10 Jul, 2020

1 commit

  • In order for no_refcnt and is_data to be the lowest order two
    bits in the 'val' we have to pad out the bitfield of the u8.

    Fixes: ad0f75e5f57c ("cgroup: fix cgroup_sk_alloc() for sk_clone_lock()")
    Reported-by: Guenter Roeck
    Signed-off-by: David S. Miller

    Cong Wang
     

08 Jul, 2020

1 commit

  • When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
    copied, so the cgroup refcnt must be taken too. And, unlike the
    sk_alloc() path, sock_update_netprioidx() is not called here.
    Therefore, it is safe and necessary to grab the cgroup refcnt
    even when cgroup_sk_alloc is disabled.

    sk_clone_lock() is in BH context anyway, the in_interrupt()
    would terminate this function if called there. And for sk_alloc()
    skcd->val is always zero. So it's safe to factor out the code
    to make it more readable.

    The global variable 'cgroup_sk_alloc_disabled' is used to determine
    whether to take these reference counts. It is impossible to make
    the reference counting correct unless we save this bit of information
    in skcd->val. So, add a new bit there to record whether the socket
    has already taken the reference counts. This obviously relies on
    kmalloc() to align cgroup pointers to at least 4 bytes,
    ARCH_KMALLOC_MINALIGN is certainly larger than that.

    This bug seems to be introduced since the beginning, commit
    d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
    tried to fix it but not compeletely. It seems not easy to trigger until
    the recent commit 090e28b229af
    ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.

    Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
    Reported-by: Cameron Berkenpas
    Reported-by: Peter Geis
    Reported-by: Lu Fengqi
    Reported-by: Daniël Sonck
    Reported-by: Zhang Qiang
    Tested-by: Cameron Berkenpas
    Tested-by: Peter Geis
    Tested-by: Thomas Lamprecht
    Cc: Daniel Borkmann
    Cc: Zefan Li
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

04 Apr, 2020

1 commit

  • Pull cgroup updates from Tejun Heo:

    - Christian extended clone3 so that processes can be spawned into
    cgroups directly.

    This is not only neat in terms of semantics but also avoids grabbing
    the global cgroup_threadgroup_rwsem for migration.

    - Daniel added !root xattr support to cgroupfs.

    Userland already uses xattrs on cgroupfs for bookkeeping. This will
    allow delegated cgroups to support such usages.

    - Prateek tried to make cpuset hotplug handling synchronous but that
    led to possible deadlock scenarios. Reverted.

    - Other minor changes including release_agent_path handling cleanup.

    * 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    docs: cgroup-v1: Document the cpuset_v2_mode mount option
    Revert "cpuset: Make cpuset hotplug synchronous"
    cgroupfs: Support user xattrs
    kernfs: Add option to enable user xattrs
    kernfs: Add removed_size out param for simple_xattr_set
    kernfs: kvmalloc xattr value instead of kmalloc
    cgroup: Restructure release_agent_path handling
    selftests/cgroup: add tests for cloning into cgroups
    clone3: allow spawning processes into cgroups
    cgroup: add cgroup_may_write() helper
    cgroup: refactor fork helpers
    cgroup: add cgroup_get_from_file() helper
    cgroup: unify attach permission checking
    cpuset: Make cpuset hotplug synchronous
    cgroup.c: Use built-in RCU list checking
    kselftest/cgroup: add cgroup destruction test
    cgroup: Clean up css_set task traversal

    Linus Torvalds
     

03 Apr, 2020

1 commit

  • Right now, the effective protection of any given cgroup is capped by its
    own explicit memory.low setting, regardless of what the parent says. The
    reasons for this are mostly historical and ease of implementation: to make
    delegation of memory.low safe, effective protection is the min() of all
    memory.low up the tree.

    Unfortunately, this limitation makes it impossible to protect an entire
    subtree from another without forcing the user to make explicit protection
    allocations all the way to the leaf cgroups - something that is highly
    undesirable in real life scenarios.

    Consider memory in a data center host. At the cgroup top level, we have a
    distinction between system management software and the actual workload the
    system is executing. Both branches are further subdivided into individual
    services, job components etc.

    We want to protect the workload as a whole from the system management
    software, but that doesn't mean we want to protect and prioritize
    individual workload wrt each other. Their memory demand can vary over
    time, and we'd want the VM to simply cache the hottest data within the
    workload subtree. Yet, the current memory.low limitations force us to
    allocate a fixed amount of protection to each workload component in order
    to get protection from system management software in general. This
    results in very inefficient resource distribution.

    Another concern with mandating downward allocation is that, as the
    complexity of the cgroup tree grows, it gets harder for the lower levels
    to be informed about decisions made at the host-level. Consider a
    container inside a namespace that in turn creates its own nested tree of
    cgroups to run multiple workloads. It'd be extremely difficult to
    configure memory.low parameters in those leaf cgroups that on one hand
    balance pressure among siblings as the container desires, while also
    reflecting the host-level protection from e.g. rpm upgrades, that lie
    beyond one or more delegation and namespacing points in the tree.

    It's highly unusual from a cgroup interface POV that nested levels have to
    be aware of and reflect decisions made at higher levels for them to be
    effective.

    To enable such use cases and scale configurability for complex trees, this
    patch implements a resource inheritance model for memory that is similar
    to how the CPU and the IO controller implement work-conserving resource
    allocations: a share of a resource allocated to a subree always applies to
    the entire subtree recursively, while allowing, but not mandating,
    children to further specify distribution rules.

    That means that if protection is explicitly allocated among siblings,
    those configured shares are being followed during page reclaim just like
    they are now. However, if the memory.low set at a higher level is not
    fully claimed by the children in that subtree, the "floating" remainder is
    applied to each cgroup in the tree in proportion to its size. Since
    reclaim pressure is applied in proportion to size as well, each child in
    that tree gets the same boost, and the effect is neutral among siblings -
    with respect to each other, they behave as if no memory control was
    enabled at all, and the VM simply balances the memory demands optimally
    within the subtree. But collectively those cgroups enjoy a boost over the
    cgroups in neighboring trees.

    E.g. a leaf cgroup with a memory.low setting of 0 no longer means that
    it's not getting a share of the hierarchically assigned resource, just
    that it doesn't claim a fixed amount of it to protect from its siblings.

    This allows us to recursively protect one subtree (workload) from another
    (system management), while letting subgroups compete freely among each
    other - without having to assign fixed shares to each leaf, and without
    nested groups having to echo higher-level settings.

    The floating protection composes naturally with fixed protection.
    Consider the following example tree:

    A A: low = 2G
    / \ A1: low = 1G
    A1 A2 A2: low = 0G

    As outside pressure is applied to this tree, A1 will enjoy a fixed
    protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
    evenly among A1 and A2, coming out to 1.5G and 0.5G.

    There is a slight risk of regressing theoretical setups where the
    top-level cgroups don't know about the true budgeting and set bogusly high
    "bypass" values that are meaningfully allocated down the tree. Such
    setups would rely on unclaimed protection to be discarded, and
    distributing it would change the intended behavior. Be safe and hide the
    new behavior behind a mount option, 'memory_recursiveprot'.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Acked-by: Tejun Heo
    Acked-by: Roman Gushchin
    Acked-by: Chris Down
    Cc: Michal Hocko
    Cc: Michal Koutný
    Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

13 Feb, 2020

1 commit

  • This adds support for creating a process in a different cgroup than its
    parent. Callers can limit and account processes and threads right from
    the moment they are spawned:
    - A service manager can directly spawn new services into dedicated
    cgroups.
    - A process can be directly created in a frozen cgroup and will be
    frozen as well.
    - The initial accounting jitter experienced by process supervisors and
    daemons is eliminated with this.
    - Threaded applications or even thread implementations can choose to
    create a specific cgroup layout where each thread is spawned
    directly into a dedicated cgroup.

    This feature is limited to the unified hierarchy. Callers need to pass
    a directory file descriptor for the target cgroup. The caller can
    choose to pass an O_PATH file descriptor. All usual migration
    restrictions apply, i.e. there can be no processes in inner nodes. In
    general, creating a process directly in a target cgroup adheres to all
    migration restrictions.

    One of the biggest advantages of this feature is that CLONE_INTO_GROUP does
    not need to grab the write side of the cgroup cgroup_threadgroup_rwsem.
    This global lock makes moving tasks/threads around super expensive. With
    clone3() this lock is avoided.

    Cc: Tejun Heo
    Cc: Ingo Molnar
    Cc: Oleg Nesterov
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Peter Zijlstra
    Cc: cgroups@vger.kernel.org
    Signed-off-by: Christian Brauner
    Signed-off-by: Tejun Heo

    Christian Brauner
     

13 Nov, 2019

1 commit

  • cgroup ID is currently allocated using a dedicated per-hierarchy idr
    and used internally and exposed through tracepoints and bpf. This is
    confusing because there are tracepoints and other interfaces which use
    the cgroupfs ino as IDs.

    The preceding changes made kn->id exposed as ino as 64bit ino on
    supported archs or ino+gen (low 32bits as ino, high gen). There's no
    reason for cgroup to use different IDs. The kernfs IDs are unique and
    userland can easily discover them and map them back to paths using
    standard file operations.

    This patch replaces cgroup IDs with kernfs IDs.

    * cgroup_id() is added and all cgroup ID users are converted to use it.

    * kernfs_node creation is moved to earlier during cgroup init so that
    cgroup_id() is available during init.

    * While at it, s/cgroup/cgrp/ in psi helpers for consistency.

    * Fallback ID value is changed to 1 to be consistent with root cgroup
    ID.

    Signed-off-by: Tejun Heo
    Reviewed-by: Greg Kroah-Hartman
    Cc: Namhyung Kim

    Tejun Heo
     

07 Nov, 2019

1 commit

  • cgroup->bstat_pending is used to determine the base stat delta to
    propagate to the parent. While correct, this is different from how
    percpu delta is determined for no good reason and the inconsistency
    makes the code more difficult to understand.

    This patch makes parent propagation delta calculation use the same
    method as percpu to global propagation.

    * cgroup_base_stat_accumulate() is renamed to cgroup_base_stat_add()
    and cgroup_base_stat_sub() is added.

    * percpu propagation calculation is updated to use the above helpers.

    * cgroup->bstat_pending is replaced with cgroup->last_bstat and
    updated to use the same calculation as percpu propagation.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

15 Jul, 2019

1 commit


09 Jul, 2019

1 commit


15 Jun, 2019

2 commits

  • Pull cgroup fixes from Tejun Heo:
    "This has an unusually high density of tricky fixes:

    - task_get_css() could deadlock when it races against a dying cgroup.

    - cgroup.procs didn't list thread group leaders with live threads.

    This could mislead readers to think that a cgroup is empty when
    it's not. Fixed by making PROCS iterator include dead tasks. I made
    a couple mistakes making this change and this pull request contains
    a couple follow-up patches.

    - When cpusets run out of online cpus, it updates cpusmasks of member
    tasks in bizarre ways. Joel improved the behavior significantly"

    * 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cpuset: restore sanity to cpuset_cpus_allowed_fallback()
    cgroup: Fix css_task_iter_advance_css_set() cset skip condition
    cgroup: css_task_iter_skip()'d iterators must be advanced before accessed
    cgroup: Include dying leaders with live threads in PROCS iterations
    cgroup: Implement css_task_iter_skip()
    cgroup: Call cgroup_release() before __exit_signal()
    docs cgroups: add another example size for hugetlb
    cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()

    Linus Torvalds
     
  • Convert the cgroup-v1 files to ReST format, in order to
    allow a later addition to the admin-guide.

    The conversion is actually:
    - add blank lines and identation in order to identify paragraphs;
    - fix tables markups;
    - add some lists markups;
    - mark literal blocks;
    - adjust title markups.

    At its new index.rst, let's add a :orphan: while this is not linked to
    the main index.rst file, in order to avoid build warnings.

    Signed-off-by: Mauro Carvalho Chehab
    Acked-by: Tejun Heo
    Signed-off-by: Tejun Heo

    Mauro Carvalho Chehab
     

10 Jun, 2019

1 commit

  • There's some discussion on how to do this the best, and Tejun prefers
    that BFQ just create the file itself instead of having cgroups support
    a symlink feature.

    Hence revert commit 54b7b868e826 and 19e9da9e86c4 for 5.2, and this
    can be done properly for 5.3.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

07 Jun, 2019

1 commit


02 Jun, 2019

1 commit

  • memory.stat and other files already consider subtrees in their output, and
    we should too in order to not present an inconsistent interface.

    The current situation is fairly confusing, because people interacting with
    cgroups expect hierarchical behaviour in the vein of memory.stat,
    cgroup.events, and other files. For example, this causes confusion when
    debugging reclaim events under low, as currently these always read "0" at
    non-leaf memcg nodes, which frequently causes people to misdiagnose breach
    behaviour. The same confusion applies to other counters in this file when
    debugging issues.

    Aggregation is done at write time instead of at read-time since these
    counters aren't hot (unlike memory.stat which is per-page, so it does it
    at read time), and it makes sense to bundle this with the file
    notifications.

    After this patch, events are propagated up the hierarchy:

    [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
    low 0
    high 0
    max 0
    oom 0
    oom_kill 0
    [root@ktst ~]# systemd-run -p MemoryMax=1 true
    Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
    [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
    low 0
    high 0
    max 7
    oom 1
    oom_kill 1

    As this is a change in behaviour, this can be reverted to the old
    behaviour by mounting with the `memory_localevents' flag set. However, we
    use the new behaviour by default as there's a lack of evidence that there
    are any current users of memory.events that would find this change
    undesirable.

    akpm: this is a behaviour change, so Cc:stable. THis is so that
    forthcoming distros which use cgroup v2 are more likely to pick up the
    revised behaviour.

    Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.name
    Signed-off-by: Chris Down
    Acked-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Cc: Dennis Zhou
    Cc: Suren Baghdasaryan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Down
     

01 Jun, 2019

1 commit

  • CSS_TASK_ITER_PROCS currently iterates live group leaders; however,
    this means that a process with dying leader and live threads will be
    skipped. IOW, cgroup.procs might be empty while cgroup.threads isn't,
    which is confusing to say the least.

    Fix it by making cset track dying tasks and include dying leaders with
    live threads in PROCS iteration.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Topi Miettinen
    Cc: Oleg Nesterov

    Tejun Heo
     

20 Apr, 2019

2 commits

  • Cgroup v1 implements the freezer controller, which provides an ability
    to stop the workload in a cgroup and temporarily free up some
    resources (cpu, io, network bandwidth and, potentially, memory)
    for some other tasks. Cgroup v2 lacks this functionality.

    This patch implements freezer for cgroup v2.

    Cgroup v2 freezer tries to put tasks into a state similar to jobctl
    stop. This means that tasks can be killed, ptraced (using
    PTRACE_SEIZE*), and interrupted. It is possible to attach to
    a frozen task, get some information (e.g. read registers) and detach.
    It's also possible to migrate a frozen tasks to another cgroup.

    This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
    tried to imitate the system-wide freezer. However uninterruptible
    sleep is fine when all tasks are going to be frozen (hibernation case),
    it's not the acceptable state for some subset of the system.

    Cgroup v2 freezer is not supporting freezing kthreads.
    If a non-root cgroup contains kthread, the cgroup still can be frozen,
    but the kthread will remain running, the cgroup will be shown
    as non-frozen, and the notification will not be delivered.

    * PTRACE_ATTACH is not working because non-fatal signal delivery
    is blocked in frozen state.

    There are some interface differences between cgroup v1 and cgroup v2
    freezer too, which are required to conform the cgroup v2 interface
    design principles:
    1) There is no separate controller, which has to be turned on:
    the functionality is always available and is represented by
    cgroup.freeze and cgroup.events cgroup control files.
    2) The desired state is defined by the cgroup.freeze control file.
    Any hierarchical configuration is allowed.
    3) The interface is asynchronous. The actual state is available
    using cgroup.events control file ("frozen" field). There are no
    dedicated transitional states.
    4) It's allowed to make any changes with the cgroup hierarchy
    (create new cgroups, remove old cgroups, move tasks between cgroups)
    no matter if some cgroups are frozen.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Tejun Heo
    No-objection-from-me-by: Oleg Nesterov
    Cc: kernel-team@fb.com

    Roman Gushchin
     
  • The number of descendant cgroups and the number of dying
    descendant cgroups are currently synchronized using the cgroup_mutex.

    The number of descendant cgroups will be required by the cgroup v2
    freezer, which will use it to determine if a cgroup is frozen
    (depending on total number of descendants and number of frozen
    descendants). It's not always acceptable to grab the cgroup_mutex,
    especially from quite hot paths (e.g. exit()).

    To avoid this, let's additionally synchronize these counters using
    the css_set_lock.

    So, it's safe to read these counters with either cgroup_mutex or
    css_set_lock locked, and for changing both locks should be acquired.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Tejun Heo
    Cc: kernel-team@fb.com

    Roman Gushchin
     

08 Mar, 2019

1 commit

  • Pull cgroup updates from Tejun Heo:

    - Oleg's pids controller accounting update which gets rid of rcu delay
    in pids accounting updates

    - rstat (cgroup hierarchical stat collection mechanism) optimization

    - Doc updates

    * 'for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cpuset: remove unused task_has_mempolicy()
    cgroup, rstat: Don't flush subtree root unless necessary
    cgroup: add documentation for pids.events file
    Documentation: cgroup-v2: eliminate markup warnings
    MAINTAINERS: Update cgroup entry
    cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting

    Linus Torvalds
     

06 Mar, 2019

1 commit

  • Cgroup has a standardized poll/notification mechanism for waking all
    pollers on all fds when a filesystem node changes. To allow polling for
    custom events, add a .poll callback that can override the default.

    This is in preparation for pollable cgroup pressure files which have
    per-fd trigger configurations.

    Link: http://lkml.kernel.org/r/20190124211518.244221-3-surenb@google.com
    Signed-off-by: Johannes Weiner
    Signed-off-by: Suren Baghdasaryan
    Cc: Dennis Zhou
    Cc: Ingo Molnar
    Cc: Jens Axboe
    Cc: Li Zefan
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

31 Jan, 2019

1 commit

  • The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
    needs pids_free() to uncharge the pid.

    However, ->free() is called from __put_task_struct()->cgroup_free() and this
    is too late. Even the trivial program which does

    for (;;) {
    int pid = fork();
    assert(pid >= 0);
    if (pid)
    wait(NULL);
    else
    exit(0);
    }

    can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
    implies an RCU gp after the task/pid goes away and before the final put().

    Test-case:

    mkdir -p /tmp/CG
    mount -t cgroup2 none /tmp/CG
    echo '+pids' > /tmp/CG/cgroup.subtree_control

    mkdir /tmp/CG/PID
    echo 2 > /tmp/CG/PID/pids.max

    perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
    echo $! > /tmp/CG/PID/cgroup.procs

    Without this patch the forking process fails soon after migration.

    Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
    into the new helper, cgroup_release(), called by release_task() which actually
    frees the pid(s).

    Reported-by: Herton R. Krzesinski
    Reported-by: Jan Stancek
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Tejun Heo

    Oleg Nesterov
     

09 Nov, 2018

1 commit

  • For debugging purpose, it will be useful to expose the content of the
    subparts_cpus as a read-only file to see if the code work correctly.
    However, subparts_cpus will not be used at all in most use cases. So
    adding a new cpuset file that clutters the cgroup directory may not be
    desirable. This is now being done by using the hidden "cgroup_debug"
    kernel command line option to expose a new "cpuset.cpus.subpartitions"
    file.

    That option was originally used by the debug controller to expose
    itself when configured into the kernel. This is now extended to set an
    internal flag used by cgroup_addrm_files(). A new CFTYPE_DEBUG flag
    can now be used to specify that a cgroup file should only be created
    when the "cgroup_debug" option is specified.

    Signed-off-by: Waiman Long
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Tejun Heo

    Waiman Long
     

27 Oct, 2018

1 commit

  • On a system that executes multiple cgrouped jobs and independent
    workloads, we don't just care about the health of the overall system, but
    also that of individual jobs, so that we can ensure individual job health,
    fairness between jobs, or prioritize some jobs over others.

    This patch implements pressure stall tracking for cgroups. In kernels
    with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
    and io.pressure files that track aggregate pressure stall times for only
    the tasks inside the cgroup.

    Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Tejun Heo
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Daniel Drake
    Tested-by: Suren Baghdasaryan
    Cc: Christopher Lameter
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Mike Galbraith
    Cc: Peter Enderborg
    Cc: Randy Dunlap
    Cc: Shakeel Butt
    Cc: Vinayak Menon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

05 Oct, 2018

1 commit

  • A cgroup which is already a threaded domain may be converted into a
    threaded cgroup if the prerequisite conditions are met. When this
    happens, all threaded descendant should also have their ->dom_cgrp
    updated to the new threaded domain cgroup. Unfortunately, this
    propagation was missing leading to the following failure.

    # cd /sys/fs/cgroup/unified
    # cat cgroup.subtree_control # show that no controllers are enabled

    # mkdir -p mycgrp/a/b/c
    # echo threaded > mycgrp/a/b/cgroup.type

    At this point, the hierarchy looks as follows:

    mycgrp [d]
    a [dt]
    b [t]
    c [inv]

    Now let's make node "a" threaded (and thus "mycgrp" s made "domain threaded"):

    # echo threaded > mycgrp/a/cgroup.type

    By this point, we now have a hierarchy that looks as follows:

    mycgrp [dt]
    a [t]
    b [t]
    c [inv]

    But, when we try to convert the node "c" from "domain invalid" to
    "threaded", we get ENOTSUP on the write():

    # echo threaded > mycgrp/a/b/c/cgroup.type
    sh: echo: write error: Operation not supported

    This patch fixes the problem by

    * Moving the opencoded ->dom_cgrp save and restoration in
    cgroup_enable_threaded() into cgroup_{save|restore}_control() so
    that mulitple cgroups can be handled.

    * Updating all threaded descendants' ->dom_cgrp to point to the new
    dom_cgrp when enabling threaded mode.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: "Michael Kerrisk (man-pages)"
    Reported-by: Amin Jamali
    Reported-by: Joao De Almeida Pereira
    Link: https://lore.kernel.org/r/CAKgNAkhHYCMn74TCNiMJ=ccLd7DcmXSbvw3CbZ1YREeG7iJM5g@mail.gmail.com
    Fixes: 454000adaa2a ("cgroup: introduce cgroup->dom_cgrp and threaded css_set handling")
    Cc: stable@vger.kernel.org # v4.14+

    Tejun Heo
     

09 Jul, 2018

1 commit

  • Since IO can be issued from literally anywhere it's almost impossible to
    do throttling without having some sort of adverse effect somewhere else
    in the system because of locking or other dependencies. The best way to
    solve this is to do the throttling when we know we aren't holding any
    other kernel resources. Do this by tracking throttling in a per-blkg
    basis, and if we require throttling flag the task that it needs to check
    before it returns to user space and possibly sleep there.

    This is to address the case where a process is doing work that is
    generating IO that can't be throttled, whether that is directly with a
    lot of REQ_META IO, or indirectly by allocating so much memory that it
    is swamping the disk with REQ_SWAP. We can't use task_add_work as we
    don't want to induce a memory allocation in the IO path, so simply
    saving the request queue in the task and flagging it to do the
    notify_resume thing achieves the same result without the overhead of a
    memory allocation.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     

27 Apr, 2018

4 commits

  • This patch adds cgroup_subsys->css_rstat_flush(). If a subsystem has
    this callback, its csses are linked on cgrp->css_rstat_list and rstat
    will call the function whenever the associated cgroup is flushed.
    Flush is also performed when such csses are released so that residual
    counts aren't lost.

    Combined with the rstat API previous patches factored out, this allows
    controllers to plug into rstat to manage their statistics in a
    scalable way.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • Base resource stat accounts universial (not specific to any
    controller) resource consumptions on top of rstat. Currently, its
    implementation is intermixed with rstat implementation making the code
    confusing to follow.

    This patch clarifies the distintion by doing the followings.

    * Encapsulate base resource stat counters, currently only cputime, in
    struct cgroup_base_stat.

    * Move prev_cputime into struct cgroup and initialize it with cgroup.

    * Rename the related functions so that they start with cgroup_base_stat.

    * Prefix the related variables and field names with b.

    This patch doesn't make any functional changes.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • stat is too generic a name and ends up causing subtle confusions.
    It'll be made generic so that controllers can plug into it, which will
    make the problem worse. Let's rename it to something more specific -
    cgroup_rstat for cgroup recursive stat.

    This patch does the following renames. No other changes.

    * cpu_stat -> rstat_cpu
    * stat -> rstat
    * ?cstat -> ?rstatc

    Note that the renames are selective. The unrenamed are the ones which
    implement basic resource statistics on top of rstat. This will be
    further cleaned up in the following patches.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • ".events" files generate file modified event to notify userland of
    possible new events. Some of the events can be quite bursty
    (e.g. memory high event) and generating notification each time is
    costly and pointless.

    This patch implements a event rate limit mechanism. If a new
    notification is requested before 10ms has passed since the previous
    notification, the new notification is delayed till then.

    As this only delays from the second notification on in a given close
    cluster of notifications, userland reactions to notifications
    shouldn't be delayed at all in most cases while avoiding notification
    storms.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

04 Apr, 2018

1 commit

  • Pull workqueue updates from Tejun Heo:
    "rcu_work addition and a couple trivial changes"

    * 'for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: remove the comment about the old manager_arb mutex
    workqueue: fix the comments of nr_idle
    fs/aio: Use rcu_work instead of explicit rcu and work item
    cgroup: Use rcu_work instead of explicit rcu and work item
    RCU, workqueue: Implement rcu_work

    Linus Torvalds
     

20 Mar, 2018

1 commit


15 Mar, 2018

1 commit

  • Andrei Vagin reported a KASAN: slab-out-of-bounds error in
    skb_update_prio()

    Since SYNACK might be attached to a request socket, we need to
    get back to the listener socket.
    Since this listener is manipulated without locks, add const
    qualifiers to sock_cgroup_prioidx() so that the const can also
    be used in skb_update_prio()

    Also add the const qualifier to sock_cgroup_classid() for consistency.

    Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener")
    Signed-off-by: Eric Dumazet
    Reported-by: Andrei Vagin
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Jan, 2018

1 commit

  • The cgroup_subsys structure references a documentation file that has been
    renamed after the v1/v2 split. Since the v2 documentation doesn't
    currently contain any information on kernel interfaces for controllers,
    point the user to the v1 docs.

    Cc: Tejun Heo
    Cc: linux-doc@vger.kernel.org
    Signed-off-by: Matt Roper
    Signed-off-by: Tejun Heo

    Matt Roper
     

16 Nov, 2017

1 commit

  • Pull cgroup updates from Tejun Heo:
    "Cgroup2 cpu controller support is finally merged.

    - Basic cpu statistics support to allow monitoring by default without
    the CPU controller enabled.

    - cgroup2 cpu controller support.

    - /sys/kernel/cgroup files to help dealing with new / optional
    features"

    * 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: export list of cgroups v2 features using sysfs
    cgroup: export list of delegatable control files using sysfs
    cgroup: mark @cgrp __maybe_unused in cpu_stat_show()
    MAINTAINERS: relocate cpuset.c
    cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat
    sched: Implement interface for cgroup unified hierarchy
    sched: Misc preps for cgroup unified hierarchy interface
    sched/cputime: Add dummy cputime_adjust() implementation for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
    cgroup: statically initialize init_css_set->dfl_cgrp
    cgroup: Implement cgroup2 basic CPU usage accounting
    cpuacct: Introduce cgroup_account_cputime[_field]()
    sched/cputime: Expose cputime_adjust()

    Linus Torvalds
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

27 Oct, 2017

1 commit

  • The basic cpu stat is currently shown with "cpu." prefix in
    cgroup.stat, and the same information is duplicated in cpu.stat when
    cpu controller is enabled. This is ugly and not very scalable as we
    want to expand the coverage of stat information which is always
    available.

    This patch makes cgroup core always create "cpu.stat" file and show
    the basic cpu stat there and calls the cpu controller to show the
    extra stats when enabled. This ensures that the same information
    isn't presented in multiple places and makes future expansion of basic
    stats easier.

    Signed-off-by: Tejun Heo
    Acked-by: Peter Zijlstra (Intel)

    Tejun Heo
     

25 Sep, 2017

1 commit

  • In cgroup1, while cpuacct isn't actually controlling any resources, it
    is a separate controller due to combination of two factors -
    1. enabling cpu controller has significant side effects, and 2. we
    have to pick one of the hierarchies to account CPU usages on. cpuacct
    controller is effectively used to designate a hierarchy to track CPU
    usages on.

    cgroup2's unified hierarchy removes the second reason and we can
    account basic CPU usages by default. While we can use cpuacct for
    this purpose, both its interface and implementation leave a lot to be
    desired - it collects and exposes two sources of truth which don't
    agree with each other and some of the exposed statistics don't make
    much sense. Also, it propagates all the way up the hierarchy on each
    accounting event which is unnecessary.

    This patch adds basic resource accounting mechanism to cgroup2's
    unified hierarchy and accounts CPU usages using it.

    * All accountings are done per-cpu and don't propagate immediately.
    It just bumps the per-cgroup per-cpu counters and links to the
    parent's updated list if not already on it.

    * On a read, the per-cpu counters are collected into the global ones
    and then propagated upwards. Only the per-cpu counters which have
    changed since the last read are propagated.

    * CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
    prefix. Total usage is collected from scheduling events. User/sys
    breakdown is sourced from tick sampling and adjusted to the usage
    using cputime_adjust().

    This keeps the accounting side hot path O(1) and per-cpu and the read
    side O(nr_updated_since_last_read).

    v2: Minor changes and documentation updates as suggested by Waiman and
    Roman.

    Signed-off-by: Tejun Heo
    Acked-by: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Li Zefan
    Cc: Johannes Weiner
    Cc: Waiman Long
    Cc: Roman Gushchin

    Tejun Heo
     

18 Aug, 2017

1 commit


03 Aug, 2017

2 commits

  • Creating cgroup hierearchies of unreasonable size can affect
    overall system performance. A user might want to limit the
    size of cgroup hierarchy. This is especially important if a user
    is delegating some cgroup sub-tree.

    To address this issue, introduce an ability to control
    the size of cgroup hierarchy.

    The cgroup.max.descendants control file allows to set the maximum
    allowed number of descendant cgroups.
    The cgroup.max.depth file controls the maximum depth of the cgroup
    tree. Both are single value r/w files, with "max" default value.

    The control files exist on each hierarchy level (including root).
    When a new cgroup is created, we check the total descendants
    and depth limits on each level, and if none of them are exceeded,
    a new cgroup is created.

    Only alive cgroups are counted, removed (dying) cgroups are
    ignored.

    Signed-off-by: Roman Gushchin
    Suggested-by: Tejun Heo
    Signed-off-by: Tejun Heo
    Cc: Zefan Li
    Cc: Waiman Long
    Cc: Johannes Weiner
    Cc: kernel-team@fb.com
    Cc: cgroups@vger.kernel.org
    Cc: linux-doc@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org

    Roman Gushchin
     
  • Keep track of the number of online and dying descent cgroups.

    This data will be used later to add an ability to control cgroup
    hierarchy (limit the depth and the number of descent cgroups)
    and display hierarchy stats.

    Signed-off-by: Roman Gushchin
    Suggested-by: Tejun Heo
    Signed-off-by: Tejun Heo
    Cc: Zefan Li
    Cc: Waiman Long
    Cc: Johannes Weiner
    Cc: kernel-team@fb.com
    Cc: cgroups@vger.kernel.org
    Cc: linux-doc@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org

    Roman Gushchin
     

21 Jul, 2017

1 commit

  • This patch implements cgroup v2 thread support. The goal of the
    thread mode is supporting hierarchical accounting and control at
    thread granularity while staying inside the resource domain model
    which allows coordination across different resource controllers and
    handling of anonymous resource consumptions.

    A cgroup is always created as a domain and can be made threaded by
    writing to the "cgroup.type" file. When a cgroup becomes threaded, it
    becomes a member of a threaded subtree which is anchored at the
    closest ancestor which isn't threaded.

    The threads of the processes which are in a threaded subtree can be
    placed anywhere without being restricted by process granularity or
    no-internal-process constraint. Note that the threads aren't allowed
    to escape to a different threaded subtree. To be used inside a
    threaded subtree, a controller should explicitly support threaded mode
    and be able to handle internal competition in the way which is
    appropriate for the resource.

    The root of a threaded subtree, the nearest ancestor which isn't
    threaded, is called the threaded domain and serves as the resource
    domain for the whole subtree. This is the last cgroup where domain
    controllers are operational and where all the domain-level resource
    consumptions in the subtree are accounted. This allows threaded
    controllers to operate at thread granularity when requested while
    staying inside the scope of system-level resource distribution.

    As the root cgroup is exempt from the no-internal-process constraint,
    it can serve as both a threaded domain and a parent to normal cgroups,
    so, unlike non-root cgroups, the root cgroup can have both domain and
    threaded children.

    Internally, in a threaded subtree, each css_set has its ->dom_cset
    pointing to a matching css_set which belongs to the threaded domain.
    This ensures that thread root level cgroup_subsys_state for all
    threaded controllers are readily accessible for domain-level
    operations.

    This patch enables threaded mode for the pids and perf_events
    controllers. Neither has to worry about domain-level resource
    consumptions and it's enough to simply set the flag.

    For more details on the interface and behavior of the thread mode,
    please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
    by this patch.

    v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
    Spotted by Waiman.
    - Documentation updated as suggested by Waiman.
    - cgroup.type content slightly reformatted.
    - Mark the debug controller threaded.

    v4: - Updated to the general idea of marking specific cgroups
    domain/threaded as suggested by PeterZ.

    v3: - Dropped "join" and always make mixed children join the parent's
    threaded subtree.

    v2: - After discussions with Waiman, support for mixed thread mode is
    added. This should address the issue that Peter pointed out
    where any nesting should be avoided for thread subtrees while
    coexisting with other domain cgroups.
    - Enabling / disabling thread mode now piggy backs on the existing
    control mask update mechanism.
    - Bug fixes and cleanup.

    Signed-off-by: Tejun Heo
    Cc: Waiman Long
    Cc: Peter Zijlstra

    Tejun Heo