10 Jan, 2019

1 commit

  • commit e9d81a1bc2c48ea9782e3e8b53875f419766ef47 upstream.

    CSS_TASK_ITER_PROCS implements process-only iteration by making
    css_task_iter_advance() skip tasks which aren't threadgroup leaders;
    however, when an iteration is started css_task_iter_start() calls the
    inner helper function css_task_iter_advance_css_set() instead of
    css_task_iter_advance(). As the helper doesn't have the skip logic,
    when the first task to visit is a non-leader thread, it doesn't get
    skipped correctly as shown in the following example.

    # ps -L 2030
    PID LWP TTY STAT TIME COMMAND
    2030 2030 pts/0 Sl+ 0:00 ./test-thread
    2030 2031 pts/0 Sl+ 0:00 ./test-thread
    # mkdir -p /sys/fs/cgroup/x/a/b
    # echo threaded > /sys/fs/cgroup/x/a/cgroup.type
    # echo threaded > /sys/fs/cgroup/x/a/b/cgroup.type
    # echo 2030 > /sys/fs/cgroup/x/a/cgroup.procs
    # cat /sys/fs/cgroup/x/a/cgroup.threads
    2030
    2031
    # cat /sys/fs/cgroup/x/cgroup.procs
    2030
    # echo 2030 > /sys/fs/cgroup/x/a/b/cgroup.threads
    # cat /sys/fs/cgroup/x/cgroup.procs
    2031
    2030

    The last read of cgroup.procs is incorrectly showing non-leader 2031
    in cgroup.procs output.

    This can be fixed by updating css_task_iter_advance() to handle the
    first advance and css_task_iters_tart() to call
    css_task_iter_advance() instead of the inner helper. After the fix,
    the same commands result in the following (correct) result:

    # ps -L 2062
    PID LWP TTY STAT TIME COMMAND
    2062 2062 pts/0 Sl+ 0:00 ./test-thread
    2062 2063 pts/0 Sl+ 0:00 ./test-thread
    # mkdir -p /sys/fs/cgroup/x/a/b
    # echo threaded > /sys/fs/cgroup/x/a/cgroup.type
    # echo threaded > /sys/fs/cgroup/x/a/b/cgroup.type
    # echo 2062 > /sys/fs/cgroup/x/a/cgroup.procs
    # cat /sys/fs/cgroup/x/a/cgroup.threads
    2062
    2063
    # cat /sys/fs/cgroup/x/cgroup.procs
    2062
    # echo 2062 > /sys/fs/cgroup/x/a/b/cgroup.threads
    # cat /sys/fs/cgroup/x/cgroup.procs
    2062

    Signed-off-by: Tejun Heo
    Reported-by: "Michael Kerrisk (man-pages)"
    Fixes: 8cfd8147df67 ("cgroup: implement cgroup v2 thread support")
    Cc: stable@vger.kernel.org # v4.14+
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

18 Oct, 2018

1 commit

  • commit 479adb89a97b0a33e5a9d702119872cc82ca21aa upstream.

    A cgroup which is already a threaded domain may be converted into a
    threaded cgroup if the prerequisite conditions are met. When this
    happens, all threaded descendant should also have their ->dom_cgrp
    updated to the new threaded domain cgroup. Unfortunately, this
    propagation was missing leading to the following failure.

    # cd /sys/fs/cgroup/unified
    # cat cgroup.subtree_control # show that no controllers are enabled

    # mkdir -p mycgrp/a/b/c
    # echo threaded > mycgrp/a/b/cgroup.type

    At this point, the hierarchy looks as follows:

    mycgrp [d]
    a [dt]
    b [t]
    c [inv]

    Now let's make node "a" threaded (and thus "mycgrp" s made "domain threaded"):

    # echo threaded > mycgrp/a/cgroup.type

    By this point, we now have a hierarchy that looks as follows:

    mycgrp [dt]
    a [t]
    b [t]
    c [inv]

    But, when we try to convert the node "c" from "domain invalid" to
    "threaded", we get ENOTSUP on the write():

    # echo threaded > mycgrp/a/b/c/cgroup.type
    sh: echo: write error: Operation not supported

    This patch fixes the problem by

    * Moving the opencoded ->dom_cgrp save and restoration in
    cgroup_enable_threaded() into cgroup_{save|restore}_control() so
    that mulitple cgroups can be handled.

    * Updating all threaded descendants' ->dom_cgrp to point to the new
    dom_cgrp when enabling threaded mode.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: "Michael Kerrisk (man-pages)"
    Reported-by: Amin Jamali
    Reported-by: Joao De Almeida Pereira
    Link: https://lore.kernel.org/r/CAKgNAkhHYCMn74TCNiMJ=ccLd7DcmXSbvw3CbZ1YREeG7iJM5g@mail.gmail.com
    Fixes: 454000adaa2a ("cgroup: introduce cgroup->dom_cgrp and threaded css_set handling")
    Cc: stable@vger.kernel.org # v4.14+
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

29 Mar, 2018

1 commit

  • commit d1897c9538edafd4ae6bbd03cc075962ddde2c21 upstream.

    A domain cgroup isn't allowed to be turned threaded if its subtree is
    populated or domain controllers are enabled. cgroup_enable_threaded()
    depended on cgroup_can_be_thread_root() test to enforce this rule. A
    parent which has populated domain descendants or have domain
    controllers enabled can't become a thread root, so the above rules are
    enforced automatically.

    However, for the root cgroup which can host mixed domain and threaded
    children, cgroup_can_be_thread_root() doesn't check any of those
    conditions and thus first level cgroups ends up escaping those rules.

    This patch fixes the bug by adding explicit checks for those rules in
    cgroup_enable_threaded().

    Reported-by: Michael Kerrisk (man-pages)
    Signed-off-by: Tejun Heo
    Fixes: 8cfd8147df67 ("cgroup: implement cgroup v2 thread support")
    Cc: stable@vger.kernel.org # v4.14+
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

03 Mar, 2018

1 commit

  • [ Upstream commit 116d2f7496c51b2e02e8e4ecdd2bdf5fb9d5a641 ]

    Deadlock during cgroup migration from cpu hotplug path when a task T is
    being moved from source to destination cgroup.

    kworker/0:0
    cpuset_hotplug_workfn()
    cpuset_hotplug_update_tasks()
    hotplug_update_tasks_legacy()
    remove_tasks_in_empty_cpuset()
    cgroup_transfer_tasks() // stuck in iterator loop
    cgroup_migrate()
    cgroup_migrate_add_task()

    In cgroup_migrate_add_task() it checks for PF_EXITING flag of task T.
    Task T will not migrate to destination cgroup. css_task_iter_start()
    will keep pointing to task T in loop waiting for task T cg_list node
    to be removed.

    Task T
    do_exit()
    exit_signals() // sets PF_EXITING
    exit_task_namespaces()
    switch_task_namespaces()
    free_nsproxy()
    put_mnt_ns()
    drop_collected_mounts()
    namespace_unlock()
    synchronize_rcu()
    _synchronize_rcu_expedited()
    schedule_work() // on cpu0 low priority worker pool
    wait_event() // waiting for work item to execute

    Task T inserted a work item in the worklist of cpu0 low priority
    worker pool. It is waiting for expedited grace period work item
    to execute. This work item will only be executed once kworker/0:0
    complete execution of cpuset_hotplug_workfn().

    kworker/0:0 ==> Task T ==>kworker/0:0

    In case of PF_EXITING task being migrated from source to destination
    cgroup, migrate next available task in source cgroup.

    Signed-off-by: Prateek Sood
    Signed-off-by: Tejun Heo
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Prateek Sood
     

17 Jan, 2018

1 commit

  • commit 74d0833c659a8a54735e5efdd44f4b225af68586 upstream.

    While teaching css_task_iter to handle skipping over tasks which
    aren't group leaders, bc2fb7ed089f ("cgroup: add @flags to
    css_task_iter_start() and implement CSS_TASK_ITER_PROCS") introduced a
    silly bug.

    CSS_TASK_ITER_PROCS is implemented by repeating
    css_task_iter_advance() while the advanced cursor is pointing to a
    non-leader thread. However, the cursor variable, @l, wasn't updated
    when the iteration has to advance to the next css_set and the
    following repetition would operate on the terminal @l from the
    previous iteration which isn't pointing to a valid task leading to
    oopses like the following or infinite looping.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000254
    IP: __task_pid_nr_ns+0xc7/0xf0
    PGD 0 P4D 0
    Oops: 0000 [#1] SMP
    ...
    CPU: 2 PID: 1 Comm: systemd Not tainted 4.14.4-200.fc26.x86_64 #1
    Hardware name: System manufacturer System Product Name/PRIME B350M-A, BIOS 3203 11/09/2017
    task: ffff88c4baee8000 task.stack: ffff96d5c3158000
    RIP: 0010:__task_pid_nr_ns+0xc7/0xf0
    RSP: 0018:ffff96d5c315bd50 EFLAGS: 00010206
    RAX: 0000000000000000 RBX: ffff88c4b68c6000 RCX: 0000000000000250
    RDX: ffffffffa5e47960 RSI: 0000000000000000 RDI: ffff88c490f6ab00
    RBP: ffff96d5c315bd50 R08: 0000000000001000 R09: 0000000000000005
    R10: ffff88c4be006b80 R11: ffff88c42f1b8004 R12: ffff96d5c315bf18
    R13: ffff88c42d7dd200 R14: ffff88c490f6a510 R15: ffff88c4b68c6000
    FS: 00007f9446f8ea00(0000) GS:ffff88c4be680000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000254 CR3: 00000007f956f000 CR4: 00000000003406e0
    Call Trace:
    cgroup_procs_show+0x19/0x30
    cgroup_seqfile_show+0x4c/0xb0
    kernfs_seq_show+0x21/0x30
    seq_read+0x2ec/0x3f0
    kernfs_fop_read+0x134/0x180
    __vfs_read+0x37/0x160
    ? security_file_permission+0x9b/0xc0
    vfs_read+0x8e/0x130
    SyS_read+0x55/0xc0
    entry_SYSCALL_64_fastpath+0x1a/0xa5
    RIP: 0033:0x7f94455f942d
    RSP: 002b:00007ffe81ba2d00 EFLAGS: 00000293 ORIG_RAX: 0000000000000000
    RAX: ffffffffffffffda RBX: 00005574e2233f00 RCX: 00007f94455f942d
    RDX: 0000000000001000 RSI: 00005574e2321a90 RDI: 000000000000002b
    RBP: 0000000000000000 R08: 00005574e2321a90 R09: 00005574e231de60
    R10: 00007f94458c8b38 R11: 0000000000000293 R12: 00007f94458c8ae0
    R13: 00007ffe81ba3800 R14: 0000000000000000 R15: 00005574e2116560
    Code: 04 74 0e 89 f6 48 8d 04 76 48 8d 04 c5 f0 05 00 00 48 8b bf b8 05 00 00 48 01 c7 31 c0 48 8b 0f 48 85 c9 74 18 8b b2 30 08 00 00 71 04 77 0d 48 c1 e6 05 48 01 f1 48 3b 51 38 74 09 5d c3 8b
    RIP: __task_pid_nr_ns+0xc7/0xf0 RSP: ffff96d5c315bd50

    Fix it by moving the initialization of the cursor below the repeat
    label. While at it, rename it to @next for readability.

    Signed-off-by: Tejun Heo
    Fixes: bc2fb7ed089f ("cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS")
    Reported-by: Laura Abbott
    Reported-by: Bronek Kozicki
    Reported-by: George Amanakis
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

22 Sep, 2017

1 commit

  • The cgroup_taskset structure within the larger cgroup_mgctx structure
    is supposed to be used once and then discarded. That is not really the
    case in the hotplug code path:

    cpuset_hotplug_workfn()
    - cgroup_transfer_tasks()
    - cgroup_migrate()
    - cgroup_migrate_add_task()
    - cgroup_migrate_execute()

    In this case, the cgroup_migrate() function is called multiple time
    with the same cgroup_mgctx structure to transfer the tasks from
    one cgroup to another one-by-one. The second time cgroup_migrate()
    is called, the cgroup_taskset will be in an incorrect state and so
    may cause the system to panic. For example,

    [ 150.888410] Faulting instruction address: 0xc0000000001db648
    [ 150.888414] Oops: Kernel access of bad area, sig: 11 [#1]
    [ 150.888417] SMP NR_CPUS=2048
    [ 150.888417] NUMA
    [ 150.888419] pSeries
    :
    [ 150.888545] NIP [c0000000001db648] cpuset_can_attach+0x58/0x1b0
    [ 150.888548] LR [c0000000001db638] cpuset_can_attach+0x48/0x1b0
    [ 150.888551] Call Trace:
    [ 150.888554] [c0000005f65cb940] [c0000000001db638] cpuset_can_attach+0x48/0x1b 0 (unreliable)
    [ 150.888559] [c0000005f65cb9a0] [c0000000001cff04] cgroup_migrate_execute+0xc4/0x4b0
    [ 150.888563] [c0000005f65cba20] [c0000000001d7d14] cgroup_transfer_tasks+0x1d4/0x370
    [ 150.888568] [c0000005f65cbb70] [c0000000001ddcb0] cpuset_hotplug_workfn+0x710/0x8f0
    [ 150.888572] [c0000005f65cbc80] [c00000000012032c] process_one_work+0x1ac/0x4d0
    [ 150.888576] [c0000005f65cbd20] [c0000000001206f8] worker_thread+0xa8/0x5b0
    [ 150.888580] [c0000005f65cbdc0] [c0000000001293f8] kthread+0x168/0x1b0
    [ 150.888584] [c0000005f65cbe30] [c00000000000b368] ret_from_kernel_thread+0x5c/0x74

    To allow reuse of the cgroup_mgctx structure, some fields in that
    structure are now re-initialized at the end of cgroup_migrate_execute()
    function call so that the structure can be reused again in a later
    iteration without causing problem.

    This bug was introduced in the commit e595cd706982 ("group: track
    migration context in cgroup_mgctx") in 4.11. This commit moves the
    cgroup_taskset initialization out of cgroup_migrate(). The commit
    10467270fb3 ("cgroup: don't call migration methods if there are no
    tasks to migrate") helped, but did not completely resolve the problem.

    Fixes: e595cd706982bff0211e6fafe5a108421e747fbc ("group: track migration context in cgroup_mgctx")
    Signed-off-by: Waiman Long
    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org # v4.11+

    Waiman Long
     

13 Sep, 2017

1 commit


08 Sep, 2017

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the first pull request for 4.14, containing most of the code
    changes. It's a quiet series this round, which I think we needed after
    the churn of the last few series. This contains:

    - Fix for a registration race in loop, from Anton Volkov.

    - Overflow complaint fix from Arnd for DAC960.

    - Series of drbd changes from the usual suspects.

    - Conversion of the stec/skd driver to blk-mq. From Bart.

    - A few BFQ improvements/fixes from Paolo.

    - CFQ improvement from Ritesh, allowing idling for group idle.

    - A few fixes found by Dan's smatch, courtesy of Dan.

    - A warning fixup for a race between changing the IO scheduler and
    device remova. From David Jeffery.

    - A few nbd fixes from Josef.

    - Support for cgroup info in blktrace, from Shaohua.

    - Also from Shaohua, new features in the null_blk driver to allow it
    to actually hold data, among other things.

    - Various corner cases and error handling fixes from Weiping Zhang.

    - Improvements to the IO stats tracking for blk-mq from me. Can
    drastically improve performance for fast devices and/or big
    machines.

    - Series from Christoph removing bi_bdev as being needed for IO
    submission, in preparation for nvme multipathing code.

    - Series from Bart, including various cleanups and fixes for switch
    fall through case complaints"

    * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
    kernfs: checking for IS_ERR() instead of NULL
    drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
    drbd: Fix allyesconfig build, fix recent commit
    drbd: switch from kmalloc() to kmalloc_array()
    drbd: abort drbd_start_resync if there is no connection
    drbd: move global variables to drbd namespace and make some static
    drbd: rename "usermode_helper" to "drbd_usermode_helper"
    drbd: fix race between handshake and admin disconnect/down
    drbd: fix potential deadlock when trying to detach during handshake
    drbd: A single dot should be put into a sequence.
    drbd: fix rmmod cleanup, remove _all_ debugfs entries
    drbd: Use setup_timer() instead of init_timer() to simplify the code.
    drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
    drbd: new disk-option disable-write-same
    drbd: Fix resource role for newly created resources in events2
    drbd: mark symbols static where possible
    drbd: Send P_NEG_ACK upon write error in protocol != C
    drbd: add explicit plugging when submitting batches
    drbd: change list_for_each_safe to while(list_first_entry_or_null)
    drbd: introduce drbd_recv_header_maybe_unplug
    ...

    Linus Torvalds
     

07 Sep, 2017

4 commits

  • Cpusets vs. suspend-resume is _completely_ broken. And it got noticed
    because it now resulted in non-cpuset usage breaking too.

    On suspend cpuset_cpu_inactive() doesn't call into
    cpuset_update_active_cpus() because it doesn't want to move tasks about,
    there is no need, all tasks are frozen and won't run again until after
    we've resumed everything.

    But this means that when we finally do call into
    cpuset_update_active_cpus() after resuming the last frozen cpu in
    cpuset_cpu_active(), the top_cpuset will not have any difference with
    the cpu_active_mask and this it will not in fact do _anything_.

    So the cpuset configuration will not be restored. This was largely
    hidden because we would unconditionally create identity domains and
    mobile users would not in fact use cpusets much. And servers what do use
    cpusets tend to not suspend-resume much.

    An addition problem is that we'd not in fact wait for the cpuset work to
    finish before resuming the tasks, allowing spurious migrations outside
    of the specified domains.

    Fix the rebuild by introducing cpuset_force_rebuild() and fix the
    ordering with cpuset_wait_for_hotplug().

    Reported-by: Andy Lutomirski
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Rafael J. Wysocki
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Fixes: deb7aa308ea2 ("cpuset: reorganize CPU / memory hotplug handling")
    Link: http://lkml.kernel.org/r/20170907091338.orwxrqkbfkki3c24@hirez.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Pull cgroup updates from Tejun Heo:
    "Several notable changes this cycle:

    - Thread mode was merged. This will be used for cgroup2 support for
    CPU and possibly other controllers. Unfortunately, CPU controller
    cgroup2 support didn't make this pull request but most contentions
    have been resolved and the support is likely to be merged before
    the next merge window.

    - cgroup.stat now shows the number of descendant cgroups.

    - cpuset now can enable the easier-to-configure v2 behavior on v1
    hierarchy"

    * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
    cpuset: Allow v2 behavior in v1 cgroup
    cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup
    cgroup: remove unneeded checks
    cgroup: misc changes
    cgroup: short-circuit cset_cgroup_from_root() on the default hierarchy
    cgroup: re-use the parent pointer in cgroup_destroy_locked()
    cgroup: add cgroup.stat interface with basic hierarchy stats
    cgroup: implement hierarchy limits
    cgroup: keep track of number of descent cgroups
    cgroup: add comment to cgroup_enable_threaded()
    cgroup: remove unnecessary empty check when enabling threaded mode
    cgroup: update debug controller to print out thread mode information
    cgroup: implement cgroup v2 thread support
    cgroup: implement CSS_TASK_ITER_THREADED
    cgroup: introduce cgroup->dom_cgrp and threaded css_set handling
    cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS
    cgroup: reorganize cgroup.procs / task write path
    cgroup: replace css_set walking populated test with testing cgrp->nr_populated_csets
    cgroup: distinguish local and children populated states
    cgroup: remove now unused list_head @pending in cgroup_apply_cftypes()
    ...

    Linus Torvalds
     
  • TIF_MEMDIE is set only to the tasks whick were either directly selected
    by the OOM killer or passed through mark_oom_victim from the allocator
    path. tsk_is_oom_victim is more generic and allows to identify all
    tasks (threads) which share the mm with the oom victim.

    Please note that the freezer still needs to check TIF_MEMDIE because we
    cannot thaw tasks which do not participage in oom_victims counting
    otherwise a !TIF_MEMDIE task could interfere after oom_disbale returns.

    Link: http://lkml.kernel.org/r/20170810075019.28998-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Commit fa06235b8eb0 ("cgroup: reset css on destruction") caused
    css_reset callback to be called from the offlining path. Although it
    solves the problem mentioned in the commit description ("For instance,
    memory cgroup needs to reset memory.low, otherwise pages charged to a
    dead cgroup might never get reclaimed."), generally speaking, it's not
    correct.

    An offline cgroup can still be a resource domain, and we shouldn't grant
    it more resources than it had before deletion.

    For instance, if an offline memory cgroup has dirty pages, we should
    still imply i/o limits during writeback.

    The css_reset callback is designed to return the cgroup state into the
    original state, that means reset all limits and counters. It's
    spomething different from the offlining, and we shouldn't use it from
    the offlining path. Instead, we should adjust necessary settings from
    the per-controller css_offline callbacks (e.g. reset memory.low).

    Link: http://lkml.kernel.org/r/20170727130428.28856-2-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Tejun Heo
    Acked-by: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

05 Sep, 2017

2 commits

  • Pull locking updates from Ingo Molnar:

    - Add 'cross-release' support to lockdep, which allows APIs like
    completions, where it's not the 'owner' who releases the lock, to be
    tracked. It's all activated automatically under
    CONFIG_PROVE_LOCKING=y.

    - Clean up (restructure) the x86 atomics op implementation to be more
    readable, in preparation of KASAN annotations. (Dmitry Vyukov)

    - Fix static keys (Paolo Bonzini)

    - Add killable versions of down_read() et al (Kirill Tkhai)

    - Rework and fix jump_label locking (Marc Zyngier, Paolo Bonzini)

    - Rework (and fix) tlb_flush_pending() barriers (Peter Zijlstra)

    - Remove smp_mb__before_spinlock() and convert its usages, introduce
    smp_mb__after_spinlock() (Peter Zijlstra)

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (56 commits)
    locking/lockdep/selftests: Fix mixed read-write ABBA tests
    sched/completion: Avoid unnecessary stack allocation for COMPLETION_INITIALIZER_ONSTACK()
    acpi/nfit: Fix COMPLETION_INITIALIZER_ONSTACK() abuse
    locking/pvqspinlock: Relax cmpxchg's to improve performance on some architectures
    smp: Avoid using two cache lines for struct call_single_data
    locking/lockdep: Untangle xhlock history save/restore from task independence
    locking/refcounts, x86/asm: Disable CONFIG_ARCH_HAS_REFCOUNT for the time being
    futex: Remove duplicated code and fix undefined behaviour
    Documentation/locking/atomic: Finish the document...
    locking/lockdep: Fix workqueue crossrelease annotation
    workqueue/lockdep: 'Fix' flush_work() annotation
    locking/lockdep/selftests: Add mixed read-write ABBA tests
    mm, locking/barriers: Clarify tlb_flush_pending() barriers
    locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS truly non-interactive
    locking/lockdep: Explicitly initialize wq_barrier::done::map
    locking/lockdep: Rename CONFIG_LOCKDEP_COMPLETE to CONFIG_LOCKDEP_COMPLETIONS
    locking/lockdep: Reword title of LOCKDEP_CROSSRELEASE config
    locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE part of CONFIG_PROVE_LOCKING
    locking/refcounts, x86/asm: Implement fast refcount overflow protection
    locking/lockdep: Fix the rollback and overwrite detection logic in crossrelease
    ...

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - fix affine wakeups (Peter Zijlstra)

    - improve CPU onlining (and general bootup) scalability on systems
    with ridiculous number (thousands) of CPUs (Peter Zijlstra)

    - sched/numa updates (Rik van Riel)

    - sched/deadline updates (Byungchul Park)

    - sched/cpufreq enhancements and related cleanups (Viresh Kumar)

    - sched/debug enhancements (Xie XiuQi)

    - various fixes"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
    sched/debug: Optimize sched_domain sysctl generation
    sched/topology: Avoid pointless rebuild
    sched/topology, cpuset: Avoid spurious/wrong domain rebuilds
    sched/topology: Improve comments
    sched/topology: Fix memory leak in __sdt_alloc()
    sched/completion: Document that reinit_completion() must be called after complete_all()
    sched/autogroup: Fix error reporting printk text in autogroup_create()
    sched/fair: Fix wake_affine() for !NUMA_BALANCING
    sched/debug: Intruduce task_state_to_char() helper function
    sched/debug: Show task state in /proc/sched_debug
    sched/debug: Use task_pid_nr_ns in /proc/$pid/sched
    sched/core: Remove unnecessary initialization init_idle_bootup_task()
    sched/deadline: Change return value of cpudl_find()
    sched/deadline: Make find_later_rq() choose a closer CPU in topology
    sched/numa: Scale scan period with tasks in group and shared/private
    sched/numa: Slow down scan rate if shared faults dominate
    sched/pelt: Fix false running accounting
    sched: Mark pick_next_task_dl() and build_sched_domain() as static
    sched/cpupri: Don't re-initialize 'struct cpupri'
    sched/deadline: Don't re-initialize 'struct cpudl'
    ...

    Linus Torvalds
     

04 Sep, 2017

1 commit


30 Aug, 2017

1 commit


25 Aug, 2017

2 commits

  • When disabling cpuset.sched_load_balance we expect to be able to online
    CPUs without generating sched_domains. However this is currently
    completely broken.

    What happens is that we generate the sched_domains and then destroy
    them. This is because of the spurious 'default' domain build in
    cpuset_update_active_cpus(). That builds a single machine wide domain
    and then schedules a work to build the 'real' domains. The work then
    finds there are _no_ domains and destroys the lot again.

    Furthermore, if there actually were cpusets, building the machine wide
    domain is actively wrong, because it would allow tasks to 'escape' their
    cpuset. Also I don't think its needed, the scheduler really should
    respect the active mask.

    Reported-by: Ofer Levi(SW)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Vineet.Gupta1@synopsys.com
    Cc: rusty@rustcorp.com.au
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The memory_pressure control file was incorrectly set up without
    a private value (0, by default). As a result, this control
    file was treated like memory_migrate on read. By adding back the
    FILE_MEMORY_PRESSURE private value, the correct memory pressure value
    will be returned.

    Signed-off-by: Waiman Long
    Signed-off-by: Tejun Heo
    Fixes: 7dbdb199d3bf ("cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLE")
    Cc: stable@vger.kernel.org # v4.4+

    Waiman Long
     

18 Aug, 2017

2 commits

  • Cpuset v2 has some useful behaviors that are not present in v1 because
    of backward compatibility concern. One of that is the restoration of
    the original cpu and memory node mask after a hot removal and addition
    event sequence.

    This patch makes the cpuset controller to check the
    CGRP_ROOT_CPUSET_V2_MODE flag and use the v2 behavior if it is set.

    Signed-off-by: Waiman Long
    Signed-off-by: Tejun Heo

    Waiman Long
     
  • A new mount option "cpuset_v2_mode" is added to the v1 cgroupfs
    filesystem to enable cpuset controller to use v2 behavior in a v1
    cgroup. This mount option applies only to cpuset controller and have
    no effect on other controllers.

    Signed-off-by: Waiman Long
    Signed-off-by: Tejun Heo

    Waiman Long
     

12 Aug, 2017

1 commit

  • "descendants" and "depth" are declared as int, so they can't be larger
    than INT_MAX. Static checkers complain and it's slightly confusing for
    humans as well so let's just remove these conditions.

    Signed-off-by: Dan Carpenter
    Signed-off-by: Tejun Heo

    Dan Carpenter
     

11 Aug, 2017

1 commit

  • Misc trivial changes to prepare for future changes. No functional
    difference.

    * Expose cgroup_get(), cgroup_tryget() and cgroup_parent().

    * Implement task_dfl_cgroup() which dereferences css_set->dfl_cgrp.

    * Rename cgroup_stats_show() to cgroup_stat_show() for consistency
    with the file name.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

10 Aug, 2017

1 commit

  • Any use of key->enabled (that is static_key_enabled and static_key_count)
    outside jump_label_lock should handle its own serialization. In the case
    of cpusets_enabled_key, the key is always incremented/decremented under
    cpuset_mutex, and hence the same rule applies to nr_cpusets. The rule
    *is* respected currently, but the mutex is static so nr_cpusets should
    be static too.

    Signed-off-by: Paolo Bonzini
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Zefan Li
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1501601046-35683-4-git-send-email-pbonzini@redhat.com
    Signed-off-by: Ingo Molnar

    Paolo Bonzini
     

03 Aug, 2017

6 commits

  • In codepaths that use the begin/retry interface for reading
    mems_allowed_seq with irqs disabled, there exists a race condition that
    stalls the patch process after only modifying a subset of the
    static_branch call sites.

    This problem manifested itself as a deadlock in the slub allocator,
    inside get_any_partial. The loop reads mems_allowed_seq value (via
    read_mems_allowed_begin), performs the defrag operation, and then
    verifies the consistency of mem_allowed via the read_mems_allowed_retry
    and the cookie returned by xxx_begin.

    The issue here is that both begin and retry first check if cpusets are
    enabled via cpusets_enabled() static branch. This branch can be
    rewritted dynamically (via cpuset_inc) if a new cpuset is created. The
    x86 jump label code fully synchronizes across all CPUs for every entry
    it rewrites. If it rewrites only one of the callsites (specifically the
    one in read_mems_allowed_retry) and then waits for the
    smp_call_function(do_sync_core) to complete while a CPU is inside the
    begin/retry section with IRQs off and the mems_allowed value is changed,
    we can hang.

    This is because begin() will always return 0 (since it wasn't patched
    yet) while retry() will test the 0 against the actual value of the seq
    counter.

    The fix is to use two different static keys: one for begin
    (pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we
    first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
    always return a valid seqcount if are enabling cpusets. Similarly, when
    disabling cpusets via cpuset_dec(), we first ensure that callers of
    cpuset_mems_allowed_retry() will start ignoring the seqcount value
    before we let cpuset_mems_allowed_begin() return 0.

    The relevant stack traces of the two stuck threads:

    CPU: 1 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4
    Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
    task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
    RIP: smp_call_function_many+0x1f9/0x260
    Call Trace:
    smp_call_function+0x3b/0x70
    on_each_cpu+0x2f/0x90
    text_poke_bp+0x87/0xd0
    arch_jump_label_transform+0x93/0x100
    __jump_label_update+0x77/0x90
    jump_label_update+0xaa/0xc0
    static_key_slow_inc+0x9e/0xb0
    cpuset_css_online+0x70/0x2e0
    online_css+0x2c/0xa0
    cgroup_apply_control_enable+0x27f/0x3d0
    cgroup_mkdir+0x2b7/0x420
    kernfs_iop_mkdir+0x5a/0x80
    vfs_mkdir+0xf6/0x1a0
    SyS_mkdir+0xb7/0xe0
    entry_SYSCALL_64_fastpath+0x18/0xad

    ...

    CPU: 2 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4
    Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
    task: ffff8818087c0000 task.stack: ffffc90000030000
    RIP: int3+0x39/0x70
    Call Trace:
    ? ___slab_alloc+0x28b/0x5a0
    ? copy_process.part.40+0xf7/0x1de0
    __slab_alloc.isra.80+0x54/0x90
    copy_process.part.40+0xf7/0x1de0
    copy_process.part.40+0xf7/0x1de0
    kmem_cache_alloc_node+0x8a/0x280
    copy_process.part.40+0xf7/0x1de0
    _do_fork+0xe7/0x6c0
    _raw_spin_unlock_irq+0x2d/0x60
    trace_hardirqs_on_caller+0x136/0x1d0
    entry_SYSCALL_64_fastpath+0x5/0xad
    do_syscall_64+0x27/0x350
    SyS_clone+0x19/0x20
    do_syscall_64+0x60/0x350
    entry_SYSCALL64_slow_path+0x25/0x25

    Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com
    Fixes: 46e700abc44c ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled")
    Signed-off-by: Dima Zavin
    Reported-by: Cliff Spradlin
    Acked-by: Vlastimil Babka
    Cc: Peter Zijlstra
    Cc: Christopher Lameter
    Cc: Li Zefan
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dima Zavin
     
  • Each css_set directly points to the default cgroup it belongs to, so
    there's no reason to walk the cgrp_links list on the default
    hierarchy.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • As we already have a pointer to the parent cgroup in
    cgroup_destroy_locked(), we don't need to calculate it again
    to pass as an argument for cgroup1_check_for_release().

    Signed-off-by: Roman Gushchin
    Suggested-by: Tejun Heo
    Signed-off-by: Tejun Heo
    Cc: Zefan Li
    Cc: Waiman Long
    Cc: Johannes Weiner
    Cc: kernel-team@fb.com
    Cc: linux-kernel@vger.kernel.org

    Roman Gushchin
     
  • A cgroup can consume resources even after being deleted by a user.
    For example, writing back dirty pages should be accounted and
    limited, despite the corresponding cgroup might contain no processes
    and being deleted by a user.

    In the current implementation a cgroup can remain in such "dying" state
    for an undefined amount of time. For instance, if a memory cgroup
    contains a pge, mlocked by a process belonging to an other cgroup.

    Although the lifecycle of a dying cgroup is out of user's control,
    it's important to have some insight of what's going on under the hood.

    In particular, it's handy to have a counter which will allow
    to detect css leaks.

    To solve this problem, add a cgroup.stat interface to
    the base cgroup control files with the following metrics:

    nr_descendants total number of visible descendant cgroups
    nr_dying_descendants total number of dying descendant cgroups

    Signed-off-by: Roman Gushchin
    Suggested-by: Tejun Heo
    Signed-off-by: Tejun Heo
    Cc: Zefan Li
    Cc: Waiman Long
    Cc: Johannes Weiner
    Cc: kernel-team@fb.com
    Cc: cgroups@vger.kernel.org
    Cc: linux-doc@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org

    Roman Gushchin
     
  • Creating cgroup hierearchies of unreasonable size can affect
    overall system performance. A user might want to limit the
    size of cgroup hierarchy. This is especially important if a user
    is delegating some cgroup sub-tree.

    To address this issue, introduce an ability to control
    the size of cgroup hierarchy.

    The cgroup.max.descendants control file allows to set the maximum
    allowed number of descendant cgroups.
    The cgroup.max.depth file controls the maximum depth of the cgroup
    tree. Both are single value r/w files, with "max" default value.

    The control files exist on each hierarchy level (including root).
    When a new cgroup is created, we check the total descendants
    and depth limits on each level, and if none of them are exceeded,
    a new cgroup is created.

    Only alive cgroups are counted, removed (dying) cgroups are
    ignored.

    Signed-off-by: Roman Gushchin
    Suggested-by: Tejun Heo
    Signed-off-by: Tejun Heo
    Cc: Zefan Li
    Cc: Waiman Long
    Cc: Johannes Weiner
    Cc: kernel-team@fb.com
    Cc: cgroups@vger.kernel.org
    Cc: linux-doc@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org

    Roman Gushchin
     
  • Keep track of the number of online and dying descent cgroups.

    This data will be used later to add an ability to control cgroup
    hierarchy (limit the depth and the number of descent cgroups)
    and display hierarchy stats.

    Signed-off-by: Roman Gushchin
    Suggested-by: Tejun Heo
    Signed-off-by: Tejun Heo
    Cc: Zefan Li
    Cc: Waiman Long
    Cc: Johannes Weiner
    Cc: kernel-team@fb.com
    Cc: cgroups@vger.kernel.org
    Cc: linux-doc@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org

    Roman Gushchin
     

29 Jul, 2017

2 commits

  • By default we output cgroup id in blktrace. This adds an option to
    display cgroup path. Since get cgroup path is a relativly heavy
    operation, we don't enable it by default.

    with the option enabled, blktrace will output something like this:
    dd-1353 [007] d..2 293.015252: 8,0 /test/level D R 24 + 8 [dd]

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Now we have the facilities to implement exportfs operations. The idea is
    cgroup can export the fhandle info to userspace, then userspace uses
    fhandle to find the cgroup name. Another example is userspace can get
    fhandle for a cgroup and BPF uses the fhandle to filter info for the
    cgroup.

    Acked-by: Greg Kroah-Hartman
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

26 Jul, 2017

2 commits

  • Explain cgroup_enable_threaded() and note that the function can never
    be called on the root cgroup.

    Signed-off-by: Tejun Heo
    Suggested-by: Waiman Long

    Tejun Heo
     
  • cgroup_enable_threaded() checks that the cgroup doesn't have any tasks
    or children and fails the operation if so. This test is unnecessary
    because the first part is already checked by
    cgroup_can_be_thread_root() and the latter is unnecessary. The latter
    actually cause a behavioral oddity. Please consider the following
    hierarchy. All cgroups are domains.

    A
    / \
    B C
    \
    D

    If B is made threaded, C and D becomes invalid domains. Due to the no
    children restriction, threaded mode can't be enabled on C. For C and
    D, the only thing the user can do is removal.

    There is no reason for this restriction. Remove it.

    Acked-by: Waiman Long
    Signed-off-by: Tejun Heo

    Tejun Heo
     

23 Jul, 2017

1 commit

  • While refactoring, f7b2814bb9b6 ("cgroup: factor out
    cgroup_{apply|finalize}_control() from
    cgroup_subtree_control_write()") broke error return value from the
    function. The return value from the last operation is always
    overridden to zero. Fix it.

    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org # v4.6+
    Signed-off-by: Tejun Heo

    Tejun Heo
     

21 Jul, 2017

5 commits

  • Update debug controller so that it prints out debug info about thread
    mode.

    1) The relationship between proc_cset and threaded_csets are displayed.
    2) The status of being a thread root or threaded cgroup is displayed.

    This patch is extracted from Waiman's larger patch.

    v2: - Removed [thread root] / [threaded] from debug.cgroup_css_links
    file as the same information is available from cgroup.type.
    Suggested by Waiman.
    - Threaded marking is moved to the previous patch.

    Patch-originally-by: Waiman Long
    Signed-off-by: Tejun Heo

    Waiman Long
     
  • This patch implements cgroup v2 thread support. The goal of the
    thread mode is supporting hierarchical accounting and control at
    thread granularity while staying inside the resource domain model
    which allows coordination across different resource controllers and
    handling of anonymous resource consumptions.

    A cgroup is always created as a domain and can be made threaded by
    writing to the "cgroup.type" file. When a cgroup becomes threaded, it
    becomes a member of a threaded subtree which is anchored at the
    closest ancestor which isn't threaded.

    The threads of the processes which are in a threaded subtree can be
    placed anywhere without being restricted by process granularity or
    no-internal-process constraint. Note that the threads aren't allowed
    to escape to a different threaded subtree. To be used inside a
    threaded subtree, a controller should explicitly support threaded mode
    and be able to handle internal competition in the way which is
    appropriate for the resource.

    The root of a threaded subtree, the nearest ancestor which isn't
    threaded, is called the threaded domain and serves as the resource
    domain for the whole subtree. This is the last cgroup where domain
    controllers are operational and where all the domain-level resource
    consumptions in the subtree are accounted. This allows threaded
    controllers to operate at thread granularity when requested while
    staying inside the scope of system-level resource distribution.

    As the root cgroup is exempt from the no-internal-process constraint,
    it can serve as both a threaded domain and a parent to normal cgroups,
    so, unlike non-root cgroups, the root cgroup can have both domain and
    threaded children.

    Internally, in a threaded subtree, each css_set has its ->dom_cset
    pointing to a matching css_set which belongs to the threaded domain.
    This ensures that thread root level cgroup_subsys_state for all
    threaded controllers are readily accessible for domain-level
    operations.

    This patch enables threaded mode for the pids and perf_events
    controllers. Neither has to worry about domain-level resource
    consumptions and it's enough to simply set the flag.

    For more details on the interface and behavior of the thread mode,
    please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
    by this patch.

    v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
    Spotted by Waiman.
    - Documentation updated as suggested by Waiman.
    - cgroup.type content slightly reformatted.
    - Mark the debug controller threaded.

    v4: - Updated to the general idea of marking specific cgroups
    domain/threaded as suggested by PeterZ.

    v3: - Dropped "join" and always make mixed children join the parent's
    threaded subtree.

    v2: - After discussions with Waiman, support for mixed thread mode is
    added. This should address the issue that Peter pointed out
    where any nesting should be avoided for thread subtrees while
    coexisting with other domain cgroups.
    - Enabling / disabling thread mode now piggy backs on the existing
    control mask update mechanism.
    - Bug fixes and cleanup.

    Signed-off-by: Tejun Heo
    Cc: Waiman Long
    Cc: Peter Zijlstra

    Tejun Heo
     
  • cgroup v2 is in the process of growing thread granularity support.
    Once thread mode is enabled, the root cgroup of the subtree serves as
    the dom_cgrp to which the processes of the subtree conceptually belong
    and domain-level resource consumptions not tied to any specific task
    are charged. In the subtree, threads won't be subject to process
    granularity or no-internal-task constraint and can be distributed
    arbitrarily across the subtree.

    This patch implements a new task iterator flag CSS_TASK_ITER_THREADED,
    which, when used on a dom_cgrp, makes the iteration include the tasks
    on all the associated threaded css_sets. "cgroup.procs" read path is
    updated to use it so that reading the file on a proc_cgrp lists all
    processes. This will also be used by controller implementations which
    need to walk processes or tasks at the resource domain level.

    Task iteration is implemented nested in css_set iteration. If
    CSS_TASK_ITER_THREADED is specified, after walking tasks of each
    !threaded css_set, all the associated threaded css_sets are visited
    before moving onto the next !threaded css_set.

    v2: ->cur_pcset renamed to ->cur_dcset. Updated for the new
    enable-threaded-per-cgroup behavior.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • cgroup v2 is in the process of growing thread granularity support. A
    threaded subtree is composed of a thread root and threaded cgroups
    which are proper members of the subtree.

    The root cgroup of the subtree serves as the domain cgroup to which
    the processes (as opposed to threads / tasks) of the subtree
    conceptually belong and domain-level resource consumptions not tied to
    any specific task are charged. Inside the subtree, threads won't be
    subject to process granularity or no-internal-task constraint and can
    be distributed arbitrarily across the subtree.

    This patch introduces cgroup->dom_cgrp along with threaded css_set
    handling.

    * cgroup->dom_cgrp points to self for normal and thread roots. For
    proper thread subtree members, points to the dom_cgrp (the thread
    root).

    * css_set->dom_cset points to self if for normal and thread roots. If
    threaded, points to the css_set which belongs to the cgrp->dom_cgrp.
    The dom_cgrp serves as the resource domain and keeps the matching
    csses available. The dom_cset holds those csses and makes them
    easily accessible.

    * All threaded csets are linked on their dom_csets to enable iteration
    of all threaded tasks.

    * cgroup->nr_threaded_children keeps track of the number of threaded
    children.

    This patch adds the above but doesn't actually use them yet. The
    following patches will build on top.

    v4: ->nr_threaded_children added.

    v3: ->proc_cgrp/cset renamed to ->dom_cgrp/cset. Updated for the new
    enable-threaded-per-cgroup behavior.

    v2: Added cgroup_is_threaded() helper.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • css_task_iter currently always walks all tasks. With the scheduled
    cgroup v2 thread support, the iterator would need to handle multiple
    types of iteration. As a preparation, add @flags to
    css_task_iter_start() and implement CSS_TASK_ITER_PROCS. If the flag
    is not specified, it walks all tasks as before. When asserted, the
    iterator only walks the group leaders.

    For now, the only user of the flag is cgroup v2 "cgroup.procs" file
    which no longer needs to skip non-leader tasks in cgroup_procs_next().
    Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
    v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
    cgroup" but "list all thread group id's with any threads in the
    cgroup".

    While at it, update cgroup_procs_show() to use task_pid_vnr() instead
    of task_tgid_vnr(). As the iteration guarantees that the function
    only sees group leaders, this doesn't change the output and will allow
    sharing the function for thread iteration.

    Signed-off-by: Tejun Heo

    Tejun Heo