06 Apr, 2019

2 commits

  • [ Upstream commit 51bee5abeab2058ea5813c5615d6197a23dbf041 ]

    The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
    needs pids_free() to uncharge the pid.

    However, ->free() is called from __put_task_struct()->cgroup_free() and this
    is too late. Even the trivial program which does

    for (;;) {
    int pid = fork();
    assert(pid >= 0);
    if (pid)
    wait(NULL);
    else
    exit(0);
    }

    can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
    implies an RCU gp after the task/pid goes away and before the final put().

    Test-case:

    mkdir -p /tmp/CG
    mount -t cgroup2 none /tmp/CG
    echo '+pids' > /tmp/CG/cgroup.subtree_control

    mkdir /tmp/CG/PID
    echo 2 > /tmp/CG/PID/pids.max

    perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
    echo $! > /tmp/CG/PID/cgroup.procs

    Without this patch the forking process fails soon after migration.

    Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
    into the new helper, cgroup_release(), called by release_task() which actually
    frees the pid(s).

    Reported-by: Herton R. Krzesinski
    Reported-by: Jan Stancek
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Tejun Heo
    Signed-off-by: Sasha Levin

    Oleg Nesterov
     
  • [ Upstream commit b4ff1b44bcd384d22fcbac6ebaf9cc0d33debe50 ]

    cgroup_rstat_cpu_pop_updated() is used to traverse the updated cgroups
    on flush. While it was only visiting updated ones in the subtree, it
    was visiting @root unconditionally. We can easily check whether @root
    is updated or not by looking at its ->updated_next just as with the
    cgroups in the subtree.

    * Remove the unnecessary cgroup_parent() test. The system root cgroup
    is never updated and thus its ->updated_next is always NULL. No
    need to test whether cgroup_parent() exists in addition to
    ->updated_next.

    * Terminate traverse if ->updated_next is NULL. This can only happen
    for subtree @root and there's no reason to visit it if it's not
    marked updated.

    This reduces cpu consumption when reading a lot of rstat backed files.
    In a micro benchmark reading stat from ~1600 cgroups, the sys time was
    lowered by >40%.

    Signed-off-by: Tejun Heo
    Signed-off-by: Sasha Levin

    Tejun Heo
     

24 Mar, 2019

1 commit

  • commit 399504e21a10be16dd1408ba0147367d9d82a10c upstream.

    same story as with last May fixes in sysfs (7b745a4e4051
    "unfuck sysfs_mount()"); new_sb is left uninitialized
    in case of early errors in kernfs_mount_ns() and papering
    over it by treating any error from kernfs_mount_ns() as
    equivalent to !new_ns ends up conflating the cases when
    objects had never been transferred to a superblock with
    ones when that has happened and resulting new superblock
    had been dropped. Easily fixed (same way as in sysfs
    case). Additionally, there's a superblock leak on
    kernfs_node_dentry() failure *and* a dentry leak inside
    kernfs_node_dentry() itself - the latter on probably
    impossible errors, but the former not impossible to trigger
    (as the matter of fact, injecting allocation failures
    at that point *does* trigger it).

    Cc: stable@kernel.org
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     

13 Feb, 2019

1 commit

  • [ Upstream commit e250d91d65750a0c0c62483ac4f9f357e7317617 ]

    This fixes the case where all mount options specified are consumed by an
    LSM and all that's left is an empty string. In this case cgroupfs should
    accept the string and not fail.

    How to reproduce (with SELinux enabled):

    # umount /sys/fs/cgroup/unified
    # mount -o context=system_u:object_r:cgroup_t:s0 -t cgroup2 cgroup2 /sys/fs/cgroup/unified
    mount: /sys/fs/cgroup/unified: wrong fs type, bad option, bad superblock on cgroup2, missing codepage or helper program, or other error.
    # dmesg | tail -n 1
    [ 31.575952] cgroup: cgroup2: unknown option ""

    Fixes: 67e9c74b8a87 ("cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type")
    [NOTE: should apply on top of commit 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option"), older versions need manual rebase]
    Suggested-by: Stephen Smalley
    Signed-off-by: Ondrej Mosnacek
    Signed-off-by: Tejun Heo
    Signed-off-by: Sasha Levin

    Ondrej Mosnacek
     

10 Jan, 2019

1 commit

  • commit e9d81a1bc2c48ea9782e3e8b53875f419766ef47 upstream.

    CSS_TASK_ITER_PROCS implements process-only iteration by making
    css_task_iter_advance() skip tasks which aren't threadgroup leaders;
    however, when an iteration is started css_task_iter_start() calls the
    inner helper function css_task_iter_advance_css_set() instead of
    css_task_iter_advance(). As the helper doesn't have the skip logic,
    when the first task to visit is a non-leader thread, it doesn't get
    skipped correctly as shown in the following example.

    # ps -L 2030
    PID LWP TTY STAT TIME COMMAND
    2030 2030 pts/0 Sl+ 0:00 ./test-thread
    2030 2031 pts/0 Sl+ 0:00 ./test-thread
    # mkdir -p /sys/fs/cgroup/x/a/b
    # echo threaded > /sys/fs/cgroup/x/a/cgroup.type
    # echo threaded > /sys/fs/cgroup/x/a/b/cgroup.type
    # echo 2030 > /sys/fs/cgroup/x/a/cgroup.procs
    # cat /sys/fs/cgroup/x/a/cgroup.threads
    2030
    2031
    # cat /sys/fs/cgroup/x/cgroup.procs
    2030
    # echo 2030 > /sys/fs/cgroup/x/a/b/cgroup.threads
    # cat /sys/fs/cgroup/x/cgroup.procs
    2031
    2030

    The last read of cgroup.procs is incorrectly showing non-leader 2031
    in cgroup.procs output.

    This can be fixed by updating css_task_iter_advance() to handle the
    first advance and css_task_iters_tart() to call
    css_task_iter_advance() instead of the inner helper. After the fix,
    the same commands result in the following (correct) result:

    # ps -L 2062
    PID LWP TTY STAT TIME COMMAND
    2062 2062 pts/0 Sl+ 0:00 ./test-thread
    2062 2063 pts/0 Sl+ 0:00 ./test-thread
    # mkdir -p /sys/fs/cgroup/x/a/b
    # echo threaded > /sys/fs/cgroup/x/a/cgroup.type
    # echo threaded > /sys/fs/cgroup/x/a/b/cgroup.type
    # echo 2062 > /sys/fs/cgroup/x/a/cgroup.procs
    # cat /sys/fs/cgroup/x/a/cgroup.threads
    2062
    2063
    # cat /sys/fs/cgroup/x/cgroup.procs
    2062
    # echo 2062 > /sys/fs/cgroup/x/a/b/cgroup.threads
    # cat /sys/fs/cgroup/x/cgroup.procs
    2062

    Signed-off-by: Tejun Heo
    Reported-by: "Michael Kerrisk (man-pages)"
    Fixes: 8cfd8147df67 ("cgroup: implement cgroup v2 thread support")
    Cc: stable@vger.kernel.org # v4.14+
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

05 Oct, 2018

1 commit

  • A cgroup which is already a threaded domain may be converted into a
    threaded cgroup if the prerequisite conditions are met. When this
    happens, all threaded descendant should also have their ->dom_cgrp
    updated to the new threaded domain cgroup. Unfortunately, this
    propagation was missing leading to the following failure.

    # cd /sys/fs/cgroup/unified
    # cat cgroup.subtree_control # show that no controllers are enabled

    # mkdir -p mycgrp/a/b/c
    # echo threaded > mycgrp/a/b/cgroup.type

    At this point, the hierarchy looks as follows:

    mycgrp [d]
    a [dt]
    b [t]
    c [inv]

    Now let's make node "a" threaded (and thus "mycgrp" s made "domain threaded"):

    # echo threaded > mycgrp/a/cgroup.type

    By this point, we now have a hierarchy that looks as follows:

    mycgrp [dt]
    a [t]
    b [t]
    c [inv]

    But, when we try to convert the node "c" from "domain invalid" to
    "threaded", we get ENOTSUP on the write():

    # echo threaded > mycgrp/a/b/c/cgroup.type
    sh: echo: write error: Operation not supported

    This patch fixes the problem by

    * Moving the opencoded ->dom_cgrp save and restoration in
    cgroup_enable_threaded() into cgroup_{save|restore}_control() so
    that mulitple cgroups can be handled.

    * Updating all threaded descendants' ->dom_cgrp to point to the new
    dom_cgrp when enabling threaded mode.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: "Michael Kerrisk (man-pages)"
    Reported-by: Amin Jamali
    Reported-by: Joao De Almeida Pereira
    Link: https://lore.kernel.org/r/CAKgNAkhHYCMn74TCNiMJ=ccLd7DcmXSbvw3CbZ1YREeG7iJM5g@mail.gmail.com
    Fixes: 454000adaa2a ("cgroup: introduce cgroup->dom_cgrp and threaded css_set handling")
    Cc: stable@vger.kernel.org # v4.14+

    Tejun Heo
     

25 Aug, 2018

1 commit


21 Jul, 2018

1 commit

  • This change allows creating kernfs files and directories with arbitrary
    uid/gid instead of always using GLOBAL_ROOT_UID/GID by extending
    kernfs_create_dir_ns() and kernfs_create_file_ns() with uid/gid arguments.
    The "simple" kernfs_create_file() and kernfs_create_dir() are left alone
    and always create objects belonging to the global root.

    When creating symlinks ownership (uid/gid) is taken from the target kernfs
    object.

    Co-Developed-by: Tyler Hicks
    Signed-off-by: Dmitry Torokhov
    Signed-off-by: Tyler Hicks
    Signed-off-by: David S. Miller

    Dmitry Torokhov
     

12 Jul, 2018

1 commit

  • It is unwise to take spin locks from the handlers of trace events.
    Mainly, because they can introduce lockups, because it introduces locks
    in places that are normally not tested. Worse yet, because trace events
    are tucked away in the include/trace/events/ directory, locks that are
    taken there are forgotten about.

    As a general rule, I tell people never to take any locks in a trace
    event handler.

    Several cgroup trace event handlers call cgroup_path() which eventually
    takes the kernfs_rename_lock spinlock. This injects the spinlock in the
    code without people realizing it. It also can cause issues for the
    PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event
    handlers are called with preemption disabled.

    By moving the calculation of the cgroup_path() out of the trace event
    handlers and into a macro (surrounded by a
    trace_cgroup_##type##_enabled()), then we could place the cgroup_path
    into a string, and pass that to the trace event. Not only does this
    remove the taking of the spinlock out of the trace event handler, but
    it also means that the cgroup_path() only needs to be called once (it
    is currently called twice, once to get the length to reserver the
    buffer for, and once again to get the path itself. Now it only needs to
    be done once.

    Reported-by: Sebastian Andrzej Siewior
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Tejun Heo

    Steven Rostedt (VMware)
     

16 Jun, 2018

1 commit

  • As we move stuff around, some doc references are broken. Fix some of
    them via this script:
    ./scripts/documentation-file-ref-check --fix

    Manually checked if the produced result is valid, removing a few
    false-positives.

    Acked-by: Takashi Iwai
    Acked-by: Masami Hiramatsu
    Acked-by: Stephen Boyd
    Acked-by: Charles Keepax
    Acked-by: Mathieu Poirier
    Reviewed-by: Coly Li
    Signed-off-by: Mauro Carvalho Chehab
    Acked-by: Jonathan Corbet

    Mauro Carvalho Chehab
     

13 Jun, 2018

2 commits

  • The vmalloc() function has no 2-factor argument form, so multiplication
    factors need to be wrapped in array_size(). This patch replaces cases of:

    vmalloc(a * b)

    with:
    vmalloc(array_size(a, b))

    as well as handling cases of:

    vmalloc(a * b * c)

    with:

    vmalloc(array3_size(a, b, c))

    This does, however, attempt to ignore constant size factors like:

    vmalloc(4 * 1024)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    vmalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    vmalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    vmalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    vmalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    vmalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    vmalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    vmalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    vmalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    vmalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    vmalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    vmalloc(
    - sizeof(TYPE) * (COUNT_ID)
    + array_size(COUNT_ID, sizeof(TYPE))
    , ...)
    |
    vmalloc(
    - sizeof(TYPE) * COUNT_ID
    + array_size(COUNT_ID, sizeof(TYPE))
    , ...)
    |
    vmalloc(
    - sizeof(TYPE) * (COUNT_CONST)
    + array_size(COUNT_CONST, sizeof(TYPE))
    , ...)
    |
    vmalloc(
    - sizeof(TYPE) * COUNT_CONST
    + array_size(COUNT_CONST, sizeof(TYPE))
    , ...)
    |
    vmalloc(
    - sizeof(THING) * (COUNT_ID)
    + array_size(COUNT_ID, sizeof(THING))
    , ...)
    |
    vmalloc(
    - sizeof(THING) * COUNT_ID
    + array_size(COUNT_ID, sizeof(THING))
    , ...)
    |
    vmalloc(
    - sizeof(THING) * (COUNT_CONST)
    + array_size(COUNT_CONST, sizeof(THING))
    , ...)
    |
    vmalloc(
    - sizeof(THING) * COUNT_CONST
    + array_size(COUNT_CONST, sizeof(THING))
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    vmalloc(
    - SIZE * COUNT
    + array_size(COUNT, SIZE)
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    vmalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    vmalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    vmalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    vmalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    vmalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    vmalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    vmalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    vmalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    vmalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    vmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    vmalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    vmalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    vmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    vmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    vmalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    vmalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    vmalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    vmalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    vmalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    vmalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    vmalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    vmalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    vmalloc(C1 * C2 * C3, ...)
    |
    vmalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants.
    @@
    expression E1, E2;
    constant C1, C2;
    @@

    (
    vmalloc(C1 * C2, ...)
    |
    vmalloc(
    - E1 * E2
    + array_size(E1, E2)
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     
  • The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
    patch replaces cases of:

    kmalloc(a * b, gfp)

    with:
    kmalloc_array(a * b, gfp)

    as well as handling cases of:

    kmalloc(a * b * c, gfp)

    with:

    kmalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kmalloc_array(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kmalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The tools/ directory was manually excluded, since it has its own
    implementation of kmalloc().

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kmalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kmalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kmalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kmalloc
    + kmalloc_array
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kmalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kmalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kmalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kmalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kmalloc(C1 * C2 * C3, ...)
    |
    kmalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kmalloc(sizeof(THING) * C2, ...)
    |
    kmalloc(sizeof(TYPE) * C2, ...)
    |
    kmalloc(C1 * C2 * C3, ...)
    |
    kmalloc(C1 * C2, ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

07 Jun, 2018

2 commits

  • Pull overflow updates from Kees Cook:
    "This adds the new overflow checking helpers and adds them to the
    2-factor argument allocators. And this adds the saturating size
    helpers and does a treewide replacement for the struct_size() usage.
    Additionally this adds the overflow testing modules to make sure
    everything works.

    I'm still working on the treewide replacements for allocators with
    "simple" multiplied arguments:

    *alloc(a * b, ...) -> *alloc_array(a, b, ...)

    and

    *zalloc(a * b, ...) -> *calloc(a, b, ...)

    as well as the more complex cases, but that's separable from this
    portion of the series. I expect to have the rest sent before -rc1
    closes; there are a lot of messy cases to clean up.

    Summary:

    - Introduce arithmetic overflow test helper functions (Rasmus)

    - Use overflow helpers in 2-factor allocators (Kees, Rasmus)

    - Introduce overflow test module (Rasmus, Kees)

    - Introduce saturating size helper functions (Matthew, Kees)

    - Treewide use of struct_size() for allocators (Kees)"

    * tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    treewide: Use struct_size() for devm_kmalloc() and friends
    treewide: Use struct_size() for vmalloc()-family
    treewide: Use struct_size() for kmalloc()-family
    device: Use overflow helpers for devm_kmalloc()
    mm: Use overflow helpers in kvmalloc()
    mm: Use overflow helpers in kmalloc_array*()
    test_overflow: Add memory allocation overflow tests
    overflow.h: Add allocation size calculation helpers
    test_overflow: Report test failures
    test_overflow: macrofy some more, do more tests for free
    lib: add runtime test of check_*_overflow functions
    compiler.h: enable builtin overflow checkers and add fallback code

    Linus Torvalds
     
  • One of the more common cases of allocation size calculations is finding
    the size of a structure that has a zero-sized array at the end, along
    with memory for some number of elements for that array. For example:

    struct foo {
    int stuff;
    void *entry[];
    };

    instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

    Instead of leaving these open-coded and prone to type mistakes, we can
    now use the new struct_size() helper:

    instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);

    This patch makes the changes for kmalloc()-family (and kvmalloc()-family)
    uses. It was done via automatic conversion with manual review for the
    "CHECKME" non-standard cases noted below, using the following Coccinelle
    script:

    // pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
    // sizeof *pkey_cache->table, GFP_KERNEL);
    @@
    identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
    expression GFP;
    identifier VAR, ELEMENT;
    expression COUNT;
    @@

    - alloc(sizeof(*VAR) + COUNT * sizeof(*VAR->ELEMENT), GFP)
    + alloc(struct_size(VAR, ELEMENT, COUNT), GFP)

    // mr = kzalloc(sizeof(*mr) + m * sizeof(mr->map[0]), GFP_KERNEL);
    @@
    identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
    expression GFP;
    identifier VAR, ELEMENT;
    expression COUNT;
    @@

    - alloc(sizeof(*VAR) + COUNT * sizeof(VAR->ELEMENT[0]), GFP)
    + alloc(struct_size(VAR, ELEMENT, COUNT), GFP)

    // Same pattern, but can't trivially locate the trailing element name,
    // or variable name.
    @@
    identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
    expression GFP;
    expression SOMETHING, COUNT, ELEMENT;
    @@

    - alloc(sizeof(SOMETHING) + COUNT * sizeof(ELEMENT), GFP)
    + alloc(CHECKME_struct_size(&SOMETHING, ELEMENT, COUNT), GFP)

    Signed-off-by: Kees Cook

    Kees Cook
     

06 Jun, 2018

1 commit

  • Pull cgroup updates from Tejun Heo:

    - For cpustat, cgroup has a percpu hierarchical stat mechanism which
    propagates up the hierarchy lazily.

    This contains commits to factor out and generalize the mechanism so
    that it can be used for other cgroup stats too.

    The original intention was to update memcg stats to use it but memcg
    went for a different approach, so still the only user is cpustat. The
    factoring out and generalization still make sense and it's likely
    that this can be used for other purposes in the future.

    - cgroup uses kernfs_notify() (which uses fsnotify()) to inform user
    space of certain events. A rate limiting mechanism is added.

    - Other misc changes.

    * 'for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: css_set_lock should nest inside tasklist_lock
    rdmacg: Convert to use match_string() helper
    cgroup: Make cgroup_rstat_updated() ready for root cgroup usage
    cgroup: Add memory barriers to plug cgroup_rstat_updated() race window
    cgroup: Add cgroup_subsys->css_rstat_flush()
    cgroup: Replace cgroup_rstat_mutex with a spinlock
    cgroup: Factor out and expose cgroup_rstat_*() interface functions
    cgroup: Reorganize kernel/cgroup/rstat.c
    cgroup: Distinguish base resource stat implementation from rstat
    cgroup: Rename stat to rstat
    cgroup: Rename kernel/cgroup/stat.c to kernel/cgroup/rstat.c
    cgroup: Limit event generation frequency
    cgroup: Explicitly remove core interface files

    Linus Torvalds
     

24 May, 2018

1 commit

  • cgroup_enable_task_cg_lists() incorrectly nests non-irq-safe
    tasklist_lock inside irq-safe css_set_lock triggering the following
    lockdep warning.

    WARNING: possible irq lock inversion dependency detected
    4.17.0-rc1-00027-gb37d049 #6 Not tainted
    --------------------------------------------------------
    systemd/1 just changed the state of lock:
    00000000fe57773b (css_set_lock){..-.}, at: cgroup_free+0xf2/0x12a
    but this lock took another, SOFTIRQ-unsafe lock in the past:
    (tasklist_lock){.+.+}

    and interrupts could create inverse lock ordering between them.

    other info that might help us debug this:
    Possible interrupt unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(tasklist_lock);
    local_irq_disable();
    lock(css_set_lock);
    lock(tasklist_lock);

    lock(css_set_lock);

    *** DEADLOCK ***

    The condition is highly unlikely to actually happen especially given
    that the path is executed only once per boot.

    Signed-off-by: Tejun Heo
    Reported-by: Boqun Feng

    Tejun Heo
     

16 May, 2018

1 commit


08 May, 2018

1 commit


27 Apr, 2018

11 commits

  • cgroup_rstat_updated() ensures that the cgroup's rstat is linked to
    the parent. If there's no parent, it never gets linked and the
    function ends up grabbing and releasing the cgroup_rstat_lock each
    time for no reason which can be expensive.

    This hasn't been a problem till now because nobody was calling the
    function for the root cgroup but rstat is gonna be exposed to
    controllers and use cases, so let's get ready. Make
    cgroup_rstat_updated() an no-op for the root cgroup.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • cgroup_rstat_updated() has a small race window where an updated
    signaling can race with flush and could be lost till the next update.
    This wasn't a problem for the existing usages, but we plan to use
    rstat to track counters which need to be accurate.

    This patch plugs the race window by synchronizing
    cgroup_rstat_updated() and flush path with memory barriers around
    cgroup_rstat_cpu->updated_next pointer.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • This patch adds cgroup_subsys->css_rstat_flush(). If a subsystem has
    this callback, its csses are linked on cgrp->css_rstat_list and rstat
    will call the function whenever the associated cgroup is flushed.
    Flush is also performed when such csses are released so that residual
    counts aren't lost.

    Combined with the rstat API previous patches factored out, this allows
    controllers to plug into rstat to manage their statistics in a
    scalable way.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • Currently, rstat flush path is protected with a mutex which is fine as
    all the existing users are from interface file show path. However,
    rstat is being generalized for use by controllers and flushing from
    atomic contexts will be necessary.

    This patch replaces cgroup_rstat_mutex with a spinlock and adds a
    irq-safe flush function - cgroup_rstat_flush_irqsafe(). Explicit
    yield handling is added to the flush path so that other flush
    functions can yield to other threads and flushers.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • cgroup_rstat is being generalized so that controllers can use it too.
    This patch factors out and exposes the following interface functions.

    * cgroup_rstat_updated(): Renamed from cgroup_rstat_cpu_updated() for
    consistency.

    * cgroup_rstat_flush_hold/release(): Factored out from base stat
    implementation.

    * cgroup_rstat_flush(): Verbatim expose.

    While at it, drop assert on cgroup_rstat_mutex in
    cgroup_base_stat_flush() as it crosses layers and make a minor comment
    update.

    v2: Added EXPORT_SYMBOL_GPL(cgroup_rstat_updated) to fix a build bug.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • Currently, rstat.c has rstat and base stat implementations intermixed.
    Collect base stat implementation at the end of the file. Also,
    reorder the prototypes.

    This patch doesn't make any functional changes.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • Base resource stat accounts universial (not specific to any
    controller) resource consumptions on top of rstat. Currently, its
    implementation is intermixed with rstat implementation making the code
    confusing to follow.

    This patch clarifies the distintion by doing the followings.

    * Encapsulate base resource stat counters, currently only cputime, in
    struct cgroup_base_stat.

    * Move prev_cputime into struct cgroup and initialize it with cgroup.

    * Rename the related functions so that they start with cgroup_base_stat.

    * Prefix the related variables and field names with b.

    This patch doesn't make any functional changes.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • stat is too generic a name and ends up causing subtle confusions.
    It'll be made generic so that controllers can plug into it, which will
    make the problem worse. Let's rename it to something more specific -
    cgroup_rstat for cgroup recursive stat.

    This patch does the following renames. No other changes.

    * cpu_stat -> rstat_cpu
    * stat -> rstat
    * ?cstat -> ?rstatc

    Note that the renames are selective. The unrenamed are the ones which
    implement basic resource statistics on top of rstat. This will be
    further cleaned up in the following patches.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • stat is too generic a name and ends up causing subtle confusions.
    It'll be made generic so that controllers can plug into it, which will
    make the problem worse. Let's rename it to something more specific -
    cgroup_rstat for cgroup recursive stat.

    First, rename kernel/cgroup/stat.c to kernel/cgroup/rstat.c. No
    content changes.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • ".events" files generate file modified event to notify userland of
    possible new events. Some of the events can be quite bursty
    (e.g. memory high event) and generating notification each time is
    costly and pointless.

    This patch implements a event rate limit mechanism. If a new
    notification is requested before 10ms has passed since the previous
    notification, the new notification is delayed till then.

    As this only delays from the second notification on in a given close
    cluster of notifications, userland reactions to notifications
    shouldn't be delayed at all in most cases while avoiding notification
    storms.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • The "cgroup." core interface files bypass the usual interface removal
    path and get removed recursively along with the cgroup itself. While
    this works now, the subtle discrepancy gets in the way of implementing
    common mechanisms.

    This patch updates cgroup core interface file handling so that it's
    consistent with controller interface files. When added, the css is
    marked CSS_VISIBLE and they're explicitly removed before the cgroup is
    destroyed.

    This doesn't cause user-visible behavior changes.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

04 Apr, 2018

1 commit

  • Pull workqueue updates from Tejun Heo:
    "rcu_work addition and a couple trivial changes"

    * 'for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: remove the comment about the old manager_arb mutex
    workqueue: fix the comments of nr_idle
    fs/aio: Use rcu_work instead of explicit rcu and work item
    cgroup: Use rcu_work instead of explicit rcu and work item
    RCU, workqueue: Implement rcu_work

    Linus Torvalds
     

20 Mar, 2018

1 commit


22 Feb, 2018

1 commit

  • A domain cgroup isn't allowed to be turned threaded if its subtree is
    populated or domain controllers are enabled. cgroup_enable_threaded()
    depended on cgroup_can_be_thread_root() test to enforce this rule. A
    parent which has populated domain descendants or have domain
    controllers enabled can't become a thread root, so the above rules are
    enforced automatically.

    However, for the root cgroup which can host mixed domain and threaded
    children, cgroup_can_be_thread_root() doesn't check any of those
    conditions and thus first level cgroups ends up escaping those rules.

    This patch fixes the bug by adding explicit checks for those rules in
    cgroup_enable_threaded().

    Reported-by: Michael Kerrisk (man-pages)
    Signed-off-by: Tejun Heo
    Fixes: 8cfd8147df67 ("cgroup: implement cgroup v2 thread support")
    Cc: stable@vger.kernel.org # v4.14+

    Tejun Heo
     

07 Feb, 2018

1 commit

  • Make current_cpuset_is_being_rebound return bool due to this particular
    function only using either one or zero as its return value.

    No functional change.

    Link: http://lkml.kernel.org/r/1513266622-15860-4-git-send-email-baiyaowei@cmss.chinamobile.com
    Signed-off-by: Yaowei Bai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     

20 Jan, 2018

1 commit

  • e7fd37ba1217 ("cgroup: avoid copying strings longer than the buffers")
    converted possibly unsafe strncpy() usages in cgroup to strscpy().
    However, although the callsites are completely fine with truncated
    copied, because strscpy() is marked __must_check, it led to the
    following warnings.

    kernel/cgroup/cgroup.c: In function ‘cgroup_file_name’:
    kernel/cgroup/cgroup.c:1400:10: warning: ignoring return value of ‘strscpy’, declared with attribute warn_unused_result [-Wunused-result]
    strscpy(buf, cft->name, CGROUP_FILE_NAME_MAX);
    ^

    To avoid the warnings, 50034ed49645 ("cgroup: use strlcpy() instead of
    strscpy() to avoid spurious warning") switched them to strlcpy().

    strlcpy() is worse than strlcpy() because it unconditionally runs
    strlen() on the source string, and the only reason we switched to
    strlcpy() here was because it was lacking __must_check, which doesn't
    reflect any material differences between the two function. It's just
    that someone added __must_check to strscpy() and not to strlcpy().

    These basic string copy operations are used in variety of ways, and
    one of not-so-uncommon use cases is safely handling truncated copies,
    where the caller naturally doesn't care about the return value. The
    __must_check doesn't match the actual use cases and forces users to
    opt for inferior variants which lack __must_check by happenstance or
    spread ugly (void) casts.

    Remove __must_check from strscpy() and restore strscpy() usages in
    cgroup.

    Signed-off-by: Tejun Heo
    Suggested-by: Linus Torvalds
    Cc: Ma Shimiao
    Cc: Arnd Bergmann
    Cc: Chris Metcalf

    Tejun Heo
     

11 Jan, 2018

1 commit


20 Dec, 2017

1 commit

  • While teaching css_task_iter to handle skipping over tasks which
    aren't group leaders, bc2fb7ed089f ("cgroup: add @flags to
    css_task_iter_start() and implement CSS_TASK_ITER_PROCS") introduced a
    silly bug.

    CSS_TASK_ITER_PROCS is implemented by repeating
    css_task_iter_advance() while the advanced cursor is pointing to a
    non-leader thread. However, the cursor variable, @l, wasn't updated
    when the iteration has to advance to the next css_set and the
    following repetition would operate on the terminal @l from the
    previous iteration which isn't pointing to a valid task leading to
    oopses like the following or infinite looping.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000254
    IP: __task_pid_nr_ns+0xc7/0xf0
    PGD 0 P4D 0
    Oops: 0000 [#1] SMP
    ...
    CPU: 2 PID: 1 Comm: systemd Not tainted 4.14.4-200.fc26.x86_64 #1
    Hardware name: System manufacturer System Product Name/PRIME B350M-A, BIOS 3203 11/09/2017
    task: ffff88c4baee8000 task.stack: ffff96d5c3158000
    RIP: 0010:__task_pid_nr_ns+0xc7/0xf0
    RSP: 0018:ffff96d5c315bd50 EFLAGS: 00010206
    RAX: 0000000000000000 RBX: ffff88c4b68c6000 RCX: 0000000000000250
    RDX: ffffffffa5e47960 RSI: 0000000000000000 RDI: ffff88c490f6ab00
    RBP: ffff96d5c315bd50 R08: 0000000000001000 R09: 0000000000000005
    R10: ffff88c4be006b80 R11: ffff88c42f1b8004 R12: ffff96d5c315bf18
    R13: ffff88c42d7dd200 R14: ffff88c490f6a510 R15: ffff88c4b68c6000
    FS: 00007f9446f8ea00(0000) GS:ffff88c4be680000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000254 CR3: 00000007f956f000 CR4: 00000000003406e0
    Call Trace:
    cgroup_procs_show+0x19/0x30
    cgroup_seqfile_show+0x4c/0xb0
    kernfs_seq_show+0x21/0x30
    seq_read+0x2ec/0x3f0
    kernfs_fop_read+0x134/0x180
    __vfs_read+0x37/0x160
    ? security_file_permission+0x9b/0xc0
    vfs_read+0x8e/0x130
    SyS_read+0x55/0xc0
    entry_SYSCALL_64_fastpath+0x1a/0xa5
    RIP: 0033:0x7f94455f942d
    RSP: 002b:00007ffe81ba2d00 EFLAGS: 00000293 ORIG_RAX: 0000000000000000
    RAX: ffffffffffffffda RBX: 00005574e2233f00 RCX: 00007f94455f942d
    RDX: 0000000000001000 RSI: 00005574e2321a90 RDI: 000000000000002b
    RBP: 0000000000000000 R08: 00005574e2321a90 R09: 00005574e231de60
    R10: 00007f94458c8b38 R11: 0000000000000293 R12: 00007f94458c8ae0
    R13: 00007ffe81ba3800 R14: 0000000000000000 R15: 00005574e2116560
    Code: 04 74 0e 89 f6 48 8d 04 76 48 8d 04 c5 f0 05 00 00 48 8b bf b8 05 00 00 48 01 c7 31 c0 48 8b 0f 48 85 c9 74 18 8b b2 30 08 00 00 71 04 77 0d 48 c1 e6 05 48 01 f1 48 3b 51 38 74 09 5d c3 8b
    RIP: __task_pid_nr_ns+0xc7/0xf0 RSP: ffff96d5c315bd50

    Fix it by moving the initialization of the cursor below the repeat
    label. While at it, rename it to @next for readability.

    Signed-off-by: Tejun Heo
    Fixes: bc2fb7ed089f ("cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS")
    Cc: stable@vger.kernel.org # v4.14+
    Reported-by: Laura Abbott
    Reported-by: Bronek Kozicki
    Reported-by: George Amanakis
    Signed-off-by: Tejun Heo

    Tejun Heo
     

19 Dec, 2017

1 commit

  • Deadlock during cgroup migration from cpu hotplug path when a task T is
    being moved from source to destination cgroup.

    kworker/0:0
    cpuset_hotplug_workfn()
    cpuset_hotplug_update_tasks()
    hotplug_update_tasks_legacy()
    remove_tasks_in_empty_cpuset()
    cgroup_transfer_tasks() // stuck in iterator loop
    cgroup_migrate()
    cgroup_migrate_add_task()

    In cgroup_migrate_add_task() it checks for PF_EXITING flag of task T.
    Task T will not migrate to destination cgroup. css_task_iter_start()
    will keep pointing to task T in loop waiting for task T cg_list node
    to be removed.

    Task T
    do_exit()
    exit_signals() // sets PF_EXITING
    exit_task_namespaces()
    switch_task_namespaces()
    free_nsproxy()
    put_mnt_ns()
    drop_collected_mounts()
    namespace_unlock()
    synchronize_rcu()
    _synchronize_rcu_expedited()
    schedule_work() // on cpu0 low priority worker pool
    wait_event() // waiting for work item to execute

    Task T inserted a work item in the worklist of cpu0 low priority
    worker pool. It is waiting for expedited grace period work item
    to execute. This work item will only be executed once kworker/0:0
    complete execution of cpuset_hotplug_workfn().

    kworker/0:0 ==> Task T ==>kworker/0:0

    In case of PF_EXITING task being migrated from source to destination
    cgroup, migrate next available task in source cgroup.

    Signed-off-by: Prateek Sood
    Signed-off-by: Tejun Heo

    Prateek Sood
     

15 Dec, 2017

1 commit


12 Dec, 2017

1 commit


05 Dec, 2017

1 commit

  • This reverts commit aa24163b2ee5c92120e32e99b5a93143a0f4258e.

    This and the following commit led to another circular locking scenario
    and the scenario which is fixed by this commit no longer exists after
    e8b3f8db7aad ("workqueue/hotplug: simplify workqueue_offline_cpu()")
    which removes work item flushing from hotplug path.

    Revert it for now.

    Signed-off-by: Tejun Heo

    Tejun Heo