08 Jul, 2020

1 commit

  • When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
    copied, so the cgroup refcnt must be taken too. And, unlike the
    sk_alloc() path, sock_update_netprioidx() is not called here.
    Therefore, it is safe and necessary to grab the cgroup refcnt
    even when cgroup_sk_alloc is disabled.

    sk_clone_lock() is in BH context anyway, the in_interrupt()
    would terminate this function if called there. And for sk_alloc()
    skcd->val is always zero. So it's safe to factor out the code
    to make it more readable.

    The global variable 'cgroup_sk_alloc_disabled' is used to determine
    whether to take these reference counts. It is impossible to make
    the reference counting correct unless we save this bit of information
    in skcd->val. So, add a new bit there to record whether the socket
    has already taken the reference counts. This obviously relies on
    kmalloc() to align cgroup pointers to at least 4 bytes,
    ARCH_KMALLOC_MINALIGN is certainly larger than that.

    This bug seems to be introduced since the beginning, commit
    d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
    tried to fix it but not compeletely. It seems not easy to trigger until
    the recent commit 090e28b229af
    ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.

    Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
    Reported-by: Cameron Berkenpas
    Reported-by: Peter Geis
    Reported-by: Lu Fengqi
    Reported-by: Daniël Sonck
    Reported-by: Zhang Qiang
    Tested-by: Cameron Berkenpas
    Tested-by: Peter Geis
    Tested-by: Thomas Lamprecht
    Cc: Daniel Borkmann
    Cc: Zefan Li
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

13 Feb, 2020

3 commits

  • This adds support for creating a process in a different cgroup than its
    parent. Callers can limit and account processes and threads right from
    the moment they are spawned:
    - A service manager can directly spawn new services into dedicated
    cgroups.
    - A process can be directly created in a frozen cgroup and will be
    frozen as well.
    - The initial accounting jitter experienced by process supervisors and
    daemons is eliminated with this.
    - Threaded applications or even thread implementations can choose to
    create a specific cgroup layout where each thread is spawned
    directly into a dedicated cgroup.

    This feature is limited to the unified hierarchy. Callers need to pass
    a directory file descriptor for the target cgroup. The caller can
    choose to pass an O_PATH file descriptor. All usual migration
    restrictions apply, i.e. there can be no processes in inner nodes. In
    general, creating a process directly in a target cgroup adheres to all
    migration restrictions.

    One of the biggest advantages of this feature is that CLONE_INTO_GROUP does
    not need to grab the write side of the cgroup cgroup_threadgroup_rwsem.
    This global lock makes moving tasks/threads around super expensive. With
    clone3() this lock is avoided.

    Cc: Tejun Heo
    Cc: Ingo Molnar
    Cc: Oleg Nesterov
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Peter Zijlstra
    Cc: cgroups@vger.kernel.org
    Signed-off-by: Christian Brauner
    Signed-off-by: Tejun Heo

    Christian Brauner
     
  • css_task_iter stores pointer to head of each iterable list, this dates
    back to commit 0f0a2b4fa621 ("cgroup: reorganize css_task_iter") when we
    did not store cur_cset. Let us utilize list heads directly in cur_cset
    and streamline css_task_iter_advance_css_set a bit. This is no
    intentional function change.

    Signed-off-by: Michal Koutný
    Signed-off-by: Tejun Heo

    Michal Koutný
     
  • PF_EXITING is set earlier than actual removal from css_set when a task
    is exitting. This can confuse cgroup.procs readers who see no PF_EXITING
    tasks, however, rmdir is checking against css_set membership so it can
    transitionally fail with EBUSY.

    Fix this by listing tasks that weren't unlinked from css_set active
    lists.
    It may happen that other users of the task iterator (without
    CSS_TASK_ITER_PROCS) spot a PF_EXITING task before cgroup_exit(). This
    is equal to the state before commit c03cd7738a83 ("cgroup: Include dying
    leaders with live threads in PROCS iterations") but it may be reviewed
    later.

    Reported-by: Suren Baghdasaryan
    Fixes: c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations")
    Signed-off-by: Michal Koutný
    Signed-off-by: Tejun Heo

    Michal Koutný
     

13 Nov, 2019

2 commits

  • cgroup ID is currently allocated using a dedicated per-hierarchy idr
    and used internally and exposed through tracepoints and bpf. This is
    confusing because there are tracepoints and other interfaces which use
    the cgroupfs ino as IDs.

    The preceding changes made kn->id exposed as ino as 64bit ino on
    supported archs or ino+gen (low 32bits as ino, high gen). There's no
    reason for cgroup to use different IDs. The kernfs IDs are unique and
    userland can easily discover them and map them back to paths using
    standard file operations.

    This patch replaces cgroup IDs with kernfs IDs.

    * cgroup_id() is added and all cgroup ID users are converted to use it.

    * kernfs_node creation is moved to earlier during cgroup init so that
    cgroup_id() is available during init.

    * While at it, s/cgroup/cgrp/ in psi helpers for consistency.

    * Fallback ID value is changed to 1 to be consistent with root cgroup
    ID.

    Signed-off-by: Tejun Heo
    Reviewed-by: Greg Kroah-Hartman
    Cc: Namhyung Kim

    Tejun Heo
     
  • kernfs_node->id is currently a union kernfs_node_id which represents
    either a 32bit (ino, gen) pair or u64 value. I can't see much value
    in the usage of the union - all that's needed is a 64bit ID which the
    current code is already limited to. Using a union makes the code
    unnecessarily complicated and prevents using 64bit ino without adding
    practical benefits.

    This patch drops union kernfs_node_id and makes kernfs_node->id a u64.
    ino is stored in the lower 32bits and gen upper. Accessors -
    kernfs[_id]_ino() and kernfs[_id]_gen() - are added to retrieve the
    ino and gen. This simplifies ID handling less cumbersome and will
    allow using 64bit inos on supported archs.

    This patch doesn't make any functional changes.

    Signed-off-by: Tejun Heo
    Reviewed-by: Greg Kroah-Hartman
    Cc: Namhyung Kim
    Cc: Jens Axboe
    Cc: Alexei Starovoitov

    Tejun Heo
     

25 Oct, 2019

1 commit

  • cgroup_enable_task_cg_lists() is used to lazyily initialize task
    cgroup associations on the first use to reduce fork / exit overheads
    on systems which don't use cgroup. Unfortunately, locking around it
    has never been actually correct and its value is dubious given how the
    vast majority of systems use cgroup right away from boot.

    This patch removes the optimization. For now, replace the cg_list
    based branches with WARN_ON_ONCE()'s to be on the safe side. We can
    simplify the logic further in the future.

    Signed-off-by: Tejun Heo
    Reported-by: Oleg Nesterov
    Signed-off-by: Tejun Heo

    Tejun Heo
     

25 Jul, 2019

1 commit

  • When the topology of root domains is modified by CPUset or CPUhotplug
    operations information about the current deadline bandwidth held in the
    root domain is lost.

    This patch addresses the issue by recalculating the lost deadline
    bandwidth information by circling through the deadline tasks held in
    CPUsets and adding their current load to the root domain they are
    associated with.

    Tested-by: Dietmar Eggemann
    Signed-off-by: Mathieu Poirier
    Signed-off-by: Juri Lelli
    [ Various additional modifications. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bristot@redhat.com
    Cc: claudio@evidence.eu.com
    Cc: lizefan@huawei.com
    Cc: longman@redhat.com
    Cc: luca.abeni@santannapisa.it
    Cc: rostedt@goodmis.org
    Cc: tj@kernel.org
    Cc: tommaso.cucinotta@santannapisa.it
    Link: https://lkml.kernel.org/r/20190719140000.31694-4-juri.lelli@redhat.com
    Signed-off-by: Ingo Molnar

    Mathieu Poirier
     

16 Jul, 2019

1 commit

  • Pull more block updates from Jens Axboe:
    "A later pull request with some followup items. I had some vacation
    coming up to the merge window, so certain things items were delayed a
    bit. This pull request also contains fixes that came in within the
    last few days of the merge window, which I didn't want to push right
    before sending you a pull request.

    This contains:

    - NVMe pull request, mostly fixes, but also a few minor items on the
    feature side that were timing constrained (Christoph et al)

    - Report zones fixes (Damien)

    - Removal of dead code (Damien)

    - Turn on cgroup psi memstall (Josef)

    - block cgroup MAINTAINERS entry (Konstantin)

    - Flush init fix (Josef)

    - blk-throttle low iops timing fix (Konstantin)

    - nbd resize fixes (Mike)

    - nbd 0 blocksize crash fix (Xiubo)

    - block integrity error leak fix (Wenwen)

    - blk-cgroup writeback and priority inheritance fixes (Tejun)"

    * tag 'for-linus-20190715' of git://git.kernel.dk/linux-block: (42 commits)
    MAINTAINERS: add entry for block io cgroup
    null_blk: fixup ->report_zones() for !CONFIG_BLK_DEV_ZONED
    block: Limit zone array allocation size
    sd_zbc: Fix report zones buffer allocation
    block: Kill gfp_t argument of blkdev_report_zones()
    block: Allow mapping of vmalloc-ed buffers
    block/bio-integrity: fix a memory leak bug
    nvme: fix NULL deref for fabrics options
    nbd: add netlink reconfigure resize support
    nbd: fix crash when the blksize is zero
    block: Disable write plugging for zoned block devices
    block: Fix elevator name declaration
    block: Remove unused definitions
    nvme: fix regression upon hot device removal and insertion
    blk-throttle: fix zero wait time for iops throttled group
    block: Fix potential overflow in blk_report_zones()
    blkcg: implement REQ_CGROUP_PUNT
    blkcg, writeback: Implement wbc_blkcg_css()
    blkcg, writeback: Add wbc->no_cgroup_owner
    blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner()
    ...

    Linus Torvalds
     

12 Jul, 2019

1 commit

  • Pull networking updates from David Miller:
    "Some highlights from this development cycle:

    1) Big refactoring of ipv6 route and neigh handling to support
    nexthop objects configurable as units from userspace. From David
    Ahern.

    2) Convert explored_states in BPF verifier into a hash table,
    significantly decreased state held for programs with bpf2bpf
    calls, from Alexei Starovoitov.

    3) Implement bpf_send_signal() helper, from Yonghong Song.

    4) Various classifier enhancements to mvpp2 driver, from Maxime
    Chevallier.

    5) Add aRFS support to hns3 driver, from Jian Shen.

    6) Fix use after free in inet frags by allocating fqdirs dynamically
    and reworking how rhashtable dismantle occurs, from Eric Dumazet.

    7) Add act_ctinfo packet classifier action, from Kevin
    Darbyshire-Bryant.

    8) Add TFO key backup infrastructure, from Jason Baron.

    9) Remove several old and unused ISDN drivers, from Arnd Bergmann.

    10) Add devlink notifications for flash update status to mlxsw driver,
    from Jiri Pirko.

    11) Lots of kTLS offload infrastructure fixes, from Jakub Kicinski.

    12) Add support for mv88e6250 DSA chips, from Rasmus Villemoes.

    13) Various enhancements to ipv6 flow label handling, from Eric
    Dumazet and Willem de Bruijn.

    14) Support TLS offload in nfp driver, from Jakub Kicinski, Dirk van
    der Merwe, and others.

    15) Various improvements to axienet driver including converting it to
    phylink, from Robert Hancock.

    16) Add PTP support to sja1105 DSA driver, from Vladimir Oltean.

    17) Add mqprio qdisc offload support to dpaa2-eth, from Ioana
    Radulescu.

    18) Add devlink health reporting to mlx5, from Moshe Shemesh.

    19) Convert stmmac over to phylink, from Jose Abreu.

    20) Add PTP PHC (Physical Hardware Clock) support to mlxsw, from
    Shalom Toledo.

    21) Add nftables SYNPROXY support, from Fernando Fernandez Mancera.

    22) Convert tcp_fastopen over to use SipHash, from Ard Biesheuvel.

    23) Track spill/fill of constants in BPF verifier, from Alexei
    Starovoitov.

    24) Support bounded loops in BPF, from Alexei Starovoitov.

    25) Various page_pool API fixes and improvements, from Jesper Dangaard
    Brouer.

    26) Just like ipv4, support ref-countless ipv6 route handling. From
    Wei Wang.

    27) Support VLAN offloading in aquantia driver, from Igor Russkikh.

    28) Add AF_XDP zero-copy support to mlx5, from Maxim Mikityanskiy.

    29) Add flower GRE encap/decap support to nfp driver, from Pieter
    Jansen van Vuuren.

    30) Protect against stack overflow when using act_mirred, from John
    Hurley.

    31) Allow devmap map lookups from eBPF, from Toke Høiland-Jørgensen.

    32) Use page_pool API in netsec driver, Ilias Apalodimas.

    33) Add Google gve network driver, from Catherine Sullivan.

    34) More indirect call avoidance, from Paolo Abeni.

    35) Add kTLS TX HW offload support to mlx5, from Tariq Toukan.

    36) Add XDP_REDIRECT support to bnxt_en, from Andy Gospodarek.

    37) Add MPLS manipulation actions to TC, from John Hurley.

    38) Add sending a packet to connection tracking from TC actions, and
    then allow flower classifier matching on conntrack state. From
    Paul Blakey.

    39) Netfilter hw offload support, from Pablo Neira Ayuso"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2080 commits)
    net/mlx5e: Return in default case statement in tx_post_resync_params
    mlx5: Return -EINVAL when WARN_ON_ONCE triggers in mlx5e_tls_resync().
    net: dsa: add support for BRIDGE_MROUTER attribute
    pkt_sched: Include const.h
    net: netsec: remove static declaration for netsec_set_tx_de()
    net: netsec: remove superfluous if statement
    netfilter: nf_tables: add hardware offload support
    net: flow_offload: rename tc_cls_flower_offload to flow_cls_offload
    net: flow_offload: add flow_block_cb_is_busy() and use it
    net: sched: remove tcf block API
    drivers: net: use flow block API
    net: sched: use flow block API
    net: flow_offload: add flow_block_cb_{priv, incref, decref}()
    net: flow_offload: add list handling functions
    net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()
    net: flow_offload: rename TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_*
    net: flow_offload: rename TC_BLOCK_{UN}BIND to FLOW_BLOCK_{UN}BIND
    net: flow_offload: add flow_block_cb_setup_simple()
    net: hisilicon: Add an tx_desc to adapt HI13X1_GMAC
    net: hisilicon: Add an rx_desc to adapt HI13X1_GMAC
    ...

    Linus Torvalds
     

10 Jul, 2019

1 commit


18 Jun, 2019

1 commit


01 Jun, 2019

3 commits

  • cgroup already uses floating point for percent[ile] numbers and there
    are several controllers which want to take them as input. Add a
    generic parse helper to handle inputs.

    Update the interface convention documentation about the use of
    percentage numbers. While at it, also clarify the default time unit.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • CSS_TASK_ITER_PROCS currently iterates live group leaders; however,
    this means that a process with dying leader and live threads will be
    skipped. IOW, cgroup.procs might be empty while cgroup.threads isn't,
    which is confusing to say the least.

    Fix it by making cset track dying tasks and include dying leaders with
    live threads in PROCS iteration.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Topi Miettinen
    Cc: Oleg Nesterov

    Tejun Heo
     
  • When a task is moved out of a cset, task iterators pointing to the
    task are advanced using the normal css_task_iter_advance() call. This
    is fine but we'll be tracking dying tasks on csets and thus moving
    tasks from cset->tasks to (to be added) cset->dying_tasks. When we
    remove a task from cset->tasks, if we advance the iterators, they may
    move over to the next cset before we had the chance to add the task
    back on the dying list, which can allow the task to escape iteration.

    This patch separates out skipping from advancing. Skipping only moves
    the affected iterators to the next pointer rather than fully advancing
    it and the following advancing will recognize that the cursor has
    already been moved forward and do the rest of advancing. This ensures
    that when a task moves from one list to another in its cset, as long
    as it moves in the right direction, it's always visible to iteration.

    This doesn't cause any visible behavior changes.

    Signed-off-by: Tejun Heo
    Cc: Oleg Nesterov

    Tejun Heo
     

30 May, 2019

1 commit

  • A PF_EXITING task can stay associated with an offline css. If such
    task calls task_get_css(), it can get stuck indefinitely. This can be
    triggered by BSD process accounting which writes to a file with
    PF_EXITING set when racing against memcg disable as in the backtrace
    at the end.

    After this change, task_get_css() may return a css which was already
    offline when the function was called. None of the existing users are
    affected by this change.

    INFO: rcu_sched self-detected stall on CPU
    INFO: rcu_sched detected stalls on CPUs/tasks:
    ...
    NMI backtrace for cpu 0
    ...
    Call Trace:

    dump_stack+0x46/0x68
    nmi_cpu_backtrace.cold.2+0x13/0x57
    nmi_trigger_cpumask_backtrace+0xba/0xca
    rcu_dump_cpu_stacks+0x9e/0xce
    rcu_check_callbacks.cold.74+0x2af/0x433
    update_process_times+0x28/0x60
    tick_sched_timer+0x34/0x70
    __hrtimer_run_queues+0xee/0x250
    hrtimer_interrupt+0xf4/0x210
    smp_apic_timer_interrupt+0x56/0x110
    apic_timer_interrupt+0xf/0x20

    RIP: 0010:balance_dirty_pages_ratelimited+0x28f/0x3d0
    ...
    btrfs_file_write_iter+0x31b/0x563
    __vfs_write+0xfa/0x140
    __kernel_write+0x4f/0x100
    do_acct_process+0x495/0x580
    acct_process+0xb9/0xdb
    do_exit+0x748/0xa00
    do_group_exit+0x3a/0xa0
    get_signal+0x254/0x560
    do_signal+0x23/0x5c0
    exit_to_usermode_loop+0x5d/0xa0
    prepare_exit_to_usermode+0x53/0x80
    retint_user+0x8/0x8

    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org # v4.2+
    Fixes: ec438699a9ae ("cgroup, block: implement task_get_css() and use it in bio_associate_current()")

    Tejun Heo
     

29 May, 2019

1 commit

  • Currently the lifetime of bpf programs attached to a cgroup is bound
    to the lifetime of the cgroup itself. It means that if a user
    forgets (or intentionally avoids) to detach a bpf program before
    removing the cgroup, it will stay attached up to the release of the
    cgroup. Since the cgroup can stay in the dying state (the state
    between being rmdir()'ed and being released) for a very long time, it
    leads to a waste of memory. Also, it blocks a possibility to implement
    the memcg-based memory accounting for bpf objects, because a circular
    reference dependency will occur. Charged memory pages are pinning the
    corresponding memory cgroup, and if the memory cgroup is pinning
    the attached bpf program, nothing will be ever released.

    A dying cgroup can not contain any processes, so the only chance for
    an attached bpf program to be executed is a live socket associated
    with the cgroup. So in order to release all bpf data early, let's
    count associated sockets using a new percpu refcounter. On cgroup
    removal the counter is transitioned to the atomic mode, and as soon
    as it reaches 0, all bpf programs are detached.

    Because cgroup_bpf_release() can block, it can't be called from
    the percpu ref counter callback directly, so instead an asynchronous
    work is scheduled.

    The reference counter is not socket specific, and can be used for any
    other types of programs, which can be executed from a cgroup-bpf hook
    outside of the process context, had such a need arise in the future.

    Signed-off-by: Roman Gushchin
    Cc: jolsa@redhat.com
    Signed-off-by: Alexei Starovoitov

    Roman Gushchin
     

06 May, 2019

1 commit

  • A task should never enter the exit path with the task->frozen bit set.
    Any frozen task must enter the signal handling loop and the only
    way to escape is through cgroup_leave_frozen(true), which
    unconditionally drops the task->frozen bit. So it means that
    cgroyp_freezer_frozen_exit() has zero chances to be called and
    has to be removed.

    Let's put a WARN_ON_ONCE() instead of the cgroup_freezer_frozen_exit()
    call to catch any potential leak of the task's frozen bit.

    Suggested-by: Oleg Nesterov
    Signed-off-by: Roman Gushchin
    Signed-off-by: Tejun Heo

    Roman Gushchin
     

20 Apr, 2019

1 commit

  • Cgroup v1 implements the freezer controller, which provides an ability
    to stop the workload in a cgroup and temporarily free up some
    resources (cpu, io, network bandwidth and, potentially, memory)
    for some other tasks. Cgroup v2 lacks this functionality.

    This patch implements freezer for cgroup v2.

    Cgroup v2 freezer tries to put tasks into a state similar to jobctl
    stop. This means that tasks can be killed, ptraced (using
    PTRACE_SEIZE*), and interrupted. It is possible to attach to
    a frozen task, get some information (e.g. read registers) and detach.
    It's also possible to migrate a frozen tasks to another cgroup.

    This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
    tried to imitate the system-wide freezer. However uninterruptible
    sleep is fine when all tasks are going to be frozen (hibernation case),
    it's not the acceptable state for some subset of the system.

    Cgroup v2 freezer is not supporting freezing kthreads.
    If a non-root cgroup contains kthread, the cgroup still can be frozen,
    but the kthread will remain running, the cgroup will be shown
    as non-frozen, and the notification will not be delivered.

    * PTRACE_ATTACH is not working because non-fatal signal delivery
    is blocked in frozen state.

    There are some interface differences between cgroup v1 and cgroup v2
    freezer too, which are required to conform the cgroup v2 interface
    design principles:
    1) There is no separate controller, which has to be turned on:
    the functionality is always available and is represented by
    cgroup.freeze and cgroup.events cgroup control files.
    2) The desired state is defined by the cgroup.freeze control file.
    Any hierarchical configuration is allowed.
    3) The interface is asynchronous. The actual state is available
    using cgroup.events control file ("frozen" field). There are no
    dedicated transitional states.
    4) It's allowed to make any changes with the cgroup hierarchy
    (create new cgroups, remove old cgroups, move tasks between cgroups)
    no matter if some cgroups are frozen.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Tejun Heo
    No-objection-from-me-by: Oleg Nesterov
    Cc: kernel-team@fb.com

    Roman Gushchin
     

31 Jan, 2019

1 commit

  • The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
    needs pids_free() to uncharge the pid.

    However, ->free() is called from __put_task_struct()->cgroup_free() and this
    is too late. Even the trivial program which does

    for (;;) {
    int pid = fork();
    assert(pid >= 0);
    if (pid)
    wait(NULL);
    else
    exit(0);
    }

    can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
    implies an RCU gp after the task/pid goes away and before the final put().

    Test-case:

    mkdir -p /tmp/CG
    mount -t cgroup2 none /tmp/CG
    echo '+pids' > /tmp/CG/cgroup.subtree_control

    mkdir /tmp/CG/PID
    echo 2 > /tmp/CG/PID/pids.max

    perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
    echo $! > /tmp/CG/PID/cgroup.procs

    Without this patch the forking process fails soon after migration.

    Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
    into the new helper, cgroup_release(), called by release_task() which actually
    frees the pid(s).

    Reported-by: Herton R. Krzesinski
    Reported-by: Jan Stancek
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Tejun Heo

    Oleg Nesterov
     

08 Dec, 2018

1 commit

  • The previous patch in this series removed carrying around a pointer to
    the css in blkg. However, the blkg association logic still relied on
    taking a reference on the css to ensure we wouldn't fail in getting a
    reference for the blkg.

    Here the implicit dependency on the css is removed. The association
    continues to rely on the tryget logic walking up the blkg tree. This
    streamlines the three ways that association can happen: normal, swap,
    and writeback.

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

03 Nov, 2018

1 commit

  • Pull block layer fixes from Jens Axboe:
    "The biggest part of this pull request is the revert of the blkcg
    cleanup series. It had one fix earlier for a stacked device issue, but
    another one was reported. Rather than play whack-a-mole with this,
    revert the entire series and try again for the next kernel release.

    Apart from that, only small fixes/changes.

    Summary:

    - Indentation fixup for mtip32xx (Colin Ian King)

    - The blkcg cleanup series revert (Dennis Zhou)

    - Two NVMe fixes. One fixing a regression in the nvme request
    initialization in this merge window, causing nvme-fc to not work.
    The other is a suspend/resume p2p resource issue (James, Keith)

    - Fix sg discard merge, allowing us to merge in cases where we didn't
    before (Jianchao Wang)

    - Call rq_qos_exit() after the queue is frozen, preventing a hang
    (Ming)

    - Fix brd queue setup, fixing an oops if we fail setting up all
    devices (Ming)"

    * tag 'for-linus-20181102' of git://git.kernel.dk/linux-block:
    nvme-pci: fix conflicting p2p resource adds
    nvme-fc: fix request private initialization
    blkcg: revert blkcg cleanups series
    block: brd: associate with queue until adding disk
    block: call rq_qos_exit() after queue is frozen
    mtip32xx: clean an indentation issue, remove extraneous tabs
    block: fix the DISCARD request merge

    Linus Torvalds
     

02 Nov, 2018

1 commit

  • This reverts a series committed earlier due to null pointer exception
    bug report in [1]. It seems there are edge case interactions that I did
    not consider and will need some time to understand what causes the
    adverse interactions.

    The original series can be found in [2] with a follow up series in [3].

    [1] https://www.spinics.net/lists/cgroups/msg20719.html
    [2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
    [3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/

    This reverts the following commits:
    d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae,
    f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef,
    a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53

    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

27 Oct, 2018

1 commit

  • On a system that executes multiple cgrouped jobs and independent
    workloads, we don't just care about the health of the overall system, but
    also that of individual jobs, so that we can ensure individual job health,
    fairness between jobs, or prioritize some jobs over others.

    This patch implements pressure stall tracking for cgroups. In kernels
    with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
    and io.pressure files that track aggregate pressure stall times for only
    the tasks inside the cgroup.

    Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Tejun Heo
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Daniel Drake
    Tested-by: Suren Baghdasaryan
    Cc: Christopher Lameter
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Mike Galbraith
    Cc: Peter Enderborg
    Cc: Randy Dunlap
    Cc: Shakeel Butt
    Cc: Vinayak Menon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

26 Oct, 2018

1 commit


25 Sep, 2018

1 commit


22 Sep, 2018

1 commit

  • The previous patch in this series removed carrying around a pointer to
    the css in blkg. However, the blkg association logic still relied on
    taking a reference on the css to ensure we wouldn't fail in getting a
    reference for the blkg.

    Here the implicit dependency on the css is removed. The association
    continues to rely on the tryget logic walking up the blkg tree. This
    streamlines the three ways that association can happen: normal, swap,
    and writeback.

    Acked-by: Tejun Heo
    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     

13 Aug, 2018

1 commit

  • == Problem description ==

    It's useful to be able to identify cgroup associated with skb in TC so
    that a policy can be applied to this skb, and existing bpf_skb_cgroup_id
    helper can help with this.

    Though in real life cgroup hierarchy and hierarchy to apply a policy to
    don't map 1:1.

    It's often the case that there is a container and corresponding cgroup,
    but there are many more sub-cgroups inside container, e.g. because it's
    delegated to containerized application to control resources for its
    subsystems, or to separate application inside container from infra that
    belongs to containerization system (e.g. sshd).

    At the same time it may be useful to apply a policy to container as a
    whole.

    If multiple containers like this are run on a host (what is often the
    case) and many of them have sub-cgroups, it may not be possible to apply
    per-container policy in TC with existing helpers such as
    bpf_skb_under_cgroup or bpf_skb_cgroup_id:

    * bpf_skb_cgroup_id will return id of immediate cgroup associated with
    skb, i.e. if it's a sub-cgroup inside container, it can't be used to
    identify container's cgroup;

    * bpf_skb_under_cgroup can work only with one cgroup and doesn't scale,
    i.e. if there are N containers on a host and a policy has to be
    applied to M of them (0 ancestor_ids[level] and use it with idr_find() to get struct
    cgroup for ancestor. But that would require radix lookup what doesn't
    seem to be better (at least it's not obviously better).

    Format of return value of the new helper is same as that of
    bpf_skb_cgroup_id.

    Signed-off-by: Andrey Ignatov
    Signed-off-by: Daniel Borkmann

    Andrey Ignatov
     

27 Apr, 2018

2 commits

  • Currently, rstat flush path is protected with a mutex which is fine as
    all the existing users are from interface file show path. However,
    rstat is being generalized for use by controllers and flushing from
    atomic contexts will be necessary.

    This patch replaces cgroup_rstat_mutex with a spinlock and adds a
    irq-safe flush function - cgroup_rstat_flush_irqsafe(). Explicit
    yield handling is added to the flush path so that other flush
    functions can yield to other threads and flushers.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • cgroup_rstat is being generalized so that controllers can use it too.
    This patch factors out and exposes the following interface functions.

    * cgroup_rstat_updated(): Renamed from cgroup_rstat_cpu_updated() for
    consistency.

    * cgroup_rstat_flush_hold/release(): Factored out from base stat
    implementation.

    * cgroup_rstat_flush(): Verbatim expose.

    While at it, drop assert on cgroup_rstat_mutex in
    cgroup_base_stat_flush() as it crosses layers and make a minor comment
    update.

    v2: Added EXPORT_SYMBOL_GPL(cgroup_rstat_updated) to fix a build bug.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

16 Nov, 2017

1 commit

  • Pull cgroup updates from Tejun Heo:
    "Cgroup2 cpu controller support is finally merged.

    - Basic cpu statistics support to allow monitoring by default without
    the CPU controller enabled.

    - cgroup2 cpu controller support.

    - /sys/kernel/cgroup files to help dealing with new / optional
    features"

    * 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: export list of cgroups v2 features using sysfs
    cgroup: export list of delegatable control files using sysfs
    cgroup: mark @cgrp __maybe_unused in cpu_stat_show()
    MAINTAINERS: relocate cpuset.c
    cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat
    sched: Implement interface for cgroup unified hierarchy
    sched: Misc preps for cgroup unified hierarchy interface
    sched/cputime: Add dummy cputime_adjust() implementation for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
    cgroup: statically initialize init_css_set->dfl_cgrp
    cgroup: Implement cgroup2 basic CPU usage accounting
    cpuacct: Introduce cgroup_account_cputime[_field]()
    sched/cputime: Expose cputime_adjust()

    Linus Torvalds
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

27 Oct, 2017

1 commit

  • The basic cpu stat is currently shown with "cpu." prefix in
    cgroup.stat, and the same information is duplicated in cpu.stat when
    cpu controller is enabled. This is ugly and not very scalable as we
    want to expand the coverage of stat information which is always
    available.

    This patch makes cgroup core always create "cpu.stat" file and show
    the basic cpu stat there and calls the cpu controller to show the
    extra stats when enabled. This ensures that the same information
    isn't presented in multiple places and makes future expansion of basic
    stats easier.

    Signed-off-by: Tejun Heo
    Acked-by: Peter Zijlstra (Intel)

    Tejun Heo
     

25 Sep, 2017

2 commits

  • In cgroup1, while cpuacct isn't actually controlling any resources, it
    is a separate controller due to combination of two factors -
    1. enabling cpu controller has significant side effects, and 2. we
    have to pick one of the hierarchies to account CPU usages on. cpuacct
    controller is effectively used to designate a hierarchy to track CPU
    usages on.

    cgroup2's unified hierarchy removes the second reason and we can
    account basic CPU usages by default. While we can use cpuacct for
    this purpose, both its interface and implementation leave a lot to be
    desired - it collects and exposes two sources of truth which don't
    agree with each other and some of the exposed statistics don't make
    much sense. Also, it propagates all the way up the hierarchy on each
    accounting event which is unnecessary.

    This patch adds basic resource accounting mechanism to cgroup2's
    unified hierarchy and accounts CPU usages using it.

    * All accountings are done per-cpu and don't propagate immediately.
    It just bumps the per-cgroup per-cpu counters and links to the
    parent's updated list if not already on it.

    * On a read, the per-cpu counters are collected into the global ones
    and then propagated upwards. Only the per-cpu counters which have
    changed since the last read are propagated.

    * CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
    prefix. Total usage is collected from scheduling events. User/sys
    breakdown is sourced from tick sampling and adjusted to the usage
    using cputime_adjust().

    This keeps the accounting side hot path O(1) and per-cpu and the read
    side O(nr_updated_since_last_read).

    v2: Minor changes and documentation updates as suggested by Waiman and
    Roman.

    Signed-off-by: Tejun Heo
    Acked-by: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Li Zefan
    Cc: Johannes Weiner
    Cc: Waiman Long
    Cc: Roman Gushchin

    Tejun Heo
     
  • Introduce cgroup_account_cputime[_field]() which wrap cpuacct_charge()
    and cgroup_account_field(). This doesn't introduce any functional
    changes and will be used to add cgroup basic resource accounting.

    Signed-off-by: Tejun Heo
    Acked-by: Peter Zijlstra
    Cc: Ingo Molnar

    Tejun Heo
     

08 Sep, 2017

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the first pull request for 4.14, containing most of the code
    changes. It's a quiet series this round, which I think we needed after
    the churn of the last few series. This contains:

    - Fix for a registration race in loop, from Anton Volkov.

    - Overflow complaint fix from Arnd for DAC960.

    - Series of drbd changes from the usual suspects.

    - Conversion of the stec/skd driver to blk-mq. From Bart.

    - A few BFQ improvements/fixes from Paolo.

    - CFQ improvement from Ritesh, allowing idling for group idle.

    - A few fixes found by Dan's smatch, courtesy of Dan.

    - A warning fixup for a race between changing the IO scheduler and
    device remova. From David Jeffery.

    - A few nbd fixes from Josef.

    - Support for cgroup info in blktrace, from Shaohua.

    - Also from Shaohua, new features in the null_blk driver to allow it
    to actually hold data, among other things.

    - Various corner cases and error handling fixes from Weiping Zhang.

    - Improvements to the IO stats tracking for blk-mq from me. Can
    drastically improve performance for fast devices and/or big
    machines.

    - Series from Christoph removing bi_bdev as being needed for IO
    submission, in preparation for nvme multipathing code.

    - Series from Bart, including various cleanups and fixes for switch
    fall through case complaints"

    * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
    kernfs: checking for IS_ERR() instead of NULL
    drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
    drbd: Fix allyesconfig build, fix recent commit
    drbd: switch from kmalloc() to kmalloc_array()
    drbd: abort drbd_start_resync if there is no connection
    drbd: move global variables to drbd namespace and make some static
    drbd: rename "usermode_helper" to "drbd_usermode_helper"
    drbd: fix race between handshake and admin disconnect/down
    drbd: fix potential deadlock when trying to detach during handshake
    drbd: A single dot should be put into a sequence.
    drbd: fix rmmod cleanup, remove _all_ debugfs entries
    drbd: Use setup_timer() instead of init_timer() to simplify the code.
    drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
    drbd: new disk-option disable-write-same
    drbd: Fix resource role for newly created resources in events2
    drbd: mark symbols static where possible
    drbd: Send P_NEG_ACK upon write error in protocol != C
    drbd: add explicit plugging when submitting batches
    drbd: change list_for_each_safe to while(list_first_entry_or_null)
    drbd: introduce drbd_recv_header_maybe_unplug
    ...

    Linus Torvalds
     

11 Aug, 2017

1 commit

  • Misc trivial changes to prepare for future changes. No functional
    difference.

    * Expose cgroup_get(), cgroup_tryget() and cgroup_parent().

    * Implement task_dfl_cgroup() which dereferences css_set->dfl_cgrp.

    * Rename cgroup_stats_show() to cgroup_stat_show() for consistency
    with the file name.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

29 Jul, 2017

3 commits

  • By default we output cgroup id in blktrace. This adds an option to
    display cgroup path. Since get cgroup path is a relativly heavy
    operation, we don't enable it by default.

    with the option enabled, blktrace will output something like this:
    dd-1353 [007] d..2 293.015252: 8,0 /test/level D R 24 + 8 [dd]

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Add an API to export cgroup fhandle info. We don't export a full 'struct
    file_handle', there are unrequired info. Sepcifically, cgroup is always
    a directory, so we don't need a 'FILEID_INO32_GEN_PARENT' type fhandle,
    we only need export the inode number and generation number just like
    what generic_fh_to_dentry does. And we can avoid the overhead of getting
    an inode too, since kernfs_node_id (ino and generation) has all the info
    required.

    Acked-by: Tejun Heo
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • inode number and generation can identify a kernfs node. We are going to
    export the identification by exportfs operations, so put ino and
    generation into a separate structure. It's convenient when later patches
    use the identification.

    Acked-by: Greg Kroah-Hartman
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li