14 Dec, 2020

1 commit


04 Dec, 2020

1 commit


01 Oct, 2020

2 commits

  • Do not report failure on zero sized writes, and handle them as no-op.

    There's issues for example in case of writev() when there's iovec
    containing zero buffer as a first one. It's expected writev() on below
    example to successfully perform the write to specified writable cgroup
    file expecting integer value, and to return 2. For now it's returning
    value -1, and skipping the write:

    int writetest(int fd) {
    const char *buf1 = "";
    const char *buf2 = "1\n";
    struct iovec iov[2] = {
    { .iov_base = (void*)buf1, .iov_len = 0 },
    { .iov_base = (void*)buf2, .iov_len = 2 }
    };
    return writev(fd, iov, 2);
    }

    This patch fixes the issue by checking if there's nothing to write,
    and handling the write as no-op by just returning 0.

    Signed-off-by: Jouni Roivas
    Signed-off-by: Tejun Heo

    Jouni Roivas
     
  • This step is already done in rebind_subsystems().

    Not necessary to do it again.

    Signed-off-by: Wei Yang
    Signed-off-by: Tejun Heo

    Wei Yang
     

08 Jul, 2020

1 commit

  • When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
    copied, so the cgroup refcnt must be taken too. And, unlike the
    sk_alloc() path, sock_update_netprioidx() is not called here.
    Therefore, it is safe and necessary to grab the cgroup refcnt
    even when cgroup_sk_alloc is disabled.

    sk_clone_lock() is in BH context anyway, the in_interrupt()
    would terminate this function if called there. And for sk_alloc()
    skcd->val is always zero. So it's safe to factor out the code
    to make it more readable.

    The global variable 'cgroup_sk_alloc_disabled' is used to determine
    whether to take these reference counts. It is impossible to make
    the reference counting correct unless we save this bit of information
    in skcd->val. So, add a new bit there to record whether the socket
    has already taken the reference counts. This obviously relies on
    kmalloc() to align cgroup pointers to at least 4 bytes,
    ARCH_KMALLOC_MINALIGN is certainly larger than that.

    This bug seems to be introduced since the beginning, commit
    d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
    tried to fix it but not compeletely. It seems not easy to trigger until
    the recent commit 090e28b229af
    ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.

    Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
    Reported-by: Cameron Berkenpas
    Reported-by: Peter Geis
    Reported-by: Lu Fengqi
    Reported-by: Daniël Sonck
    Reported-by: Zhang Qiang
    Tested-by: Cameron Berkenpas
    Tested-by: Peter Geis
    Tested-by: Thomas Lamprecht
    Cc: Daniel Borkmann
    Cc: Zefan Li
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

07 Jun, 2020

1 commit


28 May, 2020

1 commit

  • Currently, the root cgroup does not have a cpu.stat file. Add one which
    is consistent with /proc/stat to capture global cpu statistics that
    might not fall under cgroup accounting.

    We haven't done this in the past because the data are already presented
    in /proc/stat and we didn't want to add overhead from collecting root
    cgroup stats when cgroups are configured, but no cgroups have been
    created.

    By keeping the data consistent with /proc/stat, I think we avoid the
    first problem, while improving the usability of cgroups stats.
    We avoid the second problem by computing the contents of cpu.stat from
    existing data collected for /proc/stat anyway.

    Signed-off-by: Boris Burkov
    Suggested-by: Tejun Heo
    Signed-off-by: Tejun Heo

    Boris Burkov
     

27 May, 2020

1 commit


29 Apr, 2020

1 commit

  • Make bpf_link update support more generic by making it into another
    bpf_link_ops methods. This allows generic syscall handling code to be agnostic
    to various conditionally compiled features (e.g., the case of
    CONFIG_CGROUP_BPF). This also allows to keep link type-specific code to remain
    static within respective code base. Refactor existing bpf_cgroup_link code and
    take advantage of this.

    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200429001614.1544-2-andriin@fb.com

    Andrii Nakryiko
     

04 Apr, 2020

1 commit

  • Pull cgroup updates from Tejun Heo:

    - Christian extended clone3 so that processes can be spawned into
    cgroups directly.

    This is not only neat in terms of semantics but also avoids grabbing
    the global cgroup_threadgroup_rwsem for migration.

    - Daniel added !root xattr support to cgroupfs.

    Userland already uses xattrs on cgroupfs for bookkeeping. This will
    allow delegated cgroups to support such usages.

    - Prateek tried to make cpuset hotplug handling synchronous but that
    led to possible deadlock scenarios. Reverted.

    - Other minor changes including release_agent_path handling cleanup.

    * 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    docs: cgroup-v1: Document the cpuset_v2_mode mount option
    Revert "cpuset: Make cpuset hotplug synchronous"
    cgroupfs: Support user xattrs
    kernfs: Add option to enable user xattrs
    kernfs: Add removed_size out param for simple_xattr_set
    kernfs: kvmalloc xattr value instead of kmalloc
    cgroup: Restructure release_agent_path handling
    selftests/cgroup: add tests for cloning into cgroups
    clone3: allow spawning processes into cgroups
    cgroup: add cgroup_may_write() helper
    cgroup: refactor fork helpers
    cgroup: add cgroup_get_from_file() helper
    cgroup: unify attach permission checking
    cpuset: Make cpuset hotplug synchronous
    cgroup.c: Use built-in RCU list checking
    kselftest/cgroup: add cgroup destruction test
    cgroup: Clean up css_set task traversal

    Linus Torvalds
     

03 Apr, 2020

1 commit

  • Right now, the effective protection of any given cgroup is capped by its
    own explicit memory.low setting, regardless of what the parent says. The
    reasons for this are mostly historical and ease of implementation: to make
    delegation of memory.low safe, effective protection is the min() of all
    memory.low up the tree.

    Unfortunately, this limitation makes it impossible to protect an entire
    subtree from another without forcing the user to make explicit protection
    allocations all the way to the leaf cgroups - something that is highly
    undesirable in real life scenarios.

    Consider memory in a data center host. At the cgroup top level, we have a
    distinction between system management software and the actual workload the
    system is executing. Both branches are further subdivided into individual
    services, job components etc.

    We want to protect the workload as a whole from the system management
    software, but that doesn't mean we want to protect and prioritize
    individual workload wrt each other. Their memory demand can vary over
    time, and we'd want the VM to simply cache the hottest data within the
    workload subtree. Yet, the current memory.low limitations force us to
    allocate a fixed amount of protection to each workload component in order
    to get protection from system management software in general. This
    results in very inefficient resource distribution.

    Another concern with mandating downward allocation is that, as the
    complexity of the cgroup tree grows, it gets harder for the lower levels
    to be informed about decisions made at the host-level. Consider a
    container inside a namespace that in turn creates its own nested tree of
    cgroups to run multiple workloads. It'd be extremely difficult to
    configure memory.low parameters in those leaf cgroups that on one hand
    balance pressure among siblings as the container desires, while also
    reflecting the host-level protection from e.g. rpm upgrades, that lie
    beyond one or more delegation and namespacing points in the tree.

    It's highly unusual from a cgroup interface POV that nested levels have to
    be aware of and reflect decisions made at higher levels for them to be
    effective.

    To enable such use cases and scale configurability for complex trees, this
    patch implements a resource inheritance model for memory that is similar
    to how the CPU and the IO controller implement work-conserving resource
    allocations: a share of a resource allocated to a subree always applies to
    the entire subtree recursively, while allowing, but not mandating,
    children to further specify distribution rules.

    That means that if protection is explicitly allocated among siblings,
    those configured shares are being followed during page reclaim just like
    they are now. However, if the memory.low set at a higher level is not
    fully claimed by the children in that subtree, the "floating" remainder is
    applied to each cgroup in the tree in proportion to its size. Since
    reclaim pressure is applied in proportion to size as well, each child in
    that tree gets the same boost, and the effect is neutral among siblings -
    with respect to each other, they behave as if no memory control was
    enabled at all, and the VM simply balances the memory demands optimally
    within the subtree. But collectively those cgroups enjoy a boost over the
    cgroups in neighboring trees.

    E.g. a leaf cgroup with a memory.low setting of 0 no longer means that
    it's not getting a share of the hierarchically assigned resource, just
    that it doesn't claim a fixed amount of it to protect from its siblings.

    This allows us to recursively protect one subtree (workload) from another
    (system management), while letting subgroups compete freely among each
    other - without having to assign fixed shares to each leaf, and without
    nested groups having to echo higher-level settings.

    The floating protection composes naturally with fixed protection.
    Consider the following example tree:

    A A: low = 2G
    / \ A1: low = 1G
    A1 A2 A2: low = 0G

    As outside pressure is applied to this tree, A1 will enjoy a fixed
    protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
    evenly among A1 and A2, coming out to 1.5G and 0.5G.

    There is a slight risk of regressing theoretical setups where the
    top-level cgroups don't know about the true budgeting and set bogusly high
    "bypass" values that are meaningfully allocated down the tree. Such
    setups would rely on unclaimed protection to be discarded, and
    distributing it would change the intended behavior. Be safe and hide the
    new behavior behind a mount option, 'memory_recursiveprot'.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Acked-by: Tejun Heo
    Acked-by: Roman Gushchin
    Acked-by: Chris Down
    Cc: Michal Hocko
    Cc: Michal Koutný
    Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

31 Mar, 2020

2 commits

  • Add new operation (LINK_UPDATE), which allows to replace active bpf_prog from
    under given bpf_link. Currently this is only supported for bpf_cgroup_link,
    but will be extended to other kinds of bpf_links in follow-up patches.

    For bpf_cgroup_link, implemented functionality matches existing semantics for
    direct bpf_prog attachment (including BPF_F_REPLACE flag). User can either
    unconditionally set new bpf_prog regardless of which bpf_prog is currently
    active under given bpf_link, or, optionally, can specify expected active
    bpf_prog. If active bpf_prog doesn't match expected one, no changes are
    performed, old bpf_link stays intact and attached, operation returns
    a failure.

    cgroup_bpf_replace() operation is resolving race between auto-detachment and
    bpf_prog update in the same fashion as it's done for bpf_link detachment,
    except in this case update has no way of succeeding because of target cgroup
    marked as dying. So in this case error is returned.

    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200330030001.2312810-3-andriin@fb.com

    Andrii Nakryiko
     
  • Implement new sub-command to attach cgroup BPF programs and return FD-based
    bpf_link back on success. bpf_link, once attached to cgroup, cannot be
    replaced, except by owner having its FD. Cgroup bpf_link supports only
    BPF_F_ALLOW_MULTI semantics. Both link-based and prog-based BPF_F_ALLOW_MULTI
    attachments can be freely intermixed.

    To prevent bpf_cgroup_link from keeping cgroup alive past the point when no
    BPF program can be executed, implement auto-detachment of link. When
    cgroup_bpf_release() is called, all attached bpf_links are forced to release
    cgroup refcounts, but they leave bpf_link otherwise active and allocated, as
    well as still owning underlying bpf_prog. This is because user-space might
    still have FDs open and active, so bpf_link as a user-referenced object can't
    be freed yet. Once last active FD is closed, bpf_link will be freed and
    underlying bpf_prog refcount will be dropped. But cgroup refcount won't be
    touched, because cgroup is released already.

    The inherent race between bpf_cgroup_link release (from closing last FD) and
    cgroup_bpf_release() is resolved by both operations taking cgroup_mutex. So
    the only additional check required is when bpf_cgroup_link attempts to detach
    itself from cgroup. At that time we need to check whether there is still
    cgroup associated with that link. And if not, exit with success, because
    bpf_cgroup_link was already successfully detached.

    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Acked-by: Roman Gushchin
    Link: https://lore.kernel.org/bpf/20200330030001.2312810-2-andriin@fb.com

    Andrii Nakryiko
     

17 Mar, 2020

1 commit

  • This patch turns on xattr support for cgroupfs. This is useful for
    letting non-root owners of delegated subtrees attach metadata to
    cgroups.

    One use case is for subtree owners to tell a userspace out of memory
    killer to bias away from killing specific subtrees.

    Tests:

    [/sys/fs/cgroup]# for i in $(seq 0 130); \
    do setfattr workload.slice -n user.name$i -v wow; done
    setfattr: workload.slice: No space left on device
    setfattr: workload.slice: No space left on device
    setfattr: workload.slice: No space left on device

    [/sys/fs/cgroup]# for i in $(seq 0 130); \
    do setfattr workload.slice --remove user.name$i; done
    setfattr: workload.slice: No such attribute
    setfattr: workload.slice: No such attribute
    setfattr: workload.slice: No such attribute

    [/sys/fs/cgroup]# for i in $(seq 0 130); \
    do setfattr workload.slice -n user.name$i -v wow; done
    setfattr: workload.slice: No space left on device
    setfattr: workload.slice: No space left on device
    setfattr: workload.slice: No space left on device

    `seq 0 130` is inclusive, and 131 - 128 = 3, which is the number of
    errors we expect to see.

    [/data]# cat testxattr.c
    #include
    #include
    #include
    #include

    int main() {
    char name[256];
    char *buf = malloc(64 << 10);
    if (!buf) {
    perror("malloc");
    return 1;
    }

    for (int i = 0; i < 4; ++i) {
    snprintf(name, 256, "user.bigone%d", i);
    if (setxattr("/sys/fs/cgroup/system.slice", name, buf,
    64 << 10, 0)) {
    printf("setxattr failed on iteration=%d\n", i);
    return 1;
    }
    }

    return 0;
    }

    [/data]# ./a.out
    setxattr failed on iteration=2

    [/data]# ./a.out
    setxattr failed on iteration=0

    [/sys/fs/cgroup]# setfattr -x user.bigone0 system.slice/
    [/sys/fs/cgroup]# setfattr -x user.bigone1 system.slice/

    [/data]# ./a.out
    setxattr failed on iteration=2

    Signed-off-by: Daniel Xu
    Acked-by: Chris Down
    Reviewed-by: Greg Kroah-Hartman
    Signed-off-by: Tejun Heo

    Daniel Xu
     

13 Mar, 2020

2 commits

  • Pull networking fixes from David Miller:
    "It looks like a decent sized set of fixes, but a lot of these are one
    liner off-by-one and similar type changes:

    1) Fix netlink header pointer to calcular bad attribute offset
    reported to user. From Pablo Neira Ayuso.

    2) Don't double clear PHY interrupts when ->did_interrupt is set,
    from Heiner Kallweit.

    3) Add missing validation of various (devlink, nl802154, fib, etc.)
    attributes, from Jakub Kicinski.

    4) Missing *pos increments in various netfilter seq_next ops, from
    Vasily Averin.

    5) Missing break in of_mdiobus_register() loop, from Dajun Jin.

    6) Don't double bump tx_dropped in veth driver, from Jiang Lidong.

    7) Work around FMAN erratum A050385, from Madalin Bucur.

    8) Make sure ARP header is pulled early enough in bonding driver,
    from Eric Dumazet.

    9) Do a cond_resched() during multicast processing of ipvlan and
    macvlan, from Mahesh Bandewar.

    10) Don't attach cgroups to unrelated sockets when in interrupt
    context, from Shakeel Butt.

    11) Fix tpacket ring state management when encountering unknown GSO
    types. From Willem de Bruijn.

    12) Fix MDIO bus PHY resume by checking mdio_bus_phy_may_suspend()
    only in the suspend context. From Heiner Kallweit"

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (112 commits)
    net: systemport: fix index check to avoid an array out of bounds access
    tc-testing: add ETS scheduler to tdc build configuration
    net: phy: fix MDIO bus PM PHY resuming
    net: hns3: clear port base VLAN when unload PF
    net: hns3: fix RMW issue for VLAN filter switch
    net: hns3: fix VF VLAN table entries inconsistent issue
    net: hns3: fix "tc qdisc del" failed issue
    taprio: Fix sending packets without dequeueing them
    net: mvmdio: avoid error message for optional IRQ
    net: dsa: mv88e6xxx: Add missing mask of ATU occupancy register
    net: memcg: fix lockdep splat in inet_csk_accept()
    s390/qeth: implement smarter resizing of the RX buffer pool
    s390/qeth: refactor buffer pool code
    s390/qeth: use page pointers to manage RX buffer pool
    seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number
    net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed
    net/packet: tpacket_rcv: do not increment ring index on drop
    sxgbe: Fix off by one in samsung driver strncpy size arg
    net: caif: Add lockdep expression to RCU traversal primitive
    MAINTAINERS: remove Sathya Perla as Emulex NIC maintainer
    ...

    Linus Torvalds
     
  • Tejun Heo
     

11 Mar, 2020

1 commit

  • We are testing network memory accounting in our setup and noticed
    inconsistent network memory usage and often unrelated cgroups network
    usage correlates with testing workload. On further inspection, it
    seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in
    irq context specially for cgroup v1.

    mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context
    and kind of assumes that this can only happen from sk_clone_lock()
    and the source sock object has already associated cgroup. However in
    cgroup v1, where network memory accounting is opt-in, the source sock
    can be unassociated with any cgroup and the new cloned sock can get
    associated with unrelated interrupted cgroup.

    Cgroup v2 can also suffer if the source sock object was created by
    process in the root cgroup or if sk_alloc() is called in irq context.
    The fix is to just do nothing in interrupt.

    WARNING: Please note that about half of the TCP sockets are allocated
    from the IRQ context, so, memory used by such sockets will not be
    accouted by the memcg.

    The stack trace of mem_cgroup_sk_alloc() from IRQ-context:

    CPU: 70 PID: 12720 Comm: ssh Tainted: 5.6.0-smp-DEV #1
    Hardware name: ...
    Call Trace:

    dump_stack+0x57/0x75
    mem_cgroup_sk_alloc+0xe9/0xf0
    sk_clone_lock+0x2a7/0x420
    inet_csk_clone_lock+0x1b/0x110
    tcp_create_openreq_child+0x23/0x3b0
    tcp_v6_syn_recv_sock+0x88/0x730
    tcp_check_req+0x429/0x560
    tcp_v6_rcv+0x72d/0xa40
    ip6_protocol_deliver_rcu+0xc9/0x400
    ip6_input+0x44/0xd0
    ? ip6_protocol_deliver_rcu+0x400/0x400
    ip6_rcv_finish+0x71/0x80
    ipv6_rcv+0x5b/0xe0
    ? ip6_sublist_rcv+0x2e0/0x2e0
    process_backlog+0x108/0x1e0
    net_rx_action+0x26b/0x460
    __do_softirq+0x104/0x2a6
    do_softirq_own_stack+0x2a/0x40

    do_softirq.part.19+0x40/0x50
    __local_bh_enable_ip+0x51/0x60
    ip6_finish_output2+0x23d/0x520
    ? ip6table_mangle_hook+0x55/0x160
    __ip6_finish_output+0xa1/0x100
    ip6_finish_output+0x30/0xd0
    ip6_output+0x73/0x120
    ? __ip6_finish_output+0x100/0x100
    ip6_xmit+0x2e3/0x600
    ? ipv6_anycast_cleanup+0x50/0x50
    ? inet6_csk_route_socket+0x136/0x1e0
    ? skb_free_head+0x1e/0x30
    inet6_csk_xmit+0x95/0xf0
    __tcp_transmit_skb+0x5b4/0xb20
    __tcp_send_ack.part.60+0xa3/0x110
    tcp_send_ack+0x1d/0x20
    tcp_rcv_state_process+0xe64/0xe80
    ? tcp_v6_connect+0x5d1/0x5f0
    tcp_v6_do_rcv+0x1b1/0x3f0
    ? tcp_v6_do_rcv+0x1b1/0x3f0
    __release_sock+0x7f/0xd0
    release_sock+0x30/0xa0
    __inet_stream_connect+0x1c3/0x3b0
    ? prepare_to_wait+0xb0/0xb0
    inet_stream_connect+0x3b/0x60
    __sys_connect+0x101/0x120
    ? __sys_getsockopt+0x11b/0x140
    __x64_sys_connect+0x1a/0x20
    do_syscall_64+0x51/0x200
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
    Fixes: 2d7580738345 ("mm: memcontrol: consolidate cgroup socket tracking")
    Fixes: d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
    Signed-off-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Signed-off-by: David S. Miller

    Shakeel Butt
     

05 Mar, 2020

1 commit

  • Similar to the commit d7495343228f ("cgroup: fix incorrect
    WARN_ON_ONCE() in cgroup_setup_root()"), cgroup_id(root_cgrp) does not
    equal to 1 on 32bit ino archs which triggers all sorts of issues with
    psi_show() on s390x. For example,

    BUG: KASAN: slab-out-of-bounds in collect_percpu_times+0x2d0/
    Read of size 4 at addr 000000001e0ce000 by task read_all/3667
    collect_percpu_times+0x2d0/0x798
    psi_show+0x7c/0x2a8
    seq_read+0x2ac/0x830
    vfs_read+0x92/0x150
    ksys_read+0xe2/0x188
    system_call+0xd8/0x2b4

    Fix it by using cgroup_ino().

    Fixes: 743210386c03 ("cgroup: use cgrp->kn->id as the cgroup ID")
    Signed-off-by: Qian Cai
    Acked-by: Johannes Weiner
    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org # v5.5

    Qian Cai
     

13 Feb, 2020

9 commits

  • This adds support for creating a process in a different cgroup than its
    parent. Callers can limit and account processes and threads right from
    the moment they are spawned:
    - A service manager can directly spawn new services into dedicated
    cgroups.
    - A process can be directly created in a frozen cgroup and will be
    frozen as well.
    - The initial accounting jitter experienced by process supervisors and
    daemons is eliminated with this.
    - Threaded applications or even thread implementations can choose to
    create a specific cgroup layout where each thread is spawned
    directly into a dedicated cgroup.

    This feature is limited to the unified hierarchy. Callers need to pass
    a directory file descriptor for the target cgroup. The caller can
    choose to pass an O_PATH file descriptor. All usual migration
    restrictions apply, i.e. there can be no processes in inner nodes. In
    general, creating a process directly in a target cgroup adheres to all
    migration restrictions.

    One of the biggest advantages of this feature is that CLONE_INTO_GROUP does
    not need to grab the write side of the cgroup cgroup_threadgroup_rwsem.
    This global lock makes moving tasks/threads around super expensive. With
    clone3() this lock is avoided.

    Cc: Tejun Heo
    Cc: Ingo Molnar
    Cc: Oleg Nesterov
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Peter Zijlstra
    Cc: cgroups@vger.kernel.org
    Signed-off-by: Christian Brauner
    Signed-off-by: Tejun Heo

    Christian Brauner
     
  • Add a cgroup_may_write() helper which we can use in the
    CLONE_INTO_CGROUP patch series to verify that we can write to the
    destination cgroup.

    Cc: Tejun Heo
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: cgroups@vger.kernel.org
    Signed-off-by: Christian Brauner
    Signed-off-by: Tejun Heo

    Christian Brauner
     
  • This refactors the fork helpers so they can be easily modified in the
    next patches. The patch just moves the cgroup threadgroup rwsem grab and
    release into the helpers. They don't need to be directly exposed in fork.c.

    Cc: Tejun Heo
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: cgroups@vger.kernel.org
    Acked-by: Michal Koutný
    Signed-off-by: Christian Brauner
    Signed-off-by: Tejun Heo

    Christian Brauner
     
  • Add a helper cgroup_get_from_file(). The helper will be used in
    subsequent patches to retrieve a cgroup while holding a reference to the
    struct file it was taken from.

    Cc: Tejun Heo
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: cgroups@vger.kernel.org
    Acked-by: Michal Koutný
    Signed-off-by: Christian Brauner
    Signed-off-by: Tejun Heo

    Christian Brauner
     
  • The core codepaths to check whether a process can be attached to a
    cgroup are the same for threads and thread-group leaders. Only a small
    piece of code verifying that source and destination cgroup are in the
    same domain differentiates the thread permission checking from
    thread-group leader permission checking.
    Since cgroup_migrate_vet_dst() only matters cgroup2 - it is a noop on
    cgroup1 - we can move it out of cgroup_attach_task().
    All checks can now be consolidated into a new helper
    cgroup_attach_permissions() callable from both cgroup_procs_write() and
    cgroup_threads_write().

    Cc: Tejun Heo
    Cc: Li Zefan
    Cc: Johannes Weiner
    Cc: cgroups@vger.kernel.org
    Acked-by: Michal Koutný
    Signed-off-by: Christian Brauner
    Signed-off-by: Tejun Heo

    Christian Brauner
     
  • list_for_each_entry_rcu has built-in RCU and lock checking.
    Pass cond argument to list_for_each_entry_rcu() to silence
    false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
    by default.

    Even though the function css_next_child() already checks if
    cgroup_mutex or rcu_read_lock() is held using
    cgroup_assert_mutex_or_rcu_locked(), there is a need to pass
    cond to list_for_each_entry_rcu() to avoid false positive
    lockdep warning.

    Signed-off-by: Madhuparna Bhowmik
    Acked-by: Michal Koutný
    Signed-off-by: Tejun Heo

    Madhuparna Bhowmik
     
  • css_task_iter stores pointer to head of each iterable list, this dates
    back to commit 0f0a2b4fa621 ("cgroup: reorganize css_task_iter") when we
    did not store cur_cset. Let us utilize list heads directly in cur_cset
    and streamline css_task_iter_advance_css_set a bit. This is no
    intentional function change.

    Signed-off-by: Michal Koutný
    Signed-off-by: Tejun Heo

    Michal Koutný
     
  • PF_EXITING is set earlier than actual removal from css_set when a task
    is exitting. This can confuse cgroup.procs readers who see no PF_EXITING
    tasks, however, rmdir is checking against css_set membership so it can
    transitionally fail with EBUSY.

    Fix this by listing tasks that weren't unlinked from css_set active
    lists.
    It may happen that other users of the task iterator (without
    CSS_TASK_ITER_PROCS) spot a PF_EXITING task before cgroup_exit(). This
    is equal to the state before commit c03cd7738a83 ("cgroup: Include dying
    leaders with live threads in PROCS iterations") but it may be reviewed
    later.

    Reported-by: Suren Baghdasaryan
    Fixes: c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations")
    Signed-off-by: Michal Koutný
    Signed-off-by: Tejun Heo

    Michal Koutný
     
  • If seq_file .next fuction does not change position index,
    read after some lseek can generate unexpected output:

    1) dd bs=1 skip output of each 2nd elements
    $ dd if=/sys/fs/cgroup/cgroup.procs bs=8 count=1
    2
    3
    4
    5
    1+0 records in
    1+0 records out
    8 bytes copied, 0,000267297 s, 29,9 kB/s
    [test@localhost ~]$ dd if=/sys/fs/cgroup/cgroup.procs bs=1 count=8
    2
    4 <<< NB! 3 was skipped
    6 <<< ... and 5 too
    8 <<< ... and 7
    8+0 records in
    8+0 records out
    8 bytes copied, 5,2123e-05 s, 153 kB/s

    This happen because __cgroup_procs_start() makes an extra
    extra cgroup_procs_next() call

    2) read after lseek beyond end of file generates whole last line.
    3) read after lseek into middle of last line generates
    expected rest of last line and unexpected whole line once again.

    Additionally patch removes an extra position index changes in
    __cgroup_procs_start()

    Cc: stable@vger.kernel.org
    https://bugzilla.kernel.org/show_bug.cgi?id=206283
    Signed-off-by: Vasily Averin
    Signed-off-by: Tejun Heo

    Vasily Averin
     

11 Feb, 2020

1 commit

  • Pull cgroup fix from Tejun Heo:
    "I made a mistake while removing cgroup task list lazy init
    optimization making the root cgroup.procs show entries for the
    init_tasks. The zero entries doesn't cause critical failures but does
    make systemd print out warning messages during boot.

    Fix it by omitting init_tasks as they should be"

    * 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: init_tasks shouldn't be linked to the root cgroup

    Linus Torvalds
     

09 Feb, 2020

1 commit

  • Pull vfs file system parameter updates from Al Viro:
    "Saner fs_parser.c guts and data structures. The system-wide registry
    of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
    the horror switch() in fs_parse() that would have to grow another case
    every time something got added to that system-wide registry.

    New syntax types can be added by filesystems easily now, and their
    namespace is that of functions - not of system-wide enum members. IOW,
    they can be shared or kept private and if some turn out to be widely
    useful, we can make them common library helpers, etc., without having
    to do anything whatsoever to fs_parse() itself.

    And we already get that kind of requests - the thing that finally
    pushed me into doing that was "oh, and let's add one for timeouts -
    things like 15s or 2h". If some filesystem really wants that, let them
    do it. Without somebody having to play gatekeeper for the variants
    blessed by direct support in fs_parse(), TYVM.

    Quite a bit of boilerplate is gone. And IMO the data structures make a
    lot more sense now. -200LoC, while we are at it"

    * 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
    tmpfs: switch to use of invalfc()
    cgroup1: switch to use of errorfc() et.al.
    procfs: switch to use of invalfc()
    hugetlbfs: switch to use of invalfc()
    cramfs: switch to use of errofc() et.al.
    gfs2: switch to use of errorfc() et.al.
    fuse: switch to use errorfc() et.al.
    ceph: use errorfc() and friends instead of spelling the prefix out
    prefix-handling analogues of errorf() and friends
    turn fs_param_is_... into functions
    fs_parse: handle optional arguments sanely
    fs_parse: fold fs_parameter_desc/fs_parameter_spec
    fs_parser: remove fs_parameter_description name field
    add prefix to fs_context->log
    ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
    new primitive: __fs_parse()
    switch rbd and libceph to p_log-based primitives
    struct p_log, variants of warnf() et.al. taking that one instead
    teach logfc() to handle prefices, give it saner calling conventions
    get rid of cg_invalf()
    ...

    Linus Torvalds
     

08 Feb, 2020

2 commits


31 Jan, 2020

1 commit

  • 5153faac18d2 ("cgroup: remove cgroup_enable_task_cg_lists()
    optimization") removed lazy initialization of css_sets so that new
    tasks are always lniked to its css_set. In the process, it incorrectly
    ended up adding init_tasks to root css_set. They show up as PID 0's in
    root's cgroup.procs triggering warnings in systemd and generally
    confusing people.

    Fix it by skip css_set linking for init_tasks.

    Signed-off-by: Tejun Heo
    Reported-by: https://github.com/joanbm
    Link: https://github.com/systemd/systemd/issues/14682
    Fixes: 5153faac18d2 ("cgroup: remove cgroup_enable_task_cg_lists() optimization")
    Cc: stable@vger.kernel.org # v5.5+

    Tejun Heo
     

29 Jan, 2020

1 commit

  • Pull networking updates from David Miller:

    1) Add WireGuard

    2) Add HE and TWT support to ath11k driver, from John Crispin.

    3) Add ESP in TCP encapsulation support, from Sabrina Dubroca.

    4) Add variable window congestion control to TIPC, from Jon Maloy.

    5) Add BCM84881 PHY driver, from Russell King.

    6) Start adding netlink support for ethtool operations, from Michal
    Kubecek.

    7) Add XDP drop and TX action support to ena driver, from Sameeh
    Jubran.

    8) Add new ipv4 route notifications so that mlxsw driver does not have
    to handle identical routes itself. From Ido Schimmel.

    9) Add BPF dynamic program extensions, from Alexei Starovoitov.

    10) Support RX and TX timestamping in igc, from Vinicius Costa Gomes.

    11) Add support for macsec HW offloading, from Antoine Tenart.

    12) Add initial support for MPTCP protocol, from Christoph Paasch,
    Matthieu Baerts, Florian Westphal, Peter Krystad, and many others.

    13) Add Octeontx2 PF support, from Sunil Goutham, Geetha sowjanya, Linu
    Cherian, and others.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1469 commits)
    net: phy: add default ARCH_BCM_IPROC for MDIO_BCM_IPROC
    udp: segment looped gso packets correctly
    netem: change mailing list
    qed: FW 8.42.2.0 debug features
    qed: rt init valid initialization changed
    qed: Debug feature: ilt and mdump
    qed: FW 8.42.2.0 Add fw overlay feature
    qed: FW 8.42.2.0 HSI changes
    qed: FW 8.42.2.0 iscsi/fcoe changes
    qed: Add abstraction for different hsi values per chip
    qed: FW 8.42.2.0 Additional ll2 type
    qed: Use dmae to write to widebus registers in fw_funcs
    qed: FW 8.42.2.0 Parser offsets modified
    qed: FW 8.42.2.0 Queue Manager changes
    qed: FW 8.42.2.0 Expose new registers and change windows
    qed: FW 8.42.2.0 Internal ram offsets modifications
    MAINTAINERS: Add entry for Marvell OcteonTX2 Physical Function driver
    Documentation: net: octeontx2: Add RVU HW and drivers overview
    octeontx2-pf: ethtool RSS config support
    octeontx2-pf: Add basic ethtool support
    ...

    Linus Torvalds
     

16 Jan, 2020

1 commit

  • The test_cgcore_no_internal_process_constraint_on_threads selftest when
    running with subsystem controlling noise triggers two warnings:

    > [ 597.443115] WARNING: CPU: 1 PID: 28167 at kernel/cgroup/cgroup.c:3131 cgroup_apply_control_enable+0xe0/0x3f0
    > [ 597.443413] WARNING: CPU: 1 PID: 28167 at kernel/cgroup/cgroup.c:3177 cgroup_apply_control_disable+0xa6/0x160

    Both stem from a call to cgroup_type_write. The first warning was also
    triggered by syzkaller.

    When we're switching cgroup to threaded mode shortly after a subsystem
    was disabled on it, we can see the respective subsystem css dying there.

    The warning in cgroup_apply_control_enable is harmless in this case
    since we're not adding new subsys anyway.
    The warning in cgroup_apply_control_disable indicates an attempt to kill
    css of recently disabled subsystem repeatedly.

    The commit prevents these situations by making cgroup_type_write wait
    for all dying csses to go away before re-applying subtree controls.
    When at it, the locations of WARN_ON_ONCE calls are moved so that
    warning is triggered only when we are about to misuse the dying css.

    Reported-by: syzbot+5493b2a54d31d6aea629@syzkaller.appspotmail.com
    Reported-by: Christian Brauner
    Signed-off-by: Michal Koutný
    Signed-off-by: Tejun Heo

    Michal Koutný
     

20 Dec, 2019

1 commit

  • The common use-case in production is to have multiple cgroup-bpf
    programs per attach type that cover multiple use-cases. Such programs
    are attached with BPF_F_ALLOW_MULTI and can be maintained by different
    people.

    Order of programs usually matters, for example imagine two egress
    programs: the first one drops packets and the second one counts packets.
    If they're swapped the result of counting program will be different.

    It brings operational challenges with updating cgroup-bpf program(s)
    attached with BPF_F_ALLOW_MULTI since there is no way to replace a
    program:

    * One way to update is to detach all programs first and then attach the
    new version(s) again in the right order. This introduces an
    interruption in the work a program is doing and may not be acceptable
    (e.g. if it's egress firewall);

    * Another way is attach the new version of a program first and only then
    detach the old version. This introduces the time interval when two
    versions of same program are working, what may not be acceptable if a
    program is not idempotent. It also imposes additional burden on
    program developers to make sure that two versions of their program can
    co-exist.

    Solve the problem by introducing a "replace" mode in BPF_PROG_ATTACH
    command for cgroup-bpf programs being attached with BPF_F_ALLOW_MULTI
    flag. This mode is enabled by newly introduced BPF_F_REPLACE attach flag
    and bpf_attr.replace_bpf_fd attribute to pass fd of the old program to
    replace

    That way user can replace any program among those attached with
    BPF_F_ALLOW_MULTI flag without the problems described above.

    Details of the new API:

    * If BPF_F_REPLACE is set but replace_bpf_fd doesn't have valid
    descriptor of BPF program, BPF_PROG_ATTACH will return corresponding
    error (EINVAL or EBADF).

    * If replace_bpf_fd has valid descriptor of BPF program but such a
    program is not attached to specified cgroup, BPF_PROG_ATTACH will
    return ENOENT.

    BPF_F_REPLACE is introduced to make the user intent clear, since
    replace_bpf_fd alone can't be used for this (its default value, 0, is a
    valid fd). BPF_F_REPLACE also makes it possible to extend the API in the
    future (e.g. add BPF_F_BEFORE and BPF_F_AFTER if needed).

    Signed-off-by: Andrey Ignatov
    Signed-off-by: Alexei Starovoitov
    Acked-by: Martin KaFai Lau
    Acked-by: Andrii Narkyiko
    Link: https://lore.kernel.org/bpf/30cd850044a0057bdfcaaf154b7d2f39850ba813.1576741281.git.rdna@fb.com

    Andrey Ignatov
     

26 Nov, 2019

1 commit

  • Pull cgroup updates from Tejun Heo:
    "There are several notable changes here:

    - Single thread migrating itself has been optimized so that it
    doesn't need threadgroup rwsem anymore.

    - Freezer optimization to avoid unnecessary frozen state changes.

    - cgroup ID unification so that cgroup fs ino is the only unique ID
    used for the cgroup and can be used to directly look up live
    cgroups through filehandle interface on 64bit ino archs. On 32bit
    archs, cgroup fs ino is still the only ID in use but it is only
    unique when combined with gen.

    - selftest and other changes"

    * 'for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (24 commits)
    writeback: fix -Wformat compilation warnings
    docs: cgroup: mm: Fix spelling of "list"
    cgroup: fix incorrect WARN_ON_ONCE() in cgroup_setup_root()
    cgroup: use cgrp->kn->id as the cgroup ID
    kernfs: use 64bit inos if ino_t is 64bit
    kernfs: implement custom exportfs ops and fid type
    kernfs: combine ino/id lookup functions into kernfs_find_and_get_node_by_id()
    kernfs: convert kernfs_node->id from union kernfs_node_id to u64
    kernfs: kernfs_find_and_get_node_by_ino() should only look up activated nodes
    kernfs: use dumber locking for kernfs_find_and_get_node_by_ino()
    netprio: use css ID instead of cgroup ID
    writeback: use ino_t for inodes in tracepoints
    kernfs: fix ino wrap-around detection
    kselftests: cgroup: Avoid the reuse of fd after it is deallocated
    cgroup: freezer: don't change task and cgroups status unnecessarily
    cgroup: use cgroup->last_bstat instead of cgroup->bstat_pending for consistency
    cgroup: remove cgroup_enable_task_cg_lists() optimization
    cgroup: pids: use atomic64_t for pids->limit
    selftests: cgroup: Run test_core under interfering stress
    selftests: cgroup: Add task migration tests
    ...

    Linus Torvalds
     

15 Nov, 2019

1 commit

  • 743210386c03 ("cgroup: use cgrp->kn->id as the cgroup ID") added WARN
    which triggers if cgroup_id(root_cgrp) is not 1. This is fine on
    64bit ino archs but on 32bit archs cgroup ID is ((gen << 32) | ino)
    and gen starts at 1, so the root id is 0x1_0000_0001 instead of 1
    always triggering the WARN.

    What we wanna make sure is that the ino part is 1. Fix it.

    Reported-by: Naresh Kamboju
    Fixes: 743210386c03 ("cgroup: use cgrp->kn->id as the cgroup ID")
    Signed-off-by: Tejun Heo

    Tejun Heo
     

13 Nov, 2019

3 commits

  • cgroup ID is currently allocated using a dedicated per-hierarchy idr
    and used internally and exposed through tracepoints and bpf. This is
    confusing because there are tracepoints and other interfaces which use
    the cgroupfs ino as IDs.

    The preceding changes made kn->id exposed as ino as 64bit ino on
    supported archs or ino+gen (low 32bits as ino, high gen). There's no
    reason for cgroup to use different IDs. The kernfs IDs are unique and
    userland can easily discover them and map them back to paths using
    standard file operations.

    This patch replaces cgroup IDs with kernfs IDs.

    * cgroup_id() is added and all cgroup ID users are converted to use it.

    * kernfs_node creation is moved to earlier during cgroup init so that
    cgroup_id() is available during init.

    * While at it, s/cgroup/cgrp/ in psi helpers for consistency.

    * Fallback ID value is changed to 1 to be consistent with root cgroup
    ID.

    Signed-off-by: Tejun Heo
    Reviewed-by: Greg Kroah-Hartman
    Cc: Namhyung Kim

    Tejun Heo
     
  • kernfs_find_and_get_node_by_ino() looks the kernfs_node matching the
    specified ino. On top of that, kernfs_get_node_by_id() and
    kernfs_fh_get_inode() implement full ID matching by testing the rest
    of ID.

    On surface, confusingly, the two are slightly different in that the
    latter uses 0 gen as wildcard while the former doesn't - does it mean
    that the latter can't uniquely identify inodes w/ 0 gen? In practice,
    this is a distinction without a difference because generation number
    starts at 1. There are no actual IDs with 0 gen, so it can always
    safely used as wildcard.

    Let's simplify the code by renaming kernfs_find_and_get_node_by_ino()
    to kernfs_find_and_get_node_by_id(), moving all lookup logics into it,
    and removing now unnecessary kernfs_get_node_by_id().

    Signed-off-by: Tejun Heo
    Reviewed-by: Greg Kroah-Hartman

    Tejun Heo
     
  • kernfs_node->id is currently a union kernfs_node_id which represents
    either a 32bit (ino, gen) pair or u64 value. I can't see much value
    in the usage of the union - all that's needed is a 64bit ID which the
    current code is already limited to. Using a union makes the code
    unnecessarily complicated and prevents using 64bit ino without adding
    practical benefits.

    This patch drops union kernfs_node_id and makes kernfs_node->id a u64.
    ino is stored in the lower 32bits and gen upper. Accessors -
    kernfs[_id]_ino() and kernfs[_id]_gen() - are added to retrieve the
    ino and gen. This simplifies ID handling less cumbersome and will
    allow using 64bit inos on supported archs.

    This patch doesn't make any functional changes.

    Signed-off-by: Tejun Heo
    Reviewed-by: Greg Kroah-Hartman
    Cc: Namhyung Kim
    Cc: Jens Axboe
    Cc: Alexei Starovoitov

    Tejun Heo