09 Feb, 2017

1 commit

  • commit 07cd12945551b63ecb1a349d50a6d69d1d6feb4a upstream.

    While refactoring cgroup creation, a5bca2152036 ("cgroup: factor out
    cgroup_create() out of cgroup_mkdir()") incorrectly onlined subsystems
    before the new cgroup is associated with it kernfs_node. This is fine
    for cgroup proper but cgroup_name/path() depend on the associated
    kernfs_node and if a subsystem makes the new cgroup_subsys_state
    visible, which they're allowed to after onlining, it can lead to NULL
    dereference.

    The current code performs cgroup creation and subsystem onlining in
    cgroup_create() and cgroup_mkdir() makes the cgroup and subsystems
    visible afterwards. There's no reason to online the subsystems early
    and we can simply drop cgroup_apply_control_enable() call from
    cgroup_create() so that the subsystems are onlined and made visible at
    the same time.

    Signed-off-by: Tejun Heo
    Reported-by: Konstantin Khlebnikov
    Fixes: a5bca2152036 ("cgroup: factor out cgroup_create() out of cgroup_mkdir()")
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

15 Oct, 2016

1 commit

  • Pull cgroup updates from Tejun Heo:

    - tracepoints for basic cgroup management operations added

    - kernfs and cgroup path formatting functions updated to behave in the
    style of strlcpy()

    - non-critical bug fixes

    * 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL
    cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent()
    cpuset: fix error handling regression in proc_cpuset_show()
    cgroup: add tracepoints for basic operations
    cgroup: make cgroup_path() and friends behave in the style of strlcpy()
    kernfs: remove kernfs_path_len()
    kernfs: make kernfs_path*() behave in the style of strlcpy()
    kernfs: add dummy implementation of kernfs_path_from_node()

    Linus Torvalds
     

07 Oct, 2016

1 commit

  • Pull namespace updates from Eric Biederman:
    "This set of changes is a number of smaller things that have been
    overlooked in other development cycles focused on more fundamental
    change. The devpts changes are small things that were a distraction
    until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
    trivial regression fix to autofs for the unprivileged mount changes
    that went in last cycle. A pair of ioctls has been added by Andrey
    Vagin making it is possible to discover the relationships between
    namespaces when referring to them through file descriptors.

    The big user visible change is starting to add simple resource limits
    to catch programs that misbehave. With namespaces in general and user
    namespaces in particular allowing users to use more kinds of
    resources, it has become important to have something to limit errant
    programs. Because the purpose of these limits is to catch errant
    programs the code needs to be inexpensive to use as it always on, and
    the default limits need to be high enough that well behaved programs
    on well behaved systems don't encounter them.

    To this end, after some review I have implemented per user per user
    namespace limits, and use them to limit the number of namespaces. The
    limits being per user mean that one user can not exhause the limits of
    another user. The limits being per user namespace allow contexts where
    the limit is 0 and security conscious folks can remove from their
    threat anlysis the code used to manage namespaces (as they have
    historically done as it root only). At the same time the limits being
    per user namespace allow other parts of the system to use namespaces.

    Namespaces are increasingly being used in application sand boxing
    scenarios so an all or nothing disable for the entire system for the
    security conscious folks makes increasing use of these sandboxes
    impossible.

    There is also added a limit on the maximum number of mounts present in
    a single mount namespace. It is nontrivial to guess what a reasonable
    system wide limit on the number of mount structure in the kernel would
    be, especially as it various based on how a system is using
    containers. A limit on the number of mounts in a mount namespace
    however is much easier to understand and set. In most cases in
    practice only about 1000 mounts are used. Given that some autofs
    scenarious have the potential to be 30,000 to 50,000 mounts I have set
    the default limit for the number of mounts at 100,000 which is well
    above every known set of users but low enough that the mount hash
    tables don't degrade unreaonsably.

    These limits are a start. I expect this estabilishes a pattern that
    other limits for resources that namespaces use will follow. There has
    been interest in making inotify event limits per user per user
    namespace as well as interest expressed in making details about what
    is going on in the kernel more visible"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
    autofs: Fix automounts by using current_real_cred()->uid
    mnt: Add a per mount namespace limit on the number of mounts
    netns: move {inc,dec}_net_namespaces into #ifdef
    nsfs: Simplify __ns_get_path
    tools/testing: add a test to check nsfs ioctl-s
    nsfs: add ioctl to get a parent namespace
    nsfs: add ioctl to get an owning user namespace for ns file descriptor
    kernel: add a helper to get an owning user namespace for a namespace
    devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
    devpts: Remove sync_filesystems
    devpts: Make devpts_kill_sb safe if fsi is NULL
    devpts: Simplify devpts_mount by using mount_nodev
    devpts: Move the creation of /dev/pts/ptmx into fill_super
    devpts: Move parse_mount_options into fill_super
    userns: When the per user per user namespace limit is reached return ENOSPC
    userns; Document per user per user namespace limits.
    mntns: Add a limit on the number of mount namespaces.
    netns: Add a limit on the number of net namespaces
    cgroupns: Add a limit on the number of cgroup namespaces
    ipcns: Add a limit on the number of ipc namespaces
    ...

    Linus Torvalds
     

30 Sep, 2016

1 commit


29 Sep, 2016

1 commit


28 Sep, 2016

1 commit

  • Pull cgroup fixes from Tejun Heo:
    "Three late fixes for cgroup: Two cpuset ones, one trivial and the
    other pretty obscure, and a cgroup core fix for a bug which impacts
    cgroup v2 namespace users"

    * 'for-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: fix invalid controller enable rejections with cgroup namespace
    cpuset: fix non static symbol warning
    cpuset: handle race between CPU hotplug and cpuset_hotplug_work

    Linus Torvalds
     

24 Sep, 2016

1 commit

  • On the v2 hierarchy, "cgroup.subtree_control" rejects controller
    enables if the cgroup has processes in it. The enforcement of this
    logic assumes that the cgroup wouldn't have any css_sets associated
    with it if there are no tasks in the cgroup, which is no longer true
    since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

    When a cgroup namespace is created, it pins the css_set of the
    creating task to use it as the root css_set of the namespace. This
    extra reference stays as long as the namespace is around and makes
    "cgroup.subtree_control" think that the namespace root cgroup is not
    empty even when it is and thus reject controller enables.

    Fix it by making cgroup_subtree_control() walk and test emptiness of
    each css_set instead of testing whether the list_head is empty.

    While at it, update the comment of cgroup_task_count() to indicate
    that the returned value may be higher than the number of tasks, which
    has always been true due to temporary references and doesn't break
    anything.

    Signed-off-by: Tejun Heo
    Reported-by: Evgeny Vereshchagin
    Cc: Serge E. Hallyn
    Cc: Aditya Kali
    Cc: Eric W. Biederman
    Cc: stable@vger.kernel.org # v4.6+
    Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
    Link: https://github.com/systemd/systemd/pull/3589#issuecomment-249089541

    Tejun Heo
     

23 Sep, 2016

3 commits

  • From: Andrey Vagin

    Each namespace has an owning user namespace and now there is not way
    to discover these relationships.

    Pid and user namepaces are hierarchical. There is no way to discover
    parent-child relationships too.

    Why we may want to know relationships between namespaces?

    One use would be visualization, in order to understand the running
    system. Another would be to answer the question: what capability does
    process X have to perform operations on a resource governed by namespace
    Y?

    One more use-case (which usually called abnormal) is checkpoint/restart.
    In CRIU we are going to dump and restore nested namespaces.

    There [1] was a discussion about which interface to choose to determing
    relationships between namespaces.

    Eric suggested to add two ioctl-s [2]:
    > Grumble, Grumble. I think this may actually a case for creating ioctls
    > for these two cases. Now that random nsfs file descriptors are bind
    > mountable the original reason for using proc files is not as pressing.
    >
    > One ioctl for the user namespace that owns a file descriptor.
    > One ioctl for the parent namespace of a namespace file descriptor.

    Here is an implementaions of these ioctl-s.

    $ man man7/namespaces.7
    ...
    Since Linux 4.X, the following ioctl(2) calls are supported for
    namespace file descriptors. The correct syntax is:

    fd = ioctl(ns_fd, ioctl_type);

    where ioctl_type is one of the following:

    NS_GET_USERNS
    Returns a file descriptor that refers to an owning user names‐
    pace.

    NS_GET_PARENT
    Returns a file descriptor that refers to a parent namespace.
    This ioctl(2) can be used for pid and user namespaces. For
    user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
    meaning.

    In addition to generic ioctl(2) errors, the following specific ones
    can occur:

    EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

    EPERM The requested namespace is outside of the current namespace
    scope.

    [1] https://lkml.org/lkml/2016/7/6/158
    [2] https://lkml.org/lkml/2016/7/9/101

    Changes for v2:
    * don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
    outside of the init namespace, so we can return EPERM in this case too.
    > The fewer special cases the easier the code is to get
    > correct, and the easier it is to read. // Eric

    Changes for v3:
    * rename ns->get_owner() to ns->owner(). get_* usually means that it
    grabs a reference.

    Cc: "Eric W. Biederman"
    Cc: James Bottomley
    Cc: "Michael Kerrisk (man-pages)"
    Cc: "W. Trevor King"
    Cc: Alexander Viro
    Cc: Serge Hallyn

    Eric W. Biederman
     
  • Return -EPERM if an owning user namespace is outside of a process
    current user namespace.

    v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
    This special cases was removed from this version. There is nothing
    outside of init_user_ns, so we can return EPERM.
    v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
    grabs a reference.

    Acked-by: Serge Hallyn
    Signed-off-by: Andrei Vagin
    Signed-off-by: Eric W. Biederman

    Andrey Vagin
     
  • The current error codes returned when a the per user per user
    namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong. I
    asked for advice on linux-api and it we made clear that those were
    the wrong error code, but a correct effor code was not suggested.

    The best general error code I have found for hitting a resource limit
    is ENOSPC. It is not perfect but as it is unambiguous it will serve
    until someone comes up with a better error code.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

22 Sep, 2016

1 commit


20 Sep, 2016

1 commit

  • When a socket is cloned, the associated sock_cgroup_data is duplicated
    but not its reference on the cgroup. As a result, the cgroup reference
    count will underflow when both sockets are destroyed later on.

    Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
    Link: http://lkml.kernel.org/r/20160914194846.11153-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Tejun Heo
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

18 Aug, 2016

1 commit

  • The current percpu-rwsem read side is entirely free of serializing insns
    at the cost of having a synchronize_sched() in the write path.

    The latency of the synchronize_sched() is too high for cgroups. The
    commit 1ed1328792ff talks about the write path being a fairly cold path
    but this is not the case for Android which moves task to the foreground
    cgroup and back around binder IPC calls from foreground processes to
    background processes, so it is significantly hotter than human initiated
    operations.

    Switch cgroup_threadgroup_rwsem into the slow mode for now to avoid the
    problem, hopefully it should not be that slow after another commit:

    80127a39681b ("locking/percpu-rwsem: Optimize readers and reduce global impact").

    We could just add rcu_sync_enter() into cgroup_init() but we do not want
    another synchronize_sched() at boot time, so this patch adds the new helper
    which doesn't block but currently can only be called before the first use.

    Reported-by: John Stultz
    Reported-by: Dmitry Shmidt
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Colin Cross
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rom Lemarchand
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Todd Kjos
    Link: http://lkml.kernel.org/r/20160811165413.GA22807@redhat.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

10 Aug, 2016

2 commits

  • Debugging what goes wrong with cgroup setup can get hairy. Add
    tracepoints for cgroup hierarchy mount, cgroup creation/destruction
    and task migration operations for better visibility.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • cgroup_path() and friends used to format the path from the end and
    thus the resulting path usually didn't start at the start of the
    passed in buffer. Also, when the buffer was too small, the partial
    result was truncated from the head rather than tail and there was no
    way to tell how long the full path would be. These make the functions
    less robust and more awkward to use.

    With recent updates to kernfs_path(), cgroup_path() and friends can be
    made to behave in strlcpy() style.

    * cgroup_path(), cgroup_path_ns[_locked]() and task_cgroup_path() now
    always return the length of the full path. If buffer is too small,
    it contains nul terminated truncated output.

    * All users updated accordingly.

    v2: cgroup_path() usage in kernel/sched/debug.c converted.

    Signed-off-by: Tejun Heo
    Acked-by: Greg Kroah-Hartman
    Cc: Serge Hallyn
    Cc: Peter Zijlstra

    Tejun Heo
     

09 Aug, 2016

1 commit


30 Jul, 2016

1 commit

  • Pull more cgroup updates from Tejun Heo:
    "I forgot to include the patches which got applied to for-4.7-fixes
    late during last cycle.

    Eric's three patches fix bugs introduced with the namespace support"

    * 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroupns: Only allow creation of hierarchies in the initial cgroup namespace
    cgroupns: Close race between cgroup_post_fork and copy_cgroup_ns
    cgroupns: Fix the locking in copy_cgroup_ns

    Linus Torvalds
     

28 Jul, 2016

1 commit

  • Pull networking updates from David Miller:

    1) Unified UDP encapsulation offload methods for drivers, from
    Alexander Duyck.

    2) Make DSA binding more sane, from Andrew Lunn.

    3) Support QCA9888 chips in ath10k, from Anilkumar Kolli.

    4) Several workqueue usage cleanups, from Bhaktipriya Shridhar.

    5) Add XDP (eXpress Data Path), essentially running BPF programs on RX
    packets as soon as the device sees them, with the option to mirror
    the packet on TX via the same interface. From Brenden Blanco and
    others.

    6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet.

    7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli.

    8) Simplify netlink conntrack entry layout, from Florian Westphal.

    9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido
    Schimmel, Yotam Gigi, and Jiri Pirko.

    10) Add SKB array infrastructure and convert tun and macvtap over to it.
    From Michael S Tsirkin and Jason Wang.

    11) Support qdisc packet injection in pktgen, from John Fastabend.

    12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy.

    13) Add NV congestion control support to TCP, from Lawrence Brakmo.

    14) Add GSO support to SCTP, from Marcelo Ricardo Leitner.

    15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni.

    16) Support MPLS over IPV4, from Simon Horman.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
    xgene: Fix build warning with ACPI disabled.
    be2net: perform temperature query in adapter regardless of its interface state
    l2tp: Correctly return -EBADF from pppol2tp_getname.
    net/mlx5_core/health: Remove deprecated create_singlethread_workqueue
    net: ipmr/ip6mr: update lastuse on entry change
    macsec: ensure rx_sa is set when validation is disabled
    tipc: dump monitor attributes
    tipc: add a function to get the bearer name
    tipc: get monitor threshold for the cluster
    tipc: make cluster size threshold for monitoring configurable
    tipc: introduce constants for tipc address validation
    net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()
    MAINTAINERS: xgene: Add driver and documentation path
    Documentation: dtb: xgene: Add MDIO node
    dtb: xgene: Add MDIO node
    drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset
    drivers: net: xgene: Use exported functions
    drivers: net: xgene: Enable MDIO driver
    drivers: net: xgene: Add backward compatibility
    drivers: net: phy: xgene: Add MDIO driver
    ...

    Linus Torvalds
     

27 Jul, 2016

1 commit

  • Pull cgroup updates from Tejun Heo:
    "Nothing too exciting.

    - updates to the pids controller so that pid limit breaches can be
    noticed and monitored from userland.

    - cleanups and non-critical bug fixes"

    * 'for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: remove duplicated include from cgroup.c
    cgroup: Use lld instead of ld when printing pids controller events_limit
    cgroup: Add pids controller event when fork fails because of pid limit
    cgroup: allow NULL return from ss->css_alloc()
    cgroup: remove unnecessary 0 check from css_from_id()
    cgroup: fix idr leak for the first cgroup root

    Linus Torvalds
     

20 Jul, 2016

1 commit


15 Jul, 2016

3 commits

  • Unprivileged users can't use hierarchies if they create them as they do not
    have privilieges to the root directory.

    Which means the only thing a hiearchy created by an unprivileged user
    is good for is expanding the number of cgroup links in every css_set,
    which is a DOS attack.

    We could allow hierarchies to be created in namespaces in the initial
    user namespace. Unfortunately there is only a single namespace for
    the names of heirarchies, so that is likely to create more confusion
    than not.

    So do the simple thing and restrict hiearchy creation to the initial
    cgroup namespace.

    Cc: stable@vger.kernel.org
    Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Tejun Heo

    Eric W. Biederman
     
  • In most code paths involving cgroup migration cgroup_threadgroup_rwsem
    is taken. There are two exceptions:

    - remove_tasks_in_empty_cpuset calls cgroup_transfer_tasks
    - vhost_attach_cgroups_work calls cgroup_attach_task_all

    With cgroup_threadgroup_rwsem held it is guaranteed that cgroup_post_fork
    and copy_cgroup_ns will reference the same css_set from the process calling
    fork.

    Without such an interlock there process after fork could reference one
    css_set from it's new cgroup namespace and another css_set from
    task->cgroups, which semantically is nonsensical.

    Cc: stable@vger.kernel.org
    Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Tejun Heo

    Eric W. Biederman
     
  • If "clone(CLONE_NEWCGROUP...)" is called it results in a nice lockdep
    valid splat.

    In __cgroup_proc_write the lock ordering is:
    cgroup_mutex -- through cgroup_kn_lock_live
    cgroup_threadgroup_rwsem

    In copy_process the guts of clone the lock ordering is:
    cgroup_threadgroup_rwsem -- through threadgroup_change_begin
    cgroup_mutex -- through copy_namespaces -- copy_cgroup_ns

    lockdep reports some a different call chains for the first ordering of
    cgroup_mutex and cgroup_threadgroup_rwsem but it is harder to trace.
    This is most definitely deadlock potential under the right
    circumstances.

    Fix this by by skipping the cgroup_mutex and making the locking in
    copy_cgroup_ns mirror the locking in cgroup_post_fork which also runs
    during fork under the cgroup_threadgroup_rwsem.

    Cc: stable@vger.kernel.org
    Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Tejun Heo

    Eric W. Biederman
     

02 Jul, 2016

1 commit

  • Add a helper function to get a cgroup2 from a fd. It will be
    stored in a bpf array (BPF_MAP_TYPE_CGROUP_ARRAY) which will
    be introduced in the later patch.

    Signed-off-by: Martin KaFai Lau
    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Cc: Tejun Heo
    Acked-by: Tejun Heo
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     

24 Jun, 2016

1 commit

  • While testing the deadline scheduler + cgroup setup I hit this
    warning.

    [ 132.612935] ------------[ cut here ]------------
    [ 132.612951] WARNING: CPU: 5 PID: 0 at kernel/softirq.c:150 __local_bh_enable_ip+0x6b/0x80
    [ 132.612952] Modules linked in: (a ton of modules...)
    [ 132.612981] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc2 #2
    [ 132.612981] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
    [ 132.612982] 0000000000000086 45c8bb5effdd088b ffff88013fd43da0 ffffffff813d229e
    [ 132.612984] 0000000000000000 0000000000000000 ffff88013fd43de0 ffffffff810a652b
    [ 132.612985] 00000096811387b5 0000000000000200 ffff8800bab29d80 ffff880034c54c00
    [ 132.612986] Call Trace:
    [ 132.612987] [] dump_stack+0x63/0x85
    [ 132.612994] [] __warn+0xcb/0xf0
    [ 132.612997] [] ? push_dl_task.part.32+0x170/0x170
    [ 132.612999] [] warn_slowpath_null+0x1d/0x20
    [ 132.613000] [] __local_bh_enable_ip+0x6b/0x80
    [ 132.613008] [] _raw_write_unlock_bh+0x1a/0x20
    [ 132.613010] [] _raw_spin_unlock_bh+0xe/0x10
    [ 132.613015] [] put_css_set+0x5c/0x60
    [ 132.613016] [] cgroup_free+0x7f/0xa0
    [ 132.613017] [] __put_task_struct+0x42/0x140
    [ 132.613018] [] dl_task_timer+0xca/0x250
    [ 132.613027] [] ? push_dl_task.part.32+0x170/0x170
    [ 132.613030] [] __hrtimer_run_queues+0xee/0x270
    [ 132.613031] [] hrtimer_interrupt+0xa8/0x190
    [ 132.613034] [] local_apic_timer_interrupt+0x38/0x60
    [ 132.613035] [] smp_apic_timer_interrupt+0x3d/0x50
    [ 132.613037] [] apic_timer_interrupt+0x8c/0xa0
    [ 132.613038] [] ? native_safe_halt+0x6/0x10
    [ 132.613043] [] default_idle+0x1e/0xd0
    [ 132.613044] [] arch_cpu_idle+0xf/0x20
    [ 132.613046] [] default_idle_call+0x2a/0x40
    [ 132.613047] [] cpu_startup_entry+0x2e7/0x340
    [ 132.613048] [] start_secondary+0x155/0x190
    [ 132.613049] ---[ end trace f91934d162ce9977 ]---

    The warn is the spin_(lock|unlock)_bh(&css_set_lock) in the interrupt
    context. Converting the spin_lock_bh to spin_lock_irq(save) to avoid
    this problem - and other problems of sharing a spinlock with an
    interrupt.

    Cc: Tejun Heo
    Cc: Li Zefan
    Cc: Johannes Weiner
    Cc: Juri Lelli
    Cc: Steven Rostedt
    Cc: cgroups@vger.kernel.org
    Cc: stable@vger.kernel.org # 4.5+
    Cc: linux-kernel@vger.kernel.org
    Reviewed-by: Rik van Riel
    Reviewed-by: "Luis Claudio R. Goncalves"
    Signed-off-by: Daniel Bristot de Oliveira
    Acked-by: Zefan Li
    Signed-off-by: Tejun Heo

    Daniel Bristot de Oliveira
     

22 Jun, 2016

1 commit

  • cgroup core expected css_alloc to return an ERR_PTR value on failure
    and caused NULL deref if it returned NULL. It's an easy mistake to
    make from an alloc function and there's no ambiguity in what's being
    indicated. Update css_create() so that it interprets NULL return from
    css_alloc as -ENOMEM.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

18 Jun, 2016

2 commits

  • css_idr allocation starts at 1, so index 0 will never point to an
    item. css_from_id() currently filters that before asking idr_find(),
    but idr_find() would also just return NULL, so this is not needed.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Tejun Heo

    Johannes Weiner
     
  • The valid cgroup hierarchy ID range includes 0, so we can't filter for
    positive numbers when freeing it, or it'll leak the first ID. No big
    deal, just disruptive when reading the code.

    The ID is freed during error handling and when the reference count
    hits zero, so the double-free test is not necessary; remove it.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Tejun Heo

    Johannes Weiner
     

17 Jun, 2016

1 commit

  • If percpu_ref initialization fails during css_create(), the free path
    can end up trying to free css->id of zero. As ID 0 is unused, it
    doesn't cause a critical breakage but it does trigger a warning
    message. Fix it by setting css->id to -1 from init_and_link_css().

    Signed-off-by: Tejun Heo
    Cc: Wenwei Tao
    Fixes: 01e586598b22 ("cgroup: release css->id after css_free")
    Cc: stable@vger.kernel.org # v4.0+
    Signed-off-by: Tejun Heo

    Tejun Heo
     

27 May, 2016

1 commit

  • When create css failed, before call css_free_rcu_fn, we remove the css
    id and exit the percpu_ref, but we will do these again in
    css_free_work_fn, so they are redundant. Especially the css id, that
    would cause problem if we remove it twice, since it may be assigned to
    another css after the first remove.

    tj: This was broken by two commits updating the free path without
    synchronizing the creation failure path. This can be easily
    triggered by trying to create more than 64k memory cgroups.

    Signed-off-by: Wenwei Tao
    Signed-off-by: Tejun Heo
    Cc: Vladimir Davydov
    Fixes: 9a1049da9bd2 ("percpu-refcount: require percpu_ref to be exited explicitly")
    Fixes: 01e586598b22 ("cgroup: release css->id after css_free")
    Cc: stable@vger.kernel.org # v3.17+

    Wenwei Tao
     

12 May, 2016

1 commit

  • commit 4f41fc59620f ("cgroup, kernfs: make mountinfo
    show properly scoped path for cgroup namespaces")
    added the following compile warning:

    kernel/cgroup.c: In function ‘cgroup_show_path’:
    kernel/cgroup.c:1634:15: warning: unused variable ‘ret’ [-Wunused-variable]
    int len = 0, ret = 0;
    ^
    fix it.

    Fixes: 4f41fc59620f ("cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces")
    Signed-off-by: Felipe Balbi
    Signed-off-by: Tejun Heo

    Felipe Balbi
     

10 May, 2016

1 commit

  • Patch summary:

    When showing a cgroupfs entry in mountinfo, show the path of the mount
    root dentry relative to the reader's cgroup namespace root.

    Short explanation (courtesy of mkerrisk):

    If we create a new cgroup namespace, then we want both /proc/self/cgroup
    and /proc/self/mountinfo to show cgroup paths that are correctly
    virtualized with respect to the cgroup mount point. Previous to this
    patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
    does not.

    Long version:

    When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
    namespace, and then mounts a new instance of the freezer cgroup, the new
    mount will be rooted at /a/b. The root dentry field of the mountinfo
    entry will show '/a/b'.

    cat > /tmp/do1 << EOF
    mount -t cgroup -o freezer freezer /mnt
    grep freezer /proc/self/mountinfo
    EOF

    unshare -Gm bash /tmp/do1
    > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
    > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer

    The task's freezer cgroup entry in /proc/self/cgroup will simply show
    '/':

    grep freezer /proc/self/cgroup
    9:freezer:/

    If instead the same task simply bind mounts the /a/b cgroup directory,
    the resulting mountinfo entry will again show /a/b for the dentry root.
    However in this case the task will find its own cgroup at /mnt/a/b,
    not at /mnt:

    mount --bind /sys/fs/cgroup/freezer/a/b /mnt
    130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer

    In other words, there is no way for the task to know, based on what is
    in mountinfo, which cgroup directory is its own.

    Example (by mkerrisk):

    First, a little script to save some typing and verbiage:

    echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
    cat /proc/self/mountinfo | grep freezer |
    awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'

    Create cgroup, place this shell into the cgroup, and look at the state
    of the /proc files:

    2653
    2653 # Our shell
    14254 # cat(1)
    /proc/self/cgroup: 10:freezer:/a/b
    mountinfo: / /sys/fs/cgroup/freezer

    Create a shell in new cgroup and mount namespaces. The act of creating
    a new cgroup namespace causes the process's current cgroups directories
    to become its cgroup root directories. (Here, I'm using my own version
    of the "unshare" utility, which takes the same options as the util-linux
    version):

    Look at the state of the /proc files:

    /proc/self/cgroup: 10:freezer:/
    mountinfo: / /sys/fs/cgroup/freezer

    The third entry in /proc/self/cgroup (the pathname of the cgroup inside
    the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
    is rooted at /a/b in the outer namespace.

    However, the info in /proc/self/mountinfo is not for this cgroup
    namespace, since we are seeing a duplicate of the mount from the
    old mount namespace, and the info there does not correspond to the
    new cgroup namespace. However, trying to create a new mount still
    doesn't show us the right information in mountinfo:

    # propagating to other mountns
    /proc/self/cgroup: 7:freezer:/
    mountinfo: /a/b /mnt/freezer

    The act of creating a new cgroup namespace caused the process's
    current freezer directory, "/a/b", to become its cgroup freezer root
    directory. In other words, the pathname directory of the directory
    within the newly mounted cgroup filesystem should be "/",
    but mountinfo wrongly shows us "/a/b". The consequence of this is
    that the process in the cgroup namespace cannot correctly construct
    the pathname of its cgroup root directory from the information in
    /proc/PID/mountinfo.

    With this patch, the dentry root field in mountinfo is shown relative
    to the reader's cgroup namespace. So the same steps as above:

    /proc/self/cgroup: 10:freezer:/a/b
    mountinfo: / /sys/fs/cgroup/freezer
    /proc/self/cgroup: 10:freezer:/
    mountinfo: /../.. /sys/fs/cgroup/freezer
    /proc/self/cgroup: 10:freezer:/
    mountinfo: / /mnt/freezer

    cgroup.clone_children freezer.parent_freezing freezer.state tasks
    cgroup.procs freezer.self_freezing notify_on_release
    3164
    2653 # First shell that placed in this cgroup
    3164 # Shell started by 'unshare'
    14197 # cat(1)

    Signed-off-by: Serge Hallyn
    Tested-by: Michael Kerrisk
    Acked-by: Michael Kerrisk
    Signed-off-by: Tejun Heo

    Serge E. Hallyn
     

26 Apr, 2016

1 commit

  • Since e93ad19d0564 ("cpuset: make mm migration asynchronous"), cpuset
    kicks off asynchronous NUMA node migration if necessary during task
    migration and flushes it from cpuset_post_attach_flush() which is
    called at the end of __cgroup_procs_write(). This is to avoid
    performing migration with cgroup_threadgroup_rwsem write-locked which
    can lead to deadlock through dependency on kworker creation.

    memcg has a similar issue with charge moving, so let's convert it to
    an official callback rather than the current one-off cpuset specific
    function. This patch adds cgroup_subsys->post_attach callback and
    makes cpuset register cpuset_post_attach_flush() as its ->post_attach.

    The conversion is mostly one-to-one except that the new callback is
    called under cgroup_mutex. This is to guarantee that no other
    migration operations are started before ->post_attach callbacks are
    finished. cgroup_mutex is one of the outermost mutex in the system
    and has never been and shouldn't be a problem. We can add specialized
    synchronization around __cgroup_procs_write() but I don't think
    there's any noticeable benefit.

    Signed-off-by: Tejun Heo
    Cc: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: # 4.4+ prerequisite for the next patch

    Tejun Heo
     

22 Mar, 2016

1 commit

  • Pull cgroup namespace support from Tejun Heo:
    "These are changes to implement namespace support for cgroup which has
    been pending for quite some time now. It is very straight-forward and
    only affects what part of cgroup hierarchies are visible.

    After unsharing, mounting a cgroup fs will be scoped to the cgroups
    the task belonged to at the time of unsharing and the cgroup paths
    exposed to userland would be adjusted accordingly"

    * 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: fix and restructure error handling in copy_cgroup_ns()
    cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
    Add FS_USERNS_FLAG to cgroup fs
    cgroup: Add documentation for cgroup namespaces
    cgroup: mount cgroupns-root when inside non-init cgroupns
    kernfs: define kernfs_node_dentry
    cgroup: cgroup namespace setns support
    cgroup: introduce cgroup namespaces
    sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
    kernfs: Add API to generate relative kernfs path

    Linus Torvalds
     

17 Mar, 2016

2 commits

  • When all subsystems are disabled, gcc notices that cgroup_subsys_enabled_key
    is a zero-length array and that any access to it must be out of bounds:

    In file included from ../include/linux/cgroup.h:19:0,
    from ../kernel/cgroup.c:31:
    ../kernel/cgroup.c: In function 'cgroup_add_cftypes':
    ../kernel/cgroup.c:261:53: error: array subscript is above array bounds [-Werror=array-bounds]
    return static_key_enabled(cgroup_subsys_enabled_key[ssid]);
    ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
    ../include/linux/jump_label.h:271:40: note: in definition of macro 'static_key_enabled'
    static_key_count((struct static_key *)x) > 0; \
    ^

    We should never call the function in this particular case, so this is
    not a bug. In order to silence the warning, this adds an explicit check
    for the CGROUP_SUBSYS_COUNT==0 case.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Tejun Heo

    Arnd Bergmann
     
  • Before 2e91fa7f6d45 ("cgroup: keep zombies associated with their
    original cgroups"), all dead tasks were associated with init_css_set.
    If a zombie task is requested for migration, while migration prep
    operations would still be performed on init_css_set, the actual
    migration would ignore zombie tasks. As init_css_set is always valid,
    this worked fine.

    However, after 2e91fa7f6d45, zombie tasks stay with the css_set it was
    associated with at the time of death. Let's say a task T associated
    with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2. After T
    becomes a zombie, it would still remain associated with A and B. If A
    only contains zombie tasks, it can be removed. On removal, A gets
    marked offline but stays pinned until all zombies are drained. At
    this point, if migration is initiated on T to a cgroup C on hierarchy
    H-2, migration path would try to prepare T's css_set for migration and
    trigger the following.

    WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160()
    CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289
    ...
    Call Trace:
    [] dump_stack+0x4e/0x82
    [] warn_slowpath_common+0x78/0xb0
    [] warn_slowpath_null+0x15/0x20
    [] cgroup_get+0x121/0x160
    [] link_css_set+0x7b/0x90
    [] find_css_set+0x3bc/0x5e0
    [] cgroup_migrate_prepare_dst+0x89/0x1f0
    [] cgroup_attach_task+0x157/0x230
    [] __cgroup_procs_write+0x2b7/0x470
    [] cgroup_tasks_write+0xc/0x10
    [] cgroup_file_write+0x30/0x1b0
    [] kernfs_fop_write+0x13c/0x180
    [] __vfs_write+0x23/0xe0
    [] vfs_write+0xa4/0x1a0
    [] SyS_write+0x44/0xa0
    [] entry_SYSCALL_64_fastpath+0x12/0x6f

    It doesn't make sense to prepare migration for css_sets pointing to
    dead cgroups as they are guaranteed to contain only zombies which are
    ignored later during migration. This patch makes cgroup destruction
    path mark all affected css_sets as dead and updates the migration path
    to ignore them during preparation.

    Signed-off-by: Tejun Heo
    Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
    Cc: stable@vger.kernel.org # v4.4+

    Tejun Heo
     

09 Mar, 2016

4 commits

  • Some controllers, perf_event for now and possibly freezer in the
    future, don't really make sense to control explicitly through
    "cgroup.subtree_control". For example, the primary role of perf_event
    is identifying the cgroups of tasks; however, because the controller
    also keeps a small amount of state per cgroup, it can't be replaced
    with simple cgroup membership tests.

    This patch implements cgroup_subsys->implicit_on_dfl flag. When set,
    the controller is implicitly enabled on all cgroups on the v2
    hierarchy so that utility type controllers such as perf_event can be
    enabled and function transparently.

    An implicit controller doesn't show up in "cgroup.controllers" or
    "cgroup.subtree_control", is exempt from no internal process rule and
    can be stolen from the default hierarchy even if there are non-root
    csses.

    v2: Reimplemented on top of the recent updates to css handling and
    subsystem rebinding. Rebinding implicit subsystems is now a
    simple matter of exempting it from the busy subsystem check.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • Migration can be multi-target on the default hierarchy when a
    controller is enabled - processes belonging to each child cgroup have
    to be moved to the child cgroup itself to refresh css association.

    This isn't a problem for cgroup_migrate_add_src() as each source
    css_set still maps to single source and target cgroups; however,
    cgroup_migrate_prepare_dst() is called once after all source css_sets
    are added and thus might not have a single destination cgroup. This
    is currently worked around by specifying NULL for @dst_cgrp and using
    the source's default cgroup as destination as the only multi-target
    migration in use is self-targetting. While this works, it's subtle
    and clunky.

    As all taget cgroups are already specified while preparing the source
    css_sets, this clunkiness can easily be removed by recording the
    target cgroup in each source css_set. This patch adds
    css_set->mg_dst_cgrp which is recorded on cgroup_migrate_src() and
    used by cgroup_migrate_prepare_dst(). This also makes migration code
    ready for arbitrary multi-target migration.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • On the default hierarchy, a migration can be multi-source and/or
    multi-destination. cgroup_taskest_migrate() used to incorrectly
    assume single destination cgroup but the bug has been fixed by
    1f7dd3e5a6e4 ("cgroup: fix handling of multi-destination migration
    from subtree_control enabling").

    Since the commit, @dst_cgrp to cgroup[_taskset]_migrate() is only used
    to determine which subsystems are affected or which cgroup_root the
    migration is taking place in. As such, @dst_cgrp is misleading. This
    patch replaces @dst_cgrp with @root.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • cgroup_migrate_prepare_dst() verifies whether the destination cgroup
    is allowable; however, the test doesn't really belong there. It's too
    deep and common in the stack and as a result the test itself is gated
    by another test.

    Separate the test out into cgroup_may_migrate_to() and update
    cgroup_attach_task() and cgroup_transfer_tasks() to perform the test
    directly. This doesn't cause any behavior differences.

    Signed-off-by: Tejun Heo

    Tejun Heo