Eric Lee / smarc-fsl-linux-kernel

09 Feb, 2017

1 commit

1d88791d5 cgroup: don't online subsystems before cgroup_name/path() are operational ... Browse Code »

commit 07cd12945551b63ecb1a349d50a6d69d1d6feb4a upstream.

While refactoring cgroup creation, a5bca2152036 ("cgroup: factor out
cgroup_create() out of cgroup_mkdir()") incorrectly onlined subsystems
before the new cgroup is associated with it kernfs_node. This is fine
for cgroup proper but cgroup_name/path() depend on the associated
kernfs_node and if a subsystem makes the new cgroup_subsys_state
visible, which they're allowed to after onlining, it can lead to NULL
dereference.

The current code performs cgroup creation and subsystem onlining in
cgroup_create() and cgroup_mkdir() makes the cgroup and subsystems
visible afterwards. There's no reason to online the subsystems early
and we can simply drop cgroup_apply_control_enable() call from
cgroup_create() so that the subsystems are onlined and made visible at
the same time.

Signed-off-by: Tejun Heo
Reported-by: Konstantin Khlebnikov
Fixes: a5bca2152036 ("cgroup: factor out cgroup_create() out of cgroup_mkdir()")
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2017-02-09 15:08:28 +0800

15 Oct, 2016

1 commit

f34d3606f Merge branch 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:

- tracepoints for basic cgroup management operations added

- kernfs and cgroup path formatting functions updated to behave in the
style of strlcpy()

- non-critical bug fixes

* 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL
cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent()
cpuset: fix error handling regression in proc_cpuset_show()
cgroup: add tracepoints for basic operations
cgroup: make cgroup_path() and friends behave in the style of strlcpy()
kernfs: remove kernfs_path_len()
kernfs: make kernfs_path*() behave in the style of strlcpy()
kernfs: add dummy implementation of kernfs_path_from_node()

Linus Torvalds
2016-10-15 03:18:50 +0800

07 Oct, 2016

1 commit

14986a34e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull namespace updates from Eric Biederman:
"This set of changes is a number of smaller things that have been
overlooked in other development cycles focused on more fundamental
change. The devpts changes are small things that were a distraction
until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
trivial regression fix to autofs for the unprivileged mount changes
that went in last cycle. A pair of ioctls has been added by Andrey
Vagin making it is possible to discover the relationships between
namespaces when referring to them through file descriptors.

The big user visible change is starting to add simple resource limits
to catch programs that misbehave. With namespaces in general and user
namespaces in particular allowing users to use more kinds of
resources, it has become important to have something to limit errant
programs. Because the purpose of these limits is to catch errant
programs the code needs to be inexpensive to use as it always on, and
the default limits need to be high enough that well behaved programs
on well behaved systems don't encounter them.

To this end, after some review I have implemented per user per user
namespace limits, and use them to limit the number of namespaces. The
limits being per user mean that one user can not exhause the limits of
another user. The limits being per user namespace allow contexts where
the limit is 0 and security conscious folks can remove from their
threat anlysis the code used to manage namespaces (as they have
historically done as it root only). At the same time the limits being
per user namespace allow other parts of the system to use namespaces.

Namespaces are increasingly being used in application sand boxing
scenarios so an all or nothing disable for the entire system for the
security conscious folks makes increasing use of these sandboxes
impossible.

There is also added a limit on the maximum number of mounts present in
a single mount namespace. It is nontrivial to guess what a reasonable
system wide limit on the number of mount structure in the kernel would
be, especially as it various based on how a system is using
containers. A limit on the number of mounts in a mount namespace
however is much easier to understand and set. In most cases in
practice only about 1000 mounts are used. Given that some autofs
scenarious have the potential to be 30,000 to 50,000 mounts I have set
the default limit for the number of mounts at 100,000 which is well
above every known set of users but low enough that the mount hash
tables don't degrade unreaonsably.

These limits are a start. I expect this estabilishes a pattern that
other limits for resources that namespaces use will follow. There has
been interest in making inotify event limits per user per user
namespace as well as interest expressed in making details about what
is going on in the kernel more visible"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
autofs: Fix automounts by using current_real_cred()->uid
mnt: Add a per mount namespace limit on the number of mounts
netns: move {inc,dec}_net_namespaces into #ifdef
nsfs: Simplify __ns_get_path
tools/testing: add a test to check nsfs ioctl-s
nsfs: add ioctl to get a parent namespace
nsfs: add ioctl to get an owning user namespace for ns file descriptor
kernel: add a helper to get an owning user namespace for a namespace
devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
devpts: Remove sync_filesystems
devpts: Make devpts_kill_sb safe if fsi is NULL
devpts: Simplify devpts_mount by using mount_nodev
devpts: Move the creation of /dev/pts/ptmx into fill_super
devpts: Move parse_mount_options into fill_super
userns: When the per user per user namespace limit is reached return ENOSPC
userns; Document per user per user namespace limits.
mntns: Add a limit on the number of mount namespaces.
netns: Add a limit on the number of net namespaces
cgroupns: Add a limit on the number of cgroup namespaces
ipcns: Add a limit on the number of ipc namespaces
...

Linus Torvalds
2016-10-07 00:52:23 +0800

30 Sep, 2016

1 commit

0b429e18c Merge branch 'linus' into locking/core, to pick up fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2016-09-30 16:54:46 +0800

29 Sep, 2016

1 commit

e0223003e cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent() ... Browse Code »

4c737b41de7f ("cgroup: make cgroup_path() and friends behave in the
style of strlcpy()") broke error handling in proc_cgroup_show() and
cgroup_release_agent() by not handling negative return values from
cgroup_path_ns_locked(). Fix it.

Reported-by: Dan Carpenter
Signed-off-by: Tejun Heo

Tejun Heo
2016-09-29 21:55:16 +0800

28 Sep, 2016

1 commit

8ab293e3a Merge branch 'for-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup fixes from Tejun Heo:
"Three late fixes for cgroup: Two cpuset ones, one trivial and the
other pretty obscure, and a cgroup core fix for a bug which impacts
cgroup v2 namespace users"

* 'for-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: fix invalid controller enable rejections with cgroup namespace
cpuset: fix non static symbol warning
cpuset: handle race between CPU hotplug and cpuset_hotplug_work

Linus Torvalds
2016-09-28 07:43:11 +0800

24 Sep, 2016

1 commit

9157056da cgroup: fix invalid controller enable rejections with cgroup namespace ... Browse Code »

On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it. The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace. This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo
Reported-by: Evgeny Vereshchagin
Cc: Serge E. Hallyn
Cc: Aditya Kali
Cc: Eric W. Biederman
Cc: stable@vger.kernel.org # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: https://github.com/systemd/systemd/pull/3589#issuecomment-249089541

Tejun Heo
2016-09-24 04:55:49 +0800

23 Sep, 2016

3 commits

787255966 Merge branch 'nsfs-ioctls' into HEAD ... Browse Code »

From: Andrey Vagin

Each namespace has an owning user namespace and now there is not way
to discover these relationships.

Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships too.

Why we may want to know relationships between namespaces?

One use would be visualization, in order to understand the running
system. Another would be to answer the question: what capability does
process X have to perform operations on a resource governed by namespace
Y?

One more use-case (which usually called abnormal) is checkpoint/restart.
In CRIU we are going to dump and restore nested namespaces.

There [1] was a discussion about which interface to choose to determing
relationships between namespaces.

Eric suggested to add two ioctl-s [2]:
> Grumble, Grumble. I think this may actually a case for creating ioctls
> for these two cases. Now that random nsfs file descriptors are bind
> mountable the original reason for using proc files is not as pressing.
>
> One ioctl for the user namespace that owns a file descriptor.
> One ioctl for the parent namespace of a namespace file descriptor.

Here is an implementaions of these ioctl-s.

$ man man7/namespaces.7
...
Since Linux 4.X, the following ioctl(2) calls are supported for
namespace file descriptors. The correct syntax is:

fd = ioctl(ns_fd, ioctl_type);

where ioctl_type is one of the following:

NS_GET_USERNS
Returns a file descriptor that refers to an owning user names‐
pace.

NS_GET_PARENT
Returns a file descriptor that refers to a parent namespace.
This ioctl(2) can be used for pid and user namespaces. For
user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
meaning.

In addition to generic ioctl(2) errors, the following specific ones
can occur:

EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

EPERM The requested namespace is outside of the current namespace
scope.

[1] https://lkml.org/lkml/2016/7/6/158
[2] https://lkml.org/lkml/2016/7/9/101

Changes for v2:
* don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
outside of the init namespace, so we can return EPERM in this case too.
> The fewer special cases the easier the code is to get
> correct, and the easier it is to read. // Eric

Changes for v3:
* rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.

Cc: "Eric W. Biederman"
Cc: James Bottomley
Cc: "Michael Kerrisk (man-pages)"
Cc: "W. Trevor King"
Cc: Alexander Viro
Cc: Serge Hallyn

Eric W. Biederman
2016-09-23 09:00:36 +0800
bcac25a58 kernel: add a helper to get an owning user namespace for a namespace ... Browse Code »

Return -EPERM if an owning user namespace is outside of a process
current user namespace.

v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
This special cases was removed from this version. There is nothing
outside of init_user_ns, so we can return EPERM.
v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.

Acked-by: Serge Hallyn
Signed-off-by: Andrei Vagin
Signed-off-by: Eric W. Biederman

Andrey Vagin
2016-09-23 08:59:39 +0800
df75e7748 userns: When the per user per user namespace limit is reached return ENOSPC ... Browse Code »

The current error codes returned when a the per user per user
namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong. I
asked for advice on linux-api and it we made clear that those were
the wrong error code, but a correct effor code was not suggested.

The best general error code I have found for hitting a resource limit
is ENOSPC. It is not perfect but as it is unambiguous it will serve
until someone comes up with a better error code.

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-09-23 02:25:56 +0800

22 Sep, 2016

1 commit

7cf0f1426 Merge branch 'locking/urgent' into locking/core, to pick up fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2016-09-22 21:21:48 +0800

20 Sep, 2016

1 commit

d979a39d7 cgroup: duplicate cgroup reference when cloning sockets ... Browse Code »

When a socket is cloned, the associated sock_cgroup_data is duplicated
but not its reference on the cgroup. As a result, the cgroup reference
count will underflow when both sockets are destroyed later on.

Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
Link: http://lkml.kernel.org/r/20160914194846.11153-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Tejun Heo
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: [4.5+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-09-20 06:36:17 +0800

18 Aug, 2016

1 commit

3942a9bd7 locking, rcu, cgroup: Avoid synchronize_sched() in __cgroup_procs_write() ... Browse Code »

The current percpu-rwsem read side is entirely free of serializing insns
at the cost of having a synchronize_sched() in the write path.

The latency of the synchronize_sched() is too high for cgroups. The
commit 1ed1328792ff talks about the write path being a fairly cold path
but this is not the case for Android which moves task to the foreground
cgroup and back around binder IPC calls from foreground processes to
background processes, so it is significantly hotter than human initiated
operations.

Switch cgroup_threadgroup_rwsem into the slow mode for now to avoid the
problem, hopefully it should not be that slow after another commit:

80127a39681b ("locking/percpu-rwsem: Optimize readers and reduce global impact").

We could just add rcu_sync_enter() into cgroup_init() but we do not want
another synchronize_sched() at boot time, so this patch adds the new helper
which doesn't block but currently can only be called before the first use.

Reported-by: John Stultz
Reported-by: Dmitry Shmidt
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Colin Cross
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Rom Lemarchand
Cc: Tejun Heo
Cc: Thomas Gleixner
Cc: Todd Kjos
Link: http://lkml.kernel.org/r/20160811165413.GA22807@redhat.com
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-08-18 21:36:59 +0800

10 Aug, 2016

2 commits

ed1777de2 cgroup: add tracepoints for basic operations ... Browse Code »

Debugging what goes wrong with cgroup setup can get hairy. Add
tracepoints for cgroup hierarchy mount, cgroup creation/destruction
and task migration operations for better visibility.

Signed-off-by: Tejun Heo

Tejun Heo
2016-08-10 23:23:44 +0800
4c737b41d cgroup: make cgroup_path() and friends behave in the style of strlcpy() ... Browse Code »

cgroup_path() and friends used to format the path from the end and
thus the resulting path usually didn't start at the start of the
passed in buffer. Also, when the buffer was too small, the partial
result was truncated from the head rather than tail and there was no
way to tell how long the full path would be. These make the functions
less robust and more awkward to use.

With recent updates to kernfs_path(), cgroup_path() and friends can be
made to behave in strlcpy() style.

* cgroup_path(), cgroup_path_ns[_locked]() and task_cgroup_path() now
always return the length of the full path. If buffer is too small,
it contains nul terminated truncated output.

* All users updated accordingly.

v2: cgroup_path() usage in kernel/sched/debug.c converted.

Signed-off-by: Tejun Heo
Acked-by: Greg Kroah-Hartman
Cc: Serge Hallyn
Cc: Peter Zijlstra

Tejun Heo
2016-08-10 23:23:44 +0800

09 Aug, 2016

1 commit

d08311dd6 cgroupns: Add a limit on the number of cgroup namespaces ... Browse Code »

Acked-by: Kees Cook
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-08-09 03:42:03 +0800

30 Jul, 2016

1 commit

574c7e233 Merge branch 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull more cgroup updates from Tejun Heo:
"I forgot to include the patches which got applied to for-4.7-fixes
late during last cycle.

Eric's three patches fix bugs introduced with the namespace support"

* 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroupns: Only allow creation of hierarchies in the initial cgroup namespace
cgroupns: Close race between cgroup_post_fork and copy_cgroup_ns
cgroupns: Fix the locking in copy_cgroup_ns

Linus Torvalds
2016-07-30 05:29:04 +0800

28 Jul, 2016

1 commit

468fc7ed5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:

1) Unified UDP encapsulation offload methods for drivers, from
Alexander Duyck.

2) Make DSA binding more sane, from Andrew Lunn.

3) Support QCA9888 chips in ath10k, from Anilkumar Kolli.

4) Several workqueue usage cleanups, from Bhaktipriya Shridhar.

5) Add XDP (eXpress Data Path), essentially running BPF programs on RX
packets as soon as the device sees them, with the option to mirror
the packet on TX via the same interface. From Brenden Blanco and
others.

6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet.

7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli.

8) Simplify netlink conntrack entry layout, from Florian Westphal.

9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido
Schimmel, Yotam Gigi, and Jiri Pirko.

10) Add SKB array infrastructure and convert tun and macvtap over to it.
From Michael S Tsirkin and Jason Wang.

11) Support qdisc packet injection in pktgen, from John Fastabend.

12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy.

13) Add NV congestion control support to TCP, from Lawrence Brakmo.

14) Add GSO support to SCTP, from Marcelo Ricardo Leitner.

15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni.

16) Support MPLS over IPV4, from Simon Horman.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
xgene: Fix build warning with ACPI disabled.
be2net: perform temperature query in adapter regardless of its interface state
l2tp: Correctly return -EBADF from pppol2tp_getname.
net/mlx5_core/health: Remove deprecated create_singlethread_workqueue
net: ipmr/ip6mr: update lastuse on entry change
macsec: ensure rx_sa is set when validation is disabled
tipc: dump monitor attributes
tipc: add a function to get the bearer name
tipc: get monitor threshold for the cluster
tipc: make cluster size threshold for monitoring configurable
tipc: introduce constants for tipc address validation
net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()
MAINTAINERS: xgene: Add driver and documentation path
Documentation: dtb: xgene: Add MDIO node
dtb: xgene: Add MDIO node
drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset
drivers: net: xgene: Use exported functions
drivers: net: xgene: Enable MDIO driver
drivers: net: xgene: Add backward compatibility
drivers: net: phy: xgene: Add MDIO driver
...

Linus Torvalds
2016-07-28 03:03:20 +0800

27 Jul, 2016

1 commit

b55b04871 Merge branch 'for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"Nothing too exciting.

- updates to the pids controller so that pid limit breaches can be
noticed and monitored from userland.

- cleanups and non-critical bug fixes"

* 'for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: remove duplicated include from cgroup.c
cgroup: Use lld instead of ld when printing pids controller events_limit
cgroup: Add pids controller event when fork fails because of pid limit
cgroup: allow NULL return from ss->css_alloc()
cgroup: remove unnecessary 0 check from css_from_id()
cgroup: fix idr leak for the first cgroup root

Linus Torvalds
2016-07-27 05:34:17 +0800

20 Jul, 2016

1 commit

55094f575 cgroup: remove duplicated include from cgroup.c ... Browse Code »

Remove duplicated include.

Signed-off-by: Wei Yongjun
Signed-off-by: Tejun Heo

Wei Yongjun
2016-07-20 02:28:04 +0800

15 Jul, 2016

3 commits

726a4994b cgroupns: Only allow creation of hierarchies in the initial cgroup namespace ... Browse Code »

Unprivileged users can't use hierarchies if they create them as they do not
have privilieges to the root directory.

Which means the only thing a hiearchy created by an unprivileged user
is good for is expanding the number of cgroup links in every css_set,
which is a DOS attack.

We could allow hierarchies to be created in namespaces in the initial
user namespace. Unfortunately there is only a single namespace for
the names of heirarchies, so that is likely to create more confusion
than not.

So do the simple thing and restrict hiearchy creation to the initial
cgroup namespace.

Cc: stable@vger.kernel.org
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Tejun Heo

Eric W. Biederman
2016-07-15 20:04:27 +0800
eedd0f4cb cgroupns: Close race between cgroup_post_fork and copy_cgroup_ns ... Browse Code »

In most code paths involving cgroup migration cgroup_threadgroup_rwsem
is taken. There are two exceptions:

- remove_tasks_in_empty_cpuset calls cgroup_transfer_tasks
- vhost_attach_cgroups_work calls cgroup_attach_task_all

With cgroup_threadgroup_rwsem held it is guaranteed that cgroup_post_fork
and copy_cgroup_ns will reference the same css_set from the process calling
fork.

Without such an interlock there process after fork could reference one
css_set from it's new cgroup namespace and another css_set from
task->cgroups, which semantically is nonsensical.

Cc: stable@vger.kernel.org
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Tejun Heo

Eric W. Biederman
2016-07-15 19:56:38 +0800
7bd883087 cgroupns: Fix the locking in copy_cgroup_ns ... Browse Code »

If "clone(CLONE_NEWCGROUP...)" is called it results in a nice lockdep
valid splat.

In __cgroup_proc_write the lock ordering is:
cgroup_mutex -- through cgroup_kn_lock_live
cgroup_threadgroup_rwsem

In copy_process the guts of clone the lock ordering is:
cgroup_threadgroup_rwsem -- through threadgroup_change_begin
cgroup_mutex -- through copy_namespaces -- copy_cgroup_ns

lockdep reports some a different call chains for the first ordering of
cgroup_mutex and cgroup_threadgroup_rwsem but it is harder to trace.
This is most definitely deadlock potential under the right
circumstances.

Fix this by by skipping the cgroup_mutex and making the locking in
copy_cgroup_ns mirror the locking in cgroup_post_fork which also runs
during fork under the cgroup_threadgroup_rwsem.

Cc: stable@vger.kernel.org
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Tejun Heo

Eric W. Biederman
2016-07-15 19:56:32 +0800

02 Jul, 2016

1 commit

1f3fe7ebf cgroup: Add cgroup_get_from_fd ... Browse Code »

Add a helper function to get a cgroup2 from a fd. It will be
stored in a bpf array (BPF_MAP_TYPE_CGROUP_ARRAY) which will
be introduced in the later patch.

Signed-off-by: Martin KaFai Lau
Cc: Alexei Starovoitov
Cc: Daniel Borkmann
Cc: Tejun Heo
Acked-by: Tejun Heo
Signed-off-by: David S. Miller

Martin KaFai Lau
2016-07-02 04:30:38 +0800

24 Jun, 2016

1 commit

82d6489d0 cgroup: Disable IRQs while holding css_set_lock ... Browse Code »

While testing the deadline scheduler + cgroup setup I hit this
warning.

[ 132.612935] ------------[ cut here ]------------
[ 132.612951] WARNING: CPU: 5 PID: 0 at kernel/softirq.c:150 __local_bh_enable_ip+0x6b/0x80
[ 132.612952] Modules linked in: (a ton of modules...)
[ 132.612981] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc2 #2
[ 132.612981] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
[ 132.612982] 0000000000000086 45c8bb5effdd088b ffff88013fd43da0 ffffffff813d229e
[ 132.612984] 0000000000000000 0000000000000000 ffff88013fd43de0 ffffffff810a652b
[ 132.612985] 00000096811387b5 0000000000000200 ffff8800bab29d80 ffff880034c54c00
[ 132.612986] Call Trace:
[ 132.612987] [] dump_stack+0x63/0x85
[ 132.612994] [] __warn+0xcb/0xf0
[ 132.612997] [] ? push_dl_task.part.32+0x170/0x170
[ 132.612999] [] warn_slowpath_null+0x1d/0x20
[ 132.613000] [] __local_bh_enable_ip+0x6b/0x80
[ 132.613008] [] _raw_write_unlock_bh+0x1a/0x20
[ 132.613010] [] _raw_spin_unlock_bh+0xe/0x10
[ 132.613015] [] put_css_set+0x5c/0x60
[ 132.613016] [] cgroup_free+0x7f/0xa0
[ 132.613017] [] __put_task_struct+0x42/0x140
[ 132.613018] [] dl_task_timer+0xca/0x250
[ 132.613027] [] ? push_dl_task.part.32+0x170/0x170
[ 132.613030] [] __hrtimer_run_queues+0xee/0x270
[ 132.613031] [] hrtimer_interrupt+0xa8/0x190
[ 132.613034] [] local_apic_timer_interrupt+0x38/0x60
[ 132.613035] [] smp_apic_timer_interrupt+0x3d/0x50
[ 132.613037] [] apic_timer_interrupt+0x8c/0xa0
[ 132.613038] [] ? native_safe_halt+0x6/0x10
[ 132.613043] [] default_idle+0x1e/0xd0
[ 132.613044] [] arch_cpu_idle+0xf/0x20
[ 132.613046] [] default_idle_call+0x2a/0x40
[ 132.613047] [] cpu_startup_entry+0x2e7/0x340
[ 132.613048] [] start_secondary+0x155/0x190
[ 132.613049] ---[ end trace f91934d162ce9977 ]---

The warn is the spin_(lock|unlock)_bh(&css_set_lock) in the interrupt
context. Converting the spin_lock_bh to spin_lock_irq(save) to avoid
this problem - and other problems of sharing a spinlock with an
interrupt.

Cc: Tejun Heo
Cc: Li Zefan
Cc: Johannes Weiner
Cc: Juri Lelli
Cc: Steven Rostedt
Cc: cgroups@vger.kernel.org
Cc: stable@vger.kernel.org # 4.5+
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Rik van Riel
Reviewed-by: "Luis Claudio R. Goncalves"
Signed-off-by: Daniel Bristot de Oliveira
Acked-by: Zefan Li
Signed-off-by: Tejun Heo

Daniel Bristot de Oliveira
2016-06-24 05:23:12 +0800

22 Jun, 2016

1 commit

e7e15b87f cgroup: allow NULL return from ss->css_alloc() ... Browse Code »

cgroup core expected css_alloc to return an ERR_PTR value on failure
and caused NULL deref if it returned NULL. It's an easy mistake to
make from an alloc function and there's no ambiguity in what's being
indicated. Update css_create() so that it interprets NULL return from
css_alloc as -ENOMEM.

Signed-off-by: Tejun Heo

Tejun Heo
2016-06-22 01:07:09 +0800

18 Jun, 2016

2 commits

d6ccc55e6 cgroup: remove unnecessary 0 check from css_from_id() ... Browse Code »

css_idr allocation starts at 1, so index 0 will never point to an
item. css_from_id() currently filters that before asking idr_find(),
but idr_find() would also just return NULL, so this is not needed.

Signed-off-by: Johannes Weiner
Signed-off-by: Tejun Heo

Johannes Weiner
2016-06-18 02:16:32 +0800
8c8a55021 cgroup: fix idr leak for the first cgroup root ... Browse Code »

The valid cgroup hierarchy ID range includes 0, so we can't filter for
positive numbers when freeing it, or it'll leak the first ID. No big
deal, just disruptive when reading the code.

The ID is freed during error handling and when the reference count
hits zero, so the double-free test is not necessary; remove it.

Signed-off-by: Johannes Weiner
Signed-off-by: Tejun Heo

Johannes Weiner
2016-06-18 02:16:28 +0800

17 Jun, 2016

1 commit

8fa3b8d68 cgroup: set css->id to -1 during init ... Browse Code »

If percpu_ref initialization fails during css_create(), the free path
can end up trying to free css->id of zero. As ID 0 is unused, it
doesn't cause a critical breakage but it does trigger a warning
message. Fix it by setting css->id to -1 from init_and_link_css().

Signed-off-by: Tejun Heo
Cc: Wenwei Tao
Fixes: 01e586598b22 ("cgroup: release css->id after css_free")
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Tejun Heo

Tejun Heo
2016-06-17 05:59:35 +0800

27 May, 2016

1 commit

b00c52dae cgroup: remove redundant cleanup in css_create ... Browse Code »

When create css failed, before call css_free_rcu_fn, we remove the css
id and exit the percpu_ref, but we will do these again in
css_free_work_fn, so they are redundant. Especially the css id, that
would cause problem if we remove it twice, since it may be assigned to
another css after the first remove.

tj: This was broken by two commits updating the free path without
synchronizing the creation failure path. This can be easily
triggered by trying to create more than 64k memory cgroups.

Signed-off-by: Wenwei Tao
Signed-off-by: Tejun Heo
Cc: Vladimir Davydov
Fixes: 9a1049da9bd2 ("percpu-refcount: require percpu_ref to be exited explicitly")
Fixes: 01e586598b22 ("cgroup: release css->id after css_free")
Cc: stable@vger.kernel.org # v3.17+

Wenwei Tao
2016-05-27 03:09:23 +0800

12 May, 2016

1 commit

09be4c824 cgroup: fix compile warning ... Browse Code »

commit 4f41fc59620f ("cgroup, kernfs: make mountinfo
show properly scoped path for cgroup namespaces")
added the following compile warning:

kernel/cgroup.c: In function ‘cgroup_show_path’:
kernel/cgroup.c:1634:15: warning: unused variable ‘ret’ [-Wunused-variable]
int len = 0, ret = 0;
^
fix it.

Fixes: 4f41fc59620f ("cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces")
Signed-off-by: Felipe Balbi
Signed-off-by: Tejun Heo

Felipe Balbi
2016-05-12 23:05:27 +0800

10 May, 2016

1 commit

4f41fc596 cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces ... Browse Code »

Patch summary:

When showing a cgroupfs entry in mountinfo, show the path of the mount
root dentry relative to the reader's cgroup namespace root.

Short explanation (courtesy of mkerrisk):

If we create a new cgroup namespace, then we want both /proc/self/cgroup
and /proc/self/mountinfo to show cgroup paths that are correctly
virtualized with respect to the cgroup mount point. Previous to this
patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
does not.

Long version:

When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
namespace, and then mounts a new instance of the freezer cgroup, the new
mount will be rooted at /a/b. The root dentry field of the mountinfo
entry will show '/a/b'.

cat > /tmp/do1 << EOF
mount -t cgroup -o freezer freezer /mnt
grep freezer /proc/self/mountinfo
EOF

unshare -Gm bash /tmp/do1
> 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
> 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer

The task's freezer cgroup entry in /proc/self/cgroup will simply show
'/':

grep freezer /proc/self/cgroup
9:freezer:/

If instead the same task simply bind mounts the /a/b cgroup directory,
the resulting mountinfo entry will again show /a/b for the dentry root.
However in this case the task will find its own cgroup at /mnt/a/b,
not at /mnt:

mount --bind /sys/fs/cgroup/freezer/a/b /mnt
130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer

In other words, there is no way for the task to know, based on what is
in mountinfo, which cgroup directory is its own.

Example (by mkerrisk):

First, a little script to save some typing and verbiage:

echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
cat /proc/self/mountinfo | grep freezer |
awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'

Create cgroup, place this shell into the cgroup, and look at the state
of the /proc files:

2653
2653 # Our shell
14254 # cat(1)
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer

Create a shell in new cgroup and mount namespaces. The act of creating
a new cgroup namespace causes the process's current cgroups directories
to become its cgroup root directories. (Here, I'm using my own version
of the "unshare" utility, which takes the same options as the util-linux
version):

Look at the state of the /proc files:

/proc/self/cgroup: 10:freezer:/
mountinfo: / /sys/fs/cgroup/freezer

The third entry in /proc/self/cgroup (the pathname of the cgroup inside
the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
is rooted at /a/b in the outer namespace.

However, the info in /proc/self/mountinfo is not for this cgroup
namespace, since we are seeing a duplicate of the mount from the
old mount namespace, and the info there does not correspond to the
new cgroup namespace. However, trying to create a new mount still
doesn't show us the right information in mountinfo:

# propagating to other mountns
/proc/self/cgroup: 7:freezer:/
mountinfo: /a/b /mnt/freezer

The act of creating a new cgroup namespace caused the process's
current freezer directory, "/a/b", to become its cgroup freezer root
directory. In other words, the pathname directory of the directory
within the newly mounted cgroup filesystem should be "/",
but mountinfo wrongly shows us "/a/b". The consequence of this is
that the process in the cgroup namespace cannot correctly construct
the pathname of its cgroup root directory from the information in
/proc/PID/mountinfo.

With this patch, the dentry root field in mountinfo is shown relative
to the reader's cgroup namespace. So the same steps as above:

/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: /../.. /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: / /mnt/freezer

cgroup.clone_children freezer.parent_freezing freezer.state tasks
cgroup.procs freezer.self_freezing notify_on_release
3164
2653 # First shell that placed in this cgroup
3164 # Shell started by 'unshare'
14197 # cat(1)

Signed-off-by: Serge Hallyn
Tested-by: Michael Kerrisk
Acked-by: Michael Kerrisk
Signed-off-by: Tejun Heo

Serge E. Hallyn
2016-05-10 00:15:03 +0800

26 Apr, 2016

1 commit

5cf1cacb4 cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback ... Browse Code »

Since e93ad19d0564 ("cpuset: make mm migration asynchronous"), cpuset
kicks off asynchronous NUMA node migration if necessary during task
migration and flushes it from cpuset_post_attach_flush() which is
called at the end of __cgroup_procs_write(). This is to avoid
performing migration with cgroup_threadgroup_rwsem write-locked which
can lead to deadlock through dependency on kworker creation.

memcg has a similar issue with charge moving, so let's convert it to
an official callback rather than the current one-off cpuset specific
function. This patch adds cgroup_subsys->post_attach callback and
makes cpuset register cpuset_post_attach_flush() as its ->post_attach.

The conversion is mostly one-to-one except that the new callback is
called under cgroup_mutex. This is to guarantee that no other
migration operations are started before ->post_attach callbacks are
finished. cgroup_mutex is one of the outermost mutex in the system
and has never been and shouldn't be a problem. We can add specialized
synchronization around __cgroup_procs_write() but I don't think
there's any noticeable benefit.

Signed-off-by: Tejun Heo
Cc: Li Zefan
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: # 4.4+ prerequisite for the next patch

Tejun Heo
2016-04-26 03:45:14 +0800

22 Mar, 2016

1 commit

5518f66b5 Merge branch 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup namespace support from Tejun Heo:
"These are changes to implement namespace support for cgroup which has
been pending for quite some time now. It is very straight-forward and
only affects what part of cgroup hierarchies are visible.

After unsharing, mounting a cgroup fs will be scoped to the cgroups
the task belonged to at the time of unsharing and the cgroup paths
exposed to userland would be adjusted accordingly"

* 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: fix and restructure error handling in copy_cgroup_ns()
cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
Add FS_USERNS_FLAG to cgroup fs
cgroup: Add documentation for cgroup namespaces
cgroup: mount cgroupns-root when inside non-init cgroupns
kernfs: define kernfs_node_dentry
cgroup: cgroup namespace setns support
cgroup: introduce cgroup namespaces
sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
kernfs: Add API to generate relative kernfs path

Linus Torvalds
2016-03-22 01:05:13 +0800

17 Mar, 2016

2 commits

cfe02a8a9 cgroup: avoid false positive gcc-6 warning ... Browse Code »

When all subsystems are disabled, gcc notices that cgroup_subsys_enabled_key
is a zero-length array and that any access to it must be out of bounds:

In file included from ../include/linux/cgroup.h:19:0,
from ../kernel/cgroup.c:31:
../kernel/cgroup.c: In function 'cgroup_add_cftypes':
../kernel/cgroup.c:261:53: error: array subscript is above array bounds [-Werror=array-bounds]
return static_key_enabled(cgroup_subsys_enabled_key[ssid]);
~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
../include/linux/jump_label.h:271:40: note: in definition of macro 'static_key_enabled'
static_key_count((struct static_key *)x) > 0; \
^

We should never call the function in this particular case, so this is
not a bug. In order to silence the warning, this adds an explicit check
for the CGROUP_SUBSYS_COUNT==0 case.

Signed-off-by: Arnd Bergmann
Signed-off-by: Tejun Heo

Arnd Bergmann
2016-03-17 04:32:23 +0800
2b021cbf3 cgroup: ignore css_sets associated with dead cgroups during migration ... Browse Code »

Before 2e91fa7f6d45 ("cgroup: keep zombies associated with their
original cgroups"), all dead tasks were associated with init_css_set.
If a zombie task is requested for migration, while migration prep
operations would still be performed on init_css_set, the actual
migration would ignore zombie tasks. As init_css_set is always valid,
this worked fine.

However, after 2e91fa7f6d45, zombie tasks stay with the css_set it was
associated with at the time of death. Let's say a task T associated
with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2. After T
becomes a zombie, it would still remain associated with A and B. If A
only contains zombie tasks, it can be removed. On removal, A gets
marked offline but stays pinned until all zombies are drained. At
this point, if migration is initiated on T to a cgroup C on hierarchy
H-2, migration path would try to prepare T's css_set for migration and
trigger the following.

WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160()
CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289
...
Call Trace:
[] dump_stack+0x4e/0x82
[] warn_slowpath_common+0x78/0xb0
[] warn_slowpath_null+0x15/0x20
[] cgroup_get+0x121/0x160
[] link_css_set+0x7b/0x90
[] find_css_set+0x3bc/0x5e0
[] cgroup_migrate_prepare_dst+0x89/0x1f0
[] cgroup_attach_task+0x157/0x230
[] __cgroup_procs_write+0x2b7/0x470
[] cgroup_tasks_write+0xc/0x10
[] cgroup_file_write+0x30/0x1b0
[] kernfs_fop_write+0x13c/0x180
[] __vfs_write+0x23/0xe0
[] vfs_write+0xa4/0x1a0
[] SyS_write+0x44/0xa0
[] entry_SYSCALL_64_fastpath+0x12/0x6f

It doesn't make sense to prepare migration for css_sets pointing to
dead cgroups as they are guaranteed to contain only zombies which are
ignored later during migration. This patch makes cgroup destruction
path mark all affected css_sets as dead and updates the migration path
to ignore them during preparation.

Signed-off-by: Tejun Heo
Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
Cc: stable@vger.kernel.org # v4.4+

Tejun Heo
2016-03-17 04:31:46 +0800

09 Mar, 2016

4 commits

f6d635ad3 cgroup: implement cgroup_subsys->implicit_on_dfl ... Browse Code »

Some controllers, perf_event for now and possibly freezer in the
future, don't really make sense to control explicitly through
"cgroup.subtree_control". For example, the primary role of perf_event
is identifying the cgroups of tasks; however, because the controller
also keeps a small amount of state per cgroup, it can't be replaced
with simple cgroup membership tests.

This patch implements cgroup_subsys->implicit_on_dfl flag. When set,
the controller is implicitly enabled on all cgroups on the v2
hierarchy so that utility type controllers such as perf_event can be
enabled and function transparently.

An implicit controller doesn't show up in "cgroup.controllers" or
"cgroup.subtree_control", is exempt from no internal process rule and
can be stolen from the default hierarchy even if there are non-root
csses.

v2: Reimplemented on top of the recent updates to css handling and
subsystem rebinding. Rebinding implicit subsystems is now a
simple matter of exempting it from the busy subsystem check.

Signed-off-by: Tejun Heo

Tejun Heo
2016-03-09 00:51:26 +0800
e4857982f cgroup: use css_set->mg_dst_cgrp for the migration target cgroup ... Browse Code »

Migration can be multi-target on the default hierarchy when a
controller is enabled - processes belonging to each child cgroup have
to be moved to the child cgroup itself to refresh css association.

This isn't a problem for cgroup_migrate_add_src() as each source
css_set still maps to single source and target cgroups; however,
cgroup_migrate_prepare_dst() is called once after all source css_sets
are added and thus might not have a single destination cgroup. This
is currently worked around by specifying NULL for @dst_cgrp and using
the source's default cgroup as destination as the only multi-target
migration in use is self-targetting. While this works, it's subtle
and clunky.

As all taget cgroups are already specified while preparing the source
css_sets, this clunkiness can easily be removed by recording the
target cgroup in each source css_set. This patch adds
css_set->mg_dst_cgrp which is recorded on cgroup_migrate_src() and
used by cgroup_migrate_prepare_dst(). This also makes migration code
ready for arbitrary multi-target migration.

Signed-off-by: Tejun Heo

Tejun Heo
2016-03-09 00:51:26 +0800
37ff9f8f4 cgroup: make cgroup[_taskset]_migrate() take cgroup_root instead of cgroup ... Browse Code »

On the default hierarchy, a migration can be multi-source and/or
multi-destination. cgroup_taskest_migrate() used to incorrectly
assume single destination cgroup but the bug has been fixed by
1f7dd3e5a6e4 ("cgroup: fix handling of multi-destination migration
from subtree_control enabling").

Since the commit, @dst_cgrp to cgroup[_taskset]_migrate() is only used
to determine which subsystems are affected or which cgroup_root the
migration is taking place in. As such, @dst_cgrp is misleading. This
patch replaces @dst_cgrp with @root.

Signed-off-by: Tejun Heo

Tejun Heo
2016-03-09 00:51:26 +0800
6c694c882 cgroup: move migration destination verification out of cgroup_migrate_prepare_dst() ... Browse Code »

cgroup_migrate_prepare_dst() verifies whether the destination cgroup
is allowable; however, the test doesn't really belong there. It's too
deep and common in the stack and as a result the test itself is gated
by another test.

Separate the test out into cgroup_may_migrate_to() and update
cgroup_attach_task() and cgroup_transfer_tasks() to perform the test
directly. This doesn't cause any behavior differences.

Signed-off-by: Tejun Heo

Tejun Heo
2016-03-09 00:51:25 +0800