Eric Lee / smarc-fsl-linux-kernel

15 Jul, 2014

1 commit

5577964e6 cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes ... Browse Code »

Currently, cgroup_subsys->base_cftypes is used for both the unified
default hierarchy and legacy ones and subsystems can mark each file
with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
only on one of them. This is quite hairy and error-prone. Also, we
may end up exposing interface files to the default hierarchy without
thinking it through.

cgroup_subsys will grow two separate cftype arrays and apply each only
on the hierarchies of the matching type. This will allow organizing
cftypes in a lot clearer way and encourage subsystems to scrutinize
the interface which is being exposed in the new default hierarchy.

In preparation, this patch renames cgroup_subsys->base_cftypes to
cgroup_subsys->legacy_cftypes. This patch is pure rename.

Signed-off-by: Tejun Heo
Acked-by: Neil Horman
Acked-by: Li Zefan
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vivek Goyal
Cc: Peter Zijlstra
Cc: Paul Mackerras
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo
Cc: Aristeu Rozanski
Cc: Aneesh Kumar K.V

Tejun Heo
2014-07-15 23:05:09 +0800

17 May, 2014

3 commits

7a3bb24f7 device_cgroup: use css_has_online_children() instead of has_children() ... Browse Code »

devcgroup_update_access() wants to know whether there are child
cgroups which are online and visible to userland and has_children()
may return false positive. Replace it with css_has_online_children().

Signed-off-by: Tejun Heo
Acked-by: Aristeu Rozanski
Acked-by: Serge Hallyn
Acked-by: Li Zefan

Tejun Heo
2014-05-17 01:22:52 +0800
5877019d9 device_cgroup: remove direct access to cgroup->children ... Browse Code »

Currently, devcg::has_children() directly tests cgroup->children for
list emptiness. The field is not a published field and scheduled to
go away. In addition, the test isn't strictly correct as devcg should
only care about children which are visible to userland.

This patch converts has_children() to use css_next_child() instead.
The subtle incorrectness is noted and will be dealt with later.

Signed-off-by: Tejun Heo
Acked-by: Aristeu Rozanski
Acked-by: Serge Hallyn
Acked-by: Li Zefan

Tejun Heo
2014-05-17 01:22:48 +0800
5c9d535b8 cgroup: remove css_parent() ... Browse Code »

cgroup in general is moving towards using cgroup_subsys_state as the
fundamental structural component and css_parent() was introduced to
convert from using cgroup->parent to css->parent. It was quite some
time ago and we're moving forward with making css more prominent.

This patch drops the trivial wrapper css_parent() and let the users
dereference css->parent. While at it, explicitly mark fields of css
which are public and immutable.

v2: New usage from device_cgroup.c converted.

Signed-off-by: Tejun Heo
Acked-by: Michal Hocko
Acked-by: Neil Horman
Acked-by: "David S. Miller"
Acked-by: Li Zefan
Cc: Vivek Goyal
Cc: Jens Axboe
Cc: Peter Zijlstra
Cc: Johannes Weiner

Tejun Heo
2014-05-17 01:22:48 +0800

14 May, 2014

1 commit

451af504d cgroup: replace cftype->write_string() with cftype->write() ... Browse Code »

Convert all cftype->write_string() users to the new cftype->write()
which maps directly to kernfs write operation and has full access to
kernfs and cgroup contexts. The conversions are mostly mechanical.

* @css and @cft are accessed using of_css() and of_cft() accessors
respectively instead of being specified as arguments.

* Should return @nbytes on success instead of 0.

* @buf is not trimmed automatically. Trim if necessary. Note that
blkcg and netprio don't need this as the parsers already handle
whitespaces.

cftype->write_string() has no user left after the conversions and
removed.

While at it, remove unnecessary local variable @p in
cgroup_subtree_control_write() and stale comment about
CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.

This patch doesn't introduce any visible behavior changes.

v2: netprio was missing from conversion. Converted.

Signed-off-by: Tejun Heo
Acked-by: Aristeu Rozanski
Acked-by: Vivek Goyal
Acked-by: Li Zefan
Cc: Jens Axboe
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Neil Horman
Cc: "David S. Miller"

Tejun Heo
2014-05-14 00:16:21 +0800

05 May, 2014

2 commits

d2c2b11cf device_cgroup: check if exception removal is allowed ... Browse Code »

[PATCH v3 1/2] device_cgroup: check if exception removal is allowed

When the device cgroup hierarchy was introduced in
bd2953ebbb53 - devcg: propagate local changes down the hierarchy

a specific case was overlooked. Consider the hierarchy bellow:

A default policy: ALLOW, exceptions will deny access
\
B default policy: ALLOW, exceptions will deny access

There's no need to verify when an new exception is added to B because
in this case exceptions will deny access to further devices, which is
always fine. Hierarchy in device cgroup only makes sure B won't have
more access than A.

But when an exception is removed (by writing devices.allow), it isn't
checked if the user is in fact removing an inherited exception from A,
thus giving more access to B.

Example:

# echo 'a' >A/devices.allow
# echo 'c 1:3 rw' >A/devices.deny
# echo $$ >A/B/tasks
# echo >/dev/null
-bash: /dev/null: Operation not permitted
# echo 'c 1:3 w' >A/B/devices.allow
# echo >/dev/null
#

This shouldn't be allowed and this patch fixes it by making sure to never allow
exceptions in this case to be removed if the exception is partially or fully
present on the parent.

v3: missing '*' in function description
v2: improved log message and formatting fixes

Cc: cgroups@vger.kernel.org
Cc: Li Zefan
Cc: stable@vger.kernel.org
Signed-off-by: Aristeu Rozanski
Acked-by: Serge Hallyn
Signed-off-by: Tejun Heo

Aristeu Rozanski
2014-05-05 23:20:12 +0800
f5f3cf6f7 device_cgroup: fix the comment format for recently added functions ... Browse Code »

Moving more extensive explanations to the end of the comment.

Cc: Li Zefan
Signed-off-by: Aristeu Rozanski
Acked-by: Serge Hallyn
Signed-off-by: Tejun Heo

Aristeu Rozanski
2014-05-05 03:21:09 +0800

22 Apr, 2014

1 commit

79d719749 device_cgroup: rework device access check and exception checking ... Browse Code »

Whenever a device file is opened and checked against current device
cgroup rules, it uses the same function (may_access()) as when a new
exception rule is added by writing devices.{allow,deny}. And in both
cases, the algorithm is the same, doesn't matter the behavior.

First problem is having device access to be considered the same as rule
checking. Consider the following structure:

A (default behavior: allow, exceptions disallow access)
\
B (default behavior: allow, exceptions disallow access)

A new exception is added to B by writing devices.deny:

c 12:34 rw

When checking if that exception is allowed in may_access():

if (dev_cgroup->behavior == DEVCG_DEFAULT_ALLOW) {
if (behavior == DEVCG_DEFAULT_ALLOW) {
/* the exception will deny access to certain devices */
return true;

Which is ok, since B is not getting more privileges than A, it doesn't
matter and the rule is accepted

Now, consider it's a device file open check and the process belongs to
cgroup B. The access will be generated as:

behavior: allow
exception: c 12:34 rw

The very same chunk of code will allow it, even if there's an explicit
exception telling to do otherwise.

A simple test case:

# mkdir new_group
# cd new_group
# echo $$ >tasks
# echo "c 1:3 w" >devices.deny
# echo >/dev/null
# echo $?
0

This is a serious bug and was introduced on

c39a2a3018f8 devcg: prepare may_access() for hierarchy support

To solve this problem, the device file open function was split from the
new exception check.

Second problem is how exceptions are processed by may_access(). The
first part of the said function tries to match fully with an existing
exception:

list_for_each_entry_rcu(ex, &dev_cgroup->exceptions, list) {
if ((refex->type & DEV_BLOCK) && !(ex->type & DEV_BLOCK))
continue;
if ((refex->type & DEV_CHAR) && !(ex->type & DEV_CHAR))
continue;
if (ex->major != ~0 && ex->major != refex->major)
continue;
if (ex->minor != ~0 && ex->minor != refex->minor)
continue;
if (refex->access & (~ex->access))
continue;
match = true;
break;
}

That means the new exception should be contained into an existing one to
be considered a match:

New exception Existing match? notes
b 12:34 rwm b 12:34 rwm yes
b 12:34 r b *:34 rw yes
b 12:34 rw b 12:34 w no extra "r"
b *:34 rw b 12:34 rw no too broad "*"
b *:34 rw b *:34 rwm yes

Which is fine in some cases. Consider:

A (default behavior: deny, exceptions allow access)
\
B (default behavior: deny, exceptions allow access)

In this case the full match makes sense, the new exception cannot add
more access than the parent allows

But this doesn't always work, consider:

A (default behavior: allow, exceptions disallow access)
\
B (default behavior: deny, exceptions allow access)

In this case, a new exception in B shouldn't match any of the exceptions
in A, after all you can't allow something that was forbidden by A. But
consider this scenario:

New exception Existing in A match? outcome
b 12:34 rw b 12:34 r no exception is accepted

Because the new exception has "w" as extra, it doesn't match, so it'll
be added to B's exception list.

The same problem can happen during a file access check. Consider a
cgroup with allow as default behavior:

Access Exception match?
b 12:34 rw b 12:34 r no

In this case, the access didn't match any of the exceptions in the
cgroup, which is required since exceptions will disallow access.

To solve this problem, two new functions were created to match an
exception either fully or partially. In the example above, a partial
check will be performed and it'll produce a match since at least
"b 12:34 r" from "b 12:34 rw" access matches.

Cc: cgroups@vger.kernel.org
Cc: Tejun Heo
Cc: Serge Hallyn
Cc: Li Zefan
Cc: stable@vger.kernel.org
Signed-off-by: Aristeu Rozanski
Signed-off-by: Tejun Heo

Aristeu Rozanski
2014-04-22 06:21:42 +0800

19 Mar, 2014

1 commit

4d3bb511b cgroup: drop const from @buffer of cftype->write_string() ... Browse Code »

cftype->write_string() just passes on the writeable buffer from kernfs
and there's no reason to add const restriction on the buffer. The
only thing const achieves is unnecessarily complicating parsing of the
buffer. Drop const from @buffer.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Cc: Peter Zijlstra
Cc: Paul Mackerras
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo
Cc: Daniel Borkmann
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki

Tejun Heo
2014-03-19 22:23:54 +0800

08 Feb, 2014

1 commit

073219e99 cgroup: clean up cgroup_subsys names and initialization ... Browse Code »

cgroup_subsys is a bit messier than it needs to be.

* The name of a subsys can be different from its internal identifier
defined in cgroup_subsys.h. Most subsystems use the matching name
but three - cpu, memory and perf_event - use different ones.

* cgroup_subsys_id enums are postfixed with _subsys_id and each
cgroup_subsys is postfixed with _subsys. cgroup.h is widely
included throughout various subsystems, it doesn't and shouldn't
have claim on such generic names which don't have any qualifier
indicating that they belong to cgroup.

* cgroup_subsys->subsys_id should always equal the matching
cgroup_subsys_id enum; however, we require each controller to
initialize it and then BUG if they don't match, which is a bit
silly.

This patch cleans up cgroup_subsys names and initialization by doing
the followings.

* cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
cgroup_subsys with _cgrp_subsys.

* With the above, renaming subsys identifiers to match the userland
visible names doesn't cause any naming conflicts. All non-matching
identifiers are renamed to match the official names.

cpu_cgroup -> cpu
mem_cgroup -> memory
perf -> perf_event

* controllers no longer need to initialize ->subsys_id and ->name.
They're generated in cgroup core and set automatically during boot.

* Redundant cgroup_subsys declarations removed.

* While updating BUG_ON()s in cgroup_init_early(), convert them to
WARN()s. BUGging that early during boot is stupid - the kernel
can't print anything, even through serial console and the trap
handler doesn't even link stack frame properly for back-tracing.

This patch doesn't introduce any behavior changes.

v2: Rebased on top of fe1217c4f3f7 ("net: net_cls: move cgroupfs
classid handling into core").

Signed-off-by: Tejun Heo
Acked-by: Neil Horman
Acked-by: "David S. Miller"
Acked-by: "Rafael J. Wysocki"
Acked-by: Michal Hocko
Acked-by: Peter Zijlstra
Acked-by: Aristeu Rozanski
Acked-by: Ingo Molnar
Acked-by: Li Zefan
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Serge E. Hallyn
Cc: Vivek Goyal
Cc: Thomas Graf

Tejun Heo
2014-02-08 23:36:58 +0800

06 Dec, 2013

1 commit

2da8ca822 cgroup: replace cftype->read_seq_string() with cftype->seq_show() ... Browse Code »

In preparation of conversion to kernfs, cgroup file handling is
updated so that it can be easily mapped to kernfs. This patch
replaces cftype->read_seq_string() with cftype->seq_show() which is
not limited to single_open() operation and will map directcly to
kernfs seq_file interface.

The conversions are mechanical. As ->seq_show() doesn't have @css and
@cft, the functions which make use of them are converted to use
seq_css() and seq_cft() respectively. In several occassions, e.f. if
it has seq_string in its name, the function name is updated to fit the
new method better.

This patch does not introduce any behavior changes.

Signed-off-by: Tejun Heo
Acked-by: Aristeu Rozanski
Acked-by: Vivek Goyal
Acked-by: Michal Hocko
Acked-by: Daniel Wagner
Acked-by: Li Zefan
Cc: Jens Axboe
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Neil Horman

Tejun Heo
2013-12-06 01:28:04 +0800

24 Oct, 2013

1 commit

73ba35347 device_cgroup: remove can_attach ... Browse Code »

It is really only wanting to duplicate a check which is already done by the
cgroup subsystem.

With this patch, user jdoe still cannot move pid 1 into a devices cgroup
he owns, but now he can move his own other tasks into devices cgroups.

Signed-off-by: Serge Hallyn
Signed-off-by: Tejun Heo
Cc: Aristeu Rozanski

Serge Hallyn
2013-10-24 18:56:56 +0800

09 Aug, 2013

7 commits

bd8815a6d cgroup: make css_for_each_descendant() and friends include the origin css in the iteration ... Browse Code »

Previously, all css descendant iterators didn't include the origin
(root of subtree) css in the iteration. The reasons were maintaining
consistency with css_for_each_child() and that at the time of
introduction more use cases needed skipping the origin anyway;
however, given that css_is_descendant() considers self to be a
descendant, omitting the origin css has become more confusing and
looking at the accumulated use cases rather clearly indicates that
including origin would result in simpler code overall.

While this is a change which can easily lead to subtle bugs, cgroup
API including the iterators has recently gone through major
restructuring and no out-of-tree changes will be applicable without
adjustments making this a relatively acceptable opportunity for this
type of change.

The conversions are mostly straight-forward. If the iteration block
had explicit origin handling before or after, it's moved inside the
iteration. If not, if (pos == origin) continue; is added. Some
conversions add extra reference get/put around origin handling by
consolidating origin handling and the rest. While the extra ref
operations aren't strictly necessary, this shouldn't cause any
noticeable difference.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Vivek Goyal
Acked-by: Aristeu Rozanski
Acked-by: Michal Hocko
Cc: Jens Axboe
Cc: Matt Helsley
Cc: Johannes Weiner
Cc: Balbir Singh

Tejun Heo
2013-08-09 08:11:27 +0800
492eb21b9 cgroup: make hierarchy iterators deal with cgroup_subsys_state instead of cgroup ... Browse Code »

cgroup is currently in the process of transitioning to using css
(cgroup_subsys_state) as the primary handle instead of cgroup in
subsystem API. For hierarchy iterators, this is beneficial because

* In most cases, css is the only thing subsystems care about anyway.

* On the planned unified hierarchy, iterations for different
subsystems will need to skip over different subtrees of the
hierarchy depending on which subsystems are enabled on each cgroup.
Passing around css makes it unnecessary to explicitly specify the
subsystem in question as css is intersection between cgroup and
subsystem

* For the planned unified hierarchy, css's would need to be created
and destroyed dynamically independent from cgroup hierarchy. Having
cgroup core manage css iteration makes enforcing deref rules a lot
easier.

Most subsystem conversions are straight-forward. Noteworthy changes
are

* blkio: cgroup_to_blkcg() is no longer used. Removed.

* freezer: cgroup_freezer() is no longer used. Removed.

* devices: cgroup_to_devcgroup() is no longer used. Removed.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Michal Hocko
Acked-by: Vivek Goyal
Acked-by: Aristeu Rozanski
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: Matt Helsley
Cc: Jens Axboe

Tejun Heo
2013-08-09 08:11:25 +0800
182446d08 cgroup: pass around cgroup_subsys_state instead of cgroup in file methods ... Browse Code »

cgroup is currently in the process of transitioning to using struct
cgroup_subsys_state * as the primary handle instead of struct cgroup.
Please see the previous commit which converts the subsystem methods
for rationale.

This patch converts all cftype file operations to take @css instead of
@cgroup. cftypes for the cgroup core files don't have their subsytem
pointer set. These will automatically use the dummy_css added by the
previous patch and can be converted the same way.

Most subsystem conversions are straight forwards but there are some
interesting ones.

* freezer: update_if_frozen() is also converted to take @css instead
of @cgroup for consistency. This will make the code look simpler
too once iterators are converted to use css.

* memory/vmpressure: mem_cgroup_from_css() needs to be exported to
vmpressure while mem_cgroup_from_cont() can be made static.
Updated accordingly.

* cpu: cgroup_tg() doesn't have any user left. Removed.

* cpuacct: cgroup_ca() doesn't have any user left. Removed.

* hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
Removed.

* net_cls: cgrp_cls_state() doesn't have any user left. Removed.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Michal Hocko
Acked-by: Vivek Goyal
Acked-by: Aristeu Rozanski
Acked-by: Daniel Wagner
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: Matt Helsley
Cc: Jens Axboe
Cc: Steven Rostedt

Tejun Heo
2013-08-09 08:11:24 +0800
eb95419b0 cgroup: pass around cgroup_subsys_state instead of cgroup in subsystem methods ... Browse Code »

cgroup is currently in the process of transitioning to using struct
cgroup_subsys_state * as the primary handle instead of struct cgroup *
in subsystem implementations for the following reasons.

* With unified hierarchy, subsystems will be dynamically bound and
unbound from cgroups and thus css's (cgroup_subsys_state) may be
created and destroyed dynamically over the lifetime of a cgroup,
which is different from the current state where all css's are
allocated and destroyed together with the associated cgroup. This
in turn means that cgroup_css() should be synchronized and may
return NULL, making it more cumbersome to use.

* Differing levels of per-subsystem granularity in the unified
hierarchy means that the task and descendant iterators should behave
differently depending on the specific subsystem the iteration is
being performed for.

* In majority of the cases, subsystems only care about its part in the
cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods
often obtain the matching css pointer from the cgroup and don't
bother with the cgroup pointer itself. Passing around css fits
much better.

This patch converts all cgroup_subsys methods to take @css instead of
@cgroup. The conversions are mostly straight-forward. A few
noteworthy changes are

* ->css_alloc() now takes css of the parent cgroup rather than the
pointer to the new cgroup as the css for the new cgroup doesn't
exist yet. Knowing the parent css is enough for all the existing
subsystems.

* In kernel/cgroup.c::offline_css(), unnecessary open coded css
dereference is replaced with local variable access.

This patch shouldn't cause any behavior differences.

v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
with local variable @css as suggested by Li Zefan.

Rebased on top of new for-3.12 which includes for-3.11-fixes so
that ->css_free() invocation added by da0a12caff ("cgroup: fix a
leak when percpu_ref_init() fails") is converted too. Suggested
by Li Zefan.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Michal Hocko
Acked-by: Vivek Goyal
Acked-by: Aristeu Rozanski
Acked-by: Daniel Wagner
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: Matt Helsley
Cc: Jens Axboe
Cc: Steven Rostedt

Tejun Heo
2013-08-09 08:11:23 +0800
638769869 cgroup: add css_parent() ... Browse Code »

Currently, controllers have to explicitly follow the cgroup hierarchy
to find the parent of a given css. cgroup is moving towards using
cgroup_subsys_state as the main controller interface construct, so
let's provide a way to climb the hierarchy using just csses.

This patch implements css_parent() which, given a css, returns its
parent. The function is guarnateed to valid non-NULL parent css as
long as the target css is not at the top of the hierarchy.

freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices
are converted to use css_parent() instead of accessing cgroup->parent
directly.

* __parent_ca() is dropped from cpuacct and its usage is replaced with
parent_ca(). The only difference between the two was NULL test on
cgroup->parent which is now embedded in css_parent() making the
distinction moot. Note that eventually a css->parent field will be
added to css and the NULL check in css_parent() will go away.

This patch shouldn't cause any behavior differences.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-08-09 08:11:23 +0800
a7c6d554a cgroup: add/update accessors which obtain subsys specific data from css ... Browse Code »

css (cgroup_subsys_state) is usually embedded in a subsys specific
data structure. Subsystems either use container_of() directly to cast
from css to such data structure or has an accessor function wrapping
such cast. As cgroup as whole is moving towards using css as the main
interface handle, add and update such accessors to ease dealing with
css's.

All accessors explicitly handle NULL input and return NULL in those
cases. While this looks like an extra branch in the code, as all
controllers specific data structures have css as the first field, the
casting doesn't involve any offsetting and the compiler can trivially
optimize out the branch.

* blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such
accessor. Added.

* memory, hugetlb and devices already had one but didn't explicitly
handle NULL input. Updated.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-08-09 08:11:23 +0800
8af01f56a cgroup: s/cgroup_subsys_state/cgroup_css/ s/task_subsys_state/task_css/ ... Browse Code »

The names of the two struct cgroup_subsys_state accessors -
cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
The former clashes with the type name and the latter doesn't even
indicate it's somehow related to cgroup.

We're about to revamp large portion of cgroup API, so, let's rename
them so that they're less awkward. Most per-controller usages of the
accessors are localized in accessor wrappers and given the amount of
scheduled changes, this isn't gonna add any noticeable headache.

Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
to task_css(). This patch is pure rename.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-08-09 08:11:22 +0800

24 May, 2013

1 commit

d591fb566 device_cgroup: simplify cgroup tree walk in propagate_exception() ... Browse Code »

During a config change, propagate_exception() needs to traverse the
subtree to update config on the subtree. Because such config updates
need to allocate memory, it couldn't directly use
cgroup_for_each_descendant_pre() which required the whole iteration to
be contained in a single RCU read critical section. To work around
the limitation, propagate_exception() built a linked list of
descendant cgroups while read-locking RCU and then walked the list
afterwards, which is safe as the whole iteration is protected by
devcgroup_mutex. This works but is cumbersome.

With the recent updates, cgroup iterators now allow dropping RCU read
lock while iteration is in progress making this workaround no longer
necessary. This patch replaces dev_cgroup->propagate_pending list and
get_online_devcg() with direct cgroup_for_each_descendant_pre() walk.

Signed-off-by: Tejun Heo
Cc: Aristeu Rozanski
Acked-by: Serge E. Hallyn
Reviewed-by: Michal Hocko

Tejun Heo
2013-05-24 09:55:38 +0800

19 Apr, 2013

1 commit

e57d5cf2f devcg: remove parent_cgroup. ... Browse Code »

In devcgroup_css_alloc(), there is no longer need for parent_cgroup.
bd2953ebbb("devcg: propagate local changes down the hierarchy") made
the variable parent_cgroup redundant. This patch removes parent_cgroup
from devcgroup_css_alloc().

Signed-off-by: Rami Rosen
Acked-by: Aristeu Rozanski
Signed-off-by: Tejun Heo

Rami Rosen
2013-04-19 02:34:35 +0800

08 Apr, 2013

1 commit

8adf12b0f devcg: remove broken_hierarchy tag ... Browse Code »

bd2953ebbb ("devcg: propagate local changes down the hierarchy")
implemented proper hierarchy support. Remove the broken tag.

Signed-off-by: Tejun Heo
Acked-by: Aristeu Rozanski

Tejun Heo
2013-04-08 23:31:59 +0800

20 Mar, 2013

4 commits

bd2953ebb devcg: propagate local changes down the hierarchy ... Browse Code »

This patch makes exception changes to propagate down in hierarchy respecting
when possible local exceptions.

New exceptions allowing additional access to devices won't be propagated, but
it'll be possible to add an exception to access all of part of the newly
allowed device(s).

New exceptions disallowing access to devices will be propagated down and the
local group's exceptions will be revalidated for the new situation.
Example:
A
/ \
B

group behavior exceptions
A allow "b 8:* rwm", "c 116:1 rw"
B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm"

If a new exception is added to group A:
# echo "c 116:* r" > A/devices.deny
it'll propagate down and after revalidating B's local exceptions, the exception
"c 116:2 rwm" will be removed.

In case parent's exceptions change and local exceptions are not allowed anymore,
they'll be deleted.

v7:
- do not allow behavior change when the cgroup has children
- update documentation

v6: fixed issues pointed by Serge Hallyn
- only copy parent's exceptions while propagating behavior if the local
behavior is different
- while propagating exceptions, do not clear and copy parent's: it'd be against
the premise we don't propagate access to more devices

v5: fixed issues pointed by Serge Hallyn
- updated documentation
- not propagating when an exception is written to devices.allow
- when propagating a new behavior, clean the local exceptions list if they're
for a different behavior

v4: fixed issues pointed by Tejun Heo
- separated function to walk the tree and collect valid propagation targets

v3: fixed issues pointed by Tejun Heo
- update documentation
- move css_online/css_offline changes to a new patch
- use cgroup_for_each_descendant_pre() instead of own descendant walk
- move exception_copy rework to a separared patch
- move exception_clean rework to a separated patch

v2: fixed issues pointed by Tejun Heo
- instead of keeping the local settings that won't apply anymore, remove them

Cc: Tejun Heo
Cc: Serge Hallyn
Signed-off-by: Aristeu Rozanski
Signed-off-by: Tejun Heo

Aristeu Rozanski
2013-03-20 22:50:21 +0800
1909554c9 devcg: use css_online and css_offline ... Browse Code »

Allocate resources and change behavior only when online. This is needed in
order to determine if a node is suitable for hierarchy propagation or if it's
being removed.

Locking:
Both functions take devcgroup_mutex to make changes to device_cgroup structure.
Hierarchy propagation will also take devcgroup_mutex before walking the
tree while walking the tree itself is protected by rcu lock.

Acked-by: Tejun Heo
Acked-by: Serge Hallyn
Cc: Tejun Heo
Cc: Serge Hallyn
Signed-off-by: Aristeu Rozanski
Signed-off-by: Tejun Heo

Aristeu Rozanski
2013-03-20 22:50:17 +0800
c39a2a301 devcg: prepare may_access() for hierarchy support ... Browse Code »

Currently may_access() is only able to verify if an exception is valid for the
current cgroup, which has the same behavior. With hierarchy, it'll be also used
to verify if a cgroup local exception is valid towards its cgroup parent, which
might have different behavior.

v2:
- updated patch description
- rebased on top of a new patch to expand the may_access() logic to make it
more clear
- fixed argument description order in may_access()

Acked-by: Tejun Heo
Acked-by: Serge Hallyn
Cc: Tejun Heo
Cc: Serge Hallyn
Signed-off-by: Aristeu Rozanski
Signed-off-by: Tejun Heo

Aristeu Rozanski
2013-03-20 22:50:13 +0800
26898fdff devcg: expand may_access() logic ... Browse Code »

In order to make the next patch more clear, expand may_access() logic.

v2: may_access() returns bool now

Acked-by: Tejun Heo
Acked-by: Serge Hallyn
Cc: Tejun Heo
Cc: Serge Hallyn
Signed-off-by: Aristeu Rozanski
Signed-off-by: Tejun Heo

Aristeu Rozanski
2013-03-20 22:50:09 +0800

22 Feb, 2013

1 commit

53eb8c82d device_cgroup: don't grab mutex in rcu callback ... Browse Code »

Commit 103a197c0c4e ("security/device_cgroup: lock assert fails in
dev_exception_clean()") grabs devcgroup_mutex to fix assert failure, but
a mutex can't be grabbed in rcu callback. Since there shouldn't be any
other references when css_free is called, mutex isn't needed for list
cleanup in devcgroup_css_free().

Signed-off-by: Jerry Snitselaar
Acked-by: Tejun Heo
Acked-by: Aristeu Rozanski
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jerry Snitselaar
2013-02-22 09:22:15 +0800

21 Jan, 2013

1 commit

103a197c0 security/device_cgroup: lock assert fails in dev_exception_clean() ... Browse Code »

devcgroup_css_free() calls dev_exception_clean() without the devcgroup_mutex being locked.

Shutting down a kvm virt was giving me the following trace:

[36280.732764] ------------[ cut here ]------------
[36280.732778] WARNING: at /home/snits/dev/linux/security/device_cgroup.c:172 dev_exception_clean+0xa9/0xc0()
[36280.732782] Hardware name: Studio XPS 8100
[36280.732785] Modules linked in: xt_REDIRECT fuse ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle bridge stp llc nf_conntrack_ipv4 ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter it87 hwmon_vid xt_state nf_conntrack ip6_tables snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq coretemp snd_seq_device crc32c_intel snd_pcm snd_page_alloc snd_timer snd broadcom tg3 serio_raw i7core_edac edac_core ptp pps_core lpc_ich pcspkr mfd_core soundcore microcode i2c_i801 nfsd auth_rpcgss nfs_acl lockd vhost_net sunrpc tun macvtap macvlan kvm_intel kvm uinput binfmt_misc autofs4 usb_storage firewire_ohci firewire_core crc_itu_t radeon drm_kms_helper ttm
[36280.732921] Pid: 933, comm: libvirtd Tainted: G W 3.8.0-rc3-00307-g4c217de #1
[36280.732922] Call Trace:
[36280.732927] [] warn_slowpath_common+0x93/0xc0
[36280.732930] [] warn_slowpath_null+0x1a/0x20
[36280.732932] [] dev_exception_clean+0xa9/0xc0
[36280.732934] [] devcgroup_css_free+0x1a/0x30
[36280.732938] [] cgroup_diput+0x76/0x210
[36280.732941] [] d_delete+0x120/0x180
[36280.732943] [] vfs_rmdir+0xef/0x130
[36280.732945] [] do_rmdir+0x107/0x1c0
[36280.732949] [] ? trace_hardirqs_on_thunk+0x3a/0x3f
[36280.732951] [] sys_rmdir+0x16/0x20
[36280.732954] [] system_call_fastpath+0x16/0x1b
[36280.732956] ---[ end trace ca39dced899a7d9f ]---

Signed-off-by: Jerry Snitselaar
Cc: stable@kernel.org
Signed-off-by: James Morris

Jerry Snitselaar
2013-01-21 21:27:55 +0800

20 Nov, 2012

1 commit

92fb97487 cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free() ... Browse Code »

Rename cgroup_subsys css lifetime related callbacks to better describe
what their roles are. Also, update documentation.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2012-11-20 00:13:38 +0800

07 Nov, 2012

2 commits

4b1c7840b device_cgroup: add lockdep asserts ... Browse Code »

device_cgroup uses RCU safe ->exceptions list which is write-protected
by devcgroup_mutex and has had some issues using locking correctly.
Add lockdep asserts to utility functions so that future errors can be
easily detected.

Signed-off-by: Tejun Heo
Acked-by: Serge E. Hallyn
Cc: Aristeu Rozanski
Cc: Li Zefan

Tejun Heo
2012-11-07 04:28:04 +0800
201e72acb device_cgroup: fix RCU usage ... Browse Code »

dev_cgroup->exceptions is protected with devcgroup_mutex for writes
and RCU for reads; however, RCU usage isn't correct.

* dev_exception_clean() doesn't use RCU variant of list_del() and
kfree(). The function can race with may_access() and may_access()
may end up dereferencing already freed memory. Use list_del_rcu()
and kfree_rcu() instead.

* may_access() may be called only with RCU read locked but doesn't use
RCU safe traversal over ->exceptions. Use list_for_each_entry_rcu().

Signed-off-by: Tejun Heo
Acked-by: Serge E. Hallyn
Cc: stable@vger.kernel.org
Cc: Aristeu Rozanski
Cc: Li Zefan

Tejun Heo
2012-11-07 04:25:51 +0800

06 Nov, 2012

1 commit

64e104771 device_cgroup: fix unchecked cgroup parent usage ... Browse Code »

In 4cef7299b478687 ("device_cgroup: add proper checking when changing
default behavior") the cgroup parent usage is unchecked. root will not
have a parent and trying to use device.{allow,deny} will cause problems.
For some reason my stressing scripts didn't test the root directory so I
didn't catch it on my regular tests.

Signed-off-by: Aristeu Rozanski
Cc: Li Zefan
Cc: James Morris
Cc: Pavel Emelyanov
Acked-by: Serge E. Hallyn
Cc: Jiri Slaby
Cc: Tejun Heo
Signed-off-by: Tejun Heo

Aristeu Rozanski
2012-11-06 23:25:20 +0800

26 Oct, 2012

4 commits

4cef7299b device_cgroup: add proper checking when changing default behavior ... Browse Code »

Before changing a group's default behavior to ALLOW, we must check if
its parent's behavior is also ALLOW.

Signed-off-by: Aristeu Rozanski
Cc: Tejun Heo
Cc: Li Zefan
Cc: James Morris
Cc: Pavel Emelyanov
Acked-by: Serge Hallyn
Cc: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aristeu Rozanski
2012-10-26 05:37:52 +0800
26fd8405d device_cgroup: stop using simple_strtoul() ... Browse Code »

Convert the code to use kstrtou32() instead of simple_strtoul() which is
deprecated. The real size of the variables are u32, so use kstrtou32
instead of kstrtoul

Signed-off-by: Aristeu Rozanski
Cc: Dave Jones
Cc: Tejun Heo
Cc: Li Zefan
Cc: James Morris
Cc: Pavel Emelyanov
Acked-by: Serge Hallyn
Cc: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aristeu Rozanski
2012-10-26 05:37:52 +0800
5b7aa7d5b device_cgroup: rename deny_all to behavior ... Browse Code »

This was done in a v2 patch but v1 ended up being committed. The
variable name is less confusing and stores the default behavior when no
matching exception exists.

Signed-off-by: Aristeu Rozanski
Cc: Dave Jones
Cc: Tejun Heo
Cc: Li Zefan
Cc: James Morris
Cc: Pavel Emelyanov
Acked-by: Serge Hallyn
Cc: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aristeu Rozanski
2012-10-26 05:37:52 +0800
8c9506d16 cgroup: fix invalid rcu dereference ... Browse Code »

Commit ad676077a2ae ("device_cgroup: convert device_cgroup internally to
policy + exceptions") removed rcu locks which are needed in
task_devcgroup called in this chain:

devcgroup_inode_mknod OR __devcgroup_inode_permission ->
__devcgroup_inode_permission ->
task_devcgroup ->
task_subsys_state ->
task_subsys_state_check.

Change the code so that task_devcgroup is safely called with rcu read
lock held.

===============================
[ INFO: suspicious RCU usage. ]
3.6.0-rc5-next-20120913+ #42 Not tainted
-------------------------------
include/linux/cgroup.h:553 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
2 locks held by kdevtmpfs/23:
#0: (sb_writers){.+.+.+}, at: []
mnt_want_write+0x1f/0x50
#1: (&sb->s_type->i_mutex_key#3/1){+.+.+.}, at: []
kern_path_create+0x7f/0x170

stack backtrace:
Pid: 23, comm: kdevtmpfs Not tainted 3.6.0-rc5-next-20120913+ #42
Call Trace:
lockdep_rcu_suspicious+0xfd/0x130
devcgroup_inode_mknod+0x19d/0x240
vfs_mknod+0x71/0xf0
handle_create.isra.2+0x72/0x200
devtmpfsd+0x114/0x140
? handle_create.isra.2+0x200/0x200
kthread+0xd6/0xe0
kernel_thread_helper+0x4/0x10

Signed-off-by: Jiri Slaby
Cc: Dave Jones
Cc: Tejun Heo
Cc: Li Zefan
Cc: James Morris
Cc: Pavel Emelyanov
Acked-by: Serge Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2012-10-26 05:37:52 +0800

06 Oct, 2012

4 commits

db9aeca97 device_cgroup: rename whitelist to exception list ... Browse Code »

This patch replaces the "whitelist" usage in the code and comments and replace
them by exception list related information.

Signed-off-by: Aristeu Rozanski
Cc: Tejun Heo
Cc: Li Zefan
Cc: James Morris
Cc: Pavel Emelyanov
Acked-by: Serge E. Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aristeu Rozanski
2012-10-06 02:05:14 +0800
ad676077a device_cgroup: convert device_cgroup internally to policy + exceptions ... Browse Code »

The original model of device_cgroup is having a whitelist where all the
allowed devices are listed. The problem with this approach is that is
impossible to have the case of allowing everything but few devices.

The reason for that lies in the way the whitelist is handled internally:
since there's only a whitelist, the "all devices" entry would have to be
removed and replaced by the entire list of possible devices but the ones
that are being denied. Since dev_t is 32 bits long, representing the allowed
devices as a bitfield is not memory efficient.

This patch replaces the "whitelist" by a "exceptions" list and the default
policy is kept as "deny_all" variable in dev_cgroup structure.

The current interface determines that whenever "a" is written to devices.allow
or devices.deny, the entry masking all devices will be added or removed,
respectively. This behavior is kept and it's what will determine the default
policy:

# cat devices.list
a *:* rwm
# echo a >devices.deny
# cat devices.list
# echo a >devices.allow
# cat devices.list
a *:* rwm

The interface is also preserved. For example, if one wants to block only access
to /dev/null:
# ls -l /dev/null
crw-rw-rw- 1 root root 1, 3 Jul 24 16:17 /dev/null
# echo a >devices.allow
# echo "c 1:3 rwm" >devices.deny
# cat /dev/null
cat: /dev/null: Operation not permitted
# echo >/dev/null
bash: /dev/null: Operation not permitted
mknod /tmp/null c 1 3
mknod: `/tmp/null': Operation not permitted
# echo "c 1:3 r" >devices.allow
# cat /dev/null
# echo >/dev/null
bash: /dev/null: Operation not permitted
mknod /tmp/null c 1 3
mknod: `/tmp/null': Operation not permitted
# echo "c 1:3 rw" >devices.allow
# echo >/dev/null
# cat /dev/null
# mknod /tmp/null c 1 3
mknod: `/tmp/null': Operation not permitted
# echo "c 1:3 rwm" >devices.allow
# echo >/dev/null
# cat /dev/null
# mknod /tmp/null c 1 3
#

Note that I didn't rename the functions/variables in this patch, but in the
next one to make reviewing easier.

Signed-off-by: Aristeu Rozanski
Cc: Tejun Heo
Cc: Li Zefan
Cc: James Morris
Cc: Pavel Emelyanov
Acked-by: Serge E. Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aristeu Rozanski
2012-10-06 02:05:14 +0800
868539a3b device_cgroup: introduce dev_whitelist_clean() ... Browse Code »

This function cleans all the items in a whitelist and will be used by the next
patches.

Signed-off-by: Aristeu Rozanski
Cc: Tejun Heo
Cc: Li Zefan
Cc: James Morris
Cc: Pavel Emelyanov
Acked-by: Serge E. Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aristeu Rozanski
2012-10-06 02:05:14 +0800
66b8ef677 device_cgroup: add "deny_all" in dev_cgroup structure ... Browse Code »

deny_all will determine if the default policy is to deny all device access
unless for the ones in the exception list.

This variable will be used in the next patches to convert device_cgroup
internally into a default policy + rules.

Signed-off-by: Aristeu Rozanski
Cc: Tejun Heo
Cc: Li Zefan
Cc: James Morris
Cc: Pavel Emelyanov
Acked-by: Serge E. Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aristeu Rozanski
2012-10-06 02:05:13 +0800