Doug / smarc-fsl-linux-kernel | Embedian Git Server

03 Oct, 2012

1 commit

68d47a137 Merge branch 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup hierarchy update from Tejun Heo:
"Currently, different cgroup subsystems handle nested cgroups
completely differently. There's no consistency among subsystems and
the behaviors often are outright broken.

People at least seem to agree that the broken hierarhcy behaviors need
to be weeded out if any progress is gonna be made on this front and
that the fallouts from deprecating the broken behaviors should be
acceptable especially given that the current behaviors don't make much
sense when nested.

This patch makes cgroup emit warning messages if cgroups for
subsystems with broken hierarchy behavior are nested to prepare for
fixing them in the future. This was put in a separate branch because
more related changes were expected (didn't make it this round) and the
memory cgroup wanted to pull in this and make changes on top."

* 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them

Linus Torvalds
2012-10-03 01:52:28 +0800

15 Sep, 2012

5 commits

8c7f6edbd cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them ... Browse Code »

Currently, cgroup hierarchy support is a mess. cpu related subsystems
behave correctly - configuration, accounting and control on a parent
properly cover its children. blkio and freezer completely ignore
hierarchy and treat all cgroups as if they're directly under the root
cgroup. Others show yet different behaviors.

These differing interpretations of cgroup hierarchy make using cgroup
confusing and it impossible to co-mount controllers into the same
hierarchy and obtain sane behavior.

Eventually, we want full hierarchy support from all subsystems and
probably a unified hierarchy. Users using separate hierarchies
expecting completely different behaviors depending on the mounted
subsystem is deterimental to making any progress on this front.

This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
for controllers which are lacking in hierarchy support. The goal of
this patch is two-fold.

* Move users away from using hierarchy on currently non-hierarchical
subsystems, so that implementing proper hierarchy support on those
doesn't surprise them.

* Keep track of which controllers are broken how and nudge the
subsystems to implement proper hierarchy support.

For now, start with a single warning message. We can whine louder
later on.

v2: Fixed a typo spotted by Michal. Warning message updated.

v3: Updated memcg part so that it doesn't generate warning in the
cases where .use_hierarchy=false doesn't make the behavior
different from root.use_hierarchy=true. Fixed a typo spotted by
Glauber.

v4: Check ->broken_hierarchy after cgroup creation is complete so that
->create() can affect the result per Michal. Dropped unnecessary
memcg root handling per Michal.

Signed-off-by: Tejun Heo
Acked-by: Michal Hocko
Acked-by: Li Zefan
Acked-by: Serge E. Hallyn
Cc: Glauber Costa
Cc: Peter Zijlstra
Cc: Paul Turner
Cc: Johannes Weiner
Cc: Thomas Graf
Cc: Vivek Goyal
Cc: Paul Mackerras
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo
Cc: Neil Horman
Cc: Aneesh Kumar K.V

Tejun Heo
2012-09-15 03:01:16 +0800
8a8e04df4 cgroup: Assign subsystem IDs during compile time ... Browse Code »

WARNING: With this change it is impossible to load external built
controllers anymore.

In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
set, corresponding subsys_id should also be a constant. Up to now,
net_prio_subsys_id and net_cls_subsys_id would be of the type int and
the value would be assigned during runtime.

By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
to IS_ENABLED, all *_subsys_id will have constant value. That means we
need to remove all the code which assumes a value can be assigned to
net_prio_subsys_id and net_cls_subsys_id.

A close look is necessary on the RCU part which was introduces by
following patch:

commit f845172531fb7410c7fb7780b1a6e51ee6df7d52
Author: Herbert Xu Mon May 24 09:12:34 2010
Committer: David S. Miller Mon May 24 09:12:34 2010

cls_cgroup: Store classid in struct sock

Tis code was added to init_cgroup_cls()

/* We can't use rcu_assign_pointer because this is an int. */
smp_wmb();
net_cls_subsys_id = net_cls_subsys.subsys_id;

respectively to exit_cgroup_cls()

net_cls_subsys_id = -1;
synchronize_rcu();

and in module version of task_cls_classid()

rcu_read_lock();
id = rcu_dereference(net_cls_subsys_id);
if (id >= 0)
classid = container_of(task_subsys_state(p, id),
struct cgroup_cls_state, css)->classid;
rcu_read_unlock();

Without an explicit explaination why the RCU part is needed. (The
rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
in a later commit, but that is a minor detail.)

So here is my pondering why it was introduced and why it safe to
remove it now. Note that this code was copied over to net_prio the
reasoning holds for that subsystem too.

The idea behind the RCU use for net_cls_subsys_id is to make sure we
get a valid pointer back from task_subsys_state(). task_subsys_state()
is just blindly accessing the subsys array and returning the
pointer. Obviously, passing in -1 as id into task_subsys_state()
returns an invalid value (out of lower bound).

So this code makes sure that only after module is loaded and the
subsystem registered, the id is assigned.

Before unregistering the module all old readers must have left the
critical section. This is done by assigning -1 to the id and issuing a
synchronized_rcu(). Any new readers wont call task_subsys_state()
anymore and therefore it is safe to unregister the subsystem.

The new code relies on the same trick, but it looks at the subsys
pointer return by task_subsys_state() (remember the id is constant
and therefore we allways have a valid index into the subsys
array).

No precautions need to be taken during module loading
module. Eventually, all CPUs will get a valid pointer back from
task_subsys_state() because rebind_subsystem() which is called after
the module init() function will assigned subsys[net_cls_subsys_id] the
newly loaded module subsystem pointer.

When the subsystem is about to be removed, rebind_subsystem() will
called before the module exit() function. In this case,
rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
and then it calls synchronize_rcu(). All old readers have left by then
the critical section. Any new reader wont access the subsystem
anymore. At this point we are safe to unregister the subsystem. No
synchronize_rcu() call is needed.

Signed-off-by: Daniel Wagner
Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Neil Horman
Cc: "David S. Miller"
Cc: "Paul E. McKenney"
Cc: Andrew Morton
Cc: Eric Dumazet
Cc: Gao feng
Cc: Glauber Costa
Cc: Herbert Xu
Cc: Jamal Hadi Salim
Cc: John Fastabend
Cc: Kamezawa Hiroyuki
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org

Daniel Wagner
2012-09-15 00:57:43 +0800
80f4c8777 cgroup: Do not depend on a given order when populating the subsys array ... Browse Code »

The *_subsys_id will be used as index to access the subsys. Therefore
we need to care we populate the subsystem at the correct position by
using designated initialization.

With this change we are able to interleave builtin and modules in the subsys
array.

Signed-off-by: Daniel Wagner
Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Neil Horman
Cc: Gao feng
Cc: Jamal Hadi Salim
Cc: John Fastabend
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org

Daniel Wagner
2012-09-15 00:57:40 +0800
5fc0b0254 cgroup: Wrap subsystem selection macro ... Browse Code »

Before we are able to define all subsystem ids at compile time we need
a more fine grained control what gets defined when we include
cgroup_subsys.h. For example we define the enums for the subsystems or
to declare for struct cgroup_subsys (builtin subsystem) by including
cgroup_subsys.h and defining SUBSYS accordingly.

Currently, the decision if a subsys is used is defined inside the
header by testing if CONFIG_*=y is true. By moving this test outside
of cgroup_subsys.h we are able to control it on the include level.

This is done by introducing IS_SUBSYS_ENABLED which then is defined
according the task, e.g. is CONFIG_*=y or CONFIG_*=m.

Signed-off-by: Daniel Wagner
Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Neil Horman
Cc: Gao feng
Cc: Jamal Hadi Salim
Cc: John Fastabend
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org

Daniel Wagner
2012-09-15 00:57:37 +0800
be45c900f cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNT ... Browse Code »

CGROUP_BUILTIN_SUBSYS_COUNT is used as start index or stop index when
looping over the subsys array looking either at the builtin or the
module subsystems. Since all the builtin subsystems have an id which
is lower then CGROUP_BUILTIN_SUBSYS_COUNT we know that any module will
have an id larger than CGROUP_BUILTIN_SUBSYS_COUNT. In short the ids
are sorted.

We are about to change id assignment to happen only at compile time
later in this series. That means we can't rely on the above trick
since all ids will always be defined at compile time. Furthermore,
ordering the builtin subsystems and the module subsystems is not
really necessary.

So we need a different way to know which subsystem is a builtin or a
module one. We can use the subsys[]->module pointer for this. Any
place where we need to know if a subsys is module we just check for
the pointer. If it is NULL then the subsystem is a builtin one.

With this we are able to drop the CGROUP_BUILTIN_SUBSYS_COUNT
enum. Though we need to introduce a temporary placeholder so that we
don't get a compilation error when only CONFIG_CGROUP is selected and
no single controller. An empty enum definition is not valid. Later in
this series we are able to remove the placeholder again.

And with this change we get a fix for this:

kernel/cgroup.c: In function ‘cgroup_load_subsys’:
kernel/cgroup.c:4326:38: warning: array subscript is below array bounds [-Warray-bounds]

when CONFIG_CGROUP=y and no built in controller was enabled.

Signed-off-by: Daniel Wagner
Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Neil Horman
Cc: Gao feng
Cc: Jamal Hadi Salim
Cc: John Fastabend
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org

Daniel Wagner
2012-09-15 00:57:32 +0800

25 Aug, 2012

3 commits

a1a71b45a cgroup: rename subsys_bits to subsys_mask ... Browse Code »

In a previous discussion, Tejun Heo suggested to rename references to
subsys_bits (added_bits, removed_bits, etc) by something more meaningful.

Cc: Li Zefan
Cc: Tejun Heo
Cc: Hugh Dickins
Cc: Hillf Danton
Cc: Lennart Poettering
Signed-off-by: Aristeu Rozanski
Signed-off-by: Tejun Heo

Aristeu Rozanski
2012-08-25 06:55:33 +0800
03b1cde6b cgroup: add xattr support ... Browse Code »

This is one of the items in the plumber's wish list.

For use cases:

>> What would the use case be for this?
>
> Attaching meta information to services, in an easily discoverable
> way. For example, in systemd we create one cgroup for each service, and
> could then store data like the main pid of the specific service as an
> xattr on the cgroup itself. That way we'd have almost all service state
> in the cgroupfs, which would make it possible to terminate systemd and
> later restart it without losing any state information. But there's more:
> for example, some very peculiar services cannot be terminated on
> shutdown (i.e. fakeraid DM stuff) and it would be really nice if the
> services in question could just mark that on their cgroup, by setting an
> xattr. On the more desktopy side of things there are other
> possibilities: for example there are plans defining what an application
> is along the lines of a cgroup (i.e. an app being a collection of
> processes). With xattrs one could then attach an icon or human readable
> program name on the cgroup.
>
> The key idea is that this would allow attaching runtime meta information
> to cgroups and everything they model (services, apps, vms), that doesn't
> need any complex userspace infrastructure, has good access control
> (i.e. because the file system enforces that anyway, and there's the
> "trusted." xattr namespace), notifications (inotify), and can easily be
> shared among applications.
>
> Lennart

v7:
- no changes
v6:
- remove user xattr namespace, only allow trusted and security
v5:
- check for capabilities before setting/removing xattrs
v4:
- no changes
v3:
- instead of config option, use mount option to enable xattr support

Original-patch-by: Li Zefan
Cc: Li Zefan
Cc: Tejun Heo
Cc: Hugh Dickins
Cc: Hillf Danton
Cc: Lennart Poettering
Signed-off-by: Li Zefan
Signed-off-by: Aristeu Rozanski
Signed-off-by: Tejun Heo

Aristeu Rozanski
2012-08-25 06:55:33 +0800
13af07df9 cgroup: revise how we re-populate root directory ... Browse Code »

When remounting cgroupfs with some subsystems added to it and some
removed, cgroup will remove all the files in root directory and then
re-popluate it.

What I'm doing here is, only remove files which belong to subsystems that
are to be unbinded, and only create files for newly-added subsystems.
The purpose is to have all other files untouched.

This is a preparation for cgroup xattr support.

v7:
- checkpatch warnings fixed
v6:
- no changes
v5:
- no changes
v4:
- refactored cgroup_clear_directory() to not use cgroup_rm_file()
- instead of going thru the list of files, get the file list using the
subsystems
- use 'subsys_mask' instead of {added,removed}_bits and made
cgroup_populate_dir() to match the parameters with cgroup_clear_directory()
v3:
- refresh patches after recent refactoring

Original-patch-by: Li Zefan
Cc: Li Zefan
Cc: Hugh Dickins
Cc: Hillf Danton
Cc: Lennart Poettering
Signed-off-by: Li Zefan
Signed-off-by: Aristeu Rozanski
Signed-off-by: Tejun Heo

Aristeu Rozanski
2012-08-25 06:55:33 +0800

25 Jul, 2012

1 commit

614a6d434 Merge branch 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup changes from Tejun Heo:
"Nothing too interesting. A minor bug fix and some cleanups."

* 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Update remount documentation
cgroup: cgroup_rm_files() was calling simple_unlink() with the wrong inode
cgroup: Remove populate() documentation
cgroup: remove hierarchy_mutex

Linus Torvalds
2012-07-25 08:47:44 +0800

14 Jul, 2012

2 commits

9249e17fe VFS: Pass mount flags to sget() ... Browse Code »

Pass mount flags to sget() so that it can use them in initialising a new
superblock before the set function is called. They could also be passed to the
compare function.

Signed-off-by: David Howells
Signed-off-by: Al Viro

David Howells
2012-07-14 20:38:34 +0800
00cd8dd3b stop passing nameidata to ->lookup() ... Browse Code »

Just the flags; only NFS cares even about that, but there are
legitimate uses for such argument. And getting rid of that
completely would require splitting ->lookup() into a couple
of methods (at least), so let's leave that alone for now...

Signed-off-by: Al Viro

Al Viro
2012-07-14 20:34:32 +0800

10 Jul, 2012

1 commit

ce27e317b cgroup: cgroup_rm_files() was calling simple_unlink() with the wrong inode ... Browse Code »

While refactoring cgroup file removal path, 05ef1d7c4a "cgroup:
introduce struct cfent" incorrectly changed the @dir argument of
simple_unlink() to the inode of the file being deleted instead of that
of the containing directory.

The effect of this bug is minor - ctime and mtime of the parent
weren't properly updated on file deletion.

Fix it by using @cgrp->dentry->d_inode instead.

Signed-off-by: Tejun Heo
Reported-by: Al Viro
Acked-by: Li Zefan
Cc: stable@vger.kernel.org

Tejun Heo
2012-07-10 01:11:14 +0800

08 Jul, 2012

2 commits

5db9a4d99 cgroup: fix cgroup hierarchy umount race ... Browse Code »

48ddbe1946 "cgroup: make css->refcnt clearing on cgroup removal
optional" allowed a css to linger after the associated cgroup is
removed. As a css holds a reference on the cgroup's dentry, it means
that cgroup dentries may linger for a while.

Destroying a superblock which has dentries with positive refcnts is a
critical bug and triggers BUG() in vfs code. As each cgroup dentry
holds an s_active reference, any lingering cgroup has both its dentry
and the superblock pinned and thus preventing premature release of
superblock.

Unfortunately, after 48ddbe1946, there's a small window while
releasing a cgroup which is directly under the root of the hierarchy.
When a cgroup directory is released, vfs layer first deletes the
corresponding dentry and then invokes dput() on the parent, which may
recurse further, so when a cgroup directly below root cgroup is
released, the cgroup is first destroyed - which releases the s_active
it was holding - and then the dentry for the root cgroup is dput().

This creates a window where the root dentry's refcnt isn't zero but
superblock's s_active is. If umount happens before or during this
window, vfs will see the root dentry with non-zero refcnt and trigger
BUG().

Before 48ddbe1946, this problem didn't exist because the last dentry
reference was guaranteed to be put synchronously from rmdir(2)
invocation which holds s_active around the whole process.

Fix it by holding an extra superblock->s_active reference across
dput() from css release, which is the dput() path added by 48ddbe1946
and the only one which doesn't hold an extra s_active ref across the
final cgroup dput().

Signed-off-by: Tejun Heo
LKML-Reference:
Reported-by: shyju pv
Tested-by: shyju pv
Cc: Sasha Levin
Acked-by: Li Zefan

Tejun Heo
2012-07-08 07:08:18 +0800
7db5b3ca0 Revert "cgroup: superblock can't be released with active dentries" ... Browse Code »

This reverts commit fa980ca87d15bb8a1317853f257a505990f3ffde. The
commit was an attempt to fix a race condition where a cgroup hierarchy
may be unmounted with positive dentry reference on root cgroup. While
the commit made the race condition slightly more difficult to trigger,
the race was still there and could be reliably triggered using a
different test case.

Revert the incorrect fix. The next commit will describe the race and
fix it correctly.

Signed-off-by: Tejun Heo
LKML-Reference:
Reported-by: shyju pv
Cc: Sasha Levin
Acked-by: Li Zefan

Tejun Heo
2012-07-08 06:55:47 +0800

19 Jun, 2012

1 commit

8e3bbf42c cgroups: Account for CSS_DEACT_BIAS in __css_put ... Browse Code »

When we fixed the race between atomic_dec and css_refcnt, we missed
the fact that css_refcnt internally subtracts CSS_DEACT_BIAS to get
the actual reference count. This can potentially cause a refcount leak
if __css_put races with cgroup_clear_css_refs.

Signed-off-by: Salman Qazi
Acked-by: Li Zefan
Signed-off-by: Tejun Heo

Salman Qazi
2012-06-19 06:38:02 +0800

07 Jun, 2012

2 commits

6be96a5c9 cgroup: remove hierarchy_mutex ... Browse Code »

It was introduced for memcg to iterate cgroup hierarchy without
holding cgroup_mutex, but soon after that it was replaced with
a lockless way in memcg.

No one used hierarchy_mutex since that, so remove it.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2012-06-07 10:12:30 +0800
967db0ea6 cgroup: make sure that decisions in __css_put are atomic ... Browse Code »

__css_put is using atomic_dec on the ref count, and then
looking at the ref count to make decisions. This is prone
to races, as someone else may decrement ref count between
our decrement and our decision. Instead, we should base our
decisions on the value that we decremented the ref count to.

(This results in an actual race on Google's kernel which I
haven't been able to reproduce on the upstream kernel. Having
said that, it's still incorrect by inspection).

Signed-off-by: Salman Qazi
Acked-by: Li Zefan
Signed-off-by: Tejun Heo
Cc: stable@vger.kernel.org

Salman Qazi
2012-06-07 09:51:35 +0800

06 Jun, 2012

1 commit

365f0e173 Merge branch 'for-3.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup fix from Tejun Heo:
"This fixes the possible premature superblock release on umount bug
mentioned during v3.5-rc1 pull request.

Originally, cgroup dentry destruction path assumed that cgroup dentry
didn't have any reference left after cgroup removal thus put super
during dentry removal. Now that there can be lingering dentry
references, this led to super being put with live dentries. This
patch fixes the problem by putting super ref on dentry release instead
of removal."

* 'for-3.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: superblock can't be released with active dentries

Linus Torvalds
2012-06-06 02:54:12 +0800

30 May, 2012

1 commit

91c63734f kernel: cgroup: push rcu read locking from css_is_ancestor() to callsite ... Browse Code »

Library functions should not grab locks when the callsites can do it,
even if the lock nests like the rcu read-side lock does.

Push the rcu_read_lock() from css_is_ancestor() to its single user,
mem_cgroup_same_or_subtree() in preparation for another user that may
already hold the rcu read-side lock.

Signed-off-by: Johannes Weiner
Cc: Konstantin Khlebnikov
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Acked-by: Li Zefan
Cc: Li Zefan
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-05-30 07:22:20 +0800

28 May, 2012

1 commit

fa980ca87 cgroup: superblock can't be released with active dentries ... Browse Code »

48ddbe1946 "cgroup: make css->refcnt clearing on cgroup removal
optional" allowed a css to linger after the associated cgroup is
removed. As a css holds a reference on the cgroup's dentry, it means
that cgroup dentries may linger for a while.

cgroup_create() does grab an active reference on the superblock to
prevent it from going away while there are !root cgroups; however, the
reference is put from cgroup_diput() which is invoked on cgroup
removal, so cgroup dentries which are removed but persisting due to
lingering csses already have released their superblock active refs
allowing superblock to be killed while those dentries are around.

Given the right condition, this makes cgroup_kill_sb() call
kill_litter_super() with dentries with non-zero d_count leading to
BUG() in shrink_dcache_for_umount_subtree().

Fix it by adding cgroup_dops->d_release() operation and moving
deactivate_super() to it. cgroup_diput() now marks dentry->d_fsdata
with itself if superblock should be deactivated and cgroup_d_release()
deactivates the superblock on dentry release.

Signed-off-by: Tejun Heo
Reported-by: Sasha Levin
Tested-by: Sasha Levin
LKML-Reference:
Acked-by: Li Zefan

Tejun Heo
2012-05-28 08:22:56 +0800

24 May, 2012

1 commit

644473e9c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull user namespace enhancements from Eric Biederman:
"This is a course correction for the user namespace, so that we can
reach an inexpensive, maintainable, and reasonably complete
implementation.

Highlights:
- Config guards make it impossible to enable the user namespace and
code that has not been converted to be user namespace safe.

- Use of the new kuid_t type ensures the if you somehow get past the
config guards the kernel will encounter type errors if you enable
user namespaces and attempt to compile in code whose permission
checks have not been updated to be user namespace safe.

- All uids from child user namespaces are mapped into the initial
user namespace before they are processed. Removing the need to add
an additional check to see if the user namespace of the compared
uids remains the same.

- With the user namespaces compiled out the performance is as good or
better than it is today.

- For most operations absolutely nothing changes performance or
operationally with the user namespace enabled.

- The worst case performance I could come up with was timing 1
billion cache cold stat operations with the user namespace code
enabled. This went from 156s to 164s on my laptop (or 156ns to
164ns per stat operation).

- (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
Most uid/gid setting system calls treat these value specially
anyway so attempting to use -1 as a uid would likely cause
entertaining failures in userspace.

- If setuid is called with a uid that can not be mapped setuid fails.
I have looked at sendmail, login, ssh and every other program I
could think of that would call setuid and they all check for and
handle the case where setuid fails.

- If stat or a similar system call is called from a context in which
we can not map a uid we lie and return overflowuid. The LFS
experience suggests not lying and returning an error code might be
better, but the historical precedent with uids is different and I
can not think of anything that would break by lying about a uid we
can't map.

- Capabilities are localized to the current user namespace making it
safe to give the initial user in a user namespace all capabilities.

My git tree covers all of the modifications needed to convert the core
kernel and enough changes to make a system bootable to runlevel 1."

Fix up trivial conflicts due to nearby independent changes in fs/stat.c

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
userns: Silence silly gcc warning.
cred: use correct cred accessor with regards to rcu read lock
userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
userns: Convert cgroup permission checks to use uid_eq
userns: Convert tmpfs to use kuid and kgid where appropriate
userns: Convert sysfs to use kgid/kuid where appropriate
userns: Convert sysctl permission checks to use kuid and kgids.
userns: Convert proc to use kuid/kgid where appropriate
userns: Convert ext4 to user kuid/kgid where appropriate
userns: Convert ext3 to use kuid/kgid where appropriate
userns: Convert ext2 to use kuid/kgid where appropriate.
userns: Convert devpts to use kuid/kgid where appropriate
userns: Convert binary formats to use kuid/kgid where appropriate
userns: Add negative depends on entries to avoid building code that is userns unsafe
userns: signal remove unnecessary map_cred_ns
userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
userns: Convert stat to return values mapped from kuids and kgids
userns: Convert user specfied uids and gids in chown into kuids and kgid
userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
...

Linus Torvalds
2012-05-24 08:42:39 +0800

16 May, 2012

1 commit

14a590c3f userns: Convert cgroup permission checks to use uid_eq ... Browse Code »

Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-05-16 05:59:30 +0800

24 Apr, 2012

1 commit

c4c27fbdd cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads ... Browse Code »

Allowing kthreadd to be moved to a non-root group makes no sense, it being
a global resource, and needlessly leads unsuspecting users toward trouble.

1. An RT workqueue worker thread spawned in a task group with no rt_runtime
allocated is not schedulable. Simple user error, but harmful to the box.

2. A worker thread which acquires PF_THREAD_BOUND can never leave a cpuset,
rendering the cpuset immortal.

Save the user some unexpected trouble, just say no.

Signed-off-by: Mike Galbraith
Acked-by: Peter Zijlstra
Acked-by: Thomas Gleixner
Acked-by: Li Zefan
Signed-off-by: Tejun Heo

Mike Galbraith
2012-04-24 02:03:51 +0800

12 Apr, 2012

1 commit

86f82d561 cgroup: remove cgroup_subsys->populate() ... Browse Code »

With memcg converted, cgroup_subsys->populate() doesn't have any user
left. Remove it.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2012-04-12 00:16:48 +0800

02 Apr, 2012

12 commits

48ddbe194 cgroup: make css->refcnt clearing on cgroup removal optional ... Browse Code »

Currently, cgroup removal tries to drain all css references. If there
are active css references, the removal logic waits and retries
->pre_detroy() until either all refs drop to zero or removal is
cancelled.

This semantics is unusual and adds non-trivial complexity to cgroup
core and IMHO is fundamentally misguided in that it couples internal
implementation details (references to internal data structure) with
externally visible operation (rmdir). To userland, this is a behavior
peculiarity which is unnecessary and difficult to expect (css refs is
otherwise invisible from userland), and, to policy implementations,
this is an unnecessary restriction (e.g. blkcg wants to hold css refs
for caching purposes but can't as that becomes visible as rmdir hang).

Unfortunately, memcg currently depends on ->pre_destroy() retrials and
cgroup removal vetoing and can't be immmediately switched to the new
behavior. This patch introduces the new behavior of not waiting for
css refs to drain and maintains the old behavior for subsystems which
have __DEPRECATED_clear_css_refs set.

Once, memcg is updated, we can drop the code paths for the old
behavior as proposed in the following patch. Note that the following
patch is incorrect in that dput work item is in cgroup and may lose
some of dputs when multiples css's are released back-to-back, and
__css_put() triggers check_for_release() when refcnt reaches 0 instead
of 1; however, it shows what part can be removed.

http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251

Note that, in not-too-distant future, cgroup core will start emitting
warning messages for subsys which require the old behavior, so please
get moving.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Cc: Vivek Goyal
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki

Tejun Heo
2012-04-02 03:09:56 +0800
28b4c27b8 cgroup: use negative bias on css->refcnt to block css_tryget() ... Browse Code »

When a cgroup is about to be removed, cgroup_clear_css_refs() is
called to check and ensure that there are no active css references.

This is currently achieved by dropping the refcnt to zero iff it has
only the base ref. If all css refs could be dropped to zero, ref
clearing is successful and CSS_REMOVED is set on all css. If not, the
base ref is restored. While css ref is zero w/o CSS_REMOVED set, any
css_tryget() attempt on it busy loops so that they are atomic
w.r.t. the whole css ref clearing.

This does work but dropping and re-instating the base ref is somewhat
hairy and makes it difficult to add more logic to the put path as
there are two of them - the regular css_put() and the reversible base
ref clearing.

This patch updates css ref clearing such that blocking new
css_tryget() and putting the base ref are separate operations.
CSS_DEACT_BIAS, defined as INT_MIN, is added to css->refcnt and
css_tryget() busy loops while refcnt is negative. After all css refs
are deactivated, if they were all one, ref clearing succeeded and
CSS_REMOVED is set and the base ref is put using the regular
css_put(); otherwise, CSS_DEACT_BIAS is subtracted from the refcnts
and the original postive values are restored.

css_refcnt() accessor which always returns the unbiased positive
reference counts is added and used to simplify refcnt usages. While
at it, relocate and reformat comments in cgroup_has_css_refs().

This separates css->refcnt deactivation and putting the base ref,
which enables the next patch to make ref clearing optional.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2012-04-02 03:09:56 +0800
79578621b cgroup: implement cgroup_rm_cftypes() ... Browse Code »

Implement cgroup_rm_cftypes() which removes an array of cftypes from a
subsystem. It can be called whether the target subsys is attached or
not. cgroup core will remove the specified file from all existing
cgroups.

This will be used to improve sub-subsys modularity and will be helpful
for unified hierarchy.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2012-04-02 03:09:56 +0800
05ef1d7c4 cgroup: introduce struct cfent ... Browse Code »

This patch adds cfent (cgroup file entry) which is the association
between a cgroup and a file. This is in-cgroup representation of
files under a cgroup directory. This simplifies walking walking
cgroup files and thus cgroup_clear_directory(), which is now
implemented in two parts - cgroup_rm_file() and a loop around it.

cgroup_rm_file() will be used to implement cftype removal and cfent is
scheduled to serve cgroup specific per-file data (e.g. for sysfs-like
"sever" semantics).

v2: - cfe was freed from cgroup_rm_file() which led to use-after-free
if the file had openers at the time of removal. Moved to
cgroup_diput().

- cgroup_clear_directory() triggered WARN_ON_ONCE() if d_subdirs
wasn't empty after removing all files. This triggered
spuriously if some files were open during directory clearing.
Removed.

v3: - In cgroup_diput(), WARN_ONCE(!list_empty(&cfe->node)) could be
spuriously triggered for root cgroups because they don't go
through cgroup_clear_directory() on unmount. Don't trigger WARN
for root cgroups.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Cc: Glauber Costa

Tejun Heo
2012-04-02 03:09:56 +0800
f6ea93723 cgroup: relocate __d_cgrp() and __d_cft() ... Browse Code »

Move the two macros upwards as they'll be used earlier in the file.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2012-04-02 03:09:55 +0800
db0416b64 cgroup: remove cgroup_add_file[s]() ... Browse Code »

No controller is using cgroup_add_files[s](). Unexport them, and
convert cgroup_add_files() to handle NULL entry terminated array
instead of taking count explicitly and continue creation on failure
for internal use.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2012-04-02 03:09:55 +0800
4baf6e332 cgroup: convert all non-memcg controllers to the new cftype interface ... Browse Code »

Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio,
net_cls and device controllers to use the new cftype based interface.
Termination entry is added to cftype arrays and populate callbacks are
replaced with cgroup_subsys->base_cftypes initializations.

This is functionally identical transformation. There shouldn't be any
visible behavior change.

memcg is rather special and will be converted separately.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Cc: Paul Menage
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: "David S. Miller"
Cc: Vivek Goyal

Tejun Heo
2012-04-02 03:09:55 +0800
6e6ff25bd cgroup: merge cft_release_agent cftype array into the base files array ... Browse Code »

Now that cftype can express whether a file should only be on root,
cft_release_agent can be merged into the base files cftypes array.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2012-04-02 03:09:55 +0800
8e3f6541d cgroup: implement cgroup_add_cftypes() and friends ... Browse Code »

Currently, cgroup directories are populated by subsys->populate()
callback explicitly creating files on each cgroup creation. This
level of flexibility isn't needed or desirable. It provides largely
unused flexibility which call for abuses while severely limiting what
the core layer can do through the lack of structure and conventions.

Per each cgroup file type, the only distinction that cgroup users is
making is whether a cgroup is root or not, which can easily be
expressed with flags.

This patch introduces cgroup_add_cftypes(). These deal with cftypes
instead of individual files - controllers indicate that certain types
of files exist for certain subsystem. Newly added CFTYPE_*_ON_ROOT
flags indicate whether a cftype should be excluded or created only on
the root cgroup.

cgroup_add_cftypes() can be called any time whether the target
subsystem is currently attached or not. cgroup core will create files
on the existing cgroups as necessary.

Also, cgroup_subsys->base_cftypes is added to ease registration of the
base files for the subsystem. If non-NULL on subsys init, the cftypes
pointed to by ->base_cftypes are automatically registered on subsys
init / load.

Further patches will convert the existing users and remove the file
based interface. Note that this interface allows dynamic addition of
files to an active controller. This will be used for sub-controller
modularity and unified hierarchy in the longer term.

This patch implements the new mechanism but doesn't apply it to any
user.

v2: replaced DECLARE_CGROUP_CFTYPES[_COND]() with
cgroup_subsys->base_cftypes, which works better for cgroup_subsys
which is loaded as module.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2012-04-02 03:09:55 +0800
b0ca5a84f cgroup: build list of all cgroups under a given cgroupfs_root ... Browse Code »

Build a list of all cgroups anchored at cgroupfs_root->allcg_list and
going through cgroup->allcg_node. The list is protected by
cgroup_mutex and will be used to improve cgroup file handling.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2012-04-02 03:09:54 +0800
ff4c8d503 cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir() ... Browse Code »

cgroup_populate_dir() currently clears all files and then repopulate
the directory; however, the clearing part is only useful when it's
called from cgroup_remount(). Relocate the invocation to
cgroup_remount().

This is to prepare for further cgroup file handling updates.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2012-04-02 03:09:54 +0800
8b5a5a9db cgroup: deprecate remount option changes ... Browse Code »

This patch marks the following features for deprecation.

* Rebinding subsys by remount: Never reached useful state - only works
on empty hierarchies.

* release_agent update by remount: release_agent itself will be
replaced with conventional fsnotify notification.

v2: Lennart pointed out that "name=" is necessary for mounts w/o any
controller attached. Drop "name=" deprecation.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Cc: Lennart Poettering

Tejun Heo
2012-04-02 03:09:54 +0800

30 Mar, 2012

1 commit

8f121918f cgroup: cgroup_attach_task() could return -errno after success ... Browse Code »

61d1d219c4 "cgroup: remove extra calls to find_existing_css_set" made
cgroup_task_migrate() return void. An unfortunate side effect was
that cgroup_attach_task() was depending on that function's return
value to clear its @retval on the success path. On cgroup mounts
without any subsystem with ->can_attach() callback,
cgroup_attach_task() ended up returning @retval without initializing
it on success.

For some reason, gcc failed to warn about it and it didn't cause
cgroup_attach_task() to return non-zero value in many cases, probably
due to difference in register allocation. When the problem
materializes, systemd fails to populate /systemd cgroup mount and
fails to boot.

Fix it by initializing @retval to zero on declaration.

Signed-off-by: Tejun Heo
Reported-by: Jiri Kosina
LKML-Reference:
Reviewed-by: Mandeep Singh Baines
Acked-by: Li Zefan

Tejun Heo
2012-03-30 13:03:33 +0800

23 Mar, 2012

1 commit

95211279c Merge branch 'akpm' (Andrew's patch-bomb) ... Browse Code »

Merge first batch of patches from Andrew Morton:
"A few misc things and all the MM queue"

* emailed from Andrew Morton : (92 commits)
memcg: avoid THP split in task migration
thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE
memcg: clean up existing move charge code
mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read()
mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event()
mm/memcontrol.c: s/stealed/stolen/
memcg: fix performance of mem_cgroup_begin_update_page_stat()
memcg: remove PCG_FILE_MAPPED
memcg: use new logic for page stat accounting
memcg: remove PCG_MOVE_LOCK flag from page_cgroup
memcg: simplify move_account() check
memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
memcg: kill dead prev_priority stubs
memcg: remove PCG_CACHE page_cgroup flag
memcg: let css_get_next() rely upon rcu_read_lock()
cgroup: revert ss_id_lock to spinlock
idr: make idr_get_next() good for rcu_read_lock()
memcg: remove unnecessary thp check in page stat accounting
memcg: remove redundant returns
memcg: enum lru_list lru
...

Linus Torvalds
2012-03-23 00:04:48 +0800

22 Mar, 2012

1 commit

ca464d69b memcg: let css_get_next() rely upon rcu_read_lock() ... Browse Code »

Remove lock and unlock around css_get_next()'s call to idr_get_next().
memcg iterators (only users of css_get_next) already did rcu_read_lock(),
and its comment demands that; but add a WARN_ON_ONCE to make sure of it.

Signed-off-by: Hugh Dickins
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Li Zefan
Cc: Eric Dumazet
Acked-by: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-03-22 08:55:01 +0800