Eric Lee / smarc-fsl-linux-kernel

29 Apr, 2008

40 commits

424450c1d ipc: invoke the ipcns notifier chain as a work item ... Browse Code »

Make the memory hotplug chain's mutex held for a shorter time: when memory is
offlined or onlined a work item is added to the global workqueue. When the
work item is run, it notifies the ipcns notifier chain with the
IPCNS_MEMCHANGED event.

Signed-off-by: Nadia Derbey
Cc: Yasunori Goto
Cc: Matt Helsley
Cc: Mingming Cao
Cc: Pierre Peiffer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nadia Derbey
2008-04-29 23:06:12 +0800
b6b337ad1 ipc: recompute msgmni on memory add / remove ... Browse Code »

Introduce the registration of a callback routine that recomputes msg_ctlmni
upon memory add / remove.

A single notifier block is registered in the hotplug memory chain for all the
ipc namespaces.

Since the ipc namespaces are not linked together, they have their own
notification chain: one notifier_block is defined per ipc namespace.

Each time an ipc namespace is created (removed) it registers (unregisters) its
notifier block in (from) the ipcns chain. The callback routine registered in
the memory chain invokes the ipcns notifier chain with the IPCNS_LOWMEM event.
Each callback routine registered in the ipcns namespace, in turn, recomputes
msgmni for the owning namespace.

Signed-off-by: Nadia Derbey
Cc: Yasunori Goto
Cc: Matt Helsley
Cc: Mingming Cao
Cc: Pierre Peiffer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nadia Derbey
2008-04-29 23:06:12 +0800
0c40ba4fd ipc: define the slab_memory_callback priority as a constant ... Browse Code »

This is a trivial patch that defines the priority of slab_memory_callback in
the callback chain as a constant. This is to prepare for next patch in the
series.

Signed-off-by: Nadia Derbey
Cc: Yasunori Goto
Cc: Matt Helsley
Cc: Mingming Cao
Cc: Pierre Peiffer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nadia Derbey
2008-04-29 23:06:12 +0800
4d89dc6ab ipc: scale msgmni to the number of ipc namespaces ... Browse Code »

Since all the namespaces see the same amount of memory (the total one) this
patch introduces a new variable that counts the ipc namespaces and divides
msg_ctlmni by this counter.

Signed-off-by: Nadia Derbey
Cc: Yasunori Goto
Cc: Matt Helsley
Cc: Mingming Cao
Cc: Pierre Peiffer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nadia Derbey
2008-04-29 23:06:12 +0800
f7bf3df8b ipc: scale msgmni to the amount of lowmem ... Browse Code »

On large systems we'd like to allow a larger number of message queues. In
some cases up to 32K. However simply setting MSGMNI to a larger value may
cause problems for smaller systems.

The first patch of this series introduces a default maximum number of message
queue ids that scales with the amount of lowmem.

Since msgmni is per namespace and there is no amount of memory dedicated to
each namespace so far, the second patch of this series scales msgmni to the
number of ipc namespaces too.

Since msgmni depends on the amount of memory, it becomes necessary to
recompute it upon memory add/remove. In the 4th patch, memory hotplug
management is added: a notifier block is registered into the memory hotplug
notifier chain for the ipc subsystem. Since the ipc namespaces are not linked
together, they have their own notification chain: one notifier_block is
defined per ipc namespace. Each time an ipc namespace is created (removed) it
registers (unregisters) its notifier block in (from) the ipcns chain. The
callback routine registered in the memory chain invokes the ipcns notifier
chain with the IPCNS_MEMCHANGE event. Each callback routine registered in the
ipcns namespace, in turn, recomputes msgmni for the owning namespace.

The 5th patch makes it possible to keep the memory hotplug notifier chain's
lock for a lesser amount of time: instead of directly notifying the ipcns
notifier chain upon memory add/remove, a work item is added to the global
workqueue. When activated, this work item is the one who notifies the ipcns
notifier chain.

Since msgmni depends on the number of ipc namespaces, it becomes necessary to
recompute it upon ipc namespace creation / removal. The 6th patch uses the
ipc namespace notifier chain for that purpose: that chain is notified each
time an ipc namespace is created or removed. This makes it possible to
recompute msgmni for all the namespaces each time one of them is created or
removed.

When msgmni is explicitely set from userspace, we should avoid recomputing it
upon memory add/remove or ipcns creation/removal. This is what the 7th patch
does: it simply unregisters the ipcns callback routine as soon as msgmni has
been changed from procfs or sysctl().

Even if msgmni is set by hand, it should be possible to make it back
automatically recomputed upon memory add/remove or ipcns creation/removal.
This what is achieved in patch 8: if set to a negative value, msgmni is added
back to the ipcns notifier chain, making it automatically recomputed again.

This patch:

Compute msg_ctlmni to make it scale with the amount of lowmem. msg_ctlmni is
now set to make the message queues occupy 1/32 of the available lowmem.

Some cleaning has also been done for the MSGPOOL constant: the msgctl man page
says it's not used, but it also defines it as a size in bytes (the code
expresses it in Kbytes).

Signed-off-by: Nadia Derbey
Cc: Yasunori Goto
Cc: Matt Helsley
Cc: Mingming Cao
Cc: Pierre Peiffer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nadia Derbey
2008-04-29 23:06:12 +0800
48dea404e IPC: use ipc_buildid() directly from ipc_addid() ... Browse Code »

By continuing to consolidate a little the IPC code, each id can be built
directly in ipc_addid() instead of having it built from each callers of
ipc_addid()

And I also remove shm_addid() in order to have, as much as possible, the
same code for shm/sem/msg.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Pierre Peiffer
Cc: Nadia Derbey
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pierre Peiffer
2008-04-29 23:06:12 +0800
02d15c432 doc: fix DMA-API function parameters ... Browse Code »

Fix kernel bugzilla #10388.

DMA-API.txt has wrong argument type for some functions. It uses struct device
but should use struct pci_dev.

Signed-off-by: Randy Dunlap
Acked-by: James Bottomley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2008-04-29 23:06:12 +0800
cb9fbc5c3 IB: expand ib_umem_get() prototype ... Browse Code »

Add a new parameter, dmasync, to the ib_umem_get() prototype. Use dmasync = 1
when mapping user-allocated CQs with ib_umem_get().

Signed-off-by: Arthur Kepner
Cc: Tony Luck
Cc: Jesse Barnes
Cc: Jes Sorensen
Cc: Randy Dunlap
Cc: Roland Dreier
Cc: James Bottomley
Cc: David Miller
Cc: Benjamin Herrenschmidt
Cc: Grant Grundler
Cc: Michael Ellerman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arthur Kepner
2008-04-29 23:06:12 +0800
309df0c50 dma/ia64: update ia64 machvecs, swiotlb.c ... Browse Code »

Change all ia64 machvecs to use the new dma_*map*_attrs() interfaces.
Implement the old dma_*map_*() interfaces in terms of the corresponding new
interfaces. For ia64/sn, make use of one dma attribute,
DMA_ATTR_WRITE_BARRIER. Introduce swiotlb_*map*_attrs() functions.

Signed-off-by: Arthur Kepner
Cc: Tony Luck
Cc: Jesse Barnes
Cc: Jes Sorensen
Cc: Randy Dunlap
Cc: Roland Dreier
Cc: James Bottomley
Cc: David Miller
Cc: Benjamin Herrenschmidt
Cc: Grant Grundler
Cc: Michael Ellerman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arthur Kepner
2008-04-29 23:06:12 +0800
a75b0a2f6 dma: document dma_*map*_attrs() interfaces ... Browse Code »

Document the new dma_*map*_attrs() functions.

[markn@au1.ibm.com: fix up for dma-add-dma_map_attrs-interfaces and update docs]
Signed-off-by: Arthur Kepner
Acked-by: David S. Miller
Cc: Tony Luck
Cc: Jesse Barnes
Cc: Jes Sorensen
Cc: Randy Dunlap
Cc: Roland Dreier
Cc: James Bottomley
Cc: Benjamin Herrenschmidt
Cc: Grant Grundler
Cc: Michael Ellerman
Signed-off-by: Mark Nelson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arthur Kepner
2008-04-29 23:06:11 +0800
74bc7ceeb dma: add dma_*map*_attrs() interfaces ... Browse Code »

Introduce new interfaces, dma_*map*_attrs(), for passing architecture-specific
attributes when memory is mapped and unmapped for DMA. Give the interfaces
default implementations which ignore attributes. Also introduce the
dma_{set|get}_attr() interfaces for setting and retrieving individual
attributes. Define one attribute, DMA_ATTR_WRITE_BARRIER, in anticipation of
its use by ia64/sn. Select whether architectures implement arch-specific
versions of the dma_*map*_attrs() interfaces via HAVE_DMA_ATTRS in Kconfig.

[markn@au1.ibm.com: dma_{set,get}_attr() have to be static inline]
Signed-off-by: Arthur Kepner
Cc: Tony Luck
Cc: Jesse Barnes
Cc: Jes Sorensen
Cc: Randy Dunlap
Cc: Roland Dreier
Cc: James Bottomley
Cc: David Miller
Cc: Benjamin Herrenschmidt
Cc: Grant Grundler
Cc: Michael Ellerman
Signed-off-by: Mark Nelson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arthur Kepner
2008-04-29 23:06:11 +0800
d2ba7e2ae simplify cpu_hotplug_begin()/put_online_cpus() ... Browse Code »

cpu_hotplug_begin() must be always called under cpu_add_remove_lock, this
means that only one process can be cpu_hotplug.active_writer. So we don't
need the cpu_hotplug.writer_queue, we can wake up the ->active_writer
directly.

Also, fix the comment.

Signed-off-by: Oleg Nesterov
Cc: Dipankar Sarma
Acked-by: Gautham R Shenoy
Cc: Ingo Molnar
Cc: Srivatsa Vaddagiri
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2008-04-29 23:06:11 +0800
1e35eaa2d cleanup_workqueue_thread: remove the unneeded "cpu" parameter ... Browse Code »

cleanup_workqueue_thread() doesn't need the second argument, remove it.

Signed-off-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2008-04-29 23:06:11 +0800
00dfcaf74 workqueues: shrink cpu_populated_map when CPU dies ... Browse Code »

When cpu_populated_map was introduced, it was supposed that cwq->thread can
survive after CPU_DEAD, that is why we never shrink cpu_populated_map.

This is not very nice, we can safely remove the already dead CPU from the map.
The only required change is that destroy_workqueue() must hold the hotplug
lock until it destroys all cwq->thread's, to protect the cpu_populated_map.
We could make the local copy of cpu mask and drop the lock, but
sizeof(cpumask_t) may be very large.

Also, fix the comment near queue_work(). Unless _cpu_down() happens we do
guarantee the cpu-affinity of the work_struct, and we have users which rely on
this.

[akpm@linux-foundation.org: repair comment]
Signed-off-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2008-04-29 23:06:11 +0800
786083667 Cpuset hardwall flag: add a mem_hardwall flag to cpusets ... Browse Code »

This flag provides the hardwalling properties of mem_exclusive, without
enforcing the exclusivity. Either mem_hardwall or mem_exclusive is sufficient
to prevent GFP_KERNEL allocations from passing outside the cpuset's assigned
nodes.

Signed-off-by: Paul Menage
Acked-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2008-04-29 23:06:11 +0800
addf2c739 Cpuset hardwall flag: switch cpusets to use the bulk cgroup_add_files() API ... Browse Code »

Currently the cpusets mem_exclusive flag is overloaded to mean both
"no-overlapping" and "no GFP_KERNEL allocations outside this cpuset".

These patches add a new mem_hardwall flag with just the allocation restriction
part of the mem_exclusive semantics, without breaking backwards-compatibility
for those who continue to use just mem_exclusive. Additionally, the cgroup
control file registration for cpusets is cleaned up to reduce boilerplate.

This patch:

This change tidies up the cpusets control file definitions, and reduces the
amount of boilerplate required to add/change control files in the future.

Signed-off-by: Paul Menage
Reviewed-by: Li Zefan
Acked-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2008-04-29 23:06:11 +0800
9e0c914ca kernel/cpuset.c: make 3 functions static ... Browse Code »

Make the following needlessly global functions static:

- cpuset_test_cpumask()
- cpuset_change_cpumask()
- cpuset_do_move_task()

Signed-off-by: Adrian Bunk
Acked-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-04-29 23:06:11 +0800
1faf8e40a memcg: remove redundant initialization in mem_cgroup_create() ... Browse Code »

*mem has been zeroed, that means mem->info has already been filled with 0.

Signed-off-by: Li Zefan
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2008-04-29 23:06:11 +0800
333279487 memcgroup: use vmalloc for mem_cgroup allocation ... Browse Code »

On ia64, this kmalloc() requires order-4 pages. But this is not necessary to
be physically contiguous. For big mem_cgroup, vmalloc is better. For small
ones, kmalloc is used.

[akpm@linux-foundation.org: simplification]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Pavel Emelyanov
Cc: Li Zefan
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-04-29 23:06:11 +0800
4a56d02e3 memcgroup: make the memory controller more desktop responsive ... Browse Code »

This patch makes the memory controller more responsive on my desktop.

1. Set all cached pages as inactive. We were by default marking all pages
as active, thus forcing us to go through two passes for reclaiming pages

2. Remove congestion_wait(), since we already have that logic in
do_try_to_free_pages()

Signed-off-by: Balbir Singh
Reviewed-by: KOSAKI Motohiro
Cc: YAMAMOTO Takashi
Cc: Paul Menage
Cc: Pavel Emelianov
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2008-04-29 23:06:11 +0800
3eae90c3c memcg: remove redundant function calls ... Browse Code »

remove_list/add_list uses page_cgroup_zoneinfo() in it.

So, it's called twice before and after lock.

mz = page_cgroup_zoneinfo();
lock();
mz = page_cgroup_zoneinfo();
....
unlock();

And address of mz never changes.

This is not good. This patch fixes this behavior.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-04-29 23:06:10 +0800
29f2a4dac memcgroup: implement failcounter reset ... Browse Code »

This is a very common requirement from people using the resource accounting
facilities (not only memcgroup but also OpenVZ beancounters). They want to
put the cgroup in an initial state without re-creating it.

For example after re-configuring a group people want to observe how this new
configuration fits the group needs without saving the previous failcnt value.

Merge two resets into one mem_cgroup_reset() function to demonstrate how
multiplexing work.

Besides, I have plans to move the files, that correspond to res_counter to the
res_counter.c file and somehow "import" them into controller. I don't know
how to make it gracefully yet, but merging resets of max_usage and failcnt in
one function will be there for sure.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Pavel Emelyanov
Acked-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2008-04-29 23:06:10 +0800
85cc59db1 memcgroup: use triggers in force_empty and max_usage files ... Browse Code »

These two files are essentially event callbacks. They do not care about the
contents of the string, but only about the fact of the write itself.

Signed-off-by: Pavel Emelyanov
Acked-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2008-04-29 23:06:10 +0800
b6ac57d50 memcgroup: move memory controller allocations to their own slabs ... Browse Code »

Move the memory controller data structure page_cgroup to its own slab cache.
It saves space on the system, allocations are not necessarily pushed to order
of 2 and should provide performance benefits. Users who disable the memory
controller can also double check that the memory controller is not allocating
page_cgroup's.

NOTE: Hugh Dickins brought up the issue of whether we want to mark page_cgroup
as __GFP_MOVABLE or __GFP_RECLAIMABLE. I don't think there is an easy answer
at the moment. page_cgroup's are associated with user pages, they can be
reclaimed once the user page has been reclaimed, so it might make sense to
mark them as __GFP_RECLAIMABLE. For now, I am leaving the marking to default
values that the slab allocator uses.

Signed-off-by: Balbir Singh
Cc: Pavel Emelianov
Cc: Hugh Dickins
Cc: Sudhir Kumar
Cc: YAMAMOTO Takashi
Cc: Paul Menage
Cc: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2008-04-29 23:06:10 +0800
faebe9fdf memcgroups: add a document describing the resource counter abstraction ... Browse Code »

The resource counter is supposed to facilitate the resource accounting of
arbitrary resource (and it already does this for memory controller).

However, it is about to be used in other resources controllers (swap, kernel
memory, networking, etc), so provide a doc describing how to work with it.
This will eliminate all the possible future duplications in the appropriate
controllers' docs.

Fixed errors pointed out by Randy.

[akpm@linux-foundation.org: fix documentation tpyo]
Signed-off-by: Pavel Emelyanov
Cc: Randy Dunlap
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2008-04-29 23:06:10 +0800
c84872e16 memcgroup: add the max_usage member on the res_counter ... Browse Code »

This field is the maximal value of the usage one since the counter creation
(or since the latest reset).

To reset this to the usage value simply write anything to the appropriate
cgroup file.

Signed-off-by: Pavel Emelyanov
Acked-by: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2008-04-29 23:06:10 +0800
cf475ad28 cgroups: add an owner to the mm_struct ... Browse Code »

Remove the mem_cgroup member from mm_struct and instead adds an owner.

This approach was suggested by Paul Menage. The advantage of this approach
is that, once the mm->owner is known, using the subsystem id, the cgroup
can be determined. It also allows several control groups that are
virtually grouped by mm_struct, to exist independent of the memory
controller i.e., without adding mem_cgroup's for each controller, to
mm_struct.

A new config option CONFIG_MM_OWNER is added and the memory resource
controller selects this config option.

This patch also adds cgroup callbacks to notify subsystems when mm->owner
changes. The mm_cgroup_changed callback is called with the task_lock() of
the new task held and is called just prior to changing the mm->owner.

I am indebted to Paul Menage for the several reviews of this patchset and
helping me make it lighter and simpler.

This patch was tested on a powerpc box, it was compiled with both the
MM_OWNER config turned on and off.

After the thread group leader exits, it's moved to init_css_state by
cgroup_exit(), thus all future charges from runnings threads would be
redirected to the init_css_set's subsystem.

Signed-off-by: Balbir Singh
Cc: Pavel Emelianov
Cc: Hugh Dickins
Cc: Sudhir Kumar
Cc: YAMAMOTO Takashi
Cc: Hirokazu Takahashi
Cc: David Rientjes ,
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Pekka Enberg
Reviewed-by: Paul Menage
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2008-04-29 23:06:10 +0800
29486df32 cgroups: introduce cft->read_seq() ... Browse Code »

Introduce a read_seq() helper in cftype, which uses seq_file to print out
lists. Use it in the devices cgroup. Also split devices.allow into two
files, so now devices.deny and devices.allow are the ones to use to manipulate
the whitelist, while devices.list outputs the cgroup's current whitelist.

Signed-off-by: Serge E. Hallyn
Acked-by: Paul Menage
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Serge E. Hallyn
2008-04-29 23:06:10 +0800
28fd5dfc1 cgroups: remove the css_set linked-list ... Browse Code »

Now we can run through the hash table instead of running through the
linked-list.

Signed-off-by: Li Zefan
Reviewed-by: Paul Menage
Cc: Balbir Singh
Cc: Pavel Emelyanov
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2008-04-29 23:06:10 +0800
e8d55fdeb cgroups: simplify init_subsys() ... Browse Code »

We are at system boot and there is only 1 cgroup group (i,e, init_css_set), so
we don't need to run through the css_set linked list. Neither do we need to
run through the task list, since no processes have been created yet.

Also referring to a comment in cgroup.h:

struct css_set
{
...
/*
* Set of subsystem states, one for each subsystem. This array
* is immutable after creation apart from the init_css_set
* during subsystem registration (at boot time).
*/
struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
}

Signed-off-by: Li Zefan
Reviewed-by: Paul Menage
Cc: Balbir Singh
Cc: Pavel Emelyanov
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2008-04-29 23:06:10 +0800
472b1053f cgroups: use a hash table for css_set finding ... Browse Code »

When we attach a process to a different cgroup, the css_set linked-list will
be run through to find a suitable existing css_set to use. This patch
implements a hash table for better performance.

The following benchmarks have been tested:

For N in 1, 5, 10, 50, 100, 500, 1000, create N cgroups with one sleeping
task in each, and then move an additional task through each cgroup in
turn.

Here is a test result:

N Loop orig - Time(s) hash - Time(s)
----------------------------------------------
1 10000 1.201231728 1.196311177
5 2000 1.065743872 1.040566424
10 1000 0.991054735 0.986876440
50 200 0.976554203 0.969608733
100 100 0.998504680 0.969218270
500 20 1.157347764 0.962602963
1000 10 1.619521852 1.085140172

Signed-off-by: Li Zefan
Reviewed-by: Paul Menage
Cc: Balbir Singh
Cc: Pavel Emelyanov
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2008-04-29 23:06:09 +0800
08ce5f16e cgroups: implement device whitelist ... Browse Code »

Implement a cgroup to track and enforce open and mknod restrictions on device
files. A device cgroup associates a device access whitelist with each cgroup.
A whitelist entry has 4 fields. 'type' is a (all), c (char), or b (block).
'all' means it applies to all types and all major and minor numbers. Major
and minor are either an integer or * for all. Access is a composition of r
(read), w (write), and m (mknod).

The root device cgroup starts with rwm to 'all'. A child devcg gets a copy of
the parent. Admins can then remove devices from the whitelist or add new
entries. A child cgroup can never receive a device access which is denied its
parent. However when a device access is removed from a parent it will not
also be removed from the child(ren).

An entry is added using devices.allow, and removed using
devices.deny. For instance

echo 'c 1:3 mr' > /cgroups/1/devices.allow

allows cgroup 1 to read and mknod the device usually known as
/dev/null. Doing

echo a > /cgroups/1/devices.deny

will remove the default 'a *:* mrw' entry.

CAP_SYS_ADMIN is needed to change permissions or move another task to a new
cgroup. A cgroup may not be granted more permissions than the cgroup's parent
has. Any task can move itself between cgroups. This won't be sufficient, but
we can decide the best way to adequately restrict movement later.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix may-be-used-uninitialized warning]
Signed-off-by: Serge E. Hallyn
Acked-by: James Morris
Looks-good-to: Pavel Emelyanov
Cc: Daniel Hokka Zakrisson
Cc: Li Zefan
Cc: Paul Menage
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Serge E. Hallyn
2008-04-29 23:06:09 +0800
d447ea2f3 cgroups: add the trigger callback to struct cftype ... Browse Code »

Trigger callback can be used to receive a kick-up from the user space. The
string written is ignored.

The cftype->private is used for multiplexing events.

Signed-off-by: Pavel Emelyanov
Acked-by: Paul Menage
Acked-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2008-04-29 23:06:09 +0800
46ae220be cgroup: switch to proc_create() ... Browse Code »

There is a race between create_proc_entry() and the assignment of file ops.
proc_create() is invented to fix it.

Signed-off-by: Li Zefan
Acked-by: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2008-04-29 23:06:09 +0800
06a119204 cgroup: annotate cgroup_init_subsys with __init ... Browse Code »

It is called by cgroup_init() and cgroup_init_early() only, which are
annotated with __init.

Signed-off-by: Li Zefan
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2008-04-29 23:06:09 +0800
06ecb27cf CGroups _s64 files: use read_s64/write_s64 in CFS cgroup for rt_runtime file ... Browse Code »

This removes some filesystem boilerplate from the CFS cgroup subsystem.

Signed-off-by: Paul Menage
Acked-by: Peter Zijlstra
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2008-04-29 23:06:09 +0800
e73d2c61d CGroups _s64 files: add cgroups read_s64/write_s64 file methods ... Browse Code »

These patches add cgroups read_s64 and write_s64 control file methods (the
signed equivalent of read_u64/write_u64) and use them to implement the
cpu.rt_runtime_us control file in the CFS cgroup subsystem.

This patch:

These are the signed equivalents of the read_u64/write_u64 methods

Signed-off-by: Paul Menage
Acked-by: Peter Zijlstra
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2008-04-29 23:06:09 +0800
418d7d875 CGroup API files: make CGROUP_DEBUG default to off ... Browse Code »

The cgroup debug subsystem isn't generally useful for users. It should
default to "n".

Signed-off-by: Paul Menage
Cc: "Li Zefan"
Cc: Balbir Singh
Cc: Paul Jackson
Cc: Pavel Emelyanov
Cc: KAMEZAWA Hiroyuki
Cc: "YAMAMOTO Takashi"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2008-04-29 23:06:09 +0800
3116f0e3d CGroup API files: move "releasable" to cgroup_debug subsystem ... Browse Code »

The "releasable" control file provided by the cgroup framework exports the
state of a per-cgroup flag that's related to the notify-on-release feature.
This isn't really generally useful, unless you're trying to debug this
particular feature of cgroups.

This patch moves the "releasable" file to the cgroup_debug subsystem.

Signed-off-by: Paul Menage
Cc: "Li Zefan"
Cc: Balbir Singh
Cc: Paul Jackson
Cc: Pavel Emelyanov
Cc: KAMEZAWA Hiroyuki
Cc: "YAMAMOTO Takashi"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2008-04-29 23:06:09 +0800
c27e8818a CGroup API files: drop mem_cgroup_force_empty() ... Browse Code »

This function isn't needed - a NULL pointer in the cftype read function will
result in the same EINVAL response to userspace.

Signed-off-by: Paul Menage
Cc: "Li Zefan"
Cc: Balbir Singh
Cc: Paul Jackson
Cc: Pavel Emelyanov
Cc: KAMEZAWA Hiroyuki
Cc: "YAMAMOTO Takashi"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2008-04-29 23:06:08 +0800