29 Apr, 2008
40 commits
-
Remove proc_root export. Creation and removal works well if parent PDE is
supplied as NULL -- it worked always that way.So, one useless export removed and consistency added, some drivers created
PDEs with &proc_root as parent but removed them as NULL and so on.Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The kernel implements readlink of /proc/pid/exe by getting the file from
the first executable VMA. Then the path to the file is reconstructed and
reported as the result.Because of the VMA walk the code is slightly different on nommu systems.
This patch avoids separate /proc/pid/exe code on nommu systems. Instead of
walking the VMAs to find the first executable file-backed VMA we store a
reference to the exec'd file in the mm_struct.That reference would prevent the filesystem holding the executable file
from being unmounted even after unmapping the VMAs. So we track the number
of VM_EXECUTABLE VMAs and drop the new reference when the last one is
unmapped. This avoids pinning the mounted filesystem.[akpm@linux-foundation.org: improve comments]
[yamamoto@valinux.co.jp: fix dup_mmap]
Signed-off-by: Matt Helsley
Cc: Oleg Nesterov
Cc: David Howells
Cc:"Eric W. Biederman"
Cc: Christoph Hellwig
Cc: Al Viro
Cc: Hugh Dickins
Signed-off-by: YAMAMOTO Takashi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Make the keyring quotas controllable through /proc/sys files:
(*) /proc/sys/kernel/keys/root_maxkeys
/proc/sys/kernel/keys/root_maxbytesMaximum number of keys that root may have and the maximum total number of
bytes of data that root may have stored in those keys.(*) /proc/sys/kernel/keys/maxkeys
/proc/sys/kernel/keys/maxbytesMaximum number of keys that each non-root user may have and the maximum
total number of bytes of data that each of those users may have stored in
their keys.Also increase the quotas as a number of people have been complaining that it's
not big enough. I'm not sure that it's big enough now either, but on the
other hand, it can now be set in /etc/sysctl.conf.Signed-off-by: David Howells
Cc:
Cc:
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Don't generate the per-UID user and user session keyrings unless they're
explicitly accessed. This solves a problem during a login process whereby
set*uid() is called before the SELinux PAM module, resulting in the per-UID
keyrings having the wrong security labels.This also cures the problem of multiple per-UID keyrings sometimes appearing
due to PAM modules (including pam_keyinit) setuiding and causing user_structs
to come into and go out of existence whilst the session keyring pins the user
keyring. This is achieved by first searching for extant per-UID keyrings
before inventing new ones.The serial bound argument is also dropped from find_keyring_by_name() as it's
not currently made use of (setting it to 0 disables the feature).Signed-off-by: David Howells
Cc:
Cc:
Cc:
Cc: Stephen Smalley
Cc: James Morris
Cc: Chris Wright
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
CLONE_NEWIPC|CLONE_SYSVSEM interaction isn't handled properly. This can cause
a kernel memory corruption. CLONE_NEWIPC must detach from the existing undo
lists.Fix, part 3: refuse clone(CLONE_SYSVSEM|CLONE_NEWIPC).
With unshare, specifying CLONE_SYSVSEM means unshare the sysvsem. So it seems
reasonable that CLONE_NEWIPC without CLONE_SYSVSEM would just imply
CLONE_SYSVSEM.However with clone, specifying CLONE_SYSVSEM means *share* the sysvsem. So
calling clone(CLONE_SYSVSEM|CLONE_NEWIPC) is explicitly asking for something
we can't allow. So return -EINVAL in that case.[akpm@linux-foundation.org: cleanups]
Signed-off-by: Serge E. Hallyn
Cc: Manfred Spraul
Acked-by: "Eric W. Biederman"
Cc: Pavel Emelyanov
Cc: Michael Kerrisk
Cc: Pierre Peiffer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
sys_unshare(CLONE_NEWIPC) doesn't handle the undo lists properly, this can
cause a kernel memory corruption. CLONE_NEWIPC must detach from the existing
undo lists.Fix, part 2: perform an implicit CLONE_SYSVSEM in CLONE_NEWIPC. CLONE_NEWIPC
creates a new IPC namespace, the task cannot access the existing semaphore
arrays after the unshare syscall. Thus the task can/must detach from the
existing undo list entries, too.This fixes the kernel corruption, because it makes it impossible that
undo records from two different namespaces are in sysvsem.undo_list.Signed-off-by: Manfred Spraul
Signed-off-by: Serge E. Hallyn
Acked-by: "Eric W. Biederman"
Cc: Pavel Emelyanov
Cc: Michael Kerrisk
Cc: Pierre Peiffer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
sys_unshare(CLONE_NEWIPC) doesn't handle the undo lists properly, this can
cause a kernel memory corruption. CLONE_NEWIPC must detach from the existing
undo lists.Fix, part 1: add support for sys_unshare(CLONE_SYSVSEM)
The original reason to not support it was the potential (inevitable?)
confusion due to the fact that sys_unshare(CLONE_SYSVSEM) has the
inverse meaning of clone(CLONE_SYSVSEM).Our two most reasonable options then appear to be (1) fully support
CLONE_SYSVSEM, or (2) continue to refuse explicit CLONE_SYSVSEM,
but always do it anyway on unshare(CLONE_SYSVSEM). This patch does
(1).Changelog:
Apr 16: SEH: switch to Manfred's alternative patch which
removes the unshare_semundo() function which
always refused CLONE_SYSVSEM.Signed-off-by: Manfred Spraul
Signed-off-by: Serge E. Hallyn
Acked-by: "Eric W. Biederman"
Cc: Pavel Emelyanov
Cc: Michael Kerrisk
Cc: Pierre Peiffer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The enhancement as asked for by Yasunori: if msgmni is set to a negative
value, register it back into the ipcns notifier chain.A new interface has been added to the notification mechanism:
notifier_chain_cond_register() registers a notifier block only if not already
registered. With that new interface we avoid taking care of the states
changes in procfs.Signed-off-by: Nadia Derbey
Cc: Yasunori Goto
Cc: Matt Helsley
Cc: Mingming Cao
Cc: Pierre Peiffer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
cpu_hotplug_begin() must be always called under cpu_add_remove_lock, this
means that only one process can be cpu_hotplug.active_writer. So we don't
need the cpu_hotplug.writer_queue, we can wake up the ->active_writer
directly.Also, fix the comment.
Signed-off-by: Oleg Nesterov
Cc: Dipankar Sarma
Acked-by: Gautham R Shenoy
Cc: Ingo Molnar
Cc: Srivatsa Vaddagiri
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
cleanup_workqueue_thread() doesn't need the second argument, remove it.
Signed-off-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When cpu_populated_map was introduced, it was supposed that cwq->thread can
survive after CPU_DEAD, that is why we never shrink cpu_populated_map.This is not very nice, we can safely remove the already dead CPU from the map.
The only required change is that destroy_workqueue() must hold the hotplug
lock until it destroys all cwq->thread's, to protect the cpu_populated_map.
We could make the local copy of cpu mask and drop the lock, but
sizeof(cpumask_t) may be very large.Also, fix the comment near queue_work(). Unless _cpu_down() happens we do
guarantee the cpu-affinity of the work_struct, and we have users which rely on
this.[akpm@linux-foundation.org: repair comment]
Signed-off-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This flag provides the hardwalling properties of mem_exclusive, without
enforcing the exclusivity. Either mem_hardwall or mem_exclusive is sufficient
to prevent GFP_KERNEL allocations from passing outside the cpuset's assigned
nodes.Signed-off-by: Paul Menage
Acked-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently the cpusets mem_exclusive flag is overloaded to mean both
"no-overlapping" and "no GFP_KERNEL allocations outside this cpuset".These patches add a new mem_hardwall flag with just the allocation restriction
part of the mem_exclusive semantics, without breaking backwards-compatibility
for those who continue to use just mem_exclusive. Additionally, the cgroup
control file registration for cpusets is cleaned up to reduce boilerplate.This patch:
This change tidies up the cpusets control file definitions, and reduces the
amount of boilerplate required to add/change control files in the future.Signed-off-by: Paul Menage
Reviewed-by: Li Zefan
Acked-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Make the following needlessly global functions static:
- cpuset_test_cpumask()
- cpuset_change_cpumask()
- cpuset_do_move_task()Signed-off-by: Adrian Bunk
Acked-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This field is the maximal value of the usage one since the counter creation
(or since the latest reset).To reset this to the usage value simply write anything to the appropriate
cgroup file.Signed-off-by: Pavel Emelyanov
Acked-by: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Remove the mem_cgroup member from mm_struct and instead adds an owner.
This approach was suggested by Paul Menage. The advantage of this approach
is that, once the mm->owner is known, using the subsystem id, the cgroup
can be determined. It also allows several control groups that are
virtually grouped by mm_struct, to exist independent of the memory
controller i.e., without adding mem_cgroup's for each controller, to
mm_struct.A new config option CONFIG_MM_OWNER is added and the memory resource
controller selects this config option.This patch also adds cgroup callbacks to notify subsystems when mm->owner
changes. The mm_cgroup_changed callback is called with the task_lock() of
the new task held and is called just prior to changing the mm->owner.I am indebted to Paul Menage for the several reviews of this patchset and
helping me make it lighter and simpler.This patch was tested on a powerpc box, it was compiled with both the
MM_OWNER config turned on and off.After the thread group leader exits, it's moved to init_css_state by
cgroup_exit(), thus all future charges from runnings threads would be
redirected to the init_css_set's subsystem.Signed-off-by: Balbir Singh
Cc: Pavel Emelianov
Cc: Hugh Dickins
Cc: Sudhir Kumar
Cc: YAMAMOTO Takashi
Cc: Hirokazu Takahashi
Cc: David Rientjes ,
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Pekka Enberg
Reviewed-by: Paul Menage
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Introduce a read_seq() helper in cftype, which uses seq_file to print out
lists. Use it in the devices cgroup. Also split devices.allow into two
files, so now devices.deny and devices.allow are the ones to use to manipulate
the whitelist, while devices.list outputs the cgroup's current whitelist.Signed-off-by: Serge E. Hallyn
Acked-by: Paul Menage
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Now we can run through the hash table instead of running through the
linked-list.Signed-off-by: Li Zefan
Reviewed-by: Paul Menage
Cc: Balbir Singh
Cc: Pavel Emelyanov
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We are at system boot and there is only 1 cgroup group (i,e, init_css_set), so
we don't need to run through the css_set linked list. Neither do we need to
run through the task list, since no processes have been created yet.Also referring to a comment in cgroup.h:
struct css_set
{
...
/*
* Set of subsystem states, one for each subsystem. This array
* is immutable after creation apart from the init_css_set
* during subsystem registration (at boot time).
*/
struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
}Signed-off-by: Li Zefan
Reviewed-by: Paul Menage
Cc: Balbir Singh
Cc: Pavel Emelyanov
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When we attach a process to a different cgroup, the css_set linked-list will
be run through to find a suitable existing css_set to use. This patch
implements a hash table for better performance.The following benchmarks have been tested:
For N in 1, 5, 10, 50, 100, 500, 1000, create N cgroups with one sleeping
task in each, and then move an additional task through each cgroup in
turn.Here is a test result:
N Loop orig - Time(s) hash - Time(s)
----------------------------------------------
1 10000 1.201231728 1.196311177
5 2000 1.065743872 1.040566424
10 1000 0.991054735 0.986876440
50 200 0.976554203 0.969608733
100 100 0.998504680 0.969218270
500 20 1.157347764 0.962602963
1000 10 1.619521852 1.085140172Signed-off-by: Li Zefan
Reviewed-by: Paul Menage
Cc: Balbir Singh
Cc: Pavel Emelyanov
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Trigger callback can be used to receive a kick-up from the user space. The
string written is ignored.The cftype->private is used for multiplexing events.
Signed-off-by: Pavel Emelyanov
Acked-by: Paul Menage
Acked-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There is a race between create_proc_entry() and the assignment of file ops.
proc_create() is invented to fix it.Signed-off-by: Li Zefan
Acked-by: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
It is called by cgroup_init() and cgroup_init_early() only, which are
annotated with __init.Signed-off-by: Li Zefan
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This removes some filesystem boilerplate from the CFS cgroup subsystem.
Signed-off-by: Paul Menage
Acked-by: Peter Zijlstra
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
These patches add cgroups read_s64 and write_s64 control file methods (the
signed equivalent of read_u64/write_u64) and use them to implement the
cpu.rt_runtime_us control file in the CFS cgroup subsystem.This patch:
These are the signed equivalents of the read_u64/write_u64 methods
Signed-off-by: Paul Menage
Acked-by: Peter Zijlstra
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The "releasable" control file provided by the cgroup framework exports the
state of a per-cgroup flag that's related to the notify-on-release feature.
This isn't really generally useful, unless you're trying to debug this
particular feature of cgroups.This patch moves the "releasable" file to the cgroup_debug subsystem.
Signed-off-by: Paul Menage
Cc: "Li Zefan"
Cc: Balbir Singh
Cc: Paul Jackson
Cc: Pavel Emelyanov
Cc: KAMEZAWA Hiroyuki
Cc: "YAMAMOTO Takashi"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Adds a new type of supported control file representation, a map from strings
to u64 values.Each map entry is printed as a line in a similar format to /proc/vmstat, i.e.
"$key $value\n"Signed-off-by: Paul Menage
Cc: "Li Zefan"
Cc: Balbir Singh
Cc: Paul Jackson
Cc: Pavel Emelyanov
Cc: KAMEZAWA Hiroyuki
Cc: "YAMAMOTO Takashi"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Many of the cpusets control files are simple integer values, which don't
require the overhead of memory allocations for reads and writes.Move the handlers for these control files into cpuset_read_u64() and
cpuset_write_u64().[akpm@linux-foundation.org: ad dmissing `break']
Signed-off-by: Paul Menage
Cc: "Li Zefan"
Cc: Balbir Singh
Cc: Paul Jackson
Cc: Pavel Emelyanov
Cc: KAMEZAWA Hiroyuki
Cc: "YAMAMOTO Takashi"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This removes the need for people to remember to pass the -n flag to echo when
writing values to cgroup control files.Signed-off-by: Paul Menage
Cc: "Li Zefan"
Cc: Balbir Singh
Cc: Paul Jackson
Cc: Pavel Emelyanov
Cc: KAMEZAWA Hiroyuki
Cc: "YAMAMOTO Takashi"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Adds a function for returning the value of a resource counter member, in a
form suitable for use in a cgroup read_u64 control file method.Signed-off-by: Paul Menage
Cc: "Li Zefan"
Cc: Balbir Singh
Cc: Paul Jackson
Cc: Pavel Emelyanov
Cc: KAMEZAWA Hiroyuki
Cc: "YAMAMOTO Takashi"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Several people have justifiably complained that the "_uint" suffix is
inappropriate for functions that handle u64 values, so this patch just renames
all these functions and their users to have the suffic _u64.[peterz@infradead.org: build fix]
Signed-off-by: Paul Menage
Cc: "Li Zefan"
Cc: Balbir Singh
Cc: Paul Jackson
Cc: Pavel Emelyanov
Cc: KAMEZAWA Hiroyuki
Cc: "YAMAMOTO Takashi"
Signed-off-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Every file should include the headers containing the externs its global
functions (in this case for ns_cgroup_clone()).Signed-off-by: Adrian Bunk
Acked-by: Serge Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fix a code warning: symbol 'p' shadows an earlier one
This is a reincarnation of Harvey Harrison's patch:
cpuset: sparse warnings in cpuset.cIndependently, Cliff Wickman moved the affected code,
from kernel/cpuset.c to kernel/cgroup.c, in his patch:
cpusets: update_cpumask revisionSigned-off-by: Paul Jackson
Cc: Harvey Harrison
Cc: Cliff Wickman
Acked-by: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Make the needlessly global cgroup_enable_task_cg_lists() static.
Signed-off-by: Adrian Bunk
Acked-by: David Rientjes
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Make eCryptfs key module subsystem respect namespaces.
Since I will be removing the netlink interface in a future patch, I just made
changes to the netlink.c code so that it will not break the build. With my
recent patches, the kernel module currently defaults to the device handle
interface rather than the netlink interface.[akpm@linux-foundation.org: export free_user_ns()]
Signed-off-by: Michael Halcrow
Acked-by: Serge Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Due to the rcupreempt.h WARN_ON trigged, I got 2G syslog file. For some
serious complaining of kernel, we need repeat the warnings, so here I isolate
the ratelimit part of printk.c to a standalone file.Signed-off-by: Dave Young
Acked-by: Paul E. McKenney
Tested-by: Paul E. McKenney
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Following an experimental deletion of the unnecessary directive
#include
from the header file , these files under kernel/ were exposed
as needing to include one of or , so explicit
includes were added where necessary.Signed-off-by: Robert P. J. Day
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
From the POV of synchronization, there should be no need to call
wake_up_process() with the 'kthread_create_lock' being held.Signed-off-by: Dmitry Adamushko
Cc: Nick Piggin
Cc: Ingo Molnar
Cc: Rusty Russell
Cc: "Paul E. McKenney"
Cc: Peter Zijlstra
Cc: Andy Whitcroft
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fix following warnings:
WARNING: vmlinux.o(.text+0xc60): Section mismatch in reference from the function kvm_init() to the function .cpuinit.text:register_cpu_notifier()
WARNING: vmlinux.o(.text+0x33869a): Section mismatch in reference from the function xfs_icsb_init_counters() to the function .cpuinit.text:register_cpu_notifier()
WARNING: vmlinux.o(.text+0x5556a1): Section mismatch in reference from the function acpi_processor_install_hotplug_notify() to the function .cpuinit.text:register_cpu_notifier()
WARNING: vmlinux.o(.text+0xfe6b28): Section mismatch in reference from the function cpufreq_register_driver() to the function .cpuinit.text:register_cpu_notifier()register_cpu_notifier() are only really defined when HOTPLUG_CPU is enabled.
So references to the function are OK.Annotate it with __ref so we do not get warnings from callers and do not get
warnings for the functions/data used by register_cpu_notifier().Signed-off-by: Sam Ravnborg
Cc: Gautham R Shenoy
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fix following warnings:
WARNING: vmlinux.o(.text+0x75c8d): Section mismatch in reference from the function take_cpu_down() to the variable .cpuinit.data:cpu_chain
WARNING: vmlinux.o(.text+0x75d2a): Section mismatch in reference from the function _cpu_down() to the variable .cpuinit.data:cpu_chain
WARNING: vmlinux.o(.text+0x75d4d): Section mismatch in reference from the function _cpu_down() to the variable .cpuinit.data:cpu_chain
WARNING: vmlinux.o(.text+0x75de4): Section mismatch in reference from the function _cpu_down() to the variable .cpuinit.data:cpu_chain
WARNING: vmlinux.o(.text+0x75e33): Section mismatch in reference from the function _cpu_down() to the variable .cpuinit.data:cpu_chaincpu_down is only used from code surrounded by HOTPLUG_CPU so any references to
__cpuinit is OK.Add a few __ref to tech modpost to ignore the references.
This is just papering over the fact that the cpu hotplug code is fragile with
respect to use of HOTPLUG_CPU and in many cases rely on __cpuinit to get rid
of code when HOTPLUG_CPU is not enabled. For now this is the least invasive
change.Signed-off-by: Sam Ravnborg
Cc: Gautham R Shenoy
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds