09 Jul, 2011
1 commit
-
Since ca5ecddf (rcu: define __rcu address space modifier for sparse)
rcu_dereference_check use rcu_read_lock_held as a part of condition
automatically so callers do not have to do that as well.Signed-off-by: Michal Hocko
Acked-by: Paul E. McKenney
Signed-off-by: Jiri Kosina
27 May, 2011
4 commits
-
The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
leads to some problems:* cgroup creation is out-of-control
* cgroup name can conflict when pids are looping
* it is not possible to have a single process handling a lot of
namespaces without falling in a exponential creation time
* we may want to create a namespace without creating a cgroupThe ns_cgroup was replaced by a compatibility flag 'clone_children',
where a newly created cgroup will copy the parent cgroup values.
The userspace has to manually create a cgroup and add a task to
the 'tasks' file.This patch removes the ns_cgroup as suggested in the following thread:
https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html
The 'cgroup_clone' function is removed because it is no longer used.
This is a userspace-visible change. Commit 45531757b45c ("cgroup: notify
ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
printk warning users that the feature is planned for removal. Since that
time we have heard from XXX users who were affected by this.Signed-off-by: Daniel Lezcano
Signed-off-by: Serge E. Hallyn
Cc: Eric W. Biederman
Cc: Jamal Hadi Salim
Reviewed-by: Li Zefan
Acked-by: Paul Menage
Acked-by: Matt Helsley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Convert cgroup_attach_proc to use flex_array.
The cgroup_attach_proc implementation requires a pre-allocated array to
store task pointers to atomically move a thread-group, but asking for a
monolithic array with kmalloc() may be unreliable for very large groups.
Using flex_array provides the same functionality with less risk of
failure.This is a post-patch for cgroup-procs-write.patch.
Signed-off-by: Ben Blum
Cc: "Eric W. Biederman"
Cc: Li Zefan
Cc: Matt Helsley
Reviewed-by: Paul Menage
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Miao Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Make procs file writable to move all threads by tgid at once.
Add functionality that enables users to move all threads in a threadgroup
at once to a cgroup by writing the tgid to the 'cgroup.procs' file. This
current implementation makes use of a per-threadgroup rwsem that's taken
for reading in the fork() path to prevent newly forking threads within the
threadgroup from "escaping" while the move is in progress.Signed-off-by: Ben Blum
Cc: "Eric W. Biederman"
Cc: Li Zefan
Cc: Matt Helsley
Reviewed-by: Paul Menage
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Miao Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add cgroup subsystem callbacks for per-thread attachment in atomic contexts
Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
for cgroups's subsystem interface. Unlike can_attach and attach, these
are for per-thread operations, to be called potentially many times when
attaching an entire threadgroup.Also, the old "bool threadgroup" interface is removed, as replaced by
this. All subsystems are modified for the new interface - of note is
cpuset, which requires from/to nodemasks for attach to be globally scoped
(though per-cpuset would work too) to persist from its pre_attach to
attach_task and attach.This is a pre-patch for cgroup-procs-writable.patch.
Signed-off-by: Ben Blum
Cc: "Eric W. Biederman"
Cc: Li Zefan
Cc: Matt Helsley
Reviewed-by: Paul Menage
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Miao Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
08 May, 2011
3 commits
-
The rcu callback __free_css_id_cb() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(__free_css_id_cb).Signed-off-by: Lai Jiangshan
Acked-by: Paul Menage
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett -
The rcu callback free_cgroup_rcu() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(free_cgroup_rcu).Signed-off-by: Lai Jiangshan
Acked-by: Paul Menage
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett -
The rcu callback free_css_set_rcu() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(free_css_set_rcu).Signed-off-by: Lai Jiangshan
Acked-by: Paul Menage
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett
31 Mar, 2011
1 commit
-
Fixes generated by 'codespell' and manually reviewed.
Signed-off-by: Lucas De Marchi
23 Mar, 2011
1 commit
-
list_del() leaves poison in the prev and next pointers. The next
list_empty() will compare those poisons, and say the list isn't empty.
Any list operations that assume the node is on a list because of such a
check will be fooled into dereferencing poison. One needs to INIT the
node after the del, and fortunately there's already a wrapper for that -
list_del_init().Some of the dels are followed by deallocations, so can be ignored, and one
can be merged with an add to make a move. Apart from that, I erred on the
side of caution in making nodes list_empty()-queriable.Signed-off-by: Phil Carmody
Reviewed-by: Paul Menage
Cc: Li Zefan
Acked-by: Kirill A. Shutemov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Feb, 2011
2 commits
-
This kernel patch adds the ability to filter monitoring based on
container groups (cgroups). This is for use in per-cpu mode only.The cgroup to monitor is passed as a file descriptor in the pid
argument to the syscall. The file descriptor must be opened to
the cgroup name in the cgroup filesystem. For instance, if the
cgroup name is foo and cgroupfs is mounted in /cgroup, then the
file descriptor is opened to /cgroup/foo. Cgroup mode is
activated by passing PERF_FLAG_PID_CGROUP in the flags argument
to the syscall.For instance to measure in cgroup foo on CPU1 assuming
cgroupfs is mounted under /cgroup:struct perf_event_attr attr;
int cgroup_fd, fd;cgroup_fd = open("/cgroup/foo", O_RDONLY);
fd = perf_event_open(&attr, cgroup_fd, 1, -1, PERF_FLAG_PID_CGROUP);
close(cgroup_fd);Signed-off-by: Stephane Eranian
[ added perf_cgroup_{exit,attach} ]
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
Make the ::exit method act like ::attach, it is after all very nearly
the same thing.The bug had no effect on correctness - fixing it is an optimization for
the scheduler. Also, later perf-cgroups patches rely on it.Signed-off-by: Peter Zijlstra
Acked-by: Paul Menage
LKML-Reference:
Signed-off-by: Ingo Molnar
15 Jan, 2011
2 commits
-
…t/npiggin/linux-npiggin
* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin:
kernel: fix hlist_bl again
cgroups: Fix a lockdep warning at cgroup removal
fs: namei fix ->put_link on wrong inode in do_filp_open -
cgroup can't use simple_lookup(), since that'd override its desired ->d_op.
Tested-by: Li Zefan
Signed-off-by: Al Viro
Signed-off-by: Linus Torvalds
14 Jan, 2011
1 commit
-
Commit 2fd6b7f5 ("fs: dcache scale subdirs") forgot to annotate a dentry
lock, which caused a lockdep warning.Reported-by: Valdis Kletnieks
Signed-off-by: Li Zefan
13 Jan, 2011
1 commit
-
switching it to s_d_op allows to kill the cgroup_lookup() kludge.
Signed-off-by: Al Viro
07 Jan, 2011
7 commits
-
Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.Patched with:
git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i
Signed-off-by: Nick Piggin
-
dget_locked was a shortcut to avoid the lazy lru manipulation when we already
held dcache_lock (lru manipulation was relatively cheap at that point).
However, how that the lru lock is an innermost one, we never hold it at any
caller, so the lock cost can now be avoided. We already have well working lazy
dcache LRU, so it should be fine to defer LRU manipulations to scan time.Signed-off-by: Nick Piggin
-
dcache_lock no longer protects anything. remove it.
Signed-off-by: Nick Piggin
-
Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).Note: if we change the locking rule in future so that ->d_child protection is
provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
But it would be an exception to an otherwise regular locking scheme, so we'd
have to see some good results. Probably not worthwhile.Signed-off-by: Nick Piggin
-
Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
we start protecting many other dentry members with d_lock.Signed-off-by: Nick Piggin
-
Change d_delete from a dentry deletion notification to a dentry caching
advise, more like ->drop_inode. Require it to be constant and idempotent,
and not take d_lock. This is how all existing filesystems use the callback
anyway.This makes fine grained dentry locking of dput and dentry lru scanning
much simpler.Signed-off-by: Nick Piggin
-
Switching d_op on a live dentry is racy in general, so avoid it. In this case
it is a negative dentry, which is safer, but there are still concurrent ops
which may be called on d_op in that case (eg. d_revalidate). So in general
a filesystem may not do this. Fix cgroupfs so as not to do this.Signed-off-by: Nick Piggin
29 Oct, 2010
1 commit
-
Signed-off-by: Al Viro
28 Oct, 2010
3 commits
-
Function "strcpy" is used without check for maximum allowed source string
length and could cause destination string overflow. Check for string
length is added before using "strcpy". Function now is return error if
source string length is more than a maximum.akpm: presently considered NotABug, but add the check for general
future-safeness and robustness.Signed-off-by: Evgeny Kuznetsov
Acked-by: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Current behavior:
=================(1) When we mount a cgroup, we can specify the 'all' option which
means to enable all the cgroup subsystems. This is the default option
when no option is specified.(2) If we want to mount a cgroup with a subset of the supported cgroup
subsystems, we have to specify a subsystems name list for the mount
option.(3) If we specify another option like 'noprefix' or 'release_agent',
the actual code wants the 'all' or a subsystem name option specified
also. Not critical but a bit not friendly as we should assume (1) in
this case.(4) Logically, the 'all' option is mutually exclusive with a subsystem
name, but this is not detected.In other words:
succeed : mount -t cgroup -o all,freezer cgroup /cgroup
=> is it 'all' or 'freezer' ?
fails : mount -t cgroup -o noprefix cgroup /cgroup
=> succeed if we do '-o noprefix,all'The following patches consolidate a bit the mount options check.
New behavior:
=============(1) untouched
(2) untouched
(3) the 'all' option will be by default when specifying other than
a subsystem name option
(4) raises an errorIn other words:
fails : mount -t cgroup -o all,freezer cgroup /cgroup
succeed : mount -t cgroup -o noprefix cgroup /cgroupFor the sake of lisibility, the if ... then ... else ... if ...
indentation when parsing the options has been changed to:
if ... then
...
continue
fiSigned-off-by: Daniel Lezcano
Signed-off-by: Serge E. Hallyn
Reviewed-by: Li Zefan
Reviewed-by: Paul Menage
Cc: Eric W. Biederman
Cc: Jamal Hadi Salim
Cc: Matt Helsley
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The ns_cgroup is a control group interacting with the namespaces. When a
new namespace is created, a corresponding cgroup is automatically created
too. The cgroup name is the pid of the process who did 'unshare' or the
child of 'clone'.This cgroup is tied with the namespace because it prevents a process to
escape the control group and use the post_clone callback, so the child
cgroup inherits the values of the parent cgroup.Unfortunately, the more we use this cgroup and the more we are facing
problems with it:(1) when a process unshares, the cgroup name may conflict with a
previous cgroup with the same pid, so unshare or clone return -EEXIST(2) the cgroup creation is out of control because there may have an
application creating several namespaces where the system will
automatically create several cgroups in his back and let them on the
cgroupfs (eg. a vrf based on the network namespace).(3) the mix of (1) and (2) force an administrator to regularly check
and clean these cgroups.This patchset removes the ns_cgroup by adding a new flag to the cgroup and
the cgroupfs mount option. It enables the copy of the parent cgroup when
a child cgroup is created. We can then safely remove the ns_cgroup as
this flag brings a compatibility. We have now to manually create and add
the task to a cgroup, which is consistent with the cgroup framework.This patch:
Sent as an answer to a previous thread around the ns_cgroup.
https://lists.linux-foundation.org/pipermail/containers/2009-June/018627.html
It adds a control file 'clone_children' for a cgroup. This control file
is a boolean specifying if the child cgroup should be a clone of the
parent cgroup or not. The default value is 'false'.This flag makes the child cgroup to call the post_clone callback of all
the subsystem, if it is available.At present, the cpuset is the only one which had implemented the
post_clone callback.The option can be set at mount time by specifying the 'clone_children'
mount option.Signed-off-by: Daniel Lezcano
Signed-off-by: Serge E. Hallyn
Cc: Eric W. Biederman
Acked-by: Paul Menage
Reviewed-by: Li Zefan
Cc: Jamal Hadi Salim
Cc: Matt Helsley
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
26 Oct, 2010
1 commit
-
Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is
the call is added to all filesystems that do not assign an i_ino
by themselves. For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.Signed-off-by: Christoph Hellwig
Signed-off-by: Dave Chinner
Signed-off-by: Al Viro
23 Oct, 2010
1 commit
-
* 'vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl: (30 commits)
BKL: remove BKL from freevxfs
BKL: remove BKL from qnx4
autofs4: Only declare function when CONFIG_COMPAT is defined
autofs: Only declare function when CONFIG_COMPAT is defined
ncpfs: Lock socket in ncpfs while setting its callbacks
fs/locks.c: prepare for BKL removal
BKL: Remove BKL from ncpfs
BKL: Remove BKL from OCFS2
BKL: Remove BKL from squashfs
BKL: Remove BKL from jffs2
BKL: Remove BKL from ecryptfs
BKL: Remove BKL from afs
BKL: Remove BKL from USB gadgetfs
BKL: Remove BKL from autofs4
BKL: Remove BKL from isofs
BKL: Remove BKL from fat
BKL: Remove BKL from ext2 filesystem
BKL: Remove BKL from do_new_mount()
BKL: Remove BKL from cgroup
BKL: Remove BKL from NTFS
...
07 Oct, 2010
1 commit
-
…ck/linux-2.6-rcu into core/rcu
05 Oct, 2010
2 commits
-
The BKL is only used in remount_fs and get_sb that are both protected by
the superblocks s_umount rw_semaphore. Therefore it is safe to remove the
BKL entirely.Signed-off-by: Jan Blunck
Signed-off-by: Arnd Bergmann -
This patch is a preparation necessary to remove the BKL from do_new_mount().
It explicitly adds calls to lock_kernel()/unlock_kernel() around
get_sb/fill_super operations for filesystems that still uses the BKL.I've read through all the code formerly covered by the BKL inside
do_kern_mount() and have satisfied myself that it doesn't need the BKL
any more.do_kern_mount() is already called without the BKL when mounting the rootfs
and in nfsctl. do_kern_mount() calls vfs_kern_mount(), which is called
from various places without BKL: simple_pin_fs(), nfs_do_clone_mount()
through nfs_follow_mountpoint(), afs_mntpt_do_automount() through
afs_mntpt_follow_link(). Both later functions are actually the filesystems
follow_link inode operation. vfs_kern_mount() is calling the specified
get_sb function and lets the filesystem do its job by calling the given
fill_super function.Therefore I think it is safe to push down the BKL from the VFS to the
low-level filesystems get_sb/fill_super operation.[arnd: do not add the BKL to those file systems that already
don't use it elsewhere]Signed-off-by: Jan Blunck
Signed-off-by: Arnd Bergmann
Cc: Matthew Wilcox
Cc: Christoph Hellwig
10 Sep, 2010
1 commit
-
Add cgroup_attach_task_all()
The existing cgroup_attach_task_current_cg() API is called by a thread to
attach another thread to all of its cgroups; this is unsuitable for cases
where a privileged task wants to attach itself to the cgroups of a less
privileged one, since the call must be made from the context of the target
task.This patch adds a more generic cgroup_attach_task_all() API that allows
both the source task and to-be-moved task to be specified.
cgroup_attach_task_current_cg() becomes a specialization of the more
generic new function.[menage@google.com: rewrote changelog]
[akpm@linux-foundation.org: address reviewer comments]
Signed-off-by: Michael S. Tsirkin
Tested-by: Alex Williamson
Acked-by: Paul Menage
Cc: Li Zefan
Cc: Ben Blum
Cc: Sridhar Samudrala
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
20 Aug, 2010
1 commit
-
Signed-off-by: Arnd Bergmann
Signed-off-by: Paul E. McKenney
Acked-by: Paul Menage
Cc: Li Zefan
Reviewed-by: Josh Triplett
11 Aug, 2010
1 commit
-
The original code didn't leave enough space for a NULL terminator. These
strings are copied with strcpy() into fixed length buffers in
cgroup_root_from_opts().Signed-off-by: Dan Carpenter
Acked-by: Serge E. Hallyn
Reviewd-by: KAMEZAWA Hiroyuki
Cc: Paul Menage
Cc: Li Zefan
Cc: Ben Blum
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 Aug, 2010
1 commit
-
We really shouldn't be asking userspace to create new root filesystems.
So follow along with all of the other in-kernel filesystems, and provide
a mount point in sysfs.For cgroupfs, this should be in /sys/fs/cgroup/ This change provides
that mount point when the cgroup filesystem is registered in the kernel.Acked-by: Paul Menage
Acked-by: Dhaval Giani
Cc: Li Zefan
Cc: Lennart Poettering
Cc: Kay Sievers
Signed-off-by: Greg Kroah-Hartman
05 Aug, 2010
1 commit
-
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1443 commits)
phy/marvell: add 88ec048 support
igb: Program MDICNFG register prior to PHY init
e1000e: correct MAC-PHY interconnect register offset for 82579
hso: Add new product ID
can: Add driver for esd CAN-USB/2 device
l2tp: fix export of header file for userspace
can-raw: Fix skb_orphan_try handling
Revert "net: remove zap_completion_queue"
net: cleanup inclusion
phy/marvell: add 88e1121 interface mode support
u32: negative offset fix
net: Fix a typo from "dev" to "ndev"
igb: Use irq_synchronize per vector when using MSI-X
ixgbevf: fix null pointer dereference due to filter being set for VLAN 0
e1000e: Fix irq_synchronize in MSI-X case
e1000e: register pm_qos request on hardware activation
ip_fragment: fix subtracting PPPOE_SES_HLEN from mtu twice
net: Add getsockopt support for TCP thin-streams
cxgb4: update driver version
cxgb4: add new PCI IDs
...Manually fix up conflicts in:
- drivers/net/e1000e/netdev.c: due to pm_qos registration
infrastructure changes
- drivers/net/phy/marvell.c: conflict between adding 88ec048 support
and cleaning up the IDs
- drivers/net/wireless/ipw2x00/ipw2100.c: trivial ipw2100_pm_qos_req
conflict (registration change vs marking it static)
28 Jul, 2010
1 commit
-
Add a new kernel API to attach a task to current task's cgroup
in all the active hierarchies.Signed-off-by: Sridhar Samudrala
Signed-off-by: Michael S. Tsirkin
Reviewed-by: Paul Menage
Acked-by: Li Zefan
05 Jun, 2010
1 commit
-
Child groups should have a greater depth than their parents. Prior to
this change, the parent would incorrectly report zero memory usage for
child cgroups when use_hierarchy is enabled.test script:
mount -t cgroup none /cgroups -o memory
cd /cgroups
mkdir cg1echo 1 > cg1/memory.use_hierarchy
mkdir cg1/cg11echo $$ > cg1/cg11/tasks
dd if=/dev/zero of=/tmp/foo bs=1M count=1echo
echo CHILD
grep cache cg1/cg11/memory.statecho
echo PARENT
grep cache cg1/memory.statecho $$ > tasks
rmdir cg1/cg11 cg1
cd /
umount /cgroupsUsing fae9c79, a recent patch that changed alloc_css_id() depth computation,
the parent incorrectly reports zero usage:
root@ubuntu:~# ./test
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0151844 s, 69.1 MB/sCHILD
cache 1048576
total_cache 1048576PARENT
cache 0
total_cache 0With this patch, the parent correctly includes child usage:
root@ubuntu:~# ./test
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0136827 s, 76.6 MB/sCHILD
cache 1052672
total_cache 1052672PARENT
cache 0
total_cache 1052672Signed-off-by: Greg Thelen
Acked-by: Paul Menage
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Li Zefan
Cc: [2.6.34.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
28 May, 2010
1 commit
-
Since we are unable to handle an error returned by
cftype.unregister_event() properly, let's make the callback
void-returning.mem_cgroup_unregister_event() has been rewritten to be a "never fail"
function. On mem_cgroup_usage_register_event() we save old buffer for
thresholds array and reuse it in mem_cgroup_usage_unregister_event() to
avoid allocation.Signed-off-by: Kirill A. Shutemov
Acked-by: KAMEZAWA Hiroyuki
Cc: Phil Carmody
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Paul Menage
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds