Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

08 Sep, 2014

1 commit

f4da80727 blkcg: remove blkcg->id ... Browse Code »

blkcg->id is a unique id given to each blkcg; however, the
cgroup_subsys_state which each blkcg embeds already has ->serial_nr
which can be used for the same purpose. Drop blkcg->id and replace
its uses with blkcg->css.serial_nr. Rename cfq_cgroup->blkcg_id to
->blkcg_serial_nr and @id in check_blkcg_changed() to @serial_nr for
consistency.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2014-09-08 23:55:37 +0800

05 Aug, 2014

1 commit

47dfe4037 Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup changes from Tejun Heo:
"Mostly changes to get the v2 interface ready. The core features are
mostly ready now and I think it's reasonable to expect to drop the
devel mask in one or two devel cycles at least for a subset of
controllers.

- cgroup added a controller dependency mechanism so that block cgroup
can depend on memory cgroup. This will be used to finally support
IO provisioning on the writeback traffic, which is currently being
implemented.

- The v2 interface now uses a separate table so that the interface
files for the new interface are explicitly declared in one place.
Each controller will explicitly review and add the files for the
new interface.

- cpuset is getting ready for the hierarchical behavior which is in
the similar style with other controllers so that an ancestor's
configuration change doesn't change the descendants' configurations
irreversibly and processes aren't silently migrated when a CPU or
node goes down.

All the changes are to the new interface and no behavior changed for
the multiple hierarchies"

* 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits)
cpuset: fix the WARN_ON() in update_nodemasks_hier()
cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test
cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core
cgroup: distinguish the default and legacy hierarchies when handling cftypes
cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()
cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes
cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[]
cpuset: export effective masks to userspace
cpuset: allow writing offlined masks to cpuset.cpus/mems
cpuset: enable onlined cpu/node in effective masks
cpuset: refactor cpuset_hotplug_update_tasks()
cpuset: make cs->{cpus, mems}_allowed as user-configured masks
cpuset: apply cs->effective_{cpus,mems}
cpuset: initialize top_cpuset's configured masks at mount
cpuset: use effective cpumask to build sched domains
cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty
cpuset: update cs->effective_{cpus, mems} when config changes
cpuset: update cpuset->effective_{cpus,mems} at hotplug
cpuset: add cs->effective_cpus and cs->effective_mems
cgroup: clean up sane_behavior handling
...

Linus Torvalds
2014-08-05 01:11:28 +0800

15 Jul, 2014

2 commits

2cf669a58 cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes() ... Browse Code »

Currently, cftypes added by cgroup_add_cftypes() are used for both the
unified default hierarchy and legacy ones and subsystems can mark each
file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to
appear only on one of them. This is quite hairy and error-prone.
Also, we may end up exposing interface files to the default hierarchy
without thinking it through.

cgroup_subsys will grow two separate cftype addition functions and
apply each only on the hierarchies of the matching type. This will
allow organizing cftypes in a lot clearer way and encourage subsystems
to scrutinize the interface which is being exposed in the new default
hierarchy.

In preparation, this patch adds cgroup_add_legacy_cftypes() which
currently is a simple wrapper around cgroup_add_cftypes() and replaces
all cgroup_add_cftypes() usages with it.

While at it, this patch drops a completely spurious return from
__hugetlb_cgroup_file_init().

This patch doesn't introduce any functional differences.

Signed-off-by: Tejun Heo
Acked-by: Neil Horman
Acked-by: Li Zefan
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Aneesh Kumar K.V

Tejun Heo
2014-07-15 23:05:09 +0800
5577964e6 cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes ... Browse Code »

Currently, cgroup_subsys->base_cftypes is used for both the unified
default hierarchy and legacy ones and subsystems can mark each file
with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
only on one of them. This is quite hairy and error-prone. Also, we
may end up exposing interface files to the default hierarchy without
thinking it through.

cgroup_subsys will grow two separate cftype arrays and apply each only
on the hierarchies of the matching type. This will allow organizing
cftypes in a lot clearer way and encourage subsystems to scrutinize
the interface which is being exposed in the new default hierarchy.

In preparation, this patch renames cgroup_subsys->base_cftypes to
cgroup_subsys->legacy_cftypes. This patch is pure rename.

Signed-off-by: Tejun Heo
Acked-by: Neil Horman
Acked-by: Li Zefan
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vivek Goyal
Cc: Peter Zijlstra
Cc: Paul Mackerras
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo
Cc: Aristeu Rozanski
Cc: Aneesh Kumar K.V

Tejun Heo
2014-07-15 23:05:09 +0800

12 Jul, 2014

1 commit

0b462c89e blkcg: don't call into policy draining if root_blkg is already gone ... Browse Code »
5

While a queue is being destroyed, all the blkgs are destroyed and its
->root_blkg pointer is set to NULL. If someone else starts to drain
while the queue is in this state, the following oops happens.

NULL pointer dereference at 0000000000000028
IP: [] blk_throtl_drain+0x84/0x230
PGD e4a1067 PUD b773067 PMD 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
RIP: 0010:[] [] blk_throtl_drain+0x84/0x230
RSP: 0018:ffff88000efd7bf0 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
FS: 00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
Stack:
ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
Call Trace:
[] blkcg_drain_queue+0x1f/0x60
[] __blk_drain_queue+0x71/0x180
[] blk_queue_bypass_start+0x6e/0xb0
[] blkcg_deactivate_policy+0x38/0x120
[] blk_throtl_exit+0x34/0x50
[] blkcg_exit_queue+0x35/0x40
[] blk_release_queue+0x26/0xd0
[] kobject_cleanup+0x38/0x70
[] kobject_put+0x28/0x60
[] blk_put_queue+0x15/0x20
[] scsi_device_dev_release_usercontext+0x16b/0x1c0
[] execute_in_process_context+0x89/0xa0
[] scsi_device_dev_release+0x1c/0x20
[] device_release+0x32/0xa0
[] kobject_cleanup+0x38/0x70
[] kobject_put+0x28/0x60
[] put_device+0x17/0x20
[] __scsi_remove_device+0xa9/0xe0
[] scsi_remove_device+0x2b/0x40
[] sdev_store_delete+0x27/0x30
[] dev_attr_store+0x18/0x30
[] sysfs_kf_write+0x3e/0x50
[] kernfs_fop_write+0xe7/0x170
[] vfs_write+0xaf/0x1d0
[] SyS_write+0x4d/0xc0
[] system_call_fastpath+0x16/0x1b

776687bce42b ("block, blk-mq: draining can't be skipped even if
bypass_depth was non-zero") made it easier to trigger this bug by
making blk_queue_bypass_start() drain even when it loses the first
bypass test to blk_cleanup_queue(); however, the bug has always been
there even before the commit as blk_queue_bypass_start() could race
against queue destruction, win the initial bypass test but perform the
actual draining after blk_cleanup_queue() already destroyed all blkgs.

Fix it by skippping calling into policy draining if all the blkgs are
already gone.

Signed-off-by: Tejun Heo
Reported-by: Shirish Pargaonkar
Reported-by: Sasha Levin
Reported-by: Jet Chen
Cc: stable@vger.kernel.org
Tested-by: Shirish Pargaonkar
Signed-off-by: Jens Axboe

Tejun Heo
2014-07-12 23:55:10 +0800

09 Jul, 2014

1 commit

1ced953b1 blkcg, memcg: make blkcg depend on memcg on the default hierarchy ... Browse Code »

Currently, the blkio subsystem attributes all of writeback IOs to the
root. One of the issues is that there's no way to tell who originated
a writeback IO from block layer. Those IOs are usually issued
asynchronously from a task which didn't have anything to do with
actually generating the dirty pages. The memory subsystem, when
enabled, already keeps track of the ownership of each dirty page and
it's desirable for blkio to piggyback instead of adding its own
per-page tag.

cgroup now has a mechanism to express such dependency -
cgroup_subsys->depends_on. This patch declares that blkcg depends on
memcg so that memcg is enabled automatically on the default hierarchy
when available. Future changes will make blkcg map the memcg tag to
find out the cgroup to blame for writeback IOs.

As this means that a memcg may be made invisible, this patch also
implements css_reset() for memcg which resets its basic
configurations. This implementation will probably need to be expanded
to cover other states which are used in the default hierarchy.

v2: blkcg's dependency on memcg is wrapped with CONFIG_MEMCG to avoid
build failure. Reported by kbuild test robot.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Vivek Goyal
Cc: Jens Axboe

Tejun Heo
2014-07-09 06:02:57 +0800

23 Jun, 2014

2 commits

d5bf02914 Revert "block: add __init to blkcg_policy_register" ... Browse Code »

This reverts commit a2d445d440003f2d70ee4cd4970ea82ace616fee.

The original commit is buggy, we do use the registration functions
at runtime for modular builds.

Jens Axboe
2014-06-23 06:34:11 +0800
a5049a8ae blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t ... Browse Code »
5

Hello,

So, this patch should do. Joe, Vivek, can one of you guys please
verify that the oops goes away with this patch?

Jens, the original thread can be read at

http://thread.gmane.org/gmane.linux.kernel/1720729

The fix converts blkg->refcnt from int to atomic_t. It does some
overhead but it should be minute compared to everything else which is
going on and the involved cacheline bouncing, so I think it's highly
unlikely to cause any noticeable difference. Also, the refcnt in
question should be converted to a perpcu_ref for blk-mq anyway, so the
atomic_t is likely to go away pretty soon anyway.

Thanks.

------- 8< -------
__blkg_release_rcu() may be invoked after the associated request_queue
is released with a RCU grace period inbetween. As such, the function
and callbacks invoked from it must not dereference the associated
request_queue. This is clearly indicated in the comment above the
function.

Unfortunately, while trying to fix a different issue, 2a4fd070ee85
("blkcg: move bulk of blkcg_gq release operations to the RCU
callback") ignored this and added [un]locking of @blkg->q->queue_lock
to __blkg_release_rcu(). This of course can cause oops as the
request_queue may be long gone by the time this code gets executed.

general protection fault: 0000 [#1] SMP
CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1
Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013
task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000
RIP: 0010:[] [] _raw_spin_lock_irq+0x15/0x60
RSP: 0018:ffff88085403fdf0 EFLAGS: 00010086
RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000
RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b
RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39
R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130
R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0
Stack:
ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000
ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0
ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30
Call Trace:
[] __blkg_release_rcu+0x72/0x150
[] rcu_nocb_kthread+0x1e8/0x300
[] kthread+0xe1/0x100
[] ret_from_fork+0x7c/0xb0
Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5
+fa 66 66 90 66 66 90 b8 00 00 02 00 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f
+b7
RIP [] _raw_spin_lock_irq+0x15/0x60
RSP

The request_queue locking was added because blkcg_gq->refcnt is an int
protected with the queue lock and __blkg_release_rcu() needs to put
the parent. Let's fix it by making blkcg_gq->refcnt an atomic_t and
dropping queue locking in the function.

Given the general heavy weight of the current request_queue and blkcg
operations, this is unlikely to cause any noticeable overhead.
Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in
the near future, so whatever (most likely negligible) overhead it may
add is temporary.

Signed-off-by: Tejun Heo
Reported-by: Joe Lawrence
Acked-by: Vivek Goyal
Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe

Tejun Heo
2014-06-23 06:30:52 +0800

11 Jun, 2014

2 commits

23d4ed53b Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block layer fixes from Jens Axboe:
"Final small batch of fixes to be included before -rc1. Some general
cleanups in here as well, but some of the blk-mq fixes we need for the
NVMe conversion and/or scsi-mq. The pull request contains:

- Support for not merging across a specified "chunk size", if set by
the driver. Some NVMe devices perform poorly for IO that crosses
such a chunk, so we need to support it generically as part of
request merging avoid having to do complicated split logic. From
me.

- Bump max tag depth to 10Ki tags. Some scsi devices have a huge
shared tag space. Before we failed with EINVAL if a too large tag
depth was specified, now we truncate it and pass back the actual
value. From me.

- Various blk-mq rq init fixes from me and others.

- A fix for enter on a dying queue for blk-mq from Keith. This is
needed to prevent oopsing on hot device removal.

- Fixup for blk-mq timer addition from Ming Lei.

- Small round of performance fixes for mtip32xx from Sam Bradshaw.

- Minor stack leak fix from Rickard Strandqvist.

- Two __init annotations from Fabian Frederick"

* 'for-linus' of git://git.kernel.dk/linux-block:
block: add __init to blkcg_policy_register
block: add __init to elv_register
block: ensure that bio_add_page() always accepts a page for an empty bio
blk-mq: add timer in blk_mq_start_request
blk-mq: always initialize request->start_time
block: blk-exec.c: Cleaning up local variable address returnd
mtip32xx: minor performance enhancements
blk-mq: ->timeout should be cleared in blk_mq_rq_ctx_init()
blk-mq: don't allow queue entering for a dying queue
blk-mq: bump max tag depth to 10K tags
block: add blk_rq_set_block_pc()
block: add notion of a chunk size for request merging

Linus Torvalds
2014-06-11 23:41:17 +0800
a2d445d44 block: add __init to blkcg_policy_register ... Browse Code »
13

blkcg_policy_register is only called by
__init functions:

__init cfq_init
__init throtl_init

Cc: Andrew Morton
Signed-off-by: Fabian Frederick
Signed-off-by: Jens Axboe

Fabian Frederick
2014-06-11 03:13:12 +0800

14 May, 2014

1 commit

ec903c0c8 cgroup: rename css_tryget*() to css_tryget_online*() ... Browse Code »

Unlike the more usual refcnting, what css_tryget() provides is the
distinction between online and offline csses instead of protection
against upping a refcnt which already reached zero. cgroup is
planning to provide actual tryget which fails if the refcnt already
reached zero. Let's rename the existing trygets so that they clearly
indicate that they're onliness.

I thought about keeping the existing names as-are and introducing new
names for the planned actual tryget; however, given that each
controller participates in the synchronization of the online state, it
seems worthwhile to make it explicit that these functions are about
on/offline state.

Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
to css_tryget_online_from_dir(). This is pure rename.

v2: cgroup_freezer grew new usages of css_tryget(). Update
accordingly.

Signed-off-by: Tejun Heo
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: Li Zefan
Cc: Vivek Goyal
Cc: Jens Axboe
Cc: Peter Zijlstra
Cc: Paul Mackerras
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo

Tejun Heo
2014-05-14 00:11:01 +0800

06 May, 2014

1 commit

36c38fb71 blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats() ... Browse Code »
13

During the recent conversion of cgroup to kernfs, cgroup_tree_mutex
which nests above both the kernfs s_active protection and cgroup_mutex
is added to synchronize cgroup file type operations as cgroup_mutex
needed to be grabbed from some file operations and thus can't be put
above s_active protection.

While this arrangement mostly worked for cgroup, this triggered the
following lockdep warning.

======================================================
[ INFO: possible circular locking dependency detected ]
3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429 Tainted: G W
-------------------------------------------------------
trinity-c173/9024 is trying to acquire lock:
(blkcg_pol_mutex){+.+.+.}, at: blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)

but task is already holding lock:
(s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (s_active#89){++++.+}:
lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
__kernfs_remove (arch/x86/include/asm/atomic.h:27 fs/kernfs/dir.c:352 fs/kernfs/dir.c:1024)
kernfs_remove_by_name_ns (fs/kernfs/dir.c:1219)
cgroup_addrm_files (include/linux/kernfs.h:427 kernel/cgroup.c:1074 kernel/cgroup.c:2899)
cgroup_clear_dir (kernel/cgroup.c:1092 (discriminator 2))
rebind_subsystems (kernel/cgroup.c:1144)
cgroup_setup_root (kernel/cgroup.c:1568)
cgroup_mount (kernel/cgroup.c:1716)
mount_fs (fs/super.c:1094)
vfs_kern_mount (fs/namespace.c:899)
do_mount (fs/namespace.c:2238 fs/namespace.c:2561)
SyS_mount (fs/namespace.c:2758 fs/namespace.c:2729)
tracesys (arch/x86/kernel/entry_64.S:746)

-> #1 (cgroup_tree_mutex){+.+.+.}:
lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
cgroup_add_cftypes (include/linux/list.h:76 kernel/cgroup.c:3040)
blkcg_policy_register (block/blk-cgroup.c:1106)
throtl_init (block/blk-throttle.c:1694)
do_one_initcall (init/main.c:789)
kernel_init_freeable (init/main.c:854 init/main.c:863 init/main.c:882 init/main.c:1003)
kernel_init (init/main.c:935)
ret_from_fork (arch/x86/kernel/entry_64.S:552)

-> #0 (blkcg_pol_mutex){+.+.+.}:
__lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
cgroup_file_write (kernel/cgroup.c:2714)
kernfs_fop_write (fs/kernfs/file.c:295)
vfs_write (fs/read_write.c:532)
SyS_write (fs/read_write.c:584 fs/read_write.c:576)
tracesys (arch/x86/kernel/entry_64.S:746)

other info that might help us debug this:

Chain exists of:
blkcg_pol_mutex --> cgroup_tree_mutex --> s_active#89

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(s_active#89);
lock(cgroup_tree_mutex);
lock(s_active#89);
lock(blkcg_pol_mutex);

*** DEADLOCK ***

4 locks held by trinity-c173/9024:
#0: (&f->f_pos_lock){+.+.+.}, at: __fdget_pos (fs/file.c:714)
#1: (sb_writers#18){.+.+.+}, at: vfs_write (include/linux/fs.h:2255 fs/read_write.c:530)
#2: (&of->mutex){+.+.+.}, at: kernfs_fop_write (fs/kernfs/file.c:283)
#3: (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)

stack backtrace:
CPU: 3 PID: 9024 Comm: trinity-c173 Tainted: G W 3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429
ffffffff919687b0 ffff8805f6373bb8 ffffffff8e52cdbb 0000000000000002
ffffffff919d8400 ffff8805f6373c08 ffffffff8e51fb88 0000000000000004
ffff8805f6373c98 ffff8805f6373c08 ffff88061be70d98 ffff88061be70dd0
Call Trace:
dump_stack (lib/dump_stack.c:52)
print_circular_bug (kernel/locking/lockdep.c:1216)
__lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
cgroup_file_write (kernel/cgroup.c:2714)
kernfs_fop_write (fs/kernfs/file.c:295)
vfs_write (fs/read_write.c:532)
SyS_write (fs/read_write.c:584 fs/read_write.c:576)

This is a highly unlikely but valid circular dependency between "echo
1 > blkcg.reset_stats" and cfq module [un]loading. cgroup is going
through further locking update which will remove this complication but
for now let's use trylock on blkcg_pol_mutex and retry the file
operation if the trylock fails.

Signed-off-by: Tejun Heo
Reported-by: Sasha Levin
References: http://lkml.kernel.org/g/5363C04B.4010400@oracle.com

Tejun Heo
2014-05-06 01:48:18 +0800

04 Apr, 2014

1 commit

32d01dc7b Merge branch 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"A lot updates for cgroup:

- The biggest one is cgroup's conversion to kernfs. cgroup took
after the long abandoned vfs-entangled sysfs implementation and
made it even more convoluted over time. cgroup's internal objects
were fused with vfs objects which also brought in vfs locking and
object lifetime rules. Naturally, there are places where vfs rules
don't fit and nasty hacks, such as credential switching or lock
dance interleaving inode mutex and cgroup_mutex with object serial
number comparison thrown in to decide whether the operation is
actually necessary, needed to be employed.

After conversion to kernfs, internal object lifetime and locking
rules are mostly isolated from vfs interactions allowing shedding
of several nasty hacks and overall simplification. This will also
allow implmentation of operations which may affect multiple cgroups
which weren't possible before as it would have required nesting
i_mutexes.

- Various simplifications including dropping of module support,
easier cgroup name/path handling, simplified cgroup file type
handling and task_cg_lists optimization.

- Prepatory changes for the planned unified hierarchy, which is still
a patchset away from being actually operational. The dummy
hierarchy is updated to serve as the default unified hierarchy.
Controllers which aren't claimed by other hierarchies are
associated with it, which BTW was what the dummy hierarchy was for
anyway.

- Various fixes from Li and others. This pull request includes some
patches to add missing slab.h to various subsystems. This was
triggered xattr.h include removal from cgroup.h. cgroup.h
indirectly got included a lot of files which brought in xattr.h
which brought in slab.h.

There are several merge commits - one to pull in kernfs updates
necessary for converting cgroup (already in upstream through
driver-core), others for interfering changes in the fixes branch"

* 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
cgroup: remove useless argument from cgroup_exit()
cgroup: fix spurious lockdep warning in cgroup_exit()
cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
cgroup: break kernfs active_ref protection in cgroup directory operations
cgroup: fix cgroup_taskset walking order
cgroup: implement CFTYPE_ONLY_ON_DFL
cgroup: make cgrp_dfl_root mountable
cgroup: drop const from @buffer of cftype->write_string()
cgroup: rename cgroup_dummy_root and related names
cgroup: move ->subsys_mask from cgroupfs_root to cgroup
cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
cgroup: reorganize cgroup bootstrapping
cgroup: relocate setting of CGRP_DEAD
cpuset: use rcu_read_lock() to protect task_cs()
cgroup_freezer: document freezer_fork() subtleties
cgroup: update cgroup_transfer_tasks() to either succeed or fail
cgroup: drop task_lock() protection around task->cgroups
cgroup: update how a newly forked task gets associated with css_set
...

Linus Torvalds
2014-04-04 04:05:42 +0800

19 Feb, 2014

1 commit

ec6c676a0 block: Substitute rcu_access_pointer() for rcu_dereference_raw() ... Browse Code »

(Trivial patch.)

If the code is looking at the RCU-protected pointer itself, but not
dereferencing it, the rcu_dereference() functions can be downgraded to
rcu_access_pointer(). This commit makes this downgrade in blkg_destroy()
and ioc_destroy_icq(), both of which simply compare the RCU-protected
pointer against another pointer with no dereferencing.

Signed-off-by: Paul E. McKenney
Cc: Jens Axboe
Signed-off-by: Jens Axboe

Paul E. McKenney
2014-02-19 04:21:26 +0800

13 Feb, 2014

1 commit

924f0d9a2 cgroup: drop @skip_css from cgroup_taskset_for_each() ... Browse Code »

If !NULL, @skip_css makes cgroup_taskset_for_each() skip the matching
css. The intention of the interface is to make it easy to skip css's
(cgroup_subsys_states) which already match the migration target;
however, this is entirely unnecessary as migration taskset doesn't
include tasks which are already in the target cgroup. Drop @skip_css
from cgroup_taskset_for_each().

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Cc: Peter Zijlstra
Cc: Paul Mackerras
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo
Cc: Daniel Borkmann

Tejun Heo
2014-02-13 19:58:41 +0800

08 Feb, 2014

2 commits

073219e99 cgroup: clean up cgroup_subsys names and initialization ... Browse Code »

cgroup_subsys is a bit messier than it needs to be.

* The name of a subsys can be different from its internal identifier
defined in cgroup_subsys.h. Most subsystems use the matching name
but three - cpu, memory and perf_event - use different ones.

* cgroup_subsys_id enums are postfixed with _subsys_id and each
cgroup_subsys is postfixed with _subsys. cgroup.h is widely
included throughout various subsystems, it doesn't and shouldn't
have claim on such generic names which don't have any qualifier
indicating that they belong to cgroup.

* cgroup_subsys->subsys_id should always equal the matching
cgroup_subsys_id enum; however, we require each controller to
initialize it and then BUG if they don't match, which is a bit
silly.

This patch cleans up cgroup_subsys names and initialization by doing
the followings.

* cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
cgroup_subsys with _cgrp_subsys.

* With the above, renaming subsys identifiers to match the userland
visible names doesn't cause any naming conflicts. All non-matching
identifiers are renamed to match the official names.

cpu_cgroup -> cpu
mem_cgroup -> memory
perf -> perf_event

* controllers no longer need to initialize ->subsys_id and ->name.
They're generated in cgroup core and set automatically during boot.

* Redundant cgroup_subsys declarations removed.

* While updating BUG_ON()s in cgroup_init_early(), convert them to
WARN()s. BUGging that early during boot is stupid - the kernel
can't print anything, even through serial console and the trap
handler doesn't even link stack frame properly for back-tracing.

This patch doesn't introduce any behavior changes.

v2: Rebased on top of fe1217c4f3f7 ("net: net_cls: move cgroupfs
classid handling into core").

Signed-off-by: Tejun Heo
Acked-by: Neil Horman
Acked-by: "David S. Miller"
Acked-by: "Rafael J. Wysocki"
Acked-by: Michal Hocko
Acked-by: Peter Zijlstra
Acked-by: Aristeu Rozanski
Acked-by: Ingo Molnar
Acked-by: Li Zefan
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Serge E. Hallyn
Cc: Vivek Goyal
Cc: Thomas Graf

Tejun Heo
2014-02-08 23:36:58 +0800
3ed80a62b cgroup: drop module support ... Browse Code »
2

With module supported dropped from net_prio, no controller is using
cgroup module support. None of actual resource controllers can be
built as a module and we aren't gonna add new controllers which don't
control resources. This patch drops module support from cgroup.

* cgroup_[un]load_subsys() and cgroup_subsys->module removed.

* As there's no point in distinguishing IS_BUILTIN() and IS_MODULE(),
cgroup_subsys.h now uses IS_ENABLED() directly.

* enum cgroup_subsys_id now exactly matches the list of enabled
controllers as ordered in cgroup_subsys.h.

* cgroup_subsys[] is now a contiguously occupied array. Size
specification is no longer necessary and dropped.

* for_each_builtin_subsys() is removed and for_each_subsys() is
updated to not require any locking.

* module ref handling is removed from rebind_subsystems().

* Module related comments dropped.

v2: Rebased on top of fe1217c4f3f7 ("net: net_cls: move cgroupfs
classid handling into core").

v3: Added {} around the if (need_forkexit_callback) block in
cgroup_post_fork() for readability as suggested by Li.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2014-02-08 23:36:58 +0800

23 Sep, 2013

1 commit

68cf8d0c7 Merge branch 'for-3.12/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block IO fixes from Jens Axboe:
"After merge window, no new stuff this time only a collection of neatly
confined and simple fixes"

* 'for-3.12/core' of git://git.kernel.dk/linux-block:
cfq: explicitly use 64bit divide operation for 64bit arguments
block: Add nr_bios to block_rq_remap tracepoint
If the queue is dying then we only call the rq->end_io callout. This leaves bios setup on the request, because the caller assumes when the blk_execute_rq_nowait/blk_execute_rq call has completed that the rq->bios have been cleaned up.
bio-integrity: Fix use of bs->bio_integrity_pool after free
blkcg: relocate root_blkg setting and clearing
block: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...)
block: trace all devices plug operation

Linus Torvalds
2013-09-23 06:00:11 +0800

12 Sep, 2013

1 commit

577cee1e8 blkcg: relocate root_blkg setting and clearing ... Browse Code »

Hello, Jens.

The original thread can be read from

http://thread.gmane.org/gmane.linux.kernel.cgroups/8937

While it leads to oops, given that it only triggers under specific
configurations which aren't common. I don't think it's necessary to
backport it through -stable and merging it during the coming merge
window should be enough.

Thanks!

----- 8< -----
Currently, q->root_blkg and q->root_rl.blkg are set from
blkcg_activate_policy() and cleared from blkg_destroy_all(). This
doesn't necessarily coincide with the lifetime of the root blkcg_gq
leading to the following oops when blkcg is enabled but no policy is
activated because __blk_queue_next_rl() malfunctions expecting the
root_blkg pointers to be set.

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [] __wake_up_common+0x2b/0x90
PGD 60f7a9067 PUD 60f4c9067 PMD 0
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
gsmi: Log Shutdown Reason 0x03
Modules linked in: act_mirred cls_tcindex cls_prioshift sch_dsmark xt_multiport iptable_mangle sata_mv elephant elephant_dev_num cdc_acm uhci_hcd ehci_hcd i2c_d
CPU: 9 PID: 41382 Comm: iSCSI-write- Not tainted 3.11.0-dbg-DEV #19
Hardware name: Intel XXX
task: ffff88060d16eec0 ti: ffff88060d170000 task.ti: ffff88060d170000
RIP: 0010:[] [] __wake_up_common+0x2b/0x90
RSP: 0000:ffff88060d171818 EFLAGS: 00010096
RAX: 0000000000000082 RBX: ffff880baa3dee60 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff880baa3dee60
RBP: ffff88060d171858 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000002 R12: ffff880baa3dee98
R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000003
FS: 00007f977cba6700(0000) GS:ffff880c79c60000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000060f7a5000 CR4: 00000000000007e0
Stack:
0000000000000082 0000000000000000 ffff88060d171858 ffff880baa3dee60
0000000000000082 0000000000000003 0000000000000000 0000000000000000
ffff88060d171898 ffffffff810c7848 ffff88060d171888 ffff880bde4bc4b8
Call Trace:
[] __wake_up+0x48/0x70
[] __blk_drain_queue+0x123/0x190
[] blk_cleanup_queue+0xf5/0x210
[] __scsi_remove_device+0x5a/0xd0
[] scsi_remove_device+0x34/0x50
[] scsi_remove_target+0x16b/0x220
[] __iscsi_unbind_session+0xd1/0x1b0
[] iscsi_remove_session+0xe2/0x1c0
[] iscsi_destroy_session+0x16/0x60
[] iscsi_session_teardown+0xd9/0x100
[] iscsi_sw_tcp_session_destroy+0x5a/0xb0
[] iscsi_if_rx+0x10e8/0x1560
[] netlink_unicast+0x145/0x200
[] netlink_sendmsg+0x303/0x410
[] sock_sendmsg+0xa6/0xd0
[] ___sys_sendmsg+0x38c/0x3a0
[] ? fget_light+0x40/0x160
[] ? fget_light+0x99/0x160
[] ? fget_light+0x40/0x160
[] __sys_sendmsg+0x49/0x90
[] SyS_sendmsg+0x12/0x20
[] system_call_fastpath+0x16/0x1b
Code: 66 66 66 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 54 4c 8d 67 38 53 48 83 ec 18 89 55 c4 48 8b 57 38 4c 89 45 c8 8b 2a 48 8d 42 e8 49

Fix it by moving r->root_blkg and q->root_rl.blkg setting to
blkg_create() and clearing to blkg_destroy() so that they area
initialized when a root blkg is created and cleared when destroyed.

Reported-and-tested-by: Anatol Pomozov
Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2013-09-12 03:23:09 +0800

09 Aug, 2013

6 commits

bd8815a6d cgroup: make css_for_each_descendant() and friends include the origin css in the iteration ... Browse Code »

Previously, all css descendant iterators didn't include the origin
(root of subtree) css in the iteration. The reasons were maintaining
consistency with css_for_each_child() and that at the time of
introduction more use cases needed skipping the origin anyway;
however, given that css_is_descendant() considers self to be a
descendant, omitting the origin css has become more confusing and
looking at the accumulated use cases rather clearly indicates that
including origin would result in simpler code overall.

While this is a change which can easily lead to subtle bugs, cgroup
API including the iterators has recently gone through major
restructuring and no out-of-tree changes will be applicable without
adjustments making this a relatively acceptable opportunity for this
type of change.

The conversions are mostly straight-forward. If the iteration block
had explicit origin handling before or after, it's moved inside the
iteration. If not, if (pos == origin) continue; is added. Some
conversions add extra reference get/put around origin handling by
consolidating origin handling and the rest. While the extra ref
operations aren't strictly necessary, this shouldn't cause any
noticeable difference.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Vivek Goyal
Acked-by: Aristeu Rozanski
Acked-by: Michal Hocko
Cc: Jens Axboe
Cc: Matt Helsley
Cc: Johannes Weiner
Cc: Balbir Singh

Tejun Heo
2013-08-09 08:11:27 +0800
d99c8727e cgroup: make cgroup_taskset deal with cgroup_subsys_state instead of cgroup ... Browse Code »

cgroup is in the process of converting to css (cgroup_subsys_state)
from cgroup as the principal subsystem interface handle. This is
mostly to prepare for the unified hierarchy support where css's will
be created and destroyed dynamically but also helps cleaning up
subsystem implementations as css is usually what they are interested
in anyway.

cgroup_taskset which is used by the subsystem attach methods is the
last cgroup subsystem API which isn't using css as the handle. Update
cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and
cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp.

The conversions are pretty mechanical. One exception is
cpuset::cgroup_cs(), which lost its last user and got removed.

This patch shouldn't introduce any functional changes.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Daniel Wagner
Cc: Ingo Molnar
Cc: Matt Helsley
Cc: Steven Rostedt

Tejun Heo
2013-08-09 08:11:27 +0800
492eb21b9 cgroup: make hierarchy iterators deal with cgroup_subsys_state instead of cgroup ... Browse Code »

cgroup is currently in the process of transitioning to using css
(cgroup_subsys_state) as the primary handle instead of cgroup in
subsystem API. For hierarchy iterators, this is beneficial because

* In most cases, css is the only thing subsystems care about anyway.

* On the planned unified hierarchy, iterations for different
subsystems will need to skip over different subtrees of the
hierarchy depending on which subsystems are enabled on each cgroup.
Passing around css makes it unnecessary to explicitly specify the
subsystem in question as css is intersection between cgroup and
subsystem

* For the planned unified hierarchy, css's would need to be created
and destroyed dynamically independent from cgroup hierarchy. Having
cgroup core manage css iteration makes enforcing deref rules a lot
easier.

Most subsystem conversions are straight-forward. Noteworthy changes
are

* blkio: cgroup_to_blkcg() is no longer used. Removed.

* freezer: cgroup_freezer() is no longer used. Removed.

* devices: cgroup_to_devcgroup() is no longer used. Removed.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Michal Hocko
Acked-by: Vivek Goyal
Acked-by: Aristeu Rozanski
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: Matt Helsley
Cc: Jens Axboe

Tejun Heo
2013-08-09 08:11:25 +0800
182446d08 cgroup: pass around cgroup_subsys_state instead of cgroup in file methods ... Browse Code »

cgroup is currently in the process of transitioning to using struct
cgroup_subsys_state * as the primary handle instead of struct cgroup.
Please see the previous commit which converts the subsystem methods
for rationale.

This patch converts all cftype file operations to take @css instead of
@cgroup. cftypes for the cgroup core files don't have their subsytem
pointer set. These will automatically use the dummy_css added by the
previous patch and can be converted the same way.

Most subsystem conversions are straight forwards but there are some
interesting ones.

* freezer: update_if_frozen() is also converted to take @css instead
of @cgroup for consistency. This will make the code look simpler
too once iterators are converted to use css.

* memory/vmpressure: mem_cgroup_from_css() needs to be exported to
vmpressure while mem_cgroup_from_cont() can be made static.
Updated accordingly.

* cpu: cgroup_tg() doesn't have any user left. Removed.

* cpuacct: cgroup_ca() doesn't have any user left. Removed.

* hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
Removed.

* net_cls: cgrp_cls_state() doesn't have any user left. Removed.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Michal Hocko
Acked-by: Vivek Goyal
Acked-by: Aristeu Rozanski
Acked-by: Daniel Wagner
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: Matt Helsley
Cc: Jens Axboe
Cc: Steven Rostedt

Tejun Heo
2013-08-09 08:11:24 +0800
2bb566cb6 cgroup: add subsys backlink pointer to cftype ... Browse Code »

cgroup is transitioning to using css (cgroup_subsys_state) instead of
cgroup as the primary subsystem handle. The cgroupfs file interface
will be converted to use css's which requires finding out the
subsystem from cftype so that the matching css can be determined from
the cgroup.

This patch adds cftype->ss which points to the subsystem the file
belongs to. The field is initialized while a cftype is being
registered. This makes it unnecessary to explicitly specify the
subsystem for other cftype handling functions. @ss argument dropped
from various cftype handling functions.

This patch shouldn't introduce any behavior differences.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Vivek Goyal
Cc: Jens Axboe

Tejun Heo
2013-08-09 08:11:23 +0800
eb95419b0 cgroup: pass around cgroup_subsys_state instead of cgroup in subsystem methods ... Browse Code »

cgroup is currently in the process of transitioning to using struct
cgroup_subsys_state * as the primary handle instead of struct cgroup *
in subsystem implementations for the following reasons.

* With unified hierarchy, subsystems will be dynamically bound and
unbound from cgroups and thus css's (cgroup_subsys_state) may be
created and destroyed dynamically over the lifetime of a cgroup,
which is different from the current state where all css's are
allocated and destroyed together with the associated cgroup. This
in turn means that cgroup_css() should be synchronized and may
return NULL, making it more cumbersome to use.

* Differing levels of per-subsystem granularity in the unified
hierarchy means that the task and descendant iterators should behave
differently depending on the specific subsystem the iteration is
being performed for.

* In majority of the cases, subsystems only care about its part in the
cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods
often obtain the matching css pointer from the cgroup and don't
bother with the cgroup pointer itself. Passing around css fits
much better.

This patch converts all cgroup_subsys methods to take @css instead of
@cgroup. The conversions are mostly straight-forward. A few
noteworthy changes are

* ->css_alloc() now takes css of the parent cgroup rather than the
pointer to the new cgroup as the css for the new cgroup doesn't
exist yet. Knowing the parent css is enough for all the existing
subsystems.

* In kernel/cgroup.c::offline_css(), unnecessary open coded css
dereference is replaced with local variable access.

This patch shouldn't cause any behavior differences.

v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
with local variable @css as suggested by Li Zefan.

Rebased on top of new for-3.12 which includes for-3.11-fixes so
that ->css_free() invocation added by da0a12caff ("cgroup: fix a
leak when percpu_ref_init() fails") is converted too. Suggested
by Li Zefan.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Michal Hocko
Acked-by: Vivek Goyal
Acked-by: Aristeu Rozanski
Acked-by: Daniel Wagner
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: Matt Helsley
Cc: Jens Axboe
Cc: Steven Rostedt

Tejun Heo
2013-08-09 08:11:23 +0800

15 May, 2013

5 commits

9138125be blk-throttle: implement proper hierarchy support ... Browse Code »

With the recent updates, blk-throttle is finally ready for proper
hierarchy support. Dispatching now honors service_queue->parent_sq
and propagates correctly. The only thing missing is setting
->parent_sq correctly so that throtl_grp hierarchy matches the cgroup
hierarchy.

This patch updates throtl_pd_init() such that service_queues form the
same hierarchy as the cgroup hierarchy if sane_behavior is enabled.
As this concludes proper hierarchy support for blkcg, the shameful
.broken_hierarchy tag is removed from blkio_subsys.

v2: Updated blkio-controller.txt as suggested by Vivek.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal
Cc: Li Zefan

Tejun Heo
2013-05-15 04:52:38 +0800
2a4fd070e blkcg: move bulk of blkcg_gq release operations to the RCU callback ... Browse Code »
19

Currently, when the last reference of a blkcg_gq is put, all then
release operations sans the actual freeing happen directly in
blkg_put(). As blkg_put() may be called under queue_lock, all
pd_exit_fn()s may be too. This makes it impossible for pd_exit_fn()s
to use del_timer_sync() on timers which grab the queue_lock which is
an irq-safe lock due to the deadlock possibility described in the
comment on top of del_timer_sync().

This can be easily avoided by perfoming the release operations in the
RCU callback instead of directly from blkg_put(). This patch moves
the blkcg_gq release operations to the RCU callback.

As this leaves __blkg_release() with only call_rcu() invocation,
blkg_rcu_free() is renamed to __blkg_release_rcu(), exported and
call_rcu() invocation is now done directly from blkg_put() instead of
going through __blkg_release() which is removed.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal

Tejun Heo
2013-05-15 04:52:31 +0800
db6136703 blkcg: invoke blkcg_policy->pd_init() after parent is linked ... Browse Code »

Currently, when creating a new blkcg_gq, each policy's pd_init_fn() is
invoked in blkg_alloc() before the parent is linked. This makes it
difficult for policies to perform initializations which are dependent
on the parent.

This patch moves pd_init_fn() invocations to blkg_create() after the
parent blkg is linked where the new blkg is fully initialized. As
this means that blkg_free() can't assume that pd's are initialized,
pd_exit_fn() invocations are moved to __blkg_release(). This
guarantees that pd_exit_fn() is also invoked with fully initialized
blkgs with valid parent pointers.

This will help implementing hierarchy support in blk-throttle.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal

Tejun Heo
2013-05-15 04:52:31 +0800
dd4a4ffc0 blkcg: move blkg_for_each_descendant_pre() to block/blk-cgroup.h ... Browse Code »

blk-throttle hierarchy support will make use of it. Move
blkg_for_each_descendant_pre() from block/blk-cgroup.c to
block/blk-cgroup.h.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal

Tejun Heo
2013-05-15 04:52:30 +0800
2423c9c3f blkcg: fix error return path in blkg_create() ... Browse Code »

In blkg_create(), after lookup of parent fails, the control jumps to
error path with the error code encoded into @blkg. The error path
doesn't use @blkg for the return value. It returns ERR_PTR(ret).
Make lookup fail path set @ret instead of @blkg.

Note that the parent lookup is guaranteed to succeed at that point and
the condition check is purely for sanity and triggers WARN when fails.
As such, I don't think it's necessary to mark it for -stable.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal

Tejun Heo
2013-05-15 04:52:30 +0800

09 Apr, 2013

1 commit

e5072664f blkcg: fix "scheduling while atomic" in blk_queue_bypass_start ... Browse Code »

Since 749fefe677 in v3.7 ("block: lift the initial queue bypass mode
on blk_register_queue() instead of blk_init_allocated_queue()"),
the following warning appears when multipath is used with CONFIG_PREEMPT=y.

This patch moves blk_queue_bypass_start() before radix_tree_preload()
to avoid the sleeping call while preemption is disabled.

BUG: scheduling while atomic: multipath/2460/0x00000002
1 lock held by multipath/2460:
#0: (&md->type_lock){......}, at: [] dm_lock_md_type+0x17/0x19 [dm_mod]
Modules linked in: ...
Pid: 2460, comm: multipath Tainted: G W 3.7.0-rc2 #1
Call Trace:
[] __schedule_bug+0x6a/0x78
[] __schedule+0xb4/0x5e0
[] schedule+0x64/0x66
[] schedule_timeout+0x39/0xf8
[] ? put_lock_stats+0xe/0x29
[] ? lock_release_holdtime+0xb6/0xbb
[] wait_for_common+0x9d/0xee
[] ? try_to_wake_up+0x206/0x206
[] ? kfree_call_rcu+0x1c/0x1c
[] wait_for_completion+0x1d/0x1f
[] wait_rcu_gp+0x5d/0x7a
[] ? wait_rcu_gp+0x7a/0x7a
[] ? complete+0x21/0x53
[] synchronize_rcu+0x1e/0x20
[] blk_queue_bypass_start+0x5d/0x62
[] blkcg_activate_policy+0x73/0x270
[] ? kmem_cache_alloc_node_trace+0xc7/0x108
[] cfq_init_queue+0x80/0x28e
[] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
[] elevator_init+0xe1/0x115
[] ? blk_queue_make_request+0x54/0x59
[] blk_init_allocated_queue+0x8c/0x9e
[] dm_setup_md_queue+0x36/0xaa [dm_mod]
[] table_load+0x1bd/0x2c8 [dm_mod]
[] ctl_ioctl+0x1d6/0x236 [dm_mod]
[] ? table_clear+0xaa/0xaa [dm_mod]
[] dm_ctl_ioctl+0x13/0x17 [dm_mod]
[] do_vfs_ioctl+0x3fb/0x441
[] ? file_has_perm+0x8a/0x99
[] sys_ioctl+0x5e/0x82
[] ? trace_hardirqs_on_thunk+0x3a/0x3f
[] system_call_fastpath+0x16/0x1b

Signed-off-by: Jun'ichi Nomura
Acked-by: Vivek Goyal
Acked-by: Tejun Heo
Cc: Alasdair G Kergon
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Jun'ichi Nomura
2013-04-09 21:01:21 +0800

01 Mar, 2013

1 commit

ee89f8125 Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block IO core bits from Jens Axboe:
"Below are the core block IO bits for 3.9. It was delayed a few days
since my workstation kept crashing every 2-8h after pulling it into
current -git, but turns out it is a bug in the new pstate code (divide
by zero, will report separately). In any case, it contains:

- The big cfq/blkcg update from Tejun and and Vivek.

- Additional block and writeback tracepoints from Tejun.

- Improvement of the should sort (based on queues) logic in the plug
flushing.

- _io() variants of the wait_for_completion() interface, using
io_schedule() instead of schedule() to contribute to io wait
properly.

- Various little fixes.

You'll get two trivial merge conflicts, which should be easy enough to
fix up"

Fix up the trivial conflicts due to hlist traversal cleanups (commit
b67bfe0d42ca: "hlist: drop the node parameter from iterators").

* 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
block: remove redundant check to bd_openers()
block: use i_size_write() in bd_set_size()
cfq: fix lock imbalance with failed allocations
drivers/block/swim3.c: fix null pointer dereference
block: don't select PERCPU_RWSEM
block: account iowait time when waiting for completion of IO request
sched: add wait_for_completion_io[_timeout]
writeback: add more tracepoints
block: add block_{touch|dirty}_buffer tracepoint
buffer: make touch_buffer() an exported function
block: add @req to bio_{front|back}_merge tracepoints
block: add missing block_bio_complete() tracepoint
block: Remove should_sort judgement when flush blk_plug
block,elevator: use new hashtable implementation
cfq-iosched: add hierarchical cfq_group statistics
cfq-iosched: collect stats from dead cfqgs
cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
block: RCU free request_queue
blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
...

Linus Torvalds
2013-03-01 04:52:24 +0800

28 Feb, 2013

1 commit

b67bfe0d4 hlist: drop the node parameter from iterators ... Browse Code »
26

I'm not sure why, but the hlist for each entry iterators were conceived

list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin
Acked-by: Paul E. McKenney
Signed-off-by: Sasha Levin
Cc: Wu Fengguang
Cc: Marcelo Tosatti
Cc: Gleb Natapov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2013-02-28 11:10:24 +0800

10 Jan, 2013

7 commits

810ecfa76 blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock ... Browse Code »

Instead of holding blkcg->lock while walking ->blkg_list and executing
prfill(), RCU walk ->blkg_list and hold the blkg's queue lock while
executing prfill(). This makes prfill() implementations easier as
stats are mostly protected by queue lock.

This will be used to implement hierarchical stats.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal

Tejun Heo
2013-01-10 00:05:13 +0800
16b3de665 blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() ... Browse Code »

Implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge().
The former two collect the [rw]stats designated by the target policy
data and offset from the pd's subtree. The latter two add one
[rw]stat to another.

Note that the recursive sum functions require the queue lock to be
held on entry to make blkg online test reliable. This is necessary to
properly handle stats of a dying blkg.

These will be used to implement hierarchical stats.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal

Tejun Heo
2013-01-10 00:05:12 +0800
b50da39f5 blkcg: export __blkg_prfill_rwstat() ... Browse Code »

Hierarchical stats for cfq-iosched will need __blkg_prfill_rwstat().
Export it.

Signed-off-by: Tejun Heo
Reported-by: Fengguang Wu

Tejun Heo
2013-01-10 00:05:12 +0800
f427d9096 blkcg: implement blkcg_policy->on/offline_pd_fn() and blkcg_gq->online ... Browse Code »

Add two blkcg_policy methods, ->online_pd_fn() and ->offline_pd_fn(),
which are invoked as the policy_data gets activated and deactivated
while holding both blkcg and q locks.

Also, add blkcg_gq->online bool, which is set and cleared as the
blkcg_gq gets activated and deactivated. This flag also is toggled
while holding both blkcg and q locks.

These will be used to implement hierarchical stats.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal

Tejun Heo
2013-01-10 00:05:12 +0800
b276a876a blkcg: add blkg_policy_data->plid ... Browse Code »

Add pd->plid so that the policy a pd belongs to can be identified
easily. This will be used to implement hierarchical blkg_[rw]stats.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal

Tejun Heo
2013-01-10 00:05:12 +0800
e71357e11 cfq-iosched: add leaf_weight ... Browse Code »

cfq blkcg is about to grow proper hierarchy handling, where a child
blkg's weight would nest inside the parent's. This makes tasks in a
blkg to compete against both tasks in the sibling blkgs and the tasks
of child blkgs.

We're gonna use the existing weight as the group weight which decides
the blkg's weight against its siblings. This patch introduces a new
weight - leaf_weight - which decides the weight of a blkg against the
child blkgs.

It's named leaf_weight because another way to look at it is that each
internal blkg nodes have a hidden child leaf node which contains all
its tasks and leaf_weight is the weight of the leaf node and handled
the same as the weight of the child blkgs.

This patch only adds leaf_weight fields and exposes it to userland.
The new weight isn't actually used anywhere yet. Note that
cfq-iosched currently offcially supports only single level hierarchy
and root blkgs compete with the first level blkgs - ie. root weight is
basically being used as leaf_weight. For root blkgs, the two weights
are kept in sync for backward compatibility.

v2: cfqd->root_group->leaf_weight initialization was missing from
cfq_init_queue() causing divide by zero when
!CONFIG_CFQ_GROUP_SCHED. Fix it. Reported by Fengguang.

Signed-off-by: Tejun Heo
Cc: Fengguang Wu

Tejun Heo
2013-01-10 00:05:10 +0800
3c5478659 blkcg: make blkcg_gq's hierarchical ... Browse Code »

Currently a child blkg (blkcg_gq) can be created even if its parent
doesn't exist. ie. Given a blkg, it's not guaranteed that its
ancestors will exist. This makes it difficult to implement proper
hierarchy support for blkcg policies.

Always create blkgs recursively and make a child blkg hold a reference
to its parent. blkg->parent is added so that finding the parent is
easy. blkcg_parent() is also added in the process.

This change can be visible to userland. e.g. while issuing IO in a
nested cgroup didn't affect the ancestors at all, now it will
initialize all ancestor blkgs and zero stats for the request_queue
will always appear on them. While this is userland visible, this
shouldn't cause any functional difference.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal

Tejun Heo
2013-01-10 00:05:10 +0800