Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

18 Sep, 2014

1 commit

77a8689a6 blkcg: don't call into policy draining if root_blkg is already gone ... Browse Code »

commit 2a1b4cf2331d92bc009bf94fa02a24604cdaf24c upstream.

While a queue is being destroyed, all the blkgs are destroyed and its
->root_blkg pointer is set to NULL. If someone else starts to drain
while the queue is in this state, the following oops happens.

NULL pointer dereference at 0000000000000028
IP: [] blk_throtl_drain+0x84/0x230
PGD e4a1067 PUD b773067 PMD 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
RIP: 0010:[] [] blk_throtl_drain+0x84/0x230
RSP: 0018:ffff88000efd7bf0 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
FS: 00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
Stack:
ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
Call Trace:
[] blkcg_drain_queue+0x1f/0x60
[] __blk_drain_queue+0x71/0x180
[] blk_queue_bypass_start+0x6e/0xb0
[] blkcg_deactivate_policy+0x38/0x120
[] blk_throtl_exit+0x34/0x50
[] blkcg_exit_queue+0x35/0x40
[] blk_release_queue+0x26/0xd0
[] kobject_cleanup+0x38/0x70
[] kobject_put+0x28/0x60
[] blk_put_queue+0x15/0x20
[] scsi_device_dev_release_usercontext+0x16b/0x1c0
[] execute_in_process_context+0x89/0xa0
[] scsi_device_dev_release+0x1c/0x20
[] device_release+0x32/0xa0
[] kobject_cleanup+0x38/0x70
[] kobject_put+0x28/0x60
[] put_device+0x17/0x20
[] __scsi_remove_device+0xa9/0xe0
[] scsi_remove_device+0x2b/0x40
[] sdev_store_delete+0x27/0x30
[] dev_attr_store+0x18/0x30
[] sysfs_kf_write+0x3e/0x50
[] kernfs_fop_write+0xe7/0x170
[] vfs_write+0xaf/0x1d0
[] SyS_write+0x4d/0xc0
[] system_call_fastpath+0x16/0x1b

776687bce42b ("block, blk-mq: draining can't be skipped even if
bypass_depth was non-zero") made it easier to trigger this bug by
making blk_queue_bypass_start() drain even when it loses the first
bypass test to blk_cleanup_queue(); however, the bug has always been
there even before the commit as blk_queue_bypass_start() could race
against queue destruction, win the initial bypass test but perform the
actual draining after blk_cleanup_queue() already destroyed all blkgs.

Fix it by skippping calling into policy draining if all the blkgs are
already gone.

Signed-off-by: Tejun Heo
Reported-by: Shirish Pargaonkar
Reported-by: Sasha Levin
Reported-by: Jet Chen
Tested-by: Shirish Pargaonkar
Signed-off-by: Jens Axboe
Signed-off-by: Jiri Slaby

Tejun Heo
2014-09-18 04:54:14 +0800

31 Jul, 2014

1 commit

1f74870f8 blkcg: don't call into policy draining if root_blkg is already gone ... Browse Code »

commit 0b462c89e31f7eb6789713437eb551833ee16ff3 upstream.

While a queue is being destroyed, all the blkgs are destroyed and its
->root_blkg pointer is set to NULL. If someone else starts to drain
while the queue is in this state, the following oops happens.

NULL pointer dereference at 0000000000000028
IP: [] blk_throtl_drain+0x84/0x230
PGD e4a1067 PUD b773067 PMD 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
RIP: 0010:[] [] blk_throtl_drain+0x84/0x230
RSP: 0018:ffff88000efd7bf0 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
FS: 00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
Stack:
ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
Call Trace:
[] blkcg_drain_queue+0x1f/0x60
[] __blk_drain_queue+0x71/0x180
[] blk_queue_bypass_start+0x6e/0xb0
[] blkcg_deactivate_policy+0x38/0x120
[] blk_throtl_exit+0x34/0x50
[] blkcg_exit_queue+0x35/0x40
[] blk_release_queue+0x26/0xd0
[] kobject_cleanup+0x38/0x70
[] kobject_put+0x28/0x60
[] blk_put_queue+0x15/0x20
[] scsi_device_dev_release_usercontext+0x16b/0x1c0
[] execute_in_process_context+0x89/0xa0
[] scsi_device_dev_release+0x1c/0x20
[] device_release+0x32/0xa0
[] kobject_cleanup+0x38/0x70
[] kobject_put+0x28/0x60
[] put_device+0x17/0x20
[] __scsi_remove_device+0xa9/0xe0
[] scsi_remove_device+0x2b/0x40
[] sdev_store_delete+0x27/0x30
[] dev_attr_store+0x18/0x30
[] sysfs_kf_write+0x3e/0x50
[] kernfs_fop_write+0xe7/0x170
[] vfs_write+0xaf/0x1d0
[] SyS_write+0x4d/0xc0
[] system_call_fastpath+0x16/0x1b

776687bce42b ("block, blk-mq: draining can't be skipped even if
bypass_depth was non-zero") made it easier to trigger this bug by
making blk_queue_bypass_start() drain even when it loses the first
bypass test to blk_cleanup_queue(); however, the bug has always been
there even before the commit as blk_queue_bypass_start() could race
against queue destruction, win the initial bypass test but perform the
actual draining after blk_cleanup_queue() already destroyed all blkgs.

Fix it by skippping calling into policy draining if all the blkgs are
already gone.

Signed-off-by: Tejun Heo
Reported-by: Shirish Pargaonkar
Reported-by: Sasha Levin
Reported-by: Jet Chen
Tested-by: Shirish Pargaonkar
Signed-off-by: Jens Axboe
Signed-off-by: Jiri Slaby

Tejun Heo
2014-07-31 00:02:38 +0800

30 Jul, 2014

2 commits

b786221fb block: don't assume last put of shared tags is for the host ... Browse Code »

commit d45b3279a5a2252cafcd665bbf2db8c9b31ef783 upstream.

There is no inherent reason why the last put of a tag structure must be
the one for the Scsi_Host, as device model objects can be held for
arbitrary periods. Merge blk_free_tags and __blk_free_tags into a single
funtion that just release a references and get rid of the BUG() when the
host reference wasn't the last.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
Signed-off-by: Jiri Slaby

Christoph Hellwig
2014-07-30 17:15:55 +0800
be8e93639 block: provide compat ioctl for BLKZEROOUT ... Browse Code »

commit 3b3a1814d1703027f9867d0f5cbbfaf6c7482474 upstream.

This patch provides the compat BLKZEROOUT ioctl. The argument is a pointer
to two uint64_t values, so there is no need to translate it.

Signed-off-by: Mikulas Patocka
Acked-by: Martin K. Petersen
Signed-off-by: Jens Axboe
Signed-off-by: Jiri Slaby

Mikulas Patocka
2014-07-30 17:15:55 +0800

18 Jul, 2014

1 commit

04e1cd73c blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t ... Browse Code »

commit a5049a8ae34950249a7ae94c385d7c5c98914412 upstream.

Hello,

So, this patch should do. Joe, Vivek, can one of you guys please
verify that the oops goes away with this patch?

Jens, the original thread can be read at

http://thread.gmane.org/gmane.linux.kernel/1720729

The fix converts blkg->refcnt from int to atomic_t. It does some
overhead but it should be minute compared to everything else which is
going on and the involved cacheline bouncing, so I think it's highly
unlikely to cause any noticeable difference. Also, the refcnt in
question should be converted to a perpcu_ref for blk-mq anyway, so the
atomic_t is likely to go away pretty soon anyway.

Thanks.

------- 8< -------
__blkg_release_rcu() may be invoked after the associated request_queue
is released with a RCU grace period inbetween. As such, the function
and callbacks invoked from it must not dereference the associated
request_queue. This is clearly indicated in the comment above the
function.

Unfortunately, while trying to fix a different issue, 2a4fd070ee85
("blkcg: move bulk of blkcg_gq release operations to the RCU
callback") ignored this and added [un]locking of @blkg->q->queue_lock
to __blkg_release_rcu(). This of course can cause oops as the
request_queue may be long gone by the time this code gets executed.

general protection fault: 0000 [#1] SMP
CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1
Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013
task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000
RIP: 0010:[] [] _raw_spin_lock_irq+0x15/0x60
RSP: 0018:ffff88085403fdf0 EFLAGS: 00010086
RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000
RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b
RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39
R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130
R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0
Stack:
ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000
ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0
ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30
Call Trace:
[] __blkg_release_rcu+0x72/0x150
[] rcu_nocb_kthread+0x1e8/0x300
[] kthread+0xe1/0x100
[] ret_from_fork+0x7c/0xb0
Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5
+fa 66 66 90 66 66 90 b8 00 00 02 00 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f
+b7
RIP [] _raw_spin_lock_irq+0x15/0x60
RSP

The request_queue locking was added because blkcg_gq->refcnt is an int
protected with the queue lock and __blkg_release_rcu() needs to put
the parent. Let's fix it by making blkcg_gq->refcnt an atomic_t and
dropping queue locking in the function.

Given the general heavy weight of the current request_queue and blkcg
operations, this is unlikely to cause any noticeable overhead.
Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in
the near future, so whatever (most likely negligible) overhead it may
add is temporary.

Signed-off-by: Tejun Heo
Reported-by: Joe Lawrence
Acked-by: Vivek Goyal
Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com
Signed-off-by: Jens Axboe
Signed-off-by: Jiri Slaby

Tejun Heo
2014-07-18 21:51:04 +0800

29 May, 2014

1 commit

5104b40a9 blktrace: fix accounting of partially completed requests ... Browse Code »

commit af5040da01ef980670b3741b3e10733ee3e33566 upstream.

trace_block_rq_complete does not take into account that request can
be partially completed, so we can get the following incorrect output
of blkparser:

C R 232 + 240 [0]
C R 240 + 232 [0]
C R 248 + 224 [0]
C R 256 + 216 [0]

but should be:

C R 232 + 8 [0]
C R 240 + 8 [0]
C R 248 + 8 [0]
C R 256 + 8 [0]

Also, the whole output summary statistics of completed requests and
final throughput will be incorrect.

This patch takes into account real completion size of the request and
fixes wrong completion accounting.

Signed-off-by: Roman Pen
CC: Steven Rostedt
CC: Frederic Weisbecker
CC: Ingo Molnar
CC: linux-kernel@vger.kernel.org
Signed-off-by: Jens Axboe
Signed-off-by: Jiri Slaby

Roman Pen
2014-05-29 17:38:08 +0800

23 Feb, 2014

2 commits

50f9027cd block: add cond_resched() to potentially long running ioctl discard loop ... Browse Code »

commit c8123f8c9cb517403b51aa41c3c46ff5e10b2c17 upstream.

When mkfs issues a full device discard and the device only
supports discards of a smallish size, we can loop in
blkdev_issue_discard() for a long time. If preempt isn't enabled,
this can turn into a softlock situation and the kernel will
start complaining.

Add an explicit cond_resched() at the end of the loop to avoid
that.

Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Jens Axboe
2014-02-23 05:32:28 +0800
b98625aa1 block: __elv_next_request() shouldn't call into the elevator if bypassing ... Browse Code »

commit 556ee818c06f37b2e583af0363e6b16d0e0270de upstream.

request_queue bypassing is used to suppress higher-level function of a
request_queue so that they can be switched, reconfigured and shut
down. A request_queue does the followings while bypassing.

* bypasses elevator and io_cq association and queues requests directly
to the FIFO dispatch queue.

* bypasses block cgroup request_list lookup and always uses the root
request_list.

Once confirmed to be bypassing, specific elevator and block cgroup
policy implementations can assume that nothing is in flight for them
and perform various operations which would be dangerous otherwise.

Such confirmation is acheived by short-circuiting all new requests
directly to the dispatch queue and waiting for all the requests which
were issued before to finish. Unfortunately, while the request
allocating and draining sides were properly handled, we forgot to
actually plug the request dispatch path. Even after bypassing mode is
confirmed, if the attached driver tries to fetch a request and the
dispatch queue is empty, __elv_next_request() would invoke the current
elevator's elevator_dispatch_fn() callback. As all in-flight requests
were drained, the elevator wouldn't contain any request but once
bypass is confirmed we don't even know whether the elevator is even
there. It might be in the process of being switched and half torn
down.

Frank Mayhar reports that this actually happened while switching
elevators, leading to an oops.

Let's fix it by making __elv_next_request() avoid invoking the
elevator_dispatch_fn() callback if the queue is bypassing. It already
avoids invoking the callback if the queue is dying. As a dying queue
is guaranteed to be bypassing, we can simply replace blk_queue_dying()
check with blk_queue_bypass().

Reported-by: Frank Mayhar
References: http://lkml.kernel.org/g/1390319905.20232.38.camel@bobble.lax.corp.google.com
Tested-by: Frank Mayhar
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2014-02-23 05:32:28 +0800

12 Dec, 2013

1 commit

483095931 Update of blkg_stat and blkg_rwstat may happen in bh context. While u64_stats_fe… ... Browse Code »

…tch_retry is only preempt_disable on 32bit UP system. This is not enough to avoid preemption by bh and may read strange 64 bit value.

commit 2c575026fae6e63771bd2a4c1d407214a8096a89 upstream.

Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Hong Zhiguo
2013-12-12 14:37:54 +0800

08 Dec, 2013

2 commits

fc655139e elevator: acquire q->sysfs_lock in elevator_change() ... Browse Code »

commit 7c8a3679e3d8e9d92d58f282161760a0e247df97 upstream.

Add locking of q->sysfs_lock into elevator_change() (an exported function)
to ensure it is held to protect q->elevator from elevator_init(), even if
elevator_change() is called from non-sysfs paths.
sysfs path (elv_iosched_store) uses __elevator_change(), non-locking
version, as the lock is already taken by elv_iosched_store().

Signed-off-by: Tomoki Sekiyama
Signed-off-by: Jens Axboe
Cc: Josh Boyer
Signed-off-by: Greg Kroah-Hartman

Tomoki Sekiyama
2013-12-08 23:29:16 +0800
d6a5267e4 elevator: Fix a race in elevator switching and md device initialization ... Browse Code »

commit eb1c160b22655fd4ec44be732d6594fd1b1e44f4 upstream.

The soft lockup below happens at the boot time of the system using dm
multipath and the udev rules to switch scheduler.

[ 356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
[ 356.127001] RIP: 0010:[] [] lock_timer_base.isra.35+0x1d/0x50
...
[ 356.127001] Call Trace:
[ 356.127001] [] try_to_del_timer_sync+0x20/0x70
[ 356.127001] [] ? kmem_cache_alloc_node_trace+0x20a/0x230
[ 356.127001] [] del_timer_sync+0x52/0x60
[ 356.127001] [] cfq_exit_queue+0x32/0xf0
[ 356.127001] [] elevator_exit+0x2f/0x50
[ 356.127001] [] elevator_change+0xf1/0x1c0
[ 356.127001] [] elv_iosched_store+0x20/0x50
[ 356.127001] [] queue_attr_store+0x59/0xb0
[ 356.127001] [] sysfs_write_file+0xc6/0x140
[ 356.127001] [] vfs_write+0xbd/0x1e0
[ 356.127001] [] SyS_write+0x49/0xa0
[ 356.127001] [] system_call_fastpath+0x16/0x1b

This is caused by a race between md device initialization by multipathd and
shell script to switch the scheduler using sysfs.

- multipathd:
SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load
-> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init
q->elevator = elevator_alloc(q, e); // not yet initialized

- sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler':
elevator_switch (in the call trace above)
struct elevator_queue *old = q->elevator;
q->elevator = elevator_alloc(q, new_e);
elevator_exit(old); // lockup! (*)

- multipathd: (cont.)
err = e->ops.elevator_init_fn(q); // init fails; q->elevator is modified

(*) When del_timer_sync() is called, lock_timer_base() will loop infinitely
while timer->base == NULL. In this case, as timer will never initialized,
it results in lockup.

This patch introduces acquisition of q->sysfs_lock around elevator_init()
into blk_init_allocated_queue(), to provide mutual exclusion between
initialization of the q->scheduler and switching of the scheduler.

This should fix this bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=902012

Signed-off-by: Tomoki Sekiyama
Signed-off-by: Jens Axboe
Cc: Josh Boyer
Signed-off-by: Greg Kroah-Hartman

Tomoki Sekiyama
2013-12-08 23:29:16 +0800

05 Dec, 2013

1 commit

f5360a4c3 blk-core: Fix memory corruption if blkcg_init_queue fails ... Browse Code »

commit fff4996b7db7955414ac74386efa5e07fd766b50 upstream.

If blkcg_init_queue fails, blk_alloc_queue_node doesn't call bdi_destroy
to clean up structures allocated by the backing dev.

------------[ cut here ]------------
WARNING: at lib/debugobjects.c:260 debug_print_object+0x85/0xa0()
ODEBUG: free active (active state 0) object type: percpu_counter hint: (null)
Modules linked in: dm_loop dm_mod ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev ipt_MASQUERADE iptable_nat nf_nat_ipv4 msr nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand cpufreq_conservative spadfs fuse hid_generic usbhid hid raid0 md_mod dmi_sysfs nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack lm85 hwmon_vid snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib snd_rawmidi snd soundcore acpi_cpufreq freq_table mperf sata_svw serverworks kvm_amd ide_core ehci_pci ohci_hcd libata ehci_hcd kvm usbcore tg3 usb_common libphy k10temp pcspkr ptp i2c_piix4 i2c_core evdev microcode hwmon rtc_cmos pps_core e100 skge floppy mii processor button unix
CPU: 0 PID: 2739 Comm: lvchange Tainted: G W
3.10.15-devel #14
Hardware name: empty empty/S3992-E, BIOS 'V1.06 ' 06/09/2009
0000000000000009 ffff88023c3c1ae8 ffffffff813c8fd4 ffff88023c3c1b20
ffffffff810399eb ffff88043d35cd58 ffffffff81651940 ffff88023c3c1bf8
ffffffff82479d90 0000000000000005 ffff88023c3c1b80 ffffffff81039a67
Call Trace:
[] dump_stack+0x19/0x1b
[] warn_slowpath_common+0x6b/0xa0
[] warn_slowpath_fmt+0x47/0x50
[] ? debug_check_no_obj_freed+0xcf/0x250
[] debug_print_object+0x85/0xa0
[] debug_check_no_obj_freed+0x203/0x250
[] kmem_cache_free+0x20c/0x3a0
[] blk_alloc_queue_node+0x2a9/0x2c0
[] blk_alloc_queue+0xe/0x10
[] dm_create+0x1a3/0x530 [dm_mod]
[] ? list_version_get_info+0xe0/0xe0 [dm_mod]
[] dev_create+0x57/0x2b0 [dm_mod]
[] ? list_version_get_info+0xe0/0xe0 [dm_mod]
[] ? list_version_get_info+0xe0/0xe0 [dm_mod]
[] ctl_ioctl+0x268/0x500 [dm_mod]
[] ? get_lock_stats+0x22/0x70
[] dm_ctl_ioctl+0xe/0x20 [dm_mod]
[] do_vfs_ioctl+0x2ed/0x520
[] ? fget_light+0x377/0x4e0
[] SyS_ioctl+0x4b/0x90
[] system_call_fastpath+0x1a/0x1f
---[ end trace 4b5ff0d55673d986 ]---
------------[ cut here ]------------

This fix should be backported to stable kernels starting with 2.6.37. Note
that in the kernels prior to 3.5 the affected code is different, but the
bug is still there - bdi_init is called and bdi_destroy isn't.

Signed-off-by: Mikulas Patocka
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Mikulas Patocka
2013-12-05 03:05:37 +0800

30 Nov, 2013

2 commits

fa9d73efe block: properly stack underlying max_segment_size to DM device ... Browse Code »

commit d82ae52e68892338068e7559a0c0657193341ce4 upstream.

Without this patch all DM devices will default to BLK_MAX_SEGMENT_SIZE
(65536) even if the underlying device(s) have a larger value -- this is
due to blk_stack_limits() using min_not_zero() when stacking the
max_segment_size limit.

1073741824

before patch:
65536

after patch:
1073741824

Reported-by: Lukasz Flis
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Mike Snitzer
2013-11-30 03:28:07 +0800
6c8a390a1 block: fix race between request completion and timeout handling ... Browse Code »

commit 4912aa6c11e6a5d910264deedbec2075c6f1bb73 upstream.

crocode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca be2net sg ses enclosure ext4 mbcache jbd2 sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 491, comm: scsi_eh_0 Tainted: G W ---------------- 2.6.32-220.13.1.el6.x86_64 #1 IBM -[8722PAX]-/00D1461
RIP: 0010:[] [] blk_requeue_request+0x94/0xa0
RSP: 0018:ffff881057eefd60 EFLAGS: 00010012
RAX: ffff881d99e3e8a8 RBX: ffff881d99e3e780 RCX: ffff881d99e3e8a8
RDX: ffff881d99e3e8a8 RSI: ffff881d99e3e780 RDI: ffff881d99e3e780
RBP: ffff881057eefd80 R08: ffff881057eefe90 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff881057f92338
R13: 0000000000000000 R14: ffff881057f92338 R15: ffff883058188000
FS: 0000000000000000(0000) GS:ffff880040200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000006d3ec0 CR3: 000000302cd7d000 CR4: 00000000000406b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process scsi_eh_0 (pid: 491, threadinfo ffff881057eee000, task ffff881057e29540)
Stack:
0000000000001057 0000000000000286 ffff8810275efdc0 ffff881057f16000
ffff881057eefdd0 ffffffff81362323 ffff881057eefe20 ffffffff8135f393
ffff881057e29af8 ffff8810275efdc0 ffff881057eefe78 ffff881057eefe90
Call Trace:
[] __scsi_queue_insert+0xa3/0x150
[] ? scsi_eh_ready_devs+0x5e3/0x850
[] scsi_queue_insert+0x13/0x20
[] scsi_eh_flush_done_q+0x104/0x160
[] scsi_error_handler+0x35b/0x660
[] ? scsi_error_handler+0x0/0x660
[] kthread+0x96/0xa0
[] child_rip+0xa/0x20
[] ? kthread+0x0/0xa0
[] ? child_rip+0x0/0x20
Code: 00 00 eb d1 4c 8b 2d 3c 8f 97 00 4d 85 ed 74 bf 49 8b 45 00 49 83 c5 08 48 89 de 4c 89 e7 ff d0 49 8b 45 00 48 85 c0 75 eb eb a4 0b eb fe 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00
RIP [] blk_requeue_request+0x94/0xa0
RSP

The RIP is this line:
BUG_ON(blk_queued_rq(rq));

After digging through the code, I think there may be a race between the
request completion and the timer handler running.

A timer is started for each request put on the device's queue (see
blk_start_request->blk_add_timer). If the request does not complete
before the timer expires, the timer handler (blk_rq_timed_out_timer)
will mark the request complete atomically:

static inline int blk_mark_rq_complete(struct request *rq)
{
return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
}

and then call blk_rq_timed_out. The latter function will call
scsi_times_out, which will return one of BLK_EH_HANDLED,
BLK_EH_RESET_TIMER or BLK_EH_NOT_HANDLED. If BLK_EH_RESET_TIMER is
returned, blk_clear_rq_complete is called, and blk_add_timer is again
called to simply wait longer for the request to complete.

Now, if the request happens to complete while this is going on, what
happens? Given that we know the completion handler will bail if it
finds the REQ_ATOM_COMPLETE bit set, we need to focus on the completion
handler running after that bit is cleared. So, from the above
paragraph, after the call to blk_clear_rq_complete. If the completion
sets REQ_ATOM_COMPLETE before the BUG_ON in blk_add_timer, we go boom
there (I haven't seen this in the cores). Next, if we get the
completion before the call to list_add_tail, then the timer will
eventually fire for an old req, which may either be freed or reallocated
(there is evidence that this might be the case). Finally, if the
completion comes in *after* the addition to the timeout list, I think
it's harmless. The request will be removed from the timeout list,
req_atom_complete will be set, and all will be well.

This will only actually explain the coredumps *IF* the request
structure was freed, reallocated *and* queued before the error handler
thread had a chance to process it. That is possible, but it may make
sense to keep digging for another race. I think that if this is what
was happening, we would see other instances of this problem showing up
as null pointer or garbage pointer dereferences, for example when the
request structure was not re-used. It looks like we actually do run
into that situation in other reports.

This patch moves the BUG_ON(test_bit(REQ_ATOM_COMPLETE,
&req->atomic_flags)); from blk_add_timer to the only caller that could
trip over it (blk_start_request). It then inverts the calls to
blk_clear_rq_complete and blk_add_timer in blk_rq_timed_out to address
the race. I've boot tested this patch, but nothing more.

Signed-off-by: Jeff Moyer
Acked-by: Hannes Reinecke
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Jeff Moyer
2013-11-30 03:28:06 +0800

17 Oct, 2013

1 commit

87fc0ad2a block/partitions/efi.c: treat size mismatch as a warning, not an error ... Browse Code »

In commit 27a7c642174e ("partitions/efi: account for pmbr size in lba")
we started treating bad sizes in lba field of the partition that has the
0xEE (GPT protective) as errors.

However, we may run into these "bad sizes" in the real world if someone
uses dd to copy an image from a smaller disk to a bigger disk. Since
this case used to work (even without using force_gpt), keep it working
and treat the size mismatch as a warning instead of an error.

Reported-by: Josh Triplett
Reported-by: Sean Paul
Signed-off-by: Doug Anderson
Reviewed-by: Josh Triplett
Acked-by: Davidlohr Bueso
Tested-by: Artem Bityutskiy
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Doug Anderson
2013-10-17 12:35:53 +0800

01 Oct, 2013

1 commit

080506ad0 block: change config option name for cmdline partition parsing ... Browse Code »

Recently commit bab55417b10c ("block: support embedded device command
line partition") introduced CONFIG_CMDLINE_PARSER. However, that name
is too generic and sounds like it enables/disables generic kernel boot
arg processing, when it really is block specific.

Before this option becomes a part of a full/final release, add the BLK_
prefix to it so that it is clear in absence of any other context that it
is block specific.

In addition, fix up the following less critical items:
- help text was not really at all helpful.
- index file for Documentation was not updated
- add the new arg to Documentation/kernel-parameters.txt
- clarify wording in source comments

Signed-off-by: Paul Gortmaker
Cc: Jens Axboe
Cc: Cai Zhiyong
Cc: Wei Yongjun
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Gortmaker
2013-10-01 05:31:02 +0800

23 Sep, 2013

2 commits

68cf8d0c7 Merge branch 'for-3.12/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block IO fixes from Jens Axboe:
"After merge window, no new stuff this time only a collection of neatly
confined and simple fixes"

* 'for-3.12/core' of git://git.kernel.dk/linux-block:
cfq: explicitly use 64bit divide operation for 64bit arguments
block: Add nr_bios to block_rq_remap tracepoint
If the queue is dying then we only call the rq->end_io callout. This leaves bios setup on the request, because the caller assumes when the blk_execute_rq_nowait/blk_execute_rq call has completed that the rq->bios have been cleaned up.
bio-integrity: Fix use of bs->bio_integrity_pool after free
blkcg: relocate root_blkg setting and clearing
block: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...)
block: trace all devices plug operation

Linus Torvalds
2013-09-23 06:00:11 +0800
f3cff25f0 cfq: explicitly use 64bit divide operation for 64bit arguments ... Browse Code »

'samples' is 64bit operant, but do_div() second parameter is 32.
do_div silently truncates high 32 bits and calculated result
is invalid.

In case if low 32bit of 'samples' are zeros then do_div() produces
kernel crash.

Signed-off-by: Anatol Pomozov
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Anatol Pomozov
2013-09-23 02:43:47 +0800

18 Sep, 2013

1 commit

7652113c2 If the queue is dying then we only call the rq->end_io callout. ... Browse Code »

This leaves bios setup on the request, because the caller assumes when
the blk_execute_rq_nowait/blk_execute_rq call has completed that
the rq->bios have been cleaned up.

This patch has blk_execute_rq_nowait use __blk_end_request_all
to free bios and also call rq->end_io.

Signed-off-by: Mike Christie
Signed-off-by: Jens Axboe

Mike Christie
2013-09-18 22:33:55 +0800

15 Sep, 2013

1 commit

6b02fa59a partitions/efi: loosen check fot pmbr size in lba ... Browse Code »

Matt found that commit 27a7c642174e ("partitions/efi: account for pmbr
size in lba") caused his GPT formatted eMMC device not to boot. The
reason is that this commit enforced Linux to always check the lesser of
the whole disk or 2Tib for the pMBR size in LBA. While most disk
partitioning tools out there create a pMBR with these characteristics,
Microsoft does not, as it always sets the entry to the maximum 32-bit
limitation - even though a drive may be smaller than that[1].

Loosen this check and only verify that the size is either the whole disk
or 0xFFFFFFFF. No tool in its right mind would set it to any value
other than these.

[1] http://thestarman.pcministry.com/asm/mbr/GPT.htm#GPTPT

Reported-and-tested-by: Matt Porter
Signed-off-by: Davidlohr Bueso
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-15 19:10:16 +0800

12 Sep, 2013

16 commits

5e4c0d974 lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interrupt ... Browse Code »

With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
one such possible user), the following race can happen:

radix_tree_preload()
...
radix_tree_insert()
radix_tree_node_alloc()
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];

...
radix_tree_preload()
...
radix_tree_insert()
radix_tree_node_alloc()
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];

And we give out one radix tree node twice. That clearly results in radix
tree corruption with different results (usually OOPS) depending on which
two users of radix tree race.

We fix the problem by making radix_tree_node_alloc() always allocate fresh
radix tree nodes when in interrupt. Using preloading when in interrupt
doesn't make sense since all the allocations have to be atomic anyway and
we cannot steal nodes from process-context users because some users rely
on radix_tree_insert() succeeding after radix_tree_preload().
in_interrupt() check is somewhat ugly but we cannot simply key off passed
gfp_mask as that is acquired from root_gfp_mask() and thus the same for
all preload users.

Another part of the fix is to avoid node preallocation in
radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again,
preallocation in such case doesn't make sense and when preallocation would
happen in interrupt we could possibly leak some allocated nodes. However,
some users of radix_tree_preload() require following radix_tree_insert()
to succeed. To avoid unexpected effects for these users,
radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
and we provide a new function radix_tree_maybe_preload() for those users
which get different gfp mask from different call sites and which are
prepared to handle radix_tree_insert() failure.

Signed-off-by: Jan Kara
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2013-09-12 06:59:36 +0800
b4bc4a18a block/partitions/efi.c: consistently use pr_foo() ... Browse Code »

Cc: Davidlohr Bueso
Cc: Karel Zak
Cc: Matt Fleming
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-09-12 06:59:19 +0800
70f637e90 partitions/efi: some style cleanups ... Browse Code »

Trivial coding style cleanups - still plenty left.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davidlohr Bueso
Reviewed-by: Karel Zak
Acked-by: Matt Fleming
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:19 +0800
08009b30a partitions/efi: delete annoying emacs style comments ... Browse Code »

I love emacs, but these settings for coding style are annoying when trying
to open the efi.h file. More important, we already have checkpatch for
that.

Signed-off-by: Davidlohr Bueso
Reviewed-by: Karel Zak
Acked-by: Matt Fleming
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:18 +0800
aa054bc93 partitions/efi: compare first and last usable LBAs ... Browse Code »

When verifying GPT header integrity, make sure that first usable LBA is
smaller than last usable LBA.

Signed-off-by: Davidlohr Bueso
Reviewed-by: Karel Zak
Acked-by: Matt Fleming
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:18 +0800
27a7c6421 partitions/efi: account for pmbr size in lba ... Browse Code »

The partition that has the 0xEE (GPT protective), must have the size in
lba field set to the lesser of the size of the disk minus one or
0xFFFFFFFF for larger disks.

Signed-off-by: Davidlohr Bueso
Reviewed-by: Karel Zak
Acked-by: Matt Fleming
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:17 +0800
b05ebbbbe partitions/efi: detect hybrid MBRs ... Browse Code »

One of the biggest problems with GPT is compatibility with older, non-GPT
systems. The problem is addressed by creating hybrid mbrs, an extension,
or variant, of the traditional protective mbr. This contains, apart from
the 0xEE partition, up three additional primary partitions that point to
the same space marked by up to three GPT partitions. The result is that
legacy OSs can see the three required MBR partitions and at the same time
ignore the GPT-aware partitions that protect the GPT structures.

While hybrid MBRs are hacks, workarounds and simply not part of the GPT
standard, they do exist and we have no way around them. For instance, by
default, OSX creates a hybrid scheme when using multi-OS booting.

In order for Linux to properly discover protective MBRs, it must be made
aware of devices that have hybrid MBRs. No functionality is changed by
this patch, just a debug message informing the user of the MBR scheme that
is being used.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davidlohr Bueso
Reviewed-by: Karel Zak
Acked-by: Matt Fleming
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:16 +0800
3e69ac344 partitions/efi: do not require gpt partition to begin at sector 1 ... Browse Code »

When detecting a valid protective MBR, the Linux kernel isn't picky about
the partition (1-4) the 0xEE is at, but, unlike other operating systems,
it does require it to begin at the second sector (sector 1). This check,
apart from it not being enforced by UEFI, and causing Linux to potentially
fail to detect any *valid* partitions on the disk, can present problems
when dealing with hybrid MBRs[1].

For compatibility reasons, if the first partition is hybridized, the 0xEE
partition must be small enough to ensure that it only protects the GPT
data structures - as opposed to the the whole disk in a protective MBR.
This problem is very well described by Rod Smith[1]: where MBR-only
partitioning programs (such as older versions of fdisk) can see some of
the disk space as unallocated, thus loosing the purpose of the 0xEE
partition's protection of GPT data structures.

By dropping this check, this patch enables Linux to be more flexible when
probing for GPT disklabels.

[1] http://www.rodsbooks.com/gdisk/hybrid.html#reactions

Signed-off-by: Davidlohr Bueso
Reviewed-by: Karel Zak
Acked-by: Matt Fleming
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:16 +0800
33afd7a7d partitions/efi: check pmbr record's starting lba ... Browse Code »

Per the UEFI Specs 2.4, June 2013, the starting lba of the partition that
has the EFI GPT (0xEE) must be set to 0x00000001 - this is obviously the
LBA of the GPT Partition Header.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davidlohr Bueso
Reviewed-by: Karel Zak
Acked-by: Matt Fleming
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:15 +0800
c2ebdc243 partitions/efi: use lba-aware partition records ... Browse Code »

The kernel's GPT implementation currently uses the generic 'struct
partition' type for dealing with legacy MBR partition records. While this
is is useful for disklabels that we designed for CHS addressing, such as
msdos, it doesn't adapt well to newer standards that use LBA instead, such
as GUID partition tables. Furthermore, these generic partition structures
do not have all the required fields to properly follow the UEFI specs.

While a CHS address can be translated to LBA, it's much simpler and
cleaner to just replace the partition type. This patch adds a new
'gpt_record' type that is fully compliant with EFI and will allow, in the
next patches, to add more checks to properly verify a protective MBR,
which is paramount to probing a device that makes use of GPT.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davidlohr Bueso
Reviewed-by: Karel Zak
Acked-by: Matt Fleming
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:15 +0800
3ddc5b46a kernel-wide: fix missing validations on __get/__put/__copy_to/__copy_from_user() ... Browse Code »

I found the following pattern that leads in to interesting findings:

grep -r "ret.*|=.*__put_user" *
grep -r "ret.*|=.*__get_user" *
grep -r "ret.*|=.*__copy" *

The __put_user() calls in compat_ioctl.c, ptrace compat, signal compat,
since those appear in compat code, we could probably expect the kernel
addresses not to be reachable in the lower 32-bit range, so I think they
might not be exploitable.

For the "__get_user" cases, I don't think those are exploitable: the worse
that can happen is that the kernel will copy kernel memory into in-kernel
buffers, and will fail immediately afterward.

The alpha csum_partial_copy_from_user() seems to be missing the
access_ok() check entirely. The fix is inspired from x86. This could
lead to information leak on alpha. I also noticed that many architectures
map csum_partial_copy_from_user() to csum_partial_copy_generic(), but I
wonder if the latter is performing the access checks on every
architectures.

Signed-off-by: Mathieu Desnoyers
Cc: Richard Henderson
Cc: Ivan Kokshaysky
Cc: Matt Turner
Cc: Jens Axboe
Cc: Oleg Nesterov
Cc: David Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mathieu Desnoyers
2013-09-12 06:58:18 +0800
bab55417b block: support embedded device command line partition ... Browse Code »

Read block device partition table from command line. The partition used
for fixed block device (eMMC) embedded device. It is no MBR, save
storage space. Bootloader can be easily accessed by absolute address of
data on the block device. Users can easily change the partition.

This code reference MTD partition, source "drivers/mtd/cmdlinepart.c"
About the partition verbose reference
"Documentation/block/cmdline-partition.txt"

[akpm@linux-foundation.org: fix printk text]
[yongjun_wei@trendmicro.com.cn: fix error return code in parse_parts()]
Signed-off-by: Cai Zhiyong
Cc: Karel Zak
Cc: "Wanglin (Albert)"
Cc: Marius Groeger
Cc: David Woodhouse
Cc: Jens Axboe
Cc: Brian Norris
Cc: Artem Bityutskiy
Signed-off-by: Wei Yongjun
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cai Zhiyong
2013-09-12 06:56:57 +0800
ed751e683 block/blk-sysfs.c: replace strict_strtoul() with kstrtoul() ... Browse Code »

The usage of strict_strtoul() is not preferred, because strict_strtoul()
is obsolete. Thus, kstrtoul() should be used.

Signed-off-by: Jingoo Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jingoo Han
2013-09-12 06:56:56 +0800
577cee1e8 blkcg: relocate root_blkg setting and clearing ... Browse Code »

Hello, Jens.

The original thread can be read from

http://thread.gmane.org/gmane.linux.kernel.cgroups/8937

While it leads to oops, given that it only triggers under specific
configurations which aren't common. I don't think it's necessary to
backport it through -stable and merging it during the coming merge
window should be enough.

Thanks!

----- 8< -----
Currently, q->root_blkg and q->root_rl.blkg are set from
blkcg_activate_policy() and cleared from blkg_destroy_all(). This
doesn't necessarily coincide with the lifetime of the root blkcg_gq
leading to the following oops when blkcg is enabled but no policy is
activated because __blk_queue_next_rl() malfunctions expecting the
root_blkg pointers to be set.

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [] __wake_up_common+0x2b/0x90
PGD 60f7a9067 PUD 60f4c9067 PMD 0
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
gsmi: Log Shutdown Reason 0x03
Modules linked in: act_mirred cls_tcindex cls_prioshift sch_dsmark xt_multiport iptable_mangle sata_mv elephant elephant_dev_num cdc_acm uhci_hcd ehci_hcd i2c_d
CPU: 9 PID: 41382 Comm: iSCSI-write- Not tainted 3.11.0-dbg-DEV #19
Hardware name: Intel XXX
task: ffff88060d16eec0 ti: ffff88060d170000 task.ti: ffff88060d170000
RIP: 0010:[] [] __wake_up_common+0x2b/0x90
RSP: 0000:ffff88060d171818 EFLAGS: 00010096
RAX: 0000000000000082 RBX: ffff880baa3dee60 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff880baa3dee60
RBP: ffff88060d171858 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000002 R12: ffff880baa3dee98
R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000003
FS: 00007f977cba6700(0000) GS:ffff880c79c60000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000060f7a5000 CR4: 00000000000007e0
Stack:
0000000000000082 0000000000000000 ffff88060d171858 ffff880baa3dee60
0000000000000082 0000000000000003 0000000000000000 0000000000000000
ffff88060d171898 ffffffff810c7848 ffff88060d171888 ffff880bde4bc4b8
Call Trace:
[] __wake_up+0x48/0x70
[] __blk_drain_queue+0x123/0x190
[] blk_cleanup_queue+0xf5/0x210
[] __scsi_remove_device+0x5a/0xd0
[] scsi_remove_device+0x34/0x50
[] scsi_remove_target+0x16b/0x220
[] __iscsi_unbind_session+0xd1/0x1b0
[] iscsi_remove_session+0xe2/0x1c0
[] iscsi_destroy_session+0x16/0x60
[] iscsi_session_teardown+0xd9/0x100
[] iscsi_sw_tcp_session_destroy+0x5a/0xb0
[] iscsi_if_rx+0x10e8/0x1560
[] netlink_unicast+0x145/0x200
[] netlink_sendmsg+0x303/0x410
[] sock_sendmsg+0xa6/0xd0
[] ___sys_sendmsg+0x38c/0x3a0
[] ? fget_light+0x40/0x160
[] ? fget_light+0x99/0x160
[] ? fget_light+0x40/0x160
[] __sys_sendmsg+0x49/0x90
[] SyS_sendmsg+0x12/0x20
[] system_call_fastpath+0x16/0x1b
Code: 66 66 66 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 54 4c 8d 67 38 53 48 83 ec 18 89 55 c4 48 8b 57 38 4c 89 45 c8 8b 2a 48 8d 42 e8 49

Fix it by moving r->root_blkg and q->root_rl.blkg setting to
blkg_create() and clearing to blkg_destroy() so that they area
initialized when a root blkg is created and cleared when destroyed.

Reported-and-tested-by: Anatol Pomozov
Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2013-09-12 03:23:09 +0800
c1b511eb2 block: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...) ... Browse Code »

Use the helper function instead of __GFP_ZERO.

Signed-off-by: Joe Perches
Signed-off-by: Jens Axboe

Joe Perches
2013-09-12 03:22:03 +0800
7aef2e780 block: trace all devices plug operation ... Browse Code »

In func blk_queue_bio, if list of plug is empty,it will call
blk_trace_plug.
If process deal with a single device,it't ok.But if process deal with
multi devices,it only trace the first device.
Using request_count to judge, it can soleve this problem.

In addition, i modify the comment.

Signed-off-by: Jianpeng Ma
Signed-off-by: Jens Axboe

Jianpeng Ma
2013-09-12 03:21:07 +0800

04 Sep, 2013

1 commit

32dad03d1 Merge branch 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"A lot of activities on the cgroup front. Most changes aren't visible
to userland at all at this point and are laying foundation for the
planned unified hierarchy.

- The biggest change is decoupling the lifetime management of css
(cgroup_subsys_state) from that of cgroup's. Because controllers
(cpu, memory, block and so on) will need to be dynamically enabled
and disabled, css which is the association point between a cgroup
and a controller may come and go dynamically across the lifetime of
a cgroup. Till now, css's were created when the associated cgroup
was created and stayed till the cgroup got destroyed.

Assumptions around this tight coupling permeated through cgroup
core and controllers. These assumptions are gradually removed,
which consists bulk of patches, and css destruction path is
completely decoupled from cgroup destruction path. Note that
decoupling of creation path is relatively easy on top of these
changes and the patchset is pending for the next window.

- cgroup has its own event mechanism cgroup.event_control, which is
only used by memcg. It is overly complex trying to achieve high
flexibility whose benefits seem dubious at best. Going forward,
new events will simply generate file modified event and the
existing mechanism is being made specific to memcg. This pull
request contains prepatory patches for such change.

- Various fixes and cleanups"

Fixed up conflict in kernel/cgroup.c as per Tejun.

* 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits)
cgroup: fix cgroup_css() invocation in css_from_id()
cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp()
cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup
cgroup: implement CFTYPE_NO_PREFIX
cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys
cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax
cgroup: fix cgroup_write_event_control()
cgroup: fix subsystem file accesses on the root cgroup
cgroup: change cgroup_from_id() to css_from_id()
cgroup: use css_get() in cgroup_create() to check CSS_ROOT
cpuset: remove an unncessary forward declaration
cgroup: RCU protect each cgroup_subsys_state release
cgroup: move subsys file removal to kill_css()
cgroup: factor out kill_css()
cgroup: decouple cgroup_subsys_state destruction from cgroup destruction
cgroup: replace cgroup->css_kill_cnt with ->nr_css
cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item
cgroup: move cgroup->subsys[] assignment to online_css()
cgroup: reorganize css init / exit paths
cgroup: add __rcu modifier to cgroup->subsys[]
...

Linus Torvalds
2013-09-04 09:25:03 +0800

24 Aug, 2013

2 commits

7e782af57 [SCSI] Return ENODATA on medium error ... Browse Code »

When a medium error is detected the SCSI stack should return
ENODATA to the upper layers.

[jejb: fix whitespace error]
Signed-off-by: Hannes Reinecke
Signed-off-by: James Bottomley

Hannes Reinecke
2013-08-24 00:54:53 +0800
a9d6ceb83 [SCSI] return ENOSPC on thin provisioning failure ... Browse Code »
13

When the thin provisioning hard threshold is reached we
should return ENOSPC to inform upper layers about this fact.

Signed-off-by: Hannes Reinecke
Signed-off-by: James Bottomley

Hannes Reinecke
2013-08-24 00:43:54 +0800

09 Aug, 2013

1 commit

bd8815a6d cgroup: make css_for_each_descendant() and friends include the origin css in the iteration ... Browse Code »

Previously, all css descendant iterators didn't include the origin
(root of subtree) css in the iteration. The reasons were maintaining
consistency with css_for_each_child() and that at the time of
introduction more use cases needed skipping the origin anyway;
however, given that css_is_descendant() considers self to be a
descendant, omitting the origin css has become more confusing and
looking at the accumulated use cases rather clearly indicates that
including origin would result in simpler code overall.

While this is a change which can easily lead to subtle bugs, cgroup
API including the iterators has recently gone through major
restructuring and no out-of-tree changes will be applicable without
adjustments making this a relatively acceptable opportunity for this
type of change.

The conversions are mostly straight-forward. If the iteration block
had explicit origin handling before or after, it's moved inside the
iteration. If not, if (pos == origin) continue; is added. Some
conversions add extra reference get/put around origin handling by
consolidating origin handling and the rest. While the extra ref
operations aren't strictly necessary, this shouldn't cause any
noticeable difference.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Vivek Goyal
Acked-by: Aristeu Rozanski
Acked-by: Michal Hocko
Cc: Jens Axboe
Cc: Matt Helsley
Cc: Johannes Weiner
Cc: Balbir Singh

Tejun Heo
2013-08-09 08:11:27 +0800