Doug / smarc-fsl-linux-kernel | Embedian Git Server

09 Nov, 2013

3 commits

e37459b8e Merge branch 'blk-mq/core' into for-3.13/core ... Browse Code »

Signed-off-by: Jens Axboe

Conflicts:
block/blk-timeout.c

Jens Axboe
2013-11-09 00:08:12 +0800
23779fbc9 block: Enable sysfs nomerge control for I/O requests in the plug list ... Browse Code »

This patch enables the sysfs to control I/O request merge
functionality in the plug list. While this control has been
implemented for the request queue, it was dismissed in the plug list.
Therefore, block layer merges requests together (or attempt to merge)
even if the merge capability was disable using sysfs nomerge parameter
value 2.

This limitation is directly affects functionality of io_submit()
system call. The system call enables user to submit a bunch of IO
requests from user space using struct iocb **ios input argument.
However, the unconditioned merging functionality in the plug list
potentially merges these requests together down the road. Therefore,
there is no way to distinguish between an application sending bunch of
sequential IOs and an application sending one big IO. Ultimately, all
requests generated by the former app merge within the plug list
together and looks similar to the second app.

While the merging functionality is a desirable feature to improve the
performance of IO subsystem for some applications, it is not useful
for other application like ours at all.

Signed-off-by: Alireza Haghdoost
Reviewed-by: Jeff Moyer

Coding style modified.

Signed-off-by: Jens Axboe

Alireza Haghdoost
2013-11-09 00:00:22 +0800
eb1c160b2 elevator: Fix a race in elevator switching and md device initialization ... Browse Code »

The soft lockup below happens at the boot time of the system using dm
multipath and the udev rules to switch scheduler.

[ 356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
[ 356.127001] RIP: 0010:[] [] lock_timer_base.isra.35+0x1d/0x50
...
[ 356.127001] Call Trace:
[ 356.127001] [] try_to_del_timer_sync+0x20/0x70
[ 356.127001] [] ? kmem_cache_alloc_node_trace+0x20a/0x230
[ 356.127001] [] del_timer_sync+0x52/0x60
[ 356.127001] [] cfq_exit_queue+0x32/0xf0
[ 356.127001] [] elevator_exit+0x2f/0x50
[ 356.127001] [] elevator_change+0xf1/0x1c0
[ 356.127001] [] elv_iosched_store+0x20/0x50
[ 356.127001] [] queue_attr_store+0x59/0xb0
[ 356.127001] [] sysfs_write_file+0xc6/0x140
[ 356.127001] [] vfs_write+0xbd/0x1e0
[ 356.127001] [] SyS_write+0x49/0xa0
[ 356.127001] [] system_call_fastpath+0x16/0x1b

This is caused by a race between md device initialization by multipathd and
shell script to switch the scheduler using sysfs.

- multipathd:
SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load
-> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init
q->elevator = elevator_alloc(q, e); // not yet initialized

- sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler':
elevator_switch (in the call trace above)
struct elevator_queue *old = q->elevator;
q->elevator = elevator_alloc(q, new_e);
elevator_exit(old); // lockup! (*)

- multipathd: (cont.)
err = e->ops.elevator_init_fn(q); // init fails; q->elevator is modified

(*) When del_timer_sync() is called, lock_timer_base() will loop infinitely
while timer->base == NULL. In this case, as timer will never initialized,
it results in lockup.

This patch introduces acquisition of q->sysfs_lock around elevator_init()
into blk_init_allocated_queue(), to provide mutual exclusion between
initialization of the q->scheduler and switching of the scheduler.

This should fix this bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=902012

Signed-off-by: Tomoki Sekiyama
Signed-off-by: Jens Axboe

Tomoki Sekiyama
2013-11-09 00:00:08 +0800

08 Nov, 2013

2 commits

fff4996b7 blk-core: Fix memory corruption if blkcg_init_queue fails ... Browse Code »

If blkcg_init_queue fails, blk_alloc_queue_node doesn't call bdi_destroy
to clean up structures allocated by the backing dev.

------------[ cut here ]------------
WARNING: at lib/debugobjects.c:260 debug_print_object+0x85/0xa0()
ODEBUG: free active (active state 0) object type: percpu_counter hint: (null)
Modules linked in: dm_loop dm_mod ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev ipt_MASQUERADE iptable_nat nf_nat_ipv4 msr nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand cpufreq_conservative spadfs fuse hid_generic usbhid hid raid0 md_mod dmi_sysfs nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack lm85 hwmon_vid snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib snd_rawmidi snd soundcore acpi_cpufreq freq_table mperf sata_svw serverworks kvm_amd ide_core ehci_pci ohci_hcd libata ehci_hcd kvm usbcore tg3 usb_common libphy k10temp pcspkr ptp i2c_piix4 i2c_core evdev microcode hwmon rtc_cmos pps_core e100 skge floppy mii processor button unix
CPU: 0 PID: 2739 Comm: lvchange Tainted: G W
3.10.15-devel #14
Hardware name: empty empty/S3992-E, BIOS 'V1.06 ' 06/09/2009
0000000000000009 ffff88023c3c1ae8 ffffffff813c8fd4 ffff88023c3c1b20
ffffffff810399eb ffff88043d35cd58 ffffffff81651940 ffff88023c3c1bf8
ffffffff82479d90 0000000000000005 ffff88023c3c1b80 ffffffff81039a67
Call Trace:
[] dump_stack+0x19/0x1b
[] warn_slowpath_common+0x6b/0xa0
[] warn_slowpath_fmt+0x47/0x50
[] ? debug_check_no_obj_freed+0xcf/0x250
[] debug_print_object+0x85/0xa0
[] debug_check_no_obj_freed+0x203/0x250
[] kmem_cache_free+0x20c/0x3a0
[] blk_alloc_queue_node+0x2a9/0x2c0
[] blk_alloc_queue+0xe/0x10
[] dm_create+0x1a3/0x530 [dm_mod]
[] ? list_version_get_info+0xe0/0xe0 [dm_mod]
[] dev_create+0x57/0x2b0 [dm_mod]
[] ? list_version_get_info+0xe0/0xe0 [dm_mod]
[] ? list_version_get_info+0xe0/0xe0 [dm_mod]
[] ctl_ioctl+0x268/0x500 [dm_mod]
[] ? get_lock_stats+0x22/0x70
[] dm_ctl_ioctl+0xe/0x20 [dm_mod]
[] do_vfs_ioctl+0x2ed/0x520
[] ? fget_light+0x377/0x4e0
[] SyS_ioctl+0x4b/0x90
[] system_call_fastpath+0x1a/0x1f
---[ end trace 4b5ff0d55673d986 ]---
------------[ cut here ]------------

This fix should be backported to stable kernels starting with 2.6.37. Note
that in the kernels prior to 3.5 the affected code is different, but the
bug is still there - bdi_init is called and bdi_destroy isn't.

Signed-off-by: Mikulas Patocka
Acked-by: Tejun Heo
Cc: stable@kernel.org # 2.6.37+
Signed-off-by: Jens Axboe

Mikulas Patocka
2013-11-08 23:59:17 +0800
4912aa6c1 block: fix race between request completion and timeout handling ... Browse Code »

crocode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca be2net sg ses enclosure ext4 mbcache jbd2 sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 491, comm: scsi_eh_0 Tainted: G W ---------------- 2.6.32-220.13.1.el6.x86_64 #1 IBM -[8722PAX]-/00D1461
RIP: 0010:[] [] blk_requeue_request+0x94/0xa0
RSP: 0018:ffff881057eefd60 EFLAGS: 00010012
RAX: ffff881d99e3e8a8 RBX: ffff881d99e3e780 RCX: ffff881d99e3e8a8
RDX: ffff881d99e3e8a8 RSI: ffff881d99e3e780 RDI: ffff881d99e3e780
RBP: ffff881057eefd80 R08: ffff881057eefe90 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff881057f92338
R13: 0000000000000000 R14: ffff881057f92338 R15: ffff883058188000
FS: 0000000000000000(0000) GS:ffff880040200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000006d3ec0 CR3: 000000302cd7d000 CR4: 00000000000406b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process scsi_eh_0 (pid: 491, threadinfo ffff881057eee000, task ffff881057e29540)
Stack:
0000000000001057 0000000000000286 ffff8810275efdc0 ffff881057f16000
ffff881057eefdd0 ffffffff81362323 ffff881057eefe20 ffffffff8135f393
ffff881057e29af8 ffff8810275efdc0 ffff881057eefe78 ffff881057eefe90
Call Trace:
[] __scsi_queue_insert+0xa3/0x150
[] ? scsi_eh_ready_devs+0x5e3/0x850
[] scsi_queue_insert+0x13/0x20
[] scsi_eh_flush_done_q+0x104/0x160
[] scsi_error_handler+0x35b/0x660
[] ? scsi_error_handler+0x0/0x660
[] kthread+0x96/0xa0
[] child_rip+0xa/0x20
[] ? kthread+0x0/0xa0
[] ? child_rip+0x0/0x20
Code: 00 00 eb d1 4c 8b 2d 3c 8f 97 00 4d 85 ed 74 bf 49 8b 45 00 49 83 c5 08 48 89 de 4c 89 e7 ff d0 49 8b 45 00 48 85 c0 75 eb eb a4 0b eb fe 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00
RIP [] blk_requeue_request+0x94/0xa0
RSP

The RIP is this line:
BUG_ON(blk_queued_rq(rq));

After digging through the code, I think there may be a race between the
request completion and the timer handler running.

A timer is started for each request put on the device's queue (see
blk_start_request->blk_add_timer). If the request does not complete
before the timer expires, the timer handler (blk_rq_timed_out_timer)
will mark the request complete atomically:

static inline int blk_mark_rq_complete(struct request *rq)
{
return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
}

and then call blk_rq_timed_out. The latter function will call
scsi_times_out, which will return one of BLK_EH_HANDLED,
BLK_EH_RESET_TIMER or BLK_EH_NOT_HANDLED. If BLK_EH_RESET_TIMER is
returned, blk_clear_rq_complete is called, and blk_add_timer is again
called to simply wait longer for the request to complete.

Now, if the request happens to complete while this is going on, what
happens? Given that we know the completion handler will bail if it
finds the REQ_ATOM_COMPLETE bit set, we need to focus on the completion
handler running after that bit is cleared. So, from the above
paragraph, after the call to blk_clear_rq_complete. If the completion
sets REQ_ATOM_COMPLETE before the BUG_ON in blk_add_timer, we go boom
there (I haven't seen this in the cores). Next, if we get the
completion before the call to list_add_tail, then the timer will
eventually fire for an old req, which may either be freed or reallocated
(there is evidence that this might be the case). Finally, if the
completion comes in *after* the addition to the timeout list, I think
it's harmless. The request will be removed from the timeout list,
req_atom_complete will be set, and all will be well.

This will only actually explain the coredumps *IF* the request
structure was freed, reallocated *and* queued before the error handler
thread had a chance to process it. That is possible, but it may make
sense to keep digging for another race. I think that if this is what
was happening, we would see other instances of this problem showing up
as null pointer or garbage pointer dereferences, for example when the
request structure was not re-used. It looks like we actually do run
into that situation in other reports.

This patch moves the BUG_ON(test_bit(REQ_ATOM_COMPLETE,
&req->atomic_flags)); from blk_add_timer to the only caller that could
trip over it (blk_start_request). It then inverts the calls to
blk_clear_rq_complete and blk_add_timer in blk_rq_timed_out to address
the race. I've boot tested this patch, but nothing more.

Signed-off-by: Jeff Moyer
Acked-by: Hannes Reinecke
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Jeff Moyer
2013-11-08 23:59:04 +0800

30 Oct, 2013

1 commit

92f399c72 blk-mq: mq plug list breakage ... Browse Code »

We switched to plug mq_list for mq, but some code are still using old list.

Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe

Shaohua Li
2013-10-30 02:01:03 +0800

29 Oct, 2013

1 commit

3228f48be blk-mq: fix for flush deadlock ... Browse Code »

The flush state machine takes in a struct request, which then is
submitted multiple times to the underling driver. The old block code
requeses the same request for each of those, so it does not have an
issue with tapping into the request pool. The new one on the other hand
allocates a new request for each of the actualy steps of the flush
sequence. If have already allocated all of the tags for IO, we will
fail allocating the flush request.

Set aside a reserved request just for flushes.

Signed-off-by: Jens Axboe

Christoph Hellwig
2013-10-29 03:33:58 +0800

25 Oct, 2013

3 commits

320ae51fe blk-mq: new multi-queue block IO queueing mechanism ... Browse Code »

Linux currently has two models for block devices:

- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.

- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.

With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.

The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.

This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.

blk-mq provides various helper functions, which include:

- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.

- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.

- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.

- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.

- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.

For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).

Contributions in this patch from the following people:

Shaohua Li
Alexander Gordeev
Christoph Hellwig
Mike Christie
Matias Bjorling
Jeff Moyer

Acked-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jens Axboe
2013-10-25 18:56:00 +0800
71fe07d04 block: remove request ref_count ... Browse Code »

This reference count has been around since before git history, but the only
place where it's used is in blk_execute_rq, and ther it is entirely useless
as it is incremented before submitting the request and decremented in the
end_io handler before waking up the submitter thread.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2013-10-25 18:55:59 +0800
5953316db block: make rq->cmd_flags be 64-bit ... Browse Code »

We have officially run out of flags in a 32-bit space. Extend it
to 64-bit even on 32-bit archs.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jens Axboe
2013-10-25 18:55:59 +0800

23 Sep, 2013

1 commit

68cf8d0c7 Merge branch 'for-3.12/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block IO fixes from Jens Axboe:
"After merge window, no new stuff this time only a collection of neatly
confined and simple fixes"

* 'for-3.12/core' of git://git.kernel.dk/linux-block:
cfq: explicitly use 64bit divide operation for 64bit arguments
block: Add nr_bios to block_rq_remap tracepoint
If the queue is dying then we only call the rq->end_io callout. This leaves bios setup on the request, because the caller assumes when the blk_execute_rq_nowait/blk_execute_rq call has completed that the rq->bios have been cleaned up.
bio-integrity: Fix use of bs->bio_integrity_pool after free
blkcg: relocate root_blkg setting and clearing
block: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...)
block: trace all devices plug operation

Linus Torvalds
2013-09-23 06:00:11 +0800

12 Sep, 2013

1 commit

7aef2e780 block: trace all devices plug operation ... Browse Code »

In func blk_queue_bio, if list of plug is empty,it will call
blk_trace_plug.
If process deal with a single device,it't ok.But if process deal with
multi devices,it only trace the first device.
Using request_count to judge, it can soleve this problem.

In addition, i modify the comment.

Signed-off-by: Jianpeng Ma
Signed-off-by: Jens Axboe

Jianpeng Ma
2013-09-12 03:21:07 +0800

24 Aug, 2013

2 commits

7e782af57 [SCSI] Return ENODATA on medium error ... Browse Code »

When a medium error is detected the SCSI stack should return
ENODATA to the upper layers.

[jejb: fix whitespace error]
Signed-off-by: Hannes Reinecke
Signed-off-by: James Bottomley

Hannes Reinecke
2013-08-24 00:54:53 +0800
a9d6ceb83 [SCSI] return ENOSPC on thin provisioning failure ... Browse Code »

When the thin provisioning hard threshold is reached we
should return ENOSPC to inform upper layers about this fact.

Signed-off-by: Hannes Reinecke
Signed-off-by: James Bottomley

Hannes Reinecke
2013-08-24 00:43:54 +0800

04 Jul, 2013

1 commit

c1101cbc7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux ... Browse Code »

Pull s390 updates from Martin Schwidefsky:
"This is the bulk of the s390 patches for the 3.11 merge window.

Notable enhancements are: the block timeout patches for dasd from
Hannes, and more work on the PCI support front. In addition some
cleanup and the usual bug fixing."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (42 commits)
s390/dasd: Fail all requests when DASD_FLAG_ABORTIO is set
s390/dasd: Add 'timeout' attribute
block: check for timeout function in blk_rq_timed_out()
block/dasd: detailed I/O errors
s390/dasd: Reduce amount of messages for specific errors
s390/dasd: Implement block timeout handling
s390/dasd: process all requests in the device tasklet
s390/dasd: make number of retries configurable
s390/dasd: Clarify comment
s390/hwsampler: Updated misleading member names in hws_data_entry
s390/appldata_net_sum: do not use static data
s390/appldata_mem: do not use static data
s390/vmwatchdog: do not use static data
s390/airq: simplify adapter interrupt code
s390/pci: remove per device debug attribute
s390/dma: remove gratuitous brackets
s390/facility: decompose test_facility()
s390/sclp: remove duplicated include from sclp_ctl.c
s390/irq: store interrupt information in pt_regs
s390/drivers: Cocci spatch "ptr_ret.spatch"
...

Linus Torvalds
2013-07-04 02:08:24 +0800

03 Jul, 2013

1 commit

f317ff9ee Merge branch 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

Pull workqueue changes from Tejun Heo:
"Surprisingly, Lai and I didn't break too many things implementing
custom pools and stuff last time around and there aren't any follow-up
changes necessary at this point.

The only change in this pull request is Viresh's patches to make some
per-cpu workqueues to behave as unbound workqueues dependent on a boot
param whose default can be configured via a config option. This leads
to higher processing overhead / lower bandwidth as more work items are
bounced across CPUs; however, it can lead to noticeable powersave in
certain configurations - ~10% w/ idlish constant workload on a
big.LITTLE configuration according to Viresh.

This is because per-cpu workqueues interfere with how the scheduler
perceives whether or not each CPU is idle by forcing pinned tasks on
them, which makes the scheduler's power-aware scheduling decisions
less effective.

Its effectiveness is likely less pronounced on homogenous
configurations and this type of optimization can probably be made
automatic; however, the changes are pretty minimal and the affected
workqueues are clearly marked, so it's an easy gain for some
configurations for the time being with pretty unintrusive changes."

* 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
fbcon: queue work on power efficient wq
block: queue work on power efficient wq
PHYLIB: queue work on system_power_efficient_wq
workqueue: Add system wide power_efficient workqueues
workqueues: Introduce new flag WQ_POWER_EFFICIENT for power oriented workqueues

Linus Torvalds
2013-07-03 10:53:30 +0800

01 Jul, 2013

1 commit

d1ffc1f86 block/dasd: detailed I/O errors ... Browse Code »

The DASD driver is using FASTFAIL as an equivalent to the
transport errors in SCSI. And the 'steal lock' function maps
roughly to a reservation error. So we should be returning the
appropriate error codes when completing a request.

Acked-by: Jens Axboe
Signed-off-by: Hannes Reinecke
Signed-off-by: Stefan Weinhuber
Signed-off-by: Martin Schwidefsky

Hannes Reinecke
2013-07-01 23:31:22 +0800

17 May, 2013

1 commit

c60855cdb blkpm: avoid sleep when holding queue lock ... Browse Code »

In blk_post_runtime_resume, an autosuspend request will be initiated for
the device. Since we are holding the queue lock, we can't sleep and thus
we should use the async version to initiate an autosuspend, i.e.
pm_request_suspend instead of pm_runtime_suspend, which might sleep.

Signed-off-by: Aaron Lu
Signed-off-by: Jens Axboe

Aaron Lu
2013-05-17 16:00:43 +0800

15 May, 2013

1 commit

695588f94 block: queue work on power efficient wq ... Browse Code »

Block layer uses workqueues for multiple purposes. There is no real dependency
of scheduling these on the cpu which scheduled them.

On a idle system, it is observed that and idle cpu wakes up many times just to
service this work. It would be better if we can schedule it on a cpu which the
scheduler believes to be the most appropriate one.

This patch replaces normal workqueues with power efficient versions.

Cc: Jens Axboe
Signed-off-by: Viresh Kumar
Signed-off-by: Tejun Heo

Viresh Kumar
2013-05-15 01:50:07 +0800

09 May, 2013

1 commit

4de13d7aa Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block core updates from Jens Axboe:

- Major bit is Kents prep work for immutable bio vecs.

- Stable candidate fix for a scheduling-while-atomic in the queue
bypass operation.

- Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
discard bios.

- Tejuns changes to convert the writeback thread pool to the generic
workqueue mechanism.

- Runtime PM framework, SCSI patches exists on top of these in James'
tree.

- A few random fixes.

* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
relay: move remove_buf_file inside relay_close_buf
partitions/efi.c: replace useless kzalloc's by kmalloc's
fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
block: fix max discard sectors limit
blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
Documentation: cfq-iosched: update documentation help for cfq tunables
writeback: expose the bdi_wq workqueue
writeback: replace custom worker pool implementation with unbound workqueue
writeback: remove unused bdi_pending_list
aoe: Fix unitialized var usage
bio-integrity: Add explicit field for owner of bip_buf
block: Add an explicit bio flag for bios that own their bvec
block: Add bio_alloc_pages()
block: Convert some code to bio_for_each_segment_all()
block: Add bio_for_each_segment_all()
bounce: Refactor __blk_queue_bounce to not use bi_io_vec
raid1: use bio_copy_data()
pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
pktcdvd: use bio_copy_data()
block: Add bio_copy_data()
...

Linus Torvalds
2013-05-09 01:13:35 +0800

19 Apr, 2013

1 commit

0a82a8d13 Revert "block: add missing block_bio_complete() tracepoint" ... Browse Code »

This reverts commit 3a366e614d0837d9fc23f78cdb1a1186ebc3387f.

Wanlong Gao reports that it causes a kernel panic on his machine several
minutes after boot. Reverting it removes the panic.

Jens says:
"It's not quite clear why that is yet, so I think we should just revert
the commit for 3.9 final (which I'm assuming is pretty close).

The wifi is crap at the LSF hotel, so sending this email instead of
queueing up a revert and pull request."

Reported-by: Wanlong Gao
Requested-by: Jens Axboe
Cc: Tejun Heo
Cc: Steven Rostedt
Signed-off-by: Linus Torvalds

Linus Torvalds
2013-04-19 00:00:26 +0800

25 Mar, 2013

1 commit

705cd0ea1 Merge branch 'for-jens' of http://evilpiepirate.org/git/linux-bcache into for-3.10/core ... Browse Code »

This contains Kents prep work for the immutable bio_vecs.

Jens Axboe
2013-03-25 11:38:59 +0800

24 Mar, 2013

2 commits

f73a1c7d1 block: Add bio_end_sector() ... Browse Code »

Just a little convenience macro - main reason to add it now is preparing
for immutable bio vecs, it'll reduce the size of the patch that puts
bi_sector/bi_size/bi_idx into a struct bvec_iter.

Signed-off-by: Kent Overstreet
CC: Jens Axboe
CC: Lars Ellenberg
CC: Jiri Kosina
CC: Alasdair Kergon
CC: dm-devel@redhat.com
CC: Neil Brown
CC: Martin Schwidefsky
CC: Heiko Carstens
CC: linux-s390@vger.kernel.org
CC: Chris Mason
CC: Steven Whitehouse
Acked-by: Steven Whitehouse

Kent Overstreet
2013-03-24 05:15:29 +0800
f79ea4161 block: Refactor blk_update_request() ... Browse Code »

Converts it to use bio_advance(), simplifying it quite a bit in the
process.

Note that req_bio_endio() now always calls bio_advance() - which means
it always loops over the biovec, not just on partial completions. Don't
expect it to affect performance, but worth noting.

Tested it by forcing partial updates, and dumping before and after on
various bio/bvec fields when doing a partial update.

Signed-off-by: Kent Overstreet
CC: Jens Axboe

Kent Overstreet
2013-03-24 05:15:28 +0800

23 Mar, 2013

2 commits

c8158819d block: implement runtime pm strategy ... Browse Code »

When a request is added:
If device is suspended or is suspending and the request is not a
PM request, resume the device.

When the last request finishes:
Call pm_runtime_mark_last_busy().

When pick a request:
If device is resuming/suspending, then only PM request is allowed
to go.

The idea and API is designed by Alan Stern and described here:
http://marc.info/?l=linux-scsi&m=133727953625963&w=2

Signed-off-by: Lin Ming
Signed-off-by: Aaron Lu
Acked-by: Alan Stern
Signed-off-by: Jens Axboe

Lin Ming
2013-03-23 12:22:15 +0800
6c9546675 block: add runtime pm helpers ... Browse Code »

Add runtime pm helper functions:

void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
- Initialization function for drivers to call.

int blk_pre_runtime_suspend(struct request_queue *q)
- If any requests are in the queue, mark last busy and return -EBUSY.
Otherwise set q->rpm_status to RPM_SUSPENDING and return 0.

void blk_post_runtime_suspend(struct request_queue *q, int err)
- If the suspend succeeded then set q->rpm_status to RPM_SUSPENDED.
Otherwise set it to RPM_ACTIVE and mark last busy.

void blk_pre_runtime_resume(struct request_queue *q)
- Set q->rpm_status to RPM_RESUMING.

void blk_post_runtime_resume(struct request_queue *q, int err)
- If the resume succeeded then set q->rpm_status to RPM_ACTIVE
and call __blk_run_queue, then mark last busy and autosuspend.
Otherwise set q->rpm_status to RPM_SUSPENDED.

The idea and API is designed by Alan Stern and described here:
http://marc.info/?l=linux-scsi&m=133727953625963&w=2

Signed-off-by: Lin Ming
Signed-off-by: Aaron Lu
Acked-by: Alan Stern
Signed-off-by: Jens Axboe

Lin Ming
2013-03-23 12:22:15 +0800

01 Mar, 2013

1 commit

ee89f8125 Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block IO core bits from Jens Axboe:
"Below are the core block IO bits for 3.9. It was delayed a few days
since my workstation kept crashing every 2-8h after pulling it into
current -git, but turns out it is a bug in the new pstate code (divide
by zero, will report separately). In any case, it contains:

- The big cfq/blkcg update from Tejun and and Vivek.

- Additional block and writeback tracepoints from Tejun.

- Improvement of the should sort (based on queues) logic in the plug
flushing.

- _io() variants of the wait_for_completion() interface, using
io_schedule() instead of schedule() to contribute to io wait
properly.

- Various little fixes.

You'll get two trivial merge conflicts, which should be easy enough to
fix up"

Fix up the trivial conflicts due to hlist traversal cleanups (commit
b67bfe0d42ca: "hlist: drop the node parameter from iterators").

* 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
block: remove redundant check to bd_openers()
block: use i_size_write() in bd_set_size()
cfq: fix lock imbalance with failed allocations
drivers/block/swim3.c: fix null pointer dereference
block: don't select PERCPU_RWSEM
block: account iowait time when waiting for completion of IO request
sched: add wait_for_completion_io[_timeout]
writeback: add more tracepoints
block: add block_{touch|dirty}_buffer tracepoint
buffer: make touch_buffer() an exported function
block: add @req to bio_{front|back}_merge tracepoints
block: add missing block_bio_complete() tracepoint
block: Remove should_sort judgement when flush blk_plug
block,elevator: use new hashtable implementation
cfq-iosched: add hierarchical cfq_group statistics
cfq-iosched: collect stats from dead cfqgs
cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
block: RCU free request_queue
blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
...

Linus Torvalds
2013-03-01 04:52:24 +0800

22 Feb, 2013

1 commit

ffecfd1a7 block: optionally snapshot page contents to provide stable pages during write ... Browse Code »

This provides a band-aid to provide stable page writes on jbd without
needing to backport the fixed locking and page writeback bit handling
schemes of jbd2. The band-aid works by using bounce buffers to snapshot
page contents instead of waiting.

For those wondering about the ext3 bandage -- fixing the jbd locking
(which was done as part of ext4dev years ago) is a lot of surgery, and
setting PG_writeback on data pages when we actually hold the page lock
dropped ext3 performance by nearly an order of magnitude. If we're
going to migrate iscsi and raid to use stable page writes, the
complaints about high latency will likely return. We might as well
centralize their page snapshotting thing to one place.

Signed-off-by: Darrick J. Wong
Tested-by: Andy Lutomirski
Cc: Adrian Hunter
Cc: Artem Bityutskiy
Reviewed-by: Jan Kara
Cc: Joel Becker
Cc: Mark Fasheh
Cc: Steven Whitehouse
Cc: Jens Axboe
Cc: Eric Van Hensbergen
Cc: Ron Minnich
Cc: Latchesar Ionkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Darrick J. Wong
2013-02-22 09:22:20 +0800

14 Jan, 2013

2 commits

8c1cf6bb0 block: add @req to bio_{front|back}_merge tracepoints ... Browse Code »

bio_{front|back}_merge tracepoints report a bio merging into an
existing request but didn't specify which request the bio is being
merged into. Add @req to it. This makes it impossible to share the
event template with block_bio_queue - split it out.

@req isn't used or exported to userland at this point and there is no
userland visible behavior change. Later changes will make use of the
extra parameter.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2013-01-14 22:00:36 +0800
3a366e614 block: add missing block_bio_complete() tracepoint ... Browse Code »

bio completion didn't kick block_bio_complete TP. Only dm was
explicitly triggering the TP on IO completion. This makes
block_bio_complete TP useless for tracers which want to know about
bios, and all other bio based drivers skip generating blktrace
completion events.

This patch makes all bio completions via bio_endio() generate
block_bio_complete TP.

* Explicit trace_block_bio_complete() invocation removed from dm and
the trace point is unexported.

* @rq dropped from trace_block_bio_complete(). bios may fly around
w/o queue associated. Verifying and accessing the assocaited queue
belongs to TP probes.

* blktrace now gets both request and bio completions. Make it ignore
bio completions if request completion path is happening.

This makes all bio based drivers generate blktrace completion events
properly and makes the block_bio_complete TP actually useful.

v2: With this change, block_bio_complete TP could be invoked on sg
commands which have bio's with %NULL bi_bdev. Update TP
assignment code to check whether bio->bi_bdev is %NULL before
dereferencing.

Signed-off-by: Tejun Heo
Original-patch-by: Namhyung Kim
Cc: Tejun Heo
Cc: Steven Rostedt
Cc: Alasdair Kergon
Cc: dm-devel@redhat.com
Cc: Neil Brown
Signed-off-by: Jens Axboe

Tejun Heo
2013-01-14 22:00:36 +0800

11 Jan, 2013

1 commit

422765c26 block: Remove should_sort judgement when flush blk_plug ... Browse Code »

In commit 975927b942c932,it add blk_rq_pos to sort rq when flushing.
Although this commit was used for the situation which blk_plug handled
multi devices on the same time like md device.
I think there must be some situations like this but only single
device.
So remove the should_sort judgement.
Because the parameter should_sort is only for this purpose,it can delete
should_sort from blk_plug.

CC: Shaohua Li
Signed-off-by: Jianpeng Ma
Signed-off-by: Jens Axboe

Jianpeng Ma
2013-01-11 21:46:09 +0800

15 Dec, 2012

1 commit

cbae8d45d block: export block_unplug tracepoint ... Browse Code »

This allows stacked devices (like md/raid5) to provide blktrace
tracing, including unplug events.

Reported-by: Fengguang Wu
Signed-off-by: NeilBrown
Signed-off-by: Jens Axboe

NeilBrown
2012-12-15 03:49:27 +0800

06 Dec, 2012

5 commits

24faf6f60 block: Make blk_cleanup_queue() wait until request_fn finished ... Browse Code »

Some request_fn implementations, e.g. scsi_request_fn(), unlock
the queue lock internally. This may result in multiple threads
executing request_fn for the same queue simultaneously. Keep
track of the number of active request_fn calls and make sure that
blk_cleanup_queue() waits until all active request_fn invocations
have finished. A block driver may start cleaning up resources
needed by its request_fn as soon as blk_cleanup_queue() finished,
so blk_cleanup_queue() must wait for all outstanding request_fn
invocations to finish.

Signed-off-by: Bart Van Assche
Reported-by: Chanho Min
Cc: James Bottomley
Cc: Mike Christie
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Bart Van Assche
2012-12-06 21:33:00 +0800
704605711 block: Avoid scheduling delayed work on a dead queue ... Browse Code »

Running a queue must continue after it has been marked dying until
it has been marked dead. So the function blk_run_queue_async() must
not schedule delayed work after blk_cleanup_queue() has marked a queue
dead. Hence add a test for that queue state in blk_run_queue_async()
and make sure that queue_unplugged() invokes that function with the
queue lock held. This avoids that the queue state can change after
it has been tested and before mod_delayed_work() is invoked. Drop
the queue dying test in queue_unplugged() since it is now
superfluous: __blk_run_queue() already tests whether or not the
queue is dead.

Signed-off-by: Bart Van Assche
Cc: Mike Christie
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Bart Van Assche
2012-12-06 21:32:30 +0800
c246e80d8 block: Avoid that request_fn is invoked on a dead queue ... Browse Code »

A block driver may start cleaning up resources needed by its
request_fn as soon as blk_cleanup_queue() finished, so request_fn
must not be invoked after draining finished. This is important
when blk_run_queue() is invoked without any requests in progress.
As an example, if blk_drain_queue() and scsi_run_queue() run in
parallel, blk_drain_queue() may have finished all requests after
scsi_run_queue() has taken a SCSI device off the starved list but
before that last function has had a chance to run the queue.

Signed-off-by: Bart Van Assche
Cc: James Bottomley
Cc: Mike Christie
Cc: Chanho Min
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Bart Van Assche
2012-12-06 21:32:01 +0800
807592a4f block: Let blk_drain_queue() caller obtain the queue lock ... Browse Code »

Let the caller of blk_drain_queue() obtain the queue lock to improve
readability of the patch called "Avoid that request_fn is invoked on
a dead queue".

Signed-off-by: Bart Van Assche
Acked-by: Tejun Heo
Cc: James Bottomley
Cc: Mike Christie
Cc: Jens Axboe
Cc: Chanho Min
Signed-off-by: Jens Axboe

Bart Van Assche
2012-12-06 21:30:59 +0800
3f3299d5c block: Rename queue dead flag ... Browse Code »

QUEUE_FLAG_DEAD is used to indicate that queuing new requests must
stop. After this flag has been set queue draining starts. However,
during the queue draining phase it is still safe to invoke the
queue's request_fn, so QUEUE_FLAG_DYING is a better name for this
flag.

This patch has been generated by running the following command
over the kernel source tree:

git grep -lEw 'blk_queue_dead|QUEUE_FLAG_DEAD' |
xargs sed -i.tmp -e 's/blk_queue_dead/blk_queue_dying/g' \
-e 's/QUEUE_FLAG_DEAD/QUEUE_FLAG_DYING/g'; \
sed -i.tmp -e "s/QUEUE_FLAG_DYING$(printf \\t)*5/QUEUE_FLAG_DYING$(printf \\t)5/g" \
include/linux/blkdev.h; \
sed -i.tmp -e 's/ DEAD/ DYING/g' -e 's/dead queue/a dying queue/' \
-e 's/Dead queue/A dying queue/' block/blk-core.c

Signed-off-by: Bart Van Assche
Acked-by: Tejun Heo
Cc: James Bottomley
Cc: Mike Christie
Cc: Jens Axboe
Cc: Chanho Min
Signed-off-by: Jens Axboe

Bart Van Assche
2012-12-06 21:30:58 +0800

10 Nov, 2012

1 commit

c304a51bf block: use NUMA_NO_NODE instead of -1 ... Browse Code »

Signed-off-by: Ezequiel Garcia

Modified by me to cover blk_init_queue() as well.

Signed-off-by: Jens Axboe

Ezequiel Garcia
2012-11-10 17:41:13 +0800

26 Oct, 2012

1 commit

975927b94 block: Add blk_rq_pos(rq) to sort rq when plushing ... Browse Code »

My workload is a raid5 which had 16 disks. And used our filesystem to
write using direct-io mode.

I used the blktrace to find those message:
8,16 0 6647 2.453665504 2579 M W 7493152 + 8 [md0_raid5]
8,16 0 6648 2.453672411 2579 Q W 7493160 + 8 [md0_raid5]
8,16 0 6649 2.453672606 2579 M W 7493160 + 8 [md0_raid5]
8,16 0 6650 2.453679255 2579 Q W 7493168 + 8 [md0_raid5]
8,16 0 6651 2.453679441 2579 M W 7493168 + 8 [md0_raid5]
8,16 0 6652 2.453685948 2579 Q W 7493176 + 8 [md0_raid5]
8,16 0 6653 2.453686149 2579 M W 7493176 + 8 [md0_raid5]
8,16 0 6654 2.453693074 2579 Q W 7493184 + 8 [md0_raid5]
8,16 0 6655 2.453693254 2579 M W 7493184 + 8 [md0_raid5]
8,16 0 6656 2.453704290 2579 Q W 7493192 + 8 [md0_raid5]
8,16 0 6657 2.453704482 2579 M W 7493192 + 8 [md0_raid5]
8,16 0 6658 2.453715016 2579 Q W 7493200 + 8 [md0_raid5]
8,16 0 6659 2.453715247 2579 M W 7493200 + 8 [md0_raid5]
8,16 0 6660 2.453721730 2579 Q W 7493208 + 8 [md0_raid5]
8,16 0 6661 2.453721974 2579 M W 7493208 + 8 [md0_raid5]
8,16 0 6662 2.453728202 2579 Q W 7493216 + 8 [md0_raid5]
8,16 0 6663 2.453728436 2579 M W 7493216 + 8 [md0_raid5]
8,16 0 6664 2.453734782 2579 Q W 7493224 + 8 [md0_raid5]
8,16 0 6665 2.453735019 2579 M W 7493224 + 8 [md0_raid5]
8,16 0 6666 2.453741401 2579 Q W 7493232 + 8 [md0_raid5]
8,16 0 6667 2.453741632 2579 M W 7493232 + 8 [md0_raid5]
8,16 0 6668 2.453748148 2579 Q W 7493240 + 8 [md0_raid5]
8,16 0 6669 2.453748386 2579 M W 7493240 + 8 [md0_raid5]
8,16 0 6670 2.453851843 2579 I W 7493144 + 104 [md0_raid5]
8,16 0 0 2.453853661 0 m N cfq2579 insert_request
8,16 0 6671 2.453854064 2579 I W 7493120 + 24 [md0_raid5]
8,16 0 0 2.453854439 0 m N cfq2579 insert_request
8,16 0 6672 2.453854793 2579 U N [md0_raid5] 2
8,16 0 0 2.453855513 0 m N cfq2579 Not idling.st->count:1
8,16 0 0 2.453855927 0 m N cfq2579 dispatch_insert
8,16 0 0 2.453861771 0 m N cfq2579 dispatched a request
8,16 0 0 2.453862248 0 m N cfq2579 activate rq,drv=1
8,16 0 6673 2.453862332 2579 D W 7493120 + 24 [md0_raid5]
8,16 0 0 2.453865957 0 m N cfq2579 Not idling.st->count:1
8,16 0 0 2.453866269 0 m N cfq2579 dispatch_insert
8,16 0 0 2.453866707 0 m N cfq2579 dispatched a request
8,16 0 0 2.453867061 0 m N cfq2579 activate rq,drv=2
8,16 0 6674 2.453867145 2579 D W 7493144 + 104 [md0_raid5]
8,16 0 6675 2.454147608 0 C W 7493120 + 24 [0]
8,16 0 0 2.454149357 0 m N cfq2579 complete rqnoidle 0
8,16 0 6676 2.454791505 0 C W 7493144 + 104 [0]
8,16 0 0 2.454794803 0 m N cfq2579 complete rqnoidle 0
8,16 0 0 2.454795160 0 m N cfq schedule dispatch

From above messages,we can find rq[W 7493144 + 104] and rq[W
7493120 + 24] do not merge.
Because the bio order is:
8,16 0 6638 2.453619407 2579 Q W 7493144 + 8 [md0_raid5]
8,16 0 6639 2.453620460 2579 G W 7493144 + 8 [md0_raid5]
8,16 0 6640 2.453639311 2579 Q W 7493120 + 8 [md0_raid5]
8,16 0 6641 2.453639842 2579 G W 7493120 + 8 [md0_raid5]
The bio(7493144) first and bio(7493120) later.So the subsequent
bios will be divided into two parts.
When flushing plug-list,because elv_attempt_insert_merge only support
backmerge,not supporting frontmerge.
So rq[7493120 + 24] can't merge with rq[7493144 + 104].

From my test,i found those situation can count 25% in our system.
Using this patch, there is no this situation.

Signed-off-by: Jianpeng Ma
CC:Shaohua Li
Signed-off-by: Jens Axboe

Jianpeng Ma
2012-10-26 03:58:17 +0800

11 Oct, 2012

1 commit

ce40be7a8 Merge branch 'for-3.7/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block IO update from Jens Axboe:
"Core block IO bits for 3.7. Not a huge round this time, it contains:

- First series from Kent cleaning up and generalizing bio allocation
and freeing.

- WRITE_SAME support from Martin.

- Mikulas patches to prevent O_DIRECT crashes when someone changes
the block size of a device.

- Make bio_split() work on data-less bio's (like trim/discards).

- A few other minor fixups."

Fixed up silent semantic mis-merge as per Mikulas Patocka and Andrew
Morton. It is due to the VM no longer using a prio-tree (see commit
6b2dbba8b6ac: "mm: replace vma prio_tree with an interval tree").

So make set_blocksize() use mapping_mapped() instead of open-coding the
internal VM knowledge that has changed.

* 'for-3.7/core' of git://git.kernel.dk/linux-block: (26 commits)
block: makes bio_split support bio without data
scatterlist: refactor the sg_nents
scatterlist: add sg_nents
fs: fix include/percpu-rwsem.h export error
percpu-rw-semaphore: fix documentation typos
fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared
blockdev: turn a rw semaphore into a percpu rw semaphore
Fix a crash when block device is read and block size is changed at the same time
block: fix request_queue->flags initialization
block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()
block: ioctl to zero block ranges
block: Make blkdev_issue_zeroout use WRITE SAME
block: Implement support for WRITE SAME
block: Consolidate command flag and queue limit checks for merges
block: Clean up special command handling logic
block/blk-tag.c: Remove useless kfree
block: remove the duplicated setting for congestion_threshold
block: reject invalid queue attribute values
block: Add bio_clone_bioset(), bio_clone_kmalloc()
block: Consolidate bio_alloc_bioset(), bio_kmalloc()
...

Linus Torvalds
2012-10-11 08:04:23 +0800