14 Apr, 2016
1 commit
- 
Now that we converted everything to the newer block write cache 
 interface, kill off the queue flush_flags and queueable flush
 entries.Signed-off-by: Jens Axboe 
13 Apr, 2016
1 commit
- 
We could kmalloc() the payload, so need the offset in page. Signed-off-by: Ming Lin 
 Reviewed-by: Christoph Hellwig
 Signed-off-by: Jens Axboe
05 Apr, 2016
1 commit
- 
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time 
 ago with promise that one day it will be possible to implement page
 cache with bigger chunks than PAGE_SIZE.This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to 
 PAGE_SIZE. And it's constant source of confusion on whether
 PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
 especially on the border between fs and mm.Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much 
 breakage to be doable.Let's stop pretending that pages in page cache are special. They are 
 not.The changes are pretty straight-forward: - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ; - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using 
 script below. For some reason, coccinelle doesn't patch header files.
 I've called spatch for them manually.The only adjustment after coccinelle is revert of changes to 
 PAGE_CAHCE_ALIGN definition: we are going to drop it later.There are few places in the code where coccinelle didn't reach. I'll 
 fix them manually in a separate patch. Comments and documentation also
 will be addressed with the separate patch.virtual patch @@ 
 expression E;
 @@
 - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
 + E@@ 
 expression E;
 @@
 - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
 + E@@ 
 @@
 - PAGE_CACHE_SHIFT
 + PAGE_SHIFT@@ 
 @@
 - PAGE_CACHE_SIZE
 + PAGE_SIZE@@ 
 @@
 - PAGE_CACHE_MASK
 + PAGE_MASK@@ 
 expression E;
 @@
 - PAGE_CACHE_ALIGN(E)
 + PAGE_ALIGN(E)@@ 
 expression E;
 @@
 - page_cache_get(E)
 + get_page(E)@@ 
 expression E;
 @@
 - page_cache_release(E)
 + put_page(E)Signed-off-by: Kirill A. Shutemov 
 Acked-by: Michal Hocko
 Signed-off-by: Linus Torvalds
19 Mar, 2016
1 commit
- 
Pull libata updates from Tejun Heo: - ahci grew runtime power management support so that the controller can 
 be turned off if no devices are attached.- sata_via isn't dead yet. It got hotplug support and more refined 
 workaround for certain WD drives.- Misc cleanups. There's a merge from for-4.5-fixes to avoid confusing 
 conflicts in ahci PCI ID table.* 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: 
 ata: ahci_xgene: dereferencing uninitialized pointer in probe
 AHCI: Remove obsolete Intel Lewisburg SATA RAID device IDs
 ata: sata_rcar: Use ARCH_RENESAS
 sata_via: Implement hotplug for VT6421
 sata_via: Apply WD workaround only when needed on VT6421
 ahci: Add runtime PM support for the host controller
 ahci: Add functions to manage runtime PM of AHCI ports
 ahci: Convert driver to use modern PM hooks
 ahci: Cache host controller version
 scsi: Drop runtime PM usage count after host is added
 scsi: Set request queue runtime PM status back to active on resume
 block: Add blk_set_runtime_active()
 ata: ahci_mvebu: add support for Armada 3700 variant
 libata: fix unbalanced spin_lock_irqsave/spin_unlock_irq() in ata_scsi_park_show()
 libata: support AHCI on OCTEON platform
23 Feb, 2016
1 commit
- 
Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower 
 than if an underlying null_blk device were used directly. One of the
 reasons for this drop in performance is that blk_insert_clone_request()
 was calling blk_mq_insert_request() with @async=true. This forced the
 use of kblockd_schedule_delayed_work_on() to run the blk-mq hw queues
 which ushered in ping-ponging between process context (fio in this case)
 and kblockd's kworker to submit the cloned request. The ftrace
 function_graph tracer showed:kworker-2013 => fio-12190 
 fio-12190 => kworker-2013
 ...
 kworker-2013 => fio-12190
 fio-12190 => kworker-2013
 ...Fixing blk_insert_clone_request()'s blk_mq_insert_request() call to 
 _not_ use kblockd to submit the cloned requests isn't enough to
 eliminate the observed context switches.In addition to this dm-mq specific blk-core fix, there are 2 DM core 
 fixes to dm-mq that (when paired with the blk-core fix) completely
 eliminate the observed context switching:1) don't blk_mq_run_hw_queues in blk-mq request completion Motivated by desire to reduce overhead of dm-mq, punting to kblockd 
 just increases context switches.In my testing against a really fast null_blk device there was no benefit 
 to running blk_mq_run_hw_queues() on completion (and no other blk-mq
 driver does this). So hopefully this change doesn't induce the need for
 yet another revert like commit 621739b00e16ca2d !2) use blk_mq_complete_request() in dm_complete_request() blk_complete_request() doesn't offer the traditional q->mq_ops vs 
 .request_fn branching pattern that other historic block interfaces
 do (e.g. blk_get_request). Using blk_mq_complete_request() for
 blk-mq requests is important for performance. It should be noted
 that, like blk_complete_request(), blk_mq_complete_request() doesn't
 natively handle partial completions -- but the request-based
 DM-multipath target does provide the required partial completion
 support by dm.c:end_clone_bio() triggering requeueing of the request
 via dm-mpath.c:multipath_end_io()'s return of DM_ENDIO_REQUEUE.dm-mq fix #2 is _much_ more important than #1 for eliminating the 
 context switches.
 Before: cpu : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475
 After: cpu : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472With these changes multithreaded async read IOPs improved from ~950K 
 to ~1350K for this dm-mq stacked on null_blk test-case. The raw read
 IOPs of the underlying null_blk device for the same workload is ~1950K.Fixes: 7fb4898e0 ("block: add blk-mq support to blk_insert_cloned_request()") 
 Fixes: bfebd1cdb ("dm: add full blk-mq support to request-based DM")
 Cc: stable@vger.kernel.org # 4.1+
 Reported-by: Sagi Grimberg
 Signed-off-by: Mike Snitzer
 Acked-by: Jens Axboe
19 Feb, 2016
1 commit
- 
If block device is left runtime suspended during system suspend, resume 
 hook of the driver typically corrects runtime PM status of the device back
 to "active" after it is resumed. However, this is not enough as queue's
 runtime PM status is still "suspended". As long as it is in this state
 blk_pm_peek_request() returns NULL and thus prevents new requests to be
 processed.Add new function blk_set_runtime_active() that can be used to force the 
 queue status back to "active" as needed.Signed-off-by: Mika Westerberg 
 Acked-by: Jens Axboe
 Signed-off-by: Tejun Heo
05 Feb, 2016
2 commits
- 
When a storage device rejects a WRITE SAME command we will disable write 
 same functionality for the device and return -EREMOTEIO to the block
 layer. -EREMOTEIO will in turn prevent DM from retrying the I/O and/or
 failing the path.Yiwen Jiang discovered a small race where WRITE SAME requests issued 
 simultaneously would cause -EIO to be returned. This happened because
 any requests being prepared after WRITE SAME had been disabled for the
 device caused us to return BLKPREP_KILL. The latter caused the block
 layer to return -EIO upon completion.To overcome this we introduce BLKPREP_INVALID which indicates that this 
 is an invalid request for the device. blk_peek_request() is modified to
 return -EREMOTEIO in that case.Reported-by: Yiwen Jiang 
 Suggested-by: Mike Snitzer
 Reviewed-by: Hannes Reinicke
 Reviewed-by: Ewan Milne
 Reviewed-by: Yiwen Jiang
 Signed-off-by: Martin K. Petersen
22 Jan, 2016
1 commit
- 
Pull NVMe updates from Jens Axboe: 
 "Last branch for this series is the nvme changes. It's in a separate
 branch to avoid splitting too much between core and NVMe changes,
 since NVMe is still helping drive some blk-mq changes. That said, not
 a huge amount of core changes in here. The grunt of the work is the
 continued split of the code"* 'for-4.5/nvme' of git://git.kernel.dk/linux-block: (67 commits) 
 uapi: update install list after nvme.h rename
 NVMe: Export NVMe attributes to sysfs group
 NVMe: Shutdown controller only for power-off
 NVMe: IO queue deletion re-write
 NVMe: Remove queue freezing on resets
 NVMe: Use a retryable error code on reset
 NVMe: Fix admin queue ring wrap
 nvme: make SG_IO support optional
 nvme: fixes for NVME_IOCTL_IO_CMD on the char device
 nvme: synchronize access to ctrl->namespaces
 nvme: Move nvme_freeze/unfreeze_queues to nvme core
 PCI/AER: include header file
 NVMe: Export namespace attributes to sysfs
 NVMe: Add pci error handlers
 block: remove REQ_NO_TIMEOUT flag
 nvme: merge iod and cmd_info
 nvme: meta_sg doesn't have to be an array
 nvme: properly free resources for cancelled command
 nvme: simplify completion handling
 nvme: special case AEN requests
 ...
20 Jan, 2016
1 commit
- 
Pull core block updates from Jens Axboe: 
 "We don't have a lot of core changes this time around, it's mostly in
 drivers, which will come in a subsequent pull.The cores changes include: - blk-mq 
 - Prep patch from Christoph, changing blk_mq_alloc_request() to
 take flags instead of just using gfp_t for sleep/nosleep.
 - Doc patch from me, clarifying the difference between legacy
 and blk-mq for timer usage.
 - Fixes from Raghavendra for memory-less numa nodes, and a reuse
 of CPU masks.- Cleanup from Geliang Tang, using offset_in_page() instead of open 
 coding it.- From Ilya, rename request_queue slab to it reflects what it holds, 
 and a fix for proper use of bdgrab/put.- A real fix for the split across stripe boundaries from Keith. We 
 yanked a broken version of this from 4.4-rc final, this one works.- From Mike Krinkin, emit a trace message when we split. - From Wei Tang, two small cleanups, not explicitly clearing memory 
 that is already cleared"* 'for-4.5/core' of git://git.kernel.dk/linux-block: 
 block: use bd{grab,put}() instead of open-coding
 block: split bios to max possible length
 block: add call to split trace point
 blk-mq: Avoid memoryless numa node encoded in hctx numa_node
 blk-mq: Reuse hardware context cpumask for tags
 blk-mq: add a flags parameter to blk_mq_alloc_request
 Revert "blk-flush: Queue through IO scheduler when flush not required"
 block: clarify blk_add_timer() use case for blk-mq
 bio: use offset_in_page macro
 block: do not initialise statics to 0 or NULL
 block: do not initialise globals to 0 or NULL
 block: rename request_queue slab cache
29 Dec, 2015
1 commit
- 
We currently only have an inline/sync helper to restart a stopped 
 queue. If drivers need an async version, they have to roll their
 own. Add a generic helper instead.Signed-off-by: Jens Axboe 
23 Dec, 2015
2 commits
- 
blk_queue_bio() does split then bounce, which makes the segment 
 counting based on pages before bouncing and could go wrong. Move
 the split to after bouncing, like we do for blk-mq, and the we
 fix the issue of having the bio count for segments be wrong.Fixes: 54efd50bfd87 ("block: make generic_make_request handle arbitrarily sized bios") 
 Cc: stable@vger.kernel.org
 Tested-by: Artem S. Tashkinov
 Signed-off-by: Jens Axboe
- 
Timer context is not very useful for drivers to perform any meaningful abort 
 action from. So instead of calling the driver from this useless context
 defer it to a workqueue as soon as possible.Note that while a delayed_work item would seem the right thing here I didn't 
 dare to use it due to the magic in blk_add_timer that pokes deep into timer
 internals. But maybe this encourages Tejun to add a sensible API for that to
 the workqueue API and we'll all be fine in the end :)Contains a major update from Keith Bush: "This patch removes synchronizing the timeout work so that the timer can 
 start a freeze on its own queue. The timer enters the queue, so timer
 context can only start a freeze, but not wait for frozen."Signed-off-by: Christoph Hellwig 
 Acked-by: Keith Busch
 Signed-off-by: Jens Axboe
04 Dec, 2015
1 commit
- 
The routines in scsi_pm.c assume that if a runtime-PM callback is 
 invoked for a SCSI device, it can only mean that the device's driver
 has asked the block layer to handle the runtime power management (by
 calling blk_pm_runtime_init(), which among other things sets q->dev).However, this assumption turns out to be wrong for things like the ses 
 driver. Normally ses devices are not allowed to do runtime PM, but
 userspace can override this setting. If this happens, the kernel gets
 a NULL pointer dereference when blk_post_runtime_resume() tries to use
 the uninitialized q->dev pointer.This patch fixes the problem by checking q->dev in block layer before 
 handle runtime PM. Since ses doesn't define any PM callbacks and call
 blk_pm_runtime_init(), the crash won't occur.This fixes Bugzilla #101371. 
 https://bugzilla.kernel.org/show_bug.cgi?id=101371More discussion can be found from below link. 
 http://marc.info/?l=linux-scsi&m=144163730531875&w=2Signed-off-by: Ken Xue 
 Acked-by: Alan Stern
 Cc: Xiangliang Yu
 Cc: James E.J. Bottomley
 Cc: Jens Axboe
 Cc: Michael Terry
 Cc: stable@vger.kernel.org
 Signed-off-by: Jens Axboe
02 Dec, 2015
1 commit
- 
We already have the reserved flag, and a nowait flag awkwardly encoded as 
 a gfp_t. Add a real flags argument to make the scheme more extensible and
 allow for a nicer calling convention.Signed-off-by: Christoph Hellwig 
 Signed-off-by: Jens Axboe
30 Nov, 2015
1 commit
- 
When a cloned request is retried on other queues it always needs 
 to be checked against the queue limits of that queue.
 Otherwise the calculations for nr_phys_segments might be wrong,
 leading to a crash in scsi_init_sgtable().To clarify this the patch renames blk_rq_check_limits() 
 to blk_cloned_rq_check_limits() and removes the symbol
 export, as the new function should only be used for
 cloned requests and never exported.Cc: Mike Snitzer 
 Cc: Ewan Milne
 Cc: Jeff Moyer
 Signed-off-by: Hannes Reinecke
 Fixes: e2a60da74 ("block: Clean up special command handling logic")
 Cc: stable@vger.kernel.org # 3.7+
 Acked-by: Mike Snitzer
 Signed-off-by: Jens Axboe
25 Nov, 2015
2 commits
- 
This patch fixes the checkpatch.pl error to blk-exec.c: ERROR: do not initialise globals to 0 or NULL Signed-off-by: Wei Tang 
 Signed-off-by: Jens Axboe
- 
Name the cache after the actual name of the struct. Signed-off-by: Ilya Dryomov 
 Signed-off-by: Jens Axboe
12 Nov, 2015
1 commit
- 
Fix kernel-doc warning in blk-core.c: Warning(..//block/blk-core.c:1549): No description found for parameter 'same_queue_rq' Signed-off-by: Randy Dunlap 
 Reviewed-by: Jeff Moyer
 Signed-off-by: Jens Axboe
11 Nov, 2015
1 commit
- 
Pull block IO poll support from Jens Axboe: 
 "Various groups have been doing experimentation around IO polling for
 (really) fast devices. The code has been reviewed and has been
 sitting on the side for a few releases, but this is now good enough
 for coordinated benchmarking and further experimentation.Currently O_DIRECT sync read/write are supported. A framework is in 
 the works that allows scalable stats tracking so we can auto-tune
 this. And we'll add libaio support as well soon. Fow now, it's an
 opt-in feature for test purposes"* 'for-4.4/io-poll' of git://git.kernel.dk/linux-block: 
 direct-io: be sure to assign dio->bio_bdev for both paths
 directio: add block polling support
 NVMe: add blk polling support
 block: add block polling support
 blk-mq: return tag/queue combo in the make_request_fn handlers
 block: change ->make_request_fn() and users to return a queue cookie
08 Nov, 2015
2 commits
- 
Add basic support for polling for specific IO to complete. This uses 
 the cookie that blk-mq passes back, which enables the block layer
 to pass this cookie to the driver to spin for a specific request.This will be combined with request latency tracking, so we can make 
 qualified decisions about when to poll and when not to. For now, for
 benchmark purposes, we add a sysfs file that controls whether polling
 is enabled or not.Signed-off-by: Jens Axboe 
 Acked-by: Christoph Hellwig
 Acked-by: Keith Busch
- 
No functional changes in this patch, but it prepares us for returning 
 a more useful cookie related to the IO that was queued up.Signed-off-by: Jens Axboe 
 Acked-by: Christoph Hellwig
 Acked-by: Keith Busch
07 Nov, 2015
2 commits
- 
__GFP_WAIT was used to signal that the caller was in atomic context and 
 could not sleep. Now it is possible to distinguish between true atomic
 context and callers that are not willing to sleep. The latter should
 clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing
 __GFP_WAIT behaves differently, there is a risk that people will clear the
 wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
 indicate what it does -- setting it allows all reclaim activity, clearing
 them prevents it.[akpm@linux-foundation.org: fix build] 
 [akpm@linux-foundation.org: coding-style fixes]
 Signed-off-by: Mel Gorman
 Acked-by: Michal Hocko
 Acked-by: Vlastimil Babka
 Acked-by: Johannes Weiner
 Cc: Christoph Lameter
 Acked-by: David Rientjes
 Cc: Vitaly Wool
 Cc: Rik van Riel
 Signed-off-by: Andrew Morton
 Signed-off-by: Linus Torvalds
- 
…d avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold 
 spinlocks or are in interrupts. They are expected to be high priority and
 have access one of two watermarks lower than "min" which can be referred
 to as the "atomic reserve". __GFP_HIGH users get access to the first
 lower watermark and can be called the "high priority reserve".Over time, callers had a requirement to not block when fallback options 
 were available. Some have abused __GFP_WAIT leading to a situation where
 an optimisitic allocation with a fallback option can access atomic
 reserves.This patch uses __GFP_ATOMIC to identify callers that are truely atomic, 
 cannot sleep and have no alternative. High priority users continue to use
 __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
 are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
 callers that want to wake kswapd for background reclaim. __GFP_WAIT is
 redefined as a caller that is willing to enter direct reclaim and wake
 kswapd for background reclaim.This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory 
 pools for those requests. GFP_ATOMIC uses this flag.o Callers that have a limited mempool to guarantee forward progress clear 
 __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
 into this category where kswapd will still be woken but atomic reserves
 are not used as there is a one-entry mempool to guarantee progress.o Callers that are checking if they are non-blocking should use the 
 helper gfpflags_allow_blocking() where possible. This is because
 checking for __GFP_WAIT as was done historically now can trigger false
 positives. Some exceptions like dm-crypt.c exist where the code intent
 is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
 flag manipulations.o Callers that built their own GFP flags instead of starting with GFP_KERNEL 
 and friends now also need to specify __GFP_KSWAPD_RECLAIM.The first key hazard to watch out for is callers that removed __GFP_WAIT 
 and was depending on access to atomic reserves for inconspicuous reasons.
 In some cases it may be appropriate for them to use __GFP_HIGH.The second key hazard is callers that assembled their own combination of 
 GFP flags instead of starting with something like GFP_KERNEL. They may
 now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
 if it's missed in most cases as other activity will wake kswapd.Signed-off-by: Mel Gorman <mgorman@techsingularity.net> 
 Acked-by: Vlastimil Babka <vbabka@suse.cz>
 Acked-by: Michal Hocko <mhocko@suse.com>
 Acked-by: Johannes Weiner <hannes@cmpxchg.org>
 Cc: Christoph Lameter <cl@linux.com>
 Cc: David Rientjes <rientjes@google.com>
 Cc: Vitaly Wool <vitalywool@gmail.com>
 Cc: Rik van Riel <riel@redhat.com>
 Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
05 Nov, 2015
2 commits
- 
Pull block integrity updates from Jens Axboe: 
 ""This is the joint work of Dan and Martin, cleaning up and improving
 the support for block data integrity"* 'for-4.4/integrity' of git://git.kernel.dk/linux-block: 
 block, libnvdimm, nvme: provide a built-in blk_integrity nop profile
 block: blk_flush_integrity() for bio-based drivers
 block: move blk_integrity to request_queue
 block: generic request_queue reference counting
 nvme: suspend i/o during runtime blk_integrity_unregister
 md: suspend i/o during runtime blk_integrity_unregister
 md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown
 block: Inline blk_integrity in struct gendisk
 block: Export integrity data interval size in sysfs
 block: Reduce the size of struct blk_integrity
 block: Consolidate static integrity profile properties
 block: Move integrity kobject to struct gendisk
- 
Pull core block updates from Jens Axboe: 
 "This is the core block pull request for 4.4. I've got a few more
 topic branches this time around, some of them will layer on top of the
 core+drivers changes and will come in a separate round. So not a huge
 chunk of changes in this round.This pull request contains: - Enable blk-mq page allocation tracking with kmemleak, from Catalin. - Unused prototype removal in blk-mq from Christoph. - Cleanup of the q->blk_trace exchange, using cmpxchg instead of two 
 xchg()'s, from Davidlohr.- A plug flush fix from Jeff. - Also from Jeff, a fix that means we don't have to update shared tag 
 sets at init time unless we do a state change. This cuts down boot
 times on thousands of devices a lot with scsi/blk-mq.- blk-mq waitqueue barrier fix from Kosuke. - Various fixes from Ming: - Fixes for segment merging and splitting, and checks, for 
 the old core and blk-mq.- Potential blk-mq speedup by marking ctx pending at the end 
 of a plug insertion batch in blk-mq.- direct-io no page dirty on kernel direct reads. - A WRITE_SYNC fix for mpage from Roman" * 'for-4.4/core' of git://git.kernel.dk/linux-block: 
 blk-mq: avoid excessive boot delays with large lun counts
 blktrace: re-write setting q->blk_trace
 blk-mq: mark ctx as pending at batch in flush plug path
 blk-mq: fix for trace_block_plug()
 block: check bio_mergeable() early before merging
 blk-mq: check bio_mergeable() early before merging
 block: avoid to merge splitted bio
 block: setup bi_phys_segments after splitting
 block: fix plug list flushing for nomerge queues
 blk-mq: remove unused blk_mq_clone_flush_request prototype
 blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c
 fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read
 fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
 block: kmemleak: Track the page allocations for struct request
22 Oct, 2015
3 commits
- 
Request queues with merging disabled will not flush the plug list after 
 BLK_MAX_REQUEST_COUNT requests have been queued, since the code relies
 on blk_attempt_plug_merge to compute the request_count. Fix this by
 computing the number of queued requests even for nomerge queues.Signed-off-by: Jeff Moyer 
 Signed-off-by: Jens Axboe
- 
Since they lack requests to pin the request_queue active, synchronous 
 bio-based drivers may have in-flight integrity work from
 bio_integrity_endio() that is not flushed by blk_freeze_queue(). Flush
 that work to prevent races to free the queue and the final usage of the
 blk_integrity profile.This is temporary unless/until bio-based drivers start to generically 
 take a q_usage_counter reference while a bio is in-flight.Cc: Martin K. Petersen 
 [martin: fix the CONFIG_BLK_DEV_INTEGRITY=n case]
 Tested-by: Ross Zwisler
 Signed-off-by: Dan Williams
 Signed-off-by: Jens Axboe
- 
Allow pmem, and other synchronous/bio-based block drivers, to fallback 
 on a per-cpu reference count managed by the core for tracking queue
 live/dead state.The existing per-cpu reference count for the blk_mq case is promoted to 
 be used in all block i/o scenarios. This involves initializing it by
 default, waiting for it to drop to zero at exit, and holding a live
 reference over the invocation of q->make_request_fn() in
 generic_make_request(). The blk_mq code continues to take its own
 reference per blk_mq request and retains the ability to freeze the
 queue, but the check that the queue is frozen is moved to
 generic_make_request().This fixes crash signatures like the following: BUG: unable to handle kernel paging request at ffff880140000000 
 [..]
 Call Trace:
 [] ? copy_user_handle_tail+0x5f/0x70
 [] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
 [] pmem_make_request+0xd1/0x200 [nd_pmem]
 [] ? mempool_alloc+0x72/0x1a0
 [] generic_make_request+0xd6/0x110
 [] submit_bio+0x76/0x170
 [] submit_bh_wbc+0x12f/0x160
 [] submit_bh+0x12/0x20
 [] jbd2_write_superblock+0x8d/0x170
 [] jbd2_mark_journal_empty+0x5d/0x90
 [] jbd2_journal_destroy+0x24b/0x270
 [] ? put_pwq_unlocked+0x2a/0x30
 [] ? destroy_workqueue+0x225/0x250
 [] ext4_put_super+0x64/0x360
 [] generic_shutdown_super+0x6a/0xf0Cc: Jens Axboe 
 Cc: Keith Busch
 Cc: Ross Zwisler
 Suggested-by: Christoph Hellwig
 Reviewed-by: Christoph Hellwig
 Tested-by: Ross Zwisler
 Signed-off-by: Dan Williams
 Signed-off-by: Jens Axboe
15 Oct, 2015
1 commit
- 
bdi's are initialized in two steps, bdi_init() and bdi_register(), but 
 destroyed in a single step by bdi_destroy() which, for a bdi embedded
 in a request_queue, is called during blk_cleanup_queue() which makes
 the queue invisible and starts the draining of remaining usages.A request_queue's user can access the congestion state of the embedded 
 bdi as long as it holds a reference to the queue. As such, it may
 access the congested state of a queue which finished
 blk_cleanup_queue() but hasn't reached blk_release_queue() yet.
 Because the congested state was embedded in backing_dev_info which in
 turn is embedded in request_queue, accessing the congested state after
 bdi_destroy() was called was fine. The bdi was destroyed but the
 memory region for the congested state remained accessible till the
 queue got released.a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in 
 bdi_writeback") changed the situation. Now, the root congested state
 which is expected to be pinned while request_queue remains accessible
 is separately reference counted and the base ref is put during
 bdi_destroy(). This means that the root congested state may go away
 prematurely while the queue is between bdi_dstroy() and
 blk_cleanup_queue(), which was detected by Andrey's KASAN tests.The root cause of this problem is that bdi doesn't distinguish the two 
 steps of destruction, unregistration and release, and now the root
 congested state actually requires a separate release step. To fix the
 issue, this patch separates out bdi_unregister() and bdi_exit() from
 bdi_destroy(). bdi_unregister() is called from blk_cleanup_queue()
 and bdi_exit() from blk_release_queue(). bdi_destroy() is now just a
 simple wrapper calling the two steps back-to-back.While at it, the prototype of bdi_destroy() is moved right below 
 bdi_setup_and_register() so that the counterpart operations are
 located together.Signed-off-by: Tejun Heo 
 Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
 Cc: stable@vger.kernel.org # v4.2+
 Reported-and-tested-by: Andrey Konovalov
 Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.com
 Reviewed-by: Jan Kara
 Reviewed-by: Jeff Moyer
 Signed-off-by: Jens Axboe
11 Sep, 2015
1 commit
- 
Pull blk-cg updates from Jens Axboe: 
 "A bit later in the cycle, but this has been in the block tree for a a
 while. This is basically four patchsets from Tejun, that improve our
 buffered cgroup writeback. It was dependent on the other cgroup
 changes, but they went in earlier in this cycle.Series 1 is set of 5 patches that has cgroup writeback updates: - bdi_writeback iteration fix which could lead to some wb's being 
 skipped or repeated during e.g. sync under memory pressure.- Simplification of wb work wait mechanism. - Writeback tracepoints updated to report cgroup. Series 2 is is a set of updates for the CFQ cgroup writeback handling: cfq has always charged all async IOs to the root cgroup. It didn't 
 have much choice as writeback didn't know about cgroups and there
 was no way to tell who to blame for a given writeback IO.
 writeback finally grew support for cgroups and now tags each
 writeback IO with the appropriate cgroup to charge it against.This patchset updates cfq so that it follows the blkcg each bio is 
 tagged with. Async cfq_queues are now shared across cfq_group,
 which is per-cgroup, instead of per-request_queue cfq_data. This
 makes all IOs follow the weight based IO resource distribution
 implemented by cfq.- Switched from GFP_ATOMIC to GFP_NOWAIT as suggested by Jeff. - Other misc review points addressed, acks added and rebased. Series 3 is the blkcg policy cleanup patches: This patchset contains assorted cleanups for blkcg_policy methods 
 and blk[c]g_policy_data handling.- alloc/free added for blkg_policy_data. exit dropped. - alloc/free added for blkcg_policy_data. - blk-throttle's async percpu allocation is replaced with direct 
 allocation.- all methods now take blk[c]g_policy_data instead of blkcg_gq or 
 blkcg.And finally, series 4 is a set of patches cleaning up the blkcg stats 
 handling:blkcg's stats have always been somwhat of a mess. This patchset 
 tries to improve the situation a bit.- The following patches added to consolidate blkcg entry point and 
 blkg creation. This is in itself is an improvement and helps
 colllecting common stats on bio issue.- per-blkg stats now accounted on bio issue rather than request 
 completion so that bio based and request based drivers can behave
 the same way. The issue was spotted by Vivek.- cfq-iosched implements custom recursive stats and blk-throttle 
 implements custom per-cpu stats. This patchset make blkcg core
 support both by default.- cfq-iosched and blk-throttle keep track of the same stats 
 multiple times. Unify them"* 'for-4.3/blkcg' of git://git.kernel.dk/linux-block: (45 commits) 
 blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy
 blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/
 blkcg: implement interface for the unified hierarchy
 blkcg: misc preparations for unified hierarchy interface
 blkcg: separate out tg_conf_updated() from tg_set_conf()
 blkcg: move body parsing from blkg_conf_prep() to its callers
 blkcg: mark existing cftypes as legacy
 blkcg: rename subsystem name from blkio to io
 blkcg: refine error codes returned during blkcg configuration
 blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device()
 blkcg: reduce stack usage of blkg_rwstat_recursive_sum()
 blkcg: remove cfqg_stats->sectors
 blkcg: move io_service_bytes and io_serviced stats into blkcg_gq
 blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq
 blkcg: make blkcg_[rw]stat per-cpu
 blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it
 blkcg: consolidate blkg creation in blkcg_bio_issue_check()
 blk-throttle: improve queue bypass handling
 blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup()
 blkcg: inline [__]blkg_lookup()
 ...
19 Aug, 2015
1 commit
- 
blkg (blkcg_gq) currently is created by blkcg policies invoking 
 blkg_lookup_create() which ends up repeating about the same code in
 different policies. Theoretically, this can avoid the overhead of
 looking and/or creating blkg's if blkcg is enabled but no policy is in
 use; however, the cost of blkg lookup / creation is very low
 especially if only the root blkcg is in use which is highly likely if
 no blkcg policy is in active use - it boils down to a single very
 predictable conditional and surrounding RCU protection.This patch consolidates blkg creation to a new function 
 blkcg_bio_issue_check() which is called during bio issue from
 generic_make_request_checks(). blkcg_bio_issue_check() is now the
 only function which tries to create missing blkg's. The subsequent
 policy and request_list operations just perform blkg_lookup() and if
 missing falls back to the root.* blk_get_rl() no longer tries to create blkg. It uses blkg_lookup() 
 instead of blkg_lookup_create().* blk_throtl_bio() is now called from blkcg_bio_issue_check() with rcu 
 read locked and blkg already looked up. Both throtl_lookup_tg() and
 throtl_lookup_create_tg() are dropped.* cfq is similarly updated. cfq_lookup_create_cfqg() is replaced with 
 cfq_lookup_cfqg()which uses blkg_lookup().This consolidates blkg handling and avoids unnecessary blkg creation 
 retries under memory pressure. In addition, this provides a common
 bio entry point into blkcg where things like common accounting can be
 performed.v2: Build fixes for !CONFIG_CFQ_GROUP_IOSCHED and 
 !CONFIG_BLK_DEV_THROTTLING.Signed-off-by: Tejun Heo 
 Cc: Vivek Goyal
 Cc: Arianna Avanzini
 Signed-off-by: Jens Axboe
14 Aug, 2015
1 commit
- 
The way the block layer is currently written, it goes to great lengths 
 to avoid having to split bios; upper layer code (such as bio_add_page())
 checks what the underlying device can handle and tries to always create
 bios that don't need to be split.But this approach becomes unwieldy and eventually breaks down with 
 stacked devices and devices with dynamic limits, and it adds a lot of
 complexity. If the block layer could split bios as needed, we could
 eliminate a lot of complexity elsewhere - particularly in stacked
 drivers. Code that creates bios can then create whatever size bios are
 convenient, and more importantly stacked drivers don't have to deal with
 both their own bio size limitations and the limitations of the
 (potentially multiple) devices underneath them. In the future this will
 let us delete merge_bvec_fn and a bunch of other code.We do this by adding calls to blk_queue_split() to the various 
 make_request functions that need it - a few can already handle arbitrary
 size bios. Note that we add the call _after_ any call to
 blk_queue_bounce(); this means that blk_queue_split() and
 blk_recalc_rq_segments() don't need to be concerned with bouncing
 affecting segment merging.Some make_request_fn() callbacks were simple enough to audit and verify 
 they don't need blk_queue_split() calls. The skipped ones are:* nfhd_make_request (arch/m68k/emu/nfblock.c) 
 * axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
 * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
 * brd_make_request (ramdisk - drivers/block/brd.c)
 * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
 * loop_make_request
 * null_queue_bio
 * bcache's make_request fnsSome others are almost certainly safe to remove now, but will be left 
 for future patches.Cc: Jens Axboe 
 Cc: Christoph Hellwig
 Cc: Al Viro
 Cc: Ming Lei
 Cc: Neil Brown
 Cc: Alasdair Kergon
 Cc: Mike Snitzer
 Cc: dm-devel@redhat.com
 Cc: Lars Ellenberg
 Cc: drbd-user@lists.linbit.com
 Cc: Jiri Kosina
 Cc: Geoff Levand
 Cc: Jim Paris
 Cc: Philip Kelleher
 Cc: Minchan Kim
 Cc: Nitin Gupta
 Cc: Oleg Drokin
 Cc: Andreas Dilger
 Acked-by: NeilBrown (for the 'md/md.c' bits)
 Acked-by: Mike Snitzer
 Reviewed-by: Martin K. Petersen
 Signed-off-by: Kent Overstreet
 [dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
 Signed-off-by: Dongsu Park
 Signed-off-by: Ming Lin
 Signed-off-by: Jens Axboe
29 Jul, 2015
2 commits
- 
Some places use helpers now, others don't. We only have the 'is set' 
 helper, add helpers for setting and clearing flags too.It was a bit of a mess of atomic vs non-atomic access. With 
 BIO_UPTODATE gone, we don't have any risk of concurrent access to the
 flags. So relax the restriction and don't make any of them atomic. The
 flags that do have serialization issues (reffed and chained), we
 already handle those separately.Signed-off-by: Jens Axboe 
- 
Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag 
 (2) by returning a Linux errno value to the bi_end_io callbackThe first one has the drawback of only communicating a single possible 
 error (-EIO), and the second one has the drawback of not beeing persistent
 when bios are queued up, and are not passed along from child to parent
 bio in the ever more popular chaining scenario. Having both mechanisms
 available has the additional drawback of utterly confusing driver authors
 and introducing bugs where various I/O submitters only deal with one of
 them, and the others have to add boilerplate code to deal with both kinds
 of error returns.So add a new bi_error field to store an errno value directly in struct 
 bio and remove the existing mechanisms to clean all this up.Signed-off-by: Christoph Hellwig 
 Reviewed-by: Hannes Reinecke
 Reviewed-by: NeilBrown
 Signed-off-by: Jens Axboe
07 Jul, 2015
1 commit
- 
use FIELD_SIZEOF instead of open coding Signed-off-by: Maninder Singh 
 Signed-off-by: Jens Axboe
27 Jun, 2015
1 commit
- 
Pull device mapper fixes from Mike Snitzer: 
 "Apologies for not pressing this request-based DM partial completion
 issue further, it was an oversight on my part. We'll have to get it
 fixed up properly and revisit for a future release.- Revert block and DM core changes the removed request-based DM's 
 ability to handle partial request completions -- otherwise with the
 current SCSI LLDs these changes could lead to silent data
 corruption.- Fix two DM version bumps that were missing from the initial 4.2 DM 
 pull request (enabled userspace lvm2 to know certain changes have
 been made)"* tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: 
 dm cache policy smq: fix "default" version to be 1.4.0
 dm: bump the ioctl version to 4.32.0
 Revert "block, dm: don't copy bios for request clones"
 Revert "dm: do not allocate any mempools for blk-mq request-based DM"
26 Jun, 2015
3 commits
- 
This reverts commit 5f1b670d0bef508a5554d92525f5f6d00d640b38. Justification for revert as reported in this dm-devel post: 
 https://www.redhat.com/archives/dm-devel/2015-June/msg00160.htmlthis change should not be pushed to mainline yet. Firstly, Christoph has a newer version of the patch that fixes silent 
 data corruption problem:
 https://www.redhat.com/archives/dm-devel/2015-May/msg00229.htmlAnd the new version still depends on LLDDs to always complete requests 
 to the end when error happens, while block API doesn't enforce such a
 requirement. If the assumption is ever broken, the inconsistency between
 request and bio (e.g. rq->__sector and rq->bio) will cause silent data
 corruption:
 https://www.redhat.com/archives/dm-devel/2015-June/msg00022.htmlReported-by: Junichi Nomura 
 Signed-off-by: Mike Snitzer
- 
Pull cgroup writeback support from Jens Axboe: 
 "This is the big pull request for adding cgroup writeback support.This code has been in development for a long time, and it has been 
 simmering in for-next for a good chunk of this cycle too. This is one
 of those problems that has been talked about for at least half a
 decade, finally there's a solution and code to go with it.Also see last weeks writeup on LWN: http://lwn.net/Articles/648292/" * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits) 
 writeback, blkio: add documentation for cgroup writeback support
 vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
 writeback: do foreign inode detection iff cgroup writeback is enabled
 v9fs: fix error handling in v9fs_session_init()
 bdi: fix wrong error return value in cgwb_create()
 buffer: remove unusued 'ret' variable
 writeback: disassociate inodes from dying bdi_writebacks
 writeback: implement foreign cgroup inode bdi_writeback switching
 writeback: add lockdep annotation to inode_to_wb()
 writeback: use unlocked_inode_to_wb transaction in inode_congested()
 writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
 writeback: implement [locked_]inode_to_wb_and_lock_list()
 writeback: implement foreign cgroup inode detection
 writeback: make writeback_control track the inode being written back
 writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
 mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
 writeback: implement memcg writeback domain based throttling
 writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
 writeback: implement memcg wb_domain
 writeback: update wb_over_bg_thresh() to use wb_domain aware operations
 ...
- 
Pull core block IO update from Jens Axboe: 
 "Nothing really major in here, mostly a collection of smaller
 optimizations and cleanups, mixed with various fixes. In more detail,
 this contains:- Addition of policy specific data to blkcg for block cgroups. From 
 Arianna Avanzini.- Various cleanups around command types from Christoph. - Cleanup of the suspend block I/O path from Christoph. - Plugging updates from Shaohua and Jeff Moyer, for blk-mq. - Eliminating atomic inc/dec of both remaining IO count and reference 
 count in a bio. From me.- Fixes for SG gap and chunk size support for data-less (discards) 
 IO, so we can merge these better. From me.- Small restructuring of blk-mq shared tag support, freeing drivers 
 from iterating hardware queues. From Keith Busch.- A few cfq-iosched tweaks, from Tahsin Erdogan and me. Makes the 
 IOPS mode the default for non-rotational storage"* 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits) 
 cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL
 cfq-iosched: fix sysfs oops when attempting to read unconfigured weights
 cfq-iosched: move group scheduling functions under ifdef
 cfq-iosched: fix the setting of IOPS mode on SSDs
 blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file
 block, cgroup: implement policy-specific per-blkcg data
 block: Make CFQ default to IOPS mode on SSDs
 block: add blk_set_queue_dying() to blkdev.h
 blk-mq: Shared tag enhancements
 block: don't honor chunk sizes for data-less IO
 block: only honor SG gap prevention for merges that contain data
 block: fix returnvar.cocci warnings
 block, dm: don't copy bios for request clones
 block: remove management of bi_remaining when restoring original bi_end_io
 block: replace trylock with mutex_lock in blkdev_reread_part()
 block: export blkdev_reread_part() and __blkdev_reread_part()
 suspend: simplify block I/O handling
 block: collapse bio bit space
 block: remove unused BIO_RW_BLOCK and BIO_EOF flags
 block: remove BIO_EOPNOTSUPP
 ...