Eric Lee / smarc-fsl-linux-kernel

24 Sep, 2014

1 commit

0a30288da blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe ... Browse Code »

blk-mq uses percpu_ref for its usage counter which tracks the number
of in-flight commands and used to synchronously drain the queue on
freeze. percpu_ref shutdown takes measureable wallclock time as it
involves a sched RCU grace period. This means that draining a blk-mq
takes measureable wallclock time. One would think that this shouldn't
matter as queue shutdown should be a rare event which takes place
asynchronously w.r.t. userland.

Unfortunately, SCSI probing involves synchronously setting up and then
tearing down a lot of request_queues back-to-back for non-existent
LUNs. This means that SCSI probing may take more than ten seconds
when scsi-mq is used.

This will be properly fixed by implementing a mechanism to keep
q->mq_usage_counter in atomic mode till genhd registration; however,
that involves rather big updates to percpu_ref which is difficult to
apply late in the devel cycle (v3.17-rc6 at the moment). As a
stop-gap measure till the proper fix can be implemented in the next
cycle, this patch introduces __percpu_ref_kill_expedited() and makes
blk_mq_freeze_queue() use it. This is heavy-handed but should work
for testing the experimental SCSI blk-mq implementation.

Signed-off-by: Tejun Heo
Reported-by: Christoph Hellwig
Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de
Fixes: add703fda981 ("blk-mq: use percpu_ref for mq usage count")
Cc: Kent Overstreet
Cc: Jens Axboe
Tested-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Tejun Heo
2014-09-24 22:29:36 +0800

23 Sep, 2014

6 commits

46f341ffc genhd: fix leftover might_sleep() in blk_free_devt() ... Browse Code »

Commit 2da78092 changed the locking from a mutex to a spinlock,
so we now longer sleep in this context. But there was a leftover
might_sleep() in there, which now triggers since we do the final
free from an RCU callback. Get rid of it.

Reported-by: Pontus Fuchs
Signed-off-by: Jens Axboe

Jens Axboe
2014-09-23 04:45:45 +0800
8b9574156 blk-mq: use blk_mq_start_hw_queues() when running requeue work ... Browse Code »

When requests are retried due to hw or sw resource shortages,
we often stop the associated hardware queue. So ensure that we
restart the queues when running the requeue work, otherwise the
queue run will be a no-op.

Signed-off-by: Jens Axboe

Jens Axboe
2014-09-23 01:55:56 +0800
6b55e1f2d blk-mq: fix potential oops on out-of-memory in __blk_mq_alloc_rq_maps() ... Browse Code »

__blk_mq_alloc_rq_maps() can be invoked multiple times, if we scale
back the queue depth if we are low on memory. So don't clear
set->tags when we fail, this is handled directly in
the parent function, blk_mq_alloc_tag_set().

Reported-by: Robert Elliott
Signed-off-by: Jens Axboe

Jens Axboe
2014-09-23 01:55:23 +0800
a57a178a4 blk-mq: avoid infinite recursion with the FUA flag ... Browse Code »

We should not insert requests into the flush state machine from
blk_mq_insert_request. All incoming flush requests come through
blk_{m,s}q_make_request and are handled there, while blk_execute_rq_nowait
should only be called for BLOCK_PC requests. All other callers
deal with requests that already went through the flush statemchine
and shouldn't be reinserted into it.

Reported-by: Robert Elliott
Debugged-by: Ming Lei
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2014-09-23 01:55:19 +0800
683d0e126 blk-mq: Avoid race condition with uninitialized requests ... Browse Code »

This patch should fix the bug reported in
https://lkml.org/lkml/2014/9/11/249.

We have to initialize at least the atomic_flags and the cmd_flags when
allocating storage for the requests.

Otherwise blk_mq_timeout_check() might dereference uninitialized
pointers when racing with the creation of a request.

Also move the reset of cmd_flags for the initializing code to the point
where a request is freed. So we will never end up with pending flush
request indicators that might trigger dereferences of invalid pointers
in blk_mq_timeout_check().

Cc: stable@vger.kernel.org
Signed-off-by: David Hildenbrand
Reported-by: Paulo De Rezende Pinatti
Tested-by: Paulo De Rezende Pinatti
Acked-by: Christian Borntraeger
Signed-off-by: Jens Axboe

David Hildenbrand
2014-09-23 01:55:14 +0800
538b75341 blk-mq: request deadline must be visible before marking rq as started ... Browse Code »

When we start the request, we set the deadline and flip the bits
marking the request as started and non-complete. However, it's
important that the deadline store is ordered before flipping the
bits, otherwise we could have a small window where the request is
marked started but with an invalid deadline. This can confuse the
timeout handling.

Suggested-by: Ming Lei
Signed-off-by: Jens Axboe

Jens Axboe
2014-09-23 01:54:04 +0800

10 Sep, 2014

2 commits

a51644054 blk-mq: scale depth and rq map appropriate if low on memory ... Browse Code »

If we are running in a kdump environment, resources are scarce.
For some SCSI setups with a huge set of shared tags, we run out
of memory allocating what the drivers is asking for. So implement
a scale back logic to reduce the tag depth for those cases, allowing
the driver to successfully load.

We should extend this to detect low memory situations, and implement
a sane fallback for those (1 queue, 64 tags, or something like that).

Tested-by: Robert Elliott
Signed-off-by: Jens Axboe

Jens Axboe
2014-09-10 23:02:03 +0800
df35c7c91 Block: fix unbalanced bypass-disable in blk_register_queue ... Browse Code »

When a queue is registered, the block layer turns off the bypass
setting (because bypass is enabled when the queue is created). This
doesn't work well for queues that are unregistered and then registered
again; we get a WARNING because of the unbalanced calls to
blk_queue_bypass_end().

This patch fixes the problem by making blk_register_queue() call
blk_queue_bypass_end() only the first time the queue is registered.

Signed-off-by: Alan Stern
Acked-by: Tejun Heo
CC: James Bottomley
CC: Jens Axboe
Signed-off-by: Jens Axboe

Alan Stern
2014-09-10 00:44:24 +0800

04 Sep, 2014

2 commits

2da78092d block: Fix dev_t minor allocation lifetime ... Browse Code »

Releases the dev_t minor when all references are closed to prevent
another device from acquiring the same major/minor.

Since the partition's release may be invoked from call_rcu's soft-irq
context, the ext_dev_idr's mutex had to be replaced with a spinlock so
as not so sleep.

Signed-off-by: Keith Busch
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Keith Busch
2014-09-04 05:01:02 +0800
5676e7b6d blk-mq: cleanup after blk_mq_init_rq_map failures ... Browse Code »

In blk-mq.c blk_mq_alloc_tag_set, if:
set->tags = kmalloc_node()
succeeds, but one of the blk_mq_init_rq_map() calls fails,
goto out_unwind;
needs to free set->tags so the caller is not obligated
to do so. None of the current callers (null_blk,
virtio_blk, virtio_blk, or the forthcoming scsi-mq)
do so.

set->tags needs to be set to NULL after doing so,
so other tag cleanup logic doesn't try to free
a stale pointer later. Also set it to NULL
in blk_mq_free_tag_set.

Tested with error injection on the forthcoming
scsi-mq + hpsa combination.

Signed-off-by: Robert Elliott
Signed-off-by: Jens Axboe

Robert Elliott
2014-09-04 00:44:15 +0800

03 Sep, 2014

1 commit

073885493 blk-merge: fix blk_recount_segments ... Browse Code »

QUEUE_FLAG_NO_SG_MERGE is set at default for blk-mq devices,
so bio->bi_phys_segment computed may be bigger than
queue_max_segments(q) for blk-mq devices, then drivers will
fail to handle the case, for example, BUG_ON() in
virtio_queue_rq() can be triggerd for virtio-blk:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1359146

This patch fixes the issue by ignoring the QUEUE_FLAG_NO_SG_MERGE
flag if the computed bio->bi_phys_segment is bigger than
queue_max_segments(q), and the regression is caused by commit
05f1dd53152173(block: add queue flag for disabling SG merging).

Reported-by: Kick In
Tested-by: Chris J Arges
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2014-09-03 00:25:12 +0800

28 Aug, 2014

1 commit

7b5af5cff cfq-iosched: Add comments on update timing of weight ... Browse Code »

Explain that weight has to be updated on activation.
This complements previous fix e15693ef18e1 ("cfq-iosched: Fix wrong
children_weight calculation").

Signed-off-by: Toshiaki Makita
Signed-off-by: Jens Axboe

Toshiaki Makita
2014-08-28 22:16:29 +0800

27 Aug, 2014

1 commit

e15693ef1 cfq-iosched: Fix wrong children_weight calculation ... Browse Code »

cfq_group_service_tree_add() is applying new_weight at the beginning of
the function via cfq_update_group_weight().
This actually allows weight to change between adding it to and subtracting
it from children_weight, and triggers WARN_ON_ONCE() in
cfq_group_service_tree_del(), or even causes oops by divide error during
vfr calculation in cfq_group_service_tree_add().

The detailed scenario is as follows:
1. Create blkio cgroups X and Y as a child of X.
Set X's weight to 500 and perform some I/O to apply new_weight.
This X's I/O completes before starting Y's I/O.
2. Y starts I/O and cfq_group_service_tree_add() is called with Y.
3. cfq_group_service_tree_add() walks up the tree during children_weight
calculation and adds parent X's weight (500) to children_weight of root.
children_weight becomes 500.
4. Set X's weight to 1000.
5. X starts I/O and cfq_group_service_tree_add() is called with X.
6. cfq_group_service_tree_add() applies its new_weight (1000).
7. I/O of Y completes and cfq_group_service_tree_del() is called with Y.
8. I/O of X completes and cfq_group_service_tree_del() is called with X.
9. cfq_group_service_tree_del() subtracts X's weight (1000) from
children_weight of root. children_weight becomes -500.
This triggers WARN_ON_ONCE().
10. Set X's weight to 500.
11. X starts I/O and cfq_group_service_tree_add() is called with X.
12. cfq_group_service_tree_add() applies its new_weight (500) and adds it
to children_weight of root. children_weight becomes 0. Calcularion of
vfr triggers oops by divide error.

weight should be updated right before adding it to children_weight.

Reported-by: Ruki Sekiya
Signed-off-by: Toshiaki Makita
Acked-by: Tejun Heo
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe

Toshiaki Makita
2014-08-27 00:17:30 +0800

26 Aug, 2014

1 commit

d19d74468 block: fix error handling in sg_io ... Browse Code »

Before commit 2cada584b200 ("block: cleanup error handling in sg_io"),
we had ret = 0 before entering the last big if block of sg_io.

Since 2cada584b200, ret = -EFAULT, which breaks hdparm:

/dev/sda:
setting Advanced Power Management level to 0xc8 (200)
HDIO_DRIVE_CMD failed: Bad address
APM_level = 128

Signed-off-by: Sabrina Dubroca
Fixes: 2cada584b200 ("block: cleanup error handling in sg_io")
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Sabrina Dubroca
2014-08-26 22:20:01 +0800

23 Aug, 2014

2 commits

2ba136daa fix regression in SCSI_IOCTL_SEND_COMMAND ... Browse Code »

blk_rq_set_block_pc() memsets rq->cmd to 0, so it should come
immediately after blk_get_request() to avoid overwriting the
user-supplied CDB. Also check for failure to allocate rq.

Fixes: f27b087b81b7 ("block: add blk_rq_set_block_pc()")
Cc: # 3.16.x
Signed-off-by: Tony Battersby
Signed-off-by: Jens Axboe

Tony Battersby
2014-08-23 04:04:33 +0800
6f4a16266 scsi-mq: fix requests that use a separate CDB buffer ... Browse Code »

This patch fixes code such as the following with scsi-mq enabled:

rq = blk_get_request(...);
blk_rq_set_block_pc(rq);

rq->cmd = my_cmd_buffer; /* separate CDB buffer */

blk_execute_rq_nowait(...);

Code like this appears in e.g. sg_start_req() in drivers/scsi/sg.c (for
large CDBs only). Without this patch, scsi_mq_prep_fn() will set
rq->cmd back to rq->__cmd, causing the wrong CDB to be sent to the device.

Signed-off-by: Tony Battersby
Signed-off-by: Jens Axboe

Tony Battersby
2014-08-23 04:04:31 +0800

22 Aug, 2014

6 commits

a57821cac block: support > 16 byte CDBs for SG_IO ... Browse Code »

Signed-off-by: Christoph Hellwig
Reviewed-by: Boaz Harrosh
Signed-off-by: Jens Axboe

Christoph Hellwig
2014-08-22 09:42:01 +0800
2cada584b block: cleanup error handling in sg_io ... Browse Code »

Make sure we always clean up through the out label and just have
a single place to put the request.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2014-08-22 09:42:01 +0800
cddd5d176 blk-mq: blk_mq_freeze_queue() should allow nesting ... Browse Code »

While converting to percpu_ref for freezing, add703fda981 ("blk-mq:
use percpu_ref for mq usage count") incorrectly made
blk_mq_freeze_queue() misbehave when freezing is nested due to
percpu_ref_kill() being invoked on an already killed ref.

Fix it by making blk_mq_freeze_queue() kill and kick the queue only
for the outermost freeze attempt. All the nested ones can simply wait
for the ref to reach zero.

While at it, remove unnecessary @wake initialization from
blk_mq_unfreeze_queue().

Signed-off-by: Tejun Heo
Reported-by: Ming Lei
Signed-off-by: Jens Axboe

Tejun Heo
2014-08-22 09:37:51 +0800
a68aafa5b blk-mq: correct a few wrong/bad comments ... Browse Code »

Just grammar or spelling errors, nothing major.

Signed-off-by: Jens Axboe

Jens Axboe
2014-08-22 09:37:49 +0800
16f408dc6 block: Fix BUG_ON when pi errors occur ... Browse Code »

When getting a pi error we get to bio_integrity_end_io with
bi_remaining already decremented to 0 where we will eventually
need to call bio_endio with restored original bio completion handler.
Calling bio_endio invokes a BUG_ON(). We should call bio_endio_nodec
instead, like what is done in bio_integrity_verify_fn.

Signed-off-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Sagi Grimberg
2014-08-22 09:37:47 +0800
274a5843f blk-mq: don't allow merges if turned off for the queue ... Browse Code »

blk-mq uses BLK_MQ_F_SHOULD_MERGE, as set by the driver at init time,
to determine whether it should merge IO or not. However, this could
also be disabled by the admin, if merging is switched off through
sysfs. So check the general queue state as well before attempting
to merge IO.

Reported-by: Rob Elliott
Tested-by: Rob Elliott
Signed-off-by: Jens Axboe

Jens Axboe
2014-08-22 09:37:45 +0800

16 Aug, 2014

1 commit

dd8400870 blk-mq: fix WARNING "percpu_ref_kill() called more than once!" ... Browse Code »

Before doing queue release, the queue has been freezed already
by blk_cleanup_queue(), so needn't to freeze queue for deleting
tag set.

This patch fixes the WARNING of "percpu_ref_kill() called more than once!"
which is triggered during unloading block driver.

Cc: Tejun Heo
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2014-08-16 02:38:20 +0800

14 Aug, 2014

3 commits

ba368991f Merge tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm ... Browse Code »

Pull device mapper changes from Mike Snitzer:

- Allow the thin target to paired with any size external origin; also
allow thin snapshots to be larger than the external origin.

- Add support for quickly loading a repetitive pattern into the
dm-switch target.

- Use per-bio data in the dm-crypt target instead of always using a
mempool for each allocation. Required switching to kmalloc alignment
for the bio slab.

- Fix DM core to properly stack the QUEUE_FLAG_NO_SG_MERGE flag

- Fix the dm-cache and dm-thin targets' export of the minimum_io_size
to match the data block size -- this fixes an issue where mkfs.xfs
would improperly infer raid striping was in place on the underlying
storage.

- Small cleanups in dm-io, dm-mpath and dm-cache

* tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm table: propagate QUEUE_FLAG_NO_SG_MERGE
dm switch: efficiently support repetitive patterns
dm switch: factor out switch_region_table_read
dm cache: set minimum_io_size to cache's data block size
dm thin: set minimum_io_size to pool's data block size
dm crypt: use per-bio data
block: use kmalloc alignment for bio slab
dm table: make dm_table_supports_discards static
dm cache metadata: use dm-space-map-metadata.h defined size limits
dm cache: fail migrations in the do_worker error path
dm cache: simplify deferred set reference count increments
dm thin: relax external origin size constraints
dm thin: switch to an atomic_t for tracking pending new block preparations
dm mpath: eliminate pg_ready() wrapper
dm io: simplify dec_count and sync_io

Linus Torvalds
2014-08-14 23:17:56 +0800
d429a3639 Merge branch 'for-3.17/drivers' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block driver changes from Jens Axboe:
"Nothing out of the ordinary here, this pull request contains:

- A big round of fixes for bcache from Kent Overstreet, Slava Pestov,
and Surbhi Palande. No new features, just a lot of fixes.

- The usual round of drbd updates from Andreas Gruenbacher, Lars
Ellenberg, and Philipp Reisner.

- virtio_blk was converted to blk-mq back in 3.13, but now Ming Lei
has taken it one step further and added support for actually using
more than one queue.

- Addition of an explicit SG_FLAG_Q_AT_HEAD for block/bsg, to
compliment the the default behavior of adding to the tail of the
queue. From Douglas Gilbert"

* 'for-3.17/drivers' of git://git.kernel.dk/linux-block: (86 commits)
bcache: Drop unneeded blk_sync_queue() calls
bcache: add mutex lock for bch_is_open
bcache: Correct printing of btree_gc_max_duration_ms
bcache: try to set b->parent properly
bcache: fix memory corruption in init error path
bcache: fix crash with incomplete cache set
bcache: Fix more early shutdown bugs
bcache: fix use-after-free in btree_gc_coalesce()
bcache: Fix an infinite loop in journal replay
bcache: fix crash in bcache_btree_node_alloc_fail tracepoint
bcache: bcache_write tracepoint was crashing
bcache: fix typo in bch_bkey_equal_header
bcache: Allocate bounce buffers with GFP_NOWAIT
bcache: Make sure to pass GFP_WAIT to mempool_alloc()
bcache: fix uninterruptible sleep in writeback thread
bcache: wait for buckets when allocating new btree root
bcache: fix crash on shutdown in passthrough mode
bcache: fix lockdep warnings on shutdown
bcache allocator: send discards with correct size
bcache: Fix to remove the rcu_sched stalls.
...

Linus Torvalds
2014-08-14 23:10:21 +0800
4a319a490 Merge branch 'for-3.17/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block core bits from Jens Axboe:
"Small round this time, after the massive blk-mq dump for 3.16. This
pull request contains:

- Fixes for max_sectors overflow in ioctls from Akinoby Mita.

- Partition off-by-one bug fix in aix partitions from Dan Carpenter.

- Various small partition cleanups from Fabian Frederick.

- Fix for the block integrity code sometimes returning the wrong
vector count from Gu Zheng.

- Cleanup an re-org of the blk-mq queue enter/exit percpu counters
from Tejun. Dependent on the percpu pull for 3.17 (which was in
the block tree too), that you have already pulled in.

- A blkcg oops fix, also from Tejun"

* 'for-3.17/core' of git://git.kernel.dk/linux-block:
partitions: aix.c: off by one bug
blkcg: don't call into policy draining if root_blkg is already gone
Revert "bio: modify __bio_add_page() to accept pages that don't start a new segment"
bio: modify __bio_add_page() to accept pages that don't start a new segment
block: fix SG_[GS]ET_RESERVED_SIZE ioctl when max_sectors is huge
block: fix BLKSECTGET ioctl when max_sectors is greater than USHRT_MAX
block/partitions/efi.c: kerneldoc fixing
block/partitions/msdos.c: code clean-up
block/partitions/amiga.c: replace nolevel printk by pr_err
block/partitions/aix.c: replace count*size kzalloc by kcalloc
bio-integrity: add "bip_max_vcnt" into struct bio_integrity_payload
blk-mq: use percpu_ref for mq usage count
blk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue()
blk-mq: decouble blk-mq freezing from generic bypassing
block, blk-mq: draining can't be skipped even if bypass_depth was non-zero
blk-mq: fix a memory ordering bug in blk_mq_queue_enter()

Linus Torvalds
2014-08-14 23:07:02 +0800

06 Aug, 2014

1 commit

d97a86c17 partitions: aix.c: off by one bug ... Browse Code »

The lvip[] array has "state->limit" elements so the condition here
should be >= instead of >.

Fixes: 6ceea22bbbc8 ('partitions: add aix lvm partition support files')
Signed-off-by: Dan Carpenter
Acked-by: Philippe De Muyter
Signed-off-by: Jens Axboe

Dan Carpenter
2014-08-06 03:13:24 +0800

05 Aug, 2014

1 commit

47dfe4037 Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup changes from Tejun Heo:
"Mostly changes to get the v2 interface ready. The core features are
mostly ready now and I think it's reasonable to expect to drop the
devel mask in one or two devel cycles at least for a subset of
controllers.

- cgroup added a controller dependency mechanism so that block cgroup
can depend on memory cgroup. This will be used to finally support
IO provisioning on the writeback traffic, which is currently being
implemented.

- The v2 interface now uses a separate table so that the interface
files for the new interface are explicitly declared in one place.
Each controller will explicitly review and add the files for the
new interface.

- cpuset is getting ready for the hierarchical behavior which is in
the similar style with other controllers so that an ancestor's
configuration change doesn't change the descendants' configurations
irreversibly and processes aren't silently migrated when a CPU or
node goes down.

All the changes are to the new interface and no behavior changed for
the multiple hierarchies"

* 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits)
cpuset: fix the WARN_ON() in update_nodemasks_hier()
cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test
cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core
cgroup: distinguish the default and legacy hierarchies when handling cftypes
cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()
cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes
cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[]
cpuset: export effective masks to userspace
cpuset: allow writing offlined masks to cpuset.cpus/mems
cpuset: enable onlined cpu/node in effective masks
cpuset: refactor cpuset_hotplug_update_tasks()
cpuset: make cs->{cpus, mems}_allowed as user-configured masks
cpuset: apply cs->effective_{cpus,mems}
cpuset: initialize top_cpuset's configured masks at mount
cpuset: use effective cpumask to build sched domains
cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty
cpuset: update cs->effective_{cpus, mems} when config changes
cpuset: update cpuset->effective_{cpus,mems} at hotplug
cpuset: add cs->effective_cpus and cs->effective_mems
cgroup: clean up sane_behavior handling
...

Linus Torvalds
2014-08-05 01:11:28 +0800

02 Aug, 2014

1 commit

6a2414836 block: use kmalloc alignment for bio slab ... Browse Code »

Various subsystems can ask the bio subsystem to create a bio slab cache
with some free space before the bio. This free space can be used for any
purpose. Device mapper uses this per-bio-data feature to place some
target-specific and device-mapper specific data before the bio, so that
the target-specific data doesn't have to be allocated separately.

This per-bio-data mechanism is used in place of kmalloc, so we need the
allocated slab to have the same memory alignment as memory allocated
with kmalloc.

Change bio_find_or_create_slab() so that it uses ARCH_KMALLOC_MINALIGN
alignment when creating the slab cache. This is needed so that dm-crypt
can use per-bio-data for encryption - the crypto subsystem assumes this
data will have the same alignment as kmalloc'ed memory.

Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer
Acked-by: Jens Axboe

Mikulas Patocka
2014-08-02 00:30:34 +0800

16 Jul, 2014

1 commit

2a1b4cf23 blkcg: don't call into policy draining if root_blkg is already gone ... Browse Code »

While a queue is being destroyed, all the blkgs are destroyed and its
->root_blkg pointer is set to NULL. If someone else starts to drain
while the queue is in this state, the following oops happens.

NULL pointer dereference at 0000000000000028
IP: [] blk_throtl_drain+0x84/0x230
PGD e4a1067 PUD b773067 PMD 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
RIP: 0010:[] [] blk_throtl_drain+0x84/0x230
RSP: 0018:ffff88000efd7bf0 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
FS: 00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
Stack:
ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
Call Trace:
[] blkcg_drain_queue+0x1f/0x60
[] __blk_drain_queue+0x71/0x180
[] blk_queue_bypass_start+0x6e/0xb0
[] blkcg_deactivate_policy+0x38/0x120
[] blk_throtl_exit+0x34/0x50
[] blkcg_exit_queue+0x35/0x40
[] blk_release_queue+0x26/0xd0
[] kobject_cleanup+0x38/0x70
[] kobject_put+0x28/0x60
[] blk_put_queue+0x15/0x20
[] scsi_device_dev_release_usercontext+0x16b/0x1c0
[] execute_in_process_context+0x89/0xa0
[] scsi_device_dev_release+0x1c/0x20
[] device_release+0x32/0xa0
[] kobject_cleanup+0x38/0x70
[] kobject_put+0x28/0x60
[] put_device+0x17/0x20
[] __scsi_remove_device+0xa9/0xe0
[] scsi_remove_device+0x2b/0x40
[] sdev_store_delete+0x27/0x30
[] dev_attr_store+0x18/0x30
[] sysfs_kf_write+0x3e/0x50
[] kernfs_fop_write+0xe7/0x170
[] vfs_write+0xaf/0x1d0
[] SyS_write+0x4d/0xc0
[] system_call_fastpath+0x16/0x1b

776687bce42b ("block, blk-mq: draining can't be skipped even if
bypass_depth was non-zero") made it easier to trigger this bug by
making blk_queue_bypass_start() drain even when it loses the first
bypass test to blk_cleanup_queue(); however, the bug has always been
there even before the commit as blk_queue_bypass_start() could race
against queue destruction, win the initial bypass test but perform the
actual draining after blk_cleanup_queue() already destroyed all blkgs.

Fix it by skippping calling into policy draining if all the blkgs are
already gone.

Signed-off-by: Tejun Heo
Reported-by: Shirish Pargaonkar
Reported-by: Sasha Levin
Reported-by: Jet Chen
Cc: stable@vger.kernel.org
Tested-by: Shirish Pargaonkar
Signed-off-by: Jens Axboe

Tejun Heo
2014-07-16 15:52:03 +0800

15 Jul, 2014

3 commits

2cf669a58 cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes() ... Browse Code »

Currently, cftypes added by cgroup_add_cftypes() are used for both the
unified default hierarchy and legacy ones and subsystems can mark each
file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to
appear only on one of them. This is quite hairy and error-prone.
Also, we may end up exposing interface files to the default hierarchy
without thinking it through.

cgroup_subsys will grow two separate cftype addition functions and
apply each only on the hierarchies of the matching type. This will
allow organizing cftypes in a lot clearer way and encourage subsystems
to scrutinize the interface which is being exposed in the new default
hierarchy.

In preparation, this patch adds cgroup_add_legacy_cftypes() which
currently is a simple wrapper around cgroup_add_cftypes() and replaces
all cgroup_add_cftypes() usages with it.

While at it, this patch drops a completely spurious return from
__hugetlb_cgroup_file_init().

This patch doesn't introduce any functional differences.

Signed-off-by: Tejun Heo
Acked-by: Neil Horman
Acked-by: Li Zefan
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Aneesh Kumar K.V

Tejun Heo
2014-07-15 23:05:09 +0800
5577964e6 cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes ... Browse Code »

Currently, cgroup_subsys->base_cftypes is used for both the unified
default hierarchy and legacy ones and subsystems can mark each file
with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
only on one of them. This is quite hairy and error-prone. Also, we
may end up exposing interface files to the default hierarchy without
thinking it through.

cgroup_subsys will grow two separate cftype arrays and apply each only
on the hierarchies of the matching type. This will allow organizing
cftypes in a lot clearer way and encourage subsystems to scrutinize
the interface which is being exposed in the new default hierarchy.

In preparation, this patch renames cgroup_subsys->base_cftypes to
cgroup_subsys->legacy_cftypes. This patch is pure rename.

Signed-off-by: Tejun Heo
Acked-by: Neil Horman
Acked-by: Li Zefan
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vivek Goyal
Cc: Peter Zijlstra
Cc: Paul Mackerras
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo
Cc: Aristeu Rozanski
Cc: Aneesh Kumar K.V

Tejun Heo
2014-07-15 23:05:09 +0800
26a337944 Revert "bio: modify __bio_add_page() to accept pages that don't start a new segment" ... Browse Code »

This reverts commit 254c4407cb84a6dec90336054615b0f0e996bb7c.

It causes crashes with cryptsetup, even after a few iterations and
updates. Drop it for now.

Jens Axboe
2014-07-15 04:04:47 +0800

14 Jul, 2014

1 commit

3b3a1814d block: provide compat ioctl for BLKZEROOUT ... Browse Code »

This patch provides the compat BLKZEROOUT ioctl. The argument is a pointer
to two uint64_t values, so there is no need to translate it.

Signed-off-by: Mikulas Patocka
Cc: stable@vger.kernel.org # 3.7+
Acked-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Mikulas Patocka
2014-07-14 18:27:20 +0800

12 Jul, 2014

1 commit

0b462c89e blkcg: don't call into policy draining if root_blkg is already gone ... Browse Code »

While a queue is being destroyed, all the blkgs are destroyed and its
->root_blkg pointer is set to NULL. If someone else starts to drain
while the queue is in this state, the following oops happens.

NULL pointer dereference at 0000000000000028
IP: [] blk_throtl_drain+0x84/0x230
PGD e4a1067 PUD b773067 PMD 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
RIP: 0010:[] [] blk_throtl_drain+0x84/0x230
RSP: 0018:ffff88000efd7bf0 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
FS: 00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
Stack:
ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
Call Trace:
[] blkcg_drain_queue+0x1f/0x60
[] __blk_drain_queue+0x71/0x180
[] blk_queue_bypass_start+0x6e/0xb0
[] blkcg_deactivate_policy+0x38/0x120
[] blk_throtl_exit+0x34/0x50
[] blkcg_exit_queue+0x35/0x40
[] blk_release_queue+0x26/0xd0
[] kobject_cleanup+0x38/0x70
[] kobject_put+0x28/0x60
[] blk_put_queue+0x15/0x20
[] scsi_device_dev_release_usercontext+0x16b/0x1c0
[] execute_in_process_context+0x89/0xa0
[] scsi_device_dev_release+0x1c/0x20
[] device_release+0x32/0xa0
[] kobject_cleanup+0x38/0x70
[] kobject_put+0x28/0x60
[] put_device+0x17/0x20
[] __scsi_remove_device+0xa9/0xe0
[] scsi_remove_device+0x2b/0x40
[] sdev_store_delete+0x27/0x30
[] dev_attr_store+0x18/0x30
[] sysfs_kf_write+0x3e/0x50
[] kernfs_fop_write+0xe7/0x170
[] vfs_write+0xaf/0x1d0
[] SyS_write+0x4d/0xc0
[] system_call_fastpath+0x16/0x1b

776687bce42b ("block, blk-mq: draining can't be skipped even if
bypass_depth was non-zero") made it easier to trigger this bug by
making blk_queue_bypass_start() drain even when it loses the first
bypass test to blk_cleanup_queue(); however, the bug has always been
there even before the commit as blk_queue_bypass_start() could race
against queue destruction, win the initial bypass test but perform the
actual draining after blk_cleanup_queue() already destroyed all blkgs.

Fix it by skippping calling into policy draining if all the blkgs are
already gone.

Signed-off-by: Tejun Heo
Reported-by: Shirish Pargaonkar
Reported-by: Sasha Levin
Reported-by: Jet Chen
Cc: stable@vger.kernel.org
Tested-by: Shirish Pargaonkar
Signed-off-by: Jens Axboe

Tejun Heo
2014-07-12 23:55:10 +0800

09 Jul, 2014

2 commits

aa6ec29be cgroup: remove sane_behavior support on non-default hierarchies ... Browse Code »

sane_behavior has been used as a development vehicle for the default
unified hierarchy. Now that the default hierarchy is in place, the
flag became redundant and confusing as its usage is allowed on all
hierarchies. There are gonna be either the default hierarchy or
legacy ones. Let's make that clear by removing sane_behavior support
on non-default hierarchies.

This patch replaces cgroup_sane_behavior() with cgroup_on_dfl(). The
comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of
cgroup_on_dfl() with sane_behavior specific part dropped.

On the default and legacy hierarchies w/o sane_behavior, this
shouldn't cause any behavior differences.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal
Acked-by: Li Zefan
Cc: Johannes Weiner
Cc: Michal Hocko

Tejun Heo
2014-07-09 22:08:08 +0800
1ced953b1 blkcg, memcg: make blkcg depend on memcg on the default hierarchy ... Browse Code »

Currently, the blkio subsystem attributes all of writeback IOs to the
root. One of the issues is that there's no way to tell who originated
a writeback IO from block layer. Those IOs are usually issued
asynchronously from a task which didn't have anything to do with
actually generating the dirty pages. The memory subsystem, when
enabled, already keeps track of the ownership of each dirty page and
it's desirable for blkio to piggyback instead of adding its own
per-page tag.

cgroup now has a mechanism to express such dependency -
cgroup_subsys->depends_on. This patch declares that blkcg depends on
memcg so that memcg is enabled automatically on the default hierarchy
when available. Future changes will make blkcg map the memcg tag to
find out the cgroup to blame for writeback IOs.

As this means that a memcg may be made invisible, this patch also
implements css_reset() for memcg which resets its basic
configurations. This implementation will probably need to be expanded
to cover other states which are used in the default hierarchy.

v2: blkcg's dependency on memcg is wrapped with CONFIG_MEMCG to avoid
build failure. Reported by kbuild test robot.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Vivek Goyal
Cc: Jens Axboe

Tejun Heo
2014-07-09 06:02:57 +0800

08 Jul, 2014

1 commit

d45b3279a block: don't assume last put of shared tags is for the host ... Browse Code »

There is no inherent reason why the last put of a tag structure must be
the one for the Scsi_Host, as device model objects can be held for
arbitrary periods. Merge blk_free_tags and __blk_free_tags into a single
funtion that just release a references and get rid of the BUG() when the
host reference wasn't the last.

Signed-off-by: Christoph Hellwig
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Christoph Hellwig
2014-07-08 18:25:28 +0800

02 Jul, 2014

1 commit

254c4407c bio: modify __bio_add_page() to accept pages that don't start a new segment ... Browse Code »

The original behaviour is to refuse to add a new page if the maximum
number of segments has been reached, regardless of the fact the page we
are going to add can be merged into the last segment or not.

Unfortunately, when the system runs under heavy memory fragmentation
conditions, a driver may try to add multiple pages to the last segment.
The original code won't accept them and EBUSY will be reported to
userspace.

This patch modifies the function so it refuses to add a page only in case
the latter starts a new segment and the maximum number of segments has
already been reached.

The bug can be easily reproduced with the st driver:

1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE to 16
2) modprobe st buffer_kbs=1024
3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10
dd: error writing `/dev/st0': Device or resource busy

[ming.lei@canonical.com: update bi_iter.bi_size before recounting segments]
Signed-off-by: Maurizio Lombardi
Signed-off-by: Ming Lei
Tested-by: Dongsu Park
Tested-by: Jet Chen
Cc: Al Viro
Cc: Christoph Hellwig
Cc: Kent Overstreet
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Jens Axboe

Maurizio Lombardi
2014-07-02 00:55:15 +0800