Eric Lee / smarc-fsl-linux-kernel

06 Nov, 2015

1 commit

69234acee Merge branch 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"The cgroup core saw several significant updates this cycle:

- percpu_rwsem for threadgroup locking is reinstated. This was
temporarily dropped due to down_write latency issues. Oleg's
rework of percpu_rwsem which is scheduled to be merged in this
merge window resolves the issue.

- On the v2 hierarchy, when controllers are enabled and disabled, all
operations are atomic and can fail and revert cleanly. This allows
->can_attach() failure which is necessary for cpu RT slices.

- Tasks now stay associated with the original cgroups after exit
until released. This allows tracking resources held by zombies
(e.g. pids) and makes it easy to find out where zombies came from
on the v2 hierarchy. The pids controller was broken before these
changes as zombies escaped the limits; unfortunately, updating this
behavior required too many invasive changes and I don't think it's
a good idea to backport them, so the pids controller on 4.3, the
first version which included the pids controller, will stay broken
at least until I'm sure about the cgroup core changes.

- Optimization of a couple common tests using static_key"

* 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (38 commits)
cgroup: fix race condition around termination check in css_task_iter_next()
blkcg: don't create "io.stat" on the root cgroup
cgroup: drop cgroup__DEVEL__legacy_files_on_dfl
cgroup: replace error handling in cgroup_init() with WARN_ON()s
cgroup: add cgroup_subsys->free() method and use it to fix pids controller
cgroup: keep zombies associated with their original cgroups
cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock
cgroup: don't hold css_set_rwsem across css task iteration
cgroup: reorganize css_task_iter functions
cgroup: factor out css_set_move_task()
cgroup: keep css_set and task lists in chronological order
cgroup: make cgroup_destroy_locked() test cgroup_is_populated()
cgroup: make css_sets pin the associated cgroups
cgroup: relocate cgroup_[try]get/put()
cgroup: move check_for_release() invocation
cgroup: replace cgroup_has_tasks() with cgroup_is_populated()
cgroup: make cgroup->nr_populated count the number of populated css_sets
cgroup: remove an unused parameter from cgroup_task_migrate()
cgroup: fix too early usage of static_branch_disable()
cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically
...

Linus Torvalds
2015-11-06 06:51:32 +0800

05 Nov, 2015

3 commits

ccf21b69a Merge branch 'for-4.4/reservations' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block reservation support from Jens Axboe:
"This adds support for persistent reservations, both at the core level,
as well as for sd and NVMe"

[ Background from the docs: "Persistent Reservations allow restricting
access to block devices to specific initiators in a shared storage
setup. All implementations are expected to ensure the reservations
survive a power loss and cover all connections in a multi path
environment" ]

* 'for-4.4/reservations' of git://git.kernel.dk/linux-block:
NVMe: Precedence error in nvme_pr_clear()
nvme: add missing endianess annotations in nvme_pr_command
NVMe: Add persistent reservation ops
sd: implement the Persistent Reservation API
block: add an API for Persistent Reservations
block: cleanup blkdev_ioctl

Linus Torvalds
2015-11-05 13:01:27 +0800
527d1529e Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block integrity updates from Jens Axboe:
""This is the joint work of Dan and Martin, cleaning up and improving
the support for block data integrity"

* 'for-4.4/integrity' of git://git.kernel.dk/linux-block:
block, libnvdimm, nvme: provide a built-in blk_integrity nop profile
block: blk_flush_integrity() for bio-based drivers
block: move blk_integrity to request_queue
block: generic request_queue reference counting
nvme: suspend i/o during runtime blk_integrity_unregister
md: suspend i/o during runtime blk_integrity_unregister
md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown
block: Inline blk_integrity in struct gendisk
block: Export integrity data interval size in sysfs
block: Reduce the size of struct blk_integrity
block: Consolidate static integrity profile properties
block: Move integrity kobject to struct gendisk

Linus Torvalds
2015-11-05 12:51:48 +0800
d9734e0d1 Merge branch 'for-4.4/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block updates from Jens Axboe:
"This is the core block pull request for 4.4. I've got a few more
topic branches this time around, some of them will layer on top of the
core+drivers changes and will come in a separate round. So not a huge
chunk of changes in this round.

This pull request contains:

- Enable blk-mq page allocation tracking with kmemleak, from Catalin.

- Unused prototype removal in blk-mq from Christoph.

- Cleanup of the q->blk_trace exchange, using cmpxchg instead of two
xchg()'s, from Davidlohr.

- A plug flush fix from Jeff.

- Also from Jeff, a fix that means we don't have to update shared tag
sets at init time unless we do a state change. This cuts down boot
times on thousands of devices a lot with scsi/blk-mq.

- blk-mq waitqueue barrier fix from Kosuke.

- Various fixes from Ming:

- Fixes for segment merging and splitting, and checks, for
the old core and blk-mq.

- Potential blk-mq speedup by marking ctx pending at the end
of a plug insertion batch in blk-mq.

- direct-io no page dirty on kernel direct reads.

- A WRITE_SYNC fix for mpage from Roman"

* 'for-4.4/core' of git://git.kernel.dk/linux-block:
blk-mq: avoid excessive boot delays with large lun counts
blktrace: re-write setting q->blk_trace
blk-mq: mark ctx as pending at batch in flush plug path
blk-mq: fix for trace_block_plug()
block: check bio_mergeable() early before merging
blk-mq: check bio_mergeable() early before merging
block: avoid to merge splitted bio
block: setup bi_phys_segments after splitting
block: fix plug list flushing for nomerge queues
blk-mq: remove unused blk_mq_clone_flush_request prototype
blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c
fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read
fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
block: kmemleak: Track the page allocations for struct request

Linus Torvalds
2015-11-05 12:28:10 +0800

03 Nov, 2015

1 commit

2404e607a blk-mq: avoid excessive boot delays with large lun counts ... Browse Code »

Hi,

Zhangqing Luo reported long boot times on a system with thousands of
LUNs when scsi-mq was enabled. He narrowed the problem down to
blk_mq_add_queue_tag_set, where every queue is frozen in order to set
the BLK_MQ_F_TAG_SHARED flag. Each added device will freeze all queues
added before it in sequence, which involves waiting for an RCU grace
period for each one. We don't need to do this. After the second queue
is added, only new queues need to be initialized with the shared tag.
We can do that by percolating the flag up to the blk_mq_tag_set, and
updating the newly added queue's hctxs if the flag is set.

This problem was introduced by commit 0d2602ca30e41 (blk-mq: improve
support for shared tags maps).

Reported-and-tested-by: Jason Luo
Reviewed-by: Ming Lei
Signed-off-by: Jeff Moyer
Signed-off-by: Jens Axboe

Jeff Moyer
2015-11-03 23:42:19 +0800

28 Oct, 2015

1 commit

a22c4d7e3 block: re-add discard_granularity and alignment checks ... Browse Code »

In commit b49a087("block: remove split code in
blkdev_issue_{discard,write_same}"), discard_granularity and alignment
checks were removed. Ideally, with bio late splitting, the upper layers
shouldn't need to depend on device's limits.

Christoph reported a discard regression on the HGST Ultrastar SN100 NVMe
device when mkfs.xfs. We have not found the root cause yet.

This patch re-adds discard_granularity and alignment checks by reverting
the related changes in commit b49a087. The good thing is now we can
remove the 2G discard size cap and just use UINT_MAX to avoid bi_size
overflow.

Reviewed-by: Christoph Hellwig
Tested-by: Christoph Hellwig
Signed-off-by: Ming Lin
Reviewed-by: Mike Snitzer
Signed-off-by: Jens Axboe

Ming Lin
2015-10-28 08:12:58 +0800

22 Oct, 2015

19 commits

ca0752c5e blkcg: don't create "io.stat" on the root cgroup ... Browse Code »

The stat files on the root cgroup shows stats for the whole system and
usually don't contain any information which isn't available through
the usual system monitoring mechanisms. Some controllers skip
collecting these duplicate stats to optimize cases where cgroup isn't
used and later try to emulate the result on demand.

This leads to complexities and subtle differences in the information
shown through different channels. This is entirely unnecessary and
cgroup v2 is dropping stat files which are duplicate from all
controllers. This patch removes "io.stat" from the root hierarchy.

Signed-off-by: Tejun Heo
Acked-by: Jens Axboe
Cc: Vivek Goyal

Tejun Heo
2015-10-22 16:58:26 +0800
cfd0c552a blk-mq: mark ctx as pending at batch in flush plug path ... Browse Code »

Most of times, flush plug should be the hottest I/O path,
so mark ctx as pending after all requests in the list are
inserted.

Reviewed-by: Jeff Moyer
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2015-10-22 05:00:58 +0800
676d06077 blk-mq: fix for trace_block_plug() ... Browse Code »

The trace point is for tracing plug event of each request
queue instead of each task, so we should check the request
count in the plug list from current queue instead of
current task.

Signed-off-by: Ming Lei
Reviewed-by: Jeff Moyer
Signed-off-by: Jens Axboe

Ming Lei
2015-10-22 05:00:56 +0800
7460d389c block: check bio_mergeable() early before merging ... Browse Code »

After bio splitting is introduced, one bio can be splitted
and it is marked as NOMERGE because it is too fat to be merged,
so check bio_mergeable() earlier to avoid to try to merge it
unnecessarily.

Signed-off-by: Ming Lei
Reviewed-by: Jeff Moyer
Signed-off-by: Jens Axboe

Ming Lei
2015-10-22 05:00:54 +0800
e18378a60 blk-mq: check bio_mergeable() early before merging ... Browse Code »

It isn't necessary to try to merge the bio which is marked
as NOMERGE.

Reviewed-by: Jeff Moyer
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2015-10-22 05:00:53 +0800
6ac45aeb6 block: avoid to merge splitted bio ... Browse Code »

The splitted bio has been already too fat to merge, so mark it
as NOMERGE.

Reviewed-by: Jeff Moyer
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2015-10-22 05:00:51 +0800
bdced438a block: setup bi_phys_segments after splitting ... Browse Code »

The number of bio->bi_phys_segments is always obtained
during bio splitting, so it is natural to setup it
just after bio splitting, then we can avoid to compute
nr_segment again during merge.

Reviewed-by: Jeff Moyer
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2015-10-22 05:00:50 +0800
0809e3ac6 block: fix plug list flushing for nomerge queues ... Browse Code »

Request queues with merging disabled will not flush the plug list after
BLK_MAX_REQUEST_COUNT requests have been queued, since the code relies
on blk_attempt_plug_merge to compute the request_count. Fix this by
computing the number of queued requests even for nomerge queues.

Signed-off-by: Jeff Moyer
Signed-off-by: Jens Axboe

Jeff Moyer
2015-10-22 05:00:48 +0800
bbd3e0643 block: add an API for Persistent Reservations ... Browse Code »

This commits adds a driver API and ioctls for controlling Persistent
Reservations s/genericly/generically/ at the block layer. Persistent
Reservations are supported by SCSI and NVMe and allow controlling who gets
access to a device in a shared storage setup.

Note that we add a pr_ops structure to struct block_device_operations
instead of adding the members directly to avoid bloating all instances
of devices that will never support Persistent Reservations.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-10-22 04:46:56 +0800
d8e4bb810 block: cleanup blkdev_ioctl ... Browse Code »

Split out helpers for all non-trivial ioctls to make this function simpler,
and also start passing around a pointer version of the argument, as that's
what most ioctl handlers actually need.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-10-22 04:46:55 +0800
4125a09b0 block, libnvdimm, nvme: provide a built-in blk_integrity nop profile ... Browse Code »

The libnvidmm-btt and nvme drivers use blk_integrity to reserve space
for per-sector metadata, but sometimes without protection checksums.
This property is generically useful, so teach the block core to
internally specify a nop profile if one is not provided at registration
time.

Cc: Keith Busch
Cc: Matthew Wilcox
Suggested-by: Christoph Hellwig
[hch: kill the local nvme nop profile as well]
Acked-by: Martin K. Petersen
Signed-off-by: Dan Williams
Signed-off-by: Jens Axboe

Dan Williams
2015-10-22 04:43:45 +0800
5a48fc147 block: blk_flush_integrity() for bio-based drivers ... Browse Code »

Since they lack requests to pin the request_queue active, synchronous
bio-based drivers may have in-flight integrity work from
bio_integrity_endio() that is not flushed by blk_freeze_queue(). Flush
that work to prevent races to free the queue and the final usage of the
blk_integrity profile.

This is temporary unless/until bio-based drivers start to generically
take a q_usage_counter reference while a bio is in-flight.

Cc: Martin K. Petersen
[martin: fix the CONFIG_BLK_DEV_INTEGRITY=n case]
Tested-by: Ross Zwisler
Signed-off-by: Dan Williams
Signed-off-by: Jens Axboe

Dan Williams
2015-10-22 04:43:44 +0800
ac6fc48c9 block: move blk_integrity to request_queue ... Browse Code »

A trace like the following proceeds a crash in bio_integrity_process()
when it goes to use an already freed blk_integrity profile.

BUG: unable to handle kernel paging request at ffff8800d31b10d8
IP: [] 0xffff8800d31b10d8
PGD 2f65067 PUD 21fffd067 PMD 80000000d30001e3
Oops: 0011 [#1] SMP
Dumping ftrace buffer:
---------------------------------
ndctl-2222 2.... 44526245us : disk_release: pmem1s
systemd--2223 4.... 44573945us : bio_integrity_endio: pmem1s
-409 4.... 44574005us : bio_integrity_process: pmem1s
---------------------------------
[..]
Call Trace:
[] ? bio_integrity_process+0x159/0x2d0
[] bio_integrity_verify_fn+0x36/0x60
[] process_one_work+0x1cc/0x4e0

Given that a request_queue is pinned while i/o is in flight and that a
gendisk is allowed to have a shorter lifetime, move blk_integrity to
request_queue to satisfy requests arriving after the gendisk has been
torn down.

Cc: Christoph Hellwig
Cc: Martin K. Petersen
[martin: fix the CONFIG_BLK_DEV_INTEGRITY=n case]
Tested-by: Ross Zwisler
Signed-off-by: Dan Williams
Signed-off-by: Jens Axboe

Dan Williams
2015-10-22 04:43:42 +0800
3ef28e83a block: generic request_queue reference counting ... Browse Code »

Allow pmem, and other synchronous/bio-based block drivers, to fallback
on a per-cpu reference count managed by the core for tracking queue
live/dead state.

The existing per-cpu reference count for the blk_mq case is promoted to
be used in all block i/o scenarios. This involves initializing it by
default, waiting for it to drop to zero at exit, and holding a live
reference over the invocation of q->make_request_fn() in
generic_make_request(). The blk_mq code continues to take its own
reference per blk_mq request and retains the ability to freeze the
queue, but the check that the queue is frozen is moved to
generic_make_request().

This fixes crash signatures like the following:

BUG: unable to handle kernel paging request at ffff880140000000
[..]
Call Trace:
[] ? copy_user_handle_tail+0x5f/0x70
[] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
[] pmem_make_request+0xd1/0x200 [nd_pmem]
[] ? mempool_alloc+0x72/0x1a0
[] generic_make_request+0xd6/0x110
[] submit_bio+0x76/0x170
[] submit_bh_wbc+0x12f/0x160
[] submit_bh+0x12/0x20
[] jbd2_write_superblock+0x8d/0x170
[] jbd2_mark_journal_empty+0x5d/0x90
[] jbd2_journal_destroy+0x24b/0x270
[] ? put_pwq_unlocked+0x2a/0x30
[] ? destroy_workqueue+0x225/0x250
[] ext4_put_super+0x64/0x360
[] generic_shutdown_super+0x6a/0xf0

Cc: Jens Axboe
Cc: Keith Busch
Cc: Ross Zwisler
Suggested-by: Christoph Hellwig
Reviewed-by: Christoph Hellwig
Tested-by: Ross Zwisler
Signed-off-by: Dan Williams
Signed-off-by: Jens Axboe

Dan Williams
2015-10-22 04:43:41 +0800
25520d55c block: Inline blk_integrity in struct gendisk ... Browse Code »

Up until now the_integrity profile has been dynamically allocated and
attached to struct gendisk after the disk has been made active.

This causes problems because NVMe devices need to register the profile
prior to the partition table being read due to a mandatory metadata
buffer requirement. In addition, DM goes through hoops to deal with
preallocating, but not initializing integrity profiles.

Since the integrity profile is small (4 bytes + a pointer), Christoph
suggested moving it to struct gendisk proper. This requires several
changes:

- Moving the blk_integrity definition to genhd.h.

- Inlining blk_integrity in struct gendisk.

- Removing the dynamic allocation code.

- Adding helper functions which allow gendisk to set up and tear down
the integrity sysfs dir when a disk is added/deleted.

- Adding a blk_integrity_revalidate() callback for updating the stable
pages bdi setting.

- The calls that depend on whether a device has an integrity profile or
not now key off of the bi->profile pointer.

- Simplifying the integrity support routines in DM (Mike Snitzer).

Signed-off-by: Martin K. Petersen
Reported-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Signed-off-by: Mike Snitzer
Cc: Dan Williams
Signed-off-by: Dan Williams
Signed-off-by: Jens Axboe

Martin K. Petersen
2015-10-22 04:42:42 +0800
4c241d08d block: Export integrity data interval size in sysfs ... Browse Code »

The size of the data interval was not exported in the sysfs integrity
directory. Export it so that userland apps can tell whether the interval
is different from the device's logical block size.

Signed-off-by: Martin K. Petersen
Reviewed-by: Sagi Grimberg
Signed-off-by: Dan Williams
Signed-off-by: Jens Axboe

Martin K. Petersen
2015-10-22 04:42:41 +0800
a48f041d9 block: Reduce the size of struct blk_integrity ... Browse Code »

The per-device properties in the blk_integrity structure were previously
unsigned short. However, most of the values fit inside a char. The only
exception is the data interval size and we can work around that by
storing it as a power of two.

This cuts the size of the dynamic portion of blk_integrity in half.

Signed-off-by: Martin K. Petersen
Reported-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Signed-off-by: Dan Williams
Signed-off-by: Jens Axboe

Martin K. Petersen
2015-10-22 04:42:39 +0800
0f8087ecd block: Consolidate static integrity profile properties ... Browse Code »

We previously made a complete copy of a device's data integrity profile
even though several of the fields inside the blk_integrity struct are
pointers to fixed template entries in t10-pi.c.

Split the static and per-device portions so that we can reference the
template directly.

Signed-off-by: Martin K. Petersen
Reported-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Cc: Dan Williams
Signed-off-by: Dan Williams
Signed-off-by: Jens Axboe

Martin K. Petersen
2015-10-22 04:42:38 +0800
aff34e192 block: Move integrity kobject to struct gendisk ... Browse Code »

The integrity kobject purely exists to support the integrity
subdirectory in sysfs and doesn't really have anything to do with the
blk_integrity data structure. Move the kobject to struct gendisk where
it belongs.

Signed-off-by: Martin K. Petersen
Reported-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Signed-off-by: Dan Williams
Signed-off-by: Jens Axboe

Martin K. Petersen
2015-10-22 04:42:36 +0800

15 Oct, 2015

2 commits

b02176f30 block: don't release bdi while request_queue has live references ... Browse Code »

bdi's are initialized in two steps, bdi_init() and bdi_register(), but
destroyed in a single step by bdi_destroy() which, for a bdi embedded
in a request_queue, is called during blk_cleanup_queue() which makes
the queue invisible and starts the draining of remaining usages.

A request_queue's user can access the congestion state of the embedded
bdi as long as it holds a reference to the queue. As such, it may
access the congested state of a queue which finished
blk_cleanup_queue() but hasn't reached blk_release_queue() yet.
Because the congested state was embedded in backing_dev_info which in
turn is embedded in request_queue, accessing the congested state after
bdi_destroy() was called was fine. The bdi was destroyed but the
memory region for the congested state remained accessible till the
queue got released.

a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in
bdi_writeback") changed the situation. Now, the root congested state
which is expected to be pinned while request_queue remains accessible
is separately reference counted and the base ref is put during
bdi_destroy(). This means that the root congested state may go away
prematurely while the queue is between bdi_dstroy() and
blk_cleanup_queue(), which was detected by Andrey's KASAN tests.

The root cause of this problem is that bdi doesn't distinguish the two
steps of destruction, unregistration and release, and now the root
congested state actually requires a separate release step. To fix the
issue, this patch separates out bdi_unregister() and bdi_exit() from
bdi_destroy(). bdi_unregister() is called from blk_cleanup_queue()
and bdi_exit() from blk_release_queue(). bdi_destroy() is now just a
simple wrapper calling the two steps back-to-back.

While at it, the prototype of bdi_destroy() is moved right below
bdi_setup_and_register() so that the counterpart operations are
located together.

Signed-off-by: Tejun Heo
Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
Cc: stable@vger.kernel.org # v4.2+
Reported-and-tested-by: Andrey Konovalov
Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.com
Reviewed-by: Jan Kara
Reviewed-by: Jeff Moyer
Signed-off-by: Jens Axboe

Tejun Heo
2015-10-15 23:53:28 +0800
f42d79ab6 blk-mq: fix use-after-free in blk_mq_free_tag_set() ... Browse Code »

tags is freed in blk_mq_free_rq_map() and should not be used after that.
The problem doesn't manifest if CONFIG_CPUMASK_OFFSTACK is false because
free_cpumask_var() is nop.

tags->cpumask is allocated in blk_mq_init_tags() so it's natural to
free cpumask in its counter part, blk_mq_free_tags().

Fixes: f26cdc8536ad ("blk-mq: Shared tag enhancements")
Signed-off-by: Jun'ichi Nomura
Cc: Keith Busch
Reviewed-by: Jeff Moyer
Signed-off-by: Jens Axboe

Junichi Nomura
2015-10-15 22:45:58 +0800

10 Oct, 2015

3 commits

3380f4589 blk-mq: remove unused blk_mq_clone_flush_request prototype ... Browse Code »

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-10-10 01:16:39 +0800
8ee1b7b9d blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c ... Browse Code »

blk_mq_tag_update_depth() seems to be missing a memory barrier which
might cause the waker to not notice the waiter and fail to send a
wake_up as in the following figure.

blk_mq_tag_update_depth bt_get
------------------------------------------------------------------------
if (waitqueue_active(&bs->wait))
/* The CPU might reorder the test for
the waitqueue up here, before
prior writes complete */
prepare_to_wait(&bs->wait, &wait,
TASK_UNINTERRUPTIBLE);
tag = __bt_get(hctx, bt, last_tag,
tags);
/* Value set in bt_update_count not
visible yet */
bt_update_count(&tags->bitmap_tags, tdepth);
/* blk_mq_tag_wakeup_all(tags, false); */
bt = &tags->bitmap_tags;
wake_index = atomic_read(&bt->wake_index);
...
io_schedule();
------------------------------------------------------------------------

This patch adds the missing memory barrier.

I found this issue when I was looking through the linux source code
for places calling waitqueue_active() before wake_up*(), but without
preceding memory barriers, after sending a patch to fix a similar
issue in drivers/tty/n_tty.c (Details about the original issue can be
found here: https://lkml.org/lkml/2015/9/28/849).

Signed-off-by: Kosuke Tatsukawa
Signed-off-by: Jens Axboe

Kosuke Tatsukawa
2015-10-10 00:52:46 +0800
fd48ca384 Merge tag 'v4.3-rc4' into for-4.4/core ... Browse Code »

Linux 4.3-rc4

Pulling in v4.3-rc4 to avoid conflicts with NVMe fixes that have gone
in since for-4.4/core was based.

Jens Axboe
2015-10-10 00:08:39 +0800

01 Oct, 2015

2 commits

0bf6cd5b9 blk-mq: factor out a helper to iterate all tags for a request_queue ... Browse Code »

And replace the blk_mq_tag_busy_iter with it - the driver use has been
replaced with a new helper a while ago, and internal to the block we
only need the new version.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-10-01 16:10:57 +0800
f4829a9b7 blk-mq: fix racy updates of rq->errors ... Browse Code »

blk_mq_complete_request may be a no-op if the request has already
been completed by others means (e.g. a timeout or cancellation), but
currently drivers have to set rq->errors before calling
blk_mq_complete_request, which might leave us with the wrong error value.

Add an error parameter to blk_mq_complete_request so that we can
defer setting rq->errors until we known we won the race to complete the
request.

Signed-off-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-10-01 16:10:55 +0800

30 Sep, 2015

6 commits

60de074ba blk-mq: fix deadlock when reading cpu_list ... Browse Code »

CPU hotplug handling for blk-mq (blk_mq_queue_reinit) acquires
all_q_mutex in blk_mq_queue_reinit_notify() and then removes sysfs
entries by blk_mq_sysfs_unregister(). Removing sysfs entry needs to
be blocked until the active reference of the kernfs_node to be zero.

On the other hand, reading blk_mq_hw_sysfs_cpu sysfs entry (e.g.
/sys/block/nullb0/mq/0/cpu_list) acquires all_q_mutex in
blk_mq_hw_sysfs_cpus_show().

If these happen at the same time, a deadlock can happen. Because one
can wait for the active reference to be zero with holding all_q_mutex,
and the other tries to acquire all_q_mutex with holding the active
reference.

The reason that all_q_mutex is acquired in blk_mq_hw_sysfs_cpus_show()
is to avoid reading an imcomplete hctx->cpumask. Since reading sysfs
entry for blk-mq needs to acquire q->sysfs_lock, we can avoid deadlock
and reading an imcomplete hctx->cpumask by protecting q->sysfs_lock
while hctx->cpumask is being updated.

Signed-off-by: Akinobu Mita
Reviewed-by: Ming Lei
Cc: Ming Lei
Cc: Wanpeng Li
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Akinobu Mita
2015-09-30 01:32:51 +0800
5778322e6 blk-mq: avoid inserting requests before establishing new mapping ... Browse Code »

Notifier callbacks for CPU_ONLINE action can be run on the other CPU
than the CPU which was just onlined. So it is possible for the
process running on the just onlined CPU to insert request and run
hw queue before establishing new mapping which is done by
blk_mq_queue_reinit_notify().

This can cause a problem when the CPU has just been onlined first time
since the request queue was initialized. At this time ctx->index_hw
for the CPU, which is the index in hctx->ctxs[] for this ctx, is still
zero before blk_mq_queue_reinit_notify() is called by notifier
callbacks for CPU_ONLINE action.

For example, there is a single hw queue (hctx) and two CPU queues
(ctx0 for CPU0, and ctx1 for CPU1). Now CPU1 is just onlined and
a request is inserted into ctx1->rq_list and set bit0 in pending
bitmap as ctx1->index_hw is still zero.

And then while running hw queue, flush_busy_ctxs() finds bit0 is set
in pending bitmap and tries to retrieve requests in
hctx->ctxs[0]->rq_list. But htx->ctxs[0] is a pointer to ctx0, so the
request in ctx1->rq_list is ignored.

Fix it by ensuring that new mapping is established before onlined cpu
starts running.

Signed-off-by: Akinobu Mita
Reviewed-by: Ming Lei
Cc: Jens Axboe
Cc: Ming Lei
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Akinobu Mita
2015-09-30 01:32:50 +0800
0e6263682 blk-mq: fix q->mq_usage_counter access race ... Browse Code »

CPU hotplug handling for blk-mq (blk_mq_queue_reinit) accesses
q->mq_usage_counter while freezing all request queues in all_q_list.
On the other hand, q->mq_usage_counter is deinitialized in
blk_mq_free_queue() before deleting the queue from all_q_list.

So if CPU hotplug event occurs in the window, percpu_ref_kill() is
called with q->mq_usage_counter which has already been marked dead,
and it triggers warning. Fix it by deleting the queue from all_q_list
earlier than destroying q->mq_usage_counter.

Signed-off-by: Akinobu Mita
Reviewed-by: Ming Lei
Cc: Ming Lei
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Akinobu Mita
2015-09-30 01:32:48 +0800
a723bab3d blk-mq: Fix use after of free q->mq_map ... Browse Code »

CPU hotplug handling for blk-mq (blk_mq_queue_reinit) updates
q->mq_map by blk_mq_update_queue_map() for all request queues in
all_q_list. On the other hand, q->mq_map is released before deleting
the queue from all_q_list.

So if CPU hotplug event occurs in the window, invalid memory access
can happen. Fix it by releasing q->mq_map in blk_mq_release() to make
it happen latter than removal from all_q_list.

Signed-off-by: Akinobu Mita
Suggested-by: Ming Lei
Reviewed-by: Ming Lei
Cc: Ming Lei
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Akinobu Mita
2015-09-30 01:32:46 +0800
4593fdbe7 blk-mq: fix sysfs registration/unregistration race ... Browse Code »

There is a race between cpu hotplug handling and adding/deleting
gendisk for blk-mq, where both are trying to register and unregister
the same sysfs entries.

null_add_dev
--> blk_mq_init_queue
--> blk_mq_init_allocated_queue
--> add to 'all_q_list' (*)
--> add_disk
--> blk_register_queue
--> blk_mq_register_disk (++)

null_del_dev
--> del_gendisk
--> blk_unregister_queue
--> blk_mq_unregister_disk (--)
--> blk_cleanup_queue
--> blk_mq_free_queue
--> del from 'all_q_list' (*)

blk_mq_queue_reinit
--> blk_mq_sysfs_unregister (-)
--> blk_mq_sysfs_register (+)

While the request queue is added to 'all_q_list' (*),
blk_mq_queue_reinit() can be called for the queue anytime by CPU
hotplug callback. But blk_mq_sysfs_unregister (-) and
blk_mq_sysfs_register (+) in blk_mq_queue_reinit must not be called
before blk_mq_register_disk (++) and after blk_mq_unregister_disk (--)
is finished. Because '/sys/block/*/mq/' is not exists.

There has already been BLK_MQ_F_SYSFS_UP flag in hctx->flags which can
be used to track these sysfs stuff, but it is only fixing this issue
partially.

In order to fix it completely, we just need per-queue flag instead of
per-hctx flag with appropriate locking. So this introduces
q->mq_sysfs_init_done which is properly protected with all_q_mutex.

Also, we need to ensure that blk_mq_map_swqueue() is called with
all_q_mutex is held. Since hctx->nr_ctx is reset temporarily and
updated in blk_mq_map_swqueue(), so we should avoid
blk_mq_register_hctx() seeing the temporary hctx->nr_ctx value
in CPU hotplug handling or adding/deleting gendisk .

Signed-off-by: Akinobu Mita
Reviewed-by: Ming Lei
Cc: Ming Lei
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Akinobu Mita
2015-09-30 01:32:45 +0800
1356aae08 blk-mq: avoid setting hctx->tags->cpumask before allocation ... Browse Code »

When unmapped hw queue is remapped after CPU topology is changed,
hctx->tags->cpumask has to be set after hctx->tags is setup in
blk_mq_map_swqueue(), otherwise it causes null pointer dereference.

Fixes: f26cdc8536 ("blk-mq: Shared tag enhancements")
Signed-off-by: Akinobu Mita
Cc: Keith Busch
Cc: Ming Lei
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Akinobu Mita
2015-09-30 01:32:43 +0800

24 Sep, 2015

1 commit

f75782e4e block: kmemleak: Track the page allocations for struct request ... Browse Code »

The pages allocated for struct request contain pointers to other slab
allocations (via ops->init_request). Since kmemleak does not track/scan
page allocations, the slab objects will be reported as leaks (false
positives). This patch adds kmemleak callbacks to allow tracking of such
pages.

Signed-off-by: Catalin Marinas
Reported-by: Bart Van Assche
Tested-by: Bart Van Assche
Cc: Christoph Hellwig
Cc: Jens Axboe
Signed-off-by: Jens Axboe

Catalin Marinas
2015-09-24 01:00:57 +0800

20 Sep, 2015

1 commit

133bb5958 Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block updates from Jens Axboe:
"This is a bit bigger than it should be, but I could (did) not want to
send it off last week due to both wanting extra testing, and expecting
a fix for the bounce regression as well. In any case, this contains:

- Fix for the blk-merge.c compilation warning on gcc 5.x from me.

- A set of back/front SG gap merge fixes, from me and from Sagi.
This ensures that we honor SG gapping for integrity payloads as
well.

- Two small fixes for null_blk from Matias, fixing a leak and a
capacity propagation issue.

- A blkcg fix from Tejun, fixing a NULL dereference.

- A fast clone optimization from Ming, fixing a performance
regression since the arbitrarily sized bio's were introduced.

- Also from Ming, a regression fix for bouncing IOs"

* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix bounce_end_io
block: blk-merge: fast-clone bio when splitting rw bios
block: blkg_destroy_all() should clear q->root_blkg and ->root_rl.blkg
block: Copy a user iovec if it includes gaps
block: Refuse adding appending a gapped integrity page to a bio
block: Refuse request/bio merges with gaps in the integrity payload
block: Check for gaps on front and back merges
null_blk: fix wrong capacity when bs is not 512 bytes
null_blk: fix memory leak on cleanup
block: fix bogus compiler warnings in blk-merge.c

Linus Torvalds
2015-09-20 09:57:09 +0800