Eric Lee / smarc-fsl-linux-kernel

19 Aug, 2014

4 commits

cb8b12b5d md/raid10: always initialise ->state on newly allocated r10_bio ... Browse Code »

Most places which allocate an r10_bio zero the ->state, some don't.
As the r10_bio comes from a mempool, and the allocation function uses
kzalloc it is often zero anyway. But sometimes it isn't and it is
best to be safe.

I only noticed this because of the bug fixed by an earlier patch
where the r10_bios allocated for a reshape were left around to
be used by a subsequent resync. In that case the R10BIO_IsReshape
flag caused problems.

Signed-off-by: NeilBrown

NeilBrown
2014-08-19 15:20:27 +0800
e337aead3 md/raid10: avoid memory leak on error path during reshape. ... Browse Code »

If raid10 reshape fails to find somewhere to read a block
from, it returns without freeing memory...

Signed-off-by: NeilBrown

NeilBrown
2014-08-19 15:20:27 +0800
b39685526 md/raid10: Fix memory leak when raid10 reshape completes. ... Browse Code »

When a raid10 commences a resync/recovery/reshape it allocates
some buffer space.
When a resync/recovery completes the buffer space is freed. But not
when the reshape completes.
This can result in a small memory leak.

There is a subtle side-effect of this bug. When a RAID10 is reshaped
to a larger array (more devices), the reshape is immediately followed
by a "resync" of the new space. This "resync" will use the buffer
space which was allocated for "reshape". This can cause problems
including a "BUG" in the SCSI layer. So this is suitable for -stable.

Cc: stable@vger.kernel.org (v3.5+)
Fixes: 3ea7daa5d7fde47cd41f4d56c2deb949114da9d6
Signed-off-by: NeilBrown

NeilBrown
2014-08-19 15:20:27 +0800
ce0b0a469 md/raid10: fix memory leak when reshaping a RAID10. ... Browse Code »

raid10 reshape clears unwanted bits from a bio->bi_flags using
a method which, while clumsy, worked until 3.10 when BIO_OWNS_VEC
was added.
Since then it clears that bit but shouldn't. This results in a
memory leak.

So change to used the approved method of clearing unwanted bits.

As this causes a memory leak which can consume all of memory
the fix is suitable for -stable.

Fixes: a38352e0ac02dbbd4fa464dc22d1352b5fbd06fd
Cc: stable@vger.kernel.org (v3.10+)
Reported-by: mdraid.pkoch@dfgh.net (Peter Koch)
Signed-off-by: NeilBrown

NeilBrown
2014-08-19 15:20:27 +0800

18 Aug, 2014

2 commits

9c4bdf697 md/raid6: avoid data corruption during recovery of double-degraded RAID6 ... Browse Code »

During recovery of a double-degraded RAID6 it is possible for
some blocks not to be recovered properly, leading to corruption.

If a write happens to one block in a stripe that would be written to a
missing device, and at the same time that stripe is recovering data
to the other missing device, then that recovered data may not be written.

This patch skips, in the double-degraded case, an optimisation that is
only safe for single-degraded arrays.

Bug was introduced in 2.6.32 and fix is suitable for any kernel since
then. In an older kernel with separate handle_stripe5() and
handle_stripe6() functions the patch must change handle_stripe6().

Cc: stable@vger.kernel.org (2.6.32+)
Fixes: 6c0069c0ae9659e3a91b68eaed06a5c6c37f45c8
Cc: Yuri Tikhonov
Cc: Dan Williams
Reported-by: "Manibalan P"
Tested-by: "Manibalan P"
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1090423
Signed-off-by: NeilBrown
Acked-by: Dan Williams

NeilBrown
2014-08-18 12:49:46 +0800
a40687ff7 md/raid5: avoid livelock caused by non-aligned writes. ... Browse Code »

If a stripe in a raid6 array received a write to each data block while
the array is degraded, and if any of these writes to a missing device
are not page-aligned, then a live-lock happens.

In this case the P and Q blocks need to be read so that the part of
the missing block which is *not* being updated by the write can be
constructed. Due to a logic error, these blocks are not loaded, so
the update cannot proceed and the stripe is 'handled' repeatedly in an
infinite loop.

This bug is unlikely as most writes are page aligned. However as it
can lead to a livelock it is suitable for -stable. It was introduced
in 3.16.

Cc: stable@vger.kernel.org (v3.16)
Fixed: 67f455486d2ea20b2d94d6adf5b9b783d079e321
Signed-off-by: NeilBrown

NeilBrown
2014-08-18 12:49:41 +0800

14 Aug, 2014

2 commits

ba368991f Merge tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm ... Browse Code »

Pull device mapper changes from Mike Snitzer:

- Allow the thin target to paired with any size external origin; also
allow thin snapshots to be larger than the external origin.

- Add support for quickly loading a repetitive pattern into the
dm-switch target.

- Use per-bio data in the dm-crypt target instead of always using a
mempool for each allocation. Required switching to kmalloc alignment
for the bio slab.

- Fix DM core to properly stack the QUEUE_FLAG_NO_SG_MERGE flag

- Fix the dm-cache and dm-thin targets' export of the minimum_io_size
to match the data block size -- this fixes an issue where mkfs.xfs
would improperly infer raid striping was in place on the underlying
storage.

- Small cleanups in dm-io, dm-mpath and dm-cache

* tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm table: propagate QUEUE_FLAG_NO_SG_MERGE
dm switch: efficiently support repetitive patterns
dm switch: factor out switch_region_table_read
dm cache: set minimum_io_size to cache's data block size
dm thin: set minimum_io_size to pool's data block size
dm crypt: use per-bio data
block: use kmalloc alignment for bio slab
dm table: make dm_table_supports_discards static
dm cache metadata: use dm-space-map-metadata.h defined size limits
dm cache: fail migrations in the do_worker error path
dm cache: simplify deferred set reference count increments
dm thin: relax external origin size constraints
dm thin: switch to an atomic_t for tracking pending new block preparations
dm mpath: eliminate pg_ready() wrapper
dm io: simplify dec_count and sync_io

Linus Torvalds
2014-08-14 23:17:56 +0800
d429a3639 Merge branch 'for-3.17/drivers' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block driver changes from Jens Axboe:
"Nothing out of the ordinary here, this pull request contains:

- A big round of fixes for bcache from Kent Overstreet, Slava Pestov,
and Surbhi Palande. No new features, just a lot of fixes.

- The usual round of drbd updates from Andreas Gruenbacher, Lars
Ellenberg, and Philipp Reisner.

- virtio_blk was converted to blk-mq back in 3.13, but now Ming Lei
has taken it one step further and added support for actually using
more than one queue.

- Addition of an explicit SG_FLAG_Q_AT_HEAD for block/bsg, to
compliment the the default behavior of adding to the tail of the
queue. From Douglas Gilbert"

* 'for-3.17/drivers' of git://git.kernel.dk/linux-block: (86 commits)
bcache: Drop unneeded blk_sync_queue() calls
bcache: add mutex lock for bch_is_open
bcache: Correct printing of btree_gc_max_duration_ms
bcache: try to set b->parent properly
bcache: fix memory corruption in init error path
bcache: fix crash with incomplete cache set
bcache: Fix more early shutdown bugs
bcache: fix use-after-free in btree_gc_coalesce()
bcache: Fix an infinite loop in journal replay
bcache: fix crash in bcache_btree_node_alloc_fail tracepoint
bcache: bcache_write tracepoint was crashing
bcache: fix typo in bch_bkey_equal_header
bcache: Allocate bounce buffers with GFP_NOWAIT
bcache: Make sure to pass GFP_WAIT to mempool_alloc()
bcache: fix uninterruptible sleep in writeback thread
bcache: wait for buckets when allocating new btree root
bcache: fix crash on shutdown in passthrough mode
bcache: fix lockdep warnings on shutdown
bcache allocator: send discards with correct size
bcache: Fix to remove the rcu_sched stalls.
...

Linus Torvalds
2014-08-14 23:10:21 +0800

11 Aug, 2014

2 commits

2213d7c29 Merge tag 'md/3.17' of git://neil.brown.name/md ... Browse Code »

Pull md updates from Neil Brown:
"Most interesting is that md devices (major == 9) with minor numbers of
512 or more will no longer be created simply by opening a block device
file. They can only be created by writing to

/sys/module/md_mod/parameters/new_array

The 'auto-create-on-open' semantic is cumbersome and we need to start
moving away from it"

* tag 'md/3.17' of git://neil.brown.name/md:
md: don't allow bitmap file to be added to raid0/linear.
md/raid0: check for bitmap compatability when changing raid levels.
md: Recovery speed is wrong
md: disable probing for md devices 512 and over.
md/raid1,raid10: always abort recover on write error.

Linus Torvalds
2014-08-11 22:02:35 +0800
200612ec3 dm table: propagate QUEUE_FLAG_NO_SG_MERGE ... Browse Code »

Commit 05f1dd5 ("block: add queue flag for disabling SG merging")
introduced a new queue flag: QUEUE_FLAG_NO_SG_MERGE. This gets set by
default in blk_mq_init_queue for mq-enabled devices. The effect of
the flag is to bypass the SG segment merging. Instead, the
bio->bi_vcnt is used as the number of hardware segments.

With a device mapper target on top of a device with
QUEUE_FLAG_NO_SG_MERGE set, we can end up sending down more segments
than a driver is prepared to handle. I ran into this when backporting
the virtio_blk mq support. It triggerred this BUG_ON, in
virtio_queue_rq:

BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);

The queue's max is set here:
blk_queue_max_segments(q, vblk->sg_elems-2);

Basically, what happens is that a bio is built up for the dm device
(which does not have the QUEUE_FLAG_NO_SG_MERGE flag set) using
bio_add_page. That path will call into __blk_recalc_rq_segments, so
what you end up with is bi_phys_segments being much smaller than bi_vcnt
(and bi_vcnt grows beyond the maximum sg elements). Then, when the bio
is submitted, it gets cloned. When the cloned bio is submitted, it will
end up in blk_recount_segments, here:

if (test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags))
bio->bi_phys_segments = bio->bi_vcnt;

and now we've set bio->bi_phys_segments to a number that is beyond what
was registered as queue_max_segments by the driver.

The right way to fix this is to propagate the queue flag up the stack.

The rules for propagating the flag are simple:
- if the flag is set for any underlying device, it must be set for the
upper device
- consequently, if the flag is not set for any underlying device, it
should not be set for the upper device.

Signed-off-by: Jeff Moyer
Signed-off-by: Mike Snitzer
Cc: stable@vger.kernel.org # 3.16+

Jeff Moyer
2014-08-11 08:54:49 +0800

08 Aug, 2014

3 commits

d66b1b395 md: don't allow bitmap file to be added to raid0/linear. ... Browse Code »

An array can only accept a bitmap if it will call bitmap_daemon_work
periodically, which means it needs a thread running.

If there is no thread, don't allow a bitmap to be added.

Signed-off-by: NeilBrown

NeilBrown
2014-08-08 13:43:20 +0800
a8461a61c md/raid0: check for bitmap compatability when changing raid levels. ... Browse Code »

If an array has a bitmap, then it cannot be converted to raid0.

Reported-by: Xiao Ni
Signed-off-by: NeilBrown

NeilBrown
2014-08-08 13:33:17 +0800
ac7e50a38 md: Recovery speed is wrong ... Browse Code »

When we calculate the speed of recovery, the numerator that contains
the recovery done sectors. It's need to subtract the sectors which
don't finish recovery.

Signed-off-by: Xiao Ni
Signed-off-by: NeilBrown

Xiao Ni
2014-08-08 10:11:25 +0800

05 Aug, 2014

23 commits

98959948a Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:

- Move the nohz kick code out of the scheduler tick to a dedicated IPI,
from Frederic Weisbecker.

This necessiated quite some background infrastructure rework,
including:

* Clean up some irq-work internals
* Implement remote irq-work
* Implement nohz kick on top of remote irq-work
* Move full dynticks timer enqueue notification to new kick
* Move multi-task notification to new kick
* Remove unecessary barriers on multi-task notification

- Remove proliferation of wait_on_bit() action functions and allow
wait_on_bit_action() functions to support a timeout. (Neil Brown)

- Another round of sched/numa improvements, cleanups and fixes. (Rik
van Riel)

- Implement fast idling of CPUs when the system is partially loaded,
for better scalability. (Tim Chen)

- Restructure and fix the CPU hotplug handling code that may leave
cfs_rq and rt_rq's throttled when tasks are migrated away from a dead
cpu. (Kirill Tkhai)

- Robustify the sched topology setup code. (Peterz Zijlstra)

- Improve sched_feat() handling wrt. static_keys (Jason Baron)

- Misc fixes.

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
sched/fair: Fix 'make xmldocs' warning caused by missing description
sched: Use macro for magic number of -1 for setparam
sched: Robustify topology setup
sched: Fix sched_setparam() policy == -1 logic
sched: Allow wait_on_bit_action() functions to support a timeout
sched: Remove proliferation of wait_on_bit() action functions
sched/numa: Revert "Use effective_load() to balance NUMA loads"
sched: Fix static_key race with sched_feat()
sched: Remove extra static_key*() function indirection
sched/rt: Fix replenish_dl_entity() comments to match the current upstream code
sched: Transform resched_task() into resched_curr()
sched/deadline: Kill task_struct->pi_top_task
sched: Rework check_for_tasks()
sched/rt: Enqueue just unthrottled rt_rq back on the stack in __disable_runtime()
sched/fair: Disable runtime_enabled on dying rq
sched/numa: Change scan period code to match intent
sched/numa: Rework best node setting in task_numa_migrate()
sched/numa: Examine a task move when examining a task swap
sched/numa: Simplify task_numa_compare()
sched/numa: Use effective_load() to balance NUMA loads
...

Linus Torvalds
2014-08-05 07:23:30 +0800
0781c8748 bcache: Drop unneeded blk_sync_queue() calls ... Browse Code »

this is needed for the queue/block device we created (it's done by
blk_cleanup_queue() which we do call) - but calling it for the block devices we
only opened is pointless.

Change-Id: I53dfded14ed15b9581d10ca8399d5e1b3abbf9f2

Kent Overstreet
2014-08-05 06:23:04 +0800
789d21dbd bcache: add mutex lock for bch_is_open ... Browse Code »

Since bch_is_open will iterate linked list bch_cache_sets and
uncached_devices, it needs bch_register_lock.

Signed-off-by: Jianjian Huo

Jianjian Huo
2014-08-05 06:23:04 +0800
5b25abade bcache: Correct printing of btree_gc_max_duration_ms ... Browse Code »

time_stats::btree_gc_max_duration_mc is not bit shifted by 8

Fixes BUG #138

Change-Id: I44fc6e1d0579674016acc533f1a546b080e5371a
Signed-off-by: Surbhi Palande

Surbhi Palande
2014-08-05 06:23:04 +0800
2452cc890 bcache: try to set b->parent properly ... Browse Code »

bcache_flash_dev.ktest would reliably crash with 8k and 16k bucket size
before; now it passes.

Change-Id: Ib542232235e39298c3a7548fe52b645cabb823d1

Slava Pestov
2014-08-05 06:23:04 +0800
c9a78332b bcache: fix memory corruption in init error path ... Browse Code »

If register_cache_set() failed, we would touch ca->set after
it had already been freed. Also, fix an assertion to catch
this.

Change-Id: I748e5f5b223e2d9b2602075dec2f997cced2394d

Slava Pestov
2014-08-05 06:23:04 +0800
bf0c55c98 bcache: fix crash with incomplete cache set ... Browse Code »

Change-Id: I6abde52afe917633480caaf4e2518f42a816d886

Slava Pestov
2014-08-05 06:23:04 +0800
d83353b31 bcache: Fix more early shutdown bugs ... Browse Code »

Signed-off-by: Kent Overstreet

Kent Overstreet
2014-08-05 06:23:04 +0800
400ffaa2a bcache: fix use-after-free in btree_gc_coalesce() ... Browse Code »

If we goto out_nocoalesce after we free new_nodes[0], we end up freeing
new_nodes[0] again. This was generating a lockdep warning. The fix is
to set new_nodes[0] to NULL, since the out_nocoalesce path safely
ignores NULL entries in the new_nodes array.

This regression was introduced in 2d7f9531.

Change-Id: I76564d7257800583214376b4bacf236cda90c89c

Slava Pestov
2014-08-05 06:23:04 +0800
6b708de64 bcache: Fix an infinite loop in journal replay ... Browse Code »

When running with multiple cache devices, if one of the devices has a completely
empty journal but we'd already found some journal entries on a previosu device
we'd go into an infinite loop.

Change-Id: I1dcdc0d738192746de28f40e8b08825b0dea5e2b
Signed-off-by: Kent Overstreet

Kent Overstreet
2014-08-05 06:23:03 +0800
913dc33fb bcache: fix crash in bcache_btree_node_alloc_fail tracepoint ... Browse Code »

'b' was NULL.

Change-Id: Icac0fd04afa2d23f213d96d51afd53374e6dd0c0

Slava Pestov
2014-08-05 06:23:03 +0800
60ae81eee bcache: bcache_write tracepoint was crashing ... Browse Code »

Signed-off-by: Kent Overstreet

Slava Pestov
2014-08-05 06:23:03 +0800
8e0948080 bcache: fix typo in bch_bkey_equal_header ... Browse Code »

Signed-off-by: Kent Overstreet

Slava Pestov
2014-08-05 06:23:03 +0800
501d52a90 bcache: Allocate bounce buffers with GFP_NOWAIT ... Browse Code »

There's no point in blocking on these allocations, since our fallback paths will
probably go faster than blocking.

Change-Id: I733ca202c25cb36bde02607a0a60552229a4241c

Kent Overstreet
2014-08-05 06:23:03 +0800
bcf090e00 bcache: Make sure to pass GFP_WAIT to mempool_alloc() ... Browse Code »

this was very wrong - mempool_alloc() only guarantees success with GFP_WAIT.
bcache uses GFP_NOWAIT in various other places where we have a fallback,
circuits must've gotten crossed when writing this code or something.

Signed-off-by: Kent Overstreet

Kent Overstreet
2014-08-05 06:23:03 +0800
9e5c35351 bcache: fix uninterruptible sleep in writeback thread ... Browse Code »

There were two issues here:

- writeback thread did not start until the device first became dirty
- writeback thread used uninterruptible sleep once running

Without this patch I see kernel warnings printed and a load average of
1.52 after booting my test VM. With this patch the warnings are gone and
the load average is near 0.00 as expected.

Signed-off-by: Kent Overstreet

Slava Pestov
2014-08-05 06:23:03 +0800
c5aa4a315 bcache: wait for buckets when allocating new btree root ... Browse Code »

Tested:
- sometimes bcache_tier test would hang on startup with a failure
to allocate the btree root -- no longer seeing this

Signed-off-by: Kent Overstreet

Slava Pestov
2014-08-05 06:23:03 +0800
a664d0f05 bcache: fix crash on shutdown in passthrough mode ... Browse Code »

We never started the writeback thread in this case, so don't stop it.

Slava Pestov
2014-08-05 06:23:03 +0800
e5112201c bcache: fix lockdep warnings on shutdown Browse Code »

Slava Pestov
2014-08-05 06:23:03 +0800
8b326d3a2 bcache allocator: send discards with correct size Browse Code »

Slava Pestov
2014-08-05 06:23:03 +0800
dbd810ab6 bcache: Fix to remove the rcu_sched stalls. ... Browse Code »

while loop was executing infinitely.
This fix ends the while loop gracefully.

Signed-off-by: Surbhi Palande
Signed-off-by: Kent Overstreet

Surbhi Palande
2014-08-05 06:23:02 +0800
9aa61a992 bcache: Fix a journal replay bug ... Browse Code »

journal replay wansn't validating pointers with bch_extent_invalid() before
derefing, fixed

Signed-off-by: Kent Overstreet

Kent Overstreet
2014-08-05 06:23:02 +0800
5b1016e62 bcache: Fix a bug when detaching ... Browse Code »

After detaching a backing device from a cache set, a bit wasn't getting
reset meaning the second detach wouldn't work correctly.

Signed-off-by: Kent Overstreet

Kent Overstreet
2014-08-05 06:23:02 +0800

02 Aug, 2014

4 commits

56b1ebf2d dm switch: efficiently support repetitive patterns ... Browse Code »

Add support for quickly loading a repetitive pattern into the
dm-switch target.

In the "set_regions_mappings" message, the user may now use "Rn,m" as
one of the arguments. "n" and "m" are hexadecimal numbers. The "Rn,m"
argument repeats the last "n" arguments in the following "m" slots.

For example:
dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10
is equivalent to
dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \
:1 :2 :1 :2 :1 :2 :1 :2 :1 :2

Requested-by: Jay Wang
Tested-by: Jay Wang
Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer

Mikulas Patocka
2014-08-02 00:30:37 +0800
99eb1908e dm switch: factor out switch_region_table_read ... Browse Code »

Move code that reads the table to a switch_region_table_read.
It will be needed for the next commit. No functional change.

Tested-by: Jay Wang
Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer

Mikulas Patocka
2014-08-02 00:30:36 +0800
b02465308 dm cache: set minimum_io_size to cache's data block size ... Browse Code »

Before, if the block layer's limit stacking didn't establish an
optimal_io_size that was compatible with the cache's data block size
we'd set optimal_io_size to the data block size and minimum_io_size to 0
(which the block layer adjusts to be physical_block_size).

Update cache_io_hints() to set both minimum_io_size and optimal_io_size
to the cache's data block size. This fixes an issue where mkfs.xfs
would create more XFS Allocation Groups on cache volumes than on a
normal linear LV of comparable size.

Signed-off-by: Mike Snitzer

Mike Snitzer
2014-08-02 00:30:36 +0800
fdfb4c8c1 dm thin: set minimum_io_size to pool's data block size ... Browse Code »

Before, if the block layer's limit stacking didn't establish an
optimal_io_size that was compatible with the thin-pool's data block size
we'd set optimal_io_size to the data block size and minimum_io_size to 0
(which the block layer adjusts to be physical_block_size).

Update pool_io_hints() to set both minimum_io_size and optimal_io_size
to the thin-pool's data block size. This fixes an issue reported where
mkfs.xfs would create more XFS Allocation Groups on thinp volumes than
on a normal linear LV of comparable size, see:
https://bugzilla.redhat.com/show_bug.cgi?id=1003227

Reported-by: Chris Murphy
Signed-off-by: Mike Snitzer

Mike Snitzer
2014-08-02 00:30:35 +0800