Doug / smarc-fsl-linux-kernel | Embedian Git Server

13 Sep, 2013

1 commit

26935fb06 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs pile 4 from Al Viro:
"list_lru pile, mostly"

This came out of Andrew's pile, Al ended up doing the merge work so that
Andrew didn't have to.

Additionally, a few fixes.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits)
super: fix for destroy lrus
list_lru: dynamically adjust node arrays
shrinker: Kill old ->shrink API.
shrinker: convert remaining shrinkers to count/scan API
staging/lustre/libcfs: cleanup linux-mem.h
staging/lustre/ptlrpc: convert to new shrinker API
staging/lustre/obdclass: convert lu_object shrinker to count/scan API
staging/lustre/ldlm: convert to shrinkers to count/scan API
hugepage: convert huge zero page shrinker to new shrinker API
i915: bail out earlier when shrinker cannot acquire mutex
drivers: convert shrinkers to new count/scan API
fs: convert fs shrinkers to new scan/count API
xfs: fix dquot isolation hang
xfs-convert-dquot-cache-lru-to-list_lru-fix
xfs: convert dquot cache lru to list_lru
xfs: rework buffer dispose list tracking
xfs-convert-buftarg-lru-to-generic-code-fix
xfs: convert buftarg LRU to generic code
fs: convert inode and dentry shrinking to be node aware
vmscan: per-node deferred work
...

Linus Torvalds
2013-09-13 06:01:38 +0800

11 Sep, 2013

3 commits

7dc19d5af drivers: convert shrinkers to new count/scan API ... Browse Code »

Convert the driver shrinkers to the new API. Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.

FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out. The amount of broken code I just encountered is
mind boggling. I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.

Special mention goes to the zcache/zcache2 drivers. They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth. And that doesn't even take into account the horrible,
broken code...

[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner
Signed-off-by: Glauber Costa
Acked-by: Mel Gorman
Cc: Daniel Vetter
Cc: Kent Overstreet
Cc: John Stultz
Cc: David Rientjes
Cc: Jerome Glisse
Cc: Thomas Hellstrom
Cc: "Theodore Ts'o"
Cc: Adrian Hunter
Cc: Al Viro
Cc: Artem Bityutskiy
Cc: Arve Hjønnevåg
Cc: Carlos Maiolino
Cc: Christoph Hellwig
Cc: Chuck Lever
Cc: Daniel Vetter
Cc: David Rientjes
Cc: Gleb Natapov
Cc: Greg Thelen
Cc: J. Bruce Fields
Cc: Jan Kara
Cc: Jerome Glisse
Cc: John Stultz
Cc: KAMEZAWA Hiroyuki
Cc: Kent Overstreet
Cc: Kirill A. Shutemov
Cc: Marcelo Tosatti
Cc: Mel Gorman
Cc: Steven Whitehouse
Cc: Thomas Hellstrom
Cc: Trond Myklebust
Signed-off-by: Andrew Morton

Signed-off-by: Al Viro

Dave Chinner
2013-09-11 06:56:32 +0800
7426d6287 Merge tag 'dm-3.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm ... Browse Code »

Pull device-mapper updates from Mike Snitzer:
"Add the ability to collect I/O statistics on user-defined regions of a
device-mapper device. This dm-stats code required the reintroduction
of a div64_u64_rem() helper, but as a separate method that doesn't
slow down div64_u64() -- especially on 32-bit systems.

Allow the error target to replace request-based DM devices (e.g.
multipath) in addition to bio-based DM devices.

Various other small code fixes and improvements to thin-provisioning,
DM cache and the DM ioctl interface"

* tag 'dm-3.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm stripe: silence a couple sparse warnings
dm: add statistics support
dm thin: always return -ENOSPC if no_free_space is set
dm ioctl: cleanup error handling in table_load
dm ioctl: increase granularity of type_lock when loading table
dm ioctl: prevent rename to empty name or uuid
dm thin: set pool read-only if breaking_sharing fails block allocation
dm thin: prefix pool error messages with pool device name
dm: allow error target to replace bio-based and request-based targets
math64: New separate div64_u64_rem helper
dm space map: optimise sm_ll_dec and sm_ll_inc
dm btree: prefetch child nodes when walking tree for a dm_btree_del
dm btree: use pop_frame in dm_btree_del to cleanup code
dm cache: eliminate holes in cache structure
dm cache: fix stacking of geometry limits
dm thin: fix stacking of geometry limits
dm thin: add data block size limits to Documentation
dm cache: add data block size limits to code and Documentation
dm cache: document metadata device is exclussive to a cache
dm: stop using WQ_NON_REENTRANT

Linus Torvalds
2013-09-11 04:06:15 +0800
4d7696f1b Merge tag 'md/3.12' of git://neil.brown.name/md ... Browse Code »

Pull md update from Neil Brown:
"Headline item is multithreading for RAID5 so that more IO/sec can be
supported on fast (SSD) devices. Also TILE-Gx SIMD suppor for RAID6
calculations and an assortment of bug fixes"

* tag 'md/3.12' of git://neil.brown.name/md:
raid5: only wakeup necessary threads
md/raid5: flush out all pending requests before proceeding with reshape.
md/raid5: use seqcount to protect access to shape in make_request.
raid5: sysfs entry to control worker thread number
raid5: offload stripe handle to workqueue
raid5: fix stripe release order
raid5: make release_stripe lockless
md: avoid deadlock when dirty buffers during md_stop.
md: Don't test all of mddev->flags at once.
md: Fix apparent cut-and-paste error in super_90_validate
raid6/test: replace echo -e with printf
RAID: add tilegx SIMD implementation of raid6
md: fix safe_mode buglet.
md: don't call md_allow_write in get_bitmap_file.

Linus Torvalds
2013-09-11 04:03:41 +0800

06 Sep, 2013

9 commits

7fff5e8f7 dm stripe: silence a couple sparse warnings ... Browse Code »

Eliminate the following sparse warnings:
drivers/md/dm-stripe.c:443:12: warning: symbol 'dm_stripe_init' was not declared. Should it be static?
drivers/md/dm-stripe.c:456:6: warning: symbol 'dm_stripe_exit' was not declared. Should it be static?

Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2013-09-06 23:36:01 +0800
fd2ed4d25 dm: add statistics support ... Browse Code »

Support the collection of I/O statistics on user-defined regions of
a DM device. If no regions are defined no statistics are collected so
there isn't any performance impact. Only bio-based DM devices are
currently supported.

Each user-defined region specifies a starting sector, length and step.
Individual statistics will be collected for each step-sized area within
the range specified.

The I/O statistics counters for each step-sized area of a region are
in the same format as /sys/block/*/stat or /proc/diskstats but extra
counters (12 and 13) are provided: total time spent reading and
writing in milliseconds. All these counters may be accessed by sending
the @stats_print message to the appropriate DM device via dmsetup.

The creation of DM statistics will allocate memory via kmalloc or
fallback to using vmalloc space. At most, 1/4 of the overall system
memory may be allocated by DM statistics. The admin can see how much
memory is used by reading
/sys/module/dm_mod/parameters/stats_current_allocated_bytes

See Documentation/device-mapper/statistics.txt for more details.

Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2013-09-06 08:46:06 +0800
94563bada dm thin: always return -ENOSPC if no_free_space is set ... Browse Code »

If pool has 'no_free_space' set it means a previous allocation already
determined the pool has no free space (and failed that allocation with
-ENOSPC). By always returning -ENOSPC if 'no_free_space' is set, we do
not allow the pool to oscillate between allocating blocks and then not.

But a side-effect of this determinism is that if a user wants to be able
to allocate new blocks they'll need to reload the pool's table (to clear
the 'no_free_space' flag). This reload will happen automatically if the
pool's data volume is resized. But if the user takes action to free a
lot of space by deleting snapshot volumes, etc the pool will no longer
allow data allocations to continue without an intervening table reload.

Signed-off-by: Mike Snitzer
Acked-by: Joe Thornber
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2013-09-06 08:46:06 +0800
f11c1c569 dm ioctl: cleanup error handling in table_load ... Browse Code »

Make use of common cleanup code.

Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2013-09-06 08:46:06 +0800
00c4fc3b1 dm ioctl: increase granularity of type_lock when loading table ... Browse Code »

Hold the mapped device's type_lock before calling populate_table() since
it is where the table's type is determined based on the specified
targets. There is no need to allow concurrent table loads to race to
establish the table's targets or type.

This eliminates the need to grab the lock in dm_table_set_type().

Also verify that the type_lock is held in both dm_set_md_type() and
dm_get_md_type().

Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2013-09-06 08:46:06 +0800
c2b048246 dm ioctl: prevent rename to empty name or uuid ... Browse Code »

A device-mapper device must always have a name consisting of a non-empty
string. If the device also has a uuid, this similarly must not be an
empty string.

The DM_DEV_CREATE ioctl enforces these rules when the device is created,
but this patch is needed to enforce them when DM_DEV_RENAME is used to
change the name or uuid.

Reported-by: Zdenek Kabelac
Signed-off-by: Alasdair G Kergon
Signed-off-by: Mike Snitzer
Acked-by: Mikulas Patocka

Alasdair Kergon
2013-09-06 08:46:06 +0800
d6fc20420 dm thin: set pool read-only if breaking_sharing fails block allocation ... Browse Code »

break_sharing() now handles an arbitrary alloc_data_block() error
the same way as provision_block(): marks pool read-only and errors the
cell.

Signed-off-by: Mike Snitzer
Acked-by: Joe Thornber
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2013-09-06 08:46:06 +0800
4fa5971a6 dm thin: prefix pool error messages with pool device name ... Browse Code »

Useful to know which pool is experiencing the error.

Signed-off-by: Mike Snitzer
Acked-by: Joe Thornber
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2013-09-06 08:46:05 +0800
169e2cc27 dm: allow error target to replace bio-based and request-based targets ... Browse Code »

It may be useful to switch a request-based table to the "error" target.
Enhance the DM core to allow a hybrid target_type which is capable of
handling either bios (via .map) or requests (via .map_rq).

Add a request-based map function (.map_rq) to the "error" target_type;
making it DM's first hybrid target. Train dm_table_set_type() to prefer
the mapped device's established type (request-based or bio-based). If
the mapped device doesn't have an established type default to making the
table with the hybrid target(s) bio-based.

Tested 'dmsetup wipe_table' to work on both bio-based and request-based
devices.

Signed-off-by: Mike Snitzer
Signed-off-by: Joe Jin
Signed-off-by: Jun'ichi Nomura
Acked-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2013-09-06 08:46:05 +0800

04 Sep, 2013

1 commit

f66c83d05 Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi ... Browse Code »

Pull first round of SCSI updates from James Bottomley:
"This patch set is a set of driver updates (ufs, zfcp, lpfc, mpt2/3sas,
qla4xxx, qla2xxx [adding support for ISP8044 + other things]).

We also have a new driver: esas2r which has a number of static checker
problems, but which I expect to resolve over the -rc course of 3.12
under the new driver exception.

We also have the error return that were discussed at LSF"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (118 commits)
[SCSI] sg: push file descriptor list locking down to per-device locking
[SCSI] sg: checking sdp->detached isn't protected when open
[SCSI] sg: no need sg_open_exclusive_lock
[SCSI] sg: use rwsem to solve race during exclusive open
[SCSI] scsi_debug: fix logical block provisioning support when unmap_alignment != 0
[SCSI] scsi_debug: fix endianness bug in sdebug_build_parts()
[SCSI] qla2xxx: Update the driver version to 8.06.00.08-k.
[SCSI] qla2xxx: print MAC via %pMR.
[SCSI] qla2xxx: Correction to message ids.
[SCSI] qla2xxx: Correctly print out/in mailbox registers.
[SCSI] qla2xxx: Add a new interface to update versions.
[SCSI] qla2xxx: Move queue depth ramp down message to i/o debug level.
[SCSI] qla2xxx: Select link initialization option bits from current operating mode.
[SCSI] qla2xxx: Add loopback IDC-TIME-EXTEND aen handling support.
[SCSI] qla2xxx: Set default critical temperature value in cases when ISPFX00 firmware doesn't provide it
[SCSI] qla2xxx: QLAFX00 make over temperature AEN handling informational, add log for normal temperature AEN
[SCSI] qla2xxx: Correct Interrupt Register offset for ISPFX00
[SCSI] qla2xxx: Remove handling of Shutdown Requested AEN from qlafx00_process_aen().
[SCSI] qla2xxx: Send all AENs for ISPFx00 to above layers.
[SCSI] qla2xxx: Add changes in initialization for ISPFX00 cards with BIOS
...

Linus Torvalds
2013-09-04 06:48:06 +0800

02 Sep, 2013

1 commit

bfc90cb09 raid5: only wakeup necessary threads ... Browse Code »

If there are not enough stripes to handle, we'd better not always
queue all available work_structs. If one worker can only handle small
or even none stripes, it will impact request merge and create lock
contention.

With this patch, the number of work_struct running will depend on
pending stripes number. Note: some statistics info used in the patch
are accessed without locking protection. This should doesn't matter,
we just try best to avoid queue unnecessary work_struct.

Signed-off-by: Shaohua Li
Signed-off-by: NeilBrown

Shaohua Li
2013-09-02 08:31:29 +0800

28 Aug, 2013

6 commits

4d77e3ba8 md/raid5: flush out all pending requests before proceeding with reshape. ... Browse Code »

Some requests - particularly 'discard' and 'read' are handled
differently depending on whether a reshape is active or not.

It is harmless to assume reshape is active if it isn't but wrong
to act as though reshape is not active when it is.

So when we start reshape - after making clear to all requests that
reshape has started - use mddev_suspend/mddev_resume to flush out all
requests. This will ensure that no requests will be assuming the
absence of reshape once it really starts.

Signed-off-by: NeilBrown

NeilBrown
2013-08-28 14:58:44 +0800
c46501b2d md/raid5: use seqcount to protect access to shape in make_request. ... Browse Code »

make_request() access various shape parameters (raid_disks, chunk_size
etc) which might be changed by raid5_start_reshape().

If the later is called at and awkward time during the form, the wrong
stripe_head might be used.

So introduce a 'seqcount' and after finding a stripe_head make sure
there is no reason to expect that we got the wrong one.

Signed-off-by: NeilBrown

NeilBrown
2013-08-28 14:58:36 +0800
b721420e8 raid5: sysfs entry to control worker thread number ... Browse Code »

Add a sysfs entry to control running workqueue thread number. If
group_thread_cnt is set to 0, we will disable workqueue offload handling of
stripes.

Signed-off-by: Shaohua Li
Signed-off-by: NeilBrown

Shaohua Li
2013-08-28 14:56:52 +0800
851c30c9b raid5: offload stripe handle to workqueue ... Browse Code »

This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.

raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.

To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.

My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.

Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.

In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.

The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.

This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.

Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.

Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog

Signed-off-by: Shaohua Li
Signed-off-by: NeilBrown

Shaohua Li
2013-08-28 14:46:38 +0800
d265d9dc1 raid5: fix stripe release order ... Browse Code »

patch "make release_stripe lockless" changes the order stripes are released.
Originally I thought block layer can take care of request merge, but it appears
there are still some requests not merged. It's easy to fix the order.

Signed-off-by: Shaohua Li
Signed-off-by: NeilBrown

Shaohua Li
2013-08-28 14:36:26 +0800
773ca82fa raid5: make release_stripe lockless ... Browse Code »

release_stripe still has big lock contention. We just add the stripe to a llist
without taking device_lock. We let the raid5d thread to do the real stripe
release, which must hold device_lock anyway. In this way, release_stripe
doesn't hold any locks.

The side effect is the released stripes order is changed. But sounds not a big
deal, stripes are never handled in order. And I thought block layer can already
do nice request merge, which means order isn't that important.

I kept the unplug release batch, which is unnecessary with this patch from lock
contention avoid point of view, and actually if we delete it, the stripe_head
release_list and lru can share storage. But the unplug release batch is also
helpful for request merge. We probably can delay wakeup raid5d till unplug, but
I'm still afraid of the case which raid5d is running.

Signed-off-by: Shaohua Li
Signed-off-by: NeilBrown

Shaohua Li
2013-08-28 09:55:53 +0800

27 Aug, 2013

5 commits

260fa034e md: avoid deadlock when dirty buffers during md_stop. ... Browse Code »

When the last process closes /dev/mdX sync_blockdev will be called so
that all buffers get flushed.
So if it is then opened for the STOP_ARRAY ioctl to be sent there will
be nothing to flush.

However if we open /dev/mdX in order to send the STOP_ARRAY ioctl just
moments before some other process which was writing closes their file
descriptor, then there won't be a 'last close' and the buffers might
not get flushed.

So do_md_stop() calls sync_blockdev(). However at this point it is
holding ->reconfig_mutex. So if the array is currently 'clean' then
the writes from sync_blockdev() will not complete until the array
can be marked dirty and that won't happen until some other thread
can get ->reconfig_mutex. So we deadlock.

We need to move the sync_blockdev() call to before we take
->reconfig_mutex.
However then some other thread could open /dev/mdX and write to it
after we call sync_blockdev() and before we actually stop the array.
This can leave dirty data in the page cache which is awkward.

So introduce new flag MD_STILL_CLOSED. Set it before calling
sync_blockdev(), clear it if anyone does open the file, and abort the
STOP_ARRAY attempt if it gets set before we lock against further
opens.

It is still possible to get problems if you open /dev/mdX, write to
it, then issue the STOP_ARRAY ioctl. Just don't do that.

Signed-off-by: NeilBrown

NeilBrown
2013-08-27 14:45:00 +0800
7a0a5355c md: Don't test all of mddev->flags at once. ... Browse Code »

mddev->flags is mostly used to record if an update of the
metadata is needed. Sometimes the whole field is tested
instead of just the important bits. This makes it difficult
to introduce more state bits.

So replace all bare tests of mddev->flags with tests for the bits
that actually need testing.

Signed-off-by: NeilBrown

NeilBrown
2013-08-27 14:28:23 +0800
c9ad020fe md: Fix apparent cut-and-paste error in super_90_validate ... Browse Code »

Setting a variable to itself probably wasn't the intention here.

Signed-off-by: Dave Jones
Signed-off-by: NeilBrown

Dave Jones
2013-08-27 14:06:17 +0800
275c51c4e md: fix safe_mode buglet. ... Browse Code »

Whe we set the safe_mode_timeout to a smaller value we trigger a timeout
immediately - otherwise the small value might not be honoured.
However if the previous timeout was 0 meaning "no timeout", we didn't.
This would mean that no timeout happens until the next write completes,
which could be a long time.

Signed-off-by: NeilBrown

NeilBrown
2013-08-27 14:05:43 +0800
60559da4d md: don't call md_allow_write in get_bitmap_file. ... Browse Code »

There is no really need as GFP_NOIO is very likely sufficient,
and failure is not catastrophic.

Calling md_allow_write here will convert a read-auto array to
read/write which could be confusing when you are just performing
a read operation.

Signed-off-by: NeilBrown

NeilBrown
2013-08-27 14:05:32 +0800

24 Aug, 2013

1 commit

7e782af57 [SCSI] Return ENODATA on medium error ... Browse Code »

When a medium error is detected the SCSI stack should return
ENODATA to the upper layers.

[jejb: fix whitespace error]
Signed-off-by: Hannes Reinecke
Signed-off-by: James Bottomley

Hannes Reinecke
2013-08-24 00:54:53 +0800

23 Aug, 2013

8 commits

f722063ee dm space map: optimise sm_ll_dec and sm_ll_inc ... Browse Code »

Prior to this patch these methods did a lookup followed by an insert.
Instead they now call a common mutate function that adjusts the value
according to a callback function. This avoids traversing the data
structures twice and hence improves performance.

Also factor out sm_ll_lookup_big_ref_count() for use by both
sm_ll_lookup() and sm_ll_mutate().

Signed-off-by: Joe Thornber
Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon

Joe Thornber
2013-08-23 21:02:14 +0800
04f17c802 dm btree: prefetch child nodes when walking tree for a dm_btree_del ... Browse Code »

dm-btree now takes advantage of dm-bufio's ability to prefetch data via
dm_bm_prefetch(). Prior to this change many btree node visits were
causing a synchronous read.

Signed-off-by: Joe Thornber
Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon

Joe Thornber
2013-08-23 21:02:14 +0800
cd5acf0b4 dm btree: use pop_frame in dm_btree_del to cleanup code ... Browse Code »

Remove a visited leaf straight away from the stack, rather than
marking all it's children as visited and letting it get removed on the
next iteration. May also offer a micro optimisation in dm_btree_del.

Signed-off-by: Joe Thornber
Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon

Joe Thornber
2013-08-23 21:02:14 +0800
c9ec5d7c7 dm cache: eliminate holes in cache structure ... Browse Code »

Reorder members in the cache structure to eliminate 6 out of 7 holes
(reclaiming 24 bytes). Also, the 'worker' and 'waker' members no longer
straddle cachelines.

Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon
Acked-by: Joe Thornber

Mike Snitzer
2013-08-23 21:02:14 +0800
f61093721 dm cache: fix stacking of geometry limits ... Browse Code »

Do not blindly override the queue limits (specifically io_min and
io_opt). Allow traditional stacking of these limits if io_opt is a
factor of the cache's data block size.

Without this patch mkfs.xfs does not recognize the cache device's
provided limits as a useful geometry (e.g. raid) so these hints are
ignored. This was due to setting io_min to a useless value.

Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon
Acked-by: Joe Thornber

Mike Snitzer
2013-08-23 21:02:14 +0800
0cc67cd9c dm thin: fix stacking of geometry limits ... Browse Code »

Do not blindly override the queue limits (specifically io_min and
io_opt). Allow traditional stacking of these limits if io_opt is a
factor of the thin-pool's data block size.

Without this patch mkfs.xfs does not recognize the thin device's
provided limits as a useful geometry (e.g. raid) so these hints are
ignored. This was due to setting io_min to a useless value.

Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon
Acked-by: Joe Thornber

Mike Snitzer
2013-08-23 21:02:14 +0800
054730446 dm cache: add data block size limits to code and Documentation ... Browse Code »

Place upper bound on the cache's data block size (1GB).

Inform users that the data block size can't be any arbitrary number,
i.e. its value must be between 32KB and 1GB. Also, it should be a
multiple of 32KB.

Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon
Acked-by: Joe Thornber

Mike Snitzer
2013-08-23 21:02:13 +0800
670368a8d dm: stop using WQ_NON_REENTRANT ... Browse Code »

dbf2576e37 ("workqueue: make all workqueues non-reentrant") made
WQ_NON_REENTRANT no-op and the flag is going away. Remove its usages.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo
Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon
Acked-by: Joe Thornber

Tejun Heo
2013-08-23 21:02:13 +0800

17 Aug, 2013

1 commit

b936bf8b7 dm cache: avoid conflicting remove_mapping() in mq policy ... Browse Code »

On sparc32, which includes from :

drivers/md/dm-cache-policy-mq.c:962:13: error: conflicting types for 'remove_mapping'
include/linux/swap.h:285:12: note: previous declaration of 'remove_mapping' was here

As mq_remove_mapping() already exists, and the local remove_mapping() is
used only once, inline it manually to avoid the conflict.

Signed-off-by: Geert Uytterhoeven
Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair Kergon
Acked-by: Joe Thornber

Geert Uytterhoeven
2013-08-17 03:56:51 +0800

27 Jul, 2013

1 commit

c271f5bc9 Merge tag 'md/3.11-fixes' of git://neil.brown.name/md ... Browse Code »

Pull md fixes from Neil Brown:
"Two more bugfixes for md in 3.11

Both marked for -stable, both since 3.3. I guess I should spend more
time testing..."

* tag 'md/3.11-fixes' of git://neil.brown.name/md:
md/raid5: fix interaction of 'replace' and 'recovery'.
md/raid10: remove use-after-free bug.

Linus Torvalds
2013-07-27 02:20:10 +0800

25 Jul, 2013

2 commits

f94c0b665 md/raid5: fix interaction of 'replace' and 'recovery'. ... Browse Code »

If a device in a RAID4/5/6 is being replaced while another is being
recovered, then the writes to the replacement device currently don't
happen, resulting in corruption when the replacement completes and the
new drive takes over.

This is because the replacement writes are only triggered when
's.replacing' is set and not when the similar 's.sync' is set (which
is the case during resync and recovery - it means all devices need to
be read).

So schedule those writes when s.replacing is set as well.

In this case we cannot use "STRIPE_INSYNC" to record that the
replacement has happened as that is needed for recording that any
parity calculation is complete. So introduce STRIPE_REPLACED to
record if the replacement has happened.

For safety we should also check that STRIPE_COMPUTE_RUN is not set.
This has a similar effect to the "s.locked == 0" test. The latter
ensure that now IO has been flagged but not started. The former
checks if any parity calculation has been flagged by not started.
We must wait for both of these to complete before triggering the
'replace'.

Add a similar test to the subsequent check for "are we finished yet".
This possibly isn't needed (is subsumed in the STRIPE_INSYNC test),
but it makes it more obvious that the REPLACE will happen before we
think we are finished.

Finally if a NeedReplace device is not UPTODATE then that is an
error. We really must trigger a warning.

This bug was introduced in commit 9a3e1101b827a59ac9036a672f5fa8d5279d0fe2
(md/raid5: detect and handle replacements during recovery.)
which introduced replacement for raid5.
That was in 3.3-rc3, so any stable kernel since then would benefit
from this fix.

Cc: stable@vger.kernel.org (3.3+)
Reported-by: qindehua
Tested-by: qindehua
Signed-off-by: NeilBrown

NeilBrown
2013-07-25 14:46:57 +0800
0eb25bb02 md/raid10: remove use-after-free bug. ... Browse Code »

We always need to be careful when calling generic_make_request, as it
can start a chain of events which might free something that we are
using.

Here is one place I wasn't careful enough. If the wbio2 is not in
use, then it might get freed at the first generic_make_request call.
So perform all necessary tests first.

This bug was introduced in 3.3-rc3 (24afd80d99) and can cause an
oops, so fix is suitable for any -stable since then.

Cc: stable@vger.kernel.org (3.3+)
Signed-off-by: NeilBrown

NeilBrown
2013-07-25 14:46:53 +0800

23 Jul, 2013

1 commit

d4c90b1b9 Merge branch 'for-3.11/drivers' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block IO driver bits from Jens Axboe:
"As I mentioned in the core block pull request, due to real life
circumstances the driver pull request would be late. Now it looks
like -rc2 late... On the plus side, apart form the rsxx update, these
are all things that I could argue could go in later in the cycle as
they are fixes and not features. So even though things are late, it's
not ALL bad.

The pull request contains:

- Updates to bcache, all bug fixes, from Kent.

- A pile of drbd bug fixes (no big features this time!).

- xen blk front/back fixes.

- rsxx driver updates, some of them deferred form 3.10. So should be
well cooked by now"

* 'for-3.11/drivers' of git://git.kernel.dk/linux-block: (63 commits)
bcache: Allocation kthread fixes
bcache: Fix GC_SECTORS_USED() calculation
bcache: Journal replay fix
bcache: Shutdown fix
bcache: Fix a sysfs splat on shutdown
bcache: Advertise that flushes are supported
bcache: check for allocation failures
bcache: Fix a dumb race
bcache: Use standard utility code
bcache: Update email address
bcache: Delete fuzz tester
bcache: Document shrinker reserve better
bcache: FUA fixes
drbd: Allow online change of al-stripes and al-stripe-size
drbd: Constants should be UPPERCASE
drbd: Ignore the exit code of a fence-peer handler if it returns too late
drbd: Fix rcu_read_lock balance on error path
drbd: fix error return code in drbd_init()
drbd: Do not sleep inside rcu
bcache: Refresh usage docs
...

Linus Torvalds
2013-07-23 10:02:52 +0800