Doug / smarc-fsl-linux-kernel | Embedian Git Server

01 Feb, 2013

1 commit

cc6c954a0 Merge tag 'dm-3.8-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm ... Browse Code »

Pull more device-mapper fixes from Alasdair G Kergon:
"A fix for stacked dm thin devices and a fix for the new dm WRITE SAME
support."

* tag 'dm-3.8-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
dm: fix write same requests counting
dm thin: fix queue limits stacking

Linus Torvalds
2013-02-01 09:04:22 +0800

31 Jan, 2013

2 commits

fe7af2d3b dm: fix write same requests counting ... Browse Code »

When processing write same requests, fix dm to send the configured
number of WRITE SAME requests to the target rather than the number of
discards, which is not always the same.

Device-mapper WRITE SAME support was introduced by commit
23508a96cd2e857d57044a2ed7d305f2d9daf441 ("dm: add WRITE SAME support").

Signed-off-by: Alasdair G Kergon
Acked-by: Mike Snitzer

Alasdair G Kergon
2013-01-31 22:23:36 +0800
0f640dca0 dm thin: fix queue limits stacking ... Browse Code »

thin_io_hints() is blindly copying the queue limits from the thin-pool
which can lead to incorrect limits being set. The fix here simply
deletes the thin_io_hints() hook which leaves the existing stacking
infrastructure to set the limits correctly.

When a thin-pool uses an MD device for the data device a thin device
from the thin-pool must respect MD's constraints about disallowing a bio
from spanning multiple chunks. Otherwise we can see problems. If the raid0
chunksize is 1152K and thin-pool chunksize is 256K I see the following
md/raid0 error (with extra debug tracing added to thin_endio) when
mkfs.xfs is executed against the thin device:

md/raid0:md99: make_request bug: can't convert block across chunks or bigger than 1152k 6688 127
device-mapper: thin: bio sector=2080 err=-5 bi_size=130560 bi_rw=17 bi_vcnt=32 bi_idx=0

This extra DM debugging shows that the failing bio is spanning across
the first and second logical 1152K chunk (sector 2080 + 255 takes the
bio beyond the first chunk's boundary of sector 2304). So the bio
splitting that DM is doing clearly isn't respecting the MD limits.

max_hw_sectors_kb is 127 for both the thin-pool and thin device
(queue_max_hw_sectors returns 255 so we'll excuse sysfs's lack of
precision). So this explains why bi_size is 130560.

But the thin device's max_hw_sectors_kb should be 4 (PAGE_SIZE) given
that it doesn't have a .merge function (for bio_add_page to consult
indirectly via dm_merge_bvec) yet the thin-pool does sit above an MD
device that has a compulsory merge_bvec_fn. This scenario is exactly
why DM must resort to sending single PAGE_SIZE bios to the underlying
layer. Some additional context for this is available in the header for
commit 8cbeb67a ("dm: avoid unsupported spanning of md stripe boundaries").

Long story short, the reason a thin device doesn't properly get
configured to have a max_hw_sectors_kb of 4 (PAGE_SIZE) is that
thin_io_hints() is blindly copying the queue limits from the thin-pool
device directly to the thin device's queue limits.

Fix this by eliminating thin_io_hints. Doing so is safe because the
block layer's queue limits stacking already enables the upper level thin
device to inherit the thin-pool device's discard and minimum_io_size and
optimal_io_size limits that get set in pool_io_hints. But avoiding the
queue limits copy allows the thin and thin-pool limits to be different
where it is important, namely max_hw_sectors_kb.

Reported-by: Daniel Browning
Signed-off-by: Mike Snitzer
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2013-01-31 22:11:14 +0800

24 Jan, 2013

1 commit

55ebbb59c DM-RAID: Fix RAID10's check for sufficient redundancy ... Browse Code »

Before attempting to activate a RAID array, it is checked for sufficient
redundancy. That is, we make sure that there are not too many failed
devices - or devices specified for rebuild - to undermine our ability to
activate the array. The current code performs this check twice - once to
ensure there were not too many devices specified for rebuild by the user
('validate_rebuild_devices') and again after possibly experiencing a failure
to read the superblock ('analyse_superblocks'). Neither of these checks are
sufficient. The first check is done properly but with insufficient
information about the possible failure state of the devices to make a good
determination if the array can be activated. The second check is simply
done wrong in the case of RAID10 because it doesn't account for the
independence of the stripes (i.e. mirror sets). The solution is to use the
properly written check ('validate_rebuild_devices'), but perform the check
after the superblocks have been read and we know which devices have failed.
This gives us one check instead of two and performs it in a location where
it can be done right.

Only RAID10 was affected and it was affected in the following ways:
- the code did not properly catch the condition where a user specified
a device for rebuild that already had a failed device in the same mirror
set. (This condition would, however, be caught at a deeper level in MD.)
- the code triggers a false positive and denies activation when devices in
independent mirror sets have failed - counting the failures as though they
were all in the same set.

The most likely place this error was introduced (or this patch should have
been included) is in commit 4ec1e369 - first introduced in v3.7-rc1.
Consequently this fix should also go in v3.7.y, however there is a
small conflict on the .version in raid_target, so I'll submit a
separate patch to -stable.

Cc: stable@vger.kernel.org
Signed-off-by: Jonathan Brassow
Signed-off-by: NeilBrown

Jonathan Brassow
2013-01-24 09:02:36 +0800

22 Dec, 2012

36 commits

b49249d10 Merge tag 'dm-3.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm ... Browse Code »

Pull dm update from Alasdair G Kergon:
"Miscellaneous device-mapper fixes, cleanups and performance
improvements.

Of particular note:
- Disable broken WRITE SAME support in all targets except linear and
striped. Use it when kcopyd is zeroing blocks.
- Remove several mempools from targets by moving the data into the
bio's new front_pad area(which dm calls 'per_bio_data').
- Fix a race in thin provisioning if discards are misused.
- Prevent userspace from interfering with the ioctl parameters and
use kmalloc for the data buffer if it's small instead of vmalloc.
- Throttle some annoying error messages when I/O fails."

* tag 'dm-3.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: (36 commits)
dm stripe: add WRITE SAME support
dm: remove map_info
dm snapshot: do not use map_context
dm thin: dont use map_context
dm raid1: dont use map_context
dm flakey: dont use map_context
dm raid1: rename read_record to bio_record
dm: move target request nr to dm_target_io
dm snapshot: use per_bio_data
dm verity: use per_bio_data
dm raid1: use per_bio_data
dm: introduce per_bio_data
dm kcopyd: add WRITE SAME support to dm_kcopyd_zero
dm linear: add WRITE SAME support
dm: add WRITE SAME support
dm: prepare to support WRITE SAME
dm ioctl: use kmalloc if possible
dm ioctl: remove PF_MEMALLOC
dm persistent data: improve improve space map block alloc failure message
dm thin: use DMERR_LIMIT for errors
...

Linus Torvalds
2012-12-22 09:08:06 +0800
45e621d45 dm stripe: add WRITE SAME support ... Browse Code »

Rename stripe_map_discard to stripe_map_range and reuse it for WRITE
SAME bio processing.

Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2012-12-22 04:23:41 +0800
7de3ee57d dm: remove map_info ... Browse Code »

This patch removes map_info from bio-based device mapper targets.
map_info is still used for request-based targets.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2012-12-22 04:23:41 +0800
ee18026ac dm snapshot: do not use map_context ... Browse Code »

Eliminate struct map_info from dm-snap.

map_info->ptr was used in dm-snap to indicate if the bio was tracked.
If map_info->ptr was non-NULL, the bio was linked in tracked_chunk_hash.

This patch removes the use of map_info->ptr. We determine if the bio was
tracked based on hlist_unhashed(&c->node). If hlist_unhashed is true,
the bio is not tracked, if it is false, the bio is tracked.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2012-12-22 04:23:41 +0800
59c3d2c6a dm thin: dont use map_context ... Browse Code »

This patch removes endio_hook_pool from dm-thin and uses per-bio data instead.

This patch removes any use of map_info in preparation for the next patch
that removes map_info from bio-based device mapper.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2012-12-22 04:23:40 +0800
0045d61b5 dm raid1: dont use map_context ... Browse Code »

Don't use map_info any more in dm-raid1.

map_info was used for writes to hold the region number. For this purpose
we add a new field dm_bio_details to dm_raid1_bio_record.

map_info was used for reads to hold a pointer to dm_raid1_bio_record (if
the pointer was non-NULL, bio details were saved; if the pointer was
NULL, bio details were not saved). We use
dm_raid1_bio_record.details->bi_bdev for this purpose. If bi_bdev is
NULL, details were not saved, if bi_bdev is non-NULL, details were
saved.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2012-12-22 04:23:40 +0800
c7cfdf597 dm flakey: dont use map_context ... Browse Code »

Replace map_info with a per-bio structure "struct per_bio_data" in dm-flakey.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2012-12-22 04:23:39 +0800
89c7cd897 dm raid1: rename read_record to bio_record ... Browse Code »

Rename struct read_record to bio_record in dm-raid1.

In the following patch, the structure will be used for both read and
write bios, so rename it.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2012-12-22 04:23:39 +0800
ddbd658f6 dm: move target request nr to dm_target_io ... Browse Code »

This patch moves target_request_nr from map_info to dm_target_io and
makes it accessible with dm_bio_get_target_request_nr.

This patch is a preparation for the next patch that removes map_info.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2012-12-22 04:23:39 +0800
42bc954f2 dm snapshot: use per_bio_data ... Browse Code »

Replace tracked_chunk_pool with per_bio_data in dm-snap.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2012-12-22 04:23:38 +0800
e42c3f914 dm verity: use per_bio_data ... Browse Code »

Replace io_mempool with per_bio_data in dm-verity.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2012-12-22 04:23:38 +0800
39cf0ed27 dm raid1: use per_bio_data ... Browse Code »

Replace read_record_pool with per_bio_data in dm-raid1.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2012-12-22 04:23:38 +0800
c0820cf5a dm: introduce per_bio_data ... Browse Code »

Introduce a field per_bio_data_size in struct dm_target.

Targets can set this field in the constructor. If a target sets this
field to a non-zero value, "per_bio_data_size" bytes of auxiliary data
are allocated for each bio submitted to the target. These data can be
used for any purpose by the target and help us improve performance by
removing some per-target mempools.

Per-bio data is accessed with dm_per_bio_data. The
argument data_size must be the same as the value per_bio_data_size in
dm_target.

If the target has a pointer to per_bio_data, it can get a pointer to
the bio with dm_bio_from_per_bio_data() function (data_size must be the
same as the value passed to dm_per_bio_data).

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2012-12-22 04:23:38 +0800
70d6c400a dm kcopyd: add WRITE SAME support to dm_kcopyd_zero ... Browse Code »

Add WRITE SAME support to dm-io and make it accessible to
dm_kcopyd_zero(). dm_kcopyd_zero() provides an asynchronous interface
whereas the blkdev_issue_write_same() interface is synchronous.

WRITE SAME is a SCSI command that can be leveraged for more efficient
zeroing of a specified logical extent of a device which supports it.
Only a single zeroed logical block is transfered to the target for each
WRITE SAME and the target then writes that same block across the
specified extent.

The dm thin target uses this.

Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2012-12-22 04:23:37 +0800
4f0b70b04 dm linear: add WRITE SAME support ... Browse Code »

The linear target can already support WRITE SAME requests so signal
this by setting num_write_same_requests to 1.

Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2012-12-22 04:23:37 +0800
23508a96c dm: add WRITE SAME support ... Browse Code »

WRITE SAME bios have a payload that contain a single page. When
cloning WRITE SAME bios DM has no need to modify the bi_io_vec
attributes (and doing so would be detrimental). DM need only alter the
start and end of the WRITE SAME bio accordingly.

Rather than duplicate __clone_and_map_discard, factor out a common
function that is also used by __clone_and_map_write_same.

Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2012-12-22 04:23:37 +0800
d54eaa5a0 dm: prepare to support WRITE SAME ... Browse Code »

Allow targets to opt in to WRITE SAME support by setting
'num_write_same_requests' in the dm_target structure.

A dm device will only advertise WRITE SAME support if all its
targets and all its underlying devices support it.

Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2012-12-22 04:23:36 +0800
9c5091f2e dm ioctl: use kmalloc if possible ... Browse Code »

If the parameter buffer is small enough, try to allocate it with kmalloc()
rather than vmalloc().

vmalloc is noticeably slower than kmalloc because it has to manipulate
page tables.

In my tests, on PA-RISC this patch speeds up activation 13 times.
On Opteron this patch speeds up activation by 5%.

This patch introduces a new function free_params() to free the
parameters and this uses new flags that record whether or not vmalloc()
was used and whether or not the input buffer must be wiped after use.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2012-12-22 04:23:36 +0800
5023e5cf5 dm ioctl: remove PF_MEMALLOC ... Browse Code »

When allocating memory for the userspace ioctl data, set some
appropriate GPF flags directly instead of using PF_MEMALLOC.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2012-12-22 04:23:36 +0800
7960123f2 dm persistent data: improve improve space map block alloc failure message ... Browse Code »

Improve space map error message when unable to allocate a new
metadata block.

Signed-off-by: Joe Thornber
Signed-off-by: Alasdair G Kergon

Joe Thornber
2012-12-22 04:23:36 +0800
c397741c7 dm thin: use DMERR_LIMIT for errors ... Browse Code »

Throttle all errors logged from the IO path by dm thin.

Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2012-12-22 04:23:34 +0800
89ddeb8cb dm persistent data: use DMERR_LIMIT for errors ... Browse Code »

Nearly all of persistent-data is in the IO path so throttle error
messages with DMERR_LIMIT to limit the amount logged when
something has gone wrong.

Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2012-12-22 04:23:34 +0800
a5bd968ae dm block manager: reinstate message when validator fails ... Browse Code »

Reinstate a useful error message when the block manager buffer validator fails.
This was mistakenly eliminated when the block manager was converted to use
dm-bufio.

Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2012-12-22 04:23:34 +0800
3a0f9aaee dm raid: round region_size to power of two ... Browse Code »

If the user does not supply a bitmap region_size to the dm raid target,
a reasonable size is computed automatically. If this is not a power of 2,
the md code will report an error later.

This patch catches the problem early and rounds the region_size to the
next power of two.

Signed-off-by: Jonathan Brassow
Signed-off-by: Alasdair G Kergon

Jonathan Brassow
2012-12-22 04:23:33 +0800
2aab38502 dm thin: cleanup dead code ... Browse Code »

Remove unused @data_block parameter from cell_defer.
Change thin_bio_map to use many returns rather than setting a variable.

Signed-off-by: Joe Thornber
Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon

Joe Thornber
2012-12-22 04:23:33 +0800
f286ba0ee dm thin: rename cell_defer_except to cell_defer_no_holder ... Browse Code »

Rename cell_defer_except() to cell_defer_no_holder() which describes
its function more clearly.

Signed-off-by: Joe Thornber
Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon

Joe Thornber
2012-12-22 04:23:33 +0800
9aa0c0e60 dm snapshot: optimize track_chunk ... Browse Code »

track_chunk is always called with interrupts enabled. Consequently, we
do not need to save and restore interrupt state in "flags" variable.
This patch changes spin_lock_irqsave to spin_lock_irq and
spin_unlock_irqrestore to spin_unlock_irq.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2012-12-22 04:23:33 +0800
19cbbc60c dm raid: use DM_ENDIO_INCOMPLETE ... Browse Code »

Use a defined macro DM_ENDIO_INCOMPLETE instead of a numeric constant.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2012-12-22 04:23:32 +0800
7c27213b2 dm raid1: remove impossible mempool_alloc error test ... Browse Code »

mempool_alloc can't fail if __GFP_WAIT is specified, so the condition
that tests if read_record is non-NULL is always true.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2012-12-22 04:23:32 +0800
018debea8 dm thin: emit ignore_discard in status when discards disabled ... Browse Code »

If "ignore_discard" is specified when creating the thin pool device then
discard support is disabled for that device. The pool device's status
should reflect this fact rather than stating "no_discard_passdown"
(which implies discards are enabled but passdown is disabled).

Reported-by: Zdenek Kabelac
Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2012-12-22 04:23:32 +0800
e3cbf9451 dm persistent data: fix nested btree deletion ... Browse Code »

When deleting nested btrees, the code forgets to delete the innermost
btree. The thin-metadata code serendipitously compensates for this by
claiming there is one extra layer in the tree.

This patch corrects both problems.

Signed-off-by: Joe Thornber
Signed-off-by: Alasdair G Kergon

Joe Thornber
2012-12-22 04:23:32 +0800
563af186d dm thin: wake worker when discard is prepared ... Browse Code »

When discards are prepared it is best to directly wake the worker that
will process them. The worker will be woken anyway, via periodic
commit, but there is no reason to not wake_worker here.

Signed-off-by: Joe Thornber
Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon

Joe Thornber
2012-12-22 04:23:31 +0800
e8088073c dm thin: fix race between simultaneous io and discards to same block ... Browse Code »

There is a race when discard bios and non-discard bios are issued
simultaneously to the same block.

Discard support is expensive for all thin devices precisely because you
have to be careful to quiesce the area you're discarding. DM thin must
handle this conflicting IO pattern (simultaneous non-discard vs discard)
even though a sane application shouldn't be issuing such IO.

The race manifests as follows:

1. A non-discard bio is mapped in thin_bio_map.
This doesn't lock out parallel activity to the same block.

2. A discard bio is issued to the same block as the non-discard bio.

3. The discard bio is locked in a dm_bio_prison_cell in process_discard
to lock out parallel activity against the same block.

4. The non-discard bio's mapping continues and its all_io_entry is
incremented so the bio is accounted for in the thin pool's all_io_ds
which is a dm_deferred_set used to track time locality of non-discard IO.

5. The non-discard bio is finally locked in a dm_bio_prison_cell in
process_bio.

The race can result in deadlock, leaving the block layer hanging waiting
for completion of a discard bio that never completes, e.g.:

INFO: task ruby:15354 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ruby D ffffffff8160f0e0 0 15354 15314 0x00000000
ffff8802fb08bc58 0000000000000082 ffff8802fb08bfd8 0000000000012900
ffff8802fb08a010 0000000000012900 0000000000012900 0000000000012900
ffff8802fb08bfd8 0000000000012900 ffff8803324b9480 ffff88032c6f14c0
Call Trace:
[] schedule+0x29/0x70
[] schedule_timeout+0x195/0x220
[] ? _dm_request+0x111/0x160 [dm_mod]
[] wait_for_common+0x11e/0x190
[] ? try_to_wake_up+0x2b0/0x2b0
[] wait_for_completion+0x1d/0x20
[] blkdev_issue_discard+0x219/0x260
[] blkdev_ioctl+0x6e9/0x7b0
[] block_ioctl+0x3c/0x40
[] do_vfs_ioctl+0x8c/0x340
[] ? block_llseek+0x67/0xb0
[] sys_ioctl+0xa1/0xb0
[] ? sys_rt_sigprocmask+0x86/0xd0
[] system_call_fastpath+0x16/0x1b

The thinp-test-suite's test_discard_random_sectors reliably hits this
deadlock on fast SSD storage.

The fix for this race is that the all_io_entry for a bio must be
incremented whilst the dm_bio_prison_cell is held for the bio's
associated virtual and physical blocks. That cell locking wasn't
occurring early enough in thin_bio_map. This patch fixes this.

Care is taken to always call the new function inc_all_io_entry() with
the relevant cells locked, but they are generally unlocked before
calling issue() to try to avoid holding the cells locked across
generic_submit_request.

Also, now that thin_bio_map may lock bios in a cell, process_bio() is no
longer the only thread that will do so. Because of this we must be sure
to use cell_defer_except() to release all non-holder entries, that
were added by the other thread, because they must be deferred.

This patch depends on "dm thin: replace dm_cell_release_singleton with
cell_defer_except".

Signed-off-by: Joe Thornber
Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon
Cc: stable@vger.kernel.org

Joe Thornber
2012-12-22 04:23:31 +0800
b7ca9c927 dm thin: replace dm_cell_release_singleton with cell_defer_except ... Browse Code »

Change existing users of the function dm_cell_release_singleton to share
cell_defer_except instead, and then remove the now-unused function.

Everywhere that calls dm_cell_release_singleton, the bio in question
is the holder of the cell.

If there are no non-holder entries in the cell then cell_defer_except
behaves exactly like dm_cell_release_singleton. Conversely, if there
*are* non-holder entries then dm_cell_release_singleton must not be used
because those entries would need to be deferred.

Consequently, it is safe to replace use of dm_cell_release_singleton
with cell_defer_except.

This patch is a pre-requisite for "dm thin: fix race between
simultaneous io and discards to same block".

Signed-off-by: Joe Thornber
Signed-off-by: Mike Snitzer
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon

Joe Thornber
2012-12-22 04:23:31 +0800
c1a94672a dm: disable WRITE SAME ... Browse Code »

WRITE SAME bios are not yet handled correctly by device-mapper so
disable their use on device-mapper devices by setting
max_write_same_sectors to zero.

As an example, a ciphertext device is incompatible because the data
gets changed according to the location at which it written and so the
dm crypt target cannot support it.

Signed-off-by: Mike Snitzer
Cc: stable@vger.kernel.org
Cc: Milan Broz
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2012-12-22 04:23:30 +0800
e910d7ebe dm ioctl: prevent unsafe change to dm_ioctl data_size ... Browse Code »

Abort dm ioctl processing if userspace changes the data_size parameter
after we validated it but before we finished copying the data buffer
from userspace.

The dm ioctl parameters are processed in the following sequence:
1. ctl_ioctl() calls copy_params();
2. copy_params() makes a first copy of the fixed-sized portion of the
userspace parameters into the local variable "tmp";
3. copy_params() then validates tmp.data_size and allocates a new
structure big enough to hold the complete data and copies the whole
userspace buffer there;
4. ctl_ioctl() reads userspace data the second time and copies the whole
buffer into the pointer "param";
5. ctl_ioctl() reads param->data_size without any validation and stores it
in the variable "input_param_size";
6. "input_param_size" is further used as the authoritative size of the
kernel buffer.

The problem is that userspace code could change the contents of user
memory between steps 2 and 4. In particular, the data_size parameter
can be changed to an invalid value after the kernel has validated it.
This lets userspace force the kernel to access invalid kernel memory.

The fix is to ensure that the size has not changed at step 4.

This patch shouldn't have a security impact because CAP_SYS_ADMIN is
required to run this code, but it should be fixed anyway.

Reported-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon
Cc: stable@kernel.org

Alasdair G Kergon
2012-12-22 04:23:30 +0800