Eric Lee / smarc-fsl-linux-kernel

02 Oct, 2013

1 commit

85f58908c cfq: explicitly use 64bit divide operation for 64bit arguments ... Browse Code »

commit f3cff25f05f2ac29b2ee355e611b0657482f6f1d upstream.

'samples' is 64bit operant, but do_div() second parameter is 32.
do_div silently truncates high 32 bits and calculated result
is invalid.

In case if low 32bit of 'samples' are zeros then do_div() produces
kernel crash.

Signed-off-by: Anatol Pomozov
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe
Cc: Jonghwan Choi
Signed-off-by: Greg Kroah-Hartman

Anatol Pomozov
2013-10-02 00:17:48 +0800

20 Aug, 2013

1 commit

a6ad83fce elevator: Fix a race in elevator switching ... Browse Code »

commit d50235b7bc3ee0a0427984d763ea7534149531b4 upstream.

There's a race between elevator switching and normal io operation.
Because the allocation of struct elevator_queue and struct elevator_data
don't in a atomic operation.So there are have chance to use NULL
->elevator_data.
For example:
Thread A: Thread B
blk_queu_bio elevator_switch
spin_lock_irq(q->queue_block) elevator_alloc
elv_merge elevator_init_fn

Because call elevator_alloc, it can't hold queue_lock and the
->elevator_data is NULL.So at the same time, threadA call elv_merge and
nedd some info of elevator_data.So the crash happened.

Move the elevator_alloc into func elevator_init_fn, it make the
operations in a atomic operation.

Using the follow method can easy reproduce this bug
1:dd if=/dev/sdb of=/dev/null
2:while true;do echo noop > scheduler;echo deadline > scheduler;done

The test method also use this method.

Signed-off-by: Jianpeng Ma
Signed-off-by: Jens Axboe
Cc: Jonghwan Choi
Signed-off-by: Greg Kroah-Hartman

Jianpeng Ma
2013-08-20 23:43:03 +0800

14 Jul, 2013

1 commit

88ce7cf76 block: do not pass disk names as format strings ... Browse Code »

commit ffc8b30866879ed9ba62bd0a86fecdbd51cd3d19 upstream.

Disk names may contain arbitrary strings, so they must not be
interpreted as format strings. It seems that only md allows arbitrary
strings to be used for disk names, but this could allow for a local
memory corruption from uid 0 into ring 0.

CVE-2013-2851

Signed-off-by: Kees Cook
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Kees Cook
2013-07-14 02:42:26 +0800

17 May, 2013

1 commit

c60855cdb blkpm: avoid sleep when holding queue lock ... Browse Code »

In blk_post_runtime_resume, an autosuspend request will be initiated for
the device. Since we are holding the queue lock, we can't sleep and thus
we should use the async version to initiate an autosuspend, i.e.
pm_request_suspend instead of pm_runtime_suspend, which might sleep.

Signed-off-by: Aaron Lu
Signed-off-by: Jens Axboe

Aaron Lu
2013-05-17 16:00:43 +0800

09 May, 2013

1 commit

4de13d7aa Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block core updates from Jens Axboe:

- Major bit is Kents prep work for immutable bio vecs.

- Stable candidate fix for a scheduling-while-atomic in the queue
bypass operation.

- Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
discard bios.

- Tejuns changes to convert the writeback thread pool to the generic
workqueue mechanism.

- Runtime PM framework, SCSI patches exists on top of these in James'
tree.

- A few random fixes.

* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
relay: move remove_buf_file inside relay_close_buf
partitions/efi.c: replace useless kzalloc's by kmalloc's
fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
block: fix max discard sectors limit
blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
Documentation: cfq-iosched: update documentation help for cfq tunables
writeback: expose the bdi_wq workqueue
writeback: replace custom worker pool implementation with unbound workqueue
writeback: remove unused bdi_pending_list
aoe: Fix unitialized var usage
bio-integrity: Add explicit field for owner of bip_buf
block: Add an explicit bio flag for bios that own their bvec
block: Add bio_alloc_pages()
block: Convert some code to bio_for_each_segment_all()
block: Add bio_for_each_segment_all()
bounce: Refactor __blk_queue_bounce to not use bi_io_vec
raid1: use bio_copy_data()
pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
pktcdvd: use bio_copy_data()
block: Add bio_copy_data()
...

Linus Torvalds
2013-05-09 01:13:35 +0800

08 May, 2013

1 commit

a27bb332c aio: don't include aio.h in sched.h ... Browse Code »

Faster kernel compiles by way of fewer unnecessary includes.

[akpm@linux-foundation.org: fix fallout]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Kent Overstreet
Cc: Zach Brown
Cc: Felipe Balbi
Cc: Greg Kroah-Hartman
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Rusty Russell
Cc: Jens Axboe
Cc: Asai Thambi S P
Cc: Selvan Mani
Cc: Sam Bradshaw
Cc: Jeff Moyer
Cc: Al Viro
Cc: Benjamin LaHaise
Reviewed-by: "Theodore Ts'o"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kent Overstreet
2013-05-08 11:16:25 +0800

03 May, 2013

1 commit

736a2dd25 Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux ... Browse Code »

Pull virtio & lguest updates from Rusty Russell:
"Lots of virtio work which wasn't quite ready for last merge window.

Plus I dived into lguest again, reworking the pagetable code so we can
move the switcher page: our fixmaps sometimes take more than 2MB now..."

Ugh. Annoying conflicts with the tcm_vhost -> vhost_scsi rename.
Hopefully correctly resolved.

* tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (57 commits)
caif_virtio: Remove bouncing email addresses
lguest: improve code readability in lg_cpu_start.
virtio-net: fill only rx queues which are being used
lguest: map Switcher below fixmap.
lguest: cache last cpu we ran on.
lguest: map Switcher text whenever we allocate a new pagetable.
lguest: don't share Switcher PTE pages between guests.
lguest: expost switcher_pages array (as lg_switcher_pages).
lguest: extract shadow PTE walking / allocating.
lguest: make check_gpte et. al return bool.
lguest: assume Switcher text is a single page.
lguest: rename switcher_page to switcher_pages.
lguest: remove RESERVE_MEM constant.
lguest: check vaddr not pgd for Switcher protection.
lguest: prepare to make SWITCHER_ADDR a variable.
virtio: console: replace EMFILE with EBUSY for already-open port
virtio-scsi: reset virtqueue affinity when doing cpu hotplug
virtio-scsi: introduce multiqueue support
virtio-scsi: push vq lock/unlock into virtscsi_vq_done
virtio-scsi: pass struct virtio_scsi to virtqueue completion function
...

Linus Torvalds
2013-05-03 05:14:04 +0800

30 Apr, 2013

3 commits

ea56505be partitions/efi.c: replace useless kzalloc's by kmalloc's ... Browse Code »

In alloc_read_gpt_entries and alloc_read_gpt_header, the kzalloc'ated
zones are either totally overwritten by the following read_lba call,
or freed. As kmalloc is cheaper than kzalloc, use kmalloc.

Signed-off-by: Philippe De Muyter
Cc: Matt Domsch
Cc: Panagiotis Issaris
Cc: Andrew Morton
Signed-off-by: Jens Axboe

Philippe De Muyter
2013-04-30 14:34:25 +0800
191a71209 Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:

- Fixes and a lot of cleanups. Locking cleanup is finally complete.
cgroup_mutex is no longer exposed to individual controlelrs which
used to cause nasty deadlock issues. Li fixed and cleaned up quite a
bit including long standing ones like racy cgroup_path().

- device cgroup now supports proper hierarchy thanks to Aristeu.

- perf_event cgroup now supports proper hierarchy.

- A new mount option "__DEVEL__sane_behavior" is added. As indicated
by the name, this option is to be used for development only at this
point and generates a warning message when used. Unfortunately,
cgroup interface currently has too many brekages and inconsistencies
to implement a consistent and unified hierarchy on top. The new flag
is used to collect the behavior changes which are necessary to
implement consistent unified hierarchy. It's likely that this flag
won't be used verbatim when it becomes ready but will be enabled
implicitly along with unified hierarchy.

The option currently disables some of broken behaviors in cgroup core
and also .use_hierarchy switch in memcg (will be routed through -mm),
which can be used to make very unusual hierarchy where nesting is
partially honored. It will also be used to implement hierarchy
support for blk-throttle which would be impossible otherwise without
introducing a full separate set of control knobs.

This is essentially versioning of interface which isn't very nice but
at this point I can't see any other options which would allow keeping
the interface the same while moving towards hierarchy behavior which
is at least somewhat sane. The planned unified hierarchy is likely
to require some level of adaptation from userland anyway, so I think
it'd be best to take the chance and update the interface such that
it's supportable in the long term.

Maintaining the existing interface does complicate cgroup core but
shouldn't put too much strain on individual controllers and I think
it'd be manageable for the foreseeable future. Maybe we'll be able
to drop it in a decade.

Fix up conflicts (including a semantic one adding a new #include to ppc
that was uncovered by header the file changes) as per Tejun.

* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits)
cpuset: fix compile warning when CONFIG_SMP=n
cpuset: fix cpu hotplug vs rebuild_sched_domains() race
cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
cgroup: restore the call to eventfd->poll()
cgroup: fix use-after-free when umounting cgroupfs
cgroup: fix broken file xattrs
devcg: remove parent_cgroup.
memcg: force use_hierarchy if sane_behavior
cgroup: remove cgrp->top_cgroup
cgroup: introduce sane_behavior mount option
move cgroupfs_root to include/linux/cgroup.h
cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
cgroup: make cgroup_path() not print double slashes
Revert "cgroup: remove bind() method from cgroup_subsys."
perf: make perf_event cgroup hierarchical
cgroup: implement cgroup_is_descendant()
cgroup: make sure parent won't be destroyed before its children
cgroup: remove bind() method from cgroup_subsys.
devcg: remove broken_hierarchy tag
cgroup: remove cgroup_lock_is_held()
...

Linus Torvalds
2013-04-30 10:14:20 +0800
2794b5d40 Merge tag 'driver-core-3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core ... Browse Code »

Pull driver core update from Greg Kroah-Hartman:
"Here's the merge request for the driver core tree for 3.10-rc1

It's pretty small, just a number of driver core and sysfs updates and
fixes, all of which have been in linux-next for a while now.

Signed-off-by: Greg Kroah-Hartman "

Fixed conflict in kernel/rtmutex-tester.c, the locking tree had a better
fix for the same sysfs file mode problem.

* tag 'driver-core-3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
PM / Runtime: Idle devices asynchronously after probe|release
driver core: handle user namespaces properly with the uid/gid devtmpfs change
driver core: devtmpfs: fix compile failure with CONFIG_UIDGID_STRICT_TYPE_CHECKS
devtmpfs: add base.h include
driver core: add uid and gid to devtmpfs
sysfs: check if one entry has been removed before freeing
sysfs: fix crash_notes_size build warning
sysfs: fix use after free in case of concurrent read/write and readdir
rtmutex-tester: fix mode of sysfs files
Documentation: Add ABI entry for crash_notes and crash_notes_size
sysfs: Add crash_notes_size to export percpu note size
driver core: platform_device.h: fix checkpatch errors and warnings
driver core: platform.c: fix checkpatch errors and warnings
driver core: warn that platform_driver_probe can not use deferred probing
sysfs: use atomic_inc_unless_negative in sysfs_get_active
base: core: WARN() about bogus permissions on device attributes
device: separate all subsys mutexes

Linus Torvalds
2013-04-30 02:31:50 +0800

19 Apr, 2013

1 commit

0a82a8d13 Revert "block: add missing block_bio_complete() tracepoint" ... Browse Code »

This reverts commit 3a366e614d0837d9fc23f78cdb1a1186ebc3387f.

Wanlong Gao reports that it causes a kernel panic on his machine several
minutes after boot. Reverting it removes the panic.

Jens says:
"It's not quite clear why that is yet, so I think we should just revert
the commit for 3.9 final (which I'm assuming is pretty close).

The wifi is crap at the LSF hotel, so sending this email instead of
queueing up a revert and pull request."

Reported-by: Wanlong Gao
Requested-by: Jens Axboe
Cc: Tejun Heo
Cc: Steven Rostedt
Signed-off-by: Linus Torvalds

Linus Torvalds
2013-04-19 00:00:26 +0800

15 Apr, 2013

1 commit

0d1d392f0 Merge 3.9-rc7 into driver-core-next ... Browse Code »

Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2013-04-15 09:37:05 +0800

12 Apr, 2013

1 commit

4e4098a3e driver core: handle user namespaces properly with the uid/gid devtmpfs change ... Browse Code »

Now that devtmpfs is caring about uid/gid, we need to use the correct
internal types so users who have USER_NS enabled will have things work
properly for them.

Thanks to Eric for pointing this out, and the patch review.

Reported-by: Eric W. Biederman
Cc: Kay Sievers
Cc: Ming Lei
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2013-04-12 02:43:29 +0800

09 Apr, 2013

1 commit

e5072664f blkcg: fix "scheduling while atomic" in blk_queue_bypass_start ... Browse Code »

Since 749fefe677 in v3.7 ("block: lift the initial queue bypass mode
on blk_register_queue() instead of blk_init_allocated_queue()"),
the following warning appears when multipath is used with CONFIG_PREEMPT=y.

This patch moves blk_queue_bypass_start() before radix_tree_preload()
to avoid the sleeping call while preemption is disabled.

BUG: scheduling while atomic: multipath/2460/0x00000002
1 lock held by multipath/2460:
#0: (&md->type_lock){......}, at: [] dm_lock_md_type+0x17/0x19 [dm_mod]
Modules linked in: ...
Pid: 2460, comm: multipath Tainted: G W 3.7.0-rc2 #1
Call Trace:
[] __schedule_bug+0x6a/0x78
[] __schedule+0xb4/0x5e0
[] schedule+0x64/0x66
[] schedule_timeout+0x39/0xf8
[] ? put_lock_stats+0xe/0x29
[] ? lock_release_holdtime+0xb6/0xbb
[] wait_for_common+0x9d/0xee
[] ? try_to_wake_up+0x206/0x206
[] ? kfree_call_rcu+0x1c/0x1c
[] wait_for_completion+0x1d/0x1f
[] wait_rcu_gp+0x5d/0x7a
[] ? wait_rcu_gp+0x7a/0x7a
[] ? complete+0x21/0x53
[] synchronize_rcu+0x1e/0x20
[] blk_queue_bypass_start+0x5d/0x62
[] blkcg_activate_policy+0x73/0x270
[] ? kmem_cache_alloc_node_trace+0xc7/0x108
[] cfq_init_queue+0x80/0x28e
[] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
[] elevator_init+0xe1/0x115
[] ? blk_queue_make_request+0x54/0x59
[] blk_init_allocated_queue+0x8c/0x9e
[] dm_setup_md_queue+0x36/0xaa [dm_mod]
[] table_load+0x1bd/0x2c8 [dm_mod]
[] ctl_ioctl+0x1d6/0x236 [dm_mod]
[] ? table_clear+0xaa/0xaa [dm_mod]
[] dm_ctl_ioctl+0x13/0x17 [dm_mod]
[] do_vfs_ioctl+0x3fb/0x441
[] ? file_has_perm+0x8a/0x99
[] sys_ioctl+0x5e/0x82
[] ? trace_hardirqs_on_thunk+0x3a/0x3f
[] system_call_fastpath+0x16/0x1b

Signed-off-by: Jun'ichi Nomura
Acked-by: Vivek Goyal
Acked-by: Tejun Heo
Cc: Alasdair G Kergon
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Jun'ichi Nomura
2013-04-09 21:01:21 +0800

08 Apr, 2013

2 commits

3c2670e65 driver core: add uid and gid to devtmpfs ... Browse Code »

Some drivers want to tell userspace what uid and gid should be used for
their device nodes, so allow that information to percolate through the
driver core to userspace in order to make this happen. This means that
some systems (i.e. Android and friends) will not need to even run a
udev-like daemon for their device node manager and can just rely in
devtmpfs fully, reducing their footprint even more.

Signed-off-by: Kay Sievers
Signed-off-by: Greg Kroah-Hartman

Kay Sievers
2013-04-08 23:21:48 +0800
c2fccc1c9 Revert "loop: cleanup partitions when detaching loop device" ... Browse Code »

This reverts commit 8761a3dc1f07b163414e2215a2cadbb4cfe2a107.

There are situations where the destruction path is called
with the bdev->bd_mutex already held, which then deadlocks in
loop_clr_fd(). The normal partition cleanup does a trylock()
on the mutex, but it'd be nice to have a more bullet proof
method in loop. So punt this more involved fix to the next
merge window, and just back out this buggy fix for now.

Signed-off-by: Jens Axboe

Jens Axboe
2013-04-08 16:12:11 +0800

04 Apr, 2013

1 commit

c678ef528 block: avoid using uninitialized value in from queue_var_store ... Browse Code »

As found by gcc-4.8, the QUEUE_SYSFS_BIT_FNS macro creates functions
that use a value generated by queue_var_store independent of whether
that value was set or not.

block/blk-sysfs.c: In function 'queue_store_nonrot':
block/blk-sysfs.c:244:385: warning: 'val' may be used uninitialized in this function [-Wmaybe-uninitialized]

Unlike most other such warnings, this one is not a false positive,
writing any non-number string into the sysfs files indeed has
an undefined result, rather than returning an error.

Signed-off-by: Arnd Bergmann
Signed-off-by: Jens Axboe

Arnd Bergmann
2013-04-04 03:53:57 +0800

02 Apr, 2013

1 commit

64f8de4da Merge branch 'writeback-workqueue' of git://git.kernel.org/pub/scm/linux/kernel/… ... Browse Code »

…git/tj/wq into for-3.10/core

Tejun writes:

-----

This is the pull request for the earlier patchset[1] with the same
name. It's only three patches (the first one was committed to
workqueue tree) but the merge strategy is a bit involved due to the
dependencies.

* Because the conversion needs features from wq/for-3.10,
block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts
with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those
workqueue conflicts from flaring up in block tree.

* Resolving the issue that Jan and Dave raised about debugging
requires arch-wide changes. The patchset is being worked on[2] but
it'll have to go through -mm after these changes show up in -next,
and not included in this pull request.

The three commits are located in the following git branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue

Pulling it into block/for-3.10/core produces a conflict in
drivers/md/raid5.c between the following two commits.

e3620a3ad5 ("MD RAID5: Avoid accessing gendisk or queue structs when not available")
2f6db2a707 ("raid5: use bio_reset()")

The conflict is trivial - one removes an "if ()" conditional while the
other removes "rbi->bi_next = NULL" right above it. We just need to
remove both. The merged branch is available at

git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge

so that you can use it for verification. The test merge commit has
proper merge description.

While these changes are a bit of pain to route, they make code simpler
and even have, while minute, measureable performance gain[3] even on a
workload which isn't particularly favorable to showing the benefits of
this conversion.

----

Fixed up the conflict.

Conflicts:
drivers/md/raid5.c

Signed-off-by: Jens Axboe <axboe@kernel.dk>

Jens Axboe
2013-04-02 16:04:39 +0800

25 Mar, 2013

1 commit

705cd0ea1 Merge branch 'for-jens' of http://evilpiepirate.org/git/linux-bcache into for-3.10/core ... Browse Code »

This contains Kents prep work for the immutable bio_vecs.

Jens Axboe
2013-03-25 11:38:59 +0800

24 Mar, 2013

2 commits

f73a1c7d1 block: Add bio_end_sector() ... Browse Code »

Just a little convenience macro - main reason to add it now is preparing
for immutable bio vecs, it'll reduce the size of the patch that puts
bi_sector/bi_size/bi_idx into a struct bvec_iter.

Signed-off-by: Kent Overstreet
CC: Jens Axboe
CC: Lars Ellenberg
CC: Jiri Kosina
CC: Alasdair Kergon
CC: dm-devel@redhat.com
CC: Neil Brown
CC: Martin Schwidefsky
CC: Heiko Carstens
CC: linux-s390@vger.kernel.org
CC: Chris Mason
CC: Steven Whitehouse
Acked-by: Steven Whitehouse

Kent Overstreet
2013-03-24 05:15:29 +0800
f79ea4161 block: Refactor blk_update_request() ... Browse Code »

Converts it to use bio_advance(), simplifying it quite a bit in the
process.

Note that req_bio_endio() now always calls bio_advance() - which means
it always loops over the biovec, not just on partial completions. Don't
expect it to affect performance, but worth noting.

Tested it by forcing partial updates, and dumping before and after on
various bio/bvec fields when doing a partial update.

Signed-off-by: Kent Overstreet
CC: Jens Axboe

Kent Overstreet
2013-03-24 05:15:28 +0800

23 Mar, 2013

4 commits

c8158819d block: implement runtime pm strategy ... Browse Code »

When a request is added:
If device is suspended or is suspending and the request is not a
PM request, resume the device.

When the last request finishes:
Call pm_runtime_mark_last_busy().

When pick a request:
If device is resuming/suspending, then only PM request is allowed
to go.

The idea and API is designed by Alan Stern and described here:
http://marc.info/?l=linux-scsi&m=133727953625963&w=2

Signed-off-by: Lin Ming
Signed-off-by: Aaron Lu
Acked-by: Alan Stern
Signed-off-by: Jens Axboe

Lin Ming
2013-03-23 12:22:15 +0800
6c9546675 block: add runtime pm helpers ... Browse Code »

Add runtime pm helper functions:

void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
- Initialization function for drivers to call.

int blk_pre_runtime_suspend(struct request_queue *q)
- If any requests are in the queue, mark last busy and return -EBUSY.
Otherwise set q->rpm_status to RPM_SUSPENDING and return 0.

void blk_post_runtime_suspend(struct request_queue *q, int err)
- If the suspend succeeded then set q->rpm_status to RPM_SUSPENDED.
Otherwise set it to RPM_ACTIVE and mark last busy.

void blk_pre_runtime_resume(struct request_queue *q)
- Set q->rpm_status to RPM_RESUMING.

void blk_post_runtime_resume(struct request_queue *q, int err)
- If the resume succeeded then set q->rpm_status to RPM_ACTIVE
and call __blk_run_queue, then mark last busy and autosuspend.
Otherwise set q->rpm_status to RPM_SUSPENDED.

The idea and API is designed by Alan Stern and described here:
http://marc.info/?l=linux-scsi&m=133727953625963&w=2

Signed-off-by: Lin Ming
Signed-off-by: Aaron Lu
Acked-by: Alan Stern
Signed-off-by: Jens Axboe

Lin Ming
2013-03-23 12:22:15 +0800
f2fc7d0ed Block: blk-flush: Fixed indent code style ... Browse Code »

Fixed code indent should use tabs where possible.

Signed-off-by: Alice Ferrazzi
Signed-off-by: Jens Axboe

Alice Ferrazzi
2013-03-23 02:22:51 +0800
8761a3dc1 loop: cleanup partitions when detaching loop device ... Browse Code »

Any partitions added by user space to the loop device were being
left in place after detaching the loop device. This was because
the detach path issued a BLKRRPART to clean up partitions if
LO_FLAGS_PARTSCAN was set, meaning that the partitions were auto
scanned on attach. Replace this BLKRRPART with code that
unconditionally cleans up partitions on detach instead.

Signed-off-by: Phillip Susi

Modified by Jens to export delete_partition().

Signed-off-by: Jens Axboe

Phillip Susi
2013-03-23 02:21:53 +0800

20 Mar, 2013

1 commit

c8164d893 scatterlist: introduce sg_unmark_end ... Browse Code »

This is useful in places that recycle the same scatterlist multiple
times, and do not want to incur the cost of sg_init_table every
time in hot paths.

Acked-by: Jens Axboe
Signed-off-by: Paolo Bonzini
Signed-off-by: Rusty Russell

Paolo Bonzini
2013-03-20 13:13:04 +0800

05 Mar, 2013

1 commit

65dff759d cgroup: fix cgroup_path() vs rename() race ... Browse Code »

rename() will change dentry->d_name. The result of this race can
be worse than seeing partially rewritten name, but we might access
a stale pointer because rename() will re-allocate memory to hold
a longer name.

As accessing dentry->name must be protected by dentry->d_lock or
parent inode's i_mutex, while on the other hand cgroup-path() can
be called with some irq-safe spinlocks held, we can't generate
cgroup path using dentry->d_name.

Alternatively we make a copy of dentry->d_name and save it in
cgrp->name when a cgroup is created, and update cgrp->name at
rename().

v5: use flexible array instead of zero-size array.
v4: - allocate root_cgroup_name and all root_cgroup->name points to it.
- add cgroup_name() wrapper.
v3: use kfree_rcu() instead of synchronize_rcu() in user-visible path.
v2: make cgrp->name RCU safe.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2013-03-05 01:50:08 +0800

01 Mar, 2013

1 commit

ee89f8125 Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block IO core bits from Jens Axboe:
"Below are the core block IO bits for 3.9. It was delayed a few days
since my workstation kept crashing every 2-8h after pulling it into
current -git, but turns out it is a bug in the new pstate code (divide
by zero, will report separately). In any case, it contains:

- The big cfq/blkcg update from Tejun and and Vivek.

- Additional block and writeback tracepoints from Tejun.

- Improvement of the should sort (based on queues) logic in the plug
flushing.

- _io() variants of the wait_for_completion() interface, using
io_schedule() instead of schedule() to contribute to io wait
properly.

- Various little fixes.

You'll get two trivial merge conflicts, which should be easy enough to
fix up"

Fix up the trivial conflicts due to hlist traversal cleanups (commit
b67bfe0d42ca: "hlist: drop the node parameter from iterators").

* 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
block: remove redundant check to bd_openers()
block: use i_size_write() in bd_set_size()
cfq: fix lock imbalance with failed allocations
drivers/block/swim3.c: fix null pointer dereference
block: don't select PERCPU_RWSEM
block: account iowait time when waiting for completion of IO request
sched: add wait_for_completion_io[_timeout]
writeback: add more tracepoints
block: add block_{touch|dirty}_buffer tracepoint
buffer: make touch_buffer() an exported function
block: add @req to bio_{front|back}_merge tracepoints
block: add missing block_bio_complete() tracepoint
block: Remove should_sort judgement when flush blk_plug
block,elevator: use new hashtable implementation
cfq-iosched: add hierarchical cfq_group statistics
cfq-iosched: collect stats from dead cfqgs
cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
block: RCU free request_queue
blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
...

Linus Torvalds
2013-03-01 04:52:24 +0800

28 Feb, 2013

8 commits

b67bfe0d4 hlist: drop the node parameter from iterators ... Browse Code »

I'm not sure why, but the hlist for each entry iterators were conceived

list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin
Acked-by: Paul E. McKenney
Signed-off-by: Sasha Levin
Cc: Wu Fengguang
Cc: Marcelo Tosatti
Cc: Gleb Natapov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2013-02-28 11:10:24 +0800
ac2e5327a block/partitions: optimize memory allocation in check_partition() ... Browse Code »

Currently, sizeof(struct parsed_partitions) may be 64KB in 32bit arch, so
it is easy to trigger page allocation failure by check_partition,
especially in hotplug block device situation(such as, USB mass storage,
MMC card, ...), and Felipe Balbi has observed the failure.

This patch does below optimizations on the allocation of struct
parsed_partitions to try to address the issue:

- make parsed_partitions.parts as pointer so that the pointed memory can
fit in 32KB buffer, then approximate 32KB memory can be saved

- vmalloc the buffer pointed by parsed_partitions.parts because 32KB is
still a bit big for kmalloc

- given that many devices have the partition count limit, so only
allocate disk_max_parts() partitions instead of 256 partitions always

Signed-off-by: Ming Lei
Reported-by: Felipe Balbi
Cc: Jens Axboe
Reviewed-by: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ming Lei
2013-02-28 11:10:21 +0800
06004e6ee block/partitions/mac.c: obey the state->limit constraint ... Browse Code »

It isn't necessary to read the information of partitions whose number is
equal and more than state->limit since only maximum state->limit
partitions will be added inside rescan_partitions().

That is also what other kind of partitions are doing.

Signed-off-by: Ming Lei
Cc: Jens Axboe
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ming Lei
2013-02-28 11:10:21 +0800
8b8a6e188 block/partitions/efi.c: ensure that the GPT header is at least the size of the structure. ... Browse Code »

UEFI 2.3.1D will include a change to the spec language mandating that a
GPT header must be greater than *or equal to* the size of the defined
structure. While verifying that this would work on Linux, I discovered
that we're not actually checking the minimum bound at all.

The result of this is that when we verify the checksum, it's possible that
on a malformed header (with header_size of 0), we won't actually verify
any data.

[akpm@linux-foundation.org: fix printk warning]
Signed-off-by: Peter Jones
Acked-by: Matt Fleming
Cc: Jens Axboe
Cc: Stephen Warren
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Jones
2013-02-28 11:10:21 +0800
86ee8ba64 block/partition/msdos: detect AIX formatted disks even without 55aa ... Browse Code »

AIX formatted disks do not always have the MSDOS 55aa signature.
This happens e.g. for unbootable AIX disks.

Up to now, such disks were not recognized as AIX disks, because of the
missing 55aa. Fix that by inverting the two tests. Let's first
check for the AIX magic strings, and only if that fails check for
the MSDOS magic word.

Signed-off-by: Philippe De Muyter
Cc: Andreas Mohr
Cc: OGAWA Hirofumi
Cc: Jens Axboe
Cc: Olaf Hering
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Philippe De Muyter
2013-02-28 11:10:21 +0800
bab998d62 block: convert to idr_alloc() ... Browse Code »

Convert to the much saner new idr interface. Both bsg and genhd
protect idr w/ mutex making preloading unnecessary.

Signed-off-by: Tejun Heo
Acked-by: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2013-02-28 11:10:15 +0800
ce23bba84 block: fix synchronization and limit check in blk_alloc_devt() ... Browse Code »

idr allocation in blk_alloc_devt() wasn't synchronized against lookup
and removal, and its limit check was off by one - 1 << MINORBITS is
the number of minors allowed, not the maximum allowed minor.

Add locking and rename MAX_EXT_DEVT to NR_EXT_DEVT and fix limit
checking.

Signed-off-by: Tejun Heo
Acked-by: Jens Axboe
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2013-02-28 11:10:14 +0800
7b74e9127 block: fix ext_devt_idr handling ... Browse Code »

While adding and removing a lot of disks disks and partitions this
sometimes shows up:

WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xc9/0x130() (Not tainted)
Hardware name:
sysfs: cannot create duplicate filename '/dev/block/259:751'
Modules linked in: raid1 autofs4 bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc scsi_tgt garp stp llc sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ipv6 dm_mirror dm_region_hash dm_log power_meter microcode dcdbas serio_raw amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core k10temp bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 dm_round_robin sr_mod cdrom sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci mptsas mptscsih mptbase scsi_transport_sas dm_multipath dm_mod [last unloaded: scsi_wait_scan]
Pid: 44103, comm: async/16 Not tainted 2.6.32-195.el6.x86_64 #1
Call Trace:
warn_slowpath_common+0x87/0xc0
warn_slowpath_fmt+0x46/0x50
sysfs_add_one+0xc9/0x130
sysfs_do_create_link+0x12b/0x170
sysfs_create_link+0x13/0x20
device_add+0x317/0x650
idr_get_new+0x13/0x50
add_partition+0x21c/0x390
rescan_partitions+0x32b/0x470
sd_open+0x81/0x1f0 [sd_mod]
__blkdev_get+0x1b6/0x3c0
blkdev_get+0x10/0x20
register_disk+0x155/0x170
add_disk+0xa6/0x160
sd_probe_async+0x13b/0x210 [sd_mod]
add_wait_queue+0x46/0x60
async_thread+0x102/0x250
default_wake_function+0x0/0x20
async_thread+0x0/0x250
kthread+0x96/0xa0
child_rip+0xa/0x20
kthread+0x0/0xa0
child_rip+0x0/0x20

This most likely happens because dev_t is freed while the number is
still used and idr_get_new() is not protected on every use. The fix
adds a mutex where it wasn't before and moves the dev_t free function so
it is called after device del.

Signed-off-by: Tomas Henzl
Cc: Jens Axboe
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tomas Henzl
2013-02-28 11:10:12 +0800

24 Feb, 2013

1 commit

25e823c8c block/genhd.c: apply pm_runtime_set_memalloc_noio on block devices ... Browse Code »

Apply the introduced pm_runtime_set_memalloc_noio on block device so
that PM core will teach mm to not allocate memory with GFP_IOFS when
calling the runtime_resume and runtime_suspend callback for block
devices and its ancestors.

Signed-off-by: Ming Lei
Cc: Jens Axboe
Cc: Minchan Kim
Cc: Alan Stern
Cc: Oliver Neukum
Cc: Jiri Kosina
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Greg KH
Cc: "David S. Miller"
Cc: Eric Dumazet
Cc: David Decotigny
Cc: Tom Herbert
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ming Lei
2013-02-24 09:50:16 +0800

22 Feb, 2013

3 commits

a3cc86c2f cfq: fix lock imbalance with failed allocations ... Browse Code »

While stress-running very-small container scenarios with the Kernel Memory
Controller, I've run into a lockdep-detected lock imbalance in
cfq-iosched.c.

I'll apologize beforehand for not posting a backlog: I didn't anticipate
it would be so hard to reproduce, so I didn't save my serial output and
went directly on debugging. Turns out that it did not happen again in
more than 20 runs, making it a quite rare pattern.

But here is my analysis:

When we are in very low-memory situations, we will arrive at
cfq_find_alloc_queue and may not find a queue, having to resort to the oom
queue, in an rcu-locked condition:

if (!cfqq || cfqq == &cfqd->oom_cfqq)
[ ... ]

Next, we will release the rcu lock, and try to allocate a queue, retrying
if we succeed:

rcu_read_unlock();
spin_unlock_irq(cfqd->queue->queue_lock);
new_cfqq = kmem_cache_alloc_node(cfq_pool,
gfp_mask | __GFP_ZERO,
cfqd->queue->node);
spin_lock_irq(cfqd->queue->queue_lock);
if (new_cfqq)
goto retry;

We are unlocked at this point, but it should be fine, since we will
reacquire the rcu_read_lock when we retry.

Except of course, that we may not retry: the allocation may very well fail
and we'll keep on going through the flow:

The next branch is:

if (cfqq) {
[ ... ]
} else
cfqq = &cfqd->oom_cfqq;

And right before exiting, we'll issue rcu_read_unlock().

Being already unlocked, this is the likely source of our imbalance. Since
cfqq is either already NULL or made NULL in the first statement of the
outter branch, the only viable alternative here seems to be to return the
oom queue right away in case of allocation failure.

Please review the following patch and apply if you agree with my analysis.

Signed-off-by: Glauber Costa
Cc: Jens Axboe
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Jens Axboe

Glauber Costa
2013-02-22 17:42:46 +0800
79d0b7f0e block: don't select PERCPU_RWSEM ... Browse Code »

The block device doesn't use percpu rw-semaphore anymore, so don't select
it for compilation.

Signed-off-by: Mikulas Patocka
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Jens Axboe

Mikulas Patocka
2013-02-22 17:42:45 +0800
ffecfd1a7 block: optionally snapshot page contents to provide stable pages during write ... Browse Code »

This provides a band-aid to provide stable page writes on jbd without
needing to backport the fixed locking and page writeback bit handling
schemes of jbd2. The band-aid works by using bounce buffers to snapshot
page contents instead of waiting.

For those wondering about the ext3 bandage -- fixing the jbd locking
(which was done as part of ext4dev years ago) is a lot of surgery, and
setting PG_writeback on data pages when we actually hold the page lock
dropped ext3 performance by nearly an order of magnitude. If we're
going to migrate iscsi and raid to use stable page writes, the
complaints about high latency will likely return. We might as well
centralize their page snapshotting thing to one place.

Signed-off-by: Darrick J. Wong
Tested-by: Andy Lutomirski
Cc: Adrian Hunter
Cc: Artem Bityutskiy
Reviewed-by: Jan Kara
Cc: Joel Becker
Cc: Mark Fasheh
Cc: Steven Whitehouse
Cc: Jens Axboe
Cc: Eric Van Hensbergen
Cc: Ron Minnich
Cc: Latchesar Ionkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Darrick J. Wong
2013-02-22 09:22:20 +0800