02 Oct, 2013

1 commit

  • commit f3cff25f05f2ac29b2ee355e611b0657482f6f1d upstream.

    'samples' is 64bit operant, but do_div() second parameter is 32.
    do_div silently truncates high 32 bits and calculated result
    is invalid.

    In case if low 32bit of 'samples' are zeros then do_div() produces
    kernel crash.

    Signed-off-by: Anatol Pomozov
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe
    Cc: Jonghwan Choi
    Signed-off-by: Greg Kroah-Hartman

    Anatol Pomozov
     

20 Aug, 2013

1 commit

  • commit d50235b7bc3ee0a0427984d763ea7534149531b4 upstream.

    There's a race between elevator switching and normal io operation.
    Because the allocation of struct elevator_queue and struct elevator_data
    don't in a atomic operation.So there are have chance to use NULL
    ->elevator_data.
    For example:
    Thread A: Thread B
    blk_queu_bio elevator_switch
    spin_lock_irq(q->queue_block) elevator_alloc
    elv_merge elevator_init_fn

    Because call elevator_alloc, it can't hold queue_lock and the
    ->elevator_data is NULL.So at the same time, threadA call elv_merge and
    nedd some info of elevator_data.So the crash happened.

    Move the elevator_alloc into func elevator_init_fn, it make the
    operations in a atomic operation.

    Using the follow method can easy reproduce this bug
    1:dd if=/dev/sdb of=/dev/null
    2:while true;do echo noop > scheduler;echo deadline > scheduler;done

    The test method also use this method.

    Signed-off-by: Jianpeng Ma
    Signed-off-by: Jens Axboe
    Cc: Jonghwan Choi
    Signed-off-by: Greg Kroah-Hartman

    Jianpeng Ma
     

14 Jul, 2013

1 commit

  • commit ffc8b30866879ed9ba62bd0a86fecdbd51cd3d19 upstream.

    Disk names may contain arbitrary strings, so they must not be
    interpreted as format strings. It seems that only md allows arbitrary
    strings to be used for disk names, but this could allow for a local
    memory corruption from uid 0 into ring 0.

    CVE-2013-2851

    Signed-off-by: Kees Cook
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     

17 May, 2013

1 commit

  • In blk_post_runtime_resume, an autosuspend request will be initiated for
    the device. Since we are holding the queue lock, we can't sleep and thus
    we should use the async version to initiate an autosuspend, i.e.
    pm_request_suspend instead of pm_runtime_suspend, which might sleep.

    Signed-off-by: Aaron Lu
    Signed-off-by: Jens Axboe

    Aaron Lu
     

09 May, 2013

1 commit

  • Pull block core updates from Jens Axboe:

    - Major bit is Kents prep work for immutable bio vecs.

    - Stable candidate fix for a scheduling-while-atomic in the queue
    bypass operation.

    - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
    discard bios.

    - Tejuns changes to convert the writeback thread pool to the generic
    workqueue mechanism.

    - Runtime PM framework, SCSI patches exists on top of these in James'
    tree.

    - A few random fixes.

    * 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
    relay: move remove_buf_file inside relay_close_buf
    partitions/efi.c: replace useless kzalloc's by kmalloc's
    fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
    block: fix max discard sectors limit
    blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
    Documentation: cfq-iosched: update documentation help for cfq tunables
    writeback: expose the bdi_wq workqueue
    writeback: replace custom worker pool implementation with unbound workqueue
    writeback: remove unused bdi_pending_list
    aoe: Fix unitialized var usage
    bio-integrity: Add explicit field for owner of bip_buf
    block: Add an explicit bio flag for bios that own their bvec
    block: Add bio_alloc_pages()
    block: Convert some code to bio_for_each_segment_all()
    block: Add bio_for_each_segment_all()
    bounce: Refactor __blk_queue_bounce to not use bi_io_vec
    raid1: use bio_copy_data()
    pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
    pktcdvd: use bio_copy_data()
    block: Add bio_copy_data()
    ...

    Linus Torvalds
     

08 May, 2013

1 commit

  • Faster kernel compiles by way of fewer unnecessary includes.

    [akpm@linux-foundation.org: fix fallout]
    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     

03 May, 2013

1 commit

  • Pull virtio & lguest updates from Rusty Russell:
    "Lots of virtio work which wasn't quite ready for last merge window.

    Plus I dived into lguest again, reworking the pagetable code so we can
    move the switcher page: our fixmaps sometimes take more than 2MB now..."

    Ugh. Annoying conflicts with the tcm_vhost -> vhost_scsi rename.
    Hopefully correctly resolved.

    * tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (57 commits)
    caif_virtio: Remove bouncing email addresses
    lguest: improve code readability in lg_cpu_start.
    virtio-net: fill only rx queues which are being used
    lguest: map Switcher below fixmap.
    lguest: cache last cpu we ran on.
    lguest: map Switcher text whenever we allocate a new pagetable.
    lguest: don't share Switcher PTE pages between guests.
    lguest: expost switcher_pages array (as lg_switcher_pages).
    lguest: extract shadow PTE walking / allocating.
    lguest: make check_gpte et. al return bool.
    lguest: assume Switcher text is a single page.
    lguest: rename switcher_page to switcher_pages.
    lguest: remove RESERVE_MEM constant.
    lguest: check vaddr not pgd for Switcher protection.
    lguest: prepare to make SWITCHER_ADDR a variable.
    virtio: console: replace EMFILE with EBUSY for already-open port
    virtio-scsi: reset virtqueue affinity when doing cpu hotplug
    virtio-scsi: introduce multiqueue support
    virtio-scsi: push vq lock/unlock into virtscsi_vq_done
    virtio-scsi: pass struct virtio_scsi to virtqueue completion function
    ...

    Linus Torvalds
     

30 Apr, 2013

3 commits

  • In alloc_read_gpt_entries and alloc_read_gpt_header, the kzalloc'ated
    zones are either totally overwritten by the following read_lba call,
    or freed. As kmalloc is cheaper than kzalloc, use kmalloc.

    Signed-off-by: Philippe De Muyter
    Cc: Matt Domsch
    Cc: Panagiotis Issaris
    Cc: Andrew Morton
    Signed-off-by: Jens Axboe

    Philippe De Muyter
     
  • Pull cgroup updates from Tejun Heo:

    - Fixes and a lot of cleanups. Locking cleanup is finally complete.
    cgroup_mutex is no longer exposed to individual controlelrs which
    used to cause nasty deadlock issues. Li fixed and cleaned up quite a
    bit including long standing ones like racy cgroup_path().

    - device cgroup now supports proper hierarchy thanks to Aristeu.

    - perf_event cgroup now supports proper hierarchy.

    - A new mount option "__DEVEL__sane_behavior" is added. As indicated
    by the name, this option is to be used for development only at this
    point and generates a warning message when used. Unfortunately,
    cgroup interface currently has too many brekages and inconsistencies
    to implement a consistent and unified hierarchy on top. The new flag
    is used to collect the behavior changes which are necessary to
    implement consistent unified hierarchy. It's likely that this flag
    won't be used verbatim when it becomes ready but will be enabled
    implicitly along with unified hierarchy.

    The option currently disables some of broken behaviors in cgroup core
    and also .use_hierarchy switch in memcg (will be routed through -mm),
    which can be used to make very unusual hierarchy where nesting is
    partially honored. It will also be used to implement hierarchy
    support for blk-throttle which would be impossible otherwise without
    introducing a full separate set of control knobs.

    This is essentially versioning of interface which isn't very nice but
    at this point I can't see any other options which would allow keeping
    the interface the same while moving towards hierarchy behavior which
    is at least somewhat sane. The planned unified hierarchy is likely
    to require some level of adaptation from userland anyway, so I think
    it'd be best to take the chance and update the interface such that
    it's supportable in the long term.

    Maintaining the existing interface does complicate cgroup core but
    shouldn't put too much strain on individual controllers and I think
    it'd be manageable for the foreseeable future. Maybe we'll be able
    to drop it in a decade.

    Fix up conflicts (including a semantic one adding a new #include to ppc
    that was uncovered by header the file changes) as per Tejun.

    * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits)
    cpuset: fix compile warning when CONFIG_SMP=n
    cpuset: fix cpu hotplug vs rebuild_sched_domains() race
    cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
    cgroup: restore the call to eventfd->poll()
    cgroup: fix use-after-free when umounting cgroupfs
    cgroup: fix broken file xattrs
    devcg: remove parent_cgroup.
    memcg: force use_hierarchy if sane_behavior
    cgroup: remove cgrp->top_cgroup
    cgroup: introduce sane_behavior mount option
    move cgroupfs_root to include/linux/cgroup.h
    cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
    cgroup: make cgroup_path() not print double slashes
    Revert "cgroup: remove bind() method from cgroup_subsys."
    perf: make perf_event cgroup hierarchical
    cgroup: implement cgroup_is_descendant()
    cgroup: make sure parent won't be destroyed before its children
    cgroup: remove bind() method from cgroup_subsys.
    devcg: remove broken_hierarchy tag
    cgroup: remove cgroup_lock_is_held()
    ...

    Linus Torvalds
     
  • Pull driver core update from Greg Kroah-Hartman:
    "Here's the merge request for the driver core tree for 3.10-rc1

    It's pretty small, just a number of driver core and sysfs updates and
    fixes, all of which have been in linux-next for a while now.

    Signed-off-by: Greg Kroah-Hartman "

    Fixed conflict in kernel/rtmutex-tester.c, the locking tree had a better
    fix for the same sysfs file mode problem.

    * tag 'driver-core-3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
    PM / Runtime: Idle devices asynchronously after probe|release
    driver core: handle user namespaces properly with the uid/gid devtmpfs change
    driver core: devtmpfs: fix compile failure with CONFIG_UIDGID_STRICT_TYPE_CHECKS
    devtmpfs: add base.h include
    driver core: add uid and gid to devtmpfs
    sysfs: check if one entry has been removed before freeing
    sysfs: fix crash_notes_size build warning
    sysfs: fix use after free in case of concurrent read/write and readdir
    rtmutex-tester: fix mode of sysfs files
    Documentation: Add ABI entry for crash_notes and crash_notes_size
    sysfs: Add crash_notes_size to export percpu note size
    driver core: platform_device.h: fix checkpatch errors and warnings
    driver core: platform.c: fix checkpatch errors and warnings
    driver core: warn that platform_driver_probe can not use deferred probing
    sysfs: use atomic_inc_unless_negative in sysfs_get_active
    base: core: WARN() about bogus permissions on device attributes
    device: separate all subsys mutexes

    Linus Torvalds
     

19 Apr, 2013

1 commit

  • This reverts commit 3a366e614d0837d9fc23f78cdb1a1186ebc3387f.

    Wanlong Gao reports that it causes a kernel panic on his machine several
    minutes after boot. Reverting it removes the panic.

    Jens says:
    "It's not quite clear why that is yet, so I think we should just revert
    the commit for 3.9 final (which I'm assuming is pretty close).

    The wifi is crap at the LSF hotel, so sending this email instead of
    queueing up a revert and pull request."

    Reported-by: Wanlong Gao
    Requested-by: Jens Axboe
    Cc: Tejun Heo
    Cc: Steven Rostedt
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Apr, 2013

1 commit


12 Apr, 2013

1 commit


09 Apr, 2013

1 commit

  • Since 749fefe677 in v3.7 ("block: lift the initial queue bypass mode
    on blk_register_queue() instead of blk_init_allocated_queue()"),
    the following warning appears when multipath is used with CONFIG_PREEMPT=y.

    This patch moves blk_queue_bypass_start() before radix_tree_preload()
    to avoid the sleeping call while preemption is disabled.

    BUG: scheduling while atomic: multipath/2460/0x00000002
    1 lock held by multipath/2460:
    #0: (&md->type_lock){......}, at: [] dm_lock_md_type+0x17/0x19 [dm_mod]
    Modules linked in: ...
    Pid: 2460, comm: multipath Tainted: G W 3.7.0-rc2 #1
    Call Trace:
    [] __schedule_bug+0x6a/0x78
    [] __schedule+0xb4/0x5e0
    [] schedule+0x64/0x66
    [] schedule_timeout+0x39/0xf8
    [] ? put_lock_stats+0xe/0x29
    [] ? lock_release_holdtime+0xb6/0xbb
    [] wait_for_common+0x9d/0xee
    [] ? try_to_wake_up+0x206/0x206
    [] ? kfree_call_rcu+0x1c/0x1c
    [] wait_for_completion+0x1d/0x1f
    [] wait_rcu_gp+0x5d/0x7a
    [] ? wait_rcu_gp+0x7a/0x7a
    [] ? complete+0x21/0x53
    [] synchronize_rcu+0x1e/0x20
    [] blk_queue_bypass_start+0x5d/0x62
    [] blkcg_activate_policy+0x73/0x270
    [] ? kmem_cache_alloc_node_trace+0xc7/0x108
    [] cfq_init_queue+0x80/0x28e
    [] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
    [] elevator_init+0xe1/0x115
    [] ? blk_queue_make_request+0x54/0x59
    [] blk_init_allocated_queue+0x8c/0x9e
    [] dm_setup_md_queue+0x36/0xaa [dm_mod]
    [] table_load+0x1bd/0x2c8 [dm_mod]
    [] ctl_ioctl+0x1d6/0x236 [dm_mod]
    [] ? table_clear+0xaa/0xaa [dm_mod]
    [] dm_ctl_ioctl+0x13/0x17 [dm_mod]
    [] do_vfs_ioctl+0x3fb/0x441
    [] ? file_has_perm+0x8a/0x99
    [] sys_ioctl+0x5e/0x82
    [] ? trace_hardirqs_on_thunk+0x3a/0x3f
    [] system_call_fastpath+0x16/0x1b

    Signed-off-by: Jun'ichi Nomura
    Acked-by: Vivek Goyal
    Acked-by: Tejun Heo
    Cc: Alasdair G Kergon
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jun'ichi Nomura
     

08 Apr, 2013

2 commits

  • Some drivers want to tell userspace what uid and gid should be used for
    their device nodes, so allow that information to percolate through the
    driver core to userspace in order to make this happen. This means that
    some systems (i.e. Android and friends) will not need to even run a
    udev-like daemon for their device node manager and can just rely in
    devtmpfs fully, reducing their footprint even more.

    Signed-off-by: Kay Sievers
    Signed-off-by: Greg Kroah-Hartman

    Kay Sievers
     
  • This reverts commit 8761a3dc1f07b163414e2215a2cadbb4cfe2a107.

    There are situations where the destruction path is called
    with the bdev->bd_mutex already held, which then deadlocks in
    loop_clr_fd(). The normal partition cleanup does a trylock()
    on the mutex, but it'd be nice to have a more bullet proof
    method in loop. So punt this more involved fix to the next
    merge window, and just back out this buggy fix for now.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

04 Apr, 2013

1 commit

  • As found by gcc-4.8, the QUEUE_SYSFS_BIT_FNS macro creates functions
    that use a value generated by queue_var_store independent of whether
    that value was set or not.

    block/blk-sysfs.c: In function 'queue_store_nonrot':
    block/blk-sysfs.c:244:385: warning: 'val' may be used uninitialized in this function [-Wmaybe-uninitialized]

    Unlike most other such warnings, this one is not a false positive,
    writing any non-number string into the sysfs files indeed has
    an undefined result, rather than returning an error.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Jens Axboe

    Arnd Bergmann
     

02 Apr, 2013

1 commit

  • …git/tj/wq into for-3.10/core

    Tejun writes:

    -----

    This is the pull request for the earlier patchset[1] with the same
    name. It's only three patches (the first one was committed to
    workqueue tree) but the merge strategy is a bit involved due to the
    dependencies.

    * Because the conversion needs features from wq/for-3.10,
    block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts
    with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those
    workqueue conflicts from flaring up in block tree.

    * Resolving the issue that Jan and Dave raised about debugging
    requires arch-wide changes. The patchset is being worked on[2] but
    it'll have to go through -mm after these changes show up in -next,
    and not included in this pull request.

    The three commits are located in the following git branch.

    git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue

    Pulling it into block/for-3.10/core produces a conflict in
    drivers/md/raid5.c between the following two commits.

    e3620a3ad5 ("MD RAID5: Avoid accessing gendisk or queue structs when not available")
    2f6db2a707 ("raid5: use bio_reset()")

    The conflict is trivial - one removes an "if ()" conditional while the
    other removes "rbi->bi_next = NULL" right above it. We just need to
    remove both. The merged branch is available at

    git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge

    so that you can use it for verification. The test merge commit has
    proper merge description.

    While these changes are a bit of pain to route, they make code simpler
    and even have, while minute, measureable performance gain[3] even on a
    workload which isn't particularly favorable to showing the benefits of
    this conversion.

    ----

    Fixed up the conflict.

    Conflicts:
    drivers/md/raid5.c

    Signed-off-by: Jens Axboe <axboe@kernel.dk>

    Jens Axboe
     

25 Mar, 2013

1 commit


24 Mar, 2013

2 commits

  • Just a little convenience macro - main reason to add it now is preparing
    for immutable bio vecs, it'll reduce the size of the patch that puts
    bi_sector/bi_size/bi_idx into a struct bvec_iter.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    CC: Lars Ellenberg
    CC: Jiri Kosina
    CC: Alasdair Kergon
    CC: dm-devel@redhat.com
    CC: Neil Brown
    CC: Martin Schwidefsky
    CC: Heiko Carstens
    CC: linux-s390@vger.kernel.org
    CC: Chris Mason
    CC: Steven Whitehouse
    Acked-by: Steven Whitehouse

    Kent Overstreet
     
  • Converts it to use bio_advance(), simplifying it quite a bit in the
    process.

    Note that req_bio_endio() now always calls bio_advance() - which means
    it always loops over the biovec, not just on partial completions. Don't
    expect it to affect performance, but worth noting.

    Tested it by forcing partial updates, and dumping before and after on
    various bio/bvec fields when doing a partial update.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe

    Kent Overstreet
     

23 Mar, 2013

4 commits

  • When a request is added:
    If device is suspended or is suspending and the request is not a
    PM request, resume the device.

    When the last request finishes:
    Call pm_runtime_mark_last_busy().

    When pick a request:
    If device is resuming/suspending, then only PM request is allowed
    to go.

    The idea and API is designed by Alan Stern and described here:
    http://marc.info/?l=linux-scsi&m=133727953625963&w=2

    Signed-off-by: Lin Ming
    Signed-off-by: Aaron Lu
    Acked-by: Alan Stern
    Signed-off-by: Jens Axboe

    Lin Ming
     
  • Add runtime pm helper functions:

    void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
    - Initialization function for drivers to call.

    int blk_pre_runtime_suspend(struct request_queue *q)
    - If any requests are in the queue, mark last busy and return -EBUSY.
    Otherwise set q->rpm_status to RPM_SUSPENDING and return 0.

    void blk_post_runtime_suspend(struct request_queue *q, int err)
    - If the suspend succeeded then set q->rpm_status to RPM_SUSPENDED.
    Otherwise set it to RPM_ACTIVE and mark last busy.

    void blk_pre_runtime_resume(struct request_queue *q)
    - Set q->rpm_status to RPM_RESUMING.

    void blk_post_runtime_resume(struct request_queue *q, int err)
    - If the resume succeeded then set q->rpm_status to RPM_ACTIVE
    and call __blk_run_queue, then mark last busy and autosuspend.
    Otherwise set q->rpm_status to RPM_SUSPENDED.

    The idea and API is designed by Alan Stern and described here:
    http://marc.info/?l=linux-scsi&m=133727953625963&w=2

    Signed-off-by: Lin Ming
    Signed-off-by: Aaron Lu
    Acked-by: Alan Stern
    Signed-off-by: Jens Axboe

    Lin Ming
     
  • Fixed code indent should use tabs where possible.

    Signed-off-by: Alice Ferrazzi
    Signed-off-by: Jens Axboe

    Alice Ferrazzi
     
  • Any partitions added by user space to the loop device were being
    left in place after detaching the loop device. This was because
    the detach path issued a BLKRRPART to clean up partitions if
    LO_FLAGS_PARTSCAN was set, meaning that the partitions were auto
    scanned on attach. Replace this BLKRRPART with code that
    unconditionally cleans up partitions on detach instead.

    Signed-off-by: Phillip Susi

    Modified by Jens to export delete_partition().

    Signed-off-by: Jens Axboe

    Phillip Susi
     

20 Mar, 2013

1 commit


05 Mar, 2013

1 commit

  • rename() will change dentry->d_name. The result of this race can
    be worse than seeing partially rewritten name, but we might access
    a stale pointer because rename() will re-allocate memory to hold
    a longer name.

    As accessing dentry->name must be protected by dentry->d_lock or
    parent inode's i_mutex, while on the other hand cgroup-path() can
    be called with some irq-safe spinlocks held, we can't generate
    cgroup path using dentry->d_name.

    Alternatively we make a copy of dentry->d_name and save it in
    cgrp->name when a cgroup is created, and update cgrp->name at
    rename().

    v5: use flexible array instead of zero-size array.
    v4: - allocate root_cgroup_name and all root_cgroup->name points to it.
    - add cgroup_name() wrapper.
    v3: use kfree_rcu() instead of synchronize_rcu() in user-visible path.
    v2: make cgrp->name RCU safe.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     

01 Mar, 2013

1 commit

  • Pull block IO core bits from Jens Axboe:
    "Below are the core block IO bits for 3.9. It was delayed a few days
    since my workstation kept crashing every 2-8h after pulling it into
    current -git, but turns out it is a bug in the new pstate code (divide
    by zero, will report separately). In any case, it contains:

    - The big cfq/blkcg update from Tejun and and Vivek.

    - Additional block and writeback tracepoints from Tejun.

    - Improvement of the should sort (based on queues) logic in the plug
    flushing.

    - _io() variants of the wait_for_completion() interface, using
    io_schedule() instead of schedule() to contribute to io wait
    properly.

    - Various little fixes.

    You'll get two trivial merge conflicts, which should be easy enough to
    fix up"

    Fix up the trivial conflicts due to hlist traversal cleanups (commit
    b67bfe0d42ca: "hlist: drop the node parameter from iterators").

    * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
    block: remove redundant check to bd_openers()
    block: use i_size_write() in bd_set_size()
    cfq: fix lock imbalance with failed allocations
    drivers/block/swim3.c: fix null pointer dereference
    block: don't select PERCPU_RWSEM
    block: account iowait time when waiting for completion of IO request
    sched: add wait_for_completion_io[_timeout]
    writeback: add more tracepoints
    block: add block_{touch|dirty}_buffer tracepoint
    buffer: make touch_buffer() an exported function
    block: add @req to bio_{front|back}_merge tracepoints
    block: add missing block_bio_complete() tracepoint
    block: Remove should_sort judgement when flush blk_plug
    block,elevator: use new hashtable implementation
    cfq-iosched: add hierarchical cfq_group statistics
    cfq-iosched: collect stats from dead cfqgs
    cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
    blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
    block: RCU free request_queue
    blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
    ...

    Linus Torvalds
     

28 Feb, 2013

8 commits

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • Currently, sizeof(struct parsed_partitions) may be 64KB in 32bit arch, so
    it is easy to trigger page allocation failure by check_partition,
    especially in hotplug block device situation(such as, USB mass storage,
    MMC card, ...), and Felipe Balbi has observed the failure.

    This patch does below optimizations on the allocation of struct
    parsed_partitions to try to address the issue:

    - make parsed_partitions.parts as pointer so that the pointed memory can
    fit in 32KB buffer, then approximate 32KB memory can be saved

    - vmalloc the buffer pointed by parsed_partitions.parts because 32KB is
    still a bit big for kmalloc

    - given that many devices have the partition count limit, so only
    allocate disk_max_parts() partitions instead of 256 partitions always

    Signed-off-by: Ming Lei
    Reported-by: Felipe Balbi
    Cc: Jens Axboe
    Reviewed-by: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     
  • It isn't necessary to read the information of partitions whose number is
    equal and more than state->limit since only maximum state->limit
    partitions will be added inside rescan_partitions().

    That is also what other kind of partitions are doing.

    Signed-off-by: Ming Lei
    Cc: Jens Axboe
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     
  • UEFI 2.3.1D will include a change to the spec language mandating that a
    GPT header must be greater than *or equal to* the size of the defined
    structure. While verifying that this would work on Linux, I discovered
    that we're not actually checking the minimum bound at all.

    The result of this is that when we verify the checksum, it's possible that
    on a malformed header (with header_size of 0), we won't actually verify
    any data.

    [akpm@linux-foundation.org: fix printk warning]
    Signed-off-by: Peter Jones
    Acked-by: Matt Fleming
    Cc: Jens Axboe
    Cc: Stephen Warren
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Jones
     
  • AIX formatted disks do not always have the MSDOS 55aa signature.
    This happens e.g. for unbootable AIX disks.

    Up to now, such disks were not recognized as AIX disks, because of the
    missing 55aa. Fix that by inverting the two tests. Let's first
    check for the AIX magic strings, and only if that fails check for
    the MSDOS magic word.

    Signed-off-by: Philippe De Muyter
    Cc: Andreas Mohr
    Cc: OGAWA Hirofumi
    Cc: Jens Axboe
    Cc: Olaf Hering
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Philippe De Muyter
     
  • Convert to the much saner new idr interface. Both bsg and genhd
    protect idr w/ mutex making preloading unnecessary.

    Signed-off-by: Tejun Heo
    Acked-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • idr allocation in blk_alloc_devt() wasn't synchronized against lookup
    and removal, and its limit check was off by one - 1 << MINORBITS is
    the number of minors allowed, not the maximum allowed minor.

    Add locking and rename MAX_EXT_DEVT to NR_EXT_DEVT and fix limit
    checking.

    Signed-off-by: Tejun Heo
    Acked-by: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • While adding and removing a lot of disks disks and partitions this
    sometimes shows up:

    WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xc9/0x130() (Not tainted)
    Hardware name:
    sysfs: cannot create duplicate filename '/dev/block/259:751'
    Modules linked in: raid1 autofs4 bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc scsi_tgt garp stp llc sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ipv6 dm_mirror dm_region_hash dm_log power_meter microcode dcdbas serio_raw amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core k10temp bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 dm_round_robin sr_mod cdrom sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci mptsas mptscsih mptbase scsi_transport_sas dm_multipath dm_mod [last unloaded: scsi_wait_scan]
    Pid: 44103, comm: async/16 Not tainted 2.6.32-195.el6.x86_64 #1
    Call Trace:
    warn_slowpath_common+0x87/0xc0
    warn_slowpath_fmt+0x46/0x50
    sysfs_add_one+0xc9/0x130
    sysfs_do_create_link+0x12b/0x170
    sysfs_create_link+0x13/0x20
    device_add+0x317/0x650
    idr_get_new+0x13/0x50
    add_partition+0x21c/0x390
    rescan_partitions+0x32b/0x470
    sd_open+0x81/0x1f0 [sd_mod]
    __blkdev_get+0x1b6/0x3c0
    blkdev_get+0x10/0x20
    register_disk+0x155/0x170
    add_disk+0xa6/0x160
    sd_probe_async+0x13b/0x210 [sd_mod]
    add_wait_queue+0x46/0x60
    async_thread+0x102/0x250
    default_wake_function+0x0/0x20
    async_thread+0x0/0x250
    kthread+0x96/0xa0
    child_rip+0xa/0x20
    kthread+0x0/0xa0
    child_rip+0x0/0x20

    This most likely happens because dev_t is freed while the number is
    still used and idr_get_new() is not protected on every use. The fix
    adds a mutex where it wasn't before and moves the dev_t free function so
    it is called after device del.

    Signed-off-by: Tomas Henzl
    Cc: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tomas Henzl
     

24 Feb, 2013

1 commit

  • Apply the introduced pm_runtime_set_memalloc_noio on block device so
    that PM core will teach mm to not allocate memory with GFP_IOFS when
    calling the runtime_resume and runtime_suspend callback for block
    devices and its ancestors.

    Signed-off-by: Ming Lei
    Cc: Jens Axboe
    Cc: Minchan Kim
    Cc: Alan Stern
    Cc: Oliver Neukum
    Cc: Jiri Kosina
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Greg KH
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: David Decotigny
    Cc: Tom Herbert
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     

22 Feb, 2013

3 commits

  • While stress-running very-small container scenarios with the Kernel Memory
    Controller, I've run into a lockdep-detected lock imbalance in
    cfq-iosched.c.

    I'll apologize beforehand for not posting a backlog: I didn't anticipate
    it would be so hard to reproduce, so I didn't save my serial output and
    went directly on debugging. Turns out that it did not happen again in
    more than 20 runs, making it a quite rare pattern.

    But here is my analysis:

    When we are in very low-memory situations, we will arrive at
    cfq_find_alloc_queue and may not find a queue, having to resort to the oom
    queue, in an rcu-locked condition:

    if (!cfqq || cfqq == &cfqd->oom_cfqq)
    [ ... ]

    Next, we will release the rcu lock, and try to allocate a queue, retrying
    if we succeed:

    rcu_read_unlock();
    spin_unlock_irq(cfqd->queue->queue_lock);
    new_cfqq = kmem_cache_alloc_node(cfq_pool,
    gfp_mask | __GFP_ZERO,
    cfqd->queue->node);
    spin_lock_irq(cfqd->queue->queue_lock);
    if (new_cfqq)
    goto retry;

    We are unlocked at this point, but it should be fine, since we will
    reacquire the rcu_read_lock when we retry.

    Except of course, that we may not retry: the allocation may very well fail
    and we'll keep on going through the flow:

    The next branch is:

    if (cfqq) {
    [ ... ]
    } else
    cfqq = &cfqd->oom_cfqq;

    And right before exiting, we'll issue rcu_read_unlock().

    Being already unlocked, this is the likely source of our imbalance. Since
    cfqq is either already NULL or made NULL in the first statement of the
    outter branch, the only viable alternative here seems to be to return the
    oom queue right away in case of allocation failure.

    Please review the following patch and apply if you agree with my analysis.

    Signed-off-by: Glauber Costa
    Cc: Jens Axboe
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Glauber Costa
     
  • The block device doesn't use percpu rw-semaphore anymore, so don't select
    it for compilation.

    Signed-off-by: Mikulas Patocka
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     
  • This provides a band-aid to provide stable page writes on jbd without
    needing to backport the fixed locking and page writeback bit handling
    schemes of jbd2. The band-aid works by using bounce buffers to snapshot
    page contents instead of waiting.

    For those wondering about the ext3 bandage -- fixing the jbd locking
    (which was done as part of ext4dev years ago) is a lot of surgery, and
    setting PG_writeback on data pages when we actually hold the page lock
    dropped ext3 performance by nearly an order of magnitude. If we're
    going to migrate iscsi and raid to use stable page writes, the
    complaints about high latency will likely return. We might as well
    centralize their page snapshotting thing to one place.

    Signed-off-by: Darrick J. Wong
    Tested-by: Andy Lutomirski
    Cc: Adrian Hunter
    Cc: Artem Bityutskiy
    Reviewed-by: Jan Kara
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Steven Whitehouse
    Cc: Jens Axboe
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong