09 May, 2013

1 commit

  • Pull block core updates from Jens Axboe:

    - Major bit is Kents prep work for immutable bio vecs.

    - Stable candidate fix for a scheduling-while-atomic in the queue
    bypass operation.

    - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
    discard bios.

    - Tejuns changes to convert the writeback thread pool to the generic
    workqueue mechanism.

    - Runtime PM framework, SCSI patches exists on top of these in James'
    tree.

    - A few random fixes.

    * 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
    relay: move remove_buf_file inside relay_close_buf
    partitions/efi.c: replace useless kzalloc's by kmalloc's
    fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
    block: fix max discard sectors limit
    blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
    Documentation: cfq-iosched: update documentation help for cfq tunables
    writeback: expose the bdi_wq workqueue
    writeback: replace custom worker pool implementation with unbound workqueue
    writeback: remove unused bdi_pending_list
    aoe: Fix unitialized var usage
    bio-integrity: Add explicit field for owner of bip_buf
    block: Add an explicit bio flag for bios that own their bvec
    block: Add bio_alloc_pages()
    block: Convert some code to bio_for_each_segment_all()
    block: Add bio_for_each_segment_all()
    bounce: Refactor __blk_queue_bounce to not use bi_io_vec
    raid1: use bio_copy_data()
    pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
    pktcdvd: use bio_copy_data()
    block: Add bio_copy_data()
    ...

    Linus Torvalds
     

08 May, 2013

1 commit

  • Faster kernel compiles by way of fewer unnecessary includes.

    [akpm@linux-foundation.org: fix fallout]
    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     

03 May, 2013

1 commit

  • Pull virtio & lguest updates from Rusty Russell:
    "Lots of virtio work which wasn't quite ready for last merge window.

    Plus I dived into lguest again, reworking the pagetable code so we can
    move the switcher page: our fixmaps sometimes take more than 2MB now..."

    Ugh. Annoying conflicts with the tcm_vhost -> vhost_scsi rename.
    Hopefully correctly resolved.

    * tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (57 commits)
    caif_virtio: Remove bouncing email addresses
    lguest: improve code readability in lg_cpu_start.
    virtio-net: fill only rx queues which are being used
    lguest: map Switcher below fixmap.
    lguest: cache last cpu we ran on.
    lguest: map Switcher text whenever we allocate a new pagetable.
    lguest: don't share Switcher PTE pages between guests.
    lguest: expost switcher_pages array (as lg_switcher_pages).
    lguest: extract shadow PTE walking / allocating.
    lguest: make check_gpte et. al return bool.
    lguest: assume Switcher text is a single page.
    lguest: rename switcher_page to switcher_pages.
    lguest: remove RESERVE_MEM constant.
    lguest: check vaddr not pgd for Switcher protection.
    lguest: prepare to make SWITCHER_ADDR a variable.
    virtio: console: replace EMFILE with EBUSY for already-open port
    virtio-scsi: reset virtqueue affinity when doing cpu hotplug
    virtio-scsi: introduce multiqueue support
    virtio-scsi: push vq lock/unlock into virtscsi_vq_done
    virtio-scsi: pass struct virtio_scsi to virtqueue completion function
    ...

    Linus Torvalds
     

30 Apr, 2013

3 commits

  • In alloc_read_gpt_entries and alloc_read_gpt_header, the kzalloc'ated
    zones are either totally overwritten by the following read_lba call,
    or freed. As kmalloc is cheaper than kzalloc, use kmalloc.

    Signed-off-by: Philippe De Muyter
    Cc: Matt Domsch
    Cc: Panagiotis Issaris
    Cc: Andrew Morton
    Signed-off-by: Jens Axboe

    Philippe De Muyter
     
  • Pull cgroup updates from Tejun Heo:

    - Fixes and a lot of cleanups. Locking cleanup is finally complete.
    cgroup_mutex is no longer exposed to individual controlelrs which
    used to cause nasty deadlock issues. Li fixed and cleaned up quite a
    bit including long standing ones like racy cgroup_path().

    - device cgroup now supports proper hierarchy thanks to Aristeu.

    - perf_event cgroup now supports proper hierarchy.

    - A new mount option "__DEVEL__sane_behavior" is added. As indicated
    by the name, this option is to be used for development only at this
    point and generates a warning message when used. Unfortunately,
    cgroup interface currently has too many brekages and inconsistencies
    to implement a consistent and unified hierarchy on top. The new flag
    is used to collect the behavior changes which are necessary to
    implement consistent unified hierarchy. It's likely that this flag
    won't be used verbatim when it becomes ready but will be enabled
    implicitly along with unified hierarchy.

    The option currently disables some of broken behaviors in cgroup core
    and also .use_hierarchy switch in memcg (will be routed through -mm),
    which can be used to make very unusual hierarchy where nesting is
    partially honored. It will also be used to implement hierarchy
    support for blk-throttle which would be impossible otherwise without
    introducing a full separate set of control knobs.

    This is essentially versioning of interface which isn't very nice but
    at this point I can't see any other options which would allow keeping
    the interface the same while moving towards hierarchy behavior which
    is at least somewhat sane. The planned unified hierarchy is likely
    to require some level of adaptation from userland anyway, so I think
    it'd be best to take the chance and update the interface such that
    it's supportable in the long term.

    Maintaining the existing interface does complicate cgroup core but
    shouldn't put too much strain on individual controllers and I think
    it'd be manageable for the foreseeable future. Maybe we'll be able
    to drop it in a decade.

    Fix up conflicts (including a semantic one adding a new #include to ppc
    that was uncovered by header the file changes) as per Tejun.

    * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits)
    cpuset: fix compile warning when CONFIG_SMP=n
    cpuset: fix cpu hotplug vs rebuild_sched_domains() race
    cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
    cgroup: restore the call to eventfd->poll()
    cgroup: fix use-after-free when umounting cgroupfs
    cgroup: fix broken file xattrs
    devcg: remove parent_cgroup.
    memcg: force use_hierarchy if sane_behavior
    cgroup: remove cgrp->top_cgroup
    cgroup: introduce sane_behavior mount option
    move cgroupfs_root to include/linux/cgroup.h
    cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
    cgroup: make cgroup_path() not print double slashes
    Revert "cgroup: remove bind() method from cgroup_subsys."
    perf: make perf_event cgroup hierarchical
    cgroup: implement cgroup_is_descendant()
    cgroup: make sure parent won't be destroyed before its children
    cgroup: remove bind() method from cgroup_subsys.
    devcg: remove broken_hierarchy tag
    cgroup: remove cgroup_lock_is_held()
    ...

    Linus Torvalds
     
  • Pull driver core update from Greg Kroah-Hartman:
    "Here's the merge request for the driver core tree for 3.10-rc1

    It's pretty small, just a number of driver core and sysfs updates and
    fixes, all of which have been in linux-next for a while now.

    Signed-off-by: Greg Kroah-Hartman "

    Fixed conflict in kernel/rtmutex-tester.c, the locking tree had a better
    fix for the same sysfs file mode problem.

    * tag 'driver-core-3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
    PM / Runtime: Idle devices asynchronously after probe|release
    driver core: handle user namespaces properly with the uid/gid devtmpfs change
    driver core: devtmpfs: fix compile failure with CONFIG_UIDGID_STRICT_TYPE_CHECKS
    devtmpfs: add base.h include
    driver core: add uid and gid to devtmpfs
    sysfs: check if one entry has been removed before freeing
    sysfs: fix crash_notes_size build warning
    sysfs: fix use after free in case of concurrent read/write and readdir
    rtmutex-tester: fix mode of sysfs files
    Documentation: Add ABI entry for crash_notes and crash_notes_size
    sysfs: Add crash_notes_size to export percpu note size
    driver core: platform_device.h: fix checkpatch errors and warnings
    driver core: platform.c: fix checkpatch errors and warnings
    driver core: warn that platform_driver_probe can not use deferred probing
    sysfs: use atomic_inc_unless_negative in sysfs_get_active
    base: core: WARN() about bogus permissions on device attributes
    device: separate all subsys mutexes

    Linus Torvalds
     

19 Apr, 2013

1 commit

  • This reverts commit 3a366e614d0837d9fc23f78cdb1a1186ebc3387f.

    Wanlong Gao reports that it causes a kernel panic on his machine several
    minutes after boot. Reverting it removes the panic.

    Jens says:
    "It's not quite clear why that is yet, so I think we should just revert
    the commit for 3.9 final (which I'm assuming is pretty close).

    The wifi is crap at the LSF hotel, so sending this email instead of
    queueing up a revert and pull request."

    Reported-by: Wanlong Gao
    Requested-by: Jens Axboe
    Cc: Tejun Heo
    Cc: Steven Rostedt
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Apr, 2013

1 commit


12 Apr, 2013

1 commit


09 Apr, 2013

1 commit

  • Since 749fefe677 in v3.7 ("block: lift the initial queue bypass mode
    on blk_register_queue() instead of blk_init_allocated_queue()"),
    the following warning appears when multipath is used with CONFIG_PREEMPT=y.

    This patch moves blk_queue_bypass_start() before radix_tree_preload()
    to avoid the sleeping call while preemption is disabled.

    BUG: scheduling while atomic: multipath/2460/0x00000002
    1 lock held by multipath/2460:
    #0: (&md->type_lock){......}, at: [] dm_lock_md_type+0x17/0x19 [dm_mod]
    Modules linked in: ...
    Pid: 2460, comm: multipath Tainted: G W 3.7.0-rc2 #1
    Call Trace:
    [] __schedule_bug+0x6a/0x78
    [] __schedule+0xb4/0x5e0
    [] schedule+0x64/0x66
    [] schedule_timeout+0x39/0xf8
    [] ? put_lock_stats+0xe/0x29
    [] ? lock_release_holdtime+0xb6/0xbb
    [] wait_for_common+0x9d/0xee
    [] ? try_to_wake_up+0x206/0x206
    [] ? kfree_call_rcu+0x1c/0x1c
    [] wait_for_completion+0x1d/0x1f
    [] wait_rcu_gp+0x5d/0x7a
    [] ? wait_rcu_gp+0x7a/0x7a
    [] ? complete+0x21/0x53
    [] synchronize_rcu+0x1e/0x20
    [] blk_queue_bypass_start+0x5d/0x62
    [] blkcg_activate_policy+0x73/0x270
    [] ? kmem_cache_alloc_node_trace+0xc7/0x108
    [] cfq_init_queue+0x80/0x28e
    [] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
    [] elevator_init+0xe1/0x115
    [] ? blk_queue_make_request+0x54/0x59
    [] blk_init_allocated_queue+0x8c/0x9e
    [] dm_setup_md_queue+0x36/0xaa [dm_mod]
    [] table_load+0x1bd/0x2c8 [dm_mod]
    [] ctl_ioctl+0x1d6/0x236 [dm_mod]
    [] ? table_clear+0xaa/0xaa [dm_mod]
    [] dm_ctl_ioctl+0x13/0x17 [dm_mod]
    [] do_vfs_ioctl+0x3fb/0x441
    [] ? file_has_perm+0x8a/0x99
    [] sys_ioctl+0x5e/0x82
    [] ? trace_hardirqs_on_thunk+0x3a/0x3f
    [] system_call_fastpath+0x16/0x1b

    Signed-off-by: Jun'ichi Nomura
    Acked-by: Vivek Goyal
    Acked-by: Tejun Heo
    Cc: Alasdair G Kergon
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jun'ichi Nomura
     

08 Apr, 2013

2 commits

  • Some drivers want to tell userspace what uid and gid should be used for
    their device nodes, so allow that information to percolate through the
    driver core to userspace in order to make this happen. This means that
    some systems (i.e. Android and friends) will not need to even run a
    udev-like daemon for their device node manager and can just rely in
    devtmpfs fully, reducing their footprint even more.

    Signed-off-by: Kay Sievers
    Signed-off-by: Greg Kroah-Hartman

    Kay Sievers
     
  • This reverts commit 8761a3dc1f07b163414e2215a2cadbb4cfe2a107.

    There are situations where the destruction path is called
    with the bdev->bd_mutex already held, which then deadlocks in
    loop_clr_fd(). The normal partition cleanup does a trylock()
    on the mutex, but it'd be nice to have a more bullet proof
    method in loop. So punt this more involved fix to the next
    merge window, and just back out this buggy fix for now.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

04 Apr, 2013

1 commit

  • As found by gcc-4.8, the QUEUE_SYSFS_BIT_FNS macro creates functions
    that use a value generated by queue_var_store independent of whether
    that value was set or not.

    block/blk-sysfs.c: In function 'queue_store_nonrot':
    block/blk-sysfs.c:244:385: warning: 'val' may be used uninitialized in this function [-Wmaybe-uninitialized]

    Unlike most other such warnings, this one is not a false positive,
    writing any non-number string into the sysfs files indeed has
    an undefined result, rather than returning an error.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Jens Axboe

    Arnd Bergmann
     

02 Apr, 2013

1 commit

  • …git/tj/wq into for-3.10/core

    Tejun writes:

    -----

    This is the pull request for the earlier patchset[1] with the same
    name. It's only three patches (the first one was committed to
    workqueue tree) but the merge strategy is a bit involved due to the
    dependencies.

    * Because the conversion needs features from wq/for-3.10,
    block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts
    with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those
    workqueue conflicts from flaring up in block tree.

    * Resolving the issue that Jan and Dave raised about debugging
    requires arch-wide changes. The patchset is being worked on[2] but
    it'll have to go through -mm after these changes show up in -next,
    and not included in this pull request.

    The three commits are located in the following git branch.

    git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue

    Pulling it into block/for-3.10/core produces a conflict in
    drivers/md/raid5.c between the following two commits.

    e3620a3ad5 ("MD RAID5: Avoid accessing gendisk or queue structs when not available")
    2f6db2a707 ("raid5: use bio_reset()")

    The conflict is trivial - one removes an "if ()" conditional while the
    other removes "rbi->bi_next = NULL" right above it. We just need to
    remove both. The merged branch is available at

    git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge

    so that you can use it for verification. The test merge commit has
    proper merge description.

    While these changes are a bit of pain to route, they make code simpler
    and even have, while minute, measureable performance gain[3] even on a
    workload which isn't particularly favorable to showing the benefits of
    this conversion.

    ----

    Fixed up the conflict.

    Conflicts:
    drivers/md/raid5.c

    Signed-off-by: Jens Axboe <axboe@kernel.dk>

    Jens Axboe
     

25 Mar, 2013

1 commit


24 Mar, 2013

2 commits

  • Just a little convenience macro - main reason to add it now is preparing
    for immutable bio vecs, it'll reduce the size of the patch that puts
    bi_sector/bi_size/bi_idx into a struct bvec_iter.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    CC: Lars Ellenberg
    CC: Jiri Kosina
    CC: Alasdair Kergon
    CC: dm-devel@redhat.com
    CC: Neil Brown
    CC: Martin Schwidefsky
    CC: Heiko Carstens
    CC: linux-s390@vger.kernel.org
    CC: Chris Mason
    CC: Steven Whitehouse
    Acked-by: Steven Whitehouse

    Kent Overstreet
     
  • Converts it to use bio_advance(), simplifying it quite a bit in the
    process.

    Note that req_bio_endio() now always calls bio_advance() - which means
    it always loops over the biovec, not just on partial completions. Don't
    expect it to affect performance, but worth noting.

    Tested it by forcing partial updates, and dumping before and after on
    various bio/bvec fields when doing a partial update.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe

    Kent Overstreet
     

23 Mar, 2013

4 commits

  • When a request is added:
    If device is suspended or is suspending and the request is not a
    PM request, resume the device.

    When the last request finishes:
    Call pm_runtime_mark_last_busy().

    When pick a request:
    If device is resuming/suspending, then only PM request is allowed
    to go.

    The idea and API is designed by Alan Stern and described here:
    http://marc.info/?l=linux-scsi&m=133727953625963&w=2

    Signed-off-by: Lin Ming
    Signed-off-by: Aaron Lu
    Acked-by: Alan Stern
    Signed-off-by: Jens Axboe

    Lin Ming
     
  • Add runtime pm helper functions:

    void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
    - Initialization function for drivers to call.

    int blk_pre_runtime_suspend(struct request_queue *q)
    - If any requests are in the queue, mark last busy and return -EBUSY.
    Otherwise set q->rpm_status to RPM_SUSPENDING and return 0.

    void blk_post_runtime_suspend(struct request_queue *q, int err)
    - If the suspend succeeded then set q->rpm_status to RPM_SUSPENDED.
    Otherwise set it to RPM_ACTIVE and mark last busy.

    void blk_pre_runtime_resume(struct request_queue *q)
    - Set q->rpm_status to RPM_RESUMING.

    void blk_post_runtime_resume(struct request_queue *q, int err)
    - If the resume succeeded then set q->rpm_status to RPM_ACTIVE
    and call __blk_run_queue, then mark last busy and autosuspend.
    Otherwise set q->rpm_status to RPM_SUSPENDED.

    The idea and API is designed by Alan Stern and described here:
    http://marc.info/?l=linux-scsi&m=133727953625963&w=2

    Signed-off-by: Lin Ming
    Signed-off-by: Aaron Lu
    Acked-by: Alan Stern
    Signed-off-by: Jens Axboe

    Lin Ming
     
  • Fixed code indent should use tabs where possible.

    Signed-off-by: Alice Ferrazzi
    Signed-off-by: Jens Axboe

    Alice Ferrazzi
     
  • Any partitions added by user space to the loop device were being
    left in place after detaching the loop device. This was because
    the detach path issued a BLKRRPART to clean up partitions if
    LO_FLAGS_PARTSCAN was set, meaning that the partitions were auto
    scanned on attach. Replace this BLKRRPART with code that
    unconditionally cleans up partitions on detach instead.

    Signed-off-by: Phillip Susi

    Modified by Jens to export delete_partition().

    Signed-off-by: Jens Axboe

    Phillip Susi
     

20 Mar, 2013

1 commit


05 Mar, 2013

1 commit

  • rename() will change dentry->d_name. The result of this race can
    be worse than seeing partially rewritten name, but we might access
    a stale pointer because rename() will re-allocate memory to hold
    a longer name.

    As accessing dentry->name must be protected by dentry->d_lock or
    parent inode's i_mutex, while on the other hand cgroup-path() can
    be called with some irq-safe spinlocks held, we can't generate
    cgroup path using dentry->d_name.

    Alternatively we make a copy of dentry->d_name and save it in
    cgrp->name when a cgroup is created, and update cgrp->name at
    rename().

    v5: use flexible array instead of zero-size array.
    v4: - allocate root_cgroup_name and all root_cgroup->name points to it.
    - add cgroup_name() wrapper.
    v3: use kfree_rcu() instead of synchronize_rcu() in user-visible path.
    v2: make cgrp->name RCU safe.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     

01 Mar, 2013

1 commit

  • Pull block IO core bits from Jens Axboe:
    "Below are the core block IO bits for 3.9. It was delayed a few days
    since my workstation kept crashing every 2-8h after pulling it into
    current -git, but turns out it is a bug in the new pstate code (divide
    by zero, will report separately). In any case, it contains:

    - The big cfq/blkcg update from Tejun and and Vivek.

    - Additional block and writeback tracepoints from Tejun.

    - Improvement of the should sort (based on queues) logic in the plug
    flushing.

    - _io() variants of the wait_for_completion() interface, using
    io_schedule() instead of schedule() to contribute to io wait
    properly.

    - Various little fixes.

    You'll get two trivial merge conflicts, which should be easy enough to
    fix up"

    Fix up the trivial conflicts due to hlist traversal cleanups (commit
    b67bfe0d42ca: "hlist: drop the node parameter from iterators").

    * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
    block: remove redundant check to bd_openers()
    block: use i_size_write() in bd_set_size()
    cfq: fix lock imbalance with failed allocations
    drivers/block/swim3.c: fix null pointer dereference
    block: don't select PERCPU_RWSEM
    block: account iowait time when waiting for completion of IO request
    sched: add wait_for_completion_io[_timeout]
    writeback: add more tracepoints
    block: add block_{touch|dirty}_buffer tracepoint
    buffer: make touch_buffer() an exported function
    block: add @req to bio_{front|back}_merge tracepoints
    block: add missing block_bio_complete() tracepoint
    block: Remove should_sort judgement when flush blk_plug
    block,elevator: use new hashtable implementation
    cfq-iosched: add hierarchical cfq_group statistics
    cfq-iosched: collect stats from dead cfqgs
    cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
    blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
    block: RCU free request_queue
    blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
    ...

    Linus Torvalds
     

28 Feb, 2013

8 commits

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • Currently, sizeof(struct parsed_partitions) may be 64KB in 32bit arch, so
    it is easy to trigger page allocation failure by check_partition,
    especially in hotplug block device situation(such as, USB mass storage,
    MMC card, ...), and Felipe Balbi has observed the failure.

    This patch does below optimizations on the allocation of struct
    parsed_partitions to try to address the issue:

    - make parsed_partitions.parts as pointer so that the pointed memory can
    fit in 32KB buffer, then approximate 32KB memory can be saved

    - vmalloc the buffer pointed by parsed_partitions.parts because 32KB is
    still a bit big for kmalloc

    - given that many devices have the partition count limit, so only
    allocate disk_max_parts() partitions instead of 256 partitions always

    Signed-off-by: Ming Lei
    Reported-by: Felipe Balbi
    Cc: Jens Axboe
    Reviewed-by: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     
  • It isn't necessary to read the information of partitions whose number is
    equal and more than state->limit since only maximum state->limit
    partitions will be added inside rescan_partitions().

    That is also what other kind of partitions are doing.

    Signed-off-by: Ming Lei
    Cc: Jens Axboe
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     
  • UEFI 2.3.1D will include a change to the spec language mandating that a
    GPT header must be greater than *or equal to* the size of the defined
    structure. While verifying that this would work on Linux, I discovered
    that we're not actually checking the minimum bound at all.

    The result of this is that when we verify the checksum, it's possible that
    on a malformed header (with header_size of 0), we won't actually verify
    any data.

    [akpm@linux-foundation.org: fix printk warning]
    Signed-off-by: Peter Jones
    Acked-by: Matt Fleming
    Cc: Jens Axboe
    Cc: Stephen Warren
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Jones
     
  • AIX formatted disks do not always have the MSDOS 55aa signature.
    This happens e.g. for unbootable AIX disks.

    Up to now, such disks were not recognized as AIX disks, because of the
    missing 55aa. Fix that by inverting the two tests. Let's first
    check for the AIX magic strings, and only if that fails check for
    the MSDOS magic word.

    Signed-off-by: Philippe De Muyter
    Cc: Andreas Mohr
    Cc: OGAWA Hirofumi
    Cc: Jens Axboe
    Cc: Olaf Hering
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Philippe De Muyter
     
  • Convert to the much saner new idr interface. Both bsg and genhd
    protect idr w/ mutex making preloading unnecessary.

    Signed-off-by: Tejun Heo
    Acked-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • idr allocation in blk_alloc_devt() wasn't synchronized against lookup
    and removal, and its limit check was off by one - 1 << MINORBITS is
    the number of minors allowed, not the maximum allowed minor.

    Add locking and rename MAX_EXT_DEVT to NR_EXT_DEVT and fix limit
    checking.

    Signed-off-by: Tejun Heo
    Acked-by: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • While adding and removing a lot of disks disks and partitions this
    sometimes shows up:

    WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xc9/0x130() (Not tainted)
    Hardware name:
    sysfs: cannot create duplicate filename '/dev/block/259:751'
    Modules linked in: raid1 autofs4 bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc scsi_tgt garp stp llc sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ipv6 dm_mirror dm_region_hash dm_log power_meter microcode dcdbas serio_raw amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core k10temp bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 dm_round_robin sr_mod cdrom sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci mptsas mptscsih mptbase scsi_transport_sas dm_multipath dm_mod [last unloaded: scsi_wait_scan]
    Pid: 44103, comm: async/16 Not tainted 2.6.32-195.el6.x86_64 #1
    Call Trace:
    warn_slowpath_common+0x87/0xc0
    warn_slowpath_fmt+0x46/0x50
    sysfs_add_one+0xc9/0x130
    sysfs_do_create_link+0x12b/0x170
    sysfs_create_link+0x13/0x20
    device_add+0x317/0x650
    idr_get_new+0x13/0x50
    add_partition+0x21c/0x390
    rescan_partitions+0x32b/0x470
    sd_open+0x81/0x1f0 [sd_mod]
    __blkdev_get+0x1b6/0x3c0
    blkdev_get+0x10/0x20
    register_disk+0x155/0x170
    add_disk+0xa6/0x160
    sd_probe_async+0x13b/0x210 [sd_mod]
    add_wait_queue+0x46/0x60
    async_thread+0x102/0x250
    default_wake_function+0x0/0x20
    async_thread+0x0/0x250
    kthread+0x96/0xa0
    child_rip+0xa/0x20
    kthread+0x0/0xa0
    child_rip+0x0/0x20

    This most likely happens because dev_t is freed while the number is
    still used and idr_get_new() is not protected on every use. The fix
    adds a mutex where it wasn't before and moves the dev_t free function so
    it is called after device del.

    Signed-off-by: Tomas Henzl
    Cc: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tomas Henzl
     

24 Feb, 2013

1 commit

  • Apply the introduced pm_runtime_set_memalloc_noio on block device so
    that PM core will teach mm to not allocate memory with GFP_IOFS when
    calling the runtime_resume and runtime_suspend callback for block
    devices and its ancestors.

    Signed-off-by: Ming Lei
    Cc: Jens Axboe
    Cc: Minchan Kim
    Cc: Alan Stern
    Cc: Oliver Neukum
    Cc: Jiri Kosina
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Greg KH
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: David Decotigny
    Cc: Tom Herbert
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     

22 Feb, 2013

4 commits

  • While stress-running very-small container scenarios with the Kernel Memory
    Controller, I've run into a lockdep-detected lock imbalance in
    cfq-iosched.c.

    I'll apologize beforehand for not posting a backlog: I didn't anticipate
    it would be so hard to reproduce, so I didn't save my serial output and
    went directly on debugging. Turns out that it did not happen again in
    more than 20 runs, making it a quite rare pattern.

    But here is my analysis:

    When we are in very low-memory situations, we will arrive at
    cfq_find_alloc_queue and may not find a queue, having to resort to the oom
    queue, in an rcu-locked condition:

    if (!cfqq || cfqq == &cfqd->oom_cfqq)
    [ ... ]

    Next, we will release the rcu lock, and try to allocate a queue, retrying
    if we succeed:

    rcu_read_unlock();
    spin_unlock_irq(cfqd->queue->queue_lock);
    new_cfqq = kmem_cache_alloc_node(cfq_pool,
    gfp_mask | __GFP_ZERO,
    cfqd->queue->node);
    spin_lock_irq(cfqd->queue->queue_lock);
    if (new_cfqq)
    goto retry;

    We are unlocked at this point, but it should be fine, since we will
    reacquire the rcu_read_lock when we retry.

    Except of course, that we may not retry: the allocation may very well fail
    and we'll keep on going through the flow:

    The next branch is:

    if (cfqq) {
    [ ... ]
    } else
    cfqq = &cfqd->oom_cfqq;

    And right before exiting, we'll issue rcu_read_unlock().

    Being already unlocked, this is the likely source of our imbalance. Since
    cfqq is either already NULL or made NULL in the first statement of the
    outter branch, the only viable alternative here seems to be to return the
    oom queue right away in case of allocation failure.

    Please review the following patch and apply if you agree with my analysis.

    Signed-off-by: Glauber Costa
    Cc: Jens Axboe
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Glauber Costa
     
  • The block device doesn't use percpu rw-semaphore anymore, so don't select
    it for compilation.

    Signed-off-by: Mikulas Patocka
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     
  • This provides a band-aid to provide stable page writes on jbd without
    needing to backport the fixed locking and page writeback bit handling
    schemes of jbd2. The band-aid works by using bounce buffers to snapshot
    page contents instead of waiting.

    For those wondering about the ext3 bandage -- fixing the jbd locking
    (which was done as part of ext4dev years ago) is a lot of surgery, and
    setting PG_writeback on data pages when we actually hold the page lock
    dropped ext3 performance by nearly an order of magnitude. If we're
    going to migrate iscsi and raid to use stable page writes, the
    complaints about high latency will likely return. We might as well
    centralize their page snapshotting thing to one place.

    Signed-off-by: Darrick J. Wong
    Tested-by: Andy Lutomirski
    Cc: Adrian Hunter
    Cc: Artem Bityutskiy
    Reviewed-by: Jan Kara
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Steven Whitehouse
    Cc: Jens Axboe
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     
  • This patchset ("stable page writes, part 2") makes some key
    modifications to the original 'stable page writes' patchset. First, it
    provides creators (devices and filesystems) of a backing_dev_info a flag
    that declares whether or not it is necessary to ensure that page
    contents cannot change during writeout. It is no longer assumed that
    this is true of all devices (which was never true anyway). Second, the
    flag is used to relaxed the wait_on_page_writeback calls so that wait
    only occurs if the device needs it. Third, it fixes up the remaining
    disk-backed filesystems to use this improved conditional-wait logic to
    provide stable page writes on those filesystems.

    It is hoped that (for people not using checksumming devices, anyway)
    this patchset will give back unnecessary performance decreases since the
    original stable page write patchset went into 3.0. Sorry about not
    fixing it sooner.

    Complaints were registered by several people about the long write
    latencies introduced by the original stable page write patchset.
    Generally speaking, the kernel ought to allocate as little extra memory
    as possible to facilitate writeout, but for people who simply cannot
    wait, a second page stability strategy is (re)introduced: snapshotting
    page contents. The waiting behavior is still the default strategy; to
    enable page snapshotting, a superblock flag (MS_SNAP_STABLE) must be
    set. This flag is used to bandaid^Henable stable page writeback on
    ext3[1], and is not used anywhere else.

    Given that there are already a few storage devices and network FSes that
    have rolled their own page stability wait/page snapshot code, it would
    be nice to move towards consolidating all of these. It seems possible
    that iscsi and raid5 may wish to use the new stable page write support
    to enable zero-copy writeout.

    Thank you to Jan Kara for helping fix a couple more filesystems.

    Per Andrew Morton's request, here are the result of using dbench to measure
    latencies on ext2:

    3.8.0-rc3:
    Operation Count AvgLat MaxLat
    ----------------------------------------
    WriteX 109347 0.028 59.817
    ReadX 347180 0.004 3.391
    Flush 15514 29.828 287.283

    Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms

    3.8.0-rc3 + patches:
    WriteX 105556 0.029 4.273
    ReadX 335004 0.005 4.112
    Flush 14982 30.540 298.634

    Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms

    As you can see, for ext2 the maximum write latency decreases from ~60ms
    on a laptop hard disk to ~4ms. I'm not sure why the flush latencies
    increase, though I suspect that being able to dirty pages faster gives
    the flusher more work to do.

    On ext4, the average write latency decreases as well as all the maximum
    latencies:

    3.8.0-rc3:
    WriteX 85624 0.152 33.078
    ReadX 272090 0.010 61.210
    Flush 12129 36.219 168.260

    Throughput 44.8618 MB/sec 4 clients 4 procs max_latency=168.276 ms

    3.8.0-rc3 + patches:
    WriteX 86082 0.141 30.928
    ReadX 273358 0.010 36.124
    Flush 12214 34.800 165.689

    Throughput 44.9941 MB/sec 4 clients 4 procs max_latency=165.722 ms

    XFS seems to exhibit similar latency improvements as ext2:

    3.8.0-rc3:
    WriteX 125739 0.028 104.343
    ReadX 399070 0.005 4.115
    Flush 17851 25.004 131.390

    Throughput 66.0024 MB/sec 4 clients 4 procs max_latency=131.406 ms

    3.8.0-rc3 + patches:
    WriteX 123529 0.028 6.299
    ReadX 392434 0.005 4.287
    Flush 17549 25.120 188.687

    Throughput 64.9113 MB/sec 4 clients 4 procs max_latency=188.704 ms

    ...and btrfs, just to round things out, also shows some latency
    decreases:

    3.8.0-rc3:
    WriteX 67122 0.083 82.355
    ReadX 212719 0.005 2.828
    Flush 9547 47.561 147.418

    Throughput 35.3391 MB/sec 4 clients 4 procs max_latency=147.433 ms

    3.8.0-rc3 + patches:
    WriteX 64898 0.101 71.631
    ReadX 206673 0.005 7.123
    Flush 9190 47.963 219.034

    Throughput 34.0795 MB/sec 4 clients 4 procs max_latency=219.044 ms

    Before this patchset, all filesystems would block, regardless of whether
    or not it was necessary. ext3 would wait, but still generate occasional
    checksum errors. The network filesystems were left to do their own
    thing, so they'd wait too.

    After this patchset, all the disk filesystems except ext3 and btrfs will
    wait only if the hardware requires it. ext3 (if necessary) snapshots
    pages instead of blocking, and btrfs provides its own bdi so the mm will
    never wait. Network filesystems haven't been touched, so either they
    provide their own wait code, or they don't block at all. The blocking
    behavior is back to what it was before 3.0 if you don't have a disk
    requiring stable page writes.

    This patchset has been tested on 3.8.0-rc3 on x64 with ext3, ext4, and
    xfs. I've spot-checked 3.8.0-rc4 and seem to be getting the same
    results as -rc3.

    [1] The alternative fixes to ext3 include fixing the locking order and
    page bit handling like we did for ext4 (but then why not just use
    ext4?), or setting PG_writeback so early that ext3 becomes extremely
    slow. I tried that, but the number of write()s I could initiate dropped
    by nearly an order of magnitude. That was a bit much even for the
    author of the stable page series! :)

    This patch:

    Creates a per-backing-device flag that tracks whether or not pages must
    be held immutable during writeout. Eventually it will be used to waive
    wait_for_page_writeback() if nothing requires stable pages.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Jan Kara
    Cc: Adrian Hunter
    Cc: Andy Lutomirski
    Cc: Artem Bityutskiy
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Steven Whitehouse
    Cc: Jens Axboe
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

20 Feb, 2013

2 commits

  • Pull async changes from Tejun Heo:
    "These are followups for the earlier deadlock issue involving async
    ending up waiting for itself through block requesting module[1]. The
    following changes are made by these commits.

    - Instead of requesting default elevator on each request_queue init,
    block now requests it once early during boot.

    - Kmod triggers warning if invoked from an async worker.

    - Async synchronization implementation has been reimplemented. It's
    a lot simpler now."

    * 'for-3.9-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    async: initialise list heads to fix crash
    async: replace list of active domains with global list of pending items
    async: keep pending tasks on async_domain and remove async_pending
    async: use ULLONG_MAX for infinity cookie value
    async: bring sanity to the use of words domain and running
    async, kmod: warn on synchronous request_module() from async workers
    block: don't request module during elevator init
    init, block: try to load default elevator module early during boot

    Linus Torvalds
     
  • Pull scheduler changes from Ingo Molnar:
    "Main changes:

    - scheduler side full-dynticks (user-space execution is undisturbed
    and receives no timer IRQs) preparation changes that convert the
    cputime accounting code to be full-dynticks ready, from Frederic
    Weisbecker.

    - Initial sched.h split-up changes, by Clark Williams

    - select_idle_sibling() performance improvement by Mike Galbraith:

    " 1 tbench pair (worst case) in a 10 core + SMT package:

    pre 15.22 MB/sec 1 procs
    post 252.01 MB/sec 1 procs "

    - sched_rr_get_interval() ABI fix/change. We think this detail is not
    used by apps (so it's not an ABI in practice), but lets keep it
    under observation.

    - misc RT scheduling cleanups, optimizations"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    sched/rt: Add header to
    cputime: Remove irqsave from seqlock readers
    sched, powerpc: Fix sched.h split-up build failure
    cputime: Restore CPU_ACCOUNTING config defaults for PPC64
    sched/rt: Move rt specific bits into new header file
    sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice
    sched: Move sched.h sysctl bits into separate header
    sched: Fix signedness bug in yield_to()
    sched: Fix select_idle_sibling() bouncing cow syndrome
    sched/rt: Further simplify pick_rt_task()
    sched/rt: Do not account zero delta_exec in update_curr_rt()
    cputime: Safely read cputime of full dynticks CPUs
    kvm: Prepare to add generic guest entry/exit callbacks
    cputime: Use accessors to read task cputime stats
    cputime: Allow dynamic switch between tick/virtual based cputime accounting
    cputime: Generic on-demand virtual cputime accounting
    cputime: Move default nsecs_to_cputime() to jiffies based cputime file
    cputime: Librarize per nsecs resolution cputime definitions
    cputime: Avoid multiplication overflow on utime scaling
    context_tracking: Export context state for generic vtime
    ...

    Fix up conflict in kernel/context_tracking.c due to comment additions.

    Linus Torvalds
     

15 Feb, 2013

1 commit

  • Using wait_for_completion() for waiting for a IO request to be executed
    results in wrong iowait time accounting. For example, a system having
    the only task doing write() and fdatasync() on a block device can be
    reported being idle instead of iowaiting as it should because
    blkdev_issue_flush() calls wait_for_completion() which in turn calls
    schedule() that does not increment the iowait proc counter and thus does
    not turn on iowait time accounting.

    The patch makes block layer use wait_for_completion_io() instead of
    wait_for_completion() where appropriate to account iowait time
    correctly.

    Signed-off-by: Vladimir Davydov
    Signed-off-by: Jens Axboe

    Vladimir Davydov