01 Mar, 2013

1 commit

  • Pull block IO core bits from Jens Axboe:
    "Below are the core block IO bits for 3.9. It was delayed a few days
    since my workstation kept crashing every 2-8h after pulling it into
    current -git, but turns out it is a bug in the new pstate code (divide
    by zero, will report separately). In any case, it contains:

    - The big cfq/blkcg update from Tejun and and Vivek.

    - Additional block and writeback tracepoints from Tejun.

    - Improvement of the should sort (based on queues) logic in the plug
    flushing.

    - _io() variants of the wait_for_completion() interface, using
    io_schedule() instead of schedule() to contribute to io wait
    properly.

    - Various little fixes.

    You'll get two trivial merge conflicts, which should be easy enough to
    fix up"

    Fix up the trivial conflicts due to hlist traversal cleanups (commit
    b67bfe0d42ca: "hlist: drop the node parameter from iterators").

    * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
    block: remove redundant check to bd_openers()
    block: use i_size_write() in bd_set_size()
    cfq: fix lock imbalance with failed allocations
    drivers/block/swim3.c: fix null pointer dereference
    block: don't select PERCPU_RWSEM
    block: account iowait time when waiting for completion of IO request
    sched: add wait_for_completion_io[_timeout]
    writeback: add more tracepoints
    block: add block_{touch|dirty}_buffer tracepoint
    buffer: make touch_buffer() an exported function
    block: add @req to bio_{front|back}_merge tracepoints
    block: add missing block_bio_complete() tracepoint
    block: Remove should_sort judgement when flush blk_plug
    block,elevator: use new hashtable implementation
    cfq-iosched: add hierarchical cfq_group statistics
    cfq-iosched: collect stats from dead cfqgs
    cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
    blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
    block: RCU free request_queue
    blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
    ...

    Linus Torvalds
     

28 Feb, 2013

8 commits

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • Currently, sizeof(struct parsed_partitions) may be 64KB in 32bit arch, so
    it is easy to trigger page allocation failure by check_partition,
    especially in hotplug block device situation(such as, USB mass storage,
    MMC card, ...), and Felipe Balbi has observed the failure.

    This patch does below optimizations on the allocation of struct
    parsed_partitions to try to address the issue:

    - make parsed_partitions.parts as pointer so that the pointed memory can
    fit in 32KB buffer, then approximate 32KB memory can be saved

    - vmalloc the buffer pointed by parsed_partitions.parts because 32KB is
    still a bit big for kmalloc

    - given that many devices have the partition count limit, so only
    allocate disk_max_parts() partitions instead of 256 partitions always

    Signed-off-by: Ming Lei
    Reported-by: Felipe Balbi
    Cc: Jens Axboe
    Reviewed-by: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     
  • It isn't necessary to read the information of partitions whose number is
    equal and more than state->limit since only maximum state->limit
    partitions will be added inside rescan_partitions().

    That is also what other kind of partitions are doing.

    Signed-off-by: Ming Lei
    Cc: Jens Axboe
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     
  • UEFI 2.3.1D will include a change to the spec language mandating that a
    GPT header must be greater than *or equal to* the size of the defined
    structure. While verifying that this would work on Linux, I discovered
    that we're not actually checking the minimum bound at all.

    The result of this is that when we verify the checksum, it's possible that
    on a malformed header (with header_size of 0), we won't actually verify
    any data.

    [akpm@linux-foundation.org: fix printk warning]
    Signed-off-by: Peter Jones
    Acked-by: Matt Fleming
    Cc: Jens Axboe
    Cc: Stephen Warren
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Jones
     
  • AIX formatted disks do not always have the MSDOS 55aa signature.
    This happens e.g. for unbootable AIX disks.

    Up to now, such disks were not recognized as AIX disks, because of the
    missing 55aa. Fix that by inverting the two tests. Let's first
    check for the AIX magic strings, and only if that fails check for
    the MSDOS magic word.

    Signed-off-by: Philippe De Muyter
    Cc: Andreas Mohr
    Cc: OGAWA Hirofumi
    Cc: Jens Axboe
    Cc: Olaf Hering
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Philippe De Muyter
     
  • Convert to the much saner new idr interface. Both bsg and genhd
    protect idr w/ mutex making preloading unnecessary.

    Signed-off-by: Tejun Heo
    Acked-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • idr allocation in blk_alloc_devt() wasn't synchronized against lookup
    and removal, and its limit check was off by one - 1 << MINORBITS is
    the number of minors allowed, not the maximum allowed minor.

    Add locking and rename MAX_EXT_DEVT to NR_EXT_DEVT and fix limit
    checking.

    Signed-off-by: Tejun Heo
    Acked-by: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • While adding and removing a lot of disks disks and partitions this
    sometimes shows up:

    WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xc9/0x130() (Not tainted)
    Hardware name:
    sysfs: cannot create duplicate filename '/dev/block/259:751'
    Modules linked in: raid1 autofs4 bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc scsi_tgt garp stp llc sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ipv6 dm_mirror dm_region_hash dm_log power_meter microcode dcdbas serio_raw amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core k10temp bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 dm_round_robin sr_mod cdrom sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci mptsas mptscsih mptbase scsi_transport_sas dm_multipath dm_mod [last unloaded: scsi_wait_scan]
    Pid: 44103, comm: async/16 Not tainted 2.6.32-195.el6.x86_64 #1
    Call Trace:
    warn_slowpath_common+0x87/0xc0
    warn_slowpath_fmt+0x46/0x50
    sysfs_add_one+0xc9/0x130
    sysfs_do_create_link+0x12b/0x170
    sysfs_create_link+0x13/0x20
    device_add+0x317/0x650
    idr_get_new+0x13/0x50
    add_partition+0x21c/0x390
    rescan_partitions+0x32b/0x470
    sd_open+0x81/0x1f0 [sd_mod]
    __blkdev_get+0x1b6/0x3c0
    blkdev_get+0x10/0x20
    register_disk+0x155/0x170
    add_disk+0xa6/0x160
    sd_probe_async+0x13b/0x210 [sd_mod]
    add_wait_queue+0x46/0x60
    async_thread+0x102/0x250
    default_wake_function+0x0/0x20
    async_thread+0x0/0x250
    kthread+0x96/0xa0
    child_rip+0xa/0x20
    kthread+0x0/0xa0
    child_rip+0x0/0x20

    This most likely happens because dev_t is freed while the number is
    still used and idr_get_new() is not protected on every use. The fix
    adds a mutex where it wasn't before and moves the dev_t free function so
    it is called after device del.

    Signed-off-by: Tomas Henzl
    Cc: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tomas Henzl
     

24 Feb, 2013

1 commit

  • Apply the introduced pm_runtime_set_memalloc_noio on block device so
    that PM core will teach mm to not allocate memory with GFP_IOFS when
    calling the runtime_resume and runtime_suspend callback for block
    devices and its ancestors.

    Signed-off-by: Ming Lei
    Cc: Jens Axboe
    Cc: Minchan Kim
    Cc: Alan Stern
    Cc: Oliver Neukum
    Cc: Jiri Kosina
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Greg KH
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: David Decotigny
    Cc: Tom Herbert
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     

22 Feb, 2013

4 commits

  • While stress-running very-small container scenarios with the Kernel Memory
    Controller, I've run into a lockdep-detected lock imbalance in
    cfq-iosched.c.

    I'll apologize beforehand for not posting a backlog: I didn't anticipate
    it would be so hard to reproduce, so I didn't save my serial output and
    went directly on debugging. Turns out that it did not happen again in
    more than 20 runs, making it a quite rare pattern.

    But here is my analysis:

    When we are in very low-memory situations, we will arrive at
    cfq_find_alloc_queue and may not find a queue, having to resort to the oom
    queue, in an rcu-locked condition:

    if (!cfqq || cfqq == &cfqd->oom_cfqq)
    [ ... ]

    Next, we will release the rcu lock, and try to allocate a queue, retrying
    if we succeed:

    rcu_read_unlock();
    spin_unlock_irq(cfqd->queue->queue_lock);
    new_cfqq = kmem_cache_alloc_node(cfq_pool,
    gfp_mask | __GFP_ZERO,
    cfqd->queue->node);
    spin_lock_irq(cfqd->queue->queue_lock);
    if (new_cfqq)
    goto retry;

    We are unlocked at this point, but it should be fine, since we will
    reacquire the rcu_read_lock when we retry.

    Except of course, that we may not retry: the allocation may very well fail
    and we'll keep on going through the flow:

    The next branch is:

    if (cfqq) {
    [ ... ]
    } else
    cfqq = &cfqd->oom_cfqq;

    And right before exiting, we'll issue rcu_read_unlock().

    Being already unlocked, this is the likely source of our imbalance. Since
    cfqq is either already NULL or made NULL in the first statement of the
    outter branch, the only viable alternative here seems to be to return the
    oom queue right away in case of allocation failure.

    Please review the following patch and apply if you agree with my analysis.

    Signed-off-by: Glauber Costa
    Cc: Jens Axboe
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Glauber Costa
     
  • The block device doesn't use percpu rw-semaphore anymore, so don't select
    it for compilation.

    Signed-off-by: Mikulas Patocka
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     
  • This provides a band-aid to provide stable page writes on jbd without
    needing to backport the fixed locking and page writeback bit handling
    schemes of jbd2. The band-aid works by using bounce buffers to snapshot
    page contents instead of waiting.

    For those wondering about the ext3 bandage -- fixing the jbd locking
    (which was done as part of ext4dev years ago) is a lot of surgery, and
    setting PG_writeback on data pages when we actually hold the page lock
    dropped ext3 performance by nearly an order of magnitude. If we're
    going to migrate iscsi and raid to use stable page writes, the
    complaints about high latency will likely return. We might as well
    centralize their page snapshotting thing to one place.

    Signed-off-by: Darrick J. Wong
    Tested-by: Andy Lutomirski
    Cc: Adrian Hunter
    Cc: Artem Bityutskiy
    Reviewed-by: Jan Kara
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Steven Whitehouse
    Cc: Jens Axboe
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     
  • This patchset ("stable page writes, part 2") makes some key
    modifications to the original 'stable page writes' patchset. First, it
    provides creators (devices and filesystems) of a backing_dev_info a flag
    that declares whether or not it is necessary to ensure that page
    contents cannot change during writeout. It is no longer assumed that
    this is true of all devices (which was never true anyway). Second, the
    flag is used to relaxed the wait_on_page_writeback calls so that wait
    only occurs if the device needs it. Third, it fixes up the remaining
    disk-backed filesystems to use this improved conditional-wait logic to
    provide stable page writes on those filesystems.

    It is hoped that (for people not using checksumming devices, anyway)
    this patchset will give back unnecessary performance decreases since the
    original stable page write patchset went into 3.0. Sorry about not
    fixing it sooner.

    Complaints were registered by several people about the long write
    latencies introduced by the original stable page write patchset.
    Generally speaking, the kernel ought to allocate as little extra memory
    as possible to facilitate writeout, but for people who simply cannot
    wait, a second page stability strategy is (re)introduced: snapshotting
    page contents. The waiting behavior is still the default strategy; to
    enable page snapshotting, a superblock flag (MS_SNAP_STABLE) must be
    set. This flag is used to bandaid^Henable stable page writeback on
    ext3[1], and is not used anywhere else.

    Given that there are already a few storage devices and network FSes that
    have rolled their own page stability wait/page snapshot code, it would
    be nice to move towards consolidating all of these. It seems possible
    that iscsi and raid5 may wish to use the new stable page write support
    to enable zero-copy writeout.

    Thank you to Jan Kara for helping fix a couple more filesystems.

    Per Andrew Morton's request, here are the result of using dbench to measure
    latencies on ext2:

    3.8.0-rc3:
    Operation Count AvgLat MaxLat
    ----------------------------------------
    WriteX 109347 0.028 59.817
    ReadX 347180 0.004 3.391
    Flush 15514 29.828 287.283

    Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms

    3.8.0-rc3 + patches:
    WriteX 105556 0.029 4.273
    ReadX 335004 0.005 4.112
    Flush 14982 30.540 298.634

    Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms

    As you can see, for ext2 the maximum write latency decreases from ~60ms
    on a laptop hard disk to ~4ms. I'm not sure why the flush latencies
    increase, though I suspect that being able to dirty pages faster gives
    the flusher more work to do.

    On ext4, the average write latency decreases as well as all the maximum
    latencies:

    3.8.0-rc3:
    WriteX 85624 0.152 33.078
    ReadX 272090 0.010 61.210
    Flush 12129 36.219 168.260

    Throughput 44.8618 MB/sec 4 clients 4 procs max_latency=168.276 ms

    3.8.0-rc3 + patches:
    WriteX 86082 0.141 30.928
    ReadX 273358 0.010 36.124
    Flush 12214 34.800 165.689

    Throughput 44.9941 MB/sec 4 clients 4 procs max_latency=165.722 ms

    XFS seems to exhibit similar latency improvements as ext2:

    3.8.0-rc3:
    WriteX 125739 0.028 104.343
    ReadX 399070 0.005 4.115
    Flush 17851 25.004 131.390

    Throughput 66.0024 MB/sec 4 clients 4 procs max_latency=131.406 ms

    3.8.0-rc3 + patches:
    WriteX 123529 0.028 6.299
    ReadX 392434 0.005 4.287
    Flush 17549 25.120 188.687

    Throughput 64.9113 MB/sec 4 clients 4 procs max_latency=188.704 ms

    ...and btrfs, just to round things out, also shows some latency
    decreases:

    3.8.0-rc3:
    WriteX 67122 0.083 82.355
    ReadX 212719 0.005 2.828
    Flush 9547 47.561 147.418

    Throughput 35.3391 MB/sec 4 clients 4 procs max_latency=147.433 ms

    3.8.0-rc3 + patches:
    WriteX 64898 0.101 71.631
    ReadX 206673 0.005 7.123
    Flush 9190 47.963 219.034

    Throughput 34.0795 MB/sec 4 clients 4 procs max_latency=219.044 ms

    Before this patchset, all filesystems would block, regardless of whether
    or not it was necessary. ext3 would wait, but still generate occasional
    checksum errors. The network filesystems were left to do their own
    thing, so they'd wait too.

    After this patchset, all the disk filesystems except ext3 and btrfs will
    wait only if the hardware requires it. ext3 (if necessary) snapshots
    pages instead of blocking, and btrfs provides its own bdi so the mm will
    never wait. Network filesystems haven't been touched, so either they
    provide their own wait code, or they don't block at all. The blocking
    behavior is back to what it was before 3.0 if you don't have a disk
    requiring stable page writes.

    This patchset has been tested on 3.8.0-rc3 on x64 with ext3, ext4, and
    xfs. I've spot-checked 3.8.0-rc4 and seem to be getting the same
    results as -rc3.

    [1] The alternative fixes to ext3 include fixing the locking order and
    page bit handling like we did for ext4 (but then why not just use
    ext4?), or setting PG_writeback so early that ext3 becomes extremely
    slow. I tried that, but the number of write()s I could initiate dropped
    by nearly an order of magnitude. That was a bit much even for the
    author of the stable page series! :)

    This patch:

    Creates a per-backing-device flag that tracks whether or not pages must
    be held immutable during writeout. Eventually it will be used to waive
    wait_for_page_writeback() if nothing requires stable pages.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Jan Kara
    Cc: Adrian Hunter
    Cc: Andy Lutomirski
    Cc: Artem Bityutskiy
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Steven Whitehouse
    Cc: Jens Axboe
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

20 Feb, 2013

2 commits

  • Pull async changes from Tejun Heo:
    "These are followups for the earlier deadlock issue involving async
    ending up waiting for itself through block requesting module[1]. The
    following changes are made by these commits.

    - Instead of requesting default elevator on each request_queue init,
    block now requests it once early during boot.

    - Kmod triggers warning if invoked from an async worker.

    - Async synchronization implementation has been reimplemented. It's
    a lot simpler now."

    * 'for-3.9-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    async: initialise list heads to fix crash
    async: replace list of active domains with global list of pending items
    async: keep pending tasks on async_domain and remove async_pending
    async: use ULLONG_MAX for infinity cookie value
    async: bring sanity to the use of words domain and running
    async, kmod: warn on synchronous request_module() from async workers
    block: don't request module during elevator init
    init, block: try to load default elevator module early during boot

    Linus Torvalds
     
  • Pull scheduler changes from Ingo Molnar:
    "Main changes:

    - scheduler side full-dynticks (user-space execution is undisturbed
    and receives no timer IRQs) preparation changes that convert the
    cputime accounting code to be full-dynticks ready, from Frederic
    Weisbecker.

    - Initial sched.h split-up changes, by Clark Williams

    - select_idle_sibling() performance improvement by Mike Galbraith:

    " 1 tbench pair (worst case) in a 10 core + SMT package:

    pre 15.22 MB/sec 1 procs
    post 252.01 MB/sec 1 procs "

    - sched_rr_get_interval() ABI fix/change. We think this detail is not
    used by apps (so it's not an ABI in practice), but lets keep it
    under observation.

    - misc RT scheduling cleanups, optimizations"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    sched/rt: Add header to
    cputime: Remove irqsave from seqlock readers
    sched, powerpc: Fix sched.h split-up build failure
    cputime: Restore CPU_ACCOUNTING config defaults for PPC64
    sched/rt: Move rt specific bits into new header file
    sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice
    sched: Move sched.h sysctl bits into separate header
    sched: Fix signedness bug in yield_to()
    sched: Fix select_idle_sibling() bouncing cow syndrome
    sched/rt: Further simplify pick_rt_task()
    sched/rt: Do not account zero delta_exec in update_curr_rt()
    cputime: Safely read cputime of full dynticks CPUs
    kvm: Prepare to add generic guest entry/exit callbacks
    cputime: Use accessors to read task cputime stats
    cputime: Allow dynamic switch between tick/virtual based cputime accounting
    cputime: Generic on-demand virtual cputime accounting
    cputime: Move default nsecs_to_cputime() to jiffies based cputime file
    cputime: Librarize per nsecs resolution cputime definitions
    cputime: Avoid multiplication overflow on utime scaling
    context_tracking: Export context state for generic vtime
    ...

    Fix up conflict in kernel/context_tracking.c due to comment additions.

    Linus Torvalds
     

15 Feb, 2013

1 commit

  • Using wait_for_completion() for waiting for a IO request to be executed
    results in wrong iowait time accounting. For example, a system having
    the only task doing write() and fdatasync() on a block device can be
    reported being idle instead of iowaiting as it should because
    blkdev_issue_flush() calls wait_for_completion() which in turn calls
    schedule() that does not increment the iowait proc counter and thus does
    not turn on iowait time accounting.

    The patch makes block layer use wait_for_completion_io() instead of
    wait_for_completion() where appropriate to account iowait time
    correctly.

    Signed-off-by: Vladimir Davydov
    Signed-off-by: Jens Axboe

    Vladimir Davydov
     

08 Feb, 2013

1 commit

  • Move the sysctl-related bits from include/linux/sched.h into
    a new file: include/linux/sched/sysctl.h. Then update source
    files requiring access to those bits by including the new
    header file.

    Signed-off-by: Clark Williams
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan
    Signed-off-by: Ingo Molnar

    Clark Williams
     

23 Jan, 2013

1 commit

  • Block layer allows selecting an elevator which is built as a module to
    be selected as system default via kernel param "elevator=". This is
    achieved by automatically invoking request_module() whenever a new
    block device is initialized and the elevator is not available.

    This led to an interesting deadlock problem involving async and module
    init. Block device probing running off an async job invokes
    request_module(). While the module is being loaded, it performs
    async_synchronize_full() which ends up waiting for the async job which
    is already waiting for request_module() to finish, leading to
    deadlock.

    Invoking request_module() from deep in block device init path is
    already nasty in itself. It seems best to avoid these situations from
    the beginning by moving on-demand module loading out of block init
    path.

    The previous patch made sure that the default elevator module is
    loaded early during boot if available. This patch removes on-demand
    loading of the default elevator from elevator init path. As the
    module would have been loaded during boot, userland-visible behavior
    difference should be minimal.

    For more details, please refer to the following thread.

    http://thread.gmane.org/gmane.linux.kernel/1420814

    v2: The bool parameter was named @request_module which conflicted with
    request_module(). This built okay w/ CONFIG_MODULES because
    request_module() was defined as a macro. W/o CONFIG_MODULES, it
    causes build breakage. Rename the parameter to @try_loading.
    Reported by Fengguang.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Arjan van de Ven
    Cc: Linus Torvalds
    Cc: Alex Riesen
    Cc: Fengguang Wu

    Tejun Heo
     

19 Jan, 2013

1 commit

  • This patch adds default module loading and uses it to load the default
    block elevator. During boot, it's called right after initramfs or
    initrd is made available and right before control is passed to
    userland. This ensures that as long as the modules are available in
    the usual places in initramfs, initrd or the root filesystem, the
    default modules are loaded as soon as possible.

    This will replace the on-demand elevator module loading from elevator
    init path.

    v2: Fixed build breakage when !CONFIG_BLOCK. Reported by kbuild test
    robot.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Arjan van de Ven
    Cc: Linus Torvalds
    Cc: Alex Riesen
    Cc: Fengguang We

    Tejun Heo
     

14 Jan, 2013

2 commits

  • bio_{front|back}_merge tracepoints report a bio merging into an
    existing request but didn't specify which request the bio is being
    merged into. Add @req to it. This makes it impossible to share the
    event template with block_bio_queue - split it out.

    @req isn't used or exported to userland at this point and there is no
    userland visible behavior change. Later changes will make use of the
    extra parameter.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • bio completion didn't kick block_bio_complete TP. Only dm was
    explicitly triggering the TP on IO completion. This makes
    block_bio_complete TP useless for tracers which want to know about
    bios, and all other bio based drivers skip generating blktrace
    completion events.

    This patch makes all bio completions via bio_endio() generate
    block_bio_complete TP.

    * Explicit trace_block_bio_complete() invocation removed from dm and
    the trace point is unexported.

    * @rq dropped from trace_block_bio_complete(). bios may fly around
    w/o queue associated. Verifying and accessing the assocaited queue
    belongs to TP probes.

    * blktrace now gets both request and bio completions. Make it ignore
    bio completions if request completion path is happening.

    This makes all bio based drivers generate blktrace completion events
    properly and makes the block_bio_complete TP actually useful.

    v2: With this change, block_bio_complete TP could be invoked on sg
    commands which have bio's with %NULL bi_bdev. Update TP
    assignment code to check whether bio->bi_bdev is %NULL before
    dereferencing.

    Signed-off-by: Tejun Heo
    Original-patch-by: Namhyung Kim
    Cc: Tejun Heo
    Cc: Steven Rostedt
    Cc: Alasdair Kergon
    Cc: dm-devel@redhat.com
    Cc: Neil Brown
    Signed-off-by: Jens Axboe

    Tejun Heo
     

12 Jan, 2013

1 commit


11 Jan, 2013

2 commits

  • In commit 975927b942c932,it add blk_rq_pos to sort rq when flushing.
    Although this commit was used for the situation which blk_plug handled
    multi devices on the same time like md device.
    I think there must be some situations like this but only single
    device.
    So remove the should_sort judgement.
    Because the parameter should_sort is only for this purpose,it can delete
    should_sort from blk_plug.

    CC: Shaohua Li
    Signed-off-by: Jianpeng Ma
    Signed-off-by: Jens Axboe

    Jianpeng Ma
     
  • Switch elevator to use the new hashtable implementation. This reduces the
    amount of generic unrelated code in the elevator.

    This also removes the dymanic allocation of the hash table. The size of the table is
    constant so there's no point in paying the price of an extra dereference when accessing
    it.

    This patch depends on d9b482c ("hashtable: introduce a small and naive
    hashtable") which was merged in v3.6.

    Signed-off-by: Sasha Levin
    Signed-off-by: Jens Axboe

    Sasha Levin
     

10 Jan, 2013

15 commits

  • Unfortunately, at this point, there's no way to make the existing
    statistics hierarchical without creating nasty surprises for the
    existing users. Just create recursive counterpart of the existing
    stats.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • To support hierarchical stats, it's necessary to remember stats from
    dead children. Add cfqg->dead_stats and make a dying cfqg transfer
    its stats to the parent's dead-stats.

    The transfer happens form ->pd_offline_fn() and it is possible that
    there are some residual IOs completing afterwards. Currently, we lose
    these stats. Given that cgroup removal isn't a very high frequency
    operation and the amount of residual IOs on offline are likely to be
    nil or small, this shouldn't be a big deal and the complexity needed
    to handle residual IOs - another callback and rather elaborate
    synchronization to reach and lock the matching q - doesn't seem
    justified.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Separate out cfqg_stats_reset() which takes struct cfqg_stats * from
    cfq_pd_reset_stats() and move the latter to where other pd methods are
    defined. cfqg_stats_reset() will be used to implement hierarchical
    stats.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Instead of holding blkcg->lock while walking ->blkg_list and executing
    prfill(), RCU walk ->blkg_list and hold the blkg's queue lock while
    executing prfill(). This makes prfill() implementations easier as
    stats are mostly protected by queue lock.

    This will be used to implement hierarchical stats.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • RCU free request_queue so that blkcg_gq->q can be dereferenced under
    RCU lock. This will be used to implement hierarchical stats.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge().
    The former two collect the [rw]stats designated by the target policy
    data and offset from the pd's subtree. The latter two add one
    [rw]stat to another.

    Note that the recursive sum functions require the queue lock to be
    held on entry to make blkg online test reliable. This is necessary to
    properly handle stats of a dying blkg.

    These will be used to implement hierarchical stats.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Hierarchical stats for cfq-iosched will need __blkg_prfill_rwstat().
    Export it.

    Signed-off-by: Tejun Heo
    Reported-by: Fengguang Wu

    Tejun Heo
     
  • Rename blkg_rwstat_sum() to blkg_rwstat_total(). sum will be used for
    summing up stats from multiple blkgs.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Add two blkcg_policy methods, ->online_pd_fn() and ->offline_pd_fn(),
    which are invoked as the policy_data gets activated and deactivated
    while holding both blkcg and q locks.

    Also, add blkcg_gq->online bool, which is set and cleared as the
    blkcg_gq gets activated and deactivated. This flag also is toggled
    while holding both blkcg and q locks.

    These will be used to implement hierarchical stats.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Add pd->plid so that the policy a pd belongs to can be identified
    easily. This will be used to implement hierarchical blkg_[rw]stats.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • With the previous two patches, all cfqg scheduling decisions are based
    on vfraction and ready for hierarchy support. The only thing which
    keeps the behavior flat is cfqg_flat_parent() which makes vfraction
    calculation consider all non-root cfqgs children of the root cfqg.

    Replace it with cfqg_parent() which returns the real parent. This
    enables full blkcg hierarchy support for cfq-iosched. For example,
    consider the following hierarchy.

    root
    / \
    A:500 B:250
    / \
    AA:500 AB:1000

    For simplicity, let's say all the leaf nodes have active tasks and are
    on service tree. For each leaf node, vfraction would be

    AA: (500 / 1500) * (500 / 750) =~ 0.2222
    AB: (1000 / 1500) * (500 / 750) =~ 0.4444
    B: (250 / 750) =~ 0.3333

    and vdisktime will be distributed accordingly. For more detail,
    please refer to Documentation/block/cfq-iosched.txt.

    v2: cfq-iosched.txt updated to describe group scheduling as suggested
    by Vivek.

    v3: blkio-controller.txt updated.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • cfq_group_slice() calculates slice by taking a fraction of
    cfq_target_latency according to the ratio of cfqg->weight against
    service_tree->total_weight. This currently works only because all
    cfqgs are treated to be at the same level.

    To prepare for proper hierarchy support, convert cfq_group_slice() to
    base the calculation on cfqg->vfraction. As cfqg->vfraction is always
    a fraction of 1 and represents the fraction allocated to the cfqg with
    hierarchy considered, the slice can be simply calculated by
    multiplying cfqg->vfraction to cfq_target_latency (with fixed point
    shift factored in).

    As vfraction calculation currently treats all non-root cfqgs as
    children of the root cfqg, this patch doesn't introduce noticeable
    behavior difference.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Currently, cfqg charges are scaled directly according to cfqg->weight.
    Regardless of the number of active cfqgs or the amount of active
    weights, a given weight value always scales charge the same way. This
    works fine as long as all cfqgs are treated equally regardless of
    their positions in the hierarchy, which is what cfq currently
    implements. It can't work in hierarchical settings because the
    interpretation of a given weight value depends on where the weight is
    located in the hierarchy.

    This patch reimplements cfqg charge scaling so that it can be used to
    support hierarchy properly. The scheme is fairly simple and
    light-weight.

    * When a cfqg is added to the service tree, v(disktime)weight is
    calculated. It walks up the tree to root calculating the fraction
    it has in the hierarchy. At each level, the fraction can be
    calculated as

    cfqg->weight / parent->level_weight

    By compounding these, the global fraction of vdisktime the cfqg has
    claim to - vfraction - can be determined.

    * When the cfqg needs to be charged, the charge is scaled inversely
    proportionally to the vfraction.

    The new scaling scheme uses the same CFQ_SERVICE_SHIFT for fixed point
    representation as before; however, the smallest scaling factor is now
    1 (ie. 1 << CFQ_SERVICE_SHIFT). This is different from before where 1
    was for CFQ_WEIGHT_DEFAULT and higher weight would result in smaller
    scaling factor.

    While this shifts the global scale of vdisktime a bit, it doesn't
    change the relative relationships among cfqgs and the scheduling
    result isn't different.

    cfq_group_notify_queue_add uses fixed CFQ_IDLE_DELAY when appending
    new cfqg to the service tree. The specific value of CFQ_IDLE_DELAY
    didn't have any relevance to vdisktime before and is unlikely to cause
    any visible behavior difference now especially as the scale shift
    isn't that large.

    As the new scheme now makes proper distinction between cfqg->weight
    and ->leaf_weight, reverse the weight aliasing for root cfqgs. For
    root, both weights are now mapped to ->leaf_weight instead of the
    other way around.

    Because we're still using cfqg_flat_parent(), this patch shouldn't
    change the scheduling behavior in any noticeable way.

    v2: Beefed up comments on vfraction as requested by Vivek.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • To prepare for blkcg hierarchy support, add cfqg->nr_active and
    ->children_weight. cfqg->nr_active counts the number of active cfqgs
    at the cfqg's level and ->children_weight is sum of weights of those
    cfqgs. The level covers itself (cfqg->leaf_weight) and immediate
    children.

    The two values are updated when a cfqg enters and leaves the group
    service tree. Unless the hierarchy is very deep, the added overhead
    should be negligible.

    Currently, the parent is determined using cfqg_flat_parent() which
    makes the root cfqg the parent of all other cfqgs. This is to make
    the transition to hierarchy-aware scheduling gradual. Scheduling
    logic will be converted to use cfqg->children_weight without actually
    changing the behavior. When everything is ready,
    blkcg_weight_parent() will be replaced with proper parent function.

    This patch doesn't introduce any behavior chagne.

    v2: s/cfqg->level_weight/cfqg->children_weight/ as per Vivek.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • cfq blkcg is about to grow proper hierarchy handling, where a child
    blkg's weight would nest inside the parent's. This makes tasks in a
    blkg to compete against both tasks in the sibling blkgs and the tasks
    of child blkgs.

    We're gonna use the existing weight as the group weight which decides
    the blkg's weight against its siblings. This patch introduces a new
    weight - leaf_weight - which decides the weight of a blkg against the
    child blkgs.

    It's named leaf_weight because another way to look at it is that each
    internal blkg nodes have a hidden child leaf node which contains all
    its tasks and leaf_weight is the weight of the leaf node and handled
    the same as the weight of the child blkgs.

    This patch only adds leaf_weight fields and exposes it to userland.
    The new weight isn't actually used anywhere yet. Note that
    cfq-iosched currently offcially supports only single level hierarchy
    and root blkgs compete with the first level blkgs - ie. root weight is
    basically being used as leaf_weight. For root blkgs, the two weights
    are kept in sync for backward compatibility.

    v2: cfqd->root_group->leaf_weight initialization was missing from
    cfq_init_queue() causing divide by zero when
    !CONFIG_CFQ_GROUP_SCHED. Fix it. Reported by Fengguang.

    Signed-off-by: Tejun Heo
    Cc: Fengguang Wu

    Tejun Heo