18 Dec, 2012

4 commits

  • Merge misc patches from Andrew Morton:
    "Incoming:

    - lots of misc stuff

    - backlight tree updates

    - lib/ updates

    - Oleg's percpu-rwsem changes

    - checkpatch

    - rtc

    - aoe

    - more checkpoint/restart support

    I still have a pile of MM stuff pending - Pekka should be merging
    later today after which that is good to go. A number of other things
    are twiddling thumbs awaiting maintainer merges."

    * emailed patches from Andrew Morton : (180 commits)
    scatterlist: don't BUG when we can trivially return a proper error.
    docs: update documentation about /proc//fdinfo/ fanotify output
    fs, fanotify: add @mflags field to fanotify output
    docs: add documentation about /proc//fdinfo/ output
    fs, notify: add procfs fdinfo helper
    fs, exportfs: add exportfs_encode_inode_fh() helper
    fs, exportfs: escape nil dereference if no s_export_op present
    fs, epoll: add procfs fdinfo helper
    fs, eventfd: add procfs fdinfo helper
    procfs: add ability to plug in auxiliary fdinfo providers
    tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
    breakpoint selftests: print failure status instead of cause make error
    kcmp selftests: print fail status instead of cause make error
    kcmp selftests: make run_tests fix
    mem-hotplug selftests: print failure status instead of cause make error
    cpu-hotplug selftests: print failure status instead of cause make error
    mqueue selftests: print failure status instead of cause make error
    vm selftests: print failure status instead of cause make error
    ubifs: use prandom_bytes
    mtd: nandsim: use prandom_bytes
    ...

    Linus Torvalds
     
  • Currently only block_dev and uprobes use percpu_rw_semaphore,
    add the config option selected by BLOCK || UPROBES.

    Signed-off-by: Oleg Nesterov
    Cc: Anton Arapov
    Cc: Ingo Molnar
    Cc: Linus Torvalds
    Cc: Michal Marek
    Cc: Mikulas Patocka
    Cc: "Paul E. McKenney"
    Cc: Peter Zijlstra
    Cc: Srikar Dronamraju
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Pull block driver update from Jens Axboe:
    "Now that the core bits are in, here are the driver bits for 3.8. The
    branch contains:

    - A huge pile of drbd bits that were dumped from the 3.7 merge
    window. Following that, it was both made perfectly clear that
    there is going to be no more over-the-wall pulls and how the
    situation on individual pulls can be improved.

    - A few cleanups from Akinobu Mita for drbd and cciss.

    - Queue improvement for loop from Lukas. This grew into adding a
    generic interface for waiting/checking an even with a specific
    lock, allowing this to be pulled out of md and now loop and drbd is
    also using it.

    - A few fixes for xen back/front block driver from Roger Pau Monne.

    - Partition improvements from Stephen Warren, allowing partiion UUID
    to be used as an identifier."

    * 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
    drbd: update Kconfig to match current dependencies
    drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
    drbd: close race between drbd_set_role and drbd_connect
    drbd: respect no-md-barriers setting also when changed online via disk-options
    drbd: Remove obsolete check
    drbd: fixup after wait_even_lock_irq() addition to generic code
    loop: Limit the number of requests in the bio list
    wait: add wait_event_lock_irq() interface
    xen-blkfront: free allocated page
    xen-blkback: move free persistent grants code
    block: partition: msdos: provide UUIDs for partitions
    init: reduce PARTUUID min length to 1 from 36
    block: store partition_meta_info.uuid as a string
    cciss: use check_signature()
    cciss: cleanup bitops usage
    drbd: use copy_highpage
    drbd: if the replication link breaks during handshake, keep retrying
    drbd: check return of kmalloc in receive_uuids
    drbd: Broadcast sync progress no more often than once per second
    drbd: don't try to clear bits once the disk has failed
    ...

    Linus Torvalds
     
  • Pull block layer core updates from Jens Axboe:
    "Here are the core block IO bits for 3.8. The branch contains:

    - The final version of the surprise device removal fixups from Bart.

    - Don't hide EFI partitions under advanced partition types. It's
    fairly wide spread these days. This is especially dangerous for
    systems that have both msdos and efi partition tables, where you
    want to keep them in sync.

    - Cleanup of using -1 instead of the proper NUMA_NO_NODE

    - Export control of bdi flusher thread CPU mask and default to using
    the home node (if known) from Jeff.

    - Export unplug tracepoint for MD.

    - Core improvements from Shaohua. Reinstate the recursive merge, as
    the original bug has been fixed. Add plugging for discard and also
    fix a problem handling non pow-of-2 discard limits.

    There's a trivial merge in block/blk-exec.c due to a fix that went
    into 3.7-rc at a later point than -rc4 where this is based."

    * 'for-3.8/core' of git://git.kernel.dk/linux-block:
    block: export block_unplug tracepoint
    block: add plug for blkdev_issue_discard
    block: discard granularity might not be power of 2
    deadline: Allow 0ms deadline latency, increase the read speed
    partitions: enable EFI/GPT support by default
    bsg: Remove unused function bsg_goose_queue()
    block: Make blk_cleanup_queue() wait until request_fn finished
    block: Avoid scheduling delayed work on a dead queue
    block: Avoid that request_fn is invoked on a dead queue
    block: Let blk_drain_queue() caller obtain the queue lock
    block: Rename queue dead flag
    bdi: add a user-tunable cpu_list for the bdi flusher threads
    block: use NUMA_NO_NODE instead of -1
    block: recursive merge requests
    block CFQ: avoid moving request to different queue

    Linus Torvalds
     

15 Dec, 2012

3 commits

  • This allows stacked devices (like md/raid5) to provide blktrace
    tracing, including unplug events.

    Reported-by: Fengguang Wu
    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     
  • Last post of this patch appears lost, so I resend this.

    Now discard merge works, add plug for blkdev_issue_discard. This will help
    discard request merge especially for raid0 case. In raid0, a big discard
    request is split to small requests, and if correct plug is added, such small
    requests can be merged in low layer.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • In MD raid case, discard granularity might not be power of 2, for example, a
    4-disk raid5 has 3*chunk_size discard granularity. Correct the calculation for
    such cases.

    Reported-by: Neil Brown
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

13 Dec, 2012

1 commit

  • Pull cgroup changes from Tejun Heo:
    "A lot of activities on cgroup side. The big changes are focused on
    making cgroup hierarchy handling saner.

    - cgroup_rmdir() had peculiar semantics - it allowed cgroup
    destruction to be vetoed by individual controllers and tried to
    drain refcnt synchronously. The vetoing never worked properly and
    caused good deal of contortions in cgroup. memcg was the last
    reamining user. Michal Hocko removed the usage and cgroup_rmdir()
    path has been simplified significantly. This was done in a
    separate branch so that the memcg people can base further memcg
    changes on top.

    - The above allowed cleaning up cgroup lifecycle management and
    implementation of generic cgroup iterators which are used to
    improve hierarchy support.

    - cgroup_freezer updated to allow migration in and out of a frozen
    cgroup and handle hierarchy. If a cgroup is frozen, all descendant
    cgroups are frozen.

    - netcls_cgroup and netprio_cgroup updated to handle hierarchy
    properly.

    - Various fixes and cleanups.

    - Two merge commits. One to pull in memcg and rmdir cleanups (needed
    to build iterators). The other pulled in cgroup/for-3.7-fixes for
    device_cgroup fixes so that further device_cgroup patches can be
    stacked on top."

    Fixed up a trivial conflict in mm/memcontrol.c as per Tejun (due to
    commit bea8c150a7 ("memcg: fix hotplugged memory zone oops") in master
    touching code close to commit 2ef37d3fe4 ("memcg: Simplify
    mem_cgroup_force_empty_list error handling") in for-3.8)

    * 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (65 commits)
    cgroup: update Documentation/cgroups/00-INDEX
    cgroup_rm_file: don't delete the uncreated files
    cgroup: remove subsystem files when remounting cgroup
    cgroup: use cgroup_addrm_files() in cgroup_clear_directory()
    cgroup: warn about broken hierarchies only after css_online
    cgroup: list_del_init() on removed events
    cgroup: fix lockdep warning for event_control
    cgroup: move list add after list head initilization
    netprio_cgroup: allow nesting and inherit config on cgroup creation
    netprio_cgroup: implement netprio[_set]_prio() helpers
    netprio_cgroup: use cgroup->id instead of cgroup_netprio_state->prioidx
    netprio_cgroup: reimplement priomap expansion
    netprio_cgroup: shorten variable names in extend_netdev_table()
    netprio_cgroup: simplify write_priomap()
    netcls_cgroup: move config inheritance to ->css_online() and remove .broken_hierarchy marking
    cgroup: remove obsolete guarantee from cgroup_task_migrate.
    cgroup: add cgroup->id
    cgroup, cpuset: remove cgroup_subsys->post_clone()
    cgroup: s/CGRP_CLONE_CHILDREN/CGRP_CPUSET_CLONE_CHILDREN/
    cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free()
    ...

    Linus Torvalds
     

10 Dec, 2012

1 commit


06 Dec, 2012

7 commits

  • The Kconfig currently enables MSDOS partitions by default because they
    are assumed to be essential, but it's necessary to enable "advanced
    partition selection" in order to get GPT support. IMO GPT partitions
    are becoming common enought to deserve the same treatment MSDOS
    partitions get.

    (Side note: I got bit by a disk that had MSDOS and GPT partition
    tables, but for some reason the MSDOS table was different from the
    GPT one. I was stupid enought to disable "advanced partition
    selection" in my .config, which disabled GPT partitioning and made
    my btrfs pool unbootable because it couldn't find the partitions)

    Signed-off-by: Diego Calleja
    Signed-off-by: Jens Axboe

    Diego Calleja
     
  • The function bsg_goose_queue() does not have any in-tree callers,
    so let's remove it.

    Signed-off-by: Bart Van Assche
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Some request_fn implementations, e.g. scsi_request_fn(), unlock
    the queue lock internally. This may result in multiple threads
    executing request_fn for the same queue simultaneously. Keep
    track of the number of active request_fn calls and make sure that
    blk_cleanup_queue() waits until all active request_fn invocations
    have finished. A block driver may start cleaning up resources
    needed by its request_fn as soon as blk_cleanup_queue() finished,
    so blk_cleanup_queue() must wait for all outstanding request_fn
    invocations to finish.

    Signed-off-by: Bart Van Assche
    Reported-by: Chanho Min
    Cc: James Bottomley
    Cc: Mike Christie
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Running a queue must continue after it has been marked dying until
    it has been marked dead. So the function blk_run_queue_async() must
    not schedule delayed work after blk_cleanup_queue() has marked a queue
    dead. Hence add a test for that queue state in blk_run_queue_async()
    and make sure that queue_unplugged() invokes that function with the
    queue lock held. This avoids that the queue state can change after
    it has been tested and before mod_delayed_work() is invoked. Drop
    the queue dying test in queue_unplugged() since it is now
    superfluous: __blk_run_queue() already tests whether or not the
    queue is dead.

    Signed-off-by: Bart Van Assche
    Cc: Mike Christie
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • A block driver may start cleaning up resources needed by its
    request_fn as soon as blk_cleanup_queue() finished, so request_fn
    must not be invoked after draining finished. This is important
    when blk_run_queue() is invoked without any requests in progress.
    As an example, if blk_drain_queue() and scsi_run_queue() run in
    parallel, blk_drain_queue() may have finished all requests after
    scsi_run_queue() has taken a SCSI device off the starved list but
    before that last function has had a chance to run the queue.

    Signed-off-by: Bart Van Assche
    Cc: James Bottomley
    Cc: Mike Christie
    Cc: Chanho Min
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Let the caller of blk_drain_queue() obtain the queue lock to improve
    readability of the patch called "Avoid that request_fn is invoked on
    a dead queue".

    Signed-off-by: Bart Van Assche
    Acked-by: Tejun Heo
    Cc: James Bottomley
    Cc: Mike Christie
    Cc: Jens Axboe
    Cc: Chanho Min
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • QUEUE_FLAG_DEAD is used to indicate that queuing new requests must
    stop. After this flag has been set queue draining starts. However,
    during the queue draining phase it is still safe to invoke the
    queue's request_fn, so QUEUE_FLAG_DYING is a better name for this
    flag.

    This patch has been generated by running the following command
    over the kernel source tree:

    git grep -lEw 'blk_queue_dead|QUEUE_FLAG_DEAD' |
    xargs sed -i.tmp -e 's/blk_queue_dead/blk_queue_dying/g' \
    -e 's/QUEUE_FLAG_DEAD/QUEUE_FLAG_DYING/g'; \
    sed -i.tmp -e "s/QUEUE_FLAG_DYING$(printf \\t)*5/QUEUE_FLAG_DYING$(printf \\t)5/g" \
    include/linux/blkdev.h; \
    sed -i.tmp -e 's/ DEAD/ DYING/g' -e 's/dead queue/a dying queue/' \
    -e 's/Dead queue/A dying queue/' block/blk-core.c

    Signed-off-by: Bart Van Assche
    Acked-by: Tejun Heo
    Cc: James Bottomley
    Cc: Mike Christie
    Cc: Jens Axboe
    Cc: Chanho Min
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

23 Nov, 2012

3 commits

  • After we've done __elv_add_request() and __blk_run_queue() in
    blk_execute_rq_nowait(), the request might finish and be freed
    immediately. Therefore checking if the type is REQ_TYPE_PM_RESUME
    isn't safe afterwards, because if it isn't, rq might be gone.
    Instead, check beforehand and stash the result in a temporary.

    This fixes crashes in blk_execute_rq_nowait() I get occasionally when
    running with lots of memory debugging options enabled -- I think this
    race is usually harmless because the window for rq to be reallocated
    is so small.

    Signed-off-by: Roland Dreier
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Roland Dreier
     
  • The MSDOS/MBR partition table includes a 32-bit unique ID, often referred
    to as the NT disk signature. When combined with a partition number within
    the table, this can form a unique ID similar in concept to EFI/GPT's
    partition UUID. Constructing and recording this value in struct
    partition_meta_info allows MSDOS partitions to be referred to on the
    kernel command-line using the following syntax:

    root=PARTUUID=0002dd75-01

    Signed-off-by: Stephen Warren
    Cc: Tejun Heo
    Cc: Will Drewry
    Cc: Kay Sievers
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Stephen Warren
     
  • This will allow other types of UUID to be stored here, aside from true
    UUIDs. This also simplifies code that uses this field, since it's usually
    constructed from a, used as a, or compared to other, strings.

    Note: A simplistic approach here would be to set uuid_str[36]=0 whenever a
    /PARTNROFF option was found to be present. However, this modifies the
    input string, and causes subsequent calls to devt_from_partuuid() not to
    see the /PARTNROFF option, which causes different results. In order to
    avoid misleading future maintainers, this parameter is marked const.

    Signed-off-by: Stephen Warren
    Cc: Tejun Heo
    Cc: Will Drewry
    Cc: Kay Sievers
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Stephen Warren
     

20 Nov, 2012

1 commit


10 Nov, 2012

1 commit


09 Nov, 2012

1 commit

  • In a workload, thread 1 accesses a, a+2, ..., thread 2 accesses a+1, a+3,....
    When the requests are flushed to queue, a and a+1 are merged to (a, a+1), a+2
    and a+3 too to (a+2, a+3), but (a, a+1) and (a+2, a+3) aren't merged.

    If we do recursive merge for such interleave access, some workloads throughput
    get improvement. A recent worload I'm checking on is swap, below change
    boostes the throughput around 5% ~ 10%.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

07 Nov, 2012

1 commit


06 Nov, 2012

3 commits

  • request is queued in cfqq->fifo list. Looks it's possible we are moving a
    request from one cfqq to another in request merge case. In such case, adjusting
    the fifo list order doesn't make sense and is impossible if we don't iterate
    the whole fifo list.

    My test does hit one case the two cfqq are different, but didn't cause kernel
    crash, maybe it's because fifo list isn't used frequently. Anyway, from the
    code logic, this is buggy.

    I thought we can re-enable the recusive merge logic after this is fixed.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Pull rmdir updates into for-3.8 so that further callback updates can
    be put on top. This pull created a trivial conflict between the
    following two commits.

    8c7f6edbda ("cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them")
    ed95779340 ("cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs")

    The former added a field to cgroup_subsys and the latter removed one
    from it. They happen to be colocated causing the conflict. Keeping
    what's added and removing what's removed resolves the conflict.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • All ->pre_destory() implementations return 0 now, which is the only
    allowed return value. Make it return void.

    Signed-off-by: Tejun Heo
    Reviewed-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Li Zefan
    Cc: Balbir Singh
    Cc: Vivek Goyal

    Tejun Heo
     

26 Oct, 2012

1 commit

  • My workload is a raid5 which had 16 disks. And used our filesystem to
    write using direct-io mode.

    I used the blktrace to find those message:
    8,16 0 6647 2.453665504 2579 M W 7493152 + 8 [md0_raid5]
    8,16 0 6648 2.453672411 2579 Q W 7493160 + 8 [md0_raid5]
    8,16 0 6649 2.453672606 2579 M W 7493160 + 8 [md0_raid5]
    8,16 0 6650 2.453679255 2579 Q W 7493168 + 8 [md0_raid5]
    8,16 0 6651 2.453679441 2579 M W 7493168 + 8 [md0_raid5]
    8,16 0 6652 2.453685948 2579 Q W 7493176 + 8 [md0_raid5]
    8,16 0 6653 2.453686149 2579 M W 7493176 + 8 [md0_raid5]
    8,16 0 6654 2.453693074 2579 Q W 7493184 + 8 [md0_raid5]
    8,16 0 6655 2.453693254 2579 M W 7493184 + 8 [md0_raid5]
    8,16 0 6656 2.453704290 2579 Q W 7493192 + 8 [md0_raid5]
    8,16 0 6657 2.453704482 2579 M W 7493192 + 8 [md0_raid5]
    8,16 0 6658 2.453715016 2579 Q W 7493200 + 8 [md0_raid5]
    8,16 0 6659 2.453715247 2579 M W 7493200 + 8 [md0_raid5]
    8,16 0 6660 2.453721730 2579 Q W 7493208 + 8 [md0_raid5]
    8,16 0 6661 2.453721974 2579 M W 7493208 + 8 [md0_raid5]
    8,16 0 6662 2.453728202 2579 Q W 7493216 + 8 [md0_raid5]
    8,16 0 6663 2.453728436 2579 M W 7493216 + 8 [md0_raid5]
    8,16 0 6664 2.453734782 2579 Q W 7493224 + 8 [md0_raid5]
    8,16 0 6665 2.453735019 2579 M W 7493224 + 8 [md0_raid5]
    8,16 0 6666 2.453741401 2579 Q W 7493232 + 8 [md0_raid5]
    8,16 0 6667 2.453741632 2579 M W 7493232 + 8 [md0_raid5]
    8,16 0 6668 2.453748148 2579 Q W 7493240 + 8 [md0_raid5]
    8,16 0 6669 2.453748386 2579 M W 7493240 + 8 [md0_raid5]
    8,16 0 6670 2.453851843 2579 I W 7493144 + 104 [md0_raid5]
    8,16 0 0 2.453853661 0 m N cfq2579 insert_request
    8,16 0 6671 2.453854064 2579 I W 7493120 + 24 [md0_raid5]
    8,16 0 0 2.453854439 0 m N cfq2579 insert_request
    8,16 0 6672 2.453854793 2579 U N [md0_raid5] 2
    8,16 0 0 2.453855513 0 m N cfq2579 Not idling.st->count:1
    8,16 0 0 2.453855927 0 m N cfq2579 dispatch_insert
    8,16 0 0 2.453861771 0 m N cfq2579 dispatched a request
    8,16 0 0 2.453862248 0 m N cfq2579 activate rq,drv=1
    8,16 0 6673 2.453862332 2579 D W 7493120 + 24 [md0_raid5]
    8,16 0 0 2.453865957 0 m N cfq2579 Not idling.st->count:1
    8,16 0 0 2.453866269 0 m N cfq2579 dispatch_insert
    8,16 0 0 2.453866707 0 m N cfq2579 dispatched a request
    8,16 0 0 2.453867061 0 m N cfq2579 activate rq,drv=2
    8,16 0 6674 2.453867145 2579 D W 7493144 + 104 [md0_raid5]
    8,16 0 6675 2.454147608 0 C W 7493120 + 24 [0]
    8,16 0 0 2.454149357 0 m N cfq2579 complete rqnoidle 0
    8,16 0 6676 2.454791505 0 C W 7493144 + 104 [0]
    8,16 0 0 2.454794803 0 m N cfq2579 complete rqnoidle 0
    8,16 0 0 2.454795160 0 m N cfq schedule dispatch

    From above messages,we can find rq[W 7493144 + 104] and rq[W
    7493120 + 24] do not merge.
    Because the bio order is:
    8,16 0 6638 2.453619407 2579 Q W 7493144 + 8 [md0_raid5]
    8,16 0 6639 2.453620460 2579 G W 7493144 + 8 [md0_raid5]
    8,16 0 6640 2.453639311 2579 Q W 7493120 + 8 [md0_raid5]
    8,16 0 6641 2.453639842 2579 G W 7493120 + 8 [md0_raid5]
    The bio(7493144) first and bio(7493120) later.So the subsequent
    bios will be divided into two parts.
    When flushing plug-list,because elv_attempt_insert_merge only support
    backmerge,not supporting frontmerge.
    So rq[7493120 + 24] can't merge with rq[7493144 + 104].

    From my test,i found those situation can count 25% in our system.
    Using this patch, there is no this situation.

    Signed-off-by: Jianpeng Ma
    CC:Shaohua Li
    Signed-off-by: Jens Axboe

    Jianpeng Ma
     

24 Oct, 2012

1 commit

  • This config item has not carried much meaning for a while now and is
    almost always enabled by default. As agreed during the Linux kernel
    summit, remove it.

    CC: Jens Axboe
    Signed-off-by: Kees Cook
    Signed-off-by: Jens Axboe

    Kees Cook
     

23 Oct, 2012

2 commits

  • __blk_queue_next_rl() finds next request list based on blkg_list
    while skipping root_blkg in the list.
    OTOH, root_rl is special as it may exist even without root_blkg.

    Though the later part of the function handles such a case correctly,
    exiting early is good for readability of the code.

    Signed-off-by: Jun'ichi Nomura
    Cc: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Jun'ichi Nomura
     
  • blk_put_rl() does not call blkg_put() for q->root_rl because we
    don't take request list reference on q->root_blkg.
    However, if root_blkg is once attached then detached (freed),
    blk_put_rl() is confused by the bogus pointer in q->root_blkg.

    For example, with !CONFIG_BLK_DEV_THROTTLING &&
    CONFIG_CFQ_GROUP_IOSCHED,
    switching IO scheduler from cfq to deadline will cause system stall
    after the following warning with 3.6:

    > WARNING: at /work/build/linux/block/blk-cgroup.h:250
    > blk_put_rl+0x4d/0x95()
    > Modules linked in: bridge stp llc sunrpc acpi_cpufreq freq_table mperf
    > ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4
    > Pid: 0, comm: swapper/0 Not tainted 3.6.0 #1
    > Call Trace:
    > [] warn_slowpath_common+0x85/0x9d
    > [] warn_slowpath_null+0x1a/0x1c
    > [] blk_put_rl+0x4d/0x95
    > [] __blk_put_request+0xc3/0xcb
    > [] blk_finish_request+0x232/0x23f
    > [] ? blk_end_bidi_request+0x34/0x5d
    > [] blk_end_bidi_request+0x42/0x5d
    > [] blk_end_request+0x10/0x12
    > [] scsi_io_completion+0x207/0x4d5
    > [] scsi_finish_command+0xfa/0x103
    > [] scsi_softirq_done+0xff/0x108
    > [] blk_done_softirq+0x8d/0xa1
    > [] ?
    > generic_smp_call_function_single_interrupt+0x9f/0xd7
    > [] __do_softirq+0x102/0x213
    > [] ? lock_release_holdtime+0xb6/0xbb
    > [] ? raise_softirq_irqoff+0x9/0x3d
    > [] call_softirq+0x1c/0x30
    > [] do_softirq+0x4b/0xa3
    > [] irq_exit+0x53/0xd5
    > [] smp_call_function_single_interrupt+0x34/0x36
    > [] call_function_single_interrupt+0x6f/0x80
    > [] ? mwait_idle+0x94/0xcd
    > [] ? mwait_idle+0x8b/0xcd
    > [] cpu_idle+0xbb/0x114
    > [] rest_init+0xc1/0xc8
    > [] ? csum_partial_copy_generic+0x16c/0x16c
    > [] start_kernel+0x3d4/0x3e1
    > [] ? kernel_init+0x1f7/0x1f7
    > [] x86_64_start_reservations+0xb8/0xbd
    > [] x86_64_start_kernel+0x101/0x110

    This patch clears q->root_blkg and q->root_rl.blkg when root blkg
    is destroyed.

    Signed-off-by: Jun'ichi Nomura
    Acked-by: Vivek Goyal
    Acked-by: Tejun Heo
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jun'ichi Nomura
     

11 Oct, 2012

1 commit

  • Pull block IO update from Jens Axboe:
    "Core block IO bits for 3.7. Not a huge round this time, it contains:

    - First series from Kent cleaning up and generalizing bio allocation
    and freeing.

    - WRITE_SAME support from Martin.

    - Mikulas patches to prevent O_DIRECT crashes when someone changes
    the block size of a device.

    - Make bio_split() work on data-less bio's (like trim/discards).

    - A few other minor fixups."

    Fixed up silent semantic mis-merge as per Mikulas Patocka and Andrew
    Morton. It is due to the VM no longer using a prio-tree (see commit
    6b2dbba8b6ac: "mm: replace vma prio_tree with an interval tree").

    So make set_blocksize() use mapping_mapped() instead of open-coding the
    internal VM knowledge that has changed.

    * 'for-3.7/core' of git://git.kernel.dk/linux-block: (26 commits)
    block: makes bio_split support bio without data
    scatterlist: refactor the sg_nents
    scatterlist: add sg_nents
    fs: fix include/percpu-rwsem.h export error
    percpu-rw-semaphore: fix documentation typos
    fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared
    blockdev: turn a rw semaphore into a percpu rw semaphore
    Fix a crash when block device is read and block size is changed at the same time
    block: fix request_queue->flags initialization
    block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()
    block: ioctl to zero block ranges
    block: Make blkdev_issue_zeroout use WRITE SAME
    block: Implement support for WRITE SAME
    block: Consolidate command flag and queue limit checks for merges
    block: Clean up special command handling logic
    block/blk-tag.c: Remove useless kfree
    block: remove the duplicated setting for congestion_threshold
    block: reject invalid queue attribute values
    block: Add bio_clone_bioset(), bio_clone_kmalloc()
    block: Consolidate bio_alloc_bioset(), bio_kmalloc()
    ...

    Linus Torvalds
     

03 Oct, 2012

2 commits

  • Pull cgroup hierarchy update from Tejun Heo:
    "Currently, different cgroup subsystems handle nested cgroups
    completely differently. There's no consistency among subsystems and
    the behaviors often are outright broken.

    People at least seem to agree that the broken hierarhcy behaviors need
    to be weeded out if any progress is gonna be made on this front and
    that the fallouts from deprecating the broken behaviors should be
    acceptable especially given that the current behaviors don't make much
    sense when nested.

    This patch makes cgroup emit warning messages if cgroups for
    subsystems with broken hierarchy behavior are nested to prepare for
    fixing them in the future. This was put in a separate branch because
    more related changes were expected (didn't make it this round) and the
    memory cgroup wanted to pull in this and make changes on top."

    * 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them

    Linus Torvalds
     
  • Pull workqueue changes from Tejun Heo:
    "This is workqueue updates for v3.7-rc1. A lot of activities this
    round including considerable API and behavior cleanups.

    * delayed_work combines a timer and a work item. The handling of the
    timer part has always been a bit clunky leading to confusing
    cancelation API with weird corner-case behaviors. delayed_work is
    updated to use new IRQ safe timer and cancelation now works as
    expected.

    * Another deficiency of delayed_work was lack of the counterpart of
    mod_timer() which led to cancel+queue combinations or open-coded
    timer+work usages. mod_delayed_work[_on]() are added.

    These two delayed_work changes make delayed_work provide interface
    and behave like timer which is executed with process context.

    * A work item could be executed concurrently on multiple CPUs, which
    is rather unintuitive and made flush_work() behavior confusing and
    half-broken under certain circumstances. This problem doesn't
    exist for non-reentrant workqueues. While non-reentrancy check
    isn't free, the overhead is incurred only when a work item bounces
    across different CPUs and even in simulated pathological scenario
    the overhead isn't too high.

    All workqueues are made non-reentrant. This removes the
    distinction between flush_[delayed_]work() and
    flush_[delayed_]_work_sync(). The former is now as strong as the
    latter and the specified work item is guaranteed to have finished
    execution of any previous queueing on return.

    * In addition to the various bug fixes, Lai redid and simplified CPU
    hotplug handling significantly.

    * Joonsoo introduced system_highpri_wq and used it during CPU
    hotplug.

    There are two merge commits - one to pull in IRQ safe timer from
    tip/timers/core and the other to pull in CPU hotplug fixes from
    wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

    Fixed a number of trivial conflicts, but the more interesting conflicts
    were silent ones where the deprecated interfaces had been used by new
    code in the merge window, and thus didn't cause any real data conflicts.

    Tejun pointed out a few of them, I fixed a couple more.

    * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
    workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
    workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
    workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
    workqueue: remove @delayed from cwq_dec_nr_in_flight()
    workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
    workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
    workqueue: use __cpuinit instead of __devinit for cpu callbacks
    workqueue: rename manager_mutex to assoc_mutex
    workqueue: WORKER_REBIND is no longer necessary for idle rebinding
    workqueue: WORKER_REBIND is no longer necessary for busy rebinding
    workqueue: reimplement idle worker rebinding
    workqueue: deprecate __cancel_delayed_work()
    workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
    workqueue: use mod_delayed_work() instead of __cancel + queue
    workqueue: use irqsafe timer for delayed_work
    workqueue: clean up delayed_work initializers and add missing one
    workqueue: make deferrable delayed_work initializer names consistent
    workqueue: cosmetic whitespace updates for macro definitions
    workqueue: deprecate system_nrt[_freezable]_wq
    workqueue: deprecate flush[_delayed]_work_sync()
    ...

    Linus Torvalds
     

26 Sep, 2012

1 commit

  • In some usage scenarios it is desireable to work with disk images or
    virtualized DASD devices. One problem that prevents such applications
    is the partition detection in ibm.c. Currently it works only for
    devices that support the BIODASDINFO2 ioctl, in other words, it only
    works for devices that belong to the DASD device driver.

    The information gained from the BIODASDINFO2 ioctl is only for a small
    set of legacy cases abolutely necessary. All current VOL1, LNX1 and
    CMS1 type of disk labels can be interpreted correctly without this
    information, as long as the generic HDIO_GETGEO ioctl works and
    provides a correct disk geometry.

    This patch makes the ibm.c partition detection as independent as
    possible from the BIODASDINFO2 ioctl. Only the following two cases are
    still restricted to real DASDs:
    - An FBA DASD, or LDL formatted ECKD DASD without any disk label.
    - An old style LNX1 label (without large volume support) on a disk
    with inconsistent device geometry.

    Signed-off-by: Stefan Weinhuber
    Signed-off-by: Martin Schwidefsky

    Stefan Weinhuber
     

21 Sep, 2012

2 commits

  • A queue newly allocated with blk_alloc_queue_node() has only
    QUEUE_FLAG_BYPASS set. For request-based drivers,
    blk_init_allocated_queue() is called and q->queue_flags is overwritten
    with QUEUE_FLAG_DEFAULT which doesn't include BYPASS even though the
    initial bypass is still in effect.

    In blk_init_allocated_queue(), or QUEUE_FLAG_DEFAULT to q->queue_flags
    instead of overwriting.

    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • …_init_allocated_queue()

    b82d4b197c ("blkcg: make request_queue bypassing on allocation") made
    request_queues bypassed on allocation to avoid switching on and off
    bypass mode on a queue being initialized. Some drivers allocate and
    then destroy a lot of queues without fully initializing them and
    incurring bypass latency overhead on each of them could add upto
    significant overhead.

    Unfortunately, blk_init_allocated_queue() is never used by queues of
    bio-based drivers, which means that all bio-based driver queues are in
    bypass mode even after initialization and registration complete
    successfully.

    Due to the limited way request_queues are used by bio drivers, this
    problem is hidden pretty well but it shows up when blk-throttle is
    used in combination with a bio-based driver. Trying to configure
    (echoing to cgroupfs file) blk-throttle for a bio-based driver hangs
    indefinitely in blkg_conf_prep() waiting for bypass mode to end.

    This patch moves the initial blk_queue_bypass_end() call from
    blk_init_allocated_queue() to blk_register_queue() which is called for
    any userland-visible queues regardless of its type.

    I believe this is correct because I don't think there is any block
    driver which needs or wants working elevator and blk-cgroup on a queue
    which isn't visible to userland. If there are such users, we need a
    different solution.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reported-by: Joseph Glanville <joseph.glanville@orionvm.com.au>
    Cc: stable@vger.kernel.org
    Acked-by: Vivek Goyal <vgoyal@redhat.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

    Tejun Heo
     

20 Sep, 2012

4 commits

  • Introduce a BLKZEROOUT ioctl which can be used to clear block ranges by
    way of blkdev_issue_zeroout().

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • If the device supports WRITE SAME, use that to optimize zeroing of
    blocks. If the device does not support WRITE SAME or if the operation
    fails, fall back to writing zeroes the old-fashioned way.

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • The WRITE SAME command supported on some SCSI devices allows the same
    block to be efficiently replicated throughout a block range. Only a
    single logical block is transferred from the host and the storage device
    writes the same data to all blocks described by the I/O.

    This patch implements support for WRITE SAME in the block layer. The
    blkdev_issue_write_same() function can be used by filesystems and block
    drivers to replicate a buffer across a block range. This can be used to
    efficiently initialize software RAID devices, etc.

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • - blk_check_merge_flags() verifies that cmd_flags / bi_rw are
    compatible. This function is called for both req-req and req-bio
    merging.

    - blk_rq_get_max_sectors() and blk_queue_get_max_sectors() can be used
    to query the maximum sector count for a given request or queue. The
    calls will return the right value from the queue limits given the
    type of command (RW, discard, write same, etc.)

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen