02 Aug, 2012

2 commits

  • Pull block driver changes from Jens Axboe:

    - Making the plugging support for drivers a bit more sane from Neil.
    This supersedes the plugging change from Shaohua as well.

    - The usual round of drbd updates.

    - Using a tail add instead of a head add in the request completion for
    ndb, making us find the most completed request more quickly.

    - A few floppy changes, getting rid of a duplicated flag and also
    running the floppy init async (since it takes forever in boot terms)
    from Andi.

    * 'for-3.6/drivers' of git://git.kernel.dk/linux-block:
    floppy: remove duplicated flag FD_RAW_NEED_DISK
    blk: pass from_schedule to non-request unplug functions.
    block: stack unplug
    blk: centralize non-request unplug handling.
    md: remove plug_cnt feature of plugging.
    block/nbd: micro-optimization in nbd request completion
    drbd: announce FLUSH/FUA capability to upper layers
    drbd: fix max_bio_size to be unsigned
    drbd: flush drbd work queue before invalidate/invalidate remote
    drbd: fix potential access after free
    drbd: call local-io-error handler early
    drbd: do not reset rs_pending_cnt too early
    drbd: reset congestion information before reporting it in /proc/drbd
    drbd: report congestion if we are waiting for some userland callback
    drbd: differentiate between normal and forced detach
    drbd: cleanup, remove two unused global flags
    floppy: Run floppy initialization asynchronous

    Linus Torvalds
     
  • Pull core block IO bits from Jens Axboe:
    "The most complicated part if this is the request allocation rework by
    Tejun, which has been queued up for a long time and has been in
    for-next ditto as well.

    There are a few commits from yesterday and today, mostly trivial and
    obvious fixes. So I'm pretty confident that it is sound. It's also
    smaller than usual."

    * 'for-3.6/core' of git://git.kernel.dk/linux-block:
    block: remove dead func declaration
    block: add partition resize function to blkpg ioctl
    block: uninitialized ioc->nr_tasks triggers WARN_ON
    block: do not artificially constrain max_sectors for stacking drivers
    blkcg: implement per-blkg request allocation
    block: prepare for multiple request_lists
    block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv
    blkcg: inline bio_blkcg() and friends
    block: allocate io_context upfront
    block: refactor get_request[_wait]()
    block: drop custom queue draining used by scsi_transport_{iscsi|fc}
    mempool: add @gfp_mask to mempool_create_node()
    blkcg: make root blkcg allocation use %GFP_KERNEL
    blkcg: __blkg_lookup_create() doesn't need radix preload

    Linus Torvalds
     

01 Aug, 2012

4 commits

  • __generic_unplug_device() function is removed with commit
    7eaceaccab5f40bbfda044629a6298616aeaed50, which forgot to
    remove the declaration at meantime. Here remove it.

    Signed-off-by: Yuanhan Liu
    Signed-off-by: Jens Axboe

    Yuanhan Liu
     
  • Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
    allows altering the size of an existing partition, even if it is currently
    in use.

    This patch converts hd_struct->nr_sects into sequence counter because
    One might extend a partition while IO is happening to it and update of
    nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
    can lead to issues like reading inconsistent size of a partition. Sequence
    counter have been used so that readers don't have to take bdev mutex lock
    as we call sector_in_part() very frequently.

    Now all the access to hd_struct->nr_sects should happen using sequence
    counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
    There is one exception though, set_capacity()/get_capacity(). I think
    theoritically race should exist there too but this patch does not
    modify set_capacity()/get_capacity() due to sheer number of call sites
    and I am afraid that change might break something. I have left that as a
    TODO item. We can handle it later if need be. This patch does not introduce
    any new races as such w.r.t set_capacity()/get_capacity().

    v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Phillip Susi
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Hi,

    I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the
    below warning as of 3.5-rc1 and later (3.4 is fine):

    [ 10.886893] ------------[ cut here ]------------
    [ 10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560()
    [ 10.886905] Hardware name: Bochs
    [ 10.886906] Modules linked in:
    [ 10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27
    [ 10.886908] Call Trace:
    [ 10.886911] [] warn_slowpath_common+0x7a/0xb0
    [ 10.886912] [] warn_slowpath_null+0x15/0x20
    [ 10.886913] [] copy_process+0x1488/0x1560
    [ 10.886914] [] do_fork+0xb4/0x340
    [ 10.886918] [] ? recalc_sigpending+0x1a/0x50
    [ 10.886919] [] ? __set_task_blocked+0x32/0x80
    [ 10.886920] [] ? __set_current_blocked+0x3a/0x60
    [ 10.886923] [] sys_clone+0x23/0x30
    [ 10.886925] [] stub_clone+0x13/0x20
    [ 10.886927] [] ? system_call_fastpath+0x16/0x1b
    [ 10.886928] ---[ end trace 32a14af7ee6a590b ]---

    Reproducing is easy, I can hit it on a KVM system with a very basic
    config (x86_64 make defconfig + enable the drivers needed). To hit it,
    just install dump (on debian/ubuntu, not sure what the package might be
    called on Fedora), and:

    dump -o -f /tmp/foo /

    You'll see the warning in dmesg once it forks off the I/O process and
    starts dumping filesystem contents.

    I bisected it down to the following commit:

    commit f6e8d01bee036460e03bd4f6a79d014f98ba712e
    Author: Tejun Heo
    Date: Mon Mar 5 13:15:26 2012 -0800

    block: add io_context->active_ref

    Currently ioc->nr_tasks is used to decide two things - whether an ioc
    is done issuing IOs and whether it's shared by multiple tasks. This
    patch separate out the first into ioc->active_ref, which is acquired
    and released using {get|put}_io_context_active() respectively.

    This will be used to associate bio's with a given task. This patch
    doesn't introduce any visible behavior change.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    It seems like the init of ioc->nr_tasks was removed in that patch,
    so it starts out at 0 instead of 1.

    Tejun, is the right thing here to add back the init, or should something else
    be done?

    The below patch removes the warning, but I haven't done any more extensive
    testing on it.

    Signed-off-by: Olof Johansson
    Acked-by: Tejun Heo
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Olof Johansson
     
  • blk_set_stacking_limits is intended to allow stacking drivers to build
    up the limits of the stacked device based on the underlying devices'
    limits. But defaulting 'max_sectors' to BLK_DEF_MAX_SECTORS (1024)
    doesn't allow the stacking driver to inherit a max_sectors larger than
    1024 -- due to blk_stack_limits' use of min_not_zero.

    It is now clear that this artificial limit is getting in the way so
    change blk_set_stacking_limits's max_sectors to UINT_MAX (which allows
    stacking drivers like dm-multipath to inherit 'max_sectors' from the
    underlying paths).

    Reported-by: Vijay Chauhan
    Tested-by: Vijay Chauhan
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

31 Jul, 2012

3 commits

  • This will allow md/raid to know why the unplug was called,
    and will be able to act according - if !from_schedule it
    is safe to perform tasks which could themselves schedule.

    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     
  • MD raid1 prepares to dispatch request in unplug callback. If make_request in
    low level queue also uses unplug callback to dispatch request, the low level
    queue's unplug callback will not be called. Recheck the callback list helps
    this case.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Both md and umem has similar code for getting notified on an
    blk_finish_plug event.
    Centralize this code in block/ and allow each driver to
    provide its distinctive difference.

    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     

20 Jul, 2012

1 commit

  • If the queue is dead blk_execute_rq_nowait() doesn't invoke the done()
    callback function. That will result in blk_execute_rq() being stuck
    in wait_for_completion(). Avoid this by initializing rq->end_io to the
    done() callback before we check the queue state. Also, make sure the
    queue lock is held around the invocation of the done() callback. Found
    this through source code review.

    Signed-off-by: Muthukumar Ratty
    Signed-off-by: Bart Van Assche
    Reviewed-by: Tejun Heo
    Acked-by: Jens Axboe
    Signed-off-by: James Bottomley

    Muthukumar Ratty
     

27 Jun, 2012

1 commit

  • Currently, request_queue has one request_list to allocate requests
    from regardless of blkcg of the IO being issued. When the unified
    request pool is used up, cfq proportional IO limits become meaningless
    - whoever grabs the next request being freed wins the race regardless
    of the configured weights.

    This can be easily demonstrated by creating a blkio cgroup w/ very low
    weight, put a program which can issue a lot of random direct IOs there
    and running a sequential IO from a different cgroup. As soon as the
    request pool is used up, the sequential IO bandwidth crashes.

    This patch implements per-blkg request_list. Each blkg has its own
    request_list and any IO allocates its request from the matching blkg
    making blkcgs completely isolated in terms of request allocation.

    * Root blkcg uses the request_list embedded in each request_queue,
    which was renamed to @q->root_rl from @q->rq. While making blkcg rl
    handling a bit harier, this enables avoiding most overhead for root
    blkcg.

    * Queue fullness is properly per request_list but bdi isn't blkcg
    aware yet, so congestion state currently just follows the root
    blkcg. As writeback isn't aware of blkcg yet, this works okay for
    async congestion but readahead may get the wrong signals. It's
    better than blkcg completely collapsing with shared request_list but
    needs to be improved with future changes.

    * After this change, each block cgroup gets a full request pool making
    resource consumption of each cgroup higher. This makes allowing
    non-root users to create cgroups less desirable; however, note that
    allowing non-root users to directly manage cgroups is already
    severely broken regardless of this patch - each block cgroup
    consumes kernel memory and skews IO weight (IO weights are not
    hierarchical).

    v2: queue-sysfs.txt updated and patch description udpated as suggested
    by Vivek.

    v3: blk_get_rl() wasn't checking error return from
    blkg_lookup_create() and may cause oops on lookup failure. Fix it
    by falling back to root_rl on blkg lookup failures. This problem
    was spotted by Rakesh Iyer .

    v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
    request waitqueue". blk_drain_queue() now wakes up waiters on all
    blkg->rl on the target queue.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Cc: Wu Fengguang
    Signed-off-by: Jens Axboe

    Tejun Heo
     

25 Jun, 2012

9 commits

  • Request allocation is about to be made per-blkg meaning that there'll
    be multiple request lists.

    * Make queue full state per request_list. blk_*queue_full() functions
    are renamed to blk_*rl_full() and takes @rl instead of @q.

    * Rename blk_init_free_list() to blk_init_rl() and make it take @rl
    instead of @q. Also add @gfp_mask parameter.

    * Add blk_exit_rl() instead of destroying rl directly from
    blk_release_queue().

    * Add request_list->q and make request alloc/free functions -
    blk_free_request(), [__]freed_request(), __get_request() - take @rl
    instead of @q.

    This patch doesn't introduce any functional difference.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Add q->nr_rqs[] which currently behaves the same as q->rq.count[] and
    move q->rq.elvpriv to q->nr_rqs_elvpriv. blk_drain_queue() is updated
    to use q->nr_rqs[] instead of q->rq.count[].

    These counters separates queue-wide request statistics from the
    request list and allow implementation of per-queue request allocation.

    While at it, properly indent fields of struct request_list.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Make bio_blkcg() and friends inline. They all are very simple and
    used only in few places.

    This patch is to prepare for further updates to request allocation
    path.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Block layer very lazy allocation of ioc. It waits until the moment
    ioc is absolutely necessary; unfortunately, that time could be inside
    queue lock and __get_request() performs unlock - try alloc - retry
    dancing.

    Just allocate it up-front on entry to block layer. We're not saving
    the rain forest by deferring it to the last possible moment and
    complicating things unnecessarily.

    This patch is to prepare for further updates to request allocation
    path.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, there are two request allocation functions - get_request()
    and get_request_wait(). The former tries to allocate a request once
    and the latter keeps retrying until it succeeds. The latter wraps the
    former and keeps retrying until allocation succeeds.

    The combination of two functions deliver fallible non-wait allocation,
    fallible wait allocation and unfailing wait allocation. However,
    given that forward progress is guaranteed, fallible wait allocation
    isn't all that useful and in fact nobody uses it.

    This patch simplifies the interface as follows.

    * get_request() is renamed to __get_request() and is only used by the
    wrapper function.

    * get_request_wait() is renamed to get_request(). It now takes
    @gfp_mask and retries iff it contains %__GFP_WAIT.

    This patch doesn't introduce any functional change and is to prepare
    for further updates to request allocation path.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • iscsi_remove_host() uses bsg_remove_queue() which implements custom
    queue draining. fc_bsg_remove() open-codes mostly identical logic.

    The draining logic isn't correct in that blk_stop_queue() doesn't
    prevent new requests from being queued - it just stops processing, so
    nothing prevents new requests to be queued after the logic determines
    that the queue is drained.

    blk_cleanup_queue() now implements proper queue draining and these
    custom draining logics aren't necessary. Drop them and use
    bsg_unregister_queue() + blk_cleanup_queue() instead.

    Signed-off-by: Tejun Heo
    Reviewed-by: Mike Christie
    Acked-by: Vivek Goyal
    Cc: James Bottomley
    Cc: James Smart
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • mempool_create_node() currently assumes %GFP_KERNEL. Its only user,
    blk_init_free_list(), is about to be updated to use other allocation
    flags - add @gfp_mask argument to the function.

    Signed-off-by: Tejun Heo
    Cc: Andrew Morton
    Cc: Hugh Dickins
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, blkcg_activate_policy() depends on %GFP_ATOMIC allocation
    from __blkg_lookup_create() for root blkcg creation. This could make
    policy fail unnecessarily.

    Make blkg_alloc() take @gfp_mask, __blkg_lookup_create() take an
    optional @new_blkg for preallocated blkg, and blkcg_activate_policy()
    preload radix tree and preallocate blkg with %GFP_KERNEL before trying
    to create the root blkg.

    v2: __blkg_lookup_create() was returning %NULL on blkg alloc failure
    instead of ERR_PTR() value. Fixed.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • There's no point in calling radix_tree_preload() if preloading doesn't
    use more permissible GFP mask. Drop preloading from
    __blkg_lookup_create().

    While at it, drop sparse locking annotation which no longer applies.

    v2: Vivek pointed out the odd preload usage. Instead of updating,
    just drop it.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 Jun, 2012

4 commits

  • Sometimes, warnings about ioctls to partition happen often enough that they
    form majority of the warnings in the kernel log and users complain. In some
    cases warnings are about ioctls such as SG_IO so it's not good to get rid of
    the warnings completely as they can ease debugging of userspace problems
    when ioctl is refused.

    Since I have seen warnings from lots of commands, including some proprietary
    userspace applications, I don't think disallowing the ioctls for processes
    with CAP_SYS_RAWIO will happen in the near future if ever. So lets just
    stop warning for processes with CAP_SYS_RAWIO for which ioctl is allowed.

    CC: Paolo Bonzini
    CC: James Bottomley
    CC: linux-scsi@vger.kernel.org
    Acked-by: Paolo Bonzini
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • This function was only used by btrfs code in btrfs_abort_devices()
    (seems in a wrong way).

    It was removed in commit d07eb9117050c9ed3f78296ebcc06128b52693be,
    So, Let's remove the dead code to avoid any confusion.

    Changes in v2: update commit log, btrfs_abort_devices() was removed
    already.

    Cc: Jens Axboe
    Cc: linux-kernel@vger.kernel.org
    Cc: Chris Mason
    Cc: linux-btrfs@vger.kernel.org
    Cc: David Sterba
    Signed-off-by: Asias He
    Signed-off-by: Jens Axboe

    Asias He
     
  • Commit 777eb1bf15b8532c396821774bf6451e563438f5 disconnects externally
    supplied queue_lock before blk_drain_queue(). Switching the lock would
    introduce lock unbalance because theads which have taken the external
    lock might unlock the internal lock in the during the queue drain. This
    patch mitigate this by disconnecting the lock after the queue draining
    since queue draining makes a lot of request_queue users go away.

    However, please note, this patch only makes the problem less likely to
    happen. Anyone who still holds a ref might try to issue a new request on
    a dead queue after the blk_cleanup_queue() finishes draining, the lock
    unbalance might still happen in this case.

    =====================================
    [ BUG: bad unlock balance detected! ]
    3.4.0+ #288 Not tainted
    -------------------------------------
    fio/17706 is trying to release lock (&(&q->__queue_lock)->rlock) at:
    [] blk_queue_bio+0x2a2/0x380
    but there are no more locks to release!

    other info that might help us debug this:
    1 lock held by fio/17706:
    #0: (&(&vblk->lock)->rlock){......}, at: []
    get_request_wait+0x19a/0x250

    stack backtrace:
    Pid: 17706, comm: fio Not tainted 3.4.0+ #288
    Call Trace:
    [] ? blk_queue_bio+0x2a2/0x380
    [] print_unlock_inbalance_bug+0xf9/0x100
    [] lock_release_non_nested+0x1df/0x330
    [] ? dio_bio_end_aio+0x34/0xc0
    [] ? bio_check_pages_dirty+0x85/0xe0
    [] ? dio_bio_end_aio+0xb1/0xc0
    [] ? blk_queue_bio+0x2a2/0x380
    [] ? blk_queue_bio+0x2a2/0x380
    [] lock_release+0xd9/0x250
    [] _raw_spin_unlock_irq+0x23/0x40
    [] blk_queue_bio+0x2a2/0x380
    [] generic_make_request+0xca/0x100
    [] submit_bio+0x76/0xf0
    [] ? set_page_dirty_lock+0x3c/0x60
    [] ? bio_set_pages_dirty+0x51/0x70
    [] do_blockdev_direct_IO+0xbf8/0xee0
    [] ? blkdev_get_block+0x80/0x80
    [] __blockdev_direct_IO+0x55/0x60
    [] ? blkdev_get_block+0x80/0x80
    [] blkdev_direct_IO+0x57/0x60
    [] ? blkdev_get_block+0x80/0x80
    [] generic_file_aio_read+0x70e/0x760
    [] ? __lock_acquire+0x215/0x5a0
    [] ? aio_run_iocb+0x54/0x1a0
    [] ? grab_cache_page_nowait+0xc0/0xc0
    [] aio_rw_vect_retry+0x7c/0x1e0
    [] ? aio_fsync+0x30/0x30
    [] aio_run_iocb+0x66/0x1a0
    [] do_io_submit+0x6f0/0xb80
    [] ? trace_hardirqs_on_thunk+0x3a/0x3f
    [] sys_io_submit+0x10/0x20
    [] system_call_fastpath+0x16/0x1b

    Changes since v2: Update commit log to explain how the code is still
    broken even if we delay the lock switching after the drain.
    Changes since v1: Update commit log as Tejun suggested.

    Acked-by: Tejun Heo
    Signed-off-by: Asias He
    Signed-off-by: Jens Axboe

    Asias He
     
  • After hot-unplug a stressed disk, I found that rl->wait[] is not empty
    while rl->count[] is empty and there are theads still sleeping on
    get_request after the queue cleanup. With simple debug code, I found
    there are exactly nr_sleep - nr_wakeup of theads in D state. So there
    are missed wakeup.

    $ dmesg | grep nr_sleep
    [ 52.917115] ---> nr_sleep=1046, nr_wakeup=873, delta=173
    $ vmstat 1
    1 173 0 712640 24292 96172 0 0 0 0 419 757 0 0 0 100 0

    To quote Tejun:

    Ah, okay, freed_request() wakes up single waiter with the assumption
    that after the wakeup there will at least be one successful allocation
    which in turn will continue the wakeup chain until the wait list is
    empty - ie. waiter wakeup is dependent on successful request
    allocation happening after each wakeup. With queue marked dead, any
    woken up waiter fails the allocation path, so the wakeup chaining is
    lost and we're left with hung waiters. What we need is wake_up_all()
    after drain completion.

    This patch fixes the missed wakeup by waking up all the theads which
    are sleeping on wait queue after queue drain.

    Changes in v2: Drop waitqueue_active() optimization

    Acked-by: Tejun Heo
    Signed-off-by: Asias He

    Fixed a bug by me, where stacked devices would oops on calling
    blk_drain_queue() since ->rq.wait[] do not get initialized unless
    it's a full queue setup.

    Signed-off-by: Jens Axboe

    Asias He
     

06 Jun, 2012

1 commit

  • blkg_destroy() caches @blkg->q in local variable @q. While there are
    two places which needs @blkg->q, only lockdep_assert_held() used the
    local variable leading to unused local variable warning if lockdep is
    configured out. Drop the local variable and just use @blkg->q
    directly.

    Signed-off-by: Tejun Heo
    Reported-by: Rakesh Iyer
    Signed-off-by: Jens Axboe

    Tejun Heo
     

04 Jun, 2012

3 commits

  • When policy data allocation fails in the middle, blkg_alloc() invokes
    blkg_free() to destroy the half constructed blkg. This ends up
    calling pd_exit_fn() on policy datas which didn't go through
    pd_init_fn(). Fix it by making blkg_alloc() call pd_init_fn()
    immediately after each policy data allocation.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cfq may be built w/ or w/o blkcg support depending on
    CONFIG_CFQ_CGROUP_IOSCHED. If blkcg support is disabled, most of
    related code is ifdef'd out but some part is left dangling -
    blkcg_policy_cfq is left zero-filled and blkcg_policy_[un]register()
    calls are made on it.

    Feeding zero filled policy to blkcg_policy_register() is incorrect and
    triggers the following WARN_ON() if CONFIG_BLK_CGROUP &&
    !CONFIG_CFQ_GROUP_IOSCHED.

    ------------[ cut here ]------------
    WARNING: at block/blk-cgroup.c:867
    Modules linked in:
    Modules linked in:
    CPU: 3 Not tainted 3.4.0-09547-gfb21aff #1
    Process swapper/0 (pid: 1, task: 000000003ff80000, ksp: 000000003ff7f8b8)
    Krnl PSW : 0704100180000000 00000000003d76ca (blkcg_policy_register+0xca/0xe0)
    R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
    Krnl GPRS: 0000000000000000 00000000014b85ec 00000000014b85b0 0000000000000000
    000000000096fb60 0000000000000000 00000000009a8e78 0000000000000048
    000000000099c070 0000000000b6f000 0000000000000000 000000000099c0b8
    00000000014b85b0 0000000000667580 000000003ff7fd98 000000003ff7fd70
    Krnl Code: 00000000003d76be: a7280001 lhi %r2,1
    00000000003d76c2: a7f4ffdf brc 15,3d7680
    #00000000003d76c6: a7f40001 brc 15,3d76c8
    >00000000003d76ca: a7c8ffea lhi %r12,-22
    00000000003d76ce: a7f4ffce brc 15,3d766a
    00000000003d76d2: a7f40001 brc 15,3d76d4
    00000000003d76d6: a7c80000 lhi %r12,0
    00000000003d76da: a7f4ffc2 brc 15,3d765e
    Call Trace:
    ([] initcall_debug+0x0/0x4)
    [] cfq_init+0x62/0xd4
    [] do_one_initcall+0x3a/0x170
    [] kernel_init+0x214/0x2bc
    [] kernel_thread_starter+0x6/0xc
    [] kernel_thread_starter+0x0/0xc
    no locks held by swapper/0/1.
    Last Breaking-Event-Address:
    [] blkcg_policy_register+0xc6/0xe0
    ---[ end trace b8ef4903fcbf9dd3 ]---

    This patch fixes the problem by ensuring all blkcg support code is
    inside CONFIG_CFQ_GROUP_IOSCHED.

    * blkcg_policy_cfq declaration and blkg_to_cfqg() definition are moved
    inside the first CONFIG_CFQ_GROUP_IOSCHED block. __maybe_unused is
    dropped from blkcg_policy_cfq decl.

    * blkcg_deactivate_poilcy() invocation is moved inside ifdef. This
    also makes the activation logic match cfq_init_queue().

    * All blkcg_policy_[un]register() invocations are moved inside ifdef.

    Signed-off-by: Tejun Heo
    Reported-by: Heiko Carstens
    LKML-Reference:
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cfq_init() would return zero after kmem cache creation failure. Fix
    so that it returns -ENOMEM.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

31 May, 2012

1 commit

  • Calling get_task_io_context() on a exiting task which isn't %current can
    loop forever. This triggers at boot time on my dev machine.

    BUG: soft lockup - CPU#3 stuck for 22s ! [mountall.1603]

    Fix this by making create_task_io_context() returns -EBUSY in this case
    to break the loop.

    Signed-off-by: Eric Dumazet
    Cc: Tejun Heo
    Cc: Andrew Morton
    Cc: Alan Cox
    Signed-off-by: Jens Axboe

    Eric Dumazet
     

30 May, 2012

1 commit

  • Merge block/IO core bits from Jens Axboe:
    "This is a bit bigger on the core side than usual, but that is purely
    because we decided to hold off on parts of Tejun's submission on 3.4
    to give it a bit more time to simmer. As a consequence, it's seen a
    long cycle in for-next.

    It contains:

    - Bug fix from Dan, wrong locking type.
    - Relax splice gifting restriction from Eric.
    - A ton of updates from Tejun, primarily for blkcg. This improves
    the code a lot, making the API nicer and cleaner, and also includes
    fixes for how we handle and tie policies and re-activate on
    switches. The changes also include generic bug fixes.
    - A simple fix from Vivek, along with a fix for doing proper delayed
    allocation of the blkcg stats."

    Fix up annoying conflict just due to different merge resolution in
    Documentation/feature-removal-schedule.txt

    * 'for-3.5/core' of git://git.kernel.dk/linux-block: (92 commits)
    blkcg: tg_stats_alloc_lock is an irq lock
    vmsplice: relax alignement requirements for SPLICE_F_GIFT
    blkcg: use radix tree to index blkgs from blkcg
    blkcg: fix blkcg->css ref leak in __blkg_lookup_create()
    block: fix elvpriv allocation failure handling
    block: collapse blk_alloc_request() into get_request()
    blkcg: collapse blkcg_policy_ops into blkcg_policy
    blkcg: embed struct blkg_policy_data in policy specific data
    blkcg: mass rename of blkcg API
    blkcg: style cleanups for blk-cgroup.h
    blkcg: remove blkio_group->path[]
    blkcg: blkg_rwstat_read() was missing inline
    blkcg: shoot down blkgs if all policies are deactivated
    blkcg: drop stuff unused after per-queue policy activation update
    blkcg: implement per-queue policy activation
    blkcg: add request_queue->root_blkg
    blkcg: make request_queue bypassing on allocation
    blkcg: make sure blkg_lookup() returns %NULL if @q is bypassing
    blkcg: make blkg_conf_prep() take @pol and return with queue lock held
    blkcg: remove static policy ID enums
    ...

    Linus Torvalds
     

23 May, 2012

2 commits

  • tg_stats_alloc_lock nests inside queue lock and should always be held
    with irq disabled. throtl_pd_{init|exit}() were using non-irqsafe
    spinlock ops which triggered inverse lock ordering via irq warning via
    RCU freeing of blkg invoking throtl_pd_exit() w/o disabling IRQ.

    Update both functions to use irq safe operations.

    Signed-off-by: Tejun Heo
    Reported-by: Sasha Levin
    LKML-Reference:
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Pull cgroup updates from Tejun Heo:
    "cgroup file type addition / removal is updated so that file types are
    added and removed instead of individual files so that dynamic file
    type addition / removal can be implemented by cgroup and used by
    controllers. blkio controller changes which will come through block
    tree are dependent on this. Other changes include res_counter cleanup
    and disallowing kthread / PF_THREAD_BOUND threads to be attached to
    non-root cgroups.

    There's a reported bug with the file type addition / removal handling
    which can lead to oops on cgroup umount. The issue is being looked
    into. It shouldn't cause problems for most setups and isn't a
    security concern."

    Fix up trivial conflict in Documentation/feature-removal-schedule.txt

    * 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
    res_counter: Account max_usage when calling res_counter_charge_nofail()
    res_counter: Merge res_counter_charge and res_counter_charge_nofail
    cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads
    cgroup: remove cgroup_subsys->populate()
    cgroup: get rid of populate for memcg
    cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcg
    cgroup: make css->refcnt clearing on cgroup removal optional
    cgroup: use negative bias on css->refcnt to block css_tryget()
    cgroup: implement cgroup_rm_cftypes()
    cgroup: introduce struct cfent
    cgroup: relocate __d_cgrp() and __d_cft()
    cgroup: remove cgroup_add_file[s]()
    cgroup: convert memcg controller to the new cftype interface
    memcg: always create memsw files if CONFIG_CGROUP_MEM_RES_CTLR_SWAP
    cgroup: convert all non-memcg controllers to the new cftype interface
    cgroup: relocate cftype and cgroup_subsys definitions in controllers
    cgroup: merge cft_release_agent cftype array into the base files array
    cgroup: implement cgroup_add_cftypes() and friends
    cgroup: build list of all cgroups under a given cgroupfs_root
    cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir()
    ...

    Linus Torvalds
     

22 May, 2012

1 commit

  • Pull s390 updates from Martin Schwidefsky:
    "Just a random collection of bug-fixes and cleanups, nothing new in
    this merge request."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (46 commits)
    s390/ap: Fix wrong or missing comments
    s390/ap: move receive callback to message struct
    s390/dasd: re-prioritize partition detection message
    s390/qeth: reshuffle initialization
    s390/qeth: cleanup drv attr usage
    s390/claw: cleanup drv attr usage
    s390/lcs: cleanup drv attr usage
    s390/ctc: cleanup drv attr usage
    s390/ccwgroup: remove ccwgroup_create_from_string
    s390/qeth: stop using struct ccwgroup driver for discipline callbacks
    s390/qeth: switch to ccwgroup_create_dev
    s390/claw: switch to ccwgroup_create_dev
    s390/lcs: switch to ccwgroup_create_dev
    s390/ctcm: switch to ccwgroup_create_dev
    s390/ccwgroup: exploit ccwdev_by_dev_id
    s390/ccwgroup: introduce ccwgroup_create_dev
    s390: fix race on TIF_MCCK_PENDING
    s390/barrier: make use of fast-bcr facility
    s390/barrier: cleanup barrier functions
    s390/claw: remove "eieio" calls
    ...

    Linus Torvalds
     

16 May, 2012

1 commit


15 May, 2012

1 commit

  • 6d1d8050b4bc8 "block, partition: add partition_meta_info to hd_struct"
    added part_unpack_uuid() which assumes that the passed in buffer has
    enough space for sprintfing "%pU" - 37 characters including '\0'.

    Unfortunately, b5af921ec0233 "init: add support for root devices
    specified by partition UUID" supplied 33 bytes buffer to the function
    leading to the following panic with stackprotector enabled.

    Kernel panic - not syncing: stack-protector: Kernel stack corrupted in: ffffffff81b14c7e

    [] panic+0xba/0x1c6
    [] ? printk_all_partitions+0x259/0x26xb
    [] __stack_chk_fail+0x1b/0x20
    [] printk_all_paritions+0x259/0x26xb
    [] mount_block_root+0x1bc/0x27f
    [] mount_root+0x57/0x5b
    [] prepare_namespace+0x13d/0x176
    [] ? release_tgcred.isra.4+0x330/0x30
    [] kernel_init+0x155/0x15a
    [] ? schedule_tail+0x27/0xb0
    [] kernel_thread_helper+0x5/0x10
    [] ? start_kernel+0x3c5/0x3c5
    [] ? gs_change+0x13/0x13

    Increase the buffer size, remove the dangerous part_unpack_uuid() and
    use snprintf() directly from printk_all_partitions().

    Signed-off-by: Tejun Heo
    Reported-by: Szymon Gruszczynski
    Cc: Will Drewry
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     

01 May, 2012

1 commit


20 Apr, 2012

4 commits

  • blkg lookup is currently performed by traversing linked list anchored
    at blkcg->blkg_list. This is very unscalable and with blk-throttle
    enabled and enough request queues on the system, this can get very
    ugly quickly (blk-throttle performs look up on every bio submission).

    This patch makes blkcg use radix tree to index blkgs combined with
    simple last-looked-up hint. This is mostly identical to how icqs are
    indexed from ioc.

    Note that because __blkg_lookup() may be invoked without holding queue
    lock, hint is only updated from __blkg_lookup_create(). Due to cfq's
    cfqq caching, this makes hint updates overly lazy. This will be
    improved with scheduled blkcg aware request allocation.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • __blkg_lookup_create() leaked blkcg->css ref if blkg allocation
    failed. Fix it.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Request allocation is mempool backed to guarantee forward progress
    under memory pressure; unfortunately, this property got broken while
    adding elvpriv data. Failures during elvpriv allocation, including
    ioc and icq creation failures, currently make get_request() fail as
    whole. There's no forward progress guarantee for these allocations -
    they may fail indefinitely under memory pressure stalling IO and
    deadlocking the system.

    This patch updates get_request() such that elvpriv allocation failure
    doesn't make the whole function fail. If elvpriv allocation fails,
    the allocation is degraded into !ELVPRIV. This will force the request
    to ELEVATOR_INSERT_BACK disturbing scheduling but elvpriv alloc
    failures should be rare (nothing is per-request) and anything is
    better than deadlocking.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Allocation failure handling in get_request() is about to be updated.
    To ease the update, collapse blk_alloc_request() into get_request().

    This patch doesn't introduce any functional change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo