03 Oct, 2012

2 commits

  • Pull cgroup hierarchy update from Tejun Heo:
    "Currently, different cgroup subsystems handle nested cgroups
    completely differently. There's no consistency among subsystems and
    the behaviors often are outright broken.

    People at least seem to agree that the broken hierarhcy behaviors need
    to be weeded out if any progress is gonna be made on this front and
    that the fallouts from deprecating the broken behaviors should be
    acceptable especially given that the current behaviors don't make much
    sense when nested.

    This patch makes cgroup emit warning messages if cgroups for
    subsystems with broken hierarchy behavior are nested to prepare for
    fixing them in the future. This was put in a separate branch because
    more related changes were expected (didn't make it this round) and the
    memory cgroup wanted to pull in this and make changes on top."

    * 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them

    Linus Torvalds
     
  • Pull workqueue changes from Tejun Heo:
    "This is workqueue updates for v3.7-rc1. A lot of activities this
    round including considerable API and behavior cleanups.

    * delayed_work combines a timer and a work item. The handling of the
    timer part has always been a bit clunky leading to confusing
    cancelation API with weird corner-case behaviors. delayed_work is
    updated to use new IRQ safe timer and cancelation now works as
    expected.

    * Another deficiency of delayed_work was lack of the counterpart of
    mod_timer() which led to cancel+queue combinations or open-coded
    timer+work usages. mod_delayed_work[_on]() are added.

    These two delayed_work changes make delayed_work provide interface
    and behave like timer which is executed with process context.

    * A work item could be executed concurrently on multiple CPUs, which
    is rather unintuitive and made flush_work() behavior confusing and
    half-broken under certain circumstances. This problem doesn't
    exist for non-reentrant workqueues. While non-reentrancy check
    isn't free, the overhead is incurred only when a work item bounces
    across different CPUs and even in simulated pathological scenario
    the overhead isn't too high.

    All workqueues are made non-reentrant. This removes the
    distinction between flush_[delayed_]work() and
    flush_[delayed_]_work_sync(). The former is now as strong as the
    latter and the specified work item is guaranteed to have finished
    execution of any previous queueing on return.

    * In addition to the various bug fixes, Lai redid and simplified CPU
    hotplug handling significantly.

    * Joonsoo introduced system_highpri_wq and used it during CPU
    hotplug.

    There are two merge commits - one to pull in IRQ safe timer from
    tip/timers/core and the other to pull in CPU hotplug fixes from
    wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

    Fixed a number of trivial conflicts, but the more interesting conflicts
    were silent ones where the deprecated interfaces had been used by new
    code in the merge window, and thus didn't cause any real data conflicts.

    Tejun pointed out a few of them, I fixed a couple more.

    * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
    workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
    workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
    workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
    workqueue: remove @delayed from cwq_dec_nr_in_flight()
    workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
    workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
    workqueue: use __cpuinit instead of __devinit for cpu callbacks
    workqueue: rename manager_mutex to assoc_mutex
    workqueue: WORKER_REBIND is no longer necessary for idle rebinding
    workqueue: WORKER_REBIND is no longer necessary for busy rebinding
    workqueue: reimplement idle worker rebinding
    workqueue: deprecate __cancel_delayed_work()
    workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
    workqueue: use mod_delayed_work() instead of __cancel + queue
    workqueue: use irqsafe timer for delayed_work
    workqueue: clean up delayed_work initializers and add missing one
    workqueue: make deferrable delayed_work initializer names consistent
    workqueue: cosmetic whitespace updates for macro definitions
    workqueue: deprecate system_nrt[_freezable]_wq
    workqueue: deprecate flush[_delayed]_work_sync()
    ...

    Linus Torvalds
     

26 Sep, 2012

1 commit

  • In some usage scenarios it is desireable to work with disk images or
    virtualized DASD devices. One problem that prevents such applications
    is the partition detection in ibm.c. Currently it works only for
    devices that support the BIODASDINFO2 ioctl, in other words, it only
    works for devices that belong to the DASD device driver.

    The information gained from the BIODASDINFO2 ioctl is only for a small
    set of legacy cases abolutely necessary. All current VOL1, LNX1 and
    CMS1 type of disk labels can be interpreted correctly without this
    information, as long as the generic HDIO_GETGEO ioctl works and
    provides a correct disk geometry.

    This patch makes the ibm.c partition detection as independent as
    possible from the BIODASDINFO2 ioctl. Only the following two cases are
    still restricted to real DASDs:
    - An FBA DASD, or LDL formatted ECKD DASD without any disk label.
    - An old style LNX1 label (without large volume support) on a disk
    with inconsistent device geometry.

    Signed-off-by: Stefan Weinhuber
    Signed-off-by: Martin Schwidefsky

    Stefan Weinhuber
     

18 Sep, 2012

1 commit


15 Sep, 2012

1 commit

  • Currently, cgroup hierarchy support is a mess. cpu related subsystems
    behave correctly - configuration, accounting and control on a parent
    properly cover its children. blkio and freezer completely ignore
    hierarchy and treat all cgroups as if they're directly under the root
    cgroup. Others show yet different behaviors.

    These differing interpretations of cgroup hierarchy make using cgroup
    confusing and it impossible to co-mount controllers into the same
    hierarchy and obtain sane behavior.

    Eventually, we want full hierarchy support from all subsystems and
    probably a unified hierarchy. Users using separate hierarchies
    expecting completely different behaviors depending on the mounted
    subsystem is deterimental to making any progress on this front.

    This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
    for controllers which are lacking in hierarchy support. The goal of
    this patch is two-fold.

    * Move users away from using hierarchy on currently non-hierarchical
    subsystems, so that implementing proper hierarchy support on those
    doesn't surprise them.

    * Keep track of which controllers are broken how and nudge the
    subsystems to implement proper hierarchy support.

    For now, start with a single warning message. We can whine louder
    later on.

    v2: Fixed a typo spotted by Michal. Warning message updated.

    v3: Updated memcg part so that it doesn't generate warning in the
    cases where .use_hierarchy=false doesn't make the behavior
    different from root.use_hierarchy=true. Fixed a typo spotted by
    Glauber.

    v4: Check ->broken_hierarchy after cgroup creation is complete so that
    ->create() can affect the result per Michal. Dropped unnecessary
    memcg root handling per Michal.

    Signed-off-by: Tejun Heo
    Acked-by: Michal Hocko
    Acked-by: Li Zefan
    Acked-by: Serge E. Hallyn
    Cc: Glauber Costa
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Johannes Weiner
    Cc: Thomas Graf
    Cc: Vivek Goyal
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Neil Horman
    Cc: Aneesh Kumar K.V

    Tejun Heo
     

31 Aug, 2012

1 commit

  • When performing a cable pull test w/ active stress I/O using fio over
    a dual port Intel 82599 FCoE CNA, w/ 256LUNs on one port and about 32LUNs
    on the other, it is observed that the system becomes not usable due to
    scsi-ml being busy printing the error messages for all the failing commands.
    I don't believe this problem is specific to FCoE and these commands are
    anyway failing due to link being down (DID_NO_CONNECT), just rate-limit
    the messages here to solve this issue.

    v2->v1: use __ratelimit() as Tomas Henzl mentioned as the proper way for
    rate-limit per function. However, in this case, the failed i/o gets to
    blk_end_request_err() and then blk_update_request(), which also has to
    be rate-limited, as added in the v2 of this patch.

    v3-v2: resolved conflict to apply on current 3.6-rc3 upstream tip.

    Signed-off-by: Yi Zou
    Cc: www.Open-FCoE.org
    Cc: Tomas Henzl
    Cc:
    Signed-off-by: Jens Axboe

    Yi Zou
     

22 Aug, 2012

2 commits

  • Now that cancel_delayed_work() can be safely called from IRQ handlers,
    there's no reason to use __cancel_delayed_work(). Use
    cancel_delayed_work() instead of __cancel_delayed_work() and mark the
    latter deprecated.

    Signed-off-by: Tejun Heo
    Acked-by: Jens Axboe
    Cc: Jiri Kosina
    Cc: Roland Dreier
    Cc: Tomi Valkeinen

    Tejun Heo
     
  • Now that mod_delayed_work() is safe to call from IRQ handlers,
    __cancel_delayed_work() followed by queue_delayed_work() can be
    replaced with mod_delayed_work().

    Most conversions are straight-forward except for the following.

    * net/core/link_watch.c: linkwatch_schedule_work() was doing a quite
    elaborate dancing around its delayed_work. Collapse it such that
    linkwatch_work is queued for immediate execution if LW_URGENT and
    existing timer is kept otherwise.

    Signed-off-by: Tejun Heo
    Cc: "David S. Miller"
    Cc: Tomi Valkeinen

    Tejun Heo
     

21 Aug, 2012

1 commit

  • system_nrt[_freezable]_wq are now spurious. Mark them deprecated and
    convert all users to system[_freezable]_wq.

    If you're cc'd and wondering what's going on: Now all workqueues are
    non-reentrant, so there's no reason to use system_nrt[_freezable]_wq.
    Please use system[_freezable]_wq instead.

    This patch doesn't make any functional difference.

    Signed-off-by: Tejun Heo
    Acked-By: Lai Jiangshan

    Cc: Jens Axboe
    Cc: David Airlie
    Cc: Jiri Kosina
    Cc: "David S. Miller"
    Cc: Rusty Russell
    Cc: "Paul E. McKenney"
    Cc: David Howells

    Tejun Heo
     

14 Aug, 2012

1 commit

  • Convert delayed_work users doing cancel_delayed_work() followed by
    queue_delayed_work() to mod_delayed_work().

    Most conversions are straight-forward. Ones worth mentioning are,

    * drivers/edac/edac_mc.c: edac_mc_workq_setup() converted to always
    use mod_delayed_work() and cancel loop in
    edac_mc_reset_delay_period() is dropped.

    * drivers/platform/x86/thinkpad_acpi.c: No need to remember whether
    watchdog is active or not. @fan_watchdog_active and related code
    dropped.

    * drivers/power/charger-manager.c: Seemingly a lot of
    delayed_work_pending() abuse going on here.
    [delayed_]work_pending() are unsynchronized and racy when used like
    this. I converted one instance in fullbatt_handler(). Please
    conver the rest so that it invokes workqueue APIs for the intended
    target state rather than trying to game work item pending state
    transitions. e.g. if timer should be modified - call
    mod_delayed_work(), canceled - call cancel_delayed_work[_sync]().

    * drivers/thermal/thermal_sys.c: thermal_zone_device_set_polling()
    simplified. Note that round_jiffies() calls in this function are
    meaningless. round_jiffies() work on absolute jiffies not delta
    delay used by delayed_work.

    v2: Tomi pointed out that __cancel_delayed_work() users can't be
    safely converted to mod_delayed_work(). They could be calling it
    from irq context and if that happens while delayed_work_timer_fn()
    is running, it could deadlock. __cancel_delayed_work() users are
    dropped.

    Signed-off-by: Tejun Heo
    Acked-by: Henrique de Moraes Holschuh
    Acked-by: Dmitry Torokhov
    Acked-by: Anton Vorontsov
    Acked-by: David Howells
    Cc: Tomi Valkeinen
    Cc: Jens Axboe
    Cc: Jiri Kosina
    Cc: Doug Thompson
    Cc: David Airlie
    Cc: Roland Dreier
    Cc: "John W. Linville"
    Cc: Zhang Rui
    Cc: Len Brown
    Cc: "J. Bruce Fields"
    Cc: Johannes Berg

    Tejun Heo
     

03 Aug, 2012

3 commits

  • I met a odd prblem:read /proc/partitions may return zero.

    I wrote a file test.c:
    int main()
    {
    char buff[4096];
    int ret;
    int fd;
    printf("pid=%d\n",getpid());
    while (1) {
    fd = open("/proc/partitions", O_RDONLY);
    if (fd < 0) {
    printf("open error %s\n", strerror(errno));
    return 0;
    }
    ret = read(fd, buff, 4096);
    if (ret /dev/null ;done
    2:./test

    I reviewed the code and found:

    >> static void *show_partition_start(struct seq_file *seqf, loff_t *pos)
    >> {
    >> static void *p;
    >>
    >> p = disk_seqf_start(seqf, pos);
    >> if (!IS_ERR_OR_NULL(p) && !*pos)
    >> seq_puts(seqf, "major minor #blocks name\n\n");
    >> return p;
    >> }
    test cat /proc/partitions
    p = disk_seqf_start()(Not NULL)
    p = disk_seqf_start()(NULL because pos)
    if (!IS_ERR_OR_NULL(p) && !*pos)

    Signed-off-by: Jianpeng Ma
    Signed-off-by: Jens Axboe

    Jianpeng Ma
     
  • Add a helper to map a bio to a scatterlist, modelled after
    blk_rq_map_sg.

    This helper is useful for any driver that wants to create
    a scatterlist from its ->make_request_fn method.

    Changes in v2:
    - Use __blk_segment_map_sg to avoid duplicated code
    - Add cocbook style function comment

    Cc: Rusty Russell
    Cc: Christoph Hellwig
    Cc: Tejun Heo
    Cc: Shaohua Li
    Cc: "Michael S. Tsirkin"
    Cc: kvm@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: virtualization@lists.linux-foundation.org
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Minchan Kim
    Signed-off-by: Asias He
    Signed-off-by: Jens Axboe

    Asias He
     
  • Split the mapping code in blk_rq_map_sg() to a helper
    __blk_segment_map_sg(), so that other mapping function, e.g.
    blk_bio_map_sg(), can share the code.

    Cc: Rusty Russell
    Cc: Christoph Hellwig
    Cc: Tejun Heo
    Cc: Shaohua Li
    Cc: "Michael S. Tsirkin"
    Cc: kvm@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: virtualization@lists.linux-foundation.org
    Suggested-by: Jens Axboe
    Suggested-by: Tejun Heo
    Signed-off-by: Asias He
    Signed-off-by: Jens Axboe

    Asias He
     

02 Aug, 2012

4 commits

  • When a disk has large discard_granularity and small max_discard_sectors,
    discards are not split with optimal alignment. In the limit case of
    discard_granularity == max_discard_sectors, no request could be aligned
    correctly, so in fact you might end up with no discarded logical blocks
    at all.

    Another example that helps showing the condition in the patch is with
    discard_granularity == 64, max_discard_sectors == 128. A request that is
    submitted for 256 sectors 2..257 will be split in two: 2..129, 130..257.
    However, only 2 aligned blocks out of 3 are included in the request;
    128..191 may be left intact and not discarded. With this patch, the
    first request will be truncated to ensure good alignment of what's left,
    and the split will be 2..127, 128..255, 256..257. The patch will also
    take into account the discard_alignment.

    At most one extra request will be introduced, because the first request
    will be reduced by at most granularity-1 sectors, and granularity
    must be less than max_discard_sectors. Subsequent requests will run
    on round_down(max_discard_sectors, granularity) sectors, as in the
    current code.

    Signed-off-by: Paolo Bonzini
    Acked-by: Vivek Goyal
    Tested-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Paolo Bonzini
     
  • Mostly a preparation for the next patch.

    In principle this fixes an infinite loop if max_discard_sectors < granularity,
    but that really shouldn't happen.

    Signed-off-by: Paolo Bonzini
    Acked-by: Vivek Goyal
    Tested-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Paolo Bonzini
     
  • Pull block driver changes from Jens Axboe:

    - Making the plugging support for drivers a bit more sane from Neil.
    This supersedes the plugging change from Shaohua as well.

    - The usual round of drbd updates.

    - Using a tail add instead of a head add in the request completion for
    ndb, making us find the most completed request more quickly.

    - A few floppy changes, getting rid of a duplicated flag and also
    running the floppy init async (since it takes forever in boot terms)
    from Andi.

    * 'for-3.6/drivers' of git://git.kernel.dk/linux-block:
    floppy: remove duplicated flag FD_RAW_NEED_DISK
    blk: pass from_schedule to non-request unplug functions.
    block: stack unplug
    blk: centralize non-request unplug handling.
    md: remove plug_cnt feature of plugging.
    block/nbd: micro-optimization in nbd request completion
    drbd: announce FLUSH/FUA capability to upper layers
    drbd: fix max_bio_size to be unsigned
    drbd: flush drbd work queue before invalidate/invalidate remote
    drbd: fix potential access after free
    drbd: call local-io-error handler early
    drbd: do not reset rs_pending_cnt too early
    drbd: reset congestion information before reporting it in /proc/drbd
    drbd: report congestion if we are waiting for some userland callback
    drbd: differentiate between normal and forced detach
    drbd: cleanup, remove two unused global flags
    floppy: Run floppy initialization asynchronous

    Linus Torvalds
     
  • Pull core block IO bits from Jens Axboe:
    "The most complicated part if this is the request allocation rework by
    Tejun, which has been queued up for a long time and has been in
    for-next ditto as well.

    There are a few commits from yesterday and today, mostly trivial and
    obvious fixes. So I'm pretty confident that it is sound. It's also
    smaller than usual."

    * 'for-3.6/core' of git://git.kernel.dk/linux-block:
    block: remove dead func declaration
    block: add partition resize function to blkpg ioctl
    block: uninitialized ioc->nr_tasks triggers WARN_ON
    block: do not artificially constrain max_sectors for stacking drivers
    blkcg: implement per-blkg request allocation
    block: prepare for multiple request_lists
    block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv
    blkcg: inline bio_blkcg() and friends
    block: allocate io_context upfront
    block: refactor get_request[_wait]()
    block: drop custom queue draining used by scsi_transport_{iscsi|fc}
    mempool: add @gfp_mask to mempool_create_node()
    blkcg: make root blkcg allocation use %GFP_KERNEL
    blkcg: __blkg_lookup_create() doesn't need radix preload

    Linus Torvalds
     

01 Aug, 2012

4 commits

  • __generic_unplug_device() function is removed with commit
    7eaceaccab5f40bbfda044629a6298616aeaed50, which forgot to
    remove the declaration at meantime. Here remove it.

    Signed-off-by: Yuanhan Liu
    Signed-off-by: Jens Axboe

    Yuanhan Liu
     
  • Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
    allows altering the size of an existing partition, even if it is currently
    in use.

    This patch converts hd_struct->nr_sects into sequence counter because
    One might extend a partition while IO is happening to it and update of
    nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
    can lead to issues like reading inconsistent size of a partition. Sequence
    counter have been used so that readers don't have to take bdev mutex lock
    as we call sector_in_part() very frequently.

    Now all the access to hd_struct->nr_sects should happen using sequence
    counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
    There is one exception though, set_capacity()/get_capacity(). I think
    theoritically race should exist there too but this patch does not
    modify set_capacity()/get_capacity() due to sheer number of call sites
    and I am afraid that change might break something. I have left that as a
    TODO item. We can handle it later if need be. This patch does not introduce
    any new races as such w.r.t set_capacity()/get_capacity().

    v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Phillip Susi
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Hi,

    I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the
    below warning as of 3.5-rc1 and later (3.4 is fine):

    [ 10.886893] ------------[ cut here ]------------
    [ 10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560()
    [ 10.886905] Hardware name: Bochs
    [ 10.886906] Modules linked in:
    [ 10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27
    [ 10.886908] Call Trace:
    [ 10.886911] [] warn_slowpath_common+0x7a/0xb0
    [ 10.886912] [] warn_slowpath_null+0x15/0x20
    [ 10.886913] [] copy_process+0x1488/0x1560
    [ 10.886914] [] do_fork+0xb4/0x340
    [ 10.886918] [] ? recalc_sigpending+0x1a/0x50
    [ 10.886919] [] ? __set_task_blocked+0x32/0x80
    [ 10.886920] [] ? __set_current_blocked+0x3a/0x60
    [ 10.886923] [] sys_clone+0x23/0x30
    [ 10.886925] [] stub_clone+0x13/0x20
    [ 10.886927] [] ? system_call_fastpath+0x16/0x1b
    [ 10.886928] ---[ end trace 32a14af7ee6a590b ]---

    Reproducing is easy, I can hit it on a KVM system with a very basic
    config (x86_64 make defconfig + enable the drivers needed). To hit it,
    just install dump (on debian/ubuntu, not sure what the package might be
    called on Fedora), and:

    dump -o -f /tmp/foo /

    You'll see the warning in dmesg once it forks off the I/O process and
    starts dumping filesystem contents.

    I bisected it down to the following commit:

    commit f6e8d01bee036460e03bd4f6a79d014f98ba712e
    Author: Tejun Heo
    Date: Mon Mar 5 13:15:26 2012 -0800

    block: add io_context->active_ref

    Currently ioc->nr_tasks is used to decide two things - whether an ioc
    is done issuing IOs and whether it's shared by multiple tasks. This
    patch separate out the first into ioc->active_ref, which is acquired
    and released using {get|put}_io_context_active() respectively.

    This will be used to associate bio's with a given task. This patch
    doesn't introduce any visible behavior change.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    It seems like the init of ioc->nr_tasks was removed in that patch,
    so it starts out at 0 instead of 1.

    Tejun, is the right thing here to add back the init, or should something else
    be done?

    The below patch removes the warning, but I haven't done any more extensive
    testing on it.

    Signed-off-by: Olof Johansson
    Acked-by: Tejun Heo
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Olof Johansson
     
  • blk_set_stacking_limits is intended to allow stacking drivers to build
    up the limits of the stacked device based on the underlying devices'
    limits. But defaulting 'max_sectors' to BLK_DEF_MAX_SECTORS (1024)
    doesn't allow the stacking driver to inherit a max_sectors larger than
    1024 -- due to blk_stack_limits' use of min_not_zero.

    It is now clear that this artificial limit is getting in the way so
    change blk_set_stacking_limits's max_sectors to UINT_MAX (which allows
    stacking drivers like dm-multipath to inherit 'max_sectors' from the
    underlying paths).

    Reported-by: Vijay Chauhan
    Tested-by: Vijay Chauhan
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

31 Jul, 2012

3 commits

  • This will allow md/raid to know why the unplug was called,
    and will be able to act according - if !from_schedule it
    is safe to perform tasks which could themselves schedule.

    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     
  • MD raid1 prepares to dispatch request in unplug callback. If make_request in
    low level queue also uses unplug callback to dispatch request, the low level
    queue's unplug callback will not be called. Recheck the callback list helps
    this case.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Both md and umem has similar code for getting notified on an
    blk_finish_plug event.
    Centralize this code in block/ and allow each driver to
    provide its distinctive difference.

    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     

20 Jul, 2012

1 commit

  • If the queue is dead blk_execute_rq_nowait() doesn't invoke the done()
    callback function. That will result in blk_execute_rq() being stuck
    in wait_for_completion(). Avoid this by initializing rq->end_io to the
    done() callback before we check the queue state. Also, make sure the
    queue lock is held around the invocation of the done() callback. Found
    this through source code review.

    Signed-off-by: Muthukumar Ratty
    Signed-off-by: Bart Van Assche
    Reviewed-by: Tejun Heo
    Acked-by: Jens Axboe
    Signed-off-by: James Bottomley

    Muthukumar Ratty
     

27 Jun, 2012

1 commit

  • Currently, request_queue has one request_list to allocate requests
    from regardless of blkcg of the IO being issued. When the unified
    request pool is used up, cfq proportional IO limits become meaningless
    - whoever grabs the next request being freed wins the race regardless
    of the configured weights.

    This can be easily demonstrated by creating a blkio cgroup w/ very low
    weight, put a program which can issue a lot of random direct IOs there
    and running a sequential IO from a different cgroup. As soon as the
    request pool is used up, the sequential IO bandwidth crashes.

    This patch implements per-blkg request_list. Each blkg has its own
    request_list and any IO allocates its request from the matching blkg
    making blkcgs completely isolated in terms of request allocation.

    * Root blkcg uses the request_list embedded in each request_queue,
    which was renamed to @q->root_rl from @q->rq. While making blkcg rl
    handling a bit harier, this enables avoiding most overhead for root
    blkcg.

    * Queue fullness is properly per request_list but bdi isn't blkcg
    aware yet, so congestion state currently just follows the root
    blkcg. As writeback isn't aware of blkcg yet, this works okay for
    async congestion but readahead may get the wrong signals. It's
    better than blkcg completely collapsing with shared request_list but
    needs to be improved with future changes.

    * After this change, each block cgroup gets a full request pool making
    resource consumption of each cgroup higher. This makes allowing
    non-root users to create cgroups less desirable; however, note that
    allowing non-root users to directly manage cgroups is already
    severely broken regardless of this patch - each block cgroup
    consumes kernel memory and skews IO weight (IO weights are not
    hierarchical).

    v2: queue-sysfs.txt updated and patch description udpated as suggested
    by Vivek.

    v3: blk_get_rl() wasn't checking error return from
    blkg_lookup_create() and may cause oops on lookup failure. Fix it
    by falling back to root_rl on blkg lookup failures. This problem
    was spotted by Rakesh Iyer .

    v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
    request waitqueue". blk_drain_queue() now wakes up waiters on all
    blkg->rl on the target queue.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Cc: Wu Fengguang
    Signed-off-by: Jens Axboe

    Tejun Heo
     

25 Jun, 2012

9 commits

  • Request allocation is about to be made per-blkg meaning that there'll
    be multiple request lists.

    * Make queue full state per request_list. blk_*queue_full() functions
    are renamed to blk_*rl_full() and takes @rl instead of @q.

    * Rename blk_init_free_list() to blk_init_rl() and make it take @rl
    instead of @q. Also add @gfp_mask parameter.

    * Add blk_exit_rl() instead of destroying rl directly from
    blk_release_queue().

    * Add request_list->q and make request alloc/free functions -
    blk_free_request(), [__]freed_request(), __get_request() - take @rl
    instead of @q.

    This patch doesn't introduce any functional difference.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Add q->nr_rqs[] which currently behaves the same as q->rq.count[] and
    move q->rq.elvpriv to q->nr_rqs_elvpriv. blk_drain_queue() is updated
    to use q->nr_rqs[] instead of q->rq.count[].

    These counters separates queue-wide request statistics from the
    request list and allow implementation of per-queue request allocation.

    While at it, properly indent fields of struct request_list.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Make bio_blkcg() and friends inline. They all are very simple and
    used only in few places.

    This patch is to prepare for further updates to request allocation
    path.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Block layer very lazy allocation of ioc. It waits until the moment
    ioc is absolutely necessary; unfortunately, that time could be inside
    queue lock and __get_request() performs unlock - try alloc - retry
    dancing.

    Just allocate it up-front on entry to block layer. We're not saving
    the rain forest by deferring it to the last possible moment and
    complicating things unnecessarily.

    This patch is to prepare for further updates to request allocation
    path.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, there are two request allocation functions - get_request()
    and get_request_wait(). The former tries to allocate a request once
    and the latter keeps retrying until it succeeds. The latter wraps the
    former and keeps retrying until allocation succeeds.

    The combination of two functions deliver fallible non-wait allocation,
    fallible wait allocation and unfailing wait allocation. However,
    given that forward progress is guaranteed, fallible wait allocation
    isn't all that useful and in fact nobody uses it.

    This patch simplifies the interface as follows.

    * get_request() is renamed to __get_request() and is only used by the
    wrapper function.

    * get_request_wait() is renamed to get_request(). It now takes
    @gfp_mask and retries iff it contains %__GFP_WAIT.

    This patch doesn't introduce any functional change and is to prepare
    for further updates to request allocation path.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • iscsi_remove_host() uses bsg_remove_queue() which implements custom
    queue draining. fc_bsg_remove() open-codes mostly identical logic.

    The draining logic isn't correct in that blk_stop_queue() doesn't
    prevent new requests from being queued - it just stops processing, so
    nothing prevents new requests to be queued after the logic determines
    that the queue is drained.

    blk_cleanup_queue() now implements proper queue draining and these
    custom draining logics aren't necessary. Drop them and use
    bsg_unregister_queue() + blk_cleanup_queue() instead.

    Signed-off-by: Tejun Heo
    Reviewed-by: Mike Christie
    Acked-by: Vivek Goyal
    Cc: James Bottomley
    Cc: James Smart
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • mempool_create_node() currently assumes %GFP_KERNEL. Its only user,
    blk_init_free_list(), is about to be updated to use other allocation
    flags - add @gfp_mask argument to the function.

    Signed-off-by: Tejun Heo
    Cc: Andrew Morton
    Cc: Hugh Dickins
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, blkcg_activate_policy() depends on %GFP_ATOMIC allocation
    from __blkg_lookup_create() for root blkcg creation. This could make
    policy fail unnecessarily.

    Make blkg_alloc() take @gfp_mask, __blkg_lookup_create() take an
    optional @new_blkg for preallocated blkg, and blkcg_activate_policy()
    preload radix tree and preallocate blkg with %GFP_KERNEL before trying
    to create the root blkg.

    v2: __blkg_lookup_create() was returning %NULL on blkg alloc failure
    instead of ERR_PTR() value. Fixed.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • There's no point in calling radix_tree_preload() if preloading doesn't
    use more permissible GFP mask. Drop preloading from
    __blkg_lookup_create().

    While at it, drop sparse locking annotation which no longer applies.

    v2: Vivek pointed out the odd preload usage. Instead of updating,
    just drop it.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 Jun, 2012

4 commits

  • Sometimes, warnings about ioctls to partition happen often enough that they
    form majority of the warnings in the kernel log and users complain. In some
    cases warnings are about ioctls such as SG_IO so it's not good to get rid of
    the warnings completely as they can ease debugging of userspace problems
    when ioctl is refused.

    Since I have seen warnings from lots of commands, including some proprietary
    userspace applications, I don't think disallowing the ioctls for processes
    with CAP_SYS_RAWIO will happen in the near future if ever. So lets just
    stop warning for processes with CAP_SYS_RAWIO for which ioctl is allowed.

    CC: Paolo Bonzini
    CC: James Bottomley
    CC: linux-scsi@vger.kernel.org
    Acked-by: Paolo Bonzini
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • This function was only used by btrfs code in btrfs_abort_devices()
    (seems in a wrong way).

    It was removed in commit d07eb9117050c9ed3f78296ebcc06128b52693be,
    So, Let's remove the dead code to avoid any confusion.

    Changes in v2: update commit log, btrfs_abort_devices() was removed
    already.

    Cc: Jens Axboe
    Cc: linux-kernel@vger.kernel.org
    Cc: Chris Mason
    Cc: linux-btrfs@vger.kernel.org
    Cc: David Sterba
    Signed-off-by: Asias He
    Signed-off-by: Jens Axboe

    Asias He
     
  • Commit 777eb1bf15b8532c396821774bf6451e563438f5 disconnects externally
    supplied queue_lock before blk_drain_queue(). Switching the lock would
    introduce lock unbalance because theads which have taken the external
    lock might unlock the internal lock in the during the queue drain. This
    patch mitigate this by disconnecting the lock after the queue draining
    since queue draining makes a lot of request_queue users go away.

    However, please note, this patch only makes the problem less likely to
    happen. Anyone who still holds a ref might try to issue a new request on
    a dead queue after the blk_cleanup_queue() finishes draining, the lock
    unbalance might still happen in this case.

    =====================================
    [ BUG: bad unlock balance detected! ]
    3.4.0+ #288 Not tainted
    -------------------------------------
    fio/17706 is trying to release lock (&(&q->__queue_lock)->rlock) at:
    [] blk_queue_bio+0x2a2/0x380
    but there are no more locks to release!

    other info that might help us debug this:
    1 lock held by fio/17706:
    #0: (&(&vblk->lock)->rlock){......}, at: []
    get_request_wait+0x19a/0x250

    stack backtrace:
    Pid: 17706, comm: fio Not tainted 3.4.0+ #288
    Call Trace:
    [] ? blk_queue_bio+0x2a2/0x380
    [] print_unlock_inbalance_bug+0xf9/0x100
    [] lock_release_non_nested+0x1df/0x330
    [] ? dio_bio_end_aio+0x34/0xc0
    [] ? bio_check_pages_dirty+0x85/0xe0
    [] ? dio_bio_end_aio+0xb1/0xc0
    [] ? blk_queue_bio+0x2a2/0x380
    [] ? blk_queue_bio+0x2a2/0x380
    [] lock_release+0xd9/0x250
    [] _raw_spin_unlock_irq+0x23/0x40
    [] blk_queue_bio+0x2a2/0x380
    [] generic_make_request+0xca/0x100
    [] submit_bio+0x76/0xf0
    [] ? set_page_dirty_lock+0x3c/0x60
    [] ? bio_set_pages_dirty+0x51/0x70
    [] do_blockdev_direct_IO+0xbf8/0xee0
    [] ? blkdev_get_block+0x80/0x80
    [] __blockdev_direct_IO+0x55/0x60
    [] ? blkdev_get_block+0x80/0x80
    [] blkdev_direct_IO+0x57/0x60
    [] ? blkdev_get_block+0x80/0x80
    [] generic_file_aio_read+0x70e/0x760
    [] ? __lock_acquire+0x215/0x5a0
    [] ? aio_run_iocb+0x54/0x1a0
    [] ? grab_cache_page_nowait+0xc0/0xc0
    [] aio_rw_vect_retry+0x7c/0x1e0
    [] ? aio_fsync+0x30/0x30
    [] aio_run_iocb+0x66/0x1a0
    [] do_io_submit+0x6f0/0xb80
    [] ? trace_hardirqs_on_thunk+0x3a/0x3f
    [] sys_io_submit+0x10/0x20
    [] system_call_fastpath+0x16/0x1b

    Changes since v2: Update commit log to explain how the code is still
    broken even if we delay the lock switching after the drain.
    Changes since v1: Update commit log as Tejun suggested.

    Acked-by: Tejun Heo
    Signed-off-by: Asias He
    Signed-off-by: Jens Axboe

    Asias He
     
  • After hot-unplug a stressed disk, I found that rl->wait[] is not empty
    while rl->count[] is empty and there are theads still sleeping on
    get_request after the queue cleanup. With simple debug code, I found
    there are exactly nr_sleep - nr_wakeup of theads in D state. So there
    are missed wakeup.

    $ dmesg | grep nr_sleep
    [ 52.917115] ---> nr_sleep=1046, nr_wakeup=873, delta=173
    $ vmstat 1
    1 173 0 712640 24292 96172 0 0 0 0 419 757 0 0 0 100 0

    To quote Tejun:

    Ah, okay, freed_request() wakes up single waiter with the assumption
    that after the wakeup there will at least be one successful allocation
    which in turn will continue the wakeup chain until the wait list is
    empty - ie. waiter wakeup is dependent on successful request
    allocation happening after each wakeup. With queue marked dead, any
    woken up waiter fails the allocation path, so the wakeup chaining is
    lost and we're left with hung waiters. What we need is wake_up_all()
    after drain completion.

    This patch fixes the missed wakeup by waking up all the theads which
    are sleeping on wait queue after queue drain.

    Changes in v2: Drop waitqueue_active() optimization

    Acked-by: Tejun Heo
    Signed-off-by: Asias He

    Fixed a bug by me, where stacked devices would oops on calling
    blk_drain_queue() since ->rq.wait[] do not get initialized unless
    it's a full queue setup.

    Signed-off-by: Jens Axboe

    Asias He
     

06 Jun, 2012

1 commit

  • blkg_destroy() caches @blkg->q in local variable @q. While there are
    two places which needs @blkg->q, only lockdep_assert_held() used the
    local variable leading to unused local variable warning if lockdep is
    configured out. Drop the local variable and just use @blkg->q
    directly.

    Signed-off-by: Tejun Heo
    Reported-by: Rakesh Iyer
    Signed-off-by: Jens Axboe

    Tejun Heo