23 Nov, 2012

1 commit

  • After we've done __elv_add_request() and __blk_run_queue() in
    blk_execute_rq_nowait(), the request might finish and be freed
    immediately. Therefore checking if the type is REQ_TYPE_PM_RESUME
    isn't safe afterwards, because if it isn't, rq might be gone.
    Instead, check beforehand and stash the result in a temporary.

    This fixes crashes in blk_execute_rq_nowait() I get occasionally when
    running with lots of memory debugging options enabled -- I think this
    race is usually harmless because the window for rq to be reallocated
    is so small.

    Signed-off-by: Roland Dreier
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Roland Dreier
     

26 Oct, 2012

1 commit

  • My workload is a raid5 which had 16 disks. And used our filesystem to
    write using direct-io mode.

    I used the blktrace to find those message:
    8,16 0 6647 2.453665504 2579 M W 7493152 + 8 [md0_raid5]
    8,16 0 6648 2.453672411 2579 Q W 7493160 + 8 [md0_raid5]
    8,16 0 6649 2.453672606 2579 M W 7493160 + 8 [md0_raid5]
    8,16 0 6650 2.453679255 2579 Q W 7493168 + 8 [md0_raid5]
    8,16 0 6651 2.453679441 2579 M W 7493168 + 8 [md0_raid5]
    8,16 0 6652 2.453685948 2579 Q W 7493176 + 8 [md0_raid5]
    8,16 0 6653 2.453686149 2579 M W 7493176 + 8 [md0_raid5]
    8,16 0 6654 2.453693074 2579 Q W 7493184 + 8 [md0_raid5]
    8,16 0 6655 2.453693254 2579 M W 7493184 + 8 [md0_raid5]
    8,16 0 6656 2.453704290 2579 Q W 7493192 + 8 [md0_raid5]
    8,16 0 6657 2.453704482 2579 M W 7493192 + 8 [md0_raid5]
    8,16 0 6658 2.453715016 2579 Q W 7493200 + 8 [md0_raid5]
    8,16 0 6659 2.453715247 2579 M W 7493200 + 8 [md0_raid5]
    8,16 0 6660 2.453721730 2579 Q W 7493208 + 8 [md0_raid5]
    8,16 0 6661 2.453721974 2579 M W 7493208 + 8 [md0_raid5]
    8,16 0 6662 2.453728202 2579 Q W 7493216 + 8 [md0_raid5]
    8,16 0 6663 2.453728436 2579 M W 7493216 + 8 [md0_raid5]
    8,16 0 6664 2.453734782 2579 Q W 7493224 + 8 [md0_raid5]
    8,16 0 6665 2.453735019 2579 M W 7493224 + 8 [md0_raid5]
    8,16 0 6666 2.453741401 2579 Q W 7493232 + 8 [md0_raid5]
    8,16 0 6667 2.453741632 2579 M W 7493232 + 8 [md0_raid5]
    8,16 0 6668 2.453748148 2579 Q W 7493240 + 8 [md0_raid5]
    8,16 0 6669 2.453748386 2579 M W 7493240 + 8 [md0_raid5]
    8,16 0 6670 2.453851843 2579 I W 7493144 + 104 [md0_raid5]
    8,16 0 0 2.453853661 0 m N cfq2579 insert_request
    8,16 0 6671 2.453854064 2579 I W 7493120 + 24 [md0_raid5]
    8,16 0 0 2.453854439 0 m N cfq2579 insert_request
    8,16 0 6672 2.453854793 2579 U N [md0_raid5] 2
    8,16 0 0 2.453855513 0 m N cfq2579 Not idling.st->count:1
    8,16 0 0 2.453855927 0 m N cfq2579 dispatch_insert
    8,16 0 0 2.453861771 0 m N cfq2579 dispatched a request
    8,16 0 0 2.453862248 0 m N cfq2579 activate rq,drv=1
    8,16 0 6673 2.453862332 2579 D W 7493120 + 24 [md0_raid5]
    8,16 0 0 2.453865957 0 m N cfq2579 Not idling.st->count:1
    8,16 0 0 2.453866269 0 m N cfq2579 dispatch_insert
    8,16 0 0 2.453866707 0 m N cfq2579 dispatched a request
    8,16 0 0 2.453867061 0 m N cfq2579 activate rq,drv=2
    8,16 0 6674 2.453867145 2579 D W 7493144 + 104 [md0_raid5]
    8,16 0 6675 2.454147608 0 C W 7493120 + 24 [0]
    8,16 0 0 2.454149357 0 m N cfq2579 complete rqnoidle 0
    8,16 0 6676 2.454791505 0 C W 7493144 + 104 [0]
    8,16 0 0 2.454794803 0 m N cfq2579 complete rqnoidle 0
    8,16 0 0 2.454795160 0 m N cfq schedule dispatch

    From above messages,we can find rq[W 7493144 + 104] and rq[W
    7493120 + 24] do not merge.
    Because the bio order is:
    8,16 0 6638 2.453619407 2579 Q W 7493144 + 8 [md0_raid5]
    8,16 0 6639 2.453620460 2579 G W 7493144 + 8 [md0_raid5]
    8,16 0 6640 2.453639311 2579 Q W 7493120 + 8 [md0_raid5]
    8,16 0 6641 2.453639842 2579 G W 7493120 + 8 [md0_raid5]
    The bio(7493144) first and bio(7493120) later.So the subsequent
    bios will be divided into two parts.
    When flushing plug-list,because elv_attempt_insert_merge only support
    backmerge,not supporting frontmerge.
    So rq[7493120 + 24] can't merge with rq[7493144 + 104].

    From my test,i found those situation can count 25% in our system.
    Using this patch, there is no this situation.

    Signed-off-by: Jianpeng Ma
    CC:Shaohua Li
    Signed-off-by: Jens Axboe

    Jianpeng Ma
     

24 Oct, 2012

1 commit

  • This config item has not carried much meaning for a while now and is
    almost always enabled by default. As agreed during the Linux kernel
    summit, remove it.

    CC: Jens Axboe
    Signed-off-by: Kees Cook
    Signed-off-by: Jens Axboe

    Kees Cook
     

23 Oct, 2012

2 commits

  • __blk_queue_next_rl() finds next request list based on blkg_list
    while skipping root_blkg in the list.
    OTOH, root_rl is special as it may exist even without root_blkg.

    Though the later part of the function handles such a case correctly,
    exiting early is good for readability of the code.

    Signed-off-by: Jun'ichi Nomura
    Cc: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Jun'ichi Nomura
     
  • blk_put_rl() does not call blkg_put() for q->root_rl because we
    don't take request list reference on q->root_blkg.
    However, if root_blkg is once attached then detached (freed),
    blk_put_rl() is confused by the bogus pointer in q->root_blkg.

    For example, with !CONFIG_BLK_DEV_THROTTLING &&
    CONFIG_CFQ_GROUP_IOSCHED,
    switching IO scheduler from cfq to deadline will cause system stall
    after the following warning with 3.6:

    > WARNING: at /work/build/linux/block/blk-cgroup.h:250
    > blk_put_rl+0x4d/0x95()
    > Modules linked in: bridge stp llc sunrpc acpi_cpufreq freq_table mperf
    > ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4
    > Pid: 0, comm: swapper/0 Not tainted 3.6.0 #1
    > Call Trace:
    > [] warn_slowpath_common+0x85/0x9d
    > [] warn_slowpath_null+0x1a/0x1c
    > [] blk_put_rl+0x4d/0x95
    > [] __blk_put_request+0xc3/0xcb
    > [] blk_finish_request+0x232/0x23f
    > [] ? blk_end_bidi_request+0x34/0x5d
    > [] blk_end_bidi_request+0x42/0x5d
    > [] blk_end_request+0x10/0x12
    > [] scsi_io_completion+0x207/0x4d5
    > [] scsi_finish_command+0xfa/0x103
    > [] scsi_softirq_done+0xff/0x108
    > [] blk_done_softirq+0x8d/0xa1
    > [] ?
    > generic_smp_call_function_single_interrupt+0x9f/0xd7
    > [] __do_softirq+0x102/0x213
    > [] ? lock_release_holdtime+0xb6/0xbb
    > [] ? raise_softirq_irqoff+0x9/0x3d
    > [] call_softirq+0x1c/0x30
    > [] do_softirq+0x4b/0xa3
    > [] irq_exit+0x53/0xd5
    > [] smp_call_function_single_interrupt+0x34/0x36
    > [] call_function_single_interrupt+0x6f/0x80
    > [] ? mwait_idle+0x94/0xcd
    > [] ? mwait_idle+0x8b/0xcd
    > [] cpu_idle+0xbb/0x114
    > [] rest_init+0xc1/0xc8
    > [] ? csum_partial_copy_generic+0x16c/0x16c
    > [] start_kernel+0x3d4/0x3e1
    > [] ? kernel_init+0x1f7/0x1f7
    > [] x86_64_start_reservations+0xb8/0xbd
    > [] x86_64_start_kernel+0x101/0x110

    This patch clears q->root_blkg and q->root_rl.blkg when root blkg
    is destroyed.

    Signed-off-by: Jun'ichi Nomura
    Acked-by: Vivek Goyal
    Acked-by: Tejun Heo
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jun'ichi Nomura
     

11 Oct, 2012

1 commit

  • Pull block IO update from Jens Axboe:
    "Core block IO bits for 3.7. Not a huge round this time, it contains:

    - First series from Kent cleaning up and generalizing bio allocation
    and freeing.

    - WRITE_SAME support from Martin.

    - Mikulas patches to prevent O_DIRECT crashes when someone changes
    the block size of a device.

    - Make bio_split() work on data-less bio's (like trim/discards).

    - A few other minor fixups."

    Fixed up silent semantic mis-merge as per Mikulas Patocka and Andrew
    Morton. It is due to the VM no longer using a prio-tree (see commit
    6b2dbba8b6ac: "mm: replace vma prio_tree with an interval tree").

    So make set_blocksize() use mapping_mapped() instead of open-coding the
    internal VM knowledge that has changed.

    * 'for-3.7/core' of git://git.kernel.dk/linux-block: (26 commits)
    block: makes bio_split support bio without data
    scatterlist: refactor the sg_nents
    scatterlist: add sg_nents
    fs: fix include/percpu-rwsem.h export error
    percpu-rw-semaphore: fix documentation typos
    fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared
    blockdev: turn a rw semaphore into a percpu rw semaphore
    Fix a crash when block device is read and block size is changed at the same time
    block: fix request_queue->flags initialization
    block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()
    block: ioctl to zero block ranges
    block: Make blkdev_issue_zeroout use WRITE SAME
    block: Implement support for WRITE SAME
    block: Consolidate command flag and queue limit checks for merges
    block: Clean up special command handling logic
    block/blk-tag.c: Remove useless kfree
    block: remove the duplicated setting for congestion_threshold
    block: reject invalid queue attribute values
    block: Add bio_clone_bioset(), bio_clone_kmalloc()
    block: Consolidate bio_alloc_bioset(), bio_kmalloc()
    ...

    Linus Torvalds
     

03 Oct, 2012

2 commits

  • Pull cgroup hierarchy update from Tejun Heo:
    "Currently, different cgroup subsystems handle nested cgroups
    completely differently. There's no consistency among subsystems and
    the behaviors often are outright broken.

    People at least seem to agree that the broken hierarhcy behaviors need
    to be weeded out if any progress is gonna be made on this front and
    that the fallouts from deprecating the broken behaviors should be
    acceptable especially given that the current behaviors don't make much
    sense when nested.

    This patch makes cgroup emit warning messages if cgroups for
    subsystems with broken hierarchy behavior are nested to prepare for
    fixing them in the future. This was put in a separate branch because
    more related changes were expected (didn't make it this round) and the
    memory cgroup wanted to pull in this and make changes on top."

    * 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them

    Linus Torvalds
     
  • Pull workqueue changes from Tejun Heo:
    "This is workqueue updates for v3.7-rc1. A lot of activities this
    round including considerable API and behavior cleanups.

    * delayed_work combines a timer and a work item. The handling of the
    timer part has always been a bit clunky leading to confusing
    cancelation API with weird corner-case behaviors. delayed_work is
    updated to use new IRQ safe timer and cancelation now works as
    expected.

    * Another deficiency of delayed_work was lack of the counterpart of
    mod_timer() which led to cancel+queue combinations or open-coded
    timer+work usages. mod_delayed_work[_on]() are added.

    These two delayed_work changes make delayed_work provide interface
    and behave like timer which is executed with process context.

    * A work item could be executed concurrently on multiple CPUs, which
    is rather unintuitive and made flush_work() behavior confusing and
    half-broken under certain circumstances. This problem doesn't
    exist for non-reentrant workqueues. While non-reentrancy check
    isn't free, the overhead is incurred only when a work item bounces
    across different CPUs and even in simulated pathological scenario
    the overhead isn't too high.

    All workqueues are made non-reentrant. This removes the
    distinction between flush_[delayed_]work() and
    flush_[delayed_]_work_sync(). The former is now as strong as the
    latter and the specified work item is guaranteed to have finished
    execution of any previous queueing on return.

    * In addition to the various bug fixes, Lai redid and simplified CPU
    hotplug handling significantly.

    * Joonsoo introduced system_highpri_wq and used it during CPU
    hotplug.

    There are two merge commits - one to pull in IRQ safe timer from
    tip/timers/core and the other to pull in CPU hotplug fixes from
    wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

    Fixed a number of trivial conflicts, but the more interesting conflicts
    were silent ones where the deprecated interfaces had been used by new
    code in the merge window, and thus didn't cause any real data conflicts.

    Tejun pointed out a few of them, I fixed a couple more.

    * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
    workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
    workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
    workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
    workqueue: remove @delayed from cwq_dec_nr_in_flight()
    workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
    workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
    workqueue: use __cpuinit instead of __devinit for cpu callbacks
    workqueue: rename manager_mutex to assoc_mutex
    workqueue: WORKER_REBIND is no longer necessary for idle rebinding
    workqueue: WORKER_REBIND is no longer necessary for busy rebinding
    workqueue: reimplement idle worker rebinding
    workqueue: deprecate __cancel_delayed_work()
    workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
    workqueue: use mod_delayed_work() instead of __cancel + queue
    workqueue: use irqsafe timer for delayed_work
    workqueue: clean up delayed_work initializers and add missing one
    workqueue: make deferrable delayed_work initializer names consistent
    workqueue: cosmetic whitespace updates for macro definitions
    workqueue: deprecate system_nrt[_freezable]_wq
    workqueue: deprecate flush[_delayed]_work_sync()
    ...

    Linus Torvalds
     

26 Sep, 2012

1 commit

  • In some usage scenarios it is desireable to work with disk images or
    virtualized DASD devices. One problem that prevents such applications
    is the partition detection in ibm.c. Currently it works only for
    devices that support the BIODASDINFO2 ioctl, in other words, it only
    works for devices that belong to the DASD device driver.

    The information gained from the BIODASDINFO2 ioctl is only for a small
    set of legacy cases abolutely necessary. All current VOL1, LNX1 and
    CMS1 type of disk labels can be interpreted correctly without this
    information, as long as the generic HDIO_GETGEO ioctl works and
    provides a correct disk geometry.

    This patch makes the ibm.c partition detection as independent as
    possible from the BIODASDINFO2 ioctl. Only the following two cases are
    still restricted to real DASDs:
    - An FBA DASD, or LDL formatted ECKD DASD without any disk label.
    - An old style LNX1 label (without large volume support) on a disk
    with inconsistent device geometry.

    Signed-off-by: Stefan Weinhuber
    Signed-off-by: Martin Schwidefsky

    Stefan Weinhuber
     

21 Sep, 2012

2 commits

  • A queue newly allocated with blk_alloc_queue_node() has only
    QUEUE_FLAG_BYPASS set. For request-based drivers,
    blk_init_allocated_queue() is called and q->queue_flags is overwritten
    with QUEUE_FLAG_DEFAULT which doesn't include BYPASS even though the
    initial bypass is still in effect.

    In blk_init_allocated_queue(), or QUEUE_FLAG_DEFAULT to q->queue_flags
    instead of overwriting.

    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • …_init_allocated_queue()

    b82d4b197c ("blkcg: make request_queue bypassing on allocation") made
    request_queues bypassed on allocation to avoid switching on and off
    bypass mode on a queue being initialized. Some drivers allocate and
    then destroy a lot of queues without fully initializing them and
    incurring bypass latency overhead on each of them could add upto
    significant overhead.

    Unfortunately, blk_init_allocated_queue() is never used by queues of
    bio-based drivers, which means that all bio-based driver queues are in
    bypass mode even after initialization and registration complete
    successfully.

    Due to the limited way request_queues are used by bio drivers, this
    problem is hidden pretty well but it shows up when blk-throttle is
    used in combination with a bio-based driver. Trying to configure
    (echoing to cgroupfs file) blk-throttle for a bio-based driver hangs
    indefinitely in blkg_conf_prep() waiting for bypass mode to end.

    This patch moves the initial blk_queue_bypass_end() call from
    blk_init_allocated_queue() to blk_register_queue() which is called for
    any userland-visible queues regardless of its type.

    I believe this is correct because I don't think there is any block
    driver which needs or wants working elevator and blk-cgroup on a queue
    which isn't visible to userland. If there are such users, we need a
    different solution.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reported-by: Joseph Glanville <joseph.glanville@orionvm.com.au>
    Cc: stable@vger.kernel.org
    Acked-by: Vivek Goyal <vgoyal@redhat.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

    Tejun Heo
     

20 Sep, 2012

5 commits

  • Introduce a BLKZEROOUT ioctl which can be used to clear block ranges by
    way of blkdev_issue_zeroout().

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • If the device supports WRITE SAME, use that to optimize zeroing of
    blocks. If the device does not support WRITE SAME or if the operation
    fails, fall back to writing zeroes the old-fashioned way.

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • The WRITE SAME command supported on some SCSI devices allows the same
    block to be efficiently replicated throughout a block range. Only a
    single logical block is transferred from the host and the storage device
    writes the same data to all blocks described by the I/O.

    This patch implements support for WRITE SAME in the block layer. The
    blkdev_issue_write_same() function can be used by filesystems and block
    drivers to replicate a buffer across a block range. This can be used to
    efficiently initialize software RAID devices, etc.

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • - blk_check_merge_flags() verifies that cmd_flags / bi_rw are
    compatible. This function is called for both req-req and req-bio
    merging.

    - blk_rq_get_max_sectors() and blk_queue_get_max_sectors() can be used
    to query the maximum sector count for a given request or queue. The
    calls will return the right value from the queue limits given the
    type of command (RW, discard, write same, etc.)

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • Remove special-casing of non-rw fs style requests (discard). The nomerge
    flags are consolidated in blk_types.h, and rq_mergeable() and
    bio_mergeable() have been modified to use them.

    bio_is_rw() is used in place of bio_has_data() a few places. This is
    done to to distinguish true reads and writes from other fs type requests
    that carry a payload (e.g. write same).

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

18 Sep, 2012

1 commit


15 Sep, 2012

1 commit

  • Currently, cgroup hierarchy support is a mess. cpu related subsystems
    behave correctly - configuration, accounting and control on a parent
    properly cover its children. blkio and freezer completely ignore
    hierarchy and treat all cgroups as if they're directly under the root
    cgroup. Others show yet different behaviors.

    These differing interpretations of cgroup hierarchy make using cgroup
    confusing and it impossible to co-mount controllers into the same
    hierarchy and obtain sane behavior.

    Eventually, we want full hierarchy support from all subsystems and
    probably a unified hierarchy. Users using separate hierarchies
    expecting completely different behaviors depending on the mounted
    subsystem is deterimental to making any progress on this front.

    This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
    for controllers which are lacking in hierarchy support. The goal of
    this patch is two-fold.

    * Move users away from using hierarchy on currently non-hierarchical
    subsystems, so that implementing proper hierarchy support on those
    doesn't surprise them.

    * Keep track of which controllers are broken how and nudge the
    subsystems to implement proper hierarchy support.

    For now, start with a single warning message. We can whine louder
    later on.

    v2: Fixed a typo spotted by Michal. Warning message updated.

    v3: Updated memcg part so that it doesn't generate warning in the
    cases where .use_hierarchy=false doesn't make the behavior
    different from root.use_hierarchy=true. Fixed a typo spotted by
    Glauber.

    v4: Check ->broken_hierarchy after cgroup creation is complete so that
    ->create() can affect the result per Michal. Dropped unnecessary
    memcg root handling per Michal.

    Signed-off-by: Tejun Heo
    Acked-by: Michal Hocko
    Acked-by: Li Zefan
    Acked-by: Serge E. Hallyn
    Cc: Glauber Costa
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Johannes Weiner
    Cc: Thomas Graf
    Cc: Vivek Goyal
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Neil Horman
    Cc: Aneesh Kumar K.V

    Tejun Heo
     

13 Sep, 2012

1 commit

  • Remove useless kfree() and clean up code related to the removal.

    The semantic patch that finds this problem is as follows:
    (http://coccinelle.lip6.fr/)

    //
    @r exists@
    position p1,p2;
    expression x;
    @@

    if (x@p1 == NULL) { ... kfree@p2(x); ... return ...; }

    @unchanged exists@
    position r.p1,r.p2;
    expression e

    Signed-off-by: Peter Senna Tschudin
    Signed-off-by: Jens Axboe

    Peter Senna Tschudin
     

09 Sep, 2012

5 commits

  • Before call the blk_queue_congestion_threshold(),
    the blk_queue_congestion_threshold() is already called at blk_queue_make_rquest().
    Because this code is the duplicated, it has removed.

    Signed-off-by: Jaehoon Chung
    Signed-off-by: Kyungmin Park
    Signed-off-by: Jens Axboe

    Jaehoon Chung
     
  • Instead of using simple_strtoul which "converts" invalid numbers to 0,
    use strict_strtoul and perform error checking to ensure that userspace
    passes us a valid unsigned long. This addresses problems with functions
    such as writev, which might want to write a trailing newline -- the
    newline should rightfully be rejected, but the value preceeding it
    should be preserved.

    Fixes BZ#46981.

    Signed-off-by: Dave Reisner
    Signed-off-by: Jens Axboe

    Dave Reisner
     
  • Previously, there was bio_clone() but it only allocated from the fs bio
    set; as a result various users were open coding it and using
    __bio_clone().

    This changes bio_clone() to become bio_clone_bioset(), and then we add
    bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of
    the functionality the last patch adedd.

    This will also help in a later patch changing how bio cloning works.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    CC: NeilBrown
    CC: Alasdair Kergon
    CC: Boaz Harrosh
    CC: Jeff Garzik
    Acked-by: Jeff Garzik
    Signed-off-by: Jens Axboe

    Kent Overstreet
     
  • Now that we've got generic code for freeing bios allocated from bio
    pools, this isn't needed anymore.

    This patch also makes bio_free() static, since without bi_destructor
    there should be no need for it to be called anywhere else.

    bio_free() is now only called from bio_put, so we can refactor those a
    bit - move some code from bio_put() to bio_free() and kill the redundant
    bio->bi_next = NULL.

    v5: Switch to BIO_KMALLOC_POOL ((void *)~0), per Boaz
    v6: BIO_KMALLOC_POOL now NULL, drop bio_free's EXPORT_SYMBOL
    v7: No #define BIO_KMALLOC_POOL anymore

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    Signed-off-by: Jens Axboe

    Kent Overstreet
     
  • Now that bios keep track of where they were allocated from,
    bio_integrity_alloc_bioset() becomes redundant.

    Remove bio_integrity_alloc_bioset() and drop bio_set argument from the
    related functions and make them use bio->bi_pool.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    CC: Martin K. Petersen
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Kent Overstreet
     

31 Aug, 2012

1 commit

  • When performing a cable pull test w/ active stress I/O using fio over
    a dual port Intel 82599 FCoE CNA, w/ 256LUNs on one port and about 32LUNs
    on the other, it is observed that the system becomes not usable due to
    scsi-ml being busy printing the error messages for all the failing commands.
    I don't believe this problem is specific to FCoE and these commands are
    anyway failing due to link being down (DID_NO_CONNECT), just rate-limit
    the messages here to solve this issue.

    v2->v1: use __ratelimit() as Tomas Henzl mentioned as the proper way for
    rate-limit per function. However, in this case, the failed i/o gets to
    blk_end_request_err() and then blk_update_request(), which also has to
    be rate-limited, as added in the v2 of this patch.

    v3-v2: resolved conflict to apply on current 3.6-rc3 upstream tip.

    Signed-off-by: Yi Zou
    Cc: www.Open-FCoE.org
    Cc: Tomas Henzl
    Cc:
    Signed-off-by: Jens Axboe

    Yi Zou
     

22 Aug, 2012

2 commits

  • Now that cancel_delayed_work() can be safely called from IRQ handlers,
    there's no reason to use __cancel_delayed_work(). Use
    cancel_delayed_work() instead of __cancel_delayed_work() and mark the
    latter deprecated.

    Signed-off-by: Tejun Heo
    Acked-by: Jens Axboe
    Cc: Jiri Kosina
    Cc: Roland Dreier
    Cc: Tomi Valkeinen

    Tejun Heo
     
  • Now that mod_delayed_work() is safe to call from IRQ handlers,
    __cancel_delayed_work() followed by queue_delayed_work() can be
    replaced with mod_delayed_work().

    Most conversions are straight-forward except for the following.

    * net/core/link_watch.c: linkwatch_schedule_work() was doing a quite
    elaborate dancing around its delayed_work. Collapse it such that
    linkwatch_work is queued for immediate execution if LW_URGENT and
    existing timer is kept otherwise.

    Signed-off-by: Tejun Heo
    Cc: "David S. Miller"
    Cc: Tomi Valkeinen

    Tejun Heo
     

21 Aug, 2012

1 commit

  • system_nrt[_freezable]_wq are now spurious. Mark them deprecated and
    convert all users to system[_freezable]_wq.

    If you're cc'd and wondering what's going on: Now all workqueues are
    non-reentrant, so there's no reason to use system_nrt[_freezable]_wq.
    Please use system[_freezable]_wq instead.

    This patch doesn't make any functional difference.

    Signed-off-by: Tejun Heo
    Acked-By: Lai Jiangshan

    Cc: Jens Axboe
    Cc: David Airlie
    Cc: Jiri Kosina
    Cc: "David S. Miller"
    Cc: Rusty Russell
    Cc: "Paul E. McKenney"
    Cc: David Howells

    Tejun Heo
     

14 Aug, 2012

1 commit

  • Convert delayed_work users doing cancel_delayed_work() followed by
    queue_delayed_work() to mod_delayed_work().

    Most conversions are straight-forward. Ones worth mentioning are,

    * drivers/edac/edac_mc.c: edac_mc_workq_setup() converted to always
    use mod_delayed_work() and cancel loop in
    edac_mc_reset_delay_period() is dropped.

    * drivers/platform/x86/thinkpad_acpi.c: No need to remember whether
    watchdog is active or not. @fan_watchdog_active and related code
    dropped.

    * drivers/power/charger-manager.c: Seemingly a lot of
    delayed_work_pending() abuse going on here.
    [delayed_]work_pending() are unsynchronized and racy when used like
    this. I converted one instance in fullbatt_handler(). Please
    conver the rest so that it invokes workqueue APIs for the intended
    target state rather than trying to game work item pending state
    transitions. e.g. if timer should be modified - call
    mod_delayed_work(), canceled - call cancel_delayed_work[_sync]().

    * drivers/thermal/thermal_sys.c: thermal_zone_device_set_polling()
    simplified. Note that round_jiffies() calls in this function are
    meaningless. round_jiffies() work on absolute jiffies not delta
    delay used by delayed_work.

    v2: Tomi pointed out that __cancel_delayed_work() users can't be
    safely converted to mod_delayed_work(). They could be calling it
    from irq context and if that happens while delayed_work_timer_fn()
    is running, it could deadlock. __cancel_delayed_work() users are
    dropped.

    Signed-off-by: Tejun Heo
    Acked-by: Henrique de Moraes Holschuh
    Acked-by: Dmitry Torokhov
    Acked-by: Anton Vorontsov
    Acked-by: David Howells
    Cc: Tomi Valkeinen
    Cc: Jens Axboe
    Cc: Jiri Kosina
    Cc: Doug Thompson
    Cc: David Airlie
    Cc: Roland Dreier
    Cc: "John W. Linville"
    Cc: Zhang Rui
    Cc: Len Brown
    Cc: "J. Bruce Fields"
    Cc: Johannes Berg

    Tejun Heo
     

03 Aug, 2012

3 commits

  • I met a odd prblem:read /proc/partitions may return zero.

    I wrote a file test.c:
    int main()
    {
    char buff[4096];
    int ret;
    int fd;
    printf("pid=%d\n",getpid());
    while (1) {
    fd = open("/proc/partitions", O_RDONLY);
    if (fd < 0) {
    printf("open error %s\n", strerror(errno));
    return 0;
    }
    ret = read(fd, buff, 4096);
    if (ret /dev/null ;done
    2:./test

    I reviewed the code and found:

    >> static void *show_partition_start(struct seq_file *seqf, loff_t *pos)
    >> {
    >> static void *p;
    >>
    >> p = disk_seqf_start(seqf, pos);
    >> if (!IS_ERR_OR_NULL(p) && !*pos)
    >> seq_puts(seqf, "major minor #blocks name\n\n");
    >> return p;
    >> }
    test cat /proc/partitions
    p = disk_seqf_start()(Not NULL)
    p = disk_seqf_start()(NULL because pos)
    if (!IS_ERR_OR_NULL(p) && !*pos)

    Signed-off-by: Jianpeng Ma
    Signed-off-by: Jens Axboe

    Jianpeng Ma
     
  • Add a helper to map a bio to a scatterlist, modelled after
    blk_rq_map_sg.

    This helper is useful for any driver that wants to create
    a scatterlist from its ->make_request_fn method.

    Changes in v2:
    - Use __blk_segment_map_sg to avoid duplicated code
    - Add cocbook style function comment

    Cc: Rusty Russell
    Cc: Christoph Hellwig
    Cc: Tejun Heo
    Cc: Shaohua Li
    Cc: "Michael S. Tsirkin"
    Cc: kvm@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: virtualization@lists.linux-foundation.org
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Minchan Kim
    Signed-off-by: Asias He
    Signed-off-by: Jens Axboe

    Asias He
     
  • Split the mapping code in blk_rq_map_sg() to a helper
    __blk_segment_map_sg(), so that other mapping function, e.g.
    blk_bio_map_sg(), can share the code.

    Cc: Rusty Russell
    Cc: Christoph Hellwig
    Cc: Tejun Heo
    Cc: Shaohua Li
    Cc: "Michael S. Tsirkin"
    Cc: kvm@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: virtualization@lists.linux-foundation.org
    Suggested-by: Jens Axboe
    Suggested-by: Tejun Heo
    Signed-off-by: Asias He
    Signed-off-by: Jens Axboe

    Asias He
     

02 Aug, 2012

4 commits

  • When a disk has large discard_granularity and small max_discard_sectors,
    discards are not split with optimal alignment. In the limit case of
    discard_granularity == max_discard_sectors, no request could be aligned
    correctly, so in fact you might end up with no discarded logical blocks
    at all.

    Another example that helps showing the condition in the patch is with
    discard_granularity == 64, max_discard_sectors == 128. A request that is
    submitted for 256 sectors 2..257 will be split in two: 2..129, 130..257.
    However, only 2 aligned blocks out of 3 are included in the request;
    128..191 may be left intact and not discarded. With this patch, the
    first request will be truncated to ensure good alignment of what's left,
    and the split will be 2..127, 128..255, 256..257. The patch will also
    take into account the discard_alignment.

    At most one extra request will be introduced, because the first request
    will be reduced by at most granularity-1 sectors, and granularity
    must be less than max_discard_sectors. Subsequent requests will run
    on round_down(max_discard_sectors, granularity) sectors, as in the
    current code.

    Signed-off-by: Paolo Bonzini
    Acked-by: Vivek Goyal
    Tested-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Paolo Bonzini
     
  • Mostly a preparation for the next patch.

    In principle this fixes an infinite loop if max_discard_sectors < granularity,
    but that really shouldn't happen.

    Signed-off-by: Paolo Bonzini
    Acked-by: Vivek Goyal
    Tested-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Paolo Bonzini
     
  • Pull block driver changes from Jens Axboe:

    - Making the plugging support for drivers a bit more sane from Neil.
    This supersedes the plugging change from Shaohua as well.

    - The usual round of drbd updates.

    - Using a tail add instead of a head add in the request completion for
    ndb, making us find the most completed request more quickly.

    - A few floppy changes, getting rid of a duplicated flag and also
    running the floppy init async (since it takes forever in boot terms)
    from Andi.

    * 'for-3.6/drivers' of git://git.kernel.dk/linux-block:
    floppy: remove duplicated flag FD_RAW_NEED_DISK
    blk: pass from_schedule to non-request unplug functions.
    block: stack unplug
    blk: centralize non-request unplug handling.
    md: remove plug_cnt feature of plugging.
    block/nbd: micro-optimization in nbd request completion
    drbd: announce FLUSH/FUA capability to upper layers
    drbd: fix max_bio_size to be unsigned
    drbd: flush drbd work queue before invalidate/invalidate remote
    drbd: fix potential access after free
    drbd: call local-io-error handler early
    drbd: do not reset rs_pending_cnt too early
    drbd: reset congestion information before reporting it in /proc/drbd
    drbd: report congestion if we are waiting for some userland callback
    drbd: differentiate between normal and forced detach
    drbd: cleanup, remove two unused global flags
    floppy: Run floppy initialization asynchronous

    Linus Torvalds
     
  • Pull core block IO bits from Jens Axboe:
    "The most complicated part if this is the request allocation rework by
    Tejun, which has been queued up for a long time and has been in
    for-next ditto as well.

    There are a few commits from yesterday and today, mostly trivial and
    obvious fixes. So I'm pretty confident that it is sound. It's also
    smaller than usual."

    * 'for-3.6/core' of git://git.kernel.dk/linux-block:
    block: remove dead func declaration
    block: add partition resize function to blkpg ioctl
    block: uninitialized ioc->nr_tasks triggers WARN_ON
    block: do not artificially constrain max_sectors for stacking drivers
    blkcg: implement per-blkg request allocation
    block: prepare for multiple request_lists
    block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv
    blkcg: inline bio_blkcg() and friends
    block: allocate io_context upfront
    block: refactor get_request[_wait]()
    block: drop custom queue draining used by scsi_transport_{iscsi|fc}
    mempool: add @gfp_mask to mempool_create_node()
    blkcg: make root blkcg allocation use %GFP_KERNEL
    blkcg: __blkg_lookup_create() doesn't need radix preload

    Linus Torvalds
     

01 Aug, 2012

4 commits

  • __generic_unplug_device() function is removed with commit
    7eaceaccab5f40bbfda044629a6298616aeaed50, which forgot to
    remove the declaration at meantime. Here remove it.

    Signed-off-by: Yuanhan Liu
    Signed-off-by: Jens Axboe

    Yuanhan Liu
     
  • Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
    allows altering the size of an existing partition, even if it is currently
    in use.

    This patch converts hd_struct->nr_sects into sequence counter because
    One might extend a partition while IO is happening to it and update of
    nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
    can lead to issues like reading inconsistent size of a partition. Sequence
    counter have been used so that readers don't have to take bdev mutex lock
    as we call sector_in_part() very frequently.

    Now all the access to hd_struct->nr_sects should happen using sequence
    counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
    There is one exception though, set_capacity()/get_capacity(). I think
    theoritically race should exist there too but this patch does not
    modify set_capacity()/get_capacity() due to sheer number of call sites
    and I am afraid that change might break something. I have left that as a
    TODO item. We can handle it later if need be. This patch does not introduce
    any new races as such w.r.t set_capacity()/get_capacity().

    v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Phillip Susi
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Hi,

    I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the
    below warning as of 3.5-rc1 and later (3.4 is fine):

    [ 10.886893] ------------[ cut here ]------------
    [ 10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560()
    [ 10.886905] Hardware name: Bochs
    [ 10.886906] Modules linked in:
    [ 10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27
    [ 10.886908] Call Trace:
    [ 10.886911] [] warn_slowpath_common+0x7a/0xb0
    [ 10.886912] [] warn_slowpath_null+0x15/0x20
    [ 10.886913] [] copy_process+0x1488/0x1560
    [ 10.886914] [] do_fork+0xb4/0x340
    [ 10.886918] [] ? recalc_sigpending+0x1a/0x50
    [ 10.886919] [] ? __set_task_blocked+0x32/0x80
    [ 10.886920] [] ? __set_current_blocked+0x3a/0x60
    [ 10.886923] [] sys_clone+0x23/0x30
    [ 10.886925] [] stub_clone+0x13/0x20
    [ 10.886927] [] ? system_call_fastpath+0x16/0x1b
    [ 10.886928] ---[ end trace 32a14af7ee6a590b ]---

    Reproducing is easy, I can hit it on a KVM system with a very basic
    config (x86_64 make defconfig + enable the drivers needed). To hit it,
    just install dump (on debian/ubuntu, not sure what the package might be
    called on Fedora), and:

    dump -o -f /tmp/foo /

    You'll see the warning in dmesg once it forks off the I/O process and
    starts dumping filesystem contents.

    I bisected it down to the following commit:

    commit f6e8d01bee036460e03bd4f6a79d014f98ba712e
    Author: Tejun Heo
    Date: Mon Mar 5 13:15:26 2012 -0800

    block: add io_context->active_ref

    Currently ioc->nr_tasks is used to decide two things - whether an ioc
    is done issuing IOs and whether it's shared by multiple tasks. This
    patch separate out the first into ioc->active_ref, which is acquired
    and released using {get|put}_io_context_active() respectively.

    This will be used to associate bio's with a given task. This patch
    doesn't introduce any visible behavior change.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    It seems like the init of ioc->nr_tasks was removed in that patch,
    so it starts out at 0 instead of 1.

    Tejun, is the right thing here to add back the init, or should something else
    be done?

    The below patch removes the warning, but I haven't done any more extensive
    testing on it.

    Signed-off-by: Olof Johansson
    Acked-by: Tejun Heo
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Olof Johansson
     
  • blk_set_stacking_limits is intended to allow stacking drivers to build
    up the limits of the stacked device based on the underlying devices'
    limits. But defaulting 'max_sectors' to BLK_DEF_MAX_SECTORS (1024)
    doesn't allow the stacking driver to inherit a max_sectors larger than
    1024 -- due to blk_stack_limits' use of min_not_zero.

    It is now clear that this artificial limit is getting in the way so
    change blk_set_stacking_limits's max_sectors to UINT_MAX (which allows
    stacking drivers like dm-multipath to inherit 'max_sectors' from the
    underlying paths).

    Reported-by: Vijay Chauhan
    Tested-by: Vijay Chauhan
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer