05 Oct, 2009

2 commits

  • This reverts commit a9327cac440be4d8333bba975cbbf76045096275.

    Corrado Zoccolo reports:

    "with 2.6.32-rc1 I started getting the following strange output from
    "iostat -kx 2":
    Linux 2.6.31bisect (et2) 04/10/2009 _i686_ (2 CPU)

    avg-cpu: %user %nice %system %iowait %steal %idle
    10,70 0,00 3,16 15,75 0,00 70,38

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
    avgrq-sz avgqu-sz await svctm %util
    sda 18,22 0,00 0,67 0,01 14,77 0,02
    43,94 0,01 10,53 39043915,03 2629219,87
    sdb 60,89 9,68 50,79 3,04 1724,43 50,52
    65,95 0,70 13,06 488437,47 2629219,87

    avg-cpu: %user %nice %system %iowait %steal %idle
    2,72 0,00 0,74 0,00 0,00 96,53

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
    avgrq-sz avgqu-sz await svctm %util
    sda 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00
    sdb 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00

    avg-cpu: %user %nice %system %iowait %steal %idle
    6,68 0,00 0,99 0,00 0,00 92,33

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
    avgrq-sz avgqu-sz await svctm %util
    sda 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00
    sdb 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00

    avg-cpu: %user %nice %system %iowait %steal %idle
    4,40 0,00 0,73 1,47 0,00 93,40

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
    avgrq-sz avgqu-sz await svctm %util
    sda 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00
    sdb 0,00 4,00 0,00 3,00 0,00 28,00
    18,67 0,06 19,50 333,33 100,00

    Global values for service time and utilization are garbage. For
    interval values, utilization is always 100%, and service time is
    higher than normal.

    I bisected it down to:
    [a9327cac440be4d8333bba975cbbf76045096275] Seperate read and write
    statistics of in_flight requests
    and verified that reverting just that commit indeed solves the issue
    on 2.6.32-rc1."

    So until this is debugged, revert the bad commit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We cannot delay for the first dispatch of the async queue if it
    hasn't dispatched at all, since that could present a local user
    DoS attack vector using an app that just did slow timed sync reads
    while filling memory.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

04 Oct, 2009

3 commits


03 Oct, 2009

3 commits

  • This slowly ramps up the async queue depth based on the time
    passed since the sync IO, and doesn't allow async at all until
    a sync slice period has passed.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • o Do not allow more than max_dispatch requests from an async queue, if some
    sync request has finished recently. This is in the hope that sync activity
    is still going on in the system and we might receive a sync request soon.
    Most likely from a sync queue which finished a request and we did not enable
    idling on it.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • This is basically identical to what Vivek Goyal posted, but combined
    into one and labelled 'desktop' instead of 'fairness'. The goal
    is to continue to improve on the latency side of things as it relates
    to interactiveness, keeping the questionable bits under this sysfs
    tunable so it would be easy for throughput-only people to turn off.

    Apart from adding the interactive sysfs knob, it also adds the
    behavioural change of allowing slice idling even if the hardware
    does tagged command queuing.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

02 Oct, 2009

6 commits

  • Since 2.6.31 now has request-based device-mapper, it's useful to have
    a tracepoint for request-remapping as well as bio-remapping.
    This patch adds a tracepoint for request-remapping, trace_block_rq_remap().

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Cc: Alasdair G Kergon
    Cc: Li Zefan
    Signed-off-by: Jens Axboe

    Jun'ichi Nomura
     
  • Currently we set the bio size to the byte equivalent of the blocks to
    be trimmed when submitting the initial DISCARD ioctl. That means it
    is subject to the max_hw_sectors limitation of the HBA which is
    much lower than the size of a DISCARD request we can support.
    Add a separate max_discard_sectors tunable to limit the size for discard
    requests.

    We limit the max discard request size in bytes to 32bit as that is the
    limit for bio->bi_size. This could be much larger if we had a way to pass
    that information through the block layer.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • prepare_discard_fn() was being called in a place where memory allocation
    was effectively impossible. This makes it inappropriate for all but
    the most trivial translations of Linux's DISCARD operation to the block
    command set. Additionally adding a payload there makes the ownership
    of the bio backing unclear as it's now allocated by the device driver
    and not the submitter as usual.

    It is replaced with QUEUE_FLAG_DISCARD which is used to indicate whether
    the queue supports discard operations or not. blkdev_issue_discard now
    allocates a one-page, sector-length payload which is the right thing
    for the common ATA and SCSI implementations.

    The mtd implementation of prepare_discard_fn() is replaced with simply
    checking for the request being a discard.

    Largely based on a previous patch from Matthew Wilcox
    which did the prepare_discard_fn but not the different payload allocation
    yet.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Add missing blk_trace_remove_sysfs to be in pair with blk_trace_init_sysfs
    introduced in commit 1d54ad6da9192fed5dd3b60224d9f2dfea0dcd82.
    Release kobject also in case the request_fn is NULL.

    Problem was noticed via kmemleak backtrace when some sysfs entries were
    note properly destroyed during device removal:

    unreferenced object 0xffff88001aa76640 (size 80):
    comm "lvcreate", pid 2120, jiffies 4294885144
    hex dump (first 32 bytes):
    01 00 00 00 00 00 00 00 f0 65 a7 1a 00 88 ff ff .........e......
    90 66 a7 1a 00 88 ff ff 86 1d 53 81 ff ff ff ff .f........S.....
    backtrace:
    [] kmemleak_alloc+0x26/0x60
    [] kmem_cache_alloc+0x133/0x1c0
    [] sysfs_new_dirent+0x41/0x120
    [] sysfs_add_file_mode+0x3c/0xb0
    [] internal_create_group+0xc1/0x1a0
    [] sysfs_create_group+0x13/0x20
    [] blk_trace_init_sysfs+0x14/0x20
    [] blk_register_queue+0x3c/0xf0
    [] add_disk+0x94/0x160
    [] dm_create+0x598/0x6e0 [dm_mod]
    [] dev_create+0x51/0x350 [dm_mod]
    [] ctl_ioctl+0x1a3/0x240 [dm_mod]
    [] dm_compat_ctl_ioctl+0x12/0x20 [dm_mod]
    [] compat_sys_ioctl+0xcd/0x4f0
    [] sysenter_dispatch+0x7/0x2c
    [] 0xffffffffffffffff

    Signed-off-by: Zdenek Kabelac
    Reviewed-by: Li Zefan
    Signed-off-by: Jens Axboe

    Zdenek Kabelac
     
  • Stacking devices do not have an inherent max_hw_sector limit. Set the
    default to INT_MAX so we are bounded only by capabilities of the
    underlying storage.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • The topology changes unintentionally caused SAFE_MAX_SECTORS to be set
    for stacking devices. Set the default limit to BLK_DEF_MAX_SECTORS and
    provide SAFE_MAX_SECTORS in blk_queue_make_request() for legacy hw
    drivers that depend on the old behavior.

    Acked-by: Mike Snitzer
    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

20 Sep, 2009

1 commit

  • This allows subsytems to provide devtmpfs with non-default permissions
    for the device node. Instead of the default mode of 0600, null, zero,
    random, urandom, full, tty, ptmx now have a mode of 0666, which allows
    non-privileged processes to access standard device nodes in case no
    other userspace process applies the expected permissions.

    This also fixes a wrong assignment in pktcdvd and a checkpatch.pl complain.

    Signed-off-by: Kay Sievers
    Signed-off-by: Greg Kroah-Hartman

    Kay Sievers
     

16 Sep, 2009

3 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
    Driver Core: devtmpfs - kernel-maintained tmpfs-based /dev
    debugfs: Modify default debugfs directory for debugging pktcdvd.
    debugfs: Modified default dir of debugfs for debugging UHCI.
    debugfs: Change debugfs directory of IWMC3200
    debugfs: Change debuhgfs directory of trace-events-sample.h
    debugfs: Fix mount directory of debugfs by default in events.txt
    hpilo: add poll f_op
    hpilo: add interrupt handler
    hpilo: staging for interrupt handling
    driver core: platform_device_add_data(): use kmemdup()
    Driver core: Add support for compatibility classes
    uio: add generic driver for PCI 2.3 devices
    driver-core: move dma-coherent.c from kernel to driver/base
    mem_class: fix bug
    mem_class: use minor as index instead of searching the array
    driver model: constify attribute groups
    UIO: remove 'default n' from Kconfig
    Driver core: Add accessor for device platform data
    Driver core: move dev_get/set_drvdata to drivers/base/dd.c
    Driver core: add new device to bus's list before probing

    Linus Torvalds
     
  • Let attribute group vectors be declared "const". We'd
    like to let most attribute metadata live in read-only
    sections... this is a start.

    Signed-off-by: David Brownell
    Signed-off-by: Greg Kroah-Hartman

    David Brownell
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 commits)
    powerpc64: convert to dynamic percpu allocator
    sparc64: use embedding percpu first chunk allocator
    percpu: kill lpage first chunk allocator
    x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA
    percpu: update embedding first chunk allocator to handle sparse units
    percpu: use group information to allocate vmap areas sparsely
    vmalloc: implement pcpu_get_vm_areas()
    vmalloc: separate out insert_vmalloc_vm()
    percpu: add chunk->base_addr
    percpu: add pcpu_unit_offsets[]
    percpu: introduce pcpu_alloc_info and pcpu_group_info
    percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward
    percpu: add @align to pcpu_fc_alloc_fn_t
    percpu: make @dyn_size mandatory for pcpu_setup_first_chunk()
    percpu: drop @static_size from first chunk allocators
    percpu: generalize first chunk allocator selection
    percpu: build first chunk allocators selectively
    percpu: rename 4k first chunk allocator to page
    percpu: improve boot messages
    percpu: fix pcpu_reclaim() locking
    ...

    Fix trivial conflict as by Tejun Heo in kernel/sched.c

    Linus Torvalds
     

15 Sep, 2009

1 commit

  • * 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block: (29 commits)
    block: use blkdev_issue_discard in blk_ioctl_discard
    Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads
    block: don't assume device has a request list backing in nr_requests store
    block: Optimal I/O limit wrapper
    cfq: choose a new next_req when a request is dispatched
    Seperate read and write statistics of in_flight requests
    aoe: end barrier bios with EOPNOTSUPP
    block: trace bio queueing trial only when it occurs
    block: enable rq CPU completion affinity by default
    cfq: fix the log message after dispatched a request
    block: use printk_once
    cciss: memory leak in cciss_init_one()
    splice: update mtime and atime on files
    block: make blk_iopoll_prep_sched() follow normal 0/1 return convention
    cfq-iosched: get rid of must_alloc flag
    block: use interrupts disabled version of raise_softirq_irqoff()
    block: fix comment in blk-iopoll.c
    block: adjust default budget for blk-iopoll
    block: fix long lines in block/blk-iopoll.c
    block: add blk-iopoll, a NAPI like approach for block devices
    ...

    Linus Torvalds
     

14 Sep, 2009

5 commits

  • blk_ioctl_discard duplicates large amounts of code from blkdev_issue_discard,
    the only difference between the two is that blkdev_issue_discard needs to
    send a barrier discard request and blk_ioctl_discard a non-barrier one,
    and blk_ioctl_discard needs to wait on the request. To facilitates this
    add a flags argument to blkdev_issue_discard to control both aspects of the
    behaviour. This will be very useful later on for using the waiting
    funcitonality for other callers.

    Based on an earlier patch from Matthew Wilcox .

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Stacked devices do not. For now, just error out with -EINVAL. Later
    we could make the limit apply on stacked devices too, for throttling
    reasons.

    This fixes

    5a54cd13353bb3b88887604e2c980aa01e314309

    and should go into 2.6.31 stable as well.

    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Implement blk_limits_io_opt() and make blk_queue_io_opt() a wrapper
    around it. DM needs this to avoid poking at the queue_limits directly.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • This patch addresses http://bugzilla.kernel.org/show_bug.cgi?id=13401, a
    regression introduced in 2.6.30.

    From the bug report:

    Signed-off-by: Jens Axboe

    Jeff Moyer
     
  • Currently, there is a single in_flight counter measuring the number of
    requests in the request_queue. But some monitoring tools would like to
    know how many read requests and write requests are in progress. Split the
    current in_flight counter into two seperate counters for read and write.

    This information is exported as a sysfs attribute, as changing the
    currently available stat files would break the existing tools.

    Signed-off-by: Nikanth Karthikesan
    Signed-off-by: Jens Axboe

    Nikanth Karthikesan
     

11 Sep, 2009

16 commits

  • If BIO is discarded or cross over end of device,
    BIO queueing trial doesn't occur.

    Actually the trace was called just before make_request at first:
    [PATCH] Block queue IO tracing support (blktrace) as of 2006-03-23
         2056a782f8e7e65fd4bfd027506b4ce1c5e9ccd4

    And then 2 patches added some checks between them:
    [PATCH] md: check bio address after mapping through partitions
           5ddfe9691c91a244e8d1be597b6428fcefd58103,
    [BLOCK] Don't allow empty barriers to be passed down to
    queues that don't grok them
           51fd77bd9f512ab6cc9df0733ba1caaab89eb957

    It breaks original goal.
    Let's trace it only when it happens.

    Signed-off-by: Minchan Kim
    Acked-by: Wu Fengguang
    Cc: Li Zefan
    Signed-off-by: Jens Axboe

    Minchan Kim
     
  • The blktrace tools can show process id when cfq dispatched a request,
    using cfq_log_cfqq() instead of cfq_log().

    Signed-off-by: Shan Wei
    Signed-off-by: Jens Axboe

    Shan Wei
     
  • It's not currently used, as pointed out by
    Gui Jianfeng . We already check the
    wait_request flag to allow an idling queue priority allocation access,
    so we don't need this extra flag.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We already have interrupts disabled at that point, so use the
    __raise_softirq_irqoff() variant.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Signed-off-by: Jens Axboe

    Jens Axboe
     
  • It's not exported, I doubt we'll have a reason to change this...

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Note sure why they happened in the first place, probably some bad
    terminal setting.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This borrows some code from NAPI and implements a polled completion
    mode for block devices. The idea is the same as NAPI - instead of
    doing the command completion when the irq occurs, schedule a dedicated
    softirq in the hopes that we will complete more IO when the iopoll
    handler is invoked. Devices have a budget of commands assigned, and will
    stay in polled mode as long as they continue to consume their budget
    from the iopoll softirq handler. If they do not, the device is set back
    to interrupt completion mode.

    This patch holds the core bits for blk-iopoll, device driver support
    sold separately.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Instead of just checking whether this device uses block layer
    tagging, we can improve the detection by looking at the maximum
    queue depth it has reached. If that crosses 4, then deem it a
    queuing device.

    This is important on high IOPS devices, since plugging hurts
    the performance there (it can be as much as 10-15% of the sys
    time).

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Get rid of any functions that test for these bits and make callers
    use bio_rw_flagged() directly. Then it is at least directly apparent
    what variable and flag they check.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Whenever a block device changes it's read-only attribute
    notify the userspace about it.

    Signed-off-by: Hannes Reinecke
    Signed-off-by: Nikanth Karthikesan
    Signed-off-by: Jens Axboe

    Hannes Reinecke
     
  • o Get rid of busy_rt_queues infrastructure. Looks like it is redundant.

    o Once an RT queue gets request it will preempt any of the BE or IDLE queues
    immediately. Otherwise this queue will be put on service tree and scheduler
    will anyway select this queue before any of the BE or IDLE queue. Hence
    looks like there is no need to keep track of how many busy RT queues are
    currently on service tree.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • To lessen the impact of async IO on sync IO, let the device drain of
    any async IO in progress when switching to a sync cfqq that has idling
    enabled.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Update scsi_io_completion() such that it only fails requests till the
    next error boundary and retry the leftover. This enables block layer
    to merge requests with different failfast settings and still behave
    correctly on errors. Allow merge of requests of different failfast
    settings.

    As SCSI is currently the only subsystem which follows failfast status,
    there's no need to worry about other block drivers for now.

    Signed-off-by: Tejun Heo
    Cc: Niel Lambrechts
    Cc: James Bottomley
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Failfast has characteristics from other attributes. When issuing,
    executing and successuflly completing requests, failfast doesn't make
    any difference. It only affects how a request is handled on failure.
    Allowing requests with different failfast settings to be merged cause
    normal IOs to fail prematurely while not allowing has performance
    penalties as failfast is used for read aheads which are likely to be
    located near in-flight or to-be-issued normal IOs.

    This patch introduces the concept of 'mixed merge'. A request is a
    mixed merge if it is merge of segments which require different
    handling on failure. Currently the only mixable attributes are
    failfast ones (or lack thereof).

    When a bio with different failfast settings is added to an existing
    request or requests of different failfast settings are merged, the
    merged request is marked mixed. Each bio carries failfast settings
    and the request always tracks failfast state of the first bio. When
    the request fails, blk_rq_err_bytes() can be used to determine how
    many bytes can be safely failed without crossing into an area which
    requires further retrials.

    This allows request merging regardless of failfast settings while
    keeping the failure handling correct.

    This patch only implements mixed merge but doesn't enable it. The
    next one will update SCSI to make use of mixed merge.

    Signed-off-by: Tejun Heo
    Cc: Niel Lambrechts
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • bio and request use the same set of failfast bits. This patch makes
    the following changes to simplify things.

    * enumify BIO_RW* bits and reorder bits such that BIOS_RW_FAILFAST_*
    bits coincide with __REQ_FAILFAST_* bits.

    * The above pushes BIO_RW_AHEAD out of sync with __REQ_FAILFAST_DEV
    but the matching is useless anyway. init_request_from_bio() is
    responsible for setting FAILFAST bits on FS requests and non-FS
    requests never use BIO_RW_AHEAD. Drop the code and comment from
    blk_rq_bio_prep().

    * Define REQ_FAILFAST_MASK which is OR of all FAILFAST bits and
    simplify FAILFAST flags handling in init_request_from_bio().

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo