17 Jun, 2009

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (64 commits)
    debugfs: use specified mode to possibly mark files read/write only
    debugfs: Fix terminology inconsistency of dir name to mount debugfs filesystem.
    xen: remove driver_data direct access of struct device from more drivers
    usb: gadget: at91_udc: remove driver_data direct access of struct device
    uml: remove driver_data direct access of struct device
    block/ps3: remove driver_data direct access of struct device
    s390: remove driver_data direct access of struct device
    parport: remove driver_data direct access of struct device
    parisc: remove driver_data direct access of struct device
    of_serial: remove driver_data direct access of struct device
    mips: remove driver_data direct access of struct device
    ipmi: remove driver_data direct access of struct device
    infiniband: ehca: remove driver_data direct access of struct device
    ibmvscsi: gadget: at91_udc: remove driver_data direct access of struct device
    hvcs: remove driver_data direct access of struct device
    xen block: remove driver_data direct access of struct device
    thermal: remove driver_data direct access of struct device
    scsi: remove driver_data direct access of struct device
    pcmcia: remove driver_data direct access of struct device
    PCIE: remove driver_data direct access of struct device
    ...

    Manually fix up trivial conflicts due to different direct driver_data
    direct access fixups in drivers/block/{ps3disk.c,ps3vram.c}

    Linus Torvalds
     

16 Jun, 2009

7 commits


12 Jun, 2009

3 commits

  • Fix kernel-doc warnings in recently changed block/ source code.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • * 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
    block: add request clone interface (v2)
    floppy: fix hibernation
    ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
    fs/bio.c: add missing __user annotation
    block: prevent possible io_context->refcount overflow
    Add serial number support for virtio_blk, V4a
    block: Add missing bounce_pfn stacking and fix comments
    Revert "block: Fix bounce limit setting in DM"
    cciss: decode unit attention in SCSI error handling code
    cciss: Remove no longer needed sendcmd reject processing code
    cciss: change SCSI error handling routines to work with interrupts enabled.
    cciss: separate error processing and command retrying code in sendcmd_withirq_core()
    cciss: factor out fix target status processing code from sendcmd functions
    cciss: simplify interface of sendcmd() and sendcmd_withirq()
    cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
    cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
    block: needs to set the residual length of a bidi request
    Revert "block: implement blkdev_readpages"
    block: Fix bounce limit setting in DM
    Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
    ...

    Manually fix conflicts with tracing updates in:
    block/blk-sysfs.c
    drivers/ide/ide-atapi.c
    drivers/ide/ide-cd.c
    drivers/ide/ide-floppy.c
    drivers/ide/ide-tape.c
    include/trace/events/block.h
    kernel/trace/blktrace.c

    Linus Torvalds
     
  • * 'for-2.6.31' of git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6: (28 commits)
    ide-tape: fix debug call
    alim15x3: Remove historical hacks, re-enable init_hwif for PowerPC
    ide-dma: don't reset request fields on dma_timeout_retry()
    ide: drop rq->data handling from ide_map_sg()
    ide-atapi: kill unused fields and callbacks
    ide-tape: simplify read/write functions
    ide-tape: use byte size instead of sectors on rw issue functions
    ide-tape: unify r/w init paths
    ide-tape: kill idetape_bh
    ide-tape: use standard data transfer mechanism
    ide-tape: use single continuous buffer
    ide-atapi,tape,floppy: allow ->pc_callback() to change rq->data_len
    ide-tape,floppy: fix failed command completion after request sense
    ide-pm: don't abuse rq->data
    ide-cd,atapi: use bio for internal commands
    ide-atapi: convert ide-{floppy,tape} to using preallocated sense buffer
    ide-cd: convert to using generic sense request
    ide: add helpers for preparing sense requests
    ide-cd: don't abuse rq->buffer
    ide-atapi: don't abuse rq->buffer
    ...

    Linus Torvalds
     

11 Jun, 2009

3 commits

  • This patch adds the following 2 interfaces for request-stacking drivers:

    - blk_rq_prep_clone(struct request *clone, struct request *orig,
    struct bio_set *bs, gfp_t gfp_mask,
    int (*bio_ctr)(struct bio *, struct bio*, void *),
    void *data)
    * Clones bios in the original request to the clone request
    (bio_ctr is called for each cloned bios.)
    * Copies attributes of the original request to the clone request.
    The actual data parts (e.g. ->cmd, ->buffer, ->sense) are not
    copied.

    - blk_rq_unprep_clone(struct request *clone)
    * Frees cloned bios from the clone request.

    Request stacking drivers (e.g. request-based dm) need to make a clone
    request for a submitted request and dispatch it to other devices.

    To allocate request for the clone, request stacking drivers may not
    be able to use blk_get_request() because the allocation may be done
    in an irq-disabled context.
    So blk_rq_prep_clone() takes a request allocated by the caller
    as an argument.

    For each clone bio in the clone request, request stacking drivers
    should be able to set up their own completion handler.
    So blk_rq_prep_clone() takes a callback function which is called
    for each clone bio, and a pointer for private data which is passed
    to the callback.

    NOTE:
    blk_rq_prep_clone() doesn't copy any actual data of the original
    request. Pages are shared between original bios and cloned bios.
    So caller must not complete the original request before the clone
    request.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Cc: Boaz Harrosh
    Signed-off-by: Jens Axboe

    Kiyoshi Ueda
     
  • * 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (244 commits)
    Revert "x86, bts: reenable ptrace branch trace support"
    tracing: do not translate event helper macros in print format
    ftrace/documentation: fix typo in function grapher name
    tracing/events: convert block trace points to TRACE_EVENT(), fix !CONFIG_BLOCK
    tracing: add protection around module events unload
    tracing: add trace_seq_vprint interface
    tracing: fix the block trace points print size
    tracing/events: convert block trace points to TRACE_EVENT()
    ring-buffer: fix ret in rb_add_time_stamp
    ring-buffer: pass in lockdep class key for reader_lock
    tracing: add annotation to what type of stack trace is recorded
    tracing: fix multiple use of __print_flags and __print_symbolic
    tracing/events: fix output format of user stack
    tracing/events: fix output format of kernel stack
    tracing/trace_stack: fix the number of entries in the header
    ring-buffer: discard timestamps that are at the start of the buffer
    ring-buffer: try to discard unneeded timestamps
    ring-buffer: fix bug in ring_buffer_discard_commit
    ftrace: do not profile functions when disabled
    tracing: make trace pipe recognize latency format flag
    ...

    Linus Torvalds
     
  • Currently io_context has an atomic_t(32-bit) as refcount. In the case of
    cfq, for each device against whcih a task does I/O, a reference to the
    io_context would be taken. And when there are multiple process sharing
    io_contexts(CLONE_IO) would also have a reference to the same io_context.

    Theoretically the possible maximum number of processes sharing the same
    io_context + the number of disks/cfq_data referring to the same io_context
    can overflow the 32-bit counter on a very high-end machine.

    Even though it is an improbable case, let us make it atomic_long_t.

    Signed-off-by: Nikanth Karthikesan
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Nikanth Karthikesan
     

10 Jun, 2009

1 commit

  • TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
    these new capabilities to this tracepoint:

    - zero-copy and per-cpu splice() tracing
    - binary tracing without printf overhead
    - structured logging records exposed under /debug/tracing/events
    - trace events embedded in function tracer output and other plugins
    - user-defined, per tracepoint filter expressions
    ...

    Cons:

    - no dev_t info for the output of plug, unplug_timer and unplug_io events.
    no dev_t info for getrq and sleeprq events if bio == NULL.
    no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.

    This is mainly because we can't get the deivce from a request queue.
    But this may change in the future.

    - A packet command is converted to a string in TP_assign, not TP_print.
    While blktrace do the convertion just before output.

    Since pc requests should be rather rare, this is not a big issue.

    - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
    has a unique format, which means we have some unused data in a trace entry.

    The overhead is minimized by using __dynamic_array() instead of __array().

    I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:

    dd dd + ioctl blktrace dd + TRACE_EVENT (splice)
    1 7.36s, 42.7 MB/s 7.50s, 42.0 MB/s 7.41s, 42.5 MB/s
    2 7.43s, 42.3 MB/s 7.48s, 42.1 MB/s 7.43s, 42.4 MB/s
    3 7.38s, 42.6 MB/s 7.45s, 42.2 MB/s 7.41s, 42.5 MB/s

    So the overhead of tracing is very small, and no regression when using
    those trace events vs blktrace.

    And the binary output of TRACE_EVENT is much smaller than blktrace:

    # ls -l -h
    -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
    -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
    -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out

    Following are some comparisons between TRACE_EVENT and blktrace:

    plug:
    kjournald-480 [000] 303.084981: block_plug: [kjournald]
    kjournald-480 [000] 303.084981: 8,0 P N [kjournald]

    unplug_io:
    kblockd/0-118 [000] 300.052973: block_unplug_io: [kblockd/0] 1
    kblockd/0-118 [000] 300.052974: 8,0 U N [kblockd/0] 1

    remap:
    kjournald-480 [000] 303.085042: block_remap: 8,0 W 102736992 + 8 v3:

    - use the newly introduced __dynamic_array().

    Changelog from v1 -> v2:

    - use __string() instead of __array() to minimize the memory required
    to store hex dump of rq->cmd().

    - support large pc requests.

    - add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.

    - some cleanups.

    Signed-off-by: Li Zefan
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Li Zefan
     

09 Jun, 2009

4 commits


03 Jun, 2009

1 commit

  • blk_queue_bounce_limit() is more than a wrapper about the request queue
    limits.bounce_pfn variable. Introduce blk_queue_bounce_pfn() which can
    be called by stacking drivers that wish to set the bounce limit
    explicitly.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

02 Jun, 2009

1 commit

  • I found one more mis-conversion to the 'request is always dequeued
    when completing' model in elv_abort_queue() during code inspection.
    Although I haven't hit any problem caused by this mis-conversion yet
    and just done compile/boot test, please apply if you have no problem.

    Request must be dequeued when it completes.
    However, elv_abort_queue() completes requests without dequeueing.
    This will cause oops in the __blk_end_request_all().
    This patch fixes the oops.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Jens Axboe

    Kiyoshi Ueda
     

30 May, 2009

1 commit

  • Doing a bit of torture testing, I ran across a BUG in the block
    subsystem (at blk-core.c:2048): the test for if the request is queued.

    It turns out the trigger was a BLKPREP_KILL coming out of the SCSI prep
    function. Currently for BLKPREP_KILL requests, we send them straight
    into __blk_end_request_all() with an error, but they've never been
    dequeued, so they trip the bug. Fix this by starting requests before
    killing them.

    Signed-off-by: James Bottomley
    Signed-off-by: Jens Axboe

    James Bottomley
     

28 May, 2009

1 commit


27 May, 2009

2 commits

  • The commit below in 2.6-block/for-2.6.31 causes no diskstat problem
    because the blk_discard_rq() check was added with '&&'.
    It should be 'blk_fs_request() || blk_discard_rq()'.
    This patch does it and fixes the no diskstat problem.
    Please review and apply.

    ------ /proc/diskstat without this patch -------------------------------------
    8 0 sda 0 0 0 0 0 0 0 0 0 0 0
    ------------------------------------------------------------------------------

    ----- /proc/diskstat with this patch applied ---------------------------------
    8 0 sda 4186 303 373621 61600 9578 3859 107468 169479 2 89755 231059
    ------------------------------------------------------------------------------

    --------------------------------------------------------------------------
    commit c69d48540c201394d08cb4d48b905e001313d9b8
    Author: Jens Axboe
    Date: Fri Apr 24 08:12:19 2009 +0200

    block: include discard requests in IO accounting

    We currently don't do merging on discard requests, but we potentially
    could. If we do, then we need to include discard requests in the IO
    accounting, or merging would end up decrementing in_flight IO counters
    for an IO which never incremented them.

    So enable accounting for discard requests.

    static inline int blk_do_io_stat(struct request *rq)
    {
    - return rq->rq_disk && blk_rq_io_stat(rq) && blk_fs_request(rq);
    + return rq->rq_disk && blk_rq_io_stat(rq) && blk_fs_request(rq) &&
    + blk_discard_rq(rq);
    }
    --------------------------------------------------------------------------

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Jens Axboe

    Kiyoshi Ueda
     
  • commit e8939a50466fd963eb1ba9118c34b9ffb7ff6aa6
    Author: Tejun Heo
    Date: Fri May 8 11:54:16 2009 +0900

    block: implement and enforce request peek/start/fetch

    Added a BUG_ON(blk_queued_rq(req)) to the top of blk_finish_req().
    Unfortunately, this checks whether req->queuelist is empty. This list
    is doing double duty both as the queue list and the tag list, so tagged
    requests come in here with this not empty and boom (the tag list is
    emptied by blk_queue_end_tag() lower down).

    Fix this by moving the BUG_ON to below the end tag we also seem
    vulnerable to this in blk_requeue_request() as well. I think all uses
    of blk_queued_rq() need auditing because the check is clearly wrong in
    the tagged case.

    Signed-off-by: James Bottomley
    Signed-off-by: Jens Axboe

    James Bottomley
     

23 May, 2009

6 commits

  • To support devices with physical block sizes bigger than 512 bytes we
    need to ensure proper alignment. This patch adds support for exposing
    I/O topology characteristics as devices are stacked.

    logical_block_size is the smallest unit the device can address.

    physical_block_size indicates the smallest I/O the device can write
    without incurring a read-modify-write penalty.

    The io_min parameter is the smallest preferred I/O size reported by
    the device. In many cases this is the same as the physical block
    size. However, the io_min parameter can be scaled up when stacking
    (RAID5 chunk size > physical block size).

    The io_opt characteristic indicates the optimal I/O size reported by
    the device. This is usually the stripe width for arrays.

    The alignment_offset parameter indicates the number of bytes the start
    of the device/partition is offset from the device's natural alignment.
    Partition tools and MD/DM utilities can use this to pad their offsets
    so filesystems start on proper boundaries.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • Currently stacking devices do not have a queue directory in sysfs.
    However, many of the I/O characteristics like sector size, maximum
    request size, etc. are queue properties.

    This patch enables the queue directory for MD/DM devices. The elevator
    code has been modified to deal with queues that do not have an I/O
    scheduler.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • To accommodate stacking drivers that do not have an associated request
    queue we're moving the limits to a separate, embedded structure.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • Convert all external users of queue limits to using wrapper functions
    instead of poking the request queue variables directly.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • Until now we have had a 1:1 mapping between storage device physical
    block size and the logical block sized used when addressing the device.
    With SATA 4KB drives coming out that will no longer be the case. The
    sector size will be 4KB but the logical block size will remain
    512-bytes. Hence we need to distinguish between the physical block size
    and the logical ditto.

    This patch renames hardsect_size to logical_block_size.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • Conflicts:
    drivers/block/hd.c
    drivers/block/mg_disk.c

    Signed-off-by: Jens Axboe

    Jens Axboe
     

20 May, 2009

2 commits


19 May, 2009

4 commits

  • OSD was the last in-tree user of blk_rq_append_bio(). Now
    that it is fixed blk_rq_append_bio is un-exported and
    is only used internally by block layer.

    Signed-off-by: Boaz Harrosh
    Signed-off-by: Jens Axboe

    Boaz Harrosh
     
  • New block API:
    given a struct bio allocates a new request. This is the parallel of
    generic_make_request for BLOCK_PC commands users.

    The passed bio may be a chained-bio. The bio is bounced if needed
    inside the call to this member.

    This is in the effort of un-exporting blk_rq_append_bio().

    Signed-off-by: Boaz Harrosh
    CC: Jeff Garzik
    Signed-off-by: Jens Axboe

    Boaz Harrosh
     
  • Use blk_rq_append_bio() internally instead of blk_rq_bio_prep()
    so blk_rq_map_kern can be called multiple times, to map multiple
    buffers.

    This is in the effort to un-export blk_rq_append_bio()

    Signed-off-by: James Bottomley
    Signed-off-by: Boaz Harrosh
    Signed-off-by: Jens Axboe

    James Bottomley
     
  • In commit c3a4d78c580de4edc9ef0f7c59812fb02ceb037f, while introducing
    rq->resid_len, the default value of residue count was changed from
    full count to zero. The conversion was done under the assumption that
    when a request fails residue count wasn't defined. However, Boaz and
    James pointed out that this wasn't true and the residue count should
    be preserved for failed requests too.

    This patchset restores the original behavior by setting rq->resid_len
    to blk_rq_bytes(rq) on request start and restoring explicit clearing
    in affected drivers. While at it, take advantage of the fact that
    rq->resid_len is set to full count where applicable.

    * ide-cd: rq->resid_len cleared on pc success

    * mptsas: req->resid_len cleared on success

    * sas_expander: rsp/req->resid_len cleared on success

    * mpt2sas_transport: req->resid_len cleared on success

    * ide-cd, ide-tape, mptsas, sas_host_smp, mpt2sas_transport, ub: take
    advantage of initial full count to simplify code

    Boaz Harrosh spotted bug in resid_len initialization. Fixed as
    suggested.

    Signed-off-by: Tejun Heo
    Acked-by: Borislav Petkov
    Cc: Boaz Harrosh
    Cc: James Bottomley
    Cc: Pete Zaitcev
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Sergei Shtylyov
    Cc: Eric Moore
    Cc: Darrick J. Wong
    Signed-off-by: Jens Axboe

    Tejun Heo
     

18 May, 2009

1 commit


12 May, 2009

1 commit

  • Current bio_vec array index out-of-bounds test within
    __end_that_request_first() does not seem correct.
    It checks bio->bi_idx against bio->bi_vcnt, but the subsequent code
    uses idx (which is, bio->bi_idx + next_idx) as the array index into
    bio_vec array. This means that the test really make sense only at
    the first iteration of !(nr_bytes >=bio->bi_size) case (when next_idx
    == zero). Fix this by replacing bio->bi_idx with idx.
    (This patch applies to 2.6.30-rc4.)

    Signed-off-by: Kazuhisa Ichikawa
    Signed-off-by: Jens Axboe

    Kazuhisa Ichikawa
     

11 May, 2009

1 commit