07 Oct, 2009

1 commit

  • Commit a9327cac440be4d8333bba975cbbf76045096275 added seperate read
    and write statistics of in_flight requests. And exported the number
    of read and write requests in progress seperately through sysfs.

    But Corrado Zoccolo reported getting strange
    output from "iostat -kx 2". Global values for service time and
    utilization were garbage. For interval values, utilization was always
    100%, and service time is higher than normal.

    So this was reverted by commit 0f78ab9899e9d6acb09d5465def618704255963b

    The problem was in part_round_stats_single(), I missed the following:
    if (now == part->stamp)
    return;

    - if (part->in_flight) {
    + if (part_in_flight(part)) {
    __part_stat_add(cpu, part, time_in_queue,
    part_in_flight(part) * (now - part->stamp));
    __part_stat_add(cpu, part, io_ticks, (now - part->stamp));

    With this chunk included, the reported regression gets fixed.

    Signed-off-by: Nikanth Karthikesan

    --
    Signed-off-by: Jens Axboe

    Nikanth Karthikesan
     

05 Oct, 2009

1 commit

  • This reverts commit a9327cac440be4d8333bba975cbbf76045096275.

    Corrado Zoccolo reports:

    "with 2.6.32-rc1 I started getting the following strange output from
    "iostat -kx 2":
    Linux 2.6.31bisect (et2) 04/10/2009 _i686_ (2 CPU)

    avg-cpu: %user %nice %system %iowait %steal %idle
    10,70 0,00 3,16 15,75 0,00 70,38

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
    avgrq-sz avgqu-sz await svctm %util
    sda 18,22 0,00 0,67 0,01 14,77 0,02
    43,94 0,01 10,53 39043915,03 2629219,87
    sdb 60,89 9,68 50,79 3,04 1724,43 50,52
    65,95 0,70 13,06 488437,47 2629219,87

    avg-cpu: %user %nice %system %iowait %steal %idle
    2,72 0,00 0,74 0,00 0,00 96,53

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
    avgrq-sz avgqu-sz await svctm %util
    sda 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00
    sdb 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00

    avg-cpu: %user %nice %system %iowait %steal %idle
    6,68 0,00 0,99 0,00 0,00 92,33

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
    avgrq-sz avgqu-sz await svctm %util
    sda 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00
    sdb 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00

    avg-cpu: %user %nice %system %iowait %steal %idle
    4,40 0,00 0,73 1,47 0,00 93,40

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
    avgrq-sz avgqu-sz await svctm %util
    sda 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00
    sdb 0,00 4,00 0,00 3,00 0,00 28,00
    18,67 0,06 19,50 333,33 100,00

    Global values for service time and utilization are garbage. For
    interval values, utilization is always 100%, and service time is
    higher than normal.

    I bisected it down to:
    [a9327cac440be4d8333bba975cbbf76045096275] Seperate read and write
    statistics of in_flight requests
    and verified that reverting just that commit indeed solves the issue
    on 2.6.32-rc1."

    So until this is debugged, revert the bad commit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

22 Sep, 2009

1 commit


16 Sep, 2009

1 commit


14 Sep, 2009

1 commit

  • Currently, there is a single in_flight counter measuring the number of
    requests in the request_queue. But some monitoring tools would like to
    know how many read requests and write requests are in progress. Split the
    current in_flight counter into two seperate counters for read and write.

    This information is exported as a sysfs attribute, as changing the
    currently available stat files would break the existing tools.

    Signed-off-by: Nikanth Karthikesan
    Signed-off-by: Jens Axboe

    Nikanth Karthikesan
     

13 Jul, 2009

1 commit

  • git commit f67f129e "Driver core: implement uevent suppress in kobject"
    contains this chunk for fs/partitions/check.c:

    /* suppress uevent if the disk supresses it */
    - if (!ddev->uevent_suppress)
    + if (!dev_get_uevent_suppress(pdev))
    kobject_uevent(&pdev->kobj, KOBJ_ADD);

    However that should have been

    - if (!ddev->uevent_suppress)
    + if (!dev_get_uevent_suppress(ddev))

    Signed-off-by: Heiko Carstens
    Acked-by: Ming Lei
    Cc: stable
    Signed-off-by: Greg Kroah-Hartman

    Heiko Carstens
     

13 Jun, 2009

1 commit

  • * 'for-2.6.31' of git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6: (29 commits)
    ide: re-implement ide_pci_init_one() on top of ide_pci_init_two()
    ide: unexport ide_find_dma_mode()
    ide: fix PowerMac bootup oops
    ide: skip probe if there are no devices on the port (v2)
    sl82c105: add printk() logging facility
    ide-tape: fix proc warning
    ide: add IDE_DFLAG_NIEN_QUIRK device flag
    ide: respect quirk_drives[] list on all controllers
    hpt366: enable all quirks for devices on quirk_drives[] list
    hpt366: sync quirk_drives[] list with pdc202xx_{new,old}.c
    ide: remove superfluous SELECT_MASK() call from do_rw_taskfile()
    ide: remove superfluous SELECT_MASK() call from ide_driveid_update()
    icside: remove superfluous ->maskproc method
    ide-tape: fix IDE_AFLAG_* atomic accesses
    ide-tape: change IDE_AFLAG_IGNORE_DSC non-atomically
    pdc202xx_old: kill resetproc() method
    pdc202xx_old: don't call pdc202xx_reset() on IRQ timeout
    pdc202xx_old: use ide_dma_test_irq()
    ide: preserve Host Protected Area by default (v2)
    ide-gd: implement block device ->set_capacity method (v2)
    ...

    Linus Torvalds
     

07 Jun, 2009

2 commits

  • * Add ->set_capacity block device method and use it in rescan_partitions()
    to attempt enabling native capacity of the device upon detecting the
    partition which exceeds device capacity.

    * Add GENHD_FL_NATIVE_CAPACITY flag to try limit attempts of enabling
    native capacity during partition scan.

    Together with the consecutive patch implementing ->set_capacity method in
    ide-gd device driver this allows automatic disabling of Host Protected Area
    (HPA) if any partitions overlapping HPA are detected.

    Cc: Robert Hancock
    Cc: Frans Pop
    Cc: "Andries E. Brouwer"
    Acked-by: Al Viro
    Emphatically-Acked-by: Alan Cox
    Signed-off-by: Bartlomiej Zolnierkiewicz

    Bartlomiej Zolnierkiewicz
     
  • The current warning message says only about the kernel's action taken
    without mentioning the underlying reason behind it.

    Noticed-by: Robert Hancock
    Cc: Frans Pop
    Cc: "Andries E. Brouwer"
    Cc: Al Viro
    Emphatically-Acked-by: Alan Cox
    Signed-off-by: Bartlomiej Zolnierkiewicz

    Bartlomiej Zolnierkiewicz
     

23 May, 2009

2 commits

  • To support devices with physical block sizes bigger than 512 bytes we
    need to ensure proper alignment. This patch adds support for exposing
    I/O topology characteristics as devices are stacked.

    logical_block_size is the smallest unit the device can address.

    physical_block_size indicates the smallest I/O the device can write
    without incurring a read-modify-write penalty.

    The io_min parameter is the smallest preferred I/O size reported by
    the device. In many cases this is the same as the physical block
    size. However, the io_min parameter can be scaled up when stacking
    (RAID5 chunk size > physical block size).

    The io_opt characteristic indicates the optimal I/O size reported by
    the device. This is usually the stripe width for arrays.

    The alignment_offset parameter indicates the number of bytes the start
    of the device/partition is offset from the device's natural alignment.
    Partition tools and MD/DM utilities can use this to pad their offsets
    so filesystems start on proper boundaries.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • Until now we have had a 1:1 mapping between storage device physical
    block size and the logical block sized used when addressing the device.
    With SATA 4KB drives coming out that will no longer be the case. The
    sector size will be 4KB but the logical block size will remain
    512-bytes. Hence we need to distinguish between the physical block size
    and the logical ditto.

    This patch renames hardsect_size to logical_block_size.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

02 Apr, 2009

1 commit


27 Mar, 2009

1 commit

  • * 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6: (81 commits)
    [S390] remove duplicated #includes
    [S390] cpumask: use mm_cpumask() wrapper
    [S390] cpumask: Use accessors code.
    [S390] cpumask: prepare for iterators to only go to nr_cpu_ids/nr_cpumask_bits.
    [S390] cpumask: remove cpu_coregroup_map
    [S390] fix clock comparator save area usage
    [S390] Add hwcap flag for the etf3 enhancement facility
    [S390] Ensure that ipl panic notifier is called late.
    [S390] fix dfp elf hwcap/facility bit detection
    [S390] smp: perform initial cpu reset before starting a cpu
    [S390] smp: fix memory leak on __cpu_up
    [S390] ipl: Improve checking logic and remove switch defaults.
    [S390] s390dbf: Remove needless check for NULL pointer.
    [S390] s390dbf: Remove redundant initilizations.
    [S390] use kzfree()
    [S390] BUG to BUG_ON changes
    [S390] zfcpdump: Prevent zcore from beeing built as a kernel module.
    [S390] Use csum_partial in checksum.h
    [S390] cleanup lowcore.h
    [S390] eliminate ipl_device from lowcore
    ...

    Linus Torvalds
     

26 Mar, 2009

1 commit

  • The dasd device driver will now support ECKD devices with more then
    65520 cylinders.
    In the traditional ECKD adressing scheme each track is addressed
    by a 16-bit cylinder and 16-bit head number. The new addressing
    scheme makes use of the fact that the actual number of heads is
    never larger then 15, so 12 bits of the head number can be redefined
    to be part of the cylinder address.

    Signed-off-by: Stefan Weinhuber
    Signed-off-by: Martin Schwidefsky

    Stefan Weinhuber
     

25 Mar, 2009

1 commit

  • This patch implements uevent suppress in kobject and removes it
    from struct device, based on the following ideas:

    1,Uevent sending should be one attribute of kobject, so suppressing it
    in kobject layer is more natural than in device layer. By this way,
    we can do it for other objects embedded with kobject.

    2,It may save several bytes for each instance of struct device.(On my
    omap3(32bit ARM) based box, can save 8bytes per device object)

    This patch also introduces dev_set|get_uevent_suppress() helpers to
    set and query uevent_suppress attribute in case to help kobject
    as private part of struct device in future.

    [This version is against the latest driver-core patch set of Greg,please
    ignore the last version.]

    Signed-off-by: Ming Lei
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

27 Jan, 2009

1 commit


26 Jan, 2009

1 commit

  • Impact: New way of using the blktrace infrastructure

    This drops the requirement of userspace utilities to use the blktrace
    facility.

    Configuration is done thru sysfs, adding a "trace" directory to the
    partition directory where blktrace can be enabled for the associated
    request_queue.

    The same filters present in the IOCTL interface are present as sysfs
    device attributes.

    The /sys/block/sdX/sdXN/trace/enable file allows tracing without any
    filters.

    The other files in this directory: pid, act_mask, start_lba and end_lba
    can be used with the same meaning as with the IOCTL interface.

    Using the sysfs interface will only setup the request_queue->blk_trace
    fields, tracing will only take place when the "blk" tracer is selected
    via the ftrace interface, as in the following example:

    To see the trace, one can use the /d/tracing/trace file or the
    /d/tracign/trace_pipe file, with semantics defined in the ftrace
    documentation in Documentation/ftrace.txt.

    [root@f10-1 ~]# cat /t/trace
    kjournald-305 [000] 3046.491224: 8,1 A WBS 6367 + 8 -0 [000] 3046.511914: 8,1 C RS 6367 + 8 [6367]
    [root@f10-1 ~]#

    The default line context (prefix) format is the one described in the ftrace
    documentation, with the blktrace specific bits using its existing format,
    described in blkparse(8).

    If one wants to have the classic blktrace formatting, this is possible by
    using:

    [root@f10-1 ~]# echo blk_classic > /t/trace_options
    [root@f10-1 ~]# cat /t/trace
    8,1 0 3046.491224 305 A WBS 6367 + 8 /t/trace_options
    [root@f10-1 ~]# echo stacktrace > /t/trace_options

    [root@f10-1 ~]# cat /t/trace
    kjournald-305 [000] 3318.826779: 8,1 A WBS 6375 + 8
    Signed-off-by: Ingo Molnar

    Arnaldo Carvalho de Melo
     

10 Jan, 2009

1 commit

  • Neil writes:

    Hi Jens,

    I've found a little bug for you. It was introduced by
    a6f23657d3072bde6844055bbc2290e497f33fbc

    block: add one-hit cache for disk partition lookup

    and has the effect of killing my machine whenever I try to assemble
    an md array :-(
    One of the devices in the array has partitions, and mdadm always
    deletes partitions before putting a whole-device in an array (as it
    can cause confusion). The next IO to that device locks the machine.
    I don't really understand exactly why it locks up, but it happens in
    disk_map_sector_rcu(). This patch fixes it.

    Which is due to a missing clear of the (now) stale partition lookup
    data. So clear that when we delete a partition.

    Signed-off-by: Jens Axboe

    Neil Brown
     

07 Jan, 2009

1 commit


18 Nov, 2008

3 commits


21 Oct, 2008

2 commits


17 Oct, 2008

3 commits

  • With extended devt, finding out the partition number becomes a bit
    more challenging as subtracting the minor number from that of the
    parent device doesn't work anymore. The only thing left is parsing
    the partition name which is brittle and not exactly universal (some
    have '-' between the device name and partition number while others
    don't). This patch introduced partition attribute which contains the
    partition number of the device. This should make finding partitions
    and its index easier.

    This problem and solution were suggested by H. Peter Anvin.

    Signed-off-by: Tejun Heo
    Cc: H. Peter Anvin
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • We currently follow blindly what the partition table lies about the
    disk, and let the kernel create block devices which can not be accessed.
    Trying to identify the device leads to kernel logs full of:
    sdb: rw=0, want=73392, limit=28800
    attempt to access beyond end of device

    Here is an example of a broken partition table, where sda2 starts
    behind the end of the disk, and sdb3 is larger than the entire disk:
    Disk /dev/sdb: 14 MB, 14745600 bytes
    1 heads, 29 sectors/track, 993 cylinders, total 28800 sectors
    Device Boot Start End Blocks Id System
    /dev/sdb1 29 7800 3886 83 Linux
    /dev/sdb2 37801 45601 3900+ 83 Linux
    /dev/sdb3 15602 73402 28900+ 83 Linux
    /dev/sdb4 23403 28796 2697 83 Linux

    The kernel creates these completely invalid devices, which can not be
    accessed, or may lead to other unpredictable failures:
    grep . /sys/class/block/sdb*/{start,size}
    /sys/class/block/sdb/size:28800
    /sys/class/block/sdb1/start:29
    /sys/class/block/sdb1/size:7772
    /sys/class/block/sdb2/start:37801
    /sys/class/block/sdb2/size:7801
    /sys/class/block/sdb3/start:15602
    /sys/class/block/sdb3/size:57801
    /sys/class/block/sdb4/start:23403
    /sys/class/block/sdb4/size:5394

    With this patch, we ignore partitions which start behind the end of the disk,
    and limit partitions to the end of the disk if they pretend to be larger:
    grep . /sys/class/block/sdb*/{start,size}
    /sys/class/block/sdb/size:28800
    /sys/class/block/sdb1/start:29
    /sys/class/block/sdb1/size:7772
    /sys/class/block/sdb3/start:15602
    /sys/class/block/sdb3/size:13198
    /sys/class/block/sdb4/start:23403
    /sys/class/block/sdb4/size:5394

    These warnings are printed to the kernel log:
    sdb: p2 ignored, start 37801 is behind the end of the disk
    sdb: p3 size 57801 limited to end of disk

    Signed-off-by: Kay Sievers
    Cc: Herton Ronaldo Krzesinski
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kay Sievers
     
  • I missed this when I did the arm26 removal.

    Reported-by: Robert P. J. Day
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     

09 Oct, 2008

13 commits

  • Check for device resize in the rescan_partitions() routine. If the device
    has been resized, the bdev size is set to match. The rescan_partitions()
    routine is called when opening the device and when calling the
    BLKRRPART ioctl.

    Signed-off-by: Andrew Patterson
    Signed-off-by: Jens Axboe

    Andrew Patterson
     
  • Now that disk and partition handlings are mostly unified, it's easy to
    allow disk to have extended device number. This patch makes
    add_disk() use extended device number if disk->minors is zero. Both
    sd and ide-disk are updated to use this.

    * sd_format_disk_name() is implemented which can generically determine
    the drive name. This removes disk number restriction stemming from
    limited device names.

    * If sd index goes over SD_MAX_DISKS (which can be increased now BTW),
    sd simply doesn't initialize minors letting block layer choose
    extended device number.

    * If CONFIG_DEBUG_EXT_DEVT is set, both sd and ide-disk always set
    minors to 0 and use extended device numbers.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • With previous changes, it's meaningless to limit the number of
    partitions. Replace @ext_minors with GENHD_FL_EXT_DEVT such that
    setting the flag allows the disk to have maximum number of allowed
    partitions (only limited by the number of entries in parsed_partitions
    as determined by MAX_PART constant).

    This kills not-too-pretty alloc_disk_ext[_node]() functions and makes
    @minors parameter to alloc_disk[_node]() unnecessary. The parameter
    is left alone to avoid disturbing the users.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • disk->__part used to be statically allocated to the maximum possible
    number of partitions. This patch makes partition array allocation
    dynamic. The added overhead is minimal as only real change is one
    memory dereference changed to RCU one. This saves both a bit of
    memory and cpu cycles iterating through unoccupied slots and makes
    increasing partition limit easier.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Move stats related fields - stamp, in_flight, dkstats - from disk to
    part0 and unify stat handling such that...

    * part_stat_*() now updates part0 together if the specified partition
    is not part0. ie. part_stat_*() are now essentially all_stat_*().

    * {disk|all}_stat_*() are gone.

    * part_round_stats() is updated similary. It handles part0 stats
    automatically and disk_round_stats() is killed.

    * part_{inc|dec}_in_fligh() is implemented which automatically updates
    part0 stats for parts other than part0.

    * disk_map_sector_rcu() is updated to return part0 if no part matches.
    Combined with the above changes, this makes NULL special case
    handling in callers unnecessary.

    * Separate stats show code paths for disk are collapsed into part
    stats show code paths.

    * Rename disk_stat_lock/unlock() to part_stat_lock/unlock()

    While at it, reposition stat handling macros a bit and add missing
    parentheses around macro parameters.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • GENHD_FL_FAIL for disk is what make_it_fail is for parts. Kill it and
    use part0->make_it_fail. Sysfs node handling is unified too.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Till now, bdev->bd_part is set only if the bdev was for parts other
    than part0. This patch makes bdev->bd_part always set so that code
    paths don't have to differenciate common handling.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Move disk->holder_dir to part0->holder_dir. Kill now mostly
    superflous bdev_get_holder().

    While at it, kill superflous kobject_get/put() around holder_dir,
    slave_dir and cmd_filter creation and collapse
    disk_sysfs_add_subdirs() into register_disk(). These serve no purpose
    but obfuscating the code.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Move disk->policy to part0->policy. Implement and use get_disk_ro().

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Now that capacity and __dev are moved to part0, part0 and others can
    share the same method.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Move disk->capacity to part0->nr_sects and convert all users who
    directly accessed the field to use {get|set}_capacity(). This is done
    early to allow the __dev field to be moved.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • genhd and partition code handled disk and partitions separately. All
    information about the whole disk was in struct genhd and partitions in
    struct hd_struct. However, the whole disk (part0) and other
    partitions have a lot in common and the data structures end up having
    good number of common fields and thus separate code paths doing the
    same thing. Also, the partition array was indexed by partno - 1 which
    gets pretty confusing at times.

    This patch introduces partition 0 and makes the partition array
    indexed by partno. Following patches will unify the handling of disk
    and parts piece-by-piece.

    This patch also implements disk_partitionable() which tests whether a
    disk is partitionable. With coming dynamic partition array change,
    the most common usage of disk_max_parts() will be testing whether a
    disk is partitionable and the number of max partitions will become
    much less important.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Implement {disk|part}_to_dev() and use them to access generic device
    instead of directly dereferencing {disk|part}->dev. To make sure no
    user is left behind, rename generic devices fields to __dev.

    This is in preparation of unifying partition 0 handling with other
    partitions.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo