16 May, 2007

1 commit


11 May, 2007

1 commit

  • to generic_make_request can use up a lot of space, and we would rather they
    didn't.

    As generic_make_request is a void function, and as it is generally not
    expected that it will have any effect immediately, it is safe to delay any
    call to generic_make_request until there is sufficient stack space
    available.

    As ->bi_next is reserved for the driver to use, it can have no valid value
    when generic_make_request is called, and as __make_request implicitly
    assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
    certain that all callers set it to NULL. We can therefore safely use
    bi_next to link pending requests together, providing we clear it before
    making the real call.

    So, we choose to allow each thread to only be active in one
    generic_make_request at a time. If a subsequent (recursive) call is made,
    the bio is linked into a per-thread list, and is handled when the active
    call completes.

    As the list of pending bios is per-thread, there are no locking issues to
    worry about.

    I say above that it is "safe to delay any call...". There are, however,
    some behaviours of a make_request_fn which would make it unsafe. These
    include any behaviour that assumes anything will have changed after a
    recursive call to generic_make_request.

    These could include:
    - waiting for that call to finish and call it's bi_end_io function.
    md use to sometimes do this (marking the superblock dirty before
    completing a write) but doesn't any more
    - inspecting the bio for fields that generic_make_request might
    change, such as bi_sector or bi_bdev. It is hard to see a good
    reason for this, and I don't think anyone actually does it.
    - inspecing the queue to see if, e.g. it is 'full' yet. Again, I
    think this is very unlikely to be useful, or to be done.

    Signed-off-by: Neil Brown
    Cc: Jens Axboe
    Cc:

    Alasdair G Kergon said:

    I can see nothing wrong with this in principle.

    For device-mapper at the moment though it's essential that, while the bio
    mappings may now get delayed, they still get processed in exactly
    the same order as they were passed to generic_make_request().

    My main concern is whether the timing changes implicit in this patch
    will make the rare data-corrupting races in the existing snapshot code
    more likely. (I'm working on a fix for these races, but the unfinished
    patch is already several hundred lines long.)

    It would be helpful if some people on this mailing list would test
    this patch in various scenarios and report back.

    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Neil Brown
     

10 May, 2007

4 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
    sound: convert "sound" subdirectory to UTF-8
    MAINTAINERS: Add cxacru website/mailing list
    include files: convert "include" subdirectory to UTF-8
    general: convert "kernel" subdirectory to UTF-8
    documentation: convert the Documentation directory to UTF-8
    Convert the toplevel files CREDITS and MAINTAINERS to UTF-8.
    remove broken URLs from net drivers' output
    Magic number prefix consistency change to Documentation/magic-number.txt
    trivial: s/i_sem /i_mutex/
    fix file specification in comments
    drivers/base/platform.c: fix small typo in doc
    misc doc and kconfig typos
    Remove obsolete fat_cvf help text
    Fix occurrences of "the the "
    Fix minor typoes in kernel/module.c
    Kconfig: Remove reference to external mqueue library
    Kconfig: A couple of grammatical fixes in arch/i386/Kconfig
    Correct comments in genrtc.c to refer to correct /proc file.
    Fix more "deprecated" spellos.
    Fix "deprecated" typoes.
    ...

    Fix trivial comment conflict in kernel/relay.c.

    Linus Torvalds
     
  • Since nonboot CPUs are now disabled after tasks and devices have been
    frozen and the CPU hotplug infrastructure is used for this purpose, we need
    special CPU hotplug notifications that will help the CPU-hotplug-aware
    subsystems distinguish normal CPU hotplug events from CPU hotplug events
    related to a system-wide suspend or resume operation in progress. This
    patch introduces such notifications and causes them to be used during
    suspend and resume transitions. It also changes all of the
    CPU-hotplug-aware subsystems to take these notifications into consideration
    (for now they are handled in the same way as the corresponding "normal"
    ones).

    [oleg@tv-sign.ru: cleanups]
    Signed-off-by: Rafael J. Wysocki
    Cc: Gautham R Shenoy
    Cc: Pavel Machek
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • flush_work(wq, work) doesn't need the first parameter, we can use cwq->wq
    (this was possible from the very beginnig, I missed this). So we can unify
    flush_work_keventd and flush_work.

    Also, rename flush_work() to cancel_work_sync() and fix all callers.
    Perhaps this is not the best name, but "flush_work" is really bad.

    (akpm: this is why the earlier patches bypassed maintainers)

    Signed-off-by: Oleg Nesterov
    Cc: Jeff Garzik
    Cc: "David S. Miller"
    Cc: Jens Axboe
    Cc: Tejun Heo
    Cc: Auke Kok ,
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Switch the kblockd flushing from a global flush to a more specific
    flush_work().

    (akpm: bypassed maintainers, sorry. There are other patches which depend on
    this)

    Cc: "Maciej W. Rozycki"
    Cc: David Howells
    Cc: Jens Axboe
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

09 May, 2007

2 commits

  • Signed-off-by: Michael Opdenacker
    Signed-off-by: Adrian Bunk

    Michael Opdenacker
     
  • I think we might just need the blk_map_kern users now. For the async
    execute I added the bounce code already and the block SG_IO has it
    atleady. I think the blk_map_kern bounce code got dropped because we
    thought the correct gfp_t would be passed in. But I think all we need is
    the patch below and all the paths are take care of. The patch is not
    tested. Patch was made against scsi-misc.

    The last place that is sending non sg commands may just be md/dm-emc.c
    but that is is just waiting on alasdair to take some patches that fix
    that and a bunch of junk in there including adding bounce support. If
    the patch below is ok though and dm-emc finally gets converted then it
    will have sg and bonce buffer support.

    Signed-off-by: Mike Christie
    Signed-off-by: Jens Axboe

    Mike Christie
     

06 May, 2007

1 commit

  • * master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (87 commits)
    [SCSI] fusion: fix domain validation loops
    [SCSI] qla2xxx: fix regression on sparc64
    [SCSI] modalias for scsi devices
    [SCSI] sg: cap reserved_size values at max_sectors
    [SCSI] BusLogic: stop using check_region
    [SCSI] tgt: fix rdma transfer bugs
    [SCSI] aacraid: fix aacraid not finding device
    [SCSI] aacraid: Correct SMC products in aacraid.txt
    [SCSI] scsi_error.c: Add EH Start Unit retry
    [SCSI] aacraid: [Fastboot] Panics for AACRAID driver during 'insmod' for kexec test.
    [SCSI] ipr: Driver version to 2.3.2
    [SCSI] ipr: Faster sg list fetch
    [SCSI] ipr: Return better qc_issue errors
    [SCSI] ipr: Disrupt device error
    [SCSI] ipr: Improve async error logging level control
    [SCSI] ipr: PCI unblock config access fix
    [SCSI] ipr: Fix for oops following SATA request sense
    [SCSI] ipr: Log error for SAS dual path switch
    [SCSI] ipr: Enable logging of debug error data for all devices
    [SCSI] ipr: Add new PCI-E IDs to device table
    ...

    Linus Torvalds
     

30 Apr, 2007

1 commit


18 Apr, 2007

1 commit

  • This patch (as857) modifies the SG_GET_RESERVED_SIZE and
    SG_SET_RESERVED_SIZE ioctls in the sg driver, capping the values at
    the device's request_queue's max_sectors value. This will permit
    cdrecord to obtain a legal value for the maximum transfer length,
    fixing Bugzilla #7026.

    The patch also caps the initial reserved_size value. There's no
    reason to have a reserved buffer larger than max_sectors, since it
    would be impossible to use the extra space.

    The corresponding ioctls in the block layer are modified similarly,
    and the initial value for the reserved_size is set as large as
    possible. This will effectively make it default to max_sectors.
    Note that the actual value is meaningless anyway, since block devices
    don't have a reserved buffer.

    Finally, the BLKSECTGET ioctl is added to sg, so that there will be a
    uniform way for users to determine the actual max_sectors value for
    any raw SCSI transport.

    Signed-off-by: Alan Stern
    Acked-by: Jens Axboe
    Acked-by: Douglas Gilbert
    Signed-off-by: James Bottomley

    Alan Stern
     

27 Mar, 2007

1 commit

  • There is a small problem in handling page bounce.

    At the moment blk_max_pfn equals max_pfn, which is in fact not maximum
    possible _number_ of a page frame, but the _amount_ of page frames. For
    example for the 32bit x86 node with 4Gb RAM, max_pfn = 0x100000, but not
    0xFFFF.

    request_queue structure has a member q->bounce_pfn and queue needs bounce
    pages for the pages _above_ this limit. This routine is handled by
    blk_queue_bounce(), where the following check is produced:

    if (q->bounce_pfn >= blk_max_pfn)
    return;

    Assume, that a driver has set q->bounce_pfn to 0xFFFF, but blk_max_pfn
    equals 0x10000. In such situation the check above fails and for each bio
    we always fall down for iterating over pages tied to the bio.

    I want to notice, that for quite a big range of device drivers (ide, md,
    ...) such problem doesn't happen because they use BLK_BOUNCE_ANY for
    bounce_pfn. BLK_BOUNCE_ANY is defined as blk_max_pfn << PAGE_SHIFT, and
    then the check above doesn't fail. But for other drivers, which obtain
    reuired value from drivers, it fails. For example sata_nv uses
    ATA_DMA_MASK or dev->dma_mask.

    I propose to use (max_pfn - 1) for blk_max_pfn. And the same for
    blk_max_low_pfn. The patch also cleanses some checks related with
    bounce_pfn.

    Signed-off-by: Vasily Tarasov
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Vasily Tarasov
     

10 Feb, 2007

1 commit

  • It is possible for raid5 to be sent a bio that is too big for an underlying
    device. So if it is a READ that we pass stright down to a device, it will
    fail and confuse RAID5.

    So in 'chunk_aligned_read' we check that the bio fits within the parameters
    for the target device and if it doesn't fit, fall back on reading through
    the stripe cache and making lots of one-page requests.

    Note that this is the earliest time we can check against the device because
    earlier we don't have a lock on the device, so it could change underneath
    us.

    Also, the code for handling a retry through the cache when a read fails has
    not been tested and was badly broken. This patch fixes that code.

    Signed-off-by: Neil Brown
    Cc: "Kai"
    Cc:
    Cc:
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Brown
     

23 Dec, 2006

1 commit


19 Dec, 2006

5 commits


13 Dec, 2006

1 commit


12 Dec, 2006

1 commit

  • While working on bidi support at struct request level
    I have found that blk_queue_activity_fn is actually never used.
    The only user is in ide-probe.c with this code:

    /* enable led activity for disk drives only */
    if (drive->media == ide_disk && hwif->led_act)
    blk_queue_activity_fn(q, hwif->led_act, drive);

    And led_act is never initialized anywhere.
    (Looking back at older kernels it was used in the PPC arch, but was removed around 2.6.18)
    Unless it is all for future use off course.
    (this patch is against linux-2.6-block.git as off 2006/12/4)

    Signed-off-by: Boaz Harrosh
    Signed-off-by: Jens Axboe

    Boaz Harrosh
     

11 Dec, 2006

1 commit

  • Wire up read accounting for block devices, within submit_bio().

    Cc: Jay Lan
    Cc: Shailabh Nagar
    Cc: Balbir Singh
    Cc: Chris Sturtivant
    Cc: Tony Ernst
    Cc: Guillaume Thouvenin
    Cc: David Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

09 Dec, 2006

1 commit

  • This patch provides fault-injection capability for disk IO.

    Boot option:

    fail_make_request=,,,

    -- specifies the interval of failures.

    -- specifies how often it should fail in percent.

    -- specifies the size of free space where disk IO can be issued
    safely in bytes.

    -- specifies how many times failures may happen at most.

    Debugfs:

    /debug/fail_make_request/interval
    /debug/fail_make_request/probability
    /debug/fail_make_request/specifies
    /debug/fail_make_request/times

    Example:

    fail_make_request=10,100,0,-1
    echo 1 > /sys/blocks/hda/hda1/make-it-fail

    generic_make_request() on /dev/hda1 fails once per 10 times.

    Cc: Jens Axboe
    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

08 Dec, 2006

2 commits

  • There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn,
    prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus
    generating compiler warnings of unused symbols, hence forcing people to add
    #ifdefs.

    the compiler can skip truly unused functions just fine:

    text data bss dec hex filename
    1624412 728710 3674856 6027978 5bfaca vmlinux.before
    1624412 728710 3674856 6027978 5bfaca vmlinux.after

    [akpm@osdl.org: topology.c fix]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Replace all uses of kmem_cache_t with struct kmem_cache.

    The patch was generated using the following script:

    #!/bin/sh
    #
    # Replace one string by another in all the kernel sources.
    #

    set -e

    for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
    quilt add $file
    sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
    mv /tmp/$$ $file
    quilt refresh
    done

    The script was run like this

    sh replace kmem_cache_t "struct kmem_cache"

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

05 Dec, 2006

1 commit


01 Dec, 2006

1 commit


22 Nov, 2006

1 commit

  • Pass the work_struct pointer to the work function rather than context data.
    The work function can use container_of() to work out the data.

    For the cases where the container of the work_struct may go away the moment the
    pending bit is cleared, it is made possible to defer the release of the
    structure by deferring the clearing of the pending bit.

    To make this work, an extra flag is introduced into the management side of the
    work_struct. This governs auto-release of the structure upon execution.

    Ordinarily, the work queue executor would release the work_struct for further
    scheduling or deallocation by clearing the pending bit prior to jumping to the
    work function. This means that, unless the driver makes some guarantee itself
    that the work_struct won't go away, the work function may not access anything
    else in the work_struct or its container lest they be deallocated.. This is a
    problem if the auxiliary data is taken away (as done by the last patch).

    However, if the pending bit is *not* cleared before jumping to the work
    function, then the work function *may* access the work_struct and its container
    with no problems. But then the work function must itself release the
    work_struct by calling work_release().

    In most cases, automatic release is fine, so this is the default. Special
    initiators exist for the non-auto-release case (ending in _NAR).

    Signed-Off-By: David Howells

    David Howells
     

04 Nov, 2006

1 commit


01 Nov, 2006

1 commit

  • Partitions are not limited to live within a device. So we should range
    check after partition mapping.

    Note that 'maxsector' was being used for two different things. I have
    split off the second usage into 'old_sector' so that maxsector can be still
    be used for it's primary usage later in the function.

    Cc: Jens Axboe
    Signed-off-by: Neil Brown
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

21 Oct, 2006

2 commits

  • Separate out the concept of "queue congestion" from "backing-dev congestion".
    Congestion is a backing-dev concept, not a queue concept.

    The blk_* congestion functions are retained, as wrappers around the core
    backing-dev congestion functions.

    This proper layering is needed so that NFS can cleanly use the congestion
    functions, and so that CONFIG_BLOCK=n actually links.

    Cc: "Thomas Maier"
    Cc: "Jens Axboe"
    Cc: Trond Myklebust
    Cc: David Howells
    Cc: Peter Osterlund
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Export the clear_queue_congested() and set_queue_congested() functions
    located in ll_rw_blk.c

    The functions are renamed to blk_clear_queue_congested() and
    blk_set_queue_congested().

    (needed in the pktcdvd driver's bio write congestion control)

    Signed-off-by: Thomas Maier
    Cc: Peter Osterlund
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Maier
     

05 Oct, 2006

1 commit


01 Oct, 2006

7 commits