17 Dec, 2010

1 commit

  • __get_cpu_var() can be replaced with this_cpu_read and will then use a
    single read instruction with implied address calculation to access the
    correct per cpu instance.

    However, the address of a per cpu variable passed to __this_cpu_read()
    cannot be determined (since it's an implied address conversion through
    segment prefixes). Therefore apply this only to uses of __get_cpu_var
    where the address of the variable is not used.

    Cc: Pekka Enberg
    Cc: Hugh Dickins
    Cc: Thomas Gleixner
    Acked-by: H. Peter Anvin
    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     

25 Oct, 2010

1 commit


23 Oct, 2010

1 commit

  • * 'for-2.6.37/core' of git://git.kernel.dk/linux-2.6-block: (39 commits)
    cfq-iosched: Fix a gcc 4.5 warning and put some comments
    block: Turn bvec_k{un,}map_irq() into static inline functions
    block: fix accounting bug on cross partition merges
    block: Make the integrity mapped property a bio flag
    block: Fix double free in blk_integrity_unregister
    block: Ensure physical block size is unsigned int
    blkio-throttle: Fix possible multiplication overflow in iops calculations
    blkio-throttle: limit max iops value to UINT_MAX
    blkio-throttle: There is no need to convert jiffies to milli seconds
    blkio-throttle: Fix link failure failure on i386
    blkio: Recalculate the throttled bio dispatch time upon throttle limit change
    blkio: Add root group to td->tg_list
    blkio: deletion of a cgroup was causes oops
    blkio: Do not export throttle files if CONFIG_BLK_DEV_THROTTLING=n
    block: set the bounce_pfn to the actual DMA limit rather than to max memory
    block: revert bad fix for memory hotplug causing bounces
    Fix compile error in blk-exec.c for !CONFIG_DETECT_HUNG_TASK
    block: set the bounce_pfn to the actual DMA limit rather than to max memory
    block: Prevent hang_check firing during long I/O
    cfq: improve fsync performance for small files
    ...

    Fix up trivial conflicts due to __rcu sparse annotation in include/linux/genhd.h

    Linus Torvalds
     

19 Oct, 2010

1 commit

  • /proc/diskstats would display a strange output as follows.

    $ cat /proc/diskstats |grep sda
    8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
    8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
    ~~~~~~~~~~
    8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
    8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
    8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
    8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

    Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
    merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

    The detailed root cause is as follows.

    Assuming that there are two partition, sda1 and sda2.

    1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
    is 0 and sda2's one is 1.

    | hd_struct->in_flight
    ---------------------------
    sda1 | 0
    sda2 | 1
    ---------------------------

    2. A bio belongs to sda1 is issued and is merged into the request mentioned on
    step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
    from sda2 region to sda1 region. However the two partition's
    hd_struct->in_flight are not changed.

    | hd_struct->in_flight
    ---------------------------
    sda1 | 0
    sda2 | 1
    ---------------------------

    3. The request is finished and blk_account_io_done() is called. In this case,
    sda2's hd_struct->in_flight, not a sda1's one, is decremented.

    | hd_struct->in_flight
    ---------------------------
    sda1 | -1
    sda2 | 1
    ---------------------------

    The patch fixes the problem by caching the partition lookup
    inside the request structure, hence making sure that the increment
    and decrement will always happen on the same partition struct. This
    also speeds up IO with accounting enabled, since it cuts down on
    the number of lookups we have to do.

    When reloading partition tables, quiesce IO to ensure that no
    request references to the partition struct exists. When it is safe
    to free the partition table, the IO for that device is restarted
    again.

    Signed-off-by: Yasuaki Ishimatsu
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Yasuaki Ishimatsu
     

07 Oct, 2010

1 commit

  • 2.6.36 introduces an API for drivers to switch the IO scheduler
    instead of manually calling the elevator exit and init functions.
    This API was added since q->elevator must be cleared in between
    those two calls. And since we already have this functionality
    directly from use by the sysfs interface to switch schedulers
    online, it was prudent to reuse it internally too.

    But this API needs the queue to be in a fully initialized state
    before it is called, or it will attempt to unregister elevator
    kobjects before they have been added. This results in an oops
    like this:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000051
    IP: [] sysfs_create_dir+0x2e/0xc0
    PGD 47ddfc067 PUD 47c6a1067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP
    last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:04:00.1/irq
    CPU 2
    Modules linked in: t(+) loop hid_apple usbhid ahci ehci_hcd uhci_hcd libahci usbcore nls_base igb

    Pid: 7319, comm: modprobe Not tainted 2.6.36-rc6+ #132 QSSC-S4R/QSSC-S4R
    RIP: 0010:[] [] sysfs_create_dir+0x2e/0xc0
    RSP: 0018:ffff88027da25d08 EFLAGS: 00010246
    RAX: ffff88047c68c528 RBX: 00000000fffffffe RCX: 0000000000000000
    RDX: 000000000000002f RSI: 000000000000002f RDI: ffff88047e196c88
    RBP: ffff88027da25d38 R08: 0000000000000000 R09: d84156c5635688c0
    R10: d84156c5635688c0 R11: 0000000000000000 R12: ffff88047e196c88
    R13: 0000000000000000 R14: 0000000000000000 R15: ffff88047c68c528
    FS: 00007fcb0b26f6e0(0000) GS:ffff880287400000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000051 CR3: 000000047e76e000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process modprobe (pid: 7319, threadinfo ffff88027da24000, task ffff88027d377090)
    Stack:
    ffff88027da25d58 ffff88047c68c528 00000000fffffffe ffff88047e196c88
    ffff88047c68c528 ffff88047e05bd90 ffff88027da25d78 ffffffff8123fb77
    ffff88047e05bd90 0000000000000000 ffff88047e196c88 ffff88047c68c528
    Call Trace:
    [] kobject_add_internal+0xe7/0x1f0
    [] kobject_add_varg+0x38/0x60
    [] kobject_add+0x69/0x90
    [] ? sysfs_remove_dir+0x20/0xa0
    [] ? sub_preempt_count+0x9d/0xe0
    [] ? _raw_spin_unlock+0x30/0x50
    [] ? sysfs_remove_dir+0x20/0xa0
    [] ? sysfs_remove_dir+0x34/0xa0
    [] elv_register_queue+0x34/0xa0
    [] elevator_change+0xfd/0x250
    [] ? t_init+0x0/0x361 [t]
    [] ? t_init+0x0/0x361 [t]
    [] t_init+0xa8/0x361 [t]
    [] do_one_initcall+0x3e/0x170
    [] sys_init_module+0xbd/0x220
    [] system_call_fastpath+0x16/0x1b
    Code: e5 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 85 ff 74 52 48 8b 47 18 49 c7 c5 00 46 61 81 48 85 c0 74 04 4c 8b 68 30 45 31 f6 80 7d 51 00 74 0e 49 8b 44 24 28 4c 89 e7 ff 50 20 49 89 c6
    RIP [] sysfs_create_dir+0x2e/0xc0
    RSP
    CR2: 0000000000000051
    ---[ end trace a6541d3bf07945df ]---

    Fix this by adding a registered bit to the elevator queue, which is
    set when the sysfs kobjects have been registered.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

23 Aug, 2010

1 commit

  • Currently drivers must do an elevator_exit() + elevator_init()
    to switch IO schedulers. There are a few problems with this:

    - Since commit 1abec4fdbb142e3ccb6ce99832fae42129134a96,
    elevator_init() requires a zeroed out q->elevator
    pointer. The two existing in-kernel users don't do that.

    - It will only work at initialization time, since using the
    above two-staged construct does not properly quisce the queue.

    So add elevator_change() which takes care of this, and convert
    the elv_iosched_store() sysfs interface to use this helper as well.

    Reported-by: Peter Oberparleiter
    Reported-by: Kevin Vigor
    Signed-off-by: Jens Axboe

    Jens Axboe
     

09 Apr, 2010

1 commit

  • This includes both the number of bios merged into requests belonging to this
    cgroup as well as the number of requests merged together.
    In the past, we've observed different merging behavior across upstream kernels,
    some by design some actual bugs. This stat helps a lot in debugging such
    problems when applications report decreased throughput with a new kernel
    version.

    This needed adding an extra elevator function to capture bios being merged as I
    did not want to pollute elevator code with blkiocg knowledge and hence needed
    the accounting invocation to come from CFQ.

    Signed-off-by: Divyesh Shah
    Signed-off-by: Jens Axboe

    Divyesh Shah
     

11 May, 2009

2 commits

  • Till now block layer allowed two separate modes of request execution.
    A request is always acquired from the request queue via
    elv_next_request(). After that, drivers are free to either dequeue it
    or process it without dequeueing. Dequeue allows elv_next_request()
    to return the next request so that multiple requests can be in flight.

    Executing requests without dequeueing has its merits mostly in
    allowing drivers for simpler devices which can't do sg to deal with
    segments only without considering request boundary. However, the
    benefit this brings is dubious and declining while the cost of the API
    ambiguity is increasing. Segment based drivers are usually for very
    old or limited devices and as converting to dequeueing model isn't
    difficult, it doesn't justify the API overhead it puts on block layer
    and its more modern users.

    Previous patches converted all block low level drivers to dequeueing
    model. This patch completes the API transition by...

    * renaming elv_next_request() to blk_peek_request()

    * renaming blkdev_dequeue_request() to blk_start_request()

    * adding blk_fetch_request() which is combination of peek and start

    * disallowing completion of queued (not started) requests

    * applying new API to all LLDs

    Renamings are for consistency and to break out of tree code so that
    it's apparent that out of tree drivers need updating.

    [ Impact: block request issue API cleanup, no functional change ]

    Signed-off-by: Tejun Heo
    Cc: Rusty Russell
    Cc: James Bottomley
    Cc: Mike Miller
    Cc: unsik Kim
    Cc: Paul Clements
    Cc: Tim Waugh
    Cc: Geert Uytterhoeven
    Cc: David S. Miller
    Cc: Laurent Vivier
    Cc: Jeff Garzik
    Cc: Jeremy Fitzhardinge
    Cc: Grant Likely
    Cc: Adrian McMenamin
    Cc: Stephen Rothwell
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Borislav Petkov
    Cc: Sergei Shtylyov
    Cc: Alex Dubov
    Cc: Pierre Ossman
    Cc: David Woodhouse
    Cc: Markus Lidel
    Cc: Stefan Weinhuber
    Cc: Martin Schwidefsky
    Cc: Pete Zaitcev
    Cc: FUJITA Tomonori
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • struct request has had a few different ways to represent some
    properties of a request. ->hard_* represent block layer's view of the
    request progress (completion cursor) and the ones without the prefix
    are supposed to represent the issue cursor and allowed to be updated
    as necessary by the low level drivers. The thing is that as block
    layer supports partial completion, the two cursors really aren't
    necessary and only cause confusion. In addition, manual management of
    request detail from low level drivers is cumbersome and error-prone at
    the very least.

    Another interesting duplicate fields are rq->[hard_]nr_sectors and
    rq->{hard_cur|current}_nr_sectors against rq->data_len and
    rq->bio->bi_size. This is more convoluted than the hard_ case.

    rq->[hard_]nr_sectors are initialized for requests with bio but
    blk_rq_bytes() uses it only for !pc requests. rq->data_len is
    initialized for all request but blk_rq_bytes() uses it only for pc
    requests. This causes good amount of confusion throughout block layer
    and its drivers and determining the request length has been a bit of
    black magic which may or may not work depending on circumstances and
    what the specific LLD is actually doing.

    rq->{hard_cur|current}_nr_sectors represent the number of sectors in
    the contiguous data area at the front. This is mainly used by drivers
    which transfers data by walking request segment-by-segment. This
    value always equals rq->bio->bi_size >> 9. However, data length for
    pc requests may not be multiple of 512 bytes and using this field
    becomes a bit confusing.

    In general, having multiple fields to represent the same property
    leads only to confusion and subtle bugs. With recent block low level
    driver cleanups, no driver is accessing or manipulating these
    duplicate fields directly. Drop all the duplicates. Now rq->sector
    means the current sector, rq->data_len the current total length and
    rq->bio->bi_size the current segment length. Everything else is
    defined in terms of these three and available only through accessors.

    * blk_recalc_rq_sectors() is collapsed into blk_update_request() and
    now handles pc and fs requests equally other than rq->sector update.
    This means that now pc requests can use partial completion too (no
    in-kernel user yet tho).

    * bio_cur_sectors() is replaced with bio_cur_bytes() as block layer
    now uses byte count as the primary data length.

    * blk_rq_pos() is now guranteed to be always correct. In-block users
    converted.

    * blk_rq_bytes() is now guaranteed to be always valid as is
    blk_rq_sectors(). In-block users converted.

    * blk_rq_sectors() is now guaranteed to equal blk_rq_bytes() >> 9.
    More convenient one is used.

    * blk_rq_bytes() and blk_rq_cur_bytes() are now inlined and take const
    pointer to request.

    [ Impact: API cleanup, single way to represent one property of a request ]

    Signed-off-by: Tejun Heo
    Cc: Boaz Harrosh
    Signed-off-by: Jens Axboe

    Tejun Heo
     

07 Apr, 2009

1 commit


29 Dec, 2008

1 commit


09 Oct, 2008

2 commits

  • Signed-off-by: Mike Anderson
    Signed-off-by: Jens Axboe

    Mike Anderson
     
  • This patch adds support for controlling the IO completion CPU of
    either all requests on a queue, or on a per-request basis. We export
    a sysfs variable (rq_affinity) which, if set, migrates completions
    of requests to the CPU that originally submitted it. A bio helper
    (bio_set_completion_cpu()) is also added, so that queuers can ask
    for completion on that specific CPU.

    In testing, this has been show to cut the system time by as much
    as 20-40% on synthetic workloads where CPU affinity is desired.

    This requires a little help from the architecture, so it'll only
    work as designed for archs that are using the new generic smp
    helper infrastructure.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

18 Dec, 2007

1 commit


24 Jul, 2007

1 commit

  • Some of the code has been gradually transitioned to using the proper
    struct request_queue, but there's lots left. So do a full sweet of
    the kernel and get rid of this typedef and replace its uses with
    the proper type.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

20 Dec, 2006

1 commit

  • Currently we allow any merge, even if the io originates from different
    processes. This can cause really bad starvation and unfairness, if those
    ios happen to be synchronous (reads or direct writes).

    So add a allow_merge hook to the io scheduler ops, so an io scheduler can
    help decide whether a bio/process combination may be merged with an
    existing request.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

01 Dec, 2006

1 commit


12 Oct, 2006

1 commit


01 Oct, 2006

6 commits

  • Make it possible to disable the block layer. Not all embedded devices require
    it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
    the block layer to be present.

    This patch does the following:

    (*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
    support.

    (*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
    an item that uses the block layer. This includes:

    (*) Block I/O tracing.

    (*) Disk partition code.

    (*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.

    (*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
    block layer to do scheduling. Some drivers that use SCSI facilities -
    such as USB storage - end up disabled indirectly from this.

    (*) Various block-based device drivers, such as IDE and the old CDROM
    drivers.

    (*) MTD blockdev handling and FTL.

    (*) JFFS - which uses set_bdev_super(), something it could avoid doing by
    taking a leaf out of JFFS2's book.

    (*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
    linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
    however, still used in places, and so is still available.

    (*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
    parts of linux/fs.h.

    (*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.

    (*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.

    (*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
    is not enabled.

    (*) fs/no-block.c is created to hold out-of-line stubs and things that are
    required when CONFIG_BLOCK is not set:

    (*) Default blockdev file operations (to give error ENODEV on opening).

    (*) Makes some /proc changes:

    (*) /proc/devices does not list any blockdevs.

    (*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.

    (*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.

    (*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
    given command other than Q_SYNC or if a special device is specified.

    (*) In init/do_mounts.c, no reference is made to the blockdev routines if
    CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.

    (*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
    error ENOSYS by way of cond_syscall if so).

    (*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
    CONFIG_BLOCK is not set, since they can't then happen.

    Signed-Off-By: David Howells
    Signed-off-by: Jens Axboe

    David Howells
     
  • None of the in-kernel primitives for handling "atomic" counting seem
    to be a good fit. We need something that is essentially free for
    incrementing/decrementing, while the read side may be more expensive
    as we only ever need to do that when a device is removed from the
    kernel.

    Use a per-cpu variable for maintaining a per-cpu ioc count and define
    a reading mechanism that just sums up the values.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • It's not needed for anything, so kill the bio passing.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The io schedulers can use this instead of having to allocate space for
    it themselves.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The rbtree sort/lookup/reposition logic is mostly duplicated in
    cfq/deadline/as, so move it to the elevator core. The io schedulers
    still provide the actual rb root, as we don't want to impose any sort
    of specific handling on the schedulers.

    Introduce the helpers and rb_node in struct request to help migrate the
    IO schedulers.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Right now, every IO scheduler implements its own backmerging (except for
    noop, which does no merging). That results in duplicated code for
    essentially the same operation, which is never a good thing. This patch
    moves the backmerging out of the io schedulers and into the elevator
    core. We save 1.6kb of text and as a bonus get backmerging for noop as
    well. Win-win!

    Signed-off-by: Jens Axboe

    Jens Axboe
     

09 Jun, 2006

1 commit

  • There's a race between shutting down one io scheduler and firing up the
    next, in which a new io could enter and cause the io scheduler to be
    invoked with bad or NULL data.

    To fix this, we need to maintain the queue lock for a bit longer.
    Unfortunately we cannot do that, since the elevator init requires to be
    run without the lock held. This isn't easily fixable, without also
    changing the mempool API. So split the initialization into two parts,
    and alloc-init operation and an attach operation. Then we can
    preallocate the io scheduler and related structures, and run the attach
    inside the lock after we detach the old one.

    This patch has survived 30 minutes of 1 second io scheduler switching
    with a very busy io load.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

19 Mar, 2006

3 commits


08 Feb, 2006

1 commit

  • q->ordcolor must only be flipped on initial queueing of a hardbarrier
    request.

    Constructing ordered sequence and requeueing used to pass through
    __elv_add_request() which flips q->ordcolor when it sees a barrier
    request.

    This patch separates out elv_insert() from __elv_add_request() and uses
    elv_insert() when constructing ordered sequence and requeueing.
    elv_insert() inserts the given request at the specified position and
    does nothing else.

    Signed-off-by: Tejun Heo
    Acked-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

10 Jan, 2006

1 commit


09 Jan, 2006

1 commit


06 Jan, 2006

1 commit

  • Reimplement handling of barrier requests.

    * Flexible handling to deal with various capabilities of
    target devices.
    * Retry support for falling back.
    * Tagged queues which don't support ordered tag can do ordered.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

28 Oct, 2005

4 commits


28 Jun, 2005

1 commit

  • This updates the CFQ io scheduler to the new time sliced design (cfq
    v3). It provides full process fairness, while giving excellent
    aggregate system throughput even for many competing processes. It
    supports io priorities, either inherited from the cpu nice value or set
    directly with the ioprio_get/set syscalls. The latter closely mimic
    set/getpriority.

    This import is based on my latest from -mm.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds