21 Jul, 2016

1 commit

  • Before merging a bio into an existing request, io scheduler is called to
    get its approval first. However, the requests that come from a plug
    flush may get merged by block layer without consulting with io
    scheduler.

    In case of CFQ, this can cause fairness problems. For instance, if a
    request gets merged into a low weight cgroup's request, high weight cgroup
    now will depend on low weight cgroup to get scheduled. If high weigt cgroup
    needs that io request to complete before submitting more requests, then it
    will also lose its timeslice.

    Following script demonstrates the problem. Group g1 has a low weight, g2
    and g3 have equal high weights but g2's requests are adjacent to g1's
    requests so they are subject to merging. Due to these merges, g2 gets
    poor disk time allocation.

    cat > cfq-merge-repro.sh << "EOF"
    #!/bin/bash
    set -e

    IO_ROOT=/mnt-cgroup/io

    mkdir -p $IO_ROOT

    if ! mount | grep -qw $IO_ROOT; then
    mount -t cgroup none -oblkio $IO_ROOT
    fi

    cd $IO_ROOT

    for i in g1 g2 g3; do
    if [ -d $i ]; then
    rmdir $i
    fi
    done

    mkdir g1 && echo 10 > g1/blkio.weight
    mkdir g2 && echo 495 > g2/blkio.weight
    mkdir g3 && echo 495 > g3/blkio.weight

    RUNTIME=10

    (echo $BASHPID > g1/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=0k &> /dev/null)&

    (echo $BASHPID > g2/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=64k &> /dev/null)&

    (echo $BASHPID > g3/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=256k &> /dev/null)&

    sleep $((RUNTIME+1))

    for i in g1 g2 g3; do
    echo ---- $i ----
    cat $i/blkio.time
    done

    EOF
    # ./cfq-merge-repro.sh
    ---- g1 ----
    8:16 162
    ---- g2 ----
    8:16 165
    ---- g3 ----
    8:16 686

    After applying the patch:

    # ./cfq-merge-repro.sh
    ---- g1 ----
    8:16 90
    ---- g2 ----
    8:16 445
    ---- g3 ----
    8:16 471

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Jens Axboe

    Tahsin Erdogan
     

28 Jun, 2016

1 commit

  • Currently rq->fifo_time is unsigned long but CFQ stores nanosecond
    timestamp in it which would overflow on 32-bit archs. Convert it to u64
    to avoid the overflow. Since the rq->fifo_time is unioned with struct
    call_single_data(), this does not change the size of struct request in
    any way.

    We have to slightly fixup block/deadline-iosched.c so that comparison
    happens in the right types.

    Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

02 Feb, 2016

1 commit


25 Feb, 2014

1 commit

  • Block layer currently abuses rq->csd.list.next for storing fifo_time.
    That is a terrible hack and completely unnecessary as well. Union
    achieves the same space saving in a cleaner way.

    Signed-off-by: Jan Kara
    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Ingo Molnar
    Cc: Jens Axboe
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Jens Axboe

    Jan Kara
     

12 Sep, 2013

1 commit


03 Jul, 2013

1 commit

  • There's a race between elevator switching and normal io operation.
    Because the allocation of struct elevator_queue and struct elevator_data
    don't in a atomic operation.So there are have chance to use NULL
    ->elevator_data.
    For example:
    Thread A: Thread B
    blk_queu_bio elevator_switch
    spin_lock_irq(q->queue_block) elevator_alloc
    elv_merge elevator_init_fn

    Because call elevator_alloc, it can't hold queue_lock and the
    ->elevator_data is NULL.So at the same time, threadA call elv_merge and
    nedd some info of elevator_data.So the crash happened.

    Move the elevator_alloc into func elevator_init_fn, it make the
    operations in a atomic operation.

    Using the follow method can easy reproduce this bug
    1:dd if=/dev/sdb of=/dev/null
    2:while true;do echo noop > scheduler;echo deadline > scheduler;done

    The test method also use this method.

    Signed-off-by: Jianpeng Ma
    Signed-off-by: Jens Axboe

    Jianpeng Ma
     

24 Mar, 2013

1 commit

  • Just a little convenience macro - main reason to add it now is preparing
    for immutable bio vecs, it'll reduce the size of the patch that puts
    bi_sector/bi_size/bi_idx into a struct bvec_iter.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    CC: Lars Ellenberg
    CC: Jiri Kosina
    CC: Alasdair Kergon
    CC: dm-devel@redhat.com
    CC: Neil Brown
    CC: Martin Schwidefsky
    CC: Heiko Carstens
    CC: linux-s390@vger.kernel.org
    CC: Chris Mason
    CC: Steven Whitehouse
    Acked-by: Steven Whitehouse

    Kent Overstreet
     

10 Dec, 2012

1 commit


07 Mar, 2012

1 commit

  • elevator_ops->elevator_init_fn() has a weird return value. It returns
    a void * which the caller should assign to q->elevator->elevator_data
    and %NULL return denotes init failure.

    Update such that it returns integer 0/-errno and sets elevator_data
    directly as necessary.

    This makes the interface more conventional and eases further cleanup.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     

14 Dec, 2011

1 commit

  • Let elevators set ->icq_size and ->icq_align in elevator_type and
    elv_register() and elv_unregister() respectively create and destroy
    kmem_cache for icq.

    * elv_register() now can return failure. All callers updated.

    * icq caches are automatically named "ELVNAME_io_cq".

    * cfq_slab_setup/kill() are collapsed into cfq_init/exit().

    * While at it, minor indentation change for iosched_cfq.elevator_name
    for consistency.

    This will help moving icq management to block core. This doesn't
    introduce any functional change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

03 Jun, 2011

1 commit

  • Hi, Jens,

    If you recall, I posted an RFC patch for this back in July of last year:
    http://lkml.org/lkml/2010/7/13/279

    The basic problem is that a process can issue a never-ending stream of
    async direct I/Os to the same sector on a device, thus starving out
    other I/O in the system (due to the way the alias handling works in both
    cfq and deadline). The solution I proposed back then was to start
    dispatching from the fifo after a certain number of aliases had been
    dispatched. Vivek asked why we had to treat aliases differently at all,
    and I never had a good answer. So, I put together a simple patch which
    allows aliases to be added to the rb tree (it adds them to the right,
    though that doesn't matter as the order isn't guaranteed anyway). I
    think this is the preferred solution, as it doesn't break up time slices
    in CFQ or batches in deadline. I've tested it, and it does solve the
    starvation issue. Let me know what you think.

    Cheers,
    Jeff

    Signed-off-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jeff Moyer
     

10 Mar, 2011

1 commit

  • Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().

    Signed-off-by: Jens Axboe

    Jens Axboe
     

11 May, 2009

1 commit

  • With recent cleanups, there is no place where low level driver
    directly manipulates request fields. This means that the 'hard'
    request fields always equal the !hard fields. Convert all
    rq->sectors, nr_sectors and current_nr_sectors references to
    accessors.

    While at it, drop superflous blk_rq_pos() < 0 test in swim.c.

    [ Impact: use pos and nr_sectors accessors ]

    Signed-off-by: Tejun Heo
    Acked-by: Geert Uytterhoeven
    Tested-by: Grant Likely
    Acked-by: Grant Likely
    Tested-by: Adrian McMenamin
    Acked-by: Adrian McMenamin
    Acked-by: Mike Miller
    Cc: James Bottomley
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Borislav Petkov
    Cc: Sergei Shtylyov
    Cc: Eric Moore
    Cc: Alan Stern
    Cc: FUJITA Tomonori
    Cc: Pete Zaitcev
    Cc: Stephen Rothwell
    Cc: Paul Clements
    Cc: Tim Waugh
    Cc: Jeff Garzik
    Cc: Jeremy Fitzhardinge
    Cc: Alex Dubov
    Cc: David Woodhouse
    Cc: Martin Schwidefsky
    Cc: Dario Ballabio
    Cc: David S. Miller
    Cc: Rusty Russell
    Cc: unsik Kim
    Cc: Laurent Vivier
    Signed-off-by: Jens Axboe

    Tejun Heo
     

29 Dec, 2008

1 commit


09 Oct, 2008

2 commits

  • * convert goto to simpler while loop;
    * use rq_end_sector() instead of computing manually;
    * fix false comments;
    * remove spurious whitespace;
    * convert rq_rb_root macro to an inline function.

    Signed-off-by: Aaron Carroll
    Signed-off-by: Jens Axboe

    Aaron Carroll
     
  • Deadline currently only batches sector-contiguous requests, so except
    for a few circumstances (e.g. requests in a single direction), it is
    essentially first come first served. This is bad for throughput, so
    change it to CSCAN, which means requests in a batch do not need to be
    sequential and are issued in increasing sector order.

    Signed-off-by: Aaron Carroll
    Signed-off-by: Jens Axboe

    Aaron Carroll
     

18 Dec, 2007

1 commit


02 Nov, 2007

3 commits

  • After switching data directions, deadline always starts the next batch
    from the lowest-sector request. This gives excessive deadline expiries
    and large latency and throughput disparity between high- and low-sector
    requests; an order of magnitude in some tests.

    This patch changes the batching behaviour so new batches start from the
    request whose expiry is earliest.

    Signed-off-by: Aaron Carroll
    Signed-off-by: Jens Axboe

    Aaron Carroll
     
  • The deadline I/O scheduler does not reset the batch count when starting
    a new batch at a higher-sectored request. This means the second and
    subsequent batch in the same data direction will never exceed a single
    request in size whenever higher-sectored requests are pending.

    This patch gives new batches in the same data direction as old ones
    their full quota of requests by resetting the batch count.

    Signed-off-by: Aaron Carroll
    Signed-off-by: Jens Axboe

    Aaron Carroll
     
  • Factor finding the next request in sector-sorted order into
    a function deadline_latter_request.

    Signed-off-by: Aaron Carroll
    Signed-off-by: Jens Axboe

    Aaron Carroll
     

24 Jul, 2007

1 commit

  • Some of the code has been gradually transitioned to using the proper
    struct request_queue, but there's lots left. So do a full sweet of
    the kernel and get rid of this typedef and replace its uses with
    the proper type.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

18 Jul, 2007

1 commit

  • kmalloc_node() and kmem_cache_alloc_node() were not available in a zeroing
    variant in the past. But with __GFP_ZERO it is possible now to do zeroing
    while allocating.

    Use __GFP_ZERO to remove the explicit clearing of memory via memset whereever
    we can.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

01 Dec, 2006

1 commit


01 Oct, 2006

4 commits


01 Jul, 2006

1 commit


23 Jun, 2006

2 commits


21 Jun, 2006

1 commit

  • * git://git.infradead.org/~dwmw2/rbtree-2.6:
    [RBTREE] Switch rb_colour() et al to en_US spelling of 'color' for consistency
    Update UML kernel/physmem.c to use rb_parent() accessor macro
    [RBTREE] Update hrtimers to use rb_parent() accessor macro.
    [RBTREE] Add explicit alignment to sizeof(long) for struct rb_node.
    [RBTREE] Merge colour and parent fields of struct rb_node.
    [RBTREE] Remove dead code in rb_erase()
    [RBTREE] Update JFFS2 to use rb_parent() accessor macro.
    [RBTREE] Update eventpoll.c to use rb_parent() accessor macro.
    [RBTREE] Update key.c to use rb_parent() accessor macro.
    [RBTREE] Update ext3 to use rb_parent() accessor macro.
    [RBTREE] Change rbtree off-tree marking in I/O schedulers.
    [RBTREE] Add accessor macros for colour and parent fields of rb_node

    Linus Torvalds
     

09 Jun, 2006

1 commit

  • There's a race between shutting down one io scheduler and firing up the
    next, in which a new io could enter and cause the io scheduler to be
    invoked with bad or NULL data.

    To fix this, we need to maintain the queue lock for a bit longer.
    Unfortunately we cannot do that, since the elevator init requires to be
    run without the lock held. This isn't easily fixable, without also
    changing the mempool API. So split the initialization into two parts,
    and alloc-init operation and an attach operation. Then we can
    preallocate the io scheduler and related structures, and run the attach
    inside the lock after we detach the old one.

    This patch has survived 30 minutes of 1 second io scheduler switching
    with a very busy io load.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

21 Apr, 2006

1 commit


19 Mar, 2006

2 commits


06 Jan, 2006

1 commit

  • the patch below marks various read-only variables in block/* as const,
    so that gcc can optimize the use of them; eg gcc will replace the use by
    the value directly now and will even remove the memory usage of these.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Jens Axboe

    Arjan van de Ven
     

19 Nov, 2005

1 commit


04 Nov, 2005

1 commit