16 Aug, 2016

1 commit

  • Commit 288dab8a35a0 ("block: add a separate operation type for secure
    erase") split REQ_OP_SECURE_ERASE from REQ_OP_DISCARD without considering
    all the places REQ_OP_DISCARD was being used to mean either. Fix those.

    Signed-off-by: Adrian Hunter
    Fixes: 288dab8a35a0 ("block: add a separate operation type for secure erase")
    Signed-off-by: Jens Axboe

    Adrian Hunter
     

21 Jul, 2016

1 commit

  • Before merging a bio into an existing request, io scheduler is called to
    get its approval first. However, the requests that come from a plug
    flush may get merged by block layer without consulting with io
    scheduler.

    In case of CFQ, this can cause fairness problems. For instance, if a
    request gets merged into a low weight cgroup's request, high weight cgroup
    now will depend on low weight cgroup to get scheduled. If high weigt cgroup
    needs that io request to complete before submitting more requests, then it
    will also lose its timeslice.

    Following script demonstrates the problem. Group g1 has a low weight, g2
    and g3 have equal high weights but g2's requests are adjacent to g1's
    requests so they are subject to merging. Due to these merges, g2 gets
    poor disk time allocation.

    cat > cfq-merge-repro.sh << "EOF"
    #!/bin/bash
    set -e

    IO_ROOT=/mnt-cgroup/io

    mkdir -p $IO_ROOT

    if ! mount | grep -qw $IO_ROOT; then
    mount -t cgroup none -oblkio $IO_ROOT
    fi

    cd $IO_ROOT

    for i in g1 g2 g3; do
    if [ -d $i ]; then
    rmdir $i
    fi
    done

    mkdir g1 && echo 10 > g1/blkio.weight
    mkdir g2 && echo 495 > g2/blkio.weight
    mkdir g3 && echo 495 > g3/blkio.weight

    RUNTIME=10

    (echo $BASHPID > g1/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=0k &> /dev/null)&

    (echo $BASHPID > g2/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=64k &> /dev/null)&

    (echo $BASHPID > g3/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=256k &> /dev/null)&

    sleep $((RUNTIME+1))

    for i in g1 g2 g3; do
    echo ---- $i ----
    cat $i/blkio.time
    done

    EOF
    # ./cfq-merge-repro.sh
    ---- g1 ----
    8:16 162
    ---- g2 ----
    8:16 165
    ---- g3 ----
    8:16 686

    After applying the patch:

    # ./cfq-merge-repro.sh
    ---- g1 ----
    8:16 90
    ---- g2 ----
    8:16 445
    ---- g3 ----
    8:16 471

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Jens Axboe

    Tahsin Erdogan
     

08 Jun, 2016

1 commit

  • This patch converts the elevator code to use separate variables
    for the operation and flags, and to check req_op for the REQ_OP.

    Signed-off-by: Mike Christie
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Mike Christie
     

22 Oct, 2015

1 commit

  • After bio splitting is introduced, one bio can be splitted
    and it is marked as NOMERGE because it is too fat to be merged,
    so check bio_mergeable() earlier to avoid to try to merge it
    unnecessarily.

    Signed-off-by: Ming Lei
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Ming Lei
     

26 Jun, 2015

2 commits

  • Pull cgroup writeback support from Jens Axboe:
    "This is the big pull request for adding cgroup writeback support.

    This code has been in development for a long time, and it has been
    simmering in for-next for a good chunk of this cycle too. This is one
    of those problems that has been talked about for at least half a
    decade, finally there's a solution and code to go with it.

    Also see last weeks writeup on LWN:

    http://lwn.net/Articles/648292/"

    * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
    writeback, blkio: add documentation for cgroup writeback support
    vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
    writeback: do foreign inode detection iff cgroup writeback is enabled
    v9fs: fix error handling in v9fs_session_init()
    bdi: fix wrong error return value in cgwb_create()
    buffer: remove unusued 'ret' variable
    writeback: disassociate inodes from dying bdi_writebacks
    writeback: implement foreign cgroup inode bdi_writeback switching
    writeback: add lockdep annotation to inode_to_wb()
    writeback: use unlocked_inode_to_wb transaction in inode_congested()
    writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
    writeback: implement [locked_]inode_to_wb_and_lock_list()
    writeback: implement foreign cgroup inode detection
    writeback: make writeback_control track the inode being written back
    writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
    mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
    writeback: implement memcg writeback domain based throttling
    writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
    writeback: implement memcg wb_domain
    writeback: update wb_over_bg_thresh() to use wb_domain aware operations
    ...

    Linus Torvalds
     
  • Pull core block IO update from Jens Axboe:
    "Nothing really major in here, mostly a collection of smaller
    optimizations and cleanups, mixed with various fixes. In more detail,
    this contains:

    - Addition of policy specific data to blkcg for block cgroups. From
    Arianna Avanzini.

    - Various cleanups around command types from Christoph.

    - Cleanup of the suspend block I/O path from Christoph.

    - Plugging updates from Shaohua and Jeff Moyer, for blk-mq.

    - Eliminating atomic inc/dec of both remaining IO count and reference
    count in a bio. From me.

    - Fixes for SG gap and chunk size support for data-less (discards)
    IO, so we can merge these better. From me.

    - Small restructuring of blk-mq shared tag support, freeing drivers
    from iterating hardware queues. From Keith Busch.

    - A few cfq-iosched tweaks, from Tahsin Erdogan and me. Makes the
    IOPS mode the default for non-rotational storage"

    * 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits)
    cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL
    cfq-iosched: fix sysfs oops when attempting to read unconfigured weights
    cfq-iosched: move group scheduling functions under ifdef
    cfq-iosched: fix the setting of IOPS mode on SSDs
    blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file
    block, cgroup: implement policy-specific per-blkcg data
    block: Make CFQ default to IOPS mode on SSDs
    block: add blk_set_queue_dying() to blkdev.h
    blk-mq: Shared tag enhancements
    block: don't honor chunk sizes for data-less IO
    block: only honor SG gap prevention for merges that contain data
    block: fix returnvar.cocci warnings
    block, dm: don't copy bios for request clones
    block: remove management of bi_remaining when restoring original bi_end_io
    block: replace trylock with mutex_lock in blkdev_reread_part()
    block: export blkdev_reread_part() and __blkdev_reread_part()
    suspend: simplify block I/O handling
    block: collapse bio bit space
    block: remove unused BIO_RW_BLOCK and BIO_EOF flags
    block: remove BIO_EOPNOTSUPP
    ...

    Linus Torvalds
     

10 Jun, 2015

1 commit

  • A previous commit wanted to make CFQ default to IOPS mode on
    non-rotational storage, however it did so when the queue was
    initialized and the non-rotational flag is only set later on
    in the probe.

    Add an elevator hook that gets called off the add_disk() path,
    at that point we know that feature probing has finished, and
    we can reliably check for the various flags that drivers can
    set.

    Fixes: 41c0126b ("block: Make CFQ default to IOPS mode on SSDs")
    Tested-by: Romain Francoise
    Signed-off-by: Jens Axboe

    Jens Axboe
     

02 Jun, 2015

1 commit


24 Apr, 2015

1 commit

  • Our issue is descripted in below call path:
    ->elevator_init
    ->elevator_init_fn
    ->{cfq,deadline,noop}_init_queue
    ->elevator_alloc
    ->kzalloc_node
    fail to call kzalloc_node and then put module in elevator_alloc;
    fail to call elevator_init_fn and then put module again in elevator_init.

    Remove elevator_put invoking in error path of elevator_alloc to avoid
    double release issue.

    Signed-off-by: Chao Yu
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Chao Yu
     

04 Dec, 2014

1 commit

  • After commit b2b49ccbdd54 (PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is
    selected) PM_RUNTIME is always set if PM is set, so #ifdef blocks
    depending on CONFIG_PM_RUNTIME may now be changed to depend on
    CONFIG_PM.

    Replace CONFIG_PM_RUNTIME with CONFIG_PM in the block device core.

    Reviewed-by: Aaron Lu
    Acked-by: Jens Axboe
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     

24 Oct, 2014

1 commit

  • while compiling integer err was showing as a set but unused variable.
    elevator_init_fn can be either cfq_init_queue or deadline_init_queue
    or noop_init_queue.
    all three of these functions are returning -ENOMEM if they fail to
    allocate the queue.
    so we should actually be returning the error code rather than
    returning 0 always.

    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Jens Axboe

    Sudip Mukherjee
     

23 Jun, 2014

1 commit


12 Jun, 2014

1 commit


11 Jun, 2014

1 commit


10 Apr, 2014

1 commit

  • Martin reported that his test system would not boot with
    current git, it oopsed with this:

    BUG: unable to handle kernel paging request at ffff88046c6c9e80
    IP: [] blk_queue_start_tag+0x90/0x150
    PGD 1ddf067 PUD 1de2067 PMD 47fc7d067 PTE 800000046c6c9060
    Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
    Modules linked in: sd_mod lpfc(+) scsi_transport_fc scsi_tgt oracleasm
    rpcsec_gss_krb5 ipv6 igb dca i2c_algo_bit i2c_core hwmon
    CPU: 3 PID: 87 Comm: kworker/u17:1 Not tainted 3.14.0+ #246
    Hardware name: Supermicro X9DRX+-F/X9DRX+-F, BIOS 3.00 07/09/2013
    Workqueue: events_unbound async_run_entry_fn
    task: ffff8802743c2150 ti: ffff880273d02000 task.ti: ffff880273d02000
    RIP: 0010:[] []
    blk_queue_start_tag+0x90/0x150
    RSP: 0018:ffff880273d03a58 EFLAGS: 00010092
    RAX: ffff88046c6c9e78 RBX: ffff880077208e78 RCX: 00000000fffc8da6
    RDX: 00000000fffc186d RSI: 0000000000000009 RDI: 00000000fffc8d9d
    RBP: ffff880273d03a88 R08: 0000000000000001 R09: ffff8800021c2410
    R10: 0000000000000005 R11: 0000000000015b30 R12: ffff88046c5bb8a0
    R13: ffff88046c5c0890 R14: 000000000000001e R15: 000000000000001e
    FS: 0000000000000000(0000) GS:ffff880277b00000(0000)
    knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff88046c6c9e80 CR3: 00000000018f6000 CR4: 00000000000407e0
    Stack:
    ffff880273d03a98 ffff880474b18800 0000000000000000 ffff880474157000
    ffff88046c5c0890 ffff880077208e78 ffff880273d03ae8 ffffffff813b9e62
    ffff880200000010 ffff880474b18968 ffff880474b18848 ffff88046c5c0cd8
    Call Trace:
    [] scsi_request_fn+0xf2/0x510
    [] __blk_run_queue+0x37/0x50
    [] blk_execute_rq_nowait+0xb3/0x130
    [] blk_execute_rq+0x64/0xf0
    [] ? bit_waitqueue+0xd0/0xd0
    [] scsi_execute+0xe5/0x180
    [] scsi_execute_req_flags+0x9a/0x110
    [] sd_spinup_disk+0x94/0x460 [sd_mod]
    [] ? __unmap_hugepage_range+0x200/0x2f0
    [] sd_revalidate_disk+0xaa/0x3f0 [sd_mod]
    [] sd_probe_async+0xd8/0x200 [sd_mod]
    [] async_run_entry_fn+0x3f/0x140
    [] process_one_work+0x175/0x410
    [] worker_thread+0x123/0x400
    [] ? manage_workers+0x160/0x160
    [] kthread+0xce/0xf0
    [] ? kthread_freezable_should_stop+0x70/0x70
    [] ret_from_fork+0x7c/0xb0
    [] ? kthread_freezable_should_stop+0x70/0x70
    Code: 48 0f ab 11 72 db 48 81 4b 40 00 00 10 00 89 83 08 01 00 00 48 89
    df 49 8b 04 24 48 89 1c d0 e8 f7 a8 ff ff 49 8b 85 28 05 00 00 89
    58 08 48 89 03 49 8d 85 28 05 00 00 48 89 43 08 49 89 9d
    RIP [] blk_queue_start_tag+0x90/0x150
    RSP
    CR2: ffff88046c6c9e80

    Martin bisected and found this to be the problem patch;

    commit 6d113398dcf4dfcd9787a4ead738b186f7b7ff0f
    Author: Jan Kara
    Date: Mon Feb 24 16:39:54 2014 +0100

    block: Stop abusing rq->csd.list in blk-softirq

    and the problem was immediately apparent. The patch states that
    it is safe to reuse queuelist at completion time, since it is
    no longer used. However, that is not true if a device is using
    block enabled tagging. If that is the case, then the queuelist
    is reused to keep track of busy tags. If a device also ended
    up using softirq completions, we'd reuse ->queuelist for the
    IPI handling while block tagging was still using it. Boom.

    Fix this by adding a new ipi_list list head, and share the
    memory used with the request hash table. The hash table is
    never used after the request is moved to the dispatch list,
    which happens long before any potential completion of the
    request. Add a new request bit for this, so we don't have
    cases that check rq->hash while it could potentially have
    been reused for the IPI completion.

    Reported-by: Martin K. Petersen
    Tested-by: Benjamin Herrenschmidt
    Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Nov, 2013

1 commit

  • Immutable biovecs are going to require an explicit iterator. To
    implement immutable bvecs, a later patch is going to add a bi_bvec_done
    member to this struct; for now, this patch effectively just renames
    things.

    Signed-off-by: Kent Overstreet
    Cc: Jens Axboe
    Cc: Geert Uytterhoeven
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Ed L. Cashin"
    Cc: Nick Piggin
    Cc: Lars Ellenberg
    Cc: Jiri Kosina
    Cc: Matthew Wilcox
    Cc: Geoff Levand
    Cc: Yehuda Sadeh
    Cc: Sage Weil
    Cc: Alex Elder
    Cc: ceph-devel@vger.kernel.org
    Cc: Joshua Morris
    Cc: Philip Kelleher
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Konrad Rzeszutek Wilk
    Cc: Jeremy Fitzhardinge
    Cc: Neil Brown
    Cc: Alasdair Kergon
    Cc: Mike Snitzer
    Cc: dm-devel@redhat.com
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: linux390@de.ibm.com
    Cc: Boaz Harrosh
    Cc: Benny Halevy
    Cc: "James E.J. Bottomley"
    Cc: Greg Kroah-Hartman
    Cc: "Nicholas A. Bellinger"
    Cc: Alexander Viro
    Cc: Chris Mason
    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Cc: Jaegeuk Kim
    Cc: Steven Whitehouse
    Cc: Dave Kleikamp
    Cc: Joern Engel
    Cc: Prasad Joshi
    Cc: Trond Myklebust
    Cc: KONISHI Ryusuke
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Ben Myers
    Cc: xfs@oss.sgi.com
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Len Brown
    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Cc: Herton Ronaldo Krzesinski
    Cc: Ben Hutchings
    Cc: Andrew Morton
    Cc: Guo Chao
    Cc: Tejun Heo
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Wei Yongjun
    Cc: "Roger Pau Monné"
    Cc: Jan Beulich
    Cc: Stefano Stabellini
    Cc: Ian Campbell
    Cc: Sebastian Ott
    Cc: Christian Borntraeger
    Cc: Minchan Kim
    Cc: Jiang Liu
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Joe Perches
    Cc: Peng Tao
    Cc: Andy Adamson
    Cc: fanchaoting
    Cc: Jie Liu
    Cc: Sunil Mushran
    Cc: "Martin K. Petersen"
    Cc: Namjae Jeon
    Cc: Pankaj Kumar
    Cc: Dan Magenheimer
    Cc: Mel Gorman 6

    Kent Overstreet
     

09 Nov, 2013

2 commits

  • Add locking of q->sysfs_lock into elevator_change() (an exported function)
    to ensure it is held to protect q->elevator from elevator_init(), even if
    elevator_change() is called from non-sysfs paths.
    sysfs path (elv_iosched_store) uses __elevator_change(), non-locking
    version, as the lock is already taken by elv_iosched_store().

    Signed-off-by: Tomoki Sekiyama
    Signed-off-by: Jens Axboe

    Tomoki Sekiyama
     
  • The soft lockup below happens at the boot time of the system using dm
    multipath and the udev rules to switch scheduler.

    [ 356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
    [ 356.127001] RIP: 0010:[] [] lock_timer_base.isra.35+0x1d/0x50
    ...
    [ 356.127001] Call Trace:
    [ 356.127001] [] try_to_del_timer_sync+0x20/0x70
    [ 356.127001] [] ? kmem_cache_alloc_node_trace+0x20a/0x230
    [ 356.127001] [] del_timer_sync+0x52/0x60
    [ 356.127001] [] cfq_exit_queue+0x32/0xf0
    [ 356.127001] [] elevator_exit+0x2f/0x50
    [ 356.127001] [] elevator_change+0xf1/0x1c0
    [ 356.127001] [] elv_iosched_store+0x20/0x50
    [ 356.127001] [] queue_attr_store+0x59/0xb0
    [ 356.127001] [] sysfs_write_file+0xc6/0x140
    [ 356.127001] [] vfs_write+0xbd/0x1e0
    [ 356.127001] [] SyS_write+0x49/0xa0
    [ 356.127001] [] system_call_fastpath+0x16/0x1b

    This is caused by a race between md device initialization by multipathd and
    shell script to switch the scheduler using sysfs.

    - multipathd:
    SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load
    -> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init
    q->elevator = elevator_alloc(q, e); // not yet initialized

    - sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler':
    elevator_switch (in the call trace above)
    struct elevator_queue *old = q->elevator;
    q->elevator = elevator_alloc(q, new_e);
    elevator_exit(old); // lockup! (*)

    - multipathd: (cont.)
    err = e->ops.elevator_init_fn(q); // init fails; q->elevator is modified

    (*) When del_timer_sync() is called, lock_timer_base() will loop infinitely
    while timer->base == NULL. In this case, as timer will never initialized,
    it results in lockup.

    This patch introduces acquisition of q->sysfs_lock around elevator_init()
    into blk_init_allocated_queue(), to provide mutual exclusion between
    initialization of the q->scheduler and switching of the scheduler.

    This should fix this bugzilla:
    https://bugzilla.redhat.com/show_bug.cgi?id=902012

    Signed-off-by: Tomoki Sekiyama
    Signed-off-by: Jens Axboe

    Tomoki Sekiyama
     

12 Sep, 2013

1 commit


03 Jul, 2013

1 commit

  • There's a race between elevator switching and normal io operation.
    Because the allocation of struct elevator_queue and struct elevator_data
    don't in a atomic operation.So there are have chance to use NULL
    ->elevator_data.
    For example:
    Thread A: Thread B
    blk_queu_bio elevator_switch
    spin_lock_irq(q->queue_block) elevator_alloc
    elv_merge elevator_init_fn

    Because call elevator_alloc, it can't hold queue_lock and the
    ->elevator_data is NULL.So at the same time, threadA call elv_merge and
    nedd some info of elevator_data.So the crash happened.

    Move the elevator_alloc into func elevator_init_fn, it make the
    operations in a atomic operation.

    Using the follow method can easy reproduce this bug
    1:dd if=/dev/sdb of=/dev/null
    2:while true;do echo noop > scheduler;echo deadline > scheduler;done

    The test method also use this method.

    Signed-off-by: Jianpeng Ma
    Signed-off-by: Jens Axboe

    Jianpeng Ma
     

23 Mar, 2013

1 commit

  • When a request is added:
    If device is suspended or is suspending and the request is not a
    PM request, resume the device.

    When the last request finishes:
    Call pm_runtime_mark_last_busy().

    When pick a request:
    If device is resuming/suspending, then only PM request is allowed
    to go.

    The idea and API is designed by Alan Stern and described here:
    http://marc.info/?l=linux-scsi&m=133727953625963&w=2

    Signed-off-by: Lin Ming
    Signed-off-by: Aaron Lu
    Acked-by: Alan Stern
    Signed-off-by: Jens Axboe

    Lin Ming
     

01 Mar, 2013

1 commit

  • Pull block IO core bits from Jens Axboe:
    "Below are the core block IO bits for 3.9. It was delayed a few days
    since my workstation kept crashing every 2-8h after pulling it into
    current -git, but turns out it is a bug in the new pstate code (divide
    by zero, will report separately). In any case, it contains:

    - The big cfq/blkcg update from Tejun and and Vivek.

    - Additional block and writeback tracepoints from Tejun.

    - Improvement of the should sort (based on queues) logic in the plug
    flushing.

    - _io() variants of the wait_for_completion() interface, using
    io_schedule() instead of schedule() to contribute to io wait
    properly.

    - Various little fixes.

    You'll get two trivial merge conflicts, which should be easy enough to
    fix up"

    Fix up the trivial conflicts due to hlist traversal cleanups (commit
    b67bfe0d42ca: "hlist: drop the node parameter from iterators").

    * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
    block: remove redundant check to bd_openers()
    block: use i_size_write() in bd_set_size()
    cfq: fix lock imbalance with failed allocations
    drivers/block/swim3.c: fix null pointer dereference
    block: don't select PERCPU_RWSEM
    block: account iowait time when waiting for completion of IO request
    sched: add wait_for_completion_io[_timeout]
    writeback: add more tracepoints
    block: add block_{touch|dirty}_buffer tracepoint
    buffer: make touch_buffer() an exported function
    block: add @req to bio_{front|back}_merge tracepoints
    block: add missing block_bio_complete() tracepoint
    block: Remove should_sort judgement when flush blk_plug
    block,elevator: use new hashtable implementation
    cfq-iosched: add hierarchical cfq_group statistics
    cfq-iosched: collect stats from dead cfqgs
    cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
    blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
    block: RCU free request_queue
    blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
    ...

    Linus Torvalds
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

23 Jan, 2013

1 commit

  • Block layer allows selecting an elevator which is built as a module to
    be selected as system default via kernel param "elevator=". This is
    achieved by automatically invoking request_module() whenever a new
    block device is initialized and the elevator is not available.

    This led to an interesting deadlock problem involving async and module
    init. Block device probing running off an async job invokes
    request_module(). While the module is being loaded, it performs
    async_synchronize_full() which ends up waiting for the async job which
    is already waiting for request_module() to finish, leading to
    deadlock.

    Invoking request_module() from deep in block device init path is
    already nasty in itself. It seems best to avoid these situations from
    the beginning by moving on-demand module loading out of block init
    path.

    The previous patch made sure that the default elevator module is
    loaded early during boot if available. This patch removes on-demand
    loading of the default elevator from elevator init path. As the
    module would have been loaded during boot, userland-visible behavior
    difference should be minimal.

    For more details, please refer to the following thread.

    http://thread.gmane.org/gmane.linux.kernel/1420814

    v2: The bool parameter was named @request_module which conflicted with
    request_module(). This built okay w/ CONFIG_MODULES because
    request_module() was defined as a macro. W/o CONFIG_MODULES, it
    causes build breakage. Rename the parameter to @try_loading.
    Reported by Fengguang.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Arjan van de Ven
    Cc: Linus Torvalds
    Cc: Alex Riesen
    Cc: Fengguang Wu

    Tejun Heo
     

19 Jan, 2013

1 commit

  • This patch adds default module loading and uses it to load the default
    block elevator. During boot, it's called right after initramfs or
    initrd is made available and right before control is passed to
    userland. This ensures that as long as the modules are available in
    the usual places in initramfs, initrd or the root filesystem, the
    default modules are loaded as soon as possible.

    This will replace the on-demand elevator module loading from elevator
    init path.

    v2: Fixed build breakage when !CONFIG_BLOCK. Reported by kbuild test
    robot.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Arjan van de Ven
    Cc: Linus Torvalds
    Cc: Alex Riesen
    Cc: Fengguang We

    Tejun Heo
     

11 Jan, 2013

1 commit

  • Switch elevator to use the new hashtable implementation. This reduces the
    amount of generic unrelated code in the elevator.

    This also removes the dymanic allocation of the hash table. The size of the table is
    constant so there's no point in paying the price of an extra dereference when accessing
    it.

    This patch depends on d9b482c ("hashtable: introduce a small and naive
    hashtable") which was merged in v3.6.

    Signed-off-by: Sasha Levin
    Signed-off-by: Jens Axboe

    Sasha Levin
     

09 Nov, 2012

1 commit

  • In a workload, thread 1 accesses a, a+2, ..., thread 2 accesses a+1, a+3,....
    When the requests are flushed to queue, a and a+1 are merged to (a, a+1), a+2
    and a+3 too to (a+2, a+3), but (a, a+1) and (a+2, a+3) aren't merged.

    If we do recursive merge for such interleave access, some workloads throughput
    get improvement. A recent worload I'm checking on is swap, below change
    boostes the throughput around 5% ~ 10%.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

20 Sep, 2012

1 commit

  • Remove special-casing of non-rw fs style requests (discard). The nomerge
    flags are consolidated in blk_types.h, and rq_mergeable() and
    bio_mergeable() have been modified to use them.

    bio_is_rw() is used in place of bio_has_data() a few places. This is
    done to to distinguish true reads and writes from other fs type requests
    that carry a payload (e.g. write same).

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

20 Apr, 2012

1 commit

  • All blkcg policies were assumed to be enabled on all request_queues.
    Due to various implementation obstacles, during the recent blkcg core
    updates, this was temporarily implemented as shooting down all !root
    blkgs on elevator switch and policy [de]registration combined with
    half-broken in-place root blkg updates. In addition to being buggy
    and racy, this meant losing all blkcg configurations across those
    events.

    Now that blkcg is cleaned up enough, this patch replaces the temporary
    implementation with proper per-queue policy activation. Each blkcg
    policy should call the new blkcg_[de]activate_policy() to enable and
    disable the policy on a specific queue. blkcg_activate_policy()
    allocates and installs policy data for the policy for all existing
    blkgs. blkcg_deactivate_policy() does the reverse. If a policy is
    not enabled for a given queue, blkg printing / config functions skip
    the respective blkg for the queue.

    blkcg_activate_policy() also takes care of root blkg creation, and
    cfq_init_queue() and blk_throtl_init() are updated accordingly.

    This replaces blkcg_bypass_{start|end}() and update_root_blkg_pd()
    unnecessary. Dropped.

    v2: cfq_init_queue() was returning uninitialized @ret on root_group
    alloc failure if !CONFIG_CFQ_GROUP_IOSCHED. Fixed.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     

07 Mar, 2012

7 commits

  • IO scheduling and cgroup are tied to the issuing task via io_context
    and cgroup of %current. Unfortunately, there are cases where IOs need
    to be routed via a different task which makes scheduling and cgroup
    limit enforcement applied completely incorrectly.

    For example, all bios delayed by blk-throttle end up being issued by a
    delayed work item and get assigned the io_context of the worker task
    which happens to serve the work item and dumped to the default block
    cgroup. This is double confusing as bios which aren't delayed end up
    in the correct cgroup and makes using blk-throttle and cfq propio
    together impossible.

    Any code which punts IO issuing to another task is affected which is
    getting more and more common (e.g. btrfs). As both io_context and
    cgroup are firmly tied to task including userland visible APIs to
    manipulate them, it makes a lot of sense to match up tasks to bios.

    This patch implements bio_associate_current() which associates the
    specified bio with %current. The bio will record the associated ioc
    and blkcg at that point and block layer will use the recorded ones
    regardless of which task actually ends up issuing the bio. bio
    release puts the associated ioc and blkcg.

    It grabs and remembers ioc and blkcg instead of the task itself
    because task may already be dead by the time the bio is issued making
    ioc and blkcg inaccessible and those are all block layer cares about.

    elevator_set_req_fn() is updated such that the bio elvdata is being
    allocated for is available to the elevator.

    This doesn't update block cgroup policies yet. Further patches will
    implement the support.

    -v2: #ifdef CONFIG_BLK_CGROUP added around bio->bi_ioc dereference in
    rq_ioc() to fix build breakage.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Kent Overstreet
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, blkg is per cgroup-queue-policy combination. This is
    unnatural and leads to various convolutions in partially used
    duplicate fields in blkg, config / stat access, and general management
    of blkgs.

    This patch make blkg's per cgroup-queue and let them serve all
    policies. blkgs are now created and destroyed by blkcg core proper.
    This will allow further consolidation of common management logic into
    blkcg core and API with better defined semantics and layering.

    As a transitional step to untangle blkg management, elvswitch and
    policy [de]registration, all blkgs except the root blkg are being shot
    down during elvswitch and bypass. This patch adds blkg_root_update()
    to update root blkg in place on policy change. This is hacky and racy
    but should be good enough as interim step until we get locking
    simplified and switch over to proper in-place update for all blkgs.

    -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc()
    comment wasn't updated according to the function change. Fixed.
    Both pointed out by Vivek.

    -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for
    all policies. This freed root pd during elvswitch before the
    last queue finished exiting and led to oops. Directly invoke
    update_root_blkg_pd() only on BLKIO_POLICY_PROP from
    cfq_exit_queue(). This also is closer to what will be done with
    proper in-place blkg update. Reported by Vivek.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • With the previous patch to move blkg list heads and counters to
    request_queue and blkg, logic to manage them in both policies are
    almost identical and can be moved to blkcg core.

    This patch moves blkg link logic into blkg_lookup_create(), implements
    common blkg unlink code in blkg_destroy(), and updates
    blkg_destory_all() so that it's policy specific and can skip root
    group. The updated blkg_destroy_all() is now used to both clear queue
    for bypassing and elv switching, and release all blkgs on q exit.

    This patch introduces a race window where policy [de]registration may
    race against queue blkg clearing. This can only be a problem on cfq
    unload and shouldn't be a real problem in practice (and we have many
    other places where this race already exists). Future patches will
    remove these unlikely races.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Elevator switch may involve changes to blkcg policies. Implement
    shoot down of blkio_groups.

    Combined with the previous bypass updates, the end goal is updating
    blkcg core such that it can ensure that blkcg's being affected become
    quiescent and don't have any per-blkg data hanging around before
    commencing any policy updates. Until queues are made aware of the
    policies that applies to them, as an interim step, all per-policy blkg
    data will be shot down.

    * blk-throtl doesn't need this change as it can't be disabled for a
    live queue; however, update it anyway as the scheduled blkg
    unification requires this behavior change. This means that
    blk-throtl configuration will be unnecessarily lost over elevator
    switch. This oddity will be removed after blkcg learns to associate
    individual policies with request_queues.

    * blk-throtl dosen't shoot down root_tg. This is to ease transition.
    Unified blkg will always have persistent root group and not shooting
    down root_tg for now eases transition to that point by avoiding
    having to update td->root_tg and is safe as blk-throtl can never be
    disabled

    -v2: Vivek pointed out that group list is not guaranteed to be empty
    on return from clear function if it raced cgroup removal and
    lost. Fix it by waiting a bit and retrying. This kludge will
    soon be removed once locking is updated such that blkg is never
    in limbo state between blkcg and request_queue locks.

    blk-throtl no longer shoots down root_tg to avoid breaking
    td->root_tg.

    Also, Nest queue_lock inside blkio_list_lock not the other way
    around to avoid introduce possible deadlock via blkcg lock.

    -v3: blkcg_clear_queue() repositioned and renamed to
    blkg_destroy_all() to increase consistency with later changes.
    cfq_clear_queue() updated to check q->elevator before
    dereferencing it to avoid NULL dereference on not fully
    initialized queues (used by later change).

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Rename and extend elv_queisce_start/end() to
    blk_queue_bypass_start/end() which are exported and supports nesting
    via @q->bypass_depth. Also add blk_queue_bypass() to test bypass
    state.

    This will be further extended and used for blkio_group management.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • elevator_ops->elevator_init_fn() has a weird return value. It returns
    a void * which the caller should assign to q->elevator->elevator_data
    and %NULL return denotes init failure.

    Update such that it returns integer 0/-errno and sets elevator_data
    directly as necessary.

    This makes the interface more conventional and eases further cleanup.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Elevator switch tries hard to keep as much as context until new
    elevator is ready so that it can revert to the original state if
    initializing the new elevator fails for some reason. Unfortunately,
    with more auxiliary contexts to manage, this makes elevator init and
    exit paths too complex and fragile.

    This patch makes elevator_switch() unregister the current elevator and
    flush icq's before start initializing the new one. As we still keep
    the old elevator itself, the only difference is that we lose icq's on
    rare occassions of switching failure, which isn't critical at all.

    Note that this makes explicit elevator parameter to
    elevator_init_queue() and __elv_register_queue() unnecessary as they
    always can use the current elevator.

    This patch enables block cgroup cleanups.

    -v2: blk_add_trace_msg() prints elevator name from @new_e instead of
    @e->type as the local variable no longer exists. This caused
    build failure on CONFIG_BLK_DEV_IO_TRACE.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     

08 Feb, 2012

1 commit

  • blk_rq_merge_ok() is the elevator-neutral part of merge eligibility
    test. blk_try_merge() determines merge direction and expects the
    caller to have tested elv_rq_merge_ok() previously.

    elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls
    elv_iosched_allow_merge(). elv_try_merge() is removed and the two
    callers are updated to call elv_rq_merge_ok() explicitly followed by
    blk_try_merge(). While at it, make rq_merge_ok() functions return
    bool.

    This is to prepare for plug merge update and doesn't introduce any
    behavior change.

    This is based on Jens' patch to skip elevator_allow_merge_fn() from
    plug merge.

    Signed-off-by: Tejun Heo
    LKML-Reference:
    Original-patch-by: Jens Axboe
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 Jan, 2012

1 commit

  • This reverts commit 274193224cdabd687d804a26e0150bb20f2dd52c.

    We have some problems related to selection of empty queues
    that need to be resolved, evidence so far points to the
    recursive merge logic making either being the cause or at
    least the accelerator for this. So revert it for now, until
    we figure this out.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

16 Dec, 2011

1 commit

  • In my workload, thread 1 accesses a, a+2, ..., thread 2 accesses a+1,
    a+3,.... When the requests are flushed to queue, a and a+1 are merged
    to (a, a+1), a+2 and a+3 too to (a+2, a+3), but (a, a+1) and (a+2, a+3)
    aren't merged.
    With recursive merge below, the workload throughput gets improved 20%
    and context switch drops 60%.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

14 Dec, 2011

1 commit

  • With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to
    blk-ioc too. The odd ->io_cq->exit/release() callbacks are replaced
    with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc
    and q, and freeing automatically handled by blk-ioc. The elevator
    operation only need to perform exit operation specific to the elevator
    - in cfq's case, exiting the cfqq's.

    Also, clearing of io_cq's on q detach is moved to block core and
    automatically performed on elevator switch and q release.

    Because the q io_cq points to might be freed before RCU callback for
    the io_cq runs, blk-ioc code should remember to which cache the io_cq
    needs to be freed when the io_cq is released. New field
    io_cq->__rcu_icq_cache is added for this purpose. As both the new
    field and rcu_head are used only after io_cq is released and the
    q/ioc_node fields aren't, they are put into unions.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo