05 Aug, 2014

1 commit

  • Pull cgroup changes from Tejun Heo:
    "Mostly changes to get the v2 interface ready. The core features are
    mostly ready now and I think it's reasonable to expect to drop the
    devel mask in one or two devel cycles at least for a subset of
    controllers.

    - cgroup added a controller dependency mechanism so that block cgroup
    can depend on memory cgroup. This will be used to finally support
    IO provisioning on the writeback traffic, which is currently being
    implemented.

    - The v2 interface now uses a separate table so that the interface
    files for the new interface are explicitly declared in one place.
    Each controller will explicitly review and add the files for the
    new interface.

    - cpuset is getting ready for the hierarchical behavior which is in
    the similar style with other controllers so that an ancestor's
    configuration change doesn't change the descendants' configurations
    irreversibly and processes aren't silently migrated when a CPU or
    node goes down.

    All the changes are to the new interface and no behavior changed for
    the multiple hierarchies"

    * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits)
    cpuset: fix the WARN_ON() in update_nodemasks_hier()
    cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test
    cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core
    cgroup: distinguish the default and legacy hierarchies when handling cftypes
    cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()
    cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes
    cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[]
    cpuset: export effective masks to userspace
    cpuset: allow writing offlined masks to cpuset.cpus/mems
    cpuset: enable onlined cpu/node in effective masks
    cpuset: refactor cpuset_hotplug_update_tasks()
    cpuset: make cs->{cpus, mems}_allowed as user-configured masks
    cpuset: apply cs->effective_{cpus,mems}
    cpuset: initialize top_cpuset's configured masks at mount
    cpuset: use effective cpumask to build sched domains
    cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty
    cpuset: update cs->effective_{cpus, mems} when config changes
    cpuset: update cpuset->effective_{cpus,mems} at hotplug
    cpuset: add cs->effective_cpus and cs->effective_mems
    cgroup: clean up sane_behavior handling
    ...

    Linus Torvalds
     

15 Jul, 2014

2 commits

  • Currently, cftypes added by cgroup_add_cftypes() are used for both the
    unified default hierarchy and legacy ones and subsystems can mark each
    file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to
    appear only on one of them. This is quite hairy and error-prone.
    Also, we may end up exposing interface files to the default hierarchy
    without thinking it through.

    cgroup_subsys will grow two separate cftype addition functions and
    apply each only on the hierarchies of the matching type. This will
    allow organizing cftypes in a lot clearer way and encourage subsystems
    to scrutinize the interface which is being exposed in the new default
    hierarchy.

    In preparation, this patch adds cgroup_add_legacy_cftypes() which
    currently is a simple wrapper around cgroup_add_cftypes() and replaces
    all cgroup_add_cftypes() usages with it.

    While at it, this patch drops a completely spurious return from
    __hugetlb_cgroup_file_init().

    This patch doesn't introduce any functional differences.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Aneesh Kumar K.V

    Tejun Heo
     
  • Currently, cgroup_subsys->base_cftypes is used for both the unified
    default hierarchy and legacy ones and subsystems can mark each file
    with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
    only on one of them. This is quite hairy and error-prone. Also, we
    may end up exposing interface files to the default hierarchy without
    thinking it through.

    cgroup_subsys will grow two separate cftype arrays and apply each only
    on the hierarchies of the matching type. This will allow organizing
    cftypes in a lot clearer way and encourage subsystems to scrutinize
    the interface which is being exposed in the new default hierarchy.

    In preparation, this patch renames cgroup_subsys->base_cftypes to
    cgroup_subsys->legacy_cftypes. This patch is pure rename.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vivek Goyal
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Aristeu Rozanski
    Cc: Aneesh Kumar K.V

    Tejun Heo
     

14 Jul, 2014

1 commit


12 Jul, 2014

1 commit

  • While a queue is being destroyed, all the blkgs are destroyed and its
    ->root_blkg pointer is set to NULL. If someone else starts to drain
    while the queue is in this state, the following oops happens.

    NULL pointer dereference at 0000000000000028
    IP: [] blk_throtl_drain+0x84/0x230
    PGD e4a1067 PUD b773067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
    CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
    RIP: 0010:[] [] blk_throtl_drain+0x84/0x230
    RSP: 0018:ffff88000efd7bf0 EFLAGS: 00010046
    RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
    R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
    R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
    FS: 00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
    Stack:
    ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
    ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
    ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
    Call Trace:
    [] blkcg_drain_queue+0x1f/0x60
    [] __blk_drain_queue+0x71/0x180
    [] blk_queue_bypass_start+0x6e/0xb0
    [] blkcg_deactivate_policy+0x38/0x120
    [] blk_throtl_exit+0x34/0x50
    [] blkcg_exit_queue+0x35/0x40
    [] blk_release_queue+0x26/0xd0
    [] kobject_cleanup+0x38/0x70
    [] kobject_put+0x28/0x60
    [] blk_put_queue+0x15/0x20
    [] scsi_device_dev_release_usercontext+0x16b/0x1c0
    [] execute_in_process_context+0x89/0xa0
    [] scsi_device_dev_release+0x1c/0x20
    [] device_release+0x32/0xa0
    [] kobject_cleanup+0x38/0x70
    [] kobject_put+0x28/0x60
    [] put_device+0x17/0x20
    [] __scsi_remove_device+0xa9/0xe0
    [] scsi_remove_device+0x2b/0x40
    [] sdev_store_delete+0x27/0x30
    [] dev_attr_store+0x18/0x30
    [] sysfs_kf_write+0x3e/0x50
    [] kernfs_fop_write+0xe7/0x170
    [] vfs_write+0xaf/0x1d0
    [] SyS_write+0x4d/0xc0
    [] system_call_fastpath+0x16/0x1b

    776687bce42b ("block, blk-mq: draining can't be skipped even if
    bypass_depth was non-zero") made it easier to trigger this bug by
    making blk_queue_bypass_start() drain even when it loses the first
    bypass test to blk_cleanup_queue(); however, the bug has always been
    there even before the commit as blk_queue_bypass_start() could race
    against queue destruction, win the initial bypass test but perform the
    actual draining after blk_cleanup_queue() already destroyed all blkgs.

    Fix it by skippping calling into policy draining if all the blkgs are
    already gone.

    Signed-off-by: Tejun Heo
    Reported-by: Shirish Pargaonkar
    Reported-by: Sasha Levin
    Reported-by: Jet Chen
    Cc: stable@vger.kernel.org
    Tested-by: Shirish Pargaonkar
    Signed-off-by: Jens Axboe

    Tejun Heo
     

09 Jul, 2014

2 commits

  • sane_behavior has been used as a development vehicle for the default
    unified hierarchy. Now that the default hierarchy is in place, the
    flag became redundant and confusing as its usage is allowed on all
    hierarchies. There are gonna be either the default hierarchy or
    legacy ones. Let's make that clear by removing sane_behavior support
    on non-default hierarchies.

    This patch replaces cgroup_sane_behavior() with cgroup_on_dfl(). The
    comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of
    cgroup_on_dfl() with sane_behavior specific part dropped.

    On the default and legacy hierarchies w/o sane_behavior, this
    shouldn't cause any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko

    Tejun Heo
     
  • Currently, the blkio subsystem attributes all of writeback IOs to the
    root. One of the issues is that there's no way to tell who originated
    a writeback IO from block layer. Those IOs are usually issued
    asynchronously from a task which didn't have anything to do with
    actually generating the dirty pages. The memory subsystem, when
    enabled, already keeps track of the ownership of each dirty page and
    it's desirable for blkio to piggyback instead of adding its own
    per-page tag.

    cgroup now has a mechanism to express such dependency -
    cgroup_subsys->depends_on. This patch declares that blkcg depends on
    memcg so that memcg is enabled automatically on the default hierarchy
    when available. Future changes will make blkcg map the memcg tag to
    find out the cgroup to blame for writeback IOs.

    As this means that a memcg may be made invisible, this patch also
    implements css_reset() for memcg which resets its basic
    configurations. This implementation will probably need to be expanded
    to cover other states which are used in the default hierarchy.

    v2: blkcg's dependency on memcg is wrapped with CONFIG_MEMCG to avoid
    build failure. Reported by kbuild test robot.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vivek Goyal
    Cc: Jens Axboe

    Tejun Heo
     

08 Jul, 2014

1 commit

  • There is no inherent reason why the last put of a tag structure must be
    the one for the Scsi_Host, as device model objects can be held for
    arbitrary periods. Merge blk_free_tags and __blk_free_tags into a single
    funtion that just release a references and get rid of the BUG() when the
    host reference wasn't the last.

    Signed-off-by: Christoph Hellwig
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

27 Jun, 2014

1 commit

  • Pull block fixes from Jens Axboe:
    "A small collection of fixes/changes for the current series. This
    contains:

    - Removal of dead code from Gu Zheng.

    - Revert of two bad fixes that went in earlier in this round, marking
    things as __init that were not purely used from init.

    - A fix for blk_mq_start_hw_queue() using the __blk_mq_run_hw_queue(),
    which could place us wrongly. Make it use the non __ variant,
    which handles cases where we are called from the wrong CPU set.
    From me.

    - A fix for drbd, which allocates discard requests without room for
    the SCSI payload. From Lars Ellenberg.

    - A fix for user-after-free in the blkcg code from Tejun.

    - Addition of limiting gaps in SG lists, if the hardware needs it.
    This is the last pre-req patch for blk-mq to enable the full NVMe
    conversion. Could wait until 3.17, but it's simple enough so would
    be nice to have everything we need for the NVMe port in the 3.17
    release. From me"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    drbd: fix NULL pointer deref in blk_add_request_payload
    blk-mq: blk_mq_start_hw_queue() should use blk_mq_run_hw_queue()
    block: add support for limiting gaps in SG lists
    bio: remove unused macro bip_vec_idx()
    Revert "block: add __init to elv_register"
    Revert "block: add __init to blkcg_policy_register"
    blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t
    floppy: format block0 read error message properly

    Linus Torvalds
     

25 Jun, 2014

2 commits

  • Currently it calls __blk_mq_run_hw_queue(), which depends on the
    CPU placement being correct. This means it's not possible to call
    blk_mq_start_hw_queues(q) from a context that is correct for all
    queues, leading to triggering the

    WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));

    in __blk_mq_run_hw_queue().

    Reported-by: Ming Lei
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Another restriction inherited for NVMe - those devices don't support
    SG lists that have "gaps" in them. Gaps refers to cases where the
    previous SG entry doesn't end on a page boundary. For NVMe, all SG
    entries must start at offset 0 (except the first) and end on a page
    boundary (except the last).

    Signed-off-by: Jens Axboe

    Jens Axboe
     

23 Jun, 2014

3 commits

  • This reverts commit b5097e956a4d2919ee248d6481e4204c5568ed5c.

    The original commit is buggy, we do use the registration functions
    at runtime, for instance when loading IO schedulers through sysfs.

    Reported-by: Damien Wyart

    Jens Axboe
     
  • This reverts commit a2d445d440003f2d70ee4cd4970ea82ace616fee.

    The original commit is buggy, we do use the registration functions
    at runtime for modular builds.

    Jens Axboe
     
  • Hello,

    So, this patch should do. Joe, Vivek, can one of you guys please
    verify that the oops goes away with this patch?

    Jens, the original thread can be read at

    http://thread.gmane.org/gmane.linux.kernel/1720729

    The fix converts blkg->refcnt from int to atomic_t. It does some
    overhead but it should be minute compared to everything else which is
    going on and the involved cacheline bouncing, so I think it's highly
    unlikely to cause any noticeable difference. Also, the refcnt in
    question should be converted to a perpcu_ref for blk-mq anyway, so the
    atomic_t is likely to go away pretty soon anyway.

    Thanks.

    ------- 8< -------
    __blkg_release_rcu() may be invoked after the associated request_queue
    is released with a RCU grace period inbetween. As such, the function
    and callbacks invoked from it must not dereference the associated
    request_queue. This is clearly indicated in the comment above the
    function.

    Unfortunately, while trying to fix a different issue, 2a4fd070ee85
    ("blkcg: move bulk of blkcg_gq release operations to the RCU
    callback") ignored this and added [un]locking of @blkg->q->queue_lock
    to __blkg_release_rcu(). This of course can cause oops as the
    request_queue may be long gone by the time this code gets executed.

    general protection fault: 0000 [#1] SMP
    CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1
    Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013
    task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000
    RIP: 0010:[] [] _raw_spin_lock_irq+0x15/0x60
    RSP: 0018:ffff88085403fdf0 EFLAGS: 00010086
    RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000
    RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b
    RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39
    R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130
    R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0
    Stack:
    ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000
    ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0
    ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30
    Call Trace:
    [] __blkg_release_rcu+0x72/0x150
    [] rcu_nocb_kthread+0x1e8/0x300
    [] kthread+0xe1/0x100
    [] ret_from_fork+0x7c/0xb0
    Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5
    +fa 66 66 90 66 66 90 b8 00 00 02 00 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f
    +b7
    RIP [] _raw_spin_lock_irq+0x15/0x60
    RSP

    The request_queue locking was added because blkcg_gq->refcnt is an int
    protected with the queue lock and __blkg_release_rcu() needs to put
    the parent. Let's fix it by making blkcg_gq->refcnt an atomic_t and
    dropping queue locking in the function.

    Given the general heavy weight of the current request_queue and blkcg
    operations, this is unlikely to cause any noticeable overhead.
    Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in
    the near future, so whatever (most likely negligible) overhead it may
    add is temporary.

    Signed-off-by: Tejun Heo
    Reported-by: Joe Lawrence
    Acked-by: Vivek Goyal
    Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     

20 Jun, 2014

1 commit

  • Pull block fixes from Jens Axboe:
    "A smaller collection of fixes for the block core that would be nice to
    have in -rc2. This pull request contains:

    - Fixes for races in the wait/wakeup logic used in blk-mq from
    Alexander. No issues have been observed, but it is definitely a
    bit flakey currently. Alternatively, we may drop the cyclic
    wakeups going forward, but that needs more testing.

    - Some cleanups from Christoph.

    - Fix for an oops in null_blk if queue_mode=1 and softirq completions
    are used. From me.

    - A fix for a regression caused by the chunk size setting. It
    inadvertently used max_hw_sectors instead of max_sectors, which is
    incorrect, and causes hangs on btrfs multi-disk setups (where hw
    sectors apparently isn't set). From me.

    - Removal of WQ_POWER_EFFICIENT in the kblockd creation. This was a
    recent addition as well, but it actually breaks blk-mq which relies
    on strict scheduling. If the workqueue power_efficient mode is
    turned on, this breaks blk-mq. From Matias.

    - null_blk module parameter description fix from Mike"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    blk-mq: bitmap tag: fix races in bt_get() function
    blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt
    blk-mq: bitmap tag: fix races on shared ::wake_index fields
    block: blk_max_size_offset() should check ->max_sectors
    null_blk: fix softirq completions for queue_mode == 1
    blk-mq: merge blk_mq_drain_queue and __blk_mq_drain_queue
    blk-mq: properly drain stopped queues
    block: remove WQ_POWER_EFFICIENT from kblockd
    null_blk: fix name and description of 'queue_mode' module parameter
    block: remove elv_abort_queue and blk_abort_flushes

    Linus Torvalds
     

18 Jun, 2014

3 commits

  • This update fixes few issues in bt_get() function:

    - list_empty(&wait.task_list) check is not protected;

    - was_empty check is always true which results in *every* thread
    entering the loop resets bt_wait_state::wait_cnt counter rather
    than every bt->wake_cnt'th thread;

    - 'bt_wait_state::wait_cnt' counter update is redundant, since
    it also gets reset in bt_clear_tag() function;

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Jens Axboe
    Signed-off-by: Alexander Gordeev
    Signed-off-by: Jens Axboe

    Alexander Gordeev
     
  • This piece of code in bt_clear_tag() function is racy:

    bs = bt_wake_ptr(bt);
    if (bs && atomic_dec_and_test(&bs->wait_cnt)) {
    atomic_set(&bs->wait_cnt, bt->wake_cnt);
    wake_up(&bs->wait);
    }

    Since nothing prevents bt_wake_ptr() from returning the very
    same 'bs' address on multiple CPUs, the following scenario is
    possible:

    CPU1 CPU2
    ---- ----

    0. bs = bt_wake_ptr(bt); bs = bt_wake_ptr(bt);
    1. atomic_dec_and_test(&bs->wait_cnt)
    2. atomic_dec_and_test(&bs->wait_cnt)
    3. atomic_set(&bs->wait_cnt, bt->wake_cnt);

    If the decrement in [1] yields zero then for some amount of time
    the decrement in [2] results in a negative/overflow value, which
    is not expected. The follow-up assignment in [3] overwrites the
    invalid value with the batch value (and likely prevents the issue
    from being severe) which is still incorrect and should be a lesser.

    Cc: Ming Lei
    Cc: Jens Axboe
    Signed-off-by: Alexander Gordeev
    Signed-off-by: Jens Axboe

    Alexander Gordeev
     
  • Fix racy updates of shared blk_mq_bitmap_tags::wake_index
    and blk_mq_hw_ctx::wake_index fields.

    Cc: Ming Lei
    Signed-off-by: Alexander Gordeev
    Signed-off-by: Jens Axboe

    Alexander Gordeev
     

16 Jun, 2014

1 commit

  • Pull NVMe update from Matthew Wilcox:
    "Mostly bugfixes again for the NVMe driver. I'd like to call out the
    exported tracepoint in the block layer; I believe Keith has cleared
    this with Jens.

    We've had a few reports from people who're really pounding on NVMe
    devices at scale, hence the timeout changes (and new module
    parameters), hotplug cpu deadlock, tracepoints, and minor performance
    tweaks"

    [ Jens hadn't seen that tracepoint thing, but is ok with it - it will
    end up going away when mq conversion happens ]

    * git://git.infradead.org/users/willy/linux-nvme: (22 commits)
    NVMe: Fix START_STOP_UNIT Scsi->NVMe translation.
    NVMe: Use Log Page constants in SCSI emulation
    NVMe: Define Log Page constants
    NVMe: Fix hot cpu notification dead lock
    NVMe: Rename io_timeout to nvme_io_timeout
    NVMe: Use last bytes of f/w rev SCSI Inquiry
    NVMe: Adhere to request queue block accounting enable/disable
    NVMe: Fix nvme get/put queue semantics
    NVMe: Delete NVME_GET_FEAT_TEMP_THRESH
    NVMe: Make admin timeout a module parameter
    NVMe: Make iod bio timeout a parameter
    NVMe: Prevent possible NULL pointer dereference
    NVMe: Fix the buffer size passed in GetLogPage(CDW10.NUMD)
    NVMe: Update data structures for NVMe 1.2
    NVMe: Enable BUILD_BUG_ON checks
    NVMe: Update namespace and controller identify structures to the 1.1a spec
    NVMe: Flush with data support
    NVMe: Configure support for block flush
    NVMe: Add tracepoints
    NVMe: Protect against badly formatted CQEs
    ...

    Linus Torvalds
     

14 Jun, 2014

2 commits


12 Jun, 2014

2 commits

  • blk-mq issues async requests through kblockd. To issue a work request on
    a specific CPU, kblockd_schedule_delayed_work_on is used. However, the
    specific CPU choice may not be honored, if the power_efficient option
    for workqueues is set. blk-mq requires that we have strict per-cpu
    scheduling, so it wont work properly if kblockd is marked
    POWER_EFFICIENT and power_efficient is set.

    Remove the kblockd WQ_POWER_EFFICIENT flag to prevent this behavior.
    This essentially reverts part of commit 695588f9454b, which added
    the WQ_POWER_EFFICIENT marker to kblockd.

    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • elv_abort_queue has no callers, and blk_abort_flushes is only called by
    elv_abort_queue.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

11 Jun, 2014

4 commits

  • Pull block layer fixes from Jens Axboe:
    "Final small batch of fixes to be included before -rc1. Some general
    cleanups in here as well, but some of the blk-mq fixes we need for the
    NVMe conversion and/or scsi-mq. The pull request contains:

    - Support for not merging across a specified "chunk size", if set by
    the driver. Some NVMe devices perform poorly for IO that crosses
    such a chunk, so we need to support it generically as part of
    request merging avoid having to do complicated split logic. From
    me.

    - Bump max tag depth to 10Ki tags. Some scsi devices have a huge
    shared tag space. Before we failed with EINVAL if a too large tag
    depth was specified, now we truncate it and pass back the actual
    value. From me.

    - Various blk-mq rq init fixes from me and others.

    - A fix for enter on a dying queue for blk-mq from Keith. This is
    needed to prevent oopsing on hot device removal.

    - Fixup for blk-mq timer addition from Ming Lei.

    - Small round of performance fixes for mtip32xx from Sam Bradshaw.

    - Minor stack leak fix from Rickard Strandqvist.

    - Two __init annotations from Fabian Frederick"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    block: add __init to blkcg_policy_register
    block: add __init to elv_register
    block: ensure that bio_add_page() always accepts a page for an empty bio
    blk-mq: add timer in blk_mq_start_request
    blk-mq: always initialize request->start_time
    block: blk-exec.c: Cleaning up local variable address returnd
    mtip32xx: minor performance enhancements
    blk-mq: ->timeout should be cleared in blk_mq_rq_ctx_init()
    blk-mq: don't allow queue entering for a dying queue
    blk-mq: bump max tag depth to 10K tags
    block: add blk_rq_set_block_pc()
    block: add notion of a chunk size for request merging

    Linus Torvalds
     
  • blkcg_policy_register is only called by
    __init functions:

    __init cfq_init
    __init throtl_init

    Cc: Andrew Morton
    Signed-off-by: Fabian Frederick
    Signed-off-by: Jens Axboe

    Fabian Frederick
     
  • elv_register is only called by elevator init functions:

    __init cfq_init
    __init deadline_init
    __init noop_init

    Cc: Andrew Morton
    Signed-off-by: Fabian Frederick
    Signed-off-by: Jens Axboe

    Fabian Frederick
     
  • With commit 762380ad9322 added support for chunk sizes and no merging
    across them, it broke the rule of always allowing adding of a single
    page to an empty bio. So relax the restriction a bit to allow for that,
    similarly to what we have always done.

    This fixes a crash with mkfs.xfs and 512b sector sizes on NVMe.

    Reported-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Jun, 2014

2 commits

  • Pull cgroup updates from Tejun Heo:
    "A lot of activities on cgroup side. Heavy restructuring including
    locking simplification took place to improve the code base and enable
    implementation of the unified hierarchy, which currently exists behind
    a __DEVEL__ mount option. The core support is mostly complete but
    individual controllers need further work. To explain the design and
    rationales of the the unified hierarchy

    Documentation/cgroups/unified-hierarchy.txt

    is added.

    Another notable change is css (cgroup_subsys_state - what each
    controller uses to identify and interact with a cgroup) iteration
    update. This is part of continuing updates on css object lifetime and
    visibility. cgroup started with reference count draining on removal
    way back and is now reaching a point where csses behave and are
    iterated like normal refcnted objects albeit with some complexities to
    allow distinguishing the state where they're being deleted. The css
    iteration update isn't taken advantage of yet but is planned to be
    used to simplify memcg significantly"

    * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
    cgroup: disallow disabled controllers on the default hierarchy
    cgroup: don't destroy the default root
    cgroup: disallow debug controller on the default hierarchy
    cgroup: clean up MAINTAINERS entries
    cgroup: implement css_tryget()
    device_cgroup: use css_has_online_children() instead of has_children()
    cgroup: convert cgroup_has_live_children() into css_has_online_children()
    cgroup: use CSS_ONLINE instead of CGRP_DEAD
    cgroup: iterate cgroup_subsys_states directly
    cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
    cgroup: move cgroup->serial_nr into cgroup_subsys_state
    cgroup: link all cgroup_subsys_states in their sibling lists
    cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
    cgroup: remove cgroup->parent
    device_cgroup: remove direct access to cgroup->children
    memcg: update memcg_has_children() to use css_next_child()
    memcg: remove tasks/children test from mem_cgroup_force_empty()
    cgroup: remove css_parent()
    cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
    cgroup: use cgroup->self.refcnt for cgroup refcnting
    ...

    Linus Torvalds
     
  • This way will become consistent with non-mq case, also
    avoid to update rq->deadline twice for mq.

    The comment said: "We do this early, to ensure we are on
    the right CPU.", but no percpu stuff is used in blk_add_timer(),
    so it isn't necessary. Even when inserting from plug list, there
    is no such guarantee at all.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

09 Jun, 2014

2 commits


07 Jun, 2014

3 commits

  • printk is meant to be used with an associated log level. There are some
    instances of printk scattered around the mm code where the log level is
    missing. Add a log level and adhere to suggestions by
    scripts/checkpatch.pl by moving to the pr_* macros.

    Also add the typical pr_fmt definition so that print statements can be
    easily traced back to the modules where they occur, correlated one with
    another, etc. This will require the removal of some (now redundant)
    prefixes on a few print statements.

    Signed-off-by: Mitchel Humpherys
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mitchel Humpherys
     
  • It'll be used in blk_mq_start_request() to set a potential timeout
    for the request, so clear it to zero at alloc time to ensure that
    we know if someone has set it or not.

    Fixes random early timeouts on NVMe testing.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • If the queue is going away, don't let new allocs or queueing
    happen on it. Go through the normal wait process, and exit with
    ENODEV in that case.

    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     

06 Jun, 2014

3 commits

  • For some scsi-mq cases, the tag map can be huge. So increase the
    max number of tags we support.

    Additionally, don't fail with EINVAL if a user requests too many
    tags. Warn that the tag depth has been adjusted down, and store
    the new value inside the tag_set passed in.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • With the optimizations around not clearing the full request at alloc
    time, we are leaving some of the needed init for REQ_TYPE_BLOCK_PC
    up to the user allocating the request.

    Add a blk_rq_set_block_pc() that sets the command type to
    REQ_TYPE_BLOCK_PC, and properly initializes the members associated
    with this type of request. Update callers to use this function instead
    of manipulating rq->cmd_type directly.

    Includes fixes from Christoph Hellwig for my half-assed
    attempt.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Some drivers have different limits on what size a request should
    optimally be, depending on the offset of the request. Similar to
    dividing a device into chunks. Add a setting that allows the driver
    to inform the block layer of such a chunk size. The block layer will
    then prevent merging across the chunks.

    This is needed to optimally support NVMe with a non-zero stripe size.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

05 Jun, 2014

2 commits

  • blk_mq_tag_to_rq() needs to be able to tell if it should return
    the original request, or the flush request if we are doing a flush
    sequence. Clear the flush tag when IO completes for a flush, since
    that is what we are comparing against.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • We currently pass in the hardware queue, and get the tags from there.
    But from scsi-mq, with a shared tag space, it's a lot more convenient
    to pass in the blk_mq_tags instead as the hardware queue isn't always
    directly available. So instead of having to re-map to a given
    hardware queue from rq->mq_ctx, just pass in the tags structure.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

04 Jun, 2014

1 commit

  • When the code was collapsed to avoid duplication, the recent patch
    for ensuring that a queue is idled before free was dropped, which was
    added by commit 19c5d84f14d2.

    Add back the blk_mq_tag_idle(), to ensure we don't leak a reference
    to an active queue when it is freed.

    Signed-off-by: Jens Axboe

    Jens Axboe