24 Sep, 2014

1 commit

  • blk-mq uses percpu_ref for its usage counter which tracks the number
    of in-flight commands and used to synchronously drain the queue on
    freeze. percpu_ref shutdown takes measureable wallclock time as it
    involves a sched RCU grace period. This means that draining a blk-mq
    takes measureable wallclock time. One would think that this shouldn't
    matter as queue shutdown should be a rare event which takes place
    asynchronously w.r.t. userland.

    Unfortunately, SCSI probing involves synchronously setting up and then
    tearing down a lot of request_queues back-to-back for non-existent
    LUNs. This means that SCSI probing may take more than ten seconds
    when scsi-mq is used.

    This will be properly fixed by implementing a mechanism to keep
    q->mq_usage_counter in atomic mode till genhd registration; however,
    that involves rather big updates to percpu_ref which is difficult to
    apply late in the devel cycle (v3.17-rc6 at the moment). As a
    stop-gap measure till the proper fix can be implemented in the next
    cycle, this patch introduces __percpu_ref_kill_expedited() and makes
    blk_mq_freeze_queue() use it. This is heavy-handed but should work
    for testing the experimental SCSI blk-mq implementation.

    Signed-off-by: Tejun Heo
    Reported-by: Christoph Hellwig
    Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de
    Fixes: add703fda981 ("blk-mq: use percpu_ref for mq usage count")
    Cc: Kent Overstreet
    Cc: Jens Axboe
    Tested-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Tejun Heo
     

23 Sep, 2014

6 commits

  • Commit 2da78092 changed the locking from a mutex to a spinlock,
    so we now longer sleep in this context. But there was a leftover
    might_sleep() in there, which now triggers since we do the final
    free from an RCU callback. Get rid of it.

    Reported-by: Pontus Fuchs
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • When requests are retried due to hw or sw resource shortages,
    we often stop the associated hardware queue. So ensure that we
    restart the queues when running the requeue work, otherwise the
    queue run will be a no-op.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • __blk_mq_alloc_rq_maps() can be invoked multiple times, if we scale
    back the queue depth if we are low on memory. So don't clear
    set->tags when we fail, this is handled directly in
    the parent function, blk_mq_alloc_tag_set().

    Reported-by: Robert Elliott
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We should not insert requests into the flush state machine from
    blk_mq_insert_request. All incoming flush requests come through
    blk_{m,s}q_make_request and are handled there, while blk_execute_rq_nowait
    should only be called for BLOCK_PC requests. All other callers
    deal with requests that already went through the flush statemchine
    and shouldn't be reinserted into it.

    Reported-by: Robert Elliott
    Debugged-by: Ming Lei
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This patch should fix the bug reported in
    https://lkml.org/lkml/2014/9/11/249.

    We have to initialize at least the atomic_flags and the cmd_flags when
    allocating storage for the requests.

    Otherwise blk_mq_timeout_check() might dereference uninitialized
    pointers when racing with the creation of a request.

    Also move the reset of cmd_flags for the initializing code to the point
    where a request is freed. So we will never end up with pending flush
    request indicators that might trigger dereferences of invalid pointers
    in blk_mq_timeout_check().

    Cc: stable@vger.kernel.org
    Signed-off-by: David Hildenbrand
    Reported-by: Paulo De Rezende Pinatti
    Tested-by: Paulo De Rezende Pinatti
    Acked-by: Christian Borntraeger
    Signed-off-by: Jens Axboe

    David Hildenbrand
     
  • When we start the request, we set the deadline and flip the bits
    marking the request as started and non-complete. However, it's
    important that the deadline store is ordered before flipping the
    bits, otherwise we could have a small window where the request is
    marked started but with an invalid deadline. This can confuse the
    timeout handling.

    Suggested-by: Ming Lei
    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Sep, 2014

2 commits

  • If we are running in a kdump environment, resources are scarce.
    For some SCSI setups with a huge set of shared tags, we run out
    of memory allocating what the drivers is asking for. So implement
    a scale back logic to reduce the tag depth for those cases, allowing
    the driver to successfully load.

    We should extend this to detect low memory situations, and implement
    a sane fallback for those (1 queue, 64 tags, or something like that).

    Tested-by: Robert Elliott
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • When a queue is registered, the block layer turns off the bypass
    setting (because bypass is enabled when the queue is created). This
    doesn't work well for queues that are unregistered and then registered
    again; we get a WARNING because of the unbalanced calls to
    blk_queue_bypass_end().

    This patch fixes the problem by making blk_register_queue() call
    blk_queue_bypass_end() only the first time the queue is registered.

    Signed-off-by: Alan Stern
    Acked-by: Tejun Heo
    CC: James Bottomley
    CC: Jens Axboe
    Signed-off-by: Jens Axboe

    Alan Stern
     

04 Sep, 2014

2 commits

  • Releases the dev_t minor when all references are closed to prevent
    another device from acquiring the same major/minor.

    Since the partition's release may be invoked from call_rcu's soft-irq
    context, the ext_dev_idr's mutex had to be replaced with a spinlock so
    as not so sleep.

    Signed-off-by: Keith Busch
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • In blk-mq.c blk_mq_alloc_tag_set, if:
    set->tags = kmalloc_node()
    succeeds, but one of the blk_mq_init_rq_map() calls fails,
    goto out_unwind;
    needs to free set->tags so the caller is not obligated
    to do so. None of the current callers (null_blk,
    virtio_blk, virtio_blk, or the forthcoming scsi-mq)
    do so.

    set->tags needs to be set to NULL after doing so,
    so other tag cleanup logic doesn't try to free
    a stale pointer later. Also set it to NULL
    in blk_mq_free_tag_set.

    Tested with error injection on the forthcoming
    scsi-mq + hpsa combination.

    Signed-off-by: Robert Elliott
    Signed-off-by: Jens Axboe

    Robert Elliott
     

03 Sep, 2014

1 commit

  • QUEUE_FLAG_NO_SG_MERGE is set at default for blk-mq devices,
    so bio->bi_phys_segment computed may be bigger than
    queue_max_segments(q) for blk-mq devices, then drivers will
    fail to handle the case, for example, BUG_ON() in
    virtio_queue_rq() can be triggerd for virtio-blk:

    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1359146

    This patch fixes the issue by ignoring the QUEUE_FLAG_NO_SG_MERGE
    flag if the computed bio->bi_phys_segment is bigger than
    queue_max_segments(q), and the regression is caused by commit
    05f1dd53152173(block: add queue flag for disabling SG merging).

    Reported-by: Kick In
    Tested-by: Chris J Arges
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

28 Aug, 2014

1 commit


27 Aug, 2014

1 commit

  • cfq_group_service_tree_add() is applying new_weight at the beginning of
    the function via cfq_update_group_weight().
    This actually allows weight to change between adding it to and subtracting
    it from children_weight, and triggers WARN_ON_ONCE() in
    cfq_group_service_tree_del(), or even causes oops by divide error during
    vfr calculation in cfq_group_service_tree_add().

    The detailed scenario is as follows:
    1. Create blkio cgroups X and Y as a child of X.
    Set X's weight to 500 and perform some I/O to apply new_weight.
    This X's I/O completes before starting Y's I/O.
    2. Y starts I/O and cfq_group_service_tree_add() is called with Y.
    3. cfq_group_service_tree_add() walks up the tree during children_weight
    calculation and adds parent X's weight (500) to children_weight of root.
    children_weight becomes 500.
    4. Set X's weight to 1000.
    5. X starts I/O and cfq_group_service_tree_add() is called with X.
    6. cfq_group_service_tree_add() applies its new_weight (1000).
    7. I/O of Y completes and cfq_group_service_tree_del() is called with Y.
    8. I/O of X completes and cfq_group_service_tree_del() is called with X.
    9. cfq_group_service_tree_del() subtracts X's weight (1000) from
    children_weight of root. children_weight becomes -500.
    This triggers WARN_ON_ONCE().
    10. Set X's weight to 500.
    11. X starts I/O and cfq_group_service_tree_add() is called with X.
    12. cfq_group_service_tree_add() applies its new_weight (500) and adds it
    to children_weight of root. children_weight becomes 0. Calcularion of
    vfr triggers oops by divide error.

    weight should be updated right before adding it to children_weight.

    Reported-by: Ruki Sekiya
    Signed-off-by: Toshiaki Makita
    Acked-by: Tejun Heo
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Toshiaki Makita
     

26 Aug, 2014

1 commit

  • Before commit 2cada584b200 ("block: cleanup error handling in sg_io"),
    we had ret = 0 before entering the last big if block of sg_io.

    Since 2cada584b200, ret = -EFAULT, which breaks hdparm:

    /dev/sda:
    setting Advanced Power Management level to 0xc8 (200)
    HDIO_DRIVE_CMD failed: Bad address
    APM_level = 128

    Signed-off-by: Sabrina Dubroca
    Fixes: 2cada584b200 ("block: cleanup error handling in sg_io")
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Sabrina Dubroca
     

23 Aug, 2014

2 commits

  • blk_rq_set_block_pc() memsets rq->cmd to 0, so it should come
    immediately after blk_get_request() to avoid overwriting the
    user-supplied CDB. Also check for failure to allocate rq.

    Fixes: f27b087b81b7 ("block: add blk_rq_set_block_pc()")
    Cc: # 3.16.x
    Signed-off-by: Tony Battersby
    Signed-off-by: Jens Axboe

    Tony Battersby
     
  • This patch fixes code such as the following with scsi-mq enabled:

    rq = blk_get_request(...);
    blk_rq_set_block_pc(rq);

    rq->cmd = my_cmd_buffer; /* separate CDB buffer */

    blk_execute_rq_nowait(...);

    Code like this appears in e.g. sg_start_req() in drivers/scsi/sg.c (for
    large CDBs only). Without this patch, scsi_mq_prep_fn() will set
    rq->cmd back to rq->__cmd, causing the wrong CDB to be sent to the device.

    Signed-off-by: Tony Battersby
    Signed-off-by: Jens Axboe

    Tony Battersby
     

22 Aug, 2014

6 commits

  • Signed-off-by: Christoph Hellwig
    Reviewed-by: Boaz Harrosh
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Make sure we always clean up through the out label and just have
    a single place to put the request.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • While converting to percpu_ref for freezing, add703fda981 ("blk-mq:
    use percpu_ref for mq usage count") incorrectly made
    blk_mq_freeze_queue() misbehave when freezing is nested due to
    percpu_ref_kill() being invoked on an already killed ref.

    Fix it by making blk_mq_freeze_queue() kill and kick the queue only
    for the outermost freeze attempt. All the nested ones can simply wait
    for the ref to reach zero.

    While at it, remove unnecessary @wake initialization from
    blk_mq_unfreeze_queue().

    Signed-off-by: Tejun Heo
    Reported-by: Ming Lei
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Just grammar or spelling errors, nothing major.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • When getting a pi error we get to bio_integrity_end_io with
    bi_remaining already decremented to 0 where we will eventually
    need to call bio_endio with restored original bio completion handler.
    Calling bio_endio invokes a BUG_ON(). We should call bio_endio_nodec
    instead, like what is done in bio_integrity_verify_fn.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Sagi Grimberg
     
  • blk-mq uses BLK_MQ_F_SHOULD_MERGE, as set by the driver at init time,
    to determine whether it should merge IO or not. However, this could
    also be disabled by the admin, if merging is switched off through
    sysfs. So check the general queue state as well before attempting
    to merge IO.

    Reported-by: Rob Elliott
    Tested-by: Rob Elliott
    Signed-off-by: Jens Axboe

    Jens Axboe
     

16 Aug, 2014

1 commit

  • Before doing queue release, the queue has been freezed already
    by blk_cleanup_queue(), so needn't to freeze queue for deleting
    tag set.

    This patch fixes the WARNING of "percpu_ref_kill() called more than once!"
    which is triggered during unloading block driver.

    Cc: Tejun Heo
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

14 Aug, 2014

3 commits

  • Pull device mapper changes from Mike Snitzer:

    - Allow the thin target to paired with any size external origin; also
    allow thin snapshots to be larger than the external origin.

    - Add support for quickly loading a repetitive pattern into the
    dm-switch target.

    - Use per-bio data in the dm-crypt target instead of always using a
    mempool for each allocation. Required switching to kmalloc alignment
    for the bio slab.

    - Fix DM core to properly stack the QUEUE_FLAG_NO_SG_MERGE flag

    - Fix the dm-cache and dm-thin targets' export of the minimum_io_size
    to match the data block size -- this fixes an issue where mkfs.xfs
    would improperly infer raid striping was in place on the underlying
    storage.

    - Small cleanups in dm-io, dm-mpath and dm-cache

    * tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
    dm table: propagate QUEUE_FLAG_NO_SG_MERGE
    dm switch: efficiently support repetitive patterns
    dm switch: factor out switch_region_table_read
    dm cache: set minimum_io_size to cache's data block size
    dm thin: set minimum_io_size to pool's data block size
    dm crypt: use per-bio data
    block: use kmalloc alignment for bio slab
    dm table: make dm_table_supports_discards static
    dm cache metadata: use dm-space-map-metadata.h defined size limits
    dm cache: fail migrations in the do_worker error path
    dm cache: simplify deferred set reference count increments
    dm thin: relax external origin size constraints
    dm thin: switch to an atomic_t for tracking pending new block preparations
    dm mpath: eliminate pg_ready() wrapper
    dm io: simplify dec_count and sync_io

    Linus Torvalds
     
  • Pull block driver changes from Jens Axboe:
    "Nothing out of the ordinary here, this pull request contains:

    - A big round of fixes for bcache from Kent Overstreet, Slava Pestov,
    and Surbhi Palande. No new features, just a lot of fixes.

    - The usual round of drbd updates from Andreas Gruenbacher, Lars
    Ellenberg, and Philipp Reisner.

    - virtio_blk was converted to blk-mq back in 3.13, but now Ming Lei
    has taken it one step further and added support for actually using
    more than one queue.

    - Addition of an explicit SG_FLAG_Q_AT_HEAD for block/bsg, to
    compliment the the default behavior of adding to the tail of the
    queue. From Douglas Gilbert"

    * 'for-3.17/drivers' of git://git.kernel.dk/linux-block: (86 commits)
    bcache: Drop unneeded blk_sync_queue() calls
    bcache: add mutex lock for bch_is_open
    bcache: Correct printing of btree_gc_max_duration_ms
    bcache: try to set b->parent properly
    bcache: fix memory corruption in init error path
    bcache: fix crash with incomplete cache set
    bcache: Fix more early shutdown bugs
    bcache: fix use-after-free in btree_gc_coalesce()
    bcache: Fix an infinite loop in journal replay
    bcache: fix crash in bcache_btree_node_alloc_fail tracepoint
    bcache: bcache_write tracepoint was crashing
    bcache: fix typo in bch_bkey_equal_header
    bcache: Allocate bounce buffers with GFP_NOWAIT
    bcache: Make sure to pass GFP_WAIT to mempool_alloc()
    bcache: fix uninterruptible sleep in writeback thread
    bcache: wait for buckets when allocating new btree root
    bcache: fix crash on shutdown in passthrough mode
    bcache: fix lockdep warnings on shutdown
    bcache allocator: send discards with correct size
    bcache: Fix to remove the rcu_sched stalls.
    ...

    Linus Torvalds
     
  • Pull block core bits from Jens Axboe:
    "Small round this time, after the massive blk-mq dump for 3.16. This
    pull request contains:

    - Fixes for max_sectors overflow in ioctls from Akinoby Mita.

    - Partition off-by-one bug fix in aix partitions from Dan Carpenter.

    - Various small partition cleanups from Fabian Frederick.

    - Fix for the block integrity code sometimes returning the wrong
    vector count from Gu Zheng.

    - Cleanup an re-org of the blk-mq queue enter/exit percpu counters
    from Tejun. Dependent on the percpu pull for 3.17 (which was in
    the block tree too), that you have already pulled in.

    - A blkcg oops fix, also from Tejun"

    * 'for-3.17/core' of git://git.kernel.dk/linux-block:
    partitions: aix.c: off by one bug
    blkcg: don't call into policy draining if root_blkg is already gone
    Revert "bio: modify __bio_add_page() to accept pages that don't start a new segment"
    bio: modify __bio_add_page() to accept pages that don't start a new segment
    block: fix SG_[GS]ET_RESERVED_SIZE ioctl when max_sectors is huge
    block: fix BLKSECTGET ioctl when max_sectors is greater than USHRT_MAX
    block/partitions/efi.c: kerneldoc fixing
    block/partitions/msdos.c: code clean-up
    block/partitions/amiga.c: replace nolevel printk by pr_err
    block/partitions/aix.c: replace count*size kzalloc by kcalloc
    bio-integrity: add "bip_max_vcnt" into struct bio_integrity_payload
    blk-mq: use percpu_ref for mq usage count
    blk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue()
    blk-mq: decouble blk-mq freezing from generic bypassing
    block, blk-mq: draining can't be skipped even if bypass_depth was non-zero
    blk-mq: fix a memory ordering bug in blk_mq_queue_enter()

    Linus Torvalds
     

06 Aug, 2014

1 commit

  • The lvip[] array has "state->limit" elements so the condition here
    should be >= instead of >.

    Fixes: 6ceea22bbbc8 ('partitions: add aix lvm partition support files')
    Signed-off-by: Dan Carpenter
    Acked-by: Philippe De Muyter
    Signed-off-by: Jens Axboe

    Dan Carpenter
     

05 Aug, 2014

1 commit

  • Pull cgroup changes from Tejun Heo:
    "Mostly changes to get the v2 interface ready. The core features are
    mostly ready now and I think it's reasonable to expect to drop the
    devel mask in one or two devel cycles at least for a subset of
    controllers.

    - cgroup added a controller dependency mechanism so that block cgroup
    can depend on memory cgroup. This will be used to finally support
    IO provisioning on the writeback traffic, which is currently being
    implemented.

    - The v2 interface now uses a separate table so that the interface
    files for the new interface are explicitly declared in one place.
    Each controller will explicitly review and add the files for the
    new interface.

    - cpuset is getting ready for the hierarchical behavior which is in
    the similar style with other controllers so that an ancestor's
    configuration change doesn't change the descendants' configurations
    irreversibly and processes aren't silently migrated when a CPU or
    node goes down.

    All the changes are to the new interface and no behavior changed for
    the multiple hierarchies"

    * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits)
    cpuset: fix the WARN_ON() in update_nodemasks_hier()
    cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test
    cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core
    cgroup: distinguish the default and legacy hierarchies when handling cftypes
    cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()
    cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes
    cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[]
    cpuset: export effective masks to userspace
    cpuset: allow writing offlined masks to cpuset.cpus/mems
    cpuset: enable onlined cpu/node in effective masks
    cpuset: refactor cpuset_hotplug_update_tasks()
    cpuset: make cs->{cpus, mems}_allowed as user-configured masks
    cpuset: apply cs->effective_{cpus,mems}
    cpuset: initialize top_cpuset's configured masks at mount
    cpuset: use effective cpumask to build sched domains
    cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty
    cpuset: update cs->effective_{cpus, mems} when config changes
    cpuset: update cpuset->effective_{cpus,mems} at hotplug
    cpuset: add cs->effective_cpus and cs->effective_mems
    cgroup: clean up sane_behavior handling
    ...

    Linus Torvalds
     

02 Aug, 2014

1 commit

  • Various subsystems can ask the bio subsystem to create a bio slab cache
    with some free space before the bio. This free space can be used for any
    purpose. Device mapper uses this per-bio-data feature to place some
    target-specific and device-mapper specific data before the bio, so that
    the target-specific data doesn't have to be allocated separately.

    This per-bio-data mechanism is used in place of kmalloc, so we need the
    allocated slab to have the same memory alignment as memory allocated
    with kmalloc.

    Change bio_find_or_create_slab() so that it uses ARCH_KMALLOC_MINALIGN
    alignment when creating the slab cache. This is needed so that dm-crypt
    can use per-bio-data for encryption - the crypto subsystem assumes this
    data will have the same alignment as kmalloc'ed memory.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Acked-by: Jens Axboe

    Mikulas Patocka
     

16 Jul, 2014

1 commit

  • While a queue is being destroyed, all the blkgs are destroyed and its
    ->root_blkg pointer is set to NULL. If someone else starts to drain
    while the queue is in this state, the following oops happens.

    NULL pointer dereference at 0000000000000028
    IP: [] blk_throtl_drain+0x84/0x230
    PGD e4a1067 PUD b773067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
    CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
    RIP: 0010:[] [] blk_throtl_drain+0x84/0x230
    RSP: 0018:ffff88000efd7bf0 EFLAGS: 00010046
    RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
    R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
    R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
    FS: 00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
    Stack:
    ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
    ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
    ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
    Call Trace:
    [] blkcg_drain_queue+0x1f/0x60
    [] __blk_drain_queue+0x71/0x180
    [] blk_queue_bypass_start+0x6e/0xb0
    [] blkcg_deactivate_policy+0x38/0x120
    [] blk_throtl_exit+0x34/0x50
    [] blkcg_exit_queue+0x35/0x40
    [] blk_release_queue+0x26/0xd0
    [] kobject_cleanup+0x38/0x70
    [] kobject_put+0x28/0x60
    [] blk_put_queue+0x15/0x20
    [] scsi_device_dev_release_usercontext+0x16b/0x1c0
    [] execute_in_process_context+0x89/0xa0
    [] scsi_device_dev_release+0x1c/0x20
    [] device_release+0x32/0xa0
    [] kobject_cleanup+0x38/0x70
    [] kobject_put+0x28/0x60
    [] put_device+0x17/0x20
    [] __scsi_remove_device+0xa9/0xe0
    [] scsi_remove_device+0x2b/0x40
    [] sdev_store_delete+0x27/0x30
    [] dev_attr_store+0x18/0x30
    [] sysfs_kf_write+0x3e/0x50
    [] kernfs_fop_write+0xe7/0x170
    [] vfs_write+0xaf/0x1d0
    [] SyS_write+0x4d/0xc0
    [] system_call_fastpath+0x16/0x1b

    776687bce42b ("block, blk-mq: draining can't be skipped even if
    bypass_depth was non-zero") made it easier to trigger this bug by
    making blk_queue_bypass_start() drain even when it loses the first
    bypass test to blk_cleanup_queue(); however, the bug has always been
    there even before the commit as blk_queue_bypass_start() could race
    against queue destruction, win the initial bypass test but perform the
    actual draining after blk_cleanup_queue() already destroyed all blkgs.

    Fix it by skippping calling into policy draining if all the blkgs are
    already gone.

    Signed-off-by: Tejun Heo
    Reported-by: Shirish Pargaonkar
    Reported-by: Sasha Levin
    Reported-by: Jet Chen
    Cc: stable@vger.kernel.org
    Tested-by: Shirish Pargaonkar
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 Jul, 2014

3 commits

  • Currently, cftypes added by cgroup_add_cftypes() are used for both the
    unified default hierarchy and legacy ones and subsystems can mark each
    file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to
    appear only on one of them. This is quite hairy and error-prone.
    Also, we may end up exposing interface files to the default hierarchy
    without thinking it through.

    cgroup_subsys will grow two separate cftype addition functions and
    apply each only on the hierarchies of the matching type. This will
    allow organizing cftypes in a lot clearer way and encourage subsystems
    to scrutinize the interface which is being exposed in the new default
    hierarchy.

    In preparation, this patch adds cgroup_add_legacy_cftypes() which
    currently is a simple wrapper around cgroup_add_cftypes() and replaces
    all cgroup_add_cftypes() usages with it.

    While at it, this patch drops a completely spurious return from
    __hugetlb_cgroup_file_init().

    This patch doesn't introduce any functional differences.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Aneesh Kumar K.V

    Tejun Heo
     
  • Currently, cgroup_subsys->base_cftypes is used for both the unified
    default hierarchy and legacy ones and subsystems can mark each file
    with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
    only on one of them. This is quite hairy and error-prone. Also, we
    may end up exposing interface files to the default hierarchy without
    thinking it through.

    cgroup_subsys will grow two separate cftype arrays and apply each only
    on the hierarchies of the matching type. This will allow organizing
    cftypes in a lot clearer way and encourage subsystems to scrutinize
    the interface which is being exposed in the new default hierarchy.

    In preparation, this patch renames cgroup_subsys->base_cftypes to
    cgroup_subsys->legacy_cftypes. This patch is pure rename.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vivek Goyal
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Aristeu Rozanski
    Cc: Aneesh Kumar K.V

    Tejun Heo
     
  • This reverts commit 254c4407cb84a6dec90336054615b0f0e996bb7c.

    It causes crashes with cryptsetup, even after a few iterations and
    updates. Drop it for now.

    Jens Axboe
     

14 Jul, 2014

1 commit


12 Jul, 2014

1 commit

  • While a queue is being destroyed, all the blkgs are destroyed and its
    ->root_blkg pointer is set to NULL. If someone else starts to drain
    while the queue is in this state, the following oops happens.

    NULL pointer dereference at 0000000000000028
    IP: [] blk_throtl_drain+0x84/0x230
    PGD e4a1067 PUD b773067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
    CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
    RIP: 0010:[] [] blk_throtl_drain+0x84/0x230
    RSP: 0018:ffff88000efd7bf0 EFLAGS: 00010046
    RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
    R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
    R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
    FS: 00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
    Stack:
    ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
    ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
    ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
    Call Trace:
    [] blkcg_drain_queue+0x1f/0x60
    [] __blk_drain_queue+0x71/0x180
    [] blk_queue_bypass_start+0x6e/0xb0
    [] blkcg_deactivate_policy+0x38/0x120
    [] blk_throtl_exit+0x34/0x50
    [] blkcg_exit_queue+0x35/0x40
    [] blk_release_queue+0x26/0xd0
    [] kobject_cleanup+0x38/0x70
    [] kobject_put+0x28/0x60
    [] blk_put_queue+0x15/0x20
    [] scsi_device_dev_release_usercontext+0x16b/0x1c0
    [] execute_in_process_context+0x89/0xa0
    [] scsi_device_dev_release+0x1c/0x20
    [] device_release+0x32/0xa0
    [] kobject_cleanup+0x38/0x70
    [] kobject_put+0x28/0x60
    [] put_device+0x17/0x20
    [] __scsi_remove_device+0xa9/0xe0
    [] scsi_remove_device+0x2b/0x40
    [] sdev_store_delete+0x27/0x30
    [] dev_attr_store+0x18/0x30
    [] sysfs_kf_write+0x3e/0x50
    [] kernfs_fop_write+0xe7/0x170
    [] vfs_write+0xaf/0x1d0
    [] SyS_write+0x4d/0xc0
    [] system_call_fastpath+0x16/0x1b

    776687bce42b ("block, blk-mq: draining can't be skipped even if
    bypass_depth was non-zero") made it easier to trigger this bug by
    making blk_queue_bypass_start() drain even when it loses the first
    bypass test to blk_cleanup_queue(); however, the bug has always been
    there even before the commit as blk_queue_bypass_start() could race
    against queue destruction, win the initial bypass test but perform the
    actual draining after blk_cleanup_queue() already destroyed all blkgs.

    Fix it by skippping calling into policy draining if all the blkgs are
    already gone.

    Signed-off-by: Tejun Heo
    Reported-by: Shirish Pargaonkar
    Reported-by: Sasha Levin
    Reported-by: Jet Chen
    Cc: stable@vger.kernel.org
    Tested-by: Shirish Pargaonkar
    Signed-off-by: Jens Axboe

    Tejun Heo
     

09 Jul, 2014

2 commits

  • sane_behavior has been used as a development vehicle for the default
    unified hierarchy. Now that the default hierarchy is in place, the
    flag became redundant and confusing as its usage is allowed on all
    hierarchies. There are gonna be either the default hierarchy or
    legacy ones. Let's make that clear by removing sane_behavior support
    on non-default hierarchies.

    This patch replaces cgroup_sane_behavior() with cgroup_on_dfl(). The
    comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of
    cgroup_on_dfl() with sane_behavior specific part dropped.

    On the default and legacy hierarchies w/o sane_behavior, this
    shouldn't cause any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko

    Tejun Heo
     
  • Currently, the blkio subsystem attributes all of writeback IOs to the
    root. One of the issues is that there's no way to tell who originated
    a writeback IO from block layer. Those IOs are usually issued
    asynchronously from a task which didn't have anything to do with
    actually generating the dirty pages. The memory subsystem, when
    enabled, already keeps track of the ownership of each dirty page and
    it's desirable for blkio to piggyback instead of adding its own
    per-page tag.

    cgroup now has a mechanism to express such dependency -
    cgroup_subsys->depends_on. This patch declares that blkcg depends on
    memcg so that memcg is enabled automatically on the default hierarchy
    when available. Future changes will make blkcg map the memcg tag to
    find out the cgroup to blame for writeback IOs.

    As this means that a memcg may be made invisible, this patch also
    implements css_reset() for memcg which resets its basic
    configurations. This implementation will probably need to be expanded
    to cover other states which are used in the default hierarchy.

    v2: blkcg's dependency on memcg is wrapped with CONFIG_MEMCG to avoid
    build failure. Reported by kbuild test robot.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vivek Goyal
    Cc: Jens Axboe

    Tejun Heo
     

08 Jul, 2014

1 commit

  • There is no inherent reason why the last put of a tag structure must be
    the one for the Scsi_Host, as device model objects can be held for
    arbitrary periods. Merge blk_free_tags and __blk_free_tags into a single
    funtion that just release a references and get rid of the BUG() when the
    host reference wasn't the last.

    Signed-off-by: Christoph Hellwig
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

02 Jul, 2014

1 commit

  • The original behaviour is to refuse to add a new page if the maximum
    number of segments has been reached, regardless of the fact the page we
    are going to add can be merged into the last segment or not.

    Unfortunately, when the system runs under heavy memory fragmentation
    conditions, a driver may try to add multiple pages to the last segment.
    The original code won't accept them and EBUSY will be reported to
    userspace.

    This patch modifies the function so it refuses to add a page only in case
    the latter starts a new segment and the maximum number of segments has
    already been reached.

    The bug can be easily reproduced with the st driver:

    1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE to 16
    2) modprobe st buffer_kbs=1024
    3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10
    dd: error writing `/dev/st0': Device or resource busy

    [ming.lei@canonical.com: update bi_iter.bi_size before recounting segments]
    Signed-off-by: Maurizio Lombardi
    Signed-off-by: Ming Lei
    Tested-by: Dongsu Park
    Tested-by: Jet Chen
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Kent Overstreet
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Maurizio Lombardi