14 Aug, 2014

3 commits

  • Pull device mapper changes from Mike Snitzer:

    - Allow the thin target to paired with any size external origin; also
    allow thin snapshots to be larger than the external origin.

    - Add support for quickly loading a repetitive pattern into the
    dm-switch target.

    - Use per-bio data in the dm-crypt target instead of always using a
    mempool for each allocation. Required switching to kmalloc alignment
    for the bio slab.

    - Fix DM core to properly stack the QUEUE_FLAG_NO_SG_MERGE flag

    - Fix the dm-cache and dm-thin targets' export of the minimum_io_size
    to match the data block size -- this fixes an issue where mkfs.xfs
    would improperly infer raid striping was in place on the underlying
    storage.

    - Small cleanups in dm-io, dm-mpath and dm-cache

    * tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
    dm table: propagate QUEUE_FLAG_NO_SG_MERGE
    dm switch: efficiently support repetitive patterns
    dm switch: factor out switch_region_table_read
    dm cache: set minimum_io_size to cache's data block size
    dm thin: set minimum_io_size to pool's data block size
    dm crypt: use per-bio data
    block: use kmalloc alignment for bio slab
    dm table: make dm_table_supports_discards static
    dm cache metadata: use dm-space-map-metadata.h defined size limits
    dm cache: fail migrations in the do_worker error path
    dm cache: simplify deferred set reference count increments
    dm thin: relax external origin size constraints
    dm thin: switch to an atomic_t for tracking pending new block preparations
    dm mpath: eliminate pg_ready() wrapper
    dm io: simplify dec_count and sync_io

    Linus Torvalds
     
  • Pull block driver changes from Jens Axboe:
    "Nothing out of the ordinary here, this pull request contains:

    - A big round of fixes for bcache from Kent Overstreet, Slava Pestov,
    and Surbhi Palande. No new features, just a lot of fixes.

    - The usual round of drbd updates from Andreas Gruenbacher, Lars
    Ellenberg, and Philipp Reisner.

    - virtio_blk was converted to blk-mq back in 3.13, but now Ming Lei
    has taken it one step further and added support for actually using
    more than one queue.

    - Addition of an explicit SG_FLAG_Q_AT_HEAD for block/bsg, to
    compliment the the default behavior of adding to the tail of the
    queue. From Douglas Gilbert"

    * 'for-3.17/drivers' of git://git.kernel.dk/linux-block: (86 commits)
    bcache: Drop unneeded blk_sync_queue() calls
    bcache: add mutex lock for bch_is_open
    bcache: Correct printing of btree_gc_max_duration_ms
    bcache: try to set b->parent properly
    bcache: fix memory corruption in init error path
    bcache: fix crash with incomplete cache set
    bcache: Fix more early shutdown bugs
    bcache: fix use-after-free in btree_gc_coalesce()
    bcache: Fix an infinite loop in journal replay
    bcache: fix crash in bcache_btree_node_alloc_fail tracepoint
    bcache: bcache_write tracepoint was crashing
    bcache: fix typo in bch_bkey_equal_header
    bcache: Allocate bounce buffers with GFP_NOWAIT
    bcache: Make sure to pass GFP_WAIT to mempool_alloc()
    bcache: fix uninterruptible sleep in writeback thread
    bcache: wait for buckets when allocating new btree root
    bcache: fix crash on shutdown in passthrough mode
    bcache: fix lockdep warnings on shutdown
    bcache allocator: send discards with correct size
    bcache: Fix to remove the rcu_sched stalls.
    ...

    Linus Torvalds
     
  • Pull block core bits from Jens Axboe:
    "Small round this time, after the massive blk-mq dump for 3.16. This
    pull request contains:

    - Fixes for max_sectors overflow in ioctls from Akinoby Mita.

    - Partition off-by-one bug fix in aix partitions from Dan Carpenter.

    - Various small partition cleanups from Fabian Frederick.

    - Fix for the block integrity code sometimes returning the wrong
    vector count from Gu Zheng.

    - Cleanup an re-org of the blk-mq queue enter/exit percpu counters
    from Tejun. Dependent on the percpu pull for 3.17 (which was in
    the block tree too), that you have already pulled in.

    - A blkcg oops fix, also from Tejun"

    * 'for-3.17/core' of git://git.kernel.dk/linux-block:
    partitions: aix.c: off by one bug
    blkcg: don't call into policy draining if root_blkg is already gone
    Revert "bio: modify __bio_add_page() to accept pages that don't start a new segment"
    bio: modify __bio_add_page() to accept pages that don't start a new segment
    block: fix SG_[GS]ET_RESERVED_SIZE ioctl when max_sectors is huge
    block: fix BLKSECTGET ioctl when max_sectors is greater than USHRT_MAX
    block/partitions/efi.c: kerneldoc fixing
    block/partitions/msdos.c: code clean-up
    block/partitions/amiga.c: replace nolevel printk by pr_err
    block/partitions/aix.c: replace count*size kzalloc by kcalloc
    bio-integrity: add "bip_max_vcnt" into struct bio_integrity_payload
    blk-mq: use percpu_ref for mq usage count
    blk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue()
    blk-mq: decouble blk-mq freezing from generic bypassing
    block, blk-mq: draining can't be skipped even if bypass_depth was non-zero
    blk-mq: fix a memory ordering bug in blk_mq_queue_enter()

    Linus Torvalds
     

06 Aug, 2014

1 commit

  • The lvip[] array has "state->limit" elements so the condition here
    should be >= instead of >.

    Fixes: 6ceea22bbbc8 ('partitions: add aix lvm partition support files')
    Signed-off-by: Dan Carpenter
    Acked-by: Philippe De Muyter
    Signed-off-by: Jens Axboe

    Dan Carpenter
     

05 Aug, 2014

1 commit

  • Pull cgroup changes from Tejun Heo:
    "Mostly changes to get the v2 interface ready. The core features are
    mostly ready now and I think it's reasonable to expect to drop the
    devel mask in one or two devel cycles at least for a subset of
    controllers.

    - cgroup added a controller dependency mechanism so that block cgroup
    can depend on memory cgroup. This will be used to finally support
    IO provisioning on the writeback traffic, which is currently being
    implemented.

    - The v2 interface now uses a separate table so that the interface
    files for the new interface are explicitly declared in one place.
    Each controller will explicitly review and add the files for the
    new interface.

    - cpuset is getting ready for the hierarchical behavior which is in
    the similar style with other controllers so that an ancestor's
    configuration change doesn't change the descendants' configurations
    irreversibly and processes aren't silently migrated when a CPU or
    node goes down.

    All the changes are to the new interface and no behavior changed for
    the multiple hierarchies"

    * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits)
    cpuset: fix the WARN_ON() in update_nodemasks_hier()
    cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test
    cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core
    cgroup: distinguish the default and legacy hierarchies when handling cftypes
    cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()
    cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes
    cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[]
    cpuset: export effective masks to userspace
    cpuset: allow writing offlined masks to cpuset.cpus/mems
    cpuset: enable onlined cpu/node in effective masks
    cpuset: refactor cpuset_hotplug_update_tasks()
    cpuset: make cs->{cpus, mems}_allowed as user-configured masks
    cpuset: apply cs->effective_{cpus,mems}
    cpuset: initialize top_cpuset's configured masks at mount
    cpuset: use effective cpumask to build sched domains
    cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty
    cpuset: update cs->effective_{cpus, mems} when config changes
    cpuset: update cpuset->effective_{cpus,mems} at hotplug
    cpuset: add cs->effective_cpus and cs->effective_mems
    cgroup: clean up sane_behavior handling
    ...

    Linus Torvalds
     

02 Aug, 2014

1 commit

  • Various subsystems can ask the bio subsystem to create a bio slab cache
    with some free space before the bio. This free space can be used for any
    purpose. Device mapper uses this per-bio-data feature to place some
    target-specific and device-mapper specific data before the bio, so that
    the target-specific data doesn't have to be allocated separately.

    This per-bio-data mechanism is used in place of kmalloc, so we need the
    allocated slab to have the same memory alignment as memory allocated
    with kmalloc.

    Change bio_find_or_create_slab() so that it uses ARCH_KMALLOC_MINALIGN
    alignment when creating the slab cache. This is needed so that dm-crypt
    can use per-bio-data for encryption - the crypto subsystem assumes this
    data will have the same alignment as kmalloc'ed memory.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Acked-by: Jens Axboe

    Mikulas Patocka
     

16 Jul, 2014

1 commit

  • While a queue is being destroyed, all the blkgs are destroyed and its
    ->root_blkg pointer is set to NULL. If someone else starts to drain
    while the queue is in this state, the following oops happens.

    NULL pointer dereference at 0000000000000028
    IP: [] blk_throtl_drain+0x84/0x230
    PGD e4a1067 PUD b773067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
    CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
    RIP: 0010:[] [] blk_throtl_drain+0x84/0x230
    RSP: 0018:ffff88000efd7bf0 EFLAGS: 00010046
    RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
    R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
    R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
    FS: 00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
    Stack:
    ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
    ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
    ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
    Call Trace:
    [] blkcg_drain_queue+0x1f/0x60
    [] __blk_drain_queue+0x71/0x180
    [] blk_queue_bypass_start+0x6e/0xb0
    [] blkcg_deactivate_policy+0x38/0x120
    [] blk_throtl_exit+0x34/0x50
    [] blkcg_exit_queue+0x35/0x40
    [] blk_release_queue+0x26/0xd0
    [] kobject_cleanup+0x38/0x70
    [] kobject_put+0x28/0x60
    [] blk_put_queue+0x15/0x20
    [] scsi_device_dev_release_usercontext+0x16b/0x1c0
    [] execute_in_process_context+0x89/0xa0
    [] scsi_device_dev_release+0x1c/0x20
    [] device_release+0x32/0xa0
    [] kobject_cleanup+0x38/0x70
    [] kobject_put+0x28/0x60
    [] put_device+0x17/0x20
    [] __scsi_remove_device+0xa9/0xe0
    [] scsi_remove_device+0x2b/0x40
    [] sdev_store_delete+0x27/0x30
    [] dev_attr_store+0x18/0x30
    [] sysfs_kf_write+0x3e/0x50
    [] kernfs_fop_write+0xe7/0x170
    [] vfs_write+0xaf/0x1d0
    [] SyS_write+0x4d/0xc0
    [] system_call_fastpath+0x16/0x1b

    776687bce42b ("block, blk-mq: draining can't be skipped even if
    bypass_depth was non-zero") made it easier to trigger this bug by
    making blk_queue_bypass_start() drain even when it loses the first
    bypass test to blk_cleanup_queue(); however, the bug has always been
    there even before the commit as blk_queue_bypass_start() could race
    against queue destruction, win the initial bypass test but perform the
    actual draining after blk_cleanup_queue() already destroyed all blkgs.

    Fix it by skippping calling into policy draining if all the blkgs are
    already gone.

    Signed-off-by: Tejun Heo
    Reported-by: Shirish Pargaonkar
    Reported-by: Sasha Levin
    Reported-by: Jet Chen
    Cc: stable@vger.kernel.org
    Tested-by: Shirish Pargaonkar
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 Jul, 2014

3 commits

  • Currently, cftypes added by cgroup_add_cftypes() are used for both the
    unified default hierarchy and legacy ones and subsystems can mark each
    file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to
    appear only on one of them. This is quite hairy and error-prone.
    Also, we may end up exposing interface files to the default hierarchy
    without thinking it through.

    cgroup_subsys will grow two separate cftype addition functions and
    apply each only on the hierarchies of the matching type. This will
    allow organizing cftypes in a lot clearer way and encourage subsystems
    to scrutinize the interface which is being exposed in the new default
    hierarchy.

    In preparation, this patch adds cgroup_add_legacy_cftypes() which
    currently is a simple wrapper around cgroup_add_cftypes() and replaces
    all cgroup_add_cftypes() usages with it.

    While at it, this patch drops a completely spurious return from
    __hugetlb_cgroup_file_init().

    This patch doesn't introduce any functional differences.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Aneesh Kumar K.V

    Tejun Heo
     
  • Currently, cgroup_subsys->base_cftypes is used for both the unified
    default hierarchy and legacy ones and subsystems can mark each file
    with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
    only on one of them. This is quite hairy and error-prone. Also, we
    may end up exposing interface files to the default hierarchy without
    thinking it through.

    cgroup_subsys will grow two separate cftype arrays and apply each only
    on the hierarchies of the matching type. This will allow organizing
    cftypes in a lot clearer way and encourage subsystems to scrutinize
    the interface which is being exposed in the new default hierarchy.

    In preparation, this patch renames cgroup_subsys->base_cftypes to
    cgroup_subsys->legacy_cftypes. This patch is pure rename.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vivek Goyal
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Aristeu Rozanski
    Cc: Aneesh Kumar K.V

    Tejun Heo
     
  • This reverts commit 254c4407cb84a6dec90336054615b0f0e996bb7c.

    It causes crashes with cryptsetup, even after a few iterations and
    updates. Drop it for now.

    Jens Axboe
     

14 Jul, 2014

1 commit


12 Jul, 2014

1 commit

  • While a queue is being destroyed, all the blkgs are destroyed and its
    ->root_blkg pointer is set to NULL. If someone else starts to drain
    while the queue is in this state, the following oops happens.

    NULL pointer dereference at 0000000000000028
    IP: [] blk_throtl_drain+0x84/0x230
    PGD e4a1067 PUD b773067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
    CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
    RIP: 0010:[] [] blk_throtl_drain+0x84/0x230
    RSP: 0018:ffff88000efd7bf0 EFLAGS: 00010046
    RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
    R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
    R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
    FS: 00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
    Stack:
    ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
    ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
    ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
    Call Trace:
    [] blkcg_drain_queue+0x1f/0x60
    [] __blk_drain_queue+0x71/0x180
    [] blk_queue_bypass_start+0x6e/0xb0
    [] blkcg_deactivate_policy+0x38/0x120
    [] blk_throtl_exit+0x34/0x50
    [] blkcg_exit_queue+0x35/0x40
    [] blk_release_queue+0x26/0xd0
    [] kobject_cleanup+0x38/0x70
    [] kobject_put+0x28/0x60
    [] blk_put_queue+0x15/0x20
    [] scsi_device_dev_release_usercontext+0x16b/0x1c0
    [] execute_in_process_context+0x89/0xa0
    [] scsi_device_dev_release+0x1c/0x20
    [] device_release+0x32/0xa0
    [] kobject_cleanup+0x38/0x70
    [] kobject_put+0x28/0x60
    [] put_device+0x17/0x20
    [] __scsi_remove_device+0xa9/0xe0
    [] scsi_remove_device+0x2b/0x40
    [] sdev_store_delete+0x27/0x30
    [] dev_attr_store+0x18/0x30
    [] sysfs_kf_write+0x3e/0x50
    [] kernfs_fop_write+0xe7/0x170
    [] vfs_write+0xaf/0x1d0
    [] SyS_write+0x4d/0xc0
    [] system_call_fastpath+0x16/0x1b

    776687bce42b ("block, blk-mq: draining can't be skipped even if
    bypass_depth was non-zero") made it easier to trigger this bug by
    making blk_queue_bypass_start() drain even when it loses the first
    bypass test to blk_cleanup_queue(); however, the bug has always been
    there even before the commit as blk_queue_bypass_start() could race
    against queue destruction, win the initial bypass test but perform the
    actual draining after blk_cleanup_queue() already destroyed all blkgs.

    Fix it by skippping calling into policy draining if all the blkgs are
    already gone.

    Signed-off-by: Tejun Heo
    Reported-by: Shirish Pargaonkar
    Reported-by: Sasha Levin
    Reported-by: Jet Chen
    Cc: stable@vger.kernel.org
    Tested-by: Shirish Pargaonkar
    Signed-off-by: Jens Axboe

    Tejun Heo
     

09 Jul, 2014

2 commits

  • sane_behavior has been used as a development vehicle for the default
    unified hierarchy. Now that the default hierarchy is in place, the
    flag became redundant and confusing as its usage is allowed on all
    hierarchies. There are gonna be either the default hierarchy or
    legacy ones. Let's make that clear by removing sane_behavior support
    on non-default hierarchies.

    This patch replaces cgroup_sane_behavior() with cgroup_on_dfl(). The
    comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of
    cgroup_on_dfl() with sane_behavior specific part dropped.

    On the default and legacy hierarchies w/o sane_behavior, this
    shouldn't cause any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko

    Tejun Heo
     
  • Currently, the blkio subsystem attributes all of writeback IOs to the
    root. One of the issues is that there's no way to tell who originated
    a writeback IO from block layer. Those IOs are usually issued
    asynchronously from a task which didn't have anything to do with
    actually generating the dirty pages. The memory subsystem, when
    enabled, already keeps track of the ownership of each dirty page and
    it's desirable for blkio to piggyback instead of adding its own
    per-page tag.

    cgroup now has a mechanism to express such dependency -
    cgroup_subsys->depends_on. This patch declares that blkcg depends on
    memcg so that memcg is enabled automatically on the default hierarchy
    when available. Future changes will make blkcg map the memcg tag to
    find out the cgroup to blame for writeback IOs.

    As this means that a memcg may be made invisible, this patch also
    implements css_reset() for memcg which resets its basic
    configurations. This implementation will probably need to be expanded
    to cover other states which are used in the default hierarchy.

    v2: blkcg's dependency on memcg is wrapped with CONFIG_MEMCG to avoid
    build failure. Reported by kbuild test robot.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vivek Goyal
    Cc: Jens Axboe

    Tejun Heo
     

08 Jul, 2014

1 commit

  • There is no inherent reason why the last put of a tag structure must be
    the one for the Scsi_Host, as device model objects can be held for
    arbitrary periods. Merge blk_free_tags and __blk_free_tags into a single
    funtion that just release a references and get rid of the BUG() when the
    host reference wasn't the last.

    Signed-off-by: Christoph Hellwig
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

02 Jul, 2014

14 commits

  • The original behaviour is to refuse to add a new page if the maximum
    number of segments has been reached, regardless of the fact the page we
    are going to add can be merged into the last segment or not.

    Unfortunately, when the system runs under heavy memory fragmentation
    conditions, a driver may try to add multiple pages to the last segment.
    The original code won't accept them and EBUSY will be reported to
    userspace.

    This patch modifies the function so it refuses to add a page only in case
    the latter starts a new segment and the maximum number of segments has
    already been reached.

    The bug can be easily reproduced with the st driver:

    1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE to 16
    2) modprobe st buffer_kbs=1024
    3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10
    dd: error writing `/dev/st0': Device or resource busy

    [ming.lei@canonical.com: update bi_iter.bi_size before recounting segments]
    Signed-off-by: Maurizio Lombardi
    Signed-off-by: Ming Lei
    Tested-by: Dongsu Park
    Tested-by: Jet Chen
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Kent Overstreet
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Maurizio Lombardi
     
  • After the SG_IO ioctl was copied into the block layer and
    later into the bsg driver, subtle differences emerged.

    One difference is the way injected commands are queued through
    the block layer (i.e. this is not SCSI device queueing nor SATA
    NCQ). Summarizing:
    - SG_IO on block layer device: blk_exec*(at_head=false)
    - sg device SG_IO: at_head=true
    - bsg device SG_IO: at_head=true

    Some time ago Boaz Harrosh introduced a sg v4 flag called
    BSG_FLAG_Q_AT_TAIL to override the bsg driver default. A
    recent patch titled: "sg: add SG_FLAG_Q_AT_TAIL flag"
    allowed the sg driver default to be overridden. This patch
    allows a SG_IO ioctl sent to a block layer device to have
    its default overridden.

    ChangeLog:
    - introduce SG_FLAG_Q_AT_HEAD flag in sg.h to cause
    commands that are injected via a block layer
    device SG_IO ioctl to set at_head=true
    - make comments clearer about queueing in sg.h since the
    header is used both by the sg device and block layer
    device implementations of the SG_IO ioctl.
    - introduce BSG_FLAG_Q_AT_HEAD in bsg.h for compatibility
    (it does nothing) and update comments.

    Signed-off-by: Douglas Gilbert
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mike Christie
    Signed-off-by: Jens Axboe

    Douglas Gilbert
     
  • SG_GET_RESERVED_SIZE and SG_SET_RESERVED_SIZE ioctls access a reserved
    buffer in bytes as int type. The value needs to be capped at the request
    queue's max_sectors. But integer overflow is not correctly handled in
    the calculation when converting max_sectors from sectors to bytes.

    Signed-off-by: Akinobu Mita
    Cc: Jens Axboe
    Cc: "James E.J. Bottomley"
    Cc: Douglas Gilbert
    Cc: linux-scsi@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Akinobu Mita
     
  • BLKSECTGET ioctl loads the request queue's max_sectors as unsigned
    short value to the argument pointer. So if the max_sector is greater
    than USHRT_MAX, the upper 16 bits of that is just discarded.

    In such case, USHRT_MAX is more preferable than the lower 16 bits of
    max_sectors.

    Signed-off-by: Akinobu Mita
    Cc: Jens Axboe
    Cc: "James E.J. Bottomley"
    Cc: Douglas Gilbert
    Cc: linux-scsi@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Akinobu Mita
     
  • Adding function documentation and fixing kerneldoc warnings
    ('field: description' uniformization).

    Cc: Davidlohr Bueso
    Cc: Jens Axboe
    Signed-off-by: Fabian Frederick
    Signed-off-by: Jens Axboe

    Fabian Frederick
     
  • checkpatch fixing:
    WARNING: Missing a blank line after declarations
    WARNING: space prohibited between function name and open parenthesis '('
    ERROR: spaces required around that '
    Cc: Andrew Morton
    Signed-off-by: Fabian Frederick
    Signed-off-by: Jens Axboe

    Fabian Frederick
     
  • Also add no prefix pr_fmt to avoid any future default format update

    Cc: Jens Axboe
    Cc: Andrew Morton
    Signed-off-by: Fabian Frederick
    Signed-off-by: Jens Axboe

    Fabian Frederick
     
  • kcalloc manages count*sizeof overflow.

    Cc: Jens Axboe
    Cc: Andrew Morton
    Signed-off-by: Fabian Frederick
    Signed-off-by: Jens Axboe

    Fabian Frederick
     
  • Commit 08778795 ("block: Fix nr_vecs for inline integrity vectors") from
    Martin introduces the function bip_integrity_vecs(get the useful vectors)
    to fix the issue about nr_vecs for inline integrity vectors that reported
    by David Milburn.

    But it seems that bip_integrity_vecs() will return the wrong number if the
    bio is not based on any bio_set for some reason(bio->bi_pool == NULL),
    because in that case, the bip_inline_vecs[0] is malloced directly. So
    here we add the bip_max_vcnt to record the count of vector slots, and
    cleanup the function bip_integrity_vecs().

    Signed-off-by: Gu Zheng
    Cc: Martin K. Petersen
    Cc: Kent Overstreet
    Signed-off-by: Jens Axboe

    Gu Zheng
     
  • Currently, blk-mq uses a percpu_counter to keep track of how many
    usages are in flight. The percpu_counter is drained while freezing to
    ensure that no usage is left in-flight after freezing is complete.
    blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
    per-cpu gating mechanism.

    This type of code has relatively high chance of subtle bugs which are
    extremely difficult to trigger and it's way too hairy to be open coded
    in blk-mq. percpu_ref can serve the same purpose after the recent
    changes. This patch replaces the open-coded per-cpu usage counting
    and draining mechanism with percpu_ref.

    blk_mq_queue_enter() performs tryget_live on the ref and exit()
    performs put. blk_mq_freeze_queue() kills the ref and waits until the
    reference count reaches zero. blk_mq_unfreeze_queue() revives the ref
    and wakes up the waiters.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Nicholas A. Bellinger
    Cc: Kent Overstreet
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Keeping __blk_mq_drain_queue() as a separate function doesn't buy us
    anything and it's gonna be further simplified. Let's flatten it into
    its caller.

    This patch doesn't make any functional change.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Nicholas A. Bellinger
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blk_mq freezing is entangled with generic bypassing which bypasses
    blkcg and io scheduler and lets IO requests fall through the block
    layer to the drivers in FIFO order. This allows forward progress on
    IOs with the advanced features disabled so that those features can be
    configured or altered without worrying about stalling IO which may
    lead to deadlock through memory allocation.

    However, generic bypassing doesn't quite fit blk-mq. blk-mq currently
    doesn't make use of blkcg or ioscheds and it maps bypssing to
    freezing, which blocks request processing and drains all the in-flight
    ones. This causes problems as bypassing assumes that request
    processing is online. blk-mq works around this by conditionally
    allowing request processing for the problem case - during queue
    initialization.

    Another weirdity is that except for during queue cleanup, bypassing
    started on the generic side prevents blk-mq from processing new
    requests but doesn't drain the in-flight ones. This shouldn't break
    anything but again highlights that something isn't quite right here.

    The root cause is conflating blk-mq freezing and generic bypassing
    which are two different mechanisms. The only intersecting purpose
    that they serve is during queue cleanup. Let's properly separate
    blk-mq freezing from generic bypassing and simply use it where
    necessary.

    * request_queue->mq_freeze_depth is added and
    blk_mq_[un]freeze_queue() now operate on this counter instead of
    ->bypass_depth. The replacement for QUEUE_FLAG_BYPASS isn't added
    but the counter is tested directly. This will be further updated by
    later changes.

    * blk_mq_drain_queue() is dropped and "__" prefix is dropped from
    blk_mq_freeze_queue(). Queue cleanup path now calls
    blk_mq_freeze_queue() directly.

    * blk_queue_enter()'s fast path condition is simplified to simply
    check @q->mq_freeze_depth. Previously, the condition was

    !blk_queue_dying(q) &&
    (!blk_queue_bypass(q) || !blk_queue_init_done(q))

    mq_freeze_depth is incremented right after dying is set and
    blk_queue_init_done() exception isn't necessary as blk-mq doesn't
    start frozen, which only leaves the blk_queue_bypass() test which
    can be replaced by @q->mq_freeze_depth test.

    This change simplifies the code and reduces confusion in the area.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Nicholas A. Bellinger
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, both blk_queue_bypass_start() and blk_mq_freeze_queue()
    skip queue draining if bypass_depth was already above zero. The
    assumption is that the one which bumped the bypass_depth should have
    performed draining already; however, there's nothing which prevents a
    new instance of bypassing/freezing from starting before the previous
    one finishes draining. The current code may allow the later
    bypassing/freezing instances to complete while there still are
    in-flight requests which haven't finished draining.

    Fix it by draining regardless of bypass_depth. We still skip draining
    from blk_queue_bypass_start() while the queue is initializing to avoid
    introducing excessive delays during boot. INIT_DONE setting is moved
    above the initial blk_queue_bypass_end() so that bypassing attempts
    can't slip inbetween.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Nicholas A. Bellinger
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blk-mq uses a percpu_counter to keep track of how many usages are in
    flight. The percpu_counter is drained while freezing to ensure that
    no usage is left in-flight after freezing is complete.

    blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
    per-cpu gating mechanism; unfortunately, it contains a subtle bug -
    smp_wmb() in blk_mq_queue_enter() doesn't prevent prevent the cpu from
    fetching @q->bypass_depth before incrementing @q->mq_usage_counter and
    if freezing happens inbetween the caller can slip through and freezing
    can be complete while there are active users.

    Use smp_mb() instead so that bypass_depth and mq_usage_counter
    modifications and tests are properly interlocked.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Nicholas A. Bellinger
    Signed-off-by: Jens Axboe

    Tejun Heo
     

27 Jun, 2014

1 commit

  • Pull block fixes from Jens Axboe:
    "A small collection of fixes/changes for the current series. This
    contains:

    - Removal of dead code from Gu Zheng.

    - Revert of two bad fixes that went in earlier in this round, marking
    things as __init that were not purely used from init.

    - A fix for blk_mq_start_hw_queue() using the __blk_mq_run_hw_queue(),
    which could place us wrongly. Make it use the non __ variant,
    which handles cases where we are called from the wrong CPU set.
    From me.

    - A fix for drbd, which allocates discard requests without room for
    the SCSI payload. From Lars Ellenberg.

    - A fix for user-after-free in the blkcg code from Tejun.

    - Addition of limiting gaps in SG lists, if the hardware needs it.
    This is the last pre-req patch for blk-mq to enable the full NVMe
    conversion. Could wait until 3.17, but it's simple enough so would
    be nice to have everything we need for the NVMe port in the 3.17
    release. From me"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    drbd: fix NULL pointer deref in blk_add_request_payload
    blk-mq: blk_mq_start_hw_queue() should use blk_mq_run_hw_queue()
    block: add support for limiting gaps in SG lists
    bio: remove unused macro bip_vec_idx()
    Revert "block: add __init to elv_register"
    Revert "block: add __init to blkcg_policy_register"
    blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t
    floppy: format block0 read error message properly

    Linus Torvalds
     

25 Jun, 2014

2 commits

  • Currently it calls __blk_mq_run_hw_queue(), which depends on the
    CPU placement being correct. This means it's not possible to call
    blk_mq_start_hw_queues(q) from a context that is correct for all
    queues, leading to triggering the

    WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));

    in __blk_mq_run_hw_queue().

    Reported-by: Ming Lei
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Another restriction inherited for NVMe - those devices don't support
    SG lists that have "gaps" in them. Gaps refers to cases where the
    previous SG entry doesn't end on a page boundary. For NVMe, all SG
    entries must start at offset 0 (except the first) and end on a page
    boundary (except the last).

    Signed-off-by: Jens Axboe

    Jens Axboe
     

23 Jun, 2014

3 commits

  • This reverts commit b5097e956a4d2919ee248d6481e4204c5568ed5c.

    The original commit is buggy, we do use the registration functions
    at runtime, for instance when loading IO schedulers through sysfs.

    Reported-by: Damien Wyart

    Jens Axboe
     
  • This reverts commit a2d445d440003f2d70ee4cd4970ea82ace616fee.

    The original commit is buggy, we do use the registration functions
    at runtime for modular builds.

    Jens Axboe
     
  • Hello,

    So, this patch should do. Joe, Vivek, can one of you guys please
    verify that the oops goes away with this patch?

    Jens, the original thread can be read at

    http://thread.gmane.org/gmane.linux.kernel/1720729

    The fix converts blkg->refcnt from int to atomic_t. It does some
    overhead but it should be minute compared to everything else which is
    going on and the involved cacheline bouncing, so I think it's highly
    unlikely to cause any noticeable difference. Also, the refcnt in
    question should be converted to a perpcu_ref for blk-mq anyway, so the
    atomic_t is likely to go away pretty soon anyway.

    Thanks.

    ------- 8< -------
    __blkg_release_rcu() may be invoked after the associated request_queue
    is released with a RCU grace period inbetween. As such, the function
    and callbacks invoked from it must not dereference the associated
    request_queue. This is clearly indicated in the comment above the
    function.

    Unfortunately, while trying to fix a different issue, 2a4fd070ee85
    ("blkcg: move bulk of blkcg_gq release operations to the RCU
    callback") ignored this and added [un]locking of @blkg->q->queue_lock
    to __blkg_release_rcu(). This of course can cause oops as the
    request_queue may be long gone by the time this code gets executed.

    general protection fault: 0000 [#1] SMP
    CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1
    Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013
    task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000
    RIP: 0010:[] [] _raw_spin_lock_irq+0x15/0x60
    RSP: 0018:ffff88085403fdf0 EFLAGS: 00010086
    RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000
    RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b
    RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39
    R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130
    R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0
    Stack:
    ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000
    ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0
    ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30
    Call Trace:
    [] __blkg_release_rcu+0x72/0x150
    [] rcu_nocb_kthread+0x1e8/0x300
    [] kthread+0xe1/0x100
    [] ret_from_fork+0x7c/0xb0
    Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5
    +fa 66 66 90 66 66 90 b8 00 00 02 00 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f
    +b7
    RIP [] _raw_spin_lock_irq+0x15/0x60
    RSP

    The request_queue locking was added because blkcg_gq->refcnt is an int
    protected with the queue lock and __blkg_release_rcu() needs to put
    the parent. Let's fix it by making blkcg_gq->refcnt an atomic_t and
    dropping queue locking in the function.

    Given the general heavy weight of the current request_queue and blkcg
    operations, this is unlikely to cause any noticeable overhead.
    Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in
    the near future, so whatever (most likely negligible) overhead it may
    add is temporary.

    Signed-off-by: Tejun Heo
    Reported-by: Joe Lawrence
    Acked-by: Vivek Goyal
    Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     

20 Jun, 2014

1 commit

  • Pull block fixes from Jens Axboe:
    "A smaller collection of fixes for the block core that would be nice to
    have in -rc2. This pull request contains:

    - Fixes for races in the wait/wakeup logic used in blk-mq from
    Alexander. No issues have been observed, but it is definitely a
    bit flakey currently. Alternatively, we may drop the cyclic
    wakeups going forward, but that needs more testing.

    - Some cleanups from Christoph.

    - Fix for an oops in null_blk if queue_mode=1 and softirq completions
    are used. From me.

    - A fix for a regression caused by the chunk size setting. It
    inadvertently used max_hw_sectors instead of max_sectors, which is
    incorrect, and causes hangs on btrfs multi-disk setups (where hw
    sectors apparently isn't set). From me.

    - Removal of WQ_POWER_EFFICIENT in the kblockd creation. This was a
    recent addition as well, but it actually breaks blk-mq which relies
    on strict scheduling. If the workqueue power_efficient mode is
    turned on, this breaks blk-mq. From Matias.

    - null_blk module parameter description fix from Mike"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    blk-mq: bitmap tag: fix races in bt_get() function
    blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt
    blk-mq: bitmap tag: fix races on shared ::wake_index fields
    block: blk_max_size_offset() should check ->max_sectors
    null_blk: fix softirq completions for queue_mode == 1
    blk-mq: merge blk_mq_drain_queue and __blk_mq_drain_queue
    blk-mq: properly drain stopped queues
    block: remove WQ_POWER_EFFICIENT from kblockd
    null_blk: fix name and description of 'queue_mode' module parameter
    block: remove elv_abort_queue and blk_abort_flushes

    Linus Torvalds
     

18 Jun, 2014

3 commits

  • This update fixes few issues in bt_get() function:

    - list_empty(&wait.task_list) check is not protected;

    - was_empty check is always true which results in *every* thread
    entering the loop resets bt_wait_state::wait_cnt counter rather
    than every bt->wake_cnt'th thread;

    - 'bt_wait_state::wait_cnt' counter update is redundant, since
    it also gets reset in bt_clear_tag() function;

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Jens Axboe
    Signed-off-by: Alexander Gordeev
    Signed-off-by: Jens Axboe

    Alexander Gordeev
     
  • This piece of code in bt_clear_tag() function is racy:

    bs = bt_wake_ptr(bt);
    if (bs && atomic_dec_and_test(&bs->wait_cnt)) {
    atomic_set(&bs->wait_cnt, bt->wake_cnt);
    wake_up(&bs->wait);
    }

    Since nothing prevents bt_wake_ptr() from returning the very
    same 'bs' address on multiple CPUs, the following scenario is
    possible:

    CPU1 CPU2
    ---- ----

    0. bs = bt_wake_ptr(bt); bs = bt_wake_ptr(bt);
    1. atomic_dec_and_test(&bs->wait_cnt)
    2. atomic_dec_and_test(&bs->wait_cnt)
    3. atomic_set(&bs->wait_cnt, bt->wake_cnt);

    If the decrement in [1] yields zero then for some amount of time
    the decrement in [2] results in a negative/overflow value, which
    is not expected. The follow-up assignment in [3] overwrites the
    invalid value with the batch value (and likely prevents the issue
    from being severe) which is still incorrect and should be a lesser.

    Cc: Ming Lei
    Cc: Jens Axboe
    Signed-off-by: Alexander Gordeev
    Signed-off-by: Jens Axboe

    Alexander Gordeev
     
  • Fix racy updates of shared blk_mq_bitmap_tags::wake_index
    and blk_mq_hw_ctx::wake_index fields.

    Cc: Ming Lei
    Signed-off-by: Alexander Gordeev
    Signed-off-by: Jens Axboe

    Alexander Gordeev
     

16 Jun, 2014

1 commit

  • Pull NVMe update from Matthew Wilcox:
    "Mostly bugfixes again for the NVMe driver. I'd like to call out the
    exported tracepoint in the block layer; I believe Keith has cleared
    this with Jens.

    We've had a few reports from people who're really pounding on NVMe
    devices at scale, hence the timeout changes (and new module
    parameters), hotplug cpu deadlock, tracepoints, and minor performance
    tweaks"

    [ Jens hadn't seen that tracepoint thing, but is ok with it - it will
    end up going away when mq conversion happens ]

    * git://git.infradead.org/users/willy/linux-nvme: (22 commits)
    NVMe: Fix START_STOP_UNIT Scsi->NVMe translation.
    NVMe: Use Log Page constants in SCSI emulation
    NVMe: Define Log Page constants
    NVMe: Fix hot cpu notification dead lock
    NVMe: Rename io_timeout to nvme_io_timeout
    NVMe: Use last bytes of f/w rev SCSI Inquiry
    NVMe: Adhere to request queue block accounting enable/disable
    NVMe: Fix nvme get/put queue semantics
    NVMe: Delete NVME_GET_FEAT_TEMP_THRESH
    NVMe: Make admin timeout a module parameter
    NVMe: Make iod bio timeout a parameter
    NVMe: Prevent possible NULL pointer dereference
    NVMe: Fix the buffer size passed in GetLogPage(CDW10.NUMD)
    NVMe: Update data structures for NVMe 1.2
    NVMe: Enable BUILD_BUG_ON checks
    NVMe: Update namespace and controller identify structures to the 1.1a spec
    NVMe: Flush with data support
    NVMe: Configure support for block flush
    NVMe: Add tracepoints
    NVMe: Protect against badly formatted CQEs
    ...

    Linus Torvalds