12 Feb, 2019

4 commits

  • Since I can remember DM has forced the block layer to allow the
    allocation and initialization of the request_queue to be distinct
    operations. Reason for this is block/genhd.c:add_disk() has requires
    that the request_queue (and associated bdi) be tied to the gendisk
    before add_disk() is called -- because add_disk() also deals with
    exposing the request_queue via blk_register_queue().

    DM's dynamic creation of arbitrary device types (and associated
    request_queue types) requires the DM device's gendisk be available so
    that DM table loads can establish a master/slave relationship with
    subordinate devices that are referenced by loaded DM tables -- using
    bd_link_disk_holder(). But until these DM tables, and their associated
    subordinate devices, are known DM cannot know what type of request_queue
    it needs -- nor what its queue_limits should be.

    This chicken and egg scenario has created all manner of problems for DM
    and, at times, the block layer.

    Summary of changes:

    - Add device_add_disk_no_queue_reg() and add_disk_no_queue_reg() variant
    that drivers may use to add a disk without also calling
    blk_register_queue(). Driver must call blk_register_queue() once its
    request_queue is fully initialized.

    - Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
    is not set. It won't be set if driver used add_disk_no_queue_reg()
    but driver encounters an error and must del_gendisk() before calling
    blk_register_queue().

    - Export blk_register_queue().

    These changes allow DM to use add_disk_no_queue_reg() to anchor its
    gendisk as the "master" for master/slave relationships DM must establish
    with subordinate devices referenced in DM tables that get loaded. Once
    all "slave" devices for a DM device are known its request_queue can be
    properly initialized and then advertised via sysfs -- important
    improvement being that no request_queue resource initialization
    performed by blk_register_queue() is missed for DM devices anymore.

    Signed-off-by: Mike Snitzer
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe
    (cherry picked from commit fa70d2e2c4a0a54ced98260c6a176cc94c876d27)

    Mike Snitzer
     
  • device_add_disk() will only call bdi_register_owner() if
    !GENHD_FL_HIDDEN, so it follows that del_gendisk() should only call
    bdi_unregister() if !GENHD_FL_HIDDEN.

    Found with code inspection. bdi_unregister() won't do any harm if
    bdi_register_owner() wasn't used but best to avoid the unnecessary
    call to bdi_unregister().

    Fixes: 8ddcd65325 ("block: introduce GENHD_FL_HIDDEN")
    Signed-off-by: Mike Snitzer
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe
    (cherry picked from commit bc8d062c36e3525e81ea8237ff0ab3264c2317b6)

    Mike Snitzer
     
  • With this flag a driver can create a gendisk that can be used for I/O
    submission inside the kernel, but which is not registered as user
    facing block device. This will be useful for the NVMe multipath
    implementation.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    (cherry picked from commit 8ddcd653257c18a669fcb75ee42c37054908e0d6)

    Christoph Hellwig
     
  • The hidden gendisks introduced in the next patch need to keep the dev
    field in their struct device empty so that udev won't try to create
    block device nodes for them. To support that rewrite disk_devt to
    look at the major and first_minor fields in the gendisk itself instead
    of looking into the struct device.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe
    (cherry picked from commit 517bf3c306bad4fe0da631f90ae2ec40924dee2b)

    Christoph Hellwig
     

21 Jun, 2018

1 commit

  • [ Upstream commit bf0ddaba65ddbb2715af97041da8e7a45b2d8628 ]

    When the blk-mq inflight implementation was added, /proc/diskstats was
    converted to use it, but /sys/block/$dev/inflight was not. Fix it by
    adding another helper to count in-flight requests by data direction.

    Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant")
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Omar Sandoval
     

08 Sep, 2017

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the first pull request for 4.14, containing most of the code
    changes. It's a quiet series this round, which I think we needed after
    the churn of the last few series. This contains:

    - Fix for a registration race in loop, from Anton Volkov.

    - Overflow complaint fix from Arnd for DAC960.

    - Series of drbd changes from the usual suspects.

    - Conversion of the stec/skd driver to blk-mq. From Bart.

    - A few BFQ improvements/fixes from Paolo.

    - CFQ improvement from Ritesh, allowing idling for group idle.

    - A few fixes found by Dan's smatch, courtesy of Dan.

    - A warning fixup for a race between changing the IO scheduler and
    device remova. From David Jeffery.

    - A few nbd fixes from Josef.

    - Support for cgroup info in blktrace, from Shaohua.

    - Also from Shaohua, new features in the null_blk driver to allow it
    to actually hold data, among other things.

    - Various corner cases and error handling fixes from Weiping Zhang.

    - Improvements to the IO stats tracking for blk-mq from me. Can
    drastically improve performance for fast devices and/or big
    machines.

    - Series from Christoph removing bi_bdev as being needed for IO
    submission, in preparation for nvme multipathing code.

    - Series from Bart, including various cleanups and fixes for switch
    fall through case complaints"

    * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
    kernfs: checking for IS_ERR() instead of NULL
    drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
    drbd: Fix allyesconfig build, fix recent commit
    drbd: switch from kmalloc() to kmalloc_array()
    drbd: abort drbd_start_resync if there is no connection
    drbd: move global variables to drbd namespace and make some static
    drbd: rename "usermode_helper" to "drbd_usermode_helper"
    drbd: fix race between handshake and admin disconnect/down
    drbd: fix potential deadlock when trying to detach during handshake
    drbd: A single dot should be put into a sequence.
    drbd: fix rmmod cleanup, remove _all_ debugfs entries
    drbd: Use setup_timer() instead of init_timer() to simplify the code.
    drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
    drbd: new disk-option disable-write-same
    drbd: Fix resource role for newly created resources in events2
    drbd: mark symbols static where possible
    drbd: Send P_NEG_ACK upon write error in protocol != C
    drbd: add explicit plugging when submitting batches
    drbd: change list_for_each_safe to while(list_first_entry_or_null)
    drbd: introduce drbd_recv_header_maybe_unplug
    ...

    Linus Torvalds
     

24 Aug, 2017

2 commits


18 Aug, 2017

1 commit

  • Annotate gendisk.part_tbl and disk_part_tbl.part dereferences with
    rcu_dereference_protected(). This patch does not change the behavior
    of the modified code but ensures that sparse does not complain about
    disk->part_tbl manipulations nor about part_tbl->part accesses.
    Additionally, improve documentation of the locking requirements of
    the modified functions.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Cc: Tejun Heo
    Cc: Jan Kara
    Cc: Dan Williams
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

10 Aug, 2017

3 commits

  • We don't have to inc/dec some counter, since we can just
    iterate the tags. That makes inc/dec a noop, but means we
    have to iterate busy tags to get an in-flight count.

    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Instead of returning the count that matches the partition, pass
    in an array of two ints. Index 0 will be filled with the inflight
    count for the partition in question, and index 1 will filled
    with the root inflight count, if the partition passed in is not the
    root.

    This is in preparation for being able to calculate both in one
    go.

    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • No functional change in this patch, just in preparation for
    basing the inflight mechanism on the queue in question.

    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     

17 Jul, 2017

1 commit

  • Presently, the order of the block devices listed in /proc/devices is not
    entirely sequential. If a block device has a major number greater than
    BLKDEV_MAJOR_HASH_SIZE (255), it will be ordered as if its major were
    module 255. For example, 511 appears after 1.

    This patch cleans that up and prints each major number in the correct
    order, regardless of where they are stored in the hash table.

    In order to do this, we introduce BLKDEV_MAJOR_MAX as an artificial
    limit (chosen to be 512). It will then print all devices in major
    order number from 0 to the maximum.

    Signed-off-by: Logan Gunthorpe
    Cc: Jens Axboe
    Cc: Jeff Layton
    Cc: "J. Bruce Fields"
    Signed-off-by: Greg Kroah-Hartman

    Logan Gunthorpe
     

21 Jun, 2017

1 commit

  • The variable 'disk_type' is never modified so constify it.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

03 May, 2017

1 commit

  • Pull documentation update from Jonathan Corbet:
    "A reasonably busy cycle for documentation this time around. There is a
    new guide for user-space API documents, rather sparsely populated at
    the moment, but it's a start. Markus improved the infrastructure for
    converting diagrams. Mauro has converted much of the USB documentation
    over to RST. Plus the usual set of fixes, improvements, and tweaks.

    There's a bit more than the usual amount of reaching out of
    Documentation/ to fix comments elsewhere in the tree; I have acks for
    those where I could get them"

    * tag 'docs-4.12' of git://git.lwn.net/linux: (74 commits)
    docs: Fix a couple typos
    docs: Fix a spelling error in vfio-mediated-device.txt
    docs: Fix a spelling error in ioctl-number.txt
    MAINTAINERS: update file entry for HSI subsystem
    Documentation: allow installing man pages to a user defined directory
    Doc/PM: Sync with intel_powerclamp code behavior
    zr364xx.rst: usb/devices is now at /sys/kernel/debug/
    usb.rst: move documentation from proc_usb_info.txt to USB ReST book
    convert philips.txt to ReST and add to media docs
    docs-rst: usb: update old usbfs-related documentation
    arm: Documentation: update a path name
    docs: process/4.Coding.rst: Fix a couple of document refs
    docs-rst: fix usb cross-references
    usb: gadget.h: be consistent at kernel doc macros
    usb: composite.h: fix two warnings when building docs
    usb: get rid of some ReST doc build errors
    usb.rst: get rid of some Sphinx errors
    usb/URB.txt: convert to ReST and update it
    usb/persist.txt: convert to ReST and add to driver-api book
    usb/hotplug.txt: convert to ReST and add to driver-api book
    ...

    Linus Torvalds
     

28 Apr, 2017

1 commit

  • Commit 99e6608c9e74 "block: Add badblock management for gendisks"
    allowed for drivers like pmem and software-raid to advertise a list of
    bad media areas. However, it inadvertently added a 'badblocks' to all
    block devices. Lets clean this up by having the 'badblocks' attribute
    not be visible when the driver has not populated a 'struct badblocks'
    instance in the gendisk.

    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Cc: Martin K. Petersen
    Reported-by: Vishal Verma
    Signed-off-by: Dan Williams
    Tested-by: Vishal Verma
    Signed-off-by: Jens Axboe

    Dan Williams
     

03 Apr, 2017

1 commit

  • ./lib/string.c:134: WARNING: Inline emphasis start-string without end-string.
    ./mm/filemap.c:522: WARNING: Inline interpreted text or phrase reference start-string without end-string.
    ./mm/filemap.c:1283: ERROR: Unexpected indentation.
    ./mm/filemap.c:3003: WARNING: Inline interpreted text or phrase reference start-string without end-string.
    ./mm/vmalloc.c:1544: WARNING: Inline emphasis start-string without end-string.
    ./mm/page_alloc.c:4245: ERROR: Unexpected indentation.
    ./ipc/util.c:676: ERROR: Unexpected indentation.
    ./drivers/pci/irq.c:35: WARNING: Block quote ends without a blank line; unexpected unindent.
    ./security/security.c:109: ERROR: Unexpected indentation.
    ./security/security.c:110: WARNING: Definition list ends without a blank line; unexpected unindent.
    ./block/genhd.c:275: WARNING: Inline strong start-string without end-string.
    ./block/genhd.c:283: WARNING: Inline strong start-string without end-string.
    ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
    ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
    ./ipc/util.c:477: ERROR: Unknown target name: "s".

    Signed-off-by: Mauro Carvalho Chehab
    Acked-by: Bjorn Helgaas
    Signed-off-by: Jonathan Corbet

    mchehab@s-opensource.com
     

23 Mar, 2017

1 commit

  • When device open races with device shutdown, we can get the following
    oops in scsi_disk_get():

    [11863.044351] general protection fault: 0000 [#1] SMP
    [11863.045561] Modules linked in: scsi_debug xfs libcrc32c netconsole btrfs raid6_pq zlib_deflate lzo_compress xor [last unloaded: loop]
    [11863.047853] CPU: 3 PID: 13042 Comm: hald-probe-stor Tainted: G W 4.10.0-rc2-xen+ #35
    [11863.048030] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    [11863.048030] task: ffff88007f438200 task.stack: ffffc90000fd0000
    [11863.048030] RIP: 0010:scsi_disk_get+0x43/0x70
    [11863.048030] RSP: 0018:ffffc90000fd3a08 EFLAGS: 00010202
    [11863.048030] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88007f56d000 RCX: 0000000000000000
    [11863.048030] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffff81a8d880
    [11863.048030] RBP: ffffc90000fd3a18 R08: 0000000000000000 R09: 0000000000000001
    [11863.059217] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffa
    [11863.059217] R13: ffff880078872800 R14: ffff880070915540 R15: 000000000000001d
    [11863.059217] FS: 00007f2611f71800(0000) GS:ffff88007f0c0000(0000) knlGS:0000000000000000
    [11863.059217] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [11863.059217] CR2: 000000000060e048 CR3: 00000000778d4000 CR4: 00000000000006e0
    [11863.059217] Call Trace:
    [11863.059217] ? disk_get_part+0x22/0x1f0
    [11863.059217] sd_open+0x39/0x130
    [11863.059217] __blkdev_get+0x69/0x430
    [11863.059217] ? bd_acquire+0x7f/0xc0
    [11863.059217] ? bd_acquire+0x96/0xc0
    [11863.059217] ? blkdev_get+0x350/0x350
    [11863.059217] blkdev_get+0x126/0x350
    [11863.059217] ? _raw_spin_unlock+0x2b/0x40
    [11863.059217] ? bd_acquire+0x7f/0xc0
    [11863.059217] ? blkdev_get+0x350/0x350
    [11863.059217] blkdev_open+0x65/0x80
    ...

    As you can see RAX value is already poisoned showing that gendisk we got
    is already freed. The problem is that get_gendisk() looks up device
    number in ext_devt_idr and then does get_disk() which does kobject_get()
    on the disks kobject. However the disk gets removed from ext_devt_idr
    only in disk_release() (through blk_free_devt()) at which moment it has
    already 0 refcount and is already on its way to be freed. Indeed we've
    got a warning from kobject_get() about 0 refcount shortly before the
    oops.

    We fix the problem by using kobject_get_unless_zero() in get_disk() so
    that get_disk() cannot get reference on a disk that is already being
    freed.

    Tested-by: Lekshmi Pillai
    Reviewed-by: Bart Van Assche
    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

09 Mar, 2017

2 commits

  • This reverts commit 0dba1314d4f81115dce711292ec7981d17231064. It causes
    leaking of device numbers for SCSI when SCSI registers multiple gendisks
    for one request_queue in succession. It can be easily reproduced using
    Omar's script [1] on kernel with CONFIG_DEBUG_TEST_DRIVER_REMOVE.
    Furthermore the protection provided by this commit is not needed anymore
    as the problem it was fixing got also fixed by commit 165a5e22fafb
    "block: Move bdi_unregister() to del_gendisk()".

    [1]: http://marc.info/?l=linux-block&m=148554717109098&w=2

    Signed-off-by: Jan Kara
    Acked-by: Dan Williams
    Tested-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Commit 165a5e22fafb "block: Move bdi_unregister() to del_gendisk()"
    added disk->queue dereference to del_gendisk(). Although del_gendisk()
    is not supposed to be called without disk->queue valid and
    blk_unregister_queue() warns in that case, this change will make it oops
    instead. Return to the old more robust behavior of just warning when
    del_gendisk() gets called for gendisk with disk->queue being NULL.

    Reported-by: Dan Carpenter
    Signed-off-by: Jan Kara
    Tested-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jan Kara
     

03 Mar, 2017

1 commit

  • Commit 6cd18e711dd8 "block: destroy bdi before blockdev is
    unregistered." moved bdi unregistration (at that time through
    bdi_destroy()) from blk_release_queue() to blk_cleanup_queue() because
    it needs to happen before blk_unregister_region() call in del_gendisk()
    for MD. SCSI though will free up the device number from sd_remove()
    called through a maze of callbacks from device_del() in
    __scsi_remove_device() before blk_cleanup_queue() and thus similar races
    as described in 6cd18e711dd8 can happen for SCSI as well as reported by
    Omar [1].

    Moving bdi_unregister() to del_gendisk() works for MD and fixes the
    problem for SCSI since del_gendisk() gets called from sd_remove() before
    freeing the device number.

    This also makes device_add_disk() (calling bdi_register_owner()) more
    symmetric with del_gendisk().

    [1] http://marc.info/?l=linux-block&m=148554717109098&w=2

    Tested-by: Lekshmi Pillai
    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Tested-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jan Kara
     

22 Feb, 2017

2 commits

  • Iteration over partitions in del_gendisk() omits part0. Add
    bdev_unhash_inode() call for the whole device. Otherwise if the device
    number gets reused, bdev inode will be still associated with the old
    (stale) bdi.

    Tested-by: Lekshmi Pillai
    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Move bdev_unhash_inode() after invalidate_partition() as
    invalidate_partition() looks up bdev and it cannot find the right bdev
    inode after bdev_unhash_inode() is called. Thus invalidate_partition()
    would not invalidate page cache of the previously used bdev. Also use
    part_devt() when calling bdev_unhash_inode() instead of manually
    creating the device number.

    Tested-by: Lekshmi Pillai
    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

02 Feb, 2017

3 commits

  • Warnings of the following form occur because scsi reuses a devt number
    while the block layer still has it referenced as the name of the bdi
    [1]:

    WARNING: CPU: 1 PID: 93 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80
    sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:192'
    [..]
    Call Trace:
    dump_stack+0x86/0xc3
    __warn+0xcb/0xf0
    warn_slowpath_fmt+0x5f/0x80
    ? kernfs_path_from_node+0x4f/0x60
    sysfs_warn_dup+0x62/0x80
    sysfs_create_dir_ns+0x77/0x90
    kobject_add_internal+0xb2/0x350
    kobject_add+0x75/0xd0
    device_add+0x15a/0x650
    device_create_groups_vargs+0xe0/0xf0
    device_create_vargs+0x1c/0x20
    bdi_register+0x90/0x240
    ? lockdep_init_map+0x57/0x200
    bdi_register_owner+0x36/0x60
    device_add_disk+0x1bb/0x4e0
    ? __pm_runtime_use_autosuspend+0x5c/0x70
    sd_probe_async+0x10d/0x1c0
    async_run_entry_fn+0x39/0x170

    This is a brute-force fix to pass the devt release information from
    sd_probe() to the locations where we register the bdi,
    device_add_disk(), and unregister the bdi, blk_cleanup_queue().

    Thanks to Omar for the quick reproducer script [2]. This patch survives
    where an unmodified kernel fails in a few seconds.

    [1]: https://marc.info/?l=linux-scsi&m=147116857810716&w=4
    [2]: http://marc.info/?l=linux-block&m=148554717109098&w=2

    Cc: James Bottomley
    Cc: Bart Van Assche
    Cc: "Martin K. Petersen"
    Cc: Jan Kara
    Reported-by: Omar Sandoval
    Tested-by: Omar Sandoval
    Signed-off-by: Dan Williams
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Dan Williams
     
  • We will want to have struct backing_dev_info allocated separately from
    struct request_queue. As the first step add pointer to backing_dev_info
    to request_queue and convert all users touching it. No functional
    changes in this patch.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Currently, block device inodes stay around after corresponding gendisk
    hash died until memory reclaim finds them and frees them. Since we will
    make block device inode pin the bdi, we want to free the block device
    inode as soon as the device goes away so that bdi does not stay around
    unnecessarily. Furthermore we need to avoid issues when new device with
    the same major,minor pair gets created since reusing the bdi structure
    would be rather difficult in this case.

    Unhashing block device inode on gendisk destruction nicely deals with
    these problems. Once last block device inode reference is dropped (which
    may be directly in del_gendisk()), the inode gets evicted. Furthermore if
    the major,minor pair gets reallocated, we are guaranteed to get new
    block device inode even if old block device inode is not yet evicted and
    thus we avoid issues with possible reuse of bdi.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

05 Aug, 2016

2 commits

  • The name for a bdi of a gendisk is derived from the gendisk's devt.
    However, since the gendisk is destroyed before the bdi it leaves a
    window where a new gendisk could dynamically reuse the same devt while a
    bdi with the same name is still live. Arrange for the bdi to hold a
    reference against its "owner" disk device while it is registered.
    Otherwise we can hit sysfs duplicate name collisions like the following:

    WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80
    sysfs: cannot create duplicate filename '/devices/virtual/bdi/259:1'

    Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015
    0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec
    ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351
    0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8
    Call Trace:
    [] dump_stack+0x63/0x87
    [] __warn+0xd1/0xf0
    [] warn_slowpath_fmt+0x5f/0x80
    [] sysfs_warn_dup+0x64/0x80
    [] sysfs_create_dir_ns+0x7e/0x90
    [] kobject_add_internal+0xaa/0x320
    [] ? vsnprintf+0x34e/0x4d0
    [] kobject_add+0x75/0xd0
    [] ? mutex_lock+0x12/0x2f
    [] device_add+0x125/0x610
    [] device_create_groups_vargs+0xd8/0x100
    [] device_create_vargs+0x1c/0x20
    [] bdi_register+0x8c/0x180
    [] bdi_register_dev+0x27/0x30
    [] add_disk+0x175/0x4a0

    Cc:
    Reported-by: Yi Zhang
    Tested-by: Yi Zhang
    Signed-off-by: Dan Williams

    Fixed up missing 0 return in bdi_register_owner().

    Signed-off-by: Jens Axboe

    Dan Williams
     
  • I got a KASAN report of use-after-free:

    ==================================================================
    BUG: KASAN: use-after-free in klist_iter_exit+0x61/0x70 at addr ffff8800b6581508
    Read of size 8 by task trinity-c1/315
    =============================================================================
    BUG kmalloc-32 (Not tainted): kasan: bad access detected
    -----------------------------------------------------------------------------

    Disabling lock debugging due to kernel taint
    INFO: Allocated in disk_seqf_start+0x66/0x110 age=144 cpu=1 pid=315
    ___slab_alloc+0x4f1/0x520
    __slab_alloc.isra.58+0x56/0x80
    kmem_cache_alloc_trace+0x260/0x2a0
    disk_seqf_start+0x66/0x110
    traverse+0x176/0x860
    seq_read+0x7e3/0x11a0
    proc_reg_read+0xbc/0x180
    do_loop_readv_writev+0x134/0x210
    do_readv_writev+0x565/0x660
    vfs_readv+0x67/0xa0
    do_preadv+0x126/0x170
    SyS_preadv+0xc/0x10
    do_syscall_64+0x1a1/0x460
    return_from_SYSCALL_64+0x0/0x6a
    INFO: Freed in disk_seqf_stop+0x42/0x50 age=160 cpu=1 pid=315
    __slab_free+0x17a/0x2c0
    kfree+0x20a/0x220
    disk_seqf_stop+0x42/0x50
    traverse+0x3b5/0x860
    seq_read+0x7e3/0x11a0
    proc_reg_read+0xbc/0x180
    do_loop_readv_writev+0x134/0x210
    do_readv_writev+0x565/0x660
    vfs_readv+0x67/0xa0
    do_preadv+0x126/0x170
    SyS_preadv+0xc/0x10
    do_syscall_64+0x1a1/0x460
    return_from_SYSCALL_64+0x0/0x6a

    CPU: 1 PID: 315 Comm: trinity-c1 Tainted: G B 4.7.0+ #62
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    ffffea0002d96000 ffff880119b9f918 ffffffff81d6ce81 ffff88011a804480
    ffff8800b6581500 ffff880119b9f948 ffffffff8146c7bd ffff88011a804480
    ffffea0002d96000 ffff8800b6581500 fffffffffffffff4 ffff880119b9f970
    Call Trace:
    [] dump_stack+0x65/0x84
    [] print_trailer+0x10d/0x1a0
    [] object_err+0x2f/0x40
    [] kasan_report_error+0x221/0x520
    [] __asan_report_load8_noabort+0x3e/0x40
    [] klist_iter_exit+0x61/0x70
    [] class_dev_iter_exit+0x9/0x10
    [] disk_seqf_stop+0x3a/0x50
    [] seq_read+0x4b2/0x11a0
    [] proc_reg_read+0xbc/0x180
    [] do_loop_readv_writev+0x134/0x210
    [] do_readv_writev+0x565/0x660
    [] vfs_readv+0x67/0xa0
    [] do_preadv+0x126/0x170
    [] SyS_preadv+0xc/0x10

    This problem can occur in the following situation:

    open()
    - pread()
    - .seq_start()
    - iter = kmalloc() // succeeds
    - seqf->private = iter
    - .seq_stop()
    - kfree(seqf->private)
    - pread()
    - .seq_start()
    - iter = kmalloc() // fails
    - .seq_stop()
    - class_dev_iter_exit(seqf->private) // boom! old pointer

    As the comment in disk_seqf_stop() says, stop is called even if start
    failed, so we need to reinitialise the private pointer to NULL when seq
    iteration stops.

    An alternative would be to set the private pointer to NULL when the
    kmalloc() in disk_seqf_start() fails.

    Cc: stable@vger.kernel.org
    Signed-off-by: Vegard Nossum
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Vegard Nossum
     

27 Jul, 2016

1 commit

  • Pull block driver updates from Jens Axboe:
    "This branch also contains core changes. I've come to the conclusion
    that from 4.9 and forward, I'll be doing just a single branch. We
    often have dependencies between core and drivers, and it's hard to
    always split them up appropriately without pulling core into drivers
    when that happens.

    That said, this contains:

    - separate secure erase type for the core block layer, from
    Christoph.

    - set of discard fixes, from Christoph.

    - bio shrinking fixes from Christoph, as a followup up to the
    op/flags change in the core branch.

    - map and append request fixes from Christoph.

    - NVMeF (NVMe over Fabrics) code from Christoph. This is pretty
    exciting!

    - nvme-loop fixes from Arnd.

    - removal of ->driverfs_dev from Dan, after providing a
    device_add_disk() helper.

    - bcache fixes from Bhaktipriya and Yijing.

    - cdrom subchannel read fix from Vchannaiah.

    - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

    - set of drbd updates and fixes from Fabian, Lars, and Philipp.

    - mg_disk error path fix from Bart.

    - user notification for failed device add for loop, from Minfei.

    - NVMe in general:
    + NVMe delay quirk from Guilherme.
    + SR-IOV support and command retry limits from Keith.
    + fix for memory-less NUMA node from Masayoshi.
    + use UINT_MAX for discard sectors, from Minfei.
    + cancel IO fixes from Ming.
    + don't allocate unused major, from Neil.
    + error code fixup from Dan.
    + use constants for PSDT/FUSE from James.
    + variable init fix from Jay.
    + fabrics fixes from Ming, Sagi, and Wei.
    + various fixes"

    * 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
    nvme/pci: Provide SR-IOV support
    nvme: initialize variable before logical OR'ing it
    block: unexport various bio mapping helpers
    scsi/osd: open code blk_make_request
    target: stop using blk_make_request
    block: simplify and export blk_rq_append_bio
    block: ensure bios return from blk_get_request are properly initialized
    virtio_blk: use blk_rq_map_kern
    memstick: don't allow REQ_TYPE_BLOCK_PC requests
    block: shrink bio size again
    block: simplify and cleanup bvec pool handling
    block: get rid of bio_rw and READA
    block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
    block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
    NVMe: don't allocate unused nvme_major
    nvme: avoid crashes when node 0 is memoryless node.
    nvme: Limit command retries
    loop: Make user notify for adding loop device failed
    nvme-loop: fix nvme-loop Kconfig dependencies
    nvmet: fix return value check in nvmet_subsys_alloc()
    ...

    Linus Torvalds
     

07 Jul, 2016

1 commit

  • We now have implicit batching in the timer wheel. The slack API is no longer
    used, so remove it.

    Signed-off-by: Thomas Gleixner
    Cc: Alan Stern
    Cc: Andrew F. Davis
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: David S. Miller
    Cc: David Woodhouse
    Cc: Dmitry Eremin-Solenikov
    Cc: Eric Dumazet
    Cc: Frederic Weisbecker
    Cc: George Spelvin
    Cc: Greg Kroah-Hartman
    Cc: Jaehoon Chung
    Cc: Jens Axboe
    Cc: John Stultz
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Mathias Nyman
    Cc: Pali Rohár
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Sebastian Reichel
    Cc: Ulf Hansson
    Cc: linux-block@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mmc@vger.kernel.org
    Cc: linux-pm@vger.kernel.org
    Cc: linux-usb@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094342.189813118@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

28 Jun, 2016

1 commit

  • Now that all drivers that specify a ->driverfs_dev have been converted
    to device_add_disk(), the pointer can be removed from struct gendisk.

    Cc: Jens Axboe
    Cc: Ross Zwisler
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Dan Williams

    Dan Williams
     

16 Jun, 2016

1 commit

  • In preparation for removing the ->driverfs_dev member of a gendisk, add
    an api that takes the parent device as a parameter to add_disk(). For
    now this maintains the status quo of WARN()ing on failure, but not
    return a error code.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Bart Van Assche
    Signed-off-by: Dan Williams

    Dan Williams
     

20 Jan, 2016

1 commit

  • Pull core block updates from Jens Axboe:
    "We don't have a lot of core changes this time around, it's mostly in
    drivers, which will come in a subsequent pull.

    The cores changes include:

    - blk-mq
    - Prep patch from Christoph, changing blk_mq_alloc_request() to
    take flags instead of just using gfp_t for sleep/nosleep.
    - Doc patch from me, clarifying the difference between legacy
    and blk-mq for timer usage.
    - Fixes from Raghavendra for memory-less numa nodes, and a reuse
    of CPU masks.

    - Cleanup from Geliang Tang, using offset_in_page() instead of open
    coding it.

    - From Ilya, rename request_queue slab to it reflects what it holds,
    and a fix for proper use of bdgrab/put.

    - A real fix for the split across stripe boundaries from Keith. We
    yanked a broken version of this from 4.4-rc final, this one works.

    - From Mike Krinkin, emit a trace message when we split.

    - From Wei Tang, two small cleanups, not explicitly clearing memory
    that is already cleared"

    * 'for-4.5/core' of git://git.kernel.dk/linux-block:
    block: use bd{grab,put}() instead of open-coding
    block: split bios to max possible length
    block: add call to split trace point
    blk-mq: Avoid memoryless numa node encoded in hctx numa_node
    blk-mq: Reuse hardware context cpumask for tags
    blk-mq: add a flags parameter to blk_mq_alloc_request
    Revert "blk-flush: Queue through IO scheduler when flush not required"
    block: clarify blk_add_timer() use case for blk-mq
    bio: use offset_in_page macro
    block: do not initialise statics to 0 or NULL
    block: do not initialise globals to 0 or NULL
    block: rename request_queue slab cache

    Linus Torvalds
     

10 Jan, 2016

4 commits

  • These actions are completely managed by a block driver or can use the
    badblocks api directly.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • The badblocks list attached to a gendisk is allocated by the driver
    which equates to the driver owning the lifetime of the object. Do not
    automatically free it in del_gendisk(). This is in preparation for
    expanding the use of badblocks in libnvdimm drivers and introducing
    devm_init_badblocks().

    Signed-off-by: Dan Williams

    Dan Williams
     
  • For symmetry with badblocks_init() make it clear that this path only
    destroys incremental allocations of a badblocks instance, and does not
    free the badblocks instance itself.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • NVDIMM devices, which can behave more like DRAM rather than block
    devices, may develop bad cache lines, or 'poison'. A block device
    exposed by the pmem driver can then consume poison via a read (or
    write), and cause a machine check. On platforms without machine
    check recovery features, this would mean a crash.

    The block device maintaining a runtime list of all known sectors that
    have poison can directly avoid this, and also provide a path forward
    to enable proper handling/recovery for DAX faults on such a device.

    Use the new badblock management interfaces to add a badblocks list to
    gendisks.

    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     

09 Jan, 2016

1 commit

  • When tearing down a block device early in its lifetime, userspace may
    still be performing discovery actions like blkdev_ioctl() to re-read
    partitions.

    The nvdimm_revalidate_disk() implementation depends on
    disk->driverfs_dev to be valid at entry. However, it is set to NULL in
    del_gendisk() and fatally this is happening *before* the disk device is
    deleted from userspace view.

    There's no reason for del_gendisk() to clear ->driverfs_dev. That
    device is the parent of the disk. It is guaranteed to not be freed
    until the disk, as a child, drops its ->parent reference.

    We could also fix this issue locally in nvdimm_revalidate_disk() by
    using disk_to_dev(disk)->parent, but lets fix it globally since
    ->driverfs_dev follows the lifetime of the parent. Longer term we
    should probably just add a @parent parameter to add_disk(), and stop
    carrying this pointer in the gendisk.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] nvdimm_revalidate_disk+0x18/0x90 [libnvdimm]
    CPU: 2 PID: 538 Comm: systemd-udevd Tainted: G O 4.4.0-rc5 #2257
    [..]
    Call Trace:
    [] rescan_partitions+0x87/0x2c0
    [] ? __lock_is_held+0x49/0x70
    [] __blkdev_reread_part+0x72/0xb0
    [] blkdev_reread_part+0x25/0x40
    [] blkdev_ioctl+0x4fd/0x9c0
    [] ? current_kernel_time64+0x69/0xd0
    [] block_ioctl+0x3d/0x50
    [] do_vfs_ioctl+0x308/0x560
    [] ? __audit_syscall_entry+0xb1/0x100
    [] ? do_audit_syscall_entry+0x66/0x70
    [] SyS_ioctl+0x79/0x90
    [] entry_SYSCALL_64_fastpath+0x12/0x76

    Reported-by: Robert Hu
    Signed-off-by: Dan Williams

    Dan Williams
     

25 Nov, 2015

1 commit


22 Oct, 2015

1 commit

  • Up until now the_integrity profile has been dynamically allocated and
    attached to struct gendisk after the disk has been made active.

    This causes problems because NVMe devices need to register the profile
    prior to the partition table being read due to a mandatory metadata
    buffer requirement. In addition, DM goes through hoops to deal with
    preallocating, but not initializing integrity profiles.

    Since the integrity profile is small (4 bytes + a pointer), Christoph
    suggested moving it to struct gendisk proper. This requires several
    changes:

    - Moving the blk_integrity definition to genhd.h.

    - Inlining blk_integrity in struct gendisk.

    - Removing the dynamic allocation code.

    - Adding helper functions which allow gendisk to set up and tear down
    the integrity sysfs dir when a disk is added/deleted.

    - Adding a blk_integrity_revalidate() callback for updating the stable
    pages bdi setting.

    - The calls that depend on whether a device has an integrity profile or
    not now key off of the bi->profile pointer.

    - Simplifying the integrity support routines in DM (Mike Snitzer).

    Signed-off-by: Martin K. Petersen
    Reported-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Mike Snitzer
    Cc: Dan Williams
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Martin K. Petersen