22 Apr, 2019

1 commit

  • commit 2da78092dda "block: Fix dev_t minor allocation lifetime"
    specifically moved blk_free_devt(dev->devt) call to part_release()
    to avoid reallocating device number before the device is fully
    shutdown.

    However, it can cause use-after-free on gendisk in get_gendisk().
    We use md device as example to show the race scenes:

    Process1 Worker Process2
    md_free
    blkdev_open
    del_gendisk
    add delete_partition_work_fn() to wq
    __blkdev_get
    get_gendisk
    put_disk
    disk_release
    kfree(disk)
    find part from ext_devt_idr
    get_disk_and_module(disk)
    cause use after free

    delete_partition_work_fn
    put_device(part)
    part_release
    remove part from ext_devt_idr

    Before is removed from ext_devt_idr by
    delete_partition_work_fn(), we can find the devt and then access
    gendisk by hd_struct pointer. But, if we access the gendisk after
    it have been freed, it can cause in use-after-freeon gendisk in
    get_gendisk().

    We fix this by adding a new helper blk_invalidate_devt() in
    delete_partition() and del_gendisk(). It replaces hd_struct
    pointer in idr with value 'NULL', and deletes the entry from
    idr in part_release() as we do now.

    Thanks to Jan Kara for providing the solution and more clear comments
    for the code.

    Fixes: 2da78092dda1 ("block: Fix dev_t minor allocation lifetime")
    Cc: Al Viro
    Reviewed-by: Bart Van Assche
    Reviewed-by: Keith Busch
    Reviewed-by: Jan Kara
    Suggested-by: Jan Kara
    Signed-off-by: Yufen Yu
    Signed-off-by: Jens Axboe

    Yufen Yu
     

13 Apr, 2019

2 commits

  • Currently, an empty disk->events field tells the block layer not to
    forward media change events to user space. This was done in commit
    7c88a168da80 ("block: don't propagate unlisted DISK_EVENTs to userland")
    in order to avoid events from "fringe" drivers to be forwarded to user
    space. By doing so, the block layer lost the information which events
    were supported by a particular block device, and most importantly,
    whether or not a given device supports media change events at all.

    Prepare for not interpreting the "events" field this way in the future
    any more. This is done by adding an additional field "event_flags" to
    struct gendisk, and two flag bits that can be set to have the device
    treated like one that had the "events" field set to a non-zero value
    before. This applies only to the sd and sr drivers, which are changed to
    set the new flags.

    The new flags are DISK_EVENT_FLAG_POLL to enforce polling of the device
    for synchronous events, and DISK_EVENT_FLAG_UEVENT to tell the
    blocklayer to generate udev events from kernel events.

    In order to add the event_flags field to struct gendisk, the events
    field is converted to an "unsigned short"; it doesn't need to hold
    values bigger than 2 anyway.

    This patch doesn't change behavior.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Martin Wilck
    Signed-off-by: Jens Axboe

    Martin Wilck
     
  • The async_events field, intended to be used for drivers that support
    asynchronous notifications about disk events (aka media change events),
    isn't currently used by any driver, and apparently that has been that
    way for a long time (if not forever). Remove it.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Martin Wilck
    Signed-off-by: Jens Axboe

    Martin Wilck
     

07 Apr, 2019

1 commit

  • Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit
    architectures. These types are required to support block device and/or
    file sizes larger than 2 TiB, and have generally defaulted to on for
    a long time. Enabling the option only increases the i386 tinyconfig
    size by 145 bytes, and many data structures already always use
    64-bit values for their in-core and on-disk data structures anyway,
    so there should not be a large change in dynamic memory usage either.

    Dropping this option removes a somewhat weird non-default config that
    has cause various bugs or compiler warnings when actually used.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

10 Dec, 2018

4 commits

  • The previous patches deleted all the code that needed the second value
    returned from part_in_flight - now the kernel only uses the first value.

    Consequently, part_in_flight (and blk_mq_in_flight) may be changed so that
    it only returns one value.

    This patch just refactors the code, there's no functional change.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     
  • Now when part_round_stats is gone, we can switch to per-cpu in-flight
    counters.

    We use the local-atomic type local_t, so that if part_inc_in_flight or
    part_dec_in_flight is reentrantly called from an interrupt, the value will
    be correct.

    The other counters could be corrupted due to reentrant interrupt, but the
    corruption only results in slight counter skew - the in_flight counter
    must be exact, so it needs local_t.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     
  • We want to convert to per-cpu in_flight counters.

    The function part_round_stats needs the in_flight counter every jiffy, it
    would be too costly to sum all the percpu variables every jiffy, so it
    must be deleted. part_round_stats is used to calculate two counters -
    time_in_queue and io_ticks.

    time_in_queue can be calculated without part_round_stats, by adding the
    duration of the I/O when the I/O ends (the value is almost as exact as the
    previously calculated value, except that time for in-progress I/Os is not
    counted).

    io_ticks can be approximated by increasing the value when I/O is started
    or ended and the jiffies value has changed. If the I/Os take less than a
    jiffy, the value is as exact as the previously calculated value. If the
    I/Os take more than a jiffy, io_ticks can drift behind the previously
    calculated value.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     
  • All of part_stat_* and related methods are used with preempt disabled,
    so there is no need to pass cpu around to allow of them. Just call
    smp_processor_id() as needed.

    Suggested-by: Jens Axboe
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

29 Nov, 2018

1 commit

  • We recently got a stack by syzkaller like this:

    BUG: sleeping function called from invalid context at mm/slab.h:361
    in_atomic(): 1, irqs_disabled(): 0, pid: 6644, name: blkid
    INFO: lockdep is turned off.
    CPU: 1 PID: 6644 Comm: blkid Not tainted 4.4.163-514.55.6.9.x86_64+ #76
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
    0000000000000000 5ba6a6b879e50c00 ffff8801f6b07b10 ffffffff81cb2194
    0000000041b58ab3 ffffffff833c7745 ffffffff81cb2080 5ba6a6b879e50c00
    0000000000000000 0000000000000001 0000000000000004 0000000000000000
    Call Trace:
    [] __dump_stack lib/dump_stack.c:15 [inline]
    [] dump_stack+0x114/0x1a0 lib/dump_stack.c:51
    [] ___might_sleep+0x291/0x490 kernel/sched/core.c:7675
    [] __might_sleep+0xb3/0x270 kernel/sched/core.c:7637
    [] slab_pre_alloc_hook mm/slab.h:361 [inline]
    [] slab_alloc_node mm/slub.c:2610 [inline]
    [] slab_alloc mm/slub.c:2692 [inline]
    [] kmem_cache_alloc_trace+0x2c3/0x5c0 mm/slub.c:2709
    [] kmalloc include/linux/slab.h:479 [inline]
    [] kzalloc include/linux/slab.h:623 [inline]
    [] kobject_uevent_env+0x2c7/0x1150 lib/kobject_uevent.c:227
    [] kobject_uevent+0x1f/0x30 lib/kobject_uevent.c:374
    [] kobject_cleanup lib/kobject.c:633 [inline]
    [] kobject_release+0x229/0x440 lib/kobject.c:675
    [] kref_sub include/linux/kref.h:73 [inline]
    [] kref_put include/linux/kref.h:98 [inline]
    [] kobject_put+0x72/0xd0 lib/kobject.c:692
    [] put_device+0x25/0x30 drivers/base/core.c:1237
    [] delete_partition_rcu_cb+0x1d4/0x2f0 block/partition-generic.c:232
    [] __rcu_reclaim kernel/rcu/rcu.h:118 [inline]
    [] rcu_do_batch kernel/rcu/tree.c:2705 [inline]
    [] invoke_rcu_callbacks kernel/rcu/tree.c:2973 [inline]
    [] __rcu_process_callbacks kernel/rcu/tree.c:2940 [inline]
    [] rcu_process_callbacks+0x59c/0x1c70 kernel/rcu/tree.c:2957
    [] __do_softirq+0x299/0xe20 kernel/softirq.c:273
    [] invoke_softirq kernel/softirq.c:350 [inline]
    [] irq_exit+0x216/0x2c0 kernel/softirq.c:391
    [] exiting_irq arch/x86/include/asm/apic.h:652 [inline]
    [] smp_apic_timer_interrupt+0x8b/0xc0 arch/x86/kernel/apic/apic.c:926
    [] apic_timer_interrupt+0xa5/0xb0 arch/x86/entry/entry_64.S:746
    [] ? audit_kill_trees+0x180/0x180
    [] fd_install+0x57/0x80 fs/file.c:626
    [] do_sys_open+0x45e/0x550 fs/open.c:1043
    [] SYSC_open fs/open.c:1055 [inline]
    [] SyS_open+0x32/0x40 fs/open.c:1050
    [] entry_SYSCALL_64_fastpath+0x1e/0x9a

    In softirq context, we call rcu callback function delete_partition_rcu_cb(),
    which may allocate memory by kzalloc with GFP_KERNEL flag. If the
    allocation cannot be satisfied, it may sleep. However, That is not allowed
    in softirq contex.

    Although we found this problem on linux 4.4, the latest kernel version
    seems to have this problem as well. And it is very similar to the
    previous one:
    https://lkml.org/lkml/2018/7/9/391

    Fix it by using RCU workqueue, which allows sleep.

    Reviewed-by: Paul E. McKenney
    Signed-off-by: Yufen Yu
    Signed-off-by: Jens Axboe

    Yufen Yu
     

01 Oct, 2018

1 commit

  • Merge -rc6 in, for two reasons:

    1) Resolve a trivial conflict in the blk-mq-tag.c documentation
    2) A few important regression fixes went into upstream directly, so
    they aren't in the 4.20 branch.

    Signed-off-by: Jens Axboe

    * tag 'v4.19-rc6': (780 commits)
    Linux 4.19-rc6
    MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
    cpufreq: qcom-kryo: Fix section annotations
    perf/core: Add sanity check to deal with pinned event failure
    xen/blkfront: correct purging of persistent grants
    Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
    selftests/powerpc: Fix Makefiles for headers_install change
    blk-mq: I/O and timer unplugs are inverted in blktrace
    dax: Fix deadlock in dax_lock_mapping_entry()
    x86/boot: Fix kexec booting failure in the SEV bit detection code
    bcache: add separate workqueue for journal_write to avoid deadlock
    drm/amd/display: Fix Edid emulation for linux
    drm/amd/display: Fix Vega10 lightup on S3 resume
    drm/amdgpu: Fix vce work queue was not cancelled when suspend
    Revert "drm/panel: Add device_link from panel device to DRM device"
    xen/blkfront: When purging persistent grants, keep them in the buffer
    clocksource/drivers/timer-atmel-pit: Properly handle error cases
    block: fix deadline elevator drain for zoned block devices
    ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
    drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
    ...

    Signed-off-by: Jens Axboe

    Jens Axboe
     

28 Sep, 2018

1 commit

  • Update device_add_disk() to take an 'groups' argument so that
    individual drivers can register a device with additional sysfs
    attributes.
    This avoids race condition the driver would otherwise have if these
    groups were to be created with sysfs_add_groups().

    Signed-off-by: Martin Wilck
    Signed-off-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Hannes Reinecke
     

22 Sep, 2018

1 commit

  • Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
    updating properly on 4.18. This is because we started using ktime to
    track elapsed time, and we convert nanoseconds to jiffies when we update
    the partition counter. However, this gets rounded down, so any I/Os that
    take less than a jiffy are not accounted for. Previously in this case,
    the value of jiffies would sometimes increment while we were doing I/O,
    so at least some I/Os were accounted for.

    Let's convert the stats to use nanoseconds internally. We still report
    milliseconds as before, now more accurately than ever. The value is
    still truncated to 32 bits for backwards compatibility.

    Fixes: 522a777566f5 ("block: consolidate struct request timestamp fields")
    Cc: stable@vger.kernel.org
    Reported-by: Klaus Kusche
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

18 Jul, 2018

3 commits

  • Add tracking of REQ_OP_DISCARD ios to the partition statistics and
    append them to the various stat files in /sys as well as
    /proc/diskstats. These are tracked with the same four stats as reads
    and writes:

    Number of discard ios completed.
    Number of discard ios merged
    Number of discard sectors completed
    Milliseconds spent on discard requests

    This is done via adding a new STAT_DISCARD define to genhd.h and then
    using it to index that stat field for discard requests.

    tj: Refreshed on top of v4.17 and other previous updates.

    Signed-off-by: Michael Callahan
    Signed-off-by: Tejun Heo
    Cc: Andy Newell
    Signed-off-by: Jens Axboe

    Michael Callahan
     
  • Add defines for STAT_READ and STAT_WRITE for indexing the partition
    stat entries. This clarifies some fs/ code which has hardcoded 1 for
    STAT_WRITE and will make it easier to extend the stats with additional
    fields.

    tj: Refreshed on top of v4.17.

    Signed-off-by: Michael Callahan
    Signed-off-by: Tejun Heo
    Cc: "Theodore Ts'o"
    Cc: Jaegeuk Kim
    Signed-off-by: Jens Axboe

    Michael Callahan
     
  • Add a part_stat_read_accum macro to genhd.h to read and sum across
    field entries. For example to sum up the number read and write
    sectors completed. In addition to being ar reasonable cleanup by
    itself this will make it easier to add new stat fields in the future.

    tj: Refreshed on top of v4.17.

    Signed-off-by: Michael Callahan
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Michael Callahan
     

26 Apr, 2018

1 commit

  • When the blk-mq inflight implementation was added, /proc/diskstats was
    converted to use it, but /sys/block/$dev/inflight was not. Fix it by
    adding another helper to count in-flight requests by data direction.

    Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant")
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

27 Feb, 2018

3 commits

  • When two blkdev_open() calls for a partition race with device removal
    and recreation, we can hit BUG_ON(!bd_may_claim(bdev, whole, holder)) in
    blkdev_open(). The race can happen as follows:

    CPU0 CPU1 CPU2
    del_gendisk()
    bdev_unhash_inode(part1);

    blkdev_open(part1, O_EXCL) blkdev_open(part1, O_EXCL)
    bdev = bd_acquire() bdev = bd_acquire()
    blkdev_get(bdev)
    bd_start_claiming(bdev)
    - finds old inode 'whole'
    bd_prepare_to_claim() -> 0
    bdev_unhash_inode(whole);


    blkdev_get(bdev);
    bd_start_claiming(bdev)
    - finds new inode 'whole'
    bd_prepare_to_claim()
    - this also succeeds as we have
    different 'whole' here...
    - bad things happen now as we
    have two exclusive openers of
    the same bdev

    The problem here is that block device opens can see various intermediate
    states while gendisk is shutting down and then being recreated.

    We fix the problem by introducing new lookup_sem in gendisk that
    synchronizes gendisk deletion with get_gendisk() and furthermore by
    making sure that get_gendisk() does not return gendisk that is being (or
    has been) deleted. This makes sure that once we ever manage to look up
    newly created bdev inode, we are also guaranteed that following
    get_gendisk() will either return failure (and we fail open) or it
    returns gendisk for the new device and following bdget_disk() will
    return new bdev inode (i.e., blkdev_open() follows the path as if it is
    completely run after new device is created).

    Reported-and-analyzed-by: Hou Tao
    Tested-by: Hou Tao
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Add a proper counterpart to get_disk_and_module() -
    put_disk_and_module(). Currently it is opencoded in several places.

    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Rename get_disk() to get_disk_and_module() to make sure what the
    function does. It's not a great name but at least it is now clear that
    put_disk() is not it's counterpart.

    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

15 Jan, 2018

1 commit

  • Since I can remember DM has forced the block layer to allow the
    allocation and initialization of the request_queue to be distinct
    operations. Reason for this is block/genhd.c:add_disk() has requires
    that the request_queue (and associated bdi) be tied to the gendisk
    before add_disk() is called -- because add_disk() also deals with
    exposing the request_queue via blk_register_queue().

    DM's dynamic creation of arbitrary device types (and associated
    request_queue types) requires the DM device's gendisk be available so
    that DM table loads can establish a master/slave relationship with
    subordinate devices that are referenced by loaded DM tables -- using
    bd_link_disk_holder(). But until these DM tables, and their associated
    subordinate devices, are known DM cannot know what type of request_queue
    it needs -- nor what its queue_limits should be.

    This chicken and egg scenario has created all manner of problems for DM
    and, at times, the block layer.

    Summary of changes:

    - Add device_add_disk_no_queue_reg() and add_disk_no_queue_reg() variant
    that drivers may use to add a disk without also calling
    blk_register_queue(). Driver must call blk_register_queue() once its
    request_queue is fully initialized.

    - Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
    is not set. It won't be set if driver used add_disk_no_queue_reg()
    but driver encounters an error and must del_gendisk() before calling
    blk_register_queue().

    - Export blk_register_queue().

    These changes allow DM to use add_disk_no_queue_reg() to anchor its
    gendisk as the "master" for master/slave relationships DM must establish
    with subordinate devices referenced in DM tables that get loaded. Once
    all "slave" devices for a DM device are known its request_queue can be
    properly initialized and then advertised via sysfs -- important
    improvement being that no request_queue resource initialization
    performed by blk_register_queue() is missed for DM devices anymore.

    Signed-off-by: Mike Snitzer
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

15 Nov, 2017

1 commit

  • Pull core block layer updates from Jens Axboe:
    "This is the main pull request for block storage for 4.15-rc1.

    Nothing out of the ordinary in here, and no API changes or anything
    like that. Just various new features for drivers, core changes, etc.
    In particular, this pull request contains:

    - A patch series from Bart, closing the whole on blk/scsi-mq queue
    quescing.

    - A series from Christoph, building towards hidden gendisks (for
    multipath) and ability to move bio chains around.

    - NVMe
    - Support for native multipath for NVMe (Christoph).
    - Userspace notifications for AENs (Keith).
    - Command side-effects support (Keith).
    - SGL support (Chaitanya Kulkarni)
    - FC fixes and improvements (James Smart)
    - Lots of fixes and tweaks (Various)

    - bcache
    - New maintainer (Michael Lyle)
    - Writeback control improvements (Michael)
    - Various fixes (Coly, Elena, Eric, Liang, et al)

    - lightnvm updates, mostly centered around the pblk interface
    (Javier, Hans, and Rakesh).

    - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

    - Writeback series that fix the much discussed hundreds of millions
    of sync-all units. This goes all the way, as discussed previously
    (me).

    - Fix for missing wakeup on writeback timer adjustments (Yafang
    Shao).

    - Fix laptop mode on blk-mq (me).

    - {mq,name} tupple lookup for IO schedulers, allowing us to have
    alias names. This means you can use 'deadline' on both !mq and on
    mq (where it's called mq-deadline). (me).

    - blktrace race fix, oopsing on sg load (me).

    - blk-mq optimizations (me).

    - Obscure waitqueue race fix for kyber (Omar).

    - NBD fixes (Josef).

    - Disable writeback throttling by default on bfq, like we do on cfq
    (Luca Miccio).

    - Series from Ming that enable us to treat flush requests on blk-mq
    like any other request. This is a really nice cleanup.

    - Series from Ming that improves merging on blk-mq with schedulers,
    getting us closer to flipping the switch on scsi-mq again.

    - BFQ updates (Paolo).

    - blk-mq atomic flags memory ordering fixes (Peter Z).

    - Loop cgroup support (Shaohua).

    - Lots of minor fixes from lots of different folks, both for core and
    driver code"

    * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
    nvme: fix visibility of "uuid" ns attribute
    blk-mq: fixup some comment typos and lengths
    ide: ide-atapi: fix compile error with defining macro DEBUG
    blk-mq: improve tag waiting setup for non-shared tags
    brd: remove unused brd_mutex
    blk-mq: only run the hardware queue if IO is pending
    block: avoid null pointer dereference on null disk
    fs: guard_bio_eod() needs to consider partitions
    xtensa/simdisk: fix compile error
    nvme: expose subsys attribute to sysfs
    nvme: create 'slaves' and 'holders' entries for hidden controllers
    block: create 'slaves' and 'holders' entries for hidden gendisks
    nvme: also expose the namespace identification sysfs files for mpath nodes
    nvme: implement multipath access to nvme subsystems
    nvme: track shared namespaces
    nvme: introduce a nvme_ns_ids structure
    nvme: track subsystems
    block, nvme: Introduce blk_mq_req_flags_t
    block, scsi: Make SCSI quiesce and resume work reliably
    block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
    ...

    Linus Torvalds
     

11 Nov, 2017

1 commit

  • guard_bio_eod() needs to look at the partition capacity, not just the
    capacity of the whole device, when determining if truncation is
    necessary.

    [ 60.268688] attempt to access beyond end of device
    [ 60.268690] unknown-block(9,1): rw=0, want=67103509, limit=67103506
    [ 60.268693] buffer_io_error: 2 callbacks suppressed
    [ 60.268696] Buffer I/O error on dev md1p7, logical block 4524305, async page read

    Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
    Cc: stable@vger.kernel.org # v4.13
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Greg Edwards
    Signed-off-by: Jens Axboe

    Greg Edwards
     

07 Nov, 2017

1 commit


04 Nov, 2017

2 commits

  • With this flag a driver can create a gendisk that can be used for I/O
    submission inside the kernel, but which is not registered as user
    facing block device. This will be useful for the NVMe multipath
    implementation.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The hidden gendisks introduced in the next patch need to keep the dev
    field in their struct device empty so that udev won't try to create
    block device nodes for them. To support that rewrite disk_devt to
    look at the major and first_minor fields in the gendisk itself instead
    of looking into the struct device.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

26 Oct, 2017

1 commit

  • Darrick posted the following warning and Dave Chinner analyzed it:

    > ======================================================
    > WARNING: possible circular locking dependency detected
    > 4.14.0-rc1-fixes #1 Tainted: G W
    > ------------------------------------------------------
    > loop0/31693 is trying to acquire lock:
    > (&(&ip->i_mmaplock)->mr_lock){++++}, at: [] xfs_ilock+0x23c/0x330 [xfs]
    >
    > but now in release context of a crosslock acquired at the following:
    > ((complete)&ret.event){+.+.}, at: [] submit_bio_wait+0x7f/0xb0
    >
    > which lock already depends on the new lock.
    >
    > the existing dependency chain (in reverse order) is:
    >
    > -> #2 ((complete)&ret.event){+.+.}:
    > lock_acquire+0xab/0x200
    > wait_for_completion_io+0x4e/0x1a0
    > submit_bio_wait+0x7f/0xb0
    > blkdev_issue_zeroout+0x71/0xa0
    > xfs_bmapi_convert_unwritten+0x11f/0x1d0 [xfs]
    > xfs_bmapi_write+0x374/0x11f0 [xfs]
    > xfs_iomap_write_direct+0x2ac/0x430 [xfs]
    > xfs_file_iomap_begin+0x20d/0xd50 [xfs]
    > iomap_apply+0x43/0xe0
    > dax_iomap_rw+0x89/0xf0
    > xfs_file_dax_write+0xcc/0x220 [xfs]
    > xfs_file_write_iter+0xf0/0x130 [xfs]
    > __vfs_write+0xd9/0x150
    > vfs_write+0xc8/0x1c0
    > SyS_write+0x45/0xa0
    > entry_SYSCALL_64_fastpath+0x1f/0xbe
    >
    > -> #1 (&xfs_nondir_ilock_class){++++}:
    > lock_acquire+0xab/0x200
    > down_write_nested+0x4a/0xb0
    > xfs_ilock+0x263/0x330 [xfs]
    > xfs_setattr_size+0x152/0x370 [xfs]
    > xfs_vn_setattr+0x6b/0x90 [xfs]
    > notify_change+0x27d/0x3f0
    > do_truncate+0x5b/0x90
    > path_openat+0x237/0xa90
    > do_filp_open+0x8a/0xf0
    > do_sys_open+0x11c/0x1f0
    > entry_SYSCALL_64_fastpath+0x1f/0xbe
    >
    > -> #0 (&(&ip->i_mmaplock)->mr_lock){++++}:
    > up_write+0x1c/0x40
    > xfs_iunlock+0x1d0/0x310 [xfs]
    > xfs_file_fallocate+0x8a/0x310 [xfs]
    > loop_queue_work+0xb7/0x8d0
    > kthread_worker_fn+0xb9/0x1f0
    >
    > Chain exists of:
    > &(&ip->i_mmaplock)->mr_lock --> &xfs_nondir_ilock_class --> (complete)&ret.event
    >
    > Possible unsafe locking scenario by crosslock:
    >
    > CPU0 CPU1
    > ---- ----
    > lock(&xfs_nondir_ilock_class);
    > lock((complete)&ret.event);
    > lock(&(&ip->i_mmaplock)->mr_lock);
    > unlock((complete)&ret.event);
    >
    > *** DEADLOCK ***

    The warning is a false positive, caused by the fact that all
    wait_for_completion()s in submit_bio_wait() are waiting with the same
    lock class.

    However, some bios have nothing to do with others, for example in the case
    of loop devices, there's no direct connection between the bios of an upper
    device and the bios of a lower device(=loop device).

    The safest way to assign different lock classes to different devices is
    to do it for each gendisk. In other words, this patch assigns a
    lockdep_map per gendisk and uses it when initializing completion in
    submit_bio_wait().

    Analyzed-by: Dave Chinner
    Reported-by: Darrick J. Wong
    Signed-off-by: Byungchul Park
    Reviewed-by: Jens Axboe
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: amir73il@gmail.com
    Cc: axboe@kernel.dk
    Cc: david@fromorbit.com
    Cc: hch@infradead.org
    Cc: idryomov@gmail.com
    Cc: johan@kernel.org
    Cc: johannes.berg@intel.com
    Cc: kernel-team@lge.com
    Cc: linux-block@vger.kernel.org
    Cc: linux-fsdevel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: linux-xfs@vger.kernel.org
    Cc: oleg@redhat.com
    Cc: tj@kernel.org
    Link: http://lkml.kernel.org/r/1508921765-15396-10-git-send-email-byungchul.park@lge.com
    Signed-off-by: Ingo Molnar

    Byungchul Park
     

10 Aug, 2017

3 commits

  • We don't have to inc/dec some counter, since we can just
    iterate the tags. That makes inc/dec a noop, but means we
    have to iterate busy tags to get an in-flight count.

    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Instead of returning the count that matches the partition, pass
    in an array of two ints. Index 0 will be filled with the inflight
    count for the partition in question, and index 1 will filled
    with the root inflight count, if the partition passed in is not the
    root.

    This is in preparation for being able to calculate both in one
    go.

    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • No functional change in this patch, just in preparation for
    basing the inflight mechanism on the queue in question.

    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     

05 Jun, 2017

1 commit

  • This helper was only used by IMA of all things, which would get spurious
    errors if CONFIG_BLOCK is disabled. Just opencode the call there.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Amir Goldstein
    Acked-by: Mimi Zohar
    Reviewed-by: Andy Shevchenko

    Christoph Hellwig
     

22 Apr, 2017

1 commit

  • Commit 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
    introduced blk_integrity_revalidate(), which seems to assume ownership
    of the stable pages flag and unilaterally clears it if no blk_integrity
    profile is registered:

    if (bi->profile)
    disk->queue->backing_dev_info->capabilities |=
    BDI_CAP_STABLE_WRITES;
    else
    disk->queue->backing_dev_info->capabilities &=
    ~BDI_CAP_STABLE_WRITES;

    It's called from revalidate_disk() and rescan_partitions(), making it
    impossible to enable stable pages for drivers that support partitions
    and don't use blk_integrity: while the call in revalidate_disk() can be
    trivially worked around (see zram, which doesn't support partitions and
    hence gets away with zram_revalidate_disk()), rescan_partitions() can
    be triggered from userspace at any time. This breaks rbd, where the
    ceph messenger is responsible for generating/verifying CRCs.

    Since blk_integrity_{un,}register() "must" be used for (un)registering
    the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
    setting there. This way drivers that call blk_integrity_register() and
    use integrity infrastructure won't interfere with drivers that don't
    but still want stable pages.

    Fixes: 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
    Cc: "Martin K. Petersen"
    Cc: Christoph Hellwig
    Cc: Mike Snitzer
    Cc: stable@vger.kernel.org # 4.4+, needs backporting
    Tested-by: Dan Williams
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe

    Ilya Dryomov
     

25 Mar, 2017

1 commit


09 Mar, 2017

1 commit

  • This reverts commit 0dba1314d4f81115dce711292ec7981d17231064. It causes
    leaking of device numbers for SCSI when SCSI registers multiple gendisks
    for one request_queue in succession. It can be easily reproduced using
    Omar's script [1] on kernel with CONFIG_DEBUG_TEST_DRIVER_REMOVE.
    Furthermore the protection provided by this commit is not needed anymore
    as the problem it was fixing got also fixed by commit 165a5e22fafb
    "block: Move bdi_unregister() to del_gendisk()".

    [1]: http://marc.info/?l=linux-block&m=148554717109098&w=2

    Signed-off-by: Jan Kara
    Acked-by: Dan Williams
    Tested-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jan Kara
     

02 Feb, 2017

1 commit

  • Warnings of the following form occur because scsi reuses a devt number
    while the block layer still has it referenced as the name of the bdi
    [1]:

    WARNING: CPU: 1 PID: 93 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80
    sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:192'
    [..]
    Call Trace:
    dump_stack+0x86/0xc3
    __warn+0xcb/0xf0
    warn_slowpath_fmt+0x5f/0x80
    ? kernfs_path_from_node+0x4f/0x60
    sysfs_warn_dup+0x62/0x80
    sysfs_create_dir_ns+0x77/0x90
    kobject_add_internal+0xb2/0x350
    kobject_add+0x75/0xd0
    device_add+0x15a/0x650
    device_create_groups_vargs+0xe0/0xf0
    device_create_vargs+0x1c/0x20
    bdi_register+0x90/0x240
    ? lockdep_init_map+0x57/0x200
    bdi_register_owner+0x36/0x60
    device_add_disk+0x1bb/0x4e0
    ? __pm_runtime_use_autosuspend+0x5c/0x70
    sd_probe_async+0x10d/0x1c0
    async_run_entry_fn+0x39/0x170

    This is a brute-force fix to pass the devt release information from
    sd_probe() to the locations where we register the bdi,
    device_add_disk(), and unregister the bdi, blk_cleanup_queue().

    Thanks to Omar for the quick reproducer script [2]. This patch survives
    where an unmodified kernel fails in a few seconds.

    [1]: https://marc.info/?l=linux-scsi&m=147116857810716&w=4
    [2]: http://marc.info/?l=linux-block&m=148554717109098&w=2

    Cc: James Bottomley
    Cc: Bart Van Assche
    Cc: "Martin K. Petersen"
    Cc: Jan Kara
    Reported-by: Omar Sandoval
    Tested-by: Omar Sandoval
    Signed-off-by: Dan Williams
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Dan Williams
     

23 Dec, 2016

1 commit


11 Oct, 2016

1 commit

  • The __latent_entropy gcc attribute can be used only on functions and
    variables. If it is on a function then the plugin will instrument it for
    gathering control-flow entropy. If the attribute is on a variable then
    the plugin will initialize it with random contents. The variable must
    be an integer, an integer array type or a structure with integer fields.

    These specific functions have been selected because they are init
    functions (to help gather boot-time entropy), are called at unpredictable
    times, or they have variable loops, each of which provide some level of
    latent entropy.

    Signed-off-by: Emese Revfy
    [kees: expanded commit message]
    Signed-off-by: Kees Cook

    Emese Revfy
     

28 Jun, 2016

1 commit

  • Now that all drivers that specify a ->driverfs_dev have been converted
    to device_add_disk(), the pointer can be removed from struct gendisk.

    Cc: Jens Axboe
    Cc: Ross Zwisler
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Dan Williams

    Dan Williams
     

16 Jun, 2016

1 commit

  • In preparation for removing the ->driverfs_dev member of a gendisk, add
    an api that takes the parent device as a parameter to add_disk(). For
    now this maintains the status quo of WARN()ing on failure, but not
    return a error code.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Bart Van Assche
    Signed-off-by: Dan Williams

    Dan Williams
     

21 May, 2016

1 commit

  • UUID library provides uuid_be type and uuid_be_to_bin() function. This
    substitutes open coded variant by generic library calls.

    Signed-off-by: Andy Shevchenko
    Reviewed-by: Matt Fleming
    Cc: Dmitry Kasatkin
    Cc: Mimi Zohar
    Cc: Rasmus Villemoes
    Cc: Arnd Bergmann
    Cc: "Theodore Ts'o"
    Cc: Al Viro
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko