22 Apr, 2019

1 commit

  • commit 2da78092dda "block: Fix dev_t minor allocation lifetime"
    specifically moved blk_free_devt(dev->devt) call to part_release()
    to avoid reallocating device number before the device is fully
    shutdown.

    However, it can cause use-after-free on gendisk in get_gendisk().
    We use md device as example to show the race scenes:

    Process1 Worker Process2
    md_free
    blkdev_open
    del_gendisk
    add delete_partition_work_fn() to wq
    __blkdev_get
    get_gendisk
    put_disk
    disk_release
    kfree(disk)
    find part from ext_devt_idr
    get_disk_and_module(disk)
    cause use after free

    delete_partition_work_fn
    put_device(part)
    part_release
    remove part from ext_devt_idr

    Before is removed from ext_devt_idr by
    delete_partition_work_fn(), we can find the devt and then access
    gendisk by hd_struct pointer. But, if we access the gendisk after
    it have been freed, it can cause in use-after-freeon gendisk in
    get_gendisk().

    We fix this by adding a new helper blk_invalidate_devt() in
    delete_partition() and del_gendisk(). It replaces hd_struct
    pointer in idr with value 'NULL', and deletes the entry from
    idr in part_release() as we do now.

    Thanks to Jan Kara for providing the solution and more clear comments
    for the code.

    Fixes: 2da78092dda1 ("block: Fix dev_t minor allocation lifetime")
    Cc: Al Viro
    Reviewed-by: Bart Van Assche
    Reviewed-by: Keith Busch
    Reviewed-by: Jan Kara
    Suggested-by: Jan Kara
    Signed-off-by: Yufen Yu
    Signed-off-by: Jens Axboe

    Yufen Yu
     

10 Dec, 2018

3 commits

  • The previous patches deleted all the code that needed the second value
    returned from part_in_flight - now the kernel only uses the first value.

    Consequently, part_in_flight (and blk_mq_in_flight) may be changed so that
    it only returns one value.

    This patch just refactors the code, there's no functional change.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     
  • We want to convert to per-cpu in_flight counters.

    The function part_round_stats needs the in_flight counter every jiffy, it
    would be too costly to sum all the percpu variables every jiffy, so it
    must be deleted. part_round_stats is used to calculate two counters -
    time_in_queue and io_ticks.

    time_in_queue can be calculated without part_round_stats, by adding the
    duration of the I/O when the I/O ends (the value is almost as exact as the
    previously calculated value, except that time for in-progress I/Os is not
    counted).

    io_ticks can be approximated by increasing the value when I/O is started
    or ended and the jiffies value has changed. If the I/Os take less than a
    jiffy, the value is as exact as the previously calculated value. If the
    I/Os take more than a jiffy, io_ticks can drift behind the previously
    calculated value.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     
  • All of part_stat_* and related methods are used with preempt disabled,
    so there is no need to pass cpu around to allow of them. Just call
    smp_processor_id() as needed.

    Suggested-by: Jens Axboe
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

29 Nov, 2018

1 commit

  • We recently got a stack by syzkaller like this:

    BUG: sleeping function called from invalid context at mm/slab.h:361
    in_atomic(): 1, irqs_disabled(): 0, pid: 6644, name: blkid
    INFO: lockdep is turned off.
    CPU: 1 PID: 6644 Comm: blkid Not tainted 4.4.163-514.55.6.9.x86_64+ #76
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
    0000000000000000 5ba6a6b879e50c00 ffff8801f6b07b10 ffffffff81cb2194
    0000000041b58ab3 ffffffff833c7745 ffffffff81cb2080 5ba6a6b879e50c00
    0000000000000000 0000000000000001 0000000000000004 0000000000000000
    Call Trace:
    [] __dump_stack lib/dump_stack.c:15 [inline]
    [] dump_stack+0x114/0x1a0 lib/dump_stack.c:51
    [] ___might_sleep+0x291/0x490 kernel/sched/core.c:7675
    [] __might_sleep+0xb3/0x270 kernel/sched/core.c:7637
    [] slab_pre_alloc_hook mm/slab.h:361 [inline]
    [] slab_alloc_node mm/slub.c:2610 [inline]
    [] slab_alloc mm/slub.c:2692 [inline]
    [] kmem_cache_alloc_trace+0x2c3/0x5c0 mm/slub.c:2709
    [] kmalloc include/linux/slab.h:479 [inline]
    [] kzalloc include/linux/slab.h:623 [inline]
    [] kobject_uevent_env+0x2c7/0x1150 lib/kobject_uevent.c:227
    [] kobject_uevent+0x1f/0x30 lib/kobject_uevent.c:374
    [] kobject_cleanup lib/kobject.c:633 [inline]
    [] kobject_release+0x229/0x440 lib/kobject.c:675
    [] kref_sub include/linux/kref.h:73 [inline]
    [] kref_put include/linux/kref.h:98 [inline]
    [] kobject_put+0x72/0xd0 lib/kobject.c:692
    [] put_device+0x25/0x30 drivers/base/core.c:1237
    [] delete_partition_rcu_cb+0x1d4/0x2f0 block/partition-generic.c:232
    [] __rcu_reclaim kernel/rcu/rcu.h:118 [inline]
    [] rcu_do_batch kernel/rcu/tree.c:2705 [inline]
    [] invoke_rcu_callbacks kernel/rcu/tree.c:2973 [inline]
    [] __rcu_process_callbacks kernel/rcu/tree.c:2940 [inline]
    [] rcu_process_callbacks+0x59c/0x1c70 kernel/rcu/tree.c:2957
    [] __do_softirq+0x299/0xe20 kernel/softirq.c:273
    [] invoke_softirq kernel/softirq.c:350 [inline]
    [] irq_exit+0x216/0x2c0 kernel/softirq.c:391
    [] exiting_irq arch/x86/include/asm/apic.h:652 [inline]
    [] smp_apic_timer_interrupt+0x8b/0xc0 arch/x86/kernel/apic/apic.c:926
    [] apic_timer_interrupt+0xa5/0xb0 arch/x86/entry/entry_64.S:746
    [] ? audit_kill_trees+0x180/0x180
    [] fd_install+0x57/0x80 fs/file.c:626
    [] do_sys_open+0x45e/0x550 fs/open.c:1043
    [] SYSC_open fs/open.c:1055 [inline]
    [] SyS_open+0x32/0x40 fs/open.c:1050
    [] entry_SYSCALL_64_fastpath+0x1e/0x9a

    In softirq context, we call rcu callback function delete_partition_rcu_cb(),
    which may allocate memory by kzalloc with GFP_KERNEL flag. If the
    allocation cannot be satisfied, it may sleep. However, That is not allowed
    in softirq contex.

    Although we found this problem on linux 4.4, the latest kernel version
    seems to have this problem as well. And it is very similar to the
    previous one:
    https://lkml.org/lkml/2018/7/9/391

    Fix it by using RCU workqueue, which allows sleep.

    Reviewed-by: Paul E. McKenney
    Signed-off-by: Yufen Yu
    Signed-off-by: Jens Axboe

    Yufen Yu
     

22 Sep, 2018

1 commit

  • Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
    updating properly on 4.18. This is because we started using ktime to
    track elapsed time, and we convert nanoseconds to jiffies when we update
    the partition counter. However, this gets rounded down, so any I/Os that
    take less than a jiffy are not accounted for. Previously in this case,
    the value of jiffies would sometimes increment while we were doing I/O,
    so at least some I/Os were accounted for.

    Let's convert the stats to use nanoseconds internally. We still report
    milliseconds as before, now more accurately than ever. The value is
    still truncated to 32 bits for backwards compatibility.

    Fixes: 522a777566f5 ("block: consolidate struct request timestamp fields")
    Cc: stable@vger.kernel.org
    Reported-by: Klaus Kusche
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

18 Jul, 2018

2 commits

  • Add tracking of REQ_OP_DISCARD ios to the partition statistics and
    append them to the various stat files in /sys as well as
    /proc/diskstats. These are tracked with the same four stats as reads
    and writes:

    Number of discard ios completed.
    Number of discard ios merged
    Number of discard sectors completed
    Milliseconds spent on discard requests

    This is done via adding a new STAT_DISCARD define to genhd.h and then
    using it to index that stat field for discard requests.

    tj: Refreshed on top of v4.17 and other previous updates.

    Signed-off-by: Michael Callahan
    Signed-off-by: Tejun Heo
    Cc: Andy Newell
    Signed-off-by: Jens Axboe

    Michael Callahan
     
  • Add defines for STAT_READ and STAT_WRITE for indexing the partition
    stat entries. This clarifies some fs/ code which has hardcoded 1 for
    STAT_WRITE and will make it easier to extend the stats with additional
    fields.

    tj: Refreshed on top of v4.17.

    Signed-off-by: Michael Callahan
    Signed-off-by: Tejun Heo
    Cc: "Theodore Ts'o"
    Cc: Jaegeuk Kim
    Signed-off-by: Jens Axboe

    Michael Callahan
     

29 May, 2018

1 commit


25 May, 2018

1 commit

  • Convert the S_ symbolic permissions to their octal equivalents as
    using octal and not symbolic permissions is preferred by many as more
    readable.

    see: https://lkml.org/lkml/2016/8/2/1945

    Done with automated conversion via:
    $ ./scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace

    Miscellanea:

    o Wrapped modified multi-line calls to a single line where appropriate
    o Realign modified multi-line calls to open parenthesis

    Signed-off-by: Joe Perches
    Signed-off-by: Jens Axboe

    Joe Perches
     

26 Apr, 2018

1 commit

  • When the blk-mq inflight implementation was added, /proc/diskstats was
    converted to use it, but /sys/block/$dev/inflight was not. Fix it by
    adding another helper to count in-flight requests by data direction.

    Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant")
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

01 Mar, 2018

1 commit

  • bio_devname use __bdevname to display the device name, and can
    only show the major and minor of the part0,
    Fix this by using disk_name to display the correct name.

    Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
    Reviewed-by: Omar Sandoval
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jiufei Xue
    Signed-off-by: Jens Axboe

    Jiufei Xue
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

25 Sep, 2017

1 commit

  • part_stat_show takes a part device not a disk, so we should use
    part_to_disk.

    Fixes: d62e26b3ffd2("block: pass in queue to inflight accounting")
    Cc: Bart Van Assche
    Cc: Omar Sandoval
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

24 Aug, 2017

1 commit


18 Aug, 2017

1 commit

  • Annotate gendisk.part_tbl and disk_part_tbl.part dereferences with
    rcu_dereference_protected(). This patch does not change the behavior
    of the modified code but ensures that sparse does not complain about
    disk->part_tbl manipulations nor about part_tbl->part accesses.
    Additionally, improve documentation of the locking requirements of
    the modified functions.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Cc: Tejun Heo
    Cc: Jan Kara
    Cc: Dan Williams
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

10 Aug, 2017

2 commits

  • Instead of returning the count that matches the partition, pass
    in an array of two ints. Index 0 will be filled with the inflight
    count for the partition in question, and index 1 will filled
    with the root inflight count, if the partition passed in is not the
    root.

    This is in preparation for being able to calculate both in one
    go.

    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • No functional change in this patch, just in preparation for
    basing the inflight mechanism on the queue in question.

    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     

23 May, 2017

1 commit

  • We don't set an error code on this path. It means that we return NULL
    instead of an error pointer and the caller does a NULL dereference.

    Fixes: 6d1d8050b4bc ("block, partition: add partition_meta_info to hd_struct")
    Signed-off-by: Dan Carpenter
    Signed-off-by: Jens Axboe

    Dan Carpenter
     

06 May, 2017

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "The bulk of this has been in multiple -next releases. There were a few
    late breaking fixes and small features that got added in the last
    couple days, but the whole set has received a build success
    notification from the kbuild robot.

    Change summary:

    - Region media error reporting: A libnvdimm region device is the
    parent to one or more namespaces. To date, media errors have been
    reported via the "badblocks" attribute attached to pmem block
    devices for namespaces in "raw" or "memory" mode. Given that
    namespaces can be in "device-dax" or "btt-sector" mode this new
    interface reports media errors generically, i.e. independent of
    namespace modes or state.

    This subsequently allows userspace tooling to craft "ACPI 6.1
    Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
    requests and submit them via the ioctl path for NVDIMM root bus
    devices.

    - Introduce 'struct dax_device' and 'struct dax_operations': Prompted
    by a request from Linus and feedback from Christoph this allows for
    dax capable drivers to publish their own custom dax operations.
    This fixes the broken assumption that all dax operations are
    related to a persistent memory device, and makes it easier for
    other architectures and platforms to add customized persistent
    memory support.

    - 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
    available for storage appliance applications to manually trigger
    memory controllers to drain write-pending buffers that would
    otherwise be flushed automatically by the platform ADR
    (asynchronous-DRAM-refresh) mechanism at a power loss event.
    Support for "locked" DIMMs is included to prevent namespaces from
    surfacing when the namespace label data area is locked. Finally,
    fixes for various reported deadlocks and crashes, also tagged for
    -stable.

    - ACPI / nfit driver updates: General updates of the nfit driver to
    add DSM command overrides, ACPI 6.1 health state flags support, DSM
    payload debug available by default, and various fixes.

    Acknowledgements that came after the branch was pushed:

    - commmit 565851c972b5 "device-dax: fix sysfs attribute deadlock":
    Tested-by: Yi Zhang

    - commit 23f498448362 "libnvdimm: rework region badblocks clearing"
    Tested-by: Toshi Kani "

    * tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
    libnvdimm, pfn: fix 'npfns' vs section alignment
    libnvdimm: handle locked label storage areas
    libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
    brd: fix uninitialized use of brd->dax_dev
    block, dax: use correct format string in bdev_dax_supported
    device-dax: fix sysfs attribute deadlock
    libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
    libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
    libnvdimm: rework region badblocks clearing
    acpi, nfit: kill ACPI_NFIT_DEBUG
    libnvdimm: fix clear length of nvdimm_forget_poison()
    libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
    libnvdimm, region: sysfs trigger for nvdimm_flush()
    libnvdimm: fix phys_addr for nvdimm_clear_poison
    x86, dax, pmem: remove indirection around memcpy_from_pmem()
    block: remove block_device_operations ->direct_access()
    block, dax: convert bdev_dax_supported() to dax_direct_access()
    filesystem-dax: convert to dax_direct_access()
    Revert "block: use DAX for partition table reads"
    ext2, ext4, xfs: retrieve dax_device for iomap operations
    ...

    Linus Torvalds
     

26 Apr, 2017

1 commit

  • commit d1a5f2b4d8a1 ("block: use DAX for partition table reads") was
    part of a stalled effort to allow dax mappings of block devices. Since
    then the device-dax mechanism has filled the role of dax-mapping static
    device ranges.

    Now that we are moving ->direct_access() from a block_device operation
    to a dax_inode operation we would need block devices to map and carry
    their own dax_inode reference.

    Unless / until we decide to revive dax mapping of raw block devices
    through the dax_inode scheme, there is no need to carry
    read_dax_sector(). Its removal in turn allows for the removal of
    bdev_direct_access() and should have been included in commit
    223757016837 ("block_dev: remove DAX leftovers").

    Cc: Jeff Moyer
    Signed-off-by: Dan Williams

    Dan Williams
     

22 Apr, 2017

1 commit

  • Commit 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
    introduced blk_integrity_revalidate(), which seems to assume ownership
    of the stable pages flag and unilaterally clears it if no blk_integrity
    profile is registered:

    if (bi->profile)
    disk->queue->backing_dev_info->capabilities |=
    BDI_CAP_STABLE_WRITES;
    else
    disk->queue->backing_dev_info->capabilities &=
    ~BDI_CAP_STABLE_WRITES;

    It's called from revalidate_disk() and rescan_partitions(), making it
    impossible to enable stable pages for drivers that support partitions
    and don't use blk_integrity: while the call in revalidate_disk() can be
    trivially worked around (see zram, which doesn't support partitions and
    hence gets away with zram_revalidate_disk()), rescan_partitions() can
    be triggered from userspace at any time. This breaks rbd, where the
    ceph messenger is responsible for generating/verifying CRCs.

    Since blk_integrity_{un,}register() "must" be used for (un)registering
    the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
    setting there. This way drivers that call blk_integrity_register() and
    use integrity infrastructure won't interfere with drivers that don't
    but still want stable pages.

    Fixes: 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
    Cc: "Martin K. Petersen"
    Cc: Christoph Hellwig
    Cc: Mike Snitzer
    Cc: stable@vger.kernel.org # 4.4+, needs backporting
    Tested-by: Dan Williams
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe

    Ilya Dryomov
     

12 Jan, 2017

1 commit

  • All block device data fields and functions returning a number of 512B
    sectors are by convention named xxx_sectors while names in the form
    xxx_size are generally used for a number of bytes. The blk_queue_zone_size
    and bdev_zone_size functions were not following this convention so rename
    them.

    No functional change is introduced by this patch.

    Signed-off-by: Damien Le Moal

    Collapsed the two patches, they were nonsensically split and broke
    bisection.

    Signed-off-by: Jens Axboe

    Damien Le Moal
     

01 Dec, 2016

1 commit

  • Both blkdev_report_zones and blkdev_reset_zones can operate on a partition of
    a zoned block device. However, the first and last zones reported for a
    partition make sense only if the partition start sector and size are aligned
    on the device zone size. The same applies for zone reset. Resetting the first
    or the last zone of a partition straddling zones may impact neighboring
    partitions. Finally, if a partition start sector is not at the beginning of a
    sequential zone, it will be impossible to write to the first sectors of the
    partition on a host-managed device.
    Avoid all these problems and incoherencies by ignoring partitions that are not
    zone aligned.

    Note: Even with CONFIG_BLK_DEV_ZONED disabled, bdev_is_zoned() will report the
    correct disk zoning type (host-aware, host-managed or none) but
    bdev_zone_size() will always return 0 for zoned block devices (i.e. the zone
    size is unknown). So test this as a way to ensure that a zoned block device is
    being handled as such. As a result, for a host-aware devices, unaligned zone
    partitions will be accepted with CONFIG_BLK_DEV_ZONED disabled. That is, the
    disk will be treated as a regular block device (as it should). If zoned block
    device support is enabled, only aligned partitions will be accepted.

    Signed-off-by: Damien Le Moal
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

14 Jun, 2016

1 commit


16 Apr, 2016

1 commit

  • Pull block fixes from Jens Axboe:
    "A few fixes for the current series. This contains:

    - Two fixes for NVMe:

    One fixes a reset race that can be triggered by repeated
    insert/removal of the module.

    The other fixes an issue on some platforms, where we get probe
    timeouts since legacy interrupts isn't working. This used not to
    be a problem since we had the worker thread poll for completions,
    but since that was killed off, it means those poor souls can't
    successfully probe their NVMe device. Use a proper IRQ check and
    probe (msi-x -> msi ->legacy), like most other drivers to work
    around this. Both from Keith.

    - A loop corruption issue with offset in iters, from Ming Lei.

    - A fix for not having the partition stat per cpu ref count
    initialized before sending out the KOBJ_ADD, which could cause user
    space to access the counter prior to initialization. Also from
    Ming Lei.

    - A fix for using the wrong congestion state, from Kaixu Xia"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    block: loop: fix filesystem corruption in case of aio/dio
    NVMe: Always use MSI/MSI-x interrupts
    NVMe: Fix reset/remove race
    writeback: fix the wrong congested state variable definition
    block: partition: initialize percpuref before sending out KOBJ_ADD

    Linus Torvalds
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

30 Mar, 2016

1 commit

  • The initialization of partition's percpu_ref should have been done before
    sending out KOBJ_ADD uevent, which may cause userspace to read partition
    table. So the uninitialized percpu_ref may be accessed in data path.

    This patch fixes this issue reported by Naveen.

    Reported-by: Naveen Kaje
    Tested-by: Naveen Kaje
    Fixes: 6c71013ecb7e2(block: partition: convert percpu ref)
    Cc: # v4.3+
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

16 Mar, 2016

1 commit

  • This patch has been carried in the Android tree for quite some time and
    is one of the few patches required to get a mainline kernel up and
    running with an exsiting Android userspace. So I wanted to submit it
    for review and consideration if it should be merged.

    For partitions, add new uevent parameters 'PARTN' which specifies the
    partitions index in the table, and 'PARTNAME', which specifies PARTNAME
    specifices the partition name of a partition device.

    Android's userspace uses this for creating device node links from the
    partition name and number, ie:

    /dev/block/platform/soc/by-name/system
    or
    /dev/block/platform/soc/by-num/p1

    One can see its usage here:
    https://android.googlesource.com/platform/system/core/+/master/init/devices.cpp#355
    and
    https://android.googlesource.com/platform/system/core/+/master/init/devices.cpp#494

    [john.stultz@linaro.org: dropped NPARTS and reworded commit message for context]
    Signed-off-by: Dima Zavin
    Signed-off-by: John Stultz
    Cc: Jens Axboe
    Cc: Rom Lemarchand
    Cc: Android Kernel Team
    Cc: Jeff Moyer
    Cc:
    Cc: Kees Cook
    Cc: Kay Sievers
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    San Mehat
     

31 Jan, 2016

1 commit

  • Avoid populating pagecache when the block device is in DAX mode.
    Otherwise these page cache entries collide with the fsync/msync
    implementation and break data durability guarantees.

    Cc: Jan Kara
    Cc: Jeff Moyer
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Andrew Morton
    Reported-by: Ross Zwisler
    Tested-by: Ross Zwisler
    Reviewed-by: Matthew Wilcox
    Signed-off-by: Dan Williams

    Dan Williams
     

26 Nov, 2015

1 commit

  • Today, blockdev --rereadpt /dev/sda will fail with EBUSY if any
    partition of sda is mounted (and will fail with EINVAL if pointed
    at a partition). But it will pass if the entire block device is
    formatted with a filesystem and mounted. I don't think this makes
    sense; partitioning should surely not ever change out from under
    a mounted device.

    So check for bdev->bd_super, and fail that with -EBUSY as well.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Eric Sandeen
     

22 Oct, 2015

1 commit

  • Up until now the_integrity profile has been dynamically allocated and
    attached to struct gendisk after the disk has been made active.

    This causes problems because NVMe devices need to register the profile
    prior to the partition table being read due to a mandatory metadata
    buffer requirement. In addition, DM goes through hoops to deal with
    preallocating, but not initializing integrity profiles.

    Since the integrity profile is small (4 bytes + a pointer), Christoph
    suggested moving it to struct gendisk proper. This requires several
    changes:

    - Moving the blk_integrity definition to genhd.h.

    - Inlining blk_integrity in struct gendisk.

    - Removing the dynamic allocation code.

    - Adding helper functions which allow gendisk to set up and tear down
    the integrity sysfs dir when a disk is added/deleted.

    - Adding a blk_integrity_revalidate() callback for updating the stable
    pages bdi setting.

    - The calls that depend on whether a device has an integrity profile or
    not now key off of the bi->profile pointer.

    - Simplifying the integrity support routines in DM (Mike Snitzer).

    Signed-off-by: Martin K. Petersen
    Reported-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Mike Snitzer
    Cc: Dan Williams
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

17 Jul, 2015

2 commits


04 Sep, 2014

1 commit

  • Releases the dev_t minor when all references are closed to prevent
    another device from acquiring the same major/minor.

    Since the partition's release may be invoked from call_rcu's soft-irq
    context, the ext_dev_idr's mutex had to be replaced with a spinlock so
    as not so sleep.

    Signed-off-by: Keith Busch
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Keith Busch
     

08 Apr, 2013

1 commit

  • This reverts commit 8761a3dc1f07b163414e2215a2cadbb4cfe2a107.

    There are situations where the destruction path is called
    with the bdev->bd_mutex already held, which then deadlocks in
    loop_clr_fd(). The normal partition cleanup does a trylock()
    on the mutex, but it'd be nice to have a more bullet proof
    method in loop. So punt this more involved fix to the next
    merge window, and just back out this buggy fix for now.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

23 Mar, 2013

1 commit

  • Any partitions added by user space to the loop device were being
    left in place after detaching the loop device. This was because
    the detach path issued a BLKRRPART to clean up partitions if
    LO_FLAGS_PARTSCAN was set, meaning that the partitions were auto
    scanned on attach. Replace this BLKRRPART with code that
    unconditionally cleans up partitions on detach instead.

    Signed-off-by: Phillip Susi

    Modified by Jens to export delete_partition().

    Signed-off-by: Jens Axboe

    Phillip Susi
     

28 Feb, 2013

2 commits

  • Currently, sizeof(struct parsed_partitions) may be 64KB in 32bit arch, so
    it is easy to trigger page allocation failure by check_partition,
    especially in hotplug block device situation(such as, USB mass storage,
    MMC card, ...), and Felipe Balbi has observed the failure.

    This patch does below optimizations on the allocation of struct
    parsed_partitions to try to address the issue:

    - make parsed_partitions.parts as pointer so that the pointed memory can
    fit in 32KB buffer, then approximate 32KB memory can be saved

    - vmalloc the buffer pointed by parsed_partitions.parts because 32KB is
    still a bit big for kmalloc

    - given that many devices have the partition count limit, so only
    allocate disk_max_parts() partitions instead of 256 partitions always

    Signed-off-by: Ming Lei
    Reported-by: Felipe Balbi
    Cc: Jens Axboe
    Reviewed-by: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     
  • While adding and removing a lot of disks disks and partitions this
    sometimes shows up:

    WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xc9/0x130() (Not tainted)
    Hardware name:
    sysfs: cannot create duplicate filename '/dev/block/259:751'
    Modules linked in: raid1 autofs4 bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc scsi_tgt garp stp llc sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ipv6 dm_mirror dm_region_hash dm_log power_meter microcode dcdbas serio_raw amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core k10temp bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 dm_round_robin sr_mod cdrom sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci mptsas mptscsih mptbase scsi_transport_sas dm_multipath dm_mod [last unloaded: scsi_wait_scan]
    Pid: 44103, comm: async/16 Not tainted 2.6.32-195.el6.x86_64 #1
    Call Trace:
    warn_slowpath_common+0x87/0xc0
    warn_slowpath_fmt+0x46/0x50
    sysfs_add_one+0xc9/0x130
    sysfs_do_create_link+0x12b/0x170
    sysfs_create_link+0x13/0x20
    device_add+0x317/0x650
    idr_get_new+0x13/0x50
    add_partition+0x21c/0x390
    rescan_partitions+0x32b/0x470
    sd_open+0x81/0x1f0 [sd_mod]
    __blkdev_get+0x1b6/0x3c0
    blkdev_get+0x10/0x20
    register_disk+0x155/0x170
    add_disk+0xa6/0x160
    sd_probe_async+0x13b/0x210 [sd_mod]
    add_wait_queue+0x46/0x60
    async_thread+0x102/0x250
    default_wake_function+0x0/0x20
    async_thread+0x0/0x250
    kthread+0x96/0xa0
    child_rip+0xa/0x20
    kthread+0x0/0xa0
    child_rip+0x0/0x20

    This most likely happens because dev_t is freed while the number is
    still used and idr_get_new() is not protected on every use. The fix
    adds a mutex where it wasn't before and moves the dev_t free function so
    it is called after device del.

    Signed-off-by: Tomas Henzl
    Cc: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tomas Henzl
     

01 Aug, 2012

1 commit

  • Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
    allows altering the size of an existing partition, even if it is currently
    in use.

    This patch converts hd_struct->nr_sects into sequence counter because
    One might extend a partition while IO is happening to it and update of
    nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
    can lead to issues like reading inconsistent size of a partition. Sequence
    counter have been used so that readers don't have to take bdev mutex lock
    as we call sector_in_part() very frequently.

    Now all the access to hd_struct->nr_sects should happen using sequence
    counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
    There is one exception though, set_capacity()/get_capacity(). I think
    theoritically race should exist there too but this patch does not
    modify set_capacity()/get_capacity() due to sheer number of call sites
    and I am afraid that change might break something. I have left that as a
    TODO item. We can handle it later if need be. This patch does not introduce
    any new races as such w.r.t set_capacity()/get_capacity().

    v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Phillip Susi
    Signed-off-by: Jens Axboe

    Vivek Goyal