25 Sep, 2020

1 commit


24 Sep, 2020

1 commit


08 Sep, 2020

1 commit

  • Discarding blocks and buffers under a mounted filesystem is hardly
    anything admin wants to do. Usually it will confuse the filesystem and
    sometimes the loss of buffer_head state (including b_private field) can
    even cause crashes like:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    PGD 0 P4D 0
    Oops: 0002 [#1] SMP PTI
    CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G O --------- - - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1
    Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
    RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2]
    ...
    Call Trace:
    __jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2]
    jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2]
    kjournald2+0xbd/0x270 [jbd2]

    So if we don't have block device open with O_EXCL already, claim the
    block device while we truncate buffer cache. This makes sure any
    exclusive block device user (such as filesystem) cannot operate on the
    device while we are discarding buffer cache.

    Reported-by: Ye Bin
    Signed-off-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    [axboe: fix !CONFIG_BLOCK error in truncate_bdev_range()]
    Signed-off-by: Jens Axboe

    Jan Kara
     

19 May, 2020

1 commit

  • This patch fixes the following sparse warnings:

    block/ioctl.c:209:16: warning: incorrect type in argument 1 (different address spaces)
    block/ioctl.c:209:16: expected void const volatile [noderef] *
    block/ioctl.c:209:16: got signed int [usertype] *argp
    block/ioctl.c:214:16: warning: incorrect type in argument 1 (different address spaces)
    block/ioctl.c:214:16: expected void const volatile [noderef] *
    block/ioctl.c:214:16: got unsigned int [usertype] *argp
    block/ioctl.c:666:40: warning: incorrect type in argument 1 (different address spaces)
    block/ioctl.c:666:40: expected signed int [usertype] *argp
    block/ioctl.c:666:40: got void [noderef] *argp
    block/ioctl.c:672:41: warning: incorrect type in argument 1 (different address spaces)
    block/ioctl.c:672:41: expected unsigned int [usertype] *argp
    block/ioctl.c:672:41: got void [noderef] *argp

    Fixes: 9b81648cb5e3 ("compat_ioctl: simplify up block/ioctl.c")
    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Acked-by: Arnd Bergmann
    Cc: Arnd Bergmann
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

21 Apr, 2020

1 commit

  • Split each sub-command out into a separate helper, and move those helpers
    to block/partitions/core.c instead of having a lot of partition
    manipulation logic open coded in block/ioctl.c.

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     

25 Mar, 2020

1 commit


03 Jan, 2020

4 commits

  • Having separate implementations of blkdev_ioctl() often leads to these
    getting out of sync, despite the comment at the top.

    Since most of the ioctl commands are compatible, and we try very hard
    not to add any new incompatible ones, move all the common bits into a
    shared function and leave only the ones that are historically different
    in separate functions for native/compat mode.

    To deal with the compat_ptr() conversion, pass both the integer
    argument and the pointer argument into the new blkdev_common_ioctl()
    and make sure to always use the correct one of these.

    blkdev_ioctl() is now only kept as a separate exported interfact
    for drivers/char/raw.c, which lacks a compat_ioctl variant.
    We should probably either move raw.c to staging if there are no
    more users, or export blkdev_compat_ioctl() as well.

    Reviewed-by: Ben Hutchings
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     
  • There is no need to go through a compat_alloc_user_space()
    copy any more, just wrap the function in a small helper that
    works the same way for native and compat mode.

    Reviewed-by: Ben Hutchings
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     
  • Having both in the same file allows a number of simplifications
    to the compat path, and makes it more likely that changes to
    the native path get applied to the compat version as well.

    Reviewed-by: Ben Hutchings
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     
  • A lot of block drivers need only a trivial .compat_ioctl callback.

    Add a helper function that can be set as the callback pointer
    to only convert the argument using the compat_ptr() conversion
    and otherwise assume all input and output data is compatible,
    or handled using in_compat_syscall() checks.

    This mirrors the compat_ptr_ioctl() helper function used in
    character devices.

    Reviewed-by: Ben Hutchings
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

03 Dec, 2019

1 commit

  • Simplify the arguments to blkdev_nr_zones by passing a gendisk instead
    of the block_device and capacity. This also removes the need for
    __blkdev_nr_zones as all callers are outside the fast path and can
    deal with the additional branch.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

14 Nov, 2019

3 commits

  • In general drivers should never mess with partition tables directly.
    Unfortunately s390 and loop do for somewhat historic reasons, but they
    can use bdev_disk_changed directly instead when we export it as they
    satisfy the sanity checks we have in __blkdev_reread_part.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Stefan Haberland [dasd]
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • We still have to set the capacity to 0 if invalidating or call
    revalidate_disk if not even if the disk has no partitions. Fix
    that by merging rescan_partitions into bdev_disk_changed and just
    stubbing out blk_add_partitions and blk_drop_partitions for
    non-partitioned devices.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • A lot of the logic in invalidate_partitions and rescan_partitions is
    shared. Merge the two functions to simplify things. There is a small
    behavior change in that we now send the kevent change notice also if we
    were not invalidating but no partitions were found, which seems like
    the right thing to do.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

07 Nov, 2019

1 commit

  • Introduce three new ioctl commands BLKOPENZONE, BLKCLOSEZONE and
    BLKFINISHZONE to allow applications to control the condition of zones
    on a zoned block device through the execution of the REQ_OP_ZONE_OPEN,
    REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH operations.

    Contains contributions from Matias Bjorling, Hans Holmberg,
    Dmitry Fomichev, Keith Busch, Damien Le Moal and Christoph Hellwig.

    Reviewed-by: Javier González
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ajay Joshi
    Signed-off-by: Matias Bjorling
    Signed-off-by: Hans Holmberg
    Signed-off-by: Dmitry Fomichev
    Signed-off-by: Keith Busch
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Ajay Joshi
     

01 May, 2019

1 commit


26 Oct, 2018

2 commits

  • Get a zoned block device total number of zones. The device can be a
    partition of the whole device. The number of zones is always 0 for
    regular block devices.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Get a zoned block device zone size in number of 512 B sectors.
    The zone size is always 0 for regular block devices.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

24 Feb, 2018

1 commit


26 Oct, 2017

1 commit

  • Check for CAP_SYS_ADMIN before calling into the driver, similar to
    blkdev_flushbuf(). This is safer and can spare a check in the driver.

    (Currently BLKROSET is overridden by md and rbd, rbd is missing the
    check. md has the check, but it covers a lot more than BLKROSET.)

    Acked-by: Al Viro
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe

    Ilya Dryomov
     

25 Oct, 2017

1 commit

  • It is reasonable drop page cache on discard, otherwise that pages may
    be written by writeback second later, so thin provision devices will
    not be happy. This seems to be a security leak in case of secure discard case.

    Also add check for queue_discard flag on early stage.

    Signed-off-by: Dmitry Monakhov
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     

09 Apr, 2017

2 commits


02 Feb, 2017

1 commit


25 Dec, 2016

1 commit


20 Dec, 2016

1 commit

  • Partitions that are not aligned to the blocksize of a device may cause
    invalid I/O requests because the blocklayer cares only about alignment
    within the partition when building requests on partitions.

    device
    |--------4096--------|--------4096--------|--------4096--------|
    partition offset 512byte
    |-512-|--------4096--------|--------4096--------|--------4096--------|

    When reading/writing one 4k block of the partition this maps to
    reading/writing with an offset of 512 byte of the device leading to
    unaligned requests for the device which in turn may cause unexpected
    behavior of the device driver.

    For DASD devices we have to translate the block number into a cylinder,
    head, record format. The unaligned requests lead to wrong calculation
    and therefore to misdirected I/O. In a "good" case this leads to I/O
    errors because the underlying hardware detects the wrong addressing.
    In a worst case scenario this might destroy data on the device.

    To prevent partitions that are not aligned to the physical blocksize
    of a device check for the alignment in the blkpg_ioctl.

    Signed-off-by: Stefan Haberland
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Stefan Haberland
     

19 Oct, 2016

1 commit

  • Adds the new BLKREPORTZONE and BLKRESETZONE ioctls for respectively
    obtaining the zone configuration of a zoned block device and resetting
    the write pointer of sequential zones of a zoned block device.

    The BLKREPORTZONE ioctl maps directly to a single call of the function
    blkdev_report_zones. The zone information result is passed as an array
    of struct blk_zone identical to the structure used internally for
    processing the REQ_OP_ZONE_REPORT operation. The BLKRESETZONE ioctl
    maps to a single call of the blkdev_reset_zones function.

    Signed-off-by: Shaun Tancheff
    Signed-off-by: Damien Le Moal
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Shaun Tancheff
     

12 Oct, 2016

1 commit

  • Patch series "fallocate for block devices", v11.

    This is a patchset to fix page cache coherency with BLKZEROOUT and
    implement fallocate for block devices.

    The first patch is a fix to the existing BLKZEROOUT ioctl to invalidate
    the page cache if the zeroing command to the underlying device succeeds.
    Without this patch we still have the pagecache coherence bug that's been
    in the kernel forever.

    The second patch changes the internal block device functions to reject
    attempts to discard or zeroout that are not aligned to the logical block
    size. Previously, we only checked that the start/len parameters were
    512-byte aligned, which caused kernel BUG_ONs for unaligned IOs to 4k-LBA
    devices.

    The third patch creates an fallocate handler for block devices, wires up
    the FALLOC_FL_PUNCH_HOLE flag to zeroing-discard, and connects
    FALLOC_FL_ZERO_RANGE to write-same so that we can have a consistent
    fallocate interface between files and block devices. It also allows the
    combination of PUNCH_HOLE and NO_HIDE_STALE to invoke non-zeroing discard.

    Test cases for the new block device fallocate are now in xfstests as
    generic/349-351.

    This patch (of 3):

    Invalidate the page cache (as a regular O_DIRECT write would do) to avoid
    returning stale cache contents at a later time.

    Link: http://lkml.kernel.org/r/147518378313.22791.16649519283678515021.stgit@birch.djwong.org
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Cc: Theodore Ts'o
    Cc: Mike Snitzer
    Cc: Brian Foster
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

27 May, 2016

1 commit

  • Pull misc DAX updates from Vishal Verma:
    "DAX error handling for 4.7

    - Until now, dax has been disabled if media errors were found on any
    device. This enables the use of DAX in the presence of these
    errors by making all sector-aligned zeroing go through the driver.

    - The driver (already) has the ability to clear errors on writes that
    are sent through the block layer using 'DSMs' defined in ACPI 6.1.

    Other misc changes:

    - When mounting DAX filesystems, check to make sure the partition is
    page aligned. This is a requirement for DAX, and previously, we
    allowed such unaligned mounts to succeed, but subsequent
    reads/writes would fail.

    - Misc/cleanup fixes from Jan that remove unused code from DAX
    related to zeroing, writeback, and some size checks"

    * tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    dax: fix a comment in dax_zero_page_range and dax_truncate_page
    dax: for truncate/hole-punch, do zeroing through the driver if possible
    dax: export a low-level __dax_zero_page_range helper
    dax: use sb_issue_zerout instead of calling dax_clear_sectors
    dax: enable dax in the presence of known media errors (badblocks)
    dax: fallback from pmd to pte on error
    block: Update blkdev_dax_capable() for consistency
    xfs: Add alignment check for DAX mount
    ext2: Add alignment check for DAX mount
    ext4: Add alignment check for DAX mount
    block: Add bdev_dax_supported() for dax mount checks
    block: Add vfs_msg() interface
    dax: Remove redundant inode size checks
    dax: Remove pointless writeback from dax_do_io()
    dax: Remove zeroing from dax_io()
    dax: Remove dead zeroing code from fault handlers
    ext2: Avoid DAX zeroing to corrupt data
    ext2: Fix block zeroing in ext2_get_blocks() for DAX
    dax: Remove complete_unwritten argument
    DAX: move RADIX_DAX_ definitions to dax.c

    Linus Torvalds
     

21 May, 2016

1 commit


17 May, 2016

1 commit

  • blkdev_dax_capable() is similar to bdev_dax_supported(), but needs
    to remain as a separate interface for checking dax capability of
    a raw block device.

    Rename and relocate blkdev_dax_capable() to keep them maintained
    consistently, and call bdev_direct_access() for the dax capability
    check.

    There is no change in the behavior.

    Link: https://lkml.org/lkml/2016/5/9/950
    Signed-off-by: Toshi Kani
    Reviewed-by: Jan Kara
    Cc: Alexander Viro
    Cc: Jens Axboe
    Cc: Andreas Dilger
    Cc: Jan Kara
    Cc: Dave Chinner
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Christoph Hellwig
    Cc: Boaz Harrosh
    Signed-off-by: Vishal Verma

    Toshi Kani
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

31 Jan, 2016

1 commit

  • Dynamically enabling DAX requires that the page cache first be flushed
    and invalidated. This must occur atomically with the change of DAX mode
    otherwise we confuse the fsync/msync tracking and violate data
    durability guarantees. Eliminate the possibilty of DAX-disabled to
    DAX-enabled transitions for now and revisit this for the next cycle.

    Cc: Jan Kara
    Cc: Jeff Moyer
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Matthew Wilcox
    Cc: Andrew Morton
    Cc: Ross Zwisler
    Signed-off-by: Dan Williams

    Dan Williams
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

10 Jan, 2016

1 commit


09 Jan, 2016

1 commit

  • If an application wants exclusive access to all of the persistent memory
    provided by an NVDIMM namespace it can use this raw-block-dax facility
    to forgo establishing a filesystem. This capability is targeted
    primarily to hypervisors wanting to provision persistent memory for
    guests. It can be disabled / enabled dynamically via the new BLKDAXSET
    ioctl.

    Cc: Jeff Moyer
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Andrew Morton
    Cc: Ross Zwisler
    Reported-by: kbuild test robot
    Reviewed-by: Jan Kara
    Signed-off-by: Dan Williams

    Dan Williams
     

22 Oct, 2015

2 commits

  • This commits adds a driver API and ioctls for controlling Persistent
    Reservations s/genericly/generically/ at the block layer. Persistent
    Reservations are supported by SCSI and NVMe and allow controlling who gets
    access to a device in a shared storage setup.

    Note that we add a pr_ops structure to struct block_device_operations
    instead of adding the members directly to avoid bloating all instances
    of devices that will never support Persistent Reservations.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Split out helpers for all non-trivial ioctls to make this function simpler,
    and also start passing around a pointer version of the argument, as that's
    what most ioctl handlers actually need.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

20 May, 2015

2 commits

  • The only possible problem of using mutex_lock() instead of trylock
    is about deadlock.

    If there aren't any locks held before calling blkdev_reread_part(),
    deadlock can't be caused by this conversion.

    If there are locks held before calling blkdev_reread_part(),
    and if these locks arn't required in open, close handler and I/O
    path, deadlock shouldn't be caused too.

    Both user space's ioctl(BLKRRPART) and md_setup_drive() from
    init/do_mounts_md.c belongs to the 1st case, so the conversion is safe
    for the two cases.

    For loop, the previous patches in this pathset has fixed the ABBA lock
    dependency, so the conversion is OK.

    For nbd, tx_lock is held when calling the function:

    - both open and release won't hold the lock
    - when blkdev_reread_part() is run, I/O thread has been stopped
    already, so tx_lock won't be acquired in I/O path at that time.
    - so the conversion won't cause deadlock for nbd

    For dasd, both dasd_open(), dasd_release() and request function don't
    acquire any mutex/semphone, so the conversion should be safe.

    Reviewed-by: Christoph Hellwig
    Tested-by: Jarod Wilson
    Acked-by: Jarod Wilson
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • This patch exports blkdev_reread_part() for block drivers, also
    introduce __blkdev_reread_part().

    For some drivers, such as loop, reread of partitions can be run
    from the release path, and bd_mutex may already be held prior to
    calling ioctl_by_bdev(bdev, BLKRRPART, 0), so introduce
    __blkdev_reread_part for use in such cases.

    CC: Christoph Hellwig
    CC: Jens Axboe
    CC: Tejun Heo
    CC: Alexander Viro
    CC: Markus Pargmann
    CC: Stefan Weinhuber
    CC: Stefan Haberland
    CC: Sebastian Ott
    CC: Fabian Frederick
    CC: Ming Lei
    CC: David Herrmann
    CC: Andrew Morton
    CC: Peter Zijlstra
    CC: nbd-general@lists.sourceforge.net
    CC: linux-s390@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jarod Wilson
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Jarod Wilson