12 Feb, 2019

1 commit


23 Jan, 2019

1 commit

  • commit 04906b2f542c23626b0ef6219b808406f8dddbe9 upstream.

    bd_set_size() updates also block device's block size. This is somewhat
    unexpected from its name and at this point, only blkdev_open() uses this
    functionality. Furthermore, this can result in changing block size under
    a filesystem mounted on a loop device which leads to livelocks inside
    __getblk_gfp() like:

    Sending NMI from CPU 0 to CPUs 1:
    NMI backtrace for cpu 1
    CPU: 1 PID: 10863 Comm: syz-executor0 Not tainted 4.18.0-rc5+ #151
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
    01/01/2011
    RIP: 0010:__sanitizer_cov_trace_pc+0x3f/0x50 kernel/kcov.c:106
    ...
    Call Trace:
    init_page_buffers+0x3e2/0x530 fs/buffer.c:904
    grow_dev_page fs/buffer.c:947 [inline]
    grow_buffers fs/buffer.c:1009 [inline]
    __getblk_slow fs/buffer.c:1036 [inline]
    __getblk_gfp+0x906/0xb10 fs/buffer.c:1313
    __bread_gfp+0x2d/0x310 fs/buffer.c:1347
    sb_bread include/linux/buffer_head.h:307 [inline]
    fat12_ent_bread+0x14e/0x3d0 fs/fat/fatent.c:75
    fat_ent_read_block fs/fat/fatent.c:441 [inline]
    fat_alloc_clusters+0x8ce/0x16e0 fs/fat/fatent.c:489
    fat_add_cluster+0x7a/0x150 fs/fat/inode.c:101
    __fat_get_block fs/fat/inode.c:148 [inline]
    ...

    Trivial reproducer for the problem looks like:

    truncate -s 1G /tmp/image
    losetup /dev/loop0 /tmp/image
    mkfs.ext4 -b 1024 /dev/loop0
    mount -t ext4 /dev/loop0 /mnt
    losetup -c /dev/loop0
    l /mnt

    Fix the problem by moving initialization of a block device block size
    into a separate function and call it when needed.

    Thanks to Tetsuo Handa for help with
    debugging the problem.

    Reported-by: syzbot+9933e4476f365f5d5a1b@syzkaller.appspotmail.com
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

03 Aug, 2018

1 commit

  • commit 9362dd1109f87a9d0a798fbc890cb339c171ed35 upstream.

    Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io")
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Martin Wilck
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Martin Wilck
     

14 Oct, 2017

1 commit

  • When using FAT on a block device which supports rw_page, we can hit
    BUG_ON(!PageLocked(page)) in try_to_free_buffers(). This is because we
    call clean_buffers() after unlocking the page we've written. Introduce
    a new clean_page_buffers() which cleans all buffers associated with a
    page and call it from within bdev_write_page().

    [akpm@linux-foundation.org: s/PAGE_SIZE/~0U/ per Linus and Matthew]
    Link: http://lkml.kernel.org/r/20171006211541.GA7409@bombadil.infradead.org
    Signed-off-by: Matthew Wilcox
    Reported-by: Toshi Kani
    Reported-by: OGAWA Hirofumi
    Tested-by: Toshi Kani
    Acked-by: Johannes Thumshirn
    Cc: Ross Zwisler
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

15 Sep, 2017

1 commit


05 Sep, 2017

1 commit


24 Aug, 2017

2 commits

  • This way we don't need a block_device structure to submit I/O. The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open. Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device. But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

08 Jul, 2017

1 commit

  • Pull Writeback error handling updates from Jeff Layton:
    "This pile represents the bulk of the writeback error handling fixes
    that I have for this cycle. Some of the earlier patches in this pile
    may look trivial but they are prerequisites for later patches in the
    series.

    The aim of this set is to improve how we track and report writeback
    errors to userland. Most applications that care about data integrity
    will periodically call fsync/fdatasync/msync to ensure that their
    writes have made it to the backing store.

    For a very long time, we have tracked writeback errors using two flags
    in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
    writeback error occurs (via mapping_set_error) and are cleared as a
    side-effect of filemap_check_errors (as you noted yesterday). This
    model really sucks for userland.

    Only the first task to call fsync (or msync or fdatasync) will see the
    error. Any subsequent task calling fsync on a file will get back 0
    (unless another writeback error occurs in the interim). If I have
    several tasks writing to a file and calling fsync to ensure that their
    writes got stored, then I need to have them coordinate with one
    another. That's difficult enough, but in a world of containerized
    setups that coordination may even not be possible.

    But wait...it gets worse!

    The calls to filemap_check_errors can be buried pretty far down in the
    call stack, and there are internal callers of filemap_write_and_wait
    and the like that also end up clearing those errors. Many of those
    callers ignore the error return from that function or return it to
    userland at nonsensical times (e.g. truncate() or stat()). If I get
    back -EIO on a truncate, there is no reason to think that it was
    because some previous writeback failed, and a subsequent fsync() will
    (incorrectly) return 0.

    This pile aims to do three things:

    1) ensure that when a writeback error occurs that that error will be
    reported to userland on a subsequent fsync/fdatasync/msync call,
    regardless of what internal callers are doing

    2) report writeback errors on all file descriptions that were open at
    the time that the error occurred. This is a user-visible change,
    but I think most applications are written to assume this behavior
    anyway. Those that aren't are unlikely to be hurt by it.

    3) document what filesystems should do when there is a writeback
    error. Today, there is very little consistency between them, and a
    lot of cargo-cult copying. We need to make it very clear what
    filesystems should do in this situation.

    To achieve this, the set adds a new data type (errseq_t) and then
    builds new writeback error tracking infrastructure around that. Once
    all of that is in place, we change the filesystems to use the new
    infrastructure for reporting wb errors to userland.

    Note that this is just the initial foray into cleaning up this mess.
    There is a lot of work remaining here:

    1) convert the rest of the filesystems in a similar fashion. Once the
    initial set is in, then I think most other fs' will be fairly
    simple to convert. Hopefully most of those can in via individual
    filesystem trees.

    2) convert internal waiters on writeback to use errseq_t for
    detecting errors instead of relying on the AS_* flags. I have some
    draft patches for this for ext4, but they are not quite ready for
    prime time yet.

    This was a discussion topic this year at LSF/MM too. If you're
    interested in the gory details, LWN has some good articles about this:

    https://lwn.net/Articles/718734/
    https://lwn.net/Articles/724307/"

    * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    btrfs: minimal conversion to errseq_t writeback error reporting on fsync
    xfs: minimal conversion to errseq_t writeback error reporting
    ext4: use errseq_t based error handling for reporting data writeback errors
    fs: convert __generic_file_fsync to use errseq_t based reporting
    block: convert to errseq_t based writeback error tracking
    dax: set errors in mapping when writeback fails
    Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
    mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
    fs: new infrastructure for writeback error handling and reporting
    lib: add errseq_t type and infrastructure for handling it
    mm: don't TestClearPageError in __filemap_fdatawait_range
    mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
    jbd2: don't clear and reset errors after waiting on writeback
    buffer: set errors in mapping at the time that the error occurs
    fs: check for writeback errors after syncing out buffers in generic_file_fsync
    buffer: use mapping_set_error instead of setting the flag
    mm: fix mapping_set_error call in me_pagecache_dirty

    Linus Torvalds
     

06 Jul, 2017

2 commits

  • This is a very minimal conversion to errseq_t based error tracking
    for raw block device access. Just have it use the standard
    file_write_and_wait_range call.

    Note that there are internal callers that call sync_blockdev
    and the like that are not affected by this. They'll continue
    to use the AS_EIO/AS_ENOSPC flags for error reporting like
    they always have for now.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • Most filesystems currently use mapping_set_error and
    filemap_check_errors for setting and reporting/clearing writeback errors
    at the mapping level. filemap_check_errors is indirectly called from
    most of the filemap_fdatawait_* functions and from
    filemap_write_and_wait*. These functions are called from all sorts of
    contexts to wait on writeback to finish -- e.g. mostly in fsync, but
    also in truncate calls, getattr, etc.

    The non-fsync callers are problematic. We should be reporting writeback
    errors during fsync, but many places spread over the tree clear out
    errors before they can be properly reported, or report errors at
    nonsensical times.

    If I get -EIO on a stat() call, there is no reason for me to assume that
    it is because some previous writeback failed. The fact that it also
    clears out the error such that a subsequent fsync returns 0 is a bug,
    and a nasty one since that's potentially silent data corruption.

    This patch adds a small bit of new infrastructure for setting and
    reporting errors during address_space writeback. While the above was my
    original impetus for adding this, I think it's also the case that
    current fsync semantics are just problematic for userland. Most
    applications that call fsync do so to ensure that the data they wrote
    has hit the backing store.

    In the case where there are multiple writers to the file at the same
    time, this is really hard to determine. The first one to call fsync will
    see any stored error, and the rest get back 0. The processes with open
    fds may not be associated with one another in any way. They could even
    be in different containers, so ensuring coordination between all fsync
    callers is not really an option.

    One way to remedy this would be to track what file descriptor was used
    to dirty the file, but that's rather cumbersome and would likely be
    slow. However, there is a simpler way to improve the semantics here
    without incurring too much overhead.

    This set adds an errseq_t to struct address_space, and a corresponding
    one is added to struct file. Writeback errors are recorded in the
    mapping's errseq_t, and the one in struct file is used as the "since"
    value.

    This changes the semantics of the Linux fsync implementation such that
    applications can now use it to determine whether there were any
    writeback errors since fsync(fd) was last called (or since the file was
    opened in the case of fsync having never been called).

    Note that those writeback errors may have occurred when writing data
    that was dirtied via an entirely different fd, but that's the case now
    with the current mapping_set_error/filemap_check_error infrastructure.
    This will at least prevent you from getting a false report of success.

    The new behavior is still consistent with the POSIX spec, and is more
    reliable for application developers. This patch just adds some basic
    infrastructure for doing this, and ensures that the f_wb_err "cursor"
    is properly set when a file is opened. Later patches will change the
    existing code to use this new infrastructure for reporting errors at
    fsync time.

    Signed-off-by: Jeff Layton
    Reviewed-by: Jan Kara

    Jeff Layton
     

04 Jul, 2017

1 commit

  • Pull core block/IO updates from Jens Axboe:
    "This is the main pull request for the block layer for 4.13. Not a huge
    round in terms of features, but there's a lot of churn related to some
    core cleanups.

    Note this depends on the UUID tree pull request, that Christoph
    already sent out.

    This pull request contains:

    - A series from Christoph, unifying the error/stats codes in the
    block layer. We now use blk_status_t everywhere, instead of using
    different schemes for different places.

    - Also from Christoph, some cleanups around request allocation and IO
    scheduler interactions in blk-mq.

    - And yet another series from Christoph, cleaning up how we handle
    and do bounce buffering in the block layer.

    - A blk-mq debugfs series from Bart, further improving on the support
    we have for exporting internal information to aid debugging IO
    hangs or stalls.

    - Also from Bart, a series that cleans up the request initialization
    differences across types of devices.

    - A series from Goldwyn Rodrigues, allowing the block layer to return
    failure if we will block and the user asked for non-blocking.

    - Patch from Hannes for supporting setting loop devices block size to
    that of the underlying device.

    - Two series of patches from Javier, fixing various issues with
    lightnvm, particular around pblk.

    - A series from me, adding support for write hints. This comes with
    NVMe support as well, so applications can help guide data placement
    on flash to improve performance, latencies, and write
    amplification.

    - A series from Ming, improving and hardening blk-mq support for
    stopping/starting and quiescing hardware queues.

    - Two pull requests for NVMe updates. Nothing major on the feature
    side, but lots of cleanups and bug fixes. From the usual crew.

    - A series from Neil Brown, greatly improving the bio rescue set
    support. Most notably, this kills the bio rescue work queues, if we
    don't really need them.

    - Lots of other little bug fixes that are all over the place"

    * 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits)
    lightnvm: pblk: set line bitmap check under debug
    lightnvm: pblk: verify that cache read is still valid
    lightnvm: pblk: add initialization check
    lightnvm: pblk: remove target using async. I/Os
    lightnvm: pblk: use vmalloc for GC data buffer
    lightnvm: pblk: use right metadata buffer for recovery
    lightnvm: pblk: schedule if data is not ready
    lightnvm: pblk: remove unused return variable
    lightnvm: pblk: fix double-free on pblk init
    lightnvm: pblk: fix bad le64 assignations
    nvme: Makefile: remove dead build rule
    blk-mq: map all HWQ also in hyperthreaded system
    nvmet-rdma: register ib_client to not deadlock in device removal
    nvme_fc: fix error recovery on link down.
    nvmet_fc: fix crashes on bad opcodes
    nvme_fc: Fix crash when nvme controller connection fails.
    nvme_fc: replace ioabort msleep loop with completion
    nvme_fc: fix double calls to nvme_cleanup_cmd()
    nvme-fabrics: verify that a controller returns the correct NQN
    nvme: simplify nvme_dev_attrs_are_visible
    ...

    Linus Torvalds
     

29 Jun, 2017

1 commit

  • Wen reports significant memory leaks with DIF and O_DIRECT:

    "With nvme devive + T10 enabled, On a system it has 256GB and started
    logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
    it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
    leaking.

    /proc/meminfo | grep SUnreclaim...

    SUnreclaim: 6752128 kB
    SUnreclaim: 6874880 kB
    SUnreclaim: 7238080 kB
    ....
    SUnreclaim: 22307264 kB
    SUnreclaim: 22485888 kB
    SUnreclaim: 22720256 kB

    When testcases with T10 enabled call into __blkdev_direct_IO_simple,
    code doesn't free memory allocated by bio_integrity_alloc. The patch
    fixes the issue. HTX has been run with +60 hours without failure."

    Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
    doesn't go through the regular bio free. This means that any ancillary
    data allocated with the bio through the stack is not freed. Hence, we
    can leak the integrity data associated with the bio, if the device is
    using DIF/DIX.

    Fix this by providing a bio_uninit() and export it, so that we can use
    it to free this data. Note that this is a minimal fix for this issue.
    Any current user of bio's that are allocated outside of
    bio_alloc_bioset() suffers from this issue, most notably some drivers.
    We will fix those in a more comprehensive patch for 4.13. This also
    means that the commit marked as being fixed by this isn't the real
    culprit, it's just the most obvious one out there.

    Fixes: 542ff7bf18c6 ("block: new direct I/O implementation")
    Reported-by: Wen Xiong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

28 Jun, 2017

1 commit


19 Jun, 2017

1 commit

  • "flags" arguments are often seen as good API design as they allow
    easy extensibility.
    bioset_create_nobvec() is implemented internally as a variation in
    flags passed to __bioset_create().

    To support future extension, make the internal structure part of the
    API.
    i.e. add a 'flags' argument to bioset_create() and discard
    bioset_create_nobvec().

    Note that the bio_split allocations in drivers/md/raid* do not need
    the bvec mempool - they should have used bioset_create_nobvec().

    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     

09 Jun, 2017

2 commits


13 May, 2017

1 commit

  • Pull libnvdimm fixes from Dan Williams:
    "Incremental fixes and a small feature addition on top of the main
    libnvdimm 4.12 pull request:

    - Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
    The size regression is fixed by moving all dax helpers into the
    dax-core and only specifying "select DAX" for FS_DAX and
    dax-capable drivers. He also asked for clarification of the
    NR_DEV_DAX config option which, on closer look, does not need to be
    a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
    for good measure.

    - Ben's attention to detail on -stable patch submissions caught a
    case where the recent fixes to arch_copy_from_iter_pmem() missed a
    condition where we strand dirty data in the cache. This is tagged
    for -stable and will also be included in the rework of the pmem api
    to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.

    - Vishal adds a feature that missed the initial pull due to pending
    review feedback. It allows the kernel to clear media errors when
    initializing a BTT (atomic sector update driver) instance on a pmem
    namespace.

    - Ross noticed that the dax_device + dax_operations conversion broke
    __dax_zero_page_range(). The nvdimm unit tests fail to check this
    path, but xfstests immediately trips over it. No excuse for missing
    this before submitting the 4.12 pull request.

    These all pass the nvdimm unit tests and an xfstests spot check. The
    set has received a build success notification from the kbuild robot"

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    filesystem-dax: fix broken __dax_zero_page_range() conversion
    libnvdimm, btt: ensure that initializing metadata clears poison
    libnvdimm: add an atomic vs process context flag to rw_bytes
    x86, pmem: Fix cache flushing for iovec write < 8 bytes
    device-dax: kill NR_DEV_DAX
    block, dax: move "select DAX" from BLOCK to FS_DAX
    device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX

    Linus Torvalds
     

09 May, 2017

1 commit

  • For configurations that do not enable DAX filesystems or drivers, do not
    require the DAX core to be built.

    Given that the 'direct_access' method has been removed from
    'block_device_operations', we can also go ahead and remove the
    block-related dax helper functions from fs/block_dev.c to
    drivers/dax/super.c. This keeps dax details out of the block layer and
    lets the DAX core be built as a module in the FS_DAX=n case.

    Filesystems need to include dax.h to call bdev_dax_supported().

    Cc: linux-xfs@vger.kernel.org
    Cc: Jens Axboe
    Cc: "Theodore Ts'o"
    Cc: Matthew Wilcox
    Cc: Alexander Viro
    Cc: "Darrick J. Wong"
    Cc: Ross Zwisler
    Reviewed-by: Jan Kara
    Reported-by: Geert Uytterhoeven
    Signed-off-by: Dan Williams

    Dan Williams
     

06 May, 2017

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "The bulk of this has been in multiple -next releases. There were a few
    late breaking fixes and small features that got added in the last
    couple days, but the whole set has received a build success
    notification from the kbuild robot.

    Change summary:

    - Region media error reporting: A libnvdimm region device is the
    parent to one or more namespaces. To date, media errors have been
    reported via the "badblocks" attribute attached to pmem block
    devices for namespaces in "raw" or "memory" mode. Given that
    namespaces can be in "device-dax" or "btt-sector" mode this new
    interface reports media errors generically, i.e. independent of
    namespace modes or state.

    This subsequently allows userspace tooling to craft "ACPI 6.1
    Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
    requests and submit them via the ioctl path for NVDIMM root bus
    devices.

    - Introduce 'struct dax_device' and 'struct dax_operations': Prompted
    by a request from Linus and feedback from Christoph this allows for
    dax capable drivers to publish their own custom dax operations.
    This fixes the broken assumption that all dax operations are
    related to a persistent memory device, and makes it easier for
    other architectures and platforms to add customized persistent
    memory support.

    - 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
    available for storage appliance applications to manually trigger
    memory controllers to drain write-pending buffers that would
    otherwise be flushed automatically by the platform ADR
    (asynchronous-DRAM-refresh) mechanism at a power loss event.
    Support for "locked" DIMMs is included to prevent namespaces from
    surfacing when the namespace label data area is locked. Finally,
    fixes for various reported deadlocks and crashes, also tagged for
    -stable.

    - ACPI / nfit driver updates: General updates of the nfit driver to
    add DSM command overrides, ACPI 6.1 health state flags support, DSM
    payload debug available by default, and various fixes.

    Acknowledgements that came after the branch was pushed:

    - commmit 565851c972b5 "device-dax: fix sysfs attribute deadlock":
    Tested-by: Yi Zhang

    - commit 23f498448362 "libnvdimm: rework region badblocks clearing"
    Tested-by: Toshi Kani "

    * tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
    libnvdimm, pfn: fix 'npfns' vs section alignment
    libnvdimm: handle locked label storage areas
    libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
    brd: fix uninitialized use of brd->dax_dev
    block, dax: use correct format string in bdev_dax_supported
    device-dax: fix sysfs attribute deadlock
    libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
    libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
    libnvdimm: rework region badblocks clearing
    acpi, nfit: kill ACPI_NFIT_DEBUG
    libnvdimm: fix clear length of nvdimm_forget_poison()
    libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
    libnvdimm, region: sysfs trigger for nvdimm_flush()
    libnvdimm: fix phys_addr for nvdimm_clear_poison
    x86, dax, pmem: remove indirection around memcpy_from_pmem()
    block: remove block_device_operations ->direct_access()
    block, dax: convert bdev_dax_supported() to dax_direct_access()
    filesystem-dax: convert to dax_direct_access()
    Revert "block: use DAX for partition table reads"
    ext2, ext4, xfs: retrieve dax_device for iomap operations
    ...

    Linus Torvalds
     

04 May, 2017

1 commit

  • invalidate_bdev() calls cleancache_invalidate_inode() iff ->nrpages != 0
    which doen't make any sense.

    Make sure that invalidate_bdev() always calls cleancache_invalidate_inode()
    regardless of mapping->nrpages value.

    Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
    Link: http://lkml.kernel.org/r/20170424164135.22350-3-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Jan Kara
    Acked-by: Konrad Rzeszutek Wilk
    Cc: Alexander Viro
    Cc: Ross Zwisler
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Alexey Kuznetsov
    Cc: Christoph Hellwig
    Cc: Nikolay Borisov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

02 May, 2017

1 commit

  • The new message has an incorrect format string, causing a warning in some
    configurations:

    fs/block_dev.c: In function 'bdev_dax_supported':
    fs/block_dev.c:779:5: error: format '%d' expects argument of type 'int', but argument 2 has type 'long int' [-Werror=format=]
    "error: dax access failed (%d)", len);

    This changes it to use the correct %ld instead of %d.

    Fixes: 2093f2e9dfec ("block, dax: convert bdev_dax_supported() to dax_direct_access()")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Dan Williams

    Arnd Bergmann
     

26 Apr, 2017

2 commits


22 Apr, 2017

1 commit

  • Commit 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
    introduced blk_integrity_revalidate(), which seems to assume ownership
    of the stable pages flag and unilaterally clears it if no blk_integrity
    profile is registered:

    if (bi->profile)
    disk->queue->backing_dev_info->capabilities |=
    BDI_CAP_STABLE_WRITES;
    else
    disk->queue->backing_dev_info->capabilities &=
    ~BDI_CAP_STABLE_WRITES;

    It's called from revalidate_disk() and rescan_partitions(), making it
    impossible to enable stable pages for drivers that support partitions
    and don't use blk_integrity: while the call in revalidate_disk() can be
    trivially worked around (see zram, which doesn't support partitions and
    hence gets away with zram_revalidate_disk()), rescan_partitions() can
    be triggered from userspace at any time. This breaks rbd, where the
    ceph messenger is responsible for generating/verifying CRCs.

    Since blk_integrity_{un,}register() "must" be used for (un)registering
    the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
    setting there. This way drivers that call blk_integrity_register() and
    use integrity infrastructure won't interfere with drivers that don't
    but still want stable pages.

    Fixes: 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
    Cc: "Martin K. Petersen"
    Cc: Christoph Hellwig
    Cc: Mike Snitzer
    Cc: stable@vger.kernel.org # 4.4+, needs backporting
    Tested-by: Dan Williams
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe

    Ilya Dryomov
     

21 Apr, 2017

2 commits

  • Replace bdev_direct_access() with dax_direct_access() that uses
    dax_device and dax_operations instead of a block_device and
    block_device_operations for dax. Once all consumers of the old api have
    been converted bdev_direct_access() will be deleted.

    Given that block device partitioning decisions can cause dax page
    alignment constraints to be violated this also introduces the
    bdev_dax_pgoff() helper. It handles calculating a logical pgoff relative
    to the dax_device and also checks for page alignment.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • This is leftover dead code that has since been replaced by
    bdev_dax_supported().

    Signed-off-by: Dan Williams

    Dan Williams
     

09 Apr, 2017

2 commits


23 Mar, 2017

2 commits

  • When block device is closed, we call inode_detach_wb() in __blkdev_put()
    which sets inode->i_wb to NULL. That is contrary to expectations that
    inode->i_wb stays valid once set during the whole inode's lifetime and
    leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because
    inode_to_wb() returned NULL.

    The reason why we called inode_detach_wb() is not valid anymore though.
    BDI is guaranteed to stay along until we call bdi_put() from
    bdev_evict_inode() so we can postpone calling inode_detach_wb() to that
    moment.

    Also add a warning to catch if someone uses inode_detach_wb() in a
    dangerous way.

    Reported-by: Thiago Jung Bauermann
    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • When disk->fops->open() in __blkdev_get() returns -ERESTARTSYS, we
    restart the process of opening the block device. However we forget to
    switch bdev->bd_bdi back to noop_backing_dev_info and as a result bdev
    inode will be pointing to a stale bdi. Fix the problem by setting
    bdev->bd_bdi later when __blkdev_get() is already guaranteed to succeed.

    Acked-by: Tejun Heo
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

02 Mar, 2017

1 commit

  • So far we initialized bd_bdi only in bdget(). That is fine for normal
    bdev inodes however for the special case of the root inode of
    blockdev_superblock that function is never called and thus bd_bdi is
    left uninitialized. As a result bdev_evict_inode() may oops doing
    bdi_put(root->bd_bdi) on that inode as can be seen when doing:

    mount -t bdev none /mnt

    Fix the problem by initializing bd_bdi when first allocating the inode
    and then reinitializing bd_bdi in bdev_evict_inode().

    Thanks to syzkaller team for finding the problem.

    Reported-by: Dmitry Vyukov
    Fixes: b1d2dc5659b4 ("block: Make blk_get_backing_dev_info() safe without open bdev")
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

28 Feb, 2017

1 commit

  • Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs
    branch.

    This patch also fixes multiple checkpatch warnings: WARNING: Prefer
    'unsigned int' to bare use of 'unsigned'

    Thanks to Andrew Morton for suggesting more appropriate function instead
    of macro.

    [geliangtang@gmail.com: truncate: use i_blocksize()]
    Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com
    Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Signed-off-by: Geliang Tang
    Cc: Alexander Viro
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

22 Feb, 2017

1 commit

  • When a device gets removed, block device inode unhashed so that it is not
    used anymore (bdget() will not find it anymore). Later when a new device
    gets created with the same device number, we create new block device
    inode. However there may be file system device inodes whose i_bdev still
    points to the original block device inode and thus we get two active
    block device inodes for the same device. They will share the same
    gendisk so the only visible differences will be that page caches will
    not be coherent and BDIs will be different (the old block device inode
    still points to unregistered BDI).

    Fix the problem by checking in bd_acquire() whether i_bdev still points
    to active block device inode and re-lookup the block device if not. That
    way any open of a block device happening after the old device has been
    removed will get correct block device inode.

    Tested-by: Lekshmi Pillai
    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

18 Feb, 2017

1 commit


02 Feb, 2017

2 commits

  • Currenly blk_get_backing_dev_info() is not safe to be called when the
    block device is not open as bdev->bd_disk is NULL in that case. However
    inode_to_bdi() uses this function and may be call called from flusher
    worker or other writeback related functions without bdev being open
    which leads to crashes such as:

    [113031.075540] Unable to handle kernel paging request for data at address 0x00000000
    [113031.075614] Faulting instruction address: 0xc0000000003692e0
    0:mon> t
    [c0000000fb65f900] c00000000036cb6c writeback_sb_inodes+0x30c/0x590
    [c0000000fb65fa10] c00000000036ced4 __writeback_inodes_wb+0xe4/0x150
    [c0000000fb65fa70] c00000000036d33c wb_writeback+0x30c/0x450
    [c0000000fb65fb40] c00000000036e198 wb_workfn+0x268/0x580
    [c0000000fb65fc50] c0000000000f3470 process_one_work+0x1e0/0x590
    [c0000000fb65fce0] c0000000000f38c8 worker_thread+0xa8/0x660
    [c0000000fb65fd80] c0000000000fc4b0 kthread+0x110/0x130
    [c0000000fb65fe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c

    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Currently, block device inodes stay around after corresponding gendisk
    hash died until memory reclaim finds them and frees them. Since we will
    make block device inode pin the bdi, we want to free the block device
    inode as soon as the device goes away so that bdi does not stay around
    unnecessarily. Furthermore we need to avoid issues when new device with
    the same major,minor pair gets created since reusing the bdi structure
    would be rather difficult in this case.

    Unhashing block device inode on gendisk destruction nicely deals with
    these problems. Once last block device inode reference is dropped (which
    may be directly in del_gendisk()), the inode gets evicted. Furthermore if
    the major,minor pair gets reallocated, we are guaranteed to get new
    block device inode even if old block device inode is not yet evicted and
    thus we avoid issues with possible reuse of bdi.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

24 Jan, 2017

1 commit

  • We can't dereference the dio structure after submitting the last bio for
    this request, as I/O completion might have happened before the code is
    run. Introduce a local is_sync variable instead.

    Fixes: 542ff7bf ("block: new direct I/O implementation")
    Signed-off-by: Christoph Hellwig
    Reported-by: Matias Bjørling
    Tested-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

05 Jan, 2017

1 commit

  • Pull block layer fixes from Jens Axboe:
    "A set of fixes for the current series, one fixing a regression with
    block size < page cache size in the alias series from Jan. Outside of
    that, two small cleanups for wbt from Bart, a nvme pull request from
    Christoph, and a few small fixes of documentation updates"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    block: fix up io_poll documentation
    block: Avoid that sparse complains about context imbalance in __wbt_wait()
    block: Make wbt_wait() definition consistent with declaration
    clean_bdev_aliases: Prevent cleaning blocks that are not in block range
    genhd: remove dead and duplicated scsi code
    block: add back plugging in __blkdev_direct_IO
    nvmet/fcloop: remove some logically dead code performing redundant ret checks
    nvmet: fix KATO offset in Set Features
    nvme/fc: simplify error handling of nvme_fc_create_hw_io_queues
    nvme/fc: correct some printk information
    nvme/scsi: Remove START STOP emulation
    nvme/pci: Delete misleading queue-wrap comment
    nvme/pci: Fix whitespace problem
    nvme: simplify stripe quirk
    nvme: update maintainers information

    Linus Torvalds
     

25 Dec, 2016

1 commit