30 Nov, 2017

1 commit

  • commit 67f2519fe2903c4041c0e94394d14d372fe51399 upstream.

    guard_bio_eod() needs to look at the partition capacity, not just the
    capacity of the whole device, when determining if truncation is
    necessary.

    [ 60.268688] attempt to access beyond end of device
    [ 60.268690] unknown-block(9,1): rw=0, want=67103509, limit=67103506
    [ 60.268693] buffer_io_error: 2 callbacks suppressed
    [ 60.268696] Buffer I/O error on dev md1p7, logical block 4524305, async page read

    Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Greg Edwards
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Greg Edwards
     

08 Sep, 2017

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the first pull request for 4.14, containing most of the code
    changes. It's a quiet series this round, which I think we needed after
    the churn of the last few series. This contains:

    - Fix for a registration race in loop, from Anton Volkov.

    - Overflow complaint fix from Arnd for DAC960.

    - Series of drbd changes from the usual suspects.

    - Conversion of the stec/skd driver to blk-mq. From Bart.

    - A few BFQ improvements/fixes from Paolo.

    - CFQ improvement from Ritesh, allowing idling for group idle.

    - A few fixes found by Dan's smatch, courtesy of Dan.

    - A warning fixup for a race between changing the IO scheduler and
    device remova. From David Jeffery.

    - A few nbd fixes from Josef.

    - Support for cgroup info in blktrace, from Shaohua.

    - Also from Shaohua, new features in the null_blk driver to allow it
    to actually hold data, among other things.

    - Various corner cases and error handling fixes from Weiping Zhang.

    - Improvements to the IO stats tracking for blk-mq from me. Can
    drastically improve performance for fast devices and/or big
    machines.

    - Series from Christoph removing bi_bdev as being needed for IO
    submission, in preparation for nvme multipathing code.

    - Series from Bart, including various cleanups and fixes for switch
    fall through case complaints"

    * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
    kernfs: checking for IS_ERR() instead of NULL
    drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
    drbd: Fix allyesconfig build, fix recent commit
    drbd: switch from kmalloc() to kmalloc_array()
    drbd: abort drbd_start_resync if there is no connection
    drbd: move global variables to drbd namespace and make some static
    drbd: rename "usermode_helper" to "drbd_usermode_helper"
    drbd: fix race between handshake and admin disconnect/down
    drbd: fix potential deadlock when trying to detach during handshake
    drbd: A single dot should be put into a sequence.
    drbd: fix rmmod cleanup, remove _all_ debugfs entries
    drbd: Use setup_timer() instead of init_timer() to simplify the code.
    drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
    drbd: new disk-option disable-write-same
    drbd: Fix resource role for newly created resources in events2
    drbd: mark symbols static where possible
    drbd: Send P_NEG_ACK upon write error in protocol != C
    drbd: add explicit plugging when submitting batches
    drbd: change list_for_each_safe to while(list_first_entry_or_null)
    drbd: introduce drbd_recv_header_maybe_unplug
    ...

    Linus Torvalds
     

07 Sep, 2017

4 commits

  • All users of pagevec_lookup() and pagevec_lookup_range() now pass
    PAGEVEC_SIZE as a desired number of pages.

    Just drop the argument.

    Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • We want only pages from given range in page_cache_seek_hole_data(). Use
    pagevec_lookup_range() instead of pagevec_lookup() and remove
    unnecessary code.

    Note that the check for getting less pages than desired can be removed
    because index gets updated by pagevec_lookup_range().

    Link: http://lkml.kernel.org/r/20170726114704.7626-9-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Commit e64855c6cfaa ("fs: Add helper to clean bdev aliases under a bh
    and use it") added a wrapper for clean_bdev_aliases() that invalidates
    bdev aliases underlying a single buffer head.

    However this has caused a performance regression for bonnie++ benchmark
    on ext4 filesystem when delayed allocation is turned off (ext3 mode) -
    average of 3 runs:

    Hmean SeqOut Char 164787.55 ( 0.00%) 107189.06 (-34.95%)
    Hmean SeqOut Block 219883.89 ( 0.00%) 168870.32 (-23.20%)

    The reason for this regression is that clean_bdev_aliases() is slower
    when called for a single block because pagevec_lookup() it uses will end
    up iterating through the radix tree until it finds a page (which may
    take a while) but we are only interested whether there's a page at a
    particular index.

    Fix the problem by using pagevec_lookup_range() instead which avoids the
    needless iteration.

    Fixes: e64855c6cfaa ("fs: Add helper to clean bdev aliases under a bh and use it")
    Link: http://lkml.kernel.org/r/20170726114704.7626-5-jack@suse.cz
    Signed-off-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Make pagevec_lookup() (and underlying find_get_pages()) update index to
    the next page where iteration should continue. Most callers want this
    and also pagevec_lookup_tag() already does this.

    Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

24 Aug, 2017

1 commit

  • This way we don't need a block_device structure to submit I/O. The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open. Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device. But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

11 Jul, 2017

2 commits

  • To install a buffer_head into the cpu's LRU queue, bh_lru_install()
    would construct a new copy of the queue and then memcpy it over the real
    queue. But it's easily possible to do the update in-place, which is
    faster and simpler. Some work can also be skipped if the buffer_head
    was already in the queue.

    As a microbenchmark I timed how long it takes to run sb_getblk()
    10,000,000 times alternating between BH_LRU_SIZE + 1 blocks.
    Effectively, this benchmarks looking up buffer_heads that are in the
    page cache but not in the LRU:

    Before this patch: 1.758s
    After this patch: 1.653s

    This patch also removes about 350 bytes of compiled code (on x86_64),
    partly due to removal of the memcpy() which was being inlined+unrolled.

    Link: http://lkml.kernel.org/r/20161229193445.1913-1-ebiggers3@gmail.com
    Signed-off-by: Eric Biggers
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     
  • Pull XFS updates from Darrick Wong:
    "Here are some changes for you for 4.13. For the most part it's fixes
    for bugs and deadlock problems, and preparation for online fsck in
    some future merge window.

    - Avoid quotacheck deadlocks

    - Fix transaction overflows when bunmapping fragmented files

    - Refactor directory readahead

    - Allow admin to configure if ASSERT is fatal

    - Improve transaction usage detail logging during overflows

    - Minor cleanups

    - Don't leak log items when the log shuts down

    - Remove double-underscore typedefs

    - Various preparation for online scrubbing

    - Introduce new error injection configuration sysfs knobs

    - Refactor dq_get_next to use extent map directly

    - Fix problems with iterating the page cache for unwritten data

    - Implement SEEK_{HOLE,DATA} via iomap

    - Refactor XFS to use iomap SEEK_HOLE and SEEK_DATA

    - Don't use MAXPATHLEN to check on-disk symlink target lengths"

    * tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits)
    xfs: don't crash on unexpected holes in dir/attr btrees
    xfs: rename MAXPATHLEN to XFS_SYMLINK_MAXLEN
    xfs: fix contiguous dquot chunk iteration livelock
    xfs: Switch to iomap for SEEK_HOLE / SEEK_DATA
    vfs: Add iomap_seek_hole and iomap_seek_data helpers
    vfs: Add page_cache_seek_hole_data helper
    xfs: remove a whitespace-only line from xfs_fs_get_nextdqblk
    xfs: rewrite xfs_dq_get_next_id using xfs_iext_lookup_extent
    xfs: Check for m_errortag initialization in xfs_errortag_test
    xfs: grab dquots without taking the ilock
    xfs: fix semicolon.cocci warnings
    xfs: Don't clear SGID when inheriting ACLs
    xfs: free cowblocks and retry on buffered write ENOSPC
    xfs: replace log_badcrc_factor knob with error injection tag
    xfs: convert drop_writes to use the errortag mechanism
    xfs: remove unneeded parameter from XFS_TEST_ERROR
    xfs: expose errortag knobs via sysfs
    xfs: make errortag a per-mountpoint structure
    xfs: free uncommitted transactions during log recovery
    xfs: don't allow bmap on rt files
    ...

    Linus Torvalds
     

10 Jul, 2017

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "The first major feature for ext4 this merge window is the largedir
    feature, which allows ext4 directories to support over 2 billion
    directory entries (assuming ~64 byte file names; in practice, users
    will run into practical performance limits first.) This feature was
    originally written by the Lustre team, and credit goes to Artem
    Blagodarenko from Seagate for getting this feature upstream.

    The second major major feature allows ext4 to support extended
    attribute values up to 64k. This feature was also originally from
    Lustre, and has been enhanced by Tahsin Erdogan from Google with a
    deduplication feature so that if multiple files have the same xattr
    value (for example, Windows ACL's stored by Samba), only one copy will
    be stored on disk for encoding and caching efficiency.

    We also have the usual set of bug fixes, cleanups, and optimizations"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (47 commits)
    ext4: fix spelling mistake: "prellocated" -> "preallocated"
    ext4: fix __ext4_new_inode() journal credits calculation
    ext4: skip ext4_init_security() and encryption on ea_inodes
    fs: generic_block_bmap(): initialize all of the fields in the temp bh
    ext4: change fast symlink test to not rely on i_blocks
    ext4: require key for truncate(2) of encrypted file
    ext4: don't bother checking for encryption key in ->mmap()
    ext4: check return value of kstrtoull correctly in reserved_clusters_store
    ext4: fix off-by-one fsmap error on 1k block filesystems
    ext4: return EFSBADCRC if a bad checksum error is found in ext4_find_entry()
    ext4: return EIO on read error in ext4_find_entry
    ext4: forbid encrypting root directory
    ext4: send parallel discards on commit completions
    ext4: avoid unnecessary stalls in ext4_evict_inode()
    ext4: add nombcache mount option
    ext4: strong binding of xattr inode references
    ext4: eliminate xattr entry e_hash recalculation for removes
    ext4: reserve space for xattr entries/names
    quota: add get_inode_usage callback to transfer multi-inode charges
    ext4: xattr inode deduplication
    ...

    Linus Torvalds
     

08 Jul, 2017

1 commit

  • Pull Writeback error handling updates from Jeff Layton:
    "This pile represents the bulk of the writeback error handling fixes
    that I have for this cycle. Some of the earlier patches in this pile
    may look trivial but they are prerequisites for later patches in the
    series.

    The aim of this set is to improve how we track and report writeback
    errors to userland. Most applications that care about data integrity
    will periodically call fsync/fdatasync/msync to ensure that their
    writes have made it to the backing store.

    For a very long time, we have tracked writeback errors using two flags
    in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
    writeback error occurs (via mapping_set_error) and are cleared as a
    side-effect of filemap_check_errors (as you noted yesterday). This
    model really sucks for userland.

    Only the first task to call fsync (or msync or fdatasync) will see the
    error. Any subsequent task calling fsync on a file will get back 0
    (unless another writeback error occurs in the interim). If I have
    several tasks writing to a file and calling fsync to ensure that their
    writes got stored, then I need to have them coordinate with one
    another. That's difficult enough, but in a world of containerized
    setups that coordination may even not be possible.

    But wait...it gets worse!

    The calls to filemap_check_errors can be buried pretty far down in the
    call stack, and there are internal callers of filemap_write_and_wait
    and the like that also end up clearing those errors. Many of those
    callers ignore the error return from that function or return it to
    userland at nonsensical times (e.g. truncate() or stat()). If I get
    back -EIO on a truncate, there is no reason to think that it was
    because some previous writeback failed, and a subsequent fsync() will
    (incorrectly) return 0.

    This pile aims to do three things:

    1) ensure that when a writeback error occurs that that error will be
    reported to userland on a subsequent fsync/fdatasync/msync call,
    regardless of what internal callers are doing

    2) report writeback errors on all file descriptions that were open at
    the time that the error occurred. This is a user-visible change,
    but I think most applications are written to assume this behavior
    anyway. Those that aren't are unlikely to be hurt by it.

    3) document what filesystems should do when there is a writeback
    error. Today, there is very little consistency between them, and a
    lot of cargo-cult copying. We need to make it very clear what
    filesystems should do in this situation.

    To achieve this, the set adds a new data type (errseq_t) and then
    builds new writeback error tracking infrastructure around that. Once
    all of that is in place, we change the filesystems to use the new
    infrastructure for reporting wb errors to userland.

    Note that this is just the initial foray into cleaning up this mess.
    There is a lot of work remaining here:

    1) convert the rest of the filesystems in a similar fashion. Once the
    initial set is in, then I think most other fs' will be fairly
    simple to convert. Hopefully most of those can in via individual
    filesystem trees.

    2) convert internal waiters on writeback to use errseq_t for
    detecting errors instead of relying on the AS_* flags. I have some
    draft patches for this for ext4, but they are not quite ready for
    prime time yet.

    This was a discussion topic this year at LSF/MM too. If you're
    interested in the gory details, LWN has some good articles about this:

    https://lwn.net/Articles/718734/
    https://lwn.net/Articles/724307/"

    * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    btrfs: minimal conversion to errseq_t writeback error reporting on fsync
    xfs: minimal conversion to errseq_t writeback error reporting
    ext4: use errseq_t based error handling for reporting data writeback errors
    fs: convert __generic_file_fsync to use errseq_t based reporting
    block: convert to errseq_t based writeback error tracking
    dax: set errors in mapping when writeback fails
    Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
    mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
    fs: new infrastructure for writeback error handling and reporting
    lib: add errseq_t type and infrastructure for handling it
    mm: don't TestClearPageError in __filemap_fdatawait_range
    mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
    jbd2: don't clear and reset errors after waiting on writeback
    buffer: set errors in mapping at the time that the error occurs
    fs: check for writeback errors after syncing out buffers in generic_file_fsync
    buffer: use mapping_set_error instead of setting the flag
    mm: fix mapping_set_error call in me_pagecache_dirty

    Linus Torvalds
     

06 Jul, 2017

2 commits

  • I noticed on xfs that I could still sometimes get back an error on fsync
    on a fd that was opened after the error condition had been cleared.

    The problem is that the buffer code sets the write_io_error flag and
    then later checks that flag to set the error in the mapping. That flag
    perisists for quite a while however. If the file is later opened with
    O_TRUNC, the buffers will then be invalidated and the mapping's error
    set such that a subsequent fsync will return error. I think this is
    incorrect, as there was no writeback between the open and fsync.

    Add a new mark_buffer_write_io_error operation that sets the flag and
    the error in the mapping at the same time. Replace all calls to
    set_buffer_write_io_error with mark_buffer_write_io_error, and remove
    the places that check this flag in order to set the error in the
    mapping.

    This sets the error in the mapping earlier, at the time that it's first
    detected.

    Signed-off-by: Jeff Layton
    Reviewed-by: Jan Kara
    Reviewed-by: Carlos Maiolino

    Jeff Layton
     
  • Signed-off-by: Jeff Layton
    Reviewed-by: Jan Kara
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Christoph Hellwig

    Jeff Layton
     

05 Jul, 2017

1 commit

  • KMSAN (KernelMemorySanitizer, a new error detection tool) reports the
    use of uninitialized memory in ext4_update_bh_state():

    ==================================================================
    BUG: KMSAN: use of unitialized memory
    CPU: 3 PID: 1 Comm: swapper/0 Tainted: G B 4.8.0-rc6+ #597
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
    01/01/2011
    0000000000000282 ffff88003cc96f68 ffffffff81f30856 0000003000000008
    ffff88003cc96f78 0000000000000096 ffffffff8169742a ffff88003cc96ff8
    ffffffff812fc1fc 0000000000000008 ffff88003a1980e8 0000000100000000
    Call Trace:
    [< inline >] __dump_stack lib/dump_stack.c:15
    [] dump_stack+0xa6/0xc0 lib/dump_stack.c:51
    [] kmsan_report+0x1ec/0x300 mm/kmsan/kmsan.c:?
    [] __msan_warning+0x2b/0x40 ??:?
    [< inline >] ext4_update_bh_state fs/ext4/inode.c:727
    [] _ext4_get_block+0x6ca/0x8a0 fs/ext4/inode.c:759
    [] ext4_get_block+0x8c/0xa0 fs/ext4/inode.c:769
    [] generic_block_bmap+0x246/0x2b0 fs/buffer.c:2991
    [] ext4_bmap+0x5ee/0x660 fs/ext4/inode.c:3177
    ...
    origin description: ----tmp@generic_block_bmap
    ==================================================================

    (the line numbers are relative to 4.8-rc6, but the bug persists
    upstream)

    The local |tmp| is created in generic_block_bmap() and then passed into
    ext4_bmap() => ext4_get_block() => _ext4_get_block() =>
    ext4_update_bh_state(). Along the way tmp.b_page is never initialized
    before ext4_update_bh_state() checks its value.

    [ Use the approach suggested by Kees Cook of initializing the whole bh
    structure.]

    Signed-off-by: Alexander Potapenko
    Signed-off-by: Theodore Ts'o

    Alexander Potapenko
     

03 Jul, 2017

1 commit

  • Both ext4 and xfs implement seeking for the next hole or piece of data
    in unwritten extents by scanning the page cache, and both versions share
    the same bug when iterating the buffers of a page: the start offset into
    the page isn't taken into account, so when a page fits more than two
    filesystem blocks, things will go wrong. For example, on a filesystem
    with a block size of 1k, the following command will fail:

    xfs_io -f -c "falloc 0 4k" \
    -c "pwrite 1k 1k" \
    -c "pwrite 3k 1k" \
    -c "seek -a -r 0" foo

    In this example, neither lseek(fd, 1024, SEEK_HOLE) nor lseek(fd, 2048,
    SEEK_DATA) will return the correct result.

    Introduce a generic vfs helper for seeking in the page cache that gets
    this right. The next commits will replace the filesystem specific
    implementations.

    Signed-off-by: Andreas Gruenbacher
    [hch: dropped the export]
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Andreas Gruenbacher
     

28 Jun, 2017

1 commit


09 Jun, 2017

1 commit

  • Replace bi_error with a new bi_status to allow for a clear conversion.
    Note that device mapper overloaded bi_error with a private value, which
    we'll have to keep arround at least for now and thus propagate to a
    proper blk_status_t value.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

10 May, 2017

1 commit

  • Pull misc vfs updates from Al Viro:
    "Assorted bits and pieces from various people. No common topic in this
    pile, sorry"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs/affs: add rename exchange
    fs/affs: add rename2 to prepare multiple methods
    Make stat/lstat/fstatat pass AT_NO_AUTOMOUNT to vfs_statx()
    fs: don't set *REFERENCED on single use objects
    fs: compat: Remove warning from COMPATIBLE_IOCTL
    remove pointless extern of atime_need_update_rcu()
    fs: completely ignore unknown open flags
    fs: add a VALID_OPEN_FLAGS
    fs: remove _submit_bh()
    fs: constify tree_descr arrays passed to simple_fill_super()
    fs: drop duplicate header percpu-rwsem.h
    fs/affs: bugfix: Write files greater than page size on OFS
    fs/affs: bugfix: enable writes on OFS disks
    fs/affs: remove node generation check
    fs/affs: import amigaffs.h
    fs/affs: bugfix: make symbolic links work again

    Linus Torvalds
     

09 May, 2017

1 commit

  • Commit afddba49d18f ("fs: introduce write_begin, write_end, and
    perform_write aops") introduced AOP_FLAG_UNINTERRUPTIBLE flag which was
    checked in pagecache_write_begin(), but that check was removed by
    4e02ed4b4a2f ("fs: remove prepare_write/commit_write").

    Between these two commits, commit d9414774dc0c ("cifs: Convert cifs to
    new aops.") added a check in cifs_write_begin(), but that check was soon
    removed by commit a98ee8c1c707 ("[CIFS] fix regression in
    cifs_write_begin/cifs_write_end").

    Therefore, AOP_FLAG_UNINTERRUPTIBLE flag is checked nowhere. Let's
    remove this flag. This patch has no functionality changes.

    Link: http://lkml.kernel.org/r/1489294781-53494-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Reviewed-by: Jeff Layton
    Reviewed-by: Christoph Hellwig
    Cc: Nick Piggin
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

27 Apr, 2017

1 commit

  • _submit_bh() allowed submitting a buffer_head for I/O using custom
    bio_flags. It used to be used by jbd to set BIO_SNAP_STABLE, introduced
    by commit 713685111774 ("mm: make snapshotting pages for stable writes a
    per-bio operation"). However, the code and flag has since been removed
    and no _submit_bh() users remain.

    These days, bio_flags are mostly used internally by the block layer to
    track the state of bio's. As such, it doesn't really make sense for
    filesystems to use them instead of op_flags when wanting special
    behavior for block requests.

    Therefore, remove _submit_bh() and trim the bio_flags argument from
    submit_bh_wbc().

    Cc: Darrick J. Wong
    Signed-off-by: Eric Biggers
    Signed-off-by: Al Viro

    Eric Biggers
     

02 Mar, 2017

1 commit

  • Instead of including the full , we are going to include the
    types-only header in , to further
    decouple the scheduler header from the signal headers.

    This means that various files which relied on the full need
    to be updated to gain an explicit dependency on it.

    Update the code that relies on sched.h's inclusion of the header.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

28 Feb, 2017

1 commit

  • Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs
    branch.

    This patch also fixes multiple checkpatch warnings: WARNING: Prefer
    'unsigned int' to bare use of 'unsigned'

    Thanks to Andrew Morton for suggesting more appropriate function instead
    of macro.

    [geliangtang@gmail.com: truncate: use i_blocksize()]
    Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com
    Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Signed-off-by: Geliang Tang
    Cc: Alexander Viro
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

03 Jan, 2017

1 commit


15 Dec, 2016

1 commit

  • Pull fs meta data unmap optimization from Jens Axboe:
    "A series from Jan Kara, providing a more efficient way for unmapping
    meta data from in the buffer cache than doing it block-by-block.

    Provide a general helper that existing callers can use"

    * 'for-4.10/fs-unmap' of git://git.kernel.dk/linux-block:
    fs: Remove unmap_underlying_metadata
    fs: Add helper to clean bdev aliases under a bh and use it
    ext2: Use clean_bdev_aliases() instead of iteration
    ext4: Use clean_bdev_aliases() instead of iteration
    direct-io: Use clean_bdev_aliases() instead of handmade iteration
    fs: Provide function to unmap metadata for a range of blocks

    Linus Torvalds
     

14 Dec, 2016

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the main block pull request this series. Contrary to previous
    release, I've kept the core and driver changes in the same branch. We
    always ended up having dependencies between the two for obvious
    reasons, so makes more sense to keep them together. That said, I'll
    probably try and keep more topical branches going forward, especially
    for cycles that end up being as busy as this one.

    The major parts of this pull request is:

    - Improved support for O_DIRECT on block devices, with a small
    private implementation instead of using the pig that is
    fs/direct-io.c. From Christoph.

    - Request completion tracking in a scalable fashion. This is utilized
    by two components in this pull, the new hybrid polling and the
    writeback queue throttling code.

    - Improved support for polling with O_DIRECT, adding a hybrid mode
    that combines pure polling with an initial sleep. From me.

    - Support for automatic throttling of writeback queues on the block
    side. This uses feedback from the device completion latencies to
    scale the queue on the block side up or down. From me.

    - Support from SMR drives in the block layer and for SD. From Hannes
    and Shaun.

    - Multi-connection support for nbd. From Josef.

    - Cleanup of request and bio flags, so we have a clear split between
    which are bio (or rq) private, and which ones are shared. From
    Christoph.

    - A set of patches from Bart, that improve how we handle queue
    stopping and starting in blk-mq.

    - Support for WRITE_ZEROES from Chaitanya.

    - Lightnvm updates from Javier/Matias.

    - Supoort for FC for the nvme-over-fabrics code. From James Smart.

    - A bunch of fixes from a whole slew of people, too many to name
    here"

    * 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
    blk-stat: fix a few cases of missing batch flushing
    blk-flush: run the queue when inserting blk-mq flush
    elevator: make the rqhash helpers exported
    blk-mq: abstract out blk_mq_dispatch_rq_list() helper
    blk-mq: add blk_mq_start_stopped_hw_queue()
    block: improve handling of the magic discard payload
    blk-wbt: don't throttle discard or write zeroes
    nbd: use dev_err_ratelimited in io path
    nbd: reset the setup task for NBD_CLEAR_SOCK
    nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
    nvme-fabrics: Add target support for FC transport
    nvme-fabrics: Add host support for FC transport
    nvme-fabrics: Add FC transport LLDD api definitions
    nvme-fabrics: Add FC transport FC-NVME definitions
    nvme-fabrics: Add FC transport error codes to nvme.h
    Add type 0x28 NVME type code to scsi fc headers
    nvme-fabrics: patch target code in prep for FC transport support
    nvme-fabrics: set sqe.command_id in core not transports
    parser: add u64 number parser
    nvme-rdma: align to generic ib_event logging helper
    ...

    Linus Torvalds
     

10 Nov, 2016

1 commit


05 Nov, 2016

3 commits


03 Nov, 2016

1 commit

  • Add wbc_to_write_flags(), which returns the write modifier flags to use,
    based on a struct writeback_control. No functional changes in this
    patch, but it prepares us for factoring other wbc fields for write type.

    Signed-off-by: Jens Axboe
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig

    Jens Axboe
     

01 Nov, 2016

1 commit


28 Oct, 2016

1 commit

  • Now that we don't need the common flags to overflow outside the range
    of a 32-bit type we can encode them the same way for both the bio and
    request fields. This in addition allows us to place the operation
    first (and make some room for more ops while we're at it) and to
    stop having to shift around the operation values.

    In addition this allows passing around only one value in the block layer
    instead of two (and eventuall also in the file systems, but we can do
    that later) and thus clean up a lot of code.

    Last but not least this allows decreasing the size of the cmd_flags
    field in struct request to 32-bits. Various functions passing this
    value could also be updated, but I'd like to avoid the churn for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

12 Oct, 2016

1 commit

  • The mapping_set_error() helper sets the correct AS_ flag for the mapping
    so there is no reason to open code it. Use the helper directly.

    [akpm@linux-foundation.org: be honest about conversion from -ENXIO to -EIO]
    Link: http://lkml.kernel.org/r/20160912111608.2588-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

28 Sep, 2016

1 commit

  • __getblk_slow() was exported to modules in commit 3b5e6454aaf6
    ("fs/buffer.c: support buffer cache allocations with gfp modifiers").
    This seems to have been a mistake, as no users were introduced nor was
    the function declared in a header. Change it back to 'static'.

    Signed-off-by: Eric Biggers
    Signed-off-by: Al Viro

    Eric Biggers
     

28 Jul, 2016

1 commit

  • Pull xfs updates from Dave Chinner:
    "The major addition is the new iomap based block mapping
    infrastructure. We've been kicking this about locally for years, but
    there are other filesystems want to use it too (e.g. gfs2). Now it
    is fully working, reviewed and ready for merge and be used by other
    filesystems.

    There are a lot of other fixes and cleanups in the tree, but those are
    XFS internal things and none are of the scale or visibility of the
    iomap changes. See below for details.

    I am likely to send another pull request next week - we're just about
    ready to merge some new functionality (on disk block->owner reverse
    mapping infrastructure), but that's a huge chunk of code (74 files
    changed, 7283 insertions(+), 1114 deletions(-)) so I'm keeping that
    separate to all the "normal" pull request changes so they don't get
    lost in the noise.

    Summary of changes in this update:
    - generic iomap based IO path infrastructure
    - generic iomap based fiemap implementation
    - xfs iomap based Io path implementation
    - buffer error handling fixes
    - tracking of in flight buffer IO for unmount serialisation
    - direct IO and DAX io path separation and simplification
    - shortform directory format definition changes for wider platform
    compatibility
    - various buffer cache fixes
    - cleanups in preparation for rmap merge
    - error injection cleanups and fixes
    - log item format buffer memory allocation restructuring to prevent
    rare OOM reclaim deadlocks
    - sparse inode chunks are now fully supported"

    * tag 'xfs-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (53 commits)
    xfs: remove EXPERIMENTAL tag from sparse inode feature
    xfs: bufferhead chains are invalid after end_page_writeback
    xfs: allocate log vector buffers outside CIL context lock
    libxfs: directory node splitting does not have an extra block
    xfs: remove dax code from object file when disabled
    xfs: skip dirty pages in ->releasepage()
    xfs: remove __arch_pack
    xfs: kill xfs_dir2_inou_t
    xfs: kill xfs_dir2_sf_off_t
    xfs: split direct I/O and DAX path
    xfs: direct calls in the direct I/O path
    xfs: stop using generic_file_read_iter for direct I/O
    xfs: split xfs_file_read_iter into buffered and direct I/O helpers
    xfs: remove s_maxbytes enforcement in xfs_file_read_iter
    xfs: kill ioflags
    xfs: don't pass ioflags around in the ioctl path
    xfs: track and serialize in-flight async buffers against unmount
    xfs: exclude never-released buffers from buftarg I/O accounting
    xfs: don't reset b_retries to 0 on every failure
    xfs: remove extraneous buffer flag changes
    ...

    Linus Torvalds
     

27 Jul, 2016

2 commits

  • Pull block driver updates from Jens Axboe:
    "This branch also contains core changes. I've come to the conclusion
    that from 4.9 and forward, I'll be doing just a single branch. We
    often have dependencies between core and drivers, and it's hard to
    always split them up appropriately without pulling core into drivers
    when that happens.

    That said, this contains:

    - separate secure erase type for the core block layer, from
    Christoph.

    - set of discard fixes, from Christoph.

    - bio shrinking fixes from Christoph, as a followup up to the
    op/flags change in the core branch.

    - map and append request fixes from Christoph.

    - NVMeF (NVMe over Fabrics) code from Christoph. This is pretty
    exciting!

    - nvme-loop fixes from Arnd.

    - removal of ->driverfs_dev from Dan, after providing a
    device_add_disk() helper.

    - bcache fixes from Bhaktipriya and Yijing.

    - cdrom subchannel read fix from Vchannaiah.

    - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

    - set of drbd updates and fixes from Fabian, Lars, and Philipp.

    - mg_disk error path fix from Bart.

    - user notification for failed device add for loop, from Minfei.

    - NVMe in general:
    + NVMe delay quirk from Guilherme.
    + SR-IOV support and command retry limits from Keith.
    + fix for memory-less NUMA node from Masayoshi.
    + use UINT_MAX for discard sectors, from Minfei.
    + cancel IO fixes from Ming.
    + don't allocate unused major, from Neil.
    + error code fixup from Dan.
    + use constants for PSDT/FUSE from James.
    + variable init fix from Jay.
    + fabrics fixes from Ming, Sagi, and Wei.
    + various fixes"

    * 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
    nvme/pci: Provide SR-IOV support
    nvme: initialize variable before logical OR'ing it
    block: unexport various bio mapping helpers
    scsi/osd: open code blk_make_request
    target: stop using blk_make_request
    block: simplify and export blk_rq_append_bio
    block: ensure bios return from blk_get_request are properly initialized
    virtio_blk: use blk_rq_map_kern
    memstick: don't allow REQ_TYPE_BLOCK_PC requests
    block: shrink bio size again
    block: simplify and cleanup bvec pool handling
    block: get rid of bio_rw and READA
    block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
    block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
    NVMe: don't allocate unused nvme_major
    nvme: avoid crashes when node 0 is memoryless node.
    nvme: Limit command retries
    loop: Make user notify for adding loop device failed
    nvme-loop: fix nvme-loop Kconfig dependencies
    nvmet: fix return value check in nvmet_subsys_alloc()
    ...

    Linus Torvalds
     
  • Pull core block updates from Jens Axboe:

    - the big change is the cleanup from Mike Christie, cleaning up our
    uses of command types and modified flags. This is what will throw
    some merge conflicts

    - regression fix for the above for btrfs, from Vincent

    - following up to the above, better packing of struct request from
    Christoph

    - a 2038 fix for blktrace from Arnd

    - a few trivial/spelling fixes from Bart Van Assche

    - a front merge check fix from Damien, which could cause issues on
    SMR drives

    - Atari partition fix from Gabriel

    - convert cfq to highres timers, since jiffies isn't granular enough
    for some devices these days. From Jan and Jeff

    - CFQ priority boost fix idle classes, from me

    - cleanup series from Ming, improving our bio/bvec iteration

    - a direct issue fix for blk-mq from Omar

    - fix for plug merging not involving the IO scheduler, like we do for
    other types of merges. From Tahsin

    - expose DAX type internally and through sysfs. From Toshi and Yigal

    * 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
    block: Fix front merge check
    block: do not merge requests without consulting with io scheduler
    block: Fix spelling in a source code comment
    block: expose QUEUE_FLAG_DAX in sysfs
    block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
    Btrfs: fix comparison in __btrfs_map_block()
    block: atari: Return early for unsupported sector size
    Doc: block: Fix a typo in queue-sysfs.txt
    cfq-iosched: Charge at least 1 jiffie instead of 1 ns
    cfq-iosched: Fix regression in bonnie++ rewrite performance
    cfq-iosched: Convert slice_resid from u64 to s64
    block: Convert fifo_time from ulong to u64
    blktrace: avoid using timespec
    block/blk-cgroup.c: Declare local symbols static
    block/bio-integrity.c: Add #include "blk.h"
    block/partition-generic.c: Remove a set-but-not-used variable
    block: bio: kill BIO_MAX_SIZE
    cfq-iosched: temporarily boost queue priority for idle classes
    block: drbd: avoid to use BIO_MAX_SIZE
    block: bio: remove BIO_MAX_SECTORS
    ...

    Linus Torvalds
     

21 Jul, 2016

1 commit

  • These two are confusing leftover of the old world order, combining
    values of the REQ_OP_ and REQ_ namespaces. For callers that don't
    special case we mostly just replace bi_rw with bio_data_dir or
    op_is_write, except for the few cases where a switch over the REQ_OP_
    values makes more sense. Any check for READA is replaced with an
    explicit check for REQ_RAHEAD. Also remove the READA alias for
    REQ_RAHEAD.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Mike Christie
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

27 Jun, 2016

1 commit

  • gfs2 needs to be able to skip the check to see if a page is outside of
    the file size when writing it out. gfs2 can get into a situation where
    it needs to flush its in-memory log to disk while a truncate is in
    progress. If the file being trucated has data journaling enabled, it is
    possible that there are data blocks in the log that are past the end of
    the file. gfs can't finish the log flush without either writing these
    blocks out or revoking them. Otherwise, if the node crashed, it could
    overwrite subsequent changes made by other nodes in the cluster when
    it's journal was replayed.

    Unfortunately, there is no way to add log entries to the log during a
    flush. So gfs2 simply writes out the page instead. This situation can
    only occur when the truncate code still has the file locked exclusively,
    and hasn't marked this block as free in the metadata (which happens
    later in truc_dealloc). After gfs2 writes this page out, the truncation
    code will shortly invalidate it and write out any revokes if necessary.

    In order to make this work, gfs2 needs to be able to skip the check for
    writes outside the file size. Since the check exists in
    block_write_full_page, this patch exports __block_write_full_page, which
    doesn't have the check.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Bob Peterson

    Benjamin Marzinski
     

21 Jun, 2016

1 commit

  • Add infrastructure for multipage buffered writes. This is implemented
    using an main iterator that applies an actor function to a range that
    can be written.

    This infrastucture is used to implement a buffered write helper, one
    to zero file ranges and one to implement the ->page_mkwrite VM
    operations. All of them borrow a fair amount of code from fs/buffers.
    for now by using an internal version of __block_write_begin that
    gets passed an iomap and builds the corresponding buffer head.

    The file system is gets a set of paired ->iomap_begin and ->iomap_end
    calls which allow it to map/reserve a range and get a notification
    once the write code is finished with it.

    Based on earlier code from Dave Chinner.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Bob Peterson
    Signed-off-by: Dave Chinner

    Christoph Hellwig