14 Jan, 2021

1 commit

  • When a buffer is added to the LRU list, a reference is taken which is
    not dropped until the buffer is evicted from the LRU list. This is the
    correct behavior, however this LRU reference will prevent the buffer
    from being dropped. This means that the buffer can't actually be dropped
    until it is selected for eviction. There's no bound on the time spent
    on the LRU list, which means that the buffer may be undroppable for
    very long periods of time. Given that migration involves dropping
    buffers, the associated page is now unmigratible for long periods of
    time as well. CMA relies on being able to migrate a specific range
    of pages, so these types of failures make CMA significantly
    less reliable, especially under high filesystem usage.

    Rather than waiting for the LRU algorithm to eventually kick out
    the buffer, explicitly remove the buffer from the LRU list when trying
    to drop it. There is still the possibility that the buffer
    could be added back on the list, but that indicates the buffer is
    still in use and would probably have other 'in use' indicates to
    prevent dropping.

    Note: a bug reported by "kernel test robot" lead to a switch from
    using xas_for_each() to xa_for_each().

    Bug: 174118021
    Link: https://lore.kernel.org/linux-mm/cover.1610572007.git.cgoldswo@codeaurora.org/
    Signed-off-by: Laura Abbott
    Signed-off-by: Chris Goldsworthy
    Cc: Matthew Wilcox
    Reported-by: kernel test robot
    Change-Id: I4a93c4ed81c57874764d12f3beea1194a30c13b2

    Laura Abbott
     

19 Oct, 2020

1 commit

  • Currently the remote memcg charging API consists of two functions:
    memalloc_use_memcg() and memalloc_unuse_memcg(), which set and clear the
    memcg value, which overwrites the memcg of the current task.

    memalloc_use_memcg(target_memcg);

    memalloc_unuse_memcg();

    It works perfectly for allocations performed from a normal context,
    however an attempt to call it from an interrupt context or just nest two
    remote charging blocks will lead to an incorrect accounting. On exit from
    the inner block the active memcg will be cleared instead of being
    restored.

    memalloc_use_memcg(target_memcg);

    memalloc_use_memcg(target_memcg_2);

    memalloc_unuse_memcg();

    Error: allocation here are charged to the memcg of the current
    process instead of target_memcg.

    memalloc_unuse_memcg();

    This patch extends the remote charging API by switching to a single
    function: struct mem_cgroup *set_active_memcg(struct mem_cgroup *memcg),
    which sets the new value and returns the old one. So a remote charging
    block will look like:

    old_memcg = set_active_memcg(target_memcg);

    set_active_memcg(old_memcg);

    This patch is heavily based on the patch by Johannes Weiner, which can be
    found here: https://lkml.org/lkml/2020/5/28/806 .

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Dan Schatzberg
    Link: https://lkml.kernel.org/r/20200821212056.3769116-1-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

08 Sep, 2020

1 commit

  • If block_write_full_page() is called for a page that is beyond current
    inode size, it will truncate page buffers for the page and return 0.
    This logic has been added in 2.5.62 in commit 81eb69062588 ("fix ext3
    BUG due to race with truncate") in history.git tree to fix a problem
    with ext3 in data=ordered mode. This particular problem doesn't exist
    anymore because ext3 is long gone and ext4 handles ordered data
    differently. Also normally buffers are invalidated by truncate code and
    there's no need to specially handle this in ->writepage() code.

    This invalidation of page buffers in block_write_full_page() is causing
    issues to filesystems (e.g. ext4 or ocfs2) when block device is shrunk
    under filesystem's hands and metadata buffers get discarded while being
    tracked by the journalling layer. Although it is obviously "not
    supported" it can cause kernel crashes like:

    [ 7986.689400] BUG: unable to handle kernel NULL pointer dereference at
    +0000000000000008
    [ 7986.697197] PGD 0 P4D 0
    [ 7986.699724] Oops: 0002 [#1] SMP PTI
    [ 7986.703200] CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G
    +O --------- - - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1
    [ 7986.716438] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
    [ 7986.723462] RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2]
    ...
    [ 7986.810150] Call Trace:
    [ 7986.812595] __jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2]
    [ 7986.818408] jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2]
    [ 7986.836467] kjournald2+0xbd/0x270 [jbd2]

    which is not great. The crash happens because bh->b_private is suddently
    NULL although BH_JBD flag is still set (this is because
    block_invalidatepage() cleared BH_Mapped flag and subsequent bh lookup
    found buffer without BH_Mapped set, called init_page_buffers() which has
    rewritten bh->b_private). So just remove the invalidation in
    block_write_full_page().

    Note that the buffer cache invalidation when block device changes size
    is already careful to avoid similar problems by using
    invalidate_mapping_pages() which skips busy buffers so it was only this
    odd block_write_full_page() behavior that could tear down bdev buffers
    under filesystem's hands.

    Reported-by: Ye Bin
    Signed-off-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    CC: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Jan Kara
     

24 Aug, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

22 Aug, 2020

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "Improvements to ext4's block allocator performance for very large file
    systems, especially when the file system or files which are highly
    fragmented. There is a new mount option, prefetch_block_bitmaps which
    will pull in the block bitmaps and set up the in-memory buddy bitmaps
    when the file system is initially mounted.

    Beyond that, a lot of bug fixes and cleanups. In particular, a number
    of changes to make ext4 more robust in the face of write errors or
    file system corruptions"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (46 commits)
    ext4: limit the length of per-inode prealloc list
    ext4: reorganize if statement of ext4_mb_release_context()
    ext4: add mb_debug logging when there are lost chunks
    ext4: Fix comment typo "the the".
    jbd2: clean up checksum verification in do_one_pass()
    ext4: change to use fallthrough macro
    ext4: remove unused parameter of ext4_generic_delete_entry function
    mballoc: replace seq_printf with seq_puts
    ext4: optimize the implementation of ext4_mb_good_group()
    ext4: delete invalid comments near ext4_mb_check_limits()
    ext4: fix typos in ext4_mb_regular_allocator() comment
    ext4: fix checking of directory entry validity for inline directories
    fs: prevent BUG_ON in submit_bh_wbc()
    ext4: correctly restore system zone info when remount fails
    ext4: handle add_system_zone() failure in ext4_setup_system_zone()
    ext4: fold ext4_data_block_valid_rcu() into the caller
    ext4: check journal inode extents more carefully
    ext4: don't allow overlapping system zones
    ext4: handle error of ext4_setup_system_zone() on remount
    ext4: delete the invalid BUGON in ext4_mb_load_buddy_gfp()
    ...

    Linus Torvalds
     

08 Aug, 2020

1 commit

  • If a device is hot-removed --- for example, when a physical device is
    unplugged from pcie slot or a nbd device's network is shutdown ---
    this can result in a BUG_ON() crash in submit_bh_wbc(). This is
    because the when the block device dies, the buffer heads will have
    their Buffer_Mapped flag get cleared, leading to the crash in
    submit_bh_wbc.

    We had attempted to work around this problem in commit a17712c8
    ("ext4: check superblock mapped prior to committing"). Unfortunately,
    it's still possible to hit the BUG_ON(!buffer_mapped(bh)) if the
    device dies between when the work-around check in ext4_commit_super()
    and when submit_bh_wbh() is finally called:

    Code path:
    ext4_commit_super
    judge if 'buffer_mapped(sbh)' is false, return

    Xianting Tian
     

04 Aug, 2020

1 commit

  • Pull core block updates from Jens Axboe:
    "Good amount of cleanups and tech debt removals in here, and as a
    result, the diffstat shows a nice net reduction in code.

    - Softirq completion cleanups (Christoph)

    - Stop using ->queuedata (Christoph)

    - Cleanup bd claiming (Christoph)

    - Use check_events, moving away from the legacy media change
    (Christoph)

    - Use inode i_blkbits consistently (Christoph)

    - Remove old unused writeback congestion bits (Christoph)

    - Cleanup/unify submission path (Christoph)

    - Use bio_uninit consistently, instead of bio_disassociate_blkg
    (Christoph)

    - sbitmap cleared bits handling (John)

    - Request merging blktrace event addition (Jan)

    - sysfs add/remove race fixes (Luis)

    - blk-mq tag fixes/optimizations (Ming)

    - Duplicate words in comments (Randy)

    - Flush deferral cleanup (Yufen)

    - IO context locking/retry fixes (John)

    - struct_size() usage (Gustavo)

    - blk-iocost fixes (Chengming)

    - blk-cgroup IO stats fixes (Boris)

    - Various little fixes"

    * tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits)
    block: blk-timeout: delete duplicated word
    block: blk-mq-sched: delete duplicated word
    block: blk-mq: delete duplicated word
    block: genhd: delete duplicated words
    block: elevator: delete duplicated word and fix typos
    block: bio: delete duplicated words
    block: bfq-iosched: fix duplicated word
    iocost_monitor: start from the oldest usage index
    iocost: Fix check condition of iocg abs_vdebt
    block: Remove callback typedefs for blk_mq_ops
    block: Use non _rcu version of list functions for tag_set_list
    blk-cgroup: show global disk stats in root cgroup io.stat
    blk-cgroup: make iostat functions visible to stat printing
    block: improve discard bio alignment in __blkdev_issue_discard()
    block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
    block: defer flush request no matter whether we have elevator
    block: make blk_timeout_init() static
    block: remove retry loop in ioc_release_fn()
    block: remove unnecessary ioc nested locking
    block: integrate bd_start_claiming into __blkdev_get
    ...

    Linus Torvalds
     

09 Jul, 2020

1 commit

  • Wire up ext4 to support inline encryption via the helper functions which
    fs/crypto/ now provides. This includes:

    - Adding a mount option 'inlinecrypt' which enables inline encryption
    on encrypted files where it can be used.

    - Setting the bio_crypt_ctx on bios that will be submitted to an
    inline-encrypted file.

    Note: submit_bh_wbc() in fs/buffer.c also needed to be patched for
    this part, since ext4 sometimes uses ll_rw_block() on file data.

    - Not adding logically discontiguous data to bios that will be submitted
    to an inline-encrypted file.

    - Not doing filesystem-layer crypto on inline-encrypted files.

    Co-developed-by: Satya Tangirala
    Signed-off-by: Satya Tangirala
    Reviewed-by: Theodore Ts'o
    Link: https://lore.kernel.org/r/20200702015607.1215430-5-satyat@google.com
    Signed-off-by: Eric Biggers

    Eric Biggers
     

01 Jul, 2020

1 commit


03 Jun, 2020

2 commits

  • Since the new pair function is introduced, we can call them to clean the
    code in buffer.c.

    Signed-off-by: Guoqing Jiang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Alexander Viro
    Link: http://lkml.kernel.org/r/20200517214718.468-5-guoqing.jiang@cloud.ionos.com
    Signed-off-by: Linus Torvalds

    Guoqing Jiang
     
  • When syncing out a block device (a'la __sync_blockdev), any error
    encountered will only be recorded in the bd_inode's mapping. When the
    blockdev contains a filesystem however, we'd like to also record the
    error in the super_block that's stored there.

    Make mark_buffer_write_io_error also record the error in the
    corresponding super_block when a writeback error occurs and the block
    device contains a mounted superblock.

    Since superblocks are RCU freed, hold the rcu_read_lock to ensure that
    the superblock doesn't go away while we're marking it.

    Signed-off-by: Jeff Layton
    Signed-off-by: Andrew Morton
    Reviewed-by: Jan Kara
    Cc: Al Viro
    Cc: Andres Freund
    Cc: Matthew Wilcox
    Cc: David Howells
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Link: http://lkml.kernel.org/r/20200428135155.19223-3-jlayton@kernel.org
    Signed-off-by: Linus Torvalds

    Jeff Layton
     

25 Apr, 2020

1 commit

  • Pull block fixes from Jens Axboe:
    "A few fixes/changes that should go into this release:

    - null_blk zoned fixes (Damien)

    - blkdev_close() sync improvement (Douglas)

    - Fix regression in blk-iocost that impacted (at least) systemtap
    (Waiman)

    - Comment fix, header removal (Zhiqiang, Jianpeng)"

    * tag 'block-5.7-2020-04-24' of git://git.kernel.dk/linux-block:
    null_blk: Cleanup zoned device initialization
    null_blk: Fix zoned command handling
    block: remove unused header
    blk-iocost: Fix error on iocost_ioc_vrate_adj
    bdev: Reduce time holding bd_mutex in sync in blkdev_close()
    buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason.

    Linus Torvalds
     

18 Apr, 2020

1 commit

  • free_more_memory func has been completely removed in commit bc48f001de12
    ("buffer: eliminate the need to call free_more_memory() in __getblk_slow()")

    So comment and `WB_REASON_FREE_MORE_MEM` reason about free_more_memory
    are no longer needed.

    Fixes: bc48f001de12 ("buffer: eliminate the need to call free_more_memory() in __getblk_slow()")
    Reviewed-by: Jan Kara
    Signed-off-by: Zhiqiang Liu
    Signed-off-by: Jens Axboe

    Zhiqiang Liu
     

16 Apr, 2020

1 commit

  • Since commit a8ac900b8163 ("ext4: use non-movable memory for the
    superblock") buffers for ext4 superblock were allocated using
    the sb_bread_unmovable() helper which allocated buffer heads
    out of non-movable memory blocks. It was necessarily to not block
    page migrations and do not cause cma allocation failures.

    However commit 85c8f176a611 ("ext4: preload block group descriptors")
    broke this by introducing pre-reading of the ext4 superblock.
    The problem is that __breadahead() is using __getblk() underneath,
    which allocates buffer heads out of movable memory.

    It resulted in page migration failures I've seen on a machine
    with an ext4 partition and a preallocated cma area.

    Fix this by introducing sb_breadahead_unmovable() and
    __breadahead_gfp() helpers which use non-movable memory for buffer
    head allocations and use them for the ext4 superblock readahead.

    Reviewed-by: Andreas Dilger
    Fixes: 85c8f176a611 ("ext4: preload block group descriptors")
    Signed-off-by: Roman Gushchin
    Link: https://lore.kernel.org/r/20200229001411.128010-1-guro@fb.com
    Signed-off-by: Theodore Ts'o

    Roman Gushchin
     

31 Mar, 2020

1 commit

  • Pull locking updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Continued user-access cleanups in the futex code.

    - percpu-rwsem rewrite that uses its own waitqueue and atomic_t
    instead of an embedded rwsem. This addresses a couple of
    weaknesses, but the primary motivation was complications on the -rt
    kernel.

    - Introduce raw lock nesting detection on lockdep
    (CONFIG_PROVE_RAW_LOCK_NESTING=y), document the raw_lock vs. normal
    lock differences. This too originates from -rt.

    - Reuse lockdep zapped chain_hlocks entries, to conserve RAM
    footprint on distro-ish kernels running into the "BUG:
    MAX_LOCKDEP_CHAIN_HLOCKS too low!" depletion of the lockdep
    chain-entries pool.

    - Misc cleanups, smaller fixes and enhancements - see the changelog
    for details"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits)
    fs/buffer: Make BH_Uptodate_Lock bit_spin_lock a regular spinlock_t
    thermal/x86_pkg_temp: Make pkg_temp_lock a raw_spinlock_t
    Documentation/locking/locktypes: Minor copy editor fixes
    Documentation/locking/locktypes: Further clarifications and wordsmithing
    m68knommu: Remove mm.h include from uaccess_no.h
    x86: get rid of user_atomic_cmpxchg_inatomic()
    generic arch_futex_atomic_op_inuser() doesn't need access_ok()
    x86: don't reload after cmpxchg in unsafe_atomic_op2() loop
    x86: convert arch_futex_atomic_op_inuser() to user_access_begin/user_access_end()
    objtool: whitelist __sanitizer_cov_trace_switch()
    [parisc, s390, sparc64] no need for access_ok() in futex handling
    sh: no need of access_ok() in arch_futex_atomic_op_inuser()
    futex: arch_futex_atomic_op_inuser() calling conventions change
    completion: Use lockdep_assert_RT_in_threaded_ctx() in complete_all()
    lockdep: Add posixtimer context tracing bits
    lockdep: Annotate irq_work
    lockdep: Add hrtimer context tracing bits
    lockdep: Introduce wait-type checks
    completion: Use simple wait queues
    sched/swait: Prepare usage in completions
    ...

    Linus Torvalds
     

28 Mar, 2020

1 commit

  • Bit spinlocks are problematic if PREEMPT_RT is enabled, because they
    disable preemption, which is undesired for latency reasons and breaks when
    regular spinlocks are taken within the bit_spinlock locked region because
    regular spinlocks are converted to 'sleeping spinlocks' on RT.

    PREEMPT_RT replaced the bit spinlocks with regular spinlocks to avoid this
    problem. The replacement was done conditionaly at compile time, but
    Christoph requested to do an unconditional conversion.

    Jan suggested to move the spinlock into a existing padding hole which
    avoids a size increase of struct buffer_head on production kernels.

    As a benefit the lock gains lockdep coverage.

    [ bigeasy: Remove the wrapper and use always spinlock_t and move it into
    the padding hole ]

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Jan Kara
    Cc: Christoph Hellwig
    Link: https://lkml.kernel.org/r/20191118132824.rclhrbujqh4b4g4d@linutronix.de

    Thomas Gleixner
     

25 Mar, 2020

1 commit


25 Jan, 2020

1 commit


09 Jan, 2020

1 commit

  • Commit 85a8ce62c2ea ("block: add bio_truncate to fix guard_bio_eod")
    adds bio_truncate() for handling bio EOD. However, bio_truncate()
    doesn't use the passed 'op' parameter from guard_bio_eod's callers.

    So bio_trunacate() may retrieve wrong 'op', and zering pages may
    not be done for READ bio.

    Fixes this issue by moving guard_bio_eod() after bio_set_op_attrs()
    in submit_bh_wbc() so that bio_truncate() can always retrieve correct
    op info.

    Meantime remove the 'op' parameter from guard_bio_eod() because it isn't
    used any more.

    Cc: Carlos Maiolino
    Cc: linux-fsdevel@vger.kernel.org
    Fixes: 85a8ce62c2ea ("block: add bio_truncate to fix guard_bio_eod")
    Signed-off-by: Ming Lei

    Fold in kerneldoc and bio_op() change.

    Signed-off-by: Jens Axboe

    Ming Lei
     

29 Dec, 2019

1 commit

  • Some filesystem, such as vfat, may send bio which crosses device boundary,
    and the worse thing is that the IO request starting within device boundaries
    can contain more than one segment past EOD.

    Commit dce30ca9e3b6 ("fs: fix guard_bio_eod to check for real EOD errors")
    tries to fix this issue by returning -EIO for this situation. However,
    this way lets fs user code lose chance to handle -EIO, then sync_inodes_sb()
    may hang for ever.

    Also the current truncating on last segment is dangerous by updating the
    last bvec, given bvec table becomes not immutable any more, and fs bio
    users may not retrieve the truncated pages via bio_for_each_segment_all() in
    its .end_io callback.

    Fixes this issue by supporting multi-segment truncating. And the
    approach is simpler:

    - just update bio size since block layer can make correct bvec with
    the updated bio size. Then bvec table becomes really immutable.

    - zero all truncated segments for read bio

    Cc: Carlos Maiolino
    Cc: linux-fsdevel@vger.kernel.org
    Fixed-by: dce30ca9e3b6 ("fs: fix guard_bio_eod to check for real EOD errors")
    Reported-by: syzbot+2b9e54155c8c25d8d165@syzkaller.appspotmail.com
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

01 Dec, 2019

2 commits

  • The declarations of __block_write_begin_int and guard_bio_eod are needed
    from internal.h so include it to fix the following sparse warnings:

    fs/buffer.c:1930:5: warning: symbol '__block_write_begin_int' was not declared. Should it be static?
    fs/buffer.c:2994:6: warning: symbol 'guard_bio_eod' was not declared. Should it be static?

    Link: http://lkml.kernel.org/r/20191011170039.16100-1-ben.dooks@codethink.co.uk
    Signed-off-by: Ben Dooks
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Dooks
     
  • Use true/false for bool return type of has_bh_in_lru().

    Link: http://lkml.kernel.org/r/20191029040529.GA7625@saurav
    Signed-off-by: Saurav Girepunje
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Saurav Girepunje
     

15 Nov, 2019

1 commit

  • After each filesystem block (as represented by a buffer_head) has been
    read from disk by block_read_full_page(), decrypt it if needed. The
    decryption is done on the fscrypt_read_workqueue.

    This is the final change needed to support ext4 encryption with
    blocksize != PAGE_SIZE, and it's a fairly small change now that
    CONFIG_FS_ENCRYPTION is a bool and fs/crypto/ exposes functions to
    decrypt individual blocks and to enqueue work on the fscrypt workqueue.

    Don't try to add fs-verity support yet, as the fs/verity/ support layer
    isn't ready for sub-page blocks yet. Just add fscrypt support for now.

    Almost all the new code is compiled away when CONFIG_FS_ENCRYPTION=n.

    Cc: Chandan Rajendra
    Signed-off-by: Eric Biggers
    Link: https://lore.kernel.org/r/20191023033312.361355-2-ebiggers@kernel.org
    Signed-off-by: Theodore Ts'o

    Eric Biggers
     

16 Jul, 2019

1 commit

  • Pull more block updates from Jens Axboe:
    "A later pull request with some followup items. I had some vacation
    coming up to the merge window, so certain things items were delayed a
    bit. This pull request also contains fixes that came in within the
    last few days of the merge window, which I didn't want to push right
    before sending you a pull request.

    This contains:

    - NVMe pull request, mostly fixes, but also a few minor items on the
    feature side that were timing constrained (Christoph et al)

    - Report zones fixes (Damien)

    - Removal of dead code (Damien)

    - Turn on cgroup psi memstall (Josef)

    - block cgroup MAINTAINERS entry (Konstantin)

    - Flush init fix (Josef)

    - blk-throttle low iops timing fix (Konstantin)

    - nbd resize fixes (Mike)

    - nbd 0 blocksize crash fix (Xiubo)

    - block integrity error leak fix (Wenwen)

    - blk-cgroup writeback and priority inheritance fixes (Tejun)"

    * tag 'for-linus-20190715' of git://git.kernel.dk/linux-block: (42 commits)
    MAINTAINERS: add entry for block io cgroup
    null_blk: fixup ->report_zones() for !CONFIG_BLK_DEV_ZONED
    block: Limit zone array allocation size
    sd_zbc: Fix report zones buffer allocation
    block: Kill gfp_t argument of blkdev_report_zones()
    block: Allow mapping of vmalloc-ed buffers
    block/bio-integrity: fix a memory leak bug
    nvme: fix NULL deref for fabrics options
    nbd: add netlink reconfigure resize support
    nbd: fix crash when the blksize is zero
    block: Disable write plugging for zoned block devices
    block: Fix elevator name declaration
    block: Remove unused definitions
    nvme: fix regression upon hot device removal and insertion
    blk-throttle: fix zero wait time for iops throttled group
    block: Fix potential overflow in blk_report_zones()
    blkcg: implement REQ_CGROUP_PUNT
    blkcg, writeback: Implement wbc_blkcg_css()
    blkcg, writeback: Add wbc->no_cgroup_owner
    blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner()
    ...

    Linus Torvalds
     

10 Jul, 2019

1 commit


28 Jun, 2019

1 commit


21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

01 May, 2019

2 commits

  • In iomap_write_end, we're not holding a page reference anymore when
    calling the page_done callback, but the callback needs that reference to
    access the page. To fix that, move the put_page call in
    __generic_write_end into the callers of __generic_write_end. Then, in
    iomap_write_end, put the page after calling the page_done callback.

    Reported-by: Jan Kara
    Fixes: 63899c6f8851 ("iomap: add a page_done callback")
    Signed-off-by: Andreas Gruenbacher
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Andreas Gruenbacher
     
  • The VFS-internal __generic_write_end helper always returns the value of
    its @copied argument. This can be confusing, and it isn't very useful
    anyway, so turn __generic_write_end into a function returning void
    instead.

    Signed-off-by: Andreas Gruenbacher
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Andreas Gruenbacher
     

01 Mar, 2019

1 commit

  • guard_bio_eod() can truncate a segment in bio to allow it to do IO on
    odd last sectors of a device.

    It already checks if the IO starts past EOD, but it does not consider
    the possibility of an IO request starting within device boundaries can
    contain more than one segment past EOD.

    In such cases, truncated_bytes can be bigger than PAGE_SIZE, and will
    underflow bvec->bv_len.

    Fix this by checking if truncated_bytes is lower than PAGE_SIZE.

    This situation has been found on filesystems such as isofs and vfat,
    which doesn't check the device size before mount, if the device is
    smaller than the filesystem itself, a readahead on such filesystem,
    which spans EOD, can trigger this situation, leading a call to
    zero_user() with a wrong size possibly corrupting memory.

    I didn't see any crash, or didn't let the system run long enough to
    check if memory corruption will be hit somewhere, but adding
    instrumentation to guard_bio_end() to check truncated_bytes size, was
    enough to see the error.

    The following script can trigger the error.

    MNT=/mnt
    IMG=./DISK.img
    DEV=/dev/loop0

    mkfs.vfat $IMG
    mount $IMG $MNT
    cp -R /etc $MNT &> /dev/null
    umount $MNT

    losetup -D

    losetup --find --show --sizelimit 16247280 $IMG
    mount $DEV $MNT

    find $MNT -type f -exec cat {} + >/dev/null

    Kudos to Eric Sandeen for coming up with the reproducer above

    Reviewed-by: Ming Lei
    Signed-off-by: Carlos Maiolino
    Signed-off-by: Jens Axboe

    Carlos Maiolino
     

15 Feb, 2019

2 commits

  • Pull in 5.0-rc6 to avoid a dumb merge conflict with fs/iomap.c.
    This is needed since io_uring is now based on the block branch,
    to avoid a conflict between the multi-page bvecs and the bits
    of io_uring that touch the core block parts.

    * tag 'v5.0-rc6': (525 commits)
    Linux 5.0-rc6
    x86/mm: Make set_pmd_at() paravirt aware
    MAINTAINERS: Update the ocores i2c bus driver maintainer, etc
    blk-mq: remove duplicated definition of blk_mq_freeze_queue
    Blk-iolatency: warn on negative inflight IO counter
    blk-iolatency: fix IO hang due to negative inflight counter
    MAINTAINERS: unify reference to xen-devel list
    x86/mm/cpa: Fix set_mce_nospec()
    futex: Handle early deadlock return correctly
    futex: Fix barrier comment
    net: dsa: b53: Fix for failure when irq is not defined in dt
    blktrace: Show requests without sector
    mips: cm: reprime error cause
    mips: loongson64: remove unreachable(), fix loongson_poweroff().
    sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
    geneve: should not call rt6_lookup() when ipv6 was disabled
    KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
    KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222)
    kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)
    signal: Better detection of synchronous signals
    ...

    Jens Axboe
     
  • Once multi-page bvec is enabled, the last bvec may include more than one
    page, this patch use mp_bvec_last_segment() to truncate the bio.

    Reviewed-by: Omar Sandoval
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

07 Feb, 2019

1 commit

  • When something let __find_get_block_slow() hit all_mapped path, it calls
    printk() for 100+ times per a second. But there is no need to print same
    message with such high frequency; it is just asking for stall warning, or
    at least bloating log files.

    [ 399.866302][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
    [ 399.873324][T15342] b_state=0x00000029, b_size=512
    [ 399.878403][T15342] device loop0 blocksize: 4096
    [ 399.883296][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
    [ 399.890400][T15342] b_state=0x00000029, b_size=512
    [ 399.895595][T15342] device loop0 blocksize: 4096
    [ 399.900556][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
    [ 399.907471][T15342] b_state=0x00000029, b_size=512
    [ 399.912506][T15342] device loop0 blocksize: 4096

    This patch reduces frequency to up to once per a second, in addition to
    concatenating three lines into one.

    [ 399.866302][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8, b_state=0x00000029, b_size=512, device loop0 blocksize: 4096

    Signed-off-by: Tetsuo Handa
    Reviewed-by: Jan Kara
    Cc: Dmitry Vyukov
    Signed-off-by: Jens Axboe

    Tetsuo Handa
     

05 Jan, 2019

1 commit


08 Dec, 2018

1 commit

  • One of the goals of this series is to remove a separate reference to
    the css of the bio. This can and should be accessed via bio_blkcg(). In
    this patch, wbc_init_bio() now requires a bio to have a device
    associated with it.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

03 Nov, 2018

1 commit

  • Pull block layer fixes from Jens Axboe:
    "The biggest part of this pull request is the revert of the blkcg
    cleanup series. It had one fix earlier for a stacked device issue, but
    another one was reported. Rather than play whack-a-mole with this,
    revert the entire series and try again for the next kernel release.

    Apart from that, only small fixes/changes.

    Summary:

    - Indentation fixup for mtip32xx (Colin Ian King)

    - The blkcg cleanup series revert (Dennis Zhou)

    - Two NVMe fixes. One fixing a regression in the nvme request
    initialization in this merge window, causing nvme-fc to not work.
    The other is a suspend/resume p2p resource issue (James, Keith)

    - Fix sg discard merge, allowing us to merge in cases where we didn't
    before (Jianchao Wang)

    - Call rq_qos_exit() after the queue is frozen, preventing a hang
    (Ming)

    - Fix brd queue setup, fixing an oops if we fail setting up all
    devices (Ming)"

    * tag 'for-linus-20181102' of git://git.kernel.dk/linux-block:
    nvme-pci: fix conflicting p2p resource adds
    nvme-fc: fix request private initialization
    blkcg: revert blkcg cleanups series
    block: brd: associate with queue until adding disk
    block: call rq_qos_exit() after queue is frozen
    mtip32xx: clean an indentation issue, remove extraneous tabs
    block: fix the DISCARD request merge

    Linus Torvalds
     

02 Nov, 2018

1 commit

  • This reverts a series committed earlier due to null pointer exception
    bug report in [1]. It seems there are edge case interactions that I did
    not consider and will need some time to understand what causes the
    adverse interactions.

    The original series can be found in [2] with a follow up series in [3].

    [1] https://www.spinics.net/lists/cgroups/msg20719.html
    [2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
    [3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/

    This reverts the following commits:
    d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae,
    f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef,
    a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53

    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

29 Oct, 2018

1 commit

  • Pull XArray conversion from Matthew Wilcox:
    "The XArray provides an improved interface to the radix tree data
    structure, providing locking as part of the API, specifying GFP flags
    at allocation time, eliminating preloading, less re-walking the tree,
    more efficient iterations and not exposing RCU-protected pointers to
    its users.

    This patch set

    1. Introduces the XArray implementation

    2. Converts the pagecache to use it

    3. Converts memremap to use it

    The page cache is the most complex and important user of the radix
    tree, so converting it was most important. Converting the memremap
    code removes the only other user of the multiorder code, which allows
    us to remove the radix tree code that supported it.

    I have 40+ followup patches to convert many other users of the radix
    tree over to the XArray, but I'd like to get this part in first. The
    other conversions haven't been in linux-next and aren't suitable for
    applying yet, but you can see them in the xarray-conv branch if you're
    interested"

    * 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
    radix tree: Remove multiorder support
    radix tree test: Convert multiorder tests to XArray
    radix tree tests: Convert item_delete_rcu to XArray
    radix tree tests: Convert item_kill_tree to XArray
    radix tree tests: Move item_insert_order
    radix tree test suite: Remove multiorder benchmarking
    radix tree test suite: Remove __item_insert
    memremap: Convert to XArray
    xarray: Add range store functionality
    xarray: Move multiorder_check to in-kernel tests
    xarray: Move multiorder_shrink to kernel tests
    xarray: Move multiorder account test in-kernel
    radix tree test suite: Convert iteration test to XArray
    radix tree test suite: Convert tag_tagged_items to XArray
    radix tree: Remove radix_tree_clear_tags
    radix tree: Remove radix_tree_maybe_preload_order
    radix tree: Remove split/join code
    radix tree: Remove radix_tree_update_node_t
    page cache: Finish XArray conversion
    dax: Convert page fault handlers to XArray
    ...

    Linus Torvalds
     

21 Oct, 2018

1 commit


22 Sep, 2018

1 commit

  • One of the goals of this series is to remove a separate reference to
    the css of the bio. This can and should be accessed via bio_blkcg. In
    this patch, the wbc_init_bio call is changed such that it must be called
    after a queue has been associated with the bio.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)