06 Apr, 2019

1 commit

  • [ Upstream commit dce30ca9e3b676fb288c33c1f4725a0621361185 ]

    guard_bio_eod() can truncate a segment in bio to allow it to do IO on
    odd last sectors of a device.

    It already checks if the IO starts past EOD, but it does not consider
    the possibility of an IO request starting within device boundaries can
    contain more than one segment past EOD.

    In such cases, truncated_bytes can be bigger than PAGE_SIZE, and will
    underflow bvec->bv_len.

    Fix this by checking if truncated_bytes is lower than PAGE_SIZE.

    This situation has been found on filesystems such as isofs and vfat,
    which doesn't check the device size before mount, if the device is
    smaller than the filesystem itself, a readahead on such filesystem,
    which spans EOD, can trigger this situation, leading a call to
    zero_user() with a wrong size possibly corrupting memory.

    I didn't see any crash, or didn't let the system run long enough to
    check if memory corruption will be hit somewhere, but adding
    instrumentation to guard_bio_end() to check truncated_bytes size, was
    enough to see the error.

    The following script can trigger the error.

    MNT=/mnt
    IMG=./DISK.img
    DEV=/dev/loop0

    mkfs.vfat $IMG
    mount $IMG $MNT
    cp -R /etc $MNT &> /dev/null
    umount $MNT

    losetup -D

    losetup --find --show --sizelimit 16247280 $IMG
    mount $DEV $MNT

    find $MNT -type f -exec cat {} + >/dev/null

    Kudos to Eric Sandeen for coming up with the reproducer above

    Reviewed-by: Ming Lei
    Signed-off-by: Carlos Maiolino
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Carlos Maiolino
     

14 Mar, 2019

1 commit

  • [ Upstream commit 43636c804df0126da669c261fc820fb22f62bfc2 ]

    When something let __find_get_block_slow() hit all_mapped path, it calls
    printk() for 100+ times per a second. But there is no need to print same
    message with such high frequency; it is just asking for stall warning, or
    at least bloating log files.

    [ 399.866302][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
    [ 399.873324][T15342] b_state=0x00000029, b_size=512
    [ 399.878403][T15342] device loop0 blocksize: 4096
    [ 399.883296][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
    [ 399.890400][T15342] b_state=0x00000029, b_size=512
    [ 399.895595][T15342] device loop0 blocksize: 4096
    [ 399.900556][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
    [ 399.907471][T15342] b_state=0x00000029, b_size=512
    [ 399.912506][T15342] device loop0 blocksize: 4096

    This patch reduces frequency to up to once per a second, in addition to
    concatenating three lines into one.

    [ 399.866302][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8, b_state=0x00000029, b_size=512, device loop0 blocksize: 4096

    Signed-off-by: Tetsuo Handa
    Reviewed-by: Jan Kara
    Cc: Dmitry Vyukov
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Tetsuo Handa
     

30 Aug, 2018

1 commit


18 Aug, 2018

1 commit

  • The buffer_head can consume a significant amount of system memory and is
    directly related to the amount of page cache. In our production
    environment we have observed that a lot of machines are spending a
    significant amount of memory as buffer_head and can not be left as
    system memory overhead.

    Charging buffer_head is not as simple as adding __GFP_ACCOUNT to the
    allocation. The buffer_heads can be allocated in a memcg different from
    the memcg of the page for which buffer_heads are being allocated. One
    concrete example is memory reclaim. The reclaim can trigger I/O of
    pages of any memcg on the system. So, the right way to charge
    buffer_head is to extract the memcg from the page for which buffer_heads
    are being allocated and then use targeted memcg charging API.

    [shakeelb@google.com: use __GFP_ACCOUNT for directed memcg charging]
    Link: http://lkml.kernel.org/r/20180702220208.213380-1-shakeelb@google.com
    Link: http://lkml.kernel.org/r/20180627191250.209150-3-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jan Kara
    Cc: Amir Goldstein
    Cc: Greg Thelen
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

20 Jun, 2018

2 commits

  • In iomap_to_bh, not only mark buffer heads in IOMAP_UNWRITTEN maps as
    new, but also buffer heads in IOMAP_MAPPED maps with the IOMAP_F_NEW
    flag set. This will be used by filesystems like gfs2, which allocate
    blocks in iomap->begin.

    Minor corrections to the comment for IOMAP_UNWRITTEN maps.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Andreas Gruenbacher
     
  • Bits of the buffer.c based write_end implementations that don't know
    about buffer_heads and can be reused by other implementations.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Andreas Gruenbacher
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

02 Jun, 2018

1 commit

  • This function is only used by the iomap code, depends on being called
    from it, and will soon stop poking into buffer head internals.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Andreas Gruenbacher
    Reviewed-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

13 Apr, 2018

1 commit


12 Apr, 2018

2 commits

  • Remove the address_space ->tree_lock and use the xa_lock newly added to
    the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
    since we don't really care that it's a tree.

    [willy@infradead.org: fix nds32, fs/dax.c]
    Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • XFS currently contains a copy-and-paste of __set_page_dirty(). Export
    it from buffer.c instead.

    Link: http://lkml.kernel.org/r/20180313132639.17387-6-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Reviewed-by: Darrick J. Wong
    Cc: Ryusuke Konishi
    Cc: Dave Chinner
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

06 Apr, 2018

1 commit

  • Prior to commit d47992f86b30 ("mm: change invalidatepage prototype to
    accept length"), an offset of 0 meant that the full page was being
    invalidated. After that commit, we need to instead check the length.

    Jan said:
    :
    : The only possible issue is that try_to_release_page() was called more
    : often than necessary. Otherwise the issue is harmless but still it's good
    : to have this fixed.

    Link: http://lkml.kernel.org/r/x49fu5rtnzs.fsf@segfault.boston.devel.redhat.com
    Fixes: d47992f86b307 ("mm: change invalidatepage prototype to accept length")
    Signed-off-by: Jeff Moyer
    Reviewed-by: Jan Kara
    Cc: Lukas Czerner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     

19 Mar, 2018

1 commit

  • There are 2 distinct freezing mechanisms - one operates on block
    devices and another one directly on super blocks. Both end up with the
    same result, but thaw of only one of these does not thaw the other.

    In particular fsfreeze --freeze uses the ioctl variant going to the
    super block. Since prior to this patch emergency thaw was not doing
    a relevant thaw, filesystems frozen with this method remained
    unaffected.

    The patch is a hack which adds blind unfreezing.

    In order to keep the super block write-locked the whole time the code
    is shuffled around and the newly introduced __iterate_supers is
    employed.

    Signed-off-by: Mateusz Guzik
    Signed-off-by: Al Viro

    Mateusz Guzik
     

01 Feb, 2018

1 commit

  • Pull misc vfs updates from Al Viro:
    "All kinds of misc stuff, without any unifying topic, from various
    people.

    Neil's d_anon patch, several bugfixes, introduction of kvmalloc
    analogue of kmemdup_user(), extending bitfield.h to deal with
    fixed-endians, assorted cleanups all over the place..."

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
    alpha: osf_sys.c: use timespec64 where appropriate
    alpha: osf_sys.c: fix put_tv32 regression
    jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path
    dcache: delete unused d_hash_mask
    dcache: subtract d_hash_shift from 32 in advance
    fs/buffer.c: fold init_buffer() into init_page_buffers()
    fs: fold __inode_permission() into inode_permission()
    fs: add RWF_APPEND
    sctp: use vmemdup_user() rather than badly open-coding memdup_user()
    snd_ctl_elem_init_enum_names(): switch to vmemdup_user()
    replace_user_tlv(): switch to vmemdup_user()
    new primitive: vmemdup_user()
    memdup_user(): switch to GFP_USER
    eventfd: fold eventfd_ctx_get() into eventfd_ctx_fileget()
    eventfd: fold eventfd_ctx_read() into eventfd_read()
    eventfd: convert to use anon_inode_getfd()
    nfs4file: get rid of pointless include of btrfs.h
    uvc_v4l2: clean copyin/copyout up
    vme_user: don't use __copy_..._user()
    usx2y: don't bother with memdup_user() for 16-byte structure
    ...

    Linus Torvalds
     

26 Jan, 2018

1 commit


07 Jan, 2018

1 commit


16 Nov, 2017

1 commit

  • Every pagevec_init user claims the pages being released are hot even in
    cases where it is unlikely the pages are hot. As no one cares about the
    hotness of pages being released to the allocator, just ditch the
    parameter.

    No performance impact is expected as the overhead is marginal. The
    parameter is removed simply because it is a bit stupid to have a useless
    parameter copied everywhere.

    Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

15 Nov, 2017

2 commits

  • Pull core block layer updates from Jens Axboe:
    "This is the main pull request for block storage for 4.15-rc1.

    Nothing out of the ordinary in here, and no API changes or anything
    like that. Just various new features for drivers, core changes, etc.
    In particular, this pull request contains:

    - A patch series from Bart, closing the whole on blk/scsi-mq queue
    quescing.

    - A series from Christoph, building towards hidden gendisks (for
    multipath) and ability to move bio chains around.

    - NVMe
    - Support for native multipath for NVMe (Christoph).
    - Userspace notifications for AENs (Keith).
    - Command side-effects support (Keith).
    - SGL support (Chaitanya Kulkarni)
    - FC fixes and improvements (James Smart)
    - Lots of fixes and tweaks (Various)

    - bcache
    - New maintainer (Michael Lyle)
    - Writeback control improvements (Michael)
    - Various fixes (Coly, Elena, Eric, Liang, et al)

    - lightnvm updates, mostly centered around the pblk interface
    (Javier, Hans, and Rakesh).

    - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

    - Writeback series that fix the much discussed hundreds of millions
    of sync-all units. This goes all the way, as discussed previously
    (me).

    - Fix for missing wakeup on writeback timer adjustments (Yafang
    Shao).

    - Fix laptop mode on blk-mq (me).

    - {mq,name} tupple lookup for IO schedulers, allowing us to have
    alias names. This means you can use 'deadline' on both !mq and on
    mq (where it's called mq-deadline). (me).

    - blktrace race fix, oopsing on sg load (me).

    - blk-mq optimizations (me).

    - Obscure waitqueue race fix for kyber (Omar).

    - NBD fixes (Josef).

    - Disable writeback throttling by default on bfq, like we do on cfq
    (Luca Miccio).

    - Series from Ming that enable us to treat flush requests on blk-mq
    like any other request. This is a really nice cleanup.

    - Series from Ming that improves merging on blk-mq with schedulers,
    getting us closer to flipping the switch on scsi-mq again.

    - BFQ updates (Paolo).

    - blk-mq atomic flags memory ordering fixes (Peter Z).

    - Loop cgroup support (Shaohua).

    - Lots of minor fixes from lots of different folks, both for core and
    driver code"

    * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
    nvme: fix visibility of "uuid" ns attribute
    blk-mq: fixup some comment typos and lengths
    ide: ide-atapi: fix compile error with defining macro DEBUG
    blk-mq: improve tag waiting setup for non-shared tags
    brd: remove unused brd_mutex
    blk-mq: only run the hardware queue if IO is pending
    block: avoid null pointer dereference on null disk
    fs: guard_bio_eod() needs to consider partitions
    xtensa/simdisk: fix compile error
    nvme: expose subsys attribute to sysfs
    nvme: create 'slaves' and 'holders' entries for hidden controllers
    block: create 'slaves' and 'holders' entries for hidden gendisks
    nvme: also expose the namespace identification sysfs files for mpath nodes
    nvme: implement multipath access to nvme subsystems
    nvme: track shared namespaces
    nvme: introduce a nvme_ns_ids structure
    nvme: track subsystems
    block, nvme: Introduce blk_mq_req_flags_t
    block, scsi: Make SCSI quiesce and resume work reliably
    block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
    ...

    Linus Torvalds
     
  • Pull ext4 updates from Ted Ts'o:

    - Add support for online resizing of file systems with bigalloc

    - Fix a two data corruption bugs involving DAX, as well as a corruption
    bug after a crash during a racing fallocate and delayed allocation.

    - Finally, a number of cleanups and optimizations.

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: improve smp scalability for inode generation
    ext4: add support for online resizing with bigalloc
    ext4: mention noload when recovering on read-only device
    Documentation: fix little inconsistencies
    ext4: convert timers to use timer_setup()
    jbd2: convert timers to use timer_setup()
    ext4: remove duplicate extended attributes defs
    ext4: add ext4_should_use_dax()
    ext4: add sanity check for encryption + DAX
    ext4: prevent data corruption with journaling + DAX
    ext4: prevent data corruption with inline data + DAX
    ext4: fix interaction between i_size, fallocate, and delalloc after a crash
    ext4: retry allocations conservatively
    ext4: Switch to iomap for SEEK_HOLE / SEEK_DATA
    ext4: Add iomap support for inline data
    iomap: Add IOMAP_F_DATA_INLINE flag
    iomap: Switch from blkno to disk offset

    Linus Torvalds
     

11 Nov, 2017

1 commit

  • guard_bio_eod() needs to look at the partition capacity, not just the
    capacity of the whole device, when determining if truncation is
    necessary.

    [ 60.268688] attempt to access beyond end of device
    [ 60.268690] unknown-block(9,1): rw=0, want=67103509, limit=67103506
    [ 60.268693] buffer_io_error: 2 callbacks suppressed
    [ 60.268696] Buffer I/O error on dev md1p7, logical block 4524305, async page read

    Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
    Cc: stable@vger.kernel.org # v4.13
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Greg Edwards
    Signed-off-by: Jens Axboe

    Greg Edwards
     

25 Oct, 2017

1 commit

  • …READ_ONCE()/WRITE_ONCE()

    Please do not apply this to mainline directly, instead please re-run the
    coccinelle script shown below and apply its output.

    For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
    preference to ACCESS_ONCE(), and new code is expected to use one of the
    former. So far, there's been no reason to change most existing uses of
    ACCESS_ONCE(), as these aren't harmful, and changing them results in
    churn.

    However, for some features, the read/write distinction is critical to
    correct operation. To distinguish these cases, separate read/write
    accessors must be used. This patch migrates (most) remaining
    ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
    coccinelle script:

    ----
    // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
    // WRITE_ONCE()

    // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

    virtual patch

    @ depends on patch @
    expression E1, E2;
    @@

    - ACCESS_ONCE(E1) = E2
    + WRITE_ONCE(E1, E2)

    @ depends on patch @
    expression E;
    @@

    - ACCESS_ONCE(E)
    + READ_ONCE(E)
    ----

    Signed-off-by: Mark Rutland <mark.rutland@arm.com>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: davem@davemloft.net
    Cc: linux-arch@vger.kernel.org
    Cc: mpe@ellerman.id.au
    Cc: shuah@kernel.org
    Cc: snitzer@redhat.com
    Cc: thor.thayer@linux.intel.com
    Cc: tj@kernel.org
    Cc: viro@zeniv.linux.org.uk
    Cc: will.deacon@arm.com
    Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Mark Rutland
     

03 Oct, 2017

3 commits

  • Since the previous commit removed any case where grow_buffers()
    would return failure due to memory allocations, we can safely
    remove the case where we have to call free_more_memory() in
    this function.

    Since this is also the last user of free_more_memory(), kill
    it off completely.

    Reviewed-by: Nikolay Borisov
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We currently use it for find_or_create_page(), which means that it
    cannot fail. Ensure we also pass in 'retry == true' to
    alloc_page_buffers(), which also ensure that it cannot fail.

    After this, there are no failure cases in grow_dev_page() that
    occur because of a failed memory allocation.

    Reviewed-by: Nikolay Borisov
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Instead of adding weird retry logic in that function, utilize
    __GFP_NOFAIL to ensure that the vm takes care of handling any
    potential retries appropriately. This means we don't have to
    call free_more_memory() from here.

    Reviewed-by: Nikolay Borisov
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jens Axboe
     

02 Oct, 2017

1 commit

  • Replace iomap->blkno, the sector number, with iomap->addr, the disk
    offset in bytes. For invalid disk offsets, use the special value
    IOMAP_NULL_ADDR instead of IOMAP_NULL_BLOCK.

    This allows to use iomap for mappings which are not block aligned, such
    as inline data on ext4.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Darrick J. Wong # iomap, xfs
    Reviewed-by: Jan Kara

    Andreas Gruenbacher
     

08 Sep, 2017

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the first pull request for 4.14, containing most of the code
    changes. It's a quiet series this round, which I think we needed after
    the churn of the last few series. This contains:

    - Fix for a registration race in loop, from Anton Volkov.

    - Overflow complaint fix from Arnd for DAC960.

    - Series of drbd changes from the usual suspects.

    - Conversion of the stec/skd driver to blk-mq. From Bart.

    - A few BFQ improvements/fixes from Paolo.

    - CFQ improvement from Ritesh, allowing idling for group idle.

    - A few fixes found by Dan's smatch, courtesy of Dan.

    - A warning fixup for a race between changing the IO scheduler and
    device remova. From David Jeffery.

    - A few nbd fixes from Josef.

    - Support for cgroup info in blktrace, from Shaohua.

    - Also from Shaohua, new features in the null_blk driver to allow it
    to actually hold data, among other things.

    - Various corner cases and error handling fixes from Weiping Zhang.

    - Improvements to the IO stats tracking for blk-mq from me. Can
    drastically improve performance for fast devices and/or big
    machines.

    - Series from Christoph removing bi_bdev as being needed for IO
    submission, in preparation for nvme multipathing code.

    - Series from Bart, including various cleanups and fixes for switch
    fall through case complaints"

    * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
    kernfs: checking for IS_ERR() instead of NULL
    drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
    drbd: Fix allyesconfig build, fix recent commit
    drbd: switch from kmalloc() to kmalloc_array()
    drbd: abort drbd_start_resync if there is no connection
    drbd: move global variables to drbd namespace and make some static
    drbd: rename "usermode_helper" to "drbd_usermode_helper"
    drbd: fix race between handshake and admin disconnect/down
    drbd: fix potential deadlock when trying to detach during handshake
    drbd: A single dot should be put into a sequence.
    drbd: fix rmmod cleanup, remove _all_ debugfs entries
    drbd: Use setup_timer() instead of init_timer() to simplify the code.
    drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
    drbd: new disk-option disable-write-same
    drbd: Fix resource role for newly created resources in events2
    drbd: mark symbols static where possible
    drbd: Send P_NEG_ACK upon write error in protocol != C
    drbd: add explicit plugging when submitting batches
    drbd: change list_for_each_safe to while(list_first_entry_or_null)
    drbd: introduce drbd_recv_header_maybe_unplug
    ...

    Linus Torvalds
     

07 Sep, 2017

4 commits

  • All users of pagevec_lookup() and pagevec_lookup_range() now pass
    PAGEVEC_SIZE as a desired number of pages.

    Just drop the argument.

    Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • We want only pages from given range in page_cache_seek_hole_data(). Use
    pagevec_lookup_range() instead of pagevec_lookup() and remove
    unnecessary code.

    Note that the check for getting less pages than desired can be removed
    because index gets updated by pagevec_lookup_range().

    Link: http://lkml.kernel.org/r/20170726114704.7626-9-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Commit e64855c6cfaa ("fs: Add helper to clean bdev aliases under a bh
    and use it") added a wrapper for clean_bdev_aliases() that invalidates
    bdev aliases underlying a single buffer head.

    However this has caused a performance regression for bonnie++ benchmark
    on ext4 filesystem when delayed allocation is turned off (ext3 mode) -
    average of 3 runs:

    Hmean SeqOut Char 164787.55 ( 0.00%) 107189.06 (-34.95%)
    Hmean SeqOut Block 219883.89 ( 0.00%) 168870.32 (-23.20%)

    The reason for this regression is that clean_bdev_aliases() is slower
    when called for a single block because pagevec_lookup() it uses will end
    up iterating through the radix tree until it finds a page (which may
    take a while) but we are only interested whether there's a page at a
    particular index.

    Fix the problem by using pagevec_lookup_range() instead which avoids the
    needless iteration.

    Fixes: e64855c6cfaa ("fs: Add helper to clean bdev aliases under a bh and use it")
    Link: http://lkml.kernel.org/r/20170726114704.7626-5-jack@suse.cz
    Signed-off-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Make pagevec_lookup() (and underlying find_get_pages()) update index to
    the next page where iteration should continue. Most callers want this
    and also pagevec_lookup_tag() already does this.

    Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

24 Aug, 2017

1 commit

  • This way we don't need a block_device structure to submit I/O. The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open. Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device. But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

11 Jul, 2017

2 commits

  • To install a buffer_head into the cpu's LRU queue, bh_lru_install()
    would construct a new copy of the queue and then memcpy it over the real
    queue. But it's easily possible to do the update in-place, which is
    faster and simpler. Some work can also be skipped if the buffer_head
    was already in the queue.

    As a microbenchmark I timed how long it takes to run sb_getblk()
    10,000,000 times alternating between BH_LRU_SIZE + 1 blocks.
    Effectively, this benchmarks looking up buffer_heads that are in the
    page cache but not in the LRU:

    Before this patch: 1.758s
    After this patch: 1.653s

    This patch also removes about 350 bytes of compiled code (on x86_64),
    partly due to removal of the memcpy() which was being inlined+unrolled.

    Link: http://lkml.kernel.org/r/20161229193445.1913-1-ebiggers3@gmail.com
    Signed-off-by: Eric Biggers
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     
  • Pull XFS updates from Darrick Wong:
    "Here are some changes for you for 4.13. For the most part it's fixes
    for bugs and deadlock problems, and preparation for online fsck in
    some future merge window.

    - Avoid quotacheck deadlocks

    - Fix transaction overflows when bunmapping fragmented files

    - Refactor directory readahead

    - Allow admin to configure if ASSERT is fatal

    - Improve transaction usage detail logging during overflows

    - Minor cleanups

    - Don't leak log items when the log shuts down

    - Remove double-underscore typedefs

    - Various preparation for online scrubbing

    - Introduce new error injection configuration sysfs knobs

    - Refactor dq_get_next to use extent map directly

    - Fix problems with iterating the page cache for unwritten data

    - Implement SEEK_{HOLE,DATA} via iomap

    - Refactor XFS to use iomap SEEK_HOLE and SEEK_DATA

    - Don't use MAXPATHLEN to check on-disk symlink target lengths"

    * tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits)
    xfs: don't crash on unexpected holes in dir/attr btrees
    xfs: rename MAXPATHLEN to XFS_SYMLINK_MAXLEN
    xfs: fix contiguous dquot chunk iteration livelock
    xfs: Switch to iomap for SEEK_HOLE / SEEK_DATA
    vfs: Add iomap_seek_hole and iomap_seek_data helpers
    vfs: Add page_cache_seek_hole_data helper
    xfs: remove a whitespace-only line from xfs_fs_get_nextdqblk
    xfs: rewrite xfs_dq_get_next_id using xfs_iext_lookup_extent
    xfs: Check for m_errortag initialization in xfs_errortag_test
    xfs: grab dquots without taking the ilock
    xfs: fix semicolon.cocci warnings
    xfs: Don't clear SGID when inheriting ACLs
    xfs: free cowblocks and retry on buffered write ENOSPC
    xfs: replace log_badcrc_factor knob with error injection tag
    xfs: convert drop_writes to use the errortag mechanism
    xfs: remove unneeded parameter from XFS_TEST_ERROR
    xfs: expose errortag knobs via sysfs
    xfs: make errortag a per-mountpoint structure
    xfs: free uncommitted transactions during log recovery
    xfs: don't allow bmap on rt files
    ...

    Linus Torvalds
     

10 Jul, 2017

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "The first major feature for ext4 this merge window is the largedir
    feature, which allows ext4 directories to support over 2 billion
    directory entries (assuming ~64 byte file names; in practice, users
    will run into practical performance limits first.) This feature was
    originally written by the Lustre team, and credit goes to Artem
    Blagodarenko from Seagate for getting this feature upstream.

    The second major major feature allows ext4 to support extended
    attribute values up to 64k. This feature was also originally from
    Lustre, and has been enhanced by Tahsin Erdogan from Google with a
    deduplication feature so that if multiple files have the same xattr
    value (for example, Windows ACL's stored by Samba), only one copy will
    be stored on disk for encoding and caching efficiency.

    We also have the usual set of bug fixes, cleanups, and optimizations"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (47 commits)
    ext4: fix spelling mistake: "prellocated" -> "preallocated"
    ext4: fix __ext4_new_inode() journal credits calculation
    ext4: skip ext4_init_security() and encryption on ea_inodes
    fs: generic_block_bmap(): initialize all of the fields in the temp bh
    ext4: change fast symlink test to not rely on i_blocks
    ext4: require key for truncate(2) of encrypted file
    ext4: don't bother checking for encryption key in ->mmap()
    ext4: check return value of kstrtoull correctly in reserved_clusters_store
    ext4: fix off-by-one fsmap error on 1k block filesystems
    ext4: return EFSBADCRC if a bad checksum error is found in ext4_find_entry()
    ext4: return EIO on read error in ext4_find_entry
    ext4: forbid encrypting root directory
    ext4: send parallel discards on commit completions
    ext4: avoid unnecessary stalls in ext4_evict_inode()
    ext4: add nombcache mount option
    ext4: strong binding of xattr inode references
    ext4: eliminate xattr entry e_hash recalculation for removes
    ext4: reserve space for xattr entries/names
    quota: add get_inode_usage callback to transfer multi-inode charges
    ext4: xattr inode deduplication
    ...

    Linus Torvalds
     

08 Jul, 2017

1 commit

  • Pull Writeback error handling updates from Jeff Layton:
    "This pile represents the bulk of the writeback error handling fixes
    that I have for this cycle. Some of the earlier patches in this pile
    may look trivial but they are prerequisites for later patches in the
    series.

    The aim of this set is to improve how we track and report writeback
    errors to userland. Most applications that care about data integrity
    will periodically call fsync/fdatasync/msync to ensure that their
    writes have made it to the backing store.

    For a very long time, we have tracked writeback errors using two flags
    in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
    writeback error occurs (via mapping_set_error) and are cleared as a
    side-effect of filemap_check_errors (as you noted yesterday). This
    model really sucks for userland.

    Only the first task to call fsync (or msync or fdatasync) will see the
    error. Any subsequent task calling fsync on a file will get back 0
    (unless another writeback error occurs in the interim). If I have
    several tasks writing to a file and calling fsync to ensure that their
    writes got stored, then I need to have them coordinate with one
    another. That's difficult enough, but in a world of containerized
    setups that coordination may even not be possible.

    But wait...it gets worse!

    The calls to filemap_check_errors can be buried pretty far down in the
    call stack, and there are internal callers of filemap_write_and_wait
    and the like that also end up clearing those errors. Many of those
    callers ignore the error return from that function or return it to
    userland at nonsensical times (e.g. truncate() or stat()). If I get
    back -EIO on a truncate, there is no reason to think that it was
    because some previous writeback failed, and a subsequent fsync() will
    (incorrectly) return 0.

    This pile aims to do three things:

    1) ensure that when a writeback error occurs that that error will be
    reported to userland on a subsequent fsync/fdatasync/msync call,
    regardless of what internal callers are doing

    2) report writeback errors on all file descriptions that were open at
    the time that the error occurred. This is a user-visible change,
    but I think most applications are written to assume this behavior
    anyway. Those that aren't are unlikely to be hurt by it.

    3) document what filesystems should do when there is a writeback
    error. Today, there is very little consistency between them, and a
    lot of cargo-cult copying. We need to make it very clear what
    filesystems should do in this situation.

    To achieve this, the set adds a new data type (errseq_t) and then
    builds new writeback error tracking infrastructure around that. Once
    all of that is in place, we change the filesystems to use the new
    infrastructure for reporting wb errors to userland.

    Note that this is just the initial foray into cleaning up this mess.
    There is a lot of work remaining here:

    1) convert the rest of the filesystems in a similar fashion. Once the
    initial set is in, then I think most other fs' will be fairly
    simple to convert. Hopefully most of those can in via individual
    filesystem trees.

    2) convert internal waiters on writeback to use errseq_t for
    detecting errors instead of relying on the AS_* flags. I have some
    draft patches for this for ext4, but they are not quite ready for
    prime time yet.

    This was a discussion topic this year at LSF/MM too. If you're
    interested in the gory details, LWN has some good articles about this:

    https://lwn.net/Articles/718734/
    https://lwn.net/Articles/724307/"

    * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    btrfs: minimal conversion to errseq_t writeback error reporting on fsync
    xfs: minimal conversion to errseq_t writeback error reporting
    ext4: use errseq_t based error handling for reporting data writeback errors
    fs: convert __generic_file_fsync to use errseq_t based reporting
    block: convert to errseq_t based writeback error tracking
    dax: set errors in mapping when writeback fails
    Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
    mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
    fs: new infrastructure for writeback error handling and reporting
    lib: add errseq_t type and infrastructure for handling it
    mm: don't TestClearPageError in __filemap_fdatawait_range
    mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
    jbd2: don't clear and reset errors after waiting on writeback
    buffer: set errors in mapping at the time that the error occurs
    fs: check for writeback errors after syncing out buffers in generic_file_fsync
    buffer: use mapping_set_error instead of setting the flag
    mm: fix mapping_set_error call in me_pagecache_dirty

    Linus Torvalds
     

06 Jul, 2017

2 commits

  • I noticed on xfs that I could still sometimes get back an error on fsync
    on a fd that was opened after the error condition had been cleared.

    The problem is that the buffer code sets the write_io_error flag and
    then later checks that flag to set the error in the mapping. That flag
    perisists for quite a while however. If the file is later opened with
    O_TRUNC, the buffers will then be invalidated and the mapping's error
    set such that a subsequent fsync will return error. I think this is
    incorrect, as there was no writeback between the open and fsync.

    Add a new mark_buffer_write_io_error operation that sets the flag and
    the error in the mapping at the same time. Replace all calls to
    set_buffer_write_io_error with mark_buffer_write_io_error, and remove
    the places that check this flag in order to set the error in the
    mapping.

    This sets the error in the mapping earlier, at the time that it's first
    detected.

    Signed-off-by: Jeff Layton
    Reviewed-by: Jan Kara
    Reviewed-by: Carlos Maiolino

    Jeff Layton
     
  • Signed-off-by: Jeff Layton
    Reviewed-by: Jan Kara
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Christoph Hellwig

    Jeff Layton
     

05 Jul, 2017

1 commit

  • KMSAN (KernelMemorySanitizer, a new error detection tool) reports the
    use of uninitialized memory in ext4_update_bh_state():

    ==================================================================
    BUG: KMSAN: use of unitialized memory
    CPU: 3 PID: 1 Comm: swapper/0 Tainted: G B 4.8.0-rc6+ #597
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
    01/01/2011
    0000000000000282 ffff88003cc96f68 ffffffff81f30856 0000003000000008
    ffff88003cc96f78 0000000000000096 ffffffff8169742a ffff88003cc96ff8
    ffffffff812fc1fc 0000000000000008 ffff88003a1980e8 0000000100000000
    Call Trace:
    [< inline >] __dump_stack lib/dump_stack.c:15
    [] dump_stack+0xa6/0xc0 lib/dump_stack.c:51
    [] kmsan_report+0x1ec/0x300 mm/kmsan/kmsan.c:?
    [] __msan_warning+0x2b/0x40 ??:?
    [< inline >] ext4_update_bh_state fs/ext4/inode.c:727
    [] _ext4_get_block+0x6ca/0x8a0 fs/ext4/inode.c:759
    [] ext4_get_block+0x8c/0xa0 fs/ext4/inode.c:769
    [] generic_block_bmap+0x246/0x2b0 fs/buffer.c:2991
    [] ext4_bmap+0x5ee/0x660 fs/ext4/inode.c:3177
    ...
    origin description: ----tmp@generic_block_bmap
    ==================================================================

    (the line numbers are relative to 4.8-rc6, but the bug persists
    upstream)

    The local |tmp| is created in generic_block_bmap() and then passed into
    ext4_bmap() => ext4_get_block() => _ext4_get_block() =>
    ext4_update_bh_state(). Along the way tmp.b_page is never initialized
    before ext4_update_bh_state() checks its value.

    [ Use the approach suggested by Kees Cook of initializing the whole bh
    structure.]

    Signed-off-by: Alexander Potapenko
    Signed-off-by: Theodore Ts'o

    Alexander Potapenko
     

03 Jul, 2017

1 commit

  • Both ext4 and xfs implement seeking for the next hole or piece of data
    in unwritten extents by scanning the page cache, and both versions share
    the same bug when iterating the buffers of a page: the start offset into
    the page isn't taken into account, so when a page fits more than two
    filesystem blocks, things will go wrong. For example, on a filesystem
    with a block size of 1k, the following command will fail:

    xfs_io -f -c "falloc 0 4k" \
    -c "pwrite 1k 1k" \
    -c "pwrite 3k 1k" \
    -c "seek -a -r 0" foo

    In this example, neither lseek(fd, 1024, SEEK_HOLE) nor lseek(fd, 2048,
    SEEK_DATA) will return the correct result.

    Introduce a generic vfs helper for seeking in the page cache that gets
    this right. The next commits will replace the filesystem specific
    implementations.

    Signed-off-by: Andreas Gruenbacher
    [hch: dropped the export]
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Andreas Gruenbacher
     

28 Jun, 2017

1 commit


09 Jun, 2017

1 commit

  • Replace bi_error with a new bi_status to allow for a clear conversion.
    Note that device mapper overloaded bi_error with a private value, which
    we'll have to keep arround at least for now and thus propagate to a
    proper blk_status_t value.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig