12 Jun, 2020

1 commit


03 Jun, 2020

1 commit

  • Implement the new readahead aop and convert all callers (block_dev,
    exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6,
    reiserfs & udf).

    The callers are all trivial except for GFS2 & OCFS2.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Junxiao Bi # ocfs2
    Reviewed-by: Joseph Qi # ocfs2
    Reviewed-by: Dave Chinner
    Reviewed-by: John Hubbard
    Reviewed-by: Christoph Hellwig
    Reviewed-by: William Kucharski
    Cc: Chao Yu
    Cc: Cong Wang
    Cc: Darrick J. Wong
    Cc: Eric Biggers
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Johannes Thumshirn
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200414150233.24495-17-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Jan, 2020

1 commit


09 Jan, 2020

1 commit

  • Commit 85a8ce62c2ea ("block: add bio_truncate to fix guard_bio_eod")
    adds bio_truncate() for handling bio EOD. However, bio_truncate()
    doesn't use the passed 'op' parameter from guard_bio_eod's callers.

    So bio_trunacate() may retrieve wrong 'op', and zering pages may
    not be done for READ bio.

    Fixes this issue by moving guard_bio_eod() after bio_set_op_attrs()
    in submit_bh_wbc() so that bio_truncate() can always retrieve correct
    op info.

    Meantime remove the 'op' parameter from guard_bio_eod() because it isn't
    used any more.

    Cc: Carlos Maiolino
    Cc: linux-fsdevel@vger.kernel.org
    Fixes: 85a8ce62c2ea ("block: add bio_truncate to fix guard_bio_eod")
    Signed-off-by: Ming Lei

    Fold in kerneldoc and bio_op() change.

    Signed-off-by: Jens Axboe

    Ming Lei
     

24 Jul, 2019

1 commit


10 Jul, 2019

1 commit


21 May, 2019

1 commit


04 May, 2019

2 commits

  • Change-Id: I4380c68c3474026a42ffa9f95c525f9a563ba7a3

    Todd Kjos
     
  • Adds tracepoints in ext4/f2fs/mpage to track readpages/buffered
    write()s. This allows us to track files that are being read/written
    to PIDs.

    Bug: 120445624
    Change-Id: I44476230324e9397e292328463f846af4befbd6d
    [joelaf: Needed for storaged fsync accounting ("storaged --uid" and
    "storaged --task".)]
    Signed-off-by: Mohan Srinivasan
    [AmitP: Folded following android-4.9 commit changes into this patch
    a5c4dbb05ab7 ("ANDROID: Replace spaces by '_' for some
    android filesystem tracepoints.")]
    Signed-off-by: Amit Pundir
    [astrachan: Folded 63066f4acf92 ("ANDROID: fs: Refactor FS
    readpage/write tracepoints.") into this patch
    Signed-off-by: Alistair Strachan

    Mohan Srinivasan
     

30 Apr, 2019

1 commit


15 Feb, 2019

1 commit

  • This patch introduces one extra iterator variable to bio_for_each_segment_all(),
    then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.

    Given it is just one mechannical & simple change on all bio_for_each_segment_all()
    users, this patch does tree-wide change in one single patch, so that we can
    avoid to use a temporary helper for this conversion.

    Reviewed-by: Omar Sandoval
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

18 Aug, 2018

2 commits

  • a_ops->readpages() is only ever used for read-ahead, yet we don't flag
    the IO being submitted as such. Fix that up. Any file system that uses
    mpage_readpages() as its ->readpages() implementation will now get this
    right.

    Since we're passing in whether the IO is read-ahead or not, we don't
    need to pass in the 'gfp' separately, as it is dependent on the IO being
    read-ahead. Kill off that member.

    Add some documentation notes on ->readpages() being purely for
    read-ahead.

    Link: http://lkml.kernel.org/r/20180621010725.17813-3-axboe@kernel.dk
    Signed-off-by: Jens Axboe
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Chris Mason
    Cc: Christoph Hellwig
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jens Axboe
     
  • Patch series "Submit ->readpages() IO as read-ahead", v4.

    The only caller of ->readpages() is from read-ahead, yet we don't submit
    IO flagged with REQ_RAHEAD. This means we don't see it in blktrace, for
    instance, which is a shame. Additionally, it's preventing further
    functional changes in the block layer for deadling with read-ahead more
    intelligently. We already make assumptions about ->readpages() just
    being for read-ahead in the mpage implementation, using
    readahead_gfp_mask(mapping) as out GFP mask of choice.

    This small series fixes up mpage_readpages() to submit with REQ_RAHEAD,
    which takes care of file systems using mpage_readpages(). The first
    patch is a prep patch, that makes do_mpage_readpage() take an argument
    structure.

    This patch (of 4):

    We're currently passing 8 arguments to this function, clean it up a bit
    by packing the arguments in an args structure we pass to it.

    No intentional functional changes in this patch.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20180621010725.17813-2-axboe@kernel.dk
    Signed-off-by: Jens Axboe
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

18 Jul, 2018

1 commit

  • c11f0c0b5bb9 ("block/mm: make bdev_ops->rw_page() take a bool for
    read/write") replaced @op with boolean @is_write, which limited the
    amount of information going into ->rw_page() and more importantly
    page_endio(), which removed the need to expose block internals to mm.

    Unfortunately, we want to track discards separately and @is_write
    isn't enough information. This patch updates bdev_ops->rw_page() to
    take REQ_OP instead but leaves page_endio() to take bool @is_write.
    This allows the block part of operations to have enough information
    while not leaking it to mm.

    Signed-off-by: Tejun Heo
    Cc: Mike Christie
    Cc: Minchan Kim
    Cc: Dan Williams
    Signed-off-by: Jens Axboe

    Tejun Heo
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

14 Oct, 2017

1 commit

  • When using FAT on a block device which supports rw_page, we can hit
    BUG_ON(!PageLocked(page)) in try_to_free_buffers(). This is because we
    call clean_buffers() after unlocking the page we've written. Introduce
    a new clean_page_buffers() which cleans all buffers associated with a
    page and call it from within bdev_write_page().

    [akpm@linux-foundation.org: s/PAGE_SIZE/~0U/ per Linus and Matthew]
    Link: http://lkml.kernel.org/r/20171006211541.GA7409@bombadil.infradead.org
    Signed-off-by: Matthew Wilcox
    Reported-by: Toshi Kani
    Reported-by: OGAWA Hirofumi
    Tested-by: Toshi Kani
    Acked-by: Johannes Thumshirn
    Cc: Ross Zwisler
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

24 Aug, 2017

1 commit

  • This way we don't need a block_device structure to submit I/O. The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open. Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device. But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

04 Jul, 2017

1 commit

  • Pull documentation updates from Jonathan Corbet:
    "There has been a fair amount of activity in the docs tree this time
    around. Highlights include:

    - Conversion of a bunch of security documentation into RST

    - The conversion of the remaining DocBook templates by The Amazing
    Mauro Machine. We can now drop the entire DocBook build chain.

    - The usual collection of fixes and minor updates"

    * tag 'docs-4.13' of git://git.lwn.net/linux: (90 commits)
    scripts/kernel-doc: handle DECLARE_HASHTABLE
    Documentation: atomic_ops.txt is core-api/atomic_ops.rst
    Docs: clean up some DocBook loose ends
    Make the main documentation title less Geocities
    Docs: Use kernel-figure in vidioc-g-selection.rst
    Docs: fix table problems in ras.rst
    Docs: Fix breakage with Sphinx 1.5 and upper
    Docs: Include the Latex "ifthen" package
    doc/kokr/howto: Only send regression fixes after -rc1
    docs-rst: fix broken links to dynamic-debug-howto in kernel-parameters
    doc: Document suitability of IBM Verse for kernel development
    Doc: fix a markup error in coding-style.rst
    docs: driver-api: i2c: remove some outdated information
    Documentation: DMA API: fix a typo in a function name
    Docs: Insert missing space to separate link from text
    doc/ko_KR/memory-barriers: Update control-dependencies example
    Documentation, kbuild: fix typo "minimun" -> "minimum"
    docs: Fix some formatting issues in request-key.rst
    doc: ReSTify keys-trusted-encrypted.txt
    doc: ReSTify keys-request-key.txt
    ...

    Linus Torvalds
     

28 Jun, 2017

1 commit


09 Jun, 2017

1 commit

  • Replace bi_error with a new bi_status to allow for a clear conversion.
    Note that device mapper overloaded bi_error with a private value, which
    we'll have to keep arround at least for now and thus propagate to a
    proper blk_status_t value.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

16 May, 2017

1 commit

  • Sphinx gets confused when it finds identation without a
    good reason for it and without a preceding blank line:

    ./fs/mpage.c:347: ERROR: Unexpected indentation.
    ./fs/namei.c:4303: ERROR: Unexpected indentation.
    ./fs/fs-writeback.c:2060: ERROR: Unexpected indentation.

    No functional changes.

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     

28 Feb, 2017

1 commit

  • Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs
    branch.

    This patch also fixes multiple checkpatch warnings: WARNING: Prefer
    'unsigned int' to bare use of 'unsigned'

    Thanks to Andrew Morton for suggesting more appropriate function instead
    of macro.

    [geliangtang@gmail.com: truncate: use i_blocksize()]
    Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com
    Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Signed-off-by: Geliang Tang
    Cc: Alexander Viro
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

05 Nov, 2016

1 commit

  • Add a helper function that clears buffer heads from a block device
    aliasing passed bh. Use this helper function from filesystems instead of
    the original unmap_underlying_metadata() to save some boiler plate code
    and also have a better name for the functionalily since it is not
    unmapping anything for a *long* time.

    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

03 Nov, 2016

1 commit

  • Add wbc_to_write_flags(), which returns the write modifier flags to use,
    based on a struct writeback_control. No functional changes in this
    patch, but it prepares us for factoring other wbc fields for write type.

    Signed-off-by: Jens Axboe
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig

    Jens Axboe
     

01 Nov, 2016

1 commit


08 Aug, 2016

1 commit

  • Commit abf545484d31 changed it from an 'rw' flags type to the
    newer ops based interface, but now we're effectively leaking
    some bdev internals to the rest of the kernel. Since we only
    care about whether it's a read or a write at that level, just
    pass in a bool 'is_write' parameter instead.

    Then we can also move op_is_write() and friends back under
    CONFIG_BLOCK protection.

    Reviewed-by: Mike Christie
    Signed-off-by: Jens Axboe

    Jens Axboe
     

05 Aug, 2016

1 commit

  • The rw_page users were not converted to use bio/req ops. As a result
    bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
    be sent down as reads.

    Signed-off-by: Mike Christie
    Fixes: 4e1b2d52a80d ("block, fs, drivers: remove REQ_OP compat defs and related code")

    Modified by me to:

    1) Drop op_flags passing into ->rw_page(), as we don't use it.
    2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK

    Signed-off-by: Jens Axboe

    Mike Christie
     

27 Jul, 2016

2 commits

  • Merge updates from Andrew Morton:

    - a few misc bits

    - ocfs2

    - most(?) of MM

    * emailed patches from Andrew Morton : (125 commits)
    thp: fix comments of __pmd_trans_huge_lock()
    cgroup: remove unnecessary 0 check from css_from_id()
    cgroup: fix idr leak for the first cgroup root
    mm: memcontrol: fix documentation for compound parameter
    mm: memcontrol: remove BUG_ON in uncharge_list
    mm: fix build warnings in
    mm, thp: convert from optimistic swapin collapsing to conservative
    mm, thp: fix comment inconsistency for swapin readahead functions
    thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
    shmem: split huge pages beyond i_size under memory pressure
    thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
    khugepaged: add support of collapse for tmpfs/shmem pages
    shmem: make shmem_inode_info::lock irq-safe
    khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
    thp: extract khugepaged from mm/huge_memory.c
    shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
    shmem: add huge pages support
    shmem: get_unmapped_area align huge page
    shmem: prepare huge= mount option and sysfs knob
    mm, rmap: account shmem thp pages
    ...

    Linus Torvalds
     
  • Vladimir has noticed that we might declare memcg oom even during
    readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
    restriction) while __do_page_cache_readahead uses
    page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
    OOMs. This gfp mask discrepancy is really unfortunate and easily
    fixable. Drop page_cache_alloc_readahead() which only has one user and
    outsource the gfp_mask logic into readahead_gfp_mask and propagate this
    mask from __do_page_cache_readahead down to read_pages.

    This alone would have only very limited impact as most filesystems are
    implementing ->readpages and the common implementation mpage_readpages
    does GFP_KERNEL (with mapping_gfp restriction) again. We can tell it to
    use readahead_gfp_mask instead as this function is called only during
    readahead as well. The same applies to read_cache_pages.

    ext4 has its own ext4_mpage_readpages but the path which has pages !=
    NULL can use the same gfp mask. Btrfs, cifs, f2fs and orangefs are
    doing a very similar pattern to mpage_readpages so the same can be
    applied to them as well.

    [akpm@linux-foundation.org: coding-style fixes]
    [mhocko@suse.com: restrict gfp mask in mpage_alloc]
    Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Chris Mason
    Cc: Steve French
    Cc: Theodore Ts'o
    Cc: Jan Kara
    Cc: Mike Marshall
    Cc: Jaegeuk Kim
    Cc: Changman Lee
    Cc: Chao Yu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

08 Jun, 2016

2 commits


05 Apr, 2016

2 commits

  • Mostly direct substitution with occasional adjustment or removing
    outdated comments.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Mar, 2016

1 commit


07 Nov, 2015

1 commit

  • There are many places which use mapping_gfp_mask to restrict a more
    generic gfp mask which would be used for allocations which are not
    directly related to the page cache but they are performed in the same
    context.

    Let's introduce a helper function which makes the restriction explicit and
    easier to track. This patch doesn't introduce any functional changes.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Michal Hocko
    Suggested-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

05 Nov, 2015

1 commit

  • Pull core block updates from Jens Axboe:
    "This is the core block pull request for 4.4. I've got a few more
    topic branches this time around, some of them will layer on top of the
    core+drivers changes and will come in a separate round. So not a huge
    chunk of changes in this round.

    This pull request contains:

    - Enable blk-mq page allocation tracking with kmemleak, from Catalin.

    - Unused prototype removal in blk-mq from Christoph.

    - Cleanup of the q->blk_trace exchange, using cmpxchg instead of two
    xchg()'s, from Davidlohr.

    - A plug flush fix from Jeff.

    - Also from Jeff, a fix that means we don't have to update shared tag
    sets at init time unless we do a state change. This cuts down boot
    times on thousands of devices a lot with scsi/blk-mq.

    - blk-mq waitqueue barrier fix from Kosuke.

    - Various fixes from Ming:

    - Fixes for segment merging and splitting, and checks, for
    the old core and blk-mq.

    - Potential blk-mq speedup by marking ctx pending at the end
    of a plug insertion batch in blk-mq.

    - direct-io no page dirty on kernel direct reads.

    - A WRITE_SYNC fix for mpage from Roman"

    * 'for-4.4/core' of git://git.kernel.dk/linux-block:
    blk-mq: avoid excessive boot delays with large lun counts
    blktrace: re-write setting q->blk_trace
    blk-mq: mark ctx as pending at batch in flush plug path
    blk-mq: fix for trace_block_plug()
    block: check bio_mergeable() early before merging
    blk-mq: check bio_mergeable() early before merging
    block: avoid to merge splitted bio
    block: setup bi_phys_segments after splitting
    block: fix plug list flushing for nomerge queues
    blk-mq: remove unused blk_mq_clone_flush_request prototype
    blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c
    fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read
    fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
    block: kmemleak: Track the page allocations for struct request

    Linus Torvalds
     

17 Oct, 2015

1 commit

  • Commit 6afdb859b710 ("mm: do not ignore mapping_gfp_mask in page cache
    allocation paths") has caught some users of hardcoded GFP_KERNEL used in
    the page cache allocation paths. This, however, wasn't complete and
    there were others which went unnoticed.

    Dave Chinner has reported the following deadlock for xfs on loop device:
    : With the recent merge of the loop device changes, I'm now seeing
    : XFS deadlock on my single CPU, 1GB RAM VM running xfs/073.
    :
    : The deadlocked is as follows:
    :
    : kloopd1: loop_queue_read_work
    : xfs_file_iter_read
    : lock XFS inode XFS_IOLOCK_SHARED (on image file)
    : page cache read (GFP_KERNEL)
    : radix tree alloc
    : memory reclaim
    : reclaim XFS inodes
    : log force to unpin inodes
    :
    :
    : xfs-cil/loop1:
    : xlog_cil_push
    : xlog_write
    :
    : xlog_state_get_iclog_space()
    :
    :
    :
    : kloopd1: loop_queue_write_work
    : xfs_file_write_iter
    : lock XFS inode XFS_IOLOCK_EXCL (on image file)
    :
    :
    : i.e. the kloopd, with it's split read and write work queues, has
    : introduced a dependency through memory reclaim. i.e. that writes
    : need to be able to progress for reads make progress.
    :
    : The problem, fundamentally, is that mpage_readpages() does a
    : GFP_KERNEL allocation, rather than paying attention to the inode's
    : mapping gfp mask, which is set to GFP_NOFS.
    :
    : The didn't used to happen, because the loop device used to issue
    : reads through the splice path and that does:
    :
    : error = add_to_page_cache_lru(page, mapping, index,
    : GFP_KERNEL & mapping_gfp_mask(mapping));

    This has changed by commit aa4d86163e4 ("block: loop: switch to VFS
    ITER_BVEC").

    This patch changes mpage_readpage{s} to follow gfp mask set for the
    mapping. There are, however, other places which are doing basically the
    same.

    lustre:ll_dir_filler is doing GFP_KERNEL from the function which
    apparently uses GFP_NOFS for other allocations so let's make this
    consistent.

    cifs:readpages_get_pages is called from cifs_readpages and
    __cifs_readpages_from_fscache called from the same path obeys mapping
    gfp.

    ramfs_nommu_expand_for_mapping is hardcoding GFP_KERNEL as well
    regardless it uses mapping_gfp_mask for the page allocation.

    ext4_mpage_readpages is the called from the page cache allocation path
    same as read_pages and read_cache_pages

    As I've noticed in my previous post I cannot say I would be happy about
    sprinkling mapping_gfp_mask all over the place and it sounds like we
    should drop gfp_mask argument altogether and use it internally in
    __add_to_page_cache_locked that would require all the filesystems to use
    mapping gfp consistently which I am not sure is the case here. From a
    quick glance it seems that some file system use it all the time while
    others are selective.

    Signed-off-by: Michal Hocko
    Reported-by: Dave Chinner
    Cc: "Theodore Ts'o"
    Cc: Ming Lei
    Cc: Andreas Dilger
    Cc: Oleg Drokin
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

24 Sep, 2015

1 commit

  • In case of wbc->sync_mode == WB_SYNC_ALL we need to do data integrity
    write, thus mark request as WRITE_SYNC.

    akpm: afaict this change will cause the data integrity write bios to be
    placed onto the second queue in cfq_io_cq.cfqq[], which presumably results
    in special treatment. The documentation for REQ_SYNC is horrid.

    Signed-off-by: Roman Pen
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Reviewed-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Roman Pen
     

14 Aug, 2015

1 commit

  • We can always fill up the bio now, no need to estimate the possible
    size based on queue parameters.

    Acked-by: Steven Whitehouse
    Signed-off-by: Kent Overstreet
    [hch: rebased and wrote a changelog]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ming Lin
    Signed-off-by: Jens Axboe

    Kent Overstreet
     

29 Jul, 2015

1 commit

  • Currently we have two different ways to signal an I/O error on a BIO:

    (1) by clearing the BIO_UPTODATE flag
    (2) by returning a Linux errno value to the bi_end_io callback

    The first one has the drawback of only communicating a single possible
    error (-EIO), and the second one has the drawback of not beeing persistent
    when bios are queued up, and are not passed along from child to parent
    bio in the ever more popular chaining scenario. Having both mechanisms
    available has the additional drawback of utterly confusing driver authors
    and introducing bugs where various I/O submitters only deal with one of
    them, and the others have to add boilerplate code to deal with both kinds
    of error returns.

    So add a new bi_error field to store an errno value directly in struct
    bio and remove the existing mechanisms to clean all this up.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: NeilBrown
    Signed-off-by: Jens Axboe

    Christoph Hellwig