18 Oct, 2020

1 commit

  • For the case where read-ahead is disabled on the file, or if the cgroup
    is congested, ensure that we can at least do 1 page of read-ahead to
    make progress on the read in an async fashion. This could potentially be
    larger, but it's not needed in terms of functionality, so let's error on
    the side of caution as larger counts of pages may run into reclaim
    issues (particularly if we're congested).

    This makes sure we're not hitting the potentially sync ->readpage() path
    for IO that is marked IOCB_WAITQ, which could cause us to block. It also
    means we'll use the same path for IO, regardless of whether or not
    read-ahead happens to be disabled on the lower level device.

    Acked-by: Johannes Weiner
    Reported-by: Matthew Wilcox (Oracle)
    Reported-by: Hao_Xu
    [axboe: updated for new ractl API]
    Signed-off-by: Jens Axboe

    Jens Axboe
     

17 Oct, 2020

7 commits

  • The file_ra_state being passed into page_cache_sync_readahead() was being
    ignored in favour of using the one embedded in the struct file. The only
    caller for which this makes a difference is the fsverity code if the file
    has been marked as POSIX_FADV_RANDOM, but it's confusing and worth fixing.

    Signed-off-by: David Howells
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: Eric Biggers
    Link: https://lkml.kernel.org/r/20200903140844.14194-10-willy@infradead.org
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Reimplement page_cache_sync_readahead() and page_cache_async_readahead()
    as wrappers around versions of the function which take a readahead_control
    in preparation for making do_sync_mmap_readahead() pass down an RAC
    struct.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: David Howells
    Cc: Eric Biggers
    Link: https://lkml.kernel.org/r/20200903140844.14194-8-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Reimplement force_page_cache_readahead() as a wrapper around
    force_page_cache_ra(). Pass the existing readahead_control from
    page_cache_sync_readahead().

    Signed-off-by: David Howells
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: Eric Biggers
    Link: https://lkml.kernel.org/r/20200903140844.14194-7-willy@infradead.org
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Make ondemand_readahead() take a readahead_control struct in preparation
    for making do_sync_mmap_readahead() pass down an RAC struct.

    Signed-off-by: David Howells
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: Eric Biggers
    Link: https://lkml.kernel.org/r/20200903140844.14194-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Rename __do_page_cache_readahead() to do_page_cache_ra() and call it
    directly from ondemand_readahead() instead of indirecting via ra_submit().

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: David Howells
    Cc: Eric Biggers
    Link: https://lkml.kernel.org/r/20200903140844.14194-5-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Define it in the callers instead of in page_cache_ra_unbounded().

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: David Howells
    Cc: Eric Biggers
    Link: https://lkml.kernel.org/r/20200903140844.14194-4-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Patch series "Readahead patches for 5.9/5.10".

    These are infrastructure for both the THP patchset and for the fscache
    rewrite,

    For both pieces of infrastructure being build on top of this patchset, we
    want the ractl to be available higher in the call-stack.

    For David's work, he wants to add the 'critical page' to the ractl so that
    he knows which page NEEDS to be brought in from storage, and which ones
    are nice-to-have. We might want something similar in block storage too.
    It used to be simple -- the first page was the critical one, but then mmap
    added fault-around and so for that usecase, the middle page is the
    critical one. Anyway, I don't have any code to show that yet, we just
    know that the lowest point in the callchain where we have that information
    is do_sync_mmap_readahead() and so the ractl needs to start its life
    there.

    For THP, we havew the code that needs it. It's actually the apex patch to
    the series; the one which finally starts to allocate THPs and present them
    to consenting filesystems:
    http://git.infradead.org/users/willy/pagecache.git/commitdiff/798bcf30ab2eff278caad03a9edca74d2f8ae760

    This patch (of 8):

    Allow for a more concise definition of a struct readahead_control.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: Eric Biggers
    Cc: David Howells
    Link: https://lkml.kernel.org/r/20200903140844.14194-1-willy@infradead.org
    Link: https://lkml.kernel.org/r/20200903140844.14194-3-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

03 Jun, 2020

13 commits

  • Ensure that memory allocations in the readahead path do not attempt to
    reclaim file-backed pages, which could lead to a deadlock. It is
    possible, though unlikely this is the root cause of a problem observed
    by Cong Wang.

    Reported-by: Cong Wang
    Suggested-by: Michal Hocko
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Cc: Chao Yu
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Eric Biggers
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: John Hubbard
    Cc: Joseph Qi
    Cc: Junxiao Bi
    Cc: Zi Yan
    Cc: Johannes Thumshirn
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200414150233.24495-16-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • If the page is already in cache, we don't set PageReadahead on it.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Christoph Hellwig
    Reviewed-by: William Kucharski
    Cc: Chao Yu
    Cc: Cong Wang
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Eric Biggers
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: John Hubbard
    Cc: Joseph Qi
    Cc: Junxiao Bi
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Johannes Thumshirn
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200414150233.24495-15-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • ext4 and f2fs have duplicated the guts of the readahead code so they can
    read past i_size. Instead, separate out the guts of the readahead code
    so they can call it directly.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Tested-by: Eric Biggers
    Reviewed-by: Christoph Hellwig
    Reviewed-by: William Kucharski
    Reviewed-by: Eric Biggers
    Cc: Chao Yu
    Cc: Cong Wang
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: John Hubbard
    Cc: Joseph Qi
    Cc: Junxiao Bi
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Johannes Thumshirn
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200414150233.24495-14-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • By reducing nr_to_read, we can eliminate this check from inside the loop.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: John Hubbard
    Reviewed-by: William Kucharski
    Cc: Chao Yu
    Cc: Christoph Hellwig
    Cc: Cong Wang
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Eric Biggers
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: Joseph Qi
    Cc: Junxiao Bi
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Johannes Thumshirn
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200414150233.24495-13-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • This replaces ->readpages with a saner interface:
    - Return void instead of an ignored error code.
    - Page cache is already populated with locked pages when ->readahead
    is called.
    - New arguments can be passed to the implementation without changing
    all the filesystems that use a common helper function like
    mpage_readahead().

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: John Hubbard
    Reviewed-by: Christoph Hellwig
    Reviewed-by: William Kucharski
    Cc: Chao Yu
    Cc: Cong Wang
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Eric Biggers
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: Joseph Qi
    Cc: Junxiao Bi
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Johannes Thumshirn
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200414150233.24495-12-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • When populating the page cache for readahead, mappings that use
    ->readpages must populate the page cache themselves as the pages are
    passed on a linked list which would normally be used for the page
    cache's LRU. For mappings that use ->readpage or the upcoming
    ->readahead method, we can put the pages into the page cache as soon as
    they're allocated, which solves a race between readahead and direct IO.
    It also lets us remove the gfp argument from read_pages().

    Use the new readahead_page() API to implement the repeated calls to
    ->readpage(), just like most filesystems will.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Christoph Hellwig
    Reviewed-by: William Kucharski
    Cc: Chao Yu
    Cc: Cong Wang
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Eric Biggers
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: John Hubbard
    Cc: Joseph Qi
    Cc: Junxiao Bi
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Johannes Thumshirn
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200414150233.24495-11-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Replace the page_offset variable with 'index + i'.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: John Hubbard
    Reviewed-by: Christoph Hellwig
    Reviewed-by: William Kucharski
    Cc: Chao Yu
    Cc: Cong Wang
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Eric Biggers
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: Joseph Qi
    Cc: Junxiao Bi
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Johannes Thumshirn
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200414150233.24495-10-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Change the type of page_idx to unsigned long, and rename it -- it's just
    a loop counter, not a page index.

    Suggested-by: John Hubbard
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Dave Chinner
    Reviewed-by: William Kucharski
    Reviewed-by: Johannes Thumshirn
    Cc: Chao Yu
    Cc: Christoph Hellwig
    Cc: Cong Wang
    Cc: Darrick J. Wong
    Cc: Eric Biggers
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: Joseph Qi
    Cc: Junxiao Bi
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200414150233.24495-9-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • The word 'offset' is used ambiguously to mean 'byte offset within a
    page', 'byte offset from the start of the file' and 'page offset from
    the start of the file'.

    Use 'index' to mean 'page offset from the start of the file' throughout
    the readahead code.

    [ We should probably rename the 'pgoff_t' type to 'pgidx_t' too - Linus ]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Zi Yan
    Reviewed-by: William Kucharski
    Cc: Chao Yu
    Cc: Christoph Hellwig
    Cc: Cong Wang
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Eric Biggers
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: John Hubbard
    Cc: Joseph Qi
    Cc: Junxiao Bi
    Cc: Michal Hocko
    Cc: Johannes Thumshirn
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200414150233.24495-8-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • In this patch, only between __do_page_cache_readahead() and
    read_pages(), but it will be extended in upcoming patches. The
    read_pages() function becomes aops centric, as this makes the most sense
    by the end of the patchset.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: John Hubbard
    Reviewed-by: Christoph Hellwig
    Reviewed-by: William Kucharski
    Reviewed-by: Johannes Thumshirn
    Cc: Chao Yu
    Cc: Cong Wang
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Eric Biggers
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: Joseph Qi
    Cc: Junxiao Bi
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200414150233.24495-7-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Simplify the callers by moving the check for nr_pages and the BUG_ON
    into read_pages().

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Zi Yan
    Reviewed-by: John Hubbard
    Reviewed-by: Christoph Hellwig
    Reviewed-by: William Kucharski
    Reviewed-by: Johannes Thumshirn
    Cc: Chao Yu
    Cc: Cong Wang
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Eric Biggers
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: Joseph Qi
    Cc: Junxiao Bi
    Cc: Michal Hocko
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200414150233.24495-5-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • We used to assign the return value to a variable, which we then ignored.
    Remove the pretence of caring.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: John Hubbard
    Reviewed-by: William Kucharski
    Reviewed-by: Johannes Thumshirn
    Cc: Chao Yu
    Cc: Cong Wang
    Cc: Darrick J. Wong
    Cc: Eric Biggers
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: Joseph Qi
    Cc: Junxiao Bi
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200414150233.24495-4-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • ondemand_readahead has two callers, neither of which use the return
    value. That means that both ra_submit and __do_page_cache_readahead()
    can return void, and we don't need to worry that a present page in the
    readahead window causes us to return a smaller nr_pages than we ought to
    have.

    Similarly, no caller uses the return value from
    force_page_cache_readahead().

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Dave Chinner
    Reviewed-by: John Hubbard
    Reviewed-by: Christoph Hellwig
    Reviewed-by: William Kucharski
    Cc: Chao Yu
    Cc: Cong Wang
    Cc: Darrick J. Wong
    Cc: Eric Biggers
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: Joseph Qi
    Cc: Junxiao Bi
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Johannes Thumshirn
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200414150233.24495-3-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

06 Mar, 2019

1 commit

  • Many kernel-doc comments in mm/ have the return value descriptions
    either misformatted or omitted at all which makes kernel-doc script
    unhappy:

    $ make V=1 htmldocs
    ...
    ./mm/util.c:36: info: Scanning doc for kstrdup
    ./mm/util.c:41: warning: No description found for return value of 'kstrdup'
    ./mm/util.c:57: info: Scanning doc for kstrdup_const
    ./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
    ./mm/util.c:75: info: Scanning doc for kstrndup
    ./mm/util.c:83: warning: No description found for return value of 'kstrndup'
    ...

    Fixing the formatting and adding the missing return value descriptions
    eliminates ~100 such warnings.

    Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

29 Dec, 2018

1 commit

  • It's a trivial simplification for get_next_ra_size() and clear enough for
    humans to understand.

    It also fixes potential overflow if ra->size(< ra_pages) is too large.

    Link: http://lkml.kernel.org/r/1540707206-19649-1-git-send-email-hsiangkao@aol.com
    Signed-off-by: Gao Xiang
    Reviewed-by: Fengguang Wu
    Reviewed-by: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gao Xiang
     

21 Oct, 2018

2 commits


30 Sep, 2018

1 commit

  • Introduce xarray value entries and tagged pointers to replace radix
    tree exceptional entries. This is a slight change in encoding to allow
    the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
    value entry). It is also a change in emphasis; exceptional entries are
    intimidating and different. As the comment explains, you can choose
    to store values or pointers in the xarray and they are both first-class
    citizens.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Josef Bacik

    Matthew Wilcox
     

31 Aug, 2018

1 commit

  • The implementation of readahead(2) syscall is identical to that of
    fadvise64(POSIX_FADV_WILLNEED) with a few exceptions:
    1. readahead(2) returns -EINVAL for !mapping->a_ops and fadvise64()
    ignores the request and returns 0.
    2. fadvise64() checks for integer overflow corner case
    3. fadvise64() calls the optional filesystem fadvise() file operation

    Unite the two implementations by calling vfs_fadvise() from readahead(2)
    syscall. Check the !mapping->a_ops in readahead(2) syscall to preserve
    documented syscall ABI behaviour.

    Suggested-by: Miklos Szeredi
    Fixes: d1d04ef8572b ("ovl: stack file ops")
    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     

27 Jul, 2018

1 commit

  • ondemand_readahead() checks bdi->io_pages to cap the maximum pages
    that need to be processed. This works until the readit section. If
    we would do an async only readahead (async size = sync size) and
    target is at beginning of window we expand the pages by another
    get_next_ra_size() pages. Btrace for large reads shows that kernel
    always issues a doubled size read at the beginning of processing.
    Add an additional check for io_pages in the lower part of the func.
    The fix helps devices that hard limit bio pages and rely on proper
    handling of max_hw_read_sectors (e.g. older FusionIO cards). For
    that reason it could qualify for stable.

    Fixes: 9491ae4a ("mm: don't cap request size based on read-ahead setting")
    Cc: stable@vger.kernel.org
    Signed-off-by: Markus Stockhausen stockhausen@collogia.de
    Signed-off-by: Jens Axboe

    Markus Stockhausen
     

09 Jul, 2018

1 commit

  • We noticed in testing we'd get pretty bad latency stalls under heavy
    pressure because read ahead would try to do its thing while the cgroup
    was under severe pressure. If we're under this much pressure we want to
    do as little IO as possible so we can still make progress on real work
    if we're a throttled cgroup, so just skip readahead if our group is
    under pressure.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Acked-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Josef Bacik
     

02 Jun, 2018

3 commits


12 Apr, 2018

1 commit

  • Remove the address_space ->tree_lock and use the xa_lock newly added to
    the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
    since we don't really care that it's a tree.

    [willy@infradead.org: fix nds32, fs/dax.c]
    Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

03 Apr, 2018

1 commit

  • Using this helper allows us to avoid the in-kernel calls to the
    sys_readahead() syscall. The ksys_ prefix denotes that this function is
    meant as a drop-in replacement for the syscall. In particular, it uses the
    same calling convention as sys_readahead().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Andrew Morton
    Cc: linux-mm@kvack.org
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     

13 Dec, 2016

1 commit

  • We ran into a funky issue, where someone doing 256K buffered reads saw
    128K requests at the device level. Turns out it is read-ahead capping
    the request size, since we use 128K as the default setting. This
    doesn't make a lot of sense - if someone is issuing 256K reads, they
    should see 256K reads, regardless of the read-ahead setting, if the
    underlying device can support a 256K read in a single command.

    This patch introduces a bdi hint, io_pages. This is the soft max IO
    size for the lower level, I've hooked it up to the bdev settings here.
    Read-ahead is modified to issue the maximum of the user request size,
    and the read-ahead max size, but capped to the max request size on the
    device side. The latter is done to avoid reading ahead too much, if the
    application asks for a huge read. With this patch, the kernel behaves
    like the application expects.

    Link: http://lkml.kernel.org/r/1479498073-8657-1-git-send-email-axboe@fb.com
    Signed-off-by: Jens Axboe
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

27 Aug, 2016

1 commit

  • For DAX inodes we need to be careful to never have page cache pages in
    the mapping->page_tree. This radix tree should be composed only of DAX
    exceptional entries and zero pages.

    ltp's readahead02 test was triggering a warning because we were trying
    to insert a DAX exceptional entry but found that a page cache page had
    already been inserted into the tree. This page was being inserted into
    the radix tree in response to a readahead(2) call.

    Readahead doesn't make sense for DAX inodes, but we don't want it to
    report a failure either. Instead, we just return success and don't do
    any work.

    Link: http://lkml.kernel.org/r/20160824221429.21158-1-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reported-by: Jeff Moyer
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

27 Jul, 2016

1 commit

  • Vladimir has noticed that we might declare memcg oom even during
    readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
    restriction) while __do_page_cache_readahead uses
    page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
    OOMs. This gfp mask discrepancy is really unfortunate and easily
    fixable. Drop page_cache_alloc_readahead() which only has one user and
    outsource the gfp_mask logic into readahead_gfp_mask and propagate this
    mask from __do_page_cache_readahead down to read_pages.

    This alone would have only very limited impact as most filesystems are
    implementing ->readpages and the common implementation mpage_readpages
    does GFP_KERNEL (with mapping_gfp restriction) again. We can tell it to
    use readahead_gfp_mask instead as this function is called only during
    readahead as well. The same applies to read_cache_pages.

    ext4 has its own ext4_mpage_readpages but the path which has pages !=
    NULL can use the same gfp mask. Btrfs, cifs, f2fs and orangefs are
    doing a very similar pattern to mpage_readpages so the same can be
    applied to them as well.

    [akpm@linux-foundation.org: coding-style fixes]
    [mhocko@suse.com: restrict gfp mask in mpage_alloc]
    Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Chris Mason
    Cc: Steve French
    Cc: Theodore Ts'o
    Cc: Jan Kara
    Cc: Mike Marshall
    Cc: Jaegeuk Kim
    Cc: Changman Lee
    Cc: Chao Yu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

1 commit