21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

06 Mar, 2019

1 commit

  • Many kernel-doc comments in mm/ have the return value descriptions
    either misformatted or omitted at all which makes kernel-doc script
    unhappy:

    $ make V=1 htmldocs
    ...
    ./mm/util.c:36: info: Scanning doc for kstrdup
    ./mm/util.c:41: warning: No description found for return value of 'kstrdup'
    ./mm/util.c:57: info: Scanning doc for kstrdup_const
    ./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
    ./mm/util.c:75: info: Scanning doc for kstrndup
    ./mm/util.c:83: warning: No description found for return value of 'kstrndup'
    ...

    Fixing the formatting and adding the missing return value descriptions
    eliminates ~100 such warnings.

    Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

29 Dec, 2018

1 commit

  • It's a trivial simplification for get_next_ra_size() and clear enough for
    humans to understand.

    It also fixes potential overflow if ra->size(< ra_pages) is too large.

    Link: http://lkml.kernel.org/r/1540707206-19649-1-git-send-email-hsiangkao@aol.com
    Signed-off-by: Gao Xiang
    Reviewed-by: Fengguang Wu
    Reviewed-by: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gao Xiang
     

21 Oct, 2018

2 commits


30 Sep, 2018

1 commit

  • Introduce xarray value entries and tagged pointers to replace radix
    tree exceptional entries. This is a slight change in encoding to allow
    the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
    value entry). It is also a change in emphasis; exceptional entries are
    intimidating and different. As the comment explains, you can choose
    to store values or pointers in the xarray and they are both first-class
    citizens.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Josef Bacik

    Matthew Wilcox
     

31 Aug, 2018

1 commit

  • The implementation of readahead(2) syscall is identical to that of
    fadvise64(POSIX_FADV_WILLNEED) with a few exceptions:
    1. readahead(2) returns -EINVAL for !mapping->a_ops and fadvise64()
    ignores the request and returns 0.
    2. fadvise64() checks for integer overflow corner case
    3. fadvise64() calls the optional filesystem fadvise() file operation

    Unite the two implementations by calling vfs_fadvise() from readahead(2)
    syscall. Check the !mapping->a_ops in readahead(2) syscall to preserve
    documented syscall ABI behaviour.

    Suggested-by: Miklos Szeredi
    Fixes: d1d04ef8572b ("ovl: stack file ops")
    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     

27 Jul, 2018

1 commit

  • ondemand_readahead() checks bdi->io_pages to cap the maximum pages
    that need to be processed. This works until the readit section. If
    we would do an async only readahead (async size = sync size) and
    target is at beginning of window we expand the pages by another
    get_next_ra_size() pages. Btrace for large reads shows that kernel
    always issues a doubled size read at the beginning of processing.
    Add an additional check for io_pages in the lower part of the func.
    The fix helps devices that hard limit bio pages and rely on proper
    handling of max_hw_read_sectors (e.g. older FusionIO cards). For
    that reason it could qualify for stable.

    Fixes: 9491ae4a ("mm: don't cap request size based on read-ahead setting")
    Cc: stable@vger.kernel.org
    Signed-off-by: Markus Stockhausen stockhausen@collogia.de
    Signed-off-by: Jens Axboe

    Markus Stockhausen
     

09 Jul, 2018

1 commit

  • We noticed in testing we'd get pretty bad latency stalls under heavy
    pressure because read ahead would try to do its thing while the cgroup
    was under severe pressure. If we're under this much pressure we want to
    do as little IO as possible so we can still make progress on real work
    if we're a throttled cgroup, so just skip readahead if our group is
    under pressure.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Acked-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Josef Bacik
     

02 Jun, 2018

3 commits


12 Apr, 2018

1 commit

  • Remove the address_space ->tree_lock and use the xa_lock newly added to
    the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
    since we don't really care that it's a tree.

    [willy@infradead.org: fix nds32, fs/dax.c]
    Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

03 Apr, 2018

1 commit

  • Using this helper allows us to avoid the in-kernel calls to the
    sys_readahead() syscall. The ksys_ prefix denotes that this function is
    meant as a drop-in replacement for the syscall. In particular, it uses the
    same calling convention as sys_readahead().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Andrew Morton
    Cc: linux-mm@kvack.org
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     

13 Dec, 2016

1 commit

  • We ran into a funky issue, where someone doing 256K buffered reads saw
    128K requests at the device level. Turns out it is read-ahead capping
    the request size, since we use 128K as the default setting. This
    doesn't make a lot of sense - if someone is issuing 256K reads, they
    should see 256K reads, regardless of the read-ahead setting, if the
    underlying device can support a 256K read in a single command.

    This patch introduces a bdi hint, io_pages. This is the soft max IO
    size for the lower level, I've hooked it up to the bdev settings here.
    Read-ahead is modified to issue the maximum of the user request size,
    and the read-ahead max size, but capped to the max request size on the
    device side. The latter is done to avoid reading ahead too much, if the
    application asks for a huge read. With this patch, the kernel behaves
    like the application expects.

    Link: http://lkml.kernel.org/r/1479498073-8657-1-git-send-email-axboe@fb.com
    Signed-off-by: Jens Axboe
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

27 Aug, 2016

1 commit

  • For DAX inodes we need to be careful to never have page cache pages in
    the mapping->page_tree. This radix tree should be composed only of DAX
    exceptional entries and zero pages.

    ltp's readahead02 test was triggering a warning because we were trying
    to insert a DAX exceptional entry but found that a page cache page had
    already been inserted into the tree. This page was being inserted into
    the radix tree in response to a readahead(2) call.

    Readahead doesn't make sense for DAX inodes, but we don't want it to
    report a failure either. Instead, we just return success and don't do
    any work.

    Link: http://lkml.kernel.org/r/20160824221429.21158-1-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reported-by: Jeff Moyer
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

27 Jul, 2016

1 commit

  • Vladimir has noticed that we might declare memcg oom even during
    readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
    restriction) while __do_page_cache_readahead uses
    page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
    OOMs. This gfp mask discrepancy is really unfortunate and easily
    fixable. Drop page_cache_alloc_readahead() which only has one user and
    outsource the gfp_mask logic into readahead_gfp_mask and propagate this
    mask from __do_page_cache_readahead down to read_pages.

    This alone would have only very limited impact as most filesystems are
    implementing ->readpages and the common implementation mpage_readpages
    does GFP_KERNEL (with mapping_gfp restriction) again. We can tell it to
    use readahead_gfp_mask instead as this function is called only during
    readahead as well. The same applies to read_cache_pages.

    ext4 has its own ext4_mpage_readpages but the path which has pages !=
    NULL can use the same gfp mask. Btrfs, cifs, f2fs and orangefs are
    doing a very similar pattern to mpage_readpages so the same can be
    applied to them as well.

    [akpm@linux-foundation.org: coding-style fixes]
    [mhocko@suse.com: restrict gfp mask in mpage_alloc]
    Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Chris Mason
    Cc: Steve French
    Cc: Theodore Ts'o
    Cc: Jan Kara
    Cc: Mike Marshall
    Cc: Jaegeuk Kim
    Cc: Changman Lee
    Cc: Chao Yu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

2 commits


07 Nov, 2015

1 commit

  • There are many places which use mapping_gfp_mask to restrict a more
    generic gfp mask which would be used for allocations which are not
    directly related to the page cache but they are performed in the same
    context.

    Let's introduce a helper function which makes the restriction explicit and
    easier to track. This patch doesn't introduce any functional changes.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Michal Hocko
    Suggested-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

06 Nov, 2015

1 commit

  • Maximal readahead size is limited now by two values:
    1) by global 2Mb constant (MAX_READAHEAD in max_sane_readahead())
    2) by configurable per-device value* (bdi->ra_pages)

    There are devices, which require custom readahead limit.
    For instance, for RAIDs it's calculated as number of devices
    multiplied by chunk size times 2.

    Readahead size can never be larger than bdi->ra_pages * 2 value
    (POSIX_FADV_SEQUNTIAL doubles readahead size).

    If so, why do we need two limits?
    I suggest to completely remove this max_sane_readahead() stuff and
    use per-device readahead limit everywhere.

    Also, using right readahead size for RAID disks can significantly
    increase i/o performance:

    before:
    dd if=/dev/md2 of=/dev/null bs=100M count=100
    100+0 records in
    100+0 records out
    10485760000 bytes (10 GB) copied, 12.9741 s, 808 MB/s

    after:
    $ dd if=/dev/md2 of=/dev/null bs=100M count=100
    100+0 records in
    100+0 records out
    10485760000 bytes (10 GB) copied, 8.91317 s, 1.2 GB/s

    (It's an 8-disks RAID5 storage).

    This patch doesn't change sys_readahead and madvise(MADV_WILLNEED)
    behavior introduced by 6d2be915e589b58 ("mm/readahead.c: fix readahead
    failure for memoryless NUMA nodes and limit readahead pages").

    Signed-off-by: Roman Gushchin
    Cc: Raghavendra K T
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: David Rientjes
    Cc: onstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

17 Oct, 2015

1 commit

  • Commit 6afdb859b710 ("mm: do not ignore mapping_gfp_mask in page cache
    allocation paths") has caught some users of hardcoded GFP_KERNEL used in
    the page cache allocation paths. This, however, wasn't complete and
    there were others which went unnoticed.

    Dave Chinner has reported the following deadlock for xfs on loop device:
    : With the recent merge of the loop device changes, I'm now seeing
    : XFS deadlock on my single CPU, 1GB RAM VM running xfs/073.
    :
    : The deadlocked is as follows:
    :
    : kloopd1: loop_queue_read_work
    : xfs_file_iter_read
    : lock XFS inode XFS_IOLOCK_SHARED (on image file)
    : page cache read (GFP_KERNEL)
    : radix tree alloc
    : memory reclaim
    : reclaim XFS inodes
    : log force to unpin inodes
    :
    :
    : xfs-cil/loop1:
    : xlog_cil_push
    : xlog_write
    :
    : xlog_state_get_iclog_space()
    :
    :
    :
    : kloopd1: loop_queue_write_work
    : xfs_file_write_iter
    : lock XFS inode XFS_IOLOCK_EXCL (on image file)
    :
    :
    : i.e. the kloopd, with it's split read and write work queues, has
    : introduced a dependency through memory reclaim. i.e. that writes
    : need to be able to progress for reads make progress.
    :
    : The problem, fundamentally, is that mpage_readpages() does a
    : GFP_KERNEL allocation, rather than paying attention to the inode's
    : mapping gfp mask, which is set to GFP_NOFS.
    :
    : The didn't used to happen, because the loop device used to issue
    : reads through the splice path and that does:
    :
    : error = add_to_page_cache_lru(page, mapping, index,
    : GFP_KERNEL & mapping_gfp_mask(mapping));

    This has changed by commit aa4d86163e4 ("block: loop: switch to VFS
    ITER_BVEC").

    This patch changes mpage_readpage{s} to follow gfp mask set for the
    mapping. There are, however, other places which are doing basically the
    same.

    lustre:ll_dir_filler is doing GFP_KERNEL from the function which
    apparently uses GFP_NOFS for other allocations so let's make this
    consistent.

    cifs:readpages_get_pages is called from cifs_readpages and
    __cifs_readpages_from_fscache called from the same path obeys mapping
    gfp.

    ramfs_nommu_expand_for_mapping is hardcoding GFP_KERNEL as well
    regardless it uses mapping_gfp_mask for the page allocation.

    ext4_mpage_readpages is the called from the page cache allocation path
    same as read_pages and read_cache_pages

    As I've noticed in my previous post I cannot say I would be happy about
    sprinkling mapping_gfp_mask all over the place and it sounds like we
    should drop gfp_mask argument altogether and use it internally in
    __add_to_page_cache_locked that would require all the filesystems to use
    mapping gfp consistently which I am not sure is the case here. From a
    quick glance it seems that some file system use it all the time while
    others are selective.

    Signed-off-by: Michal Hocko
    Reported-by: Dave Chinner
    Cc: "Theodore Ts'o"
    Cc: Ming Lei
    Cc: Andreas Dilger
    Cc: Oleg Drokin
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

02 Jun, 2015

1 commit

  • In several places, bdi_congested() and its wrappers are used to
    determine whether more IOs should be issued. With cgroup writeback
    support, this question can't be answered solely based on the bdi
    (backing_dev_info). It's dependent on whether the filesystem and bdi
    support cgroup writeback and the blkcg the inode is associated with.

    This patch implements inode_congested() and its wrappers which take
    @inode and determines the congestion state considering cgroup
    writeback. The new functions replace bdi_*congested() calls in places
    where the query is about specific inode and task.

    There are several filesystem users which also fit this criteria but
    they should be updated when each filesystem implements cgroup
    writeback support.

    v2: Now that a given inode is associated with only one wb, congestion
    state can be determined independent from the asking task. Drop
    @task. Spotted by Vivek. Also, converted to take @inode instead
    of @mapping and renamed to inode_congested().

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     

21 Jan, 2015

1 commit

  • Now that we got rid of the bdi abuse on character devices we can always use
    sb->s_bdi to get at the backing_dev_info for a file, except for the block
    device special case. Export inode_to_bdi and replace uses of
    mapping->backing_dev_info with it to prepare for the removal of
    mapping->backing_dev_info.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

07 Aug, 2014

1 commit


08 Apr, 2014

1 commit

  • Commit f9acc8c7b35a ("readahead: sanify file_ra_state names") left
    ra_submit with a single function call.

    Move ra_submit to internal.h and inline it to save some stack. Thanks
    to Andrew Morton for commenting different versions.

    Signed-off-by: Fabian Frederick
    Suggested-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

04 Apr, 2014

3 commits

  • Currently max_sane_readahead() returns zero on the cpu whose NUMA node
    has no local memory which leads to readahead failure. Fix this
    readahead failure by returning minimum of (requested pages, 512). Users
    running applications on a memory-less cpu which needs readahead such as
    streaming application see considerable boost in the performance.

    Result:

    fadvise experiment with FADV_WILLNEED on a PPC machine having memoryless
    CPU with 1GB testfile (12 iterations) yielded around 46.66% improvement.

    fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB
    testfile 32GB* 4G RAM numa machine (12 iterations) showed no impact on
    the normal NUMA cases w/ patch.

    Kernel Avg Stddev
    base 7.4975 3.92%
    patched 7.4174 3.26%

    [Andrew: making return value PAGE_SIZE independent]
    Suggested-by: Linus Torvalds
    Signed-off-by: Raghavendra K T
    Acked-by: Jan Kara
    Cc: Wu Fengguang
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raghavendra K T
     
  • shmem mappings already contain exceptional entries where swap slot
    information is remembered.

    To be able to store eviction information for regular page cache, prepare
    every site dealing with the radix trees directly to handle entries other
    than pages.

    The common lookup functions will filter out non-page entries and return
    NULL for page cache holes, just as before. But provide a raw version of
    the API which returns non-page entries as well, and switch shmem over to
    use it.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The radix tree hole searching code is only used for page cache, for
    example the readahead code trying to get a a picture of the area
    surrounding a fault.

    It sufficed to rely on the radix tree definition of holes, which is
    "empty tree slot". But this is about to change, though, as shadow page
    descriptors will be stored in the page cache after the actual pages get
    evicted from memory.

    Move the functions over to mm/filemap.c and make them native page cache
    operations, where they can later be adapted to handle the new definition
    of "page cache hole".

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

30 Jan, 2014

1 commit

  • Commit 63d0f0a3c7e1 ("mm/readahead.c:do_readhead(): don't check for
    ->readpage") unintentionally made do_readahead return 0 for all valid
    files regardless of whether readahead was supported, rather than the
    expected -EINVAL. This gets forwarded on to userspace, and results in
    sys_readahead appearing to succeed in cases that don't make sense (e.g.
    when called on pipes or sockets). This issue is detected by the LTP
    readahead01 testcase.

    As the exact return value of force_page_cache_readahead is currently
    never used, we can simplify it to return only 0 or -EINVAL (when
    readpage or readpages is missing). With that in place we can simply
    forward on the return value of force_page_cache_readahead in
    do_readahead.

    This patch performs said change, restoring the expected semantics.

    Signed-off-by: Mark Rutland
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Rutland
     

13 Nov, 2013

2 commits

  • The kernel's readahead algorithm sometimes interprets random read
    accesses as sequential and triggers unnecessary data prefecthing from
    storage device (impacting random read average latency).

    In order to identify sequential cache read misses, the readahead
    algorithm intends to check whether offset - previous offset == 1
    (trivial sequential reads) or offset - previous offset == 0 (sequential
    reads not aligned on page boundary):

    if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) current offset (which happens on random pattern), the if
    condition is true and access is wrongly interpeted as sequential. An
    unnecessary data prefetching is triggered, impacting the average random
    read latency.

    Storing the previous offset value in a "pgoff_t" variable (unsigned
    long) fixes the sequential read detection logic.

    Signed-off-by: Damien Ramonda
    Reviewed-by: Fengguang Wu
    Acked-by: Pierre Tardy
    Acked-by: David Cohen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Damien Ramonda
     
  • The callee force_page_cache_readahead() already does this and unlike
    do_readahead(), force_page_cache_readahead() remembers to check for
    ->readpages() as well.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

12 Sep, 2013

1 commit

  • This helps performance on moderately dense random reads on SSD.

    Transaction-Per-Second numbers provided by Taobao:

    QPS case
    -------------------------------------------------------
    7536 disable context readahead totally
    w/ patch: 7129 slower size rampup and start RA on the 3rd read
    6717 slower size rampup
    w/o patch: 5581 unmodified context readahead

    Before, readahead will be started whenever reading page N+1 when it happen
    to read N recently. After patch, we'll only start readahead when *three*
    random reads happen to access pages N, N+1, N+2. The probability of this
    happening is extremely low for pure random reads, unless they are very
    dense, which actually deserves some readahead.

    Also start with a smaller readahead window. The impact to interleaved
    sequential reads should be small, because for a long run stream, the the
    small readahead window rampup phase is negletable.

    The context readahead actually benefits clustered random reads on HDD
    whose seek cost is pretty high. However as SSD is increasingly used for
    random read workloads it's better for the context readahead to concentrate
    on interleaved sequential reads.

    Another SSD rand read test from Miao

    # file size: 2GB
    # read IO amount: 625MB
    sysbench --test=fileio \
    --max-requests=10000 \
    --num-threads=1 \
    --file-num=1 \
    --file-block-size=64K \
    --file-test-mode=rndrd \
    --file-fsync-freq=0 \
    --file-fsync-end=off run

    shows the performance of btrfs grows up from 69MB/s to 121MB/s, ext4 from
    104MB/s to 121MB/s.

    Signed-off-by: Wu Fengguang
    Tested-by: Tao Ma
    Tested-by: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     

22 May, 2013

1 commit

  • Currently there is no way to truncate partial page where the end
    truncate point is not at the end of the page. This is because it was not
    needed and the functionality was enough for file system truncate
    operation to work properly. However more file systems now support punch
    hole feature and it can benefit from mm supporting truncating page just
    up to the certain point.

    Specifically, with this functionality truncate_inode_pages_range() can
    be changed so it supports truncating partial page at the end of the
    range (currently it will BUG_ON() if 'end' is not at the end of the
    page).

    This commit changes the invalidatepage() address space operation
    prototype to accept range to be invalidated and update all the instances
    for it.

    We also change the block_invalidatepage() in the same way and actually
    make a use of the new length argument implementing range invalidation.

    Actual file system implementations will follow except the file systems
    where the changes are really simple and should not change the behaviour
    in any way .Implementation for truncate_page_range() which will be able
    to accept page unaligned ranges will follow as well.

    Signed-off-by: Lukas Czerner
    Cc: Andrew Morton
    Cc: Hugh Dickins

    Lukas Czerner
     

04 Mar, 2013

1 commit


27 Sep, 2012

2 commits


30 May, 2012

1 commit


31 Oct, 2011

1 commit