03 Nov, 2020

1 commit

  • Fix the following sparse warning:

    mm/truncate.c:531:15: warning: symbol '__invalidate_mapping_pages' was not declared. Should it be static?

    Fixes: eb1d7a65f08a ("mm, fadvise: improve the expensive remote LRU cache draining after FADV_DONTNEED")
    Signed-off-by: Jason Yan
    Signed-off-by: Andrew Morton
    Reviewed-by: Yafang Shao
    Link: https://lkml.kernel.org/r/20201015054808.2445904-1-yanaijie@huawei.com
    Signed-off-by: Linus Torvalds

    Jason Yan
     

17 Oct, 2020

1 commit

  • Remove the assumption that a compound page is HPAGE_PMD_SIZE, and the
    assumption that any page is PAGE_SIZE.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: SeongJae Park
    Acked-by: Kirill A. Shutemov
    Cc: Huang Ying
    Link: https://lkml.kernel.org/r/20200908195539.25896-10-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

14 Oct, 2020

1 commit

  • Our users reported that there're some random latency spikes when their RT
    process is running. Finally we found that latency spike is caused by
    FADV_DONTNEED. Which may call lru_add_drain_all() to drain LRU cache on
    remote CPUs, and then waits the per-cpu work to complete. The wait time
    is uncertain, which may be tens millisecond.

    That behavior is unreasonable, because this process is bound to a specific
    CPU and the file is only accessed by itself, IOW, there should be no
    pagecache pages on a per-cpu pagevec of a remote CPU. That unreasonable
    behavior is partially caused by the wrong comparation of the number of
    invalidated pages and the number of the target. For example,

    if (count < (end_index - start_index + 1))

    The count above is how many pages were invalidated in the local CPU, and
    (end_index - start_index + 1) is how many pages should be invalidated.
    The usage of (end_index - start_index + 1) is incorrect, because they are
    virtual addresses, which may not mapped to pages. Besides that, there may
    be holes between start and end. So we'd better check whether there are
    still pages on per-cpu pagevec after drain the local cpu, and then decide
    whether or not to call lru_add_drain_all().

    After I applied it with a hotfix to our production environment, most of
    the lru_add_drain_all() can be avoided.

    Suggested-by: Mel Gorman
    Signed-off-by: Yafang Shao
    Signed-off-by: Andrew Morton
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Link: https://lkml.kernel.org/r/20200923133318.14373-1-laoar.shao@gmail.com
    Signed-off-by: Linus Torvalds

    Yafang Shao
     

19 Oct, 2019

1 commit

  • Once a THP is added to the page cache, it cannot be dropped via
    /proc/sys/vm/drop_caches. Fix this issue with proper handling in
    invalidate_mapping_pages().

    Link: http://lkml.kernel.org/r/20191017164223.2762148-5-songliubraving@fb.com
    Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Song Liu
    Tested-by: Song Liu
    Acked-by: Yang Shi
    Cc: Matthew Wilcox (Oracle)
    Cc: Oleg Nesterov
    Cc: Srikar Dronamraju
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

06 Mar, 2019

1 commit

  • Many kernel-doc comments in mm/ have the return value descriptions
    either misformatted or omitted at all which makes kernel-doc script
    unhappy:

    $ make V=1 htmldocs
    ...
    ./mm/util.c:36: info: Scanning doc for kstrdup
    ./mm/util.c:41: warning: No description found for return value of 'kstrdup'
    ./mm/util.c:57: info: Scanning doc for kstrdup_const
    ./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
    ./mm/util.c:75: info: Scanning doc for kstrndup
    ./mm/util.c:83: warning: No description found for return value of 'kstrndup'
    ...

    Fixing the formatting and adding the missing return value descriptions
    eliminates ~100 such warnings.

    Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

01 Dec, 2018

1 commit

  • If all pages are deleted from the mapping by memory reclaim and also
    moved to the cleancache:

    __delete_from_page_cache
    (no shadow case)
    unaccount_page_cache_page
    cleancache_put_page
    page_cache_delete
    mapping->nrpages -= nr
    (nrpages becomes 0)

    We don't clean the cleancache for an inode after final file truncation
    (removal).

    truncate_inode_pages_final
    check (nrpages || nrexceptional) is false
    no truncate_inode_pages
    no cleancache_invalidate_inode(mapping)

    These way when reading the new file created with same inode we may get
    these trash leftover pages from cleancache and see wrong data instead of
    the contents of the new file.

    Fix it by always doing truncate_inode_pages which is already ready for
    nrpages == 0 && nrexceptional == 0 case and just invalidates inode.

    [akpm@linux-foundation.org: add comment, per Jan]
    Link: http://lkml.kernel.org/r/20181112095734.17979-1-ptikhomirov@virtuozzo.com
    Fixes: commit 91b0abe36a7b ("mm + fs: store shadow entries in page cache")
    Signed-off-by: Pavel Tikhomirov
    Reviewed-by: Vasily Averin
    Reviewed-by: Andrey Ryabinin
    Reviewed-by: Jan Kara
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Matthew Wilcox
    Cc: Andi Kleen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tikhomirov
     

21 Oct, 2018

1 commit


30 Sep, 2018

1 commit

  • Introduce xarray value entries and tagged pointers to replace radix
    tree exceptional entries. This is a slight change in encoding to allow
    the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
    value entry). It is also a change in emphasis; exceptional entries are
    intimidating and different. As the comment explains, you can choose
    to store values or pointers in the xarray and they are both first-class
    citizens.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Josef Bacik

    Matthew Wilcox
     

12 Apr, 2018

1 commit

  • Remove the address_space ->tree_lock and use the xa_lock newly added to
    the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
    since we don't really care that it's a tree.

    [willy@infradead.org: fix nds32, fs/dax.c]
    Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

01 Feb, 2018

1 commit

  • Several users of unmap_mapping_range() would prefer to express their
    range in pages rather than bytes. Unfortuately, on a 32-bit kernel, you
    have to remember to cast your page number to a 64-bit type before
    shifting it, and four places in the current tree didn't remember to do
    that. That's a sign of a bad interface.

    Conveniently, unmap_mapping_range() actually converts from bytes into
    pages, so hoist the guts of unmap_mapping_range() into a new function
    unmap_mapping_pages() and convert the callers which want to use pages.

    Link: http://lkml.kernel.org/r/20171206142627.GD32044@bombadil.infradead.org
    Signed-off-by: Matthew Wilcox
    Reported-by: "zhangyi (F)"
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

16 Nov, 2017

5 commits

  • Every pagevec_init user claims the pages being released are hot even in
    cases where it is unlikely the pages are hot. As no one cares about the
    hotness of pages being released to the allocator, just ditch the
    parameter.

    No performance impact is expected as the overhead is marginal. The
    parameter is removed simply because it is a bit stupid to have a useless
    parameter copied everywhere.

    Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • During truncate each entry in a pagevec is checked to see if it is an
    exceptional entry and if so, the shadow entry is cleaned up. This is
    potentially expensive as multiple entries for a mapping locks/unlocks
    the tree lock. This batches the operation such that any exceptional
    entries removed from a pagevec only acquire the mapping tree lock once.
    The corner case where this is more expensive is where there is only one
    exceptional entry but this is unlikely due to temporal locality and how
    it affects LRU ordering. Note that for truncations of small files
    created recently, this patch should show no gain because it only batches
    the handling of exceptional entries.

    sparsetruncate (large)
    4.14.0-rc4 4.14.0-rc4
    pickhelper-v1r1 batchshadow-v1r1
    Min Time 38.00 ( 0.00%) 27.00 ( 28.95%)
    1st-qrtle Time 40.00 ( 0.00%) 28.00 ( 30.00%)
    2nd-qrtle Time 44.00 ( 0.00%) 41.00 ( 6.82%)
    3rd-qrtle Time 146.00 ( 0.00%) 147.00 ( -0.68%)
    Max-90% Time 153.00 ( 0.00%) 153.00 ( 0.00%)
    Max-95% Time 155.00 ( 0.00%) 156.00 ( -0.65%)
    Max-99% Time 181.00 ( 0.00%) 171.00 ( 5.52%)
    Amean Time 93.04 ( 0.00%) 88.43 ( 4.96%)
    Best99%Amean Time 92.08 ( 0.00%) 86.13 ( 6.46%)
    Best95%Amean Time 89.19 ( 0.00%) 83.13 ( 6.80%)
    Best90%Amean Time 85.60 ( 0.00%) 79.15 ( 7.53%)
    Best75%Amean Time 72.95 ( 0.00%) 65.09 ( 10.78%)
    Best50%Amean Time 39.86 ( 0.00%) 28.20 ( 29.25%)
    Best25%Amean Time 39.44 ( 0.00%) 27.70 ( 29.77%)

    bonnie
    4.14.0-rc4 4.14.0-rc4
    pickhelper-v1r1 batchshadow-v1r1
    Hmean SeqCreate ops 71.92 ( 0.00%) 76.78 ( 6.76%)
    Hmean SeqCreate read 42.42 ( 0.00%) 45.01 ( 6.10%)
    Hmean SeqCreate del 26519.88 ( 0.00%) 27191.87 ( 2.53%)
    Hmean RandCreate ops 71.92 ( 0.00%) 76.95 ( 7.00%)
    Hmean RandCreate read 44.44 ( 0.00%) 49.23 ( 10.78%)
    Hmean RandCreate del 24948.62 ( 0.00%) 24764.97 ( -0.74%)

    Truncation of a large number of files shows a substantial gain with 99%
    of files being truncated 6.46% faster. bonnie shows a modest gain of
    2.53%

    [jack@suse.cz: fix truncate_exceptional_pvec_entries()]
    Link: http://lkml.kernel.org/r/20171108164226.26788-1-jack@suse.cz
    Link: http://lkml.kernel.org/r/20171018075952.10627-4-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Signed-off-by: Jan Kara
    Reviewed-by: Jan Kara
    Acked-by: Johannes Weiner
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • During truncation, the mapping has already been checked for shmem and
    dax so it's known that workingset_update_node is required.

    This patch avoids the checks on mapping for each page being truncated.
    In all other cases, a lookup helper is used to determine if
    workingset_update_node() needs to be called. The one danger is that the
    API is slightly harder to use as calling workingset_update_node directly
    without checking for dax or shmem mappings could lead to surprises.
    However, the API rarely needs to be used and hopefully the comment is
    enough to give people the hint.

    sparsetruncate (tiny)
    4.14.0-rc4 4.14.0-rc4
    oneirq-v1r1 pickhelper-v1r1
    Min Time 141.00 ( 0.00%) 140.00 ( 0.71%)
    1st-qrtle Time 142.00 ( 0.00%) 141.00 ( 0.70%)
    2nd-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%)
    3rd-qrtle Time 143.00 ( 0.00%) 143.00 ( 0.00%)
    Max-90% Time 144.00 ( 0.00%) 144.00 ( 0.00%)
    Max-95% Time 147.00 ( 0.00%) 145.00 ( 1.36%)
    Max-99% Time 195.00 ( 0.00%) 191.00 ( 2.05%)
    Max Time 230.00 ( 0.00%) 205.00 ( 10.87%)
    Amean Time 144.37 ( 0.00%) 143.82 ( 0.38%)
    Stddev Time 10.44 ( 0.00%) 9.00 ( 13.74%)
    Coeff Time 7.23 ( 0.00%) 6.26 ( 13.41%)
    Best99%Amean Time 143.72 ( 0.00%) 143.34 ( 0.26%)
    Best95%Amean Time 142.37 ( 0.00%) 142.00 ( 0.26%)
    Best90%Amean Time 142.19 ( 0.00%) 141.85 ( 0.24%)
    Best75%Amean Time 141.92 ( 0.00%) 141.58 ( 0.24%)
    Best50%Amean Time 141.69 ( 0.00%) 141.31 ( 0.27%)
    Best25%Amean Time 141.38 ( 0.00%) 140.97 ( 0.29%)

    As you'd expect, the gain is marginal but it can be detected. The
    differences in bonnie are all within the noise which is not surprising
    given the impact on the microbenchmark.

    radix_tree_update_node_t is a callback for some radix operations that
    optionally passes in a private field. The only user of the callback is
    workingset_update_node and as it no longer requires a mapping, the
    private field is removed.

    Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Jan Kara
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently we remove pages from the radix tree one by one. To speed up
    page cache truncation, lock several pages at once and free them in one
    go. This allows us to batch radix tree operations in a more efficient
    way and also save round-trips on mapping->tree_lock. As a result we
    gain about 20% speed improvement in page cache truncation.

    Data from a simple benchmark timing 10000 truncates of 1024 pages (on
    ext4 on ramdisk but the filesystem is barely visible in the profiles).
    The range shows 1% and 95% percentiles of the measured times:

    4.14-rc2 4.14-rc2 + batched truncation
    248-256 209-219
    249-258 209-217
    248-255 211-239
    248-255 209-217
    247-256 210-218

    [jack@suse.cz: convert delete_from_page_cache_batch() to pagevec]
    Link: http://lkml.kernel.org/r/20171018111648.13714-1-jack@suse.cz
    [akpm@linux-foundation.org: move struct pagevec forward declaration to top-of-file]
    Link: http://lkml.kernel.org/r/20171010151937.26984-8-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Mel Gorman
    Reviewed-by: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Move call of delete_from_page_cache() and page->mapping check out of
    truncate_complete_page() into the single caller - truncate_inode_page().
    Also move page_mapped() check into truncate_complete_page(). That way
    it will be easier to batch operations.

    Also rename truncate_complete_page() to truncate_cleanup_page().

    Link: http://lkml.kernel.org/r/20171010151937.26984-3-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Mel Gorman
    Reviewed-by: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

11 Jul, 2017

1 commit

  • The condition checking for THP straddling end of invalidated range is
    wrong - it checks 'index' against 'end' but 'index' has been already
    advanced to point to the end of THP and thus the condition can never be
    true. As a result THP straddling 'end' has been fully invalidated.
    Given the nature of invalidate_mapping_pages(), this could be only
    performance issue. In fact, we are lucky the condition is wrong because
    if it was ever true, we'd leave locked page behind.

    Fix the condition checking for THP straddling 'end' and also properly
    unlock the page. Also update the comment before the condition to
    explain why we decide not to invalidate the page as it was not clear to
    me and I had to ask Kirill.

    Link: http://lkml.kernel.org/r/20170619124723.21656-1-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

13 May, 2017

2 commits

  • Currently, we didn't invalidate page tables during invalidate_inode_pages2()
    for DAX. That could result in e.g. 2MiB zero page being mapped into
    page tables while there were already underlying blocks allocated and
    thus data seen through mmap were different from data seen by read(2).
    The following sequence reproduces the problem:

    - open an mmap over a 2MiB hole

    - read from a 2MiB hole, faulting in a 2MiB zero page

    - write to the hole with write(3p). The write succeeds but we
    incorrectly leave the 2MiB zero page mapping intact.

    - via the mmap, read the data that was just written. Since the zero
    page mapping is still intact we read back zeroes instead of the new
    data.

    Fix the problem by unconditionally calling invalidate_inode_pages2_range()
    in dax_iomap_actor() for new block allocations and by properly
    invalidating page tables in invalidate_inode_pages2_range() for DAX
    mappings.

    Fixes: c6dcf52c23d2d3fb5235cec42d7dd3f786b87d55
    Link: http://lkml.kernel.org/r/20170510085419.27601-3-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler
    Cc: Dan Williams
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Patch series "mm,dax: Fix data corruption due to mmap inconsistency",
    v4.

    This series fixes data corruption that can happen for DAX mounts when
    page faults race with write(2) and as a result page tables get out of
    sync with block mappings in the filesystem and thus data seen through
    mmap is different from data seen through read(2).

    The series passes testing with t_mmap_stale test program from Ross and
    also other mmap related tests on DAX filesystem.

    This patch (of 4):

    dax_invalidate_mapping_entry() currently removes DAX exceptional entries
    only if they are clean and unlocked. This is done via:

    invalidate_mapping_pages()
    invalidate_exceptional_entry()
    dax_invalidate_mapping_entry()

    However, for page cache pages removed in invalidate_mapping_pages()
    there is an additional criteria which is that the page must not be
    mapped. This is noted in the comments above invalidate_mapping_pages()
    and is checked in invalidate_inode_page().

    For DAX entries this means that we can can end up in a situation where a
    DAX exceptional entry, either a huge zero page or a regular DAX entry,
    could end up mapped but without an associated radix tree entry. This is
    inconsistent with the rest of the DAX code and with what happens in the
    page cache case.

    We aren't able to unmap the DAX exceptional entry because according to
    its comments invalidate_mapping_pages() isn't allowed to block, and
    unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.

    Since we essentially never have unmapped DAX entries to evict from the
    radix tree, just remove dax_invalidate_mapping_entry().

    Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
    Link: http://lkml.kernel.org/r/20170510085419.27601-2-jack@suse.cz
    Signed-off-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Reported-by: Jan Kara
    Cc: Dan Williams
    Cc: [4.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

04 May, 2017

2 commits

  • cleancache_invalidate_inode() called truncate_inode_pages_range() and
    invalidate_inode_pages2_range() twice - on entry and on exit. It's
    stupid and waste of time. It's enough to call it once at exit.

    Link: http://lkml.kernel.org/r/20170424164135.22350-5-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Jan Kara
    Acked-by: Konrad Rzeszutek Wilk
    Cc: Alexander Viro
    Cc: Ross Zwisler
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Alexey Kuznetsov
    Cc: Christoph Hellwig
    Cc: Nikolay Borisov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • If mapping is empty (both ->nrpages and ->nrexceptional is zero) we can
    avoid pointless lookups in empty radix tree and bail out immediately
    after cleancache invalidation.

    Link: http://lkml.kernel.org/r/20170424164135.22350-4-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Jan Kara
    Acked-by: Konrad Rzeszutek Wilk
    Cc: Alexander Viro
    Cc: Ross Zwisler
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Alexey Kuznetsov
    Cc: Christoph Hellwig
    Cc: Nikolay Borisov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

28 Feb, 2017

1 commit

  • Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs
    branch.

    This patch also fixes multiple checkpatch warnings: WARNING: Prefer
    'unsigned int' to bare use of 'unsigned'

    Thanks to Andrew Morton for suggesting more appropriate function instead
    of macro.

    [geliangtang@gmail.com: truncate: use i_blocksize()]
    Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com
    Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Signed-off-by: Geliang Tang
    Cc: Alexander Viro
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

25 Feb, 2017

1 commit

  • Remove the prototypes for shmem_mapping() and shmem_zero_setup() from
    linux/mm.h, since they are already provided in linux/shmem_fs.h. But
    shmem_fs.h must then provide the inline stub for shmem_mapping() when
    CONFIG_SHMEM is not set, and a few more cfiles now need to #include it.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1702081658250.1549@eggly.anvils
    Signed-off-by: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Simek
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

27 Dec, 2016

1 commit

  • Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
    just delete all exceptional radix tree entries they find. For DAX this
    is not desirable as we track cache dirtiness in these entries and when
    they are evicted, we may not flush caches although it is necessary. This
    can for example manifest when we write to the same block both via mmap
    and via write(2) (to different offsets) and fsync(2) then does not
    properly flush CPU caches when modification via write(2) was the last
    one.

    Create appropriate DAX functions to handle invalidation of DAX entries
    for invalidate_inode_pages2_range() and invalidate_mapping_pages() and
    wire them up into the corresponding mm functions.

    Acked-by: Johannes Weiner
    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Dan Williams

    Jan Kara
     

13 Dec, 2016

2 commits

  • Currently, we track the shadow entries in the page cache in the upper
    bits of the radix_tree_node->count, behind the back of the radix tree
    implementation. Because the radix tree code has no awareness of them,
    we rely on random subtleties throughout the implementation (such as the
    node->count != 1 check in the shrinking code, which is meant to exclude
    multi-entry nodes but also happens to skip nodes with only one shadow
    entry, as that's accounted in the upper bits). This is error prone and
    has, in fact, caused the bug fixed in d3798ae8c6f3 ("mm: filemap: don't
    plant shadow entries without radix tree node").

    To remove these subtleties, this patch moves shadow entry tracking from
    the upper bits of node->count to the existing counter for exceptional
    entries. node->count goes back to being a simple counter of valid
    entries in the tree node and can be shrunk to a single byte.

    This vastly simplifies the page cache code. All accounting happens
    natively inside the radix tree implementation, and maintaining the LRU
    linkage of shadow nodes is consolidated into a single function in the
    workingset code that is called for leaf nodes affected by a change in
    the page cache tree.

    This also removes the last user of the __radix_delete_node() return
    value. Eliminate it.

    Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Jan Kara
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The bug in khugepaged fixed earlier in this series shows that radix tree
    slot replacement is fragile; and it will become more so when not only
    NULL!NULL transitions need to be caught but transitions from and to
    exceptional entries as well. We need checks.

    Re-implement radix_tree_replace_slot() on top of the sanity-checked
    __radix_tree_replace(). This requires existing callers to also pass the
    radix tree root, but it'll warn us when somebody replaces slots with
    contents that need proper accounting (transitions between NULL entries,
    real entries, exceptional entries) and where a replacement through the
    slot pointer would corrupt the radix tree node counts.

    Link: http://lkml.kernel.org/r/20161117193021.GB23430@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Suggested-by: Jan Kara
    Reviewed-by: Jan Kara
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

01 Dec, 2016

1 commit

  • Hugetlb pages have ->index in size of the huge pages (PMD_SIZE or
    PUD_SIZE), not in PAGE_SIZE as other types of pages. This means we
    cannot user page_to_pgoff() to check whether we've got the right page
    for the radix-tree index.

    Let's introduce page_to_index() which would return radix-tree index for
    given page.

    We will be able to get rid of this once hugetlb will be switched to
    multi-order entries.

    Fixes: fc127da085c2 ("truncate: handle file thp")
    Link: http://lkml.kernel.org/r/20161123093053.mjbnvn5zwxw5e6lk@black.fi.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Doug Nelson
    Tested-by: Doug Nelson
    Reviewed-by: Naoya Horiguchi
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

27 Jul, 2016

1 commit

  • For shmem/tmpfs we only need to tweak truncate_inode_page() and
    invalidate_mapping_pages().

    truncate_inode_pages_range() and invalidate_inode_pages2_range() are
    adjusted to use page_to_pgoff().

    Link: http://lkml.kernel.org/r/1466021202-61880-26-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

20 May, 2016

1 commit

  • Currently DAX page fault locking is racy.

    CPU0 (write fault) CPU1 (read fault)

    __dax_fault() __dax_fault()
    get_block(inode, block, &bh, 0) -> not mapped
    get_block(inode, block, &bh, 0)
    -> not mapped
    if (!buffer_mapped(&bh))
    if (vmf->flags & FAULT_FLAG_WRITE)
    get_block(inode, block, &bh, 1) -> allocates blocks
    if (page) -> no
    if (!buffer_mapped(&bh))
    if (vmf->flags & FAULT_FLAG_WRITE) {
    } else {
    dax_load_hole();
    }
    dax_insert_mapping()

    And we are in a situation where we fail in dax_radix_entry() with -EIO.

    Another problem with the current DAX page fault locking is that there is
    no race-free way to clear dirty tag in the radix tree. We can always
    end up with clean radix tree and dirty data in CPU cache.

    We fix the first problem by introducing locking of exceptional radix
    tree entries in DAX mappings acting very similarly to page lock and thus
    synchronizing properly faults against the same mapping index. The same
    lock can later be used to avoid races when clearing radix tree dirty
    tag.

    Reviewed-by: NeilBrown
    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Mar, 2016

3 commits

  • There are several users that nest lock_page_memcg() inside lock_page()
    to prevent page->mem_cgroup from changing. But the page lock prevents
    pages from moving between cgroups, so that is unnecessary overhead.

    Remove lock_page_memcg() in contexts with locked contexts and fix the
    debug code in the page stat functions to be okay with the page lock.

    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Now that migration doesn't clear page->mem_cgroup of live pages anymore,
    it's safe to make lock_page_memcg() and the memcg stat functions take
    pages, and spare the callers from memcg objects.

    [akpm@linux-foundation.org: fix warnings]
    Signed-off-by: Johannes Weiner
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These patches tag the page cache radix tree eviction entries with the
    memcg an evicted page belonged to, thus making per-cgroup LRU reclaim
    work properly and be as adaptive to new cache workingsets as global
    reclaim already is.

    This should have been part of the original thrash detection patch
    series, but was deferred due to the complexity of those patches.

    This patch (of 5):

    So far the only sites that needed to exclude charge migration to
    stabilize page->mem_cgroup have been per-cgroup page statistics, hence
    the name mem_cgroup_begin_page_stat(). But per-cgroup thrash detection
    will add another site that needs to ensure page->mem_cgroup lifetime.

    Rename these locking functions to the more generic lock_page_memcg() and
    unlock_page_memcg(). Since charge migration is a cgroup1 feature only,
    we might be able to delete it at some point, and these now easy to
    identify locking sites along with it.

    Signed-off-by: Johannes Weiner
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

23 Jan, 2016

1 commit

  • Add support for tracking dirty DAX entries in the struct address_space
    radix tree. This tree is already used for dirty page writeback, and it
    already supports the use of exceptional (non struct page*) entries.

    In order to properly track dirty DAX pages we will insert new
    exceptional entries into the radix tree that represent dirty DAX PTE or
    PMD pages. These exceptional entries will also contain the writeback
    addresses for the PTE or PMD faults that we can use at fsync/msync time.

    There are currently two types of exceptional entries (shmem and shadow)
    that can be placed into the radix tree, and this adds a third. We rely
    on the fact that only one type of exceptional entry can be found in a
    given radix tree based on its usage. This happens for free with DAX vs
    shmem but we explicitly prevent shadow entries from being added to radix
    trees for DAX mappings.

    The only shadow entries that would be generated for DAX radix trees
    would be to track zero page mappings that were created for holes. These
    pages would receive minimal benefit from having shadow entries, and the
    choice to have only one type of exceptional entry in a given radix tree
    makes the logic simpler both in clear_exceptional_entry() and in the
    rest of DAX.

    Signed-off-by: Ross Zwisler
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

02 Jun, 2015

2 commits

  • When modifying PG_Dirty on cached file pages, update the new
    MEM_CGROUP_STAT_DIRTY counter. This is done in the same places where
    global NR_FILE_DIRTY is managed. The new memcg stat is visible in the
    per memcg memory.stat cgroupfs file. The most recent past attempt at
    this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632

    The new accounting supports future efforts to add per cgroup dirty
    page throttling and writeback. It also helps an administrator break
    down a container's memory usage and provides evidence to understand
    memcg oom kills (the new dirty count is included in memcg oom kill
    messages).

    The ability to move page accounting between memcg
    (memory.move_charge_at_immigrate) makes this accounting more
    complicated than the global counter. The existing
    mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
    accounting with stat updates.
    Typical update operation:
    memcg = mem_cgroup_begin_page_stat(page)
    if (TestSetPageDirty()) {
    [...]
    mem_cgroup_update_page_stat(memcg)
    }
    mem_cgroup_end_page_stat(memcg)

    Summary of mem_cgroup_end_page_stat() overhead:
    - Without CONFIG_MEMCG it's a no-op
    - With CONFIG_MEMCG and no inter memcg task movement, it's just
    rcu_read_lock()
    - With CONFIG_MEMCG and inter memcg task movement, it's
    rcu_read_lock() + spin_lock_irqsave()

    A memcg parameter is added to several routines because their callers
    now grab mem_cgroup_begin_page_stat() which returns the memcg later
    needed by for mem_cgroup_update_page_stat().

    Because mem_cgroup_begin_page_stat() may disable interrupts, some
    adjustments are needed:
    - move __mark_inode_dirty() from __set_page_dirty() to its caller.
    __mark_inode_dirty() locking does not want interrupts disabled.
    - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
    __delete_from_page_cache(), replace_page_cache_page(),
    invalidate_complete_page2(), and __remove_mapping().

    text data bss dec hex filename
    8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
    8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
    +192 text bytes
    8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
    8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
    +773 text bytes

    Performance tests run on v4.0-rc1-36-g4f671fe2f952. Lower is better for
    all metrics, they're all wall clock or cycle counts. The read and write
    fault benchmarks just measure fault time, they do not include I/O time.

    * CONFIG_MEMCG not set:
    baseline patched
    kbuild 1m25.030000(+-0.088% 3 samples) 1m25.426667(+-0.120% 3 samples)
    dd write 100 MiB 0.859211561 +-15.10% 0.874162885 +-15.03%
    dd write 200 MiB 1.670653105 +-17.87% 1.669384764 +-11.99%
    dd write 1000 MiB 8.434691190 +-14.15% 8.474733215 +-14.77%
    read fault cycles 254.0(+-0.000% 10 samples) 253.0(+-0.000% 10 samples)
    write fault cycles 2021.2(+-3.070% 10 samples) 1984.5(+-1.036% 10 samples)

    * CONFIG_MEMCG=y root_memcg:
    baseline patched
    kbuild 1m25.716667(+-0.105% 3 samples) 1m25.686667(+-0.153% 3 samples)
    dd write 100 MiB 0.855650830 +-14.90% 0.887557919 +-14.90%
    dd write 200 MiB 1.688322953 +-12.72% 1.667682724 +-13.33%
    dd write 1000 MiB 8.418601605 +-14.30% 8.673532299 +-15.00%
    read fault cycles 266.0(+-0.000% 10 samples) 266.0(+-0.000% 10 samples)
    write fault cycles 2051.7(+-1.349% 10 samples) 2049.6(+-1.686% 10 samples)

    * CONFIG_MEMCG=y non-root_memcg:
    baseline patched
    kbuild 1m26.120000(+-0.273% 3 samples) 1m25.763333(+-0.127% 3 samples)
    dd write 100 MiB 0.861723964 +-15.25% 0.818129350 +-14.82%
    dd write 200 MiB 1.669887569 +-13.30% 1.698645885 +-13.27%
    dd write 1000 MiB 8.383191730 +-14.65% 8.351742280 +-14.52%
    read fault cycles 265.7(+-0.172% 10 samples) 267.0(+-0.000% 10 samples)
    write fault cycles 2070.6(+-1.512% 10 samples) 2084.4(+-2.148% 10 samples)

    As expected anon page faults are not affected by this patch.

    tj: Updated to apply on top of the recent cancel_dirty_page() changes.

    Signed-off-by: Sha Zhengju
    Signed-off-by: Greg Thelen
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Greg Thelen
     
  • cancel_dirty_page() had some issues and b9ea25152e56 ("page_writeback:
    clean up mess around cancel_dirty_page()") replaced it with
    account_page_cleaned() which makes the caller responsible for clearing
    the dirty bit; unfortunately, the planned changes for cgroup writeback
    support requires synchronization between dirty bit manipulation and
    stat updates. While we can open-code such synchronization in each
    account_page_cleaned() callsite, that's gonna be unnecessarily awkward
    and verbose.

    This patch revives cancel_dirty_page() but in a more restricted form.
    All it does is TestClearPageDirty() followed by account_page_cleaned()
    invocation if the page was dirty. This helper covers all
    account_page_cleaned() usages except for __delete_from_page_cache()
    which is a special case anyway and left alone. As this leaves no
    module user for account_page_cleaned(), EXPORT_SYMBOL() is dropped
    from it.

    This patch just revives cancel_dirty_page() as a trivial wrapper to
    replace equivalent usages and doesn't introduce any functional
    changes.

    Signed-off-by: Tejun Heo
    Cc: Konstantin Khlebnikov
    Signed-off-by: Jens Axboe

    Tejun Heo
     

16 Apr, 2015

1 commit

  • "deactivate_page" was created for file invalidation so it has too
    specific logic for file-backed pages. So, let's change the name of the
    function and date to a file-specific one and yield the generic name.

    Signed-off-by: Minchan Kim
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Wang, Yalin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

15 Apr, 2015

1 commit

  • This patch replaces cancel_dirty_page() with a helper function
    account_page_cleaned() which only updates counters. It's called from
    truncate_complete_page() and from try_to_free_buffers() (hack for ext3).
    Page is locked in both cases, page-lock protects against concurrent
    dirtiers: see commit 2d6d7f982846 ("mm: protect set_page_dirty() from
    ongoing truncation").

    Delete_from_page_cache() shouldn't be called for dirty pages, they must
    be handled by caller (either written or truncated). This patch treats
    final dirty accounting fixup at the end of __delete_from_page_cache() as
    a debug check and adds WARN_ON_ONCE() around it. If something removes
    dirty pages without proper handling that might be a bug and unwritten
    data might be lost.

    Hugetlbfs has no dirty pages accounting, ClearPageDirty() is enough
    here.

    cancel_dirty_page() in nfs_wb_page_cancel() is redundant. This is
    helper for nfs_invalidate_page() and it's called only in case complete
    invalidation.

    The mess was started in v2.6.20 after commits 46d2277c796f ("Clean up
    and make try_to_free_buffers() not race with dirty pages") and
    3e67c0987d75 ("truncate: clear page dirtiness before running
    try_to_free_buffers()") first was reverted right in v2.6.20 in commit
    ecdfc9787fe5 ("Resurrect 'try_to_free_buffers()' VM hackery"), second in
    v2.6.25 commit a2b345642f53 ("Fix dirty page accounting leak with ext3
    data=journal").

    Custom fixes were introduced between these points. NFS in v2.6.23, commit
    1b3b4a1a2deb ("NFS: Fix a write request leak in nfs_invalidate_page()").
    Kludge in __delete_from_page_cache() in v2.6.24, commit 3a6927906f1b ("Do
    dirty page accounting when removing a page from the page cache"). Since
    v2.6.25 all of them are redundant.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Tejun Heo
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

21 Jan, 2015

1 commit

  • Now that we got rid of the bdi abuse on character devices we can always use
    sb->s_bdi to get at the backing_dev_info for a file, except for the block
    device special case. Export inode_to_bdi and replace uses of
    mapping->backing_dev_info with it to prepare for the removal of
    mapping->backing_dev_info.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

07 Nov, 2014

1 commit

  • XFS doesn't always hold i_mutex when calling truncate_setsize() and it
    uses a different lock to serialize truncates and writes. So fix the
    comment before truncate_setsize().

    Reported-by: Jan Beulich
    Signed-off-by: Jan Kara
    Signed-off-by: Dave Chinner

    Jan Kara