08 Jun, 2018

1 commit

  • Use new return type vm_fault_t for fault handler in struct
    vm_operations_struct. For now, this is just documenting that the
    function returns a VM_FAULT value rather than an errno. Once all
    instances are converted, vm_fault_t will become a distinct type.

    Link: http://lkml.kernel.org/r/20180511190542.GA2412@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Cc: Dan Williams
    Cc: Jan Kara
    Cc: Ross Zwisler
    Cc: Rik van Riel
    Cc: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Pavel Tatashin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     

21 Apr, 2018

1 commit

  • f2fs specifies the __GFP_ZERO flag for allocating some of its pages.
    Unfortunately, the page cache also uses the mapping's GFP flags for
    allocating radix tree nodes. It always masked off the __GFP_HIGHMEM
    flag, and masks off __GFP_ZERO in some paths, but not all. That causes
    radix tree nodes to be allocated with a NULL list_head, which causes
    backtraces like:

    __list_del_entry+0x30/0xd0
    list_lru_del+0xac/0x1ac
    page_cache_tree_insert+0xd8/0x110

    The __GFP_DMA and __GFP_DMA32 flags would also be able to sneak through
    if they are ever used. Fix them all by using GFP_RECLAIM_MASK at the
    innermost location, and remove it from earlier in the callchain.

    Link: http://lkml.kernel.org/r/20180411060320.14458-2-willy@infradead.org
    Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
    Signed-off-by: Matthew Wilcox
    Reported-by: Chris Fries
    Debugged-by: Minchan Kim
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

14 Apr, 2018

1 commit

  • Building orangefs on MMU-less machines now results in a link error
    because of the newly introduced use of the filemap_page_mkwrite()
    function:

    ERROR: "filemap_page_mkwrite" [fs/orangefs/orangefs.ko] undefined!

    This adds a dummy version for it, similar to the existing
    generic_file_mmap and generic_file_readonly_mmap stubs in the same file,
    to avoid the link error without adding #ifdefs in each file system that
    uses these.

    Link: http://lkml.kernel.org/r/20180409105555.2439976-1-arnd@arndb.de
    Fixes: a5135eeab2e5 ("orangefs: implement vm_ops->fault")
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Jan Kara
    Reviewed-by: Andrew Morton
    Cc: Martin Brandenburg
    Cc: Mike Marshall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

12 Apr, 2018

1 commit

  • Remove the address_space ->tree_lock and use the xa_lock newly added to
    the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
    since we don't really care that it's a tree.

    [willy@infradead.org: fix nds32, fs/dax.c]
    Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

01 Feb, 2018

1 commit

  • in_atomic() has been moved to include/linux/preempt.h, and the filemap.c
    doesn't use in_atomic() directly at all, so it sounds unnecessary to
    include hardirq.h.

    Link: http://lkml.kernel.org/r/1509985319-38633-1-git-send-email-yang.s@alibaba-inc.com
    Signed-off-by: Yang Shi
    Reviewed-by: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

17 Nov, 2017

1 commit

  • Pull AFS updates from David Howells:
    "kAFS filesystem driver overhaul.

    The major points of the overhaul are:

    (1) Preliminary groundwork is laid for supporting network-namespacing
    of kAFS. The remainder of the namespacing work requires some way
    to pass namespace information to submounts triggered by an
    automount. This requires something like the mount overhaul that's
    in progress.

    (2) sockaddr_rxrpc is used in preference to in_addr for holding
    addresses internally and add support for talking to the YFS VL
    server. With this, kAFS can do everything over IPv6 as well as
    IPv4 if it's talking to servers that support it.

    (3) Callback handling is overhauled to be generally passive rather
    than active. 'Callbacks' are promises by the server to tell us
    about data and metadata changes. Callbacks are now checked when
    we next touch an inode rather than actively going and looking for
    it where possible.

    (4) File access permit caching is overhauled to store the caching
    information per-inode rather than per-directory, shared over
    subordinate files. Whilst older AFS servers only allow ACLs on
    directories (shared to the files in that directory), newer AFS
    servers break that restriction.

    To improve memory usage and to make it easier to do mass-key
    removal, permit combinations are cached and shared.

    (5) Cell database management is overhauled to allow lighter locks to
    be used and to make cell records autonomous state machines that
    look after getting their own DNS records and cleaning themselves
    up, in particular preventing races in acquiring and relinquishing
    the fscache token for the cell.

    (6) Volume caching is overhauled. The afs_vlocation record is got rid
    of to simplify things and the superblock is now keyed on the cell
    and the numeric volume ID only. The volume record is tied to a
    superblock and normal superblock management is used to mediate
    the lifetime of the volume fscache token.

    (7) File server record caching is overhauled to make server records
    independent of cells and volumes. A server can be in multiple
    cells (in such a case, the administrator must make sure that the
    VL services for all cells correctly reflect the volumes shared
    between those cells).

    Server records are now indexed using the UUID of the server
    rather than the address since a server can have multiple
    addresses.

    (8) File server rotation is overhauled to handle VMOVED, VBUSY (and
    similar), VOFFLINE and VNOVOL indications and to handle rotation
    both of servers and addresses of those servers. The rotation will
    also wait and retry if the server says it is busy.

    (9) Data writeback is overhauled. Each inode no longer stores a list
    of modified sections tagged with the key that authorised it in
    favour of noting the modified region of a page in page->private
    and storing a list of keys that made modifications in the inode.

    This simplifies things and allows other keys to be used to
    actually write to the server if a key that made a modification
    becomes useless.

    (10) Writable mmap() is implemented. This allows a kernel to be build
    entirely on AFS.

    Note that Pre AFS-3.4 servers are no longer supported, though this can
    be added back if necessary (AFS-3.4 was released in 1998)"

    * tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (35 commits)
    afs: Protect call->state changes against signals
    afs: Trace page dirty/clean
    afs: Implement shared-writeable mmap
    afs: Get rid of the afs_writeback record
    afs: Introduce a file-private data record
    afs: Use a dynamic port if 7001 is in use
    afs: Fix directory read/modify race
    afs: Trace the sending of pages
    afs: Trace the initiation and completion of client calls
    afs: Fix documentation on # vs % prefix in mount source specification
    afs: Fix total-length calculation for multiple-page send
    afs: Only progress call state at end of Tx phase from rxrpc callback
    afs: Make use of the YFS service upgrade to fully support IPv6
    afs: Overhaul volume and server record caching and fileserver rotation
    afs: Move server rotation code into its own file
    afs: Add an address list concept
    afs: Overhaul cell database management
    afs: Overhaul permit caching
    afs: Overhaul the callback handling
    afs: Rename struct afs_call server member to cm_server
    ...

    Linus Torvalds
     

16 Nov, 2017

11 commits

  • As the page free path makes no distinction between cache hot and cold
    pages, there is no real useful ordering of pages in the free list that
    allocation requests can take advantage of. Juding from the users of
    __GFP_COLD, it is likely that a number of them are the result of copying
    other sites instead of actually measuring the impact. Remove the
    __GFP_COLD parameter which simplifies a number of paths in the page
    allocator.

    This is potentially controversial but bear in mind that the size of the
    per-cpu pagelists versus modern cache sizes means that the whole per-cpu
    list can often fit in the L3 cache. Hence, there is only a potential
    benefit for microbenchmarks that alloc/free pages in a tight loop. It's
    even worse when THP is taken into account which has little or no chance
    of getting a cache-hot page as the per-cpu list is bypassed and the
    zeroing of multiple pages will thrash the cache anyway.

    The truncate microbenchmarks are not shown as this patch affects the
    allocation path and not the free path. A page fault microbenchmark was
    tested but it showed no sigificant difference which is not surprising
    given that the __GFP_COLD branches are a miniscule percentage of the
    fault path.

    Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Every pagevec_init user claims the pages being released are hot even in
    cases where it is unlikely the pages are hot. As no one cares about the
    hotness of pages being released to the allocator, just ditch the
    parameter.

    No performance impact is expected as the overhead is marginal. The
    parameter is removed simply because it is a bit stupid to have a useless
    parameter copied everywhere.

    Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • During truncation, the mapping has already been checked for shmem and
    dax so it's known that workingset_update_node is required.

    This patch avoids the checks on mapping for each page being truncated.
    In all other cases, a lookup helper is used to determine if
    workingset_update_node() needs to be called. The one danger is that the
    API is slightly harder to use as calling workingset_update_node directly
    without checking for dax or shmem mappings could lead to surprises.
    However, the API rarely needs to be used and hopefully the comment is
    enough to give people the hint.

    sparsetruncate (tiny)
    4.14.0-rc4 4.14.0-rc4
    oneirq-v1r1 pickhelper-v1r1
    Min Time 141.00 ( 0.00%) 140.00 ( 0.71%)
    1st-qrtle Time 142.00 ( 0.00%) 141.00 ( 0.70%)
    2nd-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%)
    3rd-qrtle Time 143.00 ( 0.00%) 143.00 ( 0.00%)
    Max-90% Time 144.00 ( 0.00%) 144.00 ( 0.00%)
    Max-95% Time 147.00 ( 0.00%) 145.00 ( 1.36%)
    Max-99% Time 195.00 ( 0.00%) 191.00 ( 2.05%)
    Max Time 230.00 ( 0.00%) 205.00 ( 10.87%)
    Amean Time 144.37 ( 0.00%) 143.82 ( 0.38%)
    Stddev Time 10.44 ( 0.00%) 9.00 ( 13.74%)
    Coeff Time 7.23 ( 0.00%) 6.26 ( 13.41%)
    Best99%Amean Time 143.72 ( 0.00%) 143.34 ( 0.26%)
    Best95%Amean Time 142.37 ( 0.00%) 142.00 ( 0.26%)
    Best90%Amean Time 142.19 ( 0.00%) 141.85 ( 0.24%)
    Best75%Amean Time 141.92 ( 0.00%) 141.58 ( 0.24%)
    Best50%Amean Time 141.69 ( 0.00%) 141.31 ( 0.27%)
    Best25%Amean Time 141.38 ( 0.00%) 140.97 ( 0.29%)

    As you'd expect, the gain is marginal but it can be detected. The
    differences in bonnie are all within the noise which is not surprising
    given the impact on the microbenchmark.

    radix_tree_update_node_t is a callback for some radix operations that
    optionally passes in a private field. The only user of the callback is
    workingset_update_node and as it no longer requires a mapping, the
    private field is removed.

    Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Jan Kara
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently we remove pages from the radix tree one by one. To speed up
    page cache truncation, lock several pages at once and free them in one
    go. This allows us to batch radix tree operations in a more efficient
    way and also save round-trips on mapping->tree_lock. As a result we
    gain about 20% speed improvement in page cache truncation.

    Data from a simple benchmark timing 10000 truncates of 1024 pages (on
    ext4 on ramdisk but the filesystem is barely visible in the profiles).
    The range shows 1% and 95% percentiles of the measured times:

    4.14-rc2 4.14-rc2 + batched truncation
    248-256 209-219
    249-258 209-217
    248-255 211-239
    248-255 209-217
    247-256 210-218

    [jack@suse.cz: convert delete_from_page_cache_batch() to pagevec]
    Link: http://lkml.kernel.org/r/20171018111648.13714-1-jack@suse.cz
    [akpm@linux-foundation.org: move struct pagevec forward declaration to top-of-file]
    Link: http://lkml.kernel.org/r/20171010151937.26984-8-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Mel Gorman
    Reviewed-by: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Move checks and accounting updates from __delete_from_page_cache() into
    a separate function. We will reuse it when batching page cache
    truncation operations.

    Link: http://lkml.kernel.org/r/20171010151937.26984-7-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Mel Gorman
    Reviewed-by: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Clearing of page->mapping makes sense in page_cache_tree_delete() as
    well and it will help us with batching things this way.

    Link: http://lkml.kernel.org/r/20171010151937.26984-6-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Mel Gorman
    Reviewed-by: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Move updates of various counters before page_cache_tree_delete() call.
    It will be easier to batch things this way and there is no difference
    whether the counters get updated before or after removal from the radix
    tree.

    Link: http://lkml.kernel.org/r/20171010151937.26984-5-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Mel Gorman
    Reviewed-by: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Factor out page freeing from delete_from_page_cache() into a separate
    function. We will need to call the same when batching pagecache
    deletion operations.

    invalidate_complete_page2() and replace_page_cache_page() might want to
    call this function as well however they currently don't seem to handle
    THPs so it's unnecessary for them to take the hit of checking whether a
    page is THP or not.

    Link: http://lkml.kernel.org/r/20171010151937.26984-4-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Mel Gorman
    Reviewed-by: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • All users of pagevec_lookup() and pagevec_lookup_range() now pass
    PAGEVEC_SIZE as a desired number of pages. Just drop the argument.

    Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Daniel Jordan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Use pagevec_lookup_range_tag() in __filemap_fdatawait_range() as it is
    interested only in pages from given range. Remove unnecessary code
    resulting from this.

    Link: http://lkml.kernel.org/r/20171009151359.31984-11-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Daniel Jordan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Patch series "Ranged pagevec tagged lookup", v3.

    In this series I provide a ranged variant of pagevec_lookup_tag() and
    use it in places where it makes sense. This series removes some common
    code and it also has a potential for speeding up some operations
    similarly as for pagevec_lookup_range() (but for now I can think of only
    artificial cases where this happens).

    This patch (of 16):

    Implement a variant of find_get_pages_tag() that stops iterating at
    given index. Lots of users of this function (through pagevec_lookup())
    actually want a range lookup and all of them are currently open-coding
    this.

    Also create corresponding pagevec_lookup_range_tag() function.

    Link: http://lkml.kernel.org/r/20171009151359.31984-2-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Daniel Jordan
    Cc: Bob Peterson
    Cc: Chao Yu
    Cc: David Howells
    Cc: David Sterba
    Cc: Ilya Dryomov
    Cc: Jaegeuk Kim
    Cc: Ryusuke Konishi
    Cc: Steve French
    Cc: "Theodore Ts'o"
    Cc: "Yan, Zheng"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

13 Nov, 2017

1 commit


04 Oct, 2017

1 commit

  • Eryu noticed that he could sometimes get a leftover error reported when
    it shouldn't be on fsync with ext2 and non-journalled ext4.

    The problem is that writeback_single_inode still uses filemap_fdatawait.
    That picks up a previously set AS_EIO flag, which would ordinarily have
    been cleared before.

    Since we're mostly using this function as a replacement for
    filemap_check_errors, have filemap_check_and_advance_wb_err clear AS_EIO
    and AS_ENOSPC when reporting an error. That should allow the new
    function to better emulate the behavior of the old with respect to these
    flags.

    Link: http://lkml.kernel.org/r/20170922133331.28812-1-jlayton@kernel.org
    Signed-off-by: Jeff Layton
    Reported-by: Eryu Guan
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Layton
     

25 Sep, 2017

1 commit

  • Currently when mixing buffered reads and asynchronous direct writes it
    is possible to end up with the situation where we have stale data in the
    page cache while the new data is already written to disk. This is
    permanent until the affected pages are flushed away. Despite the fact
    that mixing buffered and direct IO is ill-advised it does pose a thread
    for a data integrity, is unexpected and should be fixed.

    Fix this by deferring completion of asynchronous direct writes to a
    process context in the case that there are mapped pages to be found in
    the inode. Later before the completion in dio_complete() invalidate
    the pages in question. This ensures that after the completion the pages
    in the written area are either unmapped, or populated with up-to-date
    data. Also do the same for the iomap case which uses
    iomap_dio_complete() instead.

    This has a side effect of deferring the completion to a process context
    for every AIO DIO that happens on inode that has pages mapped. However
    since the consensus is that this is ill-advised practice the performance
    implication should not be a problem.

    This was based on proposal from Jeff Moyer, thanks!

    Reviewed-by: Jan Kara
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Jeff Moyer
    Signed-off-by: Lukas Czerner
    Signed-off-by: Jens Axboe

    Lukas Czerner
     

15 Sep, 2017

2 commits

  • Pull nowait read support from Al Viro:
    "Support IOCB_NOWAIT for buffered reads and block devices"

    * 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    block_dev: support RFW_NOWAIT on block device nodes
    fs: support RWF_NOWAIT for buffered reads
    fs: support IOCB_NOWAIT in generic_file_buffered_read
    fs: pass iocb to do_generic_file_read

    Linus Torvalds
     
  • Now that we have added breaks in the wait queue scan and allow bookmark
    on scan position, we put this logic in the wake_up_page_bit function.

    We can have very long page wait list in large system where multiple
    pages share the same wait list. We break the wake up walk here to allow
    other cpus a chance to access the list, and not to disable the interrupts
    when traversing the list for too long. This reduces the interrupt and
    rescheduling latency, and excessive page wait queue lock hold time.

    [ v2: Remove bookmark_wake_function ]

    Signed-off-by: Tim Chen
    Signed-off-by: Linus Torvalds

    Tim Chen
     

07 Sep, 2017

6 commits

  • Merge updates from Andrew Morton:

    - various misc bits

    - DAX updates

    - OCFS2

    - most of MM

    * emailed patches from Andrew Morton : (119 commits)
    mm,fork: introduce MADV_WIPEONFORK
    x86,mpx: make mpx depend on x86-64 to free up VMA flag
    mm: add /proc/pid/smaps_rollup
    mm: hugetlb: clear target sub-page last when clearing huge page
    mm: oom: let oom_reap_task and exit_mmap run concurrently
    swap: choose swap device according to numa node
    mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
    mm, oom: do not rely on TIF_MEMDIE for memory reserves access
    z3fold: use per-cpu unbuddied lists
    mm, swap: don't use VMA based swap readahead if HDD is used as swap
    mm, swap: add sysfs interface for VMA based swap readahead
    mm, swap: VMA based swap readahead
    mm, swap: fix swap readahead marking
    mm, swap: add swap readahead hit statistics
    mm/vmalloc.c: don't reinvent the wheel but use existing llist API
    mm/vmstat.c: fix wrong comment
    selftests/memfd: add memfd_create hugetlbfs selftest
    mm/shmem: add hugetlbfs support to memfd_create()
    mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
    mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()
    ...

    Linus Torvalds
     
  • We want only pages from given range in filemap_range_has_page(),
    furthermore we want at most a single page.

    So use find_get_pages_range() instead of pagevec_lookup() and remove
    unnecessary code.

    Link: http://lkml.kernel.org/r/20170726114704.7626-10-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Implement a variant of find_get_pages() that stops iterating at given
    index. This may be substantial performance gain if the mapping is
    sparse. See following commit for details. Furthermore lots of users of
    this function (through pagevec_lookup()) actually want a range lookup
    and all of them are currently open-coding this.

    Also create corresponding pagevec_lookup_range() function.

    Link: http://lkml.kernel.org/r/20170726114704.7626-4-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Make pagevec_lookup() (and underlying find_get_pages()) update index to
    the next page where iteration should continue. Most callers want this
    and also pagevec_lookup_tag() already does this.

    Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Now that we no longer insert struct page pointers in DAX radix trees we
    can remove the special casing for DAX in page_cache_tree_insert().

    This also allows us to make dax_wake_mapping_entry_waiter() local to
    fs/dax.c, removing it from dax.h.

    Link: http://lkml.kernel.org/r/20170724170616.25810-5-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Suggested-by: Jan Kara
    Reviewed-by: Jan Kara
    Cc: "Darrick J. Wong"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jonathan Corbet
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Pull writeback error handling updates from Jeff Layton:
    "This pile continues the work from last cycle on better tracking
    writeback errors. In v4.13 we added some basic errseq_t infrastructure
    and converted a few filesystems to use it.

    This set continues refining that infrastructure, adds documentation,
    and converts most of the other filesystems to use it. The main
    exception at this point is the NFS client"

    * tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    ecryptfs: convert to file_write_and_wait in ->fsync
    mm: remove optimizations based on i_size in mapping writeback waits
    fs: convert a pile of fsync routines to errseq_t based reporting
    gfs2: convert to errseq_t based writeback error reporting for fsync
    fs: convert sync_file_range to use errseq_t based error-tracking
    mm: add file_fdatawait_range and file_write_and_wait
    fuse: convert to errseq_t based error tracking for fsync
    mm: consolidate dax / non-dax checks for writeback
    Documentation: add some docs for errseq_t
    errseq: rename __errseq_set to errseq_set

    Linus Torvalds
     

05 Sep, 2017

2 commits


29 Aug, 2017

1 commit

  • Commit 3510ca20ece0 ("Minor page waitqueue cleanups") made the page
    queue code always add new waiters to the back of the queue, which helps
    upcoming patches to batch the wakeups for some horrid loads where the
    wait queues grow to thousands of entries.

    However, I forgot about the nasrt add_page_wait_queue() special case
    code that is only used by the cachefiles code. That one still continued
    to add the new wait queue entries at the beginning of the list.

    Fix it, because any sane batched wakeup will require that we don't
    suddenly start getting new entries at the beginning of the list that we
    already handled in a previous batch.

    [ The current code always does the whole list while holding the lock, so
    wait queue ordering doesn't matter for correctness, but even then it's
    better to add later entries at the end from a fairness standpoint ]

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 Aug, 2017

2 commits

  • The "lock_page_killable()" function waits for exclusive access to the
    page lock bit using the WQ_FLAG_EXCLUSIVE bit in the waitqueue entry
    set.

    That means that if it gets woken up, other waiters may have been
    skipped.

    That, in turn, means that if it sees the page being unlocked, it *must*
    take that lock and return success, even if a lethal signal is also
    pending.

    So instead of checking for lethal signals first, we need to check for
    them after we've checked the actual bit that we were waiting for. Even
    if that might then delay the killing of the process.

    This matches the order of the old "wait_on_bit_lock()" infrastructure
    that the page locking used to use (and is still used in a few other
    areas).

    Note that if we still return an error after having unsuccessfully tried
    to acquire the page lock, that is ok: that means that some other thread
    was able to get ahead of us and lock the page, and when that other
    thread then unlocks the page, the wakeup event will be repeated. So any
    other pending waiters will now get properly woken up.

    Fixes: 62906027091f ("mm: add PageWaiters indicating tasks are waiting for a page bit")
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Cc: Jan Kara
    Cc: Davidlohr Bueso
    Cc: Andi Kleen
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Tim Chen and Kan Liang have been battling a customer load that shows
    extremely long page wakeup lists. The cause seems to be constant NUMA
    migration of a hot page that is shared across a lot of threads, but the
    actual root cause for the exact behavior has not been found.

    Tim has a patch that batches the wait list traversal at wakeup time, so
    that we at least don't get long uninterruptible cases where we traverse
    and wake up thousands of processes and get nasty latency spikes. That
    is likely 4.14 material, but we're still discussing the page waitqueue
    specific parts of it.

    In the meantime, I've tried to look at making the page wait queues less
    expensive, and failing miserably. If you have thousands of threads
    waiting for the same page, it will be painful. We'll need to try to
    figure out the NUMA balancing issue some day, in addition to avoiding
    the excessive spinlock hold times.

    That said, having tried to rewrite the page wait queues, I can at least
    fix up some of the braindamage in the current situation. In particular:

    (a) we don't want to continue walking the page wait list if the bit
    we're waiting for already got set again (which seems to be one of
    the patterns of the bad load). That makes no progress and just
    causes pointless cache pollution chasing the pointers.

    (b) we don't want to put the non-locking waiters always on the front of
    the queue, and the locking waiters always on the back. Not only is
    that unfair, it means that we wake up thousands of reading threads
    that will just end up being blocked by the writer later anyway.

    Also add a comment about the layout of 'struct wait_page_key' - there is
    an external user of it in the cachefiles code that means that it has to
    match the layout of 'struct wait_bit_key' in the two first members. It
    so happens to match, because 'struct page *' and 'unsigned long *' end
    up having the same values simply because the page flags are the first
    member in struct page.

    Cc: Tim Chen
    Cc: Kan Liang
    Cc: Mel Gorman
    Cc: Christopher Lameter
    Cc: Andi Kleen
    Cc: Davidlohr Bueso
    Cc: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Aug, 2017

2 commits

  • Marcelo added this i_size based optimization with a patch in 2004
    (commitid is from the linux-history tree):

    commit 765dad09b4ac101a32d87af2bb793c3060497d3c
    Author: Marcelo Tosatti
    Date: Tue Sep 7 17:51:17 2004 -0700

    small wait_on_page_writeback_range() optimization

    filemap_fdatawait() calls wait_on_page_writeback_range() with -1
    as "end" parameter. This is not needed since we know the EOF
    from the inode. Use that instead.

    There may be races here, particularly with clustered or network
    filesystems. It also seems like a bit of a layering violation since
    we're operating on an address_space here, not an inode.

    Finally, it's also questionable whether this optimization really helps
    on workloads that we care about. Should we be optimizing for writeback
    vs. truncate races in a codepath where we expect to wait anyway? It
    doesn't seem worth the risk.

    Remove this optimization from the filemap_fdatawait codepaths. This
    means that filemap_fdatawait becomes a trivial wrapper around
    filemap_fdatawait_range.

    Reviewed-by: Jan Kara
    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • Necessary now for gfs2_fsync and sync_file_range, but there will
    eventually be other callers.

    Reviewed-by: Jan Kara
    Signed-off-by: Jeff Layton

    Jeff Layton
     

29 Jul, 2017

1 commit


27 Jul, 2017

1 commit


11 Jul, 2017

1 commit

  • We avoid calling __mod_node_page_state(NR_FILE_PAGES) for hugetlb page
    now, but it's not enough because later code doesn't handle hugetlb
    properly. Actually in our testing, WARN_ON_ONCE(PageDirty(page)) at the
    end of this function fires for hugetlb, which makes no sense. So we
    should return immediately for hugetlb pages.

    Link: http://lkml.kernel.org/r/1496305019-5493-3-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: "Aneesh Kumar K.V"
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

08 Jul, 2017

1 commit

  • Pull Writeback error handling updates from Jeff Layton:
    "This pile represents the bulk of the writeback error handling fixes
    that I have for this cycle. Some of the earlier patches in this pile
    may look trivial but they are prerequisites for later patches in the
    series.

    The aim of this set is to improve how we track and report writeback
    errors to userland. Most applications that care about data integrity
    will periodically call fsync/fdatasync/msync to ensure that their
    writes have made it to the backing store.

    For a very long time, we have tracked writeback errors using two flags
    in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
    writeback error occurs (via mapping_set_error) and are cleared as a
    side-effect of filemap_check_errors (as you noted yesterday). This
    model really sucks for userland.

    Only the first task to call fsync (or msync or fdatasync) will see the
    error. Any subsequent task calling fsync on a file will get back 0
    (unless another writeback error occurs in the interim). If I have
    several tasks writing to a file and calling fsync to ensure that their
    writes got stored, then I need to have them coordinate with one
    another. That's difficult enough, but in a world of containerized
    setups that coordination may even not be possible.

    But wait...it gets worse!

    The calls to filemap_check_errors can be buried pretty far down in the
    call stack, and there are internal callers of filemap_write_and_wait
    and the like that also end up clearing those errors. Many of those
    callers ignore the error return from that function or return it to
    userland at nonsensical times (e.g. truncate() or stat()). If I get
    back -EIO on a truncate, there is no reason to think that it was
    because some previous writeback failed, and a subsequent fsync() will
    (incorrectly) return 0.

    This pile aims to do three things:

    1) ensure that when a writeback error occurs that that error will be
    reported to userland on a subsequent fsync/fdatasync/msync call,
    regardless of what internal callers are doing

    2) report writeback errors on all file descriptions that were open at
    the time that the error occurred. This is a user-visible change,
    but I think most applications are written to assume this behavior
    anyway. Those that aren't are unlikely to be hurt by it.

    3) document what filesystems should do when there is a writeback
    error. Today, there is very little consistency between them, and a
    lot of cargo-cult copying. We need to make it very clear what
    filesystems should do in this situation.

    To achieve this, the set adds a new data type (errseq_t) and then
    builds new writeback error tracking infrastructure around that. Once
    all of that is in place, we change the filesystems to use the new
    infrastructure for reporting wb errors to userland.

    Note that this is just the initial foray into cleaning up this mess.
    There is a lot of work remaining here:

    1) convert the rest of the filesystems in a similar fashion. Once the
    initial set is in, then I think most other fs' will be fairly
    simple to convert. Hopefully most of those can in via individual
    filesystem trees.

    2) convert internal waiters on writeback to use errseq_t for
    detecting errors instead of relying on the AS_* flags. I have some
    draft patches for this for ext4, but they are not quite ready for
    prime time yet.

    This was a discussion topic this year at LSF/MM too. If you're
    interested in the gory details, LWN has some good articles about this:

    https://lwn.net/Articles/718734/
    https://lwn.net/Articles/724307/"

    * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    btrfs: minimal conversion to errseq_t writeback error reporting on fsync
    xfs: minimal conversion to errseq_t writeback error reporting
    ext4: use errseq_t based error handling for reporting data writeback errors
    fs: convert __generic_file_fsync to use errseq_t based reporting
    block: convert to errseq_t based writeback error tracking
    dax: set errors in mapping when writeback fails
    Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
    mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
    fs: new infrastructure for writeback error handling and reporting
    lib: add errseq_t type and infrastructure for handling it
    mm: don't TestClearPageError in __filemap_fdatawait_range
    mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
    jbd2: don't clear and reset errors after waiting on writeback
    buffer: set errors in mapping at the time that the error occurs
    fs: check for writeback errors after syncing out buffers in generic_file_fsync
    buffer: use mapping_set_error instead of setting the flag
    mm: fix mapping_set_error call in me_pagecache_dirty

    Linus Torvalds
     

07 Jul, 2017

1 commit

  • Track the following reclaim counters for every memory cgroup: PGREFILL,
    PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED.

    These values are exposed using the memory.stats interface of cgroup v2.

    The meaning of each value is the same as for global counters, available
    using /proc/vmstat.

    Also, for consistency, rename mem_cgroup_count_vm_event() to
    count_memcg_event_mm().

    Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com
    Signed-off-by: Roman Gushchin
    Suggested-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin