24 Apr, 2018

1 commit

  • commit abc1be13fd113ddef5e2d807a466286b864caed3 upstream.

    f2fs specifies the __GFP_ZERO flag for allocating some of its pages.
    Unfortunately, the page cache also uses the mapping's GFP flags for
    allocating radix tree nodes. It always masked off the __GFP_HIGHMEM
    flag, and masks off __GFP_ZERO in some paths, but not all. That causes
    radix tree nodes to be allocated with a NULL list_head, which causes
    backtraces like:

    __list_del_entry+0x30/0xd0
    list_lru_del+0xac/0x1ac
    page_cache_tree_insert+0xd8/0x110

    The __GFP_DMA and __GFP_DMA32 flags would also be able to sneak through
    if they are ever used. Fix them all by using GFP_RECLAIM_MASK at the
    innermost location, and remove it from earlier in the callchain.

    Link: http://lkml.kernel.org/r/20180411060320.14458-2-willy@infradead.org
    Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
    Signed-off-by: Matthew Wilcox
    Reported-by: Chris Fries
    Debugged-by: Minchan Kim
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Matthew Wilcox
     

04 Oct, 2017

1 commit

  • Eryu noticed that he could sometimes get a leftover error reported when
    it shouldn't be on fsync with ext2 and non-journalled ext4.

    The problem is that writeback_single_inode still uses filemap_fdatawait.
    That picks up a previously set AS_EIO flag, which would ordinarily have
    been cleared before.

    Since we're mostly using this function as a replacement for
    filemap_check_errors, have filemap_check_and_advance_wb_err clear AS_EIO
    and AS_ENOSPC when reporting an error. That should allow the new
    function to better emulate the behavior of the old with respect to these
    flags.

    Link: http://lkml.kernel.org/r/20170922133331.28812-1-jlayton@kernel.org
    Signed-off-by: Jeff Layton
    Reported-by: Eryu Guan
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Layton
     

25 Sep, 2017

1 commit

  • Currently when mixing buffered reads and asynchronous direct writes it
    is possible to end up with the situation where we have stale data in the
    page cache while the new data is already written to disk. This is
    permanent until the affected pages are flushed away. Despite the fact
    that mixing buffered and direct IO is ill-advised it does pose a thread
    for a data integrity, is unexpected and should be fixed.

    Fix this by deferring completion of asynchronous direct writes to a
    process context in the case that there are mapped pages to be found in
    the inode. Later before the completion in dio_complete() invalidate
    the pages in question. This ensures that after the completion the pages
    in the written area are either unmapped, or populated with up-to-date
    data. Also do the same for the iomap case which uses
    iomap_dio_complete() instead.

    This has a side effect of deferring the completion to a process context
    for every AIO DIO that happens on inode that has pages mapped. However
    since the consensus is that this is ill-advised practice the performance
    implication should not be a problem.

    This was based on proposal from Jeff Moyer, thanks!

    Reviewed-by: Jan Kara
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Jeff Moyer
    Signed-off-by: Lukas Czerner
    Signed-off-by: Jens Axboe

    Lukas Czerner
     

15 Sep, 2017

2 commits

  • Pull nowait read support from Al Viro:
    "Support IOCB_NOWAIT for buffered reads and block devices"

    * 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    block_dev: support RFW_NOWAIT on block device nodes
    fs: support RWF_NOWAIT for buffered reads
    fs: support IOCB_NOWAIT in generic_file_buffered_read
    fs: pass iocb to do_generic_file_read

    Linus Torvalds
     
  • Now that we have added breaks in the wait queue scan and allow bookmark
    on scan position, we put this logic in the wake_up_page_bit function.

    We can have very long page wait list in large system where multiple
    pages share the same wait list. We break the wake up walk here to allow
    other cpus a chance to access the list, and not to disable the interrupts
    when traversing the list for too long. This reduces the interrupt and
    rescheduling latency, and excessive page wait queue lock hold time.

    [ v2: Remove bookmark_wake_function ]

    Signed-off-by: Tim Chen
    Signed-off-by: Linus Torvalds

    Tim Chen
     

07 Sep, 2017

6 commits

  • Merge updates from Andrew Morton:

    - various misc bits

    - DAX updates

    - OCFS2

    - most of MM

    * emailed patches from Andrew Morton : (119 commits)
    mm,fork: introduce MADV_WIPEONFORK
    x86,mpx: make mpx depend on x86-64 to free up VMA flag
    mm: add /proc/pid/smaps_rollup
    mm: hugetlb: clear target sub-page last when clearing huge page
    mm: oom: let oom_reap_task and exit_mmap run concurrently
    swap: choose swap device according to numa node
    mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
    mm, oom: do not rely on TIF_MEMDIE for memory reserves access
    z3fold: use per-cpu unbuddied lists
    mm, swap: don't use VMA based swap readahead if HDD is used as swap
    mm, swap: add sysfs interface for VMA based swap readahead
    mm, swap: VMA based swap readahead
    mm, swap: fix swap readahead marking
    mm, swap: add swap readahead hit statistics
    mm/vmalloc.c: don't reinvent the wheel but use existing llist API
    mm/vmstat.c: fix wrong comment
    selftests/memfd: add memfd_create hugetlbfs selftest
    mm/shmem: add hugetlbfs support to memfd_create()
    mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
    mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()
    ...

    Linus Torvalds
     
  • We want only pages from given range in filemap_range_has_page(),
    furthermore we want at most a single page.

    So use find_get_pages_range() instead of pagevec_lookup() and remove
    unnecessary code.

    Link: http://lkml.kernel.org/r/20170726114704.7626-10-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Implement a variant of find_get_pages() that stops iterating at given
    index. This may be substantial performance gain if the mapping is
    sparse. See following commit for details. Furthermore lots of users of
    this function (through pagevec_lookup()) actually want a range lookup
    and all of them are currently open-coding this.

    Also create corresponding pagevec_lookup_range() function.

    Link: http://lkml.kernel.org/r/20170726114704.7626-4-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Make pagevec_lookup() (and underlying find_get_pages()) update index to
    the next page where iteration should continue. Most callers want this
    and also pagevec_lookup_tag() already does this.

    Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Now that we no longer insert struct page pointers in DAX radix trees we
    can remove the special casing for DAX in page_cache_tree_insert().

    This also allows us to make dax_wake_mapping_entry_waiter() local to
    fs/dax.c, removing it from dax.h.

    Link: http://lkml.kernel.org/r/20170724170616.25810-5-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Suggested-by: Jan Kara
    Reviewed-by: Jan Kara
    Cc: "Darrick J. Wong"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jonathan Corbet
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Pull writeback error handling updates from Jeff Layton:
    "This pile continues the work from last cycle on better tracking
    writeback errors. In v4.13 we added some basic errseq_t infrastructure
    and converted a few filesystems to use it.

    This set continues refining that infrastructure, adds documentation,
    and converts most of the other filesystems to use it. The main
    exception at this point is the NFS client"

    * tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    ecryptfs: convert to file_write_and_wait in ->fsync
    mm: remove optimizations based on i_size in mapping writeback waits
    fs: convert a pile of fsync routines to errseq_t based reporting
    gfs2: convert to errseq_t based writeback error reporting for fsync
    fs: convert sync_file_range to use errseq_t based error-tracking
    mm: add file_fdatawait_range and file_write_and_wait
    fuse: convert to errseq_t based error tracking for fsync
    mm: consolidate dax / non-dax checks for writeback
    Documentation: add some docs for errseq_t
    errseq: rename __errseq_set to errseq_set

    Linus Torvalds
     

05 Sep, 2017

2 commits


29 Aug, 2017

1 commit

  • Commit 3510ca20ece0 ("Minor page waitqueue cleanups") made the page
    queue code always add new waiters to the back of the queue, which helps
    upcoming patches to batch the wakeups for some horrid loads where the
    wait queues grow to thousands of entries.

    However, I forgot about the nasrt add_page_wait_queue() special case
    code that is only used by the cachefiles code. That one still continued
    to add the new wait queue entries at the beginning of the list.

    Fix it, because any sane batched wakeup will require that we don't
    suddenly start getting new entries at the beginning of the list that we
    already handled in a previous batch.

    [ The current code always does the whole list while holding the lock, so
    wait queue ordering doesn't matter for correctness, but even then it's
    better to add later entries at the end from a fairness standpoint ]

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 Aug, 2017

2 commits

  • The "lock_page_killable()" function waits for exclusive access to the
    page lock bit using the WQ_FLAG_EXCLUSIVE bit in the waitqueue entry
    set.

    That means that if it gets woken up, other waiters may have been
    skipped.

    That, in turn, means that if it sees the page being unlocked, it *must*
    take that lock and return success, even if a lethal signal is also
    pending.

    So instead of checking for lethal signals first, we need to check for
    them after we've checked the actual bit that we were waiting for. Even
    if that might then delay the killing of the process.

    This matches the order of the old "wait_on_bit_lock()" infrastructure
    that the page locking used to use (and is still used in a few other
    areas).

    Note that if we still return an error after having unsuccessfully tried
    to acquire the page lock, that is ok: that means that some other thread
    was able to get ahead of us and lock the page, and when that other
    thread then unlocks the page, the wakeup event will be repeated. So any
    other pending waiters will now get properly woken up.

    Fixes: 62906027091f ("mm: add PageWaiters indicating tasks are waiting for a page bit")
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Cc: Jan Kara
    Cc: Davidlohr Bueso
    Cc: Andi Kleen
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Tim Chen and Kan Liang have been battling a customer load that shows
    extremely long page wakeup lists. The cause seems to be constant NUMA
    migration of a hot page that is shared across a lot of threads, but the
    actual root cause for the exact behavior has not been found.

    Tim has a patch that batches the wait list traversal at wakeup time, so
    that we at least don't get long uninterruptible cases where we traverse
    and wake up thousands of processes and get nasty latency spikes. That
    is likely 4.14 material, but we're still discussing the page waitqueue
    specific parts of it.

    In the meantime, I've tried to look at making the page wait queues less
    expensive, and failing miserably. If you have thousands of threads
    waiting for the same page, it will be painful. We'll need to try to
    figure out the NUMA balancing issue some day, in addition to avoiding
    the excessive spinlock hold times.

    That said, having tried to rewrite the page wait queues, I can at least
    fix up some of the braindamage in the current situation. In particular:

    (a) we don't want to continue walking the page wait list if the bit
    we're waiting for already got set again (which seems to be one of
    the patterns of the bad load). That makes no progress and just
    causes pointless cache pollution chasing the pointers.

    (b) we don't want to put the non-locking waiters always on the front of
    the queue, and the locking waiters always on the back. Not only is
    that unfair, it means that we wake up thousands of reading threads
    that will just end up being blocked by the writer later anyway.

    Also add a comment about the layout of 'struct wait_page_key' - there is
    an external user of it in the cachefiles code that means that it has to
    match the layout of 'struct wait_bit_key' in the two first members. It
    so happens to match, because 'struct page *' and 'unsigned long *' end
    up having the same values simply because the page flags are the first
    member in struct page.

    Cc: Tim Chen
    Cc: Kan Liang
    Cc: Mel Gorman
    Cc: Christopher Lameter
    Cc: Andi Kleen
    Cc: Davidlohr Bueso
    Cc: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Aug, 2017

2 commits

  • Marcelo added this i_size based optimization with a patch in 2004
    (commitid is from the linux-history tree):

    commit 765dad09b4ac101a32d87af2bb793c3060497d3c
    Author: Marcelo Tosatti
    Date: Tue Sep 7 17:51:17 2004 -0700

    small wait_on_page_writeback_range() optimization

    filemap_fdatawait() calls wait_on_page_writeback_range() with -1
    as "end" parameter. This is not needed since we know the EOF
    from the inode. Use that instead.

    There may be races here, particularly with clustered or network
    filesystems. It also seems like a bit of a layering violation since
    we're operating on an address_space here, not an inode.

    Finally, it's also questionable whether this optimization really helps
    on workloads that we care about. Should we be optimizing for writeback
    vs. truncate races in a codepath where we expect to wait anyway? It
    doesn't seem worth the risk.

    Remove this optimization from the filemap_fdatawait codepaths. This
    means that filemap_fdatawait becomes a trivial wrapper around
    filemap_fdatawait_range.

    Reviewed-by: Jan Kara
    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • Necessary now for gfs2_fsync and sync_file_range, but there will
    eventually be other callers.

    Reviewed-by: Jan Kara
    Signed-off-by: Jeff Layton

    Jeff Layton
     

29 Jul, 2017

1 commit


27 Jul, 2017

1 commit


11 Jul, 2017

1 commit

  • We avoid calling __mod_node_page_state(NR_FILE_PAGES) for hugetlb page
    now, but it's not enough because later code doesn't handle hugetlb
    properly. Actually in our testing, WARN_ON_ONCE(PageDirty(page)) at the
    end of this function fires for hugetlb, which makes no sense. So we
    should return immediately for hugetlb pages.

    Link: http://lkml.kernel.org/r/1496305019-5493-3-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: "Aneesh Kumar K.V"
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

08 Jul, 2017

1 commit

  • Pull Writeback error handling updates from Jeff Layton:
    "This pile represents the bulk of the writeback error handling fixes
    that I have for this cycle. Some of the earlier patches in this pile
    may look trivial but they are prerequisites for later patches in the
    series.

    The aim of this set is to improve how we track and report writeback
    errors to userland. Most applications that care about data integrity
    will periodically call fsync/fdatasync/msync to ensure that their
    writes have made it to the backing store.

    For a very long time, we have tracked writeback errors using two flags
    in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
    writeback error occurs (via mapping_set_error) and are cleared as a
    side-effect of filemap_check_errors (as you noted yesterday). This
    model really sucks for userland.

    Only the first task to call fsync (or msync or fdatasync) will see the
    error. Any subsequent task calling fsync on a file will get back 0
    (unless another writeback error occurs in the interim). If I have
    several tasks writing to a file and calling fsync to ensure that their
    writes got stored, then I need to have them coordinate with one
    another. That's difficult enough, but in a world of containerized
    setups that coordination may even not be possible.

    But wait...it gets worse!

    The calls to filemap_check_errors can be buried pretty far down in the
    call stack, and there are internal callers of filemap_write_and_wait
    and the like that also end up clearing those errors. Many of those
    callers ignore the error return from that function or return it to
    userland at nonsensical times (e.g. truncate() or stat()). If I get
    back -EIO on a truncate, there is no reason to think that it was
    because some previous writeback failed, and a subsequent fsync() will
    (incorrectly) return 0.

    This pile aims to do three things:

    1) ensure that when a writeback error occurs that that error will be
    reported to userland on a subsequent fsync/fdatasync/msync call,
    regardless of what internal callers are doing

    2) report writeback errors on all file descriptions that were open at
    the time that the error occurred. This is a user-visible change,
    but I think most applications are written to assume this behavior
    anyway. Those that aren't are unlikely to be hurt by it.

    3) document what filesystems should do when there is a writeback
    error. Today, there is very little consistency between them, and a
    lot of cargo-cult copying. We need to make it very clear what
    filesystems should do in this situation.

    To achieve this, the set adds a new data type (errseq_t) and then
    builds new writeback error tracking infrastructure around that. Once
    all of that is in place, we change the filesystems to use the new
    infrastructure for reporting wb errors to userland.

    Note that this is just the initial foray into cleaning up this mess.
    There is a lot of work remaining here:

    1) convert the rest of the filesystems in a similar fashion. Once the
    initial set is in, then I think most other fs' will be fairly
    simple to convert. Hopefully most of those can in via individual
    filesystem trees.

    2) convert internal waiters on writeback to use errseq_t for
    detecting errors instead of relying on the AS_* flags. I have some
    draft patches for this for ext4, but they are not quite ready for
    prime time yet.

    This was a discussion topic this year at LSF/MM too. If you're
    interested in the gory details, LWN has some good articles about this:

    https://lwn.net/Articles/718734/
    https://lwn.net/Articles/724307/"

    * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    btrfs: minimal conversion to errseq_t writeback error reporting on fsync
    xfs: minimal conversion to errseq_t writeback error reporting
    ext4: use errseq_t based error handling for reporting data writeback errors
    fs: convert __generic_file_fsync to use errseq_t based reporting
    block: convert to errseq_t based writeback error tracking
    dax: set errors in mapping when writeback fails
    Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
    mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
    fs: new infrastructure for writeback error handling and reporting
    lib: add errseq_t type and infrastructure for handling it
    mm: don't TestClearPageError in __filemap_fdatawait_range
    mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
    jbd2: don't clear and reset errors after waiting on writeback
    buffer: set errors in mapping at the time that the error occurs
    fs: check for writeback errors after syncing out buffers in generic_file_fsync
    buffer: use mapping_set_error instead of setting the flag
    mm: fix mapping_set_error call in me_pagecache_dirty

    Linus Torvalds
     

07 Jul, 2017

1 commit

  • Track the following reclaim counters for every memory cgroup: PGREFILL,
    PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED.

    These values are exposed using the memory.stats interface of cgroup v2.

    The meaning of each value is the same as for global counters, available
    using /proc/vmstat.

    Also, for consistency, rename mem_cgroup_count_vm_event() to
    count_memcg_event_mm().

    Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com
    Signed-off-by: Roman Gushchin
    Suggested-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

06 Jul, 2017

4 commits

  • Most filesystems currently use mapping_set_error and
    filemap_check_errors for setting and reporting/clearing writeback errors
    at the mapping level. filemap_check_errors is indirectly called from
    most of the filemap_fdatawait_* functions and from
    filemap_write_and_wait*. These functions are called from all sorts of
    contexts to wait on writeback to finish -- e.g. mostly in fsync, but
    also in truncate calls, getattr, etc.

    The non-fsync callers are problematic. We should be reporting writeback
    errors during fsync, but many places spread over the tree clear out
    errors before they can be properly reported, or report errors at
    nonsensical times.

    If I get -EIO on a stat() call, there is no reason for me to assume that
    it is because some previous writeback failed. The fact that it also
    clears out the error such that a subsequent fsync returns 0 is a bug,
    and a nasty one since that's potentially silent data corruption.

    This patch adds a small bit of new infrastructure for setting and
    reporting errors during address_space writeback. While the above was my
    original impetus for adding this, I think it's also the case that
    current fsync semantics are just problematic for userland. Most
    applications that call fsync do so to ensure that the data they wrote
    has hit the backing store.

    In the case where there are multiple writers to the file at the same
    time, this is really hard to determine. The first one to call fsync will
    see any stored error, and the rest get back 0. The processes with open
    fds may not be associated with one another in any way. They could even
    be in different containers, so ensuring coordination between all fsync
    callers is not really an option.

    One way to remedy this would be to track what file descriptor was used
    to dirty the file, but that's rather cumbersome and would likely be
    slow. However, there is a simpler way to improve the semantics here
    without incurring too much overhead.

    This set adds an errseq_t to struct address_space, and a corresponding
    one is added to struct file. Writeback errors are recorded in the
    mapping's errseq_t, and the one in struct file is used as the "since"
    value.

    This changes the semantics of the Linux fsync implementation such that
    applications can now use it to determine whether there were any
    writeback errors since fsync(fd) was last called (or since the file was
    opened in the case of fsync having never been called).

    Note that those writeback errors may have occurred when writing data
    that was dirtied via an entirely different fd, but that's the case now
    with the current mapping_set_error/filemap_check_error infrastructure.
    This will at least prevent you from getting a false report of success.

    The new behavior is still consistent with the POSIX spec, and is more
    reliable for application developers. This patch just adds some basic
    infrastructure for doing this, and ensures that the f_wb_err "cursor"
    is properly set when a file is opened. Later patches will change the
    existing code to use this new infrastructure for reporting errors at
    fsync time.

    Signed-off-by: Jeff Layton
    Reviewed-by: Jan Kara

    Jeff Layton
     
  • The -EIO returned here can end up overriding whatever error is marked in
    the address space, and be returned at fsync time, even when there is a
    more appropriate error stored in the mapping.

    Read errors are also sometimes tracked on a per-page level using
    PG_error. Suppose we have a read error on a page, and then that page is
    subsequently dirtied by overwriting the whole page. Writeback doesn't
    clear PG_error, so we can then end up successfully writing back that
    page and still return -EIO on fsync.

    Worse yet, PG_error is cleared during a sync() syscall, but the -EIO
    return from that is silently discarded. Any subsystem that is relying on
    PG_error to report errors during fsync can easily lose writeback errors
    due to this. All you need is a stray sync() call to wait for writeback
    to complete and you've lost the error.

    Since the handling of the PG_error flag is somewhat inconsistent across
    subsystems, let's just rely on marking the address space when there are
    writeback errors. Change the TestClearPageError call to ClearPageError,
    and make __filemap_fdatawait_range a void return function.

    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • filemap_write_and_wait{_range} will return an error if writeback
    initiation fails, but won't clear errors in the address_space. This is
    particularly problematic on DAX, as filemap_fdatawrite* is
    effectively synchronous there. Ensure that we clear the AS_EIO/AS_ENOSPC
    flags when filemap_fdatawrite* returns an error.

    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • Resetting this flag is almost certainly racy, and will be problematic
    with some coming changes.

    Make filemap_fdatawait_keep_errors return int, but not clear the flag(s).
    Have jbd2 call it instead of filemap_fdatawait and don't attempt to
    re-set the error flag if it fails.

    Reviewed-by: Jan Kara
    Reviewed-by: Carlos Maiolino
    Signed-off-by: Jeff Layton

    Jeff Layton
     

04 Jul, 2017

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Add the SYSTEM_SCHEDULING bootup state to move various scheduler
    debug checks earlier into the bootup. This turns silent and
    sporadically deadly bugs into nice, deterministic splats. Fix some
    of the splats that triggered. (Thomas Gleixner)

    - A round of restructuring and refactoring of the load-balancing and
    topology code (Peter Zijlstra)

    - Another round of consolidating ~20 of incremental scheduler code
    history: this time in terms of wait-queue nomenclature. (I didn't
    get much feedback on these renaming patches, and we can still
    easily change any names I might have misplaced, so if anyone hates
    a new name, please holler and I'll fix it.) (Ingo Molnar)

    - sched/numa improvements, fixes and updates (Rik van Riel)

    - Another round of x86/tsc scheduler clock code improvements, in hope
    of making it more robust (Peter Zijlstra)

    - Improve NOHZ behavior (Frederic Weisbecker)

    - Deadline scheduler improvements and fixes (Luca Abeni, Daniel
    Bristot de Oliveira)

    - Simplify and optimize the topology setup code (Lauro Ramos
    Venancio)

    - Debloat and decouple scheduler code some more (Nicolas Pitre)

    - Simplify code by making better use of llist primitives (Byungchul
    Park)

    - ... plus other fixes and improvements"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
    sched/cputime: Refactor the cputime_adjust() code
    sched/debug: Expose the number of RT/DL tasks that can migrate
    sched/numa: Hide numa_wake_affine() from UP build
    sched/fair: Remove effective_load()
    sched/numa: Implement NUMA node level wake_affine()
    sched/fair: Simplify wake_affine() for the single socket case
    sched/numa: Override part of migrate_degrades_locality() when idle balancing
    sched/rt: Move RT related code from sched/core.c to sched/rt.c
    sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
    sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
    sched/fair: Spare idle load balancing on nohz_full CPUs
    nohz: Move idle balancer registration to the idle path
    sched/loadavg: Generalize "_idle" naming to "_nohz"
    sched/core: Drop the unused try_get_task_struct() helper function
    sched/fair: WARN() and refuse to set buddy when !se->on_rq
    sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
    sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
    sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
    sched/wait: Split out the wait_bit*() APIs from into
    sched/wait: Re-adjust macro line continuation backslashes in
    ...

    Linus Torvalds
     

20 Jun, 2017

4 commits

  • Find out if the I/O will trigger a wait due to writeback. If yes,
    return -EAGAIN.

    Return -EINVAL for buffered AIO: there are multiple causes of
    delay such as page locks, dirty throttling logic, page loading
    from disk etc. which cannot be taken care of.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: Jens Axboe

    Goldwyn Rodrigues
     
  • filemap_range_has_page() return true if the file's mapping has
    a page within the range mentioned. This function will be used
    to check if a write() call will cause a writeback of previous
    writes.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: Jens Axboe

    Goldwyn Rodrigues
     
  • So I've noticed a number of instances where it was not obvious from the
    code whether ->task_list was for a wait-queue head or a wait-queue entry.

    Furthermore, there's a number of wait-queue users where the lists are
    not for 'tasks' but other entities (poll tables, etc.), in which case
    the 'task_list' name is actively confusing.

    To clear this all up, name the wait-queue head and entry list structure
    fields unambiguously:

    struct wait_queue_head::task_list => ::head
    struct wait_queue_entry::task_list => ::entry

    For example, this code:

    rqw->wait.task_list.next != &wait->task_list

    ... is was pretty unclear (to me) what it's doing, while now it's written this way:

    rqw->wait.head.next != &wait->entry

    ... which makes it pretty clear that we are iterating a list until we see the head.

    Other examples are:

    list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
    list_for_each_entry(wq, &fence->wait.task_list, task_list) {

    ... where it's unclear (to me) what we are iterating, and during review it's
    hard to tell whether it's trying to walk a wait-queue entry (which would be
    a bug), while now it's written as:

    list_for_each_entry_safe(pos, next, &x->head, entry) {
    list_for_each_entry(wq, &fence->wait.head, entry) {

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

10 May, 2017

1 commit


09 May, 2017

2 commits

  • Commit afddba49d18f ("fs: introduce write_begin, write_end, and
    perform_write aops") introduced AOP_FLAG_UNINTERRUPTIBLE flag which was
    checked in pagecache_write_begin(), but that check was removed by
    4e02ed4b4a2f ("fs: remove prepare_write/commit_write").

    Between these two commits, commit d9414774dc0c ("cifs: Convert cifs to
    new aops.") added a check in cifs_write_begin(), but that check was soon
    removed by commit a98ee8c1c707 ("[CIFS] fix regression in
    cifs_write_begin/cifs_write_end").

    Therefore, AOP_FLAG_UNINTERRUPTIBLE flag is checked nowhere. Let's
    remove this flag. This patch has no functionality changes.

    Link: http://lkml.kernel.org/r/1489294781-53494-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Reviewed-by: Jeff Layton
    Reviewed-by: Christoph Hellwig
    Cc: Nick Piggin
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Wrong sign of iov_iter_revert() argument. Unfortunately, slipped through
    the testing, since most of the time we don't do anything to the iterator
    afterwards and potential oops on walking the iter->iov too far backwards
    is too infrequent to be easily triggered.

    Add a sanity check in iov_iter_revert() to catch bugs like this one;
    fortunately, the same braino hadn't happened in other callers, but we'd
    better have a warning if such thing crops up.

    Signed-off-by: Al Viro

    Al Viro
     

04 May, 2017

2 commits

  • Patch series "Properly invalidate data in the cleancache", v2.

    We've noticed that after direct IO write, buffered read sometimes gets
    stale data which is coming from the cleancache. The reason for this is
    that some direct write hooks call call invalidate_inode_pages2[_range]()
    conditionally iff mapping->nrpages is not zero, so we may not invalidate
    data in the cleancache.

    Another odd thing is that we check only for ->nrpages and don't check
    for ->nrexceptional, but invalidate_inode_pages2[_range] also
    invalidates exceptional entries as well. So we invalidate exceptional
    entries only if ->nrpages != 0? This doesn't feel right.

    - Patch 1 fixes direct IO writes by removing ->nrpages check.
    - Patch 2 fixes similar case in invalidate_bdev().
    Note: I only fixed conditional cleancache_invalidate_inode() here.
    Do we also need to add ->nrexceptional check in into invalidate_bdev()?

    - Patches 3-4: some optimizations.

    This patch (of 4):

    Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
    conditionally iff mapping->nrpages is not zero. This can't be right,
    because invalidate_inode_pages2[_range]() also invalidate data in the
    cleancache via cleancache_invalidate_inode() call. So if page cache is
    empty but there is some data in the cleancache, buffered read after
    direct IO write would get stale data from the cleancache.

    Also it doesn't feel right to check only for ->nrpages because
    invalidate_inode_pages2[_range] invalidates exceptional entries as well.

    Fix this by calling invalidate_inode_pages2[_range]() regardless of
    nrpages state.

    Note: nfs,cifs,9p doesn't need similar fix because the never call
    cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so
    they are not affected by this bug.

    Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
    Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Jan Kara
    Acked-by: Konrad Rzeszutek Wilk
    Cc: Alexander Viro
    Cc: Ross Zwisler
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Alexey Kuznetsov
    Cc: Christoph Hellwig
    Cc: Nikolay Borisov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • The round_up() macro generates a couple of unnecessary instructions
    in this usage:

    48cd: 49 8b 47 50 mov 0x50(%r15),%rax
    48d1: 48 83 e8 01 sub $0x1,%rax
    48d5: 48 0d ff 0f 00 00 or $0xfff,%rax
    48db: 48 83 c0 01 add $0x1,%rax
    48df: 48 c1 f8 0c sar $0xc,%rax
    48e3: 48 39 c3 cmp %rax,%rbx
    48e6: 72 2e jb 4916

    If we change round_up() to ((x) + __round_mask(x, y)) & ~__round_mask(x, y)
    then GCC can see through it and remove the mask (because that would be
    dead code given the subsequent shift):

    48cd: 49 8b 47 50 mov 0x50(%r15),%rax
    48d1: 48 05 ff 0f 00 00 add $0xfff,%rax
    48d7: 48 c1 e8 0c shr $0xc,%rax
    48db: 48 39 c3 cmp %rax,%rbx
    48de: 72 2e jb 490e

    But that's problematic because we'd evaluate 'y' twice. Converting
    round_up into an inline function prevents it from being used in other
    definitions. The easiest thing to do is just change these three usages
    of round_up to use DIV_ROUND_UP. Also add an unlikely() because GCC's
    heuristic is wrong in this case.

    Link: http://lkml.kernel.org/r/20170207192812.5281-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

03 May, 2017

1 commit

  • Pull iov_iter updates from Al Viro:
    "Cleanups that sat in -next + -stable fodder that has just missed 4.11.

    There's more iov_iter work in my local tree, but I'd prefer to push
    the stuff that had been in -next first"

    * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    iov_iter: don't revert iov buffer if csum error
    generic_file_read_iter(): make use of iov_iter_revert()
    generic_file_direct_write(): make use of iov_iter_revert()
    orangefs: use iov_iter_revert()
    sctp: switch to copy_from_iter_full()
    net/9p: switch to copy_from_iter_full()
    switch memcpy_from_msg() to copy_from_iter_full()
    rds: make use of iov_iter_revert()

    Linus Torvalds
     

22 Apr, 2017

2 commits