15 Aug, 2020

2 commits

  • struct file_ra_state ra.mmap_miss could be accessed concurrently during
    page faults as noticed by KCSAN,

    BUG: KCSAN: data-race in filemap_fault / filemap_map_pages

    write to 0xffff9b1700a2c1b4 of 4 bytes by task 3292 on cpu 30:
    filemap_fault+0x920/0xfc0
    do_sync_mmap_readahead at mm/filemap.c:2384
    (inlined by) filemap_fault at mm/filemap.c:2486
    __xfs_filemap_fault+0x112/0x3e0 [xfs]
    xfs_filemap_fault+0x74/0x90 [xfs]
    __do_fault+0x9e/0x220
    do_fault+0x4a0/0x920
    __handle_mm_fault+0xc69/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    read to 0xffff9b1700a2c1b4 of 4 bytes by task 3313 on cpu 32:
    filemap_map_pages+0xc2e/0xd80
    filemap_map_pages at mm/filemap.c:2625
    do_fault+0x3da/0x920
    __handle_mm_fault+0xc69/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 32 PID: 3313 Comm: systemd-udevd Tainted: G W L 5.5.0-next-20200210+ #1
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    ra.mmap_miss is used to contribute the readahead decisions, a data race
    could be undesirable. Both the read and write is only under non-exclusive
    mmap_sem, two concurrent writers could even underflow the counter. Fix
    the underflow by writing to a local variable before committing a final
    store to ra.mmap_miss given a small inaccuracy of the counter should be
    acceptable.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Tested-by: Qian Cai
    Reviewed-by: Matthew Wilcox (Oracle)
    Cc: Marco Elver
    Link: http://lkml.kernel.org/r/20200211030134.1847-1-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Aug, 2020

1 commit

  • Drop the repeated word "the".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-3-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

08 Aug, 2020

2 commits

  • FGP_{WRITE|NOFS|NOWAIT} were missed in pagecache_get_page's kerneldoc
    comment.

    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Cc: Gang Deng
    Cc: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/1593031747-4249-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Since commit bbddabe2e436aa ("mm: filemap: only do access activations on
    reads"), mark_page_accessed() is called for reads only. But the idle flag
    is cleared by mark_page_accessed() so the idle flag won't get cleared if
    the page is write accessed only.

    Basically idle page tracking is used to estimate workingset size of
    workload, noticeable size of workingset might be missed if the idle flag
    is not maintained correctly.

    It seems good enough to just clear idle flag for write operations.

    Fixes: bbddabe2e436 ("mm: filemap: only do access activations on reads")
    Reported-by: Gang Deng
    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/1593020612-13051-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     

04 Aug, 2020

1 commit

  • Pull io_uring updates from Jens Axboe:
    "Lots of cleanups in here, hardening the code and/or making it easier
    to read and fixing bugs, but a core feature/change too adding support
    for real async buffered reads. With the latter in place, we just need
    buffered write async support and we're done relying on kthreads for
    the fast path. In detail:

    - Cleanup how memory accounting is done on ring setup/free (Bijan)

    - sq array offset calculation fixup (Dmitry)

    - Consistently handle blocking off O_DIRECT submission path (me)

    - Support proper async buffered reads, instead of relying on kthread
    offload for that. This uses the page waitqueue to drive retries
    from task_work, like we handle poll based retry. (me)

    - IO completion optimizations (me)

    - Fix race with accounting and ring fd install (me)

    - Support EPOLLEXCLUSIVE (Jiufei)

    - Get rid of the io_kiocb unionizing, made possible by shrinking
    other bits (Pavel)

    - Completion side cleanups (Pavel)

    - Cleanup REQ_F_ flags handling, and kill off many of them (Pavel)

    - Request environment grabbing cleanups (Pavel)

    - File and socket read/write cleanups (Pavel)

    - Improve kiocb_set_rw_flags() (Pavel)

    - Tons of fixes and cleanups (Pavel)

    - IORING_SQ_NEED_WAKEUP clear fix (Xiaoguang)"

    * tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block: (127 commits)
    io_uring: flip if handling after io_setup_async_rw
    fs: optimise kiocb_set_rw_flags()
    io_uring: don't touch 'ctx' after installing file descriptor
    io_uring: get rid of atomic FAA for cq_timeouts
    io_uring: consolidate *_check_overflow accounting
    io_uring: fix stalled deferred requests
    io_uring: fix racy overflow count reporting
    io_uring: deduplicate __io_complete_rw()
    io_uring: de-unionise io_kiocb
    io-wq: update hash bits
    io_uring: fix missing io_queue_linked_timeout()
    io_uring: mark ->work uninitialised after cleanup
    io_uring: deduplicate io_grab_files() calls
    io_uring: don't do opcode prep twice
    io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works
    io_uring: batch put_task_struct()
    tasks: add put_task_struct_many()
    io_uring: return locked and pinned page accounting
    io_uring: don't miscount pinned memory
    io_uring: don't open-code recv kbuf managment
    ...

    Linus Torvalds
     

03 Aug, 2020

2 commits

  • That gives us ordering guarantees around the pair.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • It turns out that wait_on_page_bit_common() had several problems,
    ranging from just unfair behavioe due to re-queueing at the end of the
    wait queue when re-trying, and an outright bug that could result in
    missed wakeups (but probably never happened in practice).

    This rewrites the whole logic to avoid both issues, by simply moving the
    logic to check (and possibly take) the bit lock into the wakeup path
    instead.

    That makes everything much more straightforward, and means that we never
    need to re-queue the wait entry: if we get woken up, we'll be notified
    through WQ_FLAG_WOKEN, and the wait queue entry will have been removed,
    and everything will have been done for us.

    Link: https://lore.kernel.org/lkml/CAHk-=wjJA2Z3kUFb-5s=6+n0qbTs8ELqKFt9B3pH85a8fGD73w@mail.gmail.com/
    Link: https://lore.kernel.org/lkml/alpine.LSU.2.11.2007221359450.1017@eggly.anvils/
    Reported-by: Oleg Nesterov
    Reported-by: Hugh Dickins
    Cc: Michal Hocko
    Reviewed-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Jul, 2020

1 commit

  • Add an IOCB_NOIO flag that indicates to generic_file_read_iter that it
    shouldn't trigger any filesystem I/O for the actual request or for
    readahead. This allows to do tentative reads out of the page cache as
    some filesystems allow, and to take the appropriate locks and retry the
    reads only if the requested pages are not cached.

    Signed-off-by: Andreas Gruenbacher

    Andreas Gruenbacher
     

22 Jun, 2020

4 commits

  • Use the async page locking infrastructure, if IOCB_WAITQ is set in the
    passed in iocb. The caller must expect an -EIOCBQUEUED return value,
    which means that IO is started but not done yet. This is similar to how
    O_DIRECT signals the same operation. Once the callback is received by
    the caller for IO completion, the caller must retry the operation.

    Acked-by: Johannes Weiner
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Normally waiting for a page to become unlocked, or locking the page,
    requires waiting for IO to complete. Add support for lock_page_async()
    and wait_on_page_locked_async(), which are callback based instead. This
    allows a caller to get notified when a page becomes unlocked, rather
    than wait for it.

    We add a new iocb field, ki_waitq, to pass in the necessary data for this
    to happen. We can unionize this with ki_cookie, since that is only used
    for polled IO. Polled IO can never co-exist with async callbacks, as it is
    (by definition) polled completions. struct wait_page_key is made public,
    and we define struct wait_page_async as the interface between the caller
    and the core.

    Acked-by: Johannes Weiner
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • No functional changes in this patch, just in preparation for allowing
    more callers.

    Acked-by: Johannes Weiner
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The read-ahead shouldn't block, so allow it to be done even if
    IOCB_NOWAIT is set in the kiocb.

    Acked-by: Johannes Weiner
    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Jun, 2020

3 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Convert comments that reference old mmap_sem APIs to reference
    corresponding new mmap locking APIs instead.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Davidlohr Bueso
    Reviewed-by: Daniel Jordan
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

05 Jun, 2020

1 commit


04 Jun, 2020

6 commits

  • Swapin faults were the last event to charge pages after they had already
    been put on the LRU list. Now that we charge directly on swapin, the
    lrucare portion of the charge code is unused.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Joonsoo Kim
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Balbir Singh
    Cc: Shakeel Butt
    Link: http://lkml.kernel.org/r/20200508183105.225460-19-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • With the page->mapping requirement gone from memcg, we can charge anon and
    file-thp pages in one single step, right after they're allocated.

    This removes two out of three API calls - especially the tricky commit
    step that needed to happen at just the right time between when the page is
    "set up" and when it's "published" - somewhat vague and fluid concepts
    that varied by page type. All we need is a freshly allocated page and a
    memcg context to charge.

    v2: prevent double charges on pre-allocated hugepages in khugepaged

    [hannes@cmpxchg.org: Fix crash - *hpage could be ERR_PTR instead of NULL]
    Link: http://lkml.kernel.org/r/20200512215813.GA487759@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Joonsoo Kim
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/20200508183105.225460-13-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memcg maintains private MEMCG_CACHE and NR_SHMEM counters. This
    divergence from the generic VM accounting means unnecessary code overhead,
    and creates a dependency for memcg that page->mapping is set up at the
    time of charging, so that page types can be told apart.

    Convert the generic accounting sites to mod_lruvec_page_state and friends
    to maintain the per-cgroup vmstat counters of NR_FILE_PAGES and NR_SHMEM.
    The page is already locked in these places, so page->mem_cgroup is stable;
    we only need minimal tweaks of two mem_cgroup_migrate() calls to ensure
    it's set up in time.

    Then replace MEMCG_CACHE with NR_FILE_PAGES and delete the private
    NR_SHMEM accounting sites.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Joonsoo Kim
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-10-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The try/commit/cancel protocol that memcg uses dates back to when pages
    used to be uncharged upon removal from the page cache, and thus couldn't
    be committed before the insertion had succeeded. Nowadays, pages are
    uncharged when they are physically freed; it doesn't matter whether the
    insertion was successful or not. For the page cache, the transaction
    dance has become unnecessary.

    Introduce a mem_cgroup_charge() function that simply charges a newly
    allocated page to a cgroup and sets up page->mem_cgroup in one single
    step. If the insertion fails, the caller doesn't have to do anything but
    free/put the page.

    Then switch the page cache over to this new API.

    Subsequent patches will also convert anon pages, but it needs a bit more
    prep work. Right now, memcg depends on page->mapping being already set up
    at the time of charging, so that it can maintain its own MEMCG_CACHE and
    MEMCG_RSS counters. For anon, page->mapping is set under the same pte
    lock under which the page is publishd, so a single charge point that can
    block doesn't work there just yet.

    The following prep patches will replace the private memcg counters with
    the generic vmstat counters, thus removing the page->mapping dependency,
    then complete the transition to the new single-point charge API and delete
    the old transactional scheme.

    v2: leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
    v3: rebase on preceeding shmem simplification patch

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Alex Shi
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-6-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The memcg charging API carries a boolean @compound parameter that tells
    whether the page we're dealing with is a hugepage.
    mem_cgroup_commit_charge() has another boolean @lrucare that indicates
    whether the page needs LRU locking or not while charging. The majority of
    callsites know those parameters at compile time, which results in a lot of
    naked "false, false" argument lists. This makes for cryptic code and is a
    breeding ground for subtle mistakes.

    Thankfully, the huge page state can be inferred from the page itself and
    doesn't need to be passed along. This is safe because charging completes
    before the page is published and somebody may split it.

    Simplify the callsites by removing @compound, and let memcg infer the
    state by using hpage_nr_pages() unconditionally. That function does
    PageTransHuge() to identify huge pages, which also helpfully asserts that
    nobody passes in tail pages by accident.

    The following patches will introduce a new charging API, best not to carry
    over unnecessary weight.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Alex Shi
    Reviewed-by: Joonsoo Kim
    Reviewed-by: Shakeel Butt
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Patch series "mm: memcontrol: charge swapin pages on instantiation", v2.

    This patch series reworks memcg to charge swapin pages directly at
    swapin time, rather than at fault time, which may be much later, or
    not happen at all.

    Changes in version 2:
    - prevent double charges on pre-allocated hugepages in khugepaged
    - leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
    - fix temporary accounting bug by switching rmapcommit (Joonsoo)
    - fix double swap charge bug in cgroup1/cgroup2 code gating
    - simplify swapin error checking (Joonsoo)
    - mm: memcontrol: document the new swap control behavior (Alex)
    - review tags

    The delayed swapin charging scheme we have right now causes problems:

    - Alex's per-cgroup lru_lock patches rely on pages that have been
    isolated from the LRU to have a stable page->mem_cgroup; otherwise
    the lock may change underneath him. Swapcache pages are charged only
    after they are added to the LRU, and charging doesn't follow the LRU
    isolation protocol.

    - Joonsoo's anon workingset patches need a suitable LRU at the time
    the page enters the swap cache and displaces the non-resident
    info. But the correct LRU is only available after charging.

    - It's a containment hole / DoS vector. Users can trigger arbitrarily
    large swap readahead using MADV_WILLNEED. The memory is never
    charged unless somebody actually touches it.

    - It complicates the page->mem_cgroup stabilization rules

    In order to charge pages directly at swapin time, the memcg code base
    needs to be prepared, and several overdue cleanups become a necessity:

    To charge pages at swapin time, we need to always have cgroup
    ownership tracking of swap records. We also cannot rely on
    page->mapping to tell apart page types at charge time, because that's
    only set up during a page fault.

    To eliminate the page->mapping dependency, memcg needs to ditch its
    private page type counters (MEMCG_CACHE, MEMCG_RSS, NR_SHMEM) in favor
    of the generic vmstat counters and accounting sites, such as
    NR_FILE_PAGES, NR_ANON_MAPPED etc.

    To switch to generic vmstat counters, the charge sequence must be
    adjusted such that page->mem_cgroup is set up by the time these
    counters are modified.

    The series is structured as follows:

    1. Bug fixes
    2. Decoupling charging from rmap
    3. Swap controller integration into memcg
    4. Direct swapin charging

    This patch (of 19):

    When replacing one page with another one in the cache, we have to decrease
    the file count of the old page's NUMA node and increase the one of the new
    NUMA node, otherwise the old node leaks the count and the new node
    eventually underflows its counter.

    Fixes: 74d609585d8b ("page cache: Add and replace pages using the XArray")
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Alex Shi
    Reviewed-by: Shakeel Butt
    Reviewed-by: Joonsoo Kim
    Reviewed-by: Balbir Singh
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/20200508183105.225460-1-hannes@cmpxchg.org
    Link: http://lkml.kernel.org/r/20200508183105.225460-2-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

03 Jun, 2020

2 commits

  • Pull btrfs updates from David Sterba:
    "Highlights:

    - speedup dead root detection during orphan cleanup, eg. when there
    are many deleted subvolumes waiting to be cleaned, the trees are
    now looked up in radix tree instead of a O(N^2) search

    - snapshot creation with inherited qgroup will mark the qgroup
    inconsistent, requires a rescan

    - send will emit file capabilities after chown, this produces a
    stream that does not need postprocessing to set the capabilities
    again

    - direct io ported to iomap infrastructure, cleaned up and simplified
    code, notably removing last use of struct buffer_head in btrfs code

    Core changes:

    - factor out backreference iteration, to be used by ordinary
    backreferences and relocation code

    - improved global block reserve utilization
    * better logic to serialize requests
    * increased maximum available for unlink
    * improved handling on large pages (64K)

    - direct io cleanups and fixes
    * simplify layering, where cloned bios were unnecessarily created
    for some cases
    * error handling fixes (submit, endio)
    * remove repair worker thread, used to avoid deadlocks during
    repair

    - refactored block group reading code, preparatory work for new type
    of block group storage that should improve mount time on large
    filesystems

    Cleanups:

    - cleaned up (and slightly sped up) set/get helpers for metadata data
    structure members

    - root bit REF_COWS got renamed to SHAREABLE to reflect the that the
    blocks of the tree get shared either among subvolumes or with the
    relocation trees

    Fixes:

    - when subvolume deletion fails due to ENOSPC, the filesystem is not
    turned read-only

    - device scan deals with devices from other filesystems that changed
    ownership due to overwrite (mkfs)

    - fix a race between scrub and block group removal/allocation

    - fix long standing bug of a runaway balance operation, printing the
    same line to the syslog, caused by a stale status bit on a reloc
    tree that prevented progress

    - fix corrupt log due to concurrent fsync of inodes with shared
    extents

    - fix space underflow for NODATACOW and buffered writes when it for
    some reason needs to fallback to COW mode"

    * tag 'for-5.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (133 commits)
    btrfs: fix space_info bytes_may_use underflow during space cache writeout
    btrfs: fix space_info bytes_may_use underflow after nocow buffered write
    btrfs: fix wrong file range cleanup after an error filling dealloc range
    btrfs: remove redundant local variable in read_block_for_search
    btrfs: open code key_search
    btrfs: split btrfs_direct_IO to read and write part
    btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK
    fs: remove dio_end_io()
    btrfs: switch to iomap_dio_rw() for dio
    iomap: remove lockdep_assert_held()
    iomap: add a filesystem hook for direct I/O bio submission
    fs: export generic_file_buffered_read()
    btrfs: turn space cache writeout failure messages into debug messages
    btrfs: include error on messages about failure to write space/inode caches
    btrfs: remove useless 'fail_unlock' label from btrfs_csum_file_blocks()
    btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums
    btrfs: make checksum item extension more efficient
    btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents
    btrfs: unexport btrfs_compress_set_level()
    btrfs: simplify iget helpers
    ...

    Linus Torvalds
     
  • We no longer return 0 here and the comment doesn't tell us anything that
    we don't already know (SIGBUS is a pretty good indicator that things
    didn't work out).

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: William Kucharski
    Link: http://lkml.kernel.org/r/20200529123243.20640-1-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

25 May, 2020

1 commit


08 Apr, 2020

1 commit

  • Yang Shi writes:

    Currently, when truncating a shmem file, if the range is partly in a THP
    (start or end is in the middle of THP), the pages actually will just get
    cleared rather than being freed, unless the range covers the whole THP.
    Even though all the subpages are truncated (randomly or sequentially), the
    THP may still be kept in page cache.

    This might be fine for some usecases which prefer preserving THP, but
    balloon inflation is handled in base page size. So when using shmem THP
    as memory backend, QEMU inflation actually doesn't work as expected since
    it doesn't free memory. But the inflation usecase really needs to get the
    memory freed. (Anonymous THP will also not get freed right away, but will
    be freed eventually when all subpages are unmapped: whereas shmem THP
    still stays in page cache.)

    Split THP right away when doing partial hole punch, and if split fails
    just clear the page so that read of the punched area will return zeroes.

    Hugh Dickins adds:

    Our earlier "team of pages" huge tmpfs implementation worked in the way
    that Yang Shi proposes; and we have been using this patch to continue to
    split the huge page when hole-punched or truncated, since converting over
    to the compound page implementation. Although huge tmpfs gives out huge
    pages when available, if the user specifically asks to truncate or punch a
    hole (perhaps to free memory, perhaps to reduce the memcg charge), then
    the filesystem should do so as best it can, splitting the huge page.

    That is not always possible: any additional reference to the huge page
    prevents split_huge_page() from succeeding, so the result can be flaky.
    But in practice it works successfully enough that we've not seen any
    problem from that.

    Add shmem_punch_compound() to encapsulate the decision of when a split is
    needed, and doing the split if so. Using this simplifies the flow in
    shmem_undo_range(); and the first (trylock) pass does not need to do any
    page clearing on failure, because the second pass will either succeed or
    do that clearing. Following the example of zero_user_segment() when
    clearing a partial page, add flush_dcache_page() and set_page_dirty() when
    clearing a hole - though I'm not certain that either is needed.

    But: split_huge_page() would be sure to fail if shmem_undo_range()'s
    pagevec holds further references to the huge page. The easiest way to fix
    that is for find_get_entries() to return early, as soon as it has put one
    compound head or tail into the pagevec. At first this felt like a hack;
    but on examination, this convention better suits all its callers - or will
    do, if the slight one-page-per-pagevec slowdown in shmem_unlock_mapping()
    and shmem_seek_hole_data() is transformed into a 512-page-per-pagevec
    speedup by checking for compound pages there.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Cc: Yang Shi
    Cc: Alexander Duyck
    Cc: "Michael S. Tsirkin"
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Andrea Arcangeli
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2002261959020.10801@eggly.anvils
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

03 Apr, 2020

7 commits

  • The idea comes from a discussion between Linus and Andrea [1].

    Before this patch we only allow a page fault to retry once. We achieved
    this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
    handle_mm_fault() the second time. This was majorly used to avoid
    unexpected starvation of the system by looping over forever to handle the
    page fault on a single page. However that should hardly happen, and after
    all for each code path to return a VM_FAULT_RETRY we'll first wait for a
    condition (during which time we should possibly yield the cpu) to happen
    before VM_FAULT_RETRY is really returned.

    This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
    flag when we receive VM_FAULT_RETRY. It means that the page fault handler
    now can retry the page fault for multiple times if necessary without the
    need to generate another page fault event. Meanwhile we still keep the
    FAULT_FLAG_TRIED flag so page fault handler can still identify whether a
    page fault is the first attempt or not.

    Then we'll have these combinations of fault flags (only considering
    ALLOW_RETRY flag and TRIED flag):

    - ALLOW_RETRY and !TRIED: this means the page fault allows to
    retry, and this is the first try

    - ALLOW_RETRY and TRIED: this means the page fault allows to
    retry, and this is not the first try

    - !ALLOW_RETRY and !TRIED: this means the page fault does not allow
    to retry at all

    - !ALLOW_RETRY and TRIED: this is forbidden and should never be used

    In existing code we have multiple places that has taken special care of
    the first condition above by checking against (fault_flags &
    FAULT_FLAG_ALLOW_RETRY). This patch introduces a simple helper to detect
    the first retry of a page fault by checking against both (fault_flags &
    FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now
    even the 2nd try will have the ALLOW_RETRY set, then use that helper in
    all existing special paths. One example is in __lock_page_or_retry(), now
    we'll drop the mmap_sem only in the first attempt of page fault and we'll
    keep it in follow up retries, so old locking behavior will be retained.

    This will be a nice enhancement for current code [2] at the same time a
    supporting material for the future userfaultfd-writeprotect work, since in
    that work there will always be an explicit userfault writeprotect retry
    for protected pages, and if that cannot resolve the page fault (e.g., when
    userfaultfd-writeprotect is used in conjunction with swapped pages) then
    we'll possibly need a 3rd retry of the page fault. It might also benefit
    other potential users who will have similar requirement like userfault
    write-protection.

    GUP code is not touched yet and will be covered in follow up patch.

    Please read the thread below for more information.

    [1] https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/
    [2] https://lore.kernel.org/lkml/20181230154648.GB9832@redhat.com/

    Suggested-by: Linus Torvalds
    Suggested-by: Andrea Arcangeli
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Tested-by: Brian Geffon
    Cc: Bobby Powers
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Jerome Glisse
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Matthew Wilcox
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Link: http://lkml.kernel.org/r/20200220160246.9790-1-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • - These were never called PCG flags; they've been called FGP flags since
    their introduction in 2014.
    - The FGP_FOR_MMAP flag was misleadingly documented as if it was an
    alternative to FGP_CREAT instead of an option to it.
    - Rename the 'offset' parameter to 'index'.
    - Capitalisation, formatting, rewording.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Christoph Hellwig
    Acked-by: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Cc: Pankaj Gupta
    Link: http://lkml.kernel.org/r/20200318140253.6141-9-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • No in-tree users (proc, madvise, memcg, mincore) can be built as a module.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Christoph Hellwig
    Acked-by: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Cc: Pankaj Gupta
    Link: http://lkml.kernel.org/r/20200318140253.6141-8-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Use VM_FAULT_OOM instead of indirecting through vmf_error(-ENOMEM).

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Christoph Hellwig
    Acked-by: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Cc: Pankaj Gupta
    Link: http://lkml.kernel.org/r/20200318140253.6141-2-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • The first argument of shrink_readahead_size_eio() is not used. Hence
    remove it from the function definition and from all the callers.

    Signed-off-by: Souptick Joarder
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/1583868093-24342-1-git-send-email-jrdr.linux@gmail.com
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     
  • Mount failure issue happens under the scenario: Application forked dozens
    of threads to mount the same number of cramfs images separately in docker,
    but several mounts failed with high probability. Mount failed due to the
    checking result of the page(read from the superblock of loop dev) is not
    uptodate after wait_on_page_locked(page) returned in function cramfs_read:

    wait_on_page_locked(page);
    if (!PageUptodate(page)) {
    ...
    }

    The reason of the checking result of the page not uptodate: systemd-udevd
    read the loopX dev before mount, because the status of loopX is Lo_unbound
    at this time, so loop_make_request directly trigger the calling of io_end
    handler end_buffer_async_read, which called SetPageError(page). So It
    caused the page can't be set to uptodate in function
    end_buffer_async_read:

    if(page_uptodate && !PageError(page)) {
    SetPageUptodate(page);
    }

    Then mount operation is performed, it used the same page which is just
    accessed by systemd-udevd above, Because this page is not uptodate, it
    will launch a actual read via submit_bh, then wait on this page by calling
    wait_on_page_locked(page). When the I/O of the page done, io_end handler
    end_buffer_async_read is called, because no one cleared the page
    error(during the whole read path of mount), which is caused by
    systemd-udevd reading, so this page is still in "PageError" status, which
    can't be set to uptodate in function end_buffer_async_read, then caused
    mount failure.

    But sometimes mount succeed even through systemd-udeved read loopX dev
    just before, The reason is systemd-udevd launched other loopX read just
    between step 3.1 and 3.2, the steps as below:

    1, loopX dev default status is Lo_unbound;
    2, systemd-udved read loopX dev (page is set to PageError);
    3, mount operation
    1) set loopX status to Lo_bound;
    ==>systemd-udevd read loopX deva_ops->readpage(filp, page);

    here, mapping->a_ops->readpage() is blkdev_readpage. In latest kernel,
    some function name changed, the call trace as below:

    blkdev_read_iter
    generic_file_read_iter
    generic_file_buffered_read:
    /*
    * A previous I/O error may have been due to temporary
    * failures, eg. mutipath errors.
    * Pg_error will be set again if readpage fails.
    */
    ClearPageError(page);
    /* Start the actual read. The read will unlock the page*/
    error=mapping->a_ops->readpage(flip, page);

    We can see ClearPageError(page) is called before the actual read,
    then the read in step 3.2 succeed.

    This patch is to add the calling of ClearPageError just before the actual
    read of read path of cramfs mount. Without the patch, the call trace as
    below when performing cramfs mount:

    do_mount
    cramfs_read
    cramfs_blkdev_read
    read_cache_page
    do_read_cache_page:
    filler(data, page);
    or
    mapping->a_ops->readpage(data, page);

    With the patch, the call trace as below when performing mount:

    do_mount
    cramfs_read
    cramfs_blkdev_read
    read_cache_page:
    do_read_cache_page:
    ClearPageError(page); a_ops->readpage(data, page);

    With the patch, mount operation trigger the calling of
    ClearPageError(page) before the actual read, the page has no error if no
    additional page error happen when I/O done.

    Signed-off-by: Xianting Tian
    Signed-off-by: Andrew Morton
    Reviewed-by: Matthew Wilcox (Oracle)
    Cc: Jan Kara
    Cc:
    Link: http://lkml.kernel.org/r/1583318844-22971-1-git-send-email-xianting_tian@126.com
    Signed-off-by: Linus Torvalds

    Xianting Tian
     
  • When handling a page fault, we drop mmap_sem to start async readahead so
    that we don't block on IO submission with mmap_sem held. However there's
    no point to drop mmap_sem in case readahead is disabled. Handle that case
    to avoid pointless dropping of mmap_sem and retrying the fault. This was
    actually reported to block mlockall(MCL_CURRENT) indefinitely.

    Fixes: 6b4c9f446981 ("filemap: drop the mmap_sem for all blocking operations")
    Reported-by: Minchan Kim
    Reported-by: Robert Stupp
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Reviewed-by: Josef Bacik
    Reviewed-by: Minchan Kim
    Link: http://lkml.kernel.org/r/20200212101356.30759-1-jack@suse.cz
    Signed-off-by: Linus Torvalds

    Jan Kara
     

01 Feb, 2020

1 commit

  • At some point filemap_write_and_wait() and
    filemap_write_and_wait_range() got the exact same implementation with
    the exception of the range being specified in *_range()

    Similar to other functions in fs.h which call *_range(..., 0,
    LLONG_MAX), change filemap_write_and_wait() to be a static inline which
    calls filemap_write_and_wait_range()

    Link: http://lkml.kernel.org/r/20191129160713.30892-1-ira.weiny@intel.com
    Signed-off-by: Ira Weiny
    Reviewed-by: Nikolay Borisov
    Reviewed-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ira Weiny
     

01 Dec, 2019

4 commits

  • One of our services is observing hanging ps/top/etc under heavy write
    IO, and the task states show this is an mmap_sem priority inversion:

    A write fault is holding the mmap_sem in read-mode and waiting for
    (heavily cgroup-limited) IO in balance_dirty_pages():

    balance_dirty_pages+0x724/0x905
    balance_dirty_pages_ratelimited+0x254/0x390
    fault_dirty_shared_page.isra.96+0x4a/0x90
    do_wp_page+0x33e/0x400
    __handle_mm_fault+0x6f0/0xfa0
    handle_mm_fault+0xe4/0x200
    __do_page_fault+0x22b/0x4a0
    page_fault+0x45/0x50

    Somebody tries to change the address space, contending for the mmap_sem in
    write-mode:

    call_rwsem_down_write_failed_killable+0x13/0x20
    do_mprotect_pkey+0xa8/0x330
    SyS_mprotect+0xf/0x20
    do_syscall_64+0x5b/0x100
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    The waiting writer locks out all subsequent readers to avoid lock
    starvation, and several threads can be seen hanging like this:

    call_rwsem_down_read_failed+0x14/0x30
    proc_pid_cmdline_read+0xa0/0x480
    __vfs_read+0x23/0x140
    vfs_read+0x87/0x130
    SyS_read+0x42/0x90
    do_syscall_64+0x5b/0x100
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    To fix this, do what we do for cache read faults already: drop the
    mmap_sem before calling into anything IO bound, in this case the
    balance_dirty_pages() function, and return VM_FAULT_RETRY.

    Link: http://lkml.kernel.org/r/20190924194238.GA29030@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Matthew Wilcox (Oracle)
    Acked-by: Kirill A. Shutemov
    Cc: Josef Bacik
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • generic_file_direct_write() tries to invalidate pagecache after O_DIRECT
    write. Unlike to similar code in dio_complete() this silently ignores
    error returned from invalidate_inode_pages2_range().

    According to comment this code here because not all filesystems call
    dio_complete() to do proper invalidation after O_DIRECT write. Noticeable
    example is a blkdev_direct_IO().

    This patch calls dio_warn_stale_pagecache() if invalidation fails.

    Link: http://lkml.kernel.org/r/157270038294.4812.2238891109785106069.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Andrew Morton
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • This helper prints warning if direct I/O write failed to invalidate cache,
    and set EIO at inode to warn usersapce about possible data corruption.

    See also commit 5a9d929d6e13 ("iomap: report collisions between directio
    and buffered writes to userspace").

    Direct I/O is supported by non-disk filesystems, for example NFS. Thus
    generic code needs this even in kernel without CONFIG_BLOCK.

    Link: http://lkml.kernel.org/r/157270038074.4812.7980855544557488880.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Andrew Morton
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • generic_file_direct_write() invalidates cache at entry. Second time this
    should be done when request completes. But this function calls second
    invalidation at exit unconditionally even for async requests.

    This patch skips second invalidation for async requests (-EIOCBQUEUED).

    Link: http://lkml.kernel.org/r/157270037850.4812.15036239021726025572.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Andrew Morton
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

19 Oct, 2019

1 commit

  • The generic_file_vm_ops is defined in so include it to
    fix the following warning:

    mm/filemap.c:2717:35: warning: symbol 'generic_file_vm_ops' was not declared. Should it be static?

    Link: http://lkml.kernel.org/r/20191008102311.25432-1-ben.dooks@codethink.co.uk
    Signed-off-by: Ben Dooks
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Dooks