12 Jul, 2022

1 commit

  • Release refcount after xas_set to fix UAF which may cause panic like this:

    page:ffffea000491fa40 refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1247e9
    head:ffffea000491fa00 order:3 compound_mapcount:0 compound_pincount:0
    memcg:ffff888104f91091
    flags: 0x2fffff80010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
    ...
    page dumped because: VM_BUG_ON_PAGE(PageTail(page))
    ------------[ cut here ]------------
    kernel BUG at include/linux/page-flags.h:632!
    invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
    CPU: 1 PID: 7642 Comm: sh Not tainted 5.15.51-dirty #26
    ...
    Call Trace:

    __invalidate_mapping_pages+0xe7/0x540
    drop_pagecache_sb+0x159/0x320
    iterate_supers+0x120/0x240
    drop_caches_sysctl_handler+0xaa/0xe0
    proc_sys_call_handler+0x2b4/0x480
    new_sync_write+0x3d6/0x5c0
    vfs_write+0x446/0x7a0
    ksys_write+0x105/0x210
    do_syscall_64+0x35/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xae
    RIP: 0033:0x7f52b5733130
    ...

    This problem has been fixed on mainline by patch 6b24ca4a1a8d ("mm: Use
    multi-index entries in the page cache") since it deletes the related code.

    Fixes: 5c211ba29deb ("mm: add and use find_lock_entries")
    Signed-off-by: Liu Shixin
    Acked-by: Matthew Wilcox (Oracle)
    Signed-off-by: Greg Kroah-Hartman

    Liu Shixin
     

01 May, 2022

2 commits

  • commit a6294593e8a1290091d0b078d5d33da5e0cd3dfe upstream

    Turn iov_iter_fault_in_readable into a function that returns the number
    of bytes not faulted in, similar to copy_to_user, instead of returning a
    non-zero value when any of the requested pages couldn't be faulted in.
    This supports the existing users that require all pages to be faulted in
    as well as new users that are happy if any pages can be faulted in.

    Rename iov_iter_fault_in_readable to fault_in_iov_iter_readable to make
    sure this change doesn't silently break things.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Anand Jain
    Signed-off-by: Greg Kroah-Hartman

    Andreas Gruenbacher
     
  • commit bb523b406c849eef8f265a07cd7f320f1f177743 upstream

    Turn fault_in_pages_{readable,writeable} into versions that return the
    number of bytes not faulted in, similar to copy_to_user, instead of
    returning a non-zero value when any of the requested pages couldn't be
    faulted in. This supports the existing users that require all pages to
    be faulted in as well as new users that are happy if any pages can be
    faulted in.

    Rename the functions to fault_in_{readable,writeable} to make sure
    this change doesn't silently break things.

    Neither of these functions is entirely trivial and it doesn't seem
    useful to inline them, so move them to mm/gup.c.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Anand Jain
    Signed-off-by: Greg Kroah-Hartman

    Andreas Gruenbacher
     

02 Mar, 2022

1 commit

  • When a THP is present in the page cache, we can return it several times,
    leading to userspace seeing the same data repeatedly if doing a read()
    that crosses a 64-page boundary. This is probably not a security issue
    (since the data all comes from the same file), but it can be interpreted
    as a transient data corruption issue. Fortunately, it is very rare as
    it can only occur when CONFIG_READ_ONLY_THP_FOR_FS is enabled, and it can
    only happen to executables. We don't often call read() on executables.

    This bug is fixed differently in v5.17 by commit 6b24ca4a1a8d
    ("mm: Use multi-index entries in the page cache"). That commit is
    unsuitable for backporting, so fix this in the clearest way. It
    sacrifices a little performance for clarity, but this should never
    be a performance path in these kernel versions.

    Fixes: cbd59c48ae2b ("mm/filemap: use head pages in generic_file_buffered_read")
    Cc: stable@vger.kernel.org # v5.15, v5.16
    Link: https://lore.kernel.org/r/df3b5d1c-a36b-2c73-3e27-99e74983de3a@suse.cz/
    Analyzed-by: Adam Majer
    Analyzed-by: Dirk Mueller
    Bisected-by: Takashi Iwai
    Reported-by: Vlastimil Babka
    Tested-by: Vlastimil Babka
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Greg Kroah-Hartman

    Matthew Wilcox (Oracle)
     

19 Nov, 2021

1 commit

  • commit d417b49fff3e2f21043c834841e8623a6098741d upstream.

    It is not safe to check page->index without holding the page lock. It
    can be changed if the page is moved between the swap cache and the page
    cache for a shmem file, for example. There is a VM_BUG_ON below which
    checks page->index is correct after taking the page lock.

    Link: https://lkml.kernel.org/r/20210818144932.940640-1-willy@infradead.org
    Fixes: 5c211ba29deb ("mm: add and use find_lock_entries")
    Signed-off-by: Matthew Wilcox (Oracle)
    Reported-by:
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Matthew Wilcox (Oracle)
     

04 Sep, 2021

2 commits

  • Merge misc updates from Andrew Morton:
    "173 patches.

    Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
    pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
    bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
    hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
    oom-kill, migration, ksm, percpu, vmstat, and madvise)"

    * emailed patches from Andrew Morton : (173 commits)
    mm/madvise: add MADV_WILLNEED to process_madvise()
    mm/vmstat: remove unneeded return value
    mm/vmstat: simplify the array size calculation
    mm/vmstat: correct some wrong comments
    mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
    selftests: vm: add COW time test for KSM pages
    selftests: vm: add KSM merging time test
    mm: KSM: fix data type
    selftests: vm: add KSM merging across nodes test
    selftests: vm: add KSM zero page merging test
    selftests: vm: add KSM unmerge test
    selftests: vm: add KSM merge test
    mm/migrate: correct kernel-doc notation
    mm: wire up syscall process_mrelease
    mm: introduce process_mrelease system call
    memblock: make memblock_find_in_range method private
    mm/mempolicy.c: use in_task() in mempolicy_slab_node()
    mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
    mm/mempolicy: advertise new MPOL_PREFERRED_MANY
    mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
    ...

    Linus Torvalds
     
  • The page cache deletion paths all have interrupts enabled, so no need to
    use irqsafe/irqrestore locking variants.

    They used to have irqs disabled by the memcg lock added in commit
    c4843a7593a9 ("memcg: add per cgroup dirty page accounting"), but that has
    since been replaced by memcg taking the page lock instead, commit
    0a31bc97c80c ("mm: memcontrol: rewrite uncharge AP").

    Link: https://lkml.kernel.org/r/20210614211904.14420-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

01 Sep, 2021

1 commit

  • Pull btrfs updates from David Sterba:
    "The highlights of this round are integrations with fs-verity and
    idmapped mounts, the rest is usual mix of minor improvements, speedups
    and cleanups.

    There are some patches outside of btrfs, namely updating some VFS
    interfaces, all straightforward and acked.

    Features:

    - fs-verity support, using standard ioctls, backward compatible with
    read-only limitation on inodes with previously enabled fs-verity

    - idmapped mount support

    - make mount with rescue=ibadroots more tolerant to partially damaged
    trees

    - allow raid0 on a single device and raid10 on two devices,
    degenerate cases but might be useful as an intermediate step during
    conversion to other profiles

    - zoned mode block group auto reclaim can be disabled via sysfs knob

    Performance improvements:

    - continue readahead of node siblings even if target node is in
    memory, could speed up full send (on sample test +11%)

    - batching of delayed items can speed up creating many files

    - fsync/tree-log speedups
    - avoid unnecessary work (gains +2% throughput, -2% run time on
    sample load)
    - reduced lock contention on renames (on dbench +4% throughput,
    up to -30% latency)

    Fixes:

    - various zoned mode fixes

    - preemptive flushing threshold tuning, avoid excessive work on
    almost full filesystems

    Core:

    - continued subpage support, preparation for implementing remaining
    features like compression and defragmentation; with some
    limitations, write is now enabled on 64K page systems with 4K
    sectors, still considered experimental
    - no readahead on compressed reads
    - inline extents disabled
    - disabled raid56 profile conversion and mount

    - improved flushing logic, fixing early ENOSPC on some workloads

    - inode flags have been internally split to read-only and read-write
    incompat bit parts, used by fs-verity

    - new tree items for fs-verity
    - descriptor item
    - Merkle tree item

    - inode operations extended to be namespace-aware

    - cleanups and refactoring

    Generic code changes:

    - fs: new export filemap_fdatawrite_wbc

    - fs: removed sync_inode

    - block: bio_trim argument type fixups

    - vfs: add namespace-aware lookup"

    * tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (114 commits)
    btrfs: reset replace target device to allocation state on close
    btrfs: zoned: fix ordered extent boundary calculation
    btrfs: do not do preemptive flushing if the majority is global rsv
    btrfs: reduce the preemptive flushing threshold to 90%
    btrfs: tree-log: check btrfs_lookup_data_extent return value
    btrfs: avoid unnecessarily logging directories that had no changes
    btrfs: allow idmapped mount
    btrfs: handle ACLs on idmapped mounts
    btrfs: allow idmapped INO_LOOKUP_USER ioctl
    btrfs: allow idmapped SUBVOL_SETFLAGS ioctl
    btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls
    btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids
    btrfs: allow idmapped SNAP_DESTROY ioctls
    btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls
    btrfs: check whether fsgid/fsuid are mapped during subvolume creation
    btrfs: allow idmapped permission inode op
    btrfs: allow idmapped setattr inode op
    btrfs: allow idmapped tmpfile inode op
    btrfs: allow idmapped symlink inode op
    btrfs: allow idmapped mkdir inode op
    ...

    Linus Torvalds
     

23 Aug, 2021

1 commit

  • Btrfs sometimes needs to flush dirty pages on a bunch of dirty inodes in
    order to reclaim metadata reservations. Unfortunately most helpers in
    this area are too smart for us:

    1) The normal filemap_fdata* helpers only take range and sync modes, and
    don't give any indication of how much was written, so we can only
    flush full inodes, which isn't what we want in most cases.
    2) The normal writeback path requires us to have the s_umount sem held,
    but we can't unconditionally take it in this path because we could
    deadlock.
    3) The normal writeback path also skips inodes with I_SYNC set if we
    write with WB_SYNC_NONE. This isn't the behavior we want under heavy
    ENOSPC pressure, we want to actually make sure the pages are under
    writeback before returning, and if another thread is in the middle of
    writing the file we may return before they're under writeback and
    miss our ordered extents and not properly wait for completion.
    4) sync_inode() uses the normal writeback path and has the same problem
    as #3.

    What we really want is to call do_writepages() with our wbc. This way
    we can make sure that writeback is actually started on the pages, and we
    can control how many pages are written as a whole as we write many
    inodes using the same wbc. Accomplish this with a new helper that does
    just that so we can use it for our ENOSPC flushing infrastructure.

    Reviewed-by: Nikolay Borisov
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     

13 Jul, 2021

3 commits

  • Some operations such as reflinking blocks among files will need to lock
    invalidate_lock for two mappings. Add helper functions to do that.

    Reviewed-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Currently, serializing operations such as page fault, read, or readahead
    against hole punching is rather difficult. The basic race scheme is
    like:

    fallocate(FALLOC_FL_PUNCH_HOLE) read / fault / ..
    truncate_inode_pages_range()

    Now the problem is in this way read / page fault / readahead can
    instantiate pages in page cache with potentially stale data (if blocks
    get quickly reused). Avoiding this race is not simple - page locks do
    not work because we want to make sure there are *no* pages in given
    range. inode->i_rwsem does not work because page fault happens under
    mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
    the performance for mixed read-write workloads suffer.

    So create a new rw_semaphore in the address_space - invalidate_lock -
    that protects adding of pages to page cache for page faults / reads /
    readahead.

    Reviewed-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     
  • inode->i_mutex has been replaced with inode->i_rwsem long ago. Fix
    comments still mentioning i_mutex.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Acked-by: Hugh Dickins
    Signed-off-by: Jan Kara

    Jan Kara
     

04 Jul, 2021

1 commit

  • Pull iov_iter updates from Al Viro:
    "iov_iter cleanups and fixes.

    There are followups, but this is what had sat in -next this cycle. IMO
    the macro forest in there became much thinner and easier to follow..."

    * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
    csum_and_copy_to_pipe_iter(): leave handling of csum_state to caller
    clean up copy_mc_pipe_to_iter()
    pipe_zero(): we don't need no stinkin' kmap_atomic()...
    iov_iter: clean csum_and_copy_...() primitives up a bit
    copy_page_from_iter(): don't need kmap_atomic() for kvec/bvec cases
    copy_page_to_iter(): don't bother with kmap_atomic() for bvec/kvec cases
    iterate_xarray(): only of the first iteration we might get offset != 0
    pull handling of ->iov_offset into iterate_{iovec,bvec,xarray}
    iov_iter: make iterator callbacks use base and len instead of iovec
    iov_iter: make the amount already copied available to iterator callbacks
    iov_iter: get rid of separate bvec and xarray callbacks
    iov_iter: teach iterate_{bvec,xarray}() about possible short copies
    iterate_bvec(): expand bvec.h macro forest, massage a bit
    iov_iter: unify iterate_iovec and iterate_kvec
    iov_iter: massage iterate_iovec and iterate_kvec to logics similar to iterate_bvec
    iterate_and_advance(): get rid of magic in case when n is 0
    csum_and_copy_to_iter(): massage into form closer to csum_and_copy_from_iter()
    iov_iter: replace iov_iter_copy_from_user_atomic() with iterator-advancing variant
    [xarray] iov_iter_npages(): just use DIV_ROUND_UP()
    iov_iter_npages(): don't bother with iterate_all_kinds()
    ...

    Linus Torvalds
     

30 Jun, 2021

1 commit

  • set_active_memcg() worked for kernel allocations but was silently ignored
    for user pages.

    This patch establishes a precedence order for who gets charged:

    1. If there is a memcg associated with the page already, that memcg is
    charged. This happens during swapin.

    2. If an explicit mm is passed, mm->memcg is charged. This happens
    during page faults, which can be triggered in remote VMs (eg gup).

    3. Otherwise consult the current process context. If there is an
    active_memcg, use that. Otherwise, current->mm->memcg.

    Previously, if a NULL mm was passed to mem_cgroup_charge (case 3) it would
    always charge the root cgroup. Now it looks up the active_memcg first
    (falling back to charging the root cgroup if not set).

    Link: https://lkml.kernel.org/r/20210610173944.1203706-3-schatzberg.dan@gmail.com
    Signed-off-by: Dan Schatzberg
    Acked-by: Johannes Weiner
    Acked-by: Tejun Heo
    Acked-by: Chris Down
    Acked-by: Jens Axboe
    Reviewed-by: Shakeel Butt
    Reviewed-by: Michal Koutný
    Cc: Michal Hocko
    Cc: Ming Lei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Schatzberg
     

10 Jun, 2021

1 commit

  • Replacement is called copy_page_from_iter_atomic(); unlike the old primitive the
    callers do *not* need to do iov_iter_advance() after it. In case when they end
    up consuming less than they'd been given they need to do iov_iter_revert() on
    everything they had not consumed. That, however, needs to be done only on slow
    paths.

    All in-tree callers converted. And that kills the last user of iterate_all_kinds()

    Signed-off-by: Al Viro

    Al Viro
     

03 Jun, 2021

1 commit


07 May, 2021

1 commit

  • Fix ~94 single-word typos in locking code comments, plus a few
    very obvious grammar mistakes.

    Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com
    Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com
    Signed-off-by: Ingo Molnar
    Reviewed-by: Matthew Wilcox (Oracle)
    Reviewed-by: Randy Dunlap
    Cc: Bhaskar Chowdhury
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

06 May, 2021

3 commits

  • Various coding style tweaks to various files under mm/

    [daizhiyuan@phytium.com.cn: mm/swapfile: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614223624-16055-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/sparse: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614227288-19363-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/vmscan: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614227649-19853-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/compaction: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228218-20770-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/oom_kill: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228360-21168-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/shmem: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228504-21491-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/page_alloc: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228613-21754-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/filemap: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228936-22337-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/mlock: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613956588-2453-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/frontswap: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613962668-15045-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/vmalloc: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613963379-15988-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/memory_hotplug: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613971784-24878-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/mempolicy: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613972228-25501-1-git-send-email-daizhiyuan@phytium.com.cn

    Link: https://lkml.kernel.org/r/1614222374-13805-1-git-send-email-daizhiyuan@phytium.com.cn
    Signed-off-by: Zhiyuan Dai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhiyuan Dai
     
  • Simplify mapping_needs_writeback() by accounting DAX entries as pages
    instead of exceptional entries.

    Link: https://lkml.kernel.org/r/20201026151849.24232-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Tested-by: Vishal Verma
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • We no longer need to keep track of how many shadow entries are present in
    a mapping. This saves a few writes to the inode and memory barriers.

    Link: https://lkml.kernel.org/r/20201026151849.24232-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Tested-by: Vishal Verma
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

01 May, 2021

5 commits

  • Commit a6de4b4873e1 ("mm: convert find_get_entry to return the head page")
    uses @index instead of @offset, but the comment is stale, update it.

    Link: https://lkml.kernel.org/r/1617948260-50724-1-git-send-email-zhangshaokun@hisilicon.com
    Signed-off-by: Rui Sun
    Signed-off-by: Shaokun Zhang
    Cc: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rui Sun
     
  • If the I/O completed successfully, the page will remain Uptodate, even
    if it is subsequently truncated. If the I/O completed with an error,
    this check would cause us to retry the I/O if the page were truncated
    before we woke up. There is no need to retry the I/O; the I/O to fill
    the page failed, so we can legitimately just return -EIO.

    This code was originally added by commit 56f0d5fe6851 ("[PATCH]
    readpage-vs-invalidate fix") in 2005 (this commit ID is from the
    linux-fullhistory tree; it is also commit ba1f08f14b52 in tglx-history).

    At the time, truncate_complete_page() called ClearPageUptodate(), and so
    this was fixing a real bug. In 2008, commit 84209e02de48 ("mm: dont clear
    PG_uptodate on truncate/invalidate") removed the call to
    ClearPageUptodate, and this check has been unnecessary ever since.

    It doesn't do any real harm, but there's no need to keep it.

    Link: https://lkml.kernel.org/r/20210303222547.1056428-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Acked-by: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • After splitting generic_file_buffered_read() into smaller parts, it turns
    out we can reuse one of the parts in filemap_fault(). This fixes an
    oversight -- waiting for the I/O to complete is now interruptible by a
    fatal signal. And it saves us a few bytes of text in an unlikely path.

    $ ./scripts/bloat-o-meter before.o after.o
    add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-207 (-207)
    Function old new delta
    filemap_fault 2187 1980 -207
    Total: Before=37491, After=37284, chg -0.55%

    Link: https://lkml.kernel.org/r/20210226140011.2883498-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Andrew Morton
    Cc: Kent Overstreet
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • For the generic page cache read helper, use the better variant of checking
    for the need to call filemap_write_and_wait_range() when doing O_DIRECT
    reads. This avoids falling back to the slow path for IOCB_NOWAIT, if
    there are no pages to wait for (or write out).

    Link: https://lkml.kernel.org/r/20210224164455.1096727-3-axboe@kernel.dk
    Signed-off-by: Jens Axboe
    Reviewed-by: Matthew Wilcox (Oracle)
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jens Axboe
     
  • Patch series "Improve IOCB_NOWAIT O_DIRECT reads", v3.

    An internal workload complained because it was using too much CPU, and
    when I took a look, we had a lot of io_uring workers going to town.

    For an async buffered read like workload, I am normally expecting _zero_
    offloads to a worker thread, but this one had tons of them. I'd drop
    caches and things would look good again, but then a minute later we'd
    regress back to using workers. Turns out that every minute something
    was reading parts of the device, which would add page cache for that
    inode. I put patches like these in for our kernel, and the problem was
    solved.

    Don't -EAGAIN IOCB_NOWAIT dio reads just because we have page cache
    entries for the given range. This causes unnecessary work from the
    callers side, when the IO could have been issued totally fine without
    blocking on writeback when there is none.

    This patch (of 3):

    For O_DIRECT reads/writes, we check if we need to issue a call to
    filemap_write_and_wait_range() to issue and/or wait for writeback for any
    page in the given range. The existing mechanism just checks for a page in
    the range, which is suboptimal for IOCB_NOWAIT as we'll fallback to the
    slow path (and needing retry) if there's just a clean page cache page in
    the range.

    Provide filemap_range_needs_writeback() which tries a little harder to
    check if we actually need to issue and/or wait for writeback in the range.

    Link: https://lkml.kernel.org/r/20210224164455.1096727-1-axboe@kernel.dk
    Link: https://lkml.kernel.org/r/20210224164455.1096727-2-axboe@kernel.dk
    Signed-off-by: Jens Axboe
    Reviewed-by: Matthew Wilcox (Oracle)
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

28 Apr, 2021

1 commit

  • Pull network filesystem helper library updates from David Howells:
    "Here's a set of patches for 5.13 to begin the process of overhauling
    the local caching API for network filesystems. This set consists of
    two parts:

    (1) Add a helper library to handle the new VM readahead interface.

    This is intended to be used unconditionally by the filesystem
    (whether or not caching is enabled) and provides a common
    framework for doing caching, transparent huge pages and, in the
    future, possibly fscrypt and read bandwidth maximisation. It also
    allows the netfs and the cache to align, expand and slice up a
    read request from the VM in various ways; the netfs need only
    provide a function to read a stretch of data to the pagecache and
    the helper takes care of the rest.

    (2) Add an alternative fscache/cachfiles I/O API that uses the kiocb
    facility to do async DIO to transfer data to/from the netfs's
    pages, rather than using readpage with wait queue snooping on one
    side and vfs_write() on the other. It also uses less memory, since
    it doesn't do buffered I/O on the backing file.

    Note that this uses SEEK_HOLE/SEEK_DATA to locate the data
    available to be read from the cache. Whilst this is an improvement
    from the bmap interface, it still has a problem with regard to a
    modern extent-based filesystem inserting or removing bridging
    blocks of zeros. Fixing that requires a much greater overhaul.

    This is a step towards overhauling the fscache API. The change is
    opt-in on the part of the network filesystem. A netfs should not try
    to mix the old and the new API because of conflicting ways of handling
    pages and the PG_fscache page flag and because it would be mixing DIO
    with buffered I/O. Further, the helper library can't be used with the
    old API.

    This does not change any of the fscache cookie handling APIs or the
    way invalidation is done at this time.

    In the near term, I intend to deprecate and remove the old I/O API
    (fscache_allocate_page{,s}(), fscache_read_or_alloc_page{,s}(),
    fscache_write_page() and fscache_uncache_page()) and eventually
    replace most of fscache/cachefiles with something simpler and easier
    to follow.

    This patchset contains the following parts:

    - Some helper patches, including provision of an ITER_XARRAY iov
    iterator and a function to do readahead expansion.

    - Patches to add the netfs helper library.

    - A patch to add the fscache/cachefiles kiocb API.

    - A pair of patches to fix some review issues in the ITER_XARRAY and
    read helpers as spotted by Al and Willy.

    Jeff Layton has patches to add support in Ceph for this that he
    intends for this merge window. I have a set of patches to support AFS
    that I will post a separate pull request for.

    With this, AFS without a cache passes all expected xfstests; with a
    cache, there's an extra failure, but that's also there before these
    patches. Fixing that probably requires a greater overhaul. Ceph also
    passes the expected tests.

    I also have patches in a separate branch to tidy up the handling of
    PG_fscache/PG_private_2 and their contribution to page refcounting in
    the core kernel here, but I haven't included them in this set and will
    route them separately"

    Link: https://lore.kernel.org/lkml/3779937.1619478404@warthog.procyon.org.uk/

    * tag 'netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    netfs: Miscellaneous fixes
    iov_iter: Four fixes for ITER_XARRAY
    fscache, cachefiles: Add alternate API to use kiocb for read/write to cache
    netfs: Add a tracepoint to log failures that would be otherwise unseen
    netfs: Define an interface to talk to a cache
    netfs: Add write_begin helper
    netfs: Gather stats
    netfs: Add tracepoints
    netfs: Provide readahead and readpage netfs helpers
    netfs, mm: Add set/end/wait_on_page_fscache() aliases
    netfs, mm: Move PG_fscache helper funcs to linux/netfs.h
    netfs: Documentation for helper library
    netfs: Make a netfs helper module
    mm: Implement readahead_control pageset expansion
    mm/readahead: Handle ractl nr_pages being modified
    fs: Document file_ra_state
    mm/filemap: Pass the file_ra_state in the ractl
    mm: Add set/end/wait functions for PG_private_2
    iov_iter: Add ITER_XARRAY

    Linus Torvalds
     

24 Apr, 2021

2 commits

  • No problem on 64-bit, or without huge pages, but xfstests generic/285
    and other SEEK_HOLE/SEEK_DATA tests have regressed on huge tmpfs, and on
    32-bit architectures, with the new mapping_seek_hole_data(). Several
    different bugs turned out to need fixing.

    u64 cast to stop losing bits when converting unsigned long to loff_t
    (and let's use shifts throughout, rather than mixed with * and /).

    Use round_up() when advancing pos, to stop assuming that pos was already
    THP-aligned when advancing it by THP-size. (This use of round_up()
    assumes that any THP has THP-aligned index: true at present and true
    going forward, but could be recoded to avoid the assumption.)

    Use xas_set() when iterating away from a THP, so that xa_index stays in
    synch with start, instead of drifting away to return bogus offset.

    Check start against end to avoid wrapping 32-bit xa_index to 0 (and to
    handle these additional cases, seek_data or not, it's easier to break
    the loop than goto: so rearrange exit from the function).

    [hughd@google.com: remove unneeded u64 casts, per Matthew]
    Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104221347240.1170@eggly.anvils

    Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104211737410.3299@eggly.anvils
    Fixes: 41139aa4c3a3 ("mm/filemap: add mapping_seek_hole_data")
    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: William Kucharski
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • No problem on 64-bit, or without huge pages, but xfstests generic/308
    hung uninterruptibly on 32-bit huge tmpfs.

    Since commit 0cc3b0ec23ce ("Clarify (and fix) in 4.13 MAX_LFS_FILESIZE
    macros"), MAX_LFS_FILESIZE is only a PAGE_SIZE away from wrapping 32-bit
    xa_index to 0, so the new find_lock_entries() has to be extra careful
    when handling a THP.

    Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104211735430.3299@eggly.anvils
    Fixes: 5c211ba29deb ("mm: add and use find_lock_entries")
    Signed-off-by: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: William Kucharski
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: Dave Chinner
    Cc: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

23 Apr, 2021

2 commits

  • For readahead_expand(), we need to modify the file ra_state, so pass it
    down by adding it to the ractl. We have to do this because it's not always
    the same as f_ra in the struct file that is already being passed.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: David Howells
    Tested-by: Jeff Layton
    Tested-by: Dave Wysochanski
    Tested-By: Marc Dionne
    Link: https://lore.kernel.org/r/20210407201857.3582797-2-willy@infradead.org/
    Link: https://lore.kernel.org/r/161789067431.6155.8063840447229665720.stgit@warthog.procyon.org.uk/ # v6

    Matthew Wilcox (Oracle)
     
  • Add three functions to manipulate PG_private_2:

    (*) set_page_private_2() - Set the flag and take an appropriate reference
    on the flagged page.

    (*) end_page_private_2() - Clear the flag, drop the reference and wake up
    any waiters, somewhat analogously with end_page_writeback().

    (*) wait_on_page_private_2() - Wait for the flag to be cleared.

    Wrappers will need to be placed in the netfs lib header in the patch that
    adds that.

    [This implements a suggestion by Linus[1] to not mix the terminology of
    PG_private_2 and PG_fscache in the mm core function]

    Changes:
    v7:
    - Use compound_head() in all the functions to make them THP safe[6].

    v5:
    - Add set and end functions, calling the end function end rather than
    unlock[3].
    - Keep a ref on the page when PG_private_2 is set[4][5].

    v4:
    - Remove extern from the declaration[2].

    Suggested-by: Linus Torvalds
    Signed-off-by: David Howells
    Reviewed-by: Matthew Wilcox (Oracle)
    Tested-by: Jeff Layton
    Tested-by: Dave Wysochanski
    Tested-By: Marc Dionne
    cc: Alexander Viro
    cc: Christoph Hellwig
    cc: linux-mm@kvack.org
    cc: linux-cachefs@redhat.com
    cc: linux-afs@lists.infradead.org
    cc: linux-nfs@vger.kernel.org
    cc: linux-cifs@vger.kernel.org
    cc: ceph-devel@vger.kernel.org
    cc: v9fs-developer@lists.sourceforge.net
    cc: linux-fsdevel@vger.kernel.org
    Link: https://lore.kernel.org/r/1330473.1612974547@warthog.procyon.org.uk/ # v1
    Link: https://lore.kernel.org/r/CAHk-=wjgA-74ddehziVk=XAEMTKswPu1Yw4uaro1R3ibs27ztw@mail.gmail.com/ [1]
    Link: https://lore.kernel.org/r/20210216102659.GA27714@lst.de/ [2]
    Link: https://lore.kernel.org/r/161340387944.1303470.7944159520278177652.stgit@warthog.procyon.org.uk/ # v3
    Link: https://lore.kernel.org/r/161539528910.286939.1252328699383291173.stgit@warthog.procyon.org.uk # v4
    Link: https://lore.kernel.org/r/20210321105309.GG3420@casper.infradead.org [3]
    Link: https://lore.kernel.org/r/CAHk-=wh+2gbF7XEjYc=HV9w_2uVzVf7vs60BPz0gFA=+pUm3ww@mail.gmail.com/ [4]
    Link: https://lore.kernel.org/r/CAHk-=wjSGsRj7xwhSMQ6dAQiz53xA39pOG+XA_WeTgwBBu4uqg@mail.gmail.com/ [5]
    Link: https://lore.kernel.org/r/20210408145057.GN2531743@casper.infradead.org/ [6]
    Link: https://lore.kernel.org/r/161653788200.2770958.9517755716374927208.stgit@warthog.procyon.org.uk/ # v5
    Link: https://lore.kernel.org/r/161789066013.6155.9816857201817288382.stgit@warthog.procyon.org.uk/ # v6

    David Howells
     

27 Feb, 2021

9 commits

  • All callers of find_get_entries() use a pvec, so pass it directly instead
    of manipulating it in the caller.

    Link: https://lkml.kernel.org/r/20201112212641.27837-14-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Jan Kara
    Reviewed-by: William Kucharski
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • This simplifies the callers and leads to a more efficient implementation
    since the XArray has this functionality already.

    Link: https://lkml.kernel.org/r/20201112212641.27837-11-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Jan Kara
    Reviewed-by: William Kucharski
    Reviewed-by: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • We have three functions (shmem_undo_range(), truncate_inode_pages_range()
    and invalidate_mapping_pages()) which want exactly this function, so add
    it to filemap.c. Before this patch, shmem_undo_range() would split any
    compound page which overlaps either end of the range being punched in both
    the first and second loops through the address space. After this patch,
    that functionality is left for the second loop, which is arguably more
    appropriate since the first loop is supposed to run through all the pages
    quickly, and splitting a page can sleep.

    [willy@infradead.org: add assertion]
    Link: https://lkml.kernel.org/r/20201124041507.28996-3-willy@infradead.org

    Link: https://lkml.kernel.org/r/20201112212641.27837-10-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Jan Kara
    Reviewed-by: William Kucharski
    Reviewed-by: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Enhance mapping_seek_hole_data() to handle partially uptodate pages and
    convert the iomap seek code to call it.

    Link: https://lkml.kernel.org/r/20201112212641.27837-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: William Kucharski
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Rewrite shmem_seek_hole_data() and move it to filemap.c.

    [willy@infradead.org: don't put an xa_is_value() page]
    Link: https://lkml.kernel.org/r/20201124041507.28996-4-willy@infradead.org

    Link: https://lkml.kernel.org/r/20201112212641.27837-8-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: William Kucharski
    Reviewed-by: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • There is a lot of common code in find_get_entries(),
    find_get_pages_range() and find_get_pages_range_tag(). Factor out
    find_get_entry() which simplifies all three functions.

    [willy@infradead.org: remove VM_BUG_ON_PAGE()]
    Link: https://lkml.kernel.org/r/20201124041507.28996-2-willy@infradead.orgLink: https://lkml.kernel.org/r/20201112212641.27837-7-willy@infradead.org

    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Jan Kara
    Reviewed-by: William Kucharski
    Reviewed-by: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • find_get_entry doesn't "find" anything. It returns the entry at a
    particular index.

    Link: https://lkml.kernel.org/r/20201112212641.27837-6-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: William Kucharski
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • The functionality of find_lock_entry() and find_get_entry() can be
    provided by pagecache_get_page(), which lets us delete find_lock_entry()
    and make find_get_entry() static.

    Link: https://lkml.kernel.org/r/20201112212641.27837-5-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: William Kucharski
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Patch series "Overhaul multi-page lookups for THP", v4.

    This THP prep patchset changes several page cache iteration APIs to only
    return head pages.

    - It's only possible to tag head pages in the page cache, so only
    return head pages, not all their subpages.
    - Factor a lot of common code out of the various batch lookup routines
    - Add mapping_seek_hole_data()
    - Unify find_get_entries() and pagevec_lookup_entries()
    - Make find_get_entries only return head pages, like find_get_entry().

    These are only loosely connected, but they seem to make sense together as
    a series.

    This patch (of 14):

    Pagecache tags are used for dirty page writeback. Since dirtiness is
    tracked on a per-THP basis, we only want to return the head page rather
    than each subpage of a tagged page. All the filesystems which use huge
    pages today are in-memory, so there are no tagged huge pages today.

    Link: https://lkml.kernel.org/r/20201112212641.27837-2-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Jan Kara
    Reviewed-by: William Kucharski
    Reviewed-by: Christoph Hellwig
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Yang Shi
    Cc: Dave Chinner
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

25 Feb, 2021

1 commit

  • Currently we use struct per_cpu_nodestat to cache the vmstat counters,
    which leads to inaccurate statistics especially THP vmstat counters. In
    the systems with hundreds of processors it can be GBs of memory. For
    example, for a 96 CPUs system, the threshold is the maximum number of 125.
    And the per cpu counters can cache 23.4375 GB in total.

    The THP page is already a form of batched addition (it will add 512 worth
    of memory in one go) so skipping the batching seems like sensible.
    Although every THP stats update overflows the per-cpu counter, resorting
    to atomic global updates. But it can make the statistics more accuracy
    for the THP vmstat counters.

    So we convert the NR_SHMEM_THPS account to pages. This patch is
    consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page
    arrival"). Doing this also can make the unit of vmstat counters more
    unified. Finally, the unit of the vmstat counters are pages, kB and
    bytes. The B/KB suffix can tell us that the unit is bytes or kB. The
    rest which is without suffix are pages.

    Link: https://lkml.kernel.org/r/20201228164110.2838-5-songmuchun@bytedance.com
    Signed-off-by: Muchun Song
    Cc: Alexey Dobriyan
    Cc: Feng Tang
    Cc: Greg Kroah-Hartman
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: NeilBrown
    Cc: Pankaj Gupta
    Cc: Rafael. J. Wysocki
    Cc: Randy Dunlap
    Cc: Roman Gushchin
    Cc: Sami Tolvanen
    Cc: Shakeel Butt
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Muchun Song