01 Oct, 2020

1 commit

  • [ Upstream commit faffdfa04fa11ccf048cebdde73db41ede0679e0 ]

    Mount failure issue happens under the scenario: Application forked dozens
    of threads to mount the same number of cramfs images separately in docker,
    but several mounts failed with high probability. Mount failed due to the
    checking result of the page(read from the superblock of loop dev) is not
    uptodate after wait_on_page_locked(page) returned in function cramfs_read:

    wait_on_page_locked(page);
    if (!PageUptodate(page)) {
    ...
    }

    The reason of the checking result of the page not uptodate: systemd-udevd
    read the loopX dev before mount, because the status of loopX is Lo_unbound
    at this time, so loop_make_request directly trigger the calling of io_end
    handler end_buffer_async_read, which called SetPageError(page). So It
    caused the page can't be set to uptodate in function
    end_buffer_async_read:

    if(page_uptodate && !PageError(page)) {
    SetPageUptodate(page);
    }

    Then mount operation is performed, it used the same page which is just
    accessed by systemd-udevd above, Because this page is not uptodate, it
    will launch a actual read via submit_bh, then wait on this page by calling
    wait_on_page_locked(page). When the I/O of the page done, io_end handler
    end_buffer_async_read is called, because no one cleared the page
    error(during the whole read path of mount), which is caused by
    systemd-udevd reading, so this page is still in "PageError" status, which
    can't be set to uptodate in function end_buffer_async_read, then caused
    mount failure.

    But sometimes mount succeed even through systemd-udeved read loopX dev
    just before, The reason is systemd-udevd launched other loopX read just
    between step 3.1 and 3.2, the steps as below:

    1, loopX dev default status is Lo_unbound;
    2, systemd-udved read loopX dev (page is set to PageError);
    3, mount operation
    1) set loopX status to Lo_bound;
    ==>systemd-udevd read loopX deva_ops->readpage(filp, page);

    here, mapping->a_ops->readpage() is blkdev_readpage. In latest kernel,
    some function name changed, the call trace as below:

    blkdev_read_iter
    generic_file_read_iter
    generic_file_buffered_read:
    /*
    * A previous I/O error may have been due to temporary
    * failures, eg. mutipath errors.
    * Pg_error will be set again if readpage fails.
    */
    ClearPageError(page);
    /* Start the actual read. The read will unlock the page*/
    error=mapping->a_ops->readpage(flip, page);

    We can see ClearPageError(page) is called before the actual read,
    then the read in step 3.2 succeed.

    This patch is to add the calling of ClearPageError just before the actual
    read of read path of cramfs mount. Without the patch, the call trace as
    below when performing cramfs mount:

    do_mount
    cramfs_read
    cramfs_blkdev_read
    read_cache_page
    do_read_cache_page:
    filler(data, page);
    or
    mapping->a_ops->readpage(data, page);

    With the patch, the call trace as below when performing mount:

    do_mount
    cramfs_read
    cramfs_blkdev_read
    read_cache_page:
    do_read_cache_page:
    ClearPageError(page); a_ops->readpage(data, page);

    With the patch, mount operation trigger the calling of
    ClearPageError(page) before the actual read, the page has no error if no
    additional page error happen when I/O done.

    Signed-off-by: Xianting Tian
    Signed-off-by: Andrew Morton
    Reviewed-by: Matthew Wilcox (Oracle)
    Cc: Jan Kara
    Cc:
    Link: http://lkml.kernel.org/r/1583318844-22971-1-git-send-email-xianting_tian@126.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Xianting Tian
     

05 Aug, 2020

1 commit

  • commit 5c72feee3e45b40a3c96c7145ec422899d0e8964 upstream.

    When handling a page fault, we drop mmap_sem to start async readahead so
    that we don't block on IO submission with mmap_sem held. However there's
    no point to drop mmap_sem in case readahead is disabled. Handle that case
    to avoid pointless dropping of mmap_sem and retrying the fault. This was
    actually reported to block mlockall(MCL_CURRENT) indefinitely.

    Fixes: 6b4c9f446981 ("filemap: drop the mmap_sem for all blocking operations")
    Reported-by: Minchan Kim
    Reported-by: Robert Stupp
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Reviewed-by: Josef Bacik
    Reviewed-by: Minchan Kim
    Link: http://lkml.kernel.org/r/20200212101356.30759-1-jack@suse.cz
    Signed-off-by: Linus Torvalds
    Cc: SeongJae Park
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

09 Jan, 2020

1 commit

  • [ Upstream commit 89b15332af7c0312a41e50846819ca6613b58b4c ]

    One of our services is observing hanging ps/top/etc under heavy write
    IO, and the task states show this is an mmap_sem priority inversion:

    A write fault is holding the mmap_sem in read-mode and waiting for
    (heavily cgroup-limited) IO in balance_dirty_pages():

    balance_dirty_pages+0x724/0x905
    balance_dirty_pages_ratelimited+0x254/0x390
    fault_dirty_shared_page.isra.96+0x4a/0x90
    do_wp_page+0x33e/0x400
    __handle_mm_fault+0x6f0/0xfa0
    handle_mm_fault+0xe4/0x200
    __do_page_fault+0x22b/0x4a0
    page_fault+0x45/0x50

    Somebody tries to change the address space, contending for the mmap_sem in
    write-mode:

    call_rwsem_down_write_failed_killable+0x13/0x20
    do_mprotect_pkey+0xa8/0x330
    SyS_mprotect+0xf/0x20
    do_syscall_64+0x5b/0x100
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    The waiting writer locks out all subsequent readers to avoid lock
    starvation, and several threads can be seen hanging like this:

    call_rwsem_down_read_failed+0x14/0x30
    proc_pid_cmdline_read+0xa0/0x480
    __vfs_read+0x23/0x140
    vfs_read+0x87/0x130
    SyS_read+0x42/0x90
    do_syscall_64+0x5b/0x100
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    To fix this, do what we do for cache read faults already: drop the
    mmap_sem before calling into anything IO bound, in this case the
    balance_dirty_pages() function, and return VM_FAULT_RETRY.

    Link: http://lkml.kernel.org/r/20190924194238.GA29030@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Matthew Wilcox (Oracle)
    Acked-by: Kirill A. Shutemov
    Cc: Josef Bacik
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Johannes Weiner
     

19 Oct, 2019

1 commit

  • The generic_file_vm_ops is defined in so include it to
    fix the following warning:

    mm/filemap.c:2717:35: warning: symbol 'generic_file_vm_ops' was not declared. Should it be static?

    Link: http://lkml.kernel.org/r/20191008102311.25432-1-ben.dooks@codethink.co.uk
    Signed-off-by: Ben Dooks
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Dooks
     

25 Sep, 2019

9 commits

  • In previous patch, an application could put part of its text section in
    THP via madvise(). These THPs will be protected from writes when the
    application is still running (TXTBSY). However, after the application
    exits, the file is available for writes.

    This patch avoids writes to file THP by dropping page cache for the file
    when the file is open for write. A new counter nr_thps is added to struct
    address_space. In do_dentry_open(), if the file is open for write and
    nr_thps is non-zero, we drop page cache for the whole file.

    Link: http://lkml.kernel.org/r/20190801184244.3169074-8-songliubraving@fb.com
    Signed-off-by: Song Liu
    Reported-by: kbuild test robot
    Acked-by: Rik van Riel
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: William Kucharski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     
  • This patch is (hopefully) the first step to enable THP for non-shmem
    filesystems.

    This patch enables an application to put part of its text sections to THP
    via madvise, for example:

    madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE);

    We tried to reuse the logic for THP on tmpfs.

    Currently, write is not supported for non-shmem THP. khugepaged will only
    process vma with VM_DENYWRITE. sys_mmap() ignores VM_DENYWRITE requests
    (see ksys_mmap_pgoff). The only way to create vma with VM_DENYWRITE is
    execve(). This requirement limits non-shmem THP to text sections.

    The next patch will handle writes, which would only happen when the all
    the vmas with VM_DENYWRITE are unmapped.

    An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this
    feature.

    [songliubraving@fb.com: fix build without CONFIG_SHMEM]
    Link: http://lkml.kernel.org/r/F53407FB-96CC-42E8-9862-105C92CC2B98@fb.com
    [songliubraving@fb.com: fix double unlock in collapse_file()]
    Link: http://lkml.kernel.org/r/B960CBFA-8EFC-4DA4-ABC5-1977FFF2CA57@fb.com
    Link: http://lkml.kernel.org/r/20190801184244.3169074-7-songliubraving@fb.com
    Signed-off-by: Song Liu
    Acked-by: Rik van Riel
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Cc: Stephen Rothwell
    Cc: Dan Carpenter
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: William Kucharski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     
  • With THP, current check of offset:

    VM_BUG_ON_PAGE(page->index != offset, page);

    is no longer accurate. Update it to:

    VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);

    Link: http://lkml.kernel.org/r/20190801184244.3169074-4-songliubraving@fb.com
    Signed-off-by: Song Liu
    Acked-by: Rik van Riel
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: William Kucharski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     
  • Similar to previous patch, pagecache_get_page() avoids race condition with
    truncate by checking page->mapping == mapping. This does not work for
    compound pages. This patch let it check compound_head(page)->mapping
    instead.

    Link: http://lkml.kernel.org/r/20190801184244.3169074-3-songliubraving@fb.com
    Signed-off-by: Song Liu
    Suggested-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Rik van Riel
    Cc: William Kucharski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     
  • Patch series "Enable THP for text section of non-shmem files", v10;

    This patchset follows up discussion at LSF/MM 2019. The motivation is to
    put text section of an application in THP, and thus reduces iTLB miss rate
    and improves performance. Both Facebook and Oracle showed strong
    interests to this feature.

    To make reviews easier, this set aims a mininal valid product. Current
    version of the work does not have any changes to file system specific
    code. This comes with some limitations (discussed later).

    This set enables an application to "hugify" its text section by simply
    running something like:

    madvise(0x600000, 0x80000, MADV_HUGEPAGE);

    Before this call, the /proc//maps looks like:

    00400000-074d0000 r-xp 00000000 00:27 2006927 app

    After this call, part of the text section is split out and mapped to
    THP:

    00400000-00425000 r-xp 00000000 00:27 2006927 app
    00600000-00e00000 r-xp 00200000 00:27 2006927 app <<< on THP
    00e00000-074d0000 r-xp 00a00000 00:27 2006927 app

    Limitations:

    1. This only works for text section (vma with VM_DENYWRITE).
    2. Original limitation #2 is removed in v3.

    We gated this feature with an experimental config, READ_ONLY_THP_FOR_FS.
    Once we get better support on the write path, we can remove the config and
    enable it by default.

    Tested cases:
    1. Tested with btrfs and ext4.
    2. Tested with real work application (memcache like caching service).
    3. Tested with "THP aware uprobe":
    https://patchwork.kernel.org/project/linux-mm/list/?series=131339

    This patch (of 7):

    Currently, filemap_fault() avoids race condition with truncate by checking
    page->mapping == mapping. This does not work for compound pages. This
    patch let it check compound_head(page)->mapping instead.

    Link: http://lkml.kernel.org/r/20190801184244.3169074-2-songliubraving@fb.com
    Signed-off-by: Song Liu
    Acked-by: Rik van Riel
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Cc: William Kucharski
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     
  • Transparent Huge Pages are currently stored in i_pages as pointers to
    consecutive subpages. This patch changes that to storing consecutive
    pointers to the head page in preparation for storing huge pages more
    efficiently in i_pages.

    Large parts of this are "inspired" by Kirill's patch
    https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/

    Kirill and Huang Ying contributed several fixes.

    [willy@infradead.org: use compound_nr, squish uninit-var warning]
    Link: http://lkml.kernel.org/r/20190731210400.7419-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jan Kara
    Reviewed-by: Kirill Shutemov
    Reviewed-by: Song Liu
    Tested-by: Song Liu
    Tested-by: William Kucharski
    Reviewed-by: William Kucharski
    Tested-by: Qian Cai
    Tested-by: Mikhail Gavrilov
    Cc: Hugh Dickins
    Cc: Chris Wilson
    Cc: Song Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • This actually checks that writeback is needed or in progress.

    Link: http://lkml.kernel.org/r/156378817069.1087.1302816672037672488.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Andrew Morton
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Functions like filemap_write_and_wait_range() should do nothing if inode
    has no dirty pages or pages currently under writeback. But they anyway
    construct struct writeback_control and this does some atomic operations if
    CONFIG_CGROUP_WRITEBACK=y - on fast path it locks inode->i_lock and
    updates state of writeback ownership, on slow path might be more work.
    Current this path is safely avoided only when inode mapping has no pages.

    For example generic_file_read_iter() calls filemap_write_and_wait_range()
    at each O_DIRECT read - pretty hot path.

    This patch skips starting new writeback if mapping has no dirty tags set.
    If writeback is already in progress filemap_write_and_wait_range() will
    wait for it.

    Link: http://lkml.kernel.org/r/156378816804.1087.8607636317907921438.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Jan Kara
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Replace 1 << compound_order(page) with compound_nr(page). Minor
    improvements in readability.

    Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

20 Aug, 2019

1 commit


13 Jul, 2019

3 commits

  • Commit 6b4c9f446981 ("filemap: drop the mmap_sem for all blocking
    operations") changed when mmap_sem is dropped during filemap page fault
    and when returning VM_FAULT_RETRY.

    Correct the comment to reflect the change.

    Link: http://lkml.kernel.org/r/1556234531-108228-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Reviewed-by: Josef Bacik
    Acked-by: Song Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • We can just pass a NULL filler and do the right thing inside of
    do_read_cache_page based on the NULL parameter.

    Link: http://lkml.kernel.org/r/20190520055731.24538-3-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Kees Cook
    Cc: Nick Desaulniers
    Cc: Sami Tolvanen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Patch series "fix filler_t callback type mismatches", v2.

    Casting mapping->a_ops->readpage to filler_t causes an indirect call
    type mismatch with Control-Flow Integrity checking. This change fixes
    the mismatch in read_cache_page_gfp and read_mapping_page by adding
    using a NULL filler argument as an indication to call ->readpage
    directly, and by passing the right parameter callbacks in nfs and jffs2.

    This patch (of 4):

    Code cleanup.

    Link: http://lkml.kernel.org/r/20190520055731.24538-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Kees Cook
    Cc: Nick Desaulniers
    Cc: Sami Tolvanen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

11 Jul, 2019

2 commits

  • Pull ext4 updates from Ted Ts'o:
    "Many bug fixes and cleanups, and an optimization for case-insensitive
    lookups"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: fix coverity warning on error path of filename setup
    ext4: replace ktype default_attrs with default_groups
    ext4: rename htree_inline_dir_to_tree() to ext4_inlinedir_to_tree()
    ext4: refactor initialize_dirent_tail()
    ext4: rename "dirent_csum" functions to use "dirblock"
    ext4: allow directory holes
    jbd2: drop declaration of journal_sync_buffer()
    ext4: use jbd2_inode dirty range scoping
    jbd2: introduce jbd2_inode dirty range scoping
    mm: add filemap_fdatawait_range_keep_errors()
    ext4: remove redundant assignment to node
    ext4: optimize case-insensitive lookups
    ext4: make __ext4_get_inode_loc plug
    ext4: clean up kerneldoc warnigns when building with W=1
    ext4: only set project inherit bit for directory
    ext4: enforce the immutable flag on open files
    ext4: don't allow any modifications to an immutable file
    jbd2: fix typo in comment of journal_submit_inode_data_buffers
    jbd2: fix some print format mistakes
    ext4: gracefully handle ext4_break_layouts() failure during truncate

    Linus Torvalds
     
  • Pull copy_file_range updates from Darrick Wong:
    "This fixes numerous parameter checking problems and inconsistent
    behaviors in the new(ish) copy_file_range system call.

    Now the system call will actually check its range parameters
    correctly; refuse to copy into files for which the caller does not
    have sufficient privileges; update mtime and strip setuid like file
    writes are supposed to do; and allows copying up to the EOF of the
    source file instead of failing the call like we used to.

    Summary:

    - Create a generic copy_file_range handler and make individual
    filesystems responsible for calling it (i.e. no more assuming that
    do_splice_direct will work or is appropriate)

    - Refactor copy_file_range and remap_range parameter checking where
    they are the same

    - Install missing copy_file_range parameter checking(!)

    - Remove suid/sgid and update mtime like any other file write

    - Change the behavior so that a copy range crossing the source file's
    eof will result in a short copy to the source file's eof instead of
    EINVAL

    - Permit filesystems to decide if they want to handle
    cross-superblock copy_file_range in their local handlers"

    * tag 'copy-file-range-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    fuse: copy_file_range needs to strip setuid bits and update timestamps
    vfs: allow copy_file_range to copy across devices
    xfs: use file_modified() helper
    vfs: introduce file_modified() helper
    vfs: add missing checks to copy_file_range
    vfs: remove redundant checks from generic_remap_checks()
    vfs: introduce generic_file_rw_checks()
    vfs: no fallback for ->copy_file_range
    vfs: introduce generic_copy_file_range()

    Linus Torvalds
     

06 Jul, 2019

1 commit

  • This reverts commit 5fd4ca2d84b249f0858ce28cf637cf25b61a398f.

    Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in
    __delete_from_swap_cache() to trigger:

    page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363
    anon
    flags: 0x17fffe00080034(uptodate|lru|active|swapbacked)
    raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689
    raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000
    page dumped because: VM_BUG_ON_PAGE(entry != page)
    page->mem_cgroup:ffff978433ace000
    ------------[ cut here ]------------
    kernel BUG at mm/swap_state.c:170!
    invalid opcode: 0000 [#1] SMP NOPTI
    CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1
    Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019
    RIP: 0010:__delete_from_swap_cache+0x20d/0x240
    Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f
    RSP: 0018:ffffa982036e7980 EFLAGS: 00010046
    RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006
    RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900
    RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535
    R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120
    R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000
    FS: 0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0
    Call Trace:
    delete_from_swap_cache+0x46/0xa0
    try_to_free_swap+0xbc/0x110
    swap_writepage+0x13/0x70
    pageout.isra.0+0x13c/0x350
    shrink_page_list+0xc14/0xdf0
    shrink_inactive_list+0x1e5/0x3c0
    shrink_node_memcg+0x202/0x760
    shrink_node+0xe0/0x470
    balance_pgdat+0x2d1/0x510
    kswapd+0x220/0x420
    kthread+0xfb/0x130
    ret_from_fork+0x22/0x40

    and it's not immediately obvious why it happens. It's too late in the
    rc cycle to do anything but revert for now.

    Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/
    Reported-and-bisected-by: Mikhail Gavrilov
    Suggested-by: Jan Kara
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Kirill Shutemov
    Cc: William Kucharski
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 Jun, 2019

1 commit

  • In the spirit of filemap_fdatawait_range() and
    filemap_fdatawait_keep_errors(), introduce
    filemap_fdatawait_range_keep_errors() which both takes a range upon
    which to wait and does not clear errors from the address space.

    Signed-off-by: Ross Zwisler
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara
    Cc: stable@vger.kernel.org

    Ross Zwisler
     

10 Jun, 2019

3 commits

  • Like the clone and dedupe interfaces we've recently fixed, the
    copy_file_range() implementation is missing basic sanity, limits and
    boundary condition tests on the parameters that are passed to it
    from userspace. Create a new "generic_copy_file_checks()" function
    modelled on the generic_remap_checks() function to provide this
    missing functionality.

    [Amir] Shorten copy length instead of checking pos_in limits
    because input file size already abides by the limits.

    Signed-off-by: Dave Chinner
    Signed-off-by: Amir Goldstein
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Amir Goldstein
     
  • The access limit checks on input file range in generic_remap_checks()
    are redundant because the input file size is guaranteed to be within
    limits and pos+len are already checked to be within input file size.

    Beyond the fact that the check cannot fail, if it would have failed,
    it could return -EFBIG for input file range error. There is no precedent
    for that. -EFBIG is returned in syscalls that would change file length.

    With that call removed, we can fold generic_access_check_limits() into
    generic_write_check_limits().

    Signed-off-by: Amir Goldstein
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Amir Goldstein
     
  • Factor out helper with some checks on in/out file that are
    common to clone_file_range and copy_file_range.

    Suggested-by: Darrick J. Wong
    Signed-off-by: Amir Goldstein
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Amir Goldstein
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

4 commits

  • I removed the only user of this and hadn't noticed it was now unused.

    Link: http://lkml.kernel.org/r/20190430152929.21813-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Link: http://lkml.kernel.org/r/20190304155240.19215-1-ldufour@linux.ibm.com
    Signed-off-by: Laurent Dufour
    Reviewed-by: William Kucharski
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     
  • Recently I messed up the error handling in filemap_fault() because of an
    unexpected ENOMEM (related to cgroup memory limits) in add_to_page_cache.
    Enable error injection at this point so I can add a testcase to xfstests
    to verify I don't mess this up again.

    [akpm@linux-foundation.org: include linux/error-injection.h]
    Link: http://lkml.kernel.org/r/20190403152604.14008-1-josef@toxicpanda.com
    Signed-off-by: Josef Bacik
    Reviewed-by: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef Bacik
     
  • Transparent Huge Pages are currently stored in i_pages as pointers to
    consecutive subpages. This patch changes that to storing consecutive
    pointers to the head page in preparation for storing huge pages more
    efficiently in i_pages.

    Large parts of this are "inspired" by Kirill's patch
    https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/

    [willy@infradead.org: fix swapcache pages]
    Link: http://lkml.kernel.org/r/20190324155441.GF10344@bombadil.infradead.org
    [kirill@shutemov.name: hugetlb stores pages in page cache differently]
    Link: http://lkml.kernel.org/r/20190404134553.vuvhgmghlkiw2hgl@kshutemo-mobl1
    Link: http://lkml.kernel.org/r/20190307153051.18815-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jan Kara
    Reviewed-by: Kirill Shutemov
    Reviewed-and-tested-by: Song Liu
    Tested-by: William Kucharski
    Reviewed-by: William Kucharski
    Tested-by: Qian Cai
    Cc: Hugh Dickins
    Cc: Song Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

16 Mar, 2019

3 commits

  • I thought Josef Bacik's patch to drop the mmap_sem was buggy, because
    when looking at the error cases, there was one case where we returned
    VM_FAULT_RETRY without actually dropping the mmap_sem.

    Josef had to explain to me (using small words) that yes, that's actually
    what we're supposed to do, and his patch was correct. Which not only
    convinced me he knew what he was doing and I should stop arguing with
    him, but also that I should add a comment to the case I was confused
    about.

    Patiently-pointed-out-by: Josef Bacik
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Currently we only drop the mmap_sem if there is contention on the page
    lock. The idea is that we issue readahead and then go to lock the page
    while it is under IO and we want to not hold the mmap_sem during the IO.

    The problem with this is the assumption that the readahead does anything.
    In the case that the box is under extreme memory or IO pressure we may end
    up not reading anything at all for readahead, which means we will end up
    reading in the page under the mmap_sem.

    Even if the readahead does something, it could get throttled because of io
    pressure on the system and the process is in a lower priority cgroup.

    Holding the mmap_sem while doing IO is problematic because it can cause
    system-wide priority inversions. Consider some large company that does a
    lot of web traffic. This large company has load balancing logic in it's
    core web server, cause some engineer thought this was a brilliant plan.
    This load balancing logic gets statistics from /proc about the system,
    which trip over processes mmap_sem for various reasons. Now the web
    server application is in a protected cgroup, but these other processes may
    not be, and if they are being throttled while their mmap_sem is held we'll
    stall, and cause this nice death spiral.

    Instead rework filemap fault path to drop the mmap sem at any point that
    we may do IO or block for an extended period of time. This includes while
    issuing readahead, locking the page, or needing to call ->readpage because
    readahead did not occur. Then once we have a fully uptodate page we can
    return with VM_FAULT_RETRY and come back again to find our nicely in-cache
    page that was gotten outside of the mmap_sem.

    This patch also adds a new helper for locking the page with the mmap_sem
    dropped. This doesn't make sense currently as generally speaking if the
    page is already locked it'll have been read in (unless there was an error)
    before it was unlocked. However a forthcoming patchset will change this
    with the ability to abort read-ahead bio's if necessary, making it more
    likely that we could contend for a page lock and still have a not uptodate
    page. This allows us to deal with this case by grabbing the lock and
    issuing the IO without the mmap_sem held, and then returning
    VM_FAULT_RETRY to come back around.

    [josef@toxicpanda.com: v6]
    Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
    [kirill@shutemov.name: fix race in filemap_fault()]
    Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
    Signed-off-by: Josef Bacik
    Acked-by: Johannes Weiner
    Reviewed-by: Andrew Morton
    Reviewed-by: Jan Kara
    Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
    Cc: Dave Chinner
    Cc: Rik van Riel
    Cc: Tejun Heo
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef Bacik
     
  • Patch series "drop the mmap_sem when doing IO in the fault path", v6.

    Now that we have proper isolation in place with cgroups2 we have started
    going through and fixing the various priority inversions. Most are all
    gone now, but this one is sort of weird since it's not necessarily a
    priority inversion that happens within the kernel, but rather because of
    something userspace does.

    We have giant applications that we want to protect, and parts of these
    giant applications do things like watch the system state to determine how
    healthy the box is for load balancing and such. This involves running
    'ps' or other such utilities. These utilities will often walk
    /proc//whatever, and these files can sometimes need to
    down_read(&task->mmap_sem). Not usually a big deal, but we noticed when
    we are stress testing that sometimes our protected application has latency
    spikes trying to get the mmap_sem for tasks that are in lower priority
    cgroups.

    This is because any down_write() on a semaphore essentially turns it into
    a mutex, so even if we currently have it held for reading, any new readers
    will not be allowed on to keep from starving the writer. This is fine,
    except a lower priority task could be stuck doing IO because it has been
    throttled to the point that its IO is taking much longer than normal. But
    because a higher priority group depends on this completing it is now stuck
    behind lower priority work.

    In order to avoid this particular priority inversion we want to use the
    existing retry mechanism to stop from holding the mmap_sem at all if we
    are going to do IO. This already exists in the read case sort of, but
    needed to be extended for more than just grabbing the page lock. With
    io.latency we throttle at submit_bio() time, so the readahead stuff can
    block and even page_cache_read can block, so all these paths need to have
    the mmap_sem dropped.

    The other big thing is ->page_mkwrite. btrfs is particularly shitty here
    because we have to reserve space for the dirty page, which can be a very
    expensive operation. We use the same retry method as the read path, and
    simply cache the page and verify the page is still setup properly the next
    pass through ->page_mkwrite().

    I've tested these patches with xfstests and there are no regressions.

    This patch (of 3):

    If we do not have a page at filemap_fault time we'll do this weird forced
    page_cache_read thing to populate the page, and then drop it again and
    loop around and find it. This makes for 2 ways we can read a page in
    filemap_fault, and it's not really needed. Instead add a FGP_FOR_MMAP
    flag so that pagecache_get_page() will return a unlocked page that's in
    pagecache. Then use the normal page locking and readpage logic already in
    filemap_fault. This simplifies the no page in page cache case
    significantly.

    [akpm@linux-foundation.org: fix comment text]
    [josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
    Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
    Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.com
    Signed-off-by: Josef Bacik
    Acked-by: Johannes Weiner
    Reviewed-by: Jan Kara
    Reviewed-by: Andrew Morton
    Cc: Tejun Heo
    Cc: Dave Chinner
    Cc: Rik van Riel
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef Bacik
     

15 Mar, 2019

1 commit

  • All of the arguments to these functions come from the vmf.

    Cut down on the amount of arguments passed by simply passing in the vmf
    to these two helpers.

    Link: http://lkml.kernel.org/r/20181211173801.29535-3-josef@toxicpanda.com
    Signed-off-by: Josef Bacik
    Reviewed-by: Andrew Morton
    Reviewed-by: Jan Kara
    Cc: Dave Chinner
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Tejun Heo
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef Bacik
     

06 Mar, 2019

5 commits

  • We have common pattern to access lru_lock from a page pointer:
    zone_lru_lock(page_zone(page))

    Which is silly, because it unfolds to this:
    &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]->zone_pgdat->lru_lock
    while we can simply do
    &NODE_DATA(page_to_nid(page))->lru_lock

    Remove zone_lru_lock() function, since it's only complicate things. Use
    'page_pgdat(page)->lru_lock' pattern instead.

    [aryabinin@virtuozzo.com: a slightly better version of __split_huge_page()]
    Link: http://lkml.kernel.org/r/20190301121651.7741-1-aryabinin@virtuozzo.com
    Link: http://lkml.kernel.org/r/20190228083329.31892-2-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: William Kucharski
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • find_get_pages_range() and find_get_pages_range_tag() already correctly
    increment reference count on head when seeing compound page, but they
    may still use page index from tail. Page index from tail is always
    zero, so these functions don't work on huge shmem. This hasn't been a
    problem because, AFAIK, nobody calls these functions on (huge) shmem.
    Fix them anyway just in case.

    Link: http://lkml.kernel.org/r/20190110030838.84446-1-yuzhao@google.com
    Signed-off-by: Yu Zhao
    Reviewed-by: William Kucharski
    Cc: Matthew Wilcox
    Cc: Amir Goldstein
    Cc: Dave Chinner
    Cc: "Darrick J . Wong"
    Cc: Johannes Weiner
    Cc: Souptick Joarder
    Cc: Hugh Dickins
    Cc: "Kirill A . Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yu Zhao
     
  • Many kernel-doc comments in mm/ have the return value descriptions
    either misformatted or omitted at all which makes kernel-doc script
    unhappy:

    $ make V=1 htmldocs
    ...
    ./mm/util.c:36: info: Scanning doc for kstrdup
    ./mm/util.c:41: warning: No description found for return value of 'kstrdup'
    ./mm/util.c:57: info: Scanning doc for kstrdup_const
    ./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
    ./mm/util.c:75: info: Scanning doc for kstrndup
    ./mm/util.c:83: warning: No description found for return value of 'kstrndup'
    ...

    Fixing the formatting and adding the missing return value descriptions
    eliminates ~100 such warnings.

    Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • The 'end_byte' parameter of filemap_range_has_page is required to be
    inclusive, so follow the rule.

    Link: http://lkml.kernel.org/r/1548678679-18122-1-git-send-email-zhengbin13@huawei.com
    Fixes: 6be96d3ad34a ("fs: return if direct I/O will trigger writeback")
    Signed-off-by: zhengbin
    Reviewed-by: Andrew Morton
    Reviewed-by: Matthew Wilcox
    Acked-by: Christoph Hellwig
    Cc: "Darrick J. Wong"
    Cc: Amir Goldstein
    Cc: Dave Chinner
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Hou Tao
    Cc: zhangyi (F)
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhengbin
     
  • After we establish a reference on the page, we check the pointer
    continues to be in the correct position in i_pages. Checking
    page->index afterwards is unnecessary; if it were to change, then the
    pointer to it from the page cache would also move. The check used to be
    done before grabbing a reference on the page which was racy (see commit
    9cbb4cb21b19f ("mm: find_get_pages_contig fixlet")), but nobody noticed
    that moving the check after grabbing the reference was redundant.

    Link: http://lkml.kernel.org/r/20190107200224.13260-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

05 Jan, 2019

1 commit


29 Dec, 2018

1 commit

  • filemap_map_pages takes a speculative reference to each page in the range
    before it tries to lock that page. While this is correct it also can
    influence page migration which will bail out when seeing an elevated
    reference count. The faultaround code would bail on seeing a locked page
    so we can pro-actively check the PageLocked bit before
    page_cache_get_speculative and prevent from pointless reference count
    churn.

    Link: http://lkml.kernel.org/r/20181211142741.2607-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Reviewed-by: David Hildenbrand
    Acked-by: Hugh Dickins
    Reviewed-by: William Kucharski
    Cc: Oscar Salvador
    Cc: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko