09 Feb, 2017

1 commit

  • commit 5abf186a30a89d5b9c18a6bf93a2c192c9fd52f6 upstream.

    do_generic_file_read() can be told to perform a large request from
    userspace. If the system is under OOM and the reading task is the OOM
    victim then it has an access to memory reserves and finishing the full
    request can lead to the full memory depletion which is dangerous. Make
    sure we rather go with a short read and allow the killed task to
    terminate.

    Link: http://lkml.kernel.org/r/20170201092706.9966-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Christoph Hellwig
    Cc: Tetsuo Handa
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

20 Jan, 2017

1 commit

  • commit 965d004af54088d138f806d04d803fb60d441986 upstream.

    Currently in DAX if we have three read faults on the same hole address we
    can end up with the following:

    Thread 0 Thread 1 Thread 2
    -------- -------- --------
    dax_iomap_fault
    grab_mapping_entry
    lock_slot

    dax_iomap_fault
    grab_mapping_entry
    get_unlocked_mapping_entry

    dax_iomap_fault
    grab_mapping_entry
    get_unlocked_mapping_entry

    dax_load_hole
    find_or_create_page
    ...
    page_cache_tree_insert
    dax_wake_mapping_entry_waiter

    __radix_tree_replace


    get_page
    lock_page
    ...
    put_locked_mapping_entry
    unlock_page
    put_page

    The crux of the problem is that once we insert a 4k zero page, all
    locking from then on is done in terms of that 4k zero page and any
    additional threads sleeping on the empty DAX entry will never be woken.

    Fix this by waking all sleepers when we replace the DAX radix tree entry
    with a 4k zero page. This will allow all sleeping threads to
    successfully transition from locking based on the DAX empty entry to
    locking on the 4k zero page.

    With the test case reported by Xiong this happens very regularly in my
    test setup, with some runs resulting in 9+ threads in this deadlocked
    state. With this fix I've been able to run that same test dozens of
    times in a loop without issue.

    Fixes: ac401cc78242 ("dax: New fault locking")
    Link: http://lkml.kernel.org/r/1483479365-13607-1-git-send-email-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reported-by: Xiong Zhou
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Ross Zwisler
     

06 Jan, 2017

1 commit

  • commit d05c5f7ba164aed3db02fb188c26d0dd94f5455b upstream.

    We truncated the possible read iterator to s_maxbytes in commit
    c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()"),
    but our end condition handling was wrong: it's not an error to try to
    read at the end of the file.

    Reading past the end should return EOF (0), not EINVAL.

    See for example

    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1649342
    http://lists.gnu.org/archive/html/bug-coreutils/2016-12/msg00008.html

    where a md5sum of a maximally sized file fails because the final read is
    exactly at s_maxbytes.

    Fixes: c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
    Reported-by: Joseph Salisbury
    Cc: Wei Fang
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

07 Nov, 2016

1 commit

  • Starting from 4.9-rc1 kernel, I started noticing some test failures
    of sendfile(2) and splice(2) (sendfile0N and splice01 from LTP) when
    testing on sub-page block size filesystems (tested both XFS and
    ext4), these syscalls start to return EIO in the tests. e.g.

    sendfile02 1 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 26, got: -1
    sendfile02 2 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 24, got: -1
    sendfile02 3 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 22, got: -1
    sendfile02 4 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 20, got: -1

    This is because that in sub-page block size cases, we don't need the
    whole page to be uptodate, only the part we care about is uptodate
    is OK (if fs has ->is_partially_uptodate defined). But
    page_cache_pipe_buf_confirm() doesn't have the ability to check the
    partially-uptodate case, it needs the whole page to be uptodate. So
    it returns EIO in this case.

    This is a regression introduced by commit 82c156f85384 ("switch
    generic_file_splice_read() to use of ->read_iter()"). Prior to the
    change, generic_file_splice_read() doesn't allow partially-uptodate
    page either, so it worked fine.

    Fix it by skipping the partially-uptodate check if we're working on
    a pipe in do_generic_file_read(), so we read the whole page from
    disk as long as the page is not uptodate.

    Signed-off-by: Eryu Guan
    Signed-off-by: Al Viro

    Eryu Guan
     

28 Oct, 2016

1 commit

  • The per-zone waitqueues exist because of a scalability issue with the
    page waitqueues on some NUMA machines, but it turns out that they hurt
    normal loads, and now with the vmalloced stacks they also end up
    breaking gfs2 that uses a bit_wait on a stack object:

    wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)

    where 'gh' can be a reference to the local variable 'mount_gh' on the
    stack of fill_super().

    The reason the per-zone hash table breaks for this case is that there is
    no "zone" for virtual allocations, and trying to look up the physical
    page to get at it will fail (with a BUG_ON()).

    It turns out that I actually complained to the mm people about the
    per-zone hash table for another reason just a month ago: the zone lookup
    also hurts the regular use of "unlock_page()" a lot, because the zone
    lookup ends up forcing several unnecessary cache misses and generates
    horrible code.

    As part of that earlier discussion, we had a much better solution for
    the NUMA scalability issue - by just making the page lock have a
    separate contention bit, the waitqueue doesn't even have to be looked at
    for the normal case.

    Peter Zijlstra already has a patch for that, but let's see if anybody
    even notices. In the meantime, let's fix the actual gfs2 breakage by
    simplifying the bitlock waitqueues and removing the per-zone issue.

    Reported-by: Andreas Gruenbacher
    Tested-by: Bob Peterson
    Acked-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Steven Whitehouse
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

11 Oct, 2016

2 commits

  • Pull splice fixups from Al Viro:
    "A couple of fixups for interaction of pipe-backed iov_iter with
    O_DIRECT reads + constification of a couple of primitives in uio.h
    missed by previous rounds.

    Kudos to davej - his fuzzing has caught those bugs"

    * 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    [btrfs] fix check_direct_IO() for non-iovec iterators
    constify iov_iter_count() and iter_is_iovec()
    fix ITER_PIPE interaction with direct_IO

    Linus Torvalds
     
  • by making sure we call iov_iter_advance() on original
    iov_iter even if direct_IO (done on its copy) has returned 0.
    It's a no-op for old iov_iter flavours and does the right thing
    (== truncation of the stuff we'd allocated, but not filled) in
    ITER_PIPE case. Failures (e.g. -EIO) get caught and dealt with
    by cleanup in generic_file_read_iter().

    Signed-off-by: Al Viro

    Al Viro
     

08 Oct, 2016

2 commits

  • We triggered a deadloop in truncate_inode_pages_range() on 32 bits
    architecture with the test case bellow:

    ...
    fd = open();
    write(fd, buf, 4096);
    preadv64(fd, &iovec, 1, 0xffffffff000);
    ftruncate(fd, 0);
    ...

    Then ftruncate() will not return forever.

    The filesystem used in this case is ubifs, but it can be triggered on
    many other filesystems.

    When preadv64() is called with offset=0xffffffff000, a page with
    index=0xffffffff will be added to the radix tree of ->mapping. Then
    this page can be found in ->mapping with pagevec_lookup(). After that,
    truncate_inode_pages_range(), which is called in ftruncate(), will fall
    into an infinite loop:

    - find a page with index=0xffffffff, since index>=end, this page won't
    be truncated

    - index++, and index become 0

    - the page with index=0xffffffff will be found again

    The data type of index is unsigned long, so index won't overflow to 0 on
    64 bits architecture in this case, and the dead loop won't happen.

    Since truncate_inode_pages_range() is executed with holding lock of
    inode->i_rwsem, any operation related with this lock will be blocked,
    and a hung task will happen, e.g.:

    INFO: task truncate_test:3364 blocked for more than 120 seconds.
    ...
    call_rwsem_down_write_failed+0x17/0x30
    generic_file_write_iter+0x32/0x1c0
    ubifs_write_iter+0xcc/0x170
    __vfs_write+0xc4/0x120
    vfs_write+0xb2/0x1b0
    SyS_write+0x46/0xa0

    The page with index=0xffffffff added to ->mapping is useless. Fix this
    by checking the read position before allocating pages.

    Link: http://lkml.kernel.org/r/1475151010-40166-1-git-send-email-fangwei1@huawei.com
    Signed-off-by: Wei Fang
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Fang
     
  • If a fatal signal has been received, fail immediately instead of trying
    to read more data.

    If wait_on_page_locked_killable() was interrupted then this page is most
    likely is not PageUptodate() and in this case do_generic_file_read()
    will fail after lock_page_killable().

    See also commit ebded02788b5 ("mm: filemap: avoid unnecessary calls to
    lock_page when waiting for IO to complete during a read")

    [oleg@redhat.com: changelog addition]
    Link: http://lkml.kernel.org/r/63068e8e-8bee-b208-8441-a3c39a9d9eb6@sandisk.com
    Signed-off-by: Bart Van Assche
    Reviewed-by: Jan Kara
    Acked-by: Oleg Nesterov
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bart Van Assche
     

06 Oct, 2016

3 commits

  • Pull xfs and iomap updates from Dave Chinner:
    "The main things in this update are the iomap-based DAX infrastructure,
    an XFS delalloc rework, and a chunk of fixes to how log recovery
    schedules writeback to prevent spurious corruption detections when
    recovery of certain items was not required.

    The other main chunk of code is some preparation for the upcoming
    reflink functionality. Most of it is generic and cleanups that stand
    alone, but they were ready and reviewed so are in this pull request.

    Speaking of reflink, I'm currently planning to send you another pull
    request next week containing all the new reflink functionality. I'm
    working through a similar process to the last cycle, where I sent the
    reverse mapping code in a separate request because of how large it
    was. The reflink code merge is even bigger than reverse mapping, so
    I'll be doing the same thing again....

    Summary for this update:

    - change of XFS mailing list to linux-xfs@vger.kernel.org

    - iomap-based DAX infrastructure w/ XFS and ext2 support

    - small iomap fixes and additions

    - more efficient XFS delayed allocation infrastructure based on iomap

    - a rework of log recovery writeback scheduling to ensure we don't
    fail recovery when trying to replay items that are already on disk

    - some preparation patches for upcoming reflink support

    - configurable error handling fixes and documentation

    - aio access time update race fixes for XFS and
    generic_file_read_iter"

    * tag 'xfs-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (40 commits)
    fs: update atime before I/O in generic_file_read_iter
    xfs: update atime before I/O in xfs_file_dio_aio_read
    ext2: fix possible integer truncation in ext2_iomap_begin
    xfs: log recovery tracepoints to track current lsn and buffer submission
    xfs: update metadata LSN in buffers during log recovery
    xfs: don't warn on buffers not being recovered due to LSN
    xfs: pass current lsn to log recovery buffer validation
    xfs: rework log recovery to submit buffers on LSN boundaries
    xfs: quiesce the filesystem after recovery on readonly mount
    xfs: remote attribute blocks aren't really userdata
    ext2: use iomap to implement DAX
    ext2: stop passing buffer_head to ext2_get_blocks
    xfs: use iomap to implement DAX
    xfs: refactor xfs_setfilesize
    xfs: take the ilock shared if possible in xfs_file_iomap_begin
    xfs: fix locking for DAX writes
    dax: provide an iomap based fault handler
    dax: provide an iomap based dax read/write path
    dax: don't pass buffer_head to copy_user_dax
    dax: don't pass buffer_head to dax_insert_mapping
    ...

    Linus Torvalds
     
  • Commit 22f2ac51b6d6 ("mm: workingset: fix crash in shadow node shrinker
    caused by replace_page_cache_page()") switched replace_page_cache() from
    raw radix tree operations to page_cache_tree_insert() but didn't take
    into account that the latter function, unlike the raw radix tree op,
    handles mapping->nrpages. As a result, that counter is bumped for each
    page replacement rather than balanced out even.

    The mapping->nrpages counter is used to skip needless radix tree walks
    when invalidating, truncating, syncing inodes without pages, as well as
    statistics for userspace. Since the error is positive, we'll do more
    page cache tree walks than necessary; we won't miss a necessary one.
    And we'll report more buffer pages to userspace than there are. The
    error is limited to fuse inodes.

    Fixes: 22f2ac51b6d6 ("mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()")
    Signed-off-by: Johannes Weiner
    Cc: Andrew Morton
    Cc: Miklos Szeredi
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When the underflow checks were added to workingset_node_shadow_dec(),
    they triggered immediately:

    kernel BUG at ./include/linux/swap.h:276!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
    soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt
    CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1
    Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
    task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
    RIP: page_cache_tree_insert+0xf1/0x100
    Call Trace:
    __add_to_page_cache_locked+0x12e/0x270
    add_to_page_cache_lru+0x4e/0xe0
    mpage_readpages+0x112/0x1d0
    blkdev_readpages+0x1d/0x20
    __do_page_cache_readahead+0x1ad/0x290
    force_page_cache_readahead+0xaa/0x100
    page_cache_sync_readahead+0x3f/0x50
    generic_file_read_iter+0x5af/0x740
    blkdev_read_iter+0x35/0x40
    __vfs_read+0xe1/0x130
    vfs_read+0x96/0x130
    SyS_read+0x55/0xc0
    entry_SYSCALL_64_fastpath+0x13/0x8f
    Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 0b e8 88 68 ef ff 0f 1f 84 00
    RIP page_cache_tree_insert+0xf1/0x100

    This is a long-standing bug in the way shadow entries are accounted in
    the radix tree nodes. The shrinker needs to know when radix tree nodes
    contain only shadow entries, no pages, so node->count is split in half
    to count shadows in the upper bits and pages in the lower bits.

    Unfortunately, the radix tree implementation doesn't know of this and
    assumes all entries are in node->count. When there is a shadow entry
    directly in root->rnode and the tree is later extended, the radix tree
    implementation will copy that entry into the new node and and bump its
    node->count, i.e. increases the page count bits. Once the shadow gets
    removed and we subtract from the upper counter, node->count underflows
    and triggers the warning. Afterwards, without node->count reaching 0
    again, the radix tree node is leaked.

    Limit shadow entries to when we have actual radix tree nodes and can
    count them properly. That means we lose the ability to detect refaults
    from files that had only the first page faulted in at eviction time.

    Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
    Signed-off-by: Johannes Weiner
    Reported-and-tested-by: Linus Torvalds
    Reviewed-by: Jan Kara
    Cc: Andrew Morton
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

03 Oct, 2016

1 commit

  • After the call to ->direct_IO the final reference to the file might have
    been dropped by aio_complete already, and the call to file_accessed might
    cause a use after free.

    Instead update the access time before the I/O, similar to how we
    update the time stamps before writes.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

01 Oct, 2016

1 commit

  • Antonio reports the following crash when using fuse under memory pressure:

    kernel BUG at /build/linux-a2WvEb/linux-4.4.0/mm/workingset.c:346!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: all of them
    CPU: 2 PID: 63 Comm: kswapd0 Not tainted 4.4.0-36-generic #55-Ubuntu
    Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
    task: ffff88040cae6040 ti: ffff880407488000 task.ti: ffff880407488000
    RIP: shadow_lru_isolate+0x181/0x190
    Call Trace:
    __list_lru_walk_one.isra.3+0x8f/0x130
    list_lru_walk_one+0x23/0x30
    scan_shadow_nodes+0x34/0x50
    shrink_slab.part.40+0x1ed/0x3d0
    shrink_zone+0x2ca/0x2e0
    kswapd+0x51e/0x990
    kthread+0xd8/0xf0
    ret_from_fork+0x3f/0x70

    which corresponds to the following sanity check in the shadow node
    tracking:

    BUG_ON(node->count & RADIX_TREE_COUNT_MASK);

    The workingset code tracks radix tree nodes that exclusively contain
    shadow entries of evicted pages in them, and this (somewhat obscure)
    line checks whether there are real pages left that would interfere with
    reclaim of the radix tree node under memory pressure.

    While discussing ways how fuse might sneak pages into the radix tree
    past the workingset code, Miklos pointed to replace_page_cache_page(),
    and indeed there is a problem there: it properly accounts for the old
    page being removed - __delete_from_page_cache() does that - but then
    does a raw raw radix_tree_insert(), not accounting for the replacement
    page. Eventually the page count bits in node->count underflow while
    leaving the node incorrectly linked to the shadow node LRU.

    To address this, make sure replace_page_cache_page() uses the tracked
    page insertion code, page_cache_tree_insert(). This fixes the page
    accounting and makes sure page-containing nodes are properly unlinked
    from the shadow node LRU again.

    Also, make the sanity checks a bit less obscure by using the helpers for
    checking the number of pages and shadows in a radix tree node.

    Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
    Link: http://lkml.kernel.org/r/20160919155822.29498-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: Antonio SJ Musumeci
    Debugged-by: Miklos Szeredi
    Cc: [3.15+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

08 Aug, 2016

1 commit

  • Commit abf545484d31 changed it from an 'rw' flags type to the
    newer ops based interface, but now we're effectively leaking
    some bdev internals to the rest of the kernel. Since we only
    care about whether it's a read or a write at that level, just
    pass in a bool 'is_write' parameter instead.

    Then we can also move op_is_write() and friends back under
    CONFIG_BLOCK protection.

    Reviewed-by: Mike Christie
    Signed-off-by: Jens Axboe

    Jens Axboe
     

05 Aug, 2016

1 commit

  • The rw_page users were not converted to use bio/req ops. As a result
    bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
    be sent down as reads.

    Signed-off-by: Mike Christie
    Fixes: 4e1b2d52a80d ("block, fs, drivers: remove REQ_OP compat defs and related code")

    Modified by me to:

    1) Drop op_flags passing into ->rw_page(), as we don't use it.
    2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK

    Signed-off-by: Jens Axboe

    Mike Christie
     

30 Jul, 2016

1 commit

  • Pull fuse updates from Miklos Szeredi:
    "This fixes error propagation from writeback to fsync/close for
    writeback cache mode as well as adding a missing capability flag to
    the INIT message. The rest are cleanups.

    (The commits are recent but all the code actually sat in -next for a
    while now. The recommits are due to conflict avoidance and the
    addition of Cc: stable@...)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
    fuse: use filemap_check_errors()
    mm: export filemap_check_errors() to modules
    fuse: fix wrong assignment of ->flags in fuse_send_init()
    fuse: fuse_flush must check mapping->flags for errors
    fuse: fsync() did not return IO errors
    fuse: don't mess with blocking signals
    new helper: wait_event_killable_exclusive()
    fuse: improve aio directIO write performance for size extending writes

    Linus Torvalds
     

29 Jul, 2016

3 commits

  • Can be used by fuse, btrfs and f2fs to replace opencoded variants.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • There are now a number of accounting oddities such as mapped file pages
    being accounted for on the node while the total number of file pages are
    accounted on the zone. This can be coped with to some extent but it's
    confusing so this patch moves the relevant file-based accounted. Due to
    throttling logic in the page allocator for reliable OOM detection, it is
    still necessary to track dirty and writeback pages on a per-zone basis.

    [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
    Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Node-based reclaim requires node-based LRUs and locking. This is a
    preparation patch that just moves the lru_lock to the node so later
    patches are easier to review. It is a mechanical change but note this
    patch makes contention worse because the LRU lock is hotter and direct
    reclaim and kswapd can contend on the same lock even when reclaiming
    from different zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

4 commits

  • Here's basic implementation of huge pages support for shmem/tmpfs.

    It's all pretty streight-forward:

    - shmem_getpage() allcoates huge page if it can and try to inserd into
    radix tree with shmem_add_to_page_cache();

    - shmem_add_to_page_cache() puts the page onto radix-tree if there's
    space for it;

    - shmem_undo_range() removes huge pages, if it fully within range.
    Partial truncate of huge pages zero out this part of THP.

    This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE)
    behaviour. As we don't really create hole in this case,
    lseek(SEEK_HOLE) may have inconsistent results depending what
    pages happened to be allocated.

    - no need to change shmem_fault: core-mm will map an compound page as
    huge if VMA is suitable;

    Link: http://lkml.kernel.org/r/1466021202-61880-30-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • For now, we would have HPAGE_PMD_NR entries in radix tree for every huge
    page. That's suboptimal and it will be changed to use Matthew's
    multi-order entries later.

    'add' operation is not changed, because we don't need it to implement
    hugetmpfs: shmem uses its own implementation.

    Link: http://lkml.kernel.org/r/1466021202-61880-25-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The idea (and most of code) is borrowed again: from Hugh's patchset on
    huge tmpfs[1].

    Instead of allocation pte page table upfront, we postpone this until we
    have page to map in hands. This approach opens possibility to map the
    page as huge if filesystem supports this.

    Comparing to Hugh's patch I've pushed page table allocation a bit
    further: into do_set_pte(). This way we can postpone allocation even in
    faultaround case without moving do_fault_around() after __do_fault().

    do_set_pte() got renamed to alloc_set_pte() as it can allocate page
    table if required.

    [1] http://lkml.kernel.org/r/alpine.LSU.2.11.1502202015090.14414@eggly.anvils

    Link: http://lkml.kernel.org/r/1466021202-61880-10-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The idea borrowed from Peter's patch from patchset on speculative page
    faults[1]:

    Instead of passing around the endless list of function arguments,
    replace the lot with a single structure so we can change context without
    endless function signature changes.

    The changes are mostly mechanical with exception of faultaround code:
    filemap_map_pages() got reworked a bit.

    This patch is preparation for the next one.

    [1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org

    Link: http://lkml.kernel.org/r/1466021202-61880-9-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

25 Jun, 2016

1 commit

  • This reverts commit 5c0a85fad949212b3e059692deecdeed74ae7ec7.

    The commit causes ~6% regression in unixbench.

    Let's revert it for now and consider other solution for reclaim problem
    later.

    Link: http://lkml.kernel.org/r/1465893750-44080-2-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: "Huang, Ying"
    Cc: Linus Torvalds
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Vinayak Menon
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

27 May, 2016

1 commit

  • Pull DAX locking updates from Ross Zwisler:
    "Filesystem DAX locking for 4.7

    - We use a bit in an exceptional radix tree entry as a lock bit and
    use it similarly to how page lock is used for normal faults. This
    fixes races between hole instantiation and read faults of the same
    index.

    - Filesystem DAX PMD faults are disabled, and will be re-enabled when
    PMD locking is implemented"

    * tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    dax: Remove i_mmap_lock protection
    dax: Use radix tree entry lock to protect cow faults
    dax: New fault locking
    dax: Allow DAX code to replace exceptional entries
    dax: Define DAX lock bit for radix tree exceptional entry
    dax: Make huge page handling depend of CONFIG_BROKEN
    dax: Fix condition for filling of PMD holes

    Linus Torvalds
     

21 May, 2016

4 commits

  • In addition to replacing the entry, we also clear all associated tags.
    This is really a one-off special for page_cache_tree_delete() which had
    far too much detailed knowledge about how the radix tree works.

    For efficiency, factor node_tag_clear() out of radix_tree_tag_clear() It
    can be used by radix_tree_delete_item() as well as
    radix_tree_replace_clear_tags().

    Signed-off-by: Matthew Wilcox
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Currently, faultaround code produces young pte. This can screw up
    vmscan behaviour[1], as it makes vmscan think that these pages are hot
    and not push them out on first round.

    During sparse file access faultaround gets more pages mapped and all of
    them are young. Under memory pressure, this makes vmscan swap out anon
    pages instead, or to drop other page cache pages which otherwise stay
    resident.

    Modify faultaround to produce old ptes, so they can easily be reclaimed
    under memory pressure.

    This can to some extend defeat the purpose of faultaround on machines
    without hardware accessed bit as it will not help us with reducing the
    number of minor page faults.

    We may want to disable faultaround on such machines altogether, but
    that's subject for separate patchset.

    Minchan:
    "I tested 512M mmap sequential word read test on non-HW access bit
    system (i.e., ARM) and confirmed it doesn't increase minor fault any
    more.

    old: 4096 fault_around
    minor fault: 131291
    elapsed time: 6747645 usec

    new: 65536 fault_around
    minor fault: 131291
    elapsed time: 6709263 usec

    0.56% benefit"

    [1] https://lkml.kernel.org/r/1460992636-711-1-git-send-email-vinmenon@codeaurora.org

    Link: http://lkml.kernel.org/r/1463488366-47723-1-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Acked-by: Minchan Kim
    Tested-by: Minchan Kim
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Vinayak Menon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Andres observed that his database workload is struggling with the
    transaction journal creating pressure on frequently read pages.

    Access patterns like transaction journals frequently write the same
    pages over and over, but in the majority of cases those pages are never
    read back. There are no caching benefits to be had for those pages, so
    activating them and having them put pressure on pages that do benefit
    from caching is a bad choice.

    Leave page activations to read accesses and don't promote pages based on
    writes alone.

    It could be said that partially written pages do contain cache-worthy
    data, because even if *userspace* does not access the unwritten part,
    the kernel still has to read it from the filesystem for correctness.
    However, a counter argument is that these pages enjoy at least *some*
    protection over other inactive file pages through the writeback cache,
    in the sense that dirty pages are written back with a delay and cache
    reclaim leaves them alone until they have been written back to disk.
    Should that turn out to be insufficient and we see increased read IO
    from partial writes under memory pressure, we can always go back and
    update grab_cache_page_write_begin() to take (pos, len) so that it can
    tell partial writes from pages that don't need partial reads. But for
    now, keep it simple.

    Signed-off-by: Johannes Weiner
    Reported-by: Andres Freund
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This is a follow-up to

    http://www.spinics.net/lists/linux-mm/msg101739.html

    where Andres reported his database workingset being pushed out by the
    minimum size enforcement of the inactive file list - currently 50% of
    cache - as well as repeatedly written file pages that are never actually
    read.

    Two changes fell out of the discussions. The first change observes that
    pages that are only ever written don't benefit from caching beyond what
    the writeback cache does for partial page writes, and so we shouldn't
    promote them to the active file list where they compete with pages whose
    cached data is actually accessed repeatedly. This change comes in two
    patches - one for in-cache write accesses and one for refaults triggered
    by writes, neither of which should promote a cache page.

    Second, with the refault detection we don't need to set 50% of the cache
    aside for used-once cache anymore since we can detect frequently used
    pages even when they are evicted between accesses. We can allow the
    active list to be bigger and thus protect a bigger workingset that isn't
    challenged by streamers. Depending on the access patterns, this can
    increase major faults during workingset transitions for better
    performance during stable phases.

    This patch (of 3):

    When rewriting a page, the data in that page is replaced with new data.
    This means that evicting something else from the active file list, in
    order to cache data that will be replaced by something else, is likely
    to be a waste of memory.

    It is better to save the active list for frequently read pages, because
    reads actually use the data that is in the page.

    This patch ignores partial writes, because it is unclear whether the
    complexity of identifying those is worth any potential performance gain
    obtained from better caching pages that see repeated partial writes at
    large enough intervals to not get caught by the use-twice promotion code
    used for the inactive file list.

    Signed-off-by: Rik van Riel
    Signed-off-by: Johannes Weiner
    Reported-by: Andres Freund
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

20 May, 2016

3 commits

  • page_reference manipulation functions are introduced to track down
    reference count change of the page. Use it instead of direct
    modification of _count.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Johannes Berg
    Cc: "David S. Miller"
    Cc: Sunil Goutham
    Cc: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Currently DAX page fault locking is racy.

    CPU0 (write fault) CPU1 (read fault)

    __dax_fault() __dax_fault()
    get_block(inode, block, &bh, 0) -> not mapped
    get_block(inode, block, &bh, 0)
    -> not mapped
    if (!buffer_mapped(&bh))
    if (vmf->flags & FAULT_FLAG_WRITE)
    get_block(inode, block, &bh, 1) -> allocates blocks
    if (page) -> no
    if (!buffer_mapped(&bh))
    if (vmf->flags & FAULT_FLAG_WRITE) {
    } else {
    dax_load_hole();
    }
    dax_insert_mapping()

    And we are in a situation where we fail in dax_radix_entry() with -EIO.

    Another problem with the current DAX page fault locking is that there is
    no race-free way to clear dirty tag in the radix tree. We can always
    end up with clean radix tree and dirty data in CPU cache.

    We fix the first problem by introducing locking of exceptional radix
    tree entries in DAX mappings acting very similarly to page lock and thus
    synchronizing properly faults against the same mapping index. The same
    lock can later be used to avoid races when clearing radix tree dirty
    tag.

    Reviewed-by: NeilBrown
    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     
  • Currently we forbid page_cache_tree_insert() to replace exceptional radix
    tree entries for DAX inodes. However to make DAX faults race free we will
    lock radix tree entries and when hole is created, we need to replace
    such locked radix tree entry with a hole page. So modify
    page_cache_tree_insert() to allow that.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     

02 May, 2016

5 commits


05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

26 Mar, 2016

1 commit

  • If
    - generic_file_read_iter() gets called with a zero read length,
    - the read offset is at a page boundary,
    - IOCB_DIRECT is not set
    - and the page in question hasn't made it into the page cache yet,
    then do_generic_file_read() will trigger a readahead with a req_size hint
    of zero.

    Since roundup_pow_of_two(0) is undefined, UBSAN reports

    UBSAN: Undefined behaviour in include/linux/log2.h:63:13
    shift exponent 64 is too large for 64-bit type 'long unsigned int'
    CPU: 3 PID: 1017 Comm: sa1 Tainted: G L 4.5.0-next-20160318+ #14
    [...]
    Call Trace:
    [...]
    [] ondemand_readahead+0x3aa/0x3d0
    [] ? ondemand_readahead+0x3aa/0x3d0
    [] ? find_get_entry+0x2d/0x210
    [] page_cache_sync_readahead+0x63/0xa0
    [] do_generic_file_read+0x80d/0xf90
    [] generic_file_read_iter+0x185/0x420
    [...]
    [] __vfs_read+0x256/0x3d0
    [...]

    when get_init_ra_size() gets called from ondemand_readahead().

    The net effect is that the initial readahead size is arch dependent for
    requested read lengths of zero: for example, since

    1UL << (sizeof(unsigned long) * 8)

    evaluates to 1 on x86 while its result is 0 on ARMv7, the initial readahead
    size becomes 4 on the former and 0 on the latter.

    What's more, whether or not the file access timestamp is updated for zero
    length reads is decided differently for the two cases of IOCB_DIRECT
    being set or cleared: in the first case, generic_file_read_iter()
    explicitly skips updating that timestamp while in the latter case, it is
    always updated through the call to do_generic_file_read().

    According to POSIX, zero length reads "do not modify the last data access
    timestamp" and thus, the IOCB_DIRECT behaviour is POSIXly correct.

    Let generic_file_read_iter() unconditionally check the requested read
    length at its entry and return immediately with success if it is zero.

    Signed-off-by: Nicolai Stange
    Cc: Al Viro
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolai Stange