01 Dec, 2016

1 commit

  • Hugetlb pages have ->index in size of the huge pages (PMD_SIZE or
    PUD_SIZE), not in PAGE_SIZE as other types of pages. This means we
    cannot user page_to_pgoff() to check whether we've got the right page
    for the radix-tree index.

    Let's introduce page_to_index() which would return radix-tree index for
    given page.

    We will be able to get rid of this once hugetlb will be switched to
    multi-order entries.

    Fixes: fc127da085c2 ("truncate: handle file thp")
    Link: http://lkml.kernel.org/r/20161123093053.mjbnvn5zwxw5e6lk@black.fi.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Doug Nelson
    Tested-by: Doug Nelson
    Reviewed-by: Naoya Horiguchi
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

27 Jul, 2016

1 commit

  • For shmem/tmpfs we only need to tweak truncate_inode_page() and
    invalidate_mapping_pages().

    truncate_inode_pages_range() and invalidate_inode_pages2_range() are
    adjusted to use page_to_pgoff().

    Link: http://lkml.kernel.org/r/1466021202-61880-26-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

20 May, 2016

1 commit

  • Currently DAX page fault locking is racy.

    CPU0 (write fault) CPU1 (read fault)

    __dax_fault() __dax_fault()
    get_block(inode, block, &bh, 0) -> not mapped
    get_block(inode, block, &bh, 0)
    -> not mapped
    if (!buffer_mapped(&bh))
    if (vmf->flags & FAULT_FLAG_WRITE)
    get_block(inode, block, &bh, 1) -> allocates blocks
    if (page) -> no
    if (!buffer_mapped(&bh))
    if (vmf->flags & FAULT_FLAG_WRITE) {
    } else {
    dax_load_hole();
    }
    dax_insert_mapping()

    And we are in a situation where we fail in dax_radix_entry() with -EIO.

    Another problem with the current DAX page fault locking is that there is
    no race-free way to clear dirty tag in the radix tree. We can always
    end up with clean radix tree and dirty data in CPU cache.

    We fix the first problem by introducing locking of exceptional radix
    tree entries in DAX mappings acting very similarly to page lock and thus
    synchronizing properly faults against the same mapping index. The same
    lock can later be used to avoid races when clearing radix tree dirty
    tag.

    Reviewed-by: NeilBrown
    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Mar, 2016

3 commits

  • There are several users that nest lock_page_memcg() inside lock_page()
    to prevent page->mem_cgroup from changing. But the page lock prevents
    pages from moving between cgroups, so that is unnecessary overhead.

    Remove lock_page_memcg() in contexts with locked contexts and fix the
    debug code in the page stat functions to be okay with the page lock.

    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Now that migration doesn't clear page->mem_cgroup of live pages anymore,
    it's safe to make lock_page_memcg() and the memcg stat functions take
    pages, and spare the callers from memcg objects.

    [akpm@linux-foundation.org: fix warnings]
    Signed-off-by: Johannes Weiner
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These patches tag the page cache radix tree eviction entries with the
    memcg an evicted page belonged to, thus making per-cgroup LRU reclaim
    work properly and be as adaptive to new cache workingsets as global
    reclaim already is.

    This should have been part of the original thrash detection patch
    series, but was deferred due to the complexity of those patches.

    This patch (of 5):

    So far the only sites that needed to exclude charge migration to
    stabilize page->mem_cgroup have been per-cgroup page statistics, hence
    the name mem_cgroup_begin_page_stat(). But per-cgroup thrash detection
    will add another site that needs to ensure page->mem_cgroup lifetime.

    Rename these locking functions to the more generic lock_page_memcg() and
    unlock_page_memcg(). Since charge migration is a cgroup1 feature only,
    we might be able to delete it at some point, and these now easy to
    identify locking sites along with it.

    Signed-off-by: Johannes Weiner
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

23 Jan, 2016

1 commit

  • Add support for tracking dirty DAX entries in the struct address_space
    radix tree. This tree is already used for dirty page writeback, and it
    already supports the use of exceptional (non struct page*) entries.

    In order to properly track dirty DAX pages we will insert new
    exceptional entries into the radix tree that represent dirty DAX PTE or
    PMD pages. These exceptional entries will also contain the writeback
    addresses for the PTE or PMD faults that we can use at fsync/msync time.

    There are currently two types of exceptional entries (shmem and shadow)
    that can be placed into the radix tree, and this adds a third. We rely
    on the fact that only one type of exceptional entry can be found in a
    given radix tree based on its usage. This happens for free with DAX vs
    shmem but we explicitly prevent shadow entries from being added to radix
    trees for DAX mappings.

    The only shadow entries that would be generated for DAX radix trees
    would be to track zero page mappings that were created for holes. These
    pages would receive minimal benefit from having shadow entries, and the
    choice to have only one type of exceptional entry in a given radix tree
    makes the logic simpler both in clear_exceptional_entry() and in the
    rest of DAX.

    Signed-off-by: Ross Zwisler
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

02 Jun, 2015

2 commits

  • When modifying PG_Dirty on cached file pages, update the new
    MEM_CGROUP_STAT_DIRTY counter. This is done in the same places where
    global NR_FILE_DIRTY is managed. The new memcg stat is visible in the
    per memcg memory.stat cgroupfs file. The most recent past attempt at
    this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632

    The new accounting supports future efforts to add per cgroup dirty
    page throttling and writeback. It also helps an administrator break
    down a container's memory usage and provides evidence to understand
    memcg oom kills (the new dirty count is included in memcg oom kill
    messages).

    The ability to move page accounting between memcg
    (memory.move_charge_at_immigrate) makes this accounting more
    complicated than the global counter. The existing
    mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
    accounting with stat updates.
    Typical update operation:
    memcg = mem_cgroup_begin_page_stat(page)
    if (TestSetPageDirty()) {
    [...]
    mem_cgroup_update_page_stat(memcg)
    }
    mem_cgroup_end_page_stat(memcg)

    Summary of mem_cgroup_end_page_stat() overhead:
    - Without CONFIG_MEMCG it's a no-op
    - With CONFIG_MEMCG and no inter memcg task movement, it's just
    rcu_read_lock()
    - With CONFIG_MEMCG and inter memcg task movement, it's
    rcu_read_lock() + spin_lock_irqsave()

    A memcg parameter is added to several routines because their callers
    now grab mem_cgroup_begin_page_stat() which returns the memcg later
    needed by for mem_cgroup_update_page_stat().

    Because mem_cgroup_begin_page_stat() may disable interrupts, some
    adjustments are needed:
    - move __mark_inode_dirty() from __set_page_dirty() to its caller.
    __mark_inode_dirty() locking does not want interrupts disabled.
    - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
    __delete_from_page_cache(), replace_page_cache_page(),
    invalidate_complete_page2(), and __remove_mapping().

    text data bss dec hex filename
    8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
    8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
    +192 text bytes
    8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
    8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
    +773 text bytes

    Performance tests run on v4.0-rc1-36-g4f671fe2f952. Lower is better for
    all metrics, they're all wall clock or cycle counts. The read and write
    fault benchmarks just measure fault time, they do not include I/O time.

    * CONFIG_MEMCG not set:
    baseline patched
    kbuild 1m25.030000(+-0.088% 3 samples) 1m25.426667(+-0.120% 3 samples)
    dd write 100 MiB 0.859211561 +-15.10% 0.874162885 +-15.03%
    dd write 200 MiB 1.670653105 +-17.87% 1.669384764 +-11.99%
    dd write 1000 MiB 8.434691190 +-14.15% 8.474733215 +-14.77%
    read fault cycles 254.0(+-0.000% 10 samples) 253.0(+-0.000% 10 samples)
    write fault cycles 2021.2(+-3.070% 10 samples) 1984.5(+-1.036% 10 samples)

    * CONFIG_MEMCG=y root_memcg:
    baseline patched
    kbuild 1m25.716667(+-0.105% 3 samples) 1m25.686667(+-0.153% 3 samples)
    dd write 100 MiB 0.855650830 +-14.90% 0.887557919 +-14.90%
    dd write 200 MiB 1.688322953 +-12.72% 1.667682724 +-13.33%
    dd write 1000 MiB 8.418601605 +-14.30% 8.673532299 +-15.00%
    read fault cycles 266.0(+-0.000% 10 samples) 266.0(+-0.000% 10 samples)
    write fault cycles 2051.7(+-1.349% 10 samples) 2049.6(+-1.686% 10 samples)

    * CONFIG_MEMCG=y non-root_memcg:
    baseline patched
    kbuild 1m26.120000(+-0.273% 3 samples) 1m25.763333(+-0.127% 3 samples)
    dd write 100 MiB 0.861723964 +-15.25% 0.818129350 +-14.82%
    dd write 200 MiB 1.669887569 +-13.30% 1.698645885 +-13.27%
    dd write 1000 MiB 8.383191730 +-14.65% 8.351742280 +-14.52%
    read fault cycles 265.7(+-0.172% 10 samples) 267.0(+-0.000% 10 samples)
    write fault cycles 2070.6(+-1.512% 10 samples) 2084.4(+-2.148% 10 samples)

    As expected anon page faults are not affected by this patch.

    tj: Updated to apply on top of the recent cancel_dirty_page() changes.

    Signed-off-by: Sha Zhengju
    Signed-off-by: Greg Thelen
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Greg Thelen
     
  • cancel_dirty_page() had some issues and b9ea25152e56 ("page_writeback:
    clean up mess around cancel_dirty_page()") replaced it with
    account_page_cleaned() which makes the caller responsible for clearing
    the dirty bit; unfortunately, the planned changes for cgroup writeback
    support requires synchronization between dirty bit manipulation and
    stat updates. While we can open-code such synchronization in each
    account_page_cleaned() callsite, that's gonna be unnecessarily awkward
    and verbose.

    This patch revives cancel_dirty_page() but in a more restricted form.
    All it does is TestClearPageDirty() followed by account_page_cleaned()
    invocation if the page was dirty. This helper covers all
    account_page_cleaned() usages except for __delete_from_page_cache()
    which is a special case anyway and left alone. As this leaves no
    module user for account_page_cleaned(), EXPORT_SYMBOL() is dropped
    from it.

    This patch just revives cancel_dirty_page() as a trivial wrapper to
    replace equivalent usages and doesn't introduce any functional
    changes.

    Signed-off-by: Tejun Heo
    Cc: Konstantin Khlebnikov
    Signed-off-by: Jens Axboe

    Tejun Heo
     

16 Apr, 2015

1 commit

  • "deactivate_page" was created for file invalidation so it has too
    specific logic for file-backed pages. So, let's change the name of the
    function and date to a file-specific one and yield the generic name.

    Signed-off-by: Minchan Kim
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Wang, Yalin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

15 Apr, 2015

1 commit

  • This patch replaces cancel_dirty_page() with a helper function
    account_page_cleaned() which only updates counters. It's called from
    truncate_complete_page() and from try_to_free_buffers() (hack for ext3).
    Page is locked in both cases, page-lock protects against concurrent
    dirtiers: see commit 2d6d7f982846 ("mm: protect set_page_dirty() from
    ongoing truncation").

    Delete_from_page_cache() shouldn't be called for dirty pages, they must
    be handled by caller (either written or truncated). This patch treats
    final dirty accounting fixup at the end of __delete_from_page_cache() as
    a debug check and adds WARN_ON_ONCE() around it. If something removes
    dirty pages without proper handling that might be a bug and unwritten
    data might be lost.

    Hugetlbfs has no dirty pages accounting, ClearPageDirty() is enough
    here.

    cancel_dirty_page() in nfs_wb_page_cancel() is redundant. This is
    helper for nfs_invalidate_page() and it's called only in case complete
    invalidation.

    The mess was started in v2.6.20 after commits 46d2277c796f ("Clean up
    and make try_to_free_buffers() not race with dirty pages") and
    3e67c0987d75 ("truncate: clear page dirtiness before running
    try_to_free_buffers()") first was reverted right in v2.6.20 in commit
    ecdfc9787fe5 ("Resurrect 'try_to_free_buffers()' VM hackery"), second in
    v2.6.25 commit a2b345642f53 ("Fix dirty page accounting leak with ext3
    data=journal").

    Custom fixes were introduced between these points. NFS in v2.6.23, commit
    1b3b4a1a2deb ("NFS: Fix a write request leak in nfs_invalidate_page()").
    Kludge in __delete_from_page_cache() in v2.6.24, commit 3a6927906f1b ("Do
    dirty page accounting when removing a page from the page cache"). Since
    v2.6.25 all of them are redundant.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Tejun Heo
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

21 Jan, 2015

1 commit

  • Now that we got rid of the bdi abuse on character devices we can always use
    sb->s_bdi to get at the backing_dev_info for a file, except for the block
    device special case. Export inode_to_bdi and replace uses of
    mapping->backing_dev_info with it to prepare for the removal of
    mapping->backing_dev_info.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

07 Nov, 2014

1 commit

  • XFS doesn't always hold i_mutex when calling truncate_setsize() and it
    uses a different lock to serialize truncates and writes. So fix the
    comment before truncate_setsize().

    Reported-by: Jan Beulich
    Signed-off-by: Jan Kara
    Signed-off-by: Dave Chinner

    Jan Kara
     

30 Oct, 2014

1 commit


02 Oct, 2014

1 commit

  • ->page_mkwrite() is used by filesystems to allocate blocks under a page
    which is becoming writeably mmapped in some process' address space. This
    allows a filesystem to return a page fault if there is not enough space
    available, user exceeds quota or similar problem happens, rather than
    silently discarding data later when writepage is called.

    However VFS fails to call ->page_mkwrite() in all the cases where
    filesystems need it when blocksize < pagesize. For example when
    blocksize = 1024, pagesize = 4096 the following is problematic:
    ftruncate(fd, 0);
    pwrite(fd, buf, 1024, 0);
    map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
    map[0] = 'a'; ----> page_mkwrite() for index 0 is called
    ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
    mremap(map, 1024, 10000, 0);
    map[4095] = 'a'; ----> no page_mkwrite() called

    At the moment ->page_mkwrite() is called, filesystem can allocate only
    one block for the page because i_size == 1024. Otherwise it would create
    blocks beyond i_size which is generally undesirable. But later at
    ->writepage() time, we also need to store data at offset 4095 but we
    don't have block allocated for it.

    This patch introduces a helper function filesystems can use to have
    ->page_mkwrite() called at all the necessary moments.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Jan Kara
     

09 Aug, 2014

1 commit

  • The memcg uncharging code that is involved towards the end of a page's
    lifetime - truncation, reclaim, swapout, migration - is impressively
    complicated and fragile.

    Because anonymous and file pages were always charged before they had their
    page->mapping established, uncharges had to happen when the page type
    could still be known from the context; as in unmap for anonymous, page
    cache removal for file and shmem pages, and swap cache truncation for swap
    pages. However, these operations happen well before the page is actually
    freed, and so a lot of synchronization is necessary:

    - Charging, uncharging, page migration, and charge migration all need
    to take a per-page bit spinlock as they could race with uncharging.

    - Swap cache truncation happens during both swap-in and swap-out, and
    possibly repeatedly before the page is actually freed. This means
    that the memcg swapout code is called from many contexts that make
    no sense and it has to figure out the direction from page state to
    make sure memory and memory+swap are always correctly charged.

    - On page migration, the old page might be unmapped but then reused,
    so memcg code has to prevent untimely uncharging in that case.
    Because this code - which should be a simple charge transfer - is so
    special-cased, it is not reusable for replace_page_cache().

    But now that charged pages always have a page->mapping, introduce
    mem_cgroup_uncharge(), which is called after the final put_page(), when we
    know for sure that nobody is looking at the page anymore.

    For page migration, introduce mem_cgroup_migrate(), which is called after
    the migration is successful and the new page is fully rmapped. Because
    the old page is no longer uncharged after migration, prevent double
    charges by decoupling the page's memcg association (PCG_USED and
    pc->mem_cgroup) from the page holding an actual charge. The new bits
    PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
    to the new page during migration.

    mem_cgroup_migrate() is suitable for replace_page_cache() as well,
    which gets rid of mem_cgroup_replace_page_cache(). However, care
    needs to be taken because both the source and the target page can
    already be charged and on the LRU when fuse is splicing: grab the page
    lock on the charge moving side to prevent changing pc->mem_cgroup of a
    page under migration. Also, the lruvecs of both pages change as we
    uncharge the old and charge the new during migration, and putback may
    race with us, so grab the lru lock and isolate the pages iff on LRU to
    prevent races and ensure the pages are on the right lruvec afterward.

    Swap accounting is massively simplified: because the page is no longer
    uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
    transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
    before the final put_page() in page reclaim.

    Finally, page_cgroup changes are now protected by whatever protection the
    page itself offers: anonymous pages are charged under the page table lock,
    whereas page cache insertions, swapin, and migration hold the page lock.
    Uncharging happens under full exclusion with no outstanding references.
    Charging and uncharging also ensure that the page is off-LRU, which
    serializes against charge migration. Remove the very costly page_cgroup
    lock and set pc->flags non-atomically.

    [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
    [vdavydov@parallels.com: fix flags definition]
    Signed-off-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Tested-by: Jet Chen
    Acked-by: Michal Hocko
    Tested-by: Felipe Balbi
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

24 Jul, 2014

1 commit

  • I wanted to revert my v3.1 commit d0823576bf4b ("mm: pincer in
    truncate_inode_pages_range"), to keep truncate_inode_pages_range() in
    synch with shmem_undo_range(); but have stepped back - a change to
    hole-punching in truncate_inode_pages_range() is a change to
    hole-punching in every filesystem (except tmpfs) that supports it.

    If there's a logical proof why no filesystem can depend for its own
    correctness on the pincer guarantee in truncate_inode_pages_range() - an
    instant when the entire hole is removed from pagecache - then let's
    revisit later. But the evidence is that only tmpfs suffered from the
    livelock, and we have no intention of extending hole-punch to ramfs. So
    for now just add a few comments (to match or differ from those in
    shmem_undo_range()), and fix one silliness noticed in d0823576bf4b...

    Its "index == start" addition to the hole-punch termination test was
    incomplete: it opened a way for the end condition to be missed, and the
    loop go on looking through the radix_tree, all the way to end of file.
    Fix that pessimization by resetting index when detected in inner loop.

    Note that it's actually hard to hit this case, without the obsessive
    concurrent faulting that trinity does: normally all pages are removed in
    the initial trylock_page() pass, and this loop finds nothing to do. I
    had to "#if 0" out the initial pass to reproduce bug and test fix.

    Signed-off-by: Hugh Dickins
    Cc: Sasha Levin
    Cc: Konstantin Khlebnikov
    Cc: Lukas Czerner
    Cc: Dave Jones
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

07 May, 2014

1 commit

  • Dave Jones reports the following crash when find_get_pages_tag() runs
    into an exceptional entry:

    kernel BUG at mm/filemap.c:1347!
    RIP: find_get_pages_tag+0x1cb/0x220
    Call Trace:
    find_get_pages_tag+0x36/0x220
    pagevec_lookup_tag+0x21/0x30
    filemap_fdatawait_range+0xbe/0x1e0
    filemap_fdatawait+0x27/0x30
    sync_inodes_sb+0x204/0x2a0
    sync_inodes_one_sb+0x19/0x20
    iterate_supers+0xb2/0x110
    sys_sync+0x44/0xb0
    ia32_do_call+0x13/0x13

    1343 /*
    1344 * This function is never used on a shmem/tmpfs
    1345 * mapping, so a swap entry won't be found here.
    1346 */
    1347 BUG();

    After commit 0cd6144aadd2 ("mm + fs: prepare for non-page entries in
    page cache radix trees") this comment and BUG() are out of date because
    exceptional entries can now appear in all mappings - as shadows of
    recently evicted pages.

    However, as Hugh Dickins notes,

    "it is truly surprising for a PAGECACHE_TAG_WRITEBACK (and probably
    any other PAGECACHE_TAG_*) to appear on an exceptional entry.

    I expect it comes down to an occasional race in RCU lookup of the
    radix_tree: lacking absolute synchronization, we might sometimes
    catch an exceptional entry, with the tag which really belongs with
    the unexceptional entry which was there an instant before."

    And indeed, not only is the tree walk lockless, the tags are also read
    in chunks, one radix tree node at a time. There is plenty of time for
    page reclaim to swoop in and replace a page that was already looked up
    as tagged with a shadow entry.

    Remove the BUG() and update the comment. While reviewing all other
    lookup sites for whether they properly deal with shadow entries of
    evicted pages, update all the comments and fix memcg file charge moving
    to not miss shmem/tmpfs swapcache pages.

    Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees")
    Signed-off-by: Johannes Weiner
    Reported-by: Dave Jones
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

04 Apr, 2014

3 commits

  • Previously, page cache radix tree nodes were freed after reclaim emptied
    out their page pointers. But now reclaim stores shadow entries in their
    place, which are only reclaimed when the inodes themselves are
    reclaimed. This is problematic for bigger files that are still in use
    after they have a significant amount of their cache reclaimed, without
    any of those pages actually refaulting. The shadow entries will just
    sit there and waste memory. In the worst case, the shadow entries will
    accumulate until the machine runs out of memory.

    To get this under control, the VM will track radix tree nodes
    exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
    rather than global because we expect the radix tree nodes themselves to
    be allocated node-locally and we want to reduce cross-node references of
    otherwise independent cache workloads. A simple shrinker will then
    reclaim these nodes on memory pressure.

    A few things need to be stored in the radix tree node to implement the
    shadow node LRU and allow tree deletions coming from the list:

    1. There is no index available that would describe the reverse path
    from the node up to the tree root, which is needed to perform a
    deletion. To solve this, encode in each node its offset inside the
    parent. This can be stored in the unused upper bits of the same
    member that stores the node's height at no extra space cost.

    2. The number of shadow entries needs to be counted in addition to the
    regular entries, to quickly detect when the node is ready to go to
    the shadow node LRU list. The current entry count is an unsigned
    int but the maximum number of entries is 64, so a shadow counter
    can easily be stored in the unused upper bits.

    3. Tree modification needs tree lock and tree root, which are located
    in the address space, so store an address_space backpointer in the
    node. The parent pointer of the node is in a union with the 2-word
    rcu_head, so the backpointer comes at no extra cost as well.

    4. The node needs to be linked to an LRU list, which requires a list
    head inside the node. This does increase the size of the node, but
    it does not change the number of objects that fit into a slab page.

    [akpm@linux-foundation.org: export the right function]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Reclaim will be leaving shadow entries in the page cache radix tree upon
    evicting the real page. As those pages are found from the LRU, an
    iput() can lead to the inode being freed concurrently. At this point,
    reclaim must no longer install shadow pages because the inode freeing
    code needs to ensure the page tree is really empty.

    Add an address_space flag, AS_EXITING, that the inode freeing code sets
    under the tree lock before doing the final truncate. Reclaim will check
    for this flag before installing shadow pages.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • shmem mappings already contain exceptional entries where swap slot
    information is remembered.

    To be able to store eviction information for regular page cache, prepare
    every site dealing with the radix trees directly to handle entries other
    than pages.

    The common lookup functions will filter out non-page entries and return
    NULL for page cache holes, just as before. But provide a raw version of
    the API which returns non-page entries as well, and switch shmem over to
    use it.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

13 Sep, 2013

1 commit


28 May, 2013

1 commit

  • This commit changes truncate_inode_pages_range() so it can handle non
    page aligned regions of the truncate. Currently we can hit BUG_ON when
    the end of the range is not page aligned, but we can handle unaligned
    start of the range.

    Being able to handle non page aligned regions of the page can help file
    system punch_hole implementations and save some work, because once we're
    holding the page we might as well deal with it right away.

    In previous commits we've changed ->invalidatepage() prototype to accept
    'length' argument to be able to specify range to invalidate. No we can
    use that new ability in truncate_inode_pages_range().

    Signed-off-by: Lukas Czerner
    Cc: Andrew Morton
    Cc: Hugh Dickins
    Signed-off-by: Theodore Ts'o

    Lukas Czerner
     

22 May, 2013

1 commit

  • Currently there is no way to truncate partial page where the end
    truncate point is not at the end of the page. This is because it was not
    needed and the functionality was enough for file system truncate
    operation to work properly. However more file systems now support punch
    hole feature and it can benefit from mm supporting truncating page just
    up to the certain point.

    Specifically, with this functionality truncate_inode_pages_range() can
    be changed so it supports truncating partial page at the end of the
    range (currently it will BUG_ON() if 'end' is not at the end of the
    page).

    This commit changes the invalidatepage() address space operation
    prototype to accept range to be invalidated and update all the instances
    for it.

    We also change the block_invalidatepage() in the same way and actually
    make a use of the new length argument implementing range invalidation.

    Actual file system implementations will follow except the file systems
    where the changes are really simple and should not change the behaviour
    in any way .Implementation for truncate_page_range() which will be able
    to accept page unaligned ranges will follow as well.

    Signed-off-by: Lukas Czerner
    Cc: Andrew Morton
    Cc: Hugh Dickins

    Lukas Czerner
     

21 Dec, 2012

1 commit


09 Oct, 2012

2 commits

  • We had thought that pages could no longer get freed while still marked as
    mlocked; but Johannes Weiner posted this program to demonstrate that
    truncating an mlocked private file mapping containing COWed pages is still
    mishandled:

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    char *map;
    int fd;

    system("grep mlockfreed /proc/vmstat");
    fd = open("chigurh", O_CREAT|O_EXCL|O_RDWR);
    unlink("chigurh");
    ftruncate(fd, 4096);
    map = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE, fd, 0);
    map[0] = 11;
    mlock(map, sizeof(fd));
    ftruncate(fd, 0);
    close(fd);
    munlock(map, sizeof(fd));
    munmap(map, 4096);
    system("grep mlockfreed /proc/vmstat");
    return 0;
    }

    The anon COWed pages are not caught by truncation's clear_page_mlock() of
    the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to
    look out for them there in page_remove_rmap(). Indeed, why should
    truncation or invalidation be doing the clear_page_mlock() when removing
    from pagecache? mlock is a property of mapping in userspace, not a
    property of pagecache: an mlocked unmapped page is nonsensical.

    Reported-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Michel Lespinasse
    Cc: Ying Han
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • In fuzzing with trinity, lockdep protested "possible irq lock inversion
    dependency detected" when isolate_lru_page() reenabled interrupts while
    still holding the supposedly irq-safe tree_lock:

    invalidate_inode_pages2
    invalidate_complete_page2
    spin_lock_irq(&mapping->tree_lock)
    clear_page_mlock
    isolate_lru_page
    spin_unlock_irq(&zone->lru_lock)

    isolate_lru_page() is correct to enable interrupts unconditionally:
    invalidate_complete_page2() is incorrect to call clear_page_mlock() while
    holding tree_lock, which is supposed to nest inside lru_lock.

    Both truncate_complete_page() and invalidate_complete_page() call
    clear_page_mlock() before taking tree_lock to remove page from radix_tree.
    I guess invalidate_complete_page2() preferred to test PageDirty (again)
    under tree_lock before committing to the munlock; but since the page has
    already been unmapped, its state is already somewhat inconsistent, and no
    worse if clear_page_mlock() moved up.

    Reported-by: Sasha Levin
    Deciphered-by: Andrew Morton
    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

30 May, 2012

1 commit

  • Remove vmtruncate_range(), and remove the truncate_range method from
    struct inode_operations: only tmpfs ever supported it, and tmpfs has now
    converted over to using the fallocate method of file_operations.

    Update Documentation accordingly, adding (setlease and) fallocate lines.
    And while we're in mm.h, remove duplicate declarations of shmem_lock() and
    shmem_file_setup(): everyone is now using the ones in shmem_fs.h.

    Based-on-patch-by: Cong Wang
    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Cc: Cong Wang
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

29 Mar, 2012

1 commit

  • Holepunching filesystems ext4 and xfs are using truncate_inode_pages_range
    but forgetting to unmap pages first (ocfs2 remembers). This is not really
    a bug, since races already require truncate_inode_page() to handle that
    case once the page is locked; but it can be very inefficient if the file
    being punched happens to be mapped into many vmas.

    Provide a drop-in replacement truncate_pagecache_range() which does the
    unmapping pass first, handling the awkward mismatch between arguments to
    truncate_inode_pages_range() and arguments to unmap_mapping_range().

    Note that holepunching does not unmap privately COWed pages in the range:
    POSIX requires that we do so when truncating, but it's hard to justify,
    difficult to implement without an i_size cutoff, and no filesystem is
    attempting to implement it.

    Signed-off-by: Hugh Dickins
    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Ben Myers
    Cc: Alex Elder
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

23 Mar, 2012

1 commit

  • Pull cleancache changes from Konrad Rzeszutek Wilk:
    "This has some patches for the cleancache API that should have been
    submitted a _long_ time ago. They are basically cleanups:

    - rename of flush to invalidate

    - moving reporting of statistics into debugfs

    - use __read_mostly as necessary.

    Oh, and also the MAINTAINERS file change. The files (except the
    MAINTAINERS file) have been in #linux-next for months now. The late
    addition of MAINTAINERS file is a brain-fart on my side - didn't
    realize I needed that just until I was typing this up - and I based
    that patch on v3.3 - so the tree is on top of v3.3."

    * tag 'stable/for-linus-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
    MAINTAINERS: Adding cleancache API to the list.
    mm: cleancache: Use __read_mostly as appropiate.
    mm: cleancache: report statistics via debugfs instead of sysfs.
    mm: zcache/tmem/cleancache: s/flush/invalidate/
    mm: cleancache: s/flush/invalidate/

    Linus Torvalds
     

23 Feb, 2012

1 commit


24 Jan, 2012

1 commit

  • Per akpm suggestions alter the use of the term flush to be
    invalidate. The next patch will do this across all MM.

    This change is completely cosmetic.

    [v9: akpm@linux-foundation.org: change "flush" to "invalidate", part 3]

    Signed-off-by: Dan Magenheimer
    Cc: Kamezawa Hiroyuki
    Cc: Jan Beulich
    Reviewed-by: Seth Jennings
    Cc: Jeremy Fitzhardinge
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Nitin Gupta
    Cc: Matthew Wilcox
    Cc: Chris Mason
    Cc: Rik Riel
    Cc: Andrew Morton
    [v10: Fixed fs: move code out of buffer.c conflict change]
    Signed-off-by: Konrad Rzeszutek Wilk

    Dan Magenheimer
     

31 Oct, 2011

1 commit


04 Aug, 2011

1 commit

  • Remove PageSwapBacked (!page_is_file_cache) cases from
    add_to_page_cache_locked() and add_to_page_cache_lru(): those pages now
    go through shmem_add_to_page_cache().

    Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(),
    and add a comment on swap entries to invalidate_mapping_pages().

    And mincore_page() uses find_get_page() on what might be shmem or a
    tmpfs file: allow for a radix_tree_exceptional_entry(), and proceed to
    find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef).

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

26 Jul, 2011

3 commits

  • truncate_inode_pages_range()'s final loop has a nice pincer property,
    bringing start and end together, squeezing out the last pages. But the
    range handling missed out on that, just sliding up the range, perhaps
    letting pages come in behind it. Add one more test to give it the same
    pincer effect.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Make the pagevec_lookup loops in truncate_inode_pages_range(),
    invalidate_mapping_pages() and invalidate_inode_pages2_range() more
    consistent with each other.

    They were relying upon page->index of an unlocked page, but apologizing
    for it: accept it, embrace it, add comments and WARN_ONs, and simplify the
    index handling.

    invalidate_inode_pages2_range() had special handling for a wrapped
    page->index + 1 = 0 case; but MAX_LFS_FILESIZE doesn't let us anywhere
    near there, and a corrupt page->index in the radix_tree could cause more
    trouble than that would catch. Remove that wrapped handling.

    invalidate_inode_pages2_range() uses min() to limit the pagevec_lookup
    when near the end of the range: copy that into the other two, although
    it's less useful than you might think (it limits the use of the buffer,
    rather than the indices looked up).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Use consistent variable names in truncate_pagecache(), truncate_setsize(),
    vmtruncate() and vmtruncate_range().

    unmap_mapping_range() and vmtruncate_range() have mismatched interfaces:
    don't change either, but make the vmtruncates more precise about what they
    expect unmap_mapping_range() to do.

    vmtruncate_range() is currently called only with page-aligned start and
    end+1: can handle unaligned start, but unaligned end+1 would hit BUG_ON in
    truncate_inode_pages_range() (lacks partial clearing of the end page).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

21 Jul, 2011

1 commit

  • i_alloc_sem is a rather special rw_semaphore. It's the last one that may
    be released by a non-owner, and it's write side is always mirrored by
    real exclusion. It's intended use it to wait for all pending direct I/O
    requests to finish before starting a truncate.

    Replace it with a hand-grown construct:

    - exclusion for truncates is already guaranteed by i_mutex, so it can
    simply fall way
    - the reader side is replaced by an i_dio_count member in struct inode
    that counts the number of pending direct I/O requests. Truncate can't
    proceed as long as it's non-zero
    - when i_dio_count reaches non-zero we wake up a pending truncate using
    wake_up_bit on a new bit in i_flags
    - new references to i_dio_count can't appear while we are waiting for
    it to read zero because the direct I/O count always needs i_mutex
    (or an equivalent like XFS's i_iolock) for starting a new operation.

    This scheme is much simpler, and saves the space of a spinlock_t and a
    struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
    system).

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

28 Jun, 2011

1 commit

  • Under heavy memory and filesystem load, users observe the assertion
    mapping->nrpages == 0 in end_writeback() trigger. This can be caused by
    page reclaim reclaiming the last page from a mapping in the following
    race:

    CPU0 CPU1
    ...
    shrink_page_list()
    __remove_mapping()
    __delete_from_page_cache()
    radix_tree_delete()
    evict_inode()
    truncate_inode_pages()
    truncate_inode_pages_range()
    pagevec_lookup() - finds nothing
    end_writeback()
    mapping->nrpages != 0 -> BUG
    page->mapping = NULL
    mapping->nrpages--

    Fix the problem by doing a reliable check of mapping->nrpages under
    mapping->tree_lock in end_writeback().

    Analyzed by Jay , lost in LKML, and dug out
    by Miklos Szeredi .

    Cc: Jay
    Cc: Miklos Szeredi
    Signed-off-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara