05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

18 Mar, 2016

3 commits

  • shmem likes to occasionally drop the lock, schedule, then reacqire the
    lock and continue with the iteration from the last place it left off.
    This is currently done with a pretty ugly goto. Introduce
    radix_tree_iter_next() and use it throughout shmem.c.

    [koct9i@gmail.com: fix bug in radix_tree_iter_next() for tagged iteration]
    Signed-off-by: Matthew Wilcox
    Cc: Hugh Dickins
    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Instead of a 'goto restart', we can now use radix_tree_iter_retry() to
    restart from our current position. This will make a difference when
    there are more ways to happen across an indirect pointer. And it
    eliminates some confusing gotos.

    [vbabka@suse.cz: remove now-obsolete-and-misleading comment]
    Signed-off-by: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Konstantin Khlebnikov
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Most of the mm subsystem uses pr_ so make it consistent.

    Miscellanea:

    - Realign arguments
    - Add missing newline to format
    - kmemleak-test.c has a "kmemleak: " prefix added to the
    "Kmemleak testing" logging message via pr_fmt

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     

16 Mar, 2016

1 commit

  • Changing a page's memcg association complicates dealing with the page,
    so we want to limit this as much as possible. Page migration e.g. does
    not have to do that. Just like page cache replacement, it can forcibly
    charge a replacement page, and then uncharge the old page when it gets
    freed. Temporarily overcharging the cgroup by a single page is not an
    issue in practice, and charging is so cheap nowadays that this is much
    preferrable to the headache of messing with live pages.

    The only place that still changes the page->mem_cgroup binding of live
    pages is when pages move along with a task to another cgroup. But that
    path isolates the page from the LRU, takes the page lock, and the move
    lock (lock_page_memcg()). That means page->mem_cgroup is always stable
    in callers that have the page isolated from the LRU or locked. Lighter
    unlocked paths, like writeback accounting, can use lock_page_memcg().

    [akpm@linux-foundation.org: fix build]
    [vdavydov@virtuozzo.com: fix lockdep splat]
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

23 Jan, 2016

2 commits


21 Jan, 2016

1 commit

  • This patchset introduces swap accounting to cgroup2.

    This patch (of 7):

    In the legacy hierarchy we charge memsw, which is dubious, because:

    - memsw.limit must be >= memory.limit, so it is impossible to limit
    swap usage less than memory usage. Taking into account the fact that
    the primary limiting mechanism in the unified hierarchy is
    memory.high while memory.limit is either left unset or set to a very
    large value, moving memsw.limit knob to the unified hierarchy would
    effectively make it impossible to limit swap usage according to the
    user preference.

    - memsw.usage != memory.usage + swap.usage, because a page occupying
    both swap entry and a swap cache page is charged only once to memsw
    counter. As a result, it is possible to effectively eat up to
    memory.limit of memory pages *and* memsw.limit of swap entries, which
    looks unexpected.

    That said, we should provide a different swap limiting mechanism for
    cgroup2.

    This patch adds mem_cgroup->swap counter, which charges the actual number
    of swap entries used by a cgroup. It is only charged in the unified
    hierarchy, while the legacy hierarchy memsw logic is left intact.

    The swap usage can be monitored using new memory.swap.current file and
    limited using memory.swap.max.

    Note, to charge swap resource properly in the unified hierarchy, we have
    to make swap_entry_free uncharge swap only when ->usage reaches zero, not
    just ->count, i.e. when all references to a swap entry, including the one
    taken by swap cache, are gone. This is necessary, because otherwise
    swap-in could result in uncharging swap even if the page is still in swap
    cache and hence still occupies a swap entry. At the same time, this
    shouldn't break memsw counter logic, where a page is never charged twice
    for using both memory and swap, because in case of legacy hierarchy we
    uncharge swap on commit (see mem_cgroup_commit_charge).

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

16 Jan, 2016

3 commits

  • As with rmap, with new refcounting we cannot rely on PageTransHuge() to
    check if we need to charge size of huge page form the cgroup. We need
    to get information from caller to know whether it was mapped with PMD or
    PTE.

    We do uncharge when last reference on the page gone. At that point if
    we see PageTransHuge() it means we need to unchange whole huge page.

    The tricky part is partial unmap -- when we try to unmap part of huge
    page. We don't do a special handing of this situation, meaning we don't
    uncharge the part of huge page unless last user is gone or
    split_huge_page() is triggered. In case of cgroup memory pressure
    happens the partial unmapped page will be split through shrinker. This
    should be good enough.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • lock_page() must operate on the whole compound page. It doesn't make
    much sense to lock part of compound page. Change code to use head
    page's PG_locked, if tail page is passed.

    This patch also gets rid of custom helper functions --
    __set_page_locked() and __clear_page_locked(). They are replaced with
    helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG. Tail pages to these
    helper would trigger VM_BUG_ON().

    SLUB uses PG_locked as a bit spin locked. IIUC, tail pages should never
    appear there. VM_BUG_ON() is added to make sure that this assumption is
    correct.

    [akpm@linux-foundation.org: fix fs/cifs/file.c]
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Merge first patch-bomb from Andrew Morton:

    - A few hotfixes which missed 4.4 becasue I was asleep. cc'ed to
    -stable

    - A few misc fixes

    - OCFS2 updates

    - Part of MM. Including pretty large changes to page-flags handling
    and to thp management which have been buffered up for 2-3 cycles now.

    I have a lot of MM material this time.

    [ It turns out the THP part wasn't quite ready, so that got dropped from
    this series - Linus ]

    * emailed patches from Andrew Morton : (117 commits)
    zsmalloc: reorganize struct size_class to pack 4 bytes hole
    mm/zbud.c: use list_last_entry() instead of list_tail_entry()
    zram/zcomp: do not zero out zcomp private pages
    zram: pass gfp from zcomp frontend to backend
    zram: try vmalloc() after kmalloc()
    zram/zcomp: use GFP_NOIO to allocate streams
    mm: add tracepoint for scanning pages
    drivers/base/memory.c: fix kernel warning during memory hotplug on ppc64
    mm/page_isolation: use macro to judge the alignment
    mm: fix noisy sparse warning in LIBCFS_ALLOC_PRE()
    mm: rework virtual memory accounting
    include/linux/memblock.h: fix ordering of 'flags' argument in comments
    mm: move lru_to_page to mm_inline.h
    Documentation/filesystems: describe the shared memory usage/accounting
    memory-hotplug: don't BUG() in register_memory_resource()
    hugetlb: make mm and fs code explicitly non-modular
    mm/swapfile.c: use list_for_each_entry_safe in free_swap_count_continuations
    mm: /proc/pid/clear_refs: no need to clear VM_SOFTDIRTY in clear_soft_dirty_pmd()
    mm: make sure isolate_lru_page() is never called for tail page
    vmstat: make vmstat_updater deferrable again and shut down on idle
    ...

    Linus Torvalds
     

15 Jan, 2016

4 commits

  • Following the previous patch, further reduction of /proc/pid/smaps cost
    is possible for private writable shmem mappings with unpopulated areas
    where the page walk invokes the .pte_hole function. We can use radix
    tree iterator for each such area instead of calling find_get_entry() in
    a loop. This is possible at the extra maintenance cost of introducing
    another shmem function shmem_partial_swap_usage().

    To demonstrate the diference, I have measured this on a process that
    creates a private writable 2GB mapping of a partially swapped out
    /dev/shm/file (which cannot employ the optimizations from the prvious
    patch) and doesn't populate it at all. I time how long does it take to
    cat /proc/pid/smaps of this process 100 times.

    Before this patch:

    real 0m3.831s
    user 0m0.180s
    sys 0m3.212s

    After this patch:

    real 0m1.176s
    user 0m0.180s
    sys 0m0.684s

    The time is similar to the case where a radix tree iterator is employed
    on the whole mapping.

    Signed-off-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Jerome Marchand
    Cc: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The previous patch has improved swap accounting for shmem mapping, which
    however made /proc/pid/smaps more expensive for shmem mappings, as we
    consult the radix tree for each pte_none entry, so the overal complexity
    is O(n*log(n)).

    We can reduce this significantly for mappings that cannot contain COWed
    pages, because then we can either use the statistics tha shmem object
    itself tracks (if the mapping contains the whole object, or the swap
    usage of the whole object is zero), or use the radix tree iterator,
    which is much more effective than repeated find_get_entry() calls.

    This patch therefore introduces a function shmem_swap_usage(vma) and
    makes /proc/pid/smaps use it when possible. Only for writable private
    mappings of shmem objects (i.e. tmpfs files) with the shmem object
    itself (partially) swapped outwe have to resort to the find_get_entry()
    approach.

    Hopefully such mappings are relatively uncommon.

    To demonstrate the diference, I have measured this on a process that
    creates a 2GB mapping and dirties single pages with a stride of 2MB, and
    time how long does it take to cat /proc/pid/smaps of this process 100
    times.

    Private writable mapping of a /dev/shm/file (the most complex case):

    real 0m3.831s
    user 0m0.180s
    sys 0m3.212s

    Shared mapping of an almost full mapping of a partially swapped /dev/shm/file
    (which needs to employ the radix tree iterator).

    real 0m1.351s
    user 0m0.096s
    sys 0m0.768s

    Same, but with /dev/shm/file not swapped (so no radix tree walk needed)

    real 0m0.935s
    user 0m0.128s
    sys 0m0.344s

    Private anonymous mapping:

    real 0m0.949s
    user 0m0.116s
    sys 0m0.348s

    The cost is now much closer to the private anonymous mapping case, unless
    the shmem mapping is private and writable.

    Signed-off-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Jerome Marchand
    Cc: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • inode_nohighmem() is sufficient to make sure that page_get_link()
    won't try to allocate a highmem page. Moreover, it is sufficient
    to make sure that page_symlink/__page_symlink won't do the same
    thing. However, any filesystem that manually preseeds the symlink's
    page cache upon symlink(2) needs to make sure that the page it
    inserts there won't be a highmem one.

    Fortunately, only nfs and shmem have run afoul of that...

    Signed-off-by: Al Viro

    Al Viro
     

12 Jan, 2016

2 commits

  • Pull vfs xattr updates from Al Viro:
    "Andreas' xattr cleanup series.

    It's a followup to his xattr work that went in last cycle; -0.5KLoC"

    * 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    xattr handlers: Simplify list operation
    ocfs2: Replace list xattr handler operations
    nfs: Move call to security_inode_listsecurity into nfs_listxattr
    xfs: Change how listxattr generates synthetic attributes
    tmpfs: listxattr should include POSIX ACL xattrs
    tmpfs: Use xattr handler infrastructure
    btrfs: Use xattr handler infrastructure
    vfs: Distinguish between full xattr names and proper prefixes
    posix acls: Remove duplicate xattr name definitions
    gfs2: Remove gfs2_xattr_acl_chmod
    vfs: Remove vfs_xattr_cmp

    Linus Torvalds
     
  • Pull vfs RCU symlink updates from Al Viro:
    "Replacement of ->follow_link/->put_link, allowing to stay in RCU mode
    even if the symlink is not an embedded one.

    No changes since the mailbomb on Jan 1"

    * 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    switch ->get_link() to delayed_call, kill ->put_link()
    kill free_page_put_link()
    teach nfs_get_link() to work in RCU mode
    teach proc_self_get_link()/proc_thread_self_get_link() to work in RCU mode
    teach shmem_get_link() to work in RCU mode
    teach page_get_link() to work in RCU mode
    replace ->follow_link() with new method that could stay in RCU mode
    don't put symlink bodies in pagecache into highmem
    namei: page_getlink() and page_follow_link_light() are the same thing
    ufs: get rid of ->setattr() for symlinks
    udf: don't duplicate page_symlink_inode_operations
    logfs: don't duplicate page_symlink_inode_operations
    switch befs long symlinks to page_symlink_operations

    Linus Torvalds
     

31 Dec, 2015

1 commit


13 Dec, 2015

1 commit

  • Dmitry Vyukov provides a little program, autogenerated by syzkaller,
    which races a fault on a mapping of a sparse memfd object, against
    truncation of that object below the fault address: run repeatedly for a
    few minutes, it reliably generates shmem_evict_inode()'s
    WARN_ON(inode->i_blocks).

    (But there's nothing specific to memfd here, nor to the fstat which it
    happened to use to generate the fault: though that looked suspicious,
    since a shmem_recalc_inode() had been added there recently. The same
    problem can be reproduced with open+unlink in place of memfd_create, and
    with fstatfs in place of fstat.)

    v3.7 commit 0f3c42f522dc ("tmpfs: change final i_blocks BUG to WARNING")
    explains one cause of such a warning (a race with shmem_writepage to
    swap), and possible solutions; but we never took it further, and this
    syzkaller incident turns out to have a different cause.

    shmem_getpage_gfp()'s error recovery, when a freshly allocated page is
    then found to be beyond eof, looks plausible - decrementing the alloced
    count that was just before incremented - but in fact can go wrong, if a
    racing thread (the truncator, for example) gets its shmem_recalc_inode()
    in just after our delete_from_page_cache(). delete_from_page_cache()
    decrements nrpages, that shmem_recalc_inode() will balance the books by
    decrementing alloced itself, then our decrement of alloced take it one
    too low: leading to the WARNING when the object is finally evicted.

    Once the new page has been exposed in the page cache,
    shmem_getpage_gfp() must leave it to shmem_recalc_inode() itself to get
    the accounting right in all cases (and not fall through from "trunc:" to
    "decused:"). Adjust that error recovery block; and the reinitialization
    of info and sbinfo can be removed too.

    While we're here, fix shmem_writepage() to avoid the original issue: it
    will be safe against a racing shmem_recalc_inode(), if it merely
    increments swapped before the shmem_delete_from_page_cache() which
    decrements nrpages (but it must then do its own shmem_recalc_inode()
    before that, while still in balance, instead of after). (Aside: why do
    we shmem_recalc_inode() here in the swap path? Because its raison d'etre
    is to cope with clean sparse shmem pages being reclaimed behind our
    back: so here when swapping is a good place to look for that case.) But
    I've not now managed to reproduce this bug, even without the patch.

    I don't see why I didn't do that earlier: perhaps inhibited by the
    preference to eliminate shmem_recalc_inode() altogether. Driven by this
    incident, I do now have a patch to do so at last; but still want to sit
    on it for a bit, there's a couple of questions yet to be resolved.

    Signed-off-by: Hugh Dickins
    Reported-by: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

09 Dec, 2015

3 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • new method: ->get_link(); replacement of ->follow_link(). The differences
    are:
    * inode and dentry are passed separately
    * might be called both in RCU and non-RCU mode;
    the former is indicated by passing it a NULL dentry.
    * when called that way it isn't allowed to block
    and should return ERR_PTR(-ECHILD) if it needs to be called
    in non-RCU mode.

    It's a flagday change - the old method is gone, all in-tree instances
    converted. Conversion isn't hard; said that, so far very few instances
    do not immediately bail out when called in RCU mode. That'll change
    in the next commits.

    Signed-off-by: Al Viro

    Al Viro
     
  • kmap() in page_follow_link_light() needed to go - allowing to hold
    an arbitrary number of kmaps for long is a great way to deadlocking
    the system.

    new helper (inode_nohighmem(inode)) needs to be used for pagecache
    symlinks inodes; done for all in-tree cases. page_follow_link_light()
    instrumented to yell about anything missed.

    Signed-off-by: Al Viro

    Al Viro
     

07 Dec, 2015

2 commits

  • When a file on tmpfs has an ACL or a Default ACL, listxattr should include the
    corresponding xattr name.

    Signed-off-by: Andreas Gruenbacher
    Reviewed-by: James Morris
    Cc: Hugh Dickins
    Cc: linux-mm@kvack.org
    Signed-off-by: Al Viro

    Andreas Gruenbacher
     
  • Use the VFS xattr handler infrastructure and get rid of similar code in
    the filesystem. For implementing shmem_xattr_handler_set, we need a
    version of simple_xattr_set which removes the attribute when value is
    NULL. Use this to implement kernfs_iop_removexattr as well.

    Signed-off-by: Andreas Gruenbacher
    Reviewed-by: James Morris
    Cc: Hugh Dickins
    Cc: linux-mm@kvack.org
    Signed-off-by: Al Viro

    Andreas Gruenbacher
     

07 Nov, 2015

1 commit

  • Andrew stated the following

    We have quite a history of remote parts of the kernel using
    weird/wrong/inexplicable combinations of __GFP_ flags. I tend
    to think that this is because we didn't adequately explain the
    interface.

    And I don't think that gfp.h really improved much in this area as
    a result of this patchset. Could you go through it some time and
    decide if we've adequately documented all this stuff?

    This patches first moves some GFP flag combinations that are part of the MM
    internals to mm/internal.h. The rest of the patch documents the __GFP_FOO
    bits under various headings and then documents the flag combinations. It
    will not help callers that are brain damaged but the clarity might motivate
    some fixes and avoid future mistakes.

    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

06 Nov, 2015

2 commits

  • LKP reports that v4.2 commit afa2db2fb6f1 ("tmpfs: truncate prealloc
    blocks past i_size") causes a 14.5% slowdown in the AIM9 creat-clo
    benchmark.

    creat-clo does just what you'd expect from the name, and creat's O_TRUNC
    on 0-length file does indeed get into more overhead now shmem_setattr()
    tests "0 < 0".

    I'm not sure how much we care, but I think it would not be too VW-like to
    add in a check for whether any pages (or swap) are allocated: if none are
    allocated, there's none to remove from the radix_tree. At first I thought
    that check would be good enough for the unmaps too, but no: we should not
    skip the unlikely case of unmapping pages beyond the new EOF, which were
    COWed from holes which have now been reclaimed, leaving none.

    This gives me an 8.5% speedup: on Haswell instead of LKP's Westmere, and
    running a debug config before and after: I hope those account for the
    lesser speedup.

    And probably someone has a benchmark where a thousand threads keep on
    stat'ing the same file repeatedly: forestall that report by adjusting v4.3
    commit 44a30220bc0a ("shmem: recalculate file inode when fstat") not to
    take the spinlock in shmem_getattr() when there's no work to do.

    Signed-off-by: Hugh Dickins
    Reported-by: Ying Huang
    Tested-by: Ying Huang
    Cc: Josef Bacik
    Cc: Yu Zhao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • After v4.3's commit 0610c25daa3e ("memcg: fix dirty page migration")
    mem_cgroup_migrate() doesn't have much to offer in page migration: convert
    migrate_misplaced_transhuge_page() to set_page_memcg() instead.

    Then rename mem_cgroup_migrate() to mem_cgroup_replace_page(), since its
    remaining callers are replace_page_cache_page() and shmem_replace_page():
    both of whom passed lrucare true, so just eliminate that argument.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

09 Sep, 2015

1 commit

  • Shmem uses shmem_recalc_inode to update i_blocks when it allocates page,
    undoes range or swaps. But mm can drop clean page without notifying
    shmem. This makes fstat sometimes return out-of-date block size.

    The problem can be partially solved when we add
    inode_operations->getattr which calls shmem_recalc_inode to update
    i_blocks for fstat.

    shmem_recalc_inode also updates counter used by statfs and
    vm_committed_as. For them the situation is not changed. They still
    suffer from the discrepancy after dropping clean page and before the
    function is called by aforementioned triggers.

    Signed-off-by: Yu Zhao
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yu Zhao
     

07 Aug, 2015

1 commit

  • The shm implementation internally uses shmem or hugetlbfs inodes for shm
    segments. As these inodes are never directly exposed to userspace and
    only accessed through the shm operations which are already hooked by
    security modules, mark the inodes with the S_PRIVATE flag so that inode
    security initialization and permission checking is skipped.

    This was motivated by the following lockdep warning:

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G W
    -------------------------------------------------------
    httpd/1597 is trying to acquire lock:
    (&ids->rwsem){+++++.}, at: shm_close+0x34/0x130
    but task is already holding lock:
    (&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180
    which lock already depends on the new lock.
    the existing dependency chain (in reverse order) is:
    -> #3 (&mm->mmap_sem){++++++}:
    lock_acquire+0xc7/0x270
    __might_fault+0x7a/0xa0
    filldir+0x9e/0x130
    xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs]
    xfs_readdir+0x1b4/0x330 [xfs]
    xfs_file_readdir+0x2b/0x30 [xfs]
    iterate_dir+0x97/0x130
    SyS_getdents+0x91/0x120
    entry_SYSCALL_64_fastpath+0x12/0x76
    -> #2 (&xfs_dir_ilock_class){++++.+}:
    lock_acquire+0xc7/0x270
    down_read_nested+0x57/0xa0
    xfs_ilock+0x167/0x350 [xfs]
    xfs_ilock_attr_map_shared+0x38/0x50 [xfs]
    xfs_attr_get+0xbd/0x190 [xfs]
    xfs_xattr_get+0x3d/0x70 [xfs]
    generic_getxattr+0x4f/0x70
    inode_doinit_with_dentry+0x162/0x670
    sb_finish_set_opts+0xd9/0x230
    selinux_set_mnt_opts+0x35c/0x660
    superblock_doinit+0x77/0xf0
    delayed_superblock_init+0x10/0x20
    iterate_supers+0xb3/0x110
    selinux_complete_init+0x2f/0x40
    security_load_policy+0x103/0x600
    sel_write_load+0xc1/0x750
    __vfs_write+0x37/0x100
    vfs_write+0xa9/0x1a0
    SyS_write+0x58/0xd0
    entry_SYSCALL_64_fastpath+0x12/0x76
    ...

    Signed-off-by: Stephen Smalley
    Reported-by: Morten Stevens
    Acked-by: Hugh Dickins
    Acked-by: Paul Moore
    Cc: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Prarit Bhargava
    Cc: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Smalley
     

25 Jun, 2015

1 commit

  • One of the rocksdb people noticed that when you do something like this

    fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, 10M)
    pwrite(fd, buf, 5M, 0)
    ftruncate(5M)

    on tmpfs, the file would still take up 10M: which led to super fun
    issues because we were getting ENOSPC before we thought we should be
    getting ENOSPC. This patch fixes the problem, and mirrors what all the
    other fs'es do (and was agreed to be the correct behaviour at LSF).

    I tested it locally to make sure it worked properly with the following

    xfs_io -f -c "falloc -k 0 10M" -c "pwrite 0 5M" -c "truncate 5M" file

    Without the patch we have "Blocks: 20480", with the patch we have the
    correct value of "Blocks: 10240".

    Signed-off-by: Josef Bacik
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef Bacik
     

23 Jun, 2015

1 commit

  • Pull vfs updates from Al Viro:
    "In this pile: pathname resolution rewrite.

    - recursion in link_path_walk() is gone.

    - nesting limits on symlinks are gone (the only limit remaining is
    that the total amount of symlinks is no more than 40, no matter how
    nested).

    - "fast" (inline) symlinks are handled without leaving rcuwalk mode.

    - stack footprint (independent of the nesting) is below kilobyte now,
    about on par with what it used to be with one level of nested
    symlinks and ~2.8 times lower than it used to be in the worst case.

    - struct nameidata is entirely private to fs/namei.c now (not even
    opaque pointers are being passed around).

    - ->follow_link() and ->put_link() calling conventions had been
    changed; all in-tree filesystems converted, out-of-tree should be
    able to follow reasonably easily.

    For out-of-tree conversions, see Documentation/filesystems/porting
    for details (and in-tree filesystems for examples of conversion).

    That has sat in -next since mid-May, seems to survive all testing
    without regressions and merges clean with v4.1"

    * 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (131 commits)
    turn user_{path_at,path,lpath,path_dir}() into static inlines
    namei: move saved_nd pointer into struct nameidata
    inline user_path_create()
    inline user_path_parent()
    namei: trim do_last() arguments
    namei: stash dfd and name into nameidata
    namei: fold path_cleanup() into terminate_walk()
    namei: saner calling conventions for filename_parentat()
    namei: saner calling conventions for filename_create()
    namei: shift nameidata down into filename_parentat()
    namei: make filename_lookup() reject ERR_PTR() passed as name
    namei: shift nameidata inside filename_lookup()
    namei: move putname() call into filename_lookup()
    namei: pass the struct path to store the result down into path_lookupat()
    namei: uninline set_root{,_rcu}()
    namei: be careful with mountpoint crossings in follow_dotdot_rcu()
    Documentation: remove outdated information from automount-support.txt
    get rid of assorted nameidata-related debris
    lustre: kill unused helper
    lustre: kill unused macro (LOOKUP_CONTINUE)
    ...

    Linus Torvalds
     

18 Jun, 2015

1 commit

  • It appears that, at some point last year, XFS made directory handling
    changes which bring it into lockdep conflict with shmem_zero_setup():
    it is surprising that mmap() can clone an inode while holding mmap_sem,
    but that has been so for many years.

    Since those few lockdep traces that I've seen all implicated selinux,
    I'm hoping that we can use the __shmem_file_setup(,,,S_PRIVATE) which
    v3.13's commit c7277090927a ("security: shmem: implement kernel private
    shmem inodes") introduced to avoid LSM checks on kernel-internal inodes:
    the mmap("/dev/zero") cloned inode is indeed a kernel-internal detail.

    This also covers the !CONFIG_SHMEM use of ramfs to support /dev/zero
    (and MAP_SHARED|MAP_ANONYMOUS). I thought there were also drivers
    which cloned inode in mmap(), but if so, I cannot locate them now.

    Reported-and-tested-by: Prarit Bhargava
    Reported-and-tested-by: Daniel Wagner
    Reported-and-tested-by: Morten Stevens
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

11 May, 2015

4 commits

  • only one instance looks at that argument at all; that sole
    exception wants inode rather than dentry.

    Signed-off-by: Al Viro

    Al Viro
     
  • its only use is getting passed to nd_jump_link(), which can obtain
    it from current->nameidata

    Signed-off-by: Al Viro

    Al Viro
     
  • a) instead of storing the symlink body (via nd_set_link()) and returning
    an opaque pointer later passed to ->put_link(), ->follow_link() _stores_
    that opaque pointer (into void * passed by address by caller) and returns
    the symlink body. Returning ERR_PTR() on error, NULL on jump (procfs magic
    symlinks) and pointer to symlink body for normal symlinks. Stored pointer
    is ignored in all cases except the last one.

    Storing NULL for opaque pointer (or not storing it at all) means no call
    of ->put_link().

    b) the body used to be passed to ->put_link() implicitly (via nameidata).
    Now only the opaque pointer is. In the cases when we used the symlink body
    to free stuff, ->follow_link() now should store it as opaque pointer in addition
    to returning it.

    Signed-off-by: Al Viro

    Al Viro
     
  • Reviewed-by: Jan Kara
    Signed-off-by: Al Viro

    Al Viro
     

16 Apr, 2015

1 commit


12 Apr, 2015

2 commits


26 Mar, 2015

1 commit