27 Jul, 2016

5 commits

  • These flags are in use for file THP.

    Link: http://lkml.kernel.org/r/1466021202-61880-23-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • As with anon THP, we only mlock file huge pages if we can prove that the
    page is not mapped with PTE. This way we can avoid mlock leak into
    non-mlocked vma on split.

    We rely on PageDoubleMap() under lock_page() to check if the the page
    may be PTE mapped. PG_double_map is set by page_add_file_rmap() when
    the page mapped with PTEs.

    Link: http://lkml.kernel.org/r/1466021202-61880-21-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Currently, to charge a non-slab allocation to kmemcg one has to use
    alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with
    this helper should finally be freed using free_kmem_pages, otherwise it
    won't be uncharged.

    This API suits its current users fine, but it turns out to be impossible
    to use along with page reference counting, i.e. when an allocation is
    supposed to be freed with put_page, as it is the case with pipe or unix
    socket buffers.

    To overcome this limitation, this patch moves charging/uncharging to
    generic page allocator paths, i.e. to __alloc_pages_nodemask and
    free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way,
    one can use any of the available page allocation functions to get the
    allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
    just like in case of kmalloc and friends. A charged page will be
    automatically uncharged on free.

    To make it possible, we need to mark pages charged to kmemcg somehow.
    To avoid introducing a new page flag, we make use of page->_mapcount for
    marking such pages. Since pages charged to kmemcg are not supposed to
    be mapped to userspace, it should work just fine. There are other
    (ab)users of page->_mapcount - buddy and balloon pages - but we don't
    conflict with them.

    In case kmemcg is compiled out or not used at runtime, this patch
    introduces no overhead to generic page allocator paths. If kmemcg is
    used, it will be plus one gfp flags check on alloc and plus one
    page->_mapcount check on free, which shouldn't hurt performance, because
    the data accessed are hot.

    Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • - Add a proper comment to page->_mapcount.

    - Introduce a macro for generating helper functions.

    - Place all special page->_mapcount values next to each other so that
    readers can see all possible values and so we don't get duplicates.

    Link: http://lkml.kernel.org/r/502f49000e0b63e6c62e338fac6b420bf34fb526.1464079537.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We have allowed migration for only LRU pages until now and it was enough
    to make high-order pages. But recently, embedded system(e.g., webOS,
    android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
    have seen several reports about troubles of small high-order allocation.
    For fixing the problem, there were several efforts (e,g,. enhance
    compaction algorithm, SLUB fallback to 0-order page, reserved memory,
    vmalloc and so on) but if there are lots of non-movable pages in system,
    their solutions are void in the long run.

    So, this patch is to support facility to change non-movable pages with
    movable. For the feature, this patch introduces functions related to
    migration to address_space_operations as well as some page flags.

    If a driver want to make own pages movable, it should define three
    functions which are function pointers of struct
    address_space_operations.

    1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

    What VM expects on isolate_page function of driver is to return *true*
    if driver isolates page successfully. On returing true, VM marks the
    page as PG_isolated so concurrent isolation in several CPUs skip the
    page for isolation. If a driver cannot isolate the page, it should
    return *false*.

    Once page is successfully isolated, VM uses page.lru fields so driver
    shouldn't expect to preserve values in that fields.

    2. int (*migratepage) (struct address_space *mapping,
    struct page *newpage, struct page *oldpage, enum migrate_mode);

    After isolation, VM calls migratepage of driver with isolated page. The
    function of migratepage is to move content of the old page to new page
    and set up fields of struct page newpage. Keep in mind that you should
    indicate to the VM the oldpage is no longer movable via
    __ClearPageMovable() under page_lock if you migrated the oldpage
    successfully and returns 0. If driver cannot migrate the page at the
    moment, driver can return -EAGAIN. On -EAGAIN, VM will retry page
    migration in a short time because VM interprets -EAGAIN as "temporal
    migration failure". On returning any error except -EAGAIN, VM will give
    up the page migration without retrying in this time.

    Driver shouldn't touch page.lru field VM using in the functions.

    3. void (*putback_page)(struct page *);

    If migration fails on isolated page, VM should return the isolated page
    to the driver so VM calls driver's putback_page with migration failed
    page. In this function, driver should put the isolated page back to the
    own data structure.

    4. non-lru movable page flags

    There are two page flags for supporting non-lru movable page.

    * PG_movable

    Driver should use the below function to make page movable under
    page_lock.

    void __SetPageMovable(struct page *page, struct address_space *mapping)

    It needs argument of address_space for registering migration family
    functions which will be called by VM. Exactly speaking, PG_movable is
    not a real flag of struct page. Rather than, VM reuses page->mapping's
    lower bits to represent it.

    #define PAGE_MAPPING_MOVABLE 0x2
    page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

    so driver shouldn't access page->mapping directly. Instead, driver
    should use page_mapping which mask off the low two bits of page->mapping
    so it can get right struct address_space.

    For testing of non-lru movable page, VM supports __PageMovable function.
    However, it doesn't guarantee to identify non-lru movable page because
    page->mapping field is unified with other variables in struct page. As
    well, if driver releases the page after isolation by VM, page->mapping
    doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
    __ClearPageMovable). But __PageMovable is cheap to catch whether page
    is LRU or non-lru movable once the page has been isolated. Because LRU
    pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
    good for just peeking to test non-lru movable pages before more
    expensive checking with lock_page in pfn scanning to select victim.

    For guaranteeing non-lru movable page, VM provides PageMovable function.
    Unlike __PageMovable, PageMovable functions validates page->mapping and
    mapping->a_ops->isolate_page under lock_page. The lock_page prevents
    sudden destroying of page->mapping.

    Driver using __SetPageMovable should clear the flag via
    __ClearMovablePage under page_lock before the releasing the page.

    * PG_isolated

    To prevent concurrent isolation among several CPUs, VM marks isolated
    page as PG_isolated under lock_page. So if a CPU encounters PG_isolated
    non-lru movable page, it can skip it. Driver doesn't need to manipulate
    the flag because VM will set/clear it automatically. Keep in mind that
    if driver sees PG_isolated page, it means the page have been isolated by
    VM so it shouldn't touch page.lru field. PG_isolated is alias with
    PG_reclaim flag so driver shouldn't use the flag for own purpose.

    [opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
    Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
    Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
    Signed-off-by: Gioh Kim
    Signed-off-by: Minchan Kim
    Signed-off-by: Ganesh Mahendran
    Acked-by: Vlastimil Babka
    Cc: Sergey Senozhatsky
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rafael Aquini
    Cc: Jonathan Corbet
    Cc: John Einar Reitan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

21 May, 2016

1 commit

  • struct page->flags is unsigned long, so when shifting bits we should use
    UL suffix to match it.

    Found this problem after I added 64-bit CPU specific page flags and
    failed to compile the kernel:

    mm/page_alloc.c: In function '__free_one_page':
    mm/page_alloc.c:672:2: error: integer overflow in expression [-Werror=overflow]

    Link: http://lkml.kernel.org/r/1461971723-16187-1-git-send-email-yuzhao@google.com
    Signed-off-by: Yu Zhao
    Cc: "Kirill A . Shutemov"
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Jerome Marchand
    Cc: Denys Vlasenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yu Zhao
     

20 May, 2016

1 commit

  • The PageAnon check always checks for compound_head but this is a
    relatively expensive check if the caller already knows the page is a
    head page. This patch creates a helper and uses it in the page free
    path which only operates on head pages.

    With this patch and "Only check PageCompound for high-order pages", the
    performance difference on a page allocator microbenchmark is;

    4.6.0-rc2 4.6.0-rc2
    vanilla nocompound-v1r20
    Min alloc-odr0-1 425.00 ( 0.00%) 417.00 ( 1.88%)
    Min alloc-odr0-2 313.00 ( 0.00%) 308.00 ( 1.60%)
    Min alloc-odr0-4 257.00 ( 0.00%) 253.00 ( 1.56%)
    Min alloc-odr0-8 224.00 ( 0.00%) 221.00 ( 1.34%)
    Min alloc-odr0-16 208.00 ( 0.00%) 205.00 ( 1.44%)
    Min alloc-odr0-32 199.00 ( 0.00%) 199.00 ( 0.00%)
    Min alloc-odr0-64 195.00 ( 0.00%) 193.00 ( 1.03%)
    Min alloc-odr0-128 192.00 ( 0.00%) 191.00 ( 0.52%)
    Min alloc-odr0-256 204.00 ( 0.00%) 200.00 ( 1.96%)
    Min alloc-odr0-512 213.00 ( 0.00%) 212.00 ( 0.47%)
    Min alloc-odr0-1024 219.00 ( 0.00%) 219.00 ( 0.00%)
    Min alloc-odr0-2048 225.00 ( 0.00%) 225.00 ( 0.00%)
    Min alloc-odr0-4096 230.00 ( 0.00%) 231.00 ( -0.43%)
    Min alloc-odr0-8192 235.00 ( 0.00%) 234.00 ( 0.43%)
    Min alloc-odr0-16384 235.00 ( 0.00%) 234.00 ( 0.43%)
    Min free-odr0-1 215.00 ( 0.00%) 191.00 ( 11.16%)
    Min free-odr0-2 152.00 ( 0.00%) 136.00 ( 10.53%)
    Min free-odr0-4 119.00 ( 0.00%) 107.00 ( 10.08%)
    Min free-odr0-8 106.00 ( 0.00%) 96.00 ( 9.43%)
    Min free-odr0-16 97.00 ( 0.00%) 87.00 ( 10.31%)
    Min free-odr0-32 91.00 ( 0.00%) 83.00 ( 8.79%)
    Min free-odr0-64 89.00 ( 0.00%) 81.00 ( 8.99%)
    Min free-odr0-128 88.00 ( 0.00%) 80.00 ( 9.09%)
    Min free-odr0-256 106.00 ( 0.00%) 95.00 ( 10.38%)
    Min free-odr0-512 116.00 ( 0.00%) 111.00 ( 4.31%)
    Min free-odr0-1024 125.00 ( 0.00%) 118.00 ( 5.60%)
    Min free-odr0-2048 133.00 ( 0.00%) 126.00 ( 5.26%)
    Min free-odr0-4096 136.00 ( 0.00%) 130.00 ( 4.41%)
    Min free-odr0-8192 138.00 ( 0.00%) 130.00 ( 5.80%)
    Min free-odr0-16384 137.00 ( 0.00%) 130.00 ( 5.11%)

    There is a sizable boost to the free allocator performance. While there
    is an apparent boost on the allocation side, it's likely a co-incidence
    or due to the patches slightly reducing cache footprint.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

06 May, 2016

1 commit

  • After the THP refcounting change, obtaining a compound pages from
    get_user_pages() no longer allows us to assume the entire compound page
    is immediately mappable from a secondary MMU.

    A secondary MMU doesn't want to call get_user_pages() more than once for
    each compound page, in order to know if it can map the whole compound
    page. So a secondary MMU needs to know from a single get_user_pages()
    invocation when it can map immediately the entire compound page to avoid
    a flood of unnecessary secondary MMU faults and spurious
    atomic_inc()/atomic_dec() (pages don't have to be pinned by MMU notifier
    users).

    Ideally instead of the page->_mapcount < 1 check, get_user_pages()
    should return the granularity of the "page" mapping in the "mm" passed
    to get_user_pages(). However it's non trivial change to pass the "pmd"
    status belonging to the "mm" walked by get_user_pages up the stack (up
    to the caller of get_user_pages). So the fix just checks if there is
    not a single pte mapping on the page returned by get_user_pages, and in
    turn if the caller can assume that the whole compound page is mapped in
    the current "mm" (in a pmd_trans_huge()). In such case the entire
    compound page is safe to map into the secondary MMU without additional
    get_user_pages() calls on the surrounding tail/head pages. In addition
    of being faster, not having to run other get_user_pages() calls also
    reduces the memory footprint of the secondary MMU fault in case the pmd
    split happened as result of memory pressure.

    Without this fix after a MADV_DONTNEED (like invoked by QEMU during
    postcopy live migration or balloning) or after generic swapping (with a
    failure in split_huge_page() that would only result in pmd splitting and
    not a physical page split), KVM would map the whole compound page into
    the shadow pagetables, despite regular faults or userfaults (like
    UFFDIO_COPY) may map regular pages into the primary MMU as result of the
    pte faults, leading to the guest mode and userland mode going out of
    sync and not working on the same memory at all times.

    Any other secondary MMU notifier manager (KVM is just one of the many
    MMU notifier users) will need the same information if it doesn't want to
    run a flood of get_user_pages_fast and it can support multiple
    granularity in the secondary MMU mappings, so I think it is justified to
    be exposed not just to KVM.

    The other option would be to move transparent_hugepage_adjust to
    mm/huge_memory.c but that currently has all kind of KVM data structures
    in it, so it's definitely not a cut-and-paste work, so I couldn't do a
    fix as cleaner as this one for 4.6.

    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: "Kirill A. Shutemov"
    Cc: "Li, Liang Z"
    Cc: Amit Shah
    Cc: Paolo Bonzini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

18 Mar, 2016

2 commits

  • Sometimes gcc mysteriously doesn't inline
    very small functions we expect to be inlined. See

    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122

    With this .config:
    http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
    the following functions get deinlined many times.
    Examples of disassembly:

    (43 copies, 141 calls):
    55 push %rbp
    48 89 e5 mov %rsp,%rbp
    f0 80 0f 08 lock orb $0x8,(%rdi)
    5d pop %rbp
    c3 retq

    (10 copies, 134 calls):
    48 8b 07 mov (%rdi),%rax
    55 push %rbp
    48 89 e5 mov %rsp,%rbp
    48 c1 e8 0b shr $0xb,%rax
    83 e0 01 and $0x1,%eax
    5d pop %rbp
    c3 retq

    This patch fixes this via s/inline/__always_inline/.

    Code size decrease after the patch is ~7k:

    text data bss dec hex filename
    92125002 20826048 36417536 149368586 8e72f0a vmlinux
    92118087 20826112 36417536 149361735 8e71447 vmlinux7_pageops_after

    Signed-off-by: Denys Vlasenko
    Cc: Ingo Molnar
    Cc: Thomas Graf
    Cc: Peter Zijlstra
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denys Vlasenko
     
  • Currently /proc/kpageflags returns nothing for "tail" buddy pages, which
    is inconvenient when grasping how free pages are distributed. This
    patch sets KPF_BUDDY for such pages.

    With this patch:

    $ grep MemFree /proc/meminfo ; tools/vm/page-types -b buddy
    MemFree: 3134992 kB
    flags page-count MB symbolic-flags long-symbolic-flags
    0x0000000000000400 779272 3044 __________B_______________________________ buddy
    0x0000000000000c00 4385 17 __________BM______________________________ buddy,mmap
    total 783657 3061

    783657 pages is 3134628 kB (roughly consistent with the global counter,)
    so it's OK.

    [akpm@linux-foundation.org: update comment, per Naoya]
    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Vladimir Davydov >
    Cc: Konstantin Khlebnikov
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

16 Jan, 2016

18 commits

  • We're going to allow mapping of individual 4k pages of THP compound. It
    means we need to track mapcount on per small page basis.

    Straight-forward approach is to use ->_mapcount in all subpages to track
    how many time this subpage is mapped with PMDs or PTEs combined. But
    this is rather expensive: mapping or unmapping of a THP page with PMD
    would require HPAGE_PMD_NR atomic operations instead of single we have
    now.

    The idea is to store separately how many times the page was mapped as
    whole -- compound_mapcount. This frees up ->_mapcount in subpages to
    track PTE mapcount.

    We use the same approach as with compound page destructor and compound
    order to store compound_mapcount: use space in first tail page,
    ->mapping this time.

    Any time we map/unmap whole compound page (THP or hugetlb) -- we
    increment/decrement compound_mapcount. When we map part of compound
    page with PTE we operate on ->_mapcount of the subpage.

    page_mapcount() counts both: PTE and PMD mappings of the page.

    Basically, we have mapcount for a subpage spread over two counters. It
    makes tricky to detect when last mapcount for a page goes away.

    We introduced PageDoubleMap() for this. When we split THP PMD for the
    first time and there's other PMD mapping left we offset up ->_mapcount
    in all subpages by one and set PG_double_map on the compound page.
    These additional references go away with last compound_mapcount.

    This approach provides a way to detect when last mapcount goes away on
    per small page basis without introducing new overhead for most common
    cases.

    [akpm@linux-foundation.org: fix typo in comment]
    [mhocko@suse.com: ignore partial THP when moving task]
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We are going to use migration entries to stabilize page counts. It
    means we don't need compound_lock() for that.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Nobody uses them.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • PageAnon() and PageKsm() look at lower bits of page->mapping to check if
    the page is Anon or KSM. page->mapping can be overloaded in tail pages.

    Let's always look at head page to avoid false-positives.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We use PG_uptodate on head pages on transparent huge page. Let's use
    PF_NO_TAIL.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • So far, only IA64 uses PG_uncached and only on non-compound pages.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Transparent huge pages can be mlocked -- whole compund page at once.
    Something went wrong if we're trying to mlock() tail page. Let's use
    PF_NO_TAIL.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Swap cannot handle compound pages so far. Transparent huge pages are
    split on the way to swap.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • PG_swapbacked is used for transparent huge pages. For head pages only.
    Let's use PF_NO_TAIL policy.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • As far as I can see there's no users of PG_reserved on compound pages.
    Let's use PF_NO_COMPOUND here.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • PG_pinned and PG_savepinned are about page table's pages which are never
    compound.

    I'm not so sure about PG_foreign, but it seems we shouldn't see compound
    pages there too.

    Let's use PF_NO_COMPOUND for all of them.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • SL*B uses compound pages and marks head pages with PG_slab.
    __SetPageSlab() and __ClearPageSlab() are never called for tail pages.

    The same situation with PG_slob_free in SLOB allocator.

    PF_NO_TAIL is appropriate for these flags.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Only head pages are ever on LRU. Let's use PF_HEAD policy to avoid any
    confusion for all LRU-related flags.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • It seems we don't have compound page on FS/IO path currently. Use
    PF_NO_COMPOUND to catch if we have.

    The odd exception is PG_dirty: sound uses compound pages and maps them
    with PTEs. PF_NO_COMPOUND triggers VM_BUG_ON() in set_page_dirty() on
    handling shared fault. Let's use PF_HEAD for PG_dirty.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • lock_page() must operate on the whole compound page. It doesn't make
    much sense to lock part of compound page. Change code to use head
    page's PG_locked, if tail page is passed.

    This patch also gets rid of custom helper functions --
    __set_page_locked() and __clear_page_locked(). They are replaced with
    helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG. Tail pages to these
    helper would trigger VM_BUG_ON().

    SLUB uses PG_locked as a bit spin locked. IIUC, tail pages should never
    appear there. VM_BUG_ON() is added to make sure that this assumption is
    correct.

    [akpm@linux-foundation.org: fix fs/cifs/file.c]
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This patch adds a third argument to macros which create function
    definitions for page flags. This argument defines how page-flags
    helpers behave on compound functions.

    For now we define four policies:

    - PF_ANY: the helper function operates on the page it gets, regardless
    if it's non-compound, head or tail.

    - PF_HEAD: the helper function operates on the head page of the
    compound page if it gets tail page.

    - PF_NO_TAIL: only head and non-compond pages are acceptable for this
    helper function.

    - PF_NO_COMPOUND: only non-compound pages are acceptable for this
    helper function.

    For now we use policy PF_ANY for all helpers, which matches current
    behaviour.

    We do not enforce the policy for TESTPAGEFLAG, because we have flags
    checked for random pages all over the kernel. Noticeable exception to
    this is PageTransHuge() which triggers VM_BUG_ON() for tail page.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The preparation patch: we are going to use compound_head(), PageTail()
    and PageCompound() to define page-flags helpers.

    Let's define them before macros.

    We cannot user PageHead() helper in PageCompound() as it's not yet
    defined -- use test_bit(PG_head, &page->flags) instead.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Use TESTPAGEFLAG_FALSE() to get it a bit cleaner.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

07 Nov, 2015

1 commit

  • Hugh has pointed that compound_head() call can be unsafe in some
    context. There's one example:

    CPU0 CPU1

    isolate_migratepages_block()
    page_count()
    compound_head()
    !!PageTail() == true
    put_page()
    tail->first_page = NULL
    head = tail->first_page
    alloc_pages(__GFP_COMP)
    prep_compound_page()
    tail->first_page = head
    __SetPageTail(p);
    !!PageTail() == true

    The race is pure theoretical. I don't it's possible to trigger it in
    practice. But who knows.

    We can fix the race by changing how encode PageTail() and compound_head()
    within struct page to be able to update them in one shot.

    The patch introduces page->compound_head into third double word block in
    front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
    the rest bits are pointer to head page if bit zero is set.

    The patch moves page->pmd_huge_pte out of word, just in case if an
    architecture defines pgtable_t into something what can have the bit 0
    set.

    hugetlb_cgroup uses page->lru.next in the second tail page to store
    pointer struct hugetlb_cgroup. The patch switch it to use page->private
    in the second tail page instead. The space is free since ->first_page is
    removed from the union.

    The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
    limitation, since there's now space in first tail page to store struct
    hugetlb_cgroup pointer. But that's out of scope of the patch.

    That means page->compound_head shares storage space with:

    - page->lru.next;
    - page->next;
    - page->rcu_head.next;

    That's too long list to be absolutely sure, but looks like nobody uses
    bit 0 of the word.

    page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
    call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
    call_rcu_lazy() is not allowed as it makes use of the bit and we can
    get false positive PageTail().

    [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Acked-by: Paul E. McKenney
    Cc: Aneesh Kumar K.V
    Cc: Andi Kleen
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

06 Nov, 2015

1 commit

  • This came up when implementing HIHGMEM/PAE40 for ARC. The kmap() /
    kmap_atomic() generated code seemed needlessly bloated due to the way
    PageHighMem() macro is implemented. It derives the exact zone for page
    and then does pointer subtraction with first zone to infer the zone_type.
    The pointer arithmatic in turn generates the code bloat.

    PageHighMem(page)
    is_highmem(page_zone(page))
    zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones

    Instead use is_highmem_idx() to work on zone_type available in page flags

    ----- Before -----
    80756348: mov_s r13,r0
    8075634a: ld_s r2,[r13,0]
    8075634c: lsr_s r2,r2,30
    8075634e: mpy r2,r2,0x2a4
    80756352: add_s r2,r2,0x80aef880
    80756358: ld_s r3,[r2,28]
    8075635a: sub_s r2,r2,r3
    8075635c: breq r2,0x2a4,80756378
    80756364: breq r2,0x548,80756378

    ----- After -----
    80756330: mov_s r13,r0
    80756332: ld_s r2,[r13,0]
    80756334: lsr_s r2,r2,30
    80756336: sub_s r2,r2,1
    80756338: brlo r2,2,80756348

    For x86 defconfig build (32 bit only) it saves around 900 bytes.
    For ARC defconfig with HIGHMEM, it saved around 2K bytes.

    ---->8-------
    ./scripts/bloat-o-meter x86/vmlinux-defconfig-pre x86/vmlinux-defconfig-post
    add/remove: 0/0 grow/shrink: 0/36 up/down: 0/-934 (-934)
    function old new delta
    saveable_page 162 154 -8
    saveable_highmem_page 154 146 -8
    skb_gro_reset_offset 147 131 -16
    ...
    ...
    __change_page_attr_set_clr 1715 1678 -37
    setup_data_read 434 394 -40
    mon_bin_event 1967 1927 -40
    swsusp_save 1148 1105 -43
    _set_pages_array 549 493 -56
    ---->8-------

    e.g. For ARC kmap()

    Signed-off-by: Vineet Gupta
    Acked-by: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Jennifer Herbert
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vineet Gupta
     

11 Sep, 2015

1 commit

  • Knowing the portion of memory that is not used by a certain application or
    memory cgroup (idle memory) can be useful for partitioning the system
    efficiently, e.g. by setting memory cgroup limits appropriately.
    Currently, the only means to estimate the amount of idle memory provided
    by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
    access bit for all pages mapped to a particular process by writing 1 to
    clear_refs, wait for some time, and then count smaps:Referenced. However,
    this method has two serious shortcomings:

    - it does not count unmapped file pages
    - it affects the reclaimer logic

    To overcome these drawbacks, this patch introduces two new page flags,
    Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
    A page's Idle flag can only be set from userspace by setting bit in
    /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
    and it is cleared whenever the page is accessed either through page tables
    (it is cleared in page_referenced() in this case) or using the read(2)
    system call (mark_page_accessed()). Thus by setting the Idle flag for
    pages of a particular workload, which can be found e.g. by reading
    /proc/PID/pagemap, waiting for some time to let the workload access its
    working set, and then reading the bitmap file, one can estimate the amount
    of pages that are not used by the workload.

    The Young page flag is used to avoid interference with the memory
    reclaimer. A page's Young flag is set whenever the Access bit of a page
    table entry pointing to the page is cleared by writing to the bitmap file.
    If page_referenced() is called on a Young page, it will add 1 to its
    return value, therefore concealing the fact that the Access bit was
    cleared.

    Note, since there is no room for extra page flags on 32 bit, this feature
    uses extended page flags when compiled on 32 bit.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: kpageidle requires an MMU]
    [akpm@linux-foundation.org: decouple from page-flags rework]
    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

07 Aug, 2015

1 commit

  • The race condition addressed in commit add05cecef80 ("mm: soft-offline:
    don't free target page in successful page migration") was not closed
    completely, because that can happen not only for soft-offline, but also
    for hard-offline. Consider that a slab page is about to be freed into
    buddy pool, and then an uncorrected memory error hits the page just
    after entering __free_one_page(), then VM_BUG_ON_PAGE(page->flags &
    PAGE_FLAGS_CHECK_AT_PREP) is triggered, despite the fact that it's not
    necessary because the data on the affected page is not consumed.

    To solve it, this patch drops __PG_HWPOISON from page flag checks at
    allocation/free time. I think it's justified because __PG_HWPOISON
    flags is defined to prevent the page from being reused, and setting it
    outside the page's alloc-free cycle is a designed behavior (not a bug.)

    For recent months, I was annoyed about BUG_ON when soft-offlined page
    remains on lru cache list for a while, which is avoided by calling
    put_page() instead of putback_lru_page() in page migration's success
    path. This means that this patch reverts a major change from commit
    add05cecef80 about the new refcounting rule of soft-offlined pages, so
    "reuse window" revives. This will be closed by a subsequent patch.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Dean Nelson
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

16 Apr, 2015

2 commits

  • Now we have an easy access to hugepages' activeness, so existing helpers to
    get the information can be cleaned up.

    [akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/]
    Signed-off-by: Naoya Horiguchi
    Cc: Hugh Dickins
    Reviewed-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently we take a naive approach to page flags on compound pages - we
    set the flag on the page without consideration if the flag makes sense
    for tail page or for compound page in general. This patchset try to
    sort this out by defining per-flag policy on what need to be done if
    page-flag helper operate on compound page.

    The last patch in the patchset also sanitizes usege of page->mapping for
    tail pages. We don't define the meaning of page->mapping for tail
    pages. Currently it's always NULL, which can be inconsistent with head
    page and potentially lead to problems.

    For now I caught one case of illegal usage of page flags or ->mapping:
    sound subsystem allocates pages with __GFP_COMP and maps them with PTEs.
    It leads to setting dirty bit on tail pages and access to tail_page's
    ->mapping. I don't see any bad behaviour caused by this, but worth
    fixing anyway.

    This patchset makes more sense if you take my THP refcounting into
    account: we will see more compound pages mapped with PTEs and we need to
    define behaviour of flags on compound pages to avoid bugs.

    This patch (of 16):

    We have page-flags helper function declarations/definitions spread over
    several header files. Let's consolidate them in .

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Apr, 2015

1 commit

  • This patch replaces cancel_dirty_page() with a helper function
    account_page_cleaned() which only updates counters. It's called from
    truncate_complete_page() and from try_to_free_buffers() (hack for ext3).
    Page is locked in both cases, page-lock protects against concurrent
    dirtiers: see commit 2d6d7f982846 ("mm: protect set_page_dirty() from
    ongoing truncation").

    Delete_from_page_cache() shouldn't be called for dirty pages, they must
    be handled by caller (either written or truncated). This patch treats
    final dirty accounting fixup at the end of __delete_from_page_cache() as
    a debug check and adds WARN_ON_ONCE() around it. If something removes
    dirty pages without proper handling that might be a bug and unwritten
    data might be lost.

    Hugetlbfs has no dirty pages accounting, ClearPageDirty() is enough
    here.

    cancel_dirty_page() in nfs_wb_page_cancel() is redundant. This is
    helper for nfs_invalidate_page() and it's called only in case complete
    invalidation.

    The mess was started in v2.6.20 after commits 46d2277c796f ("Clean up
    and make try_to_free_buffers() not race with dirty pages") and
    3e67c0987d75 ("truncate: clear page dirtiness before running
    try_to_free_buffers()") first was reverted right in v2.6.20 in commit
    ecdfc9787fe5 ("Resurrect 'try_to_free_buffers()' VM hackery"), second in
    v2.6.25 commit a2b345642f53 ("Fix dirty page accounting leak with ext3
    data=journal").

    Custom fixes were introduced between these points. NFS in v2.6.23, commit
    1b3b4a1a2deb ("NFS: Fix a write request leak in nfs_invalidate_page()").
    Kludge in __delete_from_page_cache() in v2.6.24, commit 3a6927906f1b ("Do
    dirty page accounting when removing a page from the page cache"). Since
    v2.6.25 all of them are redundant.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Tejun Heo
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

28 Jan, 2015

1 commit

  • The foreign page flag will be used by Xen guests to mark pages that
    have grant mappings of frames from other (foreign) guests.

    The foreign flag is an alias for the existing (Xen-specific) pinned
    flag. This is safe because pinned is only used on pages used for page
    tables and these cannot also be foreign.

    Signed-off-by: Jennifer Herbert
    Acked-by: Andrew Morton
    Signed-off-by: David Vrabel

    Jennifer Herbert
     

07 Aug, 2014

1 commit

  • - PAGEFLAG_FALSE only defines TEST, make it define SET and CLEAR as
    well, analogous to PAGEFLAG.

    - Define TESTSETFLAG_FALSE, analogous to TESTSETFLAG.

    - Define TESTSCFLAG_FALSE, analogous to TESTSCFLAG

    - Make PG_mlocked accessors the same on both MMU and !MMU setups

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

24 Jun, 2014

1 commit

  • To allow filtering of huge pages, makedumpfile must be able to identify
    them in the dump. This can be done by checking the appropriate page
    flag, so communicate its value to makedumpfile through the VMCOREINFO
    interface.

    There's only one small catch. Depending on how many page flags are
    available on a given architecture, this bit can be called PG_head or
    PG_compound.

    I sent a similar patch back in 2012, but Eric Biederman did not like
    using an #ifdef. So, this time I'm adding a common symbol
    (PG_head_mask) instead.

    See https://lkml.org/lkml/2012/11/28/91 for the previous version.

    Signed-off-by: Petr Tesarik
    Acked-by: Vivek Goyal
    Cc: Eric Biederman
    Cc: Paul Mackerras
    Cc: Fengguang Wu
    Cc: Benjamin Herrenschmidt
    Cc: Shaohua Li
    Cc: Alexey Kardashevskiy
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Tesarik
     

09 Jun, 2014

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "Clean ups and miscellaneous bug fixes, in particular for the new
    collapse_range and zero_range fallocate functions. In addition,
    improve the scalability of adding and remove inodes from the orphan
    list"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits)
    ext4: handle symlink properly with inline_data
    ext4: fix wrong assert in ext4_mb_normalize_request()
    ext4: fix zeroing of page during writeback
    ext4: remove unused local variable "stored" from ext4_readdir(...)
    ext4: fix ZERO_RANGE test failure in data journalling
    ext4: reduce contention on s_orphan_lock
    ext4: use sbi in ext4_orphan_{add|del}()
    ext4: use EXT_MAX_BLOCKS in ext4_es_can_be_merged()
    ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access
    ext4: remove unnecessary double parentheses
    ext4: do not destroy ext4_groupinfo_caches if ext4_mb_init() fails
    ext4: make local functions static
    ext4: fix block bitmap validation when bigalloc, ^flex_bg
    ext4: fix block bitmap initialization under sparse_super2
    ext4: find the group descriptors on a 1k-block bigalloc,meta_bg filesystem
    ext4: avoid unneeded lookup when xattr name is invalid
    ext4: fix data integrity sync in ordered mode
    ext4: remove obsoleted check
    ext4: add a new spinlock i_raw_lock to protect the ext4's raw inode
    ext4: fix locking for O_APPEND writes
    ...

    Linus Torvalds
     

05 Jun, 2014

1 commit

  • aops->write_begin may allocate a new page and make it visible only to have
    mark_page_accessed called almost immediately after. Once the page is
    visible the atomic operations are necessary which is noticable overhead
    when writing to an in-memory filesystem like tmpfs but should also be
    noticable with fast storage. The objective of the patch is to initialse
    the accessed information with non-atomic operations before the page is
    visible.

    The bulk of filesystems directly or indirectly use
    grab_cache_page_write_begin or find_or_create_page for the initial
    allocation of a page cache page. This patch adds an init_page_accessed()
    helper which behaves like the first call to mark_page_accessed() but may
    called before the page is visible and can be done non-atomically.

    The primary APIs of concern in this care are the following and are used
    by most filesystems.

    find_get_page
    find_lock_page
    find_or_create_page
    grab_cache_page_nowait
    grab_cache_page_write_begin

    All of them are very similar in detail to the patch creates a core helper
    pagecache_get_page() which takes a flags parameter that affects its
    behavior such as whether the page should be marked accessed or not. Then
    old API is preserved but is basically a thin wrapper around this core
    function.

    Each of the filesystems are then updated to avoid calling
    mark_page_accessed when it is known that the VM interfaces have already
    done the job. There is a slight snag in that the timing of the
    mark_page_accessed() has now changed so in rare cases it's possible a page
    gets to the end of the LRU as PageReferenced where as previously it might
    have been repromoted. This is expected to be rare but it's worth the
    filesystem people thinking about it in case they see a problem with the
    timing change. It is also the case that some filesystems may be marking
    pages accessed that previously did not but it makes sense that filesystems
    have consistent behaviour in this regard.

    The test case used to evaulate this is a simple dd of a large file done
    multiple times with the file deleted on each iterations. The size of the
    file is 1/10th physical memory to avoid dirty page balancing. In the
    async case it will be possible that the workload completes without even
    hitting the disk and will have variable results but highlight the impact
    of mark_page_accessed for async IO. The sync results are expected to be
    more stable. The exception is tmpfs where the normal case is for the "IO"
    to not hit the disk.

    The test machine was single socket and UMA to avoid any scheduling or NUMA
    artifacts. Throughput and wall times are presented for sync IO, only wall
    times are shown for async as the granularity reported by dd and the
    variability is unsuitable for comparison. As async results were variable
    do to writback timings, I'm only reporting the maximum figures. The sync
    results were stable enough to make the mean and stddev uninteresting.

    The performance results are reported based on a run with no profiling.
    Profile data is based on a separate run with oprofile running.

    async dd
    3.15.0-rc3 3.15.0-rc3
    vanilla accessed-v2
    ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
    tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
    btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
    ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
    xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)

    The XFS figure is a bit strange as it managed to avoid a worst case by
    sheer luck but the average figures looked reasonable.

    samples percentage
    ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

    [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Tested-by: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman