27 Jul, 2016
5 commits
-
These flags are in use for file THP.
Link: http://lkml.kernel.org/r/1466021202-61880-23-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
As with anon THP, we only mlock file huge pages if we can prove that the
page is not mapped with PTE. This way we can avoid mlock leak into
non-mlocked vma on split.We rely on PageDoubleMap() under lock_page() to check if the the page
may be PTE mapped. PG_double_map is set by page_add_file_rmap() when
the page mapped with PTEs.Link: http://lkml.kernel.org/r/1466021202-61880-21-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently, to charge a non-slab allocation to kmemcg one has to use
alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with
this helper should finally be freed using free_kmem_pages, otherwise it
won't be uncharged.This API suits its current users fine, but it turns out to be impossible
to use along with page reference counting, i.e. when an allocation is
supposed to be freed with put_page, as it is the case with pipe or unix
socket buffers.To overcome this limitation, this patch moves charging/uncharging to
generic page allocator paths, i.e. to __alloc_pages_nodemask and
free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way,
one can use any of the available page allocation functions to get the
allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
just like in case of kmalloc and friends. A charged page will be
automatically uncharged on free.To make it possible, we need to mark pages charged to kmemcg somehow.
To avoid introducing a new page flag, we make use of page->_mapcount for
marking such pages. Since pages charged to kmemcg are not supposed to
be mapped to userspace, it should work just fine. There are other
(ab)users of page->_mapcount - buddy and balloon pages - but we don't
conflict with them.In case kmemcg is compiled out or not used at runtime, this patch
introduces no overhead to generic page allocator paths. If kmemcg is
used, it will be plus one gfp flags check on alloc and plus one
page->_mapcount check on free, which shouldn't hurt performance, because
the data accessed are hot.Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Eric Dumazet
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
- Add a proper comment to page->_mapcount.
- Introduce a macro for generating helper functions.
- Place all special page->_mapcount values next to each other so that
readers can see all possible values and so we don't get duplicates.Link: http://lkml.kernel.org/r/502f49000e0b63e6c62e338fac6b420bf34fb526.1464079537.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Eric Dumazet
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We have allowed migration for only LRU pages until now and it was enough
to make high-order pages. But recently, embedded system(e.g., webOS,
android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
have seen several reports about troubles of small high-order allocation.
For fixing the problem, there were several efforts (e,g,. enhance
compaction algorithm, SLUB fallback to 0-order page, reserved memory,
vmalloc and so on) but if there are lots of non-movable pages in system,
their solutions are void in the long run.So, this patch is to support facility to change non-movable pages with
movable. For the feature, this patch introduces functions related to
migration to address_space_operations as well as some page flags.If a driver want to make own pages movable, it should define three
functions which are function pointers of struct
address_space_operations.1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);
What VM expects on isolate_page function of driver is to return *true*
if driver isolates page successfully. On returing true, VM marks the
page as PG_isolated so concurrent isolation in several CPUs skip the
page for isolation. If a driver cannot isolate the page, it should
return *false*.Once page is successfully isolated, VM uses page.lru fields so driver
shouldn't expect to preserve values in that fields.2. int (*migratepage) (struct address_space *mapping,
struct page *newpage, struct page *oldpage, enum migrate_mode);After isolation, VM calls migratepage of driver with isolated page. The
function of migratepage is to move content of the old page to new page
and set up fields of struct page newpage. Keep in mind that you should
indicate to the VM the oldpage is no longer movable via
__ClearPageMovable() under page_lock if you migrated the oldpage
successfully and returns 0. If driver cannot migrate the page at the
moment, driver can return -EAGAIN. On -EAGAIN, VM will retry page
migration in a short time because VM interprets -EAGAIN as "temporal
migration failure". On returning any error except -EAGAIN, VM will give
up the page migration without retrying in this time.Driver shouldn't touch page.lru field VM using in the functions.
3. void (*putback_page)(struct page *);
If migration fails on isolated page, VM should return the isolated page
to the driver so VM calls driver's putback_page with migration failed
page. In this function, driver should put the isolated page back to the
own data structure.4. non-lru movable page flags
There are two page flags for supporting non-lru movable page.
* PG_movable
Driver should use the below function to make page movable under
page_lock.void __SetPageMovable(struct page *page, struct address_space *mapping)
It needs argument of address_space for registering migration family
functions which will be called by VM. Exactly speaking, PG_movable is
not a real flag of struct page. Rather than, VM reuses page->mapping's
lower bits to represent it.#define PAGE_MAPPING_MOVABLE 0x2
page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;so driver shouldn't access page->mapping directly. Instead, driver
should use page_mapping which mask off the low two bits of page->mapping
so it can get right struct address_space.For testing of non-lru movable page, VM supports __PageMovable function.
However, it doesn't guarantee to identify non-lru movable page because
page->mapping field is unified with other variables in struct page. As
well, if driver releases the page after isolation by VM, page->mapping
doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
__ClearPageMovable). But __PageMovable is cheap to catch whether page
is LRU or non-lru movable once the page has been isolated. Because LRU
pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
good for just peeking to test non-lru movable pages before more
expensive checking with lock_page in pfn scanning to select victim.For guaranteeing non-lru movable page, VM provides PageMovable function.
Unlike __PageMovable, PageMovable functions validates page->mapping and
mapping->a_ops->isolate_page under lock_page. The lock_page prevents
sudden destroying of page->mapping.Driver using __SetPageMovable should clear the flag via
__ClearMovablePage under page_lock before the releasing the page.* PG_isolated
To prevent concurrent isolation among several CPUs, VM marks isolated
page as PG_isolated under lock_page. So if a CPU encounters PG_isolated
non-lru movable page, it can skip it. Driver doesn't need to manipulate
the flag because VM will set/clear it automatically. Keep in mind that
if driver sees PG_isolated page, it means the page have been isolated by
VM so it shouldn't touch page.lru field. PG_isolated is alias with
PG_reclaim flag so driver shouldn't use the flag for own purpose.[opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
Signed-off-by: Gioh Kim
Signed-off-by: Minchan Kim
Signed-off-by: Ganesh Mahendran
Acked-by: Vlastimil Babka
Cc: Sergey Senozhatsky
Cc: Rik van Riel
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: Rafael Aquini
Cc: Jonathan Corbet
Cc: John Einar Reitan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
21 May, 2016
1 commit
-
struct page->flags is unsigned long, so when shifting bits we should use
UL suffix to match it.Found this problem after I added 64-bit CPU specific page flags and
failed to compile the kernel:mm/page_alloc.c: In function '__free_one_page':
mm/page_alloc.c:672:2: error: integer overflow in expression [-Werror=overflow]Link: http://lkml.kernel.org/r/1461971723-16187-1-git-send-email-yuzhao@google.com
Signed-off-by: Yu Zhao
Cc: "Kirill A . Shutemov"
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: Jerome Marchand
Cc: Denys Vlasenko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
20 May, 2016
1 commit
-
The PageAnon check always checks for compound_head but this is a
relatively expensive check if the caller already knows the page is a
head page. This patch creates a helper and uses it in the page free
path which only operates on head pages.With this patch and "Only check PageCompound for high-order pages", the
performance difference on a page allocator microbenchmark is;4.6.0-rc2 4.6.0-rc2
vanilla nocompound-v1r20
Min alloc-odr0-1 425.00 ( 0.00%) 417.00 ( 1.88%)
Min alloc-odr0-2 313.00 ( 0.00%) 308.00 ( 1.60%)
Min alloc-odr0-4 257.00 ( 0.00%) 253.00 ( 1.56%)
Min alloc-odr0-8 224.00 ( 0.00%) 221.00 ( 1.34%)
Min alloc-odr0-16 208.00 ( 0.00%) 205.00 ( 1.44%)
Min alloc-odr0-32 199.00 ( 0.00%) 199.00 ( 0.00%)
Min alloc-odr0-64 195.00 ( 0.00%) 193.00 ( 1.03%)
Min alloc-odr0-128 192.00 ( 0.00%) 191.00 ( 0.52%)
Min alloc-odr0-256 204.00 ( 0.00%) 200.00 ( 1.96%)
Min alloc-odr0-512 213.00 ( 0.00%) 212.00 ( 0.47%)
Min alloc-odr0-1024 219.00 ( 0.00%) 219.00 ( 0.00%)
Min alloc-odr0-2048 225.00 ( 0.00%) 225.00 ( 0.00%)
Min alloc-odr0-4096 230.00 ( 0.00%) 231.00 ( -0.43%)
Min alloc-odr0-8192 235.00 ( 0.00%) 234.00 ( 0.43%)
Min alloc-odr0-16384 235.00 ( 0.00%) 234.00 ( 0.43%)
Min free-odr0-1 215.00 ( 0.00%) 191.00 ( 11.16%)
Min free-odr0-2 152.00 ( 0.00%) 136.00 ( 10.53%)
Min free-odr0-4 119.00 ( 0.00%) 107.00 ( 10.08%)
Min free-odr0-8 106.00 ( 0.00%) 96.00 ( 9.43%)
Min free-odr0-16 97.00 ( 0.00%) 87.00 ( 10.31%)
Min free-odr0-32 91.00 ( 0.00%) 83.00 ( 8.79%)
Min free-odr0-64 89.00 ( 0.00%) 81.00 ( 8.99%)
Min free-odr0-128 88.00 ( 0.00%) 80.00 ( 9.09%)
Min free-odr0-256 106.00 ( 0.00%) 95.00 ( 10.38%)
Min free-odr0-512 116.00 ( 0.00%) 111.00 ( 4.31%)
Min free-odr0-1024 125.00 ( 0.00%) 118.00 ( 5.60%)
Min free-odr0-2048 133.00 ( 0.00%) 126.00 ( 5.26%)
Min free-odr0-4096 136.00 ( 0.00%) 130.00 ( 4.41%)
Min free-odr0-8192 138.00 ( 0.00%) 130.00 ( 5.80%)
Min free-odr0-16384 137.00 ( 0.00%) 130.00 ( 5.11%)There is a sizable boost to the free allocator performance. While there
is an apparent boost on the allocation side, it's likely a co-incidence
or due to the patches slightly reducing cache footprint.Signed-off-by: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Jesper Dangaard Brouer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 May, 2016
1 commit
-
After the THP refcounting change, obtaining a compound pages from
get_user_pages() no longer allows us to assume the entire compound page
is immediately mappable from a secondary MMU.A secondary MMU doesn't want to call get_user_pages() more than once for
each compound page, in order to know if it can map the whole compound
page. So a secondary MMU needs to know from a single get_user_pages()
invocation when it can map immediately the entire compound page to avoid
a flood of unnecessary secondary MMU faults and spurious
atomic_inc()/atomic_dec() (pages don't have to be pinned by MMU notifier
users).Ideally instead of the page->_mapcount < 1 check, get_user_pages()
should return the granularity of the "page" mapping in the "mm" passed
to get_user_pages(). However it's non trivial change to pass the "pmd"
status belonging to the "mm" walked by get_user_pages up the stack (up
to the caller of get_user_pages). So the fix just checks if there is
not a single pte mapping on the page returned by get_user_pages, and in
turn if the caller can assume that the whole compound page is mapped in
the current "mm" (in a pmd_trans_huge()). In such case the entire
compound page is safe to map into the secondary MMU without additional
get_user_pages() calls on the surrounding tail/head pages. In addition
of being faster, not having to run other get_user_pages() calls also
reduces the memory footprint of the secondary MMU fault in case the pmd
split happened as result of memory pressure.Without this fix after a MADV_DONTNEED (like invoked by QEMU during
postcopy live migration or balloning) or after generic swapping (with a
failure in split_huge_page() that would only result in pmd splitting and
not a physical page split), KVM would map the whole compound page into
the shadow pagetables, despite regular faults or userfaults (like
UFFDIO_COPY) may map regular pages into the primary MMU as result of the
pte faults, leading to the guest mode and userland mode going out of
sync and not working on the same memory at all times.Any other secondary MMU notifier manager (KVM is just one of the many
MMU notifier users) will need the same information if it doesn't want to
run a flood of get_user_pages_fast and it can support multiple
granularity in the secondary MMU mappings, so I think it is justified to
be exposed not just to KVM.The other option would be to move transparent_hugepage_adjust to
mm/huge_memory.c but that currently has all kind of KVM data structures
in it, so it's definitely not a cut-and-paste work, so I couldn't do a
fix as cleaner as this one for 4.6.Signed-off-by: Andrea Arcangeli
Cc: "Dr. David Alan Gilbert"
Cc: "Kirill A. Shutemov"
Cc: "Li, Liang Z"
Cc: Amit Shah
Cc: Paolo Bonzini
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
18 Mar, 2016
2 commits
-
Sometimes gcc mysteriously doesn't inline
very small functions we expect to be inlined. Seehttps://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
With this .config:
http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
the following functions get deinlined many times.
Examples of disassembly:(43 copies, 141 calls):
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 80 0f 08 lock orb $0x8,(%rdi)
5d pop %rbp
c3 retq(10 copies, 134 calls):
48 8b 07 mov (%rdi),%rax
55 push %rbp
48 89 e5 mov %rsp,%rbp
48 c1 e8 0b shr $0xb,%rax
83 e0 01 and $0x1,%eax
5d pop %rbp
c3 retqThis patch fixes this via s/inline/__always_inline/.
Code size decrease after the patch is ~7k:
text data bss dec hex filename
92125002 20826048 36417536 149368586 8e72f0a vmlinux
92118087 20826112 36417536 149361735 8e71447 vmlinux7_pageops_afterSigned-off-by: Denys Vlasenko
Cc: Ingo Molnar
Cc: Thomas Graf
Cc: Peter Zijlstra
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently /proc/kpageflags returns nothing for "tail" buddy pages, which
is inconvenient when grasping how free pages are distributed. This
patch sets KPF_BUDDY for such pages.With this patch:
$ grep MemFree /proc/meminfo ; tools/vm/page-types -b buddy
MemFree: 3134992 kB
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000400 779272 3044 __________B_______________________________ buddy
0x0000000000000c00 4385 17 __________BM______________________________ buddy,mmap
total 783657 3061783657 pages is 3134628 kB (roughly consistent with the global counter,)
so it's OK.[akpm@linux-foundation.org: update comment, per Naoya]
Signed-off-by: Naoya Horiguchi
Reviewed-by: Vladimir Davydov >
Cc: Konstantin Khlebnikov
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Jan, 2016
18 commits
-
We're going to allow mapping of individual 4k pages of THP compound. It
means we need to track mapcount on per small page basis.Straight-forward approach is to use ->_mapcount in all subpages to track
how many time this subpage is mapped with PMDs or PTEs combined. But
this is rather expensive: mapping or unmapping of a THP page with PMD
would require HPAGE_PMD_NR atomic operations instead of single we have
now.The idea is to store separately how many times the page was mapped as
whole -- compound_mapcount. This frees up ->_mapcount in subpages to
track PTE mapcount.We use the same approach as with compound page destructor and compound
order to store compound_mapcount: use space in first tail page,
->mapping this time.Any time we map/unmap whole compound page (THP or hugetlb) -- we
increment/decrement compound_mapcount. When we map part of compound
page with PTE we operate on ->_mapcount of the subpage.page_mapcount() counts both: PTE and PMD mappings of the page.
Basically, we have mapcount for a subpage spread over two counters. It
makes tricky to detect when last mapcount for a page goes away.We introduced PageDoubleMap() for this. When we split THP PMD for the
first time and there's other PMD mapping left we offset up ->_mapcount
in all subpages by one and set PG_double_map on the compound page.
These additional references go away with last compound_mapcount.This approach provides a way to detect when last mapcount goes away on
per small page basis without introducing new overhead for most common
cases.[akpm@linux-foundation.org: fix typo in comment]
[mhocko@suse.com: ignore partial THP when moving task]
Signed-off-by: Kirill A. Shutemov
Tested-by: Aneesh Kumar K.V
Acked-by: Jerome Marchand
Cc: Sasha Levin
Cc: Aneesh Kumar K.V
Cc: Jerome Marchand
Cc: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We are going to use migration entries to stabilize page counts. It
means we don't need compound_lock() for that.Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Vlastimil Babka
Acked-by: Jerome Marchand
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Nobody uses them.
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
PageAnon() and PageKsm() look at lower bits of page->mapping to check if
the page is Anon or KSM. page->mapping can be overloaded in tail pages.Let's always look at head page to avoid false-positives.
Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Christoph Lameter
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Cc: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We use PG_uptodate on head pages on transparent huge page. Let's use
PF_NO_TAIL.Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Christoph Lameter
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Cc: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
So far, only IA64 uses PG_uncached and only on non-compound pages.
Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Christoph Lameter
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Cc: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Transparent huge pages can be mlocked -- whole compund page at once.
Something went wrong if we're trying to mlock() tail page. Let's use
PF_NO_TAIL.Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Christoph Lameter
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Cc: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Swap cannot handle compound pages so far. Transparent huge pages are
split on the way to swap.Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Christoph Lameter
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Cc: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
PG_swapbacked is used for transparent huge pages. For head pages only.
Let's use PF_NO_TAIL policy.Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Christoph Lameter
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Cc: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
As far as I can see there's no users of PG_reserved on compound pages.
Let's use PF_NO_COMPOUND here.Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Christoph Lameter
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Cc: Jérôme Glisse
Cc: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
PG_pinned and PG_savepinned are about page table's pages which are never
compound.I'm not so sure about PG_foreign, but it seems we shouldn't see compound
pages there too.Let's use PF_NO_COMPOUND for all of them.
Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Christoph Lameter
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Cc: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
SL*B uses compound pages and marks head pages with PG_slab.
__SetPageSlab() and __ClearPageSlab() are never called for tail pages.The same situation with PG_slob_free in SLOB allocator.
PF_NO_TAIL is appropriate for these flags.
Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Christoph Lameter
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Cc: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Only head pages are ever on LRU. Let's use PF_HEAD policy to avoid any
confusion for all LRU-related flags.Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Christoph Lameter
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Cc: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
It seems we don't have compound page on FS/IO path currently. Use
PF_NO_COMPOUND to catch if we have.The odd exception is PG_dirty: sound uses compound pages and maps them
with PTEs. PF_NO_COMPOUND triggers VM_BUG_ON() in set_page_dirty() on
handling shared fault. Let's use PF_HEAD for PG_dirty.Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Christoph Lameter
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Cc: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
lock_page() must operate on the whole compound page. It doesn't make
much sense to lock part of compound page. Change code to use head
page's PG_locked, if tail page is passed.This patch also gets rid of custom helper functions --
__set_page_locked() and __clear_page_locked(). They are replaced with
helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG. Tail pages to these
helper would trigger VM_BUG_ON().SLUB uses PG_locked as a bit spin locked. IIUC, tail pages should never
appear there. VM_BUG_ON() is added to make sure that this assumption is
correct.[akpm@linux-foundation.org: fix fs/cifs/file.c]
Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Christoph Lameter
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Cc: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch adds a third argument to macros which create function
definitions for page flags. This argument defines how page-flags
helpers behave on compound functions.For now we define four policies:
- PF_ANY: the helper function operates on the page it gets, regardless
if it's non-compound, head or tail.- PF_HEAD: the helper function operates on the head page of the
compound page if it gets tail page.- PF_NO_TAIL: only head and non-compond pages are acceptable for this
helper function.- PF_NO_COMPOUND: only non-compound pages are acceptable for this
helper function.For now we use policy PF_ANY for all helpers, which matches current
behaviour.We do not enforce the policy for TESTPAGEFLAG, because we have flags
checked for random pages all over the kernel. Noticeable exception to
this is PageTransHuge() which triggers VM_BUG_ON() for tail page.Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Christoph Lameter
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Cc: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The preparation patch: we are going to use compound_head(), PageTail()
and PageCompound() to define page-flags helpers.Let's define them before macros.
We cannot user PageHead() helper in PageCompound() as it's not yet
defined -- use test_bit(PG_head, &page->flags) instead.Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Christoph Lameter
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Cc: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use TESTPAGEFLAG_FALSE() to get it a bit cleaner.
Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Christoph Lameter
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Cc: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Nov, 2015
1 commit
-
Hugh has pointed that compound_head() call can be unsafe in some
context. There's one example:CPU0 CPU1
isolate_migratepages_block()
page_count()
compound_head()
!!PageTail() == true
put_page()
tail->first_page = NULL
head = tail->first_page
alloc_pages(__GFP_COMP)
prep_compound_page()
tail->first_page = head
__SetPageTail(p);
!!PageTail() == true
The race is pure theoretical. I don't it's possible to trigger it in
practice. But who knows.We can fix the race by changing how encode PageTail() and compound_head()
within struct page to be able to update them in one shot.The patch introduces page->compound_head into third double word block in
front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
the rest bits are pointer to head page if bit zero is set.The patch moves page->pmd_huge_pte out of word, just in case if an
architecture defines pgtable_t into something what can have the bit 0
set.hugetlb_cgroup uses page->lru.next in the second tail page to store
pointer struct hugetlb_cgroup. The patch switch it to use page->private
in the second tail page instead. The space is free since ->first_page is
removed from the union.The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
limitation, since there's now space in first tail page to store struct
hugetlb_cgroup pointer. But that's out of scope of the patch.That means page->compound_head shares storage space with:
- page->lru.next;
- page->next;
- page->rcu_head.next;That's too long list to be absolutely sure, but looks like nobody uses
bit 0 of the word.page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
call_rcu_lazy() is not allowed as it makes use of the bit and we can
get false positive PageTail().[1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com
Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Reviewed-by: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Cc: Vlastimil Babka
Acked-by: Paul E. McKenney
Cc: Aneesh Kumar K.V
Cc: Andi Kleen
Cc: Christoph Lameter
Cc: Joonsoo Kim
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 Nov, 2015
1 commit
-
This came up when implementing HIHGMEM/PAE40 for ARC. The kmap() /
kmap_atomic() generated code seemed needlessly bloated due to the way
PageHighMem() macro is implemented. It derives the exact zone for page
and then does pointer subtraction with first zone to infer the zone_type.
The pointer arithmatic in turn generates the code bloat.PageHighMem(page)
is_highmem(page_zone(page))
zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zonesInstead use is_highmem_idx() to work on zone_type available in page flags
----- Before -----
80756348: mov_s r13,r0
8075634a: ld_s r2,[r13,0]
8075634c: lsr_s r2,r2,30
8075634e: mpy r2,r2,0x2a4
80756352: add_s r2,r2,0x80aef880
80756358: ld_s r3,[r2,28]
8075635a: sub_s r2,r2,r3
8075635c: breq r2,0x2a4,80756378
80756364: breq r2,0x548,80756378----- After -----
80756330: mov_s r13,r0
80756332: ld_s r2,[r13,0]
80756334: lsr_s r2,r2,30
80756336: sub_s r2,r2,1
80756338: brlo r2,2,80756348For x86 defconfig build (32 bit only) it saves around 900 bytes.
For ARC defconfig with HIGHMEM, it saved around 2K bytes.---->8-------
./scripts/bloat-o-meter x86/vmlinux-defconfig-pre x86/vmlinux-defconfig-post
add/remove: 0/0 grow/shrink: 0/36 up/down: 0/-934 (-934)
function old new delta
saveable_page 162 154 -8
saveable_highmem_page 154 146 -8
skb_gro_reset_offset 147 131 -16
...
...
__change_page_attr_set_clr 1715 1678 -37
setup_data_read 434 394 -40
mon_bin_event 1967 1927 -40
swsusp_save 1148 1105 -43
_set_pages_array 549 493 -56
---->8-------e.g. For ARC kmap()
Signed-off-by: Vineet Gupta
Acked-by: Michal Hocko
Cc: Naoya Horiguchi
Cc: Hugh Dickins
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Jennifer Herbert
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Sep, 2015
1 commit
-
Knowing the portion of memory that is not used by a certain application or
memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced. However,
this method has two serious shortcomings:- it does not count unmapped file pages
- it affects the reclaimer logicTo overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
A page's Idle flag can only be set from userspace by setting bit in
/sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
and it is cleared whenever the page is accessed either through page tables
(it is cleared in page_referenced() in this case) or using the read(2)
system call (mark_page_accessed()). Thus by setting the Idle flag for
pages of a particular workload, which can be found e.g. by reading
/proc/PID/pagemap, waiting for some time to let the workload access its
working set, and then reading the bitmap file, one can estimate the amount
of pages that are not used by the workload.The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to the bitmap file.
If page_referenced() is called on a Young page, it will add 1 to its
return value, therefore concealing the fact that the Access bit was
cleared.Note, since there is no room for extra page flags on 32 bit, this feature
uses extended page flags when compiled on 32 bit.[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: kpageidle requires an MMU]
[akpm@linux-foundation.org: decouple from page-flags rework]
Signed-off-by: Vladimir Davydov
Reviewed-by: Andres Lagar-Cavilla
Cc: Minchan Kim
Cc: Raghavendra K T
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Greg Thelen
Cc: Michel Lespinasse
Cc: David Rientjes
Cc: Pavel Emelyanov
Cc: Cyrill Gorcunov
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Aug, 2015
1 commit
-
The race condition addressed in commit add05cecef80 ("mm: soft-offline:
don't free target page in successful page migration") was not closed
completely, because that can happen not only for soft-offline, but also
for hard-offline. Consider that a slab page is about to be freed into
buddy pool, and then an uncorrected memory error hits the page just
after entering __free_one_page(), then VM_BUG_ON_PAGE(page->flags &
PAGE_FLAGS_CHECK_AT_PREP) is triggered, despite the fact that it's not
necessary because the data on the affected page is not consumed.To solve it, this patch drops __PG_HWPOISON from page flag checks at
allocation/free time. I think it's justified because __PG_HWPOISON
flags is defined to prevent the page from being reused, and setting it
outside the page's alloc-free cycle is a designed behavior (not a bug.)For recent months, I was annoyed about BUG_ON when soft-offlined page
remains on lru cache list for a while, which is avoided by calling
put_page() instead of putback_lru_page() in page migration's success
path. This means that this patch reverts a major change from commit
add05cecef80 about the new refcounting rule of soft-offlined pages, so
"reuse window" revives. This will be closed by a subsequent patch.Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Dean Nelson
Cc: Tony Luck
Cc: "Kirill A. Shutemov"
Cc: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Apr, 2015
2 commits
-
Now we have an easy access to hugepages' activeness, so existing helpers to
get the information can be cleaned up.[akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/]
Signed-off-by: Naoya Horiguchi
Cc: Hugh Dickins
Reviewed-by: Michal Hocko
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently we take a naive approach to page flags on compound pages - we
set the flag on the page without consideration if the flag makes sense
for tail page or for compound page in general. This patchset try to
sort this out by defining per-flag policy on what need to be done if
page-flag helper operate on compound page.The last patch in the patchset also sanitizes usege of page->mapping for
tail pages. We don't define the meaning of page->mapping for tail
pages. Currently it's always NULL, which can be inconsistent with head
page and potentially lead to problems.For now I caught one case of illegal usage of page flags or ->mapping:
sound subsystem allocates pages with __GFP_COMP and maps them with PTEs.
It leads to setting dirty bit on tail pages and access to tail_page's
->mapping. I don't see any bad behaviour caused by this, but worth
fixing anyway.This patchset makes more sense if you take my THP refcounting into
account: we will see more compound pages mapped with PTEs and we need to
define behaviour of flags on compound pages to avoid bugs.This patch (of 16):
We have page-flags helper function declarations/definitions spread over
several header files. Let's consolidate them in .Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Acked-by: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Christoph Lameter
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 Apr, 2015
1 commit
-
This patch replaces cancel_dirty_page() with a helper function
account_page_cleaned() which only updates counters. It's called from
truncate_complete_page() and from try_to_free_buffers() (hack for ext3).
Page is locked in both cases, page-lock protects against concurrent
dirtiers: see commit 2d6d7f982846 ("mm: protect set_page_dirty() from
ongoing truncation").Delete_from_page_cache() shouldn't be called for dirty pages, they must
be handled by caller (either written or truncated). This patch treats
final dirty accounting fixup at the end of __delete_from_page_cache() as
a debug check and adds WARN_ON_ONCE() around it. If something removes
dirty pages without proper handling that might be a bug and unwritten
data might be lost.Hugetlbfs has no dirty pages accounting, ClearPageDirty() is enough
here.cancel_dirty_page() in nfs_wb_page_cancel() is redundant. This is
helper for nfs_invalidate_page() and it's called only in case complete
invalidation.The mess was started in v2.6.20 after commits 46d2277c796f ("Clean up
and make try_to_free_buffers() not race with dirty pages") and
3e67c0987d75 ("truncate: clear page dirtiness before running
try_to_free_buffers()") first was reverted right in v2.6.20 in commit
ecdfc9787fe5 ("Resurrect 'try_to_free_buffers()' VM hackery"), second in
v2.6.25 commit a2b345642f53 ("Fix dirty page accounting leak with ext3
data=journal").Custom fixes were introduced between these points. NFS in v2.6.23, commit
1b3b4a1a2deb ("NFS: Fix a write request leak in nfs_invalidate_page()").
Kludge in __delete_from_page_cache() in v2.6.24, commit 3a6927906f1b ("Do
dirty page accounting when removing a page from the page cache"). Since
v2.6.25 all of them are redundant.[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Konstantin Khlebnikov
Cc: Tejun Heo
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
28 Jan, 2015
1 commit
-
The foreign page flag will be used by Xen guests to mark pages that
have grant mappings of frames from other (foreign) guests.The foreign flag is an alias for the existing (Xen-specific) pinned
flag. This is safe because pinned is only used on pages used for page
tables and these cannot also be foreign.Signed-off-by: Jennifer Herbert
Acked-by: Andrew Morton
Signed-off-by: David Vrabel
07 Aug, 2014
1 commit
-
- PAGEFLAG_FALSE only defines TEST, make it define SET and CLEAR as
well, analogous to PAGEFLAG.- Define TESTSETFLAG_FALSE, analogous to TESTSETFLAG.
- Define TESTSCFLAG_FALSE, analogous to TESTSCFLAG
- Make PG_mlocked accessors the same on both MMU and !MMU setups
Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Jun, 2014
1 commit
-
To allow filtering of huge pages, makedumpfile must be able to identify
them in the dump. This can be done by checking the appropriate page
flag, so communicate its value to makedumpfile through the VMCOREINFO
interface.There's only one small catch. Depending on how many page flags are
available on a given architecture, this bit can be called PG_head or
PG_compound.I sent a similar patch back in 2012, but Eric Biederman did not like
using an #ifdef. So, this time I'm adding a common symbol
(PG_head_mask) instead.See https://lkml.org/lkml/2012/11/28/91 for the previous version.
Signed-off-by: Petr Tesarik
Acked-by: Vivek Goyal
Cc: Eric Biederman
Cc: Paul Mackerras
Cc: Fengguang Wu
Cc: Benjamin Herrenschmidt
Cc: Shaohua Li
Cc: Alexey Kardashevskiy
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
09 Jun, 2014
1 commit
-
Pull ext4 updates from Ted Ts'o:
"Clean ups and miscellaneous bug fixes, in particular for the new
collapse_range and zero_range fallocate functions. In addition,
improve the scalability of adding and remove inodes from the orphan
list"* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits)
ext4: handle symlink properly with inline_data
ext4: fix wrong assert in ext4_mb_normalize_request()
ext4: fix zeroing of page during writeback
ext4: remove unused local variable "stored" from ext4_readdir(...)
ext4: fix ZERO_RANGE test failure in data journalling
ext4: reduce contention on s_orphan_lock
ext4: use sbi in ext4_orphan_{add|del}()
ext4: use EXT_MAX_BLOCKS in ext4_es_can_be_merged()
ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access
ext4: remove unnecessary double parentheses
ext4: do not destroy ext4_groupinfo_caches if ext4_mb_init() fails
ext4: make local functions static
ext4: fix block bitmap validation when bigalloc, ^flex_bg
ext4: fix block bitmap initialization under sparse_super2
ext4: find the group descriptors on a 1k-block bigalloc,meta_bg filesystem
ext4: avoid unneeded lookup when xattr name is invalid
ext4: fix data integrity sync in ordered mode
ext4: remove obsoleted check
ext4: add a new spinlock i_raw_lock to protect the ext4's raw inode
ext4: fix locking for O_APPEND writes
...
05 Jun, 2014
1 commit
-
aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after. Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage. The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page. This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.The primary APIs of concern in this care are the following and are used
by most filesystems.find_get_page
find_lock_page
find_or_create_page
grab_cache_page_nowait
grab_cache_page_write_beginAll of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not. Then
old API is preserved but is basically a thin wrapper around this core
function.Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job. There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted. This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change. It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of the
file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts. Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.async dd
3.15.0-rc3 3.15.0-rc3
vanilla accessed-v2
ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)The XFS figure is a bit strange as it managed to avoid a worst case by
sheer luck but the average figures looked reasonable.samples percentage
ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed[akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
Signed-off-by: Mel Gorman
Cc: Johannes Weiner
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Rik van Riel
Cc: Peter Zijlstra
Tested-by: Prabhakar Lad
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds