Eric Lee / smarc-fsl-linux-kernel

936a5fe6e thp: kvm mmu transparent hugepage support ... Browse Code »

This should work for both hugetlbfs and transparent hugepages.

[akpm@linux-foundation.org: bring forward PageTransCompound() addition for bisectability]
Signed-off-by: Andrea Arcangeli
Cc: Avi Kivity
Cc: Marcelo Tosatti
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:41 +0800

47ad8475c thp: clear_copy_huge_page ... Browse Code »

Move the copy/clear_huge_page functions to common code to share between
hugetlb.c and huge_memory.c.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:41 +0800

3f04f62f9 thp: split_huge_page paging ... Browse Code »

Paging logic that splits the page before it is unmapped and added to swap
to ensure backwards compatibility with the legacy swap code. Eventually
swap should natively pageout the hugepages to increase performance and
decrease seeking and fragmentation of swap space. swapoff can just skip
over huge pmd as they cannot be part of swap yet. In add_to_swap be
careful to split the page only if we got a valid swap entry so we don't
split hugepages with a full swap.

In theory we could split pages before isolating them during the lru scan,
but for khugepaged to be safe, I'm relying on either mmap_sem write mode,
or PG_lock taken, so split_huge_page has to run either with mmap_sem
read/write mode or PG_lock taken. Calling it from isolate_lru_page would
make locking more complicated, in addition to that split_huge_page would
deadlock if called by __isolate_lru_page because it has to take the lru
lock to add the tail pages.

Signed-off-by: Andrea Arcangeli
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:41 +0800

bae9c19bf thp: split_huge_page_mm/vma ... Browse Code »

split_huge_page_pmd compat code. Each one of those would need to be
expanded to hundred of lines of complex code without a fully reliable
split_huge_page_pmd design.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:41 +0800

e7a00c45f thp: add pmd_huge_pte to mm_struct ... Browse Code »

This increase the size of the mm struct a bit but it is needed to
preallocate one pte for each hugepage so that split_huge_page will not
require a fail path. Guarantee of success is a fundamental property of
split_huge_page to avoid decrasing swapping reliability and to avoid
adding -ENOMEM fail paths that would otherwise force the hugepage-unaware
VM code to learn rolling back in the middle of its pte mangling operations
(if something we need it to learn handling pmd_trans_huge natively rather
being capable of rollback). When split_huge_page runs a pte is needed to
succeed the split, to map the newly splitted regular pages with a regular
pte. This way all existing VM code remains backwards compatible by just
adding a split_huge_page* one liner. The memory waste of those
preallocated ptes is negligible and so it is worth it.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:41 +0800

4e6af67e9 thp: clear page compound ... Browse Code »

split_huge_page must transform a compound page to a regular page and needs
ClearPageCompound.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Reviewed-by: Christoph Lameter
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:41 +0800

91a4ee267 thp: add pmd mmu_notifier helpers ... Browse Code »

Add mmu notifier helpers to handle pmd huge operations.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:41 +0800

8ac1f8320 thp: pte alloc trans splitting ... Browse Code »

pte alloc routines must wait for split_huge_page if the pmd is not present
and not null (i.e. pmd_trans_splitting). The additional branches are
optimized away at compile time by pmd_trans_splitting if the config option
is off. However we must pass the vma down in order to know the anon_vma
lock to wait for.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:40 +0800

64cc6ae00 thp: bail out gup_fast on splitting pmd ... Browse Code »

Force gup_fast to take the slow path and block if the pmd is splitting,
not only if it's none.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:40 +0800

db3eb96f4 thp: add pmd mangling functions to x86 ... Browse Code »

Add needed pmd mangling functions with symmetry with their pte
counterparts. pmdp_splitting_flush() is the only new addition on the pmd_
methods and it's needed to serialize the VM against split_huge_page. It
simply atomically sets the splitting bit in a similar way
pmdp_clear_flush_young atomically clears the accessed bit.
pmdp_splitting_flush() also has to flush the tlb to make it effective
against gup_fast, but it wouldn't really require to flush the tlb too.
Just the tlb flush is the simplest operation we can invoke to serialize
pmdp_splitting_flush() against gup_fast.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:40 +0800

e2cda3226 thp: add pmd mangling generic functions ... Browse Code »

Some are needed to build but not actually used on archs not supporting
transparent hugepages. Others like pmdp_clear_flush are used by x86 too.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:40 +0800

5f6e8da70 thp: special pmd_trans_* functions ... Browse Code »

These returns 0 at compile time when the config option is disabled, to
allow gcc to eliminate the transparent hugepage function calls at compile
time without additional #ifdefs (only the export of those functions have
to be visible to gcc but they won't be required at link time and
huge_memory.o can be not built at all).

_PAGE_BIT_UNUSED1 is never used for pmd, only on pte.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:40 +0800

4c76d9d1f thp: CONFIG_TRANSPARENT_HUGEPAGE ... Browse Code »

Add config option.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:40 +0800

59ff42163 thp: comment reminder in destroy_compound_page ... Browse Code »

Warn destroy_compound_page that __split_huge_page_refcount is heavily
dependent on its internal behavior.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:39 +0800

14fd403f2 thp: export maybe_mkwrite ... Browse Code »

huge_memory.c needs it too when it fallbacks in copying hugepages into
regular fragmented pages if hugepage allocation fails during COW.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:39 +0800

2609ae6d1 thp: no paravirt version of pmd ops ... Browse Code »

No paravirt version of set_pmd_at/pmd_update/pmd_update_defer.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:39 +0800

331127f79 thp: add pmd paravirt ops ... Browse Code »

Paravirt ops pmd_update/pmd_update_defer/pmd_set_at. Not all might be
necessary (vmware needs pmd_update, Xen needs set_pmd_at, nobody needs
pmd_update_defer), but this is to keep full simmetry with pte paravirt
ops, which looks cleaner and simpler from a common code POV.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:39 +0800

0a47de52d thp: add native_set_pmd_at ... Browse Code »

Used by paravirt and not paravirt set_pmd_at.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:39 +0800

8dd60a3a6 thp: clear compound mapping ... Browse Code »

Clear compound mapping for anonymous compound pages like it already
happens for regular anonymous pages. But crash if mapping is set for any
tail page, also the PageAnon check is meaningless for tail pages. This
check only makes sense for the head page, for tail page it can only hide
bugs and we definitely don't want to hide bugs.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:39 +0800

a5b338f2b thp: update futex compound knowledge ... Browse Code »

Futex code is smarter than most other gup_fast O_DIRECT code and knows
about the compound internals. However now doing a put_page(head_page)
will not release the pin on the tail page taken by gup-fast, leading to
all sort of refcounting bugchecks. Getting a stable head_page is a little
tricky.

page_head = page is there because if this is not a tail page it's also the
page_head. Only in case this is a tail page, compound_head is called,
otherwise it's guaranteed unnecessary. And if it's a tail page
compound_head has to run atomically inside irq disabled section
__get_user_pages_fast before returning. Otherwise ->first_page won't be a
stable pointer.

Disableing irq before __get_user_page_fast and releasing irq after running
compound_head is needed because if __get_user_page_fast returns == 1, it
means the huge pmd is established and cannot go away from under us.
pmdp_splitting_flush_notify in __split_huge_page_splitting will have to
wait for local_irq_enable before the IPI delivery can return. This means
__split_huge_page_refcount can't be running from under us, and in turn
when we run compound_head(page) we're not reading a dangling pointer from
tailpage->first_page. Then after we get to stable head page, we are
always safe to call compound_lock and after taking the compound lock on
head page we can finally re-check if the page returned by gup-fast is
still a tail page. in which case we're set and we didn't need to split
the hugepage in order to take a futex on it.

Signed-off-by: Andrea Arcangeli
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:39 +0800

a95a82e96 thp: put_page: recheck PageHead after releasing the compound_lock ... Browse Code »

After releasing the compound_lock split_huge_page can still run and release the
page before put_page_testzero runs.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:39 +0800

918070634 thp: alter compound get_page/put_page ... Browse Code »

Alter compound get_page/put_page to keep references on subpages too, in
order to allow __split_huge_page_refcount to split an hugepage even while
subpages have been pinned by one of the get_user_pages() variants.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:39 +0800

e9da73d67 thp: compound_lock ... Browse Code »

Add a new compound_lock() needed to serialize put_page against
__split_huge_page_refcount().

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:38 +0800

a826e4224 thp: mm: define MADV_HUGEPAGE ... Browse Code »

Define MADV_HUGEPAGE.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Arnd Bergmann
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:38 +0800

1c9bf22c0 thp: transparent hugepage support documentation ... Browse Code »

Documentation/vm/transhuge.txt

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:38 +0800

4e9f64c42 thp: fix bad_page to show the real reason the page is bad ... Browse Code »

page_count shows the count of the head page, but the actual check is done
on the tail page, so show what is really being checked.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:38 +0800

ae52a2adb thp: ksm: free swap when swapcache page is replaced ... Browse Code »

When a swapcache page is replaced by a ksm page, it's best to free that
swap immediately.

Reported-by: Andrea Arcangeli
Signed-off-by: Hugh Dickins
Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:38 +0800

240c879f2 writeback: avoid unnecessary determine_dirtyable_memory call ... Browse Code »

I think determine_dirtyable_memory() is a rather costly function since it
need many atomic reads for gathering zone/global page state. But when we
use vm_dirty_bytes && dirty_background_bytes, we don't need that costly
calculation.

This patch eliminates such unnecessary overhead.

NOTE : newly added if condition might add overhead in normal path.
But it should be _really_ small because anyway we need the
access both vm_dirty_bytes and dirty_background_bytes so it is
likely to hit the cache.

[akpm@linux-foundation.org: fix used-uninitialised warning]
Signed-off-by: Minchan Kim
Cc: Wu Fengguang
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:38 +0800

ecb256f81 mm: set correct numa_zonelist_order string when configured on the kernel command line ... Browse Code »

When numa_zonelist_order parameter is set to "node" or "zone" on the
command line it's still showing as "default" in sysctl. That's because
early_param parsing function changes only user_zonelist_order variable.
Fix this by copying user-provided string to numa_zonelist_order if it was
successfully parsed.

Signed-off-by: Volodymyr G Lukiianyk
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:37 +0800

dc83edd94 mm: kswapd: use the classzone idx that kswapd was using for sleeping_prematurely() ... Browse Code »

When kswapd is woken up for a high-order allocation, it takes account of
the highest usable zone by the caller (the classzone idx). During
allocation, this index is used to select the lowmem_reserve[] that should
be applied to the watermark calculation in zone_watermark_ok().

When balancing a node, kswapd considers the highest unbalanced zone to be
the classzone index. This will always be at least be the callers
classzone_idx and can be higher. However, sleeping_prematurely() always
considers the lowest zone (e.g. ZONE_DMA) to be the classzone index.
This means that sleeping_prematurely() can consider a zone to be balanced
that is unusable by the allocation request that originally woke kswapd.
This patch changes sleeping_prematurely() to use a classzone_idx matching
the value it used in balance_pgdat().

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: Eric B Munson
Cc: KAMEZAWA Hiroyuki
Cc: Simon Kirby
Cc: KOSAKI Motohiro
Cc: Shaohua Li
Cc: Dave Hansen
Cc: Johannes Weiner
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:37 +0800

355b09c47 mm: kswapd: treat zone->all_unreclaimable in sleeping_prematurely similar to balance_pgdat() ... Browse Code »

After DEF_PRIORITY, balance_pgdat() considers all_unreclaimable zones to
be balanced but sleeping_prematurely does not. This can force kswapd to
stay awake longer than it should. This patch fixes it.

Signed-off-by: Mel Gorman
Reviewed-by: Eric B Munson
Cc: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Cc: Simon Kirby
Cc: KOSAKI Motohiro
Cc: Shaohua Li
Cc: Dave Hansen
Cc: Johannes Weiner
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:37 +0800

4d40502ea mm: kswapd: reset kswapd_max_order and classzone_idx after reading ... Browse Code »

When kswapd wakes up, it reads its order and classzone from pgdat and
calls balance_pgdat. While its awake, it potentially reclaimes at a high
order and a low classzone index. This might have been a once-off that was
not required by subsequent callers. However, because the pgdat values
were not reset, they remain artifically high while balance_pgdat() is
running and potentially kswapd enters a second unnecessary reclaim cycle.
Reset the pgdat order and classzone index after reading.

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Eric B Munson
Cc: Simon Kirby
Cc: KOSAKI Motohiro
Cc: Shaohua Li
Cc: Dave Hansen
Cc: Johannes Weiner
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:37 +0800

0abdee2bd mm: kswapd: use the order that kswapd was reclaiming at for sleeping_prematurely() ... Browse Code »

Before kswapd goes to sleep, it uses sleeping_prematurely() to check if
there was a race pushing a zone below its watermark. If the race
happened, it stays awake. However, balance_pgdat() can decide to reclaim
at order-0 if it decides that high-order reclaim is not working as
expected. This information is not passed back to sleeping_prematurely().
The impact is that kswapd remains awake reclaiming pages long after it
should have gone to sleep. This patch passes the adjusted order to
sleeping_prematurely and uses the same logic as balance_pgdat to decide if
it's ok to go to sleep.

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Eric B Munson
Cc: Simon Kirby
Cc: KOSAKI Motohiro
Cc: Shaohua Li
Cc: Dave Hansen
Cc: Johannes Weiner
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:37 +0800

1741c8775 mm: kswapd: keep kswapd awake for high-order allocations until a percentage of the node is balanced ... Browse Code »

When reclaiming for high-orders, kswapd is responsible for balancing a
node but it should not reclaim excessively. It avoids excessive reclaim
by considering if any zone in a node is balanced then the node is
balanced. In the cases where there are imbalanced zone sizes (e.g.
ZONE_DMA with both ZONE_DMA32 and ZONE_NORMAL), kswapd can go to sleep
prematurely as just one small zone was balanced.

This alters the sleep logic of kswapd slightly. It counts the number of
pages that make up the balanced zones. If the total number of balanced
pages is more than a quarter of the zone, kswapd will go back to sleep.
This should keep a node balanced without reclaiming an excessive number of
pages.

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Eric B Munson
Cc: Simon Kirby
Cc: KOSAKI Motohiro
Cc: Shaohua Li
Cc: Dave Hansen
Cc: Johannes Weiner
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:37 +0800

995047488 mm: kswapd: stop high-order balancing when any suitable zone is balanced ... Browse Code »

Simon Kirby reported the following problem

We're seeing cases on a number of servers where cache never fully
grows to use all available memory. Sometimes we see servers with 4 GB
of memory that never seem to have less than 1.5 GB free, even with a
constantly-active VM. In some cases, these servers also swap out while
this happens, even though they are constantly reading the working set
into memory. We have been seeing this happening for a long time; I
don't think it's anything recent, and it still happens on 2.6.36.

After some debugging work by Simon, Dave Hansen and others, the prevaling
theory became that kswapd is reclaiming order-3 pages requested by SLUB
too aggressive about it.

There are two apparent problems here. On the target machine, there is a
small Normal zone in comparison to DMA32. As kswapd tries to balance all
zones, it would continually try reclaiming for Normal even though DMA32
was balanced enough for callers. The second problem is that
sleeping_prematurely() does not use the same logic as balance_pgdat() when
deciding whether to sleep or not. This keeps kswapd artifically awake.

A number of tests were run and the figures from previous postings will
look very different for a few reasons. One, the old figures were forcing
my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
Second, I previous specified slub_min_order=3 again in an attempt to
reproduce Simon's problem. In this posting, I'm depending on Simon to say
whether his problem is fixed or not and these figures are to show the
impact to the ordinary cases. Finally, the "vmscan" figures are taken
from /proc/vmstat instead of the tracepoints. There is less information
but recording is less disruptive.

The first test of relevance was postmark with a process running in the
background reading a large amount of anonymous memory in blocks. The
objective was to vaguely simulate what was happening on Simon's machine
and it's memory intensive enough to have kswapd awake.

POSTMARK
traceonly kanyzone
Transactions per second: 156.00 ( 0.00%) 153.00 (-1.96%)
Data megabytes read per second: 21.51 ( 0.00%) 21.52 ( 0.05%)
Data megabytes written per second: 29.28 ( 0.00%) 29.11 (-0.58%)
Files created alone per second: 250.00 ( 0.00%) 416.00 (39.90%)
Files create/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%)
Files deleted alone per second: 520.00 ( 0.00%) 420.00 (-23.81%)
Files delete/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%)

MMTests Statistics: duration
User/Sys Time Running Test (seconds) 16.58 17.4
Total Elapsed Time (seconds) 218.48 222.47

VMstat Reclaim Statistics: vmscan
Direct reclaims 0 4
Direct reclaim pages scanned 0 203
Direct reclaim pages reclaimed 0 184
Kswapd pages scanned 326631 322018
Kswapd pages reclaimed 312632 309784
Kswapd low wmark quickly 1 4
Kswapd high wmark quickly 122 475
Kswapd skip congestion_wait 1 0
Pages activated 700040 705317
Pages deactivated 212113 203922
Pages written 9875 6363

Total pages scanned 326631 322221
Total pages reclaimed 312632 309968
%age total pages scanned/reclaimed 95.71% 96.20%
%age total pages scanned/written 3.02% 1.97%

proc vmstat: Faults
Major Faults 300 254
Minor Faults 645183 660284
Page ins 493588 486704
Page outs 4960088 4986704
Swap ins 1230 661
Swap outs 9869 6355

Performance is mildly affected because kswapd is no longer doing as much
work and the background memory consumer process is getting in the way.
Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
and overall fewer pages were scanned and reclaimed. Swap in/out is
particularly reduced again reflecting kswapd throwing out fewer pages.

The slight performance impact is unfortunate here but it looks like a
direct result of kswapd being less aggressive. As the bug report is about
too many pages being freed by kswapd, it may have to be accepted for now.

The second test is a streaming IO benchmark that was previously used by
Johannes to show regressions in page reclaim.

MICRO
traceonly kanyzone
User/Sys Time Running Test (seconds) 29.29 28.87
Total Elapsed Time (seconds) 492.18 488.79

VMstat Reclaim Statistics: vmscan
Direct reclaims 2128 1460
Direct reclaim pages scanned 2284822 1496067
Direct reclaim pages reclaimed 148919 110937
Kswapd pages scanned 15450014 16202876
Kswapd pages reclaimed 8503697 8537897
Kswapd low wmark quickly 3100 3397
Kswapd high wmark quickly 1860 7243
Kswapd skip congestion_wait 708 801
Pages activated 9635 9573
Pages deactivated 1432 1271
Pages written 223 1130

Total pages scanned 17734836 17698943
Total pages reclaimed 8652616 8648834
%age total pages scanned/reclaimed 48.79% 48.87%
%age total pages scanned/written 0.00% 0.01%

proc vmstat: Faults
Major Faults 165 221
Minor Faults 9655785 9656506
Page ins 3880 7228
Page outs 37692940 37480076
Swap ins 0 69
Swap outs 19 15

Again fewer pages are scanned and reclaimed as expected and this time the
test completed faster. Note that kswapd is hitting its watermarks faster
(low and high wmark quickly) which I expect is due to kswapd reclaiming
fewer pages.

I also ran fs-mark, iozone and sysbench but there is nothing interesting
to report in the figures. Performance is not significantly changed and
the reclaim statistics look reasonable.

Tgis patch:

When the allocator enters its slow path, kswapd is woken up to balance the
node. It continues working until all zones within the node are balanced.
For order-0 allocations, this makes perfect sense but for higher orders it
can have unintended side-effects. If the zone sizes are imbalanced,
kswapd may reclaim heavily within a smaller zone discarding an excessive
number of pages. The user-visible behaviour is that kswapd is awake and
reclaiming even though plenty of pages are free from a suitable zone.

This patch alters the "balance" logic for high-order reclaim allowing
kswapd to stop if any suitable zone becomes balanced to reduce the number
of pages it reclaims from other zones. kswapd still tries to ensure that
order-0 watermarks for all zones are met before sleeping.

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Eric B Munson
Cc: Simon Kirby
Cc: KOSAKI Motohiro
Cc: Shaohua Li
Cc: Dave Hansen
Cc: Johannes Weiner
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:37 +0800

c585a2678 mm: remove likely() from grab_cache_page_write_begin() ... Browse Code »

Running the annotated branch profiler on a box doing average work
(firefox, evolution, xchat, distcc farm), the likely() used in
grab_cache_page_write_begin() was incorrect most of the time:

correct incorrect % Function File Line
------- --------- - -------- ---- ----
1924262 71332401 97 grab_cache_page_write_begin filemap.c 2206

Adding a trace_printk() and running the function tracer limited to
just this function I can see:

gconfd-2-2696 [000] 4467.268935: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=7
gconfd-2-2696 [000] 4467.268946: grab_cache_page_write_begin
Acked-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:36 +0800

e20e87795 mm: remove unlikely() from page_mapping() ... Browse Code »

page_mapping() has a unlikely that the mapping has PAGE_MAPPING_ANON set.
But running the annotated branch profiler on a normal desktop system doing
vairous tasks (xchat, evolution, firefox, distcc), it is not really that
unlikely that the mapping here will have the PAGE_MAPPING_ANON flag set:

correct incorrect % Function File Line
------- --------- - -------- ---- ----
35935762 1270265395 97 page_mapping mm.h 659
1306198001 143659 0 page_mapping mm.h 657
203131478 121586 0 page_mapping mm.h 657
5415491 1116 0 page_mapping mm.h 657
74899487 1116 0 page_mapping mm.h 657
203132845 224 0 page_mapping mm.h 659
5415464 27 0 page_mapping mm.h 659
13552 0 0 page_mapping mm.h 657
13552 0 0 page_mapping mm.h 659
242630 0 0 page_mapping mm.h 657
242630 0 0 page_mapping mm.h 659
74899487 0 0 page_mapping mm.h 659

The page_mapping() is a static inline, which is why it shows up multiple
times.

The unlikely in page_mapping() was correct a total of 1909540379 times and
incorrect 1270533123 times, with a 39% being incorrect. With this much of
an error, it's best to simply remove the unlikely and have the compiler
and branch prediction figure this out.

Signed-off-by: Steven Rostedt
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Rik van Riel
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:36 +0800

088e54658 mm: remove likely() from mapping_unevictable() ... Browse Code »

The mapping_unevictable() has a likely() around the mapping parameter.
This mapping parameter comes from page_mapping() which has an unlikely()
that the page will be set as PAGE_MAPPING_ANON, and if so, it will return
NULL. One would think that this unlikely() means that the mapping
returned by page_mapping() would not be NULL, but where page_mapping() is
used just above mapping_unevictable(), that unlikely() is incorrect most
of the time. This means that the "likely(mapping)" in
mapping_unevictable() is incorrect most of the time.

Running the annotated branch profiler on my main box which runs firefox,
evolution, xchat and is part of my distcc farm, I had this:

correct incorrect % Function File Line
------- --------- - -------- ---- ----
12872836 1269443893 98 mapping_unevictable pagemap.h 51
35935762 1270265395 97 page_mapping mm.h 659
1306198001 143659 0 page_mapping mm.h 657
203131478 121586 0 page_mapping mm.h 657
5415491 1116 0 page_mapping mm.h 657
74899487 1116 0 page_mapping mm.h 657
203132845 224 0 page_mapping mm.h 659
5415464 27 0 page_mapping mm.h 659
13552 0 0 page_mapping mm.h 657
13552 0 0 page_mapping mm.h 659
242630 0 0 page_mapping mm.h 657
242630 0 0 page_mapping mm.h 659
74899487 0 0 page_mapping mm.h 659

The page_mapping() is a static inline, which is why it shows up multiple
times. The mapping_unevictable() is also a static inline but seems to be
used only once in my setup.

The unlikely in page_mapping() was correct a total of 1909540379 times and
incorrect 1270533123 times, with a 39% being incorrect. Perhaps this is
enough to remove the unlikely from page_mapping() as well.

Signed-off-by: Steven Rostedt
Reviewed-by: KOSAKI Motohiro
Acked-by: Nick Piggin
Acked-by: Rik van Riel
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:36 +0800

ddf9c6d47 vmalloc: remove redundant unlikely() ... Browse Code »

IS_ERR() already implies unlikely(), so it can be omitted here.

Signed-off-by: Tobias Klauser
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:36 +0800

1e50df39f mempolicy: remove tasklist_lock from migrate_pages ... Browse Code »

Today, tasklist_lock in migrate_pages doesn't protect anything.
rcu_read_lock() provide enough protection from pid hash walk.

Signed-off-by: KOSAKI Motohiro
Reported-by: Peter Zijlstra
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:36 +0800

GITLAB

Eric Lee / smarc-fsl-linux-kernel

14 Jan, 2011