Eric Lee / smarc-fsl-linux-kernel

13 Jan, 2012

8 commits

1c1c53d43 mm: remove del_page_from_lru, add page_off_lru ... Browse Code »

del_page_from_lru() repeats del_page_from_lru_list(), also working out
which LRU the page was on, clearing the relevant bits. Decouple those
functions: remove del_page_from_lru() and add page_off_lru().

Signed-off-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-01-13 12:13:10 +0800
4d06f382c mm: no blank line after EXPORT_SYMBOL in swap.c ... Browse Code »

checkpatch rightly protests

WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable

so fix the five offenders in mm/swap.c.

Signed-off-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-01-13 12:13:10 +0800
5095ae837 mm: fewer underscores in ____pagevec_lru_add ... Browse Code »

What's so special about ____pagevec_lru_add() that it needs four leading
underscores? Nothing, it just helped to distinguish from
__pagevec_lru_add() in 2.6.28 development. Cut two leading underscores.

Signed-off-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-01-13 12:13:10 +0800
2bcf88796 mm: take pagevecs off reclaim stack ... Browse Code »

Replace pagevecs in putback_lru_pages() and move_active_pages_to_lru()
by lists of pages_to_free: then apply Konstantin Khlebnikov's
free_hot_cold_page_list() to them instead of pagevec_release().

Which simplifies the flow (no need to drop and retake lock whenever
pagevec fills up) and reduces stale addresses in stack backtraces
(which often showed through the pagevecs); but more importantly,
removes another 120 bytes from the deepest stacks in page reclaim.
Although I've not recently seen an actual stack overflow here with
a vanilla kernel, move_active_pages_to_lru() has often featured in
deep backtraces.

However, free_hot_cold_page_list() does not handle compound pages
(nor need it: a Transparent HugePage would have been split by the
time it reaches the call in shrink_page_list()), but it is possible
for putback_lru_pages() or move_active_pages_to_lru() to be left
holding the last reference on a THP, so must exclude the unlikely
compound case before putting on pages_to_free.

Remove pagevec_strip(), its work now done in move_active_pages_to_lru().
The pagevec in scan_mapping_unevictable_pages() remains in mm/vmscan.c,
but that is never on the reclaim path, and cannot be replaced by a list.

Signed-off-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Konstantin Khlebnikov
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-01-13 12:13:10 +0800
12d271078 memcg: fix split_huge_page_refcounts() ... Browse Code »

This patch started off as a cleanup: __split_huge_page_refcounts() has to
cope with two scenarios, when the hugepage being split is already on LRU,
and when it is not; but why does it have to split that accounting across
three different sites? Consolidate it in lru_add_page_tail(), handling
evictable and unevictable alike, and use standard add_page_to_lru_list()
when accounting is needed (when the head is not yet on LRU).

But a recent regression in -next, I guess the removal of PageCgroupAcctLRU
test from mem_cgroup_split_huge_fixup(), makes this now a necessary fix:
under load, the MEM_CGROUP_ZSTAT count was wrapping to a huge number,
messing up reclaim calculations and causing a freeze at rmdir of cgroup.

Add a VM_BUG_ON to mem_cgroup_lru_del_list() when we're about to wrap that
count - this has not been the only such incident. Document that
lru_add_page_tail() is for Transparent HugePages by #ifdef around it.

Signed-off-by: Hugh Dickins
Cc: Daisuke Nishimura
Cc: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-01-13 12:13:09 +0800
45676885b thp: improve order in lru list for split huge page ... Browse Code »

Put the tail subpages of an isolated hugepage under splitting in the lru
reclaim head as they supposedly should be isolated too next.

Queues the subpages in physical order in the lru for non isolated
hugepages under splitting. That might provide some theoretical cache
benefit to the buddy allocator later.

Signed-off-by: Shaohua Li
Signed-off-by: Andrea Arcangeli
Cc: David Rientjes
Cc: Johannes Weiner
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2012-01-13 12:13:08 +0800
925b7673c mm: make per-memcg LRU lists exclusive ... Browse Code »
43

Now that all code that operated on global per-zone LRU lists is
converted to operate on per-memory cgroup LRU lists instead, there is no
reason to keep the double-LRU scheme around any longer.

The pc->lru member is removed and page->lru is linked directly to the
per-memory cgroup LRU lists, which removes two pointers from a
descriptor that exists for every page frame in the system.

Signed-off-by: Johannes Weiner
Signed-off-by: Hugh Dickins
Signed-off-by: Ying Han
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Michal Hocko
Reviewed-by: Kirill A. Shutemov
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Greg Thelen
Cc: Michel Lespinasse
Cc: Rik van Riel
Cc: Minchan Kim
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-13 12:13:05 +0800
6290df545 mm: collect LRU list heads into struct lruvec ... Browse Code »

Having a unified structure with a LRU list set for both global zones and
per-memcg zones allows to keep that code simple which deals with LRU
lists and does not care about the container itself.

Once the per-memcg LRU lists directly link struct pages, the isolation
function and all other list manipulations are shared between the memcg
case and the global LRU case.

Signed-off-by: Johannes Weiner
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Michal Hocko
Reviewed-by: Kirill A. Shutemov
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Ying Han
Cc: Greg Thelen
Cc: Michel Lespinasse
Cc: Rik van Riel
Cc: Minchan Kim
Cc: Christoph Hellwig
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-13 12:13:05 +0800

11 Jan, 2012

1 commit

cc59850ef mm: add free_hot_cold_page_list() helper ... Browse Code »

This patch adds helper free_hot_cold_page_list() to free list of 0-order
pages. It frees pages directly from list without temporary page-vector.
It also calls trace_mm_pagevec_free() to simulate pagevec_free()
behaviour.

bloat-o-meter:

add/remove: 1/1 grow/shrink: 1/3 up/down: 267/-295 (-28)
function old new delta
free_hot_cold_page_list - 264 +264
get_page_from_freelist 2129 2132 +3
__pagevec_free 243 239 -4
split_free_page 380 373 -7
release_pages 606 510 -96
free_page_list 188 - -188

Signed-off-by: Konstantin Khlebnikov
Cc: Mel Gorman
Cc: KOSAKI Motohiro
Acked-by: Minchan Kim
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-01-11 08:30:41 +0800

07 Nov, 2011

1 commit

32aaeffbd Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux ... Browse Code »

* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include
net: sch_generic remove redundant use of
net: inet_timewait_sock doesnt need
...

Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h

Linus Torvalds
2011-11-07 11:44:47 +0800

03 Nov, 2011

1 commit

70b50f94f mm: thp: tail page refcounting fix ... Browse Code »
2

Michel while working on the working set estimation code, noticed that
calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
wasn't safe, if the pfn ended up being a tail page of a transparent
hugepage under splitting by __split_huge_page_refcount().

He then found the problem could also theoretically materialize with
page_cache_get_speculative() during the speculative radix tree lookups
that uses get_page_unless_zero() in SMP if the radix tree page is freed
and reallocated and get_user_pages is called on it before
page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at
all times. This will guarantee that get_page_unless_zero() can never
succeed on any tail page. page_tail->_mapcount is guaranteed zero and
is unused for all tail pages of a compound page, so we can simply
account the tail page references there and transfer them to
tail_page->_count in __split_huge_page_refcount() (in addition to the
head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't
entirely safe because the two atomic_inc in get_page weren't atomic. As
opposed to other get_user_page users like secondary-MMU page fault to
establish the shadow pagetables would never call any superflous get_page
after get_user_page returns. It's safer to make get_page universally safe
for tail pages and to use get_page_foll() within follow_page (inside
get_user_pages()). get_page_foll() is safe to do the refcounting for tail
pages without taking any locks because it is run within PT lock protected
critical sections (PT lock for pte and page_table_lock for
pmd_trans_huge).

The standard get_page() as invoked by direct-io instead will now take
the compound_lock but still only for tail pages. The direct-io paths
are usually I/O bound and the compound_lock is per THP so very
finegrined, so there's no risk of scalability issues with it. A simple
direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no
overhead. So it's worth it. Ideally direct-io should stop calling
get_page() on pages returned by get_user_pages(). The spinlock in
get_page() is already optimized away for no-THP builds but doing
get_page() on tail pages returned by GUP is generally a rare operation
and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new
RCU critical sections will also allow the working set estimation code to
work without any further complexity associated to the tail page
refcounting with THP.

Signed-off-by: Andrea Arcangeli
Reported-by: Michel Lespinasse
Reviewed-by: Michel Lespinasse
Reviewed-by: Minchan Kim
Cc: Peter Zijlstra
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Mel Gorman
Cc: KOSAKI Motohiro
Cc: Benjamin Herrenschmidt
Cc: David Gibson
Cc:
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-11-03 07:06:57 +0800

31 Oct, 2011

1 commit

b95f1b31b mm: Map most files to use export.h instead of module.h ... Browse Code »

The files changed within are only using the EXPORT_SYMBOL
macro variants. They are not using core modular infrastructure
and hence don't need module.h but only the export.h header.

Signed-off-by: Paul Gortmaker

Paul Gortmaker
2011-10-31 21:20:12 +0800

25 May, 2011

2 commits

eb709b0d0 mm: batch activate_page() to reduce lock contention ... Browse Code »

The zone->lru_lock is heavily contented in workload where activate_page()
is frequently used. We could do batch activate_page() to reduce the lock
contention. The batched pages will be added into zone list when the pool
is full or page reclaim is trying to drain them.

For example, in a 4 socket 64 CPU system, create a sparse file and 64
processes, processes shared map to the file. Each process read access the
whole file and then exit. The process exit will do unmap_vmas() and cause
a lot of activate_page() call. In such workload, we saw about 58% total
time reduction with below patch. Other workloads with a lot of
activate_page also benefits a lot too.

Andrew Morton suggested activate_page() and putback_lru_pages() should
follow the same path to active pages, but this is hard to implement (see
commit 7a608572a282a ("Revert "mm: batch activate_page() to reduce lock
contention")). On the other hand, do we really need putback_lru_pages()
to follow the same path? I tested several FIO/FFSB benchmark (about 20
scripts for each benchmark) in 3 machines here from 2 sockets to 4
sockets. My test doesn't show anything significant with/without below
patch (there is slight difference but mostly some noise which we found
even without below patch before). Below patch basically returns to the
same as my first post.

I tested some microbenchmarks:
case-anon-cow-rand-mt 0.58%
case-anon-cow-rand -3.30%
case-anon-cow-seq-mt -0.51%
case-anon-cow-seq -5.68%
case-anon-r-rand-mt 0.23%
case-anon-r-rand 0.81%
case-anon-r-seq-mt -0.71%
case-anon-r-seq -1.99%
case-anon-rx-rand-mt 2.11%
case-anon-rx-seq-mt 3.46%
case-anon-w-rand-mt -0.03%
case-anon-w-rand -0.50%
case-anon-w-seq-mt -1.08%
case-anon-w-seq -0.12%
case-anon-wx-rand-mt -5.02%
case-anon-wx-seq-mt -1.43%
case-fork 1.65%
case-fork-sleep -0.07%
case-fork-withmem 1.39%
case-hugetlb -0.59%
case-lru-file-mmap-read-mt -0.54%
case-lru-file-mmap-read 0.61%
case-lru-file-mmap-read-rand -2.24%
case-lru-file-readonce -0.64%
case-lru-file-readtwice -11.69%
case-lru-memcg -1.35%
case-mmap-pread-rand-mt 1.88%
case-mmap-pread-rand -15.26%
case-mmap-pread-seq-mt 0.89%
case-mmap-pread-seq -69.72%
case-mmap-xread-rand-mt 0.71%
case-mmap-xread-seq-mt 0.38%

The most significent are:
case-lru-file-readtwice -11.69%
case-mmap-pread-rand -15.26%
case-mmap-pread-seq -69.72%

which use activate_page a lot. others are basically variations because
each run has slightly difference.

In UP case, 'size mm/swap.o'
before the two patches:
text data bss dec hex filename
6466 896 4 7366 1cc6 mm/swap.o
after the two patches:
text data bss dec hex filename
6343 896 4 7243 1c4b mm/swap.o

Signed-off-by: Shaohua Li
Cc: KOSAKI Motohiro
Cc: Hiroyuki Kamezawa
Cc: Andi Kleen
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2011-05-25 23:39:37 +0800
821ed6bbe mm: filter unevictable page out in deactivate_page() ... Browse Code »

It's pointless that deactive_page's operates on unevictable pages. This
patch removes unnecessary overhead which might be a bit problem in case
that there are many unevictable page in system(ex, mprotect workload)

[akpm@linux-foundation.org: tidy up comment]
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Minchan Kim
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-05-25 23:39:27 +0800

12 May, 2011

1 commit

bad49d9c8 mm: check PageUnevictable in lru_deactivate_fn() ... Browse Code »

The lru_deactivate_fn should not move page which in on unevictable lru
into inactive list. Otherwise, we can meet BUG when we use
isolate_lru_pages as __isolate_lru_page could return -EINVAL.

Reported-by: Ying Han
Tested-by: Ying Han
Signed-off-by: Minchan Kim
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-05-12 09:50:44 +0800

23 Mar, 2011

4 commits

3dd7ae8ec mm: simplify code of swap.c ... Browse Code »

Clean up code and remove duplicate code. Next patch will use
pagevec_lru_move_fn introduced here too.

Signed-off-by: Shaohua Li
Cc: KOSAKI Motohiro
Cc: Hiroyuki Kamezawa
Cc: Andi Kleen
Reviewed-by: Minchan Kim
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2011-03-23 08:44:09 +0800
278df9f45 mm: reclaim invalidated page ASAP ... Browse Code »

invalidate_mapping_pages is very big hint to reclaimer. It means user
doesn't want to use the page any more. So in order to prevent working set
page eviction, this patch move the page into tail of inactive list by
PG_reclaim.

Please, remember that pages in inactive list are working set as well as
active list. If we don't move pages into inactive list's tail, pages near
by tail of inactive list can be evicted although we have a big clue about
useless pages. It's totally bad.

Now PG_readahead/PG_reclaim is shared. fe3cba17 added ClearPageReclaim
into clear_page_dirty_for_io for preventing fast reclaiming readahead
marker page.

In this series, PG_reclaim is used by invalidated page, too. If VM find
the page is invalidated and it's dirty, it sets PG_reclaim to reclaim
asap. Then, when the dirty page will be writeback,
clear_page_dirty_for_io will clear PG_reclaim unconditionally. It
disturbs this serie's goal.

I think it's okay to clear PG_readahead when the page is dirty, not
writeback time. So this patch moves ClearPageReadahead. In v4,
ClearPageReadahead in set_page_dirty has a problem which is reported by
Steven Barrett. It's due to compound page. Some driver(ex, audio) calls
set_page_dirty with compound page which isn't on LRU. but my patch does
ClearPageRelcaim on compound page. In non-CONFIG_PAGEFLAGS_EXTENDED, it
breaks PageTail flag.

I think it doesn't affect THP and pass my test with THP enabling but Cced
Andrea for double check.

Signed-off-by: Minchan Kim
Reported-by: Steven Barrett
Reviewed-by: Johannes Weiner
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Cc: Wu Fengguang
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-03-23 08:44:04 +0800
3f58a8294 memcg: move memcg reclaimable page into tail of inactive list ... Browse Code »

The rotate_reclaimable_page function moves just written out pages, which
the VM wanted to reclaim, to the end of the inactive list. That way the
VM will find those pages first next time it needs to free memory.

This patch applies the rule in memcg. It can help to prevent unnecessary
working page eviction of memcg.

Signed-off-by: Minchan Kim
Acked-by: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Reviewed-by: Rik van Riel
Cc: KOSAKI Motohiro
Acked-by: Johannes Weiner
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-03-23 08:44:03 +0800
315601809 mm: deactivate invalidated pages ... Browse Code »

Recently, there are reported problem about thrashing.
(http://marc.info/?l=rsync&m=128885034930933&w=2) It happens by backup
workloads(ex, nightly rsync). That's because the workload makes just
use-once pages and touches pages twice. It promotes the page into active
list so that it results in working set page eviction.

Some app developer want to support POSIX_FADV_NOREUSE. But other OSes
don't support it, either.
(http://marc.info/?l=linux-mm&m=128928979512086&w=2)

By other approach, app developers use POSIX_FADV_DONTNEED. But it has a
problem. If kernel meets page is writing during invalidate_mapping_pages,
it can't work. It makes for application programmer to use it since they
always have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
make sure the pages could be discardable. At last, they can't use
deferred write of kernel so that they could see performance loss.
(http://insights.oetiker.ch/linux/fadvise.html)

In fact, invalidation is very big hint to reclaimer. It means we don't
use the page any more. So let's move the writing page into inactive
list's head if we can't truncate it right now.

Why I move page to head of lru on this patch, Dirty/Writeback page would
be flushed sooner or later. It can prevent writeout of pageout which is
less effective than flusher's writeout.

Originally, I reused lru_demote of Peter with some change so added his
Signed-off-by.

Signed-off-by: Minchan Kim
Reported-by: Ben Gamari
Signed-off-by: Peter Zijlstra
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Reviewed-by: KOSAKI Motohiro
Cc: Wu Fengguang
Acked-by: Johannes Weiner
Cc: Nick Piggin
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-03-23 08:44:03 +0800

18 Jan, 2011

2 commits

83896fb5e Revert "mm: simplify code of swap.c" ... Browse Code »

This reverts commit d8505dee1a87b8d41b9c4ee1325cd72258226fbc.

Chris Mason ended up chasing down some page allocation errors and pages
stuck waiting on the IO scheduler, and was able to narrow it down to two
commits: commit 744ed1442757 ("mm: batch activate_page() to reduce lock
contention") and d8505dee1a87 ("mm: simplify code of swap.c").

This reverts the second one.

Reported-and-debugged-by: Chris Mason
Cc: Mel Gorman
Cc: Andrew Morton
Cc: Jens Axboe
Cc: linux-mm
Cc: KAMEZAWA Hiroyuki
Cc: Andrea Arcangeli
Cc: Shaohua Li
Signed-off-by: Linus Torvalds

Linus Torvalds
2011-01-18 06:42:34 +0800
7a608572a Revert "mm: batch activate_page() to reduce lock contention" ... Browse Code »

This reverts commit 744ed1442757767ffede5008bb13e0805085902e.

Chris Mason ended up chasing down some page allocation errors and pages
stuck waiting on the IO scheduler, and was able to narrow it down to two
commits: commit 744ed1442757 ("mm: batch activate_page() to reduce lock
contention") and d8505dee1a87 ("mm: simplify code of swap.c").

This reverts the first of them.

Reported-and-debugged-by: Chris Mason
Cc: Mel Gorman
Cc: Andrew Morton
Cc: Jens Axboe
Cc: linux-mm
Cc: KAMEZAWA Hiroyuki
Cc: Andrea Arcangeli
Cc: Shaohua Li
Signed-off-by: Linus Torvalds

Linus Torvalds
2011-01-18 06:42:19 +0800

14 Jan, 2011

5 commits

744ed1442 mm: batch activate_page() to reduce lock contention ... Browse Code »

The zone->lru_lock is heavily contented in workload where activate_page()
is frequently used. We could do batch activate_page() to reduce the lock
contention. The batched pages will be added into zone list when the pool
is full or page reclaim is trying to drain them.

For example, in a 4 socket 64 CPU system, create a sparse file and 64
processes, processes shared map to the file. Each process read access the
whole file and then exit. The process exit will do unmap_vmas() and cause
a lot of activate_page() call. In such workload, we saw about 58% total
time reduction with below patch. Other workloads with a lot of
activate_page also benefits a lot too.

I tested some microbenchmarks:
case-anon-cow-rand-mt 0.58%
case-anon-cow-rand -3.30%
case-anon-cow-seq-mt -0.51%
case-anon-cow-seq -5.68%
case-anon-r-rand-mt 0.23%
case-anon-r-rand 0.81%
case-anon-r-seq-mt -0.71%
case-anon-r-seq -1.99%
case-anon-rx-rand-mt 2.11%
case-anon-rx-seq-mt 3.46%
case-anon-w-rand-mt -0.03%
case-anon-w-rand -0.50%
case-anon-w-seq-mt -1.08%
case-anon-w-seq -0.12%
case-anon-wx-rand-mt -5.02%
case-anon-wx-seq-mt -1.43%
case-fork 1.65%
case-fork-sleep -0.07%
case-fork-withmem 1.39%
case-hugetlb -0.59%
case-lru-file-mmap-read-mt -0.54%
case-lru-file-mmap-read 0.61%
case-lru-file-mmap-read-rand -2.24%
case-lru-file-readonce -0.64%
case-lru-file-readtwice -11.69%
case-lru-memcg -1.35%
case-mmap-pread-rand-mt 1.88%
case-mmap-pread-rand -15.26%
case-mmap-pread-seq-mt 0.89%
case-mmap-pread-seq -69.72%
case-mmap-xread-rand-mt 0.71%
case-mmap-xread-seq-mt 0.38%

The most significent are:
case-lru-file-readtwice -11.69%
case-mmap-pread-rand -15.26%
case-mmap-pread-seq -69.72%

which use activate_page a lot. others are basically variations because
each run has slightly difference.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Shaohua Li
Cc: Andi Kleen
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2011-01-14 09:32:50 +0800
d8505dee1 mm: simplify code of swap.c ... Browse Code »

Clean up code and remove duplicate code. Next patch will use
pagevec_lru_move_fn introduced here too.

Signed-off-by: Shaohua Li
Cc: Andi Kleen
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2011-01-14 09:32:50 +0800
71e3aac07 thp: transparent hugepage core ... Browse Code »

Lately I've been working to make KVM use hugepages transparently without
the usual restrictions of hugetlbfs. Some of the restrictions I'd like to
see removed:

1) hugepages have to be swappable or the guest physical memory remains
locked in RAM and can't be paged out to swap

2) if a hugepage allocation fails, regular pages should be allocated
instead and mixed in the same vma without any failure and without
userland noticing

3) if some task quits and more hugepages become available in the
buddy, guest physical memory backed by regular pages should be
relocated on hugepages automatically in regions under
madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
not null)

4) avoidance of reservation and maximization of use of hugepages whenever
possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
1 machine with 1 database with 1 database cache with 1 database cache size
known at boot time. It's definitely not feasible with a virtualization
hypervisor usage like RHEV-H that runs an unknown number of virtual machines
with an unknown size of each virtual machine with an unknown amount of
pagecache that could be potentially useful in the host for guest not using
O_DIRECT (aka cache=off).

hugepages in the virtualization hypervisor (and also in the guest!) are
much more important than in a regular host not using virtualization,
becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24
to 19 in case only the hypervisor uses transparent hugepages, and they
decrease the tlb-miss cacheline accesses from 19 to 15 in case both the
linux hypervisor and the linux guest both uses this patch (though the
guest will limit the addition speedup to anonymous regions only for
now...). Even more important is that the tlb miss handler is much slower
on a NPT/EPT guest than for a regular shadow paging or no-virtualization
scenario. So maximizing the amount of virtual memory cached by the TLB
pays off significantly more with NPT/EPT than without (even if there would
be no significant speedup in the tlb-miss runtime).

The first (and more tedious) part of this work requires allowing the VM to
handle anonymous hugepages mixed with regular pages transparently on
regular anonymous vmas. This is what this patch tries to achieve in the
least intrusive possible way. We want hugepages and hugetlb to be used in
a way so that all applications can benefit without changes (as usual we
leverage the KVM virtualization design: by improving the Linux VM at
large, KVM gets the performance boost too).

The most important design choice is: always fallback to 4k allocation if
the hugepage allocation fails! This is the _very_ opposite of some large
pagecache patches that failed with -EIO back then if a 64k (or similar)
allocation failed...

Second important decision (to reduce the impact of the feature on the
existing pagetable handling code) is that at any time we can split an
hugepage into 512 regular pages and it has to be done with an operation
that can't fail. This way the reliability of the swapping isn't decreased
(no need to allocate memory when we are short on memory to swap) and it's
trivial to plug a split_huge_page* one-liner where needed without
polluting the VM. Over time we can teach mprotect, mremap and friends to
handle pmd_trans_huge natively without calling split_huge_page*. The fact
it can't fail isn't just for swap: if split_huge_page would return -ENOMEM
(instead of the current void) we'd need to rollback the mprotect from the
middle of it (ideally including undoing the split_vma) which would be a
big change and in the very wrong direction (it'd likely be simpler not to
call split_huge_page at all and to teach mprotect and friends to handle
hugepages instead of rolling them back from the middle). In short the
very value of split_huge_page is that it can't fail.

The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and
incremental and it'll just be an "harmless" addition later if this initial
part is agreed upon. It also should be noted that locking-wise replacing
regular pages with hugepages is going to be very easy if compared to what
I'm doing below in split_huge_page, as it will only happen when
page_count(page) matches page_mapcount(page) if we can take the PG_lock
and mmap_sem in write mode. collapse_huge_page will be a "best effort"
that (unlike split_huge_page) can fail at the minimal sign of trouble and
we can try again later. collapse_huge_page will be similar to how KSM
works and the madvise(MADV_HUGEPAGE) will work similar to
madvise(MADV_MERGEABLE).

The default I like is that transparent hugepages are used at page fault
time. This can be changed with
/sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set
to three values "always", "madvise", "never" which mean respectively that
hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions,
or never used. /sys/kernel/mm/transparent_hugepage/defrag instead
controls if the hugepage allocation should defrag memory aggressively
"always", only inside "madvise" regions, or "never".

The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
put_page (from get_user_page users that can't use mmu notifier like
O_DIRECT) that runs against a __split_huge_page_refcount instead was a
pain to serialize in a way that would result always in a coherent page
count for both tail and head. I think my locking solution with a
compound_lock taken only after the page_first is valid and is still a
PageHead should be safe but it surely needs review from SMP race point of
view. In short there is no current existing way to serialize the O_DIRECT
final put_page against split_huge_page_refcount so I had to invent a new
one (O_DIRECT loses knowledge on the mapping status by the time gup_fast
returns so...). And I didn't want to impact all gup/gup_fast users for
now, maybe if we change the gup interface substantially we can avoid this
locking, I admit I didn't think too much about it because changing the gup
unpinning interface would be invasive.

If we ignored O_DIRECT we could stick to the existing compound refcounting
code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM
(and any other mmu notifier user) would call it without FOLL_GET (and if
FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the
current task mmu notifier list yet). But O_DIRECT is fundamental for
decent performance of virtualized I/O on fast storage so we can't avoid it
to solve the race of put_page against split_huge_page_refcount to achieve
a complete hugepage feature for KVM.

Swap and oom works fine (well just like with regular pages ;). MMU
notifier is handled transparently too, with the exception of the young bit
on the pmd, that didn't have a range check but I think KVM will be fine
because the whole point of hugepages is that EPT/NPT will also use a huge
pmd when they notice gup returns pages with PageCompound set, so they
won't care of a range and there's just the pmd young bit to check in that
case.

NOTE: in some cases if the L2 cache is small, this may slowdown and waste
memory during COWs because 4M of memory are accessed in a single fault
instead of 8k (the payoff is that after COW the program can run faster).
So we might want to switch the copy_huge_page (and clear_huge_page too) to
not temporal stores. I also extensively researched ways to avoid this
cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k
up to 1M (I can send those patches that fully implemented prefault) but I
concluded they're not worth it and they add an huge additional complexity
and they remove all tlb benefits until the full hugepage has been faulted
in, to save a little bit of memory and some cache during app startup, but
they still don't improve substantially the cache-trashing during startup
if the prefault happens in >4k chunks. One reason is that those 4k pte
entries copied are still mapped on a perfectly cache-colored hugepage, so
the trashing is the worst one can generate in those copies (cow of 4k page
copies aren't so well colored so they trashes less, but again this results
in software running faster after the page fault). Those prefault patches
allowed things like a pte where post-cow pages were local 4k regular anon
pages and the not-yet-cowed pte entries were pointing in the middle of
some hugepage mapped read-only. If it doesn't payoff substantially with
todays hardware it will payoff even less in the future with larger l2
caches, and the prefault logic would blot the VM a lot. If one is
emebdded transparent_hugepage can be disabled during boot with sysfs or
with the boot commandline parameter transparent_hugepage=0 (or
transparent_hugepage=2 to restrict hugepages inside madvise regions) that
will ensure not a single hugepage is allocated at boot time. It is simple
enough to just disable transparent hugepage globally and let transparent
hugepages be allocated selectively by applications in the MADV_HUGEPAGE
region (both at page fault time, and if enabled with the
collapse_huge_page too through the kernel daemon).

This patch supports only hugepages mapped in the pmd, archs that have
smaller hugepages will not fit in this patch alone. Also some archs like
power have certain tlb limits that prevents mixing different page size in
the same regions so they will not fit in this framework that requires
"graceful fallback" to basic PAGE_SIZE in case of physical memory
fragmentation. hugetlbfs remains a perfect fit for those because its
software limits happen to match the hardware limits. hugetlbfs also
remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped
to be found not fragmented after a certain system uptime and that would be
very expensive to defragment with relocation, so requiring reservation.
hugetlbfs is the "reservation way", the point of transparent hugepages is
not to have any reservation at all and maximizing the use of cache and
hugepages at all times automatically.

Some performance result:

vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
ages3
memset page fault 1566023
memset tlb miss 453854
memset second tlb miss 453321
random access tlb miss 41635
random access second tlb miss 41658
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
memset page fault 1566471
memset tlb miss 453375
memset second tlb miss 453320
random access tlb miss 41636
random access second tlb miss 41637
vmx andrea # ./largepages3
memset page fault 1566642
memset tlb miss 453417
memset second tlb miss 453313
random access tlb miss 41630
random access second tlb miss 41647
vmx andrea # ./largepages3
memset page fault 1566872
memset tlb miss 453418
memset second tlb miss 453315
random access tlb miss 41618
random access second tlb miss 41659
vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
vmx andrea # ./largepages3
memset page fault 2182476
memset tlb miss 460305
memset second tlb miss 460179
random access tlb miss 44483
random access second tlb miss 44186
vmx andrea # ./largepages3
memset page fault 2182791
memset tlb miss 460742
memset second tlb miss 459962
random access tlb miss 43981
random access second tlb miss 43988

============
#include
#include
#include
#include

#define SIZE (3UL*1024*1024*1024)

int main()
{
char *p = malloc(SIZE), *p2;
struct timeval before, after;

gettimeofday(&before, NULL);
memset(p, 0, SIZE);
gettimeofday(&after, NULL);
printf("memset page fault %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);

gettimeofday(&before, NULL);
memset(p, 0, SIZE);
gettimeofday(&after, NULL);
printf("memset tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);

gettimeofday(&before, NULL);
memset(p, 0, SIZE);
gettimeofday(&after, NULL);
printf("memset second tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);

gettimeofday(&before, NULL);
for (p2 = p; p2 < p+SIZE; p2 += 4096)
*p2 = 0;
gettimeofday(&after, NULL);
printf("random access tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);

gettimeofday(&before, NULL);
for (p2 = p; p2 < p+SIZE; p2 += 4096)
*p2 = 0;
gettimeofday(&after, NULL);
printf("random access second tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);

return 0;
}
============

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:42 +0800
a95a82e96 thp: put_page: recheck PageHead after releasing the compound_lock ... Browse Code »

After releasing the compound_lock split_huge_page can still run and release the
page before put_page_testzero runs.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:39 +0800
918070634 thp: alter compound get_page/put_page ... Browse Code »

Alter compound get_page/put_page to keep references on subpages too, in
order to allow __split_huge_page_refcount to split an hugepage even while
subpages have been pinned by one of the get_user_pages() variants.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:39 +0800

28 Oct, 2010

1 commit

0be8557bc fuse: use release_pages() ... Browse Code »

Replace iterated page_cache_release() with release_pages(), which is
faster and shorter.

Needs release_pages() to be exported to modules.

Suggested-by: Andrew Morton
Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2010-10-28 09:03:17 +0800

25 May, 2010

1 commit

47846b065 mm: export lru_cache_add_*() to modules ... Browse Code »

This is needed to enable moving pages into the page cache in fuse with
splice(..., SPLICE_F_MOVE).

Signed-off-by: Miklos Szeredi

Miklos Szeredi
2010-05-25 21:06:06 +0800

30 Mar, 2010

1 commit

5a0e3ad6a include cleanup: Update gfp.h and slab.h includes to prepare for breaking implic… ... Browse Code »

…it slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Tejun Heo
2010-03-30 21:02:32 +0800

07 Mar, 2010

1 commit

fc91668ea mm: remove free_hot_page() ... Browse Code »

free_hot_page() is just a wrapper around free_hot_cold_page() with
parameter 'cold = 0'. After adding a clear comment for
free_hot_cold_page(), it is reasonable to remove a level of call.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Li Hong
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Ingo Molnar
Cc: Larry Woodman
Cc: Peter Zijlstra
Cc: Li Ming Chun
Cc: KOSAKI Motohiro
Cc: Americo Wang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Hong
2010-03-07 03:26:25 +0800

22 Sep, 2009

3 commits

4481374ce mm: replace various uses of num_physpages by totalram_pages ... Browse Code »

Sizing of memory allocations shouldn't depend on the number of physical
pages found in a system, as that generally includes (perhaps a huge amount
of) non-RAM pages. The amount of what actually is usable as storage
should instead be used as a basis here.

Some of the calculations (i.e. those not intending to use high memory)
should likely even use (totalram_pages - totalhigh_pages).

Signed-off-by: Jan Beulich
Acked-by: Rusty Russell
Acked-by: Ingo Molnar
Cc: Dave Airlie
Cc: Kyle McMartin
Cc: Jeremy Fitzhardinge
Cc: Pekka Enberg
Cc: Hugh Dickins
Cc: "David S. Miller"
Cc: Patrick McHardy
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Beulich
2009-09-22 22:17:38 +0800
6c0b13519 mm: return boolean from page_is_file_cache() ... Browse Code »

page_is_file_cache() has been used for both boolean checks and LRU
arithmetic, which was always a bit weird.

Now that page_lru_base_type() exists for LRU arithmetic, make
page_is_file_cache() a real predicate function and adjust the
boolean-using callsites to drop those pesky double negations.

Signed-off-by: Johannes Weiner
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2009-09-22 22:17:37 +0800
401a8e1c1 mm: introduce page_lru_base_type() ... Browse Code »

Instead of abusing page_is_file_cache() for LRU list index arithmetic, add
another helper with a more appropriate name and convert the non-boolean
users of page_is_file_cache() accordingly.

This new helper gives the LRU base type a page is supposed to live on,
inactive anon or inactive file.

[hugh.dickins@tiscali.co.uk: convert del_page_from_lru() also]
Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Cc: Minchan Kim
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2009-09-22 22:17:35 +0800

03 May, 2009

1 commit

00a62ce91 mm: fix Committed_AS underflow on large NR_CPUS environment ... Browse Code »
43

The Committed_AS field can underflow in certain situations:

> # while true; do cat /proc/meminfo | grep _AS; sleep 1; done | uniq -c
> 1 Committed_AS: 18446744073709323392 kB
> 11 Committed_AS: 18446744073709455488 kB
> 6 Committed_AS: 35136 kB
> 5 Committed_AS: 18446744073709454400 kB
> 7 Committed_AS: 35904 kB
> 3 Committed_AS: 18446744073709453248 kB
> 2 Committed_AS: 34752 kB
> 9 Committed_AS: 18446744073709453248 kB
> 8 Committed_AS: 34752 kB
> 3 Committed_AS: 18446744073709320960 kB
> 7 Committed_AS: 18446744073709454080 kB
> 3 Committed_AS: 18446744073709320960 kB
> 5 Committed_AS: 18446744073709454080 kB
> 6 Committed_AS: 18446744073709320960 kB

Because NR_CPUS can be greater than 1000 and meminfo_proc_show() does
not check for underflow.

But NR_CPUS proportional isn't good calculation. In general,
possibility of lock contention is proportional to the number of online
cpus, not theorical maximum cpus (NR_CPUS).

The current kernel has generic percpu-counter stuff. using it is right
way. it makes code simplify and percpu_counter_read_positive() don't
make underflow issue.

Reported-by: Dave Hansen
Signed-off-by: KOSAKI Motohiro
Cc: Eric B Munson
Cc: Mel Gorman
Cc: Christoph Lameter
Cc: [All kernel versions]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-05-03 06:36:10 +0800

03 Apr, 2009

1 commit

266cf658e FS-Cache: Recruit a page flags for cache management ... Browse Code »

Recruit a page flag to aid in cache management. The following extra flag is
defined:

(1) PG_fscache (PG_private_2)

The marked page is backed by a local cache and is pinning resources in the
cache driver.

If PG_fscache is set, then things that checked for PG_private will now also
check for that. This includes things like truncation and page invalidation.
The function page_has_private() had been added to make the checks for both
PG_private and PG_private_2 at the same time.

Signed-off-by: David Howells
Acked-by: Steve Dickson
Acked-by: Trond Myklebust
Acked-by: Rik van Riel
Acked-by: Al Viro
Tested-by: Daire Byrne

David Howells
2009-04-03 23:42:36 +0800

01 Apr, 2009

1 commit

d1d748717 mm: remove pagevec_swap_free() ... Browse Code »

pagevec_swap_free() is now unused.

Signed-off-by: KOSAKI Motohiro
Cc: Johannes Weiner
Cc: Rik van Riel
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-04-01 23:59:13 +0800

09 Jan, 2009

3 commits

3e2f41f1f memcg: add zone_reclaim_stat ... Browse Code »

Introduce mem_cgroup_per_zone::reclaim_stat member and its statics
collecting function.

Now, get_scan_ratio() can calculate correct value on memcg reclaim.

[hugh@veritas.com: avoid reclaim_stat oops when disabled]
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Rik van Riel
Signed-off-by: KOSAKI Motohiro
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-09 00:31:08 +0800
6e9015716 mm: introduce zone_reclaim struct ... Browse Code »

Add zone_reclam_stat struct for later enhancement.

A later patch uses this. This patch doesn't any behavior change (yet).

Reviewed-by: KAMEZAWA Hiroyuki
Signed-off-by: KOSAKI Motohiro
Acked-by: Rik van Riel
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-09 00:31:07 +0800
08e552c69 memcg: synchronized LRU ... Browse Code »
43

A big patch for changing memcg's LRU semantics.

Now,
- page_cgroup is linked to mem_cgroup's its own LRU (per zone).

- LRU of page_cgroup is not synchronous with global LRU.

- page and page_cgroup is one-to-one and statically allocated.

- To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
- lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);

- SwapCache is handled.

And, when we handle LRU list of page_cgroup, we do following.

pc = lookup_page_cgroup(page);
lock_page_cgroup(pc); .....................(1)
mz = page_cgroup_zoneinfo(pc);
spin_lock(&mz->lru_lock);
.....add to LRU
spin_unlock(&mz->lru_lock);
unlock_page_cgroup(pc);

But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.

This is a trial to remove this dirty nesting of locks.
This patch changes mz->lru_lock to be zone->lru_lock.
Then, above sequence will be written as

spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
mem_cgroup_add/remove/etc_lru() {
pc = lookup_page_cgroup(page);
mz = page_cgroup_zoneinfo(pc);
if (PageCgroupUsed(pc)) {
....add to LRU
}
spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU

This is much simpler.
(*) We're safe even if we don't take lock_page_cgroup(pc). Because..
1. When pc->mem_cgroup can be modified.
- at charge.
- at account_move().
2. at charge
the PCG_USED bit is not set before pc->mem_cgroup is fixed.
3. at account_move()
the page is isolated and not on LRU.

Pros.
- easy for maintenance.
- memcg can make use of laziness of pagevec.
- we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
- LRU status of memcg will be synchronized with global LRU's one.
- # of locks are reduced.
- account_move() is simplified very much.
Cons.
- may increase cost of LRU rotation.
(no impact if memcg is not configured.)

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Balbir Singh
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:05 +0800

07 Jan, 2009

1 commit

a2c43eed8 mm: try_to_free_swap replaces remove_exclusive_swap_page ... Browse Code »

remove_exclusive_swap_page(): its problem is in living up to its name.

It doesn't matter if someone else has a reference to the page (raised
page_count); it doesn't matter if the page is mapped into userspace
(raised page_mapcount - though that hints it may be worth keeping the
swap): all that matters is that there be no more references to the swap
(and no writeback in progress).

swapoff (try_to_unuse) has been removing pages from swapcache for years,
with no concern for page count or page mapcount, and we used to have a
comment in lookup_swap_cache() recognizing that: if you go for a page of
swapcache, you'll get the right page, but it could have been removed from
swapcache by the time you get page lock.

So, give up asking for exclusivity: get rid of
remove_exclusive_swap_page(), and remove_exclusive_swap_page_ref() and
remove_exclusive_swap_page_count() which were spawned for the recent LRU
work: replace them by the simpler try_to_free_swap() which just checks
page_swapcount().

Similarly, remove the page_count limitation from free_swap_and_count(),
but assume that it's worth holding on to the swap if page is mapped and
swap nowhere near full. Add a vm_swap_full() test in free_swap_cache()?
It would be consistent, but I think we probably have enough for now.

Signed-off-by: Hugh Dickins
Cc: Lee Schermerhorn
Cc: Rik van Riel
Cc: Nick Piggin
Cc: KAMEZAWA Hiroyuki
Cc: Robin Holt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:03 +0800