Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

05 Jun, 2014

9 commits

2457aec63 mm: non-atomically mark page accessed during page cache allocation where possible ... Browse Code »
65

aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after. Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage. The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.

The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page. This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.

The primary APIs of concern in this care are the following and are used
by most filesystems.

find_get_page
find_lock_page
find_or_create_page
grab_cache_page_nowait
grab_cache_page_write_begin

All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not. Then
old API is preserved but is basically a thin wrapper around this core
function.

Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job. There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted. This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change. It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of the
file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts. Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.

async dd
3.15.0-rc3 3.15.0-rc3
vanilla accessed-v2
ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)

The XFS figure is a bit strange as it managed to avoid a worst case by
sheer luck but the average figures looked reasonable.

samples percentage
ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

[akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
Signed-off-by: Mel Gorman
Cc: Johannes Weiner
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Rik van Riel
Cc: Peter Zijlstra
Tested-by: Prabhakar Lad
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:10 +0800
6fb81a17d mm: do not use unnecessary atomic operations when adding pages to the LRU ... Browse Code »
4

When adding pages to the LRU we clear the active bit unconditionally.
As the page could be reachable from other paths we cannot use unlocked
operations without risk of corruption such as a parallel
mark_page_accessed. This patch tests if is necessary to clear the
active flag before using an atomic operation. This potentially opens a
tiny race when PageActive is checked as mark_page_accessed could be
called after PageActive was checked. The race already exists but this
patch changes it slightly. The consequence is that that the page may be
promoted to the active list that might have been left on the inactive
list before the patch. It's too tiny a race and too marginal a
consequence to always use atomic operations for.

Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Rik van Riel
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:10 +0800
e3741b506 mm: do not use atomic operations when releasing pages ... Browse Code »
4

There should be no references to it any more and a parallel mark should
not be reordered against us. Use non-locked varient to clear page active.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Johannes Weiner
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:09 +0800
b745bc85f mm: page_alloc: convert hot/cold parameter and immediate callers to bool ... Browse Code »
4

cold is a bool, make it one. Make the likely case the "if" part of the
block instead of the else as according to the optimisation manual this is
preferred.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Johannes Weiner
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:09 +0800
d2ee40eae mm: introdule compound_head_by_tail() ... Browse Code »

Currently, in put_compound_page(), we have

======
if (likely(!PageTail(page))) { first_page;

smp_rmb();
if (likely(PageTail(page)))
return head;
}
return page;
}
======

here, the (3) unlikely in the case is a negative hint, because it is
*likely* a tail page. So the check (3) in this case is not good, so I
introduce a helper for this case.

So this patch introduces compound_head_by_tail() which deals with a
possible tail page(though it could be spilt by a racy thread), and make
compound_head() a wrapper on it.

This patch has no functional change, and it reduces the object
size slightly:
text data bss dec hex filename
11003 1328 16 12347 303b mm/swap.o.orig
10971 1328 16 12315 301b mm/swap.o.patched

I've ran "perf top -e branch-miss" to observe branch-miss in this case.
As Michael points out, it's a slow path, so only very few times this case
happens. But I grep'ed the code base, and found there still are some
other call sites could be benifited from this helper. And given that it
only bloating up the source by only 5 lines, but with a reduced object
size. I still believe this helper deserves to exsit.

Signed-off-by: Jianyu Zhan
Cc: Kirill A. Shutemov
Cc: Rik van Riel
Cc: Jiang Liu
Cc: Peter Zijlstra
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Sasha Levin
Cc: Wanpeng Li
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jianyu Zhan
2014-06-05 07:54:03 +0800
4bd3e8f7b mm/swap.c: split put_compound_page() ... Browse Code »

Currently, put_compound_page() carefully handles tricky cases to avoid
racing with compound page releasing or splitting, which makes it quite
lenthy (about 200+ lines) and needs deep tab indention, which makes it
quite hard to follow and maintain.

Now based on two helpers introduced in the previous patch ("mm/swap.c:
introduce put_[un]refcounted_compound_page helpers for spliting
put_compound_page"), this patch replaces those two lengthy code paths with
these two helpers, respectively. Also, it has some comment rephrasing.

After this patch, the put_compound_page() is very compact, thus easy to
read and maintain.

After splitting, the object file is of same size as the original one.
Actually, I've diff'ed put_compound_page()'s orginal disassemble code and
the patched disassemble code, the are 100% the same!

This fact shows that this splitting has no functional change, but it
brings readability.

This patch and the previous one blow the code by 32 lines, mostly due to
comments.

Signed-off-by: Jianyu Zhan
Cc: Kirill A. Shutemov
Cc: Rik van Riel
Cc: Jiang Liu
Cc: Peter Zijlstra
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Sasha Levin
Cc: Wanpeng Li
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jianyu Zhan
2014-06-05 07:54:03 +0800
c747ce790 mm/swap.c: introduce put_[un]refcounted_compound_page helpers for splitting put_compound_page() ... Browse Code »

Currently, put_compound_page() carefully handles tricky cases to avoid
racing with compound page releasing or splitting, which makes it quite
lenthy (about 200+ lines) and needs deep tab indention, which makes it
quite hard to follow and maintain.

This patch and the next patch refactor this function.

Based on the code skeleton of put_compound_page:

put_compound_pge:
if !PageTail(page)
put head page fastpath;
return;

/* else PageTail */
page_head = compound_head(page)
if !__compound_tail_refcounted(page_head)
put head page optimal path;
Cc: Kirill A. Shutemov
Cc: Rik van Riel
Cc: Jiang Liu
Cc: Peter Zijlstra
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Sasha Levin
Cc: Wanpeng Li
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jianyu Zhan
2014-06-05 07:54:03 +0800
7c8e0181e mm: replace __get_cpu_var uses with this_cpu_ptr ... Browse Code »

Replace places where __get_cpu_var() is used for an address calculation
with this_cpu_ptr().

Signed-off-by: Christoph Lameter
Cc: Tejun Heo
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2014-06-05 07:54:03 +0800
2329d3751 mm/swap.c: clean up *lru_cache_add* functions ... Browse Code »
4

In mm/swap.c, __lru_cache_add() is exported, but actually there are no
users outside this file.

This patch unexports __lru_cache_add(), and makes it static. It also
exports lru_cache_add_file(), as it is use by cifs and fuse, which can
loaded as modules.

Signed-off-by: Jianyu Zhan
Cc: Minchan Kim
Cc: Johannes Weiner
Cc: Shaohua Li
Cc: Bob Liu
Cc: Seth Jennings
Cc: Joonsoo Kim
Cc: Rafael Aquini
Cc: Mel Gorman
Acked-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: Khalid Aziz
Cc: Christoph Hellwig
Reviewed-by: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jianyu Zhan
2014-06-05 07:54:00 +0800

04 Apr, 2014

2 commits

a528910e1 mm: thrash detection-based file cache sizing ... Browse Code »
2

The VM maintains cached filesystem pages on two types of lists. One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have shown
to benefit from caching in the past. We call the recently usedbut
ultimately was not significantly better than a FIFO policy and still
thrashed cache based on eviction speed, rather than actual demand for
cache.

This patch solves one half of the problem by decoupling the ability to
detect working set changes from the inactive list size. By maintaining
a history of recently evicted file pages it can detect frequently used
pages with an arbitrarily small inactive list size, and subsequently
apply pressure on the active list based on actual demand for cache, not
just overall eviction speed.

Every zone maintains a counter that tracks inactive list aging speed.
When a page is evicted, a snapshot of this counter is stored in the
now-empty page cache radix tree slot. On refault, the minimum access
distance of the page can be assessed, to evaluate whether the page
should be part of the active list or not.

This fixes the VM's blindness towards working set changes in excess of
the inactive list. And it's the foundation to further improve the
protection ability and reduce the minimum inactive list size of 50%.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Reviewed-by: Bob Liu
Cc: Andrea Arcangeli
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Mel Gorman
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-04-04 07:21:01 +0800
0cd6144aa mm + fs: prepare for non-page entries in page cache radix trees ... Browse Code »
77

shmem mappings already contain exceptional entries where swap slot
information is remembered.

To be able to store eviction information for regular page cache, prepare
every site dealing with the radix trees directly to handle entries other
than pages.

The common lookup functions will filter out non-page entries and return
NULL for page cache holes, just as before. But provide a raw version of
the API which returns non-page entries as well, and switch shmem over to
use it.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Cc: Andrea Arcangeli
Cc: Bob Liu
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Mel Gorman
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-04-04 07:21:00 +0800

04 Mar, 2014

1 commit

668f9abbd mm: close PageTail race ... Browse Code »

Commit bf6bddf1924e ("mm: introduce compaction and migration for
ballooned pages") introduces page_count(page) into memory compaction
which dereferences page->first_page if PageTail(page).

This results in a very rare NULL pointer dereference on the
aforementioned page_count(page). Indeed, anything that does
compound_head(), including page_count() is susceptible to racing with
prep_compound_page() and seeing a NULL or dangling page->first_page
pointer.

This patch uses Andrea's implementation of compound_trans_head() that
deals with such a race and makes it the default compound_head()
implementation. This includes a read memory barrier that ensures that
if PageTail(head) is true that we return a head page that is neither
NULL nor dangling. The patch then adds a store memory barrier to
prep_compound_page() to ensure page->first_page is set.

This is the safest way to ensure we see the head page that we are
expecting, PageTail(page) is already in the unlikely() path and the
memory barriers are unfortunately required.

Hugetlbfs is the exception, we don't enforce a store memory barrier
during init since no race is possible.

Signed-off-by: David Rientjes
Cc: Holger Kiehl
Cc: Christoph Lameter
Cc: Rafael Aquini
Cc: Vlastimil Babka
Cc: Michal Hocko
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Rik van Riel
Cc: "Kirill A. Shutemov"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-03-04 23:55:47 +0800

24 Jan, 2014

1 commit

309381fea mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE ... Browse Code »

Most of the VM_BUG_ON assertions are performed on a page. Usually, when
one of these assertions fails we'll get a BUG_ON with a call stack and
the registers.

I've recently noticed based on the requests to add a small piece of code
that dumps the page to various VM_BUG_ON sites that the page dump is
quite useful to people debugging issues in mm.

This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
VM_BUG_ON() does, also dumps the page before executing the actual
BUG_ON.

[akpm@linux-foundation.org: fix up includes]
Signed-off-by: Sasha Levin
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2014-01-24 08:36:50 +0800

22 Jan, 2014

4 commits

26296ad2d mm/swap.c: reorganize put_compound_page() ... Browse Code »

Tweak it so save a tab stop, make code layout slightly less nutty.

Signed-off-by: Andrea Arcangeli
Cc: Khalid Aziz
Cc: Pravin Shelar
Cc: Greg Kroah-Hartman
Cc: Ben Hutchings
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2014-01-22 08:19:43 +0800
3bfcd13ec mm: hugetlbfs: use __compound_tail_refcounted in __get_page_tail too ... Browse Code »

Also remove hugetlb.h which isn't needed anymore as PageHeadHuge is
handled in mm.h.

Signed-off-by: Andrea Arcangeli
Cc: Khalid Aziz
Cc: Pravin Shelar
Cc: Greg Kroah-Hartman
Cc: Ben Hutchings
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2014-01-22 08:19:43 +0800
44518d2b3 mm: tail page refcounting optimization for slab and hugetlbfs ... Browse Code »

This skips the _mapcount mangling for slab and hugetlbfs pages.

The main trouble in doing this is to guarantee that PageSlab and
PageHeadHuge remains constant for all get_page/put_page run on the tail
of slab or hugetlbfs compound pages. Otherwise if they're set during
get_page but not set during put_page, the _mapcount of the tail page
would underflow.

PageHeadHuge will remain true until the compound page is released and
enters the buddy allocator so it won't risk to change even if the tail
page is the last reference left on the page.

PG_slab instead is cleared before the slab frees the head page with
put_page, so if the tail pin is released after the slab freed the page,
we would have a problem. But in the slab case the tail pin cannot be
the last reference left on the page. This is because the slab code is
free to reuse the compound page after a kfree/kmem_cache_free without
having to check if there's any tail pin left. In turn all tail pins
must be always released while the head is still pinned by the slab code
and so we know PG_slab will be still set too.

Signed-off-by: Andrea Arcangeli
Reviewed-by: Khalid Aziz
Cc: Pravin Shelar
Cc: Greg Kroah-Hartman
Cc: Ben Hutchings
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2014-01-22 08:19:43 +0800
ebf360f9b mm: hugetlbfs: move the put/get_page slab and hugetlbfs optimization in a faster path ... Browse Code »

We don't actually need a reference on the head page in the slab and
hugetlbfs paths, as long as we add a smp_rmb() which should be faster
than get_page_unless_zero.

[akpm@linux-foundation.org: fix typo in comment]
Signed-off-by: Andrea Arcangeli
Cc: Khalid Aziz
Cc: Pravin Shelar
Cc: Greg Kroah-Hartman
Cc: Ben Hutchings
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2014-01-22 08:19:43 +0800

22 Nov, 2013

1 commit

27c73ae75 mm: hugetlbfs: fix hugetlbfs optimization ... Browse Code »

Commit 7cb2ef56e6a8 ("mm: fix aio performance regression for database
caused by THP") can cause dereference of a dangling pointer if
split_huge_page runs during PageHuge() if there are updates to the
tail_page->private field.

Also it is repeating compound_head twice for hugetlbfs and it is running
compound_head+compound_trans_head for THP when a single one is needed in
both cases.

The new code within the PageSlab() check doesn't need to verify that the
THP page size is never bigger than the smallest hugetlbfs page size, to
avoid memory corruption.

A longstanding theoretical race condition was found while fixing the
above (see the change right after the skip_unlock label, that is
relevant for the compound_lock path too).

By re-establishing the _mapcount tail refcounting for all compound
pages, this also fixes the below problem:

echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

BUG: Bad page state in process bash pfn:59a01
page:ffffea000139b038 count:0 mapcount:10 mapping: (null) index:0x0
page flags: 0x1c00000000008000(tail)
Modules linked in:
CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x55/0x76
bad_page+0xd5/0x130
free_pages_prepare+0x213/0x280
__free_pages+0x36/0x80
update_and_free_page+0xc1/0xd0
free_pool_huge_page+0xc2/0xe0
set_max_huge_pages.part.58+0x14c/0x220
nr_hugepages_store_common.isra.60+0xd0/0xf0
nr_hugepages_store+0x13/0x20
kobj_attr_store+0xf/0x20
sysfs_write_file+0x189/0x1e0
vfs_write+0xc5/0x1f0
SyS_write+0x55/0xb0
system_call_fastpath+0x16/0x1b

Signed-off-by: Khalid Aziz
Signed-off-by: Andrea Arcangeli
Tested-by: Khalid Aziz
Cc: Pravin Shelar
Cc: Greg Kroah-Hartman
Cc: Ben Hutchings
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2013-11-22 08:42:27 +0800

08 Nov, 2013

1 commit

8077c0d98 bdi: test bdi_init failure ... Browse Code »

There were two places where return value from bdi_init was not tested.

Signed-off-by: Mikulas Patocka
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Mikulas Patocka
2013-11-08 23:59:44 +0800

13 Sep, 2013

1 commit

5fbc46163 mm: make lru_add_drain_all() selective ... Browse Code »

make lru_add_drain_all() only selectively interrupt the cpus that have
per-cpu free pages that can be drained.

This is important in nohz mode where calling mlockall(), for example,
otherwise will interrupt every core unnecessarily.

This is important on workloads where nohz cores are handling 10 Gb traffic
in userspace. Those CPUs do not enter the kernel and place pages into LRU
pagevecs and they really, really don't want to be interrupted, or they
drop packets on the floor.

Signed-off-by: Chris Metcalf
Reviewed-by: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chris Metcalf
2013-09-13 06:38:02 +0800

12 Sep, 2013

1 commit

7cb2ef56e mm: fix aio performance regression for database caused by THP ... Browse Code »

I am working with a tool that simulates oracle database I/O workload.
This tool (orion to be specific -
)
allocates hugetlbfs pages using shmget() with SHM_HUGETLB flag. It then
does aio into these pages from flash disks using various common block
sizes used by database. I am looking at performance with two of the most
common block sizes - 1M and 64K. aio performance with these two block
sizes plunged after Transparent HugePages was introduced in the kernel.
Here are performance numbers:

pre-THP 2.6.39 3.11-rc5
1M read 8384 MB/s 5629 MB/s 6501 MB/s
64K read 7867 MB/s 4576 MB/s 4251 MB/s

I have narrowed the performance impact down to the overheads introduced by
THP in __get_page_tail() and put_compound_page() routines. perf top shows
>40% of cycles being spent in these two routines. Every time direct I/O
to hugetlbfs pages starts, kernel calls get_page() to grab a reference to
the pages and calls put_page() when I/O completes to put the reference
away. THP introduced significant amount of locking overhead to get_page()
and put_page() when dealing with compound pages because hugepages can be
split underneath get_page() and put_page(). It added this overhead
irrespective of whether it is dealing with hugetlbfs pages or transparent
hugepages. This resulted in 20%-45% drop in aio performance when using
hugetlbfs pages.

Since hugetlbfs pages can not be split, there is no reason to go through
all the locking overhead for these pages from what I can see. I added
code to __get_page_tail() and put_compound_page() to bypass all the
locking code when working with hugetlbfs pages. This improved performance
significantly. Performance numbers with this patch:

pre-THP 3.11-rc5 3.11-rc5 + Patch
1M read 8384 MB/s 6501 MB/s 8371 MB/s
64K read 7867 MB/s 4251 MB/s 6510 MB/s

Performance with 64K read is still lower than what it was before THP, but
still a 53% improvement. It does mean there is more work to be done but I
will take a 53% improvement for now.

Please take a look at the following patch and let me know if it looks
reasonable.

[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Khalid Aziz
Cc: Pravin B Shelar
Cc: Christoph Lameter
Cc: Andrea Arcangeli
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Minchan Kim
Cc: Andi Kleen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Khalid Aziz
2013-09-12 06:57:55 +0800

01 Aug, 2013

2 commits

e180cf806 thp, mm: avoid PageUnevictable on active/inactive lru lists ... Browse Code »

active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
that have been placed on the LRU lists when first allocated), but these
pages must not have PageUnevictable set - otherwise shrink_[in]active_list
goes crazy:

kernel BUG at /home/space/kas/git/public/linux-next/mm/vmscan.c:1122!

1090 static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
1091 struct lruvec *lruvec, struct list_head *dst,
1092 unsigned long *nr_scanned, struct scan_control *sc,
1093 isolate_mode_t mode, enum lru_list lru)
1094 {
...
1108 switch (__isolate_lru_page(page, mode)) {
1109 case 0:
...
1116 case -EBUSY:
...
1121 default:
1122 BUG();
1123 }
1124 }
...
1130 }

__isolate_lru_page() returns EINVAL for PageUnevictable(page).

For lru_add_page_tail(), it means we should not set PageUnevictable()
for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
Let's just copy PG_active and PG_unevictable from head page in
__split_huge_page_refcount(), it will simplify lru_add_page_tail().

This will fix one more bug in lru_add_page_tail(): if
page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
will go to the same lru as page, but nobody cares to sync page_tail
active/inactive state with page. So we can end up with inactive page on
active lru. The patch will fix it as well since we copy PG_active from
head page.

Signed-off-by: Kirill A. Shutemov
Acked-by: Dave Hansen
Cc: Naoya Horiguchi
Cc: Mel Gorman
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-08-01 05:41:03 +0800
ef2a2cbdd mm/swap.c: clear PageActive before adding pages onto unevictable list ... Browse Code »

As a result of commit 13f7f78981e4 ("mm: pagevec: defer deciding which
LRU to add a page to until pagevec drain time"), pages on unevictable
lists can have both of PageActive and PageUnevictable set. This is not
only confusing, but also corrupts page migration and
shrink_[in]active_list.

This patch fixes the problem by adding ClearPageActive before adding
pages into unevictable list. It also cleans up VM_BUG_ONs.

Signed-off-by: Naoya Horiguchi
Cc: Mel Gorman
Cc: KOSAKI Motohiro
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2013-08-01 05:41:03 +0800

04 Jul, 2013

5 commits

c53954a09 mm: remove lru parameter from __lru_cache_add and lru_cache_add_lru ... Browse Code »

Similar to __pagevec_lru_add, this patch removes the LRU parameter from
__lru_cache_add and lru_cache_add_lru as the caller does not control the
exact LRU the page gets added to. lru_cache_add_lru gets renamed to
lru_cache_add the name is silly without the lru parameter. With the
parameter removed, it is required that the caller indicate if they want
the page added to the active or inactive list by setting or clearing
PageActive respectively.

[akpm@linux-foundation.org: Suggested the patch]
[gang.chen@asianux.com: fix used-unintialized warning]
Signed-off-by: Mel Gorman
Signed-off-by: Chen Gang
Cc: Jan Kara
Cc: Rik van Riel
Acked-by: Johannes Weiner
Cc: Alexey Lyahkov
Cc: Andrew Perepechko
Cc: Robin Dong
Cc: Theodore Tso
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Bernd Schubert
Cc: David Howells
Cc: Trond Myklebust
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-07-04 07:07:31 +0800
a0b8cab3b mm: remove lru parameter from __pagevec_lru_add and remove parts of pagevec API ... Browse Code »
2

Now that the LRU to add a page to is decided at LRU-add time, remove the
misleading lru parameter from __pagevec_lru_add. A consequence of this
is that the pagevec_lru_add_file, pagevec_lru_add_anon and similar
helpers are misleading as the caller no longer has direct control over
what LRU the page is added to. Unused helpers are removed by this patch
and existing users of pagevec_lru_add_file() are converted to use
lru_cache_add_file() directly and use the per-cpu pagevecs instead of
creating their own pagevec.

Signed-off-by: Mel Gorman
Reviewed-by: Jan Kara
Reviewed-by: Rik van Riel
Acked-by: Johannes Weiner
Cc: Alexey Lyahkov
Cc: Andrew Perepechko
Cc: Robin Dong
Cc: Theodore Tso
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Bernd Schubert
Cc: David Howells
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-07-04 07:07:31 +0800
059285a25 mm: activate !PageLRU pages on mark_page_accessed if page is on local pagevec ... Browse Code »

If a page is on a pagevec then it is !PageLRU and mark_page_accessed()
may fail to move a page to the active list as expected. Now that the
LRU is selected at LRU drain time, mark pages PageActive if they are on
the local pagevec so it gets moved to the correct list at LRU drain
time. Using a debugging patch it was found that for a simple git
checkout based workload that pages were never added to the active file
list in practice but with this patch applied they are.

before after
LRU Add Active File 0 750583
LRU Add Active Anon 2640587 2702818
LRU Add Inactive File 8833662 8068353
LRU Add Inactive Anon 207 200

Note that only pages on the local pagevec are considered on purpose. A
!PageLRU page could be in the process of being released, reclaimed,
migrated or on a remote pagevec that is currently being drained.
Marking it PageActive is vunerable to races where PageLRU and Active
bits are checked at the wrong time. Page reclaim will trigger
VM_BUG_ONs but depending on when the race hits, it could also free a
PageActive page to the page allocator and trigger a bad_page warning.
Similarly a potential race exists between a per-cpu drain on a pagevec
list and an activation on a remote CPU.

lru_add_drain_cpu
__pagevec_lru_add
lru = page_lru(page);
mark_page_accessed
if (PageLRU(page))
activate_page
else
SetPageActive
SetPageLRU(page);
add_page_to_lru_list(page, lruvec, lru);

In this case a PageActive page is added to the inactivate list and later
the inactive/active stats will get skewed. While the PageActive checks
in vmscan could be removed and potentially dealt with, a skew in the
statistics would be very difficult to detect. Hence this patch deals
just with the common case where a page being marked accessed has just
been added to the local pagevec.

Signed-off-by: Mel Gorman
Cc: Jan Kara
Cc: Rik van Riel
Acked-by: Johannes Weiner
Cc: Alexey Lyahkov
Cc: Andrew Perepechko
Cc: Robin Dong
Cc: Theodore Tso
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Bernd Schubert
Cc: David Howells
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-07-04 07:07:31 +0800
13f7f7898 mm: pagevec: defer deciding which LRU to add a page to until pagevec drain time ... Browse Code »

mark_page_accessed() cannot activate an inactive page that is located on
an inactive LRU pagevec. Hints from filesystems may be ignored as a
result. In preparation for fixing that problem, this patch removes the
per-LRU pagevecs and leaves just one pagevec. The final LRU the page is
added to is deferred until the pagevec is drained.

This means that fewer pagevecs are available and potentially there is
greater contention on the LRU lock. However, this only applies in the
case where there is an almost perfect mix of file, anon, active and
inactive pages being added to the LRU. In practice I expect that we are
adding stream of pages of a particular time and that the changes in
contention will barely be measurable.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Jan Kara
Acked-by: Johannes Weiner
Cc: Alexey Lyahkov
Cc: Andrew Perepechko
Cc: Robin Dong
Cc: Theodore Tso
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Bernd Schubert
Cc: David Howells
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-07-04 07:07:31 +0800
c6286c983 mm: add tracepoints for LRU activation and insertions ... Browse Code »

Andrew Perepechko reported a problem whereby pages are being prematurely
evicted as the mark_page_accessed() hint is ignored for pages that are
currently on a pagevec --
http://www.spinics.net/lists/linux-ext4/msg37340.html .

Alexey Lyahkov and Robin Dong have also reported problems recently that
could be due to hot pages reaching the end of the inactive list too
quickly and be reclaimed.

Rather than addressing this on a per-filesystem basis, this series aims
to fix the mark_page_accessed() interface by deferring what LRU a page
is added to pagevec drain time and allowing mark_page_accessed() to call
SetPageActive on a pagevec page.

Patch 1 adds two tracepoints for LRU page activation and insertion. Using
these processes it's possible to build a model of pages in the
LRU that can be processed offline.

Patch 2 defers making the decision on what LRU to add a page to until when
the pagevec is drained.

Patch 3 searches the local pagevec for pages to mark PageActive on
mark_page_accessed. The changelog explains why only the local
pagevec is examined.

Patches 4 and 5 tidy up the API.

postmark, a dd-based test and fs-mark both single and threaded mode were
run but none of them showed any performance degradation or gain as a
result of the patch.

Using patch 1, I built a *very* basic model of the LRU to examine
offline what the average age of different page types on the LRU were in
milliseconds. Of course, capturing the trace distorts the test as it's
written to local disk but it does not matter for the purposes of this
test. The average age of pages in milliseconds were

vanilla deferdrain
Average age mapped anon: 1454 1250
Average age mapped file: 127841 155552
Average age unmapped anon: 85 235
Average age unmapped file: 73633 38884
Average age unmapped buffers: 74054 116155

The LRU activity was mostly files which you'd expect for a dd-based
workload. Note that the average age of buffer pages is increased by the
series and it is expected this is due to the fact that the buffer pages
are now getting added to the active list when drained from the pagevecs.
Note that the average age of the unmapped file data is decreased as they
are still added to the inactive list and are reclaimed before the
buffers.

There is no guarantee this is a universal win for all workloads and it
would be nice if the filesystem people gave some thought as to whether
this decision is generally a win or a loss.

This patch:

Using these tracepoints it is possible to model LRU activity and the
average residency of pages of different types. This can be used to
debug problems related to premature reclaim of pages of particular
types.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Jan Kara
Cc: Johannes Weiner
Cc: Alexey Lyahkov
Cc: Andrew Perepechko
Cc: Robin Dong
Cc: Theodore Tso
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Bernd Schubert
Cc: David Howells
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-07-04 07:07:31 +0800

08 May, 2013

1 commit

a27bb332c aio: don't include aio.h in sched.h ... Browse Code »

Faster kernel compiles by way of fewer unnecessary includes.

[akpm@linux-foundation.org: fix fallout]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Kent Overstreet
Cc: Zach Brown
Cc: Felipe Balbi
Cc: Greg Kroah-Hartman
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Rusty Russell
Cc: Jens Axboe
Cc: Asai Thambi S P
Cc: Selvan Mani
Cc: Sam Bradshaw
Cc: Jeff Moyer
Cc: Al Viro
Cc: Benjamin LaHaise
Reviewed-by: "Theodore Ts'o"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kent Overstreet
2013-05-08 11:16:25 +0800

30 Apr, 2013

1 commit

5bc7b8aca mm: thp: add split tail pages to shrink page list in page reclaim ... Browse Code »
5

In page reclaim, huge page is split. split_huge_page() adds tail pages
to LRU list. Since we are reclaiming a huge page, it's better we
reclaim all subpages of the huge page instead of just the head page.
This patch adds split tail pages to shrink page list so the tail pages
can be reclaimed soon.

Before this patch, run a swap workload:
thp_fault_alloc 3492
thp_fault_fallback 608
thp_collapse_alloc 6
thp_collapse_alloc_failed 0
thp_split 916

With this patch:
thp_fault_alloc 4085
thp_fault_fallback 16
thp_collapse_alloc 90
thp_collapse_alloc_failed 0
thp_split 1272

fallback allocation is reduced a lot.

[akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
Signed-off-by: Shaohua Li
Acked-by: Rik van Riel
Acked-by: Minchan Kim
Acked-by: Johannes Weiner
Reviewed-by: Wanpeng Li
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2013-04-30 06:54:38 +0800

24 Feb, 2013

1 commit

33806f06d swap: make each swap partition have one address_space ... Browse Code »

When I use several fast SSD to do swap, swapper_space.tree_lock is
heavily contended. This makes each swap partition have one
address_space to reduce the lock contention. There is an array of
address_space for swap. The swap entry type is the index to the array.

In my test with 3 SSD, this increases the swapout throughput 20%.

[akpm@linux-foundation.org: revert unneeded change to __add_to_swap_cache]
Signed-off-by: Shaohua Li
Cc: Hugh Dickins
Acked-by: Rik van Riel
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2013-02-24 09:50:17 +0800

09 Oct, 2012

2 commits

39b5f29ac mm: remove vma arg from page_evictable ... Browse Code »

page_evictable(page, vma) is an irritant: almost all its callers pass
NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page)
explicitly in the couple of places it's needed. But in those places we
don't even need page_evictable() itself! They're dealing with a freshly
allocated anonymous page, which has no "mapping" and cannot be mlocked yet.

Signed-off-by: Hugh Dickins
Acked-by: Mel Gorman
Cc: Rik van Riel
Acked-by: Johannes Weiner
Cc: Michel Lespinasse
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-10-09 15:22:55 +0800
d741c9cde mm: fix nonuniform page status when writing new file with small buffer ... Browse Code »

When writing a new file with 2048 bytes buffer, such as write(fd, buffer,
2048), it will call generic_perform_write() twice for every page:

write_begin
mark_page_accessed(page)
write_end

write_begin
mark_page_accessed(page)
write_end

Pages 1-13 will be added to lru-pvecs in write_begin() and will *NOT* be
added to active_list even they have be accessed twice because they are not
PageLRU(page). But when page 14th comes, all pages in lru-pvecs will be
moved to inactive_list (by __lru_cache_add() ) in first write_begin(), now
page 14th *is* PageLRU(page). And after second write_end() only page 14th
will be in active_list.

In Hadoop environment, we do comes to this situation: after writing a
file, we find out that only 14th, 28th, 42th... page are in active_list
and others in inactive_list. Now kswapd works, shrinks the inactive_list,
the file only have 14th, 28th...pages in memory, the readahead request
size will be broken to only 52k (13*4k), system's performance falls
dramatically.

This problem can also replay by below steps (the machine has 8G memory):

1. dd if=/dev/zero of=/test/file.out bs=1024 count=1048576
2. cat another 7.5G file to /dev/null
3. vmtouch -m 1G -v /test/file.out, it will show:

/test/file.out
[oooooooooooooooooooOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 187847/262144

the 'o' means same pages are in memory but same are not.

The solution for this problem is simple: the 14th page should be added to
lru_add_pvecs before mark_page_accessed() just as other pages.

[akpm@linux-foundation.org: tweak comment]
[akpm@linux-foundation.org: grab better comment from the v3 patch]
Signed-off-by: Robin Dong
Reviewed-by: Minchan Kim
Cc: KOSAKI Motohiro
Reviewed-by: Johannes Weiner
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robin Dong
2012-10-09 15:22:19 +0800

01 Aug, 2012

2 commits

5a178119b mm: add support for direct_IO to highmem pages ... Browse Code »

The patch "mm: add support for a filesystem to activate swap files and use
direct_IO for writing swap pages" added support for using direct_IO to
write swap pages but it is insufficient for highmem pages.

To support highmem pages, this patch kmaps() the page before calling the
direct_IO() handler. As direct_IO deals with virtual addresses an
additional helper is necessary for get_kernel_pages() to lookup the struct
page for a kmap virtual address.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Christoph Hellwig
Cc: David S. Miller
Cc: Eric B Munson
Cc: Eric Paris
Cc: James Morris
Cc: Mel Gorman
Cc: Mike Christie
Cc: Neil Brown
Cc: Peter Zijlstra
Cc: Sebastian Andrzej Siewior
Cc: Trond Myklebust
Cc: Xiaotian Feng
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-08-01 09:42:47 +0800
18022c5d8 mm: add get_kernel_page[s] for pinning of kernel addresses for I/O ... Browse Code »

This patch adds two new APIs get_kernel_pages() and get_kernel_page() that
may be used to pin a vector of kernel addresses for IO. The initial user
is expected to be NFS for allowing pages to be written to swap using
aops->direct_IO(). Strictly speaking, swap-over-NFS only needs to pin one
page for IO but it makes sense to express the API in terms of a vector and
add a helper for pinning single pages.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Christoph Hellwig
Cc: David S. Miller
Cc: Eric B Munson
Cc: Eric Paris
Cc: James Morris
Cc: Mel Gorman
Cc: Mike Christie
Cc: Neil Brown
Cc: Peter Zijlstra
Cc: Sebastian Andrzej Siewior
Cc: Trond Myklebust
Cc: Xiaotian Feng
Cc: Mark Salter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-08-01 09:42:47 +0800

30 May, 2012

3 commits

fa9add641 mm/memcg: apply add/del_page to lruvec ... Browse Code »

Take lruvec further: pass it instead of zone to add_page_to_lru_list() and
del_page_from_lru_list(); and pagevec_lru_move_fn() pass lruvec down to
its target functions.

This cleanup eliminates a swathe of cruft in memcontrol.c, including
mem_cgroup_lru_add_list(), mem_cgroup_lru_del_list() and
mem_cgroup_lru_move_lists() - which never actually touched the lists.

In their place, mem_cgroup_page_lruvec() to decide the lruvec, previously
a side-effect of add, and mem_cgroup_update_lru_size() to maintain the
lru_size stats.

Whilst these are simplifications in their own right, the goal is to bring
the evaluation of lruvec next to the spin_locking of the lrus, in
preparation for a future patch.

Signed-off-by: Hugh Dickins
Cc: KOSAKI Motohiro
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Acked-by: Konstantin Khlebnikov
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-05-30 07:22:28 +0800
89abfab13 mm/memcg: move reclaim_stat into lruvec ... Browse Code »

With mem_cgroup_disabled() now explicit, it becomes clear that the
zone_reclaim_stat structure actually belongs in lruvec, per-zone when
memcg is disabled but per-memcg per-zone when it's enabled.

We can delete mem_cgroup_get_reclaim_stat(), and change
update_page_reclaim_stat() to update just the one set of stats, the one
which get_scan_count() will actually use.

Signed-off-by: Hugh Dickins
Signed-off-by: Konstantin Khlebnikov
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Reviewed-by: Minchan Kim
Reviewed-by: Michal Hocko
Cc: Glauber Costa
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-05-30 07:22:25 +0800
5bf5f03c2 mm: fix slab->page flags corruption ... Browse Code »

Transparent huge pages can change page->flags (PG_compound_lock) without
taking Slab lock. Since THP can not break slab pages we can safely access
compound page without taking compound lock.

Specifically this patch fixes a race between compound_unlock() and slab
functions which perform page-flags updates. This can occur when
get_page()/put_page() is called on a page from slab.

[akpm@linux-foundation.org: tweak comment text, fix comment layout, fix label indenting]
Reported-by: Amey Bhide
Signed-off-by: Pravin B Shelar
Reviewed-by: Christoph Lameter
Acked-by: Andrea Arcangeli
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pravin B Shelar
2012-05-30 07:22:24 +0800

22 Mar, 2012

1 commit

f0cb3c76a mm: drain percpu lru add/rotate page-vectors on cpu hot-unplug ... Browse Code »

This cpu hotplug hook was accidentally removed in commit 00a62ce91e55
("mm: fix Committed_AS underflow on large NR_CPUS environment")

The visible effect of this accident: some pages are borrowed in per-cpu
page-vectors. Truncate can deal with it, but these pages cannot be
reused while this cpu is offline. So this is like a temporary memory
leak.

Signed-off-by: Konstantin Khlebnikov
Cc: Dave Hansen
Cc: KOSAKI Motohiro
Cc: Eric B Munson
Cc: Mel Gorman
Cc: Christoph Lameter
Cc: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-03-22 08:54:58 +0800

06 Mar, 2012

1 commit

7512102cf memcg: fix GPF when cgroup removal races with last exit ... Browse Code »
13

When moving tasks from old memcg (with move_charge_at_immigrate on new
memcg), followed by removal of old memcg, hit General Protection Fault in
mem_cgroup_lru_del_list() (called from release_pages called from
free_pages_and_swap_cache from tlb_flush_mmu from tlb_finish_mmu from
exit_mmap from mmput from exit_mm from do_exit).

Somewhat reproducible, takes a few hours: the old struct mem_cgroup has
been freed and poisoned by SLAB_DEBUG, but mem_cgroup_lru_del_list() is
still trying to update its stats, and take page off lru before freeing.

A task, or a charge, or a page on lru: each secures a memcg against
removal. In this case, the last task has been moved out of the old memcg,
and it is exiting: anonymous pages are uncharged one by one from the
memcg, as they are zapped from its pagetables, so the charge gets down to
0; but the pages themselves are queued in an mmu_gather for freeing.

Most of those pages will be on lru (and force_empty is careful to
lru_add_drain_all, to add pages from pagevec to lru first), but not
necessarily all: perhaps some have been isolated for page reclaim, perhaps
some isolated for other reasons. So, force_empty may find no task, no
charge and no page on lru, and let the removal proceed.

There would still be no problem if these pages were immediately freed; but
typically (and the put_page_testzero protocol demands it) they have to be
added back to lru before they are found freeable, then removed from lru
and freed. We don't see the issue when adding, because the
mem_cgroup_iter() loops keep their own reference to the memcg being
scanned; but when it comes to mem_cgroup_lru_del_list().

I believe this was not an issue in v3.2: there, PageCgroupAcctLRU and
PageCgroupUsed flags were used (like a trick with mirrors) to deflect view
of pc->mem_cgroup to the stable root_mem_cgroup when neither set.
38c5d72f3ebe ("memcg: simplify LRU handling by new rule") mercifully
removed those convolutions, but left this General Protection Fault.

But it's surprisingly easy to restore the old behaviour: just check
PageCgroupUsed in mem_cgroup_lru_add_list() (which decides on which lruvec
to add), and reset pc to root_mem_cgroup if page is uncharged. A risky
change? just going back to how it worked before; testing, and an audit of
uses of pc->mem_cgroup, show no problem.

And there's a nice bonus: with mem_cgroup_lru_add_list() itself making
sure that an uncharged page goes to root lru, mem_cgroup_reset_owner() no
longer has any purpose, and we can safely revert 4e5f01c2b9b9 ("memcg:
clear pc->mem_cgroup if necessary").

Calling update_page_reclaim_stat() after add_page_to_lru_list() in swap.c
is not strictly necessary: the lru_lock there, with RCU before memcg
structures are freed, makes mem_cgroup_get_reclaim_stat_from_page safe
without that; but it seems cleaner to rely on one dependency less.

Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-03-06 07:49:43 +0800