Eric Lee / smarc-fsl-linux-kernel

23 Mar, 2012

3 commits

aab008db8 Merge tag 'stable/for-linus-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm ... Browse Code »

Pull cleancache changes from Konrad Rzeszutek Wilk:
"This has some patches for the cleancache API that should have been
submitted a _long_ time ago. They are basically cleanups:

- rename of flush to invalidate

- moving reporting of statistics into debugfs

- use __read_mostly as necessary.

Oh, and also the MAINTAINERS file change. The files (except the
MAINTAINERS file) have been in #linux-next for months now. The late
addition of MAINTAINERS file is a brain-fart on my side - didn't
realize I needed that just until I was typing this up - and I based
that patch on v3.3 - so the tree is on top of v3.3."

* tag 'stable/for-linus-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
MAINTAINERS: Adding cleancache API to the list.
mm: cleancache: Use __read_mostly as appropiate.
mm: cleancache: report statistics via debugfs instead of sysfs.
mm: zcache/tmem/cleancache: s/flush/invalidate/
mm: cleancache: s/flush/invalidate/

Linus Torvalds
2012-03-23 10:52:47 +0800
754b98007 Merge branch 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull MCE changes from Ingo Molnar.

* 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Fix return value of mce_chrdev_read() when erst is disabled
x86/mce: Convert static array of pointers to per-cpu variables
x86/mce: Replace hard coded hex constants with symbolic defines
x86/mce: Recognise machine check bank signature for data path error
x86/mce: Handle "action required" errors
x86/mce: Add mechanism to safely save information in MCE handler
x86/mce: Create helper function to save addr/misc when needed
HWPOISON: Add code to handle "action required" errors.
HWPOISON: Clean up memory_failure() vs. __memory_failure()

Linus Torvalds
2012-03-23 00:42:04 +0800
95211279c Merge branch 'akpm' (Andrew's patch-bomb) ... Browse Code »

Merge first batch of patches from Andrew Morton:
"A few misc things and all the MM queue"

* emailed from Andrew Morton : (92 commits)
memcg: avoid THP split in task migration
thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE
memcg: clean up existing move charge code
mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read()
mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event()
mm/memcontrol.c: s/stealed/stolen/
memcg: fix performance of mem_cgroup_begin_update_page_stat()
memcg: remove PCG_FILE_MAPPED
memcg: use new logic for page stat accounting
memcg: remove PCG_MOVE_LOCK flag from page_cgroup
memcg: simplify move_account() check
memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
memcg: kill dead prev_priority stubs
memcg: remove PCG_CACHE page_cgroup flag
memcg: let css_get_next() rely upon rcu_read_lock()
cgroup: revert ss_id_lock to spinlock
idr: make idr_get_next() good for rcu_read_lock()
memcg: remove unnecessary thp check in page stat accounting
memcg: remove redundant returns
memcg: enum lru_list lru
...

Linus Torvalds
2012-03-23 00:04:48 +0800

22 Mar, 2012

37 commits

12724850e memcg: avoid THP split in task migration ... Browse Code »
43

Currently we can't do task migration among memory cgroups without THP
split, which means processes heavily using THP experience large overhead
in task migration. This patch introduces the code for moving charge of
THP and makes THP more valuable.

Signed-off-by: Naoya Horiguchi
Acked-by: Hillf Danton
Cc: Andrea Arcangeli
Acked-by: KAMEZAWA Hiroyuki
Cc: David Rientjes
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2012-03-22 08:55:02 +0800
d8c37c480 thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE ... Browse Code »

These macros will be used in a later patch, where all usages are expected
to be optimized away without #ifdef CONFIG_TRANSPARENT_HUGEPAGE. But to
detect unexpected usages, we convert the existing BUG() to BUILD_BUG().

[akpm@linux-foundation.org: fix build in mm/pgtable-generic.c]
Signed-off-by: Naoya Horiguchi
Acked-by: Hillf Danton
Reviewed-by: Andrea Arcangeli
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: David Rientjes
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2012-03-22 08:55:02 +0800
8d32ff844 memcg: clean up existing move charge code ... Browse Code »

- Replace lengthy function name is_target_pte_for_mc() with a shorter
one in order to avoid ugly line breaks.

- explicitly use MC_TARGET_* instead of simply using integers.

Signed-off-by: Naoya Horiguchi
Cc: Andrea Arcangeli
Cc: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Hillf Danton
Cc: David Rientjes
Acked-by: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2012-03-22 08:55:02 +0800
a48842887 mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read() ... Browse Code »

Signed-off-by: Jie Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Liu
2012-03-22 08:55:02 +0800
45f3e385b mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event() ... Browse Code »

In the following code:

if (type == _MEM)
thresholds = &memcg->thresholds;
else if (type == _MEMSWAP)
thresholds = &memcg->memsw_thresholds;
else
BUG();

BUG_ON(!thresholds);

The BUG_ON() seems redundant.

Signed-off-by: Anton Vorontsov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anton Vorontsov
2012-03-22 08:55:02 +0800
13fd1dd9d mm/memcontrol.c: s/stealed/stolen/ ... Browse Code »

A grammatical fix.

Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2012-03-22 08:55:02 +0800
4331f7d33 memcg: fix performance of mem_cgroup_begin_update_page_stat() ... Browse Code »

mem_cgroup_begin_update_page_stat() should be very fast because it's
called very frequently. Now, it needs to look up page_cgroup and its
memcg....this is slow.

This patch adds a global variable to check "any memcg is moving or not".
With this, the caller doesn't need to visit page_cgroup and memcg.

Here is a test result. A test program makes page faults onto a file,
MAP_SHARED and makes each page's page_mapcount(page) > 1, and free the
range by madvise() and page fault again. This program causes 26214400
times of page fault onto a file(size was 1G.) and shows shows the cost of
mem_cgroup_begin_update_page_stat().

Before this patch for mem_cgroup_begin_update_page_stat()

[kamezawa@bluextal test]$ time ./mmap 1G

real 0m21.765s
user 0m5.999s
sys 0m15.434s

27.46% mmap mmap [.] reader
21.15% mmap [kernel.kallsyms] [k] page_fault
9.17% mmap [kernel.kallsyms] [k] filemap_fault
2.96% mmap [kernel.kallsyms] [k] __do_fault
2.83% mmap [kernel.kallsyms] [k] __mem_cgroup_begin_update_page_stat

After this patch

[root@bluextal test]# time ./mmap 1G

real 0m21.373s
user 0m6.113s
sys 0m15.016s

In usual path, calls to __mem_cgroup_begin_update_page_stat() goes away.

Note: we may be able to remove this optimization in future if
we can get pointer to memcg directly from struct page.

[akpm@linux-foundation.org: don't return a void]
Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Greg Thelen
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: KOSAKI Motohiro
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-03-22 08:55:02 +0800
2ff76f119 memcg: remove PCG_FILE_MAPPED ... Browse Code »

With the new lock scheme for updating memcg's page stat, we don't need a
flag PCG_FILE_MAPPED which was duplicated information of page_mapped().

[hughd@google.com: cosmetic fix]
[hughd@google.com: add comment to MEM_CGROUP_CHARGE_TYPE_MAPPED case in __mem_cgroup_uncharge_common()]
Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Greg Thelen
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: KOSAKI Motohiro
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-03-22 08:55:01 +0800
89c06bd52 memcg: use new logic for page stat accounting ... Browse Code »

Now, page-stat-per-memcg is recorded into per page_cgroup flag by
duplicating page's status into the flag. The reason is that memcg has a
feature to move a page from a group to another group and we have race
between "move" and "page stat accounting",

Under current logic, assume CPU-A and CPU-B. CPU-A does "move" and CPU-B
does "page stat accounting".

When CPU-A goes 1st,

CPU-A CPU-B
update "struct page" info.
move_lock_mem_cgroup(memcg)
see pc->flags
copy page stat to new group
overwrite pc->mem_cgroup.
move_unlock_mem_cgroup(memcg)
move_lock_mem_cgroup(mem)
set pc->flags
update page stat accounting
move_unlock_mem_cgroup(mem)

stat accounting is guarded by move_lock_mem_cgroup() and "move" logic
(CPU-A) doesn't see changes in "struct page" information.

But it's costly to have the same information both in 'struct page' and
'struct page_cgroup'. And, there is a potential problem.

For example, assume we have PG_dirty accounting in memcg.
PG_..is a flag for struct page.
PCG_ is a flag for struct page_cgroup.
(This is just an example. The same problem can be found in any
kind of page stat accounting.)

CPU-A CPU-B
TestSet PG_dirty
(delay) TestClear PG_dirty
if (TestClear(PCG_dirty))
memcg->nr_dirty--
if (TestSet(PCG_dirty))
memcg->nr_dirty++

Here, memcg->nr_dirty = +1, this is wrong. This race was reported by Greg
Thelen . Now, only FILE_MAPPED is supported but
fortunately, it's serialized by page table lock and this is not real bug,
_now_,

If this potential problem is caused by having duplicated information in
struct page and struct page_cgroup, we may be able to fix this by using
original 'struct page' information. But we'll have a problem in "move
account"

Assume we use only PG_dirty.

CPU-A CPU-B
TestSet PG_dirty
(delay) move_lock_mem_cgroup()
if (PageDirty(page))
new_memcg->nr_dirty++
pc->mem_cgroup = new_memcg;
move_unlock_mem_cgroup()
move_lock_mem_cgroup()
memcg = pc->mem_cgroup
new_memcg->nr_dirty++

accounting information may be double-counted. This was original reason to
have PCG_xxx flags but it seems PCG_xxx has another problem.

I think we need a bigger lock as

move_lock_mem_cgroup(page)
TestSetPageDirty(page)
update page stats (without any checks)
move_unlock_mem_cgroup(page)

This fixes both of problems and we don't have to duplicate page flag into
page_cgroup. Please note: move_lock_mem_cgroup() is held only when there
are possibility of "account move" under the system. So, in most path,
status update will go without atomic locks.

This patch introduces mem_cgroup_begin_update_page_stat() and
mem_cgroup_end_update_page_stat() both should be called at modifying
'struct page' information if memcg takes care of it. as

mem_cgroup_begin_update_page_stat()
modify page information
mem_cgroup_update_page_stat()
=> never check any 'struct page' info, just update counters.
mem_cgroup_end_update_page_stat().

This patch is slow because we need to call begin_update_page_stat()/
end_update_page_stat() regardless of accounted will be changed or not. A
following patch adds an easy optimization and reduces the cost.

[akpm@linux-foundation.org: s/lock/locked/]
[hughd@google.com: fix deadlock by avoiding stat lock when anon]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Greg Thelen
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: KOSAKI Motohiro
Cc: Ying Han
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-03-22 08:55:01 +0800
312734c04 memcg: remove PCG_MOVE_LOCK flag from page_cgroup ... Browse Code »

PCG_MOVE_LOCK is used for bit spinlock to avoid race between overwriting
pc->mem_cgroup and page statistics accounting per memcg. This lock helps
to avoid the race but the race is very rare because moving tasks between
cgroup is not a usual job. So, it seems using 1bit per page is too
costly.

This patch changes this lock as per-memcg spinlock and removes
PCG_MOVE_LOCK.

If smaller lock is required, we'll be able to add some hashes but I'd like
to start from this.

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Greg Thelen
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: KOSAKI Motohiro
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-03-22 08:55:01 +0800
619d094b5 memcg: simplify move_account() check ... Browse Code »

In memcg, for avoiding take-lock-irq-off at accessing page_cgroup, a
logic, flag + rcu_read_lock(), is used. This works as following

CPU-A CPU-B
rcu_read_lock()
set flag
if(flag is set)
take heavy lock
do job.
synchronize_rcu() rcu_read_unlock()
take heavy lock.

In recent discussion, it's argued that using per-cpu value for this flag
just complicates the code because 'set flag' is very rare.

This patch changes 'flag' implementation from percpu to atomic_t. This
will be much simpler.

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Greg Thelen
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: KOSAKI Motohiro
Cc: Ying Han
Cc: "Paul E. McKenney"
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-03-22 08:55:01 +0800
9e3357907 memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat) ... Browse Code »

As described in the log, I guess EXPORT was for preparing dirty
accounting. But _now_, we don't need to export this. Remove this for
now.

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Greg Thelen
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: KOSAKI Motohiro
Cc: Ying Han
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-03-22 08:55:01 +0800
b24028572 memcg: remove PCG_CACHE page_cgroup flag ... Browse Code »
43

We record 'the page is cache' with the PCG_CACHE bit in page_cgroup.
Here, "CACHE" means anonymous user pages (and SwapCache). This doesn't
include shmem.

Considering callers, at charge/uncharge, the caller should know what the
page is and we don't need to record it by using one bit per page.

This patch removes PCG_CACHE bit and make callers of
mem_cgroup_charge_statistics() to specify what the page is.

About page migration: Mapping of the used page is not touched during migra
tion (see page_remove_rmap) so we can rely on it and push the correct
charge type down to __mem_cgroup_uncharge_common from end_migration for
unused page. The force flag was misleading was abused for skipping the
needless page_mapped() / PageCgroupMigration() check, as we know the
unused page is no longer mapped and cleared the migration flag just a few
lines up. But doing the checks is no biggie and it's not worth adding
another flag just to skip them.

[akpm@linux-foundation.org: checkpatch fixes]
[hughd@google.com: fix PageAnon uncharging]
Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: Hugh Dickins
Cc: Ying Han
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-03-22 08:55:01 +0800
0e79dedde memcg: remove unnecessary thp check in page stat accounting ... Browse Code »

Commit e94c8a9cbce1 ("memcg: make mem_cgroup_split_huge_fixup() more
efficient") removed move_lock_page_cgroup(). So we do not have to check
PageTransHuge in mem_cgroup_update_page_stat() and fallback into the
locked accounting because both move_account() and thp split are done
with compound_lock so they cannot race.

The race between update vs. move is protected by mem_cgroup_stealed.

PageTransHuge pages shouldn't appear in this code path currently because
we are tracking only file pages at the moment but later we are planning
to track also other pages (e.g. mlocked ones).

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Cc: Andrea Arcangeli
Reviewed-by: Acked-by: Michal Hocko
Cc: David Rientjes
Acked-by: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-03-22 08:55:01 +0800
1f2b71f41 memcg: remove redundant returns ... Browse Code »

Remove redundant returns from ends of functions, and one blank line.

Signed-off-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Acked-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-03-22 08:55:00 +0800
f156ab933 memcg: enum lru_list lru ... Browse Code »

Mostly we use "enum lru_list lru": change those few "l"s to "lru"s.

Signed-off-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Acked-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-03-22 08:55:00 +0800
1eb492725 memcg: lru_size instead of MEM_CGROUP_ZSTAT ... Browse Code »

I never understood why we need a MEM_CGROUP_ZSTAT(mz, idx) macro to
obscure the LRU counts. For easier searching? So call it lru_size
rather than bare count (lru_length sounds better, but would be wrong,
since each huge page raises lru_size hugely).

Signed-off-by: Hugh Dickins
Acked-by: Kirill A. Shutemov
Acked-by: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-03-22 08:55:00 +0800
d79154bb5 memcg: replace mem and mem_cont stragglers ... Browse Code »

Replace mem and mem_cont stragglers in memcontrol.c by memcg.

Signed-off-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Acked-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-03-22 08:55:00 +0800
052b1987f swap: don't do discard if no discard option added ... Browse Code »

When swapon() was not passed the SWAP_FLAG_DISCARD option, sys_swapon()
will still perform a discard operation. This can cause problems if
discard is slow or buggy.

Reverse the order of the check so that a discard operation is performed
only if the sys_swapon() caller is attempting to enable discard.

Signed-off-by: Shaohua Li
Reported-by: Holger Kiehl
Tested-by: Holger Kiehl
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2012-03-22 08:55:00 +0800
1480de034 mm: forbid lumpy-reclaim in shrink_active_list() ... Browse Code »

Reset the reclaim mode in shrink_active_list() to RECLAIM_MODE_SINGLE |
RECLAIM_MODE_ASYNC. (sync/async sign is used only in shrink_page_list
and does not affect shrink_active_list)

Currenly shrink_active_list() sometimes works in lumpy-reclaim mode, if
RECLAIM_MODE_LUMPYRECLAIM is left over from an earlier
shrink_inactive_list(). Meanwhile, in age_active_anon()
sc->reclaim_mode is totally zero. So the current behavior is too
complex and confusing, and this looks like bug.

In general, shrink_active_list() populates the inactive list for the
next shrink_inactive_list(). Lumpy shring_inactive_list() isolates
pages around the chosen one from both the active and inactive lists.
So, there is no reason for lumpy isolation in shrink_active_list().

See also: https://lkml.org/lkml/2012/3/15/583

Signed-off-by: Konstantin Khlebnikov
Proposed-by: Hugh Dickins
Acked-by: Johannes Weiner
Cc: Rik van Riel
Cc: Minchan Kim
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-03-22 08:55:00 +0800
88f6b4c32 mmap.c: fix comment for __insert_vm_struct() ... Browse Code »

The comment above __insert_vm_struct seems to suggest that this function
is also going to link the VMA with the anon_vma, but this is not true.
This function only links the VMA to the mm->mm_rb tree and the mm->mmap
linked list.

[akpm@linux-foundation.org: improve comment layout and text]
Signed-off-by: Kautuk Consul
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kautuk Consul
2012-03-22 08:55:00 +0800
b224ef856 page_alloc: remove unused find_zone_movable_pfns_for_nodes() argument ... Browse Code »

find_zone_movable_pfns_for_nodes() does not use its argument.

Signed-off-by: Kautuk Consul
Cc: David Rientjes
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kautuk Consul
2012-03-22 08:55:00 +0800
8d13bddd1 page_alloc.c: remove add_from_early_node_map() ... Browse Code »

add_from_early_node_map() is unused.

Signed-off-by: Kautuk Consul
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kautuk Consul
2012-03-22 08:55:00 +0800
40716e292 hugetlbfs: fix alignment of huge page requests ... Browse Code »

When calling shmget() with SHM_HUGETLB, shmget aligns the request size to
PAGE_SIZE, but this is not sufficient.

Modify hugetlb_file_setup() to align requests to the huge page size, and
to accept an address argument so that all alignment checks can be
performed in hugetlb_file_setup(), rather than in its callers. Change
newseg() and mmap_pgoff() to match the new prototype and eliminate a now
redundant alignment check.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Steven Truelove
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Truelove
2012-03-22 08:54:59 +0800
ea48cf786 mm, counters: fold __sync_task_rss_stat() into sync_mm_rss() ... Browse Code »

There's no difference between sync_mm_rss() and __sync_task_rss_stat(),
so fold the latter into the former.

Signed-off-by: David Rientjes
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-03-22 08:54:59 +0800
05af2e104 mm, counters: remove task argument to sync_mm_rss() and __sync_task_rss_stat() ... Browse Code »

sync_mm_rss() can only be used for current to avoid race conditions in
iterating and clearing its per-task counters. Remove the task argument
for it and its helper function, __sync_task_rss_stat(), to avoid thinking
it can be used safely for anything other than current.

Signed-off-by: David Rientjes
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-03-22 08:54:59 +0800
90481622d hugepages: fix use after free bug in "quota" handling ... Browse Code »

hugetlbfs_{get,put}_quota() are badly named. They don't interact with the
general quota handling code, and they don't much resemble its behaviour.
Rather than being about maintaining limits on on-disk block usage by
particular users, they are instead about maintaining limits on in-memory
page usage (including anonymous MAP_PRIVATE copied-on-write pages)
associated with a particular hugetlbfs filesystem instance.

Worse, they work by having callbacks to the hugetlbfs filesystem code from
the low-level page handling code, in particular from free_huge_page().
This is a layering violation of itself, but more importantly, if the
kernel does a get_user_pages() on hugepages (which can happen from KVM
amongst others), then the free_huge_page() can be delayed until after the
associated inode has already been freed. If an unmount occurs at the
wrong time, even the hugetlbfs superblock where the "quota" limits are
stored may have been freed.

Andrew Barry proposed a patch to fix this by having hugepages, instead of
storing a pointer to their address_space and reaching the superblock from
there, had the hugepages store pointers directly to the superblock,
bumping the reference count as appropriate to avoid it being freed.
Andrew Morton rejected that version, however, on the grounds that it made
the existing layering violation worse.

This is a reworked version of Andrew's patch, which removes the extra, and
some of the existing, layering violation. It works by introducing the
concept of a hugepage "subpool" at the lower hugepage mm layer - that is a
finite logical pool of hugepages to allocate from. hugetlbfs now creates
a subpool for each filesystem instance with a page limit set, and a
pointer to the subpool gets added to each allocated hugepage, instead of
the address_space pointer used now. The subpool has its own lifetime and
is only freed once all pages in it _and_ all other references to it (i.e.
superblocks) are gone.

subpools are optional - a NULL subpool pointer is taken by the code to
mean that no subpool limits are in effect.

Previous discussion of this bug found in: "Fix refcounting in hugetlbfs
quota handling.". See: https://lkml.org/lkml/2011/8/11/28 or
http://marc.info/?l=linux-mm&m=126928970510627&w=1

v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
alloc_huge_page() - since it already takes the vma, it is not necessary.

Signed-off-by: Andrew Barry
Signed-off-by: David Gibson
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Hillf Danton
Cc: Paul Mackerras
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Gibson
2012-03-22 08:54:59 +0800
ef6942224 ksm: cleanup: introduce find_mergeable_vma() ... Browse Code »

There are multiple places which perform the same check. Add a new
find_mergeable_vma() to handle this.

Signed-off-by: Bob Liu
Acked-by: Hugh Dickins
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bob Liu
2012-03-22 08:54:59 +0800
cc9a6c877 cpuset: mm: reduce large amounts of memory barrier related damage v3 ... Browse Code »
43

Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.

[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.

For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.

This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.

While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.

In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were

3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87

The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.

For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.

To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.

Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.

Signed-off-by: Mel Gorman
Cc: Miao Xie
Cc: David Rientjes
Cc: Peter Zijlstra
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-03-22 08:54:59 +0800
e845e1993 mm, memcg: pass charge order to oom killer ... Browse Code »

The oom killer typically displays the allocation order at the time of oom
as a part of its diangostic messages (for global, cpuset, and mempolicy
ooms).

The memory controller may also pass the charge order to the oom killer so
it can emit the same information. This is useful in determining how large
the memory allocation is that triggered the oom killer.

Signed-off-by: David Rientjes
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-03-22 08:54:59 +0800
c7cfa37b7 mm/vmscan.c: fix spelling error ... Browse Code »

s/noticable/noticeable/

Signed-off-by: Copot Alexandru
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Copot Alexandru
2012-03-22 08:54:59 +0800
9a3c531df mm: update stale lock ordering comment for memory-failure.c ... Browse Code »

When i_mmap_lock changed to a mutex the locking order in memory failure
was changed to take the sleeping lock first. But the big fat mm lock
ordering comment (BFMLO) wasn't updated. Do this here.

Pointed out by Andrew.

Signed-off-by: Andi Kleen
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2012-03-22 08:54:58 +0800
47a133339 mm: use global_dirty_limit in throttle_vm_writeout() ... Browse Code »

When starting a memory hog task, a desktop box w/o swap is found to go
unresponsive for a long time. It's solely caused by lots of congestion
waits in throttle_vm_writeout():

gnome-system-mo-4201 553.073384: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
gnome-system-mo-4201 553.073386: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000
gtali-4237 553.080377: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
gtali-4237 553.080378: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000
Xorg-3483 553.103375: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
Xorg-3483 553.103377: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000

The root cause is, the dirty threshold is knocked down a lot by the memory
hog task. Fixed by using global_dirty_limit which decreases gradually on
such events and can guarantee we stay above (the also decreasing) nr_dirty
in the progress of following down to the new dirty threshold.

Signed-off-by: Fengguang Wu
Cc: Johannes Weiner
Cc: Jan Kara
Cc: Greg Thelen
Cc: Ying Han
Cc: KAMEZAWA Hiroyuki
Reviewed-by: Rik van Riel
Cc: Mel Gorman
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2012-03-22 08:54:58 +0800
1010bb1b8 mm: don't set __GFP_WRITE on ramfs/sysfs writes ... Browse Code »

There is not much point in skipping zones during allocation based on the
dirty usage which they'll never contribute to. And we'd like to avoid
page reclaim waits when writing to ramfs/sysfs etc.

Signed-off-by: Fengguang Wu
Acked-by: Johannes Weiner
Cc: Jan Kara
Cc: Greg Thelen
Cc: Ying Han
Cc: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Cc: Mel Gorman
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2012-03-22 08:54:58 +0800
f5bf18fa2 bootmem/sparsemem: remove limit constraint in alloc_bootmem_section ... Browse Code »
130

While testing AMS (Active Memory Sharing) / CMO (Cooperative Memory
Overcommit) on powerpc, we tripped the following:

kernel BUG at mm/bootmem.c:483!
cpu 0x0: Vector: 700 (Program Check) at [c000000000c03940]
pc: c000000000a62bd8: .alloc_bootmem_core+0x90/0x39c
lr: c000000000a64bcc: .sparse_early_usemaps_alloc_node+0x84/0x29c
sp: c000000000c03bc0
msr: 8000000000021032
current = 0xc000000000b0cce0
paca = 0xc000000001d80000
pid = 0, comm = swapper
kernel BUG at mm/bootmem.c:483!
enter ? for help
[c000000000c03c80] c000000000a64bcc
.sparse_early_usemaps_alloc_node+0x84/0x29c
[c000000000c03d50] c000000000a64f10 .sparse_init+0x12c/0x28c
[c000000000c03e20] c000000000a474f4 .setup_arch+0x20c/0x294
[c000000000c03ee0] c000000000a4079c .start_kernel+0xb4/0x460
[c000000000c03f90] c000000000009670 .start_here_common+0x1c/0x2c

This is

BUG_ON(limit && goal + size > limit);

and after some debugging, it seems that

goal = 0x7ffff000000
limit = 0x80000000000

and sparse_early_usemaps_alloc_node ->
sparse_early_usemaps_alloc_pgdat_section calls

return alloc_bootmem_section(usemap_size() * count, section_nr);

This is on a system with 8TB available via the AMS pool, and as a quirk
of AMS in firmware, all of that memory shows up in node 0. So, we end
up with an allocation that will fail the goal/limit constraints.

In theory, we could "fall-back" to alloc_bootmem_node() in
sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE
defined, we'll BUG_ON() instead. A simple solution appears to be to
unconditionally remove the limit condition in alloc_bootmem_section,
meaning allocations are allowed to cross section boundaries (necessary
for systems of this size).

Johannes Weiner pointed out that if alloc_bootmem_section() no longer
guarantees section-locality, we need check_usemap_section_nr() to print
possible cross-dependencies between node descriptors and the usemaps
allocated through it. That makes the two loops in
sparse_early_usemaps_alloc_node() identical, so re-factor the code a
bit.

[akpm@linux-foundation.org: code simplification]
Signed-off-by: Nishanth Aravamudan
Cc: Dave Hansen
Cc: Anton Blanchard
Cc: Paul Mackerras
Cc: Ben Herrenschmidt
Cc: Robert Jennings
Acked-by: Johannes Weiner
Acked-by: Mel Gorman
Cc: [3.3.1]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2012-03-22 08:54:58 +0800
f0cb3c76a mm: drain percpu lru add/rotate page-vectors on cpu hot-unplug ... Browse Code »

This cpu hotplug hook was accidentally removed in commit 00a62ce91e55
("mm: fix Committed_AS underflow on large NR_CPUS environment")

The visible effect of this accident: some pages are borrowed in per-cpu
page-vectors. Truncate can deal with it, but these pages cannot be
reused while this cpu is offline. So this is like a temporary memory
leak.

Signed-off-by: Konstantin Khlebnikov
Cc: Dave Hansen
Cc: KOSAKI Motohiro
Cc: Eric B Munson
Cc: Mel Gorman
Cc: Christoph Lameter
Cc: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-03-22 08:54:58 +0800
3268c63ed mm: fix move/migrate_pages() race on task struct ... Browse Code »
86

Migration functions perform the rcu_read_unlock too early. As a result
the task pointed to may change from under us. This can result in an oops,
as reported by Dave Hansen in https://lkml.org/lkml/2012/2/23/302.

The following patch extend the period of the rcu_read_lock until after the
permissions checks are done. We also take a refcount so that the task
reference is stable when calling security check functions and performing
cpuset node validation (which takes a mutex).

The refcount is dropped before actual page migration occurs so there is no
change to the refcounts held during page migration.

Also move the determination of the mm of the task struct to immediately
before the do_migrate*() calls so that it is clear that we switch from
handling the task during permission checks to the mm for the actual
migration. Since the determination is only done once and we then no
longer use the task_struct we can be sure that we operate on a specific
address space that will not change from under us.

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Christoph Lameter
Cc: "Eric W. Biederman"
Reported-by: Dave Hansen
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2012-03-22 08:54:58 +0800