Eric Lee / smarc-fsl-linux-kernel

01 Oct, 2013

9 commits

fb31ba30f mm/hwpoison: fix the lack of one reference count against poisoned page ... Browse Code »

The lack of one reference count against poisoned page for hwpoison_inject
w/o hwpoison_filter enabled result in hwpoison detect -1 users still
referenced the page, however, the number should be 0 except the poison
handler held one after successfully unmap. This patch fix it by hold one
referenced count against poisoned page for hwpoison_inject w/ and w/o
hwpoison_filter enabled.

Before patch:

[ 71.902112] Injecting memory failure at pfn 224706
[ 71.902137] MCE 0x224706: dirty LRU page recovery: Failed
[ 71.902138] MCE 0x224706: dirty LRU page still referenced by -1 users

After patch:

[ 94.710860] Injecting memory failure at pfn 215b68
[ 94.710885] MCE 0x215b68: dirty LRU page recovery: Recovered

Reviewed-by: Naoya Horiguchi
Acked-by: Andi Kleen
Signed-off-by: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2013-10-01 05:31:03 +0800
2d421acd1 mm/hwpoison: fix false report on 2nd attempt at page recovery ... Browse Code »

If the page is poisoned by software injection w/ MF_COUNT_INCREASED
flag, there is a false report during the 2nd attempt at page recovery
which is not truthful.

This patch fixes it by reporting the first attempt to try free buddy
page recovery if MF_COUNT_INCREASED is set.

Before patch:

[ 346.332041] Injecting memory failure at pfn 200010
[ 346.332189] MCE 0x200010: free buddy, 2nd try page recovery: Delayed

After patch:

[ 297.742600] Injecting memory failure at pfn 200010
[ 297.742941] MCE 0x200010: free buddy page recovery: Delayed

Reviewed-by: Naoya Horiguchi
Acked-by: Andi Kleen
Signed-off-by: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2013-10-01 05:31:02 +0800
e76d30e20 mm/hwpoison: fix test for a transparent huge page ... Browse Code »

PageTransHuge() can't guarantee the page is a transparent huge page
since it returns true for both transparent huge and hugetlbfs pages.

This patch fixes it by checking the page is also !hugetlbfs page.

Before patch:

[ 121.571128] Injecting memory failure at pfn 23a200
[ 121.571141] MCE 0x23a200: huge page recovery: Delayed
[ 140.355100] MCE: Memory failure is now running on 0x23a200

After patch:

[ 94.290793] Injecting memory failure at pfn 23a000
[ 94.290800] MCE 0x23a000: huge page recovery: Delayed
[ 105.722303] MCE: Software-unpoisoned page 0x23a000

Signed-off-by: Wanpeng Li
Reviewed-by: Naoya Horiguchi
Acked-by: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2013-10-01 05:31:02 +0800
20cb6cab5 mm/hwpoison: fix traversal of hugetlbfs pages to avoid printk flood ... Browse Code »

madvise_hwpoison won't check if the page is small page or huge page and
traverses in small page granularity against the range unconditionally,
which result in a printk flood "MCE xxx: already hardware poisoned" if
the page is a huge page.

This patch fixes it by using compound_order(compound_head(page)) for
huge page iterator.

Testcase:

#define _GNU_SOURCE
#include
#include
#include
#include
#include
#include
#include

#define PAGES_TO_TEST 3
#define PAGE_SIZE 4096 * 512

int main(void)
{
char *mem;
int i;

mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE,
PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, 0, 0);

if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1)
return -1;

munmap(mem, PAGES_TO_TEST * PAGE_SIZE);

return 0;
}

Signed-off-by: Wanpeng Li
Reviewed-by: Naoya Horiguchi
Acked-by: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2013-10-01 05:31:02 +0800
eadb41ae8 mm/mlock.c: prevent walking off the end of a pagetable in no-pmd configuration ... Browse Code »

The function __munlock_pagevec_fill() introduced in commit 7a8010cd3627
("mm: munlock: manual pte walk in fast path instead of
follow_page_mask()") uses pmd_addr_end() for restricting its operation
within current page table.

This is insufficient on architectures/configurations where pmd is folded
and pmd_addr_end() just returns the end of the full range to be walked.
In this case, it allows pte++ to walk off the end of a page table
resulting in unpredictable behaviour.

This patch fixes the function by using pgd_addr_end() and pud_addr_end()
before pmd_addr_end(), which will yield correct page table boundary on
all configurations. This is similar to what existing page walkers do
when walking each level of the page table.

Additionaly, the patch clarifies a comment for get_locked_pte() call in the
function.

Signed-off-by: Vlastimil Babka
Reported-by: Fengguang Wu
Reviewed-by: Bob Liu
Cc: Jörn Engel
Cc: Mel Gorman
Cc: Michel Lespinasse
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2013-10-01 05:31:02 +0800
117aad1e9 mm: avoid reinserting isolated balloon pages into LRU lists ... Browse Code »

Isolated balloon pages can wrongly end up in LRU lists when
migrate_pages() finishes its round without draining all the isolated
page list.

The same issue can happen when reclaim_clean_pages_from_list() tries to
reclaim pages from an isolated page list, before migration, in the CMA
path. Such balloon page leak opens a race window against LRU lists
shrinkers that leads us to the following kernel panic:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
IP: [] shrink_page_list+0x24e/0x897
PGD 3cda2067 PUD 3d713067 PMD 0
Oops: 0000 [#1] SMP
CPU: 0 PID: 340 Comm: kswapd0 Not tainted 3.12.0-rc1-22626-g4367597 #87
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
RIP: shrink_page_list+0x24e/0x897
RSP: 0000:ffff88003da499b8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff88003e82bd60 RCX: 00000000000657d5
RDX: 0000000000000000 RSI: 000000000000031f RDI: ffff88003e82bd40
RBP: ffff88003da49ab0 R08: 0000000000000001 R09: 0000000081121a45
R10: ffffffff81121a45 R11: ffff88003c4a9a28 R12: ffff88003e82bd40
R13: ffff88003da0e800 R14: 0000000000000001 R15: ffff88003da49d58
FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000067d9000 CR3: 000000003ace5000 CR4: 00000000000407b0
Call Trace:
shrink_inactive_list+0x240/0x3de
shrink_lruvec+0x3e0/0x566
__shrink_zone+0x94/0x178
shrink_zone+0x3a/0x82
balance_pgdat+0x32a/0x4c2
kswapd+0x2f0/0x372
kthread+0xa2/0xaa
ret_from_fork+0x7c/0xb0
Code: 80 7d 8f 01 48 83 95 68 ff ff ff 00 4c 89 e7 e8 5a 7b 00 00 48 85 c0 49 89 c5 75 08 80 7d 8f 00 74 3e eb 31 48 8b 80 18 01 00 00 8b 74 0d 48 8b 78 30 be 02 00 00 00 ff d2 eb
RIP [] shrink_page_list+0x24e/0x897
RSP
CR2: 0000000000000028
---[ end trace 703d2451af6ffbfd ]---
Kernel panic - not syncing: Fatal exception

This patch fixes the issue, by assuring the proper tests are made at
putback_movable_pages() & reclaim_clean_pages_from_list() to avoid
isolated balloon pages being wrongly reinserted in LRU lists.

[akpm@linux-foundation.org: clarify awkward comment text]
Signed-off-by: Rafael Aquini
Reported-by: Luiz Capitulino
Tested-by: Luiz Capitulino
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rafael Aquini
2013-10-01 05:31:02 +0800
83b2944fd mm/bounce.c: fix a regression where MS_SNAP_STABLE (stable pages snapshotting) was ignored ... Browse Code »

The "force" parameter in __blk_queue_bounce was being ignored, which
means that stable page snapshots are not always happening (on ext3).
This of course leads to DIF disks reporting checksum errors, so fix this
regression.

The regression was introduced in commit 6bc454d15004 ("bounce: Refactor
__blk_queue_bounce to not use bi_io_vec")

Reported-by: Mel Gorman
Signed-off-by: Darrick J. Wong
Cc: Kent Overstreet
Cc: [3.10+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Darrick J. Wong
2013-10-01 05:31:02 +0800
f6ea3adb7 mm/compaction.c: periodically schedule when freeing pages ... Browse Code »

We've been getting warnings about an excessive amount of time spent
allocating pages for migration during memory compaction without
scheduling. isolate_freepages_block() already periodically checks for
contended locks or the need to schedule, but isolate_freepages() never
does.

When a zone is massively long and no suitable targets can be found, this
iteration can be quite expensive without ever doing cond_resched().

Check periodically for the need to reschedule while the compaction free
scanner iterates.

Signed-off-by: David Rientjes
Reviewed-by: Rik van Riel
Reviewed-by: Wanpeng Li
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2013-10-01 05:31:01 +0800
7393dc45f revert "mm/memory-hotplug: fix lowmem count overflow when offline pages" ... Browse Code »

This reverts commit cea27eb2a202 ("mm/memory-hotplug: fix lowmem count
overflow when offline pages").

The fixed bug by commit cea27eb was fixed to another way by commit
3dcc0571cd64 ("mm: correctly update zone->managed_pages"). That commit
enhances memory_hotplug.c to adjust totalhigh_pages when hot-removing
memory, for details please refer to:

http://marc.info/?l=linux-mm&m=136957578620221&w=2

As a result, commit cea27eb2a202 currently causes duplicated decreasing
of totalhigh_pages, thus the revert.

Signed-off-by: Joonyoung Shim
Reviewed-by: Wanpeng Li
Cc: Jiang Liu
Cc: KOSAKI Motohiro
Cc: Bartlomiej Zolnierkiewicz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonyoung Shim
2013-10-01 05:31:01 +0800

25 Sep, 2013

9 commits

22356f447 mm: Place preemption point in do_mlockall() loop ... Browse Code »

There is a loop in do_mlockall() that lacks a preemption point, which
means that the following can happen on non-preemptible builds of the
kernel. Dave Jones reports:

"My fuzz tester keeps hitting this. Every instance shows the non-irq
stack came in from mlockall. I'm only seeing this on one box, but
that has more ram (8gb) than my other machines, which might explain
it.

INFO: rcu_preempt self-detected stall on CPU { 3} (t=6500 jiffies g=470344 c=470343 q=0)
sending NMI to all CPUs:
NMI backtrace for cpu 3
CPU: 3 PID: 29664 Comm: trinity-child2 Not tainted 3.11.0-rc1+ #32
Call Trace:
lru_add_drain_all+0x15/0x20
SyS_mlockall+0xa5/0x1a0
tracesys+0xdd/0xe2"

This commit addresses this problem by inserting the required preemption
point.

Reported-by: Dave Jones
Signed-off-by: Paul E. McKenney
Cc: KOSAKI Motohiro
Cc: Michel Lespinasse
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Paul E. McKenney
2013-09-25 10:44:40 +0800
0608f43da revert "memcg, vmscan: integrate soft reclaim tighter with zone shrinking code" ... Browse Code »

Revert commit 3b38722efd9f ("memcg, vmscan: integrate soft reclaim
tighter with zone shrinking code")

I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.

Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-09-25 08:00:26 +0800
bb4cc1a8b revert "memcg: get rid of soft-limit tree infrastructure" ... Browse Code »

Revert commit e883110aad71 ("memcg: get rid of soft-limit tree
infrastructure")

I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.

Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-09-25 08:00:26 +0800
b1aff7fcf revert "vmscan, memcg: do softlimit reclaim also for targeted reclaim" ... Browse Code »

Revert commit a5b7c87f9207 ("vmscan, memcg: do softlimit reclaim also
for targeted reclaim")

I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.

Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-09-25 08:00:26 +0800
694fbc0fe revert "memcg: enhance memcg iterator to support predicates" ... Browse Code »

Revert commit de57780dc659 ("memcg: enhance memcg iterator to support
predicates")

I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.

Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-09-25 08:00:26 +0800
30361e51c revert "memcg: track children in soft limit excess to improve soft limit" ... Browse Code »

Revert commit 7d910c054be4 ("memcg: track children in soft limit excess
to improve soft limit")

I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.

Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-09-25 08:00:25 +0800
3120055e8 revert "memcg, vmscan: do not attempt soft limit reclaim if it would not scan anything" ... Browse Code »

Revert commit e839b6a1c8d0 ("memcg, vmscan: do not attempt soft limit
reclaim if it would not scan anything")

I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.

Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-09-25 08:00:25 +0800
8f939a9f4 revert "memcg: track all children over limit in the root" ... Browse Code »

Revert commit 1be171d60bdd ("memcg: track all children over limit in the
root")

I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.

Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-09-25 08:00:25 +0800
20ba27f52 revert "memcg, vmscan: do not fall into reclaim-all pass too quickly" ... Browse Code »

Revert commit e975de998b96 ("memcg, vmscan: do not fall into reclaim-all
pass too quickly")

I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.

Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-09-25 08:00:25 +0800

15 Sep, 2013

1 commit

bff157b3a Merge branch 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux ... Browse Code »

Pull SLAB update from Pekka Enberg:
"Nothing terribly exciting here apart from Christoph's kmalloc
unification patches that brings sl[aou]b implementations closer to
each other"

* 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
slab: Use correct GFP_DMA constant
slub: remove verify_mem_not_deleted()
mm/sl[aou]b: Move kmallocXXX functions to common code
mm, slab_common: add 'unlikely' to size check of kmalloc_slab()
mm/slub.c: beautify code for removing redundancy 'break' statement.
slub: Remove unnecessary page NULL check
slub: don't use cpu partial pages on UP
mm/slub: beautify code for 80 column limitation and tab alignment
mm/slub: remove 'per_cpu' which is useless variable

Linus Torvalds
2013-09-15 19:15:06 +0800

14 Sep, 2013

1 commit

9bf12df31 Merge git://git.kvack.org/~bcrl/aio-next ... Browse Code »

Pull aio changes from Ben LaHaise:
"First off, sorry for this pull request being late in the merge window.
Al had raised a couple of concerns about 2 items in the series below.
I addressed the first issue (the race introduced by Gu's use of
mm_populate()), but he has not provided any further details on how he
wants to rework the anon_inode.c changes (which were sent out months
ago but have yet to be commented on).

The bulk of the changes have been sitting in the -next tree for a few
months, with all the issues raised being addressed"

* git://git.kvack.org/~bcrl/aio-next: (22 commits)
aio: rcu_read_lock protection for new rcu_dereference calls
aio: fix race in ring buffer page lookup introduced by page migration support
aio: fix rcu sparse warnings introduced by ioctx table lookup patch
aio: remove unnecessary debugging from aio_free_ring()
aio: table lookup: verify ctx pointer
staging/lustre: kiocb->ki_left is removed
aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
aio: convert the ioctx list to table lookup v3
aio: double aio_max_nr in calculations
aio: Kill ki_dtor
aio: Kill ki_users
aio: Kill unneeded kiocb members
aio: Kill aio_rw_vect_retry()
aio: Don't use ctx->tail unnecessarily
aio: io_cancel() no longer returns the io_event
aio: percpu ioctx refcount
aio: percpu reqs_available
aio: reqs_active -> reqs_available
aio: fix build when migration is disabled
...

Linus Torvalds
2013-09-14 01:55:58 +0800

13 Sep, 2013

20 commits

ac4de9543 Merge branch 'akpm' (patches from Andrew Morton) ... Browse Code »

Merge more patches from Andrew Morton:
"The rest of MM. Plus one misc cleanup"

* emailed patches from Andrew Morton : (35 commits)
mm/Kconfig: add MMU dependency for MIGRATION.
kernel: replace strict_strto*() with kstrto*()
mm, thp: count thp_fault_fallback anytime thp fault fails
thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
thp: do_huge_pmd_anonymous_page() cleanup
thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
mm: cleanup add_to_page_cache_locked()
thp: account anon transparent huge pages into NR_ANON_PAGES
truncate: drop 'oldsize' truncate_pagecache() parameter
mm: make lru_add_drain_all() selective
memcg: document cgroup dirty/writeback memory statistics
memcg: add per cgroup writeback pages accounting
memcg: check for proper lock held in mem_cgroup_update_page_stat
memcg: remove MEMCG_NR_FILE_MAPPED
memcg: reduce function dereference
memcg: avoid overflow caused by PAGE_ALIGN
memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
memcg: correct RESOURCE_MAX to ULLONG_MAX
mm: memcg: do not trap chargers with full callstack on OOM
mm: memcg: rework and document OOM waiting and wakeup
...

Linus Torvalds
2013-09-13 06:44:27 +0800
de32a8177 mm/Kconfig: add MMU dependency for MIGRATION. ... Browse Code »

MIGRATION must depend on MMU, or allmodconfig for the nommu sh
architecture fails to build:

CC mm/migrate.o
mm/migrate.c: In function 'remove_migration_pte':
mm/migrate.c:134:3: error: implicit declaration of function 'pmd_trans_huge' [-Werror=implicit-function-declaration]
if (pmd_trans_huge(*pmd))
^
mm/migrate.c:149:2: error: implicit declaration of function 'is_swap_pte' [-Werror=implicit-function-declaration]
if (!is_swap_pte(pte))
^
...

Also let CMA depend on MMU, or when NOMMU, if we select CMA, it will
select MIGRATION by force.

Signed-off-by: Chen Gang
Reviewed-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Gang
2013-09-13 06:38:03 +0800
17766dde3 mm, thp: count thp_fault_fallback anytime thp fault fails ... Browse Code »

Currently, thp_fault_fallback in vmstat only gets incremented if a
hugepage allocation fails. If current's memcg hits its limit or the page
fault handler returns an error, it is incorrectly accounted as a
successful thp_fault_alloc.

Count thp_fault_fallback anytime the page fault handler falls back to
using regular pages and only count thp_fault_alloc when a hugepage has
actually been faulted.

Signed-off-by: David Rientjes
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2013-09-13 06:38:03 +0800
c02925540 thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page() ... Browse Code »

do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault()
to handle fallback path.

Let's consolidate code back by introducing VM_FAULT_FALLBACK return
code.

Signed-off-by: Kirill A. Shutemov
Acked-by: Hillf Danton
Cc: Andrea Arcangeli
Cc: Al Viro
Cc: Hugh Dickins
Cc: Wu Fengguang
Cc: Jan Kara
Cc: Mel Gorman
Cc: Andi Kleen
Cc: Matthew Wilcox
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-09-13 06:38:03 +0800
128ec037b thp: do_huge_pmd_anonymous_page() cleanup ... Browse Code »

Minor cleanup: unindent most code of the fucntion by inverting one
condition. It's preparation for the next patch.

No functional changes.

Signed-off-by: Kirill A. Shutemov
Acked-by: Hillf Danton
Cc: Andrea Arcangeli
Cc: Al Viro
Cc: Hugh Dickins
Cc: Wu Fengguang
Cc: Jan Kara
Cc: Mel Gorman
Cc: Andi Kleen
Cc: Matthew Wilcox
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-09-13 06:38:03 +0800
3122359a6 thp: move maybe_pmd_mkwrite() out of mk_huge_pmd() ... Browse Code »

It's confusing that mk_huge_pmd() has semantics different from mk_pte() or
mk_pmd(). I spent some time on debugging issue cased by this
inconsistency.

Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust prototype
to match mk_pte().

Signed-off-by: Kirill A. Shutemov
Acked-by: Dave Hansen
Cc: Andrea Arcangeli
Cc: Al Viro
Cc: Hugh Dickins
Cc: Wu Fengguang
Cc: Jan Kara
Cc: Mel Gorman
Cc: Andi Kleen
Cc: Matthew Wilcox
Cc: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-09-13 06:38:03 +0800
66a0c8ee3 mm: cleanup add_to_page_cache_locked() ... Browse Code »

Make add_to_page_cache_locked() cleaner:

- unindent most code of the function by inverting one condition;
- streamline code no-error path;
- move insert error path outside normal code path;
- call radix_tree_preload_end() earlier;

No functional changes.

Signed-off-by: Kirill A. Shutemov
Acked-by: Dave Hansen
Cc: Andrea Arcangeli
Cc: Al Viro
Cc: Hugh Dickins
Cc: Wu Fengguang
Cc: Jan Kara
Cc: Mel Gorman
Cc: Andi Kleen
Cc: Matthew Wilcox
Cc: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-09-13 06:38:03 +0800
3cd14fcd3 thp: account anon transparent huge pages into NR_ANON_PAGES ... Browse Code »

We use NR_ANON_PAGES as base for reporting AnonPages to user. There's
not much sense in not accounting transparent huge pages there, but add
them on printing to user.

Let's account transparent huge pages in NR_ANON_PAGES in the first place.

Signed-off-by: Kirill A. Shutemov
Acked-by: Dave Hansen
Cc: Andrea Arcangeli
Cc: Al Viro
Cc: Hugh Dickins
Cc: Wu Fengguang
Cc: Jan Kara
Cc: Mel Gorman
Cc: Andi Kleen
Cc: Matthew Wilcox
Cc: Hillf Danton
Cc: Ning Qu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-09-13 06:38:03 +0800
7caef2676 truncate: drop 'oldsize' truncate_pagecache() parameter ... Browse Code »

truncate_pagecache() doesn't care about old size since commit
cedabed49b39 ("vfs: Fix vmtruncate() regression"). Let's drop it.

Signed-off-by: Kirill A. Shutemov
Cc: OGAWA Hirofumi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-09-13 06:38:02 +0800
5fbc46163 mm: make lru_add_drain_all() selective ... Browse Code »

make lru_add_drain_all() only selectively interrupt the cpus that have
per-cpu free pages that can be drained.

This is important in nohz mode where calling mlockall(), for example,
otherwise will interrupt every core unnecessarily.

This is important on workloads where nohz cores are handling 10 Gb traffic
in userspace. Those CPUs do not enter the kernel and place pages into LRU
pagevecs and they really, really don't want to be interrupted, or they
drop packets on the floor.

Signed-off-by: Chris Metcalf
Reviewed-by: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chris Metcalf
2013-09-13 06:38:02 +0800
3ea67d06e memcg: add per cgroup writeback pages accounting ... Browse Code »

Add memcg routines to count writeback pages, later dirty pages will also
be accounted.

After Kame's commit 89c06bd52fb9 ("memcg: use new logic for page stat
accounting"), we can use 'struct page' flag to test page state instead
of per page_cgroup flag. But memcg has a feature to move a page from a
cgroup to another one and may have race between "move" and "page stat
accounting". So in order to avoid the race we have designed a new lock:

mem_cgroup_begin_update_page_stat()
modify page information -->(a)
mem_cgroup_update_page_stat() -->(b)
mem_cgroup_end_update_page_stat()

It requires both (a) and (b)(writeback pages accounting) to be pretected
in mem_cgroup_{begin/end}_update_page_stat(). It's full no-op for
!CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu
read lock in the most cases (no task is moving), and spin_lock_irqsave
on top in the slow path.

There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
And the lock order is:
--> memcg->move_lock
--> mapping->tree_lock

Signed-off-by: Sha Zhengju
Acked-by: Michal Hocko
Reviewed-by: Greg Thelen
Cc: Fengguang Wu
Cc: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sha Zhengju
2013-09-13 06:38:02 +0800
658b72c5a memcg: check for proper lock held in mem_cgroup_update_page_stat ... Browse Code »

We should call mem_cgroup_begin_update_page_stat() before
mem_cgroup_update_page_stat() to get proper locks, however the latter
doesn't do any checking that we use proper locking, which would be hard.
Suggested by Michal Hock we could at least test for rcu_read_lock_held()
because RCU is held if !mem_cgroup_disabled().

Signed-off-by: Sha Zhengju
Acked-by: Michal Hocko
Reviewed-by: Greg Thelen
Cc: Fengguang Wu
Cc: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sha Zhengju
2013-09-13 06:38:02 +0800
68b4876d9 memcg: remove MEMCG_NR_FILE_MAPPED ... Browse Code »

While accounting memcg page stat, it's not worth to use
MEMCG_NR_FILE_MAPPED as an extra layer of indirection because of the
complexity and presumed performance overhead. We can use
MEM_CGROUP_STAT_FILE_MAPPED directly.

Signed-off-by: Sha Zhengju
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Acked-by: Fengguang Wu
Reviewed-by: Greg Thelen
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sha Zhengju
2013-09-13 06:38:02 +0800
6de5a8bfc memcg: rename RESOURCE_MAX to RES_COUNTER_MAX ... Browse Code »

RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX.

Signed-off-by: Sha Zhengju
Signed-off-by: Qiang Huang
Acked-by: Michal Hocko
Cc: Daisuke Nishimura
Cc: Jeff Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sha Zhengju
2013-09-13 06:38:02 +0800
3812c8c8f mm: memcg: do not trap chargers with full callstack on OOM ... Browse Code »

The memcg OOM handling is incredibly fragile and can deadlock. When a
task fails to charge memory, it invokes the OOM killer and loops right
there in the charge code until it succeeds. Comparably, any other task
that enters the charge path at this point will go to a waitqueue right
then and there and sleep until the OOM situation is resolved. The problem
is that these tasks may hold filesystem locks and the mmap_sem; locks that
the selected OOM victim may need to exit.

For example, in one reported case, the task invoking the OOM killer was
about to charge a page cache page during a write(), which holds the
i_mutex. The OOM killer selected a task that was just entering truncate()
and trying to acquire the i_mutex:

OOM invoking task:
mem_cgroup_handle_oom+0x241/0x3b0
mem_cgroup_cache_charge+0xbe/0xe0
add_to_page_cache_locked+0x4c/0x140
add_to_page_cache_lru+0x22/0x50
grab_cache_page_write_begin+0x8b/0xe0
ext3_write_begin+0x88/0x270
generic_file_buffered_write+0x116/0x290
__generic_file_aio_write+0x27c/0x480
generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
do_sync_write+0xea/0x130
vfs_write+0xf3/0x1f0
sys_write+0x51/0x90
system_call_fastpath+0x18/0x1d

OOM kill victim:
do_truncate+0x58/0xa0 # takes i_mutex
do_last+0x250/0xa30
path_openat+0xd7/0x440
do_filp_open+0x49/0xa0
do_sys_open+0x106/0x240
sys_open+0x20/0x30
system_call_fastpath+0x18/0x1d

The OOM handling task will retry the charge indefinitely while the OOM
killed task is not releasing any resources.

A similar scenario can happen when the kernel OOM killer for a memcg is
disabled and a userspace task is in charge of resolving OOM situations.
In this case, ALL tasks that enter the OOM path will be made to sleep on
the OOM waitqueue and wait for userspace to free resources or increase
the group's limit. But a userspace OOM handler is prone to deadlock
itself on the locks held by the waiting tasks. For example one of the
sleeping tasks may be stuck in a brk() call with the mmap_sem held for
writing but the userspace handler, in order to pick an optimal victim,
may need to read files from /proc/, which tries to acquire the same
mmap_sem for reading and deadlocks.

This patch changes the way tasks behave after detecting a memcg OOM and
makes sure nobody loops or sleeps with locks held:

1. When OOMing in a user fault, invoke the OOM killer and restart the
fault instead of looping on the charge attempt. This way, the OOM
victim can not get stuck on locks the looping task may hold.

2. When OOMing in a user fault but somebody else is handling it
(either the kernel OOM killer or a userspace handler), don't go to
sleep in the charge context. Instead, remember the OOMing memcg in
the task struct and then fully unwind the page fault stack with
-ENOMEM. pagefault_out_of_memory() will then call back into the
memcg code to check if the -ENOMEM came from the memcg, and then
either put the task to sleep on the memcg's OOM waitqueue or just
restart the fault. The OOM victim can no longer get stuck on any
lock a sleeping task may hold.

Debugged by Michal Hocko.

Signed-off-by: Johannes Weiner
Reported-by: azurIt
Acked-by: Michal Hocko
Cc: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-09-13 06:38:02 +0800
fb2a6fc56 mm: memcg: rework and document OOM waiting and wakeup ... Browse Code »

The memcg OOM handler open-codes a sleeping lock for OOM serialization
(trylock, wait, repeat) because the required locking is so specific to
memcg hierarchies. However, it would be nice if this construct would be
clearly recognizable and not be as obfuscated as it is right now. Clean
up as follows:

1. Remove the return value of mem_cgroup_oom_unlock()

2. Rename mem_cgroup_oom_lock() to mem_cgroup_oom_trylock().

3. Pull the prepare_to_wait() out of the memcg_oom_lock scope. This
makes it more obvious that the task has to be on the waitqueue
before attempting to OOM-trylock the hierarchy, to not miss any
wakeups before going to sleep. It just didn't matter until now
because it was all lumped together into the global memcg_oom_lock
spinlock section.

4. Pull the mem_cgroup_oom_notify() out of the memcg_oom_lock scope.
It is proctected by the hierarchical OOM-lock.

5. The memcg_oom_lock spinlock is only required to propagate the OOM
lock in any given hierarchy atomically. Restrict its scope to
mem_cgroup_oom_(trylock|unlock).

6. Do not wake up the waitqueue unconditionally at the end of the
function. Only the lockholder has to wake up the next in line
after releasing the lock.

Note that the lockholder kicks off the OOM-killer, which in turn
leads to wakeups from the uncharges of the exiting task. But a
contender is not guaranteed to see them if it enters the OOM path
after the OOM kills but before the lockholder releases the lock.
Thus there has to be an explicit wakeup after releasing the lock.

7. Put the OOM task on the waitqueue before marking the hierarchy as
under OOM as that is the point where we start to receive wakeups.
No point in listening before being on the waitqueue.

8. Likewise, unmark the hierarchy before finishing the sleep, for
symmetry.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: azurIt
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-09-13 06:38:01 +0800
519e52473 mm: memcg: enable memcg OOM killer only for user faults ... Browse Code »

System calls and kernel faults (uaccess, gup) can handle an out of memory
situation gracefully and just return -ENOMEM.

Enable the memcg OOM killer only for user faults, where it's really the
only option available.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: azurIt
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-09-13 06:38:01 +0800
f894ffa86 memcg: trivial cleanups ... Browse Code »

Clean up some mess made by the "Soft limit rework" series, and a few other
things.

Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-09-13 06:38:01 +0800
e975de998 memcg, vmscan: do not fall into reclaim-all pass too quickly ... Browse Code »

shrink_zone starts with soft reclaim pass first and then falls back to
regular reclaim if nothing has been scanned. This behavior is natural
but there is a catch. Memcg iterators, when used with the reclaim
cookie, are designed to help to prevent from over reclaim by
interleaving reclaimers (per node-zone-priority) so the tree walk might
miss many (even all) nodes in the hierarchy e.g. when there are direct
reclaimers racing with each other or with kswapd in the global case or
multiple allocators reaching the limit for the target reclaim case. To
make it even more complicated, targeted reclaim doesn't do the whole
tree walk because it stops reclaiming once it reclaims sufficient pages.
As a result groups over the limit might be missed, thus nothing is
scanned, and reclaim would fall back to the reclaim all mode.

This patch checks for the incomplete tree walk in shrink_zone. If no
group has been visited and the hierarchy is soft reclaimable then we
must have missed some groups, in which case the __shrink_zone is called
again. This doesn't guarantee there will be some progress of course
because the current reclaimer might be still racing with others but it
would at least give a chance to start the walk without a big risk of
reclaim latencies.

Signed-off-by: Michal Hocko
Cc: Balbir Singh
Cc: Glauber Costa
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Michel Lespinasse
Cc: Tejun Heo
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2013-09-13 06:38:01 +0800
1be171d60 memcg: track all children over limit in the root ... Browse Code »

Children in soft limit excess are currently tracked up the hierarchy in
memcg->children_in_excess. Nevertheless there still might exist tons of
groups that are not in hierarchy relation to the root cgroup (e.g. all
first level groups if root_mem_cgroup->use_hierarchy == false).

As the whole tree walk has to be done when the iteration starts at
root_mem_cgroup the iterator should be able to skip the walk if there is
no child above the limit without iterating them. This can be done
easily if the root tracks all children rather than only hierarchical
children. This is done by this patch which updates root_mem_cgroup
children_in_excess if root_mem_cgroup->use_hierarchy == false so the
root knows about all children in excess.

Please note that this is not an issue for inner memcgs which have
use_hierarchy == false because then only the single group is visited so
no special optimization is necessary.

Signed-off-by: Michal Hocko
Cc: Balbir Singh
Cc: Glauber Costa
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Michel Lespinasse
Cc: Tejun Heo
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2013-09-13 06:38:01 +0800