Doug / smarc-fsl-linux-kernel | Embedian Git Server

24 Feb, 2013

27 commits

c22c0d634 mm: remove flags argument to mmap_region ... Browse Code »

After the MAP_POPULATE handling has been moved to mmap_region() call
sites, the only remaining use of the flags argument is to pass the
MAP_NORESERVE flag. This can be just as easily handled by
do_mmap_pgoff(), so do that and remove the mmap_region() flags
parameter.

[akpm@linux-foundation.org: remove double parens]
Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Tested-by: Andy Lutomirski
Cc: Greg Ungerer
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-02-24 09:50:11 +0800
81909b842 mm: use mm_populate() for mremap() of VM_LOCKED vmas ... Browse Code »

Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Tested-by: Andy Lutomirski
Cc: Greg Ungerer
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-02-24 09:50:11 +0800
128557ffe mm: use mm_populate() when adjusting brk with MCL_FUTURE in effect ... Browse Code »

Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Tested-by: Andy Lutomirski
Cc: Greg Ungerer
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-02-24 09:50:11 +0800
a1ea9549a mm: use mm_populate() for blocking remap_file_pages() ... Browse Code »

Signed-off-by: Michel Lespinasse
Reviewed-by: Rik van Riel
Tested-by: Andy Lutomirski
Cc: Greg Ungerer
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-02-24 09:50:11 +0800
bebeb3d68 mm: introduce mm_populate() for populating new vmas ... Browse Code »

When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or
with MCL_FUTURE in effect), we want to populate the pages within the
newly created vmas. This may take a while as we may have to read pages
from disk, so ideally we want to do this outside of the write-locked
mmap_sem region.

This change introduces mm_populate(), which is used to defer populating
such mappings until after the mmap_sem write lock has been released.
This is implemented as a generalization of the former do_mlock_pages(),
which accomplished the same task but was using during mlock() /
mlockall().

Signed-off-by: Michel Lespinasse
Reported-by: Andy Lutomirski
Acked-by: Rik van Riel
Tested-by: Andy Lutomirski
Cc: Greg Ungerer
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-02-24 09:50:10 +0800
940e7da51 mm: remap_file_pages() fixes ... Browse Code »

We have many vma manipulation functions that are fast in the typical
case, but can optionally be instructed to populate an unbounded number
of ptes within the region they work on:

- mmap with MAP_POPULATE or MAP_LOCKED flags;
- remap_file_pages() with MAP_NONBLOCK not set or when working on a
VM_LOCKED vma;
- mmap_region() and all its wrappers when mlock(MCL_FUTURE) is in
effect;
- brk() when mlock(MCL_FUTURE) is in effect.

Current code handles these pte operations locally, while the
sourrounding code has to hold the mmap_sem write side since it's
manipulating vmas. This means we're doing an unbounded amount of pte
population work with mmap_sem held, and this causes problems as Andy
Lutomirski reported (we've hit this at Google as well, though it's not
entirely clear why people keep trying to use mlock(MCL_FUTURE) in the
first place).

I propose introducing a new mm_populate() function to do this pte
population work after the mmap_sem has been released. mm_populate()
does need to acquire the mmap_sem read side, but critically, it doesn't
need to hold it continuously for the entire duration of the operation -
it can drop it whenever things take too long (such as when hitting disk
for a file read) and re-acquire it later on.

The following patches are included

- Patches 1 fixes some issues I noticed while working on the existing code.
If needed, they could potentially go in before the rest of the patches.

- Patch 2 introduces the new mm_populate() function and changes
mmap_region() call sites to use it after they drop mmap_sem. This is
inspired from Andy Lutomirski's proposal and is built as an extension
of the work I had previously done for mlock() and mlockall() around
v2.6.38-rc1. I had tried doing something similar at the time but had
given up as there were so many do_mmap() call sites; the recent cleanups
by Linus and Viro are a tremendous help here.

- Patches 3-5 convert some of the less-obvious places doing unbounded
pte populates to the new mm_populate() mechanism.

- Patches 6-7 are code cleanups that are made possible by the
mm_populate() work. In particular, they remove more code than the
entire patch series added, which should be a good thing :)

- Patch 8 is optional to this entire series. It only helps to deal more
nicely with racy userspace programs that might modify their mappings
while we're trying to populate them. It adds a new VM_POPULATE flag
on the mappings we do want to populate, so that if userspace replaces
them with mappings it doesn't want populated, mm_populate() won't
populate those replacement mappings.

This patch:

Assorted small fixes. The first two are quite small:

- Move check for vma->vm_private_data && !(vma->vm_flags & VM_NONLINEAR)
within existing if (!(vma->vm_flags & VM_NONLINEAR)) block.
Purely cosmetic.

- In the VM_LOCKED case, when dropping PG_Mlocked for the over-mapped
range, make sure we own the mmap_sem write lock around the
munlock_vma_pages_range call as this manipulates the vma's vm_flags.

Last fix requires a longer explanation. remap_file_pages() can do its work
either through VM_NONLINEAR manipulation or by creating extra vmas.
These two cases were inconsistent with each other (and ultimately, both wrong)
as to exactly when did they fault in the newly mapped file pages:

- In the VM_NONLINEAR case, new file pages would be populated if
the MAP_NONBLOCK flag wasn't passed. If MAP_NONBLOCK was passed,
new file pages wouldn't be populated even if the vma is already
marked as VM_LOCKED.

- In the linear (emulated) case, the work is passed to the mmap_region()
function which would populate the pages if the vma is marked as
VM_LOCKED, and would not otherwise - regardless of the value of the
MAP_NONBLOCK flag, because MAP_POPULATE wasn't being passed to
mmap_region().

The desired behavior is that we want the pages to be populated and locked
if the vma is marked as VM_LOCKED, or to be populated if the MAP_NONBLOCK
flag is not passed to remap_file_pages().

Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Tested-by: Andy Lutomirski
Cc: Greg Ungerer
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-02-24 09:50:10 +0800
dafcb73e3 mm: avoid calling pgdat_balanced() needlessly ... Browse Code »

Now that balance_pgdat() is slightly tidied up, thanks to more capable
pgdat_balanced(), it's become obvious that pgdat_balanced() is called to
check the status, then break the loop if pgdat is balanced, just to be
immediately called again. The second call is completely unnecessary, of
course.

The patch introduces pgdat_is_balanced boolean, which helps resolve the
above suboptimal behavior, with the added benefit of slightly better
documenting one other place in the function where we jump and skip lots
of code.

Signed-off-by: Zlatko Calusic
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zlatko Calusic
2013-02-24 09:50:10 +0800
7103f16db mm: compaction: make __compact_pgdat() and compact_pgdat() return void ... Browse Code »

These functions always return 0. Formalise this.

Cc: Jason Liu
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Rik van Riel
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-02-24 09:50:10 +0800
1998cc048 mm: make madvise(MADV_WILLNEED) support swap file prefetch ... Browse Code »

Make madvise(MADV_WILLNEED) support swap file prefetch. If memory is
swapout, this syscall can do swapin prefetch. It has no impact if the
memory isn't swapout.

[akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
[sasha.levin@oracle.com: fix BUG on madvise early failure]
Signed-off-by: Shaohua Li
Cc: Hugh Dickins
Cc: Rik van Riel
Signed-off-by: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2013-02-24 09:50:10 +0800
a394cb8ee memcg,vmscan: do not break out targeted reclaim without reclaimed pages ... Browse Code »

Targeted (hard resp soft) reclaim has traditionally tried to scan one
group with decreasing priority until nr_to_reclaim (SWAP_CLUSTER_MAX
pages) is reclaimed or all priorities are exhausted. The reclaim is
then retried until the limit is met.

This approach, however, doesn't work well with deeper hierarchies where
groups higher in the hierarchy do not have any or only very few pages
(this usually happens if those groups do not have any tasks and they
have only re-parented pages after some of their children is removed).
Those groups are reclaimed with decreasing priority pointlessly as there
is nothing to reclaim from them.

An easiest fix is to break out of the memcg iteration loop in
shrink_zone only if the whole hierarchy has been visited or sufficient
pages have been reclaimed. This is also more natural because the
reclaimer expects that the hierarchy under the given root is reclaimed.
As a result we can simplify the soft limit reclaim which does its own
iteration.

[yinghan@google.com: break out of the hierarchy loop only if nr_reclaimed exceeded nr_to_reclaim]
[akpm@linux-foundation.org: use conventional comparison order]
Signed-off-by: Michal Hocko
Reported-by: Ying Han
Cc: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Cc: Tejun Heo
Cc: Glauber Costa
Cc: Li Zefan
Signed-off-by: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2013-02-24 09:50:10 +0800
4ca3a69bc mm/ksm.c: use new hashtable implementation ... Browse Code »

Switch ksm to use the new hashtable implementation. This reduces the
amount of generic unrelated code in the ksm module.

Signed-off-by: Sasha Levin
Acked-by: Hugh Dickins
Cc: Michal Hocko
Cc: Konstantin Khlebnikov
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2013-02-24 09:50:10 +0800
43b5fbbd2 mm/huge_memory.c: use new hashtable implementation ... Browse Code »

Switch hugemem to use the new hashtable implementation. This reduces
the amount of generic unrelated code in the hugemem.

This also removes the dymanic allocation of the hash table. The upside
is that we save a pointer dereference when accessing the hashtable, but
we lose 8KB if CONFIG_TRANSPARENT_HUGEPAGE is enabled but the processor
doesn't support hugepages.

Signed-off-by: Sasha Levin
Cc: David Rientjes
Cc: "Kirill A. Shutemov"
Cc: Xiao Guangrong
Cc: Andrea Arcangeli
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2013-02-24 09:50:10 +0800
a9aacbccf mm: compaction: do not accidentally skip pageblocks in the migrate scanner ... Browse Code »

Compaction uses the ALIGN macro incorrectly with the migrate scanner by
adding pageblock_nr_pages to a PFN. It happened to work when initially
implemented as the starting PFN was also aligned but with caching
restarts and isolating in smaller chunks this is no longer always true.

The impact is that the migrate scanner scans outside its current
pageblock. As pfn_valid() is still checked properly it does not cause
any failure and the impact of the bug is that in some cases it will scan
more than necessary when it crosses a page boundary but by no more than
COMPACT_CLUSTER_MAX. It is highly unlikely this is even measurable but
it's still wrong so this patch addresses the problem.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-02-24 09:50:10 +0800
62b726c1b mm/vmscan.c:__zone_reclaim(): replace max_t() with max() ... Browse Code »

"mm: vmscan: save work scanning (almost) empty LRU lists" made
SWAP_CLUSTER_MAX an unsigned long.

Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Satoru Moriya
Cc: Simon Jeons
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-02-24 09:50:10 +0800
90ae8d670 mm/page_alloc.c:__setup_per_zone_wmarks: make min_pages unsigned long ... Browse Code »

`int' is an inappropriate type for a number-of-pages counter.

While we're there, use the clamp() macro.

Acked-by: Johannes Weiner
Cc: Rik van Riel
Cc: Mel Gorman
Reviewed-by: Michal Hocko
Cc: Hugh Dickins
Cc: Satoru Moriya
Cc: Simon Jeons
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-02-24 09:50:10 +0800
af34770e5 mm: reduce rmap overhead for ex-KSM page copies created on swap faults ... Browse Code »

When ex-KSM pages are faulted from swap cache, the fault handler is not
capable of re-establishing anon_vma-spanning KSM pages. In this case, a
copy of the page is created instead, just like during a COW break.

These freshly made copies are known to be exclusive to the faulting VMA
and there is no reason to go look for this page in parent and sibling
processes during rmap operations.

Use page_add_new_anon_rmap() for these copies. This also puts them on
the proper LRU lists and marks them SwapBacked, so we can get rid of
doing this ad-hoc in the KSM copy code.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Acked-by: Hugh Dickins
Cc: Simon Jeons
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Satoru Moriya
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-02-24 09:50:09 +0800
9b4f98cda mm: vmscan: compaction works against zones, not lruvecs ... Browse Code »

The restart logic for when reclaim operates back to back with compaction
is currently applied on the lruvec level. But this does not make sense,
because the container of interest for compaction is a zone as a whole,
not the zone pages that are part of a certain memory cgroup.

Negative impact is bounded. For one, the code checks that the lruvec
has enough reclaim candidates, so it does not risk getting stuck on a
condition that can not be fulfilled. And the unfairness of hammering on
one particular memory cgroup to make progress in a zone will be
amortized by the round robin manner in which reclaim goes through the
memory cgroups. Still, this can lead to unnecessary allocation
latencies when the code elects to restart on a hard to reclaim or small
group when there are other, more reclaimable groups in the zone.

Move this logic to the zone level and restart reclaim for all memory
cgroups in a zone when compaction requires more free pages from it.

[akpm@linux-foundation.org: no need for min_t]
Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Acked-by: Mel Gorman
Reviewed-by: Michal Hocko
Cc: Hugh Dickins
Cc: Satoru Moriya
Cc: Simon Jeons
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-02-24 09:50:09 +0800
9a2651140 mm: vmscan: clean up get_scan_count() ... Browse Code »

Reclaim pressure balance between anon and file pages is calculated
through a tuple of numerators and a shared denominator.

Exceptional cases that want to force-scan anon or file pages configure
the numerators and denominator such that one list is preferred, which is
not necessarily the most obvious way:

fraction[0] = 1;
fraction[1] = 0;
denominator = 1;
goto out;

Make this easier by making the force-scan cases explicit and use the
fractionals only in case they are calculated from reclaim history.

[akpm@linux-foundation.org: avoid using unintialized_var()]
Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Acked-by: Mel Gorman
Reviewed-by: Michal Hocko
Cc: Hugh Dickins
Cc: Satoru Moriya
Cc: Simon Jeons
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-02-24 09:50:09 +0800
11d16c25b mm: vmscan: improve comment on low-page cache handling ... Browse Code »

Fix comment style and elaborate on why anonymous memory is force-scanned
when file cache runs low.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Acked-by: Mel Gorman
Reviewed-by: Michal Hocko
Cc: Hugh Dickins
Cc: Satoru Moriya
Cc: Simon Jeons
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-02-24 09:50:09 +0800
10316b313 mm: vmscan: clarify how swappiness, highest priority, memcg interact ... Browse Code »

A swappiness of 0 has a slightly different meaning for global reclaim
(may swap if file cache really low) and memory cgroup reclaim (never
swap, ever).

In addition, global reclaim at highest priority will scan all LRU lists
equal to their size and ignore other balancing heuristics. UNLESS
swappiness forbids swapping, then the lists are balanced based on recent
reclaim effectiveness. UNLESS file cache is running low, then anonymous
pages are force-scanned.

This (total mess of a) behaviour is implicit and not obvious from the
way the code is organized. At least make it apparent in the code flow
and document the conditions. It will be it easier to come up with sane
semantics later.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Satoru Moriya
Reviewed-by: Michal Hocko
Acked-by: Mel Gorman
Cc: Hugh Dickins
Cc: Simon Jeons
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-02-24 09:50:09 +0800
d778df51c mm: vmscan: save work scanning (almost) empty LRU lists ... Browse Code »

In certain cases (kswapd reclaim, memcg target reclaim), a fixed minimum
amount of pages is scanned from the LRU lists on each iteration, to make
progress.

Do not make this minimum bigger than the respective LRU list size,
however, and save some busy work trying to isolate and reclaim pages
that are not there.

Empty LRU lists are quite common with memory cgroups in NUMA
environments because there exists a set of LRU lists for each zone for
each memory cgroup, while the memory of a single cgroup is expected to
stay on just one node. The number of expected empty LRU lists is thus

memcgs * (nodes - 1) * lru types

Each attempt to reclaim from an empty LRU list does expensive size
comparisons between lists, acquires the zone's lru lock etc. Avoid
that.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Acked-by: Mel Gorman
Reviewed-by: Michal Hocko
Cc: Hugh Dickins
Cc: Satoru Moriya
Cc: Simon Jeons
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-02-24 09:50:09 +0800
7c5bd705d mm: memcg: only evict file pages when we have plenty ... Browse Code »

Commit e9868505987a ("mm, vmscan: only evict file pages when we have
plenty") makes a point of not going for anonymous memory while there is
still enough inactive cache around.

The check was added only for global reclaim, but it is just as useful to
reduce swapping in memory cgroup reclaim:

200M-memcg-defconfig-j2

vanilla patched
Real time 454.06 ( +0.00%) 453.71 ( -0.08%)
User time 668.57 ( +0.00%) 668.73 ( +0.02%)
System time 128.92 ( +0.00%) 129.53 ( +0.46%)
Swap in 1246.80 ( +0.00%) 814.40 ( -34.65%)
Swap out 1198.90 ( +0.00%) 827.00 ( -30.99%)
Pages allocated 16431288.10 ( +0.00%) 16434035.30 ( +0.02%)
Major faults 681.50 ( +0.00%) 593.70 ( -12.86%)
THP faults 237.20 ( +0.00%) 242.40 ( +2.18%)
THP collapse 241.20 ( +0.00%) 248.50 ( +3.01%)
THP splits 157.30 ( +0.00%) 161.40 ( +2.59%)

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Cc: Hugh Dickins
Cc: Satoru Moriya
Cc: Simon Jeons
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-02-24 09:50:09 +0800
2a6f51241 CMA: make putback_lru_pages() call conditional ... Browse Code »

As per documentation and other places calling putback_lru_pages(),
putback_lru_pages() is called on error only. Make the CMA code behave
consistently.

[akpm@linux-foundation.org: remove a test-n-branch in the wrapup code]
Signed-off-by: Srinivas Pandruvada
Acked-by: Michal Nazarewicz
Cc: Marek Szyprowski
Cc: Bartlomiej Zolnierkiewicz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Srinivas Pandruvada
2013-02-24 09:50:09 +0800
ffb22af5b mm/hugetlb.c: convert to pr_foo() ... Browse Code »

Cc: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Acked-by: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-02-24 09:50:09 +0800
d045197ff mm/memcontrol.c: convert printk(KERN_FOO) to pr_foo() ... Browse Code »

Acked-by: Sha Zhengju
Acked-by: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-02-24 09:50:09 +0800
58cf188ed memcg, oom: provide more precise dump info while memcg oom happening ... Browse Code »

Currently when a memcg oom is happening the oom dump messages is still
global state and provides few useful info for users. This patch prints
more pointed memcg page statistics for memcg-oom and take hierarchy into
consideration:

Based on Michal's advice, we take hierarchy into consideration: supppose
we trigger an OOM on A's limit

root_memcg
|
A (use_hierachy=1)
/ \
B C
|
D
then the printed info will be:

Memory cgroup stats for /A:...
Memory cgroup stats for /A/B:...
Memory cgroup stats for /A/C:...
Memory cgroup stats for /A/B/D:...

Following are samples of oom output:

(1) Before change:

mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0
mal-80 cpuset=/ mems_allowed=0
Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10
Call Trace:
[] dump_header+0x83/0x1ca
..... (call trace)
[] page_fault+0x28/0x30
<<<<<<<<<<<<<<<<<<<<< memcg specific information
Task in /A/B/D killed as a result of limit of /A
memory: usage 101376kB, limit 101376kB, failcnt 57
memory+swap: usage 101376kB, limit 101376kB, failcnt 0
kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
<<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
......
CPU 3: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 173
......
CPU 3: hi: 186, btch: 31 usd: 130
<<<<<<<<<<<<<<<<<<<<< print global page state
active_anon:92963 inactive_anon:40777 isolated_anon:0
active_file:33027 inactive_file:51718 isolated_file:0
unevictable:0 dirty:3 writeback:0 unstable:0
free:729995 slab_reclaimable:6897 slab_unreclaimable:6263
mapped:20278 shmem:35971 pagetables:5885 bounce:0
free_cma:0
<<<<<<<<<<<<<<<<<<<<< print per zone page state
Node 0 DMA free:15836kB ... all_unreclaimable? no
lowmem_reserve[]: 0 3175 3899 3899
Node 0 DMA32 free:2888564kB ... all_unrelaimable? no
lowmem_reserve[]: 0 0 724 724
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB
Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB
120710 total pagecache pages
0 pages in swap cache
<<<<<<<<<<<<<<<<<<<<< print global swap cache stat
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 499708kB
Total swap = 499708kB
1040368 pages RAM
58678 pages reserved
169065 pages shared
173632 pages non-shared
[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[ 2693] 0 2693 6005 1324 17 0 0 god
[ 2754] 0 2754 6003 1320 16 0 0 god
[ 2811] 0 2811 5992 1304 18 0 0 god
[ 2874] 0 2874 6005 1323 18 0 0 god
[ 2935] 0 2935 8720 7742 21 0 0 mal-30
[ 2976] 0 2976 21520 17577 42 0 0 mal-80
Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child
Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB

We can see that messages dumped by show_free_areas() are longsome and can
provide so limited info for memcg that just happen oom.

(2) After change
mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
mal-80 cpuset=/ mems_allowed=0
Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10
Call Trace:
[] dump_header+0x83/0x1d1
.......(call trace)
[] page_fault+0x28/0x30
Task in /A/B/D killed as a result of limit of /A
<<<<<<<<<<<<<<<<<<<<< memcg specific information
memory: usage 102400kB, limit 102400kB, failcnt 140
memory+swap: usage 102400kB, limit 102400kB, failcnt 0
kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB
Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB
[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[ 2260] 0 2260 6006 1325 18 0 0 god
[ 2383] 0 2383 6003 1319 17 0 0 god
[ 2503] 0 2503 6004 1321 18 0 0 god
[ 2622] 0 2622 6004 1321 16 0 0 god
[ 2695] 0 2695 8720 7741 22 0 0 mal-30
[ 2704] 0 2704 21520 17839 43 0 0 mal-80
Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child
Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB

This version provides more pointed info for memcg in "Memory cgroup stats
for XXX" section.

Signed-off-by: Sha Zhengju
Acked-by: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sha Zhengju
2013-02-24 09:50:08 +0800
df8557982 drivers/md/persistent-data/dm-transaction-manager.c: rename HASH_SIZE ... Browse Code »

Fix the warning:

drivers/md/persistent-data/dm-transaction-manager.c:28:1: warning: "HASH_SIZE" redefined
In file included from include/linux/elevator.h:5,
from include/linux/blkdev.h:216,
from drivers/md/persistent-data/dm-block-manager.h:11,
from drivers/md/persistent-data/dm-transaction-manager.h:10,
from drivers/md/persistent-data/dm-transaction-manager.c:6:
include/linux/hashtable.h:22:1: warning: this is the location of the previous definition

Cc: Alasdair Kergon
Cc: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-02-24 09:50:08 +0800

22 Feb, 2013

13 commits

2ef14f465 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 mm changes from Peter Anvin:
"This is a huge set of several partly interrelated (and concurrently
developed) changes, which is why the branch history is messier than
one would like.

The *really* big items are two humonguous patchsets mostly developed
by Yinghai Lu at my request, which completely revamps the way we
create initial page tables. In particular, rather than estimating how
much memory we will need for page tables and then build them into that
memory -- a calculation that has shown to be incredibly fragile -- we
now build them (on 64 bits) with the aid of a "pseudo-linear mode" --
a #PF handler which creates temporary page tables on demand.

This has several advantages:

1. It makes it much easier to support things that need access to data
very early (a followon patchset uses this to load microcode way
early in the kernel startup).

2. It allows the kernel and all the kernel data objects to be invoked
from above the 4 GB limit. This allows kdump to work on very large
systems.

3. It greatly reduces the difference between Xen and native (Xen's
equivalent of the #PF handler are the temporary page tables created
by the domain builder), eliminating a bunch of fragile hooks.

The patch series also gets us a bit closer to W^X.

Additional work in this pull is the 64-bit get_user() work which you
were also involved with, and a bunch of cleanups/speedups to
__phys_addr()/__pa()."

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (105 commits)
x86, mm: Move reserving low memory later in initialization
x86, doc: Clarify the use of asm("%edx") in uaccess.h
x86, mm: Redesign get_user with a __builtin_choose_expr hack
x86: Be consistent with data size in getuser.S
x86, mm: Use a bitfield to mask nuisance get_user() warnings
x86/kvm: Fix compile warning in kvm_register_steal_time()
x86-32: Add support for 64bit get_user()
x86-32, mm: Remove reference to alloc_remap()
x86-32, mm: Remove reference to resume_map_numa_kva()
x86-32, mm: Rip out x86_32 NUMA remapping code
x86/numa: Use __pa_nodebug() instead
x86: Don't panic if can not alloc buffer for swiotlb
mm: Add alloc_bootmem_low_pages_nopanic()
x86, 64bit, mm: hibernate use generic mapping_init
x86, 64bit, mm: Mark data/bss/brk to nx
x86: Merge early kernel reserve for 32bit and 64bit
x86: Add Crash kernel low reservation
x86, kdump: Remove crashkernel range find limit for 64bit
memblock: Add memblock_mem_size()
x86, boot: Not need to check setup_header version for setup_data
...

Linus Torvalds
2013-02-22 10:06:55 +0800
cb715a836 Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 cpu updates from Peter Anvin:
"This is a corrected attempt at the x86/cpu branch, this time with the
fixes in that makes it not break on KVM (current or past), or any
other virtualizer which traps on this configuration.

Again, the biggest change here is enabling the WC+ memory type on AMD
processors, if the BIOS doesn't."

* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, kvm: Add MSR_AMD64_BU_CFG2 to the list of ignored MSRs
x86, cpu, amd: Fix WC+ workaround for older virtual hosts
x86, AMD: Enable WC+ memory type on family 10 processors
x86, AMD: Clean up init_amd()
x86/process: Change %8s to %s for pr_warn() in release_thread()
x86/cpu/hotplug: Remove CONFIG_EXPERIMENTAL dependency

Linus Torvalds
2013-02-22 10:03:39 +0800
27ea6dfdc Merge tag 'please-pull-misc-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux ... Browse Code »

Pull misc ia64 bits from Tony Luck.

* tag 'please-pull-misc-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
MAINTAINERS: update SGI & ia64 Altix stuff
sysctl: Enable IA64 "ignore-unaligned-usertrap" to be used cross-arch

Linus Torvalds
2013-02-22 09:55:48 +0800
81ec44a6c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux ... Browse Code »

Pull s390 update from Martin Schwidefsky:
"The most prominent change in this patch set is the software dirty bit
patch for s390. It removes __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY and
the page_test_and_clear_dirty primitive which makes the common memory
management code a bit less obscure.

Heiko fixed most of the PCI related fallout, more often than not
missing GENERIC_HARDIRQS dependencies. Notable is one of the 3270
patches which adds an export to tty_io to be able to resize a tty.

The rest is the usual bunch of cleanups and bug fixes."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (42 commits)
s390/module: Add missing R_390_NONE relocation type
drivers/gpio: add missing GENERIC_HARDIRQ dependency
drivers/input: add couple of missing GENERIC_HARDIRQS dependencies
s390/cleanup: rename SPP to LPP
s390/mm: implement software dirty bits
s390/mm: Fix crst upgrade of mmap with MAP_FIXED
s390/linker skript: discard exit.data at runtime
drivers/media: add missing GENERIC_HARDIRQS dependency
s390/bpf,jit: add vlan tag support
drivers/net,AT91RM9200: add missing GENERIC_HARDIRQS dependency
iucv: fix kernel panic at reboot
s390/Kconfig: sort list of arch selected config options
phylib: remove !S390 dependeny from Kconfig
uio: remove !S390 dependency from Kconfig
dasd: fix sysfs cleanup in dasd_generic_remove
s390/pci: fix hotplug module init
s390/pci: cleanup clp page allocation
s390/pci: cleanup clp inline assembly
s390/perf: cpum_cf: fallback to software sampling events
s390/mm: provide PAGE_SHARED define
...

Linus Torvalds
2013-02-22 09:54:03 +0800
48a732dfa Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid ... Browse Code »

Pull HID subsystem updates from Jiri Kosina:
"HID subsystem and drivers update. Highlights:

- new support of a group of Win7/Win8 multitouch devices, from
Benjamin Tissoires

- fix for compat interface brokenness in uhid, from Dmitry Torokhov

- conversion of drivers to use hid_driver helper, by H Hartley
Sweeten

- HID over I2C transport received ACPI enumeration support, written
by Mika Westerberg

- there is an ongoing effort to make HID sensor hubs independent of
USB transport. The first self-contained part of this work is
provided here, done by Mika Westerberg

- a few smaller fixes here and there, support for a couple new
devices added"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (43 commits)
HID: Correct Logitech order in hid-ids.h
HID: LG4FF: Remove unnecessary deadzone code
HID: LG: Prevent the Logitech Gaming Wheels deadzone
HID: LG: Fix detection of Logitech Speed Force Wireless (WiiWheel)
HID: LG: Add support for Logitech Momo Force (Red) Wheel
HID: hidraw: print message when succesfully initialized
HID: logitech: split accel, brake for Driving Force wheel
HID: logitech: add report descriptor for Driving Force wheel
HID: add ThingM blink(1) USB RGB LED support
HID: uhid: make creating devices work on 64/32 systems
HID: wiimote: fix nunchuck button parser
HID: blacklist Velleman data acquisition boards
HID: sensor-hub: don't limit the driver only to USB bus
HID: sensor-hub: get rid of unused sensor_hub_grabbed_usages[] table
HID: extend autodetect to handle I2C sensors as well
HID: ntrig: use input_configured() callback to set the name
HID: multitouch: do not use pointers towards hid-core
HID: add missing GENERIC_HARDIRQ dependency
HID: multitouch: make MT_CLS_ALWAYS_TRUE the new default class
HID: multitouch: fix protocol for Elo panels
...

Linus Torvalds
2013-02-22 09:41:38 +0800
9afa3195b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

Pull trivial tree from Jiri Kosina:
"Assorted tiny fixes queued in trivial tree"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (22 commits)
DocBook: update EXPORT_SYMBOL entry to point at export.h
Documentation: update top level 00-INDEX file with new additions
ARM: at91/ide: remove unsused at91-ide Kconfig entry
percpu_counter.h: comment code for better readability
x86, efi: fix comment typo in head_32.S
IB: cxgb3: delay freeing mem untill entirely done with it
net: mvneta: remove unneeded version.h include
time: x86: report_lost_ticks doesn't exist any more
pcmcia: avoid static analysis complaint about use-after-free
fs/jfs: Fix typo in comment : 'how may' -> 'how many'
of: add missing documentation for of_platform_populate()
btrfs: remove unnecessary cur_trans set before goto loop in join_transaction
sound: soc: Fix typo in sound/codecs
treewide: Fix typo in various drivers
btrfs: fix comment typos
Update ibmvscsi module name in Kconfig.
powerpc: fix typo (utilties -> utilities)
of: fix spelling mistake in comment
h8300: Fix home page URL in h8300/README
xtensa: Fix home page URL in Kconfig
...

Linus Torvalds
2013-02-22 09:40:58 +0800
7c2db36e7 Merge branch 'akpm' (incoming from Andrew) ... Browse Code »

Merge misc patches from Andrew Morton:

- Florian has vanished so I appear to have become fbdev maintainer
again :(

- Joel and Mark are distracted to welcome to the new OCFS2 maintainer

- The backlight queue

- Small core kernel changes

- lib/ updates

- The rtc queue

- Various random bits

* akpm: (164 commits)
rtc: rtc-davinci: use devm_*() functions
rtc: rtc-max8997: use devm_request_threaded_irq()
rtc: rtc-max8907: use devm_request_threaded_irq()
rtc: rtc-da9052: use devm_request_threaded_irq()
rtc: rtc-wm831x: use devm_request_threaded_irq()
rtc: rtc-tps80031: use devm_request_threaded_irq()
rtc: rtc-lp8788: use devm_request_threaded_irq()
rtc: rtc-coh901331: use devm_clk_get()
rtc: rtc-vt8500: use devm_*() functions
rtc: rtc-tps6586x: use devm_request_threaded_irq()
rtc: rtc-imxdi: use devm_clk_get()
rtc: rtc-cmos: use dev_warn()/dev_dbg() instead of printk()/pr_debug()
rtc: rtc-pcf8583: use dev_warn() instead of printk()
rtc: rtc-sun4v: use pr_warn() instead of printk()
rtc: rtc-vr41xx: use dev_info() instead of printk()
rtc: rtc-rs5c313: use pr_err() instead of printk()
rtc: rtc-at91rm9200: use dev_dbg()/dev_err() instead of printk()/pr_debug()
rtc: rtc-rs5c372: use dev_dbg()/dev_warn() instead of printk()/pr_debug()
rtc: rtc-ds2404: use dev_err() instead of printk()
rtc: rtc-efi: use dev_err()/dev_warn()/pr_err() instead of printk()
...

Linus Torvalds
2013-02-22 09:38:49 +0800
a47a376f1 rtc: rtc-davinci: use devm_*() functions ... Browse Code »

Use devm_*() functions to make cleanup paths more simple.

Signed-off-by: Jingoo Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jingoo Han
2013-02-22 09:22:31 +0800
c1879fe80 rtc: rtc-max8997: use devm_request_threaded_irq() ... Browse Code »

Use devm_request_threaded_irq() to make cleanup paths more simple.

Signed-off-by: Jingoo Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jingoo Han
2013-02-22 09:22:30 +0800
83a72c87e rtc: rtc-max8907: use devm_request_threaded_irq() ... Browse Code »

Use devm_request_threaded_irq() to make cleanup paths more simple.

Signed-off-by: Jingoo Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jingoo Han
2013-02-22 09:22:30 +0800
27239a149 rtc: rtc-da9052: use devm_request_threaded_irq() ... Browse Code »

Use devm_request_threaded_irq() to make cleanup paths more simple.

Signed-off-by: Jingoo Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jingoo Han
2013-02-22 09:22:30 +0800
fd5231ce3 rtc: rtc-wm831x: use devm_request_threaded_irq() ... Browse Code »

Use devm_request_threaded_irq() to make cleanup paths more simple.

Signed-off-by: Jingoo Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jingoo Han
2013-02-22 09:22:30 +0800
6d77bdca2 rtc: rtc-tps80031: use devm_request_threaded_irq() ... Browse Code »

Use devm_request_threaded_irq() to make cleanup paths more simple.

Signed-off-by: Jingoo Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jingoo Han
2013-02-22 09:22:30 +0800