Doug / smarc-fsl-linux-kernel | Embedian Git Server

30 May, 2012

40 commits

cc926f784 memcg: move charges to root cgroup if use_hierarchy=0 ... Browse Code »

Presently, at removal of cgroup, ->pre_destroy() is called and moves
charges to the parent cgroup. A major reason for returning -EBUSY from
->pre_destroy() is that the 'moving' hits the parent's resource
limitation. It happens only when use_hierarchy=0.

Considering use_hierarchy=0, all cgroups should be flat. So, no one
cannot justify moving charges to parent...parent and children are in flat
configuration, not hierarchical.

This patch modifes the code to move charges to the root cgroup at
rmdir/force_empty if use_hierarchy==0. This will much simplify rmdir()
and reduce error in ->pre_destroy.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Aneesh Kumar K.V
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Frederic Weisbecker
Cc: Ying Han
Cc: Glauber Costa
Reviewed-by: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-05-30 07:22:27 +0800
d01dd17f1 memcg: use res_counter_uncharge_until() in move_parent() ... Browse Code »

By using res_counter_uncharge_until(), we can avoid race and unnecessary
charging.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Aneesh Kumar K.V
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Frederic Weisbecker
Cc: Ying Han
Cc: Glauber Costa
Reviewed-by: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-05-30 07:22:27 +0800
2bb2ba9d5 rescounters: add res_counter_uncharge_until() ... Browse Code »

When killing a res_counter which is a child of other counter, we need to
do

res_counter_uncharge(child, xxx)
res_counter_charge(parent, xxx)

This is not atomic and wastes CPU. This patch adds
res_counter_uncharge_until(). This function's uncharge propagates to
ancestors until specified res_counter.

res_counter_uncharge_until(child, parent, xxx)

Now the operation is atomic and efficient.

Signed-off-by: Frederic Weisbecker
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Aneesh Kumar K.V
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Ying Han
Cc: Glauber Costa
Reviewed-by: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Frederic Weisbecker
2012-05-30 07:22:27 +0800
f9be23d6d mm/vmscan: kill struct mem_cgroup_zone ... Browse Code »

Kill struct mem_cgroup_zone and rename shrink_mem_cgroup_zone() to
shrink_lruvec(), it always shrinks one lruvec which it takes as an
argument.

Signed-off-by: Konstantin Khlebnikov
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-05-30 07:22:27 +0800
90bdcfafd mm/vmscan: push lruvec pointer into should_continue_reclaim() ... Browse Code »

Signed-off-by: Konstantin Khlebnikov
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-05-30 07:22:27 +0800
90126375d mm/vmscan: push lruvec pointer into get_scan_count() ... Browse Code »

Signed-off-by: Konstantin Khlebnikov
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-05-30 07:22:26 +0800
1a93be0e7 mm/vmscan: push lruvec pointer into shrink_list() ... Browse Code »

Signed-off-by: Konstantin Khlebnikov
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-05-30 07:22:26 +0800
c56d5c7df mm/vmscan: push lruvec pointer into inactive_list_is_low() ... Browse Code »

Switch mem_cgroup_inactive_anon_is_low() to lruvec pointers,
mem_cgroup_get_lruvec_size() is more effective than
mem_cgroup_zone_nr_lru_pages()

Signed-off-by: Konstantin Khlebnikov
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-05-30 07:22:26 +0800
074291fea mm/vmscan: replace zone_nr_lru_pages() with get_lruvec_size() ... Browse Code »

If memory cgroup is enabled we always use lruvecs which are embedded into
struct mem_cgroup_per_zone, so we can reach lru_size counters via
container_of().

Signed-off-by: Konstantin Khlebnikov
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-05-30 07:22:26 +0800
27ac81d85 mm/vmscan: push lruvec pointer into putback_inactive_pages() ... Browse Code »

As zone_reclaim_stat is now located in the lruvec, we can reach it
directly.

Signed-off-by: Konstantin Khlebnikov
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-05-30 07:22:26 +0800
95d918fc0 mm/vmscan: remove update_isolated_counts() ... Browse Code »

update_isolated_counts() is no longer required, because lumpy-reclaim was
removed. Insanity is over, now there is only one kind of inactive page.

Signed-off-by: Konstantin Khlebnikov
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-05-30 07:22:26 +0800
6a18adb35 mm/vmscan: push zone pointer into shrink_page_list() ... Browse Code »

It doesn't need a pointer to the cgroup - pointer to the zone is enough.
This patch also kills the "mz" argument of page_check_references() - it is
unused after "mm: memcg: count pte references from every member of the
reclaimed hierarch"

Signed-off-by: Konstantin Khlebnikov
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-05-30 07:22:26 +0800
5dc35979e mm/vmscan: push lruvec pointer into isolate_lru_pages() ... Browse Code »

Move the mem_cgroup_zone_lruvec() call from isolate_lru_pages() into
shrink_[in]active_list(). Further patches push it to shrink_zone() step
by step.

Signed-off-by: Konstantin Khlebnikov
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-05-30 07:22:26 +0800
7f5e86c2c mm: add link from struct lruvec to struct zone ... Browse Code »

This is the first stage of struct mem_cgroup_zone removal. Further
patches replace struct mem_cgroup_zone with a pointer to struct lruvec.

If CONFIG_CGROUP_MEM_RES_CTLR=n lruvec_zone() is just container_of().

Signed-off-by: Konstantin Khlebnikov
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-05-30 07:22:26 +0800
9e3b2f8cd mm/vmscan: store "priority" in struct scan_control ... Browse Code »

In memory reclaim some function have too many arguments - "priority" is
one of them. It can be stored in struct scan_control - we construct them
on the same level. Instead of an open coded loop we set the initial
sc.priority, and do_try_to_free_pages() decreases it down to zero.

Signed-off-by: Konstantin Khlebnikov
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-05-30 07:22:26 +0800
3d58ab5c9 mm/memcg: use vm_swappiness from target memory cgroup ... Browse Code »

Use vm_swappiness from memory cgroup which is triggered this memory
reclaim. This is more reasonable and allows to kill one argument.

[akpm@linux-foundation.org: fix build (patch skew)]
Signed-off-by: Konstantin Khlebnikov
Acked-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Glauber Costa
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-05-30 07:22:26 +0800
748dad36d memcg: make threshold index in the right position ... Browse Code »

Index current_threshold may point to threshold that just equal to usage
after last call of __mem_cgroup_threshold. But after registering a new
event, it will change (pointing to threshold just below usage). So make
it consistent here.

For example:
now:
threshold array: 3 [5] 7 9 (usage = 6, [index] = 5)

next turn (after calling __mem_cgroup_threshold):
threshold array: 3 5 [7] 9 (usage = 7, [index] = 7)

after registering a new event (threshold = 10):
threshold array: 3 [5] 7 9 10 (usage = 7, [index] = 5)

Signed-off-by: Sha Zhengju
Acked-by: KAMEZAWA Hiroyuki
Reviewed-by: Kirill A. Shutemov
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sha Zhengju
2012-05-30 07:22:25 +0800
a0db00fcf memcg: remove redundant parentheses ... Browse Code »

Signed-off-by: Kirill A. Shutemov
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2012-05-30 07:22:25 +0800
3a7951b4c memcg: mark stat field of mem_cgroup struct as __percpu ... Browse Code »

It fixes a lot of sparse warnings.

Signed-off-by: Kirill A. Shutemov
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2012-05-30 07:22:25 +0800
92ba39a7a memcg: remove unused variable ... Browse Code »

mm/memcontrol.c: In function `mc_handle_file_pte':
mm/memcontrol.c:5206:16: warning: variable `inode' set but not used [-Wunused-but-set-variable]

Signed-off-by: Kirill A. Shutemov
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2012-05-30 07:22:25 +0800
6bbda35ce memcg: mark more functions/variables as static ... Browse Code »

Based on sparse output.

Signed-off-by: Kirill A. Shutemov
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2012-05-30 07:22:25 +0800
bbf808ed7 mm/memcg: kill mem_cgroup_lru_del() ... Browse Code »

This patch kills mem_cgroup_lru_del(), we can use
mem_cgroup_lru_del_list() instead. On 0-order isolation we already have
right lru list id.

Signed-off-by: Konstantin Khlebnikov
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Glauber Costa
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-05-30 07:22:25 +0800
f3fd4a619 mm: remove lru type checks from __isolate_lru_page() ... Browse Code »

After patch "mm: forbid lumpy-reclaim in shrink_active_list()" we can
completely remove anon/file and active/inactive lru type filters from
__isolate_lru_page(), because isolation for 0-order reclaim always
isolates pages from right lru list. And pages-isolation for lumpy
shrink_inactive_list() or memory-compaction anyway allowed to isolate
pages from all evictable lru lists.

Signed-off-by: Konstantin Khlebnikov
Acked-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Acked-by: Michal Hocko
Cc: Glauber Costa
Cc: Johannes Weiner
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-05-30 07:22:25 +0800
014483bcc mm: mark mm-inline functions as __always_inline ... Browse Code »

GCC sometimes ignores "inline" directives even for small and simple functions.
This supposed to be fixed in gcc 4.7, but it was released only yesterday.

Signed-off-by: Konstantin Khlebnikov
Acked-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Glauber Costa
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-05-30 07:22:25 +0800
3cb994517 mm: push lru index into shrink_[in]active_list() ... Browse Code »

Let's toss lru index through call stack to isolate_lru_pages(), this is
better than its reconstructing from individual bits.

[akpm@linux-foundation.org: fix kerneldoc, per Minchan]
Signed-off-by: Konstantin Khlebnikov
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Cc: Glauber Costa
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-05-30 07:22:25 +0800
89abfab13 mm/memcg: move reclaim_stat into lruvec ... Browse Code »

With mem_cgroup_disabled() now explicit, it becomes clear that the
zone_reclaim_stat structure actually belongs in lruvec, per-zone when
memcg is disabled but per-memcg per-zone when it's enabled.

We can delete mem_cgroup_get_reclaim_stat(), and change
update_page_reclaim_stat() to update just the one set of stats, the one
which get_scan_count() will actually use.

Signed-off-by: Hugh Dickins
Signed-off-by: Konstantin Khlebnikov
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Reviewed-by: Minchan Kim
Reviewed-by: Michal Hocko
Cc: Glauber Costa
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-05-30 07:22:25 +0800
c3c787e8c mm/memcg: scanning_global_lru means mem_cgroup_disabled ... Browse Code »

Although one has to admire the skill with which it has been concealed,
scanning_global_lru(mz) is actually just an interesting way to test
mem_cgroup_disabled(). Too many developer hours have been wasted on
confusing it with global_reclaim(): just use mem_cgroup_disabled().

Signed-off-by: Hugh Dickins
Signed-off-by: Konstantin Khlebnikov
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Glauber Costa
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-05-30 07:22:25 +0800
86493009d memcg swap: use mem_cgroup_uncharge_swap() ... Browse Code »

That stuff __mem_cgroup_commit_charge_swapin() does with a swap entry, it
has a name and even a declaration: just use mem_cgroup_uncharge_swap().

Signed-off-by: Hugh Dickins
Acked-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Acked-by: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-05-30 07:22:24 +0800
e91cbb425 memcg swap: mem_cgroup_move_swap_account never needs fixup ... Browse Code »

The need_fixup arg to mem_cgroup_move_swap_account() is always false,
so just remove it.

Signed-off-by: Hugh Dickins
Acked-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Acked-by: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-05-30 07:22:24 +0800
4b91355e9 memcg: fix/change behavior of shared anon at moving task ... Browse Code »

This patch changes memcg's behavior at task_move().

At task_move(), the kernel scans a task's page table and move the changes
for mapped pages from source cgroup to target cgroup. There has been a
bug at handling shared anonymous pages for a long time.

Before patch:
- The spec says 'shared anonymous pages are not moved.'
- The implementation was 'shared anonymoys pages may be moved'.
If page_mapcount
Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Naoya Horiguchi
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-05-30 07:22:24 +0800
181eb3942 mm/memblock: fix memory leak on extending regions ... Browse Code »

The overall memblock has been organized into the memory regions and
reserved regions. Initially, the memory regions and reserved regions are
stored in the predetermined arrays of "struct memblock _region". It's
possible for the arrays to be enlarged when we have newly added regions,
but no free space left there. The policy here is to create double-sized
array either by slab allocator or memblock allocator. Unfortunately, we
didn't free the old array, which might be allocated through slab allocator
before. That would cause memory leak.

The patch introduces 2 variables to trace where (slab or memblock) the
memory and reserved regions come from. The memory for the memory or
reserved regions will be deallocated by kfree() if that was allocated by
slab allocator. Thus to fix the memory leak issue.

Signed-off-by: Gavin Shan
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gavin Shan
2012-05-30 07:22:24 +0800
4e2f07750 mm/memblock: cleanup on duplicate VA/PA conversion ... Browse Code »

The overall memblock has been organized into the memory regions and
reserved regions. Initially, the memory regions and reserved regions are
stored in the predetermined arrays of "struct memblock _region". It's
possible for the arrays to be enlarged when we have newly added regions
for them, but no enough space there. Under the situation, We will created
double-sized array to meet the requirement. However, the original
implementation converted the VA (Virtual Address) of the newly allocated
array of regions to PA (Physical Address), then translate back when we
allocates the new array from slab. That's actually unnecessary.

The patch removes the duplicate VA/PA conversion.

Signed-off-by: Gavin Shan
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gavin Shan
2012-05-30 07:22:24 +0800
5bf5f03c2 mm: fix slab->page flags corruption ... Browse Code »

Transparent huge pages can change page->flags (PG_compound_lock) without
taking Slab lock. Since THP can not break slab pages we can safely access
compound page without taking compound lock.

Specifically this patch fixes a race between compound_unlock() and slab
functions which perform page-flags updates. This can occur when
get_page()/put_page() is called on a page from slab.

[akpm@linux-foundation.org: tweak comment text, fix comment layout, fix label indenting]
Reported-by: Amey Bhide
Signed-off-by: Pravin B Shelar
Reviewed-by: Christoph Lameter
Acked-by: Andrea Arcangeli
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pravin B Shelar
2012-05-30 07:22:24 +0800
dbda591d9 mm: fix faulty initialization in vmalloc_init() ... Browse Code »

The transfer of ->flags causes some of the static mapping virtual
addresses to be prematurely freed (before the mapping is removed) because
VM_LAZY_FREE gets "set" if tmp->flags has VM_IOREMAP set. This might
cause subsequent vmalloc/ioremap calls to fail because it might allocate
one of the freed virtual address ranges that aren't unmapped.

va->flags has different types of flags from tmp->flags. If a region with
VM_IOREMAP set is registered with vm_area_add_early(), it will be removed
by __purge_vmap_area_lazy().

Fix vmalloc_init() to correctly initialize vmap_area for the given
vm_struct.

Also initialise va->vm. If it is not set, find_vm_area() for the early
vm regions will always fail.

Signed-off-by: KyongHo Cho
Cc: "Olav Haugan"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KyongHo
2012-05-30 07:22:24 +0800
26c191788 mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition ... Browse Code »

When holding the mmap_sem for reading, pmd_offset_map_lock should only
run on a pmd_t that has been read atomically from the pmdp pointer,
otherwise we may read only half of it leading to this crash.

PID: 11679 TASK: f06e8000 CPU: 3 COMMAND: "do_race_2_panic"
#0 [f06a9dd8] crash_kexec at c049b5ec
#1 [f06a9e2c] oops_end at c083d1c2
#2 [f06a9e40] no_context at c0433ded
#3 [f06a9e64] bad_area_nosemaphore at c043401a
#4 [f06a9e6c] __do_page_fault at c0434493
#5 [f06a9eec] do_page_fault at c083eb45
#6 [f06a9f04] error_code (via page_fault) at c083c5d5
EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
00000000
DS: 007b ESI: 9e201000 ES: 007b EDI: 01fb4700 GS: 00e0
CS: 0060 EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
#7 [f06a9f38] _spin_lock at c083bc14
#8 [f06a9f44] sys_mincore at c0507b7d
#9 [f06a9fb0] system_call at c083becd
start len
EAX: ffffffda EBX: 9e200000 ECX: 00001000 EDX: 6228537f
DS: 007b ESI: 00000000 ES: 007b EDI: 003d0f00
SS: 007b ESP: 62285354 EBP: 62285388 GS: 0033
CS: 0073 EIP: 00291416 ERR: 000000da EFLAGS: 00000286

This should be a longstanding bug affecting x86 32bit PAE without THP.
Only archs with 64bit large pmd_t and 32bit unsigned long should be
affected.

With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad()
would partly hide the bug when the pmd transition from none to stable,
by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is
enabled a new set of problem arises by the fact could then transition
freely in any of the none, pmd_trans_huge or pmd_trans_stable states.
So making the barrier in pmd_none_or_trans_huge_or_clear_bad()
unconditional isn't good idea and it would be a flakey solution.

This should be fully fixed by introducing a pmd_read_atomic that reads
the pmd in order with THP disabled, or by reading the pmd atomically
with cmpxchg8b with THP enabled.

Luckily this new race condition only triggers in the places that must
already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
is localized there but this bug is not related to THP.

NOTE: this can trigger on x86 32bit systems with PAE enabled with more
than 4G of ram, otherwise the high part of the pmd will never risk to be
truncated because it would be zero at all times, in turn so hiding the
SMP race.

This bug was discovered and fully debugged by Ulrich, quote:

----
[..]
pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
eax.

496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
*pmd)
497 {
498 /* depend on compiler for an atomic pmd read */
499 pmd_t pmdval = *pmd;

// edi = pmd pointer
0xc0507a74 : mov 0x8(%esp),%edi
...
// edx = PTE page table high address
0xc0507a84 : mov 0x4(%edi),%edx
...
// eax = PTE page table low address
0xc0507a8e : mov (%edi),%eax

[..]

Please note that the PMD is not read atomically. These are two "mov"
instructions where the high order bits of the PMD entry are fetched
first. Hence, the above machine code is prone to the following race.

- The PMD entry {high|low} is 0x0000000000000000.
The "mov" at 0xc0507a84 loads 0x00000000 into edx.

- A page fault (on another CPU) sneaks in between the two "mov"
instructions and instantiates the PMD.

- The PMD entry {high|low} is now 0x00000003fda38067.
The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
----

Reported-by: Ulrich Obergfell
Signed-off-by: Andrea Arcangeli
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: Larry Woodman
Cc: Petr Matousek
Cc: Rik van Riel
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2012-05-30 07:22:24 +0800
a7f638f99 mm, oom: normalize oom scores to oom_score_adj scale only for userspace ... Browse Code »

The oom_score_adj scale ranges from -1000 to 1000 and represents the
proportion of memory available to the process at allocation time. This
means an oom_score_adj value of 300, for example, will bias a process as
though it was using an extra 30.0% of available memory and a value of
-350 will discount 35.0% of available memory from its usage.

The oom killer badness heuristic also uses this scale to report the oom
score for each eligible process in determining the "best" process to
kill. Thus, it can only differentiate each process's memory usage by
0.1% of system RAM.

On large systems, this can end up being a large amount of memory: 256MB
on 256GB systems, for example.

This can be fixed by having the badness heuristic to use the actual
memory usage in scoring threads and then normalizing it to the
oom_score_adj scale for userspace. This results in better comparison
between eligible threads for kill and no change from the userspace
perspective.

Suggested-by: KOSAKI Motohiro
Tested-by: Dave Jones
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-05-30 07:22:24 +0800
fe35004fb mm: avoid swapping out with swappiness==0 ... Browse Code »

Sometimes we'd like to avoid swapping out anonymous memory. In
particular, avoid swapping out pages of important process or process
groups while there is a reasonable amount of pagecache on RAM so that we
can satisfy our customers' requirements.

OTOH, we can control how aggressive the kernel will swap memory pages with
/proc/sys/vm/swappiness for global and
/sys/fs/cgroup/memory/memory.swappiness for each memcg.

But with current reclaim implementation, the kernel may swap out even if
we set swappiness=0 and there is pagecache in RAM.

This patch changes the behavior with swappiness==0. If we set
swappiness==0, the kernel does not swap out completely (for global reclaim
until the amount of free pages and filebacked pages in a zone has been
reduced to something very very small (nr_free + nr_filebacked < high
watermark)).

Signed-off-by: Satoru Moriya
Acked-by: Minchan Kim
Reviewed-by: Rik van Riel
Acked-by: Jerome Marchand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Satoru Moriya
2012-05-30 07:22:24 +0800
c50ac0508 hugetlb: fix resv_map leak in error path ... Browse Code »

When called for anonymous (non-shared) mappings, hugetlb_reserve_pages()
does a resv_map_alloc(). It depends on code in hugetlbfs's
vm_ops->close() to release that allocation.

However, in the mmap() failure path, we do a plain unmap_region() without
the remove_vma() which actually calls vm_ops->close().

This is a decent fix. This leak could get reintroduced if new code (say,
after hugetlb_reserve_pages() in hugetlbfs_file_mmap()) decides to return
an error. But, I think it would have to unroll the reservation anyway.

Christoph's test case:

http://marc.info/?l=linux-mm&m=133728900729735

This patch applies to 3.4 and later. A version for earlier kernels is at
https://lkml.org/lkml/2012/5/22/418.

Signed-off-by: Dave Hansen
Acked-by: Mel Gorman
Acked-by: KOSAKI Motohiro
Reported-by: Christoph Lameter
Tested-by: Christoph Lameter
Cc: Andrea Arcangeli
Cc: [2.6.32+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2012-05-30 07:22:24 +0800
5c2b8a162 mm/bootmem.c: cleanup on addition to bootmem data list ... Browse Code »

The objects of "struct bootmem_data_t" are linked together to form
double-linked list sequentially based on its minimal page frame number.

The current implementation implicitly supports the following cases,
which means the inserting point for current bootmem data depends on how
"list_for_each" works. That makes the code a little hard to read.
Besides, "list_for_each" and "list_entry" can be replaced with
"list_for_each_entry".

- The linked list is empty.
- There has no entry in the linked list, whose minimal page
frame number is bigger than current one.

Signed-off-by: Gavin Shan
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gavin Shan
2012-05-30 07:22:24 +0800
e48982734 mm: consider all swapped back pages in used-once logic ... Browse Code »

Commit 645747462435 ("vmscan: detect mapped file pages used only once")
made mapped pages have another round in inactive list because they might
be just short lived and so we could consider them again next time. This
heuristic helps to reduce pressure on the active list with a streaming
IO worklods.

This patch fixes a regression introduced by this commit for heavy shmem
based workloads because unlike Anon pages, which are excluded from this
heuristic because they are usually long lived, shmem pages are handled
as a regular page cache.

This doesn't work quite well, unfortunately, if the workload is mostly
backed by shmem (in memory database sitting on 80% of memory) with a
streaming IO in the background (backup - up to 20% of memory). Anon
inactive list is full of (dirty) shmem pages when watermarks are hit.
Shmem pages are kept in the inactive list (they are referenced) in the
first round and it is hard to reclaim anything else so we reach lower
scanning priorities very quickly which leads to an excessive swap out.

Let's fix this by excluding all swap backed pages (they tend to be long
lived wrt. the regular page cache anyway) from used-once heuristic and
rather activate them if they are referenced.

The customer's workload is shmem backed database (80% of RAM) and they
are measuring transactions/s with an IO in the background (20%).
Transactions touch more or less random rows in the table. The
transaction rate fell by a factor of 3 (in the worst case) because of
commit 64574746. This patch restores the previous numbers.

Signed-off-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: Mel Gorman
Cc: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Reviewed-by: Rik van Riel
Cc: [2.6.34+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2012-05-30 07:22:23 +0800