Eric Lee / smarc-fsl-linux-kernel

03 Nov, 2011

1 commit

70b50f94f mm: thp: tail page refcounting fix ... Browse Code »
2

Michel while working on the working set estimation code, noticed that
calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
wasn't safe, if the pfn ended up being a tail page of a transparent
hugepage under splitting by __split_huge_page_refcount().

He then found the problem could also theoretically materialize with
page_cache_get_speculative() during the speculative radix tree lookups
that uses get_page_unless_zero() in SMP if the radix tree page is freed
and reallocated and get_user_pages is called on it before
page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at
all times. This will guarantee that get_page_unless_zero() can never
succeed on any tail page. page_tail->_mapcount is guaranteed zero and
is unused for all tail pages of a compound page, so we can simply
account the tail page references there and transfer them to
tail_page->_count in __split_huge_page_refcount() (in addition to the
head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't
entirely safe because the two atomic_inc in get_page weren't atomic. As
opposed to other get_user_page users like secondary-MMU page fault to
establish the shadow pagetables would never call any superflous get_page
after get_user_page returns. It's safer to make get_page universally safe
for tail pages and to use get_page_foll() within follow_page (inside
get_user_pages()). get_page_foll() is safe to do the refcounting for tail
pages without taking any locks because it is run within PT lock protected
critical sections (PT lock for pte and page_table_lock for
pmd_trans_huge).

The standard get_page() as invoked by direct-io instead will now take
the compound_lock but still only for tail pages. The direct-io paths
are usually I/O bound and the compound_lock is per THP so very
finegrined, so there's no risk of scalability issues with it. A simple
direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no
overhead. So it's worth it. Ideally direct-io should stop calling
get_page() on pages returned by get_user_pages(). The spinlock in
get_page() is already optimized away for no-THP builds but doing
get_page() on tail pages returned by GUP is generally a rare operation
and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new
RCU critical sections will also allow the working set estimation code to
work without any further complexity associated to the tail page
refcounting with THP.

Signed-off-by: Andrea Arcangeli
Reported-by: Michel Lespinasse
Reviewed-by: Michel Lespinasse
Reviewed-by: Minchan Kim
Cc: Peter Zijlstra
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Mel Gorman
Cc: KOSAKI Motohiro
Cc: Benjamin Herrenschmidt
Cc: David Gibson
Cc:
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-11-03 07:06:57 +0800

01 Nov, 2011

3 commits

e10d59f2c mm: add comments to explain mm_struct fields ... Browse Code »

Add comments to explain the page statistics field in the mm_struct.

[akpm@linux-foundation.org: add missing ;]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2011-11-01 08:30:46 +0800
bc3e53f68 mm: distinguish between mlocked and pinned pages ... Browse Code »
43

Some kernel components pin user space memory (infiniband and perf) (by
increasing the page count) and account that memory as "mlocked".

The difference between mlocking and pinning is:

A. mlocked pages are marked with PG_mlocked and are exempt from
swapping. Page migration may move them around though.
They are kept on a special LRU list.

B. Pinned pages cannot be moved because something needs to
directly access physical memory. They may not be on any
LRU list.

I recently saw an mlockalled process where mm->locked_vm became
bigger than the virtual size of the process (!) because some
memory was accounted for twice:

Once when the page was mlocked and once when the Infiniband
layer increased the refcount because it needt to pin the RDMA
memory.

This patch introduces a separate counter for pinned pages and
accounts them seperately.

Signed-off-by: Christoph Lameter
Cc: Mike Marciniszyn
Cc: Roland Dreier
Cc: Sean Hefty
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2011-11-01 08:30:46 +0800
c9f01245b oom: remove oom_disable_count ... Browse Code »

This removes mm->oom_disable_count entirely since it's unnecessary and
currently buggy. The counter was intended to be per-process but it's
currently decremented in the exit path for each thread that exits, causing
it to underflow.

The count was originally intended to prevent oom killing threads that
share memory with threads that cannot be killed since it doesn't lead to
future memory freeing. The counter could be fixed to represent all
threads sharing the same mm, but it's better to remove the count since:

- it is possible that the OOM_DISABLE thread sharing memory with the
victim is waiting on that thread to exit and will actually cause
future memory freeing, and

- there is no guarantee that a thread is disabled from oom killing just
because another thread sharing its mm is oom disabled.

Signed-off-by: David Rientjes
Reported-by: Oleg Nesterov
Reviewed-by: Oleg Nesterov
Cc: Ying Han
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-11-01 08:30:45 +0800

26 Oct, 2011

1 commit

e182a345d Merge branches 'slab/next' and 'slub/partial' into slab/for-linus Browse Code »

Pekka Enberg
2011-10-26 23:09:12 +0800

20 Oct, 2011

1 commit

30d3c128e mm: add a "struct page_frag" type containing a page, offset and length ... Browse Code »

A few network drivers currently use skb_frag_struct for this purpose but I have
patches which add additional fields and semantics there which these other uses
do not want.

A structure for reference sub-page regions seems like a generally useful thing
so do so instead of adding a network subsystem specific structure.

Signed-off-by: Ian Campbell
Acked-by: Jens Axboe
Acked-by: David Rientjes
Signed-off-by: David S. Miller

Ian Campbell
2011-10-20 16:58:32 +0800

20 Aug, 2011

1 commit

49e225858 slub: per cpu cache for partial pages ... Browse Code »

Allow filling out the rest of the kmem_cache_cpu cacheline with pointers to
partial pages. The partial page list is used in slab_free() to avoid
per node lock taking.

In __slab_alloc() we can then take multiple partial pages off the per
node partial list in one go reducing node lock pressure.

We can also use the per cpu partial list in slab_alloc() to avoid scanning
partial lists for pages with free objects.

The main effect of a per cpu partial list is that the per node list_lock
is taken for batches of partial pages instead of individual ones.

Potential future enhancements:

1. The pickup from the partial list could be perhaps be done without disabling
interrupts with some work. The free path already puts the page into the
per cpu partial list without disabling interrupts.

2. __slab_free() may have some code paths that could use optimization.

Performance:

Before After
./hackbench 100 process 200000
Time: 1953.047 1564.614
./hackbench 100 process 20000
Time: 207.176 156.940
./hackbench 100 process 20000
Time: 204.468 156.940
./hackbench 100 process 20000
Time: 204.879 158.772
./hackbench 10 process 20000
Time: 20.153 15.853
./hackbench 10 process 20000
Time: 20.153 15.986
./hackbench 10 process 20000
Time: 19.363 16.111
./hackbench 1 process 20000
Time: 2.518 2.307
./hackbench 1 process 20000
Time: 2.258 2.339
./hackbench 1 process 20000
Time: 2.864 2.163

Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2011-08-20 00:34:27 +0800

18 Jul, 2011

2 commits

013e89637 Avoid duplicate _count variables in page_struct ... Browse Code »

Restructure the union / struct cascade in struct page so that
we only have one definition of _count.

Tested-by: Hugh Dickins
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2011-07-18 20:17:01 +0800
3adf004d8 Revert "SLUB: Fix build breakage in linux/mm_types.h" ... Browse Code »

This reverts commit ea6bd8ee1a2ccdffc38b2b1fcfe941addfafaade.

Pekka Enberg
2011-07-18 20:16:55 +0800

08 Jul, 2011

1 commit

ea6bd8ee1 SLUB: Fix build breakage in linux/mm_types.h ... Browse Code »

On Wed, 6 Jul 2011, Jonathan Cameron wrote:

> Getting:
>
> CHK include/linux/version.h
> CHK include/generated/utsrelease.h
> make[1]: `include/generated/mach-types.h' is up to date.
> CC arch/arm/kernel/asm-offsets.s
> In file included from include/linux/sched.h:64:0,
> from arch/arm/kernel/asm-offsets.c:13:
> include/linux/mm_types.h:74:15: error: duplicate member '_count'
> make[1]: *** [arch/arm/kernel/asm-offsets.s] Error 1
> make: *** [prepare0] Error 2
>
> Issue looks to have been introduced by
>
> mm: Rearrange struct page
>
> fc9bb8c768abe7ae10861c3510e01a95f98d5933
>
> Guessing it's a known issue, but just thought I'd flag it up in case
> it's something very specific about my build.
>
> gcc-2.6 armv7a
>
> Reverting that patch works, but given I don't know the history, I'm
> not proposing doing that in general!

Well _count exists in two unionized structs but always has the same offset
within the larger struct. Maybe ARM creates different offsets there for
some reason?

The following is a patch to restructure the union / structs combo in such
a way that only a single definition of _count

Reported-by: Jonathan Cameron
Tested-by: Piotr Hosowicz
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2011-07-08 03:24:30 +0800

02 Jul, 2011

2 commits

fc9bb8c76 mm: Rearrange struct page ... Browse Code »

We need to be able to use cmpxchg_double on the freelist and object count
field in struct page. Rearrange the fields in struct page according to
doubleword entities so that the freelist pointer comes before the counters.
Do the rearranging with a future in mind where we use more doubleword
atomics to avoid locking of updates to flags/mapping or lru pointers.

Create another union to allow access to counters in struct page as a
single unsigned long value.

The doublewords must be properly aligned for cmpxchg_double to work.
Sadly this increases the size of page struct by one word on some architectures.
But as a resultpage structs are now cacheline aligned on x86_64.

Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2011-07-02 18:26:53 +0800
50d5c41cd slub: Do not use frozen page flag but a bit in the page counters ... Browse Code »

Do not use a page flag for the frozen bit. It needs to be part
of the state that is handled with cmpxchg_double(). So use a bit
in the counter struct in the page struct for that purpose.

Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2011-07-02 18:26:52 +0800

30 May, 2011

1 commit

6345d24da mm: Fix boot crash in mm_alloc() ... Browse Code »

Thomas Gleixner reports that we now have a boot crash triggered by
CONFIG_CPUMASK_OFFSTACK=y:

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [] find_next_bit+0x55/0xb0
Call Trace:
[] cpumask_any_but+0x2a/0x70
[] flush_tlb_mm+0x2b/0x80
[] pud_populate+0x35/0x50
[] pgd_alloc+0x9a/0xf0
[] mm_init+0xec/0x120
[] mm_alloc+0x53/0xd0

which was introduced by commit de03c72cfce5 ("mm: convert
mm->cpu_vm_cpumask into cpumask_var_t"), and is due to wrong ordering of
mm_init() vs mm_init_cpumask

Thomas wrote a patch to just fix the ordering of initialization, but I
hate the new double allocation in the fork path, so I ended up instead
doing some more radical surgery to clean it all up.

Reported-by: Thomas Gleixner
Reported-by: Ingo Molnar
Cc: KOSAKI Motohiro
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Linus Torvalds
2011-05-30 02:32:28 +0800

27 May, 2011

2 commits

386460138 mm: extract exe_file handling from procfs ... Browse Code »

Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/.
This was because exe_file was needed only for /proc//exe. Since we
will need the exe_file functionality also for core dumps (so core name can
contain full binary path), built this functionality always into the
kernel.

To achieve that move that out of proc FS to the kernel/ where in fact it
should belong. By doing that we can make dup_mm_exe_file static. Also we
can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c.

Signed-off-by: Jiri Slaby
Cc: Alexander Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2011-05-27 08:12:36 +0800
ca16d140a mm: don't access vm_flags as 'int' ... Browse Code »

The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor
'unsigned int'. This patch fixes such misuse.

Signed-off-by: KOSAKI Motohiro
[ Changed to use a typedef - we'll extend it to cover more cases
later, since there has been discussion about making it a 64-bit
type.. - Linus ]
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2011-05-27 00:20:31 +0800

25 May, 2011

3 commits

172703b08 mm: delete non-atomic mm counter implementation ... Browse Code »

The problem with having two different types of counters is that developers
adding new code need to keep in mind whether it's safe to use both the
atomic and non-atomic implementations. For example, when adding new
callers of the *_mm_counter() functions a developer needs to ensure that
those paths are always executed with page_table_lock held, in case we're
using the non-atomic implementation of mm counters.

Hugh Dickins introduced the atomic mm counters in commit f412ac08c986
("[PATCH] mm: fix rss and mmlist locking"). When asked why he left the
non-atomic counters around he said,

| The only reason was to avoid adding costly atomic operations into a
| configuration that had no need for them there: the page_table_lock
| sufficed.
|
| Certainly it would be simpler just to delete the non-atomic variant.
|
| And I think it's fair to say that any configuration on which we're
| measuring performance to that degree (rather than "does it boot fast?"
| type measurements), would already be going the split ptlocks route.

Removing the non-atomic counters eases the maintenance burden because
developers no longer have to mindful of the two implementations when using
*_mm_counter().

Note that all architectures provide a means of atomically updating
atomic_long_t variables, even if they have to revert to the generic
spinlock implementation because they don't support 64-bit atomic
instructions (see lib/atomic64.c).

Signed-off-by: Matt Fleming
Acked-by: Dave Hansen
Acked-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Fleming
2011-05-25 23:39:30 +0800
de03c72cf mm: convert mm->cpu_vm_cpumask into cpumask_var_t ... Browse Code »

cpumask_t is very big struct and cpu_vm_mask is placed wrong position.
It might lead to reduce cache hit ratio.

This patch has two change.
1) Move the place of cpumask into last of mm_struct. Because usually cpumask
is accessed only front bits when the system has cpu-hotplug capability
2) Convert cpu_vm_mask into cpumask_var_t. It may help to reduce memory
footprint if cpumask_size() will use nr_cpumask_bits properly in future.

In addition, this patch change the name of cpu_vm_mask with cpu_vm_mask_var.
It may help to detect out of tree cpu_vm_mask users.

This patch has no functional change.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KOSAKI Motohiro
Cc: David Howells
Cc: Koichi Yasutake
Cc: Hugh Dickins
Cc: Chris Metcalf
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2011-05-25 23:39:21 +0800
97a894136 mm: Remove i_mmap_lock lockbreak ... Browse Code »
43

Hugh says:
"The only significant loser, I think, would be page reclaim (when
concurrent with truncation): could spin for a long time waiting for
the i_mmap_mutex it expects would soon be dropped? "

Counter points:
- cpu contention makes the spin stop (need_resched())
- zap pages should be freeing pages at a higher rate than reclaim
ever can

I think the simplification of the truncate code is definitely worth it.

Effectively reverts: 2aa15890f3c ("mm: prevent concurrent
unmap_mapping_range() on the same inode") and takes out the code that
caused its problem.

Signed-off-by: Peter Zijlstra
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Benjamin Herrenschmidt
Cc: David Miller
Cc: Martin Schwidefsky
Cc: Russell King
Cc: Paul Mundt
Cc: Jeff Dike
Cc: Richard Weinberger
Cc: Tony Luck
Cc: Mel Gorman
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Namhyung Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2011-05-25 23:39:17 +0800

23 Mar, 2011

1 commit

481b4bb5e mm: mm_struct: remove 16 bytes of alignment padding on 64 bit builds ... Browse Code »

Reorder mm_struct to remove 16 bytes of alignment padding on 64 bit
builds. On my config this shrinks mm_struct by enough to fit in one
fewer cache lines and allows more objects per slab in mm_struct
kmem_cache under SLUB.

slabinfo before patch :-
Sizes (bytes) Slabs
--------------------------------
Object : 848 Total : 9
SlabObj: 896 Full : 2
SlabSiz: 16384 Partial: 5
Loss : 48 CpuSlab: 2
Align : 64 Objects: 18

slabinfo after :-
Sizes (bytes) Slabs
--------------------------------
Object : 832 Total : 7
SlabObj: 832 Full : 2
SlabSiz: 16384 Partial: 3
Loss : 0 CpuSlab: 2
Align : 64 Objects: 19

Signed-off-by: Richard Kennedy
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Richard Kennedy
2011-03-23 08:44:03 +0800

14 Jan, 2011

1 commit

e7a00c45f thp: add pmd_huge_pte to mm_struct ... Browse Code »

This increase the size of the mm struct a bit but it is needed to
preallocate one pte for each hugepage so that split_huge_page will not
require a fail path. Guarantee of success is a fundamental property of
split_huge_page to avoid decrasing swapping reliability and to avoid
adding -ENOMEM fail paths that would otherwise force the hugepage-unaware
VM code to learn rolling back in the middle of its pte mangling operations
(if something we need it to learn handling pmd_trans_huge natively rather
being capable of rollback). When split_huge_page runs a pte is needed to
succeed the split, to map the newly splitted regular pages with a regular
pte. This way all existing VM code remains backwards compatible by just
adding a split_huge_page* one liner. The memory waste of those
preallocated ptes is negligible and so it is worth it.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:41 +0800

27 Oct, 2010

1 commit

3d5992d2a oom: add per-mm oom disable count ... Browse Code »

It's pointless to kill a task if another thread sharing its mm cannot be
killed to allow future memory freeing. A subsequent patch will prevent
kills in such cases, but first it's necessary to have a way to flag a task
that shares memory with an OOM_DISABLE task that doesn't incur an
additional tasklist scan, which would make select_bad_process() an O(n^2)
function.

This patch adds an atomic counter to struct mm_struct that follows how
many threads attached to it have an oom_score_adj of OOM_SCORE_ADJ_MIN.
They cannot be killed by the kernel, so their memory cannot be freed in
oom conditions.

This only requires task_lock() on the task that we're operating on, it
does not require mm->mmap_sem since task_lock() pins the mm and the
operation is atomic.

[rientjes@google.com: changelog and sys_unshare() code]
[rientjes@google.com: protect oom_disable_count with task_lock in fork]
[rientjes@google.com: use old_mm for oom_disable_count in exec]
Signed-off-by: Ying Han
Signed-off-by: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ying Han
2010-10-27 07:52:05 +0800

23 Aug, 2010

1 commit

a6b9b4d50 Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck… ... Browse Code »

…/linux-2.6-rcu into core/rcu

Ingo Molnar
2010-08-23 17:32:34 +0800

21 Aug, 2010

1 commit

297c5eee3 mm: make the vma list be doubly linked ... Browse Code »

It's a really simple list, and several of the users want to go backwards
in it to find the previous vma. So rather than have to look up the
previous entry with 'find_vma_prev()' or something similar, just make it
doubly linked instead.

Tested-by: Ian Campbell
Signed-off-by: Linus Torvalds

Linus Torvalds
2010-08-21 23:49:21 +0800

20 Aug, 2010

1 commit

4d2deb40b kernel: __rcu annotations ... Browse Code »

This adds annotations for RCU operations in core kernel components

Signed-off-by: Arnd Bergmann
Signed-off-by: Paul E. McKenney
Cc: Al Viro
Cc: Jens Axboe
Cc: Andrew Morton
Reviewed-by: Josh Triplett

Arnd Bergmann
2010-08-20 08:18:03 +0800

13 Mar, 2010

1 commit

53bddb4e9 nommu: fix build breakage ... Browse Code »

Commit 34e55232e59f7b19050267a05ff1226e5cd122a5 ("mm: avoid false sharing
of mm_counter") added sync_mm_rss() for syncing loosely accounted rss
counters. It's for CONFIG_MMU but sync_mm_rss is called even in NOMMU
enviroment (kerne/exit.c, fs/exec.c). Above commit doesn't handle it
well.

This patch changes
SPLIT_RSS_COUNTING depends on SPLIT_PTLOCKS && CONFIG_MMU

And for avoid unnecessary function calls, sync_mm_rss changed to be inlined
noop function in header file.

Reported-by: David Howells
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Mike Frysinger
Signed-off-by: Michal Simek
Signed-off-by: David Howells
Cc: Greg Ungerer
Cc: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-13 07:52:28 +0800

07 Mar, 2010

4 commits

5beb49305 mm: change anon_vma linking to fix multi-process server scalability issue ... Browse Code »

The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.

In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.

This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.

This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.

This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.

The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.

A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.

Some test results:

Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.

With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel
Cc: KOSAKI Motohiro
Cc: Larry Woodman
Cc: Lee Schermerhorn
Cc: Minchan Kim
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-03-07 03:26:26 +0800
b084d4353 mm: count swap usage ... Browse Code »

A frequent questions from users about memory management is what numbers of
swap ents are user for processes. And this information will give some
hints to oom-killer.

Besides we can count the number of swapents per a process by scanning
/proc//smaps, this is very slow and not good for usual process
information handler which works like 'ps' or 'top'. (ps or top is now
enough slow..)

This patch adds a counter of swapents to mm_counter and update is at each
swap events. Information is exported via /proc//status file as

[kamezawa@bluextal memory]$ cat /proc/self/status
Name: cat
State: R (running)
Tgid: 2910
Pid: 2910
PPid: 2823
TracerPid: 0
Uid: 500 500 500 500
Gid: 500 500 500 500
FDSize: 256
Groups: 500
VmPeak: 82696 kB
VmSize: 82696 kB
VmLck: 0 kB
VmHWM: 432 kB
VmRSS: 432 kB
VmData: 172 kB
VmStk: 84 kB
VmExe: 48 kB
VmLib: 1568 kB
VmPTE: 40 kB
VmSwap: 0 kB
Reviewed-by: Minchan Kim
Reviewed-by: Christoph Lameter
Cc: Lee Schermerhorn
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-07 03:26:24 +0800
34e55232e mm: avoid false sharing of mm_counter ... Browse Code »

Considering the nature of per mm stats, it's the shared object among
threads and can be a cache-miss point in the page fault path.

This patch adds per-thread cache for mm_counter. RSS value will be
counted into a struct in task_struct and synchronized with mm's one at
events.

Now, in this patch, the event is the number of calls to handle_mm_fault.
Per-thread value is added to mm at each 64 calls.

rough estimation with small benchmark on parallel thread (2threads) shows
[before]
4.5 cache-miss/faults
[after]
4.0 cache-miss/faults
Anyway, the most contended object is mmap_sem if the number of threads grows.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Cc: Christoph Lameter
Cc: Lee Schermerhorn
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-07 03:26:24 +0800
d559db086 mm: clean up mm_counter ... Browse Code »

Presently, per-mm statistics counter is defined by macro in sched.h

This patch modifies it to
- defined in mm.h as inlinf functions
- use array instead of macro's name creation.

This patch is for reducing patch size in future patch to modify
implementation of per-mm counter.

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Minchan Kim
Cc: Christoph Lameter
Cc: Lee Schermerhorn
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-07 03:26:23 +0800

17 Jan, 2010

2 commits

efc1a3b16 nommu: don't need get_unmapped_area() for NOMMU ... Browse Code »

get_unmapped_area() is unnecessary for NOMMU as no-one calls it.

Signed-off-by: David Howells
Acked-by: Al Viro
Cc: Greg Ungerer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Howells
2010-01-17 04:15:40 +0800
1e2ae599d nommu: struct vm_region's vm_usage count need not be atomic ... Browse Code »

The vm_usage count field in struct vm_region does not need to be atomic as
it's only even modified whilst nommu_region_sem is write locked.

Signed-off-by: David Howells
Acked-by: Al Viro
Cc: Greg Ungerer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Howells
2010-01-17 04:15:40 +0800

07 Jan, 2010

1 commit

cfe79c00a NOMMU: Avoiding duplicate icache flushes of shared maps ... Browse Code »

When working with FDPIC, there are many shared mappings of read-only
code regions between applications (the C library, applet packages like
busybox, etc.), but the current do_mmap_pgoff() function will issue an
icache flush whenever a VMA is added to an MM instead of only doing it
when the map is initially created.

The flush can instead be done when a region is first mmapped PROT_EXEC.
Note that we may not rely on the first mapping of a region being
executable - it's possible for it to be PROT_READ only, so we have to
remember whether we've flushed the region or not, and then flush the
entire region when a bit of it is made executable.

However, this also affects the brk area. That will no longer be
executable. We can mprotect() it to PROT_EXEC on MPU-mode kernels, but
for NOMMU mode kernels, when it increases the brk allocation, making
sys_brk() flush the extra from the icache should suffice. The brk area
probably isn't used by NOMMU programs since the brk area can only use up
the leavings from the stack allocation, where the stack allocation is
larger than requested.

Signed-off-by: David Howells
Signed-off-by: Mike Frysinger
Signed-off-by: Linus Torvalds

Mike Frysinger
2010-01-07 10:16:02 +0800

28 Sep, 2009

1 commit

f0f37e2f7 const: mark struct vm_struct_operations ... Browse Code »

* mark struct vm_area_struct::vm_ops as const
* mark vm_ops in AGP code

But leave TTM code alone, something is fishy there with global vm_ops
being used.

Signed-off-by: Alexey Dobriyan
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2009-09-28 02:39:25 +0800

24 Sep, 2009

2 commits

801460d0c task_struct cleanup: move binfmt field to mm_struct ... Browse Code »

Because the binfmt is not different between threads in the same process,
it can be moved from task_struct to mm_struct. And binfmt moudle is
handled per mm_struct instead of task_struct.

Signed-off-by: Hiroshi Shimamoto
Acked-by: Oleg Nesterov
Cc: Rusty Russell
Acked-by: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hiroshi Shimamoto
2009-09-24 22:21:05 +0800
858f09930 aio: ifdef fields in mm_struct ... Browse Code »

->ioctx_lock and ->ioctx_list are used only under CONFIG_AIO.

Signed-off-by: Alexey Dobriyan
Cc: Zach Brown
Cc: Benjamin LaHaise
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2009-09-24 22:21:05 +0800

19 Aug, 2009

1 commit

0753ba01e mm: revert "oom: move oom_adj value" ... Browse Code »

The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to
the mm_struct. It was a very good first step for sanitize OOM.

However Paul Menage reported the commit makes regression to his job
scheduler. Current OOM logic can kill OOM_DISABLED process.

Why? His program has the code of similar to the following.

...
set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
...
if (vfork() == 0) {
set_oom_adj(0); /* Invoked child can be killed */
execve("foo-bar-cmd");
}
....

vfork() parent and child are shared the same mm_struct. then above
set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also
change oom_adj for vfork() parent. Then, vfork() parent (job scheduler)
lost OOM immune and it was killed.

Actually, fork-setting-exec idiom is very frequently used in userland program.
We must not break this assumption.

Then, this patch revert commit 2ff05b2b and related commit.

Reverted commit list
---------------------
- commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct)
- commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE)
- commit 8123681022 (oom: only oom kill exiting tasks with attached memory)
- commit 933b787b57 (mm: copy over oom_adj value at fork time)

Signed-off-by: KOSAKI Motohiro
Cc: Paul Menage
Cc: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Cc: Linus Torvalds
Cc: Oleg Nesterov
Cc: Nick Piggin
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-08-19 07:31:13 +0800

17 Jun, 2009

2 commits

517d08699 Merge branch 'akpm' ... Browse Code »

* akpm: (182 commits)
fbdev: bf54x-lq043fb: use kzalloc over kmalloc/memset
fbdev: *bfin*: fix __dev{init,exit} markings
fbdev: *bfin*: drop unnecessary calls to memset
fbdev: bfin-t350mcqb-fb: drop unused local variables
fbdev: blackfin has __raw I/O accessors, so use them in fb.h
fbdev: s1d13xxxfb: add accelerated bitblt functions
tcx: use standard fields for framebuffer physical address and length
fbdev: add support for handoff from firmware to hw framebuffers
intelfb: fix a bug when changing video timing
fbdev: use framebuffer_release() for freeing fb_info structures
radeon: P2G2CLK_ALWAYS_ONb tested twice, should 2nd be P2G2CLK_DAC_ALWAYS_ONb?
s3c-fb: CPUFREQ frequency scaling support
s3c-fb: fix resource releasing on error during probing
carminefb: fix possible access beyond end of carmine_modedb[]
acornfb: remove fb_mmap function
mb862xxfb: use CONFIG_OF instead of CONFIG_PPC_OF
mb862xxfb: restrict compliation of platform driver to PPC
Samsung SoC Framebuffer driver: add Alpha Channel support
atmel-lcdc: fix pixclock upper bound detection
offb: use framebuffer_alloc() to allocate fb_info struct
...

Manually fix up conflicts due to kmemcheck in mm/slab.c

Linus Torvalds
2009-06-17 10:50:13 +0800
2ff05b2b4 oom: move oom_adj value from task_struct to mm_struct ... Browse Code »

The per-task oom_adj value is a characteristic of its mm more than the
task itself since it's not possible to oom kill any thread that shares the
mm. If a task were to be killed while attached to an mm that could not be
freed because another thread were set to OOM_DISABLE, it would have
needlessly been terminated since there is no potential for future memory
freeing.

This patch moves oomkilladj (now more appropriately named oom_adj) from
struct task_struct to struct mm_struct. This requires task_lock() on a
task to check its oom_adj value to protect against exec, but it's already
necessary to take the lock when dereferencing the mm to find the total VM
size for the badness heuristic.

This fixes a livelock if the oom killer chooses a task and another thread
sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs
because oom_kill_task() repeatedly returns 1 and refuses to kill the
chosen task while select_bad_process() will repeatedly choose the same
task during the next retry.

Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in
oom_kill_task() to check for threads sharing the same memory will be
removed in the next patch in this series where it will no longer be
necessary.

Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
these threads are immune from oom killing already. They simply report an
oom_adj value of OOM_DISABLE.

Cc: Nick Piggin
Cc: Rik van Riel
Cc: Mel Gorman
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2009-06-17 10:47:43 +0800

13 Jun, 2009

1 commit

dfec072ec kmemcheck: add the kmemcheck core ... Browse Code »

General description: kmemcheck is a patch to the linux kernel that
detects use of uninitialized memory. It does this by trapping every
read and write to memory that was allocated dynamically (e.g. using
kmalloc()). If a memory address is read that has not previously been
written to, a message is printed to the kernel log.

Thanks to Andi Kleen for the set_memory_4k() solution.

Andrew Morton suggested documenting the shadow member of struct page.

Signed-off-by: Vegard Nossum
Signed-off-by: Pekka Enberg

[export kmemcheck_mark_initialized]
[build fix for setup_max_cpus]
Signed-off-by: Ingo Molnar

[rebased for mainline inclusion]
Signed-off-by: Vegard Nossum

Vegard Nossum
2009-06-13 21:37:30 +0800

03 Apr, 2009

1 commit

ee3b4290a generic debug pagealloc: build fix ... Browse Code »

This fixes a build failure with generic debug pagealloc:

mm/debug-pagealloc.c: In function 'set_page_poison':
mm/debug-pagealloc.c:8: error: 'struct page' has no member named 'debug_flags'
mm/debug-pagealloc.c: In function 'clear_page_poison':
mm/debug-pagealloc.c:13: error: 'struct page' has no member named 'debug_flags'
mm/debug-pagealloc.c: In function 'page_poison':
mm/debug-pagealloc.c:18: error: 'struct page' has no member named 'debug_flags'
mm/debug-pagealloc.c: At top level:
mm/debug-pagealloc.c:120: error: redefinition of 'kernel_map_pages'
include/linux/mm.h:1278: error: previous definition of 'kernel_map_pages' was here
mm/debug-pagealloc.c: In function 'kernel_map_pages':
mm/debug-pagealloc.c:122: error: 'debug_pagealloc_enabled' undeclared (first use in this function)

by fixing

- debug_flags should be in struct page
- define DEBUG_PAGEALLOC config option for all architectures

Signed-off-by: Akinobu Mita
Reported-by: Alexander Beregalov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2009-04-03 10:04:48 +0800