Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

26 Jun, 2009

1 commit

9d73777e5 clarify get_user_pages() prototype ... Browse Code »

Currently the 4th parameter of get_user_pages() is called len, but its
in pages, not bytes. Rename the thing to nr_pages to avoid future
confusion.

Signed-off-by: Peter Zijlstra
Signed-off-by: Linus Torvalds

Peter Zijlstra
2009-06-26 02:22:13 +0800

24 Jun, 2009

2 commits

a5c9b696e mm: pass mm to grab_swap_token ... Browse Code »

If a kthread happens to use get_user_pages() on an mm (as KSM does),
there's a chance that it will end up trying to read in a swap page, then
oops in grab_swap_token() because the kthread has no mm: GUP passes down
the right mm, so grab_swap_token() ought to be using it.

We have not identified a stronger case than KSM's daemon (not yet in
mainline), but the issue must have come up before, since RHEL has included
a fix for this for years (though a different fix, they just back out of
grab_swap_token if current->mm is unset: which is what we first proposed,
but using the right mm here seems more correct).

Reported-by: Izik Eidus
Signed-off-by: Johannes Weiner
Signed-off-by: Hugh Dickins
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-06-24 03:50:05 +0800
d26ed650d mm: don't rely on flags coincidence ... Browse Code »

Indeed FOLL_WRITE matches FAULT_FLAG_WRITE, matches GUP_FLAGS_WRITE,
and it's tempting to devise a set of Grand Unified Paging flags;
but not today. So until then, let's rely upon the compiler to spot
the coincidence, "rather than have that subtle dependency and a
comment for it" - as you remarked in another context yesterday.

Signed-off-by: Hugh Dickins
Acked-by: Wu Fengguang
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-06-24 02:23:33 +0800

22 Jun, 2009

2 commits

d06063cc2 Move FAULT_FLAG_xyz into handle_mm_fault() callers ... Browse Code »

This allows the callers to now pass down the full set of FAULT_FLAG_xyz
flags to handle_mm_fault(). All callers have been (mechanically)
converted to the new calling convention, there's almost certainly room
for architectures to clean up their code and then add FAULT_FLAG_RETRY
when that support is added.

Signed-off-by: Linus Torvalds

Linus Torvalds
2009-06-22 04:08:22 +0800
30c9f3a9f Remove internal use of 'write_access' in mm/memory.c ... Browse Code »

The fault handling routines really want more fine-grained flags than a
single "was it a write fault" boolean - the callers will want to set
flags like "you can return a retry error" etc.

And that's actually how the VM works internally, but right now the
top-level fault handling functions in mm/memory.c all pass just the
'write_access' boolean around.

This switches them over to pass around the FAULT_FLAG_xyzzy 'flags'
variable instead. The 'write_access' calling convention still exists
for the exported 'handle_mm_fault()' function, but that is next.

Signed-off-by: Linus Torvalds

Linus Torvalds
2009-06-22 04:06:05 +0800

17 Jun, 2009

4 commits

3b6748e2d mm: introduce follow_pfn() ... Browse Code »

Analoguous to follow_phys(), add a helper that looks up the PFN at a
user virtual address in an IO mapping or a raw PFN mapping.

Signed-off-by: Johannes Weiner
Cc: Christoph Hellwig
Acked-by: Magnus Damm
Cc: Hans Verkuil
Cc: Paul Mundt
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2009-06-17 10:47:40 +0800
03668a4de mm: use generic follow_pte() in follow_phys() ... Browse Code »

Signed-off-by: Johannes Weiner
Cc: Christoph Hellwig
Acked-by: Magnus Damm
Cc: Hans Verkuil
Cc: Paul Mundt
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2009-06-17 10:47:40 +0800
f8ad0f499 mm: introduce follow_pte() ... Browse Code »

A generic readonly page table lookup helper to map an address space and an
address from it to a pte.

Signed-off-by: Johannes Weiner
Cc: Christoph Hellwig
Acked-by: Magnus Damm
Cc: Hans Verkuil
Cc: Paul Mundt
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2009-06-17 10:47:39 +0800
d2bf6be8a mm: clean up get_user_pages_fast() documentation ... Browse Code »

Move more documentation for get_user_pages_fast into the new kerneldoc comment.
Add some comments for get_user_pages as well.

Also, move get_user_pages_fast declaration up to get_user_pages. It wasn't
there initially because it was once a static inline function.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Nick Piggin
Cc: Andy Grover
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-06-17 10:47:30 +0800

03 May, 2009

2 commits

b827e496c mm: close page_mkwrite races ... Browse Code »

Change page_mkwrite to allow implementations to return with the page
locked, and also change it's callers (in page fault paths) to hold the
lock until the page is marked dirty. This allows the filesystem to have
full control of page dirtying events coming from the VM.

Rather than simply hold the page locked over the page_mkwrite call, we
call page_mkwrite with the page unlocked and allow callers to return with
it locked, so filesystems can avoid LOR conditions with page lock.

The problem with the current scheme is this: a filesystem that wants to
associate some metadata with a page as long as the page is dirty, will
perform this manipulation in its ->page_mkwrite. It currently then must
return with the page unlocked and may not hold any other locks (according
to existing page_mkwrite convention).

In this window, the VM could write out the page, clearing page-dirty. The
filesystem has no good way to detect that a dirty pte is about to be
attached, so it will happily write out the page, at which point, the
filesystem may manipulate the metadata to reflect that the page is no
longer dirty.

It is not always possible to perform the required metadata manipulation in
->set_page_dirty, because that function cannot block or fail. The
filesystem may need to allocate some data structure, for example.

And the VM cannot mark the pte dirty before page_mkwrite, because
page_mkwrite is allowed to fail, so we must not allow any window where the
page could be written to if page_mkwrite does fail.

This solution of holding the page locked over the 3 critical operations
(page_mkwrite, setting the pte dirty, and finally setting the page dirty)
closes out races nicely, preventing page cleaning for writeout being
initiated in that window. This provides the filesystem with a strong
synchronisation against the VM here.

- Sage needs this race closed for ceph filesystem.
- Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
- I need it for fsblock.
- I suspect other filesystems may need it too (eg. btrfs).
- I have converted buffer.c to the new locking. Even simple block allocation
under dirty pages might be susceptible to i_size changing under partial page
at the end of file (we also have a buffer.c-side problem here, but it cannot
be fixed properly without this patch).
- Other filesystems (eg. NFS, maybe btrfs) will need to change their
page_mkwrite functions themselves.

[ This also moves page_mkwrite another step closer to fault, which should
eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a
filesystem calldown and page lock/unlock cycle in __do_fault. ]

[akpm@linux-foundation.org: fix derefs of NULL ->mapping]
Cc: Sage Weil
Cc: Trond Myklebust
Signed-off-by: Nick Piggin
Cc: Valdis Kletnieks
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-05-03 06:36:09 +0800
bc43f75cd mm: fix pageref leak in do_swap_page() ... Browse Code »

By the time the memory cgroup code is notified about a swapin we
already hold a reference on the fault page.

If the cgroup callback fails make sure to unlock AND release the page
reference which was taken by lookup_swap_cach(), or we leak the reference.

Signed-off-by: Johannes Weiner
Cc: Balbir Singh
Reviewed-by: Minchan Kim
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2009-05-03 06:36:09 +0800

01 Apr, 2009

3 commits

c2ec175c3 mm: page_mkwrite change prototype to match fault ... Browse Code »

Change the page_mkwrite prototype to take a struct vm_fault, and return
VM_FAULT_xxx flags. There should be no functional change.

This makes it possible to return much more detailed error information to
the VM (and also can provide more information eg. virtual_address to the
driver, which might be important in some special cases).

This is required for a subsequent fix. And will also make it easier to
merge page_mkwrite() with fault() in future.

Signed-off-by: Nick Piggin
Cc: Chris Mason
Cc: Trond Myklebust
Cc: Miklos Szeredi
Cc: Steven Whitehouse
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Artem Bityutskiy
Cc: Felix Blyakher
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-04-01 23:59:14 +0800
bd775c42e mm: add comment why mark_page_accessed() would be better than pte_mkyoung() in follow_page() ... Browse Code »

At first look, mark_page_accessed() in follow_page() seems a bit strange.
It seems pte_mkyoung() would be better consistent with other kernel code.

However, it is intentional. The commit log said:

------------------------------------------------
commit 9e45f61d69be9024a2e6bef3831fb04d90fac7a8
Author: akpm
Date: Fri Aug 15 07:24:59 2003 +0000

[PATCH] Use mark_page_accessed() in follow_page()

Touching a page via follow_page() counts as a reference so we should be
either setting the referenced bit in the pte or running mark_page_accessed().

Altering the pte is tricky because we haven't implemented an atomic
pte_mkyoung(). And mark_page_accessed() is better anyway because it has more
aging state: it can move the page onto the active list.

BKrev: 3f3c8acbplT8FbwBVGtth7QmnqWkIw
------------------------------------------------

The atomic issue is still true nowadays. adding comment help to understand
code intention and it would be better.

[akpm@linux-foundation.org: clarify text]
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-04-01 23:59:12 +0800
0a0dd05dd mm: don't call mark_page_accessed() in do_swap_page() ... Browse Code »

commit bf3f3bc5e734706730c12a323f9b2068052aa1f0 (mm: don't
mark_page_accessed in fault path) only remove the mark_page_accessed() in
filemap_fault().

Therefore, swap-backed pages and file-backed pages have inconsistent
behavior. mark_page_accessed() should be removed from do_swap_page().

Signed-off-by: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Hugh Dickins
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-04-01 23:59:11 +0800

14 Mar, 2009

1 commit

895791dac VM, x86, PAT: add a new vm flag to track full pfnmap at mmap ... Browse Code »

Impact: cleanup

Add a new vm flag VM_PFN_AT_MMAP to identify a PFNMAP that is
fully mapped with remap_pfn_range. Patch removes the overloading
of VM_INSERTPAGE from the earlier patch.

Signed-off-by: Venkatesh Pallipadi
Signed-off-by: Suresh Siddha
Acked-by: Nick Piggin
LKML-Reference:
Signed-off-by: Ingo Molnar

Pallipadi, Venkatesh
2009-03-14 16:47:44 +0800

13 Mar, 2009

1 commit

4bb9c5c02 VM, x86, PAT: Change is_linear_pfn_mapping to not use vm_pgoff ... Browse Code »

Impact: fix false positive PAT warnings - also fix VirtalBox hang

Use of vma->vm_pgoff to identify the pfnmaps that are fully
mapped at mmap time is broken. vm_pgoff is set by generic mmap
code even for cases where drivers are setting up the mappings
at the fault time.

The problem was originally reported here:

http://marc.info/?l=linux-kernel&m=123383810628583&w=2

Change is_linear_pfn_mapping logic to overload VM_INSERTPAGE
flag along with VM_PFNMAP to mean full PFNMAP setup at mmap
time.

Problem also tracked at:

http://bugzilla.kernel.org/show_bug.cgi?id=12800

Reported-by: Thomas Hellstrom
Tested-by: Frans Pop
Signed-off-by: Venkatesh Pallipadi
Signed-off-by: Suresh Siddha @intel.com>
Cc: Nick Piggin
Cc: "ebiederm@xmission.com"
Cc: # only for 2.6.29.1, not .28
LKML-Reference:
Signed-off-by: Ingo Molnar

Pallipadi, Venkatesh
2009-03-13 11:28:50 +0800

06 Feb, 2009

1 commit

ab92661d5 do_wp_page: fix regression with execute in place ... Browse Code »

Fix do_wp_page for VM_MIXEDMAP mappings.

In the case where pfn_valid returns 0 for a pfn at the beginning of
do_wp_page and the mapping is not shared writable, the code branches to
label `gotten:' with old_page == NULL.

In case the vma is locked (vma->vm_flags & VM_LOCKED), lock_page,
clear_page_mlock, and unlock_page try to access the old_page.

This patch checks whether old_page is valid before it is dereferenced.

The regression was introduced by "mlock: mlocked pages are unevictable"
(commit b291f000393f5a0b679012b39d79fbc85c018233).

Signed-off-by: Carsten Otte
Cc: Nick Piggin
Cc: Heiko Carstens
Cc: [2.6.28.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Carsten Otte
2009-02-06 04:56:48 +0800

14 Jan, 2009

2 commits

e4b866ed1 x86 PAT: change track_pfn_vma_new to take pgprot_t pointer param ... Browse Code »

Impact: cleanup

Change the protection parameter for track_pfn_vma_new() into a pgprot_t pointer.
Subsequent patch changes the x86 PAT handling to return a compatible
memtype in pgprot_t, if what was requested cannot be allowed due to conflicts.
No fuctionality change in this patch.

Signed-off-by: Venkatesh Pallipadi
Signed-off-by: Suresh Siddha
Signed-off-by: Ingo Molnar

venkatesh.pallipadi@intel.com
2009-01-14 02:13:01 +0800
a36706131 x86 PAT: remove PFNMAP type on track_pfn_vma_new() error ... Browse Code »

Impact: fix (harmless) double-free of memtype entries and avoid warning

On track_pfn_vma_new() failure, reset the vm_flags so that there will be
no second cleanup happening when upper level routines call unmap_vmas().

This patch fixes part of the bug reported here:

http://marc.info/?l=linux-kernel&m=123108883716357&w=2

Specifically the error message:

X:5010 freeing invalid memtype d0000000-d0101000

Is due to multiple frees on error path, will not happen with the patch below.

Signed-off-by: Venkatesh Pallipadi
Signed-off-by: Suresh Siddha
Signed-off-by: Ingo Molnar

venkatesh.pallipadi@intel.com
2009-01-14 02:12:59 +0800

12 Jan, 2009

1 commit

95156f005 lockdep, mm: fix might_fault() annotation ... Browse Code »

Some code (nfs/sunrpc) uses socket ops on kernel memory while holding
the mmap_sem, this is safe because kernel memory doesn't get paged out,
therefore we'll never actually fault, and the might_fault() annotations
will generate false positives.

Reported-by: "J. Bruce Fields"
Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Peter Zijlstra
2009-01-12 20:09:18 +0800

09 Jan, 2009

5 commits

03f3c4336 memcg: fix swap accounting leak ... Browse Code »

Fix swapin charge operation of memcg.

Now, memcg has hooks to swap-out operation and checks SwapCache is really
unused or not. That check depends on contents of struct page. I.e. If
PageAnon(page) && page_mapped(page), the page is recoginized as
still-in-use.

Now, reuse_swap_page() calles delete_from_swap_cache() before establishment
of any rmap. Then, in followinig sequence

(Page fault with WRITE)
try_charge() (charge += PAGESIZE)
commit_charge() (Check page_cgroup is used or not..)
reuse_swap_page()
-> delete_from_swapcache()
-> mem_cgroup_uncharge_swapcache() (charge -= PAGESIZE)
......
New charge is uncharged soon....
To avoid this, move commit_charge() after page_mapcount() goes up to 1.
By this,

try_charge() (usage += PAGESIZE)
reuse_swap_page() (may usage -= PAGESIZE if PCG_USED is set)
commit_charge() (If page_cgroup is not marked as PCG_USED,
add new charge.)
Accounting will be correct.

Changelog (v2) -> (v3)
- fixed invalid charge to swp_entry==0.
- updated documentation.
Changelog (v1) -> (v2)
- fixed comment.

[nishimura@mxp.nes.nec.co.jp: swap accounting leak doc fix]
Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Tested-by: Balbir Singh
Cc: Hugh Dickins
Cc: Daisuke Nishimura
Signed-off-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:10 +0800
2c26fdd70 memcg: revert gfp mask fix ... Browse Code »

My patch, memcg-fix-gfp_mask-of-callers-of-charge.patch changed gfp_mask
of callers of charge to be GFP_HIGHUSER_MOVABLE for showing what will
happen at memory reclaim.

But in recent discussion, it's NACKed because it sounds ugly.

This patch is for reverting it and add some clean up to gfp_mask of
callers of charge. No behavior change but need review before generating
HUNK in deep queue.

This patch also adds explanation to meaning of gfp_mask passed to charge
functions in memcontrol.h.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:06 +0800
8c7c6e34a memcg: mem+swap controller core ... Browse Code »

This patch implements per cgroup limit for usage of memory+swap. However
there are SwapCache, double counting of swap-cache and swap-entry is
avoided.

Mem+Swap controller works as following.
- memory usage is limited by memory.limit_in_bytes.
- memory + swap usage is limited by memory.memsw_limit_in_bytes.

This has following benefits.
- A user can limit total resource usage of mem+swap.

Without this, because memory resource controller doesn't take care of
usage of swap, a process can exhaust all the swap (by memory leak.)
We can avoid this case.

And Swap is shared resource but it cannot be reclaimed (goes back to memory)
until it's used. This characteristic can be trouble when the memory
is divided into some parts by cpuset or memcg.
Assume group A and group B.
After some application executes, the system can be..

Group A -- very large free memory space but occupy 99% of swap.
Group B -- under memory shortage but cannot use swap...it's nearly full.

Ability to set appropriate swap limit for each group is required.

Maybe someone wonder "why not swap but mem+swap ?"

- The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
to move account from memory to swap...there is no change in usage of
mem+swap.

In other words, when we want to limit the usage of swap without affecting
global LRU, mem+swap limit is better than just limiting swap.

Accounting target information is stored in swap_cgroup which is
per swap entry record.

Charge is done as following.
map
- charge page and memsw.

unmap
- uncharge page/memsw if not SwapCache.

swap-out (__delete_from_swap_cache)
- uncharge page
- record mem_cgroup information to swap_cgroup.

swap-in (do_swap_page)
- charged as page and memsw.
record in swap_cgroup is cleared.
memsw accounting is decremented.

swap-free (swap_free())
- if swap entry is freed, memsw is uncharged by PAGE_SIZE.

There are people work under never-swap environments and consider swap as
something bad. For such people, this mem+swap controller extension is just an
overhead. This overhead is avoided by config or boot option.
(see Kconfig. detail is not in this patch.)

TODO:
- maybe more optimization can be don in swap-in path. (but not very safe.)
But we just do simple accounting at this stage.

[nishimura@mxp.nes.nec.co.jp: make resize limit hold mutex]
[hugh@veritas.com: memswap controller core swapcache fixes]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Balbir Singh
Cc: Pavel Emelyanov
Signed-off-by: Daisuke Nishimura
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:05 +0800
bced0520f memcg: fix gfp_mask of callers of charge ... Browse Code »

Fix misuse of gfp_kernel.

Now, most of callers of mem_cgroup_charge_xxx functions uses GFP_KERNEL.

I think that this is from the fact that page_cgroup *was* dynamically
allocated.

But now, we allocate all page_cgroup at boot. And
mem_cgroup_try_to_free_pages() reclaim memory from GFP_HIGHUSER_MOVABLE +
specified GFP_RECLAIM_MASK.

* This is because we just want to reduce memory usage.
"Where we should reclaim from ?" is not a problem in memcg.

This patch modifies gfp masks to be GFP_HIGUSER_MOVABLE if possible.

Note: This patch is not for fixing behavior but for showing sane information
in source code.

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Daisuke Nishimura
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:04 +0800
7a81b88cb memcg: introduce charge-commit-cancel style of functions ... Browse Code »

There is a small race in do_swap_page(). When the page swapped-in is
charged, the mapcount can be greater than 0. But, at the same time some
process (shares it ) call unmap and make mapcount 1->0 and the page is
uncharged.

CPUA CPUB
mapcount == 1.
(1) charge if mapcount==0 zap_pte_range()
(2) mapcount 1 => 0.
(3) uncharge(). (success)
(4) set page's rmap()
mapcount 0=>1

Then, this swap page's account is leaked.

For fixing this, I added a new interface.
- charge
account to res_counter by PAGE_SIZE and try to free pages if necessary.
- commit
register page_cgroup and add to LRU if necessary.
- cancel
uncharge PAGE_SIZE because of do_swap_page failure.

CPUA
(1) charge (always)
(2) set page's rmap (mapcount > 0)
(3) commit charge was necessary or not after set_pte().

This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
Usual mem_cgroup_charge_common() does charge -> commit at a time.

And this patch also adds following function to clarify all charges.

- mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge()
called against newly allocated anon pages.

- mem_cgroup_charge_migrate_fixup()
called only from remove_migration_ptes().
we'll have to rewrite this later.(this patch just keeps old behavior)
This function will be removed by additional patch to make migration
clearer.

Good for clarifying "what we do"

Then, we have 4 following charge points.
- newpage
- swap-in
- add-to-cache.
- migration.

[akpm@linux-foundation.org: add missing inline directives to stubs]
Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Daisuke Nishimura
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:04 +0800

07 Jan, 2009

15 commits

4779280d1 mm: make get_user_pages() interruptible ... Browse Code »
13

The initial implementation of checking TIF_MEMDIE covers the cases of OOM
killing. If the process has been OOM killed, the TIF_MEMDIE is set and it
return immediately. This patch includes:

1. add the case that the SIGKILL is sent by user processes. The
process can try to get_user_pages() unlimited memory even if a user
process has sent a SIGKILL to it(maybe a monitor find the process
exceed its memory limit and try to kill it). In the old
implementation, the SIGKILL won't be handled until the get_user_pages()
returns.

2. change the return value to be ERESTARTSYS. It makes no sense to
return ENOMEM if the get_user_pages returned by getting a SIGKILL
signal. Considering the general convention for a system call
interrupted by a signal is ERESTARTNOSYS, so the current return value
is consistant to that.

Lee:

An unfortunate side effect of "make-get_user_pages-interruptible" is that
it prevents a SIGKILL'd task from munlock-ing pages that it had mlocked,
resulting in freeing of mlocked pages. Freeing of mlocked pages, in
itself, is not so bad. We just count them now--altho' I had hoped to
remove this stat and add PG_MLOCKED to the free pages flags check.

However, consider pages in shared libraries mapped by more than one task
that a task mlocked--e.g., via mlockall(). If the task that mlocked the
pages exits via SIGKILL, these pages would be left mlocked and
unevictable.

Proposed fix:

Add another GUP flag to ignore sigkill when calling get_user_pages from
munlock()--similar to Kosaki Motohiro's 'IGNORE_VMA_PERMISSIONS flag for
the same purpose. We are not actually allocating memory in this case,
which "make-get_user_pages-interruptible" intends to avoid. We're just
munlocking pages that are already resident and mapped, and we're reusing
get_user_pages() to access those pages.

?? Maybe we should combine 'IGNORE_VMA_PERMISSIONS and '_IGNORE_SIGKILL
into a single flag: GUP_FLAGS_MUNLOCK ???

[Lee.Schermerhorn@hp.com: ignore sigkill in get_user_pages during munlock]
Signed-off-by: Paul Menage
Signed-off-by: Ying Han
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Pekka Enberg
Cc: Nick Piggin
Cc: Hugh Dickins
Cc: Oleg Nesterov
Cc: Lee Schermerhorn
Cc: Rohit Seth
Cc: David Rientjes
Signed-off-by: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ying Han
2009-01-07 07:59:08 +0800
1e9e63650 badpage: KERN_ALERT BUG instead of KERN_EMERG ... Browse Code »

bad_page() and rmap Eeek messages have said KERN_EMERG for a few years,
which I've followed in print_bad_pte(). These are serious system errors,
on a par with BUGs, but they're not quite emergencies, and we do our best
to carry on: say KERN_ALERT "BUG: " like the x86 oops does.

And remove the "Trying to fix it up, but a reboot is needed" line: it's
not untrue, but I hope the KERN_ALERT "BUG: " conveys as much.

Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:08 +0800
d936cf9b3 badpage: ratelimit print_bad_pte and bad_page ... Browse Code »

print_bad_pte() and bad_page() might each need ratelimiting - especially
for their dump_stacks, almost never of interest, yet not quite
dispensible. Correlating corruption across neighbouring entries can be
very helpful, so allow a burst of 60 reports before keeping quiet for the
remainder of that minute (or allow a steady drip of one report per
second).

Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:07 +0800
edc315fd2 badpage: remove vma from page_remove_rmap ... Browse Code »

Remove page_remove_rmap()'s vma arg, which was only for the Eeek message.
And remove the BUG_ON(page_mapcount(page) == 0) from CONFIG_DEBUG_VM's
page_dup_rmap(): we're trying to be more resilient about that than BUGs.

Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:07 +0800
2509ef26d badpage: zap print_bad_pte on swap and file ... Browse Code »

Complete zap_pte_range()'s coverage of bad pagetable entries by calling
print_bad_pte() on a pte_file in a linear vma and on a bad swap entry.
That needs free_swap_and_cache() to tell it, which will also have shown
one of those "swap_free" errors (but with much less information).

Similar checks in fork's copy_one_pte()? No, that would be more noisy
than helpful: we'll see them when parent and child exec or exit.

Where do_nonlinear_fault() calls print_bad_pte(): omit !VM_CAN_NONLINEAR
case, that could only be a bug in sys_remap_file_pages(), not a bad pte.
VM_FAULT_OOM rather than VM_FAULT_SIGBUS? Well, okay, that is consistent
with what happens if do_swap_page() operates a bad swap entry; but don't
we have patches to be more careful about killing when VM_FAULT_OOM?

Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:07 +0800
22b31eec6 badpage: vm_normal_page use print_bad_pte ... Browse Code »

print_bad_pte() is so far being called only when zap_pte_range() finds
negative page_mapcount, or there's a fault on a pte_file where it does not
belong. That's weak coverage when we suspect pagetable corruption.

Originally, it was called when vm_normal_page() found an invalid pfn: but
pfn_valid is expensive on some architectures and configurations, so 2.6.24
put that under CONFIG_DEBUG_VM (which doesn't help in the field), then
2.6.26 replaced it by a VM_BUG_ON (likewise).

Reinstate the print_bad_pte() in vm_normal_page(), but use a cheaper test
than pfn_valid(): memmap_init_zone() (used in bootup and hotplug) keep a
__read_mostly note of the highest_memmap_pfn, vm_normal_page() then check
pfn against that. We could call this pfn_plausible() or pfn_sane(), but I
doubt we'll need it elsewhere: of course it's not reliable, but gives much
stronger pagetable validation on many boxes.

Also use print_bad_pte() when the pte_special bit is found outside a
VM_PFNMAP or VM_MIXEDMAP area, instead of VM_BUG_ON.

Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:07 +0800
3dc147414 badpage: replace page_remove_rmap Eeek and BUG ... Browse Code »

Now that bad pages are kept out of circulation, there is no need for the
infamous page_remove_rmap() BUG() - once that page is freed, its negative
mapcount will issue a "Bad page state" message and the page won't be
freed. Removing the BUG() allows more info, on subsequent pages, to be
gathered.

We do have more info about the page at this point than bad_page() can know
- notably, what the pmd is, which might pinpoint something like low 64kB
corruption - but page_remove_rmap() isn't given the address to find that.

In practice, there is only one call to page_remove_rmap() which has ever
reported anything, that from zap_pte_range() (usually on exit, sometimes
on munmap). It has all the info, so remove page_remove_rmap()'s "Eeek"
message and leave it all to zap_pte_range().

mm/memory.c already has a hardly used print_bad_pte() function, showing
some of the appropriate info: extend it to show what we want for the rmap
case: pte info, page info (when there is a page) and vma info to compare.
zap_pte_range() already knows the pmd, but print_bad_pte() is easier to
use if it works that out for itself.

Some of this info is also shown in bad_page()'s "Bad page state" message.
Keep them separate, but adjust them to match each other as far as
possible. Say "Bad page map" in print_bad_pte(), and add a TAINT_BAD_PAGE
there too.

print_bad_pte() show current->comm unconditionally (though it should get
repeated in the usually irrelevant stack trace): sorry, I misled Nick
Piggin to make it conditional on vm_mm == current->mm, but current->mm is
already NULL in the exit case. Usually current->comm is good, though
exceptionally it may not be that of the mm (when "swapoff" for example).

Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:07 +0800
2bc7273b0 mm: make maddr __iomem ... Browse Code »

sparse output following warnings.

mm/memory.c:2936:8: warning: incorrect type in assignment (different address spaces)
mm/memory.c:2936:8: expected void *maddr
mm/memory.c:2936:8: got void [noderef]

cleanup here.

Signed-off-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-07 07:59:04 +0800
a2c43eed8 mm: try_to_free_swap replaces remove_exclusive_swap_page ... Browse Code »

remove_exclusive_swap_page(): its problem is in living up to its name.

It doesn't matter if someone else has a reference to the page (raised
page_count); it doesn't matter if the page is mapped into userspace
(raised page_mapcount - though that hints it may be worth keeping the
swap): all that matters is that there be no more references to the swap
(and no writeback in progress).

swapoff (try_to_unuse) has been removing pages from swapcache for years,
with no concern for page count or page mapcount, and we used to have a
comment in lookup_swap_cache() recognizing that: if you go for a page of
swapcache, you'll get the right page, but it could have been removed from
swapcache by the time you get page lock.

So, give up asking for exclusivity: get rid of
remove_exclusive_swap_page(), and remove_exclusive_swap_page_ref() and
remove_exclusive_swap_page_count() which were spawned for the recent LRU
work: replace them by the simpler try_to_free_swap() which just checks
page_swapcount().

Similarly, remove the page_count limitation from free_swap_and_count(),
but assume that it's worth holding on to the swap if page is mapped and
swap nowhere near full. Add a vm_swap_full() test in free_swap_cache()?
It would be consistent, but I think we probably have enough for now.

Signed-off-by: Hugh Dickins
Cc: Lee Schermerhorn
Cc: Rik van Riel
Cc: Nick Piggin
Cc: KAMEZAWA Hiroyuki
Cc: Robin Holt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:03 +0800
7b1fe5979 mm: reuse_swap_page replaces can_share_swap_page ... Browse Code »

A good place to free up old swap is where do_wp_page(), or do_swap_page(),
is about to redirty the page: the data on disk is then stale and won't be
read again; and if we do decide to write the page out later, using the
previous swap location makes an unnecessary disk seek very likely.

So give can_share_swap_page() the side-effect of delete_from_swap_cache()
when it safely can. And can_share_swap_page() was always a misleading
name, the more so if it has a side-effect: rename it reuse_swap_page().

Irrelevant cleanup nearby: remove swap_token_default_timeout definition
from swap.h: it's used nowhere.

Signed-off-by: Hugh Dickins
Cc: Lee Schermerhorn
Acked-by: Rik van Riel
Cc: Nick Piggin
Cc: KAMEZAWA Hiroyuki
Cc: Robin Holt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:03 +0800
ab967d860 mm: wp lock page before deciding cow ... Browse Code »

An application may rely on get_user_pages() to give it pages writable from
userspace and shared with a driver, GUP breaking COW if necessary. It may
mprotect() the pages' writability, off and on, from time to time.

Normally this works fine (so long as the app does not fork); but just
occasionally, under memory pressure, a readonly pte in a newly writable
area is COWed unnecessarily, breaking the link with the driver: because
do_wp_page() does trylock_page, and falls back to COW whenever that fails.

For reliable behaviour in the unshared case, when the trylock_page fails,
now unlock pagetable, lock page and relock pagetable, before deciding
whether Copy-On-Write is really necessary.

Reported-by: Zhou Yingchao
Signed-off-by: Hugh Dickins
Cc: Lee Schermerhorn
Cc: Rik van Riel
Cc: Nick Piggin
Cc: KAMEZAWA Hiroyuki
Cc: Robin Holt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:03 +0800
878b63ac8 mm: gup persist for write permission ... Browse Code »

do_wp_page()'s VM_FAULT_WRITE return value tells __get_user_pages() that
COW has been done if necessary, though it may be leaving the pte without
write permission - for the odd case of forced writing to a readonly vma
for ptrace. At present GUP then retries the follow_page() without asking
for write permission, to escape an endless loop when forced.

But an application may be relying on GUP to guarantee a writable page
which won't be COWed again when written from userspace, whereas a race
here might leave a readonly pte in place? Change the VM_FAULT_WRITE
handling to ask follow_page() for write permission again, except in that
odd case of forced writing to a readonly vma.

Signed-off-by: Hugh Dickins
Cc: Lee Schermerhorn
Cc: Rik van Riel
Cc: Nick Piggin
Cc: KAMEZAWA Hiroyuki
Cc: Robin Holt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:03 +0800
cbf84b7ad mm: further cleanup page_add_new_anon_rmap ... Browse Code »

Moving lru_cache_add_active_or_unevictable() into page_add_new_anon_rmap()
was good but stupid: we can and should SetPageSwapBacked() there too; and
we know for sure that this anonymous, swap-backed page is not file cache.

Signed-off-by: Hugh Dickins
Cc: Lee Schermerhorn
Cc: Nick Piggin
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:02 +0800
b5934c531 mm: add_active_or_unevictable into rmap ... Browse Code »

lru_cache_add_active_or_unevictable() and page_add_new_anon_rmap() always
appear together. Save some symbol table space and some jumping around by
removing lru_cache_add_active_or_unevictable(), folding its code into
page_add_new_anon_rmap(): like how we add file pages to lru just after
adding them to page cache.

Remove the nearby "TODO: is this safe?" comments (yes, it is safe), and
change page_add_new_anon_rmap()'s address BUG_ON to VM_BUG_ON as
originally intended.

Signed-off-by: Hugh Dickins
Acked-by: Rik van Riel
Cc: Lee Schermerhorn
Cc: Nick Piggin
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:02 +0800
38e0edb15 mm/apply_to_range: call pte function with lazy updates ... Browse Code »

Make the pte-level function in apply_to_range be called in lazy mmu mode,
so that any pagetable modifications can be batched.

Signed-off-by: Jeremy Fitzhardinge
Cc: Johannes Weiner
Cc: Nick Piggin
Cc: Venkatesh Pallipadi
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeremy Fitzhardinge
2009-01-07 07:59:01 +0800