Eric Lee / smarc-fsl-linux-kernel

08 May, 2007

3 commits

949099148 Add unitialized_var() macro for suppressing gcc warnings ... Browse Code »

Introduce a macro for suppressing gcc from generating a warning about a
probable uninitialized state of a variable.

Example:

- spinlock_t *ptl;
+ spinlock_t *uninitialized_var(ptl);

Not a happy solution, but those warnings are obnoxious.

- Using the usual pointlessly-set-it-to-zero approach wastes several
bytes of text.

- Using a macro means we can (hopefully) do something else if gcc changes
cause the `x = x' hack to stop working

- Using a macro means that people who are worried about hiding true bugs
can easily turn it off.

Signed-off-by: Borislav Petkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Borislav Petkov
2007-05-08 03:12:52 +0800
5f22df00a mm: remove gcc workaround ... Browse Code »

Minimum gcc version is 3.2 now. However, with likely profiling, even
modern gcc versions cannot always eliminate the call.

Replace the placeholder functions with the more conventional empty static
inlines, which should be optimal for everyone.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-05-08 03:12:51 +0800
aee16b3ce Add apply_to_page_range() which applies a function to a pte range ... Browse Code »

Add a new mm function apply_to_page_range() which applies a given function to
every pte in a given virtual address range in a given mm structure. This is a
generic alternative to cut-and-pasting the Linux idiomatic pagetable walking
code in every place that a sequence of PTEs must be accessed.

Although this interface is intended to be useful in a wide range of
situations, it is currently used specifically by several Xen subsystems, for
example: to ensure that pagetables have been allocated for a virtual address
range, and to construct batched special pagetable update requests to map I/O
memory (in ioremap()).

[akpm@linux-foundation.org: fix warning, unpleasantly]
Signed-off-by: Ian Pratt
Signed-off-by: Christian Limpach
Signed-off-by: Chris Wright
Signed-off-by: Jeremy Fitzhardinge
Cc: Christoph Lameter
Cc: Matt Mackall
Acked-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeremy Fitzhardinge
2007-05-08 03:12:51 +0800

13 Feb, 2007

2 commits

22cd25ed3 [PATCH] Add NOPFN_REFAULT result from vm_ops->nopfn() ... Browse Code »

Add a NOPFN_REFAULT return code for vm_ops->nopfn() equivalent to
NOPAGE_REFAULT for vmops->nopage() indicating that the handler requests a
re-execution of the faulting instruction

Signed-off-by: Benjamin Herrenschmidt
Cc: Arnd Bergmann
Cc: Hugh Dickins
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Benjamin Herrenschmidt
2007-02-13 01:48:27 +0800
e0dc0d8f4 [PATCH] add vm_insert_pfn() ... Browse Code »

Add a vm_insert_pfn helper, so that ->fault handlers can have nopfn
functionality by installing their own pte and returning NULL.

Signed-off-by: Nick Piggin
Signed-off-by: Benjamin Herrenschmidt
Cc: Arnd Bergmann
Cc: Hugh Dickins
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-02-13 01:48:27 +0800

12 Feb, 2007

3 commits

72fd4a35a [PATCH] Numerous fixes to kernel-doc info in source files. ... Browse Code »

A variety of (mostly) innocuous fixes to the embedded kernel-doc content in
source files, including:

* make multi-line initial descriptions single line
* denote some function names, constants and structs as such
* change erroneous opening '/*' to '/**' in a few places
* reword some text for clarity

Signed-off-by: Robert P. J. Day
Cc: "Randy.Dunlap"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robert P. J. Day
2007-02-12 02:51:32 +0800
daa88c8d2 [PATCH] do not disturb page referenced state when unmapping memory range ... Browse Code »

When kernel unmaps an address range, it needs to transfer PTE state into
page struct. Currently, kernel transfer access bit via
mark_page_accessed(). The call to mark_page_accessed in the unmap path
doesn't look logically correct.

At unmap time, calling mark_page_accessed will causes page LRU state to be
bumped up one step closer to more recently used state. It is causing quite
a bit headache in a scenario when a process creates a shmem segment, touch
a whole bunch of pages, then unmaps it. The unmapping takes a long time
because mark_page_accessed() will start moving pages from inactive to
active list.

I'm not too much concerned with moving the page from one list to another in
LRU. Sooner or later it might be moved because of multiple mappings from
various processes. But it just doesn't look logical that when user asks a
range to be unmapped, it's his intention that the process is no longer
interested in these pages. Moving those pages to active list (or bumping
up a state towards more active) seems to be an over reaction. It also
prolongs unmapping latency which is the core issue I'm trying to solve.

As suggested by Peter, we should still preserve the info on pte young
pages, but not more.

Signed-off-by: Peter Zijlstra
Acked-by: Ken Chen
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ken Chen
2007-02-12 02:51:19 +0800
c3704ceb4 [PATCH] page_mkwrite caller race fix ... Browse Code »

After do_wp_page has tested page_mkwrite, it must release old_page after
acquiring page table lock, not before: at some stage that ordering got
reversed, leaving a (very unlikely) window in which old_page might be
truncated, freed, and reused in the same position.

Signed-off-by: Hugh Dickins
Acked-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2007-02-12 02:51:17 +0800

27 Jan, 2007

2 commits

f47aef55d [PATCH] i386 vDSO: use VM_ALWAYSDUMP ... Browse Code »

This patch fixes core dumps to include the vDSO vma, which is left out now.
It removes the special-case core writing macros, which were not doing the
right thing for the vDSO vma anyway. Instead, it uses VM_ALWAYSDUMP in the
vma; there is no need for the fixmap page to be installed. It handles the
CONFIG_COMPAT_VDSO case by making elf_core_dump use the fake vma from
get_gate_vma after real vmas in the same way the /proc/PID/maps code does.

This changes core dumps so they no longer include the non-PT_LOAD phdrs from
the vDSO. I made the change to add them in the first place, but in turned out
that nothing ever wanted them there since the advent of NT_AUXV. It's cleaner
to leave them out, and just let the phdrs inside the vDSO image speak for
themselves.

Signed-off-by: Roland McGrath
Cc: Ingo Molnar
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roland McGrath
2007-01-27 05:50:58 +0800
b6558c4a2 [PATCH] Fix gate_vma.vm_flags ... Browse Code »

This patch fixes the initialization of gate_vma.vm_flags and
gate_vma.vm_page_prot to reflect reality. This makes the "[vdso]" line in
/proc/PID/maps correctly show r-xp instead of ---p, when gate_vma is used
(CONFIG_COMPAT_VDSO on i386).

Signed-off-by: Roland McGrath
Cc: Ingo Molnar
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roland McGrath
2007-01-27 05:50:58 +0800

09 Jan, 2007

1 commit

a6f36be32 [ARM] pass vma for flush_anon_page() ... Browse Code »

Since get_user_pages() may be used with processes other than the
current process and calls flush_anon_page(), flush_anon_page() has to
cope in some way with non-current processes.

It may not be appropriate, or even desirable to flush a region of
virtual memory cache in the current process when that is different to
the process that we want the flush to occur for.

Therefore, pass the vma into flush_anon_page() so that the architecture
can work out whether the 'vmaddr' is for the current process or not.

Signed-off-by: Russell King

Russell King
2007-01-09 03:49:54 +0800

23 Dec, 2006

1 commit

7de6b8057 [PATCH] mm: more rmap debugging ... Browse Code »

Add more debugging in the rmap code in an attempt to locate to source of
the occasional "mapcount went negative" assertions.

Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2006-12-23 00:55:49 +0800

14 Dec, 2006

1 commit

9de455b20 [PATCH] Pass vma argument to copy_user_highpage(). ... Browse Code »

To allow a more effective copy_user_highpage() on certain architectures,
a vma argument is added to the function and cow_user_page() allowing
the implementation of these functions to check for the VM_EXEC bit.

The main part of this patch was originally written by Ralf Baechle;
Atushi Nemoto did the the debugging.

Signed-off-by: Atsushi Nemoto
Signed-off-by: Ralf Baechle
Signed-off-by: Linus Torvalds

Atsushi Nemoto
2006-12-14 01:27:08 +0800

11 Dec, 2006

1 commit

5fcf7bb73 [PATCH] read_zero_pagealigned() locking fix ... Browse Code »

Ramiro Voicu hits the BUG_ON(!pte_none(*pte)) in zeromap_pte_range: kernel
bugzilla 7645. Right: read_zero_pagealigned uses down_read of mmap_sem,
but another thread's racing read of /dev/zero, or a normal fault, can
easily set that pte again, in between zap_page_range and zeromap_page_range
getting there. It's been wrong ever since 2.4.3.

The simple fix is to use down_write instead, but that would serialize reads
of /dev/zero more than at present: perhaps some app would be badly
affected. So instead let zeromap_page_range return the error instead of
BUG_ON, and read_zero_pagealigned break to the slower clear_user loop in
that case - there's no need to optimize for it.

Use -EEXIST for when a pte is found: BUG_ON in mmap_zero (the other user of
zeromap_page_range), though it really isn't interesting there. And since
mmap_zero wants -EAGAIN for out-of-memory, the zeromaps better return that
than -ENOMEM.

Signed-off-by: Hugh Dickins
Cc: Ramiro Voicu:
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2006-12-11 01:55:39 +0800

08 Dec, 2006

2 commits

045f147f3 [PATCH] remove EXPORT_UNUSED_SYMBOL'ed symbols ... Browse Code »

In time for 2.6.20, we can get rid of this junk.

Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2006-12-08 00:39:44 +0800
098fe651f [PATCH] grab swap token reordered ... Browse Code »

Make sure the contention for the token happens _before_ any read-in and
kicks the swap-token algo only when the VM is under pressure.

Signed-off-by: Ashwin Chaugule
Cc: Rik van Riel
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ashwin Chaugule
2006-12-08 00:39:21 +0800

21 Oct, 2006

1 commit

c4ec7b0de [PATCH] mm: D-cache aliasing issue in cow_user_page ... Browse Code »

--=-=-=

from mm/memory.c:
1434 static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va)
1435 {
1436 /*
1437 * If the source page was a PFN mapping, we don't have
1438 * a "struct page" for it. We do a best-effort copy by
1439 * just copying from the original user address. If that
1440 * fails, we just zero-fill it. Live with it.
1441 */
1442 if (unlikely(!src)) {
1443 void *kaddr = kmap_atomic(dst, KM_USER0);
1444 void __user *uaddr = (void __user *)(va & PAGE_MASK);
1445
1446 /*
1447 * This really shouldn't fail, because the page is there
1448 * in the page tables. But it might just be unreadable,
1449 * in which case we just give up and fill the result with
1450 * zeroes.
1451 */
1452 if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
1453 memset(kaddr, 0, PAGE_SIZE);
1454 kunmap_atomic(kaddr, KM_USER0);
#### D-cache have to be flushed here.
#### It seems it is just forgotten.

1455 return;
1456
1457 }
1458 copy_user_highpage(dst, src, va);
#### Ok here. flush_dcache_page() called from this func if arch need it
1459 }

Following is the patch fix this issue:

Signed-off-by: Dmitriy Monakhov
Cc: "David S. Miller"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dmitriy Monakhov
2006-10-21 01:26:43 +0800

06 Oct, 2006

1 commit

7f7bbbe50 [PATCH] page fault retry with NOPAGE_REFAULT ... Browse Code »

Add a way for a no_page() handler to request a retry of the faulting
instruction. It goes back to userland on page faults and just tries again
in get_user_pages(). I added a cond_resched() in the loop in that later
case.

The problem I have with signal and spufs is an actual bug affecting apps and I
don't see other ways of fixing it.

In addition, we are having issues with infiniband and 64k pages (related to
the way the hypervisor deals with some HV cards) that will require us to muck
around with the MMU from within the IB driver's no_page() (it's a pSeries
specific driver) and return to the caller the same way using NOPAGE_REFAULT.

And to add to this, the graphics folks have been following a new approach of
memory management that involves transparently swapping objects between video
ram and main meory. To do that, they need installing PTEs from a no_page()
handler as well and that also requires returning with NOPAGE_REFAULT.

(For the later, they are currently using io_remap_pfn_range to install one PTE
from no_page() which is a bit racy, we need to add a check for the PTE having
already been installed afer taking the lock, but that's ok, they are only at
the proof-of-concept stage. I'll send a patch adding a "clean" function to do
that, we can use that from spufs too and get rid of the sparsemem hacks we do
to create struct page for SPEs. Basically, that provides a generic solution
for being able to have no_page() map hardware devices, which is something that
I think sound driver folks have been asking for some time too).

All of these things depend on having the NOPAGE_REFAULT exit path from
no_page() handlers.

Signed-off-by: Benjamin Herrenchmidt
Cc: Hugh Dickins
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Benjamin Herrenschmidt
2006-10-06 23:53:40 +0800

01 Oct, 2006

3 commits

6606c3e0d [PATCH] paravirt: lazy mmu mode hooks.patch ... Browse Code »

Implement lazy MMU update hooks which are SMP safe for both direct and shadow
page tables. The idea is that PTE updates and page invalidations while in
lazy mode can be batched into a single hypercall. We use this in VMI for
shadow page table synchronization, and it is a win. It also can be used by
PPC and for direct page tables on Xen.

For SMP, the enter / leave must happen under protection of the page table
locks for page tables which are being modified. This is because otherwise,
you end up with stale state in the batched hypercall, which other CPUs can
race ahead of. Doing this under the protection of the locks guarantees the
synchronization is correct, and also means that spurious faults which are
generated during this window by remote CPUs are properly handled, as the page
fault handler must re-check the PTE under protection of the same lock.

Signed-off-by: Zachary Amsden
Signed-off-by: Jeremy Fitzhardinge
Cc: Rusty Russell
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zachary Amsden
2006-10-01 15:39:33 +0800
9888a1cae [PATCH] paravirt: pte clear not present ... Browse Code »

Change pte_clear_full to a more appropriately named pte_clear_not_present,
allowing optimizations when not-present mapping changes need not be reflected
in the hardware TLB for protected page table modes. There is also another
case that can use it in the fremap code.

Signed-off-by: Zachary Amsden
Signed-off-by: Jeremy Fitzhardinge
Cc: Rusty Russell
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zachary Amsden
2006-10-01 15:39:33 +0800
3dc907951 [PATCH] paravirt: remove read hazard from cow ... Browse Code »

We don't want to read PTEs directly like this after they have been modified,
as a lazy MMU implementation of direct page tables may not have written the
updated PTE back to memory yet.

Signed-off-by: Zachary Amsden
Signed-off-by: Jeremy Fitzhardinge
Cc: Rusty Russell
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zachary Amsden
2006-10-01 15:39:33 +0800

30 Sep, 2006

1 commit

4ce072f1f [PATCH] mm: fix a race condition under SMC + COW ... Browse Code »

Failing context is a multi threaded process context and the failing
sequence is as follows.

One thread T0 doing self modifying code on page X on processor P0 and
another thread T1 doing COW (breaking the COW setup as part of just
happened fork() in another thread T2) on the same page X on processor P1.
T0 doing SMC can endup modifying the new page Y (allocated by the T1 doing
COW on P1) but because of different I/D TLB's, P0 ITLB will not see the new
mapping till the flush TLB IPI from P1 is received. During this interval,
if T0 executes the code created by SMC it can result in an app error (as
ITLB still points to old page X and endup executing the content in page X
rather than using the content in page Y).

Fix this issue by first clearing the PTE and flushing it, before updating
it with new entry.

Hugh sayeth:

I was a bit sceptical, in the habit of thinking that Self Modifying Code
must look such issues itself: but I guess there's nothing it can do to avoid
this one.

Fair enough, what you're changing it to is pretty much what powerpc and
s390 were already doing, and is a more robust way of proceeding, consistent
with how ptes are set everywhere else.

The ptep_clear_flush is a bit heavy-handed (it's anxious to return the pte
that was atomically cleared), but we'd have to wander through lots of arches
to get the right minimal behaviour. It'd also be nice to eliminate
ptep_establish completely, now only used to define other macros/inlines: it
always seemed obfuscation to me, what you've got there now is clearer.
Let's put those cleanups on a TODO list.

Signed-off-by: Suresh Siddha
Acked-by: "David S. Miller"
Acked-by: Hugh Dickins
Cc: Nick Piggin
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Siddha, Suresh B
2006-09-30 00:18:03 +0800

27 Sep, 2006

2 commits

0ec76a110 [PATCH] NOMMU: Check that access_process_vm() has a valid target ... Browse Code »

Check that access_process_vm() is accessing a valid mapping in the target
process.

This limits ptrace() accesses and accesses through /proc//maps to only
those regions actually mapped by a program.

Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Howells
2006-09-27 23:26:14 +0800
f4b81804a [PATCH] do_no_pfn() ... Browse Code »

Implement do_no_pfn() for handling mapping of memory without a struct page
backing it. This avoids creating fake page table entries for regions which
are not backed by real memory.

This feature is used by the MSPEC driver and other users, where it is
highly undesirable to have a struct page sitting behind the page (for
instance if the page is accessed in cached mode via the struct page in
parallel to the the driver accessing it uncached, which can result in data
corruption on some architectures, such as ia64).

This version uses specific NOPFN_{SIGBUS,OOM} return values, rather than
expect all negative pfn values would be an error. It also bugs on cow
mappings as this would not work with the VM.

[akpm@osdl.org: micro-optimise]
Signed-off-by: Jes Sorensen
Cc: Hugh Dickins
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jes Sorensen
2006-09-27 23:26:13 +0800

26 Sep, 2006

4 commits

bfa5bf6d6 [PATCH] Add kerneldocs for some functions in mm/memory.c ... Browse Code »

These functions are already documented quite well with long comments. Now
add kerneldoc style header to make this turn up in everyones favorite doc
format.

Signed-off-by: Rolf Eike Beer
Cc: "Randy.Dunlap"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rolf Eike Beer
2006-09-26 23:48:47 +0800
ee6a64578 [PATCH] mm: fixup do_wp_page() ... Browse Code »

Wrt. the recent modifications in do_wp_page() Hugh Dickins pointed out:

"I now realize it's right to the first order (normal case) and to the
second order (ptrace poke), but not to the third order (ptrace poke
anon page here to be COWed - perhaps can't occur without intervening
mprotects)."

This patch restores the old COW behaviour for anonymous pages.

Signed-off-by: Peter Zijlstra
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2006-09-26 23:48:44 +0800
edc79b2a4 [PATCH] mm: balance dirty pages ... Browse Code »

Now that we can detect writers of shared mappings, throttle them. Avoids OOM
by surprise.

Signed-off-by: Peter Zijlstra
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2006-09-26 23:48:44 +0800
d08b3851d [PATCH] mm: tracking shared dirty pages ... Browse Code »

Tracking of dirty pages in shared writeable mmap()s.

The idea is simple: write protect clean shared writeable pages, catch the
write-fault, make writeable and set dirty. On page write-back clean all the
PTE dirty bits and write protect them once again.

The implementation is a tad harder, mainly because the default
backing_dev_info capabilities were too loosely maintained. Hence it is not
enough to test the backing_dev_info for cap_account_dirty.

The current heuristic is as follows, a VMA is eligible when:
- its shared writeable
(vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)
- it is not a 'special' mapping
(vm_flags & (VM_PFNMAP|VM_INSERTPAGE)) == 0
- the backing_dev_info is cap_account_dirty
mapping_cap_account_dirty(vma->vm_file->f_mapping)
- f_op->mmap() didn't change the default page protection

Page from remap_pfn_range() are explicitly excluded because their COW
semantics are already horrid enough (see vm_normal_page() in do_wp_page()) and
because they don't have a backing store anyway.

mprotect() is taught about the new behaviour as well. However it overrides
the last condition.

Cleaning the pages on write-back is done with page_mkclean() a new rmap call.
It can be called on any page, but is currently only implemented for mapped
pages, if the page is found the be of a VMA that accounts dirty pages it will
also wrprotect the PTE.

Finally, in fs/buffers.c:try_to_free_buffers(); remove clear_page_dirty() from
under ->private_lock. This seems to be safe, since ->private_lock is used to
serialize access to the buffers, not the page itself. This is needed because
clear_page_dirty() will call into page_mkclean() and would thereby violate
locking order.

[dhowells@redhat.com: Provide a page_mkclean() implementation for NOMMU]
Signed-off-by: Peter Zijlstra
Cc: Hugh Dickins
Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2006-09-26 23:48:44 +0800

15 Jul, 2006

2 commits

0ff922452 [PATCH] per-task-delay-accounting: sync block I/O and swapin delay collection ... Browse Code »

Unlike earlier iterations of the delay accounting patches, now delays are only
collected for the actual I/O waits rather than try and cover the delays seen
in I/O submission paths.

Account separately for block I/O delays incurred as a result of swapin page
faults whose frequency can be affected by the task/process' rss limit. Hence
swapin delays can act as feedback for rss limit changes independent of I/O
priority changes.

Signed-off-by: Shailabh Nagar
Signed-off-by: Balbir Singh
Cc: Jes Sorensen
Cc: Peter Chubb
Cc: Erich Focht
Cc: Levent Serinol
Cc: Jay Lan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shailabh Nagar
2006-07-15 12:53:56 +0800
c38c8db72 [PATCH] ia64: race flushing icache in COW path ... Browse Code »

There is a race condition that showed up in a threaded JIT environment.
The situation is that a process with a JIT code page forks, so the page is
marked read-only, then some threads are created in the child. One of the
threads attempts to add a new code block to the JIT page, so a
copy-on-write fault is taken, and the kernel allocates a new page, copies
the data, installs the new pte, and then calls lazy_mmu_prot_update() to
flush caches to make sure that the icache and dcache are in sync.
Unfortunately, the other thread runs right after the new pte is installed,
but before the caches have been flushed. It tries to execute some old JIT
code that was already in this page, but it sees some garbage in the i-cache
from the previous users of the new physical page.

Fix: we must make the caches consistent before installing the pte. This is
an ia64 only fix because lazy_mmu_prot_update() is a no-op on all other
architectures.

Signed-off-by: Anil Keshavamurthy
Signed-off-by: Tony Luck
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anil Keshavamurthy
2006-07-15 12:53:51 +0800

11 Jul, 2006

1 commit

26fc52367 [PATCH] mm/memory.c: EXPORT_UNUSED_SYMBOL ... Browse Code »

This patch marks an unused export as EXPORT_UNUSED_SYMBOL.

Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2006-07-11 04:24:17 +0800

04 Jul, 2006

1 commit

f20dc5f7c [PATCH] lockdep: annotate mm ... Browse Code »

Teach special (recursive) locking code to the lock validator. Has no effect
on non-lockdep kernels.

Signed-off-by: Ingo Molnar
Signed-off-by: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2006-07-04 06:27:07 +0800

01 Jul, 2006

2 commits

f8891e5e1 [PATCH] Light weight event counters ... Browse Code »

The remaining counters in page_state after the zoned VM counter patches
have been applied are all just for show in /proc/vmstat. They have no
essential function for the VM.

We use a simple increment of per cpu variables. In order to avoid the most
severe races we disable preempt. Preempt does not prevent the race between
an increment and an interrupt handler incrementing the same statistics
counter. However, that race is exceedingly rare, we may only loose one
increment or so and there is no requirement (at least not in kernel) that
the vm event counters have to be accurate.

In the non preempt case this results in a simple increment for each
counter. For many architectures this will be reduced by the compiler to a
single instruction. This single instruction is atomic for i386 and x86_64.
And therefore even the rare race condition in an interrupt is avoided for
both architectures in most cases.

The patchset also adds an off switch for embedded systems that allows a
building of linux kernels without these counters.

The implementation of these counters is through inline code that hopefully
results in only a single instruction increment instruction being emitted
(i386, x86_64) or in the increment being hidden though instruction
concurrency (EPIC architectures such as ia64 can get that done).

Benefits:
- VM event counter operations usually reduce to a single inline instruction
on i386 and x86_64.
- No interrupt disable, only preempt disable for the preempt case.
Preempt disable can also be avoided by moving the counter into a spinlock.
- Handling is similar to zoned VM counters.
- Simple and easily extendable.
- Can be omitted to reduce memory use for embedded use.

References:

RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:36 +0800
df849a152 [PATCH] zoned vm counters: conversion of nr_pagetables to per zone counter ... Browse Code »

Conversion of nr_page_table_pages to a per zone counter

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:35 +0800

23 Jun, 2006

3 commits

9637a5efd [PATCH] add page_mkwrite() vm_operations method ... Browse Code »

Add a new VMA operation to notify a filesystem or other driver about the
MMU generating a fault because userspace attempted to write to a page
mapped through a read-only PTE.

This facility permits the filesystem or driver to:

(*) Implement storage allocation/reservation on attempted write, and so to
deal with problems such as ENOSPC more gracefully (perhaps by generating
SIGBUS).

(*) Delay making the page writable until the contents have been written to a
backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS.
It permits the filesystem to have some guarantee about the state of the
cache.

(*) Account and limit number of dirty pages. This is one piece of the puzzle
needed to make shared writable mapping work safely in FUSE.

Needed by cachefs (Or is it cachefiles? Or fscache? ).

At least four other groups have stated an interest in it or a desire to use
the functionality it provides: FUSE, OCFS2, NTFS and JFFS2. Also, things like
EXT3 really ought to use it to deal with the case of shared-writable mmap
encountering ENOSPC before we permit the page to be dirtied.

From: Peter Zijlstra

get_user_pages(.write=1, .force=1) can generate COW hits on read-only
shared mappings, this patch traps those as mkpage_write candidates and fails
to handle them the old way.

Signed-off-by: David Howells
Cc: Miklos Szeredi
Cc: Joel Becker
Cc: Mark Fasheh
Cc: Anton Altaparmakov
Cc: David Woodhouse
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Howells
2006-06-23 22:42:51 +0800
0697212a4 [PATCH] Swapless page migration: add R/W migration entries ... Browse Code »

Implement read/write migration ptes

We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.

The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.

We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.

Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c

From: Hugh Dickins

Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.

This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.

Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)

This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.

(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)

From: Ingo Molnar

pte_unmap_unlock() takes the pte pointer as an argument.

From: Hugh Dickins

Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.

The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.

Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.

From: Hugh Dickins

Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.

Signed-off-by: Hugh Dickins
Acked-by: Martin Schwidefsky
Signed-off-by: Hugh Dickins
Signed-off-by: Christoph Lameter
Signed-off-by: Ingo Molnar
From: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-06-23 22:42:50 +0800
4da5eda0d [PATCH] Page Migration: Make do_swap_page redo the fault ... Browse Code »

It is better to redo the complete fault if do_swap_page() finds that the
page is not in PageSwapCache() because the page migration code may have
replaced the swap pte already with a pte pointing to valid memory.

do_swap_page() may interpret an invalid swap entry without this patch
because we do not reload the pte if we are looping back. The page
migration code may already have reused the swap entry referenced by our
local swp_entry.

Signed-off-by: Christoph Lameter
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-06-23 22:42:45 +0800

01 Apr, 2006

1 commit

9b41046cd [PATCH] Don't pass boot parameters to argv_init[] ... Browse Code »

The boot cmdline is parsed in parse_early_param() and
parse_args(,unknown_bootoption).

And __setup() is used in obsolete_checksetup().

start_kernel()
-> parse_args()
-> unknown_bootoption()
-> obsolete_checksetup()

If __setup()'s callback (->setup_func()) returns 1 in
obsolete_checksetup(), obsolete_checksetup() thinks a parameter was
handled.

If ->setup_func() returns 0, obsolete_checksetup() tries other
->setup_func(). If all ->setup_func() that matched a parameter returns 0,
a parameter is seted to argv_init[].

Then, when runing /sbin/init or init=app, argv_init[] is passed to the app.
If the app doesn't ignore those arguments, it will warning and exit.

This patch fixes a wrong usage of it, however fixes obvious one only.

Signed-off-by: OGAWA Hirofumi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

OGAWA Hirofumi
2006-04-01 04:18:53 +0800

27 Mar, 2006

2 commits

9ae21d1bb Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
drivers/char/ftape/lowlevel/fdc-io.c: Correct a comment
Kconfig help: MTD_JEDECPROBE already supports Intel
Remove ugly debugging stuff
do_mounts.c: Minor ROOT_DEV comment cleanup
BUG_ON() Conversion in drivers/s390/block/dasd_devmap.c
BUG_ON() Conversion in mm/mempool.c
BUG_ON() Conversion in mm/memory.c
BUG_ON() Conversion in kernel/fork.c
BUG_ON() Conversion in ipc/sem.c
BUG_ON() Conversion in fs/ext2/
BUG_ON() Conversion in fs/hfs/
BUG_ON() Conversion in fs/dcache.c
BUG_ON() Conversion in fs/buffer.c
BUG_ON() Conversion in input/serio/hp_sdc_mlc.c
BUG_ON() Conversion in md/dm-table.c
BUG_ON() Conversion in md/dm-path-selector.c
BUG_ON() Conversion in drivers/isdn
BUG_ON() Conversion in drivers/char
BUG_ON() Conversion in drivers/mtd/

Linus Torvalds
2006-03-27 01:41:18 +0800
03beb0766 [PATCH] Add API for flushing Anon pages ... Browse Code »

Currently, get_user_pages() returns fully coherent pages to the kernel for
anything other than anonymous pages. This is a problem for things like
fuse and the SCSI generic ioctl SG_IO which can potentially wish to do DMA
to anonymous pages passed in by users.

The fix is to add a new memory management API: flush_anon_page() which
is used in get_user_pages() to make anonymous pages coherent.

Signed-off-by: James Bottomley
Cc: Russell King
Cc: "David S. Miller"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

James Bottomley
2006-03-27 00:56:53 +0800