08 May, 2007

3 commits

  • Introduce a macro for suppressing gcc from generating a warning about a
    probable uninitialized state of a variable.

    Example:

    - spinlock_t *ptl;
    + spinlock_t *uninitialized_var(ptl);

    Not a happy solution, but those warnings are obnoxious.

    - Using the usual pointlessly-set-it-to-zero approach wastes several
    bytes of text.

    - Using a macro means we can (hopefully) do something else if gcc changes
    cause the `x = x' hack to stop working

    - Using a macro means that people who are worried about hiding true bugs
    can easily turn it off.

    Signed-off-by: Borislav Petkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Borislav Petkov
     
  • Minimum gcc version is 3.2 now. However, with likely profiling, even
    modern gcc versions cannot always eliminate the call.

    Replace the placeholder functions with the more conventional empty static
    inlines, which should be optimal for everyone.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Add a new mm function apply_to_page_range() which applies a given function to
    every pte in a given virtual address range in a given mm structure. This is a
    generic alternative to cut-and-pasting the Linux idiomatic pagetable walking
    code in every place that a sequence of PTEs must be accessed.

    Although this interface is intended to be useful in a wide range of
    situations, it is currently used specifically by several Xen subsystems, for
    example: to ensure that pagetables have been allocated for a virtual address
    range, and to construct batched special pagetable update requests to map I/O
    memory (in ioremap()).

    [akpm@linux-foundation.org: fix warning, unpleasantly]
    Signed-off-by: Ian Pratt
    Signed-off-by: Christian Limpach
    Signed-off-by: Chris Wright
    Signed-off-by: Jeremy Fitzhardinge
    Cc: Christoph Lameter
    Cc: Matt Mackall
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeremy Fitzhardinge
     

13 Feb, 2007

2 commits

  • Add a NOPFN_REFAULT return code for vm_ops->nopfn() equivalent to
    NOPAGE_REFAULT for vmops->nopage() indicating that the handler requests a
    re-execution of the faulting instruction

    Signed-off-by: Benjamin Herrenschmidt
    Cc: Arnd Bergmann
    Cc: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • Add a vm_insert_pfn helper, so that ->fault handlers can have nopfn
    functionality by installing their own pte and returning NULL.

    Signed-off-by: Nick Piggin
    Signed-off-by: Benjamin Herrenschmidt
    Cc: Arnd Bergmann
    Cc: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

12 Feb, 2007

3 commits

  • A variety of (mostly) innocuous fixes to the embedded kernel-doc content in
    source files, including:

    * make multi-line initial descriptions single line
    * denote some function names, constants and structs as such
    * change erroneous opening '/*' to '/**' in a few places
    * reword some text for clarity

    Signed-off-by: Robert P. J. Day
    Cc: "Randy.Dunlap"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert P. J. Day
     
  • When kernel unmaps an address range, it needs to transfer PTE state into
    page struct. Currently, kernel transfer access bit via
    mark_page_accessed(). The call to mark_page_accessed in the unmap path
    doesn't look logically correct.

    At unmap time, calling mark_page_accessed will causes page LRU state to be
    bumped up one step closer to more recently used state. It is causing quite
    a bit headache in a scenario when a process creates a shmem segment, touch
    a whole bunch of pages, then unmaps it. The unmapping takes a long time
    because mark_page_accessed() will start moving pages from inactive to
    active list.

    I'm not too much concerned with moving the page from one list to another in
    LRU. Sooner or later it might be moved because of multiple mappings from
    various processes. But it just doesn't look logical that when user asks a
    range to be unmapped, it's his intention that the process is no longer
    interested in these pages. Moving those pages to active list (or bumping
    up a state towards more active) seems to be an over reaction. It also
    prolongs unmapping latency which is the core issue I'm trying to solve.

    As suggested by Peter, we should still preserve the info on pte young
    pages, but not more.

    Signed-off-by: Peter Zijlstra
    Acked-by: Ken Chen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     
  • After do_wp_page has tested page_mkwrite, it must release old_page after
    acquiring page table lock, not before: at some stage that ordering got
    reversed, leaving a (very unlikely) window in which old_page might be
    truncated, freed, and reused in the same position.

    Signed-off-by: Hugh Dickins
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

27 Jan, 2007

2 commits

  • This patch fixes core dumps to include the vDSO vma, which is left out now.
    It removes the special-case core writing macros, which were not doing the
    right thing for the vDSO vma anyway. Instead, it uses VM_ALWAYSDUMP in the
    vma; there is no need for the fixmap page to be installed. It handles the
    CONFIG_COMPAT_VDSO case by making elf_core_dump use the fake vma from
    get_gate_vma after real vmas in the same way the /proc/PID/maps code does.

    This changes core dumps so they no longer include the non-PT_LOAD phdrs from
    the vDSO. I made the change to add them in the first place, but in turned out
    that nothing ever wanted them there since the advent of NT_AUXV. It's cleaner
    to leave them out, and just let the phdrs inside the vDSO image speak for
    themselves.

    Signed-off-by: Roland McGrath
    Cc: Ingo Molnar
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • This patch fixes the initialization of gate_vma.vm_flags and
    gate_vma.vm_page_prot to reflect reality. This makes the "[vdso]" line in
    /proc/PID/maps correctly show r-xp instead of ---p, when gate_vma is used
    (CONFIG_COMPAT_VDSO on i386).

    Signed-off-by: Roland McGrath
    Cc: Ingo Molnar
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

09 Jan, 2007

1 commit

  • Since get_user_pages() may be used with processes other than the
    current process and calls flush_anon_page(), flush_anon_page() has to
    cope in some way with non-current processes.

    It may not be appropriate, or even desirable to flush a region of
    virtual memory cache in the current process when that is different to
    the process that we want the flush to occur for.

    Therefore, pass the vma into flush_anon_page() so that the architecture
    can work out whether the 'vmaddr' is for the current process or not.

    Signed-off-by: Russell King

    Russell King
     

23 Dec, 2006

1 commit


14 Dec, 2006

1 commit

  • To allow a more effective copy_user_highpage() on certain architectures,
    a vma argument is added to the function and cow_user_page() allowing
    the implementation of these functions to check for the VM_EXEC bit.

    The main part of this patch was originally written by Ralf Baechle;
    Atushi Nemoto did the the debugging.

    Signed-off-by: Atsushi Nemoto
    Signed-off-by: Ralf Baechle
    Signed-off-by: Linus Torvalds

    Atsushi Nemoto
     

11 Dec, 2006

1 commit

  • Ramiro Voicu hits the BUG_ON(!pte_none(*pte)) in zeromap_pte_range: kernel
    bugzilla 7645. Right: read_zero_pagealigned uses down_read of mmap_sem,
    but another thread's racing read of /dev/zero, or a normal fault, can
    easily set that pte again, in between zap_page_range and zeromap_page_range
    getting there. It's been wrong ever since 2.4.3.

    The simple fix is to use down_write instead, but that would serialize reads
    of /dev/zero more than at present: perhaps some app would be badly
    affected. So instead let zeromap_page_range return the error instead of
    BUG_ON, and read_zero_pagealigned break to the slower clear_user loop in
    that case - there's no need to optimize for it.

    Use -EEXIST for when a pte is found: BUG_ON in mmap_zero (the other user of
    zeromap_page_range), though it really isn't interesting there. And since
    mmap_zero wants -EAGAIN for out-of-memory, the zeromaps better return that
    than -ENOMEM.

    Signed-off-by: Hugh Dickins
    Cc: Ramiro Voicu:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

08 Dec, 2006

2 commits


21 Oct, 2006

1 commit

  • --=-=-=

    from mm/memory.c:
    1434 static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va)
    1435 {
    1436 /*
    1437 * If the source page was a PFN mapping, we don't have
    1438 * a "struct page" for it. We do a best-effort copy by
    1439 * just copying from the original user address. If that
    1440 * fails, we just zero-fill it. Live with it.
    1441 */
    1442 if (unlikely(!src)) {
    1443 void *kaddr = kmap_atomic(dst, KM_USER0);
    1444 void __user *uaddr = (void __user *)(va & PAGE_MASK);
    1445
    1446 /*
    1447 * This really shouldn't fail, because the page is there
    1448 * in the page tables. But it might just be unreadable,
    1449 * in which case we just give up and fill the result with
    1450 * zeroes.
    1451 */
    1452 if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
    1453 memset(kaddr, 0, PAGE_SIZE);
    1454 kunmap_atomic(kaddr, KM_USER0);
    #### D-cache have to be flushed here.
    #### It seems it is just forgotten.

    1455 return;
    1456
    1457 }
    1458 copy_user_highpage(dst, src, va);
    #### Ok here. flush_dcache_page() called from this func if arch need it
    1459 }

    Following is the patch fix this issue:

    Signed-off-by: Dmitriy Monakhov
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitriy Monakhov
     

06 Oct, 2006

1 commit

  • Add a way for a no_page() handler to request a retry of the faulting
    instruction. It goes back to userland on page faults and just tries again
    in get_user_pages(). I added a cond_resched() in the loop in that later
    case.

    The problem I have with signal and spufs is an actual bug affecting apps and I
    don't see other ways of fixing it.

    In addition, we are having issues with infiniband and 64k pages (related to
    the way the hypervisor deals with some HV cards) that will require us to muck
    around with the MMU from within the IB driver's no_page() (it's a pSeries
    specific driver) and return to the caller the same way using NOPAGE_REFAULT.

    And to add to this, the graphics folks have been following a new approach of
    memory management that involves transparently swapping objects between video
    ram and main meory. To do that, they need installing PTEs from a no_page()
    handler as well and that also requires returning with NOPAGE_REFAULT.

    (For the later, they are currently using io_remap_pfn_range to install one PTE
    from no_page() which is a bit racy, we need to add a check for the PTE having
    already been installed afer taking the lock, but that's ok, they are only at
    the proof-of-concept stage. I'll send a patch adding a "clean" function to do
    that, we can use that from spufs too and get rid of the sparsemem hacks we do
    to create struct page for SPEs. Basically, that provides a generic solution
    for being able to have no_page() map hardware devices, which is something that
    I think sound driver folks have been asking for some time too).

    All of these things depend on having the NOPAGE_REFAULT exit path from
    no_page() handlers.

    Signed-off-by: Benjamin Herrenchmidt
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     

01 Oct, 2006

3 commits

  • Implement lazy MMU update hooks which are SMP safe for both direct and shadow
    page tables. The idea is that PTE updates and page invalidations while in
    lazy mode can be batched into a single hypercall. We use this in VMI for
    shadow page table synchronization, and it is a win. It also can be used by
    PPC and for direct page tables on Xen.

    For SMP, the enter / leave must happen under protection of the page table
    locks for page tables which are being modified. This is because otherwise,
    you end up with stale state in the batched hypercall, which other CPUs can
    race ahead of. Doing this under the protection of the locks guarantees the
    synchronization is correct, and also means that spurious faults which are
    generated during this window by remote CPUs are properly handled, as the page
    fault handler must re-check the PTE under protection of the same lock.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Jeremy Fitzhardinge
    Cc: Rusty Russell
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • Change pte_clear_full to a more appropriately named pte_clear_not_present,
    allowing optimizations when not-present mapping changes need not be reflected
    in the hardware TLB for protected page table modes. There is also another
    case that can use it in the fremap code.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Jeremy Fitzhardinge
    Cc: Rusty Russell
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • We don't want to read PTEs directly like this after they have been modified,
    as a lazy MMU implementation of direct page tables may not have written the
    updated PTE back to memory yet.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Jeremy Fitzhardinge
    Cc: Rusty Russell
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     

30 Sep, 2006

1 commit

  • Failing context is a multi threaded process context and the failing
    sequence is as follows.

    One thread T0 doing self modifying code on page X on processor P0 and
    another thread T1 doing COW (breaking the COW setup as part of just
    happened fork() in another thread T2) on the same page X on processor P1.
    T0 doing SMC can endup modifying the new page Y (allocated by the T1 doing
    COW on P1) but because of different I/D TLB's, P0 ITLB will not see the new
    mapping till the flush TLB IPI from P1 is received. During this interval,
    if T0 executes the code created by SMC it can result in an app error (as
    ITLB still points to old page X and endup executing the content in page X
    rather than using the content in page Y).

    Fix this issue by first clearing the PTE and flushing it, before updating
    it with new entry.

    Hugh sayeth:

    I was a bit sceptical, in the habit of thinking that Self Modifying Code
    must look such issues itself: but I guess there's nothing it can do to avoid
    this one.

    Fair enough, what you're changing it to is pretty much what powerpc and
    s390 were already doing, and is a more robust way of proceeding, consistent
    with how ptes are set everywhere else.

    The ptep_clear_flush is a bit heavy-handed (it's anxious to return the pte
    that was atomically cleared), but we'd have to wander through lots of arches
    to get the right minimal behaviour. It'd also be nice to eliminate
    ptep_establish completely, now only used to define other macros/inlines: it
    always seemed obfuscation to me, what you've got there now is clearer.
    Let's put those cleanups on a TODO list.

    Signed-off-by: Suresh Siddha
    Acked-by: "David S. Miller"
    Acked-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Siddha, Suresh B
     

27 Sep, 2006

2 commits

  • Check that access_process_vm() is accessing a valid mapping in the target
    process.

    This limits ptrace() accesses and accesses through /proc//maps to only
    those regions actually mapped by a program.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Implement do_no_pfn() for handling mapping of memory without a struct page
    backing it. This avoids creating fake page table entries for regions which
    are not backed by real memory.

    This feature is used by the MSPEC driver and other users, where it is
    highly undesirable to have a struct page sitting behind the page (for
    instance if the page is accessed in cached mode via the struct page in
    parallel to the the driver accessing it uncached, which can result in data
    corruption on some architectures, such as ia64).

    This version uses specific NOPFN_{SIGBUS,OOM} return values, rather than
    expect all negative pfn values would be an error. It also bugs on cow
    mappings as this would not work with the VM.

    [akpm@osdl.org: micro-optimise]
    Signed-off-by: Jes Sorensen
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jes Sorensen
     

26 Sep, 2006

4 commits

  • These functions are already documented quite well with long comments. Now
    add kerneldoc style header to make this turn up in everyones favorite doc
    format.

    Signed-off-by: Rolf Eike Beer
    Cc: "Randy.Dunlap"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rolf Eike Beer
     
  • Wrt. the recent modifications in do_wp_page() Hugh Dickins pointed out:

    "I now realize it's right to the first order (normal case) and to the
    second order (ptrace poke), but not to the third order (ptrace poke
    anon page here to be COWed - perhaps can't occur without intervening
    mprotects)."

    This patch restores the old COW behaviour for anonymous pages.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Now that we can detect writers of shared mappings, throttle them. Avoids OOM
    by surprise.

    Signed-off-by: Peter Zijlstra
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Tracking of dirty pages in shared writeable mmap()s.

    The idea is simple: write protect clean shared writeable pages, catch the
    write-fault, make writeable and set dirty. On page write-back clean all the
    PTE dirty bits and write protect them once again.

    The implementation is a tad harder, mainly because the default
    backing_dev_info capabilities were too loosely maintained. Hence it is not
    enough to test the backing_dev_info for cap_account_dirty.

    The current heuristic is as follows, a VMA is eligible when:
    - its shared writeable
    (vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)
    - it is not a 'special' mapping
    (vm_flags & (VM_PFNMAP|VM_INSERTPAGE)) == 0
    - the backing_dev_info is cap_account_dirty
    mapping_cap_account_dirty(vma->vm_file->f_mapping)
    - f_op->mmap() didn't change the default page protection

    Page from remap_pfn_range() are explicitly excluded because their COW
    semantics are already horrid enough (see vm_normal_page() in do_wp_page()) and
    because they don't have a backing store anyway.

    mprotect() is taught about the new behaviour as well. However it overrides
    the last condition.

    Cleaning the pages on write-back is done with page_mkclean() a new rmap call.
    It can be called on any page, but is currently only implemented for mapped
    pages, if the page is found the be of a VMA that accounts dirty pages it will
    also wrprotect the PTE.

    Finally, in fs/buffers.c:try_to_free_buffers(); remove clear_page_dirty() from
    under ->private_lock. This seems to be safe, since ->private_lock is used to
    serialize access to the buffers, not the page itself. This is needed because
    clear_page_dirty() will call into page_mkclean() and would thereby violate
    locking order.

    [dhowells@redhat.com: Provide a page_mkclean() implementation for NOMMU]
    Signed-off-by: Peter Zijlstra
    Cc: Hugh Dickins
    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

15 Jul, 2006

2 commits

  • Unlike earlier iterations of the delay accounting patches, now delays are only
    collected for the actual I/O waits rather than try and cover the delays seen
    in I/O submission paths.

    Account separately for block I/O delays incurred as a result of swapin page
    faults whose frequency can be affected by the task/process' rss limit. Hence
    swapin delays can act as feedback for rss limit changes independent of I/O
    priority changes.

    Signed-off-by: Shailabh Nagar
    Signed-off-by: Balbir Singh
    Cc: Jes Sorensen
    Cc: Peter Chubb
    Cc: Erich Focht
    Cc: Levent Serinol
    Cc: Jay Lan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shailabh Nagar
     
  • There is a race condition that showed up in a threaded JIT environment.
    The situation is that a process with a JIT code page forks, so the page is
    marked read-only, then some threads are created in the child. One of the
    threads attempts to add a new code block to the JIT page, so a
    copy-on-write fault is taken, and the kernel allocates a new page, copies
    the data, installs the new pte, and then calls lazy_mmu_prot_update() to
    flush caches to make sure that the icache and dcache are in sync.
    Unfortunately, the other thread runs right after the new pte is installed,
    but before the caches have been flushed. It tries to execute some old JIT
    code that was already in this page, but it sees some garbage in the i-cache
    from the previous users of the new physical page.

    Fix: we must make the caches consistent before installing the pte. This is
    an ia64 only fix because lazy_mmu_prot_update() is a no-op on all other
    architectures.

    Signed-off-by: Anil Keshavamurthy
    Signed-off-by: Tony Luck
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anil Keshavamurthy
     

11 Jul, 2006

1 commit


04 Jul, 2006

1 commit

  • Teach special (recursive) locking code to the lock validator. Has no effect
    on non-lockdep kernels.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

01 Jul, 2006

2 commits

  • The remaining counters in page_state after the zoned VM counter patches
    have been applied are all just for show in /proc/vmstat. They have no
    essential function for the VM.

    We use a simple increment of per cpu variables. In order to avoid the most
    severe races we disable preempt. Preempt does not prevent the race between
    an increment and an interrupt handler incrementing the same statistics
    counter. However, that race is exceedingly rare, we may only loose one
    increment or so and there is no requirement (at least not in kernel) that
    the vm event counters have to be accurate.

    In the non preempt case this results in a simple increment for each
    counter. For many architectures this will be reduced by the compiler to a
    single instruction. This single instruction is atomic for i386 and x86_64.
    And therefore even the rare race condition in an interrupt is avoided for
    both architectures in most cases.

    The patchset also adds an off switch for embedded systems that allows a
    building of linux kernels without these counters.

    The implementation of these counters is through inline code that hopefully
    results in only a single instruction increment instruction being emitted
    (i386, x86_64) or in the increment being hidden though instruction
    concurrency (EPIC architectures such as ia64 can get that done).

    Benefits:
    - VM event counter operations usually reduce to a single inline instruction
    on i386 and x86_64.
    - No interrupt disable, only preempt disable for the preempt case.
    Preempt disable can also be avoided by moving the counter into a spinlock.
    - Handling is similar to zoned VM counters.
    - Simple and easily extendable.
    - Can be omitted to reduce memory use for embedded use.

    References:

    RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
    RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
    local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
    V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
    V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
    V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Conversion of nr_page_table_pages to a per zone counter

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

23 Jun, 2006

3 commits

  • Add a new VMA operation to notify a filesystem or other driver about the
    MMU generating a fault because userspace attempted to write to a page
    mapped through a read-only PTE.

    This facility permits the filesystem or driver to:

    (*) Implement storage allocation/reservation on attempted write, and so to
    deal with problems such as ENOSPC more gracefully (perhaps by generating
    SIGBUS).

    (*) Delay making the page writable until the contents have been written to a
    backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS.
    It permits the filesystem to have some guarantee about the state of the
    cache.

    (*) Account and limit number of dirty pages. This is one piece of the puzzle
    needed to make shared writable mapping work safely in FUSE.

    Needed by cachefs (Or is it cachefiles? Or fscache? ).

    At least four other groups have stated an interest in it or a desire to use
    the functionality it provides: FUSE, OCFS2, NTFS and JFFS2. Also, things like
    EXT3 really ought to use it to deal with the case of shared-writable mmap
    encountering ENOSPC before we permit the page to be dirtied.

    From: Peter Zijlstra

    get_user_pages(.write=1, .force=1) can generate COW hits on read-only
    shared mappings, this patch traps those as mkpage_write candidates and fails
    to handle them the old way.

    Signed-off-by: David Howells
    Cc: Miklos Szeredi
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Anton Altaparmakov
    Cc: David Woodhouse
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Implement read/write migration ptes

    We take the upper two swapfiles for the two types of migration ptes and define
    a series of macros in swapops.h.

    The VM is modified to handle the migration entries. migration entries can
    only be encountered when the page they are pointing to is locked. This limits
    the number of places one has to fix. We also check in copy_pte_range and in
    mprotect_pte_range() for migration ptes.

    We check for migration ptes in do_swap_cache and call a function that will
    then wait on the page lock. This allows us to effectively stop all accesses
    to apge.

    Migration entries are created by try_to_unmap if called for migration and
    removed by local functions in migrate.c

    From: Hugh Dickins

    Several times while testing swapless page migration (I've no NUMA, just
    hacking it up to migrate recklessly while running load), I've hit the
    BUG_ON(!PageLocked(p)) in migration_entry_to_page.

    This comes from an orphaned migration entry, unrelated to the current
    correctly locked migration, but hit by remove_anon_migration_ptes as it
    checks an address in each vma of the anon_vma list.

    Such an orphan may be left behind if an earlier migration raced with fork:
    copy_one_pte can duplicate a migration entry from parent to child, after
    remove_anon_migration_ptes has checked the child vma, but before it has
    removed it from the parent vma. (If the process were later to fault on this
    orphaned entry, it would hit the same BUG from migration_entry_wait.)

    This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
    not. There's no such problem with file pages, because vma_prio_tree_add
    adds child vma after parent vma, and the page table locking at each end is
    enough to serialize. Follow that example with anon_vma: add new vmas to the
    tail instead of the head.

    (There's no corresponding problem when inserting migration entries,
    because a missed pte will leave the page count and mapcount high, which is
    allowed for. And there's no corresponding problem when migrating via swap,
    because a leftover swap entry will be correctly faulted. But the swapless
    method has no refcounting of its entries.)

    From: Ingo Molnar

    pte_unmap_unlock() takes the pte pointer as an argument.

    From: Hugh Dickins

    Several times while testing swapless page migration, gcc has tried to exec
    a pointer instead of a string: smells like COW mappings are not being
    properly write-protected on fork.

    The protection in copy_one_pte looks very convincing, until at last you
    realize that the second arg to make_migration_entry is a boolean "write",
    and SWP_MIGRATION_READ is 30.

    Anyway, it's better done like in change_pte_range, using
    is_write_migration_entry and make_migration_entry_read.

    From: Hugh Dickins

    Remove unnecessary obfuscation from sys_swapon's range check on swap type,
    which blew up causing memory corruption once swapless migration made
    MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.

    Signed-off-by: Hugh Dickins
    Acked-by: Martin Schwidefsky
    Signed-off-by: Hugh Dickins
    Signed-off-by: Christoph Lameter
    Signed-off-by: Ingo Molnar
    From: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • It is better to redo the complete fault if do_swap_page() finds that the
    page is not in PageSwapCache() because the page migration code may have
    replaced the swap pte already with a pte pointing to valid memory.

    do_swap_page() may interpret an invalid swap entry without this patch
    because we do not reload the pte if we are looping back. The page
    migration code may already have reused the swap entry referenced by our
    local swp_entry.

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

01 Apr, 2006

1 commit

  • The boot cmdline is parsed in parse_early_param() and
    parse_args(,unknown_bootoption).

    And __setup() is used in obsolete_checksetup().

    start_kernel()
    -> parse_args()
    -> unknown_bootoption()
    -> obsolete_checksetup()

    If __setup()'s callback (->setup_func()) returns 1 in
    obsolete_checksetup(), obsolete_checksetup() thinks a parameter was
    handled.

    If ->setup_func() returns 0, obsolete_checksetup() tries other
    ->setup_func(). If all ->setup_func() that matched a parameter returns 0,
    a parameter is seted to argv_init[].

    Then, when runing /sbin/init or init=app, argv_init[] is passed to the app.
    If the app doesn't ignore those arguments, it will warning and exit.

    This patch fixes a wrong usage of it, however fixes obvious one only.

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     

27 Mar, 2006

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
    drivers/char/ftape/lowlevel/fdc-io.c: Correct a comment
    Kconfig help: MTD_JEDECPROBE already supports Intel
    Remove ugly debugging stuff
    do_mounts.c: Minor ROOT_DEV comment cleanup
    BUG_ON() Conversion in drivers/s390/block/dasd_devmap.c
    BUG_ON() Conversion in mm/mempool.c
    BUG_ON() Conversion in mm/memory.c
    BUG_ON() Conversion in kernel/fork.c
    BUG_ON() Conversion in ipc/sem.c
    BUG_ON() Conversion in fs/ext2/
    BUG_ON() Conversion in fs/hfs/
    BUG_ON() Conversion in fs/dcache.c
    BUG_ON() Conversion in fs/buffer.c
    BUG_ON() Conversion in input/serio/hp_sdc_mlc.c
    BUG_ON() Conversion in md/dm-table.c
    BUG_ON() Conversion in md/dm-path-selector.c
    BUG_ON() Conversion in drivers/isdn
    BUG_ON() Conversion in drivers/char
    BUG_ON() Conversion in drivers/mtd/

    Linus Torvalds
     
  • Currently, get_user_pages() returns fully coherent pages to the kernel for
    anything other than anonymous pages. This is a problem for things like
    fuse and the SCSI generic ioctl SG_IO which can potentially wish to do DMA
    to anonymous pages passed in by users.

    The fix is to add a new memory management API: flush_anon_page() which
    is used in get_user_pages() to make anonymous pages coherent.

    Signed-off-by: James Bottomley
    Cc: Russell King
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Bottomley