05 Mar, 2008

2 commits

  • Don't uncharge when do_swap_page's call to do_wp_page fails: the page which
    was charged for is there in the pagetable, and will be correctly uncharged
    when that area is unmapped - it was only its COWing which failed.

    And while we're here, remove earlier XXX comment: yes, OR in do_wp_page's
    return value (maybe VM_FAULT_WRITE) with do_swap_page's there; but if it
    fails, mask out success bits, which might confuse some arches e.g. sparc.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's nothing wrong with mem_cgroup_charge failure in do_wp_page and
    do_anonymous page using __free_page, but it does look odd when nearby code
    uses page_cache_release: use that instead (while turning a blind eye to
    ancient inconsistencies of page_cache_release versus put_page).

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

15 Feb, 2008

3 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86:
    x86: cpa, fix out of date comment
    KVM is not seen under X86 config with latest git (32 bit compile)
    x86: cpa: ensure page alignment
    x86: include proper prototypes for rodata_test
    x86: fix gart_iommu_init()
    x86: EFI set_memory_x()/set_memory_uc() fixes
    x86: make dump_pagetable() static
    x86: fix "BUG: sleeping function called from invalid context" in print_vma_addr()

    Linus Torvalds
     
  • d_path() is used on a pair. Lets use a struct path to
    reflect this.

    [akpm@linux-foundation.org: fix build in mm/memory.c]
    Signed-off-by: Jan Blunck
    Acked-by: Bryan Wu
    Acked-by: Christoph Hellwig
    Cc: Al Viro
    Cc: "J. Bruce Fields"
    Cc: Neil Brown
    Cc: Michael Halcrow
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Blunck
     
  • Jiri Kosina reported the following deadlock scenario with
    show_unhandled_signals enabled:

    [ 68.379022] gnome-settings-[2941] trap int3 ip:3d2c840f34
    sp:7fff36f5d100 error:0BUG: sleeping function called from invalid
    context at kernel/rwsem.c:21
    [ 68.379039] in_atomic():1, irqs_disabled():0
    [ 68.379044] no locks held by gnome-settings-/2941.
    [ 68.379050] Pid: 2941, comm: gnome-settings- Not tainted 2.6.25-rc1 #30
    [ 68.379054]
    [ 68.379056] Call Trace:
    [ 68.379061] [] ? __debug_show_held_locks+0x13/0x30
    [ 68.379109] [] __might_sleep+0xe5/0x110
    [ 68.379123] [] down_read+0x20/0x70
    [ 68.379137] [] print_vma_addr+0x3a/0x110
    [ 68.379152] [] do_trap+0xf5/0x170
    [ 68.379168] [] do_int3+0x7b/0xe0
    [ 68.379180] [] int3+0x9f/0xd0
    [ 68.379203] <>
    [ 68.379229] in libglib-2.0.so.0.1505.0[3d2c800000+dc000]

    and tracked it down to:

    commit 03252919b79891063cf99145612360efbdf9500b
    Author: Andi Kleen
    Date: Wed Jan 30 13:33:18 2008 +0100

    x86: print which shared library/executable faulted in segfault etc. messages

    the problem is that we call down_read() from an atomic context.

    Solve this by returning from print_vma_addr() if the preempt count is
    elevated. Update preempt_conditional_sti / preempt_conditional_cli to
    unconditionally lift the preempt count even on !CONFIG_PREEMPT.

    Reported-by: Jiri Kosina
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

12 Feb, 2008

1 commit

  • So I spent a while pounding my head against my monitor trying to figure
    out the vmsplice() vulnerability - how could a failure to check for
    *read* access turn into a root exploit? It turns out that it's a buffer
    overflow problem which is made easy by the way get_user_pages() is
    coded.

    In particular, "len" is a signed int, and it is only checked at the
    *end* of a do {} while() loop. So, if it is passed in as zero, the loop
    will execute once and decrement len to -1. At that point, the loop will
    proceed until the next invalid address is found; in the process, it will
    likely overflow the pages array passed in to get_user_pages().

    I think that, if get_user_pages() has been asked to grab zero pages,
    that's what it should do. Thus this patch; it is, among other things,
    enough to block the (already fixed) root exploit and any others which
    might be lurking in similar code. I also think that the number of pages
    should be unsigned, but changing the prototype of this function probably
    requires some more careful review.

    Signed-off-by: Jonathan Corbet
    Signed-off-by: Linus Torvalds

    Jonathan Corbet
     

09 Feb, 2008

1 commit

  • Background: I've implemented 1K/2K page tables for s390. These sub-page
    page tables are required to properly support the s390 virtualization
    instruction with KVM. The SIE instruction requires that the page tables
    have 256 page table entries (pte) followed by 256 page status table entries
    (pgste). The pgstes are only required if the process is using the SIE
    instruction. The pgstes are updated by the hardware and by the hypervisor
    for a number of reasons, one of them is dirty and reference bit tracking.
    To avoid wasting memory the standard pte table allocation should return
    1K/2K (31/64 bit) and 2K/4K if the process is using SIE.

    Problem: Page size on s390 is 4K, page table size is 1K or 2K. That means
    the s390 version for pte_alloc_one cannot return a pointer to a struct
    page. Trouble is that with the CONFIG_HIGHPTE feature on x86 pte_alloc_one
    cannot return a pointer to a pte either, since that would require more than
    32 bit for the return value of pte_alloc_one (and the pte * would not be
    accessible since its not kmapped).

    Solution: The only solution I found to this dilemma is a new typedef: a
    pgtable_t. For s390 pgtable_t will be a (pte *) - to be introduced with a
    later patch. For everybody else it will be a (struct page *). The
    additional problem with the initialization of the ptl lock and the
    NR_PAGETABLE accounting is solved with a constructor pgtable_page_ctor and
    a destructor pgtable_page_dtor. The page table allocation and free
    functions need to call these two whenever a page table page is allocated or
    freed. pmd_populate will get a pgtable_t instead of a struct page pointer.
    To get the pgtable_t back from a pmd entry that has been installed with
    pmd_populate a new function pmd_pgtable is added. It replaces the pmd_page
    call in free_pte_range and apply_to_pte_range.

    Signed-off-by: Martin Schwidefsky
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Schwidefsky
     

08 Feb, 2008

2 commits

  • Nick Piggin pointed out that swap cache and page cache addition routines
    could be called from non GFP_KERNEL contexts. This patch makes the
    charging routine aware of the gfp context. Charging might fail if the
    cgroup is over it's limit, in which case a suitable error is returned.

    This patch was tested on a Powerpc box. I am still looking at being able
    to test the path, through which allocations happen in non GFP_KERNEL
    contexts.

    [kamezawa.hiroyu@jp.fujitsu.com: problem with ZONE_MOVABLE]
    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Add the accounting hooks. The accounting is carried out for RSS and Page
    Cache (unmapped) pages. There is now a common limit and accounting for both.
    The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
    time. Page cache is accounted at add_to_page_cache(),
    __delete_from_page_cache(). Swap cache is also accounted for.

    Each page's page_cgroup is protected with the last bit of the
    page_cgroup pointer, this makes handling of race conditions involving
    simultaneous mappings of a page easier. A reference count is kept in the
    page_cgroup to deal with cases where a page might be unmapped from the RSS
    of all tasks, but still lives in the page cache.

    Credits go to Vaidyanathan Srinivasan for helping with reference counting work
    of the page cgroup. Almost all of the page cache accounting code has help
    from Vaidyanathan Srinivasan.

    [hugh@veritas.com: fix swapoff breakage]
    [akpm@linux-foundation.org: fix locking]
    Signed-off-by: Vaidyanathan Srinivasan
    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc:
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

07 Feb, 2008

1 commit

  • based on similar patch from: Pavel Machek

    Introduce CONFIG_COMPAT_BRK. If disabled then the kernel is free
    (but not obliged to) randomize the brk area.

    Heap randomization breaks ancient binaries, so we keep COMPAT_BRK
    enabled by default.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

06 Feb, 2008

8 commits

  • After running SetPageUptodate, preceeding stores to the page contents to
    actually bring it uptodate may not be ordered with the store to set the
    page uptodate.

    Therefore, another CPU which checks PageUptodate is true, then reads the
    page contents can get stale data.

    Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
    PageUptodate.

    Many places that test PageUptodate, do so with the page locked, and this
    would be enough to ensure memory ordering in those places if
    SetPageUptodate were only called while the page is locked. Unfortunately
    that is not always the case for some filesystems, but it could be an idea
    for the future.

    Also bring the handling of anonymous page uptodateness in line with that of
    file backed page management, by marking anon pages as uptodate when they
    _are_ uptodate, rather than when our implementation requires that they be
    marked as such. Doing allows us to get rid of the smp_wmb's in the page
    copying functions, which were especially added for anonymous pages for an
    analogous memory ordering problem. Both file and anonymous pages are
    handled with the same barriers.

    FAQ:
    Q. Why not do this in flush_dcache_page?
    A. Firstly, flush_dcache_page handles only one side (the smb side) of the
    ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away
    memory barriers in a completely unrelated function is nasty; at least in the
    PageUptodate macros, they are located together with (half) the operations
    involved in the ordering. Thirdly, the smp_wmb is only required when first
    bringing the page uptodate, wheras flush_dcache_page should be called each time
    it is written to through the kernel mapping. It is logically the wrong place to
    put it.

    Q. Why does this increase my text size / reduce my performance / etc.
    A. Because it is adding the necessary instructions to eliminate the data-race.

    Q. Can it be improved?
    A. Yes, eg. if you were to create a rule that all SetPageUptodate operations
    run under the page lock, we could avoid the smp_rmb places where PageUptodate
    is queried under the page lock. Requires audit of all filesystems and at least
    some would need reworking. That's great you're interested, I'm eagerly awaiting
    your patches.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • fastcall is always defined to be empty, remove it

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Harvey Harrison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     
  • (with Martin Schwidefsky )

    The pgd/pud/pmd/pte page table allocation functions get a mm_struct pointer as
    first argument. The free functions do not get the mm_struct argument. This
    is 1) asymmetrical and 2) to do mm related page table allocations the mm
    argument is needed on the free function as well.

    [kamalesh@linux.vnet.ibm.com: i386 fix]
    [akpm@linux-foundation.org: coding-syle fixes]
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Martin Schwidefsky
    Cc:
    Signed-off-by: Kamalesh Babulal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • vmtruncate is a twisted maze of gotos, this patch cleans it up to have a
    proper if else for the two major cases of extending and truncating truncate
    and thus makes it a lot more readable while keeping exactly the same
    functinality.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Building in a filesystem on a loop device on a tmpfs file can hang when
    swapping, the loop thread caught in that infamous throttle_vm_writeout.

    In theory this is a long standing problem, which I've either never seen in
    practice, or long ago suppressed the recollection, after discounting my load
    and my tmpfs size as unrealistically high. But now, with the new aops, it has
    become easy to hang on one machine.

    Loop used to grab_cache_page before the old prepare_write to tmpfs, which
    seems to have been enough to free up some memory for any swapin needed; but
    the new write_begin lets tmpfs find or allocate the page (much nicer, since
    grab_cache_page missed tmpfs pages in swapcache).

    When allocating a fresh page, tmpfs respects loop's mapping_gfp_mask, which
    has __GFP_IO|__GFP_FS stripped off, and throttle_vm_writeout is designed to
    break out when __GFP_IO or GFP_FS is unset; but when tmfps swaps in,
    read_swap_cache_async allocates with GFP_HIGHUSER_MOVABLE regardless of the
    mapping_gfp_mask - hence the hang.

    So, pass gfp_mask down the line from shmem_getpage to shmem_swapin to
    swapin_readahead to read_swap_cache_async to add_to_swap_cache.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • swapin_readahead has never sat well in mm/memory.c: move it to mm/swap_state.c
    beside its kindred read_swap_cache_async. Why were its args in a different
    order? rearrange them. And since it was always followed by a
    read_swap_cache_async of the target page, fold that in and return struct
    page*. Then CONFIG_SWAP=n no longer needs valid_swaphandles and
    read_swap_cache_async stubs.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • For three years swapin_readahead has been cluttered with fanciful CONFIG_NUMA
    code, advancing addr, and stepping on to the next vma at the boundary, to line
    up the mempolicy for each page allocation.

    It _might_ be a good idea to allocate swap more according to vma layout; but
    the fact is, that's not how we do it at all, 2.6 even less than 2.4: swap is
    allocated as needed for pages as they sink to the bottom of the inactive LRUs.
    Sometimes that may match vma layout, but not so often that it's worth going
    to these misleading vma->vm_next lengths: rip all that out.

    Originally I intended to retain the incrementation of addr, but correct its
    initial value: valid_swaphandles generally supplies an offset below the target
    addr (this is readaround rather than readahead), but addr has not been
    adjusted accordingly, so in the interleave case it has usually been allocating
    the target page from the "wrong" node (though that may not matter very much).

    But look at the equivalent shmem_swapin code: either by oversight or by
    design, though it has all the apparatus for choosing a new mempolicy per page,
    it uses the same idx throughout, choosing the same mempolicy and interleave
    node for each page of the cluster.

    Which is actually a much better strategy: each node has its own LRUs and its
    own kswapd, so if you're betting on any particular relationship between swap
    and node, the best bet is that nearby swap entries belong to pages from the
    same node - even when the mempolicy of the target page is to interleave. And
    examining a map of nodes corresponding to swap entries on a numa=fake system
    bears this out. (We could later tweak swap allocation to make it even more
    likely, but this patch is merely about removing cruft.)

    So, neither adjust nor increment addr in swapin_readahead, and then
    shmem_swapin can use it too; the pseudo-vma to pass policy need only be set up
    once per cluster, and so few fields of pvma are used, let's skip the memset -
    from shmem_alloc_page also.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: Andi Kleen
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We already have page table manipulation for vmalloc in vmalloc.c. Move the
    vmalloc_to_page() function there as well.

    Move the definitions for vmalloc related functions in mm.h to a newly created
    section. A better place would be vmalloc.h but mm.h is basic and may depend
    on these functions. An alternative would be to include vmalloc.h in mm.h
    (like done for vmstat.h).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

30 Jan, 2008

2 commits

  • They now look like:

    hal-resmgr[13791]: segfault at 3c rip 2b9c8caec182 rsp 7fff1e825d30 error 4 in libacl.so.1.1.0[2b9c8caea000+6000]

    This makes it easier to pinpoint bugs to specific libraries.

    And printing the offset into a mapping also always allows to find the
    correct fault point in a library even with randomized mappings. Previously
    there was no way to actually find the correct code address inside
    the randomized mapping.

    Relies on earlier patch to shorten the printk formats.

    They are often now longer than 80 characters, but I think that's worth it.

    [includes fix from Eric Dumazet to check d_path error value]

    Signed-off-by: Andi Kleen
    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner

    Andi Kleen
     
  • The break_lock data structure and code for spinlocks is quite nasty.
    Not only does it double the size of a spinlock but it changes locking to
    a potentially less optimal trylock.

    Put all of that under CONFIG_GENERIC_LOCKBREAK, and introduce a
    __raw_spin_is_contended that uses the lock data itself to determine whether
    there are waiters on the lock, to be used if CONFIG_GENERIC_LOCKBREAK is
    not set.

    Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to
    decouple it from the spinlock implementation, and make it typesafe (rwlocks
    do not have any need_lockbreak sites -- why do they even get bloated up
    with that break_lock then?).

    Signed-off-by: Nick Piggin
    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner

    Nick Piggin
     

24 Jan, 2008

1 commit


18 Jan, 2008

1 commit

  • This patch puts #ifdef CONFIG_DEBUG_VM around a check in vm_normal_page
    that verifies that a pfn is valid. This patch increases performance of the
    page fault microbenchmark in lmbench by 13% and overall dbench performance
    by 7% on s390x. pfn_valid() is an expensive operation on s390 that needs a
    high double digit amount of CPU cycles. Nick Piggin suggested that
    pfn_valid() involves an array lookup on systems with sparsemem, and
    therefore is an expensive operation there too.

    The check looks like a clear debug thing to me, it should never trigger on
    regular kernels. And if a pte is created for an invalid pfn, we'll find
    out once the memory gets accessed later on anyway. Please consider
    inclusion of this patch into mm.

    Signed-off-by: Carsten Otte
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Carsten Otte
     

15 Nov, 2007

2 commits

  • The delay incurred in lock_page() should also be accounted in swap delay
    accounting

    Reported-by: Nick Piggin
    Signed-off-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • When calling get_user_pages(), a write flag is passed in by the caller to
    indicate if write access is required on the faulted-in pages. Currently,
    follow_hugetlb_page() ignores this flag and always faults pages for
    read-only access. This can cause data corruption because a device driver
    that calls get_user_pages() with write set will not expect COW faults to
    occur on the returned pages.

    This patch passes the write flag down to follow_hugetlb_page() and makes
    sure hugetlb_fault() is called with the right write_access parameter.

    [ezk@cs.sunysb.edu: build fix]
    Signed-off-by: Adam Litke
    Reviewed-by: Ken Chen
    Cc: David Gibson
    Cc: William Lee Irwin III
    Cc: Badari Pulavarty
    Signed-off-by: Erez Zadok
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     

05 Nov, 2007

1 commit


20 Oct, 2007

2 commits


17 Oct, 2007

3 commits

  • Current ia64 kernel flushes icache by lazy_mmu_prot_update() *after*
    set_pte(). This is too late. This patch removes lazy_mmu_prot_update and
    add modfied set_pte() for flushing if necessary.

    This patch flush icache of a page when
    new pte has exec bit.
    && new pte has present bit
    && new pte is user's page.
    && (old *ptep is not present
    || new pte's pfn is not same to old *ptep's ptn)
    && new pte's page has no Pg_arch_1 bit.
    Pg_arch_1 is set when a page is cache consistent.

    I think this condition checks are much easier to understand than considering
    "Where sync_icache_dcache() should be inserted ?".

    pte_user() for ia64 was removed by http://lkml.org/lkml/2007/6/12/67 as
    clean-up. So, I added it again.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Luck, Tony"
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • The calculation of pgoff in do_linear_fault() should use PAGE_SHIFT and not
    PAGE_CACHE_SHIFT since vma->vm_pgoff is in units of PAGE_SIZE and not
    PAGE_CACHE_SIZE. At the moment linux/pagemap.h has PAGE_CACHE_SHIFT
    defined as PAGE_SHIFT, but should that ever change this calculation would
    break.

    Signed-off-by: Dean Nelson
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dean Nelson
     
  • The commit b5810039a54e5babf428e9a1e89fc1940fabff11 contains the note

    A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
    (and thus mapcounted and count towards shared rss). These writes to
    the struct page could cause excessive cacheline bouncing on big
    systems. There are a number of ways this could be addressed if it is
    an issue.

    And indeed this cacheline bouncing has shown up on large SGI systems.
    There was a situation where an Altix system was essentially livelocked
    tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
    This situation can be avoided in userspace, but it does highlight the
    potential scalability problem with refcounting ZERO_PAGE, and corner
    cases where it can really hurt (we don't want the system to livelock!).

    There are several broad ways to fix this problem:
    1. add back some special casing to avoid refcounting ZERO_PAGE
    2. per-node or per-cpu ZERO_PAGES
    3. remove the ZERO_PAGE completely

    I will argue for 3. The others should also fix the problem, but they
    result in more complex code than does 3, with little or no real benefit
    that I can see.

    Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
    false optimisation: if an application is performance critical, it would
    not be doing many read faults of new memory, or at least it could be
    expected to write to that memory soon afterwards. If cache or memory use
    is critical, it should not be working with a significant number of
    ZERO_PAGEs anyway (a more compact representation of zeroes should be
    used).

    As a sanity check -- mesuring on my desktop system, there are never many
    mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
    increase much without it.

    When running a make -j4 kernel compile on my dual core system, there are
    about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
    ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
    is torn down without being COWed). So removing ZERO_PAGE will save 1,000
    page faults per second when running kbuild, while keeping it only saves
    less than 1 page clearing operation per second. 1 page clear is cheaper
    than a thousand faults, presumably, so there isn't an obvious loss.

    Neither the logical argument nor these basic tests give a guarantee of no
    regressions. However, this is a reasonable opportunity to try to remove
    the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
    we can reintroduce it and just avoid refcounting it.

    The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked. I don't see
    much use to them except on benchmarks. All other users of ZERO_PAGE are
    converted just to use ZERO_PAGE(0) for simplicity. We can look at
    replacing them all and maybe ripping out ZERO_PAGE completely when we are
    more satisfied with this solution.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus "snif" Torvalds

    Nick Piggin
     

09 Oct, 2007

1 commit

  • All the current page_mkwrite() implementations also set the page dirty. Which
    results in the set_page_dirty_balance() call to _not_ call balance, because the
    page is already found dirty.

    This allows us to dirty a _lot_ of pages without ever hitting
    balance_dirty_pages(). Not good (tm).

    Force a balance call if ->page_mkwrite() was successful.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

05 Oct, 2007

1 commit

  • Gurudas Pai reports kernel BUG at arch/i386/mm/highmem.c:15! below
    sys_remap_file_pages, while running Oracle database test on x86 in 6GB
    RAM: kunmap thinks we're in_interrupt because the preempt count has
    wrapped.

    That's because __do_fault expected to unmap page_table, but one of its
    two callers do_nonlinear_fault already unmapped it: let do_linear_fault
    unmap it first too, and then there's no need to pass the page_table arg
    down.

    Why have we been so slow to notice this? Probably through forgetting
    that the mapping_cap_account_dirty test means that sys_remap_file_pages
    nowadays only goes the full nonlinear vma route on a few memory-backed
    filesystems like ramfs, tmpfs and hugetlbfs.

    [ It also depends on CONFIG_HIGHPTE, so it becomes even harder to
    trigger in practice. Many who have need of large memory have probably
    migrated to x86-64..

    Problem introduced by commit d0217ac04ca6591841e5665f518e38064f4e65bd
    ("mm: fault feedback #1") -- Linus ]

    Signed-off-by: Hugh Dickins
    Cc: gurudas pai
    Cc: Nick Piggin
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

22 Jul, 2007

1 commit

  • Now that arch/powerpc/platforms/cell/spufs/fault.c is always built in
    the kernel there is no need to export handle_mm_fault anymore.

    Signed-off-by: Christoph Hellwig
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

20 Jul, 2007

7 commits

  • lguest does some fairly lowlevel things to support a host, which
    normal modules don't need:

    math_state_restore:
    When the guest triggers a Device Not Available fault, we need
    to be able to restore the FPU

    __put_task_struct:
    We need to hold a reference to another task for inter-guest
    I/O, and put_task_struct() is an inline function which calls
    __put_task_struct.

    access_process_vm:
    We need to access another task for inter-guest I/O.

    map_vm_area & __get_vm_area:
    We need to map the switcher shim (ie. monitor) at 0xFFC01000.

    Signed-off-by: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rusty Russell
     
  • Fix msync data loss and (less importantly) dirty page accounting
    inaccuracies due to the race remaining in clear_page_dirty_for_io().

    The deleted comment explains what the race was, and the added comments
    explain how it is fixed.

    Signed-off-by: Nick Piggin
    Acked-by: Linus Torvalds
    Cc: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • This patch completes Linus's wish that the fault return codes be made into
    bit flags, which I agree makes everything nicer. This requires requires
    all handle_mm_fault callers to be modified (possibly the modifications
    should go further and do things like fault accounting in handle_mm_fault --
    however that would be for another patch).

    [akpm@linux-foundation.org: fix alpha build]
    [akpm@linux-foundation.org: fix s390 build]
    [akpm@linux-foundation.org: fix sparc build]
    [akpm@linux-foundation.org: fix sparc64 build]
    [akpm@linux-foundation.org: fix ia64 build]
    Signed-off-by: Nick Piggin
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: Ian Molton
    Cc: Bryan Wu
    Cc: Mikael Starvik
    Cc: David Howells
    Cc: Yoshinori Sato
    Cc: "Luck, Tony"
    Cc: Hirokazu Takata
    Cc: Geert Uytterhoeven
    Cc: Roman Zippel
    Cc: Greg Ungerer
    Cc: Matthew Wilcox
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Paul Mundt
    Cc: Kazumoto Kojima
    Cc: Richard Curnow
    Cc: William Lee Irwin III
    Cc: "David S. Miller"
    Cc: Jeff Dike
    Cc: Paolo 'Blaisorblade' Giarrusso
    Cc: Miles Bader
    Cc: Chris Zankel
    Acked-by: Kyle McMartin
    Acked-by: Haavard Skinnemoen
    Acked-by: Ralf Baechle
    Acked-by: Andi Kleen
    Signed-off-by: Andrew Morton
    [ Still apparently needs some ARM and PPC loving - Linus ]
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Change ->fault prototype. We now return an int, which contains
    VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
    FAULT_RET_ code tells the VM whether a page was found, whether it has been
    locked, and potentially other things. This is not quite the way he wanted
    it yet, but that's changed in the next patch (which requires changes to
    arch code).

    This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
    that a page is locked which requires filemap_nopage to go away (because we
    can no longer remain backward compatible without that flag), but we were
    going to do that anyway.

    struct fault_data is renamed to struct vm_fault as Linus asked. address
    is now a void __user * that we should firmly encourage drivers not to use
    without really good reason.

    The page is now returned via a page pointer in the vm_fault struct.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • __do_fault() was calling ->page_mkwrite() with the page lock held, which
    violates the locking rules for that callback. Release and retake the page
    lock around the callback to avoid deadlocking file systems which manually
    take it.

    Signed-off-by: Mark Fasheh
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Fasheh
     
  • Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
    the virtual address -> file offset differently from linear mappings.

    ->populate is a layering violation because the filesystem/pagecache code
    should need to know anything about the virtual memory mapping. The hitch here
    is that the ->nopage handler didn't pass down enough information (ie. pgoff).
    But it is more logical to pass pgoff rather than have the ->nopage function
    calculate it itself anyway (because that's a similar layering violation).

    Having the populate handler install the pte itself is likewise a nasty thing
    to be doing.

    This patch introduces a new fault handler that replaces ->nopage and
    ->populate and (later) ->nopfn. Most of the old mechanism is still in place
    so there is a lot of duplication and nice cleanups that can be removed if
    everyone switches over.

    The rationale for doing this in the first place is that nonlinear mappings are
    subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
    to duplicate the synchronisation logic rather than just consolidate the two.

    After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
    pagecache. Seems like a fringe functionality anyway.

    NOPAGE_REFAULT is removed. This should be implemented with ->fault, and no
    users have hit mainline yet.

    [akpm@linux-foundation.org: cleanup]
    [randy.dunlap@oracle.com: doc. fixes for readahead]
    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Nick Piggin
    Signed-off-by: Randy Dunlap
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Fix the race between invalidate_inode_pages and do_no_page.

    Andrea Arcangeli identified a subtle race between invalidation of pages from
    pagecache with userspace mappings, and do_no_page.

    The issue is that invalidation has to shoot down all mappings to the page,
    before it can be discarded from the pagecache. Between shooting down ptes to
    a particular page, and actually dropping the struct page from the pagecache,
    do_no_page from any process might fault on that page and establish a new
    mapping to the page just before it gets discarded from the pagecache.

    The most common case where such invalidation is used is in file truncation.
    This case was catered for by doing a sort of open-coded seqlock between the
    file's i_size, and its truncate_count.

    Truncation will decrease i_size, then increment truncate_count before
    unmapping userspace pages; do_no_page will read truncate_count, then find the
    page if it is within i_size, and then check truncate_count under the page
    table lock and back out and retry if it had subsequently been changed (ptl
    will serialise against unmapping, and ensure a potentially updated
    truncate_count is actually visible).

    Complexity and documentation issues aside, the locking protocol fails in the
    case where we would like to invalidate pagecache inside i_size. do_no_page
    can come in anytime and filemap_nopage is not aware of the invalidation in
    progress (as it is when it is outside i_size). The end result is that
    dangling (->mapping == NULL) pages that appear to be from a particular file
    may be mapped into userspace with nonsense data. Valid mappings to the same
    place will see a different page.

    Andrea implemented two working fixes, one using a real seqlock, another using
    a page->flags bit. He also proposed using the page lock in do_no_page, but
    that was initially considered too heavyweight. However, it is not a global or
    per-file lock, and the page cacheline is modified in do_no_page to increment
    _count and _mapcount anyway, so a further modification should not be a large
    performance hit. Scalability is not an issue.

    This patch implements this latter approach. ->nopage implementations return
    with the page locked if it is possible for their underlying file to be
    invalidated (in that case, they must set a special vm_flags bit to indicate
    so). do_no_page only unlocks the page after setting up the mapping
    completely. invalidation is excluded because it holds the page lock during
    invalidation of each page (and ensures that the page is not mapped while
    holding the lock).

    This also allows significant simplifications in do_no_page, because we have
    the page locked in the right place in the pagecache from the start.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin