14 Jan, 2011

3 commits

  • Natively handle huge pmds when changing page tables on behalf of
    mprotect().

    I left out update_mmu_cache() because we do not need it on x86 anyway but
    more importantly the interface works on ptes, not pmds.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Flushing the tlb for huge pmds requires the vma's anon_vma, so pass along
    the vma instead of the mm, we can always get the latter when we need it.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • split_huge_page_pmd compat code. Each one of those would need to be
    expanded to hundred of lines of complex code without a fully reliable
    split_huge_page_pmd design.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

10 Nov, 2010

1 commit

  • As pointed out by Linus, commit dab5855 ("perf_counter: Add mmap event hooks to
    mprotect()") is fundamentally wrong as mprotect_fixup() can free 'vma' due to
    merging. Fix the problem by moving perf_event_mmap() hook to
    mprotect_fixup().

    Note: there's another successful return path from mprotect_fixup() if old
    flags equal to new flags. We don't, however, need to call
    perf_event_mmap() there because 'perf' already knows the VMA is
    executable.

    Reported-by: Dave Jones
    Analyzed-by: Linus Torvalds
    Cc: Ingo Molnar
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Pekka Enberg
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

21 Sep, 2009

1 commit

  • Bye-bye Performance Counters, welcome Performance Events!

    In the past few months the perfcounters subsystem has grown out its
    initial role of counting hardware events, and has become (and is
    becoming) a much broader generic event enumeration, reporting, logging,
    monitoring, analysis facility.

    Naming its core object 'perf_counter' and naming the subsystem
    'perfcounters' has become more and more of a misnomer. With pending
    code like hw-breakpoints support the 'counter' name is less and
    less appropriate.

    All in one, we've decided to rename the subsystem to 'performance
    events' and to propagate this rename through all fields, variables
    and API names. (in an ABI compatible fashion)

    The word 'event' is also a bit shorter than 'counter' - which makes
    it slightly more convenient to write/handle as well.

    Thanks goes to Stephane Eranian who first observed this misnomer and
    suggested a rename.

    User-space tooling and ABI compatibility is not affected - this patch
    should be function-invariant. (Also, defconfigs were not touched to
    keep the size down.)

    This patch has been generated via the following script:

    FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

    sed -i \
    -e 's/PERF_EVENT_/PERF_RECORD_/g' \
    -e 's/PERF_COUNTER/PERF_EVENT/g' \
    -e 's/perf_counter/perf_event/g' \
    -e 's/nb_counters/nb_events/g' \
    -e 's/swcounter/swevent/g' \
    -e 's/tpcounter_event/tp_event/g' \
    $FILES

    for N in $(find . -name perf_counter.[ch]); do
    M=$(echo $N | sed 's/perf_counter/perf_event/g')
    mv $N $M
    done

    FILES=$(find . -name perf_event.*)

    sed -i \
    -e 's/COUNTER_MASK/REG_MASK/g' \
    -e 's/COUNTER/EVENT/g' \
    -e 's/\/event_id/g' \
    -e 's/counter/event/g' \
    -e 's/Counter/Event/g' \
    $FILES

    ... to keep it as correct as possible. This script can also be
    used by anyone who has pending perfcounters patches - it converts
    a Linux kernel tree over to the new naming. We tried to time this
    change to the point in time where the amount of pending patches
    is the smallest: the end of the merge window.

    Namespace clashes were fixed up in a preparatory patch - and some
    stylistic fallout will be fixed up in a subsequent patch.

    ( NOTE: 'counters' are still the proper terminology when we deal
    with hardware registers - and these sed scripts are a bit
    over-eager in renaming them. I've undone some of that, but
    in case there's something left where 'counter' would be
    better than 'event' we can undo that on an individual basis
    instead of touching an otherwise nicely automated patch. )

    Suggested-by: Stephane Eranian
    Acked-by: Peter Zijlstra
    Acked-by: Paul Mackerras
    Reviewed-by: Arjan van de Ven
    Cc: Mike Galbraith
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Kyle McMartin
    Cc: Martin Schwidefsky
    Cc: "David S. Miller"
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

09 Jun, 2009

1 commit

  • Some JIT compilers allocate memory for generated code with
    posix_memalign() + mprotect() so we need to hook into mprotect()
    to make sure 'perf' is aware that we're executing code in
    anonymous memory.

    [ penberg@cs.helsinki.fi: move the hook to sys_mprotect() ]
    Cc: Arnaldo Carvalho de Melo
    Signed-off-by: Pekka Enberg
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

11 Feb, 2009

1 commit

  • When overcommit is disabled, the core VM accounts for pages used by anonymous
    shared, private mappings and special mappings. It keeps track of VMAs that
    should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
    with VM_NORESERVE.

    Overcommit for hugetlbfs is much riskier than overcommit for base pages
    due to contiguity requirements. It avoids overcommiting on both shared and
    private mappings using reservation counters that are checked and updated
    during mmap(). This ensures (within limits) that hugepages exist in the
    future when faults occurs or it is too easy to applications to be SIGKILLed.

    As hugetlbfs makes its own reservations of a different unit to the base page
    size, VM_ACCOUNT should never be set. Even if the units were correct, we would
    double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
    be set because an application can request no reserves be made for hugetlbfs
    at the risk of getting killed later.

    With commit fc8744adc870a8d4366908221508bb113d8b72ee, VM_NORESERVE and
    VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
    breaks the accounting for both the core VM and hugetlbfs, can trigger an
    OOM storm when hugepage pools are too small lockups and corrupted counters
    otherwise are used. This patch brings hugetlbfs more in line with how the
    core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.

    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

14 Jan, 2009

1 commit


07 Jan, 2009

1 commit

  • #ifdef in *.c file decrease source readability a bit. removing is better.

    This patch doesn't have any functional change.

    Signed-off-by: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

06 Jan, 2009

1 commit


29 Jul, 2008

1 commit

  • With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
    There are secondary MMUs (with secondary sptes and secondary tlbs) too.
    sptes in the kvm case are shadow pagetables, but when I say spte in
    mmu-notifier context, I mean "secondary pte". In GRU case there's no
    actual secondary pte and there's only a secondary tlb because the GRU
    secondary MMU has no knowledge about sptes and every secondary tlb miss
    event in the MMU always generates a page fault that has to be resolved by
    the CPU (this is not the case of KVM where the a secondary tlb miss will
    walk sptes in hardware and it will refill the secondary tlb transparently
    to software if the corresponding spte is present). The same way
    zap_page_range has to invalidate the pte before freeing the page, the spte
    (and secondary tlb) must also be invalidated before any page is freed and
    reused.

    Currently we take a page_count pin on every page mapped by sptes, but that
    means the pages can't be swapped whenever they're mapped by any spte
    because they're part of the guest working set. Furthermore a spte unmap
    event can immediately lead to a page to be freed when the pin is released
    (so requiring the same complex and relatively slow tlb_gather smp safe
    logic we have in zap_page_range and that can be avoided completely if the
    spte unmap event doesn't require an unpin of the page previously mapped in
    the secondary MMU).

    The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
    when the VM is swapping or freeing or doing anything on the primary MMU so
    that the secondary MMU code can drop sptes before the pages are freed,
    avoiding all page pinning and allowing 100% reliable swapping of guest
    physical address space. Furthermore it avoids the code that teardown the
    mappings of the secondary MMU, to implement a logic like tlb_gather in
    zap_page_range that would require many IPI to flush other cpu tlbs, for
    each fixed number of spte unmapped.

    To make an example: if what happens on the primary MMU is a protection
    downgrade (from writeable to wrprotect) the secondary MMU mappings will be
    invalidated, and the next secondary-mmu-page-fault will call
    get_user_pages and trigger a do_wp_page through get_user_pages if it
    called get_user_pages with write=1, and it'll re-establishing an updated
    spte or secondary-tlb-mapping on the copied page. Or it will setup a
    readonly spte or readonly tlb mapping if it's a guest-read, if it calls
    get_user_pages with write=0. This is just an example.

    This allows to map any page pointed by any pte (and in turn visible in the
    primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
    full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
    with kvm), or a remote DMA in software like XPMEM (hence needing of
    schedule in XPMEM code to send the invalidate to the remote node, while no
    need to schedule in kvm/gru as it's an immediate event like invalidating
    primary-mmu pte).

    At least for KVM without this patch it's impossible to swap guests
    reliably. And having this feature and removing the page pin allows
    several other optimizations that simplify life considerably.

    Dependencies:

    1) mm_take_all_locks() to register the mmu notifier when the whole VM
    isn't doing anything with "mm". This allows mmu notifier users to keep
    track if the VM is in the middle of the invalidate_range_begin/end
    critical section with an atomic counter incraese in range_begin and
    decreased in range_end. No secondary MMU page fault is allowed to map
    any spte or secondary tlb reference, while the VM is in the middle of
    range_begin/end as any page returned by get_user_pages in that critical
    section could later immediately be freed without any further
    ->invalidate_page notification (invalidate_range_begin/end works on
    ranges and ->invalidate_page isn't called immediately before freeing
    the page). To stop all page freeing and pagetable overwrites the
    mmap_sem must be taken in write mode and all other anon_vma/i_mmap
    locks must be taken too.

    2) It'd be a waste to add branches in the VM if nobody could possibly
    run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
    CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
    mmu notifiers, but this already allows to compile a KVM external module
    against a kernel with mmu notifiers enabled and from the next pull from
    kvm.git we'll start using them. And GRU/XPMEM will also be able to
    continue the development by enabling KVM=m in their config, until they
    submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
    also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
    This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
    are all =n.

    The mmu_notifier_register call can fail because mm_take_all_locks may be
    interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
    is used when a driver startup, a failure can be gracefully handled. Here
    an example of the change applied to kvm to register the mmu notifiers.
    Usually when a driver startups other allocations are required anyway and
    -ENOMEM failure paths exists already.

    struct kvm *kvm_arch_create_vm(void)
    {
    struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
    + int err;

    if (!kvm)
    return ERR_PTR(-ENOMEM);

    INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

    + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
    + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
    + if (err) {
    + kfree(kvm);
    + return ERR_PTR(err);
    + }
    +
    return kvm;
    }

    mmu_notifier_unregister returns void and it's reliable.

    The patch also adds a few needed but missing includes that would prevent
    kernel to compile after these changes on non-x86 archs (x86 didn't need
    them by luck).

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix mm/filemap_xip.c build]
    [akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Kanoj Sarcar
    Cc: Roland Dreier
    Cc: Steve Wise
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Rusty Russell
    Cc: Anthony Liguori
    Cc: Chris Wright
    Cc: Marcelo Tosatti
    Cc: Eric Dumazet
    Cc: "Paul E. McKenney"
    Cc: Izik Eidus
    Cc: Anthony Liguori
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 Jul, 2008

1 commit

  • With Mel's hugetlb private reservation support patches applied, strict
    overcommit semantics are applied to both shared and private huge page
    mappings. This can be a problem if an application relied on unlimited
    overcommit semantics for private mappings. An example of this would be an
    application which maps a huge area with the intention of using it very
    sparsely. These application would benefit from being able to opt-out of
    the strict overcommit. It should be noted that prior to hugetlb
    supporting demand faulting all mappings were fully populated and so
    applications of this type should be rare.

    This patch stack implements the MAP_NORESERVE mmap() flag for huge page
    mappings. This flag has the same meaning as for small page mappings,
    suppressing reservations for that mapping.

    Thanks to Mel Gorman for reviewing a number of early versions of these
    patches.

    This patch:

    When a small page mapping is created with mmap() reservations are created
    by default for any memory pages required. When the region is read/write
    the reservation is increased for every page, no reservation is needed for
    read-only regions (as they implicitly share the zero page). Reservations
    are tracked via the VM_ACCOUNT vma flag which is present when the region
    has reservation backing it. When we convert a region from read-only to
    read-write new reservations are aquired and VM_ACCOUNT is set. However,
    when a read-only map is created with MAP_NORESERVE it is indistinguishable
    from a normal mapping. When we then convert that to read/write we are
    forced to incorrectly create reservations for it as we have no record of
    the original MAP_NORESERVE.

    This patch introduces a new vma flag VM_NORESERVE which records the
    presence of the original MAP_NORESERVE flag. This allows us to
    distinguish these two circumstances and correctly account the reserve.

    As well as fixing this FIXME in the code, this makes it much easier to
    introduce MAP_NORESERVE support for huge pages as this flag is available
    consistantly for the life of the mapping. VM_ACCOUNT on the other hand is
    heavily used at the generic level in association with small pages.

    Signed-off-by: Andy Whitcroft
    Cc: Mel Gorman
    Cc: Adam Litke
    Cc: Johannes Weiner
    Cc: Andy Whitcroft
    Cc: William Lee Irwin III
    Cc: Hugh Dickins
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     

15 Jul, 2008

1 commit


09 Jul, 2008

1 commit

  • This patch allows architectures to define functions to deal with
    additional protections bits for mmap() and mprotect().

    arch_calc_vm_prot_bits() maps additonal protection bits to vm_flags
    arch_vm_get_page_prot() maps additional vm_flags to the vma's vm_page_prot
    arch_validate_prot() checks for valid values of the protection bits

    Note: vm_get_page_prot() is now pretty ugly, but the generated code
    should be identical for architectures that don't define additional
    protection bits.

    Signed-off-by: Dave Kleikamp
    Acked-by: Andrew Morton
    Acked-by: Hugh Dickins
    Signed-off-by: Benjamin Herrenschmidt

    Dave Kleikamp
     

25 Jun, 2008

1 commit

  • This patch adds an API for doing read-modify-write updates to a pte's
    protection bits which may race against hardware updates to the pte.
    After reading the pte, the hardware may asynchonously set the accessed
    or dirty bits on a pte, which would be lost when writing back the
    modified pte value.

    The existing technique to handle this race is to use
    ptep_get_and_clear() atomically fetch the old pte value and clear it
    in memory. This has the effect of marking the pte as non-present,
    which will prevent the hardware from updating its state. When the new
    value is written back, the pte will be present again, and the hardware
    can resume updating the access/dirty flags.

    When running in a virtualized environment, pagetable updates are
    relatively expensive, since they generally involve some trap into the
    hypervisor. To mitigate the cost of these updates, we tend to batch
    them.

    However, because of the atomic nature of ptep_get_and_clear(), it is
    inherently non-batchable. This new interface allows batching by
    giving the underlying implementation enough information to open a
    transaction between the read and write phases:

    ptep_modify_prot_start() returns the current pte value, and puts the
    pte entry into a state where either the hardware will not update the
    pte, or if it does, the updates will be preserved on commit.

    ptep_modify_prot_commit() writes back the updated pte, makes sure that
    any hardware updates made since ptep_modify_prot_start() are
    preserved.

    ptep_modify_prot_start() and _commit() must be exactly paired, and
    used while holding the appropriate pte lock. They do not protect
    against other software updates of the pte in any way.

    The current implementations of ptep_modify_prot_start and _commit are
    functionally unchanged from before: _start() uses ptep_get_and_clear()
    fetch the pte and zero the entry, preventing any hardware updates.
    _commit() simply writes the new pte value back knowing that the
    hardware has not updated the pte in the meantime.

    The only current user of this interface is mprotect

    Signed-off-by: Jeremy Fitzhardinge
    Acked-by: Linus Torvalds
    Acked-by: Hugh Dickins
    Signed-off-by: Ingo Molnar

    Jeremy Fitzhardinge
     

15 May, 2008

1 commit

  • There is a defect in mprotect, which lets the user change the page cache
    type bits by-passing the kernel reserve_memtype and free_memtype
    wrappers. Fix the problem by not letting mprotect change the PAT bits.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Suresh Siddha
    Signed-off-by: Ingo Molnar
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Venki Pallipadi
     

23 Oct, 2007

1 commit

  • Fix mprotect bug in recent commit 3ed75eb8f1cd89565966599c4f77d2edb086d5b0
    (setup vma->vm_page_prot by vm_get_page_prot()): the vma_wants_writenotify
    case was setting the same prot as when not.

    Nothing wrong with the use of protection_map[] in mmap_region(),
    but use vm_get_page_prot() there too in the same ~VM_SHARED way.

    Signed-off-by: Hugh Dickins
    Cc: Coly Li
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

20 Oct, 2007

1 commit

  • This patch uses vm_get_page_prot() to setup vma->vm_page_prot.

    Though inside vm_get_page_prot() the protection flags is AND with
    (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED), it does not hurt correct code.

    Signed-off-by: Coly Li
    Cc: Hugh Dickins
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Coly Li
     

17 Oct, 2007

1 commit

  • Current ia64 kernel flushes icache by lazy_mmu_prot_update() *after*
    set_pte(). This is too late. This patch removes lazy_mmu_prot_update and
    add modfied set_pte() for flushing if necessary.

    This patch flush icache of a page when
    new pte has exec bit.
    && new pte has present bit
    && new pte is user's page.
    && (old *ptep is not present
    || new pte's pfn is not same to old *ptep's ptn)
    && new pte's page has no Pg_arch_1 bit.
    Pg_arch_1 is set when a page is cache consistent.

    I think this condition checks are much easier to understand than considering
    "Where sync_icache_dcache() should be inserted ?".

    pte_user() for ia64 was removed by http://lkml.org/lkml/2007/6/12/67 as
    clean-up. So, I added it again.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Luck, Tony"
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

20 Jul, 2007

1 commit

  • Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from
    the old mm into the new mm.

    We create the new mm before the binfmt code runs, and place the new stack at
    the very top of the address space. Once the binfmt code runs and figures out
    where the stack should be, we move it downwards.

    It is a bit peculiar in that we have one task with two mm's, one of which is
    inactive.

    [a.p.zijlstra@chello.nl: limit stack size]
    Signed-off-by: Ollie Wild
    Signed-off-by: Peter Zijlstra
    Cc:
    Cc: Hugh Dickins
    [bunk@stusta.de: unexport bprm_mm_init]
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ollie Wild
     

01 Oct, 2006

1 commit

  • Implement lazy MMU update hooks which are SMP safe for both direct and shadow
    page tables. The idea is that PTE updates and page invalidations while in
    lazy mode can be batched into a single hypercall. We use this in VMI for
    shadow page table synchronization, and it is a win. It also can be used by
    PPC and for direct page tables on Xen.

    For SMP, the enter / leave must happen under protection of the page table
    locks for page tables which are being modified. This is because otherwise,
    you end up with stale state in the batched hypercall, which other CPUs can
    race ahead of. Doing this under the protection of the locks guarantees the
    synchronization is correct, and also means that spurious faults which are
    generated during this window by remote CPUs are properly handled, as the page
    fault handler must re-check the PTE under protection of the same lock.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Jeremy Fitzhardinge
    Cc: Rusty Russell
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     

26 Sep, 2006

2 commits

  • mprotect() resets the page protections, which could result in extra write
    faults for those pages whose dirty state we track using write faults and are
    dirty already.

    Signed-off-by: Peter Zijlstra
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Tracking of dirty pages in shared writeable mmap()s.

    The idea is simple: write protect clean shared writeable pages, catch the
    write-fault, make writeable and set dirty. On page write-back clean all the
    PTE dirty bits and write protect them once again.

    The implementation is a tad harder, mainly because the default
    backing_dev_info capabilities were too loosely maintained. Hence it is not
    enough to test the backing_dev_info for cap_account_dirty.

    The current heuristic is as follows, a VMA is eligible when:
    - its shared writeable
    (vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)
    - it is not a 'special' mapping
    (vm_flags & (VM_PFNMAP|VM_INSERTPAGE)) == 0
    - the backing_dev_info is cap_account_dirty
    mapping_cap_account_dirty(vma->vm_file->f_mapping)
    - f_op->mmap() didn't change the default page protection

    Page from remap_pfn_range() are explicitly excluded because their COW
    semantics are already horrid enough (see vm_normal_page() in do_wp_page()) and
    because they don't have a backing store anyway.

    mprotect() is taught about the new behaviour as well. However it overrides
    the last condition.

    Cleaning the pages on write-back is done with page_mkclean() a new rmap call.
    It can be called on any page, but is currently only implemented for mapped
    pages, if the page is found the be of a VMA that accounts dirty pages it will
    also wrprotect the PTE.

    Finally, in fs/buffers.c:try_to_free_buffers(); remove clear_page_dirty() from
    under ->private_lock. This seems to be safe, since ->private_lock is used to
    serialize access to the buffers, not the page itself. This is needed because
    clear_page_dirty() will call into page_mkclean() and would thereby violate
    locking order.

    [dhowells@redhat.com: Provide a page_mkclean() implementation for NOMMU]
    Signed-off-by: Peter Zijlstra
    Cc: Hugh Dickins
    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

23 Jun, 2006

3 commits

  • Add a new VMA operation to notify a filesystem or other driver about the
    MMU generating a fault because userspace attempted to write to a page
    mapped through a read-only PTE.

    This facility permits the filesystem or driver to:

    (*) Implement storage allocation/reservation on attempted write, and so to
    deal with problems such as ENOSPC more gracefully (perhaps by generating
    SIGBUS).

    (*) Delay making the page writable until the contents have been written to a
    backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS.
    It permits the filesystem to have some guarantee about the state of the
    cache.

    (*) Account and limit number of dirty pages. This is one piece of the puzzle
    needed to make shared writable mapping work safely in FUSE.

    Needed by cachefs (Or is it cachefiles? Or fscache? ).

    At least four other groups have stated an interest in it or a desire to use
    the functionality it provides: FUSE, OCFS2, NTFS and JFFS2. Also, things like
    EXT3 really ought to use it to deal with the case of shared-writable mmap
    encountering ENOSPC before we permit the page to be dirtied.

    From: Peter Zijlstra

    get_user_pages(.write=1, .force=1) can generate COW hits on read-only
    shared mappings, this patch traps those as mkpage_write candidates and fails
    to handle them the old way.

    Signed-off-by: David Howells
    Cc: Miklos Szeredi
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Anton Altaparmakov
    Cc: David Woodhouse
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Implement read/write migration ptes

    We take the upper two swapfiles for the two types of migration ptes and define
    a series of macros in swapops.h.

    The VM is modified to handle the migration entries. migration entries can
    only be encountered when the page they are pointing to is locked. This limits
    the number of places one has to fix. We also check in copy_pte_range and in
    mprotect_pte_range() for migration ptes.

    We check for migration ptes in do_swap_cache and call a function that will
    then wait on the page lock. This allows us to effectively stop all accesses
    to apge.

    Migration entries are created by try_to_unmap if called for migration and
    removed by local functions in migrate.c

    From: Hugh Dickins

    Several times while testing swapless page migration (I've no NUMA, just
    hacking it up to migrate recklessly while running load), I've hit the
    BUG_ON(!PageLocked(p)) in migration_entry_to_page.

    This comes from an orphaned migration entry, unrelated to the current
    correctly locked migration, but hit by remove_anon_migration_ptes as it
    checks an address in each vma of the anon_vma list.

    Such an orphan may be left behind if an earlier migration raced with fork:
    copy_one_pte can duplicate a migration entry from parent to child, after
    remove_anon_migration_ptes has checked the child vma, but before it has
    removed it from the parent vma. (If the process were later to fault on this
    orphaned entry, it would hit the same BUG from migration_entry_wait.)

    This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
    not. There's no such problem with file pages, because vma_prio_tree_add
    adds child vma after parent vma, and the page table locking at each end is
    enough to serialize. Follow that example with anon_vma: add new vmas to the
    tail instead of the head.

    (There's no corresponding problem when inserting migration entries,
    because a missed pte will leave the page count and mapcount high, which is
    allowed for. And there's no corresponding problem when migrating via swap,
    because a leftover swap entry will be correctly faulted. But the swapless
    method has no refcounting of its entries.)

    From: Ingo Molnar

    pte_unmap_unlock() takes the pte pointer as an argument.

    From: Hugh Dickins

    Several times while testing swapless page migration, gcc has tried to exec
    a pointer instead of a string: smells like COW mappings are not being
    properly write-protected on fork.

    The protection in copy_one_pte looks very convincing, until at last you
    realize that the second arg to make_migration_entry is a boolean "write",
    and SWP_MIGRATION_READ is 30.

    Anyway, it's better done like in change_pte_range, using
    is_write_migration_entry and make_migration_entry_read.

    From: Hugh Dickins

    Remove unnecessary obfuscation from sys_swapon's range check on swap type,
    which blew up causing memory corruption once swapless migration made
    MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.

    Signed-off-by: Hugh Dickins
    Acked-by: Martin Schwidefsky
    Signed-off-by: Hugh Dickins
    Signed-off-by: Christoph Lameter
    Signed-off-by: Ingo Molnar
    From: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • With likely/unlikely profiling on my not-so-busy-typical-developmentsystem
    there are 5k misses vs 2k hits. So I guess we should remove the unlikely.

    Signed-off-by: Hua Zhong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hua Zhong
     

22 Mar, 2006

1 commit

  • 2.6.16-rc3 uses hugetlb on-demand paging, but it doesn_t support hugetlb
    mprotect.

    From: David Gibson

    Remove a test from the mprotect() path which checks that the mprotect()ed
    range on a hugepage VMA is hugepage aligned (yes, really, the sense of
    is_aligned_hugepage_range() is the opposite of what you'd guess :-/).

    In fact, we don't need this test. If the given addresses match the
    beginning/end of a hugepage VMA they must already be suitably aligned. If
    they don't, then mprotect_fixup() will attempt to split the VMA. The very
    first test in split_vma() will check for a badly aligned address on a
    hugepage VMA and return -EINVAL if necessary.

    From: "Chen, Kenneth W"

    On i386 and x86-64, pte flag _PAGE_PSE collides with _PAGE_PROTNONE. The
    identify of hugetlb pte is lost when changing page protection via mprotect.
    A page fault occurs later will trigger a bug check in huge_pte_alloc().

    The fix is to always make new pte a hugetlb pte and also to clean up
    legacy code where _PAGE_PRESENT is forced on in the pre-faulting day.

    Signed-off-by: Zhang Yanmin
    Cc: David Gibson
    Cc: "David S. Miller"
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: William Lee Irwin III
    Signed-off-by: Ken Chen
    Signed-off-by: Nishanth Aravamudan
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang, Yanmin
     

23 Nov, 2005

1 commit

  • The PageReserved removal in 2.6.15-rc1 issued a "deprecated" message when you
    tried to mmap or mprotect MAP_PRIVATE PROT_WRITE a VM_RESERVED, and failed
    with -EACCES: because do_wp_page lacks the refinement to COW pages in those
    areas, nor do we expect to find anonymous pages in them; and it seemed just
    bloat to add code for handling such a peculiar case. But immediately it
    caused vbetool and ddcprobe (using lrmi) to fail.

    So revert the "deprecated" messages, letting mmap and mprotect succeed. But
    leave do_wp_page's BUG_ON(vma->vm_flags & VM_RESERVED) in place until we've
    added the code to do it right: so this particular patch is only good if the
    app doesn't really need to write to that private area.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

30 Oct, 2005

3 commits

  • Convert those common loops using page_table_lock on the outside and
    pte_offset_map within to use just pte_offset_map_lock within instead.

    These all hold mmap_sem (some exclusively, some not), so at no level can a
    page table be whipped away from beneath them. But whereas pte_alloc loops
    tested with the "atomic" pmd_present, these loops are testing with pmd_none,
    which on i386 PAE tests both lower and upper halves.

    That's now unsafe, so add a cast into pmd_none to test only the vital lower
    half: we lose a little sensitivity to a corrupt middle directory, but not
    enough to worry about. It appears that i386 and UML were the only
    architectures vulnerable in this way, and pgd and pud no problem.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove PageReserved() calls from core code by tightening VM_RESERVED
    handling in mm/ to cover PageReserved functionality.

    PageReserved special casing is removed from get_page and put_page.

    All setting and clearing of PageReserved is retained, and it is now flagged
    in the page_alloc checks to help ensure we don't introduce any refcount
    based freeing of Reserved pages.

    MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being
    deprecated. We never completely handled it correctly anyway, and is be
    reintroduced in future if required (Hugh has a proof of concept).

    Once PageReserved() calls are removed from kernel/power/swsusp.c, and all
    arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can
    be trivially removed.

    Last real user of PageReserved is swsusp, which uses PageReserved to
    determine whether a struct page points to valid memory or not. This still
    needs to be addressed (a generic page_is_ram() should work).

    A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and
    thus mapcounted and count towards shared rss). These writes to the struct
    page could cause excessive cacheline bouncing on big systems. There are a
    number of ways this could be addressed if it is an issue.

    Signed-off-by: Nick Piggin

    Refcount bug fix for filemap_xip.c

    Signed-off-by: Carsten Otte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • The original vm_stat_account has fallen into disuse, with only one user, and
    only one user of vm_stat_unaccount. It's easier to keep track if we convert
    them all to __vm_stat_account, then free it from its __shackles.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

22 Sep, 2005

1 commit

  • Hugh made me note this line for permission checking in mprotect():

    if ((newflags & ~(newflags >> 4)) & 0xf) {

    after figuring out what's that about, I decided it's nasty enough. Btw
    Hugh itself didn't like the 0xf.

    We can safely change it to VM_READ|VM_WRITE|VM_EXEC because we never change
    VM_SHARED, so no need to check that.

    Signed-off-by: Paolo 'Blaisorblade' Giarrusso
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paolo 'Blaisorblade' Giarrusso
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds