31 Jan, 2007

1 commit

  • Nick Piggin points out that page accounting on MIPS multiple ZERO_PAGEs
    is not maintained by its move_pte, and could lead to freeing a ZERO_PAGE.

    Instead of complicating that move_pte, just forget the minor optimization
    when mremapping, and change the one thing which needed it for correctness
    - filemap_xip use ZERO_PAGE(0) throughout instead of according to address.

    [ "There is no block device driver one could use for XIP on mips
    platforms" - Carsten Otte ]

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Andrew Morton
    Cc: Ralf Baechle
    Cc: Carsten Otte
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

01 Oct, 2006

1 commit

  • Implement lazy MMU update hooks which are SMP safe for both direct and shadow
    page tables. The idea is that PTE updates and page invalidations while in
    lazy mode can be batched into a single hypercall. We use this in VMI for
    shadow page table synchronization, and it is a win. It also can be used by
    PPC and for direct page tables on Xen.

    For SMP, the enter / leave must happen under protection of the page table
    locks for page tables which are being modified. This is because otherwise,
    you end up with stale state in the batched hypercall, which other CPUs can
    race ahead of. Doing this under the protection of the locks guarantees the
    synchronization is correct, and also means that spurious faults which are
    generated during this window by remote CPUs are properly handled, as the page
    fault handler must re-check the PTE under protection of the same lock.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Jeremy Fitzhardinge
    Cc: Rusty Russell
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     

04 Jul, 2006

1 commit

  • Teach special (recursive) locking code to the lock validator. Has no effect
    on non-lockdep kernels.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

12 Jan, 2006

1 commit

  • - Move capable() from sched.h to capability.h;

    - Use where capable() is used
    (in include/, block/, ipc/, kernel/, a few drivers/,
    mm/, security/, & sound/;
    many more drivers/ to go)

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy.Dunlap
     

17 Dec, 2005

1 commit


30 Oct, 2005

7 commits

  • Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
    a many-threaded application which concurrently initializes different parts of
    a large anonymous area.

    This patch corrects that, by using a separate spinlock per page table page, to
    guard the page table entries in that page, instead of using the mm's single
    page_table_lock. (But even then, page_table_lock is still used to guard page
    table allocation, and anon_vma allocation.)

    In this implementation, the spinlock is tucked inside the struct page of the
    page table page: with a BUILD_BUG_ON in case it overflows - which it would in
    the case of 32-bit PA-RISC with spinlock debugging enabled.

    Splitting the lock is not quite for free: another cacheline access. Ideally,
    I suppose we would use split ptlock only for multi-threaded processes on
    multi-cpu machines; but deciding that dynamically would have its own costs.
    So for now enable it by config, at some number of cpus - since the Kconfig
    language doesn't support inequalities, let preprocessor compare that with
    NR_CPUS. But I don't think it's worth being user-configurable: for good
    testing of both split and unsplit configs, split now at 4 cpus, and perhaps
    change that to 8 later.

    There is a benefit even for singly threaded processes: kswapd can be attacking
    one part of the mm while another part is busy faulting.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Second step in pushing down the page_table_lock. Remove the temporary
    bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not
    to hold page_table_lock, whether it's on init_mm or a user mm; take
    page_table_lock internally to check if a racing task already allocated.

    Convert their callers from common code. But avoid coming back to change them
    again later: instead of moving the spin_lock(&mm->page_table_lock) down,
    switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which
    encapsulate the mapping+locking and unlocking+unmapping together, and in the
    end may use alternatives to the mm page_table_lock itself.

    These callers all hold mmap_sem (some exclusively, some not), so at no level
    can a page table be whipped away from beneath them; and pte_alloc uses the
    "atomic" pmd_present to test whether it needs to allocate. It appears that on
    all arches we can safely descend without page_table_lock.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • It seems odd to me that, whereas pud_alloc and pmd_alloc test inline, only
    calling out-of-line __pud_alloc __pmd_alloc if allocation needed,
    pte_alloc_map and pte_alloc_kernel are entirely out-of-line. Though it does
    add a little to kernel size, change them to macros testing inline, calling
    __pte_alloc or __pte_alloc_kernel to allocate out-of-line. Mark none of them
    as fastcalls, leave that to CONFIG_REGPARM or not.

    It also seems more natural for the out-of-line functions to leave the offset
    calculation and map to the inline, which has to do it anyway for the common
    case. At least mremap move wants __pte_alloc without _map.

    Macros rather than inline functions, certainly to avoid the header file issues
    which arise from CONFIG_HIGHPTE needing kmap_types.h, but also in case any
    architectures I haven't built would have other such problems.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • update_mem_hiwater has attracted various criticisms, in particular from those
    concerned with mm scalability. Originally it was called whenever rss or
    total_vm got raised. Then many of those callsites were replaced by a timer
    tick call from account_system_time. Now Frank van Maarseveen reports that to
    be found inadequate. How about this? Works for Frank.

    Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
    update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
    mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
    by 1): those are hot paths. Do the opposite, update only when about to lower
    rss (usually by many), or just before final accounting in do_exit. Handle
    mm->hiwater_vm in the same way, though it's much less of an issue. Demand
    that whoever collects these hiwater statistics do the work of taking the
    maximum with rss or total_vm.

    And there has been no collector of these hiwater statistics in the tree. The
    new convention needs an example, so match Frank's usage by adding a VmPeak
    line above VmSize to /proc//status, and also a VmHWM line above VmRSS
    (High-Water-Mark or High-Water-Memory).

    There was a particular anomaly during mremap move, that hiwater_vm might be
    captured too high. A fleeting such anomaly remains, but it's quickly
    corrected now, whereas before it would stick.

    What locking? None: if the app is racy then these statistics will be racy,
    it's not worth any overhead to make them exact. But whenever it suits,
    hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
    page_table_lock (for now) or with preemption disabled (later on): without
    going to any trouble, minimize the time between reading current values and
    updating, to minimize those occasions when a racing thread bumps a count up
    and back down in between.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Cleanup: relieve do_mremap from its surfeit of current->mms.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Speeding up mremap's moving of ptes has never been a priority, but the locking
    will get more complicated shortly, and is already too baroque.

    Scrap the current one-by-one moving, do an extent at a time: curtailed by end
    of src and dst pmds (have to use PMD_SIZE: the way pmd_addr_end gets elided
    doesn't match this usage), and by latency considerations.

    One nice property of the old method is lost: it never allocated a page table
    unless absolutely necessary, so you could free empty page tables by mremapping
    to and fro. Whereas this way, it allocates a dst table wherever there was a
    src table. I keep diving in to reinstate the old behaviour, then come out
    preferring not to clutter how it now is.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The original vm_stat_account has fallen into disuse, with only one user, and
    only one user of vm_stat_unaccount. It's easier to keep track if we convert
    them all to __vm_stat_account, then free it from its __shackles.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

28 Sep, 2005

1 commit

  • Move the ZERO_PAGE remapping complexity to the move_pte macro in
    asm-generic, have it conditionally depend on
    __HAVE_ARCH_MULTIPLE_ZERO_PAGE, which gets defined for MIPS.

    For architectures without __HAVE_ARCH_MULTIPLE_ZERO_PAGE, move_pte becomes
    a noop.

    From: Hugh Dickins

    Fix nasty little bug we've missed in Nick's mremap move ZERO_PAGE patch.
    The "pte" at that point may be a swap entry or a pte_file entry: we must
    check pte_present before perhaps corrupting such an entry.

    Patch below against 2.6.14-rc2-mm1, but the same bug is in 2.6.14-rc2's
    mm/mremap.c, and more dangerous there since it's affecting all arches: I
    think the safest course is to send Nick's patch and Yoichi's build fix and
    this fix (build tested) on to Linus - so only MIPS can be affected.

    Signed-off-by: Nick Piggin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

05 Sep, 2005

1 commit

  • filemap_xip's nopage routine maps the ZERO_PAGE into readonly mappings, if it
    has no data page to map there: then if the hole in the file is later filled,
    __xip_unmap uses an rmap technique to replace the ZERO_PAGEs mapped for that
    offset by the newly allocated file page, so that established mappings will see
    the newly written data.

    However, on MIPS (alone) there's not one but as many as eight ZERO_PAGEs,
    chosen for coloring by user virtual address; and if mremap has meanwhile been
    used to move a mapping containing a ZERO_PAGE, it will generally not match the
    ZERO_PAGE(address) __xip_unmap is looking for.

    To maintain XIP's established mappings correctly on MIPS, we need Nick's fix
    to mremap's move_one_page (originally presented as an optimization), to
    replace the ZERO_PAGE appropriate to the old address by the ZERO_PAGE
    appropriate to the new address.

    (But when I first saw this, I was thinking the ZERO_PAGEs themselves would get
    corrupted, very bad. Now I think it's the other way round, that the
    established mappings will fail to see the newly written data: incorrect, but
    not corrupting everything else. Whether filemap_xip's technique is generally
    safe, I'd hesitate to say in a hurry: it's interesting, but we've never tried
    to do that in tmpfs.)

    Signed-off-by: Hugh Dickins
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

05 Aug, 2005

1 commit

  • mremap's move_vma is applying __vm_stat_account to the old vma which may
    have already been freed: move it to just before the do_munmap.

    mremapping to and fro with CONFIG_DEBUG_SLAB=y showed /proc//status
    VmSize and VmData wrapping just like in kernel bugzilla #4842, and fixed by
    this patch - worth including in 2.6.13, though not yet confirmed that it
    fixes that specific report from Frank van Maarseveen.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 May, 2005

1 commit


01 May, 2005

1 commit

  • Address bug #4508: there's potential for wraparound in the various places
    where we perform RLIMIT_AS checking.

    (I'm a bit worried about acct_stack_growth(). Are we sure that vma->vm_mm is
    always equal to current->mm? If not, then we're comparing some other
    process's total_vm with the calling process's rlimits).

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds