07 Nov, 2005

1 commit


30 Oct, 2005

4 commits

  • Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
    a many-threaded application which concurrently initializes different parts of
    a large anonymous area.

    This patch corrects that, by using a separate spinlock per page table page, to
    guard the page table entries in that page, instead of using the mm's single
    page_table_lock. (But even then, page_table_lock is still used to guard page
    table allocation, and anon_vma allocation.)

    In this implementation, the spinlock is tucked inside the struct page of the
    page table page: with a BUILD_BUG_ON in case it overflows - which it would in
    the case of 32-bit PA-RISC with spinlock debugging enabled.

    Splitting the lock is not quite for free: another cacheline access. Ideally,
    I suppose we would use split ptlock only for multi-threaded processes on
    multi-cpu machines; but deciding that dynamically would have its own costs.
    So for now enable it by config, at some number of cpus - since the Kconfig
    language doesn't support inequalities, let preprocessor compare that with
    NR_CPUS. But I don't think it's worth being user-configurable: for good
    testing of both split and unsplit configs, split now at 4 cpus, and perhaps
    change that to 8 later.

    There is a benefit even for singly threaded processes: kswapd can be attacking
    one part of the mm while another part is busy faulting.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Convert those common loops using page_table_lock on the outside and
    pte_offset_map within to use just pte_offset_map_lock within instead.

    These all hold mmap_sem (some exclusively, some not), so at no level can a
    page table be whipped away from beneath them. But whereas pte_alloc loops
    tested with the "atomic" pmd_present, these loops are testing with pmd_none,
    which on i386 PAE tests both lower and upper halves.

    That's now unsafe, so add a cast into pmd_none to test only the vital lower
    half: we lose a little sensitivity to a corrupt middle directory, but not
    enough to worry about. It appears that i386 and UML were the only
    architectures vulnerable in this way, and pgd and pud no problem.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I was lazy when we added anon_rss, and chose to change as few places as
    possible. So currently each anonymous page has to be counted twice, in rss
    and in anon_rss. Which won't be so good if those are atomic counts in some
    configurations.

    Change that around: keep file_rss and anon_rss separately, and add them
    together (with get_mm_rss macro) when the total is needed - reading two
    atomics is much cheaper than updating two atomics. And update anon_rss
    upfront, typically in memory.c, not tucked away in page_add_anon_rmap.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • do_anonymous_page's pte_wrprotect causes some confusion: in such a case,
    vm_page_prot must already be forcing COW, so must omit write permission, and
    so the pte_wrprotect is redundant. Replace it by a comment to that effect,
    and reword the comment on unuse_pte which also caused confusion.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

23 Sep, 2005

1 commit

  • Problem: In some circumstances, bd_claim() is returning the wrong error
    code.

    If we try to swapon an unused block device that isn't swap formatted, we
    get -EINVAL. But if that same block device is already mounted, we instead
    get -EBUSY, even though it still isn't a valid swap device.

    This issue came up on the busybox list trying to get the error message
    from "swapon -a" right. If a swap device is already enabled, we get -EBUSY,
    and we shouldn't report this as an error. But we can't distinguish the two
    -EBUSY conditions, which are very different errors.

    In the code, bd_claim() returns either 0 or -EBUSY, but in this case busy
    means "somebody other than sys_swapon has already claimed this", and
    _that_ means this block device can't be a valid swap device. So return
    -EINVAL there.

    Signed-off-by: Rob Landley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Landley
     

11 Sep, 2005

1 commit


05 Sep, 2005

12 commits

  • The idea of a swap_device_lock per device, and a swap_list_lock over them all,
    is appealing; but in practice almost every holder of swap_device_lock must
    already hold swap_list_lock, which defeats the purpose of the split.

    The only exceptions have been swap_duplicate, valid_swaphandles and an
    untrodden path in try_to_unuse (plus a few places added in this series).
    valid_swaphandles doesn't show up high in profiles, but swap_duplicate does
    demand attention. However, with the hold time in get_swap_pages so much
    reduced, I've not yet found a load and set of swap device priorities to show
    even swap_duplicate benefitting from the split. Certainly the split is mere
    overhead in the common case of a single swap device.

    So, replace swap_list_lock and swap_device_lock by spinlock_t swap_lock
    (generally we seem to prefer an _ in the name, and not hide in a macro).

    If someone can show a regression in swap_duplicate, then probably we should
    add a hashlock for the swap_map entries alone (shorts being anatomic), so as
    to help the case of the single swap device too.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The get_swap_page/scan_swap_map latency can be so bad that even those without
    preemption configured deserve relief: periodically cond_resched.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • get_swap_page has often shown up on latency traces, doing lengthy scans while
    holding two spinlocks. swap_list_lock is already dropped, now scan_swap_map
    drop swap_device_lock before scanning the swap_map.

    While scanning for an empty cluster, don't worry that racing tasks may
    allocate what was free and free what was allocated; but when allocating an
    entry, check it's still free after retaking the lock. Avoid dropping the lock
    in the expected common path. No barriers beyond the locks, just let the
    cookie crumble; highest_bit limit is volatile, but benign.

    Guard against swapoff: must check SWP_WRITEOK before allocating, must raise
    SWP_SCANNING reference count while in scan_swap_map, swapoff wait for that to
    fall - just use schedule_timeout, we don't want to burden scan_swap_map
    itself, and it's very unlikely that anyone can really still be in
    scan_swap_map once swapoff gets this far.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Rewrite scan_swap_map to allocate in just the same way as before (taking the
    next free entry SWAPFILE_CLUSTER-1 times, then restarting at the lowest wholly
    empty cluster, falling back to lowest entry if none), but with a view towards
    dropping the lock in the next patch.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Rewrite get_swap_page to allocate in just the same sequence as before, but
    without holding swap_list_lock across its scan_swap_map. Decrement
    nr_swap_pages and update swap_list.next in advance, while still holding
    swap_list_lock. Skip full devices by testing highest_bit. Swapoff hold
    swap_device_lock as well as swap_list_lock to clear SWP_WRITEOK. Reduces lock
    contention when there are parallel swap devices of the same priority.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • This makes negligible difference in practice: but swap_list.next should not be
    updated to a higher prio in the general helper swap_info_get, but rather in
    swap_entry_free; and then only in the case when entry is actually freed.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The swap header's unsigned int last_page determines the range of swap pages,
    but swap_info has been using int or unsigned long in some cases: use unsigned
    int throughout (except, in several places a local unsigned long is useful to
    avoid overflows when adding).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The "Adding %dk swap" message shows the number of swap extents, as a guide to
    how fragmented the swapfile may be. But a useful further guide is what total
    extent they span across (sometimes scarily large).

    And there's no need to keep nr_extents in swap_info: it's unused after the
    initial message, so save a little space by keeping it on stack.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There are several comments that swap's extent_list.prev points to the lowest
    extent: that's not so, it's extent_list.next which points to it, as you'd
    expect. And a couple of loops in add_swap_extent which go all the way through
    the list, when they should just add to the other end.

    Fix those up, and let map_swap_page search the list forwards: profiles shows
    it to be twice as quick that way - because prefetch works better on how the
    structs are typically kmalloc'ed? or because usually more is written to than
    read from swap, and swap is allocated ascendingly?

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • sys_swapon's call to destroy_swap_extents on failure is made after the final
    swap_list_unlock, which is faintly unsafe: another sys_swapon might already be
    setting up that swap_info_struct. Calling it earlier, before taking
    swap_list_lock, is safe. sys_swapoff's call to destroy_swap_extents was safe,
    but likewise move it earlier, before taking the locks (once try_to_unuse has
    completed, nothing can be needing the swap extents).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If a regular swapfile lies on a filesystem whose blocksize is less than
    PAGE_SIZE, then setup_swap_extents may have to cut the number of usable swap
    pages; but sys_swapon's nr_good_pages was not expecting that. Also,
    setup_swap_extents takes no account of badpages listed in the swap header: not
    worth doing so, but ensure nr_badpages is 0 for a regular swapfile.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Update swap extents comment: nowadays we guard with S_SWAPFILE not i_sem.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

22 Jun, 2005

1 commit

  • Remember that ironic get_user_pages race? when the raised page_count on a
    page swapped out led do_wp_page to decide that it had to copy on write, so
    substituted a different page into userspace. 2.6.7 onwards have Andrea's
    solution, where try_to_unmap_one backs out if it finds page_count raised.

    Which works, but is unsatisfying (rmap.c has no other page_count heuristics),
    and was found a few months ago to hang an intensive page migration test. A
    year ago I was hesitant to engage page_mapcount, now it seems the right fix.

    So remove the page_count hack from try_to_unmap_one; and use activate_page in
    unuse_mm when dropping lock, to replace its secondary effect of helping
    swapoff to make progress in that case.

    Simplify can_share_swap_page (now called only on anonymous pages) to check
    page_mapcount + page_swapcount == 1: still needs the page lock to stabilize
    their (pessimistic) sum, but does not need swapper_space.tree_lock for that.

    In do_swap_page, move swap_free and unlock_page below page_add_anon_rmap, to
    keep sum on the high side, and correct when can_share_swap_page called.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 May, 2005

1 commit


17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds