24 Sep, 2009

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     

22 Sep, 2009

1 commit

  • Just as the swapoff system call allocates many pages of RAM to various
    processes, perhaps triggering OOM, so "echo 2 >/sys/kernel/mm/ksm/run"
    (unmerge) is liable to allocate many pages of RAM to various processes,
    perhaps triggering OOM; and each is normally run from a modest admin
    process (swapoff or shell), easily repeated until it succeeds.

    So treat unmerge_and_remove_all_rmap_items() in the same way that we treat
    try_to_unuse(): generalize PF_SWAPOFF to PF_OOM_ORIGIN, and bracket both
    with that, to ask the OOM killer to kill them first, to prevent them from
    spawning more and more OOM kills.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

16 Sep, 2009

1 commit

  • Memory migration uses special swap entry types to trigger special actions on
    page faults. Extend this mechanism to also support poisoned swap entries, to
    trigger poison handling on page faults. This allows follow-on patches to
    prevent processes from faulting in poisoned pages again.

    v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)
    v3: Better overflow fix (Hidehiro Kawai)

    Signed-off-by: Andi Kleen

    Andi Kleen
     

14 Sep, 2009

1 commit

  • blk_ioctl_discard duplicates large amounts of code from blkdev_issue_discard,
    the only difference between the two is that blkdev_issue_discard needs to
    send a barrier discard request and blk_ioctl_discard a non-barrier one,
    and blk_ioctl_discard needs to wait on the request. To facilitates this
    add a flags argument to blkdev_issue_discard to control both aspects of the
    behaviour. This will be very useful later on for using the waiting
    funcitonality for other callers.

    Based on an earlier patch from Matthew Wilcox .

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

30 Jul, 2009

1 commit

  • Create bdgrab(). This function copies an existing reference to a
    block_device. It is safe to call from any context.

    Hibernation code wishes to copy a reference to the active swap device.
    Right now it calls bdget() under a spinlock, but this is wrong because
    bdget() can sleep. It doesn't need a full bdget() because we already
    hold a reference to active swap devices (and the spinlock protects
    against swapoff).

    Fixes http://bugzilla.kernel.org/show_bug.cgi?id=13827

    Signed-off-by: Alan Jenkins
    Signed-off-by: Rafael J. Wysocki

    Alan Jenkins
     

19 Jun, 2009

1 commit

  • This patch fixes mis-accounting of swap usage in memcg.

    In the current implementation, memcg's swap account is uncharged only when
    swap is completely freed. But there are several cases where swap cannot
    be freed cleanly. For handling that, this patch changes that memcg
    uncharges swap account when swap has no references other than cache.

    By this, memcg's swap entry accounting can be fully synchronous with the
    application's behavior.

    This patch also changes memcg's hooks for swap-out.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Acked-by: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Dhaval Giani
    Cc: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

17 Jun, 2009

3 commits

  • Presently we can know a swap entry is just used as SwapCache via swap_map,
    without looking up swap cache.

    Then, we have a chance to reuse swap-cache-only swap entries in
    get_swap_pages().

    This patch tries to free swap-cache-only swap entries if swap is not
    enough.

    Note: We hit following path when swap_cluster code cannot find a free
    cluster. Then, vm_swap_full() is not only condition to allow the kernel
    to reclaim unused swap.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Dhaval Giani
    Cc: YAMAMOTO Takashi
    Tested-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This is a part of the patches for fixing memcg's swap accountinf leak.
    But, IMHO, not a bad patch even if no memcg.

    There are 2 kinds of references to swap.
    - reference from swap entry
    - reference from swap cache

    Then,

    - If there is swap cache && swap's refcnt is 1, there is only swap cache.
    (*) swapcount(entry) == 1 && find_get_page(swapper_space, entry) != NULL

    This counting logic have worked well for a long time. But considering
    that we cannot know there is a _real_ reference or not by swap_map[],
    current usage of counter is not very good.

    This patch adds a flag SWAP_HAS_CACHE and recored information that a swap
    entry has a cache or not. This will remove -1 magic used in swapfile.c
    and be a help to avoid unnecessary find_get_page().

    Signed-off-by: KAMEZAWA Hiroyuki
    Tested-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Dhaval Giani
    Cc: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • In a following patch, the usage of swap cache is recorded into swap_map.
    This patch is for necessary interface changes to do that.

    2 interfaces:

    - swapcache_prepare()
    - swapcache_free()

    are added for allocating/freeing refcnt from swap-cache to existing swap
    entries. But implementation itself is not changed under this patch. At
    adding swapcache_free(), memcg's hook code is moved under
    swapcache_free(). This is better than using scattered hooks.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Acked-by: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Dhaval Giani
    Cc: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

22 Feb, 2009

1 commit

  • http://bugzilla.kernel.org/show_bug.cgi?id=12239

    The image writing code dropped a reference to the current swap device.
    This doesn't show up if the hibernation succeeds - because it doesn't
    affect the image which gets resumed. But it means multiple _failed_
    hibernations end up freeing the swap device while it is still use!

    swsusp_write() finds the block device for the swap file using swap_type_of().
    It then uses blkdev_get() / blkdev_put() to open and close the block device.

    Unfortunately, blkdev_get() assumes ownership of the inode of the block_device
    passed to it. So blkdev_put() calls iput() on the inode. This is by design
    and other callers expect this behaviour. The fix is for swap_type_of() to take
    a reference on the inode using bdget().

    Signed-off-by: Alan Jenkins
    Signed-off-by: Rafael J. Wysocki
    Cc: Len Brown
    Cc: Greg KH
    Signed-off-by: Linus Torvalds

    Alan Jenkins
     

30 Jan, 2009

1 commit

  • Now, at swapoff, even while try_charge() fails, commit is executed. This
    is a bug which turns the refcnt of cgroup_subsys_state negative.

    Reported-by: Li Zefan
    Tested-by: Li Zefan
    Tested-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

14 Jan, 2009

1 commit


09 Jan, 2009

5 commits

  • My patch, memcg-fix-gfp_mask-of-callers-of-charge.patch changed gfp_mask
    of callers of charge to be GFP_HIGHUSER_MOVABLE for showing what will
    happen at memory reclaim.

    But in recent discussion, it's NACKed because it sounds ugly.

    This patch is for reverting it and add some clean up to gfp_mask of
    callers of charge. No behavior change but need review before generating
    HUNK in deep queue.

    This patch also adds explanation to meaning of gfp_mask passed to charge
    functions in memcontrol.h.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch implements per cgroup limit for usage of memory+swap. However
    there are SwapCache, double counting of swap-cache and swap-entry is
    avoided.

    Mem+Swap controller works as following.
    - memory usage is limited by memory.limit_in_bytes.
    - memory + swap usage is limited by memory.memsw_limit_in_bytes.

    This has following benefits.
    - A user can limit total resource usage of mem+swap.

    Without this, because memory resource controller doesn't take care of
    usage of swap, a process can exhaust all the swap (by memory leak.)
    We can avoid this case.

    And Swap is shared resource but it cannot be reclaimed (goes back to memory)
    until it's used. This characteristic can be trouble when the memory
    is divided into some parts by cpuset or memcg.
    Assume group A and group B.
    After some application executes, the system can be..

    Group A -- very large free memory space but occupy 99% of swap.
    Group B -- under memory shortage but cannot use swap...it's nearly full.

    Ability to set appropriate swap limit for each group is required.

    Maybe someone wonder "why not swap but mem+swap ?"

    - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
    to move account from memory to swap...there is no change in usage of
    mem+swap.

    In other words, when we want to limit the usage of swap without affecting
    global LRU, mem+swap limit is better than just limiting swap.

    Accounting target information is stored in swap_cgroup which is
    per swap entry record.

    Charge is done as following.
    map
    - charge page and memsw.

    unmap
    - uncharge page/memsw if not SwapCache.

    swap-out (__delete_from_swap_cache)
    - uncharge page
    - record mem_cgroup information to swap_cgroup.

    swap-in (do_swap_page)
    - charged as page and memsw.
    record in swap_cgroup is cleared.
    memsw accounting is decremented.

    swap-free (swap_free())
    - if swap entry is freed, memsw is uncharged by PAGE_SIZE.

    There are people work under never-swap environments and consider swap as
    something bad. For such people, this mem+swap controller extension is just an
    overhead. This overhead is avoided by config or boot option.
    (see Kconfig. detail is not in this patch.)

    TODO:
    - maybe more optimization can be don in swap-in path. (but not very safe.)
    But we just do simple accounting at this stage.

    [nishimura@mxp.nes.nec.co.jp: make resize limit hold mutex]
    [hugh@veritas.com: memswap controller core swapcache fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • For accounting swap, we need a record per swap entry, at least.

    This patch adds following function.
    - swap_cgroup_swapon() .... called from swapon
    - swap_cgroup_swapoff() ... called at the end of swapoff.

    - swap_cgroup_record() .... record information of swap entry.
    - swap_cgroup_lookup() .... lookup information of swap entry.

    This patch just implements "how to record information". No actual method
    for limit the usage of swap. These routine uses flat table to record and
    lookup. "wise" lookup system like radix-tree requires requires memory
    allocation at new records but swap-out is usually called under memory
    shortage (or memcg hits limit.) So, I used static allocation. (maybe
    dynamic allocation is not very hard but it adds additional memory
    allocation in memory shortage path.)

    Note1: In this, we use pointer to record information and this means
    8bytes per swap entry. I think we can reduce this when we
    create "id of cgroup" in the range of 0-65535 or 0-255.

    Reported-by: Daisuke Nishimura
    Reviewed-by: Daisuke Nishimura
    Tested-by: Daisuke Nishimura
    Reported-by: Hugh Dickins
    Reported-by: Balbir Singh
    Reported-by: Andrew Morton
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Pavel Emelianov
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Fix misuse of gfp_kernel.

    Now, most of callers of mem_cgroup_charge_xxx functions uses GFP_KERNEL.

    I think that this is from the fact that page_cgroup *was* dynamically
    allocated.

    But now, we allocate all page_cgroup at boot. And
    mem_cgroup_try_to_free_pages() reclaim memory from GFP_HIGHUSER_MOVABLE +
    specified GFP_RECLAIM_MASK.

    * This is because we just want to reduce memory usage.
    "Where we should reclaim from ?" is not a problem in memcg.

    This patch modifies gfp masks to be GFP_HIGUSER_MOVABLE if possible.

    Note: This patch is not for fixing behavior but for showing sane information
    in source code.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • There is a small race in do_swap_page(). When the page swapped-in is
    charged, the mapcount can be greater than 0. But, at the same time some
    process (shares it ) call unmap and make mapcount 1->0 and the page is
    uncharged.

    CPUA CPUB
    mapcount == 1.
    (1) charge if mapcount==0 zap_pte_range()
    (2) mapcount 1 => 0.
    (3) uncharge(). (success)
    (4) set page's rmap()
    mapcount 0=>1

    Then, this swap page's account is leaked.

    For fixing this, I added a new interface.
    - charge
    account to res_counter by PAGE_SIZE and try to free pages if necessary.
    - commit
    register page_cgroup and add to LRU if necessary.
    - cancel
    uncharge PAGE_SIZE because of do_swap_page failure.

    CPUA
    (1) charge (always)
    (2) set page's rmap (mapcount > 0)
    (3) commit charge was necessary or not after set_pte().

    This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
    Usual mem_cgroup_charge_common() does charge -> commit at a time.

    And this patch also adds following function to clarify all charges.

    - mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge()
    called against newly allocated anon pages.

    - mem_cgroup_charge_migrate_fixup()
    called only from remove_migration_ptes().
    we'll have to rewrite this later.(this patch just keeps old behavior)
    This function will be removed by additional patch to make migration
    clearer.

    Good for clarifying "what we do"

    Then, we have 4 following charge points.
    - newpage
    - swap-in
    - add-to-cache.
    - migration.

    [akpm@linux-foundation.org: add missing inline directives to stubs]
    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

07 Jan, 2009

18 commits

  • page_queue_congested() was introduced in 2002, but it was never used

    Signed-off-by: KOSAKI Motohiro
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Complete zap_pte_range()'s coverage of bad pagetable entries by calling
    print_bad_pte() on a pte_file in a linear vma and on a bad swap entry.
    That needs free_swap_and_cache() to tell it, which will also have shown
    one of those "swap_free" errors (but with much less information).

    Similar checks in fork's copy_one_pte()? No, that would be more noisy
    than helpful: we'll see them when parent and child exec or exit.

    Where do_nonlinear_fault() calls print_bad_pte(): omit !VM_CAN_NONLINEAR
    case, that could only be a bug in sys_remap_file_pages(), not a bad pte.
    VM_FAULT_OOM rather than VM_FAULT_SIGBUS? Well, okay, that is consistent
    with what happens if do_swap_page() operates a bad swap entry; but don't
    we have patches to be more careful about killing when VM_FAULT_OOM?

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove the srandom32((u32)get_seconds()) from non-rotational swapon:
    there's been a coincidental discussion of earlier randomization, assume
    that goes ahead, let swapon be a client rather than stirring for itself.

    Signed-off-by: Hugh Dickins
    Cc: David Woodhouse
    Cc: Donjun Shin
    Cc: James Bottomley
    Cc: Jens Axboe
    Cc: Joern Engel
    Cc: KAMEZAWA Hiroyuki
    Cc: Matthew Wilcox
    Cc: Nick Piggin
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Change pgoff_t nr_blocks in discard_swap() and discard_swap_cluster() to
    sector_t: given the constraints on swap offsets (in particular, the 5 bits
    of swap type accommodated in the same unsigned long), pgoff_t was actually
    safe as is, but it certainly looked worrying when shifted left.

    [akpm@linux-foundation.org: fix shift overflow]
    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Cc: David Woodhouse
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Joern Engel
    Cc: James Bottomley
    Cc: Donjun Shin
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Though attempting to find free clusters (Andrea), swap allocation has
    always restarted its searches from the beginning of the swap area (sct),
    to reduce seek times between swap pages, by not scattering them all over
    the partition.

    But on a solidstate swap device, seeks are cheap, and block remapping to
    level the wear may be limited by zones: in that case it's better to cycle
    around the whole partition.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Cc: David Woodhouse
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Joern Engel
    Cc: James Bottomley
    Cc: Donjun Shin
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Swap allocation has always started from the beginning of the swap area;
    but if we're dealing with a solidstate swap device which can only remap
    blocks within limited zones, that would sooner wear out the first zone.

    Therefore sys_swapon() test whether blk_queue is non-rotational, and if so
    randomize the cluster_next starting position for allocation.

    If blk_queue is nonrot, note SWP_SOLIDSTATE for later use, and report it
    with an "SS" at the right end of the kernel's "Adding ... swap" message
    (so that if it's both nonrot and discardable, "SSD" will be shown there).
    Perhaps something should be shown in /proc/swaps (swapon -s), but we have
    to be more cautious before making any addition to that format.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Cc: David Woodhouse
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Joern Engel
    Cc: James Bottomley
    Cc: Donjun Shin
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When scan_swap_map() finds a free cluster of swap pages to allocate,
    discard the old contents of the cluster if the device supports discard.
    But don't bother when swap is so fragmented that we allocate single pages.

    Be careful about racing allocations made while we're scanning for a
    cluster; and hold up allocations made while we're discarding.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Cc: David Woodhouse
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Joern Engel
    Cc: James Bottomley
    Cc: Donjun Shin
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When adding swap, all the old data on swap can be forgotten: sys_swapon()
    discard all but the header page of the swap partition (or every extent but
    the header of the swap file), to give a solidstate swap device the
    opportunity to optimize its wear-levelling.

    If that succeeds, note SWP_DISCARDABLE for later use, and report it with a
    "D" at the right end of the kernel's "Adding ... swap" message. Perhaps
    something should be shown in /proc/swaps (swapon -s), but we have to be
    more cautious before making any addition to that format.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Cc: David Woodhouse
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Joern Engel
    Cc: James Bottomley
    Cc: Donjun Shin
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Before making functional changes, rearrange scan_swap_map() to simplify
    subsequent diffs. Actually, there is one functional change in there:
    leave cluster_nr negative while scanning for a new cluster - resetting it
    early increased the likelihood that when we have difficulty finding a free
    cluster, another task may come in and try doing exactly the same - just a
    waste of cpu.

    Before making functional changes, rearrange struct swap_info_struct
    slightly: flags will be needed as an unsigned long (for wait_on_bit), next
    is a good int to pair with prio, old_block_size is uninteresting so shift
    it to the end.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The kernel has not supported v0 SWAP-SPACE since 2.5.22: I think we can
    now safely drop its "version 0 swap is no longer supported" message - just
    say "Unable to find swap-space signature" as usual. This removes one
    level of indentation from a stretch of sys_swapon().

    I'd have liked to be specific, saying "Unable to find SWAPSPACE2
    signature", but it's just too confusing that the version 1 signature shows
    the number 2.

    Irrelevant nearby cleanup: kmap(page) already gives page_address(page).

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove trailing whitespace from swapfile.c, and odd swap_show() alignment.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove the SWP_ACTIVE mask: it just obscures the SWP_WRITEOK flag.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • sys_swapon()'s swapfilesize (better renamed swapfilepages) is declared as
    an int, but should be an unsigned long like the maxpages it's compared
    against: on 64-bit (with 4kB pages) a swapfile of 2^44 bytes was rejected
    with "Swap area shorter than signature indicates".

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Rik suggests a simplified get_scan_ratio() for !CONFIG_SWAP. Yes, the gcc
    optimizer gives us that, when nr_swap_pages is #defined as 0L. Move usual
    declaration to swapfile.c: it never belonged in page_alloc.c.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Acked-by: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's a possible race in try_to_unuse() which Nick Piggin led me to two
    years ago. Where it does lock_page() after read_swap_cache_async(), what
    if another task removed that page from swapcache just before we locked it?

    It would sail though the (*swap_map > 1) tests doing nothing (because it
    could not have been removed from swapcache before its swap references were
    gone), until it reaches the delete_from_swap_cache(page) near the bottom.

    Now imagine that this page has been allocated to swap on a different swap
    area while we dropped page lock (perhaps at the top, perhaps in unuse_mm):
    we could wrongly remove from swap cache before the page has been written
    to swap, so a subsequent do_swap_page() would read in stale data from
    swap.

    I think this case could not happen before: remove_exclusive_swap_page()
    refused while page count was raised. But now with reuse_swap_page() and
    try_to_free_swap() removing from swap cache without minding page count, I
    think it could happen - the previous patch argued that it was safe because
    try_to_unuse() already ignored page count, but overlooked that it might be
    breaking the assumptions in try_to_unuse() itself.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • remove_exclusive_swap_page(): its problem is in living up to its name.

    It doesn't matter if someone else has a reference to the page (raised
    page_count); it doesn't matter if the page is mapped into userspace
    (raised page_mapcount - though that hints it may be worth keeping the
    swap): all that matters is that there be no more references to the swap
    (and no writeback in progress).

    swapoff (try_to_unuse) has been removing pages from swapcache for years,
    with no concern for page count or page mapcount, and we used to have a
    comment in lookup_swap_cache() recognizing that: if you go for a page of
    swapcache, you'll get the right page, but it could have been removed from
    swapcache by the time you get page lock.

    So, give up asking for exclusivity: get rid of
    remove_exclusive_swap_page(), and remove_exclusive_swap_page_ref() and
    remove_exclusive_swap_page_count() which were spawned for the recent LRU
    work: replace them by the simpler try_to_free_swap() which just checks
    page_swapcount().

    Similarly, remove the page_count limitation from free_swap_and_count(),
    but assume that it's worth holding on to the swap if page is mapped and
    swap nowhere near full. Add a vm_swap_full() test in free_swap_cache()?
    It would be consistent, but I think we probably have enough for now.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A good place to free up old swap is where do_wp_page(), or do_swap_page(),
    is about to redirty the page: the data on disk is then stale and won't be
    read again; and if we do decide to write the page out later, using the
    previous swap location makes an unnecessary disk seek very likely.

    So give can_share_swap_page() the side-effect of delete_from_swap_cache()
    when it safely can. And can_share_swap_page() was always a misleading
    name, the more so if it has a side-effect: rename it reuse_swap_page().

    Irrelevant cleanup nearby: remove swap_token_default_timeout definition
    from swap.h: it's used nowhere.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Acked-by: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The swap code is over-provisioned with BUG_ONs on assorted page flags,
    mostly dating back to 2.3. They're good documentation, and guard against
    developer error, but a waste of space on most systems: change them to
    VM_BUG_ONs, conditional on CONFIG_DEBUG_VM. Just delete the PagePrivate
    ones: they're later, from 2.5.69, but even less interesting now.

    Signed-off-by: Hugh Dickins
    Reviewed-by: Christoph Lameter
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Dec, 2008

1 commit

  • Impact: cleanup, code robustization

    The __swp_...() macros silently relied upon which bits are used for
    _PAGE_FILE and _PAGE_PROTNONE. After having changed _PAGE_PROTNONE in
    our Xen kernel to no longer overlap _PAGE_PAT, live locks and crashes
    were reported that could have been avoided if these macros properly
    used the symbolic constants. Since, as pointed out earlier, for Xen
    Dom0 support mainline likewise will need to eliminate the conflict
    between _PAGE_PAT and _PAGE_PROTNONE, this patch does all the necessary
    adjustments, plus it introduces a mechanism to check consistency
    between MAX_SWAPFILES_SHIFT and the actual encoding macros.

    This also fixes a latent bug in that x86-64 used a 6-bit mask in
    __swp_type(), and if MAX_SWAPFILES_SHIFT was increased beyond 5 in (the
    seemingly unrelated) linux/swap.h, this would have resulted in a
    collision with _PAGE_FILE.

    Non-PAE 32-bit code gets similarly adjusted for its pte_to_pgoff() and
    pgoff_to_pte() calculations.

    Signed-off-by: Jan Beulich
    Signed-off-by: Ingo Molnar

    Jan Beulich
     

20 Oct, 2008

2 commits

  • trylock_page, unlock_page open and close a critical section. Hence,
    we can use the lock bitops to get the desired memory ordering.

    Also, mark trylock as likely to succeed (and remove the annotation from
    callers).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • If vm_swap_full() (swap space more than 50% full), the system will free
    swap space at swapin time. With this patch, the system will also free the
    swap space in the pageout code, when we decide that the page is not a
    candidate for swapout (and just wasting swap space).

    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: MinChan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

05 Aug, 2008

1 commit

  • Converting page lock to new locking bitops requires a change of page flag
    operation naming, so we might as well convert it to something nicer
    (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

    This also facilitates lockdeping of page lock.

    Signed-off-by: Nick Piggin
    Acked-by: KOSAKI Motohiro
    Acked-by: Peter Zijlstra
    Acked-by: Andrew Morton
    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

31 Jul, 2008

1 commit