22 May, 2010

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6: (577 commits)
    Staging: ramzswap: Handler for swap slot free callback
    swap: Add swap slot free callback to block_device_operations
    swap: Add flag to identify block swap devices
    Staging: vt6655: use ETH_FRAME_LEN macro instead of custom one
    Staging: vt6655: use ETH_DATA_LEN macro instead of custom one
    Staging: vt6655: use ETH_FCS_LEN macro instead of custom one
    Staging: vt6656: use ETH_HLEN macro instead of custom one
    Staging: comedi: quatech_daqp_cs.c Replace eos semaphore with a completion.
    Staging: dt3155v4l: remove private memory allocator
    Staging: crystalhd: Remove typedefs from driver
    Staging: winbond: Fix for pointer name format issue in mds.c
    Staging: vt6656: removed custom UCHAR/USHORT/UINT/ULONG/ULONGLONG typedefs
    Staging: vt6656: removed custom CHAR/SHORT/INT/LONG typedefs
    Staging: comedi: Altered the way printk is used in 8255.c
    staging: iio: adis16350 and similar IMU driver
    Staging: iio: max1363 Fix two bugs in single_channel_from_ring
    Staging: iio: adis16220 extract bin_attribute structures from state
    Staging: iio: adis16220 vibration sensor driver
    Staging: comedi: Kconfig dependancy fixes
    Staging: comedi: fix up build error from last Kconfig changes
    ...

    Linus Torvalds
     

19 May, 2010

2 commits

  • This callback is required when RAM based devices are used as swap disks.
    One such device is ramzswap which is used as compressed in-memory swap
    disk. For such devices, we need a callback as soon as a swap slot is no
    longer used to allow freeing memory allocated for this slot. Without this
    callback, stale data can quickly accumulate in memory defeating the whole
    purpose of such devices.

    Signed-off-by: Nitin Gupta
    Acked-by: Linus Torvalds
    Acked-by: Nigel Cunningham
    Acked-by: Pekka Enberg
    Reviewed-by: Minchan Kim
    Signed-off-by: Greg Kroah-Hartman

    Nitin Gupta
     
  • Added SWP_BLKDEV flag to distinguish block and regular file backed
    swap devices. We could also check if a swap is entire block device,
    rather than a file, by:
    S_ISBLK(swap_info_struct->swap_file->f_mapping->host->i_mode)
    but, I think, simply checking this flag is more convenient.

    Signed-off-by: Nitin Gupta
    Acked-by: Linus Torvalds
    Acked-by: Nigel Cunningham
    Acked-by: Pekka Enberg
    Reviewed-by: Minchan Kim
    Signed-off-by: Greg Kroah-Hartman

    Nitin Gupta
     

29 Apr, 2010

1 commit


13 Mar, 2010

1 commit

  • This patch is another core part of this move-charge-at-task-migration
    feature. It enables moving charges of anonymous swaps.

    To move the charge of swap, we need to exchange swap_cgroup's record.

    In current implementation, swap_cgroup's record is protected by:

    - page lock: if the entry is on swap cache.
    - swap_lock: if the entry is not on swap cache.

    This works well in usual swap-in/out activity.

    But this behavior make the feature of moving swap charge check many
    conditions to exchange swap_cgroup's record safely.

    So I changed modification of swap_cgroup's recored(swap_cgroup_record())
    to use xchg, and define a new function to cmpxchg swap_cgroup's record.

    This patch also enables moving charge of non pte_present but not uncharged
    swap caches, which can be exist on swap-out path, by getting the target
    pages via find_get_page() as do_mincore() does.

    [kosaki.motohiro@jp.fujitsu.com: fix ia64 build]
    [akpm@linux-foundation.org: fix typos]
    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

07 Mar, 2010

4 commits

  • swap_duplicate()'s loop appears to miss out on returning the error code
    from __swap_duplicate(), except when that's -ENOMEM. In fact this is
    intentional: prior to -ENOMEM for swap_count_continuation,
    swap_duplicate() was void (and the case only occurs when copy_one_pte()
    hits a corrupt pte). But that's surprising behaviour, which certainly
    deserves a comment.

    Signed-off-by: Hugh Dickins
    Reported-by: Huang Shijie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's an off-by-one disagreement between mkswap and swapon about the
    meaning of swap_header last_page: mkswap (in all versions I've looked at:
    util-linux-ng and BusyBox and old util-linux; probably as far back as
    1999) consistently means the offset (in page units) of the last page of
    the swap area, whereas kernel sys_swapon (as far back as 2.2 and 2.3)
    strangely takes it to mean the size (in page units) of the swap area.

    This disagreement is the safe way round; but it's worrying people, and
    loses us one page of swap.

    The fix is not just to add one to nr_good_pages: we need to get maxpages
    (the size of the swap_map array) right before that; and though that is an
    unsigned long, be careful not to overflow the unsigned int p->max which
    later holds it (probably why header uses __u32 last_page instead of size).

    Why did we subtract one from the maximum swp_offset to calculate maxpages?
    Though it was probably me who made that change in 2.4.10, I don't get it:
    and now we should be adding one (without risk of overflow in this case).

    Fix the handling of swap_header badpages: it could have overrun the
    swap_map when very large swap area used on a more limited architecture.

    Remove pre-initializations of swap_header, nr_good_pages and maxpages:
    those date from when sys_swapon was supporting other versions of header.

    Reported-by: Nitin Gupta
    Reported-by: Jarkko Lavinen
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A frequent questions from users about memory management is what numbers of
    swap ents are user for processes. And this information will give some
    hints to oom-killer.

    Besides we can count the number of swapents per a process by scanning
    /proc//smaps, this is very slow and not good for usual process
    information handler which works like 'ps' or 'top'. (ps or top is now
    enough slow..)

    This patch adds a counter of swapents to mm_counter and update is at each
    swap events. Information is exported via /proc//status file as

    [kamezawa@bluextal memory]$ cat /proc/self/status
    Name: cat
    State: R (running)
    Tgid: 2910
    Pid: 2910
    PPid: 2823
    TracerPid: 0
    Uid: 500 500 500 500
    Gid: 500 500 500 500
    FDSize: 256
    Groups: 500
    VmPeak: 82696 kB
    VmSize: 82696 kB
    VmLck: 0 kB
    VmHWM: 432 kB
    VmRSS: 432 kB
    VmData: 172 kB
    VmStk: 84 kB
    VmExe: 48 kB
    VmLib: 1568 kB
    VmPTE: 40 kB
    VmSwap: 0 kB
    Reviewed-by: Minchan Kim
    Reviewed-by: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Presently, per-mm statistics counter is defined by macro in sched.h

    This patch modifies it to
    - defined in mm.h as inlinf functions
    - use array instead of macro's name creation.

    This patch is for reducing patch size in future patch to modify
    implementation of per-mm counter.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

16 Dec, 2009

11 commits

  • Initial implementation for swapping out KSM's shared pages: add
    page_referenced_ksm() and try_to_unmap_ksm(), which rmap.c calls when
    faced with a PageKsm page.

    Most of what's needed can be got from the rmap_items listed from the
    stable_node of the ksm page, without discovering the actual vma: so in
    this patch just fake up a struct vma for page_referenced_one() or
    try_to_unmap_one(), then refine that in the next patch.

    Add VM_NONLINEAR to ksm_madvise()'s list of exclusions: it has always been
    implicit there (being only set with VM_SHARED, already excluded), but
    let's make it explicit, to help justify the lack of nonlinear unmap.

    Rely on the page lock to protect against concurrent modifications to that
    page's node of the stable tree.

    The awkward part is not swapout but swapin: do_swap_page() and
    page_add_anon_rmap() now have to allow for new possibilities - perhaps a
    ksm page still in swapcache, perhaps a swapcache page associated with one
    location in one anon_vma now needed for another location or anon_vma.
    (And the vma might even be no longer VM_MERGEABLE when that happens.)

    ksm_might_need_to_copy() checks for that case, and supplies a duplicate
    page when necessary, simply leaving it to a subsequent pass of ksmd to
    rediscover the identity and merge them back into one ksm page.
    Disappointingly primitive: but the alternative would have to accumulate
    unswappable info about the swapped out ksm pages, limiting swappability.

    Remove page_add_ksm_rmap(): page_add_anon_rmap() now has to allow for the
    particular case it was handling, so just use it instead.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit set
    in page->mapping, with the higher bits a pointer to the anon_vma; and have
    defined PageKsm(page) as that with NULL anon_vma.

    But KSM swapping will need to store a pointer there: so in preparation for
    that, now define PAGE_MAPPING_FLAGS as the low two bits, including
    PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some
    other use for the bit emerges).

    Declare page_rmapping(page) to return the pointer part of page->mapping,
    and page_anon_vma(page) to return the anon_vma pointer when that's what it
    is. Use these in a few appropriate places: notably, unuse_vma() has been
    testing page->mapping, but is better to be testing page_anon_vma() (cases
    may be added in which flag bits are set without any pointer).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Seems that page_io.c doesn't really need to know that page_private(page)
    is the swp_entry 'val'. Rework map_swap_page() to do what its name says
    and map a page to a page offset in the swap space.

    The only other caller of map_swap_page() is internal to mm/swapfile.c and
    it does want to map a swap entry to the 'sector'. So rename
    map_swap_page() to map_swap_entry(), make it 'static' and and implement
    map_swap_page() as a wrapper around that.

    Signed-off-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • While we're fiddling with the swap_map values, let's assign a particular
    value to shmem/tmpfs swap pages: their swap counts are never incremented,
    and it helps swapoff's try_to_unuse() a little if it can immediately
    distinguish those pages from process pages.

    Since we've no use for SWAP_MAP_BAD | COUNT_CONTINUED,
    we might as well use that 0xbf value for SWAP_MAP_SHMEM.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Swap is duplicated (reference count incremented by one) whenever the same
    swap page is inserted into another mm (when forking finds a swap entry in
    place of a pte, or when reclaim unmaps a pte to insert the swap entry).

    swap_info_struct's vmalloc'ed swap_map is the array of these reference
    counts: but what happens when the unsigned short (or unsigned char since
    the preceding patch) is full? (and its high bit is kept for a cache flag)

    We then lose track of it, never freeing, leaving it in use until swapoff:
    at which point we _hope_ that a single pass will have found all instances,
    assume there are no more, and will lose user data if we're wrong.

    Swapping of KSM pages has not yet been enabled; but it is implemented,
    and makes it very easy for a user to overflow the maximum swap count:
    possible with ordinary process pages, but unlikely, even when pid_max
    has been raised from PID_MAX_DEFAULT.

    This patch implements swap count continuations: when the count overflows,
    a continuation page is allocated and linked to the original vmalloc'ed
    map page, and this used to hold the continuation counts for that entry
    and its neighbours. These continuation pages are seldom referenced:
    the common paths all work on the original swap_map, only referring to
    a continuation page when the low "digit" of a count is incremented or
    decremented through SWAP_MAP_MAX.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Halve the vmalloc'ed swap_map array from unsigned shorts to unsigned
    chars: it's still very unusual to reach a swap count of 126, and the
    next patch allows it to be extended indefinitely.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Though swap_count() is useful, I'm finding that swap_has_cache() and
    encode_swapmap() obscure what happens in the swap_map entry, just at
    those points where I need to understand it. Remove them, and pass
    more usable "usage" values to scan_swap_map(), swap_entry_free() and
    __swap_duplicate(), instead of the SWAP_MAP and SWAP_CACHE enum.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Move CONFIG_HIBERNATION's swapdev_block() into the main CONFIG_HIBERNATION
    block, remove extraneous whitespace and return, fix typo in a comment.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Make better use of the space by folding first swap_extent into its
    swap_info_struct, instead of just the list_head: swap partitions need
    only that one, and for others it's used as a circular list anyway.

    [jirislaby@gmail.com: fix crash on double swapon]
    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The swap_info_struct is only 76 or 104 bytes, but it does seem wrong
    to reserve an array of about 30 of them in bss, when most people will
    want only one. Change swap_info[] to an array of pointers.

    That does need a "type" field in the structure: pack it as a char with
    next type and short prio (aha, char is unsigned by default on PowerPC).
    Use the (admittedly peculiar) name "type" throughout for this index.

    /proc/swaps does not take swap_lock: I wouldn't want it to, but do take
    care with barriers when adding a new item to the array (never removed).

    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The swap_info_struct is mostly private to mm/swapfile.c, with only
    one other in-tree user: get_swap_bio(). Adjust its interface to
    map_swap_page(), so that we can then remove get_swap_info_struct().

    But there is a popular user out-of-tree, TuxOnIce: so leave the
    declaration of swap_info_struct in linux/swap.h.

    Signed-off-by: Hugh Dickins
    Cc: Nigel Cunningham
    Cc: KAMEZAWA Hiroyuki
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

03 Nov, 2009

1 commit

  • In try_to_unuse(), swcount is a local copy of *swap_map, including the
    SWAP_HAS_CACHE bit; but a wrong comparison against swap_count(*swap_map),
    which masks off the SWAP_HAS_CACHE bit, succeeded where it should fail.

    That had the effect of resetting the mm from which to start searching
    for the next swap page, to an irrelevant mm instead of to an mm in which
    this swap page had been found: which may increase search time by ~20%.
    But we're used to swapoff being slow, so never noticed the slowdown.

    Remove that one spurious use of swap_count(): Bo Liu thought it merely
    redundant, Hugh rewrote the description since it was measurably wrong.

    Signed-off-by: Bo Liu
    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Bo Liu
     

02 Oct, 2009

1 commit

  • While testing Swap over NFS patchset, I noticed an oops that was triggered
    during swapon. Investigating further, the NULL pointer deference is due to the
    SSD device check/optimization in the swapon code that assumes s_bdev could never
    be NULL.

    inode->i_sb->s_bdev could be NULL in a few cases. For e.g. one such case is
    loopback NFS mount, there could be others as well. Fix this by ensuring s_bdev
    is not NULL before we try to deference s_bdev.

    Signed-off-by: Suresh Jayaraman
    Signed-off-by: Jens Axboe

    Suresh Jayaraman
     

24 Sep, 2009

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     

22 Sep, 2009

1 commit

  • Just as the swapoff system call allocates many pages of RAM to various
    processes, perhaps triggering OOM, so "echo 2 >/sys/kernel/mm/ksm/run"
    (unmerge) is liable to allocate many pages of RAM to various processes,
    perhaps triggering OOM; and each is normally run from a modest admin
    process (swapoff or shell), easily repeated until it succeeds.

    So treat unmerge_and_remove_all_rmap_items() in the same way that we treat
    try_to_unuse(): generalize PF_SWAPOFF to PF_OOM_ORIGIN, and bracket both
    with that, to ask the OOM killer to kill them first, to prevent them from
    spawning more and more OOM kills.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

16 Sep, 2009

1 commit

  • Memory migration uses special swap entry types to trigger special actions on
    page faults. Extend this mechanism to also support poisoned swap entries, to
    trigger poison handling on page faults. This allows follow-on patches to
    prevent processes from faulting in poisoned pages again.

    v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)
    v3: Better overflow fix (Hidehiro Kawai)

    Signed-off-by: Andi Kleen

    Andi Kleen
     

14 Sep, 2009

1 commit

  • blk_ioctl_discard duplicates large amounts of code from blkdev_issue_discard,
    the only difference between the two is that blkdev_issue_discard needs to
    send a barrier discard request and blk_ioctl_discard a non-barrier one,
    and blk_ioctl_discard needs to wait on the request. To facilitates this
    add a flags argument to blkdev_issue_discard to control both aspects of the
    behaviour. This will be very useful later on for using the waiting
    funcitonality for other callers.

    Based on an earlier patch from Matthew Wilcox .

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

30 Jul, 2009

1 commit

  • Create bdgrab(). This function copies an existing reference to a
    block_device. It is safe to call from any context.

    Hibernation code wishes to copy a reference to the active swap device.
    Right now it calls bdget() under a spinlock, but this is wrong because
    bdget() can sleep. It doesn't need a full bdget() because we already
    hold a reference to active swap devices (and the spinlock protects
    against swapoff).

    Fixes http://bugzilla.kernel.org/show_bug.cgi?id=13827

    Signed-off-by: Alan Jenkins
    Signed-off-by: Rafael J. Wysocki

    Alan Jenkins
     

19 Jun, 2009

1 commit

  • This patch fixes mis-accounting of swap usage in memcg.

    In the current implementation, memcg's swap account is uncharged only when
    swap is completely freed. But there are several cases where swap cannot
    be freed cleanly. For handling that, this patch changes that memcg
    uncharges swap account when swap has no references other than cache.

    By this, memcg's swap entry accounting can be fully synchronous with the
    application's behavior.

    This patch also changes memcg's hooks for swap-out.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Acked-by: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Dhaval Giani
    Cc: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

17 Jun, 2009

3 commits

  • Presently we can know a swap entry is just used as SwapCache via swap_map,
    without looking up swap cache.

    Then, we have a chance to reuse swap-cache-only swap entries in
    get_swap_pages().

    This patch tries to free swap-cache-only swap entries if swap is not
    enough.

    Note: We hit following path when swap_cluster code cannot find a free
    cluster. Then, vm_swap_full() is not only condition to allow the kernel
    to reclaim unused swap.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Dhaval Giani
    Cc: YAMAMOTO Takashi
    Tested-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This is a part of the patches for fixing memcg's swap accountinf leak.
    But, IMHO, not a bad patch even if no memcg.

    There are 2 kinds of references to swap.
    - reference from swap entry
    - reference from swap cache

    Then,

    - If there is swap cache && swap's refcnt is 1, there is only swap cache.
    (*) swapcount(entry) == 1 && find_get_page(swapper_space, entry) != NULL

    This counting logic have worked well for a long time. But considering
    that we cannot know there is a _real_ reference or not by swap_map[],
    current usage of counter is not very good.

    This patch adds a flag SWAP_HAS_CACHE and recored information that a swap
    entry has a cache or not. This will remove -1 magic used in swapfile.c
    and be a help to avoid unnecessary find_get_page().

    Signed-off-by: KAMEZAWA Hiroyuki
    Tested-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Dhaval Giani
    Cc: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • In a following patch, the usage of swap cache is recorded into swap_map.
    This patch is for necessary interface changes to do that.

    2 interfaces:

    - swapcache_prepare()
    - swapcache_free()

    are added for allocating/freeing refcnt from swap-cache to existing swap
    entries. But implementation itself is not changed under this patch. At
    adding swapcache_free(), memcg's hook code is moved under
    swapcache_free(). This is better than using scattered hooks.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Acked-by: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Dhaval Giani
    Cc: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

22 Feb, 2009

1 commit

  • http://bugzilla.kernel.org/show_bug.cgi?id=12239

    The image writing code dropped a reference to the current swap device.
    This doesn't show up if the hibernation succeeds - because it doesn't
    affect the image which gets resumed. But it means multiple _failed_
    hibernations end up freeing the swap device while it is still use!

    swsusp_write() finds the block device for the swap file using swap_type_of().
    It then uses blkdev_get() / blkdev_put() to open and close the block device.

    Unfortunately, blkdev_get() assumes ownership of the inode of the block_device
    passed to it. So blkdev_put() calls iput() on the inode. This is by design
    and other callers expect this behaviour. The fix is for swap_type_of() to take
    a reference on the inode using bdget().

    Signed-off-by: Alan Jenkins
    Signed-off-by: Rafael J. Wysocki
    Cc: Len Brown
    Cc: Greg KH
    Signed-off-by: Linus Torvalds

    Alan Jenkins
     

30 Jan, 2009

1 commit

  • Now, at swapoff, even while try_charge() fails, commit is executed. This
    is a bug which turns the refcnt of cgroup_subsys_state negative.

    Reported-by: Li Zefan
    Tested-by: Li Zefan
    Tested-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

14 Jan, 2009

1 commit


09 Jan, 2009

5 commits

  • My patch, memcg-fix-gfp_mask-of-callers-of-charge.patch changed gfp_mask
    of callers of charge to be GFP_HIGHUSER_MOVABLE for showing what will
    happen at memory reclaim.

    But in recent discussion, it's NACKed because it sounds ugly.

    This patch is for reverting it and add some clean up to gfp_mask of
    callers of charge. No behavior change but need review before generating
    HUNK in deep queue.

    This patch also adds explanation to meaning of gfp_mask passed to charge
    functions in memcontrol.h.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch implements per cgroup limit for usage of memory+swap. However
    there are SwapCache, double counting of swap-cache and swap-entry is
    avoided.

    Mem+Swap controller works as following.
    - memory usage is limited by memory.limit_in_bytes.
    - memory + swap usage is limited by memory.memsw_limit_in_bytes.

    This has following benefits.
    - A user can limit total resource usage of mem+swap.

    Without this, because memory resource controller doesn't take care of
    usage of swap, a process can exhaust all the swap (by memory leak.)
    We can avoid this case.

    And Swap is shared resource but it cannot be reclaimed (goes back to memory)
    until it's used. This characteristic can be trouble when the memory
    is divided into some parts by cpuset or memcg.
    Assume group A and group B.
    After some application executes, the system can be..

    Group A -- very large free memory space but occupy 99% of swap.
    Group B -- under memory shortage but cannot use swap...it's nearly full.

    Ability to set appropriate swap limit for each group is required.

    Maybe someone wonder "why not swap but mem+swap ?"

    - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
    to move account from memory to swap...there is no change in usage of
    mem+swap.

    In other words, when we want to limit the usage of swap without affecting
    global LRU, mem+swap limit is better than just limiting swap.

    Accounting target information is stored in swap_cgroup which is
    per swap entry record.

    Charge is done as following.
    map
    - charge page and memsw.

    unmap
    - uncharge page/memsw if not SwapCache.

    swap-out (__delete_from_swap_cache)
    - uncharge page
    - record mem_cgroup information to swap_cgroup.

    swap-in (do_swap_page)
    - charged as page and memsw.
    record in swap_cgroup is cleared.
    memsw accounting is decremented.

    swap-free (swap_free())
    - if swap entry is freed, memsw is uncharged by PAGE_SIZE.

    There are people work under never-swap environments and consider swap as
    something bad. For such people, this mem+swap controller extension is just an
    overhead. This overhead is avoided by config or boot option.
    (see Kconfig. detail is not in this patch.)

    TODO:
    - maybe more optimization can be don in swap-in path. (but not very safe.)
    But we just do simple accounting at this stage.

    [nishimura@mxp.nes.nec.co.jp: make resize limit hold mutex]
    [hugh@veritas.com: memswap controller core swapcache fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • For accounting swap, we need a record per swap entry, at least.

    This patch adds following function.
    - swap_cgroup_swapon() .... called from swapon
    - swap_cgroup_swapoff() ... called at the end of swapoff.

    - swap_cgroup_record() .... record information of swap entry.
    - swap_cgroup_lookup() .... lookup information of swap entry.

    This patch just implements "how to record information". No actual method
    for limit the usage of swap. These routine uses flat table to record and
    lookup. "wise" lookup system like radix-tree requires requires memory
    allocation at new records but swap-out is usually called under memory
    shortage (or memcg hits limit.) So, I used static allocation. (maybe
    dynamic allocation is not very hard but it adds additional memory
    allocation in memory shortage path.)

    Note1: In this, we use pointer to record information and this means
    8bytes per swap entry. I think we can reduce this when we
    create "id of cgroup" in the range of 0-65535 or 0-255.

    Reported-by: Daisuke Nishimura
    Reviewed-by: Daisuke Nishimura
    Tested-by: Daisuke Nishimura
    Reported-by: Hugh Dickins
    Reported-by: Balbir Singh
    Reported-by: Andrew Morton
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Pavel Emelianov
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Fix misuse of gfp_kernel.

    Now, most of callers of mem_cgroup_charge_xxx functions uses GFP_KERNEL.

    I think that this is from the fact that page_cgroup *was* dynamically
    allocated.

    But now, we allocate all page_cgroup at boot. And
    mem_cgroup_try_to_free_pages() reclaim memory from GFP_HIGHUSER_MOVABLE +
    specified GFP_RECLAIM_MASK.

    * This is because we just want to reduce memory usage.
    "Where we should reclaim from ?" is not a problem in memcg.

    This patch modifies gfp masks to be GFP_HIGUSER_MOVABLE if possible.

    Note: This patch is not for fixing behavior but for showing sane information
    in source code.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • There is a small race in do_swap_page(). When the page swapped-in is
    charged, the mapcount can be greater than 0. But, at the same time some
    process (shares it ) call unmap and make mapcount 1->0 and the page is
    uncharged.

    CPUA CPUB
    mapcount == 1.
    (1) charge if mapcount==0 zap_pte_range()
    (2) mapcount 1 => 0.
    (3) uncharge(). (success)
    (4) set page's rmap()
    mapcount 0=>1

    Then, this swap page's account is leaked.

    For fixing this, I added a new interface.
    - charge
    account to res_counter by PAGE_SIZE and try to free pages if necessary.
    - commit
    register page_cgroup and add to LRU if necessary.
    - cancel
    uncharge PAGE_SIZE because of do_swap_page failure.

    CPUA
    (1) charge (always)
    (2) set page's rmap (mapcount > 0)
    (3) commit charge was necessary or not after set_pte().

    This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
    Usual mem_cgroup_charge_common() does charge -> commit at a time.

    And this patch also adds following function to clarify all charges.

    - mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge()
    called against newly allocated anon pages.

    - mem_cgroup_charge_migrate_fixup()
    called only from remove_migration_ptes().
    we'll have to rewrite this later.(this patch just keeps old behavior)
    This function will be removed by additional patch to make migration
    clearer.

    Good for clarifying "what we do"

    Then, we have 4 following charge points.
    - newpage
    - swap-in
    - add-to-cache.
    - migration.

    [akpm@linux-foundation.org: add missing inline directives to stubs]
    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

07 Jan, 2009

1 commit