14 Jan, 2011

1 commit

  • Paging logic that splits the page before it is unmapped and added to swap
    to ensure backwards compatibility with the legacy swap code. Eventually
    swap should natively pageout the hugepages to increase performance and
    decrease seeking and fragmentation of swap space. swapoff can just skip
    over huge pmd as they cannot be part of swap yet. In add_to_swap be
    careful to split the page only if we got a valid swap entry so we don't
    split hugepages with a full swap.

    In theory we could split pages before isolating them during the lru scan,
    but for khugepaged to be safe, I'm relying on either mmap_sem write mode,
    or PG_lock taken, so split_huge_page has to run either with mmap_sem
    read/write mode or PG_lock taken. Calling it from isolate_lru_page would
    make locking more complicated, in addition to that split_huge_page would
    deadlock if called by __isolate_lru_page because it has to take the lru
    lock to add the tail pages.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

13 Nov, 2010

1 commit

  • Over time, block layer has accumulated a set of APIs dealing with bdev
    open, close, claim and release.

    * blkdev_get/put() are the primary open and close functions.

    * bd_claim/release() deal with exclusive open.

    * open/close_bdev_exclusive() are combination of open and claim and
    the other way around, respectively.

    * bd_link/unlink_disk_holder() to create and remove holder/slave
    symlinks.

    * open_by_devnum() wraps bdget() + blkdev_get().

    The interface is a bit confusing and the decoupling of open and claim
    makes it impossible to properly guarantee exclusive access as
    in-kernel open + claim sequence can disturb the existing exclusive
    open even before the block layer knows the current open if for another
    exclusive access. Reorganize the interface such that,

    * blkdev_get() is extended to include exclusive access management.
    @holder argument is added and, if is @FMODE_EXCL specified, it will
    gain exclusive access atomically w.r.t. other exclusive accesses.

    * blkdev_put() is similarly extended. It now takes @mode argument and
    if @FMODE_EXCL is set, it releases an exclusive access. Also, when
    the last exclusive claim is released, the holder/slave symlinks are
    removed automatically.

    * bd_claim/release() and close_bdev_exclusive() are no longer
    necessary and either made static or removed.

    * bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
    is no longer necessary and removed.

    * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
    and blkdev_get(). It also has an unexpected extra bdev_read_only()
    test which probably should be moved into blkdev_get().

    * open_by_devnum() is modified to take @holder argument and pass it to
    blkdev_get().

    Most of bdev open/close operations are unified into blkdev_get/put()
    and most exclusive accesses are tested atomically at the open time (as
    it should). This cleans up code and removes some, both valid and
    invalid, but unnecessary all the same, corner cases.

    open_bdev_exclusive() and open_by_devnum() can use further cleanup -
    rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
    special features. Well, let's leave them for another day.

    Most conversions are straight-forward. drbd conversion is a bit more
    involved as there was some reordering, but the logic should stay the
    same.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Brown
    Acked-by: Ryusuke Konishi
    Acked-by: Mike Snitzer
    Acked-by: Philipp Reisner
    Cc: Peter Osterlund
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Jan Kara
    Cc: Andrew Morton
    Cc: Andreas Dilger
    Cc: "Theodore Ts'o"
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Alex Elder
    Cc: Christoph Hellwig
    Cc: dm-devel@redhat.com
    Cc: drbd-dev@lists.linbit.com
    Cc: Leo Chen
    Cc: Scott Branden
    Cc: Chris Mason
    Cc: Steven Whitehouse
    Cc: Dave Kleikamp
    Cc: Joern Engel
    Cc: reiserfs-devel@vger.kernel.org
    Cc: Alexander Viro

    Tejun Heo
     

27 Oct, 2010

1 commit

  • System management wants to subscribe to changes in swap configuration.
    Make /proc/swaps pollable like /proc/mounts.

    [akpm@linux-foundation.org: document proc_poll_event]
    Signed-off-by: Kay Sievers
    Acked-by: Greg KH
    Cc: Jonathan Corbet
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kay Sievers
     

19 Oct, 2010

1 commit


17 Sep, 2010

1 commit

  • All the blkdev_issue_* helpers can only sanely be used for synchronous
    caller. To issue cache flushes or barriers asynchronously the caller needs
    to set up a bio by itself with a completion callback to move the asynchronous
    state machine ahead. So drop the BLKDEV_IFL_WAIT flag that is always
    specified when calling blkdev_issue_* and also remove the now unused flags
    argument to blkdev_issue_flush and blkdev_issue_zeroout. For
    blkdev_issue_discard we need to keep it for the secure discard flag, which
    gains a more descriptive name and loses the bitops vs flag confusion.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

10 Sep, 2010

5 commits

  • The swap code already uses synchronous discards, no need to add I/O barriers.

    tj: superflous newlines removed.

    Signed-off-by: Christoph Hellwig
    Acked-by: Hugh Dickins
    Tested-by: Nigel Cunningham
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Tests with recent firmware on Intel X25-M 80GB and OCZ Vertex 60GB SSDs
    show a shift since I last tested in December: in part because of firmware
    updates, in part because of the necessary move from barriers to awaiting
    completion at the block layer. While discard at swapon still shows as
    slightly beneficial on both, discarding 1MB swap cluster when allocating
    is now disadvanteous: adds 25% overhead on Intel, adds 230% on OCZ (YMMV).

    Surrender: discard as presently implemented is more hindrance than help
    for swap; but might prove useful on other devices, or with improvements.
    So continue to do the discard at swapon, but make discard while swapping
    conditional on a SWAP_FLAG_DISCARD to sys_swapon() (which has been using
    only the lower 16 bits of int flags).

    We can add a --discard or -d to swapon(8), and a "discard" to swap in
    /etc/fstab: matching the mount option for btrfs, ext4, fat, gfs2, nilfs2.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Cc: Nigel Cunningham
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: James Bottomley
    Cc: "Martin K. Petersen"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The swap code already uses synchronous discards, no need to add I/O
    barriers.

    This fixes the worst of the terrible slowdown in swap allocation for
    hibernation, reported on 2.6.35 by Nigel Cunningham; but does not entirely
    eliminate that regression.

    [tj@kernel.org: superflous newlines removed]
    Signed-off-by: Christoph Hellwig
    Tested-by: Nigel Cunningham
    Signed-off-by: Tejun Heo
    Signed-off-by: Hugh Dickins
    Cc: Jens Axboe
    Cc: James Bottomley
    Cc: "Martin K. Petersen"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Move the hibernation check from scan_swap_map() into try_to_free_swap():
    to catch not only the common case when hibernation's allocation itself
    triggers swap reuse, but also the less likely case when concurrent page
    reclaim (shrink_page_list) might happen to try_to_free_swap from a page.

    Hibernation already clears __GFP_IO from the gfp_allowed_mask, to stop
    reclaim from going to swap: check that to prevent swap reuse too.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: "Rafael J. Wysocki"
    Cc: Ondrej Zary
    Cc: Andrea Gelmini
    Cc: Balbir Singh
    Cc: Andrea Arcangeli
    Cc: Nigel Cunningham
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Please revert 2.6.36-rc commit d2997b1042ec150616c1963b5e5e919ffd0b0ebf
    "hibernation: freeze swap at hibernation". It complicated matters by
    adding a second swap allocation path, just for hibernation; without in any
    way fixing the issue that it was intended to address - page reclaim after
    fixing the hibernation image might free swap from a page already imaged as
    swapcache, letting its swap be reallocated to store a different page of
    the image: resulting in data corruption if the imaged page were freed as
    clean then swapped back in. Pages freed to si->swap_map were still in
    danger of being reallocated by the alternative allocation path.

    I guess it inadvertently fixed slow SSD swap allocation for hibernation,
    as reported by Nigel Cunningham: by missing out the discards that occur on
    the usual swap allocation path; but that was unintentional, and needs a
    separate fix.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: "Rafael J. Wysocki"
    Cc: Ondrej Zary
    Cc: Andrea Gelmini
    Cc: Balbir Singh
    Cc: Andrea Arcangeli
    Cc: Nigel Cunningham
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

10 Aug, 2010

2 commits

  • When taking a memory snapshot in hibernate_snapshot(), all (directly
    called) memory allocations use GFP_ATOMIC. Hence swap misusage during
    hibernation never occurs.

    But from a pessimistic point of view, there is no guarantee that no page
    allcation has __GFP_WAIT. It is better to have a global indication "we
    enter hibernation, don't use swap!".

    This patch tries to freeze new-swap-allocation during hibernation. (All
    user processes are frozenm so swapin is not a concern).

    This way, no updates will happen to swap_map[] between
    hibernate_snapshot() and save_image(). Swap is thawed when swsusp_free()
    is called. We can be assured that swap corruption will not occur.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Rafael J. Wysocki"
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Ondrej Zary
    Cc: Balbir Singh
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Since 2.6.31, swap_map[]'s refcounting was changed to show that a used
    swap entry is just for swap-cache, can be reused. Then, while scanning
    free entry in swap_map[], a swap entry may be able to be reclaimed and
    reused. It was caused by commit c9e444103b5e7a5 ("mm: reuse unused swap
    entry if necessary").

    But this caused deta corruption at resume. The scenario is

    - Assume a clean-swap cache, but mapped.

    - at hibernation_snapshot[], clean-swap-cache is saved as
    clean-swap-cache and swap_map[] is marked as SWAP_HAS_CACHE.

    - then, save_image() is called. And reuse SWAP_HAS_CACHE entry to save
    image, and break the contents.

    After resume:

    - the memory reclaim runs and finds clean-not-referenced-swap-cache and
    discards it because it's marked as clean. But here, the contents on
    disk and swap-cache is inconsistent.

    Hance memory is corrupted.

    This patch avoids the bug by not reclaiming swap-entry during hibernation.
    This is a quick fix for backporting.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Rafael J. Wysocki
    Reported-by: Ondreg Zary
    Tested-by: Ondreg Zary
    Tested-by: Andrea Gelmini
    Acked-by: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

22 May, 2010

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6: (577 commits)
    Staging: ramzswap: Handler for swap slot free callback
    swap: Add swap slot free callback to block_device_operations
    swap: Add flag to identify block swap devices
    Staging: vt6655: use ETH_FRAME_LEN macro instead of custom one
    Staging: vt6655: use ETH_DATA_LEN macro instead of custom one
    Staging: vt6655: use ETH_FCS_LEN macro instead of custom one
    Staging: vt6656: use ETH_HLEN macro instead of custom one
    Staging: comedi: quatech_daqp_cs.c Replace eos semaphore with a completion.
    Staging: dt3155v4l: remove private memory allocator
    Staging: crystalhd: Remove typedefs from driver
    Staging: winbond: Fix for pointer name format issue in mds.c
    Staging: vt6656: removed custom UCHAR/USHORT/UINT/ULONG/ULONGLONG typedefs
    Staging: vt6656: removed custom CHAR/SHORT/INT/LONG typedefs
    Staging: comedi: Altered the way printk is used in 8255.c
    staging: iio: adis16350 and similar IMU driver
    Staging: iio: max1363 Fix two bugs in single_channel_from_ring
    Staging: iio: adis16220 extract bin_attribute structures from state
    Staging: iio: adis16220 vibration sensor driver
    Staging: comedi: Kconfig dependancy fixes
    Staging: comedi: fix up build error from last Kconfig changes
    ...

    Linus Torvalds
     

19 May, 2010

2 commits

  • This callback is required when RAM based devices are used as swap disks.
    One such device is ramzswap which is used as compressed in-memory swap
    disk. For such devices, we need a callback as soon as a swap slot is no
    longer used to allow freeing memory allocated for this slot. Without this
    callback, stale data can quickly accumulate in memory defeating the whole
    purpose of such devices.

    Signed-off-by: Nitin Gupta
    Acked-by: Linus Torvalds
    Acked-by: Nigel Cunningham
    Acked-by: Pekka Enberg
    Reviewed-by: Minchan Kim
    Signed-off-by: Greg Kroah-Hartman

    Nitin Gupta
     
  • Added SWP_BLKDEV flag to distinguish block and regular file backed
    swap devices. We could also check if a swap is entire block device,
    rather than a file, by:
    S_ISBLK(swap_info_struct->swap_file->f_mapping->host->i_mode)
    but, I think, simply checking this flag is more convenient.

    Signed-off-by: Nitin Gupta
    Acked-by: Linus Torvalds
    Acked-by: Nigel Cunningham
    Acked-by: Pekka Enberg
    Reviewed-by: Minchan Kim
    Signed-off-by: Greg Kroah-Hartman

    Nitin Gupta
     

29 Apr, 2010

1 commit


13 Mar, 2010

1 commit

  • This patch is another core part of this move-charge-at-task-migration
    feature. It enables moving charges of anonymous swaps.

    To move the charge of swap, we need to exchange swap_cgroup's record.

    In current implementation, swap_cgroup's record is protected by:

    - page lock: if the entry is on swap cache.
    - swap_lock: if the entry is not on swap cache.

    This works well in usual swap-in/out activity.

    But this behavior make the feature of moving swap charge check many
    conditions to exchange swap_cgroup's record safely.

    So I changed modification of swap_cgroup's recored(swap_cgroup_record())
    to use xchg, and define a new function to cmpxchg swap_cgroup's record.

    This patch also enables moving charge of non pte_present but not uncharged
    swap caches, which can be exist on swap-out path, by getting the target
    pages via find_get_page() as do_mincore() does.

    [kosaki.motohiro@jp.fujitsu.com: fix ia64 build]
    [akpm@linux-foundation.org: fix typos]
    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

07 Mar, 2010

4 commits

  • swap_duplicate()'s loop appears to miss out on returning the error code
    from __swap_duplicate(), except when that's -ENOMEM. In fact this is
    intentional: prior to -ENOMEM for swap_count_continuation,
    swap_duplicate() was void (and the case only occurs when copy_one_pte()
    hits a corrupt pte). But that's surprising behaviour, which certainly
    deserves a comment.

    Signed-off-by: Hugh Dickins
    Reported-by: Huang Shijie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's an off-by-one disagreement between mkswap and swapon about the
    meaning of swap_header last_page: mkswap (in all versions I've looked at:
    util-linux-ng and BusyBox and old util-linux; probably as far back as
    1999) consistently means the offset (in page units) of the last page of
    the swap area, whereas kernel sys_swapon (as far back as 2.2 and 2.3)
    strangely takes it to mean the size (in page units) of the swap area.

    This disagreement is the safe way round; but it's worrying people, and
    loses us one page of swap.

    The fix is not just to add one to nr_good_pages: we need to get maxpages
    (the size of the swap_map array) right before that; and though that is an
    unsigned long, be careful not to overflow the unsigned int p->max which
    later holds it (probably why header uses __u32 last_page instead of size).

    Why did we subtract one from the maximum swp_offset to calculate maxpages?
    Though it was probably me who made that change in 2.4.10, I don't get it:
    and now we should be adding one (without risk of overflow in this case).

    Fix the handling of swap_header badpages: it could have overrun the
    swap_map when very large swap area used on a more limited architecture.

    Remove pre-initializations of swap_header, nr_good_pages and maxpages:
    those date from when sys_swapon was supporting other versions of header.

    Reported-by: Nitin Gupta
    Reported-by: Jarkko Lavinen
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A frequent questions from users about memory management is what numbers of
    swap ents are user for processes. And this information will give some
    hints to oom-killer.

    Besides we can count the number of swapents per a process by scanning
    /proc//smaps, this is very slow and not good for usual process
    information handler which works like 'ps' or 'top'. (ps or top is now
    enough slow..)

    This patch adds a counter of swapents to mm_counter and update is at each
    swap events. Information is exported via /proc//status file as

    [kamezawa@bluextal memory]$ cat /proc/self/status
    Name: cat
    State: R (running)
    Tgid: 2910
    Pid: 2910
    PPid: 2823
    TracerPid: 0
    Uid: 500 500 500 500
    Gid: 500 500 500 500
    FDSize: 256
    Groups: 500
    VmPeak: 82696 kB
    VmSize: 82696 kB
    VmLck: 0 kB
    VmHWM: 432 kB
    VmRSS: 432 kB
    VmData: 172 kB
    VmStk: 84 kB
    VmExe: 48 kB
    VmLib: 1568 kB
    VmPTE: 40 kB
    VmSwap: 0 kB
    Reviewed-by: Minchan Kim
    Reviewed-by: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Presently, per-mm statistics counter is defined by macro in sched.h

    This patch modifies it to
    - defined in mm.h as inlinf functions
    - use array instead of macro's name creation.

    This patch is for reducing patch size in future patch to modify
    implementation of per-mm counter.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

16 Dec, 2009

11 commits

  • Initial implementation for swapping out KSM's shared pages: add
    page_referenced_ksm() and try_to_unmap_ksm(), which rmap.c calls when
    faced with a PageKsm page.

    Most of what's needed can be got from the rmap_items listed from the
    stable_node of the ksm page, without discovering the actual vma: so in
    this patch just fake up a struct vma for page_referenced_one() or
    try_to_unmap_one(), then refine that in the next patch.

    Add VM_NONLINEAR to ksm_madvise()'s list of exclusions: it has always been
    implicit there (being only set with VM_SHARED, already excluded), but
    let's make it explicit, to help justify the lack of nonlinear unmap.

    Rely on the page lock to protect against concurrent modifications to that
    page's node of the stable tree.

    The awkward part is not swapout but swapin: do_swap_page() and
    page_add_anon_rmap() now have to allow for new possibilities - perhaps a
    ksm page still in swapcache, perhaps a swapcache page associated with one
    location in one anon_vma now needed for another location or anon_vma.
    (And the vma might even be no longer VM_MERGEABLE when that happens.)

    ksm_might_need_to_copy() checks for that case, and supplies a duplicate
    page when necessary, simply leaving it to a subsequent pass of ksmd to
    rediscover the identity and merge them back into one ksm page.
    Disappointingly primitive: but the alternative would have to accumulate
    unswappable info about the swapped out ksm pages, limiting swappability.

    Remove page_add_ksm_rmap(): page_add_anon_rmap() now has to allow for the
    particular case it was handling, so just use it instead.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit set
    in page->mapping, with the higher bits a pointer to the anon_vma; and have
    defined PageKsm(page) as that with NULL anon_vma.

    But KSM swapping will need to store a pointer there: so in preparation for
    that, now define PAGE_MAPPING_FLAGS as the low two bits, including
    PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some
    other use for the bit emerges).

    Declare page_rmapping(page) to return the pointer part of page->mapping,
    and page_anon_vma(page) to return the anon_vma pointer when that's what it
    is. Use these in a few appropriate places: notably, unuse_vma() has been
    testing page->mapping, but is better to be testing page_anon_vma() (cases
    may be added in which flag bits are set without any pointer).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Seems that page_io.c doesn't really need to know that page_private(page)
    is the swp_entry 'val'. Rework map_swap_page() to do what its name says
    and map a page to a page offset in the swap space.

    The only other caller of map_swap_page() is internal to mm/swapfile.c and
    it does want to map a swap entry to the 'sector'. So rename
    map_swap_page() to map_swap_entry(), make it 'static' and and implement
    map_swap_page() as a wrapper around that.

    Signed-off-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • While we're fiddling with the swap_map values, let's assign a particular
    value to shmem/tmpfs swap pages: their swap counts are never incremented,
    and it helps swapoff's try_to_unuse() a little if it can immediately
    distinguish those pages from process pages.

    Since we've no use for SWAP_MAP_BAD | COUNT_CONTINUED,
    we might as well use that 0xbf value for SWAP_MAP_SHMEM.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Swap is duplicated (reference count incremented by one) whenever the same
    swap page is inserted into another mm (when forking finds a swap entry in
    place of a pte, or when reclaim unmaps a pte to insert the swap entry).

    swap_info_struct's vmalloc'ed swap_map is the array of these reference
    counts: but what happens when the unsigned short (or unsigned char since
    the preceding patch) is full? (and its high bit is kept for a cache flag)

    We then lose track of it, never freeing, leaving it in use until swapoff:
    at which point we _hope_ that a single pass will have found all instances,
    assume there are no more, and will lose user data if we're wrong.

    Swapping of KSM pages has not yet been enabled; but it is implemented,
    and makes it very easy for a user to overflow the maximum swap count:
    possible with ordinary process pages, but unlikely, even when pid_max
    has been raised from PID_MAX_DEFAULT.

    This patch implements swap count continuations: when the count overflows,
    a continuation page is allocated and linked to the original vmalloc'ed
    map page, and this used to hold the continuation counts for that entry
    and its neighbours. These continuation pages are seldom referenced:
    the common paths all work on the original swap_map, only referring to
    a continuation page when the low "digit" of a count is incremented or
    decremented through SWAP_MAP_MAX.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Halve the vmalloc'ed swap_map array from unsigned shorts to unsigned
    chars: it's still very unusual to reach a swap count of 126, and the
    next patch allows it to be extended indefinitely.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Though swap_count() is useful, I'm finding that swap_has_cache() and
    encode_swapmap() obscure what happens in the swap_map entry, just at
    those points where I need to understand it. Remove them, and pass
    more usable "usage" values to scan_swap_map(), swap_entry_free() and
    __swap_duplicate(), instead of the SWAP_MAP and SWAP_CACHE enum.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Move CONFIG_HIBERNATION's swapdev_block() into the main CONFIG_HIBERNATION
    block, remove extraneous whitespace and return, fix typo in a comment.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Make better use of the space by folding first swap_extent into its
    swap_info_struct, instead of just the list_head: swap partitions need
    only that one, and for others it's used as a circular list anyway.

    [jirislaby@gmail.com: fix crash on double swapon]
    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The swap_info_struct is only 76 or 104 bytes, but it does seem wrong
    to reserve an array of about 30 of them in bss, when most people will
    want only one. Change swap_info[] to an array of pointers.

    That does need a "type" field in the structure: pack it as a char with
    next type and short prio (aha, char is unsigned by default on PowerPC).
    Use the (admittedly peculiar) name "type" throughout for this index.

    /proc/swaps does not take swap_lock: I wouldn't want it to, but do take
    care with barriers when adding a new item to the array (never removed).

    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The swap_info_struct is mostly private to mm/swapfile.c, with only
    one other in-tree user: get_swap_bio(). Adjust its interface to
    map_swap_page(), so that we can then remove get_swap_info_struct().

    But there is a popular user out-of-tree, TuxOnIce: so leave the
    declaration of swap_info_struct in linux/swap.h.

    Signed-off-by: Hugh Dickins
    Cc: Nigel Cunningham
    Cc: KAMEZAWA Hiroyuki
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

03 Nov, 2009

1 commit

  • In try_to_unuse(), swcount is a local copy of *swap_map, including the
    SWAP_HAS_CACHE bit; but a wrong comparison against swap_count(*swap_map),
    which masks off the SWAP_HAS_CACHE bit, succeeded where it should fail.

    That had the effect of resetting the mm from which to start searching
    for the next swap page, to an irrelevant mm instead of to an mm in which
    this swap page had been found: which may increase search time by ~20%.
    But we're used to swapoff being slow, so never noticed the slowdown.

    Remove that one spurious use of swap_count(): Bo Liu thought it merely
    redundant, Hugh rewrote the description since it was measurably wrong.

    Signed-off-by: Bo Liu
    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Bo Liu
     

02 Oct, 2009

1 commit

  • While testing Swap over NFS patchset, I noticed an oops that was triggered
    during swapon. Investigating further, the NULL pointer deference is due to the
    SSD device check/optimization in the swapon code that assumes s_bdev could never
    be NULL.

    inode->i_sb->s_bdev could be NULL in a few cases. For e.g. one such case is
    loopback NFS mount, there could be others as well. Fix this by ensuring s_bdev
    is not NULL before we try to deference s_bdev.

    Signed-off-by: Suresh Jayaraman
    Signed-off-by: Jens Axboe

    Suresh Jayaraman
     

24 Sep, 2009

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     

22 Sep, 2009

1 commit

  • Just as the swapoff system call allocates many pages of RAM to various
    processes, perhaps triggering OOM, so "echo 2 >/sys/kernel/mm/ksm/run"
    (unmerge) is liable to allocate many pages of RAM to various processes,
    perhaps triggering OOM; and each is normally run from a modest admin
    process (swapoff or shell), easily repeated until it succeeds.

    So treat unmerge_and_remove_all_rmap_items() in the same way that we treat
    try_to_unuse(): generalize PF_SWAPOFF to PF_OOM_ORIGIN, and bracket both
    with that, to ask the OOM killer to kill them first, to prevent them from
    spawning more and more OOM kills.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

16 Sep, 2009

1 commit

  • Memory migration uses special swap entry types to trigger special actions on
    page faults. Extend this mechanism to also support poisoned swap entries, to
    trigger poison handling on page faults. This allows follow-on patches to
    prevent processes from faulting in poisoned pages again.

    v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)
    v3: Better overflow fix (Hidehiro Kawai)

    Signed-off-by: Andi Kleen

    Andi Kleen
     

14 Sep, 2009

1 commit

  • blk_ioctl_discard duplicates large amounts of code from blkdev_issue_discard,
    the only difference between the two is that blkdev_issue_discard needs to
    send a barrier discard request and blk_ioctl_discard a non-barrier one,
    and blk_ioctl_discard needs to wait on the request. To facilitates this
    add a flags argument to blkdev_issue_discard to control both aspects of the
    behaviour. This will be very useful later on for using the waiting
    funcitonality for other callers.

    Based on an earlier patch from Matthew Wilcox .

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

30 Jul, 2009

1 commit

  • Create bdgrab(). This function copies an existing reference to a
    block_device. It is safe to call from any context.

    Hibernation code wishes to copy a reference to the active swap device.
    Right now it calls bdget() under a spinlock, but this is wrong because
    bdget() can sleep. It doesn't need a full bdget() because we already
    hold a reference to active swap devices (and the spinlock protects
    against swapoff).

    Fixes http://bugzilla.kernel.org/show_bug.cgi?id=13827

    Signed-off-by: Alan Jenkins
    Signed-off-by: Rafael J. Wysocki

    Alan Jenkins
     

19 Jun, 2009

1 commit

  • This patch fixes mis-accounting of swap usage in memcg.

    In the current implementation, memcg's swap account is uncharged only when
    swap is completely freed. But there are several cases where swap cannot
    be freed cleanly. For handling that, this patch changes that memcg
    uncharges swap account when swap has no references other than cache.

    By this, memcg's swap entry accounting can be fully synchronous with the
    application's behavior.

    This patch also changes memcg's hooks for swap-out.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Acked-by: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Dhaval Giani
    Cc: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki