Doug / smarc-fsl-linux-kernel | Embedian Git Server

10 Aug, 2010

2 commits

d2997b104 hibernation: freeze swap at hibernation ... Browse Code »

When taking a memory snapshot in hibernate_snapshot(), all (directly
called) memory allocations use GFP_ATOMIC. Hence swap misusage during
hibernation never occurs.

But from a pessimistic point of view, there is no guarantee that no page
allcation has __GFP_WAIT. It is better to have a global indication "we
enter hibernation, don't use swap!".

This patch tries to freeze new-swap-allocation during hibernation. (All
user processes are frozenm so swapin is not a concern).

This way, no updates will happen to swap_map[] between
hibernate_snapshot() and save_image(). Swap is thawed when swsusp_free()
is called. We can be assured that swap corruption will not occur.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: "Rafael J. Wysocki"
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Cc: Ondrej Zary
Cc: Balbir Singh
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-08-10 11:45:04 +0800
966cca029 mm: fix corruption of hibernation caused by reusing swap during image saving ... Browse Code »

Since 2.6.31, swap_map[]'s refcounting was changed to show that a used
swap entry is just for swap-cache, can be reused. Then, while scanning
free entry in swap_map[], a swap entry may be able to be reclaimed and
reused. It was caused by commit c9e444103b5e7a5 ("mm: reuse unused swap
entry if necessary").

But this caused deta corruption at resume. The scenario is

- Assume a clean-swap cache, but mapped.

- at hibernation_snapshot[], clean-swap-cache is saved as
clean-swap-cache and swap_map[] is marked as SWAP_HAS_CACHE.

- then, save_image() is called. And reuse SWAP_HAS_CACHE entry to save
image, and break the contents.

After resume:

- the memory reclaim runs and finds clean-not-referenced-swap-cache and
discards it because it's marked as clean. But here, the contents on
disk and swap-cache is inconsistent.

Hance memory is corrupted.

This patch avoids the bug by not reclaiming swap-entry during hibernation.
This is a quick fix for backporting.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Rafael J. Wysocki
Reported-by: Ondreg Zary
Tested-by: Ondreg Zary
Tested-by: Andrea Gelmini
Acked-by: Hugh Dickins
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-08-10 11:45:04 +0800

22 May, 2010

1 commit

d79df0b1e Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6: (577 commits)
Staging: ramzswap: Handler for swap slot free callback
swap: Add swap slot free callback to block_device_operations
swap: Add flag to identify block swap devices
Staging: vt6655: use ETH_FRAME_LEN macro instead of custom one
Staging: vt6655: use ETH_DATA_LEN macro instead of custom one
Staging: vt6655: use ETH_FCS_LEN macro instead of custom one
Staging: vt6656: use ETH_HLEN macro instead of custom one
Staging: comedi: quatech_daqp_cs.c Replace eos semaphore with a completion.
Staging: dt3155v4l: remove private memory allocator
Staging: crystalhd: Remove typedefs from driver
Staging: winbond: Fix for pointer name format issue in mds.c
Staging: vt6656: removed custom UCHAR/USHORT/UINT/ULONG/ULONGLONG typedefs
Staging: vt6656: removed custom CHAR/SHORT/INT/LONG typedefs
Staging: comedi: Altered the way printk is used in 8255.c
staging: iio: adis16350 and similar IMU driver
Staging: iio: max1363 Fix two bugs in single_channel_from_ring
Staging: iio: adis16220 extract bin_attribute structures from state
Staging: iio: adis16220 vibration sensor driver
Staging: comedi: Kconfig dependancy fixes
Staging: comedi: fix up build error from last Kconfig changes
...

Linus Torvalds
2010-05-22 06:26:46 +0800

19 May, 2010

2 commits

b3a27d052 swap: Add swap slot free callback to block_device_operations ... Browse Code »

This callback is required when RAM based devices are used as swap disks.
One such device is ramzswap which is used as compressed in-memory swap
disk. For such devices, we need a callback as soon as a swap slot is no
longer used to allow freeing memory allocated for this slot. Without this
callback, stale data can quickly accumulate in memory defeating the whole
purpose of such devices.

Signed-off-by: Nitin Gupta
Acked-by: Linus Torvalds
Acked-by: Nigel Cunningham
Acked-by: Pekka Enberg
Reviewed-by: Minchan Kim
Signed-off-by: Greg Kroah-Hartman

Nitin Gupta
2010-05-19 06:07:52 +0800
b27256439 swap: Add flag to identify block swap devices ... Browse Code »

Added SWP_BLKDEV flag to distinguish block and regular file backed
swap devices. We could also check if a swap is entire block device,
rather than a file, by:
S_ISBLK(swap_info_struct->swap_file->f_mapping->host->i_mode)
but, I think, simply checking this flag is more convenient.

Signed-off-by: Nitin Gupta
Acked-by: Linus Torvalds
Acked-by: Nigel Cunningham
Acked-by: Pekka Enberg
Reviewed-by: Minchan Kim
Signed-off-by: Greg Kroah-Hartman

Nitin Gupta
2010-05-19 06:07:52 +0800

29 Apr, 2010

1 commit

fbd9b09a1 blkdev: generalize flags for blkdev_issue_fn functions ... Browse Code »

The patch just convert all blkdev_issue_xxx function to common
set of flags. Wait/allocation semantics preserved.

Signed-off-by: Dmitry Monakhov
Signed-off-by: Jens Axboe

Dmitry Monakhov
2010-04-29 01:47:36 +0800

13 Mar, 2010

1 commit

024914477 memcg: move charges of anonymous swap ... Browse Code »

This patch is another core part of this move-charge-at-task-migration
feature. It enables moving charges of anonymous swaps.

To move the charge of swap, we need to exchange swap_cgroup's record.

In current implementation, swap_cgroup's record is protected by:

- page lock: if the entry is on swap cache.
- swap_lock: if the entry is not on swap cache.

This works well in usual swap-in/out activity.

But this behavior make the feature of moving swap charge check many
conditions to exchange swap_cgroup's record safely.

So I changed modification of swap_cgroup's recored(swap_cgroup_record())
to use xchg, and define a new function to cmpxchg swap_cgroup's record.

This patch also enables moving charge of non pte_present but not uncharged
swap caches, which can be exist on swap-out path, by getting the target
pages via find_get_page() as do_mincore() does.

[kosaki.motohiro@jp.fujitsu.com: fix ia64 build]
[akpm@linux-foundation.org: fix typos]
Signed-off-by: Daisuke Nishimura
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Paul Menage
Cc: Daisuke Nishimura
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2010-03-13 07:52:36 +0800

07 Mar, 2010

4 commits

08259d58e mm: add comment on swap_duplicate's error code ... Browse Code »

swap_duplicate()'s loop appears to miss out on returning the error code
from __swap_duplicate(), except when that's -ENOMEM. In fact this is
intentional: prior to -ENOMEM for swap_count_continuation,
swap_duplicate() was void (and the case only occurs when copy_one_pte()
hits a corrupt pte). But that's surprising behaviour, which certainly
deserves a comment.

Signed-off-by: Hugh Dickins
Reported-by: Huang Shijie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2010-03-07 03:26:27 +0800
ad2bd7e0e mm/swapfile.c: fix swapon size off-by-one ... Browse Code »

There's an off-by-one disagreement between mkswap and swapon about the
meaning of swap_header last_page: mkswap (in all versions I've looked at:
util-linux-ng and BusyBox and old util-linux; probably as far back as
1999) consistently means the offset (in page units) of the last page of
the swap area, whereas kernel sys_swapon (as far back as 2.2 and 2.3)
strangely takes it to mean the size (in page units) of the swap area.

This disagreement is the safe way round; but it's worrying people, and
loses us one page of swap.

The fix is not just to add one to nr_good_pages: we need to get maxpages
(the size of the swap_map array) right before that; and though that is an
unsigned long, be careful not to overflow the unsigned int p->max which
later holds it (probably why header uses __u32 last_page instead of size).

Why did we subtract one from the maximum swp_offset to calculate maxpages?
Though it was probably me who made that change in 2.4.10, I don't get it:
and now we should be adding one (without risk of overflow in this case).

Fix the handling of swap_header badpages: it could have overrun the
swap_map when very large swap area used on a more limited architecture.

Remove pre-initializations of swap_header, nr_good_pages and maxpages:
those date from when sys_swapon was supporting other versions of header.

Reported-by: Nitin Gupta
Reported-by: Jarkko Lavinen
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2010-03-07 03:26:26 +0800
b084d4353 mm: count swap usage ... Browse Code »

A frequent questions from users about memory management is what numbers of
swap ents are user for processes. And this information will give some
hints to oom-killer.

Besides we can count the number of swapents per a process by scanning
/proc//smaps, this is very slow and not good for usual process
information handler which works like 'ps' or 'top'. (ps or top is now
enough slow..)

This patch adds a counter of swapents to mm_counter and update is at each
swap events. Information is exported via /proc//status file as

[kamezawa@bluextal memory]$ cat /proc/self/status
Name: cat
State: R (running)
Tgid: 2910
Pid: 2910
PPid: 2823
TracerPid: 0
Uid: 500 500 500 500
Gid: 500 500 500 500
FDSize: 256
Groups: 500
VmPeak: 82696 kB
VmSize: 82696 kB
VmLck: 0 kB
VmHWM: 432 kB
VmRSS: 432 kB
VmData: 172 kB
VmStk: 84 kB
VmExe: 48 kB
VmLib: 1568 kB
VmPTE: 40 kB
VmSwap: 0 kB
Reviewed-by: Minchan Kim
Reviewed-by: Christoph Lameter
Cc: Lee Schermerhorn
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-07 03:26:24 +0800
d559db086 mm: clean up mm_counter ... Browse Code »

Presently, per-mm statistics counter is defined by macro in sched.h

This patch modifies it to
- defined in mm.h as inlinf functions
- use array instead of macro's name creation.

This patch is for reducing patch size in future patch to modify
implementation of per-mm counter.

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Minchan Kim
Cc: Christoph Lameter
Cc: Lee Schermerhorn
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-07 03:26:23 +0800

16 Dec, 2009

11 commits

5ad646880 ksm: let shared pages be swappable ... Browse Code »

Initial implementation for swapping out KSM's shared pages: add
page_referenced_ksm() and try_to_unmap_ksm(), which rmap.c calls when
faced with a PageKsm page.

Most of what's needed can be got from the rmap_items listed from the
stable_node of the ksm page, without discovering the actual vma: so in
this patch just fake up a struct vma for page_referenced_one() or
try_to_unmap_one(), then refine that in the next patch.

Add VM_NONLINEAR to ksm_madvise()'s list of exclusions: it has always been
implicit there (being only set with VM_SHARED, already excluded), but
let's make it explicit, to help justify the lack of nonlinear unmap.

Rely on the page lock to protect against concurrent modifications to that
page's node of the stable tree.

The awkward part is not swapout but swapin: do_swap_page() and
page_add_anon_rmap() now have to allow for new possibilities - perhaps a
ksm page still in swapcache, perhaps a swapcache page associated with one
location in one anon_vma now needed for another location or anon_vma.
(And the vma might even be no longer VM_MERGEABLE when that happens.)

ksm_might_need_to_copy() checks for that case, and supplies a duplicate
page when necessary, simply leaving it to a subsequent pass of ksmd to
rediscover the identity and merge them back into one ksm page.
Disappointingly primitive: but the alternative would have to accumulate
unswappable info about the swapped out ksm pages, limiting swappability.

Remove page_add_ksm_rmap(): page_add_anon_rmap() now has to allow for the
particular case it was handling, so just use it instead.

Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Cc: Chris Wright
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:19 +0800
3ca7b3c5b mm: define PAGE_MAPPING_FLAGS ... Browse Code »

At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit set
in page->mapping, with the higher bits a pointer to the anon_vma; and have
defined PageKsm(page) as that with NULL anon_vma.

But KSM swapping will need to store a pointer there: so in preparation for
that, now define PAGE_MAPPING_FLAGS as the low two bits, including
PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some
other use for the bit emerges).

Declare page_rmapping(page) to return the pointer part of page->mapping,
and page_anon_vma(page) to return the anon_vma pointer when that's what it
is. Use these in a few appropriate places: notably, unuse_vma() has been
testing page->mapping, but is better to be testing page_anon_vma() (cases
may be added in which flag bits are set without any pointer).

Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Cc: Nick Piggin
Cc: KOSAKI Motohiro
Reviewed-by: Rik van Riel
Cc: Lee Schermerhorn
Cc: Andi Kleen
Cc: KAMEZAWA Hiroyuki
Cc: Wu Fengguang
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:17 +0800
d4906e1aa swap: rework map_swap_page() again ... Browse Code »

Seems that page_io.c doesn't really need to know that page_private(page)
is the swp_entry 'val'. Rework map_swap_page() to do what its name says
and map a page to a page offset in the swap space.

The only other caller of map_swap_page() is internal to mm/swapfile.c and
it does want to map a swap entry to the 'sector'. So rename
map_swap_page() to map_swap_entry(), make it 'static' and and implement
map_swap_page() as a wrapper around that.

Signed-off-by: Lee Schermerhorn
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-12-16 00:53:16 +0800
aaa468653 swap_info: note SWAP_MAP_SHMEM ... Browse Code »

While we're fiddling with the swap_map values, let's assign a particular
value to shmem/tmpfs swap pages: their swap counts are never incremented,
and it helps swapoff's try_to_unuse() a little if it can immediately
distinguish those pages from process pages.

Since we've no use for SWAP_MAP_BAD | COUNT_CONTINUED,
we might as well use that 0xbf value for SWAP_MAP_SHMEM.

Signed-off-by: Hugh Dickins
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:16 +0800
570a335b8 swap_info: swap count continuations ... Browse Code »

Swap is duplicated (reference count incremented by one) whenever the same
swap page is inserted into another mm (when forking finds a swap entry in
place of a pte, or when reclaim unmaps a pte to insert the swap entry).

swap_info_struct's vmalloc'ed swap_map is the array of these reference
counts: but what happens when the unsigned short (or unsigned char since
the preceding patch) is full? (and its high bit is kept for a cache flag)

We then lose track of it, never freeing, leaving it in use until swapoff:
at which point we _hope_ that a single pass will have found all instances,
assume there are no more, and will lose user data if we're wrong.

Swapping of KSM pages has not yet been enabled; but it is implemented,
and makes it very easy for a user to overflow the maximum swap count:
possible with ordinary process pages, but unlikely, even when pid_max
has been raised from PID_MAX_DEFAULT.

This patch implements swap count continuations: when the count overflows,
a continuation page is allocated and linked to the original vmalloc'ed
map page, and this used to hold the continuation counts for that entry
and its neighbours. These continuation pages are seldom referenced:
the common paths all work on the original swap_map, only referring to
a continuation page when the low "digit" of a count is incremented or
decremented through SWAP_MAP_MAX.

Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:15 +0800
8d69aaee8 swap_info: swap_map of chars not shorts ... Browse Code »

Halve the vmalloc'ed swap_map array from unsigned shorts to unsigned
chars: it's still very unusual to reach a swap count of 126, and the
next patch allows it to be extended indefinitely.

Signed-off-by: Hugh Dickins
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:15 +0800
253d553ba swap_info: SWAP_HAS_CACHE cleanups ... Browse Code »

Though swap_count() is useful, I'm finding that swap_has_cache() and
encode_swapmap() obscure what happens in the swap_map entry, just at
those points where I need to understand it. Remove them, and pass
more usable "usage" values to scan_swap_map(), swap_entry_free() and
__swap_duplicate(), instead of the SWAP_MAP and SWAP_CACHE enum.

Signed-off-by: Hugh Dickins
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:15 +0800
73c34b6ac swap_info: miscellaneous minor cleanups ... Browse Code »

Move CONFIG_HIBERNATION's swapdev_block() into the main CONFIG_HIBERNATION
block, remove extraneous whitespace and return, fix typo in a comment.

Signed-off-by: Hugh Dickins
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:15 +0800
9625a5f28 swap_info: include first_swap_extent ... Browse Code »

Make better use of the space by folding first swap_extent into its
swap_info_struct, instead of just the list_head: swap partitions need
only that one, and for others it's used as a circular list anyway.

[jirislaby@gmail.com: fix crash on double swapon]
Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Signed-off-by: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:15 +0800
efa90a981 swap_info: change to array of pointers ... Browse Code »

The swap_info_struct is only 76 or 104 bytes, but it does seem wrong
to reserve an array of about 30 of them in bss, when most people will
want only one. Change swap_info[] to an array of pointers.

That does need a "type" field in the structure: pack it as a char with
next type and short prio (aha, char is unsigned by default on PowerPC).
Use the (admittedly peculiar) name "type" throughout for this index.

/proc/swaps does not take swap_lock: I wouldn't want it to, but do take
care with barriers when adding a new item to the array (never removed).

Signed-off-by: Hugh Dickins
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:15 +0800
f29ad6a99 swap_info: private to swapfile.c ... Browse Code »

The swap_info_struct is mostly private to mm/swapfile.c, with only
one other in-tree user: get_swap_bio(). Adjust its interface to
map_swap_page(), so that we can then remove get_swap_info_struct().

But there is a popular user out-of-tree, TuxOnIce: so leave the
declaration of swap_info_struct in linux/swap.h.

Signed-off-by: Hugh Dickins
Cc: Nigel Cunningham
Cc: KAMEZAWA Hiroyuki
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:13 +0800

03 Nov, 2009

1 commit

32c5fc10e mm: remove incorrect swap_count() from try_to_unuse() ... Browse Code »

In try_to_unuse(), swcount is a local copy of *swap_map, including the
SWAP_HAS_CACHE bit; but a wrong comparison against swap_count(*swap_map),
which masks off the SWAP_HAS_CACHE bit, succeeded where it should fail.

That had the effect of resetting the mm from which to start searching
for the next swap page, to an irrelevant mm instead of to an mm in which
this swap page had been found: which may increase search time by ~20%.
But we're used to swapoff being slow, so never noticed the slowdown.

Remove that one spurious use of swap_count(): Bo Liu thought it merely
redundant, Hugh rewrote the description since it was measurably wrong.

Signed-off-by: Bo Liu
Signed-off-by: Hugh Dickins
Reviewed-by: KAMEZAWA Hiroyuki
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Bo Liu
2009-11-03 01:44:41 +0800

02 Oct, 2009

1 commit

3bd0f0c76 swapfile: avoid NULL pointer dereference in swapon when s_bdev is NULL ... Browse Code »

While testing Swap over NFS patchset, I noticed an oops that was triggered
during swapon. Investigating further, the NULL pointer deference is due to the
SSD device check/optimization in the swapon code that assumes s_bdev could never
be NULL.

inode->i_sb->s_bdev could be NULL in a few cases. For e.g. one such case is
loopback NFS mount, there could be others as well. Fix this by ensuring s_bdev
is not NULL before we try to deference s_bdev.

Signed-off-by: Suresh Jayaraman
Signed-off-by: Jens Axboe

Suresh Jayaraman
2009-10-02 03:15:46 +0800

24 Sep, 2009

1 commit

db1682636 Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 ... Browse Code »

* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
HWPOISON: Enable error_remove_page on btrfs
HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
HWPOISON: Add madvise() based injector for hardware poisoned pages v4
HWPOISON: Enable error_remove_page for NFS
HWPOISON: Enable .remove_error_page for migration aware file systems
HWPOISON: The high level memory error handler in the VM v7
HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
HWPOISON: shmem: call set_page_dirty() with locked page
HWPOISON: Define a new error_remove_page address space op for async truncation
HWPOISON: Add invalidate_inode_page
HWPOISON: Refactor truncate to allow direct truncating of page v2
HWPOISON: check and isolate corrupted free pages v2
HWPOISON: Handle hardware poisoned pages in try_to_unmap
HWPOISON: Use bitmask/action code for try_to_unmap behaviour
HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
HWPOISON: Add poison check to page fault handling
HWPOISON: Add basic support for poisoned pages in fault handler v3
HWPOISON: Add new SIGBUS error codes for hardware poison signals
HWPOISON: Add support for poison swap entries v2
HWPOISON: Export some rmap vma locking to outside world
...

Linus Torvalds
2009-09-24 22:53:22 +0800

22 Sep, 2009

1 commit

35451beec ksm: unmerge is an origin of OOMs ... Browse Code »

Just as the swapoff system call allocates many pages of RAM to various
processes, perhaps triggering OOM, so "echo 2 >/sys/kernel/mm/ksm/run"
(unmerge) is liable to allocate many pages of RAM to various processes,
perhaps triggering OOM; and each is normally run from a modest admin
process (swapoff or shell), easily repeated until it succeeds.

So treat unmerge_and_remove_all_rmap_items() in the same way that we treat
try_to_unuse(): generalize PF_SWAPOFF to PF_OOM_ORIGIN, and bracket both
with that, to ask the OOM killer to kill them first, to prevent them from
spawning more and more OOM kills.

Signed-off-by: Hugh Dickins
Acked-by: Izik Eidus
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-09-22 22:17:33 +0800

16 Sep, 2009

1 commit

a7420aa54 HWPOISON: Add support for poison swap entries v2 ... Browse Code »

Memory migration uses special swap entry types to trigger special actions on
page faults. Extend this mechanism to also support poisoned swap entries, to
trigger poison handling on page faults. This allows follow-on patches to
prevent processes from faulting in poisoned pages again.

v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)
v3: Better overflow fix (Hidehiro Kawai)

Signed-off-by: Andi Kleen

Andi Kleen
2009-09-16 17:50:05 +0800

14 Sep, 2009

1 commit

746cd1e7e block: use blkdev_issue_discard in blk_ioctl_discard ... Browse Code »

blk_ioctl_discard duplicates large amounts of code from blkdev_issue_discard,
the only difference between the two is that blkdev_issue_discard needs to
send a barrier discard request and blk_ioctl_discard a non-barrier one,
and blk_ioctl_discard needs to wait on the request. To facilitates this
add a flags argument to blkdev_issue_discard to control both aspects of the
behaviour. This will be very useful later on for using the waiting
funcitonality for other callers.

Based on an earlier patch from Matthew Wilcox .

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2009-09-14 14:24:53 +0800

30 Jul, 2009

1 commit

dddac6a7b PM / Hibernate: Replace bdget call with simple atomic_inc of i_count ... Browse Code »

Create bdgrab(). This function copies an existing reference to a
block_device. It is safe to call from any context.

Hibernation code wishes to copy a reference to the active swap device.
Right now it calls bdget() under a spinlock, but this is wrong because
bdget() can sleep. It doesn't need a full bdget() because we already
hold a reference to active swap devices (and the spinlock protects
against swapoff).

Fixes http://bugzilla.kernel.org/show_bug.cgi?id=13827

Signed-off-by: Alan Jenkins
Signed-off-by: Rafael J. Wysocki

Alan Jenkins
2009-07-30 03:07:55 +0800

19 Jun, 2009

1 commit

8a9478ca7 memcg: fix swap accounting ... Browse Code »

This patch fixes mis-accounting of swap usage in memcg.

In the current implementation, memcg's swap account is uncharged only when
swap is completely freed. But there are several cases where swap cannot
be freed cleanly. For handling that, this patch changes that memcg
uncharges swap account when swap has no references other than cache.

By this, memcg's swap entry accounting can be fully synchronous with the
application's behavior.

This patch also changes memcg's hooks for swap-out.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Acked-by: Balbir Singh
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Li Zefan
Cc: Dhaval Giani
Cc: YAMAMOTO Takashi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-06-19 04:03:47 +0800

17 Jun, 2009

3 commits

c9e444103 mm: reuse unused swap entry if necessary ... Browse Code »

Presently we can know a swap entry is just used as SwapCache via swap_map,
without looking up swap cache.

Then, we have a chance to reuse swap-cache-only swap entries in
get_swap_pages().

This patch tries to free swap-cache-only swap entries if swap is not
enough.

Note: We hit following path when swap_cluster code cannot find a free
cluster. Then, vm_swap_full() is not only condition to allow the kernel
to reclaim unused swap.

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Li Zefan
Cc: Dhaval Giani
Cc: YAMAMOTO Takashi
Tested-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-06-17 10:47:42 +0800
355cfa73d mm: modify swap_map and add SWAP_HAS_CACHE flag ... Browse Code »

This is a part of the patches for fixing memcg's swap accountinf leak.
But, IMHO, not a bad patch even if no memcg.

There are 2 kinds of references to swap.
- reference from swap entry
- reference from swap cache

Then,

- If there is swap cache && swap's refcnt is 1, there is only swap cache.
(*) swapcount(entry) == 1 && find_get_page(swapper_space, entry) != NULL

This counting logic have worked well for a long time. But considering
that we cannot know there is a _real_ reference or not by swap_map[],
current usage of counter is not very good.

This patch adds a flag SWAP_HAS_CACHE and recored information that a swap
entry has a cache or not. This will remove -1 magic used in swapfile.c
and be a help to avoid unnecessary find_get_page().

Signed-off-by: KAMEZAWA Hiroyuki
Tested-by: Daisuke Nishimura
Cc: Balbir Singh
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Li Zefan
Cc: Dhaval Giani
Cc: YAMAMOTO Takashi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-06-17 10:47:42 +0800
cb4b86ba4 mm: add swap cache interface for swap reference ... Browse Code »

In a following patch, the usage of swap cache is recorded into swap_map.
This patch is for necessary interface changes to do that.

2 interfaces:

- swapcache_prepare()
- swapcache_free()

are added for allocating/freeing refcnt from swap-cache to existing swap
entries. But implementation itself is not changed under this patch. At
adding swapcache_free(), memcg's hook code is moved under
swapcache_free(). This is better than using scattered hooks.

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Daisuke Nishimura
Acked-by: Balbir Singh
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Li Zefan
Cc: Dhaval Giani
Cc: YAMAMOTO Takashi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-06-17 10:47:42 +0800

22 Feb, 2009

1 commit

a1bb7d612 PM/hibernate: fix "swap breaks after hibernation failures" ... Browse Code »

http://bugzilla.kernel.org/show_bug.cgi?id=12239

The image writing code dropped a reference to the current swap device.
This doesn't show up if the hibernation succeeds - because it doesn't
affect the image which gets resumed. But it means multiple _failed_
hibernations end up freeing the swap device while it is still use!

swsusp_write() finds the block device for the swap file using swap_type_of().
It then uses blkdev_get() / blkdev_put() to open and close the block device.

Unfortunately, blkdev_get() assumes ownership of the inode of the block_device
passed to it. So blkdev_put() calls iput() on the inode. This is by design
and other callers expect this behaviour. The fix is for swap_type_of() to take
a reference on the inode using bdget().

Signed-off-by: Alan Jenkins
Signed-off-by: Rafael J. Wysocki
Cc: Len Brown
Cc: Greg KH
Signed-off-by: Linus Torvalds

Alan Jenkins
2009-02-22 06:17:17 +0800

30 Jan, 2009

1 commit

85d9fc89f memcg: fix refcnt handling at swapoff ... Browse Code »

Now, at swapoff, even while try_charge() fails, commit is executed. This
is a bug which turns the refcnt of cgroup_subsys_state negative.

Reported-by: Li Zefan
Tested-by: Li Zefan
Tested-by: Daisuke Nishimura
Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Daisuke Nishimura
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-30 10:04:43 +0800

14 Jan, 2009

1 commit

c4ea37c26 [CVE-2009-0029] System call wrappers part 26 ... Browse Code »

Signed-off-by: Heiko Carstens

Heiko Carstens
2009-01-14 21:15:29 +0800

09 Jan, 2009

4 commits

2c26fdd70 memcg: revert gfp mask fix ... Browse Code »

My patch, memcg-fix-gfp_mask-of-callers-of-charge.patch changed gfp_mask
of callers of charge to be GFP_HIGHUSER_MOVABLE for showing what will
happen at memory reclaim.

But in recent discussion, it's NACKed because it sounds ugly.

This patch is for reverting it and add some clean up to gfp_mask of
callers of charge. No behavior change but need review before generating
HUNK in deep queue.

This patch also adds explanation to meaning of gfp_mask passed to charge
functions in memcontrol.h.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:06 +0800
8c7c6e34a memcg: mem+swap controller core ... Browse Code »

This patch implements per cgroup limit for usage of memory+swap. However
there are SwapCache, double counting of swap-cache and swap-entry is
avoided.

Mem+Swap controller works as following.
- memory usage is limited by memory.limit_in_bytes.
- memory + swap usage is limited by memory.memsw_limit_in_bytes.

This has following benefits.
- A user can limit total resource usage of mem+swap.

Without this, because memory resource controller doesn't take care of
usage of swap, a process can exhaust all the swap (by memory leak.)
We can avoid this case.

And Swap is shared resource but it cannot be reclaimed (goes back to memory)
until it's used. This characteristic can be trouble when the memory
is divided into some parts by cpuset or memcg.
Assume group A and group B.
After some application executes, the system can be..

Group A -- very large free memory space but occupy 99% of swap.
Group B -- under memory shortage but cannot use swap...it's nearly full.

Ability to set appropriate swap limit for each group is required.

Maybe someone wonder "why not swap but mem+swap ?"

- The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
to move account from memory to swap...there is no change in usage of
mem+swap.

In other words, when we want to limit the usage of swap without affecting
global LRU, mem+swap limit is better than just limiting swap.

Accounting target information is stored in swap_cgroup which is
per swap entry record.

Charge is done as following.
map
- charge page and memsw.

unmap
- uncharge page/memsw if not SwapCache.

swap-out (__delete_from_swap_cache)
- uncharge page
- record mem_cgroup information to swap_cgroup.

swap-in (do_swap_page)
- charged as page and memsw.
record in swap_cgroup is cleared.
memsw accounting is decremented.

swap-free (swap_free())
- if swap entry is freed, memsw is uncharged by PAGE_SIZE.

There are people work under never-swap environments and consider swap as
something bad. For such people, this mem+swap controller extension is just an
overhead. This overhead is avoided by config or boot option.
(see Kconfig. detail is not in this patch.)

TODO:
- maybe more optimization can be don in swap-in path. (but not very safe.)
But we just do simple accounting at this stage.

[nishimura@mxp.nes.nec.co.jp: make resize limit hold mutex]
[hugh@veritas.com: memswap controller core swapcache fixes]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Balbir Singh
Cc: Pavel Emelyanov
Signed-off-by: Daisuke Nishimura
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:05 +0800
27a7faa07 memcg: swap cgroup for remembering usage ... Browse Code »

For accounting swap, we need a record per swap entry, at least.

This patch adds following function.
- swap_cgroup_swapon() .... called from swapon
- swap_cgroup_swapoff() ... called at the end of swapoff.

- swap_cgroup_record() .... record information of swap entry.
- swap_cgroup_lookup() .... lookup information of swap entry.

This patch just implements "how to record information". No actual method
for limit the usage of swap. These routine uses flat table to record and
lookup. "wise" lookup system like radix-tree requires requires memory
allocation at new records but swap-out is usually called under memory
shortage (or memcg hits limit.) So, I used static allocation. (maybe
dynamic allocation is not very hard but it adds additional memory
allocation in memory shortage path.)

Note1: In this, we use pointer to record information and this means
8bytes per swap entry. I think we can reduce this when we
create "id of cgroup" in the range of 0-65535 or 0-255.

Reported-by: Daisuke Nishimura
Reviewed-by: Daisuke Nishimura
Tested-by: Daisuke Nishimura
Reported-by: Hugh Dickins
Reported-by: Balbir Singh
Reported-by: Andrew Morton
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Pavel Emelianov
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:05 +0800
bced0520f memcg: fix gfp_mask of callers of charge ... Browse Code »

Fix misuse of gfp_kernel.

Now, most of callers of mem_cgroup_charge_xxx functions uses GFP_KERNEL.

I think that this is from the fact that page_cgroup *was* dynamically
allocated.

But now, we allocate all page_cgroup at boot. And
mem_cgroup_try_to_free_pages() reclaim memory from GFP_HIGHUSER_MOVABLE +
specified GFP_RECLAIM_MASK.

* This is because we just want to reduce memory usage.
"Where we should reclaim from ?" is not a problem in memcg.

This patch modifies gfp masks to be GFP_HIGUSER_MOVABLE if possible.

Note: This patch is not for fixing behavior but for showing sane information
in source code.

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Daisuke Nishimura
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:04 +0800