memcg: mem+swap controller core

This patch implements per cgroup limit for usage of memory+swap. However there are SwapCache, double counting of swap-cache and swap-entry is avoided. Mem+Swap controller works as following. - memory usage is limited by memory.limit_in_bytes. - memory + swap usage is limited by memory.memsw_limit_in_bytes. This has following benefits. - A user can limit total resource usage of mem+swap. Without this, because memory resource controller doesn't take care of usage of swap, a process can exhaust all the swap (by memory leak.) We can avoid this case. And Swap is shared resource but it cannot be reclaimed (goes back to memory) until it's used. This characteristic can be trouble when the memory is divided into some parts by cpuset or memcg. Assume group A and group B. After some application executes, the system can be.. Group A -- very large free memory space but occupy 99% of swap. Group B -- under memory shortage but cannot use swap...it's nearly full. Ability to set appropriate swap limit for each group is required. Maybe someone wonder "why not swap but mem+swap ?" - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means to move account from memory to swap...there is no change in usage of mem+swap. In other words, when we want to limit the usage of swap without affecting global LRU, mem+swap limit is better than just limiting swap. Accounting target information is stored in swap_cgroup which is per swap entry record. Charge is done as following. map - charge page and memsw. unmap - uncharge page/memsw if not SwapCache. swap-out (__delete_from_swap_cache) - uncharge page - record mem_cgroup information to swap_cgroup. swap-in (do_swap_page) - charged as page and memsw. record in swap_cgroup is cleared. memsw accounting is decremented. swap-free (swap_free()) - if swap entry is freed, memsw is uncharged by PAGE_SIZE. There are people work under never-swap environments and consider swap as something bad. For such people, this mem+swap controller extension is just an overhead. This overhead is avoided by config or boot option. (see Kconfig. detail is not in this patch.) TODO: - maybe more optimization can be don in swap-in path. (but not very safe.) But we just do simple accounting at this stage. [nishimura@mxp.nes.nec.co.jp: make resize limit hold mutex] [hugh@veritas.com: memswap controller core swapcache fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

memcg: mem+swap controller core
This patch implements per cgroup limit for usage of memory+swap. However there are SwapCache, double counting of swap-cache and swap-entry is avoided. Mem+Swap controller works as following. - memory usage is limited by memory.limit_in_bytes. - memory + swap usage is limited by memory.memsw_limit_in_bytes. This has following benefits. - A user can limit total resource usage of mem+swap. Without this, because memory resource controller doesn't take care of usage of swap, a process can exhaust all the swap (by memory leak.) We can avoid this case. And Swap is shared resource but it cannot be reclaimed (goes back to memory) until it's used. This characteristic can be trouble when the memory is divided into some parts by cpuset or memcg. Assume group A and group B. After some application executes, the system can be.. Group A -- very large free memory space but occupy 99% of swap. Group B -- under memory shortage but cannot use swap...it's nearly full. Ability to set appropriate swap limit for each group is required. Maybe someone wonder "why not swap but mem+swap ?" - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means to move account from memory to swap...there is no change in usage of mem+swap. In other words, when we want to limit the usage of swap without affecting global LRU, mem+swap limit is better than just limiting swap. Accounting target information is stored in swap_cgroup which is per swap entry record. Charge is done as following. map - charge page and memsw. unmap - uncharge page/memsw if not SwapCache. swap-out (__delete_from_swap_cache) - uncharge page - record mem_cgroup information to swap_cgroup. swap-in (do_swap_page) - charged as page and memsw. record in swap_cgroup is cleared. memsw accounting is decremented. swap-free (swap_free()) - if swap entry is freed, memsw is uncharged by PAGE_SIZE. There are people work under never-swap environments and consider swap as something bad. For such people, this mem+swap controller extension is just an overhead. This overhead is avoided by config or boot option. (see Kconfig. detail is not in this patch.) TODO: - maybe more optimization can be don in swap-in path. (but not very safe.) But we just do simple accounting at this stage. [nishimura@mxp.nes.nec.co.jp: make resize limit hold mutex] [hugh@veritas.com: memswap controller core swapcache fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KAMEZAWA Hiroyuki · Linus Torvalds
1 parent 27a7faa077
Showing 8 changed files with 440 additions and 54 deletions Side-by-side Diff
Documentation/controllers/memory.txt
include/linux/memcontrol.h
include/linux/swap.h
mm/memcontrol.c
mm/memory.c
mm/swap_state.c
mm/swapfile.c
mm/vmscan.c
@@ -137,13 +137,33 @@
 page will eventually get charged for it (once it is uncharged from
 the cgroup that brought it in -- this will happen on memory pressure).
  
-Exception: When you do swapoff and make swapped-out pages of shmem(tmpfs) to
+Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used..
+When you do swapoff and make swapped-out pages of shmem(tmpfs) to
 be backed into memory in force, charges for pages are accounted against the
 caller of swapoff rather than the users of shmem.
  
  
-2.4 Reclaim
+2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP)
+Swap Extension allows you to record charge for swap. A swapped-in page is
+charged back to original page allocator if possible.
  
+When swap is accounted, following files are added.
+ - memory.memsw.usage_in_bytes.
+ - memory.memsw.limit_in_bytes.
+
+usage of mem+swap is limited by memsw.limit_in_bytes.
+
+Note: why 'mem+swap' rather than swap.
+The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
+to move account from memory to swap...there is no change in usage of
+mem+swap.
+
+In other words, when we want to limit the usage of swap without affecting
+global LRU, mem+swap limit is better than just limiting swap from OS point
+of view.
+
+2.5 Reclaim
+
 Each cgroup maintains a per cgroup LRU that consists of an active
 and inactive list. When a cgroup goes over its limit, we first try
 to reclaim memory from the cgroup so as to make space for the new
@@ -245,6 +265,11 @@
 Such charges are freed(at default) or moved to its parent. When moved,
 both of RSS and CACHES are moved to parent.
 If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also.
+
+Charges recorded in swap information is not updated at removal of cgroup.
+Recorded information is discarded and a cgroup which uses swap (swapcache)
+will be charged as a new owner of it.
+
  
 5. Misc. interfaces.
  
@@ -32,6 +32,8 @@
 /* for swap handling */
 extern int mem_cgroup_try_charge(struct mm_struct *mm,
 		gfp_t gfp_mask, struct mem_cgroup **ptr);
+extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
+		struct page *page, gfp_t mask, struct mem_cgroup **ptr);
 extern void mem_cgroup_commit_charge_swapin(struct page *page,
 					struct mem_cgroup *ptr);
 extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
@@ -80,7 +82,6 @@
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 extern int do_swap_account;
 #endif
-
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
  
@@ -97,7 +98,13 @@
 }
  
 static inline int mem_cgroup_try_charge(struct mm_struct *mm,
-				gfp_t gfp_mask, struct mem_cgroup **ptr)
+			gfp_t gfp_mask, struct mem_cgroup **ptr)
+{
+	return 0;
+}
+
+static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
+		struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr)
 {
 	return 0;
 }
@@ -214,7 +214,7 @@
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
-							gfp_t gfp_mask);
+						gfp_t gfp_mask, bool noswap);
 extern int __isolate_lru_page(struct page *page, int mode, int file);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
@@ -336,7 +336,7 @@
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 extern int mem_cgroup_cache_charge_swapin(struct page *page,
 				struct mm_struct *mm, gfp_t mask, bool locked);
-extern void mem_cgroup_uncharge_swapcache(struct page *page);
+extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
 #else
 static inline
 int mem_cgroup_cache_charge_swapin(struct page *page,
@@ -344,7 +344,15 @@
 {
 	return 0;
 }
-static inline void mem_cgroup_uncharge_swapcache(struct page *page)
+static inline void
+mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
+{
+}
+#endif
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
+#else
+static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
 {
 }
 #endif
@@ -27,6 +27,7 @@
 #include <linux/backing-dev.h>
 #include <linux/bit_spinlock.h>
 #include <linux/rcupdate.h>
+#include <linux/mutex.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/spinlock.h>
  
@@ -132,12 +133,18 @@
 	 */
 	struct res_counter res;
 	/*
+	 * the counter to account for mem+swap usage.
+	 */
+	struct res_counter memsw;
+	/*
 	 * Per cgroup active and inactive list, similar to the
 	 * per zone LRU lists.
 	 */
 	struct mem_cgroup_lru_info info;
  
 	int	prev_priority;	/* for recording reclaim priority */
+	int		obsolete;
+	atomic_t	refcnt;
 	/*
 	 * statistics. This must be placed at the end of memcg.
 	 */
@@ -167,6 +174,17 @@
 	0, /* FORCE */
 };
  
+
+/* for encoding cft->private value on file */
+#define _MEM			(0)
+#define _MEMSWAP		(1)
+#define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
+#define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
+#define MEMFILE_ATTR(val)	((val) & 0xffff)
+
+static void mem_cgroup_get(struct mem_cgroup *mem);
+static void mem_cgroup_put(struct mem_cgroup *mem);
+
 /*
  * Always modified under lru lock. Then, not necessary to preempt_disable()
  */
@@ -485,7 +503,8 @@
  * oom-killer can be invoked.
  */
 static int __mem_cgroup_try_charge(struct mm_struct *mm,
-			gfp_t gfp_mask, struct mem_cgroup **memcg, bool oom)
+			gfp_t gfp_mask, struct mem_cgroup **memcg,
+			bool oom)
 {
 	struct mem_cgroup *mem;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
  
  
@@ -513,12 +532,25 @@
 		css_get(&mem->css);
 	}
  
+	while (1) {
+		int ret;
+		bool noswap = false;
  
-	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
+		ret = res_counter_charge(&mem->res, PAGE_SIZE);
+		if (likely(!ret)) {
+			if (!do_swap_account)
+				break;
+			ret = res_counter_charge(&mem->memsw, PAGE_SIZE);
+			if (likely(!ret))
+				break;
+			/* mem+swap counter fails */
+			res_counter_uncharge(&mem->res, PAGE_SIZE);
+			noswap = true;
+		}
 		if (!(gfp_mask & __GFP_WAIT))
 			goto nomem;
  
-		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
+		if (try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap))
 			continue;
  
 		/*
  
  
@@ -527,9 +559,14 @@
 		 * moved to swap cache or just unmapped from the cgroup.
 		 * Check the limit again to see if the reclaim reduced the
 		 * current usage of the cgroup before giving up
+		 *
 		 */
-		if (res_counter_check_under_limit(&mem->res))
+		if (!do_swap_account &&
+			res_counter_check_under_limit(&mem->res))
 			continue;
+		if (do_swap_account &&
+			res_counter_check_under_limit(&mem->memsw))
+			continue;
  
 		if (!nr_retries--) {
 			if (oom)
@@ -582,6 +619,8 @@
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
 		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		if (do_swap_account)
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 		css_put(&mem->css);
 		return;
 	}
@@ -646,6 +685,8 @@
 		__mem_cgroup_remove_list(from_mz, pc);
 		css_put(&from->css);
 		res_counter_uncharge(&from->res, PAGE_SIZE);
+		if (do_swap_account)
+			res_counter_uncharge(&from->memsw, PAGE_SIZE);
 		pc->mem_cgroup = to;
 		css_get(&to->css);
 		__mem_cgroup_add_list(to_mz, pc, false);
  
@@ -692,8 +733,11 @@
 	/* drop extra refcnt */
 	css_put(&parent->css);
 	/* uncharge if move fails */
-	if (ret)
+	if (ret) {
 		res_counter_uncharge(&parent->res, PAGE_SIZE);
+		if (do_swap_account)
+			res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+	}
  
 	return ret;
 }
  
@@ -791,7 +835,42 @@
 				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
 }
  
+int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
+				 struct page *page,
+				 gfp_t mask, struct mem_cgroup **ptr)
+{
+	struct mem_cgroup *mem;
+	swp_entry_t     ent;
+
+	if (mem_cgroup_subsys.disabled)
+		return 0;
+
+	if (!do_swap_account)
+		goto charge_cur_mm;
+
+	/*
+	 * A racing thread's fault, or swapoff, may have already updated
+	 * the pte, and even removed page from swap cache: return success
+	 * to go on to do_swap_page()'s pte_same() test, which should fail.
+	 */
+	if (!PageSwapCache(page))
+		return 0;
+
+	ent.val = page_private(page);
+
+	mem = lookup_swap_cgroup(ent);
+	if (!mem || mem->obsolete)
+		goto charge_cur_mm;
+	*ptr = mem;
+	return __mem_cgroup_try_charge(NULL, mask, ptr, true);
+charge_cur_mm:
+	if (unlikely(!mm))
+		mm = &init_mm;
+	return __mem_cgroup_try_charge(mm, mask, ptr, true);
+}
+
 #ifdef CONFIG_SWAP
+
 int mem_cgroup_cache_charge_swapin(struct page *page,
 			struct mm_struct *mm, gfp_t mask, bool locked)
 {
  
@@ -808,8 +887,28 @@
 	 * we reach here.
 	 */
 	if (PageSwapCache(page)) {
+		struct mem_cgroup *mem = NULL;
+		swp_entry_t ent;
+
+		ent.val = page_private(page);
+		if (do_swap_account) {
+			mem = lookup_swap_cgroup(ent);
+			if (mem && mem->obsolete)
+				mem = NULL;
+			if (mem)
+				mm = NULL;
+		}
 		ret = mem_cgroup_charge_common(page, mm, mask,
-				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
+				MEM_CGROUP_CHARGE_TYPE_SHMEM, mem);
+
+		if (!ret && do_swap_account) {
+			/* avoid double counting */
+			mem = swap_cgroup_record(ent, NULL);
+			if (mem) {
+				res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+				mem_cgroup_put(mem);
+			}
+		}
 	}
 	if (!locked)
 		unlock_page(page);
@@ -828,6 +927,23 @@
 		return;
 	pc = lookup_page_cgroup(page);
 	__mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
+	/*
+	 * Now swap is on-memory. This means this page may be
+	 * counted both as mem and swap....double count.
+	 * Fix it by uncharging from memsw. This SwapCache is stable
+	 * because we're still under lock_page().
+	 */
+	if (do_swap_account) {
+		swp_entry_t ent = {.val = page_private(page)};
+		struct mem_cgroup *memcg;
+		memcg = swap_cgroup_record(ent, NULL);
+		if (memcg) {
+			/* If memcg is obsolete, memcg can be != ptr */
+			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+			mem_cgroup_put(memcg);
+		}
+
+	}
 }
  
 void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
@@ -837,6 +953,8 @@
 	if (!mem)
 		return;
 	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	if (do_swap_account)
+		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 	css_put(&mem->css);
 }
  
  
  
  
  
  
@@ -844,29 +962,31 @@
 /*
  * uncharge if !page_mapped(page)
  */
-static void
+static struct mem_cgroup *
 __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 {
 	struct page_cgroup *pc;
-	struct mem_cgroup *mem;
+	struct mem_cgroup *mem = NULL;
 	struct mem_cgroup_per_zone *mz;
 	unsigned long flags;
  
 	if (mem_cgroup_subsys.disabled)
-		return;
+		return NULL;
  
 	if (PageSwapCache(page))
-		return;
+		return NULL;
  
 	/*
 	 * Check if our page_cgroup is valid
 	 */
 	pc = lookup_page_cgroup(page);
 	if (unlikely(!pc || !PageCgroupUsed(pc)))
-		return;
+		return NULL;
  
 	lock_page_cgroup(pc);
  
+	mem = pc->mem_cgroup;
+
 	if (!PageCgroupUsed(pc))
 		goto unlock_out;
  
  
@@ -886,8 +1006,11 @@
 		break;
 	}
  
+	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
+		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+
 	ClearPageCgroupUsed(pc);
-	mem = pc->mem_cgroup;
  
 	mz = page_cgroup_zoneinfo(pc);
 	spin_lock_irqsave(&mz->lru_lock, flags);
  
  
@@ -895,14 +1018,13 @@
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 	unlock_page_cgroup(pc);
  
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
 	css_put(&mem->css);
  
-	return;
+	return mem;
  
 unlock_out:
 	unlock_page_cgroup(pc);
-	return;
+	return NULL;
 }
  
 void mem_cgroup_uncharge_page(struct page *page)
  
  
  
@@ -922,12 +1044,44 @@
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
 }
  
-void mem_cgroup_uncharge_swapcache(struct page *page)
+/*
+ * called from __delete_from_swap_cache() and drop "page" account.
+ * memcg information is recorded to swap_cgroup of "ent"
+ */
+void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
 {
-	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
+	struct mem_cgroup *memcg;
+
+	memcg = __mem_cgroup_uncharge_common(page,
+					MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
+	/* record memcg information */
+	if (do_swap_account && memcg) {
+		swap_cgroup_record(ent, memcg);
+		mem_cgroup_get(memcg);
+	}
 }
  
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 /*
+ * called from swap_entry_free(). remove record in swap_cgroup and
+ * uncharge "memsw" account.
+ */
+void mem_cgroup_uncharge_swap(swp_entry_t ent)
+{
+	struct mem_cgroup *memcg;
+
+	if (!do_swap_account)
+		return;
+
+	memcg = swap_cgroup_record(ent, NULL);
+	if (memcg) {
+		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+		mem_cgroup_put(memcg);
+	}
+}
+#endif
+
+/*
  * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
  * page belongs to.
  */
@@ -1034,7 +1188,7 @@
 	rcu_read_unlock();
  
 	do {
-		progress = try_to_free_mem_cgroup_pages(mem, gfp_mask);
+		progress = try_to_free_mem_cgroup_pages(mem, gfp_mask, true);
 		progress += res_counter_check_under_limit(&mem->res);
 	} while (!progress && --retry);
  
  
  
  
  
  
  
@@ -1044,26 +1198,84 @@
 	return 0;
 }
  
+static DEFINE_MUTEX(set_limit_mutex);
+
 static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
-				   unsigned long long val)
+				unsigned long long val)
 {
  
 	int retry_count = MEM_CGROUP_RECLAIM_RETRIES;
 	int progress;
+	u64 memswlimit;
 	int ret = 0;
  
-	while (res_counter_set_limit(&memcg->res, val)) {
+	while (retry_count) {
 		if (signal_pending(current)) {
 			ret = -EINTR;
 			break;
 		}
-		if (!retry_count) {
-			ret = -EBUSY;
+		/*
+		 * Rather than hide all in some function, I do this in
+		 * open coded manner. You see what this really does.
+		 * We have to guarantee mem->res.limit < mem->memsw.limit.
+		 */
+		mutex_lock(&set_limit_mutex);
+		memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
+		if (memswlimit < val) {
+			ret = -EINVAL;
+			mutex_unlock(&set_limit_mutex);
 			break;
 		}
+		ret = res_counter_set_limit(&memcg->res, val);
+		mutex_unlock(&set_limit_mutex);
+
+		if (!ret)
+			break;
+
 		progress = try_to_free_mem_cgroup_pages(memcg,
-				GFP_HIGHUSER_MOVABLE);
-		if (!progress)
+				GFP_HIGHUSER_MOVABLE, false);
+  		if (!progress)			retry_count--;
+	}
+	return ret;
+}
+
+int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
+				unsigned long long val)
+{
+	int retry_count = MEM_CGROUP_RECLAIM_RETRIES;
+	u64 memlimit, oldusage, curusage;
+	int ret;
+
+	if (!do_swap_account)
+		return -EINVAL;
+
+	while (retry_count) {
+		if (signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
+		/*
+		 * Rather than hide all in some function, I do this in
+		 * open coded manner. You see what this really does.
+		 * We have to guarantee mem->res.limit < mem->memsw.limit.
+		 */
+		mutex_lock(&set_limit_mutex);
+		memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
+		if (memlimit > val) {
+			ret = -EINVAL;
+			mutex_unlock(&set_limit_mutex);
+			break;
+		}
+		ret = res_counter_set_limit(&memcg->memsw, val);
+		mutex_unlock(&set_limit_mutex);
+
+		if (!ret)
+			break;
+
+		oldusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
+		try_to_free_mem_cgroup_pages(memcg, GFP_HIGHUSER_MOVABLE, true);
+		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
+		if (curusage >= oldusage)
 			retry_count--;
 	}
 	return ret;
@@ -1193,7 +1405,7 @@
 			goto out;
 		}
 		progress = try_to_free_mem_cgroup_pages(mem,
-						  GFP_HIGHUSER_MOVABLE);
+						  GFP_HIGHUSER_MOVABLE, false);
 		if (!progress) {
 			nr_retries--;
 			/* maybe some writeback is necessary */
@@ -1216,8 +1428,25 @@
  
 static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
 {
-	return res_counter_read_u64(&mem_cgroup_from_cont(cont)->res,
-				    cft->private);
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	u64 val = 0;
+	int type, name;
+
+	type = MEMFILE_TYPE(cft->private);
+	name = MEMFILE_ATTR(cft->private);
+	switch (type) {
+	case _MEM:
+		val = res_counter_read_u64(&mem->res, name);
+		break;
+	case _MEMSWAP:
+		if (do_swap_account)
+			val = res_counter_read_u64(&mem->memsw, name);
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return val;
 }
 /*
  * The user of this function is...
  
  
  
@@ -1227,15 +1456,22 @@
 			    const char *buffer)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+	int type, name;
 	unsigned long long val;
 	int ret;
  
-	switch (cft->private) {
+	type = MEMFILE_TYPE(cft->private);
+	name = MEMFILE_ATTR(cft->private);
+	switch (name) {
 	case RES_LIMIT:
 		/* This function does all necessary parse...reuse it */
 		ret = res_counter_memparse_write_strategy(buffer, &val);
-		if (!ret)
+		if (ret)
+			break;
+		if (type == _MEM)
 			ret = mem_cgroup_resize_limit(memcg, val);
+		else
+			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 		break;
 	default:
 		ret = -EINVAL; /* should be BUG() ? */
  
  
  
@@ -1247,14 +1483,23 @@
 static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
 {
 	struct mem_cgroup *mem;
+	int type, name;
  
 	mem = mem_cgroup_from_cont(cont);
-	switch (event) {
+	type = MEMFILE_TYPE(event);
+	name = MEMFILE_ATTR(event);
+	switch (name) {
 	case RES_MAX_USAGE:
-		res_counter_reset_max(&mem->res);
+		if (type == _MEM)
+			res_counter_reset_max(&mem->res);
+		else
+			res_counter_reset_max(&mem->memsw);
 		break;
 	case RES_FAILCNT:
-		res_counter_reset_failcnt(&mem->res);
+		if (type == _MEM)
+			res_counter_reset_failcnt(&mem->res);
+		else
+			res_counter_reset_failcnt(&mem->memsw);
 		break;
 	}
 	return 0;
  
  
  
@@ -1315,24 +1560,24 @@
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
-		.private = RES_USAGE,
+		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
 		.read_u64 = mem_cgroup_read,
 	},
 	{
 		.name = "max_usage_in_bytes",
-		.private = RES_MAX_USAGE,
+		.private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
 		.trigger = mem_cgroup_reset,
 		.read_u64 = mem_cgroup_read,
 	},
 	{
 		.name = "limit_in_bytes",
-		.private = RES_LIMIT,
+		.private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
 		.write_string = mem_cgroup_write,
 		.read_u64 = mem_cgroup_read,
 	},
 	{
 		.name = "failcnt",
-		.private = RES_FAILCNT,
+		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
 		.trigger = mem_cgroup_reset,
 		.read_u64 = mem_cgroup_read,
 	},
@@ -1346,6 +1591,47 @@
 	},
 };
  
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+static struct cftype memsw_cgroup_files[] = {
+	{
+		.name = "memsw.usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
+		.read_u64 = mem_cgroup_read,
+	},
+	{
+		.name = "memsw.max_usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE),
+		.trigger = mem_cgroup_reset,
+		.read_u64 = mem_cgroup_read,
+	},
+	{
+		.name = "memsw.limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT),
+		.write_string = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read,
+	},
+	{
+		.name = "memsw.failcnt",
+		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT),
+		.trigger = mem_cgroup_reset,
+		.read_u64 = mem_cgroup_read,
+	},
+};
+
+static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
+{
+	if (!do_swap_account)
+		return 0;
+	return cgroup_add_files(cont, ss, memsw_cgroup_files,
+				ARRAY_SIZE(memsw_cgroup_files));
+};
+#else
+static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
+{
+	return 0;
+}
+#endif
+
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 {
 	struct mem_cgroup_per_node *pn;
  
  
  
@@ -1404,15 +1690,45 @@
 	return mem;
 }
  
+/*
+ * At destroying mem_cgroup, references from swap_cgroup can remain.
+ * (scanning all at force_empty is too costly...)
+ *
+ * Instead of clearing all references at force_empty, we remember
+ * the number of reference from swap_cgroup and free mem_cgroup when
+ * it goes down to 0.
+ *
+ * When mem_cgroup is destroyed, mem->obsolete will be set to 0 and
+ * entry which points to this memcg will be ignore at swapin.
+ *
+ * Removal of cgroup itself succeeds regardless of refs from swap.
+ */
+
 static void mem_cgroup_free(struct mem_cgroup *mem)
 {
+	if (atomic_read(&mem->refcnt) > 0)
+		return;
 	if (mem_cgroup_size() < PAGE_SIZE)
 		kfree(mem);
 	else
 		vfree(mem);
 }
  
+static void mem_cgroup_get(struct mem_cgroup *mem)
+{
+	atomic_inc(&mem->refcnt);
+}
  
+static void mem_cgroup_put(struct mem_cgroup *mem)
+{
+	if (atomic_dec_and_test(&mem->refcnt)) {
+		if (!mem->obsolete)
+			return;
+		mem_cgroup_free(mem);
+	}
+}
+
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 static void __init enable_swap_cgroup(void)
 {
@@ -1436,6 +1752,7 @@
 		return ERR_PTR(-ENOMEM);
  
 	res_counter_init(&mem->res);
+	res_counter_init(&mem->memsw);
  
 	for_each_node_state(node, N_POSSIBLE)
 		if (alloc_mem_cgroup_per_zone_info(mem, node))
@@ -1456,6 +1773,7 @@
 					struct cgroup *cont)
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	mem->obsolete = 1;
 	mem_cgroup_force_empty(mem, false);
 }
  
@@ -1474,8 +1792,14 @@
 static int mem_cgroup_populate(struct cgroup_subsys *ss,
 				struct cgroup *cont)
 {
-	return cgroup_add_files(cont, ss, mem_cgroup_files,
-					ARRAY_SIZE(mem_cgroup_files));
+	int ret;
+
+	ret = cgroup_add_files(cont, ss, mem_cgroup_files,
+				ARRAY_SIZE(mem_cgroup_files));
+
+	if (!ret)
+		ret = register_memsw_files(cont, ss);
+	return ret;
 }
  
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
@@ -2431,7 +2431,8 @@
 	lock_page(page);
 	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
  
-	if (mem_cgroup_try_charge(mm, GFP_HIGHUSER_MOVABLE, &ptr) == -ENOMEM) {
+	if (mem_cgroup_try_charge_swapin(mm, page,
+				GFP_HIGHUSER_MOVABLE, &ptr) == -ENOMEM) {
 		ret = VM_FAULT_OOM;
 		unlock_page(page);
 		goto out;
  
@@ -2449,8 +2450,20 @@
 		goto out_nomap;
 	}
  
-	/* The page isn't present yet, go ahead with the fault. */
+	/*
+	 * The page isn't present yet, go ahead with the fault.
+	 *
+	 * Be careful about the sequence of operations here.
+	 * To get its accounting right, reuse_swap_page() must be called
+	 * while the page is counted on swap but not yet in mapcount i.e.
+	 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
+	 * must be called after the swap_free(), or it will never succeed.
+	 * And mem_cgroup_commit_charge_swapin(), which uses the swp_entry
+	 * in page->private, must be called before reuse_swap_page(),
+	 * which may delete_from_swap_cache().
+	 */
  
+	mem_cgroup_commit_charge_swapin(page, ptr);
 	inc_mm_counter(mm, anon_rss);
 	pte = mk_pte(page, vma->vm_page_prot);
 	if (write_access && reuse_swap_page(page)) {
@@ -2461,7 +2474,6 @@
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
-	mem_cgroup_commit_charge_swapin(page, ptr);
  
 	swap_free(entry);
 	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
@@ -17,6 +17,7 @@
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
+#include <linux/page_cgroup.h>
  
 #include <asm/pgtable.h>
  
@@ -108,6 +109,8 @@
  */
 void __delete_from_swap_cache(struct page *page)
 {
+	swp_entry_t ent = {.val = page_private(page)};
+
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(!PageSwapCache(page));
 	VM_BUG_ON(PageWriteback(page));
@@ -118,7 +121,7 @@
 	total_swapcache_pages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	INC_CACHE_INFO(del_total);
-	mem_cgroup_uncharge_swapcache(page);
+	mem_cgroup_uncharge_swapcache(page, ent);
 }
  
 /**
@@ -471,8 +471,9 @@
 	return NULL;
 }
  
-static int swap_entry_free(struct swap_info_struct *p, unsigned long offset)
+static int swap_entry_free(struct swap_info_struct *p, swp_entry_t ent)
 {
+	unsigned long offset = swp_offset(ent);
 	int count = p->swap_map[offset];
  
 	if (count < SWAP_MAP_MAX) {
@@ -487,6 +488,7 @@
 				swap_list.next = p - swap_info;
 			nr_swap_pages++;
 			p->inuse_pages--;
+			mem_cgroup_uncharge_swap(ent);
 		}
 	}
 	return count;
@@ -502,7 +504,7 @@
  
 	p = swap_info_get(entry);
 	if (p) {
-		swap_entry_free(p, swp_offset(entry));
+		swap_entry_free(p, entry);
 		spin_unlock(&swap_lock);
 	}
 }
@@ -582,7 +584,7 @@
  
 	p = swap_info_get(entry);
 	if (p) {
-		if (swap_entry_free(p, swp_offset(entry)) == 1) {
+		if (swap_entry_free(p, entry) == 1) {
 			page = find_get_page(&swapper_space, entry.val);
 			if (page && !trylock_page(page)) {
 				page_cache_release(page);
@@ -696,7 +698,8 @@
 	pte_t *pte;
 	int ret = 1;
  
-	if (mem_cgroup_try_charge(vma->vm_mm, GFP_HIGHUSER_MOVABLE, &ptr))
+	if (mem_cgroup_try_charge_swapin(vma->vm_mm, page,
+					GFP_HIGHUSER_MOVABLE, &ptr))
 		ret = -ENOMEM;
  
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
@@ -1661,7 +1661,8 @@
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
  
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
-						gfp_t gfp_mask)
+						gfp_t gfp_mask,
+					   bool noswap)
 {
 	struct scan_control sc = {
 		.may_writepage = !laptop_mode,
@@ -1673,6 +1674,9 @@
 		.isolate_pages = mem_cgroup_isolate_pages,
 	};
 	struct zonelist *zonelist;
+
+	if (noswap)
+		sc.may_swap = 0;
  
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
...	...	@@ -137,13 +137,33 @@
137	137	page will eventually get charged for it (once it is uncharged from
138	138	the cgroup that brought it in -- this will happen on memory pressure).
139	139
140		-Exception: When you do swapoff and make swapped-out pages of shmem(tmpfs) to
	140	+Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used..
	141	+When you do swapoff and make swapped-out pages of shmem(tmpfs) to
141	142	be backed into memory in force, charges for pages are accounted against the
142	143	caller of swapoff rather than the users of shmem.
143	144
144	145
145		-2.4 Reclaim
	146	+2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP)
	147	+Swap Extension allows you to record charge for swap. A swapped-in page is
	148	+charged back to original page allocator if possible.
146	149
	150	+When swap is accounted, following files are added.
	151	+ - memory.memsw.usage_in_bytes.
	152	+ - memory.memsw.limit_in_bytes.
	153	+
	154	+usage of mem+swap is limited by memsw.limit_in_bytes.
	155	+
	156	+Note: why 'mem+swap' rather than swap.
	157	+The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
	158	+to move account from memory to swap...there is no change in usage of
	159	+mem+swap.
	160	+
	161	+In other words, when we want to limit the usage of swap without affecting
	162	+global LRU, mem+swap limit is better than just limiting swap from OS point
	163	+of view.
	164	+
	165	+2.5 Reclaim
	166	+
147	167	Each cgroup maintains a per cgroup LRU that consists of an active
148	168	and inactive list. When a cgroup goes over its limit, we first try
149	169	to reclaim memory from the cgroup so as to make space for the new
...	...	@@ -245,6 +265,11 @@
245	265	Such charges are freed(at default) or moved to its parent. When moved,
246	266	both of RSS and CACHES are moved to parent.
247	267	If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also.
	268	+
	269	+Charges recorded in swap information is not updated at removal of cgroup.
	270	+Recorded information is discarded and a cgroup which uses swap (swapcache)
	271	+will be charged as a new owner of it.
	272	+
248	273
249	274	5. Misc. interfaces.
250	275
...	...	@@ -32,6 +32,8 @@
32	32	/* for swap handling */
33	33	extern int mem_cgroup_try_charge(struct mm_struct *mm,
34	34	gfp_t gfp_mask, struct mem_cgroup **ptr);
	35	+extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
	36	+ struct page page, gfp_t mask, struct mem_cgroup *ptr);
35	37	extern void mem_cgroup_commit_charge_swapin(struct page *page,
36	38	struct mem_cgroup *ptr);
37	39	extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
...	...	@@ -80,7 +82,6 @@
80	82	#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
81	83	extern int do_swap_account;
82	84	#endif
83		-
84	85	#else /* CONFIG_CGROUP_MEM_RES_CTLR */
85	86	struct mem_cgroup;
86	87
...	...	@@ -97,7 +98,13 @@
97	98	}
98	99
99	100	static inline int mem_cgroup_try_charge(struct mm_struct *mm,
100		- gfp_t gfp_mask, struct mem_cgroup **ptr)
	101	+ gfp_t gfp_mask, struct mem_cgroup **ptr)
	102	+{
	103	+ return 0;
	104	+}
	105	+
	106	+static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
	107	+ struct page page, gfp_t gfp_mask, struct mem_cgroup *ptr)
101	108	{
102	109	return 0;
103	110	}
...	...	@@ -214,7 +214,7 @@
214	214	extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
215	215	gfp_t gfp_mask);
216	216	extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
217		- gfp_t gfp_mask);
	217	+ gfp_t gfp_mask, bool noswap);
218	218	extern int __isolate_lru_page(struct page *page, int mode, int file);
219	219	extern unsigned long shrink_all_memory(unsigned long nr_pages);
220	220	extern int vm_swappiness;
...	...	@@ -336,7 +336,7 @@
336	336	#ifdef CONFIG_CGROUP_MEM_RES_CTLR
337	337	extern int mem_cgroup_cache_charge_swapin(struct page *page,
338	338	struct mm_struct *mm, gfp_t mask, bool locked);
339		-extern void mem_cgroup_uncharge_swapcache(struct page *page);
	339	+extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
340	340	#else
341	341	static inline
342	342	int mem_cgroup_cache_charge_swapin(struct page *page,
...	...	@@ -344,7 +344,15 @@
344	344	{
345	345	return 0;
346	346	}
347		-static inline void mem_cgroup_uncharge_swapcache(struct page *page)
	347	+static inline void
	348	+mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
	349	+{
	350	+}
	351	+#endif
	352	+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
	353	+extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
	354	+#else
	355	+static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
348	356	{
349	357	}
350	358	#endif
...	...	@@ -27,6 +27,7 @@
27	27	#include <linux/backing-dev.h>
28	28	#include <linux/bit_spinlock.h>
29	29	#include <linux/rcupdate.h>
	30	+#include <linux/mutex.h>
30	31	#include <linux/slab.h>
31	32	#include <linux/swap.h>
32	33	#include <linux/spinlock.h>
33	34
...	...	@@ -132,12 +133,18 @@
132	133	*/
133	134	struct res_counter res;
134	135	/*
	136	+ * the counter to account for mem+swap usage.
	137	+ */
	138	+ struct res_counter memsw;
	139	+ /*
135	140	* Per cgroup active and inactive list, similar to the
136	141	* per zone LRU lists.
137	142	*/
138	143	struct mem_cgroup_lru_info info;
139	144
140	145	int prev_priority; /* for recording reclaim priority */
	146	+ int obsolete;
	147	+ atomic_t refcnt;
141	148	/*
142	149	* statistics. This must be placed at the end of memcg.
143	150	*/
...	...	@@ -167,6 +174,17 @@
167	174	0, /* FORCE */
168	175	};
169	176
	177	+
	178	+/* for encoding cft->private value on file */
	179	+#define _MEM (0)
	180	+#define _MEMSWAP (1)
	181	+#define MEMFILE_PRIVATE(x, val) (((x) << 16) \| (val))
	182	+#define MEMFILE_TYPE(val) (((val) >> 16) & 0xffff)
	183	+#define MEMFILE_ATTR(val) ((val) & 0xffff)
	184	+
	185	+static void mem_cgroup_get(struct mem_cgroup *mem);
	186	+static void mem_cgroup_put(struct mem_cgroup *mem);
	187	+
170	188	/*
171	189	* Always modified under lru lock. Then, not necessary to preempt_disable()
172	190	*/
...	...	@@ -485,7 +503,8 @@
485	503	* oom-killer can be invoked.
486	504	*/
487	505	static int __mem_cgroup_try_charge(struct mm_struct *mm,
488		- gfp_t gfp_mask, struct mem_cgroup **memcg, bool oom)
	506	+ gfp_t gfp_mask, struct mem_cgroup **memcg,
	507	+ bool oom)
489	508	{
490	509	struct mem_cgroup *mem;
491	510	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
492	511
493	512
...	...	@@ -513,12 +532,25 @@
513	532	css_get(&mem->css);
514	533	}
515	534
	535	+ while (1) {
	536	+ int ret;
	537	+ bool noswap = false;
516	538
517		- while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
	539	+ ret = res_counter_charge(&mem->res, PAGE_SIZE);
	540	+ if (likely(!ret)) {
	541	+ if (!do_swap_account)
	542	+ break;
	543	+ ret = res_counter_charge(&mem->memsw, PAGE_SIZE);
	544	+ if (likely(!ret))
	545	+ break;
	546	+ /* mem+swap counter fails */
	547	+ res_counter_uncharge(&mem->res, PAGE_SIZE);
	548	+ noswap = true;
	549	+ }
518	550	if (!(gfp_mask & __GFP_WAIT))
519	551	goto nomem;
520	552
521		- if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
	553	+ if (try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap))
522	554	continue;
523	555
524	556	/*
525	557
526	558
...	...	@@ -527,9 +559,14 @@
527	559	* moved to swap cache or just unmapped from the cgroup.
528	560	* Check the limit again to see if the reclaim reduced the
529	561	* current usage of the cgroup before giving up
	562	+ *
530	563	*/
531		- if (res_counter_check_under_limit(&mem->res))
	564	+ if (!do_swap_account &&
	565	+ res_counter_check_under_limit(&mem->res))
532	566	continue;
	567	+ if (do_swap_account &&
	568	+ res_counter_check_under_limit(&mem->memsw))
	569	+ continue;
533	570
534	571	if (!nr_retries--) {
535	572	if (oom)
...	...	@@ -582,6 +619,8 @@
582	619	if (unlikely(PageCgroupUsed(pc))) {
583	620	unlock_page_cgroup(pc);
584	621	res_counter_uncharge(&mem->res, PAGE_SIZE);
	622	+ if (do_swap_account)
	623	+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
585	624	css_put(&mem->css);
586	625	return;
587	626	}
...	...	@@ -646,6 +685,8 @@
646	685	__mem_cgroup_remove_list(from_mz, pc);
647	686	css_put(&from->css);
648	687	res_counter_uncharge(&from->res, PAGE_SIZE);
	688	+ if (do_swap_account)
	689	+ res_counter_uncharge(&from->memsw, PAGE_SIZE);
649	690	pc->mem_cgroup = to;
650	691	css_get(&to->css);
651	692	__mem_cgroup_add_list(to_mz, pc, false);
652	693
...	...	@@ -692,8 +733,11 @@
692	733	/* drop extra refcnt */
693	734	css_put(&parent->css);
694	735	/* uncharge if move fails */
695		- if (ret)
	736	+ if (ret) {
696	737	res_counter_uncharge(&parent->res, PAGE_SIZE);
	738	+ if (do_swap_account)
	739	+ res_counter_uncharge(&parent->memsw, PAGE_SIZE);
	740	+ }
697	741
698	742	return ret;
699	743	}
700	744
...	...	@@ -791,7 +835,42 @@
791	835	MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
792	836	}
793	837
	838	+int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
	839	+ struct page *page,
	840	+ gfp_t mask, struct mem_cgroup **ptr)
	841	+{
	842	+ struct mem_cgroup *mem;
	843	+ swp_entry_t ent;
	844	+
	845	+ if (mem_cgroup_subsys.disabled)
	846	+ return 0;
	847	+
	848	+ if (!do_swap_account)
	849	+ goto charge_cur_mm;
	850	+
	851	+ /*
	852	+ * A racing thread's fault, or swapoff, may have already updated
	853	+ * the pte, and even removed page from swap cache: return success
	854	+ * to go on to do_swap_page()'s pte_same() test, which should fail.
	855	+ */
	856	+ if (!PageSwapCache(page))
	857	+ return 0;
	858	+
	859	+ ent.val = page_private(page);
	860	+
	861	+ mem = lookup_swap_cgroup(ent);
	862	+ if (!mem \|\| mem->obsolete)
	863	+ goto charge_cur_mm;
	864	+ *ptr = mem;
	865	+ return __mem_cgroup_try_charge(NULL, mask, ptr, true);
	866	+charge_cur_mm:
	867	+ if (unlikely(!mm))
	868	+ mm = &init_mm;
	869	+ return __mem_cgroup_try_charge(mm, mask, ptr, true);
	870	+}
	871	+
794	872	#ifdef CONFIG_SWAP
	873	+
795	874	int mem_cgroup_cache_charge_swapin(struct page *page,
796	875	struct mm_struct *mm, gfp_t mask, bool locked)
797	876	{
798	877
...	...	@@ -808,8 +887,28 @@
808	887	* we reach here.
809	888	*/
810	889	if (PageSwapCache(page)) {
	890	+ struct mem_cgroup *mem = NULL;
	891	+ swp_entry_t ent;
	892	+
	893	+ ent.val = page_private(page);
	894	+ if (do_swap_account) {
	895	+ mem = lookup_swap_cgroup(ent);
	896	+ if (mem && mem->obsolete)
	897	+ mem = NULL;
	898	+ if (mem)
	899	+ mm = NULL;
	900	+ }
811	901	ret = mem_cgroup_charge_common(page, mm, mask,
812		- MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
	902	+ MEM_CGROUP_CHARGE_TYPE_SHMEM, mem);
	903	+
	904	+ if (!ret && do_swap_account) {
	905	+ /* avoid double counting */
	906	+ mem = swap_cgroup_record(ent, NULL);
	907	+ if (mem) {
	908	+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
	909	+ mem_cgroup_put(mem);
	910	+ }
	911	+ }
813	912	}
814	913	if (!locked)
815	914	unlock_page(page);
...	...	@@ -828,6 +927,23 @@
828	927	return;
829	928	pc = lookup_page_cgroup(page);
830	929	__mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
	930	+ /*
	931	+ * Now swap is on-memory. This means this page may be
	932	+ * counted both as mem and swap....double count.
	933	+ * Fix it by uncharging from memsw. This SwapCache is stable
	934	+ * because we're still under lock_page().
	935	+ */
	936	+ if (do_swap_account) {
	937	+ swp_entry_t ent = {.val = page_private(page)};
	938	+ struct mem_cgroup *memcg;
	939	+ memcg = swap_cgroup_record(ent, NULL);
	940	+ if (memcg) {
	941	+ /* If memcg is obsolete, memcg can be != ptr */
	942	+ res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
	943	+ mem_cgroup_put(memcg);
	944	+ }
	945	+
	946	+ }
831	947	}
832	948
833	949	void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
...	...	@@ -837,6 +953,8 @@
837	953	if (!mem)
838	954	return;
839	955	res_counter_uncharge(&mem->res, PAGE_SIZE);
	956	+ if (do_swap_account)
	957	+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
840	958	css_put(&mem->css);
841	959	}
842	960
843	961
844	962
845	963
846	964
847	965
...	...	@@ -844,29 +962,31 @@
844	962	/*
845	963	* uncharge if !page_mapped(page)
846	964	*/
847		-static void
	965	+static struct mem_cgroup *
848	966	__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
849	967	{
850	968	struct page_cgroup *pc;
851		- struct mem_cgroup *mem;
	969	+ struct mem_cgroup *mem = NULL;
852	970	struct mem_cgroup_per_zone *mz;
853	971	unsigned long flags;
854	972
855	973	if (mem_cgroup_subsys.disabled)
856		- return;
	974	+ return NULL;
857	975
858	976	if (PageSwapCache(page))
859		- return;
	977	+ return NULL;
860	978
861	979	/*
862	980	* Check if our page_cgroup is valid
863	981	*/
864	982	pc = lookup_page_cgroup(page);
865	983	if (unlikely(!pc \|\| !PageCgroupUsed(pc)))
866		- return;
	984	+ return NULL;
867	985
868	986	lock_page_cgroup(pc);
869	987
	988	+ mem = pc->mem_cgroup;
	989	+
870	990	if (!PageCgroupUsed(pc))
871	991	goto unlock_out;
872	992
873	993
...	...	@@ -886,8 +1006,11 @@
886	1006	break;
887	1007	}
888	1008
	1009	+ res_counter_uncharge(&mem->res, PAGE_SIZE);
	1010	+ if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
	1011	+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
	1012	+
889	1013	ClearPageCgroupUsed(pc);
890		- mem = pc->mem_cgroup;
891	1014
892	1015	mz = page_cgroup_zoneinfo(pc);
893	1016	spin_lock_irqsave(&mz->lru_lock, flags);
894	1017
895	1018
...	...	@@ -895,14 +1018,13 @@
895	1018	spin_unlock_irqrestore(&mz->lru_lock, flags);
896	1019	unlock_page_cgroup(pc);
897	1020
898		- res_counter_uncharge(&mem->res, PAGE_SIZE);
899	1021	css_put(&mem->css);
900	1022
901		- return;
	1023	+ return mem;
902	1024
903	1025	unlock_out:
904	1026	unlock_page_cgroup(pc);
905		- return;
	1027	+ return NULL;
906	1028	}
907	1029
908	1030	void mem_cgroup_uncharge_page(struct page *page)
909	1031
910	1032
911	1033
...	...	@@ -922,12 +1044,44 @@
922	1044	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
923	1045	}
924	1046
925		-void mem_cgroup_uncharge_swapcache(struct page *page)
	1047	+/*
	1048	+ * called from __delete_from_swap_cache() and drop "page" account.
	1049	+ * memcg information is recorded to swap_cgroup of "ent"
	1050	+ */
	1051	+void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
926	1052	{
927		- __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
	1053	+ struct mem_cgroup *memcg;
	1054	+
	1055	+ memcg = __mem_cgroup_uncharge_common(page,
	1056	+ MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
	1057	+ /* record memcg information */
	1058	+ if (do_swap_account && memcg) {
	1059	+ swap_cgroup_record(ent, memcg);
	1060	+ mem_cgroup_get(memcg);
	1061	+ }
928	1062	}
929	1063
	1064	+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
930	1065	/*
	1066	+ * called from swap_entry_free(). remove record in swap_cgroup and
	1067	+ * uncharge "memsw" account.
	1068	+ */
	1069	+void mem_cgroup_uncharge_swap(swp_entry_t ent)
	1070	+{
	1071	+ struct mem_cgroup *memcg;
	1072	+
	1073	+ if (!do_swap_account)
	1074	+ return;
	1075	+
	1076	+ memcg = swap_cgroup_record(ent, NULL);
	1077	+ if (memcg) {
	1078	+ res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
	1079	+ mem_cgroup_put(memcg);
	1080	+ }
	1081	+}
	1082	+#endif
	1083	+
	1084	+/*
931	1085	* Before starting migration, account PAGE_SIZE to mem_cgroup that the old
932	1086	* page belongs to.
933	1087	*/
...	...	@@ -1034,7 +1188,7 @@
1034	1188	rcu_read_unlock();
1035	1189
1036	1190	do {
1037		- progress = try_to_free_mem_cgroup_pages(mem, gfp_mask);
	1191	+ progress = try_to_free_mem_cgroup_pages(mem, gfp_mask, true);
1038	1192	progress += res_counter_check_under_limit(&mem->res);
1039	1193	} while (!progress && --retry);
1040	1194
1041	1195
1042	1196
1043	1197
1044	1198
1045	1199
1046	1200
...	...	@@ -1044,26 +1198,84 @@
1044	1198	return 0;
1045	1199	}
1046	1200
	1201	+static DEFINE_MUTEX(set_limit_mutex);
	1202	+
1047	1203	static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
1048		- unsigned long long val)
	1204	+ unsigned long long val)
1049	1205	{
1050	1206
1051	1207	int retry_count = MEM_CGROUP_RECLAIM_RETRIES;
1052	1208	int progress;
	1209	+ u64 memswlimit;
1053	1210	int ret = 0;
1054	1211
1055		- while (res_counter_set_limit(&memcg->res, val)) {
	1212	+ while (retry_count) {
1056	1213	if (signal_pending(current)) {
1057	1214	ret = -EINTR;
1058	1215	break;
1059	1216	}
1060		- if (!retry_count) {
1061		- ret = -EBUSY;
	1217	+ /*
	1218	+ * Rather than hide all in some function, I do this in
	1219	+ * open coded manner. You see what this really does.
	1220	+ * We have to guarantee mem->res.limit < mem->memsw.limit.
	1221	+ */
	1222	+ mutex_lock(&set_limit_mutex);
	1223	+ memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
	1224	+ if (memswlimit < val) {
	1225	+ ret = -EINVAL;
	1226	+ mutex_unlock(&set_limit_mutex);
1062	1227	break;
1063	1228	}
	1229	+ ret = res_counter_set_limit(&memcg->res, val);
	1230	+ mutex_unlock(&set_limit_mutex);
	1231	+
	1232	+ if (!ret)
	1233	+ break;
	1234	+
1064	1235	progress = try_to_free_mem_cgroup_pages(memcg,
1065		- GFP_HIGHUSER_MOVABLE);
1066		- if (!progress)
	1236	+ GFP_HIGHUSER_MOVABLE, false);
	1237	+ if (!progress) retry_count--;
	1238	+ }
	1239	+ return ret;
	1240	+}
	1241	+
	1242	+int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
	1243	+ unsigned long long val)
	1244	+{
	1245	+ int retry_count = MEM_CGROUP_RECLAIM_RETRIES;
	1246	+ u64 memlimit, oldusage, curusage;
	1247	+ int ret;
	1248	+
	1249	+ if (!do_swap_account)
	1250	+ return -EINVAL;
	1251	+
	1252	+ while (retry_count) {
	1253	+ if (signal_pending(current)) {
	1254	+ ret = -EINTR;
	1255	+ break;
	1256	+ }
	1257	+ /*
	1258	+ * Rather than hide all in some function, I do this in
	1259	+ * open coded manner. You see what this really does.
	1260	+ * We have to guarantee mem->res.limit < mem->memsw.limit.
	1261	+ */
	1262	+ mutex_lock(&set_limit_mutex);
	1263	+ memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
	1264	+ if (memlimit > val) {
	1265	+ ret = -EINVAL;
	1266	+ mutex_unlock(&set_limit_mutex);
	1267	+ break;
	1268	+ }
	1269	+ ret = res_counter_set_limit(&memcg->memsw, val);
	1270	+ mutex_unlock(&set_limit_mutex);
	1271	+
	1272	+ if (!ret)
	1273	+ break;
	1274	+
	1275	+ oldusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
	1276	+ try_to_free_mem_cgroup_pages(memcg, GFP_HIGHUSER_MOVABLE, true);
	1277	+ curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
	1278	+ if (curusage >= oldusage)
1067	1279	retry_count--;
1068	1280	}
1069	1281	return ret;
...	...	@@ -1193,7 +1405,7 @@
1193	1405	goto out;
1194	1406	}
1195	1407	progress = try_to_free_mem_cgroup_pages(mem,
1196		- GFP_HIGHUSER_MOVABLE);
	1408	+ GFP_HIGHUSER_MOVABLE, false);
1197	1409	if (!progress) {
1198	1410	nr_retries--;
1199	1411	/* maybe some writeback is necessary */
...	...	@@ -1216,8 +1428,25 @@
1216	1428
1217	1429	static u64 mem_cgroup_read(struct cgroup cont, struct cftype cft)
1218	1430	{
1219		- return res_counter_read_u64(&mem_cgroup_from_cont(cont)->res,
1220		- cft->private);
	1431	+ struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
	1432	+ u64 val = 0;
	1433	+ int type, name;
	1434	+
	1435	+ type = MEMFILE_TYPE(cft->private);
	1436	+ name = MEMFILE_ATTR(cft->private);
	1437	+ switch (type) {
	1438	+ case _MEM:
	1439	+ val = res_counter_read_u64(&mem->res, name);
	1440	+ break;
	1441	+ case _MEMSWAP:
	1442	+ if (do_swap_account)
	1443	+ val = res_counter_read_u64(&mem->memsw, name);
	1444	+ break;
	1445	+ default:
	1446	+ BUG();
	1447	+ break;
	1448	+ }
	1449	+ return val;
1221	1450	}
1222	1451	/*
1223	1452	* The user of this function is...
1224	1453
1225	1454
1226	1455
...	...	@@ -1227,15 +1456,22 @@
1227	1456	const char *buffer)
1228	1457	{
1229	1458	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
	1459	+ int type, name;
1230	1460	unsigned long long val;
1231	1461	int ret;
1232	1462
1233		- switch (cft->private) {
	1463	+ type = MEMFILE_TYPE(cft->private);
	1464	+ name = MEMFILE_ATTR(cft->private);
	1465	+ switch (name) {
1234	1466	case RES_LIMIT:
1235	1467	/* This function does all necessary parse...reuse it */
1236	1468	ret = res_counter_memparse_write_strategy(buffer, &val);
1237		- if (!ret)
	1469	+ if (ret)
	1470	+ break;
	1471	+ if (type == _MEM)
1238	1472	ret = mem_cgroup_resize_limit(memcg, val);
	1473	+ else
	1474	+ ret = mem_cgroup_resize_memsw_limit(memcg, val);
1239	1475	break;
1240	1476	default:
1241	1477	ret = -EINVAL; /* should be BUG() ? */
1242	1478
1243	1479
1244	1480
...	...	@@ -1247,14 +1483,23 @@
1247	1483	static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
1248	1484	{
1249	1485	struct mem_cgroup *mem;
	1486	+ int type, name;
1250	1487
1251	1488	mem = mem_cgroup_from_cont(cont);
1252		- switch (event) {
	1489	+ type = MEMFILE_TYPE(event);
	1490	+ name = MEMFILE_ATTR(event);
	1491	+ switch (name) {
1253	1492	case RES_MAX_USAGE:
1254		- res_counter_reset_max(&mem->res);
	1493	+ if (type == _MEM)
	1494	+ res_counter_reset_max(&mem->res);
	1495	+ else
	1496	+ res_counter_reset_max(&mem->memsw);
1255	1497	break;
1256	1498	case RES_FAILCNT:
1257		- res_counter_reset_failcnt(&mem->res);
	1499	+ if (type == _MEM)
	1500	+ res_counter_reset_failcnt(&mem->res);
	1501	+ else
	1502	+ res_counter_reset_failcnt(&mem->memsw);
1258	1503	break;
1259	1504	}
1260	1505	return 0;
1261	1506
1262	1507
1263	1508
...	...	@@ -1315,24 +1560,24 @@
1315	1560	static struct cftype mem_cgroup_files[] = {
1316	1561	{
1317	1562	.name = "usage_in_bytes",
1318		- .private = RES_USAGE,
	1563	+ .private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
1319	1564	.read_u64 = mem_cgroup_read,
1320	1565	},
1321	1566	{
1322	1567	.name = "max_usage_in_bytes",
1323		- .private = RES_MAX_USAGE,
	1568	+ .private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
1324	1569	.trigger = mem_cgroup_reset,
1325	1570	.read_u64 = mem_cgroup_read,
1326	1571	},
1327	1572	{
1328	1573	.name = "limit_in_bytes",
1329		- .private = RES_LIMIT,
	1574	+ .private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
1330	1575	.write_string = mem_cgroup_write,
1331	1576	.read_u64 = mem_cgroup_read,
1332	1577	},
1333	1578	{
1334	1579	.name = "failcnt",
1335		- .private = RES_FAILCNT,
	1580	+ .private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
1336	1581	.trigger = mem_cgroup_reset,
1337	1582	.read_u64 = mem_cgroup_read,
1338	1583	},
...	...	@@ -1346,6 +1591,47 @@
1346	1591	},
1347	1592	};
1348	1593
	1594	+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
	1595	+static struct cftype memsw_cgroup_files[] = {
	1596	+ {
	1597	+ .name = "memsw.usage_in_bytes",
	1598	+ .private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
	1599	+ .read_u64 = mem_cgroup_read,
	1600	+ },
	1601	+ {
	1602	+ .name = "memsw.max_usage_in_bytes",
	1603	+ .private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE),
	1604	+ .trigger = mem_cgroup_reset,
	1605	+ .read_u64 = mem_cgroup_read,
	1606	+ },
	1607	+ {
	1608	+ .name = "memsw.limit_in_bytes",
	1609	+ .private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT),
	1610	+ .write_string = mem_cgroup_write,
	1611	+ .read_u64 = mem_cgroup_read,
	1612	+ },
	1613	+ {
	1614	+ .name = "memsw.failcnt",
	1615	+ .private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT),
	1616	+ .trigger = mem_cgroup_reset,
	1617	+ .read_u64 = mem_cgroup_read,
	1618	+ },
	1619	+};
	1620	+
	1621	+static int register_memsw_files(struct cgroup cont, struct cgroup_subsys ss)
	1622	+{
	1623	+ if (!do_swap_account)
	1624	+ return 0;
	1625	+ return cgroup_add_files(cont, ss, memsw_cgroup_files,
	1626	+ ARRAY_SIZE(memsw_cgroup_files));
	1627	+};
	1628	+#else
	1629	+static int register_memsw_files(struct cgroup cont, struct cgroup_subsys ss)
	1630	+{
	1631	+ return 0;
	1632	+}
	1633	+#endif
	1634	+
1349	1635	static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
1350	1636	{
1351	1637	struct mem_cgroup_per_node *pn;
1352	1638
1353	1639
1354	1640
...	...	@@ -1404,15 +1690,45 @@
1404	1690	return mem;
1405	1691	}
1406	1692
	1693	+/*
	1694	+ * At destroying mem_cgroup, references from swap_cgroup can remain.
	1695	+ * (scanning all at force_empty is too costly...)
	1696	+ *
	1697	+ * Instead of clearing all references at force_empty, we remember
	1698	+ * the number of reference from swap_cgroup and free mem_cgroup when
	1699	+ * it goes down to 0.
	1700	+ *
	1701	+ * When mem_cgroup is destroyed, mem->obsolete will be set to 0 and
	1702	+ * entry which points to this memcg will be ignore at swapin.
	1703	+ *
	1704	+ * Removal of cgroup itself succeeds regardless of refs from swap.
	1705	+ */
	1706	+
1407	1707	static void mem_cgroup_free(struct mem_cgroup *mem)
1408	1708	{
	1709	+ if (atomic_read(&mem->refcnt) > 0)
	1710	+ return;
1409	1711	if (mem_cgroup_size() < PAGE_SIZE)
1410	1712	kfree(mem);
1411	1713	else
1412	1714	vfree(mem);
1413	1715	}
1414	1716
	1717	+static void mem_cgroup_get(struct mem_cgroup *mem)
	1718	+{
	1719	+ atomic_inc(&mem->refcnt);
	1720	+}
1415	1721
	1722	+static void mem_cgroup_put(struct mem_cgroup *mem)
	1723	+{
	1724	+ if (atomic_dec_and_test(&mem->refcnt)) {
	1725	+ if (!mem->obsolete)
	1726	+ return;
	1727	+ mem_cgroup_free(mem);
	1728	+ }
	1729	+}
	1730	+
	1731	+
1416	1732	#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
1417	1733	static void __init enable_swap_cgroup(void)
1418	1734	{
...	...	@@ -1436,6 +1752,7 @@
1436	1752	return ERR_PTR(-ENOMEM);
1437	1753
1438	1754	res_counter_init(&mem->res);
	1755	+ res_counter_init(&mem->memsw);
1439	1756
1440	1757	for_each_node_state(node, N_POSSIBLE)
1441	1758	if (alloc_mem_cgroup_per_zone_info(mem, node))
...	...	@@ -1456,6 +1773,7 @@
1456	1773	struct cgroup *cont)
1457	1774	{
1458	1775	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
	1776	+ mem->obsolete = 1;
1459	1777	mem_cgroup_force_empty(mem, false);
1460	1778	}
1461	1779
...	...	@@ -1474,8 +1792,14 @@
1474	1792	static int mem_cgroup_populate(struct cgroup_subsys *ss,
1475	1793	struct cgroup *cont)
1476	1794	{
1477		- return cgroup_add_files(cont, ss, mem_cgroup_files,
1478		- ARRAY_SIZE(mem_cgroup_files));
	1795	+ int ret;
	1796	+
	1797	+ ret = cgroup_add_files(cont, ss, mem_cgroup_files,
	1798	+ ARRAY_SIZE(mem_cgroup_files));
	1799	+
	1800	+ if (!ret)
	1801	+ ret = register_memsw_files(cont, ss);
	1802	+ return ret;
1479	1803	}
1480	1804
1481	1805	static void mem_cgroup_move_task(struct cgroup_subsys *ss,
...	...	@@ -2431,7 +2431,8 @@
2431	2431	lock_page(page);
2432	2432	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
2433	2433
2434		- if (mem_cgroup_try_charge(mm, GFP_HIGHUSER_MOVABLE, &ptr) == -ENOMEM) {
	2434	+ if (mem_cgroup_try_charge_swapin(mm, page,
	2435	+ GFP_HIGHUSER_MOVABLE, &ptr) == -ENOMEM) {
2435	2436	ret = VM_FAULT_OOM;
2436	2437	unlock_page(page);
2437	2438	goto out;
2438	2439
...	...	@@ -2449,8 +2450,20 @@
2449	2450	goto out_nomap;
2450	2451	}
2451	2452
2452		- /* The page isn't present yet, go ahead with the fault. */
	2453	+ /*
	2454	+ * The page isn't present yet, go ahead with the fault.
	2455	+ *
	2456	+ * Be careful about the sequence of operations here.
	2457	+ * To get its accounting right, reuse_swap_page() must be called
	2458	+ * while the page is counted on swap but not yet in mapcount i.e.
	2459	+ * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
	2460	+ * must be called after the swap_free(), or it will never succeed.
	2461	+ * And mem_cgroup_commit_charge_swapin(), which uses the swp_entry
	2462	+ * in page->private, must be called before reuse_swap_page(),
	2463	+ * which may delete_from_swap_cache().
	2464	+ */
2453	2465
	2466	+ mem_cgroup_commit_charge_swapin(page, ptr);
2454	2467	inc_mm_counter(mm, anon_rss);
2455	2468	pte = mk_pte(page, vma->vm_page_prot);
2456	2469	if (write_access && reuse_swap_page(page)) {
...	...	@@ -2461,7 +2474,6 @@
2461	2474	flush_icache_page(vma, page);
2462	2475	set_pte_at(mm, address, page_table, pte);
2463	2476	page_add_anon_rmap(page, vma, address);
2464		- mem_cgroup_commit_charge_swapin(page, ptr);
2465	2477
2466	2478	swap_free(entry);
2467	2479	if (vm_swap_full() \|\| (vma->vm_flags & VM_LOCKED) \|\| PageMlocked(page))
...	...	@@ -17,6 +17,7 @@
17	17	#include <linux/backing-dev.h>
18	18	#include <linux/pagevec.h>
19	19	#include <linux/migrate.h>
	20	+#include <linux/page_cgroup.h>
20	21
21	22	#include <asm/pgtable.h>
22	23
...	...	@@ -108,6 +109,8 @@
108	109	*/
109	110	void __delete_from_swap_cache(struct page *page)
110	111	{
	112	+ swp_entry_t ent = {.val = page_private(page)};
	113	+
111	114	VM_BUG_ON(!PageLocked(page));
112	115	VM_BUG_ON(!PageSwapCache(page));
113	116	VM_BUG_ON(PageWriteback(page));
...	...	@@ -118,7 +121,7 @@
118	121	total_swapcache_pages--;
119	122	__dec_zone_page_state(page, NR_FILE_PAGES);
120	123	INC_CACHE_INFO(del_total);
121		- mem_cgroup_uncharge_swapcache(page);
	124	+ mem_cgroup_uncharge_swapcache(page, ent);
122	125	}
123	126
124	127	/**
...	...	@@ -471,8 +471,9 @@
471	471	return NULL;
472	472	}
473	473
474		-static int swap_entry_free(struct swap_info_struct *p, unsigned long offset)
	474	+static int swap_entry_free(struct swap_info_struct *p, swp_entry_t ent)
475	475	{
	476	+ unsigned long offset = swp_offset(ent);
476	477	int count = p->swap_map[offset];
477	478
478	479	if (count < SWAP_MAP_MAX) {
...	...	@@ -487,6 +488,7 @@
487	488	swap_list.next = p - swap_info;
488	489	nr_swap_pages++;
489	490	p->inuse_pages--;
	491	+ mem_cgroup_uncharge_swap(ent);
490	492	}
491	493	}
492	494	return count;
...	...	@@ -502,7 +504,7 @@
502	504
503	505	p = swap_info_get(entry);
504	506	if (p) {
505		- swap_entry_free(p, swp_offset(entry));
	507	+ swap_entry_free(p, entry);
506	508	spin_unlock(&swap_lock);
507	509	}
508	510	}
...	...	@@ -582,7 +584,7 @@
582	584
583	585	p = swap_info_get(entry);
584	586	if (p) {
585		- if (swap_entry_free(p, swp_offset(entry)) == 1) {
	587	+ if (swap_entry_free(p, entry) == 1) {
586	588	page = find_get_page(&swapper_space, entry.val);
587	589	if (page && !trylock_page(page)) {
588	590	page_cache_release(page);
...	...	@@ -696,7 +698,8 @@
696	698	pte_t *pte;
697	699	int ret = 1;
698	700
699		- if (mem_cgroup_try_charge(vma->vm_mm, GFP_HIGHUSER_MOVABLE, &ptr))
	701	+ if (mem_cgroup_try_charge_swapin(vma->vm_mm, page,
	702	+ GFP_HIGHUSER_MOVABLE, &ptr))
700	703	ret = -ENOMEM;
701	704
702	705	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
...	...	@@ -1661,7 +1661,8 @@
1661	1661	#ifdef CONFIG_CGROUP_MEM_RES_CTLR
1662	1662
1663	1663	unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
1664		- gfp_t gfp_mask)
	1664	+ gfp_t gfp_mask,
	1665	+ bool noswap)
1665	1666	{
1666	1667	struct scan_control sc = {
1667	1668	.may_writepage = !laptop_mode,
...	...	@@ -1673,6 +1674,9 @@
1673	1674	.isolate_pages = mem_cgroup_isolate_pages,
1674	1675	};
1675	1676	struct zonelist *zonelist;
	1677	+
	1678	+ if (noswap)
	1679	+ sc.may_swap = 0;
1676	1680
1677	1681	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) \|
1678	1682	(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);