Eric Lee / smarc-fsl-linux-kernel

10 Jan, 2012

2 commits

38e5781bb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
igmp: Avoid zero delay when receiving odd mixture of IGMP queries
netdev: make net_device_ops const
bcm63xx: make ethtool_ops const
usbnet: make ethtool_ops const
net: Fix build with INET disabled.
net: introduce netif_addr_lock_nested() and call if when appropriate
net: correct lock name in dev_[uc/mc]_sync documentations.
net: sk_update_clone is only used in net/core/sock.c
8139cp: fix missing napi_gro_flush.
pktgen: set correct max and min in pktgen_setup_inject()
smsc911x: Unconditionally include linux/smscphy.h in smsc911x.h
asix: fix infinite loop in rx_fixup()
net: Default UDP and UNIX diag to 'n'.
r6040: fix typo in use of MCR0 register bits
net: fix sock_clone reference mismatch with tcp memcontrol

Linus Torvalds
2012-01-10 06:46:52 +0800
db0c2bf69 Merge branch 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

* 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
cgroup: fix to allow mounting a hierarchy by name
cgroup: move assignement out of condition in cgroup_attach_proc()
cgroup: Remove task_lock() from cgroup_post_fork()
cgroup: add sparse annotation to cgroup_iter_start() and cgroup_iter_end()
cgroup: mark cgroup_rmdir_waitq and cgroup_attach_proc() as static
cgroup: only need to check oldcgrp==newgrp once
cgroup: remove redundant get/put of task struct
cgroup: remove redundant get/put of old css_set from migrate
cgroup: Remove unnecessary task_lock before fetching css_set on migration
cgroup: Drop task_lock(parent) on cgroup_fork()
cgroups: remove redundant get/put of css_set from css_set_check_fetched()
resource cgroups: remove bogus cast
cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
cgroup, cpuset: don't use ss->pre_attach()
cgroup: don't use subsys->can_attach_task() or ->attach_task()
cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()
cgroup: improve old cgroup handling in cgroup_attach_proc()
cgroup: always lock threadgroup during migration
threadgroup: extend threadgroup_lock() to cover exit and exec
threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
...

Fix up conflict in kernel/cgroup.c due to commit e0197aae59e5: "cgroups:
fix a css_set not found bug in cgroup_attach_proc" that already
mentioned that the bug is fixed (differently) in Tejun's cgroup
patchset. This one, in other words.

Linus Torvalds
2012-01-10 04:59:24 +0800

08 Jan, 2012

1 commit

f3f511e1c net: fix sock_clone reference mismatch with tcp memcontrol ... Browse Code »

Sockets can also be created through sock_clone. Because it copies
all data in the sock structure, it also copies the memcg-related pointer,
and all should be fine. However, since we now use reference counts in
socket creation, we are left with some sockets that have no reference
counts. It matters when we destroy them, since it leads to a mismatch.

Signed-off-by: Glauber Costa
CC: David S. Miller
CC: Greg Thelen
CC: Hiroyouki Kamezawa
CC: Laurent Chavey
Signed-off-by: David S. Miller

Glauber Costa
2012-01-08 02:16:34 +0800

24 Dec, 2011

1 commit

abb434cb0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
net/bluetooth/l2cap_core.c

Just two overlapping changes, one added an initialization of
a local variable, and another change added a new local variable.

Signed-off-by: David S. Miller

David S. Miller
2011-12-24 06:13:56 +0800

23 Dec, 2011

1 commit

65c64ce8e Partial revert "Basic kernel memory functionality for the Memory Controller" ... Browse Code »

This reverts commit e5671dfae59b165e2adfd4dfbdeab11ac8db5bda.

After a follow up discussion with Michal, it was agreed it would
be better to leave the kmem controller with just the tcp files,
deferring the behavior of the other general memory.kmem.* files
for a later time, when more caches are controlled. This is because
generic kmem files are not used by tcp accounting and it is
not clear how other slab caches would fit into the scheme.

We are reverting the original commit so we can track the reference.
Part of the patch is kept, because it was used by the later tcp
code. Conflicts are shown in the bottom. init/Kconfig is removed from
the revert entirely.

Signed-off-by: Glauber Costa
Acked-by: Michal Hocko
CC: Kirill A. Shutemov
CC: Paul Menage
CC: Greg Thelen
CC: Johannes Weiner
CC: David S. Miller

Conflicts:

Documentation/cgroups/memory.txt
mm/memcontrol.c
Signed-off-by: David S. Miller

Glauber Costa
2011-12-23 11:37:18 +0800

21 Dec, 2011

1 commit

a41c58a66 memcg: keep root group unchanged if creation fails ... Browse Code »
1

If the request is to create non-root group and we fail to meet it, we
should leave the root unchanged.

Signed-off-by: Hillf Danton
Acked-by: Hugh Dickins
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Cc: Balbir Singh
Cc: David Rientjes
Cc: Andrea Arcangeli
Cc: Johannes Weiner
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2011-12-21 02:25:04 +0800

13 Dec, 2011

4 commits

2f7ee5691 cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach() ... Browse Code »

Currently, there's no way to pass multiple tasks to cgroup_subsys
methods necessitating the need for separate per-process and per-task
methods. This patch introduces cgroup_taskset which can be used to
pass multiple tasks and their associated cgroups to cgroup_subsys
methods.

Three methods - can_attach(), cancel_attach() and attach() - are
converted to use cgroup_taskset. This unifies passed parameters so
that all methods have access to all information. Conversions in this
patchset are identical and don't introduce any behavior change.

-v2: documentation updated as per Paul Menage's suggestion.

Signed-off-by: Tejun Heo
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Frederic Weisbecker
Acked-by: Paul Menage
Acked-by: Li Zefan
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: KAMEZAWA Hiroyuki
Cc: James Morris

Tejun Heo
2011-12-13 10:12:21 +0800
d1a4c0b37 tcp memory pressure controls ... Browse Code »

This patch introduces memory pressure controls for the tcp
protocol. It uses the generic socket memory pressure code
introduced in earlier patches, and fills in the
necessary data in cg_proto struct.

Signed-off-by: Glauber Costa
Reviewed-by: KAMEZAWA Hiroyuki
CC: Eric W. Biederman
Signed-off-by: David S. Miller

Glauber Costa
2011-12-13 08:04:10 +0800
e1aab161e socket: initial cgroup code. ... Browse Code »

The goal of this work is to move the memory pressure tcp
controls to a cgroup, instead of just relying on global
conditions.

To avoid excessive overhead in the network fast paths,
the code that accounts allocated memory to a cgroup is
hidden inside a static_branch(). This branch is patched out
until the first non-root cgroup is created. So when nobody
is using cgroups, even if it is mounted, no significant performance
penalty should be seen.

This patch handles the generic part of the code, and has nothing
tcp-specific.

Signed-off-by: Glauber Costa
Reviewed-by: KAMEZAWA Hiroyuki
CC: Kirill A. Shutemov
CC: David S. Miller
CC: Eric W. Biederman
CC: Eric Dumazet
Signed-off-by: David S. Miller

Glauber Costa
2011-12-13 08:04:10 +0800
e5671dfae Basic kernel memory functionality for the Memory Controller ... Browse Code »

This patch lays down the foundation for the kernel memory component
of the Memory Controller.

As of today, I am only laying down the following files:

* memory.independent_kmem_limit
* memory.kmem.limit_in_bytes (currently ignored)
* memory.kmem.usage_in_bytes (always zero)

Signed-off-by: Glauber Costa
CC: Kirill A. Shutemov
CC: Paul Menage
CC: Greg Thelen
CC: Johannes Weiner
CC: Michal Hocko
Signed-off-by: David S. Miller

Glauber Costa
2011-12-13 08:03:55 +0800

07 Nov, 2011

1 commit

32aaeffbd Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux ... Browse Code »

* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include
net: sch_generic remove redundant use of
net: inet_timewait_sock doesnt need
...

Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h

Linus Torvalds
2011-11-07 11:44:47 +0800

03 Nov, 2011

6 commits

4799401fe memcg: Fix race condition in memcg_check_events() with this_cpu usage ... Browse Code »

Various code in memcontrol.c () calls this_cpu_read() on the calculations
to be done from two different percpu variables, or does an open-coded
read-modify-write on a single percpu variable.

Disable preemption throughout these operations so that the writes go to
the correct palces.

[hannes@cmpxchg.org: added this_cpu to __this_cpu conversion]
Signed-off-by: Johannes Weiner
Signed-off-by: Steven Rostedt
Cc: Greg Thelen
Cc: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Rostedt
2011-11-03 07:07:00 +0800
a61ed3cec memcg: close race between charge and putback ... Browse Code »

There is a potential race between a thread charging a page and another
thread putting it back to the LRU list:

charge: putback:
SetPageCgroupUsed SetPageLRU
PageLRU && add to memcg LRU PageCgroupUsed && add to memcg LRU

The order of setting one flag and checking the other is crucial, otherwise
the charge may observe !PageLRU while the putback observes !PageCgroupUsed
and the page is not linked to the memcg LRU at all.

Global memory pressure may fix this by trying to isolate and putback the
page for reclaim, where that putback would link it to the memcg LRU again.
Without that, the memory cgroup is undeletable due to a charge whose
physical page can not be found and moved out.

Signed-off-by: Johannes Weiner
Cc: Ying Han
Acked-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-11-03 07:07:00 +0800
9b272977e memcg: skip scanning active lists based on individual size ... Browse Code »

Reclaim decides to skip scanning an active list when the corresponding
inactive list is above a certain size in comparison to leave the assumed
working set alone while there are still enough reclaim candidates around.

The memcg implementation of comparing those lists instead reports whether
the whole memcg is low on the requested type of inactive pages,
considering all nodes and zones.

This can lead to an oversized active list not being scanned because of the
state of the other lists in the memcg, as well as an active list being
scanned while its corresponding inactive list has enough pages.

Not only is this wrong, it's also a scalability hazard, because the global
memory state over all nodes and zones has to be gathered for each memcg
and zone scanned.

Make these calculations purely based on the size of the two LRU lists
that are actually affected by the outcome of the decision.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Cc: KOSAKI Motohiro
Acked-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Reviewed-by: Minchan Kim
Reviewed-by: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-11-03 07:07:00 +0800
0a619e587 memcg: do not expose uninitialized mem_cgroup_per_node to world ... Browse Code »

If somebody is touching data too early, it might be easier to diagnose a
problem when dereferencing NULL at mem->info.nodeinfo[node] than trying to
understand why mem_cgroup_per_zone is [un|partly]initialized.

Signed-off-by: Igor Mammedov
Acked-by: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Igor Mammedov
2011-11-03 07:07:00 +0800
715a5ee82 memcg: fix oom schedule_timeout() ... Browse Code »

Before calling schedule_timeout(), task state should be changed.

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-11-03 07:06:59 +0800
c0ff4b854 memcg: rename mem variable to memcg ... Browse Code »

The memcg code sometimes uses "struct mem_cgroup *mem" and sometimes uses
"struct mem_cgroup *memcg". Rename all mem variables to memcg in source
file.

Signed-off-by: Raghavendra K T
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Raghavendra K T
2011-11-03 07:06:59 +0800

01 Nov, 2011

1 commit

4356f21d0 mm: change isolate mode from #define to bitwise type ... Browse Code »

Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally,
macro isn't recommended as it's type-unsafe and making debugging harder as
symbol cannot be passed throught to the debugger.

Quote from Johannes
" Hmm, it would probably be cleaner to fully convert the isolation mode
into independent flags. INACTIVE, ACTIVE, BOTH is currently a
tri-state among flags, which is a bit ugly."

This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h

Signed-off-by: Minchan Kim
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Michal Hocko
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-11-01 08:30:44 +0800

31 Oct, 2011

1 commit

b9e15bafd mm: Add export.h for EXPORT_SYMBOL to active symbol exporters ... Browse Code »

These files were getting via an implicit include
path, but we want to crush those out of existence since they cost
time during compiles of processing thousands of lines of headers
for no reason. Give them the lightweight header that just contains
the EXPORT_SYMBOL infrastructure.

Signed-off-by: Paul Gortmaker

Paul Gortmaker
2011-10-31 21:20:12 +0800

15 Sep, 2011

1 commit

185efc0f9 memcg: Revert "memcg: add memory.vmscan_stat" ... Browse Code »

Revert the post-3.0 commit 82f9d486e59f5 ("memcg: add
memory.vmscan_stat").

The implementation of per-memcg reclaim statistics violates how memcg
hierarchies usually behave: hierarchically.

The reclaim statistics are accounted to child memcgs and the parent
hitting the limit, but not to hierarchy levels in between. Usually,
hierarchical statistics are perfectly recursive, with each level
representing the sum of itself and all its children.

Since this exports statistics to userspace, this may lead to confusion
and problems with changing things after the release, so revert it now,
we can try again later.

Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Michal Hocko
Cc: Ying Han
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-09-15 09:09:38 +0800

26 Aug, 2011

2 commits

23751be00 memcg: fix hierarchical oom locking ... Browse Code »

Commit 79dfdaccd1d5 ("memcg: make oom_lock 0 and 1 based rather than
counter") tried to oom lock the hierarchy and roll back upon
encountering an already locked memcg.

The code is confused when it comes to detecting a locked memcg, though,
so it would fail and rollback after locking one memcg and encountering
an unlocked second one.

The result is that oom-locking hierarchies fails unconditionally and
that every oom killer invocation simply goes to sleep on the oom
waitqueue forever. The tasks practically hang forever without anyone
intervening, possibly holding locks that trip up unrelated tasks, too.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-08-26 07:25:34 +0800
5af12d0ef memcg: pin execution to current cpu while draining stock ... Browse Code »

Commit d1a05b6973c7 ("memcg do not try to drain per-cpu caches without
pages") added a drain_local_stock() call to a preemptible section.

The draining task looks up the cpu-local stock twice to set the
draining-flag, then to drain the stock and clear the flag again. If the
task is migrated to a different CPU in between, noone will clear the
flag on the first stock and it will be forever undrainable. Its charge
can not be recovered and the cgroup can not be deleted anymore.

Properly pin the task to the executing CPU while draining stocks.

Signed-off-by: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-08-26 07:25:33 +0800

10 Aug, 2011

1 commit

9f50fad65 Revert "memcg: get rid of percpu_charge_mutex lock" ... Browse Code »

This reverts commit 8521fc50d433507a7cdc96bec280f9e5888a54cc.

The patch incorrectly assumes that using atomic FLUSHING_CACHED_CHARGE
bit operations is sufficient but that is not true. Johannes Weiner has
reported a crash during parallel memory cgroup removal:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: [] css_is_ancestor+0x20/0x70
Oops: 0000 [#1] PREEMPT SMP
Pid: 19677, comm: rmdir Tainted: G W 3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3
RIP: 0010:[] css_is_ancestor+0x20/0x70
RSP: 0018:ffff880077b09c88 EFLAGS: 00010202
Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310)
Call Trace:
[] mem_cgroup_same_or_subtree+0x33/0x40
[] drain_all_stock+0x11f/0x170
[] mem_cgroup_force_empty+0x231/0x6d0
[] mem_cgroup_pre_destroy+0x14/0x20
[] cgroup_rmdir+0xb9/0x500
[] vfs_rmdir+0x86/0xe0
[] do_rmdir+0xfb/0x110
[] sys_rmdir+0x16/0x20
[] system_call_fastpath+0x16/0x1b

We are crashing because we try to dereference cached memcg when we are
checking whether we should wait for draining on the cache. The cache is
already cleaned up, though.

There is also a theoretical chance that the cached memcg gets freed
between we test for the FLUSHING_CACHED_CHARGE and dereference it in
mem_cgroup_same_or_subtree:

CPU0 CPU1 CPU2
mem=stock->cached
stock->cached=NULL
clear_bit
test_and_set_bit
test_bit() ...
mem_cgroup_destroy
use after free

The percpu_charge_mutex protected from this race because sync draining
is exclusive.

It is safer to revert now and come up with a more parallel
implementation later.

Signed-off-by: Michal Hocko
Reported-by: Johannes Weiner
Acked-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Michal Hocko
2011-08-10 08:04:43 +0800

04 Aug, 2011

1 commit

aa3b18955 tmpfs: convert mem_cgroup shmem to radix-swap ... Browse Code »

Remove mem_cgroup_shmem_charge_fallback(): it was only required when we
had to move swappage to filecache with GFP_NOWAIT.

Remove the GFP_NOWAIT special case from mem_cgroup_cache_charge(), by
moving its call out from shmem_add_to_page_cache() to two of thats three
callers. But leave it doing mem_cgroup_uncharge_cache_page() on error:
although asymmetrical, it's easier for all 3 callers to handle.

These two changes would also be appropriate if anyone were to start
using shmem_read_mapping_page_gfp() with GFP_NOWAIT.

Remove mem_cgroup_get_shmem_target(): mc_handle_file_pte() can test
radix_tree_exceptional_entry() to get what it needs for itself.

Signed-off-by: Hugh Dickins
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-08-04 08:25:24 +0800

27 Jul, 2011

10 commits

8521fc50d memcg: get rid of percpu_charge_mutex lock ... Browse Code »

percpu_charge_mutex protects from multiple simultaneous per-cpu charge
caches draining because we might end up having too many work items. At
least this was the case until commit 26fe61684449 ("memcg: fix percpu
cached charge draining frequency") when we introduced a more targeted
draining for async mode.

Now that also sync draining is targeted we can safely remove mutex
because we will not send more work than the current number of CPUs.
FLUSHING_CACHED_CHARGE protects from sending the same work multiple
times and stock->nr_pages == 0 protects from pointless sending a work if
there is obviously nothing to be done. This is of course racy but we
can live with it as the race window is really small (we would have to
see FLUSHING_CACHED_CHARGE cleared while nr_pages would be still
non-zero).

The only remaining place where we can race is synchronous mode when we
rely on FLUSHING_CACHED_CHARGE test which might have been set by other
drainer on the same group but we should wait in that case as well.

Signed-off-by: Michal Hocko
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2011-07-27 07:49:43 +0800
3e92041d6 memcg: add mem_cgroup_same_or_subtree() helper ... Browse Code »

We are checking whether a given two groups are same or at least in the
same subtree of a hierarchy at several places. Let's make a helper for
it to make code easier to read.

Signed-off-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2011-07-27 07:49:43 +0800
d38144b7a memcg: unify sync and async per-cpu charge cache draining ... Browse Code »

Currently we have two ways how to drain per-CPU caches for charges.
drain_all_stock_sync will synchronously drain all caches while
drain_all_stock_async will asynchronously drain only those that refer to
a given memory cgroup or its subtree in hierarchy. Targeted async
draining has been introduced by 26fe6168 (memcg: fix percpu cached
charge draining frequency) to reduce the cpu workers number.

sync draining is currently triggered only from mem_cgroup_force_empty
which is triggered only by userspace (mem_cgroup_force_empty_write) or
when a cgroup is removed (mem_cgroup_pre_destroy). Although these are
not usually frequent operations it still makes some sense to do targeted
draining as well, especially if the box has many CPUs.

This patch unifies both methods to use the single code (drain_all_stock)
which relies on the original async implementation and just adds
flush_work to wait on all caches that are still under work for the sync
mode. We are using FLUSHING_CACHED_CHARGE bit check to prevent from
waiting on a work that we haven't triggered. Please note that both sync
and async functions are currently protected by percpu_charge_mutex so we
cannot race with other drainers.

Signed-off-by: Michal Hocko
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2011-07-27 07:49:42 +0800
d1a05b697 memcg: do not try to drain per-cpu caches without pages ... Browse Code »

drain_all_stock_async tries to optimize a work to be done on the work
queue by excluding any work for the current CPU because it assumes that
the context we are called from already tried to charge from that cache
and it's failed so it must be empty already.

While the assumption is correct we can optimize it even more by checking
the current number of pages in the cache. This will also reduce a work
on other CPUs with an empty stock.

For the current CPU we can simply call drain_local_stock rather than
deferring it to the work queue.

[kamezawa.hiroyu@jp.fujitsu.com: use drain_local_stock for current CPU optimization]
Signed-off-by: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Linus Torvalds

Michal Hocko
2011-07-27 07:49:42 +0800
82f9d486e memcg: add memory.vmscan_stat ... Browse Code »

The commit log of 0ae5e89c60c9 ("memcg: count the soft_limit reclaim
in...") says it adds scanning stats to memory.stat file. But it doesn't
because we considered we needed to make a concensus for such new APIs.

This patch is a trial to add memory.scan_stat. This shows
- the number of scanned pages(total, anon, file)
- the number of rotated pages(total, anon, file)
- the number of freed pages(total, anon, file)
- the number of elaplsed time (including sleep/pause time)

for both of direct/soft reclaim.

The biggest difference with oringinal Ying's one is that this file
can be reset by some write, as

# echo 0 ...../memory.scan_stat

Example of output is here. This is a result after make -j 6 kernel
under 300M limit.

[kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
[kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
scanned_pages_by_limit 9471864
scanned_anon_pages_by_limit 6640629
scanned_file_pages_by_limit 2831235
rotated_pages_by_limit 4243974
rotated_anon_pages_by_limit 3971968
rotated_file_pages_by_limit 272006
freed_pages_by_limit 2318492
freed_anon_pages_by_limit 962052
freed_file_pages_by_limit 1356440
elapsed_ns_by_limit 351386416101
scanned_pages_by_system 0
scanned_anon_pages_by_system 0
scanned_file_pages_by_system 0
rotated_pages_by_system 0
rotated_anon_pages_by_system 0
rotated_file_pages_by_system 0
freed_pages_by_system 0
freed_anon_pages_by_system 0
freed_file_pages_by_system 0
elapsed_ns_by_system 0
scanned_pages_by_limit_under_hierarchy 9471864
scanned_anon_pages_by_limit_under_hierarchy 6640629
scanned_file_pages_by_limit_under_hierarchy 2831235
rotated_pages_by_limit_under_hierarchy 4243974
rotated_anon_pages_by_limit_under_hierarchy 3971968
rotated_file_pages_by_limit_under_hierarchy 272006
freed_pages_by_limit_under_hierarchy 2318492
freed_anon_pages_by_limit_under_hierarchy 962052
freed_file_pages_by_limit_under_hierarchy 1356440
elapsed_ns_by_limit_under_hierarchy 351386416101
scanned_pages_by_system_under_hierarchy 0
scanned_anon_pages_by_system_under_hierarchy 0
scanned_file_pages_by_system_under_hierarchy 0
rotated_pages_by_system_under_hierarchy 0
rotated_anon_pages_by_system_under_hierarchy 0
rotated_file_pages_by_system_under_hierarchy 0
freed_pages_by_system_under_hierarchy 0
freed_anon_pages_by_system_under_hierarchy 0
freed_file_pages_by_system_under_hierarchy 0
elapsed_ns_by_system_under_hierarchy 0

total_xxxx is for hierarchy management.

This will be useful for further memcg developments and need to be
developped before we do some complicated rework on LRU/softlimit
management.

This patch adds a new struct memcg_scanrecord into scan_control struct.
sc->nr_scanned at el is not designed for exporting information. For
example, nr_scanned is reset frequentrly and incremented +2 at scanning
mapped pages.

To avoid complexity, I added a new param in scan_control which is for
exporting scanning score.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Michal Hocko
Cc: Ying Han
Cc: Andrew Bresticker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-07-27 07:49:42 +0800
108b6a784 memcg: fix behavior of mem_cgroup_resize_limit() ... Browse Code »
1

Commit 22a668d7c3ef ("memcg: fix behavior under memory.limit equals to
memsw.limit") introduced "memsw_is_minimum" flag, which becomes true
when mem_limit == memsw_limit. The flag is checked at the beginning of
reclaim, and "noswap" is set if the flag is true, because using swap is
meaningless in this case.

This works well in most cases, but when we try to shrink mem_limit,
which is the same as memsw_limit now, we might fail to shrink mem_limit
because swap doesn't used.

This patch fixes this behavior by:
- check MEM_CGROUP_RECLAIM_SHRINK at the begining of reclaim
- If it is set, don't set "noswap" flag even if memsw_is_minimum is true.

Signed-off-by: Daisuke Nishimura
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Ying Han
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2011-07-27 07:49:42 +0800
1af8efe96 memcg: change memcg_oom_mutex to spinlock ... Browse Code »

memcg_oom_mutex is used to protect memcg OOM path and eventfd interface
for oom_control. None of the critical sections which it protects sleep
(eventfd_signal works from atomic context and the rest are simple linked
list resp. oom_lock atomic operations).

Mutex is also too heavyweight for those code paths because it triggers a
lot of scheduling. It also makes makes convoying effects more visible
when we have a big number of oom killing because we take the lock
mutliple times during mem_cgroup_handle_oom so we have multiple places
where many processes can sleep.

Signed-off-by: Michal Hocko
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2011-07-27 07:49:42 +0800
79dfdaccd memcg: make oom_lock 0 and 1 based rather than counter ... Browse Code »

Commit 867578cb ("memcg: fix oom kill behavior") introduced a oom_lock
counter which is incremented by mem_cgroup_oom_lock when we are about to
handle memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep
if oom_lock > 1 to prevent from multiple oom kills at the same time.
The counter is then decremented by mem_cgroup_oom_unlock called from the
same function.

This works correctly but it can lead to serious starvations when we have
many processes triggering OOM and many CPUs available for them (I have
tested with 16 CPUs).

Consider a process (call it A) which gets the oom_lock (the first one
that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex) and other
processes that are blocked on the mutex. While A releases the mutex and
calls mem_cgroup_out_of_memory others will wake up (one after another)
and increase the counter and fall into sleep (memcg_oom_waitq).

Once A finishes mem_cgroup_out_of_memory it takes the mutex again and
decreases oom_lock and wakes other tasks (if releasing memory by
somebody else - e.g. killed process - hasn't done it yet).

A testcase would look like:
Assume malloc XXX is a program allocating XXX Megabytes of memory
which touches all allocated pages in a tight loop
# swapoff SWAP_DEVICE
# cgcreate -g memory:A
# cgset -r memory.oom_control=0 A
# cgset -r memory.limit_in_bytes= 200M
# for i in `seq 100`
# do
# cgexec -g memory:A malloc 10 &
# done

The main problem here is that all processes still race for the mutex and
there is no guarantee that we will get counter back to 0 for those that
got back to mem_cgroup_handle_oom. In the end the whole convoy
in/decreases the counter but we do not get to 1 that would enable
killing so nothing useful can be done. The time is basically unbounded
because it highly depends on scheduling and ordering on mutex (I have
seen this taking hours...).

This patch replaces the counter by a simple {un}lock semantic. As
mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have to
make sure that nobody else races with us which is guaranteed by the
memcg_oom_mutex.

We have to be careful while locking subtrees because we can encounter a
subtree which is already locked: hierarchy:

A
/ \
B \
/\ \
C D E

B - C - D tree might be already locked. While we want to enable locking
E subtree because OOM situations cannot influence each other we
definitely do not want to allow locking A.

Therefore we have to refuse lock if any subtree is already locked and
clear up the lock for all nodes that have been set up to the failure
point.

On the other hand we have to make sure that the rest of the world will
recognize that a group is under OOM even though it doesn't have a lock.
Therefore we have to introduce under_oom variable which is incremented
and decremented for the whole subtree when we enter resp. leave
mem_cgroup_handle_oom. under_oom, unlike oom_lock, doesn't need be
updated under memcg_oom_mutex because its users only check a single
group and they use atomic operations for that.

This can be checked easily by the following test case:

# cgcreate -g memory:A
# cgset -r memory.use_hierarchy=1 A
# cgset -r memory.oom_control=1 A
# cgset -r memory.limit_in_bytes= 100M
# cgset -r memory.memsw.limit_in_bytes= 100M
# cgcreate -g memory:A/B
# cgset -r memory.oom_control=1 A/B
# cgset -r memory.limit_in_bytes=20M
# cgset -r memory.memsw.limit_in_bytes=20M
# cgexec -g memory:A/B malloc 30 & #->this will be blocked by OOM of group B
# cgexec -g memory:A malloc 80 & #->this will be blocked by OOM of group A

While B gets oom_lock A will not get it. Both of them go into sleep and
wait for an external action. We can make the limit higher for A to
enforce waking it up

# cgset -r memory.memsw.limit_in_bytes=300M A
# cgset -r memory.limit_in_bytes=300M A

malloc in A has to wake up even though it doesn't have oom_lock.

Finally, the unlock path is very easy because we always unlock only the
subtree we have locked previously while we always decrement under_oom.

Signed-off-by: Michal Hocko
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2011-07-27 07:49:42 +0800
bb2a0de92 memcg: consolidate memory cgroup lru stat functions ... Browse Code »

In mm/memcontrol.c, there are many lru stat functions as..

mem_cgroup_zone_nr_lru_pages
mem_cgroup_node_nr_file_lru_pages
mem_cgroup_nr_file_lru_pages
mem_cgroup_node_nr_anon_lru_pages
mem_cgroup_nr_anon_lru_pages
mem_cgroup_node_nr_unevictable_lru_pages
mem_cgroup_nr_unevictable_lru_pages
mem_cgroup_node_nr_lru_pages
mem_cgroup_nr_lru_pages
mem_cgroup_get_local_zonestat

Some of them are under #ifdef MAX_NUMNODES >1 and others are not.
This seems bad. This patch consolidates all functions into

mem_cgroup_zone_nr_lru_pages()
mem_cgroup_node_nr_lru_pages()
mem_cgroup_nr_lru_pages()

For these functions, "which LRU?" information is passed by a mask.

example:
mem_cgroup_nr_lru_pages(mem, BIT(LRU_ACTIVE_ANON))

And I added some macro as ALL_LRU, ALL_LRU_FILE, ALL_LRU_ANON.

example:
mem_cgroup_nr_lru_pages(mem, ALL_LRU)

BTW, considering layout of NUMA memory placement of counters, this patch seems
to be better.

Now, when we gather all LRU information, we scan in following orer
for_each_lru -> for_each_node -> for_each_zone.

This means we'll touch cache lines in different node in turn.

After patch, we'll scan
for_each_node -> for_each_zone -> for_each_lru(mask)

Then, we'll gather information in the same cacheline at once.

[akpm@linux-foundation.org: fix warnigns, build error]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Michal Hocko
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-07-27 07:49:42 +0800
1f4c025b5 memcg: export memory cgroup's swappiness with mem_cgroup_swappiness() ... Browse Code »

Each memory cgroup has a 'swappiness' value which can be accessed by
get_swappiness(memcg). The major user is try_to_free_mem_cgroup_pages()
and swappiness is passed by argument. It's propagated by scan_control.

get_swappiness() is a static function but some planned updates will need
to get swappiness from files other than memcontrol.c This patch exports
get_swappiness() as mem_cgroup_swappiness(). With this, we can remove the
argument of swapiness from try_to_free... and drop swappiness from
scan_control. only memcg uses it.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Michal Hocko
Cc: Ying Han
Cc: Shaohua Li
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-07-27 07:49:42 +0800

09 Jul, 2011

2 commits

453a9bf34 memcg: fix numa scan information update to be triggered by memory event ... Browse Code »

commit 889976dbcb12 ("memcg: reclaim memory from nodes in round-robin
order") adds an numa node round-robin for memcg. But the information is
updated once per 10sec.

This patch changes the update trigger from jiffies to memcg's event count.
After this patch, numa scan information will be updated when we see 1024
events of pagein/pageout under a memcg.

[akpm@linux-foundation.org: attempt to repair code layout]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Ying Han
Cc: Johannes Weiner
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-07-09 12:14:44 +0800
4d0c066d2 memcg: fix reclaimable lru check in memcg ... Browse Code »

Now, in mem_cgroup_hierarchical_reclaim(), mem_cgroup_local_usage() is
used for checking whether the memcg contains reclaimable pages or not. If
no pages in it, the routine skips it.

But, mem_cgroup_local_usage() contains Unevictable pages and cannot handle
"noswap" condition correctly. This doesn't work on a swapless system.

This patch adds test_mem_cgroup_reclaimable() and replaces
mem_cgroup_local_usage(). test_mem_cgroup_reclaimable() see LRU counter
and returns correct answer to the caller. And this new function has
"noswap" argument and can see only FILE LRU if necessary.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix kerneldoc layout]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Ying Han
Cc: Johannes Weiner
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-07-09 12:14:43 +0800

28 Jun, 2011

1 commit

072441e21 mm: move shmem prototypes to shmem_fs.h ... Browse Code »

Before adding any more global entry points into shmem.c, gather such
prototypes into shmem_fs.h. Remove mm's own declarations from swap.h,
but for now leave the ones in mm.h: because shmem_file_setup() and
shmem_zero_setup() are called from various places, and we should not
force other subsystems to update immediately.

Signed-off-by: Hugh Dickins
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-06-28 09:00:12 +0800

16 Jun, 2011

3 commits

fbc29a25e memcg: avoid percpu cached charge draining at softlimit ... Browse Code »

Based on Michal Hocko's comment.

We are not draining per cpu cached charges during soft limit reclaim
because background reclaim doesn't care about charges. It tries to free
some memory and charges will not give any.

Cached charges might influence only selection of the biggest soft limit
offender but as the call is done only after the selection has been already
done it makes no change.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Reviewed-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-06-16 11:04:01 +0800
26fe61684 memcg: fix percpu cached charge draining frequency ... Browse Code »

For performance, memory cgroup caches some "charge" from res_counter into
per cpu cache. This works well but because it's cache, it needs to be
flushed in some cases. Typical cases are

1. when someone hit limit.

2. when rmdir() is called and need to charges to be 0.

But "1" has problem.

Recently, with large SMP machines, we see many kworker runs because of
flushing memcg's cache. Bad things in implementation are that even if a
cpu contains a cache for memcg not related to a memcg which hits limit,
drain code is called.

This patch does
A) check percpu cache contains a useful data or not.
B) check other asynchronous percpu draining doesn't run.
C) don't call local cpu callback.

(*)This patch avoid changing the calling condition with hard-limit.

When I run "cat 1Gfile > /dev/null" under 300M limit memcg,

[Before]
13767 kamezawa 20 0 98.6m 424 416 D 10.0 0.0 0:00.61 cat
58 root 20 0 0 0 0 S 0.6 0.0 0:00.09 kworker/2:1
60 root 20 0 0 0 0 S 0.6 0.0 0:00.08 kworker/4:1
4 root 20 0 0 0 0 S 0.3 0.0 0:00.02 kworker/0:0
57 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/1:1
61 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/5:1
62 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/6:1
63 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/7:1

[After]
2676 root 20 0 98.6m 416 416 D 9.3 0.0 0:00.87 cat
2626 kamezawa 20 0 15192 1312 920 R 0.3 0.0 0:00.28 top
1 root 20 0 19384 1496 1204 S 0.0 0.0 0:00.66 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
4 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0

[akpm@linux-foundation.org: make percpu_charge_mutex static, tweak comments]
Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Daisuke Nishimura
Reviewed-by: Michal Hocko
Tested-by: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-06-16 11:04:01 +0800
7ae534d07 memcg: fix wrong check of noswap with softlimit ... Browse Code »

Hierarchical reclaim doesn't swap out if memsw and resource limits are
thye same (memsw_is_minimum == true) because we would hit mem+swap limit
anyway (during hard limit reclaim).

If it comes to the soft limit we shouldn't consider memsw_is_minimum at
all because it doesn't make much sense. Either the soft limit is bellow
the hard limit and then we cannot hit mem+swap limit or the direct reclaim
takes a precedence.

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Michal Hocko
Acked-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-06-16 11:04:01 +0800