Eric Lee / smarc-fsl-linux-kernel

09 Jan, 2009

40 commits

b53907c01 generic swap(): lib/sort.c: rename swap to swap_func ... Browse Code »

This is to avoid name clashes for the introduction of a global swap()
macro.

Signed-off-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-01-09 00:31:14 +0800
b67445fc1 generic swap(): iphase: rename swap() to swap_byte_order() ... Browse Code »

In preparation for the introduction of a generic swap() macro.

Signed-off-by: Wu Fengguang
Acked-by: David S. Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-01-09 00:31:14 +0800
1a8a27c97 generic swap(): sparc: rename swap() to swap_ulong() ... Browse Code »

In preparation for the introduction of a generic swap() macro.

Signed-off-by: Wu Fengguang
Acked-by: David S. Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-01-09 00:31:14 +0800
c19a28e11 remove lots of double-semicolons ... Browse Code »

Cc: Ingo Molnar
Cc: Thomas Gleixner
Acked-by: Theodore Ts'o
Acked-by: Mark Fasheh
Acked-by: David S. Miller
Cc: James Morris
Acked-by: Casey Schaufler
Acked-by: Takashi Iwai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fernando Carrijo
2009-01-09 00:31:14 +0800
f15659628 romfs: romfs_iget() - unsigned ino >= 0 is always true ... Browse Code »

romfs_strnlen() returns int
unsigned X >= 0 is always true

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: roel kluin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

roel kluin
2009-01-09 00:31:14 +0800
921d58c0e vmcore: remove saved_max_pfn check ... Browse Code »

Remove the saved_max_pfn check from the /proc/vmcore function
read_from_oldmem(). No need to verify, we should be able to just trust
that "elfcorehdr=" is correctly passed to the crash kernel on the kernel
command line like we do with other parameters.

The read_from_oldmem() function in fs/proc/vmcore.c is quite similar to
read_from_oldmem() in drivers/char/mem.c, but only in the latter it makes
sense to use saved_max_pfn. For oldmem it is used to determine when to
stop reading. For vmcore we already have the elf header info pointing out
the physical memory regions, no need to pass the end-of- old-memory twice.

Removing the saved_max_pfn check from vmcore makes it possible for
architectures to skip oldmem but still support crash dump through vmcore -
without the need for the old saved_max_pfn cruft.

Architectures that want to play safe can do the saved_max_pfn check in
copy_oldmem_page(). Not sure why anyone would want to do that, but that's
even safer than today - the saved_max_pfn check in vmcore removed by this
patch only checks the first page.

Signed-off-by: Magnus Damm
Acked-by: Vivek Goyal
Acked-by: Simon Horman
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Magnus Damm
2009-01-09 00:31:14 +0800
4037014e3 w1: send status messages after command processing ... Browse Code »

Send completion status of the commands to the userspace. Message and
protocol are described in the documentation.

Signed-off-by: Evgeniy Polyakov
Cc: Paul Alfille
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Evgeniy Polyakov
2009-01-09 00:31:14 +0800
f89735c4e w1: added w1 reset command ... Browse Code »

Command which allows to reset the bus.

Signed-off-by: Evgeniy Polyakov
Cc: Paul Alfille
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Evgeniy Polyakov
2009-01-09 00:31:14 +0800
325a06fb1 w1: move w1 commands from defines to enum ... Browse Code »

Signed-off-by: Evgeniy Polyakov
Cc: Paul Alfille
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Evgeniy Polyakov
2009-01-09 00:31:13 +0800
c7e26631d w1: allow master IO commands ... Browse Code »

This small patchset extendes existing commands with reset, master IO and
status messages. Reset is used to reset the bus for given master device,
master IO command allows to initiate IO against bus itself not selecting
slave device first, which can be used to probe the device for example.
And status messages carry command completion status back to the userspace
(namely very useful to get -ENODEV from when requested device was not
found).

Great thanks to Paul Alfille of OWFS for testing and commands suggestions.

This patch:

Allow starting of IO not against already found slave devices, but against
the bus itself, which can be used for example to probe devices.

[akpm@linux-foundation.org: reindent switch statements]
Signed-off-by: Evgeniy Polyakov
Cc: Paul Alfille
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Evgeniy Polyakov
2009-01-09 00:31:13 +0800
e4e056aa3 w1: documentation update ... Browse Code »

Signed-off-by: Evgeniy Polyakov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Evgeniy Polyakov
2009-01-09 00:31:13 +0800
3b8384070 w1: list slaves commands ... Browse Code »

Initiates search (or alarm search) and returns all found devices to
userspace. Found devices are not added into the system (i.e. they are
not attached to family devices or bus masters), it will be done via (if
was not done yet) usual timed searching.

Signed-off-by: Evgeniy Polyakov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Evgeniy Polyakov
2009-01-09 00:31:13 +0800
9be62e0b2 w1: add touch block command ... Browse Code »

Writes and returns sampled data back to userspace.

Signed-off-by: Evgeniy Polyakov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Evgeniy Polyakov
2009-01-09 00:31:13 +0800
610705e78 w1: add list masters w1 command ... Browse Code »

This patch series introduces and extends several userspace commands
used with netlink protocol.

Touch block command allows to write data and return sampled data to
the userspace.

Extended search and alarm seach commands to return list of slave
devices found during given search.

List masters command allows to send all registered master IDs to the
userspace.

Great thanks to Paul Alfille (owfs) who
tested this implementation and wrote w1-to-network daemon
http://sourceforge.net/projects/w1repeater/ and

Frederik Deweerdt and Randy Dunlap for review.

This patch:

Returns list of registered bus master devices.

Signed-off-by: Evgeniy Polyakov
Cc: Paul Alfille
Cc: Frederik Deweerdt
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Evgeniy Polyakov
2009-01-09 00:31:13 +0800
a5fd9139f w1: add 1-wire master driver for i.MX27 / i.MX31 ... Browse Code »

This patch adds support for the 1-wire master interface for i.MX27 and
i.MX31.

Signed-off-by: Luotao Fu
Signed-off-by: Sascha Hauer
Signed-off-by: Evgeniy Polyakov
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sascha Hauer
2009-01-09 00:31:13 +0800
09f50c954 tpm: clean up tpm_nsc driver for platform_device suspend/resume compliance ... Browse Code »

Signed-off-by: Marcel Selhorst
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Smith
2009-01-09 00:31:12 +0800
ad8f07cca misc: add dell-laptop driver ... Browse Code »

Add a driver for controlling Dell-specific backlight and rfkill interfaces.
This driver makes use of the dcdbas interface to the Dell firmware to
allow the backlight and rfkill interfaces on Dell systems to be driven
through the standardised sysfs interfaces.

Signed-off-by: Matthew Garrett
Cc: Matt Domsch
Cc: Ivo van Doorn
Cc: Len Brown
Cc: Richard Purdie
Cc: Henrique de Moraes Holschuh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Garrett
2009-01-09 00:31:12 +0800
3cab7fd96 dcdbas: export functionality for use in other drivers ... Browse Code »

The dcdbas code allows calls to be made into the firmware on Dell systems.
Exporting this to other drivers allows them to implement Dell-specific
functionality in a safe way.

Signed-off-by: Matthew Garrett
Cc: Matt Domsch
Cc: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Garrett
2009-01-09 00:31:12 +0800
f06295b44 ELF: implement AT_RANDOM for glibc PRNG seeding ... Browse Code »

While discussing[1] the need for glibc to have access to random bytes
during program load, it seems that an earlier attempt to implement
AT_RANDOM got stalled. This implements a random 16 byte string, available
to every ELF program via a new auxv AT_RANDOM vector.

[1] http://sourceware.org/ml/libc-alpha/2008-10/msg00006.html

Ulrich said:

glibc needs right after startup a bit of random data for internal
protections (stack canary etc). What is now in upstream glibc is that we
always unconditionally open /dev/urandom, read some data, and use it. For
every process startup. That's slow.

...

The solution is to provide a limited amount of random data to the
starting process in the aux vector. I suggested 16 bytes and this is
what the patch implements. If we need only 16 bytes or less we use the
data directly. If we need more we'll use the 16 bytes to see a PRNG.
This avoids the costly /dev/urandom use and it allows the kernel to use
the most adequate source of random data for this purpose. It might not
be the same pool as that for /dev/urandom.

Concerns were expressed about the depletion of the randomness pool. But
this patch doesn't make the situation worse, it doesn't deplete entropy
more than happens now.

Signed-off-by: Kees Cook
Cc: Jakub Jelinek
Cc: Andi Kleen
Cc: Ulrich Drepper
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2009-01-09 00:31:12 +0800
a6684999f mqueue: fix si_pid value in mqueue do_notify() ... Browse Code »

If a process registers for asynchronous notification on a POSIX message
queue, it gets a signal and a siginfo_t structure when a message arrives
on the message queue. The si_pid in the siginfo_t structure is set to the
PID of the process that sent the message to the message queue.

The principle is the following:
. when mq_notify(SIGEV_SIGNAL) is called, the caller registers for
notification when a msg arrives. The associated pid structure is stroed into
inode_info->notify_owner. Let's call this process P1.
. when mq_send() is called by say P2, P2 sends a signal to P1 to notify
him about msg arrival.

The way .si_pid is set today is not correct, since it doesn't take into account
the fact that the process that is sending the message might not be in the
same namespace as the notified one.

This patch proposes to set si_pid to the sender's pid into the notify_owner
namespace.

Signed-off-by: Nadia Derbey
Signed-off-by: Sukadev Bhattiprolu
Acked-by: Oleg Nesterov
Cc: Roland McGrath
Cc: Bastian Blank
Cc: Pavel Emelyanov
Cc: Eric W. Biederman
Acked-by: Serge Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sukadev Bhattiprolu
2009-01-09 00:31:12 +0800
61bce0f13 pid: generalize task_active_pid_ns ... Browse Code »

Currently task_active_pid_ns is not safe to call after a task becomes a
zombie and exit_task_namespaces is called, as nsproxy becomes NULL. By
reading the pid namespace from the pid of the task we can trivially solve
this problem at the cost of one extra memory read in what should be the
same cacheline as we read the namespace from.

When moving things around I have made task_active_pid_ns out of line
because keeping it in pid_namespace.h would require adding includes of
pid.h and sched.h that I don't think we want.

This change does make task_active_pid_ns unsafe to call during
copy_process until we attach a pid on the task_struct which seems to be a
reasonable trade off.

Signed-off-by: Eric W. Biederman
Signed-off-by: Sukadev Bhattiprolu
Cc: Oleg Nesterov
Cc: Roland McGrath
Cc: Bastian Blank
Cc: Pavel Emelyanov
Cc: Nadia Derbey
Acked-by: Serge Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2009-01-09 00:31:12 +0800
f9fb860f6 pid: implement ns_of_pid ... Browse Code »

A current problem with the pid namespace is that it is easy to do pid
related work after exit_task_namespaces which drops the nsproxy pointer.

However if we are doing pid namespace related work we are always operating
on some struct pid which retains the pid_namespace pointer of the pid
namespace it was allocated in.

So provide ns_of_pid which allows us to find the pid namespace a pid was
allocated in.

Using this we have the needed infrastructure to do pid namespace related
work at anytime we have a struct pid, removing the chance of accidentally
having a NULL pointer dereference when accessing current->nsproxy.

Signed-off-by: Eric W. Biederman
Signed-off-by: Sukadev Bhattiprolu
Cc: Oleg Nesterov
Cc: Roland McGrath
Cc: Bastian Blank
Cc: Pavel Emelyanov
Cc: Nadia Derbey
Acked-by: Serge Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2009-01-09 00:31:12 +0800
6af866af3 cpuset: remove remaining pointers to cpumask_t ... Browse Code »

Impact: cleanups, use new cpumask API

Final trivial cleanups: mainly s/cpumask_t/struct cpumask

Note there is a FIXME in generate_sched_domains(). A future patch will
change struct cpumask *doms to struct cpumask *doms[].
(I suppose Rusty will do this.)

Signed-off-by: Li Zefan
Cc: Ingo Molnar
Cc: Rusty Russell
Acked-by: Mike Travis
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2009-01-09 00:31:11 +0800
300ed6cbb cpuset: convert cpuset->cpus_allowed to cpumask_var_t ... Browse Code »

Impact: use new cpumask API

This patch mainly does the following things:
- change cs->cpus_allowed from cpumask_t to cpumask_var_t
- call alloc_bootmem_cpumask_var() for top_cpuset in cpuset_init_early()
- call alloc_cpumask_var() for other cpusets
- replace cpus_xxx() to cpumask_xxx()

Signed-off-by: Li Zefan
Cc: Ingo Molnar
Cc: Rusty Russell
Acked-by: Mike Travis
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2009-01-09 00:31:11 +0800
645fcc9d2 cpuset: don't allocate trial cpuset on stack ... Browse Code »

Impact: cleanups, reduce stack usage

This patch prepares for the next patch. When we convert
cpuset.cpus_allowed to cpumask_var_t, (trialcs = *cs) no longer works.

Another result of this patch is reducing stack usage of trialcs.
sizeof(*cs) can be as large as 148 bytes on x86_64, so it's really not
good to have it on stack.

Signed-off-by: Li Zefan
Cc: Ingo Molnar
Cc: Rusty Russell
Acked-by: Mike Travis
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2009-01-09 00:31:11 +0800
2341d1b65 cpuset: convert cpuset_attach() to use cpumask_var_t ... Browse Code »

Impact: reduce stack usage

Allocate a global cpumask_var_t at boot, and use it in cpuset_attach(), so
we won't fail cpuset_attach().

Signed-off-by: Li Zefan
Cc: Ingo Molnar
Cc: Rusty Russell
Acked-by: Mike Travis
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2009-01-09 00:31:11 +0800
5771f0a22 cpuset: remove on stack cpumask_t in cpuset_can_attach() ... Browse Code »

Impact: reduce stack usage

Just use cs->cpus_allowed, and no need to allocate a cpumask_var_t.

Signed-off-by: Li Zefan
Cc: Ingo Molnar
Cc: Rusty Russell
Acked-by: Mike Travis
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2009-01-09 00:31:11 +0800
5a7625df7 cpuset: remove on stack cpumask_t in cpuset_sprintf_cpulist() ... Browse Code »

This patchset converts cpuset to use new cpumask API, and thus
remove on stack cpumask_t to reduce stack usage.

Before:
# cat kernel/cpuset.c include/linux/cpuset.h | grep -c cpumask_t
21
After:
# cat kernel/cpuset.c include/linux/cpuset.h | grep -c cpumask_t
0

This patch:

Impact: reduce stack usage

It's safe to call cpulist_scnprintf inside callback_mutex, and thus we can
just remove the cpumask_t and no need to allocate a cpumask_var_t.

Signed-off-by: Li Zefan
Cc: Ingo Molnar
Cc: Rusty Russell
Acked-by: Mike Travis
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2009-01-09 00:31:11 +0800
f5813d942 cpusets: set task's cpu_allowed to cpu_possible_map when attaching it into top cpuset ... Browse Code »

I found a bug on my dual-cpu box. I created a sub cpuset in top cpuset
and assign 1 to its cpus. And then we attach some tasks into this sub
cpuset. After this, we offline CPU1. Now, the tasks in this new cpuset
are moved into top cpuset automatically because there is no cpu in sub
cpuset. Then we online CPU1, we find all the tasks which doesn't belong
to top cpuset originally just run on CPU0.

We fix this bug by setting task's cpu_allowed to cpu_possible_map when
attaching it into top cpuset. This method needn't modify the current
behavior of cpusets on CPU hotplug, and all of tasks in top cpuset use
cpu_possible_map to initialize their cpu_allowed.

Signed-off-by: Miao Xie
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miao Xie
2009-01-09 00:31:11 +0800
13337714f cpuset: rcu_read_lock() to protect task_cs() ... Browse Code »
43

task_cs() calls task_subsys_state().

We must use rcu_read_lock() to protect cgroup_subsys_state().

It's correct that top_cpuset is never freed, but cgroup_subsys_state()
accesses css_set, this css_set maybe freed when task_cs() called.

We use use rcu_read_lock() to protect it.

Signed-off-by: Lai Jiangshan
Acked-by: Paul Menage
Cc: KAMEZAWA Hiroyuki
Cc: Pavel Emelyanov
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lai Jiangshan
2009-01-09 00:31:11 +0800
e7c5ec919 cgroups: add css_tryget() ... Browse Code »

Add css_tryget(), that obtains a counted reference on a CSS. It is used
in situations where the caller has a "weak" reference to the CSS, i.e.
one that does not protect the cgroup from removal via a reference count,
but would instead be cleaned up by a destroy() callback.

css_tryget() will return true on success, or false if the cgroup is being
removed.

This is similar to Kamezawa Hiroyuki's patch from a week or two ago, but
with the difference that in the event of css_tryget() racing with a
cgroup_rmdir(), css_tryget() will only return false if the cgroup really
does get removed.

This implementation is done by biasing css->refcnt, so that a refcnt of 1
means "releasable" and 0 means "released or releasing". In the event of a
race, css_tryget() distinguishes between "released" and "releasing" by
checking for the CSS_REMOVED flag in css->flags.

Signed-off-by: Paul Menage
Tested-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2009-01-09 00:31:10 +0800
2cb378c86 cgroups: use hierarchy_mutex in memory controller ... Browse Code »

Update the memory controller to use its hierarchy_mutex rather than
calling cgroup_lock() to protected against cgroup_mkdir()/cgroup_rmdir()
from occurring in its hierarchy.

Signed-off-by: Paul Menage
Tested-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2009-01-09 00:31:10 +0800
999cd8a45 cgroups: add a per-subsystem hierarchy_mutex ... Browse Code »

These patches introduce new locking/refcount support for cgroups to
reduce the need for subsystems to call cgroup_lock(). This will
ultimately allow the atomicity of cgroup_rmdir() (which was removed
recently) to be restored.

These three patches give:

1/3 - introduce a per-subsystem hierarchy_mutex which a subsystem can
use to prevent changes to its own cgroup tree

2/3 - use hierarchy_mutex in place of calling cgroup_lock() in the
memory controller

3/3 - introduce a css_tryget() function similar to the one recently
proposed by Kamezawa, but avoiding spurious refcount failures in
the event of a race between a css_tryget() and an unsuccessful
cgroup_rmdir()

Future patches will likely involve:

- using hierarchy mutex in place of cgroup_lock() in more subsystems
where appropriate

- restoring the atomicity of cgroup_rmdir() with respect to cgroup_create()

This patch:

Add a hierarchy_mutex to the cgroup_subsys object that protects changes to
the hierarchy observed by that subsystem. It is taken by the cgroup
subsystem (in addition to cgroup_mutex) for the following operations:

- linking a cgroup into that subsystem's cgroup tree
- unlinking a cgroup from that subsystem's cgroup tree
- moving the subsystem to/from a hierarchy (including across the
bind() callback)

Thus if the subsystem holds its own hierarchy_mutex, it can safely
traverse its own hierarchy.

Signed-off-by: Paul Menage
Tested-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2009-01-09 00:31:10 +0800
b5a84319a memcg: fix shmem's swap accounting ... Browse Code »

Now, you can see following even when swap accounting is enabled.

1. Create Group 01, and 02.
2. allocate a "file" on tmpfs by a task under 01.
3. swap out the "file" (by memory pressure)
4. Read "file" from a task in group 02.
5. the charge of "file" is moved to group 02.

This is not ideal behavior. This is because SwapCache which was loaded
by read-ahead is not taken into account..

This is a patch to fix shmem's swapcache behavior.
- remove mem_cgroup_cache_charge_swapin().
- Add SwapCache handler routine to mem_cgroup_cache_charge().
By this, shmem's file cache is charged at add_to_page_cache()
with GFP_NOWAIT.
- pass the page of swapcache to shrink_mem_cgroup.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Paul Menage
Cc: Li Zefan
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:10 +0800
544122e5e memcg: fix LRU accounting for SwapCache ... Browse Code »

Now, a page can be deleted from SwapCache while do_swap_page().
memcg-fix-swap-accounting-leak-v3.patch handles that, but, LRU handling is
still broken. (above behavior broke assumption of memcg-synchronized-lru
patch.)

This patch is a fix for LRU handling (especially for per-zone counters).
At charging SwapCache,
- Remove page_cgroup from LRU if it's not used.
- Add page cgroup to LRU if it's not linked to.

Reported-by: Daisuke Nishimura
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Paul Menage
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:10 +0800
54595fe26 memcg: use css_tryget in memcg ... Browse Code »

From:KAMEZAWA Hiroyuki

css_tryget() newly is added and we can know css is alive or not and get
refcnt of css in very safe way. ("alive" here means "rmdir/destroy" is
not called.)

This patch replaces css_get() to css_tryget(), where I cannot explain
why css_get() is safe. And removes memcg->obsolete flag.

Reviewed-by: Daisuke Nishimura
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Paul Menage
Cc: Daisuke Nishimura
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:10 +0800
a7ba0eef3 memcg: fix double free and make refcnt sane ... Browse Code »

1. Fix double-free BUG in error route of mem_cgroup_create().
mem_cgroup_free() itself frees per-zone-info.
2. Making refcnt of memcg simple.
Add 1 refcnt at creation and call free when refcnt goes down to 0.

Reviewed-by: Daisuke Nishimura
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Paul Menage
Cc: Daisuke Nishimura
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:10 +0800
03f3c4336 memcg: fix swap accounting leak ... Browse Code »

Fix swapin charge operation of memcg.

Now, memcg has hooks to swap-out operation and checks SwapCache is really
unused or not. That check depends on contents of struct page. I.e. If
PageAnon(page) && page_mapped(page), the page is recoginized as
still-in-use.

Now, reuse_swap_page() calles delete_from_swap_cache() before establishment
of any rmap. Then, in followinig sequence

(Page fault with WRITE)
try_charge() (charge += PAGESIZE)
commit_charge() (Check page_cgroup is used or not..)
reuse_swap_page()
-> delete_from_swapcache()
-> mem_cgroup_uncharge_swapcache() (charge -= PAGESIZE)
......
New charge is uncharged soon....
To avoid this, move commit_charge() after page_mapcount() goes up to 1.
By this,

try_charge() (usage += PAGESIZE)
reuse_swap_page() (may usage -= PAGESIZE if PCG_USED is set)
commit_charge() (If page_cgroup is not marked as PCG_USED,
add new charge.)
Accounting will be correct.

Changelog (v2) -> (v3)
- fixed invalid charge to swp_entry==0.
- updated documentation.
Changelog (v1) -> (v2)
- fixed comment.

[nishimura@mxp.nes.nec.co.jp: swap accounting leak doc fix]
Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Tested-by: Balbir Singh
Cc: Hugh Dickins
Cc: Daisuke Nishimura
Signed-off-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:10 +0800
42e9abb62 memcg: change try_to_free_pages to hierarchical_reclaim ... Browse Code »

mem_cgroup_hierarchicl_reclaim() works properly even when !use_hierarchy
now (by memcg-hierarchy-avoid-unnecessary-reclaim.patch), so, instead of
try_to_free_mem_cgroup_pages(), it should be used in many cases.

The only exception is force_empty. The group has no children in this
case.

Signed-off-by: Daisuke Nishimura
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2009-01-09 00:31:09 +0800
7f4d454de memcg: avoid deadlock caused by race between oom and cpuset_attach ... Browse Code »

mpol_rebind_mm(), which can be called from cpuset_attach(), does
down_write(mm->mmap_sem). This means down_write(mm->mmap_sem) can be
called under cgroup_mutex.

OTOH, page fault path does down_read(mm->mmap_sem) and calls
mem_cgroup_try_charge_xxx(), which may eventually calls
mem_cgroup_out_of_memory(). And mem_cgroup_out_of_memory() calls
cgroup_lock(). This means cgroup_lock() can be called under
down_read(mm->mmap_sem).

If those two paths race, deadlock can happen.

This patch avoid this deadlock by:
- remove cgroup_lock() from mem_cgroup_out_of_memory().
- define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task()
(->attach handler of memory cgroup) and mem_cgroup_out_of_memory.

Signed-off-by: Daisuke Nishimura
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2009-01-09 00:31:09 +0800