Eric Lee / smarc-fsl-linux-kernel

28 May, 2010

40 commits

d344193a0 exit: avoid sig->count in de_thread/__exit_signal synchronization ... Browse Code »

de_thread() and __exit_signal() use signal_struct->count/notify_count for
synchronization. We can simplify the code and use ->notify_count only.
Instead of comparing these two counters, we can change de_thread() to set
->notify_count = nr_of_sub_threads, then change __exit_signal() to
dec-and-test this counter and notify group_exit_task.

Note that __exit_signal() checks "notify_count > 0" just for symmetry with
exit_notify(), we could just check it is != 0.

Signed-off-by: Oleg Nesterov
Acked-by: Roland McGrath
Cc: Veaceslav Falico
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2010-05-28 00:12:46 +0800
09faef11d exit: change zap_other_threads() to count sub-threads ... Browse Code »

Change zap_other_threads() to return the number of other sub-threads found
on ->thread_group list.

Other changes are cosmetic:

- change the code to use while_each_thread() helper

- remove the obsolete comment about SIGKILL/SIGSTOP

Signed-off-by: Oleg Nesterov
Acked-by: Roland McGrath
Cc: Veaceslav Falico
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2010-05-28 00:12:46 +0800
9c3391684 exit: exit_notify() can trust signal->notify_count < 0 ... Browse Code »

signal_struct->count in its current form must die.

- it has no reasons to be atomic_t

- it looks like a reference counter, but it is not

- otoh, we really need to make task->signal refcountable, just look at
the extremely ugly task_rq_unlock_wait() called from __exit_signals().

- we should change the lifetime rules for task->signal, it should be
pinned to task_struct. We have a lot of code which can be simplified
after that.

- it is not needed! while the code is correct, any usage of this
counter is artificial, except fs/proc uses it correctly to show the
number of threads.

This series removes the usage of sig->count from exit pathes.

This patch:

Now that Veaceslav changed copy_signal() to use zalloc(), exit_notify()
can just check notify_count < 0 to ensure the execing sub-threads needs
the notification from us. No need to do other checks, notify_count != 0
must always mean ->group_exit_task != NULL is waiting for us.

Signed-off-by: Oleg Nesterov
Acked-by: Roland McGrath
Cc: Veaceslav Falico
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2010-05-28 00:12:45 +0800
269b005a2 coredump: shift down_write(mmap_sem) into coredump_wait() ... Browse Code »

- move the cprm.mm_flags checks up, before we take mmap_sem

- move down_write(mmap_sem) and ->core_state check from do_coredump()
to coredump_wait()

This simplifies the code and makes the locking symmetrical.

Signed-off-by: Oleg Nesterov
Cc: David Howells
Cc: Neil Horman
Cc: Roland McGrath
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2010-05-28 00:12:45 +0800
5e43aef53 coredump: factor out put_cred() calls ... Browse Code »

Given that do_coredump() calls put_cred() on exit path, it is a bit ugly
to do put_cred() + "goto fail" twice, just add the new "fail_creds" label.

Signed-off-by: Oleg Nesterov
Cc: David Howells
Cc: Neil Horman
Cc: Roland McGrath
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2010-05-28 00:12:45 +0800
d5bf4c4f5 coredump: cleanup "ispipe" code ... Browse Code »

- kill "int dump_count", argv_split(argcp) accepts argcp == NULL.

- move "int dump_count" under " if (ispipe)" branch, fail_dropcount
can check ispipe.

- move "char **helper_argv" as well, change the code to do argv_free()
right after call_usermodehelper_fns().

- If call_usermodehelper_fns() fails goto close_fail label instead
of closing the file by hand.

Signed-off-by: Oleg Nesterov
Cc: David Howells
Cc: Neil Horman
Cc: Roland McGrath
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2010-05-28 00:12:45 +0800
c71354112 coredump: factor out the not-ispipe file checks ... Browse Code »

do_coredump() does a lot of file checks after it opens the file or calls
usermode helper. But all of these checks are only needed in !ispipe case.

Move this code into the "else" branch and kill the ugly repetitive ispipe
checks.

Signed-off-by: Oleg Nesterov
Cc: David Howells
Cc: Neil Horman
Cc: Roland McGrath
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2010-05-28 00:12:45 +0800
04b1c384f call_usermodehelper: UMH_WAIT_EXEC ignores kernel_thread() failure ... Browse Code »

UMH_WAIT_EXEC should report the error if kernel_thread() fails, like
UMH_WAIT_PROC does.

Signed-off-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2010-05-28 00:12:45 +0800
d47419cd9 call_usermodehelper: simplify/fix UMH_NO_WAIT case ... Browse Code »

__call_usermodehelper(UMH_NO_WAIT) has 2 problems:

- if kernel_thread() fails, call_usermodehelper_freeinfo()
is not called.

- for unknown reason UMH_NO_WAIT has UMH_WAIT_PROC logic,
we spawn yet another thread which waits until the user
mode application exits.

Change the UMH_NO_WAIT code to use ____call_usermodehelper() instead of
wait_for_helper(), and do call_usermodehelper_freeinfo() unconditionally.
We can rely on CLONE_VFORK, do_fork(CLONE_VFORK) until the child exits or
execs.

With or without this patch UMH_NO_WAIT does not report the error if
kernel_thread() fails, this is correct since the caller doesn't wait for
result.

Signed-off-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2010-05-28 00:12:45 +0800
7d6422421 wait_for_helper: SIGCHLD from user-space can lead to use-after-free ... Browse Code »

1. wait_for_helper() calls allow_signal(SIGCHLD) to ensure the child
can't autoreap itself.

However, this means that a spurious SIGCHILD from user-space can
set TIF_SIGPENDING and:

- kernel_thread() or sys_wait4() can fail due to signal_pending()

- worse, wait4() can fail before ____call_usermodehelper() execs
or exits. In this case the caller may kfree(subprocess_info)
while the child still uses this memory.

Change the code to use SIG_DFL instead of magic "(void __user *)2"
set by allow_signal(). This means that SIGCHLD won't be delivered,
yet the child won't autoreap itsefl.

The problem is minor, only root can send a signal to this kthread.

2. If sys_wait4(&ret) fails it doesn't populate "ret", in this case
wait_for_helper() reports a random value from uninitialized var.

With this patch sys_wait4() should never fail, but still it makes
sense to initialize ret = -ECHILD so that the caller can notice
the problem.

Signed-off-by: Oleg Nesterov
Acked-by: Neil Horman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2010-05-28 00:12:45 +0800
363da4022 call_usermodehelper: no need to unblock signals ... Browse Code »

____call_usermodehelper() correctly calls flush_signal_handlers() to set
SIG_DFL, but sigemptyset(->blocked) and recalc_sigpending() are not
needed.

This kthread was forked by workqueue thread, all signals must be unblocked
and ignored, no pending signal is possible.

Signed-off-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2010-05-28 00:12:45 +0800
c70a626d3 umh: creds: kill subprocess_info->cred logic ... Browse Code »

Now that nobody ever changes subprocess_info->cred we can kill this member
and related code. ____call_usermodehelper() always runs in the context of
freshly forked kernel thread, it has the proper ->cred copied from its
parent kthread, keventd.

Signed-off-by: Oleg Nesterov
Acked-by: Neil Horman
Acked-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2010-05-28 00:12:45 +0800
685bfd2c4 umh: creds: convert call_usermodehelper_keys() to use subprocess_info->init() ... Browse Code »

call_usermodehelper_keys() uses call_usermodehelper_setkeys() to change
subprocess_info->cred in advance. Now that we have info->init() we can
change this code to set tgcred->session_keyring in context of execing
kernel thread.

Note: since currently call_usermodehelper_keys() is never called with
UMH_NO_WAIT, call_usermodehelper_keys()->key_get() and umh_keys_cleanup()
are not really needed, we could rely on install_session_keyring_to_cred()
which does key_get() on success.

Signed-off-by: Oleg Nesterov
Acked-by: Neil Horman
Acked-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2010-05-28 00:12:45 +0800
898b374af exec: replace call_usermodehelper_pipe with use of umh init function and resolve limit ... Browse Code »
46

The first patch in this series introduced an init function to the
call_usermodehelper api so that processes could be customized by caller.
This patch takes advantage of that fact, by customizing the helper in
do_coredump to create the pipe and set its core limit to one (for our
recusrsion check). This lets us clean up the previous uglyness in the
usermodehelper internals and factor call_usermodehelper out entirely.
While I'm at it, we can also modify the helper setup to look for a core
limit value of 1 rather than zero for our recursion check

Signed-off-by: Neil Horman
Reviewed-by: Oleg Nesterov
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Neil Horman
2010-05-28 00:12:44 +0800
a06a4dc3a kmod: add init function to usermodehelper ... Browse Code »

About 6 months ago, I made a set of changes to how the core-dump-to-a-pipe
feature in the kernel works. We had reports of several races, including
some reports of apps bypassing our recursion check so that a process that
was forked as part of a core_pattern setup could infinitely crash and
refork until the system crashed.

We fixed those by improving our recursion checks. The new check basically
refuses to fork a process if its core limit is zero, which works well.

Unfortunately, I've been getting grief from maintainer of user space
programs that are inserted as the forked process of core_pattern. They
contend that in order for their programs (such as abrt and apport) to
work, all the running processes in a system must have their core limits
set to a non-zero value, to which I say 'yes'. I did this by design, and
think thats the right way to do things.

But I've been asked to ease this burden on user space enough times that I
thought I would take a look at it. The first suggestion was to make the
recursion check fail on a non-zero 'special' number, like one. That way
the core collector process could set its core size ulimit to 1, and enable
the kernel's recursion detection. This isn't a bad idea on the surface,
but I don't like it since its opt-in, in that if a program like abrt or
apport has a bug and fails to set such a core limit, we're left with a
recursively crashing system again.

So I've come up with this. What I've done is modify the
call_usermodehelper api such that an extra parameter is added, a function
pointer which will be called by the user helper task, after it forks, but
before it exec's the required process. This will give the caller the
opportunity to get a call back in the processes context, allowing it to do
whatever it needs to to the process in the kernel prior to exec-ing the
user space code. In the case of do_coredump, this callback is ues to set
the core ulimit of the helper process to 1. This elimnates the opt-in
problem that I had above, as it allows the ulimit for core sizes to be set
to the value of 1, which is what the recursion check looks for in
do_coredump.

This patch:

Create new function call_usermodehelper_fns() and allow it to assign both
an init and cleanup function, as we'll as arbitrary data.

The init function is called from the context of the forked process and
allows for customization of the helper process prior to calling exec. Its
return code gates the continuation of the process, or causes its exit.
Also add an arbitrary data pointer to the subprocess_info struct allowing
for data to be passed from the caller to the new process, and the
subsequent cleanup process

Also, use this patch to cleanup the cleanup function. It currently takes
an argp and envp pointer for freeing, which is ugly. Lets instead just
make the subprocess_info structure public, and pass that to the cleanup
and init routines

Signed-off-by: Neil Horman
Reviewed-by: Oleg Nesterov
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Neil Horman
2010-05-28 00:12:44 +0800
065add394 signals: check_kill_permission(): don't check creds if same_thread_group() ... Browse Code »

Andrew Tridgell reports that aio_read(SIGEV_SIGNAL) can fail if the
notification from the helper thread races with setresuid(), see
http://samba.org/~tridge/junkcode/aio_uid.c

This happens because check_kill_permission() doesn't permit sending a
signal to the task with the different cred->xids. But there is not any
security reason to check ->cred's when the task sends a signal (private or
group-wide) to its sub-thread. Whatever we do, any thread can bypass all
security checks and send SIGKILL to all threads, or it can block a signal
SIG and do kill(gettid(), SIG) to deliver this signal to another
sub-thread. Not to mention that CLONE_THREAD implies CLONE_VM.

Change check_kill_permission() to avoid the credentials check when the
sender and the target are from the same thread group.

Also, move "cred = current_cred()" down to avoid calling get_current()
twice.

Note: David Howells pointed out we could relax this even more, the
CLONE_SIGHAND (without CLONE_THREAD) case probably does not need
these checks too.

Roland said:
: The glibc (libpthread) that does set*id across threads has
: been in use for a while (2.3.4?), probably in distro's using kernels as old
: or older than any active -stable streams. In the race in question, this
: kernel bug is breaking valid POSIX application expectations.

Reported-by: Andrew Tridgell
Signed-off-by: Oleg Nesterov
Acked-by: Roland McGrath
Acked-by: David Howells
Cc: Eric Paris
Cc: Jakub Jelinek
Cc: James Morris
Cc: Roland McGrath
Cc: Stephen Smalley
Cc: [all kernel versions]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2010-05-28 00:12:44 +0800
e0129ef91 ptrace: PTRACE_GETFDPIC: fix the unsafe usage of child->mm ... Browse Code »

Now that Mike Frysinger unified the FDPIC ptrace code, we can fix the
unsafe usage of child->mm in ptrace_request(PTRACE_GETFDPIC).

We have the reference to task_struct, and ptrace_check_attach() verified
the tracee is stopped. But nothing can protect from SIGKILL after that,
we must not assume child->mm != NULL.

Signed-off-by: Oleg Nesterov
Acked-by: Mike Frysinger
Acked-by: David Howells
Cc: Paul Mundt
Cc: Greg Ungerer
Acked-by: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2010-05-28 00:12:44 +0800
9c1a12592 ptrace: unify FDPIC implementations ... Browse Code »

The Blackfin/FRV/SuperH guys all have the same exact FDPIC ptrace code in
their arch handlers (since they were probably copied & pasted). Since
these ptrace interfaces are an arch independent aspect of the FDPIC code,
unify them in the common ptrace code so new FDPIC ports don't need to copy
and paste this fundamental stuff yet again.

Signed-off-by: Mike Frysinger
Acked-by: Roland McGrath
Acked-by: David Howells
Acked-by: Paul Mundt
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Frysinger
2010-05-28 00:12:44 +0800
0ac0c0d0f cpusets: randomize node rotor used in cpuset_mem_spread_node() ... Browse Code »

Some workloads that create a large number of small files tend to assign
too many pages to node 0 (multi-node systems). Part of the reason is that
the rotor (in cpuset_mem_spread_node()) used to assign nodes starts at
node 0 for newly created tasks.

This patch changes the rotor to be initialized to a random node number of
the cpuset.

[akpm@linux-foundation.org: fix layout]
[Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration]
Signed-off-by: Jack Steiner
Signed-off-by: Lee Schermerhorn
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Paul Menage
Cc: Jack Steiner
Cc: Robin Holt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jack Steiner
2010-05-28 00:12:44 +0800
6adef3ebe cpusets: new round-robin rotor for SLAB allocations ... Browse Code »

We have observed several workloads running on multi-node systems where
memory is assigned unevenly across the nodes in the system. There are
numerous reasons for this but one is the round-robin rotor in
cpuset_mem_spread_node().

For example, a simple test that writes a multi-page file will allocate
pages on nodes 0 2 4 6 ... Odd nodes are skipped. (Sometimes it
allocates on odd nodes & skips even nodes).

An example is shown below. The program "lfile" writes a file consisting
of 10 pages. The program then mmaps the file & uses get_mempolicy(...,
MPOL_F_NODE) to determine the nodes where the file pages were allocated.
The output is shown below:

# ./lfile
allocated on nodes: 2 4 6 0 1 2 6 0 2

There is a single rotor that is used for allocating both file pages & slab
pages. Writing the file allocates both a data page & a slab page
(buffer_head). This advances the RR rotor 2 nodes for each page
allocated.

A quick confirmation seems to confirm this is the cause of the uneven
allocation:

# echo 0 >/dev/cpuset/memory_spread_slab
# ./lfile
allocated on nodes: 6 7 8 9 0 1 2 3 4 5

This patch introduces a second rotor that is used for slab allocations.

Signed-off-by: Jack Steiner
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: Paul Menage
Cc: Jack Steiner
Cc: Robin Holt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jack Steiner
2010-05-28 00:12:44 +0800
2c488db27 memcg: clean up memory thresholds ... Browse Code »

Introduce struct mem_cgroup_thresholds. It helps to reduce number of
checks of thresholds type (memory or mem+swap).

[akpm@linux-foundation.org: repair comment]
Signed-off-by: Kirill A. Shutemov
Cc: Phil Carmody
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: KAMEZAWA Hiroyuki
Acked-by: Paul Menage
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2010-05-28 00:12:44 +0800
907860ed3 cgroups: make cftype.unregister_event() void-returning ... Browse Code »

Since we are unable to handle an error returned by
cftype.unregister_event() properly, let's make the callback
void-returning.

mem_cgroup_unregister_event() has been rewritten to be a "never fail"
function. On mem_cgroup_usage_register_event() we save old buffer for
thresholds array and reuse it in mem_cgroup_usage_unregister_event() to
avoid allocation.

Signed-off-by: Kirill A. Shutemov
Acked-by: KAMEZAWA Hiroyuki
Cc: Phil Carmody
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Paul Menage
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2010-05-28 00:12:44 +0800
ac39cf8cb memcg: fix mis-accounting of file mapped racy with migration ... Browse Code »

FILE_MAPPED per memcg of migrated file cache is not properly updated,
because our hook in page_add_file_rmap() can't know to which memcg
FILE_MAPPED should be counted.

Basically, this patch is for fixing the bug but includes some big changes
to fix up other messes.

Now, at migrating mapped file, events happen in following sequence.

1. allocate a new page.
2. get memcg of an old page.
3. charge ageinst a new page before migration. But at this point,
no changes to new page's page_cgroup, no commit for the charge.
(IOW, PCG_USED bit is not set.)
4. page migration replaces radix-tree, old-page and new-page.
5. page migration remaps the new page if the old page was mapped.
6. Here, the new page is unlocked.
7. memcg commits the charge for newpage, Mark the new page's page_cgroup
as PCG_USED.

Because "commit" happens after page-remap, we can count FILE_MAPPED
at "5", because we should avoid to trust page_cgroup->mem_cgroup.
if PCG_USED bit is unset.
(Note: memcg's LRU removal code does that but LRU-isolation logic is used
for helping it. When we overwrite page_cgroup->mem_cgroup, page_cgroup is
not on LRU or page_cgroup->mem_cgroup is NULL.)

We can lose file_mapped accounting information at 5 because FILE_MAPPED
is updated only when mapcount changes 0->1. So we should catch it.

BTW, historically, above implemntation comes from migration-failure
of anonymous page. Because we charge both of old page and new page
with mapcount=0, we can't catch
- the page is really freed before remap.
- migration fails but it's freed before remap
or .....corner cases.

New migration sequence with memcg is:

1. allocate a new page.
2. mark PageCgroupMigration to the old page.
3. charge against a new page onto the old page's memcg. (here, new page's pc
is marked as PageCgroupUsed.)
4. page migration replaces radix-tree, page table, etc...
5. At remapping, new page's page_cgroup is now makrked as "USED"
We can catch 0->1 event and FILE_MAPPED will be properly updated.

And we can catch SWAPOUT event after unlock this and freeing this
page by unmap() can be caught.

7. Clear PageCgroupMigration of the old page.

So, FILE_MAPPED will be correctly updated.

Then, for what MIGRATION flag is ?
Without it, at migration failure, we may have to charge old page again
because it may be fully unmapped. "charge" means that we have to dive into
memory reclaim or something complated. So, it's better to avoid
charge it again. Before this patch, __commit_charge() was working for
both of the old/new page and fixed up all. But this technique has some
racy condtion around FILE_MAPPED and SWAPOUT etc...
Now, the kernel use MIGRATION flag and don't uncharge old page until
the end of migration.

I hope this change will make memcg's page migration much simpler. This
page migration has caused several troubles. Worth to add a flag for
simplification.

Reviewed-by: Daisuke Nishimura
Tested-by: Daisuke Nishimura
Reported-by: Daisuke Nishimura
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Christoph Lameter
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

akpm@linux-foundation.org
2010-05-28 00:12:44 +0800
315c1998e mm: memcontrol - uninitialised return value ... Browse Code »

Only an out of memory error will cause ret to be set.

Signed-off-by: Phil Carmody
Acked-by: Kirill A. Shutemov
Cc: Balbir Singh
Cc: Daisuke Nishimura
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Phil Carmody
2010-05-28 00:12:44 +0800
5407a5625 mm: remove unnecessary use of atomic ... Browse Code »

The bottom 4 hunks are atomically changing memory to which there are no
aliases as it's freshly allocated, so there's no need to use atomic
operations.

The other hunks are just atomic_read and atomic_set, and do not involve
any read-modify-write. The use of atomic_{read,set} doesn't prevent a
read/write or write/write race, so if a race were possible (I'm not saying
one is), then it would still be there even with atomic_set.

See:
http://digitalvampire.org/blog/index.php/2007/05/13/atomic-cargo-cults/

Signed-off-by: Phil Carmody
Acked-by: Kirill A. Shutemov
Cc: Balbir Singh
Cc: Daisuke Nishimura
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Phil Carmody
2010-05-28 00:12:43 +0800
df64f81bb memcg: make oom killer a no-op when no killable task can be found ... Browse Code »

It's pointless to try to kill current if select_bad_process() did not find
an eligible task to kill in mem_cgroup_out_of_memory() since it's
guaranteed that current is a member of the memcg that is oom and it is, by
definition, unkillable.

Signed-off-by: David Rientjes
Acked-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Li Zefan
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2010-05-28 00:12:43 +0800
dc10e281f memcg: update documentation ... Browse Code »

Some information are old, and I think current document doesn't work as "a
guide for users". We need summary of all of our controls, at least.

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Randy Dunlap
Cc: Balbir Singh
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-05-28 00:12:43 +0800
87946a722 memcg: move charge of file pages ... Browse Code »

This patch adds support for moving charge of file pages, which include
normal file, tmpfs file and swaps of tmpfs file. It's enabled by setting
bit 1 of /memory.move_charge_at_immigrate.

Unlike the case of anonymous pages, file pages(and swaps) in the range
mmapped by the task will be moved even if the task hasn't done page fault,
i.e. they might not be the task's "RSS", but other task's "RSS" that maps
the same file. And mapcount of the page is ignored(the page can be moved
even if page_mapcount(page) > 1). So, conditions that the page/swap
should be met to be moved is that it must be in the range mmapped by the
target task and it must be charged to the old cgroup.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Daisuke Nishimura
Acked-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2010-05-28 00:12:43 +0800
90254a658 memcg: clean up move charge ... Browse Code »

This patch cleans up move charge code by:

- define functions to handle pte for each types, and make
is_target_pte_for_mc() cleaner.

- instead of checking the MOVE_CHARGE_TYPE_ANON bit, define a function
that checks the bit.

Signed-off-by: Daisuke Nishimura
Acked-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2010-05-28 00:12:43 +0800
3c11ecf44 memcg: oom kill disable and oom status ... Browse Code »

This adds a feature to disable oom-killer for memcg, if disabled, of
course, tasks under memcg will stop.

But now, we have oom-notifier for memcg. And the world around memcg is
not under out-of-memory. memcg's out-of-memory just shows memcg hits
limit. Then, administrator or management daemon can recover the situation
by

- kill some process
- enlarge limit, add more swap.
- migrate some tasks
- remove file cache on tmps (difficult ?)

Unlike oom-killer, you can take enough information before killing tasks.
(by gcore, or, ps etc.)

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-05-28 00:12:43 +0800
9490ff275 memcg: oom notifier ... Browse Code »

Considering containers or other resource management softwares in userland,
event notification of OOM in memcg should be implemented. Now, memcg has
"threshold" notifier which uses eventfd, we can make use of it for oom
notification.

This patch adds oom notification eventfd callback for memcg. The usage is
very similar to threshold notifier, but control file is memory.oom_control
and no arguments other than eventfd is required.

% cgroup_event_notifier /cgroup/A/memory.oom_control dummy
(About cgroup_event_notifier, see Documentation/cgroup/)

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: David Rientjes
Cc: Davide Libenzi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-05-28 00:12:43 +0800
dc98df5a1 memcg: oom wakeup filter ... Browse Code »

memcg's oom waitqueue is a system-wide wait_queue (for handling
hierarchy.) So, it's better to add custom wake function and do filtering
in wake up path.

This patch adds a filtering feature for waking up oom-waiters. Hierarchy
is properly handled.

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Daisuke Nishimura
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-05-28 00:12:43 +0800
595f4b694 Documentation/cgroups/cgroups.txt: fix reference to "numtasks" ... Browse Code »

Signed-off-by: Trevor Woerner
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Trevor Woerner
2010-05-28 00:12:43 +0800
6d06b81bc Documentation: SubmittingDrivers: Resources ... Browse Code »

- Add additional location (Git) for the kernel master tree
- Add reference to Git Project

Signed-off-by: Abraham Arce
Acked-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arce, Abraham
2010-05-28 00:12:43 +0800
d27d7a9a7 ufs: permit mounting of BorderWare filesystems ... Browse Code »

I recently had to recover some files from an old broken machine that was
running BorderWare Document Gateway. It's basically a drop in web server
for sharing files. From the look of the init process and using strings on
of a few files it seems to be based on FreeBSD 3.3.

The process turned out to be more difficult than I imagined, but to cut a
long story short BorderWare in their wisdom use a nonstandard magic number
in their UFS (ufstype=44bsd) file systems. Thus Linux refuses to mount
the file systems in order to recover the data. After a bit of hunting I
was able to make a quick fix to fs/ufs/super.c in order to detect the new
magic number.

I assume that this number is the same for all installations. It's quite
easy to find out from ufs_fs.h. The superblock sits 8k into the block
device and the magic number its 1372 bytes into the superblock struct.

# dd if=/dev/sda5 skip=$(( 8192 + 1372 )) bs=1 count=4 2> /dev/null | hd
00000000 97 26 24 0f |.&$.|
#

Signed-off-by: Thomas Stewart
Cc: Evgeniy Dushistov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thomas Stewart
2010-05-28 00:12:43 +0800
b8d6b0d6b drivers/telephony/ixj.c: use memdup_user ... Browse Code »

Use memdup_user when user data is immediately copied into the
allocated region.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

//
@@
expression from,to,size,flag;
position p;
identifier l1,l2;
@@

- to = $kmalloc@p\|kzalloc@p$(size,flag);
+ to = memdup_user(from,size);
if (
- to==NULL
+ IS_ERR(to)
|| ...) {

}
- if (copy_from_user(to, from, size) != 0) {
-
- }
//

Signed-off-by: Julia Lawall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Julia Lawall
2010-05-28 00:12:42 +0800
8fc809d17 fbdev: bf54x-lq043fb: fix unused warnings with backlight code ... Browse Code »

The current backlight code is stubbed out, so the new props changes added
some warnings:
drivers/video/bf54x-lq043fb.c: In function 'bfin_bf54x_probe':
drivers/video/bf54x-lq043fb.c:666: warning: label 'out9' defined but not used
drivers/video/bf54x-lq043fb.c:504: warning: unused variable 'props'

Fix em !

Signed-off-by: Mike Frysinger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Frysinger
2010-05-28 00:12:42 +0800
d11991cba fbdev: bfin-t350mcqb-fb: avoid unused warnings in backlight code ... Browse Code »

The current backlight code is stubbed out, so the new props changes added
some warnings about unused label/prop.

Signed-off-by: Mike Frysinger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Frysinger
2010-05-28 00:12:42 +0800
a51faabc6 drivers/video/via: use memdup_user ... Browse Code »

Use memdup_user when user data is immediately copied into the
allocated region.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

//
@@
expression from,to,size,flag;
position p;
identifier l1,l2;
@@

- to = $kmalloc@p\|kzalloc@p$(size,flag);
+ to = memdup_user(from,size);
if (
- to==NULL
+ IS_ERR(to)
|| ...) {

}
- if (copy_from_user(to, from, size) != 0) {
-
- }
//

Signed-off-by: Julia Lawall
Cc: Joseph Chan
Cc: Scott Fang
Cc: Florian Tobias Schandinat
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Julia Lawall
2010-05-28 00:12:42 +0800
9966c4fea add support for S3 Trio3D/1X/2X ... Browse Code »

Add support for S3 Trio3D/1X (86C360) and S3 Trio3D/2X (86C362 and 86C368)
cards to s3fb driver. Tested with 86C362 AGP and 86C368 PCI&AGP.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Ondrej Zary
Acked-by: Ondrej Zajicek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ondrej Zary
2010-05-28 00:12:42 +0800