28 May, 2010

40 commits

  • de_thread() and __exit_signal() use signal_struct->count/notify_count for
    synchronization. We can simplify the code and use ->notify_count only.
    Instead of comparing these two counters, we can change de_thread() to set
    ->notify_count = nr_of_sub_threads, then change __exit_signal() to
    dec-and-test this counter and notify group_exit_task.

    Note that __exit_signal() checks "notify_count > 0" just for symmetry with
    exit_notify(), we could just check it is != 0.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change zap_other_threads() to return the number of other sub-threads found
    on ->thread_group list.

    Other changes are cosmetic:

    - change the code to use while_each_thread() helper

    - remove the obsolete comment about SIGKILL/SIGSTOP

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • signal_struct->count in its current form must die.

    - it has no reasons to be atomic_t

    - it looks like a reference counter, but it is not

    - otoh, we really need to make task->signal refcountable, just look at
    the extremely ugly task_rq_unlock_wait() called from __exit_signals().

    - we should change the lifetime rules for task->signal, it should be
    pinned to task_struct. We have a lot of code which can be simplified
    after that.

    - it is not needed! while the code is correct, any usage of this
    counter is artificial, except fs/proc uses it correctly to show the
    number of threads.

    This series removes the usage of sig->count from exit pathes.

    This patch:

    Now that Veaceslav changed copy_signal() to use zalloc(), exit_notify()
    can just check notify_count < 0 to ensure the execing sub-threads needs
    the notification from us. No need to do other checks, notify_count != 0
    must always mean ->group_exit_task != NULL is waiting for us.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • - move the cprm.mm_flags checks up, before we take mmap_sem

    - move down_write(mmap_sem) and ->core_state check from do_coredump()
    to coredump_wait()

    This simplifies the code and makes the locking symmetrical.

    Signed-off-by: Oleg Nesterov
    Cc: David Howells
    Cc: Neil Horman
    Cc: Roland McGrath
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Given that do_coredump() calls put_cred() on exit path, it is a bit ugly
    to do put_cred() + "goto fail" twice, just add the new "fail_creds" label.

    Signed-off-by: Oleg Nesterov
    Cc: David Howells
    Cc: Neil Horman
    Cc: Roland McGrath
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • - kill "int dump_count", argv_split(argcp) accepts argcp == NULL.

    - move "int dump_count" under " if (ispipe)" branch, fail_dropcount
    can check ispipe.

    - move "char **helper_argv" as well, change the code to do argv_free()
    right after call_usermodehelper_fns().

    - If call_usermodehelper_fns() fails goto close_fail label instead
    of closing the file by hand.

    Signed-off-by: Oleg Nesterov
    Cc: David Howells
    Cc: Neil Horman
    Cc: Roland McGrath
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • do_coredump() does a lot of file checks after it opens the file or calls
    usermode helper. But all of these checks are only needed in !ispipe case.

    Move this code into the "else" branch and kill the ugly repetitive ispipe
    checks.

    Signed-off-by: Oleg Nesterov
    Cc: David Howells
    Cc: Neil Horman
    Cc: Roland McGrath
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • UMH_WAIT_EXEC should report the error if kernel_thread() fails, like
    UMH_WAIT_PROC does.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • __call_usermodehelper(UMH_NO_WAIT) has 2 problems:

    - if kernel_thread() fails, call_usermodehelper_freeinfo()
    is not called.

    - for unknown reason UMH_NO_WAIT has UMH_WAIT_PROC logic,
    we spawn yet another thread which waits until the user
    mode application exits.

    Change the UMH_NO_WAIT code to use ____call_usermodehelper() instead of
    wait_for_helper(), and do call_usermodehelper_freeinfo() unconditionally.
    We can rely on CLONE_VFORK, do_fork(CLONE_VFORK) until the child exits or
    execs.

    With or without this patch UMH_NO_WAIT does not report the error if
    kernel_thread() fails, this is correct since the caller doesn't wait for
    result.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • 1. wait_for_helper() calls allow_signal(SIGCHLD) to ensure the child
    can't autoreap itself.

    However, this means that a spurious SIGCHILD from user-space can
    set TIF_SIGPENDING and:

    - kernel_thread() or sys_wait4() can fail due to signal_pending()

    - worse, wait4() can fail before ____call_usermodehelper() execs
    or exits. In this case the caller may kfree(subprocess_info)
    while the child still uses this memory.

    Change the code to use SIG_DFL instead of magic "(void __user *)2"
    set by allow_signal(). This means that SIGCHLD won't be delivered,
    yet the child won't autoreap itsefl.

    The problem is minor, only root can send a signal to this kthread.

    2. If sys_wait4(&ret) fails it doesn't populate "ret", in this case
    wait_for_helper() reports a random value from uninitialized var.

    With this patch sys_wait4() should never fail, but still it makes
    sense to initialize ret = -ECHILD so that the caller can notice
    the problem.

    Signed-off-by: Oleg Nesterov
    Acked-by: Neil Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • ____call_usermodehelper() correctly calls flush_signal_handlers() to set
    SIG_DFL, but sigemptyset(->blocked) and recalc_sigpending() are not
    needed.

    This kthread was forked by workqueue thread, all signals must be unblocked
    and ignored, no pending signal is possible.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that nobody ever changes subprocess_info->cred we can kill this member
    and related code. ____call_usermodehelper() always runs in the context of
    freshly forked kernel thread, it has the proper ->cred copied from its
    parent kthread, keventd.

    Signed-off-by: Oleg Nesterov
    Acked-by: Neil Horman
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • call_usermodehelper_keys() uses call_usermodehelper_setkeys() to change
    subprocess_info->cred in advance. Now that we have info->init() we can
    change this code to set tgcred->session_keyring in context of execing
    kernel thread.

    Note: since currently call_usermodehelper_keys() is never called with
    UMH_NO_WAIT, call_usermodehelper_keys()->key_get() and umh_keys_cleanup()
    are not really needed, we could rely on install_session_keyring_to_cred()
    which does key_get() on success.

    Signed-off-by: Oleg Nesterov
    Acked-by: Neil Horman
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The first patch in this series introduced an init function to the
    call_usermodehelper api so that processes could be customized by caller.
    This patch takes advantage of that fact, by customizing the helper in
    do_coredump to create the pipe and set its core limit to one (for our
    recusrsion check). This lets us clean up the previous uglyness in the
    usermodehelper internals and factor call_usermodehelper out entirely.
    While I'm at it, we can also modify the helper setup to look for a core
    limit value of 1 rather than zero for our recursion check

    Signed-off-by: Neil Horman
    Reviewed-by: Oleg Nesterov
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • About 6 months ago, I made a set of changes to how the core-dump-to-a-pipe
    feature in the kernel works. We had reports of several races, including
    some reports of apps bypassing our recursion check so that a process that
    was forked as part of a core_pattern setup could infinitely crash and
    refork until the system crashed.

    We fixed those by improving our recursion checks. The new check basically
    refuses to fork a process if its core limit is zero, which works well.

    Unfortunately, I've been getting grief from maintainer of user space
    programs that are inserted as the forked process of core_pattern. They
    contend that in order for their programs (such as abrt and apport) to
    work, all the running processes in a system must have their core limits
    set to a non-zero value, to which I say 'yes'. I did this by design, and
    think thats the right way to do things.

    But I've been asked to ease this burden on user space enough times that I
    thought I would take a look at it. The first suggestion was to make the
    recursion check fail on a non-zero 'special' number, like one. That way
    the core collector process could set its core size ulimit to 1, and enable
    the kernel's recursion detection. This isn't a bad idea on the surface,
    but I don't like it since its opt-in, in that if a program like abrt or
    apport has a bug and fails to set such a core limit, we're left with a
    recursively crashing system again.

    So I've come up with this. What I've done is modify the
    call_usermodehelper api such that an extra parameter is added, a function
    pointer which will be called by the user helper task, after it forks, but
    before it exec's the required process. This will give the caller the
    opportunity to get a call back in the processes context, allowing it to do
    whatever it needs to to the process in the kernel prior to exec-ing the
    user space code. In the case of do_coredump, this callback is ues to set
    the core ulimit of the helper process to 1. This elimnates the opt-in
    problem that I had above, as it allows the ulimit for core sizes to be set
    to the value of 1, which is what the recursion check looks for in
    do_coredump.

    This patch:

    Create new function call_usermodehelper_fns() and allow it to assign both
    an init and cleanup function, as we'll as arbitrary data.

    The init function is called from the context of the forked process and
    allows for customization of the helper process prior to calling exec. Its
    return code gates the continuation of the process, or causes its exit.
    Also add an arbitrary data pointer to the subprocess_info struct allowing
    for data to be passed from the caller to the new process, and the
    subsequent cleanup process

    Also, use this patch to cleanup the cleanup function. It currently takes
    an argp and envp pointer for freeing, which is ugly. Lets instead just
    make the subprocess_info structure public, and pass that to the cleanup
    and init routines

    Signed-off-by: Neil Horman
    Reviewed-by: Oleg Nesterov
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • Andrew Tridgell reports that aio_read(SIGEV_SIGNAL) can fail if the
    notification from the helper thread races with setresuid(), see
    http://samba.org/~tridge/junkcode/aio_uid.c

    This happens because check_kill_permission() doesn't permit sending a
    signal to the task with the different cred->xids. But there is not any
    security reason to check ->cred's when the task sends a signal (private or
    group-wide) to its sub-thread. Whatever we do, any thread can bypass all
    security checks and send SIGKILL to all threads, or it can block a signal
    SIG and do kill(gettid(), SIG) to deliver this signal to another
    sub-thread. Not to mention that CLONE_THREAD implies CLONE_VM.

    Change check_kill_permission() to avoid the credentials check when the
    sender and the target are from the same thread group.

    Also, move "cred = current_cred()" down to avoid calling get_current()
    twice.

    Note: David Howells pointed out we could relax this even more, the
    CLONE_SIGHAND (without CLONE_THREAD) case probably does not need
    these checks too.

    Roland said:
    : The glibc (libpthread) that does set*id across threads has
    : been in use for a while (2.3.4?), probably in distro's using kernels as old
    : or older than any active -stable streams. In the race in question, this
    : kernel bug is breaking valid POSIX application expectations.

    Reported-by: Andrew Tridgell
    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Acked-by: David Howells
    Cc: Eric Paris
    Cc: Jakub Jelinek
    Cc: James Morris
    Cc: Roland McGrath
    Cc: Stephen Smalley
    Cc: [all kernel versions]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that Mike Frysinger unified the FDPIC ptrace code, we can fix the
    unsafe usage of child->mm in ptrace_request(PTRACE_GETFDPIC).

    We have the reference to task_struct, and ptrace_check_attach() verified
    the tracee is stopped. But nothing can protect from SIGKILL after that,
    we must not assume child->mm != NULL.

    Signed-off-by: Oleg Nesterov
    Acked-by: Mike Frysinger
    Acked-by: David Howells
    Cc: Paul Mundt
    Cc: Greg Ungerer
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The Blackfin/FRV/SuperH guys all have the same exact FDPIC ptrace code in
    their arch handlers (since they were probably copied & pasted). Since
    these ptrace interfaces are an arch independent aspect of the FDPIC code,
    unify them in the common ptrace code so new FDPIC ports don't need to copy
    and paste this fundamental stuff yet again.

    Signed-off-by: Mike Frysinger
    Acked-by: Roland McGrath
    Acked-by: David Howells
    Acked-by: Paul Mundt
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • Some workloads that create a large number of small files tend to assign
    too many pages to node 0 (multi-node systems). Part of the reason is that
    the rotor (in cpuset_mem_spread_node()) used to assign nodes starts at
    node 0 for newly created tasks.

    This patch changes the rotor to be initialized to a random node number of
    the cpuset.

    [akpm@linux-foundation.org: fix layout]
    [Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration]
    Signed-off-by: Jack Steiner
    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Paul Menage
    Cc: Jack Steiner
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jack Steiner
     
  • We have observed several workloads running on multi-node systems where
    memory is assigned unevenly across the nodes in the system. There are
    numerous reasons for this but one is the round-robin rotor in
    cpuset_mem_spread_node().

    For example, a simple test that writes a multi-page file will allocate
    pages on nodes 0 2 4 6 ... Odd nodes are skipped. (Sometimes it
    allocates on odd nodes & skips even nodes).

    An example is shown below. The program "lfile" writes a file consisting
    of 10 pages. The program then mmaps the file & uses get_mempolicy(...,
    MPOL_F_NODE) to determine the nodes where the file pages were allocated.
    The output is shown below:

    # ./lfile
    allocated on nodes: 2 4 6 0 1 2 6 0 2

    There is a single rotor that is used for allocating both file pages & slab
    pages. Writing the file allocates both a data page & a slab page
    (buffer_head). This advances the RR rotor 2 nodes for each page
    allocated.

    A quick confirmation seems to confirm this is the cause of the uneven
    allocation:

    # echo 0 >/dev/cpuset/memory_spread_slab
    # ./lfile
    allocated on nodes: 6 7 8 9 0 1 2 3 4 5

    This patch introduces a second rotor that is used for slab allocations.

    Signed-off-by: Jack Steiner
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Paul Menage
    Cc: Jack Steiner
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jack Steiner
     
  • Introduce struct mem_cgroup_thresholds. It helps to reduce number of
    checks of thresholds type (memory or mem+swap).

    [akpm@linux-foundation.org: repair comment]
    Signed-off-by: Kirill A. Shutemov
    Cc: Phil Carmody
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Since we are unable to handle an error returned by
    cftype.unregister_event() properly, let's make the callback
    void-returning.

    mem_cgroup_unregister_event() has been rewritten to be a "never fail"
    function. On mem_cgroup_usage_register_event() we save old buffer for
    thresholds array and reuse it in mem_cgroup_usage_unregister_event() to
    avoid allocation.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Phil Carmody
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • FILE_MAPPED per memcg of migrated file cache is not properly updated,
    because our hook in page_add_file_rmap() can't know to which memcg
    FILE_MAPPED should be counted.

    Basically, this patch is for fixing the bug but includes some big changes
    to fix up other messes.

    Now, at migrating mapped file, events happen in following sequence.

    1. allocate a new page.
    2. get memcg of an old page.
    3. charge ageinst a new page before migration. But at this point,
    no changes to new page's page_cgroup, no commit for the charge.
    (IOW, PCG_USED bit is not set.)
    4. page migration replaces radix-tree, old-page and new-page.
    5. page migration remaps the new page if the old page was mapped.
    6. Here, the new page is unlocked.
    7. memcg commits the charge for newpage, Mark the new page's page_cgroup
    as PCG_USED.

    Because "commit" happens after page-remap, we can count FILE_MAPPED
    at "5", because we should avoid to trust page_cgroup->mem_cgroup.
    if PCG_USED bit is unset.
    (Note: memcg's LRU removal code does that but LRU-isolation logic is used
    for helping it. When we overwrite page_cgroup->mem_cgroup, page_cgroup is
    not on LRU or page_cgroup->mem_cgroup is NULL.)

    We can lose file_mapped accounting information at 5 because FILE_MAPPED
    is updated only when mapcount changes 0->1. So we should catch it.

    BTW, historically, above implemntation comes from migration-failure
    of anonymous page. Because we charge both of old page and new page
    with mapcount=0, we can't catch
    - the page is really freed before remap.
    - migration fails but it's freed before remap
    or .....corner cases.

    New migration sequence with memcg is:

    1. allocate a new page.
    2. mark PageCgroupMigration to the old page.
    3. charge against a new page onto the old page's memcg. (here, new page's pc
    is marked as PageCgroupUsed.)
    4. page migration replaces radix-tree, page table, etc...
    5. At remapping, new page's page_cgroup is now makrked as "USED"
    We can catch 0->1 event and FILE_MAPPED will be properly updated.

    And we can catch SWAPOUT event after unlock this and freeing this
    page by unmap() can be caught.

    7. Clear PageCgroupMigration of the old page.

    So, FILE_MAPPED will be correctly updated.

    Then, for what MIGRATION flag is ?
    Without it, at migration failure, we may have to charge old page again
    because it may be fully unmapped. "charge" means that we have to dive into
    memory reclaim or something complated. So, it's better to avoid
    charge it again. Before this patch, __commit_charge() was working for
    both of the old/new page and fixed up all. But this technique has some
    racy condtion around FILE_MAPPED and SWAPOUT etc...
    Now, the kernel use MIGRATION flag and don't uncharge old page until
    the end of migration.

    I hope this change will make memcg's page migration much simpler. This
    page migration has caused several troubles. Worth to add a flag for
    simplification.

    Reviewed-by: Daisuke Nishimura
    Tested-by: Daisuke Nishimura
    Reported-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@linux-foundation.org
     
  • Only an out of memory error will cause ret to be set.

    Signed-off-by: Phil Carmody
    Acked-by: Kirill A. Shutemov
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Phil Carmody
     
  • The bottom 4 hunks are atomically changing memory to which there are no
    aliases as it's freshly allocated, so there's no need to use atomic
    operations.

    The other hunks are just atomic_read and atomic_set, and do not involve
    any read-modify-write. The use of atomic_{read,set} doesn't prevent a
    read/write or write/write race, so if a race were possible (I'm not saying
    one is), then it would still be there even with atomic_set.

    See:
    http://digitalvampire.org/blog/index.php/2007/05/13/atomic-cargo-cults/

    Signed-off-by: Phil Carmody
    Acked-by: Kirill A. Shutemov
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Phil Carmody
     
  • It's pointless to try to kill current if select_bad_process() did not find
    an eligible task to kill in mem_cgroup_out_of_memory() since it's
    guaranteed that current is a member of the memcg that is oom and it is, by
    definition, unkillable.

    Signed-off-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Li Zefan
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Some information are old, and I think current document doesn't work as "a
    guide for users". We need summary of all of our controls, at least.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Randy Dunlap
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch adds support for moving charge of file pages, which include
    normal file, tmpfs file and swaps of tmpfs file. It's enabled by setting
    bit 1 of /memory.move_charge_at_immigrate.

    Unlike the case of anonymous pages, file pages(and swaps) in the range
    mmapped by the task will be moved even if the task hasn't done page fault,
    i.e. they might not be the task's "RSS", but other task's "RSS" that maps
    the same file. And mapcount of the page is ignored(the page can be moved
    even if page_mapcount(page) > 1). So, conditions that the page/swap
    should be met to be moved is that it must be in the range mmapped by the
    target task and it must be charged to the old cgroup.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • This patch cleans up move charge code by:

    - define functions to handle pte for each types, and make
    is_target_pte_for_mc() cleaner.

    - instead of checking the MOVE_CHARGE_TYPE_ANON bit, define a function
    that checks the bit.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • This adds a feature to disable oom-killer for memcg, if disabled, of
    course, tasks under memcg will stop.

    But now, we have oom-notifier for memcg. And the world around memcg is
    not under out-of-memory. memcg's out-of-memory just shows memcg hits
    limit. Then, administrator or management daemon can recover the situation
    by

    - kill some process
    - enlarge limit, add more swap.
    - migrate some tasks
    - remove file cache on tmps (difficult ?)

    Unlike oom-killer, you can take enough information before killing tasks.
    (by gcore, or, ps etc.)

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Considering containers or other resource management softwares in userland,
    event notification of OOM in memcg should be implemented. Now, memcg has
    "threshold" notifier which uses eventfd, we can make use of it for oom
    notification.

    This patch adds oom notification eventfd callback for memcg. The usage is
    very similar to threshold notifier, but control file is memory.oom_control
    and no arguments other than eventfd is required.

    % cgroup_event_notifier /cgroup/A/memory.oom_control dummy
    (About cgroup_event_notifier, see Documentation/cgroup/)

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • memcg's oom waitqueue is a system-wide wait_queue (for handling
    hierarchy.) So, it's better to add custom wake function and do filtering
    in wake up path.

    This patch adds a filtering feature for waking up oom-waiters. Hierarchy
    is properly handled.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Signed-off-by: Trevor Woerner
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Trevor Woerner
     
  • - Add additional location (Git) for the kernel master tree
    - Add reference to Git Project

    Signed-off-by: Abraham Arce
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arce, Abraham
     
  • I recently had to recover some files from an old broken machine that was
    running BorderWare Document Gateway. It's basically a drop in web server
    for sharing files. From the look of the init process and using strings on
    of a few files it seems to be based on FreeBSD 3.3.

    The process turned out to be more difficult than I imagined, but to cut a
    long story short BorderWare in their wisdom use a nonstandard magic number
    in their UFS (ufstype=44bsd) file systems. Thus Linux refuses to mount
    the file systems in order to recover the data. After a bit of hunting I
    was able to make a quick fix to fs/ufs/super.c in order to detect the new
    magic number.

    I assume that this number is the same for all installations. It's quite
    easy to find out from ufs_fs.h. The superblock sits 8k into the block
    device and the magic number its 1372 bytes into the superblock struct.

    # dd if=/dev/sda5 skip=$(( 8192 + 1372 )) bs=1 count=4 2> /dev/null | hd
    00000000 97 26 24 0f |.&$.|
    #

    Signed-off-by: Thomas Stewart
    Cc: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Stewart
     
  • Use memdup_user when user data is immediately copied into the
    allocated region.

    The semantic patch that makes this change is as follows:
    (http://coccinelle.lip6.fr/)

    //
    @@
    expression from,to,size,flag;
    position p;
    identifier l1,l2;
    @@

    - to = \(kmalloc@p\|kzalloc@p\)(size,flag);
    + to = memdup_user(from,size);
    if (
    - to==NULL
    + IS_ERR(to)
    || ...) {

    }
    - if (copy_from_user(to, from, size) != 0) {
    -
    - }
    //

    Signed-off-by: Julia Lawall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Julia Lawall
     
  • The current backlight code is stubbed out, so the new props changes added
    some warnings:
    drivers/video/bf54x-lq043fb.c: In function 'bfin_bf54x_probe':
    drivers/video/bf54x-lq043fb.c:666: warning: label 'out9' defined but not used
    drivers/video/bf54x-lq043fb.c:504: warning: unused variable 'props'

    Fix em !

    Signed-off-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • The current backlight code is stubbed out, so the new props changes added
    some warnings about unused label/prop.

    Signed-off-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • Use memdup_user when user data is immediately copied into the
    allocated region.

    The semantic patch that makes this change is as follows:
    (http://coccinelle.lip6.fr/)

    //
    @@
    expression from,to,size,flag;
    position p;
    identifier l1,l2;
    @@

    - to = \(kmalloc@p\|kzalloc@p\)(size,flag);
    + to = memdup_user(from,size);
    if (
    - to==NULL
    + IS_ERR(to)
    || ...) {

    }
    - if (copy_from_user(to, from, size) != 0) {
    -
    - }
    //

    Signed-off-by: Julia Lawall
    Cc: Joseph Chan
    Cc: Scott Fang
    Cc: Florian Tobias Schandinat
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Julia Lawall
     
  • Add support for S3 Trio3D/1X (86C360) and S3 Trio3D/2X (86C362 and 86C368)
    cards to s3fb driver. Tested with 86C362 AGP and 86C368 PCI&AGP.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Ondrej Zary
    Acked-by: Ondrej Zajicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ondrej Zary