09 Jan, 2009

40 commits

  • This is to avoid name clashes for the introduction of a global swap()
    macro.

    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • In preparation for the introduction of a generic swap() macro.

    Signed-off-by: Wu Fengguang
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • In preparation for the introduction of a generic swap() macro.

    Signed-off-by: Wu Fengguang
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Acked-by: Theodore Ts'o
    Acked-by: Mark Fasheh
    Acked-by: David S. Miller
    Cc: James Morris
    Acked-by: Casey Schaufler
    Acked-by: Takashi Iwai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fernando Carrijo
     
  • romfs_strnlen() returns int
    unsigned X >= 0 is always true

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: roel kluin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    roel kluin
     
  • Remove the saved_max_pfn check from the /proc/vmcore function
    read_from_oldmem(). No need to verify, we should be able to just trust
    that "elfcorehdr=" is correctly passed to the crash kernel on the kernel
    command line like we do with other parameters.

    The read_from_oldmem() function in fs/proc/vmcore.c is quite similar to
    read_from_oldmem() in drivers/char/mem.c, but only in the latter it makes
    sense to use saved_max_pfn. For oldmem it is used to determine when to
    stop reading. For vmcore we already have the elf header info pointing out
    the physical memory regions, no need to pass the end-of- old-memory twice.

    Removing the saved_max_pfn check from vmcore makes it possible for
    architectures to skip oldmem but still support crash dump through vmcore -
    without the need for the old saved_max_pfn cruft.

    Architectures that want to play safe can do the saved_max_pfn check in
    copy_oldmem_page(). Not sure why anyone would want to do that, but that's
    even safer than today - the saved_max_pfn check in vmcore removed by this
    patch only checks the first page.

    Signed-off-by: Magnus Damm
    Acked-by: Vivek Goyal
    Acked-by: Simon Horman
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Magnus Damm
     
  • Send completion status of the commands to the userspace. Message and
    protocol are described in the documentation.

    Signed-off-by: Evgeniy Polyakov
    Cc: Paul Alfille
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Polyakov
     
  • Command which allows to reset the bus.

    Signed-off-by: Evgeniy Polyakov
    Cc: Paul Alfille
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Polyakov
     
  • Signed-off-by: Evgeniy Polyakov
    Cc: Paul Alfille
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Polyakov
     
  • This small patchset extendes existing commands with reset, master IO and
    status messages. Reset is used to reset the bus for given master device,
    master IO command allows to initiate IO against bus itself not selecting
    slave device first, which can be used to probe the device for example.
    And status messages carry command completion status back to the userspace
    (namely very useful to get -ENODEV from when requested device was not
    found).

    Great thanks to Paul Alfille of OWFS for testing and commands suggestions.

    This patch:

    Allow starting of IO not against already found slave devices, but against
    the bus itself, which can be used for example to probe devices.

    [akpm@linux-foundation.org: reindent switch statements]
    Signed-off-by: Evgeniy Polyakov
    Cc: Paul Alfille
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Polyakov
     
  • Signed-off-by: Evgeniy Polyakov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Polyakov
     
  • Initiates search (or alarm search) and returns all found devices to
    userspace. Found devices are not added into the system (i.e. they are
    not attached to family devices or bus masters), it will be done via (if
    was not done yet) usual timed searching.

    Signed-off-by: Evgeniy Polyakov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Polyakov
     
  • Writes and returns sampled data back to userspace.

    Signed-off-by: Evgeniy Polyakov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Polyakov
     
  • This patch series introduces and extends several userspace commands
    used with netlink protocol.

    Touch block command allows to write data and return sampled data to
    the userspace.

    Extended search and alarm seach commands to return list of slave
    devices found during given search.

    List masters command allows to send all registered master IDs to the
    userspace.

    Great thanks to Paul Alfille (owfs) who
    tested this implementation and wrote w1-to-network daemon
    http://sourceforge.net/projects/w1repeater/ and

    Frederik Deweerdt and Randy Dunlap for review.

    This patch:

    Returns list of registered bus master devices.

    Signed-off-by: Evgeniy Polyakov
    Cc: Paul Alfille
    Cc: Frederik Deweerdt
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Polyakov
     
  • This patch adds support for the 1-wire master interface for i.MX27 and
    i.MX31.

    Signed-off-by: Luotao Fu
    Signed-off-by: Sascha Hauer
    Signed-off-by: Evgeniy Polyakov
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sascha Hauer
     
  • Signed-off-by: Marcel Selhorst
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Smith
     
  • Add a driver for controlling Dell-specific backlight and rfkill interfaces.
    This driver makes use of the dcdbas interface to the Dell firmware to
    allow the backlight and rfkill interfaces on Dell systems to be driven
    through the standardised sysfs interfaces.

    Signed-off-by: Matthew Garrett
    Cc: Matt Domsch
    Cc: Ivo van Doorn
    Cc: Len Brown
    Cc: Richard Purdie
    Cc: Henrique de Moraes Holschuh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Garrett
     
  • The dcdbas code allows calls to be made into the firmware on Dell systems.
    Exporting this to other drivers allows them to implement Dell-specific
    functionality in a safe way.

    Signed-off-by: Matthew Garrett
    Cc: Matt Domsch
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Garrett
     
  • While discussing[1] the need for glibc to have access to random bytes
    during program load, it seems that an earlier attempt to implement
    AT_RANDOM got stalled. This implements a random 16 byte string, available
    to every ELF program via a new auxv AT_RANDOM vector.

    [1] http://sourceware.org/ml/libc-alpha/2008-10/msg00006.html

    Ulrich said:

    glibc needs right after startup a bit of random data for internal
    protections (stack canary etc). What is now in upstream glibc is that we
    always unconditionally open /dev/urandom, read some data, and use it. For
    every process startup. That's slow.

    ...

    The solution is to provide a limited amount of random data to the
    starting process in the aux vector. I suggested 16 bytes and this is
    what the patch implements. If we need only 16 bytes or less we use the
    data directly. If we need more we'll use the 16 bytes to see a PRNG.
    This avoids the costly /dev/urandom use and it allows the kernel to use
    the most adequate source of random data for this purpose. It might not
    be the same pool as that for /dev/urandom.

    Concerns were expressed about the depletion of the randomness pool. But
    this patch doesn't make the situation worse, it doesn't deplete entropy
    more than happens now.

    Signed-off-by: Kees Cook
    Cc: Jakub Jelinek
    Cc: Andi Kleen
    Cc: Ulrich Drepper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • If a process registers for asynchronous notification on a POSIX message
    queue, it gets a signal and a siginfo_t structure when a message arrives
    on the message queue. The si_pid in the siginfo_t structure is set to the
    PID of the process that sent the message to the message queue.

    The principle is the following:
    . when mq_notify(SIGEV_SIGNAL) is called, the caller registers for
    notification when a msg arrives. The associated pid structure is stroed into
    inode_info->notify_owner. Let's call this process P1.
    . when mq_send() is called by say P2, P2 sends a signal to P1 to notify
    him about msg arrival.

    The way .si_pid is set today is not correct, since it doesn't take into account
    the fact that the process that is sending the message might not be in the
    same namespace as the notified one.

    This patch proposes to set si_pid to the sender's pid into the notify_owner
    namespace.

    Signed-off-by: Nadia Derbey
    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Bastian Blank
    Cc: Pavel Emelyanov
    Cc: Eric W. Biederman
    Acked-by: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • Currently task_active_pid_ns is not safe to call after a task becomes a
    zombie and exit_task_namespaces is called, as nsproxy becomes NULL. By
    reading the pid namespace from the pid of the task we can trivially solve
    this problem at the cost of one extra memory read in what should be the
    same cacheline as we read the namespace from.

    When moving things around I have made task_active_pid_ns out of line
    because keeping it in pid_namespace.h would require adding includes of
    pid.h and sched.h that I don't think we want.

    This change does make task_active_pid_ns unsafe to call during
    copy_process until we attach a pid on the task_struct which seems to be a
    reasonable trade off.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Bastian Blank
    Cc: Pavel Emelyanov
    Cc: Nadia Derbey
    Acked-by: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • A current problem with the pid namespace is that it is easy to do pid
    related work after exit_task_namespaces which drops the nsproxy pointer.

    However if we are doing pid namespace related work we are always operating
    on some struct pid which retains the pid_namespace pointer of the pid
    namespace it was allocated in.

    So provide ns_of_pid which allows us to find the pid namespace a pid was
    allocated in.

    Using this we have the needed infrastructure to do pid namespace related
    work at anytime we have a struct pid, removing the chance of accidentally
    having a NULL pointer dereference when accessing current->nsproxy.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Bastian Blank
    Cc: Pavel Emelyanov
    Cc: Nadia Derbey
    Acked-by: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Impact: cleanups, use new cpumask API

    Final trivial cleanups: mainly s/cpumask_t/struct cpumask

    Note there is a FIXME in generate_sched_domains(). A future patch will
    change struct cpumask *doms to struct cpumask *doms[].
    (I suppose Rusty will do this.)

    Signed-off-by: Li Zefan
    Cc: Ingo Molnar
    Cc: Rusty Russell
    Acked-by: Mike Travis
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Impact: use new cpumask API

    This patch mainly does the following things:
    - change cs->cpus_allowed from cpumask_t to cpumask_var_t
    - call alloc_bootmem_cpumask_var() for top_cpuset in cpuset_init_early()
    - call alloc_cpumask_var() for other cpusets
    - replace cpus_xxx() to cpumask_xxx()

    Signed-off-by: Li Zefan
    Cc: Ingo Molnar
    Cc: Rusty Russell
    Acked-by: Mike Travis
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Impact: cleanups, reduce stack usage

    This patch prepares for the next patch. When we convert
    cpuset.cpus_allowed to cpumask_var_t, (trialcs = *cs) no longer works.

    Another result of this patch is reducing stack usage of trialcs.
    sizeof(*cs) can be as large as 148 bytes on x86_64, so it's really not
    good to have it on stack.

    Signed-off-by: Li Zefan
    Cc: Ingo Molnar
    Cc: Rusty Russell
    Acked-by: Mike Travis
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Impact: reduce stack usage

    Allocate a global cpumask_var_t at boot, and use it in cpuset_attach(), so
    we won't fail cpuset_attach().

    Signed-off-by: Li Zefan
    Cc: Ingo Molnar
    Cc: Rusty Russell
    Acked-by: Mike Travis
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Impact: reduce stack usage

    Just use cs->cpus_allowed, and no need to allocate a cpumask_var_t.

    Signed-off-by: Li Zefan
    Cc: Ingo Molnar
    Cc: Rusty Russell
    Acked-by: Mike Travis
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • This patchset converts cpuset to use new cpumask API, and thus
    remove on stack cpumask_t to reduce stack usage.

    Before:
    # cat kernel/cpuset.c include/linux/cpuset.h | grep -c cpumask_t
    21
    After:
    # cat kernel/cpuset.c include/linux/cpuset.h | grep -c cpumask_t
    0

    This patch:

    Impact: reduce stack usage

    It's safe to call cpulist_scnprintf inside callback_mutex, and thus we can
    just remove the cpumask_t and no need to allocate a cpumask_var_t.

    Signed-off-by: Li Zefan
    Cc: Ingo Molnar
    Cc: Rusty Russell
    Acked-by: Mike Travis
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • I found a bug on my dual-cpu box. I created a sub cpuset in top cpuset
    and assign 1 to its cpus. And then we attach some tasks into this sub
    cpuset. After this, we offline CPU1. Now, the tasks in this new cpuset
    are moved into top cpuset automatically because there is no cpu in sub
    cpuset. Then we online CPU1, we find all the tasks which doesn't belong
    to top cpuset originally just run on CPU0.

    We fix this bug by setting task's cpu_allowed to cpu_possible_map when
    attaching it into top cpuset. This method needn't modify the current
    behavior of cpusets on CPU hotplug, and all of tasks in top cpuset use
    cpu_possible_map to initialize their cpu_allowed.

    Signed-off-by: Miao Xie
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • task_cs() calls task_subsys_state().

    We must use rcu_read_lock() to protect cgroup_subsys_state().

    It's correct that top_cpuset is never freed, but cgroup_subsys_state()
    accesses css_set, this css_set maybe freed when task_cs() called.

    We use use rcu_read_lock() to protect it.

    Signed-off-by: Lai Jiangshan
    Acked-by: Paul Menage
    Cc: KAMEZAWA Hiroyuki
    Cc: Pavel Emelyanov
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • Add css_tryget(), that obtains a counted reference on a CSS. It is used
    in situations where the caller has a "weak" reference to the CSS, i.e.
    one that does not protect the cgroup from removal via a reference count,
    but would instead be cleaned up by a destroy() callback.

    css_tryget() will return true on success, or false if the cgroup is being
    removed.

    This is similar to Kamezawa Hiroyuki's patch from a week or two ago, but
    with the difference that in the event of css_tryget() racing with a
    cgroup_rmdir(), css_tryget() will only return false if the cgroup really
    does get removed.

    This implementation is done by biasing css->refcnt, so that a refcnt of 1
    means "releasable" and 0 means "released or releasing". In the event of a
    race, css_tryget() distinguishes between "released" and "releasing" by
    checking for the CSS_REMOVED flag in css->flags.

    Signed-off-by: Paul Menage
    Tested-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Update the memory controller to use its hierarchy_mutex rather than
    calling cgroup_lock() to protected against cgroup_mkdir()/cgroup_rmdir()
    from occurring in its hierarchy.

    Signed-off-by: Paul Menage
    Tested-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • These patches introduce new locking/refcount support for cgroups to
    reduce the need for subsystems to call cgroup_lock(). This will
    ultimately allow the atomicity of cgroup_rmdir() (which was removed
    recently) to be restored.

    These three patches give:

    1/3 - introduce a per-subsystem hierarchy_mutex which a subsystem can
    use to prevent changes to its own cgroup tree

    2/3 - use hierarchy_mutex in place of calling cgroup_lock() in the
    memory controller

    3/3 - introduce a css_tryget() function similar to the one recently
    proposed by Kamezawa, but avoiding spurious refcount failures in
    the event of a race between a css_tryget() and an unsuccessful
    cgroup_rmdir()

    Future patches will likely involve:

    - using hierarchy mutex in place of cgroup_lock() in more subsystems
    where appropriate

    - restoring the atomicity of cgroup_rmdir() with respect to cgroup_create()

    This patch:

    Add a hierarchy_mutex to the cgroup_subsys object that protects changes to
    the hierarchy observed by that subsystem. It is taken by the cgroup
    subsystem (in addition to cgroup_mutex) for the following operations:

    - linking a cgroup into that subsystem's cgroup tree
    - unlinking a cgroup from that subsystem's cgroup tree
    - moving the subsystem to/from a hierarchy (including across the
    bind() callback)

    Thus if the subsystem holds its own hierarchy_mutex, it can safely
    traverse its own hierarchy.

    Signed-off-by: Paul Menage
    Tested-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Now, you can see following even when swap accounting is enabled.

    1. Create Group 01, and 02.
    2. allocate a "file" on tmpfs by a task under 01.
    3. swap out the "file" (by memory pressure)
    4. Read "file" from a task in group 02.
    5. the charge of "file" is moved to group 02.

    This is not ideal behavior. This is because SwapCache which was loaded
    by read-ahead is not taken into account..

    This is a patch to fix shmem's swapcache behavior.
    - remove mem_cgroup_cache_charge_swapin().
    - Add SwapCache handler routine to mem_cgroup_cache_charge().
    By this, shmem's file cache is charged at add_to_page_cache()
    with GFP_NOWAIT.
    - pass the page of swapcache to shrink_mem_cgroup.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Now, a page can be deleted from SwapCache while do_swap_page().
    memcg-fix-swap-accounting-leak-v3.patch handles that, but, LRU handling is
    still broken. (above behavior broke assumption of memcg-synchronized-lru
    patch.)

    This patch is a fix for LRU handling (especially for per-zone counters).
    At charging SwapCache,
    - Remove page_cgroup from LRU if it's not used.
    - Add page cgroup to LRU if it's not linked to.

    Reported-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • From:KAMEZAWA Hiroyuki

    css_tryget() newly is added and we can know css is alive or not and get
    refcnt of css in very safe way. ("alive" here means "rmdir/destroy" is
    not called.)

    This patch replaces css_get() to css_tryget(), where I cannot explain
    why css_get() is safe. And removes memcg->obsolete flag.

    Reviewed-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • 1. Fix double-free BUG in error route of mem_cgroup_create().
    mem_cgroup_free() itself frees per-zone-info.
    2. Making refcnt of memcg simple.
    Add 1 refcnt at creation and call free when refcnt goes down to 0.

    Reviewed-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Fix swapin charge operation of memcg.

    Now, memcg has hooks to swap-out operation and checks SwapCache is really
    unused or not. That check depends on contents of struct page. I.e. If
    PageAnon(page) && page_mapped(page), the page is recoginized as
    still-in-use.

    Now, reuse_swap_page() calles delete_from_swap_cache() before establishment
    of any rmap. Then, in followinig sequence

    (Page fault with WRITE)
    try_charge() (charge += PAGESIZE)
    commit_charge() (Check page_cgroup is used or not..)
    reuse_swap_page()
    -> delete_from_swapcache()
    -> mem_cgroup_uncharge_swapcache() (charge -= PAGESIZE)
    ......
    New charge is uncharged soon....
    To avoid this, move commit_charge() after page_mapcount() goes up to 1.
    By this,

    try_charge() (usage += PAGESIZE)
    reuse_swap_page() (may usage -= PAGESIZE if PCG_USED is set)
    commit_charge() (If page_cgroup is not marked as PCG_USED,
    add new charge.)
    Accounting will be correct.

    Changelog (v2) -> (v3)
    - fixed invalid charge to swp_entry==0.
    - updated documentation.
    Changelog (v1) -> (v2)
    - fixed comment.

    [nishimura@mxp.nes.nec.co.jp: swap accounting leak doc fix]
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Tested-by: Balbir Singh
    Cc: Hugh Dickins
    Cc: Daisuke Nishimura
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • mem_cgroup_hierarchicl_reclaim() works properly even when !use_hierarchy
    now (by memcg-hierarchy-avoid-unnecessary-reclaim.patch), so, instead of
    try_to_free_mem_cgroup_pages(), it should be used in many cases.

    The only exception is force_empty. The group has no children in this
    case.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • mpol_rebind_mm(), which can be called from cpuset_attach(), does
    down_write(mm->mmap_sem). This means down_write(mm->mmap_sem) can be
    called under cgroup_mutex.

    OTOH, page fault path does down_read(mm->mmap_sem) and calls
    mem_cgroup_try_charge_xxx(), which may eventually calls
    mem_cgroup_out_of_memory(). And mem_cgroup_out_of_memory() calls
    cgroup_lock(). This means cgroup_lock() can be called under
    down_read(mm->mmap_sem).

    If those two paths race, deadlock can happen.

    This patch avoid this deadlock by:
    - remove cgroup_lock() from mem_cgroup_out_of_memory().
    - define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task()
    (->attach handler of memory cgroup) and mem_cgroup_out_of_memory.

    Signed-off-by: Daisuke Nishimura
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura