28 May, 2010

1 commit

  • We have observed several workloads running on multi-node systems where
    memory is assigned unevenly across the nodes in the system. There are
    numerous reasons for this but one is the round-robin rotor in
    cpuset_mem_spread_node().

    For example, a simple test that writes a multi-page file will allocate
    pages on nodes 0 2 4 6 ... Odd nodes are skipped. (Sometimes it
    allocates on odd nodes & skips even nodes).

    An example is shown below. The program "lfile" writes a file consisting
    of 10 pages. The program then mmaps the file & uses get_mempolicy(...,
    MPOL_F_NODE) to determine the nodes where the file pages were allocated.
    The output is shown below:

    # ./lfile
    allocated on nodes: 2 4 6 0 1 2 6 0 2

    There is a single rotor that is used for allocating both file pages & slab
    pages. Writing the file allocates both a data page & a slab page
    (buffer_head). This advances the RR rotor 2 nodes for each page
    allocated.

    A quick confirmation seems to confirm this is the cause of the uneven
    allocation:

    # echo 0 >/dev/cpuset/memory_spread_slab
    # ./lfile
    allocated on nodes: 6 7 8 9 0 1 2 3 4 5

    This patch introduces a second rotor that is used for slab allocations.

    Signed-off-by: Jack Steiner
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Paul Menage
    Cc: Jack Steiner
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jack Steiner
     

25 May, 2010

2 commits

  • Before applying this patch, cpuset updates task->mems_allowed and
    mempolicy by setting all new bits in the nodemask first, and clearing all
    old unallowed bits later. But in the way, the allocator may find that
    there is no node to alloc memory.

    The reason is that cpuset rebinds the task's mempolicy, it cleans the
    nodes which the allocater can alloc pages on, for example:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    This patch fixes this problem by expanding the nodes range first(set newly
    allowed bits) and shrink it lazily(clear newly disallowed bits). So we
    use a variable to tell the write-side task that read-side task is reading
    nodemask, and the write-side task clears newly disallowed nodes after
    read-side task ends the current memory allocation.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • Nick Piggin reported that the allocator may see an empty nodemask when
    changing cpuset's mems[1]. It happens only on the kernel that do not do
    atomic nodemask_t stores. (MAX_NUMNODES > BITS_PER_LONG)

    But I found that there is also a problem on the kernel that can do atomic
    nodemask_t stores. The problem is that the allocator can't find a node to
    alloc page when changing cpuset's mems though there is a lot of free
    memory. The reason is like this:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    I can use the attached program reproduce it by the following step:

    # mkdir /dev/cpuset
    # mount -t cpuset cpuset /dev/cpuset
    # mkdir /dev/cpuset/1
    # echo `cat /dev/cpuset/cpus` > /dev/cpuset/1/cpus
    # echo `cat /dev/cpuset/mems` > /dev/cpuset/1/mems
    # echo $$ > /dev/cpuset/1/tasks
    # numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog &
    = max(nr_cpus - 1, 1)
    # killall -s SIGUSR1 cpuset_mem_hog
    # ./change_mems.sh

    several hours later, oom will happen though there is a lot of free memory.

    This patchset fixes this problem by expanding the nodes range first(set
    newly allowed bits) and shrink it lazily(clear newly disallowed bits). So
    we use a variable to tell the write-side task that read-side task is
    reading nodemask, and the write-side task clears newly disallowed nodes
    after read-side task ends the current memory allocation.

    This patch:

    In order to fix no node to alloc memory, when we want to update mempolicy
    and mems_allowed, we expand the set of nodes first (set all the newly
    nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the
    mempolicy's rebind functions may breaks the expanding.

    So we restructure the mempolicy's rebind functions and split the rebind
    work to two steps, just like the update of cpuset's mems: The 1st step:
    expand the set of the mempolicy's nodes. The 2nd step: shrink the set of
    the mempolicy's nodes. It is used when there is no real lock to protect
    the mempolicy in the read-side. Otherwise we can do rebind work at once.

    In order to implement it, we define

    enum mpol_rebind_step {
    MPOL_REBIND_ONCE,
    MPOL_REBIND_STEP1,
    MPOL_REBIND_STEP2,
    MPOL_REBIND_NSTEP,
    };

    If the mempolicy needn't be updated by two steps, we can pass
    MPOL_REBIND_ONCE to the rebind functions. Or we can pass
    MPOL_REBIND_STEP1 to do the first step of the rebind work and pass
    MPOL_REBIND_STEP2 to do the second step work.

    Besides that, it maybe long time between these two step and we have to
    release the lock that protects mempolicy and mems_allowed. If we hold the
    lock once again, we must check whether the current mempolicy is under the
    rebinding (the first step has been done) or not, because the task may
    alloc a new mempolicy when we don't hold the lock. So we defined the
    following flag to identify it:

    #define MPOL_F_REBINDING (1 << 2)

    The new functions will be used in the next patch.

    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     

03 Apr, 2010

2 commits

  • Introduce cpuset_cpus_allowed_fallback() helper to fix the cpuset problems
    with select_fallback_rq(). It can be called from any context and can't use
    any cpuset locks including task_lock(). It is called when the task doesn't
    have online cpus in ->cpus_allowed but ttwu/etc must be able to find a
    suitable cpu.

    I am not proud of this patch. Everything which needs such a fat comment
    can't be good even if correct. But I'd prefer to not change the locking
    rules in the code I hardly understand, and in any case I believe this
    simple change make the code much more correct compared to deadlocks we
    currently have.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • This patch just states the fact the cpusets/cpuhotplug interaction is
    broken and removes the deadlockable code which only pretends to work.

    - cpuset_lock() doesn't really work. It is needed for
    cpuset_cpus_allowed_locked() but we can't take this lock in
    try_to_wake_up()->select_fallback_rq() path.

    - cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes
    callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex
    stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take
    cpuset_lock() and hangs forever because CPU is already dead and thus
    T can't be scheduled.

    - cpuset_cpus_allowed_locked() is deadlockable too. It takes task_lock()
    which is not irq-safe, but try_to_wake_up() can be called from irq.

    Kill them, and change select_fallback_rq() to use cpu_possible_mask, like
    we currently do without CONFIG_CPUSETS.

    Also, with or without this patch, with or without CONFIG_CPUSETS, the
    callers of select_fallback_rq() can race with each other or with
    set_cpus_allowed() pathes.

    The subsequent patches try to to fix these problems.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

25 Mar, 2010

2 commits


07 Dec, 2009

2 commits

  • Since (e761b77: cpu hotplug, sched: Introduce cpu_active_map and redo
    sched domain managment) we have cpu_active_mask which is suppose to rule
    scheduler migration and load-balancing, except it never (fully) did.

    The particular problem being solved here is a crash in try_to_wake_up()
    where select_task_rq() ends up selecting an offline cpu because
    select_task_rq_fair() trusts the sched_domain tree to reflect the
    current state of affairs, similarly select_task_rq_rt() trusts the
    root_domain.

    However, the sched_domains are updated from CPU_DEAD, which is after the
    cpu is taken offline and after stop_machine is done. Therefore it can
    race perfectly well with code assuming the domains are right.

    Cure this by building the domains from cpu_active_mask on
    CPU_DOWN_PREPARE.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Commit acc3f5d7cabbfd6cec71f0c1f9900621fa2d6ae7 ("cpumask:
    Partition_sched_domains takes array of cpumask_var_t") changed
    the function signature of generate_sched_domains() for the
    CONFIG_SMP=y case, but forgot to update the corresponding
    function for the CONFIG_SMP=n case, causing:

    kernel/cpuset.c:2073: warning: passing argument 1 of 'generate_sched_domains' from incompatible pointer type

    Signed-off-by: Geert Uytterhoeven
    Cc: Rusty Russell
    Cc: Linus Torvalds
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Geert Uytterhoeven
     

04 Nov, 2009

1 commit

  • Currently partition_sched_domains() takes a 'struct cpumask
    *doms_new' which is a kmalloc'ed array of cpumask_t. You can't
    have such an array if 'struct cpumask' is undefined, as we plan
    for CONFIG_CPUMASK_OFFSTACK=y.

    So, we make this an array of cpumask_var_t instead: this is the
    same for the CONFIG_CPUMASK_OFFSTACK=n case, but requires
    multiple allocations for the CONFIG_CPUMASK_OFFSTACK=y case.
    Hence we add alloc_sched_domains() and free_sched_domains()
    functions.

    Signed-off-by: Rusty Russell
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Rusty Russell
     

26 Oct, 2009

1 commit


24 Sep, 2009

1 commit

  • Alter the ss->can_attach and ss->attach functions to be able to deal with
    a whole threadgroup at a time, for use in cgroup_attach_proc. (This is a
    pre-patch to cgroup-procs-writable.patch.)

    Currently, new mode of the attach function can only tell the subsystem
    about the old cgroup of the threadgroup leader. No subsystem currently
    needs that information for each thread that's being moved, but if one were
    to be added (for example, one that counts tasks within a group) this bit
    would need to be reworked a bit to tell the subsystem the right
    information.

    [hidave.darkstar@gmail.com: fix build]
    Signed-off-by: Ben Blum
    Signed-off-by: Paul Menage
    Acked-by: Li Zefan
    Reviewed-by: Matt Helsley
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     

21 Sep, 2009

1 commit

  • The Cpus_allowed fields in /proc//status is currently only
    shown in case of CONFIG_CPUSETS. However their contents are also
    useful for the !CONFIG_CPUSETS case.

    So change the current behaviour and always show these fields.

    Signed-off-by: Heiko Carstens
    Cc: Andrew Morton
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Heiko Carstens
     

17 Jun, 2009

3 commits

  • Fix allocating page cache/slab object on the unallowed node when memory
    spread is set by updating tasks' mems_allowed after its cpuset's mems is
    changed.

    In order to update tasks' mems_allowed in time, we must modify the code of
    memory policy. Because the memory policy is applied in the process's
    context originally. After applying this patch, one task directly
    manipulates anothers mems_allowed, and we use alloc_lock in the
    task_struct to protect mems_allowed and memory policy of the task.

    But in the fast path, we didn't use lock to protect them, because adding a
    lock may lead to performance regression. But if we don't add a lock,the
    task might see no nodes when changing cpuset's mems_allowed to some
    non-overlapping set. In order to avoid it, we set all new allowed nodes,
    then clear newly disallowed ones.

    [lee.schermerhorn@hp.com:
    The rework of mpol_new() to extract the adjusting of the node mask to
    apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
    with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
    allocation. Fix this by adding the check for MPOL_PREFERRED and empty
    node mask to mpol_new_mpolicy().

    Remove the now unneeded 'nodes = NULL' from mpol_new().

    Note that mpol_new_mempolicy() is always called with a non-NULL
    'nodes' parameter now that it has been removed from mpol_new().
    Therefore, we don't need to test nodes for NULL before testing it for
    'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
    verify this assumption.]
    [lee.schermerhorn@hp.com:

    I don't think the function name 'mpol_new_mempolicy' is descriptive
    enough to differentiate it from mpol_new().

    This function applies cpuset set context, usually constraining nodes
    to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
    is set, it also translates the nodes. So I settled on
    'mpol_set_nodemask()', because the comment block for mpol_new() mentions
    that we need to call this function to "set nodes".

    Some additional minor line length, whitespace and typo cleanup.]
    Signed-off-by: Miao Xie
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Cc: Paul Menage
    Cc: Nick Piggin
    Cc: Yasunori Goto
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • Fix the bug that the kernel didn't spread page cache/slab object evenly
    over all the allowed nodes when spread flags were set by updating tasks'
    page/slab spread flags in time.

    Signed-off-by: Miao Xie
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Cc: Paul Menage
    Cc: Nick Piggin
    Cc: Yasunori Goto
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • The kernel still allocates the page caches on old node after modifying its
    cpuset's mems when 'memory_spread_page' was set, or it didn't spread the
    page cache evenly over all the nodes that faulting task is allowed to usr
    after memory_spread_page was set. it is caused by the old mem_allowed and
    flags of the task, the current kernel doesn't updates them unless some
    function invokes cpuset_update_task_memory_state(), it is too late
    sometimes.We must update the mem_allowed and the flags of the tasks in
    time.

    Slab has the same problem.

    The following patches fix this bug by updating tasks' mem_allowed and
    spread flag after its cpuset's mems or spread flag is changed.

    This patch:

    Extract a function from cpuset_update_task_memory_state(). It will be
    used later for update tasks' page/slab spread flags after its cpuset's
    flag is set

    Signed-off-by: Miao Xie
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Cc: Paul Menage
    Cc: Nick Piggin
    Cc: Yasunori Goto
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     

12 Jun, 2009

1 commit


03 Apr, 2009

8 commits

  • Kthreads that have the PF_THREAD_BOUND bit set in their flags are bound to a
    specific cpu. Thus, their set of allowed cpus shall not change.

    This patch prevents such threads from attaching to non-root cpusets. They do
    not have mempolicies that restrict them to a subset of system nodes and, since
    their cpumask may never change, they cannot use any of the features of
    cpusets.

    The tasks will forever be a member of the root cpuset and will be returned
    when listing the tasks attached to that cpuset.

    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: Dhaval Giani
    Signed-off-by: David Rientjes
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Allow cpusets to be configured/built on non-SMP systems

    Currently it's impossible to build cpusets under UML on x86-64, since
    cpusets depends on SMP and x86-64 UML doesn't support SMP.

    There's code in cpusets that doesn't depend on SMP. This patch surrounds
    the minimum amount of cpusets code with #ifdef CONFIG_SMP in order to
    allow cpusets to build/run on UP systems (for testing purposes under UML).

    Reviewed-by: Li Zefan
    Signed-off-by: Paul Menage
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • The cpuset_zone_allowed() variants are actually only a function of the
    zone's node.

    Cc: Paul Menage
    Acked-by: Christoph Lameter
    Cc: Randy Dunlap
    Signed-off-by: David Rientjes
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Use cgroup_scanner.data, instead of introducing cpuset_hotplug_scanner.

    Signed-off-by: Li Zefan
    Cc: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • When writing to cpuset.mems, cpuset has to update its mems_allowed before
    calling update_tasks_nodemask(), but this function might return -ENOMEM.

    To avoid this rare case, we allocate the memory before changing
    mems_allowed, and then pass to update_tasks_nodemask(). Similar to what
    update_cpumask() does.

    Signed-off-by: Li Zefan
    Cc: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • This patch uses cgroup_scan_tasks() to rebind tasks' vmas to new cpuset's
    mems_allowed.

    Not only simplify the code largely, but also avoid allocating an array to
    hold mm pointers of all the tasks in the cpuset. This array can be big
    (size > PAGESIZE) if we have lots of tasks in that cpuset, thus has a
    chance to fail the allocation when under memory stress.

    Signed-off-by: Li Zefan
    Cc: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Change to cpuset->cpus_allowed and cpuset->mems_allowed should be protected
    by callback_mutex, otherwise the reader may read wrong cpus/mems. This is
    cpuset's lock rule.

    Signed-off-by: Li Zefan
    Cc: Paul Menage
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • We have some read-only files and write-only files, but currently they are
    all set to 0644, which is counter-intuitive and cause trouble for some
    cgroup tools like libcgroup.

    This patch adds 'mode' to struct cftype to allow cgroup subsys to set it's
    own files' file mode, and for the most cases cft->mode can be default to 0
    and cgroup will figure out proper mode.

    Acked-by: Paul Menage
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     

19 Jan, 2009

1 commit

  • Lockdep reported some possible circular locking info when we tested cpuset on
    NUMA/fake NUMA box.

    =======================================================
    [ INFO: possible circular locking dependency detected ]
    2.6.29-rc1-00224-ga652504 #111
    -------------------------------------------------------
    bash/2968 is trying to acquire lock:
    (events){--..}, at: [] flush_work+0x24/0xd8

    but task is already holding lock:
    (cgroup_mutex){--..}, at: [] cgroup_lock_live_group+0x12/0x29

    which lock already depends on the new lock.
    ......
    -------------------------------------------------------

    Steps to reproduce:
    # mkdir /dev/cpuset
    # mount -t cpuset xxx /dev/cpuset
    # mkdir /dev/cpuset/0
    # echo 0 > /dev/cpuset/0/cpus
    # echo 0 > /dev/cpuset/0/mems
    # echo 1 > /dev/cpuset/0/memory_migrate
    # cat /dev/zero > /dev/null &
    # echo $! > /dev/cpuset/0/tasks

    This is because async_rebuild_sched_domains has the following lock sequence:
    run_workqueue(async_rebuild_sched_domains)
    -> do_rebuild_sched_domains -> cgroup_lock

    But, attaching tasks when memory_migrate is set has following:
    cgroup_lock_live_group(cgroup_tasks_write)
    -> do_migrate_pages -> flush_work

    This patch fixes it by using a separate workqueue thread.

    Signed-off-by: Miao Xie
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Ingo Molnar

    Miao Xie
     

16 Jan, 2009

1 commit

  • Move Documentation/cpusets.txt and Documentation/controllers/* to
    Documentation/cgroups/

    Signed-off-by: Li Zefan
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Acked-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     

09 Jan, 2009

8 commits

  • Impact: cleanups, use new cpumask API

    Final trivial cleanups: mainly s/cpumask_t/struct cpumask

    Note there is a FIXME in generate_sched_domains(). A future patch will
    change struct cpumask *doms to struct cpumask *doms[].
    (I suppose Rusty will do this.)

    Signed-off-by: Li Zefan
    Cc: Ingo Molnar
    Cc: Rusty Russell
    Acked-by: Mike Travis
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Impact: use new cpumask API

    This patch mainly does the following things:
    - change cs->cpus_allowed from cpumask_t to cpumask_var_t
    - call alloc_bootmem_cpumask_var() for top_cpuset in cpuset_init_early()
    - call alloc_cpumask_var() for other cpusets
    - replace cpus_xxx() to cpumask_xxx()

    Signed-off-by: Li Zefan
    Cc: Ingo Molnar
    Cc: Rusty Russell
    Acked-by: Mike Travis
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Impact: cleanups, reduce stack usage

    This patch prepares for the next patch. When we convert
    cpuset.cpus_allowed to cpumask_var_t, (trialcs = *cs) no longer works.

    Another result of this patch is reducing stack usage of trialcs.
    sizeof(*cs) can be as large as 148 bytes on x86_64, so it's really not
    good to have it on stack.

    Signed-off-by: Li Zefan
    Cc: Ingo Molnar
    Cc: Rusty Russell
    Acked-by: Mike Travis
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Impact: reduce stack usage

    Allocate a global cpumask_var_t at boot, and use it in cpuset_attach(), so
    we won't fail cpuset_attach().

    Signed-off-by: Li Zefan
    Cc: Ingo Molnar
    Cc: Rusty Russell
    Acked-by: Mike Travis
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Impact: reduce stack usage

    Just use cs->cpus_allowed, and no need to allocate a cpumask_var_t.

    Signed-off-by: Li Zefan
    Cc: Ingo Molnar
    Cc: Rusty Russell
    Acked-by: Mike Travis
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • This patchset converts cpuset to use new cpumask API, and thus
    remove on stack cpumask_t to reduce stack usage.

    Before:
    # cat kernel/cpuset.c include/linux/cpuset.h | grep -c cpumask_t
    21
    After:
    # cat kernel/cpuset.c include/linux/cpuset.h | grep -c cpumask_t
    0

    This patch:

    Impact: reduce stack usage

    It's safe to call cpulist_scnprintf inside callback_mutex, and thus we can
    just remove the cpumask_t and no need to allocate a cpumask_var_t.

    Signed-off-by: Li Zefan
    Cc: Ingo Molnar
    Cc: Rusty Russell
    Acked-by: Mike Travis
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • I found a bug on my dual-cpu box. I created a sub cpuset in top cpuset
    and assign 1 to its cpus. And then we attach some tasks into this sub
    cpuset. After this, we offline CPU1. Now, the tasks in this new cpuset
    are moved into top cpuset automatically because there is no cpu in sub
    cpuset. Then we online CPU1, we find all the tasks which doesn't belong
    to top cpuset originally just run on CPU0.

    We fix this bug by setting task's cpu_allowed to cpu_possible_map when
    attaching it into top cpuset. This method needn't modify the current
    behavior of cpusets on CPU hotplug, and all of tasks in top cpuset use
    cpu_possible_map to initialize their cpu_allowed.

    Signed-off-by: Miao Xie
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • task_cs() calls task_subsys_state().

    We must use rcu_read_lock() to protect cgroup_subsys_state().

    It's correct that top_cpuset is never freed, but cgroup_subsys_state()
    accesses css_set, this css_set maybe freed when task_cs() called.

    We use use rcu_read_lock() to protect it.

    Signed-off-by: Lai Jiangshan
    Acked-by: Paul Menage
    Cc: KAMEZAWA Hiroyuki
    Cc: Pavel Emelyanov
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     

07 Jan, 2009

1 commit

  • When cpusets are enabled, it's necessary to print the triggering task's
    set of allowable nodes so the subsequently printed meminfo can be
    interpreted correctly.

    We also print the task's cpuset name for informational purposes.

    [rientjes@google.com: task lock current before dereferencing cpuset]
    Cc: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

13 Dec, 2008

1 commit

  • …t_scnprintf to take pointers.

    Impact: change calling convention of existing cpumask APIs

    Most cpumask functions started with cpus_: these have been replaced by
    cpumask_ ones which take struct cpumask pointers as expected.

    These four functions don't have good replacement names; fortunately
    they're rarely used, so we just change them over.

    Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
    Signed-off-by: Mike Travis <travis@sgi.com>
    Acked-by: Ingo Molnar <mingo@elte.hu>
    Cc: paulus@samba.org
    Cc: mingo@redhat.com
    Cc: tony.luck@intel.com
    Cc: ralf@linux-mips.org
    Cc: Greg Kroah-Hartman <gregkh@suse.de>
    Cc: cl@linux-foundation.org
    Cc: srostedt@redhat.com

    Rusty Russell
     

30 Nov, 2008

1 commit

  • this warning:

    kernel/cpuset.c: In function ‘generate_sched_domains’:
    kernel/cpuset.c:588: warning: ‘ndoms’ may be used uninitialized in this function

    triggers because GCC does not recognize that ndoms stays uninitialized
    only if doms is NULL - but that flow is covered at the end of
    generate_sched_domains().

    Help out GCC by initializing this variable to 0. (that's prudent anyway)

    Also, this function needs a splitup and code flow simplification:
    with 160 lines length it's clearly too long.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

20 Nov, 2008

1 commit

  • After adding a node into the machine, top cpuset's mems isn't updated.

    By reviewing the code, we found that the update function

    cpuset_track_online_nodes()

    was invoked after node_states[N_ONLINE] changes. It is wrong because
    N_ONLINE just means node has pgdat, and if node has/added memory, we use
    N_HIGH_MEMORY. So, We should invoke the update function after
    node_states[N_HIGH_MEMORY] changes, just like its commit says.

    This patch fixes it. And we use notifier of memory hotplug instead of
    direct calling of cpuset_track_online_nodes().

    Signed-off-by: Miao Xie
    Acked-by: Yasunori Goto
    Cc: David Rientjes
    Cc: Paul Menage
    Signed-off-by: Linus Torvalds

    Miao Xie
     

18 Nov, 2008

1 commit

  • Impact: properly rebuild sched-domains on kmalloc() failure

    When cpuset failed to generate sched domains due to kmalloc()
    failure, the scheduler should fallback to the single partition
    'fallback_doms' and rebuild sched domains, but now it only
    destroys but not rebuilds sched domains.

    The regression was introduced by:

    | commit dfb512ec4834116124da61d6c1ee10fd0aa32bd6
    | Author: Max Krasnyansky
    | Date: Fri Aug 29 13:11:41 2008 -0700
    |
    | sched: arch_reinit_sched_domains() must destroy domains to force rebuild

    After the above commit, partition_sched_domains(0, NULL, NULL) will
    only destroy sched domains and partition_sched_domains(1, NULL, NULL)
    will create the default sched domain.

    Signed-off-by: Li Zefan
    Cc: Max Krasnyansky
    Cc:
    Signed-off-by: Ingo Molnar

    Li Zefan