Eric Lee / linux-smarc-t335x-v3.2

27 Jul, 2011

2 commits

60063497a atomic: use <linux/atomic.h> ... Browse Code »

This allows us to move duplicated code in
(atomic_inc_not_zero() for now) to

Signed-off-by: Arun Sharma
Reviewed-by: Eric Dumazet
Cc: Ingo Molnar
Cc: David Miller
Cc: Eric Dumazet
Acked-by: Mike Frysinger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arun Sharma
2011-07-27 07:49:47 +0800
778d3b0ff cpusets: randomize node rotor used in cpuset_mem_spread_node() ... Browse Code »

[ This patch has already been accepted as commit 0ac0c0d0f837 but later
reverted (commit 35926ff5fba8) because it itroduced arch specific
__node_random which was defined only for x86 code so it broke other
archs. This is a followup without any arch specific code. Other than
that there are no functional changes.]

Some workloads that create a large number of small files tend to assign
too many pages to node 0 (multi-node systems). Part of the reason is
that the rotor (in cpuset_mem_spread_node()) used to assign nodes starts
at node 0 for newly created tasks.

This patch changes the rotor to be initialized to a random node number
of the cpuset.

[akpm@linux-foundation.org: fix layout]
[Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration]
[mhocko@suse.cz: Make it arch independent]
[akpm@linux-foundation.org: fix CONFIG_NUMA=y, MAX_NUMNODES>1 build]
Signed-off-by: Jack Steiner
Signed-off-by: Lee Schermerhorn
Signed-off-by: Michal Hocko
Reviewed-by: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Paul Menage
Cc: Jack Steiner
Cc: Robin Holt
Cc: David Rientjes
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Jack Steiner
Cc: KOSAKI Motohiro
Cc: Lee Schermerhorn
Cc: Michal Hocko
Cc: Paul Menage
Cc: Pekka Enberg
Cc: Robin Holt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2011-07-27 07:49:43 +0800

28 May, 2011

1 commit

1e1b6c511 cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed ... Browse Code »

The rule is, we have to update tsk->rt.nr_cpus_allowed if we change
tsk->cpus_allowed. Otherwise RT scheduler may confuse.

Signed-off-by: KOSAKI Motohiro
Cc: Oleg Nesterov
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/4DD4B3FA.5060901@jp.fujitsu.com
Signed-off-by: Ingo Molnar

KOSAKI Motohiro
2011-05-28 23:02:57 +0800

27 May, 2011

2 commits

a77aea920 cgroup: remove the ns_cgroup ... Browse Code »

The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
leads to some problems:

* cgroup creation is out-of-control
* cgroup name can conflict when pids are looping
* it is not possible to have a single process handling a lot of
namespaces without falling in a exponential creation time
* we may want to create a namespace without creating a cgroup

The ns_cgroup was replaced by a compatibility flag 'clone_children',
where a newly created cgroup will copy the parent cgroup values.
The userspace has to manually create a cgroup and add a task to
the 'tasks' file.

This patch removes the ns_cgroup as suggested in the following thread:

https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html

The 'cgroup_clone' function is removed because it is no longer used.

This is a userspace-visible change. Commit 45531757b45c ("cgroup: notify
ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
printk warning users that the feature is planned for removal. Since that
time we have heard from XXX users who were affected by this.

Signed-off-by: Daniel Lezcano
Signed-off-by: Serge E. Hallyn
Cc: Eric W. Biederman
Cc: Jamal Hadi Salim
Reviewed-by: Li Zefan
Acked-by: Paul Menage
Acked-by: Matt Helsley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daniel Lezcano
2011-05-27 08:12:34 +0800
f780bdb7c cgroups: add per-thread subsystem callbacks ... Browse Code »

Add cgroup subsystem callbacks for per-thread attachment in atomic contexts

Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
for cgroups's subsystem interface. Unlike can_attach and attach, these
are for per-thread operations, to be called potentially many times when
attaching an entire threadgroup.

Also, the old "bool threadgroup" interface is removed, as replaced by
this. All subsystems are modified for the new interface - of note is
cpuset, which requires from/to nodemasks for attach to be globally scoped
(though per-cpuset would work too) to persist from its pre_attach to
attach_task and attach.

This is a pre-patch for cgroup-procs-writable.patch.

Signed-off-by: Ben Blum
Cc: "Eric W. Biederman"
Cc: Li Zefan
Cc: Matt Helsley
Reviewed-by: Paul Menage
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Miao Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ben Blum
2011-05-27 08:12:34 +0800

11 Apr, 2011

1 commit

60495e776 sched: Dynamic sched_domain::level ... Browse Code »

Remove the SD_LV_ enum and use dynamic level assignments.

Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
Cc: Nick Piggin
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/20110407122942.969433965@chello.nl
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-04-11 20:09:32 +0800

24 Mar, 2011

4 commits

523fb486b cpuset: hold callback_mutex in cpuset_post_clone() ... Browse Code »

Chaning cpuset->mems/cpuset->cpus should be protected under
callback_mutex.

cpuset_clone() doesn't follow this rule. It's ok because it's
called when creating and initializing a cgroup, but we'd better
hold the lock to avoid subtil break in the future.

Signed-off-by: Li Zefan
Acked-by: Paul Menage
Acked-by: David Rientjes
Cc: Miao Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2011-03-24 10:46:35 +0800
ee24d3797 cpuset: fix unchecked calls to NODEMASK_ALLOC() ... Browse Code »

Those functions that use NODEMASK_ALLOC() can't propagate errno
to users, but will fail silently.

Fix it by using a static nodemask_t variable for each function, and
those variables are protected by cgroup_mutex;

[akpm@linux-foundation.org: fix comment spelling, strengthen cgroup_lock comment]
Signed-off-by: Li Zefan
Cc: Paul Menage
Acked-by: David Rientjes
Cc: Miao Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2011-03-24 10:46:35 +0800
c8163ca8a cpuset: remove unneeded NODEMASK_ALLOC() in cpuset_attach() ... Browse Code »

oldcs->mems_allowed is not modified during cpuset_attach(), so we don't
have to copy it to a buffer allocated by NODEMASK_ALLOC(). Just pass it
to cpuset_migrate_mm().

Signed-off-by: Li Zefan
Cc: Paul Menage
Acked-by: David Rientjes
Cc: Miao Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2011-03-24 10:46:34 +0800
9303e0c48 cpuset: remove unneeded NODEMASK_ALLOC() in cpuset_sprintf_memlist() ... Browse Code »

It's not necessary to copy cpuset->mems_allowed to a buffer allocated by
NODEMASK_ALLOC(). Just pass it to nodelist_scnprintf().

As spotted by Paul, a side effect is we fix a bug that the function can
return -ENOMEM but the caller doesn't expect negative return value.
Therefore change the return value of cpuset_sprintf_cpulist() and
cpuset_sprintf_memlist() from int to size_t.

Signed-off-by: Li Zefan
Acked-by: Paul Menage
Acked-by: David Rientjes
Cc: Miao Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2011-03-24 10:46:34 +0800

05 Mar, 2011

1 commit

b75f38d65 cpuset: add a missing unlock in cpuset_write_resmask() ... Browse Code »

Don't forget to release cgroup_mutex if alloc_trial_cpuset() fails.

[akpm@linux-foundation.org: avoid multiple return points]
Signed-off-by: Li Zefan
Cc: Paul Menage
Acked-by: David Rientjes
Cc: Miao Xie
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2011-03-05 09:53:38 +0800

29 Oct, 2010

1 commit

f7e835710 convert cgroup and cpuset ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-10-29 16:17:06 +0800

21 Oct, 2010

1 commit

b0ae19811 security: remove unused parameter from security_task_setscheduler() ... Browse Code »

All security modules shouldn't change sched_param parameter of
security_task_setscheduler(). This is not only meaningless, but also
make a harmful result if caller pass a static variable.

This patch remove policy and sched_param parameter from
security_task_setscheduler() becuase none of security module is
using it.

Cc: James Morris
Signed-off-by: KOSAKI Motohiro
Signed-off-by: James Morris

KOSAKI Motohiro
2010-10-21 07:12:44 +0800

07 Aug, 2010

1 commit

c4efd6b56 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/tip/linux-2.6-tip

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (27 commits)
sched: Use correct macro to display sched_child_runs_first in /proc/sched_debug
sched: No need for bootmem special cases
sched: Revert nohz_ratelimit() for now
sched: Reduce update_group_power() calls
sched: Update rq->clock for nohz balanced cpus
sched: Fix spelling of sibling
sched, cpuset: Drop __cpuexit from cpu hotplug callbacks
sched: Fix the racy usage of thread_group_cputimer() in fastpath_timer_check()
sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand()
sched: thread_group_cputime: Simplify, document the "alive" check
sched: Remove the obsolete exit_state/signal hacks
sched: task_tick_rt: Remove the obsolete ->signal != NULL check
sched: __sched_setscheduler: Read the RLIMIT_RTPRIO value lockless
sched: Fix comments to make them DocBook happy
sched: Fix fix_small_capacity
powerpc: Exclude arch_sd_sibiling_asym_packing() on UP
powerpc: Enable asymmetric SMT scheduling on POWER7
sched: Add asymmetric group packing option for sibling domain
sched: Fix capacity calculations for SMT4
sched: Change nohz idle load balancing logic to push model
...

Linus Torvalds
2010-08-07 00:39:22 +0800

22 Jun, 2010

1 commit

0b2e918aa sched, cpuset: Drop __cpuexit from cpu hotplug callbacks ... Browse Code »

Commit 3a101d05 (sched: adjust when cpu_active and cpuset
configurations are updated during cpu on/offlining) added
hotplug notifiers marked with __cpuexit; however, ia64 drops
text in __cpuexit during link unlike x86.

This means that functions which are referenced during init but used
only for cpu hot unplugging afterwards shouldn't be marked with
__cpuexit. Drop __cpuexit from those functions.

Reported-by: Tony Luck
Signed-off-by: Tejun Heo
Acked-by: Tony Luck
Cc: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Tejun Heo
2010-06-22 14:07:39 +0800

17 Jun, 2010

2 commits

f1bbbb691 Merge branch 'master' into for-next Browse Code »

Jiri Kosina
2010-06-17 00:08:13 +0800
732bee7af fix typos concerning "hierarchy" ... Browse Code »

Signed-off-by: Uwe Kleine-König
Signed-off-by: Jiri Kosina

Uwe Kleine-König
2010-06-17 00:03:14 +0800

09 Jun, 2010

1 commit

3a101d054 sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining ... Browse Code »

Currently, when a cpu goes down, cpu_active is cleared before
CPU_DOWN_PREPARE starts and cpuset configuration is updated from a
default priority cpu notifier. When a cpu is coming up, it's set
before CPU_ONLINE but cpuset configuration again is updated from the
same cpu notifier.

For cpu notifiers, this presents an inconsistent state. Threads which
a CPU_DOWN_PREPARE notifier expects to be bound to the CPU can be
migrated to other cpus because the cpu is no more inactive.

Fix it by updating cpu_active in the highest priority cpu notifier and
cpuset configuration in the second highest when a cpu is coming up.
Down path is updated similarly. This guarantees that all other cpu
notifiers see consistent cpu_active and cpuset configuration.

cpuset_track_online_cpus() notifier is converted to
cpuset_update_active_cpus() which just updates the configuration and
now called from cpuset_cpu_[in]active() notifiers registered from
sched_init_smp(). If cpuset is disabled, cpuset_update_active_cpus()
degenerates into partition_sched_domains() making separate notifier
for !CONFIG_CPUSETS unnecessary.

This problem is triggered by cmwq. During CPU_DOWN_PREPARE, hotplug
callback creates a kthread and kthread_bind()s it to the target cpu,
and the thread is expected to run on that cpu.

* Ingo's test discovered __cpuinit/exit markups were incorrect.
Fixed.

Signed-off-by: Tejun Heo
Acked-by: Peter Zijlstra
Cc: Rusty Russell
Cc: Ingo Molnar
Cc: Paul Menage

Tejun Heo
2010-06-09 03:40:36 +0800

28 May, 2010

1 commit

6adef3ebe cpusets: new round-robin rotor for SLAB allocations ... Browse Code »

We have observed several workloads running on multi-node systems where
memory is assigned unevenly across the nodes in the system. There are
numerous reasons for this but one is the round-robin rotor in
cpuset_mem_spread_node().

For example, a simple test that writes a multi-page file will allocate
pages on nodes 0 2 4 6 ... Odd nodes are skipped. (Sometimes it
allocates on odd nodes & skips even nodes).

An example is shown below. The program "lfile" writes a file consisting
of 10 pages. The program then mmaps the file & uses get_mempolicy(...,
MPOL_F_NODE) to determine the nodes where the file pages were allocated.
The output is shown below:

# ./lfile
allocated on nodes: 2 4 6 0 1 2 6 0 2

There is a single rotor that is used for allocating both file pages & slab
pages. Writing the file allocates both a data page & a slab page
(buffer_head). This advances the RR rotor 2 nodes for each page
allocated.

A quick confirmation seems to confirm this is the cause of the uneven
allocation:

# echo 0 >/dev/cpuset/memory_spread_slab
# ./lfile
allocated on nodes: 6 7 8 9 0 1 2 3 4 5

This patch introduces a second rotor that is used for slab allocations.

Signed-off-by: Jack Steiner
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: Paul Menage
Cc: Jack Steiner
Cc: Robin Holt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jack Steiner
2010-05-28 00:12:44 +0800

25 May, 2010

2 commits

c0ff7453b cpuset,mm: fix no node to alloc memory when changing cpuset's mems ... Browse Code »
1

Before applying this patch, cpuset updates task->mems_allowed and
mempolicy by setting all new bits in the nodemask first, and clearing all
old unallowed bits later. But in the way, the allocator may find that
there is no node to alloc memory.

The reason is that cpuset rebinds the task's mempolicy, it cleans the
nodes which the allocater can alloc pages on, for example:

(mpol: mempolicy)
task1 task1's mpol task2
alloc page 1
alloc on node0? NO 1
1 change mems from 1 to 0
1 rebind task1's mpol
0-1 set new bits
0 clear disallowed bits
alloc on node1? NO 0
...
can't alloc page
goto oom

This patch fixes this problem by expanding the nodes range first(set newly
allowed bits) and shrink it lazily(clear newly disallowed bits). So we
use a variable to tell the write-side task that read-side task is reading
nodemask, and the write-side task clears newly disallowed nodes after
read-side task ends the current memory allocation.

[akpm@linux-foundation.org: fix spello]
Signed-off-by: Miao Xie
Cc: David Rientjes
Cc: Nick Piggin
Cc: Paul Menage
Cc: Lee Schermerhorn
Cc: Hugh Dickins
Cc: Ravikiran Thirumalai
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miao Xie
2010-05-25 23:06:57 +0800
708c1bbc9 mempolicy: restructure rebinding-mempolicy functions ... Browse Code »

Nick Piggin reported that the allocator may see an empty nodemask when
changing cpuset's mems[1]. It happens only on the kernel that do not do
atomic nodemask_t stores. (MAX_NUMNODES > BITS_PER_LONG)

But I found that there is also a problem on the kernel that can do atomic
nodemask_t stores. The problem is that the allocator can't find a node to
alloc page when changing cpuset's mems though there is a lot of free
memory. The reason is like this:

(mpol: mempolicy)
task1 task1's mpol task2
alloc page 1
alloc on node0? NO 1
1 change mems from 1 to 0
1 rebind task1's mpol
0-1 set new bits
0 clear disallowed bits
alloc on node1? NO 0
...
can't alloc page
goto oom

I can use the attached program reproduce it by the following step:

# mkdir /dev/cpuset
# mount -t cpuset cpuset /dev/cpuset
# mkdir /dev/cpuset/1
# echo `cat /dev/cpuset/cpus` > /dev/cpuset/1/cpus
# echo `cat /dev/cpuset/mems` > /dev/cpuset/1/mems
# echo $$ > /dev/cpuset/1/tasks
# numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog &
= max(nr_cpus - 1, 1)
# killall -s SIGUSR1 cpuset_mem_hog
# ./change_mems.sh

several hours later, oom will happen though there is a lot of free memory.

This patchset fixes this problem by expanding the nodes range first(set
newly allowed bits) and shrink it lazily(clear newly disallowed bits). So
we use a variable to tell the write-side task that read-side task is
reading nodemask, and the write-side task clears newly disallowed nodes
after read-side task ends the current memory allocation.

This patch:

In order to fix no node to alloc memory, when we want to update mempolicy
and mems_allowed, we expand the set of nodes first (set all the newly
nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the
mempolicy's rebind functions may breaks the expanding.

So we restructure the mempolicy's rebind functions and split the rebind
work to two steps, just like the update of cpuset's mems: The 1st step:
expand the set of the mempolicy's nodes. The 2nd step: shrink the set of
the mempolicy's nodes. It is used when there is no real lock to protect
the mempolicy in the read-side. Otherwise we can do rebind work at once.

In order to implement it, we define

enum mpol_rebind_step {
MPOL_REBIND_ONCE,
MPOL_REBIND_STEP1,
MPOL_REBIND_STEP2,
MPOL_REBIND_NSTEP,
};

If the mempolicy needn't be updated by two steps, we can pass
MPOL_REBIND_ONCE to the rebind functions. Or we can pass
MPOL_REBIND_STEP1 to do the first step of the rebind work and pass
MPOL_REBIND_STEP2 to do the second step work.

Besides that, it maybe long time between these two step and we have to
release the lock that protects mempolicy and mems_allowed. If we hold the
lock once again, we must check whether the current mempolicy is under the
rebinding (the first step has been done) or not, because the task may
alloc a new mempolicy when we don't hold the lock. So we defined the
following flag to identify it:

#define MPOL_F_REBINDING (1 << 2)

The new functions will be used in the next patch.

Signed-off-by: Miao Xie
Cc: David Rientjes
Cc: Nick Piggin
Cc: Paul Menage
Cc: Lee Schermerhorn
Cc: Hugh Dickins
Cc: Ravikiran Thirumalai
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miao Xie
2010-05-25 23:06:57 +0800

03 Apr, 2010

2 commits

9084bb824 sched: Make select_fallback_rq() cpuset friendly ... Browse Code »

Introduce cpuset_cpus_allowed_fallback() helper to fix the cpuset problems
with select_fallback_rq(). It can be called from any context and can't use
any cpuset locks including task_lock(). It is called when the task doesn't
have online cpus in ->cpus_allowed but ttwu/etc must be able to find a
suitable cpu.

I am not proud of this patch. Everything which needs such a fat comment
can't be good even if correct. But I'd prefer to not change the locking
rules in the code I hardly understand, and in any case I believe this
simple change make the code much more correct compared to deadlocks we
currently have.

Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Oleg Nesterov
2010-04-03 02:12:03 +0800
897f0b3c3 sched: Kill the broken and deadlockable cpuset_lock/cpuset_cpus_allowed_locked code ... Browse Code »

This patch just states the fact the cpusets/cpuhotplug interaction is
broken and removes the deadlockable code which only pretends to work.

- cpuset_lock() doesn't really work. It is needed for
cpuset_cpus_allowed_locked() but we can't take this lock in
try_to_wake_up()->select_fallback_rq() path.

- cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes
callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex
stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take
cpuset_lock() and hangs forever because CPU is already dead and thus
T can't be scheduled.

- cpuset_cpus_allowed_locked() is deadlockable too. It takes task_lock()
which is not irq-safe, but try_to_wake_up() can be called from irq.

Kill them, and change select_fallback_rq() to use cpu_possible_mask, like
we currently do without CONFIG_CPUSETS.

Also, with or without this patch, with or without CONFIG_CPUSETS, the
callers of select_fallback_rq() can race with each other or with
set_cpus_allowed() pathes.

The subsequent patches try to to fix these problems.

Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Oleg Nesterov
2010-04-03 02:12:01 +0800

25 Mar, 2010

2 commits

53feb2976 cpuset: alloc nodemask_t on the heap rather than the stack ... Browse Code »

Signed-off-by: Miao Xie
Acked-by: David Rientjes
Cc: Nick Piggin
Cc: Paul Menage
Cc: Li Zefan
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miao Xie
2010-03-25 07:31:21 +0800
5ab116c93 cpuset: fix the problem that cpuset_mem_spread_node() returns an offline node ... Browse Code »

cpuset_mem_spread_node() returns an offline node, and causes an oops.

This patch fixes it by initializing task->mems_allowed to
node_states[N_HIGH_MEMORY], and updating task->mems_allowed when doing
memory hotplug.

Signed-off-by: Miao Xie
Acked-by: David Rientjes
Reported-by: Nick Piggin
Tested-by: Nick Piggin
Cc: Paul Menage
Cc: Li Zefan
Cc: Ingo Molnar
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miao Xie
2010-03-25 07:31:21 +0800

07 Dec, 2009

2 commits

6ad4c1888 sched: Fix balance vs hotplug race ... Browse Code »

Since (e761b77: cpu hotplug, sched: Introduce cpu_active_map and redo
sched domain managment) we have cpu_active_mask which is suppose to rule
scheduler migration and load-balancing, except it never (fully) did.

The particular problem being solved here is a crash in try_to_wake_up()
where select_task_rq() ends up selecting an offline cpu because
select_task_rq_fair() trusts the sched_domain tree to reflect the
current state of affairs, similarly select_task_rq_rt() trusts the
root_domain.

However, the sched_domains are updated from CPU_DEAD, which is after the
cpu is taken offline and after stop_machine is done. Therefore it can
race perfectly well with code assuming the domains are right.

Cure this by building the domains from cpu_active_mask on
CPU_DOWN_PREPARE.

Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2009-12-07 04:10:56 +0800
e1b8090bd cpumask: Fix generate_sched_domains() for UP ... Browse Code »

Commit acc3f5d7cabbfd6cec71f0c1f9900621fa2d6ae7 ("cpumask:
Partition_sched_domains takes array of cpumask_var_t") changed
the function signature of generate_sched_domains() for the
CONFIG_SMP=y case, but forgot to update the corresponding
function for the CONFIG_SMP=n case, causing:

kernel/cpuset.c:2073: warning: passing argument 1 of 'generate_sched_domains' from incompatible pointer type

Signed-off-by: Geert Uytterhoeven
Cc: Rusty Russell
Cc: Linus Torvalds
LKML-Reference:
Signed-off-by: Ingo Molnar

Geert Uytterhoeven
2009-12-07 04:08:41 +0800

04 Nov, 2009

1 commit

acc3f5d7c cpumask: Partition_sched_domains takes array of cpumask_var_t ... Browse Code »

Currently partition_sched_domains() takes a 'struct cpumask
*doms_new' which is a kmalloc'ed array of cpumask_t. You can't
have such an array if 'struct cpumask' is undefined, as we plan
for CONFIG_CPUMASK_OFFSTACK=y.

So, we make this an array of cpumask_var_t instead: this is the
same for the CONFIG_CPUMASK_OFFSTACK=n case, but requires
multiple allocations for the CONFIG_CPUMASK_OFFSTACK=y case.
Hence we add alloc_sched_domains() and free_sched_domains()
functions.

Signed-off-by: Rusty Russell
Cc: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Rusty Russell
2009-11-04 20:16:40 +0800

26 Oct, 2009

1 commit

0b9e31e92 Merge branch 'linus' into sched/core ... Browse Code »

Conflicts:
fs/proc/array.c

Merge reason: resolve conflict and queue up dependent patch.

Signed-off-by: Ingo Molnar

Ingo Molnar
2009-10-26 00:30:53 +0800

24 Sep, 2009

1 commit

be367d099 cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time ... Browse Code »

Alter the ss->can_attach and ss->attach functions to be able to deal with
a whole threadgroup at a time, for use in cgroup_attach_proc. (This is a
pre-patch to cgroup-procs-writable.patch.)

Currently, new mode of the attach function can only tell the subsystem
about the old cgroup of the threadgroup leader. No subsystem currently
needs that information for each thread that's being moved, but if one were
to be added (for example, one that counts tasks within a group) this bit
would need to be reworked a bit to tell the subsystem the right
information.

[hidave.darkstar@gmail.com: fix build]
Signed-off-by: Ben Blum
Signed-off-by: Paul Menage
Acked-by: Li Zefan
Reviewed-by: Matt Helsley
Cc: "Eric W. Biederman"
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Dave Young
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ben Blum
2009-09-24 22:20:58 +0800

21 Sep, 2009

1 commit

d01d48278 sched: Always show Cpus_allowed field in /proc/<pid>/status ... Browse Code »

The Cpus_allowed fields in /proc//status is currently only
shown in case of CONFIG_CPUSETS. However their contents are also
useful for the !CONFIG_CPUSETS case.

So change the current behaviour and always show these fields.

Signed-off-by: Heiko Carstens
Cc: Andrew Morton
Cc: Oleg Nesterov
Cc: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Heiko Carstens
2009-09-21 17:37:27 +0800

17 Jun, 2009

3 commits

58568d2a8 cpuset,mm: update tasks' mems_allowed in time ... Browse Code »

Fix allocating page cache/slab object on the unallowed node when memory
spread is set by updating tasks' mems_allowed after its cpuset's mems is
changed.

In order to update tasks' mems_allowed in time, we must modify the code of
memory policy. Because the memory policy is applied in the process's
context originally. After applying this patch, one task directly
manipulates anothers mems_allowed, and we use alloc_lock in the
task_struct to protect mems_allowed and memory policy of the task.

But in the fast path, we didn't use lock to protect them, because adding a
lock may lead to performance regression. But if we don't add a lock,the
task might see no nodes when changing cpuset's mems_allowed to some
non-overlapping set. In order to avoid it, we set all new allowed nodes,
then clear newly disallowed ones.

[lee.schermerhorn@hp.com:
The rework of mpol_new() to extract the adjusting of the node mask to
apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
allocation. Fix this by adding the check for MPOL_PREFERRED and empty
node mask to mpol_new_mpolicy().

Remove the now unneeded 'nodes = NULL' from mpol_new().

Note that mpol_new_mempolicy() is always called with a non-NULL
'nodes' parameter now that it has been removed from mpol_new().
Therefore, we don't need to test nodes for NULL before testing it for
'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
verify this assumption.]
[lee.schermerhorn@hp.com:

I don't think the function name 'mpol_new_mempolicy' is descriptive
enough to differentiate it from mpol_new().

This function applies cpuset set context, usually constraining nodes
to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
is set, it also translates the nodes. So I settled on
'mpol_set_nodemask()', because the comment block for mpol_new() mentions
that we need to call this function to "set nodes".

Some additional minor line length, whitespace and typo cleanup.]
Signed-off-by: Miao Xie
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Christoph Lameter
Cc: Paul Menage
Cc: Nick Piggin
Cc: Yasunori Goto
Cc: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miao Xie
2009-06-17 10:47:31 +0800
950592f7b cpusets: update tasks' page/slab spread flags in time ... Browse Code »

Fix the bug that the kernel didn't spread page cache/slab object evenly
over all the allowed nodes when spread flags were set by updating tasks'
page/slab spread flags in time.

Signed-off-by: Miao Xie
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Christoph Lameter
Cc: Paul Menage
Cc: Nick Piggin
Cc: Yasunori Goto
Cc: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miao Xie
2009-06-17 10:47:31 +0800
f3b39d47e cpusets: restructure the function cpuset_update_task_memory_state() ... Browse Code »

The kernel still allocates the page caches on old node after modifying its
cpuset's mems when 'memory_spread_page' was set, or it didn't spread the
page cache evenly over all the nodes that faulting task is allowed to usr
after memory_spread_page was set. it is caused by the old mem_allowed and
flags of the task, the current kernel doesn't updates them unless some
function invokes cpuset_update_task_memory_state(), it is too late
sometimes.We must update the mem_allowed and the flags of the tasks in
time.

Slab has the same problem.

The following patches fix this bug by updating tasks' mem_allowed and
spread flag after its cpuset's mems or spread flag is changed.

This patch:

Extract a function from cpuset_update_task_memory_state(). It will be
used later for update tasks' page/slab spread flags after its cpuset's
flag is set

Signed-off-by: Miao Xie
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Christoph Lameter
Cc: Paul Menage
Cc: Nick Piggin
Cc: Yasunori Goto
Cc: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miao Xie
2009-06-17 10:47:31 +0800

12 Jun, 2009

1 commit

38c7fed2f x86: remove some alloc_bootmem_cpumask_var calling ... Browse Code »

Now that we set up the slab allocator earlier, we can get rid of some
alloc_bootmem_cpumask_var() calls in boot code.

Cc: Ingo Molnar
Cc: Johannes Weiner
Cc: Linus Torvalds
Signed-off-by: Yinghai Lu
Signed-off-by: Pekka Enberg

Yinghai Lu
2009-06-12 00:27:07 +0800

03 Apr, 2009

5 commits

6d7b2f5f9 cpusets: prevent PF_THREAD_BOUND tasks from attaching to non-root cpusets ... Browse Code »

Kthreads that have the PF_THREAD_BOUND bit set in their flags are bound to a
specific cpu. Thus, their set of allowed cpus shall not change.

This patch prevents such threads from attaching to non-root cpusets. They do
not have mempolicies that restrict them to a subset of system nodes and, since
their cpumask may never change, they cannot use any of the features of
cpusets.

The tasks will forever be a member of the root cpuset and will be returned
when listing the tasks attached to that cpuset.

Cc: Paul Menage
Cc: Peter Zijlstra
Cc: Dhaval Giani
Signed-off-by: David Rientjes
Cc: Ingo Molnar
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2009-04-03 10:04:57 +0800
db7f47cf4 cpusets: allow cpusets to be configured/built on non-SMP systems ... Browse Code »

Allow cpusets to be configured/built on non-SMP systems

Currently it's impossible to build cpusets under UML on x86-64, since
cpusets depends on SMP and x86-64 UML doesn't support SMP.

There's code in cpusets that doesn't depend on SMP. This patch surrounds
the minimum amount of cpusets code with #ifdef CONFIG_SMP in order to
allow cpusets to build/run on UP systems (for testing purposes under UML).

Reviewed-by: Li Zefan
Signed-off-by: Paul Menage
Cc: Ingo Molnar
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2009-04-03 10:04:57 +0800
a1bc5a4ee cpusets: replace zone allowed functions with node allowed ... Browse Code »

The cpuset_zone_allowed() variants are actually only a function of the
zone's node.

Cc: Paul Menage
Acked-by: Christoph Lameter
Cc: Randy Dunlap
Signed-off-by: David Rientjes
Cc: Ingo Molnar
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2009-04-03 10:04:57 +0800
7f81b1ae1 cpuset: remove struct cpuset_hotplug_scanner ... Browse Code »

Use cgroup_scanner.data, instead of introducing cpuset_hotplug_scanner.

Signed-off-by: Li Zefan
Cc: KAMEZAWA Hiroyuki
Cc: Paul Menage
Cc: Ingo Molnar
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2009-04-03 10:04:57 +0800
010cfac4c cpuset: avoid changing cpuset's mems when errno returned ... Browse Code »

When writing to cpuset.mems, cpuset has to update its mems_allowed before
calling update_tasks_nodemask(), but this function might return -ENOMEM.

To avoid this rare case, we allocate the memory before changing
mems_allowed, and then pass to update_tasks_nodemask(). Similar to what
update_cpumask() does.

Signed-off-by: Li Zefan
Cc: KAMEZAWA Hiroyuki
Cc: Paul Menage
Cc: Ingo Molnar
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2009-04-03 10:04:57 +0800