Eric Lee / smarc-fsl-linux-kernel

07 Jun, 2008

1 commit

37340746a cpusets: fix bug when adding nonexistent cpu or mem ... Browse Code »

Adding a nonexistent cpu to a cpuset will be omitted quietly. It should
return -EINVAL.

Example: (real_nr_cpus < NR_CPUS or cpu#4 was just offline)

# cat cpus
0-1
# /bin/echo 4 > cpus
# /bin/echo $?
0
# cat cpus

#

The same occurs when add a nonexistent mem.
This patch will fix this bug.
And when *buf == "", the check is unneeded.

Signed-off-by: Lai Jiangshan
Acked-by: Paul Jackson
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lai Jiangshan
2008-06-07 02:29:11 +0800

09 May, 2008

1 commit

5be7a4792 Fix cpuset sched_relax_domain_level control file ... Browse Code »

Due to a merge conflict, the sched_relax_domain_level control file was marked
as being handled by cpuset_read/write_u64, but the code to handle it was
actually in cpuset_common_file_read/write.

Since the value being written/read is in fact a signed integer, it should be
treated as such; this patch adds cpuset_read/write_s64 functions, and uses
them to handle the sched_relax_domain_level file.

With this patch, the sched_relax_domain_level can be read and written, and the
correct contents seen/updated.

Signed-off-by: Paul Menage
Cc: Hidetoshi Seto
Cc: Paul Jackson
Cc: Ingo Molnar
Reviewed-by: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2008-05-09 01:46:56 +0800

29 Apr, 2008

5 commits

786083667 Cpuset hardwall flag: add a mem_hardwall flag to cpusets ... Browse Code »

This flag provides the hardwalling properties of mem_exclusive, without
enforcing the exclusivity. Either mem_hardwall or mem_exclusive is sufficient
to prevent GFP_KERNEL allocations from passing outside the cpuset's assigned
nodes.

Signed-off-by: Paul Menage
Acked-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2008-04-29 23:06:11 +0800
addf2c739 Cpuset hardwall flag: switch cpusets to use the bulk cgroup_add_files() API ... Browse Code »

Currently the cpusets mem_exclusive flag is overloaded to mean both
"no-overlapping" and "no GFP_KERNEL allocations outside this cpuset".

These patches add a new mem_hardwall flag with just the allocation restriction
part of the mem_exclusive semantics, without breaking backwards-compatibility
for those who continue to use just mem_exclusive. Additionally, the cgroup
control file registration for cpusets is cleaned up to reduce boilerplate.

This patch:

This change tidies up the cpusets control file definitions, and reduces the
amount of boilerplate required to add/change control files in the future.

Signed-off-by: Paul Menage
Reviewed-by: Li Zefan
Acked-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2008-04-29 23:06:11 +0800
9e0c914ca kernel/cpuset.c: make 3 functions static ... Browse Code »

Make the following needlessly global functions static:

- cpuset_test_cpumask()
- cpuset_change_cpumask()
- cpuset_do_move_task()

Signed-off-by: Adrian Bunk
Acked-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-04-29 23:06:11 +0800
700fe1ab9 CGroup API files: update cpusets to use cgroup structured file API ... Browse Code »

Many of the cpusets control files are simple integer values, which don't
require the overhead of memory allocations for reads and writes.

Move the handlers for these control files into cpuset_read_u64() and
cpuset_write_u64().

[akpm@linux-foundation.org: ad dmissing `break']
Signed-off-by: Paul Menage
Cc: "Li Zefan"
Cc: Balbir Singh
Cc: Paul Jackson
Cc: Pavel Emelyanov
Cc: KAMEZAWA Hiroyuki
Cc: "YAMAMOTO Takashi"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2008-04-29 23:06:08 +0800
b331d259b kernel: fix integer as NULL pointer warnings ... Browse Code »

kernel/cpuset.c:1268:52: warning: Using plain integer as NULL pointer
kernel/pid_namespace.c:95:24: warning: Using plain integer as NULL pointer

Signed-off-by: Harvey Harrison
Reviewed-by: Paul Jackson
Signed-off-by: Linus Torvalds

Harvey Harrison
2008-04-29 08:29:18 +0800

28 Apr, 2008

3 commits

846a16bf0 mempolicy: rename mpol_copy to mpol_dup ... Browse Code »

This patch renames mpol_copy() to mpol_dup() because, well, that's what it
does. Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an
existing mempolicy, allocates a new one and copies the contents.

In a later patch, I want to use the name mpol_copy() to copy the contents from
one mempolicy to another like, e.g., strcpy() does for strings.

Signed-off-by: Lee Schermerhorn
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Mel Gorman
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2008-04-28 23:58:23 +0800
19770b326 mm: filter based on a nodemask as well as a gfp_mask ... Browse Code »

The MPOL_BIND policy creates a zonelist that is used for allocations
controlled by that mempolicy. As the per-node zonelist is already being
filtered based on a zone id, this patch adds a version of __alloc_pages() that
takes a nodemask for further filtering. This eliminates the need for
MPOL_BIND to create a custom zonelist.

A positive benefit of this is that allocations using MPOL_BIND now use the
local node's distance-ordered zonelist instead of a custom node-id-ordered
zonelist. I.e., pages will be allocated from the closest allowed node with
available memory.

[Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
[Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
[Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
Signed-off-by: Mel Gorman
Acked-by: Christoph Lameter
Signed-off-by: Lee Schermerhorn
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2008-04-28 23:58:19 +0800
dd1a239f6 mm: have zonelist contains structs with both a zone pointer and zone_idx ... Browse Code »

Filtering zonelists requires very frequent use of zone_idx(). This is costly
as it involves a lookup of another structure and a substraction operation. As
the zone_idx is often required, it should be quickly accessible. The node idx
could also be stored here if it was found that accessing zone->node is
significant which may be the case on workloads where nodemasks are heavily
used.

This patch introduces a struct zoneref to store a zone pointer and a zone
index. The zonelist then consists of an array of these struct zonerefs which
are looked up as necessary. Helpers are given for accessing the zone index as
well as the node index.

[kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
[hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
[hugh@veritas.com: just return do_try_to_free_pages]
[hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
Signed-off-by: Mel Gorman
Acked-by: Christoph Lameter
Acked-by: David Rientjes
Signed-off-by: Lee Schermerhorn
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Christoph Lameter
Cc: Nick Piggin
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2008-04-28 23:58:18 +0800

20 Apr, 2008

3 commits

1d3504fcf sched, cpuset: customize sched domains, core ... Browse Code »

[rebased for sched-devel/latest]

- Add a new cpuset file, having levels:
sched_relax_domain_level

- Modify partition_sched_domains() and build_sched_domains()
to take attributes parameter passed from cpuset.

- Fill newidle_idx for node domains which currently unused but
might be required if sched_relax_domain_level become higher.

- We can change the default level by boot option 'relax_domain_level='.

Signed-off-by: Hidetoshi Seto
Signed-off-by: Ingo Molnar

Hidetoshi Seto
2008-04-20 01:45:00 +0800
39106dcf8 cpumask: use new cpus_scnprintf function ... Browse Code »

* Cleaned up references to cpumask_scnprintf() and added new
cpulist_scnprintf() interfaces where appropriate.

* Fix some small bugs (or code efficiency improvments) for various uses
of cpumask_scnprintf.

* Clean up some checkpatch errors.

Signed-off-by: Mike Travis
Signed-off-by: Ingo Molnar

Mike Travis
2008-04-20 01:44:59 +0800
f9a86fcbb cpuset: modify cpuset_set_cpus_allowed to use cpumask pointer ... Browse Code »

* Modify cpuset_cpus_allowed to return the currently allowed cpuset
via a pointer argument instead of as the function return value.

* Use new set_cpus_allowed_ptr function.

* Cleanup CPU_MASK_ALL and NODE_MASK_ALL uses.

Depends on:
[sched-devel]: sched: add new set_cpus_allowed_ptr function

Signed-off-by: Mike Travis
Signed-off-by: Ingo Molnar

Mike Travis
2008-04-20 01:44:58 +0800

06 Mar, 2008

1 commit

41f7f60d3 cpusets: fix obsolete comment ... Browse Code »

mm migration is no longer done in cpuset_update_task_memory_state() so it
can no longer take current->mm->mmap_sem, so fix the obsolete comment.

[ This changed in commit 04c19fa6f16047abff2288ddbc1f0798ede5a849
("cpuset: migrate all tasks in cpuset at once") when the mm migration
was moved from cpuset_update_task_memory_state() to update_nodemask() ]

Signed-off-by: David Rientjes
Cc: Paul Jackson
Signed-off-by: Linus Torvalds

David Rientjes
2008-03-06 09:53:33 +0800

09 Feb, 2008

1 commit

df5f8314c proc: seqfile convert proc_pid_status to properly handle pid namespaces ... Browse Code »

Currently we possibly lookup the pid in the wrong pid namespace. So
seq_file convert proc_pid_status which ensures the proper pid namespaces is
passed in.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: another build fix]
[akpm@linux-foundation.org: s390 build fix]
[akpm@linux-foundation.org: fix task_name() output]
[akpm@linux-foundation.org: fix nommu build]
Signed-off-by: Eric W. Biederman
Cc: Andrew Morgan
Cc: Serge Hallyn
Cc: Cedric Le Goater
Cc: Pavel Emelyanov
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Paul Menage
Cc: Paul Jackson
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2008-02-09 01:22:24 +0800

08 Feb, 2008

5 commits

b45012955 hotplug cpu move tasks in empty cpusets - refinements ... Browse Code »

- Narrow the scope of callback_mutex in scan_for_empty_cpusets().

- Avoid rewriting the cpus, mems of cpusets except when it is likely that
we'll be changing them.

- Have remove_tasks_in_empty_cpuset() also check for empty mems.

Signed-off-by: Paul Jackson
Acked-by: Cliff Wickman
Cc: David Rientjes
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2008-02-08 00:42:22 +0800
c8d9c90c7 hotplug cpu: move tasks in empty cpusets to parent various other fixes ... Browse Code »

Various minor formatting and comment tweaks to Cliff Wickman's
[PATCH_3_of_3]_cpusets__update_cpumask_revision.patch

I had had "iff", meaning "if and only if" in a comment. However, except for
ancient mathematicians, the abbreviation "iff" was a tad too cryptic. Cliff
changed it to "if", presumably figuring that the "iff" was a typo. However,
it was the "only if" half of the conjunction that was most interesting.
Reword to emphasis the "only if" aspect.

The locking comment for remove_tasks_in_empty_cpuset() was wrong; it said
callback_mutex had to be held on entry. The opposite is true.

Several mentions of attach_task() in comments needed to be
changed to cgroup_attach_task().

A comment about notify_on_release was no longer relevant,
as the line of code it had commented, namely:
set_bit(CS_RELEASED_RESOURCE, &parent->flags);
is no longer present in that place in the cpuset.c code.

Similarly a comment about notify_on_release before the
scan_for_empty_cpusets() routine was no longer relevant.

Removed extra parentheses and unnecessary return statement.

Renamed attach_task() to cpuset_attach() in various comments.

Removed comment about not needing memory migration, as it seems the migration
is done anyway, via the cpuset_attach() callback from cgroup_attach_task().

Signed-off-by: Paul Jackson
Acked-by: Cliff Wickman
Cc: David Rientjes
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2008-02-08 00:42:22 +0800
2df167a30 cgroups: update comments in cpuset.c ... Browse Code »

Some of the comments in kernel/cpuset.c were stale following the
transition to control groups; this patch updates them to more closely
match reality.

Signed-off-by: Paul Menage
Acked-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2008-02-08 00:42:22 +0800
58f4790b7 cpusets: update_cpumask revision ... Browse Code »

Use the new function cgroup_scan_tasks() to step through all tasks in a
cpuset.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Cliff Wickman
Cc: Paul Menage
Cc: Paul Jackson
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cliff Wickman
2008-02-08 00:42:22 +0800
956db3ca0 hotplug cpu: move tasks in empty cpusets to parent ... Browse Code »

This patch corrects a situation that occurs when one disables all the cpus in
a cpuset.

Currently, the disabled (cpu-less) cpuset inherits the cpus of its parent,
which is incorrect because it may then overlap its cpu-exclusive sibling.

Tasks of an empty cpuset should be moved to the cpuset which is the parent of
their current cpuset. Or if the parent cpuset has no cpus, to its parent,
etc.

And the empty cpuset should be released (if it is flagged notify_on_release).

Depends on the cgroup_scan_tasks() function (proposed by David Rientjes) to
iterate through all tasks in the cpu-less cpuset. We are deliberately
avoiding a walk of the tasklist.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Cliff Wickman
Cc: Paul Menage
Cc: Paul Jackson
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cliff Wickman
2008-02-08 00:42:22 +0800

26 Jan, 2008

1 commit

86ef5c9a8 cpu-hotplug: replace lock_cpu_hotplug() with get_online_cpus() ... Browse Code »

Replace all lock_cpu_hotplug/unlock_cpu_hotplug from the kernel and use
get_online_cpus and put_online_cpus instead as it highlights the
refcount semantics in these operations.

The new API guarantees protection against the cpu-hotplug operation, but
it doesn't guarantee serialized access to any of the local data
structures. Hence the changes needs to be reviewed.

In case of pseries_add_processor/pseries_remove_processor, use
cpu_maps_update_begin()/cpu_maps_update_done() as we're modifying the
cpu_present_map there.

Signed-off-by: Gautham R Shenoy
Signed-off-by: Ingo Molnar

Gautham R Shenoy
2008-01-26 04:08:02 +0800

20 Oct, 2007

6 commits

470fd6464 hotplug cpu: migrate a task within its cpuset ... Browse Code »

When a cpu is disabled, move_task_off_dead_cpu() is called for tasks that have
been running on that cpu.

Currently, such a task is migrated:
1) to any cpu on the same node as the disabled cpu, which is both online
and among that task's cpus_allowed
2) to any cpu which is both online and among that task's cpus_allowed

It is typical of a multithreaded application running on a large NUMA system to
have its tasks confined to a cpuset so as to cluster them near the memory that
they share. Furthermore, it is typical to explicitly place such a task on a
specific cpu in that cpuset. And in that case the task's cpus_allowed
includes only a single cpu.

This patch would insert a preference to migrate such a task to some cpu within
its cpuset (and set its cpus_allowed to its entire cpuset).

With this patch, migrate the task to:
1) to any cpu on the same node as the disabled cpu, which is both online
and among that task's cpus_allowed
2) to any online cpu within the task's cpuset
3) to any cpu which is both online and among that task's cpus_allowed

In order to do this, move_task_off_dead_cpu() must make a call to
cpuset_cpus_allowed_locked(), a new subset of cpuset_cpus_allowed(), that will
not block. (name change - per Oleg's suggestion)

Calls are made to cpuset_lock() and cpuset_unlock() in migration_call() to set
the cpuset mutex during the whole migrate_live_tasks() and
migrate_dead_tasks() procedure.

[akpm@linux-foundation.org: build fix]
[pj@sgi.com: Fix indentation and spacing]
Signed-off-by: Cliff Wickman
Cc: Oleg Nesterov
Cc: Christoph Lameter
Cc: Paul Jackson
Cc: Ingo Molnar
Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cliff Wickman
2007-10-20 02:53:44 +0800
8707d8b8c Fix cpusets update_cpumask ... Browse Code »

Cause writes to cpuset "cpus" file to update cpus_allowed for member tasks:

- collect batches of tasks under tasklist_lock and then call
set_cpus_allowed() on them outside the lock (since this can sleep).

- add a simple generic priority heap type to allow efficient collection
of batches of tasks to be processed without duplicating or missing any
tasks in subsequent batches.

- make "cpus" file update a no-op if the mask hasn't changed

- fix race between update_cpumask() and sched_setaffinity() by making
sched_setaffinity() post-check that it's not running on any cpus outside
cpuset_cpus_allowed().

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Paul Menage
Cc: Paul Jackson
Cc: David Rientjes
Cc: Nick Piggin
Cc: Peter Zijlstra
Cc: Balbir Singh
Cc: Cedric Le Goater
Cc: "Eric W. Biederman"
Cc: Serge Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2007-10-20 02:53:41 +0800
020958b62 cpusets: decrustify cpuset mask update code ... Browse Code »

Decrustify the kernel/cpuset.c 'cpus' and 'mems' updating code.

Other than subtle improvements in the consistency of identifying
white space at the beginning and end of passed in masks, this
doesn't make any visible difference in behaviour. But it's
one or two hundred kernel text bytes smaller, and easier to
understand.

[akpm@linux-foundation.org: coding-style fix]
Signed-off-by: Paul Jackson
Reviewed-by: Paul Menage
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2007-10-20 02:53:41 +0800
029190c51 cpuset sched_load_balance flag ... Browse Code »

Add a new per-cpuset flag called 'sched_load_balance'.

When enabled in a cpuset (the default value) it tells the kernel scheduler
that the scheduler should provide the normal load balancing on the CPUs in
that cpuset, sometimes moving tasks from one CPU to a second CPU if the
second CPU is less loaded and if that task is allowed to run there.

When disabled (write "0" to the file) then it tells the kernel scheduler
that load balancing is not required for the CPUs in that cpuset.

Now even if this flag is disabled for some cpuset, the kernel may still
have to load balance some or all the CPUs in that cpuset, if some
overlapping cpuset has its sched_load_balance flag enabled.

If there are some CPUs that are not in any cpuset whose sched_load_balance
flag is enabled, the kernel scheduler will not load balance tasks to those
CPUs.

Moreover the kernel will partition the 'sched domains' (non-overlapping
sets of CPUs over which load balancing is attempted) into the finest
granularity partition that it can find, while still keeping any two CPUs
that are in the same shed_load_balance enabled cpuset in the same element
of the partition.

This serves two purposes:
1) It provides a mechanism for real time isolation of some CPUs, and
2) it can be used to improve performance on systems with many CPUs
by supporting configurations in which load balancing is not done
across all CPUs at once, but rather only done in several smaller
disjoint sets of CPUs.

This mechanism replaces the earlier overloading of the per-cpuset
flag 'cpu_exclusive', which overloading was removed in an earlier
patch: cpuset-remove-sched-domain-hooks-from-cpusets

See further the Documentation and comments in the code itself.

[akpm@linux-foundation.org: don't be weird]
Signed-off-by: Paul Jackson
Acked-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2007-10-20 02:53:41 +0800
8793d854e Task Control Groups: make cpusets a client of cgroups ... Browse Code »

Remove the filesystem support logic from the cpusets system and makes cpusets
a cgroup subsystem

The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
passed through to the cgroup filesystem with the appropriate options to
emulate the old cpuset filesystem behaviour.

Signed-off-by: Paul Menage
Cc: Serge E. Hallyn
Cc: "Eric W. Biederman"
Cc: Dave Hansen
Cc: Balbir Singh
Cc: Paul Jackson
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: Srivatsa Vaddagiri
Cc: Cedric Le Goater
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2007-10-20 02:53:36 +0800
55a230aae cpuset: zero malloc - revert the old cpuset fix ... Browse Code »

The cpuset code to present a list of tasks using a cpuset to user space could
write to an array that it had kmalloc'd, after a kmalloc request of zero size.

The problem was that the code didn't check for writes past the allocated end
of the array until -after- the first write.

This is a race condition that is likely rare -- it would only show up if a
cpuset went from being empty to having a task in it, during the brief time
between the allocation and the first write.

Prior to roughly 2.6.22 kernels, this was also a benign problem, because a
zero kmalloc returned a few usable bytes anyway, and no harm was done with the
bogus write.

With the 2.6.22 kernel changes to make issue a warning if code tries to write
to the location returned from a zero size allocation, this problem is no
longer benign. This cpuset code would occassionally trigger that warning.

The fix is trivial -- check before storing into the array, not after, whether
the array is big enough to hold the store.

Cc: "Eric W. Biederman"
Cc: "Serge E. Hallyn"
Cc: Balbir Singh
Cc: Dave Hansen
Cc: Herbert Poetzl
Cc: Kirill Korotaev
Cc: Paul Menage
Cc: Srivatsa Vaddagiri
Cc: Christoph Lameter
Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2007-10-20 02:53:35 +0800

19 Oct, 2007

1 commit

dedf8b79e whitespace fixes: cpuset ... Browse Code »

Signed-off-by: Daniel Walker
Cc: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daniel Walker
2007-10-19 05:37:24 +0800

17 Oct, 2007

4 commits

bbe373f2c oom: compare cpuset mems_allowed instead of exclusive ancestors ... Browse Code »

Instead of testing for overlap in the memory nodes of the the nearest
exclusive ancestor of both current and the candidate task, it is better to
simply test for intersection between the task's mems_allowed in their task
descriptors. This does not require taking callback_mutex since it is only
used as a hint in the badness scoring.

Tasks that do not have an intersection in their mems_allowed with the current
task are not explicitly restricted from being OOM killed because it is quite
possible that the candidate task has allocated memory there before and has
since changed its mems_allowed.

Cc: Andrea Arcangeli
Acked-by: Christoph Lameter
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2007-10-17 23:42:46 +0800
607717a65 cpuset: remove sched domain hooks from cpusets ... Browse Code »

Remove the cpuset hooks that defined sched domains depending on the setting
of the 'cpu_exclusive' flag.

The cpu_exclusive flag can only be set on a child if it is set on the
parent.

This made that flag painfully unsuitable for use as a flag defining a
partitioning of a system.

It was entirely unobvious to a cpuset user what partitioning of sched
domains they would be causing when they set that one cpu_exclusive bit on
one cpuset, because it depended on what CPUs were in the remainder of that
cpusets siblings and child cpusets, after subtracting out other
cpu_exclusive cpusets.

Furthermore, there was no way on production systems to query the
result.

Using the cpu_exclusive flag for this was simply wrong from the get go.

Fortunately, it was sufficiently borked that so far as I know, almost no
successful use has been made of this. One real time group did use it to
affectively isolate CPUs from any load balancing efforts. They are willing
to adapt to alternative mechanisms for this, such as someway to manipulate
the list of isolated CPUs on a running system. They can do without this
present cpu_exclusive based mechanism while we develop an alternative.

There is a real risk, to the best of my understanding, of users
accidentally setting up a partitioned scheduler domains, inhibiting desired
load balancing across all their CPUs, due to the nonobvious (from the
cpuset perspective) side affects of the cpu_exclusive flag.

Furthermore, since there was no way on a running system to see what one was
doing with sched domains, this change will be invisible to any using code.
Unless they have real insight to the scheduler load balancing choices, they
will be unable to detect that this change has been made in the kernel's
behaviour.

Initial discussion on lkml of this patch has generated much comment. My
(probably controversial) take on that discussion is that it has reached a
rough concensus that the current cpuset cpu_exclusive mechanism for
defining sched domains is borked. There is no concensus on the
replacement. But since we can remove this mechanism, and since its
continued presence risks causing unwanted partitioning of the schedulers
load balancing, we should remove it while we can, as we proceed to work the
replacement scheduler domain mechanisms.

Signed-off-by: Paul Jackson
Cc: Ingo Molnar
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: Dinakar Guniguntala
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2007-10-17 00:43:09 +0800
e12ba74d8 Group short-lived and reclaimable kernel allocations ... Browse Code »

This patch marks a number of allocations that are either short-lived such as
network buffers or are reclaimable such as inode allocations. When something
like updatedb is called, long-lived and unmovable kernel allocations tend to
be spread throughout the address space which increases fragmentation.

This patch groups these allocations together as much as possible by adding a
new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be
reclaimed on demand, but not moved. i.e. they can be migrated by deleting
them and re-reading the information from elsewhere.

Signed-off-by: Mel Gorman
Cc: Andy Whitcroft
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2007-10-17 00:43:00 +0800
0e1e7c7a7 Memoryless nodes: Use N_HIGH_MEMORY for cpusets ... Browse Code »

cpusets try to ensure that any node added to a cpuset's mems_allowed is
on-line and contains memory. The assumption was that online nodes contained
memory. Thus, it is possible to add memoryless nodes to a cpuset and then add
tasks to this cpuset. This results in continuous series of oom-kill and
apparent system hang.

Change cpusets to use node_states[N_HIGH_MEMORY] [a.k.a. node_memory_map] in
place of node_online_map when vetting memories. Return error if admin
attempts to write a non-empty mems_allowed node mask containing only
memoryless-nodes.

Signed-off-by: Lee Schermerhorn
Signed-off-by: Bob Picco
Signed-off-by: Nishanth Aravamudan
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-10-17 00:42:59 +0800

18 Jul, 2007

2 commits

86313c488 usermodehelper: Tidy up waiting ... Browse Code »

Rather than using a tri-state integer for the wait flag in
call_usermodehelper_exec, define a proper enum, and use that. I've
preserved the integer values so that any callers I've missed should
still work OK.

Signed-off-by: Jeremy Fitzhardinge
Cc: James Bottomley
Cc: Randy Dunlap
Cc: Christoph Hellwig
Cc: Andi Kleen
Cc: Paul Mackerras
Cc: Johannes Berg
Cc: Ralf Baechle
Cc: Bjorn Helgaas
Cc: Joel Becker
Cc: Tony Luck
Cc: Kay Sievers
Cc: Srivatsa Vaddagiri
Cc: Oleg Nesterov
Cc: David Howells

Jeremy Fitzhardinge
2007-07-18 23:47:40 +0800
49c13b51a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (80 commits)
KVM: Use CPU_DYING for disabling virtualization
KVM: Tune hotplug/suspend IPIs
KVM: Keep track of which cpus have virtualization enabled
SMP: Allow smp_call_function_single() to current cpu
i386: Allow smp_call_function_single() to current cpu
x86_64: Allow smp_call_function_single() to current cpu
HOTPLUG: Adapt thermal throttle to CPU_DYING
HOTPLUG: Adapt cpuset hotplug callback to CPU_DYING
HOTPLUG: Add CPU_DYING notifier
KVM: Clean up #includes
KVM: Remove kvmfs in favor of the anonymous inodes source
KVM: SVM: Reliably detect if SVM was disabled by BIOS
KVM: VMX: Remove unnecessary code in vmx_tlb_flush()
KVM: MMU: Fix Wrong tlb flush order
KVM: VMX: Reinitialize the real-mode tss when entering real mode
KVM: Avoid useless memory write when possible
KVM: Fix x86 emulator writeback
KVM: Add support for in-kernel pio handlers
KVM: VMX: Fix interrupt checking on lightweight exit
KVM: Adds support for in-kernel mmio handlers
...

Linus Torvalds
2007-07-18 02:50:26 +0800

17 Jul, 2007

1 commit

c2aef333c Reduce cpuset.c write_lock_irq() to read_lock() ... Browse Code »

cpuset.c:update_nodemask() uses a write_lock_irq() on tasklist_lock to
block concurrent forks; a read_lock() suffices and is less intrusive.

Signed-off-by: Paul Menage
Acked-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2007-07-17 00:05:43 +0800

16 Jul, 2007

1 commit

ac076758b HOTPLUG: Adapt cpuset hotplug callback to CPU_DYING ... Browse Code »

CPU_DYING is called in atomic context, so don't try to take any locks.

Signed-off-by: Avi Kivity

Avi Kivity
2007-07-16 17:05:49 +0800

17 Jun, 2007

1 commit

3e903e7b1 cpuset: zero malloc - fix for old cpusets ... Browse Code »

The cpuset code to present a list of tasks using a cpuset to user space could
write to an array that it had kmalloc'd, after a kmalloc request of zero size.

The problem was that the code didn't check for writes past the allocated end
of the array until -after- the first write.

This is a race condition that is likely rare -- it would only show up if a
cpuset went from being empty to having a task in it, during the brief time
between the allocation and the first write.

Prior to roughly 2.6.22 kernels, this was also a benign problem, because a
zero kmalloc returned a few usable bytes anyway, and no harm was done with the
bogus write.

With the 2.6.22 kernel changes to make issue a warning if code tries to write
to the location returned from a zero size allocation, this problem is no
longer benign. This cpuset code would occassionally trigger that warning.

The fix is trivial -- check before storing into the array, not after, whether
the array is big enough to hold the store.

Cc: "Eric W. Biederman"
Cc: "Serge E. Hallyn"
Cc: Balbir Singh
Cc: Dave Hansen
Cc: Herbert Poetzl
Cc: Kirill Korotaev
Cc: Paul Menage
Cc: Srivatsa Vaddagiri
Cc: Christoph Lameter
Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2007-06-17 04:16:15 +0800

10 May, 2007

1 commit

85badbdf5 use simple_read_from_buffer in kernel/ ... Browse Code »

Cleanup using simple_read_from_buffer() for /dev/cpuset/tasks and
/proc/config.gz.

Cc: Paul Jackson
Cc: Randy Dunlap
Signed-off-by: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2007-05-10 03:30:49 +0800

09 May, 2007

2 commits

6f7f02e78 cpusets: allow empty {cpus,mems}_allowed to be set for unpopulated cpuset ... Browse Code »

You currently cannot remove all cpus or mems from cpus_allowed or
mems_allowed of a cpuset. We now allow both if there are no attached
tasks.

Acked-by: Paul Jackson
Cc: Christoph Lameter
Signed-off-by: Paul Menage
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2007-05-09 02:15:14 +0800
e63340ae6 header cleaning: don't include smp_lock.h when not used ... Browse Code »

Remove includes of where it is not used/needed.
Suggested by Al Viro.

Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2007-05-09 02:15:07 +0800