Eric Lee / smarc-fsl-linux-kernel

20 Oct, 2007

2 commits

8793d854e Task Control Groups: make cpusets a client of cgroups ... Browse Code »

Remove the filesystem support logic from the cpusets system and makes cpusets
a cgroup subsystem

The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
passed through to the cgroup filesystem with the appropriate options to
emulate the old cpuset filesystem behaviour.

Signed-off-by: Paul Menage
Cc: Serge E. Hallyn
Cc: "Eric W. Biederman"
Cc: Dave Hansen
Cc: Balbir Singh
Cc: Paul Jackson
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: Srivatsa Vaddagiri
Cc: Cedric Le Goater
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2007-10-20 02:53:36 +0800
55a230aae cpuset: zero malloc - revert the old cpuset fix ... Browse Code »

The cpuset code to present a list of tasks using a cpuset to user space could
write to an array that it had kmalloc'd, after a kmalloc request of zero size.

The problem was that the code didn't check for writes past the allocated end
of the array until -after- the first write.

This is a race condition that is likely rare -- it would only show up if a
cpuset went from being empty to having a task in it, during the brief time
between the allocation and the first write.

Prior to roughly 2.6.22 kernels, this was also a benign problem, because a
zero kmalloc returned a few usable bytes anyway, and no harm was done with the
bogus write.

With the 2.6.22 kernel changes to make issue a warning if code tries to write
to the location returned from a zero size allocation, this problem is no
longer benign. This cpuset code would occassionally trigger that warning.

The fix is trivial -- check before storing into the array, not after, whether
the array is big enough to hold the store.

Cc: "Eric W. Biederman"
Cc: "Serge E. Hallyn"
Cc: Balbir Singh
Cc: Dave Hansen
Cc: Herbert Poetzl
Cc: Kirill Korotaev
Cc: Paul Menage
Cc: Srivatsa Vaddagiri
Cc: Christoph Lameter
Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2007-10-20 02:53:35 +0800

19 Oct, 2007

1 commit

dedf8b79e whitespace fixes: cpuset ... Browse Code »

Signed-off-by: Daniel Walker
Cc: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daniel Walker
2007-10-19 05:37:24 +0800

17 Oct, 2007

4 commits

bbe373f2c oom: compare cpuset mems_allowed instead of exclusive ancestors ... Browse Code »

Instead of testing for overlap in the memory nodes of the the nearest
exclusive ancestor of both current and the candidate task, it is better to
simply test for intersection between the task's mems_allowed in their task
descriptors. This does not require taking callback_mutex since it is only
used as a hint in the badness scoring.

Tasks that do not have an intersection in their mems_allowed with the current
task are not explicitly restricted from being OOM killed because it is quite
possible that the candidate task has allocated memory there before and has
since changed its mems_allowed.

Cc: Andrea Arcangeli
Acked-by: Christoph Lameter
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2007-10-17 23:42:46 +0800
607717a65 cpuset: remove sched domain hooks from cpusets ... Browse Code »

Remove the cpuset hooks that defined sched domains depending on the setting
of the 'cpu_exclusive' flag.

The cpu_exclusive flag can only be set on a child if it is set on the
parent.

This made that flag painfully unsuitable for use as a flag defining a
partitioning of a system.

It was entirely unobvious to a cpuset user what partitioning of sched
domains they would be causing when they set that one cpu_exclusive bit on
one cpuset, because it depended on what CPUs were in the remainder of that
cpusets siblings and child cpusets, after subtracting out other
cpu_exclusive cpusets.

Furthermore, there was no way on production systems to query the
result.

Using the cpu_exclusive flag for this was simply wrong from the get go.

Fortunately, it was sufficiently borked that so far as I know, almost no
successful use has been made of this. One real time group did use it to
affectively isolate CPUs from any load balancing efforts. They are willing
to adapt to alternative mechanisms for this, such as someway to manipulate
the list of isolated CPUs on a running system. They can do without this
present cpu_exclusive based mechanism while we develop an alternative.

There is a real risk, to the best of my understanding, of users
accidentally setting up a partitioned scheduler domains, inhibiting desired
load balancing across all their CPUs, due to the nonobvious (from the
cpuset perspective) side affects of the cpu_exclusive flag.

Furthermore, since there was no way on a running system to see what one was
doing with sched domains, this change will be invisible to any using code.
Unless they have real insight to the scheduler load balancing choices, they
will be unable to detect that this change has been made in the kernel's
behaviour.

Initial discussion on lkml of this patch has generated much comment. My
(probably controversial) take on that discussion is that it has reached a
rough concensus that the current cpuset cpu_exclusive mechanism for
defining sched domains is borked. There is no concensus on the
replacement. But since we can remove this mechanism, and since its
continued presence risks causing unwanted partitioning of the schedulers
load balancing, we should remove it while we can, as we proceed to work the
replacement scheduler domain mechanisms.

Signed-off-by: Paul Jackson
Cc: Ingo Molnar
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: Dinakar Guniguntala
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2007-10-17 00:43:09 +0800
e12ba74d8 Group short-lived and reclaimable kernel allocations ... Browse Code »

This patch marks a number of allocations that are either short-lived such as
network buffers or are reclaimable such as inode allocations. When something
like updatedb is called, long-lived and unmovable kernel allocations tend to
be spread throughout the address space which increases fragmentation.

This patch groups these allocations together as much as possible by adding a
new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be
reclaimed on demand, but not moved. i.e. they can be migrated by deleting
them and re-reading the information from elsewhere.

Signed-off-by: Mel Gorman
Cc: Andy Whitcroft
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2007-10-17 00:43:00 +0800
0e1e7c7a7 Memoryless nodes: Use N_HIGH_MEMORY for cpusets ... Browse Code »

cpusets try to ensure that any node added to a cpuset's mems_allowed is
on-line and contains memory. The assumption was that online nodes contained
memory. Thus, it is possible to add memoryless nodes to a cpuset and then add
tasks to this cpuset. This results in continuous series of oom-kill and
apparent system hang.

Change cpusets to use node_states[N_HIGH_MEMORY] [a.k.a. node_memory_map] in
place of node_online_map when vetting memories. Return error if admin
attempts to write a non-empty mems_allowed node mask containing only
memoryless-nodes.

Signed-off-by: Lee Schermerhorn
Signed-off-by: Bob Picco
Signed-off-by: Nishanth Aravamudan
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-10-17 00:42:59 +0800

18 Jul, 2007

2 commits

86313c488 usermodehelper: Tidy up waiting ... Browse Code »

Rather than using a tri-state integer for the wait flag in
call_usermodehelper_exec, define a proper enum, and use that. I've
preserved the integer values so that any callers I've missed should
still work OK.

Signed-off-by: Jeremy Fitzhardinge
Cc: James Bottomley
Cc: Randy Dunlap
Cc: Christoph Hellwig
Cc: Andi Kleen
Cc: Paul Mackerras
Cc: Johannes Berg
Cc: Ralf Baechle
Cc: Bjorn Helgaas
Cc: Joel Becker
Cc: Tony Luck
Cc: Kay Sievers
Cc: Srivatsa Vaddagiri
Cc: Oleg Nesterov
Cc: David Howells

Jeremy Fitzhardinge
2007-07-18 23:47:40 +0800
49c13b51a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (80 commits)
KVM: Use CPU_DYING for disabling virtualization
KVM: Tune hotplug/suspend IPIs
KVM: Keep track of which cpus have virtualization enabled
SMP: Allow smp_call_function_single() to current cpu
i386: Allow smp_call_function_single() to current cpu
x86_64: Allow smp_call_function_single() to current cpu
HOTPLUG: Adapt thermal throttle to CPU_DYING
HOTPLUG: Adapt cpuset hotplug callback to CPU_DYING
HOTPLUG: Add CPU_DYING notifier
KVM: Clean up #includes
KVM: Remove kvmfs in favor of the anonymous inodes source
KVM: SVM: Reliably detect if SVM was disabled by BIOS
KVM: VMX: Remove unnecessary code in vmx_tlb_flush()
KVM: MMU: Fix Wrong tlb flush order
KVM: VMX: Reinitialize the real-mode tss when entering real mode
KVM: Avoid useless memory write when possible
KVM: Fix x86 emulator writeback
KVM: Add support for in-kernel pio handlers
KVM: VMX: Fix interrupt checking on lightweight exit
KVM: Adds support for in-kernel mmio handlers
...

Linus Torvalds
2007-07-18 02:50:26 +0800

17 Jul, 2007

1 commit

c2aef333c Reduce cpuset.c write_lock_irq() to read_lock() ... Browse Code »

cpuset.c:update_nodemask() uses a write_lock_irq() on tasklist_lock to
block concurrent forks; a read_lock() suffices and is less intrusive.

Signed-off-by: Paul Menage
Acked-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2007-07-17 00:05:43 +0800

16 Jul, 2007

1 commit

ac076758b HOTPLUG: Adapt cpuset hotplug callback to CPU_DYING ... Browse Code »

CPU_DYING is called in atomic context, so don't try to take any locks.

Signed-off-by: Avi Kivity

Avi Kivity
2007-07-16 17:05:49 +0800

17 Jun, 2007

1 commit

3e903e7b1 cpuset: zero malloc - fix for old cpusets ... Browse Code »

The cpuset code to present a list of tasks using a cpuset to user space could
write to an array that it had kmalloc'd, after a kmalloc request of zero size.

The problem was that the code didn't check for writes past the allocated end
of the array until -after- the first write.

This is a race condition that is likely rare -- it would only show up if a
cpuset went from being empty to having a task in it, during the brief time
between the allocation and the first write.

Prior to roughly 2.6.22 kernels, this was also a benign problem, because a
zero kmalloc returned a few usable bytes anyway, and no harm was done with the
bogus write.

With the 2.6.22 kernel changes to make issue a warning if code tries to write
to the location returned from a zero size allocation, this problem is no
longer benign. This cpuset code would occassionally trigger that warning.

The fix is trivial -- check before storing into the array, not after, whether
the array is big enough to hold the store.

Cc: "Eric W. Biederman"
Cc: "Serge E. Hallyn"
Cc: Balbir Singh
Cc: Dave Hansen
Cc: Herbert Poetzl
Cc: Kirill Korotaev
Cc: Paul Menage
Cc: Srivatsa Vaddagiri
Cc: Christoph Lameter
Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2007-06-17 04:16:15 +0800

10 May, 2007

1 commit

85badbdf5 use simple_read_from_buffer in kernel/ ... Browse Code »

Cleanup using simple_read_from_buffer() for /dev/cpuset/tasks and
/proc/config.gz.

Cc: Paul Jackson
Cc: Randy Dunlap
Signed-off-by: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2007-05-10 03:30:49 +0800

09 May, 2007

3 commits

6f7f02e78 cpusets: allow empty {cpus,mems}_allowed to be set for unpopulated cpuset ... Browse Code »

You currently cannot remove all cpus or mems from cpus_allowed or
mems_allowed of a cpuset. We now allow both if there are no attached
tasks.

Acked-by: Paul Jackson
Cc: Christoph Lameter
Signed-off-by: Paul Menage
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2007-05-09 02:15:14 +0800
e63340ae6 header cleaning: don't include smp_lock.h when not used ... Browse Code »

Remove includes of where it is not used/needed.
Suggested by Al Viro.

Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2007-05-09 02:15:07 +0800
dd9037a26 Fix race between attach_task and cpuset_exit ... Browse Code »

Currently cpuset_exit() changes the exiting task's ->cpuset pointer w/o
taking task_lock(). This can lead to ugly races between attach_task and
cpuset_exit. Details of the races are described at
http://lkml.org/lkml/2007/3/24/132.

Patch below closes those races.

Signed-off-by: Srivatsa Vaddagiri
Cc: Paul Jackson
Cc: Balbir Singh
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Srivatsa Vaddagiri
2007-05-09 02:15:05 +0800

08 May, 2007

1 commit

c596d9f32 cpusets: allow TIF_MEMDIE threads to allocate anywhere ... Browse Code »

OOM killed tasks have access to memory reserves as specified by the
TIF_MEMDIE flag in the hopes that it will quickly exit. If such a task has
memory allocations constrained by cpusets, we may encounter a deadlock if a
blocking task cannot exit because it cannot allocate the necessary memory.

We allow tasks that have the TIF_MEMDIE flag to allocate memory anywhere,
including outside its cpuset restriction, so that it can quickly die
regardless of whether it is __GFP_HARDWALL.

Cc: Andi Kleen
Cc: Paul Jackson
Cc: Christoph Lameter
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2007-05-08 03:12:53 +0800

13 Feb, 2007

2 commits

92e1d5be9 [PATCH] mark struct inode_operations const 2 ... Browse Code »

Many struct inode_operations in the kernel can be "const". Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data. In addition it'll catch accidental writes at compile time to
these shared resources.

Signed-off-by: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arjan van de Ven
2007-02-13 01:48:46 +0800
9a32144e9 [PATCH] mark struct file_operations const 7 ... Browse Code »

Many struct file_operations in the kernel can be "const". Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data. In addition it'll catch accidental writes at compile time to
these shared resources.

Signed-off-by: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arjan van de Ven
2007-02-13 01:48:46 +0800

31 Dec, 2006

1 commit

089e34b60 [PATCH] cpuset procfs warning fix ... Browse Code »

fs/proc/base.c:1869: warning: initialization discards qualifiers from pointer target type
fs/proc/base.c:2150: warning: initialization discards qualifiers from pointer target type

Cc: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2006-12-31 02:56:43 +0800

14 Dec, 2006

1 commit

02a0e53d8 [PATCH] cpuset: rework cpuset_zone_allowed api ... Browse Code »

Elaborate the API for calling cpuset_zone_allowed(), so that users have to
explicitly choose between the two variants:

cpuset_zone_allowed_hardwall()
cpuset_zone_allowed_softwall()

Until now, whether or not you got the hardwall flavor depended solely on
whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
argument.

If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
version.

Unfortunately, this meant that users would end up with the softwall version
without thinking about it. Since only the softwall version might sleep,
this led to bugs with possible sleeping in interrupt context on more than
one occassion.

The hardwall version requires that the current tasks mems_allowed allows
the node of the specified zone (or that you're in interrupt or that
__GFP_THISNODE is set or that you're on a one cpuset system.)

The softwall version, depending on the gfp_mask, might allow a node if it
was allowed in the nearest enclusing cpuset marked mem_exclusive (which
requires taking the cpuset lock 'callback_mutex' to evaluate.)

This patch removes the cpuset_zone_allowed() call, and forces the caller to
explicitly choose between the hardwall and the softwall case.

If the caller wants the gfp_mask to determine this choice, they should (1)
be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
cpuset_zone_allowed_softwall() routine.

This adds another 100 or 200 bytes to the kernel text space, due to the few
lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
routines. It should save a few instructions executed for the calls that
turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
set (before the call) then check (within the call) the __GFP_HARDWALL flag.

For the most critical call, from get_page_from_freelist(), the same
instructions are executed as before -- the old cpuset_zone_allowed()
routine it used to call is the same code as the
cpuset_zone_allowed_softwall() routine that it calls now.

Not a perfect win, but seems worth it, to reduce this chance of hitting a
sleeping with irq off complaint again.

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-12-14 01:05:49 +0800

09 Dec, 2006

1 commit

a7a005fd1 [PATCH] struct path: convert kernel ... Browse Code »

Signed-off-by: Josef Sipek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Josef Sipek
2006-12-09 00:28:46 +0800

08 Dec, 2006

4 commits

d3ed11c35 [PATCH] cpuset: allow a larger buffer for writes to cpuset files ... Browse Code »

When using fake NUMA setup, the number of memory nodes can greatly exceed
the number of CPUs. So the current limit in cpuset_common_file_write() is
insufficient.

Signed-off-by: Paul Menage
Acked-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2006-12-08 00:39:48 +0800
15ad7cdcf [PATCH] struct seq_operations and struct file_operations constification ... Browse Code »

- move some file_operations structs into the .rodata section

- move static strings from policy_types[] array into the .rodata section

- fix generic seq_operations usages, so that those structs may be defined
as "const" as well

[akpm@osdl.org: couple of fixes]
Signed-off-by: Helge Deller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Helge Deller
2006-12-08 00:39:46 +0800
023160678 [PATCH] hotplug CPU: clean up hotcpu_notifier() use ... Browse Code »

There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn,
prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus
generating compiler warnings of unused symbols, hence forcing people to add
#ifdefs.

the compiler can skip truly unused functions just fine:

text data bss dec hex filename
1624412 728710 3674856 6027978 5bfaca vmlinux.before
1624412 728710 3674856 6027978 5bfaca vmlinux.after

[akpm@osdl.org: topology.c fix]
Signed-off-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2006-12-08 00:39:39 +0800
696040670 [PATCH] cpuset: minor code refinements ... Browse Code »

A couple of minor code simplifications to the kernel/cpuset.c code. No
functional change. Just a little less code and a little more readable.

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-12-08 00:39:32 +0800

11 Oct, 2006

1 commit

1af989281 [PATCH] cpuset ANSI prototype ... Browse Code »

Signed-off-by: Al Viro
Acked-by: Paul Jackson
Signed-off-by: Linus Torvalds

Al Viro
2006-10-11 06:37:23 +0800

01 Oct, 2006

1 commit

d8c76e6f4 [PATCH] r/o bind mount prepwork: inc_nlink() helper ... Browse Code »

This is mostly included for parity with dec_nlink(), where we will have some
more hooks. This one should stay pretty darn straightforward for now.

Signed-off-by: Dave Hansen
Acked-by: Christoph Hellwig
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2006-10-01 15:39:30 +0800

30 Sep, 2006

4 commits

181b64803 [PATCH] cpuset: fix obscure attach_task vs exiting race ... Browse Code »

Fix obscure race condition in kernel/cpuset.c attach_task() code.

There is basically zero chance of anyone accidentally being harmed by this
race.

It requires a special 'micro-stress' load and a special timing loop hacks
in the kernel to hit in less than an hour, and even then you'd have to hit
it hundreds or thousands of times, followed by some unusual and senseless
cpuset configuration requests, including removing the top cpuset, to cause
any visibly harm affects.

One could, with perhaps a few days or weeks of such effort, get the
reference count on the top cpuset below zero, and manage to crash the
kernel by asking to remove the top cpuset.

I found it by code inspection.

The race was introduced when 'the_top_cpuset_hack' was introduced, and one
piece of code was not updated. An old check for a possibly null task
cpuset pointer needed to be changed to a check for a task marked
PF_EXITING. The pointer can't be null anymore, thanks to
the_top_cpuset_hack (documented in kernel/cpuset.c). But the task could
have gone into PF_EXITING state after it was found in the task_list scan.

If a task is PF_EXITING in this code, it is possible that its task->cpuset
pointer is pointing to the top cpuset due to the_top_cpuset_hack, rather
than because the top_cpuset was that tasks last valid cpuset. In that
case, the wrong cpuset reference counter would be decremented.

The fix is trivial. Instead of failing the system call if the tasks cpuset
pointer is null here, fail it if the task is in PF_EXITING state.

The code for 'the_top_cpuset_hack' that changes an exiting tasks cpuset to
the top_cpuset is done without locking, so could happen at anytime. But it
is done during the exit handling, after the PF_EXITING flag is set. So if
we verify that a task is still not PF_EXITING after we copy out its cpuset
pointer (into 'oldcs', below), we know that 'oldcs' is not one of these
hack references to the top_cpuset.

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-09-30 00:18:25 +0800
b1aac8bb8 [PATCH] cpuset: hotunplug cpus and mems in all cpusets ... Browse Code »

The cpuset code handling hot unplug of CPUs or Memory Nodes was incorrect -
it could remove a CPU or Node from the top cpuset, while leaving it still
in some child cpusets.

One basic rule of cpusets is that each cpusets cpus and mems are subsets of
its parents. The cpuset hot unplug code violated this rule.

So the cpuset hotunplug handler must walk down the tree, removing any
removed CPU or Node from all cpusets.

However, it is not allowed to make a cpusets cpus or mems become empty.
They can only transition from empty to non-empty, not back.

So if the last CPU or Node would be removed from a cpuset by the above
walk, we scan back up the cpuset hierarchy, finding the nearest ancestor
that still has something online, and copy its CPU or Memory placement.

Signed-off-by: Paul Jackson
Cc: Nathan Lynch
Cc: Anton Blanchard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-09-30 00:18:21 +0800
38837fc75 [PATCH] cpuset: top_cpuset tracks hotplug changes to node_online_map ... Browse Code »

Change the list of memory nodes allowed to tasks in the top (root) nodeset
to dynamically track what cpus are online, using a call to a cpuset hook
from the memory hotplug code. Make this top cpus file read-only.

On systems that have cpusets configured in their kernel, but that aren't
actively using cpusets (for some distros, this covers the majority of
systems) all tasks end up in the top cpuset.

If that system does support memory hotplug, then these tasks cannot make
use of memory nodes that are added after system boot, because the memory
nodes are not allowed in the top cpuset. This is a surprising regression
over earlier kernels that didn't have cpusets enabled.

One key motivation for this change is to remain consistent with the
behaviour for the top_cpuset's 'cpus', which is also read-only, and which
automatically tracks the cpu_online_map.

This change also has the minor benefit that it fixes a long standing,
little noticed, minor bug in cpusets. The cpuset performance tweak to
short circuit the cpuset_zone_allowed() check on systems with just a single
cpuset (see 'number_of_cpusets', in linux/cpuset.h) meant that simply
changing the 'mems' of the top_cpuset had no affect, even though the change
(the write system call) appeared to succeed. With the following change,
that write to the 'mems' file fails -EACCES, and the 'mems' file stubbornly
refuses to be changed via user space writes. Thus no one should be mislead
into thinking they've changed the top_cpusets's 'mems' when in affect they
haven't.

In order to keep the behaviour of cpusets consistent between systems
actively making use of them and systems not using them, this patch changes
the behaviour of the 'mems' file in the top (root) cpuset, making it read
only, and making it automatically track the value of node_online_map. Thus
tasks in the top cpuset will have automatic use of hot plugged memory nodes
allowed by their cpuset.

[akpm@osdl.org: build fix]
[bunk@stusta.de: build fix]
Signed-off-by: Paul Jackson
Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-09-30 00:18:21 +0800
f400e198b [PATCH] pidspace: is_init() ... Browse Code »

This is an updated version of Eric Biederman's is_init() patch.
(http://lkml.org/lkml/2006/2/6/280). It applies cleanly to 2.6.18-rc3 and
replaces a few more instances of ->pid == 1 with is_init().

Further, is_init() checks pid and thus removes dependency on Eric's other
patches for now.

Eric's original description:

There are a lot of places in the kernel where we test for init
because we give it special properties. Most significantly init
must not die. This results in code all over the kernel test
->pid == 1.

Introduce is_init to capture this case.

With multiple pid spaces for all of the cases affected we are
looking for only the first process on the system, not some other
process that has pid == 1.

Signed-off-by: Eric W. Biederman
Signed-off-by: Sukadev Bhattiprolu
Cc: Dave Hansen
Cc: Serge Hallyn
Cc: Cedric Le Goater
Cc:
Acked-by: Paul Mackerras
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sukadev Bhattiprolu
2006-09-30 00:18:12 +0800

27 Sep, 2006

1 commit

ba52de123 [PATCH] inode-diet: Eliminate i_blksize from the inode structure ... Browse Code »

This eliminates the i_blksize field from struct inode. Filesystems that want
to provide a per-inode st_blksize can do so by providing their own getattr
routine instead of using the generic_fillattr() function.

Note that some filesystems were providing pretty much random (and incorrect)
values for i_blksize.

[bunk@stusta.de: cleanup]
[akpm@osdl.org: generic_fillattr() fix]
Signed-off-by: "Theodore Ts'o"
Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Theodore Ts'o
2006-09-27 23:26:18 +0800

26 Sep, 2006

2 commits

89fa30242 [PATCH] NUMA: Add zone_to_nid function ... Browse Code »

There are many places where we need to determine the node of a zone.
Currently we use a difficult to read sequence of pointer dereferencing.
Put that into an inline function and use throughout VM. Maybe we can find
a way to optimize the lookup in the future.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:52 +0800
9b819d204 [PATCH] Add __GFP_THISNODE to avoid fallback to other nodes and ignore cpuset/me… ... Browse Code »

…mory policy restrictions

Add a new gfp flag __GFP_THISNODE to avoid fallback to other nodes. This
flag is essential if a kernel component requires memory to be located on a
certain node. It will be needed for alloc_pages_node() to force allocation
on the indicated node and for alloc_pages() to force allocation on the
current node.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

Christoph Lameter
2006-09-26 23:48:50 +0800

28 Aug, 2006

2 commits

0d673a5a4 [PATCH] cpuset: oom panic fix ... Browse Code »

cpuset_excl_nodes_overlap always returns 0 if current is exiting. This caused
customer's systems to panic in the OOM killer when processes were having
trouble getting memory for the final put_user in mm_release. Even though
there were lots of processes to kill.

Change to returning 1 in this case. This achieves parity with !CONFIG_CPUSETS
case, and was observed to fix the problem.

Signed-off-by: Nick Piggin
Acked-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2006-08-28 02:01:32 +0800
4c4d50f7b [PATCH] cpuset: top_cpuset tracks hotplug changes to cpu_online_map ... Browse Code »

Change the list of cpus allowed to tasks in the top (root) cpuset to
dynamically track what cpus are online, using a CPU hotplug notifier. Make
this top cpus file read-only.

On systems that have cpusets configured in their kernel, but that aren't
actively using cpusets (for some distros, this covers the majority of
systems) all tasks end up in the top cpuset.

If that system does support CPU hotplug, then these tasks cannot make use
of CPUs that are added after system boot, because the CPUs are not allowed
in the top cpuset. This is a surprising regression over earlier kernels
that didn't have cpusets enabled.

In order to keep the behaviour of cpusets consistent between systems
actively making use of them and systems not using them, this patch changes
the behaviour of the 'cpus' file in the top (root) cpuset, making it read
only, and making it automatically track the value of cpu_online_map. Thus
tasks in the top cpuset will have automatic use of hot plugged CPUs allowed
by their cpuset.

Thanks to Anton Blanchard and Nathan Lynch for reporting this problem,
driving the fix, and earlier versions of this patch.

Signed-off-by: Paul Jackson
Cc: Nathan Lynch
Cc: Anton Blanchard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-08-28 02:01:32 +0800

24 Jul, 2006

1 commit

abb5a5cc6 [PATCH] Cpuset: fix ABBA deadlock with cpu hotplug lock ... Browse Code »

Fix ABBA deadlock between lock_cpu_hotplug() and the cpuset
callback_mutex lock.

It only happens on cpu_exclusive cpusets, due to the dynamic
sched domain code trying to take the cpu hotplug lock inside
the cpuset callback_mutex lock.

This bug has apparently been here for several months, but didn't
get hit until the right customer load on a large system.

This fix appears right from inspection, but it will take a few
more days running it on that customers workload to be confident
we nailed it. We don't have any other reproducible test case.

The cpu_hotplug_lock() tends to cover large runs of code.
The other places that hold both that lock and the cpuset callback
mutex lock always nest the cpuset lock inside the hotplug lock.
This place tries to do the reverse, risking an ABBA deadlock.

This is in the cpuset_rmdir() code, where we:
* take the callback_mutex lock
* mark the cpuset CS_REMOVED
* call update_cpu_domains for cpu_exclusive cpusets
* in that call, take the cpu_hotplug lock if the
cpuset is marked for removal.

Thanks to Jack Steiner for identifying this deadlock.

The fix is to tear down the dynamic sched domain before we grab
the cpuset callback_mutex lock. This way, the two locks are
serialized, with the hotplug lock taken and released before
trying for the cpuset lock.

I suspect that this bug was introduced when I changed the
cpuset locking from one lock to two. The dynamic sched domain
dependency on cpu_exclusive cpusets and its hotplug hooks were
added to this code earlier, when cpusets had only a single lock.
It may well have been fine then.

Signed-off-by: Paul Jackson
Signed-off-by: Linus Torvalds

Paul Jackson
2006-07-24 04:03:05 +0800

01 Jul, 2006

2 commits

6ab3d5624 Remove obsolete #include <linux/config.h> ... Browse Code »

Signed-off-by: Jörn Engel
Signed-off-by: Adrian Bunk

Jörn Engel
2006-07-01 01:25:36 +0800
80f7228b5 typo fixes: occuring -> occurring ... Browse Code »

Signed-off-by: Adrian Bunk

Adrian Bunk
2006-07-01 00:27:16 +0800