Eric Lee / linux-smarc-t335x-v3.2

02 Feb, 2006

1 commit

6292d9aaf [PATCH] __cpuinit functions wrongly marked __meminit ... Browse Code »

__meminit has overzelously been modified and crept its way into marking
cpuup callbacks as __meminit.

Signed-off-by: Ashok Raj
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ashok Raj
2006-02-02 00:53:09 +0800

19 Jan, 2006

8 commits

86c562a9d [PATCH] mm: optimize numa policy handling in slab allocator ... Browse Code »

Move the interrupt check from slab_node into ___cache_alloc and adds an
"unlikely()" to avoid pipeline stalls on some architectures.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-19 11:20:18 +0800
dc85da15d [PATCH] NUMA policies in the slab allocator V2 ... Browse Code »

This patch fixes a regression in 2.6.14 against 2.6.13 that causes an
imbalance in memory allocation during bootup.

The slab allocator in 2.6.13 is not numa aware and simply calls
alloc_pages(). This means that memory policies may control the behavior of
alloc_pages(). During bootup the memory policy is set to MPOL_INTERLEAVE
resulting in the spreading out of allocations during bootup over all
available nodes. The slab allocator in 2.6.13 has only a single list of
slab pages. As a result the per cpu slab cache and the spinlock controlled
page lists may contain slab entries from off node memory. The slab
allocator in 2.6.13 makes no effort to discern the locality of an entry on
its lists.

The NUMA aware slab allocator in 2.6.14 controls locality of the slab pages
explicitly by calling alloc_pages_node(). The NUMA slab allocator manages
slab entries by having lists of available slab pages for each node. The
per cpu slab cache can only contain slab entries associated with the node
local to the processor. This guarantees that the default allocation mode
of the slab allocator always assigns local memory if available.

Setting MPOL_INTERLEAVE as a default policy during bootup has no effect
anymore. In 2.6.14 all node unspecific slab allocations are performed on
the boot processor. This means that most of key data structures are
allocated on one node. Most processors will have to refer to these
structures making the boot node a potential bottleneck. This may reduce
performance and cause unnecessary memory pressure on the boot node.

This patch implements NUMA policies in the slab layer. There is the need
of explicit application of NUMA memory policies by the slab allcator itself
since the NUMA slab allocator does no longer let the page_allocator control
locality.

The check for policies is made directly at the beginning of __cache_alloc
using current->mempolicy. The memory policy is already frequently checked
by the page allocator (alloc_page_vma() and alloc_page_current()). So it
is highly likely that the cacheline is present. For MPOL_INTERLEAVE
kmalloc() will spread out each request to one node after another so that an
equal distribution of allocations can be obtained during bootup.

It is not possible to push the policy check to lower layers of the NUMA
slab allocator since the per cpu caches are now only containing slab
entries from the current node. If the policy says that the local node is
not to be preferred or forbidden then there is no point in checking the
slab cache or local list of slab pages. The allocation better be directed
immediately to the lists containing slab entries for the allowed set of
nodes.

This way of applying policy also fixes another strange behavior in 2.6.13.
alloc_pages() is controlled by the memory allocation policy of the current
process. It could therefore be that one process is running with
MPOL_INTERLEAVE and would f.e. obtain a new page following that policy
since no slab entries are in the lists anymore. A page can typically be
used for multiple slab entries but lets say that the current process is
only using one. The other entries are then added to the slab lists. These
are now non local entries in the slab lists despite of the possible
availability of local pages that would provide faster access and increase
the performance of the application.

Another process without MPOL_INTERLEAVE may now run and expect a local slab
entry from kmalloc(). However, there are still these free slab entries
from the off node page obtained from the other process via MPOL_INTERLEAVE
in the cache. The process will then get an off node slab entry although
other slab entries may be available that are local to that process. This
means that the policy if one process may contaminate the locality of the
slab caches for other processes.

This patch in effect insures that a per process policy is followed for the
allocation of slab entries and that there cannot be a memory policy
influence from one process to another. A process with default policy will
always get a local slab entry if one is available. And the process using
memory policies will get its memory arranged as requested. Off-node slab
allocation will require the use of spinlocks and will make the use of per
cpu caches not possible. A process using memory policies to redirect
allocations offnode will have to cope with additional lock overhead in
addition to the latency added by the need to access a remote slab entry.

Changes V1->V2
- Remove #ifdef CONFIG_NUMA by moving forward declaration into
prior #ifdef CONFIG_NUMA section.

- Give the function determining the node number to use a saner
name.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-19 11:20:18 +0800
fc0abb145 [PATCH] sem2mutex: mm/slab.c ... Browse Code »

Convert mm/swapfile.c's swapon_sem to swapon_mutex.

Signed-off-by: Ingo Molnar
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2006-01-19 11:20:18 +0800
9eeff2395 [PATCH] Zone reclaim: Reclaim logic ... Browse Code »

Some bits for zone reclaim exists in 2.6.15 but they are not usable. This
patch fixes them up, removes unused code and makes zone reclaim usable.

Zone reclaim allows the reclaiming of pages from a zone if the number of
free pages falls below the watermarks even if other zones still have enough
pages available. Zone reclaim is of particular importance for NUMA
machines. It can be more beneficial to reclaim a page than taking the
performance penalties that come with allocating a page on a remote zone.

Zone reclaim is enabled if the maximum distance to another node is higher
than RECLAIM_DISTANCE, which may be defined by an arch. By default
RECLAIM_DISTANCE is 20. 20 is the distance to another node in the same
component (enclosure or motherboard) on IA64. The meaning of the NUMA
distance information seems to vary by arch.

If zone reclaim is not successful then no further reclaim attempts will
occur for a certain time period (ZONE_RECLAIM_INTERVAL).

This patch was discussed before. See

http://marc.theaimsgroup.com/?l=linux-kernel&m=113519961504207&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=113408418232531&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=113389027420032&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=113380938612205&w=2

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-19 11:20:17 +0800
f1fd1067e [PATCH] Zone reclaim: resurrect may_swap ... Browse Code »

Zone reclaim has a huge impact on NUMA performance (f.e. our maximum
throughput with XFS is raised from 4GB to 6GB/sec / page cache contamination
of numa nodes destroys locality if one just does a large copy operation which
results in performance dropping for good until reboot).

This patch:

Resurrect may_swap in struct scan_control

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-19 11:20:17 +0800
fc3012896 [PATCH] Simplify migrate_page_add ... Browse Code »

Simplify migrate_page_add after feedback from Hugh. This also allows us to
drop one parameter from migrate_page_add.

Signed-off-by: Christoph Lameter
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-19 11:20:17 +0800
053837fce [PATCH] mm: migration page refcounting fix ... Browse Code »

Migration code currently does not take a reference to target page
properly, so between unlocking the pte and trying to take a new
reference to the page with isolate_lru_page, anything could happen to
it.

Fix this by holding the pte lock until we get a chance to elevate the
refcount.

Other small cleanups while we're here.

Signed-off-by: Nick Piggin
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2006-01-19 11:20:17 +0800
e236a166b [PATCH] mm: dirty_exceeded speedup ... Browse Code »

Ravikiran reports that this variable is bouncing all around nodes on NUMA
machines, causing measurable performance problems. Fix that up by only
writing to it when it actually changed.

And put it in a new cacheline to prevent it sharing with other things (this
happened).

Signed-off-by: Ravikiran Thirumalai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2006-01-19 11:20:17 +0800

17 Jan, 2006

1 commit

c09b42404 [PATCH] x86_64: add __meminit for memory hotplug ... Browse Code »

Add __meminit to the __init lineup to ensure functions default
to __init when memory hotplug is not enabled. Replace __devinit
with __meminit on functions that were changed when the memory
hotplug code was introduced.

Signed-off-by: Matt Tolentino
Signed-off-by: Andi Kleen
Signed-off-by: Linus Torvalds

Matt Tolentino
2006-01-17 15:18:35 +0800

15 Jan, 2006

2 commits

505970b96 [PATCH] cpuset oom lock fix ... Browse Code »

The problem, reported in:

http://bugzilla.kernel.org/show_bug.cgi?id=5859

and by various other email messages and lkml posts is that the cpuset hook
in the oom (out of memory) code can try to take a cpuset semaphore while
holding the tasklist_lock (a spinlock).

One must not sleep while holding a spinlock.

The fix seems easy enough - move the cpuset semaphore region outside the
tasklist_lock region.

This required a few lines of mechanism to implement. The oom code where
the locking needs to be changed does not have access to the cpuset locks,
which are internal to kernel/cpuset.c only. So I provided a couple more
cpuset interface routines, available to the rest of the kernel, which
simple take and drop the lock needed here (cpusets callback_sem).

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-01-15 10:27:10 +0800
7339ff830 [PATCH] Add tmpfs options for memory placement policies ... Browse Code »

Anything that writes into a tmpfs filesystem is liable to disproportionately
decrease the available memory on a particular node. Since there's no telling
what sort of application (e.g. dd/cp/cat) might be dropping large files
there, this lets the admin choose the appropriate default behavior for their
site's situation.

Introduce a tmpfs mount option which allows specifying a memory policy and
a second option to specify the nodelist for that policy. With the default
policy, tmpfs will behave as it does today. This patch adds support for
preferred, bind, and interleave policies.

The default policy will cause pages to be added to tmpfs files on the node
which is doing the writing. Some jobs expect a single process to create
and manage the tmpfs files. This results in a node which has a
significantly reduced number of free pages.

With this patch, the administrator can specify the policy and nodes for
that policy where they would prefer allocations.

This patch was originally written by Brent Casavant and Hugh Dickins. I
added support for the bind and preferred policies and the mpol_nodelist
mount option.

Signed-off-by: Brent Casavant
Signed-off-by: Hugh Dickins
Signed-off-by: Robin Holt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robin Holt
2006-01-15 10:27:07 +0800

13 Jan, 2006

4 commits

9f5974c87 Merge git://oss.sgi.com:8090/oss/git/xfs-2.6 Browse Code »

Linus Torvalds
2006-01-13 01:10:34 +0800
cbe8dd4af [PATCH] memmap_init_zone(): remove uneccesary page++ ... Browse Code »

Remove unecessary page++ from memmap_init_zone loop.

Signed-off-by: Greg Ungerer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Greg Ungerer
2006-01-13 01:08:49 +0800
2a7e2f7dc [PATCH] do_truncate() call fix in tiny-shmem.c ... Browse Code »

Adapt tiny-shmem.c to the new do_truncate() prototype.

Signed-off-by: Catalin Marinas
Acked-by: Matt Mackall
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Catalin Marinas
2006-01-13 01:08:49 +0800
f4598c8b3 [PATCH] migration: make sure there is no attempt to migrate reserved pages. ... Browse Code »

This ensures that reserved pages are not migrated. Reserved pages
currently cause the WARN_ON to trigger in migrate_page_add()

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-13 01:08:48 +0800

12 Jan, 2006

5 commits

c59ede7b7 [PATCH] move capable() to capability.h ... Browse Code »

- Move capable() from sched.h to capability.h;

- Use where capable() is used
(in include/, block/, ipc/, kernel/, a few drivers/,
mm/, security/, & sound/;
many more drivers/ to go)

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy.Dunlap
2006-01-12 10:42:13 +0800
4eac915d0 [PATCH] mm: gfp_atomic comments ... Browse Code »

Clarify in comments that GFP_ATOMIC means both "don't sleep" and "use
emergency pools", hence both ALLOC_HARDER and ALLOC_HIGH.

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-01-12 10:42:09 +0800
7365f3d16 [PATCH] Restore KERN_EMERG to each line printed by bad_page ... Browse Code »

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2006-01-12 10:42:08 +0800
ddae9c2ea Merge HEAD from oss.sgi.com:/oss/git/linux-2.6.git Browse Code »

Nathan Scott
2006-01-12 10:34:47 +0800
a4fc7ab1d [PATCH] fix/simplify mutex debugging code ... Browse Code »

Let's switch mutex_debug_check_no_locks_freed() to take (addr, len) as
arguments instead, since all its callers were just calculating the 'to'
address for themselves anyway... (and sometimes doing so badly).

Signed-off-by: David Woodhouse
Acked-by: Ingo Molnar
Signed-off-by: Linus Torvalds

David Woodhouse
2006-01-12 00:14:16 +0800

11 Jan, 2006

3 commits

78539fdfa [XFS] Export pagevec_lookup for use on the XFS page writeout path, ... Browse Code »

for dealing with delayed allocate and unwritten extents (as well).

Signed-off-by: Christoph Hellwig
Signed-off-by: Nathan Scott

Christoph Hellwig
2006-01-11 17:47:41 +0800
e97a31117 add missing printk loglevel in mm/swapfile.c ... Browse Code »

in mm/swapfile.c a printk() is missing a loglevel. I believe the proper
loglevel for this situation is KERN_ERR, so that's what the patch below
sets -if you agree, please apply.

Signed-off-by: Jesper Juhl
Signed-off-by: Adrian Bunk

Jesper Juhl
2006-01-11 08:50:28 +0800
870f48179 [PATCH] replace inode_update_time with file_update_time ... Browse Code »

To allow various options to work per-mount instead of per-sb we need a
struct vfsmount when updating ctime and mtime. This preparation patch
replaces the inode_update_time routine with a file_update_atime routine so
we can easily get at the vfsmount. (and the file makes more sense in this
context anyway). Also get rid of the unused second argument - we always
want to update the ctime when calling this routine.

Signed-off-by: Christoph Hellwig
Cc: Al Viro
Cc: Anton Altaparmakov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2006-01-11 00:01:30 +0800

10 Jan, 2006

3 commits

1b1dcc1b5 [PATCH] mutex subsystem, semaphore to mutex: VFS, ->i_sem ... Browse Code »

This patch converts the inode semaphore to a mutex. I have tested it on
XFS and compiled as much as one can consider on an ia64. Anyway your
luck with it might be different.

Modified-by: Ingo Molnar

(finished the conversion)

Signed-off-by: Jes Sorensen
Signed-off-by: Ingo Molnar

Jes Sorensen
2006-01-10 07:59:24 +0800
de5097c2e [PATCH] mutex subsystem, more debugging code ... Browse Code »

more mutex debugging: check for held locks during memory freeing,
task exit, enable sysrq printouts, etc.

Signed-off-by: Ingo Molnar
Signed-off-by: Arjan van de Ven

Ingo Molnar
2006-01-10 07:59:21 +0800
6150c3258 Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc-merge Browse Code »

Linus Torvalds
2006-01-10 02:03:44 +0800

09 Jan, 2006

13 commits

87ba81dba [PATCH] fadvise: return ESPIPE on FIFO/pipe ... Browse Code »

The patch makes posix_fadvise return ESPIPE on FIFO/pipe in order to be
fully POSIX-compliant.

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Valentine Barshak
2006-01-09 12:14:00 +0800
28fd12982 [PATCH] Fix and add EXPORT_SYMBOL(filemap_write_and_wait) ... Browse Code »

This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.

See mm/filemap.c:

And changes the filemap_write_and_wait() and filemap_write_and_wait_range().

Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
returns error. However, even if filemap_fdatawrite() returned an
error, it may have submitted the partially data pages to the device.
(e.g. in the case of -ENOSPC)

Andrew Morton writes,

If filemap_fdatawrite() returns an error, this might be due to some
I/O problem: dead disk, unplugged cable, etc. Given the generally
crappy quality of the kernel's handling of such exceptions, there's a
good chance that the filemap_fdatawait() will get stuck in D state
forever.

So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.

Trond, could you please review the nfs part? Especially I'm not sure,
nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.

Acked-by: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

OGAWA Hirofumi
2006-01-09 12:13:47 +0800
268fc16e3 [PATCH] export/change sync_page_range/_nolock() ... Browse Code »

This exports/changes the sync_page_range/_nolock(). The fatfs needs
sync_page_range/_nolock() for expanding truncate, and changes "size_t count"
to "loff_t count".

Signed-off-by: OGAWA Hirofumi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

OGAWA Hirofumi
2006-01-09 12:13:47 +0800
4225399a6 [PATCH] cpuset: rebind vma mempolicies fix ... Browse Code »

Fix more of longstanding bug in cpuset/mempolicy interaction.

NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
to just the Memory Nodes allowed by that cpuset. The kernel maintains
internal state for each mempolicy, tracking what nodes are used for the
MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.

When a tasks cpuset memory placement changes, whether because the cpuset
changed, or because the task was attached to a different cpuset, then the
tasks mempolicies have to be rebound to the new cpuset placement, so as to
preserve the cpuset-relative numbering of the nodes in that policy.

An earlier fix handled such mempolicy rebinding for mempolicies attached to a
task.

This fix rebinds mempolicies attached to vma's (address ranges in a tasks
address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
updating vma's, the rebinding of vma mempolicies has to be done when the
cpuset memory placement is changed, at which time mmap_sem can be safely
acquired. The tasks mempolicy is rebound later, when the task next attempts
to allocate memory and notices that its task->cpuset_mems_generation is
out-of-date with its cpusets mems_generation.

Because walking the tasklist to find all tasks attached to a changing cpuset
requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
affected tasks while doing the tasklist scan. In general, one cannot acquire
a semaphore (which can sleep) while already holding a spinlock (such as
tasklist_lock). So a list of mm references has to be built up during the
tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
acquired, and the vma's in that mm rebound.

Once the tasklist lock is dropped, affected tasks may fork new tasks, before
their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to
point to the cpuset being rebound (there can only be one; cpuset modifications
are done under a global 'manage_sem' semaphore), and the mpol_copy code that
is used to copy a tasks mempolicies during fork catches such forking tasks,
and ensures their children are also rebound.

When a task is moved to a different cpuset, it is easier, as there is only one
task involved. It's mm->vma's are scanned, using the same
mpol_rebind_policy() as used above.

It may happen that both the mpol_copy hook and the update done via the
tasklist scan update the same mm twice. This is ok, as the mempolicies of
each vma in an mm keep track of what mems_allowed they are relative to, and
safely no-op a second request to rebind to the same nodes.

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-01-09 12:13:44 +0800
74cb21553 [PATCH] cpuset: numa_policy_rebind cleanup ... Browse Code »

Cleanup, reorganize and make more robust the mempolicy.c code to rebind
mempolicies relative to the containing cpuset after a tasks memory placement
changes.

The real motivator for this cleanup patch is to lay more groundwork for the
upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's
after the containing cpuset memory placement changes.

NUMA mempolicies are constrained by the cpuset their task is a member of.
When either (1) a task is moved to a different cpuset, or (2) the 'mems'
mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded
node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to
be recalculated, relative to their new cpuset placement.

The old code used an unreliable method of determining what was the old
mems_allowed constraining the mempolicy. It just looked at the tasks
mems_allowed value. This sort of worked with the present code, that just
rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken,
referring to the old nodes. But in an upcoming patch, the vma mempolicies
will be rebound as well. Then the order in which the various task and vma
mempolicies are updated will no longer be deterministic, and one can no longer
count on the task->mems_allowed holding the old value for as long as needed.
It's not even clear if the current code was guaranteed to work reliably for
task mempolicies.

So I added a mems_allowed field to each mempolicy, stating exactly what
mems_allowed the policy is relative to, and updated synchronously and reliably
anytime that the mempolicy is rebound.

Also removed a useless wrapper routine, numa_policy_rebind(), and had its
caller, cpuset_update_task_memory_state(), call directly to the rewritten
policy_rebind() routine, and made that rebind routine extern instead of
static, and added a "mpol_" prefix to its name, making it
mpol_rebind_policy().

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-01-09 12:13:44 +0800
909d75a3b [PATCH] cpuset: implement cpuset_mems_allowed ... Browse Code »

Provide a cpuset_mems_allowed() method, which the sys_migrate_pages() code
needed, to obtain the mems_allowed vector of a cpuset, and replaced the
workaround in sys_migrate_pages() to call this new method.

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-01-09 12:13:44 +0800
cf2a473c4 [PATCH] cpuset: combine refresh_mems and update_mems ... Browse Code »

The important code paths through alloc_pages_current() and alloc_page_vma(),
by which most kernel page allocations go, both called
cpuset_update_current_mems_allowed(), which in turn called refresh_mems().
-Both- of these latter two routines did a tasklock, got the tasks cpuset
pointer, and checked for out of date cpuset->mems_generation.

That was a silly duplication of code and waste of CPU cycles on an important
code path.

Consolidated those two routines into a single routine, called
cpuset_update_task_memory_state(), since it updates more than just
mems_allowed.

Changed all callers of either routine to call the new consolidated routine.

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-01-09 12:13:43 +0800
3e0d98b9f [PATCH] cpuset: memory pressure meter ... Browse Code »

Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
that the tasks in a cpuset call try_to_free_pages(), the synchronous
(direct) memory reclaim code.

This enables batch managers monitoring jobs running in dedicated cpusets to
efficiently detect what level of memory pressure that job is causing.

This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or reprioritize jobs that are
trying to use more memory than allowed on the nodes assigned them, and with
tightly coupled, long running, massively parallel scientific computing jobs
that will dramatically fail to meet required performance goals if they
start to use more memory than allowed to them.

This patch just provides a very economical way for the batch manager to
monitor a cpuset for signs of memory pressure. It's up to the batch
manager or other user code to decide what to do about it and take action.

==> Unless this feature is enabled by writing "1" to the special file
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
code of __alloc_pages() for this metric reduces to simply noticing
that the cpuset_memory_pressure_enabled flag is zero. So only
systems that enable this feature will compute the metric.

Why a per-cpuset, running average:

Because this meter is per-cpuset, rather than per-task or mm, the
system load imposed by a batch scheduler monitoring this metric is
sharply reduced on large systems, because a scan of the tasklist can be
avoided on each set of queries.

Because this meter is a running average, instead of an accumulating
counter, a batch scheduler can detect memory pressure with a single
read, instead of having to read and accumulate results for a period of
time.

Because this meter is per-cpuset rather than per-task or mm, the
batch scheduler can obtain the key information, memory pressure in a
cpuset, with a single read, rather than having to query and accumulate
results over all the (dynamically changing) set of tasks in the cpuset.

A per-cpuset simple digital filter (requires a spinlock and 3 words of data
per-cpuset) is kept, and updated by any task attached to that cpuset, if it
enters the synchronous (direct) page reclaim code.

A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by the tasks
in the cpuset, in units of reclaims attempted per second, times 1000.

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-01-09 12:13:42 +0800
5966514db [PATCH] cpuset: mempolicy one more nodemask conversion ... Browse Code »

Finish converting mm/mempolicy.c from bitmaps to nodemasks. The previous
conversion had left one routine using bitmaps, since it involved a
corresponding change to kernel/cpuset.c

Fix that interface by replacing with a simple macro that calls nodes_subset(),
or if !CONFIG_CPUSET, returns (1).

Signed-off-by: Paul Jackson
Cc: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-01-09 12:13:42 +0800
10cef6029 [PATCH] slob: introduce the SLOB allocator ... Browse Code »

configurable replacement for slab allocator

This adds a CONFIG_SLAB option under CONFIG_EMBEDDED. When CONFIG_SLAB is
disabled, the kernel falls back to using the 'SLOB' allocator.

SLOB is a traditional K&R/UNIX allocator with a SLAB emulation layer,
similar to the original Linux kmalloc allocator that SLAB replaced. It's
signicantly smaller code and is more memory efficient. But like all
similar allocators, it scales poorly and suffers from fragmentation more
than SLAB, so it's only appropriate for small systems.

It's been tested extensively in the Linux-tiny tree. I've also
stress-tested it with make -j 8 compiles on a 3G SMP+PREEMPT box (not
recommended).

Here's a comparison for otherwise identical builds, showing SLOB saving
nearly half a megabyte of RAM:

$ size vmlinux*
text data bss dec hex filename
3336372 529360 190812 4056544 3de5e0 vmlinux-slab
3323208 527948 190684 4041840 3dac70 vmlinux-slob

$ size mm/{slab,slob}.o
text data bss dec hex filename
13221 752 48 14021 36c5 mm/slab.o
1896 52 8 1956 7a4 mm/slob.o

/proc/meminfo:
SLAB SLOB delta
MemTotal: 27964 kB 27980 kB +16 kB
MemFree: 24596 kB 25092 kB +496 kB
Buffers: 36 kB 36 kB 0 kB
Cached: 1188 kB 1188 kB 0 kB
SwapCached: 0 kB 0 kB 0 kB
Active: 608 kB 600 kB -8 kB
Inactive: 808 kB 812 kB +4 kB
HighTotal: 0 kB 0 kB 0 kB
HighFree: 0 kB 0 kB 0 kB
LowTotal: 27964 kB 27980 kB +16 kB
LowFree: 24596 kB 25092 kB +496 kB
SwapTotal: 0 kB 0 kB 0 kB
SwapFree: 0 kB 0 kB 0 kB
Dirty: 4 kB 12 kB +8 kB
Writeback: 0 kB 0 kB 0 kB
Mapped: 560 kB 556 kB -4 kB
Slab: 1756 kB 0 kB -1756 kB
CommitLimit: 13980 kB 13988 kB +8 kB
Committed_AS: 4208 kB 4208 kB 0 kB
PageTables: 28 kB 28 kB 0 kB
VmallocTotal: 1007312 kB 1007312 kB 0 kB
VmallocUsed: 48 kB 48 kB 0 kB
VmallocChunk: 1007264 kB 1007264 kB 0 kB

(this work has been sponsored in part by CELF)

From: Ingo Molnar

Fix 32-bitness bugs in mm/slob.c.

Signed-off-by: Matt Mackall
Signed-off-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Mackall
2006-01-09 12:13:41 +0800
30992c97a [PATCH] slob: introduce mm/util.c for shared functions ... Browse Code »

Add mm/util.c for functions common between SLAB and SLOB.

Signed-off-by: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Mackall
2006-01-09 12:13:41 +0800
22fc6eccb [PATCH] Change maxaligned_in_smp alignemnt macros to internodealigned_in_smp macros ... Browse Code »

____cacheline_maxaligned_in_smp is currently used to align critical structures
and avoid false sharing. It uses per-arch L1_CACHE_SHIFT_MAX and people find
L1_CACHE_SHIFT_MAX useless.

However, we have been using ____cacheline_maxaligned_in_smp to align
structures on the internode cacheline size. As per Andi's suggestion,
following patch kills ____cacheline_maxaligned_in_smp and introduces
INTERNODE_CACHE_SHIFT, which defaults to L1_CACHE_SHIFT for all arches.
Arches needing L3/Internode cacheline alignment can define
INTERNODE_CACHE_SHIFT in the arch asm/cache.h. Patch replaces
____cacheline_maxaligned_in_smp with ____cacheline_internodealigned_in_smp

With this patch, L1_CACHE_SHIFT_MAX can be killed

Signed-off-by: Ravikiran Thirumalai
Signed-off-by: Shai Fultheim
Signed-off-by: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ravikiran G Thirumalai
2006-01-09 12:13:38 +0800
2f659f462 [PATCH] Optimise oom kill of current task ... Browse Code »

When oom_killer kills current there's no need to call
schedule_timeout_interruptible() since task must die ASAP.

Signed-Off-By: Pavel Emelianov
Signed-Off-By: Kirill Korotaev
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Korotaev
2006-01-09 12:12:45 +0800