Eric Lee / smarc-fsl-linux-kernel

08 May, 2007

38 commits

81819f0fc SLUB core ... Browse Code »

This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns
with the existing implementation.

A. Management of object queues

A particular concern was the complex management of the numerous object
queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
each allocating CPU and use objects from a slab directly instead of
queueing them up.

B. Storage overhead of object queues

SLAB Object queues exist per node, per CPU. The alien cache queue even
has a queue array that contain a queue for each processor on each
node. For very large systems the number of queues and the number of
objects that may be caught in those queues grows exponentially. On our
systems with 1k nodes / processors we have several gigabytes just tied up
for storing references to objects for those queues This does not include
the objects that could be on those queues. One fears that the whole
memory of the machine could one day be consumed by those queues.

C. SLAB meta data overhead

SLAB has overhead at the beginning of each slab. This means that data
cannot be naturally aligned at the beginning of a slab block. SLUB keeps
all meta data in the corresponding page_struct. Objects can be naturally
aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
boundaries and can fit tightly into a 4k page with no bytes left over.
SLAB cannot do this.

D. SLAB has a complex cache reaper

SLUB does not need a cache reaper for UP systems. On SMP systems
the per CPU slab may be pushed back into partial list but that
operation is simple and does not require an iteration over a list
of objects. SLAB expires per CPU, shared and alien object queues
during cache reaping which may cause strange hold offs.

E. SLAB has complex NUMA policy layer support

SLUB pushes NUMA policy handling into the page allocator. This means that
allocation is coarser (SLUB does interleave on a page level) but that
situation was also present before 2.6.13. SLABs application of
policies to individual slab objects allocated in SLAB is
certainly a performance concern due to the frequent references to
memory policies which may lead a sequence of objects to come from
one node after another. SLUB will get a slab full of objects
from one node and then will switch to the next.

F. Reduction of the size of partial slab lists

SLAB has per node partial lists. This means that over time a large
number of partial slabs may accumulate on those lists. These can
only be reused if allocator occur on specific nodes. SLUB has a global
pool of partial slabs and will consume slabs from that pool to
decrease fragmentation.

G. Tunables

SLAB has sophisticated tuning abilities for each slab cache. One can
manipulate the queue sizes in detail. However, filling the queues still
requires the uses of the spin lock to check out slabs. SLUB has a global
parameter (min_slab_order) for tuning. Increasing the minimum slab
order can decrease the locking overhead. The bigger the slab order the
less motions of pages between per CPU and partial lists occur and the
better SLUB will be scaling.

G. Slab merging

We often have slab caches with similar parameters. SLUB detects those
on boot up and merges them into the corresponding general caches. This
leads to more effective memory use. About 50% of all caches can
be eliminated through slab merging. This will also decrease
slab fragmentation because partial allocated slabs can be filled
up again. Slab merging can be switched off by specifying
slub_nomerge on boot up.

Note that merging can expose heretofore unknown bugs in the kernel
because corrupted objects may now be placed differently and corrupt
differing neighboring objects. Enable sanity checks to find those.

H. Diagnostics

The current slab diagnostics are difficult to use and require a
recompilation of the kernel. SLUB contains debugging code that
is always available (but is kept out of the hot code paths).
SLUB diagnostics can be enabled via the "slab_debug" option.
Parameters can be specified to select a single or a group of
slab caches for diagnostics. This means that the system is running
with the usual performance and it is much more likely that
race conditions can be reproduced.

I. Resiliency

If basic sanity checks are on then SLUB is capable of detecting
common error conditions and recover as best as possible to allow the
system to continue.

J. Tracing

Tracing can be enabled via the slab_debug=T, option
during boot. SLUB will then protocol all actions on that slabcache
and dump the object contents on free.

K. On demand DMA cache creation.

Generally DMA caches are not needed. If a kmalloc is used with
__GFP_DMA then just create this single slabcache that is needed.
For systems that have no ZONE_DMA requirement the support is
completely eliminated.

L. Performance increase

Some benchmarks have shown speed improvements on kernbench in the
range of 5-10%. The locking overhead of slub is based on the
underlying base allocation size. If we can reliably allocate
larger order pages then it is possible to increase slub
performance much further. The anti-fragmentation patches may
enable further performance increases.

Tested on:
i386 UP + SMP, x86_64 UP + SMP + NUMA emulation, IA64 NUMA + Simulator

SLUB Boot options

slub_nomerge Disable merging of slabs
slub_min_order=x Require a minimum order for slab caches. This
increases the managed chunk size and therefore
reduces meta data and locking overhead.
slub_min_objects=x Mininum objects per slab. Default is 8.
slub_max_order=x Avoid generating slabs larger than order specified.
slub_debug Enable all diagnostics for all caches
slub_debug= Enable selective options for all caches
slub_debug=, Enable selective options for a certain set of
caches

Available Debug options
F Double Free checking, sanity and resiliency
R Red zoning
P Object / padding poisoning
U Track last free / alloc
T Trace all allocs / frees (only use for individual slabs).

To use SLUB: Apply this patch and then select SLUB as the default slab
allocator.

[hugh@veritas.com: fix an oops-causing locking error]
[akpm@linux-foundation.org: various stupid cleanups and small fixes]
Signed-off-by: Christoph Lameter
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-05-08 03:12:53 +0800
543691a6c tty_register_driver: only allocate tty instances when defined ... Browse Code »

If device->num is zero we attempt to kmalloc() zero bytes. When SLUB is
enabled this returns a null pointer and take that as an allocation failure
and fail the device register. Check for no devices and avoid the
allocation.

[akpm: opportunistic kzalloc() conversion]
Signed-off-by: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2007-05-08 03:12:53 +0800
b5637e65e i386: use page allocator to allocate thread_info structure ... Browse Code »

i386 uses kmalloc to allocate the threadinfo structure assuming that the
allocations result in a page sized aligned allocation. That has worked so
far because SLAB exempts page sized slabs from debugging and aligns them in
special ways that goes beyond the restrictions imposed by
KMALLOC_ARCH_MINALIGN valid for other slabs in the kmalloc array.

SLUB also works fine without debugging since page sized allocations neatly
align at page boundaries. However, if debugging is switched on then SLUB
will extend the slab with debug information. The resulting slab is not
longer of page size. It will only be aligned following the requirements
imposed by KMALLOC_ARCH_MINALIGN. As a result the threadinfo structure may
not be page aligned which makes i386 fail to boot with SLUB debug on.

Replace the calls to kmalloc with calls into the page allocator.

An alternate solution may be to create a custom slab cache where the
alignment is set to PAGE_SIZE. That would allow slub debugging to be
applied to the threadinfo structure.

Signed-off-by: Christoph Lameter
Cc: William Lee Irwin III
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-05-08 03:12:53 +0800
c596d9f32 cpusets: allow TIF_MEMDIE threads to allocate anywhere ... Browse Code »

OOM killed tasks have access to memory reserves as specified by the
TIF_MEMDIE flag in the hopes that it will quickly exit. If such a task has
memory allocations constrained by cpusets, we may encounter a deadlock if a
blocking task cannot exit because it cannot allocate the necessary memory.

We allow tasks that have the TIF_MEMDIE flag to allocate memory anywhere,
including outside its cpuset restriction, so that it can quickly die
regardless of whether it is __GFP_HARDWALL.

Cc: Andi Kleen
Cc: Paul Jackson
Cc: Christoph Lameter
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2007-05-08 03:12:53 +0800
a3a02be79 slab: mark set_up_list3s() __init ... Browse Code »

It is only ever used prior to free_initmem().

(It will cause a warning when we run the section checking, but that's a
false-positive and it simply changes the source of an existing warning, which
is also a false-positive)

Cc: Christoph Lameter
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-05-08 03:12:53 +0800
3b1d92c56 Do not disable interrupts when reading min_free_kbytes ... Browse Code »

The sysctl handler for min_free_kbytes calls setup_per_zone_pages_min() on
read or write. This function iterates through every zone and calls
spin_lock_irqsave() on the zone LRU lock. When reading min_free_kbytes,
this is a total waste of time that disables interrupts on the local
processor. It might even be noticable machines with large numbers of zones
if a process started constantly reading min_free_kbytes.

This patch only calls setup_per_zone_pages_min() only on write. Tested on
an x86 laptop and it did the right thing.

Signed-off-by: Mel Gorman
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2007-05-08 03:12:53 +0800
8da3430d8 slab: NUMA kmem_cache diet ... Browse Code »

Some NUMA machines have a big MAX_NUMNODES (possibly 1024), but fewer
possible nodes. This patch dynamically sizes the 'struct kmem_cache' to
allocate only needed space.

I moved nodelists[] field at the end of struct kmem_cache, and use the
following computation in kmem_cache_init()

cache_cache.buffer_size = offsetof(struct kmem_cache, nodelists) +
nr_node_ids * sizeof(struct kmem_list3 *);

On my two nodes x86_64 machine, kmem_cache.obj_size is now 192 instead of 704
(This is because on x86_64, MAX_NUMNODES is 64)

On bigger NUMA setups, this might reduce the gfporder of "cache_cache"

Signed-off-by: Eric Dumazet
Cc: Pekka Enberg
Cc: Andy Whitcroft
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Dumazet
2007-05-08 03:12:53 +0800
631098469 SLAB: don't allocate empty shared caches ... Browse Code »

We can avoid allocating empty shared caches and avoid unecessary check of
cache->limit. We save some memory. We avoid bringing into CPU cache
unecessary cache lines.

All accesses to l3->shared are already checking NULL pointers so this patch is
safe.

Signed-off-by: Eric Dumazet
Acked-by: Pekka Enberg
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Dumazet
2007-05-08 03:12:53 +0800
364fbb29a SLAB: use num_possible_cpus() in enable_cpucache() ... Browse Code »

The existing comment in mm/slab.c is *perfect*, so I reproduce it :

/*
* CPU bound tasks (e.g. network routing) can exhibit cpu bound
* allocation behaviour: Most allocs on one cpu, most free operations
* on another cpu. For these cases, an efficient object passing between
* cpus is necessary. This is provided by a shared array. The array
* replaces Bonwick's magazine layer.
* On uniprocessor, it's functionally equivalent (but less efficient)
* to a larger limit. Thus disabled by default.
*/

As most shiped linux kernels are now compiled with CONFIG_SMP, there is no way
a preprocessor #if can detect if the machine is UP or SMP. Better to use
num_possible_cpus().

This means on UP we allocate a 'size=0 shared array', to be more efficient.

Another patch can later avoid the allocations of 'empty shared arrays', to
save some memory.

Signed-off-by: Eric Dumazet
Acked-by: Pekka Enberg
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Dumazet
2007-05-08 03:12:52 +0800
6ce745ed3 readahead: code cleanup ... Browse Code »

Rename file_ra_state.prev_page to prev_index and file_ra_state.offset to
prev_offset. Also update of prev_index in do_generic_mapping_read() is now
moved close to the update of prev_offset.

[wfg@mail.ustc.edu.cn: fix it]
Signed-off-by: Jan Kara
Cc: Nick Piggin
Cc: WU Fengguang
Signed-off-by: Fengguang Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2007-05-08 03:12:52 +0800
ec0f16372 readahead: improve heuristic detecting sequential reads ... Browse Code »

Introduce ra.offset and store in it an offset where the previous read
ended. This way we can detect whether reads are really sequential (and
thus we should not mark the page as accessed repeatedly) or whether they
are random and just happen to be in the same page (and the page should
really be marked accessed again).

Signed-off-by: Jan Kara
Acked-by: Nick Piggin
Cc: WU Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2007-05-08 03:12:52 +0800
b813e931b smaps: add clear_refs file to clear reference ... Browse Code »

Adds /proc/pid/clear_refs. When any non-zero number is written to this file,
pte_mkold() and ClearPageReferenced() is called for each pte and its
corresponding page, respectively, in that task's VMAs. This file is only
writable by the user who owns the task.

It is now possible to measure _approximately_ how much memory a task is using
by clearing the reference bits with

echo 1 > /proc/pid/clear_refs

and checking the reference count for each VMA from the /proc/pid/smaps output
at a measured time interval. For example, to observe the approximate change
in memory footprint for a task, write a script that clears the references
(echo 1 > /proc/pid/clear_refs), sleeps, and then greps for Pgs_Referenced and
extracts the size in kB. Add the sizes for each VMA together for the total
referenced footprint. Moments later, repeat the process and observe the
difference.

For example, using an efficient Mozilla:

accumulated time referenced memory
---------------- -----------------
0 s 408 kB
1 s 408 kB
2 s 556 kB
3 s 1028 kB
4 s 872 kB
5 s 1956 kB
6 s 416 kB
7 s 1560 kB
8 s 2336 kB
9 s 1044 kB
10 s 416 kB

This is a valuable tool to get an approximate measurement of the memory
footprint for a task.

Cc: Hugh Dickins
Cc: Paul Mundt
Cc: Christoph Lameter
Signed-off-by: David Rientjes
[akpm@linux-foundation.org: build fixes]
[mpm@selenic.com: rename for_each_pmd]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2007-05-08 03:12:52 +0800
f79f177c2 smaps: add pages referenced count to smaps ... Browse Code »

Adds an additional unsigned long field to struct mem_size_stats called
'referenced'. For each pte walked in the smaps code, this field is
incremented by PAGE_SIZE if it has pte-reference bits.

An additional line was added to the /proc/pid/smaps output for each VMA to
indicate how many pages within it are currently marked as referenced or
accessed.

Cc: Hugh Dickins
Cc: Paul Mundt
Cc: Christoph Lameter
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2007-05-08 03:12:52 +0800
826fad1b9 smaps: extract pmd walker from smaps code ... Browse Code »

Extracts the pmd walker from smaps-specific code in fs/proc/task_mmu.c.

The new struct pmd_walker includes the struct vm_area_struct of the memory to
walk over. Iteration begins at the vma->vm_start and completes at
vma->vm_end. A pointer to another data structure may be stored in the private
field such as struct mem_size_stats, which acts as the smaps accumulator. For
each pmd in the VMA, the action function is called with a pointer to its
struct vm_area_struct, a pointer to the pmd_t, its start and end addresses,
and the private field.

The interface for walking pmd's in a VMA for fs/proc/task_mmu.c is now:

void for_each_pmd(struct vm_area_struct *vma,
void (*action)(struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr,
unsigned long end,
void *private),
void *private);

Since the pmd walker is now extracted from the smaps code, smaps_one_pmd() is
invoked for each pmd in the VMA. Its behavior and efficiency is identical to
the existing implementation.

Cc: Hugh Dickins
Cc: Paul Mundt
Cc: Christoph Lameter
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2007-05-08 03:12:52 +0800
0013572b2 i386: use pte_update_defer in ptep_test_and_clear_{dirty,young} ... Browse Code »

If you actually clear the bit, you need to:

+ pte_update_defer(vma->vm_mm, addr, ptep);

The reason is, when updating PTEs, the hypervisor must be notified. Using
atomic operations to do this is fine for all hypervisors I am aware of.
However, for hypervisors which shadow page tables, if these PTE
modifications are not trapped, you need a post-modification call to fulfill
the update of the shadow page table.

Acked-by: Zachary Amsden
Cc: Hugh Dickins
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zachary Amsden
2007-05-08 03:12:52 +0800
10a8d6ae4 i386: add ptep_test_and_clear_{dirty,young} ... Browse Code »

Add ptep_test_and_clear_{dirty,young} to i386. They advertise that they
have it and there is at least one place where it needs to be called without
the page table lock: to clear the accessed bit on write to
/proc/pid/clear_refs.

ptep_clear_flush_{dirty,young} are updated to use the new functions. The
overall net effect to current users of ptep_clear_flush_{dirty,young} is
that we introduce an additional branch.

Cc: Hugh Dickins
Cc: Ingo Molnar
Signed-off-by: David Rientjes
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2007-05-08 03:12:52 +0800
949099148 Add unitialized_var() macro for suppressing gcc warnings ... Browse Code »

Introduce a macro for suppressing gcc from generating a warning about a
probable uninitialized state of a variable.

Example:

- spinlock_t *ptl;
+ spinlock_t *uninitialized_var(ptl);

Not a happy solution, but those warnings are obnoxious.

- Using the usual pointlessly-set-it-to-zero approach wastes several
bytes of text.

- Using a macro means we can (hopefully) do something else if gcc changes
cause the `x = x' hack to stop working

- Using a macro means that people who are worried about hiding true bugs
can easily turn it off.

Signed-off-by: Borislav Petkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Borislav Petkov
2007-05-08 03:12:52 +0800
a8127717c mm: simplify filemap_nopage ... Browse Code »

Identical block is duplicated twice: contrary to the comment, we have been
re-reading the page *twice* in filemap_nopage rather than once.

If any retry logic or anything is needed, it belongs in lower levels anyway.
Only retry once. Linus agrees.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-05-08 03:12:52 +0800
14e072984 add pfn_valid_within helper for sub-MAX_ORDER hole detection ... Browse Code »

Generally we work under the assumption that memory the mem_map array is
contigious and valid out to MAX_ORDER_NR_PAGES block of pages, ie. that if we
have validated any page within this MAX_ORDER_NR_PAGES block we need not check
any other. This is not true when CONFIG_HOLES_IN_ZONE is set and we must
check each and every reference we make from a pfn.

Add a pfn_valid_within() helper which should be used when scanning pages
within a MAX_ORDER_NR_PAGES block when we have already checked the validility
of the block normally with pfn_valid(). This can then be optimised away when
we do not have holes within a MAX_ORDER_NR_PAGES block of pages.

Signed-off-by: Andy Whitcroft
Acked-by: Mel Gorman
Acked-by: Bob Picco
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2007-05-08 03:12:52 +0800
ac267728f mm/slab.c: proper prototypes ... Browse Code »

Add proper prototypes in include/linux/slab.h.

Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2007-05-08 03:12:52 +0800
411f0f3ed Introduce CONFIG_HAS_DMA ... Browse Code »

Architectures that don't support DMA can say so by adding a config NO_DMA
to their Kconfig file. This will prevent compilation of some dma specific
driver code. Also dma-mapping-broken.h isn't needed anymore on at least
s390. This avoids compilation and linking of otherwise dead/broken code.

Other architectures that include dma-mapping-broken.h are arm26, h8300,
m68k, m68knommu and v850. If these could be converted as well we could get
rid of the header file.

Signed-off-by: Heiko Carstens
"John W. Linville"
Cc: Kyle McMartin
Cc:
Cc: Tejun Heo
Cc: Jeff Garzik
Cc: Martin Schwidefsky
Cc:
Cc:
Cc:
Cc:
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heiko Carstens
2007-05-08 03:12:51 +0800
9a82782f8 allow oom_adj of saintly processes ... Browse Code »

If the badness of a process is zero then oom_adj>0 has no effect. This
patch makes sure that the oom_adj shift actually increases badness points
appropriately.

Signed-off-by: Joshua N. Pritikin
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joshua N Pritikin
2007-05-08 03:12:51 +0800
3d67f2d7c fs: buffer don't PageUptodate without page locked ... Browse Code »

__block_write_full_page is calling SetPageUptodate without the page locked.
This is unusual, but not incorrect, as PG_writeback is still set.

However the next patch will require that SetPageUptodate always be called with
the page locked. Simply don't bother setting the page uptodate in this case
(it is unusual that the write path does such a thing anyway). Instead just
leave it to the read side to bring the page uptodate when it notices that all
buffers are uptodate.

Signed-off-by: Nick Piggin
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-05-08 03:12:51 +0800
6fe6900e1 mm: make read_cache_page synchronous ... Browse Code »

Ensure pages are uptodate after returning from read_cache_page, which allows
us to cut out most of the filesystem-internal PageUptodate calls.

I didn't have a great look down the call chains, but this appears to fixes 7
possible use-before uptodate in hfs, 2 in hfsplus, 1 in jfs, a few in
ecryptfs, 1 in jffs2, and a possible cleared data overwritten with readpage in
block2mtd. All depending on whether the filler is async and/or can return
with a !uptodate page.

Signed-off-by: Nick Piggin
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-05-08 03:12:51 +0800
714b8171a slab: ensure cache_alloc_refill terminates ... Browse Code »

If slab->inuse is corrupted, cache_alloc_refill can enter an infinite
loop as detailed by Michael Richardson in the following post:
. This adds a BUG_ON to catch
those cases.

Cc: Michael Richardson
Acked-by: Christoph Lameter
Signed-off-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pekka Enberg
2007-05-08 03:12:51 +0800
5f22df00a mm: remove gcc workaround ... Browse Code »

Minimum gcc version is 3.2 now. However, with likely profiling, even
modern gcc versions cannot always eliminate the call.

Replace the placeholder functions with the more conventional empty static
inlines, which should be optimal for everyone.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-05-08 03:12:51 +0800
d2ba27e80 proper prototype for hugetlb_get_unmapped_area() ... Browse Code »

Add a proper prototype for hugetlb_get_unmapped_area() in
include/linux/hugetlb.h.

Signed-off-by: Adrian Bunk
Acked-by: William Irwin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2007-05-08 03:12:51 +0800
1b4244647 Use ZVC counters to establish exact size of dirtyable pages ... Browse Code »

We can use the global ZVC counters to establish the exact size of the LRU
and the free pages. This allows a more accurate determination of the dirty
ratio.

This patch will fix the broken ratio calculations if large amounts of
memory are allocated to huge pags or other consumers that do not put the
pages on to the LRU.

Notes:
- I did not add NR_SLAB_RECLAIMABLE to the calculation of the
dirtyable pages. Those may be reclaimable but they are at this
point not dirtyable. If NR_SLAB_RECLAIMABLE would be considered
then a huge number of reclaimable pages would stop writeback
from occurring.

- This patch used to be in mm as the last one in a series of patches.
It was removed when Linus updated the treatment of highmem because
there was a conflict. I updated the patch to follow Linus' approach.
This patch is neede to fulfill the claims made in the beginning of the
patchset that is now in Linus' tree.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-05-08 03:12:51 +0800
476f35348 Safer nr_node_ids and nr_node_ids determination and initial values ... Browse Code »

The nr_cpu_ids value is currently only calculated in smp_init. However, it
may be needed before (SLUB needs it on kmem_cache_init!) and other kernel
components may also want to allocate dynamically sized per cpu array before
smp_init. So move the determination of possible cpus into sched_init()
where we already loop over all possible cpus early in boot.

Also initialize both nr_node_ids and nr_cpu_ids with the highest value they
could take. If we have accidental users before these values are determined
then the current valud of 0 may cause too small per cpu and per node arrays
to be allocated. If it is set to the maximum possible then we only waste
some memory for early boot users.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-05-08 03:12:51 +0800
aee16b3ce Add apply_to_page_range() which applies a function to a pte range ... Browse Code »

Add a new mm function apply_to_page_range() which applies a given function to
every pte in a given virtual address range in a given mm structure. This is a
generic alternative to cut-and-pasting the Linux idiomatic pagetable walking
code in every place that a sequence of PTEs must be accessed.

Although this interface is intended to be useful in a wide range of
situations, it is currently used specifically by several Xen subsystems, for
example: to ensure that pagetables have been allocated for a virtual address
range, and to construct batched special pagetable update requests to map I/O
memory (in ioremap()).

[akpm@linux-foundation.org: fix warning, unpleasantly]
Signed-off-by: Ian Pratt
Signed-off-by: Christian Limpach
Signed-off-by: Chris Wright
Signed-off-by: Jeremy Fitzhardinge
Cc: Christoph Lameter
Cc: Matt Mackall
Acked-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeremy Fitzhardinge
2007-05-08 03:12:51 +0800
eb3a1e114 Serial: serial_core, use pr_debug ... Browse Code »

serial_core, use pr_debug

Signed-off-by: Jiri Slaby
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2007-05-08 03:12:51 +0800
1733310bb MPSC serial driver tx locking ... Browse Code »

The MPSC serial driver assumes that interrupt is always on to pick up the
DMA transmit ops that aren't submitted while the DMA engine is active.
However when irqs are off for a period of time such as operations under
kernel crash dump console messages do not show up due to additional DMA ops
are being dropped. This makes console writes to process through all the tx
DMAs queued up before submitting a new request.

Also, the current locking mechanism does not protect the hardware registers
and ring buffer when a printk is done during the serial write operations.
The additional per port transmit lock provides a finer granular locking and
protects registers being clobbered while printks are nested within UART
writes.

Signed-off-by: Dave Jiang
Signed-off-by: Mark A. Greer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Jiang
2007-05-08 03:12:50 +0800
abb4a2390 serial: define FIXED_PORT flag for serial_core ... Browse Code »

At present, the serial core always allows setserial in userspace to change the
port address, irq and base clock of any serial port. That makes sense for
legacy ISA ports, but not for (say) embedded ns16550 compatible serial ports
at peculiar addresses. In these cases, the kernel code configuring the ports
must know exactly where they are, and their clocking arrangements (which can
be unusual on embedded boards). It doesn't make sense for userspace to change
these settings.

Therefore, this patch defines a UPF_FIXED_PORT flag for the uart_port
structure. If this flag is set when the serial port is configured, any
attempts to alter the port's type, io address, irq or base clock with
setserial are ignored.

In addition this patch uses the new flag for on-chip serial ports probed in
arch/powerpc/kernel/legacy_serial.c, and for other hard-wired serial ports
probed by drivers/serial/of_serial.c.

Signed-off-by: David Gibson
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Gibson
2007-05-08 03:12:50 +0800
bd71c182d RM9000 serial driver ... Browse Code »

Add support for the integrated serial ports of the MIPS RM9122 processor
and its relatives.

The patch also does some whitespace cleanup.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Thomas Koeller
Cc: Ralf Baechle
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thomas Koeller
2007-05-08 03:12:50 +0800
beab697ab serial driver PMC MSP71xx ... Browse Code »

Serial driver patch for the PMC-Sierra MSP71xx devices.

There are three different fixes:

1 Fix for DesignWare APB THRE errata: In brief, this is a non-standard
16550 in that the THRE interrupt will not re-assert itself simply by
disabling and re-enabling the THRI bit in the IER, it is only re-enabled
if a character is actually sent out.

It appears that the "8250-uart-backup-timer.patch" in the "mm" tree
also fixes it so we have dropped our initial workaround. This patch now
needs to be applied on top of that "mm" patch.

2 Fix for Busy Detect on LCR write: The DesignWare APB UART has a feature
which causes a new Busy Detect interrupt to be generated if it's busy
when the LCR is written. This fix saves the value of the LCR and
rewrites it after clearing the interrupt.

3 Workaround for interrupt/data concurrency issue: The SoC needs to
ensure that writes that can cause interrupts to be cleared reach the UART
before returning from the ISR. This fix reads a non-destructive register
on the UART so the read transaction completion ensures the previously
queued write transaction has also completed.

Signed-off-by: Marc St-Jean
Cc: Russell King
Cc: Ralf Baechle
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Marc St-Jean
2007-05-08 03:12:50 +0800
6179b5562 add new_id to PCMCIA drivers ... Browse Code »

PCI drivers have the new_id file in sysfs which allows new IDs to be added
at runtime. The advantage is to avoid re-compilation of a driver that
works for a new device, but it's ID table doesn't contain the new device.
This mechanism is only meant for testing, after the driver has been tested
successfully, the ID should be added in source code so that new revisions
of the kernel automatically detect the device.

The implementation follows the PCI implementation. The interface is documented
in Documentation/pcmcia/driver.txt. Computations should be done in userspace,
so the sysfs string contains the raw structure members for matching.

Signed-off-by: Bernhard Walle
Cc: Dominik Brodowski
Cc: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bernhard Walle
2007-05-08 03:12:50 +0800
02c83595b at91_cf, minor fix ... Browse Code »

This is a minor correctness fix: since the at91_cf driver probe() routine
is in the init section, it should use platform_driver_probe() instead of
leaving that pointer around in the driver struct after init section
removal.

Signed-off-by: David Brownell
Cc: Dominik Brodowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Brownell
2007-05-08 03:12:50 +0800
fd76bab2f slab: introduce krealloc ... Browse Code »

This introduce krealloc() that reallocates memory while keeping the contents
unchanged. The allocator avoids reallocation if the new size fits the
currently used cache. I also added a simple non-optimized version for
mm/slob.c for compatibility.

[akpm@linux-foundation.org: fix warnings]
Acked-by: Josef Sipek
Acked-by: Matt Mackall
Acked-by: Christoph Lameter
Signed-off-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pekka Enberg
2007-05-08 03:12:50 +0800

07 May, 2007

2 commits

e3ebadd95 Revert "[PATCH] x86: __pa and __pa_symbol address space separation" ... Browse Code »

This was broken. It adds complexity, for no good reason. Rather than
separate __pa() and __pa_symbol(), we should deprecate __pa_symbol(),
and preferably __pa() too - and just use "virt_to_phys()" instead, which
is more readable and has nicer semantics.

However, right now, just undo the separation, and make __pa_symbol() be
the exact same as __pa(). That fixes the bugs this patch introduced,
and we can do the fairly obvious cleanups later.

Do the new __phys_addr() function (which is now the actual workhorse for
the unified __pa()/__pa_symbol()) as a real external function, that way
all the potential issues with compile/link-time optimizations of
constant symbol addresses go away, and we can also, if we choose to, add
more sanity-checking of the argument.

Cc: Eric W. Biederman
Cc: Vivek Goyal
Cc: Andi Kleen
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Linus Torvalds
2007-05-07 23:44:24 +0800
15700770e Merge git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild: (38 commits)
kconfig: fix mconf segmentation fault
kbuild: enable use of code from a different dir
kconfig: error out if recursive dependencies are found
kbuild: scripts/basic/fixdep segfault on pathological string-o-death
kconfig: correct minor typo in Kconfig warning message.
kconfig: fix path to modules.txt in Kconfig help
usr/Kconfig: fix typo
kernel-doc: alphabetically-sorted entries in index.html of 'htmldocs'
kbuild: be more explicit on missing .config file
kbuild: clarify the creation of the LOCALVERSION_AUTO string.
kbuild: propagate errors from find in scripts/gen_initramfs_list.sh
kconfig: refer to qt3 if we cannot find qt libraries
kbuild: handle compressed cpio initramfs-es
kbuild: ignore section mismatch warning for references from .paravirtprobe to .init.text
kbuild: remove stale comment in modpost.c
kbuild/mkuboot.sh: allow spaces in CROSS_COMPILE
kbuild: fix make mrproper for Documentation/DocBook/man
kbuild: remove kconfig binaries during make mrproper
kconfig/menuconfig: do not hardcode '.config'
kbuild: override build timestamp & version
...

Linus Torvalds
2007-05-07 04:21:57 +0800