Eric Lee / smarc-fsl-linux-kernel

07 Feb, 2008

4 commits

3e6bdf473 Merge git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86:
x86: fix deadlock, make pgd_lock irq-safe
virtio: fix trivial build bug
x86: fix mttr trimming
x86: delay CPA self-test and repeat it
x86: fix 64-bit sections
generic: add __FINITDATA
x86: remove suprious ifdefs from pageattr.c
x86: mark the .rodata section also NX
x86: fix iret exception recovery on 64-bit
cpuidle: dubious one-bit signed bitfield in cpuidle.h
x86: fix sparse warnings in powernow-k8.c
x86: fix sparse error in traps_32.c
x86: trivial sparse/checkpatch in quirks.c
x86 ptrace: disallow null cs/ss
MAINTAINERS: RDC R-321x SoC maintainer
brk randomization: introduce CONFIG_COMPAT_BRK
brk: check the lower bound properly
x86: remove X2 workaround
x86: make spurious fault handler aware of large mappings
x86: make traps on entry code be debuggable in user space, 64-bit

Linus Torvalds
2008-02-07 05:54:09 +0800
32a932332 brk randomization: introduce CONFIG_COMPAT_BRK ... Browse Code »

based on similar patch from: Pavel Machek

Introduce CONFIG_COMPAT_BRK. If disabled then the kernel is free
(but not obliged to) randomize the brk area.

Heap randomization breaks ancient binaries, so we keep COMPAT_BRK
enabled by default.

Signed-off-by: Ingo Molnar

Ingo Molnar
2008-02-07 05:39:44 +0800
4cc6028d4 brk: check the lower bound properly ... Browse Code »

There is a check in sys_brk(), that tries to make sure that we do not
underflow the area that is dedicated to brk heap.

The check is however wrong, as it assumes that brk area starts immediately
after the end of the code (+bss), which is wrong for example in
environments with randomized brk start. The proper way is to check whether
the address is not below the start_brk address.

Signed-off-by: Jiri Kosina
Signed-off-by: Ingo Molnar

Jiri Kosina
2008-02-07 05:39:44 +0800
b32421519 PERCPU : __percpu_alloc_mask() can dynamically size percpu_data storage ... Browse Code »

Instead of allocating a fix sized array of NR_CPUS pointers for percpu_data,
we can use nr_cpu_ids, which is generally < NR_CPUS.

Signed-off-by: Eric Dumazet
Cc: Christoph Lameter
Cc: "David S. Miller"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Dumazet
2008-02-07 02:41:04 +0800

06 Feb, 2008

36 commits

b297d520b Merge branch 'dmapool' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc ... Browse Code »

* 'dmapool' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc:
pool: Improve memory usage for devices which can't cross boundaries
Change dmapool free block management
dmapool: Tidy up includes and add comments
dmapool: Validate parameters to dma_pool_create
Avoid taking waitqueue lock in dmapool
dmapool: Fix style problems
Move dmapool.c to mm/ directory

Linus Torvalds
2008-02-06 11:05:48 +0800
f905bc447 nommu: add new vmalloc_user() and remap_vmalloc_range() interfaces. ... Browse Code »

This builds on top of the earlier vmalloc_32_user() work introduced by
b50731732f926d6c49fd0724616a7344c31cd5cf, as we now have places in the nommu
allmodconfig that hit up against these missing APIs.

As vmalloc_32_user() is already implemented, this is moved over to
vmalloc_user() and simply made a wrapper. As all current nommu platforms are
32-bit addressable, there's no special casing we have to do for ZONE_DMA and
things of that nature as per GFP_VMALLOC32.

remap_vmalloc_range() needs to check VM_USERMAP in order to figure out whether
we permit the remap or not, which means that we also have to rework the
vmalloc_user() code to grovel for the VMA and set the flag.

Signed-off-by: Paul Mundt
Acked-by: David McCullough
Acked-by: David Howells
Acked-by: Greg Ungerer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Mundt
2008-02-06 01:44:21 +0800
97829955a oom_kill: remove uid==0 checks ... Browse Code »

Root processes are considered more important when out of memory and killing
proceses. The check for CAP_SYS_ADMIN was augmented with a check for
uid==0 or euid==0.

There are several possible ways to look at this:

1. uid comparisons are unnecessary, trust CAP_SYS_ADMIN
alone. However CAP_SYS_RESOURCE is the one that really
means "give me extra resources" so allow for that as
well.
2. Any privileged code should be protected, but uid is not
an indication of privilege. So we should check whether
any capabilities are raised.
3. uid==0 makes processes on the host as well as in containers
more important, so we should keep the existing checks.
4. uid==0 makes processes only on the host more important,
even without any capabilities. So we should be keeping
the (uid==0||euid==0) check but only when
userns==&init_user_ns.

I'm following number 1 here.

Signed-off-by: Serge Hallyn
Cc: Andrew Morgan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Serge E. Hallyn
2008-02-06 01:44:20 +0800
e338d263a Add 64-bit capability support to the kernel ... Browse Code »

The patch supports legacy (32-bit) capability userspace, and where possible
translates 32-bit capabilities to/from userspace and the VFS to 64-bit
kernel space capabilities. If a capability set cannot be compressed into
32-bits for consumption by user space, the system call fails, with -ERANGE.

FWIW libcap-2.00 supports this change (and earlier capability formats)

http://www.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.6/

[akpm@linux-foundation.org: coding-syle fixes]
[akpm@linux-foundation.org: use get_task_comm()]
[ezk@cs.sunysb.edu: build fix]
[akpm@linux-foundation.org: do not initialise statics to 0 or NULL]
[akpm@linux-foundation.org: unused var]
[serue@us.ibm.com: export __cap_ symbols]
Signed-off-by: Andrew G. Morgan
Cc: Stephen Smalley
Acked-by: Serge Hallyn
Cc: Chris Wright
Cc: James Morris
Cc: Casey Schaufler
Signed-off-by: Erez Zadok
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morgan
2008-02-06 01:44:20 +0800
424925940 VFS/Security: Rework inode_getsecurity and callers to return resulting buffer ... Browse Code »

This patch modifies the interface to inode_getsecurity to have the function
return a buffer containing the security blob and its length via parameters
instead of relying on the calling function to give it an appropriately sized
buffer.

Security blobs obtained with this function should be freed using the
release_secctx LSM hook. This alleviates the problem of the caller having to
guess a length and preallocate a buffer for this function allowing it to be
used elsewhere for Labeled NFS.

The patch also removed the unused err parameter. The conversion is similar to
the one performed by Al Viro for the security_getprocattr hook.

Signed-off-by: David P. Quigley
Cc: Stephen Smalley
Cc: Chris Wright
Acked-by: James Morris
Acked-by: Serge Hallyn
Cc: Casey Schaufler
Cc: Al Viro
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David P. Quigley
2008-02-06 01:44:20 +0800
20cecbae4 slob: reduce external fragmentation by using three free lists ... Browse Code »

By putting smaller objects on their own list, we greatly reduce overall
external fragmentation and increase repeatability. This reduces total SLOB
overhead from > 50% to ~6% on a simple boot test.

Signed-off-by: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Mackall
2008-02-06 01:44:19 +0800
679299b32 slob: fix free block merging at head of subpage ... Browse Code »

We weren't merging freed blocks at the beginning of the free list. Fixing
this showed a 2.5% efficiency improvement in a userspace test harness.

Signed-off-by: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Mackall
2008-02-06 01:44:19 +0800
8bc3be275 writeback: speed up writeback of big dirty files ... Browse Code »

After making dirty a 100M file, the normal behavior is to start the
writeback for all data after 30s delays. But sometimes the following
happens instead:

- after 30s: ~4M
- after 5s: ~4M
- after 5s: all remaining 92M

Some analyze shows that the internal io dispatch queues goes like this:

s_io s_more_io
-------------------------
1) 100M,1K 0
2) 1K 96M
3) 0 96M
1) initial state with a 100M file and a 1K file

2) 4M written, nr_to_write 0, no more writes(BUG)

nr_to_write > 0 in (3) fools the upper layer to think that data have all
been written out. The big dirty file is actually still sitting in
s_more_io. We cannot simply splice s_more_io back to s_io as soon as s_io
becomes empty, and let the loop in generic_sync_sb_inodes() continue: this
may starve newly expired inodes in s_dirty. It is also not an option to
draw inodes from both s_more_io and s_dirty, an let the loop go on: this
might lead to live locks, and might also starve other superblocks in sync
time(well kupdate may still starve some superblocks, that's another bug).

We have to return when a full scan of s_io completes. So nr_to_write > 0
does not necessarily mean that "all data are written". This patch
introduces a flag writeback_control.more_io to indicate that more io should
be done. With it the big dirty file no longer has to wait for the next
kupdate invokation 5s later.

In sync_sb_inodes() we only set more_io on super_blocks we actually
visited. This avoids the interaction between two pdflush deamons.

Also in __sync_single_inode() we don't blindly keep requeuing the io if the
filesystem cannot progress. Failing to do so may lead to 100% iowait.

Tested-by: Mike Snitzer
Signed-off-by: Fengguang Wu
Cc: Michael Rubin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2008-02-06 01:44:19 +0800
a322f8ab6 mm: fix section mismatch warning in sparse.c ... Browse Code »

Fix following warning:
WARNING: mm/built-in.o(.text+0x22069): Section mismatch in reference from the function sparse_early_usemap_alloc() to the function .init.text:__alloc_bootmem_node()

static sparse_early_usemap_alloc() were used only by sparse_init()
and with sparse_init() annotated _init it is safe to
annotate sparse_early_usemap_alloc with __init too.

Signed-off-by: Sam Ravnborg
Cc: Andy Whitcroft
Cc: Mel Gorman
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sam Ravnborg
2008-02-06 01:44:19 +0800
0ed361dec mm: fix PageUptodate data race ... Browse Code »

After running SetPageUptodate, preceeding stores to the page contents to
actually bring it uptodate may not be ordered with the store to set the
page uptodate.

Therefore, another CPU which checks PageUptodate is true, then reads the
page contents can get stale data.

Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
PageUptodate.

Many places that test PageUptodate, do so with the page locked, and this
would be enough to ensure memory ordering in those places if
SetPageUptodate were only called while the page is locked. Unfortunately
that is not always the case for some filesystems, but it could be an idea
for the future.

Also bring the handling of anonymous page uptodateness in line with that of
file backed page management, by marking anon pages as uptodate when they
_are_ uptodate, rather than when our implementation requires that they be
marked as such. Doing allows us to get rid of the smp_wmb's in the page
copying functions, which were especially added for anonymous pages for an
analogous memory ordering problem. Both file and anonymous pages are
handled with the same barriers.

FAQ:
Q. Why not do this in flush_dcache_page?
A. Firstly, flush_dcache_page handles only one side (the smb side) of the
ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away
memory barriers in a completely unrelated function is nasty; at least in the
PageUptodate macros, they are located together with (half) the operations
involved in the ordering. Thirdly, the smp_wmb is only required when first
bringing the page uptodate, wheras flush_dcache_page should be called each time
it is written to through the kernel mapping. It is logically the wrong place to
put it.

Q. Why does this increase my text size / reduce my performance / etc.
A. Because it is adding the necessary instructions to eliminate the data-race.

Q. Can it be improved?
A. Yes, eg. if you were to create a rule that all SetPageUptodate operations
run under the page lock, we could avoid the smp_rmb places where PageUptodate
is queried under the page lock. Requires audit of all filesystems and at least
some would need reworking. That's great you're interested, I'm eagerly awaiting
your patches.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-02-06 01:44:19 +0800
62e1c5530 page migraton: handle orphaned pages ... Browse Code »

Orphaned page might have fs-private metadata, the page is truncated. As
the page hasn't mapping, page migration refuse to migrate the page. It
appears the page is only freed in page reclaim and if zone watermark is
low, the page is never freed, as a result migration always fail. I thought
we could free the metadata so such page can be freed in migration and make
migration more reliable.

[akpm@linux-foundation.org: go direct to try_to_free_buffers()]
Signed-off-by: Shaohua Li
Acked-by: Nick Piggin
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2008-02-06 01:44:19 +0800
b5beb1caf check ADVICE of fadvise64_64 even if get_xip_page is given ... Browse Code »

I've written some test programs in ltp project. During writing I met an
problem which I cannot solve in user land. So I wrote a patch for linux
kernel. Please, include this patch if acceptable.

The test program tests the 4th parameter of fadvise64_64:

long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);

My test case calls fadvise64_64 with invalid advice value and checks errno is
set to EINVAL. About the advice parameter man page says:

...
Permissible values for advice include:

POSIX_FADV_NORMAL
...
POSIX_FADV_SEQUENTIAL
...
POSIX_FADV_RANDOM
...
POSIX_FADV_NOREUSE
...
POSIX_FADV_WILLNEED
...
POSIX_FADV_DONTNEED
...
ERRORS
...
EINVAL An invalid value was specified for advice.

However, I got a bug report that the system call invocations
in my test case returned 0 unexpectedly.

I've inspected the kernel code:

asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
{
struct file *file = fget(fd);
struct address_space *mapping;
struct backing_dev_info *bdi;
loff_t endbyte; /* inclusive */
pgoff_t start_index;
pgoff_t end_index;
unsigned long nrpages;
int ret = 0;

if (!file)
return -EBADF;

if (S_ISFIFO(file->f_path.dentry->d_inode->i_mode)) {
ret = -ESPIPE;
goto out;
}

mapping = file->f_mapping;
if (!mapping || len < 0) {
ret = -EINVAL;
goto out;
}

if (mapping->a_ops->get_xip_page)
/* no bad return value, but ignore advice */
goto out;
...
out:
fput(file);
return ret;
}

I found the advice parameter is just ignored in the case
mapping->a_ops->get_xip_page is given. This behavior is different from
what is written on the man page. Is this o.k.?

get_xip_page is given if CONFIG_EXT2_FS_XIP is true.
Anyway I cannot find the easy way to detect get_xip_page
field is given or CONFIG_EXT2_FS_XIP is true from the
user space.

I propose the following patch which checks the advice parameter
even if get_xip_page is given.

Signed-off-by: Masatake YAMATO
Acked-by: Carsten Otte
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Masatake YAMATO
2008-02-06 01:44:19 +0800
e6f3602d2 Include count of pagecache pages in show_mem() output ... Browse Code »

The show_mem() output does not include the total number of pagecache
pages. This would be helpful when analyzing the debug information in
the /var/log/messages file after OOM kills occur.

This patch includes the total pagecache pages in that output.

Signed-off-by: Larry Woodman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Larry Woodman
2008-02-06 01:44:19 +0800
a2b345642 Fix dirty page accounting leak with ext3 data=journal ... Browse Code »

In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 ("Clean up and make
try_to_free_buffers() not race with dirty pages"), try_to_free_buffers
was changed to bail out if the page was dirty.

That in turn caused truncate_complete_page to leak massive amounts of
memory, because the dirty bit was only cleared after the call to
try_to_free_buffers.

So the call to cancel_dirty_page was moved up to have the dirty bit
cleared early in 3e67c0987d7567ad666641164a153dca9a43b11d ("truncate:
clear page dirtiness before running try_to_free_buffers()").

The problem with that fix is, that the page can be redirtied after
cancel_dirty_page was called, eg. like this:

truncate_complete_page()
cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
do_invalidatepage()
ext3_invalidatepage()
journal_invalidatepage()
journal_unmap_buffer()
__dispose_buffer()
__journal_unfile_buffer()
__journal_temp_unlink_buffer()
mark_buffer_dirty(); // PG_dirty set, incr. dirty pages

And then we end up with dirty pages being wrongly accounted.

As a result, in ecdfc9787fe527491baefc22dce8b2dbd5b2908d ("Resurrect
'try_to_free_buffers()' VM hackery") the changes to try_to_free_buffers
were reverted, so the original reason for the massive memory leak is
gone, and we can also revert the move of the call to cancel_dirty_page
from truncate_complete_page and get the accounting right again.

I'm not sure if it matters, but opposed to the final check in
__remove_from_page_cache, this one also cares about the task io
accounting, so maybe we want to use this instead, although it's not
quite the clean fix either.

Signed-off-by: Björn Steinbrink
Tested-by: Krzysztof Piotr Oledzki
Cc: Jan Kara
Cc: Nick Piggin
Cc: Peter Zijlstra
Cc: Thomas Osterried
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bjorn Steinbrink
2008-02-06 01:44:19 +0800
ae1276b93 set_page_refcounted() VM_BUG_ON fix ... Browse Code »

The current PageTail semantic is that a PageTail page is first a
PageCompound page. So remove the redundant PageCompound test in
set_page_refcounted().

Signed-off-by: Qi Yong
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Qi Yong
2008-02-06 01:44:19 +0800
920c7a5d0 mm: remove fastcall from mm/ ... Browse Code »

fastcall is always defined to be empty, remove it

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Harvey Harrison
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Harvey Harrison
2008-02-06 01:44:18 +0800
1e548deb5 page allocator: remove unused arguments in zone_init_free_lists() ... Browse Code »

Signed-off-by: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2008-02-06 01:44:18 +0800
5a9bbdcd2 mm: don't waste swap on locked pages ... Browse Code »

try_to_unmap always fails on a page found in a VM_LOCKED vma (unless
migrating), and recycles it back to the active list. But if it's an
anonymous page, we've already allocated swap to it: just wasting swap.
Spot locked pages in page_referenced_one and treat them as referenced.

Signed-off-by: Hugh Dickins
Tested-by: KAMEZAWA Hiroyuki
Cc: Ethan Solomita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-02-06 01:44:18 +0800
9eccf2a81 vmstat: remove prefetch ... Browse Code »

Remove the prefetch logic in order to avoid touching impossible per cpu
areas.

Signed-off-by: Christoph Lameter
Cc: Mike Travis
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2008-02-06 01:44:18 +0800
195cf453d mm/page-writeback: highmem_is_dirtyable option ... Browse Code »

Add vm.highmem_is_dirtyable toggle

A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
approximately 2Gb size which contains a hash format that is written
randomly by the dbclean process. On 2.6.16 this process took a few
minutes. With lowmem only accounting of dirty ratios, this takes about 12
hours of 100% disk IO, all random writes.

Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
add the highmem back to the total available memory count.

[akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build]
Signed-off-by: Bron Gondwana
Cc: Ethan Solomita
Cc: Peter Zijlstra
Cc: WU Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bron Gondwana
2008-02-06 01:44:18 +0800
3dfa5721f Page allocator: get rid of the list of cold pages ... Browse Code »

We have repeatedly discussed if the cold pages still have a point. There is
one way to join the two lists: Use a single list and put the cold pages at the
end and the hot pages at the beginning. That way a single list can serve for
both types of allocations.

The discussion of the RFC for this and Mel's measurements indicate that
there may not be too much of a point left to having separate lists for
hot and cold pages (see http://marc.info/?t=119492914200001&r=1&w=2).

Signed-off-by: Christoph Lameter
Cc: Mel Gorman
Cc: Martin Bligh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2008-02-06 01:44:18 +0800
5dc331852 mm: don't allow ioremapping of ranges larger than vmalloc space ... Browse Code »

When running with a 16M IOREMAP_MAX_ORDER (on armv7) we found that the
vmlist search routine in __get_vm_area_node can mistakenly allow a driver
to ioremap a range larger than vmalloc space.

If at the time of the ioremap all existing vmlist areas sit below the
determined alignment then the search routine continues past all entries and
exits the for loop - straight into the found: label - without ever testing
for integer wrapping or that the requested size fits.

We were seeing a driver successfully ioremap 128M of flash even though
there was only 120M of vmalloc space. From that point the system was left
with the remainder of the first 16M of space to vmalloc/ioremap within.

Signed-off-by: Robert Bragg
Acked-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robert Bragg
2008-02-06 01:44:18 +0800
a7f75e258 vmstat: small revisions to refresh_cpu_vm_stats() ... Browse Code »

1. Add comments explaining how the function can be called.

2. Collect global diffs in a local array and only spill
them once into the global counters when the zone scan
is finished. This means that we only touch each global
counter once instead of each time we fold cpu counters
into zone counters.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2008-02-06 01:44:18 +0800
08e7d9b55 arch_rebalance_pgtables call ... Browse Code »

In order to change the layout of the page tables after an mmap has crossed the
adress space limit of the current page table layout a architecture hook in
get_unmapped_area is needed. The arguments are the address of the new mapping
and the length of it.

Cc: Benjamin Herrenschmidt
Signed-off-by: Martin Schwidefsky
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Martin Schwidefsky
2008-02-06 01:44:18 +0800
5e5419734 add mm argument to pte/pmd/pud/pgd_free ... Browse Code »

(with Martin Schwidefsky )

The pgd/pud/pmd/pte page table allocation functions get a mm_struct pointer as
first argument. The free functions do not get the mm_struct argument. This
is 1) asymmetrical and 2) to do mm related page table allocations the mm
argument is needed on the free function as well.

[kamalesh@linux.vnet.ibm.com: i386 fix]
[akpm@linux-foundation.org: coding-syle fixes]
Signed-off-by: Benjamin Herrenschmidt
Signed-off-by: Martin Schwidefsky
Cc:
Signed-off-by: Kamalesh Babulal
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Benjamin Herrenschmidt
2008-02-06 01:44:18 +0800
9f8f21725 Page allocator: clean up pcp draining functions ... Browse Code »

- Add comments explaing how drain_pages() works.

- Eliminate useless functions

- Rename drain_all_local_pages to drain_all_pages(). It does drain
all pages not only those of the local processor.

- Eliminate useless interrupt off / on sequences. drain_pages()
disables interrupts on its own. The execution thread is
pinned to processor by the caller. So there is no need to
disable interrupts.

- Put drain_all_pages() declaration in gfp.h and remove the
declarations from suspend.h and from mm/memory_hotplug.c

- Make software suspend call drain_all_pages(). The draining
of processor local pages is may not the right approach if
software suspend wants to support SMP. If they call drain_all_pages
then we can make drain_pages() static.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Christoph Lameter
Acked-by: Mel Gorman
Cc: "Rafael J. Wysocki"
Cc: Daniel Walker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2008-02-06 01:44:17 +0800
e2848a0ef radix-tree: avoid atomic allocations for preloaded insertions ... Browse Code »

Most pagecache (and some other) radix tree insertions have the great
opportunity to preallocate a few nodes with relaxed gfp flags. But the
preallocation is squandered when it comes time to allocate a node, we
default to first attempting a GFP_ATOMIC allocation -- that doesn't
normally fail, but it can eat into atomic memory reserves that we don't
need to be using.

Another upshot of this is that it removes the sometimes highly contended
zone->lock from underneath tree_lock. Pagecache insertions are always
performed with a radix tree preload, and after this change, such a
situation will never fall back to kmem_cache_alloc within
radix_tree_node_alloc.

David Miller reports seeing this allocation fail on a highly threaded
sparc64 system:

[527319.459981] dd: page allocation failure. order:0, mode:0x20
[527319.460403] Call Trace:
[527319.460568] [00000000004b71e0] __slab_alloc+0x1b0/0x6a8
[527319.460636] [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8
[527319.460698] [000000000055309c] radix_tree_node_alloc+0x20/0x90
[527319.460763] [0000000000553238] radix_tree_insert+0x12c/0x260
[527319.460830] [0000000000495cd0] add_to_page_cache+0x38/0xb0
[527319.460893] [00000000004e4794] mpage_readpages+0x6c/0x134
[527319.460955] [000000000049c7fc] __do_page_cache_readahead+0x170/0x280
[527319.461028] [000000000049cc88] ondemand_readahead+0x208/0x214
[527319.461094] [0000000000496018] do_generic_mapping_read+0xe8/0x428
[527319.461152] [0000000000497948] generic_file_aio_read+0x108/0x170
[527319.461217] [00000000004badac] do_sync_read+0x88/0xd0
[527319.461292] [00000000004bb5cc] vfs_read+0x78/0x10c
[527319.461361] [00000000004bb920] sys_read+0x34/0x60
[527319.461424] [0000000000406294] linux_sparc_syscall32+0x3c/0x40

The calltrace is significant: __do_page_cache_readahead allocates a number
of pages with GFP_KERNEL, and hence it should have reclaimed sufficient
memory to satisfy GFP_ATOMIC allocations. However after the list of pages
goes to mpage_readpages, there can be significant intervals (including disk
IO) before all the pages are inserted into the radix-tree. So the reserves
can easily be depleted at that point. The patch is confirmed to fix the
problem.

Signed-off-by: Nick Piggin
Cc: "David S. Miller"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-02-06 01:44:17 +0800
e31d9eb5c make __vmalloc_area_node() static ... Browse Code »

__vmalloc_area_node() can become static.

Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-02-06 01:44:17 +0800
625d9573d Remove unused code from mm/tiny-shmem.c ... Browse Code »

This code in mm/tiny-shmem.c is under #if 0 - remove it.

Signed-off-by: Balbir Singh
Acked-by: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2008-02-06 01:44:17 +0800
f61eaf9fc mm/page-writeback.c: make a function static ... Browse Code »

task_dirty_limit() can become static.

Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-02-06 01:44:17 +0800
1e8832811 maps4: make page monitoring /proc file optional ... Browse Code »

Make /proc/ page monitoring configurable

This puts the following files under an embedded config option:

/proc/pid/clear_refs
/proc/pid/smaps
/proc/pid/pagemap
/proc/kpagecount
/proc/kpageflags

[akpm@linux-foundation.org: Kconfig fix]
Signed-off-by: Matt Mackall
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Mackall
2008-02-06 01:44:17 +0800
e6473092b maps4: introduce a generic page walker ... Browse Code »

Introduce a general page table walker

Signed-off-by: Matt Mackall
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Mackall
2008-02-06 01:44:16 +0800
698dd4ba6 maps4: move is_swap_pte ... Browse Code »

Move is_swap_pte helper function to swapops.h for use by pagemap code

Signed-off-by: Matt Mackall
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Mackall
2008-02-06 01:44:16 +0800
61d5048f1 clean up vmtruncate ... Browse Code »

vmtruncate is a twisted maze of gotos, this patch cleans it up to have a
proper if else for the two major cases of extending and truncating truncate
and thus makes it a lot more readable while keeping exactly the same
functinality.

Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2008-02-06 01:44:16 +0800
1b1b32f2c tmpfs: fix shmem_swaplist races ... Browse Code »

Intensive swapoff testing shows shmem_unuse spinning on an entry in
shmem_swaplist pointing to itself: how does that come about? Days pass...

First guess is this: shmem_delete_inode tests list_empty without taking the
global mutex (so the swapping case doesn't slow down the common case); but
there's an instant in shmem_unuse_inode's list_move_tail when the list entry
may appear empty (a rare case, because it's actually moving the head not the
the list member). So there's a danger of leaving the inode on the swaplist
when it's freed, then reinitialized to point to itself when reused. Fix that
by skipping the list_move_tail when it's a no-op, which happens to plug this.

But this same spinning then surfaces on another machine. Ah, I'd never
suspected it, but shmem_writepage's swaplist manipulation is unsafe: though we
still hold page lock, which would hold off inode deletion if the page were in
pagecache, it doesn't hold off once it's in swapcache (free_swap_and_cache
doesn't wait on locked pages). Hmm: we could put the the inode on swaplist
earlier, but then shmem_unuse_inode could never prune unswapped inodes.

Fix this with an igrab before dropping info->lock, as in shmem_unuse_inode;
though I am a little uneasy about the iput which has to follow - it works, and
I see nothing wrong with it, but it is surprising that shmem inode deletion
may now occur below shmem_writepage. Revisit this fix later?

And while we're looking at these races: the way shmem_unuse tests swapped
without holding info->lock looks unsafe, if we've more than one swap area: a
racing shmem_writepage on another page of the same inode could be putting it
in swapcache, just as we're deciding to remove the inode from swaplist -
there's a danger of going on swap without being listed, so a later swapoff
would hang, being unable to locate the entry. Move that test and removal down
into shmem_unuse_inode, once info->lock is held.

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-02-06 01:44:16 +0800
b409f9fcf tmpfs: radix_tree_preloading ... Browse Code »

Nick has observed that shmem.c still uses GFP_ATOMIC when adding to page cache
or swap cache, without any radix tree preload: so tending to deplete emergency
reserves of memory.

GFP_ATOMIC remains appropriate in shmem_writepage's add_to_swap_cache: it's
being called under memory pressure, so must not wait for more memory to become
available. But shmem_unuse_inode now has a window in which it can and should
preload with GFP_KERNEL, and say GFP_NOWAIT instead of GFP_ATOMIC in its
add_to_page_cache.

shmem_getpage is not so straightforward: its filepage/swappage integrity
relies upon exchanging between caches under spinlock, and it would need a lot
of restructuring to place the preloads correctly. Instead, follow its pattern
of retrying on races: use GFP_NOWAIT instead of GFP_ATOMIC in
add_to_page_cache, and begin each circuit of the repeat loop with a sleeping
radix_tree_preload, followed immediately by radix_tree_preload_end - that
won't guarantee success in the next add_to_page_cache, but doesn't need to.

And we can then remove that bothersome congestion_wait: when needed, it'll
automatically get done in the course of the radix_tree_preload.

Signed-off-by: Hugh Dickins
Looks-good-to: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-02-06 01:44:15 +0800