Eric Lee / linux-smarc-t335x-v3.2

07 Jan, 2009

40 commits

91f47662d mm: hugetlb: remove redundant `if' operation ... Browse Code »

At this point we already know that 'addr' is not NULL so get rid of
redundant 'if'. Probably gcc eliminate it by optimization pass.

[akpm@linux-foundation.org: use __weak, too]
Signed-off-by: Cyrill Gorcunov
Reviewed-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2009-01-07 07:59:10 +0800
73ce02e96 mm: stop kswapd's infinite loop at high order allocation ... Browse Code »

Wassim Dagash reported following kswapd infinite loop problem.

kswapd runs in some infinite loop trying to swap until order 10 of zone
highmem is OK.... kswapd will continue to try to balance order 10 of zone
highmem forever (or until someone release a very large chunk of highmem).

For non order-0 allocations, the system may never be balanced due to
fragmentation but kswapd should not infinitely loop as a result.

Instead, recheck all watermarks at order-0 as they are the most important.
If watermarks are ok, kswapd will go back to sleep.

[akpm@linux-foundation.org: fix comment]
Reported-by: wassim dagash
Signed-off-by: KOSAKI Motohiro
Reviewed-by: Nick Piggin
Signed-off-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-07 07:59:10 +0800
594fe1a04 bootmem: print request details before BUG_ON(them) ... Browse Code »

Moving the request details print-out before the sanity checks that
might panic() enables us to analyse invalid requests without having
access to the line information of the stack dump.

Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2009-01-07 07:59:10 +0800
dcd4a049b mm: check for no mmaps in exit_mmap() ... Browse Code »

When dup_mmap() ooms we can end up with mm->mmap == NULL. The error
path does mmput() and unmap_vmas() gets a NULL vma which it
dereferences.

In exit_mmap() there is nothing to do at all for this case, we can
cancel the callpath right there.

[akpm@linux-foundation.org: add sorely-needed comment]
Signed-off-by: Johannes Weiner
Reported-by: Akinobu Mita
Cc: Nick Piggin
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2009-01-07 07:59:10 +0800
084f71ae5 mm: kill page_queue_congested() ... Browse Code »

page_queue_congested() was introduced in 2002, but it was never used

Signed-off-by: KOSAKI Motohiro
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-07 07:59:10 +0800
9f572e3f9 mm: remove CONFIG_OUT_OF_LINE_PFN_TO_PAGE ... Browse Code »

No architectures use CONFIG_OUT_OF_LINE_PFN_TO_PAGE - it can be removed.

Signed-off-by: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-07 07:59:10 +0800
901608d90 mm: introduce get_mm_hiwater_xxx(), fix taskstats->hiwater_xxx accounting ... Browse Code »

xacct_add_tsk() relies on do_exit()->update_hiwater_xxx() and uses
mm->hiwater_xxx directly, this leads to 2 problems:

- taskstats_user_cmd() can call fill_pid()->xacct_add_tsk() at any
moment before the task exits, so we should check the current values of
rss/vm anyway.

- do_exit()->update_hiwater_xxx() calls are racy. An exiting thread can
be preempted right before mm->hiwater_xxx = new_val, and another thread
can use A_LOT of memory and exit in between. When the first thread
resumes it can be the last thread in the thread group, in that case we
report the wrong hiwater_xxx values which do not take A_LOT into
account.

Introduce get_mm_hiwater_rss() and get_mm_hiwater_vm() helpers and change
xacct_add_tsk() to use them. The first helper will also be used by
rusage->ru_maxrss accounting.

Kill do_exit()->update_hiwater_xxx() calls. Unless we are going to
decrease rss/vm there is no point to update mm->hiwater_xxx, and nobody
can look at this mm_struct when exit_mmap() actually unmaps the memory.

Signed-off-by: Oleg Nesterov
Acked-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2009-01-07 07:59:09 +0800
67d58ac47 mm: pagecache gfp flags fix ... Browse Code »

Frustratingly, gfp_t is really divided into two classes of flags. One are
the context dependent ones (can we sleep? can we enter filesystem? block
subsystem? should we use some extra reserves, etc.). The other ones are
the type of memory required and depend on how the algorithm is implemented
rather than the point at which the memory is allocated (highmem? dma
memory? etc).

Some of the functions which allocate a page and add it to page cache take
a gfp_t, but sometimes those functions or their callers aren't really
doing the right thing: when allocating pagecache page, the memory type
should be mapping_gfp_mask(mapping). When allocating radix tree nodes,
the memory type should be kernel mapped (not highmem) memory. The gfp_t
argument should only really be needed for context dependent options.

This patch doesn't really solve that tangle in a nice way, but it does
attempt to fix a couple of bugs.

- find_or_create_page changes its radix-tree allocation to only include
the main context dependent flags in order so the pagecache page may be
allocated from arbitrary types of memory without affecting the
radix-tree. In practice, slab allocations don't come from highmem
anyway, and radix-tree only uses slab allocations. So there isn't a
practical change (unless some fs uses GFP_DMA for pages).

- grab_cache_page_nowait() is changed to allocate radix-tree nodes with
GFP_NOFS, because it is not supposed to reenter the filesystem. This
bug could cause lock recursion if a filesystem is not expecting the
function to reenter the fs (as-per documentation).

Filesystems should be careful about exactly what semantics they want and
what they get when fiddling with gfp_t masks to allocate pagecache. One
should be as liberal as possible with the type of memory that can be used,
and same for the the context specific flags.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-01-07 07:59:09 +0800
48b47c561 mm: direct IO starvation improvement ... Browse Code »

Direct IO can invalidate and sync a lot of pagecache pages in the mapping.
A 4K direct IO will actually try to sync and/or invalidate the pagecache
of the entire file, for example (which might be many GB or TB large).

Improve this by doing range syncs. Also, memory no longer has to be
unmapped to catch the dirty bits for syncing, as dirty bits would remain
coherent due to dirty mmap accounting.

This fixes the immediate DM deadlocks when doing direct IO reads to block
device with a mounted filesystem, if only by papering over the problem
somewhat rather than addressing the fsync starvation cases.

Signed-off-by: Nick Piggin
Reviewed-by: Jeff Moyer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-01-07 07:59:09 +0800
48aae4255 mm/mmap.c: fix coding style ... Browse Code »

Fix a little of the coding style in mm/mmap.c

[akpm@linux-foundation.org: cleanup]
Signed-off-by: ZhenwenXu
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

ZhenwenXu
2009-01-07 07:59:08 +0800
853ac43ab shmem: unify regular and tiny shmem ... Browse Code »

tiny-shmem shares most of its 130 lines of code with shmem and tends to
break when particular bits of shmem get modified. Unifying saves code and
makes keeping these two in sync much easier.

before:
14367 392 24 14783 39bf mm/shmem.o
396 72 8 476 1dc mm/tiny-shmem.o

after:
14367 392 24 14783 39bf mm/shmem.o
412 72 8 492 1ec mm/shmem.o tiny

Signed-off-by: Matt Mackall
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Mackall
2009-01-07 07:59:08 +0800
4779280d1 mm: make get_user_pages() interruptible ... Browse Code »

The initial implementation of checking TIF_MEMDIE covers the cases of OOM
killing. If the process has been OOM killed, the TIF_MEMDIE is set and it
return immediately. This patch includes:

1. add the case that the SIGKILL is sent by user processes. The
process can try to get_user_pages() unlimited memory even if a user
process has sent a SIGKILL to it(maybe a monitor find the process
exceed its memory limit and try to kill it). In the old
implementation, the SIGKILL won't be handled until the get_user_pages()
returns.

2. change the return value to be ERESTARTSYS. It makes no sense to
return ENOMEM if the get_user_pages returned by getting a SIGKILL
signal. Considering the general convention for a system call
interrupted by a signal is ERESTARTNOSYS, so the current return value
is consistant to that.

Lee:

An unfortunate side effect of "make-get_user_pages-interruptible" is that
it prevents a SIGKILL'd task from munlock-ing pages that it had mlocked,
resulting in freeing of mlocked pages. Freeing of mlocked pages, in
itself, is not so bad. We just count them now--altho' I had hoped to
remove this stat and add PG_MLOCKED to the free pages flags check.

However, consider pages in shared libraries mapped by more than one task
that a task mlocked--e.g., via mlockall(). If the task that mlocked the
pages exits via SIGKILL, these pages would be left mlocked and
unevictable.

Proposed fix:

Add another GUP flag to ignore sigkill when calling get_user_pages from
munlock()--similar to Kosaki Motohiro's 'IGNORE_VMA_PERMISSIONS flag for
the same purpose. We are not actually allocating memory in this case,
which "make-get_user_pages-interruptible" intends to avoid. We're just
munlocking pages that are already resident and mapped, and we're reusing
get_user_pages() to access those pages.

?? Maybe we should combine 'IGNORE_VMA_PERMISSIONS and '_IGNORE_SIGKILL
into a single flag: GUP_FLAGS_MUNLOCK ???

[Lee.Schermerhorn@hp.com: ignore sigkill in get_user_pages during munlock]
Signed-off-by: Paul Menage
Signed-off-by: Ying Han
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Pekka Enberg
Cc: Nick Piggin
Cc: Hugh Dickins
Cc: Oleg Nesterov
Cc: Lee Schermerhorn
Cc: Rohit Seth
Cc: David Rientjes
Signed-off-by: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ying Han
2009-01-07 07:59:08 +0800
b555749aa vmscan: shrink_active_list(): reduce lru_lock hold time ... Browse Code »

These three statements manipulate local variables and do not need the lock
coverage.

Cc: Johannes Weiner
Cc: Lee Schermerhorn
Cc: Rik van Riel
Signed-off-by: Linus Torvalds

Andrew Morton
2009-01-07 07:59:08 +0800
1e9e63650 badpage: KERN_ALERT BUG instead of KERN_EMERG ... Browse Code »

bad_page() and rmap Eeek messages have said KERN_EMERG for a few years,
which I've followed in print_bad_pte(). These are serious system errors,
on a par with BUGs, but they're not quite emergencies, and we do our best
to carry on: say KERN_ALERT "BUG: " like the x86 oops does.

And remove the "Trying to fix it up, but a reboot is needed" line: it's
not untrue, but I hope the KERN_ALERT "BUG: " conveys as much.

Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:08 +0800
d936cf9b3 badpage: ratelimit print_bad_pte and bad_page ... Browse Code »

print_bad_pte() and bad_page() might each need ratelimiting - especially
for their dump_stacks, almost never of interest, yet not quite
dispensible. Correlating corruption across neighbouring entries can be
very helpful, so allow a burst of 60 reports before keeping quiet for the
remainder of that minute (or allow a steady drip of one report per
second).

Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:07 +0800
edc315fd2 badpage: remove vma from page_remove_rmap ... Browse Code »

Remove page_remove_rmap()'s vma arg, which was only for the Eeek message.
And remove the BUG_ON(page_mapcount(page) == 0) from CONFIG_DEBUG_VM's
page_dup_rmap(): we're trying to be more resilient about that than BUGs.

Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:07 +0800
2509ef26d badpage: zap print_bad_pte on swap and file ... Browse Code »

Complete zap_pte_range()'s coverage of bad pagetable entries by calling
print_bad_pte() on a pte_file in a linear vma and on a bad swap entry.
That needs free_swap_and_cache() to tell it, which will also have shown
one of those "swap_free" errors (but with much less information).

Similar checks in fork's copy_one_pte()? No, that would be more noisy
than helpful: we'll see them when parent and child exec or exit.

Where do_nonlinear_fault() calls print_bad_pte(): omit !VM_CAN_NONLINEAR
case, that could only be a bug in sys_remap_file_pages(), not a bad pte.
VM_FAULT_OOM rather than VM_FAULT_SIGBUS? Well, okay, that is consistent
with what happens if do_swap_page() operates a bad swap entry; but don't
we have patches to be more careful about killing when VM_FAULT_OOM?

Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:07 +0800
22b31eec6 badpage: vm_normal_page use print_bad_pte ... Browse Code »

print_bad_pte() is so far being called only when zap_pte_range() finds
negative page_mapcount, or there's a fault on a pte_file where it does not
belong. That's weak coverage when we suspect pagetable corruption.

Originally, it was called when vm_normal_page() found an invalid pfn: but
pfn_valid is expensive on some architectures and configurations, so 2.6.24
put that under CONFIG_DEBUG_VM (which doesn't help in the field), then
2.6.26 replaced it by a VM_BUG_ON (likewise).

Reinstate the print_bad_pte() in vm_normal_page(), but use a cheaper test
than pfn_valid(): memmap_init_zone() (used in bootup and hotplug) keep a
__read_mostly note of the highest_memmap_pfn, vm_normal_page() then check
pfn against that. We could call this pfn_plausible() or pfn_sane(), but I
doubt we'll need it elsewhere: of course it's not reliable, but gives much
stronger pagetable validation on many boxes.

Also use print_bad_pte() when the pte_special bit is found outside a
VM_PFNMAP or VM_MIXEDMAP area, instead of VM_BUG_ON.

Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:07 +0800
3dc147414 badpage: replace page_remove_rmap Eeek and BUG ... Browse Code »

Now that bad pages are kept out of circulation, there is no need for the
infamous page_remove_rmap() BUG() - once that page is freed, its negative
mapcount will issue a "Bad page state" message and the page won't be
freed. Removing the BUG() allows more info, on subsequent pages, to be
gathered.

We do have more info about the page at this point than bad_page() can know
- notably, what the pmd is, which might pinpoint something like low 64kB
corruption - but page_remove_rmap() isn't given the address to find that.

In practice, there is only one call to page_remove_rmap() which has ever
reported anything, that from zap_pte_range() (usually on exit, sometimes
on munmap). It has all the info, so remove page_remove_rmap()'s "Eeek"
message and leave it all to zap_pte_range().

mm/memory.c already has a hardly used print_bad_pte() function, showing
some of the appropriate info: extend it to show what we want for the rmap
case: pte info, page info (when there is a page) and vma info to compare.
zap_pte_range() already knows the pmd, but print_bad_pte() is easier to
use if it works that out for itself.

Some of this info is also shown in bad_page()'s "Bad page state" message.
Keep them separate, but adjust them to match each other as far as
possible. Say "Bad page map" in print_bad_pte(), and add a TAINT_BAD_PAGE
there too.

print_bad_pte() show current->comm unconditionally (though it should get
repeated in the usually irrelevant stack trace): sorry, I misled Nick
Piggin to make it conditional on vm_mm == current->mm, but current->mm is
already NULL in the exit case. Usually current->comm is good, though
exceptionally it may not be that of the mm (when "swapoff" for example).

Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:07 +0800
8cc3b3922 badpage: keep any bad page out of circulation ... Browse Code »

Until now the bad_page() checkers have special-cased PageReserved, keeping
those pages out of circulation thereafter. Now extend the special case to
all: we want to keep ANY page with bad state out of circulation - the
"free" page may well be in use by something.

Leave the bad state of those pages untouched, for examination by
debuggers; except for PageBuddy - leaving that set would risk bringing the
page back.

Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:07 +0800
79f4b7bf3 badpage: simplify page_alloc flag check+clear ... Browse Code »

Simplify the PAGE_FLAGS checking and clearing when freeing and allocating
a page: check the same flags as before when freeing, clear ALL the flags
(unless PageReserved) when freeing, check ALL flags off when allocating.

Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:07 +0800
09f445e7f mm: kill zone_is_near_oom() ... Browse Code »

zone_is_near_oom() is unused.

Signed-off-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-07 07:59:06 +0800
01dbe5c9b vmscan: improve reclaim throughput to bail out patch ... Browse Code »

The vmscan bail out patch move nr_reclaimed variable to struct
scan_control. Unfortunately, indirect access can easily happen cache
miss.

if heavy memory pressure happend, that's ok.
cache miss already plenty. it is not observable.

but, if memory pressure is lite, performance degression is obserbable.

I compared following three pattern (it was mesured 10 times each)

hackbench 125 process 3000
hackbench 130 process 3000
hackbench 135 process 3000

2.6.28-rc6 bail-out

125 130 135 125 130 135
==============================================================
71.866 75.86 81.274 93.414 73.254 193.382
74.145 78.295 77.27 74.897 75.021 80.17
70.305 77.643 75.855 70.134 77.571 79.896
74.288 73.986 75.955 77.222 78.48 80.619
72.029 79.947 78.312 75.128 82.172 79.708
71.499 77.615 77.042 74.177 76.532 77.306
76.188 74.471 83.562 73.839 72.43 79.833
73.236 75.606 78.743 76.001 76.557 82.726
69.427 77.271 76.691 76.236 79.371 103.189
72.473 76.978 80.643 69.128 78.932 75.736

avg 72.545 76.767 78.534 76.017 77.03 93.256
std 1.89 1.71 2.41 6.29 2.79 34.16
min 69.427 73.986 75.855 69.128 72.43 75.736
max 76.188 79.947 83.562 93.414 82.172 193.382

about 4-5% degression.

Then, this patch introduces a temporary local variable.

result:

2.6.28-rc6 this patch

num 125 130 135 125 130 135
==============================================================
71.866 75.86 81.274 67.302 68.269 77.161
74.145 78.295 77.27 72.616 72.712 79.06
70.305 77.643 75.855 72.475 75.712 77.735
74.288 73.986 75.955 69.229 73.062 78.814
72.029 79.947 78.312 71.551 74.392 78.564
71.499 77.615 77.042 69.227 74.31 78.837
76.188 74.471 83.562 70.759 75.256 76.6
73.236 75.606 78.743 69.966 76.001 78.464
69.427 77.271 76.691 69.068 75.218 80.321
72.473 76.978 80.643 72.057 77.151 79.068

avg 72.545 76.767 78.534 70.425 74.2083 78.462
std 1.89 1.71 2.41 1.66 2.34 1.00
min 69.427 73.986 75.855 67.302 68.269 76.6
max 76.188 79.947 83.562 72.616 77.151 80.321

OK. the degression is disappeared.

Signed-off-by: KOSAKI Motohiro
Acked-by: Rik van Riel
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-07 07:59:06 +0800
a79311c14 vmscan: bail out of direct reclaim after swap_cluster_max pages ... Browse Code »

When the VM is under pressure, it can happen that several direct reclaim
processes are in the pageout code simultaneously. It also happens that
the reclaiming processes run into mostly referenced, mapped and dirty
pages in the first round.

This results in multiple direct reclaim processes having a lower
pageout priority, which corresponds to a higher target of pages to
scan.

This in turn can result in each direct reclaim process freeing
many pages. Together, they can end up freeing way too many pages.

This kicks useful data out of memory (in some cases more than half
of all memory is swapped out). It also impacts performance by
keeping tasks stuck in the pageout code for too long.

A 30% improvement in hackbench has been observed with this patch.

The fix is relatively simple: in shrink_zone() we can check how many
pages we have already freed, direct reclaim tasks break out of the
scanning loop if they have already freed enough pages and have reached
a lower priority level.

We do not break out of shrink_zone() when priority == DEF_PRIORITY,
to ensure that equal pressure is applied to every zone in the common
case.

However, in order to do this we do need to know how many pages we already
freed, so move nr_reclaimed into scan_control.

akpm: a historical interlude...

We tried this in 2004:

:commit e468e46a9bea3297011d5918663ce6d19094cf87
:Author: akpm
:Date: Thu Jun 24 15:53:52 2004 +0000
:
:[PATCH] vmscan.c: dont reclaim too many pages
:
: The shrink_zone() logic can, under some circumstances, cause far too many
: pages to be reclaimed. Say, we're scanning at high priority and suddenly hit
: a large number of reclaimable pages on the LRU.
: Change things so we bale out when SWAP_CLUSTER_MAX pages have been reclaimed.

And we reverted it in 2006:

:commit 210fe530305ee50cd889fe9250168228b2994f32
:Author: Andrew Morton
:Date: Fri Jan 6 00:11:14 2006 -0800
:
: [PATCH] vmscan: balancing fix
:
: Revert a patch which went into 2.6.8-rc1. The changelog for that patch was:
:
: The shrink_zone() logic can, under some circumstances, cause far too many
: pages to be reclaimed. Say, we're scanning at high priority and suddenly
: hit a large number of reclaimable pages on the LRU.
:
: Change things so we bale out when SWAP_CLUSTER_MAX pages have been
: reclaimed.
:
: Problem is, this change caused significant imbalance in inter-zone scan
: balancing by truncating scans of larger zones.
:
: Suppose, for example, ZONE_HIGHMEM is 10x the size of ZONE_NORMAL. The zone
: balancing algorithm would require that if we're scanning 100 pages of
: ZONE_HIGHMEM, we should scan 10 pages of ZONE_NORMAL. But this logic will
: cause the scanning of ZONE_HIGHMEM to bale out after only 32 pages are
: reclaimed. Thus effectively causing smaller zones to be scanned relatively
: harder than large ones.
:
: Now I need to remember what the workload was which caused me to write this
: patch originally, then fix it up in a different way...

And we haven't demonstrated that whatever problem caused that reversion is
not being reintroduced by this change in 2008.

Signed-off-by: Rik van Riel
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2009-01-07 07:59:06 +0800
ebdd4aea8 hugetlb: fix sparse warnings ... Browse Code »

Fix the following sparse warnings:

mm/hugetlb.c:375:3: warning: returning void-valued expression
mm/hugetlb.c:408:3: warning: returning void-valued expression

Signed-off-by: Hannes Eder
Acked-by: Nishanth Aravamudan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hannes Eder
2009-01-07 07:59:06 +0800
f0d7a4b3e swapfile: let others seed random ... Browse Code »

Remove the srandom32((u32)get_seconds()) from non-rotational swapon:
there's been a coincidental discussion of earlier randomization, assume
that goes ahead, let swapon be a client rather than stirring for itself.

Signed-off-by: Hugh Dickins
Cc: David Woodhouse
Cc: Donjun Shin
Cc: James Bottomley
Cc: Jens Axboe
Cc: Joern Engel
Cc: KAMEZAWA Hiroyuki
Cc: Matthew Wilcox
Cc: Nick Piggin
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:06 +0800
858a29900 swapfile: change discard pgoff_t to sector_t ... Browse Code »

Change pgoff_t nr_blocks in discard_swap() and discard_swap_cluster() to
sector_t: given the constraints on swap offsets (in particular, the 5 bits
of swap type accommodated in the same unsigned long), pgoff_t was actually
safe as is, but it certainly looked worrying when shifted left.

[akpm@linux-foundation.org: fix shift overflow]
Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Nick Piggin
Cc: David Woodhouse
Cc: Jens Axboe
Cc: Matthew Wilcox
Cc: Joern Engel
Cc: James Bottomley
Cc: Donjun Shin
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:06 +0800
c60aa176c swapfile: swap allocation cycle if nonrot ... Browse Code »

Though attempting to find free clusters (Andrea), swap allocation has
always restarted its searches from the beginning of the swap area (sct),
to reduce seek times between swap pages, by not scattering them all over
the partition.

But on a solidstate swap device, seeks are cheap, and block remapping to
level the wear may be limited by zones: in that case it's better to cycle
around the whole partition.

Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Nick Piggin
Cc: David Woodhouse
Cc: Jens Axboe
Cc: Matthew Wilcox
Cc: Joern Engel
Cc: James Bottomley
Cc: Donjun Shin
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:06 +0800
20137a490 swapfile: swapon randomize if nonrot ... Browse Code »

Swap allocation has always started from the beginning of the swap area;
but if we're dealing with a solidstate swap device which can only remap
blocks within limited zones, that would sooner wear out the first zone.

Therefore sys_swapon() test whether blk_queue is non-rotational, and if so
randomize the cluster_next starting position for allocation.

If blk_queue is nonrot, note SWP_SOLIDSTATE for later use, and report it
with an "SS" at the right end of the kernel's "Adding ... swap" message
(so that if it's both nonrot and discardable, "SSD" will be shown there).
Perhaps something should be shown in /proc/swaps (swapon -s), but we have
to be more cautious before making any addition to that format.

Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Nick Piggin
Cc: David Woodhouse
Cc: Jens Axboe
Cc: Matthew Wilcox
Cc: Joern Engel
Cc: James Bottomley
Cc: Donjun Shin
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:05 +0800
7992fde72 swapfile: swap allocation use discard ... Browse Code »

When scan_swap_map() finds a free cluster of swap pages to allocate,
discard the old contents of the cluster if the device supports discard.
But don't bother when swap is so fragmented that we allocate single pages.

Be careful about racing allocations made while we're scanning for a
cluster; and hold up allocations made while we're discarding.

Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Nick Piggin
Cc: David Woodhouse
Cc: Jens Axboe
Cc: Matthew Wilcox
Cc: Joern Engel
Cc: James Bottomley
Cc: Donjun Shin
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:05 +0800
6a6ba8317 swapfile: swapon use discard (trim) ... Browse Code »

When adding swap, all the old data on swap can be forgotten: sys_swapon()
discard all but the header page of the swap partition (or every extent but
the header of the swap file), to give a solidstate swap device the
opportunity to optimize its wear-levelling.

If that succeeds, note SWP_DISCARDABLE for later use, and report it with a
"D" at the right end of the kernel's "Adding ... swap" message. Perhaps
something should be shown in /proc/swaps (swapon -s), but we have to be
more cautious before making any addition to that format.

Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Nick Piggin
Cc: David Woodhouse
Cc: Jens Axboe
Cc: Matthew Wilcox
Cc: Joern Engel
Cc: James Bottomley
Cc: Donjun Shin
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:05 +0800
ebebbbe90 swapfile: rearrange scan and swap_info ... Browse Code »

Before making functional changes, rearrange scan_swap_map() to simplify
subsequent diffs. Actually, there is one functional change in there:
leave cluster_nr negative while scanning for a new cluster - resetting it
early increased the likelihood that when we have difficulty finding a free
cluster, another task may come in and try doing exactly the same - just a
waste of cpu.

Before making functional changes, rearrange struct swap_info_struct
slightly: flags will be needed as an unsigned long (for wait_on_bit), next
is a good int to pair with prio, old_block_size is uninteresting so shift
it to the end.

Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:05 +0800
81e339712 swapfile: remove v0 SWAP-SPACE message ... Browse Code »

The kernel has not supported v0 SWAP-SPACE since 2.5.22: I think we can
now safely drop its "version 0 swap is no longer supported" message - just
say "Unable to find swap-space signature" as usual. This removes one
level of indentation from a stretch of sys_swapon().

I'd have liked to be specific, saying "Unable to find SWAPSPACE2
signature", but it's just too confusing that the version 1 signature shows
the number 2.

Irrelevant nearby cleanup: kmap(page) already gives page_address(page).

Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:05 +0800
886bb7e9c swapfile: remove surplus whitespace ... Browse Code »

Remove trailing whitespace from swapfile.c, and odd swap_show() alignment.

Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:05 +0800
22c6f8fdb swapfile: remove SWP_ACTIVE mask ... Browse Code »

Remove the SWP_ACTIVE mask: it just obscures the SWP_WRITEOK flag.

Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:05 +0800
73fd8748a swapfile: swapon needs larger size type ... Browse Code »

sys_swapon()'s swapfilesize (better renamed swapfilepages) is declared as
an int, but should be an unsigned long like the maxpages it's compared
against: on 64-bit (with 4kB pages) a swapfile of 2^44 bytes was rejected
with "Swap area shorter than signature indicates".

Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:05 +0800
efab81864 mm: make setup_per_zone_inactive_ratio() static ... Browse Code »

Sparse output following warning.

mm/page_alloc.c:4301:6: warning: symbol 'setup_per_zone_inactive_ratio' was not declared. Should it be static?

cleanup here.

Signed-off-by: KOSAKI Motohiro

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-07 07:59:05 +0800
14b90b22e mm: make scan_zone_unevictable_pages() static ... Browse Code »

sparse output following warning

mm/vmscan.c:2507:6: warning: symbol 'scan_zone_unevictable_pages' was not declared. Should it be static?

cleanup here.

Signed-off-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-07 07:59:04 +0800
ff30153bf mm: make scan_all_zones_unevictable_pages() static ... Browse Code »

sparse output following warning.

mm/vmscan.c:2549:6: warning: symbol 'scan_all_zones_unevictable_pages' was not declared. Should it be static?

cleanup here.

Signed-off-by: KOSAKI Motohiro

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-07 07:59:04 +0800
d38d2a758 mm: make mem_cgroup_resize_limit() static ... Browse Code »

Sparse output following warnings.

mm/memcontrol.c:782:5: warning: symbol 'mem_cgroup_resize_limit' was not
declared. Should it be static?

cleanup here.

Signed-off-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-07 07:59:04 +0800