Doug / smarc-fsl-linux-kernel | Embedian Git Server

24 Oct, 2008

1 commit

88ed86fee Merge branch 'proc' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc ... Browse Code »

* 'proc' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc: (35 commits)
proc: remove fs/proc/proc_misc.c
proc: move /proc/vmcore creation to fs/proc/vmcore.c
proc: move pagecount stuff to fs/proc/page.c
proc: move all /proc/kcore stuff to fs/proc/kcore.c
proc: move /proc/schedstat boilerplate to kernel/sched_stats.h
proc: move /proc/modules boilerplate to kernel/module.c
proc: move /proc/diskstats boilerplate to block/genhd.c
proc: move /proc/zoneinfo boilerplate to mm/vmstat.c
proc: move /proc/vmstat boilerplate to mm/vmstat.c
proc: move /proc/pagetypeinfo boilerplate to mm/vmstat.c
proc: move /proc/buddyinfo boilerplate to mm/vmstat.c
proc: move /proc/vmallocinfo to mm/vmalloc.c
proc: move /proc/slabinfo boilerplate to mm/slub.c, mm/slab.c
proc: move /proc/slab_allocators boilerplate to mm/slab.c
proc: move /proc/interrupts boilerplate code to fs/proc/interrupts.c
proc: move /proc/stat to fs/proc/stat.c
proc: move rest of /proc/partitions code to block/genhd.c
proc: move /proc/cpuinfo code to fs/proc/cpuinfo.c
proc: move /proc/devices code to fs/proc/devices.c
proc: move rest of /proc/locks to fs/locks.c
...

Linus Torvalds
2008-10-24 03:04:37 +0800

23 Oct, 2008

10 commits

94b6da5ab memcg: fix page_cgroup allocation ... Browse Code »

page_cgroup_init() is called from mem_cgroup_init(). But at this
point, we cannot call alloc_bootmem().
(and this caused panic at boot.)

This patch moves page_cgroup_init() to init/main.c.

Time table is following:
==
parse_args(). # we can trust mem_cgroup_subsys.disabled bit after this.
....
cgroup_init_early() # "early" init of cgroup.
....
setup_arch() # memmap is allocated.
...
page_cgroup_init();
mem_init(); # we cannot call alloc_bootmem after this.
....
cgroup_init() # mem_cgroup is initialized.
==

Before page_cgroup_init(), mem_map must be initialized. So,
I added page_cgroup_init() to init/main.c directly.

(*) maybe this is not very clean but
- cgroup_init_early() is too early
- in cgroup_init(), we have to use vmalloc instead of alloc_bootmem().
use of vmalloc area in x86-32 is important and we should avoid very large
vmalloc() in x86-32. So, we want to use alloc_bootmem() and added page_cgroup_init()
directly to init/main.c

[akpm@linux-foundation.org: remove unneeded/bad mem_cgroup_subsys declaration]
[akpm@linux-foundation.org: fix build]
Acked-by: Balbir Singh
Tested-by: Balbir Singh
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-10-23 23:55:02 +0800
4c8210427 mm: page_cgroup needs linux/vmalloc.h for vmalloc_node()/vfree(). ... Browse Code »

mm/page_cgroup.c: In function 'init_section_page_cgroup':
mm/page_cgroup.c:111: error: implicit declaration of function 'vmalloc_node'
mm/page_cgroup.c:111: warning: assignment makes pointer from integer without a cast
mm/page_cgroup.c: In function '__free_page_cgroup':
mm/page_cgroup.c:140: error: implicit declaration of function 'vfree'

Signed-off-by: Paul Mundt
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Mundt
2008-10-23 23:55:01 +0800
5c9fe6281 proc: move /proc/zoneinfo boilerplate to mm/vmstat.c ... Browse Code »

Signed-off-by: Alexey Dobriyan
Acked-by: Christoph Lameter

Alexey Dobriyan
2008-10-23 21:35:04 +0800
b6aa44ab6 proc: move /proc/vmstat boilerplate to mm/vmstat.c ... Browse Code »

Signed-off-by: Alexey Dobriyan
Acked-by: Christoph Lameter

Alexey Dobriyan
2008-10-23 21:12:51 +0800
74e2e8e8c proc: move /proc/pagetypeinfo boilerplate to mm/vmstat.c ... Browse Code »

Signed-off-by: Alexey Dobriyan

Alexey Dobriyan
2008-10-23 20:33:29 +0800
8f32f7e5a proc: move /proc/buddyinfo boilerplate to mm/vmstat.c ... Browse Code »

Signed-off-by: Alexey Dobriyan

Alexey Dobriyan
2008-10-23 20:12:04 +0800
5f6a6a9c4 proc: move /proc/vmallocinfo to mm/vmalloc.c ... Browse Code »

Signed-off-by: Alexey Dobriyan
Acked-by: Christoph Lameter

Alexey Dobriyan
2008-10-23 19:48:28 +0800
7b3c3a50a proc: move /proc/slabinfo boilerplate to mm/slub.c, mm/slab.c ... Browse Code »

Lose dummy ->write hook in case of SLUB, it's possible now.

Signed-off-by: Alexey Dobriyan
Acked-by: Pekka Enberg

Alexey Dobriyan
2008-10-23 19:20:06 +0800
a0ec95a8e proc: move /proc/slab_allocators boilerplate to mm/slab.c ... Browse Code »

Signed-off-by: Alexey Dobriyan
Acked-by: Pekka Enberg

Alexey Dobriyan
2008-10-23 19:17:27 +0800
e1759c215 proc: switch /proc/meminfo to seq_file ... Browse Code »

and move it to fs/proc/meminfo.c while I'm at it.

Signed-off-by: Alexey Dobriyan

Alexey Dobriyan
2008-10-23 17:52:40 +0800

21 Oct, 2008

3 commits

a50c22eed mm: remove duplicated #include's ... Browse Code »

Removed duplicated #include in mm/vmalloc.c and
"internal.h" in mm/memory.c.

Signed-off-by: Huang Weiyi
Signed-off-by: Linus Torvalds

Huang Weiyi
2008-10-21 07:17:42 +0800
e798ba57e Export tiny shmem_file_setup for DRM-GEM ... Browse Code »

We're trying to keep the !CONFIG_SHMEM tiny-shmem.c (using ramfs without
swap) in synch with CONFIG_SHMEM shmem.c (and mpm is preparing patches
to combine them). I was glad to see EXPORT_SYMBOL_GPL(shmem_file_setup)
go into shmem.c, but why not support DRM-GEM when !CONFIG_SHMEM too?
But caution says still depend on MMU, since !CONFIG_MMU is.. different.

Signed-off-by: Hugh Dickins
Acked-by: Matt Mackall
Acked-by: Dave Airlie
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-10-21 07:17:42 +0800
b9d7ccf56 Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/… ... Browse Code »

…git/tip/linux-2.6-tip

* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86 ACPI: fix breakage of resume on 64-bit UP systems with SMP kernel
Introduce is_vmalloc_or_module_addr() and use with DEBUG_VIRTUAL

Linus Torvalds
2008-10-21 04:27:05 +0800

20 Oct, 2008

26 commits

fdd2e5f88 make mm/rmap.c:anon_vma_cachep static ... Browse Code »

This patch makes the needlessly global anon_vma_cachep static.

Signed-off-by: Adrian Bunk
Reviewed-by: KOSAKI Motohiro
Acked-by: Rik van Riel
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-10-20 23:52:40 +0800
52d4b9ac0 memcg: allocate all page_cgroup at boot ... Browse Code »

Allocate all page_cgroup at boot and remove page_cgroup poitner from
struct page. This patch adds an interface as

struct page_cgroup *lookup_page_cgroup(struct page*)

All FLATMEM/DISCONTIGMEM/SPARSEMEM and MEMORY_HOTPLUG is supported.

Remove page_cgroup pointer reduces the amount of memory by
- 4 bytes per PAGE_SIZE.
- 8 bytes per PAGE_SIZE
if memory controller is disabled. (even if configured.)

On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
On my x86-64 server with 48GB of memory, this saves 96MB of memory.
I think this reduction makes sense.

By pre-allocation, kmalloc/kfree in charge/uncharge are removed.
This means
- we're not necessary to be afraid of kmalloc faiulre.
(this can happen because of gfp_mask type.)
- we can avoid calling kmalloc/kfree.
- we can avoid allocating tons of small objects which can be fragmented.
- we can know what amount of memory will be used for this extra-lru handling.

I added printk message as

"allocated %ld bytes of page_cgroup"
"please try cgroup_disable=memory option if you don't want"

maybe enough informative for users.

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Balbir Singh
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-10-20 23:52:39 +0800
c05555b57 memcg: atomic ops for page_cgroup->flags ... Browse Code »

This patch makes page_cgroup->flags to be atomic_ops and define functions
(and macros) to access it.

Before trying to modify memory resource controller, this atomic operation
on flags is necessary. Most of flags in this patch is for LRU and modfied
under mz->lru_lock but we'll add another flags which is not for LRU soon.
For example, we'll place LOCK bit on flags field. We need atomic
operation to modify LRU bit without LOCK.

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-10-20 23:52:39 +0800
addb9efeb memcg: optimize per-cpu statistics ... Browse Code »

Some obvious optimization to memcg.

I found mem_cgroup_charge_statistics() is a little big (in object) and
does unnecessary address calclation. This patch is for optimization to
reduce the size of this function.

And res_counter_charge() is 'likely' to succeed.

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-10-20 23:52:39 +0800
5b4e655e9 memcg: avoid accounting special pages ... Browse Code »

There are not-on-LRU pages which can be mapped and they are not worth to
be accounted. (becasue we can't shrink them and need dirty codes to
handle specical case) We'd like to make use of usual objrmap/radix-tree's
protcol and don't want to account out-of-vm's control pages.

When special_mapping_fault() is called, page->mapping is tend to be NULL
and it's charged as Anonymous page. insert_page() also handles some
special pages from drivers.

This patch is for avoiding to account special pages.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-10-20 23:52:38 +0800
b7abea963 memcg: make page->mapping NULL before uncharge ... Browse Code »

This patch tries to make page->mapping to be NULL before
mem_cgroup_uncharge_cache_page() is called.

"page->mapping == NULL" is a good check for "whether the page is still
radix-tree or not". This patch also adds BUG_ON() to
mem_cgroup_uncharge_cache_page();

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Daisuke Nishimura
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-10-20 23:52:38 +0800
073e587ec memcg: move charge swapin under lock ... Browse Code »

While page-cache's charge/uncharge is done under page_lock(), swap-cache
isn't. (anonymous page is charged when it's newly allocated.)

This patch moves do_swap_page()'s charge() call under lock. I don't see
any bad problem *now* but this fix will be good for future for avoiding
unnecessary racy state.

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Daisuke Nishimura
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-10-20 23:52:38 +0800
5e9a0f023 mm: extract do_pages_move() out of sys_move_pages() ... Browse Code »

To prepare the chunking, move the sys_move_pages() code that is used when
nodes!=NULL into do_pages_move(). And rename do_move_pages() into
do_move_page_to_node_array().

Signed-off-by: Brice Goglin
Acked-by: Christoph Lameter
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Brice Goglin
2008-10-20 23:52:33 +0800
2f007e74b mm: don't vmalloc a huge page_to_node array for do_pages_stat() ... Browse Code »

do_pages_stat() does not need any page_to_node entry for real. Just pass
the pointers to the user-space page address array and to the user-space
status array, and have do_pages_stat() traverse the former and fill the
latter directly.

Signed-off-by: Brice Goglin
Acked-by: Christoph Lameter
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Brice Goglin
2008-10-20 23:52:33 +0800
e78bbfa82 mm: stop returning -ENOENT from sys_move_pages() if nothing got migrated ... Browse Code »

A patchset reworking sys_move_pages(). It removes the possibly large
vmalloc by using multiple chunks when migrating large buffers. It also
dramatically increases the throughput for large buffers since the lookup
in new_page_node() is now limited to a single chunk, causing the quadratic
complexity to have a much slower impact. There is no need to use any
radix-tree-like structure to improve this lookup.

sys_move_pages() duration on a 4-quadcore-opteron 2347HE (1.9Gz),
migrating between nodes #2 and #3:

length move_pages (us) move_pages+patch (us)
4kB 126 98
40kB 198 168
400kB 963 937
4MB 12503 11930
40MB 246867 11848

Patches #1 and #4 are the important ones:
1) stop returning -ENOENT from sys_move_pages() if nothing got migrated
2) don't vmalloc a huge page_to_node array for do_pages_stat()
3) extract do_pages_move() out of sys_move_pages()
4) rework do_pages_move() to work on page_sized chunks
5) move_pages: no need to set pp->page to ZERO_PAGE(0) by default

This patch:

There is no point in returning -ENOENT from sys_move_pages() if all pages
were already on the right node, while we return 0 if only 1 page was not.
Most application don't know where their pages are allocated, so it's not
an error to try to migrate them anyway.

Just return 0 and let the status array in user-space be checked if the
application needs details.

It will make the upcoming chunked-move_pages() support much easier.

Signed-off-by: Brice Goglin
Acked-by: Christoph Lameter
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Brice Goglin
2008-10-20 23:52:33 +0800
de7f0cba9 memory hotplug: release memory regions in PAGES_PER_SECTION chunks ... Browse Code »

During hotplug memory remove, memory regions should be released on a
PAGES_PER_SECTION size chunks. This mirrors the code in add_memory where
resources are requested on a PAGES_PER_SECTION size.

Attempting to release the entire memory region fails because there is not
a single resource for the total number of pages being removed. Instead
the resources for the pages are split in PAGES_PER_SECTION size chunks as
requested during memory add.

Signed-off-by: Nathan Fontenot
Signed-off-by: Badari Pulavarty
Acked-by: Yasunori Goto
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nathan Fontenot
2008-10-20 23:52:32 +0800
1125b4e39 setup_per_zone_pages_min(): take zone->lock instead of zone->lru_lock ... Browse Code »

This replaces zone->lru_lock in setup_per_zone_pages_min() with zone->lock.
There seems to be no need for the lru_lock anymore, but there is a need for
zone->lock instead, because that function may call move_freepages() via
setup_zone_migrate_reserve().

Signed-off-by: Gerald Schaefer
Acked-by: KAMEZAWA Hiroyuki
Tested-by: Yasunori Goto
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2008-10-20 23:52:32 +0800
4b2e38ad7 hugepage: support ZERO_PAGE() ... Browse Code »

Presently hugepage doesn't use zero page at all because zero page is only
used for coredumping and hugepage can't core dump.

However we have now implemented hugepage coredumping. Therefore we should
implement the zero page of hugepage.

Implementation note:

o Why do we only check VM_SHARED for zero page?
normal page checked as ..

static inline int use_zero_page(struct vm_area_struct *vma)
{
if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
return 0;

return !vma->vm_ops || !vma->vm_ops->fault;
}

First, hugepages are never mlock()ed. We aren't concerned with VM_LOCKED.

Second, hugetlbfs is a pseudo filesystem, not a real filesystem and it
doesn't have any file backing. Thus ops->fault checking is meaningless.

o Why don't we use zero page if !pte.

!pte indicate {pud, pmd} doesn't exist or some error happened. So we
shouldn't return zero page if any error occurred.

Signed-off-by: KOSAKI Motohiro
Cc: Adam Litke
Cc: Hugh Dickins
Cc: Kawai Hidehiro
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2008-10-20 23:52:32 +0800
d903ef9f3 mm: print out meminit for memmap ... Browse Code »

Improve debuggability of memory setup problems.

Signed-off-by: Yinghai Lu
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yinghai Lu
2008-10-20 23:52:32 +0800
2a4b3ded5 mm: hugetlb.c make functions static, use NULL rather than 0 ... Browse Code »

mm/hugetlb.c:265:17: warning: symbol 'resv_map_alloc' was not declared. Should it be static?
mm/hugetlb.c:277:6: warning: symbol 'resv_map_release' was not declared. Should it be static?
mm/hugetlb.c:292:9: warning: Using plain integer as NULL pointer
mm/hugetlb.c:1750:5: warning: symbol 'unmap_ref_private' was not declared. Should it be static?

Signed-off-by: Harvey Harrison
Acked-by: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Harvey Harrison
2008-10-20 23:52:32 +0800
db64fe022 mm: rewrite vmap layer ... Browse Code »

Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
provide a fast, scalable percpu frontend for small vmaps (requires a
slightly different API, though).

The biggest problem with vmap is actually vunmap. Presently this requires
a global kernel TLB flush, which on most architectures is a broadcast IPI
to all CPUs to flush the cache. This is all done under a global lock. As
the number of CPUs increases, so will the number of vunmaps a scaled
workload will want to perform, and so will the cost of a global TLB flush.
This gives terrible quadratic scalability characteristics.

Another problem is that the entire vmap subsystem works under a single
lock. It is a rwlock, but it is actually taken for write in all the fast
paths, and the read locking would likely never be run concurrently anyway,
so it's just pointless.

This is a rewrite of vmap subsystem to solve those problems. The existing
vmalloc API is implemented on top of the rewritten subsystem.

The TLB flushing problem is solved by using lazy TLB unmapping. vmap
addresses do not have to be flushed immediately when they are vunmapped,
because the kernel will not reuse them again (would be a use-after-free)
until they are reallocated. So the addresses aren't allocated again until
a subsequent TLB flush. A single TLB flush then can flush multiple
vunmaps from each CPU.

XEN and PAT and such do not like deferred TLB flushing because they can't
always handle multiple aliasing virtual addresses to a physical address.
They now call vm_unmap_aliases() in order to flush any deferred mappings.
That call is very expensive (well, actually not a lot more expensive than
a single vunmap under the old scheme), however it should be OK if not
called too often.

The virtual memory extent information is stored in an rbtree rather than a
linked list to improve the algorithmic scalability.

There is a per-CPU allocator for small vmaps, which amortizes or avoids
global locking.

To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
must be used in place of vmap and vunmap. Vmalloc does not use these
interfaces at the moment, so it will not be quite so scalable (although it
will use lazy TLB flushing).

As a quick test of performance, I ran a test that loops in the kernel,
linearly mapping then touching then unmapping 4 pages. Different numbers
of tests were run in parallel on an 4 core, 2 socket opteron. Results are
in nanoseconds per map+touch+unmap.

threads vanilla vmap rewrite
1 14700 2900
2 33600 3000
4 49500 2800
8 70631 2900

So with a 8 cores, the rewritten version is already 25x faster.

In a slightly more realistic test (although with an older and less
scalable version of the patch), I ripped the not-very-good vunmap batching
code out of XFS, and implemented the large buffer mapping with vm_map_ram
and vm_unmap_ram... along with a couple of other tricks, I was able to
speed up a large directory workload by 20x on a 64 CPU system. I believe
vmap/vunmap is actually sped up a lot more than 20x on such a system, but
I'm running into other locks now. vmap is pretty well blown off the
profiles.

Before:
1352059 total 0.1401
798784 _write_lock 8320.6667
Cc: Jeremy Fitzhardinge
Cc: Krzysztof Helt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-10-20 23:52:32 +0800
cb8f488c3 mmap.c: deinline a few functions ... Browse Code »

__vma_link_file and expand_downwards functions are not small, yeat they
are marked inline. They probably had one callsite sometime in the past,
but now they have more. In order to prevent similar thing, I also
deinlined expand_upwards, despite it having only pne callsite. Nowadays
gcc auto-inlines such static functions anyway. In find_extend_vma, I
removed one extra level of indirection.

Patch is deliberately generated with -U $BIGNUM to make
it easier to see that functions are big.

Result:

# size */*/mmap.o */vmlinux
text data bss dec hex filename
9514 188 16 9718 25f6 0.org/mm/mmap.o
9237 188 16 9441 24e1 deinline/mm/mmap.o
6124402 858996 389480 7372878 70804e 0.org/vmlinux
6124113 858996 389480 7372589 707f2d deinline/vmlinux

Signed-off-by: Denys Vlasenko
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Denys Vlasenko
2008-10-20 23:52:32 +0800
8413ac9d8 mm: page lock use lock bitops ... Browse Code »

trylock_page, unlock_page open and close a critical section. Hence,
we can use the lock bitops to get the desired memory ordering.

Also, mark trylock as likely to succeed (and remove the annotation from
callers).

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-10-20 23:52:32 +0800
a978d6f52 mm: unlockless reclaim ... Browse Code »

unlock_page is fairly expensive. It can be avoided in page reclaim
success path. By definition if we have any other references to the page
it would be a bug anyway.

Signed-off-by: Nick Piggin
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-10-20 23:52:32 +0800
f45840b5c mm: pagecache insertion fewer atomics ... Browse Code »

Setting and clearing the page locked when inserting it into swapcache /
pagecache when it has no other references can use non-atomic page flags
operations because no other CPU may be operating on it at this time.

This saves one atomic operation when inserting a page into pagecache.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-10-20 23:52:31 +0800
9978ad583 mlock: make mlock error return Posixly Correct ... Browse Code »

Rework Posix error return for mlock().

Posix requires error code for mlock*() system calls for some conditions
that differ from what kernel low level functions, such as
get_user_pages(), return for those conditions. For more info, see:

http://marc.info/?l=linux-kernel&m=121750892930775&w=2

This patch provides the same translation of get_user_pages()
error codes to posix specified error codes in the context
of the mlock rework for unevictable lru.

[akpm@linux-foundation.org: fix build]
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2008-10-20 23:52:31 +0800
c11d69d8c mlock: revert mainline handling of mlock error return ... Browse Code »

This change is intended to make mlock() error returns correct.
make_page_present() is a lower level function used by more than mlock().
Subsequent patch[es] will add this error return fixup in an mlock specific
path.

Cc: KOSAKI Motohiro
Signed-off-by: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2008-10-20 23:52:31 +0800
e0f79b8f1 vmscan: don't accumulate scan pressure on unrelated lists ... Browse Code »

During each reclaim scan we accumulate scan pressure on unrelated lists
which will result in bogus scans and unwanted reclaims eventually.

Scanning lists with few reclaim candidates results in a lot of rotation
and therefor also disturbs the list balancing, putting even more
pressure on the wrong lists.

In a test-case with much streaming IO, and therefor a crowded inactive
file page list, swapping started because

a) anon pages were reclaimed after swap_cluster_max reclaim
invocations -- nr_scan of this list has just accumulated

b) active file pages were scanned because *their* nr_scan has also
accumulated through the same logic. And this in return created a
lot of rotation for file pages and resulted in a decrease of file
list priority, again increasing the pressure on anon pages.

The result was an evicted working set of anon pages while there were
tons of inactive file pages that should have been taken instead.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2008-10-20 23:52:31 +0800
985737cf2 mlock: count attempts to free mlocked page ... Browse Code »

Allow free of mlock()ed pages. This shouldn't happen, but during
developement, it occasionally did.

This patch allows us to survive that condition, while keeping the
statistics and events correct for debug.

Signed-off-by: Lee Schermerhorn
Signed-off-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2008-10-20 23:52:31 +0800
af936a160 vmscan: unevictable LRU scan sysctl ... Browse Code »

This patch adds a function to scan individual or all zones' unevictable
lists and move any pages that have become evictable onto the respective
zone's inactive list, where shrink_inactive_list() will deal with them.

Adds sysctl to scan all nodes, and per node attributes to individual
nodes' zones.

Kosaki: If evictable page found in unevictable lru when write
/proc/sys/vm/scan_unevictable_pages, print filename and file offset of
these pages.

[akpm@linux-foundation.org: fix one CONFIG_MMU=n build error]
[kosaki.motohiro@jp.fujitsu.com: adapt vmscan-unevictable-lru-scan-sysctl.patch to new sysfs API]
Signed-off-by: Lee Schermerhorn
Signed-off-by: Rik van Riel
Signed-off-by: KOSAKI Motohiro
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2008-10-20 23:52:31 +0800
64d6519dd swap: cull unevictable pages in fault path ... Browse Code »

In the fault paths that install new anonymous pages, check whether the
page is evictable or not using lru_cache_add_active_or_unevictable(). If
the page is evictable, just add it to the active lru list [via the pagevec
cache], else add it to the unevictable list.

This "proactive" culling in the fault path mimics the handling of mlocked
pages in Nick Piggin's series to keep mlocked pages off the lru lists.

Notes:

1) This patch is optional--e.g., if one is concerned about the
additional test in the fault path. We can defer the moving of
nonreclaimable pages until when vmscan [shrink_*_list()]
encounters them. Vmscan will only need to handle such pages
once, but if there are a lot of them it could impact system
performance.

2) The 'vma' argument to page_evictable() is require to notice that
we're faulting a page into an mlock()ed vma w/o having to scan the
page's rmap in the fault path. Culling mlock()ed anon pages is
currently the only reason for this patch.

3) We can't cull swap pages in read_swap_cache_async() because the
vma argument doesn't necessarily correspond to the swap cache
offset passed in by swapin_readahead(). This could [did!] result
in mlocking pages in non-VM_LOCKED vmas if [when] we tried to
cull in this path.

4) Move set_pte_at() to after where we add page to lru to keep it
hidden from other tasks that might walk the page table.
We already do it in this order in do_anonymous() page. And,
these are COW'd anon pages. Is this safe?

[riel@redhat.com: undo an overzealous code cleanup]
Signed-off-by: Lee Schermerhorn
Signed-off-by: Rik van Riel
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2008-10-20 23:52:31 +0800