Eric Lee / smarc-fsl-linux-kernel

23 Mar, 2011

40 commits

8a06dc4d5 um: remove file pointer from ioctl ... Browse Code »

Commit 6caa76b ("tty: now phase out the ioctl file pointer for good")
removed the ioctl file pointer. User Mode Linux's line driver uses this
ioctl and needs a signature update too.

Signed-off-by: Richard Weinberger
Cc: Alan Cox
Cc: Greg KH
Cc: Jeff Dike
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Richard Weinberger
2011-03-23 08:44:09 +0800
13e165baf uml: kernels on {i386,x86_64} produce bad coredumps ... Browse Code »

One of our users reported that when a user-level program SIGSEGVs under
UML kernel, the resulting core dump is not very usable.

I have reproduced that with the latest kernel:

make ARCH=um defconfig; make ARCH=um

Run the resulting kernel, then "inside" run this program:

#include

void *fn(void *p)
{
abort();
}

int main()
{
pthread_t tid;
pthread_create(&tid, 0, fn, 0);
pthread_join(tid, 0);
return 0;
}

Analyze the coredump with GDB. Here is what you'll see:

sudo gdb -q -ex 'set solib-absolute-prefix ../root_fs' -ex 'file ../root_fs/var/tmp/mt-abort' -ex 'core ../root_fs/var/tmp/core.762'
Reading symbols from /usr/local/google/root_fs/var/tmp/mt-abort...done.
[New Thread 763]
[New Thread 762]
Core was generated by `./mt-abort'.
Program terminated with signal 6, Aborted.
#0 0x0000000040255250 in raise () from ../root_fs/lib64/libc.so.6
(gdb) info thread
2 Thread 762 0x0000000000000000 in ?? ()
* 1 Thread 763 0x0000000040255250 in raise () from ../root_fs/lib64/libc.so.6

Note that thread#2 looks funny.

(gdb) thread 2
[Switching to thread 2 (Thread 762)]#0 0x0000000000000000 in ?? ()
(gdb) info reg
rax 0x0 0
rbx 0x0 0
rcx 0x0 0
rdx 0x0 0
rsi 0x0 0
rdi 0x0 0
rbp 0x0 0x0
rsp 0x0 0x0
r8 0x0 0
r9 0x0 0
r10 0x0 0
r11 0x0 0
r12 0x0 0
r13 0x0 0
r14 0x0 0
r15 0x0 0
rip 0x0 0
eflags 0x0 [ ]
cs 0x0 0
ss 0x0 0
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0

Examining the core shows that NT_PRSTATUS notes for all threads other than
the one that crashed are zeroed out.

I believe this is happening because neither ELF_CORE_COPY_TASK_REGS nor
task_pt_regs are defined under ARCH=um, and so elf_core_copy_task_regs()
becomes a no-op.

Attached patch fixes this for SUBARCH={x86_64,i386}.

Signed-off-by: Paul Pluzhnikov
Cc: Jeff Dike
Acked-by: WANG Cong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Pluzhnikov
2011-03-23 08:44:09 +0800
3dd7ae8ec mm: simplify code of swap.c ... Browse Code »

Clean up code and remove duplicate code. Next patch will use
pagevec_lru_move_fn introduced here too.

Signed-off-by: Shaohua Li
Cc: KOSAKI Motohiro
Cc: Hiroyuki Kamezawa
Cc: Andi Kleen
Reviewed-by: Minchan Kim
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2011-03-23 08:44:09 +0800
bee4c36a5 shmem: let shared anonymous be nonlinear again ... Browse Code »

Up to 2.6.22, you could use remap_file_pages(2) on a tmpfs file or a
shared mapping of /dev/zero or a shared anonymous mapping. In 2.6.23 we
disabled it by default, but set VM_CAN_NONLINEAR to enable it on safe
mappings. We made sure to set it in shmem_mmap() for tmpfs files, but
missed it in shmem_zero_setup() for the others. Fix that at last.

Reported-by: Kenny Simpson
Signed-off-by: Hugh Dickins
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-03-23 08:44:09 +0800
8f7a66051 mm/memblock: properly handle overlaps and fix error path ... Browse Code »

Currently memblock_reserve() or memblock_free() don't handle overlaps of
any kind. There is some special casing for coalescing exactly adjacent
regions but that's about it.

This is annoying because typically memblock_reserve() is used to mark
regions passed by the firmware as reserved and we all know how much we can
trust our firmwares...

Also, with the current code, if we do something it doesn't handle right
such as trying to memblock_reserve() a large range spanning multiple
existing smaller reserved regions for example, or doing overlapping
reservations, it can silently corrupt the internal region array, causing
odd errors much later on, such as allocations returning reserved regions
etc...

This patch rewrites the underlying functions that add or remove a region
to the arrays. The new code is a lot more robust as it fully handles
overlapping regions. It's also, imho, simpler than the previous
implementation.

In addition, while doing so, I found a bug where if we fail to double the
array while adding a region, we would remove the last region of the array
rather than the region we just allocated. This fixes it too.

Signed-off-by: Benjamin Herrenschmidt
Acked-by: Yinghai Lu
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Benjamin Herrenschmidt
2011-03-23 08:44:09 +0800
84be48d84 mm/page_alloc.c: use list_move() instead of list_del()/list_add() combination ... Browse Code »

Signed-off-by: Kirill A. Shutemov
Cc: Mel Gorman
Cc: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Reviewed-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2011-03-23 08:44:09 +0800
a42931bf9 vmalloc: remove confusing comment on vwrite() ... Browse Code »

KM_USER1 is never used for vwrite() path so the caller doesn't need to
guarantee it is not used. Only the caller should guarantee is KM_USER0
and it is commented already.

Signed-off-by: Namhyung Kim
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Namhyung Kim
2011-03-23 08:44:09 +0800
cf15b07cf writeback: make mapping->writeback_index to point to the last written page ... Browse Code »

For range-cyclic writeback (e.g. kupdate), the writeback code sets a
continuation point of the next writeback to mapping->writeback_index which
is set the page after the last written page. This happens so that we
evenly write the whole file even if pages in it get continuously
redirtied.

However, in some cases, sequential writer is writing in the middle of the
page and it just redirties the last written page by continuing from that.
For example with an application which uses a file as a big ring buffer we
see:

[1st writeback session]
...
flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898514 + 8
flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898522 + 8
flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898530 + 8
flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898538 + 8
flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898546 + 8
kworker/0:1-11 4571: block_rq_issue: 8,0 W 0 () 94898514 + 40
>> flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898554 + 8
>> flush-8:0-2743 4571: block_rq_issue: 8,0 W 0 () 94898554 + 8

[2nd writeback session after 35sec]
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94898562 + 8
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94898570 + 8
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94898578 + 8
...
kworker/0:1-11 4606: block_rq_issue: 8,0 W 0 () 94898562 + 640
kworker/0:1-11 4606: block_rq_issue: 8,0 W 0 () 94899202 + 72
...
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94899962 + 8
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94899970 + 8
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94899978 + 8
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94899986 + 8
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94899994 + 8
kworker/0:1-11 4606: block_rq_issue: 8,0 W 0 () 94899962 + 40
>> flush-8:0-2743 4606: block_bio_queue: 8,0 W 94898554 + 8
>> flush-8:0-2743 4606: block_rq_issue: 8,0 W 0 () 94898554 + 8

So we seeked back to 94898554 after we wrote all the pages at the end of
the file.

This extra seek seems unnecessary. If we continue writeback from the last
written page, we can avoid it and do not cause harm to other cases. The
original intent of even writeout over the whole file is preserved and if
the page does not get redirtied pagevec_lookup_tag() just skips it.

As an exceptional case, when I/O error happens, set done_index to the next
page as the comment in the code suggests.

Tested-by: Wu Fengguang
Signed-off-by: Jun'ichi Nomura
Signed-off-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jun'ichi Nomura
2011-03-23 08:44:09 +0800
24b8ff7c2 mm: remove inline from scan_swap_map() ... Browse Code »

scan_swap_map() is a large function (224 lines), with several loops and a
complex control flow involving several gotos.

Given all that, it is a bit silly that it is marked as inline. The
compiler agrees with me: on a x86-64 compile, it did not inline the
function.

Remove the "inline" and let the compiler decide instead.

Signed-off-by: Cesar Eduardo Barros
Reviewed-by: Pekka Enberg
Reviewed-by: KOSAKI Motohiro
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:09 +0800
40531542e sys_swapon: separate final enabling of the swapfile ... Browse Code »

The block in sys_swapon which does the final adjustments to the
swap_info_struct and to swap_list is the same as the block which
re-inserts it again at sys_swapoff on failure of try_to_unuse(). Move
this code to a separate function, and use it both in sys_swapon and
sys_swapoff.

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: Pekka Enberg
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:08 +0800
c6a2b64ba sys_swapoff: change order to match sys_swapon ... Browse Code »

The block in sys_swapon which does the final adjustments to the
swap_info_struct and to swap_list is the same as the block which
re-inserts it again at sys_swapoff on failure of try_to_unuse(), except
for the order of the operations within the lock. Since the order should
not matter, arbitrarily change sys_swapoff to match sys_swapon, in
preparation to making both share the same code.

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: Pekka Enberg
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:08 +0800
c69dbfb84 sys_swapon: move printk outside lock ... Browse Code »

The block in sys_swapon which does the final adjustments to the
swap_info_struct and to swap_list is the same as the block which
re-inserts it again at sys_swapoff on failure of try_to_unuse(). To be
able to make both share the same code, move the printk() call in the
middle of it to just after it.

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:08 +0800
9c8100ef2 sys_swapon: remove nr_good_pages variable ... Browse Code »

It still exists within setup_swap_map_and_extents(), but after it
nr_good_pages == p->pages.

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: Pekka Enberg
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:08 +0800
bdb8e3f68 sys_swapon: simplify error flow in setup_swap_map_and_extents() ... Browse Code »

Since there is no cleanup to do, there is no reason to jump to a label.
Return directly instead.

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: Pekka Enberg
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:08 +0800
915d4d7bc sys_swapon: separate parsing of bad blocks and extents ... Browse Code »

Move the code which parses the bad block list and the extents to a
separate function. Only code movement, no functional changes.

This change uses the fact that, after the success path, nr_good_pages ==
p->pages.

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: Pekka Enberg
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:08 +0800
1421ef3cd sys_swapon: call swap_cgroup_swapon() earlier ... Browse Code »

The call to swap_cgroup_swapon is in the middle of loading the swap map
and extents. As it only does memory allocation and does not depend on
the swapfile layout (map/extents), it can be called earlier (or later).

Move it to just after the allocation of swap_map, since it is
conceptually similar (allocates a map).

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: Pekka Enberg
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:08 +0800
387190253 sys_swapon: simplify error flow in read_swap_header() ... Browse Code »

Since there is no cleanup to do, there is no reason to jump to a label.
Return directly instead.

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:08 +0800
ca8bd38bf sys_swapon: separate parsing of swapfile header ... Browse Code »

Move the code which parses and checks the swapfile header (except for
the bad block list) to a separate function. Only code movement, no
functional changes.

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:07 +0800
5de771e41 sys_swapon: move setting of swapfilepages near use ... Browse Code »

There is no reason I can see to read inode->i_size long before it is
needed. Move its read to just before it is needed, to reduce the
variable lifetime.

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: Jesper Juhl
Reviewed-by: Pekka Enberg
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:07 +0800
87ade72a7 sys_swapon: simplify error flow in claim_swapfile() ... Browse Code »

Since there is no cleanup to do, there is no reason to jump to a label.
Return directly instead.

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:07 +0800
4d0e1e107 sys_swapon: separate bdev claim and inode lock ... Browse Code »

Move the code which claims the bdev (S_ISBLK) or locks the inode
(S_ISREG) to a separate function. Only code movement, no functional
changes.

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:07 +0800
bd69010b0 sys_swapon: use a single error label ... Browse Code »

sys_swapon currently has two error labels, bad_swap and bad_swap_2.
bad_swap does the same as bad_swap_2 plus destroy_swap_extents() and
swap_cgroup_swapoff(); both are noops in the places where bad_swap_2 is
jumped to. With a single extra test for inode (matching the one in the
S_ISREG case below), all the error paths in the function can go to
bad_swap.

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:07 +0800
9b01c350a sys_swapon: do only cleanup in the cleanup blocks ... Browse Code »

The only way error is 0 in the cleanup blocks is when the function is
returning successfully. In this case, the cleanup blocks were setting
S_SWAPFILE in the S_ISREG case. But this is not a cleanup.

Move the setting of S_SWAPFILE to just before the "goto out;" to make
this more clear. At this point, we do not need to test for inode because
it will never be NULL.

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:07 +0800
f2090d2df sys_swapon: remove bdev variable ... Browse Code »

The bdev variable is always equivalent to (S_ISBLK(inode->i_mode) ?
p->bdev : NULL), as long as it being set is moved to a bit earlier. Use
this fact to remove the bdev variable.

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:07 +0800
7de7fb6b3 sys_swapon: move setting of error nearer use ... Browse Code »

Move the setting of the error variable nearer the goto in a few places.

Avoids calling PTR_ERR() if not IS_ERR() in two places, and makes the
error condition more explicit in two other places.

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: Jesper Juhl
Reviewed-by: Pekka Enberg
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:07 +0800
83ef99bef sys_swapon: remove did_down variable ... Browse Code »

Since mutex_lock(&inode->i_mutex) is called just after setting inode,
did_down is always equivalent to (inode && S_ISREG(inode->i_mode)).

Use this fact to remove the did_down variable.

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:07 +0800
28b36bd74 sys_swapon: remove initial value of name variable ... Browse Code »

Now there is nothing which jumps to the cleanup blocks before the name
variable is set. There is no need to set it initially to NULL anymore.

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: Pekka Enberg
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:07 +0800
730c0581c sys_swapon: simplify error flow in alloc_swap_info() ... Browse Code »

Since there is no cleanup to do, there is no reason to jump to a label.
Return directly instead.

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: Pekka Enberg
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:06 +0800
2542e5134 sys_swapon: simplify error return from swap_info allocation ... Browse Code »

At this point in sys_swapon, there is nothing to free. Return directly
instead of jumping to the cleanup block at the end of the function.

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: Pekka Enberg
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:06 +0800
53cbb2435 sys_swapon: separate swap_info allocation ... Browse Code »

Move the swap_info allocation to its own function. Only code movement,
no functional changes.

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: Pekka Enberg
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:06 +0800
e8e6c2ec4 sys_swapon: do not depend on "type" after allocation ... Browse Code »

Within sys_swapon, after the swap_info entry has been allocated, we
always have type == p->type and swap_info[type] == p. Use this fact to
reduce the dependency on the "type" local variable within the function,
as a preparation to move the allocation of the swap_info entry to a
separate function.

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: Pekka Enberg
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:06 +0800
80b0df12b sys_swapon: remove changelog from function comment ... Browse Code »

Changelogs belong in the git history instead of in the source code.

Also, "The swapon system call" is redundant with
"SYSCALL_DEFINE2(swapon, ...)".

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: Pekka Enberg
Reviewed-by: Jesper Juhl
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
[ Gaah. That's a _historical_ comment. But the patch-series depends on removal ]
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:05 +0800
803d0c835 sys_swapon: use vzalloc() instead of vmalloc/memset ... Browse Code »

This patch series refactors the sys_swapon function.

sys_swapon is currently a very large function, with 313 lines (more than
12 25-line screens), which can make it a bit hard to read. This patch
series reduces this size by half, by extracting large chunks of related
code to new helper functions.

One of these chunks of code was nearly identical to the part of
sys_swapoff which is used in case of a failure return from
try_to_unuse(), so this patch series also makes both share the same
code.

As a side effect of all this refactoring, the compiled code gets a bit
smaller (from v1 of this patch series):

text data bss dec hex filename
14012 944 276 15232 3b80 mm/swapfile.o.before
13941 944 276 15161 3b39 mm/swapfile.o.after

This patch:

Use vzalloc() instead of vmalloc/memset.

Signed-off-by: Cesar Eduardo Barros
Tested-by: Eric B Munson
Acked-by: Eric B Munson
Reviewed-by: Pekka Enberg
Reviewed-by: Jesper Juhl
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cesar Eduardo Barros
2011-03-23 08:44:05 +0800
cc5d462f7 mm: use __GFP_OTHER_NODE for transparent huge pages ... Browse Code »

Pass __GFP_OTHER_NODE for transparent hugepages NUMA allocations done by the
hugepages daemon. This way the low level accounting for local versus
remote pages works correctly.

Contains improvements from Andrea Arcangeli

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andi Kleen
Cc: Andrea Arcangeli
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2011-03-23 08:44:05 +0800
78afd5612 mm: add __GFP_OTHER_NODE flag ... Browse Code »

Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics in
zone_statistics() that an allocation is on behalf of another thread. This
way the local and remote counters can be still correct, even when
background daemons like khugepaged are changing memory mappings.

This only affects the accounting, but I think it's worth doing that right
to avoid confusing users.

I first tried to just pass down the right node, but this required a lot of
changes to pass down this parameter and at least one addition of a 10th
argument to a 9 argument function. Using the flag is a lot less
intrusive.

Open: should be also used for migration?

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andi Kleen
Cc: Andrea Arcangeli
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2011-03-23 08:44:05 +0800
11bc82d67 mm: compaction: Use async migration for __GFP_NO_KSWAPD and enforce no writeback ... Browse Code »

__GFP_NO_KSWAPD allocations are usually very expensive and not mandatory
to succeed as they have graceful fallback. Waiting for I/O in those,
tends to be overkill in terms of latencies, so we can reduce their latency
by disabling sync migrate.

Unfortunately, even with async migration it's still possible for the
process to be blocked waiting for a request slot (e.g. get_request_wait
in the block layer) when ->writepage is called. To prevent
__GFP_NO_KSWAPD blocking, this patch prevents ->writepage being called on
dirty page cache for asynchronous migration.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=31142

[mel@csn.ul.ie: Avoid writebacks for NFS, retry locked pages, use bool]
Signed-off-by: Andrea Arcangeli
Signed-off-by: Mel Gorman
Cc: Arthur Marsh
Cc: Clemens Ladisch
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Reported-by: Alex Villacis Lasso
Tested-by: Alex Villacis Lasso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-03-23 08:44:05 +0800
b2eef8c0d mm: compaction: minimise the time IRQs are disabled while isolating pages for migration ... Browse Code »
43

compaction_alloc() isolates pages for migration in isolate_migratepages.
While it's scanning, IRQs are disabled on the mistaken assumption the
scanning should be short. Tests show this to be true for the most part
but contention times on the LRU lock can be increased. Before this patch,
the IRQ disabled times for a simple test looked like

Total sampled time IRQs off (not real total time): 5493
Event shrink_inactive_list..shrink_zone 1596 us count 1
Event shrink_inactive_list..shrink_zone 1530 us count 1
Event shrink_inactive_list..shrink_zone 956 us count 1
Event shrink_inactive_list..shrink_zone 541 us count 1
Event shrink_inactive_list..shrink_zone 531 us count 1
Event split_huge_page..add_to_swap 232 us count 1
Event save_args..call_softirq 36 us count 1
Event save_args..call_softirq 35 us count 2
Event __wake_up..__wake_up 1 us count 1

This patch reduces the worst-case IRQs-disabled latencies by releasing the
lock every SWAP_CLUSTER_MAX pages that are scanned and releasing the CPU if
necessary. The cost of this is that the processing performing compaction will
be slower but IRQs being disabled for too long a time has worse consequences
as the following report shows;

Total sampled time IRQs off (not real total time): 4367
Event shrink_inactive_list..shrink_zone 881 us count 1
Event shrink_inactive_list..shrink_zone 875 us count 1
Event shrink_inactive_list..shrink_zone 868 us count 1
Event shrink_inactive_list..shrink_zone 555 us count 1
Event split_huge_page..add_to_swap 495 us count 1
Event compact_zone..compact_zone_order 269 us count 1
Event split_huge_page..add_to_swap 266 us count 1
Event shrink_inactive_list..shrink_zone 85 us count 1
Event save_args..call_softirq 36 us count 2
Event __wake_up..__wake_up 1 us count 1

[akpm@linux-foundation.org: simplify with s/unlocked/locked/]
Signed-off-by: Andrea Arcangeli
Signed-off-by: Mel Gorman
Cc: Johannes Weiner
Cc: Arthur Marsh
Cc: Clemens Ladisch
Cc: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-03-23 08:44:05 +0800
602605a42 mm: compaction: minimise the time IRQs are disabled while isolating free pages ... Browse Code »

compaction_alloc() isolates free pages to be used as migration targets.
While its scanning, IRQs are disabled on the mistaken assumption the
scanning should be short. Analysis showed that IRQs were in fact being
disabled for substantial time. A simple test was run using large
anonymous mappings with transparent hugepage support enabled to trigger
frequent compactions. A monitor sampled what the worst IRQ-off latencies
were and a post-processing tool found the following;

Total sampled time IRQs off (not real total time): 22355
Event compaction_alloc..compaction_alloc 8409 us count 1
Event compaction_alloc..compaction_alloc 7341 us count 1
Event compaction_alloc..compaction_alloc 2463 us count 1
Event compaction_alloc..compaction_alloc 2054 us count 1
Event shrink_inactive_list..shrink_zone 1864 us count 1
Event shrink_inactive_list..shrink_zone 88 us count 1
Event save_args..call_softirq 36 us count 1
Event save_args..call_softirq 35 us count 2
Event __make_request..__blk_run_queue 24 us count 1
Event __alloc_pages_nodemask..__alloc_pages_nodemask 6 us count 1

i.e. compaction is disabled IRQs for a prolonged period of time - 8ms in
one instance. The full report generated by the tool can be found at

http://www.csn.ul.ie/~mel/postings/minfree-20110225/irqsoff-vanilla-micro.report

This patch reduces the time IRQs are disabled by simply disabling IRQs at
the last possible minute. An updated IRQs-off summary report then looks
like;

Total sampled time IRQs off (not real total time): 5493
Event shrink_inactive_list..shrink_zone 1596 us count 1
Event shrink_inactive_list..shrink_zone 1530 us count 1
Event shrink_inactive_list..shrink_zone 956 us count 1
Event shrink_inactive_list..shrink_zone 541 us count 1
Event shrink_inactive_list..shrink_zone 531 us count 1
Event split_huge_page..add_to_swap 232 us count 1
Event save_args..call_softirq 36 us count 1
Event save_args..call_softirq 35 us count 2
Event __wake_up..__wake_up 1 us count 1

A full report is again available at

http://www.csn.ul.ie/~mel/postings/minfree-20110225/irqsoff-minimiseirq-free-v1r4-micro.report

As should be obvious, IRQ disabled latencies due to compaction are
almost elimimnated for this particular test.

[aarcange@redhat.com: Fix initialisation of isolated]
Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Minchan Kim
Acked-by: Andrea Arcangeli
Cc: Arthur Marsh
Cc: Clemens Ladisch
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-03-23 08:44:05 +0800
5b280c0cc mm: don't return 0 too early from find_get_pages() ... Browse Code »

Callers of find_get_pages(), or its wrapper pagevec_lookup() - notably
truncate_inode_pages_range() - stop looking further when it returns 0.

But if an interrupt comes just after its radix_tree_gang_lookup_slot(),
especially if we have preemptible RCU enabled, isn't it conceivable that
all 14 pages returned could be removed from the page cache by
shrink_page_list(), before find_get_pages() gets to process them? So
causing it to return 0 although there may be plenty more pages beyond.

Make find_get_pages() and find_get_pages_tag() check for this unlikely
case, and restart should it occur; but callers of find_get_pages_contig()
have no such expectation, it's okay for that to return 0 early.

I have not seen this in practice, just worried by the possibility.

Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Acked-by: Peter Zijlstra
Cc: Wu Fengguang
Cc: Salman Qazi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-03-23 08:44:04 +0800
9d8aa4ea8 mm: remove worrying dead code from find_get_pages() ... Browse Code »

The radix_tree_deref_retry() case in find_get_pages() has a strange little
excrescence, not seen in the other gang lookups: it looks like the start
of an abandoned attempt to guarantee forward progress in a case that
cannot arise.

ret should always be 0 here: if it isn't, then going back to restart will
leak references to pages already gotten. There used to be a comment
saying nr_found is necessarily 1 here: that's not quite true, but the
radix_tree_deref_retry() case is peculiar to the entry at index 0, when we
race with it being moved out of the radix_tree root or back.

Remove the worrisome two lines, add a brief comment here and in
find_get_pages_contig() and find_get_pages_tag(), and a WARN_ON in
find_get_pages() should it ever be seen elsewhere than at 0.

Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Acked-by: Peter Zijlstra
Cc: Wu Fengguang
Cc: Salman Qazi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-03-23 08:44:04 +0800