Eric Lee / smarc-fsl-linux-kernel

07 Jan, 2012

1 commit

69734b644 Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
x86: Fix atomic64_xxx_cx8() functions
x86: Fix and improve cmpxchg_double{,_local}()
x86_64, asm: Optimise fls(), ffs() and fls64()
x86, bitops: Move fls64.h inside __KERNEL__
x86: Fix and improve percpu_cmpxchg{8,16}b_double()
x86: Report cpb and eff_freq_ro flags correctly
x86/i386: Use less assembly in strlen(), speed things up a bit
x86: Use the same node_distance for 32 and 64-bit
x86: Fix rflags in FAKE_STACK_FRAME
x86: Clean up and extend do_int3()
x86: Call do_notify_resume() with interrupts enabled
x86/div64: Add a micro-optimization shortcut if base is power of two
x86-64: Cleanup some assembly entry points
x86-64: Slightly shorten line system call entry and exit paths
x86-64: Reduce amount of redundant code generated for invalidate_interruptNN
x86-64: Slightly shorten int_ret_from_sys_call
x86, efi: Convert efi_phys_get_time() args to physical addresses
x86: Default to vsyscall=emulate
x86-64: Set siginfo and context on vsyscall emulation faults
x86: consolidate xchg and xadd macros
...

Linus Torvalds
2012-01-07 05:59:14 +0800

06 Jan, 2012

1 commit

4a2164a7d Merge branch 'core-memblock-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

* 'core-memblock-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
memblock: Reimplement memblock allocation using reverse free area iterator
memblock: Kill early_node_map[]
score: Use HAVE_MEMBLOCK_NODE_MAP
s390: Use HAVE_MEMBLOCK_NODE_MAP
mips: Use HAVE_MEMBLOCK_NODE_MAP
ia64: Use HAVE_MEMBLOCK_NODE_MAP
SuperH: Use HAVE_MEMBLOCK_NODE_MAP
sparc: Use HAVE_MEMBLOCK_NODE_MAP
powerpc: Use HAVE_MEMBLOCK_NODE_MAP
memblock: Implement memblock_add_node()
memblock: s/memblock_analyze()/memblock_allow_resize()/ and update users
memblock: Track total size of regions automatically
powerpc: Cleanup memblock usage
memblock: Reimplement memblock_enforce_memory_limit() using __memblock_remove()
memblock: Make memblock functions handle overflowing range @size
memblock: Reimplement __memblock_remove() using memblock_isolate_range()
memblock: Separate out memblock_isolate_range() from memblock_set_node()
memblock: Kill memblock_init()
memblock: Kill sentinel entries at the end of static region arrays
memblock: Add __memblock_dump_all()
...

Linus Torvalds
2012-01-06 23:54:53 +0800

04 Jan, 2012

1 commit

cdcd62986 x86: Fix and improve cmpxchg_double{,_local}() ... Browse Code »

Just like the per-CPU ones they had several
problems/shortcomings:

Only the first memory operand was mentioned in the asm()
operands, and the 2x64-bit version didn't have a memory clobber
while the 2x32-bit one did. The former allowed the compiler to
not recognize the need to re-load the data in case it had it
cached in some register, while the latter was overly
destructive.

The types of the local copies of the old and new values were
incorrect (the types of the pointed-to variables should be used
here, to make sure the respective old/new variable types are
compatible).

The __dummy/__junk variables were pointless, given that local
copies of the inputs already existed (and can hence be used for
discarded outputs).

The 32-bit variant of cmpxchg_double_local() referenced
cmpxchg16b_local().

At once also:

- change the return value type to what it really is: 'bool'
- unify 32- and 64-bit variants
- abstract out the common part of the 'normal' and 'local' variants

Signed-off-by: Jan Beulich
Cc: Christoph Lameter
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/4F01F12A020000780006A19B@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar

Jan Beulich
2012-01-04 22:01:54 +0800

30 Dec, 2011

2 commits

b0365c8d0 mm: hugetlb: fix non-atomic enqueue of huge page ... Browse Code »
1

If a huge page is enqueued under the protection of hugetlb_lock, then the
operation is atomic and safe.

Signed-off-by: Hillf Danton
Reviewed-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Cc: [2.6.37+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2011-12-30 08:31:57 +0800
e26a51148 mm/mempolicy.c: refix mbind_range() vma issue ... Browse Code »

commit 8aacc9f550 ("mm/mempolicy.c: fix pgoff in mbind vma merge") is the
slightly incorrect fix.

Why? Think following case.

1. map 4 pages of a file at offset 0

[0123]

2. map 2 pages just after the first mapping of the same file but with
page offset 2

[0123][23]

3. mbind() 2 pages from the first mapping at offset 2.
mbind_range() should treat new vma is,

[0123][23]
|23|
mbind vma

but it does

[0123][23]
|01|
mbind vma

Oops. then, it makes wrong vma merge and splitting ([01][0123] or similar).

This patch fixes it.

[testcase]
test result - before the patch

case4: 126: test failed. expect '2,4', actual '2,2,2'
case5: passed
case6: passed
case7: passed
case8: passed
case_n: 246: test failed. expect '4,2', actual '1,4'

------------[ cut here ]------------
kernel BUG at mm/filemap.c:135!
invalid opcode: 0000 [#4] SMP DEBUG_PAGEALLOC

(snip long bug on messages)

test result - after the patch

case4: passed
case5: passed
case6: passed
case7: passed
case8: passed
case_n: passed

source: mbind_vma_test.c
============================================================
#include
#include
#include
#include
#include
#include
#include

static unsigned long pagesize;
void* mmap_addr;
struct bitmask *nmask;
char buf[1024];
FILE *file;
char retbuf[10240] = "";
int mapped_fd;

char *rubysrc = "ruby -e '\
pid = %d; \
vstart = 0x%llx; \
vend = 0x%llx; \
s = `pmap -q #{pid}`; \
rary = []; \
s.each_line {|line|; \
ary=line.split(\" \"); \
addr = ary[0].to_i(16); \
if(vstart < vend) then \
rary.push(ary[1].to_i()/4); \
end; \
}; \
print rary.join(\",\"); \
'";

void init(void)
{
void* addr;
char buf[128];

nmask = numa_allocate_nodemask();
numa_bitmask_setbit(nmask, 0);

pagesize = getpagesize();

sprintf(buf, "%s", "mbind_vma_XXXXXX");
mapped_fd = mkstemp(buf);
if (mapped_fd == -1)
perror("mkstemp "), exit(1);
unlink(buf);

if (lseek(mapped_fd, pagesize*8, SEEK_SET) < 0)
perror("lseek "), exit(1);
if (write(mapped_fd, "\0", 1) < 0)
perror("write "), exit(1);

addr = mmap(NULL, pagesize*8, PROT_NONE,
MAP_SHARED, mapped_fd, 0);
if (addr == MAP_FAILED)
perror("mmap "), exit(1);

if (mprotect(addr+pagesize, pagesize*6, PROT_READ|PROT_WRITE) < 0)
perror("mprotect "), exit(1);

mmap_addr = addr + pagesize;

/* make page populate */
memset(mmap_addr, 0, pagesize*6);
}

void fin(void)
{
void* addr = mmap_addr - pagesize;
munmap(addr, pagesize*8);

memset(buf, 0, sizeof(buf));
memset(retbuf, 0, sizeof(retbuf));
}

void mem_bind(int index, int len)
{
int err;

err = mbind(mmap_addr+pagesize*index, pagesize*len,
MPOL_BIND, nmask->maskp, nmask->size, 0);
if (err)
perror("mbind "), exit(err);
}

void mem_interleave(int index, int len)
{
int err;

err = mbind(mmap_addr+pagesize*index, pagesize*len,
MPOL_INTERLEAVE, nmask->maskp, nmask->size, 0);
if (err)
perror("mbind "), exit(err);
}

void mem_unbind(int index, int len)
{
int err;

err = mbind(mmap_addr+pagesize*index, pagesize*len,
MPOL_DEFAULT, NULL, 0, 0);
if (err)
perror("mbind "), exit(err);
}

void Assert(char *expected, char *value, char *name, int line)
{
if (strcmp(expected, value) == 0) {
fprintf(stderr, "%s: passed\n", name);
return;
}
else {
fprintf(stderr, "%s: %d: test failed. expect '%s', actual '%s'\n",
name, line,
expected, value);
// exit(1);
}
}

/*
AAAA
PPPPPPNNNNNN
might become
PPNNNNNNNNNN
case 4 below
*/
void case4(void)
{
init();
sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);

mem_bind(0, 4);
mem_unbind(2, 2);

file = popen(buf, "r");
fread(retbuf, sizeof(retbuf), 1, file);
Assert("2,4", retbuf, "case4", __LINE__);

fin();
}

/*
AAAA
PPPPPPNNNNNN
might become
PPPPPPPPPPNN
case 5 below
*/
void case5(void)
{
init();
sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);

mem_bind(0, 2);
mem_bind(2, 2);

file = popen(buf, "r");
fread(retbuf, sizeof(retbuf), 1, file);
Assert("4,2", retbuf, "case5", __LINE__);

fin();
}

/*
AAAA
PPPPNNNNXXXX
might become
PPPPPPPPPPPP 6
*/
void case6(void)
{
init();
sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);

mem_bind(0, 2);
mem_bind(4, 2);
mem_bind(2, 2);

file = popen(buf, "r");
fread(retbuf, sizeof(retbuf), 1, file);
Assert("6", retbuf, "case6", __LINE__);

fin();
}

/*
AAAA
PPPPNNNNXXXX
might become
PPPPPPPPXXXX 7
*/
void case7(void)
{
init();
sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);

mem_bind(0, 2);
mem_interleave(4, 2);
mem_bind(2, 2);

file = popen(buf, "r");
fread(retbuf, sizeof(retbuf), 1, file);
Assert("4,2", retbuf, "case7", __LINE__);

fin();
}

/*
AAAA
PPPPNNNNXXXX
might become
PPPPNNNNNNNN 8
*/
void case8(void)
{
init();
sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);

mem_bind(0, 2);
mem_interleave(4, 2);
mem_interleave(2, 2);

file = popen(buf, "r");
fread(retbuf, sizeof(retbuf), 1, file);
Assert("2,4", retbuf, "case8", __LINE__);

fin();
}

void case_n(void)
{
init();
sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);

/* make redundunt mappings [0][1234][34][7] */
mmap(mmap_addr + pagesize*4, pagesize*2, PROT_READ|PROT_WRITE,
MAP_FIXED|MAP_SHARED, mapped_fd, pagesize*3);

/* Expect to do nothing. */
mem_unbind(2, 2);

file = popen(buf, "r");
fread(retbuf, sizeof(retbuf), 1, file);
Assert("4,2", retbuf, "case_n", __LINE__);

fin();
}

int main(int argc, char** argv)
{
case4();
case5();
case6();
case7();
case8();
case_n();

return 0;
}
=============================================================

Signed-off-by: KOSAKI Motohiro
Acked-by: Johannes Weiner
Cc: Minchan Kim
Cc: Caspar Zhang
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Lee Schermerhorn
Cc: [3.1.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2011-12-30 08:31:57 +0800

22 Dec, 2011

1 commit

e6f67b8c0 vfs: __read_cache_page should use gfp argument rather than GFP_KERNEL ... Browse Code »
1

lockdep reports a deadlock in jfs because a special inode's rw semaphore
is taken recursively. The mapping's gfp mask is GFP_NOFS, but is not
used when __read_cache_page() calls add_to_page_cache_lru().

Signed-off-by: Dave Kleikamp
Acked-by: Hugh Dickins
Acked-by: Al Viro
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Dave Kleikamp
2011-12-22 09:02:46 +0800

21 Dec, 2011

3 commits

0006526d7 mm/vmalloc.c: remove static declaration of va from __get_vm_area_node ... Browse Code »

Static storage is not required for the struct vmap_area in
__get_vm_area_node.

Removing "static" to store this variable on the stack instead.

Signed-off-by: Kautuk Consul
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kautuk Consul
2011-12-21 02:25:04 +0800
ff05b6f7a oom: fix integer overflow of points in oom_badness ... Browse Code »
1

An integer overflow will happen on 64bit archs if task's sum of rss,
swapents and nr_ptes exceeds (2^31)/1000 value. This was introduced by
commit

f755a04 oom: use pte pages in OOM score

where the oom score computation was divided into several steps and it's no
longer computed as one expression in unsigned long(rss, swapents, nr_pte
are unsigned long), where the result value assigned to points(int) is in
range(1..1000). So there could be an int overflow while computing

176 points *= 1000;

and points may have negative value. Meaning the oom score for a mem hog task
will be one.

196 if (points
Acked-by: KOSAKI Motohiro
Acked-by: Oleg Nesterov
Acked-by: David Rientjes
Cc: [2.6.36+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Frantisek Hrbata
2011-12-21 02:25:04 +0800
a41c58a66 memcg: keep root group unchanged if creation fails ... Browse Code »
1

If the request is to create non-root group and we fail to meet it, we
should leave the root unchanged.

Signed-off-by: Hillf Danton
Acked-by: Hugh Dickins
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Cc: Balbir Singh
Cc: David Rientjes
Cc: Andrea Arcangeli
Cc: Johannes Weiner
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2011-12-21 02:25:04 +0800

20 Dec, 2011

1 commit

45aa0663c Merge branch 'memblock-kill-early_node_map' of git://git.kernel.org/pub/scm/linu… ... Browse Code »

…x/kernel/git/tj/misc into core/memblock

Ingo Molnar
2011-12-20 19:14:26 +0800

16 Dec, 2011

1 commit

9f57bd4d6 percpu: fix per_cpu_ptr_to_phys() handling of non-page-aligned addresses ... Browse Code »
1

per_cpu_ptr_to_phys() incorrectly rounds up its result for non-kmalloc
case to the page boundary, which is bogus for any non-page-aligned
address.

This affects the only in-tree user of this function - sysfs handler
for per-cpu 'crash_notes' physical address. The trouble is that the
crash_notes per-cpu variable is not page-aligned:

crash_notes = 0xc08e8ed4
PER-CPU OFFSET VALUES:
CPU 0: 3711f000
CPU 1: 37129000
CPU 2: 37133000
CPU 3: 3713d000

So, the per-cpu addresses are:
crash_notes on CPU 0: f7a07ed4 => phys 36b57ed4
crash_notes on CPU 1: f7a11ed4 => phys 36b4ded4
crash_notes on CPU 2: f7a1bed4 => phys 36b43ed4
crash_notes on CPU 3: f7a25ed4 => phys 36b39ed4

However, /sys/devices/system/cpu/cpu*/crash_notes says:
/sys/devices/system/cpu/cpu0/crash_notes: 36b57000
/sys/devices/system/cpu/cpu1/crash_notes: 36b4d000
/sys/devices/system/cpu/cpu2/crash_notes: 36b43000
/sys/devices/system/cpu/cpu3/crash_notes: 36b39000

As you can see, all values are rounded down to a page
boundary. Consequently, this is where kexec sets up the NOTE segments,
and thus where the secondary kernel is looking for them. However, when
the first kernel crashes, it saves the notes to the unaligned
addresses, where they are not found.

Fix it by adding offset_in_page() to the translated page address.

-tj: Combined Eugene's and Petr's commit messages.

Signed-off-by: Eugene Surovegin
Signed-off-by: Tejun Heo
Reported-by: Petr Tesarik
Cc: stable@kernel.org

Eugene Surovegin
2011-12-16 03:41:40 +0800

14 Dec, 2011

1 commit

4dde6deda Merge branch 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

* 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: set max_pause to lowest value on zero bdi_dirty
writeback: permit through good bdi even when global dirty exceeded
writeback: comment on the bdi dirty threshold
fs: Make write(2) interruptible by a fatal signal
writeback: Fix issue on make htmldocs

Linus Torvalds
2011-12-14 06:58:56 +0800

09 Dec, 2011

21 commits

1368edf06 mm: vmalloc: check for page allocation failure before vmlist insertion ... Browse Code »
1

Commit f5252e00 ("mm: avoid null pointer access in vm_struct via
/proc/vmallocinfo") adds newly allocated vm_structs to the vmlist after
it is fully initialised. Unfortunately, it did not check that
__vmalloc_area_node() successfully populated the area. In the event of
allocation failure, the vmalloc area is freed but the pointer to freed
memory is inserted into the vmlist leading to a a crash later in
get_vmalloc_info().

This patch adds a check for ____vmalloc_area_node() failure within
__vmalloc_node_range. It does not use "goto fail" as in the previous
error path as a warning was already displayed by __vmalloc_area_node()
before it called vfree in its failure path.

Credit goes to Luciano Chavez for doing all the real work of identifying
exactly where the problem was.

Signed-off-by: Mel Gorman
Reported-by: Luciano Chavez
Tested-by: Luciano Chavez
Reviewed-by: Rik van Riel
Acked-by: David Rientjes
Cc: [3.1.x+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-12-09 23:50:29 +0800
d02156388 mm: Ensure that pfn_valid() is called once per pageblock when reserving pageblocks ... Browse Code »
1

setup_zone_migrate_reserve() expects that zone->start_pfn starts at
pageblock_nr_pages aligned pfn otherwise we could access beyond an
existing memblock resulting in the following panic if
CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid:

IP: [] setup_zone_migrate_reserve+0xcd/0x180
*pdpt = 0000000000000000 *pde = f000ff53f000ff53
Oops: 0000 [#1] SMP
Pid: 1, comm: swapper Not tainted 3.0.7-0.7-pae #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
EIP: 0060:[] EFLAGS: 00010006 CPU: 0
EIP is at setup_zone_migrate_reserve+0xcd/0x180
EAX: 000c0000 EBX: f5801fc0 ECX: 000c0000 EDX: 00000000
ESI: 000c01fe EDI: 000c01fe EBP: 00140000 ESP: f2475f58
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process swapper (pid: 1, ti=f2474000 task=f2472cd0 task.ti=f2474000)
Call Trace:
[] __setup_per_zone_wmarks+0xec/0x160
[] setup_per_zone_wmarks+0xf/0x20
[] init_per_zone_wmark_min+0x27/0x86
[] do_one_initcall+0x2b/0x160
[] kernel_init+0xbe/0x157
[] kernel_thread_helper+0x6/0xd
Code: a5 39 f5 89 f7 0f 46 fd 39 cf 76 40 8b 03 f6 c4 08 74 32 eb 91 90 89 c8 c1 e8 0e 0f be 80 80 2f 86 c0 8b 14 85 60 2f 86 c0 89 c8 82 b4 12 00 00 c1 e0 05 03 82 ac 12 00 00 8b 00 f6 c4 08 0f
EIP: [] setup_zone_migrate_reserve+0xcd/0x180 SS:ESP 0068:f2475f58
CR2: 00000000000012b4

We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because
highstart_pfn = 0x36ffe.

The issue was introduced in 3.0-rc1 by 6d3163ce ("mm: check if any page
in a pageblock is reserved before marking it MIGRATE_RESERVE").

Make sure that start_pfn is always aligned to pageblock_nr_pages to
ensure that pfn_valid s always called at the start of each pageblock.
Architectures with holes in pageblocks will be correctly handled by
pfn_valid_within in pageblock_is_reserved.

Signed-off-by: Michal Hocko
Signed-off-by: Mel Gorman
Tested-by: Dang Bo
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Andrea Arcangeli
Cc: David Rientjes
Cc: Arve Hjnnevg
Cc: KOSAKI Motohiro
Cc: John Stultz
Cc: Dave Hansen
Cc: [3.0+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2011-12-09 23:50:28 +0800
09761333e mm/migrate.c: pair unlock_page() and lock_page() when migrating huge pages ... Browse Code »

Avoid unlocking and unlocked page if we failed to lock it.

Signed-off-by: Hillf Danton
Cc: Naoya Horiguchi
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2011-12-09 23:50:28 +0800
58a84aa92 thp: set compound tail page _count to zero ... Browse Code »
1

Commit 70b50f94f1644 ("mm: thp: tail page refcounting fix") keeps all
page_tail->_count zero at all times. But the current kernel does not
set page_tail->_count to zero if a 1GB page is utilized. So when an
IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a
tail page's _count does not equal zero.

kernel BUG at include/linux/mm.h:386!
invalid opcode: 0000 [#1] SMP
Call Trace:
gup_pud_range+0xb8/0x19d
get_user_pages_fast+0xcb/0x192
? trace_hardirqs_off+0xd/0xf
hva_to_pfn+0x119/0x2f2
gfn_to_pfn_memslot+0x2c/0x2e
kvm_iommu_map_pages+0xfd/0x1c1
kvm_iommu_map_memslots+0x7c/0xbd
kvm_iommu_map_guest+0xaa/0xbf
kvm_vm_ioctl_assigned_device+0x2ef/0xa47
kvm_vm_ioctl+0x36c/0x3a2
do_vfs_ioctl+0x49e/0x4e4
sys_ioctl+0x5a/0x7c
system_call_fastpath+0x16/0x1b
RIP gup_huge_pud+0xf2/0x159

Signed-off-by: Youquan Song
Reviewed-by: Andrea Arcangeli
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Youquan Song
2011-12-09 23:50:28 +0800
1dfb059b9 thp: reduce khugepaged freezing latency ... Browse Code »

khugepaged can sometimes cause suspend to fail, requiring that the user
retry the suspend operation.

Use wait_event_freezable_timeout() instead of
schedule_timeout_interruptible() to avoid missing freezer wakeups. A
try_to_freeze() would have been needed in the khugepaged_alloc_hugepage
tight loop too in case of the allocation failing repeatedly, and
wait_event_freezable_timeout will provide it too.

khugepaged would still freeze just fine by trying again the next minute
but it's better if it freezes immediately.

Reported-by: Jiri Slaby
Signed-off-by: Andrea Arcangeli
Tested-by: Jiri Slaby
Cc: Tejun Heo
Cc: Oleg Nesterov
Cc: "Srivatsa S. Bhat"
Cc: "Rafael J. Wysocki"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-12-09 23:50:28 +0800
83aeeada7 vmscan: use atomic-long for shrinker batching ... Browse Code »

Use atomic-long operations instead of looping around cmpxchg().

[akpm@linux-foundation.org: massage atomic.h inclusions]
Signed-off-by: Konstantin Khlebnikov
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2011-12-09 23:50:27 +0800
635697c66 vmscan: fix initial shrinker size handling ... Browse Code »
43

A shrinker function can return -1, means that it cannot do anything
without a risk of deadlock. For example prune_super() does this if it
cannot grab a superblock refrence, even if nr_to_scan=0. Currently we
interpret this -1 as a ULONG_MAX size shrinker and evaluate `total_scan'
according to this. So the next time around this shrinker can cause
really big pressure. Let's skip such shrinkers instead.

Also make total_scan signed, otherwise the check (total_scan < 0) below
never works.

Signed-off-by: Konstantin Khlebnikov
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2011-12-09 23:50:27 +0800
7bd0b0f0d memblock: Reimplement memblock allocation using reverse free area iterator ... Browse Code »
86

Now that all early memory information is in memblock when enabled, we
can implement reverse free area iterator and use it to implement NUMA
aware allocator which is then wrapped for simpler variants instead of
the confusing and inefficient mending of information in separate NUMA
aware allocator.

Implement for_each_free_mem_range_reverse(), use it to reimplement
memblock_find_in_range_node() which in turn is used by all allocators.

The visible allocator interface is inconsistent and can probably use
some cleanup too.

Signed-off-by: Tejun Heo
Cc: Benjamin Herrenschmidt
Cc: Yinghai Lu

Tejun Heo
2011-12-09 02:22:09 +0800
0ee332c14 memblock: Kill early_node_map[] ... Browse Code »
43

Now all ARCH_POPULATES_NODE_MAP archs select HAVE_MEBLOCK_NODE_MAP -
there's no user of early_node_map[] left. Kill early_node_map[] and
replace ARCH_POPULATES_NODE_MAP with HAVE_MEMBLOCK_NODE_MAP. Also,
relocate for_each_mem_pfn_range() and helper from mm.h to memblock.h
as page_alloc.c would no longer host an alternative implementation.

This change is ultimately one to one mapping and shouldn't cause any
observable difference; however, after the recent changes, there are
some functions which now would fit memblock.c better than page_alloc.c
and dependency on HAVE_MEMBLOCK_NODE_MAP instead of HAVE_MEMBLOCK
doesn't make much sense on some of them. Further cleanups for
functions inside HAVE_MEMBLOCK_NODE_MAP in mm.h would be nice.

-v2: Fix compile bug introduced by mis-spelling
CONFIG_HAVE_MEMBLOCK_NODE_MAP to CONFIG_MEMBLOCK_HAVE_NODE_MAP in
mmzone.h. Reported by Stephen Rothwell.

Signed-off-by: Tejun Heo
Cc: Stephen Rothwell
Cc: Benjamin Herrenschmidt
Cc: Yinghai Lu
Cc: Tony Luck
Cc: Ralf Baechle
Cc: Martin Schwidefsky
Cc: Chen Liqin
Cc: Paul Mundt
Cc: "David S. Miller"
Cc: "H. Peter Anvin"

Tejun Heo
2011-12-09 02:22:09 +0800
7fb0bc3f0 memblock: Implement memblock_add_node() ... Browse Code »

Implement memblock_add_node() which can add a new memblock memory
region with specific node ID.

Signed-off-by: Tejun Heo
Cc: Benjamin Herrenschmidt
Cc: Yinghai Lu

Tejun Heo
2011-12-09 02:22:08 +0800
1aadc0560 memblock: s/memblock_analyze()/memblock_allow_resize()/ and update users ... Browse Code »

The only function of memblock_analyze() is now allowing resize of
memblock region arrays. Rename it to memblock_allow_resize() and
update its users.

* The following users remain the same other than renaming.

arm/mm/init.c::arm_memblock_init()
microblaze/kernel/prom.c::early_init_devtree()
powerpc/kernel/prom.c::early_init_devtree()
openrisc/kernel/prom.c::early_init_devtree()
sh/mm/init.c::paging_init()
sparc/mm/init_64.c::paging_init()
unicore32/mm/init.c::uc32_memblock_init()

* In the following users, analyze was used to update total size which
is no longer necessary.

powerpc/kernel/machine_kexec.c::reserve_crashkernel()
powerpc/kernel/prom.c::early_init_devtree()
powerpc/mm/init_32.c::MMU_init()
powerpc/mm/tlb_nohash.c::__early_init_mmu()
powerpc/platforms/ps3/mm.c::ps3_mm_add_memory()
powerpc/platforms/embedded6xx/wii.c::wii_memory_fixups()
sh/kernel/machine_kexec.c::reserve_crashkernel()

* x86/kernel/e820.c::memblock_x86_fill() was directly setting
memblock_can_resize before populating memblock and calling analyze
afterwards. Call memblock_allow_resize() before start populating.

memblock_can_resize is now static inside memblock.c.

Signed-off-by: Tejun Heo
Cc: Benjamin Herrenschmidt
Cc: Yinghai Lu
Cc: Russell King
Cc: Michal Simek
Cc: Paul Mundt
Cc: "David S. Miller"
Cc: Guan Xuetao
Cc: "H. Peter Anvin"

Tejun Heo
2011-12-09 02:22:08 +0800
1440c4e2c memblock: Track total size of regions automatically ... Browse Code »

Total size of memory regions was calculated by memblock_analyze()
requiring explicitly calling the function between operations which can
change memory regions and possible users of total size, which is
cumbersome and fragile.

This patch makes each memblock_type track total size automatically
with minor modifications to memblock manipulation functions and remove
requirements on calling memblock_analyze(). [__]memblock_dump_all()
now also dumps the total size of reserved regions.

Signed-off-by: Tejun Heo
Cc: Benjamin Herrenschmidt
Cc: Yinghai Lu

Tejun Heo
2011-12-09 02:22:08 +0800
c0ce8fef5 memblock: Reimplement memblock_enforce_memory_limit() using __memblock_remove() ... Browse Code »

With recent updates, the basic memblock operations are robust enough
that there's no reason for memblock_enfore_memory_limit() to directly
manipulate memblock region arrays. Reimplement it using
__memblock_remove().

Signed-off-by: Tejun Heo
Cc: Benjamin Herrenschmidt
Cc: Yinghai Lu

Tejun Heo
2011-12-09 02:22:07 +0800
eb18f1b5b memblock: Make memblock functions handle overflowing range @size ... Browse Code »

Allow memblock users to specify range where @base + @size overflows
and automatically cap it at maximum. This makes the interface more
robust and specifying till-the-end-of-memory easier.

Signed-off-by: Tejun Heo
Cc: Benjamin Herrenschmidt
Cc: Yinghai Lu

Tejun Heo
2011-12-09 02:22:07 +0800
719361809 memblock: Reimplement __memblock_remove() using memblock_isolate_range() ... Browse Code »

__memblock_remove()'s open coded region manipulation can be trivially
replaced with memblock_islate_range(). This increases code sharing
and eases improving region tracking.

This pulls memblock_isolate_range() out of HAVE_MEMBLOCK_NODE_MAP.
Make it use memblock_get_region_node() instead of assuming rgn->nid is
available.

-v2: Fixed build failure on !HAVE_MEMBLOCK_NODE_MAP caused by direct
rgn->nid access.

Signed-off-by: Tejun Heo
Cc: Benjamin Herrenschmidt
Cc: Yinghai Lu

Tejun Heo
2011-12-09 02:22:07 +0800
6a9ceb31c memblock: Separate out memblock_isolate_range() from memblock_set_node() ... Browse Code »

memblock_set_node() operates in three steps - break regions crossing
boundaries, set nid and merge back regions. This patch separates the
first part into a separate function - memblock_isolate_range(), which
breaks regions crossing range boundaries and returns range index range
for regions properly contained in the specified memory range.

This doesn't introduce any behavior change and will be used to further
unify region handling.

Signed-off-by: Tejun Heo
Cc: Benjamin Herrenschmidt
Cc: Yinghai Lu

Tejun Heo
2011-12-09 02:22:07 +0800
fe091c208 memblock: Kill memblock_init() ... Browse Code »

memblock_init() initializes arrays for regions and memblock itself;
however, all these can be done with struct initializers and
memblock_init() can be removed. This patch kills memblock_init() and
initializes memblock with struct initializer.

The only difference is that the first dummy entries don't have .nid
set to MAX_NUMNODES initially. This doesn't cause any behavior
difference.

Signed-off-by: Tejun Heo
Cc: Benjamin Herrenschmidt
Cc: Yinghai Lu
Cc: Russell King
Cc: Michal Simek
Cc: Paul Mundt
Cc: "David S. Miller"
Cc: Guan Xuetao
Cc: "H. Peter Anvin"

Tejun Heo
2011-12-09 02:22:07 +0800
c5a1cb284 memblock: Kill sentinel entries at the end of static region arrays ... Browse Code »

memblock no longer depends on having one more entry at the end during
addition making the sentinel entries at the end of region arrays not
too useful. Remove the sentinels. This eases further updates.

Signed-off-by: Tejun Heo
Cc: Benjamin Herrenschmidt
Cc: Yinghai Lu

Tejun Heo
2011-12-09 02:22:07 +0800
4ff7b82f1 memblock: Add __memblock_dump_all() ... Browse Code »

Add __memblock_dump_all() which dumps memblock configuration whether
memblock_debug is enabled or not.

Signed-off-by: Tejun Heo
Cc: Benjamin Herrenschmidt
Cc: Yinghai Lu

Tejun Heo
2011-12-09 02:22:06 +0800
9c8c27e2b memblock: Use memblock_reserve() in memblock internal functions ... Browse Code »

Make memblock_double_array(), __memblock_alloc_base() and
memblock_alloc_nid() use memblock_reserve() instead of calling
memblock_add_region() with reserved array directly. This eases
debugging and updates to memblock_add_region().

Signed-off-by: Tejun Heo
Cc: Benjamin Herrenschmidt
Cc: Yinghai Lu

Tejun Heo
2011-12-09 02:22:06 +0800
581adcbe1 memblock: Make memblock_{add|remove|free|reserve}() return int and update prototypes ... Browse Code »

memblock_{add|remove|free|reserve}() return either 0 or -errno but had
long as return type. Chage it to int. Also, drop 'extern' from all
prototypes in memblock.h - they are unnecessary and used
inconsistently (especially if mm.h is included in the picture).

Signed-off-by: Tejun Heo
Cc: Benjamin Herrenschmidt
Cc: Yinghai Lu

Tejun Heo
2011-12-09 02:22:06 +0800

08 Dec, 2011

3 commits

82e230a07 writeback: set max_pause to lowest value on zero bdi_dirty ... Browse Code »

Some trace shows lots of bdi_dirty=0 lines where it's actually some
small value if w/o the accounting errors in the per-cpu bdi stats.

In this case the max pause time should really be set to the smallest
(non-zero) value to avoid IO queue underrun and improve throughput.

Signed-off-by: Wu Fengguang

Wu Fengguang
2011-12-08 10:49:29 +0800
c5c6343c4 writeback: permit through good bdi even when global dirty exceeded ... Browse Code »

On a system with 1 local mount and 1 NFS mount, if the NFS server
becomes not responding when dd to the NFS mount, the NFS dirty pages may
exceed the global dirty limit and _every_ task involving writing will be
blocked. The whole system appears unresponsive.

The workaround is to permit through the bdi's that only has a small
number of dirty pages. The number chosen (bdi_stat_error pages) is not
enough to enable the local disk to run in optimal throughput, however is
enough to make the system responsive on a broken NFS mount. The user can
then kill the dirtiers on the NFS mount and increase the global dirty
limit to bring up the local disk's throughput.

It risks allowing dirty pages to grow much larger than the global dirty
limit when there are 1000+ mounts, however that's very unlikely to happen,
especially in low memory profiles.

Signed-off-by: Wu Fengguang

Wu Fengguang
2011-12-08 10:49:27 +0800
aed21ad28 writeback: comment on the bdi dirty threshold ... Browse Code »

We do "floating proportions" to let active devices to grow its target
share of dirty pages and stalled/inactive devices to decrease its target
share over time.

It works well except in the case of "an inactive disk suddenly goes
busy", where the initial target share may be too small. To mitigate
this, bdi_position_ratio() has the below line to raise a small
bdi_thresh when it's safe to do so, so that the disk be feed with enough
dirty pages for efficient IO and in turn fast rampup of bdi_thresh:

bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);

balance_dirty_pages() normally does negative feedback control which
adjusts ratelimit to balance the bdi dirty pages around the target.
In some extreme cases when that is not enough, it will have to block
the tasks completely until the bdi dirty pages drop below bdi_thresh.

Acked-by: Jan Kara
Acked-by: Peter Zijlstra
Signed-off-by: Wu Fengguang

Wu Fengguang
2011-12-08 10:49:20 +0800

05 Dec, 2011

1 commit

52cef1891 slab, lockdep: Fix silly bug ... Browse Code »

Commit 30765b92 ("slab, lockdep: Annotate the locks before using
them") moves the init_lock_keys() call from after g_cpucache_up =
FULL, to before it. And overlooks the fact that init_node_lock_keys()
tests for it and ignores everything !FULL.

Introduce a LATE stage and change the lockdep test to be
Cc: Pekka Enberg
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-12-05 16:44:00 +0800

02 Dec, 2011

1 commit

a50527b19 fs: Make write(2) interruptible by a fatal signal ... Browse Code »

Currently write(2) to a file is not interruptible by any signal.
Sometimes this is desirable, e.g. when you want to quickly kill a
process hogging your disk. Also, with commit 499d05ecf990 ("mm: Make
task in balance_dirty_pages() killable"), it's necessary to abort the
current write accordingly to avoid it quickly dirtying lots more pages
at unthrottled rate.

This patch makes write interruptible by SIGKILL. We do not allow write
to be interruptible by any other signal because that has larger
potential of screwing some badly written applications.

Reported-by: Kazuya Mio
Tested-by: Kazuya Mio
Acked-by: Matthew Wilcox
Signed-off-by: Jan Kara
Signed-off-by: Wu Fengguang

Jan Kara
2011-12-02 09:17:05 +0800

30 Nov, 2011

1 commit

57db53b07 Merge branch 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux ... Browse Code »

* 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
slub: avoid potential NULL dereference or corruption
slub: use irqsafe_cpu_cmpxchg for put_cpu_partial
slub: move discard_slab out of node lock
slub: use correct parameter to add a page to partial list tail

Linus Torvalds
2011-11-30 03:13:22 +0800

29 Nov, 2011

1 commit

9b5a4d4f6 Merge branch 'for-3.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu ... Browse Code »

* 'for-3.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: explain why per_cpu_ptr_to_phys() is more complicated than necessary
percpu: fix chunk range calculation
percpu: rename pcpu_mem_alloc to pcpu_mem_zalloc

Linus Torvalds
2011-11-29 05:49:43 +0800