16 Dec, 2009
40 commits
-
According to feature-removal-schedule.txt, it is the time to remove
print_fn_descriptor_symbol().And a quick grep shows that it no longer has any callers.
Signed-off-by: WANG Cong
Cc: Bjorn Helgaas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
rwsem_is_locked() tests ->activity without locks, so we should always keep
->activity consistent. However, the code in __rwsem_do_wake() breaks this
rule, it updates ->activity after _all_ readers waken up, this may give
some reader a wrong ->activity value, thus cause rwsem_is_locked() behaves
wrong.Quote from Andrew:
"
- we have one or more processes sleeping in down_read(), waiting for access.- we wake one or more processes up without altering ->activity
- they start to run and they do rwsem_is_locked(). This incorrectly
returns "false", because the waker process is still crunching away in
__rwsem_do_wake().- the waker now alters ->activity, but it was too late.
"So we need get a spinlock to protect this. And rwsem_is_locked() should
not block, thus we use spin_trylock_irqsave().[akpm@linux-foundation.org: simplify code]
Reported-by: Brian Behlendorf
Cc: Ben Woodard
Cc: David Howells
Signed-off-by: WANG Cong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
These functions need not to be exported, since no drivers should use them.
__init_rwsem() is an exception, because init_rwsem(), which is a macro,
is used.Signed-off-by: WANG Cong
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Don't initialize __print_once. Invert the test to reduce initialized
data.defconfig before: $size vmlinux
text data bss dec hex filename
6976022 679572 1359668 9015262 898fde vmlinuxdefconfig after: $size vmlinux
text data bss dec hex filename
6976006 679508 1359700 9015214 898fae vmlinuxSigned-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The symbol 'call' is a static symbol used for initcall_debug. This same
symbol name is used locally by a couple functions and produces the
following sparse warnings:warning: symbol 'call' shadows an earlier one
Fix this noise by renaming the local symbols.
Signed-off-by: H Hartley Sweeten
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Signed-off-by: Daniel Mack
Cc: "H Hartley Sweeten"
Cc: David Brownell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use smp_processor_id() instead of get_cpu() and put_cpu() in
generic_smp_call_function_interrupt(), It's no need to disable preempt,
because we must call generic_smp_call_function_interrupt() with interrupts
disabled.Signed-off-by: Xiao Guangrong
Acked-by: Ingo Molnar
Cc: Jens Axboe
Cc: Nick Piggin
Cc: Peter Zijlstra
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This driver supports the non-volatile digital potentiometers via I2C:
AD5258, AD5259, AD5251, AD5252, AD5253, AD5254, and AD5255It provides a sysfs interface to each device for reading/writing which
is documented in Documentation/misc-devices/ad525x_dpot.txt.Signed-off-by: Michael Hennerich
Signed-off-by: Chris Verges
Signed-off-by: Mike Frysinger
Cc: Jean Delvare
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If CONFIG_DYNAMIC_DEBUG is enabled and a source file has:
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
#includedynamic_debug.h will duplicate KBUILD_MODNAME
in the output string.Remove the use of KBUILD_MODNAME from the
output format string generated by dynamic_debug.hIf CONFIG_DYNAMIC_DEBUG is not enabled, no compile-time
check is done to printk/dev_printk arguments.Add it.
Signed-off-by: Joe Perches
Cc: Jason Baron
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Commit 70867453092297be9afb2249e712a1f960ec0a09 ("printk_once(): use bool
for boolean flag") changed printk_once() to use bool instead of int for
its guard variable. Do the same change to WARN_ONCE() and WARN_ON_ONCE(),
for the same reasons.This resulted in a reduction of 1462 bytes on a x86-64 defconfig:
text data bss dec hex filename
8101271 1207116 992764 10301151 9d2edf vmlinux.before
8100553 1207148 991988 10299689 9d2929 vmlinux.afterSigned-off-by: Cesar Eduardo Barros
Cc: Roland Dreier
Cc: Daniel Walker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Convert code away from ->read_proc/->write_proc interfaces. Switch to
proc_create()/proc_create_data() which make addition of proc entries
reliable wrt NULL ->proc_fops, NULL ->data and so on.Problem with ->read_proc et al is described here commit
786d7e1612f0b0adb6046f19b906609e4fe8b1ba "Fix rmmod/read/write races in
/proc entries"Signed-off-by: Alexey Dobriyan
Cc: Jeff Dike
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
gcc is not convinced that the floppy.c ioctl has sufficient bound checks:
In function `copy_from_user',
inlined from `fd_copyin' at drivers/block/floppy.c:3080,
inlined from `fd_ioctl' at drivers/block/floppy.c:3503:
arch/x86/include/asm/uaccess_32.h:211:
warning: call to `copy_from_user_overflow' declared with attribute
warning: copy_from_user buffer size is not provably correctAnd frankly, as a human I have a hard time proving the same more or less
(the size comes from the ioctl argument. humpf. maybe. the code isn't
very nice)This patch adds an explicit check to make 100% sure it's safe, better than
finding out later that there indeed was a gap.[akpm@linux-foundation.org: add WARN_ON()]
Signed-off-by: Arjan van de Ven
Acked-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
It does not seem possible that ldev can be NULL, so drop the unnecessary
test. If ldev can somehow be NULL, then the initialization of last_idx
should be moved below the test.A simplified version of the semantic match that detects this problem is as
follows (http://coccinelle.lip6.fr/)://
@match exists@
expression x, E;
identifier fld;
@@* x->fld
... when != \(x = E\|&x\)
* x == NULL
//Signed-off-by: Julia Lawall
Acked-by: Arjan van de Ven
Cc: Ingo Molnar
Cc: Venkatesh Pallipadi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Convert code away from ->read_proc/->write_proc interfaces. Switch to
proc_create()/proc_create_data() which make addition of proc entries
reliable wrt NULL ->proc_fops, NULL ->data and so on.Problem with ->read_proc et al is described here commit
786d7e1612f0b0adb6046f19b906609e4fe8b1ba "Fix rmmod/read/write races in
/proc entries"Signed-off-by: Alexey Dobriyan
Cc: Richard Henderson
Cc: Ivan Kokshaysky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Setting a thread's comm to be something unique is a very useful ability
and is helpful for debugging complicated threaded applications. However
currently the only way to set a thread name is for the thread to name
itself via the PR_SET_NAME prctl.However, there may be situations where it would be advantageous for a
thread dispatcher to be naming the threads its managing, rather then
having the threads self-describe themselves. This sort of behavior is
available on other systems via the pthread_setname_np() interface.This patch exports a task's comm via proc/pid/comm and
proc/pid/task/tid/comm interfaces, and allows thread siblings to write to
these values.[akpm@linux-foundation.org: cleanups]
Signed-off-by: John Stultz
Cc: Andi Kleen
Cc: Arjan van de Ven
Cc: Mike Fulton
Cc: Sean Foley
Cc: Darren Hart
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
On no-MMU systems, sizes reported in /proc/n/statm have units of bytes.
Per Documentation/filesystems/proc.txt, these values should be in pages.Signed-off-by: Steven J. Magnani
Cc: Greg Ungerer
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The NOMMU code currently clears all anonymous mmapped memory. While this
is what we want in the default case, all memory allocation from userspace
under NOMMU has to go through this interface, including malloc() which is
allowed to return uninitialized memory. This can easily be a significant
performance penalty. So for constrained embedded systems were security is
irrelevant, allow people to avoid clearing memory unnecessarily.This also alters the ELF-FDPIC binfmt such that it obtains uninitialised
memory for the brk and stack region.Signed-off-by: Jie Zhang
Signed-off-by: Robin Getz
Signed-off-by: Mike Frysinger
Signed-off-by: David Howells
Acked-by: Paul Mundt
Acked-by: Greg Ungerer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch enables extraction of the pfn of a hugepage from
/proc/pid/pagemap in an architecture independent manner.Details
-------
My test program (leak_pagemap) works as follows:
- creat() and mmap() a file on hugetlbfs (file size is 200MB == 100 hugepages,)
- read()/write() something on it,
- call page-types with option -p,
- munmap() and unlink() the file on hugetlbfsWithout my patches
------------------
$ ./leak_pagemap
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000000 1 0 __________________________________
0x0000000000000804 1 0 __R________M______________________ referenced,mmap
0x000000000000086c 81 0 __RU_lA____M______________________ referenced,uptodate,lru,active,mmap
0x0000000000005808 5 0 ___U_______Ma_b___________________ uptodate,mmap,anonymous,swapbacked
0x0000000000005868 12 0 ___U_lA____Ma_b___________________ uptodate,lru,active,mmap,anonymous,swapbacked
0x000000000000586c 1 0 __RU_lA____Ma_b___________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
total 101 0The output of page-types don't show any hugepage.
With my patches
---------------
$ ./leak_pagemap
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000000 1 0 __________________________________
0x0000000000030000 51100 199 ________________TG________________ compound_tail,huge
0x0000000000028018 100 0 ___UD__________H_G________________ uptodate,dirty,compound_head,huge
0x0000000000000804 1 0 __R________M______________________ referenced,mmap
0x000000000000080c 1 0 __RU_______M______________________ referenced,uptodate,mmap
0x000000000000086c 80 0 __RU_lA____M______________________ referenced,uptodate,lru,active,mmap
0x0000000000005808 4 0 ___U_______Ma_b___________________ uptodate,mmap,anonymous,swapbacked
0x0000000000005868 12 0 ___U_lA____Ma_b___________________ uptodate,lru,active,mmap,anonymous,swapbacked
0x000000000000586c 1 0 __RU_lA____Ma_b___________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
total 51300 200The output of page-types shows 51200 pages contributing to hugepages,
containing 100 head pages and 51100 tail pages as expected.[akpm@linux-foundation.org: build fix]
Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Wu Fengguang
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Lee Schermerhorn
Cc: Andy Whitcroft
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Most callers of pmd_none_or_clear_bad() check whether the target page is
in a hugepage or not, but walk_page_range() do not check it. So if we
read /proc/pid/pagemap for the hugepage on x86 machine, the hugepage
memory is leaked as shown below. This patch fixes it.Details
=======
My test program (leak_pagemap) works as follows:
- creat() and mmap() a file on hugetlbfs (file size is 200MB == 100 hugepages,)
- read()/write() something on it,
- call page-types with option -p (walk around the page tables),
- munmap() and unlink() the file on hugetlbfsWithout my patches
------------------
$ cat /proc/meminfo |grep "HugePage"
HugePages_Total: 1000
HugePages_Free: 1000
HugePages_Rsvd: 0
HugePages_Surp: 0
$ ./leak_pagemap
[snip output]
$ cat /proc/meminfo |grep "HugePage"
HugePages_Total: 1000
HugePages_Free: 900
HugePages_Rsvd: 0
HugePages_Surp: 0
$ ls /hugetlbfs/
$100 hugepages are accounted as used while there is no file on hugetlbfs.
With my patches
---------------
$ cat /proc/meminfo |grep "HugePage"
HugePages_Total: 1000
HugePages_Free: 1000
HugePages_Rsvd: 0
HugePages_Surp: 0
$ ./leak_pagemap
[snip output]
$ cat /proc/meminfo |grep "HugePage"
HugePages_Total: 1000
HugePages_Free: 1000
HugePages_Rsvd: 0
HugePages_Surp: 0
$ ls /hugetlbfs
$No memory leaks.
Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Wu Fengguang
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Lee Schermerhorn
Cc: Andy Whitcroft
Cc: David Rientjes
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Most callers of pmd_none_or_clear_bad() check whether the target page is
in a hugepage or not, but mincore() and walk_page_range() do not check it.
So if we use mincore() on a hugepage on x86 machine, the hugepage memory
is leaked as shown below. This patch fixes it by extending mincore()
system call to support hugepages.Details
=======
My test program (leak_mincore) works as follows:
- creat() and mmap() a file on hugetlbfs (file size is 200MB == 100 hugepages,)
- read()/write() something on it,
- call mincore() for first ten pages and printf() the values of *vec
- munmap() and unlink() the file on hugetlbfsWithout my patch
----------------
$ cat /proc/meminfo| grep "HugePage"
HugePages_Total: 1000
HugePages_Free: 1000
HugePages_Rsvd: 0
HugePages_Surp: 0
$ ./leak_mincore
vec[0] 0
vec[1] 0
vec[2] 0
vec[3] 0
vec[4] 0
vec[5] 0
vec[6] 0
vec[7] 0
vec[8] 0
vec[9] 0
$ cat /proc/meminfo |grep "HugePage"
HugePages_Total: 1000
HugePages_Free: 999
HugePages_Rsvd: 0
HugePages_Surp: 0
$ ls /hugetlbfs/
$Return values in *vec from mincore() are set to 0, while the hugepage
should be in memory, and 1 hugepage is still accounted as used while
there is no file on hugetlbfs.With my patch
-------------
$ cat /proc/meminfo| grep "HugePage"
HugePages_Total: 1000
HugePages_Free: 1000
HugePages_Rsvd: 0
HugePages_Surp: 0
$ ./leak_mincore
vec[0] 1
vec[1] 1
vec[2] 1
vec[3] 1
vec[4] 1
vec[5] 1
vec[6] 1
vec[7] 1
vec[8] 1
vec[9] 1
$ cat /proc/meminfo |grep "HugePage"
HugePages_Total: 1000
HugePages_Free: 1000
HugePages_Rsvd: 0
HugePages_Surp: 0
$ ls /hugetlbfs/
$Return value in *vec set to 1 and no memory leaks.
[akpm@linux-foundation.org: cleanup]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Wu Fengguang
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Lee Schermerhorn
Cc: Andy Whitcroft
Cc: David Rientjes
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If a user asks for a hugepage pool resize but specified a large number,
the machine can begin trashing. In response, they might hit ctrl-c but
signals are ignored and the pool resize continues until it fails an
allocation. This can take a considerable amount of time so this patch
aborts a pool resize if a signal is pending.Suggested by Dave Hansen.
Signed-off-by: Mel Gorman
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Cleanup stale comments on munlock_vma_page().
Signed-off-by: Lee Schermerhorn
Acked-by: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
unevictable_migrate_page() in mm/internal.h is a relic of the since
removed UNEVICTABLE_LRU Kconfig option. This patch removes the function
and open codes the test in migrate_page_copy().Signed-off-by: Lee Schermerhorn
Reviewed-by: Christoph Lameter
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When the owner of a mapping fails COW because a child process is holding a
reference, the children VMAs are walked and the page is unmapped. The
i_mmap_lock is taken for the unmapping of the page but not the walking of
the prio_tree. In theory, that tree could be changing if the lock is not
held. This patch takes the i_mmap_lock properly for the duration of the
prio_tree walk.[hugh.dickins@tiscali.co.uk: Spotted the problem in the first place]
Signed-off-by: Mel Gorman
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Jan Engelhardt reported we have this problem:
setting max_map_count to a value large enough results in programs dying at
first try. This is on 2.6.31.6:15:59 borg:/proc/sys/vm # echo $[1<max_map_count
15:59 borg:/proc/sys/vm # cat max_map_count
1073741824
15:59 borg:/proc/sys/vm # echo $[1<max_map_count
15:59 borg:/proc/sys/vm # cat max_map_count
KilledThis is because we have a chance to make 'max_map_count' negative. but
it's meaningless. Make it only accept non-negative values.Reported-by: Jan Engelhardt
Signed-off-by: WANG Cong
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: James Morris
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The check code for CONFIG_SWAP is redundant, because there is a
non-CONFIG_SWAP version for PageSwapCache() which just returns 0.Signed-off-by: Huang Shijie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Modify the generic mmap() code to keep the cache attribute in
vma->vm_page_prot regardless if writenotify is enabled or not. Without
this patch the cache configuration selected by f_op->mmap() is overwritten
if writenotify is enabled, making it impossible to keep the vma uncached.Needed by drivers such as drivers/video/sh_mobile_lcdcfb.c which uses
deferred io together with uncached memory.Signed-off-by: Magnus Damm
Cc: Nick Piggin
Cc: Hugh Dickins
Cc: Paul Mundt
Cc: Jaya Kumar
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Simplify the code for shrink_inactive_list().
Signed-off-by: Huang Shijie
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In AIM7 runs, recent kernels start swapping out anonymous pages well
before they should. This is due to shrink_list falling through to
shrink_inactive_list if !inactive_anon_is_low(zone, sc), when all we
really wanted to do is pre-age some anonymous pages to give them extra
time to be referenced while on the inactive list.The obvious fix is to make sure that shrink_list does not fall through to
scanning/reclaiming inactive pages when we called it to scan one of the
active lists.This change should be safe because the loop in shrink_zone ensures that we
will still shrink the anon and file inactive lists whenever we should.[kosaki.motohiro@jp.fujitsu.com: inactive_file_is_low() should be inactive_anon_is_low()]
Reported-by: Larry Woodman
Signed-off-by: Rik van Riel
Acked-by: Johannes Weiner
Cc: Tomasz Chmielewski
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Signed-off-by: Jan Beulich
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Nodemasks should not be allocated on the stack for large systems (when it
is larger than 256 bytes) since there is a threat of overflow.This patch causes the unregister_mem_sect_under_nodes() nodemask to be
allocated on the stack for smaller systems and be allocated by slab for
larger systems.GFP_KERNEL is used since remove_memory_block() can block.
Cc: Gary Hade
Cc: Badari Pulavarty
Cc: Alex Chiang
Signed-off-by: David Rientjes
Cc: Greg Kroah-Hartman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
SWAP_MLOCK mean "We marked the page as PG_MLOCK, please move it to
unevictable-lru". So, following code is easy confusable.if (vma->vm_flags & VM_LOCKED) {
ret = SWAP_MLOCK;
goto out_unmap;
}Plus, if the VMA doesn't have VM_LOCKED, We don't need to check
the needed of calling mlock_vma_page().Also, add some commentary to try_to_unmap_one().
Acked-by: Hugh Dickins
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
__free_pages_bootmem() is a __meminit function - which has been called
from put_pages_bootmem thus causes a section mismatch warning.We were warned by the following warning:
LD mm/built-in.o
WARNING: mm/built-in.o(.text+0x26b22): Section mismatch in reference
from the function put_page_bootmem() to the function
.meminit.text:__free_pages_bootmem()
The function put_page_bootmem() references
the function __meminit __free_pages_bootmem().
This is often because put_page_bootmem lacks a __meminit
annotation or the annotation of __free_pages_bootmem is wrong.Signed-off-by: Rakib Mullick
Cc: Yasunori Goto
Cc: Badari Pulavarty
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
hugetlb_fault() takes the mm->page_table_lock spinlock then calls
hugetlb_cow(). If the alloc_huge_page() in hugetlb_cow() fails due to an
insufficient huge page pool it calls unmap_ref_private() with the
mm->page_table_lock held. unmap_ref_private() then calls
unmap_hugepage_range() which tries to acquire the mm->page_table_lock.[] print_circular_bug_tail+0x80/0x9f
[] ? check_noncircular+0xb0/0xe8
[] __lock_acquire+0x956/0xc0e
[] lock_acquire+0xee/0x12e
[] ? unmap_hugepage_range+0x3e/0x84
[] ? unmap_hugepage_range+0x3e/0x84
[] _spin_lock+0x40/0x89
[] ? unmap_hugepage_range+0x3e/0x84
[] ? alloc_huge_page+0x218/0x318
[] unmap_hugepage_range+0x3e/0x84
[] hugetlb_cow+0x1e2/0x3f4
[] ? hugetlb_fault+0x453/0x4f6
[] hugetlb_fault+0x480/0x4f6
[] follow_hugetlb_page+0x116/0x2d9
[] ? _spin_unlock_irq+0x3a/0x5c
[] __get_user_pages+0x2a3/0x427
[] get_user_pages+0x3e/0x54
[] get_user_pages_fast+0x170/0x1b5
[] dio_get_page+0x64/0x14a
[] __blockdev_direct_IO+0x4b7/0xb31
[] blkdev_direct_IO+0x58/0x6e
[] ? blkdev_get_blocks+0x0/0xb8
[] generic_file_aio_read+0xdd/0x528
[] ? avc_has_perm+0x66/0x8c
[] do_sync_read+0xf5/0x146
[] ? autoremove_wake_function+0x0/0x5a
[] ? security_file_permission+0x24/0x3a
[] vfs_read+0xb5/0x126
[] ? fget_light+0x5e/0xf8
[] sys_read+0x54/0x8c
[] system_call_fastpath+0x16/0x1bThis can be fixed by dropping the mm->page_table_lock around the call to
unmap_ref_private() if alloc_huge_page() fails, its dropped right below in
the normal path anyway. However, earlier in the that function, it's also
possible to call into the page allocator with the same spinlock held.What this patch does is drop the spinlock before the page allocator is
potentially entered. The check for page allocation failure can be made
without the page_table_lock as well as the copy of the huge page. Even if
the PTE changed while the spinlock was held, the consequence is that a
huge page is copied unnecessarily. This resolves both the double taking
of the lock and sleeping with the spinlock held.[mel@csn.ul.ie: Cover also the case where process can sleep with spinlock]
Signed-off-by: Larry Woodman
Signed-off-by: Mel Gorman
Acked-by: Adam Litke
Cc: Andy Whitcroft
Cc: Lee Schermerhorn
Cc: Hugh Dickins
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
It has no references outside memory_hotplug.c.
Cc: "Rafael J. Wysocki"
Cc: Andi Kleen
Cc: Gerald Schaefer
Cc: KOSAKI Motohiro
Cc: Yasunori Goto
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Now that ksm pages are swappable, and the known holes plugged, remove
mention of unswappable kernel pages from KSM documentation and comments.Remove the totalram_pages/4 initialization of max_kernel_pages. In fact,
remove max_kernel_pages altogether - we can reinstate it if removal turns
out to break someone's script; but if we later want to limit KSM's memory
usage, limiting the stable nodes would not be an effective approach.Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Cc: Chris Wright
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The previous patch enables page migration of ksm pages, but that soon gets
into trouble: not surprising, since we're using the ksm page lock to lock
operations on its stable_node, but page migration switches the page whose
lock is to be used for that. Another layer of locking would fix it, but
do we need that yet?Do we actually need page migration of ksm pages? Yes, memory hotremove
needs to offline sections of memory: and since we stopped allocating ksm
pages with GFP_HIGHUSER, they will tend to be GFP_HIGHUSER_MOVABLE
candidates for migration.But KSM is currently unconscious of NUMA issues, happily merging pages
from different NUMA nodes: at present the rule must be, not to use
MADV_MERGEABLE where you care about NUMA. So no, NUMA page migration of
ksm pages does not make sense yet.So, to complete support for ksm swapping we need to make hotremove safe.
ksm_memory_callback() take ksm_thread_mutex when MEM_GOING_OFFLINE and
release it when MEM_OFFLINE or MEM_CANCEL_OFFLINE. But if mapped pages
are freed before migration reaches them, stable_nodes may be left still
pointing to struct pages which have been removed from the system: the
stable_node needs to identify a page by pfn rather than page pointer, then
it can safely prune them when MEM_OFFLINE.And make NUMA migration skip PageKsm pages where it skips PageReserved.
But it's only when we reach unmap_and_move() that the page lock is taken
and we can be sure that raised pagecount has prevented a PageAnon from
being upgraded: so add offlining arg to migrate_pages(), to migrate ksm
page when offlining (has sufficient locking) but reject it otherwise.Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Cc: Chris Wright
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A side-effect of making ksm pages swappable is that they have to be placed
on the LRUs: which then exposes them to isolate_lru_page() and hence to
page migration.Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
consolidation with existing code is possible, but don't attempt that yet
(try_to_unmap needs to handle nonlinears, but migration pte removal does
not).rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
remove_anon_migration_ptes() which it replaces, avoids calling
page_lock_anon_vma(), because that includes a page_mapped() test which
fails when all migration ptes are in place. That was valid when NUMA page
migration was introduced (holding mmap_sem provided the missing guarantee
that anon_vma's slab had not already been destroyed), but I believe not
valid in the memory hotremove case added since.For now do the same as before, and consider the best way to fix that
unlikely race later on. When fixed, we can probably use rmap_walk() on
hwpoisoned ksm pages too: for now, they remain among hwpoison's various
exceptions (its PageKsm test comes before the page is locked, but its
page_lock_anon_vma fails safely if an anon gets upgraded).Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Cc: Chris Wright
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
But ksm swapping does require one small change in mem cgroup handling.
When do_swap_page()'s call to ksm_might_need_to_copy() does indeed
substitute a duplicate page to accommodate a different anon_vma (or a the
!PageSwapCache check in mem_cgroup_try_charge_swapin().That was returning success without charging, on the assumption that
pte_same() would fail after, which is not the case here. Originally I
proposed that success, so that an unshrinkable mem cgroup at its limit
would not fail unnecessarily; but that's a minor point, and there are
plenty of other places where we may fail an overallocation which might
later prove unnecessary. So just go ahead and do what all the other
exceptions do: proceed to charge current mm.Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Cc: Chris Wright
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds