28 Apr, 2013
1 commit
-
I think we could just move the full vm_iomap_memory() function into
util.h or similar, but I didn't get any reply from anybody actually
using nommu even to this trivial patch, so I'm not going to touch it any
more than required.Here's the fairly minimal stub to make the nommu case at least
potentially work. It doesn't seem like anybody cares, though.Signed-off-by: Linus Torvalds
18 Apr, 2013
2 commits
-
Fix the error return value in kswapd_run(). The bug was introduced by
commit d5dc0ad928fb ("mm/vmscan: fix error number for failed kthread").Signed-off-by: Xishi Qiu
Reviewed-by: Wanpeng Li
Reviewed-by: Rik van Riel
Reported-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
With applying the previous patch "hugetlbfs: stop setting VM_DONTDUMP in
initializing vma(VM_HUGETLB)" to reenable hugepage coredump, if a memory
error happens on a hugepage and the affected processes try to access the
error hugepage, we hit VM_BUG_ON(atomic_read(&page->_count)
Cc: Rik van Riel
Acked-by: Michal Hocko
Cc: HATAYAMA Daisuke
Acked-by: KOSAKI Motohiro
Acked-by: David Rientjes
Cc: [2.6.34+?]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Apr, 2013
1 commit
-
Various drivers end up replicating the code to mmap() their memory
buffers into user space, and our core memory remapping function may be
very flexible but it is unnecessarily complicated for the common cases
to use.Our internal VM uses pfn's ("page frame numbers") which simplifies
things for the VM, and allows us to pass physical addresses around in a
denser and more efficient format than passing a "phys_addr_t" around,
and having to shift it up and down by the page size. But it just means
that drivers end up doing that shifting instead at the interface level.It also means that drivers end up mucking around with internal VM things
like the vma details (vm_pgoff, vm_start/end) way more than they really
need to.So this just exports a function to map a certain physical memory range
into user space (using a phys_addr_t based interface that is much more
natural for a driver) and hides all the complexity from the driver.
Some drivers will still end up tweaking the vm_page_prot details for
things like prefetching or cacheability etc, but that's actually
relevant to the driver, rather than caring about what the page offset of
the mapping is into the particular IO memory region.Acked-by: Greg Kroah-Hartman
Signed-off-by: Linus Torvalds
13 Apr, 2013
1 commit
-
This patch attempts to fix:
https://bugzilla.kernel.org/show_bug.cgi?id=56461
The symptom is a crash and messages like this:
chrome: Corrupted page table at address 34a03000
*pdpt = 0000000000000000 *pde = 0000000000000000
Bad pagetable: 000f [#1] PREEMPT SMPIngo guesses this got introduced by commit 611ae8e3f520 ("x86/tlb:
enable tlb flush range support for x86") since that code started to free
unused pagetables.On x86-32 PAE kernels, that new code has the potential to free an entire
PMD page and will clear one of the four page-directory-pointer-table
(aka pgd_t entries).The hardware aggressively "caches" these top-level entries and invlpg
does not actually affect the CPU's copy. If we clear one we *HAVE* to
do a full TLB flush, otherwise we might continue using a freed pmd page.
(note, we do this properly on the population side in pud_populate()).This patch tracks whenever we clear one of these entries in the 'struct
mmu_gather', and ensures that we follow up with a full tlb flush.BTW, I disassembled and checked that:
if (tlb->fullmm == 0)
and
if (!tlb->fullmm && !tlb->need_flush_all)generate essentially the same code, so there should be zero impact there
to the !PAE case.Signed-off-by: Dave Hansen
Cc: Peter Anvin
Cc: Ingo Molnar
Cc: Artem S Tashkinov
Signed-off-by: Linus Torvalds
05 Apr, 2013
1 commit
-
find_vma() can be called by multiple threads with read lock
held on mm->mmap_sem and any of them can update mm->mmap_cache.
Prevent compiler from re-fetching mm->mmap_cache, because other
readers could update it in the meantime:thread 1 thread 2
|
find_vma() | find_vma()
struct vm_area_struct *vma = NULL; |
vma = mm->mmap_cache; |
if (!(vma && vma->vm_end > addr |
&& vma->vm_start mmap_cache = vma;
return vma; |
^^ compiler may optimize this |
local variable out and re-read |
mm->mmap_cache |This issue can be reproduced with gcc-4.8.0-1 on s390x by running
mallocstress testcase from LTP, which triggers:kernel BUG at mm/rmap.c:1088!
Call Trace:
([] 0x3d100c57000)
[] do_wp_page+0x2fc/0xa88
[] handle_pte_fault+0x41a/0xac8
[] handle_mm_fault+0x17a/0x268
[] do_protection_exception+0x1e2/0x394
[] pgm_check_handler+0x138/0x13c
[] 0x3fffcf1f07a
Last Breaking-Event-Address:
[] page_add_new_anon_rmap+0xc2/0x168Thanks to Jakub Jelinek for his insight on gcc and helping to
track this down.Signed-off-by: Jan Stancek
Acked-by: David Rientjes
Signed-off-by: Hugh Dickins
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds
29 Mar, 2013
1 commit
-
This reverts commit 186930500985 ("mm: introduce VM_POPULATE flag to
better deal with racy userspace programs").VM_POPULATE only has any effect when userspace plays racy games with
vmas by trying to unmap and remap memory regions that mmap or mlock are
operating on.Also, the only effect of VM_POPULATE when userspace plays such games is
that it avoids populating new memory regions that get remapped into the
address range that was being operated on by the original mmap or mlock
calls.Let's remove VM_POPULATE as there isn't any strong argument to mandate a
new vm_flag.Signed-off-by: Michel Lespinasse
Signed-off-by: Hugh Dickins
Signed-off-by: Linus Torvalds
23 Mar, 2013
2 commits
-
zone->wait_table may be allocated from bootmem, it can not be freed.
Signed-off-by: Jianguo Wu
Reviewed-by: Tang Chen
Cc: Tang Chen
Cc: Jiang Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
hugetlb_total_pages is used for overcommit calculations but the current
implementation considers only the default hugetlb page size (which is
either the first defined hugepage size or the one specified by
default_hugepagesz kernel boot parameter).If the system is configured for more than one hugepage size, which is
possible since commit a137e1cc6d6e ("hugetlbfs: per mount huge page
sizes") then the overcommit estimation done by __vm_enough_memory()
(resp. shown by meminfo_proc_show) is not precise - there is an
impression of more available/allowed memory. This can lead to an
unexpected ENOMEM/EFAULT resp. SIGSEGV when memory is accounted.Testcase:
boot: hugepagesz=1G hugepages=1
the default overcommit ratio is 50
before patch:egrep 'CommitLimit' /proc/meminfo
CommitLimit: 55434168 kBafter patch:
egrep 'CommitLimit' /proc/meminfo
CommitLimit: 54909880 kB[akpm@linux-foundation.org: coding-style tweak]
Signed-off-by: Wanpeng Li
Acked-by: Michal Hocko
Cc: "Aneesh Kumar K.V"
Cc: Hillf Danton
Cc: KAMEZAWA Hiroyuki
Cc: [3.0+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 Mar, 2013
1 commit
-
The vm_flags introduced in 6d7825b10dbe ("mm/fremap.c: fix oops on error
path") is supposed to avoid a compiler warning about unitialized
vm_flags without changing the generated code.However I am concerned that this is going to be very brittle, and fail
with some compiler versions. The failure could be either of:- compiler could actually load vma->vm_flags before checking for the
!vma condition, thus reintroducing the oops- compiler could optimize out the !vma check, since the pointer just got
dereferenced shortly before (so the compiler knows it can't be NULL!)I propose reversing this part of the change and initializing vm_flags to 0
just to avoid the bogus uninitialized use warning.Signed-off-by: Michel Lespinasse
Cc: Tommi Rantala
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
14 Mar, 2013
2 commits
-
If find_vma() fails, sys_remap_file_pages() will dereference `vma', which
contains NULL. Fix it by checking the pointer.(We could alternatively check for err==0, but this seems more direct)
(The vm_flags change is to squish a bogus used-uninitialised warning
without adding extra code).Reported-by: Tommi Rantala
Cc: Michel Lespinasse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
remove_memory() calls walk_memory_range() with [start_pfn, end_pfn), where
end_pfn is exclusive in this range. Therefore, end_pfn needs to be set to
the next page of the end address.Signed-off-by: Toshi Kani
Cc: Wen Congyang
Cc: Tang Chen
Cc: Kamezawa Hiroyuki
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Lai Jiangshan
Cc: Wu Jianguo
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 Mar, 2013
2 commits
-
In commit 887cbce0adea ("arch Kconfig: centralise ARCH_NO_VIRT_TO_BUS")
I introduced the config sybmol HAVE_VIRT_TO_BUS and selected that where
needed. I am not sure what I was thinking. Instead, just directly
select VIRT_TO_BUS where it is needed.Signed-off-by: Stephen Rothwell
Signed-off-by: Linus Torvalds -
Looking at mm/process_vm_access.c:process_vm_rw() and comparing it to
compat_process_vm_rw() shows that the compatibility code requires an
explicit "access_ok()" check before calling
compat_rw_copy_check_uvector(). The same difference seems to appear when
we compare fs/read_write.c:do_readv_writev() to
fs/compat.c:compat_do_readv_writev().This subtle difference between the compat and non-compat requirements
should probably be debated, as it seems to be error-prone. In fact,
there are two others sites that use this function in the Linux kernel,
and they both seem to get it wrong:Now shifting our attention to fs/aio.c, we see that aio_setup_iocb()
also ends up calling compat_rw_copy_check_uvector() through
aio_setup_vectored_rw(). Unfortunately, the access_ok() check appears to
be missing. Same situation for
security/keys/compat.c:compat_keyctl_instantiate_key_iov().I propose that we add the access_ok() check directly into
compat_rw_copy_check_uvector(), so callers don't have to worry about it,
and it therefore makes the compat call code similar to its non-compat
counterpart. Place the access_ok() check in the same location where
copy_from_user() can trigger a -EFAULT error in the non-compat code, so
the ABI behaviors are alike on both compat and non-compat.While we are here, fix compat_do_readv_writev() so it checks for
compat_rw_copy_check_uvector() negative return values.And also, fix a memory leak in compat_keyctl_instantiate_key_iov() error
handling.Acked-by: Linus Torvalds
Acked-by: Al Viro
Signed-off-by: Mathieu Desnoyers
Signed-off-by: Linus Torvalds
09 Mar, 2013
4 commits
-
Fix a warning from lockdep caused by calling cancel_work_sync() for
uninitialized struct work. This path has been triggered by destructon
kmem-cache hierarchy via destroying its root kmem-cache.cache ffff88003c072d80
obj ffff88003b410000 cache ffff88003c072d80
obj ffff88003b924000 cache ffff88003c20bd40
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
Pid: 2825, comm: insmod Tainted: G O 3.9.0-rc1-next-20130307+ #611
Call Trace:
__lock_acquire+0x16a2/0x1cb0
lock_acquire+0x8a/0x120
flush_work+0x38/0x2a0
__cancel_work_timer+0x89/0xf0
cancel_work_sync+0xb/0x10
kmem_cache_destroy_memcg_children+0x81/0xb0
kmem_cache_destroy+0xf/0xe0
init_module+0xcb/0x1000 [kmem_test]
do_one_initcall+0x11a/0x170
load_module+0x19b0/0x2320
SyS_init_module+0xc6/0xf0
system_call_fastpath+0x16/0x1bExample module to demonstrate:
#include
#include
#include
#includeint __init mod_init(void)
{
int size = 256;
struct kmem_cache *cache;
void *obj;
struct page *page;cache = kmem_cache_create("kmem_cache_test", size, size, 0, NULL);
if (!cache)
return -ENOMEM;printk("cache %p\n", cache);
obj = kmem_cache_alloc(cache, GFP_KERNEL);
if (obj) {
page = virt_to_head_page(obj);
printk("obj %p cache %p\n", obj, page->slab_cache);
kmem_cache_free(cache, obj);
}flush_scheduled_work();
obj = kmem_cache_alloc(cache, GFP_KERNEL);
if (obj) {
page = virt_to_head_page(obj);
printk("obj %p cache %p\n", obj, page->slab_cache);
kmem_cache_free(cache, obj);
}kmem_cache_destroy(cache);
return -EBUSY;
}module_init(mod_init);
MODULE_LICENSE("GPL");Signed-off-by: Konstantin Khlebnikov
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A CONFIG_DISCONTIGMEM=y m68k config gave
mm/ksm.c: In function `get_kpfn_nid':
mm/ksm.c:492: error: implicit declaration of function `pfn_to_nid'linux/mmzone.h declares it for CONFIG_SPARSEMEM and CONFIG_FLATMEM, but
expects the arch's asm/mmzone.h to declare it for CONFIG_DISCONTIGMEM
(see arch/mips/include/asm/mmzone.h for example).Or perhaps it is only expected when CONFIG_NUMA=y: too much of a maze,
and m68k got away without it so far, so fix the build in mm/ksm.c.Signed-off-by: Hugh Dickins
Reported-by: Geert Uytterhoeven
Cc: Petr Holasek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently, n_new is wrongly initialized. start and end parameter are
inverted. Let's fix it.Signed-off-by: KOSAKI Motohiro
Cc: Hillf Danton
Cc: Sasha Levin
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Dave Jones
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
n->end is accessed in sp_insert(). Thus it should be update
before calling sp_insert(). This mistake may make kernel panic.Signed-off-by: Hillf Danton
Signed-off-by: KOSAKI Motohiro
Cc: Sasha Levin
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Dave Jones
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
04 Mar, 2013
1 commit
-
Pull more VFS bits from Al Viro:
"Unfortunately, it looks like xattr series will have to wait until the
next cycle ;-/This pile contains 9p cleanups and fixes (races in v9fs_fid_add()
etc), fixup for nommu breakage in shmem.c, several cleanups and a bit
more file_inode() work"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
constify path_get/path_put and fs_struct.c stuff
fix nommu breakage in shmem.c
cache the value of file_inode() in struct file
9p: if v9fs_fid_lookup() gets to asking server, it'd better have hashed dentry
9p: make sure ->lookup() adds fid to the right dentry
9p: untangle ->lookup() a bit
9p: double iput() in ->lookup() if d_materialise_unique() fails
9p: v9fs_fid_add() can't fail now
v9fs: get rid of v9fs_dentry
9p: turn fid->dlist into hlist
9p: don't bother with private lock in ->d_fsdata; dentry->d_lock will do just fine
more file_inode() open-coded instances
selinux: opened file can't have NULL or negative ->f_path.dentry(In the meantime, the hlist traversal macros have changed, so this
required a semantic conflict fixup for the newly hlistified fid->dlist)
03 Mar, 2013
1 commit
-
Tim found:
WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80()
Hardware name: S2600CP
sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
smpboot: Booting Node 1, Processors #1
Modules linked in:
Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1
Call Trace:
set_cpu_sibling_map+0x279/0x449
start_secondary+0x11d/0x1e5Don Morris reproduced on a HP z620 workstation, and bisected it to
commit e8d195525809 ("acpi, memory-hotplug: parse SRAT before memblock
is ready")It turns out movable_map has some problems, and it breaks several things
1. numa_init is called several times, NOT just for srat. so those
nodes_clear(numa_nodes_parsed)
memset(&numa_meminfo, 0, sizeof(numa_meminfo))
can not be just removed. Need to consider sequence is: numaq, srat, amd, dummy.
and make fall back path working.2. simply split acpi_numa_init to early_parse_srat.
a. that early_parse_srat is NOT called for ia64, so you break ia64.
b. for (i = 0; i < MAX_LOCAL_APIC; i++)
set_apicid_to_node(i, NUMA_NO_NODE)
still left in numa_init. So it will just clear result from early_parse_srat.
it should be moved before that....
c. it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
early before override from INITRD is settled.3. that patch TITLE is total misleading, there is NO x86 in the title,
but it changes critical x86 code. It caused x86 guys did not
pay attention to find the problem early. Those patches really should
be routed via tip/x86/mm.4. after that commit, following range can not use movable ram:
a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed?
b. initrd... it will be freed after booting, so it could be on movable...
c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G
anymore.
d. init_mem_mapping: can not put page table high anymore.
e. initmem_init: vmemmap can not be high local node anymore. That is
not good.If node is hotplugable, the mem related range like page table and
vmemmap could be on the that node without problem and should be on that
node.We have workaround patch that could fix some problems, but some can not
be fixed.So just remove that offending commit and related ones including:
f7210e6c4ac7 ("mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to
protect movablecore_map in memblock_overlaps_region().")01a178a94e8e ("acpi, memory-hotplug: support getting hotplug info from
SRAT")27168d38fa20 ("acpi, memory-hotplug: extend movablemem_map ranges to
the end of node")e8d195525809 ("acpi, memory-hotplug: parse SRAT before memblock is
ready")fb06bc8e5f42 ("page_alloc: bootmem limit with movablecore_map")
42f47e27e761 ("page_alloc: make movablemem_map have higher priority")
6981ec31146c ("page_alloc: introduce zone_movable_limit[] to keep
movable limit for nodes")34b71f1e04fc ("page_alloc: add movable_memmap kernel parameter")
4d59a75125d5 ("x86: get pg_data_t's memory from other node")
Later we should have patches that will make sure kernel put page table
and vmemmap on local node ram instead of push them down to node0. Also
need to find way to put other kernel used ram to local node ram.Reported-by: Tim Gardner
Reported-by: Don Morris
Bisected-by: Don Morris
Tested-by: Don Morris
Signed-off-by: Yinghai Lu
Cc: Tony Luck
Cc: Thomas Renninger
Cc: Tejun Heo
Cc: Tang Chen
Cc: Yasuaki Ishimatsu
Signed-off-by: Linus Torvalds
02 Mar, 2013
1 commit
-
Signed-off-by: Al Viro
01 Mar, 2013
2 commits
-
Pull writeback fixes from Wu Fengguang:
"Two writeback fixes- fix negative (setpoint - dirty) in 32bit archs
- use down_read_trylock() in writeback_inodes_sb(_nr)_if_idle()"
* tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
Negative (setpoint-dirty) in bdi_position_ratio()
vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them -
Pull block IO core bits from Jens Axboe:
"Below are the core block IO bits for 3.9. It was delayed a few days
since my workstation kept crashing every 2-8h after pulling it into
current -git, but turns out it is a bug in the new pstate code (divide
by zero, will report separately). In any case, it contains:- The big cfq/blkcg update from Tejun and and Vivek.
- Additional block and writeback tracepoints from Tejun.
- Improvement of the should sort (based on queues) logic in the plug
flushing.- _io() variants of the wait_for_completion() interface, using
io_schedule() instead of schedule() to contribute to io wait
properly.- Various little fixes.
You'll get two trivial merge conflicts, which should be easy enough to
fix up"Fix up the trivial conflicts due to hlist traversal cleanups (commit
b67bfe0d42ca: "hlist: drop the node parameter from iterators").* 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
block: remove redundant check to bd_openers()
block: use i_size_write() in bd_set_size()
cfq: fix lock imbalance with failed allocations
drivers/block/swim3.c: fix null pointer dereference
block: don't select PERCPU_RWSEM
block: account iowait time when waiting for completion of IO request
sched: add wait_for_completion_io[_timeout]
writeback: add more tracepoints
block: add block_{touch|dirty}_buffer tracepoint
buffer: make touch_buffer() an exported function
block: add @req to bio_{front|back}_merge tracepoints
block: add missing block_bio_complete() tracepoint
block: Remove should_sort judgement when flush blk_plug
block,elevator: use new hashtable implementation
cfq-iosched: add hierarchical cfq_group statistics
cfq-iosched: collect stats from dead cfqgs
cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
block: RCU free request_queue
blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
...
28 Feb, 2013
4 commits
-
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;type T;
expression a,c,d,e;
identifier b;
statement S;
@@-T b;
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin
Acked-by: Paul E. McKenney
Signed-off-by: Sasha Levin
Cc: Wu Fengguang
Cc: Marcelo Tosatti
Cc: Gleb Natapov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Change it to CONFIG_HAVE_VIRT_TO_BUS and set it in all architecures
that already provide virt_to_bus().Signed-off-by: Stephen Rothwell
Reviewed-by: James Hogan
Cc: Bjorn Helgaas
Cc: H Hartley Sweeten
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: "David S. Miller"
Cc: Paul Mundt
Cc: Vineet Gupta
Cc: James Bottomley
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
at a time. When munlocking THP pages (or the huge zero page), this
resulted in taking the mm->page_table_lock 512 times in a row.We can do better by making use of the page_mask returned by
follow_page_mask (for the huge zero page case), or the size of the page
munlock_vma_page() operated on (for the true THP page case).Signed-off-by: Michel Lespinasse
Cc: Andrea Arcangeli
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The stack vma is designed to grow automatically (marked with VM_GROWSUP
or VM_GROWSDOWN depending on architecture) when an access is made beyond
the existing boundary. However, particularly if you have not limited
your stack at all ("ulimit -s unlimited"), this can cause the stack to
grow even if the access was really just one past *another* segment.And that's wrong, especially since we first grow the segment, but then
immediately later enforce the stack guard page on the last page of the
segment. So _despite_ first growing the stack segment as a result of
the access, the kernel will then make the access cause a SIGSEGV anyway!So do the same logic as the guard page check does, and consider an
access to within one page of the next segment to be a bad access, rather
than growing the stack to abut the next segment.Reported-and-tested-by: Heiko Carstens
Signed-off-by: Linus Torvalds
27 Feb, 2013
1 commit
-
Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.PS: the next vfs pile will be xattr stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...
26 Feb, 2013
4 commits
-
This patch is a follow up on below patch:
[PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type
commit: 216b6cbdcbd86b1db0754d58886b466ae31f5a63Signed-off-by: Namjae Jeon
Signed-off-by: Vivek Trivedi
Acked-by: Steven Whitehouse
Acked-by: Sage Weil
Signed-off-by: Al Viro -
Note that provided ->d_dname() reproduces what we used to get for
those guys in e.g. /proc/self/maps; it might be a good idea to change
that to something less ugly, but for now let's keep the existing
user-visible behaviourSigned-off-by: Al Viro
-
Pull user namespace and namespace infrastructure changes from Eric W Biederman:
"This set of changes starts with a few small enhnacements to the user
namespace. reboot support, allowing more arbitrary mappings, and
support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
user namespace root.I do my best to document that if you care about limiting your
unprivileged users that when you have the user namespace support
enabled you will need to enable memory control groups.There is a minor bug fix to prevent overflowing the stack if someone
creates way too many user namespaces.The bulk of the changes are a continuation of the kuid/kgid push down
work through the filesystems. These changes make using uids and gids
typesafe which ensures that these filesystems are safe to use when
multiple user namespaces are in use. The filesystems converted for
3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The
changes for these filesystems were a little more involved so I split
the changes into smaller hopefully obviously correct changes.XFS is the only filesystem that remains. I was hoping I could get
that in this release so that user namespace support would be enabled
with an allyesconfig or an allmodconfig but it looks like the xfs
changes need another couple of days before it they are ready."* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
cifs: Enable building with user namespaces enabled.
cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
cifs: Convert struct cifs_sb_info to use kuids and kgids
cifs: Modify struct smb_vol to use kuids and kgids
cifs: Convert struct cifsFileInfo to use a kuid
cifs: Convert struct cifs_fattr to use kuid and kgids
cifs: Convert struct tcon_link to use a kuid.
cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
cifs: Convert from a kuid before printing current_fsuid
cifs: Use kuids and kgids SID to uid/gid mapping
cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
cifs: Override unmappable incoming uids and gids
nfsd: Enable building with user namespaces enabled.
nfsd: Properly compare and initialize kuids and kgids
nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
nfsd: Modify nfsd4_cb_sec to use kuids and kgids
nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
nfsd: Convert nfsxdr to use kuids and kgids
nfsd: Convert nfs3xdr to use kuids and kgids
... -
Pull module update from Rusty Russell:
"The sweeping change is to make add_taint() explicitly indicate whether
to disable lockdep, but it's a mechanical change."* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
MODSIGN: Add option to not sign modules during modules_install
MODSIGN: Add -s option to sign-file
MODSIGN: Specify the hash algorithm on sign-file command line
MODSIGN: Simplify Makefile with a Kconfig helper
module: clean up load_module a little more.
modpost: Ignore ARC specific non-alloc sections
module: constify within_module_*
taint: add explicit flag to show whether lock dep is still OK.
module: printk message when module signature fail taints kernel.
24 Feb, 2013
8 commits
-
It is a pity to have MAX_NUMNODES+MAX_NUMNODES tree roots statically
allocated, particularly when very few users will ever actually tune
merge_across_nodes 0 to use more than 1+1 of those trees. Not a big
deal (only 16kB wasted on each machine with CONFIG_MAXSMP), but a pity.Start off with 1+1 statically allocated, then if merge_across_nodes is
ever tuned, allocate for nr_node_ids+nr_node_ids. Do not attempt to
free up the extra if it's tuned back, that would be a waste of effort.Signed-off-by: Hugh Dickins
Cc: Mel Gorman
Cc: Petr Holasek
Cc: Andrea Arcangeli
Cc: Izik Eidus
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
I dislike the way in which "swapcache" gets used in do_swap_page():
there is always a page from swapcache there (even if maybe uncached by
the time we lock it), but tests are made according to "swapcache".
Rework that with "page != swapcache", as has been done in unuse_pte().Signed-off-by: Hugh Dickins
Cc: Mel Gorman
Cc: Petr Holasek
Cc: Andrea Arcangeli
Cc: Izik Eidus
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Before establishing that KSM page migration was the cause of my
WARN_ON_ONCE(page_mapped(page))s, I suspected that they came from the
lack of a ksm_might_need_to_copy() in swapoff's unuse_pte() - which in
many respects is equivalent to faulting in a page.In fact I've never caught that as the cause: but in theory it does at
least need the KSM_RUN_UNMERGE check in ksm_might_need_to_copy(), to
avoid bringing a KSM page back in when it's not supposed to be.I intended to copy how it's done in do_swap_page(), but have a strong
aversion to how "swapcache" ends up being used there: rework it with
"page != swapcache".Signed-off-by: Hugh Dickins
Cc: Mel Gorman
Cc: Petr Holasek
Cc: Andrea Arcangeli
Cc: Izik Eidus
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In "ksm: remove old stable nodes more thoroughly" I said that I'd never
seen its WARN_ON_ONCE(page_mapped(page)). True at the time of writing,
but it soon appeared once I tried fuller tests on the whole series.It turned out to be due to the KSM page migration itself: unmerge_and_
remove_all_rmap_items() failed to locate and replace all the KSM pages,
because of that hiatus in page migration when old pte has been replaced
by migration entry, but not yet by new pte. follow_page() finds no page
at that instant, but a KSM page reappears shortly after, without a
fault.Add FOLL_MIGRATION flag, so follow_page() can do migration_entry_wait()
for KSM's break_cow(). I'd have preferred to avoid another flag, and do
it every time, in case someone else makes the same easy mistake; but did
not find another transgressor (the common get_user_pages() is of course
safe), and cannot be sure that every follow_page() caller is prepared to
sleep - ia64's xencomm_vtop()? Now, THP's wait_split_huge_page() can
already sleep there, since anon_vma locking was changed to mutex, but
maybe that's somehow excluded.Signed-off-by: Hugh Dickins
Cc: Mel Gorman
Cc: Petr Holasek
Cc: Andrea Arcangeli
Cc: Izik Eidus
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Think of struct rmap_item as an extension of struct page (restricted to
MADV_MERGEABLE areas): there may be a lot of them, we need to keep them
small, especially on 32-bit architectures of limited lowmem.Siting "int nid" after "unsigned int checksum" works nicely on 64-bit,
making no change to its 64-byte struct rmap_item; but bloats the 32-bit
struct rmap_item from (nicely cache-aligned) 32 bytes to 36 bytes, which
rounds up to 40 bytes once allocated from slab. We'd better avoid that.Hey, I only just remembered that the anon_vma pointer in struct
rmap_item has no purpose until the rmap_item is hung from a stable tree
node (which has its own nid field); and rmap_item's nid field no purpose
than to say which tree root to tell rb_erase() when unlinking from an
unstable tree.Double them up in a union. There's just one place where we set anon_vma
early (when we already hold mmap_sem): now we must remove tree_rmap_item
from its unstable tree there, before overwriting nid. No need to
spatter BUG()s around: we'd be seeing oopses if this were wrong.Signed-off-by: Hugh Dickins
Cc: Mel Gorman
Cc: Petr Holasek
Cc: Andrea Arcangeli
Cc: Izik Eidus
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
An inconsistency emerged in reviewing the NUMA node changes to KSM: when
meeting a page from the wrong NUMA node in a stable tree, we say that
it's okay for comparisons, but not as a leaf for merging; whereas when
meeting a page from the wrong NUMA node in an unstable tree, we bail out
immediately.Now, it might be that a wrong NUMA node in an unstable tree is more
likely to correlate with instablility (different content, with rbnode
now misplaced) than page migration; but even so, we are accustomed to
instablility in the unstable tree.Without strong evidence for which strategy is generally better, I'd
rather be consistent with what's done in the stable tree: accept a page
from the wrong NUMA node for comparison, but not as a leaf for merging.Signed-off-by: Hugh Dickins
Cc: Mel Gorman
Cc: Petr Holasek
Cc: Andrea Arcangeli
Cc: Izik Eidus
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Added slightly more detail to the Documentation of merge_across_nodes, a
few comments in areas indicated by review, and renamed get_ksm_page()'s
argument from "locked" to "lock_it". No functional change.Signed-off-by: Hugh Dickins
Cc: Mel Gorman
Cc: Petr Holasek
Cc: Andrea Arcangeli
Cc: Izik Eidus
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fix several mempolicy leaks in the tmpfs mount logic. These leaks are
slow - on the order of one object leaked per mount attempt.Leak 1 (umount doesn't free mpol allocated in mount):
while true; do
mount -t tmpfs -o mpol=interleave,size=100M nodev /mnt
umount /mnt
doneLeak 2 (errors parsing remount options will leak mpol):
mount -t tmpfs -o size=100M nodev /mnt
while true; do
mount -o remount,mpol=interleave,size=x /mnt 2> /dev/null
done
umount /mntLeak 3 (multiple mpol per mount leak mpol):
while true; do
mount -t tmpfs -o mpol=interleave,mpol=interleave,size=100M nodev /mnt
umount /mnt
doneThis patch fixes all of the above. I could have broken the patch into
three pieces but is seemed easier to review as one.[akpm@linux-foundation.org: fix handling of mpol_parse_str() errors, per Hugh]
Signed-off-by: Greg Thelen
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds