20 Sep, 2018
1 commit
-
commit 7a9cdebdcc17e426fb5287e4a82db1dfe86339b2 upstream.
Jann Horn points out that the vmacache_flush_all() function is not only
potentially expensive, it's buggy too. It also happens to be entirely
unnecessary, because the sequence number overflow case can be avoided by
simply making the sequence number be 64-bit. That doesn't even grow the
data structures in question, because the other adjacent fields are
already 64-bit.So simplify the whole thing by just making the sequence number overflow
case go away entirely, which gets rid of all the complications and makes
the code faster too. Win-win.[ Oleg Nesterov points out that the VMACACHE_FULL_FLUSHES statistics
also just goes away entirely with this ]Reported-by: Jann Horn
Suggested-by: Will Deacon
Acked-by: Davidlohr Bueso
Cc: Oleg Nesterov
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
15 Sep, 2018
1 commit
-
[ Upstream commit a718e28f538441a3b6612da9ff226973376cdf0f ]
Signed integer overflow is undefined according to the C standard. The
overflow in ksys_fadvise64_64() is deliberate, but since it is signed
overflow, UBSAN complains:UBSAN: Undefined behaviour in mm/fadvise.c:76:10
signed integer overflow:
4 + 9223372036854775805 cannot be represented in type 'long long int'Use unsigned types to do math. Unsigned overflow is defined so UBSAN
will not complain about it. This patch doesn't change generated code.[akpm@linux-foundation.org: add comment explaining the casts]
Link: http://lkml.kernel.org/r/20180629184453.7614-1-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin
Reported-by:
Reviewed-by: Andrew Morton
Cc: Alexander Potapenko
Cc: Dmitry Vyukov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman
10 Sep, 2018
2 commits
-
commit a6f572084fbee8b30f91465f4a085d7a90901c57 upstream.
Will noted that only checking mm_users is incorrect; we should also
check mm_count in order to cover CPUs that have a lazy reference to
this mm (and could do speculative TLB operations).If removing this turns out to be a performance issue, we can
re-instate a more complete check, but in tlb_table_flush() eliding the
call_rcu_sched().Fixes: 267239116987 ("mm, powerpc: move the RCU page-table freeing into generic code")
Reported-by: Will Deacon
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Rik van Riel
Acked-by: Will Deacon
Cc: Nicholas Piggin
Cc: David Miller
Cc: Martin Schwidefsky
Cc: Michael Ellerman
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman -
commit dc30b96ab6d569060741572cf30517d3179429a8 upstream.
ondemand_readahead() checks bdi->io_pages to cap the maximum pages
that need to be processed. This works until the readit section. If
we would do an async only readahead (async size = sync size) and
target is at beginning of window we expand the pages by another
get_next_ra_size() pages. Btrace for large reads shows that kernel
always issues a doubled size read at the beginning of processing.
Add an additional check for io_pages in the lower part of the func.
The fix helps devices that hard limit bio pages and rely on proper
handling of max_hw_read_sectors (e.g. older FusionIO cards). For
that reason it could qualify for stable.Fixes: 9491ae4a ("mm: don't cap request size based on read-ahead setting")
Cc: stable@vger.kernel.org
Signed-off-by: Markus Stockhausen stockhausen@collogia.de
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman
05 Sep, 2018
6 commits
-
commit d86564a2f085b79ec046a5cba90188e612352806 upstream.
Jann reported that x86 was missing required TLB invalidates when he
hit the !*batch slow path in tlb_remove_table().This is indeed the case; RCU_TABLE_FREE does not provide TLB (cache)
invalidates, the PowerPC-hash where this code originated and the
Sparc-hash where this was subsequently used did not need that. ARM
which later used this put an explicit TLB invalidate in their
__p*_free_tlb() functions, and PowerPC-radix followed that example.But when we hooked up x86 we failed to consider this. Fix this by
(optionally) hooking tlb_remove_table() into the TLB invalidate code.NOTE: s390 was also needing something like this and might now
be able to use the generic code again.[ Modified to be on top of Nick's cleanups, which simplified this patch
now that tlb_flush_mmu_tlbonly() really only flushes the TLB - Linus ]Fixes: 9e52fc2b50de ("x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)")
Reported-by: Jann Horn
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Rik van Riel
Cc: Nicholas Piggin
Cc: David Miller
Cc: Will Deacon
Cc: Martin Schwidefsky
Cc: Michael Ellerman
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman -
commit db7ddef301128dad394f1c0f77027f86ee9a4edb upstream.
There is no need to call this from tlb_flush_mmu_tlbonly, it logically
belongs with tlb_flush_mmu_free. This makes future fixes simpler.[ This was originally done to allow code consolidation for the
mmu_notifier fix, but it also ends up helping simplify the
HAVE_RCU_TABLE_INVALIDATE fix. - Linus ]Signed-off-by: Nicholas Piggin
Acked-by: Will Deacon
Cc: Peter Zijlstra
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 24eee1e4c47977bdfb71d6f15f6011e7b6188d04 ]
ioremap_prot() can return NULL which could lead to an oops.
Link: http://lkml.kernel.org/r/1533195441-58594-1-git-send-email-chenjie6@huawei.com
Signed-off-by: chen jie
Reviewed-by: Andrew Morton
Cc: Li Zefan
Cc: chenjie
Cc: Yang Shi
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 7e97de0b033bcac4fa9a35cef72e0c06e6a22c67 ]
In case of memcg_online_kmem() failure, memcg_cgroup::id remains hashed
in mem_cgroup_idr even after memcg memory is freed. This leads to leak
of ID in mem_cgroup_idr.This patch adds removal into mem_cgroup_css_alloc(), which fixes the
problem. For better readability, it adds a generic helper which is used
in mem_cgroup_alloc() and mem_cgroup_id_put_many() as well.Link: http://lkml.kernel.org/r/152354470916.22460.14397070748001974638.stgit@localhost.localdomain
Fixes 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs")
Signed-off-by: Kirill Tkhai
Acked-by: Johannes Weiner
Acked-by: Vladimir Davydov
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 53406ed1bcfdabe4b5bc35e6d17946c6f9f563e2 ]
Delete the old VM_BUG_ON_VMA() from zap_pmd_range(), which asserted
that mmap_sem must be held when splitting an "anonymous" vma there.
Whether that's still strictly true nowadays is not entirely clear,
but the danger of sometimes crashing on the BUG is now fairly clear.Even with the new stricter rules for anonymous vma marking, the
condition it checks for can possible trigger. Commit 44960f2a7b63
("staging: ashmem: Fix SIGBUS crash when traversing mmaped ashmem
pages") is good, and originally I thought it was safe from that
VM_BUG_ON_VMA(), because the /dev/ashmem fd exposed to the user is
disconnected from the vm_file in the vma, and madvise(,,MADV_REMOVE)
insists on VM_SHARED.But after I read John's earlier mail, drawing attention to the
vfs_fallocate() in there: I may be wrong, and I don't know if Android
has THP in the config anyway, but it looks to me like an
unmap_mapping_range() from ashmem's vfs_fallocate() could hit precisely
the VM_BUG_ON_VMA(), once it's vma_is_anonymous().Signed-off-by: Hugh Dickins
Cc: John Stultz
Cc: Kirill Shutemov
Cc: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 16e536ef47f567289a5699abee9ff7bb304bc12d ]
/sys/../zswap/stored_pages keeps rising in a zswap test with
"zswap.max_pool_percent=0" parameter. But it should not compress or
store pages any more since there is no space in the compressed pool.Reproduce steps:
1. Boot kernel with "zswap.enabled=1"
2. Set the max_pool_percent to 0
# echo 0 > /sys/module/zswap/parameters/max_pool_percent
3. Do memory stress test to see if some pages have been compressed
# stress --vm 1 --vm-bytes $mem_available"M" --timeout 60s
4. Watching the 'stored_pages' number increasing or notThe root cause is:
When zswap_max_pool_percent is set to 0 via kernel parameter,
zswap_is_full() will always return true due to zswap_shrink(). But if
the shinking is able to reclain a page successfully the code then
proceeds to compressing/storing another page, so the value of
stored_pages will keep changing.To solve the issue, this patch adds a zswap_is_full() check again after
zswap_shrink() to make sure it's now under the max_pool_percent, and to
not compress/store if we reached the limit.Link: http://lkml.kernel.org/r/20180530103936.17812-1-liwang@redhat.com
Signed-off-by: Li Wang
Acked-by: Dan Streetman
Cc: Seth Jennings
Cc: Huang Ying
Cc: Yu Zhao
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman
24 Aug, 2018
1 commit
-
[ Upstream commit 1e8e18f694a52d703665012ca486826f64bac29d ]
There is a special case that the size is "(N << KASAN_SHADOW_SCALE_SHIFT)
Pages plus X", the value of X is [1, KASAN_SHADOW_SCALE_SIZE-1]. The
operation "size >> KASAN_SHADOW_SCALE_SHIFT" will drop X, and the
roundup operation can not retrieve the missed one page. For example:
size=0x28006, PAGE_SIZE=0x1000, KASAN_SHADOW_SCALE_SHIFT=3, we will get
shadow_size=0x5000, but actually we need 6 pages.shadow_size = round_up(size >> KASAN_SHADOW_SCALE_SHIFT, PAGE_SIZE);
This can lead to a kernel crash when kasan is enabled and the value of
mod->core_layout.size or mod->init_layout.size is like above. Because
the shadow memory of X has not been allocated and mapped.move_module:
ptr = module_alloc(mod->core_layout.size);
...
memset(ptr, 0, mod->core_layout.size); //crashedUnable to handle kernel paging request at virtual address ffff0fffff97b000
......
Call trace:
__asan_storeN+0x174/0x1a8
memset+0x24/0x48
layout_and_allocate+0xcd8/0x1800
load_module+0x190/0x23e8
SyS_finit_module+0x148/0x180Link: http://lkml.kernel.org/r/1529659626-12660-1-git-send-email-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei
Reviewed-by: Dmitriy Vyukov
Acked-by: Andrey Ryabinin
Cc: Alexander Potapenko
Cc: Hanjun Guo
Cc: Libin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman
16 Aug, 2018
2 commits
-
commit 377eeaa8e11fe815b1d07c81c4a0e2843a8c15eb upstream
For the L1TF workaround its necessary to limit the swap file size to below
MAX_PA/2, so that the higher bits of the swap offset inverted never point
to valid memory.Add a mechanism for the architecture to override the swap file size check
in swapfile.c and add a x86 specific max swapfile check function that
enforces that limit.The check is only enabled if the CPU is vulnerable to L1TF.
In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system
with 46bit PA it is 32TB. The limit is only per individual swap file, so
it's always possible to exceed these limits with multiple swap files or
partitions.Signed-off-by: Andi Kleen
Signed-off-by: Thomas Gleixner
Reviewed-by: Josh Poimboeuf
Acked-by: Michal Hocko
Acked-by: Dave Hansen
Signed-off-by: Greg Kroah-Hartman -
commit 42e4089c7890725fcd329999252dc489b72f2921 upstream
For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
table entry. This sets the high bits in the CPU's address space, thus
making sure to point to not point an unmapped entry to valid cached memory.Some server system BIOSes put the MMIO mappings high up in the physical
address space. If such an high mapping was mapped to unprivileged users
they could attack low memory by setting such a mapping to PROT_NONE. This
could happen through a special device driver which is not access
protected. Normal /dev/mem is of course access protected.To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.
Valid page mappings are allowed because the system is then unsafe anyways.
It's not expected that users commonly use PROT_NONE on MMIO. But to
minimize any impact this is only enforced if the mapping actually refers to
a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
the check for root.For mmaps this is straight forward and can be handled in vm_insert_pfn and
in remap_pfn_range().For mprotect it's a bit trickier. At the point where the actual PTEs are
accessed a lot of state has been changed and it would be difficult to undo
on an error. Since this is a uncommon case use a separate early page talk
walk pass for MMIO PROT_NONE mappings that checks for this condition
early. For non MMIO and non PROT_NONE there are no changes.Signed-off-by: Andi Kleen
Signed-off-by: Thomas Gleixner
Reviewed-by: Josh Poimboeuf
Acked-by: Dave Hansen
Signed-off-by: Greg Kroah-Hartman
03 Aug, 2018
2 commits
-
[ Upstream commit a38965bf941b7c2af50de09c96bc5f03e136caef ]
__printf is useful to verify format and arguments. Remove the following
warning (with W=1):mm/slub.c:721:2: warning: function might be possible candidate for `gnu_printf' format attribute [-Wsuggest-attribute=format]
Link: http://lkml.kernel.org/r/20180505200706.19986-1-malat@debian.org
Signed-off-by: Mathieu Malaterre
Reviewed-by: Andrew Morton
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit f3c01d2f3ade6790db67f80fef60df84424f8964 ]
Currently, __vunmap flow is,
1) Release the VM area
2) Free the debug objects corresponding to that vm area.This leave some race window open.
1) Release the VM area
1.5) Some other client gets the same vm area
1.6) This client allocates new debug objects on the same
vm area
2) Free the debug objects corresponding to this vm area.Here, we actually free 'other' client's debug objects.
Fix this by freeing the debug objects first and then releasing the VM
area.Link: http://lkml.kernel.org/r/1523961828-9485-2-git-send-email-cpandya@codeaurora.org
Signed-off-by: Chintan Pandya
Reviewed-by: Andrew Morton
Cc: Ard Biesheuvel
Cc: Byungchul Park
Cc: Catalin Marinas
Cc: Florian Fainelli
Cc: Johannes Weiner
Cc: Laura Abbott
Cc: Vlastimil Babka
Cc: Wei Yang
Cc: Yisheng Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman
25 Jul, 2018
2 commits
-
commit e1f1b1572e8db87a56609fd05bef76f98f0e456a upstream.
__split_huge_pmd_locked() must check if the cleared huge pmd was dirty,
and propagate that to PageDirty: otherwise, data may be lost when a huge
tmpfs page is modified then split then reclaimed.How has this taken so long to be noticed? Because there was no problem
when the huge page is written by a write system call (shmem_write_end()
calls set_page_dirty()), nor when the page is allocated for a write fault
(fault_dirty_shared_page() calls set_page_dirty()); but when allocated for
a read fault (which MAP_POPULATE simulates), no set_page_dirty().Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1807111741430.1106@eggly.anvils
Fixes: d21b9e57c74c ("thp: handle file pages in split_huge_pmd()")
Signed-off-by: Hugh Dickins
Reported-by: Ashwin Chaugule
Reviewed-by: Yang Shi
Reviewed-by: Kirill A. Shutemov
Cc: "Huang, Ying"
Cc: [4.8+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman -
commit 9f15bde671355c351cf20d9f879004b234353100 upstream.
It was reported that a kernel crash happened in mem_cgroup_iter(), which
can be triggered if the legacy cgroup-v1 non-hierarchical mode is used.Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b8f
......
Call trace:
mem_cgroup_iter+0x2e0/0x6d4
shrink_zone+0x8c/0x324
balance_pgdat+0x450/0x640
kswapd+0x130/0x4b8
kthread+0xe8/0xfc
ret_from_fork+0x10/0x20mem_cgroup_iter():
......
if (css_tryget(css)) position, which has been freed before and
filled with POISON_FREE(0x6b).And the root cause of the use-after-free issue is that
invalidate_reclaim_iterators() fails to reset the value of iter->position
to NULL when the css of the memcg is released in non- hierarchical mode.Link: http://lkml.kernel.org/r/1531994807-25639-1-git-send-email-jing.xia@unisoc.com
Fixes: 6df38689e0e9 ("mm: memcontrol: fix possible memcg leak due to interrupted reclaim")
Signed-off-by: Jing Xia
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Vladimir Davydov
Cc:
Cc: Shakeel Butt
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
22 Jul, 2018
1 commit
-
commit 3ee7e8697d5860b173132606d80a9cd35e7113ee upstream.
syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
WB_shutting_down after wb->bdi->dev became NULL. This indicates that
unregister_bdi() failed to call wb_shutdown() on one of wb objects.The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
drops bdi's reference to wb structures before going through the list of
wbs again and calling wb_shutdown() on each of them. This way the loop
iterating through all wbs can easily miss a wb if that wb has already
passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
from cgwb_release_workfn() and as a result fully shutdown bdi although
wb_workfn() for this wb structure is still running. In fact there are
also other ways cgwb_bdi_unregister() can race with
cgwb_release_workfn() leading e.g. to use-after-free issues:CPU1 CPU2
cgwb_bdi_unregister()
cgwb_kill(*slot);cgwb_release()
queue_work(cgwb_release_wq, &wb->release_work);
cgwb_release_workfn()
wb = list_first_entry(&bdi->wb_list, ...)
spin_unlock_irq(&cgwb_lock);
wb_shutdown(wb);
...
kfree_rcu(wb, rcu);
wb_shutdown(wb); -> oops use-after-freeWe solve these issues by synchronizing writeback structure shutdown from
cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
way we also no longer need synchronization using WB_shutting_down as the
mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
bdi_unregister().Reported-by: syzbot
Acked-by: Tejun Heo
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman
17 Jul, 2018
2 commits
-
commit bb177a732c4369bb58a1fe1df8f552b6f0f7db5f upstream.
syzbot has noticed that a specially crafted library can easily hit
VM_BUG_ON in __mm_populatekernel BUG at mm/gup.c:1242!
invalid opcode: 0000 [#1] SMP
CPU: 2 PID: 9667 Comm: a.out Not tainted 4.18.0-rc3 #644
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
RIP: 0010:__mm_populate+0x1e2/0x1f0
Code: 55 d0 65 48 33 14 25 28 00 00 00 89 d8 75 21 48 83 c4 20 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 75 18 f1 ff 0f 0b e8 6e 18 f1 ff 0b 31 db eb c9 e8 93 06 e0 ff 0f 1f 00 55 48 89 e5 53 48 89 fb
Call Trace:
vm_brk_flags+0xc3/0x100
vm_brk+0x1f/0x30
load_elf_library+0x281/0x2e0
__ia32_sys_uselib+0x170/0x1e0
do_fast_syscall_32+0xca/0x420
entry_SYSENTER_compat+0x70/0x7fThe reason is that the length of the new brk is not page aligned when we
try to populate the it. There is no reason to bug on that though.
do_brk_flags already aligns the length properly so the mapping is
expanded as it should. All we need is to tell mm_populate about it.
Besides that there is absolutely no reason to to bug_on in the first
place. The worst thing that could happen is that the last page wouldn't
get populated and that is far from putting system into an inconsistent
state.Fix the issue by moving the length sanitization code from do_brk_flags
up to vm_brk_flags. The only other caller of do_brk_flags is brk
syscall entry and it makes sure to provide the proper length so t here
is no need for sanitation and so we can use do_brk_flags without it.Also remove the bogus BUG_ONs.
[osalvador@techadventures.net: fix up vm_brk_flags s@request@len@]
Link: http://lkml.kernel.org/r/20180706090217.GI32658@dhcp22.suse.cz
Signed-off-by: Michal Hocko
Reported-by: syzbot
Tested-by: Tetsuo Handa
Reviewed-by: Oscar Salvador
Cc: Zi Yan
Cc: "Aneesh Kumar K.V"
Cc: Dan Williams
Cc: "Kirill A. Shutemov"
Cc: Michael S. Tsirkin
Cc: Al Viro
Cc: "Huang, Ying"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman -
commit bce73e4842390f7b7309c8e253e139db71288ac3 upstream.
KVM guests on s390 can notify the host of unused pages. This can result
in pte_unused callbacks to be true for KVM guest memory.If a page is unused (checked with pte_unused) we might drop this page
instead of paging it. This can have side-effects on userfaultd, when
the page in question was already migrated:The next access of that page will trigger a fault and a user fault
instead of faulting in a new and empty zero page. As QEMU does not
expect a userfault on an already migrated page this migration will fail.The most straightforward solution is to ignore the pte_unused hint if a
userfault context is active for this VMA.Link: http://lkml.kernel.org/r/20180703171854.63981-1-borntraeger@de.ibm.com
Signed-off-by: Christian Borntraeger
Cc: Martin Schwidefsky
Cc: Andrea Arcangeli
Cc: Mike Rapoport
Cc: Janosch Frank
Cc: David Hildenbrand
Cc: Cornelia Huck
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
11 Jul, 2018
3 commits
-
commit 28557cc106e6d2aa8b8c5c7687ea9f8055ff3911 upstream.
Revert commit c7f26ccfb2c3 ("mm/vmstat.c: fix vmstat_update() preemption
BUG"). Steven saw a "using smp_processor_id() in preemptible" message
and added a preempt_disable() section around it to keep it quiet. This
is not the right thing to do it does not fix the real problem.vmstat_update() is invoked by a kworker on a specific CPU. This worker
it bound to this CPU. The name of the worker was "kworker/1:1" so it
should have been a worker which was bound to CPU1. A worker which can
run on any CPU would have a `u' before the first digit.smp_processor_id() can be used in a preempt-enabled region as long as
the task is bound to a single CPU which is the case here. If it could
run on an arbitrary CPU then this is the problem we have an should seek
to resolve.Not only this smp_processor_id() must not be migrated to another CPU but
also refresh_cpu_vm_stats() which might access wrong per-CPU variables.
Not to mention that other code relies on the fact that such a worker
runs on one specific CPU only.Therefore revert that commit and we should look instead what broke the
affinity mask of the kworker.Link: http://lkml.kernel.org/r/20180504104451.20278-1-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior
Cc: Steven J. Hill
Cc: Tejun Heo
Cc: Vlastimil Babka
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman -
commit 31286a8484a85e8b4e91ddb0f5415aee8a416827 upstream.
Recently the following BUG was reported:
Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
Memory failure: 0x3c0000: recovery action for huge page: Recovered
BUG: unable to handle kernel paging request at ffff8dfcc0003000
IP: gup_pgd_range+0x1f0/0xc20
PGD 17ae72067 P4D 17ae72067 PUD 0
Oops: 0000 [#1] SMP PTI
...
CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
a 1GB hugepage. This happens because get_user_pages_fast() is not aware
of a migration entry on pud that was created in the 1st madvise() event.I think that conversion to pud-aligned migration entry is working, but
other MM code walking over page table isn't prepared for it. We need
some time and effort to make all this work properly, so this patch
avoids the reported bug by just disabling error handling for 1GB
hugepage.[n-horiguchi@ah.jp.nec.com: v2]
Link: http://lkml.kernel.org/r/1517284444-18149-1-git-send-email-n-horiguchi@ah.jp.nec.com
Link: http://lkml.kernel.org/r/1517207283-15769-1-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi
Acked-by: Michal Hocko
Reviewed-by: Andrew Morton
Reviewed-by: Mike Kravetz
Acked-by: Punit Agrawal
Tested-by: Michael Ellerman
Cc: Anshuman Khandual
Cc: "Aneesh Kumar K.V"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sudip Mukherjee
Signed-off-by: Greg Kroah-Hartman -
commit 520495fe96d74e05db585fc748351e0504d8f40d upstream.
When booting with very large numbers of gigantic (i.e. 1G) pages, the
operations in the loop of gather_bootmem_prealloc, and specifically
prep_compound_gigantic_page, takes a very long time, and can cause a
softlockup if enough pages are requested at boot.For example booting with 3844 1G pages requires prepping
(set_compound_head, init the count) over 1 billion 4K tail pages, which
takes considerable time.Add a cond_resched() to the outer loop in gather_bootmem_prealloc() to
prevent this lockup.Tested: Booted with softlockup_panic=1 hugepagesz=1G hugepages=3844 and
no softlockup is reported, and the hugepages are reported as
successfully setup.Link: http://lkml.kernel.org/r/20180627214447.260804-1-cannonmatthews@google.com
Signed-off-by: Cannon Matthews
Reviewed-by: Andrew Morton
Reviewed-by: Mike Kravetz
Acked-by: Michal Hocko
Cc: Andres Lagar-Cavilla
Cc: Peter Feiner
Cc: Greg Thelen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
03 Jul, 2018
3 commits
-
commit d50d82faa0c964e31f7a946ba8aba7c715ca7ab0 upstream.
In kernel 4.17 I removed some code from dm-bufio that did slab cache
merging (commit 21bb13276768: "dm bufio: remove code that merges slab
caches") - both slab and slub support merging caches with identical
attributes, so dm-bufio now just calls kmem_cache_create and relies on
implicit merging.This uncovered a bug in the slub subsystem - if we delete a cache and
immediatelly create another cache with the same attributes, it fails
because of duplicate filename in /sys/kernel/slab/. The slub subsystem
offloads freeing the cache to a workqueue - and if we create the new
cache before the workqueue runs, it complains because of duplicate
filename in sysfs.This patch fixes the bug by moving the call of kobject_del from
sysfs_slab_remove_workfn to shutdown_cache. kobject_del must be called
while we hold slab_mutex - so that the sysfs entry is deleted before a
cache with the same attributes could be created.Running device-mapper-test-suite with:
dmtest run --suite thin-provisioning -n /commit_failure_causes_fallback/
triggered:
Buffer I/O error on dev dm-0, logical block 1572848, async page read
device-mapper: thin: 253:1: metadata operation 'dm_pool_alloc_data_block' failed: error = -5
device-mapper: thin: 253:1: aborting current metadata transaction
sysfs: cannot create duplicate filename '/kernel/slab/:a-0000144'
CPU: 2 PID: 1037 Comm: kworker/u48:1 Not tainted 4.17.0.snitm+ #25
Hardware name: Supermicro SYS-1029P-WTR/X11DDW-L, BIOS 2.0a 12/06/2017
Workqueue: dm-thin do_worker [dm_thin_pool]
Call Trace:
dump_stack+0x5a/0x73
sysfs_warn_dup+0x58/0x70
sysfs_create_dir_ns+0x77/0x80
kobject_add_internal+0xba/0x2e0
kobject_init_and_add+0x70/0xb0
sysfs_slab_add+0xb1/0x250
__kmem_cache_create+0x116/0x150
create_cache+0xd9/0x1f0
kmem_cache_create_usercopy+0x1c1/0x250
kmem_cache_create+0x18/0x20
dm_bufio_client_create+0x1ae/0x410 [dm_bufio]
dm_block_manager_create+0x5e/0x90 [dm_persistent_data]
__create_persistent_data_objects+0x38/0x940 [dm_thin_pool]
dm_pool_abort_metadata+0x64/0x90 [dm_thin_pool]
metadata_operation_failed+0x59/0x100 [dm_thin_pool]
alloc_data_block.isra.53+0x86/0x180 [dm_thin_pool]
process_cell+0x2a3/0x550 [dm_thin_pool]
do_worker+0x28d/0x8f0 [dm_thin_pool]
process_one_work+0x171/0x370
worker_thread+0x49/0x3f0
kthread+0xf8/0x130
ret_from_fork+0x35/0x40
kobject_add_internal failed for :a-0000144 with -EEXIST, don't try to register things with the same name in the same directory.
kmem_cache_create(dm_bufio_buffer-16) failed with error -17Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1806151817130.6333@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Mikulas Patocka
Reported-by: Mike Snitzer
Tested-by: Mike Snitzer
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman -
commit 1105a2fc022f3c7482e32faf516e8bc44095f778 upstream.
In our armv8a server(QDF2400), I noticed lots of WARN_ON caused by
PAGE_SIZE unaligned for rmap_item->address under memory pressure
tests(start 20 guests and run memhog in the host).WARNING: CPU: 4 PID: 4641 at virt/kvm/arm/mmu.c:1826 kvm_age_hva_handler+0xc0/0xc8
CPU: 4 PID: 4641 Comm: memhog Tainted: G W 4.17.0-rc3+ #8
Call trace:
kvm_age_hva_handler+0xc0/0xc8
handle_hva_to_gpa+0xa8/0xe0
kvm_age_hva+0x4c/0xe8
kvm_mmu_notifier_clear_flush_young+0x54/0x98
__mmu_notifier_clear_flush_young+0x6c/0xa0
page_referenced_one+0x154/0x1d8
rmap_walk_ksm+0x12c/0x1d0
rmap_walk+0x94/0xa0
page_referenced+0x194/0x1b0
shrink_page_list+0x674/0xc28
shrink_inactive_list+0x26c/0x5b8
shrink_node_memcg+0x35c/0x620
shrink_node+0x100/0x430
do_try_to_free_pages+0xe0/0x3a8
try_to_free_pages+0xe4/0x230
__alloc_pages_nodemask+0x564/0xdc0
alloc_pages_vma+0x90/0x228
do_anonymous_page+0xc8/0x4d0
__handle_mm_fault+0x4a0/0x508
handle_mm_fault+0xf8/0x1b0
do_page_fault+0x218/0x4b8
do_translation_fault+0x90/0xa0
do_mem_abort+0x68/0xf0
el0_da+0x24/0x28In rmap_walk_ksm, the rmap_item->address might still have the
STABLE_FLAG, then the start and end in handle_hva_to_gpa might not be
PAGE_SIZE aligned. Thus it will cause exceptions in handle_hva_to_gpa
on arm64.This patch fixes it by ignoring (not removing) the low bits of address
when doing rmap_walk_ksm.IMO, it should be backported to stable tree. the storm of WARN_ONs is
very easy for me to reproduce. More than that, I watched a panic (not
reproducible) as follows:page:ffff7fe003742d80 count:-4871 mapcount:-2126053375 mapping: (null) index:0x0
flags: 0x1fffc00000000000()
raw: 1fffc00000000000 0000000000000000 0000000000000000 ffffecf981470000
raw: dead000000000100 dead000000000200 ffff8017c001c000 0000000000000000
page dumped because: nonzero _refcount
CPU: 29 PID: 18323 Comm: qemu-kvm Tainted: G W 4.14.15-5.hxt.aarch64 #1
Hardware name:
Call trace:
dump_backtrace+0x0/0x22c
show_stack+0x24/0x2c
dump_stack+0x8c/0xb0
bad_page+0xf4/0x154
free_pages_check_bad+0x90/0x9c
free_pcppages_bulk+0x464/0x518
free_hot_cold_page+0x22c/0x300
__put_page+0x54/0x60
unmap_stage2_range+0x170/0x2b4
kvm_unmap_hva_handler+0x30/0x40
handle_hva_to_gpa+0xb0/0xec
kvm_unmap_hva_range+0x5c/0xd0I even injected a fault on purpose in kvm_unmap_hva_range by seting
size=size-0x200, the call trace is similar as above. So I thought the
panic is similarly caused by the root cause of WARN_ON.Andrea said:
: It looks a straightforward safe fix, on x86 hva_to_gfn_memslot would
: zap those bits and hide the misalignment caused by the low metadata
: bits being erroneously left set in the address, but the arm code
: notices when that's the last page in the memslot and the hva_end is
: getting aligned and the size is below one page.
:
: I think the problem triggers in the addr += PAGE_SIZE of
: unmap_stage2_ptes that never matches end because end is aligned but
: addr is not.
:
: } while (pte++, addr += PAGE_SIZE, addr != end);
:
: x86 again only works on hva_start/hva_end after converting it to
: gfn_start/end and that being in pfn units the bits are zapped before
: they risk to cause trouble.Jia He said:
: I've tested by myself in arm64 server (QDF2400,46 cpus,96G mem) Without
: this patch, the WARN_ON is very easy for reproducing. After this patch, I
: have run the same benchmarch for a whole day without any WARN_ONsLink: http://lkml.kernel.org/r/1525403506-6750-1-git-send-email-hejianet@gmail.com
Signed-off-by: Jia He
Reviewed-by: Andrea Arcangeli
Tested-by: Jia He
Cc: Suzuki K Poulose
Cc: Minchan Kim
Cc: Claudio Imbrenda
Cc: Arvind Yadav
Cc: Mike Rapoport
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman -
commit a9b6de77b1a3ff729f7bfc54b2e17711776a416c upstream.
get_user_pages_fast() for device pages is missing the typical validation
that all page references have been taken while the mapping was valid.
Without this validation truncate operations can not reliably coordinate
against new page reference events like O_DIRECT.Cc:
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Reported-by: Jan Kara
Reviewed-by: Jan Kara
Signed-off-by: Dan Williams
Signed-off-by: Greg Kroah-Hartman
26 Jun, 2018
2 commits
-
commit 7810e6781e0fcbca78b91cf65053f895bf59e85f upstream.
In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for
allocations that can ignore memory policies. The zonelist is obtained
from current CPU's node. This is a problem for __GFP_THISNODE
allocations that want to allocate on a different node, e.g. because the
allocating thread has been migrated to a different CPU.This has been observed to break SLAB in our 4.4-based kernel, because
there it relies on __GFP_THISNODE working as intended. If a slab page
is put on wrong node's list, then further list manipulations may corrupt
the list because page_to_nid() is used to determine which node's
list_lock should be locked and thus we may take a wrong lock and race.Current SLAB implementation seems to be immune by luck thanks to commit
511e3a058812 ("mm/slab: make cache_grow() handle the page allocated on
arbitrary node") but there may be others assuming that __GFP_THISNODE
works as promised.We can fix it by simply removing the zonelist reset completely. There
is actually no reason to reset it, because memory policies and cpusets
don't affect the zonelist choice in the first place. This was different
when commit 183f6371aac2 ("mm: ignore mempolicies when using
ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their
own restricted zonelists.We might consider this for 4.17 although I don't know if there's
anything currently broken.SLAB is currently not affected, but in kernels older than 4.7 that don't
yet have 511e3a058812 ("mm/slab: make cache_grow() handle the page
allocated on arbitrary node") it is. That's at least 4.4 LTS. Older
ones I'll have to check.So stable backports should be more important, but will have to be
reviewed carefully, as the code went through many changes. BTW I think
that also the ac->preferred_zoneref reset is currently useless if we
don't also reset ac->nodemask from a mempolicy to NULL first (which we
probably should for the OOM victims etc?), but I would leave that for a
separate patch.Link: http://lkml.kernel.org/r/20180525130853.13915-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Fixes: 183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
Acked-by: Mel Gorman
Cc: Michal Hocko
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Vlastimil Babka
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman -
commit f183464684190bacbfb14623bd3e4e51b7575b4c upstream.
From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001
From: Tejun Heo
Date: Wed, 23 May 2018 10:29:00 -0700cgwb_release() punts the actual release to cgwb_release_workfn() on
system_wq. Depending on the number of cgroups or block devices, there
can be a lot of cgwb_release_workfn() in flight at the same time.We're periodically seeing close to 256 kworkers getting stuck with the
following stack trace and overtime the entire system gets stuck.[] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330
[] synchronize_rcu_expedited+0x24/0x30
[] bdi_unregister+0x53/0x290
[] release_bdi+0x89/0xc0
[] wb_exit+0x85/0xa0
[] cgwb_release_workfn+0x54/0xb0
[] process_one_work+0x150/0x410
[] worker_thread+0x6d/0x520
[] kthread+0x12c/0x160
[] ret_from_fork+0x29/0x40
[] 0xffffffffffffffffThe events leading to the lockup are...
1. A lot of cgwb_release_workfn() is queued at the same time and all
system_wq kworkers are assigned to execute them.2. They all end up calling synchronize_rcu_expedited(). One of them
wins and tries to perform the expedited synchronization.3. However, that invovles queueing rcu_exp_work to system_wq and
waiting for it. Because #1 is holding all available kworkers on
system_wq, rcu_exp_work can't be executed. cgwb_release_workfn()
is waiting for synchronize_rcu_expedited() which in turn is waiting
for cgwb_release_workfn() to free up some of the kworkers.We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
same time. There's nothing to be gained from that. This patch
updates cgwb release path to use a dedicated percpu workqueue with
@max_active of 1.While this resolves the problem at hand, it might be a good idea to
isolate rcu_exp_work to its own workqueue too as it can be used from
various paths and is prone to this sort of indirect A-A deadlocks.Signed-off-by: Tejun Heo
Cc: "Paul E. McKenney"
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman
21 Jun, 2018
1 commit
-
[ Upstream commit c892fd82cc0632d425ae011a4dd75eb59e9f84ee ]
If there is heavy memory pressure, page allocation with __GFP_NOWAIT
fails easily although it's order-0 request. I got below warning 9 times
for normal boot.: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK)
.. snip ..
Call trace:
dump_backtrace+0x0/0x4
dump_stack+0xa4/0xc0
warn_alloc+0xd4/0x15c
__alloc_pages_nodemask+0xf88/0x10fc
alloc_slab_page+0x40/0x18c
new_slab+0x2b8/0x2e0
___slab_alloc+0x25c/0x464
__kmalloc+0x394/0x498
memcg_kmem_get_cache+0x114/0x2b8
kmem_cache_alloc+0x98/0x3e8
mmap_region+0x3bc/0x8c0
do_mmap+0x40c/0x43c
vm_mmap_pgoff+0x15c/0x1e4
sys_mmap+0xb0/0xc8
el0_svc_naked+0x24/0x28
Mem-Info:
active_anon:17124 inactive_anon:193 isolated_anon:0
active_file:7898 inactive_file:712955 isolated_file:55
unevictable:0 dirty:27 writeback:18 unstable:0
slab_reclaimable:12250 slab_unreclaimable:23334
mapped:19310 shmem:212 pagetables:816 bounce:0
free:36561 free_pcp:1205 free_cma:35615
Node 0 active_anon:68496kB inactive_anon:772kB active_file:31592kB inactive_file:2851820kB unevictable:0kB isolated(anon):0kB isolated(file):220kB mapped:77240kB dirty:108kB writeback:72kB shmem:848kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
DMA free:142188kB min:3056kB low:3820kB high:4584kB active_anon:10052kB inactive_anon:12kB active_file:312kB inactive_file:1412620kB unevictable:0kB writepending:0kB present:1781412kB managed:1604728kB mlocked:0kB slab_reclaimable:3592kB slab_unreclaimable:876kB kernel_stack:400kB pagetables:52kB bounce:0kB free_pcp:1436kB local_pcp:124kB free_cma:142492kB
lowmem_reserve[]: 0 1842 1842
Normal free:4056kB min:4172kB low:5212kB high:6252kB active_anon:58376kB inactive_anon:760kB active_file:31348kB inactive_file:1439040kB unevictable:0kB writepending:180kB present:2000636kB managed:1923688kB mlocked:0kB slab_reclaimable:45408kB slab_unreclaimable:92460kB kernel_stack:9680kB pagetables:3212kB bounce:0kB free_pcp:3392kB local_pcp:688kB free_cma:0kB
lowmem_reserve[]: 0 0 0
DMA: 0*4kB 0*8kB 1*16kB (C) 0*32kB 0*64kB 0*128kB 1*256kB (C) 1*512kB (C) 0*1024kB 1*2048kB (C) 34*4096kB (C) = 142096kB
Normal: 228*4kB (UMEH) 172*8kB (UMH) 23*16kB (UH) 24*32kB (H) 5*64kB (H) 1*128kB (H) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3872kB
721350 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
945512 pages RAM
0 pages HighMem/MovableOnly
63408 pages reserved
51200 pages cma reserved__memcg_schedule_kmem_cache_create() tries to create a shadow slab cache
and the worker allocation failure is not really critical because we will
retry on the next kmem charge. We might miss some charges but that
shouldn't be critical. The excessive allocation failure report is not
very helpful.[mhocko@kernel.org: changelog update]
Link: http://lkml.kernel.org/r/20180418022912.248417-1-minchan@kernel.org
Signed-off-by: Minchan Kim
Acked-by: Johannes Weiner
Reviewed-by: Andrew Morton
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Minchan Kim
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman
12 Jun, 2018
2 commits
-
commit 423913ad4ae5b3e8fb8983f70969fb522261ba26 upstream.
Commit be83bbf80682 ("mmap: introduce sane default mmap limits") was
introduced to catch problems in various ad-hoc character device drivers
doing mmap and getting the size limits wrong. In the process, it used
"known good" limits for the normal cases of mapping regular files and
block device drivers.It turns out that the "s_maxbytes" limit was less "known good" than I
thought. In particular, /proc doesn't set it, but exposes one regular
file to mmap: /proc/vmcore. As a result, that file got limited to the
default MAX_INT s_maxbytes value.This went unnoticed for a while, because apparently the only thing that
needs it is the s390 kernel zfcpdump, but there might be other tools
that use this too.Vasily suggested just changing s_maxbytes for all of /proc, which isn't
wrong, but makes me nervous at this stage. So instead, just make the
new mmap limit always be MAX_LFS_FILESIZE for regular files, which won't
affect anything else. It wasn't the regular file case I was worried
about.I'd really prefer for maxsize to have been per-inode, but that is not
how things are today.Fixes: be83bbf80682 ("mmap: introduce sane default mmap limits")
Reported-by: Vasily Gorbik
Cc: Al Viro
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman -
commit be83bbf806822b1b89e0a0f23cd87cddc409e429 upstream.
The internal VM "mmap()" interfaces are based on the mmap target doing
everything using page indexes rather than byte offsets, because
traditionally (ie 32-bit) we had the situation that the byte offset
didn't fit in a register. So while the mmap virtual address was limited
by the word size of the architecture, the backing store was not.So we're basically passing "pgoff" around as a page index, in order to
be able to describe backing store locations that are much bigger than
the word size (think files larger than 4GB etc).But while this all makes a ton of sense conceptually, we've been dogged
by various drivers that don't really understand this, and internally
work with byte offsets, and then try to work with the page index by
turning it into a byte offset with "pgoff << PAGE_SHIFT".Which obviously can overflow.
Adding the size of the mapping to it to get the byte offset of the end
of the backing store just exacerbates the problem, and if you then use
this overflow-prone value to check various limits of your device driver
mmap capability, you're just setting yourself up for problems.The correct thing for drivers to do is to do their limit math in page
indices, the way the interface is designed. Because the generic mmap
code _does_ test that the index doesn't overflow, since that's what the
mmap code really cares about.HOWEVER.
Finding and fixing various random drivers is a sisyphean task, so let's
just see if we can just make the core mmap() code do the limiting for
us. Realistically, the only "big" backing stores we need to care about
are regular files and block devices, both of which are known to do this
properly, and which have nice well-defined limits for how much data they
can access.So let's special-case just those two known cases, and then limit other
random mmap users to a backing store that still fits in "unsigned long".
Realistically, that's not much of a limit at all on 64-bit, and on
32-bit architectures the only worry might be the GPU drivers, which can
have big physical address spaces.To make it possible for drivers like that to say that they are 64-bit
clean, this patch does repurpose the "FMODE_UNSIGNED_OFFSET" bit in the
file flags to allow drivers to mark their file descriptors as safe in
the full 64-bit mmap address space.[ The timing for doing this is less than optimal, and this should really
go in a merge window. But realistically, this needs wide testing more
than it needs anything else, and being main-line is the only way to do
that.So the earlier the better, even if it's outside the proper development
cycle - Linus ]Cc: Kees Cook
Cc: Dan Carpenter
Cc: Al Viro
Cc: Willy Tarreau
Cc: Dave Airlie
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
05 Jun, 2018
2 commits
-
commit 2d077d4b59924acd1f5180c6fb73b57f4771fde6 upstream.
Swapping load on huge=always tmpfs (with khugepaged tuned up to be very
eager, but I'm not sure that is relevant) soon hung uninterruptibly,
waiting for page lock in shmem_getpage_gfp()'s find_lock_entry(), most
often when "cp -a" was trying to write to a smallish file. Debug showed
that the page in question was not locked, and page->mapping NULL by now,
but page->index consistent with having been in a huge page before.Reproduced in minutes on a 4.15 kernel, even with 4.17's 605ca5ede764
("mm/huge_memory.c: reorder operations in __split_huge_page_tail()") added
in; but took hours to reproduce on a 4.17 kernel (no idea why).The culprit proved to be the __ClearPageDirty() on tails beyond i_size in
__split_huge_page(): the non-atomic __bitoperation may have been safe when
4.8's baa355fd3314 ("thp: file pages support for split_huge_page()")
introduced it, but liable to erase PageWaiters after 4.10's 62906027091f
("mm: add PageWaiters indicating tasks are waiting for a page bit").Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805291841070.3197@eggly.anvils
Fixes: 62906027091f ("mm: add PageWaiters indicating tasks are waiting for a page bit")
Signed-off-by: Hugh Dickins
Acked-by: Kirill A. Shutemov
Cc: Konstantin Khlebnikov
Cc: Nicholas Piggin
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman -
commit 145e1a71e090575c74969e3daa8136d1e5b99fc8 upstream.
George Boole would have noticed a slight error in 4.16 commit
69d763fc6d3a ("mm: pin address_space before dereferencing it while
isolating an LRU page"). Fix it, to match both the comment above it,
and the original behaviour.Although anonymous pages are not marked PageDirty at first, we have an
old habit of calling SetPageDirty when a page is removed from swap
cache: so there's a category of ex-swap pages that are easily
migratable, but were inadvertently excluded from compaction's async
migration in 4.16.Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805302014001.12558@eggly.anvils
Fixes: 69d763fc6d3a ("mm: pin address_space before dereferencing it while isolating an LRU page")
Signed-off-by: Hugh Dickins
Acked-by: Minchan Kim
Acked-by: Mel Gorman
Reported-by: Ivan Kalvachev
Cc: "Huang, Ying"
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
30 May, 2018
7 commits
-
[ Upstream commit f0849ac0b8e072073ec5fcc7fadd05a77434364e ]
For PTE-mapped THP, the compound THP has not been split to normal 4K
pages yet, the whole THP is considered referenced if any one of sub page
is referenced.When walking PTE-mapped THP by pvmw, all relevant PTEs will be checked
to retrieve referenced bit. But, the current code just returns the
result of the last PTE. If the last PTE has not referenced, the
referenced flag will be cleared.Just set referenced when ptep{pmdp}_clear_young_notify() returns true.
Link: http://lkml.kernel.org/r/1518212451-87134-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi
Reported-by: Gang Deng
Suggested-by: Kirill A. Shutemov
Reviewed-by: Andrew Morton
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit e92bb4dd9673945179b1fc738c9817dd91bfb629 ]
When page_mapping() is called and the mapping is dereferenced in
page_evicatable() through shrink_active_list(), it is possible for the
inode to be truncated and the embedded address space to be freed at the
same time. This may lead to the following race.CPU1 CPU2
truncate(inode) shrink_active_list()
... page_evictable(page)
truncate_inode_page(mapping, page);
delete_from_page_cache(page)
spin_lock_irqsave(&mapping->tree_lock, flags);
__delete_from_page_cache(page, NULL)
page_cache_tree_delete(..)
... mapping = page_mapping(page);
page->mapping = NULL;
...
spin_unlock_irqrestore(&mapping->tree_lock, flags);
page_cache_free_page(mapping, page)
put_page(page)
if (put_page_testzero(page)) -> false
- inode now has no pages and can be freed including embedded address_spacemapping_unevictable(mapping)
test_bit(AS_UNEVICTABLE, &mapping->flags);
- we've dereferenced mapping which is potentially already free.Similar race exists between swap cache freeing and page_evicatable()
too.The address_space in inode and swap cache will be freed after a RCU
grace period. So the races are fixed via enclosing the page_mapping()
and address_space usage in rcu_read_lock/unlock(). Some comments are
added in code to make it clear what is protected by the RCU read lock.Link: http://lkml.kernel.org/r/20180212081227.1940-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Reviewed-by: Jan Kara
Reviewed-by: Andrew Morton
Cc: Mel Gorman
Cc: Minchan Kim
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 77da2ba0648a4fd52e5ff97b8b2b8dd312aec4b0 ]
This patch fixes a corner case for KSM. When two pages belong or
belonged to the same transparent hugepage, and they should be merged,
KSM fails to split the page, and therefore no merging happens.This bug can be reproduced by:
* making sure ksm is running (in case disabling ksmtuned)
* enabling transparent hugepages
* allocating a THP-aligned 1-THP-sized buffer
e.g. on amd64: posix_memalign(&p, 1<<<<
Co-authored-by: Gerald Schaefer
Reviewed-by: Andrew Morton
Cc: Andrea Arcangeli
Cc: Minchan Kim
Cc: Kirill A. Shutemov
Cc: Hugh Dickins
Cc: Christian Borntraeger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 1ec6995d1290bfb87cc3a51f0836c889e857cef9 ]
In z3fold_create_pool(), the memory allocated by __alloc_percpu() is not
released on the error path that pool->compact_wq , which holds the
return value of create_singlethread_workqueue(), is NULL. This will
result in a memory leak bug.[akpm@linux-foundation.org: fix oops on kzalloc() failure, check __alloc_percpu() retval]
Link: http://lkml.kernel.org/r/1522803111-29209-1-git-send-email-wangxidong_97@163.com
Signed-off-by: Xidong Wang
Reviewed-by: Andrew Morton
Cc: Vitaly Wool
Cc: Mike Rapoport
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit a06ad633a37c64a0cd4c229fc605cee8725d376e ]
Calling swapon() on a zero length swap file on SSD can lead to a
divide-by-zero.Although creating such files isn't possible with mkswap and they woud be
considered invalid, it would be better for the swapon code to be more
robust and handle this condition gracefully (return -EINVAL).
Especially since the fix is small and straightforward.To help with wear leveling on SSD, the swapon syscall calculates a
random position in the swap file using modulo p->highest_bit, which is
set to maxpages - 1 in read_swap_header.If the swap file is zero length, read_swap_header sets maxpages=1 and
last_page=0, resulting in p->highest_bit=0 and we divide-by-zero when we
modulo p->highest_bit in swapon syscall.This can be prevented by having read_swap_header return zero if
last_page is zero.Link: http://lkml.kernel.org/r/5AC747C1020000A7001FA82C@prv-mh.provo.novell.com
Signed-off-by: Thomas Abraham
Reported-by:
Reviewed-by: Andrew Morton
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 914b6dfff790544d9b77dfd1723adb3745ec9700 ]
A crash is observed when kmemleak_scan accesses the object->pointer,
likely due to the following race.TASK A TASK B TASK C
kmemleak_write
(with "scan" and
NOT "scan=on")
kmemleak_scan()
create_object
kmem_cache_alloc fails
kmemleak_disable
kmemleak_do_cleanup
kmemleak_free_enabled = 0
kfree
kmemleak_free bails out
(kmemleak_free_enabled is 0)
slub frees object->pointer
update_checksum
crash - object->pointer
freed (DEBUG_PAGEALLOC)kmemleak_do_cleanup waits for the scan thread to complete, but not for
direct call to kmemleak_scan via kmemleak_write. So add a wait for
kmemleak_scan completion before disabling kmemleak_free, and while at it
fix the comment on stop_scan_thread.[vinmenon@codeaurora.org: fix stop_scan_thread comment]
Link: http://lkml.kernel.org/r/1522219972-22809-1-git-send-email-vinmenon@codeaurora.org
Link: http://lkml.kernel.org/r/1522063429-18992-1-git-send-email-vinmenon@codeaurora.org
Signed-off-by: Vinayak Menon
Reviewed-by: Catalin Marinas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit c7f26ccfb2c31eb1bf810ba13d044fcf583232db ]
Attempting to hotplug CPUs with CONFIG_VM_EVENT_COUNTERS enabled can
cause vmstat_update() to report a BUG due to preemption not being
disabled around smp_processor_id().Discovered on Ubiquiti EdgeRouter Pro with Cavium Octeon II processor.
BUG: using smp_processor_id() in preemptible [00000000] code:
kworker/1:1/269
caller is vmstat_update+0x50/0xa0
CPU: 0 PID: 269 Comm: kworker/1:1 Not tainted
4.16.0-rc4-Cavium-Octeon-00009-gf83bbd5-dirty #1
Workqueue: mm_percpu_wq vmstat_update
Call Trace:
show_stack+0x94/0x128
dump_stack+0xa4/0xe0
check_preemption_disabled+0x118/0x120
vmstat_update+0x50/0xa0
process_one_work+0x144/0x348
worker_thread+0x150/0x4b8
kthread+0x110/0x140
ret_from_kernel_thread+0x14/0x1cLink: http://lkml.kernel.org/r/1520881552-25659-1-git-send-email-steven.hill@cavium.com
Signed-off-by: Steven J. Hill
Reviewed-by: Andrew Morton
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman