21 Jan, 2016
4 commits
-
By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.The problem was that when a privileged task had temporarily dropped its
privileges, e.g. by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:/proc/$pid/stat - uses the check for determining whether pointers
should be visible, useful for bypassing ASLR
/proc/$pid/maps - also useful for bypassing ASLR
/proc/$pid/cwd - useful for gaining access to restricted
directories that contain files with lax permissions, e.g. in
this scenario:
lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
drwx------ root root /root
drwxr-xr-x root root /root/foobar
-rw-r--r-- root root /root/foobar/secretTherefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn
Acked-by: Kees Cook
Cc: Casey Schaufler
Cc: Oleg Nesterov
Cc: Ingo Molnar
Cc: James Morris
Cc: "Serge E. Hallyn"
Cc: Andy Shevchenko
Cc: Andy Lutomirski
Cc: Al Viro
Cc: "Eric W. Biederman"
Cc: Willy Tarreau
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
record_obj() in migrate_zspage() does not preserve handle's
HANDLE_PIN_BIT, set by find_aloced_obj()->trypin_tag(), and implicitly
(accidentally) un-pins the handle, while migrate_zspage() still performs
an explicit unpin_tag() on the that handle. This additional explicit
unpin_tag() introduces a race condition with zs_free(), which can pin
that handle by this time, so the handle becomes un-pinned.Schematically, it goes like this:
CPU0 CPU1
migrate_zspage
find_alloced_obj
trypin_tag
set HANDLE_PIN_BIT zs_free()
pin_tag()
obj_malloc() -- new object, no tag
record_obj() -- remove HANDLE_PIN_BIT set HANDLE_PIN_BIT
unpin_tag() -- remove zs_free's HANDLE_PIN_BITThe race condition may result in a NULL pointer dereference:
Unable to handle kernel NULL pointer dereference at virtual address 00000000
CPU: 0 PID: 19001 Comm: CookieMonsterCl Tainted:
PC is at get_zspage_mapping+0x0/0x24
LR is at obj_free.isra.22+0x64/0x128
Call trace:
get_zspage_mapping+0x0/0x24
zs_free+0x88/0x114
zram_free_page+0x64/0xcc
zram_slot_free_notify+0x90/0x108
swap_entry_free+0x278/0x294
free_swap_and_cache+0x38/0x11c
unmap_single_vma+0x480/0x5c8
unmap_vmas+0x44/0x60
exit_mmap+0x50/0x110
mmput+0x58/0xe0
do_exit+0x320/0x8dc
do_group_exit+0x44/0xa8
get_signal+0x538/0x580
do_signal+0x98/0x4b8
do_notify_resume+0x14/0x5cThis patch keeps the lock bit in migration path and update value
atomically.Signed-off-by: Junil Lee
Signed-off-by: Minchan Kim
Acked-by: Vlastimil Babka
Cc: Sergey Senozhatsky
Cc: [4.1+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
split_queue_lock can be taken from interrupt context in some cases, but
I forgot to convert locking in split_huge_page() to interrupt-safe
primitives.Let's fix this.
lockdep output:
======================================================
[ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
4.4.0+ #259 Tainted: G W
------------------------------------------------------
syz-executor/18183 [HC0[0]:SC0[2]:HE0:SE0] is trying to acquire:
(split_queue_lock){+.+...}, at: free_transhuge_page+0x24/0x90 mm/huge_memory.c:3436and this task is already holding:
(slock-AF_INET){+.-...}, at: spin_lock_bh include/linux/spinlock.h:307
(slock-AF_INET){+.-...}, at: lock_sock_fast+0x45/0x120 net/core/sock.c:2462
which would create a new lock dependency:
(slock-AF_INET){+.-...} -> (split_queue_lock){+.+...}but this new dependency connects a SOFTIRQ-irq-safe lock:
(slock-AF_INET){+.-...}
... which became SOFTIRQ-irq-safe at:
mark_irqflags kernel/locking/lockdep.c:2799
__lock_acquire+0xfd8/0x4700 kernel/locking/lockdep.c:3162
lock_acquire+0x1dc/0x430 kernel/locking/lockdep.c:3585
__raw_spin_lock include/linux/spinlock_api_smp.h:144
_raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:302
udp_queue_rcv_skb+0x781/0x1550 net/ipv4/udp.c:1680
flush_stack+0x50/0x330 net/ipv6/udp.c:799
__udp4_lib_mcast_deliver+0x694/0x7f0 net/ipv4/udp.c:1798
__udp4_lib_rcv+0x17dc/0x23e0 net/ipv4/udp.c:1888
udp_rcv+0x21/0x30 net/ipv4/udp.c:2108
ip_local_deliver_finish+0x2b3/0xa50 net/ipv4/ip_input.c:216
NF_HOOK_THRESH include/linux/netfilter.h:226
NF_HOOK include/linux/netfilter.h:249
ip_local_deliver+0x1c4/0x2f0 net/ipv4/ip_input.c:257
dst_input include/net/dst.h:498
ip_rcv_finish+0x5ec/0x1730 net/ipv4/ip_input.c:365
NF_HOOK_THRESH include/linux/netfilter.h:226
NF_HOOK include/linux/netfilter.h:249
ip_rcv+0x963/0x1080 net/ipv4/ip_input.c:455
__netif_receive_skb_core+0x1620/0x2f80 net/core/dev.c:4154
__netif_receive_skb+0x2a/0x160 net/core/dev.c:4189
netif_receive_skb_internal+0x1b5/0x390 net/core/dev.c:4217
napi_skb_finish net/core/dev.c:4542
napi_gro_receive+0x2bd/0x3c0 net/core/dev.c:4572
e1000_clean_rx_irq+0x4e2/0x1100 drivers/net/ethernet/intel/e1000e/netdev.c:1038
e1000_clean+0xa08/0x24a0 drivers/net/ethernet/intel/e1000/e1000_main.c:3819
napi_poll net/core/dev.c:5074
net_rx_action+0x7eb/0xdf0 net/core/dev.c:5139
__do_softirq+0x26a/0x920 kernel/softirq.c:273
invoke_softirq kernel/softirq.c:350
irq_exit+0x18f/0x1d0 kernel/softirq.c:391
exiting_irq ./arch/x86/include/asm/apic.h:659
do_IRQ+0x86/0x1a0 arch/x86/kernel/irq.c:252
ret_from_intr+0x0/0x20 arch/x86/entry/entry_64.S:520
arch_safe_halt ./arch/x86/include/asm/paravirt.h:117
default_idle+0x52/0x2e0 arch/x86/kernel/process.c:304
arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:295
default_idle_call+0x48/0xa0 kernel/sched/idle.c:92
cpuidle_idle_call kernel/sched/idle.c:156
cpu_idle_loop kernel/sched/idle.c:252
cpu_startup_entry+0x554/0x710 kernel/sched/idle.c:300
rest_init+0x192/0x1a0 init/main.c:412
start_kernel+0x678/0x69e init/main.c:683
x86_64_start_reservations+0x2a/0x2c arch/x86/kernel/head64.c:195
x86_64_start_kernel+0x158/0x167 arch/x86/kernel/head64.c:184to a SOFTIRQ-irq-unsafe lock:
(split_queue_lock){+.+...}
which became SOFTIRQ-irq-unsafe at:
mark_irqflags kernel/locking/lockdep.c:2817
__lock_acquire+0x146e/0x4700 kernel/locking/lockdep.c:3162
lock_acquire+0x1dc/0x430 kernel/locking/lockdep.c:3585
__raw_spin_lock include/linux/spinlock_api_smp.h:144
_raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:302
split_huge_page_to_list+0xcc0/0x1c50 mm/huge_memory.c:3399
split_huge_page include/linux/huge_mm.h:99
queue_pages_pte_range+0xa38/0xef0 mm/mempolicy.c:507
walk_pmd_range mm/pagewalk.c:50
walk_pud_range mm/pagewalk.c:90
walk_pgd_range mm/pagewalk.c:116
__walk_page_range+0x653/0xcd0 mm/pagewalk.c:204
walk_page_range+0xfe/0x2b0 mm/pagewalk.c:281
queue_pages_range+0xfb/0x130 mm/mempolicy.c:687
migrate_to_node mm/mempolicy.c:1004
do_migrate_pages+0x370/0x4e0 mm/mempolicy.c:1109
SYSC_migrate_pages mm/mempolicy.c:1453
SyS_migrate_pages+0x640/0x730 mm/mempolicy.c:1374
entry_SYSCALL_64_fastpath+0x16/0x7a arch/x86/entry/entry_64.S:185other info that might help us debug this:
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(split_queue_lock);
local_irq_disable();
lock(slock-AF_INET);
lock(split_queue_lock);
lock(slock-AF_INET);Signed-off-by: Kirill A. Shutemov
Reported-by: Dmitry Vyukov
Acked-by: David Rientjes
Reviewed-by: Aneesh Kumar K.V
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A newly added tracepoint in the hugepage code uses a variable in the
error handling that is not initialized at that point:include/trace/events/huge_memory.h:81:230: error: 'isolated' may be used uninitialized in this function [-Werror=maybe-uninitialized]
The result is relatively harmless, as the trace data will in rare
cases contain incorrect data.This works around the problem by adding an explicit initialization.
Signed-off-by: Arnd Bergmann
Fixes: 7d2eba0557c1 ("mm: add tracepoint for scanning pages")
Reviewed-by: Ebru Akagunduz
Acked-by: David Rientjes
Cc: Kirill A. Shutemov
Signed-off-by: Linus Torvalds
19 Jan, 2016
1 commit
-
Pull virtio barrier rework+fixes from Michael Tsirkin:
"This adds a new kind of barrier, and reworks virtio and xen to use it.Plus some fixes here and there"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (44 commits)
checkpatch: add virt barriers
checkpatch: check for __smp outside barrier.h
checkpatch.pl: add missing memory barriers
virtio: make find_vqs() checkpatch.pl-friendly
virtio_balloon: fix race between migration and ballooning
virtio_balloon: fix race by fill and leak
s390: more efficient smp barriers
s390: use generic memory barriers
xen/events: use virt_xxx barriers
xen/io: use virt_xxx barriers
xenbus: use virt_xxx barriers
virtio_ring: use virt_store_mb
sh: move xchg_cmpxchg to a header by itself
sh: support 1 and 2 byte xchg
virtio_ring: update weak barriers to use virt_xxx
Revert "virtio_ring: Update weak barriers to use dma_wmb/rmb"
asm-generic: implement virt_xxx memory barriers
x86: define __smp_xxx
xtensa: define __smp_xxx
tile: define __smp_xxx
...
18 Jan, 2016
1 commit
-
Commit b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when
MADV_FREE syscall is called") introduced this new function, but got the
error handling for when pmd_trans_huge_lock() fails wrong. In the
failure case, the lock has not been taken, and we should not unlock on
the way out.Cc: Minchan Kim
Cc: Andrew Morton
Signed-off-by: Linus Torvalds
16 Jan, 2016
34 commits
-
A spare array holding mem cgroup threshold events is kept around to make
sure we can always safely deregister an event and have an array to store
the new set of events in.In the scenario where we're going from 1 to 0 registered events, the
pointer to the primary array containing 1 event is copied to the spare
slot, and then the spare slot is freed because no events are left.
However, it is freed before calling synchronize_rcu(), which means
readers may still be accessing threshold->primary after it is freed.Fixed by only freeing after synchronize_rcu().
Signed-off-by: Martijn Coenen
Cc: Johannes Weiner
Acked-by: Michal Hocko
Cc: Vladimir Davydov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently memory_failure() doesn't handle non anonymous thp case,
because we can hardly expect the error handling to be successful, and it
can just hit some corner case which results in BUG_ON or something
severe like that. This is also the case for soft offline code, so let's
make it in the same way.Orignal code has a MF_COUNT_INCREASED check before put_hwpoison_page(),
but it's unnecessary because get_any_page() is already called when
running on this code, which takes a refcount of the target page
regardress of the flag. So this patch also removes it.[akpm@linux-foundation.org: fix build]
Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
soft_offline_page() has some deeply indented code, that's the sign of
demand for cleanup. So let's do this. No functionality change.Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Both s390 and powerpc have hit the issue of swapoff hanging, when
CONFIG_HAVE_ARCH_SOFT_DIRTY and CONFIG_MEM_SOFT_DIRTY ifdefs were not
quite as x86_64 had them. I think it would be much clearer if
HAVE_ARCH_SOFT_DIRTY was just a Kconfig option set by architectures to
determine whether the MEM_SOFT_DIRTY option should be offered, and the
actual code depend upon CONFIG_MEM_SOFT_DIRTY alone.But won't embark on that change myself: instead make swapoff more
robust, by using pte_swp_clear_soft_dirty() on each pte it encounters,
without an explicit #ifdef CONFIG_MEM_SOFT_DIRTY. That being a no-op,
whether the bit in question is defined as 0 or the asm-generic fallback
is used, unless soft dirty is fully turned on.Why "maybe" in maybe_same_pte()? Rename it pte_same_as_swp().
Signed-off-by: Hugh Dickins
Reviewed-by: Aneesh Kumar K.V
Acked-by: Cyrill Gorcunov
Cc: Laurent Dufour
Cc: Michael Ellerman
Cc: Martin Schwidefsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Dmitry Vyukov has reported[1] possible deadlock (triggered by his
syzkaller fuzzer):Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&hugetlbfs_i_mmap_rwsem_key);
lock(&mapping->i_mmap_rwsem);
lock(&hugetlbfs_i_mmap_rwsem_key);
lock(&mapping->i_mmap_rwsem);Both traces points to mm_take_all_locks() as a source of the problem.
It doesn't take care about ordering or hugetlbfs_i_mmap_rwsem_key (aka
mapping->i_mmap_rwsem for hugetlb mapping) vs. i_mmap_rwsem.huge_pmd_share() does memory allocation under hugetlbfs_i_mmap_rwsem_key
and allocator can take i_mmap_rwsem if it hit reclaim. So we need to
take i_mmap_rwsem from all hugetlb VMAs before taking i_mmap_rwsem from
rest of VMAs.The patch also documents locking order for hugetlbfs_i_mmap_rwsem_key.
[1] http://lkml.kernel.org/r/CACT4Y+Zu95tBs-0EvdiAKzUOsb4tczRRfCRTpLr4bg_OP9HuVg@mail.gmail.com
Signed-off-by: Kirill A. Shutemov
Reported-by: Dmitry Vyukov
Reviewed-by: Michal Hocko
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
MPOL_MF_LAZY is not visible from userspace since a720094ded8c ("mm:
mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now"), but
it should still skip non-migratable VMAs such as VM_IO, VM_PFNMAP, and
VM_HUGETLB VMAs, and avoid useless overhead of minor faults.Signed-off-by: Liang Chen
Signed-off-by: Gavin Guo
Acked-by: Rik van Riel
Cc: Mel Gorman
Cc: Andi Kleen
Cc: Vlastimil Babka
Cc: David Rientjes
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Remove unused struct zone *z variable which appeared in 86051ca5eaf5
("mm: fix usemap initialization").Signed-off-by: Alexander Kuleshov
Acked-by: Kirill A. Shutemov
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Since can_do_mlock only return 1 or 0, so make it boolean.
No functional change.
[akpm@linux-foundation.org: update declaration in mm.h]
Signed-off-by: Wang Xiaoqiang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Just cleanup, no functional change.
Signed-off-by: Wang Xiaoqiang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In earlier versions, mem_cgroup_css_from_page() could return non-root
css on a legacy hierarchy which can go away and required rcu locking;
however, the eventual version simply returns the root cgroup if memcg is
on a legacy hierarchy and thus doesn't need rcu locking around or in it.
Remove spurious rcu lockings.Signed-off-by: Tejun Heo
Reported-by: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use "IS_ALIGNED" to judge the alignment, rather than directly judging.
Signed-off-by: Wang Xiaoqiang
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
During Jason's work with postcopy migration support for s390 a problem
regarding gmap faults was discovered.The gmap code will call fixup_user_fault which will end up always in
handle_mm_fault. Till now we never cared about retries, but as the
userfaultfd code kind of relies on it. this needs some fix.This patchset does not take care of the futex code. I will now look
closer at this.This patch (of 2):
With the introduction of userfaultfd, kvm on s390 needs fixup_user_fault
to pass in FAULT_FLAG_ALLOW_RETRY and give feedback if during the
faulting we ever unlocked mmap_sem.This patch brings in the logic to handle retries as well as it cleans up
the current documentation. fixup_user_fault was not having the same
semantics as filemap_fault. It never indicated if a retry happened and
so a caller wasn't able to handle that case. So we now changed the
behaviour to always retry a locked mmap_sem.Signed-off-by: Dominik Dingel
Reviewed-by: Andrea Arcangeli
Cc: "Kirill A. Shutemov"
Cc: Martin Schwidefsky
Cc: Christian Borntraeger
Cc: "Jason J. Herne"
Cc: David Rientjes
Cc: Eric B Munson
Cc: Naoya Horiguchi
Cc: Mel Gorman
Cc: Heiko Carstens
Cc: Dominik Dingel
Cc: Paolo Bonzini
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
has established a devm_memremap_pages() mapping, i.e. when the pfn_t
return from ->direct_access() has PFN_DEV and PFN_MAP set. Later, when
encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
struct dev_pagemap instance to keep the result of pfn_to_page() valid
until put_page().Signed-off-by: Dan Williams
Tested-by: Logan Gunthorpe
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A dax-huge-page mapping while it uses some thp helpers is ultimately not
a transparent huge page. The distinction is especially important in the
get_user_pages() path. pmd_devmap() is used to distinguish dax-pmds
from pmd_huge() and pmd_trans_huge() which have slightly different
semantics.Explicitly mark the pmd_trans_huge() helpers that dax needs by adding
pmd_devmap() checks.[kirill.shutemov@linux.intel.com: fix regression in handling mlocked pages in __split_huge_pmd()]
Signed-off-by: Dan Williams
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Matthew Wilcox
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Similar to the conversion of vm_insert_mixed() use pfn_t in the
vmf_insert_pfn_pmd() to tag the resulting pte with _PAGE_DEVICE when the
pfn is backed by a devm_memremap_pages() mapping.Signed-off-by: Dan Williams
Cc: Dave Hansen
Cc: Matthew Wilcox
Cc: Alexander Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Convert the raw unsigned long 'pfn' argument to pfn_t for the purpose of
evaluating the PFN_MAP and PFN_DEV flags. When both are set it triggers
_PAGE_DEVMAP to be set in the resulting pte.There are no functional changes to the gpu drivers as a result of this
conversion.Signed-off-by: Dan Williams
Cc: Dave Hansen
Cc: David Airlie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In support of providing struct page for large persistent memory
capacities, use struct vmem_altmap to change the default policy for
allocating memory for the memmap array. The default vmemmap_populate()
allocates page table storage area from the page allocator. Given
persistent memory capacities relative to DRAM it may not be feasible to
store the memmap in 'System Memory'. Instead vmem_altmap represents
pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
requests.Signed-off-by: Dan Williams
Reported-by: kbuild test robot
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Prior to this change DAX PMD mappings that were made read-only were
never able to be made writable again. This is because the code in
insert_pfn_pmd() that calls pmd_mkdirty() and pmd_mkwrite() would skip
these calls if the PMD already existed in the page table.Instead, if we are doing a write always mark the PMD entry as dirty and
writeable. Without this code we can get into a condition where we mark
the PMD as read-only, and then on a subsequent write fault we get into
an infinite loop of PMD faults where we try unsuccessfully to make the
PMD writeable.Signed-off-by: Ross Zwisler
Signed-off-by: Dan Williams
Reported-by: Jeff Moyer
Reported-by: Toshi Kani
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Sasha Levin has reported KASAN out-of-bounds bug[1]. It points to "if
(!is_swap_pte(pte[i]))" in unfreeze_page_vma() as a problematic access.The cause is that split_huge_page() doesn't handle THP correctly if it's
not allingned to PMD boundary. It can happen after mremap().Test-case (not always triggers the bug):
#define _GNU_SOURCE
#include
#include
#include#define MB (1024UL*1024)
#define SIZE (2*MB)
#define BASE ((void *)0x400000000000)int main()
{
char *p;p = mmap(BASE, SIZE, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE,
-1, 0);
if (p == MAP_FAILED)
perror("mmap"), exit(1);
p = mremap(BASE, SIZE, SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
BASE + SIZE + 8192);
if (p == MAP_FAILED)
perror("mremap"), exit(1);
system("echo 1 > /sys/kernel/debug/split_huge_pages");
return 0;
}The patch fixes freeze and unfreeze paths to handle page table boundary
crossing.It also makes mapcount vs count check in split_huge_page_to_list()
stricter:
- after freeze we don't expect any subpage mapped as we remove them
from rmap when setting up migration entries;
- count must be 1, meaning only caller has reference to the page;[1] https://gist.github.com/sashalevin/c67fbea55e7c0576972a
Signed-off-by: Kirill A. Shutemov
Reported-by: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We don't need to split THP page when MADV_FREE syscall is called if
[start, len] is aligned with THP size. The split could be done when VM
decide to free it in reclaim path if memory pressure is heavy. With
that, we could avoid unnecessary THP split.For the feature, this patch changes pte dirtness marking logic of THP.
Now, it marks every ptes of pages dirty unconditionally in splitting,
which makes MADV_FREE void. So, instead, this patch propagates pmd
dirtiness to all pages via PG_dirty and restores pte dirtiness from
PG_dirty. With this, if pmd is clean(ie, MADV_FREEed) when split
happens(e,g, shrink_page_list), all of pages are clean too so we could
discard them.Signed-off-by: Minchan Kim
Cc: Kirill A. Shutemov
Cc: Hugh Dickins
Cc: Andrea Arcangeli
Cc: "James E.J. Bottomley"
Cc: "Kirill A. Shutemov"
Cc: Shaohua Li
Cc:
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Catalin Marinas
Cc: Chen Gang
Cc: Chris Zankel
Cc: Daniel Micay
Cc: Darrick J. Wong
Cc: David S. Miller
Cc: Helge Deller
Cc: Ivan Kokshaysky
Cc: Jason Evans
Cc: Johannes Weiner
Cc: KOSAKI Motohiro
Cc: Matt Turner
Cc: Max Filippov
Cc: Mel Gorman
Cc: Michael Kerrisk
Cc: Michal Hocko
Cc: Mika Penttil
Cc: Ralf Baechle
Cc: Richard Henderson
Cc: Rik van Riel
Cc: Roland Dreier
Cc: Russell King
Cc: Shaohua Li
Cc: Will Deacon
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The MADV_FREE patchset changes page reclaim to simply free a clean
anonymous page with no dirty ptes, instead of swapping it out; but KSM
uses clean write-protected ptes to reference the stable ksm page. So be
sure to mark that page dirty, so it's never mistakenly discarded.[hughd@google.com: adjusted comments]
Signed-off-by: Minchan Kim
Acked-by: Hugh Dickins
Cc: "James E.J. Bottomley"
Cc: "Kirill A. Shutemov"
Cc: Shaohua Li
Cc:
Cc: Andrea Arcangeli
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Catalin Marinas
Cc: Chen Gang
Cc: Chris Zankel
Cc: Daniel Micay
Cc: Darrick J. Wong
Cc: David S. Miller
Cc: Helge Deller
Cc: Ivan Kokshaysky
Cc: Jason Evans
Cc: Johannes Weiner
Cc: KOSAKI Motohiro
Cc: Kirill A. Shutemov
Cc: Matt Turner
Cc: Max Filippov
Cc: Mel Gorman
Cc: Michael Kerrisk
Cc: Michal Hocko
Cc: Mika Penttil
Cc: Ralf Baechle
Cc: Richard Henderson
Cc: Rik van Riel
Cc: Roland Dreier
Cc: Russell King
Cc: Shaohua Li
Cc: Will Deacon
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
MADV_FREE is a hint that it's okay to discard pages if there is memory
pressure and we use reclaimers(ie, kswapd and direct reclaim) to free
them so there is no value keeping them in the active anonymous LRU so
this patch moves them to inactive LRU list's head.This means that MADV_FREE-ed pages which were living on the inactive
list are reclaimed first because they are more likely to be cold rather
than recently active pages.An arguable issue for the approach would be whether we should put the
page to the head or tail of the inactive list. I chose head because the
kernel cannot make sure it's really cold or warm for every MADV_FREE
usecase but at least we know it's not *hot*, so landing of inactive head
would be a comprimise for various usecases.This fixes suboptimal behavior of MADV_FREE when pages living on the
active list will sit there for a long time even under memory pressure
while the inactive list is reclaimed heavily. This basically breaks the
whole purpose of using MADV_FREE to help the system to free memory which
is might not be used.Signed-off-by: Minchan Kim
Acked-by: Hugh Dickins
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Shaohua Li
Cc: "James E.J. Bottomley"
Cc: "Kirill A. Shutemov"
Cc:
Cc: Andrea Arcangeli
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Catalin Marinas
Cc: Chen Gang
Cc: Chris Zankel
Cc: Daniel Micay
Cc: Darrick J. Wong
Cc: David S. Miller
Cc: Helge Deller
Cc: Ivan Kokshaysky
Cc: Jason Evans
Cc: KOSAKI Motohiro
Cc: Kirill A. Shutemov
Cc: Matt Turner
Cc: Max Filippov
Cc: Michael Kerrisk
Cc: Mika Penttil
Cc: Ralf Baechle
Cc: Richard Henderson
Cc: Roland Dreier
Cc: Russell King
Cc: Will Deacon
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When I test below piece of code with 12 processes(ie, 512M * 12 = 6G
consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is
siginficat slower (ie, 2x times) than madvise_dontneed.loop = 5;
mmap(512M);
while (loop--) {
memset(512M);
madvise(MADV_FREE or MADV_DONTNEED);
}The reason is lots of swapin.
1) dontneed: 1,612 swapin
2) madvfree: 879,585 swapinIf we find hinted pages were already swapped out when syscall is called,
it's pointless to keep the swapped-out pages in pte. Instead, let's
free the cold page because swapin is more expensive than (alloc page +
zeroing).With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time
1) dontneed: 6.10user 233.50system 0:50.44elapsed
2) madvfree: 6.03user 401.17system 1:30.67elapsed
2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsedSigned-off-by: Minchan Kim
Acked-by: Michal Hocko
Acked-by: Hugh Dickins
Cc: "James E.J. Bottomley"
Cc: "Kirill A. Shutemov"
Cc: Shaohua Li
Cc:
Cc: Andrea Arcangeli
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Catalin Marinas
Cc: Chen Gang
Cc: Chris Zankel
Cc: Daniel Micay
Cc: Darrick J. Wong
Cc: David S. Miller
Cc: Helge Deller
Cc: Ivan Kokshaysky
Cc: Jason Evans
Cc: Johannes Weiner
Cc: KOSAKI Motohiro
Cc: Kirill A. Shutemov
Cc: Matt Turner
Cc: Max Filippov
Cc: Mel Gorman
Cc: Michael Kerrisk
Cc: Mika Penttil
Cc: Ralf Baechle
Cc: Richard Henderson
Cc: Rik van Riel
Cc: Roland Dreier
Cc: Russell King
Cc: Shaohua Li
Cc: Will Deacon
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Linux doesn't have an ability to free pages lazy while other OS already
have been supported that named by madvise(MADV_FREE).The gain is clear that kernel can discard freed pages rather than
swapping out or OOM if memory pressure happens.Without memory pressure, freed pages would be reused by userspace
without another additional overhead(ex, page fault + allocation +
zeroing).Jason Evans said:
: Facebook has been using MAP_UNINITIALIZED
: (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
: several years, but there are operational costs to maintaining this
: out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
: in favor of MADV_FREE. When we first enabled MAP_UNINITIALIZED it
: increased throughput for much of our workload by ~5%, and although the
: benefit has decreased using newer hardware and kernels, there is still
: enough benefit that we cannot reasonably retire it without a replacement.
:
: Aside from Facebook operations, there are numerous broadly used
: applications that would benefit from MADV_FREE. The ones that immediately
: come to mind are redis, varnish, and MariaDB. I don't have much insight
: into Android internals and development process, but I would hope to see
: MADV_FREE support eventually end up there as well to benefit applications
: linked with the integrated jemalloc.
:
: jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
: In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
: available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
: (and AIX, but I'm not sure it even compiles on AIX). The lack of
: MADV_FREE on Linux forced me down a long series of increasingly
: sophisticated heuristics for madvise() volume reduction, and even so this
: remains a common performance issue for people using jemalloc on Linux.
: Please integrate MADV_FREE; many people will benefit substantially.How it works:
When madvise syscall is called, VM clears dirty bit of ptes of the
range. If memory pressure happens, VM checks dirty bit of page table
and if it found still "clean", it means it's a "lazyfree pages" so VM
could discard the page instead of swapping out. Once there was store
operation for the page before VM peek a page to reclaim, dirty bit is
set so VM can swap out the page instead of discarding.One thing we should notice is that basically, MADV_FREE relies on dirty
bit in page table entry to decide whether VM allows to discard the page
or not. IOW, if page table entry includes marked dirty bit, VM
shouldn't discard the page.However, as a example, if swap-in by read fault happens, page table
entry doesn't have dirty bit so MADV_FREE could discard the page
wrongly.For avoiding the problem, MADV_FREE did more checks with PageDirty and
PageSwapCache. It worked out because swapped-in page lives on swap
cache and since it is evicted from the swap cache, the page has PG_dirty
flag. So both page flags check effectively prevent wrong discarding by
MADV_FREE.However, a problem in above logic is that swapped-in page has PG_dirty
still after they are removed from swap cache so VM cannot consider the
page as freeable any more even if madvise_free is called in future.Look at below example for detail.
ptr = malloc();
memset(ptr);
..
..
.. heavy memory pressure so all of pages are swapped out
..
..
var = *ptr; -> a page swapped-in and could be removed from
swapcache. Then, page table doesn't mark
dirty bit and page descriptor includes PG_dirty
..
..
madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
..
..
..
.. heavy memory pressure again.
.. In this time, VM cannot discard the page because the page
.. has *PG_dirty*To solve the problem, this patch clears PG_dirty if only the page is
owned exclusively by current process when madvise is called because
PG_dirty represents ptes's dirtiness in several processes so we could
clear it only if we own it exclusively.Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
and hope glibc supports it) and jemalloc/tcmalloc already have supported
the feature for other OS(ex, FreeBSD)barrios@blaptop:~/benchmark/ebizzy$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 12
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 2
Stepping: 3
CPU MHz: 3200.185
BogoMIPS: 6400.53
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
NUMA node0 CPU(s): 0-11
ebizzy benchmark(./ebizzy -S 10 -n 512)Higher avg is better.
vanilla-jemalloc MADV_free-jemalloc
1 thread
records: 10 records: 10
avg: 2961.90 avg: 12069.70
std: 71.96(2.43%) std: 186.68(1.55%)
max: 3070.00 max: 12385.00
min: 2796.00 min: 11746.002 thread
records: 10 records: 10
avg: 5020.00 avg: 17827.00
std: 264.87(5.28%) std: 358.52(2.01%)
max: 5244.00 max: 18760.00
min: 4251.00 min: 17382.004 thread
records: 10 records: 10
avg: 8988.80 avg: 27930.80
std: 1175.33(13.08%) std: 3317.33(11.88%)
max: 9508.00 max: 30879.00
min: 5477.00 min: 21024.008 thread
records: 10 records: 10
avg: 13036.50 avg: 33739.40
std: 170.67(1.31%) std: 5146.22(15.25%)
max: 13371.00 max: 40572.00
min: 12785.00 min: 24088.0016 thread
records: 10 records: 10
avg: 11092.40 avg: 31424.20
std: 710.60(6.41%) std: 3763.89(11.98%)
max: 12446.00 max: 36635.00
min: 9949.00 min: 25669.0032 thread
records: 10 records: 10
avg: 11067.00 avg: 34495.80
std: 971.06(8.77%) std: 2721.36(7.89%)
max: 12010.00 max: 38598.00
min: 9002.00 min: 30636.00In summary, MADV_FREE is about much faster than MADV_DONTNEED.
This patch (of 12):
Add core MADV_FREE implementation.
[akpm@linux-foundation.org: small cleanups]
Signed-off-by: Minchan Kim
Acked-by: Michal Hocko
Acked-by: Hugh Dickins
Cc: Mika Penttil
Cc: Michael Kerrisk
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Mel Gorman
Cc: KOSAKI Motohiro
Cc: Jason Evans
Cc: Daniel Micay
Cc: "Kirill A. Shutemov"
Cc: Shaohua Li
Cc:
Cc: Andy Lutomirski
Cc: "James E.J. Bottomley"
Cc: "Kirill A. Shutemov"
Cc: "Shaohua Li"
Cc: Andrea Arcangeli
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Catalin Marinas
Cc: Chen Gang
Cc: Chris Zankel
Cc: Darrick J. Wong
Cc: David S. Miller
Cc: Helge Deller
Cc: Ivan Kokshaysky
Cc: Matt Turner
Cc: Max Filippov
Cc: Ralf Baechle
Cc: Richard Henderson
Cc: Roland Dreier
Cc: Russell King
Cc: Shaohua Li
Cc: Will Deacon
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
page_referenced_one() and page_idle_clear_pte_refs_one() duplicate the
code for looking up pte of a (possibly transhuge) page. Move this code
to a new helper function, page_check_address_transhuge(), and make the
above mentioned functions use it.This is just a cleanup, no functional changes are intended.
Signed-off-by: Vladimir Davydov
Reviewed-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
During freeze_page(), we remove the page from rmap. It munlocks the
page if it was mlocked. clear_page_mlock() uses thelru cache, which
temporary pins the page.Let's drain the lru cache before checking page's count vs. mapcount.
The change makes mlocked page split on first attempt, if it was not
pinned by somebody else.Signed-off-by: Kirill A. Shutemov
Cc: Sasha Levin
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Writing 1 into 'split_huge_pages' will try to find and split all huge
pages in the system. This is useful for debuging.[akpm@linux-foundation.org: fix printk text, per Vlastimil]
Signed-off-by: Kirill A. Shutemov
Cc: Sasha Levin
Cc: Andrea Arcangeli
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Both page_referenced() and page_idle_clear_pte_refs_one() assume that
THP can only be mapped with PMD, so there's no reason to look on PTEs
for PageTransHuge() pages. That's no true anymore: THP can be mapped
with PTEs too.The patch removes PageTransHuge() test from the functions and opencode
page table check.[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kirill A. Shutemov
Cc: Vladimir Davydov
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Naoya Horiguchi
Cc: Sasha Levin
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Before THP refcounting rework, THP was not allowed to cross VMA
boundary. So, if we have THP and we split it, PG_mlocked can be safely
transferred to small pages.With new THP refcounting and naive approach to mlocking we can end up
with this scenario:
1. we have a mlocked THP, which belong to one VM_LOCKED VMA.
2. the process does munlock() on the *part* of the THP:
- the VMA is split into two, one of them VM_LOCKED;
- huge PMD split into PTE table;
- THP is still mlocked;
3. split_huge_page():
- it transfers PG_mlocked to *all* small pages regrardless if it
blong to any VM_LOCKED VMA.We probably could munlock() all small pages on split_huge_page(), but I
think we have accounting issue already on step two.Instead of forbidding mlocked pages altogether, we just avoid mlocking
PTE-mapped THPs and munlock THPs on split_huge_pmd().This means PTE-mapped THPs will be on normal lru lists and will be split
under memory pressure by vmscan. After the split vmscan will detect
unevictable small pages and mlock them.With this approach we shouldn't hit situation like described above.
Signed-off-by: Kirill A. Shutemov
Cc: Sasha Levin
Cc: Aneesh Kumar K.V
Cc: Jerome Marchand
Cc: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
All parts of THP with new refcounting are now in place. We can now
allow to enable THP.Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Jerome Marchand
Cc: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently we don't split huge page on partial unmap. It's not an ideal
situation. It can lead to memory overhead.Furtunately, we can detect partial unmap on page_remove_rmap(). But we
cannot call split_huge_page() from there due to locking context.It's also counterproductive to do directly from munmap() codepath: in
many cases we will hit this from exit(2) and splitting the huge page
just to free it up in small pages is not what we really want.The patch introduce deferred_split_huge_page() which put the huge page
into queue for splitting. The splitting itself will happen when we get
memory pressure via shrinker interface. The page will be dropped from
list on freeing through compound page destructor.Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Vlastimil Babka
Acked-by: Jerome Marchand
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We are not able to migrate THPs. It means it's not enough to split only
PMD on migration -- we need to split compound page under it too.Signed-off-by: Kirill A. Shutemov
Tested-by: Aneesh Kumar K.V
Acked-by: Jerome Marchand
Cc: Sasha Levin
Cc: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch adds implementation of split_huge_page() for new
refcountings.Unlike previous implementation, new split_huge_page() can fail if
somebody holds GUP pin on the page. It also means that pin on page
would prevent it from bening split under you. It makes situation in
many places much cleaner.The basic scheme of split_huge_page():
- Check that sum of mapcounts of all subpage is equal to page_count()
plus one (caller pin). Foll off with -EBUSY. This way we can avoid
useless PMD-splits.- Freeze the page counters by splitting all PMD and setup migration
PTEs.- Re-check sum of mapcounts against page_count(). Page's counts are
stable now. -EBUSY if page is pinned.- Split compound page.
- Unfreeze the page by removing migration entries.
Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Jerome Marchand
Cc: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David RientjesSigned-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Some mm-related BUG_ON()s could trigger from hwpoison code due to recent
changes in thp refcounting rule. This patch fixes them up.In the new refcounting, we no longer use tail->_mapcount to keep tail's
refcount, and thereby we can simplify get/put_hwpoison_page().And another change is that tail's refcount is not transferred to the raw
page during thp split (more precisely, in new rule we don't take
refcount on tail page any more.) So when we need thp split, we have to
transfer the refcount properly to the 4kB soft-offlined page before
migration.thp split code goes into core code only when precheck
(total_mapcount(head) == page_count(head) - 1) passes to avoid useless
split, where we assume that one refcount is held by the caller of thp
split and the others are taken via mapping. To meet this assumption,
this patch moves thp split part in soft_offline_page() after
get_any_page().[akpm@linux-foundation.org: remove unneeded #define, per Kirill]
Signed-off-by: Naoya Horiguchi
Acked-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds