21 Mar, 2016
1 commit
-
Pull x86 protection key support from Ingo Molnar:
"This tree adds support for a new memory protection hardware feature
that is available in upcoming Intel CPUs: 'protection keys' (pkeys).There's a background article at LWN.net:
https://lwn.net/Articles/643797/
The gist is that protection keys allow the encoding of
user-controllable permission masks in the pte. So instead of having a
fixed protection mask in the pte (which needs a system call to change
and works on a per page basis), the user can map a (handful of)
protection mask variants and can change the masks runtime relatively
cheaply, without having to change every single page in the affected
virtual memory range.This allows the dynamic switching of the protection bits of large
amounts of virtual memory, via user-space instructions. It also
allows more precise control of MMU permission bits: for example the
executable bit is separate from the read bit (see more about that
below).This tree adds the MM infrastructure and low level x86 glue needed for
that, plus it adds a high level API to make use of protection keys -
if a user-space application calls:mmap(..., PROT_EXEC);
or
mprotect(ptr, sz, PROT_EXEC);
(note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
this special case, and will set a special protection key on this
memory range. It also sets the appropriate bits in the Protection
Keys User Rights (PKRU) register so that the memory becomes unreadable
and unwritable.So using protection keys the kernel is able to implement 'true'
PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
PROT_READ as well. Unreadable executable mappings have security
advantages: they cannot be read via information leaks to figure out
ASLR details, nor can they be scanned for ROP gadgets - and they
cannot be used by exploits for data purposes either.We know about no user-space code that relies on pure PROT_EXEC
mappings today, but binary loaders could start making use of this new
feature to map binaries and libraries in a more secure fashion.There is other pending pkeys work that offers more high level system
call APIs to manage protection keys - but those are not part of this
pull request.Right now there's a Kconfig that controls this feature
(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
(like most x86 CPU feature enablement code that has no runtime
overhead), but it's not user-configurable at the moment. If there's
any serious problem with this then we can make it configurable and/or
flip the default"* 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
mm/core, x86/mm/pkeys: Add execute-only protection keys support
x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
x86/mm/pkeys: Allow kernel to modify user pkey rights register
x86/fpu: Allow setting of XSAVE state
x86/mm: Factor out LDT init from context init
mm/core, x86/mm/pkeys: Add arch_validate_pkey()
mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
x86/mm/pkeys: Add Kconfig prompt to existing config option
x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
x86/mm/pkeys: Dump PKRU with other kernel registers
mm/core, x86/mm/pkeys: Differentiate instruction fetches
x86/mm/pkeys: Optimize fault handling in access_error()
mm/core: Do not enforce PKEY permissions on remote mm access
um, pkeys: Add UML arch_*_access_permitted() methods
mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
x86/mm/gup: Simplify get_user_pages() PTE bit handling
...
18 Mar, 2016
3 commits
-
Kernel style prefers a single string over split strings when the string is
'user-visible'.Miscellanea:
- Add a missing newline
- Realign argumentsSigned-off-by: Joe Perches
Acked-by: Tejun Heo [percpu]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently we have two copies of the same code which implements memory
overcommitment logic. Let's move it into mm/util.c and hence avoid
duplication. No functional changes here.Signed-off-by: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
max_map_count sysctl unrelated to scheduler. Move its bits from
include/linux/sched/sysctl.h to include/linux/mm.h.Signed-off-by: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Mar, 2016
1 commit
-
Signed-off-by: Ingo Molnar
19 Feb, 2016
3 commits
-
Grazvydas Ignotas has reported a regression in remap_file_pages()
emulation.Testcase:
#define _GNU_SOURCE
#include
#include
#include
#include#define SIZE (4096 * 3)
int main(int argc, char **argv)
{
unsigned long *p;
long i;p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;if (remap_file_pages(p, 4096, 0, 1, 0)) {
perror("remap_file_pages");
return -1;
}if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) {
perror("remap_file_pages");
return -1;
}assert(p[0] == 1);
munmap(p, SIZE);
return 0;
}The second remap_file_pages() fails with -EINVAL.
The reason is that remap_file_pages() emulation assumes that the target
vma covers whole area we want to over map. That assumption is broken by
first remap_file_pages() call: it split the area into two vma.The solution is to check next adjacent vmas, if they map the same file
with the same flags.Fixes: c8d78c1823f4 ("mm: replace remap_file_pages() syscall with emulation")
Signed-off-by: Kirill A. Shutemov
Reported-by: Grazvydas Ignotas
Tested-by: Grazvydas Ignotas
Cc: [4.0+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Protection keys provide new page-based protection in hardware.
But, they have an interesting attribute: they only affect data
accesses and never affect instruction fetches. That means that
if we set up some memory which is set as "access-disabled" via
protection keys, we can still execute from it.This patch uses protection keys to set up mappings to do just that.
If a user calls:mmap(..., PROT_EXEC);
or
mprotect(ptr, sz, PROT_EXEC);(note PROT_EXEC-only without PROT_READ/WRITE), the kernel will
notice this, and set a special protection key on the memory. It
also sets the appropriate bits in the Protection Keys User Rights
(PKRU) register so that the memory becomes unreadable and
unwritable.I haven't found any userspace that does this today. With this
facility in place, we expect userspace to move to use it
eventually. Userspace _could_ start doing this today. Any
PROT_EXEC calls get converted to PROT_READ inside the kernel, and
would transparently be upgraded to "true" PROT_EXEC with this
code. IOW, userspace never has to do any PROT_EXEC runtime
detection.This feature provides enhanced protection against leaking
executable memory contents. This helps thwart attacks which are
attempting to find ROP gadgets on the fly.But, the security provided by this approach is not comprehensive.
The PKRU register which controls access permissions is a normal
user register writable from unprivileged userspace. An attacker
who can execute the 'wrpkru' instruction can easily disable the
protection provided by this feature.The protection key that is used for execute-only support is
permanently dedicated at compile time. This is fine for now
because there is currently no API to set a protection key other
than this one.Despite there being a constant PKRU value across the entire
system, we do not set it unless this feature is in use in a
process. That is to preserve the PKRU XSAVE 'init state',
which can lead to faster context switches.PKRU *is* a user register and the kernel is modifying it. That
means that code doing:pkru = rdpkru()
pkru |= 0x100;
mmap(..., PROT_EXEC);
wrpkru(pkru);could lose the bits in PKRU that enforce execute-only
permissions. To avoid this, we suggest avoiding ever calling
mmap() or mprotect() when the PKRU value is expected to be
unstable.Signed-off-by: Dave Hansen
Reviewed-by: Thomas Gleixner
Cc: Andrea Arcangeli
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Andy Lutomirski
Cc: Aneesh Kumar K.V
Cc: Borislav Petkov
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Chen Gang
Cc: Dan Williams
Cc: Dave Chinner
Cc: Dave Hansen
Cc: David Hildenbrand
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: Kees Cook
Cc: Kirill A. Shutemov
Cc: Konstantin Khlebnikov
Cc: Linus Torvalds
Cc: Mel Gorman
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Piotr Kwapulinski
Cc: Rik van Riel
Cc: Stephen Smalley
Cc: Vladimir Murzin
Cc: Will Deacon
Cc: keescook@google.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210240.CB4BB5CA@viggo.jf.intel.com
Signed-off-by: Ingo Molnar -
This plumbs a protection key through calc_vm_flag_bits(). We
could have done this in calc_vm_prot_bits(), but I did not feel
super strongly which way to go. It was pretty arbitrary which
one to use.Signed-off-by: Dave Hansen
Reviewed-by: Thomas Gleixner
Cc: Andrea Arcangeli
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Arve Hjønnevåg
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Chen Gang
Cc: Dan Williams
Cc: Dave Chinner
Cc: Dave Hansen
Cc: David Airlie
Cc: Denys Vlasenko
Cc: Eric W. Biederman
Cc: Geliang Tang
Cc: Greg Kroah-Hartman
Cc: H. Peter Anvin
Cc: Kirill A. Shutemov
Cc: Konstantin Khlebnikov
Cc: Leon Romanovsky
Cc: Linus Torvalds
Cc: Masahiro Yamada
Cc: Maxime Coquelin
Cc: Mel Gorman
Cc: Michael Ellerman
Cc: Oleg Nesterov
Cc: Paul Gortmaker
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Riley Andrews
Cc: Vladimir Davydov
Cc: devel@driverdev.osuosl.org
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20160212210231.E6F1F0D6@viggo.jf.intel.com
Signed-off-by: Ingo Molnar
18 Feb, 2016
1 commit
-
Signed-off-by: Ingo Molnar
16 Feb, 2016
1 commit
-
Provide a stable basis for the pkeys patches, which touches various
x86 details.Signed-off-by: Ingo Molnar
06 Feb, 2016
2 commits
-
Sequence vma_lock_anon_vma() - vma_unlock_anon_vma() isn't safe if
anon_vma appeared between lock and unlock. We have to check anon_vma
first or call anon_vma_prepare() to be sure that it's here. There are
only few users of these legacy helpers. Let's get rid of them.This patch fixes anon_vma lock imbalance in validate_mm(). Write lock
isn't required here, read lock is enough.And reorders expand_downwards/expand_upwards: security_mmap_addr() and
wrapping-around check don't have to be under anon vma lock.Link: https://lkml.kernel.org/r/CACT4Y+Y908EjM2z=706dv4rV6dWtxTLK9nFg9_7DhRMLppBo2g@mail.gmail.com
Signed-off-by: Konstantin Khlebnikov
Reported-by: Dmitry Vyukov
Acked-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The mmap_sem for reading in validate_mm called from expand_stack is not
enough to prevent the argumented rbtree rb_subtree_gap information to
change from under us because expand_stack may be running from other
threads concurrently which will hold the mmap_sem for reading too.The argumented rbtree is updated with vma_gap_update under the
page_table_lock so use it in browse_rb() too to avoid false positives.Signed-off-by: Andrea Arcangeli
Reported-by: Dmitry Vyukov
Tested-by: Dmitry Vyukov
Cc: Konstantin Khlebnikov
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
04 Feb, 2016
1 commit
-
This patch provides a way of working around a slight regression
introduced by commit 84638335900f ("mm: rework virtual memory
accounting").Before that commit RLIMIT_DATA have control only over size of the brk
region. But that change have caused problems with all existing versions
of valgrind, because it set RLIMIT_DATA to zero.This patch fixes rlimit check (limit actually in bytes, not pages) and
by default turns it into warning which prints at first VmData misuse:"mmap: top (795): VmData 516096 exceed data ulimit 512000. Will be forbidden soon."
Behavior is controlled by boot param ignore_rlimit_data=y/n and by sysfs
/sys/module/kernel/parameters/ignore_rlimit_data. For now it set to "y".[akpm@linux-foundation.org: tweak kernel-parameters.txt text[
Signed-off-by: Konstantin Khlebnikov
Link: http://lkml.kernel.org/r/20151228211015.GL2194@uranus
Reported-by: Christian Borntraeger
Cc: Cyrill Gorcunov
Cc: Linus Torvalds
Cc: Vegard Nossum
Cc: Peter Zijlstra
Cc: Vladimir Davydov
Cc: Andy Lutomirski
Cc: Quentin Casasnovas
Cc: Kees Cook
Cc: Willy Tarreau
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Jan, 2016
1 commit
-
Signed-off-by: Ingo Molnar
16 Jan, 2016
1 commit
-
Dmitry Vyukov has reported[1] possible deadlock (triggered by his
syzkaller fuzzer):Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&hugetlbfs_i_mmap_rwsem_key);
lock(&mapping->i_mmap_rwsem);
lock(&hugetlbfs_i_mmap_rwsem_key);
lock(&mapping->i_mmap_rwsem);Both traces points to mm_take_all_locks() as a source of the problem.
It doesn't take care about ordering or hugetlbfs_i_mmap_rwsem_key (aka
mapping->i_mmap_rwsem for hugetlb mapping) vs. i_mmap_rwsem.huge_pmd_share() does memory allocation under hugetlbfs_i_mmap_rwsem_key
and allocator can take i_mmap_rwsem if it hit reclaim. So we need to
take i_mmap_rwsem from all hugetlb VMAs before taking i_mmap_rwsem from
rest of VMAs.The patch also documents locking order for hugetlbfs_i_mmap_rwsem_key.
[1] http://lkml.kernel.org/r/CACT4Y+Zu95tBs-0EvdiAKzUOsb4tczRRfCRTpLr4bg_OP9HuVg@mail.gmail.com
Signed-off-by: Kirill A. Shutemov
Reported-by: Dmitry Vyukov
Reviewed-by: Michal Hocko
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 Jan, 2016
4 commits
-
When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
testing the RLIMIT_DATA value to figure out if we're allowed to assign
new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
commited that RLIMIT_DATA in a form it's implemented now doesn't do
anything useful because most of user-space libraries use mmap() syscall
for dynamic memory allocations.Linus suggested to convert RLIMIT_DATA rlimit into something suitable
for anonymous memory accounting. But in this patch we go further, and
the changes are bundled together as:* keep vma counting if CONFIG_PROC_FS=n, will be used for limits
* replace mm->shared_vm with better defined mm->data_vm
* account anonymous executable areas as executable
* account file-backed growsdown/up areas as stack
* drop struct file* argument from vm_stat_account
* enforce RLIMIT_DATA for size of data areasThis way code looks cleaner: now code/stack/data classification depends
only on vm_flags state:VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc)
VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk)
VM_WRITE & ~VM_SHARED & !stack -> data (VmData)The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
"shared", but that might be strange beast like readonly-private or VM_IO
area.- RLIMIT_AS limits whole address space "VmSize"
- RLIMIT_STACK limits stack "VmStk" (but each vma individually)
- RLIMIT_DATA now limits "VmData"Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Cyrill Gorcunov
Cc: Quentin Casasnovas
Cc: Vegard Nossum
Acked-by: Linus Torvalds
Cc: Willy Tarreau
Cc: Andy Lutomirski
Cc: Kees Cook
Cc: Vladimir Davydov
Cc: Pavel Emelyanov
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Address Space Layout Randomization (ASLR) provides a barrier to
exploitation of user-space processes in the presence of security
vulnerabilities by making it more difficult to find desired code/data
which could help an attack. This is done by adding a random offset to
the location of regions in the process address space, with a greater
range of potential offset values corresponding to better protection/a
larger search-space for brute force, but also to greater potential for
fragmentation.The offset added to the mmap_base address, which provides the basis for
the majority of the mappings for a process, is set once on process exec
in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
which reflect, hopefully, the best compromise for all systems. The
trade-off between increased entropy in the offset value generation and
the corresponding increased variability in address space fragmentation
is not absolute, however, and some platforms may tolerate higher amounts
of entropy. This patch introduces both new Kconfig values and a sysctl
interface which may be used to change the amount of entropy used for
offset generation on a system.The direct motivation for this change was in response to the
libstagefright vulnerabilities that affected Android, specifically to
information provided by Google's project zero at:http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html
The attack presented therein, by Google's project zero, specifically
targeted the limited randomness used to generate the offset added to the
mmap_base address in order to craft a brute-force-based attack.
Concretely, the attack was against the mediaserver process, which was
limited to respawning every 5 seconds, on an arm device. The hard-coded
8 bits used resulted in an average expected success rate of defeating
the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
piece). With this patch, and an accompanying increase in the entropy
value to 16 bits, the same attack would take an average expected time of
over 45 hours (32768 tries), which makes it both less feasible and more
likely to be noticed.The introduced Kconfig and sysctl options are limited by per-arch
minimum and maximum values, the minimum of which was chosen to match the
current hard-coded value and the maximum of which was chosen so as to
give the greatest flexibility without generating an invalid mmap_base
address, generally a 3-4 bits less than the number of bits in the
user-space accessible virtual address space.When decided whether or not to change the default value, a system
developer should consider that mmap_base address could be placed
anywhere up to 2^(value) bits away from the non-randomized location,
which would introduce variable-sized areas above and below the mmap_base
address such that the maximum vm_area_struct size may be reduced,
preventing very large allocations.This patch (of 4):
ASLR only uses as few as 8 bits to generate the random offset for the
mmap base address on 32 bit architectures. This value was chosen to
prevent a poorly chosen value from dividing the address space in such a
way as to prevent large allocations. This may not be an issue on all
platforms. Allow the specification of a minimum number of bits so that
platforms desiring greater ASLR protection may determine where to place
the trade-off.Signed-off-by: Daniel Cashman
Cc: Russell King
Acked-by: Kees Cook
Cc: Ingo Molnar
Cc: Jonathan Corbet
Cc: Don Zickus
Cc: Eric W. Biederman
Cc: Heinrich Schuchardt
Cc: Josh Poimboeuf
Cc: Kirill A. Shutemov
Cc: Naoya Horiguchi
Cc: Andrea Arcangeli
Cc: Mel Gorman
Cc: Thomas Gleixner
Cc: David Rientjes
Cc: Mark Salyzyn
Cc: Jeff Vander Stoep
Cc: Nick Kralevich
Cc: Catalin Marinas
Cc: Will Deacon
Cc: "H. Peter Anvin"
Cc: Hector Marco-Gisbert
Cc: Borislav Petkov
Cc: Ralf Baechle
Cc: Heiko Carstens
Cc: Martin Schwidefsky
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The following flag comparison in mmap_region makes no sense:
if (!(vm_flags & MAP_FIXED))
return -ENOMEM;The condition is always false and thus the above "return -ENOMEM" is
never executed. The vm_flags must not be compared with MAP_FIXED flag.
The vm_flags may only be compared with VM_* flags. MAP_FIXED has the
same value as VM_MAYREAD.Hitting the rlimit is a slow path and find_vma_intersection should
realize that there is no overlapping VMA for !MAP_FIXED case pretty
quickly.Signed-off-by: Piotr Kwapulinski
Acked-by: Michal Hocko
Cc: Oleg Nesterov
Cc: Chris Metcalf
Reviewed-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Simplify may_expand_vm().
[akpm@linux-foundation.org: further simplification, per Naoya Horiguchi]
Signed-off-by: Chen Gang
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Jan, 2016
1 commit
-
Requiring special mappings to give a list of struct pages is
inflexible: it prevents sane use of IO memory in a special
mapping, it's inefficient (it requires arch code to initialize a
list of struct pages, and it requires the mm core to walk the
entire list just to figure out how long it is), and it prevents
arch code from doing anything fancy when a special mapping fault
occurs.Add a .fault method as an alternative to filling in a .pages
array.Looks-OK-to: Andrew Morton
Signed-off-by: Andy Lutomirski
Reviewed-by: Kees Cook
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Dave Hansen
Cc: Fenghua Yu
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Quentin Casasnovas
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/a26d1677c0bc7e774c33f469451a78ca31e9e6af.1451446564.git.luto@kernel.org
Signed-off-by: Ingo Molnar
06 Nov, 2015
8 commits
-
The cost of faulting in all memory to be locked can be very high when
working with large mappings. If only portions of the mapping will be used
this can incur a high penalty for locking.For the example of a large file, this is the usage pattern for a large
statical language model (probably applies to other statical or graphical
models as well). For the security example, any application transacting in
data that cannot be swapped out (credit card data, medical records, etc).This patch introduces the ability to request that pages are not
pre-faulted, but are placed on the unevictable LRU when they are finally
faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED
and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT
flag for a VMA will cause pages faulted into that VMA to be added to the
unevictable LRU when they are faulted or if they are already present, but
will not cause any missing pages to be faulted in.Exposing this new lock state means that we cannot overload the meaning of
the FOLL_POPULATE flag any longer. Prior to this patch it was used to
mean that the VMA for a fault was locked. This means we need the new
FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE
will now only control if the VMA should be populated and in the case of
VM_LOCKONFAULT, it will not be set.Signed-off-by: Eric B Munson
Acked-by: Kirill A. Shutemov
Acked-by: Vlastimil Babka
Cc: Michal Hocko
Cc: Jonathan Corbet
Cc: Catalin Marinas
Cc: Geert Uytterhoeven
Cc: Guenter Roeck
Cc: Heiko Carstens
Cc: Michael Kerrisk
Cc: Ralf Baechle
Cc: Shuah Khan
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Make __install_special_mapping() args order match the caller, so the
caller can pass their register args directly to callee with no touch.For most of architectures, args (at least the first 5th args) are in
registers, so this change will have effect on most of architectures.For -O2, __install_special_mapping() may be inlined under most of
architectures, but for -Os, it should not. So this change can get a
little better performance for -Os, at least.Signed-off-by: Chen Gang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When fget() fails we can return -EBADF directly.
Signed-off-by: Chen Gang
Acked-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
It is still a little better to remove it, although it should be skipped
by "-O2".Signed-off-by: Chen Gang =0A=
Acked-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Cosmetic, but expand_upwards() and expand_downwards() overuse vma->vm_mm,
a local variable makes sense imho.Signed-off-by: Oleg Nesterov
Acked-by: Hugh Dickins
Cc: Andrey Konovalov
Cc: Davidlohr Bueso
Cc: "Kirill A. Shutemov"
Cc: Sasha Levin
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
"mm->locked_vm += grow" and vm_stat_account() in acct_stack_growth() are
not safe; multiple threads using the same ->mm can do this at the same
time trying to expans different vma's under down_read(mmap_sem). This
means that one of the "locked_vm += grow" changes can be lost and we can
miss munlock_vma_pages_all() later.Move this code into the caller(s) under mm->page_table_lock. All other
updates to ->locked_vm hold mmap_sem for writing.Signed-off-by: Oleg Nesterov
Acked-by: Hugh Dickins
Cc: Andrey Konovalov
Cc: Davidlohr Bueso
Cc: "Kirill A. Shutemov"
Cc: Sasha Levin
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
linux/mm.h provides offset_in_page() macro. Let's use already predefined
macro instead of (addr & ~PAGE_MASK).Signed-off-by: Alexander Kuleshov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Before the main loop, vma is already is NULL. There is no need to set it
to NULL again.Signed-off-by: Chen Gang
Reviewed-by: Oleg Nesterov
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
23 Sep, 2015
1 commit
-
For VM_PFNMAP and VM_MIXEDMAP we use vm_ops->pfn_mkwrite instead of
vm_ops->page_mkwrite to notify abort write access. This means we want
vma->vm_page_prot to be write-protected if the VMA provides this vm_ops.A theoretical scenario that will cause these missed events is:
On writable mapping with vm_ops->pfn_mkwrite, but without
vm_ops->page_mkwrite: read fault followed by write access to the pfn.
Writable pte will be set up on read fault and write fault will not be
generated.I found it examining Dave's complaint on generic/080:
http://lkml.kernel.org/g/20150831233803.GO3902@dastard
Although I don't think it's the reason.
It shouldn't be a problem for ext2/ext4 as they provide both pfn_mkwrite
and page_mkwrite.[akpm@linux-foundation.org: add local vm_ops to avoid 80-cols mess]
Signed-off-by: Kirill A. Shutemov
Cc: Yigal Korman
Acked-by: Boaz Harrosh
Cc: Matthew Wilcox
Cc: Jan Kara
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
18 Sep, 2015
1 commit
-
Revert commit 6dc296e7df4c "mm: make sure all file VMAs have ->vm_ops
set".Will Deacon reports that it "causes some mmap regressions in LTP, which
appears to use a MAP_PRIVATE mmap of /dev/zero as a way to get anonymous
pages in some of its tests (specifically mmap10 [1])".William Shuman reports Oracle crashes.
So revert the patch while we work out what to do.
Reported-by: William Shuman
Reported-by: Will Deacon
Cc: Kirill A. Shutemov
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Sep, 2015
2 commits
-
We rely on vma->vm_ops == NULL to detect anonymous VMA: see
vma_is_anonymous(), but some drivers doesn't set ->vm_ops.As a result we can end up with anonymous page in private file mapping.
That should not lead to serious misbehaviour, but nevertheless is wrong.Let's fix by setting up dummy ->vm_ops for file mmapping if f_op->mmap()
didn't set its own.The patch also adds sanity check into __vma_link_rb(). It will help
catch broken VMAs which inserted directly into mm_struct via
insert_vm_struct().Signed-off-by: Kirill A. Shutemov
Reviewed-by: Oleg Nesterov
Cc: "H. Peter Anvin"
Cc: Andy Lutomirski
Cc: Dave Hansen
Cc: Ingo Molnar
Cc: Minchan Kim
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add the additional "vm_flags_t vm_flags" argument to do_mmap_pgoff(),
rename it to do_mmap(), and re-introduce do_mmap_pgoff() as a simple
wrapper on top of do_mmap(). Perhaps we should update the callers of
do_mmap_pgoff() and kill it later.This way mpx_mmap() can simply call do_mmap(vm_flags => VM_MPX) and do not
play with vm internals.After this change mmap_region() has a single user outside of mmap.c,
arch/tile/mm/elf.c:arch_setup_additional_pages(). It would be nice to
change arch/tile/ and unexport mmap_region().[kirill@shutemov.name: fix build]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oleg Nesterov
Acked-by: Dave Hansen
Tested-by: Dave Hansen
Signed-off-by: Kirill A. Shutemov
Cc: "H. Peter Anvin"
Cc: Andy Lutomirski
Cc: Ingo Molnar
Cc: Minchan Kim
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
09 Sep, 2015
4 commits
-
There's no point in initializing vma->vm_pgoff if the insertion attempt
will be failing anyway. Run the checks before performing the
initialization.Signed-off-by: Chen Gang
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
__split_vma() doesn't need out_err label, neither need initializing err.
copy_vma() can return NULL directly when kmem_cache_alloc() fails.
Signed-off-by: Chen Gang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Test-case:
#define _GNU_SOURCE
#include
#include
#include
#include
#include
#includevoid *find_vdso_vaddr(void)
{
FILE *perl;
char buf[32] = {};perl = popen("perl -e 'open STDIN,qq|/proc/@{[getppid]}/maps|;"
"/^(.*?)-.*vdso/ && print hex $1 while <>'", "r");
fread(buf, sizeof(buf), 1, perl);
fclose(perl);return (void *)atol(buf);
}#define PAGE_SIZE 4096
void *get_unmapped_area(void)
{
void *p = mmap(0, PAGE_SIZE, PROT_NONE,
MAP_PRIVATE|MAP_ANONYMOUS, -1,0);
assert(p != MAP_FAILED);
munmap(p, PAGE_SIZE);
return p;
}char save[2][PAGE_SIZE];
int main(void)
{
void *vdso = find_vdso_vaddr();
void *page[2];assert(vdso);
memcpy(save, vdso, sizeof (save));
// force another fault on the next check
assert(madvise(vdso, 2 * PAGE_SIZE, MADV_DONTNEED) == 0);page[0] = mremap(vdso,
PAGE_SIZE, PAGE_SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
get_unmapped_area());
page[1] = mremap(vdso + PAGE_SIZE,
PAGE_SIZE, PAGE_SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
get_unmapped_area());assert(page[0] != MAP_FAILED && page[1] != MAP_FAILED);
printf("match: %d %d\n",
!memcmp(save[0], page[0], PAGE_SIZE),
!memcmp(save[1], page[1], PAGE_SIZE));return 0;
}fails without this patch. Before the previous commit it gets the wrong
page, now it segfaults (which is imho better).This is because copy_vma() wrongly assumes that if vma->vm_file == NULL
is irrelevant until the first fault which will use do_anonymous_page().
This is obviously wrong for the special mapping.Signed-off-by: Oleg Nesterov
Acked-by: Kirill A. Shutemov
Cc: Andy Lutomirski
Cc: Hugh Dickins
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Test-case:
#include
#include
#include
#include
#include
#includevoid *find_vdso_vaddr(void)
{
FILE *perl;
char buf[32] = {};perl = popen("perl -e 'open STDIN,qq|/proc/@{[getppid]}/maps|;"
"/^(.*?)-.*vdso/ && print hex $1 while <>'", "r");
fread(buf, sizeof(buf), 1, perl);
fclose(perl);return (void *)atol(buf);
}#define PAGE_SIZE 4096
int main(void)
{
void *vdso = find_vdso_vaddr();
assert(vdso);// of course they should differ, and they do so far
printf("vdso pages differ: %d\n",
!!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE));// split into 2 vma's
assert(mprotect(vdso, PAGE_SIZE, PROT_READ) == 0);// force another fault on the next check
assert(madvise(vdso, 2 * PAGE_SIZE, MADV_DONTNEED) == 0);// now they no longer differ, the 2nd vm_pgoff is wrong
printf("vdso pages differ: %d\n",
!!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE));return 0;
}Output:
vdso pages differ: 1
vdso pages differ: 0This is because split_vma() correctly updates ->vm_pgoff, but the logic
in insert_vm_struct() and special_mapping_fault() is absolutely broken,
so the fault at vdso + PAGE_SIZE return the 1st page. The same happens
if you simply unmap the 1st page.special_mapping_fault() does:
pgoff = vmf->pgoff - vma->vm_pgoff;
and this is _only_ correct if vma->vm_start mmaps the first page from
->vm_private_data array.vdso or any other user of install_special_mapping() is not anonymous,
it has the "backing storage" even if it is just the array of pages.
So we actually need to make vm_pgoff work as an offset in this array.Note: this also allows to fix another problem: currently gdb can't access
"[vvar]" memory because in this case special_mapping_fault() doesn't work.
Now that we can use ->vm_pgoff we can implement ->access() and fix this.Signed-off-by: Oleg Nesterov
Acked-by: Kirill A. Shutemov
Cc: Andy Lutomirski
Cc: Hugh Dickins
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Sep, 2015
1 commit
-
vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge
must be aware about so that we can merge vmas back like they were
originally before arming the userfaultfd on some memory range.Signed-off-by: Andrea Arcangeli
Acked-by: Pavel Emelyanov
Cc: Sanidhya Kashyap
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov"
Cc: Andres Lagar-Cavilla
Cc: Dave Hansen
Cc: Paolo Bonzini
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Andy Lutomirski
Cc: Hugh Dickins
Cc: Peter Feiner
Cc: "Dr. David Alan Gilbert"
Cc: Johannes Weiner
Cc: "Huangpeng (Peter)"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
10 Jul, 2015
1 commit
-
Today proc and sysfs do not contain any executable files. Several
applications today mount proc or sysfs without noexec and nosuid and
then depend on there being no exectuables files on proc or sysfs.
Having any executable files show on proc or sysfs would cause
a user space visible regression, and most likely security problems.Therefore commit to never allowing executables on proc and sysfs by
adding a new flag to mark them as filesystems without executables and
enforce that flag.Test the flag where MNT_NOEXEC is tested today, so that the only user
visible effect will be that exectuables will be treated as if the
execute bit is cleared.The filesystems proc and sysfs do not currently incoporate any
executable files so this does not result in any user visible effects.This makes it unnecessary to vet changes to proc and sysfs tightly for
adding exectuable files or changes to chattr that would modify
existing files, as no matter what the individual file say they will
not be treated as exectuable files by the vfs.Not having to vet changes to closely is important as without this we
are only one proc_create call (or another goof up in the
implementation of notify_change) from having problematic executables
on proc. Those mistakes are all too easy to make and would create
a situation where there are security issues or the assumptions of
some program having to be broken (and cause userspace regressions).Signed-off-by: "Eric W. Biederman"
25 Jun, 2015
1 commit
-
The simple check for zero length memory mapping may be performed
earlier. So that in case of zero length memory mapping some unnecessary
code is not executed at all. It does not make the code less readable
and saves some CPU cycles.Signed-off-by: Piotr Kwapulinski
Acked-by: Michal Hocko
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Apr, 2015
1 commit
-
The creators of the C language gave us the while keyword. Let's use
that instead of synthesizing it from if+goto.Made possible by 6597d783397a ("mm/mmap.c: replace find_vma_prepare()
with clearer find_vma_links()").[akpm@linux-foundation.org: fix 80-col overflows]
Signed-off-by: Rasmus Villemoes
Cc: "Kirill A. Shutemov"
Cc: Sasha Levin
Cc: Cyrill Gorcunov
Cc: Roman Gushchin
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds