29 Apr, 2016
1 commit
-
In gather_pte_stats() a THP pmd is cast into a pte, which is wrong
because the layouts may differ depending on the architecture. On s390
this will lead to inaccurate numa_maps accounting in /proc because of
misguided pte_present() and pte_dirty() checks on the fake pte.On other architectures pte_present() and pte_dirty() may work by chance,
but there may be an issue with direct-access (dax) mappings w/o
underlying struct pages when HAVE_PTE_SPECIAL is set and THP is
available. In vm_normal_page() the fake pte will be checked with
pte_special() and because there is no "special" bit in a pmd, this will
always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be
skipped. On dax mappings w/o struct pages, an invalid struct page
pointer would then be returned that can crash the kernel.This patch fixes the numa_maps THP handling by introducing new "_pmd"
variants of the can_gather_numa_stats() and vm_normal_page() functions.Signed-off-by: Gerald Schaefer
Cc: Naoya Horiguchi
Cc: "Kirill A . Shutemov"
Cc: Konstantin Khlebnikov
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Jerome Marchand
Cc: Johannes Weiner
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Dan Williams
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Michael Holzheu
Cc: [4.3+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Apr, 2016
1 commit
-
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.Let's stop pretending that pages in page cache are special. They are
not.The changes are pretty straight-forward:
- << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;
- >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds
22 Mar, 2016
1 commit
-
Pull cgroup namespace support from Tejun Heo:
"These are changes to implement namespace support for cgroup which has
been pending for quite some time now. It is very straight-forward and
only affects what part of cgroup hierarchies are visible.After unsharing, mounting a cgroup fs will be scoped to the cgroups
the task belonged to at the time of unsharing and the cgroup paths
exposed to userland would be adjusted accordingly"* 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: fix and restructure error handling in copy_cgroup_ns()
cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
Add FS_USERNS_FLAG to cgroup fs
cgroup: Add documentation for cgroup namespaces
cgroup: mount cgroupns-root when inside non-init cgroupns
kernfs: define kernfs_node_dentry
cgroup: cgroup namespace setns support
cgroup: introduce cgroup namespaces
sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
kernfs: Add API to generate relative kernfs path
21 Mar, 2016
1 commit
-
Pull x86 protection key support from Ingo Molnar:
"This tree adds support for a new memory protection hardware feature
that is available in upcoming Intel CPUs: 'protection keys' (pkeys).There's a background article at LWN.net:
https://lwn.net/Articles/643797/
The gist is that protection keys allow the encoding of
user-controllable permission masks in the pte. So instead of having a
fixed protection mask in the pte (which needs a system call to change
and works on a per page basis), the user can map a (handful of)
protection mask variants and can change the masks runtime relatively
cheaply, without having to change every single page in the affected
virtual memory range.This allows the dynamic switching of the protection bits of large
amounts of virtual memory, via user-space instructions. It also
allows more precise control of MMU permission bits: for example the
executable bit is separate from the read bit (see more about that
below).This tree adds the MM infrastructure and low level x86 glue needed for
that, plus it adds a high level API to make use of protection keys -
if a user-space application calls:mmap(..., PROT_EXEC);
or
mprotect(ptr, sz, PROT_EXEC);
(note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
this special case, and will set a special protection key on this
memory range. It also sets the appropriate bits in the Protection
Keys User Rights (PKRU) register so that the memory becomes unreadable
and unwritable.So using protection keys the kernel is able to implement 'true'
PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
PROT_READ as well. Unreadable executable mappings have security
advantages: they cannot be read via information leaks to figure out
ASLR details, nor can they be scanned for ROP gadgets - and they
cannot be used by exploits for data purposes either.We know about no user-space code that relies on pure PROT_EXEC
mappings today, but binary loaders could start making use of this new
feature to map binaries and libraries in a more secure fashion.There is other pending pkeys work that offers more high level system
call APIs to manage protection keys - but those are not part of this
pull request.Right now there's a Kconfig that controls this feature
(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
(like most x86 CPU feature enablement code that has no runtime
overhead), but it's not user-configurable at the moment. If there's
any serious problem with this then we can make it configurable and/or
flip the default"* 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
mm/core, x86/mm/pkeys: Add execute-only protection keys support
x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
x86/mm/pkeys: Allow kernel to modify user pkey rights register
x86/fpu: Allow setting of XSAVE state
x86/mm: Factor out LDT init from context init
mm/core, x86/mm/pkeys: Add arch_validate_pkey()
mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
x86/mm/pkeys: Add Kconfig prompt to existing config option
x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
x86/mm/pkeys: Dump PKRU with other kernel registers
mm/core, x86/mm/pkeys: Differentiate instruction fetches
x86/mm/pkeys: Optimize fault handling in access_error()
mm/core: Do not enforce PKEY permissions on remote mm access
um, pkeys: Add UML arch_*_access_permitted() methods
mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
x86/mm/gup: Simplify get_user_pages() PTE bit handling
...
18 Mar, 2016
7 commits
-
On i686 PAE enabled machine the contiguous physical area could be large
and it can cause trimming down variables in below calculation in
read_vmcore() and mmap_vmcore():tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);
That is, the types being used is like below on i686:
m->offset: unsigned long long int
m->size: unsigned long long int
*fpos: loff_t (long long int)
buflen: size_t (unsigned int)So casting (m->offset + m->size - *fpos) by size_t means truncating a
given value by 4GB.Suppose (m->offset + m->size - *fpos) being truncated to 0, buflen >0
then we will get tsz = 0. It is of course not an expected result.
Similarly we could also get other truncated values less than buflen.
Then the real size passed down is not correct any more.If (m->offset + m->size - *fpos) is above 4GB, read_vmcore or
mmap_vmcore use the min_t result with truncated values being compared to
buflen. Then, fpos proceeds with the wrong value so that we reach below
bugs:1) read_vmcore will refuse to continue so makedumpfile fails.
2) mmap_vmcore will trigger BUG_ON() in remap_pfn_range().Use unsigned long long in min_t instead so that the variables in are not
truncated.Signed-off-by: Baoquan He
Signed-off-by: Dave Young
Cc: HATAYAMA Daisuke
Cc: Vivek Goyal
Cc: Jianyu Zhan
Cc: Minfei Huang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
It is not elegant that prompt shell does not start from new line after
executing "cat /proc/$pid/wchan". Make prompt shell start from new
line.Signed-off-by: Minfei Huang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
`proc_timers_operations` is only used when CONFIG_CHECKPOINT_RESTORE is
enabled.Signed-off-by: Eric Engestrom
Acked-by: Cyrill Gorcunov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch provides a proc/PID/timerslack_ns interface which exposes a
task's timerslack value in nanoseconds and allows it to be changed.This allows power/performance management software to set timer slack for
other threads according to its policy for the thread (such as when the
thread is designated foreground vs. background activity)If the value written is non-zero, slack is set to that value. Otherwise
sets it to the default for the thread.This interface checks that the calling task has permissions to to use
PTRACE_MODE_ATTACH_FSCREDS on the target task, so that we can ensure
arbitrary apps do not change the timer slack for other apps.Signed-off-by: John Stultz
Acked-by: Kees Cook
Cc: Arjan van de Ven
Cc: Thomas Gleixner
Cc: Oren Laadan
Cc: Ruchi Kandoi
Cc: Rom Lemarchand
Cc: Android Kernel Team
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
statistics protocol, corresponding to 'Available' in /proc/meminfo.It indicates to the hypervisor how big the balloon can be inflated
without pushing the guest system to swap. This metric would be very
useful in VM orchestration software to improve memory management of
different VMs under overcommit.This patch (of 2):
Factor out calculation of the available memory counter into a separate
exportable function, in order to be able to use it in other parts of the
kernel.In particular, it appears a relevant metric to report to the hypervisor
via virtio-balloon statistics interface (in a followup patch).Signed-off-by: Igor Redko
Signed-off-by: Denis V. Lunev
Reviewed-by: Roman Kagan
Cc: Michael S. Tsirkin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently /proc/kpageflags returns just KPF_COMPOUND_TAIL for slab tail
pages, which is inconvenient when grasping how slab pages are
distributed (userspace always needs to check which kind of tail pages by
itself). This patch sets KPF_SLAB for such pages.With this patch:
$ grep Slab /proc/meminfo ; tools/vm/page-types -b slab
Slab: 64880 kB
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000080 16220 63 _______S__________________________________ slab
total 16220 6316220 pages equals to 64880 kB, so returned result is consistent with the
global counter.Signed-off-by: Naoya Horiguchi
Reviewed-by: Vladimir Davydov
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently /proc/kpageflags returns nothing for "tail" buddy pages, which
is inconvenient when grasping how free pages are distributed. This
patch sets KPF_BUDDY for such pages.With this patch:
$ grep MemFree /proc/meminfo ; tools/vm/page-types -b buddy
MemFree: 3134992 kB
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000400 779272 3044 __________B_______________________________ buddy
0x0000000000000c00 4385 17 __________BM______________________________ buddy,mmap
total 783657 3061783657 pages is 3134628 kB (roughly consistent with the global counter,)
so it's OK.[akpm@linux-foundation.org: update comment, per Naoya]
Signed-off-by: Naoya Horiguchi
Reviewed-by: Vladimir Davydov >
Cc: Konstantin Khlebnikov
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
19 Feb, 2016
1 commit
-
The protection key can now be just as important as read/write
permissions on a VMA. We need some debug mechanism to help
figure out if it is in play. smaps seems like a logical
place to expose it.arch/x86/kernel/setup.c is a bit of a weirdo place to put
this code, but it already had seq_file.h and there was not
a much better existing place to put it.We also use no #ifdef. If protection keys is .config'd out we
will effectively get the same function as if we used the weak
generic function.Signed-off-by: Dave Hansen
Reviewed-by: Thomas Gleixner
Cc: Al Viro
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Baoquan He
Cc: Borislav Petkov
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Dave Hansen
Cc: Dave Young
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: Jerome Marchand
Cc: Jiri Kosina
Cc: Joerg Roedel
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Konstantin Khlebnikov
Cc: Laurent Dufour
Cc: Linus Torvalds
Cc: Mark Salter
Cc: Mark Williamson
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: Paolo Bonzini
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210227.4F8EB3F8@viggo.jf.intel.com
Signed-off-by: Ingo Molnar
17 Feb, 2016
1 commit
-
Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.Signed-off-by: Aditya Kali
Signed-off-by: Serge Hallyn
Signed-off-by: Tejun Heo
04 Feb, 2016
2 commits
-
Commit b76437579d13 ("procfs: mark thread stack correctly in
proc//maps") added [stack:TID] annotation to /proc//maps.Finding the task of a stack VMA requires walking the entire thread list,
turning this into quadratic behavior: a thousand threads means a
thousand stacks, so the rendering of /proc//maps needs to look at a
million combinations.The cost is not in proportion to the usefulness as described in the
patch.Drop the [stack:TID] annotation to make /proc//maps (and
/proc//numa_maps) usable again for higher thread counts.The [stack] annotation inside /proc//task//maps is retained, as
identifying the stack VMA there is an O(1) operation.Siddesh said:
"The end users needed a way to identify thread stacks programmatically and
there wasn't a way to do that. I'm afraid I no longer remember (or have
access to the resources that would aid my memory since I changed
employers) the details of their requirement. However, I did do this on my
own time because I thought it was an interesting project for me and nobody
really gave any feedback then as to its utility, so as far as I am
concerned you could roll back the main thread maps information since the
information is available in the thread-specific files"Signed-off-by: Johannes Weiner
Cc: "Kirill A. Shutemov"
Cc: Siddhesh Poyarekar
Cc: Shaohua Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When working with hugetlbfs ptes (which are actually pmds) is not valid to
directly use pte functions like pte_present() because the hardware bit
layout of pmds and ptes can be different. This is the case on s390.
Therefore we have to convert the hugetlbfs ptes first into a valid pte
encoding with huge_ptep_get().Currently the /proc//numa_maps code uses hugetlbfs ptes without
huge_ptep_get(). On s390 this leads to the following two problems:1) The pte_present() function returns false (instead of true) for
PROT_NONE hugetlb ptes. Therefore PROT_NONE vmas are missing
completely in the "numa_maps" output.2) The pte_dirty() function always returns false for all hugetlb ptes.
Therefore these pages are reported as "mapped=xxx" instead of
"dirty=xxx".Therefore use huge_ptep_get() to correctly convert the hugetlb ptes.
Signed-off-by: Michael Holzheu
Reviewed-by: Gerald Schaefer
Cc: [4.3+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
23 Jan, 2016
1 commit
-
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.Signed-off-by: Al Viro
22 Jan, 2016
1 commit
-
After THP refcounting rework we have only two possible return values
from pmd_trans_huge_lock(): success and failure. Return-by-pointer for
ptl doesn't make much sense in this case.Let's convert pmd_trans_huge_lock() to return ptl on success and NULL on
failure.Signed-off-by: Kirill A. Shutemov
Suggested-by: Linus Torvalds
Cc: Minchan Kim
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
21 Jan, 2016
3 commits
-
Only functions doing more than one read are modified. Consumeres
happened to deal with possibly changing data, but it does not seem like
a good thing to rely on.Signed-off-by: Mateusz Guzik
Acked-by: Cyrill Gorcunov
Cc: Alexey Dobriyan
Cc: Jarod Wilson
Cc: Jan Stancek
Cc: Al Viro
Cc: Anshuman Khandual
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.The problem was that when a privileged task had temporarily dropped its
privileges, e.g. by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:/proc/$pid/stat - uses the check for determining whether pointers
should be visible, useful for bypassing ASLR
/proc/$pid/maps - also useful for bypassing ASLR
/proc/$pid/cwd - useful for gaining access to restricted
directories that contain files with lax permissions, e.g. in
this scenario:
lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
drwx------ root root /root
drwxr-xr-x root root /root/foobar
-rw-r--r-- root root /root/foobar/secretTherefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn
Acked-by: Kees Cook
Cc: Casey Schaufler
Cc: Oleg Nesterov
Cc: Ingo Molnar
Cc: James Morris
Cc: "Serge E. Hallyn"
Cc: Andy Shevchenko
Cc: Andy Lutomirski
Cc: Al Viro
Cc: "Eric W. Biederman"
Cc: Willy Tarreau
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
For THP=n, HPAGE_PMD_NR in smaps_account() expands to BUILD_BUG().
That's fine since this codepath is eliminated by modern compilers.But older compilers have not that efficient dead code elimination. It
causes problem at least with gcc 4.1.2 on m68k:fs/built-in.o: In function `smaps_account':
task_mmu.c:(.text+0x4f8fa): undefined reference to `__compiletime_assert_471'Let's replace HPAGE_PMD_NR with 1 << compound_order(page).
Signed-off-by: Kirill A. Shutemov
Reported-by: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Jan, 2016
3 commits
-
Let's define page_mapped() to be true for compound pages if any
sub-pages of the compound page is mapped (with PMD or PTE).On other hand page_mapcount() return mapcount for this particular small
page.This will make cases like page_get_anon_vma() behave correctly once we
allow huge pages to be mapped with PTE.Most users outside core-mm should use page_mapcount() instead of
page_mapped().Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Jerome Marchand
Cc: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Vlastimil Babka
Acked-by: Jerome Marchand
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The goal of this patchset is to make refcounting on THP pages cheaper
with simpler semantics and allow the same THP compound page to be mapped
with PMD and PTEs. This is required to get reasonable THP-pagecache
implementation.With the new refcounting design it's much easier to protect against
split_huge_page(): simple reference on a page will make you the deal.
It makes gup_fast() implementation simpler and doesn't require
special-case in futex code to handle tail THP pages.It should improve THP utilization over the system since splitting THP in
one process doesn't necessary lead to splitting the page in all other
processes have the page mapped.The patchset drastically lower complexity of get_page()/put_page()
codepaths. I encourage people look on this code before-and-after to
justify time budget on reviewing this patchset.This patch (of 37):
With new refcounting all subpages of the compound page are not necessary
have the same mapcount. We need to take into account mapcount of every
sub-page.Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Jerome Marchand
Acked-by: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 Jan, 2016
9 commits
-
When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
testing the RLIMIT_DATA value to figure out if we're allowed to assign
new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
commited that RLIMIT_DATA in a form it's implemented now doesn't do
anything useful because most of user-space libraries use mmap() syscall
for dynamic memory allocations.Linus suggested to convert RLIMIT_DATA rlimit into something suitable
for anonymous memory accounting. But in this patch we go further, and
the changes are bundled together as:* keep vma counting if CONFIG_PROC_FS=n, will be used for limits
* replace mm->shared_vm with better defined mm->data_vm
* account anonymous executable areas as executable
* account file-backed growsdown/up areas as stack
* drop struct file* argument from vm_stat_account
* enforce RLIMIT_DATA for size of data areasThis way code looks cleaner: now code/stack/data classification depends
only on vm_flags state:VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc)
VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk)
VM_WRITE & ~VM_SHARED & !stack -> data (VmData)The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
"shared", but that might be strange beast like readonly-private or VM_IO
area.- RLIMIT_AS limits whole address space "VmSize"
- RLIMIT_STACK limits stack "VmStk" (but each vma individually)
- RLIMIT_DATA now limits "VmData"Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Cyrill Gorcunov
Cc: Quentin Casasnovas
Cc: Vegard Nossum
Acked-by: Linus Torvalds
Cc: Willy Tarreau
Cc: Andy Lutomirski
Cc: Kees Cook
Cc: Vladimir Davydov
Cc: Pavel Emelyanov
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
clear_soft_dirty_pmd() is called by clear_refs_write(CLEAR_REFS_SOFT_DIRTY),
VM_SOFTDIRTY was already cleared before walk_page_range().Signed-off-by: Oleg Nesterov
Acked-by: Kirill A. Shutemov
Acked-by: Cyrill Gorcunov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The MemAvailable item in /proc/meminfo is to give users a hint of how
much memory is allocatable without causing swapping, so it excludes the
zones' low watermarks as unavailable to userspace.However, for a userspace allocation, kswapd will actually reclaim until
the free pages hit a combination of the high watermark and the page
allocator's lowmem protection that keeps a certain amount of DMA and
DMA32 memory from userspace as well.Subtract the full amount we know to be unavailable to userspace from the
number of free pages when calculating MemAvailable.Signed-off-by: Johannes Weiner
Cc: Rik van Riel
Cc: Mel Gorman
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There are several shortcomings with the accounting of shared memory
(SysV shm, shared anonymous mapping, mapping of a tmpfs file). The
values in /proc//status and /statm don't allow to distinguish
between shmem memory and a shared mapping to a regular file, even though
theirs implication on memory usage are quite different: during reclaim,
file mapping can be dropped or written back on disk, while shmem needs a
place in swap.Also, to distinguish the memory occupied by anonymous and file mappings,
one has to read the /proc/pid/statm file, which has a field for the file
mappings (again, including shmem) and total memory occupied by these
mappings (i.e. equivalent to VmRSS in the /status file. Getting
the value for anonymous mappings only is thus not exactly user-friendly
(the statm file is intended to be rather efficiently machine-readable).To address both of these shortcomings, this patch adds a breakdown of
VmRSS in /proc//status via new fields RssAnon, RssFile and
RssShmem, making use of the previous preparatory patch. These fields
tell the user the memory occupied by private anonymous pages, mapped
regular files and shmem, respectively. Other existing fields in /status
and /statm files are left without change. The /statm file can be
extended in the future, if there's a need for that.Example (part of) /proc/pid/status output including the new Rss* fields:
VmPeak: 2001008 kB
VmSize: 2001004 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 5108 kB
VmRSS: 5108 kB
RssAnon: 92 kB
RssFile: 1324 kB
RssShmem: 3692 kB
VmData: 192 kB
VmStk: 136 kB
VmExe: 4 kB
VmLib: 1784 kB
VmPTE: 3928 kB
VmPMD: 20 kB
VmSwap: 0 kB
HugetlbPages: 0 kB[vbabka@suse.cz: forward-porting, tweak changelog]
Signed-off-by: Jerome Marchand
Signed-off-by: Vlastimil Babka
Acked-by: Konstantin Khlebnikov
Acked-by: Michal Hocko
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently looking at /proc//status or statm, there is no way to
distinguish shmem pages from pages mapped to a regular file (shmem pages
are mapped to /dev/zero), even though their implication in actual memory
use is quite different.The internal accounting currently counts shmem pages together with
regular files. As a preparation to extend the userspace interfaces,
this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
shmem pages separately from MM_FILEPAGES. The next patch will expose it
to userspace - this patch doesn't change the exported values yet, by
adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
used before. The only user-visible change after this patch is the OOM
killer message that separates the reported "shmem-rss" from "file-rss".[vbabka@suse.cz: forward-porting, tweak changelog]
Signed-off-by: Jerome Marchand
Signed-off-by: Vlastimil Babka
Acked-by: Konstantin Khlebnikov
Acked-by: Michal Hocko
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Following the previous patch, further reduction of /proc/pid/smaps cost
is possible for private writable shmem mappings with unpopulated areas
where the page walk invokes the .pte_hole function. We can use radix
tree iterator for each such area instead of calling find_get_entry() in
a loop. This is possible at the extra maintenance cost of introducing
another shmem function shmem_partial_swap_usage().To demonstrate the diference, I have measured this on a process that
creates a private writable 2GB mapping of a partially swapped out
/dev/shm/file (which cannot employ the optimizations from the prvious
patch) and doesn't populate it at all. I time how long does it take to
cat /proc/pid/smaps of this process 100 times.Before this patch:
real 0m3.831s
user 0m0.180s
sys 0m3.212sAfter this patch:
real 0m1.176s
user 0m0.180s
sys 0m0.684sThe time is similar to the case where a radix tree iterator is employed
on the whole mapping.Signed-off-by: Vlastimil Babka
Cc: Hugh Dickins
Cc: Jerome Marchand
Cc: Konstantin Khlebnikov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The previous patch has improved swap accounting for shmem mapping, which
however made /proc/pid/smaps more expensive for shmem mappings, as we
consult the radix tree for each pte_none entry, so the overal complexity
is O(n*log(n)).We can reduce this significantly for mappings that cannot contain COWed
pages, because then we can either use the statistics tha shmem object
itself tracks (if the mapping contains the whole object, or the swap
usage of the whole object is zero), or use the radix tree iterator,
which is much more effective than repeated find_get_entry() calls.This patch therefore introduces a function shmem_swap_usage(vma) and
makes /proc/pid/smaps use it when possible. Only for writable private
mappings of shmem objects (i.e. tmpfs files) with the shmem object
itself (partially) swapped outwe have to resort to the find_get_entry()
approach.Hopefully such mappings are relatively uncommon.
To demonstrate the diference, I have measured this on a process that
creates a 2GB mapping and dirties single pages with a stride of 2MB, and
time how long does it take to cat /proc/pid/smaps of this process 100
times.Private writable mapping of a /dev/shm/file (the most complex case):
real 0m3.831s
user 0m0.180s
sys 0m3.212sShared mapping of an almost full mapping of a partially swapped /dev/shm/file
(which needs to employ the radix tree iterator).real 0m1.351s
user 0m0.096s
sys 0m0.768sSame, but with /dev/shm/file not swapped (so no radix tree walk needed)
real 0m0.935s
user 0m0.128s
sys 0m0.344sPrivate anonymous mapping:
real 0m0.949s
user 0m0.116s
sys 0m0.348sThe cost is now much closer to the private anonymous mapping case, unless
the shmem mapping is private and writable.Signed-off-by: Vlastimil Babka
Cc: Hugh Dickins
Cc: Jerome Marchand
Cc: Konstantin Khlebnikov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently, /proc/pid/smaps will always show "Swap: 0 kB" for
shmem-backed mappings, even if the mapped portion does contain pages
that were swapped out. This is because unlike private anonymous
mappings, shmem does not change pte to swap entry, but pte_none when
swapping the page out. In the smaps page walk, such page thus looks
like it was never faulted in.This patch changes smaps_pte_entry() to determine the swap status for
such pte_none entries for shmem mappings, similarly to how
mincore_page() does it. Swapped out shmem pages are thus accounted for.
For private mappings of tmpfs files that COWed some of the pages, swaped
out status of the original shmem pages is naturally ignored. If some of
the private copies was also swapped out, they are accounted via their
page table swap entries, so the resulting reported swap usage is then a
sum of both swapped out private copies, and swapped out shmem pages that
were not COWed. No double accounting can thus happen.The accounting is arguably still not as precise as for private anonymous
mappings, since now we will count also pages that the process in
question never accessed, but another process populated them and then let
them become swapped out. I believe it is still less confusing and
subtle than not showing any swap usage by shmem mappings at all.
Swapped out counter might of interest of users who would like to prevent
from future swapins during performance critical operation and pre-fault
them at their convenience. Especially for larger swapped out regions
the cost of swapin is much higher than a fresh page allocation. So a
differentiation between pte_none vs. swapped out is important for those
usecases.One downside of this patch is that it makes /proc/pid/smaps more
expensive for shmem mappings, as we consult the radix tree for each
pte_none entry, so the overal complexity is O(n*log(n)). I have
measured this on a process that creates a 2GB mapping and dirties single
pages with a stride of 2MB, and time how long does it take to cat
/proc/pid/smaps of this process 100 times.Private anonymous mapping:
real 0m0.949s
user 0m0.116s
sys 0m0.348sMapping of a /dev/shm/file:
real 0m3.831s
user 0m0.180s
sys 0m3.212sThe difference is rather substantial, so the next patch will reduce the
cost for shared or read-only mappings.In a less controlled experiment, I've gathered pids of processes on my
desktop that have either '/dev/shm/*' or 'SYSV*' in smaps. This
included the Chrome browser and some KDE processes. Again, I've run cat
/proc/pid/smaps on each 100 times.Before this patch:
real 0m9.050s
user 0m0.518s
sys 0m8.066sAfter this patch:
real 0m9.221s
user 0m0.541s
sys 0m8.187sThis suggests low impact on average systems.
Note that this patch doesn't attempt to adjust the SwapPss field for
shmem mappings, which would need extra work to determine who else could
have the pages mapped. Thus the value stays zero except for COWed
swapped out pages in a shmem mapping, which are accounted as usual.Signed-off-by: Vlastimil Babka
Acked-by: Konstantin Khlebnikov
Acked-by: Jerome Marchand
Acked-by: Michal Hocko
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg. For the list, see below:- threadinfo
- task_struct
- task_delay_info
- pid
- cred
- mm_struct
- vm_area_struct and vm_region (nommu)
- anon_vma and anon_vma_chain
- signal_struct
- sighand_struct
- fs_struct
- files_struct
- fdtable and fdtable->full_fds_bits
- dentry and external_name
- inode for all filesystems. This is the most tedious part, because
most filesystems overwrite the alloc_inode method.The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds. Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 Jan, 2016
1 commit
-
Pull misc vfs updates from Al Viro:
"All kinds of stuff. That probably should've been 5 or 6 separate
branches, but by the time I'd realized how large and mixed that bag
had become it had been too close to -final to play with rebasing.Some fs/namei.c cleanups there, memdup_user_nul() introduction and
switching open-coded instances, burying long-dead code, whack-a-mole
of various kinds, several new helpers for ->llseek(), assorted
cleanups and fixes from various people, etc.One piece probably deserves special mention - Neil's
lookup_one_len_unlocked(). Similar to lookup_one_len(), but gets
called without ->i_mutex and tries to avoid ever taking it. That, of
course, means that it's not useful for any directory modifications,
but things like getting inode attributes in nfds readdirplus are fine
with that. I really should've asked for moratorium on lookup-related
changes this cycle, but since I hadn't done that early enough... I
*am* asking for that for the coming cycle, though - I'm going to try
and get conversion of i_mutex to rwsem with ->lookup() done under lock
taken shared.There will be a patch closer to the end of the window, along the lines
of the one Linus had posted last May - mechanical conversion of
->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/
inode_is_locked()/inode_lock_nested(). To quote Linus back then:-----
| This is an automated patch using
|
| sed 's/mutex_lock(&\(.*\)->i_mutex)/inode_lock(\1)/'
| sed 's/mutex_unlock(&\(.*\)->i_mutex)/inode_unlock(\1)/'
| sed 's/mutex_lock_nested(&\(.*\)->i_mutex,[ ]*I_MUTEX_\([A-Z0-9_]*\))/inode_lock_nested(\1, I_MUTEX_\2)/'
| sed 's/mutex_is_locked(&\(.*\)->i_mutex)/inode_is_locked(\1)/'
| sed 's/mutex_trylock(&\(.*\)->i_mutex)/inode_trylock(\1)/'
|
| with a very few manual fixups
-----I'm going to send that once the ->i_mutex-affecting stuff in -next
gets mostly merged (or when Linus says he's about to stop taking
merges)"* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
nfsd: don't hold i_mutex over userspace upcalls
fs:affs:Replace time_t with time64_t
fs/9p: use fscache mutex rather than spinlock
proc: add a reschedule point in proc_readfd_common()
logfs: constify logfs_block_ops structures
fcntl: allow to set O_DIRECT flag on pipe
fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE
fs: xattr: Use kvfree()
[s390] page_to_phys() always returns a multiple of PAGE_SIZE
nbd: use ->compat_ioctl()
fs: use block_device name vsprintf helper
lib/vsprintf: add %*pg format specifier
fs: use gendisk->disk_name where possible
poll: plug an unused argument to do_poll
amdkfd: don't open-code memdup_user()
cdrom: don't open-code memdup_user()
rsxx: don't open-code memdup_user()
mtip32xx: don't open-code memdup_user()
[um] mconsole: don't open-code memdup_user_nul()
[um] hostaudio: don't open-code memdup_user()
...
12 Jan, 2016
1 commit
-
Pull vfs RCU symlink updates from Al Viro:
"Replacement of ->follow_link/->put_link, allowing to stay in RCU mode
even if the symlink is not an embedded one.No changes since the mailbomb on Jan 1"
* 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
switch ->get_link() to delayed_call, kill ->put_link()
kill free_page_put_link()
teach nfs_get_link() to work in RCU mode
teach proc_self_get_link()/proc_thread_self_get_link() to work in RCU mode
teach shmem_get_link() to work in RCU mode
teach page_get_link() to work in RCU mode
replace ->follow_link() with new method that could stay in RCU mode
don't put symlink bodies in pagecache into highmem
namei: page_getlink() and page_follow_link_light() are the same thing
ufs: get rid of ->setattr() for symlinks
udf: don't duplicate page_symlink_inode_operations
logfs: don't duplicate page_symlink_inode_operations
switch befs long symlinks to page_symlink_operations
09 Jan, 2016
1 commit
-
User can pass an arbitrary large buffer to getdents().
It is typically a 32KB buffer used by libc scandir() implementation.
When scanning /proc/{pid}/fd, we can hold cpu way too long,
so add a cond_resched() to be kind with other tasks.We've seen latencies of more than 50ms on real workloads.
Signed-off-by: Eric Dumazet
Cc: Alexander Viro
Signed-off-by: Al Viro
04 Jan, 2016
1 commit
-
Signed-off-by: Al Viro
31 Dec, 2015
1 commit
-
Signed-off-by: Al Viro
19 Dec, 2015
1 commit
-
Writing to /proc/$pid/coredump_filter always returns -ESRCH because commit
774636e19ed51 ("proc: convert to kstrto*()/kstrto*_from_user()") removed
the setting of ret after the get_proc_task call and incorrectly left it as
-ESRCH. Instead, return 0 when successful.Example breakage:
echo 0 > /proc/self/coredump_filter
bash: echo: write error: No such processFixes: 774636e19ed51 ("proc: convert to kstrto*()/kstrto*_from_user()")
Signed-off-by: Colin Ian King
Acked-by: Kees Cook
Cc: [4.3+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
09 Dec, 2015
2 commits
-
Signed-off-by: Al Viro
-
new method: ->get_link(); replacement of ->follow_link(). The differences
are:
* inode and dentry are passed separately
* might be called both in RCU and non-RCU mode;
the former is indicated by passing it a NULL dentry.
* when called that way it isn't allowed to block
and should return ERR_PTR(-ECHILD) if it needs to be called
in non-RCU mode.It's a flagday change - the old method is gone, all in-tree instances
converted. Conversion isn't hard; said that, so far very few instances
do not immediately bail out when called in RCU mode. That'll change
in the next commits.Signed-off-by: Al Viro