Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

11 Dec, 2014

1 commit

c164e038e mm: fix huge zero page accounting in smaps report ... Browse Code »

As a small zero page, huge zero page should not be accounted in smaps
report as normal page.

For small pages we rely on vm_normal_page() to filter out zero page, but
vm_normal_page() is not designed to handle pmds. We only get here due
hackish cast pmd to pte in smaps_pte_range() -- pte and pmd format is not
necessary compatible on each and every architecture.

Let's add separate codepath to handle pmds. follow_trans_huge_pmd() will
detect huge zero page for us.

We would need pmd_dirty() helper to do this properly. The patch adds it
to THP-enabled architectures which don't yet have one.

[akpm@linux-foundation.org: use do_div to fix 32-bit build]
Signed-off-by: "Kirill A. Shutemov"
Reported-by: Fengguang Wu
Tested-by: Fengwei Yin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-12-11 09:41:08 +0800

18 Nov, 2014

1 commit

4aae7e436 x86, mpx: Introduce VM_MPX to indicate that a VMA is MPX specific ... Browse Code »

MPX-enabled applications using large swaths of memory can
potentially have large numbers of bounds tables in process
address space to save bounds information. These tables can take
up huge swaths of memory (as much as 80% of the memory on the
system) even if we clean them up aggressively. In the worst-case
scenario, the tables can be 4x the size of the data structure
being tracked. IOW, a 1-page structure can require 4 bounds-table
pages.

Being this huge, our expectation is that folks using MPX are
going to be keen on figuring out how much memory is being
dedicated to it. So we need a way to track memory use for MPX.

If we want to specifically track MPX VMAs we need to be able to
distinguish them from normal VMAs, and keep them from getting
merged with normal VMAs. A new VM_ flag set only on MPX VMAs does
both of those things. With this flag, MPX bounds-table VMAs can
be distinguished from other VMAs, and userspace can also walk
/proc/$pid/smaps to get memory usage for MPX.

In addition to this flag, we also introduce a special ->vm_ops
specific to MPX VMAs (see the patch "add MPX specific mmap
interface"), but currently different ->vm_ops do not by
themselves prevent VMA merging, so we still need this flag.

We understand that VM_ flags are scarce and are open to other
options.

Signed-off-by: Qiaowei Ren
Signed-off-by: Dave Hansen
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen
Link: http://lkml.kernel.org/r/20141114151825.565625B3@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner

Qiaowei Ren
2014-11-18 07:58:53 +0800

14 Oct, 2014

1 commit

64e455079 mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared ... Browse Code »

For VMAs that don't want write notifications, PTEs created for read faults
have their write bit set. If the read fault happens after VM_SOFTDIRTY is
cleared, then the PTE's softdirty bit will remain clear after subsequent
writes.

Here's a simple code snippet to demonstrate the bug:

char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
assert(*m == '\0'); /* new PTE allows write access */
assert(!soft_dirty(x));
*m = 'x'; /* should dirty the page */
assert(soft_dirty(x)); /* fails */

With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared. Furthermore, to avoid unnecessary faults, write notifications
are disabled when VM_SOFTDIRTY is set.

As a side effect of enabling and disabling write notifications with
care, this patch fixes a bug in mprotect where vm_page_prot bits set by
drivers were zapped on mprotect. An analogous bug was fixed in mmap by
commit c9d0bf241451 ("mm: uncached vma support with writenotify").

Signed-off-by: Peter Feiner
Reported-by: Peter Feiner
Suggested-by: Kirill A. Shutemov
Cc: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Jamie Liu
Cc: Hugh Dickins
Cc: Naoya Horiguchi
Cc: Bjorn Helgaas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Feiner
2014-10-14 08:18:28 +0800

10 Oct, 2014

15 commits

81d0fa623 mm: softdirty: unmapped addresses between VMAs are clean ... Browse Code »
2

If a /proc/pid/pagemap read spans a [VMA, an unmapped region, then a
VM_SOFTDIRTY VMA], the virtual pages in the unmapped region are reported
as softdirty. Here's a program to demonstrate the bug:

int main() {
const uint64_t PAGEMAP_SOFTDIRTY = 1ul << 55;
uint64_t pme[3];
int fd = open("/proc/self/pagemap", O_RDONLY);;
char *m = mmap(NULL, 3 * getpagesize(), PROT_READ,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
munmap(m + getpagesize(), getpagesize());
pread(fd, pme, 24, (unsigned long) m / getpagesize() * 8);
assert(pme[0] & PAGEMAP_SOFTDIRTY); /* passes */
assert(!(pme[1] & PAGEMAP_SOFTDIRTY)); /* fails */
assert(pme[2] & PAGEMAP_SOFTDIRTY); /* passes */
return 0;
}

(Note that all pages in new VMAs are softdirty until cleared).

Tested:
Used the program given above. I'm going to include this code in
a selftest in the future.

[n-horiguchi@ah.jp.nec.com: prevent pagemap_pte_range() from overrunning]
Signed-off-by: Peter Feiner
Cc: "Kirill A. Shutemov"
Cc: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Jamie Liu
Cc: Hugh Dickins
Cc: Naoya Horiguchi
Signed-off-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Feiner
2014-10-10 10:25:58 +0800
498f23717 mempolicy: fix show_numa_map() vs exec() + do_set_mempolicy() race ... Browse Code »

9e7814404b77 "hold task->mempolicy while numa_maps scans." fixed the
race with the exiting task but this is not enough.

The current code assumes that get_vma_policy(task) should either see
task->mempolicy == NULL or it should be equal to ->task_mempolicy saved
by hold_task_mempolicy(), so we can never race with __mpol_put(). But
this can only work if we can't race with do_set_mempolicy(), and thus
we can't race with another do_set_mempolicy() or do_exit() after that.

However, do_set_mempolicy()->down_write(mmap_sem) can not prevent this
race. This task can exec, change it's ->mm, and call do_set_mempolicy()
after that; in this case they take 2 different locks.

Change hold_task_mempolicy() to use get_task_policy(), it never returns
NULL, and change show_numa_map() to use __get_vma_policy() or fall back
to proc_priv->task_mempolicy.

Note: this is the minimal fix, we will cleanup this code later. I think
hold_task_mempolicy() and release_task_mempolicy() should die, we can
move this logic into show_numa_map(). Or we can move get_task_policy()
outside of ->mmap_sem and !CONFIG_NUMA code at least.

Signed-off-by: Oleg Nesterov
Cc: KAMEZAWA Hiroyuki
Cc: David Rientjes
Cc: KOSAKI Motohiro
Cc: Alexander Viro
Cc: Cyrill Gorcunov
Cc: "Eric W. Biederman"
Cc: "Kirill A. Shutemov"
Cc: Peter Zijlstra
Cc: Hugh Dickins
Cc: Andi Kleen
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-10-10 10:25:56 +0800
58cb65487 proc/maps: make vm_is_stack() logic namespace-friendly ... Browse Code »

- Rename vm_is_stack() to task_of_stack() and change it to return
"struct task_struct *" rather than the global (and thus wrong in
general) pid_t.

- Add the new pid_of_stack() helper which calls task_of_stack() and
uses the right namespace to report the correct pid_t.

Unfortunately we need to define this helper twice, in task_mmu.c
and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c
and move at least pid_of_stack/task_of_stack there to avoid the
code duplication.

- Change show_map_vma() and show_numa_map() to use the new helper.

Signed-off-by: Oleg Nesterov
Cc: Alexander Viro
Cc: Cyrill Gorcunov
Cc: "Eric W. Biederman"
Cc: Greg Ungerer
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-10-10 10:25:50 +0800
2c03376d2 proc/maps: replace proc_maps_private->pid with "struct inode *inode" ... Browse Code »

m_start() can use get_proc_task() instead, and "struct inode *"
provides more potentially useful info, see the next changes.

Signed-off-by: Oleg Nesterov
Cc: Alexander Viro
Cc: Cyrill Gorcunov
Cc: "Eric W. Biederman"
Cc: Greg Ungerer
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-10-10 10:25:50 +0800
557c2d8a7 fs/proc/task_mmu.c: update m->version in the main loop in m_start() ... Browse Code »

Change the main loop in m_start() to update m->version. Mostly for
consistency, but this can help to avoid the same loop if the very
1st ->show() fails due to seq_overflow().

Signed-off-by: Oleg Nesterov
Cc: Kirill A. Shutemov
Cc: Cyrill Gorcunov
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-10-10 10:25:49 +0800
b8c20a9b8 fs/proc/task_mmu.c: reintroduce m->version logic ... Browse Code »

Add the "last_addr" optimization back. Like before, every ->show()
method checks !seq_overflow() and sets m->version = vma->vm_start.

However, it also checks that m_next_vma(vma) != NULL, otherwise it
sets m->version = -1 for the lockless "EOF" fast-path in m_start().

m_start() can simply do find_vma() + m_next_vma() if last_addr is
not zero, the code looks clear and simple and this case is clearly
separated from "scan vmas" path.

Signed-off-by: Oleg Nesterov
Cc: Kirill A. Shutemov
Cc: Cyrill Gorcunov
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-10-10 10:25:49 +0800
ad2a00e4b fs/proc/task_mmu.c: introduce m_next_vma() helper ... Browse Code »

Extract the tail_vma/vm_next calculation from m_next() into the new
trivial helper, m_next_vma().

Signed-off-by: Oleg Nesterov
Cc: Kirill A. Shutemov
Cc: Cyrill Gorcunov
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-10-10 10:25:49 +0800
0c255321f fs/proc/task_mmu.c: simplify m_start() to make it readable ... Browse Code »

Now that m->version is gone we can cleanup m_start(). In particular,

- Remove the "unsigned long" typecast, m->index can't be negative
or exceed ->map_count. But lets use "unsigned int pos" to make
it clear that "pos < map_count" is safe.

- Remove the unnecessary "vma != NULL" check in the main loop. It
can't be NULL unless we have a vm bug.

- This also means that "pos < map_count" case can simply return the
valid vma and avoid "goto" and subsequent checks.

Signed-off-by: Oleg Nesterov
Cc: Kirill A. Shutemov
Cc: Cyrill Gorcunov
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-10-10 10:25:49 +0800
ebb6cdde1 fs/proc/task_mmu.c: kill the suboptimal and confusing m->version logic ... Browse Code »

m_start() carefully documents, checks, and sets "m->version = -1" if
we are going to return NULL. The only problem is that we will be never
called again if m_start() returns NULL, so this is simply pointless
and misleading.

Otoh, ->show() methods m->version = 0 if vma == tail_vma and this is
just wrong, we want -1 in this case. And in fact we also want -1 if
->vm_next == NULL and ->tail_vma == NULL.

And it is not used consistently, the "scan vmas" loop in m_start()
should update last_addr too.

Finally, imo the whole "last_addr" logic in m_start() looks horrible.
find_vma(last_addr) is called unconditionally even if we are not going
to use the result. But the main problem is that this code participates
in tail_vma-or-NULL mess, and this looks simply unfixable.

Remove this optimization. We will add it back after some cleanups.

Signed-off-by: Oleg Nesterov
Cc: Kirill A. Shutemov
Cc: Cyrill Gorcunov
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-10-10 10:25:49 +0800
0d5f5f45f fs/proc/task_mmu.c: shift "priv->task = NULL" from m_start() to m_stop() ... Browse Code »

1. There is no reason to reset ->tail_vma in m_start(), if we return
IS_ERR_OR_NULL() it won't be used.

2. m_start() also clears priv->task to ensure that m_stop() won't use
the stale pointer if we fail before get_task_struct(). But this is
ugly and confusing, move this initialization in m_stop().

Signed-off-by: Oleg Nesterov
Acked-by: Kirill A. Shutemov
Acked-by: Cyrill Gorcunov
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-10-10 10:25:49 +0800
23d54837e fs/proc/task_mmu.c: cleanup the "tail_vma" horror in m_next() ... Browse Code »

1. Kill the first "vma != NULL" check. Firstly this is not possible,
m_next() won't be called if ->start() or the previous ->next()
returns NULL.

And if it was possible the 2nd "vma != tail_vma" check is buggy,
we should not wrongly return ->tail_vma.

2. Make this function readable. The logic is very simple, we should
return check "vma != tail" once and return "vm_next || tail_vma".

Signed-off-by: Oleg Nesterov
Acked-by: Kirill A. Shutemov
Acked-by: Cyrill Gorcunov
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-10-10 10:25:48 +0800
59b4bf12d fs/proc/task_mmu.c: simplify the vma_stop() logic ... Browse Code »

m_start() drops ->mmap_sem and does mmput() if it retuns vsyscall
vma. This is because in this case m_stop()->vma_stop() obviously
can't use gate_vma->vm_mm.

Now that we have proc_maps_private->mm we can simplify this logic:

- Change m_start() to return with ->mmap_sem held unless it returns
IS_ERR_OR_NULL().

- Change vma_stop() to use priv->mm and avoid the ugly vma checks,
this makes "vm_area_struct *vma" unnecessary.

- This also allows m_start() to use vm_stop().

- Cleanup m_next() to follow the new locking rule.

Note: m_stop() looks very ugly, and this temporary uglifies it
even more. Fixed by the next change.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oleg Nesterov
Acked-by: Kirill A. Shutemov
Acked-by: Cyrill Gorcunov
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-10-10 10:25:48 +0800
29a40ace8 fs/proc/task_mmu.c: shift mm_access() from m_start() to proc_maps_open() ... Browse Code »

A simple test-case from Kirill Shutemov

cat /proc/self/maps >/dev/null
chmod +x /proc/self/net/packet
exec /proc/self/net/packet

makes lockdep unhappy, cat/exec take seq_file->lock + cred_guard_mutex in
the opposite order.

It's a false positive and probably we should not allow "chmod +x" on proc
files. Still I think that we should avoid mm_access() and cred_guard_mutex
in sys_read() paths, security checking should happen at open time. Besides,
this doesn't even look right if the task changes its ->mm between m_stop()
and m_start().

Add the new "mm_struct *mm" member into struct proc_maps_private and change
proc_maps_open() to initialize it using proc_mem_open(). Change m_start() to
use priv->mm if atomic_inc_not_zero(mm_users) succeeds or return NULL (eof)
otherwise.

The only complication is that proc_maps_open() users should additionally do
mmdrop() in fop->release(), add the new proc_map_release() helper for that.

Note: this is the user-visible change, if the task execs after open("maps")
the new ->mm won't be visible via this file. I hope this is fine, and this
matches /proc/pid/mem bahaviour.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oleg Nesterov
Reported-by: "Kirill A. Shutemov"
Acked-by: Kirill A. Shutemov
Acked-by: Cyrill Gorcunov
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-10-10 10:25:48 +0800
4db7d0ee1 fs/proc/task_mmu.c: unify/simplify do_maps_open() and numa_maps_open() ... Browse Code »

do_maps_open() and numa_maps_open() are overcomplicated, they could use
__seq_open_private(). Plus they do the same, just sizeof(*priv)

Change them to use a new simple helper, proc_maps_open(ops, psize). This
simplifies the code and allows us to do the next changes.

Signed-off-by: Oleg Nesterov
Acked-by: Kirill A. Shutemov
Acked-by: Cyrill Gorcunov
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-10-10 10:25:48 +0800
46c298cf6 fs/proc/task_mmu.c: don't use task->mm in m_start() and show_*map() ... Browse Code »

get_gate_vma(priv->task->mm) looks ugly and wrong, task->mm can be NULL or
it can changed by exec right after mm_access().

And in theory this race is not harmless, the task can exec and then later
exit and free the new mm_struct. In this case get_task_mm(oldmm) can't
help, get_gate_vma(task->mm) can read the freed/unmapped memory.

I think that priv->task should simply die and hold_task_mempolicy() logic
can be simplified. tail_vma logic asks for cleanups too.

Signed-off-by: Oleg Nesterov
Acked-by: Kirill A. Shutemov
Acked-by: Cyrill Gorcunov
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-10-10 10:25:48 +0800

26 Sep, 2014

1 commit

87e6d49a0 mm: softdirty: addresses before VMAs in PTE holes aren't softdirty ... Browse Code »

In PTE holes that contain VM_SOFTDIRTY VMAs, unmapped addresses before
VM_SOFTDIRTY VMAs are reported as softdirty by /proc/pid/pagemap. This
bug was introduced in commit 68b5a6524856 ("mm: softdirty: respect
VM_SOFTDIRTY in PTE holes"). That commit made /proc/pid/pagemap look at
VM_SOFTDIRTY in PTE holes but neglected to observe the start of VMAs
returned by find_vma.

Tested:
Wrote a selftest that creates a PMD-sized VMA then unmaps the first
page and asserts that the page is not softdirty. I'm going to send the
pagemap selftest in a later commit.

Signed-off-by: Peter Feiner
Cc: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Hugh Dickins
Cc: Naoya Horiguchi
Cc: "Kirill A. Shutemov"
Cc: Jamie Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Feiner
2014-09-26 23:10:35 +0800

07 Aug, 2014

1 commit

68b5a6524 mm: softdirty: respect VM_SOFTDIRTY in PTE holes ... Browse Code »
13

After a VMA is created with the VM_SOFTDIRTY flag set, /proc/pid/pagemap
should report that the VMA's virtual pages are soft-dirty until
VM_SOFTDIRTY is cleared (i.e., by the next write of "4" to
/proc/pid/clear_refs). However, pagemap ignores the VM_SOFTDIRTY flag
for virtual addresses that fall in PTE holes (i.e., virtual addresses
that don't have a PMD, PUD, or PGD allocated yet).

To observe this bug, use mmap to create a VMA large enough such that
there's a good chance that the VMA will occupy an unused PMD, then test
the soft-dirty bit on its pages. In practice, I found that a VMA that
covered a PMD's worth of address space was big enough.

This patch adds the necessary VMA lookup to the PTE hole callback in
/proc/pid/pagemap's page walk and sets soft-dirty according to the VMAs'
VM_SOFTDIRTY flag.

Signed-off-by: Peter Feiner
Acked-by: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Hugh Dickins
Acked-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Feiner
2014-08-07 09:01:22 +0800

09 Jun, 2014

1 commit

3f17ea6de Merge branch 'next' (accumulated 3.16 merge window patches) into master ... Browse Code »

Now that 3.15 is released, this merges the 'next' branch into 'master',
bringing us to the normal situation where my 'master' branch is the
merge window.

* accumulated work in next: (6809 commits)
ufs: sb mutex merge + mutex_destroy
powerpc: update comments for generic idle conversion
cris: update comments for generic idle conversion
idle: remove cpu_idle() forward declarations
nbd: zero from and len fields in NBD_CMD_DISCONNECT.
mm: convert some level-less printks to pr_*
MAINTAINERS: adi-buildroot-devel is moderated
MAINTAINERS: add linux-api for review of API/ABI changes
mm/kmemleak-test.c: use pr_fmt for logging
fs/dlm/debug_fs.c: replace seq_printf by seq_puts
fs/dlm/lockspace.c: convert simple_str to kstr
fs/dlm/config.c: convert simple_str to kstr
mm: mark remap_file_pages() syscall as deprecated
mm: memcontrol: remove unnecessary memcg argument from soft limit functions
mm: memcontrol: clean up memcg zoneinfo lookup
mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
mm/mempool.c: update the kmemleak stack trace for mempool allocations
lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
mm: introduce kmemleak_update_trace()
mm/kmemleak.c: use %u to print ->checksum
...

Linus Torvalds
2014-06-09 02:31:16 +0800

07 Jun, 2014

2 commits

17c2b4ee4 fs/proc/task_mmu.c: replace seq_printf by seq_puts ... Browse Code »

Signed-off-by: Fabian Frederick
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fabian Frederick
2014-06-07 07:08:12 +0800
d4c54919e mm: add !pte_present() check on existing hugetlb_entry callbacks ... Browse Code »
5

The age table walker doesn't check non-present hugetlb entry in common
path, so hugetlb_entry() callbacks must check it. The reason for this
behavior is that some callers want to handle it in its own way.

[ I think that reason is bogus, btw - it should just do what the regular
code does, which is to call the "pte_hole()" function for such hugetlb
entries - Linus]

However, some callers don't check it now, which causes unpredictable
result, for example when we have a race between migrating hugepage and
reading /proc/pid/numa_maps. This patch fixes it by adding !pte_present
checks on buggy callbacks.

This bug exists for years and got visible by introducing hugepage
migration.

ChangeLog v2:
- fix if condition (check !pte_present() instead of pte_present())

Reported-by: Sasha Levin
Signed-off-by: Naoya Horiguchi
Cc: Rik van Riel
Cc: [3.12+]
Signed-off-by: Andrew Morton
[ Backported to 3.15. Signed-off-by: Josh Boyer ]
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2014-06-07 04:21:16 +0800

05 Jun, 2014

2 commits

a0abcf2e8 Merge branch 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next ... Browse Code »

Pull x86 cdso updates from Peter Anvin:
"Vdso cleanups and improvements largely from Andy Lutomirski. This
makes the vdso a lot less ''special''"

* 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vdso, build: Make LE access macros clearer, host-safe
x86/vdso, build: Fix cross-compilation from big-endian architectures
x86/vdso, build: When vdso2c fails, unlink the output
x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
x86, mm: Replace arch_vma_name with vm_ops->name for vsyscalls
x86, mm: Improve _install_special_mapping and fix x86 vdso naming
mm, fs: Add vm_ops->name as an alternative to arch_vma_name
x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
x86, vdso: Remove vestiges of VDSO_PRELINK and some outdated comments
x86, vdso: Move the vvar and hpet mappings next to the 64-bit vDSO
x86, vdso: Move the 32-bit vdso special pages after the text
x86, vdso: Reimplement vdso.so preparation in build-time C
x86, vdso: Move syscall and sysenter setup into kernel/cpu/common.c
x86, vdso: Clean up 32-bit vs 64-bit vdso params
x86, mm: Ensure correct alignment of the fixmap

Linus Torvalds
2014-06-05 23:05:29 +0800
c86c97ff4 mm: softdirty: clear VM_SOFTDIRTY flag inside clear_refs_write() instead of clear_soft_dirty() ... Browse Code »

clear_refs_write() is called earlier than clear_soft_dirty() and it is
more natural to clear VM_SOFTDIRTY (which belongs to VMA entry but not
PTEs) that early instead of clearing it a way deeper inside call chain.

Signed-off-by: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Hugh Dickins
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2014-06-05 07:53:56 +0800

21 May, 2014

1 commit

78d683e83 mm, fs: Add vm_ops->name as an alternative to arch_vma_name ... Browse Code »
13

arch_vma_name sucks. It's a silly hack, and it's annoying to
implement correctly. In fact, AFAICS, even the straightforward x86
implementation is incorrect (I suspect that it breaks if the vdso
mapping is split or gets remapped).

This adds a new vm_ops->name operation that can replace it. The
followup patches will remove all uses of arch_vma_name on x86,
fixing a couple of annoyances in the process.

Signed-off-by: Andy Lutomirski
Link: http://lkml.kernel.org/r/2eee21791bb36a0a408c5c2bdb382a9e6a41ca4a.1400538962.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin

Andy Lutomirski
2014-05-21 02:36:31 +0800

08 Apr, 2014

1 commit

615d6e875 mm: per-thread vma caching ... Browse Code »
7

This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.

We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.

The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:

1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+

2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+

3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+

4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+

[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso
Reviewed-by: Rik van Riel
Acked-by: Linus Torvalds
Reviewed-by: Michel Lespinasse
Cc: Oleg Nesterov
Tested-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-04-08 07:35:53 +0800

15 Nov, 2013

3 commits

652586df9 seq_file: remove "%n" usage from seq_file users ... Browse Code »

All seq_printf() users are using "%n" for calculating padding size,
convert them to use seq_setwidth() / seq_pad() pair.

Signed-off-by: Tetsuo Handa
Signed-off-by: Kees Cook
Cc: Joe Perches
Cc: David Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2013-11-15 08:32:20 +0800
bf929152e mm, thp: change pmd_trans_huge_lock() to return taken lock ... Browse Code »

With split ptlock it's important to know which lock
pmd_trans_huge_lock() took. This patch adds one more parameter to the
function to return the lock.

In most places migration to new api is trivial. Exception is
move_huge_pmd(): we need to take two locks if pmd tables are different.

Signed-off-by: Naoya Horiguchi
Signed-off-by: Kirill A. Shutemov
Tested-by: Alex Thorlton
Cc: Ingo Molnar
Cc: "Eric W . Biederman"
Cc: "Paul E . McKenney"
Cc: Al Viro
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Dave Hansen
Cc: Dave Jones
Cc: David Howells
Cc: Frederic Weisbecker
Cc: Johannes Weiner
Cc: Kees Cook
Cc: Mel Gorman
Cc: Michael Kerrisk
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Robin Holt
Cc: Sedat Dilek
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-11-15 08:32:14 +0800
e1f56c89b mm: convert mm->nr_ptes to atomic_long_t ... Browse Code »

With split page table lock for PMD level we can't hold mm->page_table_lock
while updating nr_ptes.

Let's convert it to atomic_long_t to avoid races.

Signed-off-by: Kirill A. Shutemov
Tested-by: Alex Thorlton
Cc: Ingo Molnar
Cc: Naoya Horiguchi
Cc: "Eric W . Biederman"
Cc: "Paul E . McKenney"
Cc: Al Viro
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Dave Hansen
Cc: Dave Jones
Cc: David Howells
Cc: Frederic Weisbecker
Cc: Johannes Weiner
Cc: Kees Cook
Cc: Mel Gorman
Cc: Michael Kerrisk
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Robin Holt
Cc: Sedat Dilek
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-11-15 08:32:14 +0800

13 Nov, 2013

2 commits

ec8e41aec /proc/pid/smaps: show VM_SOFTDIRTY flag in VmFlags line ... Browse Code »

This flag shows that the VMA is "newly created" and thus represents
"dirty" in the task's VM.

You can clear it by "echo 4 > /proc/pid/clear_refs."

Signed-off-by: Naoya Horiguchi
Cc: Wu Fengguang
Cc: Pavel Emelyanov
Acked-by: Cyrill Gorcunov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2013-11-13 11:09:07 +0800
948927ee9 mm, mempolicy: make mpol_to_str robust and always succeed ... Browse Code »

mpol_to_str() should not fail. Currently, it either fails because the
string buffer is too small or because a string hasn't been defined for a
mempolicy mode.

If a new mempolicy mode is introduced and no string is defined for it,
just warn and return "unknown".

If the buffer is too small, just truncate the string and return, the
same behavior as snprintf().

This also fixes a bug where there was no NULL-byte termination when doing
*p++ = '=' and *p++ ':' and maxlen has been reached.

Signed-off-by: David Rientjes
Cc: KOSAKI Motohiro
Cc: Chen Gang
Cc: Rik van Riel
Cc: Dave Jones
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2013-11-13 11:09:05 +0800

17 Oct, 2013

1 commit

e9cdd6e77 mm: /proc/pid/pagemap: inspect _PAGE_SOFT_DIRTY only on present pages ... Browse Code »

If a page we are inspecting is in swap we may occasionally report it as
having soft dirty bit (even if it is clean). The pte_soft_dirty helper
should be called on present pte only.

Signed-off-by: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Andy Lutomirski
Cc: Matt Mackall
Cc: Xiao Guangrong
Cc: Marcelo Tosatti
Cc: KOSAKI Motohiro
Cc: Stephen Rothwell
Cc: Peter Zijlstra
Cc: "Aneesh Kumar K.V"
Reviewed-by: Naoya Horiguchi
Cc: Mel Gorman
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2013-10-17 12:35:52 +0800

12 Sep, 2013

2 commits

a3c039929 fs/proc/task_mmu.c: check the return value of mpol_to_str() ... Browse Code »

mpol_to_str() may fail, and not fill the buffer (e.g. -EINVAL), so need
check about it, or buffer may not be zero based, and next seq_printf()
will cause issue.

The failure return need after mpol_cond_put() to match get_vma_policy().

Signed-off-by: Chen Gang
Cc: Cyrill Gorcunov
Cc: Mel Gorman
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Gang
2013-09-12 06:59:03 +0800
d9104d1ca mm: track vma changes with VM_SOFTDIRTY bit ... Browse Code »

Pavel reported that in case if vma area get unmapped and then mapped (or
expanded) in-place, the soft dirty tracker won't be able to recognize this
situation since it works on pte level and ptes are get zapped on unmap,
loosing soft dirty bit of course.

So to resolve this situation we need to track actions on vma level, there
VM_SOFTDIRTY flag comes in. When new vma area created (or old expanded)
we set this bit, and keep it here until application calls for clearing
soft dirty bit.

Thus when user space application track memory changes now it can detect if
vma area is renewed.

Reported-by: Pavel Emelyanov
Signed-off-by: Cyrill Gorcunov
Cc: Andy Lutomirski
Cc: Matt Mackall
Cc: Xiao Guangrong
Cc: Marcelo Tosatti
Cc: KOSAKI Motohiro
Cc: Stephen Rothwell
Cc: Peter Zijlstra
Cc: "Aneesh Kumar K.V"
Cc: Rob Landley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2013-09-12 06:57:56 +0800

14 Aug, 2013

3 commits

8c8296223 fs/proc/task_mmu.c: fix buffer overflow in add_page_map() ... Browse Code »

Recently we met quite a lot of random kernel panic issues after enabling
CONFIG_PROC_PAGE_MONITOR. After debuggind we found this has something
to do with following bug in pagemap:

In struct pagemapread:

struct pagemapread {
int pos, len;
pagemap_entry_t *buffer;
bool v2;
};

pos is number of PM_ENTRY_BYTES in buffer, but len is the size of
buffer, it is a mistake to compare pos and len in add_page_map() for
checking buffer is full or not, and this can lead to buffer overflow and
random kernel panic issue.

Correct len to be total number of PM_ENTRY_BYTES in buffer.

[akpm@linux-foundation.org: document pagemapread.pos and .len units, fix PM_ENTRY_BYTES definition]
Signed-off-by: Yonghua Zheng
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

yonghua zheng
2013-08-14 08:57:50 +0800
41bb3476b mm: save soft-dirty bits on file pages ... Browse Code »

Andy reported that if file page get reclaimed we lose the soft-dirty bit
if it was there, so save _PAGE_BIT_SOFT_DIRTY bit when page address get
encoded into pte entry. Thus when #pf happens on such non-present pte
we can restore it back.

Reported-by: Andy Lutomirski
Signed-off-by: Cyrill Gorcunov
Acked-by: Pavel Emelyanov
Cc: Matt Mackall
Cc: Xiao Guangrong
Cc: Marcelo Tosatti
Cc: KOSAKI Motohiro
Cc: Stephen Rothwell
Cc: Peter Zijlstra
Cc: "Aneesh Kumar K.V"
Cc: Minchan Kim
Cc: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2013-08-14 08:57:48 +0800
179ef71cb mm: save soft-dirty bits on swapped pages ... Browse Code »

Andy Lutomirski reported that if a page with _PAGE_SOFT_DIRTY bit set
get swapped out, the bit is getting lost and no longer available when
pte read back.

To resolve this we introduce _PTE_SWP_SOFT_DIRTY bit which is saved in
pte entry for the page being swapped out. When such page is to be read
back from a swap cache we check for bit presence and if it's there we
clear it and restore the former _PAGE_SOFT_DIRTY bit back.

One of the problem was to find a place in pte entry where we can save
the _PTE_SWP_SOFT_DIRTY bit while page is in swap. The _PAGE_PSE was
chosen for that, it doesn't intersect with swap entry format stored in
pte.

Reported-by: Andy Lutomirski
Signed-off-by: Cyrill Gorcunov
Acked-by: Pavel Emelyanov
Cc: Matt Mackall
Cc: Xiao Guangrong
Cc: Marcelo Tosatti
Cc: KOSAKI Motohiro
Cc: Stephen Rothwell
Cc: Peter Zijlstra
Cc: "Aneesh Kumar K.V"
Reviewed-by: Minchan Kim
Reviewed-by: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2013-08-14 08:57:47 +0800

04 Jul, 2013

2 commits

541c237c0 pagemap: prepare to reuse constant bits with page-shift ... Browse Code »

In order to reuse bits from pagemap entries gracefully, we leave the
entries as is but on pagemap open emit a warning in dmesg, that bits
55-60 are about to change in a couple of releases. Next, if a user
issues soft-dirty clear command via the clear_refs file (it was disabled
before v3.9) we assume that he's aware of the new pagemap format, note
that fact and report the bits in pagemap in the new manner.

The "migration strategy" looks like this then:

1. existing users are not affected -- they don't touch soft-dirty feature, thus
see old bits in pagemap, but are warned and have time to fix themselves
2. those who use soft-dirty know about new pagemap format
3. some time soon we get rid of any signs of page-shift in pagemap as well as
this trick with clear-soft-dirty affecting pagemap format.

Signed-off-by: Pavel Emelyanov
Cc: Matt Mackall
Cc: Xiao Guangrong
Cc: Glauber Costa
Cc: Marcelo Tosatti
Cc: KOSAKI Motohiro
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2013-07-04 07:07:26 +0800
0f8975ec4 mm: soft-dirty bits for user memory changes tracking ... Browse Code »

The soft-dirty is a bit on a PTE which helps to track which pages a task
writes to. In order to do this tracking one should

1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
2. Wait some time.
3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)

To do this tracking, the writable bit is cleared from PTEs when the
soft-dirty bit is. Thus, after this, when the task tries to modify a
page at some virtual address the #PF occurs and the kernel sets the
soft-dirty bit on the respective PTE.

Note, that although all the task's address space is marked as r/o after
the soft-dirty bits clear, the #PF-s that occur after that are processed
fast. This is so, since the pages are still mapped to physical memory,
and thus all the kernel does is finds this fact out and puts back
writable, dirty and soft-dirty bits on the PTE.

Another thing to note, is that when mremap moves PTEs they are marked
with soft-dirty as well, since from the user perspective mremap modifies
the virtual memory at mremap's new address.

Signed-off-by: Pavel Emelyanov
Cc: Matt Mackall
Cc: Xiao Guangrong
Cc: Glauber Costa
Cc: Marcelo Tosatti
Cc: KOSAKI Motohiro
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2013-07-04 07:07:26 +0800