Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

23 Dec, 2014

1 commit

48ec833b7 Revert "mm/memory.c: share the i_mmap_rwsem" ... Browse Code »
5

This reverts commit c8475d144abb1e62958cc5ec281d2a9e161c1946.

There are several[1][2] of bug reports which points to this commit as potential
cause[3].

Let's revert it until we figure out what's going on.

[1] https://lkml.org/lkml/2014/11/14/342
[2] https://lkml.org/lkml/2014/12/22/213
[3] https://lkml.org/lkml/2014/12/9/741

Signed-off-by: Kirill A. Shutemov
Reported-by: Sasha Levin
Acked-by: Davidlohr Bueso
Cc: Hugh Dickins
Cc: Oleg Nesterov
Cc: Peter Zijlstra (Intel)
Cc: Rik van Riel
Cc: Srikar Dronamraju
Cc: Mel Gorman
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-12-23 06:27:34 +0800

21 Dec, 2014

1 commit

60815cf2e Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux ... Browse Code »

Pull ACCESS_ONCE cleanup preparation from Christian Borntraeger:
"kernel: Provide READ_ONCE and ASSIGN_ONCE

As discussed on LKML http://marc.info/?i=54611D86.4040306%40de.ibm.com
ACCESS_ONCE might fail with specific compilers for non-scalar
accesses.

Here is a set of patches to tackle that problem.

The first patch introduce READ_ONCE and ASSIGN_ONCE. If the data
structure is larger than the machine word size memcpy is used and a
warning is emitted. The next patches fix up several in-tree users of
ACCESS_ONCE on non-scalar types.

This does not yet contain a patch that forces ACCESS_ONCE to work only
on scalar types. This is targetted for the next merge window as Linux
next already contains new offenders regarding ACCESS_ONCE vs.
non-scalar types"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux:
s390/kvm: REPLACE barrier fixup with READ_ONCE
arm/spinlock: Replace ACCESS_ONCE with READ_ONCE
arm64/spinlock: Replace ACCESS_ONCE READ_ONCE
mips/gup: Replace ACCESS_ONCE with READ_ONCE
x86/gup: Replace ACCESS_ONCE with READ_ONCE
x86/spinlock: Replace ACCESS_ONCE with READ_ONCE
mm: replace ACCESS_ONCE with READ_ONCE or barriers
kernel: Provide READ_ONCE and ASSIGN_ONCE

Linus Torvalds
2014-12-21 08:48:59 +0800

19 Dec, 2014

1 commit

d82fa87d2 mm/memory.c:do_shared_fault(): add comment ... Browse Code »

Belatedly document the changes in commit f0c6d4d295e4 ("mm: introduce
do_shared_fault() and drop do_fault()").

Cc: Andi Kleen
Cc: Bob Liu
Cc: Dave Hansen
Cc: "Kirill A. Shutemov"
Cc: Matthew Wilcox
Cc: Mel Gorman
Cc: Naoya Horiguchi
Cc: Rik van Riel
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2014-12-19 11:08:11 +0800

18 Dec, 2014

2 commits

e37c69827 mm: replace ACCESS_ONCE with READ_ONCE or barriers ... Browse Code »

ACCESS_ONCE does not work reliably on non-scalar types. For
example gcc 4.6 and 4.7 might remove the volatile tag for such
accesses during the SRA (scalar replacement of aggregates) step
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)

Let's change the code to access the page table elements with
READ_ONCE that does implicit scalar accesses for the gup code.

mm_find_pmd is tricky, because m68k and sparc(32bit) define pmd_t
as array of longs. This code requires just that the pmd_present
and pmd_trans_huge check are done on the same value, so a barrier
is sufficent.

A similar case is in handle_pte_fault. On ppc44x the word size is
32 bit, but a pte is 64 bit. A barrier is ok as well.

Signed-off-by: Christian Borntraeger
Cc: linux-mm@kvack.org
Acked-by: Paul E. McKenney

Christian Borntraeger
2014-12-18 16:54:37 +0800
f045bbb9f mmu_gather: fix over-eager tlb_flush_mmu_free() calling ... Browse Code »
26

Dave Hansen reports that commit fb7332a9fedf ("mmu_gather: move minimal
range calculations into generic code") caused a performance problem:

"tlb_finish_mmu() goes up about 9x in the profiles (~0.4%->3.6%) and
tlb_flush_mmu_free() takes about 3.1% of CPU time with the patch
applied, but does not show up at all on the commit before"

and the reason is that Will moved the test for whether we need to flush
from tlb_flush_mmu() into tlb_flush_mmu_tlbonly(). But that meant that
tlb_flush_mmu_free() basically lost that check.

Move it back into tlb_flush_mmu() where it belongs, so that it covers
both tlb_flush_mmu_tlbonly() _and_ tlb_flush_mmu_free().

Reported-and-tested-by: Dave Hansen
Acked-by: Will Deacon
Signed-off-by: Linus Torvalds

Linus Torvalds
2014-12-18 03:59:04 +0800

16 Dec, 2014

1 commit

988adfdff Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux ... Browse Code »

Pull drm updates from Dave Airlie:
"Highlights:

- AMD KFD driver merge

This is the AMD HSA interface for exposing a lowlevel interface for
GPGPU use. They have an open source userspace built on top of this
interface, and the code looks as good as it was going to get out of
tree.

- Initial atomic modesetting work

The need for an atomic modesetting interface to allow userspace to
try and send a complete set of modesetting state to the driver has
arisen, and been suffering from neglect this past year. No more,
the start of the common code and changes for msm driver to use it
are in this tree. Ongoing work to get the userspace ioctl finished
and the code clean will probably wait until next kernel.

- DisplayID 1.3 and tiled monitor exposed to userspace.

Tiled monitor property is now exposed for userspace to make use of.

- Rockchip drm driver merged.

- imx gpu driver moved out of staging

Other stuff:

- core:
panel - MIPI DSI + new panels.
expose suggested x/y properties for virtual GPUs

- i915:
Initial Skylake (SKL) support
gen3/4 reset work
start of dri1/ums removal
infoframe tracking
fixes for lots of things.

- nouveau:
tegra k1 voltage support
GM204 modesetting support
GT21x memory reclocking work

- radeon:
CI dpm fixes
GPUVM improvements
Initial DPM fan control

- rcar-du:
HDMI support added
removed some support for old boards
slave encoder driver for Analog Devices adv7511

- exynos:
Exynos4415 SoC support

- msm:
a4xx gpu support
atomic helper conversion

- tegra:
iommu support
universal plane support
ganged-mode DSI support

- sti:
HDMI i2c improvements

- vmwgfx:
some late fixes.

- qxl:
use suggested x/y properties"

* 'drm-next' of git://people.freedesktop.org/~airlied/linux: (969 commits)
drm: sti: fix module compilation issue
drm/i915: save/restore GMBUS freq across suspend/resume on gen4
drm: sti: correctly cleanup CRTC and planes
drm: sti: add HQVDP plane
drm: sti: add cursor plane
drm: sti: enable auxiliary CRTC
drm: sti: fix delay in VTG programming
drm: sti: prepare sti_tvout to support auxiliary crtc
drm: sti: use drm_crtc_vblank_{on/off} instead of drm_vblank_{on/off}
drm: sti: fix hdmi avi infoframe
drm: sti: remove event lock while disabling vblank
drm: sti: simplify gdp code
drm: sti: clear all mixer control
drm: sti: remove gpio for HDMI hot plug detection
drm: sti: allow to change hdmi ddc i2c adapter
drm/doc: Document drm_add_modes_noedid() usage
drm/i915: Remove '& 0xffff' from the mask given to WA_REG()
drm/i915: Invert the mask and val arguments in wa_add() and WA_REG()
drm: Zero out DRM object memory upon cleanup
drm/i915/bdw: Fix the write setting up the WIZ hashing mode
...

Linus Torvalds
2014-12-16 07:52:01 +0800

14 Dec, 2014

3 commits

e1d6d01ab mm: export find_extend_vma() and handle_mm_fault() for driver use ... Browse Code »

This lets drivers like the AMD IOMMUv2 driver handle faults a bit more
simply, rather than doing tricks with page refs and get_user_pages().

Signed-off-by: Jesse Barnes
Cc: Oded Gabbay
Cc: Joerg Roedel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesse Barnes
2014-12-14 04:42:47 +0800
c8475d144 mm/memory.c: share the i_mmap_rwsem ... Browse Code »
13

The unmap_mapping_range family of functions do the unmapping of user pages
(ultimately via zap_page_range_single) without touching the actual
interval tree, thus share the lock.

Signed-off-by: Davidlohr Bueso
Cc: "Kirill A. Shutemov"
Acked-by: Hugh Dickins
Cc: Oleg Nesterov
Cc: Peter Zijlstra (Intel)
Cc: Rik van Riel
Cc: Srikar Dronamraju
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-12-14 04:42:46 +0800
83cde9e8b mm: use new helper functions around the i_mmap_mutex ... Browse Code »

Convert all open coded mutex_lock/unlock calls to the
i_mmap_[lock/unlock]_write() helpers.

Signed-off-by: Davidlohr Bueso
Acked-by: Rik van Riel
Acked-by: "Kirill A. Shutemov"
Acked-by: Hugh Dickins
Cc: Oleg Nesterov
Acked-by: Peter Zijlstra (Intel)
Cc: Srikar Dronamraju
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-12-14 04:42:45 +0800

12 Dec, 2014

1 commit

27afc5dbd Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux ... Browse Code »

Pull s390 updates from Martin Schwidefsky:
"The most notable change for this pull request is the ftrace rework
from Heiko. It brings a small performance improvement and the ground
work to support a new gcc option to replace the mcount blocks with a
single nop.

Two new s390 specific system calls are added to emulate user space
mmio for PCI, an artifact of the how PCI memory is accessed.

Two patches for the memory management with changes to common code.
For KVM mm_forbids_zeropage is added which disables the empty zero
page for an mm that is used by a KVM process. And an optimization,
pmdp_get_and_clear_full is added analog to ptep_get_and_clear_full.

Some micro optimization for the cmpxchg and the spinlock code.

And as usual bug fixes and cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (46 commits)
s390/cputime: fix 31-bit compile
s390/scm_block: make the number of reqs per HW req configurable
s390/scm_block: handle multiple requests in one HW request
s390/scm_block: allocate aidaw pages only when necessary
s390/scm_block: use mempool to manage aidaw requests
s390/eadm: change timeout value
s390/mm: fix memory leak of ptlock in pmd_free_tlb
s390: use local symbol names in entry[64].S
s390/ptrace: always include vector registers in core files
s390/simd: clear vector register pointer on fork/clone
s390: translate cputime magic constants to macros
s390/idle: convert open coded idle time seqcount
s390/idle: add missing irq off lockdep annotation
s390/debug: avoid function call for debug_sprintf_*
s390/kprobes: fix instruction copy for out of line execution
s390: remove diag 44 calls from cpu_relax()
s390/dasd: retry partition detection
s390/dasd: fix list corruption for sleep_on requests
s390/dasd: fix infinite term I/O loop
s390/dasd: remove unused code
...

Linus Torvalds
2014-12-12 09:30:55 +0800

10 Dec, 2014

1 commit

b64bb1d75 Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux ... Browse Code »

Pull arm64 updates from Will Deacon:
"Here's the usual mixed bag of arm64 updates, also including some
related EFI changes (Acked by Matt) and the MMU gather range cleanup
(Acked by you).

Changes include:
- support for alternative instruction patching from Andre
- seccomp from Akashi
- some AArch32 instruction emulation, required by the Android folks
- optimisations for exception entry/exit code, cmpxchg, pcpu atomics
- mmu_gather range calculations moved into core code
- EFI updates from Ard, including long-awaited SMBIOS support
- /proc/cpuinfo fixes to align with the format used by arch/arm/
- a few non-critical fixes across the architecture"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (70 commits)
arm64: remove the unnecessary arm64_swiotlb_init()
arm64: add module support for alternatives fixups
arm64: perf: Prevent wraparound during overflow
arm64/include/asm: Fixed a warning about 'struct pt_regs'
arm64: Provide a namespace to NCAPS
arm64: bpf: lift restriction on last instruction
arm64: Implement support for read-mostly sections
arm64: compat: align cacheflush syscall with arch/arm
arm64: add seccomp support
arm64: add SIGSYS siginfo for compat task
arm64: add seccomp syscall for compat task
asm-generic: add generic seccomp.h for secure computing mode 1
arm64: ptrace: allow tracer to skip a system call
arm64: ptrace: add NT_ARM_SYSTEM_CALL regset
arm64: Move some head.text functions to executable section
arm64: jump labels: NOP out NOP -> NOP replacement
arm64: add support to dump the kernel page tables
arm64: Add FIX_HOLE to permanent fixed addresses
arm64: alternatives: fix pr_fmt string for consistency
arm64: vmlinux.lds.S: don't discard .exit.* sections at link-time
...

Linus Torvalds
2014-12-10 05:12:47 +0800

08 Dec, 2014

1 commit

8c8639447 Merge tag 'v3.18' into drm-next ... Browse Code »

Linux 3.18

Backmerge Linus tree into -next as we had conflicts in i915/radeon/nouveau,
and everyone was solving them individually.

* tag 'v3.18': (57 commits)
Linux 3.18
watchdog: s3c2410_wdt: Fix the mask bit offset for Exynos7
uapi: fix to export linux/vm_sockets.h
i2c: cadence: Set the hardware time-out register to maximum value
i2c: davinci: generate STP always when NACK is received
ahci: disable MSI on SAMSUNG 0xa800 SSD
context_tracking: Restore previous state in schedule_user
slab: fix nodeid bounds check for non-contiguous node IDs
lib/genalloc.c: export devm_gen_pool_create() for modules
mm: fix anon_vma_clone() error treatment
mm: fix swapoff hang after page migration and fork
fat: fix oops on corrupted vfat fs
ipc/sem.c: fully initialize sem_array before making it visible
drivers/input/evdev.c: don't kfree() a vmalloc address
cxgb4: Fill in supported link mode for SFP modules
xen-netfront: Remove BUGs on paged skb data which crosses a page boundary
mm/vmpressure.c: fix race in vmpressure_work_fn()
mm: frontswap: invalidate expired data on a dup-store failure
mm: do not overwrite reserved pages counter at show_mem()
drm/radeon: kernel panic in drm_calc_vbltimestamp_from_scanoutpos with 3.18.0-rc6
...

Conflicts:
drivers/gpu/drm/i915/intel_display.c
drivers/gpu/drm/nouveau/nouveau_drm.c
drivers/gpu/drm/radeon/radeon_cs.c

Dave Airlie
2014-12-08 08:33:52 +0800

04 Dec, 2014

1 commit

2022b4d18 mm: fix swapoff hang after page migration and fork ... Browse Code »
3

I've been seeing swapoff hangs in recent testing: it's cycling around
trying unsuccessfully to find an mm for some remaining pages of swap.

I have been exercising swap and page migration more heavily recently,
and now notice a long-standing error in copy_one_pte(): it's trying to
add dst_mm to swapoff's mmlist when it finds a swap entry, but is doing
so even when it's a migration entry or an hwpoison entry.

Which wouldn't matter much, except it adds dst_mm next to src_mm,
assuming src_mm is already on the mmlist: which may not be so. Then if
pages are later swapped out from dst_mm, swapoff won't be able to find
where to replace them.

There's already a !non_swap_entry() test for stats: move that up before
the swap_duplicate() and the addition to mmlist.

Signed-off-by: Hugh Dickins
Cc: Kelley Nielsen
Cc: [2.6.18+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2014-12-04 01:36:03 +0800

17 Nov, 2014

1 commit

fb7332a9f mmu_gather: move minimal range calculations into generic code ... Browse Code »
13

On architectures with hardware broadcasting of TLB invalidation messages
, it makes sense to reduce the range of the mmu_gather structure when
unmapping page ranges based on the dirty address information passed to
tlb_remove_tlb_entry.

arm64 already does this by directly manipulating the start/end fields
of the gather structure, but this confuses the generic code which
does not expect these fields to change and can end up calculating
invalid, negative ranges when forcing a flush in zap_pte_range.

This patch moves the minimal range calculation out of the arm64 code
and into the generic implementation, simplifying zap_pte_range in the
process (which no longer needs to care about start/end, since they will
point to the appropriate ranges already). With the range being tracked
by core code, the need_flush flag is dropped in favour of checking that
the end of the range has actually been set.

Cc: Benjamin Herrenschmidt
Cc: Peter Zijlstra
Cc: Russell King - ARM Linux
Cc: Michal Simek
Acked-by: Linus Torvalds
Signed-off-by: Will Deacon

Will Deacon
2014-11-17 18:12:42 +0800

29 Oct, 2014

1 commit

ce9ec37bd zap_pte_range: update addr when forcing flush after TLB batching faiure ... Browse Code »

When unmapping a range of pages in zap_pte_range, the page being
unmapped is added to an mmu_gather_batch structure for asynchronous
freeing. If we run out of space in the batch structure before the range
has been completely unmapped, then we break out of the loop, force a
TLB flush and free the pages that we have batched so far. If there are
further pages to unmap, then we resume the loop where we left off.

Unfortunately, we forget to update addr when we break out of the loop,
which causes us to truncate the range being invalidated as the end
address is exclusive. When we re-enter the loop at the same address, the
page has already been freed and the pte_present test will fail, meaning
that we do not reconsider the address for invalidation.

This patch fixes the problem by incrementing addr by the PAGE_SIZE
before breaking out of the loop on batch failure.

Signed-off-by: Will Deacon
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Will Deacon
2014-10-29 04:16:28 +0800

27 Oct, 2014

1 commit

593befa6a mm: introduce mm_forbids_zeropage function ... Browse Code »

Add a new function stub to allow architectures to disable for
an mm_structthe backing of non-present, anonymous pages with
read-only empty zero pages.

Signed-off-by: Dominik Dingel
Reviewed-by: Paolo Bonzini
Signed-off-by: Martin Schwidefsky

Dominik Dingel
2014-10-27 20:27:24 +0800

14 Oct, 2014

1 commit

64e455079 mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared ... Browse Code »

For VMAs that don't want write notifications, PTEs created for read faults
have their write bit set. If the read fault happens after VM_SOFTDIRTY is
cleared, then the PTE's softdirty bit will remain clear after subsequent
writes.

Here's a simple code snippet to demonstrate the bug:

char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
assert(*m == '\0'); /* new PTE allows write access */
assert(!soft_dirty(x));
*m = 'x'; /* should dirty the page */
assert(soft_dirty(x)); /* fails */

With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared. Furthermore, to avoid unnecessary faults, write notifications
are disabled when VM_SOFTDIRTY is set.

As a side effect of enabling and disabling write notifications with
care, this patch fixes a bug in mprotect where vm_page_prot bits set by
drivers were zapped on mprotect. An analogous bug was fixed in mmap by
commit c9d0bf241451 ("mm: uncached vma support with writenotify").

Signed-off-by: Peter Feiner
Reported-by: Peter Feiner
Suggested-by: Kirill A. Shutemov
Cc: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Jamie Liu
Cc: Hugh Dickins
Cc: Naoya Horiguchi
Cc: Bjorn Helgaas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Feiner
2014-10-14 08:18:28 +0800

26 Sep, 2014

1 commit

dbab31aa2 mm: softdirty: keep bit when zapping file pte ... Browse Code »
5

This fixes the same bug as b43790eedd31 ("mm: softdirty: don't forget to
save file map softdiry bit on unmap") and 9aed8614af5a ("mm/memory.c:
don't forget to set softdirty on file mapped fault") where the return
value of pte_*mksoft_dirty was being ignored.

To be sure that no other pte/pmd "mk" function return values were being
ignored, I annotated the functions in arch/x86/include/asm/pgtable.h
with __must_check and rebuilt.

The userspace effect of this bug is that the softdirty mark might be
lost if a file mapped pte get zapped.

Signed-off-by: Peter Feiner
Acked-by: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Jamie Liu
Cc: Hugh Dickins
Cc: [3.12+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Feiner
2014-09-26 23:10:35 +0800

23 Sep, 2014

1 commit

b0e2a55c6 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm ... Browse Code »

Pull KVM fixes from Paolo Bonzini:
"Two very simple bugfixes, affecting all supported architectures"

[ Two? There's three commits in here. Oh well, I guess Paolo didn't
count the preparatory symbol export ]

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: correct null pid check in kvm_vcpu_yield_to()
KVM: check for !is_zero_pfn() in kvm_is_mmio_pfn()
mm: export symbol dependencies of is_zero_pfn()

Linus Torvalds
2014-09-23 02:58:23 +0800

14 Sep, 2014

1 commit

0b70068e4 mm: export symbol dependencies of is_zero_pfn() ... Browse Code »

In order to make the static inline function is_zero_pfn() callable by
modules, export its symbol dependencies 'zero_pfn' and (for s390 and
mips) 'zero_page_mask'.

We need this for KVM, as CONFIG_KVM is a tristate for all supported
architectures except ARM and arm64, and testing a pfn whether it refers
to the zero page is required to correctly distinguish the zero page
from other special RAM ranges that may also have the PG_reserved bit
set, but need to be treated as MMIO memory.

Signed-off-by: Ard Biesheuvel
Acked-by: Andrew Morton
Signed-off-by: Paolo Bonzini

Ard Biesheuvel
2014-09-14 22:25:14 +0800

30 Aug, 2014

1 commit

b38af4721 x86,mm: fix pte_special versus pte_numa ... Browse Code »
13

Sasha Levin has shown oopses on ffffea0003480048 and ffffea0003480008 at
mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels:
where zap_pte_range() checks page->mapping to see if PageAnon(page).

Those addresses fit struct pages for pfns d2001 and d2000, and in each
dump a register or a stack slot showed d2001730 or d2000730: pte flags
0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has
a hole between cfffffff and 100000000, which would need special access.

Commit c46a7c817e66 ("x86: define _PAGE_NUMA by reusing software bits on
the PMD and PTE levels") has broken vm_normal_page(): a PROTNONE SPECIAL
pte no longer passes the pte_special() test, so zap_pte_range() goes on
to try to access a non-existent struct page.

Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE) to
complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE). A
hint that this was a problem was that c46a7c817e66 added pte_numa() test
to vm_normal_page(), and moved its is_zero_pfn() test from slow to fast
path: This was papering over a pte_special() snag when the zero page was
encountered during zap. This patch reverts vm_normal_page() to how it
was before, relying on pte_special().

It still appears that this patch may be incomplete: aren't there other
places which need to be handling PROTNONE along with PRESENT? For
example, pte_mknuma() clears _PAGE_PRESENT and sets _PAGE_NUMA, but on a
PROT_NONE area, that would make it pte_special(). This is side-stepped
by the fact that NUMA hinting faults skipped PROT_NONE VMAs and there
are no grounds where a NUMA hinting fault on a PROT_NONE VMA would be
interesting.

Fixes: c46a7c817e66 ("x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels")
Reported-by: Sasha Levin
Tested-by: Sasha Levin
Signed-off-by: Hugh Dickins
Signed-off-by: Mel Gorman
Cc: "Kirill A. Shutemov"
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Cyrill Gorcunov
Cc: Matthew Wilcox
Cc: [3.16]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2014-08-30 07:28:16 +0800

09 Aug, 2014

3 commits

a6c19dfe3 arm64,ia64,ppc,s390,sh,tile,um,x86,mm: remove default gate area ... Browse Code »

The core mm code will provide a default gate area based on
FIXADDR_USER_START and FIXADDR_USER_END if
!defined(__HAVE_ARCH_GATE_AREA) && defined(AT_SYSINFO_EHDR).

This default is only useful for ia64. arm64, ppc, s390, sh, tile, 64-bit
UML, and x86_32 have their own code just to disable it. arm, 32-bit UML,
and x86_64 have gate areas, but they have their own implementations.

This gets rid of the default and moves the code into ia64.

This should save some code on architectures without a gate area: it's now
possible to inline the gate_area functions in the default case.

Signed-off-by: Andy Lutomirski
Acked-by: Nathan Lynch
Acked-by: H. Peter Anvin
Acked-by: Benjamin Herrenschmidt [in principle]
Acked-by: Richard Weinberger [for um]
Acked-by: Will Deacon [for arm64]
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Chris Metcalf
Cc: Jeff Dike
Cc: Richard Weinberger
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Nathan Lynch
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Lutomirski
2014-08-09 06:57:27 +0800
0a31bc97c mm: memcontrol: rewrite uncharge API ... Browse Code »
93

The memcg uncharging code that is involved towards the end of a page's
lifetime - truncation, reclaim, swapout, migration - is impressively
complicated and fragile.

Because anonymous and file pages were always charged before they had their
page->mapping established, uncharges had to happen when the page type
could still be known from the context; as in unmap for anonymous, page
cache removal for file and shmem pages, and swap cache truncation for swap
pages. However, these operations happen well before the page is actually
freed, and so a lot of synchronization is necessary:

- Charging, uncharging, page migration, and charge migration all need
to take a per-page bit spinlock as they could race with uncharging.

- Swap cache truncation happens during both swap-in and swap-out, and
possibly repeatedly before the page is actually freed. This means
that the memcg swapout code is called from many contexts that make
no sense and it has to figure out the direction from page state to
make sure memory and memory+swap are always correctly charged.

- On page migration, the old page might be unmapped but then reused,
so memcg code has to prevent untimely uncharging in that case.
Because this code - which should be a simple charge transfer - is so
special-cased, it is not reusable for replace_page_cache().

But now that charged pages always have a page->mapping, introduce
mem_cgroup_uncharge(), which is called after the final put_page(), when we
know for sure that nobody is looking at the page anymore.

For page migration, introduce mem_cgroup_migrate(), which is called after
the migration is successful and the new page is fully rmapped. Because
the old page is no longer uncharged after migration, prevent double
charges by decoupling the page's memcg association (PCG_USED and
pc->mem_cgroup) from the page holding an actual charge. The new bits
PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
to the new page during migration.

mem_cgroup_migrate() is suitable for replace_page_cache() as well,
which gets rid of mem_cgroup_replace_page_cache(). However, care
needs to be taken because both the source and the target page can
already be charged and on the LRU when fuse is splicing: grab the page
lock on the charge moving side to prevent changing pc->mem_cgroup of a
page under migration. Also, the lruvecs of both pages change as we
uncharge the old and charge the new during migration, and putback may
race with us, so grab the lru lock and isolate the pages iff on LRU to
prevent races and ensure the pages are on the right lruvec afterward.

Swap accounting is massively simplified: because the page is no longer
uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
before the final put_page() in page reclaim.

Finally, page_cgroup changes are now protected by whatever protection the
page itself offers: anonymous pages are charged under the page table lock,
whereas page cache insertions, swapin, and migration hold the page lock.
Uncharging happens under full exclusion with no outstanding references.
Charging and uncharging also ensure that the page is off-LRU, which
serializes against charge migration. Remove the very costly page_cgroup
lock and set pc->flags non-atomically.

[mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
[vdavydov@parallels.com: fix flags definition]
Signed-off-by: Johannes Weiner
Cc: Hugh Dickins
Cc: Tejun Heo
Cc: Vladimir Davydov
Tested-by: Jet Chen
Acked-by: Michal Hocko
Tested-by: Felipe Balbi
Signed-off-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-08-09 06:57:17 +0800
00501b531 mm: memcontrol: rewrite charge API ... Browse Code »
15

These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages. This drastically simplifies the code and
reduces charging and uncharging overhead. The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.

Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
executing in the root memcg). Before:

15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
13.31% cat [kernel.kallsyms] [k] memset
11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
2.38% cat [kernel.kallsyms] [k] put_page
2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn

After:

15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
13.48% cat [kernel.kallsyms] [k] memset
11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
2.46% cat [kernel.kallsyms] [k] put_page
2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
1.30% cat [kernel.kallsyms] [k] kfree

As you can see, the memcg footprint has shrunk quite a bit.

text data bss dec hex filename
37970 9892 400 48262 bc86 mm/memcontrol.o.old
35239 9892 400 45531 b1db mm/memcontrol.o

This patch (of 4):

The memcg charge API charges pages before they are rmapped - i.e. have an
actual "type" - and so every callsite needs its own set of charge and
uncharge functions to know what type is being operated on. Worse,
uncharge has to happen from a context that is still type-specific, rather
than at the end of the page's lifetime with exclusive access, and so
requires a lot of synchronization.

Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:

mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
pages from the memcg if necessary.

mem_cgroup_commit_charge() commits the page to the charge once it
has a valid page->mapping and PageAnon() reliably tells the type.

mem_cgroup_cancel_charge() aborts the transaction.

This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.

As pages need to be committed after rmap is established but before they
are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
additions again. Revive lru_cache_add_active_or_unevictable().

[hughd@google.com: fix shmem_unuse]
[hughd@google.com: Add comments on the private use of -EAGAIN]
Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: Vladimir Davydov
Signed-off-by: Hugh Dickins
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-08-09 06:57:17 +0800

07 Aug, 2014

7 commits

dbffcd03d mm: change confusing #ifdef use in __access_remote_vm ... Browse Code »

This patch changes confusing #ifdef use in __access_remote_vm into
merely ugly #ifdef use.

Addresses bug https://bugzilla.kernel.org/show_bug.cgi?id=81651

Signed-off-by: Rik van Riel
Reported-by: David Binderman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2014-08-07 09:01:22 +0800
3a91053ae mm: mark fault_around_bytes __read_mostly ... Browse Code »

fault_around_bytes can only be changed via debugfs. Let's mark it
read-mostly.

Signed-off-by: Kirill A. Shutemov
Suggested-by: David Rientjes
Acked-by: David Rientjes
Cc: Dave Hansen
Cc: Andrey Ryabinin
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-08-07 09:01:22 +0800
aecd6f442 mm: close race between do_fault_around() and fault_around_bytes_set() ... Browse Code »

Things can go wrong if fault_around_bytes will be changed under
do_fault_around(): between fault_around_mask() and fault_around_pages().

Let's read fault_around_bytes only once during do_fault_around() and
calculate mask based on the reading.

Note: fault_around_bytes can only be updated via debug interface. Also
I've tried but was not able to trigger a bad behaviour without the
patch. So I would not consider this patch as urgent.

Signed-off-by: Kirill A. Shutemov
Cc: Dave Hansen
Cc: Andrey Ryabinin
Cc: Sasha Levin
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-08-07 09:01:22 +0800
9a95f3cf7 mm: describe mmap_sem rules for __lock_page_or_retry() and callers ... Browse Code »

Add a comment describing the circumstances in which
__lock_page_or_retry() will or will not release the mmap_sem when
returning 0.

Add comments to lock_page_or_retry()'s callers (filemap_fault(),
do_swap_page()) noting the impact on VM_FAULT_RETRY returns.

Add comments on up the call tree, particularly replacing the false "We
return with mmap_sem still held" comments.

Signed-off-by: Paul Cassella
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Cassella
2014-08-07 09:01:20 +0800
9aed8614a mm/memory.c: don't forget to set softdirty on file mapped fault ... Browse Code »
18

Otherwise we may not notice that pte was softdirty because
pte_mksoft_dirty helper _returns_ new pte but doesn't modify the
argument.

In case if page fault happend on dirty filemapping the newly created pte
may loose softdirty bit thus if a userspace program is tracking memory
changes with help of a memory tracker (CONFIG_MEM_SOFT_DIRTY) it might
miss modification of a memory page (which in worts case may lead to data
inconsistency).

Signed-off-by: Cyrill Gorcunov
Acked-by: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2014-08-07 09:01:20 +0800
21bda264f mm: make copy_pte_range static again ... Browse Code »
4

Commit 71e3aac0724f ("thp: transparent hugepage core") adds
copy_pte_range prototype to huge_mm.h. I'm not sure why (or if) this
function have been used outside of memory.c, but it currently isn't.
This patch makes copy_pte_range() static again.

Signed-off-by: Jerome Marchand
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jerome Marchand
2014-08-07 09:01:19 +0800
c0d73261f mm/memory.c: use entry = ACCESS_ONCE(*pte) in handle_pte_fault() ... Browse Code »
4

Use ACCESS_ONCE() in handle_pte_fault() when getting the entry or
orig_pte upon which all subsequent decisions and pte_same() tests will
be made.

I have no evidence that its lack is responsible for the mm/filemap.c:202
BUG_ON(page_mapped(page)) in __delete_from_page_cache() found by
trinity, and I am not optimistic that it will fix it. But I have found
no other explanation, and ACCESS_ONCE() here will surely not hurt.

If gcc does re-access the pte before passing it down, then that would be
disastrous for correct page fault handling, and certainly could explain
the page_mapped() BUGs seen (concurrent fault causing page to be mapped
in a second time on top of itself: mapcount 2 for a single pte).

Signed-off-by: Hugh Dickins
Cc: Sasha Levin
Cc: Linus Torvalds
Cc: "Kirill A. Shutemov"
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2014-08-07 09:01:15 +0800

31 Jul, 2014

1 commit

b4903d6e8 mm: debugfs: move rounddown_pow_of_two() out from do_fault path ... Browse Code »

do_fault_around() expects fault_around_bytes rounded down to nearest page
order. Instead of calling rounddown_pow_of_two every time in
fault_around_pages()/fault_around_mask() we could do round down when user
changes fault_around_bytes via debugfs interface.

This also fixes bug when user set fault_around_bytes to 0. Result of
rounddown_pow_of_two(0) is not defined, therefore fault_around_bytes == 0
doesn't work without this patch.

Let's set fault_around_bytes to PAGE_SIZE if user sets to something less
than PAGE_SIZE

[akpm@linux-foundation.org: tweak code layout]
Fixes: a9b0f861("mm: nominate faultaround area in bytes rather than page order")
Signed-off-by: Andrey Ryabinin
Reported-by: Sasha Levin
Acked-by: Kirill A. Shutemov
Cc: [3.15.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2014-07-31 08:16:13 +0800

24 Jul, 2014

1 commit

c118678bc mm: do not call do_fault_around for non-linear fault ... Browse Code »

Ingo Korb reported that "repeated mapping of the same file on tmpfs
using remap_file_pages sometimes triggers a BUG at mm/filemap.c:202 when
the process exits".

He bisected the bug to d7c1755179b8 ("mm: implement ->map_pages for
shmem/tmpfs"), although the bug was actually added by commit
8c6e50b0290c ("mm: introduce vm_ops->map_pages()").

The problem is caused by calling do_fault_around for a _non-linear_
fault. In this case pgoff is shifted and might become negative during
calculation.

Faulting around non-linear page-fault makes no sense and breaks the
logic in do_fault_around because pgoff is shifted.

Signed-off-by: Konstantin Khlebnikov
Reported-by: Ingo Korb
Tested-by: Ingo Korb
Cc: Hugh Dickins
Cc: Sasha Levin
Cc: Dave Jones
Cc: Ning Qu
Cc: "Kirill A. Shutemov"
Cc: [3.15.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2014-07-24 06:10:54 +0800

05 Jun, 2014

5 commits

1fdb412bd mm: document do_fault_around() feature ... Browse Code »

Some clarification on how faultaround works.

[akpm@linux-foundation.org: tweak comment text]
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-06-05 07:54:12 +0800
a9b0f8618 mm: nominate faultaround area in bytes rather than page order ... Browse Code »
13

There is evidencs that the faultaround feature is less relevant on
architectures with page size bigger then 4k. Which makes sense since page
fault overhead per byte of mapped area should be less there.

Let's rework the feature to specify faultaround area in bytes instead of
page order. It's 64 kilobytes for now.

The patch effectively disables faultaround on architectures with page size
>= 64k (like ppc64).

It's possible that some other size of faultaround area is relevant for a
platform. We can expose `fault_around_bytes' variable to arch-specific
code once such platforms will be found.

Signed-off-by: Kirill A. Shutemov
Cc: Rusty Russell
Cc: Hugh Dickins
Cc: Madhavan Srinivasan
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Andi Kleen
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-06-05 07:54:12 +0800
850e9c69c mm: fix typo in comment in do_fault_around() ... Browse Code »

Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-06-05 07:54:11 +0800
4bbd4c776 mm: move get_user_pages()-related code to separate file ... Browse Code »

mm/memory.c is overloaded: over 4k lines. get_user_pages() code is
pretty much self-contained let's move it to separate file.

No other changes made.

Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-06-05 07:54:04 +0800
c46a7c817 x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels ... Browse Code »
52

_PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
faults on x86. Care is taken such that _PAGE_NUMA is used only in
situations where the VMA flags distinguish between NUMA hinting faults
and prot_none faults. This decision was x86-specific and conceptually
it is difficult requiring special casing to distinguish between PROTNONE
and NUMA ptes based on context.

Fundamentally, we only need the _PAGE_NUMA bit to tell the difference
between an entry that is really unmapped and a page that is protected
for NUMA hinting faults as if the PTE is not present then a fault will
be trapped.

Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset.
This patch shrinks the maximum possible swap size and uses the bit to
uniquely distinguish between NUMA hinting ptes and swap ptes.

Signed-off-by: Mel Gorman
Cc: David Vrabel
Cc: Ingo Molnar
Cc: Peter Anvin
Cc: Fengguang Wu
Cc: Linus Torvalds
Cc: Steven Noonan
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Dave Hansen
Cc: Srikar Dronamraju
Cc: Cyrill Gorcunov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:53:55 +0800

22 May, 2014

1 commit

65c2ce700 Merge tag 'v3.15-rc6' into sched/core, to pick up the latest fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2014-05-22 16:28:56 +0800

07 May, 2014

1 commit

107437feb mm/numa: Remove BUG_ON() in __handle_mm_fault() ... Browse Code »
5

Changing PTEs and PMDs to pte_numa & pmd_numa is done with the
mmap_sem held for reading, which means a pmd can be instantiated
and turned into a numa one while __handle_mm_fault() is examining
the value of old_pmd.

If that happens, __handle_mm_fault() should just return and let
the page fault retry, instead of throwing an oops. This is
handled by the test for pmd_trans_huge(*pmd) below.

Signed-off-by: Rik van Riel
Reviewed-by: Naoya Horiguchi
Reported-by: Sunil Pandey
Signed-off-by: Peter Zijlstra
Cc: Andrew Morton
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Linus Torvalds
Cc: Mel Gorman
Cc: linux-mm@kvack.org
Cc: lwoodman@redhat.com
Cc: dave.hansen@intel.com
Link: http://lkml.kernel.org/r/20140429153615.2d72098e@annuminas.surriel.com
Signed-off-by: Ingo Molnar

Rik van Riel
2014-05-07 19:33:48 +0800