23 Dec, 2014

1 commit

  • This reverts commit c8475d144abb1e62958cc5ec281d2a9e161c1946.

    There are several[1][2] of bug reports which points to this commit as potential
    cause[3].

    Let's revert it until we figure out what's going on.

    [1] https://lkml.org/lkml/2014/11/14/342
    [2] https://lkml.org/lkml/2014/12/22/213
    [3] https://lkml.org/lkml/2014/12/9/741

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Sasha Levin
    Acked-by: Davidlohr Bueso
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Cc: Mel Gorman
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

21 Dec, 2014

1 commit

  • Pull ACCESS_ONCE cleanup preparation from Christian Borntraeger:
    "kernel: Provide READ_ONCE and ASSIGN_ONCE

    As discussed on LKML http://marc.info/?i=54611D86.4040306%40de.ibm.com
    ACCESS_ONCE might fail with specific compilers for non-scalar
    accesses.

    Here is a set of patches to tackle that problem.

    The first patch introduce READ_ONCE and ASSIGN_ONCE. If the data
    structure is larger than the machine word size memcpy is used and a
    warning is emitted. The next patches fix up several in-tree users of
    ACCESS_ONCE on non-scalar types.

    This does not yet contain a patch that forces ACCESS_ONCE to work only
    on scalar types. This is targetted for the next merge window as Linux
    next already contains new offenders regarding ACCESS_ONCE vs.
    non-scalar types"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux:
    s390/kvm: REPLACE barrier fixup with READ_ONCE
    arm/spinlock: Replace ACCESS_ONCE with READ_ONCE
    arm64/spinlock: Replace ACCESS_ONCE READ_ONCE
    mips/gup: Replace ACCESS_ONCE with READ_ONCE
    x86/gup: Replace ACCESS_ONCE with READ_ONCE
    x86/spinlock: Replace ACCESS_ONCE with READ_ONCE
    mm: replace ACCESS_ONCE with READ_ONCE or barriers
    kernel: Provide READ_ONCE and ASSIGN_ONCE

    Linus Torvalds
     

19 Dec, 2014

1 commit

  • Belatedly document the changes in commit f0c6d4d295e4 ("mm: introduce
    do_shared_fault() and drop do_fault()").

    Cc: Andi Kleen
    Cc: Bob Liu
    Cc: Dave Hansen
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Naoya Horiguchi
    Cc: Rik van Riel
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

18 Dec, 2014

2 commits

  • ACCESS_ONCE does not work reliably on non-scalar types. For
    example gcc 4.6 and 4.7 might remove the volatile tag for such
    accesses during the SRA (scalar replacement of aggregates) step
    (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)

    Let's change the code to access the page table elements with
    READ_ONCE that does implicit scalar accesses for the gup code.

    mm_find_pmd is tricky, because m68k and sparc(32bit) define pmd_t
    as array of longs. This code requires just that the pmd_present
    and pmd_trans_huge check are done on the same value, so a barrier
    is sufficent.

    A similar case is in handle_pte_fault. On ppc44x the word size is
    32 bit, but a pte is 64 bit. A barrier is ok as well.

    Signed-off-by: Christian Borntraeger
    Cc: linux-mm@kvack.org
    Acked-by: Paul E. McKenney

    Christian Borntraeger
     
  • Dave Hansen reports that commit fb7332a9fedf ("mmu_gather: move minimal
    range calculations into generic code") caused a performance problem:

    "tlb_finish_mmu() goes up about 9x in the profiles (~0.4%->3.6%) and
    tlb_flush_mmu_free() takes about 3.1% of CPU time with the patch
    applied, but does not show up at all on the commit before"

    and the reason is that Will moved the test for whether we need to flush
    from tlb_flush_mmu() into tlb_flush_mmu_tlbonly(). But that meant that
    tlb_flush_mmu_free() basically lost that check.

    Move it back into tlb_flush_mmu() where it belongs, so that it covers
    both tlb_flush_mmu_tlbonly() _and_ tlb_flush_mmu_free().

    Reported-and-tested-by: Dave Hansen
    Acked-by: Will Deacon
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

16 Dec, 2014

1 commit

  • Pull drm updates from Dave Airlie:
    "Highlights:

    - AMD KFD driver merge

    This is the AMD HSA interface for exposing a lowlevel interface for
    GPGPU use. They have an open source userspace built on top of this
    interface, and the code looks as good as it was going to get out of
    tree.

    - Initial atomic modesetting work

    The need for an atomic modesetting interface to allow userspace to
    try and send a complete set of modesetting state to the driver has
    arisen, and been suffering from neglect this past year. No more,
    the start of the common code and changes for msm driver to use it
    are in this tree. Ongoing work to get the userspace ioctl finished
    and the code clean will probably wait until next kernel.

    - DisplayID 1.3 and tiled monitor exposed to userspace.

    Tiled monitor property is now exposed for userspace to make use of.

    - Rockchip drm driver merged.

    - imx gpu driver moved out of staging

    Other stuff:

    - core:
    panel - MIPI DSI + new panels.
    expose suggested x/y properties for virtual GPUs

    - i915:
    Initial Skylake (SKL) support
    gen3/4 reset work
    start of dri1/ums removal
    infoframe tracking
    fixes for lots of things.

    - nouveau:
    tegra k1 voltage support
    GM204 modesetting support
    GT21x memory reclocking work

    - radeon:
    CI dpm fixes
    GPUVM improvements
    Initial DPM fan control

    - rcar-du:
    HDMI support added
    removed some support for old boards
    slave encoder driver for Analog Devices adv7511

    - exynos:
    Exynos4415 SoC support

    - msm:
    a4xx gpu support
    atomic helper conversion

    - tegra:
    iommu support
    universal plane support
    ganged-mode DSI support

    - sti:
    HDMI i2c improvements

    - vmwgfx:
    some late fixes.

    - qxl:
    use suggested x/y properties"

    * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (969 commits)
    drm: sti: fix module compilation issue
    drm/i915: save/restore GMBUS freq across suspend/resume on gen4
    drm: sti: correctly cleanup CRTC and planes
    drm: sti: add HQVDP plane
    drm: sti: add cursor plane
    drm: sti: enable auxiliary CRTC
    drm: sti: fix delay in VTG programming
    drm: sti: prepare sti_tvout to support auxiliary crtc
    drm: sti: use drm_crtc_vblank_{on/off} instead of drm_vblank_{on/off}
    drm: sti: fix hdmi avi infoframe
    drm: sti: remove event lock while disabling vblank
    drm: sti: simplify gdp code
    drm: sti: clear all mixer control
    drm: sti: remove gpio for HDMI hot plug detection
    drm: sti: allow to change hdmi ddc i2c adapter
    drm/doc: Document drm_add_modes_noedid() usage
    drm/i915: Remove '& 0xffff' from the mask given to WA_REG()
    drm/i915: Invert the mask and val arguments in wa_add() and WA_REG()
    drm: Zero out DRM object memory upon cleanup
    drm/i915/bdw: Fix the write setting up the WIZ hashing mode
    ...

    Linus Torvalds
     

14 Dec, 2014

3 commits

  • This lets drivers like the AMD IOMMUv2 driver handle faults a bit more
    simply, rather than doing tricks with page refs and get_user_pages().

    Signed-off-by: Jesse Barnes
    Cc: Oded Gabbay
    Cc: Joerg Roedel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesse Barnes
     
  • The unmap_mapping_range family of functions do the unmapping of user pages
    (ultimately via zap_page_range_single) without touching the actual
    interval tree, thus share the lock.

    Signed-off-by: Davidlohr Bueso
    Cc: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Convert all open coded mutex_lock/unlock calls to the
    i_mmap_[lock/unlock]_write() helpers.

    Signed-off-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

12 Dec, 2014

1 commit

  • Pull s390 updates from Martin Schwidefsky:
    "The most notable change for this pull request is the ftrace rework
    from Heiko. It brings a small performance improvement and the ground
    work to support a new gcc option to replace the mcount blocks with a
    single nop.

    Two new s390 specific system calls are added to emulate user space
    mmio for PCI, an artifact of the how PCI memory is accessed.

    Two patches for the memory management with changes to common code.
    For KVM mm_forbids_zeropage is added which disables the empty zero
    page for an mm that is used by a KVM process. And an optimization,
    pmdp_get_and_clear_full is added analog to ptep_get_and_clear_full.

    Some micro optimization for the cmpxchg and the spinlock code.

    And as usual bug fixes and cleanups"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (46 commits)
    s390/cputime: fix 31-bit compile
    s390/scm_block: make the number of reqs per HW req configurable
    s390/scm_block: handle multiple requests in one HW request
    s390/scm_block: allocate aidaw pages only when necessary
    s390/scm_block: use mempool to manage aidaw requests
    s390/eadm: change timeout value
    s390/mm: fix memory leak of ptlock in pmd_free_tlb
    s390: use local symbol names in entry[64].S
    s390/ptrace: always include vector registers in core files
    s390/simd: clear vector register pointer on fork/clone
    s390: translate cputime magic constants to macros
    s390/idle: convert open coded idle time seqcount
    s390/idle: add missing irq off lockdep annotation
    s390/debug: avoid function call for debug_sprintf_*
    s390/kprobes: fix instruction copy for out of line execution
    s390: remove diag 44 calls from cpu_relax()
    s390/dasd: retry partition detection
    s390/dasd: fix list corruption for sleep_on requests
    s390/dasd: fix infinite term I/O loop
    s390/dasd: remove unused code
    ...

    Linus Torvalds
     

10 Dec, 2014

1 commit

  • Pull arm64 updates from Will Deacon:
    "Here's the usual mixed bag of arm64 updates, also including some
    related EFI changes (Acked by Matt) and the MMU gather range cleanup
    (Acked by you).

    Changes include:
    - support for alternative instruction patching from Andre
    - seccomp from Akashi
    - some AArch32 instruction emulation, required by the Android folks
    - optimisations for exception entry/exit code, cmpxchg, pcpu atomics
    - mmu_gather range calculations moved into core code
    - EFI updates from Ard, including long-awaited SMBIOS support
    - /proc/cpuinfo fixes to align with the format used by arch/arm/
    - a few non-critical fixes across the architecture"

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (70 commits)
    arm64: remove the unnecessary arm64_swiotlb_init()
    arm64: add module support for alternatives fixups
    arm64: perf: Prevent wraparound during overflow
    arm64/include/asm: Fixed a warning about 'struct pt_regs'
    arm64: Provide a namespace to NCAPS
    arm64: bpf: lift restriction on last instruction
    arm64: Implement support for read-mostly sections
    arm64: compat: align cacheflush syscall with arch/arm
    arm64: add seccomp support
    arm64: add SIGSYS siginfo for compat task
    arm64: add seccomp syscall for compat task
    asm-generic: add generic seccomp.h for secure computing mode 1
    arm64: ptrace: allow tracer to skip a system call
    arm64: ptrace: add NT_ARM_SYSTEM_CALL regset
    arm64: Move some head.text functions to executable section
    arm64: jump labels: NOP out NOP -> NOP replacement
    arm64: add support to dump the kernel page tables
    arm64: Add FIX_HOLE to permanent fixed addresses
    arm64: alternatives: fix pr_fmt string for consistency
    arm64: vmlinux.lds.S: don't discard .exit.* sections at link-time
    ...

    Linus Torvalds
     

08 Dec, 2014

1 commit

  • Linux 3.18

    Backmerge Linus tree into -next as we had conflicts in i915/radeon/nouveau,
    and everyone was solving them individually.

    * tag 'v3.18': (57 commits)
    Linux 3.18
    watchdog: s3c2410_wdt: Fix the mask bit offset for Exynos7
    uapi: fix to export linux/vm_sockets.h
    i2c: cadence: Set the hardware time-out register to maximum value
    i2c: davinci: generate STP always when NACK is received
    ahci: disable MSI on SAMSUNG 0xa800 SSD
    context_tracking: Restore previous state in schedule_user
    slab: fix nodeid bounds check for non-contiguous node IDs
    lib/genalloc.c: export devm_gen_pool_create() for modules
    mm: fix anon_vma_clone() error treatment
    mm: fix swapoff hang after page migration and fork
    fat: fix oops on corrupted vfat fs
    ipc/sem.c: fully initialize sem_array before making it visible
    drivers/input/evdev.c: don't kfree() a vmalloc address
    cxgb4: Fill in supported link mode for SFP modules
    xen-netfront: Remove BUGs on paged skb data which crosses a page boundary
    mm/vmpressure.c: fix race in vmpressure_work_fn()
    mm: frontswap: invalidate expired data on a dup-store failure
    mm: do not overwrite reserved pages counter at show_mem()
    drm/radeon: kernel panic in drm_calc_vbltimestamp_from_scanoutpos with 3.18.0-rc6
    ...

    Conflicts:
    drivers/gpu/drm/i915/intel_display.c
    drivers/gpu/drm/nouveau/nouveau_drm.c
    drivers/gpu/drm/radeon/radeon_cs.c

    Dave Airlie
     

04 Dec, 2014

1 commit

  • I've been seeing swapoff hangs in recent testing: it's cycling around
    trying unsuccessfully to find an mm for some remaining pages of swap.

    I have been exercising swap and page migration more heavily recently,
    and now notice a long-standing error in copy_one_pte(): it's trying to
    add dst_mm to swapoff's mmlist when it finds a swap entry, but is doing
    so even when it's a migration entry or an hwpoison entry.

    Which wouldn't matter much, except it adds dst_mm next to src_mm,
    assuming src_mm is already on the mmlist: which may not be so. Then if
    pages are later swapped out from dst_mm, swapoff won't be able to find
    where to replace them.

    There's already a !non_swap_entry() test for stats: move that up before
    the swap_duplicate() and the addition to mmlist.

    Signed-off-by: Hugh Dickins
    Cc: Kelley Nielsen
    Cc: [2.6.18+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Nov, 2014

1 commit

  • On architectures with hardware broadcasting of TLB invalidation messages
    , it makes sense to reduce the range of the mmu_gather structure when
    unmapping page ranges based on the dirty address information passed to
    tlb_remove_tlb_entry.

    arm64 already does this by directly manipulating the start/end fields
    of the gather structure, but this confuses the generic code which
    does not expect these fields to change and can end up calculating
    invalid, negative ranges when forcing a flush in zap_pte_range.

    This patch moves the minimal range calculation out of the arm64 code
    and into the generic implementation, simplifying zap_pte_range in the
    process (which no longer needs to care about start/end, since they will
    point to the appropriate ranges already). With the range being tracked
    by core code, the need_flush flag is dropped in favour of checking that
    the end of the range has actually been set.

    Cc: Benjamin Herrenschmidt
    Cc: Peter Zijlstra
    Cc: Russell King - ARM Linux
    Cc: Michal Simek
    Acked-by: Linus Torvalds
    Signed-off-by: Will Deacon

    Will Deacon
     

29 Oct, 2014

1 commit

  • When unmapping a range of pages in zap_pte_range, the page being
    unmapped is added to an mmu_gather_batch structure for asynchronous
    freeing. If we run out of space in the batch structure before the range
    has been completely unmapped, then we break out of the loop, force a
    TLB flush and free the pages that we have batched so far. If there are
    further pages to unmap, then we resume the loop where we left off.

    Unfortunately, we forget to update addr when we break out of the loop,
    which causes us to truncate the range being invalidated as the end
    address is exclusive. When we re-enter the loop at the same address, the
    page has already been freed and the pte_present test will fail, meaning
    that we do not reconsider the address for invalidation.

    This patch fixes the problem by incrementing addr by the PAGE_SIZE
    before breaking out of the loop on batch failure.

    Signed-off-by: Will Deacon
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Will Deacon
     

27 Oct, 2014

1 commit


14 Oct, 2014

1 commit

  • For VMAs that don't want write notifications, PTEs created for read faults
    have their write bit set. If the read fault happens after VM_SOFTDIRTY is
    cleared, then the PTE's softdirty bit will remain clear after subsequent
    writes.

    Here's a simple code snippet to demonstrate the bug:

    char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
    MAP_ANONYMOUS | MAP_SHARED, -1, 0);
    system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
    assert(*m == '\0'); /* new PTE allows write access */
    assert(!soft_dirty(x));
    *m = 'x'; /* should dirty the page */
    assert(soft_dirty(x)); /* fails */

    With this patch, write notifications are enabled when VM_SOFTDIRTY is
    cleared. Furthermore, to avoid unnecessary faults, write notifications
    are disabled when VM_SOFTDIRTY is set.

    As a side effect of enabling and disabling write notifications with
    care, this patch fixes a bug in mprotect where vm_page_prot bits set by
    drivers were zapped on mprotect. An analogous bug was fixed in mmap by
    commit c9d0bf241451 ("mm: uncached vma support with writenotify").

    Signed-off-by: Peter Feiner
    Reported-by: Peter Feiner
    Suggested-by: Kirill A. Shutemov
    Cc: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Jamie Liu
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: Bjorn Helgaas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Feiner
     

26 Sep, 2014

1 commit

  • This fixes the same bug as b43790eedd31 ("mm: softdirty: don't forget to
    save file map softdiry bit on unmap") and 9aed8614af5a ("mm/memory.c:
    don't forget to set softdirty on file mapped fault") where the return
    value of pte_*mksoft_dirty was being ignored.

    To be sure that no other pte/pmd "mk" function return values were being
    ignored, I annotated the functions in arch/x86/include/asm/pgtable.h
    with __must_check and rebuilt.

    The userspace effect of this bug is that the softdirty mark might be
    lost if a file mapped pte get zapped.

    Signed-off-by: Peter Feiner
    Acked-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Jamie Liu
    Cc: Hugh Dickins
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Feiner
     

23 Sep, 2014

1 commit

  • Pull KVM fixes from Paolo Bonzini:
    "Two very simple bugfixes, affecting all supported architectures"

    [ Two? There's three commits in here. Oh well, I guess Paolo didn't
    count the preparatory symbol export ]

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    KVM: correct null pid check in kvm_vcpu_yield_to()
    KVM: check for !is_zero_pfn() in kvm_is_mmio_pfn()
    mm: export symbol dependencies of is_zero_pfn()

    Linus Torvalds
     

14 Sep, 2014

1 commit

  • In order to make the static inline function is_zero_pfn() callable by
    modules, export its symbol dependencies 'zero_pfn' and (for s390 and
    mips) 'zero_page_mask'.

    We need this for KVM, as CONFIG_KVM is a tristate for all supported
    architectures except ARM and arm64, and testing a pfn whether it refers
    to the zero page is required to correctly distinguish the zero page
    from other special RAM ranges that may also have the PG_reserved bit
    set, but need to be treated as MMIO memory.

    Signed-off-by: Ard Biesheuvel
    Acked-by: Andrew Morton
    Signed-off-by: Paolo Bonzini

    Ard Biesheuvel
     

30 Aug, 2014

1 commit

  • Sasha Levin has shown oopses on ffffea0003480048 and ffffea0003480008 at
    mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels:
    where zap_pte_range() checks page->mapping to see if PageAnon(page).

    Those addresses fit struct pages for pfns d2001 and d2000, and in each
    dump a register or a stack slot showed d2001730 or d2000730: pte flags
    0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has
    a hole between cfffffff and 100000000, which would need special access.

    Commit c46a7c817e66 ("x86: define _PAGE_NUMA by reusing software bits on
    the PMD and PTE levels") has broken vm_normal_page(): a PROTNONE SPECIAL
    pte no longer passes the pte_special() test, so zap_pte_range() goes on
    to try to access a non-existent struct page.

    Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE) to
    complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE). A
    hint that this was a problem was that c46a7c817e66 added pte_numa() test
    to vm_normal_page(), and moved its is_zero_pfn() test from slow to fast
    path: This was papering over a pte_special() snag when the zero page was
    encountered during zap. This patch reverts vm_normal_page() to how it
    was before, relying on pte_special().

    It still appears that this patch may be incomplete: aren't there other
    places which need to be handling PROTNONE along with PRESENT? For
    example, pte_mknuma() clears _PAGE_PRESENT and sets _PAGE_NUMA, but on a
    PROT_NONE area, that would make it pte_special(). This is side-stepped
    by the fact that NUMA hinting faults skipped PROT_NONE VMAs and there
    are no grounds where a NUMA hinting fault on a PROT_NONE VMA would be
    interesting.

    Fixes: c46a7c817e66 ("x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels")
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Cyrill Gorcunov
    Cc: Matthew Wilcox
    Cc: [3.16]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

09 Aug, 2014

3 commits

  • The core mm code will provide a default gate area based on
    FIXADDR_USER_START and FIXADDR_USER_END if
    !defined(__HAVE_ARCH_GATE_AREA) && defined(AT_SYSINFO_EHDR).

    This default is only useful for ia64. arm64, ppc, s390, sh, tile, 64-bit
    UML, and x86_32 have their own code just to disable it. arm, 32-bit UML,
    and x86_64 have gate areas, but they have their own implementations.

    This gets rid of the default and moves the code into ia64.

    This should save some code on architectures without a gate area: it's now
    possible to inline the gate_area functions in the default case.

    Signed-off-by: Andy Lutomirski
    Acked-by: Nathan Lynch
    Acked-by: H. Peter Anvin
    Acked-by: Benjamin Herrenschmidt [in principle]
    Acked-by: Richard Weinberger [for um]
    Acked-by: Will Deacon [for arm64]
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Chris Metcalf
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Nathan Lynch
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • The memcg uncharging code that is involved towards the end of a page's
    lifetime - truncation, reclaim, swapout, migration - is impressively
    complicated and fragile.

    Because anonymous and file pages were always charged before they had their
    page->mapping established, uncharges had to happen when the page type
    could still be known from the context; as in unmap for anonymous, page
    cache removal for file and shmem pages, and swap cache truncation for swap
    pages. However, these operations happen well before the page is actually
    freed, and so a lot of synchronization is necessary:

    - Charging, uncharging, page migration, and charge migration all need
    to take a per-page bit spinlock as they could race with uncharging.

    - Swap cache truncation happens during both swap-in and swap-out, and
    possibly repeatedly before the page is actually freed. This means
    that the memcg swapout code is called from many contexts that make
    no sense and it has to figure out the direction from page state to
    make sure memory and memory+swap are always correctly charged.

    - On page migration, the old page might be unmapped but then reused,
    so memcg code has to prevent untimely uncharging in that case.
    Because this code - which should be a simple charge transfer - is so
    special-cased, it is not reusable for replace_page_cache().

    But now that charged pages always have a page->mapping, introduce
    mem_cgroup_uncharge(), which is called after the final put_page(), when we
    know for sure that nobody is looking at the page anymore.

    For page migration, introduce mem_cgroup_migrate(), which is called after
    the migration is successful and the new page is fully rmapped. Because
    the old page is no longer uncharged after migration, prevent double
    charges by decoupling the page's memcg association (PCG_USED and
    pc->mem_cgroup) from the page holding an actual charge. The new bits
    PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
    to the new page during migration.

    mem_cgroup_migrate() is suitable for replace_page_cache() as well,
    which gets rid of mem_cgroup_replace_page_cache(). However, care
    needs to be taken because both the source and the target page can
    already be charged and on the LRU when fuse is splicing: grab the page
    lock on the charge moving side to prevent changing pc->mem_cgroup of a
    page under migration. Also, the lruvecs of both pages change as we
    uncharge the old and charge the new during migration, and putback may
    race with us, so grab the lru lock and isolate the pages iff on LRU to
    prevent races and ensure the pages are on the right lruvec afterward.

    Swap accounting is massively simplified: because the page is no longer
    uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
    transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
    before the final put_page() in page reclaim.

    Finally, page_cgroup changes are now protected by whatever protection the
    page itself offers: anonymous pages are charged under the page table lock,
    whereas page cache insertions, swapin, and migration hold the page lock.
    Uncharging happens under full exclusion with no outstanding references.
    Charging and uncharging also ensure that the page is off-LRU, which
    serializes against charge migration. Remove the very costly page_cgroup
    lock and set pc->flags non-atomically.

    [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
    [vdavydov@parallels.com: fix flags definition]
    Signed-off-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Tested-by: Jet Chen
    Acked-by: Michal Hocko
    Tested-by: Felipe Balbi
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These patches rework memcg charge lifetime to integrate more naturally
    with the lifetime of user pages. This drastically simplifies the code and
    reduces charging and uncharging overhead. The most expensive part of
    charging and uncharging is the page_cgroup bit spinlock, which is removed
    entirely after this series.

    Here are the top-10 profile entries of a stress test that reads a 128G
    sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
    executing in the root memcg). Before:

    15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
    13.31% cat [kernel.kallsyms] [k] memset
    11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
    4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
    2.38% cat [kernel.kallsyms] [k] put_page
    2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
    2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
    1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
    1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
    1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn

    After:

    15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
    13.48% cat [kernel.kallsyms] [k] memset
    11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
    3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
    2.46% cat [kernel.kallsyms] [k] put_page
    2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
    1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
    1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
    1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
    1.30% cat [kernel.kallsyms] [k] kfree

    As you can see, the memcg footprint has shrunk quite a bit.

    text data bss dec hex filename
    37970 9892 400 48262 bc86 mm/memcontrol.o.old
    35239 9892 400 45531 b1db mm/memcontrol.o

    This patch (of 4):

    The memcg charge API charges pages before they are rmapped - i.e. have an
    actual "type" - and so every callsite needs its own set of charge and
    uncharge functions to know what type is being operated on. Worse,
    uncharge has to happen from a context that is still type-specific, rather
    than at the end of the page's lifetime with exclusive access, and so
    requires a lot of synchronization.

    Rewrite the charge API to provide a generic set of try_charge(),
    commit_charge() and cancel_charge() transaction operations, much like
    what's currently done for swap-in:

    mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
    pages from the memcg if necessary.

    mem_cgroup_commit_charge() commits the page to the charge once it
    has a valid page->mapping and PageAnon() reliably tells the type.

    mem_cgroup_cancel_charge() aborts the transaction.

    This reduces the charge API and enables subsequent patches to
    drastically simplify uncharging.

    As pages need to be committed after rmap is established but before they
    are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
    additions again. Revive lru_cache_add_active_or_unevictable().

    [hughd@google.com: fix shmem_unuse]
    [hughd@google.com: Add comments on the private use of -EAGAIN]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Signed-off-by: Hugh Dickins
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

07 Aug, 2014

7 commits

  • This patch changes confusing #ifdef use in __access_remote_vm into
    merely ugly #ifdef use.

    Addresses bug https://bugzilla.kernel.org/show_bug.cgi?id=81651

    Signed-off-by: Rik van Riel
    Reported-by: David Binderman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • fault_around_bytes can only be changed via debugfs. Let's mark it
    read-mostly.

    Signed-off-by: Kirill A. Shutemov
    Suggested-by: David Rientjes
    Acked-by: David Rientjes
    Cc: Dave Hansen
    Cc: Andrey Ryabinin
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Things can go wrong if fault_around_bytes will be changed under
    do_fault_around(): between fault_around_mask() and fault_around_pages().

    Let's read fault_around_bytes only once during do_fault_around() and
    calculate mask based on the reading.

    Note: fault_around_bytes can only be updated via debug interface. Also
    I've tried but was not able to trigger a bad behaviour without the
    patch. So I would not consider this patch as urgent.

    Signed-off-by: Kirill A. Shutemov
    Cc: Dave Hansen
    Cc: Andrey Ryabinin
    Cc: Sasha Levin
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Add a comment describing the circumstances in which
    __lock_page_or_retry() will or will not release the mmap_sem when
    returning 0.

    Add comments to lock_page_or_retry()'s callers (filemap_fault(),
    do_swap_page()) noting the impact on VM_FAULT_RETRY returns.

    Add comments on up the call tree, particularly replacing the false "We
    return with mmap_sem still held" comments.

    Signed-off-by: Paul Cassella
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Cassella
     
  • Otherwise we may not notice that pte was softdirty because
    pte_mksoft_dirty helper _returns_ new pte but doesn't modify the
    argument.

    In case if page fault happend on dirty filemapping the newly created pte
    may loose softdirty bit thus if a userspace program is tracking memory
    changes with help of a memory tracker (CONFIG_MEM_SOFT_DIRTY) it might
    miss modification of a memory page (which in worts case may lead to data
    inconsistency).

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Commit 71e3aac0724f ("thp: transparent hugepage core") adds
    copy_pte_range prototype to huge_mm.h. I'm not sure why (or if) this
    function have been used outside of memory.c, but it currently isn't.
    This patch makes copy_pte_range() static again.

    Signed-off-by: Jerome Marchand
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     
  • Use ACCESS_ONCE() in handle_pte_fault() when getting the entry or
    orig_pte upon which all subsequent decisions and pte_same() tests will
    be made.

    I have no evidence that its lack is responsible for the mm/filemap.c:202
    BUG_ON(page_mapped(page)) in __delete_from_page_cache() found by
    trinity, and I am not optimistic that it will fix it. But I have found
    no other explanation, and ACCESS_ONCE() here will surely not hurt.

    If gcc does re-access the pte before passing it down, then that would be
    disastrous for correct page fault handling, and certainly could explain
    the page_mapped() BUGs seen (concurrent fault causing page to be mapped
    in a second time on top of itself: mapcount 2 for a single pte).

    Signed-off-by: Hugh Dickins
    Cc: Sasha Levin
    Cc: Linus Torvalds
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

31 Jul, 2014

1 commit

  • do_fault_around() expects fault_around_bytes rounded down to nearest page
    order. Instead of calling rounddown_pow_of_two every time in
    fault_around_pages()/fault_around_mask() we could do round down when user
    changes fault_around_bytes via debugfs interface.

    This also fixes bug when user set fault_around_bytes to 0. Result of
    rounddown_pow_of_two(0) is not defined, therefore fault_around_bytes == 0
    doesn't work without this patch.

    Let's set fault_around_bytes to PAGE_SIZE if user sets to something less
    than PAGE_SIZE

    [akpm@linux-foundation.org: tweak code layout]
    Fixes: a9b0f861("mm: nominate faultaround area in bytes rather than page order")
    Signed-off-by: Andrey Ryabinin
    Reported-by: Sasha Levin
    Acked-by: Kirill A. Shutemov
    Cc: [3.15.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

24 Jul, 2014

1 commit

  • Ingo Korb reported that "repeated mapping of the same file on tmpfs
    using remap_file_pages sometimes triggers a BUG at mm/filemap.c:202 when
    the process exits".

    He bisected the bug to d7c1755179b8 ("mm: implement ->map_pages for
    shmem/tmpfs"), although the bug was actually added by commit
    8c6e50b0290c ("mm: introduce vm_ops->map_pages()").

    The problem is caused by calling do_fault_around for a _non-linear_
    fault. In this case pgoff is shifted and might become negative during
    calculation.

    Faulting around non-linear page-fault makes no sense and breaks the
    logic in do_fault_around because pgoff is shifted.

    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Ingo Korb
    Tested-by: Ingo Korb
    Cc: Hugh Dickins
    Cc: Sasha Levin
    Cc: Dave Jones
    Cc: Ning Qu
    Cc: "Kirill A. Shutemov"
    Cc: [3.15.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

05 Jun, 2014

5 commits

  • Some clarification on how faultaround works.

    [akpm@linux-foundation.org: tweak comment text]
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • There is evidencs that the faultaround feature is less relevant on
    architectures with page size bigger then 4k. Which makes sense since page
    fault overhead per byte of mapped area should be less there.

    Let's rework the feature to specify faultaround area in bytes instead of
    page order. It's 64 kilobytes for now.

    The patch effectively disables faultaround on architectures with page size
    >= 64k (like ppc64).

    It's possible that some other size of faultaround area is relevant for a
    platform. We can expose `fault_around_bytes' variable to arch-specific
    code once such platforms will be found.

    Signed-off-by: Kirill A. Shutemov
    Cc: Rusty Russell
    Cc: Hugh Dickins
    Cc: Madhavan Srinivasan
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • mm/memory.c is overloaded: over 4k lines. get_user_pages() code is
    pretty much self-contained let's move it to separate file.

    No other changes made.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • _PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
    faults on x86. Care is taken such that _PAGE_NUMA is used only in
    situations where the VMA flags distinguish between NUMA hinting faults
    and prot_none faults. This decision was x86-specific and conceptually
    it is difficult requiring special casing to distinguish between PROTNONE
    and NUMA ptes based on context.

    Fundamentally, we only need the _PAGE_NUMA bit to tell the difference
    between an entry that is really unmapped and a page that is protected
    for NUMA hinting faults as if the PTE is not present then a fault will
    be trapped.

    Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset.
    This patch shrinks the maximum possible swap size and uses the bit to
    uniquely distinguish between NUMA hinting ptes and swap ptes.

    Signed-off-by: Mel Gorman
    Cc: David Vrabel
    Cc: Ingo Molnar
    Cc: Peter Anvin
    Cc: Fengguang Wu
    Cc: Linus Torvalds
    Cc: Steven Noonan
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Srikar Dronamraju
    Cc: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

22 May, 2014

1 commit


07 May, 2014

1 commit

  • Changing PTEs and PMDs to pte_numa & pmd_numa is done with the
    mmap_sem held for reading, which means a pmd can be instantiated
    and turned into a numa one while __handle_mm_fault() is examining
    the value of old_pmd.

    If that happens, __handle_mm_fault() should just return and let
    the page fault retry, instead of throwing an oops. This is
    handled by the test for pmd_trans_huge(*pmd) below.

    Signed-off-by: Rik van Riel
    Reviewed-by: Naoya Horiguchi
    Reported-by: Sunil Pandey
    Signed-off-by: Peter Zijlstra
    Cc: Andrew Morton
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: linux-mm@kvack.org
    Cc: lwoodman@redhat.com
    Cc: dave.hansen@intel.com
    Link: http://lkml.kernel.org/r/20140429153615.2d72098e@annuminas.surriel.com
    Signed-off-by: Ingo Molnar

    Rik van Riel