17 Dec, 2014

1 commit


14 Dec, 2014

3 commits

  • The slab shrinkers are currently invoked from the zonelist walkers in
    kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
    eligible LRU pages and assemble a nodemask to pass to NUMA-aware
    shrinkers, which then again have to walk over the nodemask. This is
    redundant code, extra runtime work, and fairly inaccurate when it comes to
    the estimation of actually scannable LRU pages. The code duplication will
    only get worse when making the shrinkers cgroup-aware and requiring them
    to have out-of-band cgroup hierarchy walks as well.

    Instead, invoke the shrinkers from shrink_zone(), which is where all
    reclaimers end up, to avoid this duplication.

    Take the count for eligible LRU pages out of get_scan_count(), which
    considers many more factors than just the availability of swap space, like
    zone_reclaimable_pages() currently does. Accumulate the number over all
    visited lruvecs to get the per-zone value.

    Some nodes have multiple zones due to memory addressing restrictions. To
    avoid putting too much pressure on the shrinkers, only invoke them once
    for each such node, using the class zone of the allocation as the pivot
    zone.

    For now, this integrates the slab shrinking better into the reclaim logic
    and gets rid of duplicative invocations from kswapd, direct reclaim, and
    zone reclaim. It also prepares for cgroup-awareness, allowing
    memcg-capable shrinkers to be added at the lruvec level without much
    duplication of both code and runtime work.

    This changes kswapd behavior, which used to invoke the shrinkers for each
    zone, but with scan ratios gathered from the entire node, resulting in
    meaningless pressure quantities on multi-zone nodes.

    Zone reclaim behavior also changes. It used to shrink slabs until the
    same amount of pages were shrunk as were reclaimed from the LRUs. Now it
    merely invokes the shrinkers once with the zone's scan ratio, which makes
    the shrinkers go easier on caches that implement aging and would prefer
    feeding back pressure from recently used slab objects to unused LRU pages.

    [vdavydov@parallels.com: assure class zone is populated]
    Signed-off-by: Johannes Weiner
    Cc: Dave Chinner
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Now, we have prepared to avoid using debug-pagealloc in boottime. So
    introduce new kernel-parameter to disable debug-pagealloc in boottime, and
    makes related functions to be disabled in this case.

    Only non-intuitive part is change of guard page functions. Because guard
    page is effective only if debug-pagealloc is enabled, turning off
    according to debug-pagealloc is reasonable thing to do.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Until now, debug-pagealloc needs extra flags in struct page, so we need to
    recompile whole source code when we decide to use it. This is really
    painful, because it takes some time to recompile and sometimes rebuild is
    not possible due to third party module depending on struct page. So, we
    can't use this good feature in many cases.

    Now, we have the page extension feature that allows us to insert extra
    flags to outside of struct page. This gets rid of third party module
    issue mentioned above. And, this allows us to determine if we need extra
    memory for this page extension in boottime. With these property, we can
    avoid using debug-pagealloc in boottime with low computational overhead in
    the kernel built with CONFIG_DEBUG_PAGEALLOC. This will help our
    development process greatly.

    This patch is the preparation step to achive above goal. debug-pagealloc
    originally uses extra field of struct page, but, after this patch, it will
    use field of struct page_ext. Because memory for page_ext is allocated
    later than initialization of page allocator in CONFIG_SPARSEMEM, we should
    disable debug-pagealloc feature temporarily until initialization of
    page_ext. This patch implements this.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

12 Dec, 2014

1 commit

  • Pull s390 updates from Martin Schwidefsky:
    "The most notable change for this pull request is the ftrace rework
    from Heiko. It brings a small performance improvement and the ground
    work to support a new gcc option to replace the mcount blocks with a
    single nop.

    Two new s390 specific system calls are added to emulate user space
    mmio for PCI, an artifact of the how PCI memory is accessed.

    Two patches for the memory management with changes to common code.
    For KVM mm_forbids_zeropage is added which disables the empty zero
    page for an mm that is used by a KVM process. And an optimization,
    pmdp_get_and_clear_full is added analog to ptep_get_and_clear_full.

    Some micro optimization for the cmpxchg and the spinlock code.

    And as usual bug fixes and cleanups"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (46 commits)
    s390/cputime: fix 31-bit compile
    s390/scm_block: make the number of reqs per HW req configurable
    s390/scm_block: handle multiple requests in one HW request
    s390/scm_block: allocate aidaw pages only when necessary
    s390/scm_block: use mempool to manage aidaw requests
    s390/eadm: change timeout value
    s390/mm: fix memory leak of ptlock in pmd_free_tlb
    s390: use local symbol names in entry[64].S
    s390/ptrace: always include vector registers in core files
    s390/simd: clear vector register pointer on fork/clone
    s390: translate cputime magic constants to macros
    s390/idle: convert open coded idle time seqcount
    s390/idle: add missing irq off lockdep annotation
    s390/debug: avoid function call for debug_sprintf_*
    s390/kprobes: fix instruction copy for out of line execution
    s390: remove diag 44 calls from cpu_relax()
    s390/dasd: retry partition detection
    s390/dasd: fix list corruption for sleep_on requests
    s390/dasd: fix infinite term I/O loop
    s390/dasd: remove unused code
    ...

    Linus Torvalds
     

18 Nov, 2014

1 commit

  • MPX-enabled applications using large swaths of memory can
    potentially have large numbers of bounds tables in process
    address space to save bounds information. These tables can take
    up huge swaths of memory (as much as 80% of the memory on the
    system) even if we clean them up aggressively. In the worst-case
    scenario, the tables can be 4x the size of the data structure
    being tracked. IOW, a 1-page structure can require 4 bounds-table
    pages.

    Being this huge, our expectation is that folks using MPX are
    going to be keen on figuring out how much memory is being
    dedicated to it. So we need a way to track memory use for MPX.

    If we want to specifically track MPX VMAs we need to be able to
    distinguish them from normal VMAs, and keep them from getting
    merged with normal VMAs. A new VM_ flag set only on MPX VMAs does
    both of those things. With this flag, MPX bounds-table VMAs can
    be distinguished from other VMAs, and userspace can also walk
    /proc/$pid/smaps to get memory usage for MPX.

    In addition to this flag, we also introduce a special ->vm_ops
    specific to MPX VMAs (see the patch "add MPX specific mmap
    interface"), but currently different ->vm_ops do not by
    themselves prevent VMA merging, so we still need this flag.

    We understand that VM_ flags are scarce and are open to other
    options.

    Signed-off-by: Qiaowei Ren
    Signed-off-by: Dave Hansen
    Cc: linux-mm@kvack.org
    Cc: linux-mips@linux-mips.org
    Cc: Dave Hansen
    Link: http://lkml.kernel.org/r/20141114151825.565625B3@viggo.jf.intel.com
    Signed-off-by: Thomas Gleixner

    Qiaowei Ren
     

30 Oct, 2014

1 commit


27 Oct, 2014

1 commit


21 Oct, 2014

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "A large number of cleanups and bug fixes, with some (minor) journal
    optimizations"

    [ This got sent to me before -rc1, but was stuck in my spam folder. - Linus ]

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (67 commits)
    ext4: check s_chksum_driver when looking for bg csum presence
    ext4: move error report out of atomic context in ext4_init_block_bitmap()
    ext4: Replace open coded mdata csum feature to helper function
    ext4: delete useless comments about ext4_move_extents
    ext4: fix reservation overflow in ext4_da_write_begin
    ext4: add ext4_iget_normal() which is to be used for dir tree lookups
    ext4: don't orphan or truncate the boot loader inode
    ext4: grab missed write_count for EXT4_IOC_SWAP_BOOT
    ext4: optimize block allocation on grow indepth
    ext4: get rid of code duplication
    ext4: fix over-defensive complaint after journal abort
    ext4: fix return value of ext4_do_update_inode
    ext4: fix mmap data corruption when blocksize < pagesize
    vfs: fix data corruption when blocksize < pagesize for mmaped data
    ext4: fold ext4_nojournal_sops into ext4_sops
    ext4: support freezing ext2 (nojournal) file systems
    ext4: fold ext4_sync_fs_nojournal() into ext4_sync_fs()
    ext4: don't check quota format when there are no quota files
    jbd2: simplify calling convention around __jbd2_journal_clean_checkpoint_list
    jbd2: avoid pointless scanning of checkpoint lists
    ...

    Linus Torvalds
     

14 Oct, 2014

2 commits

  • For VMAs that don't want write notifications, PTEs created for read faults
    have their write bit set. If the read fault happens after VM_SOFTDIRTY is
    cleared, then the PTE's softdirty bit will remain clear after subsequent
    writes.

    Here's a simple code snippet to demonstrate the bug:

    char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
    MAP_ANONYMOUS | MAP_SHARED, -1, 0);
    system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
    assert(*m == '\0'); /* new PTE allows write access */
    assert(!soft_dirty(x));
    *m = 'x'; /* should dirty the page */
    assert(soft_dirty(x)); /* fails */

    With this patch, write notifications are enabled when VM_SOFTDIRTY is
    cleared. Furthermore, to avoid unnecessary faults, write notifications
    are disabled when VM_SOFTDIRTY is set.

    As a side effect of enabling and disabling write notifications with
    care, this patch fixes a bug in mprotect where vm_page_prot bits set by
    drivers were zapped on mprotect. An analogous bug was fixed in mmap by
    commit c9d0bf241451 ("mm: uncached vma support with writenotify").

    Signed-off-by: Peter Feiner
    Reported-by: Peter Feiner
    Suggested-by: Kirill A. Shutemov
    Cc: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Jamie Liu
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: Bjorn Helgaas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Feiner
     
  • We have a large university system in the UK that is experiencing very long
    delays modprobing the driver for a specific I/O device. The delay is from
    8-10 minutes per device and there are 31 devices in the system. This 4 to
    5 hour delay in starting up those I/O devices is very much a burden on the
    customer.

    There are two causes for requiring a restart/reload of the drivers. First
    is periodic preventive maintenance (PM) and the second is if any of the
    devices experience a fatal error. Both of these trigger this excessively
    long delay in bringing the system back up to full capability.

    The problem was tracked down to a very slow IOREMAP operation and the
    excessively long ioresource lookup to insure that the user is not
    attempting to ioremap RAM. These patches provide a speed up to that
    function.

    The modprobe time appears to be affected quite a bit by previous activity
    on the ioresource list, which I suspect is due to cache preloading. While
    the overall improvement is impacted by other overhead of starting the
    devices, this drastically improves the modprobe time.

    Also our system is considerably smaller so the percentages gained will not
    be the same. Best case improvement with the modprobe on our 20 device
    smallish system was from 'real 5m51.913s' to 'real 0m18.275s'.

    This patch (of 2):

    Since the ioremap operation is verifying that the specified address range
    is NOT RAM, it will search the entire ioresource list if the condition is
    true. To make matters worse, it does this one 4k page at a time. For a
    128M BAR region this is 32 passes to determine the entire region does not
    contain any RAM addresses.

    This patch provides another resource lookup function, region_is_ram, that
    searches for the entire region specified, verifying that it is completely
    contained within the resource region. If it is found, then it is checked
    to be RAM or not, within a single pass.

    The return result reflects if it was found or not (-1), and whether it is
    RAM (1) or not (0). This allows the caller to fallback to the previous
    page by page search if it was not found.

    [akpm@linux-foundation.org: fix spellos and typos in comment]
    Signed-off-by: Mike Travis
    Acked-by: Alex Thorlton
    Reviewed-by: Cliff Wickman
    Cc: Thomas Gleixner
    Cc: H. Peter Anvin
    Cc: Mark Salter
    Cc: Dave Young
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Travis
     

10 Oct, 2014

3 commits

  • Sasha Levin reported KASAN splash inside isolate_migratepages_range().
    Problem is in the function __is_movable_balloon_page() which tests
    AS_BALLOON_MAP in page->mapping->flags. This function has no protection
    against anonymous pages. As result it tried to check address space flags
    inside struct anon_vma.

    Further investigation shows more problems in current implementation:

    * Special branch in __unmap_and_move() never works:
    balloon_page_movable() checks page flags and page_count. In
    __unmap_and_move() page is locked, reference counter is elevated, thus
    balloon_page_movable() always fails. As a result execution goes to the
    normal migration path. virtballoon_migratepage() returns
    MIGRATEPAGE_BALLOON_SUCCESS instead of MIGRATEPAGE_SUCCESS,
    move_to_new_page() thinks this is an error code and assigns
    newpage->mapping to NULL. Newly migrated page lose connectivity with
    balloon an all ability for further migration.

    * lru_lock erroneously required in isolate_migratepages_range() for
    isolation ballooned page. This function releases lru_lock periodically,
    this makes migration mostly impossible for some pages.

    * balloon_page_dequeue have a tight race with balloon_page_isolate:
    balloon_page_isolate could be executed in parallel with dequeue between
    picking page from list and locking page_lock. Race is rare because they
    use trylock_page() for locking.

    This patch fixes all of them.

    Instead of fake mapping with special flag this patch uses special state of
    page->_mapcount: PAGE_BALLOON_MAPCOUNT_VALUE = -256. Buddy allocator uses
    PAGE_BUDDY_MAPCOUNT_VALUE = -128 for similar purpose. Storing mark
    directly in struct page makes everything safer and easier.

    PagePrivate is used to mark pages present in page list (i.e. not
    isolated, like PageLRU for normal pages). It replaces special rules for
    reference counter and makes balloon migration similar to migration of
    normal pages. This flag is protected by page_lock together with link to
    the balloon device.

    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Sasha Levin
    Link: http://lkml.kernel.org/p/53E6CEAA.9020105@oracle.com
    Cc: Rafael Aquini
    Cc: Andrey Ryabinin
    Cc: [3.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • To eliminate code duplication lets introduce check_data_rlimit helper
    which we will use in brk() and prctl() syscalls.

    Signed-off-by: Cyrill Gorcunov
    Cc: Kees Cook
    Cc: Tejun Heo
    Cc: Andrew Vagin
    Cc: Eric W. Biederman
    Cc: H. Peter Anvin
    Acked-by: Serge Hallyn
    Cc: Pavel Emelyanov
    Cc: Vasiliy Kulikov
    Cc: KAMEZAWA Hiroyuki
    Cc: Michael Kerrisk
    Cc: Julien Tinnes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • - Rename vm_is_stack() to task_of_stack() and change it to return
    "struct task_struct *" rather than the global (and thus wrong in
    general) pid_t.

    - Add the new pid_of_stack() helper which calls task_of_stack() and
    uses the right namespace to report the correct pid_t.

    Unfortunately we need to define this helper twice, in task_mmu.c
    and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c
    and move at least pid_of_stack/task_of_stack there to avoid the
    code duplication.

    - Change show_map_vma() and show_numa_map() to use the new helper.

    Signed-off-by: Oleg Nesterov
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: Greg Ungerer
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

02 Oct, 2014

1 commit

  • ->page_mkwrite() is used by filesystems to allocate blocks under a page
    which is becoming writeably mmapped in some process' address space. This
    allows a filesystem to return a page fault if there is not enough space
    available, user exceeds quota or similar problem happens, rather than
    silently discarding data later when writepage is called.

    However VFS fails to call ->page_mkwrite() in all the cases where
    filesystems need it when blocksize < pagesize. For example when
    blocksize = 1024, pagesize = 4096 the following is problematic:
    ftruncate(fd, 0);
    pwrite(fd, buf, 1024, 0);
    map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
    map[0] = 'a'; ----> page_mkwrite() for index 0 is called
    ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
    mremap(map, 1024, 10000, 0);
    map[4095] = 'a'; ----> no page_mkwrite() called

    At the moment ->page_mkwrite() is called, filesystem can allocate only
    one block for the page because i_size == 1024. Otherwise it would create
    blocks beyond i_size which is generally undesirable. But later at
    ->writepage() time, we also need to store data at offset 4095 but we
    don't have block allocated for it.

    This patch introduces a helper function filesystems can use to have
    ->page_mkwrite() called at all the necessary moments.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Jan Kara
     

24 Sep, 2014

1 commit

  • When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
    has been swapped out or is behind a filemap, this will trigger async
    readahead and return immediately. The rationale is that KVM will kick
    back the guest with an "async page fault" and allow for some other
    guest process to take over.

    If async PFs are enabled the fault is retried asap from an async
    workqueue. If not, it's retried immediately in the same code path. In
    either case the retry will not relinquish the mmap semaphore and will
    block on the IO. This is a bad thing, as other mmap semaphore users
    now stall as a function of swap or filemap latency.

    This patch ensures both the regular and async PF path re-enter the
    fault allowing for the mmap semaphore to be relinquished in the case
    of IO wait.

    Reviewed-by: Radim Krčmář
    Signed-off-by: Andres Lagar-Cavilla
    Acked-by: Andrew Morton
    Signed-off-by: Paolo Bonzini

    Andres Lagar-Cavilla
     

09 Aug, 2014

1 commit

  • The core mm code will provide a default gate area based on
    FIXADDR_USER_START and FIXADDR_USER_END if
    !defined(__HAVE_ARCH_GATE_AREA) && defined(AT_SYSINFO_EHDR).

    This default is only useful for ia64. arm64, ppc, s390, sh, tile, 64-bit
    UML, and x86_32 have their own code just to disable it. arm, 32-bit UML,
    and x86_64 have gate areas, but they have their own implementations.

    This gets rid of the default and moves the code into ia64.

    This should save some code on architectures without a gate area: it's now
    possible to inline the gate_area functions in the default case.

    Signed-off-by: Andy Lutomirski
    Acked-by: Nathan Lynch
    Acked-by: H. Peter Anvin
    Acked-by: Benjamin Herrenschmidt [in principle]
    Acked-by: Richard Weinberger [for um]
    Acked-by: Will Deacon [for arm64]
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Chris Metcalf
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Nathan Lynch
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

05 Jun, 2014

2 commits

  • Pull x86 cdso updates from Peter Anvin:
    "Vdso cleanups and improvements largely from Andy Lutomirski. This
    makes the vdso a lot less ''special''"

    * 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/vdso, build: Make LE access macros clearer, host-safe
    x86/vdso, build: Fix cross-compilation from big-endian architectures
    x86/vdso, build: When vdso2c fails, unlink the output
    x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
    x86, mm: Replace arch_vma_name with vm_ops->name for vsyscalls
    x86, mm: Improve _install_special_mapping and fix x86 vdso naming
    mm, fs: Add vm_ops->name as an alternative to arch_vma_name
    x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
    x86, vdso: Remove vestiges of VDSO_PRELINK and some outdated comments
    x86, vdso: Move the vvar and hpet mappings next to the 64-bit vDSO
    x86, vdso: Move the 32-bit vdso special pages after the text
    x86, vdso: Reimplement vdso.so preparation in build-time C
    x86, vdso: Move syscall and sysenter setup into kernel/cpu/common.c
    x86, vdso: Clean up 32-bit vs 64-bit vdso params
    x86, mm: Ensure correct alignment of the fixmap

    Linus Torvalds
     
  • Currently, in put_compound_page(), we have

    ======
    if (likely(!PageTail(page))) { first_page;

    smp_rmb();
    if (likely(PageTail(page)))
    return head;
    }
    return page;
    }
    ======

    here, the (3) unlikely in the case is a negative hint, because it is
    *likely* a tail page. So the check (3) in this case is not good, so I
    introduce a helper for this case.

    So this patch introduces compound_head_by_tail() which deals with a
    possible tail page(though it could be spilt by a racy thread), and make
    compound_head() a wrapper on it.

    This patch has no functional change, and it reduces the object
    size slightly:
    text data bss dec hex filename
    11003 1328 16 12347 303b mm/swap.o.orig
    10971 1328 16 12315 301b mm/swap.o.patched

    I've ran "perf top -e branch-miss" to observe branch-miss in this case.
    As Michael points out, it's a slow path, so only very few times this case
    happens. But I grep'ed the code base, and found there still are some
    other call sites could be benifited from this helper. And given that it
    only bloating up the source by only 5 lines, but with a reduced object
    size. I still believe this helper deserves to exsit.

    Signed-off-by: Jianyu Zhan
    Cc: Kirill A. Shutemov
    Cc: Rik van Riel
    Cc: Jiang Liu
    Cc: Peter Zijlstra
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Sasha Levin
    Cc: Wanpeng Li
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     

22 May, 2014

1 commit


21 May, 2014

2 commits

  • Using arch_vma_name to give special mappings a name is awkward. x86
    currently implements it by comparing the start address of the vma to
    the expected address of the vdso. This requires tracking the start
    address of special mappings and is probably buggy if a special vma
    is split or moved.

    Improve _install_special_mapping to just name the vma directly. Use
    it to give the x86 vvar area a name, which should make CRIU's life
    easier.

    As a side effect, the vvar area will show up in core dumps. This
    could be considered weird and is fixable.

    [hpa: I say we accept this as-is but be prepared to deal with knocking
    out the vvars from core dumps if this becomes a problem.]

    Cc: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Signed-off-by: Andy Lutomirski
    Link: http://lkml.kernel.org/r/276b39b6b645fb11e345457b503f17b83c2c6fd0.1400538962.git.luto@amacapital.net
    Signed-off-by: H. Peter Anvin

    Andy Lutomirski
     
  • arch_vma_name sucks. It's a silly hack, and it's annoying to
    implement correctly. In fact, AFAICS, even the straightforward x86
    implementation is incorrect (I suspect that it breaks if the vdso
    mapping is split or gets remapped).

    This adds a new vm_ops->name operation that can replace it. The
    followup patches will remove all uses of arch_vma_name on x86,
    fixing a couple of annoyances in the process.

    Signed-off-by: Andy Lutomirski
    Link: http://lkml.kernel.org/r/2eee21791bb36a0a408c5c2bdb382a9e6a41ca4a.1400538962.git.luto@amacapital.net
    Signed-off-by: H. Peter Anvin

    Andy Lutomirski
     

07 May, 2014

1 commit


13 Apr, 2014

1 commit

  • Pull audit updates from Eric Paris.

    * git://git.infradead.org/users/eparis/audit: (28 commits)
    AUDIT: make audit_is_compat depend on CONFIG_AUDIT_COMPAT_GENERIC
    audit: renumber AUDIT_FEATURE_CHANGE into the 1300 range
    audit: do not cast audit_rule_data pointers pointlesly
    AUDIT: Allow login in non-init namespaces
    audit: define audit_is_compat in kernel internal header
    kernel: Use RCU_INIT_POINTER(x, NULL) in audit.c
    sched: declare pid_alive as inline
    audit: use uapi/linux/audit.h for AUDIT_ARCH declarations
    syscall_get_arch: remove useless function arguments
    audit: remove stray newline from audit_log_execve_info() audit_panic() call
    audit: remove stray newlines from audit_log_lost messages
    audit: include subject in login records
    audit: remove superfluous new- prefix in AUDIT_LOGIN messages
    audit: allow user processes to log from another PID namespace
    audit: anchor all pid references in the initial pid namespace
    audit: convert PPIDs to the inital PID namespace.
    pid: get pid_t ppid of task in init_pid_ns
    audit: rename the misleading audit_get_context() to audit_take_context()
    audit: Add generic compat syscall support
    audit: Add CONFIG_HAVE_ARCH_AUDITSYSCALL
    ...

    Linus Torvalds
     

08 Apr, 2014

5 commits

  • LAST_CPUPID_MASK is calculated using LAST_CPUPID_WIDTH. However
    LAST_CPUPID_WIDTH itself can be 0. (when LAST_CPUPID_NOT_IN_PAGE_FLAGS is
    set). In such a case LAST_CPUPID_MASK turns out to be 0.

    But with recent commit 1ae71d0319: (mm: numa: bugfix for
    LAST_CPUPID_NOT_IN_PAGE_FLAGS) if LAST_CPUPID_MASK is 0,
    page_cpupid_xchg_last() and page_cpupid_reset_last() causes
    page->_last_cpupid to be set to 0.

    This causes performance regression. Its almost as if numa_balancing is
    off.

    Fix LAST_CPUPID_MASK by using LAST_CPUPID_SHIFT instead of
    LAST_CPUPID_WIDTH.

    Some performance numbers and perf stats with and without the fix.

    (3.14-rc6)
    ----------
    numa01

    Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01':

    12,27,462 cs [100.00%]
    2,41,957 migrations [100.00%]
    1,68,01,713 faults [100.00%]
    7,99,35,29,041 cache-misses
    98,808 migrate:mm_migrate_pages [100.00%]

    1407.690148814 seconds time elapsed

    numa02

    Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02':

    63,065 cs [100.00%]
    14,364 migrations [100.00%]
    2,08,118 faults [100.00%]
    25,32,59,404 cache-misses
    12 migrate:mm_migrate_pages [100.00%]

    63.840827219 seconds time elapsed

    (3.14-rc6 with fix)
    -------------------
    numa01

    Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01':

    9,68,911 cs [100.00%]
    1,01,414 migrations [100.00%]
    88,38,697 faults [100.00%]
    4,42,92,51,042 cache-misses
    4,25,060 migrate:mm_migrate_pages [100.00%]

    685.965331189 seconds time elapsed

    numa02

    Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02':

    17,543 cs [100.00%]
    2,962 migrations [100.00%]
    1,17,843 faults [100.00%]
    11,80,61,644 cache-misses
    12,358 migrate:mm_migrate_pages [100.00%]

    20.380132343 seconds time elapsed

    Signed-off-by: Srikar Dronamraju
    Cc: Liu Ping Fan
    Reviewed-by: Aneesh Kumar K.V
    Cc: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srikar Dronamraju
     
  • Commit f9acc8c7b35a ("readahead: sanify file_ra_state names") left
    ra_submit with a single function call.

    Move ra_submit to internal.h and inline it to save some stack. Thanks
    to Andrew Morton for commenting different versions.

    Signed-off-by: Fabian Frederick
    Suggested-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • filemap_map_pages() is generic implementation of ->map_pages() for
    filesystems who uses page cache.

    It should be safe to use filemap_map_pages() for ->map_pages() if
    filesystem use filemap_fault() for ->fault().

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Alexander Viro
    Cc: Dave Chinner
    Cc: Ning Qu
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Here's new version of faultaround patchset. It took a while to tune it
    and collect performance data.

    First patch adds new callback ->map_pages to vm_operations_struct.

    ->map_pages() is called when VM asks to map easy accessible pages.
    Filesystem should find and map pages associated with offsets from
    "pgoff" till "max_pgoff". ->map_pages() is called with page table
    locked and must not block. If it's not possible to reach a page without
    blocking, filesystem should skip it. Filesystem should use do_set_pte()
    to setup page table entry. Pointer to entry associated with offset
    "pgoff" is passed in "pte" field in vm_fault structure. Pointers to
    entries for other offsets should be calculated relative to "pte".

    Currently VM use ->map_pages only on read page fault path. We try to
    map FAULT_AROUND_PAGES a time. FAULT_AROUND_PAGES is 16 for now.
    Performance data for different FAULT_AROUND_ORDER is below.

    TODO:
    - implement ->map_pages() for shmem/tmpfs;
    - modify get_user_pages() to be able to use ->map_pages() and implement
    mmap(MAP_POPULATE|MAP_NONBLOCK) on top.

    =========================================================================
    Tested on 4-socket machine (120 threads) with 128GiB of RAM.

    Few real-world workloads. The sweet spot for FAULT_AROUND_ORDER here is
    somewhere between 3 and 5. Let's say 4 :)

    Linux build (make -j60)
    FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
    minor-faults 283,301,572 247,151,987 212,215,789 204,772,882 199,568,944 194,703,779 193,381,485
    time, seconds 151.227629483 153.920996480 151.356125472 150.863792049 150.879207877 151.150764954 151.450962358
    Linux rebuild (make -j60)
    FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
    minor-faults 5,396,854 4,148,444 2,855,286 2,577,282 2,361,957 2,169,573 2,112,643
    time, seconds 27.404543757 27.559725591 27.030057426 26.855045126 26.678618635 26.974523490 26.761320095
    Git test suite (make -j60 test)
    FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
    minor-faults 129,591,823 99,200,751 66,106,718 57,606,410 51,510,808 45,776,813 44,085,515
    time, seconds 66.087215026 64.784546905 64.401156567 65.282708668 66.034016829 66.793780811 67.237810413

    Two synthetic tests: access every word in file in sequential/random order.
    It doesn't improve much after FAULT_AROUND_ORDER == 4.

    Sequential access 16GiB file
    FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
    1 thread
    minor-faults 4,195,437 2,098,275 525,068 262,251 131,170 32,856 8,282
    time, seconds 7.250461742 6.461711074 5.493859139 5.488488147 5.707213983 5.898510832 5.109232856
    8 threads
    minor-faults 33,557,540 16,892,728 4,515,848 2,366,999 1,423,382 442,732 142,339
    time, seconds 16.649304881 9.312555263 6.612490639 6.394316732 6.669827501 6.75078944 6.371900528
    32 threads
    minor-faults 134,228,222 67,526,810 17,725,386 9,716,537 4,763,731 1,668,921 537,200
    time, seconds 49.164430543 29.712060103 12.938649729 10.175151004 11.840094583 9.594081325 9.928461797
    60 threads
    minor-faults 251,687,988 126,146,952 32,919,406 18,208,804 10,458,947 2,733,907 928,217
    time, seconds 86.260656897 49.626551828 22.335007632 17.608243696 16.523119035 16.339489186 16.326390902
    120 threads
    minor-faults 503,352,863 252,939,677 67,039,168 35,191,827 19,170,091 4,688,357 1,471,862
    time, seconds 124.589206333 79.757867787 39.508707872 32.167281632 29.972989292 28.729834575 28.042251622
    Random access 1GiB file
    1 thread
    minor-faults 262,636 132,743 34,369 17,299 8,527 3,451 1,222
    time, seconds 15.351890914 16.613802482 16.569227308 15.179220992 16.557356122 16.578247824 15.365266994
    8 threads
    minor-faults 2,098,948 1,061,871 273,690 154,501 87,110 25,663 7,384
    time, seconds 15.040026343 15.096933500 14.474757288 14.289129964 14.411537468 14.296316837 14.395635804
    32 threads
    minor-faults 8,390,734 4,231,023 1,054,432 528,847 269,242 97,746 26,881
    time, seconds 20.430433109 21.585235358 22.115062928 14.872878951 14.880856305 14.883370649 14.821261690
    60 threads
    minor-faults 15,733,258 7,892,809 1,973,393 988,266 594,789 164,994 51,691
    time, seconds 26.577302548 25.692397770 18.728863715 20.153026398 21.619101933 17.745086260 17.613215273
    120 threads
    minor-faults 31,471,111 15,816,616 3,959,209 1,978,685 1,008,299 264,635 96,010
    time, seconds 41.835322703 40.459786095 36.085306105 35.313894834 35.814445675 36.552633793 34.289210594

    Touch only one page in page table in 16GiB file
    FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
    1 thread
    minor-faults 8,372 8,324 8,270 8,260 8,249 8,239 8,237
    time, seconds 0.039892712 0.045369149 0.051846126 0.063681685 0.079095975 0.17652406 0.541213386
    8 threads
    minor-faults 65,731 65,681 65,628 65,620 65,608 65,599 65,596
    time, seconds 0.124159196 0.488600638 0.156854426 0.191901957 0.242631486 0.543569456 1.677303984
    32 threads
    minor-faults 262,388 262,341 262,285 262,276 262,266 262,257 263,183
    time, seconds 0.452421421 0.488600638 0.565020946 0.648229739 0.789850823 1.651584361 5.000361559
    60 threads
    minor-faults 491,822 491,792 491,723 491,711 491,701 491,691 491,825
    time, seconds 0.763288616 0.869620515 0.980727360 1.161732354 1.466915814 3.04041448 9.308612938
    120 threads
    minor-faults 983,466 983,655 983,366 983,372 983,363 984,083 984,164
    time, seconds 1.595846553 1.667902182 2.008959376 2.425380942 2.941368804 5.977807890 18.401846125

    This patch (of 2):

    Introduce new vm_ops callback ->map_pages() and uses it for mapping easy
    accessible pages around fault address.

    On read page fault, if filesystem provides ->map_pages(), we try to map up
    to FAULT_AROUND_PAGES pages around page fault address in hope to reduce
    number of minor page faults.

    We call ->map_pages first and use ->fault() as fallback if page by the
    offset is not ready to be mapped (cold page cache or something).

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Alexander Viro
    Cc: Dave Chinner
    Cc: Ning Qu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Add VM_INIT_DEF_MASK, to allow us to set the default flags for VMs. It
    also adds a prctl control which allows us to set the THP disable bit in
    mm->def_flags so that VMs will pick up the setting as they are created.

    Signed-off-by: Alex Thorlton
    Suggested-by: Oleg Nesterov
    Cc: Gerald Schaefer
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Christian Borntraeger
    Cc: Paolo Bonzini
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Alexander Viro
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Paolo Bonzini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Thorlton
     

04 Apr, 2014

3 commits

  • The ifdef conditions in include/linux/mm.h presents three cases:

    - !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)

    There is no actual definition of function but include/linux/mm.h has a
    static inline stub defined.

    - defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)

    linux/mm.h does not define a prototype, but mm/page_alloc.c defines
    the function.

    Hence, compiler reports the following warning:

    mm/page_alloc.c:4300:15: warning: no previous prototype for `__early_pfn_to_nid' [-Wmissing-prototypes]

    - defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)

    The architecture defines the function, and linux/mm.h has a
    prototype.

    Thus, join the conditions of Case 2 and 3 ie eliminate the ifdef
    condition of CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID to eliminate the missing
    prototype warning from file mm/page_alloc.c.

    Signed-off-by: Rashika Kheria
    Reviewed-by: Josh Triplett
    Reviewed-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rashika Kheria
     
  • Reclaim will be leaving shadow entries in the page cache radix tree upon
    evicting the real page. As those pages are found from the LRU, an
    iput() can lead to the inode being freed concurrently. At this point,
    reclaim must no longer install shadow pages because the inode freeing
    code needs to ensure the page tree is really empty.

    Add an address_space flag, AS_EXITING, that the inode freeing code sets
    under the tree lock before doing the final truncate. Reclaim will check
    for this flag before installing shadow pages.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • shmem mappings already contain exceptional entries where swap slot
    information is remembered.

    To be able to store eviction information for regular page cache, prepare
    every site dealing with the radix trees directly to handle entries other
    than pages.

    The common lookup functions will filter out non-page entries and return
    NULL for page cache holes, just as before. But provide a raw version of
    the API which returns non-page entries as well, and switch shmem over to
    use it.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

03 Apr, 2014

1 commit

  • Pull x86 vdso changes from Peter Anvin:
    "This is the revamp of the 32-bit vdso and the associated cleanups.

    This adds timekeeping support to the 32-bit vdso that we already have
    in the 64-bit vdso. Although 32-bit x86 is legacy, it is likely to
    remain in the embedded space for a very long time to come.

    This removes the traditional COMPAT_VDSO support; the configuration
    variable is reused for simply removing the 32-bit vdso, which will
    produce correct results but obviously suffer a performance penalty.
    Only one beta version of glibc was affected, but that version was
    unfortunately included in one OpenSUSE release.

    This is not the end of the vdso cleanups. Stefani and Andy have
    agreed to continue work for the next kernel cycle; in fact Andy has
    already produced another set of cleanups that came too late for this
    cycle.

    An incidental, but arguably important, change is that this ensures
    that unused space in the VVAR page is properly zeroed. It wasn't
    before, and would contain whatever garbage was left in memory by BIOS
    or the bootloader. Since the VVAR page is accessible to user space
    this had the potential of information leaks"

    * 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
    x86, vdso: Fix the symbol versions on the 32-bit vDSO
    x86, vdso, build: Don't rebuild 32-bit vdsos on every make
    x86, vdso: Actually discard the .discard sections
    x86, vdso: Fix size of get_unmapped_area()
    x86, vdso: Finish removing VDSO32_PRELINK
    x86, vdso: Move more vdso definitions into vdso.h
    x86: Load the 32-bit vdso in place, just like the 64-bit vdsos
    x86, vdso32: handle 32 bit vDSO larger one page
    x86, vdso32: Disable stack protector, adjust optimizations
    x86, vdso: Zero-pad the VVAR page
    x86, vdso: Add 32 bit VDSO time support for 64 bit kernel
    x86, vdso: Add 32 bit VDSO time support for 32 bit kernel
    x86, vdso: Patch alternatives in the 32-bit VDSO
    x86, vdso: Introduce VVAR marco for vdso32
    x86, vdso: Cleanup __vdso_gettimeofday()
    x86, vdso: Replace VVAR(vsyscall_gtod_data) by gtod macro
    x86, vdso: __vdso_clock_gettime() cleanup
    x86, vdso: Revamp vclock_gettime.c
    mm: Add new func _install_special_mapping() to mmap.c
    x86, vdso: Make vsyscall_gtod_data handling x86 generic
    ...

    Linus Torvalds
     

01 Apr, 2014

1 commit

  • Pull s390 updates from Martin Schwidefsky:
    "There are two memory management related changes, the CMMA support for
    KVM to avoid swap-in of freed pages and the split page table lock for
    the PMD level. These two come with common code changes in mm/.

    A fix for the long standing theoretical TLB flush problem, this one
    comes with a common code change in kernel/sched/.

    Another set of changes is Heikos uaccess work, included is the initial
    set of patches with more to come.

    And fixes and cleanups as usual"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (36 commits)
    s390/con3270: optionally disable auto update
    s390/mm: remove unecessary parameter from pgste_ipte_notify
    s390/mm: remove unnecessary parameter from gmap_do_ipte_notify
    s390/mm: fixing comment so that parameter name match
    s390/smp: limit number of cpus in possible cpu mask
    hypfs: Add clarification for "weight_min" attribute
    s390: update defconfigs
    s390/ptrace: add support for PTRACE_SINGLEBLOCK
    s390/perf: make print_debug_cf() static
    s390/topology: Remove call to update_cpu_masks()
    s390/compat: remove compat exec domain
    s390: select CONFIG_TTY for use of tty in unconditional keyboard driver
    s390/appldata_os: fix cpu array size calculation
    s390/checksum: remove memset() within csum_partial_copy_from_user()
    s390/uaccess: remove copy_from_user_real()
    s390/sclp_early: Return correct HSA block count also for zero
    s390: add some drivers/subsystems to the MAINTAINERS file
    s390: improve debug feature usage
    s390/airq: add support for irq ranges
    s390/mm: enable split page table lock for PMD level
    ...

    Linus Torvalds
     

19 Mar, 2014

1 commit

  • The _install_special_mapping() is the new base function for
    install_special_mapping(). This function will return a pointer of the
    created VMA or a error code in an ERR_PTR()

    This new function will be needed by the for the vdso 32 bit support to map the
    additonal vvar and hpet pages into the 32 bit address space. This will be done
    with io_remap_pfn_range() and remap_pfn_range, which requieres a vm_area_struct.

    Reviewed-by: Andy Lutomirski
    Signed-off-by: Stefani Seibold
    Link: http://lkml.kernel.org/r/1395094933-14252-3-git-send-email-stefani@seibold.net
    Signed-off-by: H. Peter Anvin

    Stefani Seibold
     

08 Mar, 2014

1 commit


04 Mar, 2014

3 commits

  • When doing some numa tests on powerpc, I triggered an oops bug. I find
    it is caused by using page->_last_cpupid. It should be initialized as
    "-1 & LAST_CPUPID_MASK", but not "-1". Otherwise, in task_numa_fault(),
    we will miss the checking (last_cpupid == (-1 & LAST_CPUPID_MASK)). And
    finally cause an oops bug in task_numa_group(), since the online cpu is
    less than possible cpu. This happen with CONFIG_SPARSE_VMEMMAP disabled

    Call trace:

    SMP NR_CPUS=64 NUMA PowerNV
    Modules linked in:
    CPU: 24 PID: 804 Comm: systemd-udevd Not tainted3.13.0-rc1+ #32
    task: c000001e2746aa80 ti: c000001e32c50000 task.ti:c000001e32c50000
    REGS: c000001e32c53510 TRAP: 0300 Not tainted(3.13.0-rc1+)
    MSR: 9000000000009032 CR:28024424 XER: 20000000
    CFAR: c000000000009324 DAR: 7265717569726857 DSISR:40000000 SOFTE: 1
    NIP .task_numa_fault+0x1470/0x2370
    LR .task_numa_fault+0x1468/0x2370
    Call Trace:
    .task_numa_fault+0x1468/0x2370 (unreliable)
    .do_numa_page+0x480/0x4a0
    .handle_mm_fault+0x4ec/0xc90
    .do_page_fault+0x3a8/0x890
    handle_page_fault+0x10/0x30
    Instruction dump:
    3c82fefb 3884b138 48d9cff1 60000000 48000574 3c62fefb3863af78 3c82fefb
    3884b138 48d9cfd5 60000000 e93f0100 7d2907b45529063e 7d2a07b4
    ---[ end trace 15f2510da5ae07cf ]---

    Signed-off-by: Liu Ping Fan
    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liu Ping Fan
     
  • Daniel Borkmann reported a VM_BUG_ON assertion failing:

    ------------[ cut here ]------------
    kernel BUG at mm/mlock.c:528!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: ccm arc4 iwldvm [...]
    video
    CPU: 3 PID: 2266 Comm: netsniff-ng Not tainted 3.14.0-rc2+ #8
    Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
    task: ffff8801f87f9820 ti: ffff88002cb44000 task.ti: ffff88002cb44000
    RIP: 0010:[] [] munlock_vma_pages_range+0x2e0/0x2f0
    Call Trace:
    do_munmap+0x18f/0x3b0
    vm_munmap+0x41/0x60
    SyS_munmap+0x22/0x30
    system_call_fastpath+0x1a/0x1f
    RIP munlock_vma_pages_range+0x2e0/0x2f0
    ---[ end trace a0088dcf07ae10f2 ]---

    because munlock_vma_pages_range() thinks it's unexpectedly in the middle
    of a THP page. This can be reproduced with default config since 3.11
    kernels. A reproducer can be found in the kernel's selftest directory
    for networking by running ./psock_tpacket.

    The problem is that an order=2 compound page (allocated by
    alloc_one_pg_vec_page() is part of the munlocked VM_MIXEDMAP vma (mapped
    by packet_mmap()) and mistaken for a THP page and assumed to be order=9.

    The checks for THP in munlock came with commit ff6a6da60b89 ("mm:
    accelerate munlock() treatment of THP pages"), i.e. since 3.9, but did
    not trigger a bug. It just makes munlock_vma_pages_range() skip such
    compound pages until the next 512-pages-aligned page, when it encounters
    a head page. This is however not a problem for vma's where mlocking has
    no effect anyway, but it can distort the accounting.

    Since commit 7225522bb429 ("mm: munlock: batch non-THP page isolation
    and munlock+putback using pagevec") this can trigger a VM_BUG_ON in
    PageTransHuge() check.

    This patch fixes the issue by adding VM_MIXEDMAP flag to VM_SPECIAL, a
    list of flags that make vma's non-mlockable and non-mergeable. The
    reasoning is that VM_MIXEDMAP vma's are similar to VM_PFNMAP, which is
    already on the VM_SPECIAL list, and both are intended for non-LRU pages
    where mlocking makes no sense anyway. Related Lkml discussion can be
    found in [2].

    [1] tools/testing/selftests/net/psock_tpacket
    [2] https://lkml.org/lkml/2014/1/10/427

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Daniel Borkmann
    Reported-by: Daniel Borkmann
    Tested-by: Daniel Borkmann
    Cc: Thomas Hellstrom
    Cc: John David Anglin
    Cc: HATAYAMA Daisuke
    Cc: Konstantin Khlebnikov
    Cc: Carsten Otte
    Cc: Jared Hulbert
    Tested-by: Hannes Frederic Sowa
    Cc: Kirill A. Shutemov
    Acked-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: [3.11.x+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Commit bf6bddf1924e ("mm: introduce compaction and migration for
    ballooned pages") introduces page_count(page) into memory compaction
    which dereferences page->first_page if PageTail(page).

    This results in a very rare NULL pointer dereference on the
    aforementioned page_count(page). Indeed, anything that does
    compound_head(), including page_count() is susceptible to racing with
    prep_compound_page() and seeing a NULL or dangling page->first_page
    pointer.

    This patch uses Andrea's implementation of compound_trans_head() that
    deals with such a race and makes it the default compound_head()
    implementation. This includes a read memory barrier that ensures that
    if PageTail(head) is true that we return a head page that is neither
    NULL nor dangling. The patch then adds a store memory barrier to
    prep_compound_page() to ensure page->first_page is set.

    This is the safest way to ensure we see the head page that we are
    expecting, PageTail(page) is already in the unlikely() path and the
    memory barriers are unfortunately required.

    Hugetlbfs is the exception, we don't enforce a store memory barrier
    during init since no race is possible.

    Signed-off-by: David Rientjes
    Cc: Holger Kiehl
    Cc: Christoph Lameter
    Cc: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: "Kirill A. Shutemov"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

21 Feb, 2014

1 commit

  • The pmd pointer passed to pmd_lockptr/pmd_huge_pte can point to any
    entry in a pmd table. With USE_SPLIT_PMD_PTLOCKS==1 the code uses
    virt_to_page to get a struct page for the pmd table. The virt_to_page
    function automatically masks the lower PAGE_SHIFT bits from the
    address. But if the size of a pmd table is larger than PAGE_SIZE the
    additional bits are not removed from the pmd address and the wrong
    page struct is used.

    Fix this by explicitely masking the offset in the pmd table from
    the pmd pointer.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky