16 Apr, 2015

2 commits

  • The creators of the C language gave us the while keyword. Let's use
    that instead of synthesizing it from if+goto.

    Made possible by 6597d783397a ("mm/mmap.c: replace find_vma_prepare()
    with clearer find_vma_links()").

    [akpm@linux-foundation.org: fix 80-col overflows]
    Signed-off-by: Rasmus Villemoes
    Cc: "Kirill A. Shutemov"
    Cc: Sasha Levin
    Cc: Cyrill Gorcunov
    Cc: Roman Gushchin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/
    tree since it doesn't work reliably on non-scalar types.

    This patch removes the rest of the usages of ACCESS_ONCE, and use the new
    READ_ONCE API for the read accesses. This makes things cleaner, instead
    of using separate/multiple sets of APIs.

    Signed-off-by: Jason Low
    Acked-by: Michal Hocko
    Acked-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Reviewed-by: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Low
     

15 Apr, 2015

1 commit

  • __mlock_vma_pages_range() doesn't necessarily mlock pages. It depends on
    vma flags. The same codepath is used for MAP_POPULATE.

    Let's rename __mlock_vma_pages_range() to populate_vma_page_range().

    This patch also drops mlock_vma_pages_range() references from
    documentation. It has gone in cea10a19b797 ("mm: directly use
    __mlock_vma_pages_range() in find_extend_vma()").

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Acked-by: David Rientjes
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

26 Mar, 2015

1 commit

  • I have constantly stumbled upon "kernel BUG at mm/rmap.c:399!" after
    upgrading to 3.19 and had no luck with 4.0-rc1 neither.

    So, after looking into new logic introduced by commit 7a3ef208e662 ("mm:
    prevent endless growth of anon_vma hierarchy"), I found chances are that
    unlink_anon_vmas() is called without incrementing dst->anon_vma->degree
    in anon_vma_clone() due to allocation failure. If dst->anon_vma is not
    NULL in error path, its degree will be incorrectly decremented in
    unlink_anon_vmas() and eventually underflow when exiting as a result of
    another call to unlink_anon_vmas(). That's how "kernel BUG at
    mm/rmap.c:399!" is triggered for me.

    This patch fixes the underflow by dropping dst->anon_vma when allocation
    fails. It's safe to do so regardless of original value of dst->anon_vma
    because dst->anon_vma doesn't have valid meaning if anon_vma_clone()
    fails. Besides, callers don't care dst->anon_vma in such case neither.

    Also suggested by Michal Hocko, we can clean up vma_adjust() a bit as
    anon_vma_clone() now does the work.

    [akpm@linux-foundation.org: tweak comment]
    Fixes: 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy")
    Signed-off-by: Leon Yu
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Michal Hocko
    Acked-by: Rik van Riel
    Acked-by: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Leon Yu
     

12 Feb, 2015

3 commits

  • I noticed, that "allowed" can easily overflow by falling below 0,
    because (total_vm / 32) can be larger than "allowed". The problem
    occurs in OVERCOMMIT_NONE mode.

    In this case, a huge allocation can success and overcommit the system
    (despite OVERCOMMIT_NONE mode). All subsequent allocations will fall
    (system-wide), so system become unusable.

    The problem was masked out by commit c9b1d0981fcc
    ("mm: limit growth of 3% hardcoded other user reserve"),
    but it's easy to reproduce it on older kernels:
    1) set overcommit_memory sysctl to 2
    2) mmap() large file multiple times (with VM_SHARED flag)
    3) try to malloc() large amount of memory

    It also can be reproduced on newer kernels, but miss-configured
    sysctl_user_reserve_kbytes is required.

    Fix this issue by switching to signed arithmetic here.

    [akpm@linux-foundation.org: use min_t]
    Signed-off-by: Roman Gushchin
    Cc: Andrew Shewmaker
    Cc: Rik van Riel
    Cc: Konstantin Khlebnikov
    Reviewed-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • The problem is that we check nr_ptes/nr_pmds in exit_mmap() which happens
    *before* pgd_free(). And if an arch does pte/pmd allocation in
    pgd_alloc() and frees them in pgd_free() we see offset in counters by the
    time of the checks.

    We tried to workaround this by offsetting expected counter value according
    to FIRST_USER_ADDRESS for both nr_pte and nr_pmd in exit_mmap(). But it
    doesn't work in some cases:

    1. ARM with LPAE enabled also has non-zero USER_PGTABLES_CEILING, but
    upper addresses occupied with huge pmd entries, so the trick with
    offsetting expected counter value will get really ugly: we will have
    to apply it nr_pmds, but not nr_ptes.

    2. Metag has non-zero FIRST_USER_ADDRESS, but doesn't do allocation
    pte/pmd page tables allocation in pgd_alloc(), just setup a pgd entry
    which is allocated at boot and shared accross all processes.

    The proposal is to move the check to check_mm() which happens *after*
    pgd_free() and do proper accounting during pgd_alloc() and pgd_free()
    which would bring counters to zero if nothing leaked.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Tyler Baker
    Tested-by: Tyler Baker
    Tested-by: Nishanth Menon
    Cc: Russell King
    Cc: James Hogan
    Cc: Guan Xuetao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Dave noticed that unprivileged process can allocate significant amount of
    memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
    memory cgroup. The trick is to allocate a lot of PMD page tables. Linux
    kernel doesn't account PMD tables to the process, only PTE.

    The use-cases below use few tricks to allocate a lot of PMD page tables
    while keeping VmRSS and VmPTE low. oom_score for the process will be 0.

    #include
    #include
    #include
    #include
    #include
    #include

    #define PUD_SIZE (1UL << 30)
    #define PMD_SIZE (1UL << 21)

    #define NR_PUD 130000

    int main(void)
    {
    char *addr = NULL;
    unsigned long i;

    prctl(PR_SET_THP_DISABLE);
    for (i = 0; i < NR_PUD ; i++) {
    addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
    MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
    if (addr == MAP_FAILED) {
    perror("mmap");
    break;
    }
    *addr = 'x';
    munmap(addr, PMD_SIZE);
    mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
    MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
    if (addr == MAP_FAILED)
    perror("re-mmap"), exit(1);
    }
    printf("PID %d consumed %lu KiB in PMD page tables\n",
    getpid(), i * 4096 >> 10);
    return pause();
    }

    The patch addresses the issue by account PMD tables to the process the
    same way we account PTE.

    The main place where PMD tables is accounted is __pmd_alloc() and
    free_pmd_range(). But there're few corner cases:

    - HugeTLB can share PMD page tables. The patch handles by accounting
    the table to all processes who share it.

    - x86 PAE pre-allocates few PMD tables on fork.

    - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
    check on exit(2).

    Accounting only happens on configuration where PMD page table's level is
    present (PMD is not folded). As with nr_ptes we use per-mm counter. The
    counter value is used to calculate baseline for badness score by
    oom-killer.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dave Hansen
    Cc: Hugh Dickins
    Reviewed-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: David Rientjes
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

11 Feb, 2015

2 commits

  • We don't create non-linear mappings anymore. Let's drop code which
    handles them in rmap.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • remap_file_pages(2) was invented to be able efficiently map parts of
    huge file into limited 32-bit virtual address space such as in database
    workloads.

    Nonlinear mappings are pain to support and it seems there's no
    legitimate use-cases nowadays since 64-bit systems are widely available.

    Let's drop it and get rid of all these special-cased code.

    The patch replaces the syscall with emulation which creates new VMA on
    each remap_file_pages(), unless they it can be merged with an adjacent
    one.

    I didn't find *any* real code that uses remap_file_pages(2) to test
    emulation impact on. I've checked Debian code search and source of all
    packages in ALT Linux. No real users: libc wrappers, mentions in
    strace, gdb, valgrind and this kind of stuff.

    There are few basic tests in LTP for the syscall. They work just fine
    with emulation.

    To test performance impact, I've written small test case which
    demonstrate pretty much worst case scenario: map 4G shmfs file, write to
    begin of every page pgoff of the page, remap pages in reverse order,
    read every page.

    The test creates 1 million of VMAs if emulation is in use, so I had to
    set vm.max_map_count to 1100000 to avoid -ENOMEM.

    Before: 23.3 ( +- 4.31% ) seconds
    After: 43.9 ( +- 0.85% ) seconds
    Slowdown: 1.88x

    I believe we can live with that.

    Test case:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include

    #define MB (1024UL * 1024)
    #define SIZE (4096 * MB)

    int main(int argc, char **argv)
    {
    unsigned long *p;
    long i, pass;

    for (pass = 0; pass < 10; pass++) {
    p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
    MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    if (p == MAP_FAILED) {
    perror("mmap");
    return -1;
    }

    for (i = 0; i < SIZE / 4096; i++)
    p[i * 4096 / sizeof(*p)] = i;

    for (i = 0; i < SIZE / 4096; i++) {
    if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
    0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
    perror("remap_file_pages");
    return -1;
    }
    }

    for (i = SIZE / 4096 - 1; i >= 0; i--)
    assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);

    munmap(p, SIZE);
    }

    return 0;
    }

    [akpm@linux-foundation.org: fix spello]
    [sasha.levin@oracle.com: initialize populate before usage]
    [sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
    Signed-off-by: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Dave Jones
    Cc: Linus Torvalds
    Cc: Armin Rigo
    Signed-off-by: Sasha Levin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

12 Jan, 2015

2 commits

  • Fix for BUG_ON(anon_vma->degree) splashes in unlink_anon_vmas() ("kernel
    BUG at mm/rmap.c:399!") caused by commit 7a3ef208e662 ("mm: prevent
    endless growth of anon_vma hierarchy")

    Anon_vma_clone() is usually called for a copy of source vma in
    destination argument. If source vma has anon_vma it should be already
    in dst->anon_vma. NULL in dst->anon_vma is used as a sign that it's
    called from anon_vma_fork(). In this case anon_vma_clone() finds
    anon_vma for reusing.

    Vma_adjust() calls it differently and this breaks anon_vma reusing
    logic: anon_vma_clone() links vma to old anon_vma and updates degree
    counters but vma_adjust() overrides vma->anon_vma right after that. As
    a result final unlink_anon_vmas() decrements degree for wrong anon_vma.

    This patch assigns ->anon_vma before calling anon_vma_clone().

    Signed-off-by: Konstantin Khlebnikov
    Reported-and-tested-by: Chris Clayton
    Reported-and-tested-by: Oded Gabbay
    Reported-and-tested-by: Chih-Wei Huang
    Acked-by: Rik van Riel
    Acked-by: Vlastimil Babka
    Cc: Daniel Forrest
    Cc: Michal Hocko
    Cc: stable@vger.kernel.org # to match back-porting of 7a3ef208e662
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Commit fee7e49d4514 ("mm: propagate error from stack expansion even for
    guard page") made sure that we return the error properly for stack
    growth conditions. It also theorized that counting the guard page
    towards the stack limit might break something, but also said "Let's see
    if anybody notices".

    Somebody did notice. Apparently android-x86 sets the stack limit very
    close to the limit indeed, and including the guard page in the rlimit
    check causes the android 'zygote' process problems.

    So this adds the (fairly trivial) code to make the stack rlimit check be
    against the actual real stack size, rather than the size of the vma that
    includes the guard page.

    Reported-and-tested-by: Chih-Wei Huang
    Cc: Jay Foad
    Cc: stable@kernel.org # to match back-porting of fee7e49d4514
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

14 Dec, 2014

3 commits

  • This lets drivers like the AMD IOMMUv2 driver handle faults a bit more
    simply, rather than doing tricks with page refs and get_user_pages().

    Signed-off-by: Jesse Barnes
    Cc: Oded Gabbay
    Cc: Joerg Roedel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesse Barnes
     
  • The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
    similar data, one for file backed pages and the other for anon memory. To
    this end, this lock can also be a rwsem. In addition, there are some
    important opportunities to share the lock when there are no tree
    modifications.

    This conversion is straightforward. For now, all users take the write
    lock.

    [sfr@canb.auug.org.au: update fremap.c]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Convert all open coded mutex_lock/unlock calls to the
    i_mmap_[lock/unlock]_write() helpers.

    Signed-off-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

11 Dec, 2014

1 commit

  • Pull x86 MPX support from Thomas Gleixner:
    "This enables support for x86 MPX.

    MPX is a new debug feature for bound checking in user space. It
    requires kernel support to handle the bound tables and decode the
    bound violating instruction in the trap handler"

    * 'x86-mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    asm-generic: Remove asm-generic arch_bprm_mm_init()
    mm: Make arch_unmap()/bprm_mm_init() available to all architectures
    x86: Cleanly separate use of asm-generic/mm_hooks.h
    x86 mpx: Change return type of get_reg_offset()
    fs: Do not include mpx.h in exec.c
    x86, mpx: Add documentation on Intel MPX
    x86, mpx: Cleanup unused bound tables
    x86, mpx: On-demand kernel allocation of bounds tables
    x86, mpx: Decode MPX instruction to get bound violation information
    x86, mpx: Add MPX-specific mmap interface
    x86, mpx: Introduce VM_MPX to indicate that a VMA is MPX specific
    x86, mpx: Add MPX to disabled features
    ia64: Sync struct siginfo with general version
    mips: Sync struct siginfo with general version
    mpx: Extend siginfo structure to include bound violation information
    x86, mpx: Rename cfg_reg_u and status_reg
    x86: mpx: Give bndX registers actual names
    x86: Remove arbitrary instruction size limit in instruction decoder

    Linus Torvalds
     

04 Dec, 2014

1 commit

  • Andrew Morton noticed that the error return from anon_vma_clone() was
    being dropped and replaced with -ENOMEM (which is not itself a bug
    because the only error return value from anon_vma_clone() is -ENOMEM).

    I did an audit of callers of anon_vma_clone() and discovered an actual
    bug where the error return was being lost. In __split_vma(), between
    Linux 3.11 and 3.12 the code was changed so the err variable is used
    before the call to anon_vma_clone() and the default initial value of
    -ENOMEM is overwritten. So a failure of anon_vma_clone() will return
    success since err at this point is now zero.

    Below is a patch which fixes this bug and also propagates the error
    return value from anon_vma_clone() in all cases.

    Fixes: ef0855d334e1 ("mm: mempolicy: turn vma_set_policy() into vma_dup_policy()")
    Signed-off-by: Daniel Forrest
    Reviewed-by: Michal Hocko
    Cc: Konstantin Khlebnikov
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Tim Hartrick
    Cc: Hugh Dickins
    Cc: Michel Lespinasse
    Cc: Vlastimil Babka
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Forrest
     

18 Nov, 2014

1 commit

  • The previous patch allocates bounds tables on-demand. As noted in
    an earlier description, these can add up to *HUGE* amounts of
    memory. This has caused OOMs in practice when running tests.

    This patch adds support for freeing bounds tables when they are no
    longer in use.

    There are two types of mappings in play when unmapping tables:
    1. The mapping with the actual data, which userspace is
    munmap()ing or brk()ing away, etc...
    2. The mapping for the bounds table *backing* the data
    (is tagged with VM_MPX, see the patch "add MPX specific
    mmap interface").

    If userspace use the prctl() indroduced earlier in this patchset
    to enable the management of bounds tables in kernel, when it
    unmaps the first type of mapping with the actual data, the kernel
    needs to free the mapping for the bounds table backing the data.
    This patch hooks in at the very end of do_unmap() to do so.
    We look at the addresses being unmapped and find the bounds
    directory entries and tables which cover those addresses. If
    an entire table is unused, we clear associated directory entry
    and free the table.

    Once we unmap the bounds table, we would have a bounds directory
    entry pointing at empty address space. That address space might
    now be allocated for some other (random) use, and the MPX
    hardware might now try to walk it as if it were a bounds table.
    That would be bad. So any unmapping of an enture bounds table
    has to be accompanied by a corresponding write to the bounds
    directory entry to invalidate it. That write to the bounds
    directory can fault, which causes the following problem:

    Since we are doing the freeing from munmap() (and other paths
    like it), we hold mmap_sem for write. If we fault, the page
    fault handler will attempt to acquire mmap_sem for read and
    we will deadlock. To avoid the deadlock, we pagefault_disable()
    when touching the bounds directory entry and use a
    get_user_pages() to resolve the fault.

    The unmapping of bounds tables happends under vm_munmap(). We
    also (indirectly) call vm_munmap() to _do_ the unmapping of the
    bounds tables. We avoid unbounded recursion by disallowing
    freeing of bounds tables *for* bounds tables. This would not
    occur normally, so should not have any practical impact. Being
    strict about it here helps ensure that we do not have an
    exploitable stack overflow.

    Based-on-patch-by: Qiaowei Ren
    Signed-off-by: Dave Hansen
    Cc: linux-mm@kvack.org
    Cc: linux-mips@linux-mips.org
    Cc: Dave Hansen
    Link: http://lkml.kernel.org/r/20141114151831.E4531C4A@viggo.jf.intel.com
    Signed-off-by: Thomas Gleixner

    Dave Hansen
     

30 Oct, 2014

1 commit

  • If an anonymous mapping is not allowed to fault thp memory and then
    madvise(MADV_HUGEPAGE) is used after fault, khugepaged will never
    collapse this memory into thp memory.

    This occurs because the madvise(2) handler for thp, hugepage_madvise(),
    clears VM_NOHUGEPAGE on the stack and it isn't stored in vma->vm_flags
    until the final action of madvise_behavior(). This causes the
    khugepaged_enter_vma_merge() to be a no-op in hugepage_madvise() when
    the vma had previously had VM_NOHUGEPAGE set.

    Fix this by passing the correct vma flags to the khugepaged mm slot
    handler. There's no chance khugepaged can run on this vma until after
    madvise_behavior() returns since we hold mm->mmap_sem.

    It would be possible to clear VM_NOHUGEPAGE directly from vma->vm_flags
    in hugepage_advise(), but I didn't want to introduce special case
    behavior into madvise_behavior(). I think it's best to just let it
    always set vma->vm_flags itself.

    Signed-off-by: David Rientjes
    Reported-by: Suleiman Souhlal
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

14 Oct, 2014

1 commit

  • For VMAs that don't want write notifications, PTEs created for read faults
    have their write bit set. If the read fault happens after VM_SOFTDIRTY is
    cleared, then the PTE's softdirty bit will remain clear after subsequent
    writes.

    Here's a simple code snippet to demonstrate the bug:

    char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
    MAP_ANONYMOUS | MAP_SHARED, -1, 0);
    system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
    assert(*m == '\0'); /* new PTE allows write access */
    assert(!soft_dirty(x));
    *m = 'x'; /* should dirty the page */
    assert(soft_dirty(x)); /* fails */

    With this patch, write notifications are enabled when VM_SOFTDIRTY is
    cleared. Furthermore, to avoid unnecessary faults, write notifications
    are disabled when VM_SOFTDIRTY is set.

    As a side effect of enabling and disabling write notifications with
    care, this patch fixes a bug in mprotect where vm_page_prot bits set by
    drivers were zapped on mprotect. An analogous bug was fixed in mmap by
    commit c9d0bf241451 ("mm: uncached vma support with writenotify").

    Signed-off-by: Peter Feiner
    Reported-by: Peter Feiner
    Suggested-by: Kirill A. Shutemov
    Cc: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Jamie Liu
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: Bjorn Helgaas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Feiner
     

10 Oct, 2014

6 commits

  • Pull percpu updates from Tejun Heo:
    "A lot of activities on percpu front. Notable changes are...

    - percpu allocator now can take @gfp. If @gfp doesn't contain
    GFP_KERNEL, it tries to allocate from what's already available to
    the allocator and a work item tries to keep the reserve around
    certain level so that these atomic allocations usually succeed.

    This will replace the ad-hoc percpu memory pool used by
    blk-throttle and also be used by the planned blkcg support for
    writeback IOs.

    Please note that I noticed a bug in how @gfp is interpreted while
    preparing this pull request and applied the fix 6ae833c7fe0c
    ("percpu: fix how @gfp is interpreted by the percpu allocator")
    just now.

    - percpu_ref now uses longs for percpu and global counters instead of
    ints. It leads to more sparse packing of the percpu counters on
    64bit machines but the overhead should be negligible and this
    allows using percpu_ref for refcnting pages and in-memory objects
    directly.

    - The switching between percpu and single counter modes of a
    percpu_ref is made independent of putting the base ref and a
    percpu_ref can now optionally be initialized in single or killed
    mode. This allows avoiding percpu shutdown latency for cases where
    the refcounted objects may be synchronously created and destroyed
    in rapid succession with only a fraction of them reaching fully
    operational status (SCSI probing does this when combined with
    blk-mq support). It's also planned to be used to implement forced
    single mode to detect underflow more timely for debugging.

    There's a separate branch percpu/for-3.18-consistent-ops which cleans
    up the duplicate percpu accessors. That branch causes a number of
    conflicts with s390 and other trees. I'll send a separate pull
    request w/ resolutions once other branches are merged"

    * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits)
    percpu: fix how @gfp is interpreted by the percpu allocator
    blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
    percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
    percpu_ref: add PERCPU_REF_INIT_* flags
    percpu_ref: decouple switching to percpu mode and reinit
    percpu_ref: decouple switching to atomic mode and killing
    percpu_ref: add PCPU_REF_DEAD
    percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
    percpu_ref: replace pcpu_ prefix with percpu_
    percpu_ref: minor code and comment updates
    percpu_ref: relocate percpu_ref_reinit()
    Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
    Revert "percpu: free percpu allocation info for uniprocessor system"
    percpu-refcount: make percpu_ref based on longs instead of ints
    percpu-refcount: improve WARN messages
    percpu: fix locking regression in the failure path of pcpu_alloc()
    percpu-refcount: add @gfp to percpu_ref_init()
    proportions: add @gfp to init functions
    percpu_counter: add @gfp to percpu_counter_init()
    percpu_counter: make percpu_counters_lock irq-safe
    ...

    Linus Torvalds
     
  • Dump the contents of the relevant struct_mm when we hit the bug condition.

    Signed-off-by: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • - be consistent in printing the test which failed

    - one message was actually wrong (aa)

    - don't print second bogus warning if browse_rb() failed

    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Trivially convert a few VM_BUG_ON calls to VM_BUG_ON_VMA to extract
    more information when they trigger.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Sasha Levin
    Reviewed-by: Naoya Horiguchi
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Vlastimil Babka
    Cc: Michel Lespinasse
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • Signed-off-by: Cyrill Gorcunov
    Cc: Kees Cook
    Cc: Tejun Heo
    Cc: Andrew Vagin
    Cc: Eric W. Biederman
    Cc: H. Peter Anvin
    Acked-by: Serge Hallyn
    Cc: Pavel Emelyanov
    Cc: Vasiliy Kulikov
    Cc: KAMEZAWA Hiroyuki
    Cc: Michael Kerrisk
    Cc: Julien Tinnes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Signed-off-by: vishnu.ps
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    vishnu.ps
     

25 Sep, 2014

1 commit

  • …linux-block into for-3.18

    This is to receive 0a30288da1ae ("blk-mq, percpu_ref: implement a
    kludge for SCSI blk-mq stall during probe") which implements
    __percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The
    commit reverted and patches to implement proper fix will be added.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Cc: Kent Overstreet <kmo@daterainc.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Christoph Hellwig <hch@lst.de>

    Tejun Heo
     

11 Sep, 2014

1 commit


08 Sep, 2014

1 commit

  • Percpu allocator now supports allocation mask. Add @gfp to
    percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
    with percpu_counters too.

    We could have left percpu_counter_init() alone and added
    percpu_counter_init_gfp(); however, the number of users isn't that
    high and introducing _gfp variants to all percpu data structures would
    be quite ugly, so let's just do the conversion. This is the one with
    the most users. Other percpu data structures are a lot easier to
    convert.

    This patch doesn't make any functional difference.

    Signed-off-by: Tejun Heo
    Acked-by: Jan Kara
    Acked-by: "David S. Miller"
    Cc: x86@kernel.org
    Cc: Jens Axboe
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andrew Morton

    Tejun Heo
     

09 Aug, 2014

1 commit

  • This patch (of 6):

    The i_mmap_writable field counts existing writable mappings of an
    address_space. To allow drivers to prevent new writable mappings, make
    this counter signed and prevent new writable mappings if it is negative.
    This is modelled after i_writecount and DENYWRITE.

    This will be required by the shmem-sealing infrastructure to prevent any
    new writable mappings after the WRITE seal has been set. In case there
    exists a writable mapping, this operation will fail with EBUSY.

    Note that we rely on the fact that iff you already own a writable mapping,
    you can increase the counter without using the helpers. This is the same
    that we do for i_writecount.

    Signed-off-by: David Herrmann
    Acked-by: Hugh Dickins
    Cc: Michael Kerrisk
    Cc: Ryan Lortie
    Cc: Lennart Poettering
    Cc: Daniel Mack
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Herrmann
     

07 Aug, 2014

1 commit

  • Print a warning (if CONFIG_DEBUG_VM=y) when memory commitment becomes
    too negative.

    This shouldn't happen any more - the previous two patches fixed the
    committed_as underflow issues.

    [akpm@linux-foundation.org: use VM_WARN_ONCE, per Dave]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

07 Jun, 2014

1 commit

  • printk is meant to be used with an associated log level. There are some
    instances of printk scattered around the mm code where the log level is
    missing. Add a log level and adhere to suggestions by
    scripts/checkpatch.pl by moving to the pr_* macros.

    Also add the typical pr_fmt definition so that print statements can be
    easily traced back to the modules where they occur, correlated one with
    another, etc. This will require the removal of some (now redundant)
    prefixes on a few print statements.

    Signed-off-by: Mitchel Humpherys
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mitchel Humpherys
     

05 Jun, 2014

3 commits

  • Pull x86 cdso updates from Peter Anvin:
    "Vdso cleanups and improvements largely from Andy Lutomirski. This
    makes the vdso a lot less ''special''"

    * 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/vdso, build: Make LE access macros clearer, host-safe
    x86/vdso, build: Fix cross-compilation from big-endian architectures
    x86/vdso, build: When vdso2c fails, unlink the output
    x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
    x86, mm: Replace arch_vma_name with vm_ops->name for vsyscalls
    x86, mm: Improve _install_special_mapping and fix x86 vdso naming
    mm, fs: Add vm_ops->name as an alternative to arch_vma_name
    x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
    x86, vdso: Remove vestiges of VDSO_PRELINK and some outdated comments
    x86, vdso: Move the vvar and hpet mappings next to the 64-bit vDSO
    x86, vdso: Move the 32-bit vdso special pages after the text
    x86, vdso: Reimplement vdso.so preparation in build-time C
    x86, vdso: Move syscall and sysenter setup into kernel/cpu/common.c
    x86, vdso: Clean up 32-bit vs 64-bit vdso params
    x86, mm: Ensure correct alignment of the fixmap

    Linus Torvalds
     
  • Remove the first mapping check for vma_link. Move the mutex_lock into the
    braces when vma->vm_file is true.

    Signed-off-by: Huang Shijie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • Fix a coccinelle error regarding usage of IS_ERR and PTR_ERR instead of
    PTR_ERR_OR_ZERO.

    Signed-off-by: Duan Jiong
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Duan Jiong
     

21 May, 2014

1 commit

  • Using arch_vma_name to give special mappings a name is awkward. x86
    currently implements it by comparing the start address of the vma to
    the expected address of the vdso. This requires tracking the start
    address of special mappings and is probably buggy if a special vma
    is split or moved.

    Improve _install_special_mapping to just name the vma directly. Use
    it to give the x86 vvar area a name, which should make CRIU's life
    easier.

    As a side effect, the vvar area will show up in core dumps. This
    could be considered weird and is fixable.

    [hpa: I say we accept this as-is but be prepared to deal with knocking
    out the vvars from core dumps if this becomes a problem.]

    Cc: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Signed-off-by: Andy Lutomirski
    Link: http://lkml.kernel.org/r/276b39b6b645fb11e345457b503f17b83c2c6fd0.1400538962.git.luto@amacapital.net
    Signed-off-by: H. Peter Anvin

    Andy Lutomirski
     

08 Apr, 2014

1 commit

  • This patch is a continuation of efforts trying to optimize find_vma(),
    avoiding potentially expensive rbtree walks to locate a vma upon faults.
    The original approach (https://lkml.org/lkml/2013/11/1/410), where the
    largest vma was also cached, ended up being too specific and random,
    thus further comparison with other approaches were needed. There are
    two things to consider when dealing with this, the cache hit rate and
    the latency of find_vma(). Improving the hit-rate does not necessarily
    translate in finding the vma any faster, as the overhead of any fancy
    caching schemes can be too high to consider.

    We currently cache the last used vma for the whole address space, which
    provides a nice optimization, reducing the total cycles in find_vma() by
    up to 250%, for workloads with good locality. On the other hand, this
    simple scheme is pretty much useless for workloads with poor locality.
    Analyzing ebizzy runs shows that, no matter how many threads are
    running, the mmap_cache hit rate is less than 2%, and in many situations
    below 1%.

    The proposed approach is to replace this scheme with a small per-thread
    cache, maximizing hit rates at a very low maintenance cost.
    Invalidations are performed by simply bumping up a 32-bit sequence
    number. The only expensive operation is in the rare case of a seq
    number overflow, where all caches that share the same address space are
    flushed. Upon a miss, the proposed replacement policy is based on the
    page number that contains the virtual address in question. Concretely,
    the following results are seen on an 80 core, 8 socket x86-64 box:

    1) System bootup: Most programs are single threaded, so the per-thread
    scheme does improve ~50% hit rate by just adding a few more slots to
    the cache.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 50.61% | 19.90 |
    | patched | 73.45% | 13.58 |
    +----------------+----------+------------------+

    2) Kernel build: This one is already pretty good with the current
    approach as we're dealing with good locality.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 75.28% | 11.03 |
    | patched | 88.09% | 9.31 |
    +----------------+----------+------------------+

    3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 70.66% | 17.14 |
    | patched | 91.15% | 12.57 |
    +----------------+----------+------------------+

    4) Ebizzy: There's a fair amount of variation from run to run, but this
    approach always shows nearly perfect hit rates, while baseline is just
    about non-existent. The amounts of cycles can fluctuate between
    anywhere from ~60 to ~116 for the baseline scheme, but this approach
    reduces it considerably. For instance, with 80 threads:

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 1.06% | 91.54 |
    | patched | 99.97% | 14.18 |
    +----------------+----------+------------------+

    [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
    [akpm@linux-foundation.org: document vmacache_valid() logic]
    [akpm@linux-foundation.org: attempt to untangle header files]
    [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
    [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: adjust and enhance comments]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: Linus Torvalds
    Reviewed-by: Michel Lespinasse
    Cc: Oleg Nesterov
    Tested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

05 Apr, 2014

1 commit

  • Pull file locking updates from Jeff Layton:
    "Highlights:

    - maintainership change for fs/locks.c. Willy's not interested in
    maintaining it these days, and is OK with Bruce and I taking it.
    - fix for open vs setlease race that Al ID'ed
    - cleanup and consolidation of file locking code
    - eliminate unneeded BUG() call
    - merge of file-private lock implementation"

    * 'locks-3.15' of git://git.samba.org/jlayton/linux:
    locks: make locks_mandatory_area check for file-private locks
    locks: fix locks_mandatory_locked to respect file-private locks
    locks: require that flock->l_pid be set to 0 for file-private locks
    locks: add new fcntl cmd values for handling file private locks
    locks: skip deadlock detection on FL_FILE_PVT locks
    locks: pass the cmd value to fcntl_getlk/getlk64
    locks: report l_pid as -1 for FL_FILE_PVT locks
    locks: make /proc/locks show IS_FILE_PVT locks as type "FLPVT"
    locks: rename locks_remove_flock to locks_remove_file
    locks: consolidate checks for compatible filp->f_mode values in setlk handlers
    locks: fix posix lock range overflow handling
    locks: eliminate BUG() call when there's an unexpected lock on file close
    locks: add __acquires and __releases annotations to locks_start and locks_stop
    locks: remove "inline" qualifier from fl_link manipulation functions
    locks: clean up comment typo
    locks: close potential race between setlease and open
    MAINTAINERS: update entry for fs/locks.c

    Linus Torvalds
     

04 Apr, 2014

1 commit

  • Mark function as static in mmap.c because they are not used outside this
    file.

    This eliminates the following warning in mm/mmap.c:

    mm/mmap.c:407:6: warning: no previous prototype for `validate_mm' [-Wmissing-prototypes]

    Signed-off-by: Rashika Kheria
    Reviewed-by: Josh Triplett
    Reviewed-by: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rashika Kheria
     

31 Mar, 2014

1 commit

  • As Trond pointed out, you can currently deadlock yourself by setting a
    file-private lock on a file that requires mandatory locking and then
    trying to do I/O on it.

    Avoid this problem by plumbing some knowledge of file-private locks into
    the mandatory locking code. In order to do this, we must pass down
    information about the struct file that's being used to
    locks_verify_locked.

    Reported-by: Trond Myklebust
    Signed-off-by: Jeff Layton
    Acked-by: J. Bruce Fields

    Jeff Layton
     

19 Mar, 2014

1 commit

  • The _install_special_mapping() is the new base function for
    install_special_mapping(). This function will return a pointer of the
    created VMA or a error code in an ERR_PTR()

    This new function will be needed by the for the vdso 32 bit support to map the
    additonal vvar and hpet pages into the 32 bit address space. This will be done
    with io_remap_pfn_range() and remap_pfn_range, which requieres a vm_area_struct.

    Reviewed-by: Andy Lutomirski
    Signed-off-by: Stefani Seibold
    Link: http://lkml.kernel.org/r/1395094933-14252-3-git-send-email-stefani@seibold.net
    Signed-off-by: H. Peter Anvin

    Stefani Seibold