22 Dec, 2013

1 commit

  • The arbitrary restriction on page counts offered by the core
    migrate_page_move_mapping() code results in rather suspicious looking
    fiddling with page reference counts in the aio_migratepage() operation.
    To fix this, make migrate_page_move_mapping() take an extra_count parameter
    that allows aio to tell the code about its own reference count on the page
    being migrated.

    While cleaning up aio_migratepage(), make it validate that the old page
    being passed in is actually what aio_migratepage() expects to prevent
    misbehaviour in the case of races.

    Signed-off-by: Benjamin LaHaise

    Benjamin LaHaise
     

21 Dec, 2013

4 commits

  • Commit 597d795a2a78 ('mm: do not allocate page->ptl dynamically, if
    spinlock_t fits to long') restructures some allocators that are compiled
    even if USE_SPLIT_PTLOCKS arn't used. It results in compilation
    failure:

    mm/memory.c:4282:6: error: 'struct page' has no member named 'ptl'
    mm/memory.c:4288:12: error: 'struct page' has no member named 'ptl'

    Add in the missing ifdef.

    Fixes: 597d795a2a78 ('mm: do not allocate page->ptl dynamically, if spinlock_t fits to long')
    Signed-off-by: Olof Johansson
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Olof Johansson
     
  • In struct page we have enough space to fit long-size page->ptl there,
    but we use dynamically-allocated page->ptl if size(spinlock_t) is larger
    than sizeof(int).

    It hurts 64-bit architectures with CONFIG_GENERIC_LOCKBREAK, where
    sizeof(spinlock_t) == 8, but it easily fits into struct page.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") meant
    to bring aging fairness among zones in system, but it was overzealous
    and badly regressed basic workloads on NUMA systems.

    Due to the way kswapd and page allocator interacts, we still want to
    make sure that all zones in any given node are used equally for all
    allocations to maximize memory utilization and prevent thrashing on the
    highest zone in the node.

    While the same principle applies to NUMA nodes - memory utilization is
    obviously improved by spreading allocations throughout all nodes -
    remote references can be costly and so many workloads prefer locality
    over memory utilization. The original change assumed that
    zone_reclaim_mode would be a good enough predictor for that, but it
    turned out to be as indicative as a coin flip.

    Revert the NUMA aspect of the fairness until we can find a proper way to
    make it configurable and agree on a sane default.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Michal Hocko
    Signed-off-by: Mel Gorman
    Cc: # 3.12
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This reverts commit 73f038b863df. The NUMA behaviour of this patch is
    less than ideal. An alternative approch is to interleave allocations
    only within local zones which is implemented in the next patch.

    Cc: stable@vger.kernel.org
    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

19 Dec, 2013

18 commits

  • In __page_check_address(), if address's pud is not present,
    huge_pte_offset() will return NULL, we should check the return value.

    Signed-off-by: Jianguo Wu
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: qiuxishi
    Cc: Hanjun Guo
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianguo Wu
     
  • BUG_ON(!vma) assumption is introduced by commit 0bf598d863e3 ("mbind:
    add BUG_ON(!vma) in new_vma_page()"), however, even if

    address = __vma_address(page, vma);

    and

    vma->start < address < vma->end

    page_address_in_vma() may still return -EFAULT because of many other
    conditions in it. As a result the while loop in new_vma_page() may end
    with vma=NULL.

    This patch revert the commit and also fix the potential dereference NULL
    pointer reported by Dan.

    http://marc.info/?l=linux-mm&m=137689530323257&w=2

    kernel BUG at mm/mempolicy.c:1204!
    invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    CPU: 3 PID: 7056 Comm: trinity-child3 Not tainted 3.13.0-rc3+ #2
    task: ffff8801ca5295d0 ti: ffff88005ab20000 task.ti: ffff88005ab20000
    RIP: new_vma_page+0x70/0x90
    RSP: 0000:ffff88005ab21db0 EFLAGS: 00010246
    RAX: fffffffffffffff2 RBX: 0000000000000000 RCX: 0000000000000000
    RDX: 0000000008040075 RSI: ffff8801c3d74600 RDI: ffffea00079a8b80
    RBP: ffff88005ab21dc8 R08: 0000000000000004 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: fffffffffffffff2
    R13: ffffea00079a8b80 R14: 0000000000400000 R15: 0000000000400000

    FS: 00007ff49c6f4740(0000) GS:ffff880244e00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007ff49c68f994 CR3: 000000005a205000 CR4: 00000000001407e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Stack:
    ffffea00079a8b80 ffffea00079a8bc0 ffffea00079a8ba0 ffff88005ab21e50
    ffffffff811adc7a 0000000000000000 ffff8801ca5295d0 0000000464e224f8
    0000000000000000 0000000000000002 0000000000000000 ffff88020ce75c00
    Call Trace:
    migrate_pages+0x12a/0x850
    SYSC_mbind+0x513/0x6a0
    SyS_mbind+0xe/0x10
    ia32_do_call+0x13/0x13
    Code: 85 c0 75 2f 4c 89 e1 48 89 da 31 f6 bf da 00 02 00 65 44 8b 04 25 08 f7 1c 00 e8 ec fd ff ff 5b 41 5c 41 5d 5d c3 0f 1f 44 00 00 0b 66 0f 1f 44 00 00 4c 89 e6 48 89 df ba 01 00 00 00 e8 48
    RIP [] new_vma_page+0x70/0x90
    RSP

    Signed-off-by: Wanpeng Li
    Reported-by: Dave Jones
    Reported-by: Sasha Levin
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • After a successful hugetlb page migration by soft offline, the source
    page will either be freed into hugepage_freelists or buddy(over-commit
    page). If page is in buddy, page_hstate(page) will be NULL. It will
    hit a NULL pointer dereference in dequeue_hwpoisoned_huge_page().

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
    IP: [] dequeue_hwpoisoned_huge_page+0x131/0x1d0
    PGD c23762067 PUD c24be2067 PMD 0
    Oops: 0000 [#1] SMP

    So check PageHuge(page) after call migrate_pages() successfully.

    Signed-off-by: Jianguo Wu
    Tested-by: Naoya Horiguchi
    Reviewed-by: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianguo Wu
     
  • update_pageblock_skip() only fits to compaction which tries to isolate
    by pageblock unit. If isolate_migratepages_range() is called by CMA, it
    try to isolate regardless of pageblock unit and it don't reference
    get_pageblock_skip() by ignore_skip_hint. We should also respect it on
    update_pageblock_skip() to prevent from setting the wrong information.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Wanpeng Li
    Cc: Christoph Lameter
    Cc: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Zhang Yanfei
    Cc: [3.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • queue_pages_range() isolates hugetlbfs pages and putback_lru_pages()
    can't handle these. We should change it to putback_movable_pages().

    Naoya said that it is worth going into stable, because it can break
    in-use hugepage list.

    Signed-off-by: Joonsoo Kim
    Acked-by: Rafael Aquini
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Wanpeng Li
    Cc: Christoph Lameter
    Cc: Vlastimil Babka
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Zhang Yanfei
    Cc: [3.12.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Eliminate the following (rand)config warning by adding missing PROC_FS
    dependency:

    warning: (HWPOISON_INJECT && MEM_SOFT_DIRTY) selects PROC_PAGE_MONITOR which has unmet direct dependencies (PROC_FS && MMU)

    Signed-off-by: Sima Baymani
    Suggested-by: David Rientjes
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sima Baymani
     
  • Dave Hansen noted a regression in a microbenchmark that loops around
    open() and close() on an 8-node NUMA machine and bisected it down to
    commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy").
    That change forces the slab allocations of the file descriptor to spread
    out to all 8 nodes, causing remote references in the page allocator and
    slab.

    The round-robin policy is only there to provide fairness among memory
    allocations that are reclaimed involuntarily based on pressure in each
    zone. It does not make sense to apply it to unreclaimable kernel
    allocations that are freed manually, in this case instantly after the
    allocation, and incur the remote reference costs twice for no reason.

    Only round-robin allocations that are usually freed through page reclaim
    or slab shrinking.

    Bisected by Dave Hansen.

    Signed-off-by: Johannes Weiner
    Cc: Dave Hansen
    Cc: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • THP migration can fail for a variety of reasons. Avoid flushing the TLB
    to deal with THP migration races until the copy is ready to start.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There are a few subtle races, between change_protection_range (used by
    mprotect and change_prot_numa) on one side, and NUMA page migration and
    compaction on the other side.

    The basic race is that there is a time window between when the PTE gets
    made non-present (PROT_NONE or NUMA), and the TLB is flushed.

    During that time, a CPU may continue writing to the page.

    This is fine most of the time, however compaction or the NUMA migration
    code may come in, and migrate the page away.

    When that happens, the CPU may continue writing, through the cached
    translation, to what is no longer the current memory location of the
    process.

    This only affects x86, which has a somewhat optimistic pte_accessible.
    All other architectures appear to be safe, and will either always flush,
    or flush whenever there is a valid mapping, even with no permissions
    (SPARC).

    The basic race looks like this:

    CPU A CPU B CPU C

    load TLB entry
    make entry PTE/PMD_NUMA
    fault on entry
    read/write old page
    start migrating page
    change PTE/PMD to new page
    read/write old page [*]
    flush TLB
    reload TLB from new entry
    read/write new page
    lose data

    [*] the old page may belong to a new user at this point!

    The obvious fix is to flush remote TLB entries, by making sure that
    pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
    still be accessible if there is a TLB flush pending for the mm.

    This should fix both NUMA migration and compaction.

    [mgorman@suse.de: fix build]
    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • do_huge_pmd_numa_page() handles the case where there is parallel THP
    migration. However, by the time it is checked the NUMA hinting
    information has already been disrupted. This patch adds an earlier
    check with some helpers.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • On a protection change it is no longer clear if the page should be still
    accessible. This patch clears the NUMA hinting fault bits on a
    protection change.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If a PMD changes during a THP migration then migration aborts but the
    failure path is doing more work than is necessary.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The anon_vma lock prevents parallel THP splits and any associated
    complexity that arises when handling splits during THP migration. This
    patch checks if the lock was successfully acquired and bails from THP
    migration if it failed for any reason.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The TLB must be flushed if the PTE is updated but change_pte_range is
    clearing the PTE while marking PTEs pte_numa without necessarily
    flushing the TLB if it reinserts the same entry. Without the flush,
    it's conceivable that two processors have different TLBs for the same
    virtual address and at the very least it would generate spurious faults.

    This patch only unmaps the pages in change_pte_range for a full
    protection change.

    [riel@redhat.com: write pte_numa pte back to the page tables]
    Signed-off-by: Mel Gorman
    Signed-off-by: Rik van Riel
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc: Chegu Vinod
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If the PMD is flushed then a parallel fault in handle_mm_fault() will
    enter the pmd_none and do_huge_pmd_anonymous_page() path where it'll
    attempt to insert a huge zero page. This is wasteful so the patch
    avoids clearing the PMD when setting pmd_numa.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • On x86, PMD entries are similar to _PAGE_PROTNONE protection and are
    handled as NUMA hinting faults. The following two page table protection
    bits are what defines them

    _PAGE_NUMA:set _PAGE_PRESENT:clear

    A PMD is considered present if any of the _PAGE_PRESENT, _PAGE_PROTNONE,
    _PAGE_PSE or _PAGE_NUMA bits are set. If pmdp_invalidate encounters a
    pmd_numa, it clears the present bit leaving _PAGE_NUMA which will be
    considered not present by the CPU but present by pmd_present. The
    existing caller of pmdp_invalidate should handle it but it's an
    inconsistent state for a PMD. This patch keeps the state consistent
    when calling pmdp_invalidate.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • MMU notifiers must be called on THP page migration or secondary MMUs
    will get very confused.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Base pages are unmapped and flushed from cache and TLB during normal
    page migration and replaced with a migration entry that causes any
    parallel NUMA hinting fault or gup to block until migration completes.

    THP does not unmap pages due to a lack of support for migration entries
    at a PMD level. This allows races with get_user_pages and
    get_user_pages_fast which commit 3f926ab945b6 ("mm: Close races between
    THP migration and PMD numa clearing") made worse by introducing a
    pmd_clear_flush().

    This patch forces get_user_page (fast and normal) on a pmd_numa page to
    go through the slow get_user_page path where it will serialise against
    THP migration and properly account for the NUMA hinting fault. On the
    migration side the page table lock is taken for each PTE update.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

13 Dec, 2013

4 commits

  • Commit 4942642080ea ("mm: memcg: handle non-error OOM situations more
    gracefully") allowed tasks that already entered a memcg OOM condition to
    bypass the memcg limit on subsequent allocation attempts hoping this
    would expedite finishing the page fault and executing the kill.

    David Rientjes is worried that this breaks memcg isolation guarantees
    and since there is no evidence that the bypass actually speeds up fault
    processing just change it so that these subsequent charge attempts fail
    outright. The notable exception being __GFP_NOFAIL charges which are
    required to bypass the limit regardless.

    Signed-off-by: Johannes Weiner
    Reported-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-bt: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • There is a race condition between a memcg being torn down and a swapin
    triggered from a different memcg of a page that was recorded to belong
    to the exiting memcg on swapout (with CONFIG_MEMCG_SWAP extension). The
    result is unreclaimable pages pointing to dead memcgs, which can lead to
    anything from endless loops in later memcg teardown (the page is charged
    to all hierarchical parents but is not on any LRU list) or crashes from
    following the dangling memcg pointer.

    Memcgs with tasks in them can not be torn down and usually charges don't
    show up in memcgs without tasks. Swapin with the CONFIG_MEMCG_SWAP
    extension is the notable exception because it charges the cgroup that
    was recorded as owner during swapout, which may be empty and in the
    process of being torn down when a task in another memcg triggers the
    swapin:

    teardown: swapin:

    lookup_swap_cgroup_id()
    rcu_read_lock()
    mem_cgroup_lookup()
    css_tryget()
    rcu_read_unlock()
    disable css_tryget()
    call_rcu()
    offline_css()
    reparent_charges()
    res_counter_charge() (hierarchical!)
    css_put()
    css_free()
    pc->mem_cgroup = dead memcg
    add page to dead lru

    Add a final reparenting step into css_free() to make sure any such raced
    charges are moved out of the memcg before it's finally freed.

    In the longer term it would be cleaner to have the css_tryget() and the
    res_counter charge under the same RCU lock section so that the charge
    reparenting is deferred until the last charge whose tryget succeeded is
    visible. But this will require more invasive changes that will be
    harder to evaluate and backport into stable, so better defer them to a
    separate change set.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Andrey Wagin reported crash on VM_BUG_ON() in pgtable_pmd_page_dtor() with
    fallowing backtrace:

    free_pgd_range+0x2bf/0x410
    free_pgtables+0xce/0x120
    unmap_region+0xe0/0x120
    do_munmap+0x249/0x360
    move_vma+0x144/0x270
    SyS_mremap+0x3b9/0x510
    system_call_fastpath+0x16/0x1b

    The crash can be reproduce with this test case:

    #define _GNU_SOURCE
    #include
    #include
    #include

    #define MB (1024 * 1024UL)
    #define GB (1024 * MB)

    int main(int argc, char **argv)
    {
    char *p;
    int i;

    p = mmap((void *) GB, 10 * MB, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0);
    for (i = 0; i < 10 * MB; i += 4096)
    p[i] = 1;
    mremap(p, 10 * MB, 10 * MB, MREMAP_FIXED | MREMAP_MAYMOVE, 2 * GB);
    return 0;
    }

    Due to split PMD lock, we now store preallocated PTE tables for THP
    pages per-PMD table. It means we need to move them to other PMD table
    if huge PMD moved there.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Andrey Vagin
    Tested-by: Andrey Vagin
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Commit 84235de394d9 ("fs: buffer: move allocation failure loop into the
    allocator") started recognizing __GFP_NOFAIL in memory cgroups but
    forgot to disable the OOM killer.

    Any task that does not fail allocation will also not enter the OOM
    completion path. So don't declare an OOM state in this case or it'll be
    leaked and the task be able to bypass the limit until the next
    userspace-triggered page fault cleans up the OOM state.

    Reported-by: William Dauchy
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: [3.12.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

02 Dec, 2013

1 commit

  • We have a problem where the big_key key storage implementation uses a
    shmem backed inode to hold the key contents. Because of this detail of
    implementation LSM checks are being done between processes trying to
    read the keys and the tmpfs backed inode. The LSM checks are already
    being handled on the key interface level and should not be enforced at
    the inode level (since the inode is an implementation detail, not a
    part of the security model)

    This patch implements a new function shmem_kernel_file_setup() which
    returns the equivalent to shmem_file_setup() only the underlying inode
    has S_PRIVATE set. This means that all LSM checks for the inode in
    question are skipped. It should only be used for kernel internal
    operations where the inode is not exposed to userspace without proper
    LSM checking. It is possible that some other users of
    shmem_file_setup() should use the new interface, but this has not been
    explored.

    Reproducing this bug is a little bit difficult. The steps I used on
    Fedora are:

    (1) Turn off selinux enforcing:

    setenforce 0

    (2) Create a huge key

    k=`dd if=/dev/zero bs=8192 count=1 | keyctl padd big_key test-key @s`

    (3) Access the key in another context:

    runcon system_u:system_r:httpd_t:s0-s0:c0.c1023 keyctl print $k >/dev/null

    (4) Examine the audit logs:

    ausearch -m AVC -i --subject httpd_t | audit2allow

    If the last command's output includes a line that looks like:

    allow httpd_t user_tmpfs_t:file { open read };

    There was an inode check between httpd and the tmpfs filesystem. With
    this patch no such denial will be seen. (NOTE! you should clear your
    audit log if you have tested for this previously)

    (Please return you box to enforcing)

    Signed-off-by: Eric Paris
    Signed-off-by: David Howells
    cc: Hugh Dickins
    cc: linux-mm@kvack.org

    Eric Paris
     

23 Nov, 2013

1 commit

  • Pull SLAB changes from Pekka Enberg:
    "The patches from Joonsoo Kim switch mm/slab.c to use 'struct page' for
    slab internals similar to mm/slub.c. This reduces memory usage and
    improves performance:

    https://lkml.org/lkml/2013/10/16/155

    Rest of the changes are bug fixes from various people"

    * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (21 commits)
    mm, slub: fix the typo in mm/slub.c
    mm, slub: fix the typo in include/linux/slub_def.h
    slub: Handle NULL parameter in kmem_cache_flags
    slab: replace non-existing 'struct freelist *' with 'void *'
    slab: fix to calm down kmemleak warning
    slub: proper kmemleak tracking if CONFIG_SLUB_DEBUG disabled
    slab: rename slab_bufctl to slab_freelist
    slab: remove useless statement for checking pfmemalloc
    slab: use struct page for slab management
    slab: replace free and inuse in struct slab with newly introduced active
    slab: remove SLAB_LIMIT
    slab: remove kmem_bufctl_t
    slab: change the management method of free objects of the slab
    slab: use __GFP_COMP flag for allocating slab pages
    slab: use well-defined macro, virt_to_slab()
    slab: overloading the RCU head over the LRU for RCU free
    slab: remove cachep in struct slab_rcu
    slab: remove nodeid in struct slab
    slab: remove colouroff in struct slab
    slab: change return type of kmem_getpages() to struct page
    ...

    Linus Torvalds
     

22 Nov, 2013

3 commits

  • Fengguang Wu reports that compiling mm/mempolicy.c results in a warning:

    mm/mempolicy.c: In function 'mpol_to_str':
    mm/mempolicy.c:2878:2: error: format not a string literal and no format arguments

    Kees says this is because he is using -Wformat-security.

    Silence the warning.

    Signed-off-by: David Rientjes
    Reported-by: Fengguang Wu
    Suggested-by: Kees Cook
    Acked-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Commit 7cb2ef56e6a8 ("mm: fix aio performance regression for database
    caused by THP") can cause dereference of a dangling pointer if
    split_huge_page runs during PageHuge() if there are updates to the
    tail_page->private field.

    Also it is repeating compound_head twice for hugetlbfs and it is running
    compound_head+compound_trans_head for THP when a single one is needed in
    both cases.

    The new code within the PageSlab() check doesn't need to verify that the
    THP page size is never bigger than the smallest hugetlbfs page size, to
    avoid memory corruption.

    A longstanding theoretical race condition was found while fixing the
    above (see the change right after the skip_unlock label, that is
    relevant for the compound_lock path too).

    By re-establishing the _mapcount tail refcounting for all compound
    pages, this also fixes the below problem:

    echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

    BUG: Bad page state in process bash pfn:59a01
    page:ffffea000139b038 count:0 mapcount:10 mapping: (null) index:0x0
    page flags: 0x1c00000000008000(tail)
    Modules linked in:
    CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x55/0x76
    bad_page+0xd5/0x130
    free_pages_prepare+0x213/0x280
    __free_pages+0x36/0x80
    update_and_free_page+0xc1/0xd0
    free_pool_huge_page+0xc2/0xe0
    set_max_huge_pages.part.58+0x14c/0x220
    nr_hugepages_store_common.isra.60+0xd0/0xf0
    nr_hugepages_store+0x13/0x20
    kobj_attr_store+0xf/0x20
    sysfs_write_file+0x189/0x1e0
    vfs_write+0xc5/0x1f0
    SyS_write+0x55/0xb0
    system_call_fastpath+0x16/0x1b

    Signed-off-by: Khalid Aziz
    Signed-off-by: Andrea Arcangeli
    Tested-by: Khalid Aziz
    Cc: Pravin Shelar
    Cc: Greg Kroah-Hartman
    Cc: Ben Hutchings
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Right now, the migration code in migrate_page_copy() uses copy_huge_page()
    for hugetlbfs and thp pages:

    if (PageHuge(page) || PageTransHuge(page))
    copy_huge_page(newpage, page);

    So, yay for code reuse. But:

    void copy_huge_page(struct page *dst, struct page *src)
    {
    struct hstate *h = page_hstate(src);

    and a non-hugetlbfs page has no page_hstate(). This works 99% of the
    time because page_hstate() determines the hstate from the page order
    alone. Since the page order of a THP page matches the default hugetlbfs
    page order, it works.

    But, if you change the default huge page size on the boot command-line
    (say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
    so page_hstate() returns null and copy_huge_page() oopses pretty fast
    since copy_huge_page() dereferences the hstate:

    void copy_huge_page(struct page *dst, struct page *src)
    {
    struct hstate *h = page_hstate(src);
    if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
    ...

    Mel noticed that the migration code is really the only user of these
    functions. This moves all the copy code over to migrate.c and makes
    copy_huge_page() work for THP by checking for it explicitly.

    I believe the bug was introduced in commit b32967ff101a ("mm: numa: Add
    THP migration for the NUMA working set scanning fault case")

    [akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Reviewed-by: Naoya Horiguchi
    Cc: Hillf Danton
    Cc: Andrea Arcangeli
    Tested-by: Dave Jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

21 Nov, 2013

1 commit

  • This reverts commit ea1e7ed33708c7a760419ff9ded0a6cb90586a50.

    Al points out that while the commit *does* actually create a separate
    slab for the page->ptl allocation, that slab is never actually used, and
    the code continues to use kmalloc/kfree.

    Damien Wyart points out that the original patch did have the conversion
    to use kmem_cache_alloc/free, so it got lost somewhere on its way to me.

    Revert the half-arsed attempt that didn't do anything. If we really do
    want the special slab (remember: this is all relevant just for debug
    builds, so it's not necessarily all that critical) we might as well redo
    the patch fully.

    Reported-by: Al Viro
    Acked-by: Andrew Morton
    Cc: Kirill A Shutemov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

16 Nov, 2013

1 commit

  • Pull trivial tree updates from Jiri Kosina:
    "Usual earth-shaking, news-breaking, rocket science pile from
    trivial.git"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
    doc: usb: Fix typo in Documentation/usb/gadget_configs.txt
    doc: add missing files to timers/00-INDEX
    timekeeping: Fix some trivial typos in comments
    mm: Fix some trivial typos in comments
    irq: Fix some trivial typos in comments
    NUMA: fix typos in Kconfig help text
    mm: update 00-INDEX
    doc: Documentation/DMA-attributes.txt fix typo
    DRM: comment: `halve' -> `half'
    Docs: Kconfig: `devlopers' -> `developers'
    doc: typo on word accounting in kprobes.c in mutliple architectures
    treewide: fix "usefull" typo
    treewide: fix "distingush" typo
    mm/Kconfig: Grammar s/an/a/
    kexec: Typo s/the/then/
    Documentation/kvm: Update cpuid documentation for steal time and pv eoi
    treewide: Fix common typo in "identify"
    __page_to_pfn: Fix typo in comment
    Correct some typos for word frequency
    clk: fixed-factor: Fix a trivial typo
    ...

    Linus Torvalds
     

15 Nov, 2013

6 commits

  • This patch enhances the type safety for the kfifo API. It is now safe
    to put const data into a non const FIFO and the API will now generate a
    compiler warning when reading from the fifo where the destination
    address is pointing to a const variable.

    As a side effect the kfifo_put() does now expect the value of an element
    instead a pointer to the element. This was suggested Russell King. It
    make the handling of the kfifo_put easier since there is no need to
    create a helper variable for getting the address of a pointer or to pass
    integers of different sizes.

    IMHO the API break is okay, since there are currently only six users of
    kfifo_put().

    The code is also cleaner by kicking out the "if (0)" expressions.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Stefani Seibold
    Cc: Russell King
    Cc: Hauke Mehrtens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefani Seibold
     
  • If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64
    is 72 bytes. For page->ptl they will be allocated from kmalloc-96 slab,
    so we loose 24 on each. An average system can easily allocate few tens
    thousands of page->ptl and overhead is significant.

    Let's create a separate slab for page->ptl allocation to solve this.

    Signed-off-by: Kirill A. Shutemov
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Use kernel/bounds.c to convert build-time spinlock_t size check into a
    preprocessor symbol and apply that to properly separate the page::ptl
    situation.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Kirill A. Shutemov
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • If split page table lock is in use, we embed the lock into struct page
    of table's page. We have to disable split lock, if spinlock_t is too
    big be to be embedded, like when DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC
    enabled.

    This patch add support for dynamic allocation of split page table lock
    if we can't embed it to struct page.

    page->ptl is unsigned long now and we use it as spinlock_t if
    sizeof(spinlock_t) ptl.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The basic idea is the same as with PTE level: the lock is embedded into
    struct page of table's page.

    We can't use mm->pmd_huge_pte to store pgtables for THP, since we don't
    take mm->page_table_lock anymore. Let's reuse page->lru of table's page
    for that.

    pgtable_pmd_page_ctor() returns true, if initialization is successful
    and false otherwise. Current implementation never fails, but assumption
    that constructor can fail will help to port it to -rt where spinlock_t
    is rather huge and cannot be embedded into struct page -- dynamic
    allocation is required.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Reviewed-by: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Only trivial cases left. Let's convert them altogether.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov