07 Feb, 2014

2 commits

  • commit 8790c71a18e5d2d93532ae250bcf5eddbba729cd upstream.

    As a result of commit 5606e3877ad8 ("mm: numa: Migrate on reference
    policy"), /proc//numa_maps prints the mempolicy for any as
    "prefer:N" for the local node, N, of the process reading the file.

    This should only be printed when the mempolicy of is
    MPOL_PREFERRED for node N.

    If the process is actually only using the default mempolicy for local
    node allocation, make sure "default" is printed as expected.

    Signed-off-by: David Rientjes
    Reported-by: Robert Lippert
    Cc: Peter Zijlstra
    Acked-by: Mel Gorman
    Cc: Ingo Molnar
    Cc: [3.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Rientjes
     
  • commit 27c73ae759774e63313c1fbfeb17ba076cea64c5 upstream.

    Commit 7cb2ef56e6a8 ("mm: fix aio performance regression for database
    caused by THP") can cause dereference of a dangling pointer if
    split_huge_page runs during PageHuge() if there are updates to the
    tail_page->private field.

    Also it is repeating compound_head twice for hugetlbfs and it is running
    compound_head+compound_trans_head for THP when a single one is needed in
    both cases.

    The new code within the PageSlab() check doesn't need to verify that the
    THP page size is never bigger than the smallest hugetlbfs page size, to
    avoid memory corruption.

    A longstanding theoretical race condition was found while fixing the
    above (see the change right after the skip_unlock label, that is
    relevant for the compound_lock path too).

    By re-establishing the _mapcount tail refcounting for all compound
    pages, this also fixes the below problem:

    echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

    BUG: Bad page state in process bash pfn:59a01
    page:ffffea000139b038 count:0 mapcount:10 mapping: (null) index:0x0
    page flags: 0x1c00000000008000(tail)
    Modules linked in:
    CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x55/0x76
    bad_page+0xd5/0x130
    free_pages_prepare+0x213/0x280
    __free_pages+0x36/0x80
    update_and_free_page+0xc1/0xd0
    free_pool_huge_page+0xc2/0xe0
    set_max_huge_pages.part.58+0x14c/0x220
    nr_hugepages_store_common.isra.60+0xd0/0xf0
    nr_hugepages_store+0x13/0x20
    kobj_attr_store+0xf/0x20
    sysfs_write_file+0x189/0x1e0
    vfs_write+0xc5/0x1f0
    SyS_write+0x55/0xb0
    system_call_fastpath+0x16/0x1b

    Signed-off-by: Khalid Aziz
    Signed-off-by: Andrea Arcangeli
    Tested-by: Khalid Aziz
    Cc: Pravin Shelar
    Cc: Greg Kroah-Hartman
    Cc: Ben Hutchings
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Cc: Guillaume Morin
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     

26 Jan, 2014

2 commits

  • commit 03e5ac2fc3bf6f4140db0371e8bb4243b24e3e02 upstream.

    Commit 8456a648cf44 ("slab: use struct page for slab management") causes
    a crash in the LVM2 testsuite on PA-RISC (the crashing test is
    fsadm.sh). The testsuite doesn't crash on 3.12, crashes on 3.13-rc1 and
    later.

    Bad Address (null pointer deref?): Code=15 regs=000000413edd89a0 (Addr=000006202224647d)
    CPU: 3 PID: 24008 Comm: loop0 Not tainted 3.13.0-rc6 #5
    task: 00000001bf3c0048 ti: 000000413edd8000 task.ti: 000000413edd8000

    YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
    PSW: 00001000000001101111100100001110 Not tainted
    r00-03 000000ff0806f90e 00000000405c8de0 000000004013e6c0 000000413edd83f0
    r04-07 00000000405a95e0 0000000000000200 00000001414735f0 00000001bf349e40
    r08-11 0000000010fe3d10 0000000000000001 00000040829c7778 000000413efd9000
    r12-15 0000000000000000 000000004060d800 0000000010fe3000 0000000010fe3000
    r16-19 000000413edd82a0 00000041078ddbc0 0000000000000010 0000000000000001
    r20-23 0008f3d0d83a8000 0000000000000000 00000040829c7778 0000000000000080
    r24-27 00000001bf349e40 00000001bf349e40 202d66202224640d 00000000405a95e0
    r28-31 202d662022246465 000000413edd88f0 000000413edd89a0 0000000000000001
    sr00-03 000000000532c000 0000000000000000 0000000000000000 000000000532c000
    sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000

    IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000401fe42c 00000000401fe430
    IIR: 539c0030 ISR: 00000000202d6000 IOR: 000006202224647d
    CPU: 3 CR30: 000000413edd8000 CR31: 0000000000000000
    ORIG_R28: 00000000405a95e0
    IAOQ[0]: vma_interval_tree_iter_first+0x14/0x48
    IAOQ[1]: vma_interval_tree_iter_first+0x18/0x48
    RP(r2): flush_dcache_page+0x128/0x388
    Backtrace:
    flush_dcache_page+0x128/0x388
    lo_splice_actor+0x90/0x148 [loop]
    splice_from_pipe_feed+0xc0/0x1d0
    __splice_from_pipe+0xac/0xc0
    lo_direct_splice_actor+0x1c/0x70 [loop]
    splice_direct_to_actor+0xec/0x228
    lo_receive+0xe4/0x298 [loop]
    loop_thread+0x478/0x640 [loop]
    kthread+0x134/0x168
    end_fault_vector+0x20/0x28
    xfs_setsize_buftarg+0x0/0x90 [xfs]

    Kernel panic - not syncing: Bad Address (null pointer deref?)

    Commit 8456a648cf44 changes the page structure so that the slab
    subsystem reuses the page->mapping field.

    The crash happens in the following way:
    * XFS allocates some memory from slab and issues a bio to read data
    into it.
    * the bio is sent to the loopback device.
    * lo_receive creates an actor and calls splice_direct_to_actor.
    * lo_splice_actor copies data to the target page.
    * lo_splice_actor calls flush_dcache_page because the page may be
    mapped by userspace. In that case we need to flush the kernel cache.
    * flush_dcache_page asks for the list of userspace mappings, however
    that page->mapping field is reused by the slab subsystem for a
    different purpose. This causes the crash.

    Note that other architectures without coherent caches (sparc, arm, mips)
    also call page_mapping from flush_dcache_page, so they may crash in the
    same way.

    This patch fixes this bug by testing if the page is a slab page in
    page_mapping and returning NULL if it is.

    The patch also fixes VM_BUG_ON(PageSlab(page)) that could happen in
    earlier kernels in the same scenario on architectures without cache
    coherence when CONFIG_DEBUG_VM is enabled - so it should be backported
    to stable kernels.

    In the old kernels, the function page_mapping is placed in
    include/linux/mm.h, so you should modify the patch accordingly when
    backporting it.

    Signed-off-by: Mikulas Patocka
    Cc: John David Anglin ]
    Cc: Andi Kleen
    Cc: Christoph Lameter
    Acked-by: Pekka Enberg
    Reviewed-by: Joonsoo Kim
    Cc: Helge Deller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit eecc1e426d681351a6026a7d3e7d225f38955b6c upstream.

    We see General Protection Fault on RSI in copy_page_rep: that RSI is
    what you get from a NULL struct page pointer.

    RIP: 0010:[] [] copy_page_rep+0x5/0x10
    RSP: 0000:ffff880136e15c00 EFLAGS: 00010286
    RAX: ffff880000000000 RBX: ffff880136e14000 RCX: 0000000000000200
    RDX: 6db6db6db6db6db7 RSI: db73880000000000 RDI: ffff880dd0c00000
    RBP: ffff880136e15c18 R08: 0000000000000200 R09: 000000000005987c
    R10: 000000000005987c R11: 0000000000000200 R12: 0000000000000001
    R13: ffffea00305aa000 R14: 0000000000000000 R15: 0000000000000000
    FS: 00007f195752f700(0000) GS:ffff880c7fc20000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000093010000 CR3: 00000001458e1000 CR4: 00000000000027e0
    Call Trace:
    copy_user_huge_page+0x93/0xab
    do_huge_pmd_wp_page+0x710/0x815
    handle_mm_fault+0x15d8/0x1d70
    __do_page_fault+0x14d/0x840
    do_page_fault+0x2f/0x90
    page_fault+0x22/0x30

    do_huge_pmd_wp_page() tests is_huge_zero_pmd(orig_pmd) four times: but
    since shrink_huge_zero_page() can free the huge_zero_page, and we have
    no hold of our own on it here (except where the fourth test holds
    page_table_lock and has checked pmd_same), it's possible for it to
    answer yes the first time, but no to the second or third test. Change
    all those last three to tests for NULL page.

    (Note: this is not the same issue as trinity's DEBUG_PAGEALLOC BUG
    in copy_page_rep with RSI: ffff88009c422000, reported by Sasha Levin
    in https://lkml.org/lkml/2013/3/29/103. I believe that one is due
    to the source page being split, and a tail page freed, while copy
    is in progress; and not a problem without DEBUG_PAGEALLOC, since
    the pmd_same check will prevent a miscopy from being made visible.)

    Fixes: 97ae17497e99 ("thp: implement refcounting for huge zero page")
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     

10 Jan, 2014

22 commits

  • commit 8e321fefb0e60bae4e2a28d20fc4fa30758d27c6 upstream.

    The arbitrary restriction on page counts offered by the core
    migrate_page_move_mapping() code results in rather suspicious looking
    fiddling with page reference counts in the aio_migratepage() operation.
    To fix this, make migrate_page_move_mapping() take an extra_count parameter
    that allows aio to tell the code about its own reference count on the page
    being migrated.

    While cleaning up aio_migratepage(), make it validate that the old page
    being passed in is actually what aio_migratepage() expects to prevent
    misbehaviour in the case of races.

    Signed-off-by: Benjamin LaHaise
    Signed-off-by: Greg Kroah-Hartman

    Benjamin LaHaise
     
  • commit 695c60830764945cf61a2cc623eb1392d137223e upstream.

    The mem_cgroup structure contains nr_node_ids pointers to
    mem_cgroup_per_node objects, not the objects themselves.

    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vladimir Davydov
     
  • commit a3e0f9e47d5ef7858a26cc12d90ad5146e802d47 upstream.

    Memory failures on thp tail pages cause kernel panic like below:

    mce: [Hardware Error]: Machine check events logged
    MCE exception done on CPU 7
    BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
    IP: [] dequeue_hwpoisoned_huge_page+0x131/0x1e0
    PGD bae42067 PUD ba47d067 PMD 0
    Oops: 0000 [#1] SMP
    ...
    CPU: 7 PID: 128 Comm: kworker/7:2 Tainted: G M O 3.13.0-rc4-131217-1558-00003-g83b7df08e462 #25
    ...
    Call Trace:
    me_huge_page+0x3e/0x50
    memory_failure+0x4bb/0xc20
    mce_process_work+0x3e/0x70
    process_one_work+0x171/0x420
    worker_thread+0x11b/0x3a0
    ? manage_workers.isra.25+0x2b0/0x2b0
    kthread+0xe4/0x100
    ? kthread_create_on_node+0x190/0x190
    ret_from_fork+0x7c/0xb0
    ? kthread_create_on_node+0x190/0x190
    ...
    RIP dequeue_hwpoisoned_huge_page+0x131/0x1e0
    CR2: 0000000000000058

    The reasoning of this problem is shown below:
    - when we have a memory error on a thp tail page, the memory error
    handler grabs a refcount of the head page to keep the thp under us.
    - Before unmapping the error page from processes, we split the thp,
    where page refcounts of both of head/tail pages don't change.
    - Then we call try_to_unmap() over the error page (which was a tail
    page before). We didn't pin the error page to handle the memory error,
    this error page is freed and removed from LRU list.
    - We never have the error page on LRU list, so the first page state
    check returns "unknown page," then we move to the second check
    with the saved page flag.
    - The saved page flag have PG_tail set, so the second page state check
    returns "hugepage."
    - We call me_huge_page() for freed error page, then we hit the above panic.

    The root cause is that we didn't move refcount from the head page to the
    tail page after split thp. So this patch suggests to do this.

    This panic was introduced by commit 524fca1e73 ("HWPOISON: fix
    misjudgement of page_action() for errors on mlocked pages"). Note that we
    did have the same refcount problem before this commit, but it was just
    ignored because we had only first page state check which returned "unknown
    page." The commit changed the refcount problem from "doesn't work" to
    "kernel panic."

    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Wanpeng Li
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Naoya Horiguchi
     
  • commit 4eb919825e6c3c7fb3630d5621f6d11e98a18b3a upstream.

    remap_file_pages calls mmap_region, which may merge the VMA with other
    existing VMAs, and free "vma". This can lead to a use-after-free bug.
    Avoid the bug by remembering vm_flags before calling mmap_region, and
    not trying to dereference vma later.

    Signed-off-by: Rik van Riel
    Reported-by: Dmitry Vyukov
    Cc: PaX Team
    Cc: Kees Cook
    Cc: Michel Lespinasse
    Cc: Cyrill Gorcunov
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Rik van Riel
     
  • commit 3b25df93c6e37e323b86a2a8c1e00c0a2821c6c9 upstream.

    Commit 7225522bb429 ("mm: munlock: batch non-THP page isolation and
    munlock+putback using pagevec" introduced __munlock_pagevec() to speed
    up munlock by holding lru_lock over multiple isolated pages. Pages that
    fail to be isolated are put_page()d immediately, also within the lock.

    This can lead to deadlock when __munlock_pagevec() becomes the holder of
    the last page pin and put_page() leads to __page_cache_release() which
    also locks lru_lock. The deadlock has been observed by Sasha Levin
    using trinity.

    This patch avoids the deadlock by deferring put_page() operations until
    lru_lock is released. Another pagevec (which is also used by later
    phases of the function is reused to gather the pages for put_page()
    operation.

    Signed-off-by: Vlastimil Babka
    Reported-by: Sasha Levin
    Cc: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit c424be1cbbf852e46acc84d73162af3066cd2c86 upstream.

    Since commit ff6a6da60b89 ("mm: accelerate munlock() treatment of THP
    pages") munlock skips tail pages of a munlocked THP page. However, when
    the head page already has PageMlocked unset, it will not skip the tail
    pages.

    Commit 7225522bb429 ("mm: munlock: batch non-THP page isolation and
    munlock+putback using pagevec") has added a PageTransHuge() check which
    contains VM_BUG_ON(PageTail(page)). Sasha Levin found this triggered
    using trinity, on the first tail page of a THP page without PageMlocked
    flag.

    This patch fixes the issue by skipping tail pages also in the case when
    PageMlocked flag is unset. There is still a possibility of race with
    THP page split between clearing PageMlocked and determining how many
    pages to skip. The race might result in former tail pages not being
    skipped, which is however no longer a bug, as during the skip the
    PageTail flags are cleared.

    However this race also affects correctness of NR_MLOCK accounting, which
    is to be fixed in a separate patch.

    Signed-off-by: Vlastimil Babka
    Reported-by: Sasha Levin
    Cc: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit fff4068cba484e6b0abe334ed6b15d5a215a3b25 upstream.

    Commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") meant
    to bring aging fairness among zones in system, but it was overzealous
    and badly regressed basic workloads on NUMA systems.

    Due to the way kswapd and page allocator interacts, we still want to
    make sure that all zones in any given node are used equally for all
    allocations to maximize memory utilization and prevent thrashing on the
    highest zone in the node.

    While the same principle applies to NUMA nodes - memory utilization is
    obviously improved by spreading allocations throughout all nodes -
    remote references can be costly and so many workloads prefer locality
    over memory utilization. The original change assumed that
    zone_reclaim_mode would be a good enough predictor for that, but it
    turned out to be as indicative as a coin flip.

    Revert the NUMA aspect of the fairness until we can find a proper way to
    make it configurable and agree on a sane default.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Michal Hocko
    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Johannes Weiner
     
  • commit 98398c32f6687ee1e1f3ae084effb4b75adb0747 upstream.

    In __page_check_address(), if address's pud is not present,
    huge_pte_offset() will return NULL, we should check the return value.

    Signed-off-by: Jianguo Wu
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: qiuxishi
    Cc: Hanjun Guo
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jianguo Wu
     
  • commit a49ecbcd7b0d5a1cda7d60e03df402dd0ef76ac8 upstream.

    After a successful hugetlb page migration by soft offline, the source
    page will either be freed into hugepage_freelists or buddy(over-commit
    page). If page is in buddy, page_hstate(page) will be NULL. It will
    hit a NULL pointer dereference in dequeue_hwpoisoned_huge_page().

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
    IP: [] dequeue_hwpoisoned_huge_page+0x131/0x1d0
    PGD c23762067 PUD c24be2067 PMD 0
    Oops: 0000 [#1] SMP

    So check PageHuge(page) after call migrate_pages() successfully.

    Signed-off-by: Jianguo Wu
    Tested-by: Naoya Horiguchi
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jianguo Wu
     
  • commit 6815bf3f233e0b10c99a758497d5d236063b010b upstream.

    update_pageblock_skip() only fits to compaction which tries to isolate
    by pageblock unit. If isolate_migratepages_range() is called by CMA, it
    try to isolate regardless of pageblock unit and it don't reference
    get_pageblock_skip() by ignore_skip_hint. We should also respect it on
    update_pageblock_skip() to prevent from setting the wrong information.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Wanpeng Li
    Cc: Christoph Lameter
    Cc: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Joonsoo Kim
     
  • commit b0e5fd7359f1ce8db4ccb862b3aa80d2f2cbf4d0 upstream.

    queue_pages_range() isolates hugetlbfs pages and putback_lru_pages()
    can't handle these. We should change it to putback_movable_pages().

    Naoya said that it is worth going into stable, because it can break
    in-use hugepage list.

    Signed-off-by: Joonsoo Kim
    Acked-by: Rafael Aquini
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Wanpeng Li
    Cc: Christoph Lameter
    Cc: Vlastimil Babka
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Joonsoo Kim
     
  • commit b0943d61b8fa420180f92f64ef67662b4f6cc493 upstream.

    THP migration can fail for a variety of reasons. Avoid flushing the TLB
    to deal with THP migration races until the copy is ready to start.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 20841405940e7be0617612d521e206e4b6b325db upstream.

    There are a few subtle races, between change_protection_range (used by
    mprotect and change_prot_numa) on one side, and NUMA page migration and
    compaction on the other side.

    The basic race is that there is a time window between when the PTE gets
    made non-present (PROT_NONE or NUMA), and the TLB is flushed.

    During that time, a CPU may continue writing to the page.

    This is fine most of the time, however compaction or the NUMA migration
    code may come in, and migrate the page away.

    When that happens, the CPU may continue writing, through the cached
    translation, to what is no longer the current memory location of the
    process.

    This only affects x86, which has a somewhat optimistic pte_accessible.
    All other architectures appear to be safe, and will either always flush,
    or flush whenever there is a valid mapping, even with no permissions
    (SPARC).

    The basic race looks like this:

    CPU A CPU B CPU C

    load TLB entry
    make entry PTE/PMD_NUMA
    fault on entry
    read/write old page
    start migrating page
    change PTE/PMD to new page
    read/write old page [*]
    flush TLB
    reload TLB from new entry
    read/write new page
    lose data

    [*] the old page may belong to a new user at this point!

    The obvious fix is to flush remote TLB entries, by making sure that
    pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
    still be accessible if there is a TLB flush pending for the mm.

    This should fix both NUMA migration and compaction.

    [mgorman@suse.de: fix build]
    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Rik van Riel
     
  • commit de466bd628e8d663fdf3f791bc8db318ee85c714 upstream.

    do_huge_pmd_numa_page() handles the case where there is parallel THP
    migration. However, by the time it is checked the NUMA hinting
    information has already been disrupted. This patch adds an earlier
    check with some helpers.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 1667918b6483b12a6496bf54151b827b8235d7b1 upstream.

    On a protection change it is no longer clear if the page should be still
    accessible. This patch clears the NUMA hinting fault bits on a
    protection change.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit eb4489f69f224356193364dc2762aa009738ca7f upstream.

    If a PMD changes during a THP migration then migration aborts but the
    failure path is doing more work than is necessary.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit c3a489cac38d43ea6dc4ac240473b44b46deecf7 upstream.

    The anon_vma lock prevents parallel THP splits and any associated
    complexity that arises when handling splits during THP migration. This
    patch checks if the lock was successfully acquired and bails from THP
    migration if it failed for any reason.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 0c5f83c23ca703d32f930393825487257a5cde6d upstream.

    The TLB must be flushed if the PTE is updated but change_pte_range is
    clearing the PTE while marking PTEs pte_numa without necessarily
    flushing the TLB if it reinserts the same entry. Without the flush,
    it's conceivable that two processors have different TLBs for the same
    virtual address and at the very least it would generate spurious faults.

    This patch only unmaps the pages in change_pte_range for a full
    protection change.

    [riel@redhat.com: write pte_numa pte back to the page tables]
    Signed-off-by: Mel Gorman
    Signed-off-by: Rik van Riel
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc: Chegu Vinod
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 5a6dac3ec5f583cc8ee7bc53b5500a207c4ca433 upstream.

    If the PMD is flushed then a parallel fault in handle_mm_fault() will
    enter the pmd_none and do_huge_pmd_anonymous_page() path where it'll
    attempt to insert a huge zero page. This is wasteful so the patch
    avoids clearing the PMD when setting pmd_numa.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 67f87463d3a3362424efcbe8b40e4772fd34fc61 upstream.

    On x86, PMD entries are similar to _PAGE_PROTNONE protection and are
    handled as NUMA hinting faults. The following two page table protection
    bits are what defines them

    _PAGE_NUMA:set _PAGE_PRESENT:clear

    A PMD is considered present if any of the _PAGE_PRESENT, _PAGE_PROTNONE,
    _PAGE_PSE or _PAGE_NUMA bits are set. If pmdp_invalidate encounters a
    pmd_numa, it clears the present bit leaving _PAGE_NUMA which will be
    considered not present by the CPU but present by pmd_present. The
    existing caller of pmdp_invalidate should handle it but it's an
    inconsistent state for a PMD. This patch keeps the state consistent
    when calling pmdp_invalidate.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit f714f4f20e59ea6eea264a86b9a51fd51b88fc54 upstream.

    MMU notifiers must be called on THP page migration or secondary MMUs
    will get very confused.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 2b4847e73004c10ae6666c2e27b5c5430aed8698 upstream.

    Base pages are unmapped and flushed from cache and TLB during normal
    page migration and replaced with a migration entry that causes any
    parallel NUMA hinting fault or gup to block until migration completes.

    THP does not unmap pages due to a lack of support for migration entries
    at a PMD level. This allows races with get_user_pages and
    get_user_pages_fast which commit 3f926ab945b6 ("mm: Close races between
    THP migration and PMD numa clearing") made worse by introducing a
    pmd_clear_flush().

    This patch forces get_user_page (fast and normal) on a pmd_numa page to
    go through the slow get_user_page path where it will serialise against
    THP migration and properly account for the NUMA hinting fault. On the
    migration side the page table lock is taken for each PTE update.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     

20 Dec, 2013

3 commits

  • commit 96f1c58d853497a757463e0b57fed140d6858f3a upstream.

    There is a race condition between a memcg being torn down and a swapin
    triggered from a different memcg of a page that was recorded to belong
    to the exiting memcg on swapout (with CONFIG_MEMCG_SWAP extension). The
    result is unreclaimable pages pointing to dead memcgs, which can lead to
    anything from endless loops in later memcg teardown (the page is charged
    to all hierarchical parents but is not on any LRU list) or crashes from
    following the dangling memcg pointer.

    Memcgs with tasks in them can not be torn down and usually charges don't
    show up in memcgs without tasks. Swapin with the CONFIG_MEMCG_SWAP
    extension is the notable exception because it charges the cgroup that
    was recorded as owner during swapout, which may be empty and in the
    process of being torn down when a task in another memcg triggers the
    swapin:

    teardown: swapin:

    lookup_swap_cgroup_id()
    rcu_read_lock()
    mem_cgroup_lookup()
    css_tryget()
    rcu_read_unlock()
    disable css_tryget()
    call_rcu()
    offline_css()
    reparent_charges()
    res_counter_charge() (hierarchical!)
    css_put()
    css_free()
    pc->mem_cgroup = dead memcg
    add page to dead lru

    Add a final reparenting step into css_free() to make sure any such raced
    charges are moved out of the memcg before it's finally freed.

    In the longer term it would be cleaner to have the css_tryget() and the
    res_counter charge under the same RCU lock section so that the charge
    reparenting is deferred until the last charge whose tryget succeeded is
    visible. But this will require more invasive changes that will be
    harder to evaluate and backport into stable, so better defer them to a
    separate change set.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Johannes Weiner
     
  • commit 1f14c1ac19aa45118054b6d5425873c5c7fc23a1 upstream.

    Commit 4942642080ea ("mm: memcg: handle non-error OOM situations more
    gracefully") allowed tasks that already entered a memcg OOM condition to
    bypass the memcg limit on subsequent allocation attempts hoping this
    would expedite finishing the page fault and executing the kill.

    David Rientjes is worried that this breaks memcg isolation guarantees
    and since there is no evidence that the bypass actually speeds up fault
    processing just change it so that these subsequent charge attempts fail
    outright. The notable exception being __GFP_NOFAIL charges which are
    required to bypass the limit regardless.

    Signed-off-by: Johannes Weiner
    Reported-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-bt: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Johannes Weiner
     
  • commit a0d8b00a3381f9d75764b3377590451cb0b4fe41 upstream.

    Commit 84235de394d9 ("fs: buffer: move allocation failure loop into the
    allocator") started recognizing __GFP_NOFAIL in memory cgroups but
    forgot to disable the OOM killer.

    Any task that does not fail allocation will also not enter the OOM
    completion path. So don't declare an OOM state in this case or it'll be
    leaked and the task be able to bypass the limit until the next
    userspace-triggered page fault cleans up the OOM state.

    Reported-by: William Dauchy
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Johannes Weiner
     

08 Dec, 2013

1 commit

  • commit 72403b4a0fbdf433c1fe0127e49864658f6f6468 upstream.

    Commit 0255d4918480 ("mm: Account for a THP NUMA hinting update as one
    PTE update") was added to account for the number of PTE updates when
    marking pages prot_numa. task_numa_work was using the old return value
    to track how much address space had been updated. Altering the return
    value causes the scanner to do more work than it is configured or
    documented to in a single unit of work.

    This patch reverts that commit and accounts for the number of THP
    updates separately in vmstat. It is up to the administrator to
    interpret the pair of values correctly. This is a straight-forward
    operation and likely to only be of interest when actively debugging NUMA
    balancing problems.

    The impact of this patch is that the NUMA PTE scanner will scan slower
    when THP is enabled and workloads may converge slower as a result. On
    the flip size system CPU usage should be lower than recent tests
    reported. This is an illustrative example of a short single JVM specjbb
    test

    specjbb
    3.12.0 3.12.0
    vanilla acctupdates
    TPut 1 26143.00 ( 0.00%) 25747.00 ( -1.51%)
    TPut 7 185257.00 ( 0.00%) 183202.00 ( -1.11%)
    TPut 13 329760.00 ( 0.00%) 346577.00 ( 5.10%)
    TPut 19 442502.00 ( 0.00%) 460146.00 ( 3.99%)
    TPut 25 540634.00 ( 0.00%) 549053.00 ( 1.56%)
    TPut 31 512098.00 ( 0.00%) 519611.00 ( 1.47%)
    TPut 37 461276.00 ( 0.00%) 474973.00 ( 2.97%)
    TPut 43 403089.00 ( 0.00%) 414172.00 ( 2.75%)

    3.12.0 3.12.0
    vanillaacctupdates
    User 5169.64 5184.14
    System 100.45 80.02
    Elapsed 252.75 251.85

    Performance is similar but note the reduction in system CPU time. While
    this showed a performance gain, it will not be universal but at least
    it'll be behaving as documented. The vmstats are obviously different but
    here is an obvious interpretation of them from mmtests.

    3.12.0 3.12.0
    vanillaacctupdates
    NUMA page range updates 1408326 11043064
    NUMA huge PMD updates 0 21040
    NUMA PTE updates 1408326 291624

    "NUMA page range updates" == nr_pte_updates and is the value returned to
    the NUMA pte scanner. NUMA huge PMD updates were the number of THP
    updates which in combination can be used to calculate how many ptes were
    updated from userspace.

    Signed-off-by: Mel Gorman
    Reported-by: Alex Thorlton
    Reviewed-by: Rik van Riel
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     

05 Dec, 2013

2 commits

  • commit 67d13fe846c57a54d12578e7a4518f68c5c86ad7 upstream.

    Consider the following scenario:

    thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
    thread 1: call zswap_frontswap_invalidate_page to invalidate entry x.
    finished, entry x and its zbud is not freed as its refcount != 0
    now, the swap_map[x] = 0
    thread 0: now call zswap_get_swap_cache_page
    swapcache_prepare return -ENOENT because entry x is not used any more
    zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
    zswap_writeback_entry do nothing except put refcount

    Now, the memory of zswap_entry x and its zpage leak.

    Modify:
    - check the refcount in fail path, free memory if it is not referenced.

    - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
    can be not only caused by nomem but also by invalidate.

    Signed-off-by: Weijie Yang
    Reviewed-by: Bob Liu
    Reviewed-by: Minchan Kim
    Acked-by: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Weijie Yang
     
  • commit 2afc745f3e3079ab16c826be4860da2529054dd2 upstream.

    This patch fixes the problem that get_unmapped_area() can return illegal
    address and result in failing mmap(2) etc.

    In case that the address higher than PAGE_SIZE is set to
    /proc/sys/vm/mmap_min_addr, the address lower than mmap_min_addr can be
    returned by get_unmapped_area(), even if you do not pass any virtual
    address hint (i.e. the second argument).

    This is because the current get_unmapped_area() code does not take into
    account mmap_min_addr.

    This leads to two actual problems as follows:

    1. mmap(2) can fail with EPERM on the process without CAP_SYS_RAWIO,
    although any illegal parameter is not passed.

    2. The bottom-up search path after the top-down search might not work in
    arch_get_unmapped_area_topdown().

    Note: The first and third chunk of my patch, which changes "len" check,
    are for more precise check using mmap_min_addr, and not for solving the
    above problem.

    [How to reproduce]

    --- test.c -------------------------------------------------
    #include
    #include
    #include
    #include

    int main(int argc, char *argv[])
    {
    void *ret = NULL, *last_map;
    size_t pagesize = sysconf(_SC_PAGESIZE);

    do {
    last_map = ret;
    ret = mmap(0, pagesize, PROT_NONE,
    MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    // printf("ret=%p\n", ret);
    } while (ret != MAP_FAILED);

    if (errno != ENOMEM) {
    printf("ERR: unexpected errno: %d (last map=%p)\n",
    errno, last_map);
    }

    return 0;
    }
    ---------------------------------------------------------------

    $ gcc -m32 -o test test.c
    $ sudo sysctl -w vm.mmap_min_addr=65536
    vm.mmap_min_addr = 65536
    $ ./test (run as non-priviledge user)
    ERR: unexpected errno: 1 (last map=0x10000)

    Signed-off-by: Akira Takeuchi
    Signed-off-by: Kiyoshi Owada
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Akira Takeuchi
     

30 Nov, 2013

1 commit

  • commit c6f58d9b362b45c52afebe4342c9137d0dabe47f upstream.

    Andreas Herrmann writes:

    When I've used slub_debug kernel option (e.g.
    "slub_debug=,skbuff_fclone_cache" or similar) on a debug session I've
    seen a panic like:

    Highbank #setenv bootargs console=ttyAMA0 root=/dev/sda2 kgdboc.kgdboc=ttyAMA0,115200 slub_debug=,kmalloc-4096 earlyprintk=ttyAMA0
    ...
    Unable to handle kernel NULL pointer dereference at virtual address 00000000
    pgd = c0004000
    [00000000] *pgd=00000000
    Internal error: Oops: 5 [#1] SMP ARM
    Modules linked in:
    CPU: 0 PID: 0 Comm: swapper Tainted: G W 3.12.0-00048-gbe408cd #314
    task: c0898360 ti: c088a000 task.ti: c088a000
    PC is at strncmp+0x1c/0x84
    LR is at kmem_cache_flags.isra.46.part.47+0x44/0x60
    pc : [] lr : [] psr: 200001d3
    sp : c088bea8 ip : c088beb8 fp : c088beb4
    r10: 00000000 r9 : 413fc090 r8 : 00000001
    r7 : 00000000 r6 : c2984a08 r5 : c0966e78 r4 : 00000000
    r3 : 0000006b r2 : 0000000c r1 : 00000000 r0 : c2984a08
    Flags: nzCv IRQs off FIQs off Mode SVC_32 ISA ARM Segment kernel
    Control: 10c5387d Table: 0000404a DAC: 00000015
    Process swapper (pid: 0, stack limit = 0xc088a248)
    Stack: (0xc088bea8 to 0xc088c000)
    bea0: c088bed4 c088beb8 c0110a3c c02c6d90 c0966e78 00000040
    bec0: ef001f00 00000040 c088bf14 c088bed8 c0112070 c0110a04 00000005 c010fac8
    bee0: c088bf5c c088bef0 c010fac8 ef001f00 00000040 00000000 00000040 00000001
    bf00: 413fc090 00000000 c088bf34 c088bf18 c0839190 c0112040 00000000 ef001f00
    bf20: 00000000 00000000 c088bf54 c088bf38 c0839200 c083914c 00000006 c0961c4c
    bf40: c0961c28 00000000 c088bf7c c088bf58 c08392ac c08391c0 c08a2ed8 c0966e78
    bf60: c086b874 c08a3f50 c0961c28 00000001 c088bfb4 c088bf80 c083b258 c0839248
    bf80: 2f800000 0f000000 c08935b4 ffffffff c08cd400 ffffffff c08cd400 c0868408
    bfa0: c29849c0 00000000 c088bff4 c088bfb8 c0824974 c083b1e4 ffffffff ffffffff
    bfc0: c08245c0 00000000 00000000 c0868408 00000000 10c5387d c0892bcc c0868404
    bfe0: c0899440 0000406a 00000000 c088bff8 00008074 c0824824 00000000 00000000
    [] (strncmp+0x1c/0x84) from [] (kmem_cache_flags.isra.46.part.47+0x44/0x60)
    [] (kmem_cache_flags.isra.46.part.47+0x44/0x60) from [] (__kmem_cache_create+0x3c/0x410)
    [] (__kmem_cache_create+0x3c/0x410) from [] (create_boot_cache+0x50/0x74)
    [] (create_boot_cache+0x50/0x74) from [] (create_kmalloc_cache+0x4c/0x88)
    [] (create_kmalloc_cache+0x4c/0x88) from [] (create_kmalloc_caches+0x70/0x114)
    [] (create_kmalloc_caches+0x70/0x114) from [] (kmem_cache_init+0x80/0xe0)
    [] (kmem_cache_init+0x80/0xe0) from [] (start_kernel+0x15c/0x318)
    [] (start_kernel+0x15c/0x318) from [] (0x8074)
    Code: e3520000 01a00002 089da800 e5d03000 (e5d1c000)
    ---[ end trace 1b75b31a2719ed1d ]---
    Kernel panic - not syncing: Fatal exception

    Problem is that slub_debug option is not parsed before
    create_boot_cache is called. Solve this by changing slub_debug to
    early_param.

    Kernels 3.11, 3.10 are also affected. I am not sure about older
    kernels.

    Christoph Lameter explains:

    kmem_cache_flags may be called with NULL parameter during early boot.
    Skip the test in that case.

    Reported-by: Andreas Herrmann
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg
    Signed-off-by: Greg Kroah-Hartman

    Christoph Lameter
     

02 Nov, 2013

1 commit

  • When a memcg is deleted mem_cgroup_reparent_charges() moves charged
    memory to the parent memcg. As of v3.11-9444-g3ea67d0 "memcg: add per
    cgroup writeback pages accounting" there's bad pointer read. The goal
    was to check for counter underflow. The counter is a per cpu counter
    and there are two problems with the code:

    (1) per cpu access function isn't used, instead a naked pointer is used
    which easily causes oops.
    (2) the check doesn't sum all cpus

    Test:
    $ cd /sys/fs/cgroup/memory
    $ mkdir x
    $ echo 3 > /proc/sys/vm/drop_caches
    $ (echo $BASHPID >> x/tasks && exec cat) &
    [1] 7154
    $ grep ^mapped x/memory.stat
    mapped_file 53248
    $ echo 7154 > tasks
    $ rmdir x

    The fix is to remove the check. It's currently dangerous and isn't
    worth fixing it to use something expensive, such as
    percpu_counter_sum(), for each reparented page. __this_cpu_read() isn't
    enough to fix this because there's no guarantees of the current cpus
    count. The only guarantees is that the sum of all per-cpu counter is >=
    nr_pages.

    Fixes: 3ea67d06e467 ("memcg: add per cgroup writeback pages accounting")
    Reported-and-tested-by: Flavio Leitner
    Signed-off-by: Greg Thelen
    Reviewed-by: Sha Zhengju
    Acked-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

01 Nov, 2013

5 commits

  • Merge four more fixes from Andrew Morton.

    * emailed patches from Andrew Morton :
    lib/scatterlist.c: don't flush_kernel_dcache_page on slab page
    mm: memcg: fix test for child groups
    mm: memcg: lockdep annotation for memcg OOM lock
    mm: memcg: use proper memcg in limit bypass

    Linus Torvalds
     
  • When memcg code needs to know whether any given memcg has children, it
    uses the cgroup child iteration primitives and returns true/false
    depending on whether the iteration loop is executed at least once or
    not.

    Because a cgroup's list of children is RCU protected, these primitives
    require the RCU read-lock to be held, which is not the case for all
    memcg callers. This results in the following splat when e.g. enabling
    hierarchy mode:

    WARNING: CPU: 3 PID: 1 at kernel/cgroup.c:3043 css_next_child+0xa3/0x160()
    CPU: 3 PID: 1 Comm: systemd Not tainted 3.12.0-rc5-00117-g83f11a9-dirty #18
    Hardware name: LENOVO 3680B56/3680B56, BIOS 6QET69WW (1.39 ) 04/26/2012
    Call Trace:
    dump_stack+0x54/0x74
    warn_slowpath_common+0x78/0xa0
    warn_slowpath_null+0x1a/0x20
    css_next_child+0xa3/0x160
    mem_cgroup_hierarchy_write+0x5b/0xa0
    cgroup_file_write+0x108/0x2a0
    vfs_write+0xbd/0x1e0
    SyS_write+0x4c/0xa0
    system_call_fastpath+0x16/0x1b

    In the memcg case, we only care about children when we are attempting to
    modify inheritable attributes interactively. Racing with deletion could
    mean a spurious -EBUSY, no problem. Racing with addition is handled
    just fine as well through the memcg_create_mutex: if the child group is
    not on the list after the mutex is acquired, it won't be initialized
    from the parent's attributes until after the unlock.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The memcg OOM lock is a mutex-type lock that is open-coded due to
    memcg's special needs. Add annotations for lockdep coverage.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit 84235de394d9 ("fs: buffer: move allocation failure loop into the
    allocator") allowed __GFP_NOFAIL allocations to bypass the limit if they
    fail to reclaim enough memory for the charge. But because the main test
    case was on a 3.2-based system, the patch missed the fact that on newer
    kernels the charge function needs to return root_mem_cgroup when
    bypassing the limit, and not NULL. This will corrupt whatever memory is
    at NULL + percpu pointer offset. Fix this quickly before problems are
    reported.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Pull NUMA balancing memory corruption fixes from Ingo Molnar:
    "So these fixes are definitely not something I'd like to sit on, but as
    I said to Mel at the KS the timing is quite tight, with Linus planning
    v3.12-final within a week.

    Fedora-19 is affected:

    comet:~> grep NUMA_BALANCING /boot/config-3.11.3-201.fc19.x86_64

    CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
    CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
    CONFIG_NUMA_BALANCING=y

    AFAICS Ubuntu will be affected as well, once it updates the kernel:

    hubble:~> grep NUMA_BALANCING /boot/config-3.8.0-32-generic

    CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
    CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
    CONFIG_NUMA_BALANCING=y

    These 6 commits are a minimalized set of cherry-picks needed to fix
    the memory corruption bugs. All commits are fixes, except "mm: numa:
    Sanitize task_numa_fault() callsites" which is a cleanup that made two
    followup fixes simpler.

    I've done targeted testing with just this SHA1 to try to make sure
    there are no cherry-picking artifacts. The original non-cherry-picked
    set of fixes were exposed to linux-next for a couple of weeks"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    mm: Account for a THP NUMA hinting update as one PTE update
    mm: Close races between THP migration and PMD numa clearing
    mm: numa: Sanitize task_numa_fault() callsites
    mm: Prevent parallel splits during THP migration
    mm: Wait for THP migrations to complete during NUMA hinting faults
    mm: numa: Do not account for a hinting fault if we raced

    Linus Torvalds
     

31 Oct, 2013

1 commit