19 Jun, 2013

1 commit


08 Jan, 2013

1 commit

  • when run the folloing command under shell, it will return error
    sh/$ echo 1 > /proc/sys/vm/compact_memory
    sh/$ sh: write error: Bad address

    After strace, I found the following log:
    ...
    write(1, "1\n", 2) = 3
    write(1, "", 4294967295) = -1 EFAULT (Bad address)
    write(2, "echo: write error: Bad address\n", 31echo: write error: Bad address
    ) = 31

    This tells system return 3(COMPACT_COMPLETE) after write data to compact_memory.

    The fix is to make the system just return 0 instead 3(COMPACT_COMPLETE) from
    sysctl_compaction_handler after compaction_nodes finished.

    Signed-off-by: Jason Liu
    Suggested-by: David Rientjes
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc:
    Signed-off-by: Andrew Morton

    Jason Liu
     

26 Nov, 2012

1 commit

  • Commit 88f5acf88ae6 ("mm: page allocator: adjust the per-cpu counter
    threshold when memory is low") changed the form how free_pages is
    calculated but it forgot that we used to do free_pages - ((1 << order) -
    1) so we ended up with off-by-two when calculating free_pages.

    Reported-by: Wang Sheng-Hui
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

25 Jul, 2012

3 commits

  • mm core part

    - After USB driver prime a bulk transfer(whatever IN or OUT, take
    OUT for example) on ep1, only one dTD is primed, an USB Interrupt
    (bit 0 of USBSTS) will be issued, and find that endptcomplete
    register is 0x2 which means an OUT transfer on ep1 is completed,
    at this time the ep1 out queue head status is 0x1e18000, and next
    dtd pointer is 0x1 which means transfer is done and everything is
    OK, while the dTD token status is 0x2008080 which means this dTD
    is still active, not completed yet.
    - Audio SDMA and Ethernet have the similar issue
    - root cause is not found yet
    - work around:
    change the non-cacheable bufferable memory to non-cacheable
    non-bufferable memory to make this issue disappear.

    Signed-off-by: Tony LIU

    Tony LIU
     
  • commit 9ab4233dd08036fe34a89c7dc6f47a8bf2eb29eb upstream.

    Otherwise the code races with munmap (causing a use-after-free
    of the vma) or with close (causing a use-after-free of the struct
    file).

    The bug was introduced by commit 90ed52ebe481 ("[PATCH] holepunch: fix
    mmap_sem i_mutex deadlock")

    [bwh: Backported to 3.2:
    - Adjust context
    - madvise_remove() calls vmtruncate_range(), not do_fallocate()]
    [luto: Backported to 3.0: Adjust context]

    Cc: Hugh Dickins
    Cc: Miklos Szeredi
    Cc: Badari Pulavarty
    Cc: Nick Piggin
    Signed-off-by: Ben Hutchings
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Greg Kroah-Hartman

    Andy Lutomirski
     
  • add a function to check the end address including reserved memory,
    this API can provide the top address of phy memory,
    it can be used to check if the phy memory is valild in some driver
    like VPU.

    Signed-off-by: Zhang Jiejing

    Zhang Jiejing
     

18 Jun, 2012

3 commits

  • commit c50ac050811d6485616a193eb0f37bfbd191cc89 and
    4523e1458566a0e8ecfaff90f380dd23acc44d27 upstream.

    When called for anonymous (non-shared) mappings, hugetlb_reserve_pages()
    does a resv_map_alloc(). It depends on code in hugetlbfs's
    vm_ops->close() to release that allocation.

    However, in the mmap() failure path, we do a plain unmap_region() without
    the remove_vma() which actually calls vm_ops->close().

    This is a decent fix. This leak could get reintroduced if new code (say,
    after hugetlb_reserve_pages() in hugetlbfs_file_mmap()) decides to return
    an error. But, I think it would have to unroll the reservation anyway.

    Christoph's test case:

    http://marc.info/?l=linux-mm&m=133728900729735

    This patch applies to 3.4 and later. A version for earlier kernels is at
    https://lkml.org/lkml/2012/5/22/418.

    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Acked-by: KOSAKI Motohiro
    Reported-by: Christoph Lameter
    Tested-by: Christoph Lameter
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dave Hansen
     
  • commit dbda591d920b4c7692725b13e3f68ecb251e9080 upstream.

    The transfer of ->flags causes some of the static mapping virtual
    addresses to be prematurely freed (before the mapping is removed) because
    VM_LAZY_FREE gets "set" if tmp->flags has VM_IOREMAP set. This might
    cause subsequent vmalloc/ioremap calls to fail because it might allocate
    one of the freed virtual address ranges that aren't unmapped.

    va->flags has different types of flags from tmp->flags. If a region with
    VM_IOREMAP set is registered with vm_area_add_early(), it will be removed
    by __purge_vmap_area_lazy().

    Fix vmalloc_init() to correctly initialize vmap_area for the given
    vm_struct.

    Also initialise va->vm. If it is not set, find_vm_area() for the early
    vm regions will always fail.

    Signed-off-by: KyongHo Cho
    Cc: "Olav Haugan"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    KyongHo
     
  • commit db1aecafef58b5dda39c4228debe2c845e4a27ab upstream.

    vmap_area->private is void* but we don't use the field for various purpose
    but use only for vm_struct. So change it to a vm_struct* with naming to
    improve for readability and type checking.

    Signed-off-by: Minchan Kim
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Minchan Kim
     

09 Jun, 2012

1 commit

  • commit e48982734ea0500d1eba4f9d96195acc5406cad6 upstream.

    Commit 645747462435 ("vmscan: detect mapped file pages used only once")
    made mapped pages have another round in inactive list because they might
    be just short lived and so we could consider them again next time. This
    heuristic helps to reduce pressure on the active list with a streaming
    IO worklods.

    This patch fixes a regression introduced by this commit for heavy shmem
    based workloads because unlike Anon pages, which are excluded from this
    heuristic because they are usually long lived, shmem pages are handled
    as a regular page cache.

    This doesn't work quite well, unfortunately, if the workload is mostly
    backed by shmem (in memory database sitting on 80% of memory) with a
    streaming IO in the background (backup - up to 20% of memory). Anon
    inactive list is full of (dirty) shmem pages when watermarks are hit.
    Shmem pages are kept in the inactive list (they are referenced) in the
    first round and it is hard to reclaim anything else so we reach lower
    scanning priorities very quickly which leads to an excessive swap out.

    Let's fix this by excluding all swap backed pages (they tend to be long
    lived wrt. the regular page cache anyway) from used-once heuristic and
    rather activate them if they are referenced.

    The customer's workload is shmem backed database (80% of RAM) and they
    are measuring transactions/s with an IO in the background (20%).
    Transactions touch more or less random rows in the table. The
    transaction rate fell by a factor of 3 (in the worst case) because of
    commit 64574746. This patch restores the previous numbers.

    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

01 Jun, 2012

1 commit

  • commit 05f144a0d5c2207a0349348127f996e104ad7404 upstream.

    Dave Jones' system call fuzz testing tool "trinity" triggered the
    following bug error with slab debugging enabled

    =============================================================================
    BUG numa_policy (Not tainted): Poison overwritten
    -----------------------------------------------------------------------------

    INFO: 0xffff880146498250-0xffff880146498250. First byte 0x6a instead of 0x6b
    INFO: Allocated in mpol_new+0xa3/0x140 age=46310 cpu=6 pid=32154
    __slab_alloc+0x3d3/0x445
    kmem_cache_alloc+0x29d/0x2b0
    mpol_new+0xa3/0x140
    sys_mbind+0x142/0x620
    system_call_fastpath+0x16/0x1b
    INFO: Freed in __mpol_put+0x27/0x30 age=46268 cpu=6 pid=32154
    __slab_free+0x2e/0x1de
    kmem_cache_free+0x25a/0x260
    __mpol_put+0x27/0x30
    remove_vma+0x68/0x90
    exit_mmap+0x118/0x140
    mmput+0x73/0x110
    exit_mm+0x108/0x130
    do_exit+0x162/0xb90
    do_group_exit+0x4f/0xc0
    sys_exit_group+0x17/0x20
    system_call_fastpath+0x16/0x1b
    INFO: Slab 0xffffea0005192600 objects=27 used=27 fp=0x (null) flags=0x20000000004080
    INFO: Object 0xffff880146498250 @offset=592 fp=0xffff88014649b9d0

    This implied a reference counting bug and the problem happened during
    mbind().

    mbind() applies a new memory policy to a range and uses mbind_range() to
    merge existing VMAs or split them as necessary. In the event of splits,
    mpol_dup() will allocate a new struct mempolicy and maintain existing
    reference counts whose rules are documented in
    Documentation/vm/numa_memory_policy.txt .

    The problem occurs with shared memory policies. The vm_op->set_policy
    increments the reference count if necessary and split_vma() and
    vma_merge() have already handled the existing reference counts.
    However, policy_vma() screws it up by replacing an existing
    vma->vm_policy with one that potentially has the wrong reference count
    leading to a premature free. This patch removes the damage caused by
    policy_vma().

    With this patch applied Dave's trinity tool runs an mbind test for 5
    minutes without error. /proc/slabinfo reported that there are no
    numa_policy or shared_policy_node objects allocated after the test
    completed and the shared memory region was deleted.

    Signed-off-by: Mel Gorman
    Cc: Dave Jones
    Cc: KOSAKI Motohiro
    Cc: Stephen Wilson
    Cc: Christoph Lameter
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     

22 May, 2012

4 commits

  • commit 8c7577637ca31385e92769a77e2ab5b428e8b99c upstream.

    When the last event is unregistered, there is no need to keep the spare
    array anymore. So free it to avoid memory leak.

    Signed-off-by: Sha Zhengju
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Sha Zhengju
     
  • commit 6bc2e853c6b46a6041980d58200ad9b0a73a60ff upstream.

    Systems with 8 TBytes of memory or greater can hit a problem where only
    the the first 8 TB of memory shows up. This is due to "int i" being
    smaller than "unsigned long start_aligned", causing the high bits to be
    dropped.

    The fix is to change `i' to unsigned long to match start_aligned
    and end_aligned.

    Thanks to Jack Steiner for assistance tracking this down.

    Signed-off-by: Russ Anderson
    Cc: Jack Steiner
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: David S. Miller
    Cc: Yinghai Lu
    Cc: Gavin Shan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Russ Anderson
     
  • commit 4998a6c0edce7fae9c0a5463f6ec3fa585258ee7 upstream.

    Commit 66aebce747eaf ("hugetlb: fix race condition in hugetlb_fault()")
    added code to avoid a race condition by elevating the page refcount in
    hugetlb_fault() while calling hugetlb_cow().

    However, one code path in hugetlb_cow() includes an assertion that the
    page count is 1, whereas it may now also have the value 2 in this path.

    The consensus is that this BUG_ON has served its purpose, so rather than
    extending it to cover both cases, we just remove it.

    Signed-off-by: Chris Metcalf
    Acked-by: Mel Gorman
    Acked-by: Hillf Danton
    Acked-by: Hugh Dickins
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Chris Metcalf
     
  • commit 42b64281453249dac52861f9b97d18552a7ec62b upstream.

    pcpu_embed_first_chunk() allocates memory for each node, copies percpu
    data and frees unused portions of it before proceeding to the next
    group. This assumes that allocations for different nodes doesn't
    overlap; however, depending on memory topology, the bootmem allocator
    may end up allocating memory from a different node than the requested
    one which may overlap with the portion freed from one of the previous
    percpu areas. This leads to percpu groups for different nodes
    overlapping which is a serious bug.

    This patch separates out copy & partial free from the allocation loop
    such that all allocations are complete before partial frees happen.

    This also fixes overlapping frees which could happen on allocation
    failure path - out_free_areas path frees whole groups but the groups
    could have portions freed at that point.

    Signed-off-by: Tejun Heo
    Reported-by: "Pavel V. Panteleev"
    Tested-by: "Pavel V. Panteleev"
    LKML-Reference:
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

28 Apr, 2012

1 commit

  • commit aca50bd3b4c4bb5528a1878158ba7abce41de534 upstream.

    Mel reports a BUG_ON(slot == NULL) in radix_tree_tag_set() on s390
    3.0.13: called from __set_page_dirty_nobuffers() when page_remove_rmap()
    tries to transfer dirty flag from s390 storage key to struct page and
    radix_tree.

    That would be because of reclaim's shrink_page_list() calling
    add_to_swap() on this page at the same time: first PageSwapCache is set
    (causing page_mapping(page) to appear as &swapper_space), then
    page->private set, then tree_lock taken, then page inserted into
    radix_tree - so there's an interval before taking the lock when the
    radix_tree slot is empty.

    We could fix this by moving __add_to_swap_cache()'s spin_lock_irq up
    before the SetPageSwapCache. But a better fix is simply to do what's
    five years overdue: Ken Chen introduced __set_page_dirty_no_writeback()
    (if !PageDirty TestSetPageDirty) for tmpfs to skip all the radix_tree
    overhead, and swap is just the same - it ignores the radix_tree tag, and
    does not participate in dirty page accounting, so should be using
    __set_page_dirty_no_writeback() too.

    s390 testing now confirms that this does indeed fix the problem.

    Reported-by: Mel Gorman
    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Andrew Morton
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Rik van Riel
    Cc: Ken Chen
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     

23 Apr, 2012

1 commit

  • commit 66aebce747eaf9bc456bf1f1b217d8db843031d0 upstream.

    The race is as follows:

    Suppose a multi-threaded task forks a new process (on cpu A), thus
    bumping up the ref count on all the pages. While the fork is occurring
    (and thus we have marked all the PTEs as read-only), another thread in
    the original process (on cpu B) tries to write to a huge page, taking an
    access violation from the write-protect and calling hugetlb_cow(). Now,
    suppose the fork() fails. It will undo the COW and decrement the ref
    count on the pages, so the ref count on the huge page drops back to 1.
    Meanwhile hugetlb_cow() also decrements the ref count by one on the
    original page, since the original address space doesn't need it any
    more, having copied a new page to replace the original page. This
    leaves the ref count at zero, and when we call unlock_page(), we panic.

    fork on CPU A fault on CPU B
    ============= ==============
    ...
    down_write(&parent->mmap_sem);
    down_write_nested(&child->mmap_sem);
    ...
    while duplicating vmas
    if error
    break;
    ...
    up_write(&child->mmap_sem);
    up_write(&parent->mmap_sem); ...
    down_read(&parent->mmap_sem);
    ...
    lock_page(page);
    handle COW
    page_mapcount(old_page) == 2
    alloc and prepare new_page
    ...
    handle error
    page_remove_rmap(page);
    put_page(page);
    ...
    fold new_page into pte
    page_remove_rmap(page);
    put_page(page);
    ...
    oops ==> unlock_page(page);
    up_read(&parent->mmap_sem);

    The solution is to take an extra reference to the page while we are
    holding the lock on it.

    Signed-off-by: Chris Metcalf
    Cc: Hillf Danton
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Chris Metcalf
     

03 Apr, 2012

3 commits

  • commit 66c4c35c6bc5a1a452b024cf0364635b28fd94e4 upstream.

    sysfs_slab_add() calls various sysfs functions that actually may
    end up in userspace doing all sorts of things.

    Release the slub_lock after adding the kmem_cache structure to the list.
    At that point the address of the kmem_cache is not known so we are
    guaranteed exlusive access to the following modifications to the
    kmem_cache structure.

    If the sysfs_slab_add fails then reacquire the slub_lock to
    remove the kmem_cache structure from the list.

    Reported-by: Sasha Levin
    Acked-by: Eric Dumazet
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg
    Signed-off-by: Greg Kroah-Hartman

    Christoph Lameter
     
  • commit f5bf18fa22f8c41a13eb8762c7373eb3a93a7333 upstream.

    While testing AMS (Active Memory Sharing) / CMO (Cooperative Memory
    Overcommit) on powerpc, we tripped the following:

    kernel BUG at mm/bootmem.c:483!
    cpu 0x0: Vector: 700 (Program Check) at [c000000000c03940]
    pc: c000000000a62bd8: .alloc_bootmem_core+0x90/0x39c
    lr: c000000000a64bcc: .sparse_early_usemaps_alloc_node+0x84/0x29c
    sp: c000000000c03bc0
    msr: 8000000000021032
    current = 0xc000000000b0cce0
    paca = 0xc000000001d80000
    pid = 0, comm = swapper
    kernel BUG at mm/bootmem.c:483!
    enter ? for help
    [c000000000c03c80] c000000000a64bcc
    .sparse_early_usemaps_alloc_node+0x84/0x29c
    [c000000000c03d50] c000000000a64f10 .sparse_init+0x12c/0x28c
    [c000000000c03e20] c000000000a474f4 .setup_arch+0x20c/0x294
    [c000000000c03ee0] c000000000a4079c .start_kernel+0xb4/0x460
    [c000000000c03f90] c000000000009670 .start_here_common+0x1c/0x2c

    This is

    BUG_ON(limit && goal + size > limit);

    and after some debugging, it seems that

    goal = 0x7ffff000000
    limit = 0x80000000000

    and sparse_early_usemaps_alloc_node ->
    sparse_early_usemaps_alloc_pgdat_section calls

    return alloc_bootmem_section(usemap_size() * count, section_nr);

    This is on a system with 8TB available via the AMS pool, and as a quirk
    of AMS in firmware, all of that memory shows up in node 0. So, we end
    up with an allocation that will fail the goal/limit constraints.

    In theory, we could "fall-back" to alloc_bootmem_node() in
    sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE
    defined, we'll BUG_ON() instead. A simple solution appears to be to
    unconditionally remove the limit condition in alloc_bootmem_section,
    meaning allocations are allowed to cross section boundaries (necessary
    for systems of this size).

    Johannes Weiner pointed out that if alloc_bootmem_section() no longer
    guarantees section-locality, we need check_usemap_section_nr() to print
    possible cross-dependencies between node descriptors and the usemaps
    allocated through it. That makes the two loops in
    sparse_early_usemaps_alloc_node() identical, so re-factor the code a
    bit.

    [akpm@linux-foundation.org: code simplification]
    Signed-off-by: Nishanth Aravamudan
    Cc: Dave Hansen
    Cc: Anton Blanchard
    Cc: Paul Mackerras
    Cc: Ben Herrenschmidt
    Cc: Robert Jennings
    Acked-by: Johannes Weiner
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Nishanth Aravamudan
     
  • commit 1a5a9906d4e8d1976b701f889d8f35d54b928f25 upstream.

    In some cases it may happen that pmd_none_or_clear_bad() is called with
    the mmap_sem hold in read mode. In those cases the huge page faults can
    allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
    false positive from pmd_bad() that will not like to see a pmd
    materializing as trans huge.

    It's not khugepaged causing the problem, khugepaged holds the mmap_sem
    in write mode (and all those sites must hold the mmap_sem in read mode
    to prevent pagetables to go away from under them, during code review it
    seems vm86 mode on 32bit kernels requires that too unless it's
    restricted to 1 thread per process or UP builds). The race is only with
    the huge pagefaults that can convert a pmd_none() into a
    pmd_trans_huge().

    Effectively all these pmd_none_or_clear_bad() sites running with
    mmap_sem in read mode are somewhat speculative with the page faults, and
    the result is always undefined when they run simultaneously. This is
    probably why it wasn't common to run into this. For example if the
    madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
    fault, the hugepage will not be zapped, if the page fault runs first it
    will be zapped.

    Altering pmd_bad() not to error out if it finds hugepmds won't be enough
    to fix this, because zap_pmd_range would then proceed to call
    zap_pte_range (which would be incorrect if the pmd become a
    pmd_trans_huge()).

    The simplest way to fix this is to read the pmd in the local stack
    (regardless of what we read, no need of actual CPU barriers, only
    compiler barrier needed), and be sure it is not changing under the code
    that computes its value. Even if the real pmd is changing under the
    value we hold on the stack, we don't care. If we actually end up in
    zap_pte_range it means the pmd was not none already and it was not huge,
    and it can't become huge from under us (khugepaged locking explained
    above).

    All we need is to enforce that there is no way anymore that in a code
    path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
    can run into a hugepmd. The overhead of a barrier() is just a compiler
    tweak and should not be measurable (I only added it for THP builds). I
    don't exclude different compiler versions may have prevented the race
    too by caching the value of *pmd on the stack (that hasn't been
    verified, but it wouldn't be impossible considering
    pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
    and there's no external function called in between pmd_trans_huge and
    pmd_none_or_clear_bad).

    if (pmd_trans_huge(*pmd)) {
    if (next-addr != HPAGE_PMD_SIZE) {
    VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
    split_huge_page_pmd(vma->vm_mm, pmd);
    } else if (zap_huge_pmd(tlb, vma, pmd, addr))
    continue;
    /* fall through */
    }
    if (pmd_none_or_clear_bad(pmd))

    Because this race condition could be exercised without special
    privileges this was reported in CVE-2012-1179.

    The race was identified and fully explained by Ulrich who debugged it.
    I'm quoting his accurate explanation below, for reference.

    ====== start quote =======
    mapcount 0 page_mapcount 1
    kernel BUG at mm/huge_memory.c:1384!

    At some point prior to the panic, a "bad pmd ..." message similar to the
    following is logged on the console:

    mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).

    The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
    the page's PMD table entry.

    143 void pmd_clear_bad(pmd_t *pmd)
    144 {
    -> 145 pmd_ERROR(*pmd);
    146 pmd_clear(pmd);
    147 }

    After the PMD table entry has been cleared, there is an inconsistency
    between the actual number of PMD table entries that are mapping the page
    and the page's map count (_mapcount field in struct page). When the page
    is subsequently reclaimed, __split_huge_page() detects this inconsistency.

    1381 if (mapcount != page_mapcount(page))
    1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
    1383 mapcount, page_mapcount(page));
    -> 1384 BUG_ON(mapcount != page_mapcount(page));

    The root cause of the problem is a race of two threads in a multithreaded
    process. Thread B incurs a page fault on a virtual address that has never
    been accessed (PMD entry is zero) while Thread A is executing an madvise()
    system call on a virtual address within the same 2 MB (huge page) range.

    virtual address space
    .---------------------.
    | |
    | |
    .-|---------------------|
    | | |
    | | |< |/////////////////////| > A(range)
    page | |/////////////////////|-'
    | | |
    | | |
    '-|---------------------|
    | |
    | |
    '---------------------'

    - Thread A is executing an madvise(..., MADV_DONTNEED) system call
    on the virtual address range "A(range)" shown in the picture.

    sys_madvise
    // Acquire the semaphore in shared mode.
    down_read(¤t->mm->mmap_sem)
    ...
    madvise_vma
    switch (behavior)
    case MADV_DONTNEED:
    madvise_dontneed
    zap_page_range
    unmap_vmas
    unmap_page_range
    zap_pud_range
    zap_pmd_range
    //
    // Assume that this huge page has never been accessed.
    // I.e. content of the PMD entry is zero (not mapped).
    //
    if (pmd_trans_huge(*pmd)) {
    // We don't get here due to the above assumption.
    }
    //
    // Assume that Thread B incurred a page fault and
    .---------> // sneaks in here as shown below.
    | //
    | if (pmd_none_or_clear_bad(pmd))
    | {
    | if (unlikely(pmd_bad(*pmd)))
    | pmd_clear_bad
    | {
    | pmd_ERROR
    | // Log "bad pmd ..." message here.
    | pmd_clear
    | // Clear the page's PMD entry.
    | // Thread B incremented the map count
    | // in page_add_new_anon_rmap(), but
    | // now the page is no longer mapped
    | // by a PMD entry (-> inconsistency).
    | }
    | }
    |
    v
    - Thread B is handling a page fault on virtual address "B(fault)" shown
    in the picture.

    ...
    do_page_fault
    __do_page_fault
    // Acquire the semaphore in shared mode.
    down_read_trylock(&mm->mmap_sem)
    ...
    handle_mm_fault
    if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
    // We get here due to the above assumption (PMD entry is zero).
    do_huge_pmd_anonymous_page
    alloc_hugepage_vma
    // Allocate a new transparent huge page here.
    ...
    __do_huge_pmd_anonymous_page
    ...
    spin_lock(&mm->page_table_lock)
    ...
    page_add_new_anon_rmap
    // Here we increment the page's map count (starts at -1).
    atomic_set(&page->_mapcount, 0)
    set_pmd_at
    // Here we set the page's PMD entry which will be cleared
    // when Thread A calls pmd_clear_bad().
    ...
    spin_unlock(&mm->page_table_lock)

    The mmap_sem does not prevent the race because both threads are acquiring
    it in shared mode (down_read). Thread B holds the page_table_lock while
    the page's map count and PMD table entry are updated. However, Thread A
    does not synchronize on that lock.

    ====== end quote =======

    [akpm@linux-foundation.org: checkpatch fixes]
    Reported-by: Ulrich Obergfell
    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Dave Jones
    Acked-by: Larry Woodman
    Acked-by: Rik van Riel
    Cc: Mark Salter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     

13 Mar, 2012

3 commits

  • commit 1c641e84719429bbfe62a95ed3545ee7fe24408f upstream.

    Dave Jones reports a few Fedora users hitting the BUG_ON(mm->nr_ptes...)
    in exit_mmap() recently.

    Quoting Hugh's discovery and explanation of the SMP race condition:

    "mm->nr_ptes had unusual locking: down_read mmap_sem plus
    page_table_lock when incrementing, down_write mmap_sem (or mm_users
    0) when decrementing; whereas THP is careful to increment and
    decrement it under page_table_lock.

    Now most of those paths in THP also hold mmap_sem for read or write
    (with appropriate checks on mm_users), but two do not: when
    split_huge_page() is called by hwpoison_user_mappings(), and when
    called by add_to_swap().

    It's conceivable that the latter case is responsible for the
    exit_mmap() BUG_ON mm->nr_ptes that has been reported on Fedora."

    The simplest way to fix it without having to alter the locking is to make
    split_huge_page() a noop in nr_ptes terms, so by counting the preallocated
    pagetables that exists for every mapped hugepage. It was an arbitrary
    choice not to count them and either way is not wrong or right, because
    they are not used but they're still allocated.

    Reported-by: Dave Jones
    Reported-by: Hugh Dickins
    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: David Rientjes
    Cc: Josh Boyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     
  • commit b94cfaf6685d691dc3fab023cf32f65e9b7be09c upstream.

    Don't clear vm_mm in a deleted VMA as it's unnecessary and might
    conceivably break the filesystem or driver VMA close routine.

    Reported-by: Al Viro
    Signed-off-by: David Howells
    Acked-by: Al Viro
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Howells
     
  • commit 371528caec553785c37f73fa3926ea0de84f986f upstream.

    There is an issue when memcg unregisters events that were attached to
    the same eventfd:

    - On the first call mem_cgroup_usage_unregister_event() removes all
    events attached to a given eventfd, and if there were no events left,
    thresholds->primary would become NULL;

    - Since there were several events registered, cgroups core will call
    mem_cgroup_usage_unregister_event() again, but now kernel will oops,
    as the function doesn't expect that threshold->primary may be NULL.

    That's a good question whether mem_cgroup_usage_unregister_event()
    should actually remove all events in one go, but nowadays it can't
    do any better as cftype->unregister_event callback doesn't pass
    any private event-associated cookie. So, let's fix the issue by
    simply checking for threshold->primary.

    FWIW, w/o the patch the following oops may be observed:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
    IP: [] mem_cgroup_usage_unregister_event+0x9c/0x1f0
    Pid: 574, comm: kworker/0:2 Not tainted 3.3.0-rc4+ #9 Bochs Bochs
    RIP: 0010:[] [] mem_cgroup_usage_unregister_event+0x9c/0x1f0
    RSP: 0018:ffff88001d0b9d60 EFLAGS: 00010246
    Process kworker/0:2 (pid: 574, threadinfo ffff88001d0b8000, task ffff88001de91cc0)
    Call Trace:
    [] cgroup_event_remove+0x2b/0x60
    [] process_one_work+0x174/0x450
    [] worker_thread+0x123/0x2d0

    Signed-off-by: Anton Vorontsov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Anton Vorontsov
     

01 Mar, 2012

1 commit

  • commit 918e556ec214ed2f584e4cac56d7b29e4bb6bf27 upstream.

    Lock i_mmap_mutex for access to the VMA prio list to prevent concurrent
    access. Currently, certain parts of the mmap handling are protected by
    the region mutex, but not all.

    Reported-by: Al Viro
    Signed-off-by: David Howells
    Acked-by: Al Viro
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Howells
     

21 Feb, 2012

1 commit

  • commit 73736e0387ba0e6d2b703407b4d26168d31516a7 upstream.

    Zhihua Che reported a possible memleak in slub allocator on
    CONFIG_PREEMPT=y builds.

    It is possible current thread migrates right before disabling irqs in
    __slab_alloc(). We must check again c->freelist, and perform a normal
    allocation instead of scratching c->freelist.

    Many thanks to Zhihua Che for spotting this bug, introduced in 2.6.39

    V2: Its also possible an IRQ freed one (or several) object(s) and
    populated c->freelist, so its not a CONFIG_PREEMPT only problem.

    Reported-by: Zhihua Che
    Signed-off-by: Eric Dumazet
    Acked-by: Christoph Lameter
    Signed-off-by: Pekka Enberg
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

14 Feb, 2012

5 commits

  • commit b9980cdcf2524c5fe15d8cbae9c97b3ed6385563 upstream.

    Fix CONFIG_TRANSPARENT_HUGEPAGE=y CONFIG_SMP=n CONFIG_DEBUG_VM=y
    CONFIG_DEBUG_SPINLOCK=n kernel: spin_is_locked() is then always false,
    and so triggers some BUGs in Transparent HugePage codepaths.

    asm-generic/bug.h mentions this problem, and provides a WARN_ON_SMP(x);
    but being too lazy to add VM_BUG_ON_SMP, BUG_ON_SMP, WARN_ON_SMP_ONCE,
    VM_WARN_ON_SMP_ONCE, just test NR_CPUS != 1 in the existing VM_BUG_ONs.

    Signed-off-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     
  • commit dc9086004b3d5db75997a645b3fe08d9138b7ad0 upstream.

    When isolating pages for migration, migration starts at the start of a
    zone while the free scanner starts at the end of the zone. Migration
    avoids entering a new zone by never going beyond the free scanned.

    Unfortunately, in very rare cases nodes can overlap. When this happens,
    migration isolates pages without the LRU lock held, corrupting lists
    which will trigger errors in reclaim or during page free such as in the
    following oops

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    IP: [] free_pcppages_bulk+0xcc/0x450
    PGD 1dda554067 PUD 1e1cb58067 PMD 0
    Oops: 0000 [#1] SMP
    CPU 37
    Pid: 17088, comm: memcg_process_s Tainted: G X
    RIP: free_pcppages_bulk+0xcc/0x450
    Process memcg_process_s (pid: 17088, threadinfo ffff881c2926e000, task ffff881c2926c0c0)
    Call Trace:
    free_hot_cold_page+0x17e/0x1f0
    __pagevec_free+0x90/0xb0
    release_pages+0x22a/0x260
    pagevec_lru_move_fn+0xf3/0x110
    putback_lru_page+0x66/0xe0
    unmap_and_move+0x156/0x180
    migrate_pages+0x9e/0x1b0
    compact_zone+0x1f3/0x2f0
    compact_zone_order+0xa2/0xe0
    try_to_compact_pages+0xdf/0x110
    __alloc_pages_direct_compact+0xee/0x1c0
    __alloc_pages_slowpath+0x370/0x830
    __alloc_pages_nodemask+0x1b1/0x1c0
    alloc_pages_vma+0x9b/0x160
    do_huge_pmd_anonymous_page+0x160/0x270
    do_page_fault+0x207/0x4c0
    page_fault+0x25/0x30

    The "X" in the taint flag means that external modules were loaded but but
    is unrelated to the bug triggering. The real problem was because the PFN
    layout looks like this

    Zone PFN ranges:
    DMA 0x00000010 -> 0x00001000
    DMA32 0x00001000 -> 0x00100000
    Normal 0x00100000 -> 0x01e80000
    Movable zone start PFN for each node
    early_node_map[14] active PFN ranges
    0: 0x00000010 -> 0x0000009b
    0: 0x00000100 -> 0x0007a1ec
    0: 0x0007a354 -> 0x0007a379
    0: 0x0007f7ff -> 0x0007f800
    0: 0x00100000 -> 0x00680000
    1: 0x00680000 -> 0x00e80000
    0: 0x00e80000 -> 0x01080000
    1: 0x01080000 -> 0x01280000
    0: 0x01280000 -> 0x01480000
    1: 0x01480000 -> 0x01680000
    0: 0x01680000 -> 0x01880000
    1: 0x01880000 -> 0x01a80000
    0: 0x01a80000 -> 0x01c80000
    1: 0x01c80000 -> 0x01e80000

    The fix is straight-forward. isolate_migratepages() has to make a
    similar check to isolate_freepage to ensure that it never isolates pages
    from a zone it does not hold the LRU lock for.

    This was discovered in a 3.0-based kernel but it affects 3.1.x, 3.2.x
    and current mainline.

    Signed-off-by: Mel Gorman
    Acked-by: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • …ing isolation for migration

    commit 0bf380bc70ecba68cb4d74dc656cc2fa8c4d801a upstream.

    When isolating for migration, migration starts at the start of a zone
    which is not necessarily pageblock aligned. Further, it stops isolating
    when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally
    not aligned. This allows isolate_migratepages() to call pfn_to_page() on
    an invalid PFN which can result in a crash. This was originally reported
    against a 3.0-based kernel with the following trace in a crash dump.

    PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s"
    #0 [d72d3ad0] crash_kexec at c028cfdb
    #1 [d72d3b24] oops_end at c05c5322
    #2 [d72d3b38] __bad_area_nosemaphore at c0227e60
    #3 [d72d3bec] bad_area at c0227fb6
    #4 [d72d3c00] do_page_fault at c05c72ec
    #5 [d72d3c80] error_code (via page_fault) at c05c47a4
    EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000
    DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50
    CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002
    #6 [d72d3cb4] isolate_migratepages at c030b15a
    #7 [d72d3d14] zone_watermark_ok at c02d26cb
    #8 [d72d3d2c] compact_zone at c030b8de
    #9 [d72d3d68] compact_zone_order at c030bba1
    #10 [d72d3db4] try_to_compact_pages at c030bc84
    #11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7
    #12 [d72d3e08] __alloc_pages_slowpath at c02d66c7
    #13 [d72d3e78] __alloc_pages_nodemask at c02d6a97
    #14 [d72d3eb8] alloc_pages_vma at c030a845
    #15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb
    #16 [d72d3f00] handle_mm_fault at c02f36c6
    #17 [d72d3f30] do_page_fault at c05c70ed
    #18 [d72d3fb0] error_code (via page_fault) at c05c47a4
    EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431
    DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788
    SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50
    CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202

    It was also reported by Herbert van den Bergh against 3.1-based kernel
    with the following snippet from the console log.

    BUG: unable to handle kernel paging request at 01c00008
    IP: [<c0522399>] isolate_migratepages+0x119/0x390
    *pdpt = 000000002f7ce001 *pde = 0000000000000000

    It is expected that it also affects 3.2.x and current mainline.

    The problem is that pfn_valid is only called on the first PFN being
    checked and that PFN is not necessarily aligned. Lets say we have a case
    like this

    H = MAX_ORDER_NR_PAGES boundary
    | = pageblock boundary
    m = cc->migrate_pfn
    f = cc->free_pfn
    o = memory hole

    H------|------H------|----m-Hoooooo|ooooooH-f----|------H

    The migrate_pfn is just below a memory hole and the free scanner is beyond
    the hole. When isolate_migratepages started, it scans from migrate_pfn to
    migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks
    pfn_valid() on the first PFN but then scans into the hole where there are
    not necessarily valid struct pages.

    This patch ensures that isolate_migratepages calls pfn_valid when
    necessary.

    Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
    Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Michal Nazarewicz <mina86@mina86.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

    Mel Gorman
     
  • commit 99f02ef1f18631eb0a4e0ea0a3d56878dbcb4b90 upstream.

    Fix a race condition that shows in conjunction with xip_file_fault() when
    two threads of the same user process fault on the same memory page.

    In this case, the race winner will install the page table entry and the
    unlucky loser will cause an oops: xip_file_fault calls vm_insert_pfn (via
    vm_insert_mixed) which drops out at this check:

    retval = -EBUSY;
    if (!pte_none(*pte))
    goto out_unlock;

    The resulting -EBUSY return value will trigger a BUG_ON() in
    xip_file_fault.

    This fix simply considers the fault as fixed in this case, because the
    race winner has successfully installed the pte.

    [akpm@linux-foundation.org: use conventional (and consistent) comment layout]
    Reported-by: David Sadler
    Signed-off-by: Carsten Otte
    Reported-by: Louis Alex Eisner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Carsten Otte
     
  • commit 3deaa7190a8da38453c4fabd9dec7f66d17fff67 upstream.

    Herbert Poetzl reported a performance regression since 2.6.39. The test
    is a simple dd read, but with big block size. The reason is:

    T1: ra (A, A+128k), (A+128k, A+256k)
    T2: lock_page for page A, submit the 256k
    T3: hit page A+128K, ra (A+256k, A+384). the range isn't submitted
    because of plug and there isn't any lock_page till we hit page A+256k
    because all pages from A to A+256k is in memory
    T4: hit page A+256k, ra (A+384, A+ 512). Because of plug, the range isn't
    submitted again.
    T5: lock_page A+256k, so (A+256k, A+512k) will be submitted. The task is
    waitting for (A+256k, A+512k) finish.

    There is no request to disk in T3 and T4, so readahead pipeline breaks.

    We really don't need block plug for generic_file_aio_read() for buffered
    I/O. The readahead already has plug and has fine grained control when I/O
    should be submitted. Deleting plug for buffered I/O fixes the regression.

    One side effect is plug makes the request size 256k, the size is 128k
    without it. This is because default ra size is 128k and not a reason we
    need plug here.

    Vivek said:

    : We submit some readahead IO to device request queue but because of nested
    : plug, queue never gets unplugged. When read logic reaches a page which is
    : not in page cache, it waits for page to be read from the disk
    : (lock_page_killable()) and that time we flush the plug list.
    :
    : So effectively read ahead logic is kind of broken in parts because of
    : nested plugging. Removing top level plug (generic_file_aio_read()) for
    : buffered reads, will allow unplugging queue earlier for readahead.

    Signed-off-by: Shaohua Li
    Signed-off-by: Wu Fengguang
    Reported-by: Herbert Poetzl
    Tested-by: Eric Dumazet
    Cc: Christoph Hellwig
    Cc: Jens Axboe
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Shaohua Li
     

26 Jan, 2012

2 commits

  • commit 687875fb7de4a95223af20ee024282fa9099f860 upstream.

    Fix the following NULL ptr dereference caused by

    cat /sys/devices/system/memory/memory0/removable

    Pid: 13979, comm: sed Not tainted 3.0.13-0.5-default #1 IBM BladeCenter LS21 -[7971PAM]-/Server Blade
    RIP: __count_immobile_pages+0x4/0x100
    Process sed (pid: 13979, threadinfo ffff880221c36000, task ffff88022e788480)
    Call Trace:
    is_pageblock_removable_nolock+0x34/0x40
    is_mem_section_removable+0x74/0xf0
    show_mem_removable+0x41/0x70
    sysfs_read_file+0xfe/0x1c0
    vfs_read+0xc7/0x130
    sys_read+0x53/0xa0
    system_call_fastpath+0x16/0x1b

    We are crashing because we are trying to dereference NULL zone which
    came from pfn=0 (struct page ffffea0000000000). According to the boot
    log this page is marked reserved:
    e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)

    and early_node_map confirms that:
    early_node_map[3] active PFN ranges
    1: 0x00000010 -> 0x0000009c
    1: 0x00000100 -> 0x000bffa3
    1: 0x00100000 -> 0x00240000

    The problem is that memory_present works in PAGE_SECTION_MASK aligned
    blocks so the reserved range sneaks into the the section as well. This
    also means that free_area_init_node will not take care of those reserved
    pages and they stay uninitialized.

    When we try to read the removable status we walk through all available
    sections and hope that the zone is valid for all pages in the section.
    But this is not true in this case as the zone and nid are not initialized.

    We have only one node in this particular case and it is marked as node=1
    (rather than 0) and that made the problem visible because page_to_nid will
    return 0 and there are no zones on the node.

    Let's check that the zone is valid and that the given pfn falls into its
    boundaries and mark the section not removable. This might cause some
    false positives, probably, but we do not have any sane way to find out
    whether the page is reserved by the platform or it is just not used for
    whatever other reasons.

    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     
  • commit ab936cbcd02072a34b60d268f94440fd5cf1970b upstream.

    Commit ef6a3c6311 ("mm: add replace_page_cache_page() function") added a
    function replace_page_cache_page(). This function replaces a page in the
    radix-tree with a new page. WHen doing this, memory cgroup needs to fix
    up the accounting information. memcg need to check PCG_USED bit etc.

    In some(many?) cases, 'newpage' is on LRU before calling
    replace_page_cache(). So, memcg's LRU accounting information should be
    fixed, too.

    This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
    In that function, old pages will be unaccounted without touching
    res_counter and new page will be accounted to the memcg (of old page).
    WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid
    races with LRU handling.

    Background:
    replace_page_cache_page() is called by FUSE code in its splice() handling.
    Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
    page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
    because rmdir() checks the whole LRU is empty and there is no account leak.
    If a page is on the other LRU than it should be, rmdir() will fail.

    This bug was added in March 2011, but no bug report yet. I guess there
    are not many people who use memcg and FUSE at the same time with upstream
    kernels.

    The result of this bug is that admin cannot destroy a memcg because of
    account leak. So, no panic, no deadlock. And, even if an active cgroup
    exist, umount can succseed. So no problem at shutdown.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Miklos Szeredi
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    KAMEZAWA Hiroyuki
     

07 Jan, 2012

5 commits

  • commit b0365c8d0cb6e79eb5f21418ae61ab511f31b575 upstream.

    If a huge page is enqueued under the protection of hugetlb_lock, then the
    operation is atomic and safe.

    Signed-off-by: Hillf Danton
    Reviewed-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hillf Danton
     
  • commit a41c58a6665cc995e237303b05db42100b71b65e upstream.

    If the request is to create non-root group and we fail to meet it, we
    should leave the root unchanged.

    Signed-off-by: Hillf Danton
    Signed-off-by: Hugh Dickins
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hillf Danton
     
  • commit e6f67b8c05f5e129e126f4409ddac6f25f58ffcb upstream.

    lockdep reports a deadlock in jfs because a special inode's rw semaphore
    is taken recursively. The mapping's gfp mask is GFP_NOFS, but is not
    used when __read_cache_page() calls add_to_page_cache_lru().

    Signed-off-by: Dave Kleikamp
    Acked-by: Hugh Dickins
    Acked-by: Al Viro
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dave Kleikamp
     
  • commit ff05b6f7ae762b6eb464183eec994b28ea09f6dd upstream.

    An integer overflow will happen on 64bit archs if task's sum of rss,
    swapents and nr_ptes exceeds (2^31)/1000 value. This was introduced by
    commit

    f755a04 oom: use pte pages in OOM score

    where the oom score computation was divided into several steps and it's no
    longer computed as one expression in unsigned long(rss, swapents, nr_pte
    are unsigned long), where the result value assigned to points(int) is in
    range(1..1000). So there could be an int overflow while computing

    176 points *= 1000;

    and points may have negative value. Meaning the oom score for a mem hog task
    will be one.

    196 if (points
    Acked-by: KOSAKI Motohiro
    Acked-by: Oleg Nesterov
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Frantisek Hrbata
     
  • commit 9f57bd4d6dc69a4e3bf43044fa00fcd24dd363e3 upstream.

    per_cpu_ptr_to_phys() incorrectly rounds up its result for non-kmalloc
    case to the page boundary, which is bogus for any non-page-aligned
    address.

    This affects the only in-tree user of this function - sysfs handler
    for per-cpu 'crash_notes' physical address. The trouble is that the
    crash_notes per-cpu variable is not page-aligned:

    crash_notes = 0xc08e8ed4
    PER-CPU OFFSET VALUES:
    CPU 0: 3711f000
    CPU 1: 37129000
    CPU 2: 37133000
    CPU 3: 3713d000

    So, the per-cpu addresses are:
    crash_notes on CPU 0: f7a07ed4 => phys 36b57ed4
    crash_notes on CPU 1: f7a11ed4 => phys 36b4ded4
    crash_notes on CPU 2: f7a1bed4 => phys 36b43ed4
    crash_notes on CPU 3: f7a25ed4 => phys 36b39ed4

    However, /sys/devices/system/cpu/cpu*/crash_notes says:
    /sys/devices/system/cpu/cpu0/crash_notes: 36b57000
    /sys/devices/system/cpu/cpu1/crash_notes: 36b4d000
    /sys/devices/system/cpu/cpu2/crash_notes: 36b43000
    /sys/devices/system/cpu/cpu3/crash_notes: 36b39000

    As you can see, all values are rounded down to a page
    boundary. Consequently, this is where kexec sets up the NOTE segments,
    and thus where the secondary kernel is looking for them. However, when
    the first kernel crashes, it saves the notes to the unaligned
    addresses, where they are not found.

    Fix it by adding offset_in_page() to the translated page address.

    -tj: Combined Eugene's and Petr's commit messages.

    Signed-off-by: Eugene Surovegin
    Signed-off-by: Tejun Heo
    Reported-by: Petr Tesarik
    Signed-off-by: Greg Kroah-Hartman

    Eugene Surovegin
     

22 Dec, 2011

3 commits

  • commit a855b84c3d8c73220d4d3cd392a7bee7c83de70e upstream.

    Percpu allocator recorded the cpus which map to the first and last
    units in pcpu_first/last_unit_cpu respectively and used them to
    determine the address range of a chunk - e.g. it assumed that the
    first unit has the lowest address in a chunk while the last unit has
    the highest address.

    This simply isn't true. Groups in a chunk can have arbitrary positive
    or negative offsets from the previous one and there is no guarantee
    that the first unit occupies the lowest offset while the last one the
    highest.

    Fix it by actually comparing unit offsets to determine cpus occupying
    the lowest and highest offsets. Also, rename pcu_first/last_unit_cpu
    to pcpu_low/high_unit_cpu to avoid confusion.

    The chunk address range is used to flush cache on vmalloc area
    map/unmap and decide whether a given address is in the first chunk by
    per_cpu_ptr_to_phys() and the bug was discovered by invalid
    per_cpu_ptr_to_phys() translation for crash_note.

    Kudos to Dave Young for tracking down the problem.

    Signed-off-by: Tejun Heo
    Reported-by: WANG Cong
    Reported-by: Dave Young
    Tested-by: Dave Young
    LKML-Reference:
    Signed-off-by: Thomas Renninger
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit 1368edf0647ac112d8cfa6ce47257dc950c50f5c upstream.

    Commit f5252e00 ("mm: avoid null pointer access in vm_struct via
    /proc/vmallocinfo") adds newly allocated vm_structs to the vmlist after
    it is fully initialised. Unfortunately, it did not check that
    __vmalloc_area_node() successfully populated the area. In the event of
    allocation failure, the vmalloc area is freed but the pointer to freed
    memory is inserted into the vmlist leading to a a crash later in
    get_vmalloc_info().

    This patch adds a check for ____vmalloc_area_node() failure within
    __vmalloc_node_range. It does not use "goto fail" as in the previous
    error path as a warning was already displayed by __vmalloc_area_node()
    before it called vfree in its failure path.

    Credit goes to Luciano Chavez for doing all the real work of identifying
    exactly where the problem was.

    Signed-off-by: Mel Gorman
    Reported-by: Luciano Chavez
    Tested-by: Luciano Chavez
    Reviewed-by: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit d021563888312018ca65681096f62e36c20e63cc upstream.

    setup_zone_migrate_reserve() expects that zone->start_pfn starts at
    pageblock_nr_pages aligned pfn otherwise we could access beyond an
    existing memblock resulting in the following panic if
    CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid:

    IP: [] setup_zone_migrate_reserve+0xcd/0x180
    *pdpt = 0000000000000000 *pde = f000ff53f000ff53
    Oops: 0000 [#1] SMP
    Pid: 1, comm: swapper Not tainted 3.0.7-0.7-pae #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
    EIP: 0060:[] EFLAGS: 00010006 CPU: 0
    EIP is at setup_zone_migrate_reserve+0xcd/0x180
    EAX: 000c0000 EBX: f5801fc0 ECX: 000c0000 EDX: 00000000
    ESI: 000c01fe EDI: 000c01fe EBP: 00140000 ESP: f2475f58
    DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    Process swapper (pid: 1, ti=f2474000 task=f2472cd0 task.ti=f2474000)
    Call Trace:
    [] __setup_per_zone_wmarks+0xec/0x160
    [] setup_per_zone_wmarks+0xf/0x20
    [] init_per_zone_wmark_min+0x27/0x86
    [] do_one_initcall+0x2b/0x160
    [] kernel_init+0xbe/0x157
    [] kernel_thread_helper+0x6/0xd
    Code: a5 39 f5 89 f7 0f 46 fd 39 cf 76 40 8b 03 f6 c4 08 74 32 eb 91 90 89 c8 c1 e8 0e 0f be 80 80 2f 86 c0 8b 14 85 60 2f 86 c0 89 c8 82 b4 12 00 00 c1 e0 05 03 82 ac 12 00 00 8b 00 f6 c4 08 0f
    EIP: [] setup_zone_migrate_reserve+0xcd/0x180 SS:ESP 0068:f2475f58
    CR2: 00000000000012b4

    We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because
    highstart_pfn = 0x36ffe.

    The issue was introduced in 3.0-rc1 by 6d3163ce ("mm: check if any page
    in a pageblock is reserved before marking it MIGRATE_RESERVE").

    Make sure that start_pfn is always aligned to pageblock_nr_pages to
    ensure that pfn_valid s always called at the start of each pageblock.
    Architectures with holes in pageblocks will be correctly handled by
    pfn_valid_within in pageblock_is_reserved.

    Signed-off-by: Michal Hocko
    Signed-off-by: Mel Gorman
    Tested-by: Dang Bo
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Arve Hjnnevg
    Cc: KOSAKI Motohiro
    Cc: John Stultz
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko