16 Jan, 2015

4 commits

  • commit 690eac53daff34169a4d74fc7bfbd388c4896abb upstream.

    Commit fee7e49d4514 ("mm: propagate error from stack expansion even for
    guard page") made sure that we return the error properly for stack
    growth conditions. It also theorized that counting the guard page
    towards the stack limit might break something, but also said "Let's see
    if anybody notices".

    Somebody did notice. Apparently android-x86 sets the stack limit very
    close to the limit indeed, and including the guard page in the rlimit
    check causes the android 'zygote' process problems.

    So this adds the (fairly trivial) code to make the stack rlimit check be
    against the actual real stack size, rather than the size of the vma that
    includes the guard page.

    Reported-and-tested-by: Chih-Wei Huang
    Cc: Jay Foad
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     
  • commit fee7e49d45149fba60156f5b59014f764d3e3728 upstream.

    Jay Foad reports that the address sanitizer test (asan) sometimes gets
    confused by a stack pointer that ends up being outside the stack vma
    that is reported by /proc/maps.

    This happens due to an interaction between RLIMIT_STACK and the guard
    page: when we do the guard page check, we ignore the potential error
    from the stack expansion, which effectively results in a missing guard
    page, since the expected stack expansion won't have been done.

    And since /proc/maps explicitly ignores the guard page (commit
    d7824370e263: "mm: fix up some user-visible effects of the stack guard
    page"), the stack pointer ends up being outside the reported stack area.

    This is the minimal patch: it just propagates the error. It also
    effectively makes the guard page part of the stack limit, which in turn
    measn that the actual real stack is one page less than the stack limit.

    Let's see if anybody notices. We could teach acct_stack_growth() to
    allow an extra page for a grow-up/grow-down stack in the rlimit test,
    but I don't want to add more complexity if it isn't needed.

    Reported-and-tested-by: Jay Foad
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     
  • commit 9e5e3661727eaf960d3480213f8e87c8d67b6956 upstream.

    Charles Shirron and Paul Cassella from Cray Inc have reported kswapd
    stuck in a busy loop with nothing left to balance, but
    kswapd_try_to_sleep() failing to sleep. Their analysis found the cause
    to be a combination of several factors:

    1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait

    2. The process has been killed (by OOM in this case), but has not yet been
    scheduled to remove itself from the waitqueue and die.

    3. kswapd checks for throttled processes in prepare_kswapd_sleep():

    if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
    wake_up(&pgdat->pfmemalloc_wait);
    return false; // kswapd will not go to sleep
    }

    However, for a process that was already killed, wake_up() does not remove
    the process from the waitqueue, since try_to_wake_up() checks its state
    first and returns false when the process is no longer waiting.

    4. kswapd is running on the same CPU as the only CPU that the process is
    allowed to run on (through cpus_allowed, or possibly single-cpu system).

    5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd
    encounters no voluntary preemption points and repeatedly fails
    prepare_kswapd_sleep(), blocking the process from running and removing
    itself from the waitqueue, which would let kswapd sleep.

    So, the source of the problem is that we prevent kswapd from going to
    sleep until there are processes waiting on the pfmemalloc_wait queue,
    and a process waiting on a queue is guaranteed to be removed from the
    queue only when it gets scheduled. This was done to make sure that no
    process is left sleeping on pfmemalloc_wait when kswapd itself goes to
    sleep.

    However, it isn't necessary to postpone kswapd sleep until the
    pfmemalloc_wait queue actually empties. To prevent processes from being
    left sleeping, it's actually enough to guarantee that all processes
    waiting on pfmemalloc_wait queue have been woken up by the time we put
    kswapd to sleep.

    This patch therefore fixes this issue by substituting 'wake_up' with
    'wake_up_all' and removing 'return false' in the code snippet from
    prepare_kswapd_sleep() above. Note that if any process puts itself in
    the queue after this waitqueue_active() check, or after the wake up
    itself, it means that the process will also wake up kswapd - and since
    we are under prepare_to_wait(), the wake up won't be missed. Also we
    update the comment prepare_kswapd_sleep() to hopefully more clearly
    describe the races it is preventing.

    Fixes: 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage")
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Vladimir Davydov
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit 2d6d7f98284648c5ed113fe22a132148950b140f upstream.

    Tejun, while reviewing the code, spotted the following race condition
    between the dirtying and truncation of a page:

    __set_page_dirty_nobuffers() __delete_from_page_cache()
    if (TestSetPageDirty(page))
    page->mapping = NULL
    if (PageDirty())
    dec_zone_page_state(page, NR_FILE_DIRTY);
    dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
    if (page->mapping)
    account_page_dirtied(page)
    __inc_zone_page_state(page, NR_FILE_DIRTY);
    __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);

    which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.

    Dirtiers usually lock out truncation, either by holding the page lock
    directly, or in case of zap_pte_range(), by pinning the mapcount with
    the page table lock held. The notable exception to this rule, though,
    is do_wp_page(), for which this race exists. However, do_wp_page()
    already waits for a locked page to unlock before setting the dirty bit,
    in order to prevent a race where clear_page_dirty() misses the page bit
    in the presence of dirty ptes. Upgrade that wait to a fully locked
    set_page_dirty() to also cover the situation explained above.

    Afterwards, the code in set_page_dirty() dealing with a truncation race
    is no longer needed. Remove it.

    Reported-by: Tejun Heo
    Signed-off-by: Johannes Weiner
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Johannes Weiner
     

09 Jan, 2015

1 commit

  • commit 6b101e2a3ce4d2a0312087598bd1ab4a1db2ac40 upstream.

    high_memory isn't direct mapped memory so retrieving it's physical address
    isn't appropriate. But, it would be useful to check physical address of
    highmem boundary so it's justfiable to get physical address from it. In
    x86, there is a validation check if CONFIG_DEBUG_VIRTUAL and it triggers
    following boot failure reported by Ingo.

    ...
    BUG: Int 6: CR2 00f06f53
    ...
    Call Trace:
    dump_stack+0x41/0x52
    early_idt_handler+0x6b/0x6b
    cma_declare_contiguous+0x33/0x212
    dma_contiguous_reserve_area+0x31/0x4e
    dma_contiguous_reserve+0x11d/0x125
    setup_arch+0x7b5/0xb63
    start_kernel+0xb8/0x3e6
    i386_start_kernel+0x79/0x7d

    To fix boot regression, this patch implements workaround to avoid
    validation check in x86 when retrieving physical address of high_memory.
    __pa_nodebug() used by this patch is implemented only in x86 so there is
    no choice but to use dirty #ifdef.

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Joonsoo Kim
    Reported-by: Ingo Molnar
    Tested-by: Ingo Molnar
    Cc: Marek Szyprowski
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Joonsoo Kim
     

04 Dec, 2014

3 commits

  • The bounds check for nodeid in ____cache_alloc_node gives false
    positives on machines where the node IDs are not contiguous, leading to
    a panic at boot time. For example, on a POWER8 machine the node IDs are
    typically 0, 1, 16 and 17. This means that num_online_nodes() returns
    4, so when ____cache_alloc_node is called with nodeid = 16 the VM_BUG_ON
    triggers, like this:

    kernel BUG at /home/paulus/kernel/kvm/mm/slab.c:3079!
    Call Trace:
    .____cache_alloc_node+0x5c/0x270 (unreliable)
    .kmem_cache_alloc_node_trace+0xdc/0x360
    .init_list+0x3c/0x128
    .kmem_cache_init+0x1dc/0x258
    .start_kernel+0x2a0/0x568
    start_here_common+0x20/0xa8

    To fix this, we instead compare the nodeid with MAX_NUMNODES, and
    additionally make sure it isn't negative (since nodeid is an int). The
    check is there mainly to protect the array dereference in the get_node()
    call in the next line, and the array being dereferenced is of size
    MAX_NUMNODES. If the nodeid is in range but invalid (for example if the
    node is off-line), the BUG_ON in the next line will catch that.

    Fixes: 14e50c6a9bc2 ("mm: slab: Verify the nodeid passed to ____cache_alloc_node")
    Signed-off-by: Paul Mackerras
    Reviewed-by: Yasuaki Ishimatsu
    Reviewed-by: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Mackerras
     
  • Andrew Morton noticed that the error return from anon_vma_clone() was
    being dropped and replaced with -ENOMEM (which is not itself a bug
    because the only error return value from anon_vma_clone() is -ENOMEM).

    I did an audit of callers of anon_vma_clone() and discovered an actual
    bug where the error return was being lost. In __split_vma(), between
    Linux 3.11 and 3.12 the code was changed so the err variable is used
    before the call to anon_vma_clone() and the default initial value of
    -ENOMEM is overwritten. So a failure of anon_vma_clone() will return
    success since err at this point is now zero.

    Below is a patch which fixes this bug and also propagates the error
    return value from anon_vma_clone() in all cases.

    Fixes: ef0855d334e1 ("mm: mempolicy: turn vma_set_policy() into vma_dup_policy()")
    Signed-off-by: Daniel Forrest
    Reviewed-by: Michal Hocko
    Cc: Konstantin Khlebnikov
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Tim Hartrick
    Cc: Hugh Dickins
    Cc: Michel Lespinasse
    Cc: Vlastimil Babka
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Forrest
     
  • I've been seeing swapoff hangs in recent testing: it's cycling around
    trying unsuccessfully to find an mm for some remaining pages of swap.

    I have been exercising swap and page migration more heavily recently,
    and now notice a long-standing error in copy_one_pte(): it's trying to
    add dst_mm to swapoff's mmlist when it finds a swap entry, but is doing
    so even when it's a migration entry or an hwpoison entry.

    Which wouldn't matter much, except it adds dst_mm next to src_mm,
    assuming src_mm is already on the mmlist: which may not be so. Then if
    pages are later swapped out from dst_mm, swapoff won't be able to find
    where to replace them.

    There's already a !non_swap_entry() test for stats: move that up before
    the swap_duplicate() and the addition to mmlist.

    Signed-off-by: Hugh Dickins
    Cc: Kelley Nielsen
    Cc: [2.6.18+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

03 Dec, 2014

2 commits

  • In some android devices, there will be a "divide by zero" exception.
    vmpr->scanned could be zero before spin_lock(&vmpr->sr_lock).

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=88051

    [akpm@linux-foundation.org: neaten]
    Reported-by: ji_ang
    Cc: Anton Vorontsov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • If a frontswap dup-store failed, it should invalidate the expired page
    in the backend, or it could trigger some data corruption issue.
    Such as:
    1. use zswap as the frontswap backend with writeback feature
    2. store a swap page(version_1) to entry A, success
    3. dup-store a newer page(version_2) to the same entry A, fail
    4. use __swap_writepage() write version_2 page to swapfile, success
    5. zswap do shrink, writeback version_1 page to swapfile
    6. version_2 page is overwrited by version_1, data corrupt.

    This patch fixes this issue by invalidating expired data immediately
    when meet a dup-store failure.

    Signed-off-by: Weijie Yang
    Cc: Konrad Rzeszutek Wilk
    Cc: Seth Jennings
    Cc: Dan Streetman
    Cc: Minchan Kim
    Cc: Bob Liu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     

15 Nov, 2014

1 commit


14 Nov, 2014

12 commits

  • When memory is hot-added, all the memory is in offline state. So clear
    all zones' present_pages because they will be updated in online_pages()
    and offline_pages(). Otherwise, /proc/zoneinfo will corrupt:

    When the memory of node2 is offline:

    # cat /proc/zoneinfo
    ......
    Node 2, zone Movable
    ......
    spanned 8388608
    present 8388608
    managed 0

    When we online memory on node2:

    # cat /proc/zoneinfo
    ......
    Node 2, zone Movable
    ......
    spanned 8388608
    present 16777216
    managed 8388608

    Signed-off-by: Tang Chen
    Reviewed-by: Yasuaki Ishimatsu
    Cc: [3.16+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • In free_area_init_core(), zone->managed_pages is set to an approximate
    value for lowmem, and will be adjusted when the bootmem allocator frees
    pages into the buddy system.

    But free_area_init_core() is also called by hotadd_new_pgdat() when
    hot-adding memory. As a result, zone->managed_pages of the newly added
    node's pgdat is set to an approximate value in the very beginning.

    Even if the memory on that node has node been onlined,
    /sys/device/system/node/nodeXXX/meminfo has wrong value:

    hot-add node2 (memory not onlined)
    cat /sys/device/system/node/node2/meminfo
    Node 2 MemTotal: 33554432 kB
    Node 2 MemFree: 0 kB
    Node 2 MemUsed: 33554432 kB
    Node 2 Active: 0 kB

    This patch fixes this problem by reset node managed pages to 0 after
    hot-adding a new node.

    1. Move reset_managed_pages_done from reset_node_managed_pages() to
    reset_all_zones_managed_pages()
    2. Make reset_node_managed_pages() non-static
    3. Call reset_node_managed_pages() in hotadd_new_pgdat() after pgdat
    is initialized

    Signed-off-by: Tang Chen
    Signed-off-by: Yasuaki Ishimatsu
    Cc: [3.16+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • One thing I did in this patch is fixing freepage accounting. If we
    clear guard page and link it onto isolate buddy list, we should not
    increase freepage count. This patch adds conditional branch to skip
    counting in this case. Without this patch, this overcounting happens
    frequently if guard order is set and CMA is used.

    Another thing fixed in this patch is the target to reset order. In
    __free_one_page(), we check the buddy page whether it is a guard page or
    not. And, if so, we should clear guard attribute on the buddy page and
    reset order of it to 0. But, current code resets original page's order
    rather than buddy one's. Maybe, this doesn't have any problem, because
    whole merged page's order will be re-assigned soon. But, it is better
    to correct code.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Gioh Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Several people have reported occasionally seeing processes stuck in
    compact_zone(), even triggering soft lockups, in 3.18-rc2+.

    Testing a revert of commit e14c720efdd7 ("mm, compaction: remember
    position within pageblock in free pages scanner") fixed the issue,
    although the stuck processes do not appear to involve the free scanner.

    Finally, by code inspection, the bug was found in isolate_migratepages()
    which uses a slightly different condition to detect if the migration and
    free scanners have met, than compact_finished(). That has not been a
    problem until commit e14c720efdd7 allowed the free scanner position
    between individual invocations to be in the middle of a pageblock.

    In a relatively rare case, the migration scanner position can end up at
    the beginning of a pageblock, with the free scanner position in the
    middle of the same pageblock. If it's the migration scanner's turn,
    isolate_migratepages() exits immediately (without updating the
    position), while compact_finished() decides to continue compaction,
    resulting in a potentially infinite loop. The system can recover only
    if another process creates enough high-order pages to make the watermark
    checks in compact_finished() pass.

    This patch fixes the immediate problem by bumping the migration
    scanner's position to meet the free scanner in isolate_migratepages(),
    when both are within the same pageblock. This causes compact_finished()
    to terminate properly. A more robust check in compact_finished() is
    planned as a cleanup for better future maintainability.

    Fixes: e14c720efdd73 ("mm, compaction: remember position within pageblock in free pages scanner)
    Signed-off-by: Vlastimil Babka
    Reported-by: P. Christeas
    Tested-by: P. Christeas
    Link: http://marc.info/?l=linux-mm&m=141508604232522&w=2
    Reported-by: Norbert Preining
    Tested-by: Norbert Preining
    Link: https://lkml.org/lkml/2014/11/4/904
    Reported-by: Pavel Machek
    Link: https://lkml.org/lkml/2014/11/7/164
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Having test_pages_isolated failure message as a warning confuses users
    into thinking that it is more serious than it really is. In reality, if
    called via CMA, allocation will be retried so a single
    test_pages_isolated failure does not prevent allocation from succeeding.

    Demote the warning message to an info message and reformat it such that
    the text "failed" does not appear and instead a less worrying "PFNS
    busy" is used.

    This message is trivially reproducible on a 10GB x86 machine on 3.16.y
    kernels configured with CONFIG_DMA_CMA.

    Signed-off-by: Michal Nazarewicz
    Cc: Laurent Pinchart
    Cc: Peter Hurley
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Nazarewicz
     
  • Unlike SLUB, sometimes, object isn't started at the beginning of the
    slab in SLAB. This causes the unalignment problem after slab merging is
    supported by commit 12220dea07f1 ("mm/slab: support slab merge").

    Following is the report from Markos that fail to boot on Malta with EVA.

    Calibrating delay loop... 19.86 BogoMIPS (lpj=99328)
    pid_max: default: 32768 minimum: 301
    Mount-cache hash table entries: 4096 (order: 0, 16384 bytes)
    Mountpoint-cache hash table entries: 4096 (order: 0, 16384 bytes)
    Kernel bug detected[#1]:
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.17.0-05639-g12220dea07f1 #1631
    task: 1f04f5d8 ti: 1f050000 task.ti: 1f050000
    epc : 80141190 alloc_unbound_pwq+0x234/0x304
    Not tainted
    ra : 80141184 alloc_unbound_pwq+0x228/0x304
    Process swapper/0 (pid: 1, threadinfo=1f050000, task=1f04f5d8, tls=00000000)
    Call Trace:
    alloc_unbound_pwq+0x234/0x304
    apply_workqueue_attrs+0x11c/0x294
    __alloc_workqueue_key+0x23c/0x470
    init_workqueues+0x320/0x400
    do_one_initcall+0xe8/0x23c
    kernel_init_freeable+0x9c/0x224
    kernel_init+0x10/0x100
    ret_from_kernel_thread+0x14/0x1c
    [ end trace cb88537fdc8fa200 ]
    Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

    alloc_unbound_pwq() allocates slab object from pool_workqueue. This
    kmem_cache requires 256 bytes alignment, but, current merging code
    doesn't honor that, and merge it with kmalloc-256. kmalloc-256 requires
    only cacheline size alignment so that above failure occurs. However, in
    x86, kmalloc-256 is luckily aligned in 256 bytes, so the problem didn't
    happen on it.

    To fix this problem, this patch introduces alignment mismatch check in
    find_mergeable(). This will fix the problem.

    Signed-off-by: Joonsoo Kim
    Reported-by: Markos Chandras
    Tested-by: Markos Chandras
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Current pageblock isolation logic could isolate each pageblock
    individually. This causes freepage accounting problem if freepage with
    pageblock order on isolate pageblock is merged with other freepage on
    normal pageblock. We can prevent merging by restricting max order of
    merging to pageblock order if freepage is on isolate pageblock.

    A side-effect of this change is that there could be non-merged buddy
    freepage even if finishing pageblock isolation, because undoing
    pageblock isolation is just to move freepage from isolate buddy list to
    normal buddy list rather than to consider merging. So, the patch also
    makes undoing pageblock isolation consider freepage merge. When
    un-isolation, freepage with more than pageblock order and it's buddy are
    checked. If they are on normal pageblock, instead of just moving, we
    isolate the freepage and free it in order to get merged.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Tang Chen
    Cc: Naoya Horiguchi
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Wen Congyang
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Laura Abbott
    Cc: Heesub Shin
    Cc: "Aneesh Kumar K.V"
    Cc: Ritesh Harjani
    Cc: Gioh Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • All the caller of __free_one_page() has similar freepage counting logic,
    so we can move it to __free_one_page(). This reduce line of code and
    help future maintenance.

    This is also preparation step for "mm/page_alloc: restrict max order of
    merging on isolated pageblock" which fix the freepage counting problem
    on freepage with more than pageblock order.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Tang Chen
    Cc: Naoya Horiguchi
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Wen Congyang
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Laura Abbott
    Cc: Heesub Shin
    Cc: "Aneesh Kumar K.V"
    Cc: Ritesh Harjani
    Cc: Gioh Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • In free_pcppages_bulk(), we use cached migratetype of freepage to
    determine type of buddy list where freepage will be added. This
    information is stored when freepage is added to pcp list, so if
    isolation of pageblock of this freepage begins after storing, this
    cached information could be stale. In other words, it has original
    migratetype rather than MIGRATE_ISOLATE.

    There are two problems caused by this stale information.

    One is that we can't keep these freepages from being allocated.
    Although this pageblock is isolated, freepage will be added to normal
    buddy list so that it could be allocated without any restriction. And
    the other problem is incorrect freepage accounting. Freepages on
    isolate pageblock should not be counted for number of freepage.

    Following is the code snippet in free_pcppages_bulk().

    /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
    __free_one_page(page, page_to_pfn(page), zone, 0, mt);
    trace_mm_page_pcpu_drain(page, 0, mt);
    if (likely(!is_migrate_isolate_page(page))) {
    __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
    if (is_migrate_cma(mt))
    __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
    }

    As you can see above snippet, current code already handle second
    problem, incorrect freepage accounting, by re-fetching pageblock
    migratetype through is_migrate_isolate_page(page).

    But, because this re-fetched information isn't used for
    __free_one_page(), first problem would not be solved. This patch try to
    solve this situation to re-fetch pageblock migratetype before
    __free_one_page() and to use it for __free_one_page().

    In addition to move up position of this re-fetch, this patch use
    optimization technique, re-fetching migratetype only if there is isolate
    pageblock. Pageblock isolation is rare event, so we can avoid
    re-fetching in common case with this optimization.

    This patch also correct migratetype of the tracepoint output.

    Signed-off-by: Joonsoo Kim
    Acked-by: Minchan Kim
    Acked-by: Michal Nazarewicz
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Tang Chen
    Cc: Naoya Horiguchi
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Wen Congyang
    Cc: Marek Szyprowski
    Cc: Laura Abbott
    Cc: Heesub Shin
    Cc: "Aneesh Kumar K.V"
    Cc: Ritesh Harjani
    Cc: Gioh Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Before describing bugs itself, I first explain definition of freepage.

    1. pages on buddy list are counted as freepage.
    2. pages on isolate migratetype buddy list are *not* counted as freepage.
    3. pages on cma buddy list are counted as CMA freepage, too.

    Now, I describe problems and related patch.

    Patch 1: There is race conditions on getting pageblock migratetype that
    it results in misplacement of freepages on buddy list, incorrect
    freepage count and un-availability of freepage.

    Patch 2: Freepages on pcp list could have stale cached information to
    determine migratetype of buddy list to go. This causes misplacement of
    freepages on buddy list and incorrect freepage count.

    Patch 4: Merging between freepages on different migratetype of
    pageblocks will cause freepages accouting problem. This patch fixes it.

    Without patchset [3], above problem doesn't happens on my CMA allocation
    test, because CMA reserved pages aren't used at all. So there is no
    chance for above race.

    With patchset [3], I did simple CMA allocation test and get below
    result:

    - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
    - run kernel build (make -j16) on background
    - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
    - Result: more than 5000 freepage count are missed

    With patchset [3] and this patchset, I found that no freepage count are
    missed so that I conclude that problems are solved.

    On my simple memory offlining test, these problems also occur on that
    environment, too.

    This patch (of 4):

    There are two paths to reach core free function of buddy allocator,
    __free_one_page(), one is free_one_page()->__free_one_page() and the
    other is free_hot_cold_page()->free_pcppages_bulk()->__free_one_page().
    Each paths has race condition causing serious problems. At first, this
    patch is focused on first type of freepath. And then, following patch
    will solve the problem in second type of freepath.

    In the first type of freepath, we got migratetype of freeing page
    without holding the zone lock, so it could be racy. There are two cases
    of this race.

    1. pages are added to isolate buddy list after restoring orignal
    migratetype

    CPU1 CPU2

    get migratetype => return MIGRATE_ISOLATE
    call free_one_page() with MIGRATE_ISOLATE

    grab the zone lock
    unisolate pageblock
    release the zone lock

    grab the zone lock
    call __free_one_page() with MIGRATE_ISOLATE
    freepage go into isolate buddy list,
    although pageblock is already unisolated

    This may cause two problems. One is that we can't use this page anymore
    until next isolation attempt of this pageblock, because freepage is on
    isolate buddy list. The other is that freepage accouting could be wrong
    due to merging between different buddy list. Freepages on isolate buddy
    list aren't counted as freepage, but ones on normal buddy list are
    counted as freepage. If merge happens, buddy freepage on normal buddy
    list is inevitably moved to isolate buddy list without any consideration
    of freepage accouting so it could be incorrect.

    2. pages are added to normal buddy list while pageblock is isolated.
    It is similar with above case.

    This also may cause two problems. One is that we can't keep these
    freepages from being allocated. Although this pageblock is isolated,
    freepage would be added to normal buddy list so that it could be
    allocated without any restriction. And the other problem is same as
    case 1, that it, incorrect freepage accouting.

    This race condition would be prevented by checking migratetype again
    with holding the zone lock. Because it is somewhat heavy operation and
    it isn't needed in common case, we want to avoid rechecking as much as
    possible. So this patch introduce new variable, nr_isolate_pageblock in
    struct zone to check if there is isolated pageblock. With this, we can
    avoid to re-check migratetype in common case and do it only if there is
    isolated pageblock or migratetype is MIGRATE_ISOLATE. This solve above
    mentioned problems.

    Changes from v3:
    Add one more check in free_one_page() that checks whether migratetype is
    MIGRATE_ISOLATE or not. Without this, abovementioned case 1 could happens.

    Signed-off-by: Joonsoo Kim
    Acked-by: Minchan Kim
    Acked-by: Michal Nazarewicz
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Tang Chen
    Cc: Naoya Horiguchi
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Wen Congyang
    Cc: Marek Szyprowski
    Cc: Laura Abbott
    Cc: Heesub Shin
    Cc: "Aneesh Kumar K.V"
    Cc: Ritesh Harjani
    Cc: Gioh Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Commit 7d49d8868336 ("mm, compaction: reduce zone checking frequency in
    the migration scanner") has a side-effect that changes the iteration
    range calculation. Before the change, block_end_pfn is calculated using
    start_pfn, but now it blindly adds pageblock_nr_pages to the previous
    value.

    This causes the problem that isolation_start_pfn is larger than
    block_end_pfn when we isolate the page with more than pageblock order.
    In this case, isolation would fail due to an invalid range parameter.

    To prevent this, this patch implements skipping the range until a proper
    target pageblock is met. Without this patch, CMA with more than
    pageblock order always fails but with this patch it will succeed.

    Signed-off-by: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • The branches of the if (i->type & ITER_BVEC) statement in
    iov_iter_single_seg_count() are the wrong way around; if ITER_BVEC is
    clear then we use i->bvec, when we should be using i->iov. This fixes
    it.

    In my case, the symptom that this caused was that a KVM guest doing
    filesystem operations on a virtual disk would result in one of qemu's
    threads on the host going into an infinite loop in
    generic_perform_write(). The loop would hit the copied == 0 case and
    call iov_iter_single_seg_count() to reduce the number of bytes to try
    to process, but because of the error, iov_iter_single_seg_count()
    would just return i->count and the loop made no progress and continued
    forever.

    Cc: stable@vger.kernel.org # 3.16+
    Signed-off-by: Paul Mackerras
    Signed-off-by: Al Viro

    Paul Mackerras
     

08 Nov, 2014

1 commit

  • Pull xfs fixes from Dave Chinner:
    "This update fixes a warning in the new pagecache_isize_extended() and
    updates some related comments, another fix for zero-range
    misbehaviour, and an unforntuately large set of fixes for regressions
    in the bulkstat code.

    The bulkstat fixes are large but necessary. I wouldn't normally push
    such a rework for a -rcX update, but right now xfsdump can silently
    create incomplete dumps on 3.17 and it's possible that even xfsrestore
    won't notice that the dumps were incomplete. Hence we need to get
    this update into 3.17-stable kernels ASAP.

    In more detail, the refactoring work I committed in 3.17 has exposed a
    major hole in our QA coverage. With both xfsdump (the major user of
    bulkstat) and xfsrestore silently ignoring missing files in the
    dump/restore process, incomplete dumps were going unnoticed if they
    were being triggered. Many of the dump/restore filesets were so small
    that they didn't evenhave a chance of triggering the loop iteration
    bugs we introduced in 3.17, so we didn't exercise the code
    sufficiently, either.

    We have already taken steps to improve QA coverage in xfstests to
    avoid this happening again, and I've done a lot of manual verification
    of dump/restore on very large data sets (tens of millions of inodes)
    of the past week to verify this patch set results in bulkstat behaving
    the same way as it does on 3.16.

    Unfortunately, the fixes are not exactly simple - in tracking down the
    problem historic API warts were discovered (e.g xfsdump has been
    working around a 20 year old bug in the bulkstat API for the past 10
    years) and so that complicated the process of diagnosing and fixing
    the problems. i.e. we had to fix bugs in the code as well as
    discover and re-introduce the userspace visible API bugs that we
    unwittingly "fixed" in 3.17 that xfsdump relied on to work correctly.

    Summary:

    - incorrect warnings about i_mutex locking in pagecache_isize_extended()
    and updates comments to match expected locking
    - another zero-range bug fix for stray file size updates
    - a bunch of fixes for regression in the bulkstat code introduced in
    3.17"

    * tag 'xfs-for-linus-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
    xfs: track bulkstat progress by agino
    xfs: bulkstat error handling is broken
    xfs: bulkstat main loop logic is a mess
    xfs: bulkstat chunk-formatter has issues
    xfs: bulkstat chunk formatting cursor is broken
    xfs: bulkstat btree walk doesn't terminate
    mm: Fix comment before truncate_setsize()
    xfs: rework zero range to prevent invalid i_size updates
    mm: Remove false WARN_ON from pagecache_isize_extended()
    xfs: Check error during inode btree iteration in xfs_bulkstat()
    xfs: bulkstat doesn't release AGI buffer on error

    Linus Torvalds
     

07 Nov, 2014

1 commit

  • XFS doesn't always hold i_mutex when calling truncate_setsize() and it
    uses a different lock to serialize truncates and writes. So fix the
    comment before truncate_setsize().

    Reported-by: Jan Beulich
    Signed-off-by: Jan Kara
    Signed-off-by: Dave Chinner

    Jan Kara
     

04 Nov, 2014

1 commit

  • Pull CMA and DMA-mapping fixes from Marek Szyprowski:
    "This contains important fixes for recently introduced highmem support
    for default contiguous memory region used for dma-mapping subsystem"

    * 'fixes-for-v3.18' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping:
    mm, cma: make parameters order consistent in func declaration and definition
    mm: cma: Use %pa to print physical addresses
    mm: cma: Ensure that reservations never cross the low/high mem boundary
    mm: cma: Always consider a 0 base address reservation as dynamic
    mm: cma: Don't crash on allocation if CMA area can't be activated

    Linus Torvalds
     

30 Oct, 2014

11 commits

  • The WARN_ON checking whether i_mutex is held in
    pagecache_isize_extended() was wrong because some filesystems (e.g.
    XFS) use different locks for serialization of truncates / writes. So
    just remove the check.

    Signed-off-by: Jan Kara
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Jan Kara
     
  • If CONFIG_BALLOON_COMPACTION=n balloon_page_insert() does not link pages
    with balloon and doesn't set PagePrivate flag, as a result
    balloon_page_dequeue() cannot get any pages because it thinks that all
    of them are isolated. Without balloon compaction nobody can isolate
    ballooned pages. It's safe to remove this check.

    Fixes: d6d86c0a7f8d ("mm/balloon_compaction: redesign ballooned pages management").
    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Matt Mullins
    Cc: [3.17]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • The SLUB cache merges caches with the same size and alignment and there
    was long standing bug with this behavior:

    - create the cache named "foo"
    - create the cache named "bar" (which is merged with "foo")
    - delete the cache named "foo" (but it stays allocated because "bar"
    uses it)
    - create the cache named "foo" again - it fails because the name "foo"
    is already used

    That bug was fixed in commit 694617474e33 ("slab_common: fix the check
    for duplicate slab names") by not warning on duplicate cache names when
    the SLUB subsystem is used.

    Recently, cache merging was implemented the with SLAB subsystem too, in
    12220dea07f1 ("mm/slab: support slab merge")). Therefore we need stop
    checking for duplicate names even for the SLAB subsystem.

    This patch fixes the bug by removing the check.

    Signed-off-by: Mikulas Patocka
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • page_remove_rmap() has too many branches on PageAnon() and is hard to
    follow. Move the file part into a separate function.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API") changed
    page migration to uncharge the old page right away. The page is locked,
    unmapped, truncated, and off the LRU, but it could race with writeback
    ending, which then doesn't unaccount the page properly:

    test_clear_page_writeback() migration
    wait_on_page_writeback()
    TestClearPageWriteback()
    mem_cgroup_migrate()
    clear PCG_USED
    mem_cgroup_update_page_stat()
    if (PageCgroupUsed(pc))
    decrease memcg pages under writeback

    release pc->mem_cgroup->move_lock

    The per-page statistics interface is heavily optimized to avoid a
    function call and a lookup_page_cgroup() in the file unmap fast path,
    which means it doesn't verify whether a page is still charged before
    clearing PageWriteback() and it has to do it in the stat update later.

    Rework it so that it looks up the page's memcg once at the beginning of
    the transaction and then uses it throughout. The charge will be
    verified before clearing PageWriteback() and migration can't uncharge
    the page as long as that is still set. The RCU lock will protect the
    memcg past uncharge.

    As far as losing the optimization goes, the following test results are
    from a microbenchmark that maps, faults, and unmaps a 4GB sparse file
    three times in a nested fashion, so that there are two negative passes
    that don't account but still go through the new transaction overhead.
    There is no actual difference:

    old: 33.195102545 seconds time elapsed ( +- 0.01% )
    new: 33.199231369 seconds time elapsed ( +- 0.03% )

    The time spent in page_remove_rmap()'s callees still adds up to the
    same, but the time spent in the function itself seems reduced:

    # Children Self Command Shared Object Symbol
    old: 0.12% 0.11% filemapstress [kernel.kallsyms] [k] page_remove_rmap
    new: 0.12% 0.08% filemapstress [kernel.kallsyms] [k] page_remove_rmap

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: [3.17.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • A follow-up patch would have changed the call signature. To save the
    trouble, just fold it instead.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: [3.17.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When hot adding the same memory after hot removal, the following
    messages are shown:

    WARNING: CPU: 20 PID: 6 at mm/page_alloc.c:4968 free_area_init_node+0x3fe/0x426()
    ...
    Call Trace:
    dump_stack+0x46/0x58
    warn_slowpath_common+0x81/0xa0
    warn_slowpath_null+0x1a/0x20
    free_area_init_node+0x3fe/0x426
    hotadd_new_pgdat+0x90/0x110
    add_memory+0xd4/0x200
    acpi_memory_device_add+0x1aa/0x289
    acpi_bus_attach+0xfd/0x204
    acpi_bus_attach+0x178/0x204
    acpi_bus_scan+0x6a/0x90
    acpi_device_hotplug+0xe8/0x418
    acpi_hotplug_work_fn+0x1f/0x2b
    process_one_work+0x14e/0x3f0
    worker_thread+0x11b/0x510
    kthread+0xe1/0x100
    ret_from_fork+0x7c/0xb0

    The detaled explanation is as follows:

    When hot removing memory, pgdat is set to 0 in try_offline_node(). But
    if the pgdat is allocated by bootmem allocator, the clearing step is
    skipped.

    And when hot adding the same memory, the uninitialized pgdat is reused.
    But free_area_init_node() checks wether pgdat is set to zero. As a
    result, free_area_init_node() hits WARN_ON().

    This patch clears pgdat which is allocated by bootmem allocator in
    try_offline_node().

    Signed-off-by: Yasuaki Ishimatsu
    Cc: Zhang Zhen
    Cc: Wang Nan
    Cc: Tang Chen
    Reviewed-by: Toshi Kani
    Cc: Dave Hansen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     
  • If an anonymous mapping is not allowed to fault thp memory and then
    madvise(MADV_HUGEPAGE) is used after fault, khugepaged will never
    collapse this memory into thp memory.

    This occurs because the madvise(2) handler for thp, hugepage_madvise(),
    clears VM_NOHUGEPAGE on the stack and it isn't stored in vma->vm_flags
    until the final action of madvise_behavior(). This causes the
    khugepaged_enter_vma_merge() to be a no-op in hugepage_madvise() when
    the vma had previously had VM_NOHUGEPAGE set.

    Fix this by passing the correct vma flags to the khugepaged mm slot
    handler. There's no chance khugepaged can run on this vma until after
    madvise_behavior() returns since we hold mm->mmap_sem.

    It would be possible to clear VM_NOHUGEPAGE directly from vma->vm_flags
    in hugepage_advise(), but I didn't want to introduce special case
    behavior into madvise_behavior(). I think it's best to just let it
    always set vma->vm_flags itself.

    Signed-off-by: David Rientjes
    Reported-by: Suleiman Souhlal
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Compound page should be freed by put_page() or free_pages() with correct
    order. Not doing so will cause tail pages leaked.

    The compound order can be obtained by compound_order() or use
    HPAGE_PMD_ORDER in our case. Some people would argue the latter is
    faster but I prefer the former which is more general.

    This bug was observed not just on our servers (the worst case we saw is
    11G leaked on a 48G machine) but also on our workstations running Ubuntu
    based distro.

    $ cat /proc/vmstat | grep thp_zero_page_alloc
    thp_zero_page_alloc 55
    thp_zero_page_alloc_failed 0

    This means there is (thp_zero_page_alloc - 1) * (2M - 4K) memory leaked.

    Fixes: 97ae17497e99 ("thp: implement refcounting for huge zero page")
    Signed-off-by: Yu Zhao
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Bob Liu
    Cc: [3.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yu Zhao
     
  • Commit edc2ca612496 ("mm, compaction: move pageblock checks up from
    isolate_migratepages_range()") commonizes isolate_migratepages variants
    and make them use isolate_migratepages_block().

    isolate_migratepages_block() could stop the execution when enough pages
    are isolated, but, there is no code in isolate_migratepages_range() to
    handle this case. In the result, even if isolate_migratepages_block()
    returns prematurely without checking all pages in the range,

    isolate_migratepages_block() is called repeately on the following
    pageblock and some pages in the previous range are skipped to check.
    Then, CMA is failed frequently due to this fact.

    To fix this problem, this patch let isolate_migratepages_range() know
    the situation that enough pages are isolated and stop the isolation in
    that case.

    Note that isolate_migratepages() has no such problem, because, it always
    stops the isolation after just one call of isolate_migratepages_block().

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Minchan Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Commit ff7ee93f4715 ("cgroup/kmemleak: Annotate alloc_page() for cgroup
    allocations") introduces kmemleak_alloc() for alloc_page_cgroup(), but
    corresponding kmemleak_free() is missing, which makes kmemleak be
    wrongly disabled after memory offlining. Log is pasted at the end of
    this commit message.

    This patch add kmemleak_free() into free_page_cgroup(). During page
    offlining, this patch removes corresponding entries in kmemleak rbtree.
    After that, the freed memory can be allocated again by other subsystems
    without killing kmemleak.

    bash # for x in 1 2 3 4; do echo offline > /sys/devices/system/memory/memory$x/state ; sleep 1; done ; dmesg | grep leak

    Offlined Pages 32768
    kmemleak: Cannot insert 0xffff880016969000 into the object search tree (overlaps existing)
    CPU: 0 PID: 412 Comm: sleep Not tainted 3.17.0-rc5+ #86
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x46/0x58
    create_object+0x266/0x2c0
    kmemleak_alloc+0x26/0x50
    kmem_cache_alloc+0xd3/0x160
    __sigqueue_alloc+0x49/0xd0
    __send_signal+0xcb/0x410
    send_signal+0x45/0x90
    __group_send_sig_info+0x13/0x20
    do_notify_parent+0x1bb/0x260
    do_exit+0x767/0xa40
    do_group_exit+0x44/0xa0
    SyS_exit_group+0x17/0x20
    system_call_fastpath+0x16/0x1b

    kmemleak: Kernel memory leak detector disabled
    kmemleak: Object 0xffff880016900000 (size 524288):
    kmemleak: comm "swapper/0", pid 0, jiffies 4294667296
    kmemleak: min_count = 0
    kmemleak: count = 0
    kmemleak: flags = 0x1
    kmemleak: checksum = 0
    kmemleak: backtrace:
    log_early+0x63/0x77
    kmemleak_alloc+0x4b/0x50
    init_section_page_cgroup+0x7f/0xf5
    page_cgroup_init+0xc5/0xd0
    start_kernel+0x333/0x408
    x86_64_start_reservations+0x2a/0x2c
    x86_64_start_kernel+0xf5/0xfc

    Fixes: ff7ee93f4715 (cgroup/kmemleak: Annotate alloc_page() for cgroup allocations)
    Signed-off-by: Wang Nan
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Steven Rostedt
    Cc: [3.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Nan
     

29 Oct, 2014

1 commit

  • When unmapping a range of pages in zap_pte_range, the page being
    unmapped is added to an mmu_gather_batch structure for asynchronous
    freeing. If we run out of space in the batch structure before the range
    has been completely unmapped, then we break out of the loop, force a
    TLB flush and free the pages that we have batched so far. If there are
    further pages to unmap, then we resume the loop where we left off.

    Unfortunately, we forget to update addr when we break out of the loop,
    which causes us to truncate the range being invalidated as the end
    address is exclusive. When we re-enter the loop at the same address, the
    page has already been freed and the pte_present test will fail, meaning
    that we do not reconsider the address for invalidation.

    This patch fixes the problem by incrementing addr by the PAGE_SIZE
    before breaking out of the loop on batch failure.

    Signed-off-by: Will Deacon
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Will Deacon
     

27 Oct, 2014

2 commits

  • Casting physical addresses to unsigned long and using %lu truncates the
    values on systems where physical addresses are larger than 32 bits. Use
    %pa and get rid of the cast instead.

    Signed-off-by: Laurent Pinchart
    Acked-by: Michal Nazarewicz
    Acked-by: Geert Uytterhoeven
    Signed-off-by: Marek Szyprowski

    Laurent Pinchart
     
  • Commit 95b0e655f914 ("ARM: mm: don't limit default CMA region only to
    low memory") extended CMA memory reservation to allow usage of high
    memory. It relied on commit f7426b983a6a ("mm: cma: adjust address limit
    to avoid hitting low/high memory boundary") to ensure that the reserved
    block never crossed the low/high memory boundary. While the
    implementation correctly lowered the limit, it failed to consider the
    case where the base..limit range crossed the low/high memory boundary
    with enough space on each side to reserve the requested size on either
    low or high memory.

    Rework the base and limit adjustment to fix the problem. The function
    now starts by rejecting the reservation altogether for fixed
    reservations that cross the boundary, tries to reserve from high memory
    first and then falls back to low memory.

    Signed-off-by: Laurent Pinchart
    Signed-off-by: Marek Szyprowski

    Laurent Pinchart