27 Nov, 2012

2 commits

  • Commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC
    reserves are low and swap is backed by network storage") introduced a
    check for fatal signals after a process gets throttled for network
    storage. The intention was that if a process was throttled and got
    killed that it should not trigger the OOM killer. As pointed out by
    Minchan Kim and David Rientjes, this check is in the wrong place and too
    broad. If a system is in am OOM situation and a process is exiting, it
    can loop in __alloc_pages_slowpath() and calling direct reclaim in a
    loop. As the fatal signal is pending it returns 1 as if it is making
    forward progress and can effectively deadlock.

    This patch moves the fatal_signal_pending() check after throttling to
    throttle_direct_reclaim() where it belongs. If the process is killed
    while throttled, it will return immediately without direct reclaim
    except now it will have TIF_MEMDIE set and will use the PFMEMALLOC
    reserves.

    Minchan pointed out that it may be better to direct reclaim before
    returning to avoid using the reserves because there may be pages that
    can easily reclaim that would avoid using the reserves. However, we do
    no such targetted reclaim and there is no guarantee that suitable pages
    are available. As it is expected that this throttling happens when
    swap-over-NFS is used there is a possibility that the process will
    instead swap which may allocate network buffers from the PFMEMALLOC
    reserves. Hence, in the swap-over-nfs case where a process can be
    throtted and be killed it can use the reserves to exit or it can
    potentially use reserves to swap a few pages and then exit. This patch
    takes the option of using the reserves if necessary to allow the process
    exit quickly.

    If this patch passes review it should be considered a -stable candidate
    for 3.6.

    Signed-off-by: Mel Gorman
    Cc: David Rientjes
    Cc: Luigi Semenzato
    Cc: Dan Magenheimer
    Cc: KOSAKI Motohiro
    Cc: Sonny Rao
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
    based on failures" reverted, Zdenek Kabelac reported the following

    Hmm, so it's just took longer to hit the problem and observe
    kswapd0 spinning on my CPU again - it's not as endless like before -
    but still it easily eats minutes - it helps to turn off Firefox
    or TB (memory hungry apps) so kswapd0 stops soon - and restart
    those apps again. (And I still have like >1GB of cached memory)

    kswapd0 R running task 0 30 2 0x00000000
    Call Trace:
    preempt_schedule+0x42/0x60
    _raw_spin_unlock+0x55/0x60
    put_super+0x31/0x40
    drop_super+0x22/0x30
    prune_super+0x149/0x1b0
    shrink_slab+0xba/0x510

    The sysrq+m indicates the system has no swap so it'll never reclaim
    anonymous pages as part of reclaim/compaction. That is one part of the
    problem but not the root cause as file-backed pages could also be
    reclaimed.

    The likely underlying problem is that kswapd is woken up or kept awake
    for each THP allocation request in the page allocator slow path.

    If compaction fails for the requesting process then compaction will be
    deferred for a time and direct reclaim is avoided. However, if there
    are a storm of THP requests that are simply rejected, it will still be
    the the case that kswapd is awake for a prolonged period of time as
    pgdat->kswapd_max_order is updated each time. This is noticed by the
    main kswapd() loop and it will not call kswapd_try_to_sleep(). Instead
    it will loopp, shrinking a small number of pages and calling
    shrink_slab() on each iteration.

    The temptation is to supply a patch that checks if kswapd was woken for
    THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
    backed up by proper testing. As 3.7 is very close to release and this
    is not a bug we should release with, a safer path is to revert "mm:
    remove __GFP_NO_KSWAPD" for now and revisit it with the view to ironing
    out the balance_pgdat() logic in general.

    Signed-off-by: Mel Gorman
    Cc: Zdenek Kabelac
    Cc: Seth Jennings
    Cc: Valdis Kletnieks
    Cc: Jiri Slaby
    Cc: Rik van Riel
    Cc: Robert Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

22 Nov, 2012

1 commit

  • There have been some 3.7-rc reports of vm issues, including some kswapd
    bugs and, more importantly, some memory "leaks":

    http://www.spinics.net/lists/linux-mm/msg46187.html
    https://bugzilla.kernel.org/show_bug.cgi?id=50181

    Commit 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page
    immediately when it is made available") took split_free_page() and
    reused it for the compaction code. It does something curious with
    capture_free_page() (previously known as split_free_page()):

    int capture_free_page(struct page *page, int alloc_order,
    ...
    __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));

    - /* Split into individual pages */
    - set_page_refcounted(page);
    - split_page(page, order);
    + if (alloc_order != order)
    + expand(zone, page, alloc_order, order,
    + &zone->free_area[order], migratetype);

    Note that expand() puts the pages _back_ in the allocator, but it does
    not bump NR_FREE_PAGES. We "return" 'alloc_order' worth of pages, but
    we accounted for removing 'order' in the __mod_zone_page_state() call.

    For the old split_page()-style use (order==alloc_order) the bug will not
    trigger. But, when called from the compaction code where we
    occasionally get a larger page out of the buddy allocator than we need,
    we will run in to this.

    This patch simply changes the NR_FREE_PAGES manipulation to the correct
    'alloc_order' instead of 'order'.

    I've been able to repeatedly trigger this in my testing environment.
    The amount "leaked" very closely tracks the imbalance I see in buddy
    pages vs. NR_FREE_PAGES. I have confirmed that this patch fixes the
    imbalance

    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

17 Nov, 2012

10 commits

  • Revert commit 7f1290f2f2a4 ("mm: fix-up zone present pages")

    That patch tried to fix a issue when calculating zone->present_pages,
    but it caused a regression on 32bit systems with HIGHMEM. With that
    change, reset_zone_present_pages() resets all zone->present_pages to
    zero, and fixup_zone_present_pages() is called to recalculate
    zone->present_pages when the boot allocator frees core memory pages into
    buddy allocator. Because highmem pages are not freed by bootmem
    allocator, all highmem zones' present_pages becomes zero.

    Various options for improving the situation are being discussed but for
    now, let's return to the 3.6 code.

    Cc: Jianguo Wu
    Cc: Jiang Liu
    Cc: Petr Tesarik
    Cc: "Luck, Tony"
    Cc: Mel Gorman
    Cc: Yinghai Lu
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Acked-by: David Rientjes
    Tested-by: Chris Clayton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Under a particular load on one machine, I have hit shmem_evict_inode()'s
    BUG_ON(inode->i_blocks), enough times to narrow it down to a particular
    race between swapout and eviction.

    It comes from the "if (freed > 0)" asymmetry in shmem_recalc_inode(),
    and the lack of coherent locking between mapping's nrpages and shmem's
    swapped count. There's a window in shmem_writepage(), between lowering
    nrpages in shmem_delete_from_page_cache() and then raising swapped
    count, when the freed count appears to be +1 when it should be 0, and
    then the asymmetry stops it from being corrected with -1 before hitting
    the BUG.

    One answer is coherent locking: using tree_lock throughout, without
    info->lock; reasonable, but the raw_spin_lock in percpu_counter_add() on
    used_blocks makes that messier than expected. Another answer may be a
    further effort to eliminate the weird shmem_recalc_inode() altogether,
    but previous attempts at that failed.

    So far undecided, but for now change the BUG_ON to WARN_ON: in usual
    circumstances it remains a useful consistency check.

    Signed-off-by: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Fuzzing with trinity hit the "impossible" VM_BUG_ON(error) (which Fedora
    has converted to WARNING) in shmem_getpage_gfp():

    WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
    Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49
    Call Trace:
    warn_slowpath_common+0x7f/0xc0
    warn_slowpath_null+0x1a/0x20
    shmem_getpage_gfp+0xa5c/0xa70
    shmem_fault+0x4f/0xa0
    __do_fault+0x71/0x5c0
    handle_pte_fault+0x97/0xae0
    handle_mm_fault+0x289/0x350
    __do_page_fault+0x18e/0x530
    do_page_fault+0x2b/0x50
    page_fault+0x28/0x30
    tracesys+0xe1/0xe6

    Thanks to Johannes for pointing to truncation: free_swap_and_cache()
    only does a trylock on the page, so the page lock we've held since
    before confirming swap is not enough to protect against truncation.

    What cleanup is needed in this case? Just delete_from_swap_cache(),
    which takes care of the memcg uncharge.

    Signed-off-by: Hugh Dickins
    Reported-by: Dave Jones
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • kmap_to_page returns the corresponding struct page for a virtual address
    of an arbitrary mapping. This works by checking whether the address
    falls in the pkmap region and using the pkmap page tables instead of the
    linear mapping if appropriate.

    Unfortunately, the bounds checking means that PKMAP_ADDR(LAST_PKMAP) is
    incorrectly treated as a highmem address and we can end up walking off
    the end of pkmap_page_table and subsequently passing junk to pte_page.

    This patch fixes the bound check to stay within the pkmap tables.

    Signed-off-by: Will Deacon
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     
  • Jiri Slaby reported the following:

    (It's an effective revert of "mm: vmscan: scale number of pages
    reclaimed by reclaim/compaction based on failures".) Given kswapd
    had hours of runtime in ps/top output yesterday in the morning
    and after the revert it's now 2 minutes in sum for the last 24h,
    I would say, it's gone.

    The intention of the patch in question was to compensate for the loss of
    lumpy reclaim. Part of the reason lumpy reclaim worked is because it
    aggressively reclaimed pages and this patch was meant to be a sane
    compromise.

    When compaction fails, it gets deferred and both compaction and
    reclaim/compaction is deferred avoid excessive reclaim. However, since
    commit c654345924f7 ("mm: remove __GFP_NO_KSWAPD"), kswapd is woken up
    each time and continues reclaiming which was not taken into account when
    the patch was developed.

    Attempts to address the problem ended up just changing the shape of the
    problem instead of fixing it. The release window gets closer and while
    a THP allocation failing is not a major problem, kswapd chewing up a lot
    of CPU is.

    This patch reverts commit 83fde0f22872 ("mm: vmscan: scale number of
    pages reclaimed by reclaim/compaction based on failures") and will be
    revisited in the future.

    Signed-off-by: Mel Gorman
    Cc: Zdenek Kabelac
    Tested-by: Valdis Kletnieks
    Cc: Jiri Slaby
    Cc: Rik van Riel
    Cc: Jiri Slaby
    Cc: Johannes Hirte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There's a name leak introduced by commit 91a27b2a7567 ("vfs: define
    struct filename and have getname() return it"). Add the missing
    putname.

    [akpm@linux-foundation.org: cleanup]
    Signed-off-by: Xiaotian Feng
    Reviewed-by: Jeff Layton
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiaotian Feng
     
  • When MEMCG is configured on (even when it's disabled by boot option),
    when adding or removing a page to/from its lru list, the zone pointer
    used for stats updates is nowadays taken from the struct lruvec. (On
    many configurations, calculating zone from page is slower.)

    But we have no code to update all the lruvecs (per zone, per memcg) when
    a memory node is hotadded. Here's an extract from the oops which
    results when running numactl to bind a program to a newly onlined node:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60
    IP: __mod_zone_page_state+0x9/0x60
    Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs
    Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0)
    Call Trace:
    __pagevec_lru_add_fn+0xdf/0x140
    pagevec_lru_move_fn+0xb1/0x100
    __pagevec_lru_add+0x1c/0x30
    lru_add_drain_cpu+0xa3/0x130
    lru_add_drain+0x2f/0x40
    ...

    The natural solution might be to use a memcg callback whenever memory is
    hotadded; but that solution has not been scoped out, and it happens that
    we do have an easy location at which to update lruvec->zone. The lruvec
    pointer is discovered either by mem_cgroup_zone_lruvec() or by
    mem_cgroup_page_lruvec(), and both of those do know the right zone.

    So check and set lruvec->zone in those; and remove the inadequate
    attempt to set lruvec->zone from lruvec_init(), which is called before
    NODE_DATA(node) has been allocated in such cases.

    Ah, there was one exceptionr. For no particularly good reason,
    mem_cgroup_force_empty_list() has its own code for deciding lruvec.
    Change it to use the standard mem_cgroup_zone_lruvec() and
    mem_cgroup_get_lru_size() too. In fact it was already safe against such
    an oops (the lru lists in danger could only be empty), but we're better
    proofed against future changes this way.

    I've marked this for stable (3.6) since we introduced the problem in 3.5
    (now closed to stable); but I have no idea if this is the only fix
    needed to get memory hotadd working with memcg in 3.6, and received no
    answer when I enquired twice before.

    Reported-by: Tang Chen
    Signed-off-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Konstantin Khlebnikov
    Cc: Wen Congyang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • oom_badness() takes a totalpages argument which says how many pages are
    available and it uses it as a base for the score calculation. The value
    is calculated by mem_cgroup_get_limit which considers both limit and
    total_swap_pages (resp. memsw portion of it).

    This is usually correct but since fe35004fbf9e ("mm: avoid swapping out
    with swappiness==0") we do not swap when swappiness is 0 which means
    that we cannot really use up all the totalpages pages. This in turn
    confuses oom score calculation if the memcg limit is much smaller than
    the available swap because the used memory (capped by the limit) is
    negligible comparing to totalpages so the resulting score is too small
    if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
    A wrong process might be selected as result.

    The problem can be worked around by checking mem_cgroup_swappiness==0
    and not considering swap at all in such a case.

    Signed-off-by: Michal Hocko
    Acked-by: David Rientjes
    Acked-by: Johannes Weiner
    Acked-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • do_wp_page() sets mmun_called if mmun_start and mmun_end were
    initialized and, if so, may call mmu_notifier_invalidate_range_end()
    with these values. This doesn't prevent gcc from emitting a build
    warning though:

    mm/memory.c: In function `do_wp_page':
    mm/memory.c:2530: warning: `mmun_start' may be used uninitialized in this function
    mm/memory.c:2531: warning: `mmun_end' may be used uninitialized in this function

    It's much easier to initialize the variables to impossible values and do
    a simple comparison to determine if they were initialized to remove the
    bool entirely.

    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Iterating over the vma->anon_vma_chain without anon_vma_lock may cause
    NULL ptr deref in anon_vma_interval_tree_verify(), because the node in the
    chain might have been removed.

    BUG: unable to handle kernel paging request at fffffffffffffff0
    IP: [] anon_vma_interval_tree_verify+0xc/0xa0
    PGD 4e28067 PUD 4e29067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    CPU 0
    Pid: 9050, comm: trinity-child64 Tainted: G W 3.7.0-rc2-next-20121025-sasha-00001-g673f98e-dirty #77
    RIP: 0010: anon_vma_interval_tree_verify+0xc/0xa0
    Process trinity-child64 (pid: 9050, threadinfo ffff880045f80000, task ffff880048eb0000)
    Call Trace:
    validate_mm+0x58/0x1e0
    vma_adjust+0x635/0x6b0
    __split_vma.isra.22+0x161/0x220
    split_vma+0x24/0x30
    sys_madvise+0x5da/0x7b0
    tracesys+0xe1/0xe6
    RIP anon_vma_interval_tree_verify+0xc/0xa0
    CR2: fffffffffffffff0

    Figured out by Bob Liu.

    Reported-by: Sasha Levin
    Cc: Bob Liu
    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

09 Nov, 2012

1 commit

  • In kswapd(), set current->reclaim_state to NULL before returning, as
    current->reclaim_state holds reference to variable on kswapd()'s stack.

    In rare cases, while returning from kswapd() during memory offlining,
    __free_slab() and freepages() can access the dangling pointer of
    current->reclaim_state.

    Signed-off-by: Takamori Yamaguchi
    Signed-off-by: Aaditya Kumar
    Acked-by: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Takamori Yamaguchi
     

27 Oct, 2012

1 commit

  • Pull x86 fixes from Ingo Molnar:
    "This fixes a couple of nasty page table initialization bugs which were
    causing kdump regressions. A clean rearchitecturing of the code is in
    the works - meanwhile these are reverts that restore the
    best-known-working state of the kernel.

    There's also EFI fixes and other small fixes."

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86, mm: Undo incorrect revert in arch/x86/mm/init.c
    x86: efi: Turn off efi_enabled after setup on mixed fw/kernel
    x86, mm: Find_early_table_space based on ranges that are actually being mapped
    x86, mm: Use memblock memory loop instead of e820_RAM
    x86, mm: Trim memory in memblock to be page aligned
    x86/irq/ioapic: Check for valid irq_cfg pointer in smp_irq_move_cleanup_interrupt
    x86/efi: Fix oops caused by incorrect set_memory_uc() usage
    x86-64: Fix page table accounting
    Revert "x86/mm: Fix the size calculation of mapping tables"
    MAINTAINERS: Add EFI git repository location

    Linus Torvalds
     

26 Oct, 2012

4 commits

  • Commit 957f822a0ab9 ("mm, numa: reclaim from all nodes within reclaim
    distance") caused zone_reclaim_mode to be set for all systems where two
    nodes are within RECLAIM_DISTANCE of each other. This is the opposite
    of what we actually want: zone_reclaim_mode should be set if two nodes
    are sufficiently distant.

    Signed-off-by: David Rientjes
    Reported-by: Julian Wollrath
    Tested-by: Julian Wollrath
    Cc: Hugh Dickins
    Cc: Patrik Kullman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • While allocating mmu_notifier with parameter GFP_KERNEL, swap would start
    to work in case of tight available memory. Eventually, that would lead to
    a deadlock while the swap deamon swaps anonymous pages. It was caused by
    commit e0f3c3f78da29b ("mm/mmu_notifier: init notifier if necessary").

    =================================
    [ INFO: inconsistent lock state ]
    3.7.0-rc1+ #518 Not tainted
    ---------------------------------
    inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
    kswapd0/35 [HC0[0]:SC0[0]:HE1:SE1] takes:
    (&mapping->i_mmap_mutex){+.+.?.}, at: page_referenced+0x9c/0x2e0
    {RECLAIM_FS-ON-W} state was registered at:
    mark_held_locks+0x86/0x150
    lockdep_trace_alloc+0x67/0xc0
    kmem_cache_alloc_trace+0x33/0x230
    do_mmu_notifier_register+0x87/0x180
    mmu_notifier_register+0x13/0x20
    kvm_dev_ioctl+0x428/0x510
    do_vfs_ioctl+0x98/0x570
    sys_ioctl+0x91/0xb0
    system_call_fastpath+0x16/0x1b
    irq event stamp: 825
    hardirqs last enabled at (825): _raw_spin_unlock_irq+0x30/0x60
    hardirqs last disabled at (824): _raw_spin_lock_irq+0x19/0x80
    softirqs last enabled at (0): copy_process+0x630/0x17c0
    softirqs last disabled at (0): (null)
    ...

    Simply back out the above commit, which was a small performance
    optimization.

    Signed-off-by: Gavin Shan
    Reported-by: Andrea Righi
    Tested-by: Andrea Righi
    Cc: Wanpeng Li
    Cc: Andrea Arcangeli
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Marcelo Tosatti
    Cc: Xiao Guangrong
    Cc: Sagi Grimberg
    Cc: Haggai Eran
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     
  • If start_isolate_page_range() failed, unset_migratetype_isolate() has been
    done inside it.

    Signed-off-by: Bob Liu
    Cc: Ni zhan Chen
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • On s390 any write to a page (even from kernel itself) sets architecture
    specific page dirty bit. Thus when a page is written to via buffered
    write, HW dirty bit gets set and when we later map and unmap the page,
    page_remove_rmap() finds the dirty bit and calls set_page_dirty().

    Dirtying of a page which shouldn't be dirty can cause all sorts of
    problems to filesystems. The bug we observed in practice is that
    buffers from the page get freed, so when the page gets later marked as
    dirty and writeback writes it, XFS crashes due to an assertion
    BUG_ON(!PagePrivate(page)) in page_buffers() called from
    xfs_count_page_state().

    Similar problem can also happen when zero_user_segment() call from
    xfs_vm_writepage() (or block_write_full_page() for that matter) set the
    hardware dirty bit during writeback, later buffers get freed, and then
    page unmapped.

    Fix the issue by ignoring s390 HW dirty bit for page cache pages of
    mappings with mapping_cap_account_dirty(). This is safe because for
    such mappings when a page gets marked as writeable in PTE it is also
    marked dirty in do_wp_page() or do_page_fault(). When the dirty bit is
    cleared by clear_page_dirty_for_io(), the page gets writeprotected in
    page_mkclean(). So pagecache page is writeable if and only if it is
    dirty.

    Thanks to Hugh Dickins for pointing out mapping has to have
    mapping_cap_account_dirty() for things to work and proposing a cleaned
    up variant of the patch.

    The patch has survived about two hours of running fsx-linux on tmpfs
    while heavily swapping and several days of running on out build machines
    where the original problem was triggered.

    Signed-off-by: Jan Kara
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Heiko Carstens
    Cc: [3.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

25 Oct, 2012

1 commit

  • We will not map partial pages, so need to make sure memblock
    allocation will not allocate those bytes out.

    Also we will use for_each_mem_pfn_range() to loop to map memory
    range to keep them consistent.

    Signed-off-by: Yinghai Lu
    Link: http://lkml.kernel.org/r/CAE9FiQVZirvaBMFYRfXMmWEcHbKSicQEHz4VAwUv0xFCk51ZNw@mail.gmail.com
    Acked-by: Jacob Shin
    Signed-off-by: H. Peter Anvin
    Cc:

    Yinghai Lu
     

20 Oct, 2012

5 commits

  • Pull ARM soc fixes from Olof Johansson:
    "A set of fixes and some minor cleanups for -rc2:

    - A series from Arnd that fixes warnings in drivers and other code
    included by ARM defconfigs. Most have been acked by corresponding
    maintainers (and seem quite hard to argue not picking up anyway in
    the few exception cases).
    - A few misc patches from the list for integrator/vt8500/i.MX
    - A batch of fixes to OMAP platforms, fixing:
    - boot problems on beaglebone,
    - regression fixes for local timers
    - clockdomain locking fixes
    - a few boot/sparse warnings
    - For Tegra:
    - Clock rate calculation overflow fix
    - Revert a change that removed timer clocks and a fix for symbol
    name clashes
    - For Renesas:
    - IO accessor / annotation cleanups to remove warnings
    - For Kirkwood/Dove/mvebu:
    - Fixes for device trees for Dove (some minor cleanups, some fixes)
    - Fixes for the mvebu gpio driver
    - Fix build problem for Feroceon due to missing ifdefs
    - Fix lsxl DTS files"

    * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (31 commits)
    ARM: kirkwood: fix buttons on lsxl boards
    ARM: kirkwood: fix LEDs names for lsxl boards
    ARM: Kirkwood: fix disabling CACHE_FEROCEON_L2
    gpio: mvebu: Add missing breaks in mvebu_gpio_irq_set_type
    ARM: dove: Add crypto engine to DT
    ARM: dove: Remove watchdog from DT
    ARM: dove: Restructure SoC device tree descriptor
    ARM: dove: Fix clock names of sata and gbe
    ARM: dove: Fix tauros2 device tree init
    ARM: dove: Add pcie clock support
    ARM: OMAP2+: Allow kernel to boot even if GPMC fails to reserve memory
    ARM: OMAP: clockdomain: Fix locking on _clkdm_clk_hwmod_enable / disable
    ARM: s3c: mark s3c2440_clk_add as __init_refok
    spi/s3c64xx: use correct dma_transfer_direction type
    ARM: OMAP4: devices: fixup OMAP4 DMIC platform device error message
    ARM: OMAP2+: clock data: Add dev-id for the omap-gpmc dummy fck
    ARM: OMAP: resolve sparse warning concerning debug_card_init()
    ARM: OMAP4: Fix twd_local_timer_register regression
    ARM: tegra: add tegra_timer clock
    ARM: tegra: rename tegra system timer
    ...

    Linus Torvalds
     
  • …nel/git/arm/arm-soc into fixes

    A collection of warning fixes on non-ARM code from Arnd Bergmann:

    * 'testing/driver-warnings' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
    ARM: s3c: mark s3c2440_clk_add as __init_refok
    spi/s3c64xx: use correct dma_transfer_direction type
    pcmcia: sharpsl: don't discard sharpsl_pcmcia_ops
    USB: EHCI: mark ehci_orion_conf_mbus_windows __devinit
    mm/slob: use min_t() to compare ARCH_SLAB_MINALIGN
    SCSI: ARM: make fas216_dumpinfo function conditional
    SCSI: ARM: ncr5380/oak uses no interrupts

    Olof Johansson
     
  • Merge misc fixes from Andrew Morton:
    "Seven fixes"

    * emailed patches from Andrew Morton : (7 patches)
    lib/dma-debug.c: fix __hash_bucket_find()
    mm: compaction: correct the nr_strict va isolated check for CMA
    firmware/memmap: avoid type conflicts with the generic memmap_init()
    pidns: remove recursion from free_pid_ns()
    drivers/video/backlight/lm3639_bl.c: return proper error in lm3639_bled_mode_store() error paths
    kernel/sys.c: fix stack memory content leak via UNAME26
    linux/coredump.h needs asm/siginfo.h

    Linus Torvalds
     
  • Thierry reported that the "iron out" patch for isolate_freepages_block()
    had problems due to the strict check being too strict with "mm:
    compaction: Iron out isolate_freepages_block() and
    isolate_freepages_range() -fix1". It's possible that more pages than
    necessary are isolated but the check still fails and I missed that this
    fix was not picked up before RC1. This same problem has been identified
    in 3.7-RC1 by Tony Prisk and should be addressed by the following patch.

    Signed-off-by: Mel Gorman
    Tested-by: Tony Prisk
    Reported-by: Thierry Reding
    Acked-by: Rik van Riel
    Acked-by: Minchan Kim
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In commit 0b173bc4daa8 ("mm: kill vma flag VM_CAN_NONLINEAR") we
    replaced the VM_CAN_NONLINEAR test with checking whether the mapping has
    a '->remap_pages()' vm operation, but there is no guarantee that there
    it even has a vm_ops pointer at all.

    Add the appropriate test for NULL vm_ops.

    Reported-by: Sasha Levin
    Cc: Konstantin Khlebnikov
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

17 Oct, 2012

1 commit

  • When reading /proc/pid/numa_maps, it's possible to return the contents of
    the stack where the mempolicy string should be printed if the policy gets
    freed from beneath us.

    This happens because mpol_to_str() may return an error the
    stack-allocated buffer is then printed without ever being stored.

    There are two possible error conditions in mpol_to_str():

    - if the buffer allocated is insufficient for the string to be stored,
    and

    - if the mempolicy has an invalid mode.

    The first error condition is not triggered in any of the callers to
    mpol_to_str(): at least 50 bytes is always allocated on the stack and this
    is sufficient for the string to be written. A future patch should convert
    this into BUILD_BUG_ON() since we know the maximum strlen possible, but
    that's not -rc material.

    The second error condition is possible if a race occurs in dropping a
    reference to a task's mempolicy causing it to be freed during the read().
    The slab poison value is then used for the mode and mpol_to_str() returns
    -EINVAL.

    This race is only possible because get_vma_policy() believes that
    mm->mmap_sem protects task->mempolicy, which isn't true. The exit path
    does not hold mm->mmap_sem when dropping the reference or setting
    task->mempolicy to NULL: it uses task_lock(task) instead.

    Thus, it's required for the caller of a task mempolicy to hold
    task_lock(task) while grabbing the mempolicy and reading it. Callers with
    a vma policy store their mempolicy earlier and can simply increment the
    reference count so it's guaranteed not to be freed.

    Reported-by: Dave Jones
    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     

15 Oct, 2012

1 commit

  • Certain configurations won't implicitly pull in resulting
    in the following build error:

    mm/huge_memory.c: In function 'release_pte_page':
    mm/huge_memory.c:1697:2: error: implicit declaration of function 'unlock_page' [-Werror=implicit-function-declaration]
    mm/huge_memory.c: In function '__collapse_huge_page_isolate':
    mm/huge_memory.c:1757:3: error: implicit declaration of function 'trylock_page' [-Werror=implicit-function-declaration]
    cc1: some warnings being treated as errors

    Reported-by: David Daney
    Signed-off-by: Ralf Baechle
    Signed-off-by: Linus Torvalds

    Ralf Baechle
     

13 Oct, 2012

3 commits

  • Pull third pile of VFS updates from Al Viro:
    "Stuff from Jeff Layton, mostly. Sanitizing interplay between audit
    and namei, removing a lot of insanity from audit_inode() mess and
    getting things ready for his ESTALE patchset."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    procfs: don't need a PATH_MAX allocation to hold a string representation of an int
    vfs: embed struct filename inside of names_cache allocation if possible
    audit: make audit_inode take struct filename
    vfs: make path_openat take a struct filename pointer
    vfs: turn do_path_lookup into wrapper around struct filename variant
    audit: allow audit code to satisfy getname requests from its names_list
    vfs: define struct filename and have getname() return it
    vfs: unexport getname and putname symbols
    acct: constify the name arg to acct_on
    vfs: allocate page instead of names_cache buffer in mount_block_root
    audit: overhaul __audit_inode_child to accomodate retrying
    audit: optimize audit_compare_dname_path
    audit: make audit_compare_dname_path use parent_len helper
    audit: remove dirlen argument to audit_compare_dname_path
    audit: set the name_len in audit_inode for parent lookups
    audit: add a new "type" field to audit_names struct
    audit: reverse arguments to audit_inode_child
    audit: no need to walk list in audit_inode if name is NULL
    audit: pass in dentry to audit_copy_inode wherever possible
    audit: remove unnecessary NULL ptr checks from do_path_lookup

    Linus Torvalds
     
  • ...and fix up the callers. For do_file_open_root, just declare a
    struct filename on the stack and fill out the .name field. For
    do_filp_open, make it also take a struct filename pointer, and fix up its
    callers to call it appropriately.

    For filp_open, add a variant that takes a struct filename pointer and turn
    filp_open into a wrapper around it.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     
  • getname() is intended to copy pathname strings from userspace into a
    kernel buffer. The result is just a string in kernel space. It would
    however be quite helpful to be able to attach some ancillary info to
    the string.

    For instance, we could attach some audit-related info to reduce the
    amount of audit-related processing needed. When auditing is enabled,
    we could also call getname() on the string more than once and not
    need to recopy it from userspace.

    This patchset converts the getname()/putname() interfaces to return
    a struct instead of a string. For now, the struct just tracks the
    string in kernel space and the original userland pointer for it.

    Later, we'll add other information to the struct as it becomes
    convenient.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     

12 Oct, 2012

3 commits

  • Pull SLAB fix from Pekka Enberg:
    "This contains a lockdep false positive fix from Jiri Kosina I missed
    from the previous pull request."

    * 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
    mm, slab: release slab_mutex earlier in kmem_cache_destroy()

    Linus Torvalds
     
  • Pull pile 2 of vfs updates from Al Viro:
    "Stuff in this one - assorted fixes, lglock tidy-up, death to
    lock_super().

    There'll be a VFS pile tomorrow (with patches from Jeff Layton,
    sanitizing getname() and related parts of audit and preparing for
    ESTALE fixes), but I'd rather push the stuff in this one ASAP - some
    of the bugs closed here are quite unpleasant."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: bogus warnings in fs/namei.c
    consitify do_mount() arguments
    lglock: add DEFINE_STATIC_LGLOCK()
    lglock: make the per_cpu locks static
    lglock: remove unused DEFINE_LGLOCK_LOCKDEP()
    MAX_LFS_FILESIZE definition for 64bit needs LL...
    tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checking
    vfs: drop lock/unlock super
    ufs: drop lock/unlock super
    sysv: drop lock/unlock super
    hpfs: drop lock/unlock super
    fat: drop lock/unlock super
    ext3: drop lock/unlock super
    exofs: drop lock/unlock super
    dup3: Return an error when oldfd == newfd.
    fs: handle failed audit_log_start properly
    fs: prevent use after free in auditing when symlink following was denied

    Linus Torvalds
     
  • Pull writeback fixes from Fengguang Wu:
    "Three trivial writeback fixes"

    * 'writeback-for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    CPU hotplug, writeback: Don't call writeback_set_ratelimit() too often during hotplug
    writeback: correct comment for move_expired_inodes()
    backing-dev: use kstrto* in preference to simple_strtoul

    Linus Torvalds
     

10 Oct, 2012

3 commits

  • Commit 1331e7a1bbe1 ("rcu: Remove _rcu_barrier() dependency on
    __stop_machine()") introduced slab_mutex -> cpu_hotplug.lock dependency
    through kmem_cache_destroy() -> rcu_barrier() -> _rcu_barrier() ->
    get_online_cpus().

    Lockdep thinks that this might actually result in ABBA deadlock,
    and reports it as below:

    === [ cut here ] ===
    ======================================================
    [ INFO: possible circular locking dependency detected ]
    3.6.0-rc5-00004-g0d8ee37 #143 Not tainted
    -------------------------------------------------------
    kworker/u:2/40 is trying to acquire lock:
    (rcu_sched_state.barrier_mutex){+.+...}, at: [] _rcu_barrier+0x26/0x1e0

    but task is already holding lock:
    (slab_mutex){+.+.+.}, at: [] kmem_cache_destroy+0x45/0xe0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #2 (slab_mutex){+.+.+.}:
    [] validate_chain+0x632/0x720
    [] __lock_acquire+0x309/0x530
    [] lock_acquire+0x121/0x190
    [] __mutex_lock_common+0x5c/0x450
    [] mutex_lock_nested+0x3e/0x50
    [] cpuup_callback+0x2f/0xbe
    [] notifier_call_chain+0x93/0x140
    [] __raw_notifier_call_chain+0x9/0x10
    [] _cpu_up+0xba/0x14e
    [] cpu_up+0xbc/0x117
    [] smp_init+0x6b/0x9f
    [] kernel_init+0x147/0x1dc
    [] kernel_thread_helper+0x4/0x10

    -> #1 (cpu_hotplug.lock){+.+.+.}:
    [] validate_chain+0x632/0x720
    [] __lock_acquire+0x309/0x530
    [] lock_acquire+0x121/0x190
    [] __mutex_lock_common+0x5c/0x450
    [] mutex_lock_nested+0x3e/0x50
    [] get_online_cpus+0x37/0x50
    [] _rcu_barrier+0xbb/0x1e0
    [] rcu_barrier_sched+0x10/0x20
    [] rcu_barrier+0x9/0x10
    [] deactivate_locked_super+0x49/0x90
    [] deactivate_super+0x61/0x70
    [] mntput_no_expire+0x127/0x180
    [] sys_umount+0x6e/0xd0
    [] system_call_fastpath+0x16/0x1b

    -> #0 (rcu_sched_state.barrier_mutex){+.+...}:
    [] check_prev_add+0x3de/0x440
    [] validate_chain+0x632/0x720
    [] __lock_acquire+0x309/0x530
    [] lock_acquire+0x121/0x190
    [] __mutex_lock_common+0x5c/0x450
    [] mutex_lock_nested+0x3e/0x50
    [] _rcu_barrier+0x26/0x1e0
    [] rcu_barrier_sched+0x10/0x20
    [] rcu_barrier+0x9/0x10
    [] kmem_cache_destroy+0xd1/0xe0
    [] nf_conntrack_cleanup_net+0xe4/0x110 [nf_conntrack]
    [] nf_conntrack_cleanup+0x2a/0x70 [nf_conntrack]
    [] nf_conntrack_net_exit+0x5e/0x80 [nf_conntrack]
    [] ops_exit_list+0x39/0x60
    [] cleanup_net+0xfb/0x1b0
    [] process_one_work+0x26b/0x4c0
    [] worker_thread+0x12e/0x320
    [] kthread+0x9e/0xb0
    [] kernel_thread_helper+0x4/0x10

    other info that might help us debug this:

    Chain exists of:
    rcu_sched_state.barrier_mutex --> cpu_hotplug.lock --> slab_mutex

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(slab_mutex);
    lock(cpu_hotplug.lock);
    lock(slab_mutex);
    lock(rcu_sched_state.barrier_mutex);

    *** DEADLOCK ***
    === [ cut here ] ===

    This is actually a false positive. Lockdep has no way of knowing the fact
    that the ABBA can actually never happen, because of special semantics of
    cpu_hotplug.refcount and its handling in cpu_hotplug_begin(); the mutual
    exclusion there is not achieved through mutex, but through
    cpu_hotplug.refcount.

    The "neither cpu_up() nor cpu_down() will proceed past cpu_hotplug_begin()
    until everyone who called get_online_cpus() will call put_online_cpus()"
    semantics is totally invisible to lockdep.

    This patch therefore moves the unlock of slab_mutex so that rcu_barrier()
    is being called with it unlocked. It has two advantages:

    - it slightly reduces hold time of slab_mutex; as it's used to protect
    the cachep list, it's not necessary to hold it over kmem_cache_free()
    call any more
    - it silences the lockdep false positive warning, as it avoids lockdep ever
    learning about slab_mutex -> cpu_hotplug.lock dependency

    Reviewed-by: Paul E. McKenney
    Reviewed-by: Srivatsa S. Bhat
    Acked-by: David Rientjes
    Signed-off-by: Jiri Kosina
    Signed-off-by: Pekka Enberg

    Jiri Kosina
     
  • Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(),
    u64 inum = fid->raw[2];
    which is unhelpfully reported as at the end of shmem_alloc_inode():

    BUG: unable to handle kernel paging request at ffff880061cd3000
    IP: [] shmem_alloc_inode+0x40/0x40
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Call Trace:
    [] ? exportfs_decode_fh+0x79/0x2d0
    [] do_handle_open+0x163/0x2c0
    [] sys_open_by_handle_at+0xc/0x10
    [] tracesys+0xe1/0xe6

    Right, tmpfs is being stupid to access fid->raw[2] before validating that
    fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may
    fall at the end of a page, and the next page not be present.

    But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being
    careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and
    could oops in the same way: add the missing fh_len checks to those.

    Reported-by: Sasha Levin
    Signed-off-by: Hugh Dickins
    Cc: Al Viro
    Cc: Sage Weil
    Cc: Steven Whitehouse
    Cc: Christoph Hellwig
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Hugh Dickins
     
  • The definition of ARCH_SLAB_MINALIGN is architecture dependent
    and can be either of type size_t or int. Comparing that value
    with ARCH_KMALLOC_MINALIGN can cause harmless warnings on
    platforms where they are different. Since both are always
    small positive integer numbers, using the size_t type to compare
    them is safe and gets rid of the warning.

    Without this patch, building ARM collie_defconfig results in:

    mm/slob.c: In function '__kmalloc_node':
    mm/slob.c:431:152: warning: comparison of distinct pointer types lacks a cast [enabled by default]
    mm/slob.c: In function 'kfree':
    mm/slob.c:484:153: warning: comparison of distinct pointer types lacks a cast [enabled by default]
    mm/slob.c: In function 'ksize':
    mm/slob.c:503:153: warning: comparison of distinct pointer types lacks a cast [enabled by default]

    Signed-off-by: Arnd Bergmann
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg

    Arnd Bergmann
     

09 Oct, 2012

4 commits

  • Invalidation sequences are handled in various ways on various
    architectures.

    One way, which sparc64 uses, is to let the set_*_at() functions accumulate
    pending flushes into a per-cpu array. Then the flush_tlb_range() et al.
    calls process the pending TLB flushes.

    In this regime, the __tlb_remove_*tlb_entry() implementations are
    essentially NOPs.

    The canonical PTE zap in mm/memory.c is:

    ptent = ptep_get_and_clear_full(mm, addr, pte,
    tlb->fullmm);
    tlb_remove_tlb_entry(tlb, pte, addr);

    With a subsequent tlb_flush_mmu() if needed.

    Mirror this in the THP PMD zapping using:

    orig_pmd = pmdp_get_and_clear(tlb->mm, addr, pmd);
    page = pmd_page(orig_pmd);
    tlb_remove_pmd_tlb_entry(tlb, pmd, addr);

    And we properly accomodate TLB flush mechanims like the one described
    above.

    Signed-off-by: David S. Miller
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Gerald Schaefer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Miller
     
  • The transparent huge page code passes a PMD pointer in as the third
    argument of update_mmu_cache(), which expects a PTE pointer.

    This never got noticed because X86 implements update_mmu_cache() as a
    macro and thus we don't get any type checking, and X86 is the only
    architecture which supports transparent huge pages currently.

    Before other architectures can support transparent huge pages properly we
    need to add a new interface which will take a PMD pointer as the third
    argument rather than a PTE pointer.

    [akpm@linux-foundation.org: implement update_mm_cache_pmd() for s390]
    Signed-off-by: David S. Miller
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Gerald Schaefer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Miller
     
  • …YYYYYYYYYYYYYYYY>" warning

    When our x86 box calls __remove_pages(), release_mem_region() shows many
    warnings. And x86 box cannot unregister iomem_resource.

    "Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>"

    release_mem_region() has been changed to be called in each
    PAGES_PER_SECTION by commit de7f0cba9678 ("memory hotplug: release
    memory regions in PAGES_PER_SECTION chunks"). Because powerpc registers
    iomem_resource in each PAGES_PER_SECTION chunk. But when I hot add
    memory on x86 box, iomem_resource is register in each _CRS not
    PAGES_PER_SECTION chunk. So x86 box unregisters iomem_resource.

    The patch fixes the problem.

    Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Jiang Liu <liuj97@gmail.com>
    Cc: Len Brown <len.brown@intel.com>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Wen Congyang <wency@cn.fujitsu.com>
    Cc: Dave Hansen <dave@linux.vnet.ibm.com>
    Cc: Nathan Fontenot <nfont@austin.ibm.com>
    Cc: Badari Pulavarty <pbadari@us.ibm.com>
    Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Yasuaki Ishimatsu
     
  • Acked-by: David Rientjes
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton