07 Jan, 2012

4 commits

  • commit a41c58a6665cc995e237303b05db42100b71b65e upstream.

    If the request is to create non-root group and we fail to meet it, we
    should leave the root unchanged.

    Signed-off-by: Hillf Danton
    Signed-off-by: Hugh Dickins
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hillf Danton
     
  • commit e6f67b8c05f5e129e126f4409ddac6f25f58ffcb upstream.

    lockdep reports a deadlock in jfs because a special inode's rw semaphore
    is taken recursively. The mapping's gfp mask is GFP_NOFS, but is not
    used when __read_cache_page() calls add_to_page_cache_lru().

    Signed-off-by: Dave Kleikamp
    Acked-by: Hugh Dickins
    Acked-by: Al Viro
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dave Kleikamp
     
  • commit ff05b6f7ae762b6eb464183eec994b28ea09f6dd upstream.

    An integer overflow will happen on 64bit archs if task's sum of rss,
    swapents and nr_ptes exceeds (2^31)/1000 value. This was introduced by
    commit

    f755a04 oom: use pte pages in OOM score

    where the oom score computation was divided into several steps and it's no
    longer computed as one expression in unsigned long(rss, swapents, nr_pte
    are unsigned long), where the result value assigned to points(int) is in
    range(1..1000). So there could be an int overflow while computing

    176 points *= 1000;

    and points may have negative value. Meaning the oom score for a mem hog task
    will be one.

    196 if (points
    Acked-by: KOSAKI Motohiro
    Acked-by: Oleg Nesterov
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Frantisek Hrbata
     
  • commit 9f57bd4d6dc69a4e3bf43044fa00fcd24dd363e3 upstream.

    per_cpu_ptr_to_phys() incorrectly rounds up its result for non-kmalloc
    case to the page boundary, which is bogus for any non-page-aligned
    address.

    This affects the only in-tree user of this function - sysfs handler
    for per-cpu 'crash_notes' physical address. The trouble is that the
    crash_notes per-cpu variable is not page-aligned:

    crash_notes = 0xc08e8ed4
    PER-CPU OFFSET VALUES:
    CPU 0: 3711f000
    CPU 1: 37129000
    CPU 2: 37133000
    CPU 3: 3713d000

    So, the per-cpu addresses are:
    crash_notes on CPU 0: f7a07ed4 => phys 36b57ed4
    crash_notes on CPU 1: f7a11ed4 => phys 36b4ded4
    crash_notes on CPU 2: f7a1bed4 => phys 36b43ed4
    crash_notes on CPU 3: f7a25ed4 => phys 36b39ed4

    However, /sys/devices/system/cpu/cpu*/crash_notes says:
    /sys/devices/system/cpu/cpu0/crash_notes: 36b57000
    /sys/devices/system/cpu/cpu1/crash_notes: 36b4d000
    /sys/devices/system/cpu/cpu2/crash_notes: 36b43000
    /sys/devices/system/cpu/cpu3/crash_notes: 36b39000

    As you can see, all values are rounded down to a page
    boundary. Consequently, this is where kexec sets up the NOTE segments,
    and thus where the secondary kernel is looking for them. However, when
    the first kernel crashes, it saves the notes to the unaligned
    addresses, where they are not found.

    Fix it by adding offset_in_page() to the translated page address.

    -tj: Combined Eugene's and Petr's commit messages.

    Signed-off-by: Eugene Surovegin
    Signed-off-by: Tejun Heo
    Reported-by: Petr Tesarik
    Signed-off-by: Greg Kroah-Hartman

    Eugene Surovegin
     

22 Dec, 2011

4 commits

  • commit a855b84c3d8c73220d4d3cd392a7bee7c83de70e upstream.

    Percpu allocator recorded the cpus which map to the first and last
    units in pcpu_first/last_unit_cpu respectively and used them to
    determine the address range of a chunk - e.g. it assumed that the
    first unit has the lowest address in a chunk while the last unit has
    the highest address.

    This simply isn't true. Groups in a chunk can have arbitrary positive
    or negative offsets from the previous one and there is no guarantee
    that the first unit occupies the lowest offset while the last one the
    highest.

    Fix it by actually comparing unit offsets to determine cpus occupying
    the lowest and highest offsets. Also, rename pcu_first/last_unit_cpu
    to pcpu_low/high_unit_cpu to avoid confusion.

    The chunk address range is used to flush cache on vmalloc area
    map/unmap and decide whether a given address is in the first chunk by
    per_cpu_ptr_to_phys() and the bug was discovered by invalid
    per_cpu_ptr_to_phys() translation for crash_note.

    Kudos to Dave Young for tracking down the problem.

    Signed-off-by: Tejun Heo
    Reported-by: WANG Cong
    Reported-by: Dave Young
    Tested-by: Dave Young
    LKML-Reference:
    Signed-off-by: Thomas Renninger
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit 1368edf0647ac112d8cfa6ce47257dc950c50f5c upstream.

    Commit f5252e00 ("mm: avoid null pointer access in vm_struct via
    /proc/vmallocinfo") adds newly allocated vm_structs to the vmlist after
    it is fully initialised. Unfortunately, it did not check that
    __vmalloc_area_node() successfully populated the area. In the event of
    allocation failure, the vmalloc area is freed but the pointer to freed
    memory is inserted into the vmlist leading to a a crash later in
    get_vmalloc_info().

    This patch adds a check for ____vmalloc_area_node() failure within
    __vmalloc_node_range. It does not use "goto fail" as in the previous
    error path as a warning was already displayed by __vmalloc_area_node()
    before it called vfree in its failure path.

    Credit goes to Luciano Chavez for doing all the real work of identifying
    exactly where the problem was.

    Signed-off-by: Mel Gorman
    Reported-by: Luciano Chavez
    Tested-by: Luciano Chavez
    Reviewed-by: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit d021563888312018ca65681096f62e36c20e63cc upstream.

    setup_zone_migrate_reserve() expects that zone->start_pfn starts at
    pageblock_nr_pages aligned pfn otherwise we could access beyond an
    existing memblock resulting in the following panic if
    CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid:

    IP: [] setup_zone_migrate_reserve+0xcd/0x180
    *pdpt = 0000000000000000 *pde = f000ff53f000ff53
    Oops: 0000 [#1] SMP
    Pid: 1, comm: swapper Not tainted 3.0.7-0.7-pae #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
    EIP: 0060:[] EFLAGS: 00010006 CPU: 0
    EIP is at setup_zone_migrate_reserve+0xcd/0x180
    EAX: 000c0000 EBX: f5801fc0 ECX: 000c0000 EDX: 00000000
    ESI: 000c01fe EDI: 000c01fe EBP: 00140000 ESP: f2475f58
    DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    Process swapper (pid: 1, ti=f2474000 task=f2472cd0 task.ti=f2474000)
    Call Trace:
    [] __setup_per_zone_wmarks+0xec/0x160
    [] setup_per_zone_wmarks+0xf/0x20
    [] init_per_zone_wmark_min+0x27/0x86
    [] do_one_initcall+0x2b/0x160
    [] kernel_init+0xbe/0x157
    [] kernel_thread_helper+0x6/0xd
    Code: a5 39 f5 89 f7 0f 46 fd 39 cf 76 40 8b 03 f6 c4 08 74 32 eb 91 90 89 c8 c1 e8 0e 0f be 80 80 2f 86 c0 8b 14 85 60 2f 86 c0 89 c8 82 b4 12 00 00 c1 e0 05 03 82 ac 12 00 00 8b 00 f6 c4 08 0f
    EIP: [] setup_zone_migrate_reserve+0xcd/0x180 SS:ESP 0068:f2475f58
    CR2: 00000000000012b4

    We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because
    highstart_pfn = 0x36ffe.

    The issue was introduced in 3.0-rc1 by 6d3163ce ("mm: check if any page
    in a pageblock is reserved before marking it MIGRATE_RESERVE").

    Make sure that start_pfn is always aligned to pageblock_nr_pages to
    ensure that pfn_valid s always called at the start of each pageblock.
    Architectures with holes in pageblocks will be correctly handled by
    pfn_valid_within in pageblock_is_reserved.

    Signed-off-by: Michal Hocko
    Signed-off-by: Mel Gorman
    Tested-by: Dang Bo
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Arve Hjnnevg
    Cc: KOSAKI Motohiro
    Cc: John Stultz
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     
  • commit 58a84aa92723d1ac3e1cc4e3b0ff49291663f7e1 upstream.

    Commit 70b50f94f1644 ("mm: thp: tail page refcounting fix") keeps all
    page_tail->_count zero at all times. But the current kernel does not
    set page_tail->_count to zero if a 1GB page is utilized. So when an
    IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a
    tail page's _count does not equal zero.

    kernel BUG at include/linux/mm.h:386!
    invalid opcode: 0000 [#1] SMP
    Call Trace:
    gup_pud_range+0xb8/0x19d
    get_user_pages_fast+0xcb/0x192
    ? trace_hardirqs_off+0xd/0xf
    hva_to_pfn+0x119/0x2f2
    gfn_to_pfn_memslot+0x2c/0x2e
    kvm_iommu_map_pages+0xfd/0x1c1
    kvm_iommu_map_memslots+0x7c/0xbd
    kvm_iommu_map_guest+0xaa/0xbf
    kvm_vm_ioctl_assigned_device+0x2ef/0xa47
    kvm_vm_ioctl+0x36c/0x3a2
    do_vfs_ioctl+0x49e/0x4e4
    sys_ioctl+0x5a/0x7c
    system_call_fastpath+0x16/0x1b
    RIP gup_huge_pud+0xf2/0x159

    Signed-off-by: Youquan Song
    Reviewed-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Youquan Song
     

10 Dec, 2011

1 commit

  • commit ea4039a34c4c206d015d34a49d0b00868e37db1d upstream.

    If we fail to prepare an anon_vma, the {new, old}_page should be released,
    or they will leak.

    Signed-off-by: Hillf Danton
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Cc: Michal Hocko
    Signed-off-by: Greg Kroah-Hartman

    Hillf Danton
     

22 Nov, 2011

1 commit

  • commit 7a401a972df8e184b3d1a3fc958c0a4ddee8d312 upstream.

    bdi_prune_sb() in bdi_unregister() attempts to removes the bdi links
    from all super_blocks and then del_timer_sync() the writeback timer.

    However, this can race with __mark_inode_dirty(), leading to
    bdi_wakeup_thread_delayed() rearming the writeback timer on the bdi
    we're unregistering, after we've called del_timer_sync().

    This can end up with the bdi being freed with an active timer inside it,
    as in the case of the following dump after the removal of an SD card.

    Fix this by redoing the del_timer_sync() in bdi_destory().

    ------------[ cut here ]------------
    WARNING: at /home/rabin/kernel/arm/lib/debugobjects.c:262 debug_print_object+0x9c/0xc8()
    ODEBUG: free active (active state 0) object type: timer_list hint: wakeup_timer_fn+0x0/0x180
    Modules linked in:
    Backtrace:
    [] (dump_backtrace+0x0/0x110) from [] (dump_stack+0x18/0x1c)
    r6:c02bc638 r5:00000106 r4:c79f5d18 r3:00000000
    [] (dump_stack+0x0/0x1c) from [] (warn_slowpath_common+0x54/0x6c)
    [] (warn_slowpath_common+0x0/0x6c) from [] (warn_slowpath_fmt+0x38/0x40)
    r8:20000013 r7:c780c6f0 r6:c031613c r5:c780c6f0 r4:c02b1b29
    r3:00000009
    [] (warn_slowpath_fmt+0x0/0x40) from [] (debug_print_object+0x9c/0xc8)
    r3:c02b1b29 r2:c02bc662
    [] (debug_print_object+0x0/0xc8) from [] (debug_check_no_obj_freed+0xac/0x1dc)
    r6:c7964000 r5:00000001 r4:c7964000
    [] (debug_check_no_obj_freed+0x0/0x1dc) from [] (kmem_cache_free+0x88/0x1f8)
    [] (kmem_cache_free+0x0/0x1f8) from [] (blk_release_queue+0x70/0x78)
    [] (blk_release_queue+0x0/0x78) from [] (kobject_release+0x70/0x84)
    r5:c79641f0 r4:c796420c
    [] (kobject_release+0x0/0x84) from [] (kref_put+0x68/0x80)
    r7:00000083 r6:c74083d0 r5:c015289c r4:c796420c
    [] (kref_put+0x0/0x80) from [] (kobject_put+0x48/0x5c)
    r5:c79643b4 r4:c79641f0
    [] (kobject_put+0x0/0x5c) from [] (blk_cleanup_queue+0x68/0x74)
    r4:c7964000
    [] (blk_cleanup_queue+0x0/0x74) from [] (mmc_blk_put+0x78/0xe8)
    r5:00000000 r4:c794c400
    [] (mmc_blk_put+0x0/0xe8) from [] (mmc_blk_release+0x24/0x38)
    r5:c794c400 r4:c0322824
    [] (mmc_blk_release+0x0/0x38) from [] (__blkdev_put+0xe8/0x170)
    r5:c78d5e00 r4:c74083c0
    [] (__blkdev_put+0x0/0x170) from [] (blkdev_put+0x11c/0x12c)
    r8:c79f5f70 r7:00000001 r6:c74083d0 r5:00000083 r4:c74083c0
    r3:00000000
    [] (blkdev_put+0x0/0x12c) from [] (kill_block_super+0x60/0x6c)
    r7:c7942300 r6:c79f4000 r5:00000083 r4:c74083c0
    [] (kill_block_super+0x0/0x6c) from [] (deactivate_locked_super+0x44/0x70)
    r6:c79f4000 r5:c031af64 r4:c794dc00 r3:c00b06c4
    [] (deactivate_locked_super+0x0/0x70) from [] (deactivate_super+0x6c/0x70)
    r5:c794dc00 r4:c794dc00
    [] (deactivate_super+0x0/0x70) from [] (mntput_no_expire+0x188/0x194)
    r5:c794dc00 r4:c7942300
    [] (mntput_no_expire+0x0/0x194) from [] (sys_umount+0x2e4/0x310)
    r6:c7942300 r5:00000000 r4:00000000 r3:00000000
    [] (sys_umount+0x0/0x310) from [] (ret_fast_syscall+0x0/0x30)
    ---[ end trace e5c83c92ada51c76 ]---

    Signed-off-by: Rabin Vincent
    Signed-off-by: Linus Walleij
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Rabin Vincent
     

12 Nov, 2011

2 commits

  • commit 70b50f94f1644e2aa7cb374819cfd93f3c28d725 upstream.

    Michel while working on the working set estimation code, noticed that
    calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
    wasn't safe, if the pfn ended up being a tail page of a transparent
    hugepage under splitting by __split_huge_page_refcount().

    He then found the problem could also theoretically materialize with
    page_cache_get_speculative() during the speculative radix tree lookups
    that uses get_page_unless_zero() in SMP if the radix tree page is freed
    and reallocated and get_user_pages is called on it before
    page_cache_get_speculative has a chance to call get_page_unless_zero().

    So the best way to fix the problem is to keep page_tail->_count zero at
    all times. This will guarantee that get_page_unless_zero() can never
    succeed on any tail page. page_tail->_mapcount is guaranteed zero and
    is unused for all tail pages of a compound page, so we can simply
    account the tail page references there and transfer them to
    tail_page->_count in __split_huge_page_refcount() (in addition to the
    head_page->_mapcount).

    While debugging this s/_count/_mapcount/ change I also noticed get_page is
    called by direct-io.c on pages returned by get_user_pages. That wasn't
    entirely safe because the two atomic_inc in get_page weren't atomic. As
    opposed to other get_user_page users like secondary-MMU page fault to
    establish the shadow pagetables would never call any superflous get_page
    after get_user_page returns. It's safer to make get_page universally safe
    for tail pages and to use get_page_foll() within follow_page (inside
    get_user_pages()). get_page_foll() is safe to do the refcounting for tail
    pages without taking any locks because it is run within PT lock protected
    critical sections (PT lock for pte and page_table_lock for
    pmd_trans_huge).

    The standard get_page() as invoked by direct-io instead will now take
    the compound_lock but still only for tail pages. The direct-io paths
    are usually I/O bound and the compound_lock is per THP so very
    finegrined, so there's no risk of scalability issues with it. A simple
    direct-io benchmarks with all lockdep prove locking and spinlock
    debugging infrastructure enabled shows identical performance and no
    overhead. So it's worth it. Ideally direct-io should stop calling
    get_page() on pages returned by get_user_pages(). The spinlock in
    get_page() is already optimized away for no-THP builds but doing
    get_page() on tail pages returned by GUP is generally a rare operation
    and usually only run in I/O paths.

    This new refcounting on page_tail->_mapcount in addition to avoiding new
    RCU critical sections will also allow the working set estimation code to
    work without any further complexity associated to the tail page
    refcounting with THP.

    Signed-off-by: Andrea Arcangeli
    Reported-by: Michel Lespinasse
    Reviewed-by: Michel Lespinasse
    Reviewed-by: Minchan Kim
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     
  • commit f5252e009d5b87071a919221e4f6624184005368 upstream.

    The /proc/vmallocinfo shows information about vmalloc allocations in
    vmlist that is a linklist of vm_struct. It, however, may access pages
    field of vm_struct where a page was not allocated. This results in a null
    pointer access and leads to a kernel panic.

    Why this happens: In __vmalloc_node_range() called from vmalloc(), newly
    allocated vm_struct is added to vmlist at __get_vm_area_node() and then,
    some fields of vm_struct such as nr_pages and pages are set at
    __vmalloc_area_node(). In other words, it is added to vmlist before it is
    fully initialized. At the same time, when the /proc/vmallocinfo is read,
    it accesses the pages field of vm_struct according to the nr_pages field
    at show_numa_info(). Thus, a null pointer access happens.

    The patch adds the newly allocated vm_struct to the vmlist *after* it is
    fully initialized. So, it can avoid accessing the pages field with
    unallocated page when show_numa_info() is called.

    Signed-off-by: Mitsuo Hayasaka
    Cc: Andrew Morton
    Cc: David Rientjes
    Cc: Namhyung Kim
    Cc: "Paul E. McKenney"
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mitsuo Hayasaka
     

25 Oct, 2011

1 commit

  • commit 486cf46f3f9be5f2a966016c1a8fe01e32cde09e upstream.

    I don't usually pay much attention to the stale "? " addresses in
    stack backtraces, but this lucky report from Pawel Sikora hints that
    mremap's move_ptes() has inadequate locking against page migration.

    3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
    kernel BUG at include/linux/swapops.h:105!
    RIP: 0010:[] []
    migration_entry_wait+0x156/0x160
    [] handle_pte_fault+0xae1/0xaf0
    [] ? __pte_alloc+0x42/0x120
    [] ? do_huge_pmd_anonymous_page+0xab/0x310
    [] handle_mm_fault+0x181/0x310
    [] ? vma_adjust+0x537/0x570
    [] do_page_fault+0x11d/0x4e0
    [] ? do_mremap+0x2d5/0x570
    [] page_fault+0x1f/0x30

    mremap's down_write of mmap_sem, together with i_mmap_mutex or lock,
    and pagetable locks, were good enough before page migration (with its
    requirement that every migration entry be found) came in, and enough
    while migration always held mmap_sem; but not enough nowadays, when
    there's memory hotremove and compaction.

    The danger is that move_ptes() lets a migration entry dodge around
    behind remove_migration_pte()'s back, so it's in the old location when
    looking at the new, then in the new location when looking at the old.

    Either mremap's move_ptes() must additionally take anon_vma lock(), or
    migration's remove_migration_pte() must stop peeking for is_swap_entry()
    before it takes pagetable lock.

    Consensus chooses the latter: we prefer to add overhead to migration
    than to mremapping, which gets used by JVMs and by exec stack setup.

    Reported-and-tested-by: Paweł Sikora
    Signed-off-by: Hugh Dickins
    Acked-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     

04 Oct, 2011

5 commits

  • commit 4508378b9523e22a2a0175d8bf64d932fb10a67d upstream.

    Commit 246e87a93934 ("memcg: fix get_scan_count() for small targets")
    fixes the memcg/kswapd behavior against small targets and prevent vmscan
    priority too high.

    But the implementation is too naive and adds another problem to small
    memcg. It always force scan to 32 pages of file/anon and doesn't handle
    swappiness and other rotate_info. It makes vmscan to scan anon LRU
    regardless of swappiness and make reclaim bad. This patch fixes it by
    adjusting scanning count with regard to swappiness at el.

    At a test "cat 1G file under 300M limit." (swappiness=20)
    before patch
    scanned_pages_by_limit 360919
    scanned_anon_pages_by_limit 180469
    scanned_file_pages_by_limit 180450
    rotated_pages_by_limit 31
    rotated_anon_pages_by_limit 25
    rotated_file_pages_by_limit 6
    freed_pages_by_limit 180458
    freed_anon_pages_by_limit 19
    freed_file_pages_by_limit 180439
    elapsed_ns_by_limit 429758872
    after patch
    scanned_pages_by_limit 180674
    scanned_anon_pages_by_limit 24
    scanned_file_pages_by_limit 180650
    rotated_pages_by_limit 35
    rotated_anon_pages_by_limit 24
    rotated_file_pages_by_limit 11
    freed_pages_by_limit 180634
    freed_anon_pages_by_limit 0
    freed_file_pages_by_limit 180634
    elapsed_ns_by_limit 367119089
    scanned_pages_by_system 0

    the numbers of scanning anon are decreased(as expected), and elapsed time
    reduced. By this patch, small memcgs will work better.
    (*) Because the amount of file-cache is much bigger than anon,
    recalaim_stat's rotate-scan counter make scanning files more.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Michal Hocko
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    KAMEZAWA Hiroyuki
     
  • commit 6e6938b6d3130305a5960c86b1a9b21e58cf6144 upstream.

    sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
    WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
    do livelock prevention for it, too.

    Jan's commit f446daaea9 ("mm: implement writeback livelock avoidance
    using page tagging") is a partial fix in that it only fixed the
    WB_SYNC_ALL phase livelock.

    Although ext4 is tested to no longer livelock with commit f446daaea9,
    it may due to some "redirty_tail() after pages_skipped" effect which
    is by no means a guarantee for _all_ the file systems.

    Note that writeback_inodes_sb() is called by not only sync(), they are
    treated the same because the other callers also need livelock prevention.

    Impact: It changes the order in which pages/inodes are synced to disk.
    Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
    until finished with the current inode.

    Acked-by: Jan Kara
    CC: Dave Chinner
    Signed-off-by: Wu Fengguang
    Signed-off-by: Greg Kroah-Hartman

    Wu Fengguang
     
  • commit 461ae488ecb125b140d7ea29ceeedbcce9327003 upstream.

    Xen backend drivers (e.g., blkback and netback) would sometimes fail to
    map grant pages into the vmalloc address space allocated with
    alloc_vm_area(). The GNTTABOP_map_grant_ref would fail because Xen could
    not find the page (in the L2 table) containing the PTEs it needed to
    update.

    (XEN) mm.c:3846:d0 Could not find L1 PTE for address fbb42000

    netback and blkback were making the hypercall from a kernel thread where
    task->active_mm != &init_mm and alloc_vm_area() was only updating the page
    tables for init_mm. The usual method of deferring the update to the page
    tables of other processes (i.e., after taking a fault) doesn't work as a
    fault cannot occur during the hypercall.

    This would work on some systems depending on what else was using vmalloc.

    Fix this by reverting ef691947d8a3 ("vmalloc: remove vmalloc_sync_all()
    from alloc_vm_area()") and add a comment to explain why it's needed.

    Signed-off-by: David Vrabel
    Cc: Jeremy Fitzhardinge
    Cc: Konrad Rzeszutek Wilk
    Cc: Ian Campbell
    Cc: Keir Fraser
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Vrabel
     
  • commit 76d3fbf8fbf6cc78ceb63549e0e0c5bc8a88f838 upstream.

    With zone_reclaim_mode enabled, it's possible for zones to be considered
    full in the zonelist_cache so they are skipped in the future. If the
    process enters direct reclaim, the ZLC may still consider zones to be full
    even after reclaiming pages. Reconsider all zones for allocation if
    direct reclaim returns successfully.

    Signed-off-by: Mel Gorman
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Cc: Stefan Priebe
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit cd38b115d5ad79b0100ac6daa103c4fe2c50a913 upstream.

    There have been a small number of complaints about significant stalls
    while copying large amounts of data on NUMA machines reported on a
    distribution bugzilla. In these cases, zone_reclaim was enabled by
    default due to large NUMA distances. In general, the complaints have not
    been about the workload itself unless it was a file server (in which case
    the recommendation was disable zone_reclaim).

    The stalls are mostly due to significant amounts of time spent scanning
    the preferred zone for pages to free. After a failure, it might fallback
    to another node (as zonelists are often node-ordered rather than
    zone-ordered) but stall quickly again when the next allocation attempt
    occurs. In bad cases, each page allocated results in a full scan of the
    preferred zone.

    Patch 1 checks the preferred zone for recent allocation failure
    which is particularly important if zone_reclaim has failed
    recently. This avoids rescanning the zone in the near future and
    instead falling back to another node. This may hurt node locality
    in some cases but a failure to zone_reclaim is more expensive than
    a remote access.

    Patch 2 clears the zlc information after direct reclaim.
    Otherwise, zone_reclaim can mark zones full, direct reclaim can
    reclaim enough pages but the zone is still not considered for
    allocation.

    This was tested on a 24-thread 2-node x86_64 machine. The tests were
    focused on large amounts of IO. All tests were bound to the CPUs on
    node-0 to avoid disturbances due to processes being scheduled on different
    nodes. The kernels tested are

    3.0-rc6-vanilla Vanilla 3.0-rc6
    zlcfirst Patch 1 applied
    zlcreconsider Patches 1+2 applied

    FS-Mark
    ./fs_mark -d /tmp/fsmark-10813 -D 100 -N 5000 -n 208 -L 35 -t 24 -S0 -s 524288
    fsmark-3.0-rc6 3.0-rc6 3.0-rc6
    vanilla zlcfirs zlcreconsider
    Files/s min 54.90 ( 0.00%) 49.80 (-10.24%) 49.10 (-11.81%)
    Files/s mean 100.11 ( 0.00%) 135.17 (25.94%) 146.93 (31.87%)
    Files/s stddev 57.51 ( 0.00%) 138.97 (58.62%) 158.69 (63.76%)
    Files/s max 361.10 ( 0.00%) 834.40 (56.72%) 802.40 (55.00%)
    Overhead min 76704.00 ( 0.00%) 76501.00 ( 0.27%) 77784.00 (-1.39%)
    Overhead mean 1485356.51 ( 0.00%) 1035797.83 (43.40%) 1594680.26 (-6.86%)
    Overhead stddev 1848122.53 ( 0.00%) 881489.88 (109.66%) 1772354.90 ( 4.27%)
    Overhead max 7989060.00 ( 0.00%) 3369118.00 (137.13%) 10135324.00 (-21.18%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 501.49 493.91 499.93
    Total Elapsed Time (seconds) 2451.57 2257.48 2215.92

    MMTests Statistics: vmstat
    Page Ins 46268 63840 66008
    Page Outs 90821596 90671128 88043732
    Swap Ins 0 0 0
    Swap Outs 0 0 0
    Direct pages scanned 13091697 8966863 8971790
    Kswapd pages scanned 0 1830011 1831116
    Kswapd pages reclaimed 0 1829068 1829930
    Direct pages reclaimed 13037777 8956828 8648314
    Kswapd efficiency 100% 99% 99%
    Kswapd velocity 0.000 810.643 826.346
    Direct efficiency 99% 99% 96%
    Direct velocity 5340.128 3972.068 4048.788
    Percentage direct scans 100% 83% 83%
    Page writes by reclaim 0 3 0
    Slabs scanned 796672 720640 720256
    Direct inode steals 7422667 7160012 7088638
    Kswapd inode steals 0 1736840 2021238

    Test completes far faster with a large increase in the number of files
    created per second. Standard deviation is high as a small number of
    iterations were much higher than the mean. The number of pages scanned by
    zone_reclaim is reduced and kswapd is used for more work.

    LARGE DD
    3.0-rc6 3.0-rc6 3.0-rc6
    vanilla zlcfirst zlcreconsider
    download tar 59 ( 0.00%) 59 ( 0.00%) 55 ( 7.27%)
    dd source files 527 ( 0.00%) 296 (78.04%) 320 (64.69%)
    delete source 36 ( 0.00%) 19 (89.47%) 20 (80.00%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 125.03 118.98 122.01
    Total Elapsed Time (seconds) 624.56 375.02 398.06

    MMTests Statistics: vmstat
    Page Ins 3594216 439368 407032
    Page Outs 23380832 23380488 23377444
    Swap Ins 0 0 0
    Swap Outs 0 436 287
    Direct pages scanned 17482342 69315973 82864918
    Kswapd pages scanned 0 519123 575425
    Kswapd pages reclaimed 0 466501 522487
    Direct pages reclaimed 5858054 2732949 2712547
    Kswapd efficiency 100% 89% 90%
    Kswapd velocity 0.000 1384.254 1445.574
    Direct efficiency 33% 3% 3%
    Direct velocity 27991.453 184832.737 208171.929
    Percentage direct scans 100% 99% 99%
    Page writes by reclaim 0 5082 13917
    Slabs scanned 17280 29952 35328
    Direct inode steals 115257 1431122 332201
    Kswapd inode steals 0 0 979532

    This test downloads a large tarfile and copies it with dd a number of
    times - similar to the most recent bug report I've dealt with. Time to
    completion is reduced. The number of pages scanned directly is still
    disturbingly high with a low efficiency but this is likely due to the
    number of dirty pages encountered. The figures could probably be improved
    with more work around how kswapd is used and how dirty pages are handled
    but that is separate work and this result is significant on its own.

    Streaming Mapped Writer
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 124.47 111.67 112.64
    Total Elapsed Time (seconds) 2138.14 1816.30 1867.56

    MMTests Statistics: vmstat
    Page Ins 90760 89124 89516
    Page Outs 121028340 120199524 120736696
    Swap Ins 0 86 55
    Swap Outs 0 0 0
    Direct pages scanned 114989363 96461439 96330619
    Kswapd pages scanned 56430948 56965763 57075875
    Kswapd pages reclaimed 27743219 27752044 27766606
    Direct pages reclaimed 49777 46884 36655
    Kswapd efficiency 49% 48% 48%
    Kswapd velocity 26392.541 31363.631 30561.736
    Direct efficiency 0% 0% 0%
    Direct velocity 53780.091 53108.759 51581.004
    Percentage direct scans 67% 62% 62%
    Page writes by reclaim 385 122 1513
    Slabs scanned 43008 39040 42112
    Direct inode steals 0 10 8
    Kswapd inode steals 733 534 477

    This test just creates a large file mapping and writes to it linearly.
    Time to completion is again reduced.

    The gains are mostly down to two things. In many cases, there is less
    scanning as zone_reclaim simply gives up faster due to recent failures.
    The second reason is that memory is used more efficiently. Instead of
    scanning the preferred zone every time, the allocator falls back to
    another zone and uses it instead improving overall memory utilisation.

    This patch: initialise ZLC for first zone eligible for zone_reclaim.

    The zonelist cache (ZLC) is used among other things to record if
    zone_reclaim() failed for a particular zone recently. The intention is to
    avoid a high cost scanning extremely long zonelists or scanning within the
    zone uselessly.

    Currently the zonelist cache is setup only after the first zone has been
    considered and zone_reclaim() has been called. The objective was to avoid
    a costly setup but zone_reclaim is itself quite expensive. If it is
    failing regularly such as the first eligible zone having mostly mapped
    pages, the cost in scanning and allocation stalls is far higher than the
    ZLC initialisation step.

    This patch initialises ZLC before the first eligible zone calls
    zone_reclaim(). Once initialised, it is checked whether the zone failed
    zone_reclaim recently. If it has, the zone is skipped. As the first zone
    is now being checked, additional care has to be taken about zones marked
    full. A zone can be marked "full" because it should not have enough
    unmapped pages for zone_reclaim but this is excessive as direct reclaim or
    kswapd may succeed where zone_reclaim fails. Only mark zones "full" after
    zone_reclaim fails if it failed to reclaim enough pages after scanning.

    Signed-off-by: Mel Gorman
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Cc: Stefan Priebe
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     

18 Aug, 2011

1 commit

  • commit f982f91516fa4cfd9d20518833cd04ad714585be upstream.

    Commit db64fe02258f ("mm: rewrite vmap layer") introduced code that does
    address calculations under the assumption that VMAP_BLOCK_SIZE is a
    power of two. However, this might not be true if CONFIG_NR_CPUS is not
    set to a power of two.

    Wrong vmap_block index/offset values could lead to memory corruption.
    However, this has never been observed in practice (or never been
    diagnosed correctly); what caught this was the BUG_ON in vb_alloc() that
    checks for inconsistent vmap_block indices.

    To fix this, ensure that VMAP_BLOCK_SIZE always is a power of two.

    BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=31572
    Reported-by: Pavel Kysilka
    Reported-by: Matias A. Fonzo
    Signed-off-by: Clemens Ladisch
    Signed-off-by: Stefan Richter
    Cc: Nick Piggin
    Cc: Jeremy Fitzhardinge
    Cc: Krzysztof Helt
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Clemens Ladisch
     

05 Aug, 2011

4 commits

  • commit c027a474a68065391c8773f6e83ed5412657e369 upstream.

    exit_mm() sets ->mm == NULL then it does mmput()->exit_mmap() which
    frees the memory.

    However select_bad_process() checks ->mm != NULL before TIF_MEMDIE,
    so it continues to kill other tasks even if we have the oom-killed
    task freeing its memory.

    Change select_bad_process() to check ->mm after TIF_MEMDIE, but skip
    the tasks which have already passed exit_notify() to ensure a zombie
    with TIF_MEMDIE set can't block oom-killer. Alternatively we could
    probably clear TIF_MEMDIE after exit_mmap().

    Signed-off-by: Oleg Nesterov
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Oleg Nesterov
     
  • commit 108b6a78463bb8c7163e4f9779f36ad8bbade334 upstream.

    Commit 22a668d7c3ef ("memcg: fix behavior under memory.limit equals to
    memsw.limit") introduced "memsw_is_minimum" flag, which becomes true
    when mem_limit == memsw_limit. The flag is checked at the beginning of
    reclaim, and "noswap" is set if the flag is true, because using swap is
    meaningless in this case.

    This works well in most cases, but when we try to shrink mem_limit,
    which is the same as memsw_limit now, we might fail to shrink mem_limit
    because swap doesn't used.

    This patch fixes this behavior by:
    - check MEM_CGROUP_RECLAIM_SHRINK at the begining of reclaim
    - If it is set, don't set "noswap" flag even if memsw_is_minimum is true.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Daisuke Nishimura
     
  • commit ccb6108f5b0b541d3eb332c3a73e645c0f84278e upstream.

    Vito said:

    : The system has many usb disks coming and going day to day, with their
    : respective bdi's having min_ratio set to 1 when inserted. It works for
    : some time until eventually min_ratio can no longer be set, even when the
    : active set of bdi's seen in /sys/class/bdi/*/min_ratio doesn't add up to
    : anywhere near 100.
    :
    : This then leads to an unrelated starvation problem caused by write-heavy
    : fuse mounts being used atop the usb disks, a problem the min_ratio setting
    : at the underlying devices bdi effectively prevents.

    Fix this leakage by resetting the bdi min_ratio when unregistering the
    BDI.

    Signed-off-by: Peter Zijlstra
    Reported-by: Vito Caputo
    Cc: Wu Fengguang
    Cc: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit 2efaca927f5cd7ecd0f1554b8f9b6a9a2c329c03 upstream.

    I haven't reproduced it myself but the fail scenario is that on such
    machines (notably ARM and some embedded powerpc), if you manage to hit
    that futex path on a writable page whose dirty bit has gone from the PTE,
    you'll livelock inside the kernel from what I can tell.

    It will go in a loop of trying the atomic access, failing, trying gup to
    "fix it up", getting succcess from gup, go back to the atomic access,
    failing again because dirty wasn't fixed etc...

    So I think you essentially hang in the kernel.

    The scenario is probably rare'ish because affected architecture are
    embedded and tend to not swap much (if at all) so we probably rarely hit
    the case where dirty is missing or young is missing, but I think Shan has
    a piece of SW that can reliably reproduce it using a shared writable
    mapping & fork or something like that.

    On archs who use SW tracking of dirty & young, a page without dirty is
    effectively mapped read-only and a page without young unaccessible in the
    PTE.

    Additionally, some architectures might lazily flush the TLB when relaxing
    write protection (by doing only a local flush), and expect a fault to
    invalidate the stale entry if it's still present on another processor.

    The futex code assumes that if the "in_atomic()" access -EFAULT's, it can
    "fix it up" by causing get_user_pages() which would then be equivalent to
    taking the fault.

    However that isn't the case. get_user_pages() will not call
    handle_mm_fault() in the case where the PTE seems to have the right
    permissions, regardless of the dirty and young state. It will eventually
    update those bits ... in the struct page, but not in the PTE.

    Additionally, it will not handle the lazy TLB flushing that can be
    required by some architectures in the fault case.

    Basically, gup is the wrong interface for the job. The patch provides a
    more appropriate one which boils down to just calling handle_mm_fault()
    since what we are trying to do is simulate a real page fault.

    The futex code currently attempts to write to user memory within a
    pagefault disabled section, and if that fails, tries to fix it up using
    get_user_pages().

    This doesn't work on archs where the dirty and young bits are maintained
    by software, since they will gate access permission in the TLB, and will
    not be updated by gup().

    In addition, there's an expectation on some archs that a spurious write
    fault triggers a local TLB flush, and that is missing from the picture as
    well.

    I decided that adding those "features" to gup() would be too much for this
    already too complex function, and instead added a new simpler
    fixup_user_fault() which is essentially a wrapper around handle_mm_fault()
    which the futex code can call.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout]
    Signed-off-by: Benjamin Herrenschmidt
    Reported-by: Shan Hai
    Tested-by: Shan Hai
    Cc: David Laight
    Acked-by: Peter Zijlstra
    Cc: Darren Hart
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Benjamin Herrenschmidt
     

20 Jul, 2011

1 commit

  • I'm running a workload which triggers a lot of swap in a machine with 4
    nodes. After I kill the workload, I found a kswapd livelock. Sometimes
    kswapd3 or kswapd2 are keeping running and I can't access filesystem,
    but most memory is free.

    This looks like a regression since commit 08951e545918c159 ("mm: vmscan:
    correct check for kswapd sleeping in sleeping_prematurely").

    Node 2 and 3 have only ZONE_NORMAL, but balance_pgdat() will return 0
    for classzone_idx. The reason is end_zone in balance_pgdat() is 0 by
    default, if all zones have watermark ok, end_zone will keep 0.

    Later sleeping_prematurely() always returns true. Because this is an
    order 3 wakeup, and if classzone_idx is 0, both balanced_pages and
    present_pages in pgdat_balanced() are 0. We add a special case here.
    If a zone has no page, we think it's balanced. This fixes the livelock.

    Signed-off-by: Shaohua Li
    Acked-by: Mel Gorman
    Cc: Minchan Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

09 Jul, 2011

8 commits

  • remap_pfn_range() means map physical address pfn<vm_start = pfn << PAGE_SHIFT which
    is wrong acroding the original meaning of this function. And some driver
    developer using remap_pfn_range() with correct parameter will get
    unexpected result because vm_start is changed. It should be implementd
    like addr = pfn << PAGE_SHIFT but which is meanless on nommu arch, this
    patch just make it simply return.

    Parameter name and setting of vma->vm_flags also be fixed.

    Signed-off-by: Bob Liu
    Cc: Geert Uytterhoeven
    Cc: David Howells
    Acked-by: Greg Ungerer
    Cc: Mike Frysinger
    Cc: Bob Liu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • commit 889976dbcb12 ("memcg: reclaim memory from nodes in round-robin
    order") adds an numa node round-robin for memcg. But the information is
    updated once per 10sec.

    This patch changes the update trigger from jiffies to memcg's event count.
    After this patch, numa scan information will be updated when we see 1024
    events of pagein/pageout under a memcg.

    [akpm@linux-foundation.org: attempt to repair code layout]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Johannes Weiner
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Now, in mem_cgroup_hierarchical_reclaim(), mem_cgroup_local_usage() is
    used for checking whether the memcg contains reclaimable pages or not. If
    no pages in it, the routine skips it.

    But, mem_cgroup_local_usage() contains Unevictable pages and cannot handle
    "noswap" condition correctly. This doesn't work on a swapless system.

    This patch adds test_mem_cgroup_reclaimable() and replaces
    mem_cgroup_local_usage(). test_mem_cgroup_reclaimable() see LRU counter
    and returns correct answer to the caller. And this new function has
    "noswap" argument and can see only FILE LRU if necessary.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix kerneldoc layout]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Johannes Weiner
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • __tlb_remove_page() switches to a new batch page, but still checks space
    in the old batch. This check always fails, and causes a forced tlb flush.

    Signed-off-by: Shaohua Li
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • During allocator-intensive workloads, kswapd will be woken frequently
    causing free memory to oscillate between the high and min watermark. This
    is expected behaviour. Unfortunately, if the highest zone is small, a
    problem occurs.

    When balance_pgdat() returns, it may be at a lower classzone_idx than it
    started because the highest zone was unreclaimable. Before checking if it
    should go to sleep though, it checks pgdat->classzone_idx which when there
    is no other activity will be MAX_NR_ZONES-1. It interprets this as it has
    been woken up while reclaiming, skips scheduling and reclaims again. As
    there is no useful reclaim work to do, it enters into a loop of shrinking
    slab consuming loads of CPU until the highest zone becomes reclaimable for
    a long period of time.

    There are two problems here. 1) If the returned classzone or order is
    lower, it'll continue reclaiming without scheduling. 2) if the highest
    zone was marked unreclaimable but balance_pgdat() returns immediately at
    DEF_PRIORITY, the new lower classzone is not communicated back to kswapd()
    for sleeping.

    This patch does two things that are related. If the end_zone is
    unreclaimable, this information is communicated back. Second, if the
    classzone or order was reduced due to failing to reclaim, new information
    is not read from pgdat and instead an attempt is made to go to sleep. Due
    to this, it is also necessary that pgdat->classzone_idx be initialised
    each time to pgdat->nr_zones - 1 to avoid re-reads being interpreted as
    wakeups.

    Signed-off-by: Mel Gorman
    Reported-by: Pádraig Brady
    Tested-by: Pádraig Brady
    Tested-by: Andrew Lutomirski
    Acked-by: Rik van Riel
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When deciding if kswapd is sleeping prematurely, the classzone is taken
    into account but this is different to what balance_pgdat() and the
    allocator are doing. Specifically, the DMA zone will be checked based on
    the classzone used when waking kswapd which could be for a GFP_KERNEL or
    GFP_HIGHMEM request. The lowmem reserve limit kicks in, the watermark is
    not met and kswapd thinks it's sleeping prematurely keeping kswapd awake in
    error.

    Signed-off-by: Mel Gorman
    Reported-by: Pádraig Brady
    Tested-by: Pádraig Brady
    Tested-by: Andrew Lutomirski
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • During allocator-intensive workloads, kswapd will be woken frequently
    causing free memory to oscillate between the high and min watermark. This
    is expected behaviour.

    When kswapd applies pressure to zones during node balancing, it checks if
    the zone is above a high+balance_gap threshold. If it is, it does not
    apply pressure but it unconditionally shrinks slab on a global basis which
    is excessive. In the event kswapd is being kept awake due to a high small
    unreclaimable zone, it skips zone shrinking but still calls shrink_slab().

    Once pressure has been applied, the check for zone being unreclaimable is
    being made before the check is made if all_unreclaimable should be set.
    This miss of unreclaimable can cause has_under_min_watermark_zone to be
    set due to an unreclaimable zone preventing kswapd backing off on
    congestion_wait().

    Signed-off-by: Mel Gorman
    Reported-by: Pádraig Brady
    Tested-by: Pádraig Brady
    Tested-by: Andrew Lutomirski
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • During allocator-intensive workloads, kswapd will be woken frequently
    causing free memory to oscillate between the high and min watermark. This
    is expected behaviour. Unfortunately, if the highest zone is small, a
    problem occurs.

    This seems to happen most with recent sandybridge laptops but it's
    probably a co-incidence as some of these laptops just happen to have a
    small Normal zone. The reproduction case is almost always during copying
    large files that kswapd pegs at 100% CPU until the file is deleted or
    cache is dropped.

    The problem is mostly down to sleeping_prematurely() keeping kswapd awake
    when the highest zone is small and unreclaimable and compounded by the
    fact we shrink slabs even when not shrinking zones causing a lot of time
    to be spent in shrinkers and a lot of memory to be reclaimed.

    Patch 1 corrects sleeping_prematurely to check the zones matching
    the classzone_idx instead of all zones.

    Patch 2 avoids shrinking slab when we are not shrinking a zone.

    Patch 3 notes that sleeping_prematurely is checking lower zones against
    a high classzone which is not what allocators or balance_pgdat()
    is doing leading to an artifical belief that kswapd should be
    still awake.

    Patch 4 notes that when balance_pgdat() gives up on a high zone that the
    decision is not communicated to sleeping_prematurely()

    This problem affects 2.6.38.8 for certain and is expected to affect 2.6.39
    and 3.0-rc4 as well. If accepted, they need to go to -stable to be picked
    up by distros and this series is against 3.0-rc4. I've cc'd people that
    reported similar problems recently to see if they still suffer from the
    problem and if this fixes it.

    This patch: correct the check for kswapd sleeping in sleeping_prematurely()

    During allocator-intensive workloads, kswapd will be woken frequently
    causing free memory to oscillate between the high and min watermark. This
    is expected behaviour.

    A problem occurs if the highest zone is small. balance_pgdat() only
    considers unreclaimable zones when priority is DEF_PRIORITY but
    sleeping_prematurely considers all zones. It's possible for this sequence
    to occur

    1. kswapd wakes up and enters balance_pgdat()
    2. At DEF_PRIORITY, marks highest zone unreclaimable
    3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
    4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
    highest zone, clearing all_unreclaimable. Highest zone
    is still unbalanced
    5. kswapd returns and calls sleeping_prematurely
    6. sleeping_prematurely looks at *all* zones, not just the ones
    being considered by balance_pgdat. The highest small zone
    has all_unreclaimable cleared but the zone is not
    balanced. all_zones_ok is false so kswapd stays awake

    This patch corrects the behaviour of sleeping_prematurely to check the
    zones balance_pgdat() checked.

    Signed-off-by: Mel Gorman
    Reported-by: Pádraig Brady
    Tested-by: Pádraig Brady
    Tested-by: Andrew Lutomirski
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

28 Jun, 2011

7 commits

  • Commit d149e3b25d7c ("memcg: add the soft_limit reclaim in global direct
    reclaim") adds a softlimit hook to shrink_zones(). By this, soft limit
    is called as

    try_to_free_pages()
    do_try_to_free_pages()
    shrink_zones()
    mem_cgroup_soft_limit_reclaim()

    Then, direct reclaim is memcg softlimit hint aware, now.

    But, the memory cgroup's "limit" path can call softlimit shrinker.

    try_to_free_mem_cgroup_pages()
    do_try_to_free_pages()
    shrink_zones()
    mem_cgroup_soft_limit_reclaim()

    This will cause a global reclaim when a memcg hits limit.

    This is bug. soft_limit_reclaim() should be called when
    scanning_global_lru(sc) == true.

    And the commit adds a variable "total_scanned" for counting softlimit
    scanned pages....it's not "total". This patch removes the variable and
    update sc->nr_scanned instead of it. This will affect shrink_slab()'s
    scan condition but, global LRU is scanned by softlimit and I think this
    change makes sense.

    TODO: avoid too much scanning of a zone when softlimit did enough work.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Ying Han
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Under heavy memory and filesystem load, users observe the assertion
    mapping->nrpages == 0 in end_writeback() trigger. This can be caused by
    page reclaim reclaiming the last page from a mapping in the following
    race:

    CPU0 CPU1
    ...
    shrink_page_list()
    __remove_mapping()
    __delete_from_page_cache()
    radix_tree_delete()
    evict_inode()
    truncate_inode_pages()
    truncate_inode_pages_range()
    pagevec_lookup() - finds nothing
    end_writeback()
    mapping->nrpages != 0 -> BUG
    page->mapping = NULL
    mapping->nrpages--

    Fix the problem by doing a reliable check of mapping->nrpages under
    mapping->tree_lock in end_writeback().

    Analyzed by Jay , lost in LKML, and dug out
    by Miklos Szeredi .

    Cc: Jay
    Cc: Miklos Szeredi
    Signed-off-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • We cannot take a mutex while holding a spinlock, so flip the order and
    fix the locking documentation.

    Signed-off-by: Peter Zijlstra
    Acked-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Although it is used (by i915) on nothing but tmpfs, read_cache_page_gfp()
    is unsuited to tmpfs, because it inserts a page into pagecache before
    calling the filesystem's ->readpage: tmpfs may have pages in swapcache
    which only it knows how to locate and switch to filecache.

    At present tmpfs provides a ->readpage method, and copes with this by
    copying pages; but soon we can simplify it by removing its ->readpage.
    Provide shmem_read_mapping_page_gfp() now, ready for that transition,

    Export shmem_read_mapping_page_gfp() and add it to list in shmem_fs.h,
    with shmem_read_mapping_page() inline for the common mapping_gfp case.

    (shmem_read_mapping_page_gfp or shmem_read_cache_page_gfp? Generally the
    read_mapping_page functions use the mapping's ->readpage, and the
    read_cache_page functions use the supplied filler, so I think
    read_cache_page_gfp was slightly misnamed.)

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • 2.6.35's new truncate convention gave tmpfs the opportunity to control
    its file truncation, no longer enforced from outside by vmtruncate().
    We shall want to build upon that, to handle pagecache and swap together.

    Slightly redefine the ->truncate_range interface: let it now be called
    between the unmap_mapping_range()s, with the filesystem responsible for
    doing the truncate_inode_pages_range() from it - just as the filesystem
    is nowadays responsible for doing that from its ->setattr.

    Let's rename shmem_notify_change() to shmem_setattr(). Instead of
    calling the generic truncate_setsize(), bring that code in so we can
    call shmem_truncate_range() - which will later be updated to perform its
    own variant of truncate_inode_pages_range().

    Remove the punch_hole unmap_mapping_range() from shmem_truncate_range():
    now that the COW's unmap_mapping_range() comes after ->truncate_range,
    there is no need to call it a third time.

    Export shmem_truncate_range() and add it to the list in shmem_fs.h, so
    that i915_gem_object_truncate() can call it explicitly in future; get
    this patch in first, then update drm/i915 once this is available (until
    then, i915 will just be doing the truncate_inode_pages() twice).

    Though introduced five years ago, no other filesystem is implementing
    ->truncate_range, and its only other user is madvise(,,MADV_REMOVE): we
    expect to convert it to fallocate(,FALLOC_FL_PUNCH_HOLE,,) shortly,
    whereupon ->truncate_range can be removed from inode_operations -
    shmem_truncate_range() will help i915 across that transition too.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Before adding any more global entry points into shmem.c, gather such
    prototypes into shmem_fs.h. Remove mm's own declarations from swap.h,
    but for now leave the ones in mm.h: because shmem_file_setup() and
    shmem_zero_setup() are called from various places, and we should not
    force other subsystems to update immediately.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • You would expect to find vmtruncate_range() next to vmtruncate() in
    mm/truncate.c: move it there.

    Signed-off-by: Hugh Dickins
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

23 Jun, 2011

1 commit

  • Commit 959ecc48fc75 ("mm/memory_hotplug.c: fix building of node hotplug
    zonelist") does not protect the build_all_zonelists() call with
    zonelists_mutex as needed. This can lead to races in constructing
    zonelist ordering if a concurrent build is underway. Protecting this
    with lock_memory_hotplug() is insufficient since zonelists can be
    rebuild though sysfs as well.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    David Rientjes