10 Mar, 2012

1 commit

  • Respectfully revert commit e6ca7b89dc76 "memcg: fix mapcount check
    in move charge code for anonymous page" for the 3.3 release, so that
    it behaves exactly like releases 2.6.35 through 3.2 in this respect.

    Horiguchi-san's commit is correct in itself, 1 makes much more sense
    than 2 in that check; but it does not go far enough - swapcount
    should be considered too - if we really want such a check at all.

    We appear to have reached agreement now, and expect that 3.4 will
    remove the mapcount check, but had better not make 3.3 different.

    Signed-off-by: Hugh Dickins
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

07 Mar, 2012

4 commits

  • Several users of "find_vma_prev()" were not in fact interested in the
    previous vma if there was no primary vma to be found either. And in
    those cases, we're much better off just using the regular "find_vma()",
    and then "prev" can be looked up by just checking vma->vm_prev.

    The find_vma_prev() semantics are fairly subtle (see Mikulas' recent
    commit 83cd904d271b: "mm: fix find_vma_prev"), and the whole "return
    prev by reference" means that it generates worse code too.

    Thus this "let's avoid using this inconvenient and clearly too subtle
    interface when we don't really have to" patch.

    Cc: Mikulas Patocka
    Cc: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Commit 6bd4837de96e ("mm: simplify find_vma_prev()") broke memory
    management on PA-RISC.

    After application of the patch, programs that allocate big arrays on the
    stack crash with segfault, for example, this will crash if compiled
    without optimization:

    int main()
    {
    char array[200000];
    array[199999] = 0;
    return 0;
    }

    The reason is that PA-RISC has up-growing stack and the stack is usually
    the last memory area. In the above example, a page fault happens above
    the stack.

    Previously, if we passed too high address to find_vma_prev, it returned
    NULL and stored the last VMA in *pprev. After "simplify find_vma_prev"
    change, it stores NULL in *pprev. Consequently, the stack area is not
    found and it is not expanded, as it used to be before the change.

    This patch restores the old behavior and makes it return the last VMA in
    *pprev if the requested address is higher than address of any other VMA.

    Signed-off-by: Mikulas Patocka
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • Currently error is -ENOMEM when rejecting VM_GROWSDOWN|VM_GROWSUP
    from shared anonymous: hoist the file case's -EINVAL up for both.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Why is memcg's swap accounting so broken? Insane counts, wrong
    ownership, unfreeable structures, which later get freed and then
    accessed after free.

    Turns out to be a tiny a little 3.3-rc1 regression in 9fb4b7cc0724
    "page_cgroup: add helper function to get swap_cgroup": the helper
    function (actually named lookup_swap_cgroup()) returns an address using
    void* arithmetic, but the structure in question is a short.

    Signed-off-by: Hugh Dickins
    Reviewed-by: Bob Liu
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

06 Mar, 2012

8 commits

  • Merge the emailed seties of 19 patches from Andrew Morton

    * akpm:
    rapidio/tsi721: fix queue wrapping bug in inbound doorbell handler
    memcg: fix mapcount check in move charge code for anonymous page
    mm: thp: fix BUG on mm->nr_ptes
    alpha: fix 32/64-bit bug in futex support
    memcg: fix GPF when cgroup removal races with last exit
    debugobjects: Fix selftest for static warnings
    floppy/scsi: fix setting of BIO flags
    memcg: fix deadlock by inverting lrucare nesting
    drivers/rtc/rtc-r9701.c: fix crash in r9701_remove()
    c2port: class_create() returns an ERR_PTR
    pps: class_create() returns an ERR_PTR, not NULL
    hung_task: fix the broken rcu_lock_break() logic
    vfork: kill PF_STARTING
    coredump_wait: don't call complete_vfork_done()
    vfork: make it killable
    vfork: introduce complete_vfork_done()
    aio: wake up waiters when freeing unused kiocbs
    kprobes: return proper error code from register_kprobe()
    kmsg_dump: don't run on non-error paths by default

    Linus Torvalds
     
  • Currently the charge on shared anonyous pages is supposed not to moved in
    task migration. To implement this, we need to check that mapcount > 1,
    instread of > 2. So this patch fixes it.

    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Daisuke Nishimura
    Cc: Andrea Arcangeli
    Cc: KAMEZAWA Hiroyuki
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Dave Jones reports a few Fedora users hitting the BUG_ON(mm->nr_ptes...)
    in exit_mmap() recently.

    Quoting Hugh's discovery and explanation of the SMP race condition:

    "mm->nr_ptes had unusual locking: down_read mmap_sem plus
    page_table_lock when incrementing, down_write mmap_sem (or mm_users
    0) when decrementing; whereas THP is careful to increment and
    decrement it under page_table_lock.

    Now most of those paths in THP also hold mmap_sem for read or write
    (with appropriate checks on mm_users), but two do not: when
    split_huge_page() is called by hwpoison_user_mappings(), and when
    called by add_to_swap().

    It's conceivable that the latter case is responsible for the
    exit_mmap() BUG_ON mm->nr_ptes that has been reported on Fedora."

    The simplest way to fix it without having to alter the locking is to make
    split_huge_page() a noop in nr_ptes terms, so by counting the preallocated
    pagetables that exists for every mapped hugepage. It was an arbitrary
    choice not to count them and either way is not wrong or right, because
    they are not used but they're still allocated.

    Reported-by: Dave Jones
    Reported-by: Hugh Dickins
    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: David Rientjes
    Cc: Josh Boyer
    Cc: [3.0.x, 3.1.x, 3.2.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • When moving tasks from old memcg (with move_charge_at_immigrate on new
    memcg), followed by removal of old memcg, hit General Protection Fault in
    mem_cgroup_lru_del_list() (called from release_pages called from
    free_pages_and_swap_cache from tlb_flush_mmu from tlb_finish_mmu from
    exit_mmap from mmput from exit_mm from do_exit).

    Somewhat reproducible, takes a few hours: the old struct mem_cgroup has
    been freed and poisoned by SLAB_DEBUG, but mem_cgroup_lru_del_list() is
    still trying to update its stats, and take page off lru before freeing.

    A task, or a charge, or a page on lru: each secures a memcg against
    removal. In this case, the last task has been moved out of the old memcg,
    and it is exiting: anonymous pages are uncharged one by one from the
    memcg, as they are zapped from its pagetables, so the charge gets down to
    0; but the pages themselves are queued in an mmu_gather for freeing.

    Most of those pages will be on lru (and force_empty is careful to
    lru_add_drain_all, to add pages from pagevec to lru first), but not
    necessarily all: perhaps some have been isolated for page reclaim, perhaps
    some isolated for other reasons. So, force_empty may find no task, no
    charge and no page on lru, and let the removal proceed.

    There would still be no problem if these pages were immediately freed; but
    typically (and the put_page_testzero protocol demands it) they have to be
    added back to lru before they are found freeable, then removed from lru
    and freed. We don't see the issue when adding, because the
    mem_cgroup_iter() loops keep their own reference to the memcg being
    scanned; but when it comes to mem_cgroup_lru_del_list().

    I believe this was not an issue in v3.2: there, PageCgroupAcctLRU and
    PageCgroupUsed flags were used (like a trick with mirrors) to deflect view
    of pc->mem_cgroup to the stable root_mem_cgroup when neither set.
    38c5d72f3ebe ("memcg: simplify LRU handling by new rule") mercifully
    removed those convolutions, but left this General Protection Fault.

    But it's surprisingly easy to restore the old behaviour: just check
    PageCgroupUsed in mem_cgroup_lru_add_list() (which decides on which lruvec
    to add), and reset pc to root_mem_cgroup if page is uncharged. A risky
    change? just going back to how it worked before; testing, and an audit of
    uses of pc->mem_cgroup, show no problem.

    And there's a nice bonus: with mem_cgroup_lru_add_list() itself making
    sure that an uncharged page goes to root lru, mem_cgroup_reset_owner() no
    longer has any purpose, and we can safely revert 4e5f01c2b9b9 ("memcg:
    clear pc->mem_cgroup if necessary").

    Calling update_page_reclaim_stat() after add_page_to_lru_list() in swap.c
    is not strictly necessary: the lru_lock there, with RCU before memcg
    structures are freed, makes mem_cgroup_get_reclaim_stat_from_page safe
    without that; but it seems cleaner to rely on one dependency less.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We have forgotten the rules of lock nesting: the irq-safe ones must be
    taken inside the non-irq-safe ones, otherwise we are open to deadlock:

    CPU0 CPU1
    ---- ----
    lock(&(&pc->lock)->rlock);
    local_irq_disable();
    lock(&(&zone->lru_lock)->rlock);
    lock(&(&pc->lock)->rlock);

    lock(&(&zone->lru_lock)->rlock);

    To check a different locking issue, I happened to add a spin_lock to
    memcg's bit_spin_lock in lock_page_cgroup(), and lockdep very quickly
    complained about __mem_cgroup_commit_charge_lrucare() (on CPU1 above).

    So delete __mem_cgroup_commit_charge_lrucare(), passing a bool lrucare to
    __mem_cgroup_commit_charge() instead, taking zone->lru_lock under
    lock_page_cgroup() in the lrucare case.

    The original was using spin_lock_irqsave, but we'd be in more trouble if
    it were ever called at interrupt time: unconditional _irq is enough. And
    ClearPageLRU before del from lru, SetPageLRU before add to lru: no strong
    reason, but that is the ordering used consistently elsewhere.

    Fixes 36b62ad539498d00c2d280a151a ("memcg: simplify corner case handling
    of LRU").

    Signed-off-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Cc: Konstantin Khlebnikov
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Pull per-cpu patches from Tejun Heo:
    "This pull request contains four patches. One replaces manual clearing
    with bitmap_clear(), two fix generic definition of __this_cpu ops so
    that they don't choose unnecessarily strict arch version. One makes
    _this_cpu definition use raw_local_irq_*() so that it doesn't end up
    wrecking irq on/off state tracking when used from inside lockdep.

    Of the four patches, the raw_local_irq_*() update is the most
    important, so please feel free to cherry pick only that one patch and
    ignore the rest if you want to - commit e920d5971d 'percpu: use
    raw_local_irq_* in _this_cpu op'."

    * 'for-3.3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    percpu: fix __this_cpu_{sub,inc,dec}_return() definition
    percpu: use raw_local_irq_* in _this_cpu op
    percpu: fix generic definition of __this_cpu_add_and_return()
    percpu: use bitmap_clear

    Linus Torvalds
     
  • All other callers already hold either ->mmap_sem (exclusive) or
    ->page_table_lock. And we need it because some page table flushing
    instanced do work explicitly with ge tables.

    See e.g. arch/powerpc/mm/tlb_hash32.c, flush_tlb_range() and
    flush_range() in there. The same goes for uml, with a lot more
    extensive playing with page tables.

    Almost all callers are actually fine - flush_tlb_range() may have no
    need to bother playing with page tables, but it can do so safely; again,
    this caller is the sole exception - everything else either has exclusive
    ->mmap_sem on the mm in question, or mm->page_table_lock is held.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

01 Mar, 2012

1 commit

  • memblock allocator aligns @size to @align to reduce the amount
    of fragmentation. Commit:

    7bd0b0f0da ("memblock: Reimplement memblock allocation using reverse free area iterator")

    Broke it by incorrectly relocating @size aligning to
    memblock_find_in_range_node(). As the aligned size is not
    propagated back to memblock_alloc_base_nid(), the actually
    reserved size isn't aligned.

    While this increases memory use for memblock reserved array,
    this shouldn't cause any critical failure; however, it seems
    that the size aligning was hiding a use-beyond-allocation bug in
    sparc64 and losing the aligning causes boot failure.

    The underlying problem is currently being debugged but this is a
    proper fix in itself, it's already pretty late in -rc cycle for
    boot failures and reverting the change for debugging isn't
    difficult. Restore the size aligning moving it to
    memblock_alloc_base_nid().

    Reported-by: Meelis Roos
    Signed-off-by: Tejun Heo
    Cc: David S. Miller
    Cc: Grant Likely
    Cc: Rob Herring
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20120228205621.GC3252@dhcp-172-17-108-109.mtv.corp.google.com
    Signed-off-by: Ingo Molnar
    LKML-Reference:

    Tejun Heo
     

25 Feb, 2012

3 commits

  • Don't clear vm_mm in a deleted VMA as it's unnecessary and might
    conceivably break the filesystem or driver VMA close routine.

    Reported-by: Al Viro
    Signed-off-by: David Howells
    Acked-by: Al Viro
    cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Lock i_mmap_mutex for access to the VMA prio list to prevent concurrent
    access. Currently, certain parts of the mmap handling are protected by
    the region mutex, but not all.

    Reported-by: Al Viro
    Signed-off-by: David Howells
    Acked-by: Al Viro
    cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    David Howells
     
  • There is an issue when memcg unregisters events that were attached to
    the same eventfd:

    - On the first call mem_cgroup_usage_unregister_event() removes all
    events attached to a given eventfd, and if there were no events left,
    thresholds->primary would become NULL;

    - Since there were several events registered, cgroups core will call
    mem_cgroup_usage_unregister_event() again, but now kernel will oops,
    as the function doesn't expect that threshold->primary may be NULL.

    That's a good question whether mem_cgroup_usage_unregister_event()
    should actually remove all events in one go, but nowadays it can't
    do any better as cftype->unregister_event callback doesn't pass
    any private event-associated cookie. So, let's fix the issue by
    simply checking for threshold->primary.

    FWIW, w/o the patch the following oops may be observed:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
    IP: [] mem_cgroup_usage_unregister_event+0x9c/0x1f0
    Pid: 574, comm: kworker/0:2 Not tainted 3.3.0-rc4+ #9 Bochs Bochs
    RIP: 0010:[] [] mem_cgroup_usage_unregister_event+0x9c/0x1f0
    RSP: 0018:ffff88001d0b9d60 EFLAGS: 00010246
    Process kworker/0:2 (pid: 574, threadinfo ffff88001d0b8000, task ffff88001de91cc0)
    Call Trace:
    [] cgroup_event_remove+0x2b/0x60
    [] process_one_work+0x174/0x450
    [] worker_thread+0x123/0x2d0

    Cc: stable
    Signed-off-by: Anton Vorontsov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Linus Torvalds

    Anton Vorontsov
     

14 Feb, 2012

1 commit

  • When the number of dentry cache hash table entries gets too high
    (2147483648 entries), as happens by default on a 16TB system, use of a
    signed integer in the dcache_init() initialization loop prevents the
    dentry_hashtable from getting initialized, causing a panic in
    __d_lookup(). Fix this in dcache_init() and similar areas.

    Signed-off-by: Dimitri Sivanich
    Acked-by: David S. Miller
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dimitri Sivanich
     

11 Feb, 2012

1 commit

  • fix 1 mysterious divide error
    fix 3 NULL dereference bugs in writeback tracing, on SD card removal w/o umount

    * tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: fix dereferencing NULL bdi->dev on trace_writeback_queue
    lib: proportion: lower PROP_MAX_SHIFT to 32 on 64-bit kernel
    writeback: fix NULL bdi->dev in trace writeback_single_inode
    backing-dev: fix wakeup timer races with bdi_unregister()

    Linus Torvalds
     

09 Feb, 2012

2 commits

  • Fix CONFIG_TRANSPARENT_HUGEPAGE=y CONFIG_SMP=n CONFIG_DEBUG_VM=y
    CONFIG_DEBUG_SPINLOCK=n kernel: spin_is_locked() is then always false,
    and so triggers some BUGs in Transparent HugePage codepaths.

    asm-generic/bug.h mentions this problem, and provides a WARN_ON_SMP(x);
    but being too lazy to add VM_BUG_ON_SMP, BUG_ON_SMP, WARN_ON_SMP_ONCE,
    VM_WARN_ON_SMP_ONCE, just test NR_CPUS != 1 in the existing VM_BUG_ONs.

    Signed-off-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When isolating pages for migration, migration starts at the start of a
    zone while the free scanner starts at the end of the zone. Migration
    avoids entering a new zone by never going beyond the free scanned.

    Unfortunately, in very rare cases nodes can overlap. When this happens,
    migration isolates pages without the LRU lock held, corrupting lists
    which will trigger errors in reclaim or during page free such as in the
    following oops

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    IP: [] free_pcppages_bulk+0xcc/0x450
    PGD 1dda554067 PUD 1e1cb58067 PMD 0
    Oops: 0000 [#1] SMP
    CPU 37
    Pid: 17088, comm: memcg_process_s Tainted: G X
    RIP: free_pcppages_bulk+0xcc/0x450
    Process memcg_process_s (pid: 17088, threadinfo ffff881c2926e000, task ffff881c2926c0c0)
    Call Trace:
    free_hot_cold_page+0x17e/0x1f0
    __pagevec_free+0x90/0xb0
    release_pages+0x22a/0x260
    pagevec_lru_move_fn+0xf3/0x110
    putback_lru_page+0x66/0xe0
    unmap_and_move+0x156/0x180
    migrate_pages+0x9e/0x1b0
    compact_zone+0x1f3/0x2f0
    compact_zone_order+0xa2/0xe0
    try_to_compact_pages+0xdf/0x110
    __alloc_pages_direct_compact+0xee/0x1c0
    __alloc_pages_slowpath+0x370/0x830
    __alloc_pages_nodemask+0x1b1/0x1c0
    alloc_pages_vma+0x9b/0x160
    do_huge_pmd_anonymous_page+0x160/0x270
    do_page_fault+0x207/0x4c0
    page_fault+0x25/0x30

    The "X" in the taint flag means that external modules were loaded but but
    is unrelated to the bug triggering. The real problem was because the PFN
    layout looks like this

    Zone PFN ranges:
    DMA 0x00000010 -> 0x00001000
    DMA32 0x00001000 -> 0x00100000
    Normal 0x00100000 -> 0x01e80000
    Movable zone start PFN for each node
    early_node_map[14] active PFN ranges
    0: 0x00000010 -> 0x0000009b
    0: 0x00000100 -> 0x0007a1ec
    0: 0x0007a354 -> 0x0007a379
    0: 0x0007f7ff -> 0x0007f800
    0: 0x00100000 -> 0x00680000
    1: 0x00680000 -> 0x00e80000
    0: 0x00e80000 -> 0x01080000
    1: 0x01080000 -> 0x01280000
    0: 0x01280000 -> 0x01480000
    1: 0x01480000 -> 0x01680000
    0: 0x01680000 -> 0x01880000
    1: 0x01880000 -> 0x01a80000
    0: 0x01a80000 -> 0x01c80000
    1: 0x01c80000 -> 0x01e80000

    The fix is straight-forward. isolate_migratepages() has to make a
    similar check to isolate_freepage to ensure that it never isolates pages
    from a zone it does not hold the LRU lock for.

    This was discovered in a 3.0-based kernel but it affects 3.1.x, 3.2.x
    and current mainline.

    Signed-off-by: Mel Gorman
    Acked-by: Michal Nazarewicz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

05 Feb, 2012

1 commit

  • * akpm:
    mm: compaction: check pfn_valid when entering a new MAX_ORDER_NR_PAGES block during isolation for migration
    readahead: fix pipeline break caused by block plug
    kprobes: fix a memory leak in function pre_handler_kretprobe()
    drivers/tty/vt/vt_ioctl.c: fix KDFONTOP 32bit compatibility layer
    lkdtm: avoid calling lkdtm_do_action() with spinlock held
    mm/filemap_xip.c: fix race condition in xip_file_fault()
    mm/memcontrol.c: fix warning with CONFIG_NUMA=n
    avr32: select generic atomic64_t support
    mm: postpone migrated page mapping reset
    xtensa: fix memscan()
    MAINTAINERS: update lguest F: patterns
    MAINTAINERS: remove staging sections
    MAINTAINERS: remove iMX5 section
    MAINTAINERS: update partitions block F: patterns

    Linus Torvalds
     

04 Feb, 2012

6 commits

  • …ing isolation for migration

    When isolating for migration, migration starts at the start of a zone
    which is not necessarily pageblock aligned. Further, it stops isolating
    when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally
    not aligned. This allows isolate_migratepages() to call pfn_to_page() on
    an invalid PFN which can result in a crash. This was originally reported
    against a 3.0-based kernel with the following trace in a crash dump.

    PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s"
    #0 [d72d3ad0] crash_kexec at c028cfdb
    #1 [d72d3b24] oops_end at c05c5322
    #2 [d72d3b38] __bad_area_nosemaphore at c0227e60
    #3 [d72d3bec] bad_area at c0227fb6
    #4 [d72d3c00] do_page_fault at c05c72ec
    #5 [d72d3c80] error_code (via page_fault) at c05c47a4
    EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000
    DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50
    CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002
    #6 [d72d3cb4] isolate_migratepages at c030b15a
    #7 [d72d3d14] zone_watermark_ok at c02d26cb
    #8 [d72d3d2c] compact_zone at c030b8de
    #9 [d72d3d68] compact_zone_order at c030bba1
    #10 [d72d3db4] try_to_compact_pages at c030bc84
    #11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7
    #12 [d72d3e08] __alloc_pages_slowpath at c02d66c7
    #13 [d72d3e78] __alloc_pages_nodemask at c02d6a97
    #14 [d72d3eb8] alloc_pages_vma at c030a845
    #15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb
    #16 [d72d3f00] handle_mm_fault at c02f36c6
    #17 [d72d3f30] do_page_fault at c05c70ed
    #18 [d72d3fb0] error_code (via page_fault) at c05c47a4
    EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431
    DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788
    SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50
    CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202

    It was also reported by Herbert van den Bergh against 3.1-based kernel
    with the following snippet from the console log.

    BUG: unable to handle kernel paging request at 01c00008
    IP: [<c0522399>] isolate_migratepages+0x119/0x390
    *pdpt = 000000002f7ce001 *pde = 0000000000000000

    It is expected that it also affects 3.2.x and current mainline.

    The problem is that pfn_valid is only called on the first PFN being
    checked and that PFN is not necessarily aligned. Lets say we have a case
    like this

    H = MAX_ORDER_NR_PAGES boundary
    | = pageblock boundary
    m = cc->migrate_pfn
    f = cc->free_pfn
    o = memory hole

    H------|------H------|----m-Hoooooo|ooooooH-f----|------H

    The migrate_pfn is just below a memory hole and the free scanner is beyond
    the hole. When isolate_migratepages started, it scans from migrate_pfn to
    migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks
    pfn_valid() on the first PFN but then scans into the hole where there are
    not necessarily valid struct pages.

    This patch ensures that isolate_migratepages calls pfn_valid when
    necessary.

    Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
    Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Michal Nazarewicz <mina86@mina86.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • Herbert Poetzl reported a performance regression since 2.6.39. The test
    is a simple dd read, but with big block size. The reason is:

    T1: ra (A, A+128k), (A+128k, A+256k)
    T2: lock_page for page A, submit the 256k
    T3: hit page A+128K, ra (A+256k, A+384). the range isn't submitted
    because of plug and there isn't any lock_page till we hit page A+256k
    because all pages from A to A+256k is in memory
    T4: hit page A+256k, ra (A+384, A+ 512). Because of plug, the range isn't
    submitted again.
    T5: lock_page A+256k, so (A+256k, A+512k) will be submitted. The task is
    waitting for (A+256k, A+512k) finish.

    There is no request to disk in T3 and T4, so readahead pipeline breaks.

    We really don't need block plug for generic_file_aio_read() for buffered
    I/O. The readahead already has plug and has fine grained control when I/O
    should be submitted. Deleting plug for buffered I/O fixes the regression.

    One side effect is plug makes the request size 256k, the size is 128k
    without it. This is because default ra size is 128k and not a reason we
    need plug here.

    Vivek said:

    : We submit some readahead IO to device request queue but because of nested
    : plug, queue never gets unplugged. When read logic reaches a page which is
    : not in page cache, it waits for page to be read from the disk
    : (lock_page_killable()) and that time we flush the plug list.
    :
    : So effectively read ahead logic is kind of broken in parts because of
    : nested plugging. Removing top level plug (generic_file_aio_read()) for
    : buffered reads, will allow unplugging queue earlier for readahead.

    Signed-off-by: Shaohua Li
    Signed-off-by: Wu Fengguang
    Reported-by: Herbert Poetzl
    Tested-by: Eric Dumazet
    Cc: Christoph Hellwig
    Cc: Jens Axboe
    Cc: Vivek Goyal
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Fix a race condition that shows in conjunction with xip_file_fault() when
    two threads of the same user process fault on the same memory page.

    In this case, the race winner will install the page table entry and the
    unlucky loser will cause an oops: xip_file_fault calls vm_insert_pfn (via
    vm_insert_mixed) which drops out at this check:

    retval = -EBUSY;
    if (!pte_none(*pte))
    goto out_unlock;

    The resulting -EBUSY return value will trigger a BUG_ON() in
    xip_file_fault.

    This fix simply considers the fault as fixed in this case, because the
    race winner has successfully installed the pte.

    [akpm@linux-foundation.org: use conventional (and consistent) comment layout]
    Reported-by: David Sadler
    Signed-off-by: Carsten Otte
    Reported-by: Louis Alex Eisner
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Carsten Otte
     
  • mm/memcontrol.c: In function 'memcg_check_events':
    mm/memcontrol.c:779: warning: unused variable 'do_numainfo'

    Acked-by: Michal Hocko
    Cc: Li Zefan
    Cc: Hiroyuki KAMEZAWA
    Cc: Johannes Weiner
    Acked-by: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Postpone resetting page->mapping until the final remove_migration_ptes().
    Otherwise the expression PageAnon(migration_entry_to_page(entry)) does not
    work.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Trivial kmemleak bug-fixes:

    - Early logging doesn't stop when kmemleak is off by default.
    - Zero-size scanning areas should be ignored (currently it prints a
    warning).

    * tag 'kmemleak-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux:
    kmemleak: Disable early logging when kmemleak is off by default
    kmemleak: Only scan non-zero-size areas

    Linus Torvalds
     

03 Feb, 2012

1 commit

  • This fixes the race in process_vm_core found by Oleg (see

    http://article.gmane.org/gmane.linux.kernel/1235667/

    for details).

    This has been updated since I last sent it as the creation of the new
    mm_access() function did almost exactly the same thing as parts of the
    previous version of this patch did.

    In order to use mm_access() even when /proc isn't enabled, we move it to
    kernel/fork.c where other related process mm access functions already
    are.

    Signed-off-by: Chris Yeoh
    Signed-off-by: Linus Torvalds

    Christopher Yeoh
     

01 Feb, 2012

1 commit

  • While 7a401a972df8e18 ("backing-dev: ensure wakeup_timer is deleted")
    addressed the problem of the bdi being freed with a queued wakeup
    timer, there are other races that could happen if the wakeup timer
    expires after/during bdi_unregister(), before bdi_destroy() is called.

    wakeup_timer_fn() could attempt to wakeup a task which has already has
    been freed, or could access a NULL bdi->dev via the wake_forker_thread
    tracepoint.

    Cc:
    Cc: Jens Axboe
    Reported-by: Chanho Min
    Reviewed-by: Namjae Jeon
    Signed-off-by: Rabin Vincent
    Signed-off-by: Wu Fengguang

    Rabin Vincent
     

27 Jan, 2012

1 commit


25 Jan, 2012

1 commit

  • Davem says:

    1) Fix JIT code generation on x86-64 for divide by zero, from Eric Dumazet.

    2) tg3 header length computation correction from Eric Dumazet.

    3) More build and reference counting fixes for socket memory cgroup
    code from Glauber Costa.

    4) module.h snuck back into a core header after all the hard work we
    did to remove that, from Paul Gortmaker and Jesper Dangaard Brouer.

    5) Fix PHY naming regression and add some new PCI IDs in stmmac, from
    Alessandro Rubini.

    6) Netlink message generation fix in new team driver, should only advertise
    the entries that changed during events, from Jiri Pirko.

    7) SRIOV VF registration and unregistration fixes, and also add a
    missing PCI ID, from Roopa Prabhu.

    8) Fix infinite loop in tx queue flush code of brcmsmac, from Stanislaw Gruszka.

    9) ftgmac100/ftmac100 build fix, missing interrupt.h include.

    10) Memory leak fix in net/hyperv do_set_mutlicast() handling, from Wei Yongjun.

    11) Off by one fix in netem packet scheduler, from Vijay Subramanian.

    12) TCP loss detection fix from Yuchung Cheng.

    13) TCP reset packet MD5 calculation uses wrong address, fix from Shawn Lu.

    14) skge carrier assertion and DMA mapping fixes from Stephen Hemminger.

    15) Congestion recovery undo performed at the wrong spot in BIC and CUBIC
    congestion control modules, fix from Neal Cardwell.

    16) Ethtool ETHTOOL_GSSET_INFO is unnecessarily restrictive, from Michał Mirosław.

    17) Fix triggerable race in ipv6 sysctl handling, from Francesco Ruggeri.

    18) Statistics bug fixes in mlx4 from Eugenia Emantayev.

    19) rds locking bug fix during info dumps, from your's truly.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (67 commits)
    rds: Make rds_sock_lock BH rather than IRQ safe.
    netprio_cgroup.h: dont include module.h from other includes
    net: flow_dissector.c missing include linux/export.h
    team: send only changed options/ports via netlink
    net/hyperv: fix possible memory leak in do_set_multicast()
    drivers/net: dsa/mv88e6xxx.c files need linux/module.h
    stmmac: added PCI identifiers
    llc: Fix race condition in llc_ui_recvmsg
    stmmac: fix phy naming inconsistency
    dsa: Add reporting of silicon revision for Marvell 88E6123/88E6161/88E6165 switches.
    tg3: fix ipv6 header length computation
    skge: add byte queue limit support
    mv643xx_eth: Add Rx Discard and Rx Overrun statistics
    bnx2x: fix compilation error with SOE in fw_dump
    bnx2x: handle CHIP_REVISION during init_one
    bnx2x: allow user to change ring size in ISCSI SD mode
    bnx2x: fix Big-Endianess in ethtool -t
    bnx2x: fixed ethtool statistics for MF modes
    bnx2x: credit-leakage fixup on vlan_mac_del_all
    macvlan: fix a possible use after free
    ...

    Linus Torvalds
     

24 Jan, 2012

7 commits

  • Memory migration fills a pte with a migration entry and it doesn't
    update the rss counters. Then it replaces the migration entry with the
    new page (or the old one if migration failed). But between these two
    passes this pte can be unmaped, or a task can fork a child and it will
    get a copy of this migration entry. Nobody accounts for this in the rss
    counters.

    This patch properly adjust rss counters for migration entries in
    zap_pte_range() and copy_one_pte(). Thus we avoid extra atomic
    operations on the migration fast-path.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in
    find_get_pages") correctly fixed an infinite loop; but left a problem
    that find_get_pages() on shmem would return 0 (appearing to callers to
    mean end of tree) when it meets a run of nr_pages swap entries.

    The only uses of find_get_pages() on shmem are via pagevec_lookup(),
    called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
    scan_mapping_unevictable_pages(). The first is already commented, and
    not worth worrying about; but the second can leave pages on the
    Unevictable list after an unusual sequence of swapping and locking.

    Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
    swap) instead of pagevec_lookup().

    But I don't want to contaminate vmscan.c with shmem internals, nor
    shmem.c with LRU locking. So move scan_mapping_unevictable_pages() into
    shmem.c, renaming it shmem_unlock_mapping(); and rename
    check_move_unevictable_page() to check_move_unevictable_pages(), looping
    down an array of pages, oftentimes under the same lock.

    Leave out the "rotate unevictable list" block: that's a leftover from
    when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
    handling involved looking at pages at tail of LRU.

    Was there significance to the sequence first ClearPageUnevictable, then
    test page_evictable, then SetPageUnevictable here? I think not, we're
    under LRU lock, and have no barriers between those.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Eric Dumazet
    Cc: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: [back to 3.1 but will need respins]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • scan_mapping_unevictable_pages() is used to make SysV SHM_LOCKed pages
    evictable again once the shared memory is unlocked. It does this with
    pagevec_lookup()s across the whole object (which might occupy most of
    memory), and takes 300ms to unlock 7GB here. A cond_resched() every
    PAGEVEC_SIZE pages would be good.

    However, KOSAKI-san points out that this is called under shmem.c's
    info->lock, and it's also under shm.c's shm_lock(), both spinlocks.
    There is no strong reason for that: we need to take these pages off the
    unevictable list soonish, but those locks are not required for it.

    So move the call to scan_mapping_unevictable_pages() from shmem.c's
    unlock handling up to shm.c's unlock handling. Remove the recently
    added barrier, not needed now we have spin_unlock() before the scan.

    Use get_file(), with subsequent fput(), to make sure we have a reference
    to mapping throughout scan_mapping_unevictable_pages(): that's something
    that was previously guaranteed by the shm_lock().

    Remove shmctl's lru_add_drain_all(): we don't fault in pages at SHM_LOCK
    time, and we lazily discover them to be Unevictable later, so it serves
    no purpose for SHM_LOCK; and serves no purpose for SHM_UNLOCK, since
    pages still on pagevec are not marked Unevictable.

    The original code avoided redundant rescans by checking VM_LOCKED flag
    at its level: now avoid them by checking shp's SHM_LOCKED.

    The original code called scan_mapping_unevictable_pages() on a locked
    area at shm_destroy() time: perhaps we once had accounting cross-checks
    which required that, but not now, so skip the overhead and just let
    inode eviction deal with them.

    Put check_move_unevictable_page() and scan_mapping_unevictable_pages()
    under CONFIG_SHMEM (with stub for the TINY case when ramfs is used),
    more as comment than to save space; comment them used for SHM_UNLOCK.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Eric Dumazet
    Cc: Johannes Weiner
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Page mapcount should be updated only if we are sure that the page ends
    up in the page table otherwise we would leak if we couldn't COW due to
    reservations or if idx is out of bounds.

    Signed-off-by: Hillf Danton
    Reviewed-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • end_migration() passes the old page instead of the new page to commit
    the charge. This page descriptor is not used for committing itself,
    though, since we also pass the (correct) page_cgroup descriptor. But
    it's used to find the soft limit tree through the page's zone, so the
    soft limit tree of the old page's zone is updated instead of that of the
    new page's, which might get slightly out of date until the next charge
    reaches the ratelimit point.

    This glitch has been present since 5564e88 ("memcg: condense
    page_cgroup-to-page lookup points").

    This fixes a bug that I introduced in 2.6.38. It's benign enough (to my
    knowledge) that we probably don't want this for stable.

    Reported-by: Hugh Dickins
    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • page_zone() requires an online node otherwise we are accessing NULL
    NODE_DATA. This is not an issue at the moment because node_zones are
    located at the structure beginning but this might change in the future
    so better be careful about that.

    Signed-off-by: Michal Hocko
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Fix the following NULL ptr dereference caused by

    cat /sys/devices/system/memory/memory0/removable

    Pid: 13979, comm: sed Not tainted 3.0.13-0.5-default #1 IBM BladeCenter LS21 -[7971PAM]-/Server Blade
    RIP: __count_immobile_pages+0x4/0x100
    Process sed (pid: 13979, threadinfo ffff880221c36000, task ffff88022e788480)
    Call Trace:
    is_pageblock_removable_nolock+0x34/0x40
    is_mem_section_removable+0x74/0xf0
    show_mem_removable+0x41/0x70
    sysfs_read_file+0xfe/0x1c0
    vfs_read+0xc7/0x130
    sys_read+0x53/0xa0
    system_call_fastpath+0x16/0x1b

    We are crashing because we are trying to dereference NULL zone which
    came from pfn=0 (struct page ffffea0000000000). According to the boot
    log this page is marked reserved:
    e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)

    and early_node_map confirms that:
    early_node_map[3] active PFN ranges
    1: 0x00000010 -> 0x0000009c
    1: 0x00000100 -> 0x000bffa3
    1: 0x00100000 -> 0x00240000

    The problem is that memory_present works in PAGE_SECTION_MASK aligned
    blocks so the reserved range sneaks into the the section as well. This
    also means that free_area_init_node will not take care of those reserved
    pages and they stay uninitialized.

    When we try to read the removable status we walk through all available
    sections and hope that the zone is valid for all pages in the section.
    But this is not true in this case as the zone and nid are not initialized.

    We have only one node in this particular case and it is marked as node=1
    (rather than 0) and that made the problem visible because page_to_nid will
    return 0 and there are no zones on the node.

    Let's check that the zone is valid and that the given pfn falls into its
    boundaries and mark the section not removable. This might cause some
    false positives, probably, but we do not have any sane way to find out
    whether the page is reserved by the platform or it is just not used for
    whatever other reasons.

    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

23 Jan, 2012

1 commit