25 Feb, 2012

3 commits

  • Don't clear vm_mm in a deleted VMA as it's unnecessary and might
    conceivably break the filesystem or driver VMA close routine.

    Reported-by: Al Viro
    Signed-off-by: David Howells
    Acked-by: Al Viro
    cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Lock i_mmap_mutex for access to the VMA prio list to prevent concurrent
    access. Currently, certain parts of the mmap handling are protected by
    the region mutex, but not all.

    Reported-by: Al Viro
    Signed-off-by: David Howells
    Acked-by: Al Viro
    cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    David Howells
     
  • There is an issue when memcg unregisters events that were attached to
    the same eventfd:

    - On the first call mem_cgroup_usage_unregister_event() removes all
    events attached to a given eventfd, and if there were no events left,
    thresholds->primary would become NULL;

    - Since there were several events registered, cgroups core will call
    mem_cgroup_usage_unregister_event() again, but now kernel will oops,
    as the function doesn't expect that threshold->primary may be NULL.

    That's a good question whether mem_cgroup_usage_unregister_event()
    should actually remove all events in one go, but nowadays it can't
    do any better as cftype->unregister_event callback doesn't pass
    any private event-associated cookie. So, let's fix the issue by
    simply checking for threshold->primary.

    FWIW, w/o the patch the following oops may be observed:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
    IP: [] mem_cgroup_usage_unregister_event+0x9c/0x1f0
    Pid: 574, comm: kworker/0:2 Not tainted 3.3.0-rc4+ #9 Bochs Bochs
    RIP: 0010:[] [] mem_cgroup_usage_unregister_event+0x9c/0x1f0
    RSP: 0018:ffff88001d0b9d60 EFLAGS: 00010246
    Process kworker/0:2 (pid: 574, threadinfo ffff88001d0b8000, task ffff88001de91cc0)
    Call Trace:
    [] cgroup_event_remove+0x2b/0x60
    [] process_one_work+0x174/0x450
    [] worker_thread+0x123/0x2d0

    Cc: stable
    Signed-off-by: Anton Vorontsov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Linus Torvalds

    Anton Vorontsov
     

14 Feb, 2012

1 commit

  • When the number of dentry cache hash table entries gets too high
    (2147483648 entries), as happens by default on a 16TB system, use of a
    signed integer in the dcache_init() initialization loop prevents the
    dentry_hashtable from getting initialized, causing a panic in
    __d_lookup(). Fix this in dcache_init() and similar areas.

    Signed-off-by: Dimitri Sivanich
    Acked-by: David S. Miller
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dimitri Sivanich
     

11 Feb, 2012

1 commit

  • fix 1 mysterious divide error
    fix 3 NULL dereference bugs in writeback tracing, on SD card removal w/o umount

    * tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: fix dereferencing NULL bdi->dev on trace_writeback_queue
    lib: proportion: lower PROP_MAX_SHIFT to 32 on 64-bit kernel
    writeback: fix NULL bdi->dev in trace writeback_single_inode
    backing-dev: fix wakeup timer races with bdi_unregister()

    Linus Torvalds
     

09 Feb, 2012

2 commits

  • Fix CONFIG_TRANSPARENT_HUGEPAGE=y CONFIG_SMP=n CONFIG_DEBUG_VM=y
    CONFIG_DEBUG_SPINLOCK=n kernel: spin_is_locked() is then always false,
    and so triggers some BUGs in Transparent HugePage codepaths.

    asm-generic/bug.h mentions this problem, and provides a WARN_ON_SMP(x);
    but being too lazy to add VM_BUG_ON_SMP, BUG_ON_SMP, WARN_ON_SMP_ONCE,
    VM_WARN_ON_SMP_ONCE, just test NR_CPUS != 1 in the existing VM_BUG_ONs.

    Signed-off-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When isolating pages for migration, migration starts at the start of a
    zone while the free scanner starts at the end of the zone. Migration
    avoids entering a new zone by never going beyond the free scanned.

    Unfortunately, in very rare cases nodes can overlap. When this happens,
    migration isolates pages without the LRU lock held, corrupting lists
    which will trigger errors in reclaim or during page free such as in the
    following oops

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    IP: [] free_pcppages_bulk+0xcc/0x450
    PGD 1dda554067 PUD 1e1cb58067 PMD 0
    Oops: 0000 [#1] SMP
    CPU 37
    Pid: 17088, comm: memcg_process_s Tainted: G X
    RIP: free_pcppages_bulk+0xcc/0x450
    Process memcg_process_s (pid: 17088, threadinfo ffff881c2926e000, task ffff881c2926c0c0)
    Call Trace:
    free_hot_cold_page+0x17e/0x1f0
    __pagevec_free+0x90/0xb0
    release_pages+0x22a/0x260
    pagevec_lru_move_fn+0xf3/0x110
    putback_lru_page+0x66/0xe0
    unmap_and_move+0x156/0x180
    migrate_pages+0x9e/0x1b0
    compact_zone+0x1f3/0x2f0
    compact_zone_order+0xa2/0xe0
    try_to_compact_pages+0xdf/0x110
    __alloc_pages_direct_compact+0xee/0x1c0
    __alloc_pages_slowpath+0x370/0x830
    __alloc_pages_nodemask+0x1b1/0x1c0
    alloc_pages_vma+0x9b/0x160
    do_huge_pmd_anonymous_page+0x160/0x270
    do_page_fault+0x207/0x4c0
    page_fault+0x25/0x30

    The "X" in the taint flag means that external modules were loaded but but
    is unrelated to the bug triggering. The real problem was because the PFN
    layout looks like this

    Zone PFN ranges:
    DMA 0x00000010 -> 0x00001000
    DMA32 0x00001000 -> 0x00100000
    Normal 0x00100000 -> 0x01e80000
    Movable zone start PFN for each node
    early_node_map[14] active PFN ranges
    0: 0x00000010 -> 0x0000009b
    0: 0x00000100 -> 0x0007a1ec
    0: 0x0007a354 -> 0x0007a379
    0: 0x0007f7ff -> 0x0007f800
    0: 0x00100000 -> 0x00680000
    1: 0x00680000 -> 0x00e80000
    0: 0x00e80000 -> 0x01080000
    1: 0x01080000 -> 0x01280000
    0: 0x01280000 -> 0x01480000
    1: 0x01480000 -> 0x01680000
    0: 0x01680000 -> 0x01880000
    1: 0x01880000 -> 0x01a80000
    0: 0x01a80000 -> 0x01c80000
    1: 0x01c80000 -> 0x01e80000

    The fix is straight-forward. isolate_migratepages() has to make a
    similar check to isolate_freepage to ensure that it never isolates pages
    from a zone it does not hold the LRU lock for.

    This was discovered in a 3.0-based kernel but it affects 3.1.x, 3.2.x
    and current mainline.

    Signed-off-by: Mel Gorman
    Acked-by: Michal Nazarewicz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

05 Feb, 2012

1 commit

  • * akpm:
    mm: compaction: check pfn_valid when entering a new MAX_ORDER_NR_PAGES block during isolation for migration
    readahead: fix pipeline break caused by block plug
    kprobes: fix a memory leak in function pre_handler_kretprobe()
    drivers/tty/vt/vt_ioctl.c: fix KDFONTOP 32bit compatibility layer
    lkdtm: avoid calling lkdtm_do_action() with spinlock held
    mm/filemap_xip.c: fix race condition in xip_file_fault()
    mm/memcontrol.c: fix warning with CONFIG_NUMA=n
    avr32: select generic atomic64_t support
    mm: postpone migrated page mapping reset
    xtensa: fix memscan()
    MAINTAINERS: update lguest F: patterns
    MAINTAINERS: remove staging sections
    MAINTAINERS: remove iMX5 section
    MAINTAINERS: update partitions block F: patterns

    Linus Torvalds
     

04 Feb, 2012

6 commits

  • …ing isolation for migration

    When isolating for migration, migration starts at the start of a zone
    which is not necessarily pageblock aligned. Further, it stops isolating
    when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally
    not aligned. This allows isolate_migratepages() to call pfn_to_page() on
    an invalid PFN which can result in a crash. This was originally reported
    against a 3.0-based kernel with the following trace in a crash dump.

    PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s"
    #0 [d72d3ad0] crash_kexec at c028cfdb
    #1 [d72d3b24] oops_end at c05c5322
    #2 [d72d3b38] __bad_area_nosemaphore at c0227e60
    #3 [d72d3bec] bad_area at c0227fb6
    #4 [d72d3c00] do_page_fault at c05c72ec
    #5 [d72d3c80] error_code (via page_fault) at c05c47a4
    EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000
    DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50
    CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002
    #6 [d72d3cb4] isolate_migratepages at c030b15a
    #7 [d72d3d14] zone_watermark_ok at c02d26cb
    #8 [d72d3d2c] compact_zone at c030b8de
    #9 [d72d3d68] compact_zone_order at c030bba1
    #10 [d72d3db4] try_to_compact_pages at c030bc84
    #11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7
    #12 [d72d3e08] __alloc_pages_slowpath at c02d66c7
    #13 [d72d3e78] __alloc_pages_nodemask at c02d6a97
    #14 [d72d3eb8] alloc_pages_vma at c030a845
    #15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb
    #16 [d72d3f00] handle_mm_fault at c02f36c6
    #17 [d72d3f30] do_page_fault at c05c70ed
    #18 [d72d3fb0] error_code (via page_fault) at c05c47a4
    EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431
    DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788
    SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50
    CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202

    It was also reported by Herbert van den Bergh against 3.1-based kernel
    with the following snippet from the console log.

    BUG: unable to handle kernel paging request at 01c00008
    IP: [<c0522399>] isolate_migratepages+0x119/0x390
    *pdpt = 000000002f7ce001 *pde = 0000000000000000

    It is expected that it also affects 3.2.x and current mainline.

    The problem is that pfn_valid is only called on the first PFN being
    checked and that PFN is not necessarily aligned. Lets say we have a case
    like this

    H = MAX_ORDER_NR_PAGES boundary
    | = pageblock boundary
    m = cc->migrate_pfn
    f = cc->free_pfn
    o = memory hole

    H------|------H------|----m-Hoooooo|ooooooH-f----|------H

    The migrate_pfn is just below a memory hole and the free scanner is beyond
    the hole. When isolate_migratepages started, it scans from migrate_pfn to
    migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks
    pfn_valid() on the first PFN but then scans into the hole where there are
    not necessarily valid struct pages.

    This patch ensures that isolate_migratepages calls pfn_valid when
    necessary.

    Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
    Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Michal Nazarewicz <mina86@mina86.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • Herbert Poetzl reported a performance regression since 2.6.39. The test
    is a simple dd read, but with big block size. The reason is:

    T1: ra (A, A+128k), (A+128k, A+256k)
    T2: lock_page for page A, submit the 256k
    T3: hit page A+128K, ra (A+256k, A+384). the range isn't submitted
    because of plug and there isn't any lock_page till we hit page A+256k
    because all pages from A to A+256k is in memory
    T4: hit page A+256k, ra (A+384, A+ 512). Because of plug, the range isn't
    submitted again.
    T5: lock_page A+256k, so (A+256k, A+512k) will be submitted. The task is
    waitting for (A+256k, A+512k) finish.

    There is no request to disk in T3 and T4, so readahead pipeline breaks.

    We really don't need block plug for generic_file_aio_read() for buffered
    I/O. The readahead already has plug and has fine grained control when I/O
    should be submitted. Deleting plug for buffered I/O fixes the regression.

    One side effect is plug makes the request size 256k, the size is 128k
    without it. This is because default ra size is 128k and not a reason we
    need plug here.

    Vivek said:

    : We submit some readahead IO to device request queue but because of nested
    : plug, queue never gets unplugged. When read logic reaches a page which is
    : not in page cache, it waits for page to be read from the disk
    : (lock_page_killable()) and that time we flush the plug list.
    :
    : So effectively read ahead logic is kind of broken in parts because of
    : nested plugging. Removing top level plug (generic_file_aio_read()) for
    : buffered reads, will allow unplugging queue earlier for readahead.

    Signed-off-by: Shaohua Li
    Signed-off-by: Wu Fengguang
    Reported-by: Herbert Poetzl
    Tested-by: Eric Dumazet
    Cc: Christoph Hellwig
    Cc: Jens Axboe
    Cc: Vivek Goyal
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Fix a race condition that shows in conjunction with xip_file_fault() when
    two threads of the same user process fault on the same memory page.

    In this case, the race winner will install the page table entry and the
    unlucky loser will cause an oops: xip_file_fault calls vm_insert_pfn (via
    vm_insert_mixed) which drops out at this check:

    retval = -EBUSY;
    if (!pte_none(*pte))
    goto out_unlock;

    The resulting -EBUSY return value will trigger a BUG_ON() in
    xip_file_fault.

    This fix simply considers the fault as fixed in this case, because the
    race winner has successfully installed the pte.

    [akpm@linux-foundation.org: use conventional (and consistent) comment layout]
    Reported-by: David Sadler
    Signed-off-by: Carsten Otte
    Reported-by: Louis Alex Eisner
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Carsten Otte
     
  • mm/memcontrol.c: In function 'memcg_check_events':
    mm/memcontrol.c:779: warning: unused variable 'do_numainfo'

    Acked-by: Michal Hocko
    Cc: Li Zefan
    Cc: Hiroyuki KAMEZAWA
    Cc: Johannes Weiner
    Acked-by: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Postpone resetting page->mapping until the final remove_migration_ptes().
    Otherwise the expression PageAnon(migration_entry_to_page(entry)) does not
    work.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Trivial kmemleak bug-fixes:

    - Early logging doesn't stop when kmemleak is off by default.
    - Zero-size scanning areas should be ignored (currently it prints a
    warning).

    * tag 'kmemleak-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux:
    kmemleak: Disable early logging when kmemleak is off by default
    kmemleak: Only scan non-zero-size areas

    Linus Torvalds
     

03 Feb, 2012

1 commit

  • This fixes the race in process_vm_core found by Oleg (see

    http://article.gmane.org/gmane.linux.kernel/1235667/

    for details).

    This has been updated since I last sent it as the creation of the new
    mm_access() function did almost exactly the same thing as parts of the
    previous version of this patch did.

    In order to use mm_access() even when /proc isn't enabled, we move it to
    kernel/fork.c where other related process mm access functions already
    are.

    Signed-off-by: Chris Yeoh
    Signed-off-by: Linus Torvalds

    Christopher Yeoh
     

01 Feb, 2012

1 commit

  • While 7a401a972df8e18 ("backing-dev: ensure wakeup_timer is deleted")
    addressed the problem of the bdi being freed with a queued wakeup
    timer, there are other races that could happen if the wakeup timer
    expires after/during bdi_unregister(), before bdi_destroy() is called.

    wakeup_timer_fn() could attempt to wakeup a task which has already has
    been freed, or could access a NULL bdi->dev via the wake_forker_thread
    tracepoint.

    Cc:
    Cc: Jens Axboe
    Reported-by: Chanho Min
    Reviewed-by: Namjae Jeon
    Signed-off-by: Rabin Vincent
    Signed-off-by: Wu Fengguang

    Rabin Vincent
     

27 Jan, 2012

1 commit


25 Jan, 2012

1 commit

  • Davem says:

    1) Fix JIT code generation on x86-64 for divide by zero, from Eric Dumazet.

    2) tg3 header length computation correction from Eric Dumazet.

    3) More build and reference counting fixes for socket memory cgroup
    code from Glauber Costa.

    4) module.h snuck back into a core header after all the hard work we
    did to remove that, from Paul Gortmaker and Jesper Dangaard Brouer.

    5) Fix PHY naming regression and add some new PCI IDs in stmmac, from
    Alessandro Rubini.

    6) Netlink message generation fix in new team driver, should only advertise
    the entries that changed during events, from Jiri Pirko.

    7) SRIOV VF registration and unregistration fixes, and also add a
    missing PCI ID, from Roopa Prabhu.

    8) Fix infinite loop in tx queue flush code of brcmsmac, from Stanislaw Gruszka.

    9) ftgmac100/ftmac100 build fix, missing interrupt.h include.

    10) Memory leak fix in net/hyperv do_set_mutlicast() handling, from Wei Yongjun.

    11) Off by one fix in netem packet scheduler, from Vijay Subramanian.

    12) TCP loss detection fix from Yuchung Cheng.

    13) TCP reset packet MD5 calculation uses wrong address, fix from Shawn Lu.

    14) skge carrier assertion and DMA mapping fixes from Stephen Hemminger.

    15) Congestion recovery undo performed at the wrong spot in BIC and CUBIC
    congestion control modules, fix from Neal Cardwell.

    16) Ethtool ETHTOOL_GSSET_INFO is unnecessarily restrictive, from Michał Mirosław.

    17) Fix triggerable race in ipv6 sysctl handling, from Francesco Ruggeri.

    18) Statistics bug fixes in mlx4 from Eugenia Emantayev.

    19) rds locking bug fix during info dumps, from your's truly.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (67 commits)
    rds: Make rds_sock_lock BH rather than IRQ safe.
    netprio_cgroup.h: dont include module.h from other includes
    net: flow_dissector.c missing include linux/export.h
    team: send only changed options/ports via netlink
    net/hyperv: fix possible memory leak in do_set_multicast()
    drivers/net: dsa/mv88e6xxx.c files need linux/module.h
    stmmac: added PCI identifiers
    llc: Fix race condition in llc_ui_recvmsg
    stmmac: fix phy naming inconsistency
    dsa: Add reporting of silicon revision for Marvell 88E6123/88E6161/88E6165 switches.
    tg3: fix ipv6 header length computation
    skge: add byte queue limit support
    mv643xx_eth: Add Rx Discard and Rx Overrun statistics
    bnx2x: fix compilation error with SOE in fw_dump
    bnx2x: handle CHIP_REVISION during init_one
    bnx2x: allow user to change ring size in ISCSI SD mode
    bnx2x: fix Big-Endianess in ethtool -t
    bnx2x: fixed ethtool statistics for MF modes
    bnx2x: credit-leakage fixup on vlan_mac_del_all
    macvlan: fix a possible use after free
    ...

    Linus Torvalds
     

24 Jan, 2012

7 commits

  • Memory migration fills a pte with a migration entry and it doesn't
    update the rss counters. Then it replaces the migration entry with the
    new page (or the old one if migration failed). But between these two
    passes this pte can be unmaped, or a task can fork a child and it will
    get a copy of this migration entry. Nobody accounts for this in the rss
    counters.

    This patch properly adjust rss counters for migration entries in
    zap_pte_range() and copy_one_pte(). Thus we avoid extra atomic
    operations on the migration fast-path.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in
    find_get_pages") correctly fixed an infinite loop; but left a problem
    that find_get_pages() on shmem would return 0 (appearing to callers to
    mean end of tree) when it meets a run of nr_pages swap entries.

    The only uses of find_get_pages() on shmem are via pagevec_lookup(),
    called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
    scan_mapping_unevictable_pages(). The first is already commented, and
    not worth worrying about; but the second can leave pages on the
    Unevictable list after an unusual sequence of swapping and locking.

    Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
    swap) instead of pagevec_lookup().

    But I don't want to contaminate vmscan.c with shmem internals, nor
    shmem.c with LRU locking. So move scan_mapping_unevictable_pages() into
    shmem.c, renaming it shmem_unlock_mapping(); and rename
    check_move_unevictable_page() to check_move_unevictable_pages(), looping
    down an array of pages, oftentimes under the same lock.

    Leave out the "rotate unevictable list" block: that's a leftover from
    when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
    handling involved looking at pages at tail of LRU.

    Was there significance to the sequence first ClearPageUnevictable, then
    test page_evictable, then SetPageUnevictable here? I think not, we're
    under LRU lock, and have no barriers between those.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Eric Dumazet
    Cc: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: [back to 3.1 but will need respins]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • scan_mapping_unevictable_pages() is used to make SysV SHM_LOCKed pages
    evictable again once the shared memory is unlocked. It does this with
    pagevec_lookup()s across the whole object (which might occupy most of
    memory), and takes 300ms to unlock 7GB here. A cond_resched() every
    PAGEVEC_SIZE pages would be good.

    However, KOSAKI-san points out that this is called under shmem.c's
    info->lock, and it's also under shm.c's shm_lock(), both spinlocks.
    There is no strong reason for that: we need to take these pages off the
    unevictable list soonish, but those locks are not required for it.

    So move the call to scan_mapping_unevictable_pages() from shmem.c's
    unlock handling up to shm.c's unlock handling. Remove the recently
    added barrier, not needed now we have spin_unlock() before the scan.

    Use get_file(), with subsequent fput(), to make sure we have a reference
    to mapping throughout scan_mapping_unevictable_pages(): that's something
    that was previously guaranteed by the shm_lock().

    Remove shmctl's lru_add_drain_all(): we don't fault in pages at SHM_LOCK
    time, and we lazily discover them to be Unevictable later, so it serves
    no purpose for SHM_LOCK; and serves no purpose for SHM_UNLOCK, since
    pages still on pagevec are not marked Unevictable.

    The original code avoided redundant rescans by checking VM_LOCKED flag
    at its level: now avoid them by checking shp's SHM_LOCKED.

    The original code called scan_mapping_unevictable_pages() on a locked
    area at shm_destroy() time: perhaps we once had accounting cross-checks
    which required that, but not now, so skip the overhead and just let
    inode eviction deal with them.

    Put check_move_unevictable_page() and scan_mapping_unevictable_pages()
    under CONFIG_SHMEM (with stub for the TINY case when ramfs is used),
    more as comment than to save space; comment them used for SHM_UNLOCK.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Eric Dumazet
    Cc: Johannes Weiner
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Page mapcount should be updated only if we are sure that the page ends
    up in the page table otherwise we would leak if we couldn't COW due to
    reservations or if idx is out of bounds.

    Signed-off-by: Hillf Danton
    Reviewed-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • end_migration() passes the old page instead of the new page to commit
    the charge. This page descriptor is not used for committing itself,
    though, since we also pass the (correct) page_cgroup descriptor. But
    it's used to find the soft limit tree through the page's zone, so the
    soft limit tree of the old page's zone is updated instead of that of the
    new page's, which might get slightly out of date until the next charge
    reaches the ratelimit point.

    This glitch has been present since 5564e88 ("memcg: condense
    page_cgroup-to-page lookup points").

    This fixes a bug that I introduced in 2.6.38. It's benign enough (to my
    knowledge) that we probably don't want this for stable.

    Reported-by: Hugh Dickins
    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • page_zone() requires an online node otherwise we are accessing NULL
    NODE_DATA. This is not an issue at the moment because node_zones are
    located at the structure beginning but this might change in the future
    so better be careful about that.

    Signed-off-by: Michal Hocko
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Fix the following NULL ptr dereference caused by

    cat /sys/devices/system/memory/memory0/removable

    Pid: 13979, comm: sed Not tainted 3.0.13-0.5-default #1 IBM BladeCenter LS21 -[7971PAM]-/Server Blade
    RIP: __count_immobile_pages+0x4/0x100
    Process sed (pid: 13979, threadinfo ffff880221c36000, task ffff88022e788480)
    Call Trace:
    is_pageblock_removable_nolock+0x34/0x40
    is_mem_section_removable+0x74/0xf0
    show_mem_removable+0x41/0x70
    sysfs_read_file+0xfe/0x1c0
    vfs_read+0xc7/0x130
    sys_read+0x53/0xa0
    system_call_fastpath+0x16/0x1b

    We are crashing because we are trying to dereference NULL zone which
    came from pfn=0 (struct page ffffea0000000000). According to the boot
    log this page is marked reserved:
    e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)

    and early_node_map confirms that:
    early_node_map[3] active PFN ranges
    1: 0x00000010 -> 0x0000009c
    1: 0x00000100 -> 0x000bffa3
    1: 0x00100000 -> 0x00240000

    The problem is that memory_present works in PAGE_SECTION_MASK aligned
    blocks so the reserved range sneaks into the the section as well. This
    also means that free_area_init_node will not take care of those reserved
    pages and they stay uninitialized.

    When we try to read the removable status we walk through all available
    sections and hope that the zone is valid for all pages in the section.
    But this is not true in this case as the zone and nid are not initialized.

    We have only one node in this particular case and it is marked as node=1
    (rather than 0) and that made the problem visible because page_to_nid will
    return 0 and there are no zones on the node.

    Let's check that the zone is valid and that the given pfn falls into its
    boundaries and mark the section not removable. This might cause some
    false positives, probably, but we do not have any sane way to find out
    whether the page is reserved by the platform or it is just not used for
    whatever other reasons.

    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

23 Jan, 2012

1 commit


21 Jan, 2012

2 commits

  • Commit b6693005 (kmemleak: When the early log buffer is exceeded, report
    the actual number) deferred the disabling of the early logging to
    kmemleak_init(). However, when CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y, the
    early logging was no longer disabled causing __init kmemleak functions
    to be called even after the kernel freed the init memory. This patch
    disables the early logging during kmemleak_init() if kmemleak is left
    disabled.

    Reported-by: Dirk Gouders
    Tested-by: Dirk Gouders
    Tested-by: Josh Boyer
    Signed-off-by: Catalin Marinas

    Catalin Marinas
     
  • Kmemleak should only track valid scan areas with a non-zero size.
    Otherwise, such area may reside just at the end of an object and
    kmemleak would report "Adding scan area to unknown object".

    Signed-off-by: Tiejun Chen
    Signed-off-by: Catalin Marinas

    Tiejun Chen
     

18 Jan, 2012

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (47 commits)
    tg3: Fix single-vector MSI-X code
    openvswitch: Fix multipart datapath dumps.
    ipv6: fix per device IP snmp counters
    inetpeer: initialize ->redirect_genid in inet_getpeer()
    net: fix NULL-deref in WARN() in skb_gso_segment()
    net: WARN if skb_checksum_help() is called on skb requiring segmentation
    caif: Remove bad WARN_ON in caif_dev
    caif: Fix typo in Vendor/Product-ID for CAIF modems
    bnx2x: Disable AN KR work-around for BCM57810
    bnx2x: Remove AutoGrEEEn for BCM84833
    bnx2x: Remove 100Mb force speed for BCM84833
    bnx2x: Fix PFC setting on BCM57840
    bnx2x: Fix Super-Isolate mode for BCM84833
    net: fix some sparse errors
    net: kill duplicate included header
    net: sh-eth: Fix build error by the value which is not defined
    net: Use device model to get driver name in skb_gso_segment()
    bridge: BH already disabled in br_fdb_cleanup()
    net: move sock_update_memcg outside of CONFIG_INET
    mwl8k: Fixing Sparse ENDIAN CHECK warning
    ...

    Linus Torvalds
     

17 Jan, 2012

1 commit

  • Although only used currently for tcp sockets, this function
    is now used in common sock code (for sock_clone())

    Commit 475f1b52645a29936b9df1d8fcd45f7e56bd4a9f moved the
    declaration of sock_update_clone() to inside sock.c, but
    this only fixes the problem when CONFIG_CGROUP_MEM_RES_CTLR_KMEM
    is also not defined.

    This patch here is verified to fix both problems, although
    reverting the previous one is not necessary.

    Signed-off-by: Glauber Costa
    CC: David S. Miller
    CC: Stephen Rothwell
    Reported-by: Randy Dunlap
    Acked-by: Randy Dunlap
    Signed-off-by: David S. Miller

    Glauber Costa
     

16 Jan, 2012

1 commit

  • 7bd0b0f0da ("memblock: Reimplement memblock allocation using
    reverse free area iterator") implemented a simple top-down
    allocator using a reverse memblock iterator. To avoid underflow
    in the allocator loop, it simply raised the lower boundary to
    the requested size under the assumption that requested size
    would be far smaller than available memblocks.

    This causes early page table allocation failure under certain
    configurations in Xen. Fix it by checking for underflow directly
    instead of bumping up lower bound.

    Signed-off-by: Tejun Heo
    Reported-by: Konrad Rzeszutek Wilk
    Cc: rjw@sisk.pl
    Cc: xen-devel@lists.xensource.com
    Cc: Benjamin Herrenschmidt
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20120113181412.GA11112@google.com
    Signed-off-by: Ingo Molnar

    Tejun Heo
     

15 Jan, 2012

1 commit

  • Kmemleak patches

    Main features:
    - Handle percpu memory allocations (only scanning them, not actually
    reporting).
    - Memory hotplug support.

    Usability improvements:
    - Show the origin of early allocations.
    - Report previously found leaks even if kmemleak has been disabled by
    some error.

    * tag 'kmemleak' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux:
    kmemleak: Add support for memory hotplug
    kmemleak: Handle percpu memory allocation
    kmemleak: Report previously found leaks even after an error
    kmemleak: When the early log buffer is exceeded, report the actual number
    kmemleak: Show where early_log issues come from

    Linus Torvalds
     

13 Jan, 2012

8 commits

  • If either of the vas or vms arrays are not properly kzalloced, then the
    code jumps to the err_free label.

    The err_free label runs a loop to check and free each of the array members
    of the vas and vms arrays which is not required for this situation as none
    of the array members have been allocated till this point.

    Eliminate the extra loop we have to go through by introducing a new label
    err_free2 and then jumping to it.

    [akpm@linux-foundation.org: remove now-unneeded tests]
    Signed-off-by: Kautuk Consul
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kautuk Consul
     
  • There is sometimes confusion between the global putback_lru_pages() in
    migrate.c and the static putback_lru_pages() in vmscan.c: rename the
    latter putback_inactive_pages(): it helps shrink_inactive_list() rather as
    move_active_pages_to_lru() helps shrink_active_list().

    Remove unused scan_control arg from putback_inactive_pages() and from
    update_isolated_counts(). Move clear_active_flags() inside
    update_isolated_counts(). Move NR_ISOLATED accounting up into
    shrink_inactive_list() itself, so the balance is clearer.

    Do the spin_lock_irq() before calling putback_inactive_pages() and
    spin_unlock_irq() after return from it, so that it better matches
    update_isolated_counts() and move_active_pages_to_lru().

    Signed-off-by: Hugh Dickins
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The isolate_pages() level in vmscan.c offers little but indirection: merge
    it into isolate_lru_pages() as the compiler does, and use the names
    nr_to_scan and nr_scanned in each case.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • del_page_from_lru() repeats del_page_from_lru_list(), also working out
    which LRU the page was on, clearing the relevant bits. Decouple those
    functions: remove del_page_from_lru() and add page_off_lru().

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Mostly we use "enum lru_list lru": change those few "l"s to "lru"s.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • checkpatch rightly protests

    WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable

    so fix the five offenders in mm/swap.c.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • What's so special about ____pagevec_lru_add() that it needs four leading
    underscores? Nothing, it just helped to distinguish from
    __pagevec_lru_add() in 2.6.28 development. Cut two leading underscores.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Replace pagevecs in putback_lru_pages() and move_active_pages_to_lru()
    by lists of pages_to_free: then apply Konstantin Khlebnikov's
    free_hot_cold_page_list() to them instead of pagevec_release().

    Which simplifies the flow (no need to drop and retake lock whenever
    pagevec fills up) and reduces stale addresses in stack backtraces
    (which often showed through the pagevecs); but more importantly,
    removes another 120 bytes from the deepest stacks in page reclaim.
    Although I've not recently seen an actual stack overflow here with
    a vanilla kernel, move_active_pages_to_lru() has often featured in
    deep backtraces.

    However, free_hot_cold_page_list() does not handle compound pages
    (nor need it: a Transparent HugePage would have been split by the
    time it reaches the call in shrink_page_list()), but it is possible
    for putback_lru_pages() or move_active_pages_to_lru() to be left
    holding the last reference on a THP, so must exclude the unlikely
    compound case before putting on pages_to_free.

    Remove pagevec_strip(), its work now done in move_active_pages_to_lru().
    The pagevec in scan_mapping_unevictable_pages() remains in mm/vmscan.c,
    but that is never on the reclaim path, and cannot be replaced by a list.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Konstantin Khlebnikov
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins