10 Sep, 2010

3 commits

  • When under significant memory pressure, a process enters direct reclaim
    and immediately afterwards tries to allocate a page. If it fails and no
    further progress is made, it's possible the system will go OOM. However,
    on systems with large amounts of memory, it's possible that a significant
    number of pages are on per-cpu lists and inaccessible to the calling
    process. This leads to a process entering direct reclaim more often than
    it should increasing the pressure on the system and compounding the
    problem.

    This patch notes that if direct reclaim is making progress but allocations
    are still failing that the system is already under heavy pressure. In
    this case, it drains the per-cpu lists and tries the allocation a second
    time before continuing.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Cc: Dave Chinner
    Cc: Wu Fengguang
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …low and kswapd is awake

    Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
    cheaper than scanning a number of lists. To avoid synchronization
    overhead, counter deltas are maintained on a per-cpu basis and drained
    both periodically and when the delta is above a threshold. On large CPU
    systems, the difference between the estimated and real value of
    NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than
    number of real free page in buddy, the VM can allocate pages below min
    watermark, at worst reducing the real number of pages to zero. Even if
    the OOM killer kills some victim for freeing memory, it may not free
    memory if the exit path requires a new page resulting in livelock.

    This patch introduces a zone_page_state_snapshot() function (courtesy of
    Christoph) that takes a slightly more accurate view of an arbitrary vmstat
    counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid
    the watermark being accidentally broken. The estimate is not perfect and
    may result in cache line bounces but is expected to be lighter than the
    IPI calls necessary to continually drain the per-cpu counters while kswapd
    is awake.

    Signed-off-by: Christoph Lameter <cl@linux.com>
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Christoph Lameter
     
  • When allocating a page, the system uses NR_FREE_PAGES counters to
    determine if watermarks would remain intact after the allocation was made.
    This check is made without interrupts disabled or the zone lock held and
    so is race-prone by nature. Unfortunately, when pages are being freed in
    batch, the counters are updated before the pages are added on the list.
    During this window, the counters are misleading as the pages do not exist
    yet. When under significant pressure on systems with large numbers of
    CPUs, it's possible for processes to make progress even though they should
    have been stalled. This is particularly problematic if a number of the
    processes are using GFP_ATOMIC as the min watermark can be accidentally
    breached and in extreme cases, the system can livelock.

    This patch updates the counters after the pages have been added to the
    list. This makes the allocator more cautious with respect to preserving
    the watermarks and mitigates livelock possibilities.

    [akpm@linux-foundation.org: avoid modifying incoming args]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

10 Aug, 2010

3 commits

  • Since 2.6.28 zone->prev_priority is unused. Then it can be removed
    safely. It reduce stack usage slightly.

    Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
    can be integrate again, it's useful. but four (or more) times trying
    haven't got good performance number. Thus I give up such approach.

    The rest of this changelog is notes on prev_priority and why it existed in
    the first place and why it might be not necessary any more. This information
    is based heavily on discussions between Andrew Morton, Rik van Riel and
    Kosaki Motohiro who is heavily quotes from.

    Historically prev_priority was important because it determined when the VM
    would start unmapping PTE pages. i.e. there are no balances of note within
    the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
    is a potential risk of unnecessarily increasing minor faults as a large
    amount of read activity of use-once pages could push mapped pages to the
    end of the LRU and get unmapped.

    There is no proof this is still a problem but currently it is not considered
    to be. Active files are not deactivated if the active file list is smaller
    than the inactive list reducing the liklihood that file-mapped pages are
    being pushed off the LRU and referenced executable pages are kept on the
    active list to avoid them getting pushed out by read activity.

    Even if it is a problem, prev_priority prev_priority wouldn't works
    nowadays. First of all, current vmscan still a lot of UP centric code. it
    expose some weakness on some dozens CPUs machine. I think we need more and
    more improvement.

    The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
    and per-task-pressure a bit. example, prev_priority try to boost priority to
    other concurrent priority. but if the another task have mempolicy restriction,
    it is unnecessary, but also makes wrong big latency and exceeding reclaim.
    per-task based priority + prev_priority adjustment make the emulation of
    per-system pressure. but it have two issue 1) too rough and brutal emulation
    2) we need per-zone pressure, not per-system.

    Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
    2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
    but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
    system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
    prev_priority can't solve such multithreads workload issue. In other word,
    prev_priority concept assume the sysmtem don't have lots threads."

    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Reviewed-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Dave Chinner
    Cc: Chris Mason
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Andrea Arcangeli
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • We have been used naming try_set_zone_oom and clear_zonelist_oom.
    The role of functions is to lock of zonelist for preventing parallel
    OOM. So clear_zonelist_oom makes sense but try_set_zone_oome is rather
    awkward and unmatched with clear_zonelist_oom.

    Let's change it with try_set_zonelist_oom.

    Signed-off-by: Minchan Kim
    Acked-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • If memory has been depleted in lowmem zones even with the protection
    afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
    killing current users will help. The memory is either reclaimable (or
    migratable) already, in which case we should not invoke the oom killer at
    all, or it is pinned by an application for I/O. Killing such an
    application may leave the hardware in an unspecified state and there is no
    guarantee that it will be able to make a timely exit.

    Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is
    not used so that the task can perhaps recover or try again later.

    Previously, the heuristic provided some protection for those tasks with
    CAP_SYS_RAWIO, but this is no longer necessary since we will not be
    killing tasks for the purposes of ISA allocations.

    high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
    default for all allocations that are not __GFP_DMA, __GFP_DMA32,
    __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
    flags. Testing for high_zoneidx being less than ZONE_NORMAL will only
    return true for allocations that have either __GFP_DMA or __GFP_DMA32.

    Acked-by: KOSAKI Motohiro
    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

21 Jul, 2010

1 commit

  • Borislav Petkov reported his 32bit numa system has problem:

    [ 0.000000] Reserving total of 4c00 pages for numa KVA remap
    [ 0.000000] kva_start_pfn ~ 32800 max_low_pfn ~ 375fe
    [ 0.000000] max_pfn = 238000
    [ 0.000000] 8202MB HIGHMEM available.
    [ 0.000000] 885MB LOWMEM available.
    [ 0.000000] mapped low ram: 0 - 375fe000
    [ 0.000000] low ram: 0 - 375fe000
    [ 0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 1000 1000 => 34e7000
    [ 0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 200 40 => 34c9d80
    [ 0.000000] alloc (nid=0 100000 - 7ee00000) (1000000 - ffffffffffffffff) 180 40 => 34e6140
    [ 0.000000] alloc (nid=1 80000000 - c7e60000) (1000000 - ffffffffffffffff) 240 40 => 80000000
    [ 0.000000] BUG: unable to handle kernel paging request at 40000000
    [ 0.000000] IP: [] __alloc_memory_core_early+0x147/0x1d6
    [ 0.000000] *pdpt = 0000000000000000 *pde = f000ff53f000ff00
    ...
    [ 0.000000] Call Trace:
    [ 0.000000] [] ? __alloc_bootmem_node+0x216/0x22f
    [ 0.000000] [] ? sparse_early_usemaps_alloc_node+0x5a/0x10b
    [ 0.000000] [] ? sparse_init+0x1dc/0x499
    [ 0.000000] [] ? paging_init+0x168/0x1df
    [ 0.000000] [] ? native_pagetable_setup_start+0xef/0x1bb

    looks like it allocates too much high address for bootmem.

    Try to cut limit with get_max_mapped()

    Reported-by: Borislav Petkov
    Tested-by: Conny Seidel
    Signed-off-by: Yinghai Lu
    Cc: [2.6.34.x]
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Johannes Weiner
    Cc: Lee Schermerhorn
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     

19 Jul, 2010

1 commit

  • With commits 08677214 and 59be5a8e, alloc_bootmem()/free_bootmem() and
    friends use the early_res functions for memory management when
    NO_BOOTMEM is enabled. This patch adds the kmemleak calls in the
    corresponding code paths for bootmem allocations.

    Signed-off-by: Catalin Marinas
    Acked-by: Pekka Enberg
    Acked-by: Yinghai Lu
    Cc: H. Peter Anvin
    Cc: stable@kernel.org

    Catalin Marinas
     

28 May, 2010

2 commits

  • Introduce numa_mem_id(), based on generic percpu variable infrastructure
    to track "nearest node with memory" for archs that support memoryless
    nodes.

    Define API in when CONFIG_HAVE_MEMORYLESS_NODES
    defined, else stubs. Architectures will define HAVE_MEMORYLESS_NODES
    if/when they support them.

    Archs can override definitions of:

    numa_mem_id() - returns node number of "local memory" node
    set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
    cpu_to_mem() - return numa_mem for specified cpu; may be used as lvalue

    Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
    This will initialize the boot cpu at boot time, and all cpus on change of
    numa_zonelist_order, or when node or memory hot-plug requires zonelist
    rebuild. Archs that support memoryless nodes will need to initialize
    'numa_mem' for secondary cpus as they're brought on-line.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Christoph Lameter
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Nick Piggin
    Cc: David Rientjes
    Cc: Eric Whitney
    Cc: KAMEZAWA Hiroyuki
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "Luck, Tony"
    Cc: Pekka Enberg
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Rework the generic version of the numa_node_id() function to use the new
    generic percpu variable infrastructure.

    Guard the new implementation with a new config option:

    CONFIG_USE_PERCPU_NUMA_NODE_ID.

    Archs which support this new implemention will default this option to 'y'
    when NUMA is configured. This config option could be removed if/when all
    archs switch over to the generic percpu implementation of numa_node_id().
    Arch support involves:

    1) converting any existing per cpu variable implementations to use
    this implementation. x86_64 is an instance of such an arch.
    2) archs that don't use a per cpu variable for numa_node_id() will
    need to initialize the new per cpu variable "numa_node" as cpus
    are brought on-line. ia64 is an example.
    3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g.,
    when NUMA is configured. This is required because I have
    retained the old implementation by default to allow archs to
    be modified incrementally, as desired.

    Subsequent patches will convert x86_64 and ia64 to use this implemenation.

    Signed-off-by: Lee Schermerhorn
    Cc: Tejun Heo
    Cc: Mel Gorman
    Reviewed-by: Christoph Lameter
    Cc: Nick Piggin
    Cc: David Rientjes
    Cc: Eric Whitney
    Cc: KAMEZAWA Hiroyuki
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "Luck, Tony"
    Cc: Pekka Enberg
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

25 May, 2010

10 commits

  • Add global mutex zonelists_mutex to fix the possible race:

    CPU0 CPU1 CPU2
    (1) zone->present_pages += online_pages;
    (2) build_all_zonelists();
    (3) alloc_page();
    (4) free_page();
    (5) build_all_zonelists();
    (6) __build_all_zonelists();
    (7) zone->pageset = alloc_percpu();

    In step (3,4), zone->pageset still points to boot_pageset, so bad
    things may happen if 2+ nodes are in this state. Even if only 1 node
    is accessing the boot_pageset, (3) may still consume too much memory
    to fail the memory allocations in step (7).

    Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
    since there is a new fresh memory block added in step(6).

    [haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
    Signed-off-by: Haicheng Li
    Signed-off-by: Wu Fengguang
    Reviewed-by: Andi Kleen
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haicheng Li
     
  • For each new populated zone of hotadded node, need to update its pagesets
    with dynamically allocated per_cpu_pageset struct for all possible CPUs:

    1) Detach zone->pageset from the shared boot_pageset
    at end of __build_all_zonelists().

    2) Use mutex to protect zone->pageset when it's still
    shared in onlined_pages()

    Otherwises, multiple zones of different nodes would share same boot strapping
    boot_pageset for same CPU, which will finally cause below kernel panic:

    ------------[ cut here ]------------
    kernel BUG at mm/page_alloc.c:1239!
    invalid opcode: 0000 [#1] SMP
    ...
    Call Trace:
    [] __alloc_pages_nodemask+0x131/0x7b0
    [] alloc_pages_current+0x87/0xd0
    [] __page_cache_alloc+0x67/0x70
    [] __do_page_cache_readahead+0x120/0x260
    [] ra_submit+0x21/0x30
    [] ondemand_readahead+0x166/0x2c0
    [] page_cache_async_readahead+0x80/0xa0
    [] generic_file_aio_read+0x364/0x670
    [] nfs_file_read+0xca/0x130
    [] do_sync_read+0xfa/0x140
    [] vfs_read+0xb5/0x1a0
    [] sys_read+0x51/0x80
    [] system_call_fastpath+0x16/0x1b
    RIP [] get_page_from_freelist+0x883/0x900
    RSP
    ---[ end trace 4bda28328b9990db ]

    [akpm@linux-foundation.org: merge fix]
    Signed-off-by: Haicheng Li
    Signed-off-by: Wu Fengguang
    Reviewed-by: Andi Kleen
    Reviewed-by: Christoph Lameter
    Cc: Mel Gorman
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haicheng Li
     
  • No behavior change here.

    Move some of setup_per_cpu_pageset() code into a new function
    setup_zone_pageset() that will be useful for memory hotplug.

    Signed-off-by: Wu Fengguang
    Signed-off-by: Haicheng Li
    Reviewed-by: Andi Kleen
    Reviewed-by: Christoph Lameter
    Cc: Mel Gorman
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • free_hot_cold_page() and __free_pages_ok() have very similar freeing
    preparation. Consolidate them.

    [akpm@linux-foundation.org: fix busted coding style]
    Signed-off-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The fragmentation index may indicate that a failure is due to external
    fragmentation but after a compaction run completes, it is still possible
    for an allocation to fail. There are two obvious reasons as to why

    o Page migration cannot move all pages so fragmentation remains
    o A suitable page may exist but watermarks are not met

    In the event of compaction followed by an allocation failure, this patch
    defers further compaction in the zone (1 << compact_defer_shift) times.
    If the next compaction attempt also fails, compact_defer_shift is
    increased up to a maximum of 6. If compaction succeeds, the defer
    counters are reset again.

    The zone that is deferred is the first zone in the zonelist - i.e. the
    preferred zone. To defer compaction in the other zones, the information
    would need to be stored in the zonelist or implemented similar to the
    zonelist_cache. This would impact the fast-paths and is not justified at
    this time.

    Signed-off-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Ordinarily when a high-order allocation fails, direct reclaim is entered
    to free pages to satisfy the allocation. With this patch, it is
    determined if an allocation failed due to external fragmentation instead
    of low memory and if so, the calling process will compact until a suitable
    page is freed. Compaction by moving pages in memory is considerably
    cheaper than paging out to disk and works where there are locked pages or
    no swap. If compaction fails to free a page of a suitable size, then
    reclaim will still occur.

    Direct compaction returns as soon as possible. As each block is
    compacted, it is checked if a suitable page has been freed and if so, it
    returns.

    [akpm@linux-foundation.org: Fix build errors]
    [aarcange@redhat.com: fix count_vm_event preempt in memory compaction direct reclaim]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch is the core of a mechanism which compacts memory in a zone by
    relocating movable pages towards the end of the zone.

    A single compaction run involves a migration scanner and a free scanner.
    Both scanners operate on pageblock-sized areas in the zone. The migration
    scanner starts at the bottom of the zone and searches for all movable
    pages within each area, isolating them onto a private list called
    migratelist. The free scanner starts at the top of the zone and searches
    for suitable areas and consumes the free pages within making them
    available for the migration scanner. The pages isolated for migration are
    then migrated to the newly isolated free pages.

    [aarcange@redhat.com: Fix unsafe optimisation]
    [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There are two types of zonelist ordering methodologies:

    - node order, preferring allocations on a node to stay local to and

    - zone order, preferring allocations come from a higher zone to avoid
    allocating in lowmem zones even though they may not be local.

    The ordering technique used by the kernel is configurable on the command
    line, but also has some logic to determine what the default should be.

    This logic currently lacks knowledge of systems where a node may only have
    lowmem. For such systems, it is necessary to use node order so that
    GFP_KERNEL allocations may be satisfied by nodes consisting of only
    lowmem.

    If zone order is used, GFP_KERNEL allocations to such nodes are actually
    allocated on a node with local affinity that includes ZONE_NORMAL.

    This change defaults to node zonelist ordering if any node lacks
    ZONE_NORMAL.

    To force zone order, append 'numa_zonelist_order=zone' to the kernel
    command line.

    Signed-off-by: David Rientjes
    Acked-by: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Before applying this patch, cpuset updates task->mems_allowed and
    mempolicy by setting all new bits in the nodemask first, and clearing all
    old unallowed bits later. But in the way, the allocator may find that
    there is no node to alloc memory.

    The reason is that cpuset rebinds the task's mempolicy, it cleans the
    nodes which the allocater can alloc pages on, for example:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    This patch fixes this problem by expanding the nodes range first(set newly
    allowed bits) and shrink it lazily(clear newly disallowed bits). So we
    use a variable to tell the write-side task that read-side task is reading
    nodemask, and the write-side task clears newly disallowed nodes after
    read-side task ends the current memory allocation.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • …re merging to the tail of the free lists

    In order to reduce fragmentation, this patch classifies freed pages in two
    groups according to their probability of being part of a high order merge.
    Pages belonging to a compound whose next-highest buddy is free are more
    likely to be part of a high order merge in the near future, so they will
    be added at the tail of the freelist. The remaining pages are put at the
    front of the freelist.

    In this way, the pages that are more likely to cause a big merge are kept
    free longer. Consequently there is a tendency to aggregate the
    long-living allocations on a subset of the compounds, reducing the
    fragmentation.

    This heuristic was tested on three machines, x86, x86-64 and ppc64 with
    3GB of RAM in each machine. The tests were kernbench, netperf, sysbench
    and STREAM for performance and a high-order stress test for huge page
    allocations.

    KernBench X86
    Elapsed mean 374.77 ( 0.00%) 375.10 (-0.09%)
    User mean 649.53 ( 0.00%) 650.44 (-0.14%)
    System mean 54.75 ( 0.00%) 54.18 ( 1.05%)
    CPU mean 187.75 ( 0.00%) 187.25 ( 0.27%)

    KernBench X86-64
    Elapsed mean 94.45 ( 0.00%) 94.01 ( 0.47%)
    User mean 323.27 ( 0.00%) 322.66 ( 0.19%)
    System mean 36.71 ( 0.00%) 36.50 ( 0.57%)
    CPU mean 380.75 ( 0.00%) 381.75 (-0.26%)

    KernBench PPC64
    Elapsed mean 173.45 ( 0.00%) 173.74 (-0.17%)
    User mean 587.99 ( 0.00%) 587.95 ( 0.01%)
    System mean 60.60 ( 0.00%) 60.57 ( 0.05%)
    CPU mean 373.50 ( 0.00%) 372.75 ( 0.20%)

    Nothing notable for kernbench.

    NetPerf UDP X86
    64 42.68 ( 0.00%) 42.77 ( 0.21%)
    128 85.62 ( 0.00%) 85.32 (-0.35%)
    256 170.01 ( 0.00%) 168.76 (-0.74%)
    1024 655.68 ( 0.00%) 652.33 (-0.51%)
    2048 1262.39 ( 0.00%) 1248.61 (-1.10%)
    3312 1958.41 ( 0.00%) 1944.61 (-0.71%)
    4096 2345.63 ( 0.00%) 2318.83 (-1.16%)
    8192 4132.90 ( 0.00%) 4089.50 (-1.06%)
    16384 6770.88 ( 0.00%) 6642.05 (-1.94%)*

    NetPerf UDP X86-64
    64 148.82 ( 0.00%) 154.92 ( 3.94%)
    128 298.96 ( 0.00%) 312.95 ( 4.47%)
    256 583.67 ( 0.00%) 626.39 ( 6.82%)
    1024 2293.18 ( 0.00%) 2371.10 ( 3.29%)
    2048 4274.16 ( 0.00%) 4396.83 ( 2.79%)
    3312 6356.94 ( 0.00%) 6571.35 ( 3.26%)
    4096 7422.68 ( 0.00%) 7635.42 ( 2.79%)*
    8192 12114.81 ( 0.00%)* 12346.88 ( 1.88%)
    16384 17022.28 ( 0.00%)* 17033.19 ( 0.06%)*
    1.64% 2.73%

    NetPerf UDP PPC64
    64 49.98 ( 0.00%) 50.25 ( 0.54%)
    128 98.66 ( 0.00%) 100.95 ( 2.27%)
    256 197.33 ( 0.00%) 191.03 (-3.30%)
    1024 761.98 ( 0.00%) 785.07 ( 2.94%)
    2048 1493.50 ( 0.00%) 1510.85 ( 1.15%)
    3312 2303.95 ( 0.00%) 2271.72 (-1.42%)
    4096 2774.56 ( 0.00%) 2773.06 (-0.05%)
    8192 4918.31 ( 0.00%) 4793.59 (-2.60%)
    16384 7497.98 ( 0.00%) 7749.52 ( 3.25%)

    The tests are run to have confidence limits within 1%. Results marked
    with a * were not confident although in this case, it's only outside by
    small amounts. Even with some results that were not confident, the
    netperf UDP results were generally positive.

    NetPerf TCP X86
    64 652.25 ( 0.00%)* 648.12 (-0.64%)*
    23.80% 22.82%
    128 1229.98 ( 0.00%)* 1220.56 (-0.77%)*
    21.03% 18.90%
    256 2105.88 ( 0.00%) 1872.03 (-12.49%)*
    1.00% 16.46%
    1024 3476.46 ( 0.00%)* 3548.28 ( 2.02%)*
    13.37% 11.39%
    2048 4023.44 ( 0.00%)* 4231.45 ( 4.92%)*
    9.76% 12.48%
    3312 4348.88 ( 0.00%)* 4396.96 ( 1.09%)*
    6.49% 8.75%
    4096 4726.56 ( 0.00%)* 4877.71 ( 3.10%)*
    9.85% 8.50%
    8192 4732.28 ( 0.00%)* 5777.77 (18.10%)*
    9.13% 13.04%
    16384 5543.05 ( 0.00%)* 5906.24 ( 6.15%)*
    7.73% 8.68%

    NETPERF TCP X86-64
    netperf-tcp-vanilla-netperf netperf-tcp
    tcp-vanilla pgalloc-delay
    64 1895.87 ( 0.00%)* 1775.07 (-6.81%)*
    5.79% 4.78%
    128 3571.03 ( 0.00%)* 3342.20 (-6.85%)*
    3.68% 6.06%
    256 5097.21 ( 0.00%)* 4859.43 (-4.89%)*
    3.02% 2.10%
    1024 8919.10 ( 0.00%)* 8892.49 (-0.30%)*
    5.89% 6.55%
    2048 10255.46 ( 0.00%)* 10449.39 ( 1.86%)*
    7.08% 7.44%
    3312 10839.90 ( 0.00%)* 10740.15 (-0.93%)*
    6.87% 7.33%
    4096 10814.84 ( 0.00%)* 10766.97 (-0.44%)*
    6.86% 8.18%
    8192 11606.89 ( 0.00%)* 11189.28 (-3.73%)*
    7.49% 5.55%
    16384 12554.88 ( 0.00%)* 12361.22 (-1.57%)*
    7.36% 6.49%

    NETPERF TCP PPC64
    netperf-tcp-vanilla-netperf netperf-tcp
    tcp-vanilla pgalloc-delay
    64 594.17 ( 0.00%) 596.04 ( 0.31%)*
    1.00% 2.29%
    128 1064.87 ( 0.00%)* 1074.77 ( 0.92%)*
    1.30% 1.40%
    256 1852.46 ( 0.00%)* 1856.95 ( 0.24%)
    1.25% 1.00%
    1024 3839.46 ( 0.00%)* 3813.05 (-0.69%)
    1.02% 1.00%
    2048 4885.04 ( 0.00%)* 4881.97 (-0.06%)*
    1.15% 1.04%
    3312 5506.90 ( 0.00%) 5459.72 (-0.86%)
    4096 6449.19 ( 0.00%) 6345.46 (-1.63%)
    8192 7501.17 ( 0.00%) 7508.79 ( 0.10%)
    16384 9618.65 ( 0.00%) 9490.10 (-1.35%)

    There was a distinct lack of confidence in the X86* figures so I included
    what the devation was where the results were not confident. Many of the
    results, whether gains or losses were within the standard deviation so no
    solid conclusion can be reached on performance impact. Looking at the
    figures, only the X86-64 ones look suspicious with a few losses that were
    outside the noise. However, the results were so unstable that without
    knowing why they vary so much, a solid conclusion cannot be reached.

    SYSBENCH X86
    sysbench-vanilla pgalloc-delay
    1 7722.85 ( 0.00%) 7756.79 ( 0.44%)
    2 14901.11 ( 0.00%) 13683.44 (-8.90%)
    3 15171.71 ( 0.00%) 14888.25 (-1.90%)
    4 14966.98 ( 0.00%) 15029.67 ( 0.42%)
    5 14370.47 ( 0.00%) 14865.00 ( 3.33%)
    6 14870.33 ( 0.00%) 14845.57 (-0.17%)
    7 14429.45 ( 0.00%) 14520.85 ( 0.63%)
    8 14354.35 ( 0.00%) 14362.31 ( 0.06%)

    SYSBENCH X86-64
    1 17448.70 ( 0.00%) 17484.41 ( 0.20%)
    2 34276.39 ( 0.00%) 34251.00 (-0.07%)
    3 50805.25 ( 0.00%) 50854.80 ( 0.10%)
    4 66667.10 ( 0.00%) 66174.69 (-0.74%)
    5 66003.91 ( 0.00%) 65685.25 (-0.49%)
    6 64981.90 ( 0.00%) 65125.60 ( 0.22%)
    7 64933.16 ( 0.00%) 64379.23 (-0.86%)
    8 63353.30 ( 0.00%) 63281.22 (-0.11%)
    9 63511.84 ( 0.00%) 63570.37 ( 0.09%)
    10 62708.27 ( 0.00%) 63166.25 ( 0.73%)
    11 62092.81 ( 0.00%) 61787.75 (-0.49%)
    12 61330.11 ( 0.00%) 61036.34 (-0.48%)
    13 61438.37 ( 0.00%) 61994.47 ( 0.90%)
    14 62304.48 ( 0.00%) 62064.90 (-0.39%)
    15 63296.48 ( 0.00%) 62875.16 (-0.67%)
    16 63951.76 ( 0.00%) 63769.09 (-0.29%)

    SYSBENCH PPC64
    -sysbench-pgalloc-delay-sysbench
    sysbench-vanilla pgalloc-delay
    1 7645.08 ( 0.00%) 7467.43 (-2.38%)
    2 14856.67 ( 0.00%) 14558.73 (-2.05%)
    3 21952.31 ( 0.00%) 21683.64 (-1.24%)
    4 27946.09 ( 0.00%) 28623.29 ( 2.37%)
    5 28045.11 ( 0.00%) 28143.69 ( 0.35%)
    6 27477.10 ( 0.00%) 27337.45 (-0.51%)
    7 26489.17 ( 0.00%) 26590.06 ( 0.38%)
    8 26642.91 ( 0.00%) 25274.33 (-5.41%)
    9 25137.27 ( 0.00%) 24810.06 (-1.32%)
    10 24451.99 ( 0.00%) 24275.85 (-0.73%)
    11 23262.20 ( 0.00%) 23674.88 ( 1.74%)
    12 24234.81 ( 0.00%) 23640.89 (-2.51%)
    13 24577.75 ( 0.00%) 24433.50 (-0.59%)
    14 25640.19 ( 0.00%) 25116.52 (-2.08%)
    15 26188.84 ( 0.00%) 26181.36 (-0.03%)
    16 26782.37 ( 0.00%) 26255.99 (-2.00%)

    Again, there is little to conclude here. While there are a few losses,
    the results vary by +/- 8% in some cases. They are the results of most
    concern as there are some large losses but it's also within the variance
    typically seen between kernel releases.

    The STREAM results varied so little and are so verbose that I didn't
    include them here.

    The final test stressed how many huge pages can be allocated. The
    absolute number of huge pages allocated are the same with or without the
    page. However, the "unusability free space index" which is a measure of
    external fragmentation was slightly lower (lower is better) throughout the
    lifetime of the system. I also measured the latency of how long it took
    to successfully allocate a huge page. The latency was slightly lower and
    on X86 and PPC64, more huge pages were allocated almost immediately from
    the free lists. The improvement is slight but there.

    [mel@csn.ul.ie: Tested, reworked for less branches]
    [czoccolo@gmail.com: fix oops by checking pfn_valid_within()]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Christoph Lameter <cl@linux-foundation.org>
    Acked-by: Rik van Riel <riel@redhat.com>
    Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
    Cc: Corrado Zoccolo <czoccolo@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Corrado Zoccolo
     

16 Mar, 2010

1 commit


13 Mar, 2010

2 commits

  • - introduce dump_page() to print the page info for debugging some error
    condition.

    - convert three mm users: bad_page(), print_bad_pte() and memory offline
    failure.

    - print an extra field: the symbolic names of page->flags

    Example dump_page() output:

    [ 157.521694] page:ffffea0000a7cba8 count:2 mapcount:1 mapping:ffff88001c901791 index:0x147
    [ 157.525570] page flags: 0x100000000100068(uptodate|lru|active|swapbacked)

    Signed-off-by: Wu Fengguang
    Cc: Ingo Molnar
    Cc: Alex Chiang
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • __zone_pcp_update() iterates over NR_CPUS instead of limiting the access
    to the possible cpus. This might result in access to uninitialized areas
    as the per cpu allocator only populates the per cpu memory for possible
    cpus.

    This problem was created as a result of the dynamic allocation of pagesets
    from percpu memory that went in during the merge window - commit
    99dcc3e5a94ed491fbef402831d8c0bbb267f995 ("this_cpu: Page allocator
    conversion").

    Signed-off-by: Thomas Gleixner
    Acked-by: Pekka Enberg
    Acked-by: Tejun Heo
    Acked-by: Christoph Lameter
    Acked-by: Mel Gorman
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

07 Mar, 2010

6 commits

  • free_area_init_nodes() emits pfn ranges for all zones on the system.
    There may be no pages on a higher zone, however, due to memory limitations
    or the use of the mem= kernel parameter. For example:

    Zone PFN ranges:
    DMA 0x00000001 -> 0x00001000
    DMA32 0x00001000 -> 0x00100000
    Normal 0x00100000 -> 0x00100000

    The implementation copies the previous zone's highest pfn, if any, as the
    next zone's lowest pfn. If its highest pfn is then greater than the
    amount of addressable memory, the upper memory limit is used instead.
    Thus, both the lowest and highest possible pfn for higher zones without
    memory may be the same.

    The pfn range for zones without memory is now shown as "empty" instead.

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • There are quite a few GFP_KERNEL memory allocations made during
    suspend/hibernation and resume that may cause the system to hang, because
    the I/O operations they depend on cannot be completed due to the
    underlying devices being suspended.

    Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in
    gfp_allowed_mask before suspend/hibernation and restoring the original
    values of these bits in gfp_allowed_mask durig the subsequent resume.

    [akpm@linux-foundation.org: fix CONFIG_PM=n linkage]
    Signed-off-by: Rafael J. Wysocki
    Reported-by: Maxim Levitsky
    Cc: Sebastian Ott
    Cc: Benjamin Herrenschmidt
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • commit e815af95 ("change all_unreclaimable zone member to flags") changed
    all_unreclaimable member to bit flag. But it had an undesireble side
    effect. free_one_page() is one of most hot path in linux kernel and
    increasing atomic ops in it can reduce kernel performance a bit.

    Thus, this patch revert such commit partially. at least
    all_unreclaimable shouldn't share memory word with other zone flags.

    [akpm@linux-foundation.org: fix patch interaction]
    Signed-off-by: KOSAKI Motohiro
    Cc: David Rientjes
    Cc: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Huang Shijie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • free_hot_page() is just a wrapper around free_hot_cold_page() with
    parameter 'cold = 0'. After adding a clear comment for
    free_hot_cold_page(), it is reasonable to remove a level of call.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Li Hong
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Cc: KOSAKI Motohiro
    Cc: Americo Wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Hong
     
  • Move a call of trace_mm_page_free_direct() from free_hot_page() to
    free_hot_cold_page(). It is clearer and close to kmemcheck_free_shadow(),
    as it is done in function __free_pages_ok().

    Signed-off-by: Li Hong
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Hong
     
  • trace_mm_page_free_direct() is called in function __free_pages(). But it
    is called again in free_hot_page() if order == 0 and produce duplicate
    records in trace file for mm_page_free_direct event. As below:

    K-PID CPU# TIMESTAMP FUNCTION
    gnome-terminal-1567 [000] 4415.246466: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
    gnome-terminal-1567 [000] 4415.246468: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
    gnome-terminal-1567 [000] 4415.246506: mm_page_alloc: page=ffffea0003db9f40 pfn=1155800 order=0 migratetype=0 gfp_flags=GFP_KERNEL
    gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
    gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0

    This patch removes the first call and adds a call to
    trace_mm_page_free_direct() in __free_pages_ok().

    Signed-off-by: Li Hong
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Hong
     

04 Mar, 2010

1 commit

  • …l/git/tip/linux-2.6-tip

    * 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
    early_res: Need to save the allocation name in drop_range_partial()
    sparsemem: Fix compilation on PowerPC
    early_res: Add free_early_partial()
    x86: Fix non-bootmem compilation on PowerPC
    core: Move early_res from arch/x86 to kernel/
    x86: Add find_fw_memmap_area
    Move round_up/down to kernel.h
    x86: Make 32bit support NO_BOOTMEM
    early_res: Enhance check_and_double_early_res
    x86: Move back find_e820_area to e820.c
    x86: Add find_early_area_size
    x86: Separate early_res related code from e820.c
    x86: Move bios page reserve early to head32/64.c
    sparsemem: Put mem map for one node together.
    sparsemem: Put usemap for one node together
    x86: Make 64 bit use early_res instead of bootmem before slab
    x86: Only call dma32_reserve_bootmem 64bit !CONFIG_NUMA
    x86: Make early_node_mem get mem > 4 GB if possible
    x86: Dynamically increase early_res array size
    x86: Introduce max_early_res and early_res_count
    ...

    Linus Torvalds
     

22 Feb, 2010

1 commit

  • These build errors on some non-x86 platforms (PowerPC for example):

    mm/page_alloc.c: In function '__alloc_memory_core_early':
    mm/page_alloc.c:3468: error: implicit declaration of function 'find_early_area'
    mm/page_alloc.c:3483: error: implicit declaration of function 'reserve_early_without_check'

    The function is only needed on CONFIG_NO_BOOTMEM.

    Signed-off-by: Yinghai Lu
    Cc: Andrew Morton
    Cc: Johannes Weiner
    Cc: Mel Gorman
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Yinghai Lu
     

13 Feb, 2010

1 commit


02 Feb, 2010

1 commit


30 Jan, 2010

1 commit

  • After memory pressure has forced it to dip into the reserves, 2.6.32's
    5f8dcc21211a3d4e3a7a5ca366b469fb88117f61 "page-allocator: split per-cpu
    list into one-list-per-migrate-type" has been returning MIGRATE_RESERVE
    pages to the MIGRATE_MOVABLE free_list: in some sense depleting reserves.

    Fix that in the most straightforward way (which, considering the overheads
    of alternative approaches, is Mel's preference): the right migratetype is
    already in page_private(page), but free_pcppages_bulk() wasn't using it.

    How did this bug show up? As a 20% slowdown in my tmpfs loop kbuild
    swapping tests, on PowerMac G5 with SLUB allocator. Bisecting to that
    commit was easy, but explaining the magnitude of the slowdown not easy.

    The same effect appears, but much less markedly, with SLAB, and even
    less markedly on other machines (the PowerMac divides into fewer zones
    than x86, I think that may be a factor). We guess that lumpy reclaim
    of short-lived high-order pages is implicated in some way, and probably
    this bug has been tickling a poor decision somewhere in page reclaim.

    But instrumentation hasn't told me much, I've run out of time and
    imagination to determine exactly what's going on, and shouldn't hold up
    the fix any longer: it's valid, and might even fix other misbehaviours.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Jan, 2010

2 commits

  • commit f2260e6b (page allocator: update NR_FREE_PAGES only as necessary)
    made one minor regression. if __rmqueue() was failed, NR_FREE_PAGES stat
    go wrong. this patch fixes it.

    Signed-off-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Reviewed-by: Minchan Kim
    Reported-by: Huang Shijie
    Reviewed-by: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The current check for 'backward merging' within add_active_range() does
    not seem correct. start_pfn must be compared against
    early_node_map[i].start_pfn (and NOT against .end_pfn) to find out whether
    the new region is backward-mergeable with the existing range.

    Signed-off-by: Kazuhisa Ichikawa
    Acked-by: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kazuhisa Ichikawa
     

05 Jan, 2010

1 commit

  • Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.

    This drastically reduces the size of struct zone for systems with large
    amounts of processors and allows placement of critical variables of struct
    zone in one cacheline even on very large systems.

    Another effect is that the pagesets of one processor are placed near one
    another. If multiple pagesets from different zones fit into one cacheline
    then additional cacheline fetches can be avoided on the hot paths when
    allocating memory from multiple zones.

    Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
    are reduced and we can drop the zone_pcp macro.

    Hotplug handling is also simplified since cpu alloc can bring up and
    shut down cpu areas for a specific cpu as a whole. So there is no need to
    allocate or free individual pagesets.

    V7-V8:
    - Explain chicken egg dilemmna with percpu allocator.

    V4-V5:
    - Fix up cases where per_cpu_ptr is called before irq disable
    - Integrate the bootstrap logic that was separate before.

    tj: Build failure in pageset_cpuup_callback() due to missing ret
    variable fixed.

    Reviewed-by: Mel Gorman
    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     

24 Dec, 2009

1 commit


23 Dec, 2009

1 commit

  • * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (36 commits)
    powerpc/gc/wii: Remove get_irq_desc()
    powerpc/gc/wii: hlwd-pic: convert irq_desc.lock to raw_spinlock
    powerpc/gamecube/wii: Fix off-by-one error in ugecon/usbgecko_udbg
    powerpc/mpic: Fix problem that affinity is not updated
    powerpc/mm: Fix stupid bug in subpge protection handling
    powerpc/iseries: use DECLARE_COMPLETION_ONSTACK for non-constant completion
    powerpc: Fix MSI support on U4 bridge PCIe slot
    powerpc: Handle VSX alignment faults correctly in little-endian mode
    powerpc/mm: Fix typo of cpumask_clear_cpu()
    powerpc/mm: Fix hash_utils_64.c compile errors with DEBUG enabled.
    powerpc: Convert BUG() to use unreachable()
    powerpc/pseries: Make declarations of cpu_hotplug_driver_lock() ANSI compatible.
    powerpc/pseries: Don't panic when H_PROD fails during cpu-online.
    powerpc/mm: Fix a WARN_ON() with CONFIG_DEBUG_PAGEALLOC and CONFIG_DEBUG_VM
    powerpc/defconfigs: Set HZ=100 on pseries and ppc64 defconfigs
    powerpc/defconfigs: Disable token ring in powerpc defconfigs
    powerpc/defconfigs: Reduce 64bit vmlinux by making acenic and cramfs modules
    powerpc/pseries: Select XICS and PCI_MSI PSERIES
    powerpc/85xx: Wrong variable returned on error
    powerpc/iseries: Convert to proc_fops
    ...

    Linus Torvalds
     

20 Dec, 2009

1 commit

  • …git/tip/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86, irq: Allow 0xff for /proc/irq/[n]/smp_affinity on an 8-cpu system
    Makefile: Unexport LC_ALL instead of clearing it
    x86: Fix objdump version check in arch/x86/tools/chkobjdump.awk
    x86: Reenable TSC sync check at boot, even with NONSTOP_TSC
    x86: Don't use POSIX character classes in gen-insn-attr-x86.awk
    Makefile: set LC_CTYPE, LC_COLLATE, LC_NUMERIC to C
    x86: Increase MAX_EARLY_RES; insufficient on 32-bit NUMA
    x86: Fix checking of SRAT when node 0 ram is not from 0
    x86, cpuid: Add "volatile" to asm in native_cpuid()
    x86, msr: msrs_alloc/free for CONFIG_SMP=n
    x86, amd: Get multi-node CPU info from NodeId MSR instead of PCI config space
    x86: Add IA32_TSC_AUX MSR and use it
    x86, msr/cpuid: Register enough minors for the MSR and CPUID drivers
    initramfs: add missing decompressor error check
    bzip2: Add missing checks for malloc returning NULL
    bzip2/lzma/gzip: pre-boot malloc doesn't return NULL on failure

    Linus Torvalds