27 Oct, 2010

5 commits

  • This removes following warning from sparse:

    mm/page_alloc.c:1934:9: warning: restricted gfp_t degrades to integer

    Signed-off-by: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • …r if significant congestion is not being encountered in the current zone

    If congestion_wait() is called with no BDI congested, the caller will
    sleep for the full timeout and this may be an unnecessary sleep. This
    patch adds a wait_iff_congested() that checks congestion and only sleeps
    if a BDI is congested else, it calls cond_resched() to ensure the caller
    is not hogging the CPU longer than its quota but otherwise will not sleep.

    This is aimed at reducing some of the major desktop stalls reported during
    IO. For example, while kswapd is operating, it calls congestion_wait()
    but it could just have been reclaiming clean page cache pages with no
    congestion. Without this patch, it would sleep for a full timeout but
    after this patch, it'll just call schedule() if it has been on the CPU too
    long. Similar logic applies to direct reclaimers that are not making
    enough progress.

    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • Now, sysfs interface of memory hotplug shows whether the section is
    removable or not. But it checks only migrateype of pages and doesn't
    check details of cluster of pages.

    Next, memory hotplug's set_migratetype_isolate() has the same kind of
    check, too.

    This patch adds the function __count_unmovable_pages() and makes above 2
    checks to use the same logic. Then, is_removable and hotremove code uses
    the same logic. No changes in the hotremove logic itself.

    TODO: need to find a way to check RECLAMABLE. But, considering bit,
    calling shrink_slab() against a range before starting memory hotremove
    sounds better. If so, this patch's logic doesn't need to be changed.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Reported-by: Michal Hocko
    Cc: Wu Fengguang
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Even if notifier cannot find any pages, it doesn't mean no pages are
    available...And, if there are no notifiers registered, this condition will
    be always true and memory hotplug will show -EBUSY.

    This is a bug but not critical.

    In most case, a pageblock which will be offlined is MIGRATE_MOVABLE This
    "notifier" is called only when the pageblock is _not_ MIGRATE_MOVABLE.
    But if not MIGRATE_MOVABLE, it's common case that memory hotplug will
    fail. So, no one notice this bug.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Wu Fengguang
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • There is a bug in commit 6dda9d55 ("page allocator: reduce fragmentation
    in buddy allocator by adding buddies that are merging to the tail of the
    free lists") that means a buddy at order MAX_ORDER is checked for merging.
    A page of this order never exists so at times, an effectively random
    piece of memory is being checked.

    Alan Curry has reported that this is causing memory corruption in
    userspace data on a PPC32 platform (http://lkml.org/lkml/2010/10/9/32).
    It is not clear why this is happening. It could be a cache coherency
    problem where pages mapped in both user and kernel space are getting
    different cache lines due to the bad read from kernel space
    (http://lkml.org/lkml/2010/10/13/179). It could also be that there are
    some special registers being io-remapped at the end of the memmap array
    and that a read has special meaning on them. Compiler bugs have been
    ruled out because the assembly before and after the patch looks relatively
    harmless.

    This patch fixes the problem by ensuring we are not reading a possibly
    invalid location of memory. It's not clear why the read causes corruption
    but one way or the other it is a buggy read.

    Signed-off-by: Mel Gorman
    Cc: Corrado Zoccolo
    Reported-by: Alan Curry
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

12 Oct, 2010

1 commit


08 Oct, 2010

2 commits


10 Sep, 2010

3 commits

  • When under significant memory pressure, a process enters direct reclaim
    and immediately afterwards tries to allocate a page. If it fails and no
    further progress is made, it's possible the system will go OOM. However,
    on systems with large amounts of memory, it's possible that a significant
    number of pages are on per-cpu lists and inaccessible to the calling
    process. This leads to a process entering direct reclaim more often than
    it should increasing the pressure on the system and compounding the
    problem.

    This patch notes that if direct reclaim is making progress but allocations
    are still failing that the system is already under heavy pressure. In
    this case, it drains the per-cpu lists and tries the allocation a second
    time before continuing.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Cc: Dave Chinner
    Cc: Wu Fengguang
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …low and kswapd is awake

    Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
    cheaper than scanning a number of lists. To avoid synchronization
    overhead, counter deltas are maintained on a per-cpu basis and drained
    both periodically and when the delta is above a threshold. On large CPU
    systems, the difference between the estimated and real value of
    NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than
    number of real free page in buddy, the VM can allocate pages below min
    watermark, at worst reducing the real number of pages to zero. Even if
    the OOM killer kills some victim for freeing memory, it may not free
    memory if the exit path requires a new page resulting in livelock.

    This patch introduces a zone_page_state_snapshot() function (courtesy of
    Christoph) that takes a slightly more accurate view of an arbitrary vmstat
    counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid
    the watermark being accidentally broken. The estimate is not perfect and
    may result in cache line bounces but is expected to be lighter than the
    IPI calls necessary to continually drain the per-cpu counters while kswapd
    is awake.

    Signed-off-by: Christoph Lameter <cl@linux.com>
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Christoph Lameter
     
  • When allocating a page, the system uses NR_FREE_PAGES counters to
    determine if watermarks would remain intact after the allocation was made.
    This check is made without interrupts disabled or the zone lock held and
    so is race-prone by nature. Unfortunately, when pages are being freed in
    batch, the counters are updated before the pages are added on the list.
    During this window, the counters are misleading as the pages do not exist
    yet. When under significant pressure on systems with large numbers of
    CPUs, it's possible for processes to make progress even though they should
    have been stalled. This is particularly problematic if a number of the
    processes are using GFP_ATOMIC as the min watermark can be accidentally
    breached and in extreme cases, the system can livelock.

    This patch updates the counters after the pages have been added to the
    list. This makes the allocator more cautious with respect to preserving
    the watermarks and mitigates livelock possibilities.

    [akpm@linux-foundation.org: avoid modifying incoming args]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

31 Aug, 2010

1 commit


28 Aug, 2010

2 commits

  • 1. replace find_e820_area with memblock_find_in_range
    2. replace reserve_early with memblock_x86_reserve_range
    3. replace free_early with memblock_x86_free_range.
    4. NO_BOOTMEM will switch to use memblock too.
    5. use _e820, _early wrap in the patch, in following patch, will
    replace them all
    6. because memblock_x86_free_range support partial free, we can remove some special care
    7. Need to make sure that memblock_find_in_range() is called after memblock_x86_fill()
    so adjust some calling later in setup.c::setup_arch()
    -- corruption_check and mptable_update

    -v2: Move reserve_brk() early
    Before fill_memblock_area, to avoid overlap between brk and memblock_find_in_range()
    that could happen We have more then 128 RAM entry in E820 tables, and
    memblock_x86_fill() could use memblock_find_in_range() to find a new place for
    memblock.memory.region array.
    and We don't need to use extend_brk() after fill_memblock_area()
    So move reserve_brk() early before fill_memblock_area().
    -v3: Move find_smp_config early
    To make sure memblock_find_in_range not find wrong place, if BIOS doesn't put mptable
    in right place.
    -v4: Treat RESERVED_KERN as RAM in memblock.memory. and they are already in
    memblock.reserved already..
    use __NOT_KEEP_MEMBLOCK to make sure memblock related code could be freed later.
    -v5: Generic version __memblock_find_in_range() is going from high to low, and for 32bit
    active_region for 32bit does include high pages
    need to replace the limit with memblock.default_alloc_limit, aka get_max_mapped()
    -v6: Use current_limit instead
    -v7: check with MEMBLOCK_ERROR instead of -1ULL or -1L
    -v8: Set memblock_can_resize early to handle EFI with more RAM entries
    -v9: update after kmemleak changes in mainline

    Suggested-by: David S. Miller
    Suggested-by: Benjamin Herrenschmidt
    Suggested-by: Thomas Gleixner
    Signed-off-by: Yinghai Lu
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     
  • According to node range in early_node_map[] with __memblock_find_in_range
    to find free range.

    Will be used by memblock_x86_find_in_range_node()

    memblock_x86_find_in_range_node will be used to find right buffer for NODE_DATA

    Signed-off-by: Yinghai Lu
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     

10 Aug, 2010

3 commits

  • Since 2.6.28 zone->prev_priority is unused. Then it can be removed
    safely. It reduce stack usage slightly.

    Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
    can be integrate again, it's useful. but four (or more) times trying
    haven't got good performance number. Thus I give up such approach.

    The rest of this changelog is notes on prev_priority and why it existed in
    the first place and why it might be not necessary any more. This information
    is based heavily on discussions between Andrew Morton, Rik van Riel and
    Kosaki Motohiro who is heavily quotes from.

    Historically prev_priority was important because it determined when the VM
    would start unmapping PTE pages. i.e. there are no balances of note within
    the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
    is a potential risk of unnecessarily increasing minor faults as a large
    amount of read activity of use-once pages could push mapped pages to the
    end of the LRU and get unmapped.

    There is no proof this is still a problem but currently it is not considered
    to be. Active files are not deactivated if the active file list is smaller
    than the inactive list reducing the liklihood that file-mapped pages are
    being pushed off the LRU and referenced executable pages are kept on the
    active list to avoid them getting pushed out by read activity.

    Even if it is a problem, prev_priority prev_priority wouldn't works
    nowadays. First of all, current vmscan still a lot of UP centric code. it
    expose some weakness on some dozens CPUs machine. I think we need more and
    more improvement.

    The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
    and per-task-pressure a bit. example, prev_priority try to boost priority to
    other concurrent priority. but if the another task have mempolicy restriction,
    it is unnecessary, but also makes wrong big latency and exceeding reclaim.
    per-task based priority + prev_priority adjustment make the emulation of
    per-system pressure. but it have two issue 1) too rough and brutal emulation
    2) we need per-zone pressure, not per-system.

    Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
    2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
    but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
    system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
    prev_priority can't solve such multithreads workload issue. In other word,
    prev_priority concept assume the sysmtem don't have lots threads."

    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Reviewed-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Dave Chinner
    Cc: Chris Mason
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Andrea Arcangeli
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • We have been used naming try_set_zone_oom and clear_zonelist_oom.
    The role of functions is to lock of zonelist for preventing parallel
    OOM. So clear_zonelist_oom makes sense but try_set_zone_oome is rather
    awkward and unmatched with clear_zonelist_oom.

    Let's change it with try_set_zonelist_oom.

    Signed-off-by: Minchan Kim
    Acked-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • If memory has been depleted in lowmem zones even with the protection
    afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
    killing current users will help. The memory is either reclaimable (or
    migratable) already, in which case we should not invoke the oom killer at
    all, or it is pinned by an application for I/O. Killing such an
    application may leave the hardware in an unspecified state and there is no
    guarantee that it will be able to make a timely exit.

    Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is
    not used so that the task can perhaps recover or try again later.

    Previously, the heuristic provided some protection for those tasks with
    CAP_SYS_RAWIO, but this is no longer necessary since we will not be
    killing tasks for the purposes of ISA allocations.

    high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
    default for all allocations that are not __GFP_DMA, __GFP_DMA32,
    __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
    flags. Testing for high_zoneidx being less than ZONE_NORMAL will only
    return true for allocations that have either __GFP_DMA or __GFP_DMA32.

    Acked-by: KOSAKI Motohiro
    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

21 Jul, 2010

1 commit

  • Borislav Petkov reported his 32bit numa system has problem:

    [ 0.000000] Reserving total of 4c00 pages for numa KVA remap
    [ 0.000000] kva_start_pfn ~ 32800 max_low_pfn ~ 375fe
    [ 0.000000] max_pfn = 238000
    [ 0.000000] 8202MB HIGHMEM available.
    [ 0.000000] 885MB LOWMEM available.
    [ 0.000000] mapped low ram: 0 - 375fe000
    [ 0.000000] low ram: 0 - 375fe000
    [ 0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 1000 1000 => 34e7000
    [ 0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 200 40 => 34c9d80
    [ 0.000000] alloc (nid=0 100000 - 7ee00000) (1000000 - ffffffffffffffff) 180 40 => 34e6140
    [ 0.000000] alloc (nid=1 80000000 - c7e60000) (1000000 - ffffffffffffffff) 240 40 => 80000000
    [ 0.000000] BUG: unable to handle kernel paging request at 40000000
    [ 0.000000] IP: [] __alloc_memory_core_early+0x147/0x1d6
    [ 0.000000] *pdpt = 0000000000000000 *pde = f000ff53f000ff00
    ...
    [ 0.000000] Call Trace:
    [ 0.000000] [] ? __alloc_bootmem_node+0x216/0x22f
    [ 0.000000] [] ? sparse_early_usemaps_alloc_node+0x5a/0x10b
    [ 0.000000] [] ? sparse_init+0x1dc/0x499
    [ 0.000000] [] ? paging_init+0x168/0x1df
    [ 0.000000] [] ? native_pagetable_setup_start+0xef/0x1bb

    looks like it allocates too much high address for bootmem.

    Try to cut limit with get_max_mapped()

    Reported-by: Borislav Petkov
    Tested-by: Conny Seidel
    Signed-off-by: Yinghai Lu
    Cc: [2.6.34.x]
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Johannes Weiner
    Cc: Lee Schermerhorn
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     

19 Jul, 2010

1 commit

  • With commits 08677214 and 59be5a8e, alloc_bootmem()/free_bootmem() and
    friends use the early_res functions for memory management when
    NO_BOOTMEM is enabled. This patch adds the kmemleak calls in the
    corresponding code paths for bootmem allocations.

    Signed-off-by: Catalin Marinas
    Acked-by: Pekka Enberg
    Acked-by: Yinghai Lu
    Cc: H. Peter Anvin
    Cc: stable@kernel.org

    Catalin Marinas
     

28 May, 2010

2 commits

  • Introduce numa_mem_id(), based on generic percpu variable infrastructure
    to track "nearest node with memory" for archs that support memoryless
    nodes.

    Define API in when CONFIG_HAVE_MEMORYLESS_NODES
    defined, else stubs. Architectures will define HAVE_MEMORYLESS_NODES
    if/when they support them.

    Archs can override definitions of:

    numa_mem_id() - returns node number of "local memory" node
    set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
    cpu_to_mem() - return numa_mem for specified cpu; may be used as lvalue

    Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
    This will initialize the boot cpu at boot time, and all cpus on change of
    numa_zonelist_order, or when node or memory hot-plug requires zonelist
    rebuild. Archs that support memoryless nodes will need to initialize
    'numa_mem' for secondary cpus as they're brought on-line.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Christoph Lameter
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Nick Piggin
    Cc: David Rientjes
    Cc: Eric Whitney
    Cc: KAMEZAWA Hiroyuki
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "Luck, Tony"
    Cc: Pekka Enberg
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Rework the generic version of the numa_node_id() function to use the new
    generic percpu variable infrastructure.

    Guard the new implementation with a new config option:

    CONFIG_USE_PERCPU_NUMA_NODE_ID.

    Archs which support this new implemention will default this option to 'y'
    when NUMA is configured. This config option could be removed if/when all
    archs switch over to the generic percpu implementation of numa_node_id().
    Arch support involves:

    1) converting any existing per cpu variable implementations to use
    this implementation. x86_64 is an instance of such an arch.
    2) archs that don't use a per cpu variable for numa_node_id() will
    need to initialize the new per cpu variable "numa_node" as cpus
    are brought on-line. ia64 is an example.
    3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g.,
    when NUMA is configured. This is required because I have
    retained the old implementation by default to allow archs to
    be modified incrementally, as desired.

    Subsequent patches will convert x86_64 and ia64 to use this implemenation.

    Signed-off-by: Lee Schermerhorn
    Cc: Tejun Heo
    Cc: Mel Gorman
    Reviewed-by: Christoph Lameter
    Cc: Nick Piggin
    Cc: David Rientjes
    Cc: Eric Whitney
    Cc: KAMEZAWA Hiroyuki
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "Luck, Tony"
    Cc: Pekka Enberg
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

25 May, 2010

10 commits

  • Add global mutex zonelists_mutex to fix the possible race:

    CPU0 CPU1 CPU2
    (1) zone->present_pages += online_pages;
    (2) build_all_zonelists();
    (3) alloc_page();
    (4) free_page();
    (5) build_all_zonelists();
    (6) __build_all_zonelists();
    (7) zone->pageset = alloc_percpu();

    In step (3,4), zone->pageset still points to boot_pageset, so bad
    things may happen if 2+ nodes are in this state. Even if only 1 node
    is accessing the boot_pageset, (3) may still consume too much memory
    to fail the memory allocations in step (7).

    Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
    since there is a new fresh memory block added in step(6).

    [haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
    Signed-off-by: Haicheng Li
    Signed-off-by: Wu Fengguang
    Reviewed-by: Andi Kleen
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haicheng Li
     
  • For each new populated zone of hotadded node, need to update its pagesets
    with dynamically allocated per_cpu_pageset struct for all possible CPUs:

    1) Detach zone->pageset from the shared boot_pageset
    at end of __build_all_zonelists().

    2) Use mutex to protect zone->pageset when it's still
    shared in onlined_pages()

    Otherwises, multiple zones of different nodes would share same boot strapping
    boot_pageset for same CPU, which will finally cause below kernel panic:

    ------------[ cut here ]------------
    kernel BUG at mm/page_alloc.c:1239!
    invalid opcode: 0000 [#1] SMP
    ...
    Call Trace:
    [] __alloc_pages_nodemask+0x131/0x7b0
    [] alloc_pages_current+0x87/0xd0
    [] __page_cache_alloc+0x67/0x70
    [] __do_page_cache_readahead+0x120/0x260
    [] ra_submit+0x21/0x30
    [] ondemand_readahead+0x166/0x2c0
    [] page_cache_async_readahead+0x80/0xa0
    [] generic_file_aio_read+0x364/0x670
    [] nfs_file_read+0xca/0x130
    [] do_sync_read+0xfa/0x140
    [] vfs_read+0xb5/0x1a0
    [] sys_read+0x51/0x80
    [] system_call_fastpath+0x16/0x1b
    RIP [] get_page_from_freelist+0x883/0x900
    RSP
    ---[ end trace 4bda28328b9990db ]

    [akpm@linux-foundation.org: merge fix]
    Signed-off-by: Haicheng Li
    Signed-off-by: Wu Fengguang
    Reviewed-by: Andi Kleen
    Reviewed-by: Christoph Lameter
    Cc: Mel Gorman
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haicheng Li
     
  • No behavior change here.

    Move some of setup_per_cpu_pageset() code into a new function
    setup_zone_pageset() that will be useful for memory hotplug.

    Signed-off-by: Wu Fengguang
    Signed-off-by: Haicheng Li
    Reviewed-by: Andi Kleen
    Reviewed-by: Christoph Lameter
    Cc: Mel Gorman
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • free_hot_cold_page() and __free_pages_ok() have very similar freeing
    preparation. Consolidate them.

    [akpm@linux-foundation.org: fix busted coding style]
    Signed-off-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The fragmentation index may indicate that a failure is due to external
    fragmentation but after a compaction run completes, it is still possible
    for an allocation to fail. There are two obvious reasons as to why

    o Page migration cannot move all pages so fragmentation remains
    o A suitable page may exist but watermarks are not met

    In the event of compaction followed by an allocation failure, this patch
    defers further compaction in the zone (1 << compact_defer_shift) times.
    If the next compaction attempt also fails, compact_defer_shift is
    increased up to a maximum of 6. If compaction succeeds, the defer
    counters are reset again.

    The zone that is deferred is the first zone in the zonelist - i.e. the
    preferred zone. To defer compaction in the other zones, the information
    would need to be stored in the zonelist or implemented similar to the
    zonelist_cache. This would impact the fast-paths and is not justified at
    this time.

    Signed-off-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Ordinarily when a high-order allocation fails, direct reclaim is entered
    to free pages to satisfy the allocation. With this patch, it is
    determined if an allocation failed due to external fragmentation instead
    of low memory and if so, the calling process will compact until a suitable
    page is freed. Compaction by moving pages in memory is considerably
    cheaper than paging out to disk and works where there are locked pages or
    no swap. If compaction fails to free a page of a suitable size, then
    reclaim will still occur.

    Direct compaction returns as soon as possible. As each block is
    compacted, it is checked if a suitable page has been freed and if so, it
    returns.

    [akpm@linux-foundation.org: Fix build errors]
    [aarcange@redhat.com: fix count_vm_event preempt in memory compaction direct reclaim]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch is the core of a mechanism which compacts memory in a zone by
    relocating movable pages towards the end of the zone.

    A single compaction run involves a migration scanner and a free scanner.
    Both scanners operate on pageblock-sized areas in the zone. The migration
    scanner starts at the bottom of the zone and searches for all movable
    pages within each area, isolating them onto a private list called
    migratelist. The free scanner starts at the top of the zone and searches
    for suitable areas and consumes the free pages within making them
    available for the migration scanner. The pages isolated for migration are
    then migrated to the newly isolated free pages.

    [aarcange@redhat.com: Fix unsafe optimisation]
    [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There are two types of zonelist ordering methodologies:

    - node order, preferring allocations on a node to stay local to and

    - zone order, preferring allocations come from a higher zone to avoid
    allocating in lowmem zones even though they may not be local.

    The ordering technique used by the kernel is configurable on the command
    line, but also has some logic to determine what the default should be.

    This logic currently lacks knowledge of systems where a node may only have
    lowmem. For such systems, it is necessary to use node order so that
    GFP_KERNEL allocations may be satisfied by nodes consisting of only
    lowmem.

    If zone order is used, GFP_KERNEL allocations to such nodes are actually
    allocated on a node with local affinity that includes ZONE_NORMAL.

    This change defaults to node zonelist ordering if any node lacks
    ZONE_NORMAL.

    To force zone order, append 'numa_zonelist_order=zone' to the kernel
    command line.

    Signed-off-by: David Rientjes
    Acked-by: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Before applying this patch, cpuset updates task->mems_allowed and
    mempolicy by setting all new bits in the nodemask first, and clearing all
    old unallowed bits later. But in the way, the allocator may find that
    there is no node to alloc memory.

    The reason is that cpuset rebinds the task's mempolicy, it cleans the
    nodes which the allocater can alloc pages on, for example:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    This patch fixes this problem by expanding the nodes range first(set newly
    allowed bits) and shrink it lazily(clear newly disallowed bits). So we
    use a variable to tell the write-side task that read-side task is reading
    nodemask, and the write-side task clears newly disallowed nodes after
    read-side task ends the current memory allocation.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • …re merging to the tail of the free lists

    In order to reduce fragmentation, this patch classifies freed pages in two
    groups according to their probability of being part of a high order merge.
    Pages belonging to a compound whose next-highest buddy is free are more
    likely to be part of a high order merge in the near future, so they will
    be added at the tail of the freelist. The remaining pages are put at the
    front of the freelist.

    In this way, the pages that are more likely to cause a big merge are kept
    free longer. Consequently there is a tendency to aggregate the
    long-living allocations on a subset of the compounds, reducing the
    fragmentation.

    This heuristic was tested on three machines, x86, x86-64 and ppc64 with
    3GB of RAM in each machine. The tests were kernbench, netperf, sysbench
    and STREAM for performance and a high-order stress test for huge page
    allocations.

    KernBench X86
    Elapsed mean 374.77 ( 0.00%) 375.10 (-0.09%)
    User mean 649.53 ( 0.00%) 650.44 (-0.14%)
    System mean 54.75 ( 0.00%) 54.18 ( 1.05%)
    CPU mean 187.75 ( 0.00%) 187.25 ( 0.27%)

    KernBench X86-64
    Elapsed mean 94.45 ( 0.00%) 94.01 ( 0.47%)
    User mean 323.27 ( 0.00%) 322.66 ( 0.19%)
    System mean 36.71 ( 0.00%) 36.50 ( 0.57%)
    CPU mean 380.75 ( 0.00%) 381.75 (-0.26%)

    KernBench PPC64
    Elapsed mean 173.45 ( 0.00%) 173.74 (-0.17%)
    User mean 587.99 ( 0.00%) 587.95 ( 0.01%)
    System mean 60.60 ( 0.00%) 60.57 ( 0.05%)
    CPU mean 373.50 ( 0.00%) 372.75 ( 0.20%)

    Nothing notable for kernbench.

    NetPerf UDP X86
    64 42.68 ( 0.00%) 42.77 ( 0.21%)
    128 85.62 ( 0.00%) 85.32 (-0.35%)
    256 170.01 ( 0.00%) 168.76 (-0.74%)
    1024 655.68 ( 0.00%) 652.33 (-0.51%)
    2048 1262.39 ( 0.00%) 1248.61 (-1.10%)
    3312 1958.41 ( 0.00%) 1944.61 (-0.71%)
    4096 2345.63 ( 0.00%) 2318.83 (-1.16%)
    8192 4132.90 ( 0.00%) 4089.50 (-1.06%)
    16384 6770.88 ( 0.00%) 6642.05 (-1.94%)*

    NetPerf UDP X86-64
    64 148.82 ( 0.00%) 154.92 ( 3.94%)
    128 298.96 ( 0.00%) 312.95 ( 4.47%)
    256 583.67 ( 0.00%) 626.39 ( 6.82%)
    1024 2293.18 ( 0.00%) 2371.10 ( 3.29%)
    2048 4274.16 ( 0.00%) 4396.83 ( 2.79%)
    3312 6356.94 ( 0.00%) 6571.35 ( 3.26%)
    4096 7422.68 ( 0.00%) 7635.42 ( 2.79%)*
    8192 12114.81 ( 0.00%)* 12346.88 ( 1.88%)
    16384 17022.28 ( 0.00%)* 17033.19 ( 0.06%)*
    1.64% 2.73%

    NetPerf UDP PPC64
    64 49.98 ( 0.00%) 50.25 ( 0.54%)
    128 98.66 ( 0.00%) 100.95 ( 2.27%)
    256 197.33 ( 0.00%) 191.03 (-3.30%)
    1024 761.98 ( 0.00%) 785.07 ( 2.94%)
    2048 1493.50 ( 0.00%) 1510.85 ( 1.15%)
    3312 2303.95 ( 0.00%) 2271.72 (-1.42%)
    4096 2774.56 ( 0.00%) 2773.06 (-0.05%)
    8192 4918.31 ( 0.00%) 4793.59 (-2.60%)
    16384 7497.98 ( 0.00%) 7749.52 ( 3.25%)

    The tests are run to have confidence limits within 1%. Results marked
    with a * were not confident although in this case, it's only outside by
    small amounts. Even with some results that were not confident, the
    netperf UDP results were generally positive.

    NetPerf TCP X86
    64 652.25 ( 0.00%)* 648.12 (-0.64%)*
    23.80% 22.82%
    128 1229.98 ( 0.00%)* 1220.56 (-0.77%)*
    21.03% 18.90%
    256 2105.88 ( 0.00%) 1872.03 (-12.49%)*
    1.00% 16.46%
    1024 3476.46 ( 0.00%)* 3548.28 ( 2.02%)*
    13.37% 11.39%
    2048 4023.44 ( 0.00%)* 4231.45 ( 4.92%)*
    9.76% 12.48%
    3312 4348.88 ( 0.00%)* 4396.96 ( 1.09%)*
    6.49% 8.75%
    4096 4726.56 ( 0.00%)* 4877.71 ( 3.10%)*
    9.85% 8.50%
    8192 4732.28 ( 0.00%)* 5777.77 (18.10%)*
    9.13% 13.04%
    16384 5543.05 ( 0.00%)* 5906.24 ( 6.15%)*
    7.73% 8.68%

    NETPERF TCP X86-64
    netperf-tcp-vanilla-netperf netperf-tcp
    tcp-vanilla pgalloc-delay
    64 1895.87 ( 0.00%)* 1775.07 (-6.81%)*
    5.79% 4.78%
    128 3571.03 ( 0.00%)* 3342.20 (-6.85%)*
    3.68% 6.06%
    256 5097.21 ( 0.00%)* 4859.43 (-4.89%)*
    3.02% 2.10%
    1024 8919.10 ( 0.00%)* 8892.49 (-0.30%)*
    5.89% 6.55%
    2048 10255.46 ( 0.00%)* 10449.39 ( 1.86%)*
    7.08% 7.44%
    3312 10839.90 ( 0.00%)* 10740.15 (-0.93%)*
    6.87% 7.33%
    4096 10814.84 ( 0.00%)* 10766.97 (-0.44%)*
    6.86% 8.18%
    8192 11606.89 ( 0.00%)* 11189.28 (-3.73%)*
    7.49% 5.55%
    16384 12554.88 ( 0.00%)* 12361.22 (-1.57%)*
    7.36% 6.49%

    NETPERF TCP PPC64
    netperf-tcp-vanilla-netperf netperf-tcp
    tcp-vanilla pgalloc-delay
    64 594.17 ( 0.00%) 596.04 ( 0.31%)*
    1.00% 2.29%
    128 1064.87 ( 0.00%)* 1074.77 ( 0.92%)*
    1.30% 1.40%
    256 1852.46 ( 0.00%)* 1856.95 ( 0.24%)
    1.25% 1.00%
    1024 3839.46 ( 0.00%)* 3813.05 (-0.69%)
    1.02% 1.00%
    2048 4885.04 ( 0.00%)* 4881.97 (-0.06%)*
    1.15% 1.04%
    3312 5506.90 ( 0.00%) 5459.72 (-0.86%)
    4096 6449.19 ( 0.00%) 6345.46 (-1.63%)
    8192 7501.17 ( 0.00%) 7508.79 ( 0.10%)
    16384 9618.65 ( 0.00%) 9490.10 (-1.35%)

    There was a distinct lack of confidence in the X86* figures so I included
    what the devation was where the results were not confident. Many of the
    results, whether gains or losses were within the standard deviation so no
    solid conclusion can be reached on performance impact. Looking at the
    figures, only the X86-64 ones look suspicious with a few losses that were
    outside the noise. However, the results were so unstable that without
    knowing why they vary so much, a solid conclusion cannot be reached.

    SYSBENCH X86
    sysbench-vanilla pgalloc-delay
    1 7722.85 ( 0.00%) 7756.79 ( 0.44%)
    2 14901.11 ( 0.00%) 13683.44 (-8.90%)
    3 15171.71 ( 0.00%) 14888.25 (-1.90%)
    4 14966.98 ( 0.00%) 15029.67 ( 0.42%)
    5 14370.47 ( 0.00%) 14865.00 ( 3.33%)
    6 14870.33 ( 0.00%) 14845.57 (-0.17%)
    7 14429.45 ( 0.00%) 14520.85 ( 0.63%)
    8 14354.35 ( 0.00%) 14362.31 ( 0.06%)

    SYSBENCH X86-64
    1 17448.70 ( 0.00%) 17484.41 ( 0.20%)
    2 34276.39 ( 0.00%) 34251.00 (-0.07%)
    3 50805.25 ( 0.00%) 50854.80 ( 0.10%)
    4 66667.10 ( 0.00%) 66174.69 (-0.74%)
    5 66003.91 ( 0.00%) 65685.25 (-0.49%)
    6 64981.90 ( 0.00%) 65125.60 ( 0.22%)
    7 64933.16 ( 0.00%) 64379.23 (-0.86%)
    8 63353.30 ( 0.00%) 63281.22 (-0.11%)
    9 63511.84 ( 0.00%) 63570.37 ( 0.09%)
    10 62708.27 ( 0.00%) 63166.25 ( 0.73%)
    11 62092.81 ( 0.00%) 61787.75 (-0.49%)
    12 61330.11 ( 0.00%) 61036.34 (-0.48%)
    13 61438.37 ( 0.00%) 61994.47 ( 0.90%)
    14 62304.48 ( 0.00%) 62064.90 (-0.39%)
    15 63296.48 ( 0.00%) 62875.16 (-0.67%)
    16 63951.76 ( 0.00%) 63769.09 (-0.29%)

    SYSBENCH PPC64
    -sysbench-pgalloc-delay-sysbench
    sysbench-vanilla pgalloc-delay
    1 7645.08 ( 0.00%) 7467.43 (-2.38%)
    2 14856.67 ( 0.00%) 14558.73 (-2.05%)
    3 21952.31 ( 0.00%) 21683.64 (-1.24%)
    4 27946.09 ( 0.00%) 28623.29 ( 2.37%)
    5 28045.11 ( 0.00%) 28143.69 ( 0.35%)
    6 27477.10 ( 0.00%) 27337.45 (-0.51%)
    7 26489.17 ( 0.00%) 26590.06 ( 0.38%)
    8 26642.91 ( 0.00%) 25274.33 (-5.41%)
    9 25137.27 ( 0.00%) 24810.06 (-1.32%)
    10 24451.99 ( 0.00%) 24275.85 (-0.73%)
    11 23262.20 ( 0.00%) 23674.88 ( 1.74%)
    12 24234.81 ( 0.00%) 23640.89 (-2.51%)
    13 24577.75 ( 0.00%) 24433.50 (-0.59%)
    14 25640.19 ( 0.00%) 25116.52 (-2.08%)
    15 26188.84 ( 0.00%) 26181.36 (-0.03%)
    16 26782.37 ( 0.00%) 26255.99 (-2.00%)

    Again, there is little to conclude here. While there are a few losses,
    the results vary by +/- 8% in some cases. They are the results of most
    concern as there are some large losses but it's also within the variance
    typically seen between kernel releases.

    The STREAM results varied so little and are so verbose that I didn't
    include them here.

    The final test stressed how many huge pages can be allocated. The
    absolute number of huge pages allocated are the same with or without the
    page. However, the "unusability free space index" which is a measure of
    external fragmentation was slightly lower (lower is better) throughout the
    lifetime of the system. I also measured the latency of how long it took
    to successfully allocate a huge page. The latency was slightly lower and
    on X86 and PPC64, more huge pages were allocated almost immediately from
    the free lists. The improvement is slight but there.

    [mel@csn.ul.ie: Tested, reworked for less branches]
    [czoccolo@gmail.com: fix oops by checking pfn_valid_within()]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Christoph Lameter <cl@linux-foundation.org>
    Acked-by: Rik van Riel <riel@redhat.com>
    Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
    Cc: Corrado Zoccolo <czoccolo@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Corrado Zoccolo
     

16 Mar, 2010

1 commit


13 Mar, 2010

2 commits

  • - introduce dump_page() to print the page info for debugging some error
    condition.

    - convert three mm users: bad_page(), print_bad_pte() and memory offline
    failure.

    - print an extra field: the symbolic names of page->flags

    Example dump_page() output:

    [ 157.521694] page:ffffea0000a7cba8 count:2 mapcount:1 mapping:ffff88001c901791 index:0x147
    [ 157.525570] page flags: 0x100000000100068(uptodate|lru|active|swapbacked)

    Signed-off-by: Wu Fengguang
    Cc: Ingo Molnar
    Cc: Alex Chiang
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • __zone_pcp_update() iterates over NR_CPUS instead of limiting the access
    to the possible cpus. This might result in access to uninitialized areas
    as the per cpu allocator only populates the per cpu memory for possible
    cpus.

    This problem was created as a result of the dynamic allocation of pagesets
    from percpu memory that went in during the merge window - commit
    99dcc3e5a94ed491fbef402831d8c0bbb267f995 ("this_cpu: Page allocator
    conversion").

    Signed-off-by: Thomas Gleixner
    Acked-by: Pekka Enberg
    Acked-by: Tejun Heo
    Acked-by: Christoph Lameter
    Acked-by: Mel Gorman
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

07 Mar, 2010

6 commits

  • free_area_init_nodes() emits pfn ranges for all zones on the system.
    There may be no pages on a higher zone, however, due to memory limitations
    or the use of the mem= kernel parameter. For example:

    Zone PFN ranges:
    DMA 0x00000001 -> 0x00001000
    DMA32 0x00001000 -> 0x00100000
    Normal 0x00100000 -> 0x00100000

    The implementation copies the previous zone's highest pfn, if any, as the
    next zone's lowest pfn. If its highest pfn is then greater than the
    amount of addressable memory, the upper memory limit is used instead.
    Thus, both the lowest and highest possible pfn for higher zones without
    memory may be the same.

    The pfn range for zones without memory is now shown as "empty" instead.

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • There are quite a few GFP_KERNEL memory allocations made during
    suspend/hibernation and resume that may cause the system to hang, because
    the I/O operations they depend on cannot be completed due to the
    underlying devices being suspended.

    Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in
    gfp_allowed_mask before suspend/hibernation and restoring the original
    values of these bits in gfp_allowed_mask durig the subsequent resume.

    [akpm@linux-foundation.org: fix CONFIG_PM=n linkage]
    Signed-off-by: Rafael J. Wysocki
    Reported-by: Maxim Levitsky
    Cc: Sebastian Ott
    Cc: Benjamin Herrenschmidt
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • commit e815af95 ("change all_unreclaimable zone member to flags") changed
    all_unreclaimable member to bit flag. But it had an undesireble side
    effect. free_one_page() is one of most hot path in linux kernel and
    increasing atomic ops in it can reduce kernel performance a bit.

    Thus, this patch revert such commit partially. at least
    all_unreclaimable shouldn't share memory word with other zone flags.

    [akpm@linux-foundation.org: fix patch interaction]
    Signed-off-by: KOSAKI Motohiro
    Cc: David Rientjes
    Cc: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Huang Shijie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • free_hot_page() is just a wrapper around free_hot_cold_page() with
    parameter 'cold = 0'. After adding a clear comment for
    free_hot_cold_page(), it is reasonable to remove a level of call.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Li Hong
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Cc: KOSAKI Motohiro
    Cc: Americo Wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Hong
     
  • Move a call of trace_mm_page_free_direct() from free_hot_page() to
    free_hot_cold_page(). It is clearer and close to kmemcheck_free_shadow(),
    as it is done in function __free_pages_ok().

    Signed-off-by: Li Hong
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Hong
     
  • trace_mm_page_free_direct() is called in function __free_pages(). But it
    is called again in free_hot_page() if order == 0 and produce duplicate
    records in trace file for mm_page_free_direct event. As below:

    K-PID CPU# TIMESTAMP FUNCTION
    gnome-terminal-1567 [000] 4415.246466: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
    gnome-terminal-1567 [000] 4415.246468: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
    gnome-terminal-1567 [000] 4415.246506: mm_page_alloc: page=ffffea0003db9f40 pfn=1155800 order=0 migratetype=0 gfp_flags=GFP_KERNEL
    gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
    gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0

    This patch removes the first call and adds a call to
    trace_mm_page_free_direct() in __free_pages_ok().

    Signed-off-by: Li Hong
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Hong