19 Jan, 2006

1 commit

  • Some bits for zone reclaim exists in 2.6.15 but they are not usable. This
    patch fixes them up, removes unused code and makes zone reclaim usable.

    Zone reclaim allows the reclaiming of pages from a zone if the number of
    free pages falls below the watermarks even if other zones still have enough
    pages available. Zone reclaim is of particular importance for NUMA
    machines. It can be more beneficial to reclaim a page than taking the
    performance penalties that come with allocating a page on a remote zone.

    Zone reclaim is enabled if the maximum distance to another node is higher
    than RECLAIM_DISTANCE, which may be defined by an arch. By default
    RECLAIM_DISTANCE is 20. 20 is the distance to another node in the same
    component (enclosure or motherboard) on IA64. The meaning of the NUMA
    distance information seems to vary by arch.

    If zone reclaim is not successful then no further reclaim attempts will
    occur for a certain time period (ZONE_RECLAIM_INTERVAL).

    This patch was discussed before. See

    http://marc.theaimsgroup.com/?l=linux-kernel&m=113519961504207&w=2
    http://marc.theaimsgroup.com/?l=linux-kernel&m=113408418232531&w=2
    http://marc.theaimsgroup.com/?l=linux-kernel&m=113389027420032&w=2
    http://marc.theaimsgroup.com/?l=linux-kernel&m=113380938612205&w=2

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

17 Jan, 2006

1 commit

  • Add __meminit to the __init lineup to ensure functions default
    to __init when memory hotplug is not enabled. Replace __devinit
    with __meminit on functions that were changed when the memory
    hotplug code was introduced.

    Signed-off-by: Matt Tolentino
    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Matt Tolentino
     

13 Jan, 2006

1 commit


12 Jan, 2006

3 commits


10 Jan, 2006

1 commit


09 Jan, 2006

6 commits

  • Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
    that the tasks in a cpuset call try_to_free_pages(), the synchronous
    (direct) memory reclaim code.

    This enables batch managers monitoring jobs running in dedicated cpusets to
    efficiently detect what level of memory pressure that job is causing.

    This is useful both on tightly managed systems running a wide mix of
    submitted jobs, which may choose to terminate or reprioritize jobs that are
    trying to use more memory than allowed on the nodes assigned them, and with
    tightly coupled, long running, massively parallel scientific computing jobs
    that will dramatically fail to meet required performance goals if they
    start to use more memory than allowed to them.

    This patch just provides a very economical way for the batch manager to
    monitor a cpuset for signs of memory pressure. It's up to the batch
    manager or other user code to decide what to do about it and take action.

    ==> Unless this feature is enabled by writing "1" to the special file
    /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
    code of __alloc_pages() for this metric reduces to simply noticing
    that the cpuset_memory_pressure_enabled flag is zero. So only
    systems that enable this feature will compute the metric.

    Why a per-cpuset, running average:

    Because this meter is per-cpuset, rather than per-task or mm, the
    system load imposed by a batch scheduler monitoring this metric is
    sharply reduced on large systems, because a scan of the tasklist can be
    avoided on each set of queries.

    Because this meter is a running average, instead of an accumulating
    counter, a batch scheduler can detect memory pressure with a single
    read, instead of having to read and accumulate results for a period of
    time.

    Because this meter is per-cpuset rather than per-task or mm, the
    batch scheduler can obtain the key information, memory pressure in a
    cpuset, with a single read, rather than having to query and accumulate
    results over all the (dynamically changing) set of tasks in the cpuset.

    A per-cpuset simple digital filter (requires a spinlock and 3 words of data
    per-cpuset) is kept, and updated by any task attached to that cpuset, if it
    enters the synchronous (direct) page reclaim code.

    A per-cpuset file provides an integer number representing the recent
    (half-life of 10 seconds) rate of direct page reclaims caused by the tasks
    in the cpuset, in units of reclaims attempted per second, times 1000.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Try to streamline free_pages_bulk by ensuring callers don't pass in a
    'count' that exceeds the list size.

    Some cleanups:
    Rename __free_pages_bulk to __free_one_page.
    Put the page list manipulation from __free_pages_ok into free_one_page.
    Make __free_pages_ok static.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Use zone_pcp everywhere even though NUMA code "knows" the internal details
    of the zone. Stop other people trying to copy, and it looks nicer.

    Also, only print the pagesets of online cpus in zoneinfo.

    Signed-off-by: Nick Piggin
    Cc: "Seth, Rohit"
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • As recently there has been lot of traffic on the right values for batch and
    high water marks for per_cpu_pagelists. This patch makes these two
    variables configurable through /proc interface.

    A new tunable /proc/sys/vm/percpu_pagelist_fraction is added. This entry
    controls the fraction of pages at most in each zone that are allocated for
    each per cpu page list. The min value for this is 8. It means that we
    don't allow more than 1/8th of pages in each zone to be allocated in any
    single per_cpu_pagelist.

    The batch value of each per cpu pagelist is also updated as a result. It
    is set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8)

    Signed-off-by: Rohit Seth
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rohit Seth
     
  • For some reason there is an #ifdef CONFIG_NUMA within another #ifdef
    CONFIG_NUMA in the page allocator. Remove innermost #ifdef CONFIG_NUMA

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Hugh says:

    page_alloc_cpu_notify() specifically contains code to

    /* Add dead cpu's page_states to our own. */

    which handles this more efficiently.

    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

07 Jan, 2006

18 commits

  • Optimise page_state manipulations by introducing interrupt unsafe accessors
    to page_state fields. Callers must provide their own locking (either
    disable interrupts or not update from interrupt context).

    Switch over the hot callsites that can easily be moved under interrupts off
    sections.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Give j and r meaningful names.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The use k in the inner loop means that the highest zone nr is always used
    if any zone of a node is populated. This means that the policy zone is not
    correctly determined on arches that do no use HIGHMEM like ia64.

    Change the loop to decrement k which also simplifies the BUG_ON.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently the function to build a zonelist for a BIND policy has the side
    effect to set the policy_zone. This seems to be a bit strange. policy
    zone seems to not be initialized elsewhere and therefore 0. Do we police
    ZONE_DMA if no bind policy has been used yet?

    This patch moves the determination of the zone to apply policies to into
    the page allocator. We determine the zone while building the zonelist for
    nodes.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Simplify build_zonelists_node by removing the case statement.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • There are numerous places we check whether a zone is populated or not.

    Provide a helper function to check for populated zones and convert all
    checks for zone->present_pages.

    Signed-off-by: Con Kolivas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Con Kolivas
     
  • Cut down size slightly by not passing bad_page the function name (it should be
    able to be determined by dump_stack()). And cut down the number of printks in
    bad_page.

    Also, cut down some branching in the destroy_compound_page path.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Add dma32 to zone statistics. Also attempt to arrange struct page_state a
    bit better (visually).

    Signed-off-by: Nick Piggin
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • The attached patch cleans up the way the bootmem allocator frees pages.

    A new function, __free_pages_bootmem(), is provided in mm/page_alloc.c that is
    called from mm/bootmem.c to turn pages over to the main allocator. All the
    bits of code to initialise pages (clearing PG_reserved and setting the page
    count) are moved to here. The checks on page validity are removed, on the
    assumption that the struct page arrays will have been prepared correctly.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Small cleanups that does not change generated code with the gcc's I've tested
    with.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • read_page_state and __get_page_state only traverse online CPUs, which will
    cause results to fluctuate when CPUs are plugged in or out.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • struct per_cpu_pages.low is useless. Remove it.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • bad_range is supposed to be a temporary check. It would be a pity to throw it
    out. Make it depend on CONFIG_DEBUG_VM instead.

    CONFIG_HOLES_IN_ZONE systems were relying on this to check pfn_valid in the
    page allocator. Add that to page_is_buddy instead.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Micro optimise some conditionals where we don't need lazy evaluation.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Inline set_page_refs.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Slightly optimise some page allocation and freeing functions by taking
    advantage of knowing whether or not interrupts are disabled.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • The NODES_SPAN_OTHER_NODES config option was created so that DISCONTIGMEM
    could handle pSeries numa layouts. However, support for DISCONTIGMEM has
    been replaced by SPARSEMEM on powerpc. As a result, this config option and
    supporting code is no longer needed.

    I have already sent a patch to Paul that removes the option from powerpc
    specific code. This removes the arch independent piece. Doesn't really
    matter which is applied first.

    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Two changes to the setting of the ALLOC_CPUSET flag in
    mm/page_alloc.c:__alloc_pages()

    - A bug fix - the "ignoring mins" case should not be honoring ALLOC_CPUSET.
    This case of all cases, since it is handling a request that will free up
    more memory than is asked for (exiting tasks, e.g.) should be allowed to
    escape cpuset constraints when memory is tight.

    - A logic change to make it simpler. Honor cpusets even on GFP_ATOMIC
    (!wait) requests. With this, cpuset confinement applies to all requests
    except ALLOC_NO_WATERMARKS, so that in a subsequent cleanup patch, I can
    remove the ALLOC_CPUSET flag entirely. Since I don't know any real reason
    this logic has to be either way, I am choosing the path of the simplest
    code.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

16 Dec, 2005

1 commit


04 Dec, 2005

1 commit


29 Nov, 2005

1 commit

  • I believe this patch is required to fix breakage in the asynch reclaim
    watermark logic introduced by this patch:

    http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=7fb1d9fca5c6e3b06773b69165a73f3fb786b8ee

    Just some background of the watermark logic in case it isn't clear...
    Basically what we have is this:

    --- pages_high
    |
    | (a)
    |
    --- pages_low
    |
    | (b)
    |
    --- pages_min
    |
    | (c)
    |
    --- 0

    Now when pages_low is reached, we want to kick asynch reclaim, which gives us
    an interval of "b" before we must start synch reclaim, and gives kswapd an
    interval of "a" before it need go back to sleep.

    When pages_min is reached, normal allocators must enter synch reclaim, but
    PF_MEMALLOC, ALLOC_HARDER, and ALLOC_HIGH (ie. atomic allocations, recursive
    allocations, etc.) get access to varying amounts of the reserve "c".

    Signed-off-by: Nick Piggin
    Cc: "Seth, Rohit"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

23 Nov, 2005

2 commits

  • It used to be the case that PG_reserved pages were silently never freed, but
    in 2.6.15-rc1 they may be freed with a "Bad page state" message. We should
    work through such cases as they appear, fixing the code; but for now it's
    safer to issue the message without freeing the page, leaving PG_reserved set.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • It looks like snd_xxx is not the only nopage to be using PageReserved as a way
    of holding a high-order page together: which no longer works, but is masked by
    our failure to free from VM_RESERVED areas. We cannot fix that bug without
    first substituting another way to hold the high-order page together, while
    farming out the 0-order pages from within it.

    That's just what PageCompound is designed for, but it's been kept under
    CONFIG_HUGETLB_PAGE. Remove the #ifdefs: which saves some space (out- of-line
    put_page), doesn't slow down what most needs to be fast (already using
    hugetlb), and unifies the way we handle high-order pages.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

18 Nov, 2005

1 commit


15 Nov, 2005

3 commits