12 Oct, 2006

3 commits

  • Move the lock debug checks below the page reserved checks. Also, having
    debug_check_no_locks_freed in kernel_map_pages is wrong.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • After the PG_reserved check was added, arch_free_page was being called in the
    wrong place (it could be called for a page we don't actually want to free).
    Fix that.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • memmap_zone_idx() is not used anymore. It was required by an earlier
    version of
    account-for-memmap-and-optionally-the-kernel-image-as-holes.patch but not
    any more.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

04 Oct, 2006

2 commits

  • Fix kernel-doc and function declaration (missing "void") in
    mm/page_alloc.c.

    Add mm/page_alloc.c to kernel-api.tmpl in DocBook.

    mm/page_alloc.c:2589:38: warning: non-ANSI function declaration of function 'remove_all_active_ranges'

    Signed-off-by: Randy Dunlap
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Having min be a signed quantity means gcc can't turn high latency divides
    into shifts. There happen to be two such divides for GFP_ATOMIC (ie.
    networking, ie. important) allocations, one of which depends on the other.
    Fixing this makes code smaller as a bonus.

    Shame on somebody (probably me).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

27 Sep, 2006

9 commits

  • Use NULL instead of 0 for pointer value, eliminate sparse warnings.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • We do not need to allocate pagesets for unpopulated zones.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Add the node in order to optimize zone_to_nid.

    Signed-off-by: Christoph Lameter
    Acked-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The NUMA_BUILD constant is always available and will be set to 1 on
    NUMA_BUILDs. That way checks valid only under CONFIG_NUMA can easily be done
    without #ifdef CONFIG_NUMA

    F.e.

    if (NUMA_BUILD && ) {
    ...
    }

    [akpm: not a thing we'd normally do, but CONFIG_NUMA is special: it is
    causing ifdef explosion in core kernel, so let's see if this is a comfortable
    way in whcih to control that]

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • On larger systems, the amount of output dumped on the console when you do
    SysRq-M is beyond insane. This patch is trying to reduce it somewhat as
    even with the smaller NUMA systems that have hit the desktop this seems to
    be a fair thing to do.

    The philosophy I have taken is as follows:
    1) If a zone is empty, don't tell, we don't need yet another line
    telling us so. The information is available since one can look up
    the fact how many zones were initialized in the first place.
    2) Put as much information on a line is possible, if it can be done
    in one line, rahter than two, then do it in one. I tried to format
    the temperature stuff for easy reading.

    Change show_free_areas() to not print lines for empty zones. If no zone
    output is printed, the zone is empty. This reduces the number of lines
    dumped to the console in sysrq on a large system by several thousand lines.

    Change the zone temperature printouts to use one line per CPU instead of
    two lines (one hot, one cold). On a 1024 CPU, 1024 node system, this
    reduces the console output by over a million lines of output.

    While this is a bigger problem on large NUMA systems, it is also applicable
    to smaller desktop sized and mid range NUMA systems.

    Old format:

    Mem-info:
    Node 0 DMA per-cpu:
    cpu 0 hot: high 42, batch 7 used:24
    cpu 0 cold: high 14, batch 3 used:1
    cpu 1 hot: high 42, batch 7 used:34
    cpu 1 cold: high 14, batch 3 used:0
    cpu 2 hot: high 42, batch 7 used:0
    cpu 2 cold: high 14, batch 3 used:0
    cpu 3 hot: high 42, batch 7 used:0
    cpu 3 cold: high 14, batch 3 used:0
    cpu 4 hot: high 42, batch 7 used:0
    cpu 4 cold: high 14, batch 3 used:0
    cpu 5 hot: high 42, batch 7 used:0
    cpu 5 cold: high 14, batch 3 used:0
    cpu 6 hot: high 42, batch 7 used:0
    cpu 6 cold: high 14, batch 3 used:0
    cpu 7 hot: high 42, batch 7 used:0
    cpu 7 cold: high 14, batch 3 used:0
    Node 0 DMA32 per-cpu: empty
    Node 0 Normal per-cpu: empty
    Node 0 HighMem per-cpu: empty
    Node 1 DMA per-cpu:
    [snip]
    Free pages: 5410688kB (0kB HighMem)
    Active:9536 inactive:4261 dirty:6 writeback:0 unstable:0 free:338168 slab:1931 mapped:1900 pagetables:208
    Node 0 DMA free:1676304kB min:3264kB low:4080kB high:4896kB active:128048kB inactive:61568kB present:1970880kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 0 DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 0 HighMem free:0kB min:512kB low:512kB high:512kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 1 DMA free:1951728kB min:3280kB low:4096kB high:4912kB active:5632kB inactive:1504kB present:1982464kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    ....

    New format:

    Mem-info:
    Node 0 DMA per-cpu:
    CPU 0: Hot: hi: 42, btch: 7 usd: 41 Cold: hi: 14, btch: 3 usd: 2
    CPU 1: Hot: hi: 42, btch: 7 usd: 40 Cold: hi: 14, btch: 3 usd: 1
    CPU 2: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
    CPU 3: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
    CPU 4: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
    CPU 5: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
    CPU 6: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
    CPU 7: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
    Node 1 DMA per-cpu:
    [snip]
    Free pages: 5411088kB (0kB HighMem)
    Active:9558 inactive:4233 dirty:6 writeback:0 unstable:0 free:338193 slab:1942 mapped:1918 pagetables:208
    Node 0 DMA free:1677648kB min:3264kB low:4080kB high:4896kB active:129296kB inactive:58864kB present:1970880kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 1 DMA free:1948448kB min:3280kB low:4096kB high:4912kB active:6864kB inactive:3536kB present:1982464kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0

    Signed-off-by: Jes Sorensen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jes Sorensen
     
  • Arch-independent zone-sizing determines the size of a node
    (pgdat->node_spanned_pages) based on the physical memory that was
    registered by the architecture. However, when
    CONFIG_MEMORY_HOTPLUG_RESERVE is set, the architecture expects that the
    spanned_pages will be much larger and that mem_map will be allocated that
    is used lated on memory hot-add.

    This patch allows an architecture that sets CONFIG_MEMORY_HOTPLUG_RESERVE
    to call push_node_boundaries() which will set the node beginning and end to
    at *least* the requested boundary.

    Cc: Dave Hansen
    Cc: Andy Whitcroft
    Cc: Andi Kleen
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Keith Mannthey"
    Cc: "Luck, Tony"
    Cc: KAMEZAWA Hiroyuki
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • absent_pages_in_range() made the assumption that users of the API would not
    care about holes beyound the end of physical memory. This was not the
    case. This patch will account for ranges outside of physical memory as
    holes correctly.

    Cc: Dave Hansen
    Cc: Andy Whitcroft
    Cc: Andi Kleen
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Keith Mannthey"
    Cc: "Luck, Tony"
    Cc: KAMEZAWA Hiroyuki
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The x86_64 code accounted for memmap and some portions of the the DMA zone as
    holes. This was because those areas would never be reclaimed and accounting
    for them as memory affects min watermarks. This patch will account for the
    memmap as a memory hole. Architectures may optionally use set_dma_reserve()
    if they wish to account for a portion of memory in ZONE_DMA as a hole.

    Signed-off-by: Mel Gorman
    Cc: Dave Hansen
    Cc: Andy Whitcroft
    Cc: Andi Kleen
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Keith Mannthey"
    Cc: "Luck, Tony"
    Cc: KAMEZAWA Hiroyuki
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • At a basic level, architectures define structures to record where active
    ranges of page frames are located. Once located, the code to calculate zone
    sizes and holes in each architecture is very similar. Some of this zone and
    hole sizing code is difficult to read for no good reason. This set of patches
    eliminates the similar-looking architecture-specific code.

    The patches introduce a mechanism where architectures register where the
    active ranges of page frames are with add_active_range(). When all areas have
    been discovered, free_area_init_nodes() is called to initialise the pgdat and
    zones. The zone sizes and holes are then calculated in an architecture
    independent manner.

    Patch 1 introduces the mechanism for registering and initialising PFN ranges
    Patch 2 changes ppc to use the mechanism - 139 arch-specific LOC removed
    Patch 3 changes x86 to use the mechanism - 136 arch-specific LOC removed
    Patch 4 changes x86_64 to use the mechanism - 74 arch-specific LOC removed
    Patch 5 changes ia64 to use the mechanism - 52 arch-specific LOC removed
    Patch 6 accounts for mem_map as a memory hole as the pages are not reclaimable.
    It adjusts the watermarks slightly

    Tony Luck has successfully tested for ia64 on Itanium with tiger_defconfig,
    gensparse_defconfig and defconfig. Bob Picco has also tested and debugged on
    IA64. Jack Steiner successfully boot tested on a mammoth SGI IA64-based
    machine. These were on patches against 2.6.17-rc1 and release 3 of these
    patches but there have been no ia64-changes since release 3.

    There are differences in the zone sizes for x86_64 as the arch-specific code
    for x86_64 accounts the kernel image and the starting mem_maps as memory holes
    but the architecture-independent code accounts the memory as present.

    The big benefit of this set of patches is a sizable reduction of
    architecture-specific code, some of which is very hairy. There should be a
    greater reduction when other architectures use the same mechanisms for zone
    and hole sizing but I lack the hardware to test on.

    Additional credit;
    Dave Hansen for the initial suggestion and comments on early patches
    Andy Whitcroft for reviewing early versions and catching numerous
    errors
    Tony Luck for testing and debugging on IA64
    Bob Picco for fixing bugs related to pfn registration, reviewing a
    number of patch revisions, providing a number of suggestions
    on future direction and testing heavily
    Jack Steiner and Robin Holt for testing on IA64 and clarifying
    issues related to memory holes
    Yasunori for testing on IA64
    Andi Kleen for reviewing and feeding back about x86_64
    Christian Kujau for providing valuable information related to ACPI
    problems on x86_64 and testing potential fixes

    This patch:

    Define the structure to represent an active range of page frames within a node
    in an architecture independent manner. Architectures are expected to register
    active ranges of PFNs using add_active_range(nid, start_pfn, end_pfn) and call
    free_area_init_nodes() passing the PFNs of the end of each zone.

    Signed-off-by: Mel Gorman
    Signed-off-by: Bob Picco
    Cc: Dave Hansen
    Cc: Andy Whitcroft
    Cc: Andi Kleen
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Keith Mannthey"
    Cc: "Luck, Tony"
    Cc: KAMEZAWA Hiroyuki
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

26 Sep, 2006

20 commits

  • Clean up mm/page_alloc.c#mark_free_pages() and make it avoid clearing
    PageNosaveFree for PageNosave pages. This allows us to get rid of an ugly
    hack in kernel/power/snapshot.c#copy_data_pages().

    Additionally, the page-copying loop in copy_data_pages() is moved to an
    inline function.

    Signed-off-by: Rafael J. Wysocki
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • There are many places where we need to determine the node of a zone.
    Currently we use a difficult to read sequence of pointer dereferencing.
    Put that into an inline function and use throughout VM. Maybe we can find
    a way to optimize the lookup in the future.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently one can enable slab reclaim by setting an explicit option in
    /proc/sys/vm/zone_reclaim_mode. Slab reclaim is then used as a final
    option if the freeing of unmapped file backed pages is not enough to free
    enough pages to allow a local allocation.

    However, that means that the slab can grow excessively and that most memory
    of a node may be used by slabs. We have had a case where a machine with
    46GB of memory was using 40-42GB for slab. Zone reclaim was effective in
    dealing with pagecache pages. However, slab reclaim was only done during
    global reclaim (which is a bit rare on NUMA systems).

    This patch implements slab reclaim during zone reclaim. Zone reclaim
    occurs if there is a danger of an off node allocation. At that point we

    1. Shrink the per node page cache if the number of pagecache
    pages is more than min_unmapped_ratio percent of pages in a zone.

    2. Shrink the slab cache if the number of the nodes reclaimable slab pages
    (patch depends on earlier one that implements that counter)
    are more than min_slab_ratio (a new /proc/sys/vm tunable).

    The shrinking of the slab cache is a bit problematic since it is not node
    specific. So we simply calculate what point in the slab we want to reach
    (current per node slab use minus the number of pages that neeed to be
    allocated) and then repeately run the global reclaim until that is
    unsuccessful or we have reached the limit. I hope we will have zone based
    slab reclaim at some point which will make that easier.

    The default for the min_slab_ratio is 5%

    Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Remove the atomic counter for slab_reclaim_pages and replace the counter
    and NR_SLAB with two ZVC counter that account for unreclaimable and
    reclaimable slab pages: NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE.

    Change the check in vmscan.c to refer to to NR_SLAB_RECLAIMABLE. The
    intend seems to be to check for slab pages that could be freed.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • *_pages is a better description of the role of the variable.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • If a zone is unpopulated then we do not need to check for pages that are to
    be drained and also not for vm counters that may need to be updated.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Free one_page currently adds the page to a fake list and calls
    free_page_bulk. Fee_page_bulk takes it off again and then calles
    __free_one_page.

    Make free_one_page go directly to __free_one_page. Saves list on / off and
    a temporary list in free_one_page for higher ordered pages.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • There are frequent references to *z in get_page_from_freelist.

    Add an explicit zone variable that can be used in all these places.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • …mory policy restrictions

    Add a new gfp flag __GFP_THISNODE to avoid fallback to other nodes. This
    flag is essential if a kernel component requires memory to be located on a
    certain node. It will be needed for alloc_pages_node() to force allocation
    on the indicated node and for alloc_pages() to force allocation on the
    current node.

    Signed-off-by: Christoph Lameter <clameter@sgi.com>
    Cc: Andy Whitcroft <apw@shadowen.org>
    Cc: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

    Christoph Lameter
     
  • I wonder why we need this bitmask indexing into zone->node_zonelists[]?

    We always start with the highest zone and then include all lower zones
    if we build zonelists.

    Are there really cases where we need allocation from ZONE_DMA or
    ZONE_HIGHMEM but not ZONE_NORMAL? It seems that the current implementation
    of highest_zone() makes that already impossible.

    If we go linear on the index then gfp_zone() == highest_zone() and a lot
    of definitions fall by the wayside.

    We can now revert back to the use of gfp_zone() in mempolicy.c ;-)

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • After we have done this we can now do some typing cleanup.

    The memory policy layer keeps a policy_zone that specifies
    the zone that gets memory policies applied. This variable
    can now be of type enum zone_type.

    The check_highest_zone function and the build_zonelists funnctionm must
    then also take a enum zone_type parameter.

    Plus there are a number of loops over zones that also should use
    zone_type.

    We run into some troubles at some points with functions that need a
    zone_type variable to become -1. Fix that up.

    [pj@sgi.com: fix set_mempolicy() crash]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • There is a check in zonelist_policy that compares pieces of the bitmap
    obtained from a gfp mask via GFP_ZONETYPES with a zone number in function
    zonelist_policy().

    The bitmap is an ORed mask of __GFP_DMA, __GFP_DMA32 and __GFP_HIGHMEM.
    The policy_zone is a zone number with the possible values of ZONE_DMA,
    ZONE_DMA32, ZONE_HIGHMEM and ZONE_NORMAL. These are two different domains
    of values.

    For some reason seemed to work before the zone reduction patchset (It
    definitely works on SGI boxes since we just have one zone and the check
    cannot fail).

    With the zone reduction patchset this check definitely fails on systems
    with two zones if the system actually has memory in both zones.

    This is because ZONE_NORMAL is selected using no __GFP flag at
    all and thus gfp_zone(gfpmask) == 0. ZONE_DMA is selected when __GFP_DMA
    is set. __GFP_DMA is 0x01. So gfp_zone(gfpmask) == 1.

    policy_zone is set to ZONE_NORMAL (==1) if ZONE_NORMAL and ZONE_DMA are
    populated.

    For ZONE_NORMAL gfp_zone() yields 0 which is <
    policy_zone(ZONE_NORMAL) and so policy is not applied to regular memory
    allocations!

    Instead gfp_zone(__GFP_DMA) == 1 which results in policy being applied
    to DMA allocations!

    What we realy want in that place is to establish the highest allowable
    zone for a given gfp_mask. If the highest zone is higher or equal to the
    policy_zone then memory policies need to be applied. We have such
    a highest_zone() function in page_alloc.c.

    So move the highest_zone() function from mm/page_alloc.c into
    include/linux/gfp.h. On the way we simplify the function and use the new
    zone_type that was also introduced with the zone reduction patchset plus we
    also specify the right type for the gfp flags parameter.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Make ZONE_HIGHMEM optional

    - ifdef out code and definitions related to CONFIG_HIGHMEM

    - __GFP_HIGHMEM falls back to normal allocations if there is no
    ZONE_HIGHMEM

    - GFP_ZONEMASK becomes 0x01 if there is no DMA32 and no HIGHMEM
    zone.

    [jdike@addtoit.com: build fix]
    Signed-off-by: Jeff Dike
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Make ZONE_DMA32 optional

    - Add #ifdefs around ZONE_DMA32 specific code and definitions.

    - Add CONFIG_ZONE_DMA32 config option and use that for x86_64
    that alone needs this zone.

    - Remove the use of CONFIG_DMA_IS_DMA32 and CONFIG_DMA_IS_NORMAL
    for ia64 and fix up the way per node ZVCs are calculated.

    - Fall back to prior GFP_ZONEMASK of 0x03 if there is no
    DMA32 zone.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Use enum for zones and reformat zones dependent information

    Add comments explaning the use of zones and add a zones_t type for zone
    numbers.

    Line up information that will be #ifdefd by the following patches.

    [akpm@osdl.org: comment cleanups]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • page allocator ZONE_HIGHMEM fixups

    1. We do not need to do an #ifdef in si_meminfo since both counters
    in use are zero if !CONFIG_HIGHMEM.

    2. Add #ifdef in si_meminfo_node instead to avoid referencing zone
    information for ZONE_HIGHMEM if we do not have HIGHMEM
    (may not be there after the following patches).

    3. Replace the use of ZONE_HIGHMEM with MAX_NR_ZONES in build_zonelists_node

    4. build_zonelists_node: Remove BUG_ON for ZONE_HIGHMEM. Zone will
    be optional soon and thus BUG_ON cannot be triggered anymore.

    5. init_free_area_core: Replace a use of ZONE_HIGHMEM with NR_MAX_ZONES.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Move totalhigh_pages and nr_free_highpages() into highmem.c/.h

    Move the totalhigh_pages definition into highmem.c/.h. Move the
    nr_free_highpages function into highmem.c

    [yoichi_yuasa@tripeaks.co.jp: build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Yoichi Yuasa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Do not display HIGHMEM memory sizes if CONFIG_HIGHMEM is not set.

    Make HIGHMEM dependent texts and make display of highmem counters optional

    Some texts are depending on CONFIG_HIGHMEM.

    Remove those strings and remove the display of highmem counter values if
    CONFIG_HIGHMEM is not set.

    [akpm@osdl.org: remove some ifdefs]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Introduce a VM_BUG_ON, which is turned on with CONFIG_DEBUG_VM. Use this
    in the lightweight, inline refcounting functions; PageLRU and PageActive
    checks in vmscan, because they're pretty well confined to vmscan. And in
    page allocate/free fastpaths which can be the hottest parts of the kernel
    for kbuilds.

    Unlike BUG_ON, VM_BUG_ON must not be used to execute statements with
    side-effects, and should not be used outside core mm code.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Stops panic associated with attempting to free a non slab-allocated
    per_cpu_pageset.

    Signed-off-by: David Rientjes
    Acked-by: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

23 Sep, 2006

1 commit

  • The grow algorithm is simple, we grow if:

    1) we see a hash chain collision at insert, and
    2) we haven't hit the hash size limit (currently 1*1024*1024 slots), and
    3) the number of xfrm_state objects is > the current hash mask

    All of this needs some tweaking.

    Remove __initdata from "hashdist" so we can use it safely at run time.

    Signed-off-by: David S. Miller

    David S. Miller
     

04 Jul, 2006

1 commit

  • It turns out that it is advantageous to leave a small portion of unmapped file
    backed pages if all of a zone's pages (or almost all pages) are allocated and
    so the page allocator has to go off-node.

    This allows recently used file I/O buffers to stay on the node and
    reduces the times that zone reclaim is invoked if file I/O occurs
    when we run out of memory in a zone.

    The problem is that zone reclaim runs too frequently when the page cache is
    used for file I/O (read write and therefore unmapped pages!) alone and we have
    almost all pages of the zone allocated. Zone reclaim may remove 32 unmapped
    pages. File I/O will use these pages for the next read/write requests and the
    unmapped pages increase. After the zone has filled up again zone reclaim will
    remove it again after only 32 pages. This cycle is too inefficient and there
    are potentially too many zone reclaim cycles.

    With the 1% boundary we may still remove all unmapped pages for file I/O in
    zone reclaim pass. However. it will take a large number of read and writes
    to get back to 1% again where we trigger zone reclaim again.

    The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
    second timeout.

    [akpm@osdl.org: rename the /proc file and the variable]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

01 Jul, 2006

4 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
    Remove obsolete #include
    remove obsolete swsusp_encrypt
    arch/arm26/Kconfig typos
    Documentation/IPMI typos
    Kconfig: Typos in net/sched/Kconfig
    v9fs: do not include linux/version.h
    Documentation/DocBook/mtdnand.tmpl: typo fixes
    typo fixes: specfic -> specific
    typo fixes in Documentation/networking/pktgen.txt
    typo fixes: occuring -> occurring
    typo fixes: infomation -> information
    typo fixes: disadvantadge -> disadvantage
    typo fixes: aquire -> acquire
    typo fixes: mecanism -> mechanism
    typo fixes: bandwith -> bandwidth
    fix a typo in the RTC_CLASS help text
    smb is no longer maintained

    Manually merged trivial conflict in arch/um/kernel/vmlinux.lds.S

    Linus Torvalds
     
  • The remaining counters in page_state after the zoned VM counter patches
    have been applied are all just for show in /proc/vmstat. They have no
    essential function for the VM.

    We use a simple increment of per cpu variables. In order to avoid the most
    severe races we disable preempt. Preempt does not prevent the race between
    an increment and an interrupt handler incrementing the same statistics
    counter. However, that race is exceedingly rare, we may only loose one
    increment or so and there is no requirement (at least not in kernel) that
    the vm event counters have to be accurate.

    In the non preempt case this results in a simple increment for each
    counter. For many architectures this will be reduced by the compiler to a
    single instruction. This single instruction is atomic for i386 and x86_64.
    And therefore even the rare race condition in an interrupt is avoided for
    both architectures in most cases.

    The patchset also adds an off switch for embedded systems that allows a
    building of linux kernels without these counters.

    The implementation of these counters is through inline code that hopefully
    results in only a single instruction increment instruction being emitted
    (i386, x86_64) or in the increment being hidden though instruction
    concurrency (EPIC architectures such as ia64 can get that done).

    Benefits:
    - VM event counter operations usually reduce to a single inline instruction
    on i386 and x86_64.
    - No interrupt disable, only preempt disable for the preempt case.
    Preempt disable can also be avoided by moving the counter into a spinlock.
    - Handling is similar to zoned VM counters.
    - Simple and easily extendable.
    - Can be omitted to reduce memory use for embedded use.

    References:

    RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
    RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
    local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
    V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
    V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
    V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The numa statistics are really event counters. But they are per node and
    so we have had special treatment for these counters through additional
    fields on the pcp structure. We can now use the per zone nature of the
    zoned VM counters to realize these.

    This will shrink the size of the pcp structure on NUMA systems. We will
    have some room to add additional per zone counters that will all still fit
    in the same cacheline.

    Bits Prior pcp size Size after patch We can add
    ------------------------------------------------------------------
    64 128 bytes (16 words) 80 bytes (10 words) 48
    32 76 bytes (19 words) 56 bytes (14 words) 8 (64 byte cacheline)
    72 (128 byte)

    Remove the special statistics for numa and replace them with zoned vm
    counters. This has the side effect that global sums of these events now
    show up in /proc/vmstat.

    Also take the opportunity to move the zone_statistics() function from
    page_alloc.c into vmstat.c.

    Discussions:
    V2 http://marc.theaimsgroup.com/?t=115048227000002&r=1&w=2

    Signed-off-by: Christoph Lameter
    Acked-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Conversion of nr_unstable to a per zone counter

    We need to do some special modifications to the nfs code since there are
    multiple cases of disposition and we need to have a page ref for proper
    accounting.

    This converts the last critical page state of the VM and therefore we need to
    remove several functions that were depending on GET_PAGE_STATE_LAST in order
    to make the kernel compile again. We are only left with event type counters
    in page state.

    [akpm@osdl.org: bugfixes]
    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter