02 Mar, 2007

5 commits


21 Feb, 2007

3 commits

  • The alien cache is a per cpu per node array allocated for every slab on the
    system. Currently we size this array for all nodes that the kernel does
    support. For IA64 this is 1024 nodes. So we allocate an array with 1024
    objects even if we only boot a system with 4 nodes.

    This patch uses "nr_node_ids" to determine the number of possible nodes
    supported by a hardware configuration and only allocates an alien cache
    sized for possible nodes.

    The initialization of nr_node_ids occurred too late relative to the bootstrap
    of the slab allocator and so I moved the setup_nr_node_ids() into
    free_area_init_nodes().

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • highest_possible_node_id() is currently used to calculate the last possible
    node idso that the network subsystem can figure out how to size per node
    arrays.

    I think having the ability to determine the maximum amount of nodes in a
    system at runtime is useful but then we should name this entry
    correspondingly, it should return the number of node_ids, and the the value
    needs to be setup only once on bootup. The node_possible_map does not
    change after bootup.

    This patch introduces nr_node_ids and replaces the use of
    highest_possible_node_id(). nr_node_ids is calculated on bootup when the
    page allocators pagesets are initialized.

    [deweerdt@free.fr: fix oops]
    Signed-off-by: Christoph Lameter
    Cc: Neil Brown
    Cc: Trond Myklebust
    Signed-off-by: Frederik Deweerdt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • bind_zonelist() can create zero-length zonelist if there is a
    memory-less-node. This patch checks the length of zonelist. If length is
    0, returns -EINVAL.

    tested on ia64/NUMA with memory-less-node.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Andi Kleen
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

17 Feb, 2007

1 commit

  • When NFSD receives a write request, the data is typically in a number of
    1448 byte segments and writev is used to collect them together.

    Unfortunately, generic_file_buffered_write passes these to the filesystem
    one at a time, so an e.g. 32K over-write becomes a series of partial-page
    writes to each page, causing the filesystem to have to pre-read those pages
    - wasted effort.

    generic_file_buffered_write handles one segment of the vector at a time as
    it has to pre-fault in each segment to avoid deadlocks. When writing from
    kernel-space (and nfsd does) this is not an issue, so
    generic_file_buffered_write does not need to break and iovec from nfsd into
    little pieces.

    This patch avoids the splitting when get_fs is KERNEL_DS as it is
    from NFSd.

    This issue was introduced by commit 6527c2bdf1f833cc18e8f42bd97973d583e4aa83

    Acked-by: Nick Piggin
    Cc: Norman Weathers
    Cc: Vladimir V. Saveliev
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

16 Feb, 2007

3 commits


13 Feb, 2007

4 commits

  • Many struct inode_operations in the kernel can be "const". Marking them const
    moves these to the .rodata section, which avoids false sharing with potential
    dirty data. In addition it'll catch accidental writes at compile time to
    these shared resources.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     
  • Make mincore work for anon mappings, nonlinear, and migration entries.
    Based on patch from Linus Torvalds .

    Signed-off-by: Nick Piggin
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Add a NOPFN_REFAULT return code for vm_ops->nopfn() equivalent to
    NOPAGE_REFAULT for vmops->nopage() indicating that the handler requests a
    re-execution of the faulting instruction

    Signed-off-by: Benjamin Herrenschmidt
    Cc: Arnd Bergmann
    Cc: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • Add a vm_insert_pfn helper, so that ->fault handlers can have nopfn
    functionality by installing their own pte and returning NULL.

    Signed-off-by: Nick Piggin
    Signed-off-by: Benjamin Herrenschmidt
    Cc: Arnd Bergmann
    Cc: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

12 Feb, 2007

24 commits

  • Change a hard-coded constant 0 to the symbolic equivalent NOTIFY_DONE in
    the ratelimit_handler() CPU notifier handler function.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     
  • A variety of (mostly) innocuous fixes to the embedded kernel-doc content in
    source files, including:

    * make multi-line initial descriptions single line
    * denote some function names, constants and structs as such
    * change erroneous opening '/*' to '/**' in a few places
    * reword some text for clarity

    Signed-off-by: Robert P. J. Day
    Cc: "Randy.Dunlap"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert P. J. Day
     
  • Convert all calls to invalidate_inode_pages() into open-coded calls to
    invalidate_mapping_pages().

    Leave the invalidate_inode_pages() wrapper in place for now, marked as
    deprecated.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • It makes no sense to me to export invalidate_inode_pages() and not
    invalidate_mapping_pages() and I actually need invalidate_mapping_pages()
    because of its range specification ability...

    akpm: also remove the export of invalidate_inode_pages() by making it an
    inlined wrapper.

    Signed-off-by: Anton Altaparmakov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Altaparmakov
     
  • kmem_cache_free() was missing the check for freeing held locks.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • When kernel unmaps an address range, it needs to transfer PTE state into
    page struct. Currently, kernel transfer access bit via
    mark_page_accessed(). The call to mark_page_accessed in the unmap path
    doesn't look logically correct.

    At unmap time, calling mark_page_accessed will causes page LRU state to be
    bumped up one step closer to more recently used state. It is causing quite
    a bit headache in a scenario when a process creates a shmem segment, touch
    a whole bunch of pages, then unmaps it. The unmapping takes a long time
    because mark_page_accessed() will start moving pages from inactive to
    active list.

    I'm not too much concerned with moving the page from one list to another in
    LRU. Sooner or later it might be moved because of multiple mappings from
    various processes. But it just doesn't look logical that when user asks a
    range to be unmapped, it's his intention that the process is no longer
    interested in these pages. Moving those pages to active list (or bumping
    up a state towards more active) seems to be an over reaction. It also
    prolongs unmapping latency which is the core issue I'm trying to solve.

    As suggested by Peter, we should still preserve the info on pte young
    pages, but not more.

    Signed-off-by: Peter Zijlstra
    Acked-by: Ken Chen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     
  • shmem backed file does not have page writeback, nor it participates in
    backing device's dirty or writeback accounting. So using generic
    __set_page_dirty_nobuffers() for its .set_page_dirty aops method is a bit
    overkill. It unnecessarily prolongs shm unmap latency.

    For example, on a densely populated large shm segment (sevearl GBs), the
    unmapping operation becomes painfully long. Because at unmap, kernel
    transfers dirty bit in PTE into page struct and to the radix tree tag. The
    operation of tagging the radix tree is particularly expensive because it
    has to traverse the tree from the root to the leaf node on every dirty
    page. What's bothering is that radix tree tag is used for page write back.
    However, shmem is memory backed and there is no page write back for such
    file system. And in the end, we spend all that time tagging radix tree and
    none of that fancy tagging will be used. So let's simplify it by introduce
    a new aops __set_page_dirty_no_writeback and this will speed up shm unmap.

    Signed-off-by: Ken Chen
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     
  • As Andi pointed out: CONFIG_GENERIC_ISA_DMA only disables the ISA DMA
    channel management. Other functionality may still expect GFP_DMA to
    provide memory below 16M. So we need to make sure that CONFIG_ZONE_DMA is
    set independent of CONFIG_GENERIC_ISA_DMA. Undo the modifications to
    mm/Kconfig where we made ZONE_DMA dependent on GENERIC_ISA_DMA and set
    theses explicitly in each arches Kconfig.

    Reviews must occur for each arch in order to determine if ZONE_DMA can be
    switched off. It can only be switched off if we know that all devices
    supported by a platform are capable of performing DMA transfers to all of
    memory (Some arches already support this: uml, avr32, sh sh64, parisc and
    IA64/Altix).

    In order to switch ZONE_DMA off conditionally, one would have to establish
    a scheme by which one can assure that no drivers are enabled that are only
    capable of doing I/O to a part of memory, or one needs to provide an
    alternate means of performing an allocation from a specific range of memory
    (like provided by alloc_pages_range()) and insure that all drivers use that
    call. In that case the arches alloc_dma_coherent() may need to be modified
    to call alloc_pages_range() instead of relying on GFP_DMA.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Make ZONE_DMA optional in core code.

    - ifdef all code for ZONE_DMA and related definitions following the example
    for ZONE_DMA32 and ZONE_HIGHMEM.

    - Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of
    0.

    - Modify the VM statistics to work correctly without a DMA zone.

    - Modify slab to not create DMA slabs if there is no ZONE_DMA.

    [akpm@osdl.org: cleanup]
    [jdike@addtoit.com: build fix]
    [apw@shadowen.org: Simplify calculation of the number of bits we need for ZONES_SHIFT]
    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: Kyle McMartin
    Cc: Matthew Wilcox
    Cc: James Bottomley
    Cc: Paul Mundt
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patch simply defines CONFIG_ZONE_DMA for all arches. We later do special
    things with CONFIG_ZONE_DMA after the VM and an arch are prepared to work
    without ZONE_DMA.

    CONFIG_ZONE_DMA can be defined in two ways depending on how an architecture
    handles ISA DMA.

    First if CONFIG_GENERIC_ISA_DMA is set by the arch then we know that the arch
    needs ZONE_DMA because ISA DMA devices are supported. We can catch this in
    mm/Kconfig and do not need to modify arch code.

    Second, arches may use ZONE_DMA in an unknown way. We set CONFIG_ZONE_DMA for
    all arches that do not set CONFIG_GENERIC_ISA_DMA in order to insure backwards
    compatibility. The arches may later undefine ZONE_DMA if their arch code has
    been verified to not depend on ZONE_DMA.

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: Kyle McMartin
    Cc: Matthew Wilcox
    Cc: James Bottomley
    Cc: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patchset follows up on the earlier work in Andrew's tree to reduce the
    number of zones. The patches allow to go to a minimum of 2 zones. This one
    allows also to make ZONE_DMA optional and therefore the number of zones can be
    reduced to one.

    ZONE_DMA is usually used for ISA DMA devices. There are a number of reasons
    why we would not want to have ZONE_DMA

    1. Some arches do not need ZONE_DMA at all.

    2. With the advent of IOMMUs DMA zones are no longer needed.
    The necessity of DMA zones may drastically be reduced
    in the future. This patchset allows a compilation of
    a kernel without that overhead.

    3. Devices that require ISA DMA get rare these days. All
    my systems do not have any need for ISA DMA.

    4. The presence of an additional zone unecessarily complicates
    VM operations because it must be scanned and balancing
    logic must operate on its.

    5. With only ZONE_NORMAL one can reach the situation where
    we have only one zone. This will allow the unrolling of many
    loops in the VM and allows the optimization of varous
    code paths in the VM.

    6. Having only a single zone in a NUMA system results in a
    1-1 correspondence between nodes and zones. Various additional
    optimizations to critical VM paths become possible.

    Many systems today can operate just fine with a single zone. If you look at
    what is in ZONE_DMA then one usually sees that nothing uses it. The DMA slabs
    are empty (Some arches use ZONE_DMA instead of ZONE_NORMAL, then ZONE_NORMAL
    will be empty instead).

    On all of my systems (i386, x86_64, ia64) ZONE_DMA is completely empty. Why
    constantly look at an empty zone in /proc/zoneinfo and empty slab in
    /proc/slabinfo? Non i386 also frequently have no need for ZONE_DMA and zones
    stay empty.

    The patchset was tested on i386 (UP / SMP), x86_64 (UP, NUMA) and ia64 (NUMA).

    The RFC posted earlier (see
    http://marc.theaimsgroup.com/?l=linux-kernel&m=115231723513008&w=2) had lots
    of #ifdefs in them. An effort has been made to minize the number of #ifdefs
    and make this as compact as possible. The job was made much easier by the
    ongoing efforts of others to extract common arch specific functionality.

    I have been running this for awhile now on my desktop and finally Linux is
    using all my available RAM instead of leaving the 16MB in ZONE_DMA untouched:

    christoph@pentium940:~$ cat /proc/zoneinfo
    Node 0, zone Normal
    pages free 4435
    min 1448
    low 1810
    high 2172
    active 241786
    inactive 210170
    scanned 0 (a: 0 i: 0)
    spanned 524224
    present 524224
    nr_anon_pages 61680
    nr_mapped 14271
    nr_file_pages 390264
    nr_slab_reclaimable 27564
    nr_slab_unreclaimable 1793
    nr_page_table_pages 449
    nr_dirty 39
    nr_writeback 0
    nr_unstable 0
    nr_bounce 0
    cpu: 0 pcp: 0
    count: 156
    high: 186
    batch: 31
    cpu: 0 pcp: 1
    count: 9
    high: 62
    batch: 15
    vm stats threshold: 20
    cpu: 1 pcp: 0
    count: 177
    high: 186
    batch: 31
    cpu: 1 pcp: 1
    count: 12
    high: 62
    batch: 15
    vm stats threshold: 20
    all_unreclaimable: 0
    prev_priority: 12
    temp_priority: 12
    start_pfn: 0

    This patch:

    In two places in the VM we use ZONE_DMA to refer to the first zone. If
    ZONE_DMA is optional then other zones may be first. So simply replace
    ZONE_DMA with zone 0.

    This also fixes ZONETABLE_PGSHIFT. If we have only a single zone then
    ZONES_PGSHIFT may become 0 because there is no need anymore to encode the zone
    number related to a pgdat. However, we still need a zonetable to index all
    the zones for each node if this is a NUMA system. Therefore define
    ZONETABLE_SHIFT unconditionally as the offset of the ZONE field in page flags.

    [apw@shadowen.org: fix mismerge]
    Acked-by: Christoph Hellwig
    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: Kyle McMartin
    Cc: Matthew Wilcox
    Cc: James Bottomley
    Cc: Paul Mundt
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Values are available via ZVC sums.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Values are readily available via ZVC per node and global sums.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Function is unnecessary now. We can use the summing features of the ZVCs to
    get the values we need.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • nr_free_pages is now a simple access to a global variable. Make it a macro
    instead of a function.

    The nr_free_pages now requires vmstat.h to be included. There is one
    occurrence in power management where we need to add the include. Directly
    refrer to global_page_state() there to clarify why the #include was added.

    [akpm@osdl.org: arm build fix]
    [akpm@osdl.org: sparc64 build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The global and per zone counter sums are in arrays of longs. Reorder the ZVCs
    so that the most frequently used ZVCs are put into the same cacheline. That
    way calculations of the global, node and per zone vm state touches only a
    single cacheline. This is mostly important for 64 bit systems were one 128
    byte cacheline takes only 8 longs.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This is again simplifies some of the VM counter calculations through the use
    of the ZVC consolidated counters.

    [michal.k.k.piotrowski@gmail.com: build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Michal Piotrowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The determination of the dirty ratio to determine writeback behavior is
    currently based on the number of total pages on the system.

    However, not all pages in the system may be dirtied. Thus the ratio is always
    too low and can never reach 100%. The ratio may be particularly skewed if
    large hugepage allocations, slab allocations or device driver buffers make
    large sections of memory not available anymore. In that case we may get into
    a situation in which f.e. the background writeback ratio of 40% cannot be
    reached anymore which leads to undesired writeback behavior.

    This patchset fixes that issue by determining the ratio based on the actual
    pages that may potentially be dirty. These are the pages on the active and
    the inactive list plus free pages.

    The problem with those counts has so far been that it is expensive to
    calculate these because counts from multiple nodes and multiple zones will
    have to be summed up. This patchset makes these counters ZVC counters. This
    means that a current sum per zone, per node and for the whole system is always
    available via global variables and not expensive anymore to calculate.

    The patchset results in some other good side effects:

    - Removal of the various functions that sum up free, active and inactive
    page counts

    - Cleanup of the functions that display information via the proc filesystem.

    This patch:

    The use of a ZVC for nr_inactive and nr_active allows a simplification of some
    counter operations. More ZVC functionality is used for sums etc in the
    following patches.

    [akpm@osdl.org: UP build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • After do_wp_page has tested page_mkwrite, it must release old_page after
    acquiring page table lock, not before: at some stage that ordering got
    reversed, leaving a (very unlikely) window in which old_page might be
    truncated, freed, and reused in the same position.

    Signed-off-by: Hugh Dickins
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • This early break prevents us from displaying info for the vm stats thresholds
    if the zone doesn't have any pages in its per-cpu pagesets.

    So my 800MB i386 box says:

    Node 0, zone DMA
    pages free 2365
    min 16
    low 20
    high 24
    active 0
    inactive 0
    scanned 0 (a: 0 i: 0)
    spanned 4096
    present 4044
    nr_anon_pages 0
    nr_mapped 1
    nr_file_pages 0
    nr_slab_reclaimable 0
    nr_slab_unreclaimable 0
    nr_page_table_pages 0
    nr_dirty 0
    nr_writeback 0
    nr_unstable 0
    nr_bounce 0
    nr_vmscan_write 0
    protection: (0, 868, 868)
    pagesets
    all_unreclaimable: 0
    prev_priority: 12
    start_pfn: 0
    Node 0, zone Normal
    pages free 199713
    min 934
    low 1167
    high 1401
    active 10215
    inactive 4507
    scanned 0 (a: 0 i: 0)
    spanned 225280
    present 222420
    nr_anon_pages 2685
    nr_mapped 1110
    nr_file_pages 12055
    nr_slab_reclaimable 2216
    nr_slab_unreclaimable 1527
    nr_page_table_pages 213
    nr_dirty 0
    nr_writeback 0
    nr_unstable 0
    nr_bounce 0
    nr_vmscan_write 0
    protection: (0, 0, 0)
    pagesets
    cpu: 0 pcp: 0
    count: 152
    high: 186
    batch: 31
    cpu: 0 pcp: 1
    count: 13
    high: 62
    batch: 15
    vm stats threshold: 16
    cpu: 1 pcp: 0
    count: 34
    high: 186
    batch: 31
    cpu: 1 pcp: 1
    count: 10
    high: 62
    batch: 15
    vm stats threshold: 16
    all_unreclaimable: 0
    prev_priority: 12
    start_pfn: 4096

    Just nuke all that search-for-the-first-non-empty-pageset code. Dunno why it
    was there in the first place..

    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • find_min_pfn_for_node() and find_min_pfn_with_active_regions() sort
    early_node_map[] on every call. This is an excessive amount of sorting and
    that can be avoided. This patch always searches the whole early_node_map[]
    in find_min_pfn_for_node() instead of returning the first value found. The
    map is then only sorted once when required. Successfully boot tested on a
    number of machines.

    [akpm@osdl.org: cleanup]
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Use the pointer passed to cache_reap to determine the work pointer and
    consolidate exit paths.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Clean up __cache_alloc and __cache_alloc_node functions a bit. We no
    longer need to do NUMA_BUILD tricks and the UMA allocation path is much
    simpler. No functional changes in this patch.

    Note: saves few kernel text bytes on x86 NUMA build due to using gotos in
    __cache_alloc_node() and moving __GFP_THISNODE check in to
    fallback_alloc().

    Cc: Andy Whitcroft
    Cc: Christoph Hellwig
    Cc: Manfred Spraul
    Acked-by: Christoph Lameter
    Cc: Paul Jackson
    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • The PageSlab debug check in kfree_debugcheck() is broken for compound
    pages. It is also redundant as we already do BUG_ON for non-slab pages in
    page_get_cache() and page_get_slab() which are always called before we free
    any actual objects.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg