12 Feb, 2007

20 commits

  • kmem_cache_free() was missing the check for freeing held locks.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • When kernel unmaps an address range, it needs to transfer PTE state into
    page struct. Currently, kernel transfer access bit via
    mark_page_accessed(). The call to mark_page_accessed in the unmap path
    doesn't look logically correct.

    At unmap time, calling mark_page_accessed will causes page LRU state to be
    bumped up one step closer to more recently used state. It is causing quite
    a bit headache in a scenario when a process creates a shmem segment, touch
    a whole bunch of pages, then unmaps it. The unmapping takes a long time
    because mark_page_accessed() will start moving pages from inactive to
    active list.

    I'm not too much concerned with moving the page from one list to another in
    LRU. Sooner or later it might be moved because of multiple mappings from
    various processes. But it just doesn't look logical that when user asks a
    range to be unmapped, it's his intention that the process is no longer
    interested in these pages. Moving those pages to active list (or bumping
    up a state towards more active) seems to be an over reaction. It also
    prolongs unmapping latency which is the core issue I'm trying to solve.

    As suggested by Peter, we should still preserve the info on pte young
    pages, but not more.

    Signed-off-by: Peter Zijlstra
    Acked-by: Ken Chen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     
  • shmem backed file does not have page writeback, nor it participates in
    backing device's dirty or writeback accounting. So using generic
    __set_page_dirty_nobuffers() for its .set_page_dirty aops method is a bit
    overkill. It unnecessarily prolongs shm unmap latency.

    For example, on a densely populated large shm segment (sevearl GBs), the
    unmapping operation becomes painfully long. Because at unmap, kernel
    transfers dirty bit in PTE into page struct and to the radix tree tag. The
    operation of tagging the radix tree is particularly expensive because it
    has to traverse the tree from the root to the leaf node on every dirty
    page. What's bothering is that radix tree tag is used for page write back.
    However, shmem is memory backed and there is no page write back for such
    file system. And in the end, we spend all that time tagging radix tree and
    none of that fancy tagging will be used. So let's simplify it by introduce
    a new aops __set_page_dirty_no_writeback and this will speed up shm unmap.

    Signed-off-by: Ken Chen
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     
  • As Andi pointed out: CONFIG_GENERIC_ISA_DMA only disables the ISA DMA
    channel management. Other functionality may still expect GFP_DMA to
    provide memory below 16M. So we need to make sure that CONFIG_ZONE_DMA is
    set independent of CONFIG_GENERIC_ISA_DMA. Undo the modifications to
    mm/Kconfig where we made ZONE_DMA dependent on GENERIC_ISA_DMA and set
    theses explicitly in each arches Kconfig.

    Reviews must occur for each arch in order to determine if ZONE_DMA can be
    switched off. It can only be switched off if we know that all devices
    supported by a platform are capable of performing DMA transfers to all of
    memory (Some arches already support this: uml, avr32, sh sh64, parisc and
    IA64/Altix).

    In order to switch ZONE_DMA off conditionally, one would have to establish
    a scheme by which one can assure that no drivers are enabled that are only
    capable of doing I/O to a part of memory, or one needs to provide an
    alternate means of performing an allocation from a specific range of memory
    (like provided by alloc_pages_range()) and insure that all drivers use that
    call. In that case the arches alloc_dma_coherent() may need to be modified
    to call alloc_pages_range() instead of relying on GFP_DMA.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Make ZONE_DMA optional in core code.

    - ifdef all code for ZONE_DMA and related definitions following the example
    for ZONE_DMA32 and ZONE_HIGHMEM.

    - Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of
    0.

    - Modify the VM statistics to work correctly without a DMA zone.

    - Modify slab to not create DMA slabs if there is no ZONE_DMA.

    [akpm@osdl.org: cleanup]
    [jdike@addtoit.com: build fix]
    [apw@shadowen.org: Simplify calculation of the number of bits we need for ZONES_SHIFT]
    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: Kyle McMartin
    Cc: Matthew Wilcox
    Cc: James Bottomley
    Cc: Paul Mundt
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patch simply defines CONFIG_ZONE_DMA for all arches. We later do special
    things with CONFIG_ZONE_DMA after the VM and an arch are prepared to work
    without ZONE_DMA.

    CONFIG_ZONE_DMA can be defined in two ways depending on how an architecture
    handles ISA DMA.

    First if CONFIG_GENERIC_ISA_DMA is set by the arch then we know that the arch
    needs ZONE_DMA because ISA DMA devices are supported. We can catch this in
    mm/Kconfig and do not need to modify arch code.

    Second, arches may use ZONE_DMA in an unknown way. We set CONFIG_ZONE_DMA for
    all arches that do not set CONFIG_GENERIC_ISA_DMA in order to insure backwards
    compatibility. The arches may later undefine ZONE_DMA if their arch code has
    been verified to not depend on ZONE_DMA.

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: Kyle McMartin
    Cc: Matthew Wilcox
    Cc: James Bottomley
    Cc: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patchset follows up on the earlier work in Andrew's tree to reduce the
    number of zones. The patches allow to go to a minimum of 2 zones. This one
    allows also to make ZONE_DMA optional and therefore the number of zones can be
    reduced to one.

    ZONE_DMA is usually used for ISA DMA devices. There are a number of reasons
    why we would not want to have ZONE_DMA

    1. Some arches do not need ZONE_DMA at all.

    2. With the advent of IOMMUs DMA zones are no longer needed.
    The necessity of DMA zones may drastically be reduced
    in the future. This patchset allows a compilation of
    a kernel without that overhead.

    3. Devices that require ISA DMA get rare these days. All
    my systems do not have any need for ISA DMA.

    4. The presence of an additional zone unecessarily complicates
    VM operations because it must be scanned and balancing
    logic must operate on its.

    5. With only ZONE_NORMAL one can reach the situation where
    we have only one zone. This will allow the unrolling of many
    loops in the VM and allows the optimization of varous
    code paths in the VM.

    6. Having only a single zone in a NUMA system results in a
    1-1 correspondence between nodes and zones. Various additional
    optimizations to critical VM paths become possible.

    Many systems today can operate just fine with a single zone. If you look at
    what is in ZONE_DMA then one usually sees that nothing uses it. The DMA slabs
    are empty (Some arches use ZONE_DMA instead of ZONE_NORMAL, then ZONE_NORMAL
    will be empty instead).

    On all of my systems (i386, x86_64, ia64) ZONE_DMA is completely empty. Why
    constantly look at an empty zone in /proc/zoneinfo and empty slab in
    /proc/slabinfo? Non i386 also frequently have no need for ZONE_DMA and zones
    stay empty.

    The patchset was tested on i386 (UP / SMP), x86_64 (UP, NUMA) and ia64 (NUMA).

    The RFC posted earlier (see
    http://marc.theaimsgroup.com/?l=linux-kernel&m=115231723513008&w=2) had lots
    of #ifdefs in them. An effort has been made to minize the number of #ifdefs
    and make this as compact as possible. The job was made much easier by the
    ongoing efforts of others to extract common arch specific functionality.

    I have been running this for awhile now on my desktop and finally Linux is
    using all my available RAM instead of leaving the 16MB in ZONE_DMA untouched:

    christoph@pentium940:~$ cat /proc/zoneinfo
    Node 0, zone Normal
    pages free 4435
    min 1448
    low 1810
    high 2172
    active 241786
    inactive 210170
    scanned 0 (a: 0 i: 0)
    spanned 524224
    present 524224
    nr_anon_pages 61680
    nr_mapped 14271
    nr_file_pages 390264
    nr_slab_reclaimable 27564
    nr_slab_unreclaimable 1793
    nr_page_table_pages 449
    nr_dirty 39
    nr_writeback 0
    nr_unstable 0
    nr_bounce 0
    cpu: 0 pcp: 0
    count: 156
    high: 186
    batch: 31
    cpu: 0 pcp: 1
    count: 9
    high: 62
    batch: 15
    vm stats threshold: 20
    cpu: 1 pcp: 0
    count: 177
    high: 186
    batch: 31
    cpu: 1 pcp: 1
    count: 12
    high: 62
    batch: 15
    vm stats threshold: 20
    all_unreclaimable: 0
    prev_priority: 12
    temp_priority: 12
    start_pfn: 0

    This patch:

    In two places in the VM we use ZONE_DMA to refer to the first zone. If
    ZONE_DMA is optional then other zones may be first. So simply replace
    ZONE_DMA with zone 0.

    This also fixes ZONETABLE_PGSHIFT. If we have only a single zone then
    ZONES_PGSHIFT may become 0 because there is no need anymore to encode the zone
    number related to a pgdat. However, we still need a zonetable to index all
    the zones for each node if this is a NUMA system. Therefore define
    ZONETABLE_SHIFT unconditionally as the offset of the ZONE field in page flags.

    [apw@shadowen.org: fix mismerge]
    Acked-by: Christoph Hellwig
    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: Kyle McMartin
    Cc: Matthew Wilcox
    Cc: James Bottomley
    Cc: Paul Mundt
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Values are available via ZVC sums.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Values are readily available via ZVC per node and global sums.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Function is unnecessary now. We can use the summing features of the ZVCs to
    get the values we need.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • nr_free_pages is now a simple access to a global variable. Make it a macro
    instead of a function.

    The nr_free_pages now requires vmstat.h to be included. There is one
    occurrence in power management where we need to add the include. Directly
    refrer to global_page_state() there to clarify why the #include was added.

    [akpm@osdl.org: arm build fix]
    [akpm@osdl.org: sparc64 build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The global and per zone counter sums are in arrays of longs. Reorder the ZVCs
    so that the most frequently used ZVCs are put into the same cacheline. That
    way calculations of the global, node and per zone vm state touches only a
    single cacheline. This is mostly important for 64 bit systems were one 128
    byte cacheline takes only 8 longs.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This is again simplifies some of the VM counter calculations through the use
    of the ZVC consolidated counters.

    [michal.k.k.piotrowski@gmail.com: build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Michal Piotrowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The determination of the dirty ratio to determine writeback behavior is
    currently based on the number of total pages on the system.

    However, not all pages in the system may be dirtied. Thus the ratio is always
    too low and can never reach 100%. The ratio may be particularly skewed if
    large hugepage allocations, slab allocations or device driver buffers make
    large sections of memory not available anymore. In that case we may get into
    a situation in which f.e. the background writeback ratio of 40% cannot be
    reached anymore which leads to undesired writeback behavior.

    This patchset fixes that issue by determining the ratio based on the actual
    pages that may potentially be dirty. These are the pages on the active and
    the inactive list plus free pages.

    The problem with those counts has so far been that it is expensive to
    calculate these because counts from multiple nodes and multiple zones will
    have to be summed up. This patchset makes these counters ZVC counters. This
    means that a current sum per zone, per node and for the whole system is always
    available via global variables and not expensive anymore to calculate.

    The patchset results in some other good side effects:

    - Removal of the various functions that sum up free, active and inactive
    page counts

    - Cleanup of the functions that display information via the proc filesystem.

    This patch:

    The use of a ZVC for nr_inactive and nr_active allows a simplification of some
    counter operations. More ZVC functionality is used for sums etc in the
    following patches.

    [akpm@osdl.org: UP build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • After do_wp_page has tested page_mkwrite, it must release old_page after
    acquiring page table lock, not before: at some stage that ordering got
    reversed, leaving a (very unlikely) window in which old_page might be
    truncated, freed, and reused in the same position.

    Signed-off-by: Hugh Dickins
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • This early break prevents us from displaying info for the vm stats thresholds
    if the zone doesn't have any pages in its per-cpu pagesets.

    So my 800MB i386 box says:

    Node 0, zone DMA
    pages free 2365
    min 16
    low 20
    high 24
    active 0
    inactive 0
    scanned 0 (a: 0 i: 0)
    spanned 4096
    present 4044
    nr_anon_pages 0
    nr_mapped 1
    nr_file_pages 0
    nr_slab_reclaimable 0
    nr_slab_unreclaimable 0
    nr_page_table_pages 0
    nr_dirty 0
    nr_writeback 0
    nr_unstable 0
    nr_bounce 0
    nr_vmscan_write 0
    protection: (0, 868, 868)
    pagesets
    all_unreclaimable: 0
    prev_priority: 12
    start_pfn: 0
    Node 0, zone Normal
    pages free 199713
    min 934
    low 1167
    high 1401
    active 10215
    inactive 4507
    scanned 0 (a: 0 i: 0)
    spanned 225280
    present 222420
    nr_anon_pages 2685
    nr_mapped 1110
    nr_file_pages 12055
    nr_slab_reclaimable 2216
    nr_slab_unreclaimable 1527
    nr_page_table_pages 213
    nr_dirty 0
    nr_writeback 0
    nr_unstable 0
    nr_bounce 0
    nr_vmscan_write 0
    protection: (0, 0, 0)
    pagesets
    cpu: 0 pcp: 0
    count: 152
    high: 186
    batch: 31
    cpu: 0 pcp: 1
    count: 13
    high: 62
    batch: 15
    vm stats threshold: 16
    cpu: 1 pcp: 0
    count: 34
    high: 186
    batch: 31
    cpu: 1 pcp: 1
    count: 10
    high: 62
    batch: 15
    vm stats threshold: 16
    all_unreclaimable: 0
    prev_priority: 12
    start_pfn: 4096

    Just nuke all that search-for-the-first-non-empty-pageset code. Dunno why it
    was there in the first place..

    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • find_min_pfn_for_node() and find_min_pfn_with_active_regions() sort
    early_node_map[] on every call. This is an excessive amount of sorting and
    that can be avoided. This patch always searches the whole early_node_map[]
    in find_min_pfn_for_node() instead of returning the first value found. The
    map is then only sorted once when required. Successfully boot tested on a
    number of machines.

    [akpm@osdl.org: cleanup]
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Use the pointer passed to cache_reap to determine the work pointer and
    consolidate exit paths.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Clean up __cache_alloc and __cache_alloc_node functions a bit. We no
    longer need to do NUMA_BUILD tricks and the UMA allocation path is much
    simpler. No functional changes in this patch.

    Note: saves few kernel text bytes on x86 NUMA build due to using gotos in
    __cache_alloc_node() and moving __GFP_THISNODE check in to
    fallback_alloc().

    Cc: Andy Whitcroft
    Cc: Christoph Hellwig
    Cc: Manfred Spraul
    Acked-by: Christoph Lameter
    Cc: Paul Jackson
    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • The PageSlab debug check in kfree_debugcheck() is broken for compound
    pages. It is also redundant as we already do BUG_ON for non-slab pages in
    page_get_cache() and page_get_slab() which are always called before we free
    any actual objects.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     

10 Feb, 2007

4 commits

  • This patch adds a utility function install_special_mapping, for creating a
    special vma using a fixed set of preallocated pages as backing, such as for a
    vDSO. This consolidates some nearly identical code used for vDSO mapping
    reimplemented for different architectures.

    Signed-off-by: Roland McGrath
    Cc: Ingo Molnar
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • Also split that long line up - people like to send us wordwrapped oom-kill
    traces.

    Cc: Nick Piggin
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • __unmap_hugepage_range() is buggy that it does not preserve dirty state of
    huge_pte when unmapping hugepage range. It causes data corruption in the
    event of dop_caches being used by sys admin. For example, an application
    creates a hugetlb file, modify pages, then unmap it. While leaving the
    hugetlb file alive, comes along sys admin doing a "echo 3 >
    /proc/sys/vm/drop_caches".

    drop_pagecache_sb() will happily free all pages that aren't marked dirty if
    there are no active mapping. Later when application remaps the hugetlb
    file back and all data are gone, triggering catastrophic flip over on
    application.

    Not only that, the internal resv_huge_pages count will also get all messed
    up. Fix it up by marking page dirty appropriately.

    Signed-off-by: Ken Chen
    Cc: "Nish Aravamudan"
    Cc: Adam Litke
    Cc: David Gibson
    Cc: William Lee Irwin III
    Cc:
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     
  • Remove find_trylock_page as per the removal schedule.

    Signed-off-by: Nick Piggin
    [ Let's see if anybody screams ]
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

01 Feb, 2007

1 commit

  • This reverts commit e80ee884ae0e3794ef2b65a18a767d502ad712ee.

    Pawel Sikora had a boot-time oops due to it - because the sign change
    invalidates the following comparisons, since 'free_pages' can be
    negative.

    The micro-optimization just isn't worth it.

    Bisected-by: Pawel Sikora
    Acked-by: Andrew Morton
    Cc: Nick Piggin
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

31 Jan, 2007

2 commits

  • When expanding the stack, we don't currently check if the VMA will cross
    into an area of the address space that is reserved for hugetlb pages.
    Subsequent faults on the expanded portion of such a VMA will confuse the
    low-level MMU code, resulting in an OOPS. Check for this.

    Signed-off-by: Adam Litke
    Cc: David Gibson
    Cc: William Lee Irwin III
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • Nick Piggin points out that page accounting on MIPS multiple ZERO_PAGEs
    is not maintained by its move_pte, and could lead to freeing a ZERO_PAGE.

    Instead of complicating that move_pte, just forget the minor optimization
    when mremapping, and change the one thing which needed it for correctness
    - filemap_xip use ZERO_PAGE(0) throughout instead of according to address.

    [ "There is no block device driver one could use for XIP on mips
    platforms" - Carsten Otte ]

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Andrew Morton
    Cc: Ralf Baechle
    Cc: Carsten Otte
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

30 Jan, 2007

1 commit

  • This makes balance_dirty_page() always base its calculations on the
    amount of non-highmem memory in the machine, rather than try to base it
    on total memory and then falling back on non-highmem memory if the
    mapping it was writing wasn't highmem capable.

    This not only fixes a situation where two different writers can have
    wildly different notions about what is a "balanced" dirty state, but it
    also means that people with highmem machines don't run into an OOM
    situation when regular memory fills up with dirty pages.

    We used to try to handle the latter case by scaling down the dirty_ratio
    if the machine had a lot of highmem pages in page_writeback_init(), but
    it wasn't aggressive enough for some situations, and since basing the
    dirty ratio on highmem memory was broken in the first place, let's just
    stop doing so.

    (A variation of this theme fixed Justin Piszcz's OOM problem when
    copying an 18GB file on a RAID setup).

    Acked-by: Nick Piggin
    Cc: Justin Piszcz
    Cc: Andrew Morton
    Cc: Neil Brown
    Cc: Ingo Molnar
    Cc: Randy Dunlap
    Cc: Christoph Lameter
    Cc: Jens Axboe
    Cc: Peter Zijlstra
    Cc: Adrian Bunk
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 Jan, 2007

4 commits

  • NFS can handle the case where invalidate_inode_pages2_range() fails, so the
    premise behind commit 8258d4a574d3a8c01f0ef68aa26b969398a0e140 is now gone.

    Remove the WARN_ON_ONCE() which is causing users grief as we can see from
    http://bugzilla.kernel.org/show_bug.cgi?id=7826

    Signed-off-by: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     
  • This patch fixes core dumps to include the vDSO vma, which is left out now.
    It removes the special-case core writing macros, which were not doing the
    right thing for the vDSO vma anyway. Instead, it uses VM_ALWAYSDUMP in the
    vma; there is no need for the fixmap page to be installed. It handles the
    CONFIG_COMPAT_VDSO case by making elf_core_dump use the fake vma from
    get_gate_vma after real vmas in the same way the /proc/PID/maps code does.

    This changes core dumps so they no longer include the non-PT_LOAD phdrs from
    the vDSO. I made the change to add them in the first place, but in turned out
    that nothing ever wanted them there since the advent of NT_AUXV. It's cleaner
    to leave them out, and just let the phdrs inside the vDSO image speak for
    themselves.

    Signed-off-by: Roland McGrath
    Cc: Ingo Molnar
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • This patch fixes the initialization of gate_vma.vm_flags and
    gate_vma.vm_page_prot to reflect reality. This makes the "[vdso]" line in
    /proc/PID/maps correctly show r-xp instead of ---p, when gate_vma is used
    (CONFIG_COMPAT_VDSO on i386).

    Signed-off-by: Roland McGrath
    Cc: Ingo Molnar
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • It's not pretty, but it appears that ext3 with data=journal will clean
    pages without ever actually telling the VM that they are clean. This,
    in turn, will result in the VM (and balance_dirty_pages() in particular)
    to never realize that the pages got cleaned, and wait forever for an
    event that already happened.

    Technically, this seems to be a problem with ext3 itself, but it used to
    be hidden by 'try_to_free_buffers()' noticing this situation on its own,
    and just working around the filesystem problem.

    This commit re-instates that hack, in order to avoid a regression for
    the 2.6.20 release. This fixes bugzilla 7844:

    http://bugzilla.kernel.org/show_bug.cgi?id=7844

    Peter Zijlstra points out that we should probably retain the debugging
    code that this removes from cancel_dirty_page(), and I agree, but for
    the imminent release we might as well just silence the warning too
    (since it's not a new bug: anything that triggers that warning has been
    around forever).

    Acked-by: Randy Dunlap
    Acked-by: Jens Axboe
    Acked-by: Peter Zijlstra
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

23 Jan, 2007

1 commit

  • Currently one can specify an arbitrary node mask to mbind that includes
    nodes not allowed. If that is done with an interleave policy then we will
    go around all the nodes. Those outside of the currently allowed cpuset
    will be redirected to the border nodes. Interleave will then create
    imbalances at the borders of the cpuset.

    This patch restricts the nodes to the currently allowed cpuset.

    The RFC for this patch was discussed at
    http://marc.theaimsgroup.com/?t=116793842100004&r=1&w=2

    Signed-off-by: Christoph Lameter
    Cc: Paul Jackson
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

13 Jan, 2007

1 commit

  • Currently we issue a bounce trace when __blk_queue_bounce() is called,
    but that merely means that the device has a lower dma mask than the
    higher pages in the system. The bio itself may still be lower pages. So
    move the bounce trace into __blk_queue_bounce(), when we know there will
    actually be page bouncing.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

12 Jan, 2007

2 commits

  • NFS: Fix race in nfs_release_page()

    invalidate_inode_pages2() may find the dirty bit has been set on a page
    owing to the fact that the page may still be mapped after it was locked.
    Only after the call to unmap_mapping_range() are we sure that the page
    can no longer be dirtied.
    In order to fix this, NFS has hooked the releasepage() method and tries
    to write the page out between the call to unmap_mapping_range() and the
    call to remove_mapping(). This, however leads to deadlocks in the page
    reclaim code, where the page may be locked without holding a reference
    to the inode or dentry.

    Fix is to add a new address_space_operation, launder_page(), which will
    attempt to write out a dirty page without releasing the page lock.

    Signed-off-by: Trond Myklebust

    Also, the bare SetPageDirty() can skew all sort of accounting leading to
    other nasties.

    [akpm@osdl.org: cleanup]
    Signed-off-by: Peter Zijlstra
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     
  • Fix an oops experienced on the Cell architecture when init-time functions,
    early_*(), are called at runtime. It alters the call paths to make sure
    that the callers explicitly say whether the call is being made on behalf of
    a hotplug even, or happening at boot-time.

    It has been compile tested on ppc64, ia64, s390, i386 and x86_64.

    Acked-by: Arnd Bergmann
    Signed-off-by: Dave Hansen
    Cc: Yasunori Goto
    Acked-by: Andy Whitcroft
    Cc: Christoph Lameter
    Cc: Martin Schwidefsky
    Acked-by: Heiko Carstens
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

09 Jan, 2007

2 commits

  • * master.kernel.org:/home/rmk/linux-2.6-arm:
    [ARM] Provide basic printk_clock() implementation
    [ARM] Resolve fuse and direct-IO failures due to missing cache flushes
    [ARM] pass vma for flush_anon_page()
    [ARM] Fix potential MMCI bug
    [ARM] Fix kernel-mode undefined instruction aborts
    [ARM] 4082/1: iop3xx: fix iop33x gpio register offset
    [ARM] 4070/1: arch/arm/kernel: fix warnings from missing includes
    [ARM] 4079/1: iop: Update MAINTAINERS

    Linus Torvalds
     
  • Since get_user_pages() may be used with processes other than the
    current process and calls flush_anon_page(), flush_anon_page() has to
    cope in some way with non-current processes.

    It may not be appropriate, or even desirable to flush a region of
    virtual memory cache in the current process when that is different to
    the process that we want the flush to occur for.

    Therefore, pass the vma into flush_anon_page() so that the architecture
    can work out whether the 'vmaddr' is for the current process or not.

    Signed-off-by: Russell King

    Russell King
     

06 Jan, 2007

2 commits

  • At the end of shrink_all_memory() we forget to recalculate lru_pages: it can
    be zero.

    Fix that up, and add a helper function for this operation too.

    Also, recalculate lru_pages each time around the inner loop to get the
    balancing correct.

    Cc: "Rafael J. Wysocki"
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • These days, if you swapoff when there isn't enough memory, OOM killer gives
    "BUG: scheduling while atomic" and the machine hangs: badness() needs to do
    its PF_SWAPOFF return after the task_unlock (tasklist_lock is also held
    here, so p isn't going to be freed: PF_SWAPOFF might get turned off at any
    moment, but that doesn't really matter).

    Signed-off-by: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins