Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

12 Feb, 2007

20 commits

898552c9d [PATCH] lockdep: also check for freed locks in kmem_cache_free() ... Browse Code »
2

kmem_cache_free() was missing the check for freeing held locks.

Signed-off-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2007-02-12 02:51:26 +0800
daa88c8d2 [PATCH] do not disturb page referenced state when unmapping memory range ... Browse Code »

When kernel unmaps an address range, it needs to transfer PTE state into
page struct. Currently, kernel transfer access bit via
mark_page_accessed(). The call to mark_page_accessed in the unmap path
doesn't look logically correct.

At unmap time, calling mark_page_accessed will causes page LRU state to be
bumped up one step closer to more recently used state. It is causing quite
a bit headache in a scenario when a process creates a shmem segment, touch
a whole bunch of pages, then unmaps it. The unmapping takes a long time
because mark_page_accessed() will start moving pages from inactive to
active list.

I'm not too much concerned with moving the page from one list to another in
LRU. Sooner or later it might be moved because of multiple mappings from
various processes. But it just doesn't look logical that when user asks a
range to be unmapped, it's his intention that the process is no longer
interested in these pages. Moving those pages to active list (or bumping
up a state towards more active) seems to be an over reaction. It also
prolongs unmapping latency which is the core issue I'm trying to solve.

As suggested by Peter, we should still preserve the info on pte young
pages, but not more.

Signed-off-by: Peter Zijlstra
Acked-by: Ken Chen
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ken Chen
2007-02-12 02:51:19 +0800
767193253 [PATCH] simplify shmem_aops.set_page_dirty() method ... Browse Code »

shmem backed file does not have page writeback, nor it participates in
backing device's dirty or writeback accounting. So using generic
__set_page_dirty_nobuffers() for its .set_page_dirty aops method is a bit
overkill. It unnecessarily prolongs shm unmap latency.

For example, on a densely populated large shm segment (sevearl GBs), the
unmapping operation becomes painfully long. Because at unmap, kernel
transfers dirty bit in PTE into page struct and to the radix tree tag. The
operation of tagging the radix tree is particularly expensive because it
has to traverse the tree from the root to the leaf node on every dirty
page. What's bothering is that radix tree tag is used for page write back.
However, shmem is memory backed and there is no page write back for such
file system. And in the end, we spend all that time tagging radix tree and
none of that fancy tagging will be used. So let's simplify it by introduce
a new aops __set_page_dirty_no_writeback and this will speed up shm unmap.

Signed-off-by: Ken Chen
Cc: Peter Zijlstra
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ken Chen
2007-02-12 02:51:19 +0800
5ac6da669 [PATCH] Set CONFIG_ZONE_DMA for arches with GENERIC_ISA_DMA ... Browse Code »

As Andi pointed out: CONFIG_GENERIC_ISA_DMA only disables the ISA DMA
channel management. Other functionality may still expect GFP_DMA to
provide memory below 16M. So we need to make sure that CONFIG_ZONE_DMA is
set independent of CONFIG_GENERIC_ISA_DMA. Undo the modifications to
mm/Kconfig where we made ZONE_DMA dependent on GENERIC_ISA_DMA and set
theses explicitly in each arches Kconfig.

Reviews must occur for each arch in order to determine if ZONE_DMA can be
switched off. It can only be switched off if we know that all devices
supported by a platform are capable of performing DMA transfers to all of
memory (Some arches already support this: uml, avr32, sh sh64, parisc and
IA64/Altix).

In order to switch ZONE_DMA off conditionally, one would have to establish
a scheme by which one can assure that no drivers are enabled that are only
capable of doing I/O to a part of memory, or one needs to provide an
alternate means of performing an allocation from a specific range of memory
(like provided by alloc_pages_range()) and insure that all drivers use that
call. In that case the arches alloc_dma_coherent() may need to be modified
to call alloc_pages_range() instead of relying on GFP_DMA.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:19 +0800
4b51d6698 [PATCH] optional ZONE_DMA: optional ZONE_DMA in the VM ... Browse Code »

Make ZONE_DMA optional in core code.

- ifdef all code for ZONE_DMA and related definitions following the example
for ZONE_DMA32 and ZONE_HIGHMEM.

- Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of
0.

- Modify the VM statistics to work correctly without a DMA zone.

- Modify slab to not create DMA slabs if there is no ZONE_DMA.

[akpm@osdl.org: cleanup]
[jdike@addtoit.com: build fix]
[apw@shadowen.org: Simplify calculation of the number of bits we need for ZONES_SHIFT]
Signed-off-by: Christoph Lameter
Cc: Andi Kleen
Cc: "Luck, Tony"
Cc: Kyle McMartin
Cc: Matthew Wilcox
Cc: James Bottomley
Cc: Paul Mundt
Signed-off-by: Andy Whitcroft
Signed-off-by: Jeff Dike
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:18 +0800
66701b149 [PATCH] optional ZONE_DMA: introduce CONFIG_ZONE_DMA ... Browse Code »

This patch simply defines CONFIG_ZONE_DMA for all arches. We later do special
things with CONFIG_ZONE_DMA after the VM and an arch are prepared to work
without ZONE_DMA.

CONFIG_ZONE_DMA can be defined in two ways depending on how an architecture
handles ISA DMA.

First if CONFIG_GENERIC_ISA_DMA is set by the arch then we know that the arch
needs ZONE_DMA because ISA DMA devices are supported. We can catch this in
mm/Kconfig and do not need to modify arch code.

Second, arches may use ZONE_DMA in an unknown way. We set CONFIG_ZONE_DMA for
all arches that do not set CONFIG_GENERIC_ISA_DMA in order to insure backwards
compatibility. The arches may later undefine ZONE_DMA if their arch code has
been verified to not depend on ZONE_DMA.

Signed-off-by: Christoph Lameter
Cc: Andi Kleen
Cc: "Luck, Tony"
Cc: Kyle McMartin
Cc: Matthew Wilcox
Cc: James Bottomley
Cc: Paul Mundt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:18 +0800
6267276f3 [PATCH] optional ZONE_DMA: deal with cases of ZONE_DMA meaning the first zone ... Browse Code »

This patchset follows up on the earlier work in Andrew's tree to reduce the
number of zones. The patches allow to go to a minimum of 2 zones. This one
allows also to make ZONE_DMA optional and therefore the number of zones can be
reduced to one.

ZONE_DMA is usually used for ISA DMA devices. There are a number of reasons
why we would not want to have ZONE_DMA

1. Some arches do not need ZONE_DMA at all.

2. With the advent of IOMMUs DMA zones are no longer needed.
The necessity of DMA zones may drastically be reduced
in the future. This patchset allows a compilation of
a kernel without that overhead.

3. Devices that require ISA DMA get rare these days. All
my systems do not have any need for ISA DMA.

4. The presence of an additional zone unecessarily complicates
VM operations because it must be scanned and balancing
logic must operate on its.

5. With only ZONE_NORMAL one can reach the situation where
we have only one zone. This will allow the unrolling of many
loops in the VM and allows the optimization of varous
code paths in the VM.

6. Having only a single zone in a NUMA system results in a
1-1 correspondence between nodes and zones. Various additional
optimizations to critical VM paths become possible.

Many systems today can operate just fine with a single zone. If you look at
what is in ZONE_DMA then one usually sees that nothing uses it. The DMA slabs
are empty (Some arches use ZONE_DMA instead of ZONE_NORMAL, then ZONE_NORMAL
will be empty instead).

On all of my systems (i386, x86_64, ia64) ZONE_DMA is completely empty. Why
constantly look at an empty zone in /proc/zoneinfo and empty slab in
/proc/slabinfo? Non i386 also frequently have no need for ZONE_DMA and zones
stay empty.

The patchset was tested on i386 (UP / SMP), x86_64 (UP, NUMA) and ia64 (NUMA).

The RFC posted earlier (see
http://marc.theaimsgroup.com/?l=linux-kernel&m=115231723513008&w=2) had lots
of #ifdefs in them. An effort has been made to minize the number of #ifdefs
and make this as compact as possible. The job was made much easier by the
ongoing efforts of others to extract common arch specific functionality.

I have been running this for awhile now on my desktop and finally Linux is
using all my available RAM instead of leaving the 16MB in ZONE_DMA untouched:

christoph@pentium940:~$ cat /proc/zoneinfo
Node 0, zone Normal
pages free 4435
min 1448
low 1810
high 2172
active 241786
inactive 210170
scanned 0 (a: 0 i: 0)
spanned 524224
present 524224
nr_anon_pages 61680
nr_mapped 14271
nr_file_pages 390264
nr_slab_reclaimable 27564
nr_slab_unreclaimable 1793
nr_page_table_pages 449
nr_dirty 39
nr_writeback 0
nr_unstable 0
nr_bounce 0
cpu: 0 pcp: 0
count: 156
high: 186
batch: 31
cpu: 0 pcp: 1
count: 9
high: 62
batch: 15
vm stats threshold: 20
cpu: 1 pcp: 0
count: 177
high: 186
batch: 31
cpu: 1 pcp: 1
count: 12
high: 62
batch: 15
vm stats threshold: 20
all_unreclaimable: 0
prev_priority: 12
temp_priority: 12
start_pfn: 0

This patch:

In two places in the VM we use ZONE_DMA to refer to the first zone. If
ZONE_DMA is optional then other zones may be first. So simply replace
ZONE_DMA with zone 0.

This also fixes ZONETABLE_PGSHIFT. If we have only a single zone then
ZONES_PGSHIFT may become 0 because there is no need anymore to encode the zone
number related to a pgdat. However, we still need a zonetable to index all
the zones for each node if this is a NUMA system. Therefore define
ZONETABLE_SHIFT unconditionally as the offset of the ZONE field in page flags.

[apw@shadowen.org: fix mismerge]
Acked-by: Christoph Hellwig
Signed-off-by: Christoph Lameter
Cc: Andi Kleen
Cc: "Luck, Tony"
Cc: Kyle McMartin
Cc: Matthew Wilcox
Cc: James Bottomley
Cc: Paul Mundt
Signed-off-by: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:18 +0800
65e458d43 [PATCH] Drop get_zone_counts() ... Browse Code »

Values are available via ZVC sums.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:18 +0800
05a0416be [PATCH] Drop __get_zone_counts() ... Browse Code »

Values are readily available via ZVC per node and global sums.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:18 +0800
9195481d2 [PATCH] Drop nr_free_pages_pgdat() ... Browse Code »

Function is unnecessary now. We can use the summing features of the ZVCs to
get the values we need.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:18 +0800
961772994 [PATCH] Drop free_pages() ... Browse Code »

nr_free_pages is now a simple access to a global variable. Make it a macro
instead of a function.

The nr_free_pages now requires vmstat.h to be included. There is one
occurrence in power management where we need to add the include. Directly
refrer to global_page_state() there to clarify why the #include was added.

[akpm@osdl.org: arm build fix]
[akpm@osdl.org: sparc64 build fix]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:18 +0800
51ed44912 [PATCH] Reorder ZVCs according to cacheline ... Browse Code »

The global and per zone counter sums are in arrays of longs. Reorder the ZVCs
so that the most frequently used ZVCs are put into the same cacheline. That
way calculations of the global, node and per zone vm state touches only a
single cacheline. This is mostly important for 64 bit systems were one 128
byte cacheline takes only 8 longs.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:17 +0800
d23ad4232 [PATCH] Use ZVC for free_pages ... Browse Code »

This is again simplifies some of the VM counter calculations through the use
of the ZVC consolidated counters.

[michal.k.k.piotrowski@gmail.com: build fix]
Signed-off-by: Christoph Lameter
Signed-off-by: Michal Piotrowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:17 +0800
c87853859 [PATCH] Use ZVC for inactive and active counts ... Browse Code »

The determination of the dirty ratio to determine writeback behavior is
currently based on the number of total pages on the system.

However, not all pages in the system may be dirtied. Thus the ratio is always
too low and can never reach 100%. The ratio may be particularly skewed if
large hugepage allocations, slab allocations or device driver buffers make
large sections of memory not available anymore. In that case we may get into
a situation in which f.e. the background writeback ratio of 40% cannot be
reached anymore which leads to undesired writeback behavior.

This patchset fixes that issue by determining the ratio based on the actual
pages that may potentially be dirty. These are the pages on the active and
the inactive list plus free pages.

The problem with those counts has so far been that it is expensive to
calculate these because counts from multiple nodes and multiple zones will
have to be summed up. This patchset makes these counters ZVC counters. This
means that a current sum per zone, per node and for the whole system is always
available via global variables and not expensive anymore to calculate.

The patchset results in some other good side effects:

- Removal of the various functions that sum up free, active and inactive
page counts

- Cleanup of the functions that display information via the proc filesystem.

This patch:

The use of a ZVC for nr_inactive and nr_active allows a simplification of some
counter operations. More ZVC functionality is used for sums etc in the
following patches.

[akpm@osdl.org: UP build fix]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:17 +0800
c3704ceb4 [PATCH] page_mkwrite caller race fix ... Browse Code »

After do_wp_page has tested page_mkwrite, it must release old_page after
acquiring page table lock, not before: at some stage that ordering got
reversed, leaving a (very unlikely) window in which old_page might be
truncated, freed, and reused in the same position.

Signed-off-by: Hugh Dickins
Acked-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2007-02-12 02:51:17 +0800
5a88a13d0 [PATCH] /proc/zoneinfo: fix vm stats display ... Browse Code »

This early break prevents us from displaying info for the vm stats thresholds
if the zone doesn't have any pages in its per-cpu pagesets.

So my 800MB i386 box says:

Node 0, zone DMA
pages free 2365
min 16
low 20
high 24
active 0
inactive 0
scanned 0 (a: 0 i: 0)
spanned 4096
present 4044
nr_anon_pages 0
nr_mapped 1
nr_file_pages 0
nr_slab_reclaimable 0
nr_slab_unreclaimable 0
nr_page_table_pages 0
nr_dirty 0
nr_writeback 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
protection: (0, 868, 868)
pagesets
all_unreclaimable: 0
prev_priority: 12
start_pfn: 0
Node 0, zone Normal
pages free 199713
min 934
low 1167
high 1401
active 10215
inactive 4507
scanned 0 (a: 0 i: 0)
spanned 225280
present 222420
nr_anon_pages 2685
nr_mapped 1110
nr_file_pages 12055
nr_slab_reclaimable 2216
nr_slab_unreclaimable 1527
nr_page_table_pages 213
nr_dirty 0
nr_writeback 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
protection: (0, 0, 0)
pagesets
cpu: 0 pcp: 0
count: 152
high: 186
batch: 31
cpu: 0 pcp: 1
count: 13
high: 62
batch: 15
vm stats threshold: 16
cpu: 1 pcp: 0
count: 34
high: 186
batch: 31
cpu: 1 pcp: 1
count: 10
high: 62
batch: 15
vm stats threshold: 16
all_unreclaimable: 0
prev_priority: 12
start_pfn: 4096

Just nuke all that search-for-the-first-non-empty-pageset code. Dunno why it
was there in the first place..

Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-02-12 02:51:17 +0800
a6af2bc3d [PATCH] Avoid excessive sorting of early_node_map[] ... Browse Code »

find_min_pfn_for_node() and find_min_pfn_with_active_regions() sort
early_node_map[] on every call. This is an excessive amount of sorting and
that can be avoided. This patch always searches the whole early_node_map[]
in find_min_pfn_for_node() instead of returning the first value found. The
map is then only sorted once when required. Successfully boot tested on a
number of machines.

[akpm@osdl.org: cleanup]
Signed-off-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2007-02-12 02:51:17 +0800
7c5cae368 [PATCH] slab: use parameter passed to cache_reap to determine pointer to work structure ... Browse Code »

Use the pointer passed to cache_reap to determine the work pointer and
consolidate exit paths.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:17 +0800
8c8cc2c10 [PATCH] slab: cache alloc cleanups ... Browse Code »

Clean up __cache_alloc and __cache_alloc_node functions a bit. We no
longer need to do NUMA_BUILD tricks and the UMA allocation path is much
simpler. No functional changes in this patch.

Note: saves few kernel text bytes on x86 NUMA build due to using gotos in
__cache_alloc_node() and moving __GFP_THISNODE check in to
fallback_alloc().

Cc: Andy Whitcroft
Cc: Christoph Hellwig
Cc: Manfred Spraul
Acked-by: Christoph Lameter
Cc: Paul Jackson
Signed-off-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pekka Enberg
2007-02-12 02:51:16 +0800
6e40e7309 [PATCH] slab: remove broken PageSlab check from kfree_debugcheck ... Browse Code »

The PageSlab debug check in kfree_debugcheck() is broken for compound
pages. It is also redundant as we already do BUG_ON for non-slab pages in
page_get_cache() and page_get_slab() which are always called before we free
any actual objects.

Signed-off-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pekka Enberg
2007-02-12 02:51:16 +0800

10 Feb, 2007

4 commits

fa5dc22f8 [PATCH] Add install_special_mapping ... Browse Code »

This patch adds a utility function install_special_mapping, for creating a
special vma using a fixed set of preallocated pages as backing, such as for a
vDSO. This consolidates some nearly identical code used for vDSO mapping
reimplemented for different architectures.

Signed-off-by: Roland McGrath
Cc: Ingo Molnar
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roland McGrath
2007-02-10 01:25:47 +0800
a25700a53 [PATCH] mm: show bounce pages in oom killer output ... Browse Code »

Also split that long line up - people like to send us wordwrapped oom-kill
traces.

Cc: Nick Piggin
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-02-10 01:25:47 +0800
6649a3863 [PATCH] hugetlb: preserve hugetlb pte dirty state ... Browse Code »

__unmap_hugepage_range() is buggy that it does not preserve dirty state of
huge_pte when unmapping hugepage range. It causes data corruption in the
event of dop_caches being used by sys admin. For example, an application
creates a hugetlb file, modify pages, then unmap it. While leaving the
hugetlb file alive, comes along sys admin doing a "echo 3 >
/proc/sys/vm/drop_caches".

drop_pagecache_sb() will happily free all pages that aren't marked dirty if
there are no active mapping. Later when application remaps the hugetlb
file back and all data are gone, triggering catastrophic flip over on
application.

Not only that, the internal resv_huge_pages count will also get all messed
up. Fix it up by marking page dirty appropriately.

Signed-off-by: Ken Chen
Cc: "Nish Aravamudan"
Cc: Adam Litke
Cc: David Gibson
Cc: William Lee Irwin III
Cc:
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ken Chen
2007-02-10 01:25:46 +0800
62045305c [PATCH] mm: remove find_trylock_page ... Browse Code »

Remove find_trylock_page as per the removal schedule.

Signed-off-by: Nick Piggin
[ Let's see if anybody screams ]
Signed-off-by: Linus Torvalds

Nick Piggin
2007-02-10 00:06:14 +0800

01 Feb, 2007

1 commit

6fd6b17c6 Revert "[PATCH] mm: micro optimise zone_watermark_ok" ... Browse Code »

This reverts commit e80ee884ae0e3794ef2b65a18a767d502ad712ee.

Pawel Sikora had a boot-time oops due to it - because the sign change
invalidates the following comparisons, since 'free_pages' can be
negative.

The micro-optimization just isn't worth it.

Bisected-by: Pawel Sikora
Acked-by: Andrew Morton
Cc: Nick Piggin
Signed-off-by: Linus Torvalds

Linus Torvalds
2007-02-01 08:46:40 +0800

31 Jan, 2007

2 commits

0d59a01bc [PATCH] Don't allow the stack to grow into hugetlb reserved regions ... Browse Code »

When expanding the stack, we don't currently check if the VMA will cross
into an area of the address space that is reserved for hugetlb pages.
Subsequent faults on the expanded portion of such a VMA will confuse the
low-level MMU code, resulting in an OOPS. Check for this.

Signed-off-by: Adam Litke
Cc: David Gibson
Cc: William Lee Irwin III
Cc: Hugh Dickins
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-01-31 08:01:35 +0800
701dfbc1c [PATCH] mm: mremap correct rmap accounting ... Browse Code »

Nick Piggin points out that page accounting on MIPS multiple ZERO_PAGEs
is not maintained by its move_pte, and could lead to freeing a ZERO_PAGE.

Instead of complicating that move_pte, just forget the minor optimization
when mremapping, and change the one thing which needed it for correctness
- filemap_xip use ZERO_PAGE(0) throughout instead of according to address.

[ "There is no block device driver one could use for XIP on mips
platforms" - Carsten Otte ]

Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Cc: Andrew Morton
Cc: Ralf Baechle
Cc: Carsten Otte
Signed-off-by: Linus Torvalds

Hugh Dickins
2007-01-31 00:33:32 +0800

30 Jan, 2007

1 commit

dc6e29da9 Fix balance_dirty_page() calculations with CONFIG_HIGHMEM ... Browse Code »

This makes balance_dirty_page() always base its calculations on the
amount of non-highmem memory in the machine, rather than try to base it
on total memory and then falling back on non-highmem memory if the
mapping it was writing wasn't highmem capable.

This not only fixes a situation where two different writers can have
wildly different notions about what is a "balanced" dirty state, but it
also means that people with highmem machines don't run into an OOM
situation when regular memory fills up with dirty pages.

We used to try to handle the latter case by scaling down the dirty_ratio
if the machine had a lot of highmem pages in page_writeback_init(), but
it wasn't aggressive enough for some situations, and since basing the
dirty ratio on highmem memory was broken in the first place, let's just
stop doing so.

(A variation of this theme fixed Justin Piszcz's OOM problem when
copying an 18GB file on a RAID setup).

Acked-by: Nick Piggin
Cc: Justin Piszcz
Cc: Andrew Morton
Cc: Neil Brown
Cc: Ingo Molnar
Cc: Randy Dunlap
Cc: Christoph Lameter
Cc: Jens Axboe
Cc: Peter Zijlstra
Cc: Adrian Bunk
Signed-off-by: Linus Torvalds

Linus Torvalds
2007-01-30 08:37:38 +0800

27 Jan, 2007

4 commits

569d3287c [PATCH] MM: Remove [PATCH] invalidate_inode_pages2_range() debug ... Browse Code »

NFS can handle the case where invalidate_inode_pages2_range() fails, so the
premise behind commit 8258d4a574d3a8c01f0ef68aa26b969398a0e140 is now gone.

Remove the WARN_ON_ONCE() which is causing users grief as we can see from
http://bugzilla.kernel.org/show_bug.cgi?id=7826

Signed-off-by: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Trond Myklebust
2007-01-27 05:51:00 +0800
f47aef55d [PATCH] i386 vDSO: use VM_ALWAYSDUMP ... Browse Code »

This patch fixes core dumps to include the vDSO vma, which is left out now.
It removes the special-case core writing macros, which were not doing the
right thing for the vDSO vma anyway. Instead, it uses VM_ALWAYSDUMP in the
vma; there is no need for the fixmap page to be installed. It handles the
CONFIG_COMPAT_VDSO case by making elf_core_dump use the fake vma from
get_gate_vma after real vmas in the same way the /proc/PID/maps code does.

This changes core dumps so they no longer include the non-PT_LOAD phdrs from
the vDSO. I made the change to add them in the first place, but in turned out
that nothing ever wanted them there since the advent of NT_AUXV. It's cleaner
to leave them out, and just let the phdrs inside the vDSO image speak for
themselves.

Signed-off-by: Roland McGrath
Cc: Ingo Molnar
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roland McGrath
2007-01-27 05:50:58 +0800
b6558c4a2 [PATCH] Fix gate_vma.vm_flags ... Browse Code »

This patch fixes the initialization of gate_vma.vm_flags and
gate_vma.vm_page_prot to reflect reality. This makes the "[vdso]" line in
/proc/PID/maps correctly show r-xp instead of ---p, when gate_vma is used
(CONFIG_COMPAT_VDSO on i386).

Signed-off-by: Roland McGrath
Cc: Ingo Molnar
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roland McGrath
2007-01-27 05:50:58 +0800
ecdfc9787 Resurrect 'try_to_free_buffers()' VM hackery ... Browse Code »
13

It's not pretty, but it appears that ext3 with data=journal will clean
pages without ever actually telling the VM that they are clean. This,
in turn, will result in the VM (and balance_dirty_pages() in particular)
to never realize that the pages got cleaned, and wait forever for an
event that already happened.

Technically, this seems to be a problem with ext3 itself, but it used to
be hidden by 'try_to_free_buffers()' noticing this situation on its own,
and just working around the filesystem problem.

This commit re-instates that hack, in order to avoid a regression for
the 2.6.20 release. This fixes bugzilla 7844:

http://bugzilla.kernel.org/show_bug.cgi?id=7844

Peter Zijlstra points out that we should probably retain the debugging
code that this removes from cancel_dirty_page(), and I agree, but for
the imminent release we might as well just silence the warning too
(since it's not a new bug: anything that triggers that warning has been
around forever).

Acked-by: Randy Dunlap
Acked-by: Jens Axboe
Acked-by: Peter Zijlstra
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Linus Torvalds
2007-01-27 04:47:06 +0800

23 Jan, 2007

1 commit

30150f8d7 [PATCH] mbind: restrict nodes to the currently allowed cpuset ... Browse Code »

Currently one can specify an arbitrary node mask to mbind that includes
nodes not allowed. If that is done with an interleave policy then we will
go around all the nodes. Those outside of the currently allowed cpuset
will be redirected to the border nodes. Interleave will then create
imbalances at the borders of the cpuset.

This patch restricts the nodes to the currently allowed cpuset.

The RFC for this patch was discussed at
http://marc.theaimsgroup.com/?t=116793842100004&r=1&w=2

Signed-off-by: Christoph Lameter
Cc: Paul Jackson
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-01-23 23:52:06 +0800

13 Jan, 2007

1 commit

c43a5082a [PATCH] blktrace: only add a bounce trace when we really bounce ... Browse Code »

Currently we issue a bounce trace when __blk_queue_bounce() is called,
but that merely means that the device has a lower dma mask than the
higher pages in the system. The bio itself may still be lower pages. So
move the bounce trace into __blk_queue_bounce(), when we know there will
actually be page bouncing.

Signed-off-by: Jens Axboe
Signed-off-by: Linus Torvalds

Jens Axboe
2007-01-13 02:46:49 +0800

12 Jan, 2007

2 commits

e3db7691e [PATCH] NFS: Fix race in nfs_release_page() ... Browse Code »

NFS: Fix race in nfs_release_page()

invalidate_inode_pages2() may find the dirty bit has been set on a page
owing to the fact that the page may still be mapped after it was locked.
Only after the call to unmap_mapping_range() are we sure that the page
can no longer be dirtied.
In order to fix this, NFS has hooked the releasepage() method and tries
to write the page out between the call to unmap_mapping_range() and the
call to remove_mapping(). This, however leads to deadlocks in the page
reclaim code, where the page may be locked without holding a reference
to the inode or dentry.

Fix is to add a new address_space_operation, launder_page(), which will
attempt to write out a dirty page without releasing the page lock.

Signed-off-by: Trond Myklebust

Also, the bare SetPageDirty() can skew all sort of accounting leading to
other nasties.

[akpm@osdl.org: cleanup]
Signed-off-by: Peter Zijlstra
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Trond Myklebust
2007-01-12 10:18:21 +0800
a2f3aa025 [PATCH] Fix sparsemem on Cell ... Browse Code »
5

Fix an oops experienced on the Cell architecture when init-time functions,
early_*(), are called at runtime. It alters the call paths to make sure
that the callers explicitly say whether the call is being made on behalf of
a hotplug even, or happening at boot-time.

It has been compile tested on ppc64, ia64, s390, i386 and x86_64.

Acked-by: Arnd Bergmann
Signed-off-by: Dave Hansen
Cc: Yasunori Goto
Acked-by: Andy Whitcroft
Cc: Christoph Lameter
Cc: Martin Schwidefsky
Acked-by: Heiko Carstens
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2007-01-12 10:18:20 +0800

09 Jan, 2007

2 commits

74bda9310 Merge master.kernel.org:/home/rmk/linux-2.6-arm ... Browse Code »

* master.kernel.org:/home/rmk/linux-2.6-arm:
[ARM] Provide basic printk_clock() implementation
[ARM] Resolve fuse and direct-IO failures due to missing cache flushes
[ARM] pass vma for flush_anon_page()
[ARM] Fix potential MMCI bug
[ARM] Fix kernel-mode undefined instruction aborts
[ARM] 4082/1: iop3xx: fix iop33x gpio register offset
[ARM] 4070/1: arch/arm/kernel: fix warnings from missing includes
[ARM] 4079/1: iop: Update MAINTAINERS

Linus Torvalds
2007-01-09 07:06:39 +0800
a6f36be32 [ARM] pass vma for flush_anon_page() ... Browse Code »

Since get_user_pages() may be used with processes other than the
current process and calls flush_anon_page(), flush_anon_page() has to
cope in some way with non-current processes.

It may not be appropriate, or even desirable to flush a region of
virtual memory cache in the current process when that is different to
the process that we want the flush to occur for.

Therefore, pass the vma into flush_anon_page() so that the architecture
can work out whether the 'vmaddr' is for the current process or not.

Signed-off-by: Russell King

Russell King
2007-01-09 03:49:54 +0800

06 Jan, 2007

2 commits

76395d376 [PATCH] shrink_all_memory(): fix lru_pages handling ... Browse Code »

At the end of shrink_all_memory() we forget to recalculate lru_pages: it can
be zero.

Fix that up, and add a helper function for this operation too.

Also, recalculate lru_pages each time around the inner loop to get the
balancing correct.

Cc: "Rafael J. Wysocki"
Cc: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-01-06 15:55:29 +0800
7ba348594 [PATCH] fix OOM killing of swapoff ... Browse Code »

These days, if you swapoff when there isn't enough memory, OOM killer gives
"BUG: scheduling while atomic" and the machine hangs: badness() needs to do
its PF_SWAPOFF return after the task_unlock (tasklist_lock is also held
here, so p isn't going to be freed: PF_SWAPOFF might get turned off at any
moment, but that doesn't really matter).

Signed-off-by: Hugh Dickins
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2007-01-06 15:55:29 +0800