Doug / smarc-fsl-linux-kernel | Embedian Git Server

23 Sep, 2006

1 commit

f034b5d4e [XFRM]: Dynamic xfrm_state hash table sizing. ... Browse Code »

The grow algorithm is simple, we grow if:

1) we see a hash chain collision at insert, and
2) we haven't hit the hash size limit (currently 1*1024*1024 slots), and
3) the number of xfrm_state objects is > the current hash mask

All of this needs some tweaking.

Remove __initdata from "hashdist" so we can use it safely at run time.

Signed-off-by: David S. Miller

David S. Miller
2006-09-23 06:08:41 +0800

04 Jul, 2006

1 commit

9614634fe [PATCH] ZVC/zone_reclaim: Leave 1% of unmapped pagecache pages for file I/O ... Browse Code »

It turns out that it is advantageous to leave a small portion of unmapped file
backed pages if all of a zone's pages (or almost all pages) are allocated and
so the page allocator has to go off-node.

This allows recently used file I/O buffers to stay on the node and
reduces the times that zone reclaim is invoked if file I/O occurs
when we run out of memory in a zone.

The problem is that zone reclaim runs too frequently when the page cache is
used for file I/O (read write and therefore unmapped pages!) alone and we have
almost all pages of the zone allocated. Zone reclaim may remove 32 unmapped
pages. File I/O will use these pages for the next read/write requests and the
unmapped pages increase. After the zone has filled up again zone reclaim will
remove it again after only 32 pages. This cycle is too inefficient and there
are potentially too many zone reclaim cycles.

With the 1% boundary we may still remove all unmapped pages for file I/O in
zone reclaim pass. However. it will take a large number of read and writes
to get back to 1% again where we trigger zone reclaim again.

The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
second timeout.

[akpm@osdl.org: rename the /proc file and the variable]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-04 06:26:59 +0800

01 Jul, 2006

13 commits

22a3e233c Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
Remove obsolete #include
remove obsolete swsusp_encrypt
arch/arm26/Kconfig typos
Documentation/IPMI typos
Kconfig: Typos in net/sched/Kconfig
v9fs: do not include linux/version.h
Documentation/DocBook/mtdnand.tmpl: typo fixes
typo fixes: specfic -> specific
typo fixes in Documentation/networking/pktgen.txt
typo fixes: occuring -> occurring
typo fixes: infomation -> information
typo fixes: disadvantadge -> disadvantage
typo fixes: aquire -> acquire
typo fixes: mecanism -> mechanism
typo fixes: bandwith -> bandwidth
fix a typo in the RTC_CLASS help text
smb is no longer maintained

Manually merged trivial conflict in arch/um/kernel/vmlinux.lds.S

Linus Torvalds
2006-07-01 06:39:30 +0800
f8891e5e1 [PATCH] Light weight event counters ... Browse Code »

The remaining counters in page_state after the zoned VM counter patches
have been applied are all just for show in /proc/vmstat. They have no
essential function for the VM.

We use a simple increment of per cpu variables. In order to avoid the most
severe races we disable preempt. Preempt does not prevent the race between
an increment and an interrupt handler incrementing the same statistics
counter. However, that race is exceedingly rare, we may only loose one
increment or so and there is no requirement (at least not in kernel) that
the vm event counters have to be accurate.

In the non preempt case this results in a simple increment for each
counter. For many architectures this will be reduced by the compiler to a
single instruction. This single instruction is atomic for i386 and x86_64.
And therefore even the rare race condition in an interrupt is avoided for
both architectures in most cases.

The patchset also adds an off switch for embedded systems that allows a
building of linux kernels without these counters.

The implementation of these counters is through inline code that hopefully
results in only a single instruction increment instruction being emitted
(i386, x86_64) or in the increment being hidden though instruction
concurrency (EPIC architectures such as ia64 can get that done).

Benefits:
- VM event counter operations usually reduce to a single inline instruction
on i386 and x86_64.
- No interrupt disable, only preempt disable for the preempt case.
Preempt disable can also be avoided by moving the counter into a spinlock.
- Handling is similar to zoned VM counters.
- Simple and easily extendable.
- Can be omitted to reduce memory use for embedded use.

References:

RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:36 +0800
ca889e6c4 [PATCH] Use Zoned VM Counters for NUMA statistics ... Browse Code »

The numa statistics are really event counters. But they are per node and
so we have had special treatment for these counters through additional
fields on the pcp structure. We can now use the per zone nature of the
zoned VM counters to realize these.

This will shrink the size of the pcp structure on NUMA systems. We will
have some room to add additional per zone counters that will all still fit
in the same cacheline.

Bits Prior pcp size Size after patch We can add
------------------------------------------------------------------
64 128 bytes (16 words) 80 bytes (10 words) 48
32 76 bytes (19 words) 56 bytes (14 words) 8 (64 byte cacheline)
72 (128 byte)

Remove the special statistics for numa and replace them with zoned vm
counters. This has the side effect that global sums of these events now
show up in /proc/vmstat.

Also take the opportunity to move the zone_statistics() function from
page_alloc.c into vmstat.c.

Discussions:
V2 http://marc.theaimsgroup.com/?t=115048227000002&r=1&w=2

Signed-off-by: Christoph Lameter
Acked-by: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:36 +0800
fd39fc856 [PATCH] zoned vm counters: conversion of nr_unstable to per zone counter ... Browse Code »

Conversion of nr_unstable to a per zone counter

We need to do some special modifications to the nfs code since there are
multiple cases of disposition and we need to have a page ref for proper
accounting.

This converts the last critical page state of the VM and therefore we need to
remove several functions that were depending on GET_PAGE_STATE_LAST in order
to make the kernel compile again. We are only left with event type counters
in page state.

[akpm@osdl.org: bugfixes]
Signed-off-by: Christoph Lameter
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:36 +0800
ce866b34a [PATCH] zoned vm counters: conversion of nr_writeback to per zone counter ... Browse Code »

Conversion of nr_writeback to per zone counter.

This removes the last page_state counter from arch/i386/mm/pgtable.c so we
drop the page_state from there.

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:35 +0800
b1e7a8fd8 [PATCH] zoned vm counters: conversion of nr_dirty to per zone counter ... Browse Code »

This makes nr_dirty a per zone counter. Looping over all processors is
avoided during writeback state determination.

The counter aggregation for nr_dirty had to be undone in the NFS layer since
we summed up the page counts from multiple zones. Someone more familiar with
NFS should probably review what I have done.

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:35 +0800
df849a152 [PATCH] zoned vm counters: conversion of nr_pagetables to per zone counter ... Browse Code »

Conversion of nr_page_table_pages to a per zone counter

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:35 +0800
9a865ffa3 [PATCH] zoned vm counters: conversion of nr_slab to per zone counter ... Browse Code »

- Allows reclaim to access counter without looping over processor counts.

- Allows accurate statistics on how many pages are used in a zone by
the slab. This may become useful to balance slab allocations over
various zones.

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:35 +0800
347ce434d [PATCH] zoned vm counters: conversion of nr_pagecache to per zone counter ... Browse Code »

Currently a single atomic variable is used to establish the size of the page
cache in the whole machine. The zoned VM counters have the same method of
implementation as the nr_pagecache code but also allow the determination of
the pagecache size per zone.

Remove the special implementation for nr_pagecache and make it a zoned counter
named NR_FILE_PAGES.

Updates of the page cache counters are always performed with interrupts off.
We can therefore use the __ variant here.

Signed-off-by: Christoph Lameter
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:34 +0800
65ba55f50 [PATCH] zoned vm counters: convert nr_mapped to per zone counter ... Browse Code »

nr_mapped is important because it allows a determination of how many pages of
a zone are not mapped, which would allow a more efficient means of determining
when we need to reclaim memory in a zone.

We take the nr_mapped field out of the page state structure and define a new
per zone counter named NR_FILE_MAPPED (the anonymous pages will be split off
from NR_MAPPED in the next patch).

We replace the use of nr_mapped in various kernel locations. This avoids the
looping over all processors in try_to_free_pages(), writeback, reclaim (swap +
zone reclaim).

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:34 +0800
2244b95a7 [PATCH] zoned vm counters: basic ZVC (zoned vm counter) implementation ... Browse Code »

Per zone counter infrastructure

The counters that we currently have for the VM are split per processor. The
processor however has not much to do with the zone these pages belong to. We
cannot tell f.e. how many ZONE_DMA pages are dirty.

So we are blind to potentially inbalances in the usage of memory in various
zones. F.e. in a NUMA system we cannot tell how many pages are dirty on a
particular node. If we knew then we could put measures into the VM to balance
the use of memory between different zones and different nodes in a NUMA
system. For example it would be possible to limit the dirty pages per node so
that fast local memory is kept available even if a process is dirtying huge
amounts of pages.

Another example is zone reclaim. We do not know how many unmapped pages exist
per zone. So we just have to try to reclaim. If it is not working then we
pause and try again later. It would be better if we knew when it makes sense
to reclaim unmapped pages from a zone. This patchset allows the determination
of the number of unmapped pages per zone. We can remove the zone reclaim
interval with the counters introduced here.

Futhermore the ability to have various usage statistics available will allow
the development of new NUMA balancing algorithms that may be able to improve
the decision making in the scheduler of when to move a process to another node
and hopefully will also enable automatic page migration through a user space
program that can analyse the memory load distribution and then rebalance
memory use in order to increase performance.

The counter framework here implements differential counters for each processor
in struct zone. The differential counters are consolidated when a threshold
is exceeded (like done in the current implementation for nr_pageache), when
slab reaping occurs or when a consolidation function is called.

Consolidation uses atomic operations and accumulates counters per zone in the
zone structure and also globally in the vm_stat array. VM functions can
access the counts by simply indexing a global or zone specific array.

The arrangement of counters in an array also simplifies processing when output
has to be generated for /proc/*.

Counters can be updated by calling inc/dec_zone_page_state or
_inc/dec_zone_page_state analogous to *_page_state. The second group of
functions can be called if it is known that interrupts are disabled.

Special optimized increment and decrement functions are provided. These can
avoid certain checks and use increment or decrement instructions that an
architecture may provide.

We also add a new CONFIG_DMA_IS_NORMAL that signifies that an architecture can
do DMA to all memory and therefore ZONE_NORMAL will not be populated. This is
only currently set for IA64 SGI SN2 and currently only affects
node_page_state(). In the best case node_page_state can be reduced to
retrieving a single counter for the one zone on the node.

[akpm@osdl.org: cleanups]
[akpm@osdl.org: export vm_stat[] for filesystems]
Signed-off-by: Christoph Lameter
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:34 +0800
f6ac2354d [PATCH] zoned vm counters: create vmstat.c/.h from page_alloc.c/.h ... Browse Code »

NOTE: ZVC are *not* the lightweight event counters. ZVCs are reliable whereas
event counters do not need to be.

Zone based VM statistics are necessary to be able to determine what the state
of memory in one zone is. In a NUMA system this can be helpful for local
reclaim and other memory optimizations that may be able to shift VM load in
order to get more balanced memory use.

It is also useful to know how the computing load affects the memory
allocations on various zones. This patchset allows the retrieval of that data
from userspace.

The patchset introduces a framework for counters that is a cross between the
existing page_stats --which are simply global counters split per cpu-- and the
approach of deferred incremental updates implemented for nr_pagecache.

Small per cpu 8 bit counters are added to struct zone. If the counter exceeds
certain thresholds then the counters are accumulated in an array of
atomic_long in the zone and in a global array that sums up all zone values.
The small 8 bit counters are next to the per cpu page pointers and so they
will be in high in the cpu cache when pages are allocated and freed.

Access to VM counter information for a zone and for the whole machine is then
possible by simply indexing an array (Thanks to Nick Piggin for pointing out
that approach). The access to the total number of pages of various types does
no longer require the summing up of all per cpu counters.

Benefits of this patchset right now:

- Ability for UP and SMP configuration to determine how memory
is balanced between the DMA, NORMAL and HIGHMEM zones.

- loops over all processors are avoided in writeback and
reclaim paths. We can avoid caching the writeback information
because the needed information is directly accessible.

- Special handling for nr_pagecache removed.

- zone_reclaim_interval vanishes since VM stats can now determine
when it is worth to do local reclaim.

- Fast inline per node page state determination.

- Accurate counters in /sys/devices/system/node/node*/meminfo. Current
counters are counting simply which processor allocated a page somewhere
and guestimate based on that. So the counters were not useful to show
the actual distribution of page use on a specific zone.

- The swap_prefetch patch requires per node statistics in order to
figure out when processors of a node can prefetch. This patch provides
some of the needed numbers.

- Detailed VM counters available in more /proc and /sys status files.

References to earlier discussions:
V1 http://marc.theaimsgroup.com/?l=linux-kernel&m=113511649910826&w=2
V2 http://marc.theaimsgroup.com/?l=linux-kernel&m=114980851924230&w=2
V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115014697910351&w=2
V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767318740&w=2

Performance tests with AIM7 did not show any regressions. Seems to be a tad
faster even. Tested on ia64/NUMA. Builds fine on i386, SMP / UP. Includes
fixes for s390/arm/uml arch code.

This patch:

Move counter code from page_alloc.c/page-flags.h to vmstat.c/h.

Create vmstat.c/vmstat.h by separating the counter code and the proc
functions.

Move the vm_stat_text array before zoneinfo_show.

[akpm@osdl.org: s390 build fix]
[akpm@osdl.org: HOTPLUG_CPU build fix]
Signed-off-by: Christoph Lameter
Cc: Heiko Carstens
Cc: Martin Schwidefsky
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:34 +0800
6ab3d5624 Remove obsolete #include <linux/config.h> ... Browse Code »

Signed-off-by: Jörn Engel
Signed-off-by: Adrian Bunk

Jörn Engel
2006-07-01 01:25:36 +0800

28 Jun, 2006

3 commits

f9b8404cf [PATCH] pi-futex: introduce debug_check_no_locks_freed() ... Browse Code »

Add debug_check_no_locks_freed(), as a central inline to add
bad-lock-free-debugging functionality to.

Signed-off-by: Ingo Molnar
Signed-off-by: Thomas Gleixner
Signed-off-by: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2006-06-28 08:32:46 +0800
74b85f379 [PATCH] cpu hotplug: make cpu_notifier related notifier blocks __cpuinit only ... Browse Code »

Make notifier_blocks associated with cpu_notifier as __cpuinitdata.

__cpuinitdata makes sure that the data is init time only unless
CONFIG_HOTPLUG_CPU is defined.

Signed-off-by: Chandra Seetharaman
Cc: Ashok Raj
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chandra Seetharaman
2006-06-28 08:32:41 +0800
9c7b216d2 [PATCH] cpu hotplug: revert init patch submitted for 2.6.17 ... Browse Code »

In 2.6.17, there was a problem with cpu_notifiers and XFS. I provided a
band-aid solution to solve that problem. In the process, i undid all the
changes you both were making to ensure that these notifiers were available
only at init time (unless CONFIG_HOTPLUG_CPU is defined).

We deferred the real fix to 2.6.18. Here is a set of patches that fixes the
XFS problem cleanly and makes the cpu notifiers available only at init time
(unless CONFIG_HOTPLUG_CPU is defined).

If CONFIG_HOTPLUG_CPU is defined then cpu notifiers are available at run
time.

This patch reverts the notifier_call changes made in 2.6.17

Signed-off-by: Chandra Seetharaman
Cc: Ashok Raj
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chandra Seetharaman
2006-06-28 08:32:40 +0800

27 Jun, 2006

1 commit

d6e05edc5 spelling fixes ... Browse Code »

acquired (aquired)
contiguous (contigious)
successful (succesful, succesfull)
surprise (suprise)
whether (weather)
some other misspellings

Signed-off-by: Andreas Mohr
Signed-off-by: Adrian Bunk

Andreas Mohr
2006-06-27 00:35:02 +0800

26 Jun, 2006

1 commit

43b0bc00f [PATCH] cpuset: remove extra cpuset_zone_allowed check in __alloc_pages ... Browse Code »

This is redundant with check in wakeup_kswapd.

Signed-off-by: Chris Wright
Acked-by: Paul Jackson
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chris Wright
2006-06-26 01:01:08 +0800

23 Jun, 2006

9 commits

8f9de51a4 [PATCH] printk() should not be called under zone->lock ... Browse Code »

This patch fixes printk() under zone->lock in show_free_areas(). It can be
unsafe to call printk() under this lock, since caller can try to
allocate/free some memory and selfdeadlock on this lock. I found
allocations/freeing mem both in netconsole and serial console.

This issue was faced in reallity when meminfo was periodically printed for
debug purposes and netconsole was used.

Signed-off-by: Kirill Korotaev
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Korotaev
2006-06-23 22:42:52 +0800
bd1e22b8e [PATCH] initialise total_memory() earlier ... Browse Code »

Initialise total_memory earlier in boot. Because if for some reason we run
page reclaim early in boot, we don't want total_memory to be zero when we use
it as a divisor.

And rename total_memory to vm_total_pages to avoid naming clashes with
architectures.

Cc: Yasunori Goto
Cc: KAMEZAWA Hiroyuki
Cc: Martin Bligh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2006-06-23 22:42:52 +0800
67de64821 [PATCH] squash duplicate page_to_pfn and pfn_to_page ... Browse Code »

We have architectures where the size of page_to_pfn and pfn_to_page are
significant enough to overall image size that they wish to push them out of
line. However, in the process we have grown a second copy of the
implementation of each of these routines for each memory model. Share the
implmentation exposing it either inline or out-of-line as required.

Signed-off-by: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2006-06-23 22:42:47 +0800
6811378e7 [PATCH] wait_table and zonelist initializing for memory hotadd: update zonelists ... Browse Code »

In current code, zonelist is considered to be build once, no modification.
But MemoryHotplug can add new zone/pgdat. It must be updated.

This patch modifies build_all_zonelists(). By this, build_all_zonelist() can
reconfig pgdat's zonelists.

To update them safety, this patch use stop_machine_run(). Other cpus don't
touch among updating them by using it.

In old version (V2 of node hotadd), kernel updated them after zone
initialization. But present_page of its new zone is still 0, because
online_page() is not called yet at this time. Build_zonelists() checks
present_pages to find present zone. It was too early. So, I changed it after
online_pages().

Signed-off-by: Yasunori Goto
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yasunori Goto
2006-06-23 22:42:46 +0800
cca448fe9 [PATCH] wait_table and zonelist initializing for memory hotadd: wait_table initialization ... Browse Code »

Wait_table is initialized according to zone size at boot time. But, we cannot
know the maixmum zone size when memory hotplug is enabled. It can be
changed.... And resizing of wait_table is hard.

So kernel allocate and initialzie wait_table as its maximum size.

Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Yasunori Goto
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yasunori Goto
2006-06-23 22:42:46 +0800
718127cc3 [PATCH] wait_table and zonelist initializing for memory hotadd: add return code … ... Browse Code »

…for init_current_empty_zone

When add_zone() is called against empty zone (not populated zone), we have to
initialize the zone which didn't initialize at boot time. But,
init_currently_empty_zone() may fail due to allocation of wait table. So,
this patch is to catch its error code.

Changes against wait_table is in the next patch.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

Yasunori Goto
2006-06-23 22:42:46 +0800
86356ab14 [PATCH] wait_table and zonelist initializing for memory hotadd: change to meminit for build_zonelist ... Browse Code »

Change definitions of some functions and data from __init to __meminit.

These functions and data can be used after bootup by this patch to be used for
hot-add codes.

Signed-off-by: Yasunori Goto
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yasunori Goto
2006-06-23 22:42:46 +0800
02b694dea [PATCH] wait_table and zonelist initializing for memory hotadd: change name of wait_table_size() ... Browse Code »

This is just to rename from wait_table_size() to wait_table_hash_nr_entries().

Signed-off-by: Yasunori Goto
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yasunori Goto
2006-06-23 22:42:46 +0800
cb2b95e1c [PATCH] zone handle unaligned zone boundaries ... Browse Code »

The buddy allocator has a requirement that boundaries between contigious
zones occur aligned with the the MAX_ORDER ranges. Where they do not we
will incorrectly merge pages cross zone boundaries. This can lead to pages
from the wrong zone being handed out.

Originally the buddy allocator would check that buddies were in the same
zone by referencing the zone start and end page frame numbers. This was
removed as it became very expensive and the buddy allocator already made
the assumption that zones boundaries were aligned.

It is clear that not all configurations and architectures are honouring
this alignment requirement. Therefore it seems safest to reintroduce
support for non-aligned zone boundaries. This patch introduces a new check
when considering a page a buddy it compares the zone_table index for the
two pages and refuses to merge the pages where they do not match. The
zone_table index is unique for each node/zone combination when
FLATMEM/DISCONTIGMEM is enabled and for each section/zone combination when
SPARSEMEM is enabled (a SPARSEMEM section is at least a MAX_ORDER size).

Signed-off-by: Andy Whitcroft
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Yasunori Goto
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2006-06-23 22:42:45 +0800

22 May, 2006

2 commits

e984bb43f [PATCH] Align the node_mem_map endpoints to a MAX_ORDER boundary ... Browse Code »

Andy added code to buddy allocator which does not require the zone's
endpoints to be aligned to MAX_ORDER. An issue is that the buddy allocator
requires the node_mem_map's endpoints to be MAX_ORDER aligned. Otherwise
__page_find_buddy could compute a buddy not in node_mem_map for partial
MAX_ORDER regions at zone's endpoints. page_is_buddy will detect that
these pages at endpoints are not PG_buddy (they were zeroed out by bootmem
allocator and not part of zone). Of course the negative here is we could
waste a little memory but the positive is eliminating all the old checks
for zone boundary conditions.

SPARSEMEM won't encounter this issue because of MAX_ORDER size constraint
when SPARSEMEM is configured. ia64 VIRTUAL_MEM_MAP doesn't need the logic
either because the holes and endpoints are handled differently. This
leaves checking alloc_remap and other arches which privately allocate for
node_mem_map.

Signed-off-by: Bob Picco
Acked-by: Mel Gorman
Cc: Dave Hansen
Cc: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bob Picco
2006-05-22 03:59:22 +0800
bdd804f47 [PATCH] Cpuset: might sleep checking zones allowed fix ... Browse Code »

Fix a couple of infrequently encountered 'sleeping function called from
invalid context' in the cpuset hooks in __alloc_pages. Could sleep while
interrupts disabled.

The routine cpuset_zone_allowed() is called by code in mm/page_alloc.c
__alloc_pages() to determine if a zone is allowed in the current tasks
cpuset. This routine can sleep, for certain GFP_KERNEL allocations, if the
zone is on a memory node not allowed in the current cpuset, but might be
allowed in a parent cpuset.

But we can't sleep in __alloc_pages() if in interrupt, nor if called for a
GFP_ATOMIC request (__GFP_WAIT not set in gfp_flags).

The rule was intended to be:
Don't call cpuset_zone_allowed() if you can't sleep, unless you
pass in the __GFP_HARDWALL flag set in gfp_flag, which disables
the code that might scan up ancestor cpusets and sleep.

This rule was being violated in a couple of places, due to a bogus change
made (by myself, pj) to __alloc_pages() as part of the November 2005 effort
to cleanup its logic, and also due to a later fix to constrain which swap
daemons were awoken.

The bogus change can be seen at:
http://linux.derkeiler.com/Mailing-Lists/Kernel/2005-11/4691.html
[PATCH 01/05] mm fix __alloc_pages cpuset ALLOC_* flags

This was first noticed on a tight memory system, in code that was disabling
interrupts and doing allocation requests with __GFP_WAIT not set, which
resulted in __might_sleep() writing complaints to the log "Debug: sleeping
function called ...", when the code in cpuset_zone_allowed() tried to take
the callback_sem cpuset semaphore.

We haven't seen a system hang on this 'might_sleep' yet, but we are at
decent risk of seeing it fairly soon, especially since the additional
cpuset_zone_allowed() check was added, conditioning wakeup_kswapd(), in
March 2006.

Special thanks to Dave Chinner, for figuring this out, and a tip of the hat
to Nick Piggin who warned me of this back in Nov 2005, before I was ready
to listen.

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-05-22 03:59:18 +0800

16 May, 2006

1 commit

ac924c603 [PATCH] setup_per_zone_pages_min() overflow fix ... Browse Code »

As pointed out in http://bugzilla.kernel.org/show_bug.cgi?id=6490, this
function can experience overflows on 32-bit machines, causing our response to
changed values of min_free_kbytes to go whacky.

Fixing it efficiently is all too hard, so fix it with 64-bit math instead.

Cc: Ake Sandgren
Cc: Martin Bligh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2006-05-16 02:20:55 +0800

26 Apr, 2006

1 commit

83d722f7e [PATCH] Remove __devinit and __cpuinit from notifier_call definitions ... Browse Code »

Few of the notifier_chain_register() callers use __init in the definition
of notifier_call. It is incorrect as the function definition should be
available after the initializations (they do not unregister them during
initializations).

This patch fixes all such usages to _not_ have the notifier_call __init
section.

Signed-off-by: Chandra Seetharaman
Signed-off-by: Linus Torvalds

Chandra Seetharaman
2006-04-26 23:30:03 +0800

20 Apr, 2006

1 commit

6aa3001b2 [PATCH] page_alloc.c: buddy handling cleanup ... Browse Code »

Fix up some whitespace damage.

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2006-04-20 00:13:50 +0800

11 Apr, 2006

2 commits

cb45b0e96 [PATCH] overcommit: add calculate_totalreserve_pages() ... Browse Code »

These patches are an enhancement of OVERCOMMIT_GUESS algorithm in
__vm_enough_memory().

- why the kernel needed patching

When the kernel can't allocate anonymous pages in practice, currnet
OVERCOMMIT_GUESS could return success. This implementation might be
the cause of oom kill in memory pressure situation.

If the Linux runs with page reservation features like
/proc/sys/vm/lowmem_reserve_ratio and without swap region, I think
the oom kill occurs easily.

- the overall design approach in the patch

When the OVERCOMMET_GUESS algorithm calculates number of free pages,
the reserved free pages are regarded as non-free pages.

This change helps to avoid the pitfall that the number of free pages
become less than the number which the kernel tries to keep free.

- testing results

I tested the patches using my test kernel module.

If the patches aren't applied to the kernel, __vm_enough_memory()
returns success in the situation but autual page allocation is
failed.

On the other hand, if the patches are applied to the kernel, memory
allocation failure is avoided since __vm_enough_memory() returns
failure in the situation.

I checked that on i386 SMP 16GB memory machine. I haven't tested on
nommu environment currently.

This patch adds totalreserve_pages for __vm_enough_memory().

Calculate_totalreserve_pages() checks maximum lowmem_reserve pages and
pages_high in each zone. Finally, the function stores the sum of each
zone to totalreserve_pages.

The totalreserve_pages is calculated when the VM is initilized.
And the variable is updated when /proc/sys/vm/lowmem_reserve_raito
or /proc/sys/vm/min_free_kbytes are changed.

Signed-off-by: Hideo Aoki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hideo AOKI
2006-04-11 21:18:32 +0800
676165a8a [PATCH] Fix buddy list race that could lead to page lru list corruptions ... Browse Code »

Rohit found an obscure bug causing buddy list corruption.

page_is_buddy is using a non-atomic test (PagePrivate && page_count == 0)
to determine whether or not a free page's buddy is itself free and in the
buddy lists.

Each of the conjuncts may be true at different times due to unrelated
conditions, so the non-atomic page_is_buddy test may find each conjunct to
be true even if they were not both true at the same time (ie. the page was
not on the buddy lists).

Signed-off-by: Martin Bligh
Signed-off-by: Rohit Seth
Signed-off-by: Nick Piggin
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Linus Torvalds

Nick Piggin
2006-04-11 01:16:37 +0800

28 Mar, 2006

4 commits

ae0f15fb9 [PATCH] for_each_online_pgdat: remove pgdat_list ... Browse Code »

By using for_each_online_pgdat(), pgdat_list is not necessary now. This patch
removes it.

Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2006-03-28 00:44:48 +0800
ec936fc56 [PATCH] for_each_online_pgdat: renaming for_each_pgdat ... Browse Code »

Replace for_each_pgdat() with for_each_online_pgdat().

Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2006-03-28 00:44:48 +0800
a0140c1d8 [PATCH] remove zone_mem_map ... Browse Code »

This patch removes zone_mem_map.

pfn_to_page uses pgdat, page_to_pfn uses zone. page_to_pfn can use pgdat
instead of zone, which is only one user of zone_mem_map. By modifing it,
we can remove zone_mem_map.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Dave Hansen
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2006-03-28 00:44:47 +0800
a117e66ed [PATCH] unify pfn_to_page: generic functions ... Browse Code »

There are 3 memory models, FLATMEM, DISCONTIGMEM, SPARSEMEM.
Each arch has its own page_to_pfn(), pfn_to_page() for each models.
But most of them can use the same arithmetic.

This patch adds asm-generic/memory_model.h, which includes generic
page_to_pfn(), pfn_to_page() definitions for each memory model.

When CONFIG_OUT_OF_LINE_PFN_TO_PAGE=y, out-of-line functions are
used instead of macro. This is enabled by some archs and reduces
text size.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Andi Kleen
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: Richard Henderson
Cc: Ivan Kokshaysky
Cc: Russell King
Cc: Ian Molton
Cc: Mikael Starvik
Cc: David Howells
Cc: Yoshinori Sato
Cc: Hirokazu Takata
Cc: Ralf Baechle
Cc: Kyle McMartin
Cc: Heiko Carstens
Cc: Martin Schwidefsky
Cc: Paul Mundt
Cc: Kazumoto Kojima
Cc: Richard Curnow
Cc: William Lee Irwin III
Cc: "David S. Miller"
Cc: Jeff Dike
Cc: Paolo 'Blaisorblade' Giarrusso
Cc: Miles Bader
Cc: Chris Zankel
Cc: "Luck, Tony"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2006-03-28 00:44:44 +0800