Eric Lee / smarc-fsl-linux-kernel

13 May, 2008

1 commit

b5be11329 make vmstat cpu-unplug safe ... Browse Code »

When accessing cpu_online_map, we should prevent dynamic changing
of cpu_online_map by get_online_cpus().

Unfortunately, all_vm_events() doesn't do that.

Signed-off-by: KOSAKI Motohiro
Acked-by: Christoph Lameter
Cc: Gautham R Shenoy
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2008-05-13 23:02:23 +0800

30 Apr, 2008

2 commits

fc3ba692a mm: Add NR_WRITEBACK_TEMP counter ... Browse Code »

Fuse will use temporary buffers to write back dirty data from memory mappings
(normal writes are done synchronously). This is needed, because there cannot
be any guarantee about the time in which a write will complete.

By using temporary buffers, from the MM's point if view the page is written
back immediately. If the writeout was due to memory pressure, this
effectively migrates data from a full zone to a less full zone.

This patch adds a new counter (NR_WRITEBACK_TEMP) for the number of pages used
as temporary buffers.

[Lee.Schermerhorn@hp.com: add vmstat_text for NR_WRITEBACK_TEMP]
Signed-off-by: Miklos Szeredi
Cc: Christoph Lameter
Signed-off-by: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-04-30 23:29:50 +0800
41b25a378 /proc/pagetypeinfo: fix output for memoryless nodes ... Browse Code »

on memoryless node, /proc/pagetypeinfo is displayed slightly funny output.
this patch fix it.

output example (header is outputed, but no data is outputed)
--------------------------------------------------------------
Page block order: 14
Pages per block: 16384

Free pages count per migrate type at order 0 1 2 3 4 5 \
6 7 8 9 10 11 12 13 14 15 16

Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
Page block order: 14
Pages per block: 16384

Free pages count per migrate type at order 0 1 2 3 4 5 \
6 7 8 9 10 11 12 13 14 15 16

Signed-off-by: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2008-04-30 23:29:30 +0800

28 Apr, 2008

3 commits

468fd62ed vmstats: add cond_resched() to refresh_cpu_vm_stats() ... Browse Code »

We've found that it can take quite a bit of time (100's of usec) to get
through the zone loop in refresh_cpu_vm_stats().

Adding a cond_resched() to allow other threads to run in the non-preemptive
case.

Signed-off-by: Dimitri Sivanich
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dimitri Sivanich
2008-04-28 23:58:26 +0800
3b1163006 Subject: [PATCH] hugetlb: vmstat events for huge page allocations ... Browse Code »

Allocating huge pages directly from the buddy allocator is not guaranteed to
succeed. Success depends on several factors (such as the amount of physical
memory available and the level of fragmentation). With the addition of
dynamic hugetlb pool resizing, allocations can occur much more frequently.
For these reasons it is desirable to keep track of huge page allocation
successes and failures.

Add two new vmstat entries to track huge page allocations that succeed and
fail. The presence of the two entries is contingent upon CONFIG_HUGETLB_PAGE
being enabled.

[akpm@linux-foundation.org: reduced ifdeffery]
Signed-off-by: Adam Litke
Signed-off-by: Eric Munson
Tested-by: Mel Gorman
Reviewed-by: Andy Whitcroft
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2008-04-28 23:58:23 +0800
18ea7e710 mm: remember what the preferred zone is for zone_statistics ... Browse Code »

On NUMA, zone_statistics() is used to record events like numa hit, miss and
foreign. It assumes that the first zone in a zonelist is the preferred zone.
When multiple zonelists are replaced by one that is filtered, this is no
longer the case.

This patch records what the preferred zone is rather than assuming the first
zone in the zonelist is it. This simplifies the reading of later patches in
this set.

Signed-off-by: Mel Gorman
Signed-off-by: Lee Schermerhorn
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Reviewed-by: Christoph Lameter
Cc: Hugh Dickins
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2008-04-28 23:58:18 +0800

16 Apr, 2008

1 commit

91446b064 add "Isolate" migratetype name to /proc/pagetypeinfo ... Browse Code »

In a5d76b54a3f3a40385d7f76069a2feac9f1bad63 (memory unplug: page isolation by
KAMEZAWA Hiroyuki), "isolate" migratetype added. but unfortunately, it
doesn't treat /proc/pagetypeinfo display logic.

this patch add "Isolate" to pagetype name field.

/proc/pagetype
before:
------------------------------------------------------------------------------------------------------------------------
Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
Node 0, zone DMA, type Unmovable 1 2 2 2 1 2 2 1 1 0 0
Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Movable 2 3 3 1 3 3 2 0 0 0 0
Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 0 1
Node 0, zone DMA, type 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Unmovable 1 9 7 4 1 1 1 1 0 0 0
Node 0, zone Normal, type Reclaimable 5 2 0 0 1 1 0 0 0 1 0
Node 0, zone Normal, type Movable 0 1 1 0 0 0 1 0 0 1 60
Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 1
Node 0, zone Normal, type 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone HighMem, type Unmovable 0 0 1 1 1 0 1 1 2 2 0
Node 0, zone HighMem, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone HighMem, type Movable 236 62 6 2 2 1 1 0 1 1 16
Node 0, zone HighMem, type Reserve 0 0 0 0 0 0 0 0 0 0 1
Node 0, zone HighMem, type 0 0 0 0 0 0 0 0 0 0 0

Number of blocks type Unmovable Reclaimable Movable Reserve
Node 0, zone DMA 1 0 2 1 0
Node 0, zone Normal 10 40 169 1 0
Node 0, zone HighMem 2 0 283 1 0

after:
------------------------------------------------------------------------------------------------------------------------
Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
Node 0, zone DMA, type Unmovable 1 2 2 2 1 2 2 1 1 0 0
Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Movable 2 3 3 1 3 3 2 0 0 0 0
Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 0 1
Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Unmovable 0 2 1 1 0 1 0 0 0 0 0
Node 0, zone Normal, type Reclaimable 1 1 1 1 1 0 1 1 1 0 0
Node 0, zone Normal, type Movable 0 1 1 1 0 1 0 1 0 0 196
Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 1
Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone HighMem, type Unmovable 0 1 0 0 0 1 1 1 2 2 0
Node 0, zone HighMem, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone HighMem, type Movable 1 0 1 1 0 0 0 0 1 0 200
Node 0, zone HighMem, type Reserve 0 0 0 0 0 0 0 0 0 0 1
Node 0, zone HighMem, type Isolate 0 0 0 0 0 0 0 0 0 0 0

Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
Node 0, zone DMA 1 0 2 1 0
Node 0, zone Normal 8 4 207 1 0
Node 0, zone HighMem 2 0 283 1 0

Signed-off-by: KOSAKI Motohiro
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2008-04-16 10:35:41 +0800

06 Feb, 2008

3 commits

9eccf2a81 vmstat: remove prefetch ... Browse Code »

Remove the prefetch logic in order to avoid touching impossible per cpu
areas.

Signed-off-by: Christoph Lameter
Cc: Mike Travis
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2008-02-06 01:44:18 +0800
3dfa5721f Page allocator: get rid of the list of cold pages ... Browse Code »

We have repeatedly discussed if the cold pages still have a point. There is
one way to join the two lists: Use a single list and put the cold pages at the
end and the hot pages at the beginning. That way a single list can serve for
both types of allocations.

The discussion of the RFC for this and Mel's measurements indicate that
there may not be too much of a point left to having separate lists for
hot and cold pages (see http://marc.info/?t=119492914200001&r=1&w=2).

Signed-off-by: Christoph Lameter
Cc: Mel Gorman
Cc: Martin Bligh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2008-02-06 01:44:18 +0800
a7f75e258 vmstat: small revisions to refresh_cpu_vm_stats() ... Browse Code »

1. Add comments explaining how the function can be called.

2. Collect global diffs in a local array and only spill
them once into the global counters when the zone scan
is finished. This means that we only touch each global
counter once instead of each time we fold cpu counters
into zone counters.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2008-02-06 01:44:18 +0800

15 Nov, 2007

1 commit

42614fcde vmstat: fix section mismatch warning ... Browse Code »

Mark start_cpu_timer() as __cpuinit instead of __devinit.
Fixes this section warning:

WARNING: vmlinux.o(.text+0x60e53): Section mismatch: reference to .init.text:start_cpu_timer (between 'vmstat_cpuup_callback' and 'vmstat_show')

Signed-off-by: Randy Dunlap
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2007-11-15 10:45:42 +0800

17 Oct, 2007

3 commits

e815af95f oom: change all_unreclaimable zone member to flags ... Browse Code »

Convert the int all_unreclaimable member of struct zone to unsigned long
flags. This can now be used to specify several different zone flags such as
all_unreclaimable and reclaim_in_progress, which can now be removed and
converted to a per-zone flag.

Flags are set and cleared as follows:

zone_set_flag(struct zone *zone, zone_flags_t flag)
zone_clear_flag(struct zone *zone, zone_flags_t flag)

Defines the first zone flags, ZONE_ALL_UNRECLAIMABLE and ZONE_RECLAIM_LOCKED,
which have the same semantics as the old zone->all_unreclaimable and
zone->reclaim_in_progress, respectively. Also converts all current users that
set or clear either flag to use the new interface.

Helper functions are defined to test the flags:

int zone_is_all_unreclaimable(const struct zone *zone)
int zone_is_reclaim_locked(const struct zone *zone)

All flag operators are of the atomic variety because there are currently
readers that are implemented that do not take zone->lock.

[akpm@linux-foundation.org: add needed include]
Cc: Andrea Arcangeli
Acked-by: Christoph Lameter
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2007-10-17 23:42:45 +0800
e2fc88d06 mm/vmstat.c: cleanups ... Browse Code »

This patch contains the following cleanups:
- make the needlessly global setup_vmstat() static
- remove the unused refresh_vm_stats()

Signed-off-by: Adrian Bunk
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2007-10-17 00:43:03 +0800
467c996c1 Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo ... Browse Code »

This patch provides fragmentation avoidance statistics via /proc/pagetypeinfo.
The information is collected only on request so there is no runtime overhead.
The statistics are in three parts:

The first part prints information on the size of blocks that pages are
being grouped on and looks like

Page block order: 10
Pages per block: 1024

The second part is a more detailed version of /proc/buddyinfo and looks like

Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
Node 0, zone DMA, type Unmovable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Reclaimable 1 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Reserve 0 4 4 0 0 0 0 1 0 1 0
Node 0, zone Normal, type Unmovable 111 8 4 4 2 3 1 0 0 0 0
Node 0, zone Normal, type Reclaimable 293 89 8 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Movable 1 6 13 9 7 6 3 0 0 0 0
Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 4

The third part looks like

Number of blocks type Unmovable Reclaimable Movable Reserve
Node 0, zone DMA 0 1 2 1
Node 0, zone Normal 3 17 94 4

To walk the zones within a node with interrupts disabled, walk_zones_in_node()
is introduced and shared between /proc/buddyinfo, /proc/zoneinfo and
/proc/pagetypeinfo to reduce code duplication. It seems specific to what
vmstat.c requires but could be broken out as a general utility function in
mmzone.c if there were other other potential users.

Signed-off-by: Mel Gorman
Acked-by: Andy Whitcroft
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2007-10-17 00:43:00 +0800

30 Jul, 2007

1 commit

4e950f6f0 Remove fs.h from mm.h ... Browse Code »

Remove fs.h from mm.h. For this,
1) Uninline vma_wants_writenotify(). It's pretty huge anyway.
2) Add back fs.h or less bloated headers (err.h) to files that need it.

As result, on x86_64 allyesconfig, fs.h dependencies cut down from 3929 files
rebuilt down to 3444 (-12.3%).

Cross-compile tested without regressions on my two usual configs and (sigh):

alpha arm-mx1ads mips-bigsur powerpc-ebony
alpha-allnoconfig arm-neponset mips-capcella powerpc-g5
alpha-defconfig arm-netwinder mips-cobalt powerpc-holly
alpha-up arm-netx mips-db1000 powerpc-iseries
arm arm-ns9xxx mips-db1100 powerpc-linkstation
arm-assabet arm-omap_h2_1610 mips-db1200 powerpc-lite5200
arm-at91rm9200dk arm-onearm mips-db1500 powerpc-maple
arm-at91rm9200ek arm-picotux200 mips-db1550 powerpc-mpc7448_hpc2
arm-at91sam9260ek arm-pleb mips-ddb5477 powerpc-mpc8272_ads
arm-at91sam9261ek arm-pnx4008 mips-decstation powerpc-mpc8313_rdb
arm-at91sam9263ek arm-pxa255-idp mips-e55 powerpc-mpc832x_mds
arm-at91sam9rlek arm-realview mips-emma2rh powerpc-mpc832x_rdb
arm-ateb9200 arm-realview-smp mips-excite powerpc-mpc834x_itx
arm-badge4 arm-rpc mips-fulong powerpc-mpc834x_itxgp
arm-carmeva arm-s3c2410 mips-ip22 powerpc-mpc834x_mds
arm-cerfcube arm-shannon mips-ip27 powerpc-mpc836x_mds
arm-clps7500 arm-shark mips-ip32 powerpc-mpc8540_ads
arm-collie arm-simpad mips-jazz powerpc-mpc8544_ds
arm-corgi arm-spitz mips-jmr3927 powerpc-mpc8560_ads
arm-csb337 arm-trizeps4 mips-malta powerpc-mpc8568mds
arm-csb637 arm-versatile mips-mipssim powerpc-mpc85xx_cds
arm-ebsa110 i386 mips-mpc30x powerpc-mpc8641_hpcn
arm-edb7211 i386-allnoconfig mips-msp71xx powerpc-mpc866_ads
arm-em_x270 i386-defconfig mips-ocelot powerpc-mpc885_ads
arm-ep93xx i386-up mips-pb1100 powerpc-pasemi
arm-footbridge ia64 mips-pb1500 powerpc-pmac32
arm-fortunet ia64-allnoconfig mips-pb1550 powerpc-ppc64
arm-h3600 ia64-bigsur mips-pnx8550-jbs powerpc-prpmc2800
arm-h7201 ia64-defconfig mips-pnx8550-stb810 powerpc-ps3
arm-h7202 ia64-gensparse mips-qemu powerpc-pseries
arm-hackkit ia64-sim mips-rbhma4200 powerpc-up
arm-integrator ia64-sn2 mips-rbhma4500 s390
arm-iop13xx ia64-tiger mips-rm200 s390-allnoconfig
arm-iop32x ia64-up mips-sb1250-swarm s390-defconfig
arm-iop33x ia64-zx1 mips-sead s390-up
arm-ixp2000 m68k mips-tb0219 sparc
arm-ixp23xx m68k-amiga mips-tb0226 sparc-allnoconfig
arm-ixp4xx m68k-apollo mips-tb0287 sparc-defconfig
arm-jornada720 m68k-atari mips-workpad sparc-up
arm-kafa m68k-bvme6000 mips-wrppmc sparc64
arm-kb9202 m68k-hp300 mips-yosemite sparc64-allnoconfig
arm-ks8695 m68k-mac parisc sparc64-defconfig
arm-lart m68k-mvme147 parisc-allnoconfig sparc64-up
arm-lpd270 m68k-mvme16x parisc-defconfig um-x86_64
arm-lpd7a400 m68k-q40 parisc-up x86_64
arm-lpd7a404 m68k-sun3 powerpc x86_64-allnoconfig
arm-lubbock m68k-sun3x powerpc-cell x86_64-defconfig
arm-lusl7200 mips powerpc-celleb x86_64-up
arm-mainstone mips-atlas powerpc-chrp32

Signed-off-by: Alexey Dobriyan
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2007-07-30 08:09:29 +0800

18 Jul, 2007

1 commit

2a1e274ac Create the ZONE_MOVABLE zone ... Browse Code »

The following 8 patches against 2.6.20-mm2 create a zone called ZONE_MOVABLE
that is only usable by allocations that specify both __GFP_HIGHMEM and
__GFP_MOVABLE. This has the effect of keeping all non-movable pages within a
single memory partition while allowing movable allocations to be satisfied
from either partition. The patches may be applied with the list-based
anti-fragmentation patches that groups pages together based on mobility.

The size of the zone is determined by a kernelcore= parameter specified at
boot-time. This specifies how much memory is usable by non-movable
allocations and the remainder is used for ZONE_MOVABLE. Any range of pages
within ZONE_MOVABLE can be released by migrating the pages or by reclaiming.

When selecting a zone to take pages from for ZONE_MOVABLE, there are two
things to consider. First, only memory from the highest populated zone is
used for ZONE_MOVABLE. On the x86, this is probably going to be ZONE_HIGHMEM
but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64. Second,
the amount of memory usable by the kernel will be spread evenly throughout
NUMA nodes where possible. If the nodes are not of equal size, the amount of
memory usable by the kernel on some nodes may be greater than others.

By default, the zone is not as useful for hugetlb allocations because they are
pinned and non-migratable (currently at least). A sysctl is provided that
allows huge pages to be allocated from that zone. This means that the huge
page pool can be resized to the size of ZONE_MOVABLE during the lifetime of
the system assuming that pages are not mlocked. Despite huge pages being
non-movable, we do not introduce additional external fragmentation of note as
huge pages are always the largest contiguous block we care about.

Credit goes to Andy Whitcroft for catching a large variety of problems during
review of the patches.

This patch creates an additional zone, ZONE_MOVABLE. This zone is only usable
by allocations which specify both __GFP_HIGHMEM and __GFP_MOVABLE. Hot-added
memory continues to be placed in their existing destination as there is no
mechanism to redirect them to a specific zone.

[y-goto@jp.fujitsu.com: Fix section mismatch of memory hotplug related code]
[akpm@linux-foundation.org: various fixes]
Signed-off-by: Mel Gorman
Cc: Andy Whitcroft
Signed-off-by: Yasunori Goto
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2007-07-18 01:22:59 +0800

07 Jul, 2007

1 commit

23c1fb529 mm: fixup /proc/vmstat output ... Browse Code »

Line up the vmstat_text with zone_stat_item

enum zone_stat_item {
/* First 128 byte cacheline (assuming 64 bit words) */
NR_FREE_PAGES,
NR_INACTIVE,
NR_ACTIVE,

We current have nr_active and nr_inactive reversed.

[ "OK with patch, though using initializers canbe handy to prevent such
things in future:

static const char * const vmstat_text[] = {
[NR_FREE_PAGES] = "nr_free_pages",
..."
- Alexey ]

Signed-off-by: Peter Zijlstra
Acked-by: Alexey Dobriyan
Signed-off-by: Linus Torvalds

Peter Zijlstra
2007-07-07 01:26:50 +0800

22 May, 2007

1 commit

e8edc6e03 Detach sched.h from mm.h ... Browse Code »

First thing mm.h does is including sched.h solely for can_do_mlock() inline
function which has "current" dereference inside. By dealing with can_do_mlock()
mm.h can be detached from sched.h which is good. See below, why.

This patch
a) removes unconditional inclusion of sched.h from mm.h
b) makes can_do_mlock() normal function in mm/mlock.c
c) exports can_do_mlock() to not break compilation
d) adds sched.h inclusions back to files that were getting it indirectly.
e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
getting them indirectly

Net result is:
a) mm.h users would get less code to open, read, preprocess, parse, ... if
they don't need sched.h
b) sched.h stops being dependency for significant number of files:
on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
after patch it's only 3744 (-8.3%).

Cross-compile tested on

all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
alpha alpha-up
arm
i386 i386-up i386-defconfig i386-allnoconfig
ia64 ia64-up
m68k
mips
parisc parisc-up
powerpc powerpc-up
s390 s390-up
sparc sparc-up
sparc64 sparc64-up
um-x86_64
x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig

as well as my two usual configs.

Signed-off-by: Alexey Dobriyan
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2007-05-22 00:18:19 +0800

11 May, 2007

1 commit

39bf6270f VM statistics: Make timer deferrable ... Browse Code »

VM statistics updates do not matter if the kernel is in idle powersaving
mode. So allow the timer to be deferred.

It would be better though if we could switch the timer between deferrable
and nondeferrable based on differentials present. The timer would start
out nondeferrable and if we find that there were no updates in the last
statistics interval then we would switch the timer to deferrable. If the
timer later finds again that there are differentials then go to
nondeferrable again.

And yet another way would be to run the timer shortly before going to idle?

The solution here means that the VM counters may be slightly off during
idle since differentials may be still pending while the timer is deferred.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-05-11 23:29:32 +0800

10 May, 2007

4 commits

4037d4522 Move remote node draining out of slab allocators ... Browse Code »

Currently the slab allocators contain callbacks into the page allocator to
perform the draining of pagesets on remote nodes. This requires SLUB to have
a whole subsystem in order to be compatible with SLAB. Moving node draining
out of the slab allocators avoids a section of code in SLUB.

Move the node draining so that is is done when the vm statistics are updated.
At that point we are already touching all the cachelines with the pagesets of
a processor.

Add a expire counter there. If we have to update per zone or global vm
statistics then assume that the pageset will require subsequent draining.

The expire counter will be decremented on each vm stats update pass until it
reaches zero. Then we will drain one batch from the pageset. The draining
will cause vm counter updates which will then cause another expiration until
the pcp is empty. So we will drain a batch every 3 seconds.

Note that remote node draining is a somewhat esoteric feature that is required
on large NUMA systems because otherwise significant portions of system memory
can become trapped in pcp queues. The number of pcp is determined by the
number of processors and nodes in a system. A system with 4 processors and 2
nodes has 8 pcps which is okay. But a system with 1024 processors and 512
nodes has 512k pcps with a high potential for large amount of memory being
caught in them.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-05-10 03:30:56 +0800
77461ab33 Make vm statistics update interval configurable ... Browse Code »

Make it configurable. Code in mm makes the vm statistics intervals
independent from the cache reaper use that opportunity to make it
configurable.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-05-10 03:30:56 +0800
d1187ed21 vmstat: use our own timer events ... Browse Code »

vmstat is currently using the cache reaper to periodically bring the
statistics up to date. The cache reaper does only exists in SLUB as a way to
provide compatibility with SLAB. This patch removes the vmstat calls from the
slab allocators and provides its own handling.

The advantage is also that we can use a different frequency for the updates.
Refreshing vm stats is a pretty fast job so we can run this every second and
stagger this by only one tick. This will lead to some overlap in large
systems. F.e a system running at 250 HZ with 1024 processors will have 4 vm
updates occurring at once.

However, the vm stats update only accesses per node information. It is only
necessary to stagger the vm statistics updates per processor in each node. Vm
counter updates occurring on distant nodes will not cause cacheline
contention.

We could implement an alternate approach that runs the first processor on each
node at the second and then each of the other processor on a node on a
subsequent tick. That may be useful to keep a large amount of the second free
of timer activity. Maybe the timer folks will have some feedback on this one?

[jirislaby@gmail.com: add missing break]
Cc: Arjan van de Ven
Signed-off-by: Christoph Lameter
Signed-off-by: Jiri Slaby
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-05-10 03:30:56 +0800
8bb784428 Add suspend-related notifications for CPU hotplug ... Browse Code »

Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress. This
patch introduces such notifications and causes them to be used during
suspend and resume transitions. It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).

[oleg@tv-sign.ru: cleanups]
Signed-off-by: Rafael J. Wysocki
Cc: Gautham R Shenoy
Cc: Pavel Machek
Signed-off-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rafael J. Wysocki
2007-05-10 03:30:56 +0800

12 Feb, 2007

7 commits

4b51d6698 [PATCH] optional ZONE_DMA: optional ZONE_DMA in the VM ... Browse Code »

Make ZONE_DMA optional in core code.

- ifdef all code for ZONE_DMA and related definitions following the example
for ZONE_DMA32 and ZONE_HIGHMEM.

- Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of
0.

- Modify the VM statistics to work correctly without a DMA zone.

- Modify slab to not create DMA slabs if there is no ZONE_DMA.

[akpm@osdl.org: cleanup]
[jdike@addtoit.com: build fix]
[apw@shadowen.org: Simplify calculation of the number of bits we need for ZONES_SHIFT]
Signed-off-by: Christoph Lameter
Cc: Andi Kleen
Cc: "Luck, Tony"
Cc: Kyle McMartin
Cc: Matthew Wilcox
Cc: James Bottomley
Cc: Paul Mundt
Signed-off-by: Andy Whitcroft
Signed-off-by: Jeff Dike
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:18 +0800
65e458d43 [PATCH] Drop get_zone_counts() ... Browse Code »

Values are available via ZVC sums.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:18 +0800
05a0416be [PATCH] Drop __get_zone_counts() ... Browse Code »

Values are readily available via ZVC per node and global sums.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:18 +0800
51ed44912 [PATCH] Reorder ZVCs according to cacheline ... Browse Code »

The global and per zone counter sums are in arrays of longs. Reorder the ZVCs
so that the most frequently used ZVCs are put into the same cacheline. That
way calculations of the global, node and per zone vm state touches only a
single cacheline. This is mostly important for 64 bit systems were one 128
byte cacheline takes only 8 longs.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:17 +0800
d23ad4232 [PATCH] Use ZVC for free_pages ... Browse Code »

This is again simplifies some of the VM counter calculations through the use
of the ZVC consolidated counters.

[michal.k.k.piotrowski@gmail.com: build fix]
Signed-off-by: Christoph Lameter
Signed-off-by: Michal Piotrowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:17 +0800
c87853859 [PATCH] Use ZVC for inactive and active counts ... Browse Code »

The determination of the dirty ratio to determine writeback behavior is
currently based on the number of total pages on the system.

However, not all pages in the system may be dirtied. Thus the ratio is always
too low and can never reach 100%. The ratio may be particularly skewed if
large hugepage allocations, slab allocations or device driver buffers make
large sections of memory not available anymore. In that case we may get into
a situation in which f.e. the background writeback ratio of 40% cannot be
reached anymore which leads to undesired writeback behavior.

This patchset fixes that issue by determining the ratio based on the actual
pages that may potentially be dirty. These are the pages on the active and
the inactive list plus free pages.

The problem with those counts has so far been that it is expensive to
calculate these because counts from multiple nodes and multiple zones will
have to be summed up. This patchset makes these counters ZVC counters. This
means that a current sum per zone, per node and for the whole system is always
available via global variables and not expensive anymore to calculate.

The patchset results in some other good side effects:

- Removal of the various functions that sum up free, active and inactive
page counts

- Cleanup of the functions that display information via the proc filesystem.

This patch:

The use of a ZVC for nr_inactive and nr_active allows a simplification of some
counter operations. More ZVC functionality is used for sums etc in the
following patches.

[akpm@osdl.org: UP build fix]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-02-12 02:51:17 +0800
5a88a13d0 [PATCH] /proc/zoneinfo: fix vm stats display ... Browse Code »

This early break prevents us from displaying info for the vm stats thresholds
if the zone doesn't have any pages in its per-cpu pagesets.

So my 800MB i386 box says:

Node 0, zone DMA
pages free 2365
min 16
low 20
high 24
active 0
inactive 0
scanned 0 (a: 0 i: 0)
spanned 4096
present 4044
nr_anon_pages 0
nr_mapped 1
nr_file_pages 0
nr_slab_reclaimable 0
nr_slab_unreclaimable 0
nr_page_table_pages 0
nr_dirty 0
nr_writeback 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
protection: (0, 868, 868)
pagesets
all_unreclaimable: 0
prev_priority: 12
start_pfn: 0
Node 0, zone Normal
pages free 199713
min 934
low 1167
high 1401
active 10215
inactive 4507
scanned 0 (a: 0 i: 0)
spanned 225280
present 222420
nr_anon_pages 2685
nr_mapped 1110
nr_file_pages 12055
nr_slab_reclaimable 2216
nr_slab_unreclaimable 1527
nr_page_table_pages 213
nr_dirty 0
nr_writeback 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
protection: (0, 0, 0)
pagesets
cpu: 0 pcp: 0
count: 152
high: 186
batch: 31
cpu: 0 pcp: 1
count: 13
high: 62
batch: 15
vm stats threshold: 16
cpu: 1 pcp: 0
count: 34
high: 186
batch: 31
cpu: 1 pcp: 1
count: 10
high: 62
batch: 15
vm stats threshold: 16
all_unreclaimable: 0
prev_priority: 12
start_pfn: 4096

Just nuke all that search-for-the-first-non-empty-pageset code. Dunno why it
was there in the first place..

Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-02-12 02:51:17 +0800

08 Dec, 2006

2 commits

15ad7cdcf [PATCH] struct seq_operations and struct file_operations constification ... Browse Code »

- move some file_operations structs into the .rodata section

- move static strings from policy_types[] array into the .rodata section

- fix generic seq_operations usages, so that those structs may be defined
as "const" as well

[akpm@osdl.org: couple of fixes]
Signed-off-by: Helge Deller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Helge Deller
2006-12-08 00:39:46 +0800
ce421c799 [PATCH] mm: cleanup indentation on switch for CPU operations ... Browse Code »

These patches introduced new switch statements which are indented contrary
to the concensus in mm/*.c. Fix them up to match that concensus.

[PATCH] node local per-cpu-pages
[PATCH] ZVC: Scale thresholds depending on the size of the system
commit e7c8d5c9955a4d2e88e36b640563f5d6d5aba48a
commit df9ecaba3f152d1ea79f2a5e0b87505e03f47590

Signed-off-by: Andy Whitcroft
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2006-12-08 00:39:23 +0800

29 Oct, 2006

1 commit

3bb1a852a [PATCH] vmscan: Fix temp_priority race ... Browse Code »

The temp_priority field in zone is racy, as we can walk through a reclaim
path, and just before we copy it into prev_priority, it can be overwritten
(say with DEF_PRIORITY) by another reclaimer.

The same bug is contained in both try_to_free_pages and balance_pgdat, but
it is fixed slightly differently. In balance_pgdat, we keep a separate
priority record per zone in a local array. In try_to_free_pages there is
no need to do this, as the priority level is the same for all zones that we
reclaim from.

Impact of this bug is that temp_priority is copied into prev_priority, and
setting this artificially high causes reclaimers to set distress
artificially low. They then fail to reclaim mapped pages, when they are,
in fact, under severe memory pressure (their priority may be as low as 0).
This causes the OOM killer to fire incorrectly.

From: Andrew Morton

__zone_reclaim() isn't modifying zone->prev_priority. But zone->prev_priority
is used in the decision whether or not to bring mapped pages onto the inactive
list. Hence there's a risk here that __zone_reclaim() will fail because
zone->prev_priority ir large (ie: low urgency) and lots of mapped pages end up
stuck on the active list.

Fix that up by decreasing (ie making more urgent) zone->prev_priority as
__zone_reclaim() scans the zone's pages.

This bug perhaps explains why ZONE_RECLAIM_PRIORITY was created. It should be
possible to remove that now, and to just start out at DEF_PRIORITY?

Cc: Nick Piggin
Cc: Christoph Lameter
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Martin Bligh
2006-10-29 02:30:50 +0800

04 Oct, 2006

1 commit

038b0a6d8 Remove all inclusions of <linux/config.h> ... Browse Code »

kbuild explicitly includes this at build time.

Signed-off-by: Dave Jones

Dave Jones
2006-10-04 15:38:54 +0800

27 Sep, 2006

2 commits

5d2923436 [PATCH] zone_statistics: Use hot node instead of cold zone_pgdat ... Browse Code »

Now that we have the node in the hot zone of struct zone we can avoid
accessing zone_pgdat in zone_statistics.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-27 23:26:13 +0800
e129b5c23 [PATCH] vm: add per-zone writeout counter ... Browse Code »

The VM is supposed to minimise the number of pages which get written off the
LRU (for IO scheduling efficiency, and for high reclaim-success rates). But
we don't actually have a clear way of showing how true this is.

So add `nr_vmscan_write' to /proc/vmstat and /proc/zoneinfo - the number of
pages which have been written by the vm scanner in this zone and globally.

Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2006-09-27 23:26:12 +0800

26 Sep, 2006

3 commits

972d1a7b1 [PATCH] ZVC: Support NR_SLAB_RECLAIMABLE / NR_SLAB_UNRECLAIMABLE ... Browse Code »

Remove the atomic counter for slab_reclaim_pages and replace the counter
and NR_SLAB with two ZVC counter that account for unreclaimable and
reclaimable slab pages: NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE.

Change the check in vmscan.c to refer to to NR_SLAB_RECLAIMABLE. The
intend seems to be to check for slab pages that could be freed.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:51 +0800
39bbcb8f8 [PATCH] mm: do not check unpopulated zones for draining and counter updates ... Browse Code »

If a zone is unpopulated then we do not need to check for pages that are to
be drained and also not for vm counters that may need to be updated.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:51 +0800
27bf71c2a [PATCH] reduce MAX_NR_ZONES: remove display of counters for unconfigured zones ... Browse Code »

eventcounters: Do not display counters for zones that are not available on an
arch

Do not define or display counters for the DMA32 and the HIGHMEM zone if such
zones were not configured.

[akpm@osdl.org: s390 fix]
[heiko.carstens@de.ibm.com: s390 fix]
Signed-off-by: Christoph Lameter
Cc: Martin Schwidefsky
Signed-off-by: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:47 +0800

02 Sep, 2006

1 commit

df9ecaba3 [PATCH] ZVC: Scale thresholds depending on the size of the system ... Browse Code »

The ZVC counter update threshold is currently set to a fixed value of 32.
This patch sets up the threshold depending on the number of processors and
the sizes of the zones in the system.

With the current threshold of 32, I was able to observe slight contention
when more than 130-140 processors concurrently updated the counters. The
contention vanished when I either increased the threshold to 64 or used
Andrew's idea of overstepping the interval (see ZVC overstep patch).

However, we saw contention again at 220-230 processors. So we need higher
values for larger systems.

But the current default is already a bit of an overkill for smaller
systems. Some systems have tiny zones where precision matters. For
example i386 and x86_64 have 16M DMA zones and either 900M ZONE_NORMAL or
ZONE_DMA32. These are even present on SMP and NUMA systems.

The patch here sets up a threshold based on the number of processors in the
system and the size of the zone that these counters are used for. The
threshold should grow logarithmically, so we use fls() as an easy
approximation.

Results of tests on a system with 1024 processors (4TB RAM)

The following output is from a test allocating 1GB of memory concurrently
on each processor (Forking the process. So contention on mmap_sem and the
pte locks is not a factor):

X MIN
TYPE: CPUS WALL WALL SYS USER TOTCPU
fork 1 0.552 0.552 0.540 0.012 0.552
fork 4 0.552 0.548 2.164 0.036 2.200
fork 16 0.564 0.548 8.812 0.164 8.976
fork 128 0.580 0.572 72.204 1.208 73.412
fork 256 1.300 0.660 310.400 2.160 312.560
fork 512 3.512 0.696 1526.836 4.816 1531.652
fork 1020 20.024 0.700 17243.176 6.688 17249.863

So a threshold of 32 is fine up to 128 processors. At 256 processors contention
becomes a factor.

Overstepping the counter (earlier patch) improves the numbers a bit:

fork 4 0.552 0.548 2.164 0.040 2.204
fork 16 0.552 0.548 8.640 0.148 8.788
fork 128 0.556 0.548 69.676 0.956 70.632
fork 256 0.876 0.636 212.468 2.108 214.576
fork 512 2.276 0.672 997.324 4.260 1001.584
fork 1020 13.564 0.680 11586.436 6.088 11592.523

Still contention at 512 and 1020. Contention at 1020 is down by a third.
256 still has a slight bit of contention.

After this patch the counter threshold will be set to 125 which reduces
contention significantly:

fork 128 0.560 0.548 69.776 0.932 70.708
fork 256 0.636 0.556 143.460 2.036 145.496
fork 512 0.640 0.548 284.244 4.236 288.480
fork 1020 1.500 0.588 1326.152 8.892 1335.044

[akpm@osdl.org: !SMP build fix]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-02 02:39:08 +0800