Eric Lee / smarc-fsl-linux-kernel

19 Oct, 2007

4 commits

fc6cd25b7 sysctl: Error on bad sysctl tables ... Browse Code »

After going through the kernels sysctl tables several times it has become
clear that code review and testing is just not effective in prevent
problematic sysctl tables from being used in the stable kernel. I certainly
can't seem to fix the problems as fast as they are introduced.

Therefore this patch adds sysctl_check_table which is called when a sysctl
table is registered and checks to see if we have a problematic sysctl table.

The biggest part of the code is the table of valid binary sysctl entries, but
since we have frozen our set of binary sysctls this table should not need to
change, and it makes it much easier to detect when someone unintentionally
adds a new binary sysctl value.

As best as I can determine all of the several hundred errors spewed on boot up
now are legitimate.

[bunk@kernel.org: kernel/sysctl_check.c must #include ]
Signed-off-by: Eric W. Biederman
Cc: Alexey Dobriyan
Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-10-19 05:37:23 +0800
f429cd37a sysctl: properly register the irda binary sysctl numbers ... Browse Code »

Grumble. These numbers should have been in sysctl.h from the beginning if we
ever expected anyone to use them. Oh well put them there now so we can find
them and make maintenance easier.

Signed-off-by: Eric W. Biederman
Acked-by: Samuel Ortiz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-10-19 05:37:23 +0800
49a0c4583 sysctl: Factor out sysctl_data. ... Browse Code »

There as been no easy way to wrap the default sysctl strategy routine except
for returning 0. Which is not always what we want. The few instances I have
seen that want different behaviour have written their own version of
sysctl_data. While not too hard it is unnecessary code and has the potential
for extra bugs.

So to make these situations easier and make that part of sysctl more symetric
I have factord sysctl_data out of do_sysctl_strategy and exported as a
function everyone can use.

Further having sysctl_data be an explicit function makes checking for badly
formed sysctl tables much easier.

Signed-off-by: Eric W. Biederman
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-10-19 05:37:22 +0800
d8217f076 sysctl core: Stop using the unnecessary ctl_table typedef ... Browse Code »

In sysctl.h the typedef struct ctl_table ctl_table violates coding style isn't
needed and is a bit of a nuisance because it makes it harder to recognize
ctl_table is a type name.

So this patch removes it from the generic sysctl code. Hopefully I will have
enough energy to send the rest of my patches will follow and to remove it from
the rest of the kernel.

Signed-off-by: Eric W. Biederman
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-10-19 05:37:22 +0800

01 Aug, 2007

1 commit

0d0ed42e5 Add CTL_PROC back ... Browse Code »

commit eab03ac7bd3e0da99eb9dc068772a85a5e3f3577 aka
"[PATCH] Get rid of /proc/sys/proc" was good commit except strace(1) compile
breakage it introduced:

system.c:1581: error: 'CTL_PROC' undeclared here (not in a function)

So, add dummy enum back.

Signed-off-by: Alexey Dobriyan
Cc: Stephen Hemminger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2007-08-01 06:39:39 +0800

26 Apr, 2007

4 commits

516299d2f [NETFILTER]: bridge-nf: filter bridged IPv4/IPv6 encapsulated in pppoe traffic ... Browse Code »

The attached patch by Michael Milner adds support for using iptables and
ip6tables on bridged traffic encapsulated in ppoe frames, similar to
what's already supported for vlan.

Signed-off-by: Michael Milner
Signed-off-by: Bart De Schuymer
Signed-off-by: Patrick McHardy
Signed-off-by: David S. Miller

Michael Milner
2007-04-26 13:28:57 +0800
a2a316fd0 [NET]: Replace CONFIG_NET_DEBUG with sysctl. ... Browse Code »

Covert network warning messages from a compile time to runtime choice.
Removes kernel config option and replaces it with new /proc/sys/net/core/warnings.

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller

Stephen Hemminger
2007-04-26 13:24:05 +0800
3cfe3baaf [TCP]: Add two new spurious RTO responses to FRTO ... Browse Code »

New sysctl tcp_frto_response is added to select amongst these
responses:
- Rate halving based; reuses CA_CWR state (default)
- Very conservative; used to be the only one available (=1)
- Undo cwr; undoes ssthresh and cwnd reductions (=2)

The response with rate halving requires a new parameter to
tcp_enter_cwr because FRTO has already reduced ssthresh and
doing a second reduction there has to be prevented. In addition,
to keep things nice on 80 cols screen, a local variable was
added.

Signed-off-by: Ilpo Järvinen
Signed-off-by: David S. Miller

Ilpo Järvinen
2007-04-26 13:23:23 +0800
886236c12 [TCP]: Add RFC3742 Limited Slow-Start, controlled by variable sysctl_tcp_max_ssthresh. ... Browse Code »

Signed-off-by: John Heffner
Signed-off-by: David S. Miller

John Heffner
2007-04-26 13:23:19 +0800

25 Apr, 2007

1 commit

0bcbc9262 [IPV6]: Disallow RH0 by default. ... Browse Code »

A security issue is emerging. Disallow Routing Header Type 0 by default
as we have been doing for IPv4.
Note: We allow RH2 by default because it is harmless.

Signed-off-by: YOSHIFUJI Hideaki
Signed-off-by: David S. Miller

YOSHIFUJI Hideaki
2007-04-25 05:58:30 +0800

15 Feb, 2007

13 commits

3fbfa9811 [PATCH] sysctl: remove the proc_dir_entry member for the sysctl tables ... Browse Code »

It isn't needed anymore, all of the users are gone, and all of the ctl_table
initializers have been converted to use explicit names of the fields they are
initializing.

[akpm@osdl.org: NTFS fix]
Signed-off-by: Eric W. Biederman
Acked-by: Stephen Smalley
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-02-15 00:10:00 +0800
d912b0cc1 [PATCH] sysctl: add a parent entry to ctl_table and set the parent entry ... Browse Code »

Add a parent entry into the ctl_table so you can walk the list of parents and
find the entire path to a ctl_table entry.

Signed-off-by: Eric W. Biederman
Cc: Stephen Smalley
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-02-15 00:10:00 +0800
77b14db50 [PATCH] sysctl: reimplement the sysctl proc support ... Browse Code »

With this change the sysctl inodes can be cached and nothing needs to be done
when removing a sysctl table.

For a cost of 2K code we will save about 4K of static tables (when we remove
de from ctl_table) and 70K in proc_dir_entries that we will not allocate, or
about half that on a 32bit arch.

The speed feels about the same, even though we can now cache the sysctl
dentries :(

We get the core advantage that we don't need to have a 1 to 1 mapping between
ctl table entries and proc files. Making it possible to have /proc/sys vary
depending on the namespace you are in. The currently merged namespaces don't
have an issue here but the network namespace under /proc/sys/net needs to have
different directories depending on which network adapters are visible. By
simply being a cache different directories being visible depending on who you
are is trivial to implement.

[akpm@osdl.org: fix uninitialised var]
[akpm@osdl.org: fix ARM build]
[bunk@stusta.de: make things static]
Signed-off-by: Eric W. Biederman
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-02-15 00:10:00 +0800
1ff007eb8 [PATCH] sysctl: allow sysctl_perm to be called from outside of sysctl.c ... Browse Code »

Signed-off-by: Eric W. Biederman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-02-15 00:09:59 +0800
805b5d5e0 [PATCH] sysctl: factor out sysctl_head_next from do_sysctl ... Browse Code »

The current logic to walk through the list of sysctl table headers is slightly
painful and implement in a way it cannot be used by code outside sysctl.c

I am in the process of implementing a version of the sysctl proc support that
instead of using the proc generic non-caching monster, just uses the existing
sysctl data structure as backing store for building the dcache entries and for
doing directory reads. To use the existing data structures however I need a
way to get at them.

[akpm@osdl.org: warning fix]
Signed-off-by: Eric W. Biederman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-02-15 00:09:59 +0800
0b4d41471 [PATCH] sysctl: remove insert_at_head from register_sysctl ... Browse Code »

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman
Acked-by: Ralf Baechle
Acked-by: Martin Schwidefsky
Cc: Russell King
Cc: David Howells
Cc: "Luck, Tony"
Cc: Ralf Baechle
Cc: Paul Mackerras
Cc: Martin Schwidefsky
Cc: Andi Kleen
Cc: Jens Axboe
Cc: Corey Minyard
Cc: Neil Brown
Cc: "John W. Linville"
Cc: James Bottomley
Cc: Jan Kara
Cc: Trond Myklebust
Cc: Mark Fasheh
Cc: David Chinner
Cc: "David S. Miller"
Cc: Patrick McHardy
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-02-15 00:09:59 +0800
6703ddfcc [PATCH] sysctl: remove support for CTL_ANY ... Browse Code »

There are currently no users in the kernel for CTL_ANY and it only has effect
on the binary interface which is practically unused.

So this complicates sysctl lookups for no good reason so just remove it.

Signed-off-by: Eric W. Biederman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-02-15 00:09:59 +0800
0e03036c9 [PATCH] sysctl: register the ocfs2 sysctl numbers ... Browse Code »

ocfs2 was did not have the binary number it uses under CTL_FS registered in
sysctl.h. Register it to avoid future conflicts, and change the name of the
definition to be in line with the rest of the sysctl numbers.

Signed-off-by: Eric W. Biederman
Acked-by: Mark Fasheh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-02-15 00:09:58 +0800
59fc5313b [PATCH] sysctl: register the sysctl number used by the arlan driver ... Browse Code »

Signed-off-by: Eric W. Biederman
Cc: "John W. Linville"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-02-15 00:09:57 +0800
feceb63ec [PATCH] sysctl: s390: move sysctl definitions to sysctl.h ... Browse Code »

We need to have the the definition of all top level sysctl directories
registers in sysctl.h so we don't conflict by accident and cause abi problems.

Signed-off-by: Eric W. Biederman
Acked-by: Martin Schwidefsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-02-15 00:09:57 +0800
77f6dfb12 [PATCH] sysctl: move CTL_FRV into sysctl.h where it belongs ... Browse Code »

Signed-off-by: Eric W. Biederman
Acked-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-02-15 00:09:56 +0800
462591b88 [PATCH] sysctl: move CTL_PM into sysctl.h where it belongs ... Browse Code »

Signed-off-by: Eric W. Biederman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-02-15 00:09:56 +0800
50d851f72 [PATCH] sysctl: move CTL_SUNRPC to sysctl.h where it belongs ... Browse Code »

Signed-off-by: Eric W. Biederman
Cc: Trond Myklebust
Cc: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-02-15 00:09:55 +0800

09 Feb, 2007

1 commit

39e21c0d3 [X.25]: Adds /proc/sys/net/x25/x25_forward to control forwarding. ... Browse Code »

echo "1" > /proc/sys/net/x25/x25_forward
To turn on x25_forwarding, defaults to off
Requires the previous patch.

Signed-off-by: Andrew Hendry
Signed-off-by: David S. Miller

Andrew Hendry
2007-02-09 05:34:36 +0800

13 Dec, 2006

1 commit

93aec2040 Remove duplicate "have to" in comment ... Browse Code »

Introduced in commit 7cc13edc139108bb527b692f0548dce6bc648572.

Signed-off-by: Rolf Eike Beer
Acked-by: Eric W. Biederman
Signed-off-by: Adrian Bunk

Rolf Eike Beer
2006-12-13 02:23:02 +0800

11 Dec, 2006

1 commit

1f29bcd73 [PATCH] sysctl: remove unused "context" param ... Browse Code »

Signed-off-by: Alexey Dobriyan
Cc: Andi Kleen
Cc: "David S. Miller"
Cc: David Howells
Cc: Ralf Baechle
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2006-12-11 01:55:41 +0800

03 Dec, 2006

5 commits

438426044 [DCCP]: Remove allocation of sysctl numbers ... Browse Code »

This is in response to a request sent earlier by Eric W. Biederman
and replaces all sysctl numbers for net.dccp.default with CTL_UNNUMBERED.

It has been tested to compile and to work.

Commiter note: I've removed the use of CTL_UNNUMBERED, not setting .ctl_name
sets it to 0, that is the what CTL_UNNUMBERED is, reason is
to avoid unneeded source code cluttering.

Signed-off-by: Gerrit Renker
Signed-off-by: Ian McDonald
Signed-off-by: Arnaldo Carvalho de Melo

Gerrit Renker
2006-12-03 13:30:56 +0800
82e3ab9db [DCCP]: Adds the tx buffer sysctls ... Browse Code »

This one got lost on the way from Ian to Gerrit to me, fix it.

Signed-off-by: Ian McDonald
Signed-off-by: Arnaldo Carvalho de Melo

Ian McDonald
2006-12-03 13:24:42 +0800
2e2e9e92b [DCCP]: Add sysctls to control retransmission behaviour ... Browse Code »

This adds 3 sysctls which govern the retransmission behaviour of DCCP control
packets (3way handshake, feature negotiation).

It removes 4 FIXMEs from the code.

The close resemblance of sysctl variables to their TCP analogues is emphasised
not only by their name, but also by giving them the same initial values.
This is useful since there is not much practical experience with DCCP yet.

Furthermore, with regard to the previous patch, it is now possible to limit
the number of keepalive-Responses by setting net.dccp.default.request_retries
(also a bit like in TCP).

Lastly, added documentation of all existing DCCP sysctls.

Signed-off-by: Gerrit Renker
Signed-off-by: Arnaldo Carvalho de Melo

Gerrit Renker
2006-12-03 13:22:18 +0800
ce7bc3bf1 [TCP]: Restrict congestion control choices. ... Browse Code »

Allow normal users to only choose among a restricted set of congestion
control choices. The default is reno and what ever has been configured
as default. But the policy can be changed by administrator at any time.

For example, to allow any choice:
cp /proc/sys/net/ipv4/tcp_available_congestion_control \
/proc/sys/net/ipv4/tcp_allowed_congestion_control

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller

Stephen Hemminger
2006-12-03 13:21:49 +0800
3ff825b28 [TCP]: Add tcp_available_congestion_control sysctl. ... Browse Code »

Create /proc/sys/net/ipv4/tcp_available_congestion_control
that reflects currently available TCP choices.

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller

Stephen Hemminger
2006-12-03 13:21:48 +0800

06 Nov, 2006

2 commits

7cc13edc1 [PATCH] sysctl: implement CTL_UNNUMBERED ... Browse Code »

This patch takes the CTL_UNNUMBERD concept from NFS and makes it available to
all new sysctl users.

At the same time the sysctl binary interface maintenance documentation is
updated to mention and to describe what is needed to successfully maintain the
sysctl binary interface.

Signed-off-by: Eric W. Biederman
Acked-by: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2006-11-06 17:46:23 +0800
d99f160ac [PATCH] sysctl: allow a zero ctl_name in the middle of a sysctl table ... Browse Code »

Since it is becoming clear that there are just enough users of the binary
sysctl interface that completely removing the binary interface from the kernel
will not be an option for foreseeable future, we need to find a way to address
the sysctl maintenance issues.

The basic problem is that sysctl requires one central authority to allocate
sysctl numbers, or else conflicts and ABI breakage occur. The proc interface
to sysctl does not have that problem, as names are not densely allocated.

By not terminating a sysctl table until I have neither a ctl_name nor a
procname, it becomes simple to add sysctl entries that don't show up in the
binary sysctl interface. Which allows people to avoid allocating a binary
sysctl value when not needed.

I have audited the kernel code and in my reading I have not found a single
sysctl table that wasn't terminated by a completely zero filled entry. So
this change in behavior should not affect anything.

I think this mechanism eases the pain enough that combined with a little
disciple we can solve the reoccurring sysctl ABI breakage.

Signed-off-by: Eric W. Biederman
Acked-by: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2006-11-06 17:46:23 +0800

27 Sep, 2006

1 commit

b27824083 Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6 ... Browse Code »

* 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6: (225 commits)
[PATCH] Don't set calgary iommu as default y
[PATCH] i386/x86-64: New Intel feature flags
[PATCH] x86: Add a cumulative thermal throttle event counter.
[PATCH] i386: Make the jiffies compares use the 64bit safe macros.
[PATCH] x86: Refactor thermal throttle processing
[PATCH] Add 64bit jiffies compares (for use with get_jiffies_64)
[PATCH] Fix unwinder warning in traps.c
[PATCH] x86: Allow disabling early pci scans with pci=noearly or disallowing conf1
[PATCH] x86: Move direct PCI scanning functions out of line
[PATCH] i386/x86-64: Make all early PCI scans dependent on CONFIG_PCI
[PATCH] Don't leak NT bit into next task
[PATCH] i386/x86-64: Work around gcc bug with noreturn functions in unwinder
[PATCH] Fix some broken white space in ia32_signal.c
[PATCH] Initialize argument registers for 32bit signal handlers.
[PATCH] Remove all traces of signal number conversion
[PATCH] Don't synchronize time reading on single core AMD systems
[PATCH] Remove outdated comment in x86-64 mmconfig code
[PATCH] Use string instructions for Core2 copy/clear
[PATCH] x86: - restore i8259A eoi status on resume
[PATCH] i386: Split multi-line printk in oops output.
...

Linus Torvalds
2006-09-27 04:07:55 +0800

26 Sep, 2006

3 commits

0ff38490c [PATCH] zone_reclaim: dynamic slab reclaim ... Browse Code »

Currently one can enable slab reclaim by setting an explicit option in
/proc/sys/vm/zone_reclaim_mode. Slab reclaim is then used as a final
option if the freeing of unmapped file backed pages is not enough to free
enough pages to allow a local allocation.

However, that means that the slab can grow excessively and that most memory
of a node may be used by slabs. We have had a case where a machine with
46GB of memory was using 40-42GB for slab. Zone reclaim was effective in
dealing with pagecache pages. However, slab reclaim was only done during
global reclaim (which is a bit rare on NUMA systems).

This patch implements slab reclaim during zone reclaim. Zone reclaim
occurs if there is a danger of an off node allocation. At that point we

1. Shrink the per node page cache if the number of pagecache
pages is more than min_unmapped_ratio percent of pages in a zone.

2. Shrink the slab cache if the number of the nodes reclaimable slab pages
(patch depends on earlier one that implements that counter)
are more than min_slab_ratio (a new /proc/sys/vm tunable).

The shrinking of the slab cache is a bit problematic since it is not node
specific. So we simply calculate what point in the slab we want to reach
(current per node slab use minus the number of pages that neeed to be
allocated) and then repeately run the global reclaim until that is
unsuccessful or we have reached the limit. I hope we will have zone based
slab reclaim at some point which will make that easier.

The default for the min_slab_ratio is 5%

Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.

[akpm@osdl.org: cleanups]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:51 +0800
8da5adda9 [PATCH] x86: Allow users to force a panic on NMI ... Browse Code »

To quote Alan Cox:

The default Linux behaviour on an NMI of either memory or unknown is to
continue operation. For many environments such as scientific computing
it is preferable that the box is taken out and the error dealt with than
an uncorrected parity/ECC error get propogated.

A small number of systems do generate NMI's for bizarre random reasons
such as power management so the default is unchanged. In other respects
the new proc/sys entry works like the existing panic controls already in
that directory.

This is separate to the edac support - EDAC allows supported chipsets to
handle ECC errors well, this change allows unsupported cases to at least
panic rather than cause problems further down the line.

Signed-off-by: Don Zickus
Signed-off-by: Andi Kleen

Don Zickus
2006-09-26 16:52:27 +0800
407984f1a [PATCH] x86: Add abilty to enable/disable nmi watchdog with sysctl ... Browse Code »

Adds a new /proc/sys/kernel/nmi call that will enable/disable the nmi
watchdog.

Signed-off-by: Don Zickus
Signed-off-by: Andi Kleen

Don Zickus
2006-09-26 16:52:27 +0800

23 Sep, 2006

2 commits

fbea49e1e [IPV6] NDISC: Add proxy_ndp sysctl. ... Browse Code »

We do not always need proxy NDP functionality even we
enable forwarding.

Signed-off-by: YOSHIFUJI Hideaki
Signed-off-by: David S. Miller

YOSHIFUJI Hideaki
2006-09-23 06:20:25 +0800
446fda4f2 [NetLabel]: CIPSOv4 engine ... Browse Code »

Add support for the Commercial IP Security Option (CIPSO) to the IPv4
network stack. CIPSO has become a de-facto standard for
trusted/labeled networking amongst existing Trusted Operating Systems
such as Trusted Solaris, HP-UX CMW, etc. This implementation is
designed to be used with the NetLabel subsystem to provide explicit
packet labeling to LSM developers.

The CIPSO/IPv4 packet labeling works by the LSM calling a NetLabel API
function which attaches a CIPSO label (IPv4 option) to a given socket;
this in turn attaches the CIPSO label to every packet leaving the
socket without any extra processing on the outbound side. On the
inbound side the individual packet's sk_buff is examined through a
call to a NetLabel API function to determine if a CIPSO/IPv4 label is
present and if so the security attributes of the CIPSO label are
returned to the caller of the NetLabel API function.

Signed-off-by: Paul Moore
Signed-off-by: David S. Miller

Paul Moore
2006-09-23 05:53:33 +0800

04 Jul, 2006

1 commit

9614634fe [PATCH] ZVC/zone_reclaim: Leave 1% of unmapped pagecache pages for file I/O ... Browse Code »

It turns out that it is advantageous to leave a small portion of unmapped file
backed pages if all of a zone's pages (or almost all pages) are allocated and
so the page allocator has to go off-node.

This allows recently used file I/O buffers to stay on the node and
reduces the times that zone reclaim is invoked if file I/O occurs
when we run out of memory in a zone.

The problem is that zone reclaim runs too frequently when the page cache is
used for file I/O (read write and therefore unmapped pages!) alone and we have
almost all pages of the zone allocated. Zone reclaim may remove 32 unmapped
pages. File I/O will use these pages for the next read/write requests and the
unmapped pages increase. After the zone has filled up again zone reclaim will
remove it again after only 32 pages. This cycle is too inefficient and there
are potentially too many zone reclaim cycles.

With the 1% boundary we may still remove all unmapped pages for file I/O in
zone reclaim pass. However. it will take a large number of read and writes
to get back to 1% again where we trigger zone reclaim again.

The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
second timeout.

[akpm@osdl.org: rename the /proc file and the variable]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-04 06:26:59 +0800