18 Jul, 2007
1 commit
-
This patch adds the kernelcore= parameter for x86.
Once all patches are applied, a new command-line parameter exist and a new
sysctl. This patch adds the necessary documentation.From: Yasunori Goto
When "kernelcore" boot option is specified, kernel can't boot up on ia64
because of an infinite loop. In addition, the parsing code can be handled
in an architecture-independent manner.This patch uses common code to handle the kernelcore= parameter. It is
only available to architectures that support arch-independent zone-sizing
(i.e. define CONFIG_ARCH_POPULATES_NODE_MAP). Other architectures will
ignore the boot parameter.[bunk@stusta.de: make cmdline_parse_kernelcore() static]
Signed-off-by: Mel Gorman
Signed-off-by: Yasunori Goto
Acked-by: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Jul, 2007
2 commits
-
Poeple keep on adding new numbered sysctls, when they're supposed not to.
Add a documentation file which explain why new sysctls should use
CTL_UNNUMBERED. The next patch will sprinkle pointers to this throughout
sysctl.c.Eric provided the text (thanks)
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Make zonelist creation policy selectable from sysctl/boot option v6.
This patch makes NUMA's zonelist (of pgdat) order selectable.
Available order are Default(automatic)/ Node-based / Zone-based.[Default Order]
The kernel selects Node-based or Zone-based order automatically.[Node-based Order]
This policy treats the locality of memory as the most important parameter.
Zonelist order is created by each zone's locality. This means lower zones
(ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion.
IOW. ZONE_DMA will be in the middle of zonelist.
current 2.6.21 kernel uses this.Pros.
* A user can expect local memory as much as possible.
Cons.
* lower zone will be exhansted before higher zone. This may cause OOM_KILL.Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL
because of ZONE_DMA exhaution and you need the best locality.(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.*node(0)'s memory allocation order:
node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL.
*node(1)'s memory allocation order:
node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.
[Zone-based order]
This policy treats the zone type as the most important parameter.
Zonelist order is created by zone-type order. This means lower zone
never be used bofere higher zone exhaustion.
IOW. ZONE_DMA will be always at the tail of zonelist.Pros.
* OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted.
Cons.
* memory locality may not be best.(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.*node(0)'s memory allocation order:
node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA.
*node(1)'s memory allocation order:
node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.
bootoption "numa_zonelist_order=" and proc/sysctl is supporetd.
command:
%echo N > /proc/sys/vm/numa_zonelist_orderWill rebuild zonelist in Node-based order.
command:
%echo Z > /proc/sys/vm/numa_zonelist_orderWill rebuild zonelist in Zone-based order.
Thanks to Lee Schermerhorn, he gives me much help and codes.
[Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order]
[akpm@linux-foundation.org: build fix]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Lee Schermerhorn
Cc: Christoph Lameter
Cc: Andi Kleen
Cc: "jesse.barnes@intel.com"
Signed-off-by: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Jul, 2007
1 commit
-
Add a new security check on mmap operations to see if the user is attempting
to mmap to low area of the address space. The amount of space protected is
indicated by the new proc tunable /proc/sys/vm/mmap_min_addr and defaults to
0, preserving existing behavior.This patch uses a new SELinux security class "memprotect." Policy already
contains a number of allow rules like a_t self:process * (unconfined_t being
one of them) which mean that putting this check in the process class (its
best current fit) would make it useless as all user processes, which we also
want to protect against, would be allowed. By taking the memprotect name of
the new class it will also make it possible for us to move some of the other
memory protect permissions out of 'process' and into the new class next time
we bump the policy version number (which I also think is a good future idea)Acked-by: Stephen Smalley
Acked-by: Chris Wright
Signed-off-by: Eric Paris
Signed-off-by: James Morris
09 May, 2007
2 commits
-
Fix various typos in kernel docs and Kconfigs, 2.6.21-rc4.
Signed-off-by: Matt LaPlante
Signed-off-by: Adrian Bunk -
Fix the misspellings of "propogate", "writting" and (oh, the shame
:-) "kenrel" in the source tree.Signed-off-by: Robert P. J. Day
Signed-off-by: Adrian Bunk
08 May, 2007
1 commit
-
The current panic_on_oom may not work if there is a process using
cpusets/mempolicy, because other nodes' memory may remain. But some people
want failover by panic ASAP even if they are used. This patch makes new
setting for its request.This is tested on my ia64 box which has 3 nodes.
Signed-off-by: Yasunori Goto
Signed-off-by: Benjamin LaHaise
Cc: Christoph Lameter
Cc: Paul Jackson
Cc: Ethan Solomita
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Dec, 2006
1 commit
-
Add sysctl for kstack_depth_to_print. This lets users change
the amount of raw stack data printed in dump_stack() without
having to reboot.Signed-off-by: Chuck Ebbert
Signed-off-by: Andi Kleen
30 Nov, 2006
1 commit
-
This patch fixes typos in various Documentation txts. The patch addresses some
misc words.Signed-off-by: Matt LaPlante
Acked-by: Randy Dunlap
Signed-off-by: Adrian Bunk
12 Oct, 2006
1 commit
-
The pipe-a-coredump-to-a-program feature was undocumented.
*Grumble*.NB: a good enhancement to that patch would be: save all the stuff that a
core file can get from the %x expansions in the environment.Signed-off-by: Matthias Urlichs
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
26 Sep, 2006
1 commit
-
Currently one can enable slab reclaim by setting an explicit option in
/proc/sys/vm/zone_reclaim_mode. Slab reclaim is then used as a final
option if the freeing of unmapped file backed pages is not enough to free
enough pages to allow a local allocation.However, that means that the slab can grow excessively and that most memory
of a node may be used by slabs. We have had a case where a machine with
46GB of memory was using 40-42GB for slab. Zone reclaim was effective in
dealing with pagecache pages. However, slab reclaim was only done during
global reclaim (which is a bit rare on NUMA systems).This patch implements slab reclaim during zone reclaim. Zone reclaim
occurs if there is a danger of an off node allocation. At that point we1. Shrink the per node page cache if the number of pagecache
pages is more than min_unmapped_ratio percent of pages in a zone.2. Shrink the slab cache if the number of the nodes reclaimable slab pages
(patch depends on earlier one that implements that counter)
are more than min_slab_ratio (a new /proc/sys/vm tunable).The shrinking of the slab cache is a bit problematic since it is not node
specific. So we simply calculate what point in the slab we want to reach
(current per node slab use minus the number of pages that neeed to be
allocated) and then repeately run the global reclaim until that is
unsuccessful or we have reached the limit. I hope we will have zone based
slab reclaim at some point which will make that easier.The default for the min_slab_ratio is 5%
Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.
[akpm@osdl.org: cleanups]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
28 Aug, 2006
1 commit
-
Sergey Vlasov noticed that there is not kernel.suid_dumpable, but
fs.suid_dumpable.How KERN_SETUID_DUMPABLE ended up in fs_table[]? Hell knows...
Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 Aug, 2006
1 commit
-
Signed-off-by: Maxime Bizon
Acked-by: Simon Horman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
04 Jul, 2006
1 commit
-
It turns out that it is advantageous to leave a small portion of unmapped file
backed pages if all of a zone's pages (or almost all pages) are allocated and
so the page allocator has to go off-node.This allows recently used file I/O buffers to stay on the node and
reduces the times that zone reclaim is invoked if file I/O occurs
when we run out of memory in a zone.The problem is that zone reclaim runs too frequently when the page cache is
used for file I/O (read write and therefore unmapped pages!) alone and we have
almost all pages of the zone allocated. Zone reclaim may remove 32 unmapped
pages. File I/O will use these pages for the next read/write requests and the
unmapped pages increase. After the zone has filled up again zone reclaim will
remove it again after only 32 pages. This cycle is too inefficient and there
are potentially too many zone reclaim cycles.With the 1% boundary we may still remove all unmapped pages for file I/O in
zone reclaim pass. However. it will take a large number of read and writes
to get back to 1% again where we trigger zone reclaim again.The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
second timeout.[akpm@osdl.org: rename the /proc file and the variable]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
01 Jul, 2006
1 commit
-
The zone_reclaim_interval was necessary because we were not able to determine
how many unmapped pages exist in a zone. Therefore we had to scan in
intervals to figure out if any pages were unmapped.With the zoned counters and NR_ANON_PAGES we now know the number of pagecache
pages and the number of mapped pages in a zone. So we can simply skip the
reclaim if there is an insufficient number of unmapped pages. We use
SWAP_CLUSTER_MAX as the boundary.Drop all support for /proc/sys/vm/zone_reclaim_interval.
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
23 Jun, 2006
1 commit
-
This patch adds panic_on_oom sysctl under sys.vm.
When sysctl vm.panic_on_oom = 1, the kernel panics intead of killing rogue
processes. And if vm.panic_on_oom is 0 the kernel will do oom_kill() in
the same way as it does today. Of course, the default value is 0 and only
root can modifies it.In general, oom_killer works well and kill rogue processes. So the whole
system can survive. But there are environments where panic is preferable
rather than kill some processes.Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
21 Feb, 2006
1 commit
-
Currently, acpi video options can only be set on kernel command line. That's
little inflexible; I'd like userland s2ram application that just works, and
modifying kernel command line according to whitelist is not fun. It is better
to just allow s2ram application to set video options just before suspend
(according to the whitelist).This implements sysctl to allow setting suspend video options without reboot.
(akpm: Documentation updates for this new sysctl are pending..)
Signed-off-by: Pavel Machek
Cc: "Brown, Len"
Cc: "Antonino A. Daplas"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
02 Feb, 2006
3 commits
-
If large amounts of zone memory are used by empty slabs then zone_reclaim
becomes uneffective. This patch shakes the slab a bit.The problem with this patch is that the slab reclaim is not containable to a
zone. Thus slab reclaim may affect the whole system and be extremely slow.
This also means that we cannot determine how many pages were freed in this
zone. Thus we need to go off node for at least one allocation.The functionality is disabled by default.
We could modify the shrinkers to take a zone parameter but that would be quite
invasive. Better ideas are welcome.Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In some situations one may want zone_reclaim to behave differently. For
example a process writing large amounts of memory will spew unto other nodes
to cache the writes if many pages in a zone become dirty. This may impact the
performance of processes running on other nodes.Allowing writes during reclaim puts a stop to that behavior and throttles the
process by restricting the pages to the local zone.Similarly one may want to contain processes to local memory by enabling
regular swap behavior during zone_reclaim. Off node memory allocation can
then be controlled through memory policies and cpusets.Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently the zone_reclaim code has a fixed window of 30 seconds of off node
allocations should a local zone have no unused pagecache pages left. Reclaim
will be attempted again after this timeout period to avoid repeated useless
scans for memory. This is also useful to established sufficiently large off
node allocation chunks to relieve the local node.It may be beneficial to adjust that time period for some special situations.
For example if memory use was exceeding node capacity one may want to give up
for longer periods of time. If memory spikes intermittendly then one may want
to shorten the time period to reduce the number of off node allocations.This patch allows just that....
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
19 Jan, 2006
1 commit
-
proc support for zone reclaim
This patch creates a proc entry /proc/sys/vm/zone_reclaim_mode that may be
used to override the automatic determination of the zone reclaim made on
bootup.Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
09 Jan, 2006
2 commits
-
As recently there has been lot of traffic on the right values for batch and
high water marks for per_cpu_pagelists. This patch makes these two
variables configurable through /proc interface.A new tunable /proc/sys/vm/percpu_pagelist_fraction is added. This entry
controls the fraction of pages at most in each zone that are allocated for
each per cpu page list. The min value for this is 8. It means that we
don't allow more than 1/8th of pages in each zone to be allocated in any
single per_cpu_pagelist.The batch value of each per cpu pagelist is also updated as a result. It
is set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8)Signed-off-by: Rohit Seth
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add /proc/sys/vm/drop_caches. When written to, this will cause the kernel to
discard as much pagecache and/or reclaimable slab objects as it can. THis
operation requires root permissions.It won't drop dirty data, so the user should run `sync' first.
Caveats:
a) Holds inode_lock for exorbitant amounts of time.
b) Needs to be taught about NUMA nodes: propagate these all the way through
so the discarding can be controlled on a per-node basis.This is a debugging feature: useful for getting consistent results between
filesystem benchmarks. We could possibly put it under a config option, but
it's less than 300 bytes.Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Jun, 2005
1 commit
-
Add a new `suid_dumpable' sysctl:
This value can be used to query and set the core dump mode for setuid
or otherwise protected/tainted binaries. The modes are0 - (default) - traditional behaviour. Any process which has changed
privilege levels or is execute only will not be dumped1 - (debug) - all processes dump core when possible. The core dump is
owned by the current user and no security is applied. This is intended
for system debugging situations only. Ptrace is unchecked.2 - (suidsafe) - any binary which normally would not be dumped is dumped
readable by root only. This allows the end user to remove such a dump but
not access it directly. For security reasons core dumps in this mode will
not overwrite one another or other files. This mode is appropriate when
adminstrators are attempting to debug problems in a normal environment.(akpm:
> > +EXPORT_SYMBOL(suid_dumpable);
>
> EXPORT_SYMBOL_GPL?No problem to me.
> > if (current->euid == current->uid && current->egid == current->gid)
> > current->mm->dumpable = 1;
>
> Should this be SUID_DUMP_USER?Actually the feedback I had from last time was that the SUID_ defines
should go because its clearer to follow the numbers. They can go
everywhere (and there are lots of places where dumpable is tested/used
as a bool in untouched code)> Maybe this should be renamed to `dump_policy' or something. Doing that
> would help us catch any code which isn't using the #defines, too.Fair comment. The patch was designed to be easy to maintain for Red Hat
rather than for merging. Changing that field would create a gigantic
diff because it is used all over the place.)
Signed-off-by: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Apr, 2005
1 commit
-
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.Let it rip!