15 May, 2009

1 commit

  • This reverts commit fafd688e4c0c34da0f3de909881117d374e4c7af.

    Work is progressing to switch away from pdflush as the process backing
    for flushing out dirty data. So it seems pointless to add more knobs
    to control pdflush threads. The original author of the patch did not
    have any specific use cases for adding the knobs, so we can easily
    revert this before 2.6.30 to avoid having to maintain this API
    forever.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

03 May, 2009

1 commit

  • Avoid setting less than two pages for vm_dirty_bytes: this is necessary to
    avoid potential division by 0 (like the following) in get_dirty_limits().

    [ 49.951610] divide error: 0000 [#1] PREEMPT SMP
    [ 49.952195] last sysfs file: /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/uevent
    [ 49.952195] CPU 1
    [ 49.952195] Modules linked in: pcspkr
    [ 49.952195] Pid: 3064, comm: dd Not tainted 2.6.30-rc3 #1
    [ 49.952195] RIP: 0010:[] [] get_dirty_limits+0xe9/0x2c0
    [ 49.952195] RSP: 0018:ffff88001de03a98 EFLAGS: 00010202
    [ 49.952195] RAX: 00000000000000c0 RBX: ffff88001de03b80 RCX: 28f5c28f5c28f5c3
    [ 49.952195] RDX: 0000000000000000 RSI: 00000000000000c0 RDI: 0000000000000000
    [ 49.952195] RBP: ffff88001de03ae8 R08: 0000000000000000 R09: 0000000000000000
    [ 49.952195] R10: ffff88001ddda9a0 R11: 0000000000000001 R12: 0000000000000001
    [ 49.952195] R13: ffff88001fbc8218 R14: ffff88001de03b70 R15: ffff88001de03b78
    [ 49.952195] FS: 00007fe9a435b6f0(0000) GS:ffff8800025d9000(0000) knlGS:0000000000000000
    [ 49.952195] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 49.952195] CR2: 00007fe9a39ab000 CR3: 000000001de38000 CR4: 00000000000006e0
    [ 49.952195] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 49.952195] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    [ 49.952195] Process dd (pid: 3064, threadinfo ffff88001de02000, task ffff88001ddda250)
    [ 49.952195] Stack:
    [ 49.952195] ffff88001fa0de00 ffff88001f2dbd70 ffff88001f9fe800 000080b900000000
    [ 49.952195] 00000000000000c0 ffff8800027a6100 0000000000000400 ffff88001fbc8218
    [ 49.952195] 0000000000000000 0000000000000600 ffff88001de03bb8 ffffffff802d3ed7
    [ 49.952195] Call Trace:
    [ 49.952195] [] balance_dirty_pages_ratelimited_nr+0x1d7/0x3f0
    [ 49.952195] [] ? ext3_writeback_write_end+0x9e/0x120
    [ 49.952195] [] generic_file_buffered_write+0x12f/0x330
    [ 49.952195] [] __generic_file_aio_write_nolock+0x26d/0x460
    [ 49.952195] [] ? generic_file_aio_write+0x52/0xd0
    [ 49.952195] [] generic_file_aio_write+0x69/0xd0
    [ 49.952195] [] ext3_file_write+0x26/0xc0
    [ 49.952195] [] do_sync_write+0xf1/0x140
    [ 49.952195] [] ? get_lock_stats+0x2a/0x60
    [ 49.952195] [] ? autoremove_wake_function+0x0/0x40
    [ 49.952195] [] vfs_write+0xcb/0x190
    [ 49.952195] [] sys_write+0x50/0x90
    [ 49.952195] [] system_call_fastpath+0x16/0x1b
    [ 49.952195] Code: 00 00 00 2b 05 09 1c 17 01 48 89 c6 49 0f af f4 48 c1 ee 02 48 89 f0 48 f7 e1 48 89 d6 31 d2 48 c1 ee 02 48 0f af 75 d0 48 89 f0 f7 f7 41 8b 95 ac 01 00 00 48 89 c7 49 0f af d4 48 c1 ea 02
    [ 49.952195] RIP [] get_dirty_limits+0xe9/0x2c0
    [ 49.952195] RSP
    [ 50.096523] ---[ end trace 008d7aa02f244d7b ]---

    Signed-off-by: Andrea Righi
    Cc: Peter Zijlstra
    Cc: David Rientjes
    Cc: Dave Chinner
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Righi
     

14 Apr, 2009

1 commit


07 Apr, 2009

1 commit

  • Add /proc entries to give the admin the ability to control the minimum and
    maximum number of pdflush threads. This allows finer control of pdflush
    on both large and small machines.

    The rationale is simply one size does not fit all. Admins on large and/or
    small systems may want to tune the min/max pdflush thread count to best
    suit their needs. Right now the min/max is hardcoded to 2/8. While
    probably a fair estimate for smaller machines, large machines with large
    numbers of CPUs and large numbers of filesystems/block devices may benefit
    from larger numbers of threads working on different block devices.

    Even if the background flushing algorithm is radically changed, it is
    still likely that multiple threads will be involved and admins would still
    desire finer control on the min/max other than to have to recompile the
    kernel.

    The patch adds '/proc/sys/vm/nr_pdflush_threads_min' and
    '/proc/sys/vm/nr_pdflush_threads_max' with r/w permissions.

    The minimum value for nr_pdflush_threads_min is 1 and the maximum value is
    the current value of nr_pdflush_threads_max. This minimum is required
    since additional thread creation is performed in a pdflush thread itself.

    The minimum value for nr_pdflush_threads_max is the current value of
    nr_pdflush_threads_min and the maximum value can be 1000.

    Documentation/sysctl/vm.txt is also updated.

    [akpm@linux-foundation.org: fix comment, fix whitespace, use __read_mostly]
    Signed-off-by: Peter W Morreale
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter W Morreale
     

03 Apr, 2009

2 commits

  • Previous description about system parameter in /proc/sys/net/unix/ is
    wrong (or missed). Simply add a new description about unix_dgram_qlen
    according to latest kernel.

    Signed-off-by: Li Xiaodong
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Xiaodong
     
  • Now /proc/sys is described in many places and much information is
    redundant. This patch updates the proc.txt and move the /proc/sys
    desciption out to the files in Documentation/sysctls.

    Details are:

    merge
    - 2.1 /proc/sys/fs - File system data
    - 2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem
    - 2.17 /proc/sys/fs/epoll - Configuration options for the epoll interface
    with Documentation/sysctls/fs.txt.

    remove
    - 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats
    since it's not better then the Documentation/binfmt_misc.txt.

    merge
    - 2.3 /proc/sys/kernel - general kernel parameters
    with Documentation/sysctls/kernel.txt

    remove
    - 2.5 /proc/sys/dev - Device specific parameters
    since it's obsolete the sysfs is used now.

    remove
    - 2.6 /proc/sys/sunrpc - Remote procedure calls
    since it's not better then the Documentation/sysctls/sunrpc.txt

    move
    - 2.7 /proc/sys/net - Networking stuff
    - 2.9 Appletalk
    - 2.10 IPX
    to newly created Documentation/sysctls/net.txt.

    remove
    - 2.8 /proc/sys/net/ipv4 - IPV4 settings
    since it's not better then the Documentation/networking/ip-sysctl.txt.

    add
    - Chapter 3 Per-Process Parameters
    to descibe /proc//xxx parameters.

    Signed-off-by: Shen Feng
    Cc: Randy Dunlap
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shen Feng
     

16 Jan, 2009

1 commit

  • Update Documentation/sysctl/vm.txt and Documentation/filesystems/proc.txt.
    More specifically, the section on /proc/sys/vm in
    Documentation/filesystems/proc.txt was removed and a link to
    Documentation/sysctl/vm.txt added.

    Most of the verbiage from proc.txt was simply moved in vm.txt, with new
    addtional text for "swappiness" and "stat_interval".

    Signed-off-by: Peter W Morreale
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter W Morreale
     

08 Jan, 2009

1 commit

  • NOMMU mmap allocates a piece of memory for an mmap that's rounded up in size to
    the nearest power-of-2 number of pages. Currently it then discards the excess
    pages back to the page allocator, making that memory available for use by other
    things. This can, however, cause greater amount of fragmentation.

    To counter this, a sysctl is added in order to fine-tune the trimming
    behaviour. The default behaviour remains to trim pages aggressively, while
    this can either be disabled completely or set to a higher page-granular
    watermark in order to have finer-grained control.

    vm region vm_top bits taken from an earlier patch by David Howells.

    Signed-off-by: Paul Mundt
    Signed-off-by: David Howells
    Tested-by: Mike Frysinger

    Paul Mundt
     

07 Jan, 2009

1 commit

  • This change introduces two new sysctls to /proc/sys/vm:
    dirty_background_bytes and dirty_bytes.

    dirty_background_bytes is the counterpart to dirty_background_ratio and
    dirty_bytes is the counterpart to dirty_ratio.

    With growing memory capacities of individual machines, it's no longer
    sufficient to specify dirty thresholds as a percentage of the amount of
    dirtyable memory over the entire system.

    dirty_background_bytes and dirty_bytes specify quantities of memory, in
    bytes, that represent the dirty limits for the entire system. If either
    of these values is set, its value represents the amount of dirty memory
    that is needed to commence either background or direct writeback.

    When a `bytes' or `ratio' file is written, its counterpart becomes a
    function of the written value. For example, if dirty_bytes is written to
    be 8096, 8K of memory is required to commence direct writeback.
    dirty_ratio is then functionally equivalent to 8K / the amount of
    dirtyable memory:

    dirtyable_memory = free pages + mapped pages + file cache

    dirty_background_bytes = dirty_background_ratio * dirtyable_memory
    -or-
    dirty_background_ratio = dirty_background_bytes / dirtyable_memory

    AND

    dirty_bytes = dirty_ratio * dirtyable_memory
    -or-
    dirty_ratio = dirty_bytes / dirtyable_memory

    Only one of dirty_background_bytes and dirty_background_ratio may be
    specified at a time, and only one of dirty_bytes and dirty_ratio may be
    specified. When one sysctl is written, the other appears as 0 when read.

    The `bytes' files operate on a page size granularity since dirty limits
    are compared with ZVC values, which are in page units.

    Prior to this change, the minimum dirty_ratio was 5 as implemented by
    get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
    written value between 0 and 100. This restriction is maintained, but
    dirty_bytes has a lower limit of only one page.

    Also prior to this change, the dirty_background_ratio could not equal or
    exceed dirty_ratio. This restriction is maintained in addition to
    restricting dirty_background_bytes. If either background threshold equals
    or exceeds that of the dirty threshold, it is implicitly set to half the
    dirty threshold.

    Acked-by: Peter Zijlstra
    Cc: Dave Chinner
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Cc: Andrea Righi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

30 Oct, 2008

1 commit


11 Oct, 2008

1 commit

  • We need to add a flag for all code that is in the drivers/staging/
    directory to prevent all other kernel developers from worrying about
    issues here, and to notify users that the drivers might not be as good
    as they are normally used to.

    Based on code from Andreas Gruenbacher and Jeff Mahoney to provide a
    TAINT flag for the support level of a kernel module in the Novell
    enterprise kernel release.

    This is the kernel portion of this feature, the ability for the flag to
    be set needs to be done in the build process and will happen in a
    follow-up patch.

    Cc: Andreas Gruenbacher
    Cc: Jeff Mahoney
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

23 Sep, 2008

1 commit


27 Jul, 2008

1 commit


14 Feb, 2008

1 commit


10 Feb, 2008

1 commit


08 Feb, 2008

1 commit

  • Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a
    dump of all system tasks (excluding kernel threads) when performing an
    OOM-killing. Information includes pid, uid, tgid, vm size, rss, cpu,
    oom_adj score, and name.

    This is helpful for determining why there was an OOM condition and which
    rogue task caused it.

    It is configurable so that large systems, such as those with several
    thousand tasks, do not incur a performance penalty associated with dumping
    data they may not desire.

    If an OOM was triggered as a result of a memory controller, the tasklist
    shall be filtered to exclude tasks that are not a member of the same
    cgroup.

    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Cc: Balbir Singh
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

07 Feb, 2008

1 commit

  • NR_OPEN (historically set to 1024*1024) actually forbids processes to open
    more than 1024*1024 handles.

    Unfortunatly some production servers hit the not so 'ridiculously high
    value' of 1024*1024 file descriptors per process.

    Changing NR_OPEN is not considered safe because of vmalloc space potential
    exhaust.

    This patch introduces a new sysctl (/proc/sys/fs/nr_open) wich defaults to
    1024*1024, so that admins can decide to change this limit if their workload
    needs it.

    [akpm@linux-foundation.org: export it for sparc64]
    Signed-off-by: Eric Dumazet
    Cc: Alan Cox
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: "David S. Miller"
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

06 Feb, 2008

1 commit

  • Add vm.highmem_is_dirtyable toggle

    A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
    approximately 2Gb size which contains a hash format that is written
    randomly by the dbclean process. On 2.6.16 this process took a few
    minutes. With lowmem only accounting of dirty ratios, this takes about 12
    hours of 100% disk IO, all random writes.

    Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
    add the highmem back to the total available memory count.

    [akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build]
    Signed-off-by: Bron Gondwana
    Cc: Ethan Solomita
    Cc: Peter Zijlstra
    Cc: WU Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bron Gondwana
     

18 Dec, 2007

1 commit

  • The hugetlb documentation has gotten a bit out of sync with the current code.
    Updated the sysctl file to refer to Documentation/vm/hugetlbpage.txt. Update
    that file to contain the current state of affairs (with the newer named sysctl
    in place).

    Signed-off-by: Nishanth Aravamudan
    Acked-by: Adam Litke
    Cc: William Lee Irwin III
    Cc: Dave Hansen
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     

17 Oct, 2007

4 commits

  • min_free_pages is critical for correctness, document it as such.

    Signed-off-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Machek
     
  • Add a 00-INDEX file to Documentation/sysctl/

    Signed-off-by: Jesper Juhl
    Cc: Rob Landley
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     
  • Control the trigger limit for softlockup warnings. This is useful for
    debugging softlockups, by lowering the softlockup_thresh to identify
    possible softlockups earlier.

    This patch:
    1. Adds a sysctl softlockup_thresh with valid values of 1-60s
    (Higher value to disable false positives)
    2. Changes the softlockup printk to print the cpu softlockup time

    [akpm@linux-foundation.org: Fix various warnings and add definition of "two"]
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • Adds a new sysctl, 'oom_kill_allocating_task', which will automatically kill
    the OOM-triggering task instead of scanning through the tasklist to find a
    memory-hogging target. This is helpful for systems with an insanely large
    number of tasks where scanning the tasklist significantly degrades
    performance.

    Cc: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

18 Jul, 2007

1 commit

  • This patch adds the kernelcore= parameter for x86.

    Once all patches are applied, a new command-line parameter exist and a new
    sysctl. This patch adds the necessary documentation.

    From: Yasunori Goto

    When "kernelcore" boot option is specified, kernel can't boot up on ia64
    because of an infinite loop. In addition, the parsing code can be handled
    in an architecture-independent manner.

    This patch uses common code to handle the kernelcore= parameter. It is
    only available to architectures that support arch-independent zone-sizing
    (i.e. define CONFIG_ARCH_POPULATES_NODE_MAP). Other architectures will
    ignore the boot parameter.

    [bunk@stusta.de: make cmdline_parse_kernelcore() static]
    Signed-off-by: Mel Gorman
    Signed-off-by: Yasunori Goto
    Acked-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

17 Jul, 2007

2 commits

  • Poeple keep on adding new numbered sysctls, when they're supposed not to.

    Add a documentation file which explain why new sysctls should use
    CTL_UNNUMBERED. The next patch will sprinkle pointers to this throughout
    sysctl.c.

    Eric provided the text (thanks)

    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Make zonelist creation policy selectable from sysctl/boot option v6.

    This patch makes NUMA's zonelist (of pgdat) order selectable.
    Available order are Default(automatic)/ Node-based / Zone-based.

    [Default Order]
    The kernel selects Node-based or Zone-based order automatically.

    [Node-based Order]
    This policy treats the locality of memory as the most important parameter.
    Zonelist order is created by each zone's locality. This means lower zones
    (ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion.
    IOW. ZONE_DMA will be in the middle of zonelist.
    current 2.6.21 kernel uses this.

    Pros.
    * A user can expect local memory as much as possible.
    Cons.
    * lower zone will be exhansted before higher zone. This may cause OOM_KILL.

    Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL
    because of ZONE_DMA exhaution and you need the best locality.

    (example)
    assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

    *node(0)'s memory allocation order:

    node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL.

    *node(1)'s memory allocation order:

    node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

    [Zone-based order]
    This policy treats the zone type as the most important parameter.
    Zonelist order is created by zone-type order. This means lower zone
    never be used bofere higher zone exhaustion.
    IOW. ZONE_DMA will be always at the tail of zonelist.

    Pros.
    * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted.
    Cons.
    * memory locality may not be best.

    (example)
    assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

    *node(0)'s memory allocation order:

    node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA.

    *node(1)'s memory allocation order:

    node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

    bootoption "numa_zonelist_order=" and proc/sysctl is supporetd.

    command:
    %echo N > /proc/sys/vm/numa_zonelist_order

    Will rebuild zonelist in Node-based order.

    command:
    %echo Z > /proc/sys/vm/numa_zonelist_order

    Will rebuild zonelist in Zone-based order.

    Thanks to Lee Schermerhorn, he gives me much help and codes.

    [Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order]
    [akpm@linux-foundation.org: build fix]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Cc: "jesse.barnes@intel.com"
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

12 Jul, 2007

1 commit

  • Add a new security check on mmap operations to see if the user is attempting
    to mmap to low area of the address space. The amount of space protected is
    indicated by the new proc tunable /proc/sys/vm/mmap_min_addr and defaults to
    0, preserving existing behavior.

    This patch uses a new SELinux security class "memprotect." Policy already
    contains a number of allow rules like a_t self:process * (unconfined_t being
    one of them) which mean that putting this check in the process class (its
    best current fit) would make it useless as all user processes, which we also
    want to protect against, would be allowed. By taking the memprotect name of
    the new class it will also make it possible for us to move some of the other
    memory protect permissions out of 'process' and into the new class next time
    we bump the policy version number (which I also think is a good future idea)

    Acked-by: Stephen Smalley
    Acked-by: Chris Wright
    Signed-off-by: Eric Paris
    Signed-off-by: James Morris

    Eric Paris
     

09 May, 2007

2 commits


08 May, 2007

1 commit

  • The current panic_on_oom may not work if there is a process using
    cpusets/mempolicy, because other nodes' memory may remain. But some people
    want failover by panic ASAP even if they are used. This patch makes new
    setting for its request.

    This is tested on my ia64 box which has 3 nodes.

    Signed-off-by: Yasunori Goto
    Signed-off-by: Benjamin LaHaise
    Cc: Christoph Lameter
    Cc: Paul Jackson
    Cc: Ethan Solomita
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     

07 Dec, 2006

1 commit


30 Nov, 2006

1 commit


12 Oct, 2006

1 commit

  • The pipe-a-coredump-to-a-program feature was undocumented.
    *Grumble*.

    NB: a good enhancement to that patch would be: save all the stuff that a
    core file can get from the %x expansions in the environment.

    Signed-off-by: Matthias Urlichs
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthias Urlichs
     

26 Sep, 2006

1 commit

  • Currently one can enable slab reclaim by setting an explicit option in
    /proc/sys/vm/zone_reclaim_mode. Slab reclaim is then used as a final
    option if the freeing of unmapped file backed pages is not enough to free
    enough pages to allow a local allocation.

    However, that means that the slab can grow excessively and that most memory
    of a node may be used by slabs. We have had a case where a machine with
    46GB of memory was using 40-42GB for slab. Zone reclaim was effective in
    dealing with pagecache pages. However, slab reclaim was only done during
    global reclaim (which is a bit rare on NUMA systems).

    This patch implements slab reclaim during zone reclaim. Zone reclaim
    occurs if there is a danger of an off node allocation. At that point we

    1. Shrink the per node page cache if the number of pagecache
    pages is more than min_unmapped_ratio percent of pages in a zone.

    2. Shrink the slab cache if the number of the nodes reclaimable slab pages
    (patch depends on earlier one that implements that counter)
    are more than min_slab_ratio (a new /proc/sys/vm tunable).

    The shrinking of the slab cache is a bit problematic since it is not node
    specific. So we simply calculate what point in the slab we want to reach
    (current per node slab use minus the number of pages that neeed to be
    allocated) and then repeately run the global reclaim until that is
    unsuccessful or we have reached the limit. I hope we will have zone based
    slab reclaim at some point which will make that easier.

    The default for the min_slab_ratio is 5%

    Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

28 Aug, 2006

1 commit


06 Aug, 2006

1 commit


04 Jul, 2006

1 commit

  • It turns out that it is advantageous to leave a small portion of unmapped file
    backed pages if all of a zone's pages (or almost all pages) are allocated and
    so the page allocator has to go off-node.

    This allows recently used file I/O buffers to stay on the node and
    reduces the times that zone reclaim is invoked if file I/O occurs
    when we run out of memory in a zone.

    The problem is that zone reclaim runs too frequently when the page cache is
    used for file I/O (read write and therefore unmapped pages!) alone and we have
    almost all pages of the zone allocated. Zone reclaim may remove 32 unmapped
    pages. File I/O will use these pages for the next read/write requests and the
    unmapped pages increase. After the zone has filled up again zone reclaim will
    remove it again after only 32 pages. This cycle is too inefficient and there
    are potentially too many zone reclaim cycles.

    With the 1% boundary we may still remove all unmapped pages for file I/O in
    zone reclaim pass. However. it will take a large number of read and writes
    to get back to 1% again where we trigger zone reclaim again.

    The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
    second timeout.

    [akpm@osdl.org: rename the /proc file and the variable]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

01 Jul, 2006

1 commit

  • The zone_reclaim_interval was necessary because we were not able to determine
    how many unmapped pages exist in a zone. Therefore we had to scan in
    intervals to figure out if any pages were unmapped.

    With the zoned counters and NR_ANON_PAGES we now know the number of pagecache
    pages and the number of mapped pages in a zone. So we can simply skip the
    reclaim if there is an insufficient number of unmapped pages. We use
    SWAP_CLUSTER_MAX as the boundary.

    Drop all support for /proc/sys/vm/zone_reclaim_interval.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

23 Jun, 2006

1 commit

  • This patch adds panic_on_oom sysctl under sys.vm.

    When sysctl vm.panic_on_oom = 1, the kernel panics intead of killing rogue
    processes. And if vm.panic_on_oom is 0 the kernel will do oom_kill() in
    the same way as it does today. Of course, the default value is 0 and only
    root can modifies it.

    In general, oom_killer works well and kill rogue processes. So the whole
    system can survive. But there are environments where panic is preferable
    rather than kill some processes.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

21 Feb, 2006

1 commit

  • Currently, acpi video options can only be set on kernel command line. That's
    little inflexible; I'd like userland s2ram application that just works, and
    modifying kernel command line according to whitelist is not fun. It is better
    to just allow s2ram application to set video options just before suspend
    (according to the whitelist).

    This implements sysctl to allow setting suspend video options without reboot.

    (akpm: Documentation updates for this new sysctl are pending..)

    Signed-off-by: Pavel Machek
    Cc: "Brown, Len"
    Cc: "Antonino A. Daplas"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Machek