06 Apr, 2011

1 commit


28 Oct, 2010

1 commit

  • When dirty_ratio or dirty_bytes is written the other parameter is disabled
    and set to 0 (in dirty_bytes_handler() / dirty_ratio_handler()).

    We do the same for dirty_background_ratio and dirty_background_bytes.

    However, in the sysctl documentation, we say that the counterpart becomes
    a function of the old value, that is not correct.

    Clarify the documentation reporting the actual behaviour.

    Reviewed-by: Greg Thelen
    Acked-by: David Rientjes
    Signed-off-by: Andrea Righi
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Righi
     

10 Aug, 2010

1 commit

  • The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is
    very helpful information in diagnosing why a user's task has been killed.
    It emits useful information such as each eligible thread's memory usage
    that can determine why the system is oom, so it should be enabled by
    default.

    Signed-off-by: David Rientjes
    Acked-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

28 Jun, 2010

1 commit


25 May, 2010

2 commits

  • …hen it should be reclaimed

    The kernel applies some heuristics when deciding if memory should be
    compacted or reclaimed to satisfy a high-order allocation. One of these
    is based on the fragmentation. If the index is below 500, memory will not
    be compacted. This choice is arbitrary and not based on data. To help
    optimise the system and set a sensible default for this value, this patch
    adds a sysctl extfrag_threshold. The kernel will only compact memory if
    the fragmentation index is above the extfrag_threshold.

    [randy.dunlap@oracle.com: Fix build errors when proc fs is not configured]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Christoph Lameter <cl@linux-foundation.org>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • Add a proc file /proc/sys/vm/compact_memory. When an arbitrary value is
    written to the file, all zones are compacted. The expected user of such a
    trigger is a job scheduler that prepares the system before the target
    application runs.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

13 Mar, 2010

1 commit

  • Presently, if panic_on_oom=2, the whole system panics even if the oom
    happend in some special situation (as cpuset, mempolicy....). Then,
    panic_on_oom=2 means painc_on_oom_always.

    Now, memcg doesn't check panic_on_oom flag. This patch adds a check.

    BTW, how it's useful ?

    kdump+panic_on_oom=2 is the last tool to investigate what happens in
    oom-ed system. When a task is killed, the sysytem recovers and there will
    be few hint to know what happnes. In mission critical system, oom should
    never happen. Then, panic_on_oom=2+kdump is useful to avoid next OOM by
    knowing precise information via snapshot.

    TODO:
    - For memcg, it's for isolate system's memory usage, oom-notiifer and
    freeze_at_oom (or rest_at_oom) should be implemented. Then, management
    daemon can do similar jobs (as kdump) or taking snapshot per cgroup.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Nick Piggin
    Reviewed-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

04 Dec, 2009

1 commit

  • That is "success", "unknown", "through", "performance", "[re|un]mapping"
    , "access", "default", "reasonable", "[con]currently", "temperature"
    , "channel", "[un]used", "application", "example","hierarchy", "therefore"
    , "[over|under]flow", "contiguous", "threshold", "enough" and others.

    Signed-off-by: André Goddard Rosa
    Signed-off-by: Jiri Kosina

    André Goddard Rosa
     

24 Sep, 2009

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     

22 Sep, 2009

1 commit


16 Sep, 2009

1 commit

  • Add the high level memory handler that poisons pages
    that got corrupted by hardware (typically by a two bit flip in a DIMM
    or a cache) on the Linux level. The goal is to prevent everyone
    from accessing these pages in the future.

    This done at the VM level by marking a page hwpoisoned
    and doing the appropriate action based on the type of page
    it is.

    The code that does this is portable and lives in mm/memory-failure.c

    To quote the overview comment:

    High level machine check handler. Handles pages reported by the
    hardware as being corrupted usually due to a 2bit ECC memory or cache
    failure.

    This focuses on pages detected as corrupted in the background.
    When the current CPU tries to consume corruption the currently
    running process can just be killed directly instead. This implies
    that if the error cannot be handled for some reason it's safe to
    just ignore it because no corruption has been consumed yet. Instead
    when that happens another machine check will happen.

    Handles page cache pages in various states. The tricky part
    here is that we can access any page asynchronous to other VM
    users, because memory failures could happen anytime and anywhere,
    possibly violating some of their assumptions. This is why this code
    has to be extremely careful. Generally it tries to use normal locking
    rules, as in get the standard locks, even if that means the
    error handling takes potentially a long time.

    Some of the operations here are somewhat inefficient and have non
    linear algorithmic complexity, because the data structures have not
    been optimized for this case. This is in particular the case
    for the mapping from a vma to a process. Since this case is expected
    to be rare we hope we can get away with this.

    There are in principle two strategies to kill processes on poison:
    - just unmap the data and wait for an actual reference before
    killing
    - kill as soon as corruption is detected.
    Both have advantages and disadvantages and should be used
    in different situations. Right now both are implemented and can
    be switched with a new sysctl vm.memory_failure_early_kill
    The default is early kill.

    The patch does some rmap data structure walking on its own to collect
    processes to kill. This is unusual because normally all rmap data structure
    knowledge is in rmap.c only. I put it here for now to keep
    everything together and rmap knowledge has been seeping out anyways

    Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu,
    Nick Piggin (who did a lot of great work) and others.

    Cc: npiggin@suse.de
    Cc: riel@redhat.com
    Signed-off-by: Andi Kleen
    Acked-by: Rik van Riel
    Reviewed-by: Hidehiro Kawai

    Andi Kleen
     

17 Jun, 2009

2 commits

  • A bug was brought to my attention against a distro kernel but it affects
    mainline and I believe problems like this have been reported in various
    guises on the mailing lists although I don't have specific examples at the
    moment.

    The reported problem was that malloc() stalled for a long time (minutes in
    some cases) if a large tmpfs mount was occupying a large percentage of
    memory overall. The pages did not get cleaned or reclaimed by
    zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists
    are uselessly scanned frequencly making the CPU spin at near 100%.

    This patchset intends to address that bug and bring the behaviour of
    zone_reclaim() more in line with expectations which were noticed during
    investigation. It is based on top of mmotm and takes advantage of
    Kosaki's work with respect to zone_reclaim().

    Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the
    scan should go ahead. The broken heuristic is what was causing the
    malloc() stall as it uselessly scanned the LRU constantly. Currently,
    zone_reclaim is assuming zone_reclaim_mode is 1 and historically it
    could not deal with tmpfs pages at all. This fixes up the heuristic so
    that an unnecessary scan is more likely to be correctly avoided.

    Patch 2 notes that zone_reclaim() returning a failure automatically means
    the zone is marked full. This is not always true. It could have
    failed because the GFP mask or zone_reclaim_mode were unsuitable.

    Patch 3 introduces a counter zreclaim_failed that will increment each
    time the zone_reclaim scan-avoidance heuristics fail. If that
    counter is rapidly increasing, then zone_reclaim_mode should be
    set to 0 as a temporarily resolution and a bug reported because
    the scan-avoidance heuristic is still broken.

    This patch:

    On NUMA machines, the administrator can configure zone_reclaim_mode that
    is a more targetted form of direct reclaim. On machines with large NUMA
    distances for example, a zone_reclaim_mode defaults to 1 meaning that
    clean unmapped pages will be reclaimed if the zone watermarks are not
    being met.

    There is a heuristic that determines if the scan is worthwhile but the
    problem is that the heuristic is not being properly applied and is
    basically assuming zone_reclaim_mode is 1 if it is enabled. The lack of
    proper detection can manfiest as high CPU usage as the LRU list is scanned
    uselessly.

    Historically, once enabled it was depending on NR_FILE_PAGES which may
    include swapcache pages that the reclaim_mode cannot deal with. Patch
    vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
    Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
    pages that were not file-backed such as swapcache and made a calculation
    based on the inactive, active and mapped files. This is far superior when
    zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
    reasonable starting figure.

    This patch alters how zone_reclaim() works out how many pages it might be
    able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set in
    the reclaim_mode it will either consider NR_FILE_PAGES as potential
    candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
    swapcache and other non-file-backed pages. If RECLAIM_WRITE is not set,
    then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is
    not set, then NR_FILE_MAPPED are not.

    [kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages]
    [fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Acked-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: Wu Fengguang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether
    pages_min, pages_low or pages_high is used as the zone watermark when
    allocating the pages. Two branches in the allocator hotpath determine
    which watermark to use.

    This patch uses the flags as an array index into a watermark array that is
    indexed with WMARK_* defines accessed via helpers. All call sites that
    use zone->pages_* are updated to use the helpers for accessing the values
    and the array offsets for setting.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

13 Jun, 2009

1 commit


15 May, 2009

1 commit

  • This reverts commit fafd688e4c0c34da0f3de909881117d374e4c7af.

    Work is progressing to switch away from pdflush as the process backing
    for flushing out dirty data. So it seems pointless to add more knobs
    to control pdflush threads. The original author of the patch did not
    have any specific use cases for adding the knobs, so we can easily
    revert this before 2.6.30 to avoid having to maintain this API
    forever.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

03 May, 2009

1 commit

  • Avoid setting less than two pages for vm_dirty_bytes: this is necessary to
    avoid potential division by 0 (like the following) in get_dirty_limits().

    [ 49.951610] divide error: 0000 [#1] PREEMPT SMP
    [ 49.952195] last sysfs file: /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/uevent
    [ 49.952195] CPU 1
    [ 49.952195] Modules linked in: pcspkr
    [ 49.952195] Pid: 3064, comm: dd Not tainted 2.6.30-rc3 #1
    [ 49.952195] RIP: 0010:[] [] get_dirty_limits+0xe9/0x2c0
    [ 49.952195] RSP: 0018:ffff88001de03a98 EFLAGS: 00010202
    [ 49.952195] RAX: 00000000000000c0 RBX: ffff88001de03b80 RCX: 28f5c28f5c28f5c3
    [ 49.952195] RDX: 0000000000000000 RSI: 00000000000000c0 RDI: 0000000000000000
    [ 49.952195] RBP: ffff88001de03ae8 R08: 0000000000000000 R09: 0000000000000000
    [ 49.952195] R10: ffff88001ddda9a0 R11: 0000000000000001 R12: 0000000000000001
    [ 49.952195] R13: ffff88001fbc8218 R14: ffff88001de03b70 R15: ffff88001de03b78
    [ 49.952195] FS: 00007fe9a435b6f0(0000) GS:ffff8800025d9000(0000) knlGS:0000000000000000
    [ 49.952195] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 49.952195] CR2: 00007fe9a39ab000 CR3: 000000001de38000 CR4: 00000000000006e0
    [ 49.952195] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 49.952195] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    [ 49.952195] Process dd (pid: 3064, threadinfo ffff88001de02000, task ffff88001ddda250)
    [ 49.952195] Stack:
    [ 49.952195] ffff88001fa0de00 ffff88001f2dbd70 ffff88001f9fe800 000080b900000000
    [ 49.952195] 00000000000000c0 ffff8800027a6100 0000000000000400 ffff88001fbc8218
    [ 49.952195] 0000000000000000 0000000000000600 ffff88001de03bb8 ffffffff802d3ed7
    [ 49.952195] Call Trace:
    [ 49.952195] [] balance_dirty_pages_ratelimited_nr+0x1d7/0x3f0
    [ 49.952195] [] ? ext3_writeback_write_end+0x9e/0x120
    [ 49.952195] [] generic_file_buffered_write+0x12f/0x330
    [ 49.952195] [] __generic_file_aio_write_nolock+0x26d/0x460
    [ 49.952195] [] ? generic_file_aio_write+0x52/0xd0
    [ 49.952195] [] generic_file_aio_write+0x69/0xd0
    [ 49.952195] [] ext3_file_write+0x26/0xc0
    [ 49.952195] [] do_sync_write+0xf1/0x140
    [ 49.952195] [] ? get_lock_stats+0x2a/0x60
    [ 49.952195] [] ? autoremove_wake_function+0x0/0x40
    [ 49.952195] [] vfs_write+0xcb/0x190
    [ 49.952195] [] sys_write+0x50/0x90
    [ 49.952195] [] system_call_fastpath+0x16/0x1b
    [ 49.952195] Code: 00 00 00 2b 05 09 1c 17 01 48 89 c6 49 0f af f4 48 c1 ee 02 48 89 f0 48 f7 e1 48 89 d6 31 d2 48 c1 ee 02 48 0f af 75 d0 48 89 f0 f7 f7 41 8b 95 ac 01 00 00 48 89 c7 49 0f af d4 48 c1 ea 02
    [ 49.952195] RIP [] get_dirty_limits+0xe9/0x2c0
    [ 49.952195] RSP
    [ 50.096523] ---[ end trace 008d7aa02f244d7b ]---

    Signed-off-by: Andrea Righi
    Cc: Peter Zijlstra
    Cc: David Rientjes
    Cc: Dave Chinner
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Righi
     

07 Apr, 2009

1 commit

  • Add /proc entries to give the admin the ability to control the minimum and
    maximum number of pdflush threads. This allows finer control of pdflush
    on both large and small machines.

    The rationale is simply one size does not fit all. Admins on large and/or
    small systems may want to tune the min/max pdflush thread count to best
    suit their needs. Right now the min/max is hardcoded to 2/8. While
    probably a fair estimate for smaller machines, large machines with large
    numbers of CPUs and large numbers of filesystems/block devices may benefit
    from larger numbers of threads working on different block devices.

    Even if the background flushing algorithm is radically changed, it is
    still likely that multiple threads will be involved and admins would still
    desire finer control on the min/max other than to have to recompile the
    kernel.

    The patch adds '/proc/sys/vm/nr_pdflush_threads_min' and
    '/proc/sys/vm/nr_pdflush_threads_max' with r/w permissions.

    The minimum value for nr_pdflush_threads_min is 1 and the maximum value is
    the current value of nr_pdflush_threads_max. This minimum is required
    since additional thread creation is performed in a pdflush thread itself.

    The minimum value for nr_pdflush_threads_max is the current value of
    nr_pdflush_threads_min and the maximum value can be 1000.

    Documentation/sysctl/vm.txt is also updated.

    [akpm@linux-foundation.org: fix comment, fix whitespace, use __read_mostly]
    Signed-off-by: Peter W Morreale
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter W Morreale
     

16 Jan, 2009

1 commit

  • Update Documentation/sysctl/vm.txt and Documentation/filesystems/proc.txt.
    More specifically, the section on /proc/sys/vm in
    Documentation/filesystems/proc.txt was removed and a link to
    Documentation/sysctl/vm.txt added.

    Most of the verbiage from proc.txt was simply moved in vm.txt, with new
    addtional text for "swappiness" and "stat_interval".

    Signed-off-by: Peter W Morreale
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter W Morreale
     

08 Jan, 2009

1 commit

  • NOMMU mmap allocates a piece of memory for an mmap that's rounded up in size to
    the nearest power-of-2 number of pages. Currently it then discards the excess
    pages back to the page allocator, making that memory available for use by other
    things. This can, however, cause greater amount of fragmentation.

    To counter this, a sysctl is added in order to fine-tune the trimming
    behaviour. The default behaviour remains to trim pages aggressively, while
    this can either be disabled completely or set to a higher page-granular
    watermark in order to have finer-grained control.

    vm region vm_top bits taken from an earlier patch by David Howells.

    Signed-off-by: Paul Mundt
    Signed-off-by: David Howells
    Tested-by: Mike Frysinger

    Paul Mundt
     

07 Jan, 2009

1 commit

  • This change introduces two new sysctls to /proc/sys/vm:
    dirty_background_bytes and dirty_bytes.

    dirty_background_bytes is the counterpart to dirty_background_ratio and
    dirty_bytes is the counterpart to dirty_ratio.

    With growing memory capacities of individual machines, it's no longer
    sufficient to specify dirty thresholds as a percentage of the amount of
    dirtyable memory over the entire system.

    dirty_background_bytes and dirty_bytes specify quantities of memory, in
    bytes, that represent the dirty limits for the entire system. If either
    of these values is set, its value represents the amount of dirty memory
    that is needed to commence either background or direct writeback.

    When a `bytes' or `ratio' file is written, its counterpart becomes a
    function of the written value. For example, if dirty_bytes is written to
    be 8096, 8K of memory is required to commence direct writeback.
    dirty_ratio is then functionally equivalent to 8K / the amount of
    dirtyable memory:

    dirtyable_memory = free pages + mapped pages + file cache

    dirty_background_bytes = dirty_background_ratio * dirtyable_memory
    -or-
    dirty_background_ratio = dirty_background_bytes / dirtyable_memory

    AND

    dirty_bytes = dirty_ratio * dirtyable_memory
    -or-
    dirty_ratio = dirty_bytes / dirtyable_memory

    Only one of dirty_background_bytes and dirty_background_ratio may be
    specified at a time, and only one of dirty_bytes and dirty_ratio may be
    specified. When one sysctl is written, the other appears as 0 when read.

    The `bytes' files operate on a page size granularity since dirty limits
    are compared with ZVC values, which are in page units.

    Prior to this change, the minimum dirty_ratio was 5 as implemented by
    get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
    written value between 0 and 100. This restriction is maintained, but
    dirty_bytes has a lower limit of only one page.

    Also prior to this change, the dirty_background_ratio could not equal or
    exceed dirty_ratio. This restriction is maintained in addition to
    restricting dirty_background_bytes. If either background threshold equals
    or exceeds that of the dirty threshold, it is implicitly set to half the
    dirty threshold.

    Acked-by: Peter Zijlstra
    Cc: Dave Chinner
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Cc: Andrea Righi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

27 Jul, 2008

1 commit


08 Feb, 2008

1 commit

  • Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a
    dump of all system tasks (excluding kernel threads) when performing an
    OOM-killing. Information includes pid, uid, tgid, vm size, rss, cpu,
    oom_adj score, and name.

    This is helpful for determining why there was an OOM condition and which
    rogue task caused it.

    It is configurable so that large systems, such as those with several
    thousand tasks, do not incur a performance penalty associated with dumping
    data they may not desire.

    If an OOM was triggered as a result of a memory controller, the tasklist
    shall be filtered to exclude tasks that are not a member of the same
    cgroup.

    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Cc: Balbir Singh
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

06 Feb, 2008

1 commit

  • Add vm.highmem_is_dirtyable toggle

    A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
    approximately 2Gb size which contains a hash format that is written
    randomly by the dbclean process. On 2.6.16 this process took a few
    minutes. With lowmem only accounting of dirty ratios, this takes about 12
    hours of 100% disk IO, all random writes.

    Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
    add the highmem back to the total available memory count.

    [akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build]
    Signed-off-by: Bron Gondwana
    Cc: Ethan Solomita
    Cc: Peter Zijlstra
    Cc: WU Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bron Gondwana
     

18 Dec, 2007

1 commit

  • The hugetlb documentation has gotten a bit out of sync with the current code.
    Updated the sysctl file to refer to Documentation/vm/hugetlbpage.txt. Update
    that file to contain the current state of affairs (with the newer named sysctl
    in place).

    Signed-off-by: Nishanth Aravamudan
    Acked-by: Adam Litke
    Cc: William Lee Irwin III
    Cc: Dave Hansen
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     

17 Oct, 2007

2 commits

  • min_free_pages is critical for correctness, document it as such.

    Signed-off-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Machek
     
  • Adds a new sysctl, 'oom_kill_allocating_task', which will automatically kill
    the OOM-triggering task instead of scanning through the tasklist to find a
    memory-hogging target. This is helpful for systems with an insanely large
    number of tasks where scanning the tasklist significantly degrades
    performance.

    Cc: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

18 Jul, 2007

1 commit

  • This patch adds the kernelcore= parameter for x86.

    Once all patches are applied, a new command-line parameter exist and a new
    sysctl. This patch adds the necessary documentation.

    From: Yasunori Goto

    When "kernelcore" boot option is specified, kernel can't boot up on ia64
    because of an infinite loop. In addition, the parsing code can be handled
    in an architecture-independent manner.

    This patch uses common code to handle the kernelcore= parameter. It is
    only available to architectures that support arch-independent zone-sizing
    (i.e. define CONFIG_ARCH_POPULATES_NODE_MAP). Other architectures will
    ignore the boot parameter.

    [bunk@stusta.de: make cmdline_parse_kernelcore() static]
    Signed-off-by: Mel Gorman
    Signed-off-by: Yasunori Goto
    Acked-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

17 Jul, 2007

1 commit

  • Make zonelist creation policy selectable from sysctl/boot option v6.

    This patch makes NUMA's zonelist (of pgdat) order selectable.
    Available order are Default(automatic)/ Node-based / Zone-based.

    [Default Order]
    The kernel selects Node-based or Zone-based order automatically.

    [Node-based Order]
    This policy treats the locality of memory as the most important parameter.
    Zonelist order is created by each zone's locality. This means lower zones
    (ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion.
    IOW. ZONE_DMA will be in the middle of zonelist.
    current 2.6.21 kernel uses this.

    Pros.
    * A user can expect local memory as much as possible.
    Cons.
    * lower zone will be exhansted before higher zone. This may cause OOM_KILL.

    Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL
    because of ZONE_DMA exhaution and you need the best locality.

    (example)
    assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

    *node(0)'s memory allocation order:

    node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL.

    *node(1)'s memory allocation order:

    node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

    [Zone-based order]
    This policy treats the zone type as the most important parameter.
    Zonelist order is created by zone-type order. This means lower zone
    never be used bofere higher zone exhaustion.
    IOW. ZONE_DMA will be always at the tail of zonelist.

    Pros.
    * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted.
    Cons.
    * memory locality may not be best.

    (example)
    assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

    *node(0)'s memory allocation order:

    node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA.

    *node(1)'s memory allocation order:

    node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

    bootoption "numa_zonelist_order=" and proc/sysctl is supporetd.

    command:
    %echo N > /proc/sys/vm/numa_zonelist_order

    Will rebuild zonelist in Node-based order.

    command:
    %echo Z > /proc/sys/vm/numa_zonelist_order

    Will rebuild zonelist in Zone-based order.

    Thanks to Lee Schermerhorn, he gives me much help and codes.

    [Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order]
    [akpm@linux-foundation.org: build fix]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Cc: "jesse.barnes@intel.com"
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

12 Jul, 2007

1 commit

  • Add a new security check on mmap operations to see if the user is attempting
    to mmap to low area of the address space. The amount of space protected is
    indicated by the new proc tunable /proc/sys/vm/mmap_min_addr and defaults to
    0, preserving existing behavior.

    This patch uses a new SELinux security class "memprotect." Policy already
    contains a number of allow rules like a_t self:process * (unconfined_t being
    one of them) which mean that putting this check in the process class (its
    best current fit) would make it useless as all user processes, which we also
    want to protect against, would be allowed. By taking the memprotect name of
    the new class it will also make it possible for us to move some of the other
    memory protect permissions out of 'process' and into the new class next time
    we bump the policy version number (which I also think is a good future idea)

    Acked-by: Stephen Smalley
    Acked-by: Chris Wright
    Signed-off-by: Eric Paris
    Signed-off-by: James Morris

    Eric Paris
     

08 May, 2007

1 commit

  • The current panic_on_oom may not work if there is a process using
    cpusets/mempolicy, because other nodes' memory may remain. But some people
    want failover by panic ASAP even if they are used. This patch makes new
    setting for its request.

    This is tested on my ia64 box which has 3 nodes.

    Signed-off-by: Yasunori Goto
    Signed-off-by: Benjamin LaHaise
    Cc: Christoph Lameter
    Cc: Paul Jackson
    Cc: Ethan Solomita
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     

30 Nov, 2006

1 commit


26 Sep, 2006

1 commit

  • Currently one can enable slab reclaim by setting an explicit option in
    /proc/sys/vm/zone_reclaim_mode. Slab reclaim is then used as a final
    option if the freeing of unmapped file backed pages is not enough to free
    enough pages to allow a local allocation.

    However, that means that the slab can grow excessively and that most memory
    of a node may be used by slabs. We have had a case where a machine with
    46GB of memory was using 40-42GB for slab. Zone reclaim was effective in
    dealing with pagecache pages. However, slab reclaim was only done during
    global reclaim (which is a bit rare on NUMA systems).

    This patch implements slab reclaim during zone reclaim. Zone reclaim
    occurs if there is a danger of an off node allocation. At that point we

    1. Shrink the per node page cache if the number of pagecache
    pages is more than min_unmapped_ratio percent of pages in a zone.

    2. Shrink the slab cache if the number of the nodes reclaimable slab pages
    (patch depends on earlier one that implements that counter)
    are more than min_slab_ratio (a new /proc/sys/vm tunable).

    The shrinking of the slab cache is a bit problematic since it is not node
    specific. So we simply calculate what point in the slab we want to reach
    (current per node slab use minus the number of pages that neeed to be
    allocated) and then repeately run the global reclaim until that is
    unsuccessful or we have reached the limit. I hope we will have zone based
    slab reclaim at some point which will make that easier.

    The default for the min_slab_ratio is 5%

    Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

04 Jul, 2006

1 commit

  • It turns out that it is advantageous to leave a small portion of unmapped file
    backed pages if all of a zone's pages (or almost all pages) are allocated and
    so the page allocator has to go off-node.

    This allows recently used file I/O buffers to stay on the node and
    reduces the times that zone reclaim is invoked if file I/O occurs
    when we run out of memory in a zone.

    The problem is that zone reclaim runs too frequently when the page cache is
    used for file I/O (read write and therefore unmapped pages!) alone and we have
    almost all pages of the zone allocated. Zone reclaim may remove 32 unmapped
    pages. File I/O will use these pages for the next read/write requests and the
    unmapped pages increase. After the zone has filled up again zone reclaim will
    remove it again after only 32 pages. This cycle is too inefficient and there
    are potentially too many zone reclaim cycles.

    With the 1% boundary we may still remove all unmapped pages for file I/O in
    zone reclaim pass. However. it will take a large number of read and writes
    to get back to 1% again where we trigger zone reclaim again.

    The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
    second timeout.

    [akpm@osdl.org: rename the /proc file and the variable]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

01 Jul, 2006

1 commit

  • The zone_reclaim_interval was necessary because we were not able to determine
    how many unmapped pages exist in a zone. Therefore we had to scan in
    intervals to figure out if any pages were unmapped.

    With the zoned counters and NR_ANON_PAGES we now know the number of pagecache
    pages and the number of mapped pages in a zone. So we can simply skip the
    reclaim if there is an insufficient number of unmapped pages. We use
    SWAP_CLUSTER_MAX as the boundary.

    Drop all support for /proc/sys/vm/zone_reclaim_interval.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

23 Jun, 2006

1 commit

  • This patch adds panic_on_oom sysctl under sys.vm.

    When sysctl vm.panic_on_oom = 1, the kernel panics intead of killing rogue
    processes. And if vm.panic_on_oom is 0 the kernel will do oom_kill() in
    the same way as it does today. Of course, the default value is 0 and only
    root can modifies it.

    In general, oom_killer works well and kill rogue processes. So the whole
    system can survive. But there are environments where panic is preferable
    rather than kill some processes.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

02 Feb, 2006

3 commits

  • If large amounts of zone memory are used by empty slabs then zone_reclaim
    becomes uneffective. This patch shakes the slab a bit.

    The problem with this patch is that the slab reclaim is not containable to a
    zone. Thus slab reclaim may affect the whole system and be extremely slow.
    This also means that we cannot determine how many pages were freed in this
    zone. Thus we need to go off node for at least one allocation.

    The functionality is disabled by default.

    We could modify the shrinkers to take a zone parameter but that would be quite
    invasive. Better ideas are welcome.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • In some situations one may want zone_reclaim to behave differently. For
    example a process writing large amounts of memory will spew unto other nodes
    to cache the writes if many pages in a zone become dirty. This may impact the
    performance of processes running on other nodes.

    Allowing writes during reclaim puts a stop to that behavior and throttles the
    process by restricting the pages to the local zone.

    Similarly one may want to contain processes to local memory by enabling
    regular swap behavior during zone_reclaim. Off node memory allocation can
    then be controlled through memory policies and cpusets.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently the zone_reclaim code has a fixed window of 30 seconds of off node
    allocations should a local zone have no unused pagecache pages left. Reclaim
    will be attempted again after this timeout period to avoid repeated useless
    scans for memory. This is also useful to established sufficiently large off
    node allocation chunks to relieve the local node.

    It may be beneficial to adjust that time period for some special situations.
    For example if memory use was exceeding node capacity one may want to give up
    for longer periods of time. If memory spikes intermittendly then one may want
    to shorten the time period to reduce the number of off node allocations.

    This patch allows just that....

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

19 Jan, 2006

1 commit

  • proc support for zone reclaim

    This patch creates a proc entry /proc/sys/vm/zone_reclaim_mode that may be
    used to override the automatic determination of the zone reclaim made on
    bootup.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

09 Jan, 2006

1 commit

  • As recently there has been lot of traffic on the right values for batch and
    high water marks for per_cpu_pagelists. This patch makes these two
    variables configurable through /proc interface.

    A new tunable /proc/sys/vm/percpu_pagelist_fraction is added. This entry
    controls the fraction of pages at most in each zone that are allocated for
    each per cpu page list. The min value for this is 8. It means that we
    don't allow more than 1/8th of pages in each zone to be allocated in any
    single per_cpu_pagelist.

    The batch value of each per cpu pagelist is also updated as a result. It
    is set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8)

    Signed-off-by: Rohit Seth
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rohit Seth