03 Dec, 2010

1 commit

  • The nr_dirty_[background_]threshold fields are misplaced before the
    numa_* fields, and users will read strange values.

    This is the right order. Before patch, nr_dirty_background_threshold
    will read as 0 (the value from numa_miss).

    numa_hit 128501
    numa_miss 0
    numa_foreign 0
    numa_interleave 7388
    numa_local 128501
    numa_other 0
    nr_dirty_threshold 144291
    nr_dirty_background_threshold 72145

    Signed-off-by: Wu Fengguang
    Cc: Michael Rubin
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

04 Nov, 2010

1 commit

  • Fix regression introduced by commit 79da826aee6 ("writeback: report
    dirty thresholds in /proc/vmstat").

    The incorrect pointer arithmetic can result in problems like this:

    BUG: unable to handle kernel paging request at 07c06d16
    IP: [] strnlen+0x6/0x20
    Call Trace:
    [] ? string+0x39/0xe0
    [] ? __wake_up_common+0x4b/0x80
    [] ? vsnprintf+0x1ec/0x380
    [] ? seq_printf+0x2e/0x60
    [] ? vmstat_show+0x26/0x30
    [] ? seq_read+0xa6/0x380
    [] ? seq_read+0x0/0x380
    [] ? proc_reg_read+0x5f/0x90
    [] ? vfs_read+0xa1/0x140
    [] ? proc_reg_read+0x0/0x90
    [] ? sys_read+0x41/0x70
    [] ? sysenter_do_call+0x12/0x26

    Reported-by: Tetsuo Handa
    Cc: Michael Rubin
    Signed-off-by: Wu Fengguang
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

27 Oct, 2010

3 commits

  • This removes following warning from sparse:

    mm/vmstat.c:466:5: warning: symbol 'fragmentation_index' was not declared. Should it be static?

    [akpm@linux-foundation.org: move the include to top-of-file]
    Signed-off-by: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • The kernel already exposes the user desired thresholds in /proc/sys/vm
    with dirty_background_ratio and background_ratio. But the kernel may
    alter the number requested without giving the user any indication that is
    the case.

    Knowing the actual ratios the kernel is honoring can help app developers
    understand how their buffered IO will be sent to the disk.

    $ grep threshold /proc/vmstat
    nr_dirty_threshold 409111
    nr_dirty_background_threshold 818223

    Signed-off-by: Michael Rubin
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Rubin
     
  • To help developers and applications gain visibility into writeback
    behaviour adding two entries to vm_stat_items and /proc/vmstat. This will
    allow us to track the "written" and "dirtied" counts.

    # grep nr_dirtied /proc/vmstat
    nr_dirtied 3747
    # grep nr_written /proc/vmstat
    nr_written 3618

    Signed-off-by: Michael Rubin
    Reviewed-by: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Rubin
     

10 Sep, 2010

2 commits

  • …low and kswapd is awake

    Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
    cheaper than scanning a number of lists. To avoid synchronization
    overhead, counter deltas are maintained on a per-cpu basis and drained
    both periodically and when the delta is above a threshold. On large CPU
    systems, the difference between the estimated and real value of
    NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than
    number of real free page in buddy, the VM can allocate pages below min
    watermark, at worst reducing the real number of pages to zero. Even if
    the OOM killer kills some victim for freeing memory, it may not free
    memory if the exit path requires a new page resulting in livelock.

    This patch introduces a zone_page_state_snapshot() function (courtesy of
    Christoph) that takes a slightly more accurate view of an arbitrary vmstat
    counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid
    the watermark being accidentally broken. The estimate is not perfect and
    may result in cache line bounces but is expected to be lighter than the
    IPI calls necessary to continually drain the per-cpu counters while kswapd
    is awake.

    Signed-off-by: Christoph Lameter <cl@linux.com>
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Christoph Lameter
     
  • refresh_zone_stat_thresholds() calculates parameter based on the number of
    online cpus. It's called at cpu offlining but needs to be called at
    onlining, too.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

10 Aug, 2010

2 commits

  • Since 2.6.28 zone->prev_priority is unused. Then it can be removed
    safely. It reduce stack usage slightly.

    Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
    can be integrate again, it's useful. but four (or more) times trying
    haven't got good performance number. Thus I give up such approach.

    The rest of this changelog is notes on prev_priority and why it existed in
    the first place and why it might be not necessary any more. This information
    is based heavily on discussions between Andrew Morton, Rik van Riel and
    Kosaki Motohiro who is heavily quotes from.

    Historically prev_priority was important because it determined when the VM
    would start unmapping PTE pages. i.e. there are no balances of note within
    the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
    is a potential risk of unnecessarily increasing minor faults as a large
    amount of read activity of use-once pages could push mapped pages to the
    end of the LRU and get unmapped.

    There is no proof this is still a problem but currently it is not considered
    to be. Active files are not deactivated if the active file list is smaller
    than the inactive list reducing the liklihood that file-mapped pages are
    being pushed off the LRU and referenced executable pages are kept on the
    active list to avoid them getting pushed out by read activity.

    Even if it is a problem, prev_priority prev_priority wouldn't works
    nowadays. First of all, current vmscan still a lot of UP centric code. it
    expose some weakness on some dozens CPUs machine. I think we need more and
    more improvement.

    The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
    and per-task-pressure a bit. example, prev_priority try to boost priority to
    other concurrent priority. but if the another task have mempolicy restriction,
    it is unnecessary, but also makes wrong big latency and exceeding reclaim.
    per-task based priority + prev_priority adjustment make the emulation of
    per-system pressure. but it have two issue 1) too rough and brutal emulation
    2) we need per-zone pressure, not per-system.

    Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
    2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
    but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
    system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
    prev_priority can't solve such multithreads workload issue. In other word,
    prev_priority concept assume the sysmtem don't have lots threads."

    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Reviewed-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Dave Chinner
    Cc: Chris Mason
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Andrea Arcangeli
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The sum_vm_events passes cpumask for for_each_cpu(). But it's useless
    since we have for_each_online_cpu. Althougth it's tirival overhead, it's
    not good about coding consistency.

    Let's use for_each_online_cpu instead of for_each_cpu with cpumask
    argument.

    Signed-off-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

25 May, 2010

4 commits

  • Ordinarily when a high-order allocation fails, direct reclaim is entered
    to free pages to satisfy the allocation. With this patch, it is
    determined if an allocation failed due to external fragmentation instead
    of low memory and if so, the calling process will compact until a suitable
    page is freed. Compaction by moving pages in memory is considerably
    cheaper than paging out to disk and works where there are locked pages or
    no swap. If compaction fails to free a page of a suitable size, then
    reclaim will still occur.

    Direct compaction returns as soon as possible. As each block is
    compacted, it is checked if a suitable page has been freed and if so, it
    returns.

    [akpm@linux-foundation.org: Fix build errors]
    [aarcange@redhat.com: fix count_vm_event preempt in memory compaction direct reclaim]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch is the core of a mechanism which compacts memory in a zone by
    relocating movable pages towards the end of the zone.

    A single compaction run involves a migration scanner and a free scanner.
    Both scanners operate on pageblock-sized areas in the zone. The migration
    scanner starts at the bottom of the zone and searches for all movable
    pages within each area, isolating them onto a private list called
    migratelist. The free scanner starts at the top of the zone and searches
    for suitable areas and consumes the free pages within making them
    available for the migration scanner. The pages isolated for migration are
    then migrated to the newly isolated free pages.

    [aarcange@redhat.com: Fix unsafe optimisation]
    [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The fragmentation fragmentation index, is only meaningful if an allocation
    would fail and indicates what the failure is due to. A value of -1 such
    as in many of the examples above states that the allocation would succeed.
    If it would fail, the value is between 0 and 1. A value tending towards
    0 implies the allocation failed due to a lack of memory. A value tending
    towards 1 implies it failed due to external fragmentation.

    For the most part, the huge page size will be the size of interest but not
    necessarily so it is exported on a per-order and per-zo basis via
    /sys/kernel/debug/extfrag/extfrag_index

    > cat /sys/kernel/debug/extfrag/extfrag_index
    Node 0, zone DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.00
    Node 0, zone Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.954

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Acked-by: Rik van Riel
    Reviewed-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The unusable free space index measures how much of the available free
    memory cannot be used to satisfy an allocation of a given size and is a
    value between 0 and 1. The higher the value, the more of free memory is
    unusable and by implication, the worse the external fragmentation is. For
    the most part, the huge page size will be the size of interest but not
    necessarily so it is exported on a per-order and per-zone basis via
    /sys/kernel/debug/extfrag/unusable_index.

    > cat /sys/kernel/debug/extfrag/unusable_index
    Node 0, zone DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
    Node 0, zone Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054

    [akpm@linux-foundation.org: Fix allnoconfig]
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

07 Mar, 2010

1 commit

  • commit e815af95 ("change all_unreclaimable zone member to flags") changed
    all_unreclaimable member to bit flag. But it had an undesireble side
    effect. free_one_page() is one of most hot path in linux kernel and
    increasing atomic ops in it can reduce kernel performance a bit.

    Thus, this patch revert such commit partially. at least
    all_unreclaimable shouldn't share memory word with other zone flags.

    [akpm@linux-foundation.org: fix patch interaction]
    Signed-off-by: KOSAKI Motohiro
    Cc: David Rientjes
    Cc: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Huang Shijie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

05 Jan, 2010

2 commits

  • Remove the pageset notifier since it only marks that a processor
    exists on a specific node. Move that code into the vmstat notifier.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     
  • Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.

    This drastically reduces the size of struct zone for systems with large
    amounts of processors and allows placement of critical variables of struct
    zone in one cacheline even on very large systems.

    Another effect is that the pagesets of one processor are placed near one
    another. If multiple pagesets from different zones fit into one cacheline
    then additional cacheline fetches can be avoided on the hot paths when
    allocating memory from multiple zones.

    Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
    are reduced and we can drop the zone_pcp macro.

    Hotplug handling is also simplified since cpu alloc can bring up and
    shut down cpu areas for a specific cpu as a whole. So there is no need to
    allocate or free individual pagesets.

    V7-V8:
    - Explain chicken egg dilemmna with percpu allocator.

    V4-V5:
    - Fix up cases where per_cpu_ptr is called before irq disable
    - Integrate the bootstrap logic that was separate before.

    tj: Build failure in pageset_cpuup_callback() due to missing ret
    variable fixed.

    Reviewed-by: Mel Gorman
    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     

16 Dec, 2009

2 commits

  • If reclaim fails to make sufficient progress, the priority is raised.
    Once the priority is higher, kswapd starts waiting on congestion.
    However, if the zone is below the min watermark then kswapd needs to
    continue working without delay as there is a danger of an increased rate
    of GFP_ATOMIC allocation failure.

    This patch changes the conditions under which kswapd waits on congestion
    by only going to sleep if the min watermarks are being met.

    [mel@csn.ul.ie: add stats to track how relevant the logic is]
    [mel@csn.ul.ie: make kswapd only check its own zones and rename the relevant counters]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • After kswapd balances all zones in a pgdat, it goes to sleep. In the
    event of no IO congestion, kswapd can go to sleep very shortly after the
    high watermark was reached. If there are a constant stream of allocations
    from parallel processes, it can mean that kswapd went to sleep too quickly
    and the high watermark is not being maintained for sufficient length time.

    This patch makes kswapd go to sleep as a two-stage process. It first
    tries to sleep for HZ/10. If it is woken up by another process or the
    high watermark is no longer met, it's considered a premature sleep and
    kswapd continues work. Otherwise it goes fully to sleep.

    This adds more counters to distinguish between fast and slow breaches of
    watermarks. A "fast" premature sleep is one where the low watermark was
    hit in a very short time after kswapd going to sleep. A "slow" premature
    sleep indicates that the high watermark was breached after a very short
    interval.

    Signed-off-by: Mel Gorman
    Cc: Frans Pop
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

29 Oct, 2009

1 commit

  • This patch updates percpu related symbols under kernel/ and mm/ such
    that percpu symbols are unique and don't clash with local symbols.
    This serves two purposes of decreasing the possibility of global
    percpu symbol collision and allowing dropping per_cpu__ prefix from
    percpu symbols.

    * kernel/lockdep.c: s/lock_stats/cpu_lock_stats/

    * kernel/sched.c: s/init_rq_rt/init_rt_rq_var/ (any better idea?)
    s/sched_group_cpus/sched_groups/

    * kernel/softirq.c: s/ksoftirqd/run_ksoftirqd/a

    * kernel/softlockup.c: s/(*)_timestamp/softlockup_\1_ts/
    s/watchdog_task/softlockup_watchdog/
    s/timestamp/ts/ for local variables

    * kernel/time/timer_stats: s/lookup_lock/tstats_lookup_lock/

    * mm/slab.c: s/reap_work/slab_reap_work/
    s/reap_node/slab_reap_node/

    * mm/vmstat.c: local variable changed to avoid collision with vmstat_work

    Partly based on Rusty Russell's "alloc_percpu: rename percpu vars
    which cause name clashes" patch.

    Signed-off-by: Tejun Heo
    Acked-by: (slab/vmstat) Christoph Lameter
    Reviewed-by: Christoph Lameter
    Cc: Rusty Russell
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Nick Piggin

    Tejun Heo
     

22 Sep, 2009

3 commits

  • If the system is running a heavy load of processes then concurrent reclaim
    can isolate a large number of pages from the LRU. /proc/vmstat and the
    output generated for an OOM do not show how many pages were isolated.

    This has been observed during process fork bomb testing (mstctl11 in LTP).

    This patch shows the information about isolated pages.

    Reproduced via:

    -----------------------
    % ./hackbench 140 process 1000
    => OOM occur

    active_anon:146 inactive_anon:0 isolated_anon:49245
    active_file:79 inactive_file:18 isolated_file:113
    unevictable:0 dirty:0 writeback:0 unstable:0 buffer:39
    free:370 slab_reclaimable:309 slab_unreclaimable:5492
    mapped:53 shmem:15 pagetables:28140 bounce:0

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Acked-by: Wu Fengguang
    Reviewed-by: Minchan Kim
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Recently we encountered OOM problems due to memory use of the GEM cache.
    Generally a large amuont of Shmem/Tmpfs pages tend to create a memory
    shortage problem.

    We often use the following calculation to determine the amount of shmem
    pages:

    shmem = NR_ACTIVE_ANON + NR_INACTIVE_ANON - NR_ANON_PAGES

    however the expression does not consider isolated and mlocked pages.

    This patch adds explicit accounting for pages used by shmem and tmpfs.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Reviewed-by: Christoph Lameter
    Acked-by: Wu Fengguang
    Cc: David Rientjes
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The amount of memory allocated to kernel stacks can become significant and
    cause OOM conditions. However, we do not display the amount of memory
    consumed by stacks.

    Add code to display the amount of memory used for stacks in /proc/meminfo.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Reviewed-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

17 Jun, 2009

5 commits

  • On NUMA machines, the administrator can configure zone_reclaim_mode that
    is a more targetted form of direct reclaim. On machines with large NUMA
    distances for example, a zone_reclaim_mode defaults to 1 meaning that
    clean unmapped pages will be reclaimed if the zone watermarks are not
    being met.

    There is a heuristic that determines if the scan is worthwhile but it is
    possible that the heuristic will fail and the CPU gets tied up scanning
    uselessly. Detecting the situation requires some guesswork and
    experimentation so this patch adds a counter "zreclaim_failed" to
    /proc/vmstat. If during high CPU utilisation this counter is increasing
    rapidly, then the resolution to the problem may be to set
    /proc/sys/vm/zone_reclaim_mode to 0.

    [akpm@linux-foundation.org: name things consistently]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Cc: Wu Fengguang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently, nobody wants to turn UNEVICTABLE_LRU off. Thus this
    configurability is unnecessary.

    Signed-off-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc: Andi Kleen
    Acked-by: Minchan Kim
    Cc: David Woodhouse
    Cc: Matt Mackall
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The lru->nr_saved_scan's are not meaningful counters for even kernel
    developers. They typically are smaller than 32 and are always 0 for large
    lists. So remove them from /proc/zoneinfo.

    Hopefully this interface change won't break too many scripts.
    /proc/zoneinfo is too unstructured to be script friendly, and I wonder the
    affected scripts - if there are any - are still bleeding since the not
    long ago commit "vmscan: split LRU lists into anon & file sets", which
    also touched the "scanned" line :)

    If we are to re-export accumulated vmscan counts in the future, they can
    go to new lines in /proc/zoneinfo instead of the current form, or to
    /sys/devices/system/node/node0/meminfo?

    Signed-off-by: Wu Fengguang

    Acked-by: Rik van Riel
    Cc: Nick Piggin
    Acked-by: Christoph Lameter
    Cc: Peter Zijlstra
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • The vmscan batching logic is twisting. Move it into a standalone function
    nr_scan_try_batch() and document it. No behavior change.

    Signed-off-by: Wu Fengguang
    Acked-by: Rik van Riel
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Acked-by: Peter Zijlstra
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether
    pages_min, pages_low or pages_high is used as the zone watermark when
    allocating the pages. Two branches in the allocator hotpath determine
    which watermark to use.

    This patch uses the flags as an array index into a watermark array that is
    indexed with WMARK_* defines accessed via helpers. All call sites that
    use zone->pages_* are updated to use the helpers for accessing the values
    and the array offsets for setting.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

18 May, 2009

1 commit

  • pfn_valid() is meant to be able to tell if a given PFN has valid memmap
    associated with it or not. In FLATMEM, it is expected that holes always
    have valid memmap as long as there is valid PFNs either side of the hole.
    In SPARSEMEM, it is assumed that a valid section has a memmap for the
    entire section.

    However, ARM and maybe other embedded architectures in the future free
    memmap backing holes to save memory on the assumption the memmap is never
    used. The page_zone linkages are then broken even though pfn_valid()
    returns true. A walker of the full memmap must then do this additional
    check to ensure the memmap they are looking at is sane by making sure the
    zone and PFN linkages are still valid. This is expensive, but walkers of
    the full memmap are extremely rare.

    This was caught before for FLATMEM and hacked around but it hits again for
    SPARSEMEM because the page_zone linkages can look ok where the PFN linkages
    are totally screwed. This looks like a hatchet job but the reality is that
    any clean solution would end up consumning all the memory saved by punching
    these unexpected holes in the memmap. For example, we tried marking the
    memmap within the section invalid but the section size exceeds the size of
    the hole in most cases so pfn_valid() starts returning false where valid
    memmap exists. Shrinking the size of the section would increase memory
    consumption offsetting the gains.

    This patch identifies when an architecture is punching unexpected holes
    in the memmap that the memory model cannot automatically detect and sets
    ARCH_HAS_HOLES_MEMORYMODEL. At the moment, this is restricted to EP93xx
    which is the model sub-architecture this has been reported on but may expand
    later. When set, walkers of the full memmap must call memmap_valid_within()
    for each PFN and passing in what it expects the page and zone to be for
    that PFN. If it finds the linkages to be broken, it assumes the memmap is
    invalid for that PFN.

    Signed-off-by: Mel Gorman
    Signed-off-by: Russell King

    Mel Gorman
     

03 Apr, 2009

1 commit

  • Even though vmstat_work is marked deferrable, there are still benefits to
    aligning it. For certain applications we want to keep OS jitter as low as
    possible and aligning timers and work so they occur together can reduce
    their overall impact.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     

01 Apr, 2009

1 commit

  • Impact: cleanup

    In almost cases, for_each_zone() is used with populated_zone(). It's
    because almost function doesn't need memoryless node information.
    Therefore, for_each_populated_zone() can help to make code simplify.

    This patch has no functional change.

    [akpm@linux-foundation.org: small cleanup]
    Signed-off-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Reviewed-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

30 Mar, 2009

1 commit


01 Jan, 2009

1 commit

  • Impact: Use new API

    Convert kernel mm functions to use struct cpumask.

    We skip include/linux/percpu.h and mm/allocpercpu.c, which are in flux.

    Signed-off-by: Rusty Russell
    Signed-off-by: Mike Travis
    Reviewed-by: Christoph Lameter

    Rusty Russell
     

23 Oct, 2008

4 commits


20 Oct, 2008

3 commits

  • Allow free of mlock()ed pages. This shouldn't happen, but during
    developement, it occasionally did.

    This patch allows us to survive that condition, while keeping the
    statistics and events correct for debug.

    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Add NR_MLOCK zone page state, which provides a (conservative) count of
    mlocked pages (actually, the number of mlocked pages moved off the LRU).

    Reworked by lts to fit in with the modified mlock page support in the
    Reclaim Scalability series.

    [kosaki.motohiro@jp.fujitsu.com: fix incorrect Mlocked field of /proc/meminfo]
    [lee.schermerhorn@hp.com: mlocked-pages: add event counting with statistics]
    Signed-off-by: Nick Piggin
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Report unevictable pages per zone and system wide.

    Kosaki Motohiro added support for memory controller unevictable
    statistics.

    [riel@redhat.com: fix printk in show_free_areas()]
    [akpm@linux-foundation.org: fix units in /proc/vmstats]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Debugged-by: Hiroshi Shimamoto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn