11 Aug, 2010

3 commits

  • * 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits)
    block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n
    xen-blkfront: fix missing out label
    blkdev: fix blkdev_issue_zeroout return value
    block: update request stacking methods to support discards
    block: fix missing export of blk_types.h
    writeback: fix bad _bh spinlock nesting
    drbd: revert "delay probes", feature is being re-implemented differently
    drbd: Initialize all members of sync_conf to their defaults [Bugz 315]
    drbd: Disable delay probes for the upcomming release
    writeback: cleanup bdi_register
    writeback: add new tracepoints
    writeback: remove unnecessary init_timer call
    writeback: optimize periodic bdi thread wakeups
    writeback: prevent unnecessary bdi threads wakeups
    writeback: move bdi threads exiting logic to the forker thread
    writeback: restructure bdi forker loop a little
    writeback: move last_active to bdi
    writeback: do not remove bdi from bdi_list
    writeback: simplify bdi code a little
    writeback: do not lose wake-ups in bdi threads
    ...

    Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and
    drivers/scsi/scsi_error.c as per Jens.

    Linus Torvalds
     
  • * 'kmemleak' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm:
    kmemleak: Fix typo in the comment
    lib/scatterlist: Hook sg_kmalloc into kmemleak (v2)
    kmemleak: Add DocBook style comments to kmemleak.c
    kmemleak: Introduce a default off mode for kmemleak
    kmemleak: Show more information for objects found by alias

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
    no need for list_for_each_entry_safe()/resetting with superblock list
    Fix sget() race with failing mount
    vfs: don't hold s_umount over close_bdev_exclusive() call
    sysv: do not mark superblock dirty on remount
    sysv: do not mark superblock dirty on mount
    btrfs: remove junk sb_dirt change
    BFS: clean up the superblock usage
    AFFS: wait for sb synchronization when needed
    AFFS: clean up dirty flag usage
    cifs: truncate fallout
    mbcache: fix shrinker function return value
    mbcache: Remove unused features
    add f_flags to struct statfs(64)
    pass a struct path to vfs_statfs
    update VFS documentation for method changes.
    All filesystems that need invalidate_inode_buffers() are doing that explicitly
    convert remaining ->clear_inode() to ->evict_inode()
    Make ->drop_inode() just return whether inode needs to be dropped
    fs/inode.c:clear_inode() is gone
    fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
    ...

    Fix up trivial conflicts in fs/nilfs2/super.c

    Linus Torvalds
     

10 Aug, 2010

37 commits

  • * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
    powerpc: fix build with make 3.82
    Revert "Input: appletouch - fix integer overflow issue"
    memblock: Fix memblock_is_region_reserved() to return a boolean
    powerpc: Trim defconfigs
    powerpc: fix i8042 module build error
    sound/soc: mpc5200_psc_ac97: Use gpio pins for cold reset
    powerpc/5200: add mpc5200_psc_ac97_gpio_reset

    Linus Torvalds
     
  • When taking a memory snapshot in hibernate_snapshot(), all (directly
    called) memory allocations use GFP_ATOMIC. Hence swap misusage during
    hibernation never occurs.

    But from a pessimistic point of view, there is no guarantee that no page
    allcation has __GFP_WAIT. It is better to have a global indication "we
    enter hibernation, don't use swap!".

    This patch tries to freeze new-swap-allocation during hibernation. (All
    user processes are frozenm so swapin is not a concern).

    This way, no updates will happen to swap_map[] between
    hibernate_snapshot() and save_image(). Swap is thawed when swsusp_free()
    is called. We can be assured that swap corruption will not occur.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Rafael J. Wysocki"
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Ondrej Zary
    Cc: Balbir Singh
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Since 2.6.31, swap_map[]'s refcounting was changed to show that a used
    swap entry is just for swap-cache, can be reused. Then, while scanning
    free entry in swap_map[], a swap entry may be able to be reclaimed and
    reused. It was caused by commit c9e444103b5e7a5 ("mm: reuse unused swap
    entry if necessary").

    But this caused deta corruption at resume. The scenario is

    - Assume a clean-swap cache, but mapped.

    - at hibernation_snapshot[], clean-swap-cache is saved as
    clean-swap-cache and swap_map[] is marked as SWAP_HAS_CACHE.

    - then, save_image() is called. And reuse SWAP_HAS_CACHE entry to save
    image, and break the contents.

    After resume:

    - the memory reclaim runs and finds clean-not-referenced-swap-cache and
    discards it because it's marked as clean. But here, the contents on
    disk and swap-cache is inconsistent.

    Hance memory is corrupted.

    This patch avoids the bug by not reclaiming swap-entry during hibernation.
    This is a quick fix for backporting.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Rafael J. Wysocki
    Reported-by: Ondreg Zary
    Tested-by: Ondreg Zary
    Tested-by: Andrea Gelmini
    Acked-by: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Use compile-allocated memory instead of dynamic allocated memory for
    mm_slots_hash.

    Use hash_ptr() instead divisions for bucket calculation.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Izik Eidus
    Cc: Avi Kivity
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • Fix "system goes unresponsive under memory pressure and lots of
    dirty/writeback pages" bug.

    http://lkml.org/lkml/2010/4/4/86

    In the above thread, Andreas Mohr described that

    Invoking any command locked up for minutes (note that I'm
    talking about attempted additional I/O to the _other_,
    _unaffected_ main system HDD - such as loading some shell
    binaries -, NOT the external SSD18M!!).

    This happens when the two conditions are both meet:
    - under memory pressure
    - writing heavily to a slow device

    OOM also happens in Andreas' system. The OOM trace shows that 3 processes
    are stuck in wait_on_page_writeback() in the direct reclaim path. One in
    do_fork() and the other two in unix_stream_sendmsg(). They are blocked on
    this condition:

    (sc->order && priority < DEF_PRIORITY - 2)

    which was introduced in commit 78dc583d (vmscan: low order lumpy reclaim
    also should use PAGEOUT_IO_SYNC) one year ago. That condition may be too
    permissive. In Andreas' case, 512MB/1024 = 512KB. If the direct reclaim
    for the order-1 fork() allocation runs into a range of 512KB
    hard-to-reclaim LRU pages, it will be stalled.

    It's a severe problem in three ways.

    Firstly, it can easily happen in daily desktop usage. vmscan priority can
    easily go below (DEF_PRIORITY - 2) on _local_ memory pressure. Even if
    the system has 50% globally reclaimable pages, it still has good
    opportunity to have 0.1% sized hard-to-reclaim ranges. For example, a
    simple dd can easily create a big range (up to 20%) of dirty pages in the
    LRU lists. And order-1 to order-3 allocations are more than common with
    SLUB. Try "grep -v '1 :' /proc/slabinfo" to get the list of high order
    slab caches. For example, the order-1 radix_tree_node slab cache may
    stall applications at swap-in time; the order-3 inode cache on most
    filesystems may stall applications when trying to read some file; the
    order-2 proc_inode_cache may stall applications when trying to open a
    /proc file.

    Secondly, once triggered, it will stall unrelated processes (not doing IO
    at all) in the system. This "one slow USB device stalls the whole system"
    avalanching effect is very bad.

    Thirdly, once stalled, the stall time could be intolerable long for the
    users. When there are 20MB queued writeback pages and USB 1.1 is writing
    them in 1MB/s, wait_on_page_writeback() will stuck for up to 20 seconds.
    Not to mention it may be called multiple times.

    So raise the bar to only enable PAGEOUT_IO_SYNC when priority goes below
    DEF_PRIORITY/3, or 6.25% LRU size. As the default dirty throttle ratio is
    20%, it will hardly be triggered by pure dirty pages. We'd better treat
    PAGEOUT_IO_SYNC as some last resort workaround -- its stall time is so
    uncomfortably long (easily goes beyond 1s).

    The bar is only raised for (order < PAGE_ALLOC_COSTLY_ORDER) allocations,
    which are easy to satisfy in 1TB memory boxes. So, although 6.25% of
    memory could be an awful lot of pages to scan on a system with 1TB of
    memory, it won't really have to busy scan that much.

    Andreas tested an older version of this patch and reported that it mostly
    fixed his problem. Mel Gorman helped improve it and KOSAKI Motohiro will
    fix it further in the next patch.

    Reported-by: Andreas Mohr
    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Signed-off-by: Wu Fengguang
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • kmalloc() may fail, if so return -ENOMEM.

    Signed-off-by: Kulikov Vasiliy
    Acked-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kulikov Vasiliy
     
  • Memcg also need to trace page isolation information as global reclaim.
    This patch does it.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Memcg also need to trace reclaim progress as direct reclaim. This patch
    add it.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Presently shrink_slab() has the following scanning equation.

    lru_scanned max_pass
    basic_scan_objects = 4 x ------------- x -----------------------------
    lru_pages shrinker->seeks (default:2)

    scan_objects = min(basic_scan_objects, max_pass * 2)

    If we pass very small value as lru_pages instead real number of lru pages,
    shrink_slab() drop much objects rather than necessary. And now,
    __zone_reclaim() pass 'order' as lru_pages by mistake. That produces a
    bad result.

    For example, if we receive very low memory pressure (scan = 32, order =
    0), shrink_slab() via zone_reclaim() always drop _all_ icache/dcache
    objects. (see above equation, very small lru_pages make very big
    scan_objects result).

    This patch fixes it.

    [akpm@linux-foundation.org: fix layout, typos]
    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Acked-by: Christoph Lameter
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • It is not appropriate for apply_to_page_range() to directly call any mmu
    notifiers, because it is a general purpose function whose effect depends
    on what context it is called in and what the callback function does.

    In particular, if it is being used as part of an mmu notifier
    implementation, the recursive calls can be particularly problematic.

    It is up to apply_to_page_range's caller to do any notifier calls if
    necessary. It does not affect any in-tree users because they all operate
    on init_mm, and mmu notifiers only pertain to usermode mappings.

    [stefano.stabellini@eu.citrix.com: remove unused local `start']
    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Stefano Stabellini
    Cc: Andrea Arcangeli
    Cc: Stefano Stabellini
    Cc: Avi Kivity
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeremy Fitzhardinge
     
  • Rik van Riel pointed out reading reclaim_stat should be protected
    lru_lock, otherwise vmscan might sweep 2x much pages.

    This fault was introduced by

    commit 4f98a2fee8acdb4ac84545df98cccecfd130f8db
    Author: Rik van Riel
    Date: Sat Oct 18 20:26:32 2008 -0700

    vmscan: split LRU lists into anon & file sets

    Signed-off-by: KOSAKI Motohiro
    Cc: Rik van Riel
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • 'slab_reclaimable' and 'nr_pages' are unsigned. Subtraction is unsafe
    because negative results would be misinterpreted.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Set the flag if do_swap_page is decowing the page the same way do_wp_page
    would too.

    Signed-off-by: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • On swapin it is fairly common for a page to be owned exclusively by one
    process. In that case we want to add the page to the anon_vma of that
    process's VMA, instead of to the root anon_vma.

    This will reduce the amount of rmap searching that the swapout code needs
    to do.

    Signed-off-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • This a complete rewrite of the oom killer's badness() heuristic which is
    used to determine which task to kill in oom conditions. The goal is to
    make it as simple and predictable as possible so the results are better
    understood and we end up killing the task which will lead to the most
    memory freeing while still respecting the fine-tuning from userspace.

    Instead of basing the heuristic on mm->total_vm for each task, the task's
    rss and swap space is used instead. This is a better indication of the
    amount of memory that will be freeable if the oom killed task is chosen
    and subsequently exits. This helps specifically in cases where KDE or
    GNOME is chosen for oom kill on desktop systems instead of a memory
    hogging task.

    The baseline for the heuristic is a proportion of memory that each task is
    currently using in memory plus swap compared to the amount of "allowable"
    memory. "Allowable," in this sense, means the system-wide resources for
    unconstrained oom conditions, the set of mempolicy nodes, the mems
    attached to current's cpuset, or a memory controller's limit. The
    proportion is given on a scale of 0 (never kill) to 1000 (always kill),
    roughly meaning that if a task has a badness() score of 500 that the task
    consumes approximately 50% of allowable memory resident in RAM or in swap
    space.

    The proportion is always relative to the amount of "allowable" memory and
    not the total amount of RAM systemwide so that mempolicies and cpusets may
    operate in isolation; they shall not need to know the true size of the
    machine on which they are running if they are bound to a specific set of
    nodes or mems, respectively.

    Root tasks are given 3% extra memory just like __vm_enough_memory()
    provides in LSMs. In the event of two tasks consuming similar amounts of
    memory, it is generally better to save root's task.

    Because of the change in the badness() heuristic's baseline, it is also
    necessary to introduce a new user interface to tune it. It's not possible
    to redefine the meaning of /proc/pid/oom_adj with a new scale since the
    ABI cannot be changed for backward compatability. Instead, a new tunable,
    /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
    be used to polarize the heuristic such that certain tasks are never
    considered for oom kill while others may always be considered. The value
    is added directly into the badness() score so a value of -500, for
    example, means to discount 50% of its memory consumption in comparison to
    other tasks either on the system, bound to the mempolicy, in the cpuset,
    or sharing the same memory controller.

    /proc/pid/oom_adj is changed so that its meaning is rescaled into the
    units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
    these per-task tunables will rescale the value of the other to an
    equivalent meaning. Although /proc/pid/oom_adj was originally defined as
    a bitshift on the badness score, it now shares the same linear growth as
    /proc/pid/oom_score_adj but with different granularity. This is required
    so the ABI is not broken with userspace applications and allows oom_adj to
    be deprecated for future removal.

    Signed-off-by: David Rientjes
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Oleg pointed out current PF_EXITING check is wrong. Because PF_EXITING
    is per-thread flag, not per-process flag. He said,

    Two threads, group-leader L and its sub-thread T. T dumps the code.
    In this case both threads have ->mm != NULL, L has PF_EXITING.

    The first problem is, select_bad_process() always return -1 in this
    case (even if the caller is T, this doesn't matter).

    The second problem is that we should add TIF_MEMDIE to T, not L.

    I think we can remove this dubious PF_EXITING check. but as first step,
    This patch add the protection of multi threaded issue.

    Signed-off-by: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Minchan Kim
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • In a system under heavy load it was observed that even after the
    oom-killer selects a task to die, the task may take a long time to die.

    Right after sending a SIGKILL to the task selected by the oom-killer this
    task has its priority increased so that it can exit() soon, freeing
    memory. That is accomplished by:

    /*
    * We give our sacrificial lamb high priority and access to
    * all the memory it needs. That way it should be able to
    * exit() and clear out its resources quickly...
    */
    p->rt.time_slice = HZ;
    set_tsk_thread_flag(p, TIF_MEMDIE);

    It sounds plausible giving the dying task an even higher priority to be
    sure it will be scheduled sooner and free the desired memory. It was
    suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that this
    task won't interfere with any running RT task.

    If the dying task is already an RT task, leave it untouched. Another good
    suggestion, implemented here, was to avoid boosting the dying task
    priority in case of mem_cgroup OOM.

    Signed-off-by: Luis Claudio R. Goncalves
    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis Claudio R. Goncalves
     
  • The current "child->mm == p->mm" check prevents selection of vfork()ed
    task. But we don't have any reason to don't consider vfork().

    Removed.

    Signed-off-by: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • presently has_intersects_mems_allowed() has own thread iterate logic, but
    it should use while_each_thread().

    It slightly improve the code readability.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Presently if oom_kill_allocating_task is enabled and current have
    OOM_DISABLED, following printk in oom_kill_process is called twice.

    pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n",
    message, task_pid_nr(p), p->comm, points);

    So, OOM_DISABLE check should be more early.

    Signed-off-by: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • select_bad_process() and badness() have the same OOM_DISABLE check. This
    patch kills one.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • If a kernel thread is using use_mm(), badness() returns a positive value.
    This is not a big issue because caller take care of it correctly. But
    there is one exception, /proc//oom_score calls badness() directly and
    doesn't care that the task is a regular process.

    Another example, /proc/1/oom_score return !0 value. But it's unkillable.
    This incorrectness makes administration a little confusing.

    This patch fixes it.

    Signed-off-by: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • When oom_kill_allocating_task is enabled, an argument task of
    oom_kill_process is not selected by select_bad_process(), It's just
    out_of_memory() caller task. It mean the task can be unkillable. check
    it first.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Presently we have the same task check in two places. Unify it.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Presently select_bad_process() has a PF_KTHREAD check, but
    oom_kill_process doesn't. It mean oom_kill_process() may choose wrong
    task, especially, when the child are using use_mm().

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Presently, badness() doesn't care about either CPUSET nor mempolicy. Then
    if the victim child process have disjoint nodemask, OOM Killer might kill
    innocent process.

    This patch fixes it.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Minchan Kim
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • When shrink_inactive_list() isolates pages, it updates a number of
    counters using temporary variables to gather them. These consume stack
    and it's in the main path that calls ->writepage(). This patch moves the
    accounting updates outside of the main path to reduce stack usage.

    Signed-off-by: Mel Gorman
    Reviewed-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc: Dave Chinner
    Cc: Chris Mason
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Andrea Arcangeli
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • shrink_page_list() sets up a pagevec to release pages as according as they
    are free. It uses significant amounts of stack on the pagevec. This
    patch adds pages to be freed via pagevec to a linked list which is then
    freed en-masse at the end. This avoids using stack in the main path that
    potentially calls writepage().

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Dave Chinner
    Cc: Chris Mason
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Andrea Arcangeli
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • shrink_inactive_list() sets up a pagevec to release unfreeable pages. It
    uses significant amounts of stack doing this. This patch splits
    shrink_inactive_list() to take the stack usage out of the main path so
    that callers to writepage() do not contain an unused pagevec on the stack.

    Signed-off-by: Mel Gorman
    Reviewed-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc: Dave Chinner
    Cc: Chris Mason
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Andrea Arcangeli
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Remove temporary variable that is only used once and does not help clarify
    code.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Dave Chinner
    Cc: Chris Mason
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Andrea Arcangeli
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Now, max_scan of shrink_inactive_list() is always passed less than
    SWAP_CLUSTER_MAX. then, we can remove scanning pages loop in it. This
    patch also help stack diet.

    detail
    - remove "while (nr_scanned < max_scan)" loop
    - remove nr_freed (now, we use nr_reclaimed directly)
    - remove nr_scan (now, we use nr_scanned directly)
    - rename max_scan to nr_to_scan
    - pass nr_to_scan into isolate_pages() directly instead
    using SWAP_CLUSTER_MAX

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Reviewed-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Dave Chinner
    Cc: Chris Mason
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Andrea Arcangeli
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Since 2.6.28 zone->prev_priority is unused. Then it can be removed
    safely. It reduce stack usage slightly.

    Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
    can be integrate again, it's useful. but four (or more) times trying
    haven't got good performance number. Thus I give up such approach.

    The rest of this changelog is notes on prev_priority and why it existed in
    the first place and why it might be not necessary any more. This information
    is based heavily on discussions between Andrew Morton, Rik van Riel and
    Kosaki Motohiro who is heavily quotes from.

    Historically prev_priority was important because it determined when the VM
    would start unmapping PTE pages. i.e. there are no balances of note within
    the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
    is a potential risk of unnecessarily increasing minor faults as a large
    amount of read activity of use-once pages could push mapped pages to the
    end of the LRU and get unmapped.

    There is no proof this is still a problem but currently it is not considered
    to be. Active files are not deactivated if the active file list is smaller
    than the inactive list reducing the liklihood that file-mapped pages are
    being pushed off the LRU and referenced executable pages are kept on the
    active list to avoid them getting pushed out by read activity.

    Even if it is a problem, prev_priority prev_priority wouldn't works
    nowadays. First of all, current vmscan still a lot of UP centric code. it
    expose some weakness on some dozens CPUs machine. I think we need more and
    more improvement.

    The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
    and per-task-pressure a bit. example, prev_priority try to boost priority to
    other concurrent priority. but if the another task have mempolicy restriction,
    it is unnecessary, but also makes wrong big latency and exceeding reclaim.
    per-task based priority + prev_priority adjustment make the emulation of
    per-system pressure. but it have two issue 1) too rough and brutal emulation
    2) we need per-zone pressure, not per-system.

    Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
    2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
    but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
    system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
    prev_priority can't solve such multithreads workload issue. In other word,
    prev_priority concept assume the sysmtem don't have lots threads."

    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Reviewed-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Dave Chinner
    Cc: Chris Mason
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Andrea Arcangeli
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Add a trace event for when page reclaim queues a page for IO and records
    whether it is synchronous or asynchronous. Excessive synchronous IO for a
    process can result in noticeable stalls during direct reclaim. Excessive
    IO from page reclaim may indicate that the system is seriously under
    provisioned for the amount of dirty pages that exist.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Larry Woodman
    Cc: Dave Chinner
    Cc: Chris Mason
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Andrea Arcangeli
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Add an event for when pages are isolated en-masse from the LRU lists.
    This event augments the information available on LRU traffic and can be
    used to evaluate lumpy reclaim.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Larry Woodman
    Cc: Dave Chinner
    Cc: Chris Mason
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Andrea Arcangeli
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Add two trace events for kswapd waking up and going asleep for the
    purposes of tracking kswapd activity and two trace events for direct
    reclaim beginning and ending. The information can be used to work out how
    much time a process or the system is spending on the reclamation of pages
    and in the case of direct reclaim, how many pages were reclaimed for that
    process. High frequency triggering of these events could point to memory
    pressure problems.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Larry Woodman
    Cc: Dave Chinner
    Cc: Chris Mason
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Andrea Arcangeli
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • shrink_zones() need relatively long time and lru_pages can change
    dramatically during shrink_zones(). So lru_pages should be recalculated
    for each priority.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Swap token don't works when zone reclaim is enabled since it was born.
    Because __zone_reclaim() always call disable_swap_token() unconditionally.

    This kill swap token feature completely. As far as I know, nobody want to
    that. Remove it.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro