10 May, 2007

1 commit

  • Since nonboot CPUs are now disabled after tasks and devices have been
    frozen and the CPU hotplug infrastructure is used for this purpose, we need
    special CPU hotplug notifications that will help the CPU-hotplug-aware
    subsystems distinguish normal CPU hotplug events from CPU hotplug events
    related to a system-wide suspend or resume operation in progress. This
    patch introduces such notifications and causes them to be used during
    suspend and resume transitions. It also changes all of the
    CPU-hotplug-aware subsystems to take these notifications into consideration
    (for now they are handled in the same way as the corresponding "normal"
    ones).

    [oleg@tv-sign.ru: cleanups]
    Signed-off-by: Rafael J. Wysocki
    Cc: Gautham R Shenoy
    Cc: Pavel Machek
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     

09 May, 2007

1 commit


08 May, 2007

1 commit

  • Currently we can miss freeze_process()->signal_wake_up() in kswapd() if it
    happens between try_to_freeze() and prepare_to_wait(). To prevent this
    from happening we should check freezing(current) before calling schedule().

    Signed-off-by: Rafael J. Wysocki
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     

02 Mar, 2007

1 commit


12 Feb, 2007

1 commit

  • The determination of the dirty ratio to determine writeback behavior is
    currently based on the number of total pages on the system.

    However, not all pages in the system may be dirtied. Thus the ratio is always
    too low and can never reach 100%. The ratio may be particularly skewed if
    large hugepage allocations, slab allocations or device driver buffers make
    large sections of memory not available anymore. In that case we may get into
    a situation in which f.e. the background writeback ratio of 40% cannot be
    reached anymore which leads to undesired writeback behavior.

    This patchset fixes that issue by determining the ratio based on the actual
    pages that may potentially be dirty. These are the pages on the active and
    the inactive list plus free pages.

    The problem with those counts has so far been that it is expensive to
    calculate these because counts from multiple nodes and multiple zones will
    have to be summed up. This patchset makes these counters ZVC counters. This
    means that a current sum per zone, per node and for the whole system is always
    available via global variables and not expensive anymore to calculate.

    The patchset results in some other good side effects:

    - Removal of the various functions that sum up free, active and inactive
    page counts

    - Cleanup of the functions that display information via the proc filesystem.

    This patch:

    The use of a ZVC for nr_inactive and nr_active allows a simplification of some
    counter operations. More ZVC functionality is used for sums etc in the
    following patches.

    [akpm@osdl.org: UP build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

06 Jan, 2007

1 commit

  • At the end of shrink_all_memory() we forget to recalculate lru_pages: it can
    be zero.

    Fix that up, and add a helper function for this operation too.

    Also, recalculate lru_pages each time around the inner loop to get the
    balancing correct.

    Cc: "Rafael J. Wysocki"
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

31 Dec, 2006

1 commit


23 Dec, 2006

1 commit

  • The version of mm/vmscan.c in Linus' current tree has swapped parameters in
    the shrink_all_zones declaration and call, used by the various
    suspend-to-disk implementations. This doesn't seem to have any great
    adverse effect, but it's clearly wrong.

    Signed-off-by: Nigel Cunningham
    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nigel Cunningham
     

14 Dec, 2006

1 commit

  • Elaborate the API for calling cpuset_zone_allowed(), so that users have to
    explicitly choose between the two variants:

    cpuset_zone_allowed_hardwall()
    cpuset_zone_allowed_softwall()

    Until now, whether or not you got the hardwall flavor depended solely on
    whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
    argument.

    If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
    version.

    Unfortunately, this meant that users would end up with the softwall version
    without thinking about it. Since only the softwall version might sleep,
    this led to bugs with possible sleeping in interrupt context on more than
    one occassion.

    The hardwall version requires that the current tasks mems_allowed allows
    the node of the specified zone (or that you're in interrupt or that
    __GFP_THISNODE is set or that you're on a one cpuset system.)

    The softwall version, depending on the gfp_mask, might allow a node if it
    was allowed in the nearest enclusing cpuset marked mem_exclusive (which
    requires taking the cpuset lock 'callback_mutex' to evaluate.)

    This patch removes the cpuset_zone_allowed() call, and forces the caller to
    explicitly choose between the hardwall and the softwall case.

    If the caller wants the gfp_mask to determine this choice, they should (1)
    be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
    cpuset_zone_allowed_softwall() routine.

    This adds another 100 or 200 bytes to the kernel text space, due to the few
    lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
    routines. It should save a few instructions executed for the calls that
    turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
    set (before the call) then check (within the call) the __GFP_HARDWALL flag.

    For the most critical call, from get_page_from_freelist(), the same
    instructions are executed as before -- the old cpuset_zone_allowed()
    routine it used to call is the same code as the
    cpuset_zone_allowed_softwall() routine that it calls now.

    Not a perfect win, but seems worth it, to reduce this chance of hitting a
    sleeping with irq off complaint again.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

08 Dec, 2006

4 commits

  • There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn,
    prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus
    generating compiler warnings of unused symbols, hence forcing people to add
    #ifdefs.

    the compiler can skip truly unused functions just fine:

    text data bss dec hex filename
    1624412 728710 3674856 6027978 5bfaca vmlinux.before
    1624412 728710 3674856 6027978 5bfaca vmlinux.after

    [akpm@osdl.org: topology.c fix]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Move process freezing functions from include/linux/sched.h to freezer.h, so
    that modifications to the freezer or the kernel configuration don't require
    recompiling just about everything.

    [akpm@osdl.org: fix ueagle driver]
    Signed-off-by: Nigel Cunningham
    Cc: "Rafael J. Wysocki"
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nigel Cunningham
     
  • Currently swsusp saves the contents of highmem pages by copying them to the
    normal zone which is quite inefficient (eg. it requires two normal pages
    to be used for saving one highmem page). This may be improved by using
    highmem for saving the contents of saveable highmem pages.

    Namely, during the suspend phase of the suspend-resume cycle we try to
    allocate as many free highmem pages as there are saveable highmem pages.
    If there are not enough highmem image pages to store the contents of all of
    the saveable highmem pages, some of them will be stored in the "normal"
    memory. Next, we allocate as many free "normal" pages as needed to store
    the (remaining) image data. We use a memory bitmap to mark the allocated
    free pages (ie. highmem as well as "normal" image pages).

    Now, we use another memory bitmap to mark all of the saveable pages
    (highmem as well as "normal") and the contents of the saveable pages are
    copied into the image pages. Then, the second bitmap is used to save the
    pfns corresponding to the saveable pages and the first one is used to save
    their data.

    During the resume phase the pfns of the pages that were saveable during the
    suspend are loaded from the image and used to mark the "unsafe" page
    frames. Next, we try to allocate as many free highmem page frames as to
    load all of the image data that had been in the highmem before the suspend
    and we allocate so many free "normal" page frames that the total number of
    allocated free pages (highmem and "normal") is equal to the size of the
    image. While doing this we have to make sure that there will be some extra
    free "normal" and "safe" page frames for two lists of PBEs constructed
    later.

    Now, the image data are loaded, if possible, into their "original" page
    frames. The image data that cannot be written into their "original" page
    frames are loaded into "safe" page frames and their "original" kernel
    virtual addresses, as well as the addresses of the "safe" pages containing
    their copies, are stored in one of two lists of PBEs.

    One list of PBEs is for the copies of "normal" suspend pages (ie. "normal"
    pages that were saveable during the suspend) and it is used in the same way
    as previously (ie. by the architecture-dependent parts of swsusp). The
    other list of PBEs is for the copies of highmem suspend pages. The pages
    in this list are restored (in a reversible way) right before the
    arch-dependent code is called.

    Signed-off-by: Rafael J. Wysocki
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Despaghettify balance_pdgat() a bit.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

29 Oct, 2006

2 commits

  • If try_to_free_pages / balance_pgdat are called with a gfp_mask specifying
    GFP_IO and/or GFP_FS, they will reclaim the requisite number of pages, and the
    reset prev_priority to DEF_PRIORITY (or to some other high (ie: unurgent)
    value).

    However, another reclaimer without those gfp_mask flags set (say, GFP_NOIO)
    may still be struggling to reclaim pages. The concurrent overwrite of
    zone->prev_priority will cause this GFP_NOIO thread to unexpectedly cease
    deactivating mapped pages, thus causing reclaim difficulties.

    Fix this is to key the distress calculation not off zone->prev_priority, but
    also take into account the local caller's priority by using
    min(zone->prev_priority, sc->priority)

    Signed-off-by: Martin J. Bligh
    Cc: Nick Piggin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Bligh
     
  • The temp_priority field in zone is racy, as we can walk through a reclaim
    path, and just before we copy it into prev_priority, it can be overwritten
    (say with DEF_PRIORITY) by another reclaimer.

    The same bug is contained in both try_to_free_pages and balance_pgdat, but
    it is fixed slightly differently. In balance_pgdat, we keep a separate
    priority record per zone in a local array. In try_to_free_pages there is
    no need to do this, as the priority level is the same for all zones that we
    reclaim from.

    Impact of this bug is that temp_priority is copied into prev_priority, and
    setting this artificially high causes reclaimers to set distress
    artificially low. They then fail to reclaim mapped pages, when they are,
    in fact, under severe memory pressure (their priority may be as low as 0).
    This causes the OOM killer to fire incorrectly.

    From: Andrew Morton

    __zone_reclaim() isn't modifying zone->prev_priority. But zone->prev_priority
    is used in the decision whether or not to bring mapped pages onto the inactive
    list. Hence there's a risk here that __zone_reclaim() will fail because
    zone->prev_priority ir large (ie: low urgency) and lots of mapped pages end up
    stuck on the active list.

    Fix that up by decreasing (ie making more urgent) zone->prev_priority as
    __zone_reclaim() scans the zone's pages.

    This bug perhaps explains why ZONE_RECLAIM_PRIORITY was created. It should be
    possible to remove that now, and to just start out at DEF_PRIORITY?

    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Bligh
     

21 Oct, 2006

1 commit

  • Separate out the concept of "queue congestion" from "backing-dev congestion".
    Congestion is a backing-dev concept, not a queue concept.

    The blk_* congestion functions are retained, as wrappers around the core
    backing-dev congestion functions.

    This proper layering is needed so that NFS can cleanly use the congestion
    functions, and so that CONFIG_BLOCK=n actually links.

    Cc: "Thomas Maier"
    Cc: "Jens Axboe"
    Cc: Trond Myklebust
    Cc: David Howells
    Cc: Peter Osterlund
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

17 Oct, 2006

1 commit

  • If remove_mapping() failed to remove the page from its mapping, don't go and
    mark it not uptodate! Makes kernel go dead.

    (Actually, I don't think the ClearPageUptodate is needed there at all).

    Says Nick Piggin:

    "Right, it isn't needed because at this point the page is guaranteed
    by remove_mapping to have no references (except us) and cannot pick
    up any new ones because it is removed from pagecache.

    We can delete it."

    Signed-off-by: Andrew Morton
    Acked-by: Nick Piggin
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

27 Sep, 2006

2 commits

  • Clean up the invalidate code, and use a common function to safely remove
    the page from pagecache.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • The VM is supposed to minimise the number of pages which get written off the
    LRU (for IO scheduling efficiency, and for high reclaim-success rates). But
    we don't actually have a clear way of showing how true this is.

    So add `nr_vmscan_write' to /proc/vmstat and /proc/zoneinfo - the number of
    pages which have been written by the vm scanner in this zone and globally.

    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

26 Sep, 2006

9 commits

  • There are many places where we need to determine the node of a zone.
    Currently we use a difficult to read sequence of pointer dereferencing.
    Put that into an inline function and use throughout VM. Maybe we can find
    a way to optimize the lookup in the future.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Minor performance fix.

    If we reclaimed enough slab pages from a zone then we can avoid going off
    node with the current allocation. Take care of updating nr_reclaimed when
    reclaiming from the slab.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently one can enable slab reclaim by setting an explicit option in
    /proc/sys/vm/zone_reclaim_mode. Slab reclaim is then used as a final
    option if the freeing of unmapped file backed pages is not enough to free
    enough pages to allow a local allocation.

    However, that means that the slab can grow excessively and that most memory
    of a node may be used by slabs. We have had a case where a machine with
    46GB of memory was using 40-42GB for slab. Zone reclaim was effective in
    dealing with pagecache pages. However, slab reclaim was only done during
    global reclaim (which is a bit rare on NUMA systems).

    This patch implements slab reclaim during zone reclaim. Zone reclaim
    occurs if there is a danger of an off node allocation. At that point we

    1. Shrink the per node page cache if the number of pagecache
    pages is more than min_unmapped_ratio percent of pages in a zone.

    2. Shrink the slab cache if the number of the nodes reclaimable slab pages
    (patch depends on earlier one that implements that counter)
    are more than min_slab_ratio (a new /proc/sys/vm tunable).

    The shrinking of the slab cache is a bit problematic since it is not node
    specific. So we simply calculate what point in the slab we want to reach
    (current per node slab use minus the number of pages that neeed to be
    allocated) and then repeately run the global reclaim until that is
    unsuccessful or we have reached the limit. I hope we will have zone based
    slab reclaim at some point which will make that easier.

    The default for the min_slab_ratio is 5%

    Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Remove the atomic counter for slab_reclaim_pages and replace the counter
    and NR_SLAB with two ZVC counter that account for unreclaimable and
    reclaimable slab pages: NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE.

    Change the check in vmscan.c to refer to to NR_SLAB_RECLAIMABLE. The
    intend seems to be to check for slab pages that could be freed.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • *_pages is a better description of the role of the variable.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Potentially it takes several scans of the lru lists before we can even start
    reclaiming pages.

    mapped pages, with young ptes can take 2 passes on the active list + one on
    the inactive list. But reclaim_mapped may not always kick in instantly, so it
    could take even more than that.

    Raise the threshold for marking a zone as all_unreclaimable from a factor of 4
    time the pages in the zone to 6. Introduce a mechanism to force
    reclaim_mapped if we've reached a factor 3 and still haven't made progress.

    Previously, a customer doing stress testing was able to easily OOM the box
    after using only a small fraction of its swap (~100MB). After the patches, it
    would only OOM after having used up all swap (~800MB).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • __alloc_pages currently starts shooting if page reclaim has failed to free up
    swap_cluster_max pages in one run through the priorities. This is not always
    a good indicator on its own, so make use of the all_unreclaimable logic as
    well: don't consider going OOM until all zones we're interested in are
    unreclaimable.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Some users of remove_mapping had been unsafe.

    Modify the remove_mapping precondition to ensure the caller has locked the
    page and obtained the correct mapping. Modify callers to ensure the
    mapping is the correct one.

    [hugh@veritas.com: swapper_space fix]
    Signed-off-by: Nick Piggin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Introduce a VM_BUG_ON, which is turned on with CONFIG_DEBUG_VM. Use this
    in the lightweight, inline refcounting functions; PageLRU and PageActive
    checks in vmscan, because they're pretty well confined to vmscan. And in
    page allocate/free fastpaths which can be the hottest parts of the kernel
    for kbuilds.

    Unlike BUG_ON, VM_BUG_ON must not be used to execute statements with
    side-effects, and should not be used outside core mm code.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

04 Jul, 2006

1 commit

  • It turns out that it is advantageous to leave a small portion of unmapped file
    backed pages if all of a zone's pages (or almost all pages) are allocated and
    so the page allocator has to go off-node.

    This allows recently used file I/O buffers to stay on the node and
    reduces the times that zone reclaim is invoked if file I/O occurs
    when we run out of memory in a zone.

    The problem is that zone reclaim runs too frequently when the page cache is
    used for file I/O (read write and therefore unmapped pages!) alone and we have
    almost all pages of the zone allocated. Zone reclaim may remove 32 unmapped
    pages. File I/O will use these pages for the next read/write requests and the
    unmapped pages increase. After the zone has filled up again zone reclaim will
    remove it again after only 32 pages. This cycle is too inefficient and there
    are potentially too many zone reclaim cycles.

    With the 1% boundary we may still remove all unmapped pages for file I/O in
    zone reclaim pass. However. it will take a large number of read and writes
    to get back to 1% again where we trigger zone reclaim again.

    The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
    second timeout.

    [akpm@osdl.org: rename the /proc file and the variable]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

01 Jul, 2006

6 commits

  • The remaining counters in page_state after the zoned VM counter patches
    have been applied are all just for show in /proc/vmstat. They have no
    essential function for the VM.

    We use a simple increment of per cpu variables. In order to avoid the most
    severe races we disable preempt. Preempt does not prevent the race between
    an increment and an interrupt handler incrementing the same statistics
    counter. However, that race is exceedingly rare, we may only loose one
    increment or so and there is no requirement (at least not in kernel) that
    the vm event counters have to be accurate.

    In the non preempt case this results in a simple increment for each
    counter. For many architectures this will be reduced by the compiler to a
    single instruction. This single instruction is atomic for i386 and x86_64.
    And therefore even the rare race condition in an interrupt is avoided for
    both architectures in most cases.

    The patchset also adds an off switch for embedded systems that allows a
    building of linux kernels without these counters.

    The implementation of these counters is through inline code that hopefully
    results in only a single instruction increment instruction being emitted
    (i386, x86_64) or in the increment being hidden though instruction
    concurrency (EPIC architectures such as ia64 can get that done).

    Benefits:
    - VM event counter operations usually reduce to a single inline instruction
    on i386 and x86_64.
    - No interrupt disable, only preempt disable for the preempt case.
    Preempt disable can also be avoided by moving the counter into a spinlock.
    - Handling is similar to zoned VM counters.
    - Simple and easily extendable.
    - Can be omitted to reduce memory use for embedded use.

    References:

    RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
    RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
    local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
    V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
    V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
    V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • - Allows reclaim to access counter without looping over processor counts.

    - Allows accurate statistics on how many pages are used in a zone by
    the slab. This may become useful to balance slab allocations over
    various zones.

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The zone_reclaim_interval was necessary because we were not able to determine
    how many unmapped pages exist in a zone. Therefore we had to scan in
    intervals to figure out if any pages were unmapped.

    With the zoned counters and NR_ANON_PAGES we now know the number of pagecache
    pages and the number of mapped pages in a zone. So we can simply skip the
    reclaim if there is an insufficient number of unmapped pages. We use
    SWAP_CLUSTER_MAX as the boundary.

    Drop all support for /proc/sys/vm/zone_reclaim_interval.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The current NR_FILE_MAPPED is used by zone reclaim and the dirty load
    calculation as the number of mapped pagecache pages. However, that is not
    true. NR_FILE_MAPPED includes the mapped anonymous pages. This patch
    separates those and therefore allows an accurate tracking of the anonymous
    pages per zone.

    It then becomes possible to determine the number of unmapped pages per zone
    and we can avoid scanning for unmapped pages if there are none.

    Also it may now be possible to determine the mapped/unmapped ratio in
    get_dirty_limit. Isnt the number of anonymous pages irrelevant in that
    calculation?

    Note that this will change the meaning of the number of mapped pages reported
    in /proc/vmstat /proc/meminfo and in the per node statistics. This may affect
    user space tools that monitor these counters! NR_FILE_MAPPED works like
    NR_FILE_DIRTY. It is only valid for pagecache pages.

    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We can now access the number of pages in a mapped state in an inexpensive way
    in shrink_active_list. So drop the nr_mapped field from scan_control.

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • nr_mapped is important because it allows a determination of how many pages of
    a zone are not mapped, which would allow a more efficient means of determining
    when we need to reclaim memory in a zone.

    We take the nr_mapped field out of the page state structure and define a new
    per zone counter named NR_FILE_MAPPED (the anonymous pages will be split off
    from NR_MAPPED in the next patch).

    We replace the use of nr_mapped in various kernel locations. This avoids the
    looping over all processors in try_to_free_pages(), writeback, reclaim (swap +
    zone reclaim).

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

28 Jun, 2006

2 commits

  • In 2.6.17, there was a problem with cpu_notifiers and XFS. I provided a
    band-aid solution to solve that problem. In the process, i undid all the
    changes you both were making to ensure that these notifiers were available
    only at init time (unless CONFIG_HOTPLUG_CPU is defined).

    We deferred the real fix to 2.6.18. Here is a set of patches that fixes the
    XFS problem cleanly and makes the cpu notifiers available only at init time
    (unless CONFIG_HOTPLUG_CPU is defined).

    If CONFIG_HOTPLUG_CPU is defined then cpu notifiers are available at run
    time.

    This patch reverts the notifier_call changes made in 2.6.17

    Signed-off-by: Chandra Seetharaman
    Cc: Ashok Raj
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chandra Seetharaman
     
  • When node is hot-added, kswapd for the node should start. This export kswapd
    start function as kswapd_run() to use at add_memory().

    [akpm@osdl.org: daemonize() isn't needed when using the kthread API]
    Signed-off-by: Yasunori Goto
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Dave Hansen
    Cc: "Brown, Len"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     

23 Jun, 2006

3 commits

  • Initialise total_memory earlier in boot. Because if for some reason we run
    page reclaim early in boot, we don't want total_memory to be zero when we use
    it as a divisor.

    And rename total_memory to vm_total_pages to avoid naming clashes with
    architectures.

    Cc: Yasunori Goto
    Cc: KAMEZAWA Hiroyuki
    Cc: Martin Bligh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • This implements the use of migration entries to preserve ptes of file backed
    pages during migration. Processes can therefore be migrated back and forth
    without loosing their connection to pagecache pages.

    Note that we implement the migration entries only for linear mappings.
    Nonlinear mappings still require the unmapping of the ptes for migration.

    And another writepage() ugliness shows up. writepage() can drop the page
    lock. Therefore we have to remove migration ptes before calling writepages()
    in order to avoid having migration entries point to unlocked pages.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • When a writeback_control's `start' and `end' fields are used to
    indicate a one-byte-range starting at file offset zero, the required
    values of .start=0,.end=0 mean that the ->writepages() implementation
    has no way of telling that it is being asked to perform a range
    request. Because we're currently overloading (start == 0 && end == 0)
    to mean "this is not a write-a-range request".

    To make all this sane, the patch changes range of writeback_control.

    So caller does: If it is calling ->writepages() to write pages, it
    sets range (range_start/end or range_cyclic) always.

    And if range_cyclic is true, ->writepages() thinks the range is
    cyclic, otherwise it just uses range_start and range_end.

    This patch does,

    - Add LLONG_MAX, LLONG_MIN, ULLONG_MAX to include/linux/kernel.h
    -1 is usually ok for range_end (type is long long). But, if someone did,

    range_end += val; range_end is "val - 1"
    u64val = range_end >> bits; u64val is "~(0ULL)"

    or something, they are wrong. So, this adds LLONG_MAX to avoid nasty
    things, and uses LLONG_MAX for range_end.

    - All callers of ->writepages() sets range_start/end or range_cyclic.

    - Fix updates of ->writeback_index. It seems already bit strange.
    If it starts at 0 and ended by check of nr_to_write, this last
    index may reduce chance to scan end of file. So, this updates
    ->writeback_index only if range_cyclic is true or whole-file is
    scanned.

    Signed-off-by: OGAWA Hirofumi
    Cc: Nathan Scott
    Cc: Anton Altaparmakov
    Cc: Steven French
    Cc: "Vladimir V. Saveliev"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi