20 Oct, 2008

4 commits

  • Make sure that mlocked pages also live on the unevictable LRU, so kswapd
    will not scan them over and over again.

    This is achieved through various strategies:

    1) add yet another page flag--PG_mlocked--to indicate that
    the page is locked for efficient testing in vmscan and,
    optionally, fault path. This allows early culling of
    unevictable pages, preventing them from getting to
    page_referenced()/try_to_unmap(). Also allows separate
    accounting of mlock'd pages, as Nick's original patch
    did.

    Note: Nick's original mlock patch used a PG_mlocked
    flag. I had removed this in favor of the PG_unevictable
    flag + an mlock_count [new page struct member]. I
    restored the PG_mlocked flag to eliminate the new
    count field.

    2) add the mlock/unevictable infrastructure to mm/mlock.c,
    with internal APIs in mm/internal.h. This is a rework
    of Nick's original patch to these files, taking into
    account that mlocked pages are now kept on unevictable
    LRU list.

    3) update vmscan.c:page_evictable() to check PageMlocked()
    and, if vma passed in, the vm_flags. Note that the vma
    will only be passed in for new pages in the fault path;
    and then only if the "cull unevictable pages in fault
    path" patch is included.

    4) add try_to_unlock() to rmap.c to walk a page's rmap and
    ClearPageMlocked() if no other vmas have it mlocked.
    Reuses as much of try_to_unmap() as possible. This
    effectively replaces the use of one of the lru list links
    as an mlock count. If this mechanism let's pages in mlocked
    vmas leak through w/o PG_mlocked set [I don't know that it
    does], we should catch them later in try_to_unmap(). One
    hopes this will be rare, as it will be relatively expensive.

    Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
    Signed-off-by: Nick Piggin

    splitlru: introduce __get_user_pages():

    New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
    because current get_user_pages() can't grab PROT_NONE pages theresore it
    cause PROT_NONE pages can't munlock.

    [akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
    [akpm@linux-foundation.org: untangle patch interdependencies]
    [akpm@linux-foundation.org: fix things after out-of-order merging]
    [hugh@veritas.com: fix page-flags mess]
    [lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
    [kosaki.motohiro@jp.fujitsu.com: build fix]
    [kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
    [kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Matt Mackall
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • When the system contains lots of mlocked or otherwise unevictable pages,
    the pageout code (kswapd) can spend lots of time scanning over these
    pages. Worse still, the presence of lots of unevictable pages can confuse
    kswapd into thinking that more aggressive pageout modes are required,
    resulting in all kinds of bad behaviour.

    Infrastructure to manage pages excluded from reclaim--i.e., hidden from
    vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
    maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
    them from vmscan.

    Kosaki Motohiro added the support for the memory controller unevictable
    lru list.

    Pages on the unevictable list have both PG_unevictable and PG_lru set.
    Thus, PG_unevictable is analogous to and mutually exclusive with
    PG_active--it specifies which LRU list the page is on.

    The unevictable infrastructure is enabled by a new mm Kconfig option
    [CONFIG_]UNEVICTABLE_LRU.

    A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
    not a page may be evictable. Subsequent patches will add the various
    !evictable tests. We'll want to keep these tests light-weight for use in
    shrink_active_list() and, possibly, the fault path.

    To avoid races between tasks putting pages [back] onto an LRU list and
    tasks that might be moving the page from non-evictable to evictable state,
    the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
    -- tests the "evictability" of a page after placing it on the LRU, before
    dropping the reference. If the page has become unevictable,
    putback_lru_page() will redo the 'putback', thus moving the page to the
    unevictable list. This way, we avoid "stranding" evictable pages on the
    unevictable list.

    [akpm@linux-foundation.org: fix fallout from out-of-order merge]
    [riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
    [nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
    [kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
    [kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
    [kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
    [kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
    [kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Debugged-by: Benjamin Kidwell
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Define proper false/noop inline functions for noreclaim page flags when
    !defined(CONFIG_UNEVICTABLE_LRU)

    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Define page_file_cache() function to answer the question:
    is page backed by a file?

    Originally part of Rik van Riel's split-lru patch. Extracted to make
    available for other, independent reclaim patches.

    Moved inline function to linux/mm_inline.h where it will be needed by
    subsequent "split LRU" and "noreclaim" patches.

    Unfortunately this needs to use a page flag, since the PG_swapbacked state
    needs to be preserved all the way to the point where the page is last
    removed from the LRU. Trying to derive the status from other info in the
    page resulted in wrong VM statistics in earlier split VM patchsets.

    The total number of page flags in use on a 32 bit machine after this patch
    is 19.

    [akpm@linux-foundation.org: fix up out-of-order merge fallout]
    [hugh@veritas.com: splitlru: shmem_getpage SetPageSwapBacked sooner[
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: MinChan Kim
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

05 Aug, 2008

1 commit

  • Converting page lock to new locking bitops requires a change of page flag
    operation naming, so we might as well convert it to something nicer
    (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

    This also facilitates lockdeping of page lock.

    Signed-off-by: Nick Piggin
    Acked-by: KOSAKI Motohiro
    Acked-by: Peter Zijlstra
    Acked-by: Andrew Morton
    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

01 Aug, 2008

1 commit

  • For anonymous pages without a swap cache backing the check in
    page_remove_rmap for the physical dirty bit in page_remove_rmap is
    unnecessary. The instructions that are used to check and reset the dirty
    bit are expensive. Removing the check noticably speeds up process exit.
    In addition the clearing of the dirty bit in __SetPageUptodate is
    pointless as well. With these two changes there is no storage key
    operation for an anonymous page anymore if it does not hit the swap
    space.

    The micro benchmark which repeatedly executes an empty shell script
    gets about 5% faster.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

25 Jul, 2008

3 commits

  • SLOB reuses two page bits for internal purposes, it overlays PG_active and
    PG_private. This is hidden away in slob.c. Document these overlays
    explicitly in the main page-flags enum along with all the others.

    Signed-off-by: Andy Whitcroft
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Cc: Matt Mackall
    Cc: Nick Piggin
    Reviewed-by: KOSAKI Motohiro
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • SLUB reuses two page bits for internal purposes, it overlays PG_active and
    PG_error. This is hidden away in slub.c. Document these overlays
    explicitly in the main page-flags enum along with all the others.

    Signed-off-by: Andy Whitcroft
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Cc: Matt Mackall
    Cc: Nick Piggin
    Tested-by: KOSAKI Motohiro
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • With the recent page flag reorganisation we have a single enum which
    defines the valid page flags and their values, nice and clear. However
    there are a number of bits which are overloaded by different subsystems.
    Firstly there is PG_owner_priv_1 which is used by filesystems and by XEN.
    Secondly both SLOB and SLUB use a couple of extra page bits to manage
    internal state for pages they own; both overlay other bits. All of these
    "aliases" are scattered about the source making it very hard for a reader
    to know if the bits are safe to rely on in all contexts; confusion here is
    bad.

    As we now have a single place where the bits are clearly assigned it makes
    sense to clarify the reuse of bits by making the aliases explicit and
    visible with the original bit assignments. This patch creates explicit
    aliases within the enum itself for the overloaded bits, creates standard
    bit accessors PageFoo etc. and uses those throughout.

    This version pulls the bit manipulation out to standard named page bit
    accessors as suggested by Christoph, it retains the explicit mapping to
    the overlayed bits. A fusion of both ideas. This has been SLUB and SLOB
    have been compile tested on x86_64 only, and SLUB boot tested. If people
    feel this is worth doing then I can run a fuller set of testing.

    This patch:

    Some page flags are used for more than one purpose, for example
    PG_owner_priv_1. Currently there are individual accessors for each user,
    each built using the common flag name far away from the bit definitions.
    This makes it hard to see all possible uses of these bits.

    Now that we have a single enum to generate the bit orders it makes sense
    to express overlays in the same place. So create per use aliases for this
    bit in the main page-flags enum and use those in the accessors.

    [akpm@linux-foundation.org: fix xen]
    Signed-off-by: Andy Whitcroft
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Cc: Matt Mackall
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Reviewed-by: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     

16 Jun, 2008

1 commit


10 Jun, 2008

1 commit

  • Minor source code cleanup of page flags in mm/page_alloc.c.
    Move the definition of the groups of bits to page-flags.h.

    The purpose of this clean up is that the next patch will
    conditionally add a page flag to the groups. Doing that
    in a header file is cleaner than adding #ifdefs to the
    C code.

    Signed-off-by: Russ Anderson
    Signed-off-by: Linus Torvalds

    Russ Anderson
     

27 May, 2008

1 commit

  • This patch implements Xen save/restore and migration.

    Saving is triggered via xenbus, which is polled in
    drivers/xen/manage.c. When a suspend request comes in, the kernel
    prepares itself for saving by:

    1 - Freeze all processes. This is primarily to prevent any
    partially-completed pagetable updates from confusing the suspend
    process. If CONFIG_PREEMPT isn't defined, then this isn't necessary.

    2 - Suspend xenbus and other devices

    3 - Stop_machine, to make sure all the other vcpus are quiescent. The
    Xen tools require the domain to run its save off vcpu0.

    4 - Within the stop_machine state, it pins any unpinned pgds (under
    construction or destruction), performs canonicalizes various other
    pieces of state (mostly converting mfns to pfns), and finally

    5 - Suspend the domain

    Restore reverses the steps used to save the domain, ending when all
    the frozen processes are thawed.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Thomas Gleixner

    Jeremy Fitzhardinge
     

28 Apr, 2008

9 commits

  • Having separate page flags for the head and the tail of a compound page allows
    the compiler to use bitops instead of operations on a word to check for a tail
    page. That is f.e. important for virt_to_head_page() which is used in
    various critical code paths (kfree for example):

    Code for PageTail(page)

    Before:

    mov (%rdi),%rdx page->flags
    mov %rdx,%rax 3 bytes
    and $0x12000,%eax 5 bytes
    cmp $0x12000,%rax 6 bytes
    je 897

    After:

    mov (%rdi),%rax
    test $0x40,%ah (3 bytes)
    jne 887

    So we go from 14 bytes to 3 bytes and from 3 instructions to one. From the
    use of 2 registers we go to none.

    We can only use page flags for this if we have page flags available. This
    patch introduces CONFIG_PAGEFLAGS_EXTENDED that is set if pageflags are not
    scarce due to SPARSEMEM using page flags for its sectionid on 32 bit NUMA
    platforms.

    Additional page flag definitions can be added to the CONFIG_PAGEFLAGS_EXTENDED
    section in page-flags.h if the functionality depends on PAGEFLAGS_EXTENDED or
    if more page flag overlapping tricks are used for the !PAGEFLAGS_EXTENDED
    fallback (the upcoming virtual compound patch may hook in here and Rik's/Lee's
    additional page flags to solve the reclaim issues could also be added there
    [hint... hint... where are these patchsets?]).

    Avoiding the overlaying of Pg_reclaim also clears the way for possible use of
    compound pages for the pagecache or on the LRU.

    Signed-off-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Turns out that there are a number of times that a flag is simply always
    returning 0. Define a macro for that.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Remove the special setup for PG_uncached and simply make it part of the enum.
    The page flag will only be allocated when the kernel build includes the
    uncached allocator.

    Acked-by: Dean Nelson
    Cc: Jes Sorensen
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Remove aliases of PG_xxx. We can easily drop those now and alias by
    specifying the PG_xxx flag in the macro that generates the functions.

    Signed-off-by: Christoph Lameter
    Cc: Andy Whitcroft
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Xen uses bitops to manipulate page flags. Make it use proper page flag
    functions.

    Signed-off-by: Christoph Lameter
    Cc: Andy Whitcroft
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Replace explicit definitions of page flags through the use of macros.
    Significantly reduces the size of the definitions and removes a lot of
    opportunity for errors. Additonal page flags can typically be generated with
    a single line.

    Signed-off-by: Christoph Lameter
    Cc: Andy Whitcroft
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Introduce a set of macros that generate functions to handle page flags.

    A page flag function group typically starts with either

    SETPAGEFLAG(,)

    to create a set of page flag operations that are atomic. Or

    __SETPAGEFLAG(,
    Cc: Andy Whitcroft
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • NR_PAGEFLAGS specifies the number of page flags we are using. From that we
    can calculate the number of bits leftover that can be used for zone, node (and
    maybe the sections id). There is no need anymore for FLAGS_RESERVED if we use
    NR_PAGEFLAGS.

    Use the new methods to make NR_PAGEFLAGS available via the preprocessor.
    NR_PAGEFLAGS is used to calculate field boundaries in the page flags fields.
    These field widths have to be available to the preprocessor.

    Signed-off-by: Christoph Lameter
    Cc: David Miller
    Cc: Andy Whitcroft
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Use an enum to ease the maintenance of page flags. This is going to change
    the numbering from 0 to 18.

    Signed-off-by: Christoph Lameter
    Cc: Andy Whitcroft
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

22 Feb, 2008

1 commit


06 Feb, 2008

1 commit

  • After running SetPageUptodate, preceeding stores to the page contents to
    actually bring it uptodate may not be ordered with the store to set the
    page uptodate.

    Therefore, another CPU which checks PageUptodate is true, then reads the
    page contents can get stale data.

    Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
    PageUptodate.

    Many places that test PageUptodate, do so with the page locked, and this
    would be enough to ensure memory ordering in those places if
    SetPageUptodate were only called while the page is locked. Unfortunately
    that is not always the case for some filesystems, but it could be an idea
    for the future.

    Also bring the handling of anonymous page uptodateness in line with that of
    file backed page management, by marking anon pages as uptodate when they
    _are_ uptodate, rather than when our implementation requires that they be
    marked as such. Doing allows us to get rid of the smp_wmb's in the page
    copying functions, which were especially added for anonymous pages for an
    analogous memory ordering problem. Both file and anonymous pages are
    handled with the same barriers.

    FAQ:
    Q. Why not do this in flush_dcache_page?
    A. Firstly, flush_dcache_page handles only one side (the smb side) of the
    ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away
    memory barriers in a completely unrelated function is nasty; at least in the
    PageUptodate macros, they are located together with (half) the operations
    involved in the ordering. Thirdly, the smp_wmb is only required when first
    bringing the page uptodate, wheras flush_dcache_page should be called each time
    it is written to through the kernel mapping. It is logically the wrong place to
    put it.

    Q. Why does this increase my text size / reduce my performance / etc.
    A. Because it is adding the necessary instructions to eliminate the data-race.

    Q. Can it be improved?
    A. Yes, eg. if you were to create a rule that all SetPageUptodate operations
    run under the page lock, we could avoid the smp_rmb places where PageUptodate
    is queried under the page lock. Requires audit of all filesystems and at least
    some would need reworking. That's great you're interested, I'm eagerly awaiting
    your patches.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

20 Jul, 2007

3 commits

  • page-writeback accounting is presently performed in the page-flags macros.
    This is inconsistent and a bit ugly and makes it awkward to implement
    per-backing_dev under-writeback page accounting.

    So move this accounting down to the callsite(s).

    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Share the same page flag bit for PG_readahead and PG_reclaim.

    One is used only on file reads, another is only for emergency writes. One
    is used mostly for fresh/young pages, another is for old pages.

    Combinations of possible interactions are:

    a) clear PG_reclaim => implicit clear of PG_readahead
    it will delay an asynchronous readahead into a synchronous one
    it actually does _good_ for readahead:
    the pages will be reclaimed soon, it's readahead thrashing!
    in this case, synchronous readahead makes more sense.

    b) clear PG_readahead => implicit clear of PG_reclaim
    one(and only one) page will not be reclaimed in time
    it can be avoided by checking PageWriteback(page) in readahead first

    c) set PG_reclaim => implicit set of PG_readahead
    will confuse readahead and make it restart the size rampup process
    it's a trivial problem, and can mostly be avoided by checking
    PageWriteback(page) first in readahead

    d) set PG_readahead => implicit set of PG_reclaim
    PG_readahead will never be set on already cached pages.
    PG_reclaim will always be cleared on dirtying a page.
    so not a problem.

    In summary,
    a) we get better behavior
    b,d) possible interactions can be avoided
    c) racy condition exists that might affect readahead, but the chance
    is _really_ low, and the hurt on readahead is trivial.

    Compound pages also use PG_reclaim, but for now they do not interact with
    reclaim/readahead code.

    Signed-off-by: Fengguang Wu
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Introduce a new page flag: PG_readahead.

    It acts as a look-ahead mark, which tells the page reader: Hey, it's time to
    invoke the read-ahead logic. For the sake of I/O pipelining, don't wait until
    it runs out of cached pages!

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     

18 Jul, 2007

1 commit


08 May, 2007

3 commits

  • Remove the two page flags that were previously used by swsusp and are no
    longer needed.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • The patch adds PageTail(page) and PageHead(page) to check if a page is the
    head or the tail of a compound page. This is done by masking the two bits
    describing the state of a compound page and then comparing them. So one
    comparision and a branch instead of two bit checks and two branches.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • If we add a new flag so that we can distinguish between the first page and the
    tail pages then we can avoid to use page->private in the first page.
    page->private == page for the first page, so there is no real information in
    there.

    Freeing up page->private makes the use of compound pages more transparent.
    They become more usable like real pages. Right now we have to be careful f.e.
    if we are going beyond PAGE_SIZE allocations in the slab on i386 because we
    can then no longer use the private field. This is one of the issues that
    cause us not to support debugging for page size slabs in SLAB.

    Having page->private available for SLUB would allow more meta information in
    the page struct. I can probably avoid the 16 bit ints that I have in there
    right now.

    Also if page->private is available then a compound page may be equipped with
    buffer heads. This may free up the way for filesystems to support larger
    blocks than page size.

    We add PageTail as an alias of PageReclaim. Compound pages cannot currently
    be reclaimed. Because of the alias one needs to check PageCompound first.

    The RFC for the this approach was discussed at
    http://marc.info/?t=117574302800001&r=1&w=2

    [nacc@us.ibm.com: fix hugetlbfs]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Nishanth Aravamudan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

27 Apr, 2007

1 commit

  • The page_test_and_clear_dirty primitive really consists of two
    operations, page_test_dirty and the page_clear_dirty. The combination
    of the two is not an atomic operation, so it makes more sense to have
    two separate operations instead of one.
    In addition to the improved readability of the s390 version of
    SetPageUptodate, it now avoids the page_test_dirty operation which is
    an insert-storage-key-extended (iske) instruction which is an expensive
    operation.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

02 Mar, 2007

1 commit

  • Rename PG_checked to PG_owner_priv_1 to reflect its availablilty as a
    private flag for use by the owner/allocator of the page. In the case of
    pagecache pages (which might be considered to be owned by the mm),
    filesystems may use the flag.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

22 Dec, 2006

1 commit

  • They were horribly easy to mis-use because of their tempting naming, and
    they also did way more than any users of them generally wanted them to
    do.

    A dirty page can become clean under two circumstances:

    (a) when we write it out. We have "clear_page_dirty_for_io()" for
    this, and that function remains unchanged.

    In the "for IO" case it is not sufficient to just clear the dirty
    bit, you also have to mark the page as being under writeback etc.

    (b) when we actually remove a page due to it becoming inaccessible to
    users, notably because it was truncate()'d away or the file (or
    metadata) no longer exists, and we thus want to cancel any
    outstanding dirty state.

    For the (b) case, we now introduce "cancel_dirty_page()", which only
    touches the page state itself, and verifies that the page is not mapped
    (since cancelling writes on a mapped page would be actively wrong as it
    is still accessible to users).

    Some filesystems need to be fixed up for this: CIFS, FUSE, JFS,
    ReiserFS, XFS all use the old confusing functions, and will be fixed
    separately in subsequent commits (with some of them just removing the
    offending logic, and others using clear_page_dirty_for_io()).

    This was confirmed by Martin Michlmayr to fix the apt database
    corruption on ARM.

    Cc: Martin Michlmayr
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Cc: Arjan van de Ven
    Cc: Andrei Popa
    Cc: Andrew Morton
    Cc: Dave Kleikamp
    Cc: Gordon Farquharson
    Cc: Martin Schwidefsky
    Cc: Trond Myklebust
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 Sep, 2006

1 commit

  • Convert s390 page handling macros to functions. In particular this fixes a
    problem with s390's SetPageUptodate macro which uses its input parameter
    twice which again can cause subtle bugs.

    [akpm@osdl.org: build fix]
    Cc: Martin Schwidefsky
    Signed-off-by: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     

26 Sep, 2006

1 commit


01 Jul, 2006

2 commits

  • Conversion of nr_writeback to per zone counter.

    This removes the last page_state counter from arch/i386/mm/pgtable.c so we
    drop the page_state from there.

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • NOTE: ZVC are *not* the lightweight event counters. ZVCs are reliable whereas
    event counters do not need to be.

    Zone based VM statistics are necessary to be able to determine what the state
    of memory in one zone is. In a NUMA system this can be helpful for local
    reclaim and other memory optimizations that may be able to shift VM load in
    order to get more balanced memory use.

    It is also useful to know how the computing load affects the memory
    allocations on various zones. This patchset allows the retrieval of that data
    from userspace.

    The patchset introduces a framework for counters that is a cross between the
    existing page_stats --which are simply global counters split per cpu-- and the
    approach of deferred incremental updates implemented for nr_pagecache.

    Small per cpu 8 bit counters are added to struct zone. If the counter exceeds
    certain thresholds then the counters are accumulated in an array of
    atomic_long in the zone and in a global array that sums up all zone values.
    The small 8 bit counters are next to the per cpu page pointers and so they
    will be in high in the cpu cache when pages are allocated and freed.

    Access to VM counter information for a zone and for the whole machine is then
    possible by simply indexing an array (Thanks to Nick Piggin for pointing out
    that approach). The access to the total number of pages of various types does
    no longer require the summing up of all per cpu counters.

    Benefits of this patchset right now:

    - Ability for UP and SMP configuration to determine how memory
    is balanced between the DMA, NORMAL and HIGHMEM zones.

    - loops over all processors are avoided in writeback and
    reclaim paths. We can avoid caching the writeback information
    because the needed information is directly accessible.

    - Special handling for nr_pagecache removed.

    - zone_reclaim_interval vanishes since VM stats can now determine
    when it is worth to do local reclaim.

    - Fast inline per node page state determination.

    - Accurate counters in /sys/devices/system/node/node*/meminfo. Current
    counters are counting simply which processor allocated a page somewhere
    and guestimate based on that. So the counters were not useful to show
    the actual distribution of page use on a specific zone.

    - The swap_prefetch patch requires per node statistics in order to
    figure out when processors of a node can prefetch. This patch provides
    some of the needed numbers.

    - Detailed VM counters available in more /proc and /sys status files.

    References to earlier discussions:
    V1 http://marc.theaimsgroup.com/?l=linux-kernel&m=113511649910826&w=2
    V2 http://marc.theaimsgroup.com/?l=linux-kernel&m=114980851924230&w=2
    V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115014697910351&w=2
    V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767318740&w=2

    Performance tests with AIM7 did not show any regressions. Seems to be a tad
    faster even. Tested on ia64/NUMA. Builds fine on i386, SMP / UP. Includes
    fixes for s390/arm/uml arch code.

    This patch:

    Move counter code from page_alloc.c/page-flags.h to vmstat.c/h.

    Create vmstat.c/vmstat.h by separating the counter code and the proc
    functions.

    Move the vm_stat_text array before zoneinfo_show.

    [akpm@osdl.org: s390 build fix]
    [akpm@osdl.org: HOTPLUG_CPU build fix]
    Signed-off-by: Christoph Lameter
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

23 Jun, 2006

1 commit

  • As Nick points out, only ia64 uses PG_uncached. So we can push it up into the
    higher bits of the lower half of page->flags and make room for another flag on
    32-bit machines.

    Cc: "Luck, Tony"
    Cc: Jesse Barnes
    Cc: Jes Sorensen
    Cc: Nick Piggin
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

11 Apr, 2006

2 commits

  • Add some documentation regarding the utilisation of the flags field in
    struct page. This field is overloaded for per page bits and to hold node,
    zone and SPARSEMEM information. Make it clear which areas are used for
    what and how many bits are in each area.

    Signed-off-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • Rohit found an obscure bug causing buddy list corruption.

    page_is_buddy is using a non-atomic test (PagePrivate && page_count == 0)
    to determine whether or not a free page's buddy is itself free and in the
    buddy lists.

    Each of the conjuncts may be true at different times due to unrelated
    conditions, so the non-atomic page_is_buddy test may find each conjunct to
    be true even if they were not both true at the same time (ie. the page was
    not on the buddy lists).

    Signed-off-by: Martin Bligh
    Signed-off-by: Rohit Seth
    Signed-off-by: Nick Piggin
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Linus Torvalds

    Nick Piggin