30 Apr, 2008

1 commit

  • Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is
    set, then don't update the per-bdi writeback stats from
    test_set_page_writeback() and test_clear_page_writeback().

    Misc cleanups:

    - convert bdi_cap_writeback_dirty() and friends to static inline functions
    - create a flag that includes all three dirty/writeback related flags,
    since almst all users will want to have them toghether

    Signed-off-by: Miklos Szeredi
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

20 Mar, 2008

1 commit

  • Fix various kernel-doc notation in mm/:

    filemap.c: add function short description; convert 2 to kernel-doc
    fremap.c: change parameter 'prot' to @prot
    pagewalk.c: change "-" in function parameters to ":"
    slab.c: fix short description of kmem_ptr_validate()
    swap.c: fix description & parameters of put_pages_list()
    swap_state.c: fix function parameters
    vmalloc.c: change "@returns" to "Returns:" since that is not a parameter

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

08 Feb, 2008

5 commits

  • If we're charging rss and we're charging cache, it seems obvious that we
    should be charging swapcache - as has been done. But in practice that
    doesn't work out so well: both swapin readahead and swapoff leave the
    majority of pages charged to the wrong cgroup (the cgroup that happened to
    read them in, rather than the cgroup to which they belong).

    (Which is why unuse_pte's GFP_KERNEL while holding pte lock never showed up
    as a problem: no allocation was ever done there, every page read being
    already charged to the cgroup which initiated the swapoff.)

    It all works rather better if we leave the charging to do_swap_page and
    unuse_pte, and do nothing for swapcache itself: revert mm/swap_state.c to
    what it was before the memory-controller patches. This also speeds up
    significantly a contained process working at its limit: because it no
    longer needs to keep waiting for swap writeback to complete.

    Is it unfair that swap pages become uncharged once they're unmapped, even
    though they're still clearly private to particular cgroups? For a short
    while, yes; but PageReclaim arranges for those pages to go to the end of
    the inactive list and be reclaimed soon if necessary.

    shmem/tmpfs pages are a distinct case: their charging also benefits from
    this change, but their second life on the lists as swapcache pages may
    prove more unfair - that I need to check next.

    Signed-off-by: Hugh Dickins
    Cc: Pavel Emelianov
    Acked-by: Balbir Singh
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Move mem_controller_cache_charge() above radix_tree_preload().
    radix_tree_preload() disables preemption, even though the gfp_mask passed
    contains __GFP_WAIT, we cannot really do __GFP_WAIT allocations, thus we
    hit a BUG_ON() in kmem_cache_alloc().

    This patch moves mem_controller_cache_charge() to above radix_tree_preload()
    for cache charging.

    Signed-off-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Nick Piggin pointed out that swap cache and page cache addition routines
    could be called from non GFP_KERNEL contexts. This patch makes the
    charging routine aware of the gfp context. Charging might fail if the
    cgroup is over it's limit, in which case a suitable error is returned.

    This patch was tested on a Powerpc box. I am still looking at being able
    to test the path, through which allocations happen in non GFP_KERNEL
    contexts.

    [kamezawa.hiroyu@jp.fujitsu.com: problem with ZONE_MOVABLE]
    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Choose if we want cached pages to be accounted or not. By default both are
    accounted for. A new set of tunables are added.

    echo -n 1 > mem_control_type

    switches the accounting to account for only mapped pages

    echo -n 3 > mem_control_type

    switches the behaviour back

    [bunk@kernel.org: mm/memcontrol.c: clenups]
    [akpm@linux-foundation.org: fix sparc32 build]
    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Add the accounting hooks. The accounting is carried out for RSS and Page
    Cache (unmapped) pages. There is now a common limit and accounting for both.
    The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
    time. Page cache is accounted at add_to_page_cache(),
    __delete_from_page_cache(). Swap cache is also accounted for.

    Each page's page_cgroup is protected with the last bit of the
    page_cgroup pointer, this makes handling of race conditions involving
    simultaneous mappings of a page easier. A reference count is kept in the
    page_cgroup to deal with cases where a page might be unmapped from the RSS
    of all tasks, but still lives in the page cache.

    Credits go to Vaidyanathan Srinivasan for helping with reference counting work
    of the page cgroup. Almost all of the page cache accounting code has help
    from Vaidyanathan Srinivasan.

    [hugh@veritas.com: fix swapoff breakage]
    [akpm@linux-foundation.org: fix locking]
    Signed-off-by: Vaidyanathan Srinivasan
    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc:
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

06 Feb, 2008

6 commits

  • After running SetPageUptodate, preceeding stores to the page contents to
    actually bring it uptodate may not be ordered with the store to set the
    page uptodate.

    Therefore, another CPU which checks PageUptodate is true, then reads the
    page contents can get stale data.

    Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
    PageUptodate.

    Many places that test PageUptodate, do so with the page locked, and this
    would be enough to ensure memory ordering in those places if
    SetPageUptodate were only called while the page is locked. Unfortunately
    that is not always the case for some filesystems, but it could be an idea
    for the future.

    Also bring the handling of anonymous page uptodateness in line with that of
    file backed page management, by marking anon pages as uptodate when they
    _are_ uptodate, rather than when our implementation requires that they be
    marked as such. Doing allows us to get rid of the smp_wmb's in the page
    copying functions, which were especially added for anonymous pages for an
    analogous memory ordering problem. Both file and anonymous pages are
    handled with the same barriers.

    FAQ:
    Q. Why not do this in flush_dcache_page?
    A. Firstly, flush_dcache_page handles only one side (the smb side) of the
    ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away
    memory barriers in a completely unrelated function is nasty; at least in the
    PageUptodate macros, they are located together with (half) the operations
    involved in the ordering. Thirdly, the smp_wmb is only required when first
    bringing the page uptodate, wheras flush_dcache_page should be called each time
    it is written to through the kernel mapping. It is logically the wrong place to
    put it.

    Q. Why does this increase my text size / reduce my performance / etc.
    A. Because it is adding the necessary instructions to eliminate the data-race.

    Q. Can it be improved?
    A. Yes, eg. if you were to create a rule that all SetPageUptodate operations
    run under the page lock, we could avoid the smp_rmb places where PageUptodate
    is queried under the page lock. Requires audit of all filesystems and at least
    some would need reworking. That's great you're interested, I'm eagerly awaiting
    your patches.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • move_to_swap_cache and move_from_swap_cache functions (which swizzle a page
    between tmpfs page cache and swap cache, to avoid page copying) are only used
    by shmem.c; and our subsequent fix for unionfs needs different treatments in
    the two instances of move_from_swap_cache. Move them from swap_state.c into
    their callsites shmem_writepage, shmem_unuse_inode and shmem_getpage, making
    add_to_swap_cache externally visible.

    shmem.c likes to say set_page_dirty where swap_state.c liked to say
    SetPageDirty: respect that diversity, which __set_page_dirty_no_writeback
    makes moot (and implies we should lose that "shift page from clean_pages to
    dirty_pages list" comment: it's on neither).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • add_to_swap_cache doesn't amount to much: merge it into its sole caller
    read_swap_cache_async. But we'll be needing to call __add_to_swap_cache from
    shmem.c, so promote it to the new add_to_swap_cache. Both were static, so
    there's no interface confusion to worry about.

    And lose that inappropriate "Anon pages are already on the LRU" comment in the
    merging: they're not already on the LRU, as Nick Piggin noticed.

    Signed-off-by: Hugh Dickins
    No-problems-with: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Both unionfs and memcgroups pose challenges to tmpfs and shmem. To help fix,
    it's best to move the swap swizzling functions from swap_state.c to shmem.c.
    As a preliminary to that, move swap stats updating down into
    __add_to_swap_cache, which will remain internal to swap_state.c.

    Well, actually, just move down the incrementation of add_total: remove
    noent_race and exist_race completely, they are relics of my 2.4.11 testing.
    Alt-SysRq-m users will be thrilled if 2.6.25 is at last free of "race M+N"s.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Building in a filesystem on a loop device on a tmpfs file can hang when
    swapping, the loop thread caught in that infamous throttle_vm_writeout.

    In theory this is a long standing problem, which I've either never seen in
    practice, or long ago suppressed the recollection, after discounting my load
    and my tmpfs size as unrealistically high. But now, with the new aops, it has
    become easy to hang on one machine.

    Loop used to grab_cache_page before the old prepare_write to tmpfs, which
    seems to have been enough to free up some memory for any swapin needed; but
    the new write_begin lets tmpfs find or allocate the page (much nicer, since
    grab_cache_page missed tmpfs pages in swapcache).

    When allocating a fresh page, tmpfs respects loop's mapping_gfp_mask, which
    has __GFP_IO|__GFP_FS stripped off, and throttle_vm_writeout is designed to
    break out when __GFP_IO or GFP_FS is unset; but when tmfps swaps in,
    read_swap_cache_async allocates with GFP_HIGHUSER_MOVABLE regardless of the
    mapping_gfp_mask - hence the hang.

    So, pass gfp_mask down the line from shmem_getpage to shmem_swapin to
    swapin_readahead to read_swap_cache_async to add_to_swap_cache.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • swapin_readahead has never sat well in mm/memory.c: move it to mm/swap_state.c
    beside its kindred read_swap_cache_async. Why were its args in a different
    order? rearrange them. And since it was always followed by a
    read_swap_cache_async of the target page, fold that in and return struct
    page*. Then CONFIG_SWAP=n no longer needs valid_swaphandles and
    read_swap_cache_async stubs.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Oct, 2007

1 commit

  • __add_to_swap_cache unconditionally sets the page locked, which can be a bit
    alarming to the unsuspecting reader: in the code paths where the page is
    visible to other CPUs, the page should be (and is) already locked.

    Instead, just add a check to ensure the page is locked here, and teach the one
    path relying on the old behaviour to call SetPageLocked itself.

    [hugh@veritas.com: locking fix]
    Signed-off-by: Nick Piggin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

18 Jul, 2007

1 commit

  • It is often known at allocation time whether a page may be migrated or not.
    This patch adds a flag called __GFP_MOVABLE and a new mask called
    GFP_HIGH_MOVABLE. Allocations using the __GFP_MOVABLE can be either migrated
    using the page migration mechanism or reclaimed by syncing with backing
    storage and discarding.

    An API function very similar to alloc_zeroed_user_highpage() is added for
    __GFP_MOVABLE allocations called alloc_zeroed_user_highpage_movable(). The
    flags used by alloc_zeroed_user_highpage() are not changed because it would
    change the semantics of an existing API. After this patch is applied there
    are no in-kernel users of alloc_zeroed_user_highpage() so it probably should
    be marked deprecated if this patch is merged.

    Note that this patch includes a minor cleanup to the use of __GFP_ZERO in
    shmem.c to keep all flag modifications to inode->mapping in the
    shmem_dir_alloc() helper function. This clean-up suggestion is courtesy of
    Hugh Dickens.

    Additional credit goes to Christoph Lameter and Linus Torvalds for shaping the
    concept. Credit to Hugh Dickens for catching issues with shmem swap vector
    and ramfs allocations.

    [akpm@linux-foundation.org: build fix]
    [hugh@veritas.com: __GFP_ZERO cleanup]
    Signed-off-by: Mel Gorman
    Cc: Andy Whitcroft
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

17 Jul, 2007

1 commit


04 Jul, 2006

1 commit


01 Jul, 2006

1 commit

  • Currently a single atomic variable is used to establish the size of the page
    cache in the whole machine. The zoned VM counters have the same method of
    implementation as the nr_pagecache code but also allow the determination of
    the pagecache size per zone.

    Remove the special implementation for nr_pagecache and make it a zoned counter
    named NR_FILE_PAGES.

    Updates of the page cache counters are always performed with interrupts off.
    We can therefore use the __ variant here.

    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

29 Jun, 2006

1 commit


01 Apr, 2006

1 commit


22 Mar, 2006

1 commit

  • Centralize the page migration functions in anticipation of additional
    tinkering. Creates a new file mm/migrate.c

    1. Extract buffer_migrate_page() from fs/buffer.c

    2. Extract central migration code from vmscan.c

    3. Extract some components from mempolicy.c

    4. Export pageout() and remove_from_swap() from vmscan.c

    5. Make it possible to configure NUMA systems without page migration
    and non-NUMA systems with page migration.

    I had to so some #ifdeffing in mempolicy.c that may need a cleanup.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

02 Feb, 2006

1 commit

  • Migrate a page with buffers without requiring writeback

    This introduces a new address space operation migratepage() that may be used
    by a filesystem to implement its own version of page migration.

    A version is provided that migrates buffers attached to pages. Some
    filesystems (ext2, ext3, xfs) are modified to utilize this feature.

    The swapper address space operation are modified so that a regular
    migrate_page() will occur for anonymous pages without writeback (migrate_pages
    forces every anonymous page to have a swap entry).

    Signed-off-by: Mike Kravetz
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

09 Jan, 2006

1 commit

  • Add gfp_mask to add_to_swap

    add_to_swap does allocations with GFP_ATOMIC in order not to interfere with
    swapping. During migration we may have use add_to_swap extensively which may
    lead to out of memory errors.

    This patch makes add_to_swap take a parameter that specifies the gfp mask.
    The page migration code can then make add_to_swap use GFP_KERNEL.

    Signed-off-by: Hirokazu Takahashi
    Signed-off-by: Dave Hansen
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

07 Jan, 2006

1 commit

  • Minor optimization (though it doesn't help in the PREEMPT case, severely
    constrained by small ZAP_BLOCK_SIZE). free_pages_and_swap_cache works in
    chunks of 16, calling release_pages which works in chunks of PAGEVEC_SIZE.
    But PAGEVEC_SIZE was dropped from 16 to 14 in 2.6.10, so we're now doing more
    spin_lock_irq'ing than necessary: use PAGEVEC_SIZE throughout.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

07 Nov, 2005

1 commit


30 Oct, 2005

2 commits

  • Updated several references to page_table_lock in common code comments.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
    a many-threaded application which concurrently initializes different parts of
    a large anonymous area.

    This patch corrects that, by using a separate spinlock per page table page, to
    guard the page table entries in that page, instead of using the mm's single
    page_table_lock. (But even then, page_table_lock is still used to guard page
    table allocation, and anon_vma allocation.)

    In this implementation, the spinlock is tucked inside the struct page of the
    page table page: with a BUILD_BUG_ON in case it overflows - which it would in
    the case of 32-bit PA-RISC with spinlock debugging enabled.

    Splitting the lock is not quite for free: another cacheline access. Ideally,
    I suppose we would use split ptlock only for multi-threaded processes on
    multi-cpu machines; but deciding that dynamically would have its own costs.
    So for now enable it by config, at some number of cpus - since the Kconfig
    language doesn't support inequalities, let preprocessor compare that with
    NR_CPUS. But I don't think it's worth being user-configurable: for good
    testing of both split and unsplit configs, split now at 4 cpus, and perhaps
    change that to 8 later.

    There is a benefit even for singly threaded processes: kswapd can be attacking
    one part of the mm while another part is busy faulting.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

09 Oct, 2005

1 commit

  • - added typedef unsigned int __nocast gfp_t;

    - replaced __nocast uses for gfp flags with gfp_t - it gives exactly
    the same warnings as far as sparse is concerned, doesn't change
    generated code (from gcc point of view we replaced unsigned int with
    typedef) and documents what's going on far better.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

11 Sep, 2005

1 commit


05 Sep, 2005

1 commit

  • Three of the four BUG_ONs in delete_from_swap_cache are immediately
    repeated in __delete_from_swap_cache: delete those and add the one. But
    perhaps mm/ is altogether overprovisioned with historic BUGs?

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

01 May, 2005

1 commit


17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds