30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

07 Mar, 2010

1 commit


02 Mar, 2010

1 commit

  • * 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm: (100 commits)
    ARM: Eliminate decompressor -Dstatic= PIC hack
    ARM: 5958/1: ARM: U300: fix inverted clk round rate
    ARM: 5956/1: misplaced parentheses
    ARM: 5955/1: ep93xx: move timer defines into core.c and document
    ARM: 5954/1: ep93xx: move gpio interrupt support to gpio.c
    ARM: 5953/1: ep93xx: fix broken build of clock.c
    ARM: 5952/1: ARM: MM: Add ARM_L1_CACHE_SHIFT_6 for handle inside each ARCH Kconfig
    ARM: 5949/1: NUC900 add gpio virtual memory map
    ARM: 5948/1: Enable timer0 to time4 clock support for nuc910
    ARM: 5940/2: ARM: MMCI: remove custom DBG macro and printk
    ARM: make_coherent(): fix problems with highpte, part 2
    MM: Pass a PTE pointer to update_mmu_cache() rather than the PTE itself
    ARM: 5945/1: ep93xx: include correct irq.h in core.c
    ARM: 5933/1: amba-pl011: support hardware flow control
    ARM: 5930/1: Add PKMAP area description to memory.txt.
    ARM: 5929/1: Add checks to detect overlap of memory regions.
    ARM: 5928/1: Change type of VMALLOC_END to unsigned long.
    ARM: 5927/1: Make delimiters of DMA area globally visibly.
    ARM: 5926/1: Add "Virtual kernel memory..." printout.
    ARM: 5920/1: OMAP4: Enable L2 Cache
    ...

    Fix up trivial conflict in arch/arm/mach-mx25/clock.c

    Linus Torvalds
     

22 Feb, 2010

1 commit

  • x86-32 has had a static test for copy_on_user() overflow for a while.
    This test currently fails in mm/migrate.c resulting in an
    allyesconfig/allmodconfig build failure on x86-32:

    In function ‘copy_from_user’,
    inlined from ‘do_pages_stat’ at
    /home/hpa/kernel/git/mm/migrate.c:1012:
    /home/hpa/kernel/git/arch/x86/include/asm/uaccess_32.h:212: error:
    call to ‘copy_from_user_overflow’ declared

    Make the logic more explicit and therefore easier for gcc to
    understand.

    v2: rewrite the loop entirely using a more normal structure for a
    chunked-data loop (Linus Torvalds)

    Reported-by: Len Brown
    Signed-off-by: H. Peter Anvin
    Reviewed-and-Tested-by: KOSAKI Motohiro
    Cc: Arjan van de Ven
    Cc: Andrew Morton
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Signed-off-by: Linus Torvalds

    H. Peter Anvin
     

21 Feb, 2010

1 commit

  • On VIVT ARM, when we have multiple shared mappings of the same file
    in the same MM, we need to ensure that we have coherency across all
    copies. We do this via make_coherent() by making the pages
    uncacheable.

    This used to work fine, until we allowed highmem with highpte - we
    now have a page table which is mapped as required, and is not available
    for modification via update_mmu_cache().

    Ralf Beache suggested getting rid of the PTE value passed to
    update_mmu_cache():

    On MIPS update_mmu_cache() calls __update_tlb() which walks pagetables
    to construct a pointer to the pte again. Passing a pte_t * is much
    more elegant. Maybe we might even replace the pte argument with the
    pte_t?

    Ben Herrenschmidt would also like the pte pointer for PowerPC:

    Passing the ptep in there is exactly what I want. I want that
    -instead- of the PTE value, because I have issue on some ppc cases,
    for I$/D$ coherency, where set_pte_at() may decide to mask out the
    _PAGE_EXEC.

    So, pass in the mapped page table pointer into update_mmu_cache(), and
    remove the PTE value, updating all implementations and call sites to
    suit.

    Includes a fix from Stephen Rothwell:

    sparc: fix fallout from update_mmu_cache API change

    Signed-off-by: Stephen Rothwell

    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Russell King

    Russell King
     

07 Feb, 2010

1 commit

  • We incorrectly depended on the 'node_state/node_isset()' functions
    testing the node range, rather than checking it explicitly. That's not
    reliable, even if it might often happen to work. So do the proper
    explicit test.

    Reported-by: Marcus Meissner
    Acked-and-tested-by: Brice Goglin
    Acked-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

16 Dec, 2009

5 commits

  • unevictable_migrate_page() in mm/internal.h is a relic of the since
    removed UNEVICTABLE_LRU Kconfig option. This patch removes the function
    and open codes the test in migrate_page_copy().

    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Christoph Lameter
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • The previous patch enables page migration of ksm pages, but that soon gets
    into trouble: not surprising, since we're using the ksm page lock to lock
    operations on its stable_node, but page migration switches the page whose
    lock is to be used for that. Another layer of locking would fix it, but
    do we need that yet?

    Do we actually need page migration of ksm pages? Yes, memory hotremove
    needs to offline sections of memory: and since we stopped allocating ksm
    pages with GFP_HIGHUSER, they will tend to be GFP_HIGHUSER_MOVABLE
    candidates for migration.

    But KSM is currently unconscious of NUMA issues, happily merging pages
    from different NUMA nodes: at present the rule must be, not to use
    MADV_MERGEABLE where you care about NUMA. So no, NUMA page migration of
    ksm pages does not make sense yet.

    So, to complete support for ksm swapping we need to make hotremove safe.
    ksm_memory_callback() take ksm_thread_mutex when MEM_GOING_OFFLINE and
    release it when MEM_OFFLINE or MEM_CANCEL_OFFLINE. But if mapped pages
    are freed before migration reaches them, stable_nodes may be left still
    pointing to struct pages which have been removed from the system: the
    stable_node needs to identify a page by pfn rather than page pointer, then
    it can safely prune them when MEM_OFFLINE.

    And make NUMA migration skip PageKsm pages where it skips PageReserved.
    But it's only when we reach unmap_and_move() that the page lock is taken
    and we can be sure that raised pagecount has prevented a PageAnon from
    being upgraded: so add offlining arg to migrate_pages(), to migrate ksm
    page when offlining (has sufficient locking) but reject it otherwise.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A side-effect of making ksm pages swappable is that they have to be placed
    on the LRUs: which then exposes them to isolate_lru_page() and hence to
    page migration.

    Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
    rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
    consolidation with existing code is possible, but don't attempt that yet
    (try_to_unmap needs to handle nonlinears, but migration pte removal does
    not).

    rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
    remove_anon_migration_ptes() which it replaces, avoids calling
    page_lock_anon_vma(), because that includes a page_mapped() test which
    fails when all migration ptes are in place. That was valid when NUMA page
    migration was introduced (holding mmap_sem provided the missing guarantee
    that anon_vma's slab had not already been destroyed), but I believe not
    valid in the memory hotremove case added since.

    For now do the same as before, and consider the best way to fix that
    unlikely race later on. When fixed, we can probably use rmap_walk() on
    hwpoisoned ksm pages too: for now, they remain among hwpoison's various
    exceptions (its PageKsm test comes before the page is locked, but its
    page_lock_anon_vma fails safely if an anon gets upgraded).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit set
    in page->mapping, with the higher bits a pointer to the anon_vma; and have
    defined PageKsm(page) as that with NULL anon_vma.

    But KSM swapping will need to store a pointer there: so in preparation for
    that, now define PAGE_MAPPING_FLAGS as the low two bits, including
    PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some
    other use for the bit emerges).

    Declare page_rmapping(page) to return the pointer part of page->mapping,
    and page_anon_vma(page) to return the anon_vma pointer when that's what it
    is. Use these in a few appropriate places: notably, unuse_vma() has been
    testing page->mapping, but is better to be testing page_anon_vma() (cases
    may be added in which flag bits are set without any pointer).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Christoph pointed out inc_zone_page_state(NR_ISOLATED) should be placed
    in right after isolate_page().

    This patch does it.

    Reviewed-by: Christoph Lameter
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

12 Dec, 2009

1 commit

  • Slightly adjust the logic for determining the size of the
    copy_form_user() in do_pages_stat(); with this change, gcc can see
    that the copying is safe.

    Without this, we get a build error for i386 allyesconfig:

    /home/hpa/kernel/linux-2.6-tip.urgent/arch/x86/include/asm/uaccess_32.h:213:
    error: call to ‘copy_from_user_overflow’ declared with attribute
    error: copy_from_user() buffer size is not provably correct

    Unlike an earlier patch from Arjan, this doesn't introduce new
    variables; merely reshuffles the compare so that gcc can see that an
    overflow cannot happen.

    Signed-off-by: H. Peter Anvin
    Cc: Brice Goglin
    Cc: Arjan van de Ven
    Cc: Andrew Morton
    Cc: KOSAKI Motohiro
    LKML-Reference:

    H. Peter Anvin
     

12 Nov, 2009

1 commit

  • Lee Schermerhorn reported that he saw bad pointer dereference in
    mem_cgroup_end_migration() when he disabled memcg by boot option.

    memcg's page migration logic works as

    mem_cgroup_prepare_migration(page, &ptr);
    do page migration
    mem_cgroup_end_migration(page, ptr);

    Now, ptr is not initialized in prepare_migration when memcg is disabled
    by boot option. This causes panic in end_migration. This patch fixes it.

    Reported-by: Lee Schermerhorn
    Cc: Balbir Singh
    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

24 Sep, 2009

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     

22 Sep, 2009

5 commits

  • Make page_has_private() return a true boolean value and remove the double
    negations from the two callsites using it for arithmetic.

    Signed-off-by: Johannes Weiner
    Cc: Christoph Lameter
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • page_is_file_cache() has been used for both boolean checks and LRU
    arithmetic, which was always a bit weird.

    Now that page_lru_base_type() exists for LRU arithmetic, make
    page_is_file_cache() a real predicate function and adjust the
    boolean-using callsites to drop those pesky double negations.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • If the system is running a heavy load of processes then concurrent reclaim
    can isolate a large number of pages from the LRU. /proc/vmstat and the
    output generated for an OOM do not show how many pages were isolated.

    This has been observed during process fork bomb testing (mstctl11 in LTP).

    This patch shows the information about isolated pages.

    Reproduced via:

    -----------------------
    % ./hackbench 140 process 1000
    => OOM occur

    active_anon:146 inactive_anon:0 isolated_anon:49245
    active_file:79 inactive_file:18 isolated_file:113
    unevictable:0 dirty:0 writeback:0 unstable:0 buffer:39
    free:370 slab_reclaimable:309 slab_unreclaimable:5492
    mapped:53 shmem:15 pagetables:28140 bounce:0

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Acked-by: Wu Fengguang
    Reviewed-by: Minchan Kim
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Recently we encountered OOM problems due to memory use of the GEM cache.
    Generally a large amuont of Shmem/Tmpfs pages tend to create a memory
    shortage problem.

    We often use the following calculation to determine the amount of shmem
    pages:

    shmem = NR_ACTIVE_ANON + NR_INACTIVE_ANON - NR_ANON_PAGES

    however the expression does not consider isolated and mlocked pages.

    This patch adds explicit accounting for pages used by shmem and tmpfs.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Reviewed-by: Christoph Lameter
    Acked-by: Wu Fengguang
    Cc: David Rientjes
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • In test, some pages in swap-cache can't be migrated, as they aren't rmap.

    unmap_and_move() ignores swap-cache page which is just read in and hasn't
    rmap (see the comments in the code), but swap_aops provides .migratepage.
    Better to migrate such pages instead of ignore them.

    Signed-off-by: Shaohua Li
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Yakui Zhao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

16 Sep, 2009

1 commit

  • try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
    which are selected by magic flag variables. The logic is not very straight
    forward, because each of these flag change multiple behaviours (e.g.
    migration turns off aging, not only sets up migration ptes etc.)
    Also the different flags interact in magic ways.

    A later patch in this series adds another mode to try_to_unmap, so
    this becomes quickly unmanageable.

    Replace the different flags with a action code (migration, munlock, munmap)
    and some additional flags as modifiers (ignore mlock, ignore aging).
    This makes the logic more straight forward and allows easier extension
    to new behaviours. Change all the caller to declare what they want to
    do.

    This patch is supposed to be a nop in behaviour. If anyone can prove
    it is not that would be a bug.

    Cc: Lee.Schermerhorn@hp.com
    Cc: npiggin@suse.de

    Signed-off-by: Andi Kleen

    Andi Kleen
     

17 Jun, 2009

2 commits

  • migrate_prep() is fairly expensive (72us on 16-core barcelona 1.9GHz).
    Commit 3140a2273009c01c27d316f35ab76a37e105fdd8 improved move_pages()
    throughput by breaking it into chunks, but it also made migrate_prep() be
    called once per chunk (every 128pages or so) instead of once per
    move_pages().

    This patch reverts to calling migrate_prep() only once per chunk as we did
    before 2.6.29. It is also a followup to commit
    0aedadf91a70a11c4a3e7c7d99b21e5528af8d5d ("mm: move migrate_prep out from
    under mmap_sem").

    This improves migration throughput on the above machine from 600MB/s to
    750MB/s.

    Signed-off-by: Brice Goglin
    Acked-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: Heiko Carstens
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • Callers of alloc_pages_node() can optionally specify -1 as a node to mean
    "allocate from the current node". However, a number of the callers in
    fast paths know for a fact their node is valid. To avoid a comparison and
    branch, this patch adds alloc_pages_exact_node() that only checks the nid
    with VM_BUG_ON(). Callers that know their node is valid are then
    converted.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Pekka Enberg
    Acked-by: Paul Mundt [for the SLOB NUMA bits]
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

03 Apr, 2009

1 commit

  • Recruit a page flag to aid in cache management. The following extra flag is
    defined:

    (1) PG_fscache (PG_private_2)

    The marked page is backed by a local cache and is pinning resources in the
    cache driver.

    If PG_fscache is set, then things that checked for PG_private will now also
    check for that. This includes things like truncation and page invalidation.
    The function page_has_private() had been added to make the checks for both
    PG_private and PG_private_2 at the same time.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Rik van Riel
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     

12 Feb, 2009

1 commit


14 Jan, 2009

1 commit


09 Jan, 2009

2 commits

  • Now, management of "charge" under page migration is done under following
    manner. (Assume migrate page contents from oldpage to newpage)

    before
    - "newpage" is charged before migration.
    at success.
    - "oldpage" is uncharged at somewhere(unmap, radix-tree-replace)
    at failure
    - "newpage" is uncharged.
    - "oldpage" is charged if necessary (*1)

    But (*1) is not reliable....because of GFP_ATOMIC.

    This patch tries to change behavior as following by charge/commit/cancel ops.

    before
    - charge PAGE_SIZE (no target page)
    success
    - commit charge against "newpage".
    failure
    - commit charge against "oldpage".
    (PCG_USED bit works effectively to avoid double-counting)
    - if "oldpage" is obsolete, cancel charge of PAGE_SIZE.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • There is a small race in do_swap_page(). When the page swapped-in is
    charged, the mapcount can be greater than 0. But, at the same time some
    process (shares it ) call unmap and make mapcount 1->0 and the page is
    uncharged.

    CPUA CPUB
    mapcount == 1.
    (1) charge if mapcount==0 zap_pte_range()
    (2) mapcount 1 => 0.
    (3) uncharge(). (success)
    (4) set page's rmap()
    mapcount 0=>1

    Then, this swap page's account is leaked.

    For fixing this, I added a new interface.
    - charge
    account to res_counter by PAGE_SIZE and try to free pages if necessary.
    - commit
    register page_cgroup and add to LRU if necessary.
    - cancel
    uncharge PAGE_SIZE because of do_swap_page failure.

    CPUA
    (1) charge (always)
    (2) set page's rmap (mapcount > 0)
    (3) commit charge was necessary or not after set_pte().

    This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
    Usual mem_cgroup_charge_common() does charge -> commit at a time.

    And this patch also adds following function to clarify all charges.

    - mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge()
    called against newly allocated anon pages.

    - mem_cgroup_charge_migrate_fixup()
    called only from remove_migration_ptes().
    we'll have to rewrite this later.(this patch just keeps old behavior)
    This function will be removed by additional patch to make migration
    clearer.

    Good for clarifying "what we do"

    Then, we have 4 following charge points.
    - newpage
    - swap-in
    - add-to-cache.
    - migration.

    [akpm@linux-foundation.org: add missing inline directives to stubs]
    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

07 Jan, 2009

3 commits

  • If we add NOOP stubs for SetPageSwapCache() and ClearPageSwapCache(), then
    we can remove the #ifdef CONFIG_SWAPs from mm/migrate.c.

    Signed-off-by: Hugh Dickins
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • pp->page is never used when not set to the right page, so there is no need
    to set it to ZERO_PAGE(0) by default.

    Signed-off-by: Brice Goglin
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • Rework do_pages_move() to work by page-sized chunks of struct page_to_node
    that are passed to do_move_page_to_node_array(). We now only have to
    allocate a single page instead a possibly very large vmalloc area to store
    all page_to_node entries.

    As a result, new_page_node() will now have a very small lookup, hidding
    much of the overall sys_move_pages() overhead.

    Signed-off-by: Brice Goglin
    Signed-off-by: Nathalie Furmento
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     

25 Dec, 2008

1 commit


17 Dec, 2008

1 commit

  • Commit 80bba1290ab5122c60cdb73332b26d288dc8aedd removed one necessary
    variable initialization. As a result following warning happened:

    CC mm/migrate.o
    mm/migrate.c: In function 'sys_move_pages':
    mm/migrate.c:1001: warning: 'err' may be used uninitialized in this function

    More unfortunately, if find_vma() failed, kernel read uninitialized
    memory.

    Signed-off-by: KOSAKI Motohiro
    CC: Brice Goglin
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

11 Dec, 2008

1 commit

  • Since commit 2f007e74bb85b9fc4eab28524052161703300f1a, do_pages_stat()
    gets the page address from user-space and puts the corresponding status
    back while holding the mmap_sem for read. There is no need to hold
    mmap_sem there while some page-faults may occur.

    This patch adds a temporary address and status buffer so as to only
    hold mmap_sem while working on these kernel buffers. This is
    implemented by extracting do_pages_stat_array() out of do_pages_stat().

    Signed-off-by: Brice Goglin
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     

04 Dec, 2008

1 commit


20 Nov, 2008

1 commit

  • Page migration's writeout() has got understandably confused by the nasty
    AOP_WRITEPAGE_ACTIVATE case: as in normal success, a writepage() error has
    unlocked the page, so writeout() then needs to relock it.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

14 Nov, 2008

4 commits

  • Conflicts:
    security/keys/internal.h
    security/keys/process_keys.c
    security/keys/request_key.c

    Fixed conflicts above by using the non 'tsk' versions.

    Signed-off-by: James Morris

    James Morris
     
  • Use RCU to access another task's creds and to release a task's own creds.
    This means that it will be possible for the credentials of a task to be
    replaced without another task (a) requiring a full lock to read them, and (b)
    seeing deallocated memory.

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     
  • Separate the task security context from task_struct. At this point, the
    security data is temporarily embedded in the task_struct with two pointers
    pointing to it.

    Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
    entry.S via asm-offsets.

    With comment fixes Signed-off-by: Marc Dionne

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     
  • Wrap access to task credentials so that they can be separated more easily from
    the task_struct during the introduction of COW creds.

    Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

    Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
    sense to use RCU directly rather than a convenient wrapper; these will be
    addressed by later patches.

    Signed-off-by: David Howells
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Cc: Al Viro
    Cc: linux-audit@redhat.com
    Cc: containers@lists.linux-foundation.org
    Cc: linux-mm@kvack.org
    Signed-off-by: James Morris

    David Howells
     

07 Nov, 2008

1 commit

  • Move the migrate_prep outside the mmap_sem for the following system calls

    1. sys_move_pages
    2. sys_migrate_pages
    3. sys_mbind()

    It really does not matter when we flush the lru. The system is free to
    add pages onto the lru even during migration which will make the page
    migration either skip the page (mbind, migrate_pages) or return a busy
    state (move_pages).

    Fixes this lockdep warning (and potential deadlock):

    Some VM place has
    mmap_sem -> kevent_wq via lru_add_drain_all()

    net/core/dev.c::dev_ioctl() has
    rtnl_lock -> mmap_sem (*) the ioctl has copy_from_user() and it can do page fault.

    linkwatch_event has
    kevent_wq -> rtnl_lock

    Signed-off-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Reported-by: Heiko Carstens
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter