13 Aug, 2008

6 commits

  • Signed-off-by: MinChan Kim
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    MinChan Kim
     
  • [Andrew this should replace the previous version which did not check
    the returns from the region prepare for errors. This has been tested by
    us and Gerald and it looks good.

    Bah, while reviewing the locking based on your previous email I spotted
    that we need to check the return from the vma_needs_reservation call for
    allocation errors. Here is an updated patch to correct this. This passes
    testing here.]

    Signed-off-by: Andy Whitcroft
    Tested-by: Gerald Schaefer
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • In the normal case, hugetlbfs reserves hugepages at map time so that the
    pages exist for future faults. A struct file_region is used to track when
    reservations have been consumed and where. These file_regions are
    allocated as necessary with kmalloc() which can sleep with the
    mm->page_table_lock held. This is wrong and triggers may-sleep warning
    when PREEMPT is enabled.

    Updates to the underlying file_region are done in two phases. The first
    phase prepares the region for the change, allocating any necessary memory,
    without actually making the change. The second phase actually commits the
    change. This patch makes use of this by checking the reservations before
    the page_table_lock is taken; triggering any necessary allocations. This
    may then be safely repeated within the locks without any allocations being
    required.

    Credit to Mel Gorman for diagnosing this failure and initial versions of
    the patch.

    Signed-off-by: Andy Whitcroft
    Tested-by: Gerald Schaefer
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • Got an oops in mem_cgroup_shrink_usage() when testing loop over tmpfs:
    yes, of course, loop0 has no mm: other entry points check but this didn't.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • .. since a failed allocation is being (initially) handled gracefully, and
    panic()-ed upon failure explicitly in the function if retries with smaller
    sizes failed.

    Signed-off-by: Jan Beulich
    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     
  • The s390 software large page emulation implements shared page tables by
    using page->index of the first tail page from a compound large page to
    store page table information. This is set up in arch_prepare_hugepage(),
    which is called from alloc_fresh_huge_page_node().

    A similar call to arch_prepare_hugepage() is missing for surplus large
    pages that are allocated in alloc_buddy_huge_page(), which breaks the
    software emulation mode for (surplus) large pages on s390. This patch
    adds the missing call to arch_prepare_hugepage(). It will have no effect
    on other architectures where arch_prepare_hugepage() is a nop.

    Also, use the correct order in the error path in alloc_fresh_huge_page_node().

    Acked-by: Martin Schwidefsky
    Signed-off-by: Gerald Schaefer
    Acked-by: Nick Piggin
    Acked-by: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     

12 Aug, 2008

4 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
    fix spinlock recursion in hvc_console
    stop_machine: remove unused variable
    modules: extend initcall_debug functionality to the module loader
    export virtio_rng.h
    lguest: use get_user_pages_fast() instead of get_user_pages()
    mm: Make generic weak get_user_pages_fast and EXPORT_GPL it
    lguest: don't set MAC address for guest unless specified

    Linus Torvalds
     
  • Out of line get_user_pages_fast fallback implementation, make it a weak
    symbol, get rid of CONFIG_HAVE_GET_USER_PAGES_FAST.

    Export the symbol to modules so lguest can use it.

    Signed-off-by: Nick Piggin
    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    lockdep: fix debug_lock_alloc
    lockdep: increase MAX_LOCKDEP_KEYS
    generic-ipi: fix stack and rcu interaction bug in smp_call_function_mask()
    lockdep: fix overflow in the hlock shrinkage code
    lockdep: rename map_[acquire|release]() => lock_map_[acquire|release]()
    lockdep: handle chains involving classes defined in modules
    mm: fix mm_take_all_locks() locking order
    lockdep: annotate mm_take_all_locks()
    lockdep: spin_lock_nest_lock()
    lockdep: lock protection locks
    lockdep: map_acquire
    lockdep: shrink held_lock structure
    lockdep: re-annotate scheduler runqueues
    lockdep: lock_set_subclass - reset a held lock's subclass
    lockdep: change scheduler annotation
    debug_locks: set oops_in_progress if we will log messages.
    lockdep: fix combinatorial explosion in lock subgraph traversal

    Linus Torvalds
     
  • Ingo Molnar
     

11 Aug, 2008

2 commits

  • Lockdep spotted:

    =======================================================
    [ INFO: possible circular locking dependency detected ]
    2.6.27-rc1 #270
    -------------------------------------------------------
    qemu-kvm/2033 is trying to acquire lock:
    (&inode->i_data.i_mmap_lock){----}, at: [] mm_take_all_locks+0xc2/0xea

    but task is already holding lock:
    (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&anon_vma->lock){----}:
    [] __lock_acquire+0x11be/0x14d2
    [] lock_acquire+0x5e/0x7a
    [] _spin_lock+0x3b/0x47
    [] vma_adjust+0x200/0x444
    [] split_vma+0x12f/0x146
    [] mprotect_fixup+0x13c/0x536
    [] sys_mprotect+0x1a9/0x21e
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff

    -> #0 (&inode->i_data.i_mmap_lock){----}:
    [] __lock_acquire+0xedb/0x14d2
    [] lock_release_non_nested+0x1c2/0x219
    [] lock_release+0x127/0x14a
    [] _spin_unlock+0x1e/0x50
    [] mm_drop_all_locks+0x7f/0xb0
    [] do_mmu_notifier_register+0xe2/0x112
    [] mmu_notifier_register+0xe/0x10
    [] kvm_dev_ioctl+0x11e/0x287 [kvm]
    [] vfs_ioctl+0x2a/0x78
    [] do_vfs_ioctl+0x257/0x274
    [] sys_ioctl+0x55/0x78
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff

    other info that might help us debug this:

    5 locks held by qemu-kvm/2033:
    #0: (&mm->mmap_sem){----}, at: [] do_mmu_notifier_register+0x55/0x112
    #1: (mm_all_locks_mutex){--..}, at: [] mm_take_all_locks+0x34/0xea
    #2: (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea
    #3: (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea
    #4: (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea

    stack backtrace:
    Pid: 2033, comm: qemu-kvm Not tainted 2.6.27-rc1 #270

    Call Trace:
    [] print_circular_bug_tail+0xb8/0xc3
    [] __lock_acquire+0xedb/0x14d2
    [] ? add_lock_to_list+0x7e/0xad
    [] ? mm_take_all_locks+0x70/0xea
    [] ? mm_take_all_locks+0x70/0xea
    [] lock_release_non_nested+0x1c2/0x219
    [] ? mm_take_all_locks+0xc2/0xea
    [] ? mm_take_all_locks+0xc2/0xea
    [] ? trace_hardirqs_on_caller+0x4d/0x115
    [] ? mm_drop_all_locks+0x7f/0xb0
    [] lock_release+0x127/0x14a
    [] _spin_unlock+0x1e/0x50
    [] mm_drop_all_locks+0x7f/0xb0
    [] do_mmu_notifier_register+0xe2/0x112
    [] mmu_notifier_register+0xe/0x10
    [] kvm_dev_ioctl+0x11e/0x287 [kvm]
    [] ? file_has_perm+0x83/0x8e
    [] vfs_ioctl+0x2a/0x78
    [] do_vfs_ioctl+0x257/0x274
    [] sys_ioctl+0x55/0x78
    [] system_call_fastpath+0x16/0x1b

    Which the locking hierarchy in mm/rmap.c confirms as valid.

    Fix this by first taking all the mapping->i_mmap_lock instances and then
    take all anon_vma->lock instances.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The nesting is correct due to holding mmap_sem, use the new annotation
    to annotate this.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

10 Aug, 2008

1 commit


07 Aug, 2008

1 commit


06 Aug, 2008

2 commits

  • gcc 4.3.0 correctly emits the following warnings.
    When a vma covering addr is found, find_vma_prepare indeed returns without
    setting pprev, rb_link, and rb_parent.

    mm/mmap.c: In function `insert_vm_struct':
    mm/mmap.c:2085: warning: `rb_parent' may be used uninitialized in this function
    mm/mmap.c:2085: warning: `rb_link' may be used uninitialized in this function
    mm/mmap.c:2084: warning: `prev' may be used uninitialized in this function
    mm/mmap.c: In function `copy_vma':
    mm/mmap.c:2124: warning: `rb_parent' may be used uninitialized in this function
    mm/mmap.c:2124: warning: `rb_link' may be used uninitialized in this function
    mm/mmap.c:2123: warning: `prev' may be used uninitialized in this function
    mm/mmap.c: In function `do_brk':
    mm/mmap.c:1951: warning: `rb_parent' may be used uninitialized in this function
    mm/mmap.c:1951: warning: `rb_link' may be used uninitialized in this function
    mm/mmap.c:1949: warning: `prev' may be used uninitialized in this function
    mm/mmap.c: In function `mmap_region':
    mm/mmap.c:1092: warning: `rb_parent' may be used uninitialized in this function
    mm/mmap.c:1092: warning: `rb_link' may be used uninitialized in this function
    mm/mmap.c:1089: warning: `prev' may be used uninitialized in this function

    Hugh adds: in fact, none of find_vma_prepare's callers use those values
    when a vma is found to be already covering addr, it's either an error or
    an occasion to munmap and repeat. Okay, let's quieten the compiler (but I
    would prefer it if pprev, rb_link and rb_parent were meaningful in that
    case, rather than whatever's in them from descending the tree).

    Signed-off-by: Benny Halevy
    Signed-off-by: Hugh Dickins
    Cc: "Ryan Hope"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benny Halevy
     
  • gcc-3.2:

    mm/mm_init.c:77:1: directives may not be used inside a macro argument
    mm/mm_init.c:76:47: unterminated argument list invoking macro "mminit_dprintk"
    mm/mm_init.c: In function `mminit_verify_pageflags_layout':
    mm/mm_init.c:80: `mminit_dprintk' undeclared (first use in this function)
    mm/mm_init.c:80: (Each undeclared identifier is reported only once
    mm/mm_init.c:80: for each function it appears in.)
    mm/mm_init.c:80: syntax error before numeric constant

    Also fix a typo in a comment.

    Reported-by: Adrian Bunk
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

05 Aug, 2008

4 commits

  • This patch changes the static MIN_PARTIAL to a dynamic per-cache ->min_partial
    value that is calculated from object size. The bigger the object size, the more
    pages we keep on the partial list.

    I tested SLAB, SLUB, and SLUB with this patch on Jens Axboe's 'netio' example
    script of the fio benchmarking tool. The script stresses the networking
    subsystem which should also give a fairly good beating of kmalloc() et al.

    To run the test yourself, first clone the fio repository:

    git clone git://git.kernel.dk/fio.git

    and then run the following command n times on your machine:

    time ./fio examples/netio

    The results on my 2-way 64-bit x86 machine are as follows:

    [ the minimum, maximum, and average are captured from 50 individual runs ]

    real time (seconds)
    min max avg sd
    SLAB 22.76 23.38 22.98 0.17
    SLUB 22.80 25.78 23.46 0.72
    SLUB (dynamic) 22.74 23.54 23.00 0.20

    sys time (seconds)
    min max avg sd
    SLAB 6.90 8.28 7.70 0.28
    SLUB 7.42 16.95 8.89 2.28
    SLUB (dynamic) 7.17 8.64 7.73 0.29

    user time (seconds)
    min max avg sd
    SLAB 36.89 38.11 37.50 0.29
    SLUB 30.85 37.99 37.06 1.67
    SLUB (dynamic) 36.75 38.07 37.59 0.32

    As you can see from the above numbers, this patch brings SLUB to the same level
    as SLAB for this particular workload fixing a ~2% regression. I'd expect this
    change to help similar workloads that allocate a lot of objects that are close
    to the size of a page.

    Cc: Matthew Wilcox
    Cc: Andrew Morton
    Acked-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Pekka Enberg
     
  • Converting page lock to new locking bitops requires a change of page flag
    operation naming, so we might as well convert it to something nicer
    (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

    This also facilitates lockdeping of page lock.

    Signed-off-by: Nick Piggin
    Acked-by: KOSAKI Motohiro
    Acked-by: Peter Zijlstra
    Acked-by: Andrew Morton
    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (29 commits)
    sh: enable maple_keyb in dreamcast_defconfig.
    SH2(A) cache update
    nommu: Provide vmalloc_exec().
    add addrespace definition for sh2a.
    sh: Kill off ARCH_SUPPORTS_AOUT and remnants of a.out support.
    sh: define GENERIC_HARDIRQS_NO__DO_IRQ.
    sh: define GENERIC_LOCKBREAK.
    sh: Save NUMA node data in vmcore for crash dumps.
    sh: module_alloc() should be using vmalloc_exec().
    sh: Fix up __bug_table handling in module loader.
    sh: Add documentation and integrate into docbook build.
    sh: Fix up broken kerneldoc comments.
    maple: Kill useless private_data pointer.
    maple: Clean up maple_driver_register/unregister routines.
    input: Clean up maple keyboard driver
    maple: allow removal and reinsertion of keyboard driver module
    sh: /proc/asids depends on MMU.
    arch/sh/boards/mach-se/7343/irq.c: removed duplicated #include
    arch/sh/boards/board-ap325rxa.c: removed duplicated #include
    sh/boards/Makefile typo fix
    ...

    Linus Torvalds
     
  • Halesh says:

    Please find the below testcase provide to test mlock.

    Test Case :
    ===========================

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    int fd,ret, i = 0;
    char *addr, *addr1 = NULL;
    unsigned int page_size;
    struct rlimit rlim;

    if (0 != geteuid())
    {
    printf("Execute this pgm as root\n");
    exit(1);
    }

    /* create a file */
    if ((fd = open("mmap_test.c",O_RDWR|O_CREAT,0755)) == -1)
    {
    printf("cant create test file\n");
    exit(1);
    }

    page_size = sysconf(_SC_PAGE_SIZE);

    /* set the MEMLOCK limit */
    rlim.rlim_cur = 2000;
    rlim.rlim_max = 2000;

    if ((ret = setrlimit(RLIMIT_MEMLOCK,&rlim)) != 0)
    {
    printf("Cant change limit values\n");
    exit(1);
    }

    addr = 0;
    while (1)
    {
    /* map a page into memory each time*/
    if ((addr = (char *) mmap(addr,page_size, PROT_READ |
    PROT_WRITE,MAP_SHARED,fd,0)) == MAP_FAILED)
    {
    printf("cant do mmap on file\n");
    exit(1);
    }

    if (0 == i)
    addr1 = addr;
    i++;
    errno = 0;
    /* lock the mapped memory pagewise*/
    if ((ret = mlock((char *)addr, 1500)) == -1)
    {
    printf("errno value is %d\n", errno);
    printf("cant lock maped region\n");
    exit(1);
    }
    addr = addr + page_size;
    }
    }
    ======================================================

    This testcase results in an mlock() failure with errno 14 that is EFAULT,
    but it has nowhere been specified that mlock() will return EFAULT. When I
    tested the same on older kernels like 2.6.18, I got the correct result i.e
    errno 12 (ENOMEM).

    I think in source code mlock(2), setting errno ENOMEM has been missed in
    do_mlock() , on mlock_fixup() failure.

    SUSv3 requires the following behavior frmo mlock(2).

    [ENOMEM]
    Some or all of the address range specified by the addr and
    len arguments does not correspond to valid mapped pages
    in the address space of the process.

    [EAGAIN]
    Some or all of the memory identified by the operation could not
    be locked when the call was made.

    This rule isn't so nice and slighly strange. but many people think
    POSIX/SUS compliance is important.

    Reported-by: Halesh Sadashiv
    Tested-by: Halesh Sadashiv
    Signed-off-by: KOSAKI Motohiro
    Cc: [2.6.25.x, 2.6.26.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

04 Aug, 2008

1 commit

  • Now that SH has switched to vmalloc_exec() for PAGE_KERNEL_EXEC usage,
    it's apparent that nommu has no vmalloc_exec() definition of its own.
    Stub in the one from mm/vmalloc.c.

    Signed-off-by: Paul Mundt

    Paul Mundt
     

03 Aug, 2008

1 commit

  • Brian Wang reported that a FUSE filesystem exported through NFS could
    return I/O errors on read. This was traced to splice_direct_to_actor()
    returning a short or zero count when racing with page invalidation.

    However this is not FUSE or NFSD specific, other filesystems (notably
    NFS) also call invalidate_inode_pages2() to purge stale data from the
    cache.

    If this happens while such pages are sitting in a pipe buffer, then
    splice(2) from the pipe can return zero, and read(2) from the pipe can
    return ENODATA.

    The zero return is especially bad, since it implies end-of-file or
    disconnected pipe/socket, and is documented as such for splice. But
    returning an error for read() is also nasty, when in fact there was no
    error (data becoming stale is not an error).

    The same problems can be triggered by "hole punching" with
    madvise(MADV_REMOVE).

    Fix this by not clearing the PG_uptodate flag on truncation and
    invalidation.

    Signed-off-by: Miklos Szeredi
    Acked-by: Nick Piggin
    Cc: Andrew Morton
    Cc: Jens Axboe
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

02 Aug, 2008

3 commits

  • Delete 2 EXPORTs that were accidentally sent upstream.

    Signed-off-by: Jack Steiner
    Signed-off-by: Linus Torvalds

    Jack Steiner
     
  • Some platform decide whether they support huge pages at boot time. On
    these, such as powerpc, HPAGE_SHIFT is a variable, not a constant, and is
    set to 0 when there is no such support.

    The patches to introduce multiple huge pages support broke that causing
    the kernel to crash at boot time on machines such as POWER3 which lack
    support for multiple page sizes.

    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (28 commits)
    mm/hugetlb.c must #include
    video: Fix up hp6xx driver build regressions.
    sh: defconfig updates.
    sh: Kill off stray mach-rsk7203 reference.
    serial: sh-sci: Fix up SH7760/SH7780/SH7785 early printk regression.
    sh: Move out individual boards without mach groups.
    sh: Make sure AT_SYSINFO_EHDR is exposed to userspace in asm/auxvec.h.
    sh: Allow SH-3 and SH-5 to use common headers.
    sh: Provide common CPU headers, prune the SH-2 and SH-2A directories.
    sh/maple: clean maple bus code
    sh: More header path fixups for mach dir refactoring.
    sh: Move out the solution engine headers to arch/sh/include/mach-se/
    sh: I2C fix for AP325RXA and Migo-R
    sh: Shuffle the board directories in to mach groups.
    sh: dma-sh: Fix up dreamcast dma.h mach path.
    sh: Switch KBUILD_DEFCONFIG to shx3_defconfig.
    sh: Add ARCH_DEFCONFIG entries for sh and sh64.
    sh: Fix compile error of Solution Engine
    sh: Proper __put_user_asm() size mismatch fix.
    sh: Stub in a dummy ENTRY_OFFSET for uImage offset calculation.
    ...

    Linus Torvalds
     

01 Aug, 2008

1 commit

  • For anonymous pages without a swap cache backing the check in
    page_remove_rmap for the physical dirty bit in page_remove_rmap is
    unnecessary. The instructions that are used to check and reset the dirty
    bit are expensive. Removing the check noticably speeds up process exit.
    In addition the clearing of the dirty bit in __SetPageUptodate is
    pointless as well. With these two changes there is no storage key
    operation for an anonymous page anymore if it does not hit the swap
    space.

    The micro benchmark which repeatedly executes an empty shell script
    gets about 5% faster.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

31 Jul, 2008

9 commits


30 Jul, 2008

2 commits

  • This patch removes the obsolete and no longer used exports of ksize.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Pekka Enberg

    Adrian Bunk
     
  • This patch fixes the following build error on sh caused by
    commit aa888a74977a8f2120ae9332376e179c39a6b07d
    (hugetlb: support larger than MAX_ORDER):

    ...
    CC mm/hugetlb.o
    /home/bunk/linux/kernel-2.6/git/linux-2.6/mm/hugetlb.c: In function 'alloc_bootmem_huge_page':
    /home/bunk/linux/kernel-2.6/git/linux-2.6/mm/hugetlb.c:958: error: implicit declaration of function 'virt_to_phys'
    make[2]: *** [mm/hugetlb.o] Error 1

    Reported-by: Adrian Bunk
    Signed-off-by: Adrian Bunk
    Signed-off-by: Paul Mundt

    Adrian Bunk
     

29 Jul, 2008

3 commits

  • When we read some part of a file through pagecache, if there is a
    pagecache of corresponding index but this page is not uptodate, read IO
    is issued and this page will be uptodate.

    I think this is good for pagesize == blocksize environment but there is
    room for improvement on pagesize != blocksize environment. Because in
    this case a page can have multiple buffers and even if a page is not
    uptodate, some buffers can be uptodate.

    So I suggest that when all buffers which correspond to a part of a file
    that we want to read are uptodate, use this pagecache and copy data from
    this pagecache to user buffer even if a page is not uptodate. This can
    reduce read IO and improve system throughput.

    I wrote a benchmark program and got result number with this program.

    This benchmark do:

    1: mount and open a test file.

    2: create a 512MB file.

    3: close a file and umount.

    4: mount and again open a test file.

    5: pwrite randomly 300000 times on a test file. offset is aligned
    by IO size(1024bytes).

    6: measure time of preading randomly 100000 times on a test file.

    The result was:
    2.6.26
    330 sec

    2.6.26-patched
    226 sec

    Arch:i386
    Filesystem:ext3
    Blocksize:1024 bytes
    Memory: 1GB

    On ext3/4, a file is written through buffer/block. So random read/write
    mixed workloads or random read after random write workloads are optimized
    with this patch under pagesize != blocksize environment. This test result
    showed this.

    The benchmark program is as follows:

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define LEN 1024
    #define LOOP 1024*512 /* 512MB */

    main(void)
    {
    unsigned long i, offset, filesize;
    int fd;
    char buf[LEN];
    time_t t1, t2;

    if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
    perror("cannot mount\n");
    exit(1);
    }
    memset(buf, 0, LEN);
    fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
    if (fd < 0) {
    perror("cannot open file\n");
    exit(1);
    }
    for (i = 0; i < LOOP; i++)
    write(fd, buf, LEN);
    close(fd);
    if (umount("/root/test1/") < 0) {
    perror("cannot umount\n");
    exit(1);
    }
    if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
    perror("cannot mount\n");
    exit(1);
    }
    fd = open("/root/test1/testfile", O_RDWR);
    if (fd < 0) {
    perror("cannot open file\n");
    exit(1);
    }

    filesize = LEN * LOOP;
    for (i = 0; i < 300000; i++){
    offset = (random() % filesize) & (~(LEN - 1));
    pwrite(fd, buf, LEN, offset);
    }
    printf("start test\n");
    time(&t1);
    for (i = 0; i < 100000; i++){
    offset = (random() % filesize) & (~(LEN - 1));
    pread(fd, buf, LEN, offset);
    }
    time(&t2);
    printf("%ld sec\n", t2-t1);
    close(fd);
    if (umount("/root/test1/") < 0) {
    perror("cannot umount\n");
    exit(1);
    }
    }

    Signed-off-by: Hisashi Hifumi
    Cc: Nick Piggin
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hisashi Hifumi
     
  • This patch fixes the following build error on sh caused by commit
    aa888a74977a8f2120ae9332376e179c39a6b07d ("hugetlb: support larger than
    MAX_ORDER"):

    mm/hugetlb.c: In function 'alloc_bootmem_huge_page':
    mm/hugetlb.c:958: error: implicit declaration of function 'virt_to_phys'

    Signed-off-by: Adrian Bunk
    Cc: Hirokazu Takata
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
    There are secondary MMUs (with secondary sptes and secondary tlbs) too.
    sptes in the kvm case are shadow pagetables, but when I say spte in
    mmu-notifier context, I mean "secondary pte". In GRU case there's no
    actual secondary pte and there's only a secondary tlb because the GRU
    secondary MMU has no knowledge about sptes and every secondary tlb miss
    event in the MMU always generates a page fault that has to be resolved by
    the CPU (this is not the case of KVM where the a secondary tlb miss will
    walk sptes in hardware and it will refill the secondary tlb transparently
    to software if the corresponding spte is present). The same way
    zap_page_range has to invalidate the pte before freeing the page, the spte
    (and secondary tlb) must also be invalidated before any page is freed and
    reused.

    Currently we take a page_count pin on every page mapped by sptes, but that
    means the pages can't be swapped whenever they're mapped by any spte
    because they're part of the guest working set. Furthermore a spte unmap
    event can immediately lead to a page to be freed when the pin is released
    (so requiring the same complex and relatively slow tlb_gather smp safe
    logic we have in zap_page_range and that can be avoided completely if the
    spte unmap event doesn't require an unpin of the page previously mapped in
    the secondary MMU).

    The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
    when the VM is swapping or freeing or doing anything on the primary MMU so
    that the secondary MMU code can drop sptes before the pages are freed,
    avoiding all page pinning and allowing 100% reliable swapping of guest
    physical address space. Furthermore it avoids the code that teardown the
    mappings of the secondary MMU, to implement a logic like tlb_gather in
    zap_page_range that would require many IPI to flush other cpu tlbs, for
    each fixed number of spte unmapped.

    To make an example: if what happens on the primary MMU is a protection
    downgrade (from writeable to wrprotect) the secondary MMU mappings will be
    invalidated, and the next secondary-mmu-page-fault will call
    get_user_pages and trigger a do_wp_page through get_user_pages if it
    called get_user_pages with write=1, and it'll re-establishing an updated
    spte or secondary-tlb-mapping on the copied page. Or it will setup a
    readonly spte or readonly tlb mapping if it's a guest-read, if it calls
    get_user_pages with write=0. This is just an example.

    This allows to map any page pointed by any pte (and in turn visible in the
    primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
    full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
    with kvm), or a remote DMA in software like XPMEM (hence needing of
    schedule in XPMEM code to send the invalidate to the remote node, while no
    need to schedule in kvm/gru as it's an immediate event like invalidating
    primary-mmu pte).

    At least for KVM without this patch it's impossible to swap guests
    reliably. And having this feature and removing the page pin allows
    several other optimizations that simplify life considerably.

    Dependencies:

    1) mm_take_all_locks() to register the mmu notifier when the whole VM
    isn't doing anything with "mm". This allows mmu notifier users to keep
    track if the VM is in the middle of the invalidate_range_begin/end
    critical section with an atomic counter incraese in range_begin and
    decreased in range_end. No secondary MMU page fault is allowed to map
    any spte or secondary tlb reference, while the VM is in the middle of
    range_begin/end as any page returned by get_user_pages in that critical
    section could later immediately be freed without any further
    ->invalidate_page notification (invalidate_range_begin/end works on
    ranges and ->invalidate_page isn't called immediately before freeing
    the page). To stop all page freeing and pagetable overwrites the
    mmap_sem must be taken in write mode and all other anon_vma/i_mmap
    locks must be taken too.

    2) It'd be a waste to add branches in the VM if nobody could possibly
    run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
    CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
    mmu notifiers, but this already allows to compile a KVM external module
    against a kernel with mmu notifiers enabled and from the next pull from
    kvm.git we'll start using them. And GRU/XPMEM will also be able to
    continue the development by enabling KVM=m in their config, until they
    submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
    also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
    This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
    are all =n.

    The mmu_notifier_register call can fail because mm_take_all_locks may be
    interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
    is used when a driver startup, a failure can be gracefully handled. Here
    an example of the change applied to kvm to register the mmu notifiers.
    Usually when a driver startups other allocations are required anyway and
    -ENOMEM failure paths exists already.

    struct kvm *kvm_arch_create_vm(void)
    {
    struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
    + int err;

    if (!kvm)
    return ERR_PTR(-ENOMEM);

    INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

    + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
    + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
    + if (err) {
    + kfree(kvm);
    + return ERR_PTR(err);
    + }
    +
    return kvm;
    }

    mmu_notifier_unregister returns void and it's reliable.

    The patch also adds a few needed but missing includes that would prevent
    kernel to compile after these changes on non-x86 archs (x86 didn't need
    them by luck).

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix mm/filemap_xip.c build]
    [akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Kanoj Sarcar
    Cc: Roland Dreier
    Cc: Steve Wise
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Rusty Russell
    Cc: Anthony Liguori
    Cc: Chris Wright
    Cc: Marcelo Tosatti
    Cc: Eric Dumazet
    Cc: "Paul E. McKenney"
    Cc: Izik Eidus
    Cc: Anthony Liguori
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli