02 Aug, 2008

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (28 commits)
    mm/hugetlb.c must #include
    video: Fix up hp6xx driver build regressions.
    sh: defconfig updates.
    sh: Kill off stray mach-rsk7203 reference.
    serial: sh-sci: Fix up SH7760/SH7780/SH7785 early printk regression.
    sh: Move out individual boards without mach groups.
    sh: Make sure AT_SYSINFO_EHDR is exposed to userspace in asm/auxvec.h.
    sh: Allow SH-3 and SH-5 to use common headers.
    sh: Provide common CPU headers, prune the SH-2 and SH-2A directories.
    sh/maple: clean maple bus code
    sh: More header path fixups for mach dir refactoring.
    sh: Move out the solution engine headers to arch/sh/include/mach-se/
    sh: I2C fix for AP325RXA and Migo-R
    sh: Shuffle the board directories in to mach groups.
    sh: dma-sh: Fix up dreamcast dma.h mach path.
    sh: Switch KBUILD_DEFCONFIG to shx3_defconfig.
    sh: Add ARCH_DEFCONFIG entries for sh and sh64.
    sh: Fix compile error of Solution Engine
    sh: Proper __put_user_asm() size mismatch fix.
    sh: Stub in a dummy ENTRY_OFFSET for uImage offset calculation.
    ...

    Linus Torvalds
     

01 Aug, 2008

1 commit

  • For anonymous pages without a swap cache backing the check in
    page_remove_rmap for the physical dirty bit in page_remove_rmap is
    unnecessary. The instructions that are used to check and reset the dirty
    bit are expensive. Removing the check noticably speeds up process exit.
    In addition the clearing of the dirty bit in __SetPageUptodate is
    pointless as well. With these two changes there is no storage key
    operation for an anonymous page anymore if it does not hit the swap
    space.

    The micro benchmark which repeatedly executes an empty shell script
    gets about 5% faster.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

31 Jul, 2008

9 commits


30 Jul, 2008

1 commit

  • This patch fixes the following build error on sh caused by
    commit aa888a74977a8f2120ae9332376e179c39a6b07d
    (hugetlb: support larger than MAX_ORDER):

    ...
    CC mm/hugetlb.o
    /home/bunk/linux/kernel-2.6/git/linux-2.6/mm/hugetlb.c: In function 'alloc_bootmem_huge_page':
    /home/bunk/linux/kernel-2.6/git/linux-2.6/mm/hugetlb.c:958: error: implicit declaration of function 'virt_to_phys'
    make[2]: *** [mm/hugetlb.o] Error 1

    Reported-by: Adrian Bunk
    Signed-off-by: Adrian Bunk
    Signed-off-by: Paul Mundt

    Adrian Bunk
     

29 Jul, 2008

5 commits

  • When we read some part of a file through pagecache, if there is a
    pagecache of corresponding index but this page is not uptodate, read IO
    is issued and this page will be uptodate.

    I think this is good for pagesize == blocksize environment but there is
    room for improvement on pagesize != blocksize environment. Because in
    this case a page can have multiple buffers and even if a page is not
    uptodate, some buffers can be uptodate.

    So I suggest that when all buffers which correspond to a part of a file
    that we want to read are uptodate, use this pagecache and copy data from
    this pagecache to user buffer even if a page is not uptodate. This can
    reduce read IO and improve system throughput.

    I wrote a benchmark program and got result number with this program.

    This benchmark do:

    1: mount and open a test file.

    2: create a 512MB file.

    3: close a file and umount.

    4: mount and again open a test file.

    5: pwrite randomly 300000 times on a test file. offset is aligned
    by IO size(1024bytes).

    6: measure time of preading randomly 100000 times on a test file.

    The result was:
    2.6.26
    330 sec

    2.6.26-patched
    226 sec

    Arch:i386
    Filesystem:ext3
    Blocksize:1024 bytes
    Memory: 1GB

    On ext3/4, a file is written through buffer/block. So random read/write
    mixed workloads or random read after random write workloads are optimized
    with this patch under pagesize != blocksize environment. This test result
    showed this.

    The benchmark program is as follows:

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define LEN 1024
    #define LOOP 1024*512 /* 512MB */

    main(void)
    {
    unsigned long i, offset, filesize;
    int fd;
    char buf[LEN];
    time_t t1, t2;

    if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
    perror("cannot mount\n");
    exit(1);
    }
    memset(buf, 0, LEN);
    fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
    if (fd < 0) {
    perror("cannot open file\n");
    exit(1);
    }
    for (i = 0; i < LOOP; i++)
    write(fd, buf, LEN);
    close(fd);
    if (umount("/root/test1/") < 0) {
    perror("cannot umount\n");
    exit(1);
    }
    if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
    perror("cannot mount\n");
    exit(1);
    }
    fd = open("/root/test1/testfile", O_RDWR);
    if (fd < 0) {
    perror("cannot open file\n");
    exit(1);
    }

    filesize = LEN * LOOP;
    for (i = 0; i < 300000; i++){
    offset = (random() % filesize) & (~(LEN - 1));
    pwrite(fd, buf, LEN, offset);
    }
    printf("start test\n");
    time(&t1);
    for (i = 0; i < 100000; i++){
    offset = (random() % filesize) & (~(LEN - 1));
    pread(fd, buf, LEN, offset);
    }
    time(&t2);
    printf("%ld sec\n", t2-t1);
    close(fd);
    if (umount("/root/test1/") < 0) {
    perror("cannot umount\n");
    exit(1);
    }
    }

    Signed-off-by: Hisashi Hifumi
    Cc: Nick Piggin
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hisashi Hifumi
     
  • This patch fixes the following build error on sh caused by commit
    aa888a74977a8f2120ae9332376e179c39a6b07d ("hugetlb: support larger than
    MAX_ORDER"):

    mm/hugetlb.c: In function 'alloc_bootmem_huge_page':
    mm/hugetlb.c:958: error: implicit declaration of function 'virt_to_phys'

    Signed-off-by: Adrian Bunk
    Cc: Hirokazu Takata
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
    There are secondary MMUs (with secondary sptes and secondary tlbs) too.
    sptes in the kvm case are shadow pagetables, but when I say spte in
    mmu-notifier context, I mean "secondary pte". In GRU case there's no
    actual secondary pte and there's only a secondary tlb because the GRU
    secondary MMU has no knowledge about sptes and every secondary tlb miss
    event in the MMU always generates a page fault that has to be resolved by
    the CPU (this is not the case of KVM where the a secondary tlb miss will
    walk sptes in hardware and it will refill the secondary tlb transparently
    to software if the corresponding spte is present). The same way
    zap_page_range has to invalidate the pte before freeing the page, the spte
    (and secondary tlb) must also be invalidated before any page is freed and
    reused.

    Currently we take a page_count pin on every page mapped by sptes, but that
    means the pages can't be swapped whenever they're mapped by any spte
    because they're part of the guest working set. Furthermore a spte unmap
    event can immediately lead to a page to be freed when the pin is released
    (so requiring the same complex and relatively slow tlb_gather smp safe
    logic we have in zap_page_range and that can be avoided completely if the
    spte unmap event doesn't require an unpin of the page previously mapped in
    the secondary MMU).

    The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
    when the VM is swapping or freeing or doing anything on the primary MMU so
    that the secondary MMU code can drop sptes before the pages are freed,
    avoiding all page pinning and allowing 100% reliable swapping of guest
    physical address space. Furthermore it avoids the code that teardown the
    mappings of the secondary MMU, to implement a logic like tlb_gather in
    zap_page_range that would require many IPI to flush other cpu tlbs, for
    each fixed number of spte unmapped.

    To make an example: if what happens on the primary MMU is a protection
    downgrade (from writeable to wrprotect) the secondary MMU mappings will be
    invalidated, and the next secondary-mmu-page-fault will call
    get_user_pages and trigger a do_wp_page through get_user_pages if it
    called get_user_pages with write=1, and it'll re-establishing an updated
    spte or secondary-tlb-mapping on the copied page. Or it will setup a
    readonly spte or readonly tlb mapping if it's a guest-read, if it calls
    get_user_pages with write=0. This is just an example.

    This allows to map any page pointed by any pte (and in turn visible in the
    primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
    full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
    with kvm), or a remote DMA in software like XPMEM (hence needing of
    schedule in XPMEM code to send the invalidate to the remote node, while no
    need to schedule in kvm/gru as it's an immediate event like invalidating
    primary-mmu pte).

    At least for KVM without this patch it's impossible to swap guests
    reliably. And having this feature and removing the page pin allows
    several other optimizations that simplify life considerably.

    Dependencies:

    1) mm_take_all_locks() to register the mmu notifier when the whole VM
    isn't doing anything with "mm". This allows mmu notifier users to keep
    track if the VM is in the middle of the invalidate_range_begin/end
    critical section with an atomic counter incraese in range_begin and
    decreased in range_end. No secondary MMU page fault is allowed to map
    any spte or secondary tlb reference, while the VM is in the middle of
    range_begin/end as any page returned by get_user_pages in that critical
    section could later immediately be freed without any further
    ->invalidate_page notification (invalidate_range_begin/end works on
    ranges and ->invalidate_page isn't called immediately before freeing
    the page). To stop all page freeing and pagetable overwrites the
    mmap_sem must be taken in write mode and all other anon_vma/i_mmap
    locks must be taken too.

    2) It'd be a waste to add branches in the VM if nobody could possibly
    run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
    CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
    mmu notifiers, but this already allows to compile a KVM external module
    against a kernel with mmu notifiers enabled and from the next pull from
    kvm.git we'll start using them. And GRU/XPMEM will also be able to
    continue the development by enabling KVM=m in their config, until they
    submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
    also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
    This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
    are all =n.

    The mmu_notifier_register call can fail because mm_take_all_locks may be
    interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
    is used when a driver startup, a failure can be gracefully handled. Here
    an example of the change applied to kvm to register the mmu notifiers.
    Usually when a driver startups other allocations are required anyway and
    -ENOMEM failure paths exists already.

    struct kvm *kvm_arch_create_vm(void)
    {
    struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
    + int err;

    if (!kvm)
    return ERR_PTR(-ENOMEM);

    INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

    + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
    + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
    + if (err) {
    + kfree(kvm);
    + return ERR_PTR(err);
    + }
    +
    return kvm;
    }

    mmu_notifier_unregister returns void and it's reliable.

    The patch also adds a few needed but missing includes that would prevent
    kernel to compile after these changes on non-x86 archs (x86 didn't need
    them by luck).

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix mm/filemap_xip.c build]
    [akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Kanoj Sarcar
    Cc: Roland Dreier
    Cc: Steve Wise
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Rusty Russell
    Cc: Anthony Liguori
    Cc: Chris Wright
    Cc: Marcelo Tosatti
    Cc: Eric Dumazet
    Cc: "Paul E. McKenney"
    Cc: Izik Eidus
    Cc: Anthony Liguori
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • mm_take_all_locks holds off reclaim from an entire mm_struct. This allows
    mmu notifiers to register into the mm at any time with the guarantee that
    no mmu operation is in progress on the mm.

    This operation locks against the VM for all pte/vma/mm related operations
    that could ever happen on a certain mm. This includes vmtruncate,
    try_to_unmap, and all page faults.

    The caller must take the mmap_sem in write mode before calling
    mm_take_all_locks(). The caller isn't allowed to release the mmap_sem
    until mm_drop_all_locks() returns.

    mmap_sem in write mode is required in order to block all operations that
    could modify pagetables and free pages without need of altering the vma
    layout (for example populate_range() with nonlinear vmas). It's also
    needed in write mode to avoid new anon_vmas to be associated with existing
    vmas.

    A single task can't take more than one mm_take_all_locks() in a row or it
    would deadlock.

    mm_take_all_locks() and mm_drop_all_locks are expensive operations that
    may have to take thousand of locks.

    mm_take_all_locks() can fail if it's interrupted by signals.

    When mmu_notifier_register returns, we must be sure that the driver is
    notified if some task is in the middle of a vmtruncate for the 'mm' where
    the mmu notifier was registered (mmu_notifier_invalidate_range_start/end
    is run around the vmtruncation but mmu_notifier_register can run after
    mmu_notifier_invalidate_range_start and before
    mmu_notifier_invalidate_range_end). Same problem for rmap paths. And
    we've to remove page pinning to avoid replicating the tlb_gather logic
    inside KVM (and GRU doesn't work well with page pinning regardless of
    needing tlb_gather), so without mm_take_all_locks when vmtruncate frees
    the page, kvm would have no way to notice that it mapped into sptes a page
    that is going into the freelist without a chance of any further
    mmu_notifier notification.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Andrea Arcangeli
    Acked-by: Linus Torvalds
    Cc: Christoph Lameter
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Kanoj Sarcar
    Cc: Roland Dreier
    Cc: Steve Wise
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Rusty Russell
    Cc: Anthony Liguori
    Cc: Chris Wright
    Cc: Marcelo Tosatti
    Cc: Eric Dumazet
    Cc: "Paul E. McKenney"
    Cc: Izik Eidus
    Cc: Anthony Liguori
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • SuSE's insserve initscript ordering program hits kernel BUG at mm/shmem.c:814
    on 2.6.26. It's using posix_fadvise on directories, and the shmem_readpage
    method added in 2.6.23 is letting POSIX_FADV_WILLNEED allocate useless pages
    to a tmpfs directory, incrementing i_blocks count but never decrementing it.

    Fix this by assigning shmem_aops (pointing to readpage and writepage and
    set_page_dirty) only when it's needed, on a regular file or a long symlink.

    Many thanks to Kel for outstanding bugreport and steps to reproduce it.

    Reported-by: Kel Modderman
    Tested-by: Kel Modderman
    Signed-off-by: Hugh Dickins
    Cc: [2.6.25.x, 2.6.26.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

28 Jul, 2008

1 commit


27 Jul, 2008

21 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (39 commits)
    [PATCH] fix RLIM_NOFILE handling
    [PATCH] get rid of corner case in dup3() entirely
    [PATCH] remove remaining namei_{32,64}.h crap
    [PATCH] get rid of indirect users of namei.h
    [PATCH] get rid of __user_path_lookup_open
    [PATCH] f_count may wrap around
    [PATCH] dup3 fix
    [PATCH] don't pass nameidata to __ncp_lookup_validate()
    [PATCH] don't pass nameidata to gfs2_lookupi()
    [PATCH] new (local) helper: user_path_parent()
    [PATCH] sanitize __user_walk_fd() et.al.
    [PATCH] preparation to __user_walk_fd cleanup
    [PATCH] kill nameidata passing to permission(), rename to inode_permission()
    [PATCH] take noexec checks to very few callers that care
    Re: [PATCH 3/6] vfs: open_exec cleanup
    [patch 4/4] vfs: immutable inode checking cleanup
    [patch 3/4] fat: dont call notify_change
    [patch 2/4] vfs: utimes cleanup
    [patch 1/4] vfs: utimes: move owner check into inode_change_ok()
    [PATCH] vfs: use kstrdup() and check failing allocation
    ...

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
    netns: fix ip_rt_frag_needed rt_is_expired
    netfilter: nf_conntrack_extend: avoid unnecessary "ct->ext" dereferences
    netfilter: fix double-free and use-after free
    netfilter: arptables in netns for real
    netfilter: ip{,6}tables_security: fix future section mismatch
    selinux: use nf_register_hooks()
    netfilter: ebtables: use nf_register_hooks()
    Revert "pkt_sched: sch_sfq: dump a real number of flows"
    qeth: use dev->ml_priv instead of dev->priv
    syncookies: Make sure ECN is disabled
    net: drop unused BUG_TRAP()
    net: convert BUG_TRAP to generic WARN_ON
    drivers/net: convert BUG_TRAP to generic WARN_ON

    Linus Torvalds
     
  • mm/util.c: In function 'arch_pick_mmap_layout':
    mm/util.c:144: error: dereferencing pointer to incomplete type
    mm/util.c:145: error: 'arch_get_unmapped_area' undeclared (first use in this function)
    mm/util.c:145: error: (Each undeclared identifier is reported only once
    mm/util.c:145: error: for each function it appears in.)
    mm/util.c:146: error: 'arch_unmap_area' undeclared (first use in this function)

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • All calls to remove_suid() are made with a file pointer, because
    (similarly to file_update_time) it is called when the file is written.

    Clean up callers by passing in a file instead of a dentry.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • * kill nameidata * argument; map the 3 bits in ->flags anybody cares
    about to new MAY_... ones and pass with the mask.
    * kill redundant gfs2_iop_permission()
    * sanitize ecryptfs_permission()
    * fix remaining places where ->permission() instances might barf on new
    MAY_... found in mask.

    The obvious next target in that direction is permission(9)

    folded fix for nfs_permission() breakage from Miklos Szeredi

    Signed-off-by: Al Viro

    Al Viro
     
  • As suggested by Patrick McHardy, introduce a __krealloc() that doesn't
    free the original buffer to fix a double-free and use-after-free bug
    introduced by me in netfilter that uses RCU.

    Reported-by: Patrick McHardy
    Signed-off-by: Pekka Enberg
    Tested-by: Dieter Ries
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Pekka Enberg
     
  • This patch makes the following needlessly global code static:
    - swap_lock
    - nr_swapfiles
    - struct swap_list

    Signed-off-by: Adrian Bunk
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • This patch makes the needlessly global print_bad_pte() static.

    Signed-off-by: Adrian Bunk
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • This patch makes the following needlessly global functions static:
    - percpu_depopulate()
    - __percpu_depopulate_mask()
    - percpu_populate()
    - __percpu_populate_mask()

    Signed-off-by: Adrian Bunk
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • This patch makes the needlessly global sparse_early_mem_map_alloc()
    static.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Every arch implements its own show_mem() function. Most of them share
    quite some code, some of them are completely identical.

    This series implements a generic version of this function and migrates
    almost all architectures to it.

    This patch:

    Most show_mem() implementations calculate the amount of pages within
    the swapcache every time. Move the output to a more appropriate place
    and use the anyway available total_swapcache_pages variable.

    Signed-off-by: Johannes Weiner
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Haavard Skinnemoen
    Cc: Bryan Wu
    Cc: Chris Zankel
    Cc: Ingo Molnar
    Cc: Jeff Dike
    Cc: David S. Miller
    Cc: Paul Mundt
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: David Howells
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Yoshinori Sato
    Cc: Ralf Baechle
    Cc: Greg Ungerer
    Cc: Geert Uytterhoeven
    Cc: Roman Zippel
    Cc: Hirokazu Takata
    Cc: Mikael Starvik
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This adds tracehook_expect_breakpoints() as a formal hook for the nommu
    code to use for its, "Is text-poking likely?" check at mmap time. This
    names the actual semantics the code means to test, and documents it.

    Signed-off-by: Roland McGrath
    Cc: Oleg Nesterov
    Reviewed-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • Use WARN() instead of a printk+WARN_ON() pair; this way the message becomes
    part of the warning section for better reporting/collection.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     
  • Kmem cache passed to constructor is only needed for constructors that are
    themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
    passed kmem cache in non-trivial way, so pass only pointer to object.

    Non-trivial places are:
    arch/powerpc/mm/init_64.c
    arch/powerpc/mm/hugetlbpage.c

    This is flag day, yes.

    Signed-off-by: Alexey Dobriyan
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: Jon Tollefson
    Cc: Nick Piggin
    Cc: Matt Mackall
    [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
    [akpm@linux-foundation.org: fix mm/slab.c]
    [akpm@linux-foundation.org: fix ubifs]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • mapping->tree_lock has no read lockers. convert the lock from an rwlock
    to a spinlock.

    Signed-off-by: Nick Piggin
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Hugh Dickins
    Cc: "Paul E. McKenney"
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Combine page_cache_get_speculative with lockless radix tree lookups to
    introduce lockless page cache lookups (ie. no mapping->tree_lock on the
    read-side).

    The only atomicity changes this introduces is that the gang pagecache
    lookup functions now behave as if they are implemented with multiple
    find_get_page calls, rather than operating on a snapshot of the pages. In
    practice, this atomicity guarantee is not used anyway, and it is to
    replace individual lookups, so these semantics are natural.

    Signed-off-by: Nick Piggin
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Hugh Dickins
    Cc: "Paul E. McKenney"
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • If we can be sure that elevating the page_count on a pagecache page will
    pin it, we can speculatively run this operation, and subsequently check to
    see if we hit the right page rather than relying on holding a lock or
    otherwise pinning a reference to the page.

    This can be done if get_page/put_page behaves consistently throughout the
    whole tree (ie. if we "get" the page after it has been used for something
    else, we must be able to free it with a put_page).

    Actually, there is a period where the count behaves differently: when the
    page is free or if it is a constituent page of a compound page. We need
    an atomic_inc_not_zero operation to ensure we don't try to grab the page
    in either case.

    This patch introduces the core locking protocol to the pagecache (ie.
    adds page_cache_get_speculative, and tweaks some update-side code to make
    it work).

    Thanks to Hugh for pointing out an improvement to the algorithm setting
    page_count to zero when we have control of all references, in order to
    hold off speculative getters.

    [kamezawa.hiroyu@jp.fujitsu.com: fix migration_entry_wait()]
    [hugh@veritas.com: fix add_to_page_cache]
    [akpm@linux-foundation.org: repair a comment]
    Signed-off-by: Nick Piggin
    Cc: Jeff Garzik
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Hugh Dickins
    Cc: "Paul E. McKenney"
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • radix_tree_next_hole() is implemented as a series of radix_tree_lookup()s.
    So it can be called locklessly, under rcu_read_lock().

    Signed-off-by: Nick Piggin
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Hugh Dickins
    Cc: "Paul E. McKenney"
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Implement get_user_pages_fast without locking in the fastpath on x86.

    Do an optimistic lockless pagetable walk, without taking mmap_sem or any
    page table locks or even mmap_sem. Page table existence is guaranteed by
    turning interrupts off (combined with the fact that we're always looking
    up the current mm, means we can do the lockless page table walk within the
    constraints of the TLB shootdown design). Basically we can do this
    lockless pagetable walk in a similar manner to the way the CPU's pagetable
    walker does not have to take any locks to find present ptes.

    This patch (combined with the subsequent ones to convert direct IO to use
    it) was found to give about 10% performance improvement on a 2 socket 8
    core Intel Xeon system running an OLTP workload on DB2 v9.5

    "To test the effects of the patch, an OLTP workload was run on an IBM
    x3850 M2 server with 2 processors (quad-core Intel Xeon processors at
    2.93 GHz) using IBM DB2 v9.5 running Linux 2.6.24rc7 kernel. Comparing
    runs with and without the patch resulted in an overall performance
    benefit of ~9.8%. Correspondingly, oprofiles showed that samples from
    __up_read and __down_read routines that is seen during thread contention
    for system resources was reduced from 2.8% down to .05%. Monitoring the
    /proc/vmstat output from the patched run showed that the counter for
    fast_gup contained a very high number while the fast_gup_slow value was
    zero."

    (fast_gup is the old name for get_user_pages_fast, fast_gup_slow is a
    counter we had for the number of times the slowpath was invoked).

    The main reason for the improvement is that DB2 has multiple threads each
    issuing direct-IO. Direct-IO uses get_user_pages, and thus the threads
    contend the mmap_sem cacheline, and can also contend on page table locks.

    I would anticipate larger performance gains on larger systems, however I
    think DB2 uses an adaptive mix of threads and processes, so it could be
    that thread contention remains pretty constant as machine size increases.
    In which case, we stuck with "only" a 10% gain.

    The downside of using get_user_pages_fast is that if there is not a pte
    with the correct permissions for the access, we end up falling back to
    get_user_pages and so the get_user_pages_fast is a bit of extra work.
    However this should not be the common case in most performance critical
    code.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: Kconfig fix]
    [akpm@linux-foundation.org: Makefile fix/cleanup]
    [akpm@linux-foundation.org: warning fix]
    Signed-off-by: Nick Piggin
    Cc: Dave Kleikamp
    Cc: Andy Whitcroft
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Andi Kleen
    Cc: Dave Kleikamp
    Cc: Badari Pulavarty
    Cc: Zach Brown
    Cc: Jens Axboe
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Fixes a build failure reported by Alan Cox:

    mm/hugetlb.c: In function `hugetlb_acct_memory': mm/hugetlb.c:1507:
    error: implicit declaration of function `cpuset_mems_nr'

    Also reverts Ingo's

    commit e44d1b2998d62a1f2f4d7eb17b56ba396535509f
    Author: Ingo Molnar
    Date: Fri Jul 25 12:57:41 2008 +0200

    mm/hugetlb.c: fix build failure with !CONFIG_SYSCTL

    which fixed the build error but added some unused-static-function warnings.

    Signed-off-by: Nishanth Aravamudan
    Cc: Alan Cox
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     
  • Fix this, on avr32:

    include/linux/utsname.h:35,
    from init/main.c:20:
    include/linux/sched.h: In function 'arch_pick_mmap_layout':
    include/linux/sched.h:2149: error: implicit declaration of function 'PAGE_ALIGN'

    Reported-by: Adrian Bunk
    Cc: Haavard Skinnemoen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

26 Jul, 2008

1 commit

  • on !CONFIG_SYSCTL on x86 with latest -git i get:

    mm/hugetlb.c: In function 'decrement_hugepage_resv_vma':
    mm/hugetlb.c:83: error: 'reserve' undeclared (first use in this function)
    mm/hugetlb.c:83: error: (Each undeclared identifier is reported only once
    mm/hugetlb.c:83: error: for each function it appears in.)

    Signed-off-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Ingo Molnar