01 Nov, 2011

1 commit

  • The basic idea behind cross memory attach is to allow MPI programs doing
    intra-node communication to do a single copy of the message rather than a
    double copy of the message via shared memory.

    The following patch attempts to achieve this by allowing a destination
    process, given an address and size from a source process, to copy memory
    directly from the source process into its own address space via a system
    call. There is also a symmetrical ability to copy from the current
    process's address space into a destination process's address space.

    - Use of /proc/pid/mem has been considered, but there are issues with
    using it:
    - Does not allow for specifying iovecs for both src and dest, assuming
    preadv or pwritev was implemented either the area read from or
    written to would need to be contiguous.
    - Currently mem_read allows only processes who are currently
    ptrace'ing the target and are still able to ptrace the target to read
    from the target. This check could possibly be moved to the open call,
    but its not clear exactly what race this restriction is stopping
    (reason appears to have been lost)
    - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
    domain socket is a bit ugly from a userspace point of view,
    especially when you may have hundreds if not (eventually) thousands
    of processes that all need to do this with each other
    - Doesn't allow for some future use of the interface we would like to
    consider adding in the future (see below)
    - Interestingly reading from /proc/pid/mem currently actually
    involves two copies! (But this could be fixed pretty easily)

    As mentioned previously use of vmsplice instead was considered, but has
    problems. Since you need the reader and writer working co-operatively if
    the pipe is not drained then you block. Which requires some wrapping to
    do non blocking on the send side or polling on the receive. In all to all
    communication it requires ordering otherwise you can deadlock. And in the
    example of many MPI tasks writing to one MPI task vmsplice serialises the
    copying.

    There are some cases of MPI collectives where even a single copy interface
    does not get us the performance gain we could. For example in an
    MPI_Reduce rather than copy the data from the source we would like to
    instead use it directly in a mathops (say the reduce is doing a sum) as
    this would save us doing a copy. We don't need to keep a copy of the data
    from the source. I haven't implemented this, but I think this interface
    could in the future do all this through the use of the flags - eg could
    specify the math operation and type and the kernel rather than just
    copying the data would apply the specified operation between the source
    and destination and store it in the destination.

    Although we don't have a "second user" of the interface (though I've had
    some nibbles from people who may be interested in using it for intra
    process messaging which is not MPI). This interface is something which
    hardware vendors are already doing for their custom drivers to implement
    fast local communication. And so in addition to this being useful for
    OpenMPI it would mean the driver maintainers don't have to fix things up
    when the mm changes.

    There was some discussion about how much faster a true zero copy would
    go. Here's a link back to the email with some testing I did on that:

    http://marc.info/?l=linux-mm&m=130105930902915&w=2

    There is a basic man page for the proposed interface here:

    http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt

    This has been implemented for x86 and powerpc, other architecture should
    mainly (I think) just need to add syscall numbers for the process_vm_readv
    and process_vm_writev. There are 32 bit compatibility versions for
    64-bit kernels.

    For arch maintainers there are some simple tests to be able to quickly
    verify that the syscalls are working correctly here:

    http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz

    Signed-off-by: Chris Yeoh
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Arnd Bergmann
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: James Morris
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christopher Yeoh
     

27 May, 2011

1 commit

  • This third patch of eight in this cleancache series provides
    the core code for cleancache that interfaces between the hooks in
    VFS and individual filesystems and a cleancache backend. It also
    includes build and config patches.

    Two new files are added: mm/cleancache.c and include/linux/cleancache.h.

    Note that CONFIG_CLEANCACHE can default to on; in systems that do
    not provide a cleancache backend, all hooks devolve to a simple
    check of a global enable flag, so performance impact should
    be negligible but can be reduced to zero impact if config'ed off.
    However for this first commit, it defaults to off.

    Details and a FAQ can be found in Documentation/vm/cleancache.txt

    Credits: Cleancache_ops design derived from Jeremy Fitzhardinge
    design for tmem

    [v8: dan.magenheimer@oracle.com: fix exportfs call affecting btrfs]
    [v8: akpm@linux-foundation.org: use static inline function, not macro]
    [v7: dan.magenheimer@oracle.com: cleanup sysfs and remove cleancache prefix]
    [v6: JBeulich@novell.com: robustly handle buggy fs encode_fh actor definition]
    [v5: jeremy@goop.org: clean up global usage and static var names]
    [v5: jeremy@goop.org: simplify init hook and any future fs init changes]
    [v5: hch@infradead.org: cleaner non-global interface for ops registration]
    [v4: adilger@sun.com: interface must support exportfs FS's]
    [v4: hch@infradead.org: interface must support 64-bit FS on 32-bit kernel]
    [v3: akpm@linux-foundation.org: use one ops struct to avoid pointer hops]
    [v3: akpm@linux-foundation.org: document and ensure PageLocked reqts are met]
    [v3: ngupta@vflare.org: fix success/fail codes, change funcs to void]
    [v2: viro@ZenIV.linux.org.uk: use sane types]
    Signed-off-by: Dan Magenheimer
    Reviewed-by: Jeremy Fitzhardinge
    Reviewed-by: Konrad Rzeszutek Wilk
    Acked-by: Al Viro
    Acked-by: Andrew Morton
    Acked-by: Nitin Gupta
    Acked-by: Minchan Kim
    Acked-by: Andreas Dilger
    Acked-by: Jan Beulich
    Cc: Matthew Wilcox
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Rik Van Riel
    Cc: Chris Mason
    Cc: Ted Ts'o
    Cc: Mark Fasheh
    Cc: Joel Becker

    Dan Magenheimer
     

24 Feb, 2011

1 commit

  • mm/bootmem.c contained code paths for both bootmem and no bootmem
    configurations. They implement about the same set of APIs in
    different ways and as a result bootmem.c contains massive amount of
    #ifdef CONFIG_NO_BOOTMEM.

    Separate out CONFIG_NO_BOOTMEM code into mm/nobootmem.c. As the
    common part is relatively small, duplicate them in nobootmem.c instead
    of creating a common file or ifdef'ing in bootmem.c.

    The followings are duplicated.

    * {min|max}_low_pfn, max_pfn, saved_max_pfn
    * free_bootmem_late()
    * ___alloc_bootmem()
    * __alloc_bootmem_low()

    The followings are applicable only to nobootmem and moved verbatim.

    * __free_pages_memory()
    * free_all_memory_core_early()

    The followings are not applicable to nobootmem and omitted in
    nobootmem.c.

    * reserve_bootmem_node()
    * reserve_bootmem()

    The rest split function bodies according to CONFIG_NO_BOOTMEM.

    Makefile is updated so that only either bootmem.c or nobootmem.c is
    built according to CONFIG_NO_BOOTMEM.

    This patch doesn't introduce any behavior change.

    -tj: Rewrote commit description.

    Suggested-by: Ingo Molnar
    Signed-off-by: Yinghai Lu
    Acked-by: Andrew Morton
    Signed-off-by: Tejun Heo

    Yinghai Lu
     

14 Jan, 2011

2 commits

  • Lately I've been working to make KVM use hugepages transparently without
    the usual restrictions of hugetlbfs. Some of the restrictions I'd like to
    see removed:

    1) hugepages have to be swappable or the guest physical memory remains
    locked in RAM and can't be paged out to swap

    2) if a hugepage allocation fails, regular pages should be allocated
    instead and mixed in the same vma without any failure and without
    userland noticing

    3) if some task quits and more hugepages become available in the
    buddy, guest physical memory backed by regular pages should be
    relocated on hugepages automatically in regions under
    madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
    kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
    not null)

    4) avoidance of reservation and maximization of use of hugepages whenever
    possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
    1 machine with 1 database with 1 database cache with 1 database cache size
    known at boot time. It's definitely not feasible with a virtualization
    hypervisor usage like RHEV-H that runs an unknown number of virtual machines
    with an unknown size of each virtual machine with an unknown amount of
    pagecache that could be potentially useful in the host for guest not using
    O_DIRECT (aka cache=off).

    hugepages in the virtualization hypervisor (and also in the guest!) are
    much more important than in a regular host not using virtualization,
    becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24
    to 19 in case only the hypervisor uses transparent hugepages, and they
    decrease the tlb-miss cacheline accesses from 19 to 15 in case both the
    linux hypervisor and the linux guest both uses this patch (though the
    guest will limit the addition speedup to anonymous regions only for
    now...). Even more important is that the tlb miss handler is much slower
    on a NPT/EPT guest than for a regular shadow paging or no-virtualization
    scenario. So maximizing the amount of virtual memory cached by the TLB
    pays off significantly more with NPT/EPT than without (even if there would
    be no significant speedup in the tlb-miss runtime).

    The first (and more tedious) part of this work requires allowing the VM to
    handle anonymous hugepages mixed with regular pages transparently on
    regular anonymous vmas. This is what this patch tries to achieve in the
    least intrusive possible way. We want hugepages and hugetlb to be used in
    a way so that all applications can benefit without changes (as usual we
    leverage the KVM virtualization design: by improving the Linux VM at
    large, KVM gets the performance boost too).

    The most important design choice is: always fallback to 4k allocation if
    the hugepage allocation fails! This is the _very_ opposite of some large
    pagecache patches that failed with -EIO back then if a 64k (or similar)
    allocation failed...

    Second important decision (to reduce the impact of the feature on the
    existing pagetable handling code) is that at any time we can split an
    hugepage into 512 regular pages and it has to be done with an operation
    that can't fail. This way the reliability of the swapping isn't decreased
    (no need to allocate memory when we are short on memory to swap) and it's
    trivial to plug a split_huge_page* one-liner where needed without
    polluting the VM. Over time we can teach mprotect, mremap and friends to
    handle pmd_trans_huge natively without calling split_huge_page*. The fact
    it can't fail isn't just for swap: if split_huge_page would return -ENOMEM
    (instead of the current void) we'd need to rollback the mprotect from the
    middle of it (ideally including undoing the split_vma) which would be a
    big change and in the very wrong direction (it'd likely be simpler not to
    call split_huge_page at all and to teach mprotect and friends to handle
    hugepages instead of rolling them back from the middle). In short the
    very value of split_huge_page is that it can't fail.

    The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and
    incremental and it'll just be an "harmless" addition later if this initial
    part is agreed upon. It also should be noted that locking-wise replacing
    regular pages with hugepages is going to be very easy if compared to what
    I'm doing below in split_huge_page, as it will only happen when
    page_count(page) matches page_mapcount(page) if we can take the PG_lock
    and mmap_sem in write mode. collapse_huge_page will be a "best effort"
    that (unlike split_huge_page) can fail at the minimal sign of trouble and
    we can try again later. collapse_huge_page will be similar to how KSM
    works and the madvise(MADV_HUGEPAGE) will work similar to
    madvise(MADV_MERGEABLE).

    The default I like is that transparent hugepages are used at page fault
    time. This can be changed with
    /sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set
    to three values "always", "madvise", "never" which mean respectively that
    hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions,
    or never used. /sys/kernel/mm/transparent_hugepage/defrag instead
    controls if the hugepage allocation should defrag memory aggressively
    "always", only inside "madvise" regions, or "never".

    The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
    put_page (from get_user_page users that can't use mmu notifier like
    O_DIRECT) that runs against a __split_huge_page_refcount instead was a
    pain to serialize in a way that would result always in a coherent page
    count for both tail and head. I think my locking solution with a
    compound_lock taken only after the page_first is valid and is still a
    PageHead should be safe but it surely needs review from SMP race point of
    view. In short there is no current existing way to serialize the O_DIRECT
    final put_page against split_huge_page_refcount so I had to invent a new
    one (O_DIRECT loses knowledge on the mapping status by the time gup_fast
    returns so...). And I didn't want to impact all gup/gup_fast users for
    now, maybe if we change the gup interface substantially we can avoid this
    locking, I admit I didn't think too much about it because changing the gup
    unpinning interface would be invasive.

    If we ignored O_DIRECT we could stick to the existing compound refcounting
    code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM
    (and any other mmu notifier user) would call it without FOLL_GET (and if
    FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the
    current task mmu notifier list yet). But O_DIRECT is fundamental for
    decent performance of virtualized I/O on fast storage so we can't avoid it
    to solve the race of put_page against split_huge_page_refcount to achieve
    a complete hugepage feature for KVM.

    Swap and oom works fine (well just like with regular pages ;). MMU
    notifier is handled transparently too, with the exception of the young bit
    on the pmd, that didn't have a range check but I think KVM will be fine
    because the whole point of hugepages is that EPT/NPT will also use a huge
    pmd when they notice gup returns pages with PageCompound set, so they
    won't care of a range and there's just the pmd young bit to check in that
    case.

    NOTE: in some cases if the L2 cache is small, this may slowdown and waste
    memory during COWs because 4M of memory are accessed in a single fault
    instead of 8k (the payoff is that after COW the program can run faster).
    So we might want to switch the copy_huge_page (and clear_huge_page too) to
    not temporal stores. I also extensively researched ways to avoid this
    cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k
    up to 1M (I can send those patches that fully implemented prefault) but I
    concluded they're not worth it and they add an huge additional complexity
    and they remove all tlb benefits until the full hugepage has been faulted
    in, to save a little bit of memory and some cache during app startup, but
    they still don't improve substantially the cache-trashing during startup
    if the prefault happens in >4k chunks. One reason is that those 4k pte
    entries copied are still mapped on a perfectly cache-colored hugepage, so
    the trashing is the worst one can generate in those copies (cow of 4k page
    copies aren't so well colored so they trashes less, but again this results
    in software running faster after the page fault). Those prefault patches
    allowed things like a pte where post-cow pages were local 4k regular anon
    pages and the not-yet-cowed pte entries were pointing in the middle of
    some hugepage mapped read-only. If it doesn't payoff substantially with
    todays hardware it will payoff even less in the future with larger l2
    caches, and the prefault logic would blot the VM a lot. If one is
    emebdded transparent_hugepage can be disabled during boot with sysfs or
    with the boot commandline parameter transparent_hugepage=0 (or
    transparent_hugepage=2 to restrict hugepages inside madvise regions) that
    will ensure not a single hugepage is allocated at boot time. It is simple
    enough to just disable transparent hugepage globally and let transparent
    hugepages be allocated selectively by applications in the MADV_HUGEPAGE
    region (both at page fault time, and if enabled with the
    collapse_huge_page too through the kernel daemon).

    This patch supports only hugepages mapped in the pmd, archs that have
    smaller hugepages will not fit in this patch alone. Also some archs like
    power have certain tlb limits that prevents mixing different page size in
    the same regions so they will not fit in this framework that requires
    "graceful fallback" to basic PAGE_SIZE in case of physical memory
    fragmentation. hugetlbfs remains a perfect fit for those because its
    software limits happen to match the hardware limits. hugetlbfs also
    remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped
    to be found not fragmented after a certain system uptime and that would be
    very expensive to defragment with relocation, so requiring reservation.
    hugetlbfs is the "reservation way", the point of transparent hugepages is
    not to have any reservation at all and maximizing the use of cache and
    hugepages at all times automatically.

    Some performance result:

    vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
    ages3
    memset page fault 1566023
    memset tlb miss 453854
    memset second tlb miss 453321
    random access tlb miss 41635
    random access second tlb miss 41658
    vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
    memset page fault 1566471
    memset tlb miss 453375
    memset second tlb miss 453320
    random access tlb miss 41636
    random access second tlb miss 41637
    vmx andrea # ./largepages3
    memset page fault 1566642
    memset tlb miss 453417
    memset second tlb miss 453313
    random access tlb miss 41630
    random access second tlb miss 41647
    vmx andrea # ./largepages3
    memset page fault 1566872
    memset tlb miss 453418
    memset second tlb miss 453315
    random access tlb miss 41618
    random access second tlb miss 41659
    vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
    vmx andrea # ./largepages3
    memset page fault 2182476
    memset tlb miss 460305
    memset second tlb miss 460179
    random access tlb miss 44483
    random access second tlb miss 44186
    vmx andrea # ./largepages3
    memset page fault 2182791
    memset tlb miss 460742
    memset second tlb miss 459962
    random access tlb miss 43981
    random access second tlb miss 43988

    ============
    #include
    #include
    #include
    #include

    #define SIZE (3UL*1024*1024*1024)

    int main()
    {
    char *p = malloc(SIZE), *p2;
    struct timeval before, after;

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset page fault %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset second tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    for (p2 = p; p2 < p+SIZE; p2 += 4096)
    *p2 = 0;
    gettimeofday(&after, NULL);
    printf("random access tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    for (p2 = p; p2 < p+SIZE; p2 += 4096)
    *p2 = 0;
    gettimeofday(&after, NULL);
    printf("random access second tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    return 0;
    }
    ============

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Some are needed to build but not actually used on archs not supporting
    transparent hugepages. Others like pmdp_clear_flush are used by x86 too.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

08 Sep, 2010

1 commit

  • On UP, percpu allocations were redirected to kmalloc. This has the
    following problems.

    * For certain amount of allocations (determined by
    PERCPU_DYNAMIC_EARLY_SLOTS and PERCPU_DYNAMIC_EARLY_SIZE), percpu
    allocator can be used before the usual kernel memory allocator is
    brought online. On SMP, this is used to initialize the kernel
    memory allocator.

    * percpu allocator honors alignment upto PAGE_SIZE but kmalloc()
    doesn't. For example, workqueue makes use of larger alignments for
    cpu_workqueues.

    Currently, users of percpu allocators need to handle UP differently,
    which is somewhat fragile and ugly. Other than small amount of
    memory, there isn't much to lose by enabling percpu allocator on UP.
    It can simply use kernel memory based chunk allocation which was added
    for SMP archs w/o MMUs.

    This patch removes mm/percpu_up.c, builds mm/percpu.c on UP too and
    makes UP build use percpu-km. As percpu addresses and kernel
    addresses are always identity mapped and static percpu variables don't
    need any special treatment, nothing is arch dependent and mm/percpu.c
    implements generic setup_per_cpu_areas() for UP.

    Signed-off-by: Tejun Heo
    Reviewed-by: Christoph Lameter
    Acked-by: Pekka Enberg

    Tejun Heo
     

14 Jul, 2010

1 commit

  • via following scripts

    FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

    sed -i \
    -e 's/lmb/memblock/g' \
    -e 's/LMB/MEMBLOCK/g' \
    $FILES

    for N in $(find . -name lmb.[ch]); do
    M=$(echo $N | sed 's/lmb/memblock/g')
    mv $N $M
    done

    and remove some wrong change like lmbench and dlmb etc.

    also move memblock.c from lib/ to mm/

    Suggested-by: Ingo Molnar
    Acked-by: "H. Peter Anvin"
    Acked-by: Benjamin Herrenschmidt
    Acked-by: Linus Torvalds
    Signed-off-by: Yinghai Lu
    Signed-off-by: Benjamin Herrenschmidt

    Yinghai Lu
     

25 May, 2010

1 commit

  • This patch is the core of a mechanism which compacts memory in a zone by
    relocating movable pages towards the end of the zone.

    A single compaction run involves a migration scanner and a free scanner.
    Both scanners operate on pageblock-sized areas in the zone. The migration
    scanner starts at the bottom of the zone and searches for all movable
    pages within each area, isolating them onto a private list called
    migratelist. The free scanner starts at the top of the zone and searches
    for suitable areas and consumes the free pages within making them
    available for the migration scanner. The pages isolated for migration are
    then migrated to the newly isolated free pages.

    [aarcange@redhat.com: Fix unsafe optimisation]
    [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

30 Mar, 2010

1 commit

  • percpu.h has always been including slab.h to get k[mz]alloc/free() for
    UP inline implementation. percpu.h being used by very low level
    headers including module.h and sched.h, this meant that a lot files
    unintentionally got slab.h inclusion.

    Lee Schermerhorn was trying to make topology.h use percpu.h and got
    bitten by this implicit inclusion. The right thing to do is break
    this ultimately unnecessary dependency. The previous patch added
    explicit inclusion of either gfp.h or slab.h to the source files using
    them. This patch updates percpu.h such that slab.h is no longer
    included from percpu.h.

    Signed-off-by: Tejun Heo
    Reviewed-by: Christoph Lameter
    Cc: Ingo Molnar
    Cc: Lee Schermerhorn

    Tejun Heo
     

17 Dec, 2009

1 commit

  • Now that we cache the ACL pointers in the generic inode all the generic_acl
    cruft can go away and generic_acl.c can directly implement xattr handlers
    dealing with the full Posix ACL semantics for in-memory filesystems.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

02 Oct, 2009

1 commit


25 Sep, 2009

1 commit


24 Sep, 2009

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     

23 Sep, 2009

1 commit

  • A patch to give a better overview of the userland application stack usage,
    especially for embedded linux.

    Currently you are only able to dump the main process/thread stack usage
    which is showed in /proc/pid/status by the "VmStk" Value. But you get no
    information about the consumed stack memory of the the threads.

    There is an enhancement in the /proc//{task/*,}/*maps and which marks
    the vm mapping where the thread stack pointer reside with "[thread stack
    xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value
    information, because libpthread doesn't set the start of the stack to the
    top of the mapped area, depending of the pthread usage.

    A sample output of /proc//task//maps looks like:

    08048000-08049000 r-xp 00000000 03:00 8312 /opt/z
    08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z
    0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
    a7d12000-a7d13000 ---p 00000000 00:00 0
    a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4]
    a7f13000-a7f14000 ---p 00000000 00:00 0
    a7f14000-a7f36000 rw-p 00000000 00:00 0
    a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6
    a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6
    a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6
    a806c000-a806f000 rw-p 00000000 00:00 0
    a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
    a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
    a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
    a8085000-a8088000 rw-p 00000000 00:00 0
    a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
    a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
    a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
    afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack]
    ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]

    Also there is a new entry "stack usage" in /proc//{task/*,}/status
    which will you give the current stack usage in kb.

    A sample output of /proc/self/status looks like:

    Name: cat
    State: R (running)
    Tgid: 507
    Pid: 507
    .
    .
    .
    CapBnd: fffffffffffffeff
    voluntary_ctxt_switches: 0
    nonvoluntary_ctxt_switches: 0
    Stack usage: 12 kB

    I also fixed stack base address in /proc//{task/*,}/stat to the base
    address of the associated thread stack and not the one of the main
    process. This makes more sense.

    [akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
    Signed-off-by: Stefani Seibold
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefani Seibold
     

22 Sep, 2009

2 commits

  • Anyone who wants to do copy to/from user from a kernel thread, needs
    use_mm (like what fs/aio has). Move that into mm/, to make reusing and
    exporting easier down the line, and make aio use it. Next intended user,
    besides aio, will be vhost-net.

    Acked-by: Andrea Arcangeli
    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     
  • This patch presents the mm interface to a dummy version of ksm.c, for
    better scrutiny of that interface: the real ksm.c follows later.

    When CONFIG_KSM is not set, madvise(2) reject MADV_MERGEABLE and
    MADV_UNMERGEABLE with EINVAL, since that seems more helpful than
    pretending that they can be serviced. But when CONFIG_KSM=y, accept them
    even if KSM is not currently running, and even on areas which KSM will not
    touch (e.g. hugetlb or shared file or special driver mappings).

    Like other madvices, report ENOMEM despite success if any area in the
    range is unmapped, and use EAGAIN to report out of memory.

    Define vma flag VM_MERGEABLE to identify an area on which KSM may try
    merging pages: leave it to ksm_madvise() to decide whether to set it.
    Define mm flag MMF_VM_MERGEABLE to identify an mm which might contain
    VM_MERGEABLE areas, to minimize callouts when forking or exiting.

    Based upon earlier patches by Chris Wright and Izik Eidus.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Chris Wright
    Signed-off-by: Izik Eidus
    Cc: Michael Kerrisk
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Wu Fengguang
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Avi Kivity
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

16 Sep, 2009

3 commits

  • Useful for some testing scenarios, although specific testing is often
    done better through MADV_POISON

    This can be done with the x86 level MCE injector too, but this interface
    allows it to do independently from low level x86 changes.

    v2: Add module license (Haicheng Li)

    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • Add the high level memory handler that poisons pages
    that got corrupted by hardware (typically by a two bit flip in a DIMM
    or a cache) on the Linux level. The goal is to prevent everyone
    from accessing these pages in the future.

    This done at the VM level by marking a page hwpoisoned
    and doing the appropriate action based on the type of page
    it is.

    The code that does this is portable and lives in mm/memory-failure.c

    To quote the overview comment:

    High level machine check handler. Handles pages reported by the
    hardware as being corrupted usually due to a 2bit ECC memory or cache
    failure.

    This focuses on pages detected as corrupted in the background.
    When the current CPU tries to consume corruption the currently
    running process can just be killed directly instead. This implies
    that if the error cannot be handled for some reason it's safe to
    just ignore it because no corruption has been consumed yet. Instead
    when that happens another machine check will happen.

    Handles page cache pages in various states. The tricky part
    here is that we can access any page asynchronous to other VM
    users, because memory failures could happen anytime and anywhere,
    possibly violating some of their assumptions. This is why this code
    has to be extremely careful. Generally it tries to use normal locking
    rules, as in get the standard locks, even if that means the
    error handling takes potentially a long time.

    Some of the operations here are somewhat inefficient and have non
    linear algorithmic complexity, because the data structures have not
    been optimized for this case. This is in particular the case
    for the mapping from a vma to a process. Since this case is expected
    to be rare we hope we can get away with this.

    There are in principle two strategies to kill processes on poison:
    - just unmap the data and wait for an actual reference before
    killing
    - kill as soon as corruption is detected.
    Both have advantages and disadvantages and should be used
    in different situations. Right now both are implemented and can
    be switched with a new sysctl vm.memory_failure_early_kill
    The default is early kill.

    The patch does some rmap data structure walking on its own to collect
    processes to kill. This is unusual because normally all rmap data structure
    knowledge is in rmap.c only. I put it here for now to keep
    everything together and rmap knowledge has been seeping out anyways

    Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu,
    Nick Piggin (who did a lot of great work) and others.

    Cc: npiggin@suse.de
    Cc: riel@redhat.com
    Signed-off-by: Andi Kleen
    Acked-by: Rik van Riel
    Reviewed-by: Hidehiro Kawai

    Andi Kleen
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 commits)
    powerpc64: convert to dynamic percpu allocator
    sparc64: use embedding percpu first chunk allocator
    percpu: kill lpage first chunk allocator
    x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA
    percpu: update embedding first chunk allocator to handle sparse units
    percpu: use group information to allocate vmap areas sparsely
    vmalloc: implement pcpu_get_vm_areas()
    vmalloc: separate out insert_vmalloc_vm()
    percpu: add chunk->base_addr
    percpu: add pcpu_unit_offsets[]
    percpu: introduce pcpu_alloc_info and pcpu_group_info
    percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward
    percpu: add @align to pcpu_fc_alloc_fn_t
    percpu: make @dyn_size mandatory for pcpu_setup_first_chunk()
    percpu: drop @static_size from first chunk allocators
    percpu: generalize first chunk allocator selection
    percpu: build first chunk allocators selectively
    percpu: rename 4k first chunk allocator to page
    percpu: improve boot messages
    percpu: fix pcpu_reclaim() locking
    ...

    Fix trivial conflict as by Tejun Heo in kernel/sched.c

    Linus Torvalds
     

11 Sep, 2009

1 commit


24 Jun, 2009

1 commit

  • This patch makes most !CONFIG_HAVE_SETUP_PER_CPU_AREA archs use
    dynamic percpu allocator. The first chunk is allocated using
    embedding helper and 8k is reserved for modules. This ensures that
    the new allocator behaves almost identically to the original allocator
    as long as static percpu variables are concerned, so it shouldn't
    introduce much breakage.

    s390 and alpha use custom SHIFT_PERCPU_PTR() to work around addressing
    range limit the addressing model imposes. Unfortunately, this breaks
    if the address is specified using a variable, so for now, the two
    archs aren't converted.

    The following architectures are affected by this change.

    * sh
    * arm
    * cris
    * mips
    * sparc(32)
    * blackfin
    * avr32
    * parisc (broken, under investigation)
    * m32r
    * powerpc(32)

    As this change makes the dynamic allocator the default one,
    CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is replaced with its invert -
    CONFIG_HAVE_LEGACY_PER_CPU_AREA, which is added to yet-to-be converted
    archs. These archs implement their own setup_per_cpu_areas() and the
    conversion is not trivial.

    * powerpc(64)
    * sparc(64)
    * ia64
    * alpha
    * s390

    Boot and batch alloc/free tests on x86_32 with debug code (x86_32
    doesn't use default first chunk initialization). Compile tested on
    sparc(32), powerpc(32), arm and alpha.

    Kyle McMartin reported that this change breaks parisc. The problem is
    still under investigation and he is okay with pushing this patch
    forward and fixing parisc later.

    [ Impact: use dynamic allocator for most archs w/o custom percpu setup ]

    Signed-off-by: Tejun Heo
    Acked-by: Rusty Russell
    Acked-by: David S. Miller
    Acked-by: Benjamin Herrenschmidt
    Acked-by: Martin Schwidefsky
    Reviewed-by: Christoph Lameter
    Cc: Paul Mundt
    Cc: Russell King
    Cc: Mikael Starvik
    Cc: Ralf Baechle
    Cc: Bryan Wu
    Cc: Kyle McMartin
    Cc: Matthew Wilcox
    Cc: Grant Grundler
    Cc: Hirokazu Takata
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Heiko Carstens
    Cc: Ingo Molnar

    Tejun Heo
     

17 Jun, 2009

2 commits

  • * akpm: (182 commits)
    fbdev: bf54x-lq043fb: use kzalloc over kmalloc/memset
    fbdev: *bfin*: fix __dev{init,exit} markings
    fbdev: *bfin*: drop unnecessary calls to memset
    fbdev: bfin-t350mcqb-fb: drop unused local variables
    fbdev: blackfin has __raw I/O accessors, so use them in fb.h
    fbdev: s1d13xxxfb: add accelerated bitblt functions
    tcx: use standard fields for framebuffer physical address and length
    fbdev: add support for handoff from firmware to hw framebuffers
    intelfb: fix a bug when changing video timing
    fbdev: use framebuffer_release() for freeing fb_info structures
    radeon: P2G2CLK_ALWAYS_ONb tested twice, should 2nd be P2G2CLK_DAC_ALWAYS_ONb?
    s3c-fb: CPUFREQ frequency scaling support
    s3c-fb: fix resource releasing on error during probing
    carminefb: fix possible access beyond end of carmine_modedb[]
    acornfb: remove fb_mmap function
    mb862xxfb: use CONFIG_OF instead of CONFIG_PPC_OF
    mb862xxfb: restrict compliation of platform driver to PPC
    Samsung SoC Framebuffer driver: add Alpha Channel support
    atmel-lcdc: fix pixclock upper bound detection
    offb: use framebuffer_alloc() to allocate fb_info struct
    ...

    Manually fix up conflicts due to kmemcheck in mm/slab.c

    Linus Torvalds
     
  • * create mm/init-mm.c, move init_mm there
    * remove INIT_MM, initialize init_mm with C99 initializer
    * unexport init_mm on all arches:

    init_mm is already unexported on x86.

    One strange place is some OMAP driver (drivers/video/omap/) which
    won't build modular, but it's already wants get_vm_area() export.
    Somebody should look there.

    [akpm@linux-foundation.org: add missing #includes]
    Signed-off-by: Alexey Dobriyan
    Cc: Mike Frysinger
    Cc: Americo Wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

15 Jun, 2009

1 commit

  • With kmemcheck enabled, the slab allocator needs to do this:

    1. Tell kmemcheck to allocate the shadow memory which stores the status of
    each byte in the allocation proper, e.g. whether it is initialized or
    uninitialized.
    2. Tell kmemcheck which parts of memory that should be marked uninitialized.
    There are actually a few more states, such as "not yet allocated" and
    "recently freed".

    If a slab cache is set up using the SLAB_NOTRACK flag, it will never return
    memory that can take page faults because of kmemcheck.

    If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still
    request memory with the __GFP_NOTRACK flag. This does not prevent the page
    faults from occuring, however, but marks the object in question as being
    initialized so that no warnings will ever be produced for this object.

    In addition to (and in contrast to) __GFP_NOTRACK, the
    __GFP_NOTRACK_FALSE_POSITIVE flag indicates that the allocation should
    not be tracked _because_ it would produce a false positive. Their values
    are identical, but need not be so in the future (for example, we could now
    enable/disable false positives with a config option).

    Parts of this patch were contributed by Pekka Enberg but merged for
    atomicity.

    Signed-off-by: Vegard Nossum
    Signed-off-by: Pekka Enberg
    Signed-off-by: Ingo Molnar

    [rebased for mainline inclusion]
    Signed-off-by: Vegard Nossum

    Vegard Nossum
     

12 Jun, 2009

2 commits


01 Apr, 2009

1 commit

  • CONFIG_DEBUG_PAGEALLOC is now supported by x86, powerpc, sparc64, and
    s390. This patch implements it for the rest of the architectures by
    filling the pages with poison byte patterns after free_pages() and
    verifying the poison patterns before alloc_pages().

    This generic one cannot detect invalid page accesses immediately but
    invalid read access may cause invalid dereference by poisoned memory and
    invalid write access can be detected after a long delay.

    Signed-off-by: Akinobu Mita
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

20 Feb, 2009

1 commit

  • Impact: new scalable dynamic percpu allocator which allows dynamic
    percpu areas to be accessed the same way as static ones

    Implement scalable dynamic percpu allocator which can be used for both
    static and dynamic percpu areas. This will allow static and dynamic
    areas to share faster direct access methods. This feature is optional
    and enabled only when CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is defined by
    arch. Please read comment on top of mm/percpu.c for details.

    Signed-off-by: Tejun Heo
    Cc: Andrew Morton

    Tejun Heo
     

07 Jan, 2009

1 commit

  • tiny-shmem shares most of its 130 lines of code with shmem and tends to
    break when particular bits of shmem get modified. Unifying saves code and
    makes keeping these two in sync much easier.

    before:
    14367 392 24 14783 39bf mm/shmem.o
    396 72 8 476 1dc mm/tiny-shmem.o

    after:
    14367 392 24 14783 39bf mm/shmem.o
    412 72 8 492 1ec mm/shmem.o tiny

    Signed-off-by: Matt Mackall
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     

29 Dec, 2008

1 commit

  • Currently fault-injection capability for SLAB allocator is only
    available to SLAB. This patch makes it available to SLUB, too.

    [penberg@cs.helsinki.fi: unify slab and slub implementations]
    Cc: Christoph Lameter
    Cc: Matt Mackall
    Signed-off-by: Akinobu Mita
    Signed-off-by: Pekka Enberg

    Akinobu Mita
     

20 Oct, 2008

1 commit

  • Allocate all page_cgroup at boot and remove page_cgroup poitner from
    struct page. This patch adds an interface as

    struct page_cgroup *lookup_page_cgroup(struct page*)

    All FLATMEM/DISCONTIGMEM/SPARSEMEM and MEMORY_HOTPLUG is supported.

    Remove page_cgroup pointer reduces the amount of memory by
    - 4 bytes per PAGE_SIZE.
    - 8 bytes per PAGE_SIZE
    if memory controller is disabled. (even if configured.)

    On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
    On my x86-64 server with 48GB of memory, this saves 96MB of memory.
    I think this reduction makes sense.

    By pre-allocation, kmalloc/kfree in charge/uncharge are removed.
    This means
    - we're not necessary to be afraid of kmalloc faiulre.
    (this can happen because of gfp_mask type.)
    - we can avoid calling kmalloc/kfree.
    - we can avoid allocating tons of small objects which can be fragmented.
    - we can know what amount of memory will be used for this extra-lru handling.

    I added printk message as

    "allocated %ld bytes of page_cgroup"
    "please try cgroup_disable=memory option if you don't want"

    maybe enough informative for users.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

29 Jul, 2008

1 commit

  • With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
    There are secondary MMUs (with secondary sptes and secondary tlbs) too.
    sptes in the kvm case are shadow pagetables, but when I say spte in
    mmu-notifier context, I mean "secondary pte". In GRU case there's no
    actual secondary pte and there's only a secondary tlb because the GRU
    secondary MMU has no knowledge about sptes and every secondary tlb miss
    event in the MMU always generates a page fault that has to be resolved by
    the CPU (this is not the case of KVM where the a secondary tlb miss will
    walk sptes in hardware and it will refill the secondary tlb transparently
    to software if the corresponding spte is present). The same way
    zap_page_range has to invalidate the pte before freeing the page, the spte
    (and secondary tlb) must also be invalidated before any page is freed and
    reused.

    Currently we take a page_count pin on every page mapped by sptes, but that
    means the pages can't be swapped whenever they're mapped by any spte
    because they're part of the guest working set. Furthermore a spte unmap
    event can immediately lead to a page to be freed when the pin is released
    (so requiring the same complex and relatively slow tlb_gather smp safe
    logic we have in zap_page_range and that can be avoided completely if the
    spte unmap event doesn't require an unpin of the page previously mapped in
    the secondary MMU).

    The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
    when the VM is swapping or freeing or doing anything on the primary MMU so
    that the secondary MMU code can drop sptes before the pages are freed,
    avoiding all page pinning and allowing 100% reliable swapping of guest
    physical address space. Furthermore it avoids the code that teardown the
    mappings of the secondary MMU, to implement a logic like tlb_gather in
    zap_page_range that would require many IPI to flush other cpu tlbs, for
    each fixed number of spte unmapped.

    To make an example: if what happens on the primary MMU is a protection
    downgrade (from writeable to wrprotect) the secondary MMU mappings will be
    invalidated, and the next secondary-mmu-page-fault will call
    get_user_pages and trigger a do_wp_page through get_user_pages if it
    called get_user_pages with write=1, and it'll re-establishing an updated
    spte or secondary-tlb-mapping on the copied page. Or it will setup a
    readonly spte or readonly tlb mapping if it's a guest-read, if it calls
    get_user_pages with write=0. This is just an example.

    This allows to map any page pointed by any pte (and in turn visible in the
    primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
    full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
    with kvm), or a remote DMA in software like XPMEM (hence needing of
    schedule in XPMEM code to send the invalidate to the remote node, while no
    need to schedule in kvm/gru as it's an immediate event like invalidating
    primary-mmu pte).

    At least for KVM without this patch it's impossible to swap guests
    reliably. And having this feature and removing the page pin allows
    several other optimizations that simplify life considerably.

    Dependencies:

    1) mm_take_all_locks() to register the mmu notifier when the whole VM
    isn't doing anything with "mm". This allows mmu notifier users to keep
    track if the VM is in the middle of the invalidate_range_begin/end
    critical section with an atomic counter incraese in range_begin and
    decreased in range_end. No secondary MMU page fault is allowed to map
    any spte or secondary tlb reference, while the VM is in the middle of
    range_begin/end as any page returned by get_user_pages in that critical
    section could later immediately be freed without any further
    ->invalidate_page notification (invalidate_range_begin/end works on
    ranges and ->invalidate_page isn't called immediately before freeing
    the page). To stop all page freeing and pagetable overwrites the
    mmap_sem must be taken in write mode and all other anon_vma/i_mmap
    locks must be taken too.

    2) It'd be a waste to add branches in the VM if nobody could possibly
    run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
    CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
    mmu notifiers, but this already allows to compile a KVM external module
    against a kernel with mmu notifiers enabled and from the next pull from
    kvm.git we'll start using them. And GRU/XPMEM will also be able to
    continue the development by enabling KVM=m in their config, until they
    submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
    also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
    This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
    are all =n.

    The mmu_notifier_register call can fail because mm_take_all_locks may be
    interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
    is used when a driver startup, a failure can be gracefully handled. Here
    an example of the change applied to kvm to register the mmu notifiers.
    Usually when a driver startups other allocations are required anyway and
    -ENOMEM failure paths exists already.

    struct kvm *kvm_arch_create_vm(void)
    {
    struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
    + int err;

    if (!kvm)
    return ERR_PTR(-ENOMEM);

    INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

    + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
    + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
    + if (err) {
    + kfree(kvm);
    + return ERR_PTR(err);
    + }
    +
    return kvm;
    }

    mmu_notifier_unregister returns void and it's reliable.

    The patch also adds a few needed but missing includes that would prevent
    kernel to compile after these changes on non-x86 archs (x86 didn't need
    them by luck).

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix mm/filemap_xip.c build]
    [akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Kanoj Sarcar
    Cc: Roland Dreier
    Cc: Steve Wise
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Rusty Russell
    Cc: Anthony Liguori
    Cc: Chris Wright
    Cc: Marcelo Tosatti
    Cc: Eric Dumazet
    Cc: "Paul E. McKenney"
    Cc: Izik Eidus
    Cc: Anthony Liguori
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 Jul, 2008

2 commits

  • Towards the end of putting all core mm initialization in mm_init.c, I
    plan on putting the creation of a mm kobject in a function in that file.
    However, the file is currently only compiled if CONFIG_DEBUG_MEMORY_INIT
    is set. Remove this dependency, but put the code under an #ifdef on the
    same config option. This should result in no functional changes.

    Signed-off-by: Nishanth Aravamudan
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     
  • Boot initialisation is very complex, with significant numbers of
    architecture-specific routines, hooks and code ordering. While significant
    amounts of the initialisation is architecture-independent, it trusts the data
    received from the architecture layer. This is a mistake, and has resulted in
    a number of difficult-to-diagnose bugs.

    This patchset adds some validation and tracing to memory initialisation. It
    also introduces a few basic defensive measures. The validation code can be
    explicitly disabled for embedded systems.

    This patch:

    Add additional debugging and verification code for memory initialisation.

    Once enabled, the verification checks are always run and when required
    additional debugging information may be outputted via a mminit_loglevel=
    command-line parameter.

    The verification code is placed in a new file mm/mm_init.c. Ideally other mm
    initialisation code will be moved here over time.

    Signed-off-by: Mel Gorman
    Cc: Christoph Lameter
    Cc: Andy Whitcroft
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

18 Apr, 2008

1 commit


05 Mar, 2008

1 commit

  • Rename Memory Controller to Memory Resource Controller. Reflect the same
    changes in the CONFIG definition for the Memory Resource Controller. Group
    together the config options for Resource Counters and Memory Resource
    Controller.

    Signed-off-by: Balbir Singh
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

08 Feb, 2008

1 commit

  • Setup the memory cgroup and add basic hooks and controls to integrate
    and work with the cgroup.

    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

06 Feb, 2008

3 commits