24 Jan, 2011

1 commit


14 Jan, 2011

3 commits

  • It's mostly a matter of replacing alloc_pages with alloc_pages_vma after
    introducing alloc_pages_vma. khugepaged needs special handling as the
    allocation has to happen inside collapse_huge_page where the vma is known
    and an error has to be returned to the outer loop to sleep
    alloc_sleep_millisecs in case of failure. But it retains the more
    efficient logic of handling allocation failures in khugepaged in case of
    CONFIG_NUMA=n.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Lately I've been working to make KVM use hugepages transparently without
    the usual restrictions of hugetlbfs. Some of the restrictions I'd like to
    see removed:

    1) hugepages have to be swappable or the guest physical memory remains
    locked in RAM and can't be paged out to swap

    2) if a hugepage allocation fails, regular pages should be allocated
    instead and mixed in the same vma without any failure and without
    userland noticing

    3) if some task quits and more hugepages become available in the
    buddy, guest physical memory backed by regular pages should be
    relocated on hugepages automatically in regions under
    madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
    kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
    not null)

    4) avoidance of reservation and maximization of use of hugepages whenever
    possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
    1 machine with 1 database with 1 database cache with 1 database cache size
    known at boot time. It's definitely not feasible with a virtualization
    hypervisor usage like RHEV-H that runs an unknown number of virtual machines
    with an unknown size of each virtual machine with an unknown amount of
    pagecache that could be potentially useful in the host for guest not using
    O_DIRECT (aka cache=off).

    hugepages in the virtualization hypervisor (and also in the guest!) are
    much more important than in a regular host not using virtualization,
    becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24
    to 19 in case only the hypervisor uses transparent hugepages, and they
    decrease the tlb-miss cacheline accesses from 19 to 15 in case both the
    linux hypervisor and the linux guest both uses this patch (though the
    guest will limit the addition speedup to anonymous regions only for
    now...). Even more important is that the tlb miss handler is much slower
    on a NPT/EPT guest than for a regular shadow paging or no-virtualization
    scenario. So maximizing the amount of virtual memory cached by the TLB
    pays off significantly more with NPT/EPT than without (even if there would
    be no significant speedup in the tlb-miss runtime).

    The first (and more tedious) part of this work requires allowing the VM to
    handle anonymous hugepages mixed with regular pages transparently on
    regular anonymous vmas. This is what this patch tries to achieve in the
    least intrusive possible way. We want hugepages and hugetlb to be used in
    a way so that all applications can benefit without changes (as usual we
    leverage the KVM virtualization design: by improving the Linux VM at
    large, KVM gets the performance boost too).

    The most important design choice is: always fallback to 4k allocation if
    the hugepage allocation fails! This is the _very_ opposite of some large
    pagecache patches that failed with -EIO back then if a 64k (or similar)
    allocation failed...

    Second important decision (to reduce the impact of the feature on the
    existing pagetable handling code) is that at any time we can split an
    hugepage into 512 regular pages and it has to be done with an operation
    that can't fail. This way the reliability of the swapping isn't decreased
    (no need to allocate memory when we are short on memory to swap) and it's
    trivial to plug a split_huge_page* one-liner where needed without
    polluting the VM. Over time we can teach mprotect, mremap and friends to
    handle pmd_trans_huge natively without calling split_huge_page*. The fact
    it can't fail isn't just for swap: if split_huge_page would return -ENOMEM
    (instead of the current void) we'd need to rollback the mprotect from the
    middle of it (ideally including undoing the split_vma) which would be a
    big change and in the very wrong direction (it'd likely be simpler not to
    call split_huge_page at all and to teach mprotect and friends to handle
    hugepages instead of rolling them back from the middle). In short the
    very value of split_huge_page is that it can't fail.

    The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and
    incremental and it'll just be an "harmless" addition later if this initial
    part is agreed upon. It also should be noted that locking-wise replacing
    regular pages with hugepages is going to be very easy if compared to what
    I'm doing below in split_huge_page, as it will only happen when
    page_count(page) matches page_mapcount(page) if we can take the PG_lock
    and mmap_sem in write mode. collapse_huge_page will be a "best effort"
    that (unlike split_huge_page) can fail at the minimal sign of trouble and
    we can try again later. collapse_huge_page will be similar to how KSM
    works and the madvise(MADV_HUGEPAGE) will work similar to
    madvise(MADV_MERGEABLE).

    The default I like is that transparent hugepages are used at page fault
    time. This can be changed with
    /sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set
    to three values "always", "madvise", "never" which mean respectively that
    hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions,
    or never used. /sys/kernel/mm/transparent_hugepage/defrag instead
    controls if the hugepage allocation should defrag memory aggressively
    "always", only inside "madvise" regions, or "never".

    The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
    put_page (from get_user_page users that can't use mmu notifier like
    O_DIRECT) that runs against a __split_huge_page_refcount instead was a
    pain to serialize in a way that would result always in a coherent page
    count for both tail and head. I think my locking solution with a
    compound_lock taken only after the page_first is valid and is still a
    PageHead should be safe but it surely needs review from SMP race point of
    view. In short there is no current existing way to serialize the O_DIRECT
    final put_page against split_huge_page_refcount so I had to invent a new
    one (O_DIRECT loses knowledge on the mapping status by the time gup_fast
    returns so...). And I didn't want to impact all gup/gup_fast users for
    now, maybe if we change the gup interface substantially we can avoid this
    locking, I admit I didn't think too much about it because changing the gup
    unpinning interface would be invasive.

    If we ignored O_DIRECT we could stick to the existing compound refcounting
    code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM
    (and any other mmu notifier user) would call it without FOLL_GET (and if
    FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the
    current task mmu notifier list yet). But O_DIRECT is fundamental for
    decent performance of virtualized I/O on fast storage so we can't avoid it
    to solve the race of put_page against split_huge_page_refcount to achieve
    a complete hugepage feature for KVM.

    Swap and oom works fine (well just like with regular pages ;). MMU
    notifier is handled transparently too, with the exception of the young bit
    on the pmd, that didn't have a range check but I think KVM will be fine
    because the whole point of hugepages is that EPT/NPT will also use a huge
    pmd when they notice gup returns pages with PageCompound set, so they
    won't care of a range and there's just the pmd young bit to check in that
    case.

    NOTE: in some cases if the L2 cache is small, this may slowdown and waste
    memory during COWs because 4M of memory are accessed in a single fault
    instead of 8k (the payoff is that after COW the program can run faster).
    So we might want to switch the copy_huge_page (and clear_huge_page too) to
    not temporal stores. I also extensively researched ways to avoid this
    cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k
    up to 1M (I can send those patches that fully implemented prefault) but I
    concluded they're not worth it and they add an huge additional complexity
    and they remove all tlb benefits until the full hugepage has been faulted
    in, to save a little bit of memory and some cache during app startup, but
    they still don't improve substantially the cache-trashing during startup
    if the prefault happens in >4k chunks. One reason is that those 4k pte
    entries copied are still mapped on a perfectly cache-colored hugepage, so
    the trashing is the worst one can generate in those copies (cow of 4k page
    copies aren't so well colored so they trashes less, but again this results
    in software running faster after the page fault). Those prefault patches
    allowed things like a pte where post-cow pages were local 4k regular anon
    pages and the not-yet-cowed pte entries were pointing in the middle of
    some hugepage mapped read-only. If it doesn't payoff substantially with
    todays hardware it will payoff even less in the future with larger l2
    caches, and the prefault logic would blot the VM a lot. If one is
    emebdded transparent_hugepage can be disabled during boot with sysfs or
    with the boot commandline parameter transparent_hugepage=0 (or
    transparent_hugepage=2 to restrict hugepages inside madvise regions) that
    will ensure not a single hugepage is allocated at boot time. It is simple
    enough to just disable transparent hugepage globally and let transparent
    hugepages be allocated selectively by applications in the MADV_HUGEPAGE
    region (both at page fault time, and if enabled with the
    collapse_huge_page too through the kernel daemon).

    This patch supports only hugepages mapped in the pmd, archs that have
    smaller hugepages will not fit in this patch alone. Also some archs like
    power have certain tlb limits that prevents mixing different page size in
    the same regions so they will not fit in this framework that requires
    "graceful fallback" to basic PAGE_SIZE in case of physical memory
    fragmentation. hugetlbfs remains a perfect fit for those because its
    software limits happen to match the hardware limits. hugetlbfs also
    remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped
    to be found not fragmented after a certain system uptime and that would be
    very expensive to defragment with relocation, so requiring reservation.
    hugetlbfs is the "reservation way", the point of transparent hugepages is
    not to have any reservation at all and maximizing the use of cache and
    hugepages at all times automatically.

    Some performance result:

    vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
    ages3
    memset page fault 1566023
    memset tlb miss 453854
    memset second tlb miss 453321
    random access tlb miss 41635
    random access second tlb miss 41658
    vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
    memset page fault 1566471
    memset tlb miss 453375
    memset second tlb miss 453320
    random access tlb miss 41636
    random access second tlb miss 41637
    vmx andrea # ./largepages3
    memset page fault 1566642
    memset tlb miss 453417
    memset second tlb miss 453313
    random access tlb miss 41630
    random access second tlb miss 41647
    vmx andrea # ./largepages3
    memset page fault 1566872
    memset tlb miss 453418
    memset second tlb miss 453315
    random access tlb miss 41618
    random access second tlb miss 41659
    vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
    vmx andrea # ./largepages3
    memset page fault 2182476
    memset tlb miss 460305
    memset second tlb miss 460179
    random access tlb miss 44483
    random access second tlb miss 44186
    vmx andrea # ./largepages3
    memset page fault 2182791
    memset tlb miss 460742
    memset second tlb miss 459962
    random access tlb miss 43981
    random access second tlb miss 43988

    ============
    #include
    #include
    #include
    #include

    #define SIZE (3UL*1024*1024*1024)

    int main()
    {
    char *p = malloc(SIZE), *p2;
    struct timeval before, after;

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset page fault %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset second tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    for (p2 = p; p2 < p+SIZE; p2 += 4096)
    *p2 = 0;
    gettimeofday(&after, NULL);
    printf("random access tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    for (p2 = p; p2 < p+SIZE; p2 += 4096)
    *p2 = 0;
    gettimeofday(&after, NULL);
    printf("random access second tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    return 0;
    }
    ============

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Transparent hugepage allocations must be allowed not to invoke kswapd or
    any other kind of indirect reclaim (especially when the defrag sysfs is
    control disabled). It's unacceptable to swap out anonymous pages
    (potentially anonymous transparent hugepages) in order to create new
    transparent hugepages. This is true for the MADV_HUGEPAGE areas too
    (swapping out a kvm virtual machine and so having it suffer an unbearable
    slowdown, so another one with guest physical memory marked MADV_HUGEPAGE
    can run 30% faster if it is running memory intensive workloads, makes no
    sense). If a transparent hugepage allocation fails the slowdown is minor
    and there is total fallback, so kswapd should never be asked to swapout
    memory to allow the high order allocation to succeed.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

07 Dec, 2010

1 commit

  • There is a problem that swap pages allocated before the creation of
    a hibernation image can be released and used for storing the contents
    of different memory pages while the image is being saved. Since the
    kernel stored in the image doesn't know of that, it causes memory
    corruption to occur after resume from hibernation, especially on
    systems with relatively small RAM that need to swap often.

    This issue can be addressed by keeping the GFP_IOFS bits clear
    in gfp_allowed_mask during the entire hibernation, including the
    saving of the image, until the system is finally turned off or
    the hibernation is aborted. Unfortunately, for this purpose
    it's necessary to rework the way in which the hibernate and
    suspend code manipulates gfp_allowed_mask.

    This change is based on an earlier patch from Hugh Dickins.

    Signed-off-by: Rafael J. Wysocki
    Reported-by: Ondrej Zary
    Acked-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: stable@kernel.org

    Rafael J. Wysocki
     

27 Oct, 2010

1 commit

  • Introduce ___GFP_* masks in order for gfp_t to not be mixed with plain
    integers which causes a lot of warnings like the following:

    warning: restricted gfp_t degrades to integer

    Signed-off-by: Namhyung Kim
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     

25 May, 2010

2 commits


07 Mar, 2010

3 commits

  • __GFP_NOFAIL was deprecated in dab48dab, so add a comment that no new
    users should be added.

    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • There are quite a few GFP_KERNEL memory allocations made during
    suspend/hibernation and resume that may cause the system to hang, because
    the I/O operations they depend on cannot be completed due to the
    underlying devices being suspended.

    Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in
    gfp_allowed_mask before suspend/hibernation and restoring the original
    values of these bits in gfp_allowed_mask durig the subsequent resume.

    [akpm@linux-foundation.org: fix CONFIG_PM=n linkage]
    Signed-off-by: Rafael J. Wysocki
    Reported-by: Maxim Levitsky
    Cc: Sebastian Ott
    Cc: Benjamin Herrenschmidt
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • free_hot_page() is just a wrapper around free_hot_cold_page() with
    parameter 'cold = 0'. After adding a clear comment for
    free_hot_cold_page(), it is reasonable to remove a level of call.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Li Hong
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Cc: KOSAKI Motohiro
    Cc: Americo Wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Hong
     

23 Sep, 2009

1 commit

  • gcc permitting variable length arrays makes the current construct used for
    BUILD_BUG_ON() useless, as that doesn't produce any diagnostic if the
    controlling expression isn't really constant. Instead, this patch makes
    it so that a bit field gets used here. Consequently, those uses where the
    condition isn't really constant now also need fixing.

    Note that in the gfp.h, kmemcheck.h, and virtio_config.h cases
    MAYBE_BUILD_BUG_ON() really just serves documentation purposes - even if
    the expression is compile time constant (__builtin_constant_p() yields
    true), the array is still deemed of variable length by gcc, and hence the
    whole expression doesn't have the intended effect.

    [akpm@linux-foundation.org: make arch/sparc/include/asm/vio.h compile]
    [akpm@linux-foundation.org: more nonsensical assertions in tpm.c..]
    Signed-off-by: Jan Beulich
    Cc: Andi Kleen
    Cc: Rusty Russell
    Cc: Catalin Marinas
    Cc: "David S. Miller"
    Cc: Rajiv Andrade
    Cc: Mimi Zohar
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     

22 Sep, 2009

2 commits


19 Jun, 2009

1 commit


17 Jun, 2009

6 commits

  • * akpm: (182 commits)
    fbdev: bf54x-lq043fb: use kzalloc over kmalloc/memset
    fbdev: *bfin*: fix __dev{init,exit} markings
    fbdev: *bfin*: drop unnecessary calls to memset
    fbdev: bfin-t350mcqb-fb: drop unused local variables
    fbdev: blackfin has __raw I/O accessors, so use them in fb.h
    fbdev: s1d13xxxfb: add accelerated bitblt functions
    tcx: use standard fields for framebuffer physical address and length
    fbdev: add support for handoff from firmware to hw framebuffers
    intelfb: fix a bug when changing video timing
    fbdev: use framebuffer_release() for freeing fb_info structures
    radeon: P2G2CLK_ALWAYS_ONb tested twice, should 2nd be P2G2CLK_DAC_ALWAYS_ONb?
    s3c-fb: CPUFREQ frequency scaling support
    s3c-fb: fix resource releasing on error during probing
    carminefb: fix possible access beyond end of carmine_modedb[]
    acornfb: remove fb_mmap function
    mb862xxfb: use CONFIG_OF instead of CONFIG_PPC_OF
    mb862xxfb: restrict compliation of platform driver to PPC
    Samsung SoC Framebuffer driver: add Alpha Channel support
    atmel-lcdc: fix pixclock upper bound detection
    offb: use framebuffer_alloc() to allocate fb_info struct
    ...

    Manually fix up conflicts due to kmemcheck in mm/slab.c

    Linus Torvalds
     
  • …lags passed to the page allocator

    This simplifies the code in gfp_zone() and also keeps the ability of the
    compiler to use constant folding to get rid of gfp_zone processing.

    The lookup of the zone is done using a bitfield stored in an integer. So
    the code in gfp_zone is a simple extraction of bits from a constant
    bitfield. The compiler is generating a load of a constant into a register
    and then performs a shift and mask operation to get the zone from a gfp_t.
    No cachelines are touched and no branches have to be predicted by the
    compiler.

    We are doing some macro tricks here to convince the compiler to always do
    the constant folding if possible.

    Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Reviewed-by: Mel Gorman <mel@csn.ul.ie>
    Cc: Nick Piggin <nickpiggin@yahoo.com.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Christoph Lameter
     
  • Currently, the following scenario appears to be possible in theory:

    * Tasks are frozen for hibernation or suspend.
    * Free pages are almost exhausted.
    * Certain piece of code in the suspend code path attempts to allocate
    some memory using GFP_KERNEL and allocation order less than or
    equal to PAGE_ALLOC_COSTLY_ORDER.
    * __alloc_pages_internal() cannot find a free page so it invokes the
    OOM killer.
    * The OOM killer attempts to kill a task, but the task is frozen, so
    it doesn't die immediately.
    * __alloc_pages_internal() jumps to 'restart', unsuccessfully tries
    to find a free page and invokes the OOM killer.
    * No progress can be made.

    Although it is now hard to trigger during hibernation due to the memory
    shrinking carried out by the hibernation code, it is theoretically
    possible to trigger during suspend after the memory shrinking has been
    removed from that code path. Moreover, since memory allocations are
    going to be used for the hibernation memory shrinking, it will be even
    more likely to happen during hibernation.

    To prevent it from happening, introduce the oom_killer_disabled switch
    that will cause __alloc_pages_internal() to fail in the situations in
    which the OOM killer would have been called and make the freezer set
    this switch after tasks have been successfully frozen.

    [akpm@linux-foundation.org: be nicer to the namespace]
    Signed-off-by: Rafael J. Wysocki
    Cc: Fengguang Wu
    Cc: David Rientjes
    Acked-by: Pavel Machek
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Callers of alloc_pages_node() can optionally specify -1 as a node to mean
    "allocate from the current node". However, a number of the callers in
    fast paths know for a fact their node is valid. To avoid a comparison and
    branch, this patch adds alloc_pages_exact_node() that only checks the nid
    with VM_BUG_ON(). Callers that know their node is valid are then
    converted.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Pekka Enberg
    Acked-by: Paul Mundt [for the SLOB NUMA bits]
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • No user of the allocator API should be passing in an order >= MAX_ORDER
    but we check for it on each and every allocation. Delete this check and
    make it a VM_BUG_ON check further down the call path.

    [akpm@linux-foundation.org: s/VM_BUG_ON/WARN_ON_ONCE/]
    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The start of a large patch series to clean up and optimise the page
    allocator.

    The performance improvements are in a wide range depending on the exact
    machine but the results I've seen so fair are approximately;

    kernbench: 0 to 0.12% (elapsed time)
    0.49% to 3.20% (sys time)
    aim9: -4% to 30% (for page_test and brk_test)
    tbench: -1% to 4%
    hackbench: -2.5% to 3.45% (mostly within the noise though)
    netperf-udp -1.34% to 4.06% (varies between machines a bit)
    netperf-tcp -0.44% to 5.22% (varies between machines a bit)

    I haven't sysbench figures at hand, but previously they were within the
    -0.5% to 2% range.

    On netperf, the client and server were bound to opposite number CPUs to
    maximise the problems with cache line bouncing of the struct pages so I
    expect different people to report different results for netperf depending
    on their exact machine and how they ran the test (different machines, same
    cpus client/server, shared cache but two threads client/server, different
    socket client/server etc).

    I also measured the vmlinux sizes for a single x86-based config with
    CONFIG_DEBUG_INFO enabled but not CONFIG_DEBUG_VM. The core of the
    .config is based on the Debian Lenny kernel config so I expect it to be
    reasonably typical.

    This patch:

    __alloc_pages_internal is the core page allocator function but essentially
    it is an alias of __alloc_pages_nodemask. Naming a publicly available and
    exported function "internal" is also a big ugly. This patch renames
    __alloc_pages_internal() to __alloc_pages_nodemask() and deletes the old
    nodemask function.

    Warning - This patch renames an exported symbol. No kernel driver is
    affected by external drivers calling __alloc_pages_internal() should
    change the call to __alloc_pages_nodemask() without any alteration of
    parameters.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

15 Jun, 2009

3 commits

  • Conflicts:
    MAINTAINERS

    Signed-off-by: Vegard Nossum

    Vegard Nossum
     
  • This adds support for tracking the initializedness of memory that
    was allocated with the page allocator. Highmem requests are not
    tracked.

    Cc: Dave Hansen
    Acked-by: Pekka Enberg

    [build fix for !CONFIG_KMEMCHECK]
    Signed-off-by: Ingo Molnar

    [rebased for mainline inclusion]
    Signed-off-by: Vegard Nossum

    Vegard Nossum
     
  • With kmemcheck enabled, the slab allocator needs to do this:

    1. Tell kmemcheck to allocate the shadow memory which stores the status of
    each byte in the allocation proper, e.g. whether it is initialized or
    uninitialized.
    2. Tell kmemcheck which parts of memory that should be marked uninitialized.
    There are actually a few more states, such as "not yet allocated" and
    "recently freed".

    If a slab cache is set up using the SLAB_NOTRACK flag, it will never return
    memory that can take page faults because of kmemcheck.

    If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still
    request memory with the __GFP_NOTRACK flag. This does not prevent the page
    faults from occuring, however, but marks the object in question as being
    initialized so that no warnings will ever be produced for this object.

    In addition to (and in contrast to) __GFP_NOTRACK, the
    __GFP_NOTRACK_FALSE_POSITIVE flag indicates that the allocation should
    not be tracked _because_ it would produce a false positive. Their values
    are identical, but need not be so in the future (for example, we could now
    enable/disable false positives with a config option).

    Parts of this patch were contributed by Pekka Enberg but merged for
    atomicity.

    Signed-off-by: Vegard Nossum
    Signed-off-by: Pekka Enberg
    Signed-off-by: Ingo Molnar

    [rebased for mainline inclusion]
    Signed-off-by: Vegard Nossum

    Vegard Nossum
     

12 Jun, 2009

1 commit

  • As explained by Benjamin Herrenschmidt:

    Oh and btw, your patch alone doesn't fix powerpc, because it's missing
    a whole bunch of GFP_KERNEL's in the arch code... You would have to
    grep the entire kernel for things that check slab_is_available() and
    even then you'll be missing some.

    For example, slab_is_available() didn't always exist, and so in the
    early days on powerpc, we used a mem_init_done global that is set form
    mem_init() (not perfect but works in practice). And we still have code
    using that to do the test.

    Therefore, mask out __GFP_WAIT, __GFP_IO, and __GFP_FS in the slab allocators
    in early boot code to avoid enabling interrupts.

    Signed-off-by: Pekka Enberg

    Pekka Enberg
     

13 Mar, 2009

1 commit

  • Impact: cleanup, potential bugfix

    Not sure what changed to expose this, but clearly that numa_node_id()
    doesn't belong in mmzone.h (the inline in gfp.h is probably overkill, too).

    In file included from include/linux/topology.h:34,
    from arch/x86/mm/numa.c:2:
    /home/rusty/patches-cpumask/linux-2.6/arch/x86/include/asm/topology.h:64:1: warning: "numa_node_id" redefined
    In file included from include/linux/topology.h:32,
    from arch/x86/mm/numa.c:2:
    include/linux/mmzone.h:770:1: warning: this is the location of the previous definition

    Signed-off-by: Rusty Russell
    Cc: Mike Travis
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Rusty Russell
     

07 Jan, 2009

1 commit

  • GFP_HIGHUSER_PAGECACHE is just an alias for GFP_HIGHUSER_MOVABLE, making
    that harder to track down: remove it, and its out-of-work brothers
    GFP_NOFS_PAGECACHE and GFP_USER_PAGECACHE.

    Since we're making that improvement to hotremove_migrate_alloc(), I think
    we can now also remove one of the "o"s from its comment.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

25 Jul, 2008

2 commits

  • alloc_pages_exact() is similar to alloc_pages(), except that it allocates
    the minimum number of pages to fulfill the request. This is useful if you
    want to allocate a very large buffer that is slightly larger than an even
    power-of-two number of pages. In that case, alloc_pages() will waste a
    lot of memory.

    I have a video driver that wants to allocate a 5MB buffer. alloc_pages()
    wiill waste 3MB of physically-contiguous memory.

    Signed-off-by: Timur Tabi
    Cc: Andi Kleen
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Timur Tabi
     
  • Two zonelist patch series rewrote __page_alloc() largely. Now, it is just
    a wrapper function. Inlining them will save a function call.

    [akpm@linux-foundation.org: export __alloc_pages_internal]
    Cc: Lee Schermerhorn
    Cc: Mel Gorman
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

29 Apr, 2008

1 commit

  • The definition and use of __GFP_REPEAT, __GFP_NOFAIL and __GFP_NORETRY in the
    core VM have somewhat differing comments as to their actual semantics.
    Annoyingly, the flags definition has inline and header comments, which might
    be interpreted as not being equivalent. Just add references to the header
    comments in the inline ones so they don't go out of sync in the future. In
    their use in __alloc_pages() clarify that the current implementation treats
    low-order allocations and __GFP_REPEAT allocations as distinct cases.

    To clarify, the flags' semantics are:

    __GFP_NORETRY means try no harder than one run through __alloc_pages

    __GFP_REPEAT means __GFP_NOFAIL

    __GFP_NOFAIL means repeat forever

    order
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     

28 Apr, 2008

5 commits

  • This hack, "base = MAX_NR_ZONES", at __GFP_THISNODE was used for old
    zonliests.

    Now, new zonelist[] have a list for __GFP_THISNODE and this hack is incorrect.
    Should be removed.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • The MPOL_BIND policy creates a zonelist that is used for allocations
    controlled by that mempolicy. As the per-node zonelist is already being
    filtered based on a zone id, this patch adds a version of __alloc_pages() that
    takes a nodemask for further filtering. This eliminates the need for
    MPOL_BIND to create a custom zonelist.

    A positive benefit of this is that allocations using MPOL_BIND now use the
    local node's distance-ordered zonelist instead of a custom node-id-ordered
    zonelist. I.e., pages will be allocated from the closest allowed node with
    available memory.

    [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
    [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
    [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently a node has two sets of zonelists, one for each zone type in the
    system and a second set for GFP_THISNODE allocations. Based on the zones
    allowed by a gfp mask, one of these zonelists is selected. All of these
    zonelists consume memory and occupy cache lines.

    This patch replaces the multiple zonelists per-node with two zonelists. The
    first contains all populated zones in the system, ordered by distance, for
    fallback allocations when the target/preferred node has no free pages. The
    second contains all populated zones in the node suitable for GFP_THISNODE
    allocations.

    An iterator macro is introduced called for_each_zone_zonelist() that interates
    through each zone allowed by the GFP flags in the selected zonelist.

    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Introduce a node_zonelist() helper function. It is used to lookup the
    appropriate zonelist given a node and a GFP mask. The patch on its own is a
    cleanup but it helps clarify parts of the two-zonelist-per-node patchset. If
    necessary, it can be merged with the next patch in this set without problems.

    Reviewed-by: Christoph Lameter
    Signed-off-by: Mel Gorman
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Migrate flags must be set on slab creation as agreed upon when the antifrag
    logic was reviewed. Otherwise some slabs of a slabcache will end up in the
    unmovable and others in the reclaimable section depending on which flag was
    active when a new slab page was allocated.

    This likely slid in somehow when antifrag was merged. Remove it.

    The buffer_heads are always allocated with __GFP_RECLAIMABLE because the
    SLAB_RECLAIM_ACCOUNT option is set. The set_migrateflags() never had any
    effect there.

    Radix tree allocations are not directly reclaimable but they are allocated
    with __GFP_RECLAIMABLE set on each allocation. We now set
    SLAB_RECLAIM_ACCOUNT on radix tree slab creation making sure that radix
    tree slabs are consistently placed in the reclaimable section. Radix tree
    slabs will also be accounted as such.

    There is then no user left of set_migratepages. So remove it.

    Signed-off-by: Christoph Lameter
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

14 Feb, 2008

1 commit


06 Feb, 2008

1 commit

  • - Add comments explaing how drain_pages() works.

    - Eliminate useless functions

    - Rename drain_all_local_pages to drain_all_pages(). It does drain
    all pages not only those of the local processor.

    - Eliminate useless interrupt off / on sequences. drain_pages()
    disables interrupts on its own. The execution thread is
    pinned to processor by the caller. So there is no need to
    disable interrupts.

    - Put drain_all_pages() declaration in gfp.h and remove the
    declarations from suspend.h and from mm/memory_hotplug.c

    - Make software suspend call drain_all_pages(). The draining
    of processor local pages is may not the right approach if
    software suspend wants to support SMP. If they call drain_all_pages
    then we can make drain_pages() static.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Christoph Lameter
    Acked-by: Mel Gorman
    Cc: "Rafael J. Wysocki"
    Cc: Daniel Walker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

17 Oct, 2007

3 commits

  • This patch provides fragmentation avoidance statistics via /proc/pagetypeinfo.
    The information is collected only on request so there is no runtime overhead.
    The statistics are in three parts:

    The first part prints information on the size of blocks that pages are
    being grouped on and looks like

    Page block order: 10
    Pages per block: 1024

    The second part is a more detailed version of /proc/buddyinfo and looks like

    Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
    Node 0, zone DMA, type Unmovable 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA, type Reclaimable 1 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA, type Reserve 0 4 4 0 0 0 0 1 0 1 0
    Node 0, zone Normal, type Unmovable 111 8 4 4 2 3 1 0 0 0 0
    Node 0, zone Normal, type Reclaimable 293 89 8 0 0 0 0 0 0 0 0
    Node 0, zone Normal, type Movable 1 6 13 9 7 6 3 0 0 0 0
    Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 4

    The third part looks like

    Number of blocks type Unmovable Reclaimable Movable Reserve
    Node 0, zone DMA 0 1 2 1
    Node 0, zone Normal 3 17 94 4

    To walk the zones within a node with interrupts disabled, walk_zones_in_node()
    is introduced and shared between /proc/buddyinfo, /proc/zoneinfo and
    /proc/pagetypeinfo to reduce code duplication. It seems specific to what
    vmstat.c requires but could be broken out as a general utility function in
    mmzone.c if there were other other potential users.

    Signed-off-by: Mel Gorman
    Acked-by: Andy Whitcroft
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch marks a number of allocations that are either short-lived such as
    network buffers or are reclaimable such as inode allocations. When something
    like updatedb is called, long-lived and unmovable kernel allocations tend to
    be spread throughout the address space which increases fragmentation.

    This patch groups these allocations together as much as possible by adding a
    new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be
    reclaimed on demand, but not moved. i.e. they can be migrated by deleting
    them and re-reading the information from elsewhere.

    Signed-off-by: Mel Gorman
    Cc: Andy Whitcroft
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The function of GFP_LEVEL_MASK seems to be unclear. In order to clear up
    the mystery we get rid of it and replace GFP_LEVEL_MASK with 3 sets of GFP
    flags:

    GFP_RECLAIM_MASK Flags used to control page allocator reclaim behavior.

    GFP_CONSTRAINT_MASK Flags used to limit where allocations can occur.

    GFP_SLAB_BUG_MASK Flags that the slab allocator BUG()s on.

    These replace the uses of GFP_LEVEL mask in the slab allocators and in
    vmalloc.c.

    The use of the flags not included in these sets may occur as a result of a
    slab allocation standing in for a page allocation when constructing scatter
    gather lists. Extraneous flags are cleared and not passed through to the
    page allocator. __GFP_MOVABLE/RECLAIMABLE, __GFP_COLD and __GFP_COMP will
    now be ignored if passed to a slab allocator.

    Change the allocation of allocator meta data in SLAB and vmalloc to not
    pass through flags listed in GFP_CONSTRAINT_MASK. SLAB already removes the
    __GFP_THISNODE flag for such allocations. Generalize that to also cover
    vmalloc. The use of GFP_CONSTRAINT_MASK also includes __GFP_HARDWALL.

    The impact of allocator metadata placement on access latency to the
    cachelines of the object itself is minimal since metadata is only
    referenced on alloc and free. The attempt is still made to place the meta
    data optimally but we consistently allow fallback both in SLAB and vmalloc
    (SLUB does not need to allocate metadata like that).

    Allocator metadata may serve multiple in kernel users and thus should not
    be subject to the limitations arising from a single allocation context.

    [akpm@linux-foundation.org: fix fallback_alloc()]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter