07 Mar, 2008

5 commits

  • NUMA slab allocator cpu migration bugfix

    The NUMA slab allocator (specifically, cache_alloc_refill)
    is not refreshing its local copies of what cpu and what
    numa node it is on, when it drops and reacquires the irq
    block that it inherited from its caller. As a result
    those values become invalid if an attempt to migrate the
    process to another numa node occured while the irq block
    had been dropped.

    The solution is to make cache_alloc_refill reload these
    variables whenever it drops and reacquires the irq block.

    The error is very difficult to hit. When it does occur,
    one gets the following oops + stack traceback bits in
    check_spinlock_acquired:

    kernel BUG at mm/slab.c:2417
    cache_alloc_refill+0xe6
    kmem_cache_alloc+0xd0
    ...

    This patch was developed against 2.6.23, ported to and
    compiled-tested only against 2.6.25-rc4.

    Signed-off-by: Joe Korty
    Signed-off-by: Christoph Lameter

    Joe Korty
     
  • SLUB should pack even small objects nicely into cachelines if that is what
    has been asked for. Use the same algorithm as SLAB for this.

    The effect of this patch for a system with a cacheline size of 64
    bytes is that the 24 byte sized slab caches will now put exactly
    2 objects into a cacheline instead of 3 with some overlap into
    the next cacheline. This reduces the object density in a 4k slab
    from 170 to 128 objects (same as SLAB).

    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter

    Nick Piggin
     
  • Make them all use angle brackets and the directory name.

    Acked-by: Pekka Enberg
    Signed-off-by: Joe Perches
    Signed-off-by: Christoph Lameter

    Joe Perches
     
  • The NUMA fallback logic should be passing local_flags to kmem_get_pages() and not simply the
    flags passed in.

    Reviewed-by: Pekka Enberg
    Signed-off-by: Christoph Lameter

    Christoph Lameter
     
  • The remote frees are in the freelist of the page and not in the
    percpu freelist.

    Reviewed-by: Pekka Enberg
    Signed-off-by: Christoph Lameter

    Christoph Lameter
     

05 Mar, 2008

20 commits

  • Adam Litke noticed that currently we grow the hugepage pool independent of any
    cpuset the running process may be in, but when shrinking the pool, the cpuset
    is checked. This leads to inconsistency when shrinking the pool in a
    restricted cpuset -- an administrator may have been able to grow the pool on a
    node restricted by a containing cpuset, but they cannot shrink it there.

    There are two options: either prevent growing of the pool outside of the
    cpuset or allow shrinking outside of the cpuset. >From previous discussions
    on linux-mm, /proc/sys/vm/nr_hugepages is an administrative interface that
    should not be restricted by cpusets. So allow shrinking the pool by removing
    pages from nodes outside of current's cpuset.

    Signed-off-by: Nishanth Aravamudan
    Acked-by: Adam Litke
    Cc: William Irwin
    Cc: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: Paul Jackson
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     
  • A hugetlb reservation may be inadequately backed in the event of racing
    allocations and frees when utilizing surplus huge pages. Consider the
    following series of events in processes A and B:

    A) Allocates some surplus pages to satisfy a reservation
    B) Frees some huge pages
    A) A notices the extra free pages and drops hugetlb_lock to free some of
    its surplus pages back to the buddy allocator.
    B) Allocates some huge pages
    A) Reacquires hugetlb_lock and returns from gather_surplus_huge_pages()

    Avoid this by commiting the reservation after pages have been allocated but
    before dropping the lock to free excess pages. For parity, release the
    reservation in return_unused_surplus_pages().

    This patch also corrects the cpuset_mems_nr() error path in
    hugetlb_acct_memory(). If the cpuset check fails, uncommit the
    reservation, but also be sure to return any surplus huge pages that may
    have been allocated to back the failed reservation.

    Thanks to Andy Whitcroft for discovering this.

    Signed-off-by: Adam Litke
    Cc: Mel Gorman
    Cc: Andy Whitcroft
    Cc: Dave Hansen
    Cc: William Lee Irwin III
    Cc: Andy Whitcroft
    Cc: Mel Gorman
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • While testing force_empty, during an exit_mmap, __mem_cgroup_remove_list
    called from mem_cgroup_uncharge_page oopsed on a NULL pointer in the lru list.
    I couldn't see what racing tasks on other cpus were doing, but surmise that
    another must have been in mem_cgroup_charge_common on the same page, between
    its unlock_page_cgroup and spin_lock_irqsave near done (thanks to that kzalloc
    which I'd almost changed to a kmalloc).

    Normally such a race cannot happen, the ref_cnt prevents it, the final
    uncharge cannot race with the initial charge. But force_empty buggers the
    ref_cnt, that's what it's all about; and thereafter forced pages are
    vulnerable to races such as this (just think of a shared page also mapped into
    an mm of another mem_cgroup than that just emptied). And remain vulnerable
    until they're freed indefinitely later.

    This patch just fixes the oops by moving the unlock_page_cgroups down below
    adding to and removing from the list (only possible given the previous patch);
    and while we're at it, we might as well make it an invariant that
    page->page_cgroup is always set while pc is on lru.

    But this behaviour of force_empty seems highly unsatisfactory to me: why have
    a ref_cnt if we always have to cope with it being violated (as in the earlier
    page migration patch). We may prefer force_empty to move pages to an orphan
    mem_cgroup (could be the root, but better not), from which other cgroups could
    recover them; we might need to reverse the locking again; but no time now for
    such concerns.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • As for force_empty, though this may not be the main topic here,
    mem_cgroup_force_empty_list() can be implemented simpler. It is possible to
    make the function just call mem_cgroup_uncharge_page() instead of releasing
    page_cgroups by itself. The tip is to call get_page() before invoking
    mem_cgroup_uncharge_page(), so the page won't be released during this
    function.

    Kamezawa-san points out that by the time mem_cgroup_uncharge_page() uncharges,
    the page might have been reassigned to an lru of a different mem_cgroup, and
    now be emptied from that; but Hugh claims that's okay, the end state is the
    same as when it hasn't gone to another list.

    And once force_empty stops taking lock_page_cgroup within mz->lru_lock,
    mem_cgroup_move_lists() can be simplified to take mz->lru_lock directly while
    holding page_cgroup lock (but still has to use try_lock_page_cgroup).

    Signed-off-by: Hirokazu Takahashi
    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hirokazu Takahashi
     
  • Ever since the VM_BUG_ON(page_get_page_cgroup(page)) (now Bad page state) went
    into page freeing, I've hit it from time to time in testing on some machines,
    sometimes only after many days. Recently found a machine which could usually
    produce it within a few hours, which got me there at last.

    The culprit is mem_cgroup_move_lists, whose locking is inadequate; and the
    arrangement of structures was such that you got page_cgroups from the lru list
    neatly put on to SLUB's freelist. Kamezawa-san identified the same hole
    independently.

    The main problem was that it was missing the lock_page_cgroup it needs to
    safely page_get_page_cgroup; but it's tricky to go beyond that too, and I
    couldn't do it with SLAB_DESTROY_BY_RCU as I'd expected. See the code for
    comments on the constraints.

    This patch immediately gets replaced by a simpler one from Hirokazu-san; but
    is it just foolish pride that tells me to put this one on record, in case we
    need to come back to it later?

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • mem_cgroup_uncharge_page does css_put on the mem_cgroup before uncharging from
    it, and before removing page_cgroup from one of its lru lists: isn't there a
    danger that struct mem_cgroup memory could be freed and reused before
    completing that, so corrupting something? Never seen it, and for all I know
    there may be other constraints which make it impossible; but let's be
    defensive and reverse the ordering there.

    mem_cgroup_force_empty_list is safe because there's an extra css_get around
    all its works; but even so, change its ordering the same way round, to help
    get in the habit of doing it like this.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove clear_page_cgroup: it's an unhelpful helper, see for example how
    mem_cgroup_uncharge_page had to unlock_page_cgroup just in order to call it
    (serious races from that? I'm not sure).

    Once that's gone, you can see it's pointless for page_cgroup's ref_cnt to be
    atomic: it's always manipulated under lock_page_cgroup, except where
    force_empty unilaterally reset it to 0 (and how does uncharge's
    atomic_dec_and_test protect against that?).

    Simplify this page_cgroup locking: if you've got the lock and the pc is
    attached, then the ref_cnt must be positive: VM_BUG_ONs to check that, and to
    check that pc->page matches page (we're on the way to finding why sometimes it
    doesn't, but this patch doesn't fix that).

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • More cleanup to memcontrol.c, this time changing some of the code generated.
    Let the compiler decide what to inline (except for page_cgroup_locked which is
    only used when CONFIG_DEBUG_VM): the __always_inline on lock_page_cgroup etc.
    was quite a waste since bit_spin_lock etc. are inlines in a header file; made
    mem_cgroup_force_empty and mem_cgroup_write_strategy static.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Sorry, before getting down to more important changes, I'd like to do some
    cleanup in memcontrol.c. This patch doesn't change the code generated, but
    cleans up whitespace, moves up a double declaration, removes an unused enum,
    removes void returns, removes misleading comments, that kind of thing.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Nothing uses mem_cgroup_uncharge apart from mem_cgroup_uncharge_page, (a
    trivial wrapper around it) and mem_cgroup_end_migration (which does the same
    as mem_cgroup_uncharge_page). And it often ends up having to lock just to let
    its caller unlock. Remove it (but leave the silly locking until a later
    patch).

    Moved mem_cgroup_cache_charge next to mem_cgroup_charge in memcontrol.h.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • My memcgroup patch to fix hang with shmem/tmpfs added NULL page handling to
    mem_cgroup_charge_common. It seemed convenient at the time, but hard to
    justify now: there's a perfectly appropriate swappage to charge and uncharge
    instead, this is not on any hot path through shmem_getpage, and no performance
    hit was observed from the slight extra overhead.

    So revert that NULL page handling from mem_cgroup_charge_common; and make it
    clearer by bringing page_cgroup_assign_new_page_cgroup into its body - that
    was a helper I found more of a hindrance to understanding.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Replace free_hot_cold_page's VM_BUG_ON(page_get_page_cgroup(page)) by a "Bad
    page state" and clear: most users don't have CONFIG_DEBUG_VM on, and if it
    were set here, it'd likely cause corruption when the page is reused.

    Don't use page_assign_page_cgroup to clear it: that should be private to
    memcontrol.c, and always called with the lock taken; and memmap_init_zone
    doesn't need it either - like page->mapping and other pointers throughout the
    kernel, Linux assumes pointers in zeroed structures are NULL pointers.

    Instead use page_reset_bad_cgroup, added to memcontrol.h for this only.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Page migration gave me free_hot_cold_page's VM_BUG_ON page->page_cgroup.
    remove_migration_pte was calling mem_cgroup_charge on the new page whenever it
    found a swap pte, before it had determined it to be a migration entry. That
    left a surplus reference count on the page_cgroup, so it was still attached
    when the page was later freed.

    Move that mem_cgroup_charge down to where we're sure it's a migration entry.
    We were already under i_mmap_lock or anon_vma->lock, so its GFP_KERNEL was
    already inappropriate: change that to GFP_ATOMIC.

    It's essential that remove_migration_pte removes all the migration entries,
    other crashes follow if not. So proceed even when the charge fails: normally
    it cannot, but after a mem_cgroup_force_empty it might - comment in the code.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Don't uncharge when do_swap_page's call to do_wp_page fails: the page which
    was charged for is there in the pagetable, and will be correctly uncharged
    when that area is unmapped - it was only its COWing which failed.

    And while we're here, remove earlier XXX comment: yes, OR in do_wp_page's
    return value (maybe VM_FAULT_WRITE) with do_swap_page's there; but if it
    fails, mask out success bits, which might confuse some arches e.g. sparc.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's nothing wrong with mem_cgroup_charge failure in do_wp_page and
    do_anonymous page using __free_page, but it does look odd when nearby code
    uses page_cache_release: use that instead (while turning a blind eye to
    ancient inconsistencies of page_cache_release versus put_page).

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Each caller of mem_cgroup_move_lists is having to use page_get_page_cgroup:
    it's more convenient if it acts upon the page itself not the page_cgroup; and
    in a later patch this becomes important to handle within memcontrol.c.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • vm_match_cgroup is a perverse name for a macro to match mm with cgroup: rename
    it mm_match_cgroup, matching mm_init_cgroup and mm_free_cgroup.

    Signed-off-by: Hugh Dickins
    Acked-by: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Rename Memory Controller to Memory Resource Controller. Reflect the same
    changes in the CONFIG definition for the Memory Resource Controller. Group
    together the config options for Resource Counters and Memory Resource
    Controller.

    Signed-off-by: Balbir Singh
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Some oprofile results obtained while using tbench on a 2x2 cpu machine were
    very surprising.

    For example, loopback_xmit() function was using high number of cpu cycles
    to perform the statistic updates, supposed to be real cheap since they use
    percpu data

    pcpu_lstats = netdev_priv(dev);
    lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
    lb_stats->packets++; /* HERE : serious contention */
    lb_stats->bytes += skb->len;

    struct pcpu_lstats is a small structure containing two longs. It appears
    that on my 32bits platform, alloc_percpu(8) allocates a single cache line,
    instead of giving to each cpu a separate cache line.

    Using the following patch gave me impressive boost in various benchmarks
    ( 6 % in tbench)
    (all percpu_counters hit this bug too)

    Long term fix (ie >= 2.6.26) would be to let each CPU allocate their own
    block of memory, so that we dont need to roudup sizes to L1_CACHE_BYTES, or
    merging the SGI stuff of course...

    Note : SLUB vs SLAB is important here to *show* the improvement, since they
    dont have the same minimum allocation sizes (8 bytes vs 32 bytes). This
    could very well explain regressions some guys reported when they switched
    to SLUB.

    Signed-off-by: Eric Dumazet
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • jiffies subtraction may cause an overflow problem. It should be using
    time_after().

    [akpm@linux-foundation.org: include jiffies.h]
    Signed-off-by: KOSAKI Motohiro
    Cc: Lee Schermerhorn
    Cc: Paul Jackson
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

04 Mar, 2008

12 commits


24 Feb, 2008

3 commits

  • Cgroup requires the subsystem to return negative error code on error in the
    create method.

    Signed-off-by: Li Zefan
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Remove this VM_BUG_ON(), as Balbir stated:

    We used to have a for loop with !list_empty() as a termination condition
    and VM_BUG_ON(!pc) is a spill over. With the new loop, VM_BUG_ON(!pc) does
    not make sense.

    Signed-off-by: Li Zefan
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • WARNING: vmlinux.o(.meminit.text+0x649):
    Section mismatch in reference from the
    function free_area_init_core() to the function .init.text:setup_usemap()
    The function __meminit free_area_init_core() references
    a function __init setup_usemap().
    If free_area_init_core is only used by setup_usemap then
    annotate free_area_init_core with a matching annotation.

    The warning is covers this stack of functions in mm/page_alloc.c:

    alloc_bootmem_node must be marked __init.
    alloc_bootmem_node is used by setup_usemap, if !SPARSEMEM.
    (usemap_size is only used by setup_usemap, if !SPARSEMEM.)
    setup_usemap is only used by free_area_init_core.
    free_area_init_core is only used by free_area_init_node.

    free_area_init_node is used by:
    arch/alpha/mm/numa.c: __init paging_init()
    arch/arm/mm/init.c: __init bootmem_init_node()
    arch/avr32/mm/init.c: __init paging_init()
    arch/cris/arch-v10/mm/init.c: __init paging_init()
    arch/cris/arch-v32/mm/init.c: __init paging_init()
    arch/m32r/mm/discontig.c: __init zone_sizes_init()
    arch/m32r/mm/init.c: __init zone_sizes_init()
    arch/m68k/mm/motorola.c: __init paging_init()
    arch/m68k/mm/sun3mmu.c: __init paging_init()
    arch/mips/sgi-ip27/ip27-memory.c: __init paging_init()
    arch/parisc/mm/init.c: __init paging_init()
    arch/sparc/mm/srmmu.c: __init srmmu_paging_init()
    arch/sparc/mm/sun4c.c: __init sun4c_paging_init()
    arch/sparc64/mm/init.c: __init paging_init()
    mm/page_alloc.c: __init free_area_init_nodes()
    mm/page_alloc.c: __init free_area_init()
    and
    mm/memory_hotplug.c: hotadd_new_pgdat()

    hotadd_new_pgdat can not be an __init function, but:

    It is compiled for MEMORY_HOTPLUG configurations only
    MEMORY_HOTPLUG depends on SPARSEMEM || X86_64_ACPI_NUMA
    X86_64_ACPI_NUMA depends on X86_64
    ARCH_FLATMEM_ENABLE depends on X86_32
    ARCH_DISCONTIGMEM_ENABLE depends on X86_32
    So X86_64_ACPI_NUMA implies SPARSEMEM, right?

    So we can mark the stack of functions __init for !SPARSEMEM, but we must mark
    them __meminit for SPARSEMEM configurations. This is ok, because then the
    calls to alloc_bootmem_node are also avoided.

    Compile-tested on:
    silly minimal config
    defconfig x86_32
    defconfig x86_64
    defconfig x86_64 -HIBERNATION +MEMORY_HOTPLUG

    Signed-off-by: Alexander van Heukelum
    Reviewed-by: Sam Ravnborg
    Acked-by: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander van Heukelum