25 May, 2011

27 commits

  • Rework the existing mmu_gather infrastructure.

    The direct purpose of these patches was to allow preemptible mmu_gather,
    but even without that I think these patches provide an improvement to the
    status quo.

    The first 9 patches rework the mmu_gather infrastructure. For review
    purpose I've split them into generic and per-arch patches with the last of
    those a generic cleanup.

    The next patch provides generic RCU page-table freeing, and the followup
    is a patch converting s390 to use this. I've also got 4 patches from
    DaveM lined up (not included in this series) that uses this to implement
    gup_fast() for sparc64.

    Then there is one patch that extends the generic mmu_gather batching.

    After that follow the mm preemptibility patches, these make part of the mm
    a lot more preemptible. It converts i_mmap_lock and anon_vma->lock to
    mutexes which together with the mmu_gather rework makes mmu_gather
    preemptible as well.

    Making i_mmap_lock a mutex also enables a clean-up of the truncate code.

    This also allows for preemptible mmu_notifiers, something that XPMEM I
    think wants.

    Furthermore, it removes the new and universially detested unmap_mutex.

    This patch:

    Remove the first obstacle towards a fully preemptible mmu_gather.

    The current scheme assumes mmu_gather is always done with preemption
    disabled and uses per-cpu storage for the page batches. Change this to
    try and allocate a page for batching and in case of failure, use a small
    on-stack array to make some progress.

    Preemptible mmu_gather is desired in general and usable once i_mmap_lock
    becomes a mutex. Doing it before the mutex conversion saves us from
    having to rework the code by moving the mmu_gather bits inside the
    pte_lock.

    Also avoid flushing the tlb batches from under the pte lock, this is
    useful even without the i_mmap_lock conversion as it significantly reduces
    pte lock hold times.

    [akpm@linux-foundation.org: fix comment tpyo]
    Signed-off-by: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Currently we have expand_upwards exported while expand_downwards is
    accessible only via expand_stack or expand_stack_downwards.

    check_stack_guard_page is a nice example of the asymmetry. It uses
    expand_stack for VM_GROWSDOWN while expand_upwards is called for
    VM_GROWSUP case.

    Let's clean this up by exporting both functions and make those names
    consistent. Let's use expand_{upwards,downwards} because expanding
    doesn't always involve stack manipulation (an example is
    ia64_do_page_fault which uses expand_upwards for registers backing store
    expansion). expand_downwards has to be defined for both
    CONFIG_STACK_GROWS{UP,DOWN} because get_arg_page calls the downwards
    version in the early process initialization phase for growsup
    configuration.

    Signed-off-by: Michal Hocko
    Acked-by: Hugh Dickins
    Cc: James Bottomley
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The vmap allocator is used to, among other things, allocate per-cpu vmap
    blocks, where each vmap block is naturally aligned to its own size.
    Obviously, leaving a guard page after each vmap area forbids packing vmap
    blocks efficiently and can make the kernel run out of possible vmap blocks
    long before overall vmap space is exhausted.

    The new interface to map a user-supplied page array into linear vmalloc
    space (vm_map_ram) insists on allocating from a vmap block (instead of
    falling back to a custom area) when the area size is below a certain
    threshold. With heavy users of this interface (e.g. XFS) and limited
    vmalloc space on 32-bit, vmap block exhaustion is a real problem.

    Remove the guard page from the core vmap allocator. vmalloc and the old
    vmap interface enforce a guard page on their own at a higher level.

    Note that without this patch, we had accidental guard pages after those
    vm_map_ram areas that happened to be at the end of a vmap block, but not
    between every area. This patch removes this accidental guard page only.

    If we want guard pages after every vm_map_ram area, this should be done
    separately. And just like with vmalloc and the old interface on a
    different level, not in the core allocator.

    Mel pointed out: "If necessary, the guard page could be reintroduced as a
    debugging-only option (CONFIG_DEBUG_PAGEALLOC?). Otherwise it seems
    reasonable."

    Signed-off-by: Johannes Weiner
    Cc: Nick Piggin
    Cc: Dave Chinner
    Acked-by: Mel Gorman
    Cc: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • There's a kernel-wide shortage of per-process flags, so it's always
    helpful to trim one when possible without incurring a significant penalty.
    It's even more important when you're planning on adding a per- process
    flag yourself, which I plan to do shortly for transparent hugepages.

    PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a
    tendency to allocate large amounts of memory and should be preferred for
    killing over other tasks. We'd rather immediately kill the task making
    the errant syscall rather than penalizing an innocent task.

    This patch removes PF_OOM_ORIGIN since its behavior is equivalent to
    setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX.

    The process's old oom_score_adj is stored and then set to
    OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old
    value is then reinstated when the process should no longer be considered a
    high priority for oom killing.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Hugh Dickins
    Cc: Izik Eidus
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • It's uncertain this has been beneficial, so it's safer to undo it. All
    other compaction users would still go in synchronous mode if a first
    attempt at async compaction failed. Hopefully we don't need to force
    special behavior for THP (which is the only __GFP_NO_KSWAPD user so far
    and it's the easier to exercise and to be noticeable). This also make
    __GFP_NO_KSWAPD return to its original strict semantics specific to bypass
    kswapd, as THP allocations have khugepaged for the async THP
    allocations/compactions.

    Signed-off-by: Andrea Arcangeli
    Cc: Alex Villacis Lasso
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Currently, cpu hotplug updates pcp->stat_threshold, but memory hotplug
    doesn't. There is no reason for this.

    [akpm@linux-foundation.org: fix CONFIG_SMP=n build]
    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, memory hotplug calls setup_per_zone_wmarks() and
    calculate_zone_inactive_ratio(), but doesn't call
    setup_per_zone_lowmem_reserve().

    It means the number of reserved pages aren't updated even if memory hot
    plug occur. This patch fixes it.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Commit bce7394a3e ("page-allocator: reset wmark_min and inactive ratio of
    zone when hotplug happens") introduced invalid section references. Now,
    setup_per_zone_inactive_ratio() is marked __init and then it can't be
    referenced from memory hotplug code.

    This patch marks it as __meminit and also marks caller as __ref.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Yasunori Goto
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • When an oom killing occurs, almost all processes are getting stuck at the
    following two points.

    1) __alloc_pages_nodemask
    2) __lock_page_or_retry

    1) is not very problematic because TIF_MEMDIE leads to an allocation
    failure and getting out from page allocator.

    2) is more problematic. In an OOM situation, zones typically don't have
    page cache at all and memory starvation might lead to greatly reduced IO
    performance. When a fork bomb occurs, TIF_MEMDIE tasks don't die quickly,
    meaning that a fork bomb may create new process quickly rather than the
    oom-killer killing it. Then, the system may become livelocked.

    This patch makes the pagefault interruptible by SIGKILL.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Matthew Wilcox
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • commit 2687a356 ("Add lock_page_killable") introduced killable
    lock_page(). Similarly this patch introdues killable
    wait_on_page_locked().

    Signed-off-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Matthew Wilcox
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • commit 2ac390370a ("writeback: add
    /sys/devices/system/node//vmstat") added vmstat entry. But
    strangely it only show nr_written and nr_dirtied.

    # cat /sys/devices/system/node/node20/vmstat
    nr_written 0
    nr_dirtied 0

    Of course, It's not adequate. With this patch, the vmstat show all vm
    stastics as /proc/vmstat.

    # cat /sys/devices/system/node/node0/vmstat
    nr_free_pages 899224
    nr_inactive_anon 201
    nr_active_anon 17380
    nr_inactive_file 31572
    nr_active_file 28277
    nr_unevictable 0
    nr_mlock 0
    nr_anon_pages 17321
    nr_mapped 8640
    nr_file_pages 60107
    nr_dirty 33
    nr_writeback 0
    nr_slab_reclaimable 6850
    nr_slab_unreclaimable 7604
    nr_page_table_pages 3105
    nr_kernel_stack 175
    nr_unstable 0
    nr_bounce 0
    nr_vmscan_write 0
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem 260
    nr_dirtied 1050
    nr_written 938
    numa_hit 962872
    numa_miss 0
    numa_foreign 0
    numa_interleave 8617
    numa_local 962872
    numa_other 0
    nr_anon_transparent_hugepages 0

    [akpm@linux-foundation.org: no externs in .c files]
    Signed-off-by: KOSAKI Motohiro
    Cc: Michael Rubin
    Cc: Wu Fengguang
    Acked-by: David Rientjes
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Because 'ret' is declared as int, not unsigned long, no need to cast the
    error contants into unsigned long. If you compile this code on a 64-bit
    machine somehow, you'll see following warning:

    CC mm/nommu.o
    mm/nommu.c: In function `do_mmap_pgoff':
    mm/nommu.c:1411: warning: overflow in implicit constant conversion

    Signed-off-by: Namhyung Kim
    Acked-by: Greg Ungerer
    Cc: David Howells
    Cc: Paul Mundt
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • If f_op->read() fails and sysctl_nr_trim_pages > 1, there could be a
    memory leak between @region->vm_end and @region->vm_top.

    Signed-off-by: Namhyung Kim
    Acked-by: Greg Ungerer
    Cc: David Howells
    Cc: Paul Mundt
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • Now we have the sorted vma list, use it in do_munmap() to check that we
    have an exact match.

    Signed-off-by: Namhyung Kim
    Acked-by: Greg Ungerer
    Cc: David Howells
    Cc: Paul Mundt
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • Now we have the sorted vma list, use it in the find_vma[_exact]() rather
    than doing linear search on the rb-tree.

    Signed-off-by: Namhyung Kim
    Acked-by: Greg Ungerer
    Cc: David Howells
    Cc: Paul Mundt
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • Since commit 297c5eee3724 ("mm: make the vma list be doubly linked") made
    it a doubly linked list, we don't need to scan the list when deleting
    @vma.

    And the original code didn't update the prev pointer. Fix it too.

    Signed-off-by: Namhyung Kim
    Acked-by: Greg Ungerer
    Cc: David Howells
    Cc: Paul Mundt
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • When I was reading nommu code, I found that it handles the vma list/tree
    in an unusual way. IIUC, because there can be more than one
    identical/overrapped vmas in the list/tree, it sorts the tree more
    strictly and does a linear search on the tree. But it doesn't applied to
    the list (i.e. the list could be constructed in a different order than
    the tree so that we can't use the list when finding the first vma in that
    order).

    Since inserting/sorting a vma in the tree and link is done at the same
    time, we can easily construct both of them in the same order. And linear
    searching on the tree could be more costly than doing it on the list, it
    can be converted to use the list.

    Also, after the commit 297c5eee3724 ("mm: make the vma list be doubly
    linked") made the list be doubly linked, there were a couple of code need
    to be fixed to construct the list properly.

    Patch 1/6 is a preparation. It maintains the list sorted same as the tree
    and construct doubly-linked list properly. Patch 2/6 is a simple
    optimization for the vma deletion. Patch 3/6 and 4/6 convert tree
    traversal to list traversal and the rest are simple fixes and cleanups.

    This patch:

    @vma added into @mm should be sorted by start addr, end addr and VMA
    struct addr in that order because we may get identical VMAs in the @mm.
    However this was true only for the rbtree, not for the list.

    This patch fixes this by remembering 'rb_prev' during the tree traversal
    like find_vma_prepare() does and linking the @vma via __vma_link_list().
    After this patch, we can iterate the whole VMAs in correct order simply by
    using @mm->mmap list.

    [akpm@linux-foundation.org: avoid duplicating __vma_link_list()]
    Signed-off-by: Namhyung Kim
    Acked-by: Greg Ungerer
    Cc: David Howells
    Cc: Paul Mundt
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • Signed-off-by: Sergey Senozhatsky
    Reviewed-by: Christoph Lameter
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Avoid merging a VMA with another VMA which is cloned from the parent process.

    The cloned VMA shares the anon_vma lock with the parent process's VMA. If
    we do the merge, more vmas (even the new range is only for current
    process) use the perent process's anon_vma lock. This introduces
    scalability issues. find_mergeable_anon_vma() already considers this
    case.

    Signed-off-by: Shaohua Li
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • If we only change vma->vm_end, we can avoid taking anon_vma lock even if
    'insert' isn't NULL, which is the case of split_vma.

    As I understand it, we need the lock before because rmap must get the
    'insert' VMA when we adjust old VMA's vm_end (the 'insert' VMA is linked
    to anon_vma list in __insert_vm_struct before).

    But now this isn't true any more. The 'insert' VMA is already linked to
    anon_vma list in __split_vma(with anon_vma_clone()) instead of
    __insert_vm_struct. There is no race rmap can't get required VMAs. So
    the anon_vma lock is unnecessary, and this can reduce one locking in brk
    case and improve scalability.

    Signed-off-by: Shaohua Li
    Cc: Rik van Riel
    Acked-by: Hugh Dickins
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Make some variables have correct alignment/section to avoid cache issue.
    In a workload which heavily does mmap/munmap, the variables will be used
    frequently.

    Signed-off-by: Shaohua Li
    Cc: Andi Kleen
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Architectures that implement their own show_mem() function did not pass
    the filter argument to show_free_areas() to appropriately avoid emitting
    the state of nodes that are disallowed in the current context. This patch
    now passes the filter argument to show_free_areas() so those nodes are now
    avoided.

    This patch also removes the show_free_areas() wrapper around
    __show_free_areas() and converts existing callers to pass an empty filter.

    ia64 emits additional information for each node, so skip_free_areas_zone()
    must be made global to filter disallowed nodes and it is converted to use
    a nid argument rather than a zone for this use case.

    Signed-off-by: David Rientjes
    Cc: Russell King
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Kyle McMartin
    Cc: Helge Deller
    Cc: James Bottomley
    Cc: "David S. Miller"
    Cc: Guan Xuetao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • It has been reported on some laptops that kswapd is consuming large
    amounts of CPU and not being scheduled when SLUB is enabled during large
    amounts of file copying. It is expected that this is due to kswapd
    missing every cond_resched() point because;

    shrink_page_list() calls cond_resched() if inactive pages were isolated
    which in turn may not happen if all_unreclaimable is set in
    shrink_zones(). If for whatver reason, all_unreclaimable is
    set on all zones, we can miss calling cond_resched().

    balance_pgdat() only calls cond_resched if the zones are not
    balanced. For a high-order allocation that is balanced, it
    checks order-0 again. During that window, order-0 might have
    become unbalanced so it loops again for order-0 and returns
    that it was reclaiming for order-0 to kswapd(). It can then
    find that a caller has rewoken kswapd for a high-order and
    re-enters balance_pgdat() without ever calling cond_resched().

    shrink_slab only calls cond_resched() if we are reclaiming slab
    pages. If there are a large number of direct reclaimers, the
    shrinker_rwsem can be contended and prevent kswapd calling
    cond_resched().

    This patch modifies the shrink_slab() case. If the semaphore is
    contended, the caller will still check cond_resched(). After each
    successful call into a shrinker, the check for cond_resched() remains in
    case one shrinker is particularly slow.

    [mgorman@suse.de: preserve call to cond_resched after each call into shrinker]
    Signed-off-by: Mel Gorman
    Signed-off-by: Minchan Kim
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Wu Fengguang
    Cc: James Bottomley
    Tested-by: Colin King
    Cc: Raghavendra D Prabhu
    Cc: Jan Kara
    Cc: Chris Mason
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: [2.6.38+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • There are a few reports of people experiencing hangs when copying large
    amounts of data with kswapd using a large amount of CPU which appear to be
    due to recent reclaim changes. SLUB using high orders is the trigger but
    not the root cause as SLUB has been using high orders for a while. The
    root cause was bugs introduced into reclaim which are addressed by the
    following two patches.

    Patch 1 corrects logic introduced by commit 1741c877 ("mm: kswapd:
    keep kswapd awake for high-order allocations until a percentage of
    the node is balanced") to allow kswapd to go to sleep when
    balanced for high orders.

    Patch 2 notes that it is possible for kswapd to miss every
    cond_resched() and updates shrink_slab() so it'll at least reach
    that scheduling point.

    Chris Wood reports that these two patches in isolation are sufficient to
    prevent the system hanging. AFAIK, they should also resolve similar hangs
    experienced by James Bottomley.

    This patch:

    Johannes Weiner poined out that the logic in commit 1741c877 ("mm: kswapd:
    keep kswapd awake for high-order allocations until a percentage of the
    node is balanced") is backwards. Instead of allowing kswapd to go to
    sleep when balancing for high order allocations, it keeps it kswapd
    running uselessly.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Johannes Weiner
    Reviewed-by: Wu Fengguang
    Cc: James Bottomley
    Tested-by: Colin King
    Cc: Raghavendra D Prabhu
    Cc: Jan Kara
    Cc: Chris Mason
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: Wu Fengguang
    Cc: [2.6.38+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit 442b06bcea23 ("slub: Remove node check in slab_free") added a
    call to deactivate_slab() in the debug case in __slab_alloc(), which
    unlocks the current slab used for allocation. Going to the label
    'unlock_out' then does it again.

    Also, in the debug case we do not need all the other processing that the
    'unlock_out' path does. We always fall back to the slow path in the
    debug case. So the tid update is useless.

    Similarly, ALLOC_SLOWPATH would just be incremented for all allocations.
    Also a pretty useless thing.

    So simply restore irq flags and return the object.

    Signed-off-by: Christoph Lameter
    Reported-and-bisected-by: James Morris
    Reported-by: Ingo Molnar
    Reported-by: Jens Axboe
    Cc: Pekka Enberg
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • * 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6: (29 commits)
    [S390] cpu hotplug: fix external interrupt subclass mask handling
    [S390] oprofile: dont access lowcore
    [S390] oprofile: add missing irq stats counter
    [S390] Ignore sendmmsg system call note wired up warning
    [S390] s390,oprofile: fix compile error for !CONFIG_SMP
    [S390] s390,oprofile: fix alert counter increment
    [S390] Remove unused includes in process.c
    [S390] get CPC image name
    [S390] sclp: event buffer dissection
    [S390] chsc: process channel-path-availability information
    [S390] refactor page table functions for better pgste support
    [S390] merge page_test_dirty and page_clear_dirty
    [S390] qdio: prevent compile warning
    [S390] sclp: remove unnecessary sendmask check
    [S390] convert old cpumask API into new one
    [S390] pfault: cleanup code
    [S390] pfault: cpu hotplug vs missing completion interrupts
    [S390] smp: add __noreturn attribute to cpu_die()
    [S390] percpu: implement arch specific irqsafe_cpu_ops
    [S390] vdso: disable gcov profiling
    ...

    Linus Torvalds
     
  • * 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    percpu: Unify input section names
    percpu: Avoid extra NOP in percpu_cmpxchg16b_double
    percpu: Cast away printk format warning
    percpu: Always align percpu output section to PAGE_SIZE

    Fix up fairly trivial conflict in arch/x86/include/asm/percpu.h as per Tejun

    Linus Torvalds
     

24 May, 2011

4 commits

  • Tejun Heo
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
    slub: Deal with hyperthetical case of PAGE_SIZE > 2M
    slub: Remove node check in slab_free
    slub: avoid label inside conditional
    slub: Make CONFIG_DEBUG_PAGE_ALLOC work with new fastpath
    slub: Avoid warning for !CONFIG_SLUB_DEBUG
    slub: Remove CONFIG_CMPXCHG_LOCAL ifdeffery
    slub: Move debug handlign in __slab_free
    slub: Move node determination out of hotpath
    slub: Eliminate repeated use of c->page through a new page variable
    slub: get_map() function to establish map of free objects in a slab
    slub: Use NUMA_NO_NODE in get_partial
    slub: Fix a typo in config name

    Linus Torvalds
     
  • Conflicts:
    mm/slub.c

    Pekka Enberg
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
    b43: fix comment typo reqest -> request
    Haavard Skinnemoen has left Atmel
    cris: typo in mach-fs Makefile
    Kconfig: fix copy/paste-ism for dell-wmi-aio driver
    doc: timers-howto: fix a typo ("unsgined")
    perf: Only include annotate.h once in tools/perf/util/ui/browsers/annotate.c
    md, raid5: Fix spelling error in comment ('Ofcourse' --> 'Of course').
    treewide: fix a few typos in comments
    regulator: change debug statement be consistent with the style of the rest
    Revert "arm: mach-u300/gpio: Fix mem_region resource size miscalculations"
    audit: acquire creds selectively to reduce atomic op overhead
    rtlwifi: don't touch with treewide double semicolon removal
    treewide: cleanup continuations and remove logging message whitespace
    ath9k_hw: don't touch with treewide double semicolon removal
    include/linux/leds-regulator.h: fix syntax in example code
    tty: fix typo in descripton of tty_termios_encode_baud_rate
    xtensa: remove obsolete BKL kernel option from defconfig
    m68k: fix comment typo 'occcured'
    arch:Kconfig.locks Remove unused config option.
    treewide: remove extra semicolons
    ...

    Linus Torvalds
     

23 May, 2011

1 commit

  • The page_clear_dirty primitive always sets the default storage key
    which resets the access control bits and the fetch protection bit.
    That will surprise a KVM guest that sets non-zero access control
    bits or the fetch protection bit. Merge page_test_dirty and
    page_clear_dirty back to a single function and only clear the
    dirty bit from the storage key.

    In addition move the function page_test_and_clear_dirty and
    page_test_and_clear_young to page.h where they belong. This
    requires to change the parameter from a struct page * to a page
    frame number.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

21 May, 2011

3 commits

  • We can set the page pointing in the percpu structure to
    NULL to have the same effect as setting c->node to NUMA_NO_NODE.

    Gets rid of one check in slab_free() that was only used for
    forcing the slab_free to the slowpath for debugging.

    We still need to set c->node to NUMA_NO_NODE to force the
    slab_alloc() fastpath to the slowpath in case of debugging.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Commit 778dd893ae78 ("tmpfs: fix race between umount and swapoff")
    forgot the new rules for strict atomic kmap nesting, causing

    WARNING: at arch/x86/mm/highmem_32.c:81

    from __kunmap_atomic(), then

    BUG: unable to handle kernel paging request at fffb9000

    from shmem_swp_set() when shmem_unuse_inode() is handling swapoff with
    highmem in use. My disgrace again.

    See
    https://bugzilla.kernel.org/show_bug.cgi?id=35352

    Reported-by: Witold Baryluk
    Signed-off-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Commit e66eed651fd1 ("list: remove prefetching from regular list
    iterators") removed the include of prefetch.h from list.h, which
    uncovered several cases that had apparently relied on that rather
    obscure header file dependency.

    So this fixes things up a bit, using

    grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
    grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')

    to guide us in finding files that either need
    inclusion, or have it despite not needing it.

    There are more of them around (mostly network drivers), but this gets
    many core ones.

    Reported-by: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

20 May, 2011

2 commits


18 May, 2011

3 commits

  • ZONE_CONGESTED should be a state of global memory reclaim. If not, a busy
    memcg sets this and give unnecessary throttoling in wait_iff_congested()
    against memory recalim in other contexts. This makes system performance
    bad.

    I'll think about "memcg is congested!" flag is required or not, later.
    But this fix is required first.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Daisuke Nishimura
    Acked-by: Ying Han
    Cc: Balbir Singh
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Jumping to a label inside a conditional is considered poor style,
    especially considering the current organization of __slab_alloc().

    This removes the 'load_from_page' label and just duplicates the three
    lines of code that it uses:

    c->node = page_to_nid(page);
    c->page = page;
    goto load_freelist;

    since it's probably not worth making this a separate helper function.

    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Pekka Enberg

    David Rientjes
     
  • Fastpath can do a speculative access to a page that CONFIG_DEBUG_PAGE_ALLOC may have
    marked as invalid to retrieve the pointer to the next free object.

    Use probe_kernel_read in that case in order not to cause a page fault.

    Cc: # 38.x
    Reported-by: Eric Dumazet
    Signed-off-by: Christoph Lameter
    Signed-off-by: Eric Dumazet
    Signed-off-by: Pekka Enberg

    Christoph Lameter