25 May, 2011

40 commits

  • cpumask_t is very big struct and cpu_vm_mask is placed wrong position.
    It might lead to reduce cache hit ratio.

    This patch has two change.
    1) Move the place of cpumask into last of mm_struct. Because usually cpumask
    is accessed only front bits when the system has cpu-hotplug capability
    2) Convert cpu_vm_mask into cpumask_var_t. It may help to reduce memory
    footprint if cpumask_size() will use nr_cpumask_bits properly in future.

    In addition, this patch change the name of cpu_vm_mask with cpu_vm_mask_var.
    It may help to detect out of tree cpu_vm_mask users.

    This patch has no functional change.

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KOSAKI Motohiro
    Cc: David Howells
    Cc: Koichi Yasutake
    Cc: Hugh Dickins
    Cc: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • We don't need to hold the mmmap_sem through mem_cgroup_newpage_charge(),
    the mmap_sem is only hold for keeping the vma stable and we don't need the
    vma stable anymore after we return from alloc_hugepage_vma().

    Signed-off-by: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Some of these functions have grown beyond inline sanity, move them
    out-of-line.

    Signed-off-by: Peter Zijlstra
    Requested-by: Andrew Morton
    Requested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Optimize the page_lock_anon_vma() fast path to be one atomic op, instead
    of two.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Straightforward conversion of anon_vma->lock to a mutex.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Convert page_lock_anon_vma() over to use refcounts. This is done to
    prepare for the conversion of anon_vma from spinlock to mutex.

    Sadly this inceases the cost of page_lock_anon_vma() from one to two
    atomics, a follow up patch addresses this, lets keep that simple for now.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: Mel Gorman
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • A slightly more verbose comment to go along with the trickery in
    page_lock_anon_vma().

    Signed-off-by: Peter Zijlstra
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Its beyond ugly and gets in the way.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Namhyung Kim
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Straightforward conversion of i_mmap_lock to a mutex.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Hugh says:
    "The only significant loser, I think, would be page reclaim (when
    concurrent with truncation): could spin for a long time waiting for
    the i_mmap_mutex it expects would soon be dropped? "

    Counter points:
    - cpu contention makes the spin stop (need_resched())
    - zap pages should be freeing pages at a higher rate than reclaim
    ever can

    I think the simplification of the truncate code is definitely worth it.

    Effectively reverts: 2aa15890f3c ("mm: prevent concurrent
    unmap_mapping_range() on the same inode") and takes out the code that
    caused its problem.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • In order to convert i_mmap_lock to a mutex we need a mutex equivalent to
    spin_lock_nest_lock(), thus provide the mutex_lock_nest_lock() annotation.

    As with spin_lock_nest_lock(), mutex_lock_nest_lock() allows annotation of
    the locking pattern where an outer lock serializes the acquisition order
    of nested locks. That is, if every time you lock multiple locks A, say A1
    and A2 you first acquire N, the order of acquiring A1 and A2 is
    irrelevant.

    Signed-off-by: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Instead of using a single batch (the small on-stack, or an allocated
    page), try and extend the batch every time it runs out and only flush once
    either the extend fails or we're done.

    Signed-off-by: Peter Zijlstra
    Requested-by: Nick Piggin
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • In case other architectures require RCU freed page-tables to implement
    gup_fast() and software filled hashes and similar things, provide the
    means to do so by moving the logic into generic code.

    Signed-off-by: Peter Zijlstra
    Requested-by: David Miller
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Fold all the mmu_gather rework patches into one for submission

    Signed-off-by: Peter Zijlstra
    Reported-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Fix up the um mmu_gather code to conform to the new API.

    Signed-off-by: Peter Zijlstra
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Fix up the ia64 mmu_gather code to conform to the new API.

    Signed-off-by: Peter Zijlstra
    Acked-by: Tony Luck
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Fix up the sh mmu_gather code to conform to the new API.

    Signed-off-by: Peter Zijlstra
    Acked-by: Paul Mundt
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Fix up the arm mmu_gather code to conform to the new API.

    Signed-off-by: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Adapt the stand-alone s390 mmu_gather implementation to the new
    preemptible mmu_gather interface.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Martin Schwidefsky
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Rework the sparc mmu_gather usage to conform to the new world order :-)

    Sparc mmu_gather does two things:
    - tracks vaddrs to unhash
    - tracks pages to free

    Split these two things like powerpc has done and keep the vaddrs
    in per-cpu data structures and flush them on context switch.

    The remaining bits can then use the generic mmu_gather.

    Signed-off-by: Peter Zijlstra
    Acked-by: David Miller
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Fix up powerpc to the new mmu_gather stuff.

    PPC has an extra batching queue to RCU free the actual pagetable
    allocations, use the ARCH extentions for that for now.

    For the ppc64_tlb_batch, which tracks the vaddrs to unhash from the
    hardware hash-table, keep using per-cpu arrays but flush on context switch
    and use a TLF bit to track the lazy_mmu state.

    Signed-off-by: Peter Zijlstra
    Acked-by: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Rework the existing mmu_gather infrastructure.

    The direct purpose of these patches was to allow preemptible mmu_gather,
    but even without that I think these patches provide an improvement to the
    status quo.

    The first 9 patches rework the mmu_gather infrastructure. For review
    purpose I've split them into generic and per-arch patches with the last of
    those a generic cleanup.

    The next patch provides generic RCU page-table freeing, and the followup
    is a patch converting s390 to use this. I've also got 4 patches from
    DaveM lined up (not included in this series) that uses this to implement
    gup_fast() for sparc64.

    Then there is one patch that extends the generic mmu_gather batching.

    After that follow the mm preemptibility patches, these make part of the mm
    a lot more preemptible. It converts i_mmap_lock and anon_vma->lock to
    mutexes which together with the mmu_gather rework makes mmu_gather
    preemptible as well.

    Making i_mmap_lock a mutex also enables a clean-up of the truncate code.

    This also allows for preemptible mmu_notifiers, something that XPMEM I
    think wants.

    Furthermore, it removes the new and universially detested unmap_mutex.

    This patch:

    Remove the first obstacle towards a fully preemptible mmu_gather.

    The current scheme assumes mmu_gather is always done with preemption
    disabled and uses per-cpu storage for the page batches. Change this to
    try and allocate a page for batching and in case of failure, use a small
    on-stack array to make some progress.

    Preemptible mmu_gather is desired in general and usable once i_mmap_lock
    becomes a mutex. Doing it before the mutex conversion saves us from
    having to rework the code by moving the mmu_gather bits inside the
    pte_lock.

    Also avoid flushing the tlb batches from under the pte lock, this is
    useful even without the i_mmap_lock conversion as it significantly reduces
    pte lock hold times.

    [akpm@linux-foundation.org: fix comment tpyo]
    Signed-off-by: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Currently we have expand_upwards exported while expand_downwards is
    accessible only via expand_stack or expand_stack_downwards.

    check_stack_guard_page is a nice example of the asymmetry. It uses
    expand_stack for VM_GROWSDOWN while expand_upwards is called for
    VM_GROWSUP case.

    Let's clean this up by exporting both functions and make those names
    consistent. Let's use expand_{upwards,downwards} because expanding
    doesn't always involve stack manipulation (an example is
    ia64_do_page_fault which uses expand_upwards for registers backing store
    expansion). expand_downwards has to be defined for both
    CONFIG_STACK_GROWS{UP,DOWN} because get_arg_page calls the downwards
    version in the early process initialization phase for growsup
    configuration.

    Signed-off-by: Michal Hocko
    Acked-by: Hugh Dickins
    Cc: James Bottomley
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The vmap allocator is used to, among other things, allocate per-cpu vmap
    blocks, where each vmap block is naturally aligned to its own size.
    Obviously, leaving a guard page after each vmap area forbids packing vmap
    blocks efficiently and can make the kernel run out of possible vmap blocks
    long before overall vmap space is exhausted.

    The new interface to map a user-supplied page array into linear vmalloc
    space (vm_map_ram) insists on allocating from a vmap block (instead of
    falling back to a custom area) when the area size is below a certain
    threshold. With heavy users of this interface (e.g. XFS) and limited
    vmalloc space on 32-bit, vmap block exhaustion is a real problem.

    Remove the guard page from the core vmap allocator. vmalloc and the old
    vmap interface enforce a guard page on their own at a higher level.

    Note that without this patch, we had accidental guard pages after those
    vm_map_ram areas that happened to be at the end of a vmap block, but not
    between every area. This patch removes this accidental guard page only.

    If we want guard pages after every vm_map_ram area, this should be done
    separately. And just like with vmalloc and the old interface on a
    different level, not in the core allocator.

    Mel pointed out: "If necessary, the guard page could be reintroduced as a
    debugging-only option (CONFIG_DEBUG_PAGEALLOC?). Otherwise it seems
    reasonable."

    Signed-off-by: Johannes Weiner
    Cc: Nick Piggin
    Cc: Dave Chinner
    Acked-by: Mel Gorman
    Cc: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • VM_BUG_ON() if effectively a BUG_ON() undef #ifdef CONFIG_DEBUG_VM. That
    is exactly what we have here now, and two different folks have suggested
    doing it this way.

    Signed-off-by: Dave Hansen
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • Running sparse on page_alloc.c today, it errors out:
    include/linux/gfp.h:254:17: error: bad constant expression
    include/linux/gfp.h:254:17: error: cannot size expression

    which is a line in gfp_zone():

    BUILD_BUG_ON((GFP_ZONE_BAD >> bit) & 1);

    That's really unfortunate, because it ends up hiding all of the other
    legitimate sparse messages like this:
    mm/page_alloc.c:5315:59: warning: incorrect type in argument 1 (different base types)
    mm/page_alloc.c:5315:59: expected unsigned long [unsigned] [usertype] size
    mm/page_alloc.c:5315:59: got restricted gfp_t [usertype]
    ...

    Having sparse be able to catch these very oopsable bugs is a lot more
    important than keeping a BUILD_BUG_ON(). Kill the BUILD_BUG_ON().

    Compiles on x86_64 with and without CONFIG_DEBUG_VM=y. defconfig boots
    fine for me.

    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • There's a kernel-wide shortage of per-process flags, so it's always
    helpful to trim one when possible without incurring a significant penalty.
    It's even more important when you're planning on adding a per- process
    flag yourself, which I plan to do shortly for transparent hugepages.

    PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a
    tendency to allocate large amounts of memory and should be preferred for
    killing over other tasks. We'd rather immediately kill the task making
    the errant syscall rather than penalizing an innocent task.

    This patch removes PF_OOM_ORIGIN since its behavior is equivalent to
    setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX.

    The process's old oom_score_adj is stored and then set to
    OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old
    value is then reinstated when the process should no longer be considered a
    high priority for oom killing.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Hugh Dickins
    Cc: Izik Eidus
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • It's uncertain this has been beneficial, so it's safer to undo it. All
    other compaction users would still go in synchronous mode if a first
    attempt at async compaction failed. Hopefully we don't need to force
    special behavior for THP (which is the only __GFP_NO_KSWAPD user so far
    and it's the easier to exercise and to be noticeable). This also make
    __GFP_NO_KSWAPD return to its original strict semantics specific to bypass
    kswapd, as THP allocations have khugepaged for the async THP
    allocations/compactions.

    Signed-off-by: Andrea Arcangeli
    Cc: Alex Villacis Lasso
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Currently, cpu hotplug updates pcp->stat_threshold, but memory hotplug
    doesn't. There is no reason for this.

    [akpm@linux-foundation.org: fix CONFIG_SMP=n build]
    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, memory hotplug calls setup_per_zone_wmarks() and
    calculate_zone_inactive_ratio(), but doesn't call
    setup_per_zone_lowmem_reserve().

    It means the number of reserved pages aren't updated even if memory hot
    plug occur. This patch fixes it.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Commit bce7394a3e ("page-allocator: reset wmark_min and inactive ratio of
    zone when hotplug happens") introduced invalid section references. Now,
    setup_per_zone_inactive_ratio() is marked __init and then it can't be
    referenced from memory hotplug code.

    This patch marks it as __meminit and also marks caller as __ref.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Yasunori Goto
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • When an oom killing occurs, almost all processes are getting stuck at the
    following two points.

    1) __alloc_pages_nodemask
    2) __lock_page_or_retry

    1) is not very problematic because TIF_MEMDIE leads to an allocation
    failure and getting out from page allocator.

    2) is more problematic. In an OOM situation, zones typically don't have
    page cache at all and memory starvation might lead to greatly reduced IO
    performance. When a fork bomb occurs, TIF_MEMDIE tasks don't die quickly,
    meaning that a fork bomb may create new process quickly rather than the
    oom-killer killing it. Then, the system may become livelocked.

    This patch makes the pagefault interruptible by SIGKILL.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Matthew Wilcox
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • commit 2687a356 ("Add lock_page_killable") introduced killable
    lock_page(). Similarly this patch introdues killable
    wait_on_page_locked().

    Signed-off-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Matthew Wilcox
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • commit 2ac390370a ("writeback: add
    /sys/devices/system/node//vmstat") added vmstat entry. But
    strangely it only show nr_written and nr_dirtied.

    # cat /sys/devices/system/node/node20/vmstat
    nr_written 0
    nr_dirtied 0

    Of course, It's not adequate. With this patch, the vmstat show all vm
    stastics as /proc/vmstat.

    # cat /sys/devices/system/node/node0/vmstat
    nr_free_pages 899224
    nr_inactive_anon 201
    nr_active_anon 17380
    nr_inactive_file 31572
    nr_active_file 28277
    nr_unevictable 0
    nr_mlock 0
    nr_anon_pages 17321
    nr_mapped 8640
    nr_file_pages 60107
    nr_dirty 33
    nr_writeback 0
    nr_slab_reclaimable 6850
    nr_slab_unreclaimable 7604
    nr_page_table_pages 3105
    nr_kernel_stack 175
    nr_unstable 0
    nr_bounce 0
    nr_vmscan_write 0
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem 260
    nr_dirtied 1050
    nr_written 938
    numa_hit 962872
    numa_miss 0
    numa_foreign 0
    numa_interleave 8617
    numa_local 962872
    numa_other 0
    nr_anon_transparent_hugepages 0

    [akpm@linux-foundation.org: no externs in .c files]
    Signed-off-by: KOSAKI Motohiro
    Cc: Michael Rubin
    Cc: Wu Fengguang
    Acked-by: David Rientjes
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Because 'ret' is declared as int, not unsigned long, no need to cast the
    error contants into unsigned long. If you compile this code on a 64-bit
    machine somehow, you'll see following warning:

    CC mm/nommu.o
    mm/nommu.c: In function `do_mmap_pgoff':
    mm/nommu.c:1411: warning: overflow in implicit constant conversion

    Signed-off-by: Namhyung Kim
    Acked-by: Greg Ungerer
    Cc: David Howells
    Cc: Paul Mundt
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • If f_op->read() fails and sysctl_nr_trim_pages > 1, there could be a
    memory leak between @region->vm_end and @region->vm_top.

    Signed-off-by: Namhyung Kim
    Acked-by: Greg Ungerer
    Cc: David Howells
    Cc: Paul Mundt
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • Now we have the sorted vma list, use it in do_munmap() to check that we
    have an exact match.

    Signed-off-by: Namhyung Kim
    Acked-by: Greg Ungerer
    Cc: David Howells
    Cc: Paul Mundt
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • Now we have the sorted vma list, use it in the find_vma[_exact]() rather
    than doing linear search on the rb-tree.

    Signed-off-by: Namhyung Kim
    Acked-by: Greg Ungerer
    Cc: David Howells
    Cc: Paul Mundt
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • Since commit 297c5eee3724 ("mm: make the vma list be doubly linked") made
    it a doubly linked list, we don't need to scan the list when deleting
    @vma.

    And the original code didn't update the prev pointer. Fix it too.

    Signed-off-by: Namhyung Kim
    Acked-by: Greg Ungerer
    Cc: David Howells
    Cc: Paul Mundt
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • When I was reading nommu code, I found that it handles the vma list/tree
    in an unusual way. IIUC, because there can be more than one
    identical/overrapped vmas in the list/tree, it sorts the tree more
    strictly and does a linear search on the tree. But it doesn't applied to
    the list (i.e. the list could be constructed in a different order than
    the tree so that we can't use the list when finding the first vma in that
    order).

    Since inserting/sorting a vma in the tree and link is done at the same
    time, we can easily construct both of them in the same order. And linear
    searching on the tree could be more costly than doing it on the list, it
    can be converted to use the list.

    Also, after the commit 297c5eee3724 ("mm: make the vma list be doubly
    linked") made the list be doubly linked, there were a couple of code need
    to be fixed to construct the list properly.

    Patch 1/6 is a preparation. It maintains the list sorted same as the tree
    and construct doubly-linked list properly. Patch 2/6 is a simple
    optimization for the vma deletion. Patch 3/6 and 4/6 convert tree
    traversal to list traversal and the rest are simple fixes and cleanups.

    This patch:

    @vma added into @mm should be sorted by start addr, end addr and VMA
    struct addr in that order because we may get identical VMAs in the @mm.
    However this was true only for the rbtree, not for the list.

    This patch fixes this by remembering 'rb_prev' during the tree traversal
    like find_vma_prepare() does and linking the @vma via __vma_link_list().
    After this patch, we can iterate the whole VMAs in correct order simply by
    using @mm->mmap list.

    [akpm@linux-foundation.org: avoid duplicating __vma_link_list()]
    Signed-off-by: Namhyung Kim
    Acked-by: Greg Ungerer
    Cc: David Howells
    Cc: Paul Mundt
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim