23 Nov, 2013

1 commit

  • Pull SLAB changes from Pekka Enberg:
    "The patches from Joonsoo Kim switch mm/slab.c to use 'struct page' for
    slab internals similar to mm/slub.c. This reduces memory usage and
    improves performance:

    https://lkml.org/lkml/2013/10/16/155

    Rest of the changes are bug fixes from various people"

    * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (21 commits)
    mm, slub: fix the typo in mm/slub.c
    mm, slub: fix the typo in include/linux/slub_def.h
    slub: Handle NULL parameter in kmem_cache_flags
    slab: replace non-existing 'struct freelist *' with 'void *'
    slab: fix to calm down kmemleak warning
    slub: proper kmemleak tracking if CONFIG_SLUB_DEBUG disabled
    slab: rename slab_bufctl to slab_freelist
    slab: remove useless statement for checking pfmemalloc
    slab: use struct page for slab management
    slab: replace free and inuse in struct slab with newly introduced active
    slab: remove SLAB_LIMIT
    slab: remove kmem_bufctl_t
    slab: change the management method of free objects of the slab
    slab: use __GFP_COMP flag for allocating slab pages
    slab: use well-defined macro, virt_to_slab()
    slab: overloading the RCU head over the LRU for RCU free
    slab: remove cachep in struct slab_rcu
    slab: remove nodeid in struct slab
    slab: remove colouroff in struct slab
    slab: change return type of kmem_getpages() to struct page
    ...

    Linus Torvalds
     

22 Nov, 2013

1 commit

  • I don't know what went wrong, mis-merge or something, but ->pmd_huge_pte
    placed in wrong union within struct page.

    In original patch[1] it's placed to union with ->lru and ->slab, but in
    commit e009bb30c8df ("mm: implement split page table lock for PMD
    level") it's in union with ->index and ->freelist.

    That union seems also unused for pages with table tables and safe to
    re-use, but it's not what I've tested.

    Let's move it to original place. It fixes indentation at least. :)

    [1] https://lkml.org/lkml/2013/10/7/288

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Nov, 2013

5 commits

  • Use kernel/bounds.c to convert build-time spinlock_t size check into a
    preprocessor symbol and apply that to properly separate the page::ptl
    situation.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Kirill A. Shutemov
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • If split page table lock is in use, we embed the lock into struct page
    of table's page. We have to disable split lock, if spinlock_t is too
    big be to be embedded, like when DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC
    enabled.

    This patch add support for dynamic allocation of split page table lock
    if we can't embed it to struct page.

    page->ptl is unsigned long now and we use it as spinlock_t if
    sizeof(spinlock_t) ptl.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The basic idea is the same as with PTE level: the lock is embedded into
    struct page of table's page.

    We can't use mm->pmd_huge_pte to store pgtables for THP, since we don't
    take mm->page_table_lock anymore. Let's reuse page->lru of table's page
    for that.

    pgtable_pmd_page_ctor() returns true, if initialization is successful
    and false otherwise. Current implementation never fails, but assumption
    that constructor can fail will help to port it to -rt where spinlock_t
    is rather huge and cannot be embedded into struct page -- dynamic
    allocation is required.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Reviewed-by: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With split page table lock for PMD level we can't hold mm->page_table_lock
    while updating nr_ptes.

    Let's convert it to atomic_long_t to avoid races.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: Naoya Horiguchi
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We're going to introduce split page table lock for PMD level. Let's
    rename existing split ptlock for PTE level to avoid confusion.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: Naoya Horiguchi
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

25 Oct, 2013

2 commits

  • Now, there are a few field in struct slab, so we can overload these
    over struct page. This will save some memory and reduce cache footprint.

    After this change, slabp_cache and slab_size no longer related to
    a struct slab, so rename them as freelist_cache and freelist_size.

    These changes are just mechanical ones and there is no functional change.

    Acked-by: Andi Kleen
    Acked-by: Christoph Lameter
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Pekka Enberg

    Joonsoo Kim
     
  • With build-time size checking, we can overload the RCU head over the LRU
    of struct page to free pages of a slab in rcu context. This really help to
    implement to overload the struct slab over the struct page and this
    eventually reduce memory usage and cache footprint of the SLAB.

    Acked-by: Andi Kleen
    Acked-by: Christoph Lameter
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Pekka Enberg

    Joonsoo Kim
     

09 Oct, 2013

4 commits

  • With scan rate adaptions based on whether the workload has properly
    converged or not there should be no need for the scan period reset
    hammer. Get rid of it.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-60-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • Change the per page last fault tracking to use cpu,pid instead of
    nid,pid. This will allow us to try and lookup the alternate task more
    easily. Note that even though it is the cpu that is store in the page
    flags that the mpol_misplaced decision is still based on the node.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Link: http://lkml.kernel.org/r/1381141781-10992-43-git-send-email-mgorman@suse.de
    [ Fixed build failure on 32-bit systems. ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Ideally it would be possible to distinguish between NUMA hinting faults that
    are private to a task and those that are shared. If treated identically
    there is a risk that shared pages bounce between nodes depending on
    the order they are referenced by tasks. Ultimately what is desirable is
    that task private pages remain local to the task while shared pages are
    interleaved between sharing tasks running on different nodes to give good
    average performance. This is further complicated by THP as even
    applications that partition their data may not be partitioning on a huge
    page boundary.

    To start with, this patch assumes that multi-threaded or multi-process
    applications partition their data and that in general the private accesses
    are more important for cpu->memory locality in the general case. Also,
    no new infrastructure is required to treat private pages properly but
    interleaving for shared pages requires additional infrastructure.

    To detect private accesses the pid of the last accessing task is required
    but the storage requirements are a high. This patch borrows heavily from
    Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
    to encode some bits from the last accessing task in the page flags as
    well as the node information. Collisions will occur but it is better than
    just depending on the node information. Node information is then used to
    determine if a page needs to migrate. The PID information is used to detect
    private/shared accesses. The preferred NUMA node is selected based on where
    the maximum number of approximately private faults were measured. Shared
    faults are not taken into consideration for a few reasons.

    First, if there are many tasks sharing the page then they'll all move
    towards the same node. The node will be compute overloaded and then
    scheduled away later only to bounce back again. Alternatively the shared
    tasks would just bounce around nodes because the fault information is
    effectively noise. Either way accounting for shared faults the same as
    private faults can result in lower performance overall.

    The second reason is based on a hypothetical workload that has a small
    number of very important, heavily accessed private pages but a large shared
    array. The shared array would dominate the number of faults and be selected
    as a preferred node even though it's the wrong decision.

    The third reason is that multiple threads in a process will race each
    other to fault the shared page making the fault information unreliable.

    Signed-off-by: Mel Gorman
    [ Fix complication error when !NUMA_BALANCING. ]
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • PTE scanning and NUMA hinting fault handling is expensive so commit
    5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
    on a new node") deferred the PTE scan until a task had been scheduled on
    another node. The problem is that in the purely shared memory case that
    this may never happen and no NUMA hinting fault information will be
    captured. We are not ruling out the possibility that something better
    can be done here but for now, this patch needs to be reverted and depend
    entirely on the scan_delay to avoid punishing short-lived processes.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-16-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

14 Sep, 2013

1 commit

  • Pull aio changes from Ben LaHaise:
    "First off, sorry for this pull request being late in the merge window.
    Al had raised a couple of concerns about 2 items in the series below.
    I addressed the first issue (the race introduced by Gu's use of
    mm_populate()), but he has not provided any further details on how he
    wants to rework the anon_inode.c changes (which were sent out months
    ago but have yet to be commented on).

    The bulk of the changes have been sitting in the -next tree for a few
    months, with all the issues raised being addressed"

    * git://git.kvack.org/~bcrl/aio-next: (22 commits)
    aio: rcu_read_lock protection for new rcu_dereference calls
    aio: fix race in ring buffer page lookup introduced by page migration support
    aio: fix rcu sparse warnings introduced by ioctx table lookup patch
    aio: remove unnecessary debugging from aio_free_ring()
    aio: table lookup: verify ctx pointer
    staging/lustre: kiocb->ki_left is removed
    aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
    aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
    aio: convert the ioctx list to table lookup v3
    aio: double aio_max_nr in calculations
    aio: Kill ki_dtor
    aio: Kill ki_users
    aio: Kill unneeded kiocb members
    aio: Kill aio_rw_vect_retry()
    aio: Don't use ctx->tail unnecessarily
    aio: io_cancel() no longer returns the io_event
    aio: percpu ioctx refcount
    aio: percpu reqs_available
    aio: reqs_active -> reqs_available
    aio: fix build when migration is disabled
    ...

    Linus Torvalds
     

23 Aug, 2013

1 commit

  • This is the updated version of df54d6fa5427 ("x86 get_unmapped_area():
    use proper mmap base for bottom-up direction") that only randomizes the
    mmap base address once.

    Signed-off-by: Radu Caragea
    Reported-and-tested-by: Jeff Shorey
    Cc: Andrew Morton
    Cc: Michel Lespinasse
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Adrian Sendroiu
    Cc: Greg KH
    Cc: Kamal Mostafa
    Signed-off-by: Linus Torvalds

    Radu Caragea
     

31 Jul, 2013

1 commit

  • On Wed, Jun 12, 2013 at 11:14:40AM -0700, Kent Overstreet wrote:
    > On Mon, Apr 15, 2013 at 02:40:55PM +0300, Octavian Purdila wrote:
    > > When using a large number of threads performing AIO operations the
    > > IOCTX list may get a significant number of entries which will cause
    > > significant overhead. For example, when running this fio script:
    > >
    > > rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1
    > > blocksize=1024; numjobs=512; thread; loops=100
    > >
    > > on an EXT2 filesystem mounted on top of a ramdisk we can observe up to
    > > 30% CPU time spent by lookup_ioctx:
    > >
    > > 32.51% [guest.kernel] [g] lookup_ioctx
    > > 9.19% [guest.kernel] [g] __lock_acquire.isra.28
    > > 4.40% [guest.kernel] [g] lock_release
    > > 4.19% [guest.kernel] [g] sched_clock_local
    > > 3.86% [guest.kernel] [g] local_clock
    > > 3.68% [guest.kernel] [g] native_sched_clock
    > > 3.08% [guest.kernel] [g] sched_clock_cpu
    > > 2.64% [guest.kernel] [g] lock_release_holdtime.part.11
    > > 2.60% [guest.kernel] [g] memcpy
    > > 2.33% [guest.kernel] [g] lock_acquired
    > > 2.25% [guest.kernel] [g] lock_acquire
    > > 1.84% [guest.kernel] [g] do_io_submit
    > >
    > > This patchs converts the ioctx list to a radix tree. For a performance
    > > comparison the above FIO script was run on a 2 sockets 8 core
    > > machine. This are the results (average and %rsd of 10 runs) for the
    > > original list based implementation and for the radix tree based
    > > implementation:
    > >
    > > cores 1 2 4 8 16 32
    > > list 109376 ms 69119 ms 35682 ms 22671 ms 19724 ms 16408 ms
    > > %rsd 0.69% 1.15% 1.17% 1.21% 1.71% 1.43%
    > > radix 73651 ms 41748 ms 23028 ms 16766 ms 15232 ms 13787 ms
    > > %rsd 1.19% 0.98% 0.69% 1.13% 0.72% 0.75%
    > > % of radix
    > > relative 66.12% 65.59% 66.63% 72.31% 77.26% 83.66%
    > > to list
    > >
    > > To consider the impact of the patch on the typical case of having
    > > only one ctx per process the following FIO script was run:
    > >
    > > rw=randrw; size=100m ;directory=/mnt/fio; ioengine=libaio; iodepth=1
    > > blocksize=1024; numjobs=1; thread; loops=100
    > >
    > > on the same system and the results are the following:
    > >
    > > list 58892 ms
    > > %rsd 0.91%
    > > radix 59404 ms
    > > %rsd 0.81%
    > > % of radix
    > > relative 100.87%
    > > to list
    >
    > So, I was just doing some benchmarking/profiling to get ready to send
    > out the aio patches I've got for 3.11 - and it looks like your patch is
    > causing a ~1.5% throughput regression in my testing :/
    ...

    I've got an alternate approach for fixing this wart in lookup_ioctx()...
    Instead of using an rbtree, just use the reserved id in the ring buffer
    header to index an array pointing the ioctx. It's not finished yet, and
    it needs to be tidied up, but is most of the way there.

    -ben
    --
    "Thought is the essence of where you are now."
    --
    kmo> And, a rework of Ben's code, but this was entirely his idea
    kmo> -Kent

    bcrl> And fix the code to use the right mm_struct in kill_ioctx(), actually
    free memory.

    Signed-off-by: Benjamin LaHaise

    Benjamin LaHaise
     

11 Jul, 2013

1 commit

  • Since all architectures have been converted to use vm_unmapped_area(),
    there is no remaining use for the free_area_cache.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Cc: "James E.J. Bottomley"
    Cc: "Luck, Tony"
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Paul Mackerras
    Cc: Richard Henderson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

24 Feb, 2013

3 commits

  • page->_last_nid fits into page->flags on 64-bit. The unlikely 32-bit
    NUMA configuration with NUMA Balancing will still need an extra page
    field. As Peter notes "Completely dropping 32bit support for
    CONFIG_NUMA_BALANCING would simplify things, but it would also remove
    the warning if we grow enough 64bit only page-flags to push the last-cpu
    out."

    [mgorman@suse.de: minor modifications]
    Signed-off-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Simon Jeons
    Cc: Wanpeng Li
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • This is a preparation patch for moving page->_last_nid into page->flags
    that moves page flag layout information to a separate header. This
    patch is necessary because otherwise there would be a circular
    dependency between mm_types.h and mm.h.

    Signed-off-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Simon Jeons
    Cc: Wanpeng Li
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • s/me/be/ and clarify the comment a bit when we're changing it anyway.

    Signed-off-by: Mel Gorman
    Suggested-by: Simon Jeons
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Wanpeng Li
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

19 Dec, 2012

1 commit

  • Pull SLAB changes from Pekka Enberg:
    "This contains preparational work from Christoph Lameter and Glauber
    Costa for SLAB memcg and cleanups and improvements from Ezequiel
    Garcia and Joonsoo Kim.

    Please note that the SLOB cleanup commit from Arnd Bergmann already
    appears in your tree but I had also merged it myself which is why it
    shows up in the shortlog."

    * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
    mm/sl[aou]b: Common alignment code
    slab: Use the new create_boot_cache function to simplify bootstrap
    slub: Use statically allocated kmem_cache boot structure for bootstrap
    mm, sl[au]b: create common functions for boot slab creation
    slab: Simplify bootstrap
    slub: Use correct cpu_slab on dead cpu
    mm: fix slab.c kernel-doc warnings
    mm/slob: use min_t() to compare ARCH_SLAB_MINALIGN
    slab: Ignore internal flags in cache creation
    mm/slob: Use free_page instead of put_page for page-size kmalloc allocations
    mm/sl[aou]b: Move common kmem_cache_size() to slab.h
    mm/slob: Use object_size field in kmem_cache_size()
    mm/slob: Drop usage of page->private for storing page-sized allocations
    slub: Commonize slab_cache field in struct page
    sl[au]b: Process slabinfo_show in common code
    mm/sl[au]b: Move print_slabinfo_header to slab_common.c
    mm/sl[au]b: Move slabinfo processing to slab_common.c
    slub: remove one code path and reduce lock contention in __slab_free()

    Linus Torvalds
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

12 Dec, 2012

2 commits

  • The kernel walks the VMA rbtree in various places, including the page
    fault path. However, the vm_rb node spanned two cache lines, on 64 bit
    systems with 64 byte cache lines (most x86 systems).

    Rearrange vm_area_struct a little, so all the information we need to do a
    VMA tree walk is in the first cache line.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Michel Lespinasse
    Signed-off-by: Rik van Riel
    Cc: Hugh Dickins
    Cc: Russell King
    Cc: Ralf Baechle
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Define vma->rb_subtree_gap as the largest gap between any vma in the
    subtree rooted at that vma, and their predecessor. Or, for a recursive
    definition, vma->rb_subtree_gap is the max of:

    - vma->vm_start - vma->vm_prev->vm_end
    - rb_subtree_gap fields of the vmas pointed by vma->rb.rb_left and
    vma->rb.rb_right

    This will allow get_unmapped_area_* to find a free area of the right
    size in O(log(N)) time, instead of potentially having to do a linear
    walk across all the VMAs.

    Also define mm->highest_vm_end as the vm_end field of the highest vma,
    so that we can easily check if the following gap is suitable.

    This does have the potential to make unmapping VMAs more expensive,
    especially for processes with very large numbers of VMAs, where the VMA
    rbtree can grow quite deep.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Cc: Hugh Dickins
    Cc: Russell King
    Cc: Ralf Baechle
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

11 Dec, 2012

5 commits

  • Due to the fact that migrations are driven by the CPU a task is running
    on there is no point tracking NUMA faults until one task runs on a new
    node. This patch tracks the first node used by an address space. Until
    it changes, PTE scanning is disabled and no NUMA hinting faults are
    trapped. This should help workloads that are short-lived, do not care
    about NUMA placement or have bound themselves to a single node.

    This takes advantage of the logic in "mm: sched: numa: Implement slow
    start for working set sampling" to delay when the checks are made. This
    will take advantage of processes that set their CPU and node bindings
    early in their lifetime. It will also potentially allow any initial load
    balancing to take place.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • The PTE scanning rate and fault rates are two of the biggest sources of
    system CPU overhead with automatic NUMA placement. Ideally a proper policy
    would detect if a workload was properly placed, schedule and adjust the
    PTE scanning rate accordingly. We do not track the necessary information
    to do that but we at least know if we migrated or not.

    This patch scans slower if a page was not migrated as the result of a
    NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is
    now higher than the previous default. Once every minute it will reset
    the scanner in case of phase changes.

    This is hilariously crude and the numbers are arbitrary. Workloads will
    converge quite slowly in comparison to what a proper policy should be able
    to do. On the plus side, we will chew up less CPU for workloads that have
    no need for automatic balancing.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • This patch introduces a last_nid field to the page struct. This is used
    to build a two-stage filter in the next patch that is aimed at
    mitigating a problem whereby pages migrate to the wrong node when
    referenced by a process that was running off its home node.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • Previously, to probe the working set of a task, we'd use
    a very simple and crude method: mark all of its address
    space PROT_NONE.

    That method has various (obvious) disadvantages:

    - it samples the working set at dissimilar rates,
    giving some tasks a sampling quality advantage
    over others.

    - creates performance problems for tasks with very
    large working sets

    - over-samples processes with large address spaces but
    which only very rarely execute

    Improve that method by keeping a rotating offset into the
    address space that marks the current position of the scan,
    and advance it by a constant rate (in a CPU cycles execution
    proportional manner). If the offset reaches the last mapped
    address of the mm then it then it starts over at the first
    address.

    The per-task nature of the working set sampling functionality in this tree
    allows such constant rate, per task, execution-weight proportional sampling
    of the working set, with an adaptive sampling interval/frequency that
    goes from once per 100ms up to just once per 8 seconds. The current
    sampling volume is 256 MB per interval.

    As tasks mature and converge their working set, so does the
    sampling rate slow down to just a trickle, 256 MB per 8
    seconds of CPU time executed.

    This, beyond being adaptive, also rate-limits rarely
    executing systems and does not over-sample on overloaded
    systems.

    [ In AutoNUMA speak, this patch deals with the effective sampling
    rate of the 'hinting page fault'. AutoNUMA's scanning is
    currently rate-limited, but it is also fundamentally
    single-threaded, executing in the knuma_scand kernel thread,
    so the limit in AutoNUMA is global and does not scale up with
    the number of CPUs, nor does it scan tasks in an execution
    proportional manner.

    So the idea of rate-limiting the scanning was first implemented
    in the AutoNUMA tree via a global rate limit. This patch goes
    beyond that by implementing an execution rate proportional
    working set sampling rate that is not implemented via a single
    global scanning daemon. ]

    [ Dan Carpenter pointed out a possible NULL pointer dereference in the
    first version of this patch. ]

    Based-on-idea-by: Andrea Arcangeli
    Bug-Found-By: Dan Carpenter
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    [ Wrote changelog and fixed bug. ]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Peter Zijlstra
     
  • NOTE: This patch is based on "sched, numa, mm: Add fault driven
    placement and migration policy" but as it throws away all the policy
    to just leave a basic foundation I had to drop the signed-offs-by.

    This patch creates a bare-bones method for setting PTEs pte_numa in the
    context of the scheduler that when faulted later will be faulted onto the
    node the CPU is running on. In itself this does nothing useful but any
    placement policy will fundamentally depend on receiving hints on placement
    from fault context and doing something intelligent about it.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel

    Peter Zijlstra
     

24 Oct, 2012

1 commit

  • Right now, slab and slub have fields in struct page to derive which
    cache a page belongs to, but they do it slightly differently.

    slab uses a field called slab_cache, that lives in the third double
    word. slub, uses a field called "slab", living outside of the
    doublewords area.

    Ideally, we could use the same field for this. Since slub heavily makes
    use of the doubleword region, there isn't really much room to move
    slub's slab_cache field around. Since slab does not have such strict
    placement restrictions, we can move it outside the doubleword area.

    The naming used by slab, "slab_cache", is less confusing, and it is
    preferred over slub's generic "slab".

    Signed-off-by: Glauber Costa
    Acked-by: Christoph Lameter
    CC: David Rientjes
    Signed-off-by: Pekka Enberg

    Glauber Costa
     

09 Oct, 2012

3 commits

  • Implement an interval tree as a replacement for the VMA prio_tree. The
    algorithms are similar to lib/interval_tree.c; however that code can't be
    directly reused as the interval endpoints are not explicitly stored in the
    VMA. So instead, the common algorithm is moved into a template and the
    details (node type, how to get interval endpoints from the node, etc) are
    filled in using the C preprocessor.

    Once the interval tree functions are available, using them as a
    replacement to the VMA prio tree is a relatively simple, mechanical job.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
    currently it lost original meaning but still has some effects:

    | effect | alternative flags
    -+------------------------+---------------------------------------------
    1| account as reserved_vm | VM_IO
    2| skip in core dump | VM_IO, VM_DONTDUMP
    3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
    4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP

    This patch removes reserved_vm counter from mm_struct. Seems like nobody
    cares about it, it does not exported into userspace directly, it only
    reduces total_vm showed in proc.

    Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.

    remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
    remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.

    [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Currently the kernel sets mm->exe_file during sys_execve() and then tracks
    number of vmas with VM_EXECUTABLE flag in mm->num_exe_file_vmas, as soon
    as this counter drops to zero kernel resets mm->exe_file to NULL. Plus it
    resets mm->exe_file at last mmput() when mm->mm_users drops to zero.

    VMA with VM_EXECUTABLE flag appears after mapping file with flag
    MAP_EXECUTABLE, such vmas can appears only at sys_execve() or after vma
    splitting, because sys_mmap ignores this flag. Usually binfmt module sets
    mm->exe_file and mmaps executable vmas with this file, they hold
    mm->exe_file while task is running.

    comment from v2.6.25-6245-g925d1c4 ("procfs task exe symlink"),
    where all this stuff was introduced:

    > The kernel implements readlink of /proc/pid/exe by getting the file from
    > the first executable VMA. Then the path to the file is reconstructed and
    > reported as the result.
    >
    > Because of the VMA walk the code is slightly different on nommu systems.
    > This patch avoids separate /proc/pid/exe code on nommu systems. Instead of
    > walking the VMAs to find the first executable file-backed VMA we store a
    > reference to the exec'd file in the mm_struct.
    >
    > That reference would prevent the filesystem holding the executable file
    > from being unmounted even after unmapping the VMAs. So we track the number
    > of VM_EXECUTABLE VMAs and drop the new reference when the last one is
    > unmapped. This avoids pinning the mounted filesystem.

    exe_file's vma accounting is hooked into every file mmap/unmmap and vma
    split/merge just to fix some hypothetical pinning fs from umounting by mm,
    which already unmapped all its executable files, but still alive.

    Seems like currently nobody depends on this behaviour. We can try to
    remove this logic and keep mm->exe_file until final mmput().

    mm->exe_file is still protected with mm->mmap_sem, because we want to
    change it via new sys_prctl(PR_SET_MM_EXE_FILE). Also via this syscall
    task can change its mm->exe_file and unpin mountpoint explicitly.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

01 Aug, 2012

2 commits

  • __GFP_MEMALLOC will allow the allocation to disregard the watermarks, much
    like PF_MEMALLOC. It allows one to pass along the memalloc state in
    object related allocation flags as opposed to task related flags, such as
    sk->sk_allocation. This removes the need for ALLOC_PFMEMALLOC as callers
    using __GFP_MEMALLOC can get the ALLOC_NO_WATERMARK flag which is now
    enough to identify allocations related to page reclaim.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When a user or administrator requires swap for their application, they
    create a swap partition and file, format it with mkswap and activate it
    with swapon. Swap over the network is considered as an option in diskless
    systems. The two likely scenarios are when blade servers are used as part
    of a cluster where the form factor or maintenance costs do not allow the
    use of disks and thin clients.

    The Linux Terminal Server Project recommends the use of the Network Block
    Device (NBD) for swap according to the manual at
    https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
    There is also documentation and tutorials on how to setup swap over NBD at
    places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
    nbd-client also documents the use of NBD as swap. Despite this, the fact
    is that a machine using NBD for swap can deadlock within minutes if swap
    is used intensively. This patch series addresses the problem.

    The core issue is that network block devices do not use mempools like
    normal block devices do. As the host cannot control where they receive
    packets from, they cannot reliably work out in advance how much memory
    they might need. Some years ago, Peter Zijlstra developed a series of
    patches that supported swap over an NFS that at least one distribution is
    carrying within their kernels. This patch series borrows very heavily
    from Peter's work to support swapping over NBD as a pre-requisite to
    supporting swap-over-NFS. The bulk of the complexity is concerned with
    preserving memory that is allocated from the PFMEMALLOC reserves for use
    by the network layer which is needed for both NBD and NFS.

    Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
    preserve access to pages allocated under low memory situations
    to callers that are freeing memory.

    Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks

    Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
    reserves without setting PFMEMALLOC.

    Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
    for later use by network packet processing.

    Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required

    Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.

    Patches 7-12 allows network processing to use PFMEMALLOC reserves when
    the socket has been marked as being used by the VM to clean pages. If
    packets are received and stored in pages that were allocated under
    low-memory situations and are unrelated to the VM, the packets
    are dropped.

    Patch 11 reintroduces __skb_alloc_page which the networking
    folk may object to but is needed in some cases to propogate
    pfmemalloc from a newly allocated page to an skb. If there is a
    strong objection, this patch can be dropped with the impact being
    that swap-over-network will be slower in some cases but it should
    not fail.

    Patch 13 is a micro-optimisation to avoid a function call in the
    common case.

    Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
    PFMEMALLOC if necessary.

    Patch 15 notes that it is still possible for the PFMEMALLOC reserve
    to be depleted. To prevent this, direct reclaimers get throttled on
    a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
    expected that kswapd and the direct reclaimers already running
    will clean enough pages for the low watermark to be reached and
    the throttled processes are woken up.

    Patch 16 adds a statistic to track how often processes get throttled

    Some basic performance testing was run using kernel builds, netperf on
    loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
    sysbench. Each of them were expected to use the sl*b allocators
    reasonably heavily but there did not appear to be significant performance
    variances.

    For testing swap-over-NBD, a machine was booted with 2G of RAM with a
    swapfile backed by NBD. 8*NUM_CPU processes were started that create
    anonymous memory mappings and read them linearly in a loop. The total
    size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
    memory pressure.

    Without the patches and using SLUB, the machine locks up within minutes
    and runs to completion with them applied. With SLAB, the story is
    different as an unpatched kernel run to completion. However, the patched
    kernel completed the test 45% faster.

    MICRO
    3.5.0-rc2 3.5.0-rc2
    vanilla swapnbd
    Unrecognised test vmscan-anon-mmap-write
    MMTests Statistics: duration
    Sys Time Running Test (seconds) 197.80 173.07
    User+Sys Time Running Test (seconds) 206.96 182.03
    Total Elapsed Time (seconds) 3240.70 1762.09

    This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages

    Allocations of pages below the min watermark run a risk of the machine
    hanging due to a lack of memory. To prevent this, only callers who have
    PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
    allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
    a slab though, nothing prevents other callers consuming free objects
    within those slabs. This patch limits access to slab pages that were
    alloced from the PFMEMALLOC reserves.

    When this patch is applied, pages allocated from below the low watermark
    are returned with page->pfmemalloc set and it is up to the caller to
    determine how the page should be protected. SLAB restricts access to any
    page with page->pfmemalloc set to callers which are known to able to
    access the PFMEMALLOC reserve. If one is not available, an attempt is
    made to allocate a new page rather than use a reserve. SLUB is a bit more
    relaxed in that it only records if the current per-CPU page was allocated
    from PFMEMALLOC reserve and uses another partial slab if the caller does
    not have the necessary GFP or process flags. This was found to be
    sufficient in tests to avoid hangs due to SLUB generally maintaining
    smaller lists than SLAB.

    In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
    a slab allocation even though free objects are available because they are
    being preserved for callers that are freeing pages.

    [a.p.zijlstra@chello.nl: Original implementation]
    [sebastian@breakpoint.cc: Correct order of page flag clearing]
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

31 Jul, 2012

1 commit

  • Pull SLAB changes from Pekka Enberg:
    "Most of the changes included are from Christoph Lameter's "common
    slab" patch series that unifies common parts of SLUB, SLAB, and SLOB
    allocators. The unification is needed for Glauber Costa's "kmem
    memcg" work that will hopefully appear for v3.7.

    The rest of the changes are fixes and speedups by various people."

    * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (32 commits)
    mm: Fix build warning in kmem_cache_create()
    slob: Fix early boot kernel crash
    mm, slub: ensure irqs are enabled for kmemcheck
    mm, sl[aou]b: Move kmem_cache_create mutex handling to common code
    mm, sl[aou]b: Use a common mutex definition
    mm, sl[aou]b: Common definition for boot state of the slab allocators
    mm, sl[aou]b: Extract common code for kmem_cache_create()
    slub: remove invalid reference to list iterator variable
    mm: Fix signal SIGFPE in slabinfo.c.
    slab: move FULL state transition to an initcall
    slab: Fix a typo in commit 8c138b "slab: Get rid of obj_size macro"
    mm, slab: Build fix for recent kmem_cache changes
    slab: rename gfpflags to allocflags
    slub: refactoring unfreeze_partials()
    slub: use __cmpxchg_double_slab() at interrupt disabled place
    slab/mempolicy: always use local policy from interrupt context
    slab: Get rid of obj_size macro
    mm, sl[aou]b: Extract common fields from struct kmem_cache
    slab: Remove some accessors
    slab: Use page struct fields instead of casting
    ...

    Linus Torvalds
     

21 Jun, 2012

1 commit

  • On arches that do not support this_cpu_cmpxchg_double() slab_lock is used
    to do atomic cmpxchg() on double word which contains page->_count. The
    page count can be changed from get_page() or put_page() without taking
    slab_lock. That corrupts page counter.

    Fix it by moving page->_count out of cmpxchg_double data. So that slub
    does no change it while updating slub meta-data in struct page.

    [akpm@linux-foundation.org: use standard comment layout, tweak comment text]
    Reported-by: Amey Bhide
    Signed-off-by: Pravin B Shelar
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pravin B Shelar
     

14 Jun, 2012

2 commits

  • Add fields to the page struct so that it is properly documented that
    slab overlays the lru fields.

    This cleans up some casts in slab.

    Reviewed-by: Glauber Costa
    Reviewed-by: Joonsoo Kim
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Define the fields used by slob in mm_types.h and use struct page instead
    of struct slob_page in slob. This cleans up numerous of typecasts in slob.c and
    makes readers aware of slob's use of page struct fields.

    [Also cleans up some bitrot in slob.c. The page struct field layout
    in slob.c is an old layout and does not match the one in mm_types.h]

    Reviewed-by: Glauber Costa
    Acked-by: David Rientjes
    Reviewed-by: Joonsoo Kim
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

30 May, 2012

1 commit

  • The swap token code no longer fits in with the current VM model. It
    does not play well with cgroups or the better NUMA placement code in
    development, since we have only one swap token globally.

    It also has the potential to mess with scalability of the system, by
    increasing the number of non-reclaimable pages on the active and
    inactive anon LRU lists.

    Last but not least, the swap token code has been broken for a year
    without complaints, as reported by Konstantin Khlebnikov. This suggests
    we no longer have much use for it.

    The days of sub-1G memory systems with heavy use of swap are over. If
    we ever need thrashing reducing code in the future, we will have to
    implement something that does scale.

    Signed-off-by: Rik van Riel
    Cc: Konstantin Khlebnikov
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Acked-by: Bob Picco
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel