26 Sep, 2014

1 commit

  • commit b13b1d2d8692b437203de7a404c6b809d2cc4d99 upstream.

    We use the accessed bit to age a page at page reclaim time,
    and currently we also flush the TLB when doing so.

    But in some workloads TLB flush overhead is very heavy. In my
    simple multithreaded app with a lot of swap to several pcie
    SSDs, removing the tlb flush gives about 20% ~ 30% swapout
    speedup.

    Fortunately just removing the TLB flush is a valid optimization:
    on x86 CPUs, clearing the accessed bit without a TLB flush
    doesn't cause data corruption.

    It could cause incorrect page aging and the (mistaken) reclaim of
    hot pages, but the chance of that should be relatively low.

    So as a performance optimization don't flush the TLB when
    clearing the accessed bit, it will eventually be flushed by
    a context switch or a VM operation anyway. [ In the rare
    event of it not getting flushed for a long time the delay
    shouldn't really matter because there's no real memory
    pressure for swapout to react to. ]

    Suggested-by: Linus Torvalds
    Signed-off-by: Shaohua Li
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Acked-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Cc: linux-mm@kvack.org
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140408075809.GA1764@kernel.org
    [ Rewrote the changelog and the code comments. ]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Shaohua Li
     

10 Jul, 2013

1 commit

  • The old codes accumulate addr to get right pmd, however, currently pmds
    are preallocated and transfered as a parameter, there is unnecessary to
    accumulate addr variable any more, this patch remove it.

    Signed-off-by: Wanpeng Li
    Reviewed-by: Michal Hocko
    Reviewed-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     

13 Apr, 2013

1 commit

  • This patch attempts to fix:

    https://bugzilla.kernel.org/show_bug.cgi?id=56461

    The symptom is a crash and messages like this:

    chrome: Corrupted page table at address 34a03000
    *pdpt = 0000000000000000 *pde = 0000000000000000
    Bad pagetable: 000f [#1] PREEMPT SMP

    Ingo guesses this got introduced by commit 611ae8e3f520 ("x86/tlb:
    enable tlb flush range support for x86") since that code started to free
    unused pagetables.

    On x86-32 PAE kernels, that new code has the potential to free an entire
    PMD page and will clear one of the four page-directory-pointer-table
    (aka pgd_t entries).

    The hardware aggressively "caches" these top-level entries and invlpg
    does not actually affect the CPU's copy. If we clear one we *HAVE* to
    do a full TLB flush, otherwise we might continue using a freed pmd page.
    (note, we do this properly on the population side in pud_populate()).

    This patch tracks whenever we clear one of these entries in the 'struct
    mmu_gather', and ensures that we follow up with a full tlb flush.

    BTW, I disassembled and checked that:

    if (tlb->fullmm == 0)
    and
    if (!tlb->fullmm && !tlb->need_flush_all)

    generate essentially the same code, so there should be zero impact there
    to the !PAE case.

    Signed-off-by: Dave Hansen
    Cc: Peter Anvin
    Cc: Ingo Molnar
    Cc: Artem S Tashkinov
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

26 Jan, 2013

1 commit


17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

11 Dec, 2012

2 commits

  • Intel has an architectural guarantee that the TLB entry causing
    a page fault gets invalidated automatically. This means
    we should be able to drop the local TLB invalidation.

    Because of the way other areas of the page fault code work,
    chances are good that all x86 CPUs do this. However, if
    someone somewhere has an x86 CPU that does not invalidate
    the TLB entry causing a page fault, this one-liner should
    be easy to revert.

    Signed-off-by: Rik van Riel
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Michel Lespinasse
    Cc: Peter Zijlstra
    Cc: Ingo Molnar

    Rik van Riel
     
  • The function ptep_set_access_flags() is only ever invoked to set access
    flags or add write permission on a PTE. The write bit is only ever set
    together with the dirty bit.

    Because we only ever upgrade a PTE, it is safe to skip flushing entries on
    remote TLBs. The worst that can happen is a spurious page fault on other
    CPUs, which would flush that TLB entry.

    Lazily letting another CPU incur a spurious page fault occasionally is
    (much!) cheaper than aggressively flushing everybody else's TLB.

    Signed-off-by: Rik van Riel
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Michel Lespinasse
    Cc: Ingo Molnar

    Rik van Riel
     

06 Dec, 2012

1 commit


23 Nov, 2012

1 commit

  • If we have a write protection #PF and fix up the pmd then the
    hugetlb code [the only user of pmdp_set_access_flags], in its
    do_huge_pmd_wp_page() page fault resolution function calls
    pmdp_set_access_flags() to mark the pmd permissive again,
    and flushes the TLB.

    This TLB flush is unnecessary: a flush on #PF is guaranteed on
    most (all?) x86 CPUs, and even in the worst-case we'll generate
    a spurious fault.

    So remove it.

    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Lee Schermerhorn
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20121120120251.GA15742@gmail.com
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

18 Mar, 2011

1 commit

  • According to intel CPU manual, every time PGD entry is changed in i386 PAE
    mode, we need do a full TLB flush. Current code follows this and there is
    comment for this too in the code.

    But current code misses the multi-threaded case. A changed page table
    might be used by several CPUs, every such CPU should flush TLB. Usually
    this isn't a problem, because we prepopulate all PGD entries at process
    fork. But when the process does munmap and follows new mmap, this issue
    will be triggered.

    When it happens, some CPUs keep doing page faults:

    http://marc.info/?l=linux-kernel&m=129915020508238&w=2

    Reported-by: Yasunori Goto
    Tested-by: Yasunori Goto
    Reviewed-by: Rik van Riel
    Signed-off-by: Shaohua Li
    Cc: Mallick Asit K
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: linux-mm
    Cc: stable
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Shaohua Li
     

10 Mar, 2011

1 commit

  • It's forbidden to take the page_table_lock with the irq disabled
    or if there's contention the IPIs (for tlb flushes) sent with
    the page_table_lock held will never run leading to a deadlock.

    Nobody takes the pgd_lock from irq context so the _irqsave can be
    removed.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Tested-by: Konrad Rzeszutek Wilk
    Signed-off-by: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Linus Torvalds
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Andrea Arcangeli
     

14 Jan, 2011

2 commits

  • Add support for transparent hugepages to x86 32bit.

    Share the same VM_ bitflag for VM_MAPPED_COPY. mm/nommu.c will never
    support transparent hugepages.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Add needed pmd mangling functions with symmetry with their pte
    counterparts. pmdp_splitting_flush() is the only new addition on the pmd_
    methods and it's needed to serialize the VM against split_huge_page. It
    simply atomically sets the splitting bit in a similar way
    pmdp_clear_flush_young atomically clears the accessed bit.
    pmdp_splitting_flush() also has to flush the tlb to make it effective
    against gup_fast, but it wouldn't really require to flush the tlb too.
    Just the tlb flush is the simplest operation we can invoke to serialize
    pmdp_splitting_flush() against gup_fast.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

22 Oct, 2010

1 commit


20 Oct, 2010

1 commit

  • Take mm->page_table_lock while syncing the vmalloc region. This prevents
    a race with the Xen pagetable pin/unpin code, which expects that the
    page_table_lock is already held. If this race occurs, then Xen can see
    an inconsistent page type (a page can either be read/write or a pagetable
    page, and pin/unpin converts it between them), which will cause either
    the pin or the set_p[gm]d to fail; either will crash the kernel.

    vmalloc_sync_all() should be called rarely, so this extra use of
    page_table_lock should not interfere with its normal users.

    The mm pointer is stashed in the pgd page's index field, as that won't
    be otherwise used for pgds.

    Reported-by: Ian Campbell
    Originally-by: Jan Beulich
    LKML-Reference:
    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: H. Peter Anvin

    Jeremy Fitzhardinge
     

24 Aug, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

25 Feb, 2010

1 commit

  • Distros generally (I looked at Debian, RHEL5 and SLES11) seem to
    enable CONFIG_HIGHPTE for any x86 configuration which has highmem
    enabled. This means that the overhead applies even to machines which
    have a fairly modest amount of high memory and which therefore do not
    really benefit from allocating PTEs in high memory but still pay the
    price of the additional mapping operations.

    Running kernbench on a 4G box I found that with CONFIG_HIGHPTE=y but
    no actual highptes being allocated there was a reduction in system
    time used from 59.737s to 55.9s.

    With CONFIG_HIGHPTE=y and highmem PTEs being allocated:
    Average Optimal load -j 4 Run (std deviation):
    Elapsed Time 175.396 (0.238914)
    User Time 515.983 (5.85019)
    System Time 59.737 (1.26727)
    Percent CPU 263.8 (71.6796)
    Context Switches 39989.7 (4672.64)
    Sleeps 42617.7 (246.307)

    With CONFIG_HIGHPTE=y but with no highmem PTEs being allocated:
    Average Optimal load -j 4 Run (std deviation):
    Elapsed Time 174.278 (0.831968)
    User Time 515.659 (6.07012)
    System Time 55.9 (1.07799)
    Percent CPU 263.8 (71.266)
    Context Switches 39929.6 (4485.13)
    Sleeps 42583.7 (373.039)

    This patch allows the user to control the allocation of PTEs in
    highmem from the command line ("userpte=nohigh") but retains the
    status-quo as the default.

    It is possible that some simple heuristic could be developed which
    allows auto-tuning of this option however I don't have a sufficiently
    large machine available to me to perform any particularly meaningful
    experiments. We could probably handwave up an argument for a threshold
    at 16G of total RAM.

    Assuming 768M of lowmem we have 196608 potential lowmem PTE
    pages. Each page can map 2M of RAM in a PAE-enabled configuration,
    meaning a maximum of 384G of RAM could potentially be mapped using
    lowmem PTEs.

    Even allowing generous factor of 10 to account for other required
    lowmem allocations, generous slop to account for page sharing (which
    reduces the total amount of RAM mappable by a given number of PT
    pages) and other innacuracies in the estimations it would seem that
    even a 32G machine would not have a particularly pressing need for
    highmem PTEs. I think 32G could be considered to be at the upper bound
    of what might be sensible on a 32 bit machine (although I think in
    practice 64G is still supported).

    It's seems questionable if HIGHPTE is even a win for any amount of RAM
    you would sensibly run a 32 bit kernel on rather than going 64 bit.

    Signed-off-by: Ian Campbell
    LKML-Reference:
    Signed-off-by: H. Peter Anvin

    Ian Campbell
     

04 Aug, 2009

1 commit


28 Jul, 2009

1 commit

  • mm: Pass virtual address to [__]p{te,ud,md}_free_tlb()

    Upcoming paches to support the new 64-bit "BookE" powerpc architecture
    will need to have the virtual address corresponding to PTE page when
    freeing it, due to the way the HW table walker works.

    Basically, the TLB can be loaded with "large" pages that cover the whole
    virtual space (well, sort-of, half of it actually) represented by a PTE
    page, and which contain an "indirect" bit indicating that this TLB entry
    RPN points to an array of PTEs from which the TLB can then create direct
    entries. Thus, in order to invalidate those when PTE pages are deleted,
    we need the virtual address to pass to tlbilx or tlbivax instructions.

    The old trick of sticking it somewhere in the PTE page struct page sucks
    too much, the address is almost readily available in all call sites and
    almost everybody implemets these as macros, so we may as well add the
    argument everywhere. I added it to the pmd and pud variants for consistency.

    Signed-off-by: Benjamin Herrenschmidt
    Acked-by: David Howells [MN10300 & FRV]
    Acked-by: Nick Piggin
    Acked-by: Martin Schwidefsky [s390]
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     

15 Jun, 2009

1 commit


10 Apr, 2009

1 commit

  • Use phys_addr_t for receiving a physical address argument instead of
    unsigned long. This allows fixmap to handle pages higher than 4GB on
    x86-32.

    Signed-off-by: Masami Hiramatsu
    Cc: Ingo Molnar
    Acked-by: Mathieu Desnoyers
    Signed-off-by: Linus Torvalds

    Masami Hiramatsu
     

28 Feb, 2009

1 commit


07 Sep, 2008

1 commit

  • Giving pgd_ctor() a properly typed parameter allows eliminating a local
    variable. Adjust pgd_dtor() to match.

    Signed-off-by: Jan Beulich
    Acked-by: Jeremy Fitzhardinge
    Cc: "Jeremy Fitzhardinge"
    Signed-off-by: Ingo Molnar

    Jan Beulich
     

12 Aug, 2008

1 commit

  • Simon Horman reported that gcc-3.4.x crashes when compiling
    pgd_prepopulate_pmd() when PREALLOCATED_PMDS == 0 and CONFIG_DEBUG_INFO
    is enabled.

    Adding an extra check for PREALLOCATED_PMDS == 0 [which is compiled out
    by gcc] seems to avoid the problem.

    Reported-by: Simon Horman
    Signed-off-by: Jeremy Fitzhardinge
    Acked-by: Simon Horman
    Signed-off-by: Ingo Molnar

    Jeremy Fitzhardinge
     

08 Jul, 2008

3 commits

  • Jan Beulich points out that vmalloc_sync_all() assumes that the
    kernel's pmd is always expected to be present in the pgd. The current
    pgd construction code will add the pgd to the pgd_list before its pmds
    have been pre-populated, thereby making it visible to
    vmalloc_sync_all().

    However, because pgd_prepopulate_pmd also does the allocation, it may
    block and cannot be done under spinlock.

    The solution is to preallocate the pmds out of the spinlock, then
    populate them while holding the pgd_list lock.

    This patch also pulls the pmd preallocation and mop-up functions out
    to be common, assuming that the compiler will generate no code for
    them when PREALLOCTED_PMDS is 0. Also, there's no need for pgd_ctor
    to clear the pgd again, since it's allocated as a zeroed page.

    Signed-off-by: Jeremy Fitzhardinge
    Cc: xen-devel
    Cc: Stephen Tweedie
    Cc: Eduardo Habkost
    Cc: Mark McLoughlin
    Signed-off-by: Ingo Molnar
    Cc: Jan Beulich

    Signed-off-by: Ingo Molnar

    Jeremy Fitzhardinge
     
  • Add hooks which are called at pgd_alloc/free time. The pgd_alloc hook
    may return an error code, which if non-zero, causes the pgd allocation
    to be failed. The hooks may be used to allocate/free auxillary
    per-pgd information.

    also fix:

    > * Ingo Molnar wrote:
    >
    > include/asm/pgalloc.h: In function ‘paravirt_pgd_free':
    > include/asm/pgalloc.h:14: error: parameter name omitted
    > arch/x86/kernel/entry_64.S: In file included from
    > arch/x86/kernel/traps_64.c:51:include/asm/pgalloc.h: In function ‘paravirt_pgd_free':
    > include/asm/pgalloc.h:14: error: parameter name omitted

    Signed-off-by: Jeremy Fitzhardinge
    Cc: xen-devel
    Cc: Stephen Tweedie
    Cc: Eduardo Habkost
    Cc: Mark McLoughlin
    Signed-off-by: Ingo Molnar

    Jeremy Fitzhardinge
     
  • Conflicts:

    arch/x86/mm/init_64.c

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

20 Jun, 2008

4 commits


25 May, 2008

1 commit


25 Apr, 2008

7 commits