20 Apr, 2013

1 commit

  • As reported by Dave Kleikamp, when we emit cross calls to do batched
    TLB flush processing we have a race because we do not synchronize on
    the sibling cpus completing the cross call.

    So meanwhile the TLB batch can be reset (tb->tlb_nr set to zero, etc.)
    and either flushes are missed or flushes will flush the wrong
    addresses.

    Fix this by using generic infrastructure to synchonize on the
    completion of the cross call.

    This first required getting the flush_tlb_pending() call out from
    switch_to() which operates with locks held and interrupts disabled.
    The problem is that smp_call_function_many() cannot be invoked with
    IRQs disabled and this is explicitly checked for with WARN_ON_ONCE().

    We get the batch processing outside of locked IRQ disabled sections by
    using some ideas from the powerpc port. Namely, we only batch inside
    of arch_{enter,leave}_lazy_mmu_mode() calls. If we're not in such a
    region, we flush TLBs synchronously.

    1) Get rid of xcall_flush_tlb_pending and per-cpu type
    implementations.

    2) Do TLB batch cross calls instead via:

    smp_call_function_many()
    tlb_pending_func()
    __flush_tlb_pending()

    3) Batch only in lazy mmu sequences:

    a) Add 'active' member to struct tlb_batch
    b) Define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
    c) Set 'active' in arch_enter_lazy_mmu_mode()
    d) Run batch and clear 'active' in arch_leave_lazy_mmu_mode()
    e) Check 'active' in tlb_batch_add_one() and do a synchronous
    flush if it's clear.

    4) Add infrastructure for synchronous TLB page flushes.

    a) Implement __flush_tlb_page and per-cpu variants, patch
    as needed.
    b) Likewise for xcall_flush_tlb_page.
    c) Implement smp_flush_tlb_page() to invoke the cross-call.
    d) Wire up global_flush_tlb_page() to the right routine based
    upon CONFIG_SMP

    5) It turns out that singleton batches are very common, 2 out of every
    3 batch flushes have only a single entry in them.

    The batch flush waiting is very expensive, both because of the poll
    on sibling cpu completeion, as well as because passing the tlb batch
    pointer to the sibling cpus invokes a shared memory dereference.

    Therefore, in flush_tlb_pending(), if there is only one entry in
    the batch perform a completely asynchronous global_flush_tlb_page()
    instead.

    Reported-by: Dave Kleikamp
    Signed-off-by: David S. Miller
    Acked-by: Dave Kleikamp

    David S. Miller
     

21 Feb, 2013

1 commit


09 Oct, 2012

3 commits

  • This is relatively easy since PMD's now cover exactly 4MB of memory.

    Our PMD entries are 32-bits each, so we use a special encoding. The
    lowest bit, PMD_ISHUGE, determines the interpretation. This is possible
    because sparc64's page tables are purely software entities so we can use
    whatever encoding scheme we want. We just have to make the TLB miss
    assembler page table walkers aware of the layout.

    set_pmd_at() works much like set_pte_at() but it has to operate in two
    page from a table of non-huge PTEs, so we have to queue up TLB flushes
    based upon what mappings are valid in the PTE table. In the second regime
    we are going from huge-page to non-huge-page, and in that case we need
    only queue up a single TLB flush to push out the huge page mapping.

    We still have 5 bits remaining in the huge PMD encoding so we can very
    likely support any new pieces of THP state tracking that might get added
    in the future.

    With lots of help from Johannes Weiner.

    Signed-off-by: David S. Miller
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Gerald Schaefer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Miller
     
  • We've split up the PTE tables so that they take up half a page instead of
    a full page. This is in order to facilitate transparent huge page
    support, which works much better if our PMDs cover 4MB instead of 8MB.

    What we do is have a one-behind cache for PTE table allocations in the
    mm struct.

    This logic triggers only on allocations. For example, we don't try to
    keep track of free'd up page table blocks in the style that the s390 port
    does.

    There were only two slightly annoying aspects to this change:

    1) Changing pgtable_t to be a "pte_t *". There's all of this special
    logic in the TLB free paths that needed adjustments, as did the
    PMD populate interfaces.

    2) init_new_context() needs to zap the pointer, since the mm struct
    just gets copied from the parent on fork.

    Signed-off-by: David S. Miller
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Gerald Schaefer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Miller
     
  • Narrowing the scope of the page size configurations will make the
    transparent hugepage changes much simpler.

    In the end what we really want to do is have the kernel support multiple
    huge page sizes and use whatever is appropriate as the context dictactes.

    Signed-off-by: David S. Miller
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Gerald Schaefer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Miller
     

29 Mar, 2012

1 commit


26 Jul, 2011

1 commit

  • With the recent mmu_gather changes that included generic RCU freeing of
    page-tables, it is now quite straightforward to implement gup_fast() on
    sparc64.

    This patch:

    Remove the page table quicklists. They are pointless and make it harder
    to use RCU page table freeing and share code with other architectures.

    BTW, this is the second time this has happened, see commit 3c936465249f
    ("[SPARC64]: Kill pgtable quicklists and use SLAB.")

    Signed-off-by: David S. Miller
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David S. Miller
     

08 Jun, 2011

1 commit


25 May, 2011

1 commit

  • Rework the sparc mmu_gather usage to conform to the new world order :-)

    Sparc mmu_gather does two things:
    - tracks vaddrs to unhash
    - tracks pages to free

    Split these two things like powerpc has done and keep the vaddrs
    in per-cpu data structures and flush them on context switch.

    The remaining bits can then use the generic mmu_gather.

    Signed-off-by: Peter Zijlstra
    Acked-by: David Miller
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

05 Dec, 2008

2 commits

  • Add a sysctl to tweak the RSS limit used to decide when to grow
    the TSB for an address space.

    In order to avoid expensive divides and multiplies only simply
    positive and negative powers of two are supported.

    The function computed takes the number of TSB translations that will
    fit at one time in the TSB of a given size, and either adds or
    subtracts a percentage of entries. This final value is the
    RSS limit.

    See tsb_size_to_rss_limit().

    Signed-off-by: David S. Miller

    David S. Miller
     
  • - move all sparc64/mm/ files to arch/sparc/mm/
    - commonly named files are named _64.c
    - add files to sparc/mm/Makefile preserving link order
    - delete now unused sparc64/mm/Makefile
    - sparc64 now finds mm/ in sparc

    Signed-off-by: Sam Ravnborg
    Signed-off-by: David S. Miller

    Sam Ravnborg