13 Jan, 2015

1 commit

  • When batching up address ranges for TLB invalidation, we check tlb->end
    != 0 to indicate that some pages have actually been unmapped.

    As of commit f045bbb9fa1b ("mmu_gather: fix over-eager
    tlb_flush_mmu_free() calling"), we use the same check for freeing these
    pages in order to avoid a performance regression where we call
    free_pages_and_swap_cache even when no pages are actually queued up.

    Unfortunately, the range could have been reset (tlb->end = 0) by
    tlb_end_vma, which has been shown to cause memory leaks on arm64.
    Furthermore, investigation into these leaks revealed that the fullmm
    case on task exit no longer invalidates the TLB, by virtue of tlb->end
    == 0 (in 3.18, need_flush would have been set).

    This patch resolves the problem by reverting commit f045bbb9fa1b, using
    instead tlb->local.nr as the predicate for page freeing in
    tlb_flush_mmu_free and ensuring that tlb->end is initialised to a
    non-zero value in the fullmm case.

    Tested-by: Mark Langsdorf
    Tested-by: Dave Hansen
    Signed-off-by: Will Deacon
    Signed-off-by: Linus Torvalds

    Will Deacon
     

17 Nov, 2014

1 commit

  • On architectures with hardware broadcasting of TLB invalidation messages
    , it makes sense to reduce the range of the mmu_gather structure when
    unmapping page ranges based on the dirty address information passed to
    tlb_remove_tlb_entry.

    arm64 already does this by directly manipulating the start/end fields
    of the gather structure, but this confuses the generic code which
    does not expect these fields to change and can end up calculating
    invalid, negative ranges when forcing a flush in zap_pte_range.

    This patch moves the minimal range calculation out of the arm64 code
    and into the generic implementation, simplifying zap_pte_range in the
    process (which no longer needs to care about start/end, since they will
    point to the appropriate ranges already). With the range being tracked
    by core code, the need_flush flag is dropped in favour of checking that
    the end of the range has actually been set.

    Cc: Benjamin Herrenschmidt
    Cc: Peter Zijlstra
    Cc: Russell King - ARM Linux
    Cc: Michal Simek
    Acked-by: Linus Torvalds
    Signed-off-by: Will Deacon

    Will Deacon
     

16 Aug, 2013

1 commit

  • Ben Tebulin reported:

    "Since v3.7.2 on two independent machines a very specific Git
    repository fails in 9/10 cases on git-fsck due to an SHA1/memory
    failures. This only occurs on a very specific repository and can be
    reproduced stably on two independent laptops. Git mailing list ran
    out of ideas and for me this looks like some very exotic kernel issue"

    and bisected the failure to the backport of commit 53a59fc67f97 ("mm:
    limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT").

    That commit itself is not actually buggy, but what it does is to make it
    much more likely to hit the partial TLB invalidation case, since it
    introduces a new case in tlb_next_batch() that previously only ever
    happened when running out of memory.

    The real bug is that the TLB gather virtual memory range setup is subtly
    buggered. It was introduced in commit 597e1c3580b7 ("mm/mmu_gather:
    enable tlb flush range in generic mmu_gather"), and the range handling
    was already fixed at least once in commit e6c495a96ce0 ("mm: fix the TLB
    range flushed when __tlb_remove_page() runs out of slots"), but that fix
    was not complete.

    The problem with the TLB gather virtual address range is that it isn't
    set up by the initial tlb_gather_mmu() initialization (which didn't get
    the TLB range information), but it is set up ad-hoc later by the
    functions that actually flush the TLB. And so any such case that forgot
    to update the TLB range entries would potentially miss TLB invalidates.

    Rather than try to figure out exactly which particular ad-hoc range
    setup was missing (I personally suspect it's the hugetlb case in
    zap_huge_pmd(), which didn't have the same logic as zap_pte_range()
    did), this patch just gets rid of the problem at the source: make the
    TLB range information available to tlb_gather_mmu(), and initialize it
    when initializing all the other tlb gather fields.

    This makes the patch larger, but conceptually much simpler. And the end
    result is much more understandable; even if you want to play games with
    partial ranges when invalidating the TLB contents in chunks, now the
    range information is always there, and anybody who doesn't want to
    bother with it won't introduce subtle bugs.

    Ben verified that this fixes his problem.

    Reported-bisected-and-tested-by: Ben Tebulin
    Build-testing-by: Stephen Rothwell
    Build-testing-by: Richard Weinberger
    Reviewed-by: Michal Hocko
    Acked-by: Peter Zijlstra
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

06 Jun, 2013

1 commit

  • Since the introduction of preemptible mmu_gather TLB fast mode has been
    broken. TLB fast mode relies on there being absolutely no concurrency;
    it frees pages first and invalidates TLBs later.

    However now we can get concurrency and stuff goes *bang*.

    This patch removes all tlb_fast_mode() code; it was found the better
    option vs trying to patch the hole by entangling tlb invalidation with
    the scheduler.

    Cc: Thomas Gleixner
    Cc: Russell King
    Cc: Tony Luck
    Reported-by: Max Filippov
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

13 Apr, 2013

1 commit

  • This patch attempts to fix:

    https://bugzilla.kernel.org/show_bug.cgi?id=56461

    The symptom is a crash and messages like this:

    chrome: Corrupted page table at address 34a03000
    *pdpt = 0000000000000000 *pde = 0000000000000000
    Bad pagetable: 000f [#1] PREEMPT SMP

    Ingo guesses this got introduced by commit 611ae8e3f520 ("x86/tlb:
    enable tlb flush range support for x86") since that code started to free
    unused pagetables.

    On x86-32 PAE kernels, that new code has the potential to free an entire
    PMD page and will clear one of the four page-directory-pointer-table
    (aka pgd_t entries).

    The hardware aggressively "caches" these top-level entries and invlpg
    does not actually affect the CPU's copy. If we clear one we *HAVE* to
    do a full TLB flush, otherwise we might continue using a freed pmd page.
    (note, we do this properly on the population side in pud_populate()).

    This patch tracks whenever we clear one of these entries in the 'struct
    mmu_gather', and ensures that we follow up with a full tlb flush.

    BTW, I disassembled and checked that:

    if (tlb->fullmm == 0)
    and
    if (!tlb->fullmm && !tlb->need_flush_all)

    generate essentially the same code, so there should be zero impact there
    to the !PAE case.

    Signed-off-by: Dave Hansen
    Cc: Peter Anvin
    Cc: Ingo Molnar
    Cc: Artem S Tashkinov
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

05 Jan, 2013

1 commit

  • Since commit e303297e6c3a ("mm: extended batches for generic
    mmu_gather") we are batching pages to be freed until either
    tlb_next_batch cannot allocate a new batch or we are done.

    This works just fine most of the time but we can get in troubles with
    non-preemptible kernel (CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY)
    on large machines where too aggressive batching might lead to soft
    lockups during process exit path (exit_mmap) because there are no
    scheduling points down the free_pages_and_swap_cache path and so the
    freeing can take long enough to trigger the soft lockup.

    The lockup is harmless except when the system is setup to panic on
    softlockup which is not that unusual.

    The simplest way to work around this issue is to limit the maximum
    number of batches in a single mmu_gather. 10k of collected pages should
    be safe to prevent from soft lockups (we would have 2ms for one) even if
    they are all freed without an explicit scheduling point.

    This patch doesn't add any new explicit scheduling points because it
    relies on zap_pmd_range during page tables zapping which calls
    cond_resched per PMD.

    The following lockup has been reported for 3.0 kernel with a huge
    process (in order of hundreds gigs but I do know any more details).

    BUG: soft lockup - CPU#56 stuck for 22s! [kernel:31053]
    Modules linked in: af_packet nfs lockd fscache auth_rpcgss nfs_acl sunrpc mptctl mptbase autofs4 binfmt_misc dm_round_robin dm_multipath bonding cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf microcode fuse loop osst sg sd_mod crc_t10dif st qla2xxx scsi_transport_fc scsi_tgt netxen_nic i7core_edac iTCO_wdt joydev e1000e serio_raw pcspkr edac_core iTCO_vendor_support acpi_power_meter rtc_cmos hpwdt hpilo button container usbhid hid dm_mirror dm_region_hash dm_log linear uhci_hcd ehci_hcd usbcore usb_common scsi_dh_emc scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh dm_snapshot pcnet32 mii edd dm_mod raid1 ext3 mbcache jbd fan thermal processor thermal_sys hwmon cciss scsi_mod
    Supported: Yes
    CPU 56
    Pid: 31053, comm: kernel Not tainted 3.0.31-0.9-default #1 HP ProLiant DL580 G7
    RIP: 0010: _raw_spin_unlock_irqrestore+0x8/0x10
    RSP: 0018:ffff883ec1037af0 EFLAGS: 00000206
    RAX: 0000000000000e00 RBX: ffffea01a0817e28 RCX: ffff88803ffd9e80
    RDX: 0000000000000200 RSI: 0000000000000206 RDI: 0000000000000206
    RBP: 0000000000000002 R08: 0000000000000001 R09: ffff887ec724a400
    R10: 0000000000000000 R11: dead000000200200 R12: ffffffff8144c26e
    R13: 0000000000000030 R14: 0000000000000297 R15: 000000000000000e
    FS: 00007ed834282700(0000) GS:ffff88c03f200000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 000000000068b240 CR3: 0000003ec13c5000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process kernel (pid: 31053, threadinfo ffff883ec1036000, task ffff883ebd5d4100)
    Call Trace:
    release_pages+0xc5/0x260
    free_pages_and_swap_cache+0x9d/0xc0
    tlb_flush_mmu+0x5c/0x80
    tlb_finish_mmu+0xe/0x50
    exit_mmap+0xbd/0x120
    mmput+0x49/0x120
    exit_mm+0x122/0x160
    do_exit+0x17a/0x430
    do_group_exit+0x3d/0xb0
    get_signal_to_deliver+0x247/0x480
    do_signal+0x71/0x1b0
    do_notify_resume+0x98/0xb0
    int_signal+0x12/0x17
    DWARF2 unwinder stuck at int_signal+0x12/0x17

    Signed-off-by: Michal Hocko
    Cc: [3.0+]
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

28 Jun, 2012

2 commits

  • This patch enabled the tlb flush range support in generic mmu layer.

    Most of arch has self tlb flush range support, like ARM/IA64 etc.
    X86 arch has no this support in hardware yet. But another instruction
    'invlpg' can implement this function in some degree. So, enable this
    feather in generic layer for x86 now. and maybe useful for other archs
    in further.

    Generic mmu_gather struct is protected by micro
    HAVE_GENERIC_MMU_GATHER. Other archs that has flush range supported
    own self mmu_gather struct. So, now this change is safe for them.

    In future we may unify this struct and related functions on multiple
    archs.

    Thanks for Peter Zijlstra time and time reminder for multiple
    architecture code safe!

    Signed-off-by: Alex Shi
    Link: http://lkml.kernel.org/r/1340845344-27557-7-git-send-email-alex.shi@intel.com
    Signed-off-by: H. Peter Anvin

    Alex Shi
     
  • Testing show different CPU type(micro architectures and NUMA mode) has
    different balance points between the TLB flush all and multiple invlpg.
    And there also has cases the tlb flush change has no any help.

    This patch give a interface to let x86 vendor developers have a chance
    to set different shift for different CPU type.

    like some machine in my hands, balance points is 16 entries on
    Romely-EP; while it is at 8 entries on Bloomfield NHM-EP; and is 256 on
    IVB mobile CPU. but on model 15 core2 Xeon using invlpg has nothing
    help.

    For untested machine, do a conservative optimization, same as NHM CPU.

    Signed-off-by: Alex Shi
    Link: http://lkml.kernel.org/r/1340845344-27557-5-git-send-email-alex.shi@intel.com
    Signed-off-by: H. Peter Anvin

    Alex Shi
     

13 Jan, 2012

1 commit

  • We have tlb_remove_tlb_entry to indicate a pte tlb flush entry should be
    flushed, but not a corresponding API for pmd entry. This isn't a
    problem so far because THP is only for x86 currently and tlb_flush()
    under x86 will flush entire TLB. But this is confusion and could be
    missed if thp is ported to other arch.

    Also convert tlb->need_flush = 1 to a VM_BUG_ON(!tlb->need_flush) in
    __tlb_remove_page() as suggested by Andrea Arcangeli. The
    __tlb_remove_page() function is supposed to be called after
    tlb_remove_xxx_tlb_entry() and we can catch any misuse.

    Signed-off-by: Shaohua Li
    Reviewed-by: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

25 May, 2011

4 commits

  • Some of these functions have grown beyond inline sanity, move them
    out-of-line.

    Signed-off-by: Peter Zijlstra
    Requested-by: Andrew Morton
    Requested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Instead of using a single batch (the small on-stack, or an allocated
    page), try and extend the batch every time it runs out and only flush once
    either the extend fails or we're done.

    Signed-off-by: Peter Zijlstra
    Requested-by: Nick Piggin
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • In case other architectures require RCU freed page-tables to implement
    gup_fast() and software filled hashes and similar things, provide the
    means to do so by moving the logic into generic code.

    Signed-off-by: Peter Zijlstra
    Requested-by: David Miller
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Rework the existing mmu_gather infrastructure.

    The direct purpose of these patches was to allow preemptible mmu_gather,
    but even without that I think these patches provide an improvement to the
    status quo.

    The first 9 patches rework the mmu_gather infrastructure. For review
    purpose I've split them into generic and per-arch patches with the last of
    those a generic cleanup.

    The next patch provides generic RCU page-table freeing, and the followup
    is a patch converting s390 to use this. I've also got 4 patches from
    DaveM lined up (not included in this series) that uses this to implement
    gup_fast() for sparc64.

    Then there is one patch that extends the generic mmu_gather batching.

    After that follow the mm preemptibility patches, these make part of the mm
    a lot more preemptible. It converts i_mmap_lock and anon_vma->lock to
    mutexes which together with the mmu_gather rework makes mmu_gather
    preemptible as well.

    Making i_mmap_lock a mutex also enables a clean-up of the truncate code.

    This also allows for preemptible mmu_notifiers, something that XPMEM I
    think wants.

    Furthermore, it removes the new and universially detested unmap_mutex.

    This patch:

    Remove the first obstacle towards a fully preemptible mmu_gather.

    The current scheme assumes mmu_gather is always done with preemption
    disabled and uses per-cpu storage for the page batches. Change this to
    try and allocate a page for batching and in case of failure, use a small
    on-stack array to make some progress.

    Preemptible mmu_gather is desired in general and usable once i_mmap_lock
    becomes a mutex. Doing it before the mutex conversion saves us from
    having to rework the code by moving the mmu_gather bits inside the
    pte_lock.

    Also avoid flushing the tlb batches from under the pte lock, this is
    useful even without the i_mmap_lock conversion as it significantly reduces
    pte lock hold times.

    [akpm@linux-foundation.org: fix comment tpyo]
    Signed-off-by: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

28 Jul, 2009

1 commit

  • mm: Pass virtual address to [__]p{te,ud,md}_free_tlb()

    Upcoming paches to support the new 64-bit "BookE" powerpc architecture
    will need to have the virtual address corresponding to PTE page when
    freeing it, due to the way the HW table walker works.

    Basically, the TLB can be loaded with "large" pages that cover the whole
    virtual space (well, sort-of, half of it actually) represented by a PTE
    page, and which contain an "indirect" bit indicating that this TLB entry
    RPN points to an array of PTEs from which the TLB can then create direct
    entries. Thus, in order to invalidate those when PTE pages are deleted,
    we need the virtual address to pass to tlbilx or tlbivax instructions.

    The old trick of sticking it somewhere in the PTE page struct page sucks
    too much, the address is almost readily available in all call sites and
    almost everybody implemets these as macros, so we may as well add the
    argument everywhere. I added it to the pmd and pud variants for consistency.

    Signed-off-by: Benjamin Herrenschmidt
    Acked-by: David Howells [MN10300 & FRV]
    Acked-by: Nick Piggin
    Acked-by: Martin Schwidefsky [s390]
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     

04 Feb, 2008

1 commit


01 Feb, 2008

1 commit

  • bring back the avr32, blackfin, sh, sparc architectures into working order,
    by reverting the effects of this change that came in via the x86 tree:

    commit a5a19c63f4e55e32dc0bc3d936d7f94793d8b380
    Author: Jeremy Fitzhardinge
    Date: Wed Jan 30 13:33:39 2008 +0100

    x86: demacro asm-x86/pgalloc_32.h

    Sorry about that!

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

30 Jan, 2008

1 commit

  • Convert macros into inline functions, for better type-checking.

    This patch required a little bit of fiddling with headers in order to
    make __(pte|pmd)_free_tlb inline rather than macros.
    asm-generic/tlb.h includes asm/pgalloc.h, though it doesn't directly
    use any pgalloc definitions. I removed this include to avoid an
    include cycle, but it may cause secondary compile failures by things
    depending on the indirect inclusion; arch/x86/mm/hugetlbpage.c was one
    such place; there may be others.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner

    Jeremy Fitzhardinge
     

27 Dec, 2007

1 commit

  • Did not fix the reported issue. Apart from other weirdness this causes a
    bad link between the TLB flushing logic and the quicklists. If there is
    indeed an issue that an arch needs a tlb flush before free then the arch
    code needs to set tlb->need_flush before calling quicklist_free.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

18 Dec, 2007

1 commit


04 Oct, 2006

1 commit


26 Apr, 2006

1 commit


30 Oct, 2005

3 commits

  • zap_pte_range has been counting the pages it frees in tlb->freed, then
    tlb_finish_mmu has used that to update the mm's rss. That got stranger when I
    added anon_rss, yet updated it by a different route; and stranger when rss and
    anon_rss became mm_counters with special access macros. And it would no
    longer be viable if we're relying on page_table_lock to stabilize the
    mm_counter, but calling tlb_finish_mmu outside that lock.

    Remove the mmu_gather's freed field, let tlb_finish_mmu stick to its own
    business, just decrement the rss mm_counter in zap_pte_range (yes, there was
    some point to batching the update, and a subsequent patch restores that). And
    forget the anal paranoia of first reading the counter to avoid going negative
    - if rss does go negative, just fix that bug.

    Remove the mmu_gather's flushes and avoided_flushes from arm and arm26: no use
    was being made of them. But arm26 alone was actually using the freed, in the
    way some others use need_flush: give it a need_flush. arm26 seems to prefer
    spaces to tabs here: respect that.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • tlb_is_full_mm? What does that mean? The TLB is full? No, it means that the
    mm's last user has gone and the whole mm is being torn down. And it's an
    inline function because sparc64 uses a different (slightly better)
    "tlb_frozen" name for the flag others call "fullmm".

    And now the ptep_get_and_clear_full macro used in zap_pte_range refers
    directly to tlb->fullmm, which would be wrong for sparc64. Rather than
    correct that, I'd prefer to scrap tlb_is_full_mm altogether, and change
    sparc64 to just use the same poor name as everyone else - is that okay?

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • tlb_gather_mmu dates from before kernel preemption was allowed, and uses
    smp_processor_id or __get_cpu_var to find its per-cpu mmu_gather. That works
    because it's currently only called after getting page_table_lock, which is not
    dropped until after the matching tlb_finish_mmu. But don't rely on that, it
    will soon change: now disable preemption internally by proper get_cpu_var in
    tlb_gather_mmu, put_cpu_var in tlb_finish_mmu.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

13 Sep, 2005

1 commit

  • The generic TLB flush functions kept upto 506 pages per
    CPU to avoid too frequent IPIs.

    This value was done for the L1 cache of older x86 CPUs,
    but with modern CPUs it does not make much sense anymore.
    TLB flushing is slow enough that using the L2 cache is fine.

    This patch increases the flush array on x86-64 to cache
    5350 pages. That is roughly 20MB with 4K pages. It speeds
    up large munmaps in multithreaded processes on SMP considerably.

    The cost is roughly 42k of memory per CPU, which is reasonable.

    I only increased it on x86-64 for now, but it would probably
    make sense to increase it everywhere. Embedded architectures
    with SMP may keep it smaller to save some memory per CPU.

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds