26 Mar, 2016

1 commit


20 Mar, 2016

1 commit

  • Pull powerpc updates from Michael Ellerman:
    "This was delayed a day or two by some build-breakage on old toolchains
    which we've now fixed.

    There's two PCI commits both acked by Bjorn.

    There's one commit to mm/hugepage.c which is (co)authored by Kirill.

    Highlights:
    - Restructure Linux PTE on Book3S/64 to Radix format from Paul
    Mackerras
    - Book3s 64 MMU cleanup in preparation for Radix MMU from Aneesh
    Kumar K.V
    - Add POWER9 cputable entry from Michael Neuling
    - FPU/Altivec/VSX save/restore optimisations from Cyril Bur
    - Add support for new ftrace ABI on ppc64le from Torsten Duwe

    Various cleanups & minor fixes from:
    - Adam Buchbinder, Andrew Donnellan, Balbir Singh, Christophe Leroy,
    Cyril Bur, Luis Henriques, Madhavan Srinivasan, Pan Xinhui, Russell
    Currey, Sukadev Bhattiprolu, Suraj Jitindar Singh.

    General:
    - atomics: Allow architectures to define their own __atomic_op_*
    helpers from Boqun Feng
    - Implement atomic{, 64}_*_return_* variants and acquire/release/
    relaxed variants for (cmp)xchg from Boqun Feng
    - Add powernv_defconfig from Jeremy Kerr
    - Fix BUG_ON() reporting in real mode from Balbir Singh
    - Add xmon command to dump OPAL msglog from Andrew Donnellan
    - Add xmon command to dump process/task similar to ps(1) from Douglas
    Miller
    - Clean up memory hotplug failure paths from David Gibson

    pci/eeh:
    - Redesign SR-IOV on PowerNV to give absolute isolation between VFs
    from Wei Yang.
    - EEH Support for SRIOV VFs from Wei Yang and Gavin Shan.
    - PCI/IOV: Rename and export virtfn_{add, remove} from Wei Yang
    - PCI: Add pcibios_bus_add_device() weak function from Wei Yang
    - MAINTAINERS: Update EEH details and maintainership from Russell
    Currey

    cxl:
    - Support added to the CXL driver for running on both bare-metal and
    hypervisor systems, from Christophe Lombard and Frederic Barrat.
    - Ignore probes for virtual afu pci devices from Vaibhav Jain

    perf:
    - Export Power8 generic and cache events to sysfs from Sukadev
    Bhattiprolu
    - hv-24x7: Fix usage with chip events, display change in counter
    values, display domain indices in sysfs, eliminate domain suffix in
    event names, from Sukadev Bhattiprolu

    Freescale:
    - Updates from Scott: "Highlights include 8xx optimizations, 32-bit
    checksum optimizations, 86xx consolidation, e5500/e6500 cpu
    hotplug, more fman and other dt bits, and minor fixes/cleanup"

    * tag 'powerpc-4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (179 commits)
    powerpc: Fix unrecoverable SLB miss during restore_math()
    powerpc/8xx: Fix do_mtspr_cpu6() build on older compilers
    powerpc/rcpm: Fix build break when SMP=n
    powerpc/book3e-64: Use hardcoded mttmr opcode
    powerpc/fsl/dts: Add "jedec,spi-nor" flash compatible
    powerpc/T104xRDB: add tdm riser card node to device tree
    powerpc32: PAGE_EXEC required for inittext
    powerpc/mpc85xx: Add pcsphy nodes to FManV3 device tree
    powerpc/mpc85xx: Add MDIO bus muxing support to the board device tree(s)
    powerpc/86xx: Introduce and use common dtsi
    powerpc/86xx: Update device tree
    powerpc/86xx: Move dts files to fsl directory
    powerpc/86xx: Switch to kconfig fragments approach
    powerpc/86xx: Update defconfigs
    powerpc/86xx: Consolidate common platform code
    powerpc32: Remove one insn in mulhdu
    powerpc32: small optimisation in flush_icache_range()
    powerpc: Simplify test in __dma_sync()
    powerpc32: move xxxxx_dcache_range() functions inline
    powerpc32: Remove clear_pages() and define clear_page() inline
    ...

    Linus Torvalds
     

18 Mar, 2016

7 commits

  • split_huge_pmd() tries to munlock page with munlock_vma_page(). That
    requires the page to locked.

    If the is locked by caller, we would get a deadlock:

    Unable to find swap-space signature
    INFO: task trinity-c85:1907 blocked for more than 120 seconds.
    Not tainted 4.4.0-00032-gf19d0bdced41-dirty #1606
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    trinity-c85 D ffff88084d997608 0 1907 309 0x00000000
    Call Trace:
    schedule+0x9f/0x1c0
    schedule_timeout+0x48e/0x600
    io_schedule_timeout+0x1c3/0x390
    bit_wait_io+0x29/0xd0
    __wait_on_bit_lock+0x94/0x140
    __lock_page+0x1d4/0x280
    __split_huge_pmd+0x5a8/0x10f0
    split_huge_pmd_address+0x1d9/0x230
    try_to_unmap_one+0x540/0xc70
    rmap_walk_anon+0x284/0x810
    rmap_walk_locked+0x11e/0x190
    try_to_unmap+0x1b1/0x4b0
    split_huge_page_to_list+0x49d/0x18a0
    follow_page_mask+0xa36/0xea0
    SyS_move_pages+0xaf3/0x1570
    entry_SYSCALL_64_fastpath+0x12/0x6b
    2 locks held by trinity-c85/1907:
    #0: (&mm->mmap_sem){++++++}, at: SyS_move_pages+0x933/0x1570
    #1: (&anon_vma->rwsem){++++..}, at: split_huge_page_to_list+0x402/0x18a0

    I don't think the deadlock is triggerable without split_huge_page()
    simplifilcation patchset.

    But munlock_vma_page() here is wrong: we want to munlock the page
    unconditionally, no need in rmap lookup, that munlock_vma_page() does.

    Let's use clear_page_mlock() instead. It can be called under ptl.

    Fixes: e90309c9f772 ("thp: allow mlocked THP again")
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • freeze_page() and unfreeze_page() helpers evolved in rather complex
    beasts. It would be nice to cut complexity of this code.

    This patch rewrites freeze_page() using standard try_to_unmap().
    unfreeze_page() is rewritten with remove_migration_ptes().

    The result is much simpler.

    But the new variant is somewhat slower for PTE-mapped THPs. Current
    helpers iterates over VMAs the compound page is mapped to, and then over
    ptes within this VMA. New helpers iterates over small page, then over
    VMA the small page mapped to, and only then find relevant pte.

    We have short cut for PMD-mapped THP: we directly install migration
    entries on PMD split.

    I don't think the slowdown is critical, considering how much simpler
    result is and that split_huge_page() is quite rare nowadays. It only
    happens due memory pressure or migration.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Add support for two ttu_flags:

    - TTU_SPLIT_HUGE_PMD would split PMD if it's there, before trying to
    unmap page;

    - TTU_RMAP_LOCKED indicates that caller holds relevant rmap lock;

    Also, change rwc->done to !page_mapcount() instead of !page_mapped().
    try_to_unmap() works on pte level, so we are really interested in the
    mappedness of this small page rather than of the compound page it's a
    part of.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Kernel style prefers a single string over split strings when the string is
    'user-visible'.

    Miscellanea:

    - Add a missing newline
    - Realign arguments

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • The success of CMA allocation largely depends on the success of
    migration and key factor of it is page reference count. Until now, page
    reference is manipulated by direct calling atomic functions so we cannot
    follow up who and where manipulate it. Then, it is hard to find actual
    reason of CMA allocation failure. CMA allocation should be guaranteed
    to succeed so finding offending place is really important.

    In this patch, call sites where page reference is manipulated are
    converted to introduced wrapper function. This is preparation step to
    add tracepoint to each page reference manipulation function. With this
    facility, we can easily find reason of CMA allocation failure. There is
    no functional change in this patch.

    In addition, this patch also converts reference read sites. It will
    help a second step that renames page._count to something else and
    prevents later attempt to direct access to it (Suggested by Andrew).

    Signed-off-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Sergey Senozhatsky
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • THP defrag is enabled by default to direct reclaim/compact but not wake
    kswapd in the event of a THP allocation failure. The problem is that
    THP allocation requests potentially enter reclaim/compaction. This
    potentially incurs a severe stall that is not guaranteed to be offset by
    reduced TLB misses. While there has been considerable effort to reduce
    the impact of reclaim/compaction, it is still a high cost and workloads
    that should fit in memory fail to do so. Specifically, a simple
    anon/file streaming workload will enter direct reclaim on NUMA at least
    even though the working set size is 80% of RAM. It's been years and
    it's time to throw in the towel.

    First, this patch defines THP defrag as follows;

    madvise: A failed allocation will direct reclaim/compact if the application requests it
    never: Neither reclaim/compact nor wake kswapd
    defer: A failed allocation will wake kswapd/kcompactd
    always: A failed allocation will direct reclaim/compact (historical behaviour)
    khugepaged defrag will enter direct/reclaim but not wake kswapd.

    Next it sets the default defrag option to be "madvise" to only enter
    direct reclaim/compaction for applications that specifically requested
    it.

    Lastly, it removes a check from the page allocator slowpath that is
    related to __GFP_THISNODE to allow "defer" to work. The callers that
    really cares are slub/slab and they are updated accordingly. The slab
    one may be surprising because it also corrects a comment as kswapd was
    never woken up by that path.

    This means that a THP fault will no longer stall for most applications
    by default and the ideal for most users that get THP if they are
    immediately available. There are still options for users that prefer a
    stall at startup of a new application by either restoring historical
    behaviour with "always" or pick a half-way point with "defer" where
    kswapd does some of the work in the background and wakes kcompactd if
    necessary. THP defrag for khugepaged remains enabled and will enter
    direct/reclaim but no wakeup kswapd or kcompactd.

    After this patch a THP allocation failure will quickly fallback and rely
    on khugepaged to recover the situation at some time in the future. In
    some cases, this will reduce THP usage but the benefit of THP is hard to
    measure and not a universal win where as a stall to reclaim/compaction
    is definitely measurable and can be painful.

    The first test for this is using "usemem" to read a large file and write
    a large anonymous mapping (to avoid the zero page) multiple times. The
    total size of the mappings is 80% of RAM and the benchmark simply
    measures how long it takes to complete. It uses multiple threads to see
    if that is a factor. On UMA, the performance is almost identical so is
    not reported but on NUMA, we see this

    usemem
    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Amean System-1 102.86 ( 0.00%) 46.81 ( 54.50%)
    Amean System-4 37.85 ( 0.00%) 34.02 ( 10.12%)
    Amean System-7 48.12 ( 0.00%) 46.89 ( 2.56%)
    Amean System-12 51.98 ( 0.00%) 56.96 ( -9.57%)
    Amean System-21 80.16 ( 0.00%) 79.05 ( 1.39%)
    Amean System-30 110.71 ( 0.00%) 107.17 ( 3.20%)
    Amean System-48 127.98 ( 0.00%) 124.83 ( 2.46%)
    Amean Elapsd-1 185.84 ( 0.00%) 105.51 ( 43.23%)
    Amean Elapsd-4 26.19 ( 0.00%) 25.58 ( 2.33%)
    Amean Elapsd-7 21.65 ( 0.00%) 21.62 ( 0.16%)
    Amean Elapsd-12 18.58 ( 0.00%) 17.94 ( 3.43%)
    Amean Elapsd-21 17.53 ( 0.00%) 16.60 ( 5.33%)
    Amean Elapsd-30 17.45 ( 0.00%) 17.13 ( 1.84%)
    Amean Elapsd-48 15.40 ( 0.00%) 15.27 ( 0.82%)

    For a single thread, the benchmark completes 43.23% faster with this
    patch applied with smaller benefits as the thread increases. Similar,
    notice the large reduction in most cases in system CPU usage. The
    overall CPU time is

    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    User 10357.65 10438.33
    System 3988.88 3543.94
    Elapsed 2203.01 1634.41

    Which is substantial. Now, the reclaim figures

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    Minor Faults 128458477 278352931
    Major Faults 2174976 225
    Swap Ins 16904701 0
    Swap Outs 17359627 0
    Allocation stalls 43611 0
    DMA allocs 0 0
    DMA32 allocs 19832646 19448017
    Normal allocs 614488453 580941839
    Movable allocs 0 0
    Direct pages scanned 24163800 0
    Kswapd pages scanned 0 0
    Kswapd pages reclaimed 0 0
    Direct pages reclaimed 20691346 0
    Compaction stalls 42263 0
    Compaction success 938 0
    Compaction failures 41325 0

    This patch eliminates almost all swapping and direct reclaim activity.
    There is still overhead but it's from NUMA balancing which does not
    identify that it's pointless trying to do anything with this workload.

    I also tried the thpscale benchmark which forces a corner case where
    compaction can be used heavily and measures the latency of whether base
    or huge pages were used

    thpscale Fault Latencies
    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Amean fault-base-1 5288.84 ( 0.00%) 2817.12 ( 46.73%)
    Amean fault-base-3 6365.53 ( 0.00%) 3499.11 ( 45.03%)
    Amean fault-base-5 6526.19 ( 0.00%) 4363.06 ( 33.15%)
    Amean fault-base-7 7142.25 ( 0.00%) 4858.08 ( 31.98%)
    Amean fault-base-12 13827.64 ( 0.00%) 10292.11 ( 25.57%)
    Amean fault-base-18 18235.07 ( 0.00%) 13788.84 ( 24.38%)
    Amean fault-base-24 21597.80 ( 0.00%) 24388.03 (-12.92%)
    Amean fault-base-30 26754.15 ( 0.00%) 19700.55 ( 26.36%)
    Amean fault-base-32 26784.94 ( 0.00%) 19513.57 ( 27.15%)
    Amean fault-huge-1 4223.96 ( 0.00%) 2178.57 ( 48.42%)
    Amean fault-huge-3 2194.77 ( 0.00%) 2149.74 ( 2.05%)
    Amean fault-huge-5 2569.60 ( 0.00%) 2346.95 ( 8.66%)
    Amean fault-huge-7 3612.69 ( 0.00%) 2997.70 ( 17.02%)
    Amean fault-huge-12 3301.75 ( 0.00%) 6727.02 (-103.74%)
    Amean fault-huge-18 6696.47 ( 0.00%) 6685.72 ( 0.16%)
    Amean fault-huge-24 8000.72 ( 0.00%) 9311.43 (-16.38%)
    Amean fault-huge-30 13305.55 ( 0.00%) 9750.45 ( 26.72%)
    Amean fault-huge-32 9981.71 ( 0.00%) 10316.06 ( -3.35%)

    The average time to fault pages is substantially reduced in the majority
    of caseds but with the obvious caveat that fewer THPs are actually used
    in this adverse workload

    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Percentage huge-1 0.71 ( 0.00%) 14.04 (1865.22%)
    Percentage huge-3 10.77 ( 0.00%) 33.05 (206.85%)
    Percentage huge-5 60.39 ( 0.00%) 38.51 (-36.23%)
    Percentage huge-7 45.97 ( 0.00%) 34.57 (-24.79%)
    Percentage huge-12 68.12 ( 0.00%) 40.07 (-41.17%)
    Percentage huge-18 64.93 ( 0.00%) 47.82 (-26.35%)
    Percentage huge-24 62.69 ( 0.00%) 44.23 (-29.44%)
    Percentage huge-30 43.49 ( 0.00%) 55.38 ( 27.34%)
    Percentage huge-32 50.72 ( 0.00%) 51.90 ( 2.35%)

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    Minor Faults 37429143 47564000
    Major Faults 1916 1558
    Swap Ins 1466 1079
    Swap Outs 2936863 149626
    Allocation stalls 62510 3
    DMA allocs 0 0
    DMA32 allocs 6566458 6401314
    Normal allocs 216361697 216538171
    Movable allocs 0 0
    Direct pages scanned 25977580 17998
    Kswapd pages scanned 0 3638931
    Kswapd pages reclaimed 0 207236
    Direct pages reclaimed 8833714 88
    Compaction stalls 103349 5
    Compaction success 270 4
    Compaction failures 103079 1

    Note again that while this does swap as it's an aggressive workload, the
    direct relcim activity and allocation stalls is substantially reduced.
    There is some kswapd activity but ftrace showed that the kswapd activity
    was due to normal wakeups from 4K pages being allocated.
    Compaction-related stalls and activity are almost eliminated.

    I also tried the stutter benchmark. For this, I do not have figures for
    NUMA but it's something that does impact UMA so I'll report what is
    available

    stutter
    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Min mmap 7.3571 ( 0.00%) 7.3438 ( 0.18%)
    1st-qrtle mmap 7.5278 ( 0.00%) 17.9200 (-138.05%)
    2nd-qrtle mmap 7.6818 ( 0.00%) 21.6055 (-181.25%)
    3rd-qrtle mmap 11.0889 ( 0.00%) 21.8881 (-97.39%)
    Max-90% mmap 27.8978 ( 0.00%) 22.1632 ( 20.56%)
    Max-93% mmap 28.3202 ( 0.00%) 22.3044 ( 21.24%)
    Max-95% mmap 28.5600 ( 0.00%) 22.4580 ( 21.37%)
    Max-99% mmap 29.6032 ( 0.00%) 25.5216 ( 13.79%)
    Max mmap 4109.7289 ( 0.00%) 4813.9832 (-17.14%)
    Mean mmap 12.4474 ( 0.00%) 19.3027 (-55.07%)

    This benchmark is trying to fault an anonymous mapping while there is a
    heavy IO load -- a scenario that desktop users used to complain about
    frequently. This shows a mix because the ideal case of mapping with THP
    is not hit as often. However, note that 99% of the mappings complete
    13.79% faster. The CPU usage here is particularly interesting

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    User 67.50 0.99
    System 1327.88 91.30
    Elapsed 2079.00 2128.98

    And once again we look at the reclaim figures

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    Minor Faults 335241922 1314582827
    Major Faults 715 819
    Swap Ins 0 0
    Swap Outs 0 0
    Allocation stalls 532723 0
    DMA allocs 0 0
    DMA32 allocs 1822364341 1177950222
    Normal allocs 1815640808 1517844854
    Movable allocs 0 0
    Direct pages scanned 21892772 0
    Kswapd pages scanned 20015890 41879484
    Kswapd pages reclaimed 19961986 41822072
    Direct pages reclaimed 21892741 0
    Compaction stalls 1065755 0
    Compaction success 514 0
    Compaction failures 1065241 0

    Allocation stalls and all direct reclaim activity is eliminated as well
    as compaction-related stalls.

    THP gives impressive gains in some cases but only if they are quickly
    available. We're not going to reach the point where they are completely
    free so lets take the costs out of the fast paths finally and defer the
    cost to kswapd, kcompactd and khugepaged where it belongs.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Count how many times we put a THP in split queue. Currently, it happens
    on partial unmap of a THP.

    Rapidly growing value can indicate that an application behaves
    unfriendly wrt THP: often fault in huge page and then unmap part of it.
    This leads to unnecessary memory fragmentation and the application may
    require tuning.

    The event also can help with debugging kernel [mis-]behaviour.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Mar, 2016

1 commit

  • After one of bugfixes to freeze_page(), we don't have freezed pages in
    rmap, therefore mapcount of all subpages of freezed THP is zero. And we
    have assert for that.

    Let's drop code which deal with non-zero mapcount of subpages.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

03 Mar, 2016

1 commit

  • With next generation power processor, we are having a new mmu model
    [1] that require us to maintain a different linux page table format.

    Inorder to support both current and future ppc64 systems with a single
    kernel we need to make sure kernel can select between different page
    table format at runtime. With the new MMU (radix MMU) added, we will
    have two different pmd hugepage size 16MB for hash model and 2MB for
    Radix model. Hence make HPAGE_PMD related values as a variable.

    Actual conversion of HPAGE_PMD to a variable for ppc64 happens in a
    followup patch.

    [1] http://ibm.biz/power-isa3 (Needs registration).

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman

    Kirill A. Shutemov
     

25 Feb, 2016

2 commits

  • Pull in our current fixes from 4.5, in particular the "Fix Multi hit
    ERAT" bug is causing folks some grief when testing next.

    Michael Ellerman
     
  • Sebastian Ott and Gerald Schaefer reported random crashes on s390.
    It was bisected to my THP refcounting patchset.

    The problem is that pmdp_invalidated() called with wrong virtual
    address. It got offset up by HPAGE_PMD_SIZE by loop over ptes.

    The solution is to introduce new variable to be used in loop and don't
    touch 'haddr'.

    Signed-off-by: Kirill A. Shutemov
    Reported-and-tested-by: Gerald Schaefer
    Reported-and-tested-by Sebastian Ott
    Reviewed-by: Will Deacon
    Cc: Christian Borntraeger
    Cc: Martin Schwidefsky
    Cc: Aneesh Kumar K.V
    Cc: Andrea Arcangeli
    Cc: Sasha Levin
    Cc: Jerome Marchand
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

21 Feb, 2016

1 commit

  • Pull powerpc fixes from Michael Ellerman:
    - Fix build error on 32-bit with checkpoint restart from Aneesh Kumar
    - Fix dedotify for binutils >= 2.26 from Andreas Schwab
    - Don't trace hcalls on offline CPUs from Denis Kirjanov
    - eeh: Fix stale cached primary bus from Gavin Shan
    - eeh: Fix stale PE primary bus from Gavin Shan
    - mm: Fix Multi hit ERAT cause by recent THP update from Aneesh Kumar K.V
    - ioda: Set "read" permission when "write" is set from Alexey Kardashevskiy

    * tag 'powerpc-4.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
    powerpc/ioda: Set "read" permission when "write" is set
    powerpc/mm: Fix Multi hit ERAT cause by recent THP update
    powerpc/powernv: Fix stale PE primary bus
    powerpc/eeh: Fix stale cached primary bus
    powerpc/pseries: Don't trace hcalls on offline CPUs
    powerpc: Fix dedotify for binutils >= 2.26
    powerpc/book3s_32: Fix build error with checkpoint restart

    Linus Torvalds
     

19 Feb, 2016

1 commit


15 Feb, 2016

1 commit

  • With ppc64 we use the deposited pgtable_t to store the hash pte slot
    information. We should not withdraw the deposited pgtable_t without
    marking the pmd none. This ensure that low level hash fault handling
    will skip this huge pte and we will handle them at upper levels.

    Recent change to pmd splitting changed the above in order to handle the
    race between pmd split and exit_mmap. The race is explained below.

    Consider following race:

    CPU0 CPU1
    shrink_page_list()
    add_to_swap()
    split_huge_page_to_list()
    __split_huge_pmd_locked()
    pmdp_huge_clear_flush_notify()
    // pmd_none() == true
    exit_mmap()
    unmap_vmas()
    zap_pmd_range()
    // no action on pmd since pmd_none() == true
    pmd_populate()

    As result the THP will not be freed. The leak is detected by check_mm():

    BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512

    The above required us to not mark pmd none during a pmd split.

    The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
    level fault handling code skip this pte. At higher level we do take ptl
    lock. That should serialze us against the pmd split. Once the lock is
    acquired we do check the pmd again using pmd_same. That should always
    return false for us and hence we should retry the access. We do the
    pmd_same check in all case after taking plt with
    THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
    huge_pmd_set_accessed)

    Also make sure we wait for irq disable section in other cpus to finish
    before flipping a huge pte entry with a regular pmd entry. Code paths
    like find_linux_pte_or_hugepte depend on irq disable to get
    a stable pte_t pointer. A parallel thp split need to make sure we
    don't convert a pmd pte to a regular pmd entry without waiting for the
    irq disable section to finish.

    Fixes: eef1b3ba053a ("thp: implement split_huge_pmd()")
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman

    Aneesh Kumar K.V
     

06 Feb, 2016

1 commit

  • We need to iterate over split_queue, not local empty list to get
    anything split from the shrinker.

    Fixes: e3ae19535c66 ("thp: limit number of object to scan on deferred_split_scan()")
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

04 Feb, 2016

4 commits

  • We allocate a pgtable but do not attach it to anything if the PMD is in
    a DAX VMA, causing it to leak.

    We certainly try to not free pgtables associated with the huge zero page
    if the zero page is in a DAX VMA, so I think this is the right solution.
    This needs to be properly audited.

    Signed-off-by: Matthew Wilcox
    Cc: Dan Williams
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • If we have a lot of pages in queue to be split, deferred_split_scan()
    can spend unreasonable amount of time under spinlock with disabled
    interrupts.

    Let's cap number of pages to split on scan by sc->nr_to_scan.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Andrea Arcangeli
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • I've got meaning of shrinker::count_objects() wrong: it should return
    number of potentially freeable objects, which is not necessary correlate
    with freeable memory.

    Returning 256 per THP in queue is not reasonable:
    shrinker::scan_objects() never called with nr_to_scan > 128 in my setup.

    Let's return 1 per THP and correct scan_object accordingly.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Andrea Arcangeli suggested to make split queue per-node to improve
    scalability. Let's do it.

    Signed-off-by: Kirill A. Shutemov
    Suggested-by: Andrea Arcangeli
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

22 Jan, 2016

2 commits

  • This crash is caused by NULL pointer deference, in page_to_pfn() marco,
    when page == NULL :

    Unable to handle kernel NULL pointer dereference at virtual address 00000000
    Internal error: Oops: 94000006 [#1] SMP
    Modules linked in:
    CPU: 1 PID: 26 Comm: khugepaged Tainted: G W 4.3.0-rc6-next-20151022ajb-00001-g32f3386-dirty #3
    PC is at khugepaged+0x378/0x1af8
    LR is at khugepaged+0x418/0x1af8
    Process khugepaged (pid: 26, stack limit = 0xffffffc079638020)
    Call trace:
    khugepaged+0x378/0x1af8
    kthread+0xdc/0xf4
    ret_from_fork+0xc/0x40
    Code: 35001700 f0002c60 aa0703e3 f9009fa0 (f94000e0)
    ---[ end trace 637503d8e28ae69e ]---
    Kernel panic - not syncing: Fatal exception
    CPU2: stopping
    CPU: 2 PID: 0 Comm: swapper/2 Tainted: G D W 4.3.0-rc6-next-20151022ajb-00001-g32f3386-dirty #3
    Hardware name: linux,dummy-virt (DT)

    [akpm@linux-foundation.org: fix fat-fingered merge resolution]
    Signed-off-by: yalin wang
    Acked-by: Vlastimil Babka
    Acked-by: Kirill A. Shutemov
    Acked-by: David Rientjes
    Cc: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    yalin wang
     
  • After THP refcounting rework we have only two possible return values
    from pmd_trans_huge_lock(): success and failure. Return-by-pointer for
    ptl doesn't make much sense in this case.

    Let's convert pmd_trans_huge_lock() to return ptl on success and NULL on
    failure.

    Signed-off-by: Kirill A. Shutemov
    Suggested-by: Linus Torvalds
    Cc: Minchan Kim
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

21 Jan, 2016

2 commits

  • split_queue_lock can be taken from interrupt context in some cases, but
    I forgot to convert locking in split_huge_page() to interrupt-safe
    primitives.

    Let's fix this.

    lockdep output:

    ======================================================
    [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
    4.4.0+ #259 Tainted: G W
    ------------------------------------------------------
    syz-executor/18183 [HC0[0]:SC0[2]:HE0:SE0] is trying to acquire:
    (split_queue_lock){+.+...}, at: free_transhuge_page+0x24/0x90 mm/huge_memory.c:3436

    and this task is already holding:
    (slock-AF_INET){+.-...}, at: spin_lock_bh include/linux/spinlock.h:307
    (slock-AF_INET){+.-...}, at: lock_sock_fast+0x45/0x120 net/core/sock.c:2462
    which would create a new lock dependency:
    (slock-AF_INET){+.-...} -> (split_queue_lock){+.+...}

    but this new dependency connects a SOFTIRQ-irq-safe lock:
    (slock-AF_INET){+.-...}
    ... which became SOFTIRQ-irq-safe at:
    mark_irqflags kernel/locking/lockdep.c:2799
    __lock_acquire+0xfd8/0x4700 kernel/locking/lockdep.c:3162
    lock_acquire+0x1dc/0x430 kernel/locking/lockdep.c:3585
    __raw_spin_lock include/linux/spinlock_api_smp.h:144
    _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
    spin_lock include/linux/spinlock.h:302
    udp_queue_rcv_skb+0x781/0x1550 net/ipv4/udp.c:1680
    flush_stack+0x50/0x330 net/ipv6/udp.c:799
    __udp4_lib_mcast_deliver+0x694/0x7f0 net/ipv4/udp.c:1798
    __udp4_lib_rcv+0x17dc/0x23e0 net/ipv4/udp.c:1888
    udp_rcv+0x21/0x30 net/ipv4/udp.c:2108
    ip_local_deliver_finish+0x2b3/0xa50 net/ipv4/ip_input.c:216
    NF_HOOK_THRESH include/linux/netfilter.h:226
    NF_HOOK include/linux/netfilter.h:249
    ip_local_deliver+0x1c4/0x2f0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:498
    ip_rcv_finish+0x5ec/0x1730 net/ipv4/ip_input.c:365
    NF_HOOK_THRESH include/linux/netfilter.h:226
    NF_HOOK include/linux/netfilter.h:249
    ip_rcv+0x963/0x1080 net/ipv4/ip_input.c:455
    __netif_receive_skb_core+0x1620/0x2f80 net/core/dev.c:4154
    __netif_receive_skb+0x2a/0x160 net/core/dev.c:4189
    netif_receive_skb_internal+0x1b5/0x390 net/core/dev.c:4217
    napi_skb_finish net/core/dev.c:4542
    napi_gro_receive+0x2bd/0x3c0 net/core/dev.c:4572
    e1000_clean_rx_irq+0x4e2/0x1100 drivers/net/ethernet/intel/e1000e/netdev.c:1038
    e1000_clean+0xa08/0x24a0 drivers/net/ethernet/intel/e1000/e1000_main.c:3819
    napi_poll net/core/dev.c:5074
    net_rx_action+0x7eb/0xdf0 net/core/dev.c:5139
    __do_softirq+0x26a/0x920 kernel/softirq.c:273
    invoke_softirq kernel/softirq.c:350
    irq_exit+0x18f/0x1d0 kernel/softirq.c:391
    exiting_irq ./arch/x86/include/asm/apic.h:659
    do_IRQ+0x86/0x1a0 arch/x86/kernel/irq.c:252
    ret_from_intr+0x0/0x20 arch/x86/entry/entry_64.S:520
    arch_safe_halt ./arch/x86/include/asm/paravirt.h:117
    default_idle+0x52/0x2e0 arch/x86/kernel/process.c:304
    arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:295
    default_idle_call+0x48/0xa0 kernel/sched/idle.c:92
    cpuidle_idle_call kernel/sched/idle.c:156
    cpu_idle_loop kernel/sched/idle.c:252
    cpu_startup_entry+0x554/0x710 kernel/sched/idle.c:300
    rest_init+0x192/0x1a0 init/main.c:412
    start_kernel+0x678/0x69e init/main.c:683
    x86_64_start_reservations+0x2a/0x2c arch/x86/kernel/head64.c:195
    x86_64_start_kernel+0x158/0x167 arch/x86/kernel/head64.c:184

    to a SOFTIRQ-irq-unsafe lock:
    (split_queue_lock){+.+...}
    which became SOFTIRQ-irq-unsafe at:
    mark_irqflags kernel/locking/lockdep.c:2817
    __lock_acquire+0x146e/0x4700 kernel/locking/lockdep.c:3162
    lock_acquire+0x1dc/0x430 kernel/locking/lockdep.c:3585
    __raw_spin_lock include/linux/spinlock_api_smp.h:144
    _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
    spin_lock include/linux/spinlock.h:302
    split_huge_page_to_list+0xcc0/0x1c50 mm/huge_memory.c:3399
    split_huge_page include/linux/huge_mm.h:99
    queue_pages_pte_range+0xa38/0xef0 mm/mempolicy.c:507
    walk_pmd_range mm/pagewalk.c:50
    walk_pud_range mm/pagewalk.c:90
    walk_pgd_range mm/pagewalk.c:116
    __walk_page_range+0x653/0xcd0 mm/pagewalk.c:204
    walk_page_range+0xfe/0x2b0 mm/pagewalk.c:281
    queue_pages_range+0xfb/0x130 mm/mempolicy.c:687
    migrate_to_node mm/mempolicy.c:1004
    do_migrate_pages+0x370/0x4e0 mm/mempolicy.c:1109
    SYSC_migrate_pages mm/mempolicy.c:1453
    SyS_migrate_pages+0x640/0x730 mm/mempolicy.c:1374
    entry_SYSCALL_64_fastpath+0x16/0x7a arch/x86/entry/entry_64.S:185

    other info that might help us debug this:

    Possible interrupt unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(split_queue_lock);
    local_irq_disable();
    lock(slock-AF_INET);
    lock(split_queue_lock);

    lock(slock-AF_INET);

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dmitry Vyukov
    Acked-by: David Rientjes
    Reviewed-by: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • A newly added tracepoint in the hugepage code uses a variable in the
    error handling that is not initialized at that point:

    include/trace/events/huge_memory.h:81:230: error: 'isolated' may be used uninitialized in this function [-Werror=maybe-uninitialized]

    The result is relatively harmless, as the trace data will in rare
    cases contain incorrect data.

    This works around the problem by adding an explicit initialization.

    Signed-off-by: Arnd Bergmann
    Fixes: 7d2eba0557c1 ("mm: add tracepoint for scanning pages")
    Reviewed-by: Ebru Akagunduz
    Acked-by: David Rientjes
    Cc: Kirill A. Shutemov
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

18 Jan, 2016

1 commit

  • Commit b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when
    MADV_FREE syscall is called") introduced this new function, but got the
    error handling for when pmd_trans_huge_lock() fails wrong. In the
    failure case, the lock has not been taken, and we should not unlock on
    the way out.

    Cc: Minchan Kim
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

16 Jan, 2016

14 commits

  • A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
    has established a devm_memremap_pages() mapping, i.e. when the pfn_t
    return from ->direct_access() has PFN_DEV and PFN_MAP set. Later, when
    encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
    struct dev_pagemap instance to keep the result of pfn_to_page() valid
    until put_page().

    Signed-off-by: Dan Williams
    Tested-by: Logan Gunthorpe
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • A dax-huge-page mapping while it uses some thp helpers is ultimately not
    a transparent huge page. The distinction is especially important in the
    get_user_pages() path. pmd_devmap() is used to distinguish dax-pmds
    from pmd_huge() and pmd_trans_huge() which have slightly different
    semantics.

    Explicitly mark the pmd_trans_huge() helpers that dax needs by adding
    pmd_devmap() checks.

    [kirill.shutemov@linux.intel.com: fix regression in handling mlocked pages in __split_huge_pmd()]
    Signed-off-by: Dan Williams
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Matthew Wilcox
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Similar to the conversion of vm_insert_mixed() use pfn_t in the
    vmf_insert_pfn_pmd() to tag the resulting pte with _PAGE_DEVICE when the
    pfn is backed by a devm_memremap_pages() mapping.

    Signed-off-by: Dan Williams
    Cc: Dave Hansen
    Cc: Matthew Wilcox
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Prior to this change DAX PMD mappings that were made read-only were
    never able to be made writable again. This is because the code in
    insert_pfn_pmd() that calls pmd_mkdirty() and pmd_mkwrite() would skip
    these calls if the PMD already existed in the page table.

    Instead, if we are doing a write always mark the PMD entry as dirty and
    writeable. Without this code we can get into a condition where we mark
    the PMD as read-only, and then on a subsequent write fault we get into
    an infinite loop of PMD faults where we try unsuccessfully to make the
    PMD writeable.

    Signed-off-by: Ross Zwisler
    Signed-off-by: Dan Williams
    Reported-by: Jeff Moyer
    Reported-by: Toshi Kani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Sasha Levin has reported KASAN out-of-bounds bug[1]. It points to "if
    (!is_swap_pte(pte[i]))" in unfreeze_page_vma() as a problematic access.

    The cause is that split_huge_page() doesn't handle THP correctly if it's
    not allingned to PMD boundary. It can happen after mremap().

    Test-case (not always triggers the bug):

    #define _GNU_SOURCE
    #include
    #include
    #include

    #define MB (1024UL*1024)
    #define SIZE (2*MB)
    #define BASE ((void *)0x400000000000)

    int main()
    {
    char *p;

    p = mmap(BASE, SIZE, PROT_READ | PROT_WRITE,
    MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE,
    -1, 0);
    if (p == MAP_FAILED)
    perror("mmap"), exit(1);
    p = mremap(BASE, SIZE, SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
    BASE + SIZE + 8192);
    if (p == MAP_FAILED)
    perror("mremap"), exit(1);
    system("echo 1 > /sys/kernel/debug/split_huge_pages");
    return 0;
    }

    The patch fixes freeze and unfreeze paths to handle page table boundary
    crossing.

    It also makes mapcount vs count check in split_huge_page_to_list()
    stricter:
    - after freeze we don't expect any subpage mapped as we remove them
    from rmap when setting up migration entries;
    - count must be 1, meaning only caller has reference to the page;

    [1] https://gist.github.com/sashalevin/c67fbea55e7c0576972a

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We don't need to split THP page when MADV_FREE syscall is called if
    [start, len] is aligned with THP size. The split could be done when VM
    decide to free it in reclaim path if memory pressure is heavy. With
    that, we could avoid unnecessary THP split.

    For the feature, this patch changes pte dirtness marking logic of THP.
    Now, it marks every ptes of pages dirty unconditionally in splitting,
    which makes MADV_FREE void. So, instead, this patch propagates pmd
    dirtiness to all pages via PG_dirty and restores pte dirtiness from
    PG_dirty. With this, if pmd is clean(ie, MADV_FREEed) when split
    happens(e,g, shrink_page_list), all of pages are clean too so we could
    discard them.

    Signed-off-by: Minchan Kim
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: "James E.J. Bottomley"
    Cc: "Kirill A. Shutemov"
    Cc: Shaohua Li
    Cc:
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Gang
    Cc: Chris Zankel
    Cc: Daniel Micay
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Jason Evans
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Cc: Mika Penttil
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Rik van Riel
    Cc: Roland Dreier
    Cc: Russell King
    Cc: Shaohua Li
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • During freeze_page(), we remove the page from rmap. It munlocks the
    page if it was mlocked. clear_page_mlock() uses thelru cache, which
    temporary pins the page.

    Let's drain the lru cache before checking page's count vs. mapcount.
    The change makes mlocked page split on first attempt, if it was not
    pinned by somebody else.

    Signed-off-by: Kirill A. Shutemov
    Cc: Sasha Levin
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Writing 1 into 'split_huge_pages' will try to find and split all huge
    pages in the system. This is useful for debuging.

    [akpm@linux-foundation.org: fix printk text, per Vlastimil]
    Signed-off-by: Kirill A. Shutemov
    Cc: Sasha Levin
    Cc: Andrea Arcangeli
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Both page_referenced() and page_idle_clear_pte_refs_one() assume that
    THP can only be mapped with PMD, so there's no reason to look on PTEs
    for PageTransHuge() pages. That's no true anymore: THP can be mapped
    with PTEs too.

    The patch removes PageTransHuge() test from the functions and opencode
    page table check.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Kirill A. Shutemov
    Cc: Vladimir Davydov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: Sasha Levin
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Before THP refcounting rework, THP was not allowed to cross VMA
    boundary. So, if we have THP and we split it, PG_mlocked can be safely
    transferred to small pages.

    With new THP refcounting and naive approach to mlocking we can end up
    with this scenario:
    1. we have a mlocked THP, which belong to one VM_LOCKED VMA.
    2. the process does munlock() on the *part* of the THP:
    - the VMA is split into two, one of them VM_LOCKED;
    - huge PMD split into PTE table;
    - THP is still mlocked;
    3. split_huge_page():
    - it transfers PG_mlocked to *all* small pages regrardless if it
    blong to any VM_LOCKED VMA.

    We probably could munlock() all small pages on split_huge_page(), but I
    think we have accounting issue already on step two.

    Instead of forbidding mlocked pages altogether, we just avoid mlocking
    PTE-mapped THPs and munlock THPs on split_huge_pmd().

    This means PTE-mapped THPs will be on normal lru lists and will be split
    under memory pressure by vmscan. After the split vmscan will detect
    unevictable small pages and mlock them.

    With this approach we shouldn't hit situation like described above.

    Signed-off-by: Kirill A. Shutemov
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Currently we don't split huge page on partial unmap. It's not an ideal
    situation. It can lead to memory overhead.

    Furtunately, we can detect partial unmap on page_remove_rmap(). But we
    cannot call split_huge_page() from there due to locking context.

    It's also counterproductive to do directly from munmap() codepath: in
    many cases we will hit this from exit(2) and splitting the huge page
    just to free it up in small pages is not what we really want.

    The patch introduce deferred_split_huge_page() which put the huge page
    into queue for splitting. The splitting itself will happen when we get
    memory pressure via shrinker interface. The page will be dropped from
    list on freeing through compound page destructor.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This patch adds implementation of split_huge_page() for new
    refcountings.

    Unlike previous implementation, new split_huge_page() can fail if
    somebody holds GUP pin on the page. It also means that pin on page
    would prevent it from bening split under you. It makes situation in
    many places much cleaner.

    The basic scheme of split_huge_page():

    - Check that sum of mapcounts of all subpage is equal to page_count()
    plus one (caller pin). Foll off with -EBUSY. This way we can avoid
    useless PMD-splits.

    - Freeze the page counters by splitting all PMD and setup migration
    PTEs.

    - Re-check sum of mapcounts against page_count(). Page's counts are
    stable now. -EBUSY if page is pinned.

    - Split compound page.

    - Unfreeze the page by removing migration entries.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We are going to use migration PTE entries to stabilize page counts. If
    the page is mapped with PMDs we need to split the PMD and setup
    migration entries. It's reasonable to combine these operations to avoid
    double-scanning over the page table.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Original split_huge_page() combined two operations: splitting PMDs into
    tables of PTEs and splitting underlying compound page. This patch
    implements split_huge_pmd() which split given PMD without splitting
    other PMDs this page mapped with or underlying compound page.

    Without tail page refcounting, implementation of split_huge_pmd() is
    pretty straight-forward.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov