04 Aug, 2015

1 commit

  • commit 6b7339f4c31ad69c8e9c0b2859276e22cf72176d upstream.

    Reading page fault handler code I've noticed that under right
    circumstances kernel would map anonymous pages into file mappings: if
    the VMA doesn't have vm_ops->fault() and the VMA wasn't fully populated
    on ->mmap(), kernel would handle page fault to not populated pte with
    do_anonymous_page().

    Let's change page fault handler to use do_anonymous_page() only on
    anonymous VMA (->vm_ops == NULL) and make sure that the VMA is not
    shared.

    For file mappings without vm_ops->fault() or shred VMA without vm_ops,
    page fault on pte_none() entry would lead to SIGBUS.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Oleg Nesterov
    Cc: Andrew Morton
    Cc: Willy Tarreau
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     

16 Apr, 2015

3 commits

  • This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to
    get notified when access is a write to a read-only PFN.

    This can happen if we mmap() a file then first mmap-read from it to
    page-in a read-only PFN, than we mmap-write to the same page.

    We need this functionality to fix a DAX bug, where in the scenario above
    we fail to set ctime/mtime though we modified the file. An xfstest is
    attached to this patchset that shows the failure and the fix. (A DAX
    patch will follow)

    This functionality is extra important for us, because upon dirtying of a
    pmem page we also want to RDMA the page to a remote cluster node.

    We define a new pfn_mkwrite and do not reuse page_mkwrite because
    1 - The name ;-)
    2 - But mainly because it would take a very long and tedious
    audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
    users. To make sure they do not now CRASH. For example current
    DAX code (which this is for) would crash.
    If we would want to reuse page_mkwrite, We will need to first
    patch all users, so to not-crash-on-no-page. Then enable this
    patch. But even if I did that I would not sleep so well at night.
    Adding a new vector is the safest thing to do, and is not that
    expensive. an extra pointer at a static function vector per driver.
    Also the new vector is better for performance, because else we
    Will call all current Kernel vectors, so to:
    check-ha-no-page-do-nothing and return.

    No need to call it from do_shared_fault because do_wp_page is called to
    change pte permissions anyway.

    Signed-off-by: Yigal Korman
    Signed-off-by: Boaz Harrosh
    Acked-by: Kirill A. Shutemov
    Cc: Matthew Wilcox
    Cc: Jan Kara
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Boaz Harrosh
     
  • A lot of filesystems use generic_file_mmap() and filemap_fault(),
    f_op->mmap and vm_ops->fault aren't enough to identify filesystem.

    This prints file name, vm_ops->fault, f_op->mmap and a_ops->readpage
    (which is almost always implemented and filesystem-specific).

    Example:

    [ 23.676410] BUG: Bad page map in process sh pte:1b7e6025 pmd:19bbd067
    [ 23.676887] page:ffffea00006df980 count:4 mapcount:1 mapping:ffff8800196426c0 index:0x97
    [ 23.677481] flags: 0x10000000000000c(referenced|uptodate)
    [ 23.677896] page dumped because: bad pte
    [ 23.678205] addr:00007f52fcb17000 vm_flags:00000075 anon_vma: (null) mapping:ffff8800196426c0 index:97
    [ 23.678922] file:libc-2.19.so fault:filemap_fault mmap:generic_file_readonly_mmap readpage:v9fs_vfs_readpage

    [akpm@linux-foundation.org: use pr_alert, per Kirill]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Sasha Levin
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/
    tree since it doesn't work reliably on non-scalar types.

    This patch removes the rest of the usages of ACCESS_ONCE, and use the new
    READ_ONCE API for the read accesses. This makes things cleaner, instead
    of using separate/multiple sets of APIs.

    Signed-off-by: Jason Low
    Acked-by: Michal Hocko
    Acked-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Reviewed-by: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Low
     

15 Apr, 2015

4 commits

  • The do_wp_page function is extremely long. Extract the logic for
    handling a page belonging to a shared vma into a function of its own.

    This helps the readability of the code, without doing any functional
    change in it.

    Signed-off-by: Shachar Raindel
    Acked-by: Linus Torvalds
    Acked-by: Kirill A. Shutemov
    Acked-by: Rik van Riel
    Acked-by: Andi Kleen
    Acked-by: Haggai Eran
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Peter Feiner
    Cc: Michel Lespinasse
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shachar Raindel
     
  • In some cases, do_wp_page had to copy the page suffering a write fault
    to a new location. If the function logic decided that to do this, it
    was done by jumping with a "goto" operation to the relevant code block.
    This made the code really hard to understand. It is also against the
    kernel coding style guidelines.

    This patch extracts the page copy and page table update logic to a
    separate function. It also clean up the naming, from "gotten" to
    "wp_page_copy", and adds few comments.

    Signed-off-by: Shachar Raindel
    Acked-by: Linus Torvalds
    Acked-by: Kirill A. Shutemov
    Acked-by: Rik van Riel
    Acked-by: Andi Kleen
    Acked-by: Haggai Eran
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Peter Feiner
    Cc: Michel Lespinasse
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shachar Raindel
     
  • When do_wp_page is ending, in several cases it needs to unlock the pages
    and ptls it was accessing.

    Currently, this logic was "called" by using a goto jump. This makes
    following the control flow of the function harder. Readability was
    further hampered by the unlock case containing large amount of logic
    needed only in one of the 3 cases.

    Using goto for cleanup is generally allowed. However, moving the
    trivial unlocking flows to the relevant call sites allow deeper
    refactoring in the next patch.

    Signed-off-by: Shachar Raindel
    Acked-by: Linus Torvalds
    Acked-by: Kirill A. Shutemov
    Acked-by: Rik van Riel
    Acked-by: Andi Kleen
    Acked-by: Haggai Eran
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Peter Feiner
    Cc: Michel Lespinasse
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shachar Raindel
     
  • Currently do_wp_page contains 265 code lines. It also contains 9 goto
    statements, of which 5 are targeting labels which are not cleanup
    related. This makes the function extremely difficult to understand.

    The following patches are an attempt at breaking the function to its
    basic components, and making it easier to understand.

    The patches are straight forward function extractions from do_wp_page.
    As we extract functions, we remove unneeded parameters and simplify the
    code as much as possible. However, the functionality is supposed to
    remain completely unchanged. The patches also attempt to document the
    functionality of each extracted function. In patch 2, we split the
    unlock logic to the contain logic relevant to specific needs of each use
    case, instead of having huge number of conditional decisions in a single
    unlock flow.

    This patch (of 4):

    When do_wp_page is ending, in several cases it needs to reuse the existing
    page. This is achieved by making the page table writable, and possibly
    updating the page-cache state.

    Currently, this logic was "called" by using a goto jump. This makes
    following the control flow of the function harder. It is also against the
    coding style guidelines for using goto.

    As the code can easily be refactored into a specialized function, refactor
    it out and simplify the code flow in do_wp_page.

    Acked-by: Linus Torvalds
    Acked-by: Kirill A. Shutemov
    Acked-by: Rik van Riel
    Acked-by: Andi Kleen
    Acked-by: Haggai Eran
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Peter Feiner
    Cc: Michel Lespinasse
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shachar Raindel
     

26 Mar, 2015

3 commits

  • Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

    Across the board the 4.0-rc1 numbers are much slower, and the degradation
    is far worse when using the large memory footprint configs. Perf points
    straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:

    - 56.07% 56.07% [kernel] [k] default_send_IPI_mask_sequence_phys
    - default_send_IPI_mask_sequence_phys
    - 99.99% physflat_send_IPI_mask
    - 99.37% native_send_call_func_ipi
    smp_call_function_many
    - native_flush_tlb_others
    - 99.85% flush_tlb_page
    ptep_clear_flush
    try_to_unmap_one
    rmap_walk
    try_to_unmap
    migrate_pages
    migrate_misplaced_page
    - handle_mm_fault
    - 99.73% __do_page_fault
    trace_do_page_fault
    do_async_page_fault
    + async_page_fault
    0.63% native_send_call_func_single_ipi
    generic_exec_single
    smp_call_function_single

    This is showing excessive migration activity even though excessive
    migrations are meant to get throttled. Normally, the scan rate is tuned
    on a per-task basis depending on the locality of faults. However, if
    migrations fail for any reason then the PTE scanner may scan faster if
    the faults continue to be remote. This means there is higher system CPU
    overhead and fault trapping at exactly the time we know that migrations
    cannot happen. This patch tracks when migration failures occur and
    slows the PTE scanner.

    Signed-off-by: Mel Gorman
    Reported-by: Dave Chinner
    Tested-by: Dave Chinner
    Cc: Ingo Molnar
    Cc: Aneesh Kumar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Protecting a PTE to trap a NUMA hinting fault clears the writable bit
    and further faults are needed after trapping a NUMA hinting fault to set
    the writable bit again. This patch preserves the writable bit when
    trapping NUMA hinting faults. The impact is obvious from the number of
    minor faults trapped during the basis balancing benchmark and the system
    CPU usage;

    autonumabench
    4.0.0-rc4 4.0.0-rc4
    baseline preserve
    Time System-NUMA01 107.13 ( 0.00%) 103.13 ( 3.73%)
    Time System-NUMA01_THEADLOCAL 131.87 ( 0.00%) 83.30 ( 36.83%)
    Time System-NUMA02 8.95 ( 0.00%) 10.72 (-19.78%)
    Time System-NUMA02_SMT 4.57 ( 0.00%) 3.99 ( 12.69%)
    Time Elapsed-NUMA01 515.78 ( 0.00%) 517.26 ( -0.29%)
    Time Elapsed-NUMA01_THEADLOCAL 384.10 ( 0.00%) 384.31 ( -0.05%)
    Time Elapsed-NUMA02 48.86 ( 0.00%) 48.78 ( 0.16%)
    Time Elapsed-NUMA02_SMT 47.98 ( 0.00%) 48.12 ( -0.29%)

    4.0.0-rc4 4.0.0-rc4
    baseline preserve
    User 44383.95 43971.89
    System 252.61 201.24
    Elapsed 998.68 1000.94

    Minor Faults 2597249 1981230
    Major Faults 365 364

    There is a similar drop in system CPU usage using Dave Chinner's xfsrepair
    workload

    4.0.0-rc4 4.0.0-rc4
    baseline preserve
    Amean real-xfsrepair 454.14 ( 0.00%) 442.36 ( 2.60%)
    Amean syst-xfsrepair 277.20 ( 0.00%) 204.68 ( 26.16%)

    The patch looks hacky but the alternatives looked worse. The tidest was
    to rewalk the page tables after a hinting fault but it was more complex
    than this approach and the performance was worse. It's not generally
    safe to just mark the page writable during the fault if it's a write
    fault as it may have been read-only for COW so that approach was
    discarded.

    Signed-off-by: Mel Gorman
    Reported-by: Dave Chinner
    Tested-by: Dave Chinner
    Cc: Ingo Molnar
    Cc: Aneesh Kumar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • These are three follow-on patches based on the xfsrepair workload Dave
    Chinner reported was problematic in 4.0-rc1 due to changes in page table
    management -- https://lkml.org/lkml/2015/3/1/226.

    Much of the problem was reduced by commit 53da3bc2ba9e ("mm: fix up numa
    read-only thread grouping logic") and commit ba68bc0115eb ("mm: thp:
    Return the correct value for change_huge_pmd"). It was known that the
    performance in 3.19 was still better even if is far less safe. This
    series aims to restore the performance without compromising on safety.

    For the test of this mail, I'm comparing 3.19 against 4.0-rc4 and the
    three patches applied on top

    autonumabench
    3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4
    vanilla vanilla vmwrite-v5r8 preserve-v5r8 slowscan-v5r8
    Time System-NUMA01 124.00 ( 0.00%) 161.86 (-30.53%) 107.13 ( 13.60%) 103.13 ( 16.83%) 145.01 (-16.94%)
    Time System-NUMA01_THEADLOCAL 115.54 ( 0.00%) 107.64 ( 6.84%) 131.87 (-14.13%) 83.30 ( 27.90%) 92.35 ( 20.07%)
    Time System-NUMA02 9.35 ( 0.00%) 10.44 (-11.66%) 8.95 ( 4.28%) 10.72 (-14.65%) 8.16 ( 12.73%)
    Time System-NUMA02_SMT 3.87 ( 0.00%) 4.63 (-19.64%) 4.57 (-18.09%) 3.99 ( -3.10%) 3.36 ( 13.18%)
    Time Elapsed-NUMA01 570.06 ( 0.00%) 567.82 ( 0.39%) 515.78 ( 9.52%) 517.26 ( 9.26%) 543.80 ( 4.61%)
    Time Elapsed-NUMA01_THEADLOCAL 393.69 ( 0.00%) 384.83 ( 2.25%) 384.10 ( 2.44%) 384.31 ( 2.38%) 380.73 ( 3.29%)
    Time Elapsed-NUMA02 49.09 ( 0.00%) 49.33 ( -0.49%) 48.86 ( 0.47%) 48.78 ( 0.63%) 50.94 ( -3.77%)
    Time Elapsed-NUMA02_SMT 47.51 ( 0.00%) 47.15 ( 0.76%) 47.98 ( -0.99%) 48.12 ( -1.28%) 49.56 ( -4.31%)

    3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4
    vanilla vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
    User 46334.60 46391.94 44383.95 43971.89 44372.12
    System 252.84 284.66 252.61 201.24 249.00
    Elapsed 1062.14 1050.96 998.68 1000.94 1026.78

    Overall the system CPU usage is comparable and the test is naturally a
    bit variable. The slowing of the scanner hurts numa01 but on this
    machine it is an adverse workload and patches that dramatically help it
    often hurt absolutely everything else.

    Due to patch 2, the fault activity is interesting

    3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4
    vanilla vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
    Minor Faults 2097811 2656646 2597249 1981230 1636841
    Major Faults 362 450 365 364 365

    Note the impact preserving the write bit across protection updates and
    fault reduces faults.

    NUMA alloc hit 1229008 1217015 1191660 1178322 1199681
    NUMA alloc miss 0 0 0 0 0
    NUMA interleave hit 0 0 0 0 0
    NUMA alloc local 1228514 1216317 1190871 1177448 1199021
    NUMA base PTE updates 245706197 240041607 238195516 244704842 115012800
    NUMA huge PMD updates 479530 468448 464868 477573 224487
    NUMA page range updates 491225557 479886983 476207932 489222218 229950144
    NUMA hint faults 659753 656503 641678 656926 294842
    NUMA hint local faults 381604 373963 360478 337585 186249
    NUMA hint local percent 57 56 56 51 63
    NUMA pages migrated 5412140 6374899 6266530 5277468 5755096
    AutoNUMA cost 5121% 5083% 4994% 5097% 2388%

    Here the impact of slowing the PTE scanner on migratrion failures is
    obvious as "NUMA base PTE updates" and "NUMA huge PMD updates" are
    massively reduced even though the headline performance is very similar.

    As xfsrepair was the reported workload here is the impact of the series
    on it.

    xfsrepair
    3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4
    vanilla vanilla vmwrite-v5r8 preserve-v5r8 slowscan-v5r8
    Min real-fsmark 1183.29 ( 0.00%) 1165.73 ( 1.48%) 1152.78 ( 2.58%) 1153.64 ( 2.51%) 1177.62 ( 0.48%)
    Min syst-fsmark 4107.85 ( 0.00%) 4027.75 ( 1.95%) 3986.74 ( 2.95%) 3979.16 ( 3.13%) 4048.76 ( 1.44%)
    Min real-xfsrepair 441.51 ( 0.00%) 463.96 ( -5.08%) 449.50 ( -1.81%) 440.08 ( 0.32%) 439.87 ( 0.37%)
    Min syst-xfsrepair 195.76 ( 0.00%) 278.47 (-42.25%) 262.34 (-34.01%) 203.70 ( -4.06%) 143.64 ( 26.62%)
    Amean real-fsmark 1188.30 ( 0.00%) 1177.34 ( 0.92%) 1157.97 ( 2.55%) 1158.21 ( 2.53%) 1182.22 ( 0.51%)
    Amean syst-fsmark 4111.37 ( 0.00%) 4055.70 ( 1.35%) 3987.19 ( 3.02%) 3998.72 ( 2.74%) 4061.69 ( 1.21%)
    Amean real-xfsrepair 450.88 ( 0.00%) 468.32 ( -3.87%) 454.14 ( -0.72%) 442.36 ( 1.89%) 440.59 ( 2.28%)
    Amean syst-xfsrepair 199.66 ( 0.00%) 290.60 (-45.55%) 277.20 (-38.84%) 204.68 ( -2.51%) 150.55 ( 24.60%)
    Stddev real-fsmark 4.12 ( 0.00%) 10.82 (-162.29%) 4.14 ( -0.28%) 5.98 (-45.05%) 4.60 (-11.53%)
    Stddev syst-fsmark 2.63 ( 0.00%) 20.32 (-671.82%) 0.37 ( 85.89%) 16.47 (-525.59%) 15.05 (-471.79%)
    Stddev real-xfsrepair 6.87 ( 0.00%) 4.55 ( 33.75%) 3.46 ( 49.58%) 1.78 ( 74.12%) 0.52 ( 92.50%)
    Stddev syst-xfsrepair 3.02 ( 0.00%) 10.30 (-241.37%) 13.17 (-336.37%) 0.71 ( 76.63%) 5.00 (-65.61%)
    CoeffVar real-fsmark 0.35 ( 0.00%) 0.92 (-164.73%) 0.36 ( -2.91%) 0.52 (-48.82%) 0.39 (-12.10%)
    CoeffVar syst-fsmark 0.06 ( 0.00%) 0.50 (-682.41%) 0.01 ( 85.45%) 0.41 (-543.22%) 0.37 (-478.78%)
    CoeffVar real-xfsrepair 1.52 ( 0.00%) 0.97 ( 36.21%) 0.76 ( 49.94%) 0.40 ( 73.62%) 0.12 ( 92.33%)
    CoeffVar syst-xfsrepair 1.51 ( 0.00%) 3.54 (-134.54%) 4.75 (-214.31%) 0.34 ( 77.20%) 3.32 (-119.63%)
    Max real-fsmark 1193.39 ( 0.00%) 1191.77 ( 0.14%) 1162.90 ( 2.55%) 1166.66 ( 2.24%) 1188.50 ( 0.41%)
    Max syst-fsmark 4114.18 ( 0.00%) 4075.45 ( 0.94%) 3987.65 ( 3.08%) 4019.45 ( 2.30%) 4082.80 ( 0.76%)
    Max real-xfsrepair 457.80 ( 0.00%) 474.60 ( -3.67%) 457.82 ( -0.00%) 444.42 ( 2.92%) 441.03 ( 3.66%)
    Max syst-xfsrepair 203.11 ( 0.00%) 303.65 (-49.50%) 294.35 (-44.92%) 205.33 ( -1.09%) 155.28 ( 23.55%)

    The really relevant lines as syst-xfsrepair which is the system CPU
    usage when running xfsrepair. Note that on my machine the overhead was
    45% higher on 4.0-rc4 which may be part of what Dave is seeing. Once we
    preserve the write bit across faults, it's only 2.51% higher on average.
    With the full series applied, system CPU usage is 24.6% lower on
    average.

    Again, the impact of preserving the write bit on minor faults is obvious
    and the impact of slowing scanning after migration failures is obvious
    on the PTE updates. Note also that the number of pages migrated is much
    reduced even though the headline performance is comparable.

    3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4
    vanilla vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
    Minor Faults 153466827 254507978 249163829 153501373 105737890
    Major Faults 610 702 690 649 724
    NUMA base PTE updates 217735049 210756527 217729596 216937111 144344993
    NUMA huge PMD updates 129294 85044 106921 127246 79887
    NUMA pages migrated 21938995 29705270 28594162 22687324 16258075

    3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4
    vanilla vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
    Mean sdb-avgqusz 13.47 2.54 2.55 2.47 2.49
    Mean sdb-avgrqsz 202.32 140.22 139.50 139.02 138.12
    Mean sdb-await 25.92 5.09 5.33 5.02 5.22
    Mean sdb-r_await 4.71 0.19 0.83 0.51 0.11
    Mean sdb-w_await 104.13 5.21 5.38 5.05 5.32
    Mean sdb-svctm 0.59 0.13 0.14 0.13 0.14
    Mean sdb-rrqm 0.16 0.00 0.00 0.00 0.00
    Mean sdb-wrqm 3.59 1799.43 1826.84 1812.21 1785.67
    Max sdb-avgqusz 111.06 12.13 14.05 11.66 15.60
    Max sdb-avgrqsz 255.60 190.34 190.01 187.33 191.78
    Max sdb-await 168.24 39.28 49.22 44.64 65.62
    Max sdb-r_await 660.00 52.00 280.00 76.00 12.00
    Max sdb-w_await 7804.00 39.28 49.22 44.64 65.62
    Max sdb-svctm 4.00 2.82 2.86 1.98 2.84
    Max sdb-rrqm 8.30 0.00 0.00 0.00 0.00
    Max sdb-wrqm 34.20 5372.80 5278.60 5386.60 5546.15

    FWIW, I also checked SPECjbb in different configurations but it's
    similar observations -- minor faults lower, PTE update activity lower
    and performance is roughly comparable against 3.19.

    This patch (of 3):

    Threads that share writable data within pages are grouped together as
    related tasks. This decision is based on whether the PTE is marked
    dirty which is subject to timing races between the PTE scanner update
    and when the application writes the page. If the page is file-backed,
    then background flushes and sync also affect placement. This is
    unpredictable behaviour which is impossible to reason about so this
    patch makes grouping decisions based on the VMA flags.

    Signed-off-by: Mel Gorman
    Reported-by: Dave Chinner
    Tested-by: Dave Chinner
    Cc: Ingo Molnar
    Cc: Aneesh Kumar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

12 Mar, 2015

1 commit

  • Dave Chinner reported that commit 4d9424669946 ("mm: convert
    p[te|md]_mknonnuma and remaining page table manipulations") slowed down
    his xfsrepair test enormously. In particular, it was using more system
    time due to extra TLB flushing.

    The ultimate reason turns out to be how the change to use the regular
    page table accessor functions broke the NUMA grouping logic. The old
    special mknuma/mknonnuma code accessed the page table present bit and
    the magic NUMA bit directly, while the new code just changes the page
    protections using PROT_NONE and the regular vma protections.

    That sounds equivalent, and from a fault standpoint it really is, but a
    subtle side effect is that the *other* protection bits of the page table
    entries also change. And the code to decide how to group the NUMA
    entries together used the writable bit to decide whether a particular
    page was likely to be shared read-only or not.

    And with the change to make the NUMA handling use the regular permission
    setting functions, that writable bit was basically always cleared for
    private mappings due to COW. So even if the page actually ends up being
    written to in the end, the NUMA balancing would act as if it was always
    shared RO.

    This code is a heuristic anyway, so the fix - at least for now - is to
    instead check whether the page is dirty rather than writable. The bit
    doesn't change with protection changes.

    NOTE! This also adds a FIXME comment to revisit this issue,

    Not only should we probably re-visit the whole "is this a shared
    read-only page" heuristic (we might want to take the vma permissions
    into account and base this more on those than the per-page ones, and
    also look at whether the particular access that triggers it is a write
    or not), but the whole COW issue shows that we should think about the
    NUMA fault handling some more.

    For example, maybe we should do the early-COW thing that a regular fault
    does. Or maybe we should accept that while using the same bits as
    PROTNONE was a good thing (and got rid of the specual NUMA bit), we
    might still want to just preseve the other protection bits across NUMA
    faulting.

    Those are bigger questions, left for later. This just fixes up the
    heuristic so that it at least approximates working again. More analysis
    and work needed.

    Reported-by: Dave Chinner
    Tested-by: Mel Gorman
    Cc: Andrew Morton
    Cc: Aneesh Kumar
    Cc: Ingo Molnar ,
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

17 Feb, 2015

2 commits

  • Currently COW of an XIP file is done by first bringing in a read-only
    mapping, then retrying the fault and copying the page. It is much more
    efficient to tell the fault handler that a COW is being attempted (by
    passing in the pre-allocated page in the vm_fault structure), and allow
    the handler to perform the COW operation itself.

    The handler cannot insert the page itself if there is already a read-only
    mapping at that address, so allow the handler to return VM_FAULT_LOCKED
    and set the fault_page to be NULL. This indicates to the MM code that the
    i_mmap_lock is held instead of the page lock.

    Signed-off-by: Matthew Wilcox
    Acked-by: Kirill A. Shutemov
    Cc: Andreas Dilger
    Cc: Boaz Harrosh
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Jens Axboe
    Cc: Mathieu Desnoyers
    Cc: Randy Dunlap
    Cc: Ross Zwisler
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • DAX is a replacement for the variation of XIP currently supported by the
    ext2 filesystem. We have three different things in the tree called 'XIP',
    and the new focus is on access to data rather than executables, so a name
    change was in order. DAX stands for Direct Access. The X is for
    eXciting.

    The new focus on data access has resulted in more careful attention to
    races that exist in the current XIP code, but are not hit by the use-case
    that it was designed for. XIP's architecture worked fine for ext2, but
    DAX is architected to work with modern filsystems such as ext4 and XFS.
    DAX is not intended for use with btrfs; the value that btrfs adds relies
    on manipulating data and writing data to different locations, while DAX's
    value is for write-in-place and keeping the kernel from touching the data.

    DAX was developed in order to support NV-DIMMs, but it's become clear that
    its usefuless extends beyond NV-DIMMs and there are several potential
    customers including the tracing machinery. Other people want to place the
    kernel log in an area of memory, as long as they have a BIOS that does not
    clear DRAM on reboot.

    Patch 1 is a bug fix, probably worth including in 3.18.

    Patches 2 & 3 are infrastructure for DAX.

    Patches 4-8 replace the XIP code with its DAX equivalents, transforming
    ext2 to use the DAX code as we go. Note that patch 10 is the
    Documentation patch.

    Patches 9-15 clean up after the XIP code, removing the infrastructure
    that is no longer needed and renaming various XIP things to DAX.
    Most of these patches were added after Jan found things he didn't
    like in an earlier version of the ext4 patch ... that had been copied
    from ext2. So ext2 i being transformed to do things the same way that
    ext4 will later. The ability to mount ext2 filesystems with the 'xip'
    option is retained, although the 'dax' option is now preferred.

    Patch 16 adds some DAX infrastructure to support ext4.

    Patch 17 adds DAX support to ext4. It is broadly similar to ext2's DAX
    support, but it is more efficient than ext4's due to its support for
    unwritten extents.

    Patch 18 is another cleanup patch renaming XIP to DAX.

    My thanks to Mathieu Desnoyers for his reviews of the v11 patchset. Most
    of the changes below were based on his feedback.

    This patch (of 18):

    Pagecache faults recheck i_size after taking the page lock to ensure that
    the fault didn't race against a truncate. We don't have a page to lock in
    the XIP case, so use i_mmap_lock_read() instead. It is locked in the
    truncate path in unmap_mapping_range() after updating i_size. So while we
    hold it in the fault path, we are guaranteed that either i_size has
    already been updated in the truncate path, or that the truncate will
    subsequently call zap_page_range_single() and so remove the mapping we
    have just inserted.

    There is a window of time in which i_size has been reduced and the thread
    has a mapping to a page which will be removed from the file, but this is
    harmless as the page will not be allocated to a different purpose before
    the thread's access to it is revoked.

    [akpm@linux-foundation.org: switch to i_mmap_lock_read(), add comment in unmap_single_vma()]
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Mathieu Desnoyers
    Cc: Andreas Dilger
    Cc: Boaz Harrosh
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: Randy Dunlap
    Cc: Ross Zwisler
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

13 Feb, 2015

5 commits

  • For whatever reason, generic_access_phys() only remaps one page, but
    actually allows to access arbitrary size. It's quite easy to trigger
    large reads, like printing out large structure with gdb, which leads to a
    crash. Fix it by remapping correct size.

    Fixes: 28b2ee20c7cb ("access_process_vm device memory infrastructure")
    Signed-off-by: Grazvydas Ignotas
    Cc: Rik van Riel
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Grazvydas Ignotas
     
  • pte_protnone_numa is only safe to use after VMA checks for PROT_NONE are
    complete. Treating a real PROT_NONE PTE as a NUMA hinting fault is going
    to result in strangeness so add a check for it. BUG_ON looks like
    overkill but if this is hit then it's a serious bug that could result in
    corruption so do not even try recovering. It would have been more
    comprehensive to check VMA flags in pte_protnone_numa but it would have
    made the API ugly just for a debugging check.

    Signed-off-by: Mel Gorman
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Dave Jones
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Kirill Shutemov
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Faults on the huge zero page are pointless and there is a BUG_ON to catch
    them during fault time. This patch reintroduces a check that avoids
    marking the zero page PAGE_NONE.

    Signed-off-by: Mel Gorman
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Dave Jones
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Kirill Shutemov
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • With PROT_NONE, the traditional page table manipulation functions are
    sufficient.

    [andre.przywara@arm.com: fix compiler warning in pmdp_invalidate()]
    [akpm@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS]
    Signed-off-by: Mel Gorman
    Acked-by: Linus Torvalds
    Acked-by: Aneesh Kumar
    Tested-by: Sasha Levin
    Cc: Benjamin Herrenschmidt
    Cc: Dave Jones
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Kirill Shutemov
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Convert existing users of pte_numa and friends to the new helper. Note
    that the kernel is broken after this patch is applied until the other page
    table modifiers are also altered. This patch layout is to make review
    easier.

    Signed-off-by: Mel Gorman
    Acked-by: Linus Torvalds
    Acked-by: Aneesh Kumar
    Acked-by: Benjamin Herrenschmidt
    Tested-by: Sasha Levin
    Cc: Dave Jones
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Kirill Shutemov
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

12 Feb, 2015

1 commit

  • Dave noticed that unprivileged process can allocate significant amount of
    memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
    memory cgroup. The trick is to allocate a lot of PMD page tables. Linux
    kernel doesn't account PMD tables to the process, only PTE.

    The use-cases below use few tricks to allocate a lot of PMD page tables
    while keeping VmRSS and VmPTE low. oom_score for the process will be 0.

    #include
    #include
    #include
    #include
    #include
    #include

    #define PUD_SIZE (1UL << 30)
    #define PMD_SIZE (1UL << 21)

    #define NR_PUD 130000

    int main(void)
    {
    char *addr = NULL;
    unsigned long i;

    prctl(PR_SET_THP_DISABLE);
    for (i = 0; i < NR_PUD ; i++) {
    addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
    MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
    if (addr == MAP_FAILED) {
    perror("mmap");
    break;
    }
    *addr = 'x';
    munmap(addr, PMD_SIZE);
    mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
    MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
    if (addr == MAP_FAILED)
    perror("re-mmap"), exit(1);
    }
    printf("PID %d consumed %lu KiB in PMD page tables\n",
    getpid(), i * 4096 >> 10);
    return pause();
    }

    The patch addresses the issue by account PMD tables to the process the
    same way we account PTE.

    The main place where PMD tables is accounted is __pmd_alloc() and
    free_pmd_range(). But there're few corner cases:

    - HugeTLB can share PMD page tables. The patch handles by accounting
    the table to all processes who share it.

    - x86 PAE pre-allocates few PMD tables on fork.

    - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
    check on exit(2).

    Accounting only happens on configuration where PMD page table's level is
    present (PMD is not folded). As with nr_ptes we use per-mm counter. The
    counter value is used to calculate baseline for badness score by
    oom-killer.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dave Hansen
    Cc: Hugh Dickins
    Reviewed-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: David Rientjes
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

11 Feb, 2015

7 commits

  • Merge misc updates from Andrew Morton:
    "Bite-sized chunks this time, to avoid the MTA ratelimiting woes.

    - fs/notify updates

    - ocfs2

    - some of MM"

    That laconic "some MM" is mainly the removal of remap_file_pages(),
    which is a big simplification of the VM, and which gets rid of a *lot*
    of random cruft and special cases because we no longer support the
    non-linear mappings that it used.

    From a user interface perspective, nothing has changed, because the
    remap_file_pages() syscall still exists, it's just done by emulating the
    old behavior by creating a lot of individual small mappings instead of
    one non-linear one.

    The emulation is slower than the old "native" non-linear mappings, but
    nobody really uses or cares about remap_file_pages(), and simplifying
    the VM is a big advantage.

    * emailed patches from Andrew Morton : (78 commits)
    memcg: zap memcg_slab_caches and memcg_slab_mutex
    memcg: zap memcg_name argument of memcg_create_kmem_cache
    memcg: zap __memcg_{charge,uncharge}_slab
    mm/page_alloc.c: place zone_id check before VM_BUG_ON_PAGE check
    mm: hugetlb: fix type of hugetlb_treat_as_movable variable
    mm, hugetlb: remove unnecessary lower bound on sysctl handlers"?
    mm: memory: merge shared-writable dirtying branches in do_wp_page()
    mm: memory: remove ->vm_file check on shared writable vmas
    xtensa: drop _PAGE_FILE and pte_file()-related helpers
    x86: drop _PAGE_FILE and pte_file()-related helpers
    unicore32: drop pte_file()-related helpers
    um: drop _PAGE_FILE and pte_file()-related helpers
    tile: drop pte_file()-related helpers
    sparc: drop pte_file()-related helpers
    sh: drop _PAGE_FILE and pte_file()-related helpers
    score: drop _PAGE_FILE and pte_file()-related helpers
    s390: drop pte_file()-related helpers
    parisc: drop _PAGE_FILE and pte_file()-related helpers
    openrisc: drop _PAGE_FILE and pte_file()-related helpers
    nios2: drop _PAGE_FILE and pte_file()-related helpers
    ...

    Linus Torvalds
     
  • Whether there is a vm_ops->page_mkwrite or not, the page dirtying is
    pretty much the same. Make sure the page references are the same in both
    cases, then merge the two branches.

    It's tempting to go even further and page-lock the !page_mkwrite case, to
    get it in line with everybody else setting the page table and thus further
    simplify the model. But that's not quite compelling enough to justify
    dropping the pte lock, then relocking and verifying the entry for
    filesystems without ->page_mkwrite, which notably includes tmpfs. Leave
    it for now and lock the page late in the !page_mkwrite case.

    Signed-off-by: Johannes Weiner
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Shared anonymous mmaps are implemented with shmem files, so all VMAs with
    shared writable semantics also have an underlying backing file.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • One bit in ->vm_flags is unused now!

    Signed-off-by: Kirill A. Shutemov
    Cc: Dan Carpenter
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We don't create non-linear mappings anymore. Let's drop code which
    handles them on page fault.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We have remap_file_pages(2) emulation in -mm tree for few release cycles
    and we plan to have it mainline in v3.20. This patchset removes rest of
    VM_NONLINEAR infrastructure.

    Patches 1-8 take care about generic code. They are pretty
    straight-forward and can be applied without other of patches.

    Rest patches removes pte_file()-related stuff from architecture-specific
    code. It usually frees up one bit in non-present pte. I've tried to reuse
    that bit for swap offset, where I was able to figure out how to do that.

    For obvious reason I cannot test all that arch-specific code and would
    like to see acks from maintainers.

    In total, remap_file_pages(2) required about 1.4K lines of not-so-trivial
    kernel code. That's too much for functionality nobody uses.

    Tested-by: Felipe Balbi

    This patch (of 38):

    We don't create non-linear mappings anymore. Let's drop code which
    handles them on unmap/zap.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Pull xen features and fixes from David Vrabel:

    - Reworked handling for foreign (grant mapped) pages to simplify the
    code, enable a number of additional use cases and fix a number of
    long-standing bugs.

    - Prefer the TSC over the Xen PV clock when dom0 (and the TSC is
    stable).

    - Assorted other cleanup and minor bug fixes.

    * tag 'stable/for-linus-3.20-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (25 commits)
    xen/manage: Fix USB interaction issues when resuming
    xenbus: Add proper handling of XS_ERROR from Xenbus for transactions.
    xen/gntdev: provide find_special_page VMA operation
    xen/gntdev: mark userspace PTEs as special on x86 PV guests
    xen-blkback: safely unmap grants in case they are still in use
    xen/gntdev: safely unmap grants in case they are still in use
    xen/gntdev: convert priv->lock to a mutex
    xen/grant-table: add a mechanism to safely unmap pages that are in use
    xen-netback: use foreign page information from the pages themselves
    xen: mark grant mapped pages as foreign
    xen/grant-table: add helpers for allocating pages
    x86/xen: require ballooned pages for grant maps
    xen: remove scratch frames for ballooned pages and m2p override
    xen/grant-table: pre-populate kernel unmap ops for xen_gnttab_unmap_refs()
    mm: add 'foreign' alias for the 'pinned' page flag
    mm: provide a find_special_page vma operation
    x86/xen: cleanup arch/x86/xen/mmu.c
    x86/xen: add some __init annotations in arch/x86/xen/mmu.c
    x86/xen: add some __init and static annotations in arch/x86/xen/setup.c
    x86/xen: use correct types for addresses in arch/x86/xen/setup.c
    ...

    Linus Torvalds
     

30 Jan, 2015

1 commit

  • The stack guard page error case has long incorrectly caused a SIGBUS
    rather than a SIGSEGV, but nobody actually noticed until commit
    fee7e49d4514 ("mm: propagate error from stack expansion even for guard
    page") because that error case was never actually triggered in any
    normal situations.

    Now that we actually report the error, people noticed the wrong signal
    that resulted. So far, only the test suite of libsigsegv seems to have
    actually cared, but there are real applications that use libsigsegv, so
    let's not wait for any of those to break.

    Reported-and-tested-by: Takashi Iwai
    Tested-by: Jan Engelhardt
    Acked-by: Heiko Carstens # "s390 still compiles and boots"
    Cc: linux-arch@vger.kernel.org
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 Jan, 2015

1 commit

  • The optional find_special_page VMA operation is used to lookup the
    pages backing a VMA. This is useful in cases where the normal
    mechanisms for finding the page don't work. This is only called if
    the PTE is special.

    One use case is a Xen PV guest mapping foreign pages into userspace.

    In a Xen PV guest, the PTEs contain MFNs so get_user_pages() (for
    example) must do an MFN to PFN (M2P) lookup before it can get the
    page. For foreign pages (those owned by another guest) the M2P lookup
    returns the PFN as seen by the foreign guest (which would be
    completely the wrong page for the local guest).

    This cannot be fixed up improving the M2P lookup since one MFN may be
    mapped onto two or more pages so getting the right page is impossible
    given just the MFN.

    Signed-off-by: David Vrabel
    Acked-by: Andrew Morton

    David Vrabel
     

13 Jan, 2015

1 commit

  • When batching up address ranges for TLB invalidation, we check tlb->end
    != 0 to indicate that some pages have actually been unmapped.

    As of commit f045bbb9fa1b ("mmu_gather: fix over-eager
    tlb_flush_mmu_free() calling"), we use the same check for freeing these
    pages in order to avoid a performance regression where we call
    free_pages_and_swap_cache even when no pages are actually queued up.

    Unfortunately, the range could have been reset (tlb->end = 0) by
    tlb_end_vma, which has been shown to cause memory leaks on arm64.
    Furthermore, investigation into these leaks revealed that the fullmm
    case on task exit no longer invalidates the TLB, by virtue of tlb->end
    == 0 (in 3.18, need_flush would have been set).

    This patch resolves the problem by reverting commit f045bbb9fa1b, using
    instead tlb->local.nr as the predicate for page freeing in
    tlb_flush_mmu_free and ensuring that tlb->end is initialised to a
    non-zero value in the fullmm case.

    Tested-by: Mark Langsdorf
    Tested-by: Dave Hansen
    Signed-off-by: Will Deacon
    Signed-off-by: Linus Torvalds

    Will Deacon
     

09 Jan, 2015

1 commit

  • Tejun, while reviewing the code, spotted the following race condition
    between the dirtying and truncation of a page:

    __set_page_dirty_nobuffers() __delete_from_page_cache()
    if (TestSetPageDirty(page))
    page->mapping = NULL
    if (PageDirty())
    dec_zone_page_state(page, NR_FILE_DIRTY);
    dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
    if (page->mapping)
    account_page_dirtied(page)
    __inc_zone_page_state(page, NR_FILE_DIRTY);
    __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);

    which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.

    Dirtiers usually lock out truncation, either by holding the page lock
    directly, or in case of zap_pte_range(), by pinning the mapcount with
    the page table lock held. The notable exception to this rule, though,
    is do_wp_page(), for which this race exists. However, do_wp_page()
    already waits for a locked page to unlock before setting the dirty bit,
    in order to prevent a race where clear_page_dirty() misses the page bit
    in the presence of dirty ptes. Upgrade that wait to a fully locked
    set_page_dirty() to also cover the situation explained above.

    Afterwards, the code in set_page_dirty() dealing with a truncation race
    is no longer needed. Remove it.

    Reported-by: Tejun Heo
    Signed-off-by: Johannes Weiner
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

07 Jan, 2015

1 commit

  • Jay Foad reports that the address sanitizer test (asan) sometimes gets
    confused by a stack pointer that ends up being outside the stack vma
    that is reported by /proc/maps.

    This happens due to an interaction between RLIMIT_STACK and the guard
    page: when we do the guard page check, we ignore the potential error
    from the stack expansion, which effectively results in a missing guard
    page, since the expected stack expansion won't have been done.

    And since /proc/maps explicitly ignores the guard page (commit
    d7824370e263: "mm: fix up some user-visible effects of the stack guard
    page"), the stack pointer ends up being outside the reported stack area.

    This is the minimal patch: it just propagates the error. It also
    effectively makes the guard page part of the stack limit, which in turn
    measn that the actual real stack is one page less than the stack limit.

    Let's see if anybody notices. We could teach acct_stack_growth() to
    allow an extra page for a grow-up/grow-down stack in the rlimit test,
    but I don't want to add more complexity if it isn't needed.

    Reported-and-tested-by: Jay Foad
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

23 Dec, 2014

1 commit

  • This reverts commit c8475d144abb1e62958cc5ec281d2a9e161c1946.

    There are several[1][2] of bug reports which points to this commit as potential
    cause[3].

    Let's revert it until we figure out what's going on.

    [1] https://lkml.org/lkml/2014/11/14/342
    [2] https://lkml.org/lkml/2014/12/22/213
    [3] https://lkml.org/lkml/2014/12/9/741

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Sasha Levin
    Acked-by: Davidlohr Bueso
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Cc: Mel Gorman
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

21 Dec, 2014

1 commit

  • Pull ACCESS_ONCE cleanup preparation from Christian Borntraeger:
    "kernel: Provide READ_ONCE and ASSIGN_ONCE

    As discussed on LKML http://marc.info/?i=54611D86.4040306%40de.ibm.com
    ACCESS_ONCE might fail with specific compilers for non-scalar
    accesses.

    Here is a set of patches to tackle that problem.

    The first patch introduce READ_ONCE and ASSIGN_ONCE. If the data
    structure is larger than the machine word size memcpy is used and a
    warning is emitted. The next patches fix up several in-tree users of
    ACCESS_ONCE on non-scalar types.

    This does not yet contain a patch that forces ACCESS_ONCE to work only
    on scalar types. This is targetted for the next merge window as Linux
    next already contains new offenders regarding ACCESS_ONCE vs.
    non-scalar types"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux:
    s390/kvm: REPLACE barrier fixup with READ_ONCE
    arm/spinlock: Replace ACCESS_ONCE with READ_ONCE
    arm64/spinlock: Replace ACCESS_ONCE READ_ONCE
    mips/gup: Replace ACCESS_ONCE with READ_ONCE
    x86/gup: Replace ACCESS_ONCE with READ_ONCE
    x86/spinlock: Replace ACCESS_ONCE with READ_ONCE
    mm: replace ACCESS_ONCE with READ_ONCE or barriers
    kernel: Provide READ_ONCE and ASSIGN_ONCE

    Linus Torvalds
     

19 Dec, 2014

1 commit

  • Belatedly document the changes in commit f0c6d4d295e4 ("mm: introduce
    do_shared_fault() and drop do_fault()").

    Cc: Andi Kleen
    Cc: Bob Liu
    Cc: Dave Hansen
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Naoya Horiguchi
    Cc: Rik van Riel
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

18 Dec, 2014

2 commits

  • ACCESS_ONCE does not work reliably on non-scalar types. For
    example gcc 4.6 and 4.7 might remove the volatile tag for such
    accesses during the SRA (scalar replacement of aggregates) step
    (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)

    Let's change the code to access the page table elements with
    READ_ONCE that does implicit scalar accesses for the gup code.

    mm_find_pmd is tricky, because m68k and sparc(32bit) define pmd_t
    as array of longs. This code requires just that the pmd_present
    and pmd_trans_huge check are done on the same value, so a barrier
    is sufficent.

    A similar case is in handle_pte_fault. On ppc44x the word size is
    32 bit, but a pte is 64 bit. A barrier is ok as well.

    Signed-off-by: Christian Borntraeger
    Cc: linux-mm@kvack.org
    Acked-by: Paul E. McKenney

    Christian Borntraeger
     
  • Dave Hansen reports that commit fb7332a9fedf ("mmu_gather: move minimal
    range calculations into generic code") caused a performance problem:

    "tlb_finish_mmu() goes up about 9x in the profiles (~0.4%->3.6%) and
    tlb_flush_mmu_free() takes about 3.1% of CPU time with the patch
    applied, but does not show up at all on the commit before"

    and the reason is that Will moved the test for whether we need to flush
    from tlb_flush_mmu() into tlb_flush_mmu_tlbonly(). But that meant that
    tlb_flush_mmu_free() basically lost that check.

    Move it back into tlb_flush_mmu() where it belongs, so that it covers
    both tlb_flush_mmu_tlbonly() _and_ tlb_flush_mmu_free().

    Reported-and-tested-by: Dave Hansen
    Acked-by: Will Deacon
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

16 Dec, 2014

1 commit

  • Pull drm updates from Dave Airlie:
    "Highlights:

    - AMD KFD driver merge

    This is the AMD HSA interface for exposing a lowlevel interface for
    GPGPU use. They have an open source userspace built on top of this
    interface, and the code looks as good as it was going to get out of
    tree.

    - Initial atomic modesetting work

    The need for an atomic modesetting interface to allow userspace to
    try and send a complete set of modesetting state to the driver has
    arisen, and been suffering from neglect this past year. No more,
    the start of the common code and changes for msm driver to use it
    are in this tree. Ongoing work to get the userspace ioctl finished
    and the code clean will probably wait until next kernel.

    - DisplayID 1.3 and tiled monitor exposed to userspace.

    Tiled monitor property is now exposed for userspace to make use of.

    - Rockchip drm driver merged.

    - imx gpu driver moved out of staging

    Other stuff:

    - core:
    panel - MIPI DSI + new panels.
    expose suggested x/y properties for virtual GPUs

    - i915:
    Initial Skylake (SKL) support
    gen3/4 reset work
    start of dri1/ums removal
    infoframe tracking
    fixes for lots of things.

    - nouveau:
    tegra k1 voltage support
    GM204 modesetting support
    GT21x memory reclocking work

    - radeon:
    CI dpm fixes
    GPUVM improvements
    Initial DPM fan control

    - rcar-du:
    HDMI support added
    removed some support for old boards
    slave encoder driver for Analog Devices adv7511

    - exynos:
    Exynos4415 SoC support

    - msm:
    a4xx gpu support
    atomic helper conversion

    - tegra:
    iommu support
    universal plane support
    ganged-mode DSI support

    - sti:
    HDMI i2c improvements

    - vmwgfx:
    some late fixes.

    - qxl:
    use suggested x/y properties"

    * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (969 commits)
    drm: sti: fix module compilation issue
    drm/i915: save/restore GMBUS freq across suspend/resume on gen4
    drm: sti: correctly cleanup CRTC and planes
    drm: sti: add HQVDP plane
    drm: sti: add cursor plane
    drm: sti: enable auxiliary CRTC
    drm: sti: fix delay in VTG programming
    drm: sti: prepare sti_tvout to support auxiliary crtc
    drm: sti: use drm_crtc_vblank_{on/off} instead of drm_vblank_{on/off}
    drm: sti: fix hdmi avi infoframe
    drm: sti: remove event lock while disabling vblank
    drm: sti: simplify gdp code
    drm: sti: clear all mixer control
    drm: sti: remove gpio for HDMI hot plug detection
    drm: sti: allow to change hdmi ddc i2c adapter
    drm/doc: Document drm_add_modes_noedid() usage
    drm/i915: Remove '& 0xffff' from the mask given to WA_REG()
    drm/i915: Invert the mask and val arguments in wa_add() and WA_REG()
    drm: Zero out DRM object memory upon cleanup
    drm/i915/bdw: Fix the write setting up the WIZ hashing mode
    ...

    Linus Torvalds
     

14 Dec, 2014

2 commits

  • This lets drivers like the AMD IOMMUv2 driver handle faults a bit more
    simply, rather than doing tricks with page refs and get_user_pages().

    Signed-off-by: Jesse Barnes
    Cc: Oded Gabbay
    Cc: Joerg Roedel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesse Barnes
     
  • The unmap_mapping_range family of functions do the unmapping of user pages
    (ultimately via zap_page_range_single) without touching the actual
    interval tree, thus share the lock.

    Signed-off-by: Davidlohr Bueso
    Cc: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso