17 Oct, 2015

1 commit

  • The following two locking commits in the DAX code:

    commit 843172978bb9 ("dax: fix race between simultaneous faults")
    commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX")

    introduced a number of deadlocks and other issues which need to be fixed
    for the v4.3 kernel. The list of issues in DAX after these commits
    (some newly introduced by the commits, some preexisting) can be found
    here:

    https://lkml.org/lkml/2015/9/25/602 (Subject: "Re: [PATCH] dax: fix deadlock in __dax_fault").

    This undoes most of the changes introduced by those two commits,
    essentially returning us to the DAX locking scheme that was used in
    v4.2.

    Signed-off-by: Ross Zwisler
    Cc: Alexander Viro
    Cc: Dan Williams
    Tested-by: Dave Chinner
    Cc: Jan Kara
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

11 Sep, 2015

1 commit


09 Sep, 2015

5 commits

  • __dax_fault() takes i_mmap_lock for write. Let's pair it with write
    unlock on do_cow_fault() side.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • DAX is not so special: we need i_mmap_lock to protect mapping->i_mmap.

    __dax_pmd_fault() uses unmap_mapping_range() shoot out zero page from
    all mappings. We need to drop i_mmap_lock there to avoid lock deadlock.

    Re-aquiring the lock should be fine since we check i_size after the
    point.

    Signed-off-by: Kirill A. Shutemov
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • If two threads write-fault on the same hole at the same time, the winner
    of the race will return to userspace and complete their store, only to
    have the loser overwrite their store with zeroes. Fix this for now by
    taking the i_mmap_sem for write instead of read, and do so outside the
    call to get_block(). Now the loser of the race will see the block has
    already been zeroed, and will not zero it again.

    This severely limits our scalability. I have ideas for improving it, but
    those can wait for a later patch.

    Signed-off-by: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Allow non-anonymous VMAs to provide huge pages in response to a page fault.

    Signed-off-by: Matthew Wilcox
    Cc: Hillf Danton
    Cc: "Kirill A. Shutemov"
    Cc: Theodore Ts'o
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • special_mapping_fault() is absolutely broken. It seems it was always
    wrong, but this didn't matter until vdso/vvar started to use more than
    one page.

    And after this change vma_is_anonymous() becomes really trivial, it
    simply checks vm_ops == NULL. However, I do think the helper makes
    sense. There are a lot of ->vm_ops != NULL checks, the helper makes the
    caller's code more understandable (self-documented) and this is more
    grep-friendly.

    This patch (of 3):

    Preparation. Add the new simple helper, vma_is_anonymous(vma), and change
    handle_pte_fault() to use it. It will have more users.

    The name is not accurate, say a hpet_mmap()'ed vma is not anonymous.
    Perhaps it should be named vma_has_fault() instead. But it matches the
    logic in mmap.c/memory.c (see next changes). "True" just means that a
    page fault will use do_anonymous_page().

    Signed-off-by: Oleg Nesterov
    Acked-by: Kirill A. Shutemov
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

05 Sep, 2015

2 commits

  • This makes the tlb_next_batch() bool due to this particular function only
    ever returning either one or zero as its return value.

    Signed-off-by: Nicholas Krause
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Krause
     
  • This is where the page faults must be modified to call
    handle_userfault() if userfaultfd_missing() is true (so if the
    vma->vm_flags had VM_UFFD_MISSING set).

    handle_userfault() then takes care of blocking the page fault and
    delivering it to userland.

    The fault flags must also be passed as parameter so the "read|write"
    kind of fault can be passed to userland.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

10 Jul, 2015

1 commit

  • Reading page fault handler code I've noticed that under right
    circumstances kernel would map anonymous pages into file mappings: if
    the VMA doesn't have vm_ops->fault() and the VMA wasn't fully populated
    on ->mmap(), kernel would handle page fault to not populated pte with
    do_anonymous_page().

    Let's change page fault handler to use do_anonymous_page() only on
    anonymous VMA (->vm_ops == NULL) and make sure that the VMA is not
    shared.

    For file mappings without vm_ops->fault() or shred VMA without vm_ops,
    page fault on pte_none() entry would lead to SIGBUS.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Oleg Nesterov
    Cc: Andrew Morton
    Cc: Willy Tarreau
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

05 Jul, 2015

1 commit

  • Pull more vfs updates from Al Viro:
    "Assorted VFS fixes and related cleanups (IMO the most interesting in
    that part are f_path-related things and Eric's descriptor-related
    stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
    fs-cache series, DAX patches, Jan's file_remove_suid() work"

    [ I'd say this is much more than "fixes and related cleanups". The
    file_table locking rule change by Eric Dumazet is a rather big and
    fundamental update even if the patch isn't huge. - Linus ]

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
    9p: cope with bogus responses from server in p9_client_{read,write}
    p9_client_write(): avoid double p9_free_req()
    9p: forgetting to cancel request on interrupted zero-copy RPC
    dax: bdev_direct_access() may sleep
    block: Add support for DAX reads/writes to block devices
    dax: Use copy_from_iter_nocache
    dax: Add block size note to documentation
    fs/file.c: __fget() and dup2() atomicity rules
    fs/file.c: don't acquire files->file_lock in fd_install()
    fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
    vfs: avoid creation of inode number 0 in get_next_ino
    namei: make set_root_rcu() return void
    make simple_positive() public
    ufs: use dir_pages instead of ufs_dir_pages()
    pagemap.h: move dir_pages() over there
    remove the pointless include of lglock.h
    fs: cleanup slight list_entry abuse
    xfs: Correctly lock inode when removing suid and file capabilities
    fs: Call security_ops->inode_killpriv on truncate
    fs: Provide function telling whether file_remove_privs() will do anything
    ...

    Linus Torvalds
     

25 Jun, 2015

1 commit

  • Historically memcg overhead was high even if memcg was unused. This has
    improved a lot but it still showed up in a profile summary as being a
    problem.

    /usr/src/linux-4.0-vanilla/mm/memcontrol.c 6.6441 395842
    mem_cgroup_try_charge 2.950% 175781
    __mem_cgroup_count_vm_event 1.431% 85239
    mem_cgroup_page_lruvec 0.456% 27156
    mem_cgroup_commit_charge 0.392% 23342
    uncharge_list 0.323% 19256
    mem_cgroup_update_lru_size 0.278% 16538
    memcg_check_events 0.216% 12858
    mem_cgroup_charge_statistics.isra.22 0.188% 11172
    try_charge 0.150% 8928
    commit_charge 0.141% 8388
    get_mem_cgroup_from_mm 0.121% 7184

    That is showing that 6.64% of system CPU cycles were in memcontrol.c and
    dominated by mem_cgroup_try_charge. The annotation shows that the bulk
    of the cost was checking PageSwapCache which is expected to be cache hot
    but is very expensive. The problem appears to be that __SetPageUptodate
    is called just before the check which is a write barrier. It is
    required to make sure struct page and page data is written before the
    PTE is updated and the data visible to userspace. memcg charging does
    not require or need the barrier but gets unfairly hit with the cost so
    this patch attempts the charging before the barrier. Aside from the
    accidental cost to memcg there is the added benefit that the barrier is
    avoided if the page cannot be charged. When applied the relevant
    profile summary is as follows.

    /usr/src/linux-4.0-chargefirst-v2r1/mm/memcontrol.c 3.7907 223277
    __mem_cgroup_count_vm_event 1.143% 67312
    mem_cgroup_page_lruvec 0.465% 27403
    mem_cgroup_commit_charge 0.381% 22452
    uncharge_list 0.332% 19543
    mem_cgroup_update_lru_size 0.284% 16704
    get_mem_cgroup_from_mm 0.271% 15952
    mem_cgroup_try_charge 0.237% 13982
    memcg_check_events 0.222% 13058
    mem_cgroup_charge_statistics.isra.22 0.185% 10920
    commit_charge 0.140% 8235
    try_charge 0.131% 7716

    That brings the overhead down to 3.79% and leaves the memcg fault
    accounting to the root cgroup but it's an improvement. The difference
    in headline performance of the page fault microbench is marginal as
    memcg is such a small component of it.

    pft faults
    4.0.0 4.0.0
    vanilla chargefirst
    Hmean faults/cpu-1 1443258.1051 ( 0.00%) 1509075.7561 ( 4.56%)
    Hmean faults/cpu-3 1340385.9270 ( 0.00%) 1339160.7113 ( -0.09%)
    Hmean faults/cpu-5 875599.0222 ( 0.00%) 874174.1255 ( -0.16%)
    Hmean faults/cpu-7 601146.6726 ( 0.00%) 601370.9977 ( 0.04%)
    Hmean faults/cpu-8 510728.2754 ( 0.00%) 510598.8214 ( -0.03%)
    Hmean faults/sec-1 1432084.7845 ( 0.00%) 1497935.5274 ( 4.60%)
    Hmean faults/sec-3 3943818.1437 ( 0.00%) 3941920.1520 ( -0.05%)
    Hmean faults/sec-5 3877573.5867 ( 0.00%) 3869385.7553 ( -0.21%)
    Hmean faults/sec-7 3991832.0418 ( 0.00%) 3992181.4189 ( 0.01%)
    Hmean faults/sec-8 3987189.8167 ( 0.00%) 3986452.2204 ( -0.02%)

    It's only visible at single threaded. The overhead is there for higher
    threads but other factors dominate.

    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

24 Jun, 2015

1 commit


19 May, 2015

1 commit

  • Commit 662bbcb2747c ("mm, sched: Allow uaccess in atomic with
    pagefault_disable()") removed might_sleep() checks for all user access
    code (that uses might_fault()).

    The reason was to disable wrong "sleep in atomic" warnings in the
    following scenario:

    pagefault_disable()
    rc = copy_to_user(...)
    pagefault_enable()

    Which is valid, as pagefault_disable() increments the preempt counter
    and therefore disables the pagefault handler. copy_to_user() will not
    sleep and return an error code if a page is not available.

    However, as all might_sleep() checks are removed,
    CONFIG_DEBUG_ATOMIC_SLEEP would no longer detect the following scenario:

    spin_lock(&lock);
    rc = copy_to_user(...)
    spin_unlock(&lock)

    If the kernel is compiled with preemption turned on, preempt_disable()
    will make in_atomic() detect disabled preemption. The fault handler would
    correctly never sleep on user access.
    However, with preemption turned off, preempt_disable() is usually a NOP
    (with !CONFIG_PREEMPT_COUNT), therefore in_atomic() will not be able to
    detect disabled preemption nor disabled pagefaults. The fault handler
    could sleep.
    We really want to enable CONFIG_DEBUG_ATOMIC_SLEEP checks for user access
    functions again, otherwise we can end up with horrible deadlocks.

    Root of all evil is that pagefault_disable() acts almost as
    preempt_disable(), depending on preemption being turned on/off.

    As we now have pagefault_disabled(), we can use it to distinguish
    whether user acces functions might sleep.

    Convert might_fault() into a makro that calls __might_fault(), to
    allow proper file + line messages in case of a might_sleep() warning.

    Reviewed-and-tested-by: Thomas Gleixner
    Signed-off-by: David Hildenbrand
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: David.Laight@ACULAB.COM
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: airlied@linux.ie
    Cc: akpm@linux-foundation.org
    Cc: benh@kernel.crashing.org
    Cc: bigeasy@linutronix.de
    Cc: borntraeger@de.ibm.com
    Cc: daniel.vetter@intel.com
    Cc: heiko.carstens@de.ibm.com
    Cc: herbert@gondor.apana.org.au
    Cc: hocko@suse.cz
    Cc: hughd@google.com
    Cc: mst@redhat.com
    Cc: paulus@samba.org
    Cc: ralf@linux-mips.org
    Cc: schwidefsky@de.ibm.com
    Cc: yang.shi@windriver.com
    Link: http://lkml.kernel.org/r/1431359540-32227-3-git-send-email-dahi@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    David Hildenbrand
     

16 Apr, 2015

3 commits

  • This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to
    get notified when access is a write to a read-only PFN.

    This can happen if we mmap() a file then first mmap-read from it to
    page-in a read-only PFN, than we mmap-write to the same page.

    We need this functionality to fix a DAX bug, where in the scenario above
    we fail to set ctime/mtime though we modified the file. An xfstest is
    attached to this patchset that shows the failure and the fix. (A DAX
    patch will follow)

    This functionality is extra important for us, because upon dirtying of a
    pmem page we also want to RDMA the page to a remote cluster node.

    We define a new pfn_mkwrite and do not reuse page_mkwrite because
    1 - The name ;-)
    2 - But mainly because it would take a very long and tedious
    audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
    users. To make sure they do not now CRASH. For example current
    DAX code (which this is for) would crash.
    If we would want to reuse page_mkwrite, We will need to first
    patch all users, so to not-crash-on-no-page. Then enable this
    patch. But even if I did that I would not sleep so well at night.
    Adding a new vector is the safest thing to do, and is not that
    expensive. an extra pointer at a static function vector per driver.
    Also the new vector is better for performance, because else we
    Will call all current Kernel vectors, so to:
    check-ha-no-page-do-nothing and return.

    No need to call it from do_shared_fault because do_wp_page is called to
    change pte permissions anyway.

    Signed-off-by: Yigal Korman
    Signed-off-by: Boaz Harrosh
    Acked-by: Kirill A. Shutemov
    Cc: Matthew Wilcox
    Cc: Jan Kara
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Boaz Harrosh
     
  • A lot of filesystems use generic_file_mmap() and filemap_fault(),
    f_op->mmap and vm_ops->fault aren't enough to identify filesystem.

    This prints file name, vm_ops->fault, f_op->mmap and a_ops->readpage
    (which is almost always implemented and filesystem-specific).

    Example:

    [ 23.676410] BUG: Bad page map in process sh pte:1b7e6025 pmd:19bbd067
    [ 23.676887] page:ffffea00006df980 count:4 mapcount:1 mapping:ffff8800196426c0 index:0x97
    [ 23.677481] flags: 0x10000000000000c(referenced|uptodate)
    [ 23.677896] page dumped because: bad pte
    [ 23.678205] addr:00007f52fcb17000 vm_flags:00000075 anon_vma: (null) mapping:ffff8800196426c0 index:97
    [ 23.678922] file:libc-2.19.so fault:filemap_fault mmap:generic_file_readonly_mmap readpage:v9fs_vfs_readpage

    [akpm@linux-foundation.org: use pr_alert, per Kirill]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Sasha Levin
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/
    tree since it doesn't work reliably on non-scalar types.

    This patch removes the rest of the usages of ACCESS_ONCE, and use the new
    READ_ONCE API for the read accesses. This makes things cleaner, instead
    of using separate/multiple sets of APIs.

    Signed-off-by: Jason Low
    Acked-by: Michal Hocko
    Acked-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Reviewed-by: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Low
     

15 Apr, 2015

4 commits

  • The do_wp_page function is extremely long. Extract the logic for
    handling a page belonging to a shared vma into a function of its own.

    This helps the readability of the code, without doing any functional
    change in it.

    Signed-off-by: Shachar Raindel
    Acked-by: Linus Torvalds
    Acked-by: Kirill A. Shutemov
    Acked-by: Rik van Riel
    Acked-by: Andi Kleen
    Acked-by: Haggai Eran
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Peter Feiner
    Cc: Michel Lespinasse
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shachar Raindel
     
  • In some cases, do_wp_page had to copy the page suffering a write fault
    to a new location. If the function logic decided that to do this, it
    was done by jumping with a "goto" operation to the relevant code block.
    This made the code really hard to understand. It is also against the
    kernel coding style guidelines.

    This patch extracts the page copy and page table update logic to a
    separate function. It also clean up the naming, from "gotten" to
    "wp_page_copy", and adds few comments.

    Signed-off-by: Shachar Raindel
    Acked-by: Linus Torvalds
    Acked-by: Kirill A. Shutemov
    Acked-by: Rik van Riel
    Acked-by: Andi Kleen
    Acked-by: Haggai Eran
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Peter Feiner
    Cc: Michel Lespinasse
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shachar Raindel
     
  • When do_wp_page is ending, in several cases it needs to unlock the pages
    and ptls it was accessing.

    Currently, this logic was "called" by using a goto jump. This makes
    following the control flow of the function harder. Readability was
    further hampered by the unlock case containing large amount of logic
    needed only in one of the 3 cases.

    Using goto for cleanup is generally allowed. However, moving the
    trivial unlocking flows to the relevant call sites allow deeper
    refactoring in the next patch.

    Signed-off-by: Shachar Raindel
    Acked-by: Linus Torvalds
    Acked-by: Kirill A. Shutemov
    Acked-by: Rik van Riel
    Acked-by: Andi Kleen
    Acked-by: Haggai Eran
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Peter Feiner
    Cc: Michel Lespinasse
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shachar Raindel
     
  • Currently do_wp_page contains 265 code lines. It also contains 9 goto
    statements, of which 5 are targeting labels which are not cleanup
    related. This makes the function extremely difficult to understand.

    The following patches are an attempt at breaking the function to its
    basic components, and making it easier to understand.

    The patches are straight forward function extractions from do_wp_page.
    As we extract functions, we remove unneeded parameters and simplify the
    code as much as possible. However, the functionality is supposed to
    remain completely unchanged. The patches also attempt to document the
    functionality of each extracted function. In patch 2, we split the
    unlock logic to the contain logic relevant to specific needs of each use
    case, instead of having huge number of conditional decisions in a single
    unlock flow.

    This patch (of 4):

    When do_wp_page is ending, in several cases it needs to reuse the existing
    page. This is achieved by making the page table writable, and possibly
    updating the page-cache state.

    Currently, this logic was "called" by using a goto jump. This makes
    following the control flow of the function harder. It is also against the
    coding style guidelines for using goto.

    As the code can easily be refactored into a specialized function, refactor
    it out and simplify the code flow in do_wp_page.

    Acked-by: Linus Torvalds
    Acked-by: Kirill A. Shutemov
    Acked-by: Rik van Riel
    Acked-by: Andi Kleen
    Acked-by: Haggai Eran
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Peter Feiner
    Cc: Michel Lespinasse
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shachar Raindel
     

26 Mar, 2015

3 commits

  • Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

    Across the board the 4.0-rc1 numbers are much slower, and the degradation
    is far worse when using the large memory footprint configs. Perf points
    straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:

    - 56.07% 56.07% [kernel] [k] default_send_IPI_mask_sequence_phys
    - default_send_IPI_mask_sequence_phys
    - 99.99% physflat_send_IPI_mask
    - 99.37% native_send_call_func_ipi
    smp_call_function_many
    - native_flush_tlb_others
    - 99.85% flush_tlb_page
    ptep_clear_flush
    try_to_unmap_one
    rmap_walk
    try_to_unmap
    migrate_pages
    migrate_misplaced_page
    - handle_mm_fault
    - 99.73% __do_page_fault
    trace_do_page_fault
    do_async_page_fault
    + async_page_fault
    0.63% native_send_call_func_single_ipi
    generic_exec_single
    smp_call_function_single

    This is showing excessive migration activity even though excessive
    migrations are meant to get throttled. Normally, the scan rate is tuned
    on a per-task basis depending on the locality of faults. However, if
    migrations fail for any reason then the PTE scanner may scan faster if
    the faults continue to be remote. This means there is higher system CPU
    overhead and fault trapping at exactly the time we know that migrations
    cannot happen. This patch tracks when migration failures occur and
    slows the PTE scanner.

    Signed-off-by: Mel Gorman
    Reported-by: Dave Chinner
    Tested-by: Dave Chinner
    Cc: Ingo Molnar
    Cc: Aneesh Kumar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Protecting a PTE to trap a NUMA hinting fault clears the writable bit
    and further faults are needed after trapping a NUMA hinting fault to set
    the writable bit again. This patch preserves the writable bit when
    trapping NUMA hinting faults. The impact is obvious from the number of
    minor faults trapped during the basis balancing benchmark and the system
    CPU usage;

    autonumabench
    4.0.0-rc4 4.0.0-rc4
    baseline preserve
    Time System-NUMA01 107.13 ( 0.00%) 103.13 ( 3.73%)
    Time System-NUMA01_THEADLOCAL 131.87 ( 0.00%) 83.30 ( 36.83%)
    Time System-NUMA02 8.95 ( 0.00%) 10.72 (-19.78%)
    Time System-NUMA02_SMT 4.57 ( 0.00%) 3.99 ( 12.69%)
    Time Elapsed-NUMA01 515.78 ( 0.00%) 517.26 ( -0.29%)
    Time Elapsed-NUMA01_THEADLOCAL 384.10 ( 0.00%) 384.31 ( -0.05%)
    Time Elapsed-NUMA02 48.86 ( 0.00%) 48.78 ( 0.16%)
    Time Elapsed-NUMA02_SMT 47.98 ( 0.00%) 48.12 ( -0.29%)

    4.0.0-rc4 4.0.0-rc4
    baseline preserve
    User 44383.95 43971.89
    System 252.61 201.24
    Elapsed 998.68 1000.94

    Minor Faults 2597249 1981230
    Major Faults 365 364

    There is a similar drop in system CPU usage using Dave Chinner's xfsrepair
    workload

    4.0.0-rc4 4.0.0-rc4
    baseline preserve
    Amean real-xfsrepair 454.14 ( 0.00%) 442.36 ( 2.60%)
    Amean syst-xfsrepair 277.20 ( 0.00%) 204.68 ( 26.16%)

    The patch looks hacky but the alternatives looked worse. The tidest was
    to rewalk the page tables after a hinting fault but it was more complex
    than this approach and the performance was worse. It's not generally
    safe to just mark the page writable during the fault if it's a write
    fault as it may have been read-only for COW so that approach was
    discarded.

    Signed-off-by: Mel Gorman
    Reported-by: Dave Chinner
    Tested-by: Dave Chinner
    Cc: Ingo Molnar
    Cc: Aneesh Kumar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • These are three follow-on patches based on the xfsrepair workload Dave
    Chinner reported was problematic in 4.0-rc1 due to changes in page table
    management -- https://lkml.org/lkml/2015/3/1/226.

    Much of the problem was reduced by commit 53da3bc2ba9e ("mm: fix up numa
    read-only thread grouping logic") and commit ba68bc0115eb ("mm: thp:
    Return the correct value for change_huge_pmd"). It was known that the
    performance in 3.19 was still better even if is far less safe. This
    series aims to restore the performance without compromising on safety.

    For the test of this mail, I'm comparing 3.19 against 4.0-rc4 and the
    three patches applied on top

    autonumabench
    3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4
    vanilla vanilla vmwrite-v5r8 preserve-v5r8 slowscan-v5r8
    Time System-NUMA01 124.00 ( 0.00%) 161.86 (-30.53%) 107.13 ( 13.60%) 103.13 ( 16.83%) 145.01 (-16.94%)
    Time System-NUMA01_THEADLOCAL 115.54 ( 0.00%) 107.64 ( 6.84%) 131.87 (-14.13%) 83.30 ( 27.90%) 92.35 ( 20.07%)
    Time System-NUMA02 9.35 ( 0.00%) 10.44 (-11.66%) 8.95 ( 4.28%) 10.72 (-14.65%) 8.16 ( 12.73%)
    Time System-NUMA02_SMT 3.87 ( 0.00%) 4.63 (-19.64%) 4.57 (-18.09%) 3.99 ( -3.10%) 3.36 ( 13.18%)
    Time Elapsed-NUMA01 570.06 ( 0.00%) 567.82 ( 0.39%) 515.78 ( 9.52%) 517.26 ( 9.26%) 543.80 ( 4.61%)
    Time Elapsed-NUMA01_THEADLOCAL 393.69 ( 0.00%) 384.83 ( 2.25%) 384.10 ( 2.44%) 384.31 ( 2.38%) 380.73 ( 3.29%)
    Time Elapsed-NUMA02 49.09 ( 0.00%) 49.33 ( -0.49%) 48.86 ( 0.47%) 48.78 ( 0.63%) 50.94 ( -3.77%)
    Time Elapsed-NUMA02_SMT 47.51 ( 0.00%) 47.15 ( 0.76%) 47.98 ( -0.99%) 48.12 ( -1.28%) 49.56 ( -4.31%)

    3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4
    vanilla vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
    User 46334.60 46391.94 44383.95 43971.89 44372.12
    System 252.84 284.66 252.61 201.24 249.00
    Elapsed 1062.14 1050.96 998.68 1000.94 1026.78

    Overall the system CPU usage is comparable and the test is naturally a
    bit variable. The slowing of the scanner hurts numa01 but on this
    machine it is an adverse workload and patches that dramatically help it
    often hurt absolutely everything else.

    Due to patch 2, the fault activity is interesting

    3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4
    vanilla vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
    Minor Faults 2097811 2656646 2597249 1981230 1636841
    Major Faults 362 450 365 364 365

    Note the impact preserving the write bit across protection updates and
    fault reduces faults.

    NUMA alloc hit 1229008 1217015 1191660 1178322 1199681
    NUMA alloc miss 0 0 0 0 0
    NUMA interleave hit 0 0 0 0 0
    NUMA alloc local 1228514 1216317 1190871 1177448 1199021
    NUMA base PTE updates 245706197 240041607 238195516 244704842 115012800
    NUMA huge PMD updates 479530 468448 464868 477573 224487
    NUMA page range updates 491225557 479886983 476207932 489222218 229950144
    NUMA hint faults 659753 656503 641678 656926 294842
    NUMA hint local faults 381604 373963 360478 337585 186249
    NUMA hint local percent 57 56 56 51 63
    NUMA pages migrated 5412140 6374899 6266530 5277468 5755096
    AutoNUMA cost 5121% 5083% 4994% 5097% 2388%

    Here the impact of slowing the PTE scanner on migratrion failures is
    obvious as "NUMA base PTE updates" and "NUMA huge PMD updates" are
    massively reduced even though the headline performance is very similar.

    As xfsrepair was the reported workload here is the impact of the series
    on it.

    xfsrepair
    3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4
    vanilla vanilla vmwrite-v5r8 preserve-v5r8 slowscan-v5r8
    Min real-fsmark 1183.29 ( 0.00%) 1165.73 ( 1.48%) 1152.78 ( 2.58%) 1153.64 ( 2.51%) 1177.62 ( 0.48%)
    Min syst-fsmark 4107.85 ( 0.00%) 4027.75 ( 1.95%) 3986.74 ( 2.95%) 3979.16 ( 3.13%) 4048.76 ( 1.44%)
    Min real-xfsrepair 441.51 ( 0.00%) 463.96 ( -5.08%) 449.50 ( -1.81%) 440.08 ( 0.32%) 439.87 ( 0.37%)
    Min syst-xfsrepair 195.76 ( 0.00%) 278.47 (-42.25%) 262.34 (-34.01%) 203.70 ( -4.06%) 143.64 ( 26.62%)
    Amean real-fsmark 1188.30 ( 0.00%) 1177.34 ( 0.92%) 1157.97 ( 2.55%) 1158.21 ( 2.53%) 1182.22 ( 0.51%)
    Amean syst-fsmark 4111.37 ( 0.00%) 4055.70 ( 1.35%) 3987.19 ( 3.02%) 3998.72 ( 2.74%) 4061.69 ( 1.21%)
    Amean real-xfsrepair 450.88 ( 0.00%) 468.32 ( -3.87%) 454.14 ( -0.72%) 442.36 ( 1.89%) 440.59 ( 2.28%)
    Amean syst-xfsrepair 199.66 ( 0.00%) 290.60 (-45.55%) 277.20 (-38.84%) 204.68 ( -2.51%) 150.55 ( 24.60%)
    Stddev real-fsmark 4.12 ( 0.00%) 10.82 (-162.29%) 4.14 ( -0.28%) 5.98 (-45.05%) 4.60 (-11.53%)
    Stddev syst-fsmark 2.63 ( 0.00%) 20.32 (-671.82%) 0.37 ( 85.89%) 16.47 (-525.59%) 15.05 (-471.79%)
    Stddev real-xfsrepair 6.87 ( 0.00%) 4.55 ( 33.75%) 3.46 ( 49.58%) 1.78 ( 74.12%) 0.52 ( 92.50%)
    Stddev syst-xfsrepair 3.02 ( 0.00%) 10.30 (-241.37%) 13.17 (-336.37%) 0.71 ( 76.63%) 5.00 (-65.61%)
    CoeffVar real-fsmark 0.35 ( 0.00%) 0.92 (-164.73%) 0.36 ( -2.91%) 0.52 (-48.82%) 0.39 (-12.10%)
    CoeffVar syst-fsmark 0.06 ( 0.00%) 0.50 (-682.41%) 0.01 ( 85.45%) 0.41 (-543.22%) 0.37 (-478.78%)
    CoeffVar real-xfsrepair 1.52 ( 0.00%) 0.97 ( 36.21%) 0.76 ( 49.94%) 0.40 ( 73.62%) 0.12 ( 92.33%)
    CoeffVar syst-xfsrepair 1.51 ( 0.00%) 3.54 (-134.54%) 4.75 (-214.31%) 0.34 ( 77.20%) 3.32 (-119.63%)
    Max real-fsmark 1193.39 ( 0.00%) 1191.77 ( 0.14%) 1162.90 ( 2.55%) 1166.66 ( 2.24%) 1188.50 ( 0.41%)
    Max syst-fsmark 4114.18 ( 0.00%) 4075.45 ( 0.94%) 3987.65 ( 3.08%) 4019.45 ( 2.30%) 4082.80 ( 0.76%)
    Max real-xfsrepair 457.80 ( 0.00%) 474.60 ( -3.67%) 457.82 ( -0.00%) 444.42 ( 2.92%) 441.03 ( 3.66%)
    Max syst-xfsrepair 203.11 ( 0.00%) 303.65 (-49.50%) 294.35 (-44.92%) 205.33 ( -1.09%) 155.28 ( 23.55%)

    The really relevant lines as syst-xfsrepair which is the system CPU
    usage when running xfsrepair. Note that on my machine the overhead was
    45% higher on 4.0-rc4 which may be part of what Dave is seeing. Once we
    preserve the write bit across faults, it's only 2.51% higher on average.
    With the full series applied, system CPU usage is 24.6% lower on
    average.

    Again, the impact of preserving the write bit on minor faults is obvious
    and the impact of slowing scanning after migration failures is obvious
    on the PTE updates. Note also that the number of pages migrated is much
    reduced even though the headline performance is comparable.

    3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4
    vanilla vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
    Minor Faults 153466827 254507978 249163829 153501373 105737890
    Major Faults 610 702 690 649 724
    NUMA base PTE updates 217735049 210756527 217729596 216937111 144344993
    NUMA huge PMD updates 129294 85044 106921 127246 79887
    NUMA pages migrated 21938995 29705270 28594162 22687324 16258075

    3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4
    vanilla vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
    Mean sdb-avgqusz 13.47 2.54 2.55 2.47 2.49
    Mean sdb-avgrqsz 202.32 140.22 139.50 139.02 138.12
    Mean sdb-await 25.92 5.09 5.33 5.02 5.22
    Mean sdb-r_await 4.71 0.19 0.83 0.51 0.11
    Mean sdb-w_await 104.13 5.21 5.38 5.05 5.32
    Mean sdb-svctm 0.59 0.13 0.14 0.13 0.14
    Mean sdb-rrqm 0.16 0.00 0.00 0.00 0.00
    Mean sdb-wrqm 3.59 1799.43 1826.84 1812.21 1785.67
    Max sdb-avgqusz 111.06 12.13 14.05 11.66 15.60
    Max sdb-avgrqsz 255.60 190.34 190.01 187.33 191.78
    Max sdb-await 168.24 39.28 49.22 44.64 65.62
    Max sdb-r_await 660.00 52.00 280.00 76.00 12.00
    Max sdb-w_await 7804.00 39.28 49.22 44.64 65.62
    Max sdb-svctm 4.00 2.82 2.86 1.98 2.84
    Max sdb-rrqm 8.30 0.00 0.00 0.00 0.00
    Max sdb-wrqm 34.20 5372.80 5278.60 5386.60 5546.15

    FWIW, I also checked SPECjbb in different configurations but it's
    similar observations -- minor faults lower, PTE update activity lower
    and performance is roughly comparable against 3.19.

    This patch (of 3):

    Threads that share writable data within pages are grouped together as
    related tasks. This decision is based on whether the PTE is marked
    dirty which is subject to timing races between the PTE scanner update
    and when the application writes the page. If the page is file-backed,
    then background flushes and sync also affect placement. This is
    unpredictable behaviour which is impossible to reason about so this
    patch makes grouping decisions based on the VMA flags.

    Signed-off-by: Mel Gorman
    Reported-by: Dave Chinner
    Tested-by: Dave Chinner
    Cc: Ingo Molnar
    Cc: Aneesh Kumar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

12 Mar, 2015

1 commit

  • Dave Chinner reported that commit 4d9424669946 ("mm: convert
    p[te|md]_mknonnuma and remaining page table manipulations") slowed down
    his xfsrepair test enormously. In particular, it was using more system
    time due to extra TLB flushing.

    The ultimate reason turns out to be how the change to use the regular
    page table accessor functions broke the NUMA grouping logic. The old
    special mknuma/mknonnuma code accessed the page table present bit and
    the magic NUMA bit directly, while the new code just changes the page
    protections using PROT_NONE and the regular vma protections.

    That sounds equivalent, and from a fault standpoint it really is, but a
    subtle side effect is that the *other* protection bits of the page table
    entries also change. And the code to decide how to group the NUMA
    entries together used the writable bit to decide whether a particular
    page was likely to be shared read-only or not.

    And with the change to make the NUMA handling use the regular permission
    setting functions, that writable bit was basically always cleared for
    private mappings due to COW. So even if the page actually ends up being
    written to in the end, the NUMA balancing would act as if it was always
    shared RO.

    This code is a heuristic anyway, so the fix - at least for now - is to
    instead check whether the page is dirty rather than writable. The bit
    doesn't change with protection changes.

    NOTE! This also adds a FIXME comment to revisit this issue,

    Not only should we probably re-visit the whole "is this a shared
    read-only page" heuristic (we might want to take the vma permissions
    into account and base this more on those than the per-page ones, and
    also look at whether the particular access that triggers it is a write
    or not), but the whole COW issue shows that we should think about the
    NUMA fault handling some more.

    For example, maybe we should do the early-COW thing that a regular fault
    does. Or maybe we should accept that while using the same bits as
    PROTNONE was a good thing (and got rid of the specual NUMA bit), we
    might still want to just preseve the other protection bits across NUMA
    faulting.

    Those are bigger questions, left for later. This just fixes up the
    heuristic so that it at least approximates working again. More analysis
    and work needed.

    Reported-by: Dave Chinner
    Tested-by: Mel Gorman
    Cc: Andrew Morton
    Cc: Aneesh Kumar
    Cc: Ingo Molnar ,
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

17 Feb, 2015

2 commits

  • Currently COW of an XIP file is done by first bringing in a read-only
    mapping, then retrying the fault and copying the page. It is much more
    efficient to tell the fault handler that a COW is being attempted (by
    passing in the pre-allocated page in the vm_fault structure), and allow
    the handler to perform the COW operation itself.

    The handler cannot insert the page itself if there is already a read-only
    mapping at that address, so allow the handler to return VM_FAULT_LOCKED
    and set the fault_page to be NULL. This indicates to the MM code that the
    i_mmap_lock is held instead of the page lock.

    Signed-off-by: Matthew Wilcox
    Acked-by: Kirill A. Shutemov
    Cc: Andreas Dilger
    Cc: Boaz Harrosh
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Jens Axboe
    Cc: Mathieu Desnoyers
    Cc: Randy Dunlap
    Cc: Ross Zwisler
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • DAX is a replacement for the variation of XIP currently supported by the
    ext2 filesystem. We have three different things in the tree called 'XIP',
    and the new focus is on access to data rather than executables, so a name
    change was in order. DAX stands for Direct Access. The X is for
    eXciting.

    The new focus on data access has resulted in more careful attention to
    races that exist in the current XIP code, but are not hit by the use-case
    that it was designed for. XIP's architecture worked fine for ext2, but
    DAX is architected to work with modern filsystems such as ext4 and XFS.
    DAX is not intended for use with btrfs; the value that btrfs adds relies
    on manipulating data and writing data to different locations, while DAX's
    value is for write-in-place and keeping the kernel from touching the data.

    DAX was developed in order to support NV-DIMMs, but it's become clear that
    its usefuless extends beyond NV-DIMMs and there are several potential
    customers including the tracing machinery. Other people want to place the
    kernel log in an area of memory, as long as they have a BIOS that does not
    clear DRAM on reboot.

    Patch 1 is a bug fix, probably worth including in 3.18.

    Patches 2 & 3 are infrastructure for DAX.

    Patches 4-8 replace the XIP code with its DAX equivalents, transforming
    ext2 to use the DAX code as we go. Note that patch 10 is the
    Documentation patch.

    Patches 9-15 clean up after the XIP code, removing the infrastructure
    that is no longer needed and renaming various XIP things to DAX.
    Most of these patches were added after Jan found things he didn't
    like in an earlier version of the ext4 patch ... that had been copied
    from ext2. So ext2 i being transformed to do things the same way that
    ext4 will later. The ability to mount ext2 filesystems with the 'xip'
    option is retained, although the 'dax' option is now preferred.

    Patch 16 adds some DAX infrastructure to support ext4.

    Patch 17 adds DAX support to ext4. It is broadly similar to ext2's DAX
    support, but it is more efficient than ext4's due to its support for
    unwritten extents.

    Patch 18 is another cleanup patch renaming XIP to DAX.

    My thanks to Mathieu Desnoyers for his reviews of the v11 patchset. Most
    of the changes below were based on his feedback.

    This patch (of 18):

    Pagecache faults recheck i_size after taking the page lock to ensure that
    the fault didn't race against a truncate. We don't have a page to lock in
    the XIP case, so use i_mmap_lock_read() instead. It is locked in the
    truncate path in unmap_mapping_range() after updating i_size. So while we
    hold it in the fault path, we are guaranteed that either i_size has
    already been updated in the truncate path, or that the truncate will
    subsequently call zap_page_range_single() and so remove the mapping we
    have just inserted.

    There is a window of time in which i_size has been reduced and the thread
    has a mapping to a page which will be removed from the file, but this is
    harmless as the page will not be allocated to a different purpose before
    the thread's access to it is revoked.

    [akpm@linux-foundation.org: switch to i_mmap_lock_read(), add comment in unmap_single_vma()]
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Mathieu Desnoyers
    Cc: Andreas Dilger
    Cc: Boaz Harrosh
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: Randy Dunlap
    Cc: Ross Zwisler
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

13 Feb, 2015

5 commits

  • For whatever reason, generic_access_phys() only remaps one page, but
    actually allows to access arbitrary size. It's quite easy to trigger
    large reads, like printing out large structure with gdb, which leads to a
    crash. Fix it by remapping correct size.

    Fixes: 28b2ee20c7cb ("access_process_vm device memory infrastructure")
    Signed-off-by: Grazvydas Ignotas
    Cc: Rik van Riel
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Grazvydas Ignotas
     
  • pte_protnone_numa is only safe to use after VMA checks for PROT_NONE are
    complete. Treating a real PROT_NONE PTE as a NUMA hinting fault is going
    to result in strangeness so add a check for it. BUG_ON looks like
    overkill but if this is hit then it's a serious bug that could result in
    corruption so do not even try recovering. It would have been more
    comprehensive to check VMA flags in pte_protnone_numa but it would have
    made the API ugly just for a debugging check.

    Signed-off-by: Mel Gorman
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Dave Jones
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Kirill Shutemov
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Faults on the huge zero page are pointless and there is a BUG_ON to catch
    them during fault time. This patch reintroduces a check that avoids
    marking the zero page PAGE_NONE.

    Signed-off-by: Mel Gorman
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Dave Jones
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Kirill Shutemov
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • With PROT_NONE, the traditional page table manipulation functions are
    sufficient.

    [andre.przywara@arm.com: fix compiler warning in pmdp_invalidate()]
    [akpm@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS]
    Signed-off-by: Mel Gorman
    Acked-by: Linus Torvalds
    Acked-by: Aneesh Kumar
    Tested-by: Sasha Levin
    Cc: Benjamin Herrenschmidt
    Cc: Dave Jones
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Kirill Shutemov
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Convert existing users of pte_numa and friends to the new helper. Note
    that the kernel is broken after this patch is applied until the other page
    table modifiers are also altered. This patch layout is to make review
    easier.

    Signed-off-by: Mel Gorman
    Acked-by: Linus Torvalds
    Acked-by: Aneesh Kumar
    Acked-by: Benjamin Herrenschmidt
    Tested-by: Sasha Levin
    Cc: Dave Jones
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Kirill Shutemov
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

12 Feb, 2015

1 commit

  • Dave noticed that unprivileged process can allocate significant amount of
    memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
    memory cgroup. The trick is to allocate a lot of PMD page tables. Linux
    kernel doesn't account PMD tables to the process, only PTE.

    The use-cases below use few tricks to allocate a lot of PMD page tables
    while keeping VmRSS and VmPTE low. oom_score for the process will be 0.

    #include
    #include
    #include
    #include
    #include
    #include

    #define PUD_SIZE (1UL << 30)
    #define PMD_SIZE (1UL << 21)

    #define NR_PUD 130000

    int main(void)
    {
    char *addr = NULL;
    unsigned long i;

    prctl(PR_SET_THP_DISABLE);
    for (i = 0; i < NR_PUD ; i++) {
    addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
    MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
    if (addr == MAP_FAILED) {
    perror("mmap");
    break;
    }
    *addr = 'x';
    munmap(addr, PMD_SIZE);
    mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
    MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
    if (addr == MAP_FAILED)
    perror("re-mmap"), exit(1);
    }
    printf("PID %d consumed %lu KiB in PMD page tables\n",
    getpid(), i * 4096 >> 10);
    return pause();
    }

    The patch addresses the issue by account PMD tables to the process the
    same way we account PTE.

    The main place where PMD tables is accounted is __pmd_alloc() and
    free_pmd_range(). But there're few corner cases:

    - HugeTLB can share PMD page tables. The patch handles by accounting
    the table to all processes who share it.

    - x86 PAE pre-allocates few PMD tables on fork.

    - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
    check on exit(2).

    Accounting only happens on configuration where PMD page table's level is
    present (PMD is not folded). As with nr_ptes we use per-mm counter. The
    counter value is used to calculate baseline for badness score by
    oom-killer.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dave Hansen
    Cc: Hugh Dickins
    Reviewed-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: David Rientjes
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

11 Feb, 2015

7 commits

  • Merge misc updates from Andrew Morton:
    "Bite-sized chunks this time, to avoid the MTA ratelimiting woes.

    - fs/notify updates

    - ocfs2

    - some of MM"

    That laconic "some MM" is mainly the removal of remap_file_pages(),
    which is a big simplification of the VM, and which gets rid of a *lot*
    of random cruft and special cases because we no longer support the
    non-linear mappings that it used.

    From a user interface perspective, nothing has changed, because the
    remap_file_pages() syscall still exists, it's just done by emulating the
    old behavior by creating a lot of individual small mappings instead of
    one non-linear one.

    The emulation is slower than the old "native" non-linear mappings, but
    nobody really uses or cares about remap_file_pages(), and simplifying
    the VM is a big advantage.

    * emailed patches from Andrew Morton : (78 commits)
    memcg: zap memcg_slab_caches and memcg_slab_mutex
    memcg: zap memcg_name argument of memcg_create_kmem_cache
    memcg: zap __memcg_{charge,uncharge}_slab
    mm/page_alloc.c: place zone_id check before VM_BUG_ON_PAGE check
    mm: hugetlb: fix type of hugetlb_treat_as_movable variable
    mm, hugetlb: remove unnecessary lower bound on sysctl handlers"?
    mm: memory: merge shared-writable dirtying branches in do_wp_page()
    mm: memory: remove ->vm_file check on shared writable vmas
    xtensa: drop _PAGE_FILE and pte_file()-related helpers
    x86: drop _PAGE_FILE and pte_file()-related helpers
    unicore32: drop pte_file()-related helpers
    um: drop _PAGE_FILE and pte_file()-related helpers
    tile: drop pte_file()-related helpers
    sparc: drop pte_file()-related helpers
    sh: drop _PAGE_FILE and pte_file()-related helpers
    score: drop _PAGE_FILE and pte_file()-related helpers
    s390: drop pte_file()-related helpers
    parisc: drop _PAGE_FILE and pte_file()-related helpers
    openrisc: drop _PAGE_FILE and pte_file()-related helpers
    nios2: drop _PAGE_FILE and pte_file()-related helpers
    ...

    Linus Torvalds
     
  • Whether there is a vm_ops->page_mkwrite or not, the page dirtying is
    pretty much the same. Make sure the page references are the same in both
    cases, then merge the two branches.

    It's tempting to go even further and page-lock the !page_mkwrite case, to
    get it in line with everybody else setting the page table and thus further
    simplify the model. But that's not quite compelling enough to justify
    dropping the pte lock, then relocking and verifying the entry for
    filesystems without ->page_mkwrite, which notably includes tmpfs. Leave
    it for now and lock the page late in the !page_mkwrite case.

    Signed-off-by: Johannes Weiner
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Shared anonymous mmaps are implemented with shmem files, so all VMAs with
    shared writable semantics also have an underlying backing file.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • One bit in ->vm_flags is unused now!

    Signed-off-by: Kirill A. Shutemov
    Cc: Dan Carpenter
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We don't create non-linear mappings anymore. Let's drop code which
    handles them on page fault.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We have remap_file_pages(2) emulation in -mm tree for few release cycles
    and we plan to have it mainline in v3.20. This patchset removes rest of
    VM_NONLINEAR infrastructure.

    Patches 1-8 take care about generic code. They are pretty
    straight-forward and can be applied without other of patches.

    Rest patches removes pte_file()-related stuff from architecture-specific
    code. It usually frees up one bit in non-present pte. I've tried to reuse
    that bit for swap offset, where I was able to figure out how to do that.

    For obvious reason I cannot test all that arch-specific code and would
    like to see acks from maintainers.

    In total, remap_file_pages(2) required about 1.4K lines of not-so-trivial
    kernel code. That's too much for functionality nobody uses.

    Tested-by: Felipe Balbi

    This patch (of 38):

    We don't create non-linear mappings anymore. Let's drop code which
    handles them on unmap/zap.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Pull xen features and fixes from David Vrabel:

    - Reworked handling for foreign (grant mapped) pages to simplify the
    code, enable a number of additional use cases and fix a number of
    long-standing bugs.

    - Prefer the TSC over the Xen PV clock when dom0 (and the TSC is
    stable).

    - Assorted other cleanup and minor bug fixes.

    * tag 'stable/for-linus-3.20-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (25 commits)
    xen/manage: Fix USB interaction issues when resuming
    xenbus: Add proper handling of XS_ERROR from Xenbus for transactions.
    xen/gntdev: provide find_special_page VMA operation
    xen/gntdev: mark userspace PTEs as special on x86 PV guests
    xen-blkback: safely unmap grants in case they are still in use
    xen/gntdev: safely unmap grants in case they are still in use
    xen/gntdev: convert priv->lock to a mutex
    xen/grant-table: add a mechanism to safely unmap pages that are in use
    xen-netback: use foreign page information from the pages themselves
    xen: mark grant mapped pages as foreign
    xen/grant-table: add helpers for allocating pages
    x86/xen: require ballooned pages for grant maps
    xen: remove scratch frames for ballooned pages and m2p override
    xen/grant-table: pre-populate kernel unmap ops for xen_gnttab_unmap_refs()
    mm: add 'foreign' alias for the 'pinned' page flag
    mm: provide a find_special_page vma operation
    x86/xen: cleanup arch/x86/xen/mmu.c
    x86/xen: add some __init annotations in arch/x86/xen/mmu.c
    x86/xen: add some __init and static annotations in arch/x86/xen/setup.c
    x86/xen: use correct types for addresses in arch/x86/xen/setup.c
    ...

    Linus Torvalds