02 Apr, 2017

1 commit


29 Mar, 2017

1 commit


10 Mar, 2017

2 commits


02 Mar, 2017

4 commits

  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • …sched/numa_balancing.h>

    We are going to split <linux/sched/numa_balancing.h> out of <linux/sched.h>, which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder <linux/sched/numa_balancing.h> file that just
    maps to <linux/sched.h> to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

25 Feb, 2017

8 commits

  • Patch series "Numabalancing preserve write fix", v2.

    This patch series address an issue w.r.t THP migration and autonuma
    preserve write feature. migrate_misplaced_transhuge_page() cannot deal
    with concurrent modification of the page. It does a page copy without
    following the migration pte sequence. IIUC, this was done to keep the
    migration simpler and at the time of implemenation we didn't had THP
    page cache which would have required a more elaborate migration scheme.
    That means thp autonuma migration expect the protnone with saved write
    to be done such that both kernel and user cannot update the page
    content. This patch series enables archs like ppc64 to do that. We are
    good with the hash translation mode with the current code, because we
    never create a hardware page table entry for a protnone pte.

    This patch (of 2):

    Autonuma preserves the write permission across numa fault to avoid
    taking a writefault after a numa fault (Commit: b191f9b106ea " mm: numa:
    preserve PTE write permissions across a NUMA hinting fault").
    Architecture can implement protnone in different ways and some may
    choose to implement that by clearing Read/ Write/Exec bit of pte.
    Setting the write bit on such pte can result in wrong behaviour. Fix
    this up by allowing arch to override how to save the write bit on a
    protnone pte.

    [aneesh.kumar@linux.vnet.ibm.com: don't mark pte saved write in case of dirty_accountable]
    Link: http://lkml.kernel.org/r/1487942884-16517-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    [aneesh.kumar@linux.vnet.ibm.com: v3]
    Link: http://lkml.kernel.org/r/1487498625-10891-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1487050314-3892-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Michael Neuling
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Architectures like ppc64, use privilege access bit to mark pte non
    accessible. This implies that kernel can do a copy_to_user to an
    address marked for numa fault. This also implies that there can be a
    parallel hardware update for the pte. set_pte_at cannot be used in such
    scenarios. Hence switch the pte update to use ptep_get_and_clear and
    set_pte_at combination.

    [akpm@linux-foundation.org: remove unwanted ppc change, per Aneesh]
    Link: http://lkml.kernel.org/r/1486400776-28114-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Fix whitespace issues, extraneous braces.

    Link: http://lkml.kernel.org/r/1485992240-10986-5-git-send-email-me@tobin.cc
    Signed-off-by: Tobin C Harding
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tobin C Harding
     
  • Patch fixes sparse warning: Using plain integer as NULL pointer.
    Replaces assignment of 0 to pointer with NULL assignment.

    Link: http://lkml.kernel.org/r/1485992240-10986-2-git-send-email-me@tobin.cc
    Signed-off-by: Tobin C Harding
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tobin C Harding
     
  • Since the introduction of FAULT_FLAG_SIZE to the vm_fault flag, it has
    been somewhat painful with getting the flags set and removed at the
    correct locations. More than one kernel oops was introduced due to
    difficulties of getting the placement correctly.

    Remove the flag values and introduce an input parameter to huge_fault
    that indicates the size of the page entry. This makes the code easier
    to trace and should avoid the issues we see with the fault flags where
    removal of the flag was necessary in the fallback paths.

    Link: http://lkml.kernel.org/r/148615748258.43180.1690152053774975329.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Tested-by: Dan Williams
    Reviewed-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Cc: Ross Zwisler
    Cc: Kirill A. Shutemov
    Cc: Nilesh Choudhury
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     
  • The current transparent hugepage code only supports PMDs. This patch
    adds support for transparent use of PUDs with DAX. It does not include
    support for anonymous pages. x86 support code also added.

    Most of this patch simply parallels the work that was done for huge
    PMDs. The only major difference is how the new ->pud_entry method in
    mm_walk works. The ->pmd_entry method replaces the ->pte_entry method,
    whereas the ->pud_entry method works along with either ->pmd_entry or
    ->pte_entry. The pagewalk code takes care of locking the PUD before
    calling ->pud_walk, so handlers do not need to worry whether the PUD is
    stable.

    [dave.jiang@intel.com: fix SMP x86 32bit build for native_pud_clear()]
    Link: http://lkml.kernel.org/r/148719066814.31111.3239231168815337012.stgit@djiang5-desk3.ch.intel.com
    [dave.jiang@intel.com: native_pud_clear missing on i386 build]
    Link: http://lkml.kernel.org/r/148640375195.69754.3315433724330910314.stgit@djiang5-desk3.ch.intel.com
    Link: http://lkml.kernel.org/r/148545059381.17912.8602162635537598445.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Matthew Wilcox
    Signed-off-by: Dave Jiang
    Tested-by: Alexander Kapshuk
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Kirill A. Shutemov
    Cc: Nilesh Choudhury
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Patch series "1G transparent hugepage support for device dax", v2.

    The following series implements support for 1G trasparent hugepage on
    x86 for device dax. The bulk of the code was written by Mathew Wilcox a
    while back supporting transparent 1G hugepage for fs DAX. I have
    forward ported the relevant bits to 4.10-rc. The current submission has
    only the necessary code to support device DAX.

    Comments from Dan Williams: So the motivation and intended user of this
    functionality mirrors the motivation and users of 1GB page support in
    hugetlbfs. Given expected capacities of persistent memory devices an
    in-memory database may want to reduce tlb pressure beyond what they can
    already achieve with 2MB mappings of a device-dax file. We have
    customer feedback to that effect as Willy mentioned in his previous
    version of these patches [1].

    [1]: https://lkml.org/lkml/2016/1/31/52

    Comments from Nilesh @ Oracle:

    There are applications which have a process model; and if you assume
    10,000 processes attempting to mmap all the 6TB memory available on a
    server; we are looking at the following:

    processes : 10,000
    memory : 6TB
    pte @ 4k page size: 8 bytes / 4K of memory * #processes = 6TB / 4k * 8 * 10000 = 1.5GB * 80000 = 120,000GB
    pmd @ 2M page size: 120,000 / 512 = ~240GB
    pud @ 1G page size: 240GB / 512 = ~480MB

    As you can see with 2M pages, this system will use up an exorbitant
    amount of DRAM to hold the page tables; but the 1G pages finally brings
    it down to a reasonable level. Memory sizes will keep increasing; so
    this number will keep increasing.

    An argument can be made to convert the applications from process model
    to thread model, but in the real world that may not be always practical.
    Hopefully this helps explain the use case where this is valuable.

    This patch (of 3):

    In preparation for adding the ability to handle PUD pages, convert
    vm_operations_struct.pmd_fault to vm_operations_struct.huge_fault. The
    vm_fault structure is extended to include a union of the different page
    table pointers that may be needed, and three flag bits are reserved to
    indicate which type of pointer is in the union.

    [ross.zwisler@linux.intel.com: remove unused function ext4_dax_huge_fault()]
    Link: http://lkml.kernel.org/r/1485813172-7284-1-git-send-email-ross.zwisler@linux.intel.com
    [dave.jiang@intel.com: clear PMD or PUD size flags when in fall through path]
    Link: http://lkml.kernel.org/r/148589842696.5820.16078080610311444794.stgit@djiang5-desk3.ch.intel.com
    Link: http://lkml.kernel.org/r/148545058784.17912.6353162518188733642.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Matthew Wilcox
    Signed-off-by: Dave Jiang
    Signed-off-by: Ross Zwisler
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Dan Williams
    Cc: Kirill A. Shutemov
    Cc: Nilesh Choudhury
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Dave Jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     
  • ->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
    take a vma and vmf parameter when the vma already resides in vmf.

    Remove the vma parameter to simplify things.

    [arnd@arndb.de: fix ARM build]
    Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Ross Zwisler
    Cc: Theodore Ts'o
    Cc: Darrick J. Wong
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     

23 Feb, 2017

7 commits

  • There's no users of zap_page_range() who wants non-NULL 'details'.
    Let's drop it.

    Link: http://lkml.kernel.org/r/20170118122429.43661-3-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • detail == NULL would give the same functionality as
    .check_swap_entries==true.

    Link: http://lkml.kernel.org/r/20170118122429.43661-2-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The only user of ignore_dirty is oom-reaper. But it doesn't really use
    it.

    ignore_dirty only has effect on file pages mapped with dirty pte. But
    oom-repear skips shared VMAs, so there's no way we can dirty file pte in
    them.

    Link: http://lkml.kernel.org/r/20170118122429.43661-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The new routine copy_huge_page_from_user() uses kmap_atomic() to map
    PAGE_SIZE pages. However, this prevents page faults in the subsequent
    call to copy_from_user(). This is OK in the case where the routine is
    copied with mmap_sema held. However, in another case we want to allow
    page faults. So, add a new argument allow_pagefault to indicate if the
    routine should allow page faults.

    [dan.carpenter@oracle.com: unmap the correct pointer]
    Link: http://lkml.kernel.org/r/20170113082608.GA3548@mwanda
    [akpm@linux-foundation.org: kunmap() takes a page*, per Hugh]
    Link: http://lkml.kernel.org/r/20161216144821.5183-20-aarcange@redhat.com
    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Dan Carpenter
    Cc: "Dr. David Alan Gilbert"
    Cc: Hillf Danton
    Cc: Michael Rapoport
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Cc: Hugh Dickins
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • userfaultfd UFFDIO_COPY allows user level code to copy data to a page at
    fault time. The data is copied from user space to a newly allocated
    huge page. The new routine copy_huge_page_from_user performs this copy.

    Link: http://lkml.kernel.org/r/20161216144821.5183-17-aarcange@redhat.com
    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Hillf Danton
    Cc: Michael Rapoport
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • pmd_fault() and related functions really only need the vmf parameter since
    the additional parameters are all included in the vmf struct. Remove the
    additional parameter and simplify pmd_fault() and friends.

    Link: http://lkml.kernel.org/r/1484085142-2297-8-git-send-email-ross.zwisler@linux.intel.com
    Signed-off-by: Dave Jiang
    Reviewed-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Dave Chinner
    Cc: Dave Jiang
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     
  • Instead of passing in multiple parameters in the pmd_fault() handler,
    a vmf can be passed in just like a fault() handler. This will simplify
    code and remove the need for the actual pmd fault handlers to allocate a
    vmf. Related functions are also modified to do the same.

    [dave.jiang@intel.com: fix issue with xfs_tests stall when DAX option is off]
    Link: http://lkml.kernel.org/r/148469861071.195597.3619476895250028518.stgit@djiang5-desk3.ch.intel.com
    Link: http://lkml.kernel.org/r/1484085142-2297-7-git-send-email-ross.zwisler@linux.intel.com
    Signed-off-by: Dave Jiang
    Reviewed-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Dave Chinner
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     

11 Jan, 2017

2 commits

  • Currently dax_mapping_entry_mkclean() fails to clean and write protect
    the pmd_t of a DAX PMD entry during an *sync operation. This can result
    in data loss in the following sequence:

    1) mmap write to DAX PMD, dirtying PMD radix tree entry and making the
    pmd_t dirty and writeable
    2) fsync, flushing out PMD data and cleaning the radix tree entry. We
    currently fail to mark the pmd_t as clean and write protected.
    3) more mmap writes to the PMD. These don't cause any page faults since
    the pmd_t is dirty and writeable. The radix tree entry remains clean.
    4) fsync, which fails to flush the dirty PMD data because the radix tree
    entry was clean.
    5) crash - dirty data that should have been fsync'd as part of 4) could
    still have been in the processor cache, and is lost.

    Fix this by marking the pmd_t clean and write protected in
    dax_mapping_entry_mkclean(), which is called as part of the fsync
    operation 2). This will cause the writes in step 3) above to generate
    page faults where we'll re-dirty the PMD radix tree entry, resulting in
    flushes in the fsync that happens in step 4).

    Fixes: 4b4bb46d00b3 ("dax: clear dirty entry tags on cache flush")
    Link: http://lkml.kernel.org/r/1482272586-21177-3-git-send-email-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Alexander Viro
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Patch series "Write protect DAX PMDs in *sync path".

    Currently dax_mapping_entry_mkclean() fails to clean and write protect
    the pmd_t of a DAX PMD entry during an *sync operation. This can result
    in data loss, as detailed in patch 2.

    This series is based on Dan's "libnvdimm-pending" branch, which is the
    current home for Jan's "dax: Page invalidation fixes" series. You can
    find a working tree here:

    https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax_pmd_clean

    This patch (of 2):

    Similar to follow_pte(), follow_pte_pmd() allows either a PTE leaf or a
    huge page PMD leaf to be found and returned.

    Link: http://lkml.kernel.org/r/1482272586-21177-2-git-send-email-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Suggested-by: Dave Hansen
    Cc: Alexander Viro
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

08 Jan, 2017

1 commit

  • 4.10-rc loadtest (even on x86, and even without THPCache) fails with
    "fork: Cannot allocate memory" or some such; and /proc/meminfo shows
    PageTables growing.

    Commit 953c66c2b22a ("mm: THP page cache support for ppc64") that got
    merged in rc1 removed the freeing of an unused preallocated pagetable
    after do_fault_around() has called map_pages().

    This is usually a good optimization, so that the followup doesn't have
    to reallocate one; but it's not sufficient to shift the freeing into
    alloc_set_pte(), since there are failure cases (most commonly
    VM_FAULT_RETRY) which never reach finish_fault().

    Check and free it at the outer level in do_fault(), then we don't need
    to worry in alloc_set_pte(), and can restore that to how it was (I
    cannot find any reason to pte_free() under lock as it was doing).

    And fix a separate pagetable leak, or crash, introduced by the same
    change, that could only show up on some ppc64: why does do_set_pmd()'s
    failure case attempt to withdraw a pagetable when it never deposited
    one, at the same time overwriting (so leaking) the vmf->prealloc_pte?
    Residue of an earlier implementation, perhaps? Delete it.

    Fixes: 953c66c2b22a ("mm: THP page cache support for ppc64")
    Cc: Aneesh Kumar K.V
    Cc: Kirill A. Shutemov
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Michael Neuling
    Cc: Paul Mackerras
    Cc: Balbir Singh
    Cc: Andrew Morton
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

25 Dec, 2016

1 commit


15 Dec, 2016

13 commits

  • Merge more updates from Andrew Morton:

    - a few misc things

    - kexec updates

    - DMA-mapping updates to better support networking DMA operations

    - IPC updates

    - various MM changes to improve DAX fault handling

    - lots of radix-tree changes, mainly to the test suite. All leading up
    to reimplementing the IDA/IDR code to be a wrapper layer over the
    radix-tree. However the final trigger-pulling patch is held off for
    4.11.

    * emailed patches from Andrew Morton : (114 commits)
    radix tree test suite: delete unused rcupdate.c
    radix tree test suite: add new tag check
    radix-tree: ensure counts are initialised
    radix tree test suite: cache recently freed objects
    radix tree test suite: add some more functionality
    idr: reduce the number of bits per level from 8 to 6
    rxrpc: abstract away knowledge of IDR internals
    tpm: use idr_find(), not idr_find_slowpath()
    idr: add ida_is_empty
    radix tree test suite: check multiorder iteration
    radix-tree: fix replacement for multiorder entries
    radix-tree: add radix_tree_split_preload()
    radix-tree: add radix_tree_split
    radix-tree: add radix_tree_join
    radix-tree: delete radix_tree_range_tag_if_tagged()
    radix-tree: delete radix_tree_locate_item()
    radix-tree: improve multiorder iterators
    btrfs: fix race in btrfs_free_dummy_fs_info()
    radix-tree: improve dump output
    radix-tree: make radix_tree_find_next_bit more useful
    ...

    Linus Torvalds
     
  • Currently PTE gets updated in wp_pfn_shared() after dax_pfn_mkwrite()
    has released corresponding radix tree entry lock. When we want to
    writeprotect PTE on cache flush, we need PTE modification to happen
    under radix tree entry lock to ensure consistent updates of PTE and
    radix tree (standard faults use page lock to ensure this consistency).
    So move update of PTE bit into dax_pfn_mkwrite().

    Link: http://lkml.kernel.org/r/1479460644-25076-20-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Cc: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • DAX will need to implement its own version of page_check_address(). To
    avoid duplicating page table walking code, export follow_pte() which
    does what we need.

    Link: http://lkml.kernel.org/r/1479460644-25076-18-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Cc: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Currently finish_mkwrite_fault() returns 0 when PTE got changed before
    we acquired PTE lock and VM_FAULT_WRITE when we succeeded in modifying
    the PTE. This is somewhat confusing since 0 generally means success, it
    is also inconsistent with finish_fault() which returns 0 on success.
    Change finish_mkwrite_fault() to return 0 on success and VM_FAULT_NOPAGE
    when PTE changed. Practically, there should be no behavioral difference
    since we bail out from the fault the same way regardless whether we
    return 0, VM_FAULT_NOPAGE, or VM_FAULT_WRITE. Also note that
    VM_FAULT_WRITE has no effect for shared mappings since the only two
    places that check it - KSM and GUP - care about private mappings only.
    Generally the meaning of VM_FAULT_WRITE for shared mappings is not well
    defined and we should probably clean that up.

    Link: http://lkml.kernel.org/r/1479460644-25076-17-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Cc: Ross Zwisler
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Provide a helper function for finishing write faults due to PTE being
    read-only. The helper will be used by DAX to avoid the need of
    complicating generic MM code with DAX locking specifics.

    Link: http://lkml.kernel.org/r/1479460644-25076-16-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • wp_page_reuse() handles write shared faults which is needed only in
    wp_page_shared(). Move the handling only into that location to make
    wp_page_reuse() simpler and avoid a strange situation when we sometimes
    pass in locked page, sometimes unlocked etc.

    Link: http://lkml.kernel.org/r/1479460644-25076-15-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • So far we set vmf->page during WP faults only when we needed to pass it
    to the ->page_mkwrite handler. Set it in all the cases now and use that
    instead of passing page pointer explicitly around.

    Link: http://lkml.kernel.org/r/1479460644-25076-14-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • We will need more information in the ->page_mkwrite() helper for DAX to
    be able to fully finish faults there. Pass vm_fault structure to
    do_page_mkwrite() and use it there so that information propagates
    properly from upper layers.

    Link: http://lkml.kernel.org/r/1479460644-25076-13-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Currently we duplicate handling of shared write faults in
    wp_page_reuse() and do_shared_fault(). Factor them out into a common
    function.

    Link: http://lkml.kernel.org/r/1479460644-25076-12-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Move final handling of COW faults from generic code into DAX fault
    handler. That way generic code doesn't have to be aware of
    peculiarities of DAX locking so remove that knowledge and make locking
    functions private to fs/dax.c.

    Link: http://lkml.kernel.org/r/1479460644-25076-11-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Ross Zwisler
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Introduce finish_fault() as a helper function for finishing page faults.
    It is rather thin wrapper around alloc_set_pte() but since we'd want to
    call this from DAX code or filesystems, it is still useful to avoid some
    boilerplate code.

    Link: http://lkml.kernel.org/r/1479460644-25076-10-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Patch series "dax: Clear dirty bits after flushing caches", v5.

    Patchset to clear dirty bits from radix tree of DAX inodes when caches
    for corresponding pfns have been flushed. In principle, these patches
    enable handlers to easily update PTEs and do other work necessary to
    finish the fault without duplicating the functionality present in the
    generic code. I'd like to thank Kirill and Ross for reviews of the
    series!

    This patch (of 20):

    To allow full handling of COW faults add memcg field to struct vm_fault
    and a return value of ->fault() handler meaning that COW fault is fully
    handled and memcg charge must not be canceled. This will allow us to
    remove knowledge about special DAX locking from the generic fault code.

    Link: http://lkml.kernel.org/r/1479460644-25076-9-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Add orig_pte field to vm_fault structure to allow ->page_mkwrite
    handlers to fully handle the fault.

    This also allows us to save some passing of extra arguments around.

    Link: http://lkml.kernel.org/r/1479460644-25076-8-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara