20 Oct, 2020

1 commit

  • Pull fuse updates from Miklos Szeredi:

    - Support directly accessing host page cache from virtiofs. This can
    improve I/O performance for various workloads, as well as reducing
    the memory requirement by eliminating double caching. Thanks to Vivek
    Goyal for doing most of the work on this.

    - Allow automatic submounting inside virtiofs. This allows unique
    st_dev/ st_ino values to be assigned inside the guest to files
    residing on different filesystems on the host. Thanks to Max Reitz
    for the patches.

    - Fix an old use after free bug found by Pradeep P V K.

    * tag 'fuse-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits)
    virtiofs: calculate number of scatter-gather elements accurately
    fuse: connection remove fix
    fuse: implement crossmounts
    fuse: Allow fuse_fill_super_common() for submounts
    fuse: split fuse_mount off of fuse_conn
    fuse: drop fuse_conn parameter where possible
    fuse: store fuse_conn in fuse_req
    fuse: add submount support to
    fuse: fix page dereference after free
    virtiofs: add logic to free up a memory range
    virtiofs: maintain a list of busy elements
    virtiofs: serialize truncate/punch_hole and dax fault path
    virtiofs: define dax address space operations
    virtiofs: add DAX mmap support
    virtiofs: implement dax read/write operations
    virtiofs: introduce setupmapping/removemapping commands
    virtiofs: implement FUSE_INIT map_alignment field
    virtiofs: keep a list of free dax memory ranges
    virtiofs: add a mount option to enable dax
    virtiofs: set up virtio_fs dax_device
    ...

    Linus Torvalds
     

21 Sep, 2020

1 commit

  • Pass the full length to iomap_zero() and dax_iomap_zero(), and have
    them return how many bytes they actually handled. This is preparatory
    work for handling THP, although it looks like DAX could actually take
    advantage of it if there's a larger contiguous area.

    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Matthew Wilcox (Oracle)
     

10 Sep, 2020

1 commit

  • virtiofs device has a range of memory which is mapped into file inodes
    using dax. This memory is mapped in qemu on host and maps different
    sections of real file on host. Size of this memory is limited
    (determined by administrator) and depending on filesystem size, we will
    soon reach a situation where all the memory is in use and we need to
    reclaim some.

    As part of reclaim process, we will need to make sure that there are
    no active references to pages (taken by get_user_pages()) on the memory
    range we are trying to reclaim. I am planning to use
    dax_layout_busy_page() for this. But in current form this is per inode
    and scans through all the pages of the inode.

    We want to reclaim only a portion of memory (say 2MB page). So we want
    to make sure that only that 2MB range of pages do not have any
    references (and don't want to unmap all the pages of inode).

    Hence, create a range version of this function named
    dax_layout_busy_page_range() which can be used to pass a range which
    needs to be unmapped.

    Cc: Dan Williams
    Cc: linux-nvdimm@lists.01.org
    Cc: Jan Kara
    Cc: Vishal L Verma
    Cc: "Weiny, Ira"
    Signed-off-by: Vivek Goyal
    Reviewed-by: Jan Kara
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     

24 Aug, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

31 Jul, 2020

1 commit

  • The argument passed to xas_set_err() to indicate an error should be negative.
    Otherwise, xas_error() will return 0, and grab_mapping_entry() will return the
    found entry instead of 'SIGBUS' when the entry is not in fact valid.
    This would result in problems in subsequent code paths.

    Link: https://lore.kernel.org/r/20200729034436.24267-1-lihao2018.fnst@cn.fujitsu.com
    Reviewed-by: Pankaj Gupta
    Signed-off-by: Hao Li
    Signed-off-by: Vishal Verma

    Hao Li
     

29 Jul, 2020

1 commit

  • Passing size to copy_user_dax implies it can copy variable sizes of data
    when in fact it calls copy_user_page() which is exactly a page.

    We are safe because the only caller uses PAGE_SIZE anyway so just remove
    the variable for clarity.

    While we are at it change copy_user_dax() to copy_cow_page_dax() to make
    it clear it is a singleton helper for this one case not implementing
    what dax_iomap_actor() does.

    Link: https://lore.kernel.org/r/20200717072056.73134-11-ira.weiny@intel.com
    Reviewed-by: Ben Widawsky
    Reviewed-by: Dan Williams
    Signed-off-by: Ira Weiny
    Signed-off-by: Vishal Verma

    Ira Weiny
     

03 Apr, 2020

2 commits

  • Add a helper dax_ioamp_zero() to zero a range. This patch basically
    merges __dax_zero_page_range() and iomap_dax_zero().

    Suggested-by: Christoph Hellwig
    Signed-off-by: Vivek Goyal
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20200228163456.1587-7-vgoyal@redhat.com
    Signed-off-by: Dan Williams

    Vivek Goyal
     
  • Use new dax native zero page method for zeroing page if I/O is page
    aligned. Otherwise fall back to direct_access() + memcpy().

    This gets rid of one of the depenendency on block device in dax path.

    Signed-off-by: Vivek Goyal
    Link: https://lore.kernel.org/r/20200228163456.1587-6-vgoyal@redhat.com
    Signed-off-by: Dan Williams

    Vivek Goyal
     

06 Feb, 2020

1 commit

  • fstests generic/471 reports a failure when run with MOUNT_OPTIONS="-o
    dax". The reason is that the initial pwrite to an empty file with the
    RWF_NOWAIT flag set does not return -EAGAIN. It turns out that
    dax_iomap_rw doesn't pass that flag through to iomap_apply.

    With this patch applied, generic/471 passes for me.

    Signed-off-by: Jeff Moyer
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/x49r1z86e1d.fsf@segfault.boston.devel.redhat.com
    Signed-off-by: Dan Williams

    Jeff Moyer
     

04 Jan, 2020

1 commit

  • As of now dax_writeback_mapping_range() takes "struct block_device" as a
    parameter and dax_dev is searched from bdev name. This also involves taking
    a fresh reference on dax_dev and putting that reference at the end of
    function.

    We are developing a new filesystem virtio-fs and using dax to access host
    page cache directly. But there is no block device. IOW, we want to make
    use of dax but want to get rid of this assumption that there is always
    a block device associated with dax_dev.

    So pass in "struct dax_device" as parameter instead of bdev.

    ext2/ext4/xfs are current users and they already have a reference on
    dax_device. So there is no need to take reference and drop reference to
    dax_device on each call of this function.

    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Vivek Goyal
    Link: https://lore.kernel.org/r/20200103183307.GB13350@redhat.com
    Signed-off-by: Dan Williams

    Vivek Goyal
     

01 Dec, 2019

1 commit

  • Pull iomap updates from Darrick Wong:
    "In this release, we hoisted as much of XFS' writeback code into iomap
    as was practicable, refactored the unshare file data function, added
    the ability to perform buffered io copy on write, and tweaked various
    parts of the directio implementation as needed to port ext4's directio
    code (that will be a separate pull).

    Summary:

    - Make iomap_dio_rw callers explicitly tell us if they want us to
    wait

    - Port the xfs writeback code to iomap to complete the buffered io
    library functions

    - Refactor the unshare code to share common pieces

    - Add support for performing copy on write with buffered writes

    - Other minor fixes

    - Fix unchecked return in iomap_bmap

    - Fix a type casting bug in a ternary statement in
    iomap_dio_bio_actor

    - Improve tracepoints for easier diagnostic ability

    - Fix pipe page leakage in directio reads"

    * tag 'iomap-5.5-merge-11' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (31 commits)
    iomap: Fix pipe page leakage during splicing
    iomap: trace iomap_appply results
    iomap: fix return value of iomap_dio_bio_actor on 32bit systems
    iomap: iomap_bmap should check iomap_apply return value
    iomap: Fix overflow in iomap_page_mkwrite
    fs/iomap: remove redundant check in iomap_dio_rw()
    iomap: use a srcmap for a read-modify-write I/O
    iomap: renumber IOMAP_HOLE to 0
    iomap: use write_begin to read pages to unshare
    iomap: move the zeroing case out of iomap_read_page_sync
    iomap: ignore non-shared or non-data blocks in xfs_file_dirty
    iomap: always use AOP_FLAG_NOFS in iomap_write_begin
    iomap: remove the unused iomap argument to __iomap_write_end
    iomap: better document the IOMAP_F_* flags
    iomap: enhance writeback error message
    iomap: pass a struct page to iomap_finish_page_writeback
    iomap: cleanup iomap_ioend_compare
    iomap: move struct iomap_page out of iomap.h
    iomap: warn on inline maps in iomap_writepage_map
    iomap: lift the xfs writeback code to iomap
    ...

    Linus Torvalds
     

23 Oct, 2019

1 commit

  • Users reported a v5.3 performance regression and inability to establish
    huge page mappings. A revised version of the ndctl "dax.sh" huge page
    unit test identifies commit 23c84eb78375 "dax: Fix missed wakeup with
    PMD faults" as the source.

    Update get_unlocked_entry() to check for NULL entries before checking
    the entry order, otherwise NULL is misinterpreted as a present pte
    conflict. The 'order' check needs to happen before the locked check as
    an unlocked entry at the wrong order must fallback to lookup the correct
    order.

    Reported-by: Jeff Smits
    Reported-by: Doug Nelson
    Cc:
    Fixes: 23c84eb78375 ("dax: Fix missed wakeup with PMD faults")
    Reviewed-by: Jan Kara
    Cc: Jeff Moyer
    Cc: Matthew Wilcox (Oracle)
    Reviewed-by: Johannes Thumshirn
    Link: https://lore.kernel.org/r/157167532455.3945484.11971474077040503994.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     

21 Oct, 2019

1 commit

  • The srcmap is used to identify where the read is to be performed from.
    It is passed to ->iomap_begin, which can fill it in if we need to read
    data for partially written blocks from a different location than the
    write target. The srcmap is only supported for buffered writes so far.

    Signed-off-by: Goldwyn Rodrigues
    [hch: merged two patches, removed the IOMAP_F_COW flag, use iomap as
    srcmap if not set, adjust length down to srcmap end as well]
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Acked-by: Goldwyn Rodrigues

    Goldwyn Rodrigues
     

06 Aug, 2019

1 commit

  • Vivek:

    "As of now dax_layout_busy_page() calls unmap_mapping_range() with last
    argument as 1, which says even unmap cow pages. I am wondering who needs
    to get rid of cow pages as well.

    I noticed one interesting side affect of this. I mount xfs with -o dax and
    mmaped a file with MAP_PRIVATE and wrote some data to a page which created
    cow page. Then I called fallocate() on that file to zero a page of file.
    fallocate() called dax_layout_busy_page() which unmapped cow pages as well
    and then I tried to read back the data I wrote and what I get is old
    data from persistent memory. I lost the data I had written. This
    read basically resulted in new fault and read back the data from
    persistent memory.

    This sounds wrong. Are there any users which need to unmap cow pages
    as well? If not, I am proposing changing it to not unmap cow pages.

    I noticed this while while writing virtio_fs code where when I tried
    to reclaim a memory range and that corrupted the executable and I
    was running from virtio-fs and program got segment violation."

    Dan:

    "In fact the unmap_mapping_range() in this path is only to synchronize
    against get_user_pages_fast() and force it to call back into the
    filesystem to re-establish the mapping. COW pages should be left
    untouched by dax_layout_busy_page()."

    Cc:
    Fixes: 5fac7408d828 ("mm, fs, dax: handle layout changes to pinned dax mappings")
    Signed-off-by: Vivek Goyal
    Link: https://lore.kernel.org/r/20190802192956.GA3032@redhat.com
    Signed-off-by: Dan Williams

    Vivek Goyal
     

30 Jul, 2019

1 commit

  • The condition checking whether put_unlocked_entry() needs to wake up
    following waiter got broken by commit 23c84eb78375 ("dax: Fix missed
    wakeup with PMD faults"). We need to wake the waiter whenever the passed
    entry is valid (i.e., non-NULL and not special conflict entry). This
    could lead to processes never being woken up when waiting for entry
    lock. Fix the condition.

    Cc:
    Link: http://lore.kernel.org/r/20190729120228.GC17833@quack2.suse.cz
    Fixes: 23c84eb78375 ("dax: Fix missed wakeup with PMD faults")
    Signed-off-by: Jan Kara
    Signed-off-by: Dan Williams

    Jan Kara
     

20 Jul, 2019

1 commit

  • Pull iomap split/cleanup from Darrick Wong:
    "As promised, here's the second part of the iomap merge for 5.3, in
    which we break up iomap.c into smaller files grouped by functional
    area so that it'll be easier in the long run to maintain cohesiveness
    of code units and to review incoming patches. There are no functional
    changes and fs/iomap.c split cleanly.

    Summary:

    - Regroup the fs/iomap.c code by major functional area so that we can
    start development for 5.4 from a more stable base"

    * tag 'iomap-5.3-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    iomap: move internal declarations into fs/iomap/
    iomap: move the main iteration code into a separate file
    iomap: move the buffered IO code into a separate file
    iomap: move the direct IO code into a separate file
    iomap: move the SEEK_HOLE code into a separate file
    iomap: move the file mapping reporting code into a separate file
    iomap: move the swapfile code into a separate file
    iomap: start moving code to fs/iomap/

    Linus Torvalds
     

19 Jul, 2019

1 commit

  • Pull dax updates from Dan Williams:
    "The fruits of a bug hunt in the fsdax implementation with Willy and a
    small feature update for device-dax:

    - Fix a hang condition that started triggering after the Xarray
    conversion of fsdax in the v4.20 kernel.

    - Add a 'resource' (root-only physical base address) sysfs attribute
    to device-dax instances to correlate memory-blocks onlined via the
    kmem driver with a given device instance"

    * tag 'dax-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    dax: Fix missed wakeup with PMD faults
    device-dax: Add a 'resource' attribute

    Linus Torvalds
     

17 Jul, 2019

2 commits

  • Move internal function declarations out of fs/internal.h into
    include/linux/iomap.h so that our transition is complete.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • RocksDB can hang indefinitely when using a DAX file. This is due to
    a bug in the XArray conversion when handling a PMD fault and finding a
    PTE entry. We use the wrong index in the hash and end up waiting on
    the wrong waitqueue.

    There's actually no need to wait; if we find a PTE entry while looking
    for a PMD entry, we can return immediately as we know we should fall
    back to a PTE fault (which may not conflict with the lock held).

    We reuse the XA_RETRY_ENTRY to signal a conflicting entry was found.
    This value can never be found in an XArray while holding its lock, so
    it does not create an ambiguity.

    Cc:
    Link: http://lkml.kernel.org/r/CAPcyv4hwHpX-MkUEqxwdTj7wCCZCN4RV-L4jsnuwLGyL_UEG4A@mail.gmail.com
    Fixes: b15cd800682f ("dax: Convert page fault handlers to XArray")
    Signed-off-by: Matthew Wilcox (Oracle)
    Tested-by: Dan Williams
    Reported-by: Robert Barror
    Reported-by: Seema Pandit
    Reviewed-by: Jan Kara
    Signed-off-by: Dan Williams

    Matthew Wilcox (Oracle)
     

09 Jul, 2019

1 commit

  • Pull locking updates from Ingo Molnar:
    "The main changes in this cycle are:

    - rwsem scalability improvements, phase #2, by Waiman Long, which are
    rather impressive:

    "On a 2-socket 40-core 80-thread Skylake system with 40 reader
    and writer locking threads, the min/mean/max locking operations
    done in a 5-second testing window before the patchset were:

    40 readers, Iterations Min/Mean/Max = 1,807/1,808/1,810
    40 writers, Iterations Min/Mean/Max = 1,807/50,344/151,255

    After the patchset, they became:

    40 readers, Iterations Min/Mean/Max = 30,057/31,359/32,741
    40 writers, Iterations Min/Mean/Max = 94,466/95,845/97,098"

    There's a lot of changes to the locking implementation that makes
    it similar to qrwlock, including owner handoff for more fair
    locking.

    Another microbenchmark shows how across the spectrum the
    improvements are:

    "With a locking microbenchmark running on 5.1 based kernel, the
    total locking rates (in kops/s) on a 2-socket Skylake system
    with equal numbers of readers and writers (mixed) before and
    after this patchset were:

    # of Threads Before Patch After Patch
    ------------ ------------ -----------
    2 2,618 4,193
    4 1,202 3,726
    8 802 3,622
    16 729 3,359
    32 319 2,826
    64 102 2,744"

    The changes are extensive and the patch-set has been through
    several iterations addressing various locking workloads. There
    might be more regressions, but unless they are pathological I
    believe we want to use this new implementation as the baseline
    going forward.

    - jump-label optimizations by Daniel Bristot de Oliveira: the primary
    motivation was to remove IPI disturbance of isolated RT-workload
    CPUs, which resulted in the implementation of batched jump-label
    updates. Beyond the improvement of the real-time characteristics
    kernel, in one test this patchset improved static key update
    overhead from 57 msecs to just 1.4 msecs - which is a nice speedup
    as well.

    - atomic64_t cross-arch type cleanups by Mark Rutland: over the last
    ~10 years of atomic64_t existence the various types used by the
    APIs only had to be self-consistent within each architecture -
    which means they became wildly inconsistent across architectures.
    Mark puts and end to this by reworking all the atomic64
    implementations to use 's64' as the base type for atomic64_t, and
    to ensure that this type is consistently used for parameters and
    return values in the API, avoiding further problems in this area.

    - A large set of small improvements to lockdep by Yuyang Du: type
    cleanups, output cleanups, function return type and othr cleanups
    all around the place.

    - A set of percpu ops cleanups and fixes by Peter Zijlstra.

    - Misc other changes - please see the Git log for more details"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits)
    locking/lockdep: increase size of counters for lockdep statistics
    locking/atomics: Use sed(1) instead of non-standard head(1) option
    locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING
    x86/jump_label: Make tp_vec_nr static
    x86/percpu: Optimize raw_cpu_xchg()
    x86/percpu, sched/fair: Avoid local_clock()
    x86/percpu, x86/irq: Relax {set,get}_irq_regs()
    x86/percpu: Relax smp_processor_id()
    x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
    locking/rwsem: Guard against making count negative
    locking/rwsem: Adaptive disabling of reader optimistic spinning
    locking/rwsem: Enable time-based spinning on reader-owned rwsem
    locking/rwsem: Make rwsem->owner an atomic_long_t
    locking/rwsem: Enable readers spinning on writer
    locking/rwsem: Clarify usage of owner's nonspinaable bit
    locking/rwsem: Wake up almost all readers in wait queue
    locking/rwsem: More optimal RT task handling of null owner
    locking/rwsem: Always release wait_lock before waking up tasks
    locking/rwsem: Implement lock handoff to prevent lock starvation
    locking/rwsem: Make rwsem_spin_on_owner() return owner state
    ...

    Linus Torvalds
     

05 Jul, 2019

1 commit


17 Jun, 2019

1 commit

  • All callers of lockdep_assert_held_exclusive() use it to verify the
    correct locking state of either a semaphore (ldisc_sem in tty,
    mmap_sem for perf events, i_rwsem of inode for dax) or rwlock by
    apparmor. Thus it makes sense to rename _exclusive to _write since
    that's the semantics callers care. Additionally there is already
    lockdep_assert_held_read(), which this new naming is more consistent with.

    No functional changes.

    Signed-off-by: Nikolay Borisov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190531100651.3969-1-nborisov@suse.com
    Signed-off-by: Ingo Molnar

    Nikolay Borisov
     

07 Jun, 2019

1 commit

  • When inserting entry into xarray, we store mapping and index in
    corresponding struct pages for memory error handling. When it happened
    that one process was mapping file at PMD granularity while another
    process at PTE granularity, we could wrongly deassociate PMD range and
    then reassociate PTE range leaving the rest of struct pages in PMD range
    without mapping information which could later cause missed notifications
    about memory errors. Fix the problem by calling the association /
    deassociation code if and only if we are really going to update the
    xarray (deassociating and associating zero or empty entries is just
    no-op so there's no reason to complicate the code with trying to avoid
    the calls for these cases).

    Cc:
    Fixes: d2c997c0f145 ("fs, dax: use page->mapping to warn if truncate...")
    Signed-off-by: Jan Kara
    Signed-off-by: Dan Williams

    Jan Kara
     

05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms and conditions of the gnu general public license
    version 2 as published by the free software foundation this program
    is distributed in the hope it will be useful but without any
    warranty without even the implied warranty of merchantability or
    fitness for a particular purpose see the gnu general public license
    for more details

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 263 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Alexios Zavras
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190529141901.208660670@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

2 commits

  • MADV_DONTNEED is handled with mmap_sem taken in read mode. We call
    page_mkclean without holding mmap_sem.

    MADV_DONTNEED implies that pages in the region are unmapped and subsequent
    access to the pages in that range is handled as a new page fault. This
    implies that if we don't have parallel access to the region when
    MADV_DONTNEED is run we expect those range to be unallocated.

    w.r.t page_mkclean() we need to make sure that we don't break the
    MADV_DONTNEED semantics. MADV_DONTNEED check for pmd_none without holding
    pmd_lock. This implies we skip the pmd if we temporarily mark pmd none.
    Avoid doing that while marking the page clean.

    Keep the sequence same for dax too even though we don't support
    MADV_DONTNEED for dax mapping

    The bug was noticed by code review and I didn't observe any failures w.r.t
    test run. This is similar to

    commit 58ceeb6bec86d9140f9d91d71a710e963523d063
    Author: Kirill A. Shutemov
    Date: Thu Apr 13 14:56:26 2017 -0700

    thp: fix MADV_DONTNEED vs. MADV_FREE race

    commit ced108037c2aa542b3ed8b7afd1576064ad1362a
    Author: Kirill A. Shutemov
    Date: Thu Apr 13 14:56:20 2017 -0700

    thp: fix MADV_DONTNEED vs. numa balancing race

    Link: http://lkml.kernel.org/r/20190321040610.14226-1-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Andrew Morton
    Cc: Dan Williams
    Cc:"Kirill A . Shutemov"
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Starting with c6f3c5ee40c1 ("mm/huge_memory.c: fix modifying of page
    protection by insert_pfn_pmd()") vmf_insert_pfn_pmd() internally calls
    pmdp_set_access_flags(). That helper enforces a pmd aligned @address
    argument via VM_BUG_ON() assertion.

    Update the implementation to take a 'struct vm_fault' argument directly
    and apply the address alignment fixup internally to fix crash signatures
    like:

    kernel BUG at arch/x86/mm/pgtable.c:515!
    invalid opcode: 0000 [#1] SMP NOPTI
    CPU: 51 PID: 43713 Comm: java Tainted: G OE 4.19.35 #1
    [..]
    RIP: 0010:pmdp_set_access_flags+0x48/0x50
    [..]
    Call Trace:
    vmf_insert_pfn_pmd+0x198/0x350
    dax_iomap_fault+0xe82/0x1190
    ext4_dax_huge_fault+0x103/0x1f0
    ? __switch_to_asm+0x40/0x70
    __handle_mm_fault+0x3f6/0x1370
    ? __switch_to_asm+0x34/0x70
    ? __switch_to_asm+0x40/0x70
    handle_mm_fault+0xda/0x200
    __do_page_fault+0x249/0x4f0
    do_page_fault+0x32/0x110
    ? page_fault+0x8/0x30
    page_fault+0x1e/0x30

    Link: http://lkml.kernel.org/r/155741946350.372037.11148198430068238140.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: c6f3c5ee40c1 ("mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()")
    Signed-off-by: Dan Williams
    Reported-by: Piotr Balcer
    Tested-by: Yan Ma
    Tested-by: Pankaj Gupta
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Jan Kara
    Reviewed-by: Aneesh Kumar K.V
    Cc: Chandan Rajendra
    Cc: Souptick Joarder
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

14 Mar, 2019

1 commit

  • Architectures like ppc64 use the deposited page table to store hardware
    page table slot information. Make sure we deposit a page table when
    using zero page at the pmd level for hash.

    Without this we hit

    Unable to handle kernel paging request for data at address 0x00000000
    Faulting instruction address: 0xc000000000082a74
    Oops: Kernel access of bad area, sig: 11 [#1]
    ....

    NIP [c000000000082a74] __hash_page_thp+0x224/0x5b0
    LR [c0000000000829a4] __hash_page_thp+0x154/0x5b0
    Call Trace:
    hash_page_mm+0x43c/0x740
    do_hash_page+0x2c/0x3c
    copy_from_iter_flushcache+0xa4/0x4a0
    pmem_copy_from_iter+0x2c/0x50 [nd_pmem]
    dax_copy_from_iter+0x40/0x70
    dax_iomap_actor+0x134/0x360
    iomap_apply+0xfc/0x1b0
    dax_iomap_rw+0xac/0x130
    ext4_file_write_iter+0x254/0x460 [ext4]
    __vfs_write+0x120/0x1e0
    vfs_write+0xd8/0x220
    SyS_write+0x6c/0x110
    system_call+0x3c/0x130

    Fixes: b5beae5e224f ("powerpc/pseries: Add driver for PAPR SCM regions")
    Cc:
    Reviewed-by: Jan Kara
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Dan Williams

    Aneesh Kumar K.V
     

02 Mar, 2019

1 commit

  • The radix tree would rewind the index in an iterator to the lowest index
    of a multi-slot entry. The XArray iterators instead leave the index
    unchanged, but I overlooked that when converting DAX from the radix tree
    to the XArray. Adjust the index that we use for flushing to the start
    of the PMD range.

    Fixes: c1901cd33cf4 ("page cache: Convert find_get_entries_tag to XArray")
    Cc:
    Reported-by: Piotr Balcer
    Tested-by: Dan Williams
    Reviewed-by: Jan Kara
    Signed-off-by: Matthew Wilcox
    Signed-off-by: Dan Williams

    Matthew Wilcox
     

13 Feb, 2019

2 commits


01 Jan, 2019

1 commit


29 Dec, 2018

1 commit

  • To avoid having to change many call sites everytime we want to add a
    parameter use a structure to group all parameters for the mmu_notifier
    invalidate_range_start/end cakks. No functional changes with this patch.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    From: Jérôme Glisse
    Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

    fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

    Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

22 Dec, 2018

1 commit

  • get_unlocked_entry() uses an exclusive wait because it is guaranteed to
    eventually obtain the lock and follow on with an unlock+wakeup cycle.
    The wait_entry_unlocked() path does not have the same guarantee. Rather
    than open-code an extra wakeup, just switch to a non-exclusive wait.

    Cc: Jan Kara
    Cc: Matthew Wilcox
    Reported-by: Linus Torvalds
    Signed-off-by: Dan Williams

    Dan Williams
     

05 Dec, 2018

1 commit

  • Internal to dax_unlock_mapping_entry(), dax_unlock_entry() is used to
    store a replacement entry in the Xarray at the given xas-index with the
    DAX_LOCKED bit clear. When called, dax_unlock_entry() expects the unlocked
    value of the entry relative to the current Xarray state to be specified.

    In most contexts dax_unlock_entry() is operating in the same scope as
    the matched dax_lock_entry(). However, in the dax_unlock_mapping_entry()
    case the implementation needs to recall the original entry. In the case
    where the original entry is a 'pmd' entry it is possible that the pfn
    performed to do the lookup is misaligned to the value retrieved in the
    Xarray.

    Change the api to return the unlock cookie from dax_lock_page() and pass
    it to dax_unlock_page(). This fixes a bug where dax_unlock_page() was
    assuming that the page was PMD-aligned if the entry was a PMD entry with
    signatures like:

    WARNING: CPU: 38 PID: 1396 at fs/dax.c:340 dax_insert_entry+0x2b2/0x2d0
    RIP: 0010:dax_insert_entry+0x2b2/0x2d0
    [..]
    Call Trace:
    dax_iomap_pte_fault.isra.41+0x791/0xde0
    ext4_dax_huge_fault+0x16f/0x1f0
    ? up_read+0x1c/0xa0
    __do_fault+0x1f/0x160
    __handle_mm_fault+0x1033/0x1490
    handle_mm_fault+0x18b/0x3d0

    Link: https://lkml.kernel.org/r/20181130154902.GL10377@bombadil.infradead.org
    Fixes: 9f32d221301c ("dax: Convert dax_lock_mapping_entry to XArray")
    Reported-by: Dan Williams
    Signed-off-by: Matthew Wilcox
    Tested-by: Dan Williams
    Reviewed-by: Jan Kara
    Signed-off-by: Dan Williams

    Matthew Wilcox
     

29 Nov, 2018

2 commits

  • After we drop the i_pages lock, the inode can be freed at any time.
    The get_unlocked_entry() code has no choice but to reacquire the lock,
    so it can't be used here. Create a new wait_entry_unlocked() which takes
    care not to acquire the lock or dereference the address_space in any way.

    Fixes: c2a7d2a11552 ("filesystem-dax: Introduce dax_lock_mapping_entry()")
    Cc:
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Jan Kara
    Signed-off-by: Dan Williams

    Matthew Wilcox
     
  • If we race with inode destroy, it's possible for page->mapping to be
    NULL before we even enter this routine, as well as after having slept
    waiting for the dax entry to become unlocked.

    Fixes: c2a7d2a11552 ("filesystem-dax: Introduce dax_lock_mapping_entry()")
    Cc:
    Reported-by: Jan Kara
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Jan Kara
    Signed-off-by: Dan Williams

    Matthew Wilcox
     

19 Nov, 2018

1 commit


18 Nov, 2018

2 commits

  • Using xas_load() with a PMD-sized xa_state would work if either a
    PMD-sized entry was present or a PTE sized entry was present in the
    first 64 entries (of the 512 PTEs in a PMD on x86). If there was no
    PTE in the first 64 entries, grab_mapping_entry() would believe there
    were no entries present, allocate a PMD-sized entry and overwrite the
    PTE in the page cache.

    Use xas_find_conflict() instead which turns out to simplify
    both get_unlocked_entry() and grab_mapping_entry(). Also remove a
    WARN_ON_ONCE from grab_mapping_entry() as it will have already triggered
    in get_unlocked_entry().

    Fixes: cfc93c6c6c96 ("dax: Convert dax_insert_pfn_mkwrite to XArray")
    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • Device DAX PMD pages do not set the PageHead bit for compound pages.
    Fix for now by retrieving the PMD bit from the entry, but eventually we
    will be passed the page size by the caller.

    Reported-by: Dan Williams
    Fixes: 9f32d221301c ("dax: Convert dax_lock_mapping_entry to XArray")
    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     

17 Nov, 2018

1 commit

  • For the device-dax case, it is possible that the inode can go away
    underneath us. The rcu_read_lock() was there to prevent it from
    being freed, and not (as I thought) to protect the tree. Bring back
    the rcu_read_lock() protection. Also add a little kernel-doc; while
    this function is not exported to modules, it is used from outside dax.c

    Reported-by: Dan Williams
    Fixes: 9f32d221301c ("dax: Convert dax_lock_mapping_entry to XArray")
    Signed-off-by: Matthew Wilcox

    Matthew Wilcox