14 Oct, 2020

3 commits

  • In support of device-dax growing the ability to front physically
    dis-contiguous ranges of memory, update devm_memremap_pages() to track
    multiple ranges with a single reference counter and devm instance.

    Convert all [devm_]memremap_pages() users to specify the number of ranges
    they are mapping in their 'struct dev_pagemap' instance.

    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Vishal Verma
    Cc: Vivek Goyal
    Cc: Dave Jiang
    Cc: Ben Skeggs
    Cc: David Airlie
    Cc: Daniel Vetter
    Cc: Ira Weiny
    Cc: Bjorn Helgaas
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: Stefano Stabellini
    Cc: "Jérôme Glisse"
    Cc: Ard Biesheuvel
    Cc: Ard Biesheuvel
    Cc: Borislav Petkov
    Cc: Brice Goglin
    Cc: Catalin Marinas
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Hulk Robot
    Cc: Ingo Molnar
    Cc: Jason Gunthorpe
    Cc: Jason Yan
    Cc: Jeff Moyer
    Cc: "Jérôme Glisse"
    Cc: Jia He
    Cc: Joao Martins
    Cc: Jonathan Cameron
    Cc: kernel test robot
    Cc: Mike Rapoport
    Cc: Pavel Tatashin
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Randy Dunlap
    Cc: Thomas Gleixner
    Cc: Tom Lendacky
    Cc: Wei Yang
    Cc: Will Deacon
    Link: https://lkml.kernel.org/r/159643103789.4062302.18426128170217903785.stgit@dwillia2-desk3.amr.corp.intel.com
    Link: https://lkml.kernel.org/r/160106116293.30709.13350662794915396198.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • The 'struct resource' in 'struct dev_pagemap' is only used for holding
    resource span information. The other fields, 'name', 'flags', 'desc',
    'parent', 'sibling', and 'child' are all unused wasted space.

    This is in preparation for introducing a multi-range extension of
    devm_memremap_pages().

    The bulk of this change is unwinding all the places internal to libnvdimm
    that used 'struct resource' unnecessarily, and replacing instances of
    'struct dev_pagemap'.res with 'struct dev_pagemap'.range.

    P2PDMA had a minor usage of the resource flags field, but only to report
    failures with "%pR". That is replaced with an open coded print of the
    range.

    [dan.carpenter@oracle.com: mm/hmm/test: use after free in dmirror_allocate_chunk()]
    Link: https://lkml.kernel.org/r/20200926121402.GA7467@kadam

    Signed-off-by: Dan Williams
    Signed-off-by: Dan Carpenter
    Signed-off-by: Andrew Morton
    Reviewed-by: Boris Ostrovsky [xen]
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Vishal Verma
    Cc: Vivek Goyal
    Cc: Dave Jiang
    Cc: Ben Skeggs
    Cc: David Airlie
    Cc: Daniel Vetter
    Cc: Ira Weiny
    Cc: Bjorn Helgaas
    Cc: Juergen Gross
    Cc: Stefano Stabellini
    Cc: "Jérôme Glisse"
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Ard Biesheuvel
    Cc: Borislav Petkov
    Cc: Brice Goglin
    Cc: Catalin Marinas
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Hulk Robot
    Cc: Ingo Molnar
    Cc: Jason Gunthorpe
    Cc: Jason Yan
    Cc: Jeff Moyer
    Cc: Jia He
    Cc: Joao Martins
    Cc: Jonathan Cameron
    Cc: kernel test robot
    Cc: Mike Rapoport
    Cc: Pavel Tatashin
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Randy Dunlap
    Cc: Thomas Gleixner
    Cc: Tom Lendacky
    Cc: Wei Yang
    Cc: Will Deacon
    Link: https://lkml.kernel.org/r/159643103173.4062302.768998885691711532.stgit@dwillia2-desk3.amr.corp.intel.com
    Link: https://lkml.kernel.org/r/160106115761.30709.13539840236873663620.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Pull block updates from Jens Axboe:

    - Series of merge handling cleanups (Baolin, Christoph)

    - Series of blk-throttle fixes and cleanups (Baolin)

    - Series cleaning up BDI, seperating the block device from the
    backing_dev_info (Christoph)

    - Removal of bdget() as a generic API (Christoph)

    - Removal of blkdev_get() as a generic API (Christoph)

    - Cleanup of is-partition checks (Christoph)

    - Series reworking disk revalidation (Christoph)

    - Series cleaning up bio flags (Christoph)

    - bio crypt fixes (Eric)

    - IO stats inflight tweak (Gabriel)

    - blk-mq tags fixes (Hannes)

    - Buffer invalidation fixes (Jan)

    - Allow soft limits for zone append (Johannes)

    - Shared tag set improvements (John, Kashyap)

    - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

    - DM no-wait support (Mike, Konstantin)

    - Request allocation improvements (Ming)

    - Allow md/dm/bcache to use IO stat helpers (Song)

    - Series improving blk-iocost (Tejun)

    - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
    Xianting, Yang, Yufen, yangerkun)

    * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
    block: fix uapi blkzoned.h comments
    blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
    blk-mq: get rid of the dead flush handle code path
    block: get rid of unnecessary local variable
    block: fix comment and add lockdep assert
    blk-mq: use helper function to test hw stopped
    block: use helper function to test queue register
    block: remove redundant mq check
    block: invoke blk_mq_exit_sched no matter whether have .exit_sched
    percpu_ref: don't refer to ref->data if it isn't allocated
    block: ratelimit handle_bad_sector() message
    blk-throttle: Re-use the throtl_set_slice_end()
    blk-throttle: Open code __throtl_de/enqueue_tg()
    blk-throttle: Move service tree validation out of the throtl_rb_first()
    blk-throttle: Move the list operation after list validation
    blk-throttle: Fix IO hang for a corner case
    blk-throttle: Avoid tracking latency if low limit is invalid
    blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
    blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
    block: Remove redundant 'return' statement
    ...

    Linus Torvalds
     

06 Oct, 2020

1 commit

  • In reaction to a proposal to introduce a memcpy_mcsafe_fast()
    implementation Linus points out that memcpy_mcsafe() is poorly named
    relative to communicating the scope of the interface. Specifically what
    addresses are valid to pass as source, destination, and what faults /
    exceptions are handled.

    Of particular concern is that even though x86 might be able to handle
    the semantics of copy_mc_to_user() with its common copy_user_generic()
    implementation other archs likely need / want an explicit path for this
    case:

    On Fri, May 1, 2020 at 11:28 AM Linus Torvalds wrote:
    >
    > On Thu, Apr 30, 2020 at 6:21 PM Dan Williams wrote:
    > >
    > > However now I see that copy_user_generic() works for the wrong reason.
    > > It works because the exception on the source address due to poison
    > > looks no different than a write fault on the user address to the
    > > caller, it's still just a short copy. So it makes copy_to_user() work
    > > for the wrong reason relative to the name.
    >
    > Right.
    >
    > And it won't work that way on other architectures. On x86, we have a
    > generic function that can take faults on either side, and we use it
    > for both cases (and for the "in_user" case too), but that's an
    > artifact of the architecture oddity.
    >
    > In fact, it's probably wrong even on x86 - because it can hide bugs -
    > but writing those things is painful enough that everybody prefers
    > having just one function.

    Replace a single top-level memcpy_mcsafe() with either
    copy_mc_to_user(), or copy_mc_to_kernel().

    Introduce an x86 copy_mc_fragile() name as the rename for the
    low-level x86 implementation formerly named memcpy_mcsafe(). It is used
    as the slow / careful backend that is supplanted by a fast
    copy_mc_generic() in a follow-on patch.

    One side-effect of this reorganization is that separating copy_mc_64.S
    to its own file means that perf no longer needs to track dependencies
    for its memcpy_64.S benchmarks.

    [ bp: Massage a bit. ]

    Signed-off-by: Dan Williams
    Signed-off-by: Borislav Petkov
    Reviewed-by: Tony Luck
    Acked-by: Michael Ellerman
    Cc:
    Link: http://lore.kernel.org/r/CAHk-=wjSqtXAqfUJxFtWNwmguFASTgB0dz1dT3V-78Quiezqbg@mail.gmail.com
    Link: https://lkml.kernel.org/r/160195561680.2163339.11574962055305783722.stgit@dwillia2-desk3.amr.corp.intel.com

    Dan Williams
     

25 Sep, 2020

1 commit

  • BDI_CAP_SYNCHRONOUS_IO is only checked in the swap code, and used to
    decided if ->rw_page can be used on a block device. Just check up for
    the method instead. The only complication is that zram needs a second
    set of block_device_operations as it can switch between modes that
    actually support ->rw_page and those who don't.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

02 Sep, 2020

1 commit

  • The nvdimm block driver abuse revalidate_disk in a strange way, and
    totally unrelated to what other drivers do. Simplify this by just
    calling nvdimm_revalidate_disk (which seems rather misnamed) from the
    probe routines, as the additional bdev size revalidation is pointless
    at this point, and remove the revalidate_disk methods given that
    it can only be triggered from add_disk, which is right before the
    manual calls.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Josef Bacik
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

15 Aug, 2020

1 commit

  • This function returns the number of bytes in a THP. It is like
    page_size(), but compiles to just PAGE_SIZE if CONFIG_TRANSPARENT_HUGEPAGE
    is disabled.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: David Hildenbrand
    Cc: Mike Kravetz
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-5-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

01 Jul, 2020

1 commit

  • The make_request_fn is a little weird in that it sits directly in
    struct request_queue instead of an operation vector. Replace it with
    a block_device_operations method called submit_bio (which describes much
    better what it does). Also remove the request_queue argument to it, as
    the queue can be derived pretty trivially from the bio.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

14 Jun, 2020

1 commit


09 Jun, 2020

1 commit

  • This seems to lead to some crazy include loops when using
    asm-generic/cacheflush.h on more architectures, so leave it to the arch
    header for now.

    [hch@lst.de: fix warning]
    Link: http://lkml.kernel.org/r/20200520173520.GA11199@lst.de

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Will Deacon
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Anton Ivanov
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Dan Williams
    Cc: Vishal Verma
    Cc: Dave Jiang
    Cc: Keith Busch
    Cc: Ira Weiny
    Cc: Arnd Bergmann
    Link: http://lkml.kernel.org/r/20200515143646.3857579-7-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

27 May, 2020

1 commit


14 May, 2020

1 commit


09 Apr, 2020

1 commit

  • Pull libnvdimm and dax updates from Dan Williams:
    "There were multiple touches outside of drivers/nvdimm/ this round to
    add cross arch compatibility to the devm_memremap_pages() interface,
    enhance numa information for persistent memory ranges, and add a
    zero_page_range() dax operation.

    This cycle I switched from the patchwork api to Konstantin's b4 script
    for collecting tags (from x86, PowerPC, filesystem, and device-mapper
    folks), and everything looks to have gone ok there. This has all
    appeared in -next with no reported issues.

    Summary:

    - Add support for region alignment configuration and enforcement to
    fix compatibility across architectures and PowerPC page size
    configurations.

    - Introduce 'zero_page_range' as a dax operation. This facilitates
    filesystem-dax operation without a block-device.

    - Introduce phys_to_target_node() to facilitate drivers that want to
    know resulting numa node if a given reserved address range was
    onlined.

    - Advertise a persistence-domain for of_pmem and papr_scm. The
    persistence domain indicates where cpu-store cycles need to reach
    in the platform-memory subsystem before the platform will consider
    them power-fail protected.

    - Promote numa_map_to_online_node() to a cross-kernel generic
    facility.

    - Save x86 numa information to allow for node-id lookups for reserved
    memory ranges, deploy that capability for the e820-pmem driver.

    - Pick up some miscellaneous minor fixes, that missed v5.6-final,
    including a some smatch reports in the ioctl path and some unit
    test compilation fixups.

    - Fixup some flexible-array declarations"

    * tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (29 commits)
    dax: Move mandatory ->zero_page_range() check in alloc_dax()
    dax,iomap: Add helper dax_iomap_zero() to zero a range
    dax: Use new dax zero page method for zeroing a page
    dm,dax: Add dax zero_page_range operation
    s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver
    dax, pmem: Add a dax operation zero_page_range
    pmem: Add functions for reading/writing page to/from pmem
    libnvdimm: Update persistence domain value for of_pmem and papr_scm device
    tools/test/nvdimm: Fix out of tree build
    libnvdimm/region: Fix build error
    libnvdimm/region: Replace zero-length array with flexible-array member
    libnvdimm/label: Replace zero-length array with flexible-array member
    ACPI: NFIT: Replace zero-length array with flexible-array member
    libnvdimm/region: Introduce an 'align' attribute
    libnvdimm/region: Introduce NDD_LABELING
    libnvdimm/namespace: Enforce memremap_compat_align()
    libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid
    libnvdimm: Out of bounds read in __nd_ioctl()
    acpi/nfit: improve bounds checking for 'func'
    mm/memremap_pages: Introduce memremap_compat_align()
    ...

    Linus Torvalds
     

03 Apr, 2020

3 commits

  • zero_page_range() dax operation is mandatory for dax devices. Right now
    that check happens in dax_zero_page_range() function. Dan thinks that's
    too late and its better to do the check earlier in alloc_dax().

    I also modified alloc_dax() to return pointer with error code in it in
    case of failure. Right now it returns NULL and caller assumes failure
    happened due to -ENOMEM. But with this ->zero_page_range() check, I
    need to return -EINVAL instead.

    Signed-off-by: Vivek Goyal
    Link: https://lore.kernel.org/r/20200401161125.GB9398@redhat.com
    Signed-off-by: Dan Williams

    Vivek Goyal
     
  • Add a dax operation zero_page_range, to zero a page. This will also clear any
    known poison in the page being zeroed.

    As of now, zeroing of one page is allowed in a single call. There
    are no callers which are trying to zero more than a page in a single call.
    Once we grow the callers which zero more than a page in single call, we
    can add that support. Primary reason for not doing that yet is that this
    will add little complexity in dm implementation where a range might be
    spanning multiple underlying targets and one will have to split the range
    into multiple sub ranges and call zero_page_range() on individual targets.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Vivek Goyal
    Reviewed-by: Pankaj Gupta
    Link: https://lore.kernel.org/r/20200228163456.1587-3-vgoyal@redhat.com
    Signed-off-by: Dan Williams

    Vivek Goyal
     
  • This splits pmem_do_bvec() into pmem_do_read() and pmem_do_write().
    pmem_do_write() will be used by pmem zero_page_range() as well. Hence
    sharing the same code.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Vivek Goyal
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Pankaj Gupta
    Link: https://lore.kernel.org/r/20200228163456.1587-2-vgoyal@redhat.com
    Signed-off-by: Dan Williams

    Vivek Goyal
     

28 Mar, 2020

1 commit

  • Current make_request based drivers use either blk_alloc_queue_node or
    blk_alloc_queue to allocate a queue, and then set up the make_request_fn
    function pointer and a few parameters using the blk_queue_make_request
    helper. Simplify this by passing the make_request pointer to
    blk_alloc_queue, and while at it merge the _node variant into the main
    helper by always passing a node_id, and remove the superfluous gfp_mask
    parameter. A lower-level __blk_alloc_queue is kept for the blk-mq case.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

01 Feb, 2020

1 commit

  • After the removal of the device-public infrastructure there are only 2
    ->page_free() call backs in the kernel. One of those is a
    device-private callback in the nouveau driver, the other is a generic
    wakeup needed in the DAX case. In the hopes that all ->page_free()
    callbacks can be migrated to common core kernel functionality, move the
    device-private specific actions in __put_devmap_managed_page() under the
    is_device_private_page() conditional, including the ->page_free()
    callback. For the other page types just open-code the generic wakeup.

    Yes, the wakeup is only needed in the MEMORY_DEVICE_FSDAX case, but it
    does no harm in the MEMORY_DEVICE_DEVDAX and MEMORY_DEVICE_PCI_P2PDMA
    case.

    Link: http://lkml.kernel.org/r/20200107224558.2362728-4-jhubbard@nvidia.com
    Signed-off-by: Dan Williams
    Signed-off-by: John Hubbard
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jérôme Glisse
    Cc: Jan Kara
    Cc: Ira Weiny
    Cc: Alex Williamson
    Cc: Aneesh Kumar K.V
    Cc: Björn Töpel
    Cc: Daniel Vetter
    Cc: Hans Verkuil
    Cc: Jason Gunthorpe
    Cc: Jason Gunthorpe
    Cc: Jens Axboe
    Cc: Jonathan Corbet
    Cc: Kirill A. Shutemov
    Cc: Leon Romanovsky
    Cc: Mauro Carvalho Chehab
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

18 Nov, 2019

1 commit

  • The entire point of nd-core.h is to hide functionality that no leaf
    driver should touch. In fact, the commit that added it had no need to
    include it.

    Fixes: 06e8ccdab15f ("acpi: nfit: Add support for detect platform...")
    Cc: Ira Weiny
    Cc: Dave Jiang
    Cc: Vishal Verma
    Signed-off-by: Dan Williams

    Dan Williams
     

15 Nov, 2019

1 commit

  • The nvdimm core currently maps the full namespace to an ioremap range
    while probing the namespace mode. This can result in probe failures on
    architectures that have limited ioremap space.

    For example, with a large btt namespace that consumes most of I/O remap
    range, depending on the sequence of namespace initialization, the user
    can find a pfn namespace initialization failure due to unavailable I/O
    remap space which nvdimm core uses for temporary mapping.

    nvdimm core can avoid this failure by only mapping the reserved info
    block area to check for pfn superblock type and map the full namespace
    resource only before using the namespace.

    Given that personalities like BTT can be layered on top of any namespace
    type create a generic form of devm_nsio_enable (devm_namespace_enable)
    and use it inside the per-personality attach routines. Now
    devm_namespace_enable() is always paired with disable unless the mapping
    is going to be used for long term runtime access.

    Signed-off-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/20191017073308.32645-1-aneesh.kumar@linux.ibm.com
    [djbw: reworks to move devm_namespace_{en,dis}able into *attach helpers]
    Reported-by: kbuild test robot
    Link: https://lore.kernel.org/r/20191031105741.102793-2-aneesh.kumar@linux.ibm.com
    Signed-off-by: Dan Williams

    Aneesh Kumar K.V
     

06 Sep, 2019

1 commit

  • In order to support marking namespaces with unsupported feature/versions
    disabled, nvdimm core should advance the namespace seed on these
    probe failures. Otherwise, these failed namespaces will be considered a
    seed namespace and will be wrongly used while creating new namespaces.

    Add -EOPNOTSUPP as return from pmem probe callback to indicate a namespace
    initialization failures due to pfn superblock feature/version mismatch.

    Signed-off-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/20190905154603.10349-3-aneesh.kumar@linux.ibm.com
    Signed-off-by: Dan Williams

    Aneesh Kumar K.V
     

27 Jul, 2019

1 commit

  • Pull libnvdimm fixes from Dan Williams:
    "A collection of locking and async operations fixes for v5.3-rc2. These
    had been soaking in a branch targeting the merge window, but missed
    due to a regression hunt. This fixed up version has otherwise been in
    -next this past week with no reported issues.

    In order to gain confidence in the locking changes the pull also
    includes a debug / instrumentation patch to enable lockdep coverage
    for libnvdimm subsystem operations that depend on the device_lock for
    exclusion. As mentioned in the changelog it is a hack, but it works
    and documents the locking expectations of the sub-system in a way that
    others can use lockdep to verify. The driver core touches got an ack
    from Greg.

    Summary:

    - Fix duplicate device_unregister() calls (multiple threads competing
    to do unregister work when scheduling device removal from a sysfs
    attribute of the self-same device).

    - Fix badblocks registration order bug. Ensure region badblocks are
    initialized in advance of namespace registration.

    - Fix a deadlock between the bus lock and probe operations.

    - Export device-core infrastructure to coordinate async operations
    via the device ->dead state.

    - Add device-core infrastructure to validate device_lock() usage with
    lockdep"

    * tag 'libnvdimm-fixes-5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    driver-core, libnvdimm: Let device subsystems add local lockdep coverage
    libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlock
    libnvdimm/bus: Stop holding nvdimm_bus_list_mutex over __nd_ioctl()
    libnvdimm/bus: Prepare the nd_ioctl() path to be re-entrant
    libnvdimm/region: Register badblocks before namespaces
    libnvdimm/bus: Prevent duplicate device_unregister() calls
    drivers/base: Introduce kill_device()

    Linus Torvalds
     

19 Jul, 2019

2 commits

  • For good reason, the standard device_lock() is marked
    lockdep_set_novalidate_class() because there is simply no sane way to
    describe the myriad ways the device_lock() ordered with other locks.
    However, that leaves subsystems that know their own local device_lock()
    ordering rules to find lock ordering mistakes manually. Instead,
    introduce an optional / additional lockdep-enabled lock that a subsystem
    can acquire in all the same paths that the device_lock() is acquired.

    A conversion of the NFIT driver and NVDIMM subsystem to a
    lockdep-validate device_lock() scheme is included. The
    debug_nvdimm_lock() implementation implements the correct lock-class and
    stacking order for the libnvdimm device topology hierarchy.

    Yes, this is a hack, but hopefully it is a useful hack for other
    subsystems device_lock() debug sessions. Quoting Greg:

    "Yeah, it feels a bit hacky but it's really up to a subsystem to mess up
    using it as much as anything else, so user beware :)

    I don't object to it if it makes things easier for you to debug."

    Cc: Ingo Molnar
    Cc: Ira Weiny
    Cc: Will Deacon
    Cc: Dave Jiang
    Cc: Keith Busch
    Cc: Peter Zijlstra
    Cc: Vishal Verma
    Cc: "Rafael J. Wysocki"
    Cc: Greg Kroah-Hartman
    Signed-off-by: Dan Williams
    Acked-by: Greg Kroah-Hartman
    Reviewed-by: Ira Weiny
    Link: https://lore.kernel.org/r/156341210661.292348.7014034644265455704.stgit@dwillia2-desk3.amr.corp.intel.com

    Dan Williams
     
  • Pull libnvdimm updates from Dan Williams:
    "Primarily just the virtio_pmem driver:

    - virtio_pmem

    The new virtio_pmem facility introduces a paravirtualized
    persistent memory device that allows a guest VM to use DAX
    mechanisms to access a host-file with host-page-cache. It arranges
    for MAP_SYNC to be disabled and instead triggers a host fsync()
    when a 'write-cache flush' command is sent to the virtual disk
    device.

    - Miscellaneous small fixups"

    * tag 'libnvdimm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    virtio_pmem: fix sparse warning
    xfs: disable map_sync for async flush
    ext4: disable map_sync for async flush
    dax: check synchronous mapping is supported
    dm: enable synchronous dax
    libnvdimm: add dax_dev sync flag
    virtio-pmem: Add virtio pmem driver
    libnvdimm: nd_region flush callback support
    libnvdimm, namespace: Drop uuid_t implementation detail

    Linus Torvalds
     

06 Jul, 2019

2 commits

  • This patch adds 'DAXDEV_SYNC' flag which is set
    for nd_region doing synchronous flush. This later
    is used to disable MAP_SYNC functionality for
    ext4 & xfs filesystem for devices don't support
    synchronous flush.

    Signed-off-by: Pankaj Gupta
    Signed-off-by: Dan Williams

    Pankaj Gupta
     
  • This patch adds functionality to perform flush from guest
    to host over VIRTIO. We are registering a callback based
    on 'nd_region' type. virtio_pmem driver requires this special
    flush function. For rest of the region types we are registering
    existing flush function. Report error returned by host fsync
    failure to userspace.

    Signed-off-by: Pankaj Gupta
    Signed-off-by: Dan Williams

    Pankaj Gupta
     

03 Jul, 2019

5 commits

  • Add a flags field to struct dev_pagemap to replace the altmap_valid
    boolean to be a little more extensible. Also add a pgmap_altmap() helper
    to find the optional altmap and clean up the code using the altmap using
    it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ira Weiny
    Reviewed-by: Dan Williams
    Tested-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • struct dev_pagemap is always embedded into a containing structure, so
    there is no need to an additional private data field.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: Dan Williams
    Tested-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Just check if there is a ->page_free operation set and take care of the
    static key enable, as well as the put using device managed resources.
    Also check that a ->page_free is provided for the pgmaps types that
    require it, and check for a valid type as well while we are at it.

    Note that this also fixes the fact that hmm never called
    dev_pagemap_put_ops and thus would leave the slow path enabled forever,
    even after a device driver unload or disable.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ira Weiny
    Reviewed-by: Dan Williams
    Tested-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Passing the actual typed structure leads to more understandable code
    vs just passing the ref member.

    Reported-by: Logan Gunthorpe
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Logan Gunthorpe
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: Dan Williams
    Tested-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • The dev_pagemap is a growing too many callbacks. Move them into a
    separate ops structure so that they are not duplicated for multiple
    instances, and an attacker can't easily overwrite them.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Logan Gunthorpe
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: Dan Williams
    Tested-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

14 Jun, 2019

1 commit

  • Logan noticed that devm_memremap_pages_release() kills the percpu_ref
    drops all the page references that were acquired at init and then
    immediately proceeds to unplug, arch_remove_memory(), the backing pages
    for the pagemap. If for some reason device shutdown actually collides
    with a busy / elevated-ref-count page then arch_remove_memory() should
    be deferred until after that reference is dropped.

    As it stands the "wait for last page ref drop" happens *after*
    devm_memremap_pages_release() returns, which is obviously too late and
    can lead to crashes.

    Fix this situation by assigning the responsibility to wait for the
    percpu_ref to go idle to devm_memremap_pages() with a new ->cleanup()
    callback. Implement the new cleanup callback for all
    devm_memremap_pages() users: pmem, devdax, hmm, and p2pdma.

    Link: http://lkml.kernel.org/r/155727339156.292046.5432007428235387859.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: 41e94a851304 ("add devm_memremap_pages")
    Signed-off-by: Dan Williams
    Reported-by: Logan Gunthorpe
    Reviewed-by: Ira Weiny
    Reviewed-by: Logan Gunthorpe
    Cc: Bjorn Helgaas
    Cc: "Jérôme Glisse"
    Cc: Christoph Hellwig
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms and conditions of the gnu general public license
    version 2 as published by the free software foundation this program
    is distributed in the hope it will be useful but without any
    warranty without even the implied warranty of merchantability or
    fitness for a particular purpose see the gnu general public license
    for more details

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 263 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Alexios Zavras
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190529141901.208660670@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

21 May, 2019

2 commits

  • Jeff discovered that performance improves from ~375K iops to ~519K iops
    on a simple psync-write fio workload when moving the location of 'struct
    page' from the default PMEM location to DRAM. This result is surprising
    because the expectation is that 'struct page' for dax is only needed for
    third party references to dax mappings. For example, a dax-mapped buffer
    passed to another system call for direct-I/O requires 'struct page' for
    sending the request down the driver stack and pinning the page. There is
    no usage of 'struct page' for first party access to a file via
    read(2)/write(2) and friends.

    However, this "no page needed" expectation is violated by
    CONFIG_HARDENED_USERCOPY and the check_copy_size() performed in
    copy_from_iter_full_nocache() and copy_to_iter_mcsafe(). The
    check_heap_object() helper routine assumes the buffer is backed by a
    slab allocator (DRAM) page and applies some checks. Those checks are
    invalid, dax pages do not originate from the slab, and redundant,
    dax_iomap_actor() has already validated that the I/O is within bounds.
    Specifically that routine validates that the logical file offset is
    within bounds of the file, then it does a sector-to-pfn translation
    which validates that the physical mapping is within bounds of the block
    device.

    Bypass additional hardened usercopy overhead and call the 'no check'
    versions of the copy_{to,from}_iter operations directly.

    Fixes: 0aed55af8834 ("x86, uaccess: introduce copy_from_iter_flushcache...")
    Cc:
    Cc: Jeff Moyer
    Cc: Ingo Molnar
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Thomas Gleixner
    Cc: Matthew Wilcox
    Reported-and-tested-by: Jeff Smits
    Acked-by: Kees Cook
    Acked-by: Jan Kara
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Pankaj reports that starting with commit ad428cdb525a "dax: Check the
    end of the block-device capacity with dax_direct_access()" device-mapper
    no longer allows dax operation. This results from the stricter checks in
    __bdev_dax_supported() that validate that the start and end of a
    block-device map to the same 'pagemap' instance.

    Teach the dax-core and device-mapper to validate the 'pagemap' on a
    per-target basis. This is accomplished by refactoring the
    bdev_dax_supported() internals into generic_fsdax_supported() which
    takes a sector range to validate. Consequently generic_fsdax_supported()
    is suitable to be used in a device-mapper ->iterate_devices() callback.
    A new ->dax_supported() operation is added to allow composite devices to
    split and route upper-level bdev_dax_supported() requests.

    Fixes: ad428cdb525a ("dax: Check the end of the block-device...")
    Cc:
    Cc: Ira Weiny
    Cc: Dave Jiang
    Cc: Keith Busch
    Cc: Matthew Wilcox
    Cc: Vishal Verma
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Reviewed-by: Jan Kara
    Reported-by: Pankaj Gupta
    Reviewed-by: Pankaj Gupta
    Tested-by: Pankaj Gupta
    Tested-by: Vaibhav Jain
    Reviewed-by: Mike Snitzer
    Signed-off-by: Dan Williams

    Dan Williams
     

08 Apr, 2019

1 commit

  • If offset is not zero and length is bigger than PAGE_SIZE,
    this will cause to out of boundary access to a page memory

    Fixes: 98cc093cba1e ("block, THP: make block_device_operations.rw_page support THP")
    Co-developed-by: Liang ZhiCheng
    Signed-off-by: Liang ZhiCheng
    Signed-off-by: Li RongQing
    Reviewed-by: Ira Weiny
    Reviewed-by: Jeff Moyer
    Signed-off-by: Dan Williams

    Li RongQing
     

29 Dec, 2018

2 commits

  • Merge misc updates from Andrew Morton:

    - large KASAN update to use arm's "software tag-based mode"

    - a few misc things

    - sh updates

    - ocfs2 updates

    - just about all of MM

    * emailed patches from Andrew Morton : (167 commits)
    kernel/fork.c: mark 'stack_vm_area' with __maybe_unused
    memcg, oom: notify on oom killer invocation from the charge path
    mm, swap: fix swapoff with KSM pages
    include/linux/gfp.h: fix typo
    mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm
    hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
    hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
    memory_hotplug: add missing newlines to debugging output
    mm: remove __hugepage_set_anon_rmap()
    include/linux/vmstat.h: remove unused page state adjustment macro
    mm/page_alloc.c: allow error injection
    mm: migrate: drop unused argument of migrate_page_move_mapping()
    blkdev: avoid migration stalls for blkdev pages
    mm: migrate: provide buffer_migrate_page_norefs()
    mm: migrate: move migrate_page_lock_buffers()
    mm: migrate: lock buffers before migrate_page_move_mapping()
    mm: migration: factor out code to compute expected number of page references
    mm, page_alloc: enable pcpu_drain with zone capability
    kmemleak: add config to select auto scan
    mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init
    ...

    Linus Torvalds
     
  • The last step before devm_memremap_pages() returns success is to allocate
    a release action, devm_memremap_pages_release(), to tear the entire setup
    down. However, the result from devm_add_action() is not checked.

    Checking the error from devm_add_action() is not enough. The api
    currently relies on the fact that the percpu_ref it is using is killed by
    the time the devm_memremap_pages_release() is run. Rather than continue
    this awkward situation, offload the responsibility of killing the
    percpu_ref to devm_memremap_pages_release() directly. This allows
    devm_memremap_pages() to do the right thing relative to init failures and
    shutdown.

    Without this change we could fail to register the teardown of
    devm_memremap_pages(). The likelihood of hitting this failure is tiny as
    small memory allocations almost always succeed. However, the impact of
    the failure is large given any future reconfiguration, or disable/enable,
    of an nvdimm namespace will fail forever as subsequent calls to
    devm_memremap_pages() will fail to setup the pgmap_radix since there will
    be stale entries for the physical address range.

    An argument could be made to require that the ->kill() operation be set in
    the @pgmap arg rather than passed in separately. However, it helps code
    readability, tracking the lifetime of a given instance, to be able to grep
    the kill routine directly at the devm_memremap_pages() call site.

    Link: http://lkml.kernel.org/r/154275558526.76910.7535251937849268605.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...")
    Reviewed-by: "Jérôme Glisse"
    Reported-by: Logan Gunthorpe
    Reviewed-by: Logan Gunthorpe
    Reviewed-by: Christoph Hellwig
    Cc: Balbir Singh
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

16 Nov, 2018

1 commit


25 Oct, 2018

1 commit

  • Pull libnvdimm updates from Dan Williams:

    - Improve the efficiency and performance of reading nvdimm-namespace
    labels. Reduce the amount of label data read at driver load time by a
    few orders of magnitude. Reduce heavyweight call-outs to
    platform-firmware routines.

    - Handle media errors located in the 'struct page' array stored on a
    persistent memory namespace. Let the kernel clear these errors rather
    than an awkward userspace workaround.

    - Fix Address Range Scrub (ARS) completion tracking. Correct occasions
    where the kernel indicates completion of ARS before submission.

    - Fix asynchronous device registration reference counting.

    - Add support for reporting an nvdimm dirty-shutdown-count via sysfs.

    - Fix various small libnvdimm core and uapi issues.

    * tag 'libnvdimm-for-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
    acpi, nfit: Further restrict userspace ARS start requests
    acpi, nfit: Fix Address Range Scrub completion tracking
    UAPI: ndctl: Remove use of PAGE_SIZE
    UAPI: ndctl: Fix g++-unsupported initialisation in headers
    tools/testing/nvdimm: Populate dirty shutdown data
    acpi, nfit: Collect shutdown status
    acpi, nfit: Introduce nfit_mem flags
    libnvdimm, label: Fix sparse warning
    nvdimm: Use namespace index data to reduce number of label reads needed
    nvdimm: Split label init out from the logic for getting config data
    nvdimm: Remove empty if statement
    nvdimm: Clarify comment in sizeof_namespace_index
    nvdimm: Sanity check labeloff
    libnvdimm, dimm: Maximize label transfer size
    libnvdimm, pmem: Fix badblocks population for 'raw' namespaces
    libnvdimm, namespace: Drop the repeat assignment for variable dev->parent
    libnvdimm, region: Fail badblocks listing for inactive regions
    libnvdimm, pfn: during init, clear errors in the metadata area
    libnvdimm: Set device node in nd_device_register
    libnvdimm: Hold reference on parent while scheduling async init
    ...

    Linus Torvalds