14 Oct, 2020

1 commit

  • The 'struct resource' in 'struct dev_pagemap' is only used for holding
    resource span information. The other fields, 'name', 'flags', 'desc',
    'parent', 'sibling', and 'child' are all unused wasted space.

    This is in preparation for introducing a multi-range extension of
    devm_memremap_pages().

    The bulk of this change is unwinding all the places internal to libnvdimm
    that used 'struct resource' unnecessarily, and replacing instances of
    'struct dev_pagemap'.res with 'struct dev_pagemap'.range.

    P2PDMA had a minor usage of the resource flags field, but only to report
    failures with "%pR". That is replaced with an open coded print of the
    range.

    [dan.carpenter@oracle.com: mm/hmm/test: use after free in dmirror_allocate_chunk()]
    Link: https://lkml.kernel.org/r/20200926121402.GA7467@kadam

    Signed-off-by: Dan Williams
    Signed-off-by: Dan Carpenter
    Signed-off-by: Andrew Morton
    Reviewed-by: Boris Ostrovsky [xen]
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Vishal Verma
    Cc: Vivek Goyal
    Cc: Dave Jiang
    Cc: Ben Skeggs
    Cc: David Airlie
    Cc: Daniel Vetter
    Cc: Ira Weiny
    Cc: Bjorn Helgaas
    Cc: Juergen Gross
    Cc: Stefano Stabellini
    Cc: "Jérôme Glisse"
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Ard Biesheuvel
    Cc: Borislav Petkov
    Cc: Brice Goglin
    Cc: Catalin Marinas
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Hulk Robot
    Cc: Ingo Molnar
    Cc: Jason Gunthorpe
    Cc: Jason Yan
    Cc: Jeff Moyer
    Cc: Jia He
    Cc: Joao Martins
    Cc: Jonathan Cameron
    Cc: kernel test robot
    Cc: Mike Rapoport
    Cc: Pavel Tatashin
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Randy Dunlap
    Cc: Thomas Gleixner
    Cc: Tom Lendacky
    Cc: Wei Yang
    Cc: Will Deacon
    Link: https://lkml.kernel.org/r/159643103173.4062302.768998885691711532.stgit@dwillia2-desk3.amr.corp.intel.com
    Link: https://lkml.kernel.org/r/160106115761.30709.13539840236873663620.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Linus Torvalds

    Dan Williams
     

25 Sep, 2019

1 commit

  • We do check for a bad block during namespace init and that use
    region bad block list. We need to initialize the bad block
    for volatile regions for this to work. We also observe a lockdep
    warning as below because the lock is not initialized correctly
    since we skip bad block init for volatile regions.

    INFO: trying to register non-static key.
    the code is fine but needs lockdep annotation.
    turning off the locking correctness validator.
    CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.3.0-rc1-15699-g3dee241c937e #149
    Call Trace:
    [c0000000f95cb250] [c00000000147dd84] dump_stack+0xe8/0x164 (unreliable)
    [c0000000f95cb2a0] [c00000000022ccd8] register_lock_class+0x308/0xa60
    [c0000000f95cb3a0] [c000000000229cc0] __lock_acquire+0x170/0x1ff0
    [c0000000f95cb4c0] [c00000000022c740] lock_acquire+0x220/0x270
    [c0000000f95cb580] [c000000000a93230] badblocks_check+0xc0/0x290
    [c0000000f95cb5f0] [c000000000d97540] nd_pfn_validate+0x5c0/0x7f0
    [c0000000f95cb6d0] [c000000000d98300] nd_dax_probe+0xd0/0x1f0
    [c0000000f95cb760] [c000000000d9b66c] nd_pmem_probe+0x10c/0x160
    [c0000000f95cb790] [c000000000d7f5ec] nvdimm_bus_probe+0x10c/0x240
    [c0000000f95cb820] [c000000000d0f844] really_probe+0x254/0x4e0
    [c0000000f95cb8b0] [c000000000d0fdfc] driver_probe_device+0x16c/0x1e0
    [c0000000f95cb930] [c000000000d10238] device_driver_attach+0x68/0xa0
    [c0000000f95cb970] [c000000000d1040c] __driver_attach+0x19c/0x1c0
    [c0000000f95cb9f0] [c000000000d0c4c4] bus_for_each_dev+0x94/0x130
    [c0000000f95cba50] [c000000000d0f014] driver_attach+0x34/0x50
    [c0000000f95cba70] [c000000000d0e208] bus_add_driver+0x178/0x2f0
    [c0000000f95cbb00] [c000000000d117c8] driver_register+0x108/0x170
    [c0000000f95cbb70] [c000000000d7edb0] __nd_driver_register+0xe0/0x100
    [c0000000f95cbbd0] [c000000001a6baa4] nd_pmem_driver_init+0x34/0x48
    [c0000000f95cbbf0] [c0000000000106f4] do_one_initcall+0x1d4/0x4b0
    [c0000000f95cbcd0] [c0000000019f499c] kernel_init_freeable+0x544/0x65c
    [c0000000f95cbdb0] [c000000000010d6c] kernel_init+0x2c/0x180
    [c0000000f95cbe20] [c00000000000b954] ret_from_kernel_thread+0x5c/0x68

    Signed-off-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/20190919083355.26340-1-aneesh.kumar@linux.ibm.com
    Signed-off-by: Dan Williams

    Aneesh Kumar K.V
     

19 Jul, 2019

2 commits

  • For good reason, the standard device_lock() is marked
    lockdep_set_novalidate_class() because there is simply no sane way to
    describe the myriad ways the device_lock() ordered with other locks.
    However, that leaves subsystems that know their own local device_lock()
    ordering rules to find lock ordering mistakes manually. Instead,
    introduce an optional / additional lockdep-enabled lock that a subsystem
    can acquire in all the same paths that the device_lock() is acquired.

    A conversion of the NFIT driver and NVDIMM subsystem to a
    lockdep-validate device_lock() scheme is included. The
    debug_nvdimm_lock() implementation implements the correct lock-class and
    stacking order for the libnvdimm device topology hierarchy.

    Yes, this is a hack, but hopefully it is a useful hack for other
    subsystems device_lock() debug sessions. Quoting Greg:

    "Yeah, it feels a bit hacky but it's really up to a subsystem to mess up
    using it as much as anything else, so user beware :)

    I don't object to it if it makes things easier for you to debug."

    Cc: Ingo Molnar
    Cc: Ira Weiny
    Cc: Will Deacon
    Cc: Dave Jiang
    Cc: Keith Busch
    Cc: Peter Zijlstra
    Cc: Vishal Verma
    Cc: "Rafael J. Wysocki"
    Cc: Greg Kroah-Hartman
    Signed-off-by: Dan Williams
    Acked-by: Greg Kroah-Hartman
    Reviewed-by: Ira Weiny
    Link: https://lore.kernel.org/r/156341210661.292348.7014034644265455704.stgit@dwillia2-desk3.amr.corp.intel.com

    Dan Williams
     
  • Namespace activation expects to be able to reference region badblocks.
    The following warning sometimes triggers when asynchronous namespace
    activation races in front of the completion of namespace probing. Move
    all possible namespace probing after region badblocks initialization.

    Otherwise, lockdep sometimes catches the uninitialized state of the
    badblocks seqlock with stack trace signatures like:

    INFO: trying to register non-static key.
    pmem2: detected capacity change from 0 to 136365211648
    the code is fine but needs lockdep annotation.
    turning off the locking correctness validator.
    CPU: 9 PID: 358 Comm: kworker/u80:5 Tainted: G OE 5.2.0-rc4+ #3382
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
    Workqueue: events_unbound async_run_entry_fn
    Call Trace:
    dump_stack+0x85/0xc0
    pmem1.12: detected capacity change from 0 to 8589934592
    register_lock_class+0x56a/0x570
    ? check_object+0x140/0x270
    __lock_acquire+0x80/0x1710
    ? __mutex_lock+0x39d/0x910
    lock_acquire+0x9e/0x180
    ? nd_pfn_validate+0x28f/0x440 [libnvdimm]
    badblocks_check+0x93/0x1f0
    ? nd_pfn_validate+0x28f/0x440 [libnvdimm]
    nd_pfn_validate+0x28f/0x440 [libnvdimm]
    ? lockdep_hardirqs_on+0xf0/0x180
    nd_dax_probe+0x9a/0x120 [libnvdimm]
    nd_pmem_probe+0x6d/0x180 [nd_pmem]
    nvdimm_bus_probe+0x90/0x2c0 [libnvdimm]

    Fixes: 48af2f7e52f4 ("libnvdimm, pfn: during init, clear errors...")
    Cc:
    Cc: Vishal Verma
    Reviewed-by: Vishal Verma
    Link: https://lore.kernel.org/r/156341208365.292348.1547528796026249120.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     

05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of version 2 of the gnu general public license as
    published by the free software foundation this program is
    distributed in the hope that it will be useful but without any
    warranty without even the implied warranty of merchantability or
    fitness for a particular purpose see the gnu general public license
    for more details

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 64 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Alexios Zavras
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190529141901.894819585@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

07 Apr, 2018

1 commit

  • The message about constraining number of online cpus to be less than or
    equal to ND_MAX_LANES (256) is only useful for block-aperture
    configurations and BTT. Make it debug since it is only relevant when
    debugging performance.

    Signed-off-by: Dan Williams

    Dan Williams
     

01 Jul, 2017

1 commit

  • We need to hold a reference on the 'dirent' until we are sure there are
    no more notifications that will be sent. As noted in the new comments we
    take advantage of the fact that the references are taken and dropped
    under device_lock() and that nd_device_notify() holds device_lock() over
    new badblocks notifications. The notifications that happen when
    badblocks are cleared only occur while the device is active.

    Also take the opportunity to fix up the error messages to report the
    user visible effect of a sysfs_get_dirent() failure.

    Fixes: 975750a98c26 ("libnvdimm, pmem: Add sysfs notifications to badblocks")
    Cc: Toshi Kani
    Signed-off-by: Dan Williams

    Dan Williams
     

16 Jun, 2017

1 commit

  • Sysfs "badblocks" information may be updated during run-time that:
    - MCE, SCI, and sysfs "scrub" may add new bad blocks
    - Writes and ioctl() may clear bad blocks

    Add support to send sysfs notifications to sysfs "badblocks" file
    under region and pmem directories when their badblocks information
    is re-evaluated (but is not necessarily changed) during run-time.

    Signed-off-by: Toshi Kani
    Cc: Vishal Verma
    Cc: Linda Knippers
    Signed-off-by: Dan Williams

    Toshi Kani
     

30 Apr, 2017

1 commit

  • Toshi noticed that the new support for a region-level badblocks missed
    the case where errors are cleared due to BTT I/O.

    An initial attempt to fix this ran into a "sleeping while atomic"
    warning due to taking the nvdimm_bus_lock() in the BTT I/O path to
    satisfy the locking requirements of __nvdimm_bus_badblocks_clear().
    However, that lock is not needed since we are not acting on any data that
    is subject to change under that lock. The badblocks instance has its own
    internal lock to handle mutations of the error list.

    So, in order to make it clear that we are just acting on region devices,
    rename __nvdimm_bus_badblocks_clear() to nvdimm_clear_badblocks_regions().
    Eliminate the lock and consolidate all support routines for the new
    nvdimm_account_cleared_poison() in drivers/nvdimm/bus.c. Finally, to the
    opportunity to cleanup to some unnecessary casts, make the calling
    convention of nvdimm_clear_badblocks_regions() clearer by replacing struct
    resource with the minimal struct clear_badblocks_context, and use the
    DEVICE_ATTR macro.

    Cc: Dave Jiang
    Cc: Vishal Verma
    Reported-by: Toshi Kani
    Signed-off-by: Dan Williams

    Dan Williams
     

13 Apr, 2017

2 commits

  • Providing mechanism to clear poison list via the ndctl ND_CMD_CLEAR_ERROR
    call. We will update the poison list and also the badblocks at region level
    if the region is in dax mode or in pmem mode and not active. In other
    words we force badblocks to be cleared through write requests if the
    address is currently accessed through a block device, otherwise it can
    only be done via the ioctl+dsm path.

    Signed-off-by: Dave Jiang
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Dan Williams

    Dave Jiang
     
  • badblocks sysfs file will be export at region level. When nvdimm event
    notifier happens for NVDIMM_REVALIATE_POISON, the badblocks in the
    region will be updated.

    Signed-off-by: Dave Jiang
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Dan Williams

    Dave Jiang
     

12 Jul, 2016

2 commits


10 May, 2016

1 commit

  • Device DAX is the device-centric analogue of Filesystem DAX
    (CONFIG_FS_DAX). It allows persistent memory ranges to be allocated and
    mapped without need of an intervening file system. This initial
    infrastructure arranges for a libnvdimm pfn-device to be represented as
    a different device-type so that it can be attached to a driver other
    than the pmem driver.

    Signed-off-by: Dan Williams

    Dan Williams
     

06 Mar, 2016

1 commit


29 Aug, 2015

1 commit

  • Implement the base infrastructure for libnvdimm PFN devices. Similar to
    BTT devices they take a namespace as a backing device and layer
    functionality on top. In this case the functionality is reserving space
    for an array of 'struct page' entries to be handed out through
    pfn_to_page(). For now this is just the basic libnvdimm-device-model for
    configuring the base PFN device.

    As the namespace claiming mechanism for PFN devices is mostly identical
    to BTT devices drivers/nvdimm/claim.c is created to house the common
    bits.

    Cc: Ross Zwisler
    Signed-off-by: Dan Williams

    Dan Williams
     

26 Jun, 2015

2 commits

  • The libnvdimm implementation handles allocating dimm address space (DPA)
    between PMEM and BLK mode interfaces. After DPA has been allocated from
    a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
    as a struct bio based block device. Unlike PMEM, BLK is required to
    handle platform specific details like mmio register formats and memory
    controller interleave. For this reason the libnvdimm generic nd_blk
    driver calls back into the bus provider to carry out the I/O.

    This initial implementation handles the BLK interface defined by the
    ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
    DCR (dimm control region), BDW (block data window), IDT (interleave
    descriptor) NFIT structures and the hardware register format.
    [1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
    [2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

    Cc: Andy Lutomirski
    Cc: Boaz Harrosh
    Cc: H. Peter Anvin
    Cc: Jens Axboe
    Cc: Ingo Molnar
    Cc: Christoph Hellwig
    Signed-off-by: Ross Zwisler
    Acked-by: Rafael J. Wysocki
    Signed-off-by: Dan Williams

    Ross Zwisler
     
  • BTT stands for Block Translation Table, and is a way to provide power
    fail sector atomicity semantics for block devices that have the ability
    to perform byte granularity IO. It relies on the capability of libnvdimm
    namespace devices to do byte aligned IO.

    The BTT works as a stacked blocked device, and reserves a chunk of space
    from the backing device for its accounting metadata. It is a bio-based
    driver because all IO is done synchronously, and there is no queuing or
    asynchronous completions at either the device or the driver level.

    The BTT uses 'lanes' to index into various 'on-disk' data structures,
    and lanes also act as a synchronization mechanism in case there are more
    CPUs than available lanes. We did a comparison between two lane lock
    strategies - first where we kept an atomic counter around that tracked
    which was the last lane that was used, and 'our' lane was determined by
    atomically incrementing that. That way, for the nr_cpus > nr_lanes case,
    theoretically, no CPU would be blocked waiting for a lane. The other
    strategy was to use the cpu number we're scheduled on to and hash it to
    a lane number. Theoretically, this could block an IO that could've
    otherwise run using a different, free lane. But some fio workloads
    showed that the direct cpu -> lane hash performed faster than tracking
    'last lane' - my reasoning is the cache thrash caused by moving the
    atomic variable made that approach slower than simply waiting out the
    in-progress IO. This supports the conclusion that the driver can be a
    very simple bio-based one that does synchronous IOs instead of queuing.

    Cc: Andy Lutomirski
    Cc: Boaz Harrosh
    Cc: H. Peter Anvin
    Cc: Jens Axboe
    Cc: Ingo Molnar
    Cc: Christoph Hellwig
    Cc: Neil Brown
    Cc: Jeff Moyer
    Cc: Dave Chinner
    Cc: Greg KH
    [jmoyer: fix nmi watchdog timeout in btt_map_init]
    [jmoyer: move btt initialization to module load path]
    [jmoyer: fix memory leak in the btt initialization path]
    [jmoyer: Don't overwrite corrupted arenas]
    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     

25 Jun, 2015

3 commits

  • NVDIMM namespaces, in addition to accepting "struct bio" based requests,
    also have the capability to perform byte-aligned accesses. By default
    only the bio/block interface is used. However, if another driver can
    make effective use of the byte-aligned capability it can claim namespace
    interface and use the byte-aligned ->rw_bytes() interface.

    The BTT driver is the initial first consumer of this mechanism to allow
    adding atomic sector update semantics to a pmem or blk namespace. This
    patch is the sysfs infrastructure to allow configuring a BTT instance
    for a namespace. Enabling that BTT and performing i/o is in a
    subsequent patch.

    Cc: Greg KH
    Cc: Neil Brown
    Signed-off-by: Dan Williams

    Dan Williams
     
  • A complete label set is a PMEM-label per-dimm per-interleave-set where
    all the UUIDs match and the interleave set cookie matches the hosting
    interleave set.

    Present sysfs attributes for manipulation of a PMEM-namespace's
    'alt_name', 'uuid', and 'size' attributes. A later patch will make
    these settings persistent by writing back the label.

    Note that PMEM allocations grow forwards from the start of an interleave
    set (lowest dimm-physical-address (DPA)). BLK-namespaces that alias
    with a PMEM interleave set will grow allocations backward from the
    highest DPA.

    Cc: Greg KH
    Cc: Neil Brown
    Acked-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     
  • The libnvdimm region driver is an intermediary driver that translates
    non-volatile "region"s into "namespace" sub-devices that are surfaced by
    persistent memory block-device drivers (PMEM and BLK).

    ACPI 6 introduces the concept that a given nvdimm may simultaneously
    offer multiple access modes to its media through direct PMEM load/store
    access, or windowed BLK mode. Existing nvdimms mostly implement a PMEM
    interface, some offer a BLK-like mode, but never both as ACPI 6 defines.
    If an nvdimm is single interfaced, then there is no need for dimm
    metadata labels. For these devices we can take the region boundaries
    directly to create a child namespace device (nd_namespace_io).

    Acked-by: Christoph Hellwig
    Tested-by: Toshi Kani
    Signed-off-by: Dan Williams

    Dan Williams