06 Mar, 2016

3 commits

  • While the nfit driver is issuing address range scrub commands and
    reaping the results do not permit an ars_start command issued from
    userspace. The scrub thread assumes that all ars completions are for
    scrubs initiated by platform firmware at boot, or by the nfit driver.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Introduce a workqueue that will be used to run address range scrub
    asynchronously with the rest of nvdimm device probing.

    Userspace still wants notification when probing operations complete, so
    introduce a new callback to flush this workqueue when userspace is
    awaiting probe completion.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • The return value from an 'ndctl_fn' reports the command execution
    status, i.e. was the command properly formatted and was it successfully
    submitted to the bus provider. The new 'cmd_rc' parameter allows the bus
    provider to communicate command specific results, translated into
    common error codes.

    Convert the ARS commands to this scheme to:

    1/ Consolidate status reporting

    2/ Prepare for for expanding ars unit test cases

    3/ Make the implementation more generic

    Cc: Vishal Verma
    Signed-off-by: Dan Williams

    Dan Williams
     

24 Feb, 2016

1 commit

  • The original format of these commands from the "NVDIMM DSM Interface
    Example" [1] are superseded by the ACPI 6.1 definition of the "NVDIMM Root
    Device _DSMs" [2].

    [1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
    [2]: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
    "9.20.7 NVDIMM Root Device _DSMs"

    Changes include:
    1/ New 'restart' fields in ars_status, unfortunately these are
    implemented in the middle of the existing definition so this change
    is not backwards compatible. The expectation is that shipping
    platforms will only ever support the ACPI 6.1 definition.

    2/ New status values for ars_start ('busy') and ars_status ('overflow').

    Cc: Vishal Verma
    Cc: Linda Knippers
    Cc:
    Signed-off-by: Dan Williams

    Dan Williams
     

20 Feb, 2016

1 commit


10 Jan, 2016

1 commit

  • During region creation, perform Address Range Scrubs (ARS) for the SPA
    (System Physical Address) ranges to retrieve known poison locations from
    firmware. Add a new data structure 'nd_poison' which is used as a list
    in nvdimm_bus to store these poison locations.

    When creating a pmem namespace, if there is any known poison associated
    with its physical address space, convert the poison ranges to bad sectors
    that are exposed using the badblocks interface.

    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     

29 Aug, 2015

1 commit

  • The expectation is that the legacy / non-standard pmem discovery method
    (e820 type-12) will only ever be used to describe small quantities of
    persistent memory. Larger capacities will be described via the ACPI
    NFIT. When "allocate struct page from pmem" support is added this default
    policy can be overridden by assigning a legacy pmem namespace to a pfn
    device, however this would be only be necessary if a platform used the
    legacy mechanism to define a very large range.

    Cc: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

26 Jun, 2015

5 commits

  • Add support of sysfs 'numa_node' to I/O-related NVDIMM devices
    under /sys/bus/nd/devices, regionN, namespaceN.0, and bttN.x.

    An example of numa_node values on a 2-socket system with a single
    NVDIMM range on each socket is shown below.
    /sys/bus/nd/devices
    |-- btt0.0/numa_node:0
    |-- btt1.0/numa_node:1
    |-- btt1.1/numa_node:1
    |-- namespace0.0/numa_node:0
    |-- namespace1.0/numa_node:1
    |-- region0/numa_node:0
    |-- region1/numa_node:1

    These numa_node files are then linked under the block class of
    their device names.
    /sys/class/block/pmem0/device/numa_node:0
    /sys/class/block/pmem1s/device/numa_node:1

    This enables numactl(8) to accept 'block:' and 'file:' paths of
    pmem and btt devices as shown in the examples below.
    numactl --preferred block:pmem0 --show
    numactl --preferred file:/dev/pmem1s --show

    Signed-off-by: Toshi Kani
    Signed-off-by: Dan Williams

    Toshi Kani
     
  • ACPI NFIT table has System Physical Address Range Structure entries that
    describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
    set in the flags.

    Change acpi_nfit_register_region() to map a proximity ID to its node ID,
    and set it to a new numa_node field of nd_region_desc, which is then
    conveyed to the nd_region device.

    The device core arranges for btt and namespace devices to inherit their
    node from their parent region.

    Signed-off-by: Toshi Kani
    [djbw: move set_dev_node() from region.c to bus.c]
    Signed-off-by: Dan Williams

    Toshi Kani
     
  • Upon detection of an unarmed dimm in a region, arrange for descendant
    BTT, PMEM, or BLK instances to be read-only. A dimm is primarily marked
    "unarmed" via flags passed by platform firmware (NFIT).

    The flags in the NFIT memory device sub-structure indicate the state of
    the data on the nvdimm relative to its energy source or last "flush to
    persistence". For the most part there is nothing the driver can do but
    advertise the state of these flags in sysfs and emit a message if
    firmware indicates that the contents of the device may be corrupted.
    However, for the case of ACPI_NFIT_MEM_ARMED, the driver can arrange for
    the block devices incorporating that nvdimm to be marked read-only.
    This is a safe default as the data is still available and new writes are
    held off until the administrator either forces read-write mode, or the
    energy source becomes armed.

    A 'read_only' attribute is added to REGION devices to allow for
    overriding the default read-only policy of all descendant block devices.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • The libnvdimm implementation handles allocating dimm address space (DPA)
    between PMEM and BLK mode interfaces. After DPA has been allocated from
    a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
    as a struct bio based block device. Unlike PMEM, BLK is required to
    handle platform specific details like mmio register formats and memory
    controller interleave. For this reason the libnvdimm generic nd_blk
    driver calls back into the bus provider to carry out the I/O.

    This initial implementation handles the BLK interface defined by the
    ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
    DCR (dimm control region), BDW (block data window), IDT (interleave
    descriptor) NFIT structures and the hardware register format.
    [1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
    [2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

    Cc: Andy Lutomirski
    Cc: Boaz Harrosh
    Cc: H. Peter Anvin
    Cc: Jens Axboe
    Cc: Ingo Molnar
    Cc: Christoph Hellwig
    Signed-off-by: Ross Zwisler
    Acked-by: Rafael J. Wysocki
    Signed-off-by: Dan Williams

    Ross Zwisler
     
  • BTT stands for Block Translation Table, and is a way to provide power
    fail sector atomicity semantics for block devices that have the ability
    to perform byte granularity IO. It relies on the capability of libnvdimm
    namespace devices to do byte aligned IO.

    The BTT works as a stacked blocked device, and reserves a chunk of space
    from the backing device for its accounting metadata. It is a bio-based
    driver because all IO is done synchronously, and there is no queuing or
    asynchronous completions at either the device or the driver level.

    The BTT uses 'lanes' to index into various 'on-disk' data structures,
    and lanes also act as a synchronization mechanism in case there are more
    CPUs than available lanes. We did a comparison between two lane lock
    strategies - first where we kept an atomic counter around that tracked
    which was the last lane that was used, and 'our' lane was determined by
    atomically incrementing that. That way, for the nr_cpus > nr_lanes case,
    theoretically, no CPU would be blocked waiting for a lane. The other
    strategy was to use the cpu number we're scheduled on to and hash it to
    a lane number. Theoretically, this could block an IO that could've
    otherwise run using a different, free lane. But some fio workloads
    showed that the direct cpu -> lane hash performed faster than tracking
    'last lane' - my reasoning is the cache thrash caused by moving the
    atomic variable made that approach slower than simply waiting out the
    in-progress IO. This supports the conclusion that the driver can be a
    very simple bio-based one that does synchronous IOs instead of queuing.

    Cc: Andy Lutomirski
    Cc: Boaz Harrosh
    Cc: H. Peter Anvin
    Cc: Jens Axboe
    Cc: Ingo Molnar
    Cc: Christoph Hellwig
    Cc: Neil Brown
    Cc: Jeff Moyer
    Cc: Dave Chinner
    Cc: Greg KH
    [jmoyer: fix nmi watchdog timeout in btt_map_init]
    [jmoyer: move btt initialization to module load path]
    [jmoyer: fix memory leak in the btt initialization path]
    [jmoyer: Don't overwrite corrupted arenas]
    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     

25 Jun, 2015

10 commits

  • A blk label set describes a namespace comprised of one or more
    discontiguous dpa ranges on a single dimm. They may alias with one or
    more pmem interleave sets that include the given dimm.

    This is the runtime/volatile configuration infrastructure for sysfs
    manipulation of 'alt_name', 'uuid', 'size', and 'sector_size'. A later
    patch will make these settings persistent by writing back the label(s).

    Unlike pmem namespaces, multiple blk namespaces can be created per
    region. Once a blk namespace has been created a new seed device
    (unconfigured child of a parent blk region) is instantiated. As long as
    a region has 'available_size' != 0 new child namespaces may be created.

    Cc: Greg KH
    Cc: Neil Brown
    Acked-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     
  • A complete label set is a PMEM-label per-dimm per-interleave-set where
    all the UUIDs match and the interleave set cookie matches the hosting
    interleave set.

    Present sysfs attributes for manipulation of a PMEM-namespace's
    'alt_name', 'uuid', and 'size' attributes. A later patch will make
    these settings persistent by writing back the label.

    Note that PMEM allocations grow forwards from the start of an interleave
    set (lowest dimm-physical-address (DPA)). BLK-namespaces that alias
    with a PMEM interleave set will grow allocations backward from the
    highest DPA.

    Cc: Greg KH
    Cc: Neil Brown
    Acked-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     
  • On platforms that have firmware support for reading/writing per-dimm
    label space, a portion of the dimm may be accessible via an interleave
    set PMEM mapping in addition to the dimm's BLK (block-data-window
    aperture(s)) interface. A label, stored in a "configuration data
    region" on the dimm, disambiguates which dimm addresses are accessed
    through which exclusive interface.

    Add infrastructure that allows the kernel to block modifications to a
    label in the set while any member dimm is active. Note that this is
    meant only for enforcing "no modifications of active labels" via the
    coarse ioctl command. Adding/deleting namespaces from an active
    interleave set is always possible via sysfs.

    Another aspect of tracking interleave sets is tracking their integrity
    when DIMMs in a set are physically re-ordered. For this purpose we
    generate an "interleave-set cookie" that can be recorded in a label and
    validated against the current configuration. It is the bus provider
    implementation's responsibility to calculate the interleave set cookie
    and attach it to a given region.

    Cc: Neil Brown
    Cc:
    Cc: Greg KH
    Cc: Robert Moore
    Cc: Rafael J. Wysocki
    Acked-by: Christoph Hellwig
    Acked-by: Rafael J. Wysocki
    Signed-off-by: Dan Williams

    Dan Williams
     
  • The libnvdimm region driver is an intermediary driver that translates
    non-volatile "region"s into "namespace" sub-devices that are surfaced by
    persistent memory block-device drivers (PMEM and BLK).

    ACPI 6 introduces the concept that a given nvdimm may simultaneously
    offer multiple access modes to its media through direct PMEM load/store
    access, or windowed BLK mode. Existing nvdimms mostly implement a PMEM
    interface, some offer a BLK-like mode, but never both as ACPI 6 defines.
    If an nvdimm is single interfaced, then there is no need for dimm
    metadata labels. For these devices we can take the region boundaries
    directly to create a child namespace device (nd_namespace_io).

    Acked-by: Christoph Hellwig
    Tested-by: Toshi Kani
    Signed-off-by: Dan Williams

    Dan Williams
     
  • A "region" device represents the maximum capacity of a BLK range (mmio
    block-data-window(s)), or a PMEM range (DAX-capable persistent memory or
    volatile memory), without regard for aliasing. Aliasing, in the
    dimm-local address space (DPA), is resolved by metadata on a dimm to
    designate which exclusive interface will access the aliased DPA ranges.
    Support for the per-dimm metadata/label arrvies is in a subsequent
    patch.

    The name format of "region" devices is "regionN" where, like dimms, N is
    a global ida index assigned at discovery time. This id is not reliable
    across reboots nor in the presence of hotplug. Look to attributes of
    the region or static id-data of the sub-namespace to generate a
    persistent name. However, if the platform configuration does not change
    it is reasonable to expect the same region id to be assigned at the next
    boot.

    "region"s have 2 generic attributes "size", and "mapping"s where:
    - size: the BLK accessible capacity or the span of the
    system physical address range in the case of PMEM.

    - mappingN: a tuple describing a dimm's contribution to the region's
    capacity in the format (,,). For a PMEM-region
    there will be at least one mapping per dimm in the interleave set. For
    a BLK-region there is only "mapping0" listing the starting DPA of the
    BLK-region and the available DPA capacity of that space (matches "size"
    above).

    The max number of mappings per "region" is hard coded per the
    constraints of sysfs attribute groups. That said the number of mappings
    per region should never exceed the maximum number of possible dimms in
    the system. If the current number turns out to not be enough then the
    "mappings" attribute clarifies how many there are supposed to be. "32
    should be enough for anybody...".

    Cc: Neil Brown
    Cc:
    Cc: Greg KH
    Cc: Robert Moore
    Cc: Rafael J. Wysocki
    Acked-by: Christoph Hellwig
    Acked-by: Rafael J. Wysocki
    Tested-by: Toshi Kani
    Signed-off-by: Dan Williams

    Dan Williams
     
  • * Implement the device-model infrastructure for loading modules and
    attaching drivers to nvdimm devices. This is a simple association of a
    nd-device-type number with a driver that has a bitmask of supported
    device types. To facilitate userspace bind/unbind operations 'modalias'
    and 'devtype', that also appear in the uevent, are added as generic
    sysfs attributes for all nvdimm devices. The reason for the device-type
    number is to support sub-types within a given parent devtype, be it a
    vendor-specific sub-type or otherwise.

    * The first consumer of this infrastructure is the driver
    for dimm devices. It simply uses control messages to retrieve and
    store the configuration-data image (label set) from each dimm.

    Note: nd_device_register() arranges for asynchronous registration of
    nvdimm bus devices by default.

    Cc: Greg KH
    Cc: Neil Brown
    Acked-by: Christoph Hellwig
    Tested-by: Toshi Kani
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Most discovery/configuration of the nvdimm-subsystem is done via sysfs
    attributes. However, some nvdimm_bus instances, particularly the
    ACPI.NFIT bus, define a small set of messages that can be passed to the
    platform. For convenience we derive the initial libnvdimm-ioctl command
    formats directly from the NFIT DSM Interface Example formats.

    ND_CMD_SMART: media health and diagnostics
    ND_CMD_GET_CONFIG_SIZE: size of the label space
    ND_CMD_GET_CONFIG_DATA: read label space
    ND_CMD_SET_CONFIG_DATA: write label space
    ND_CMD_VENDOR: vendor-specific command passthrough
    ND_CMD_ARS_CAP: report address-range-scrubbing capabilities
    ND_CMD_ARS_START: initiate scrubbing
    ND_CMD_ARS_STATUS: report on scrubbing state
    ND_CMD_SMART_THRESHOLD: configure alarm thresholds for smart events

    If a platform later defines different commands than this set it is
    straightforward to extend support to those formats.

    Most of the commands target a specific dimm. However, the
    address-range-scrubbing commands target the bus. The 'commands'
    attribute in sysfs of an nvdimm_bus, or nvdimm, enumerate the supported
    commands for that object.

    Cc:
    Cc: Robert Moore
    Cc: Rafael J. Wysocki
    Reported-by: Nicholas Moulin
    Acked-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Enable nvdimm devices to be registered on a nvdimm_bus. The kernel
    assigned device id for nvdimm devicesis dynamic. If userspace needs a
    more static identifier it should consult a provider-specific attribute.
    In the case where NFIT is the provider, the 'nmemX/nfit/handle' or
    'nmemX/nfit/serial' attributes may be used for this purpose.

    Cc: Neil Brown
    Cc:
    Cc: Greg KH
    Cc: Robert Moore
    Cc: Rafael J. Wysocki
    Acked-by: Christoph Hellwig
    Acked-by: Rafael J. Wysocki
    Tested-by: Toshi Kani
    Signed-off-by: Dan Williams

    Dan Williams
     
  • The control device for a nvdimm_bus is registered as an "nd" class
    device. The expectation is that there will usually only be one "nd" bus
    registered under /sys/class/nd. However, we allow for the possibility
    of multiple buses and they will listed in discovery order as
    ndctl0...ndctlN. This character device hosts the ioctl for passing
    control messages. The initial command set has a 1:1 correlation with
    the commands listed in the by the "NFIT DSM Example" document [1], but
    this scheme is extensible to future command sets.

    Note, nd_ioctl() and the backing ->ndctl() implementation are defined in
    a subsequent patch. This is simply the initial registrations and sysfs
    attributes.

    [1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

    Cc: Neil Brown
    Cc: Greg KH
    Cc:
    Cc: Robert Moore
    Cc: Rafael J. Wysocki
    Acked-by: Christoph Hellwig
    Acked-by: Rafael J. Wysocki
    Tested-by: Toshi Kani
    Signed-off-by: Dan Williams

    Dan Williams
     
  • A struct nvdimm_bus is the anchor device for registering nvdimm
    resources and interfaces, for example, a character control device,
    nvdimm devices, and I/O region devices. The ACPI NFIT (NVDIMM Firmware
    Interface Table) is one possible platform description for such
    non-volatile memory resources in a system. The nfit.ko driver attaches
    to the "ACPI0012" device that indicates the presence of the NFIT and
    parses the table to register a struct nvdimm_bus instance.

    Cc:
    Cc: Lv Zheng
    Cc: Robert Moore
    Cc: Rafael J. Wysocki
    Acked-by: Jeff Moyer
    Acked-by: Christoph Hellwig
    Acked-by: Rafael J. Wysocki
    Tested-by: Toshi Kani
    Signed-off-by: Dan Williams

    Dan Williams