02 Sep, 2020

1 commit

  • The nvdimm block driver abuse revalidate_disk in a strange way, and
    totally unrelated to what other drivers do. Simplify this by just
    calling nvdimm_revalidate_disk (which seems rather misnamed) from the
    probe routines, as the additional bdev size revalidation is pointless
    at this point, and remove the revalidate_disk methods given that
    it can only be triggered from add_disk, which is right before the
    manual calls.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Josef Bacik
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

26 Jul, 2020

1 commit

  • The ND_CMD_CALL format allows for a general passthrough of passlisted
    commands targeting a given command set. However there is no validation
    of the family index relative to what the bus supports.

    - Update the NFIT bus implementation (the only one that supports
    ND_CMD_CALL passthrough) to also passlist the valid set of command
    family indices.

    - Update the generic __nd_ioctl() path to validate that field on behalf
    of all implementations.

    Fixes: 31eca76ba2fc ("nfit, libnvdimm: limited/whitelisted dimm command marshaling mechanism")
    Cc: Vishal Verma
    Cc: Dave Jiang
    Cc: Ira Weiny
    Cc: "Rafael J. Wysocki"
    Cc: Len Brown
    Cc:
    Signed-off-by: Dan Williams
    Signed-off-by: Vishal Verma

    Dan Williams
     

29 Feb, 2020

1 commit

  • The "cmd" comes from the user and it can be up to 255. It it's more
    than the number of bits in long, it results out of bounds read when we
    check test_bit(cmd, &cmd_mask). The highest valid value for "cmd" is
    ND_CMD_CALL (10) so I added a compare against that.

    Fixes: 62232e45f4a2 ("libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices")
    Signed-off-by: Dan Carpenter
    Link: https://lore.kernel.org/r/20200225162055.amtosfy7m35aivxg@kili.mountain
    Signed-off-by: Dan Williams

    Dan Carpenter
     

02 Dec, 2019

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "The highlight this cycle is continuing integration fixes for PowerPC
    and some resulting optimizations.

    Summary:

    - Updates to better support vmalloc space restrictions on PowerPC
    platforms.

    - Cleanups to move common sysfs attributes to core 'struct
    device_type' objects.

    - Export the 'target_node' attribute (the effective numa node if pmem
    is marked online) for regions and namespaces.

    - Miscellaneous fixups and optimizations"

    * tag 'libnvdimm-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
    MAINTAINERS: Remove Keith from NVDIMM maintainers
    libnvdimm: Export the target_node attribute for regions and namespaces
    dax: Add numa_node to the default device-dax attributes
    libnvdimm: Simplify root read-only definition for the 'resource' attribute
    dax: Simplify root read-only definition for the 'resource' attribute
    dax: Create a dax device_type
    libnvdimm: Move nvdimm_bus_attribute_group to device_type
    libnvdimm: Move nvdimm_attribute_group to device_type
    libnvdimm: Move nd_mapping_attribute_group to device_type
    libnvdimm: Move nd_region_attribute_group to device_type
    libnvdimm: Move nd_numa_attribute_group to device_type
    libnvdimm: Move nd_device_attribute_group to device_type
    libnvdimm: Move region attribute group definition
    libnvdimm: Move attribute groups to device type
    libnvdimm: Remove prototypes for nonexistent functions
    libnvdimm/btt: fix variable 'rc' set but not used
    libnvdimm/pmem: Delete include of nd-core.h
    libnvdimm/namespace: Differentiate between probe mapping and runtime mapping
    libnvdimm/pfn_dev: Don't clear device memmap area during generic namespace probe
    libnvdimm: Trivial comment fix
    ...

    Linus Torvalds
     

20 Nov, 2019

3 commits

  • Aneesh points out that some platforms may have "local" attached
    persistent memory and "remote" persistent memory that map to the same
    "online" node, or persistent memory devices with different performance
    properties. In this case 'numa_node' is identical for the two instances,
    but 'target_node' is differentiated so platform firmware can communicate
    distinct performance properties per range. Expose 'target_node' by
    default to allow for disambiguation of devices that share the same
    numa_map_to_online_node() result.

    Reported-by: "Aneesh Kumar K.V"
    Reviewed-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/157401274500.43284.2369509941678577768.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     
  • A 'struct device_type' instance can carry default attributes for the
    device. Use this facility to remove the export of
    nvdimm_bus_attribute_group and put the responsibility on the core rather
    than leaf implementations to define this attribute.

    Cc: Ira Weiny
    Cc: Michael Ellerman
    Cc: "Oliver O'Halloran"
    Cc: Vishal Verma
    Cc: Aneesh Kumar K.V
    Signed-off-by: Dan Williams
    Reviewed-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/157309903815.1582359.6418211876315050283.stgit@dwillia2-desk3.amr.corp.intel.com

    Dan Williams
     
  • A 'struct device_type' instance can carry default attributes for the
    device. Use this facility to remove the export of
    nd_numa_attribute_group and put the responsibility on the core rather
    than leaf implementations to define this attribute.

    Cc: Ira Weiny
    Cc: Michael Ellerman
    Cc: "Oliver O'Halloran"
    Cc: Vishal Verma
    Cc: Aneesh Kumar K.V
    Reviewed-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/157401269537.43284.14411189404186877352.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     

18 Nov, 2019

1 commit

  • A 'struct device_type' instance can carry default attributes for the
    device. Use this facility to remove the export of
    nd_device_attribute_group and put the responsibility on the core rather
    than leaf implementations to define this attribute.

    For regions this creates a new nd_region_attribute_groups[] added to the
    per-region device-type instances.

    Cc: Ira Weiny
    Cc: Michael Ellerman
    Cc: "Oliver O'Halloran"
    Cc: Vishal Verma
    Cc: Aneesh Kumar K.V
    Reviewed-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/157309901138.1582359.12909354140826530394.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     

23 Oct, 2019

1 commit

  • The .ioctl and .compat_ioctl file operations have the same prototype so
    they can both point to the same function, which works great almost all
    the time when all the commands are compatible.

    One exception is the s390 architecture, where a compat pointer is only
    31 bit wide, and converting it into a 64-bit pointer requires calling
    compat_ptr(). Most drivers here will never run in s390, but since we now
    have a generic helper for it, it's easy enough to use it consistently.

    I double-checked all these drivers to ensure that all ioctl arguments
    are used as pointers or are ignored, but are not interpreted as integer
    values.

    Acked-by: Jason Gunthorpe
    Acked-by: Daniel Vetter
    Acked-by: Mauro Carvalho Chehab
    Acked-by: Greg Kroah-Hartman
    Acked-by: David Sterba
    Acked-by: Darren Hart (VMware)
    Acked-by: Jonathan Cameron
    Acked-by: Bjorn Andersson
    Acked-by: Dan Williams
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

25 Sep, 2019

1 commit

  • We do check for a bad block during namespace init and that use
    region bad block list. We need to initialize the bad block
    for volatile regions for this to work. We also observe a lockdep
    warning as below because the lock is not initialized correctly
    since we skip bad block init for volatile regions.

    INFO: trying to register non-static key.
    the code is fine but needs lockdep annotation.
    turning off the locking correctness validator.
    CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.3.0-rc1-15699-g3dee241c937e #149
    Call Trace:
    [c0000000f95cb250] [c00000000147dd84] dump_stack+0xe8/0x164 (unreliable)
    [c0000000f95cb2a0] [c00000000022ccd8] register_lock_class+0x308/0xa60
    [c0000000f95cb3a0] [c000000000229cc0] __lock_acquire+0x170/0x1ff0
    [c0000000f95cb4c0] [c00000000022c740] lock_acquire+0x220/0x270
    [c0000000f95cb580] [c000000000a93230] badblocks_check+0xc0/0x290
    [c0000000f95cb5f0] [c000000000d97540] nd_pfn_validate+0x5c0/0x7f0
    [c0000000f95cb6d0] [c000000000d98300] nd_dax_probe+0xd0/0x1f0
    [c0000000f95cb760] [c000000000d9b66c] nd_pmem_probe+0x10c/0x160
    [c0000000f95cb790] [c000000000d7f5ec] nvdimm_bus_probe+0x10c/0x240
    [c0000000f95cb820] [c000000000d0f844] really_probe+0x254/0x4e0
    [c0000000f95cb8b0] [c000000000d0fdfc] driver_probe_device+0x16c/0x1e0
    [c0000000f95cb930] [c000000000d10238] device_driver_attach+0x68/0xa0
    [c0000000f95cb970] [c000000000d1040c] __driver_attach+0x19c/0x1c0
    [c0000000f95cb9f0] [c000000000d0c4c4] bus_for_each_dev+0x94/0x130
    [c0000000f95cba50] [c000000000d0f014] driver_attach+0x34/0x50
    [c0000000f95cba70] [c000000000d0e208] bus_add_driver+0x178/0x2f0
    [c0000000f95cbb00] [c000000000d117c8] driver_register+0x108/0x170
    [c0000000f95cbb70] [c000000000d7edb0] __nd_driver_register+0xe0/0x100
    [c0000000f95cbbd0] [c000000001a6baa4] nd_pmem_driver_init+0x34/0x48
    [c0000000f95cbbf0] [c0000000000106f4] do_one_initcall+0x1d4/0x4b0
    [c0000000f95cbcd0] [c0000000019f499c] kernel_init_freeable+0x544/0x65c
    [c0000000f95cbdb0] [c000000000010d6c] kernel_init+0x2c/0x180
    [c0000000f95cbe20] [c00000000000b954] ret_from_kernel_thread+0x5c/0x68

    Signed-off-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/20190919083355.26340-1-aneesh.kumar@linux.ibm.com
    Signed-off-by: Dan Williams

    Aneesh Kumar K.V
     

06 Sep, 2019

2 commits

  • In order to support marking namespaces with unsupported feature/versions
    disabled, nvdimm core should advance the namespace seed on these
    probe failures. Otherwise, these failed namespaces will be considered a
    seed namespace and will be wrongly used while creating new namespaces.

    Add -EOPNOTSUPP as return from pmem probe callback to indicate a namespace
    initialization failures due to pfn superblock feature/version mismatch.

    Signed-off-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/20190905154603.10349-3-aneesh.kumar@linux.ibm.com
    Signed-off-by: Dan Williams

    Aneesh Kumar K.V
     
  • The nd_region_probe_success() helper collides seed management with
    nvdimm->busy tracking. Given the 'busy' increment is handled internal to the
    nd_region driver 'probe' path move the decrement to the 'remove' path.
    With that cleanup the routine can be renamed to the more descriptive
    nd_region_advance_seeds().

    The change is prompted by an incoming need to optionally advance the
    seeds on other events besides 'probe' success.

    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Dan Williams
    Signed-off-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/20190905154603.10349-2-aneesh.kumar@linux.ibm.com
    Signed-off-by: Dan Williams

    Dan Williams
     

30 Aug, 2019

1 commit

  • In the process of debugging a system with an NVDIMM that was failing to
    unlock it was found that the kernel is reporting 'locked' while the DIMM
    security interface is 'frozen'. Unfortunately the security state is
    tracked internally as an enum which prevents it from communicating the
    difference between 'locked' and 'locked + frozen'. It follows that the
    enum also prevents the kernel from communicating 'unlocked + frozen'
    which would be useful for debugging why security operations like 'change
    passphrase' are disabled.

    Ditch the security state enum for a set of flags and introduce a new
    sysfs attribute explicitly for the 'frozen' state. The regression risk
    is low because the 'frozen' state was already blocked behind the
    'locked' state, but will need to revisit if there were cases where
    applications need 'frozen' to show up in the primary 'security'
    attribute. The expectation is that communicating 'frozen' is mostly a
    helper for debug and status monitoring.

    Reviewed-by: Dave Jiang
    Reported-by: Jeff Moyer
    Reviewed-by: Jeff Moyer
    Link: https://lore.kernel.org/r/156686729474.184120.5835135644278860826.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     

19 Jul, 2019

5 commits

  • For good reason, the standard device_lock() is marked
    lockdep_set_novalidate_class() because there is simply no sane way to
    describe the myriad ways the device_lock() ordered with other locks.
    However, that leaves subsystems that know their own local device_lock()
    ordering rules to find lock ordering mistakes manually. Instead,
    introduce an optional / additional lockdep-enabled lock that a subsystem
    can acquire in all the same paths that the device_lock() is acquired.

    A conversion of the NFIT driver and NVDIMM subsystem to a
    lockdep-validate device_lock() scheme is included. The
    debug_nvdimm_lock() implementation implements the correct lock-class and
    stacking order for the libnvdimm device topology hierarchy.

    Yes, this is a hack, but hopefully it is a useful hack for other
    subsystems device_lock() debug sessions. Quoting Greg:

    "Yeah, it feels a bit hacky but it's really up to a subsystem to mess up
    using it as much as anything else, so user beware :)

    I don't object to it if it makes things easier for you to debug."

    Cc: Ingo Molnar
    Cc: Ira Weiny
    Cc: Will Deacon
    Cc: Dave Jiang
    Cc: Keith Busch
    Cc: Peter Zijlstra
    Cc: Vishal Verma
    Cc: "Rafael J. Wysocki"
    Cc: Greg Kroah-Hartman
    Signed-off-by: Dan Williams
    Acked-by: Greg Kroah-Hartman
    Reviewed-by: Ira Weiny
    Link: https://lore.kernel.org/r/156341210661.292348.7014034644265455704.stgit@dwillia2-desk3.amr.corp.intel.com

    Dan Williams
     
  • A multithreaded namespace creation/destruction stress test currently
    deadlocks with the following lockup signature:

    INFO: task ndctl:2924 blocked for more than 122 seconds.
    Tainted: G OE 5.2.0-rc4+ #3382
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    ndctl D 0 2924 1176 0x00000000
    Call Trace:
    ? __schedule+0x27e/0x780
    schedule+0x30/0xb0
    wait_nvdimm_bus_probe_idle+0x8a/0xd0 [libnvdimm]
    ? finish_wait+0x80/0x80
    uuid_store+0xe6/0x2e0 [libnvdimm]
    kernfs_fop_write+0xf0/0x1a0
    vfs_write+0xb7/0x1b0
    ksys_write+0x5c/0xd0
    do_syscall_64+0x60/0x240

    INFO: task ndctl:2923 blocked for more than 122 seconds.
    Tainted: G OE 5.2.0-rc4+ #3382
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    ndctl D 0 2923 1175 0x00000000
    Call Trace:
    ? __schedule+0x27e/0x780
    ? __mutex_lock+0x489/0x910
    schedule+0x30/0xb0
    schedule_preempt_disabled+0x11/0x20
    __mutex_lock+0x48e/0x910
    ? nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm]
    ? __lock_acquire+0x23f/0x1710
    ? nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm]
    nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm]
    __dax_pmem_probe+0x5e/0x210 [dax_pmem_core]
    ? nvdimm_bus_probe+0x1d0/0x2c0 [libnvdimm]
    dax_pmem_probe+0xc/0x20 [dax_pmem]
    nvdimm_bus_probe+0x90/0x2c0 [libnvdimm]
    really_probe+0xef/0x390
    driver_probe_device+0xb4/0x100

    In this sequence an 'nd_dax' device is being probed and trying to take
    the lock on its backing namespace to validate that the 'nd_dax' device
    indeed has exclusive access to the backing namespace. Meanwhile, another
    thread is trying to update the uuid property of that same backing
    namespace. So one thread is in the probe path trying to acquire the
    lock, and the other thread has acquired the lock and tries to flush the
    probe path.

    Fix this deadlock by not holding the namespace device_lock over the
    wait_nvdimm_bus_probe_idle() synchronization step. In turn this requires
    the device_lock to be held on entry to wait_nvdimm_bus_probe_idle() and
    subsequently dropped internally to wait_nvdimm_bus_probe_idle().

    Cc:
    Fixes: bf9bccc14c05 ("libnvdimm: pmem label sets and namespace instantiation")
    Cc: Vishal Verma
    Tested-by: Jane Chu
    Link: https://lore.kernel.org/r/156341210094.292348.2384694131126767789.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     
  • In preparation for fixing a deadlock between wait_for_bus_probe_idle()
    and the nvdimm_bus_list_mutex arrange for __nd_ioctl() without
    nvdimm_bus_list_mutex held. This also unifies the 'dimm' and 'bus' level
    ioctls into a common nd_ioctl() preamble implementation.

    Marked for -stable as it is a pre-requisite for a follow-on fix.

    Cc:
    Fixes: bf9bccc14c05 ("libnvdimm: pmem label sets and namespace instantiation")
    Cc: Vishal Verma
    Tested-by: Jane Chu
    Link: https://lore.kernel.org/r/156341209518.292348.7183897251740665198.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     
  • In preparation for not holding a lock over the execution of nd_ioctl(),
    update the implementation to allow multiple threads to be attempting
    ioctls at the same time. The bus lock still prevents multiple in-flight
    ->ndctl() invocations from corrupting each other's state, but static
    global staging buffers are moved to the heap.

    Reported-by: Vishal Verma
    Reviewed-by: Vishal Verma
    Tested-by: Vishal Verma
    Link: https://lore.kernel.org/r/156341208947.292348.10560140326807607481.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     
  • A multithreaded namespace creation/destruction stress test currently
    fails with signatures like the following:

    sysfs group 'power' not found for kobject 'dax1.1'
    RIP: 0010:sysfs_remove_group+0x76/0x80
    Call Trace:
    device_del+0x73/0x370
    device_unregister+0x16/0x50
    nd_async_device_unregister+0x1e/0x30 [libnvdimm]
    async_run_entry_fn+0x39/0x160
    process_one_work+0x23c/0x5e0
    worker_thread+0x3c/0x390

    BUG: kernel NULL pointer dereference, address: 0000000000000020
    RIP: 0010:klist_put+0x1b/0x6c
    Call Trace:
    klist_del+0xe/0x10
    device_del+0x8a/0x2c9
    ? __switch_to_asm+0x34/0x70
    ? __switch_to_asm+0x40/0x70
    device_unregister+0x44/0x4f
    nd_async_device_unregister+0x22/0x2d [libnvdimm]
    async_run_entry_fn+0x47/0x15a
    process_one_work+0x1a2/0x2eb
    worker_thread+0x1b8/0x26e

    Use the kill_device() helper to atomically resolve the race of multiple
    threads issuing kill, device_unregister(), requests.

    Reported-by: Jane Chu
    Reported-by: Erwin Tsaur
    Fixes: 4d88a97aa9e8 ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver...")
    Cc:
    Link: https://github.com/pmem/ndctl/issues/96
    Tested-by: Tested-by: Jane Chu
    Link: https://lore.kernel.org/r/156341207846.292348.10435719262819764054.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     

05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of version 2 of the gnu general public license as
    published by the free software foundation this program is
    distributed in the hope that it will be useful but without any
    warranty without even the implied warranty of merchantability or
    fitness for a particular purpose see the gnu general public license
    for more details

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 64 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Alexios Zavras
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190529141901.894819585@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

21 May, 2019

1 commit

  • Several places (dimm_devs.c, core.c etc) include label.h but only
    label.c uses NSINDEX_SIGNATURE, so move its definition to label.c
    instead.

    In file included from drivers/nvdimm/dimm_devs.c:23:
    drivers/nvdimm/label.h:41:19: warning: 'NSINDEX_SIGNATURE' defined but
    not used [-Wunused-const-variable=]

    Also, some places abuse "/**" which is only reserved for the kernel-doc.

    drivers/nvdimm/bus.c:648: warning: cannot understand function prototype:
    'struct attribute_group nd_device_attribute_group = '
    drivers/nvdimm/bus.c:677: warning: cannot understand function prototype:
    'struct attribute_group nd_numa_attribute_group = '

    Those are just some member assignments for the "struct attribute_group"
    instances and it can't be expressed in the kernel-doc.

    Reviewed-by: Vishal Verma
    Signed-off-by: Qian Cai
    Signed-off-by: Dan Williams

    Qian Cai
     

09 Apr, 2019

1 commit

  • %pF and %pf are functionally equivalent to %pS and %ps conversion
    specifiers. The former are deprecated, therefore switch the current users
    to use the preferred variant.

    The changes have been produced by the following command:

    git grep -l '%p[fF]' | grep -v '^\(tools\|Documentation\)/' | \
    while read i; do perl -i -pe 's/%pf/%ps/g; s/%pF/%pS/g;' $i; done

    And verifying the result.

    Link: http://lkml.kernel.org/r/20190325193229.23390-1-sakari.ailus@linux.intel.com
    Cc: Andy Shevchenko
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: sparclinux@vger.kernel.org
    Cc: linux-um@lists.infradead.org
    Cc: xen-devel@lists.xenproject.org
    Cc: linux-acpi@vger.kernel.org
    Cc: linux-pm@vger.kernel.org
    Cc: drbd-dev@lists.linbit.com
    Cc: linux-block@vger.kernel.org
    Cc: linux-mmc@vger.kernel.org
    Cc: linux-nvdimm@lists.01.org
    Cc: linux-pci@vger.kernel.org
    Cc: linux-scsi@vger.kernel.org
    Cc: linux-btrfs@vger.kernel.org
    Cc: linux-f2fs-devel@lists.sourceforge.net
    Cc: linux-mm@kvack.org
    Cc: ceph-devel@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Signed-off-by: Sakari Ailus
    Acked-by: David Sterba (for btrfs)
    Acked-by: Mike Rapoport (for mm/memblock.c)
    Acked-by: Bjorn Helgaas (for drivers/pci)
    Acked-by: Rafael J. Wysocki
    Signed-off-by: Petr Mladek

    Sakari Ailus
     

31 Jan, 2019

1 commit

  • Force the device registration for nvdimm devices to be closer to the actual
    device. This is achieved by using either the NUMA node ID of the region, or
    of the parent. By doing this we can have everything above the region based
    on the region, and everything below the region based on the nvdimm bus.

    By guaranteeing NUMA locality I see an improvement of as high as 25% for
    per-node init of a system with 12TB of persistent memory.

    Reviewed-by: Bart Van Assche
    Signed-off-by: Alexander Duyck
    Signed-off-by: Greg Kroah-Hartman

    Alexander Duyck
     

28 Dec, 2018

1 commit


22 Dec, 2018

1 commit

  • Add support for the NVDIMM_FAMILY_INTEL "ovewrite" capability as
    described by the Intel DSM spec v1.7. This will allow triggering of
    overwrite on Intel NVDIMMs. The overwrite operation can take tens of
    minutes. When the overwrite DSM is issued successfully, the NVDIMMs will
    be unaccessible. The kernel will do backoff polling to detect when the
    overwrite process is completed. According to the DSM spec v1.7, the 128G
    NVDIMMs can take up to 15mins to perform overwrite and larger DIMMs will
    take longer.

    Given that overwrite puts the DIMM in an indeterminate state until it
    completes introduce the NDD_SECURITY_OVERWRITE flag to prevent other
    operations from executing when overwrite is happening. The
    NDD_WORK_PENDING flag is added to denote that there is a device reference
    on the nvdimm device for an async workqueue thread context.

    Signed-off-by: Dave Jiang
    Signed-off-by: Dan Williams

    Dave Jiang
     

14 Dec, 2018

1 commit

  • Some NVDIMMs, like the ones defined by the NVDIMM_FAMILY_INTEL command
    set, expose a security capability to lock the DIMMs at poweroff and
    require a passphrase to unlock them. The security model is derived from
    ATA security. In anticipation of other DIMMs implementing a similar
    scheme, and to abstract the core security implementation away from the
    device-specific details, introduce nvdimm_security_ops.

    Initially only a status retrieval operation, ->state(), is defined,
    along with the base infrastructure and definitions for future
    operations.

    Signed-off-by: Dave Jiang
    Co-developed-by: Dan Williams
    Signed-off-by: Dan Williams

    Dave Jiang
     

11 Dec, 2018

1 commit


05 Dec, 2018

1 commit

  • Add command definition for security commands defined in Intel DSM
    specification v1.8 [1]. This includes "get security state", "set
    passphrase", "unlock unit", "freeze lock", "secure erase", "overwrite",
    "overwrite query", "master passphrase enable/disable", and "master
    erase", . Since this adds several Intel definitions, move the relevant
    bits to their own header.

    These commands mutate physical data, but that manipulation is not cache
    coherent. The requirement to flush and invalidate caches makes these
    commands unsuitable to be called from userspace, so extra logic is added
    to detect and block these commands from being submitted via the ioctl
    command submission path.

    Lastly, the commands may contain sensitive key material that should not
    be dumped in a standard debug session. Update the nvdimm-command
    payload-dump facility to move security command payloads behind a
    default-off compile time switch.

    [1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdf

    Signed-off-by: Dave Jiang
    Signed-off-by: Dan Williams

    Dave Jiang
     

27 Sep, 2018

2 commits

  • This change makes it so that we don't repeatedly overwrite the device node
    for nvdimm regions. The earliest we can set the node is immediately after
    calling device init, so I have moved the code there so we can avoid
    rewriting the node with each uevent.

    Signed-off-by: Alexander Duyck
    Signed-off-by: Dan Williams

    Alexander Duyck
     
  • Unlike asynchronous initialization in the core we have not yet associated
    the device with the parent, and as such the device doesn't hold a reference
    to the parent.

    In order to resolve that we should be holding a reference on the parent
    until the asynchronous initialization has completed.

    Cc:
    Fixes: 4d88a97aa9e8 ("libnvdimm: ...base ... infrastructure")
    Signed-off-by: Alexander Duyck
    Signed-off-by: Dan Williams

    Alexander Duyck
     

20 Aug, 2018

1 commit

  • Commit efda1b5d87cb ("acpi, nfit, libnvdimm: fix / harden ars_status output length handling")
    Introduced additional hardening for ambiguity in the ACPI spec for
    ars_status output sizing. However, it had a couple of cases mixed up.
    Where it should have been checking for (and returning) "out_field[1] -
    4" it was using "out_field[1] - 8" and vice versa.

    This caused a four byte discrepancy in the buffer size passed on to
    the command handler, and in some cases, this caused memory corruption
    like:

    ./daxdev-errors.sh: line 76: 24104 Aborted (core dumped) ./daxdev-errors $busdev $region
    malloc(): memory corruption
    Program received signal SIGABRT, Aborted.
    [...]
    #5 0x00007ffff7865a2e in calloc () from /lib64/libc.so.6
    #6 0x00007ffff7bc2970 in ndctl_bus_cmd_new_ars_status (ars_cap=ars_cap@entry=0x6153b0) at ars.c:136
    #7 0x0000000000401644 in check_ars_status (check=0x7fffffffdeb0, bus=0x604c20) at daxdev-errors.c:144
    #8 test_daxdev_clear_error (region_name=, bus_name=)
    at daxdev-errors.c:332

    Cc:
    Cc: Dave Jiang
    Cc: Keith Busch
    Cc: Lukasz Dorau
    Cc: Dan Williams
    Fixes: efda1b5d87cb ("acpi, nfit, libnvdimm: fix / harden ars_status output length handling")
    Signed-off-by: Vishal Verma
    Reviewed-by: Keith Busch
    Signed-of-by: Dave Jiang

    Vishal Verma
     

03 Jun, 2018

1 commit

  • Instrument nvdimm_bus_probe() to emit timestamps for the start and end
    of libnvdimm device probing. This is useful for identifying sources of
    libnvdimm sub-system initialization latency.

    Signed-off-by: Dan Williams

    Dan Williams
     

01 Jun, 2018

1 commit

  • The pmem driver does not honor a forced read-only setting for very long:
    $ blockdev --setro /dev/pmem0
    $ blockdev --getro /dev/pmem0
    1

    followed by various commands like these:
    $ blockdev --rereadpt /dev/pmem0
    or
    $ mkfs.ext4 /dev/pmem0

    results in this in the kernel serial log:
    nd_pmem namespace0.0: region0 read-write, marking pmem0 read-write

    with the read-only setting lost:
    $ blockdev --getro /dev/pmem0
    0

    That's from bus.c nvdimm_revalidate_disk(), which always applies the
    setting from nd_region (which is initially based on the ACPI NFIT
    NVDIMM state flags not_armed bit).

    In contrast, commit 20bd1d026aac ("scsi: sd: Keep disk read-only when
    re-reading partition") fixed this issue for SCSI devices to preserve
    the previous setting if it was set to read-only.

    This patch modifies bus.c to preserve any previous read-only setting.
    It also eliminates the kernel serial log print except for cases where
    read-write is changed to read-only, so it doesn't print read-only to
    read-only non-changes.

    Cc:
    Fixes: 581388209405 ("libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only")
    Signed-off-by: Robert Elliott
    Signed-off-by: Dan Williams

    Robert Elliott
     

07 Apr, 2018

1 commit

  • We want to be able to cross reference the region and bus devices
    with the device tree node that they were spawned from. libNVDIMM
    handles creating the actual devices for these internally, so we
    need to pass in a pointer to the relevant node in the descriptor.

    Signed-off-by: Oliver O'Halloran
    Acked-by: Dan Williams
    Acked-by: Balbir Singh
    Signed-off-by: Dan Williams

    Oliver O'Halloran
     

07 Mar, 2018

1 commit

  • Dynamic debug can be instructed to add the function name to the debug
    output using the +f switch, so there is no need for the libnvdimm
    modules to do it again. If a user decides to add the +f switch for
    libnvdimm's dynamic debug this results in double prints of the function
    name.

    Reported-by: Johannes Thumshirn
    Reported-by: Ross Zwisler
    Signed-off-by: Dan Williams

    Dan Williams
     

05 Dec, 2017

1 commit

  • The kernel's ND_IOCTL_SMART_THRESHOLD command is based on a payload
    definition that has become broken / out-of-sync with recent versions of
    the NVDIMM_FAMILY_INTEL definition. Deprecate the use of the
    ND_IOCTL_SMART_THRESHOLD command in favor of the ND_CMD_CALL approach
    taken by NVDIMM_FAMILY_{HPE,MSFT}, where we can manage the per-vendor
    variance in userspace.

    In a couple years, when the new scheme is widely deployed in userspace
    packages, the ND_IOCTL_SMART_THRESHOLD support can be removed. For now
    we prevent new binaries from compiling against the kernel header
    definitions, but kernel still compatible with old binaries. The
    libndctl.h [1] header is now the authoritative interface definition for
    NVDIMM SMART.

    [1]: https://github.com/pmem/ndctl
    Signed-off-by: Dan Williams

    Dan Williams
     

03 Nov, 2017

1 commit

  • nfit_test needs to use the poison list manipulation code as well. Make
    it more generic and in the process rename poison to badrange, and move
    all the related helpers to a new file.

    Signed-off-by: Dave Jiang
    [vishal: Add badrange.o to nfit_test's Kbuild]
    [vishal: add a missed include in bus.c for the new badrange functions]
    [vishal: rename all instances of 'be' to 'bre']
    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Dave Jiang
     

05 Sep, 2017

1 commit

  • Delay the check of nd_reserved2 to the actual endpoint (acpi_nfit_ctl)
    that uses it, as a prevention of a potential double-fetch bug.

    While examining the kernel source code, I found a dangerous operation that
    could turn into a double-fetch situation (a race condition bug) where
    the same userspace memory region are fetched twice into kernel with sanity
    checks after the first fetch while missing checks after the second fetch.

    In the case of _IOC_NR(ioctl_cmd) == ND_CMD_CALL:

    1. The first fetch happens in line 935 copy_from_user(&pkg, p, sizeof(pkg)

    2. subsequently `pkg.nd_reserved2` is asserted to be all zeroes
    (line 984 to 986).

    3. The second fetch happens in line 1022 copy_from_user(buf, p, buf_len)

    4. Given that `p` can be fully controlled in userspace, an attacker can
    race condition to override the header part of `p`, say,
    `((struct nd_cmd_pkg *)p)->nd_reserved2` to arbitrary value
    (say nine 0xFFFFFFFF for `nd_reserved2`) after the first fetch but before the
    second fetch. The changed value will be copied to `buf`.

    5. There is no checks on the second fetches until the use of it in
    line 1034: nd_cmd_clear_to_send(nvdimm_bus, nvdimm, cmd, buf) and
    line 1038: nd_desc->ndctl(nd_desc, nvdimm, cmd, buf, buf_len, &cmd_rc)
    which means that the assumed relation, `p->nd_reserved2` are all zeroes might
    not hold after the second fetch. And once the control goes to these functions
    we lose the context to assert the assumed relation.

    6. Based on my manual analysis, `p->nd_reserved2` is not used in function
    `nd_cmd_clear_to_send` and potential implementations of `nd_desc->ndctl`
    so there is no working exploit against it right now. However, this could
    easily turns to an exploitable one if careless developers start to use
    `p->nd_reserved2` later and assume that they are all zeroes.

    Move the validation of the nd_reserved2 field to the ->ndctl()
    implementation where it has a stable buffer to evaluate.

    Signed-off-by: Meng Xu
    Signed-off-by: Dan Williams

    Meng Xu
     

01 Sep, 2017

2 commits

  • Dan reports:
    The patch 62232e45f4a2: "libnvdimm: control (ioctl) messages for
    nvdimm_bus and nvdimm devices" from Jun 8, 2015, leads to the
    following static checker warning:

    drivers/nvdimm/bus.c:1018 __nd_ioctl()
    warn: integer overflows 'buf_len'

    From a casual review, this seems like it might be a real bug. On
    the first iteration we load some data into in_env[]. On the second
    iteration we read a use controlled "in_size" from nd_cmd_in_size().
    It can go up to UINT_MAX - 1. A high number means we will fill the
    whole in_env[] buffer. But we potentially keep looping and adding
    more to in_len so now it can be any value.

    It simple enough to change, but it feels weird that we keep looping
    even though in_env is totally full. Shouldn't we just return an
    error if we don't have space for desc->in_num.

    We keep looping because the size of the total input is allowed to be
    bigger than the 'envelope' which is a subset of the payload that tells
    us how much data to expect. For safety explicitly check that buf_len
    does not overflow which is what the checker flagged.

    Cc:
    Fixes: 62232e45f4a2: "libnvdimm: control (ioctl) messages for nvdimm_bus..."
    Reported-by: Dan Carpenter
    Signed-off-by: Dan Williams

    Dan Williams
     
  • With the ACPI NFIT 'DSM' methods, acpi can be called from IO paths.
    Specifically, the DSM to clear media errors is called during writes, so
    that we can provide a writes-fix-errors model.

    However it is easy to imagine a scenario like:
    -> write through the nvdimm driver
    -> acpi allocation
    -> writeback, causes more IO through the nvdimm driver
    -> deadlock

    Fix this by using memalloc_noio_{save,restore}, which sets the GFP_NOIO
    flag for the current scope when issuing commands/IOs that are expected
    to clear errors.

    Cc:
    Cc:
    Cc: Dan Williams
    Cc: Robert Moore
    Cc: Rafael J. Wysocki
    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     

04 Jul, 2017

1 commit