Eric Lee / smarc-fsl-linux-kernel

28 Jun, 2017

1 commit

0b0bcacc3 block: don't bother with bounce limits for make_request drivers ... Browse Code »

We only call blk_queue_bounce for request-based drivers, so stop messing
with it for make_request based drivers.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-06-28 02:13:45 +0800

13 Jun, 2017

1 commit

fdd050b5b Merge branch 'uuid-types' of bombadil.infradead.org:public_git/uuid into nvme-base Browse Code »

Christoph Hellwig
2017-06-13 17:45:14 +0800

09 Jun, 2017

1 commit

4e4cbee93 block: switch bios to blk_status_t ... Browse Code »

Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-06-09 23:27:32 +0800

05 Jun, 2017

1 commit

ef40dda5b uuid: hoist uuid_is_null() helper from libnvdimm ... Browse Code »

Hoist the libnvdimm helper as an inline helper to linux/uuid.h
using an auxiliary const variable uuid_null in lib/uuid.c.

[hch: also add the guid variant. Both do the same but I'd like
to keep casts to a minimum]

The common helper uses the new abstract type uuid_t * instead of
u8 *.

Suggested-by: Christoph Hellwig
Signed-off-by: Amir Goldstein
[hch: added guid_is_null]
Signed-off-by: Christoph Hellwig
Acked-by: Dan Williams
Reviewed-by: Andy Shevchenko

Christoph Hellwig
2017-06-05 22:59:05 +0800

13 May, 2017

1 commit

0fcc3ab23 Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm fixes from Dan Williams:
"Incremental fixes and a small feature addition on top of the main
libnvdimm 4.12 pull request:

- Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
The size regression is fixed by moving all dax helpers into the
dax-core and only specifying "select DAX" for FS_DAX and
dax-capable drivers. He also asked for clarification of the
NR_DEV_DAX config option which, on closer look, does not need to be
a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
for good measure.

- Ben's attention to detail on -stable patch submissions caught a
case where the recent fixes to arch_copy_from_iter_pmem() missed a
condition where we strand dirty data in the cache. This is tagged
for -stable and will also be included in the rework of the pmem api
to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.

- Vishal adds a feature that missed the initial pull due to pending
review feedback. It allows the kernel to clear media errors when
initializing a BTT (atomic sector update driver) instance on a pmem
namespace.

- Ross noticed that the dax_device + dax_operations conversion broke
__dax_zero_page_range(). The nvdimm unit tests fail to check this
path, but xfstests immediately trips over it. No excuse for missing
this before submitting the 4.12 pull request.

These all pass the nvdimm unit tests and an xfstests spot check. The
set has received a build success notification from the kbuild robot"

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
filesystem-dax: fix broken __dax_zero_page_range() conversion
libnvdimm, btt: ensure that initializing metadata clears poison
libnvdimm: add an atomic vs process context flag to rw_bytes
x86, pmem: Fix cache flushing for iovec write < 8 bytes
device-dax: kill NR_DEV_DAX
block, dax: move "select DAX" from BLOCK to FS_DAX
device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX

Linus Torvalds
2017-05-13 06:43:10 +0800

11 May, 2017

2 commits

b177fe85d libnvdimm, btt: ensure that initializing metadata clears poison ... Browse Code »

If we had badblocks/poison in the metadata area of a BTT, recreating the
BTT would not clear the poison in all cases, notably the flog area. This
is because rw_bytes will only clear errors if the request being sent
down is 512B aligned and sized.

Make sure that when writing the map and info blocks, the rw_bytes being
sent are of the correct size/alignment. For the flog, instead of doing
the smaller log_entry writes only, first do a 'wipe' of the entire area
by writing zeroes in large enough chunks so that errors get cleared.

Cc: Andy Rudoff
Cc: Dan Williams
Signed-off-by: Vishal Verma
Signed-off-by: Dan Williams

Vishal Verma
2017-05-11 12:46:22 +0800
3ae3d67ba libnvdimm: add an atomic vs process context flag to rw_bytes ... Browse Code »

nsio_rw_bytes can clear media errors, but this cannot be done while we
are in an atomic context due to locking within ACPI. From the BTT,
->rw_bytes may be called either from atomic or process context depending
on whether the calls happen during initialization or during IO.

During init, we want to ensure error clearing happens, and the flag
marking process context allows nsio_rw_bytes to do that. When called
during IO, we're in atomic context, and error clearing can be skipped.

Cc: Dan Williams
Signed-off-by: Vishal Verma
Signed-off-by: Dan Williams

Vishal Verma
2017-05-11 12:46:22 +0800

09 May, 2017

1 commit

752ade68c treewide: use kv[mz]alloc* rather than opencoded variants ... Browse Code »

There are many code paths opencoding kvmalloc. Let's use the helper
instead. The main difference to kvmalloc is that those users are
usually not considering all the aspects of the memory allocator. E.g.
allocation requests
Reviewed-by: Boris Ostrovsky # Xen bits
Acked-by: Kees Cook
Acked-by: Vlastimil Babka
Acked-by: Andreas Dilger # Lustre
Acked-by: Christian Borntraeger # KVM/s390
Acked-by: Dan Williams # nvdim
Acked-by: David Sterba # btrfs
Acked-by: Ilya Dryomov # Ceph
Acked-by: Tariq Toukan # mlx4
Acked-by: Leon Romanovsky # mlx5
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Herbert Xu
Cc: Anton Vorontsov
Cc: Colin Cross
Cc: Tony Luck
Cc: "Rafael J. Wysocki"
Cc: Ben Skeggs
Cc: Kent Overstreet
Cc: Santosh Raspatur
Cc: Hariprasad S
Cc: Yishai Hadas
Cc: Oleg Drokin
Cc: "Yan, Zheng"
Cc: Alexander Viro
Cc: Alexei Starovoitov
Cc: Eric Dumazet
Cc: David Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-05-09 08:15:13 +0800

06 May, 2017

1 commit

53ef7d0e2 Merge tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm updates from Dan Williams:
"The bulk of this has been in multiple -next releases. There were a few
late breaking fixes and small features that got added in the last
couple days, but the whole set has received a build success
notification from the kbuild robot.

Change summary:

- Region media error reporting: A libnvdimm region device is the
parent to one or more namespaces. To date, media errors have been
reported via the "badblocks" attribute attached to pmem block
devices for namespaces in "raw" or "memory" mode. Given that
namespaces can be in "device-dax" or "btt-sector" mode this new
interface reports media errors generically, i.e. independent of
namespace modes or state.

This subsequently allows userspace tooling to craft "ACPI 6.1
Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
requests and submit them via the ioctl path for NVDIMM root bus
devices.

- Introduce 'struct dax_device' and 'struct dax_operations': Prompted
by a request from Linus and feedback from Christoph this allows for
dax capable drivers to publish their own custom dax operations.
This fixes the broken assumption that all dax operations are
related to a persistent memory device, and makes it easier for
other architectures and platforms to add customized persistent
memory support.

- 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
available for storage appliance applications to manually trigger
memory controllers to drain write-pending buffers that would
otherwise be flushed automatically by the platform ADR
(asynchronous-DRAM-refresh) mechanism at a power loss event.
Support for "locked" DIMMs is included to prevent namespaces from
surfacing when the namespace label data area is locked. Finally,
fixes for various reported deadlocks and crashes, also tagged for
-stable.

- ACPI / nfit driver updates: General updates of the nfit driver to
add DSM command overrides, ACPI 6.1 health state flags support, DSM
payload debug available by default, and various fixes.

Acknowledgements that came after the branch was pushed:

- commmit 565851c972b5 "device-dax: fix sysfs attribute deadlock":
Tested-by: Yi Zhang

- commit 23f498448362 "libnvdimm: rework region badblocks clearing"
Tested-by: Toshi Kani "

* tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
libnvdimm, pfn: fix 'npfns' vs section alignment
libnvdimm: handle locked label storage areas
libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
brd: fix uninitialized use of brd->dax_dev
block, dax: use correct format string in bdev_dax_supported
device-dax: fix sysfs attribute deadlock
libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
libnvdimm: rework region badblocks clearing
acpi, nfit: kill ACPI_NFIT_DEBUG
libnvdimm: fix clear length of nvdimm_forget_poison()
libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
libnvdimm, region: sysfs trigger for nvdimm_flush()
libnvdimm: fix phys_addr for nvdimm_clear_poison
x86, dax, pmem: remove indirection around memcpy_from_pmem()
block: remove block_device_operations ->direct_access()
block, dax: convert bdev_dax_supported() to dax_direct_access()
filesystem-dax: convert to dax_direct_access()
Revert "block: use DAX for partition table reads"
ext2, ext4, xfs: retrieve dax_device for iomap operations
...

Linus Torvalds
2017-05-06 09:49:20 +0800

05 May, 2017

4 commits

736163671 Merge branch 'for-4.12/dax' into libnvdimm-for-next Browse Code »

Dan Williams
2017-05-05 14:38:43 +0800
d5483feda libnvdimm, pfn: fix 'npfns' vs section alignment ... Browse Code »

Fix failures to create namespaces due to the vmem_altmap not advertising
enough free space to store the memmap.

WARNING: CPU: 15 PID: 8022 at arch/x86/mm/init_64.c:656 arch_add_memory+0xde/0xf0
[..]
Call Trace:
dump_stack+0x63/0x83
__warn+0xcb/0xf0
warn_slowpath_null+0x1d/0x20
arch_add_memory+0xde/0xf0
devm_memremap_pages+0x244/0x440
pmem_attach_disk+0x37e/0x490 [nd_pmem]
nd_pmem_probe+0x7e/0xa0 [nd_pmem]
nvdimm_bus_probe+0x71/0x120 [libnvdimm]
driver_probe_device+0x2bb/0x460
bind_store+0x114/0x160
drv_attr_store+0x25/0x30

In commit 658922e57b84 "libnvdimm, pfn: fix memmap reservation sizing"
we arranged for the capacity to be allocated, but failed to also update
the 'npfns' parameter. This leads to cases where there is enough
capacity reserved to hold all the allocated sections, but
vmemmap_populate_hugepages() still encounters -ENOMEM from
altmap_alloc_block_buf().

This fix is a stop-gap until we can teach the core memory hotplug
implementation to permit sub-section hotplug.

Cc:
Fixes: 658922e57b84 ("libnvdimm, pfn: fix memmap reservation sizing")
Reported-by: Anisha Allada
Signed-off-by: Dan Williams

Dan Williams
2017-05-05 10:54:42 +0800
9d62ed965 libnvdimm: handle locked label storage areas ... Browse Code »

Per the latest version of the "NVDIMM DSM Interface Example" [1], the
label data retrieval routine can report a "locked" status. In this case
all regions associated with that DIMM are disabled until the label area
is unlocked. Provide generic libnvdimm enabling for NVDIMMs with label
data area locking capabilities.

[1]: http://pmem.io/documents/

Signed-off-by: Dan Williams

Dan Williams
2017-05-05 06:41:39 +0800
8f078b38d libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED ... Browse Code »

This is a preparation patch for handling locked nvdimm label regions, a
new concept as introduced by the latest DSM document on pmem.io [1]. A
future patch will leverage nvdimm_set_locked() at DIMM probe time to
flag regions that can not be enabled. There should be no functional
difference resulting from this change.

[1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example-V1.3.pdf

Signed-off-by: Dan Williams

Dan Williams
2017-05-05 05:01:24 +0800

02 May, 2017

2 commits

d3b5d3529 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 mm updates from Ingo Molnar:
"The main x86 MM changes in this cycle were:

- continued native kernel PCID support preparation patches to the TLB
flushing code (Andy Lutomirski)

- various fixes related to 32-bit compat syscall returning address
over 4Gb in applications, launched from 64-bit binaries - motivated
by C/R frameworks such as Virtuozzo. (Dmitry Safonov)

- continued Intel 5-level paging enablement: in particular the
conversion of x86 GUP to the generic GUP code. (Kirill A. Shutemov)

- x86/mpx ABI corner case fixes/enhancements (Joerg Roedel)

- ... plus misc updates, fixes and cleanups"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash
x86/mm: Fix flush_tlb_page() on Xen
x86/mm: Make flush_tlb_mm_range() more predictable
x86/mm: Remove flush_tlb() and flush_tlb_current_task()
x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly()
x86/mm/64: Fix crash in remove_pagetable()
Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation"
x86/boot/e820: Remove a redundant self assignment
x86/mm: Fix dump pagetables for 4 levels of page tables
x86/mpx, selftests: Only check bounds-vs-shadow when we keep shadow
x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space
Revert "x86/mm/numa: Remove numa_nodemask_from_meminfo()"
x86/espfix: Add support for 5-level paging
x86/kasan: Extend KASAN to support 5-level paging
x86/mm: Add basic defines/helpers for CONFIG_X86_5LEVEL=y
x86/paravirt: Add 5-level support to the paravirt code
x86/mm: Define virtual memory map for 5-level paging
x86/asm: Remove __VIRTUAL_MASK_SHIFT==47 assert
x86/boot: Detect 5-level paging support
x86/mm/numa: Remove numa_nodemask_from_meminfo()
...

Linus Torvalds
2017-05-02 14:54:56 +0800
a3e9af95f libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking" ... Browse Code »

This continues the 4.11 status quo of disabling of error clearing from
the BTT I/O path. Toshi found that even though we have eliminated all
the libnvdimm sources of sleeping-while-atomic triggers, we still have
sleeping operations that will occur in the path to send the ACPI DSM to
the DIMM to clear the error:

BUG: sleeping function called from invalid context at mm/slab.h:432
in_atomic(): 1, irqs_disabled(): 0, pid: 13353, name: dd
Call Trace:
dump_stack+0x86/0xc3
___might_sleep+0x17d/0x250
__might_sleep+0x4a/0x80
__kmalloc+0x1c0/0x2e0
acpi_os_allocate_zeroed+0x2d/0x2f
acpi_evaluate_object+0x59/0x3b1
acpi_evaluate_dsm+0xbd/0x10c
acpi_nfit_ctl+0x1ef/0x7c0 [nfit]
? nsio_rw_bytes+0x152/0x280
nvdimm_clear_poison+0x77/0x140
nsio_rw_bytes+0x18f/0x280
btt_write_pg+0x1d4/0x3d0 [nd_btt]
btt_make_request+0x119/0x2d0 [nd_btt]

A solution for tracking and handling media errors natively in the BTT is
needed.

Cc: Jeff Moyer
Cc: Dave Jiang
Cc: Vishal Verma
Reported-by: Toshi Kani
Signed-off-by: Dan Williams

Dan Williams
2017-05-02 01:00:02 +0800

01 May, 2017

2 commits

452bae0ae libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering ... Browse Code »

A debug patch to turn the standard device_lock() into something that
lockdep can analyze yielded the following:

======================================================
[ INFO: possible circular locking dependency detected ]
4.11.0-rc4+ #106 Tainted: G O
-------------------------------------------------------
lt-libndctl/1898 is trying to acquire lock:
(&dev->nvdimm_mutex/3){+.+.+.}, at: [] nd_attach_ndns+0x178/0x1b0 [libnvdimm]

but task is already holding lock:
(&nvdimm_bus->reconfig_mutex){+.+.+.}, at: [] nvdimm_bus_lock+0x21/0x30 [libnvdimm]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&nvdimm_bus->reconfig_mutex){+.+.+.}:
lock_acquire+0xf6/0x1f0
__mutex_lock+0x88/0x980
mutex_lock_nested+0x1b/0x20
nvdimm_bus_lock+0x21/0x30 [libnvdimm]
nvdimm_namespace_capacity+0x1b/0x40 [libnvdimm]
nvdimm_namespace_common_probe+0x230/0x510 [libnvdimm]
nd_pmem_probe+0x14/0x180 [nd_pmem]
nvdimm_bus_probe+0xa9/0x260 [libnvdimm]

-> #0 (&dev->nvdimm_mutex/3){+.+.+.}:
__lock_acquire+0x1107/0x1280
lock_acquire+0xf6/0x1f0
__mutex_lock+0x88/0x980
mutex_lock_nested+0x1b/0x20
nd_attach_ndns+0x178/0x1b0 [libnvdimm]
nd_namespace_store+0x308/0x3c0 [libnvdimm]
namespace_store+0x87/0x220 [libnvdimm]

In this case '&dev->nvdimm_mutex/3' mirrors '&dev->mutex'.

Fix this by replacing the use of device_lock() with nvdimm_bus_lock() to protect
nd_{attach,detach}_ndns() operations.

Cc:
Fixes: 8c2f7e8658df ("libnvdimm: infrastructure for btt devices")
Reported-by: Yi Zhang
Signed-off-by: Dan Williams

Dan Williams
2017-05-01 23:29:37 +0800
713897038 mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash ... Browse Code »

The x86 conversion to the generic GUP code included a small change which causes
crashes and data corruption in the pmem code - not good.

The root cause is that the /dev/pmem driver code implicitly relies on the x86
get_user_pages() implementation doing a get_page() on the page refcount, because
get_page() does a get_zone_device_page() which properly refcounts pmem's separate
page struct arrays that are not present in the regular page struct structures.
(The pmem driver does this because it can cover huge memory areas.)

But the x86 conversion to the generic GUP code changed the get_page() to
page_cache_get_speculative() which is faster but doesn't do the
get_zone_device_page() call the pmem code relies on.

One way to solve the regression would be to change the generic GUP code to use
get_page(), but that would slow things down a bit and punish other generic-GUP
using architectures for an x86-ism they did not care about. (Arguably the pmem
driver was probably not working reliably for them: but nvdimm is an Intel
feature, so non-x86 exposure is probably still limited.)

So restructure the pmem code's interface with the MM instead: get rid of the
get/put_zone_device_page() distinction, integrate put_zone_device_page() into
__put_page() and and restructure the pmem completion-wait and teardown machinery:

Kirill points out that the calls to {get,put}_dev_pagemap() can be
removed from the mm fast path if we take a single get_dev_pagemap()
reference to signify that the page is alive and use the final put of the
page to drop that reference.

This does require some care to make sure that any waits for the
percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
since it now maintains its own elevated reference.

This speeds up things while also making the pmem refcounting more robust going
forward.

Suggested-by: Kirill Shutemov
Tested-by: Kirill Shutemov
Signed-off-by: Dan Williams
Reviewed-by: Logan Gunthorpe
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: Josh Poimboeuf
Cc: Jérôme Glisse
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar

Dan Williams
2017-05-01 15:15:53 +0800

30 Apr, 2017

1 commit

23f498448 libnvdimm: rework region badblocks clearing ... Browse Code »

Toshi noticed that the new support for a region-level badblocks missed
the case where errors are cleared due to BTT I/O.

An initial attempt to fix this ran into a "sleeping while atomic"
warning due to taking the nvdimm_bus_lock() in the BTT I/O path to
satisfy the locking requirements of __nvdimm_bus_badblocks_clear().
However, that lock is not needed since we are not acting on any data that
is subject to change under that lock. The badblocks instance has its own
internal lock to handle mutations of the error list.

So, in order to make it clear that we are just acting on region devices,
rename __nvdimm_bus_badblocks_clear() to nvdimm_clear_badblocks_regions().
Eliminate the lock and consolidate all support routines for the new
nvdimm_account_cleared_poison() in drivers/nvdimm/bus.c. Finally, to the
opportunity to cleanup to some unnecessary casts, make the calling
convention of nvdimm_clear_badblocks_regions() clearer by replacing struct
resource with the minimal struct clear_badblocks_context, and use the
DEVICE_ATTR macro.

Cc: Dave Jiang
Cc: Vishal Verma
Reported-by: Toshi Kani
Signed-off-by: Dan Williams

Dan Williams
2017-04-30 06:24:03 +0800

29 Apr, 2017

3 commits

8d13c0290 libnvdimm: fix clear length of nvdimm_forget_poison() ... Browse Code »

ND_CMD_CLEAR_ERROR command returns 'clear_err.cleared', the length
of error actually cleared, which may be smaller than its requested
'len'.

Change nvdimm_clear_poison() to call nvdimm_forget_poison() with
'clear_err.cleared' when this value is valid.

Cc:
Fixes: e046114af5fc ("libnvdimm: clear the internal poison_list when clearing badblocks")
Cc: Dave Jiang
Cc: Vishal Verma
Signed-off-by: Toshi Kani
Signed-off-by: Dan Williams

Toshi Kani
2017-04-29 06:56:26 +0800
b2518c78c libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify ... Browse Code »

The following BUG was observed when nd_pmem_notify() was called
for a BTT device. The use of a pmem_device pointer is not valid
with BTT.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
IP: nd_pmem_notify+0x30/0xf0 [nd_pmem]
Call Trace:
nd_device_notify+0x40/0x50
child_notify+0x10/0x20
device_for_each_child+0x50/0x90
nd_region_notify+0x20/0x30
nd_device_notify+0x40/0x50
nvdimm_region_notify+0x27/0x30
acpi_nfit_scrub+0x341/0x590 [nfit]
process_one_work+0x197/0x450
worker_thread+0x4e/0x4a0
kthread+0x109/0x140

Fix nd_pmem_notify() by setting nd_region and badblocks pointers
properly for BTT.

Cc:
Cc: Vishal Verma
Fixes: 719994660c24 ("libnvdimm: async notification support")
Signed-off-by: Toshi Kani
Signed-off-by: Dan Williams

Toshi Kani
2017-04-29 03:46:47 +0800
ab630891c libnvdimm, region: sysfs trigger for nvdimm_flush() ... Browse Code »

The nvdimm_flush() mechanism helps to reduce the impact of an ADR
(asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
platform WPQ (write-pending-queue) buffers when power is removed. The
nvdimm_flush() mechanism performs that same function on-demand.

When a pmem namespace is associated with a block device, an
nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
request. These requests are typically associated with filesystem
metadata updates. However, when a namespace is in device-dax mode,
userspace (think database metadata) needs another path to perform the
same flushing. In other words this is not required to make data
persistent, but in the case of metadata it allows for a smaller failure
domain in the unlikely event of an ADR failure.

The new 'deep_flush' attribute is visible when the individual DIMMs
backing a given interleave-set are described by platform firmware. In
ACPI terms this is "NVDIMM Region Mapping Structures" and associated
"Flush Hint Address Structures". Reads return "1" if the region supports
triggering WPQ flushes on all DIMMs. Reads return "0" the flush
operation is a platform nop, and in that case the attribute is
read-only.

Why sysfs and not an ioctl? An ioctl requires establishing a new
ioctl function number space for device-dax. Given that this would be
called on a device-dax fd an application could be forgiven for
accidentally calling this on a filesystem-dax fd. Placing this interface
in libnvdimm sysfs removes that potential for collision with a
filesystem ioctl, and it keeps ioctls out of the generic device-dax
implementation.

Cc: Jeff Moyer
Cc: Masayoshi Mizuma
Signed-off-by: Dan Williams

Dan Williams
2017-04-29 03:46:46 +0800

28 Apr, 2017

1 commit

97681f9b0 libnvdimm: fix phys_addr for nvdimm_clear_poison ... Browse Code »

nvdimm_clear_poison() expects a physical address, not an offset.
Fix nsio_rw_bytes() to call nvdimm_clear_poison() with a physical
address.

Signed-off-by: Toshi Kani
Cc: Dave Jiang
Cc: Vishal Verma
Reviewed-by: Vishal Verma
Signed-off-by: Dan Williams

Toshi Kani
2017-04-28 04:51:18 +0800

26 Apr, 2017

2 commits

6abccd1bf x86, dax, pmem: remove indirection around memcpy_from_pmem() ... Browse Code »

memcpy_from_pmem() maps directly to memcpy_mcsafe(). The wrapper
serves no real benefit aside from affording a more generic function name
than the x86-specific 'mcsafe'. However this would not be the first time
that x86 terminology leaked into the global namespace. For lack of
better name, just use memcpy_mcsafe() directly.

This conversion also catches a place where we should have been using
plain memcpy, acpi_nfit_blk_single_io().

Cc:
Cc: Jan Kara
Cc: Jeff Moyer
Cc: Ingo Molnar
Cc: Christoph Hellwig
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Cc: Matthew Wilcox
Cc: Ross Zwisler
Acked-by: Tony Luck
Signed-off-by: Dan Williams

Dan Williams
2017-04-26 04:20:46 +0800
d4b29fd78 block: remove block_device_operations ->direct_access() ... Browse Code »

Now that all the producers and consumers of dax interfaces have been
converted to using dax_operations on a dax_device, remove the block
device direct_access enabling.

Signed-off-by: Dan Williams

Dan Williams
2017-04-26 04:20:46 +0800

25 Apr, 2017

1 commit

bc042fdfb libnvdimm, region: fix flush hint detection crash ... Browse Code »

In the case where a dimm does not have any associated flush hints the
ndrd->flush_wpq array may be uninitialized leading to crashes with the
following signature:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: region_visible+0x10f/0x160 [libnvdimm]

Call Trace:
internal_create_group+0xbe/0x2f0
sysfs_create_groups+0x40/0x80
device_add+0x2d8/0x650
nd_async_device_register+0x12/0x40 [libnvdimm]
async_run_entry_fn+0x39/0x170
process_one_work+0x212/0x6c0
? process_one_work+0x197/0x6c0
worker_thread+0x4e/0x4a0
kthread+0x10c/0x140
? process_one_work+0x6c0/0x6c0
? kthread_create_on_node+0x60/0x60
ret_from_fork+0x31/0x40

Cc:
Reviewed-by: Jeff Moyer
Fixes: f284a4f23752 ("libnvdimm: introduce nvdimm_flush() and nvdimm_has_flush()")
Signed-off-by: Dan Williams

Dan Williams
2017-04-25 07:01:56 +0800

20 Apr, 2017

1 commit

c1d6e828a pmem: add dax_operations support ... Browse Code »

Setup a dax_device to have the same lifetime as the pmem block device
and add a ->direct_access() method that is equivalent to
pmem_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old pmem_direct_access() will be removed.

Signed-off-by: Dan Williams

Dan Williams
2017-04-20 06:14:35 +0800

15 Apr, 2017

1 commit

e88da7998 Revert "libnvdimm: band aid btt vs clear poison locking" ... Browse Code »

This reverts commit 4aa5615e080a "libnvdimm: band aid btt vs clear
poison locking".

Now that poison list locking has been converted to a spinlock and poison
list entry allocation during i/o has been converted to GFP_NOWAIT,
revert the band-aid that disabled error clearing from btt i/o.

Cc: Vishal Verma
Cc: Dave Jiang
Signed-off-by: Dan Williams

Dan Williams
2017-04-15 04:29:01 +0800

14 Apr, 2017

1 commit

b3b454f69 libnvdimm: fix clear poison locking with spinlock and GFP_NOWAIT allocation ... Browse Code »

The following warning results from holding a lane spinlock,
preempt_disable(), or the btt map spinlock and then trying to take the
reconfig_mutex to walk the poison list and potentially add new entries.

BUG: sleeping function called from invalid context at kernel/locking/mutex.
c:747
in_atomic(): 1, irqs_disabled(): 0, pid: 17159, name: dd
[..]
Call Trace:
dump_stack+0x85/0xc8
___might_sleep+0x184/0x250
__might_sleep+0x4a/0x90
__mutex_lock+0x58/0x9b0
? nvdimm_bus_lock+0x21/0x30 [libnvdimm]
? __nvdimm_bus_badblocks_clear+0x2f/0x60 [libnvdimm]
? acpi_nfit_forget_poison+0x79/0x80 [nfit]
? _raw_spin_unlock+0x27/0x40
mutex_lock_nested+0x1b/0x20
nvdimm_bus_lock+0x21/0x30 [libnvdimm]
nvdimm_forget_poison+0x25/0x50 [libnvdimm]
nvdimm_clear_poison+0x106/0x140 [libnvdimm]
nsio_rw_bytes+0x164/0x270 [libnvdimm]
btt_write_pg+0x1de/0x3e0 [nd_btt]
? blk_queue_enter+0x30/0x290
btt_make_request+0x11a/0x310 [nd_btt]
? blk_queue_enter+0xb7/0x290
? blk_queue_enter+0x30/0x290
generic_make_request+0x118/0x3b0

A spinlock is introduced to protect the poison list. This allows us to not
having to acquire the reconfig_mutex for touching the poison list. The
add_poison() function has been broken out into two helper functions. One to
allocate the poison entry and the other to apppend the entry. This allows us
to unlock the poison_lock in non-I/O path and continue to be able to allocate
the poison entry with GFP_KERNEL. We will use GFP_NOWAIT in the I/O path in
order to satisfy being in atomic context.

Reviewed-by: Vishal Verma
Signed-off-by: Dave Jiang
Signed-off-by: Dan Williams

Dave Jiang
2017-04-14 05:23:51 +0800

13 Apr, 2017

3 commits

006358b35 libnvdimm: add support for clear poison list and badblocks for device dax ... Browse Code »

Providing mechanism to clear poison list via the ndctl ND_CMD_CLEAR_ERROR
call. We will update the poison list and also the badblocks at region level
if the region is in dax mode or in pmem mode and not active. In other
words we force badblocks to be cleared through write requests if the
address is currently accessed through a block device, otherwise it can
only be done via the ioctl+dsm path.

Signed-off-by: Dave Jiang
Reviewed-by: Johannes Thumshirn
Signed-off-by: Dan Williams

Dave Jiang
2017-04-13 12:56:43 +0800
802f4be6f libnvdimm: Add 'resource' sysfs attribute to regions ... Browse Code »

Adding sysfs attribute in order to export the physical address of the
region. This is for supporting of user app poison clear via
ND_IOCTL_CLEAR_ERROR.

Signed-off-by: Dave Jiang
Reviewed-by: Johannes Thumshirn
Signed-off-by: Dan Williams

Dave Jiang
2017-04-13 12:56:42 +0800
6a6bef904 libnvdimm: add mechanism to publish badblocks at the region level ... Browse Code »

badblocks sysfs file will be export at region level. When nvdimm event
notifier happens for NVDIMM_REVALIATE_POISON, the badblocks in the
region will be updated.

Signed-off-by: Dave Jiang
Reviewed-by: Johannes Thumshirn
Signed-off-by: Dan Williams

Dave Jiang
2017-04-13 12:56:42 +0800

11 Apr, 2017

2 commits

4aa5615e0 libnvdimm: band aid btt vs clear poison locking ... Browse Code »

The following warning results from holding a lane spinlock,
preempt_disable(), or the btt map spinlock and then trying to take the
reconfig_mutex to walk the poison list and potentially add new entries.

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747
in_atomic(): 1, irqs_disabled(): 0, pid: 17159, name: dd
[..]
Call Trace:
dump_stack+0x85/0xc8
___might_sleep+0x184/0x250
__might_sleep+0x4a/0x90
__mutex_lock+0x58/0x9b0
? nvdimm_bus_lock+0x21/0x30 [libnvdimm]
? __nvdimm_bus_badblocks_clear+0x2f/0x60 [libnvdimm]
? acpi_nfit_forget_poison+0x79/0x80 [nfit]
? _raw_spin_unlock+0x27/0x40
mutex_lock_nested+0x1b/0x20
nvdimm_bus_lock+0x21/0x30 [libnvdimm]
nvdimm_forget_poison+0x25/0x50 [libnvdimm]
nvdimm_clear_poison+0x106/0x140 [libnvdimm]
nsio_rw_bytes+0x164/0x270 [libnvdimm]
btt_write_pg+0x1de/0x3e0 [nd_btt]
? blk_queue_enter+0x30/0x290
btt_make_request+0x11a/0x310 [nd_btt]
? blk_queue_enter+0xb7/0x290
? blk_queue_enter+0x30/0x290
generic_make_request+0x118/0x3b0

As a minimal fix, disable error clearing when the BTT is enabled for the
namespace. For the final fix a larger rework of the poison list locking
is needed.

Note that this is not a problem in the blk case since that path never
calls nvdimm_clear_poison().

Cc:
Fixes: 82bf1037f2ca ("libnvdimm: check and clear poison before writing to pmem")
Cc: Dave Jiang
[jeff: dynamically disable error clearing in the btt case]
Suggested-by: Jeff Moyer
Reviewed-by: Jeff Moyer
Reported-by: Vishal Verma
Signed-off-by: Dan Williams

Dan Williams
2017-04-11 08:21:45 +0800
0beb2012a libnvdimm: fix reconfig_mutex, mmap_sem, and jbd2_handle lockdep splat ... Browse Code »

Holding the reconfig_mutex over a potential userspace fault sets up a
lockdep dependency chain between filesystem-DAX and the libnvdimm ioctl
path. Move the user access outside of the lock.

[ INFO: possible circular locking dependency detected ]
4.11.0-rc3+ #13 Tainted: G W O
-------------------------------------------------------
fallocate/16656 is trying to acquire lock:
(&nvdimm_bus->reconfig_mutex){+.+.+.}, at: [] nvdimm_bus_lock+0x21/0x30 [libnvdimm]
but task is already holding lock:
(jbd2_handle){++++..}, at: [] start_this_handle+0x104/0x460

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (jbd2_handle){++++..}:
lock_acquire+0xbd/0x200
start_this_handle+0x16a/0x460
jbd2__journal_start+0xe9/0x2d0
__ext4_journal_start_sb+0x89/0x1c0
ext4_dirty_inode+0x32/0x70
__mark_inode_dirty+0x235/0x670
generic_update_time+0x87/0xd0
touch_atime+0xa9/0xd0
ext4_file_mmap+0x90/0xb0
mmap_region+0x370/0x5b0
do_mmap+0x415/0x4f0
vm_mmap_pgoff+0xd7/0x120
SyS_mmap_pgoff+0x1c5/0x290
SyS_mmap+0x22/0x30
entry_SYSCALL_64_fastpath+0x1f/0xc2

-> #1 (&mm->mmap_sem){++++++}:
lock_acquire+0xbd/0x200
__might_fault+0x70/0xa0
__nd_ioctl+0x683/0x720 [libnvdimm]
nvdimm_ioctl+0x8b/0xe0 [libnvdimm]
do_vfs_ioctl+0xa8/0x740
SyS_ioctl+0x79/0x90
do_syscall_64+0x6c/0x200
return_from_SYSCALL_64+0x0/0x7a

-> #0 (&nvdimm_bus->reconfig_mutex){+.+.+.}:
__lock_acquire+0x16b6/0x1730
lock_acquire+0xbd/0x200
__mutex_lock+0x88/0x9b0
mutex_lock_nested+0x1b/0x20
nvdimm_bus_lock+0x21/0x30 [libnvdimm]
nvdimm_forget_poison+0x25/0x50 [libnvdimm]
nvdimm_clear_poison+0x106/0x140 [libnvdimm]
pmem_do_bvec+0x1c2/0x2b0 [nd_pmem]
pmem_make_request+0xf9/0x270 [nd_pmem]
generic_make_request+0x118/0x3b0
submit_bio+0x75/0x150

Cc:
Fixes: 62232e45f4a2 ("libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices")
Cc: Dave Jiang
Reported-by: Vishal Verma
Signed-off-by: Dan Williams

Dan Williams
2017-04-11 08:21:45 +0800

05 Apr, 2017

1 commit

fe514739d libnvdimm: fix blk free space accounting ... Browse Code »

Commit a1f3e4d6a0c3 "libnvdimm, region: update nd_region_available_dpa()
for multi-pmem support" reworked blk dpa (DIMM Physical Address)
accounting to comprehend multiple pmem namespace allocations aliasing
with a given blk-dpa range.

The following call trace is a result of failing to account for allocated
blk capacity.

WARNING: CPU: 1 PID: 2433 at tools/testing/nvdimm/../../../drivers/nvdimm/names
4 size_store+0x6f3/0x930 [libnvdimm]
nd_region region5: allocation underrun: 0x0 of 0x1000000 bytes
[..]
Call Trace:
dump_stack+0x86/0xc3
__warn+0xcb/0xf0
warn_slowpath_fmt+0x5f/0x80
size_store+0x6f3/0x930 [libnvdimm]
dev_attr_store+0x18/0x30

If a given blk-dpa allocation does not alias with any pmem ranges then
the full allocation should be accounted as busy space, not the size of
the current pmem contribution to the region.

The thinkos that led to this confusion was not realizing that the struct
resource management is already guaranteeing no collisions between pmem
allocations and blk allocations on the same dimm. Also, we do not try to
support blk allocations in aliased pmem holes.

This patch also fixes a case where the available blk goes negative.

Cc:
Fixes: a1f3e4d6a0c3 ("libnvdimm, region: update nd_region_available_dpa() for multi-pmem support").
Reported-by: Dariusz Dokupil
Reported-by: Dave Jiang
Reported-by: Vishal Verma
Tested-by: Dave Jiang
Tested-by: Vishal Verma
Signed-off-by: Dan Williams

Dan Williams
2017-04-05 06:08:36 +0800

01 Mar, 2017

1 commit

86ef58a4e nfit, libnvdimm: fix interleave set cookie calculation ... Browse Code »

The interleave-set cookie is a sum that sanity checks the composition of
an interleave set has not changed from when the namespace was initially
created. The checksum is calculated by sorting the DIMMs by their
location in the interleave-set. The comparison for the sort must be
64-bit wide, not byte-by-byte as performed by memcmp() in the broken
case.

Fix the implementation to accept correct cookie values in addition to
the Linux "memcmp" order cookies, but only allow correct cookies to be
generated going forward. It does mean that namespaces created by
third-party-tooling, or created by newer kernels with this fix, will not
validate on older kernels. However, there are a couple mitigating
conditions:

1/ platforms with namespace-label capable NVDIMMs are not widely
available.

2/ interleave-sets with a single-dimm are by definition not affected
(nothing to sort). This covers the QEMU-KVM NVDIMM emulation case.

The cookie stored in the namespace label will be fixed by any write the
namespace label, the most straightforward way to achieve this is to
write to the "alt_name" attribute of a namespace in sysfs.

Cc:
Fixes: eaf961536e16 ("libnvdimm, nfit: add interleave-set state-tracking infrastructure")
Reported-by: Nicholas Moulin
Tested-by: Nicholas Moulin
Signed-off-by: Dan Williams

Dan Williams
2017-03-01 16:49:42 +0800

05 Feb, 2017

1 commit

bfb34527a libnvdimm, pfn: fix memmap reservation size versus 4K alignment ... Browse Code »

When vmemmap_populate() allocates space for the memmap it does so in 2MB
sized chunks. The libnvdimm-pfn driver incorrectly accounts for this
when the alignment of the device is set to 4K. When this happens we
trigger memory allocation failures in altmap_alloc_block_buf() and
trigger warnings of the form:

WARNING: CPU: 0 PID: 3376 at arch/x86/mm/init_64.c:656 arch_add_memory+0xe4/0xf0
[..]
Call Trace:
dump_stack+0x86/0xc3
__warn+0xcb/0xf0
warn_slowpath_null+0x1d/0x20
arch_add_memory+0xe4/0xf0
devm_memremap_pages+0x29b/0x4e0

Fixes: 315c562536c4 ("libnvdimm, pfn: add 'align' attribute, default to HPAGE_SIZE")
Cc:
Signed-off-by: Dan Williams

Dan Williams
2017-02-05 06:47:31 +0800

01 Feb, 2017

2 commits

9d032f420 libnvdimm, namespace: do not delete namespace-id 0 ... Browse Code »

Given that the naming of pmem devices changes from the pmemX form to the
pmemX.Y form when namespace id is greater than 0, arrange for namespaces
with id-0 to be exempt from deletion. Otherwise a simple reconfiguration
of an existing namespace to a new mode results in a name change of the
resulting block device:

# ndctl list --namespace=namespace1.0
{
"dev":"namespace1.0",
"mode":"raw",
"size":2147483648,
"uuid":"3dadf3dc-89b9-4b24-b20e-abc8a4707ce3",
"blockdev":"pmem1"
}

# ndctl create-namespace --reconfig=namespace1.0 --mode=memory --force
{
"dev":"namespace1.1",
"mode":"memory",
"size":2111832064,
"uuid":"7b4a6341-7318-4219-a02c-fb57c0bbf613",
"blockdev":"pmem1.1"
}

This change does require tooling changes to explicitly look for
namespaceX.0 if the seed has already advanced to another namespace.

Cc:
Fixes: 98a29c39dc68 ("libnvdimm, namespace: allow creation of multiple pmem-namespaces per region")
Reviewed-by: Johannes Thumshirn
Signed-off-by: Dan Williams

Dan Williams
2017-02-01 10:18:21 +0800
970d14e39 nvdimm: constify device_type structures ... Browse Code »

Declare device_type structure as const as it is only stored in the
type field of a device structure. This field is of type const, so add
const to declaration of device_type structure.

File size before:
text data bss dec hex filename
19278 3199 16 22493 57dd nvdimm/namespace_devs.o

File size after:
text data bss dec hex filename
19929 3160 16 23105 5a41 nvdimm/namespace_devs.o

Signed-off-by: Bhumika Goyal
Signed-off-by: Dan Williams

Bhumika Goyal
2017-02-01 10:16:30 +0800

14 Jan, 2017

1 commit

1f19b983a libnvdimm, namespace: fix pmem namespace leak, delete when size set to zero ... Browse Code »

Commit 98a29c39dc68 ("libnvdimm, namespace: allow creation of multiple
pmem-namespaces per region") added support for establishing additional
pmem namespace beyond the seed device, similar to blk namespaces.
However, it neglected to delete the namespace when the size is set to
zero.

Fixes: 98a29c39dc68 ("libnvdimm, namespace: allow creation of multiple pmem-namespaces per region")
Cc:
Signed-off-by: Dan Williams

Dan Williams
2017-01-14 01:50:33 +0800

13 Jan, 2017

1 commit

d47d1d27f pmem: return EIO on read_pmem() failure ... Browse Code »

The read_pmem() function uses memcpy_mcsafe() on x86 where an EFAULT
error code indicates a failed read. Block I/O should use EIO to
indicate failure. Other pmem code paths (like bad blocks) already use
EIO so let's be consistent.

This fixes compatibility with consumers like btrfs that try to parse the
specific error code rather than treat all errors the same.

Reviewed-by: Jeff Moyer
Signed-off-by: Stefan Hajnoczi
Signed-off-by: Dan Williams

Stefan Hajnoczi
2017-01-13 08:40:29 +0800