Eric Lee / smarc-fsl-linux-kernel

24 Mar, 2019

4 commits

3b8da135a libnvdimm: Fix altmap reservation size calculation ... Browse Code »

commit 07464e88365e9236febaca9ed1a2e2006d8bc952 upstream.

Libnvdimm reserves the first 8K of pfn and devicedax namespaces to
store a superblock describing the namespace. This 8K reservation
is contained within the altmap area which the kernel uses for the
vmemmap backing for the pages within the namespace. The altmap
allows for some pages at the start of the altmap area to be reserved
and that mechanism is used to protect the superblock from being
re-used as vmemmap backing.

The number of PFNs to reserve is calculated using:

PHYS_PFN(SZ_8K)

Which is implemented as:

#define PHYS_PFN(x) ((unsigned long)((x) >> PAGE_SHIFT))

So on systems where PAGE_SIZE is greater than 8K the reservation
size is truncated to zero and the superblock area is re-used as
vmemmap backing. As a result all the namespace information stored
in the superblock (i.e. if it's a PFN or DAX namespace) is lost
and the namespace needs to be re-created to get access to the
contents.

This patch fixes this by using PFN_UP() rather than PHYS_PFN() to ensure
that at least one page is reserved. On systems with a 4K pages size this
patch should have no effect.

Cc: stable@vger.kernel.org
Cc: Dan Williams
Fixes: ac515c084be9 ("libnvdimm, pmem, pfn: move pfn setup to the core")
Signed-off-by: Oliver O'Halloran
Reviewed-by: Vishal Verma
Signed-off-by: Dan Williams
Signed-off-by: Greg Kroah-Hartman

Oliver O'Halloran
2019-03-24 03:09:53 +0800
696c37524 libnvdimm/pmem: Honor force_raw for legacy pmem regions ... Browse Code »

commit fa7d2e639cd90442d868dfc6ca1d4cc9d8bf206e upstream.

For recovery, where non-dax access is needed to a given physical address
range, and testing, allow the 'force_raw' attribute to override the
default establishment of a dev_pagemap.

Otherwise without this capability it is possible to end up with a
namespace that can not be activated due to corrupted info-block, and one
that can not be repaired due to a section collision.

Cc:
Fixes: 004f1afbe199 ("libnvdimm, pmem: direct map legacy pmem by default")
Signed-off-by: Dan Williams
Signed-off-by: Greg Kroah-Hartman

Dan Williams
2019-03-24 03:09:53 +0800
6a89ed7aa libnvdimm, pfn: Fix over-trim in trim_pfn_device() ... Browse Code »

commit f101ada7da6551127d192c2f1742c1e9e0f62799 upstream.

When trying to see whether current nd_region intersects with others,
trim_pfn_device() has already calculated the *size* to be expanded to
SECTION size.

Do not double append 'adjust' to 'size' when calculating whether the end
of a region collides with the next pmem region.

Fixes: ae86cbfef381 "libnvdimm, pfn: Pad pfn namespaces relative to other regions"
Cc:
Signed-off-by: Wei Yang
Signed-off-by: Dan Williams
Signed-off-by: Greg Kroah-Hartman

Wei Yang
2019-03-24 03:09:53 +0800
2b88d92ea libnvdimm/label: Clear 'updating' flag after label-set update ... Browse Code »

commit 966d23a006ca7b44ac8cf4d0c96b19785e0c3da0 upstream.

The UEFI 2.7 specification sets expectations that the 'updating' flag is
eventually cleared. To date, the libnvdimm core has never adhered to
that protocol. The policy of the core matches the policy of other
multi-device info-block formats like MD-Software-RAID that expect
administrator intervention on inconsistent info-blocks, not automatic
invalidation.

However, some pre-boot environments may unfortunately attempt to "clean
up" the labels and invalidate a set when it fails to find at least one
"non-updating" label in the set. Clear the updating flag after set
updates to minimize the window of vulnerability to aggressive pre-boot
environments.

Ideally implementations would not write to the label area outside of
creating namespaces.

Note that this only minimizes the window, it does not close it as the
system can still crash while clearing the flag and the set can be
subsequently deleted / invalidated by the pre-boot environment.

Fixes: f524bf271a5c ("libnvdimm: write pmem label set")
Cc:
Cc: Kelly Couch
Signed-off-by: Dan Williams
Signed-off-by: Greg Kroah-Hartman

Dan Williams
2019-03-24 03:09:53 +0800

13 Jan, 2019

1 commit

ec5471c92 mm, devm_memremap_pages: fix shutdown handling ... Browse Code »

commit a95c90f1e2c253b280385ecf3d4ebfe476926b28 upstream.

The last step before devm_memremap_pages() returns success is to allocate
a release action, devm_memremap_pages_release(), to tear the entire setup
down. However, the result from devm_add_action() is not checked.

Checking the error from devm_add_action() is not enough. The api
currently relies on the fact that the percpu_ref it is using is killed by
the time the devm_memremap_pages_release() is run. Rather than continue
this awkward situation, offload the responsibility of killing the
percpu_ref to devm_memremap_pages_release() directly. This allows
devm_memremap_pages() to do the right thing relative to init failures and
shutdown.

Without this change we could fail to register the teardown of
devm_memremap_pages(). The likelihood of hitting this failure is tiny as
small memory allocations almost always succeed. However, the impact of
the failure is large given any future reconfiguration, or disable/enable,
of an nvdimm namespace will fail forever as subsequent calls to
devm_memremap_pages() will fail to setup the pgmap_radix since there will
be stale entries for the physical address range.

An argument could be made to require that the ->kill() operation be set in
the @pgmap arg rather than passed in separately. However, it helps code
readability, tracking the lifetime of a given instance, to be able to grep
the kill routine directly at the devm_memremap_pages() call site.

Link: http://lkml.kernel.org/r/154275558526.76910.7535251937849268605.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...")
Reviewed-by: "Jérôme Glisse"
Reported-by: Logan Gunthorpe
Reviewed-by: Logan Gunthorpe
Reviewed-by: Christoph Hellwig
Cc: Balbir Singh
Cc: Michal Hocko
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Dan Williams
2019-01-13 16:51:04 +0800

13 Dec, 2018

1 commit

98206f340 libnvdimm, pfn: Pad pfn namespaces relative to other regions ... Browse Code »

commit ae86cbfef3818300f1972e52f67a93211acb0e24 upstream.

Commit cfe30b872058 "libnvdimm, pmem: adjust for section collisions with
'System RAM'" enabled Linux to workaround occasions where platform
firmware arranges for "System RAM" and "Persistent Memory" to collide
within a single section boundary. Unfortunately, as reported in this
issue [1], platform firmware can inflict the same collision between
persistent memory regions.

The approach of interrogating iomem_resource does not work in this
case because platform firmware may merge multiple regions into a single
iomem_resource range. Instead provide a method to interrogate regions
that share the same parent bus.

This is a stop-gap until the core-MM can grow support for hotplug on
sub-section boundaries.

[1]: https://github.com/pmem/ndctl/issues/76

Fixes: cfe30b872058 ("libnvdimm, pmem: adjust for section collisions with...")
Cc:
Reported-by: Patrick Geary
Tested-by: Patrick Geary
Reviewed-by: Vishal Verma
Signed-off-by: Dan Williams
Signed-off-by: Greg Kroah-Hartman

Dan Williams
2018-12-13 16:16:22 +0800

14 Nov, 2018

3 commits

6c1400b39 libnvdimm, pmem: Fix badblocks population for 'raw' namespaces ... Browse Code »

commit 91ed7ac444ef749603a95629a5ec483988c4f14b upstream.

The driver is only initializing bb_res in the devm_memremap_pages()
paths, but the raw namespace case is passing an uninitialized bb_res to
nvdimm_badblocks_populate().

Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...")
Cc:
Cc: Christoph Hellwig
Reported-by: Jacek Zloch
Reported-by: Krzysztof Rusocki
Reviewed-by: Vishal Verma
Signed-off-by: Dan Williams
Signed-off-by: Greg Kroah-Hartman

Dan Williams
2018-11-14 03:08:42 +0800
8f696986d libnvdimm, region: Fail badblocks listing for inactive regions ... Browse Code »

commit 5d394eee2c102453278d81d9a7cf94c80253486a upstream.

While experimenting with region driver loading the following backtrace
was triggered:

INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
[..]
Call Trace:
dump_stack+0x85/0xcb
register_lock_class+0x571/0x580
? __lock_acquire+0x2ba/0x1310
? kernfs_seq_start+0x2a/0x80
__lock_acquire+0xd4/0x1310
? dev_attr_show+0x1c/0x50
? __lock_acquire+0x2ba/0x1310
? kernfs_seq_start+0x2a/0x80
? lock_acquire+0x9e/0x1a0
lock_acquire+0x9e/0x1a0
? dev_attr_show+0x1c/0x50
badblocks_show+0x70/0x190
? dev_attr_show+0x1c/0x50
dev_attr_show+0x1c/0x50

This results from a missing successful call to devm_init_badblocks()
from nd_region_probe(). Block attempts to show badblocks while the
region is not enabled.

Fixes: 6a6bef90425e ("libnvdimm: add mechanism to publish badblocks...")
Cc:
Reviewed-by: Johannes Thumshirn
Reviewed-by: Dave Jiang
Signed-off-by: Dan Williams
Signed-off-by: Greg Kroah-Hartman

Dan Williams
2018-11-14 03:08:42 +0800
4f1a55a4f libnvdimm: Hold reference on parent while scheduling async init ... Browse Code »

commit b6eae0f61db27748606cc00dafcfd1e2c032f0a5 upstream.

Unlike asynchronous initialization in the core we have not yet associated
the device with the parent, and as such the device doesn't hold a reference
to the parent.

In order to resolve that we should be holding a reference on the parent
until the asynchronous initialization has completed.

Cc:
Fixes: 4d88a97aa9e8 ("libnvdimm: ...base ... infrastructure")
Signed-off-by: Alexander Duyck
Signed-off-by: Dan Williams
Signed-off-by: Greg Kroah-Hartman

Alexander Duyck
2018-11-14 03:08:42 +0800

26 Aug, 2018

2 commits

2923b27e5 Merge tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm… ... Browse Code »

…/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm memory-failure update from Dave Jiang:
"As it stands, memory_failure() gets thoroughly confused by dev_pagemap
backed mappings. The recovery code has specific enabling for several
possible page states and needs new enabling to handle poison in dax
mappings.

In order to support reliable reverse mapping of user space addresses:

1/ Add new locking in the memory_failure() rmap path to prevent races
that would typically be handled by the page lock.

2/ Since dev_pagemap pages are hidden from the page allocator and the
"compound page" accounting machinery, add a mechanism to determine
the size of the mapping that encompasses a given poisoned pfn.

3/ Given pmem errors can be repaired, change the speculatively
accessed poison protection, mce_unmap_kpfn(), to be reversible and
otherwise allow ongoing access from the kernel.

A side effect of this enabling is that MADV_HWPOISON becomes usable
for dax mappings, however the primary motivation is to allow the
system to survive userspace consumption of hardware-poison via dax.
Specifically the current behavior is:

mce: Uncorrected hardware memory error in user-access at af34214200
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
mce: [Hardware Error]: Machine check events logged
{1}[Hardware Error]: event severity: corrected
Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
[..]
Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
mce: Memory error not recovered
<reboot>

...and with these changes:

Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
Memory failure: 0x20cb00: recovery action for dax page: Recovered

Given all the cross dependencies I propose taking this through
nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
folks"

* tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm, pmem: Restore page attributes when clearing errors
x86/memory_failure: Introduce {set, clear}_mce_nospec()
x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
mm, memory_failure: Teach memory_failure() about dev_pagemap pages
filesystem-dax: Introduce dax_lock_mapping_entry()
mm, memory_failure: Collect mapping size in collect_procs()
mm, madvise_inject_error: Let memory_failure() optionally take a page reference
mm, dev_pagemap: Do not clear ->mapping on final put
mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
filesystem-dax: Set page->index
device-dax: Set page->index
device-dax: Enable page_mapping()
device-dax: Convert to vmf_insert_mixed and vm_fault_t

Linus Torvalds
2018-08-26 09:43:59 +0800
828bf6e90 Merge tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm updates from Dave Jiang:
"Collection of misc libnvdimm patches for 4.19 submission:

- Adding support to read locked nvdimm capacity.

- Change test code to make DSM failure code injection an override.

- Add support for calculate maximum contiguous area for namespace.

- Add support for queueing a short ARS when there is on going ARS for
nvdimm.

- Allow NULL to be passed in to ->direct_access() for kaddr and pfn
params.

- Improve smart injection support for nvdimm emulation testing.

- Fix test code that supports for emulating controller temperature.

- Fix hang on error before devm_memremap_pages()

- Fix a bug that causes user memory corruption when data returned to
user for ars_status.

- Maintainer updates for Ross Zwisler emails and adding Jan Kara to
fsdax"

* tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm: fix ars_status output length calculation
device-dax: avoid hang on error before devm_memremap_pages()
tools/testing/nvdimm: improve emulation of smart injection
filesystem-dax: Do not request kaddr and pfn when not required
md/dm-writecache: Don't request pointer dummy_addr when not required
dax/super: Do not request a pointer kaddr when not required
tools/testing/nvdimm: kaddr and pfn can be NULL to ->direct_access()
s390, dcssblk: kaddr and pfn can be NULL to ->direct_access()
libnvdimm, pmem: kaddr and pfn can be NULL to ->direct_access()
acpi/nfit: queue issuing of ars when an uc error notification comes in
libnvdimm: Export max available extent
libnvdimm: Use max contiguous area for namespace size
MAINTAINERS: Add Jan Kara for filesystem DAX
MAINTAINERS: update Ross Zwisler's email address
tools/testing/nvdimm: Fix support for emulating controller temperature
tools/testing/nvdimm: Make DSM failure code injection an override
acpi, nfit: Prefer _DSM over _LSR for namespace label reads
libnvdimm: Introduce locked DIMM capacity support

Linus Torvalds
2018-08-26 09:13:10 +0800

21 Aug, 2018

1 commit

c953cc987 libnvdimm, pmem: Restore page attributes when clearing errors ... Browse Code »

Use clear_mce_nospec() to restore WB mode for the kernel linear mapping
of a pmem page that was marked 'HWPoison'. A page with 'HWPoison' set
has also been marked UC in PAT (page attribute table) via
set_mce_nospec() to prevent speculative retrievals of poison.

The 'HWPoison' flag is only cleared when overwriting an entire page.

Signed-off-by: Dan Williams
Signed-off-by: Dave Jiang

Dan Williams
2018-08-21 00:22:45 +0800

20 Aug, 2018

1 commit

286e87718 libnvdimm: fix ars_status output length calculation ... Browse Code »

Commit efda1b5d87cb ("acpi, nfit, libnvdimm: fix / harden ars_status output length handling")
Introduced additional hardening for ambiguity in the ACPI spec for
ars_status output sizing. However, it had a couple of cases mixed up.
Where it should have been checking for (and returning) "out_field[1] -
4" it was using "out_field[1] - 8" and vice versa.

This caused a four byte discrepancy in the buffer size passed on to
the command handler, and in some cases, this caused memory corruption
like:

./daxdev-errors.sh: line 76: 24104 Aborted (core dumped) ./daxdev-errors $busdev $region
malloc(): memory corruption
Program received signal SIGABRT, Aborted.
[...]
#5 0x00007ffff7865a2e in calloc () from /lib64/libc.so.6
#6 0x00007ffff7bc2970 in ndctl_bus_cmd_new_ars_status (ars_cap=ars_cap@entry=0x6153b0) at ars.c:136
#7 0x0000000000401644 in check_ars_status (check=0x7fffffffdeb0, bus=0x604c20) at daxdev-errors.c:144
#8 test_daxdev_clear_error (region_name=, bus_name=)
at daxdev-errors.c:332

Cc:
Cc: Dave Jiang
Cc: Keith Busch
Cc: Lukasz Dorau
Cc: Dan Williams
Fixes: efda1b5d87cb ("acpi, nfit, libnvdimm: fix / harden ars_status output length handling")
Signed-off-by: Vishal Verma
Reviewed-by: Keith Busch
Signed-of-by: Dave Jiang

Vishal Verma
2018-08-20 23:56:38 +0800

06 Aug, 2018

1 commit

05b9ba4b5 Merge tag 'v4.18-rc6' into for-4.19/block2 ... Browse Code »

Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a
merge conflict down the line.

Signed-of-by: Jens Axboe

Jens Axboe
2018-08-06 09:32:09 +0800

31 Jul, 2018

1 commit

46a590cde libnvdimm, pmem: kaddr and pfn can be NULL to ->direct_access() ... Browse Code »

pmem_direct_access() needs to check the validity of pointers kaddr
and pfn for NULL assignment. If anyone equals to NULL, it doesn't need
to calculate the value.

If pointer equals to NULL, that is to say callers may have no need for
kaddr or pfn, so this patch is prepared for allowing them to pass in
NULL instead of having to pass in a pointer or local variable that
they then just throw away.

Signed-off-by: Huaisheng Ye
Reviewed-by: Ross Zwisler
Signed-off-by: Dave Jiang

Huaisheng Ye
2018-07-31 00:32:38 +0800

26 Jul, 2018

2 commits

1e687220e libnvdimm: Export max available extent ... Browse Code »

The 'available_size' attribute showing the combined total of all
unallocated space isn't always useful to know how large of a namespace
a user may be able to allocate if the region is fragmented. This patch
will export the largest extent of unallocated space that may be allocated
to create a new namespace.

Signed-off-by: Keith Busch
Reviewed-by: Vishal Verma
Signed-off-by: Dave Jiang

Keith Busch
2018-07-26 05:11:43 +0800
12e3129e2 libnvdimm: Use max contiguous area for namespace size ... Browse Code »

This patch will find the max contiguous area to determine the largest
pmem namespace size that can be created. If the requested size exceeds
the largest available, ENOSPC error will be returned.

This fixes the allocation underrun error and wrong error return code
that have otherwise been observed as the following kernel warning:

WARNING: CPU: PID: at drivers/nvdimm/namespace_devs.c:913 size_store

Fixes: a1f3e4d6a0c3 ("libnvdimm, region: update nd_region_available_dpa() for multi-pmem support")
Cc:
Signed-off-by: Keith Busch
Reviewed-by: Vishal Verma
Signed-off-by: Dave Jiang

Keith Busch
2018-07-26 05:11:09 +0800

18 Jul, 2018

2 commits

ddcf35d39 block: Add and use op_stat_group() for indexing disk_stat fields. ... Browse Code »

Add and use a new op_stat_group() function for indexing partition stat
fields rather than indexing them by rq_data_dir() or bio_data_dir().
This function works similarly to op_is_sync() in that it takes the
request::cmd_flags or bio::bi_opf flags and determines which stats
should et updated.

In addition, the second parameter to generic_start_io_acct() and
generic_end_io_acct() is now a REQ_OP rather than simply a read or
write bit and it uses op_stat_group() on the parameter to determine
the stat group.

Note that the partition in_flight counts are not part of the per-cpu
statistics and as such are not indexed via this function. It's now
indexed by op_is_write().

tj: Refreshed on top of v4.17. Updated to pass around REQ_OP.

Signed-off-by: Michael Callahan
Signed-off-by: Tejun Heo
Cc: Minchan Kim
Cc: Dan Williams
Cc: Joshua Morris
Cc: Philipp Reisner
Cc: Matias Bjorling
Cc: Kent Overstreet
Cc: Alasdair Kergon
Signed-off-by: Jens Axboe

Michael Callahan
2018-07-18 22:44:20 +0800
3f289dcb4 block: make bdev_ops->rw_page() take a REQ_OP instead of bool ... Browse Code »

c11f0c0b5bb9 ("block/mm: make bdev_ops->rw_page() take a bool for
read/write") replaced @op with boolean @is_write, which limited the
amount of information going into ->rw_page() and more importantly
page_endio(), which removed the need to expose block internals to mm.

Unfortunately, we want to track discards separately and @is_write
isn't enough information. This patch updates bdev_ops->rw_page() to
take REQ_OP instead but leaves page_endio() to take bool @is_write.
This allows the block part of operations to have enough information
while not leaking it to mm.

Signed-off-by: Tejun Heo
Cc: Mike Christie
Cc: Minchan Kim
Cc: Dan Williams
Signed-off-by: Jens Axboe

Tejun Heo
2018-07-18 22:44:14 +0800

15 Jul, 2018

1 commit

08e6b3c6e libnvdimm: Introduce locked DIMM capacity support ... Browse Code »

When a DIMM is locked its namespace label area may not be. Introduce the
distinction of locked namespaces to allow namespace enumeration while
the capacity is locked.

Signed-off-by: Dan Williams

Dan Williams
2018-07-15 01:27:00 +0800

14 Jul, 2018

1 commit

4596f5547 Merge tag 'libnvdimm-fixes-4.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm fixes from Dave Jiang:

- ensure that a variable passed in by reference to acpi_nfit_ctl is
always set to a value. An incremental patch is provided due to notice
from testing in -next. The rest of the commits did not exhibit
issues.

- fix a return path in nsio_rw_bytes() that was not returning "bytes
remain" as expected for the function.

- address an issue where applications polling on scrub-completion for
the NVDIMM may falsely wakeup and read the wrong state value and
cause hang.

- change the test unit persistent capability attribute to fix up a
broken assumption in the unit test infrastructure wrt the
'write_cache' attribute

- ratelimit dev_info() in the dax device check_vma() function since
this is easily triggered from userspace

* tag 'libnvdimm-fixes-4.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
nfit: fix unchecked dereference in acpi_nfit_ctl
acpi, nfit: Fix scrub idle detection
tools/testing/nvdimm: advertise a write cache for nfit_test
acpi/nfit: fix cmd_rc for acpi_nfit_ctl to always return a value
dev-dax: check_vma: ratelimit dev_info-s
libnvdimm, pmem: Fix memcpy_mcsafe() return code handling in nsio_rw_bytes()

Linus Torvalds
2018-07-14 01:54:01 +0800

29 Jun, 2018

2 commits

b62cc6fdd libnvdimm, pmem: Fix memcpy_mcsafe() return code handling in nsio_rw_bytes() ... Browse Code »

Commit 60622d68227d "x86/asm/memcpy_mcsafe: Return bytes remaining"
converted callers of memcpy_mcsafe() to expect a positive 'bytes
remaining' value rather than a negative error code. The nsio_rw_bytes()
conversion failed to return success. The failure is benign in that
nsio_rw_bytes() will end up writing back what it just read.

Fixes: 60622d68227d ("x86/asm/memcpy_mcsafe: Return bytes remaining")
Cc: Dan Williams
Reviewed-by: Vishal Verma
Signed-off-by: Dan Williams

Dan Williams
2018-06-29 09:21:30 +0800
4557641b4 pmem: only set QUEUE_FLAG_DAX for fsdax mode ... Browse Code »

QUEUE_FLAG_DAX is an indication that a given block device supports
filesystem DAX and should not be set for PMEM namespaces which are in "raw"
mode. These namespaces lack struct page and are prevented from
participating in filesystem DAX as of commit 569d0365f571 ("dax: require
'struct page' by default for filesystem dax").

Signed-off-by: Ross Zwisler
Suggested-by: Mike Snitzer
Fixes: 569d0365f571 ("dax: require 'struct page' by default for filesystem dax")
Cc: stable@vger.kernel.org
Acked-by: Dan Williams
Reviewed-by: Toshi Kani
Signed-off-by: Mike Snitzer

Ross Zwisler
2018-06-29 04:05:59 +0800

09 Jun, 2018

2 commits

930218aff Merge branch 'for-4.18/mcsafe' into libnvdimm-for-next Browse Code »

Dan Williams
2018-06-09 06:16:44 +0800
b56845794 Merge branch 'for-4.18/dax' into libnvdimm-for-next Browse Code »

Dan Williams
2018-06-09 06:16:40 +0800

07 Jun, 2018

3 commits

546eb0317 libnvdimm, pmem: Do not flush power-fail protected CPU caches ... Browse Code »

This commit:

5fdf8e5ba566 ("libnvdimm: re-enable deep flush for pmem devices via fsync()")

intended to make sure that deep flush was always available even on
platforms which support a power-fail protected CPU cache. An unintended
side effect of this change was that we also lost the ability to skip
flushing CPU caches on those power-fail protected CPU cache.

Fix this by skipping the low level cache flushing in dax_flush() if we have
CPU caches which are power-fail protected. The user can still override this
behavior by manually setting the write_cache state of a namespace. See
libndctl's ndctl_namespace_write_cache_is_enabled(),
ndctl_namespace_enable_write_cache() and
ndctl_namespace_disable_write_cache() functions.

Cc:
Fixes: 5fdf8e5ba566 ("libnvdimm: re-enable deep flush for pmem devices via fsync()")
Signed-off-by: Ross Zwisler
Signed-off-by: Dan Williams

Ross Zwisler
2018-06-07 02:02:32 +0800
ce7f11a23 libnvdimm, pmem: Unconditionally deep flush on *sync ... Browse Code »

Prior to this commit we would only do a "deep flush" (have nvdimm_flush()
write to each of the flush hints for a region) in response to an
msync/fsync/sync call if the nvdimm_has_cache() returned true at the time
we were setting up the request queue. This happens due to the write cache
value passed in to blk_queue_write_cache(), which then causes the block
layer to send down BIOs with REQ_FUA and REQ_PREFLUSH set. We do have a
"write_cache" sysfs entry for namespaces, i.e.:

/sys/bus/nd/devices/pfn0.1/block/pmem0/dax/write_cache

which can be used to control whether or not the kernel thinks a given
namespace has a write cache, but this didn't modify the deep flush behavior
that we set up when the driver was initialized. Instead, it only modified
whether or not DAX would flush CPU caches via dax_flush() in response to
*sync calls.

Simplify this by making the *sync deep flush always happen, regardless of
the write cache setting of a namespace. The DAX CPU cache flushing will
still be controlled the write_cache setting of the namespace.

Cc:
Fixes: 5fdf8e5ba566 ("libnvdimm: re-enable deep flush for pmem devices via fsync()")
Signed-off-by: Ross Zwisler
Signed-off-by: Dan Williams

Ross Zwisler
2018-06-07 01:55:53 +0800
d2d6364dc libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH ... Browse Code »

Complete the move from REQ_FLUSH to REQ_PREFLUSH that apparently started
way back in v4.8.

Signed-off-by: Ross Zwisler
Signed-off-by: Dan Williams

Ross Zwisler
2018-06-07 01:40:56 +0800

03 Jun, 2018

2 commits

d76401ade libnvdimm, e820: Register all pmem resources ... Browse Code »

There is currently a mismatch between the resources that will trigger
the e820_pmem driver to register/load and the resources that will
actually be surfaced as pmem ranges. register_e820_pmem() uses
walk_iomem_res_desc() which includes children and siblings. In contrast,
e820_pmem_probe() only considers top level resources. For example the
following resource tree results in the driver being loaded, but no
resources being registered:

398000000000-39bfffffffff : PCI Bus 0000:ae
39be00000000-39bf07ffffff : PCI Bus 0000:af
39be00000000-39beffffffff : 0000:af:00.0
39be10000000-39beffffffff : Persistent Memory (legacy)

Fix this up to allow definitions of "legacy" pmem ranges anywhere in
system-physical address space. Not that it is a recommended or safe to
define a pmem range in PCI space, but it is useful for debug /
experimentation, and the restriction on being a top-level resource was
arbitrary.

Cc: Christoph Hellwig
Signed-off-by: Dan Williams

Dan Williams
2018-06-03 08:05:43 +0800
3f46833df libnvdimm: Debug probe times ... Browse Code »

Instrument nvdimm_bus_probe() to emit timestamps for the start and end
of libnvdimm device probing. This is useful for identifying sources of
libnvdimm sub-system initialization latency.

Signed-off-by: Dan Williams

Dan Williams
2018-06-03 08:05:43 +0800

01 Jun, 2018

1 commit

254a4cd50 linvdimm, pmem: Preserve read-only setting for pmem devices ... Browse Code »

The pmem driver does not honor a forced read-only setting for very long:
$ blockdev --setro /dev/pmem0
$ blockdev --getro /dev/pmem0
1

followed by various commands like these:
$ blockdev --rereadpt /dev/pmem0
or
$ mkfs.ext4 /dev/pmem0

results in this in the kernel serial log:
nd_pmem namespace0.0: region0 read-write, marking pmem0 read-write

with the read-only setting lost:
$ blockdev --getro /dev/pmem0
0

That's from bus.c nvdimm_revalidate_disk(), which always applies the
setting from nd_region (which is initially based on the ACPI NFIT
NVDIMM state flags not_armed bit).

In contrast, commit 20bd1d026aac ("scsi: sd: Keep disk read-only when
re-reading partition") fixed this issue for SCSI devices to preserve
the previous setting if it was set to read-only.

This patch modifies bus.c to preserve any previous read-only setting.
It also eliminates the kernel serial log print except for cases where
read-write is changed to read-only, so it doesn't print read-only to
read-only non-changes.

Cc:
Fixes: 581388209405 ("libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only")
Signed-off-by: Robert Elliott
Signed-off-by: Dan Williams

Robert Elliott
2018-06-01 12:07:33 +0800

23 May, 2018

2 commits

6dfdb2b6d pmem: Switch to copy_to_iter_mcsafe() ... Browse Code »

Use the machine check safe version of copy_to_iter() for the
->copy_to_iter() operation published by the pmem driver.

Signed-off-by: Dan Williams

Dan Williams
2018-05-23 14:18:31 +0800
b3a9a0c36 dax: Introduce a ->copy_to_iter dax operation ... Browse Code »

Similar to the ->copy_from_iter() operation, a platform may want to
deploy an architecture or device specific routine for handling reads
from a dax_device like /dev/pmemX. On x86 this routine will point to a
machine check safe version of copy_to_iter(). For now, add the plumbing
to device-mapper and the dax core.

Cc: Ross Zwisler
Cc: Mike Snitzer
Cc: Christoph Hellwig
Signed-off-by: Dan Williams

Dan Williams
2018-05-23 14:18:31 +0800

22 May, 2018

1 commit

e76384884 mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS ... Browse Code »

In preparation for fixing dax-dma-vs-unmap issues, filesystems need to
be able to rely on the fact that they will get wakeups on dev_pagemap
page-idle events. Introduce MEMORY_DEVICE_FS_DAX and
generic_dax_page_free() as common indicator / infrastructure for dax
filesytems to require. With this change there are no users of the
MEMORY_DEVICE_HOST designation, so remove it.

The HMM sub-system extended dev_pagemap to arrange a callback when a
dev_pagemap managed page is freed. Since a dev_pagemap page is free /
idle when its reference count is 1 it requires an additional branch to
check the page-type at put_page() time. Given put_page() is a hot-path
we do not want to incur that check if HMM is not in use, so a static
branch is used to avoid that overhead when not necessary.

Now, the FS_DAX implementation wants to reuse this mechanism for
receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
static-key into a generic mechanism that either HMM or FS_DAX code paths
can enable.

For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
However, we still need to support FS_DAX in the FS_DAX_LIMITED case
implemented by the s390/dcssblk driver.

Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Michal Hocko
Reported-by: kbuild test robot
Reported-by: Thomas Meyer
Reported-by: Dave Jiang
Cc: "Jérôme Glisse"
Reviewed-by: Jan Kara
Reviewed-by: Christoph Hellwig
Signed-off-by: Dan Williams

Dan Williams
2018-05-22 21:59:39 +0800

15 May, 2018

1 commit

60622d682 x86/asm/memcpy_mcsafe: Return bytes remaining ... Browse Code »

Machine check safe memory copies are currently deployed in the pmem
driver whenever reading from persistent memory media, so that -EIO is
returned rather than triggering a kernel panic. While this protects most
pmem accesses, it is not complete in the filesystem-dax case. When
filesystem-dax is enabled reads may bypass the block layer and the
driver via dax_iomap_actor() and its usage of copy_to_iter().

In preparation for creating a copy_to_iter() variant that can handle
machine checks, teach memcpy_mcsafe() to return the number of bytes
remaining rather than -EFAULT when an exception occurs.

Co-developed-by: Tony Luck
Signed-off-by: Dan Williams
Cc: Al Viro
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: hch@lst.de
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-nvdimm@lists.01.org
Link: http://lkml.kernel.org/r/152539238119.31796.14318473522414462886.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar

Dan Williams
2018-05-15 14:32:42 +0800

20 Apr, 2018

2 commits

f22acf827 Revert "libnvdimm, of_pmem: workaround OF_NUMA=n build error" ... Browse Code »

With commit df3f126482db ("libnvdimm, of_pmem: use dev_to_node() instead
of of_node_to_nid()") it is now possible to allow of_pmem to be built as
a module as originally implemented.

Signed-off-by: Dan Williams

Dan Williams
2018-04-20 06:10:56 +0800
df3f12648 libnvdimm, of_pmem: use dev_to_node() instead of of_node_to_nid() ... Browse Code »

Remove the direct dependency on of_node_to_nid() by using dev_to_node()
instead. Any DT platform device will have its NUMA node id set when the
device is created.

With this, commit 291717b6fbdb ("libnvdimm, of_pmem: workaround OF_NUMA=n
build error") can be reverted.

Fixes: 717197608952 ("libnvdimm: Add device-tree based driver")
Cc: Dan Williams
Cc: Oliver O'Halloran
Cc: linux-nvdimm@lists.01.org
Signed-off-by: Rob Herring
Signed-off-by: Dan Williams

Rob Herring
2018-04-20 06:07:10 +0800

16 Apr, 2018

1 commit

e7c5a571a libnvdimm, dimm: handle EACCES failures from label reads ... Browse Code »

The new support for the standard _LSR and _LSW methods neglected to also
update the nvdimm_init_config_data() and nvdimm_set_config_data() to
return the translated error code from failed commands. This precision is
necessary because the locked status that was previously returned on
ND_CMD_GET_CONFIG_SIZE commands is now returned on
ND_CMD_{GET,SET}_CONFIG_DATA commands.

If the kernel misses this indication it can inadvertently fall back to
label-less mode when it should otherwise avoid all access to locked
regions.

Cc:
Fixes: 4b27db7e26cd ("acpi, nfit: add support for the _LSI, _LSR, and...")
Signed-off-by: Dan Williams

Dan Williams
2018-04-16 23:18:51 +0800

11 Apr, 2018

1 commit

9f3a0941f Merge tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm updates from Dan Williams:
"This cycle was was not something I ever want to repeat as there were
several late changes that have only now just settled.

Half of the branch up to commit d2c997c0f145 ("fs, dax: use
page->mapping to warn...") have been in -next for several releases.
The of_pmem driver and the address range scrub rework were late
arrivals, and the dax work was scaled back at the last moment.

The of_pmem driver missed a previous merge window due to an oversight.
A sense of obligation to rectify that miss is why it is included for
4.17. It has acks from PowerPC folks. Stephen reported a build failure
that only occurs when merging it with your latest tree, for now I have
fixed that up by disabling modular builds of of_pmem. A test merge
with your tree has received a build success report from the 0day robot
over 156 configs.

An initial version of the ARS rework was submitted before the merge
window. It is self contained to libnvdimm, a net code reduction, and
passing all unit tests.

The filesystem-dax changes are based on the wait_var_event()
functionality from tip/sched/core. However, late review feedback
showed that those changes regressed truncate performance to a large
degree. The branch was rewound to drop the truncate behavior change
and now only includes preparation patches and cleanups (with full acks
and reviews). The finalization of this dax-dma-vs-trnucate work will
need to wait for 4.18.

Summary:

- A rework of the filesytem-dax implementation provides for detection
of unmap operations (truncate / hole punch) colliding with
in-progress device-DMA. A fix for these collisions remains a
work-in-progress pending resolution of truncate latency and
starvation regressions.

- The of_pmem driver expands the users of libnvdimm outside of x86
and ACPI to describe an implementation of persistent memory on
PowerPC with Open Firmware / Device tree.

- Address Range Scrub (ARS) handling is completely rewritten to
account for the fact that ARS may run for 100s of seconds and there
is no platform defined way to cancel it. ARS will now no longer
block namespace initialization.

- The NVDIMM Namespace Label implementation is updated to handle
label areas as small as 1K, down from 128K.

- Miscellaneous cleanups and updates to unit test infrastructure"

* tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits)
libnvdimm, of_pmem: workaround OF_NUMA=n build error
nfit, address-range-scrub: add module option to skip initial ars
nfit, address-range-scrub: rework and simplify ARS state machine
nfit, address-range-scrub: determine one platform max_ars value
powerpc/powernv: Create platform devs for nvdimm buses
doc/devicetree: Persistent memory region bindings
libnvdimm: Add device-tree based driver
libnvdimm: Add of_node to region and bus descriptors
libnvdimm, region: quiet region probe
libnvdimm, namespace: use a safe lookup for dimm device name
libnvdimm, dimm: fix dpa reservation vs uninitialized label area
libnvdimm, testing: update the default smart ctrl_temperature
libnvdimm, testing: Add emulation for smart injection commands
nfit, address-range-scrub: introduce nfit_spa->ars_state
libnvdimm: add an api to cast a 'struct nd_region' to its 'struct device'
nfit, address-range-scrub: fix scrub in-progress reporting
dax, dm: allow device-mapper to operate without dax support
dax: introduce CONFIG_DAX_DRIVER
fs, dax: use page->mapping to warn if truncate collides with a busy page
ext2, dax: introduce ext2_dax_aops
...

Linus Torvalds
2018-04-11 01:25:57 +0800

10 Apr, 2018

1 commit

e13e75b86 Merge branch 'for-4.17/dax' into libnvdimm-for-next Browse Code »

Dan Williams
2018-04-10 01:50:17 +0800