10 Sep, 2018

1 commit

  • commit 286e87718103acdf85f4ed323a37e4839a8a7c05 upstream.

    Commit efda1b5d87cb ("acpi, nfit, libnvdimm: fix / harden ars_status output length handling")
    Introduced additional hardening for ambiguity in the ACPI spec for
    ars_status output sizing. However, it had a couple of cases mixed up.
    Where it should have been checking for (and returning) "out_field[1] -
    4" it was using "out_field[1] - 8" and vice versa.

    This caused a four byte discrepancy in the buffer size passed on to
    the command handler, and in some cases, this caused memory corruption
    like:

    ./daxdev-errors.sh: line 76: 24104 Aborted (core dumped) ./daxdev-errors $busdev $region
    malloc(): memory corruption
    Program received signal SIGABRT, Aborted.
    [...]
    #5 0x00007ffff7865a2e in calloc () from /lib64/libc.so.6
    #6 0x00007ffff7bc2970 in ndctl_bus_cmd_new_ars_status (ars_cap=ars_cap@entry=0x6153b0) at ars.c:136
    #7 0x0000000000401644 in check_ars_status (check=0x7fffffffdeb0, bus=0x604c20) at daxdev-errors.c:144
    #8 test_daxdev_clear_error (region_name=, bus_name=)
    at daxdev-errors.c:332

    Cc:
    Cc: Dave Jiang
    Cc: Keith Busch
    Cc: Lukasz Dorau
    Cc: Dan Williams
    Fixes: efda1b5d87cb ("acpi, nfit, libnvdimm: fix / harden ars_status output length handling")
    Signed-off-by: Vishal Verma
    Reviewed-by: Keith Busch
    Signed-of-by: Dave Jiang
    Signed-off-by: Greg Kroah-Hartman

    Vishal Verma
     

03 Jul, 2018

1 commit

  • commit 254a4cd50b9fe2291a12b8902e08e56dcc4e9b10 upstream.

    The pmem driver does not honor a forced read-only setting for very long:
    $ blockdev --setro /dev/pmem0
    $ blockdev --getro /dev/pmem0
    1

    followed by various commands like these:
    $ blockdev --rereadpt /dev/pmem0
    or
    $ mkfs.ext4 /dev/pmem0

    results in this in the kernel serial log:
    nd_pmem namespace0.0: region0 read-write, marking pmem0 read-write

    with the read-only setting lost:
    $ blockdev --getro /dev/pmem0
    0

    That's from bus.c nvdimm_revalidate_disk(), which always applies the
    setting from nd_region (which is initially based on the ACPI NFIT
    NVDIMM state flags not_armed bit).

    In contrast, commit 20bd1d026aac ("scsi: sd: Keep disk read-only when
    re-reading partition") fixed this issue for SCSI devices to preserve
    the previous setting if it was set to read-only.

    This patch modifies bus.c to preserve any previous read-only setting.
    It also eliminates the kernel serial log print except for cases where
    read-write is changed to read-only, so it doesn't print read-only to
    read-only non-changes.

    Cc:
    Fixes: 581388209405 ("libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only")
    Signed-off-by: Robert Elliott
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Robert Elliott
     

24 Apr, 2018

2 commits

  • commit 4f8672201b7e7ed4f5f6c3cf6dcd080648580582 upstream.

    The following NULL dereference results from incorrectly assuming that
    ndd is valid in this print:

    struct nvdimm_drvdata *ndd = to_ndd(&nd_region->mapping[i]);

    /*
    * Give up if we don't find an instance of a uuid at each
    * position (from 0 to nd_region->ndr_mappings - 1), or if we
    * find a dimm with two instances of the same uuid.
    */
    dev_err(&nd_region->dev, "%s missing label for %pUb\n",
    dev_name(ndd->dev), nd_label->uuid);

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
    IP: nd_region_register_namespaces+0xd67/0x13c0 [libnvdimm]
    PGD 0 P4D 0
    Oops: 0000 [#1] SMP PTI
    CPU: 43 PID: 673 Comm: kworker/u609:10 Not tainted 4.16.0-rc4+ #1
    [..]
    RIP: 0010:nd_region_register_namespaces+0xd67/0x13c0 [libnvdimm]
    [..]
    Call Trace:
    ? devres_add+0x2f/0x40
    ? devm_kmalloc+0x52/0x60
    ? nd_region_activate+0x9c/0x320 [libnvdimm]
    nd_region_probe+0x94/0x260 [libnvdimm]
    ? kernfs_add_one+0xe4/0x130
    nvdimm_bus_probe+0x63/0x100 [libnvdimm]

    Switch to using the nvdimm device directly.

    Fixes: 0e3b0d123c8f ("libnvdimm, namespace: allow multiple pmem...")
    Cc:
    Reported-by: Dave Jiang
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • commit c31898c8c711f2bbbcaebe802a55827e288d875a upstream.

    At initialization time the 'dimm' driver caches a copy of the memory
    device's label area and reserves address space for each of the
    namespaces defined.

    However, as can be seen below, the reservation occurs even when the
    index blocks are invalid:

    nvdimm nmem0: nvdimm_init_config_data: len: 131072 rc: 0
    nvdimm nmem0: config data size: 131072
    nvdimm nmem0: __nd_label_validate: nsindex0 labelsize 1 invalid
    nvdimm nmem0: __nd_label_validate: nsindex1 labelsize 1 invalid
    nvdimm nmem0: : pmem-6025e505: 0x1000000000 @ 0xf50000000 reserve
    Fixes: 4a826c83db4e ("libnvdimm: namespace indices: read and validate")
    Reported-by: Krzysztof Rusocki
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

29 Mar, 2018

1 commit

  • commit 3ffb0ba9b567a8efb9a04ed3d1ec15ff333ada22 upstream.

    Prior to 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
    we needed to temporarily add a zero-capacity disk before registering for
    blk-integrity. But adding a zero-capacity disk caused the partition
    table scanning to bail early, and this resulted in partitions not coming
    up after a probe of the BTT or blk namespaces.

    We can now register for integrity before the disk has been added, and
    this fixes the rescan problems.

    Fixes: 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
    Reported-by: Dariusz Dokupil
    Cc:
    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Vishal Verma
     

30 Dec, 2017

3 commits

  • commit 19deaa217bc04e83b59b5e8c8229eb0e53ad9efc upstream.

    The alignment checks at pfn driver startup fail to properly account for
    the 'start_pad' in the case where the namespace is misaligned relative
    to its internal alignment. This is typically triggered in 1G aligned
    namespace, but could theoretically trigger with small namespace
    alignments. When this triggers the kernel reports messages of the form:

    dax2.1: bad offset: 0x3c000000 dax disabled align: 0x40000000

    Fixes: 1ee6667cd8d1 ("libnvdimm, pfn, dax: fix initialization vs autodetect...")
    Reported-by: Jane Chu
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • commit 24e3a7fb60a9187e5df90e5fa655ffc94b9c4f77 upstream.

    Due to a spec misinterpretation, the Linux implementation of the BTT log
    area had different padding scheme from other implementations, such as
    UEFI and NVML.

    This fixes the padding scheme, and defaults to it for new BTT layouts.
    We attempt to detect the padding scheme in use when probing for an
    existing BTT. If we detect the older/incompatible scheme, we continue
    using it.

    Reported-by: Juston Li
    Cc: Dan Williams
    Fixes: 5212e11fde4d ("nd_btt: atomic sector updates")
    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Vishal Verma
     
  • commit 41fce90f26333c4fa82e8e43b9ace86c4e8a0120 upstream.

    The following namespace configuration attempt:

    # ndctl create-namespace -e namespace0.0 -m devdax -a 1G -f
    libndctl: ndctl_dax_enable: dax0.1: failed to enable
    Error: namespace0.0: failed to enable

    failed to reconfigure namespace: No such device or address

    ...fails when the backing memory range is not physically aligned to 1G:

    # cat /proc/iomem | grep Persistent
    210000000-30fffffff : Persistent Memory (legacy)

    In the above example the 4G persistent memory range starts and ends on a
    256MB boundary.

    We handle this case correctly when needing to handle cases that violate
    section alignment (128MB) collisions against "System RAM", and we simply
    need to extend that padding/truncation for the 1GB alignment use case.

    Fixes: 315c562536c4 ("libnvdimm, pfn: add 'align' attribute...")
    Reported-and-tested-by: Jane Chu
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

30 Nov, 2017

5 commits

  • commit c1fb3542074fd0c4d901d778bd52455111e4eb6f upstream.

    For the same reason that /proc/iomem returns 0's for non-root readers
    and acpi tables are root-only, make the 'resource' attribute for
    namespace devices only readable by root. Otherwise we disclose physical
    address information.

    Fixes: bf9bccc14c05 ("libnvdimm: pmem label sets and namespace instantiation")
    Reported-by: Dave Hansen
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • commit b8ff981f88df03c72a4de2f6eaa9ce447a10ac03 upstream.

    For the same reason that /proc/iomem returns 0's for non-root readers
    and acpi tables are root-only, make the 'resource' attribute for region
    devices only readable by root. Otherwise we disclose physical address
    information.

    Fixes: 802f4be6feee ("libnvdimm: Add 'resource' sysfs attribute to regions")
    Cc: Dave Jiang
    Cc: Johannes Thumshirn
    Reported-by: Dave Hansen
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • commit b18d4b8a25af6fe83d7692191d6ff962ea611c4f upstream.

    The set of valid sequence numbers is {1,2,3}. The specification
    indicates that an implementation should consider 0 a sign of a critical
    error:

    UEFI 2.7: 13.19 NVDIMM Label Protocol

    Software never writes the sequence number 00, so a correctly
    check-summed Index Block with this sequence number probably indicates a
    critical error. When software discovers this case it treats it as an
    invalid Index Block indication.

    While the expectation is that the invalid block is just thrown away, the
    Robustness Principle says we should fix this to make both sequence
    numbers valid.

    Fixes: f524bf271a5c ("libnvdimm: write pmem label set")
    Reported-by: Juston Li
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • commit 26417ae4fc6108f8db436f24108b08f68bdc520e upstream.

    For the same reason that /proc/iomem returns 0's for non-root readers
    and acpi tables are root-only, make the 'resource' attribute for pfn
    devices only readable by root. Otherwise we disclose physical address
    information.

    Fixes: f6ed58c70d14 ("libnvdimm, pfn: 'resource'-address and 'size'...")
    Reported-by: Dave Hansen
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • commit d34cb808402898e53b9a9bcbbedd01667a78723b upstream.

    If we successfully enable a DIMM then it must not be locked and we can
    clear the label-read failure condition. Otherwise, we need to reload the
    entire bus provider driver to achieve the same effect, and that can
    disrupt unrelated DIMMs and namespaces.

    Fixes: 9d62ed965118 ("libnvdimm: handle locked label storage areas")
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

19 Sep, 2017

1 commit

  • Maurice reports:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
    IP: holder_class_store+0x253/0x2b0 [libnvdimm]

    ...while trying to reconfigure an NVDIMM-N namespace into 'sector' /
    'btt' mode. The crash points to this line:

    (gdb) li *(holder_class_store+0x253)
    0x7773 is in holder_class_store (drivers/nvdimm/namespace_devs.c:1420).
    1415 for (i = 0; i < nd_region->ndr_mappings; i++) {
    1416 struct nd_mapping *nd_mapping = &nd_region->mapping[i];
    1417 struct nvdimm_drvdata *ndd = to_ndd(nd_mapping);
    1418 struct nd_namespace_index *nsindex;
    1419
    1420 nsindex = to_namespace_index(ndd, ndd->ns_current);

    ...where we are failing because ndd is NULL due to NVDIMM-N dimms not
    supporting labels.

    Long story short, default to the BTTv1 format in the label-less /
    NVDIMM-N case.

    Fixes: 14e494542636 ("libnvdimm, btt: BTT updates for UEFI 2.7 format")
    Cc:
    Cc: Vishal Verma
    Reported-by: Maurice A. Saldivar
    Tested-by: Maurice A. Saldivar
    Signed-off-by: Dan Williams

    Dan Williams
     

15 Sep, 2017

1 commit

  • …/device-mapper/linux-dm

    Pull device mapper updates from Mike Snitzer:

    - Some request-based DM core and DM multipath fixes and cleanups

    - Constify a few variables in DM core and DM integrity

    - Add bufio optimization and checksum failure accounting to DM
    integrity

    - Fix DM integrity to avoid checking integrity of failed reads

    - Fix DM integrity to use init_completion

    - A couple DM log-writes target fixes

    - Simplify DAX flushing by eliminating the unnecessary flush
    abstraction that was stood up for DM's use.

    * tag 'for-4.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
    dax: remove the pmem_dax_ops->flush abstraction
    dm integrity: use init_completion instead of COMPLETION_INITIALIZER_ONSTACK
    dm integrity: make blk_integrity_profile structure const
    dm integrity: do not check integrity for failed read operations
    dm log writes: fix >512b sectorsize support
    dm log writes: don't use all the cpu while waiting to log blocks
    dm ioctl: constify ioctl lookup table
    dm: constify argument arrays
    dm integrity: count and display checksum failures
    dm integrity: optimize writing dm-bufio buffers that are partially changed
    dm rq: do not update rq partially in each ending bio
    dm rq: make dm-sq requeuing behavior consistent with dm-mq behavior
    dm mpath: complain about unsupported __multipath_map_bio() return values
    dm mpath: avoid that building with W=1 causes gcc 7 to complain about fall-through

    Linus Torvalds
     

12 Sep, 2017

1 commit

  • Pull libnvdimm from Dan Williams:
    "A rework of media error handling in the BTT driver and other updates.
    It has appeared in a few -next releases and collected some late-
    breaking build-error and warning fixups as a result.

    Summary:

    - Media error handling support in the Block Translation Table (BTT)
    driver is reworked to address sleeping-while-atomic locking and
    memory-allocation-context conflicts.

    - The dax_device lookup overhead for xfs and ext4 is moved out of the
    iomap hot-path to a mount-time lookup.

    - A new 'ecc_unit_size' sysfs attribute is added to advertise the
    read-modify-write boundary property of a persistent memory range.

    - Preparatory fix-ups for arm and powerpc pmem support are included
    along with other miscellaneous fixes"

    * tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (26 commits)
    libnvdimm, btt: fix format string warnings
    libnvdimm, btt: clean up warning and error messages
    ext4: fix null pointer dereference on sbi
    libnvdimm, nfit: move the check on nd_reserved2 to the endpoint
    dax: fix FS_DAX=n BLOCK=y compilation
    libnvdimm: fix integer overflow static analysis warning
    libnvdimm, nd_blk: remove mmio_flush_range()
    libnvdimm, btt: rework error clearing
    libnvdimm: fix potential deadlock while clearing errors
    libnvdimm, btt: cache sector_size in arena_info
    libnvdimm, btt: ensure that flags were also unchanged during a map_read
    libnvdimm, btt: refactor map entry operations with macros
    libnvdimm, btt: fix a missed NVDIMM_IO_ATOMIC case in the write path
    libnvdimm, nfit: export an 'ecc_unit_size' sysfs attribute
    ext4: perform dax_device lookup at mount
    ext2: perform dax_device lookup at mount
    xfs: perform dax_device lookup at mount
    dax: introduce a fs_dax_get_by_bdev() helper
    libnvdimm, btt: check memory allocation failure
    libnvdimm, label: fix index block size calculation
    ...

    Linus Torvalds
     

11 Sep, 2017

1 commit

  • Commit abebfbe2f731 ("dm: add ->flush() dax operation support") is
    buggy. A DM device may be composed of multiple underlying devices and
    all of them need to be flushed. That commit just routes the flush
    request to the first device and ignores the other devices.

    It could be fixed by adding more complex logic to the device mapper. But
    there is only one implementation of the method pmem_dax_ops->flush - that
    is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
    don't need the pmem_dax_ops->flush abstraction at all, we can call
    arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
    can't ever reach anything different from arch_wb_cache_pmem().

    It should be also pointed out that for some uses of persistent memory it
    is needed to flush only a very small amount of data (such as 1 cacheline),
    and it would be overkill if we go through that device mapper machinery for
    a single flushed cache line.

    Fix this by removing the pmem_dax_ops->flush abstraction and call
    arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device
    mapper code that forwards the flushes.

    Fixes: abebfbe2f731 ("dm: add ->flush() dax operation support")
    Cc: stable@vger.kernel.org
    Signed-off-by: Mikulas Patocka
    Reviewed-by: Dan Williams
    Signed-off-by: Mike Snitzer

    Mikulas Patocka
     

10 Sep, 2017

1 commit

  • Fix format warnings (seen on i386) in nvdimm/btt.c:

    ../drivers/nvdimm/btt.c: In function ‘btt_map_init’:
    ../drivers/nvdimm/btt.c:430:3: warning: format ‘%lx’ expects argument of type ‘long unsigned int’, but argument 4 has type ‘size_t’ [-Wformat=]
    dev_WARN_ONCE(to_dev(arena), size < 512,
    ^
    ../drivers/nvdimm/btt.c: In function ‘btt_log_init’:
    ../drivers/nvdimm/btt.c:474:3: warning: format ‘%lx’ expects argument of type ‘long unsigned int’, but argument 4 has type ‘size_t’ [-Wformat=]
    dev_WARN_ONCE(to_dev(arena), size < 512,
    ^

    Fixes: 86652d2eb347 ("libnvdimm, btt: clean up warning and error messages")
    Reported-by: Arnd Bergmann
    Reported-by: kbuild test robot
    Cc: Vishal Verma
    Signed-off-by: Randy Dunlap
    Signed-off-by: Dan Williams

    Randy Dunlap
     

08 Sep, 2017

2 commits

  • Convert all WARN* style messages to dev_WARN, and for errors in the IO
    paths, use dev_err_ratelimited. Also remove some BUG_ONs in the IO path
    and replace them with the above - no need to crash the machine in case
    of an unaligned IO.

    Cc: Dan Williams
    Signed-off-by: Vishal Verma
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Dan Williams

    Vishal Verma
     
  • Pull block layer updates from Jens Axboe:
    "This is the first pull request for 4.14, containing most of the code
    changes. It's a quiet series this round, which I think we needed after
    the churn of the last few series. This contains:

    - Fix for a registration race in loop, from Anton Volkov.

    - Overflow complaint fix from Arnd for DAC960.

    - Series of drbd changes from the usual suspects.

    - Conversion of the stec/skd driver to blk-mq. From Bart.

    - A few BFQ improvements/fixes from Paolo.

    - CFQ improvement from Ritesh, allowing idling for group idle.

    - A few fixes found by Dan's smatch, courtesy of Dan.

    - A warning fixup for a race between changing the IO scheduler and
    device remova. From David Jeffery.

    - A few nbd fixes from Josef.

    - Support for cgroup info in blktrace, from Shaohua.

    - Also from Shaohua, new features in the null_blk driver to allow it
    to actually hold data, among other things.

    - Various corner cases and error handling fixes from Weiping Zhang.

    - Improvements to the IO stats tracking for blk-mq from me. Can
    drastically improve performance for fast devices and/or big
    machines.

    - Series from Christoph removing bi_bdev as being needed for IO
    submission, in preparation for nvme multipathing code.

    - Series from Bart, including various cleanups and fixes for switch
    fall through case complaints"

    * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
    kernfs: checking for IS_ERR() instead of NULL
    drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
    drbd: Fix allyesconfig build, fix recent commit
    drbd: switch from kmalloc() to kmalloc_array()
    drbd: abort drbd_start_resync if there is no connection
    drbd: move global variables to drbd namespace and make some static
    drbd: rename "usermode_helper" to "drbd_usermode_helper"
    drbd: fix race between handshake and admin disconnect/down
    drbd: fix potential deadlock when trying to detach during handshake
    drbd: A single dot should be put into a sequence.
    drbd: fix rmmod cleanup, remove _all_ debugfs entries
    drbd: Use setup_timer() instead of init_timer() to simplify the code.
    drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
    drbd: new disk-option disable-write-same
    drbd: Fix resource role for newly created resources in events2
    drbd: mark symbols static where possible
    drbd: Send P_NEG_ACK upon write error in protocol != C
    drbd: add explicit plugging when submitting batches
    drbd: change list_for_each_safe to while(list_first_entry_or_null)
    drbd: introduce drbd_recv_header_maybe_unplug
    ...

    Linus Torvalds
     

07 Sep, 2017

1 commit

  • The .rw_page in struct block_device_operations is used by the swap
    subsystem to read/write the page contents from/into the corresponding
    swap slot in the swap device. To support the THP (Transparent Huge
    Page) swap optimization, the .rw_page is enhanced to support to
    read/write THP if possible.

    Link: http://lkml.kernel.org/r/20170724051840.2309-6-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dan Williams
    Cc: Vishal L Verma
    Cc: Jens Axboe
    Cc: "Kirill A . Shutemov"
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Shaohua Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

05 Sep, 2017

1 commit

  • Delay the check of nd_reserved2 to the actual endpoint (acpi_nfit_ctl)
    that uses it, as a prevention of a potential double-fetch bug.

    While examining the kernel source code, I found a dangerous operation that
    could turn into a double-fetch situation (a race condition bug) where
    the same userspace memory region are fetched twice into kernel with sanity
    checks after the first fetch while missing checks after the second fetch.

    In the case of _IOC_NR(ioctl_cmd) == ND_CMD_CALL:

    1. The first fetch happens in line 935 copy_from_user(&pkg, p, sizeof(pkg)

    2. subsequently `pkg.nd_reserved2` is asserted to be all zeroes
    (line 984 to 986).

    3. The second fetch happens in line 1022 copy_from_user(buf, p, buf_len)

    4. Given that `p` can be fully controlled in userspace, an attacker can
    race condition to override the header part of `p`, say,
    `((struct nd_cmd_pkg *)p)->nd_reserved2` to arbitrary value
    (say nine 0xFFFFFFFF for `nd_reserved2`) after the first fetch but before the
    second fetch. The changed value will be copied to `buf`.

    5. There is no checks on the second fetches until the use of it in
    line 1034: nd_cmd_clear_to_send(nvdimm_bus, nvdimm, cmd, buf) and
    line 1038: nd_desc->ndctl(nd_desc, nvdimm, cmd, buf, buf_len, &cmd_rc)
    which means that the assumed relation, `p->nd_reserved2` are all zeroes might
    not hold after the second fetch. And once the control goes to these functions
    we lose the context to assert the assumed relation.

    6. Based on my manual analysis, `p->nd_reserved2` is not used in function
    `nd_cmd_clear_to_send` and potential implementations of `nd_desc->ndctl`
    so there is no working exploit against it right now. However, this could
    easily turns to an exploitable one if careless developers start to use
    `p->nd_reserved2` later and assume that they are all zeroes.

    Move the validation of the nd_reserved2 field to the ->ndctl()
    implementation where it has a stable buffer to evaluate.

    Signed-off-by: Meng Xu
    Signed-off-by: Dan Williams

    Meng Xu
     

01 Sep, 2017

8 commits

  • Dan reports:
    The patch 62232e45f4a2: "libnvdimm: control (ioctl) messages for
    nvdimm_bus and nvdimm devices" from Jun 8, 2015, leads to the
    following static checker warning:

    drivers/nvdimm/bus.c:1018 __nd_ioctl()
    warn: integer overflows 'buf_len'

    From a casual review, this seems like it might be a real bug. On
    the first iteration we load some data into in_env[]. On the second
    iteration we read a use controlled "in_size" from nd_cmd_in_size().
    It can go up to UINT_MAX - 1. A high number means we will fill the
    whole in_env[] buffer. But we potentially keep looping and adding
    more to in_len so now it can be any value.

    It simple enough to change, but it feels weird that we keep looping
    even though in_env is totally full. Shouldn't we just return an
    error if we don't have space for desc->in_num.

    We keep looping because the size of the total input is allowed to be
    bigger than the 'envelope' which is a subset of the payload that tells
    us how much data to expect. For safety explicitly check that buf_len
    does not overflow which is what the checker flagged.

    Cc:
    Fixes: 62232e45f4a2: "libnvdimm: control (ioctl) messages for nvdimm_bus..."
    Reported-by: Dan Carpenter
    Signed-off-by: Dan Williams

    Dan Williams
     
  • mmio_flush_range() suffers from a lack of clearly-defined semantics,
    and is somewhat ambiguous to port to other architectures where the
    scope of the writeback implied by "flush" and ordering might matter,
    but MMIO would tend to imply non-cacheable anyway. Per the rationale
    in 67a3e8fe9015 ("nd_blk: change aperture mapping from WC to WB"), the
    only existing use is actually to invalidate clean cache lines for
    ARCH_MEMREMAP_PMEM type mappings *without* writeback. Since the recent
    cleanup of the pmem API, that also now happens to be the exact purpose
    of arch_invalidate_pmem(), which would be a far more well-defined tool
    for the job.

    Rather than risk potentially inconsistent implementations of
    mmio_flush_range() for the sake of one callsite, streamline things by
    removing it entirely and instead move the ARCH_MEMREMAP_PMEM related
    definitions up to the libnvdimm level, so they can be shared by NFIT
    as well. This allows NFIT to be enabled for arm64.

    Signed-off-by: Robin Murphy
    Signed-off-by: Dan Williams

    Robin Murphy
     
  • Clearing errors or badblocks during a BTT write requires sending an ACPI
    DSM, which means potentially sleeping. Since a BTT IO happens in atomic
    context (preemption disabled, spinlocks may be held), we cannot perform
    error clearing in the course of an IO. Due to this error clearing for
    BTT IOs has hitherto been disabled.

    In this patch we move error clearing out of the atomic section, and thus
    re-enable error clearing with BTTs. When we are about to add a block to
    the free list, we check if it was previously marked as an error, and if
    it was, we add it to the freelist, but also set a flag that says error
    clearing will be required. We then drop the lane (ending the atomic
    context), and send a zero buffer so that the error can be cleared. The
    error flag in the free list is protected by the nd 'lane', and is set
    only be a thread while it holds that lane. When the error is cleared,
    the flag is cleared, but while holding a mutex for that freelist index.

    When writing, we check for two things -
    1/ If the freelist mutex is held or if the error flag is set. If so,
    this is an error block that is being (or about to be) cleared.
    2/ If the block is a known badblock based on nsio->bb

    The second check is required because the BTT map error flag for a map
    entry only gets set when an error LBA is read. If we write to a new
    location that may not have the map error flag set, but still might be in
    the region's badblock list, we can trigger an EIO on the write, which is
    undesirable and completely avoidable.

    Cc: Jeff Moyer
    Cc: Toshi Kani
    Cc: Dan Williams
    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     
  • With the ACPI NFIT 'DSM' methods, acpi can be called from IO paths.
    Specifically, the DSM to clear media errors is called during writes, so
    that we can provide a writes-fix-errors model.

    However it is easy to imagine a scenario like:
    -> write through the nvdimm driver
    -> acpi allocation
    -> writeback, causes more IO through the nvdimm driver
    -> deadlock

    Fix this by using memalloc_noio_{save,restore}, which sets the GFP_NOIO
    flag for the current scope when issuing commands/IOs that are expected
    to clear errors.

    Cc:
    Cc:
    Cc: Dan Williams
    Cc: Robert Moore
    Cc: Rafael J. Wysocki
    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     
  • In preparation for the error clearing rework, add sector_size in the
    arena_info struct.

    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     
  • In btt_map_read, we read the map twice to make sure that the map entry
    didn't change after we added it to the read tracking table. In
    anticipation of expanding the use of the error bit, also make sure that
    the error and zero flags are constant across the two map reads.

    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     
  • Add helpers for converting a raw map entry to just the block number, or
    either of the 'e' or 'z' flags in preparation for actually using the
    error flag to mark blocks with media errors.

    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     
  • The IO context conversion for rw_bytes missed a case in the BTT write
    path (btt_map_write) which should've been marked as atomic.

    In reality this should not cause a problem, because map writes are to
    small for nsio_rw_bytes to attempt error clearing, but it should be
    fixed for posterity.

    Add a might_sleep() in the non-atomic section of nsio_rw_bytes so that
    things like the nfit unit tests, which don't actually sleep, can catch
    bugs like this.

    Cc: Dan Williams
    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     

30 Aug, 2017

2 commits

  • Check memory allocation failures and return -ENOMEM in such cases, as
    already done few lines below for another memory allocation.

    This avoids NULL pointers dereference.

    Cc:
    Fixes: 14e494542636 ("libnvdimm, btt: BTT updates for UEFI 2.7 format")
    Signed-off-by: Christophe JAILLET
    Reviewed-by: Vishal Verma
    Signed-off-by: Dan Williams

    Christophe Jaillet
     
  • The old calculation assumed that the label space was 128k and the label
    size is 128. With v1.2 labels where the label size is 256 this
    calculation will return zero. We are saved by the fact that the
    nsindex_size is always pre-initialized from a previous 128 byte
    assumption and we are lucky that the index sizes turn out the same.

    Fix this going forward in case we start encountering different
    geometries of label areas besides 128k.

    Since the label size can change from one call to the next, drop the
    caching of nsindex_size.

    Signed-off-by: Dan Williams

    Dan Williams
     

24 Aug, 2017

1 commit

  • This way we don't need a block_device structure to submit I/O. The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open. Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device. But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

16 Aug, 2017

1 commit

  • Now that we properly advertise the supported pte, pmd, and pud sizes,
    restrict the supported alignments that can be set on a namespace. This
    assumes that userspace was not previously relying on the ability to set
    odd alignments. At least ndctl only ever supported setting the namespace
    alignment to 4K, 2M, or 1G.

    Cc: Oliver O'Halloran
    Signed-off-by: Dan Williams

    Dan Williams
     

12 Aug, 2017

2 commits


10 Aug, 2017

1 commit


05 Aug, 2017

1 commit

  • It is useful to be able to know the position of a DIMM in an
    interleave-set. Consider the case where the order of the DIMMs changes
    causing a namespace to be invalidated because the interleave-set cookie no
    longer matches. If the before and after state of each DIMM position is
    known this state debugged by the system owner.

    Signed-off-by: Dan Williams

    Dan Williams
     

26 Jul, 2017

1 commit

  • Currently libnvdimm uses HPAGE_SIZE as the default alignment for DAX and
    PFN devices. HPAGE_SIZE is the default hugetlbfs page size and when
    hugetlbfs is disabled it defaults to PAGE_SIZE. Given DAX has more
    in common with THP than hugetlbfs we should proably be using
    HPAGE_PMD_SIZE, but this is undefined when THP is disabled so lets just
    give it a new name.

    The other usage of HPAGE_SIZE in libnvdimm is when determining how large
    the altmap should be. For the reasons mentioned above it doesn't really
    make sense to use HPAGE_SIZE here either. PMD_SIZE seems to be safe to
    use in generic code and it happens to match the vmemmap allocation block
    on x86 and Power. It's still a hack, but it's a slightly nicer hack.

    Signed-off-by: Oliver O'Halloran
    Signed-off-by: Dan Williams

    Oliver O'Halloran