09 Oct, 2018

33 commits

  • dma allocations for ppa_list and meta_list in rqd are replicated in
    several places across the pblk codebase. Make helpers to encapsulate
    creation and deletion to simplify the code.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • The lightnvm subsystem provides helpers to retrieve chunk metadata,
    where the target needs to provide a buffer to store the metadata. An
    implicit assumption is that this buffer is contiguous and can be used to
    retrieve the data from the device. If the device exposes too many
    chunks, then kmalloc might fail, thus failing instance creation.

    This patch removes this assumption by implementing an internal buffer in
    the lightnvm subsystem to retrieve chunk metadata. Targets can then
    use virtual memory allocations. Since this is a target API change, adapt
    pblk accordingly.

    Signed-off-by: Javier González
    Reviewed-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • The driver may sleep with holding a spinlock.

    The function call paths (from bottom to top) in Linux-4.16 are:

    [FUNC] nvm_dev_dma_alloc(GFP_KERNEL)
    drivers/lightnvm/pblk-core.c, 754:
    nvm_dev_dma_alloc in pblk_line_submit_smeta_io
    drivers/lightnvm/pblk-core.c, 1048:
    pblk_line_submit_smeta_io in pblk_line_init_bb
    drivers/lightnvm/pblk-core.c, 1434:
    pblk_line_init_bb in pblk_line_replace_data
    drivers/lightnvm/pblk-recovery.c, 980:
    pblk_line_replace_data in pblk_recov_l2p
    drivers/lightnvm/pblk-recovery.c, 976:
    spin_lock in pblk_recov_l2p

    [FUNC] bio_map_kern(GFP_KERNEL)
    drivers/lightnvm/pblk-core.c, 762:
    bio_map_kern in pblk_line_submit_smeta_io
    drivers/lightnvm/pblk-core.c, 1048:
    pblk_line_submit_smeta_io in pblk_line_init_bb
    drivers/lightnvm/pblk-core.c, 1434:
    pblk_line_init_bb in pblk_line_replace_data
    drivers/lightnvm/pblk-recovery.c, 980:
    pblk_line_replace_data in pblk_recov_l2p
    drivers/lightnvm/pblk-recovery.c, 976:
    spin_lock in pblk_recov_l2p

    To fix these bugs, the call to pblk_line_replace_data()
    is moved out of the spinlock protection.

    These bugs are found by my static analysis tool DSAC.

    Signed-off-by: Jia-Ju Bai
    Reviewed-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Jia-Ju Bai
     
  • On 1.2-devices, the mapping-out of remaning sectors in the
    failed-write's block can result in an infinite loop,
    stalling the write pipeline, fix this.

    Fixes: 6a3abf5beef6 ("lightnvm: pblk: rework write error recovery path")
    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • Pblk should not create a set of global caches every time
    a pblk instance is created. The global caches should be
    made available only when there is one or more pblk instances.

    This patch bundles the global caches together with a kref
    keeping track of whether the caches should be available or not.

    Also, turn the global pblk lock into a mutex that explicitly
    protects the caches (as this was the only purpose of the lock).

    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • If a line is padded, calculate the pad distance directly on the helper
    being used for this purpose.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • Continuing the effort of moving 1.2 and 2.0 specific code to core, move
    64_to_32 and 32_to_64 ppa helpers from pblk to core.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • Trace state of chunk resets.

    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • Add trace events for tracking pblk state changes.

    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • Add trace events for logging for line state changes.

    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • Introduce trace points for tracking chunk states in pblk - this is
    useful for inspection of the entire state of the drive, and real handy
    for both fw and pblk debugging.

    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • Remove the debug only iteration within __pblk_down_page, which
    then allows us to reduce the number of arguments down to pblk and
    the parallel unit from the functions that calls it. Simplifying the
    callers logic considerably.

    Also, rename the functions pblk_[down/up]_page to
    pblk_[down/up]_chunk, to communicate that it manages the write
    pointer of the chunk. Note that it also protects the parallel unit
    such that at most one chunk is active per parallel unit.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • When the user data counter exceeds 32 bits, the write amplification
    calculation does not provide the right value. Fix this by using
    div64_u64 in stead of div64.

    Fixes: 76758390f83e ("lightnvm: pblk: export write amplification counters to sysfs")
    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • The prefix when printing ppas in pblk_read_check_rand should be "rnd"
    not "seq", so fix this so we can differentiate between lba missmatches
    in random and sequential reads. Also change the print order so
    we align with pblk_read_check_seq, printing read lba first.

    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • The parameters nr_ppas and ppa_list are not used, so remove them.

    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • Line map bitmap allocations are fairly large and can fail. Allocation
    failures are fatal to pblk, stopping the write pipeline. To avoid this,
    allocate the bitmaps using a mempool instead.

    Mempool allocations never fail if called from a process context,
    and pblk *should* only allocate map bitmaps in process context,
    but keep the failure handling for robustness sake.

    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • There is a number of places in the lightnvm subsystem where the user
    iterates over the ppa list. Before iterating, the user must know if it
    is a single or multiple LBAs due to vector commands using either the
    nvm_rq ->ppa_addr or ->ppa_list fields on command submission, which
    leads to open-coding the if/else statement.

    Instead of having multiple if/else's, move it into a function that can
    be called by its users.

    A nice side effect of this cleanup is that this patch fixes up a
    bunch of cases where we don't consider the single-ppa case in pblk.

    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • If a line is recovered from open chunks, the memory structures for
    emeta have not necessarily been properly set on line initialization.
    When closing a line, make sure that emeta is consistent so that the line
    can be recovered on the fast path on next reboot.

    Also, remove a couple of empty lines at the end of the function.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • Removed unused struct ppa_addr variable.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • Fix comment typo Decrese -> Decrease

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • The current helper to obtain a line from a ppa returns the line id,
    which requires its users to explicitly retrieve the pointer to the line
    with the id.

    Make 2 different helpers: one returning the line id and one returning
    the line directly.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • Implement helpers to go from ppas to a chunk within a line and an
    address within a chunk.

    These helpers will be used on the patches adding trace support in pblk,
    which will be sent in this window.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • The read completion path uses the put_line variable to decide whether
    the reference on a line should be released. The function name used for
    that is pblk_read_put_rqd_kref, which could lead one to believe that it
    is the rqd that is releasing the reference, while it is the line
    reference that is put.

    Rename and also split the function in two to account for either rqd or
    single ppa callers and move it to core, such that it later can be used
    in the write path as well.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Reviewed-by: Heiner Litz
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • The I/O size and capacity checks are already done by the block layer.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • The calculation of pblk->min_write_pgs should only use the optimal
    write size attribute provided by the drive, it does not correlate to
    the memory page size of the system, which can be smaller or larger
    than the LBA size reported.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • Both NVM_MAX_VLBA and PBLK_MAX_REQ_ADDRS define how many LBAs that
    are available in a vector command. pblk uses them interchangeably
    in its implementation. Use NVM_MAX_VLBA as the main one and remove
    usages of PBLK_MAX_REQ_ADDRS.

    Also remove the power representation that only has one user, and
    instead calculate it at runtime.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • pblk implements two data paths for recovery line state. One for 1.2
    and another for 2.0, instead of having pblk implement these, combine
    them in the core to reduce complexity and make available to other
    targets.

    The new interface will adhere to the 2.0 chunk definition,
    including managing open chunks with an active write pointer. To provide
    this interface, a 1.2 device recovers the state of the chunks by
    manually detecting if a chunk is either free/open/close/offline, and if
    open, scanning the flash pages sequentially to find the next writeable
    page. This process takes on average ~10 seconds on a device with 64 dies,
    1024 blocks and 60us read access time. The process can be parallelized
    but is left out for maintenance simplicity, as the 1.2 specification is
    deprecated. For 2.0 devices, the logic is maintained internally in the
    drive and retrieved through the 2.0 interface.

    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • In pblk, when a new line is allocated, metadata for the previously
    written line is scheduled. This is done through a fixed memory region
    that is shared through time and contexts across different lines and
    therefore protected by a lock. Unfortunately, this lock is not properly
    covering all the metadata used for sharing this memory regions,
    resulting in a race condition.

    This patch fixes this race condition by protecting this metadata
    properly.

    Fixes: dd2a43437337 ("lightnvm: pblk: sched. metadata on write thread")
    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • A 1.2 device is able to manage the logical to physical mapping
    table internally or leave it to the host.

    A target only supports one of those approaches, and therefore must
    check on initialization. Move this check to core to avoid each target
    implement the check.

    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • rqd.error is masked by the return value of pblk_submit_io_sync.
    The rqd structure is then passed on to the end_io function, which
    assumes that any error should lead to a chunk being marked
    offline/bad. Since the pblk_submit_io_sync can fail before the
    command is issued to the device, the error value maybe not correspond
    to a media failure, leading to chunks being immaturely retired.

    Also, the pblk_blk_erase_sync function prints an error message in case
    the erase fails. Since the caller prints an error message by itself,
    remove the error message in this function.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Reviewed-by: Hans Holmberg
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • Add nvm_set_flags helper to enable core to appropriately
    set the command flags for read/write/erase depending on which version
    a drive supports.

    The flags arguments can be distilled into the access hint,
    scrambling, and program/erase suspend. Replace the access hint with
    a "is_seq" parameter. The rest of the flags are dependent on the
    command opcode, which is trivial to detect and set.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • No need to force NVMe device driver to be compiled in if the
    lightnvm subsystem is selected. Also no need for PCI to be selected
    as well, as it would be selected by the device driver that hooks into
    the subsystem.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • Lot of controllers may have only one irq vector for completing IO
    request. And usually affinity of the only irq vector is all possible
    CPUs, however, on most of ARCH, there may be only one specific CPU
    for handling this interrupt.

    So if all IOs are completed in hardirq context, it is inevitable to
    degrade IO performance because of increased irq latency.

    This patch tries to address this issue by allowing to complete request
    in softirq context, like the legacy IO path.

    IOPS is observed as ~13%+ in the following randread test on raid0 over
    virtio-scsi.

    mdadm --create --verbose /dev/md0 --level=0 --chunk=1024 --raid-devices=8 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi

    fio --time_based --name=benchmark --runtime=30 --filename=/dev/md0 --nrfiles=1 --ioengine=libaio --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=32 --rw=randread --blocksize=4k

    Cc: Dongli Zhang
    Cc: Zach Marano
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Jianchao Wang
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

08 Oct, 2018

7 commits