07 May, 2019

1 commit

  • The sector bits in the erase command may be uninitialized are
    uninitialized, causing the erase LBA to be unaligned to the chunk size.

    This is unexpected situation, since erase shall always be chunk
    aligned based on OCSSD the 2.0 specification.

    Signed-off-by: Igor Konopko
    Reviewed-by: Javier González
    Reviewed-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Igor Konopko
     

11 Feb, 2019

1 commit

  • This patch fixes a race condition where a write is mapped to the last
    sectors of a line. The write is synced to the device but the L2P is not
    updated yet. When the line is garbage collected before the L2P update
    is performed, the sectors are ignored by the GC logic and the line is
    freed before all sectors are moved. When the L2P is finally updated, it
    contains a mapping to a freed line, subsequent reads of the
    corresponding LBAs fail.

    This patch introduces a per line counter specifying the number of
    sectors that are synced to the device but have not been updated in the
    L2P. Lines with a counter of greater than zero will not be selected
    for GC.

    Signed-off-by: Heiner Litz
    Reviewed-by: Hans Holmberg
    Reviewed-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Heiner Litz
     

12 Dec, 2018

3 commits

  • pblk performs recovery of open lines by storing the LBA in the per LBA
    metadata field. Recovery therefore only works for drives that has this
    field.

    This patch adds support for packed metadata, which store l2p mapping
    for open lines in last sector of every write unit and enables drives
    without per IO metadata to recover open lines.

    After this patch, drives with OOB size
    Signed-off-by: Igor Konopko
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Igor Konopko
     
  • pblk currently assumes that size of OOB metadata on drive is always
    equal to size of pblk_sec_meta struct. This commit add helpers which will
    allow to handle different sizes of OOB metadata on drive in the future.

    After this patch only OOB metadata equal to 16 bytes is supported.

    Reviewed-by: Javier González
    Signed-off-by: Igor Konopko
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Igor Konopko
     
  • If mapping fails (i.e. when running out of lines), handle the error
    and stop writing.

    Signed-off-by: Hans Holmberg
    Reviewed-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     

09 Oct, 2018

3 commits

  • Add GLP-2.0 SPDX license tag to all pblk files

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • Remove the debug only iteration within __pblk_down_page, which
    then allows us to reduce the number of arguments down to pblk and
    the parallel unit from the functions that calls it. Simplifying the
    callers logic considerably.

    Also, rename the functions pblk_[down/up]_page to
    pblk_[down/up]_chunk, to communicate that it manages the write
    pointer of the chunk. Note that it also protects the parallel unit
    such that at most one chunk is active per parallel unit.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • There is a number of places in the lightnvm subsystem where the user
    iterates over the ppa list. Before iterating, the user must know if it
    is a single or multiple LBAs due to vector commands using either the
    nvm_rq ->ppa_addr or ->ppa_list fields on command submission, which
    leads to open-coding the if/else statement.

    Instead of having multiple if/else's, move it into a function that can
    be called by its users.

    A nice side effect of this cleanup is that this patch fixes up a
    bunch of cases where we don't consider the single-ppa case in pblk.

    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     

01 Jun, 2018

1 commit


30 Mar, 2018

2 commits

  • Add support for 2.0 address format. Also, align address bits for 1.2 and
    2.0 to be able to operate on channel and luns without requiring a format
    conversion. Use a generic address format for this purpose.

    Also, convert the generic operations to the generic format in pblk.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • In a SSD, write amplification, WA, is defined as the average
    number of page writes per user page write. Write amplification
    negatively affects write performance and decreases the lifetime
    of the disk, so it's a useful metric to add to sysfs.

    In plkb's case, the number of writes per user sector is the sum of:

    (1) number of user writes
    (2) number of sectors written by the garbage collector
    (3) number of sectors padded (i.e. due to syncs)

    This patch adds persistent counters for 1-3 and two sysfs attributes
    to export these along with WA calculated with five decimals:

    write_amp_mileage: the accumulated write amplification stats
    for the lifetime of the pblk instance

    write_amp_trip: resetable stats to facilitate delta measurements,
    values reset at creation and if 0 is written
    to the attribute.

    64-bit counters are used as a 32 bit counter would wrap around
    already after about 17 TB worth of user data. It will take a
    long long time before the 64 bit sector counters wrap around.

    The counters are stored after the bad block bitmap in the first
    emeta sector of each written line. There is plenty of space in the
    first emeta sector, so we don't need to bump the major version of
    the line data format.

    Signed-off-by: Hans Holmberg
    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     

05 Jan, 2018

1 commit

  • Now that rrpc has been removed, the only users of the ppa helpers
    is pblk. However, pblk already defines similar functions.

    Switch pblk to use the internal ones, and remove the generic ppa
    helpers.

    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Matias Bjørling
     

13 Oct, 2017

2 commits

  • During garbage collect, lbas being written can end up
    being invalidated. Make sure that this is reflected in
    the valid lba count.

    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • Metadata I/Os are scheduled to minimize their impact on user data I/Os.
    When there are enough LUNs instantiated (i.e., enough bandwidth), it is
    easy to interleave metadata and data one after the other so that
    metadata I/Os are the ones being blocked and not vice-versa.

    We do this by calculating the distance between the I/Os in terms of the
    LUNs that are not in used, and selecting a free LUN that satisfies a
    the simple heuristic that metadata is scheduled behind. The per-LUN
    semaphores guarantee consistency. This works fine on >1 LUN
    configuration. However, when a single LUN is instantiated, this design
    leads to a deadlock, where metadata waits to be scheduled on a free LUN.

    This patch implements the 1 LUN case by simply scheduling the metadada
    I/O after the data I/O. In the process, we refactor the way a line is
    replaced to ensure that metadata writes are submitted after data writes
    in order to guarantee block sequentiality. Note that, since there is
    only one LUN, both I/Os will block each other by design. However, such
    configuration only pursues tight read latencies, not write bandwidth.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     

01 Jul, 2017

1 commit


27 Jun, 2017

5 commits

  • Due to user writes being decoupled from media writes because of the need
    of an intermediate write buffer, irrecoverable media write errors lead
    to pblk stalling; user writes fill up the buffer and end up in an
    infinite retry loop.

    In order to let user writes fail gracefully, it is necessary for pblk to
    keep track of its own internal state and prevent further writes from
    being placed into the write buffer.

    This patch implements a state machine to keep track of internal errors
    and, in case of failure, fail further user writes in an standard way.
    Depending on the type of error, pblk will do its best to persist
    buffered writes (which are already acknowledged) and close down on a
    graceful manner. This way, data might be recovered by re-instantiating
    pblk. Such state machine paves out the way for a state-based FTL log.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • After refactoring the metadata path, the backpointer controlling
    synced I/Os in a line becomes unnecessary; metadata is scheduled
    on the write thread, thus we know when the end of the line is reached
    and act on it directly.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • At the moment, line metadata is persisted on a separate work queue, that
    is kicked each time that a line is closed. The assumption when designing
    this was that freeing the write thread from creating a new write request
    was better than the potential impact of writes colliding on the media
    (user I/O and metadata I/O). Experimentation has proven that this
    assumption is wrong; collision can cause up to 25% of bandwidth and
    introduce long tail latencies on the write thread, which potentially
    cause user write threads to spend more time spinning to get a free entry
    on the write buffer.

    This patch moves the metadata logic to the write thread. When a line is
    closed, remaining metadata is written in memory and is placed on a
    metadata queue. The write thread then takes the metadata corresponding
    to the previous line, creates the write request and schedules it to
    minimize collisions on the media. Using this approach, we see that we
    can saturate the media's bandwidth, which helps reducing both write
    latencies and the spinning time for user writer threads.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • Erase I/Os are scheduled with the following goals in mind: (i) minimize
    LUNs collisions with write I/Os, and (ii) even out the price of erasing
    on every write, instead of putting all the burden on when garbage
    collection runs. This works well on the current design, but is specific
    to the default mapping algorithm.

    This patch generalizes the erase path so that other mapping algorithms
    can select an arbitrary line to be erased instead. It also gets rid of
    the erase semaphore since it creates jittering for user writes.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • Spare a double calculation on the fast write path.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     

24 Apr, 2017

1 commit

  • When block erases fail, these blocks are marked bad. The number of valid
    blocks in the line was not updated, which could cause an infinite loop
    on the erase path.

    Fix this atomic counter and, in order to avoid taking an irq lock on the
    interrupt context, make the erase counters atomic too.

    Also, in the case that a significant number of blocks become bad in a
    line, the result is the double shared metadata buffer (emeta) to stop
    the pipeline until all metadata is flushed to the media. Increase the
    number of metadata lines from 2 to 4 to avoid this case.

    Fixes: a4bd217b4326 "lightnvm: physical block device (pblk) target"

    Signed-off-by: Javier González
    Reviewed-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     

17 Apr, 2017

1 commit

  • This patch introduces pblk, a host-side translation layer for
    Open-Channel SSDs to expose them like block devices. The translation
    layer allows data placement decisions, and I/O scheduling to be
    managed by the host, enabling users to optimize the SSD for their
    specific workloads.

    An open-channel SSD has a set of LUNs (parallel units) and a
    collection of blocks. Each block can be read in any order, but
    writes must be sequential. Writes may also fail, and if a block
    requires it, must also be reset before new writes can be
    applied.

    To manage the constraints, pblk maintains a logical to
    physical address (L2P) table, write cache, garbage
    collection logic, recovery scheme, and logic to rate-limit
    user I/Os versus garbage collection I/Os.

    The L2P table is fully-associative and manages sectors at a
    4KB granularity. Pblk stores the L2P table in two places, in
    the out-of-band area of the media and on the last page of a
    line. In the cause of a power failure, pblk will perform a
    scan to recover the L2P table.

    The user data is organized into lines. A line is data
    striped across blocks and LUNs. The lines enable the host to
    reduce the amount of metadata to maintain besides the user
    data and makes it easier to implement RAID or erasure coding
    in the future.

    pblk implements multi-tenant support and can be instantiated
    multiple times on the same drive. Each instance owns a
    portion of the SSD - both regarding I/O bandwidth and
    capacity - providing I/O isolation for each case.

    Finally, pblk also exposes a sysfs interface that allows
    user-space to peek into the internals of pblk. The interface
    is available at /dev/block/*/pblk/ where * is the block
    device name exposed.

    This work also contains contributions from:
    Matias Bjørling
    Simon A. F. Lund
    Young Tack Jin
    Huaicheng Li

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González