09 Oct, 2018

23 commits

  • Introduce trace points for tracking chunk states in pblk - this is
    useful for inspection of the entire state of the drive, and real handy
    for both fw and pblk debugging.

    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • Remove the debug only iteration within __pblk_down_page, which
    then allows us to reduce the number of arguments down to pblk and
    the parallel unit from the functions that calls it. Simplifying the
    callers logic considerably.

    Also, rename the functions pblk_[down/up]_page to
    pblk_[down/up]_chunk, to communicate that it manages the write
    pointer of the chunk. Note that it also protects the parallel unit
    such that at most one chunk is active per parallel unit.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • When the user data counter exceeds 32 bits, the write amplification
    calculation does not provide the right value. Fix this by using
    div64_u64 in stead of div64.

    Fixes: 76758390f83e ("lightnvm: pblk: export write amplification counters to sysfs")
    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • The prefix when printing ppas in pblk_read_check_rand should be "rnd"
    not "seq", so fix this so we can differentiate between lba missmatches
    in random and sequential reads. Also change the print order so
    we align with pblk_read_check_seq, printing read lba first.

    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • The parameters nr_ppas and ppa_list are not used, so remove them.

    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • Line map bitmap allocations are fairly large and can fail. Allocation
    failures are fatal to pblk, stopping the write pipeline. To avoid this,
    allocate the bitmaps using a mempool instead.

    Mempool allocations never fail if called from a process context,
    and pblk *should* only allocate map bitmaps in process context,
    but keep the failure handling for robustness sake.

    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • There is a number of places in the lightnvm subsystem where the user
    iterates over the ppa list. Before iterating, the user must know if it
    is a single or multiple LBAs due to vector commands using either the
    nvm_rq ->ppa_addr or ->ppa_list fields on command submission, which
    leads to open-coding the if/else statement.

    Instead of having multiple if/else's, move it into a function that can
    be called by its users.

    A nice side effect of this cleanup is that this patch fixes up a
    bunch of cases where we don't consider the single-ppa case in pblk.

    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • If a line is recovered from open chunks, the memory structures for
    emeta have not necessarily been properly set on line initialization.
    When closing a line, make sure that emeta is consistent so that the line
    can be recovered on the fast path on next reboot.

    Also, remove a couple of empty lines at the end of the function.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • Removed unused struct ppa_addr variable.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • Fix comment typo Decrese -> Decrease

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • The current helper to obtain a line from a ppa returns the line id,
    which requires its users to explicitly retrieve the pointer to the line
    with the id.

    Make 2 different helpers: one returning the line id and one returning
    the line directly.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • Implement helpers to go from ppas to a chunk within a line and an
    address within a chunk.

    These helpers will be used on the patches adding trace support in pblk,
    which will be sent in this window.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • The read completion path uses the put_line variable to decide whether
    the reference on a line should be released. The function name used for
    that is pblk_read_put_rqd_kref, which could lead one to believe that it
    is the rqd that is releasing the reference, while it is the line
    reference that is put.

    Rename and also split the function in two to account for either rqd or
    single ppa callers and move it to core, such that it later can be used
    in the write path as well.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Reviewed-by: Heiner Litz
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • The I/O size and capacity checks are already done by the block layer.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • The calculation of pblk->min_write_pgs should only use the optimal
    write size attribute provided by the drive, it does not correlate to
    the memory page size of the system, which can be smaller or larger
    than the LBA size reported.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • Both NVM_MAX_VLBA and PBLK_MAX_REQ_ADDRS define how many LBAs that
    are available in a vector command. pblk uses them interchangeably
    in its implementation. Use NVM_MAX_VLBA as the main one and remove
    usages of PBLK_MAX_REQ_ADDRS.

    Also remove the power representation that only has one user, and
    instead calculate it at runtime.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • pblk implements two data paths for recovery line state. One for 1.2
    and another for 2.0, instead of having pblk implement these, combine
    them in the core to reduce complexity and make available to other
    targets.

    The new interface will adhere to the 2.0 chunk definition,
    including managing open chunks with an active write pointer. To provide
    this interface, a 1.2 device recovers the state of the chunks by
    manually detecting if a chunk is either free/open/close/offline, and if
    open, scanning the flash pages sequentially to find the next writeable
    page. This process takes on average ~10 seconds on a device with 64 dies,
    1024 blocks and 60us read access time. The process can be parallelized
    but is left out for maintenance simplicity, as the 1.2 specification is
    deprecated. For 2.0 devices, the logic is maintained internally in the
    drive and retrieved through the 2.0 interface.

    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • In pblk, when a new line is allocated, metadata for the previously
    written line is scheduled. This is done through a fixed memory region
    that is shared through time and contexts across different lines and
    therefore protected by a lock. Unfortunately, this lock is not properly
    covering all the metadata used for sharing this memory regions,
    resulting in a race condition.

    This patch fixes this race condition by protecting this metadata
    properly.

    Fixes: dd2a43437337 ("lightnvm: pblk: sched. metadata on write thread")
    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • A 1.2 device is able to manage the logical to physical mapping
    table internally or leave it to the host.

    A target only supports one of those approaches, and therefore must
    check on initialization. Move this check to core to avoid each target
    implement the check.

    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • rqd.error is masked by the return value of pblk_submit_io_sync.
    The rqd structure is then passed on to the end_io function, which
    assumes that any error should lead to a chunk being marked
    offline/bad. Since the pblk_submit_io_sync can fail before the
    command is issued to the device, the error value maybe not correspond
    to a media failure, leading to chunks being immaturely retired.

    Also, the pblk_blk_erase_sync function prints an error message in case
    the erase fails. Since the caller prints an error message by itself,
    remove the error message in this function.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Reviewed-by: Hans Holmberg
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • Add nvm_set_flags helper to enable core to appropriately
    set the command flags for read/write/erase depending on which version
    a drive supports.

    The flags arguments can be distilled into the access hint,
    scrambling, and program/erase suspend. Replace the access hint with
    a "is_seq" parameter. The rest of the flags are dependent on the
    command opcode, which is trivial to detect and set.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • No need to force NVMe device driver to be compiled in if the
    lightnvm subsystem is selected. Also no need for PCI to be selected
    as well, as it would be selected by the device driver that hooks into
    the subsystem.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • Lot of controllers may have only one irq vector for completing IO
    request. And usually affinity of the only irq vector is all possible
    CPUs, however, on most of ARCH, there may be only one specific CPU
    for handling this interrupt.

    So if all IOs are completed in hardirq context, it is inevitable to
    degrade IO performance because of increased irq latency.

    This patch tries to address this issue by allowing to complete request
    in softirq context, like the legacy IO path.

    IOPS is observed as ~13%+ in the following randread test on raid0 over
    virtio-scsi.

    mdadm --create --verbose /dev/md0 --level=0 --chunk=1024 --raid-devices=8 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi

    fio --time_based --name=benchmark --runtime=30 --filename=/dev/md0 --nrfiles=1 --ioengine=libaio --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=32 --rw=randread --blocksize=4k

    Cc: Dongli Zhang
    Cc: Zach Marano
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Jianchao Wang
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

08 Oct, 2018

15 commits

  • when the nbuckets of cache device is smaller than 1024, making cache
    device will trigger BUG_ON in kernel, add a condition to avoid this.

    Reported-by: nitroxis
    Signed-off-by: Dongbo Cao
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Dongbo Cao
     
  • Split the combined '||' statements in if() check, to make the code easier
    for debug.

    Signed-off-by: Dongbo Cao
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Dongbo Cao
     
  • Current cache_set has MAX_CACHES_PER_SET caches most, and the macro
    is used for
    "
    struct cache *cache_by_alloc[MAX_CACHES_PER_SET];
    "
    in the define of struct cache_set.

    Use MAX_CACHES_PER_SET instead of magic number 8 in
    __bch_bucket_alloc_set.

    Signed-off-by: Shenghui Wang
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Shenghui Wang
     
  • In extents.c:bch_extent_bad(), number 96 is used as parameter to call
    btree_bug_on(). The purpose is to check whether stale gen value exceeds
    BUCKET_GC_GEN_MAX, so it is better to use macro BUCKET_GC_GEN_MAX to
    make the code more understandable.

    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Coly Li
     
  • Parameter "struct kobject *kobj" in bch_debug_init() is useless,
    remove it in this patch.

    Signed-off-by: Dongbo Cao
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Dongbo Cao
     
  • struct kmem_cache *bch_passthrough_cache is not used in
    bcache code. Remove it.

    Signed-off-by: Shenghui Wang
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Shenghui Wang
     
  • Recal cached_dev_sectors on cached_dev detached, as recal done on
    cached_dev attached.

    Update the cached_dev_sectors before bcache_device_detach called
    as bcache_device_detach will set bcache_device->c to NULL.

    Signed-off-by: Shenghui Wang
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Shenghui Wang
     
  • refill->end record the last key of writeback, for example, at the first
    time, keys (1,128K) to (1,1024K) are flush to the backend device, but
    the end key (1,1024K) is not included, since the bellow code:
    if (bkey_cmp(k, refill->end) >= 0) {
    ret = MAP_DONE;
    goto out;
    }
    And in the next time when we refill writeback keybuf again, we searched
    key start from (1,1024K), and got a key bigger than it, so the key
    (1,1024K) missed.
    This patch modify the above code, and let the end key to be included to
    the writeback key buffer.

    Signed-off-by: Tang Junhui
    Cc: stable@vger.kernel.org
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Tang Junhui
     
  • Forgot to include the maintainers with my first email.

    Somewhere between Michael Lyle's original
    "bcache: PI controller for writeback rate V2" patch dated 07 Sep 2017
    and 1d316e6 bcache: implement PI controller for writeback rate,
    the mapping of the writeback_rate_minimum attribute was dropped.

    Re-add the missing sysfs writeback_rate_minimum attribute mapping to
    "allow the user to specify a minimum rate at which dirty blocks are
    retired."

    Fixes: 1d316e6 ("bcache: implement PI controller for writeback rate")
    Signed-off-by: Ben Peddell
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Ben Peddell
     
  • When bcache device is clean, dirty keys may still exist after
    journal replay, so we need to count these dirty keys even
    device in clean status, otherwise after writeback, the amount
    of dirty data would be incorrect.

    Signed-off-by: Tang Junhui
    Cc: stable@vger.kernel.org
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Tang Junhui
     
  • The code comments of closure_return_with_destructor() in closure.h makrs
    function name as closure_return(). This patch fixes this type with the
    correct name - closure_return_with_destructor.

    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Coly Li
     
  • When doing ioctl in flash device, it will call ioctl_dev() in super.c,
    then we should not to get cached device since flash only device has
    no backend device. This patch just move the jugement dc->io_disable
    to cached_dev_ioctl() to make ioctl in flash device correctly.

    Fixes: 0f0709e6bfc3c ("bcache: stop bcache device when backing device is offline")
    Signed-off-by: Tang Junhui
    Cc: stable@vger.kernel.org
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Tang Junhui
     
  • In cached_dev_cache_miss() and check_should_bypass(), REQ_META is used
    to check whether a bio is for metadata request. REQ_META is used for
    blktrace, the correct REQ_ flag should be REQ_PRIO. This flag means the
    bio should be prior to other bio, and frequently be used to indicate
    metadata io in file system code.

    This patch replaces REQ_META with correct flag REQ_PRIO.

    CC Adam Manzanares because he explains to me what REQ_PRIO is for.

    Signed-off-by: Coly Li
    Cc: Adam Manzanares
    Signed-off-by: Jens Axboe

    Coly Li
     
  • Missed reading IOs are identified by s->cache_missed, not the
    s->cache_miss, so in trace_bcache_read() using trace_bcache_read
    to identify whether the IO is missed or not.

    Signed-off-by: Tang Junhui
    Cc: stable@vger.kernel.org
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Tang Junhui
     
  • UUIDs are considered as metadata. __uuid_write should add the number
    of buckets (in sectors) written to disk to ca->meta_sectors_written.
    Currently only 1 bucket is used in uuid write.

    Steps to test:
    1) create a fresh backing device and a fresh cache device separately.
    The backing device didn't attach to any cache set.
    2) cd /sys/block//bcache
    cat metadata_written // record the output value
    cat bucket_size
    3) attach the backing device to cache set
    4) cat metadata_written
    The output value is almost the same as the value in step 2
    before the change.
    After the change, the value is bigger about 1 bucket size.

    Signed-off-by: Shenghui Wang
    Reviewed-by: Tang Junhui
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Shenghui Wang
     

05 Oct, 2018

2 commits

  • When debugging e.g. the SCSI timeout handler it is important that
    requests that have not yet been started or that already have
    completed are also reported through debugfs.

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Pull NVMe updates from Christoph:

    "A relatively boring merge window:

    - better AEN tracing (Chaitanya)
    - NUMA aware PCIe multipathing (me)
    - RDMA workqueue fixes (Sagi)
    - better bio usage in the target (Sagi)
    - FC rework for target removal (James)
    - better multipath handling of ->queue_rq failures (James)
    - various cleanups (Milan)"

    * 'nvme-4.20' of git://git.infradead.org/nvme:
    nvmet-rdma: use a private workqueue for delete
    nvme: take node locality into account when selecting a path
    nvmet: don't split large I/Os unconditionally
    nvme: call nvme_complete_rq when nvmf_check_ready fails for mpath I/O
    nvme-core: add async event trace helper
    nvme_fc: add 'nvme_discovery' sysfs attribute to fc transport device
    nvmet_fc: support target port removal with nvmet layer
    nvme-fc: fix for a minor typos
    nvmet: remove redundant module prefix
    nvme: fix typo in nvme_identify_ns_descs

    Jens Axboe