09 Oct, 2018

22 commits

  • Introduce trace points for tracking chunk states in pblk - this is
    useful for inspection of the entire state of the drive, and real handy
    for both fw and pblk debugging.

    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • Remove the debug only iteration within __pblk_down_page, which
    then allows us to reduce the number of arguments down to pblk and
    the parallel unit from the functions that calls it. Simplifying the
    callers logic considerably.

    Also, rename the functions pblk_[down/up]_page to
    pblk_[down/up]_chunk, to communicate that it manages the write
    pointer of the chunk. Note that it also protects the parallel unit
    such that at most one chunk is active per parallel unit.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • When the user data counter exceeds 32 bits, the write amplification
    calculation does not provide the right value. Fix this by using
    div64_u64 in stead of div64.

    Fixes: 76758390f83e ("lightnvm: pblk: export write amplification counters to sysfs")
    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • The prefix when printing ppas in pblk_read_check_rand should be "rnd"
    not "seq", so fix this so we can differentiate between lba missmatches
    in random and sequential reads. Also change the print order so
    we align with pblk_read_check_seq, printing read lba first.

    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • The parameters nr_ppas and ppa_list are not used, so remove them.

    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • Line map bitmap allocations are fairly large and can fail. Allocation
    failures are fatal to pblk, stopping the write pipeline. To avoid this,
    allocate the bitmaps using a mempool instead.

    Mempool allocations never fail if called from a process context,
    and pblk *should* only allocate map bitmaps in process context,
    but keep the failure handling for robustness sake.

    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • There is a number of places in the lightnvm subsystem where the user
    iterates over the ppa list. Before iterating, the user must know if it
    is a single or multiple LBAs due to vector commands using either the
    nvm_rq ->ppa_addr or ->ppa_list fields on command submission, which
    leads to open-coding the if/else statement.

    Instead of having multiple if/else's, move it into a function that can
    be called by its users.

    A nice side effect of this cleanup is that this patch fixes up a
    bunch of cases where we don't consider the single-ppa case in pblk.

    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • If a line is recovered from open chunks, the memory structures for
    emeta have not necessarily been properly set on line initialization.
    When closing a line, make sure that emeta is consistent so that the line
    can be recovered on the fast path on next reboot.

    Also, remove a couple of empty lines at the end of the function.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • Removed unused struct ppa_addr variable.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • Fix comment typo Decrese -> Decrease

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • The current helper to obtain a line from a ppa returns the line id,
    which requires its users to explicitly retrieve the pointer to the line
    with the id.

    Make 2 different helpers: one returning the line id and one returning
    the line directly.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • Implement helpers to go from ppas to a chunk within a line and an
    address within a chunk.

    These helpers will be used on the patches adding trace support in pblk,
    which will be sent in this window.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • The read completion path uses the put_line variable to decide whether
    the reference on a line should be released. The function name used for
    that is pblk_read_put_rqd_kref, which could lead one to believe that it
    is the rqd that is releasing the reference, while it is the line
    reference that is put.

    Rename and also split the function in two to account for either rqd or
    single ppa callers and move it to core, such that it later can be used
    in the write path as well.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Reviewed-by: Heiner Litz
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • The I/O size and capacity checks are already done by the block layer.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • The calculation of pblk->min_write_pgs should only use the optimal
    write size attribute provided by the drive, it does not correlate to
    the memory page size of the system, which can be smaller or larger
    than the LBA size reported.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • Both NVM_MAX_VLBA and PBLK_MAX_REQ_ADDRS define how many LBAs that
    are available in a vector command. pblk uses them interchangeably
    in its implementation. Use NVM_MAX_VLBA as the main one and remove
    usages of PBLK_MAX_REQ_ADDRS.

    Also remove the power representation that only has one user, and
    instead calculate it at runtime.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • pblk implements two data paths for recovery line state. One for 1.2
    and another for 2.0, instead of having pblk implement these, combine
    them in the core to reduce complexity and make available to other
    targets.

    The new interface will adhere to the 2.0 chunk definition,
    including managing open chunks with an active write pointer. To provide
    this interface, a 1.2 device recovers the state of the chunks by
    manually detecting if a chunk is either free/open/close/offline, and if
    open, scanning the flash pages sequentially to find the next writeable
    page. This process takes on average ~10 seconds on a device with 64 dies,
    1024 blocks and 60us read access time. The process can be parallelized
    but is left out for maintenance simplicity, as the 1.2 specification is
    deprecated. For 2.0 devices, the logic is maintained internally in the
    drive and retrieved through the 2.0 interface.

    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • In pblk, when a new line is allocated, metadata for the previously
    written line is scheduled. This is done through a fixed memory region
    that is shared through time and contexts across different lines and
    therefore protected by a lock. Unfortunately, this lock is not properly
    covering all the metadata used for sharing this memory regions,
    resulting in a race condition.

    This patch fixes this race condition by protecting this metadata
    properly.

    Fixes: dd2a43437337 ("lightnvm: pblk: sched. metadata on write thread")
    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • A 1.2 device is able to manage the logical to physical mapping
    table internally or leave it to the host.

    A target only supports one of those approaches, and therefore must
    check on initialization. Move this check to core to avoid each target
    implement the check.

    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • rqd.error is masked by the return value of pblk_submit_io_sync.
    The rqd structure is then passed on to the end_io function, which
    assumes that any error should lead to a chunk being marked
    offline/bad. Since the pblk_submit_io_sync can fail before the
    command is issued to the device, the error value maybe not correspond
    to a media failure, leading to chunks being immaturely retired.

    Also, the pblk_blk_erase_sync function prints an error message in case
    the erase fails. Since the caller prints an error message by itself,
    remove the error message in this function.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Reviewed-by: Hans Holmberg
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • Add nvm_set_flags helper to enable core to appropriately
    set the command flags for read/write/erase depending on which version
    a drive supports.

    The flags arguments can be distilled into the access hint,
    scrambling, and program/erase suspend. Replace the access hint with
    a "is_seq" parameter. The rest of the flags are dependent on the
    command opcode, which is trivial to detect and set.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • No need to force NVMe device driver to be compiled in if the
    lightnvm subsystem is selected. Also no need for PCI to be selected
    as well, as it would be selected by the device driver that hooks into
    the subsystem.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     

08 Oct, 2018

15 commits

  • when the nbuckets of cache device is smaller than 1024, making cache
    device will trigger BUG_ON in kernel, add a condition to avoid this.

    Reported-by: nitroxis
    Signed-off-by: Dongbo Cao
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Dongbo Cao
     
  • Split the combined '||' statements in if() check, to make the code easier
    for debug.

    Signed-off-by: Dongbo Cao
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Dongbo Cao
     
  • Current cache_set has MAX_CACHES_PER_SET caches most, and the macro
    is used for
    "
    struct cache *cache_by_alloc[MAX_CACHES_PER_SET];
    "
    in the define of struct cache_set.

    Use MAX_CACHES_PER_SET instead of magic number 8 in
    __bch_bucket_alloc_set.

    Signed-off-by: Shenghui Wang
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Shenghui Wang
     
  • In extents.c:bch_extent_bad(), number 96 is used as parameter to call
    btree_bug_on(). The purpose is to check whether stale gen value exceeds
    BUCKET_GC_GEN_MAX, so it is better to use macro BUCKET_GC_GEN_MAX to
    make the code more understandable.

    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Coly Li
     
  • Parameter "struct kobject *kobj" in bch_debug_init() is useless,
    remove it in this patch.

    Signed-off-by: Dongbo Cao
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Dongbo Cao
     
  • struct kmem_cache *bch_passthrough_cache is not used in
    bcache code. Remove it.

    Signed-off-by: Shenghui Wang
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Shenghui Wang
     
  • Recal cached_dev_sectors on cached_dev detached, as recal done on
    cached_dev attached.

    Update the cached_dev_sectors before bcache_device_detach called
    as bcache_device_detach will set bcache_device->c to NULL.

    Signed-off-by: Shenghui Wang
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Shenghui Wang
     
  • refill->end record the last key of writeback, for example, at the first
    time, keys (1,128K) to (1,1024K) are flush to the backend device, but
    the end key (1,1024K) is not included, since the bellow code:
    if (bkey_cmp(k, refill->end) >= 0) {
    ret = MAP_DONE;
    goto out;
    }
    And in the next time when we refill writeback keybuf again, we searched
    key start from (1,1024K), and got a key bigger than it, so the key
    (1,1024K) missed.
    This patch modify the above code, and let the end key to be included to
    the writeback key buffer.

    Signed-off-by: Tang Junhui
    Cc: stable@vger.kernel.org
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Tang Junhui
     
  • Forgot to include the maintainers with my first email.

    Somewhere between Michael Lyle's original
    "bcache: PI controller for writeback rate V2" patch dated 07 Sep 2017
    and 1d316e6 bcache: implement PI controller for writeback rate,
    the mapping of the writeback_rate_minimum attribute was dropped.

    Re-add the missing sysfs writeback_rate_minimum attribute mapping to
    "allow the user to specify a minimum rate at which dirty blocks are
    retired."

    Fixes: 1d316e6 ("bcache: implement PI controller for writeback rate")
    Signed-off-by: Ben Peddell
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Ben Peddell
     
  • When bcache device is clean, dirty keys may still exist after
    journal replay, so we need to count these dirty keys even
    device in clean status, otherwise after writeback, the amount
    of dirty data would be incorrect.

    Signed-off-by: Tang Junhui
    Cc: stable@vger.kernel.org
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Tang Junhui
     
  • The code comments of closure_return_with_destructor() in closure.h makrs
    function name as closure_return(). This patch fixes this type with the
    correct name - closure_return_with_destructor.

    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Coly Li
     
  • When doing ioctl in flash device, it will call ioctl_dev() in super.c,
    then we should not to get cached device since flash only device has
    no backend device. This patch just move the jugement dc->io_disable
    to cached_dev_ioctl() to make ioctl in flash device correctly.

    Fixes: 0f0709e6bfc3c ("bcache: stop bcache device when backing device is offline")
    Signed-off-by: Tang Junhui
    Cc: stable@vger.kernel.org
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Tang Junhui
     
  • In cached_dev_cache_miss() and check_should_bypass(), REQ_META is used
    to check whether a bio is for metadata request. REQ_META is used for
    blktrace, the correct REQ_ flag should be REQ_PRIO. This flag means the
    bio should be prior to other bio, and frequently be used to indicate
    metadata io in file system code.

    This patch replaces REQ_META with correct flag REQ_PRIO.

    CC Adam Manzanares because he explains to me what REQ_PRIO is for.

    Signed-off-by: Coly Li
    Cc: Adam Manzanares
    Signed-off-by: Jens Axboe

    Coly Li
     
  • Missed reading IOs are identified by s->cache_missed, not the
    s->cache_miss, so in trace_bcache_read() using trace_bcache_read
    to identify whether the IO is missed or not.

    Signed-off-by: Tang Junhui
    Cc: stable@vger.kernel.org
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Tang Junhui
     
  • UUIDs are considered as metadata. __uuid_write should add the number
    of buckets (in sectors) written to disk to ca->meta_sectors_written.
    Currently only 1 bucket is used in uuid write.

    Steps to test:
    1) create a fresh backing device and a fresh cache device separately.
    The backing device didn't attach to any cache set.
    2) cd /sys/block//bcache
    cat metadata_written // record the output value
    cat bucket_size
    3) attach the backing device to cache set
    4) cat metadata_written
    The output value is almost the same as the value in step 2
    before the change.
    After the change, the value is bigger about 1 bucket size.

    Signed-off-by: Shenghui Wang
    Reviewed-by: Tang Junhui
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Shenghui Wang
     

05 Oct, 2018

2 commits

  • Pull NVMe updates from Christoph:

    "A relatively boring merge window:

    - better AEN tracing (Chaitanya)
    - NUMA aware PCIe multipathing (me)
    - RDMA workqueue fixes (Sagi)
    - better bio usage in the target (Sagi)
    - FC rework for target removal (James)
    - better multipath handling of ->queue_rq failures (James)
    - various cleanups (Milan)"

    * 'nvme-4.20' of git://git.infradead.org/nvme:
    nvmet-rdma: use a private workqueue for delete
    nvme: take node locality into account when selecting a path
    nvmet: don't split large I/Os unconditionally
    nvme: call nvme_complete_rq when nvmf_check_ready fails for mpath I/O
    nvme-core: add async event trace helper
    nvme_fc: add 'nvme_discovery' sysfs attribute to fc transport device
    nvmet_fc: support target port removal with nvmet layer
    nvme-fc: fix for a minor typos
    nvmet: remove redundant module prefix
    nvme: fix typo in nvme_identify_ns_descs

    Jens Axboe
     
  • Queue deletion is done asynchronous when the last reference on the queue
    is dropped. Thus, in order to make sure we don't over allocate under a
    connect/disconnect storm, we let queue deletion complete before making
    forward progress.

    However, given that we flush the system_wq from rdma_cm context which
    runs from a workqueue context, we can have a circular locking complaint
    [1]. Fix that by using a private workqueue for queue deletion.

    [1]:
    ======================================================
    WARNING: possible circular locking dependency detected
    4.19.0-rc4-dbg+ #3 Not tainted
    ------------------------------------------------------
    kworker/5:0/39 is trying to acquire lock:
    00000000a10b6db9 (&id_priv->handler_mutex){+.+.}, at: rdma_destroy_id+0x6f/0x440 [rdma_cm]

    but task is already holding lock:
    00000000331b4e2c ((work_completion)(&queue->release_work)){+.+.}, at: process_one_work+0x3ed/0xa20

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #3 ((work_completion)(&queue->release_work)){+.+.}:
    process_one_work+0x474/0xa20
    worker_thread+0x63/0x5a0
    kthread+0x1cf/0x1f0
    ret_from_fork+0x24/0x30

    -> #2 ((wq_completion)"events"){+.+.}:
    flush_workqueue+0xf3/0x970
    nvmet_rdma_cm_handler+0x133d/0x1734 [nvmet_rdma]
    cma_ib_req_handler+0x72f/0xf90 [rdma_cm]
    cm_process_work+0x2e/0x110 [ib_cm]
    cm_req_handler+0x135b/0x1c30 [ib_cm]
    cm_work_handler+0x2b7/0x38cd [ib_cm]
    process_one_work+0x4ae/0xa20
    nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma: disconnected (10): status 0 id 0000000040357082
    worker_thread+0x63/0x5a0
    kthread+0x1cf/0x1f0
    ret_from_fork+0x24/0x30
    nvme nvme0: Reconnecting in 10 seconds...

    -> #1 (&id_priv->handler_mutex/1){+.+.}:
    __mutex_lock+0xfe/0xbe0
    mutex_lock_nested+0x1b/0x20
    cma_ib_req_handler+0x6aa/0xf90 [rdma_cm]
    cm_process_work+0x2e/0x110 [ib_cm]
    cm_req_handler+0x135b/0x1c30 [ib_cm]
    cm_work_handler+0x2b7/0x38cd [ib_cm]
    process_one_work+0x4ae/0xa20
    worker_thread+0x63/0x5a0
    kthread+0x1cf/0x1f0
    ret_from_fork+0x24/0x30

    -> #0 (&id_priv->handler_mutex){+.+.}:
    lock_acquire+0xc5/0x200
    __mutex_lock+0xfe/0xbe0
    mutex_lock_nested+0x1b/0x20
    rdma_destroy_id+0x6f/0x440 [rdma_cm]
    nvmet_rdma_release_queue_work+0x8e/0x1b0 [nvmet_rdma]
    process_one_work+0x4ae/0xa20
    worker_thread+0x63/0x5a0
    kthread+0x1cf/0x1f0
    ret_from_fork+0x24/0x30

    Fixes: 777dc82395de ("nvmet-rdma: occasionally flush ongoing controller teardown")
    Reported-by: Bart Van Assche
    Signed-off-by: Sagi Grimberg
    Tested-by: Bart Van Assche

    Signed-off-by: Christoph Hellwig

    Sagi Grimberg
     

04 Oct, 2018

1 commit

  • Some time ago REQ_DISCARD was renamed into REQ_OP_DISCARD. Some comments
    and documentation files were not updated however. Update these comments
    and documentation files. See also commit 4e1b2d52a80d ("block, fs,
    drivers: remove REQ_OP compat defs and related code").

    Signed-off-by: Bart Van Assche
    Cc: Mike Christie
    Cc: Martin K. Petersen
    Cc: Philipp Reisner
    Cc: Lars Ellenberg
    Signed-off-by: Jens Axboe

    Bart Van Assche