04 Oct, 2017

5 commits

  • When under memory-pressure it is possible that the mempool which backs
    the 'struct request_queue' will make use of up to BLKDEV_MIN_RQ count
    emergency buffers - in case it can't get a regular allocation. These
    buffers are preallocated and once they are also used, they are
    re-supplied with old finished requests from the same request_queue (see
    mempool_free()).

    The bug is, when re-supplying the emergency pool, the old requests are
    not again ran through the callback mempool_t->alloc(), and thus also not
    through the callback bsg_init_rq(). Thus we skip initialization, and
    while the sense-buffer still should be good, scsi_request->cmd might
    have become to be an invalid pointer in the meantime. When the request
    is initialized in bsg.c, and the user's CDB is larger than BLK_MAX_CDB,
    bsg will replace it with a custom allocated buffer, which is freed when
    the user's command is finished, thus it dangles afterwards. When next a
    command is sent by the user that has a smaller/similar CDB as
    BLK_MAX_CDB, bsg will assume that scsi_request->cmd is backed by
    scsi_request->__cmd, will not make a custom allocation, and write into
    undefined memory.

    Fix this by splitting bsg_init_rq() into two functions:
    - bsg_init_rq() is changed to only do the allocation of the
    sense-buffer, which is used to back the bsg job's reply buffer. This
    pointer should never change during the lifetime of a scsi_request, so
    it doesn't need re-initialization.
    - bsg_initialize_rq() is a new function that makes use of
    'struct request_queue's initialize_rq_fn callback (which was
    introduced in v4.12). This is always called before the request is
    given out via blk_get_request(). This function does the remaining
    initialization that was previously done in bsg_init_rq(), and will
    also do it when the request is taken from the emergency-pool of the
    backing mempool.

    Fixes: 50b4d485528d ("bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer")
    Cc: # 4.11+
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Benjamin Block
    Signed-off-by: Jens Axboe

    Benjamin Block
     
  • In blk_mq_debugfs_register(), I remembered to set up the per-hctx sched
    directories if a default scheduler was already configured by
    blk_mq_sched_init() from blk_mq_init_allocated_queue(), but I didn't do
    the same for the device-wide sched directory. Fix it.

    Fixes: d332ce091813 ("blk-mq-debugfs: allow schedulers to register debugfs attributes")
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • A recent commit made null_blk depend on configfs, which is kind of
    annoying since you now have to find this dependency and enable that
    as well. Discovered this since I no longer had null_blk available
    on a box I needed to debug, since it got killed when the config
    updated after the configfs change was merged.

    Fixes: 3bf2bd20734e ("nullb: add configfs interface")
    Reviewed-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • There is a case which will lead to io stall. The case is described as
    follows.
    /test1
    |-subtest1
    /test2
    |-subtest2
    And subtest1 and subtest2 each has 32 queued bios already.

    Now upgrade to max. In throtl_upgrade_state, it will try to dispatch
    bios as follows:
    1) tg=subtest1, do nothing;
    2) tg=test1, transfer 32 queued bios from subtest1 to test1; no pending
    left, no need to schedule next dispatch;
    3) tg=subtest2, do nothing;
    4) tg=test2, transfer 32 queued bios from subtest2 to test2; no pending
    left, no need to schedule next dispatch;
    5) tg=/, transfer 8 queued bios from test1 to /, 8 queued bios from
    test2 to /, 8 queued bios from test1 to /, and 8 queued bios from test2
    to /; note that test1 and test2 each still has 16 queued bios left;
    6) tg=/, try to schedule next dispatch, but since disptime is now
    (update in tg_update_disptime, wait=0), pending timer is not scheduled
    in fact;
    7) In throtl_upgrade_state it totally dispatches 32 queued bios and with
    32 left. test1 and test2 each has 16 queued bios;
    8) throtl_pending_timer_fn sees the left over bios, but could do
    nothing, because throtl_select_dispatch returns 0, and test1/test2 has
    no pending tg.

    The blktrace shows the following:
    8,32 0 0 2.539007641 0 m N throtl upgrade to max
    8,32 0 0 2.539072267 0 m N throtl /test2 dispatch nr_queued=16 read=0 write=16
    8,32 7 0 2.539077142 0 m N throtl /test1 dispatch nr_queued=16 read=0 write=16

    So force schedule dispatch if there are pending children.

    Reviewed-by: Shaohua Li
    Signed-off-by: Joseph Qi
    Signed-off-by: Jens Axboe

    Joseph Qi
     
  • nbd-general@sourceforge.net becomes nbd@other.debian.org, because
    sourceforge is just a spamtrap these days.

    Signed-off-by: Wouter Verhelst
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Wouter Verhelst
     

03 Oct, 2017

1 commit

  • Christoph made it so that if we return'ed BLK_STS_RESOURCE whenever we
    got ERESTARTSYS from sending our packets we'd return BLK_STS_OK, which
    means we'd never requeue and just hang. We really need to return the
    right value from the upper layer.

    Fixes: fc17b6534eb8 ("blk-mq: switch ->queue_rq return value to blk_status_t")
    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     

28 Sep, 2017

1 commit

  • Commit 09b3efec ("bcache: Don't reinvent the wheel but use existing llist
    API") replaces the following while loop by llist_for_each_entry(),

    -
    - while (reverse) {
    - cl = container_of(reverse, struct closure, list);
    - reverse = llist_next(reverse);
    -
    + llist_for_each_entry(cl, reverse, list) {
    closure_set_waiting(cl, 0);
    closure_sub(cl, CLOSURE_WAITING + 1);
    }

    This modification introduces a potential race by iterating a corrupted
    list. Here is how it happens.

    In the above modification, closure_sub() may wake up a process which is
    waiting on reverse list. If this process decides to wait again by calling
    closure_wait(), its cl->list will be added to another wait list. Then
    when llist_for_each_entry() continues to iterate next node, it will travel
    on another new wait list which is added in closure_wait(), not the
    original reverse list in __closure_wake_up(). It is more probably to
    happen on UP machine because the waked up process may preempt the process
    which wakes up it.

    Use llist_for_each_entry_safe() will fix the issue, the safe version fetch
    next node before waking up a process. Then the copy of next node will make
    sure list iteration stays on original reverse list.

    Fixes: 09b3efec81de ("bcache: Don't reinvent the wheel but use existing llist API")
    Signed-off-by: Coly Li
    Reported-by: Michael Lyle
    Reviewed-by: Byungchul Park
    Signed-off-by: Jens Axboe

    Coly Li
     

27 Sep, 2017

2 commits


26 Sep, 2017

15 commits

  • Pull compat fix from Al Viro:
    "I really wish gcc warned about conversions from pointer to function
    into void *..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fix a typo in put_compat_shm_info()

    Linus Torvalds
     
  • "uip" misspelled as "up"; unfortunately, the latter happens to be
    a function and gcc is happy to convert it to void *...

    Signed-off-by: Al Viro

    Al Viro
     
  • Pull block fixes from Jens Axboe:

    - Two sets of NVMe pull requests from Christoph:
    - Fixes for the Fibre Channel host/target to fix spec compliance
    - Allow a zero keep alive timeout
    - Make the debug printk for broken SGLs work better
    - Fix queue zeroing during initialization
    - Set of RDMA and FC fixes
    - Target div-by-zero fix

    - bsg double-free fix.

    - ndb unknown ioctl fix from Josef.

    - Buffered vs O_DIRECT page cache inconsistency fix. Has been floating
    around for a long time, well reviewed. From Lukas.

    - brd overflow fix from Mikulas.

    - Fix for a loop regression in this merge window, where using a union
    for two members of the loop_cmd turned out to be a really bad idea.
    From Omar.

    - Fix for an iostat regression fix in this series, using the wrong API
    to get at the block queue. From Shaohua.

    - Fix for a potential blktrace delection deadlock. From Waiman.

    * 'for-linus' of git://git.kernel.dk/linux-block: (30 commits)
    nvme-fcloop: fix port deletes and callbacks
    nvmet-fc: sync header templates with comments
    nvmet-fc: ensure target queue id within range.
    nvmet-fc: on port remove call put outside lock
    nvme-rdma: don't fully stop the controller in error recovery
    nvme-rdma: give up reconnect if state change fails
    nvme-core: Use nvme_wq to queue async events and fw activation
    nvme: fix sqhd reference when admin queue connect fails
    block: fix a crash caused by wrong API
    fs: Fix page cache inconsistency when mixing buffered and AIO DIO
    nvmet: implement valid sqhd values in completions
    nvme-fabrics: Allow 0 as KATO value
    nvme: allow timed-out ios to retry
    nvme: stop aer posting if controller state not live
    nvme-pci: Print invalid SGL only once
    nvme-pci: initialize queue memory before interrupts
    nvmet-fc: fix failing max io queue connections
    nvme-fc: use transport-specific sgl format
    nvme: add transport SGL definitions
    nvme.h: remove FC transport-specific error values
    ...

    Linus Torvalds
     
  • Pull gfs2 fix from Bob Peterson:
    "GFS2: Fix an old regression in GFS2's debugfs interface

    This fixes a regression introduced by commit 88ffbf3e037e ("GFS2: Use
    resizable hash table for glocks"). The regression caused the glock dump
    in debugfs to not report all the glocks, which makes debugging
    extremely difficult"

    * tag 'gfs2-for-linus-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
    gfs2: Fix debugfs glocks dump

    Linus Torvalds
     
  • Pull Microblaze fixes from Michal Simek:

    - Kbuild fix

    - use vma_pages

    - setup default little endians

    * tag 'microblaze-4.14-rc3' of git://git.monstr.eu/linux-2.6-microblaze:
    arch: change default endian for microblaze
    microblaze: Cocci spatch "vma_pages"
    microblaze: Add missing kvm_para.h to Kbuild

    Linus Torvalds
     
  • Pull tracing fixes from Steven Rostedt:
    "Stack tracing and RCU has been having issues with each other and
    lockdep has been pointing out constant problems.

    The changes have been going into the stack tracer, but it has been
    discovered that the problem isn't with the stack tracer itself, but it
    is with calling save_stack_trace() from within the internals of RCU.

    The stack tracer is the one that can trigger the issue the easiest,
    but examining the problem further, it could also happen from a WARN()
    in the wrong place, or even if an NMI happened in this area and it did
    an rcu_read_lock().

    The critical area is where RCU is not watching. Which can happen while
    going to and from idle, or bringing up or taking down a CPU.

    The final fix was to put the protection in kernel_text_address() as it
    is the one that requires RCU to be watching while doing the stack
    trace.

    To make this work properly, Paul had to allow rcu_irq_enter() happen
    after rcu_nmi_enter(). This should have been done anyway, since an NMI
    can page fault (reading vmalloc area), and a page fault triggers
    rcu_irq_enter().

    One patch is just a consolidation of code so that the fix only needed
    to be done in one location"

    * tag 'trace-v4.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Remove RCU work arounds from stack tracer
    extable: Enable RCU if it is not watching in kernel_text_address()
    extable: Consolidate *kernel_text_address() functions
    rcu: Allow for page faults in NMI handlers

    Linus Torvalds
     
  • Now that there are potentially long delays between when a remoteport or
    targetport delete calls is made and when the callback occurs (dev_loss_tmo
    timeout), no longer block in the delete routines and move the final nport
    puts to the callbacks.

    Moved the fcloop_nport_get/put/free routines to avoid forward declarations.

    Ensure port_info structs used in registrations are nulled in case fields
    are not set (ex: devloss_tmo values).

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    James Smart
     
  • Comments were incorrect:
    - defer_rcv was in host port template. moved to target port template
    - Added Mandatory statements for target port template items

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    James Smart
     
  • When searching for queue id's ensure they are within the expected range.

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    James Smart
     
  • Avoid calling the put routine, as it may traverse to free routines while
    holding the target lock.

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    James Smart
     
  • By calling nvme_stop_ctrl on a already failed controller will wait for the
    scan work to complete (only by identify timeout expiration which is 60
    seconds). This is unnecessary when we already know that the controller has
    failed.

    Reported-by: Yi Zhang
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Sagi Grimberg
     
  • If we failed to transition to state LIVE after a successful reconnect,
    then controller deletion already started. In this case there is no
    point moving forward with reconnect.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Sagi Grimberg
     
  • async_event_work might race as it is executed from two different
    workqueues at the moment.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Sagi Grimberg
     
  • Fix bug in sqhd patch.

    It wasn't the sq that was at risk. In the case where the admin queue
    connect command fails, the sq->size field is not set. Therefore, this
    becomes a divide by zero error.

    Add a quick check to bypass under this failure condition.

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    James Smart
     
  • The switch to rhashtables (commit 88ffbf3e03) broke the debugfs glock
    dump (/sys/kernel/debug/gfs2//glocks) for dumps bigger than a
    single buffer: the right function for restarting an rhashtable iteration
    from the beginning of the hash table is rhashtable_walk_enter;
    rhashtable_walk_stop + rhashtable_walk_start will just resume from the
    current position.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson
    Cc: stable@vger.kernel.org # v4.3+

    Andreas Gruenbacher
     

25 Sep, 2017

16 commits

  • part_stat_show takes a part device not a disk, so we should use
    part_to_disk.

    Fixes: d62e26b3ffd2("block: pass in queue to inflight accounting")
    Cc: Bart Van Assche
    Cc: Omar Sandoval
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Currently when mixing buffered reads and asynchronous direct writes it
    is possible to end up with the situation where we have stale data in the
    page cache while the new data is already written to disk. This is
    permanent until the affected pages are flushed away. Despite the fact
    that mixing buffered and direct IO is ill-advised it does pose a thread
    for a data integrity, is unexpected and should be fixed.

    Fix this by deferring completion of asynchronous direct writes to a
    process context in the case that there are mapped pages to be found in
    the inode. Later before the completion in dio_complete() invalidate
    the pages in question. This ensures that after the completion the pages
    in the written area are either unmapped, or populated with up-to-date
    data. Also do the same for the iomap case which uses
    iomap_dio_complete() instead.

    This has a side effect of deferring the completion to a process context
    for every AIO DIO that happens on inode that has pages mapped. However
    since the consensus is that this is ill-advised practice the performance
    implication should not be a problem.

    This was based on proposal from Jeff Moyer, thanks!

    Reviewed-by: Jan Kara
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Jeff Moyer
    Signed-off-by: Lukas Czerner
    Signed-off-by: Jens Axboe

    Lukas Czerner
     
  • To support sqhd, for initiators that are following the spec and
    paying attention to sqhd vs their sqtail values:

    - add sqhd to struct nvmet_sq
    - initialize sqhd to 0 in nvmet_sq_setup
    - rather than propagate the 0's-based qsize value from the connect message
    which requires a +1 in every sqhd update, and as nothing else references
    it, convert to 1's-based value in nvmt_sq/cq_setup() calls.
    - validate connect message sqsize being non-zero per spec.
    - updated assign sqhd for every completion that goes back.

    Also remove handling the NULL sq case in __nvmet_req_complete, as it can't
    happen with the current code.

    Signed-off-by: James Smart
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Max Gurtovoy
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    James Smart
     
  • Currently, driver code allows user to set 0 as KATO
    (Keep Alive TimeOut), but this is not being respected.
    This patch enforces the expected behavior.

    Signed-off-by: Guilherme G. Piccoli
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Guilherme G. Piccoli
     
  • Currently the nvme_req_needs_retry() applies several checks to see if
    a retry is allowed. On of those is whether the current time has exceeded
    the start time of the io plus the timeout length. This check, if an io
    times out, means there is never a retry allowed for the io. Which means
    applications see the io failure.

    Remove this check and allow the io to timeout, like it does on other
    protocols, and retries to be made.

    On the FC transport, a frame can be lost for an individual io, and there
    may be no other errors that escalate for the connection/association.
    The io will timeout, which causes the transport to escalate into creating
    a new association, but the io that timed out, due to this retry logic, has
    already failed back to the application and things are hosed.

    Signed-off-by: James Smart
    Reviewed-by: Keith Busch
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    James Smart
     
  • If an nvme async_event command completes, in most cases, a new
    async event is posted. However, if the controller enters a
    resetting or reconnecting state, there is nothing to block the
    scheduled work element from posting the async event again. Nor are
    there calls from the transport to stop async events when an
    association dies.

    In the case of FC, where the association is torn down, the aer must
    be aborted on the FC link and completes through the normal job
    completion path. Thus the terminated async event ends up being
    rescheduled even though the controller isn't in a valid state for
    the aer, and the reposting gets the transport into a partially torn
    down data structure.

    It's possible to hit the scenario on rdma, although much less likely
    due to an aer completing right as the association is terminated and
    as the association teardown reclaims the blk requests via
    nvme_cancel_request() so its immediate, not a link-related action
    like on FC.

    Fix by putting controller state checks in both the async event
    completion routine where it schedules the async event and in the
    async event work routine before it calls into the transport. It's
    effectively a "stop_async_events()" behavior. The transport, when
    it creates a new association with the subsystem will transition
    the state back to live and is already restarting the async event
    posting.

    Signed-off-by: James Smart
    [hch: remove taking a lock over reading the controller state]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    James Smart
     
  • The WARN_ONCE macro returns true if the condition is true, not if the
    warn was raised, so we're printing the scatter list every time it's
    invalid. This is excessive and makes debugging harder, so this patch
    prints it just once.

    Signed-off-by: Keith Busch
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • A spurious interrupt before the nvme driver has initialized the completion
    queue may inadvertently cause the driver to believe it has a completion
    to process. This may result in a NULL dereference since the nvmeq's tags
    are not set at this point.

    The patch initializes the host's CQ memory so that a spurious interrupt
    isn't mistaken for a real completion.

    Signed-off-by: Keith Busch
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • fc transport is treating NVMET_NR_QUEUES as maximum queue count, e.g.
    admin queue plus NVMET_NR_QUEUES-1 io queues. But NVMET_NR_QUEUES is
    the number of io queues, so maximum queue count is really
    NVMET_NR_QUEUES+1.

    Fix the handling in the target fc transport

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    James Smart
     
  • Sync with NVM Express spec change and FC-NVME 1.18.

    FC transport sets SGL type to Transport SGL Data Block Descriptor and
    subtype to transport-specific value 0x0A.

    Removed the warn-on's on the PRP fields. They are unneeded. They were
    to check for values from the upper layer that weren't set right, and
    for the most part were fine. But, with Async events, which reuse the
    same structure and 2nd time issued the SGL overlay converted them to
    the Transport SGL values - the warn-on's were errantly firing.

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    James Smart
     
  • Add transport SGL defintions from NVMe TP 4008, required for
    the final NVMe-FC standard.

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    James Smart
     
  • The NVM express group recinded the reserved range for the transport.
    Remove the FC-centric values that had been defined.

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    James Smart
     
  • The qla2xxx driver uses the FC-specific error when it needed to return an
    error to the FC-NVME transport. Convert to use a generic value instead.

    Signed-off-by: James Smart
    Acked-by: Himanshu Madhani
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    James Smart
     
  • The lpfc driver uses the FC-specific error when it needed to return an
    error to the FC-NVME transport. Convert to use a generic value instead.

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    James Smart
     
  • The FC-NVME transport loopback test module used the FC-specific error
    codes in cases where it emulated a transport abort case. Instead of
    using the FC-specific values, now use a generic value (NVME_SC_INTERNAL).

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    James Smart
     
  • The FC-NVME target transport used the FC-specific error codes in
    return codes when the transport or lldd failed. Instead of using the
    FC-specific values, now use a generic value (NVME_SC_INTERNAL).

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    James Smart