07 Jul, 2016

10 commits

  • The responsibility of the media manager is not to keep track of
    open/closed blocks. This is better maintained within a target,
    that already manages this information on writes.

    Remove the statistics and merge the states NVM_BLK_ST_OPEN and
    NVM_BLK_ST_CLOSED.

    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • A couple of small checkpatch fixups to stop it from complaining.

    ./drivers/lightnvm/core.c:360: WARNING: line over 80 characters
    ./drivers/lightnvm/core.c:360: ERROR: trailing statements should be on
    next line
    ./drivers/lightnvm/core.c:503: WARNING: Block comments use a trailing */
    on a separate line

    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • Checkpatch found two incidents where the type was preferred to be
    written out in full.

    ./drivers/lightnvm/rrpc.h:184: WARNING: Prefer 'unsigned int' to bare
    use of 'unsigned'
    ./drivers/lightnvm/rrpc.h:209: WARNING: Prefer 'unsigned int' to bare
    use of 'unsigned'
    ./drivers/lightnvm/rrpc.c:51: WARNING: Prefer 'unsigned int' to bare use
    of 'unsigned'

    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • Mark functions not used by ouside of thier implementing file as static.

    Signed-off-by: Johannes Thumshirn
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Johannes Thumshirn
     
  • According to the OpenChannel SSD interface specification the NAND flash
    MLC page pairing information's number of page page pairings field is the
    first two bytes in the MLC Page Pairing data structure. The hardware's
    data structure itself is little endian so annotate it as such, like the
    rest of lighnvm's data structures.

    Signed-off-by: Johannes Thumshirn
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Johannes Thumshirn
     
  • The ->reserved bit is not initialized when allocated on stack.
    This may lead targets to misinterpret the PPA as cached.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • Expose media manager mark_blk() to targets, as done for the rest of the
    media manager callback functions.

    Signed-off-by: Javier González
    Updated description
    Signed-off-by: Matias Bjørling

    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • Break the loop when rqd is not null to reduce
    an unnecessary schedule.

    Signed-off-by: Wenwei Tao
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Wenwei Tao
     
  • We accidentally return zero here when ERR_PTR(-ENOMEM) is intended.

    Fixes: a07b4970f464 ('nvmet: add a generic NVMe target')
    Signed-off-by: Dan Carpenter
    Signed-off-by: Jens Axboe

    Dan Carpenter
     
  • CONFIG_NVME_TARGET has a correct CONFIG_CONFIGFS_FS dependency, but the
    newly added NVME_TARGET_LOOP is missing this, resulting in a link
    failure:

    drivers/nvme/built-in.o: In function `nvmet_init_configfs':
    loop.c:(.init.text+0x2a0): undefined reference to `config_group_init'
    loop.c:(.init.text+0x2c0): undefined reference to `config_group_init_type_name'
    loop.c:(.init.text+0x318): undefined reference to `configfs_register_subsystem'
    drivers/nvme/built-in.o: In function `nvmet_exit_configfs':
    loop.c:(.exit.text+0x9c): undefined reference to `configfs_unregister_subsystem'

    This adds the same dependency here.

    Signed-off-by: Arnd Bergmann
    Fixes: 3a85a5de29ea ("nvme-loop: add a NVMe loopback host driver")
    Signed-off-by: Jens Axboe

    Arnd Bergmann
     

06 Jul, 2016

14 commits

  • We have assigned sb->block_size before the switch,
    so remove the redundant one.

    Reviewed-by: Coly Li
    Signed-off-by: Yijing Wang
    Acked-by: Eric Wheeler
    Signed-off-by: Jens Axboe

    Yijing Wang
     
  • There is no return in continue_at(), update the documentation.

    Signed-off-by: Yijing Wang
    Acked-by: Coly Li
    Signed-off-by: Jens Axboe

    Yijing Wang
     
  • Cache_sb is not used in cache_alloc, and we have copied
    sb info to cache->sb already, remove it.

    Reviewed-by: Coly Li
    Signed-off-by: Yijing Wang
    Signed-off-by: Jens Axboe

    Yijing Wang
     
  • This patch implements adds nvme-loop which allows to access local devices
    exported as NVMe over Fabrics namespaces. This module can be useful for
    easy evaluation, testing and also feature experimentation.

    To createa nvme-loop device you need to configure the NVMe target to
    export a loop port (see the nvmetcli documentaton for that) and then
    connect to it using

    nvme connect-all -t loop

    which requires the very latest nvme-cli version with Fabrics support.

    Signed-off-by: Jay Freyensee
    Signed-off-by: Ming Lin
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Steve Wise
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This patch introduces a implementation of NVMe subsystems,
    controllers and discovery service which allows to export
    NVMe namespaces across fabrics such as Ethernet, FC etc.

    The implementation conforms to the NVMe 1.2.1 specification
    and interoperates with NVMe over fabrics host implementations.

    Configuration works using configfs, and is best performed using
    the nvmetcli tool from http://git.infradead.org/users/hch/nvmetcli.git,
    which also has a detailed explanation of the required steps in the
    README file.

    Signed-off-by: Armen Baloyan
    Signed-off-by: Anthony Knapp
    Signed-off-by: Jay Freyensee
    Signed-off-by: Ming Lin
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Steve Wise
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The new NVMe over fabrics target will make use of this outside from a
    module.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Steve Wise
    Signed-off-by: Jens Axboe

    Sagi Grimberg
     
  • Periodic keep-alive is a mandatory feature in NVMe over Fabrics, and
    optional in NVMe 1.2.1 for PCIe. This patch adds periodic keep-alive
    sent from the host to verify that the controller is still responsive
    and vice-versa. The keep-alive timeout is user-defined (with
    keep_alive_tmo connection parameter) and defaults to 5 seconds.

    In order to avoid a race condition where the host sends a keep-alive
    competing with the target side keep-alive timeout expiration, the host
    adds a grace period of 10 seconds when publishing the keep-alive timeout
    to the target.

    In case a keep-alive failed (or timed out), a transport specific error
    recovery kicks in.

    For now only NVMe over Fabrics is wired up to support keep alive, but
    we can add PCIe support easily once controllers actually supporting it
    become available.

    Signed-off-by: Sagi Grimberg
    Reviewed-by: Steve Wise
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Sagi Grimberg
     
  • KAS: keep-alive support and granularity of kato in units of 100 ms
    nvme_admin_keep_alive opcode: 0x18

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Sagi Grimberg
     
  • The NVMe over Fabrics library provides an interface for both transports
    and the nvme core to handle fabrics specific commands and attributes
    independent of the underlying transport.

    In addition, the fabrics library adds a misc device interface that allow
    actually creating a fabrics controller, as we can't just autodiscover
    it like in the PCI case. The nvme-cli utility has been enhanced to use
    this interface to support fabric connect and discovery.

    Signed-off-by: Armen Baloyan ,
    Signed-off-by: Jay Freyensee ,
    Signed-off-by: Ming Lin
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The NVMe over Fabrics specification defines a protocol interface and
    related extensions to NVMe that enable operation over network protocols.
    The NVMe over Fabrics specification has an NVMe Transport binding for
    each NVMe Transport.

    This patch adds the fabrics related definitions:
    - fabric specific command set and error codes
    - transport addressing and binding definitions
    - fabrics sgl extensions
    - controller identification fabrics enhancements
    - discovery log page definition

    Signed-off-by: Armen Baloyan
    Signed-off-by: James Smart
    Signed-off-by: Jay Freyensee
    Signed-off-by: Ming Lin
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • - delete_controller: This attribute allows to delete a controller.
    A driver is not obligated to support it (pci doesn't) so it is
    created only if the driver supports it. The new fabrics drivers
    will support it (essentialy a disconnect operation).

    Usage:
    echo > /sys/class/nvme/nvme0/delete_controller

    - subsysnqn: This attribute shows the subsystem nqn of the configured
    device. If a driver does not implement the get_subsysnqn method, the
    file will not appear in sysfs.

    - transport: This attribute shows the transport name. Added a "name"
    field to struct nvme_ctrl_ops.

    For loop,
    cat /sys/class/nvme/nvme0/transport
    loop

    For RDMA,
    cat /sys/class/nvme/nvme0/transport
    rdma

    For PCIe,
    cat /sys/class/nvme/nvme0/transport
    pcie

    - address: This attributes shows the controller address. The fabrics
    drivers that will implement get_address can show the address of the
    connected controller.

    example:
    cat /sys/class/nvme/nvme0/address
    traddr=192.168.2.2,trsvcid=1023

    Signed-off-by: Ming Lin
    Reviewed-by: Jay Freyensee
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Ming Lin
     
  • NVMe over fabrics will use __nvme_submit_sync_cmd in the the
    transport and require a few tweaks to it. For that we export it
    and add a few more paramters:

    1. allow passing a queue ID to the block layer

    For the NVMe over Fabrics connect command we need to able to specify a
    queue ID that we want to send the command on. Add a qid parameter to
    the relevant functions to enable this behavior.

    2. allow submitting at_head commands

    In cases where we want to (re)connect to a controller
    where we have inflight queued commands we want to first
    connect and only then allow the other queued commands to
    be kicked. This will prevents failures in controller resets
    and reconnects.

    3. allow passing flags to blk_mq_allocate_request

    Both for Fabrics connect the the keep-alive feature in NVMe 1.2.1 we
    want to be able to use reserved requests.

    Reviewed-by: Jay Freyensee
    Reviewed-by: Sagi Grimberg
    Tested-by: Ming Lin
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • For Fabrics we're not going through an intermediate reset state
    (at least for now).

    Reviewed-by: Jay Freyensee
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • For some protocols like NVMe over Fabrics we need to be able to send
    initialization commands to a specific queue.

    Based on an earlier patch from Christoph Hellwig .

    Signed-off-by: Ming Lin
    [hch: disallow sleeping allocation, req_op fixes]
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Ming Lin
     

29 Jun, 2016

1 commit

  • MG_DISK_MAJ is defined as 0 so dynamic block major number
    allocation is used by the driver and the assigned major
    number is stored in host->major. This patch fixes error
    path in mg_probe() to use host->major instead of using
    MG_DISK_MAJ.

    Cc: unsik Kim
    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Jens Axboe

    Bartlomiej Zolnierkiewicz
     

14 Jun, 2016

15 commits

  • crypto_alloc_hash returns an ERR_PTR(), not NULL.

    Also reset peer_integrity_tfm to NULL, to not call crypto_free_hash()
    on an errno in the cleanup path.

    Reported-by: Insu Yun

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg
    Signed-off-by: Jens Axboe

    Lars Ellenberg
     
  • For larger devices, the array of bitmap page pointers can grow very
    large (8000 pointers per TB of storage).

    For each activity log transaction, we need to flush the associated
    bitmap pages to stable storage. Currently, we just "mark" the respective
    pages while setting up the transaction, then tell the bitmap code to
    write out all marked pages, but skip unchanged pages.

    But one such transaction can affect only a small number of bitmap pages,
    there is no need to scan the full array of several (ten-)thousand
    page pointers to find the few marked ones.

    Instead, remember the index numbers of the few affected pages,
    and later only re-check those to skip duplicates and unchanged ones.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg
    Signed-off-by: Jens Axboe

    Lars Ellenberg
     
  • Also skip the message unless bitmap IO took longer than 5 ms.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg
    Signed-off-by: Jens Axboe

    Lars Ellenberg
     
  • This should silence a warning about an empty statement. Thanks to Fabian
    Frederick who sent a patch I modified to be smaller and
    avoids an additional indent level.

    Signed-off-by: Roland Kammerer
    Signed-off-by: Philipp Reisner
    Signed-off-by: Jens Axboe

    Roland Kammerer
     
  • This contains various cosmetic fixes ranging from simple typos to
    const-ifying, and using booleans properly.

    Original commit messages from Fabian's patch set:
    drbd: debugfs: constify drbd_version_fops
    drbd: use seq_put instead of seq_print where possible
    drbd: include linux/uaccess.h instead of asm/uaccess.h
    drbd: use const char * const for drbd strings
    drbd: kerneldoc warning fix in w_e_end_data_req()
    drbd: use unsigned for one bit fields
    drbd: use bool for peer is_ states
    drbd: fix typo
    drbd: use | for bitmask combination
    drbd: use true/false for bool
    drbd: fix drbd_bm_init() comments
    drbd: introduce peer state union
    drbd: fix maybe_pull_ahead() locking comments
    drbd: use bool for growing
    drbd: remove redundant declarations
    drbd: replace if/BUG by BUG_ON

    Signed-off-by: Fabian Frederick
    Signed-off-by: Roland Kammerer
    Signed-off-by: Jens Axboe

    Fabian Frederick
     
  • Scenario, starting with normal operation
    Connected Primary/Secondary UpToDate/UpToDate
    NetworkFailure Primary/Unknown UpToDate/DUnknown (frozen)
    ... more failures happen, secondary loses it's disk,
    but eventually is able to re-establish the replication link ...
    Connected Primary/Secondary UpToDate/Diskless (resumed; needs to bump uuid!)

    We used to just resume/resent suspended requests,
    without bumping the UUID.

    Which will lead to problems later, when we want to re-attach the disk on
    the peer, without first disconnecting, or if we experience additional
    failures, because we now have diverging data without being able to
    recognize it.

    Make sure we also bump the current data generation UUID,
    if we notice "peer disk unknown" -> "peer disk known bad".

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg
    Signed-off-by: Jens Axboe

    Lars Ellenberg
     
  • We already serialize connection state changes,
    and other, non-connection state changes (role changes)
    while we are establishing a connection.

    But if we have an established connection,
    then trigger a resync handshake (by primary --force or similar),
    until now we just had to be "lucky".

    Consider this sequence (e.g. deployment scenario):
    create-md; up;
    -> Connected Secondary/Secondary Inconsistent/Inconsistent
    then do a racy primary --force on both peers.

    block drbd0: drbd_sync_handshake:
    block drbd0: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
    block drbd0: peer 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
    block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> Inconsistent )
    block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate )
    *** HERE things go wrong. ***
    block drbd0: role( Secondary -> Primary )
    block drbd0: drbd_sync_handshake:
    block drbd0: self 0000000000000005:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
    block drbd0: peer C90D2FC716D232AB:0000000000000004:0000000000000000:0000000000000000 bits:25590 flags:0
    block drbd0: Becoming sync target due to disk states.
    block drbd0: Writing the whole bitmap, full sync required after drbd_sync_handshake.
    block drbd0: Remote failed to finish a request within 6007ms > ko-count (2) * timeout (30 * 0.1s)
    drbd s0: peer( Primary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )

    The problem here is that the local promotion happens before the sync handshake
    triggered by the remote promotion was completed. Some assumptions elsewhere
    become wrong, and when the expected resync handshake is then received and
    processed, we get stuck in a deadlock, which can only be recovered by reboot :-(

    Fix: if we know the peer has good data,
    and our own disk is present, but NOT good,
    and there is no resync going on yet,
    we expect a sync handshake to happen "soon".
    So reject a racy promotion with SS_IN_TRANSIENT_STATE.

    Result:
    ... as above ...
    block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate )
    *** local promotion being postponed until ... ***
    block drbd0: drbd_sync_handshake:
    block drbd0: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
    block drbd0: peer 77868BDA836E12A5:0000000000000004:0000000000000000:0000000000000000 bits:25590 flags:0
    ...
    block drbd0: conn( WFBitMapT -> WFSyncUUID )
    block drbd0: updated sync uuid 85D06D0E8887AD44:0000000000000000:0000000000000000:0000000000000000
    block drbd0: conn( WFSyncUUID -> SyncTarget )
    *** ... after the resync handshake ***
    block drbd0: role( Secondary -> Primary )

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg
    Signed-off-by: Jens Axboe

    Lars Ellenberg
     
  • If in a two-primary scenario, we lost our peer, freeze IO,
    and are still frozen (no UUID rotation) when the peer comes back
    as Secondary after a hard crash, we will see identical UUIDs.

    The "rule_nr = 40" chose to use the "CRASHED_PRIMARY" bit as
    arbitration, but that would cause the still running (but frozen) Primary
    to become SyncTarget (which it typically refuses), and the handshake is
    declined.

    Fix: check current roles.
    If we have *one* current primary, the Primary wins.
    (rule_nr = 41)

    Since that is a protocol change, use the newly introduced DRBD_FF_WSAME
    to determine if rule_nr = 41 can be applied.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg
    Signed-off-by: Jens Axboe

    Lars Ellenberg
     
  • We will support WRITE_SAME, if
    * all peers support WRITE_SAME (both in kernel and DRBD version),
    * all peer devices support WRITE_SAME
    * logical_block_size is identical on all peers.

    We may at some point introduce a fallback on the receiving side
    for devices/kernels that do not support WRITE_SAME,
    by open-coding a submit loop. But not yet.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg
    Signed-off-by: Jens Axboe

    Lars Ellenberg
     
  • Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg
    Signed-off-by: Jens Axboe

    Lars Ellenberg
     
  • Even if discard_zeroes_data != 0,
    if discard_zeroes_if_aligned is set, we assume we can reliably
    zero-out/discard using the drbd_issue_peer_discard() helper.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg
    Signed-off-by: Jens Axboe

    Lars Ellenberg
     
  • When re-attaching the local backend device to a C_STANDALONE D_DISKLESS
    R_PRIMARY with OND_SUSPEND_IO, we may only resume IO if we recognize the
    backend that is being attached as D_UP_TO_DATE.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg
    Signed-off-by: Jens Axboe

    Lars Ellenberg
     
  • If DRBD lost all path to good data,
    and the on-no-data-accessible policy is OND_SUSPEND_IO,
    all pending and new IO requests are suspended (will block).

    If that setting is OND_IO_ERROR, IO will still be completed.
    READ to "clean" areas (e.g. on an D_INCONSISTENT device,
    and bitmap indicates a block is already in sync) will succeed.
    READ to "unclean" areas (bitmap indicates block is out-of-sync),
    will return EIO.

    If we are already D_DISKLESS (or D_FAILED), we also return EIO.

    Unfortunately, on a former R_PRIMARY C_SYNC_TARGET D_INCONSISTENT,
    after replication link loss, new WRITE requests still went through OK.

    The would also set the "out-of-sync" bit on their way, so READ after
    WRITE would still return EIO. Also, the data generation UUIDs had not
    been bumped, we would cause data divergence, without being able to
    detect it on the next sync handshake, given the right sequence of events
    in a multiple error scenario and "improper" order of recovery actions.

    The right thing to do is to return EIO for all new writes,
    unless we have access to good, current, D_UP_TO_DATE data.

    The "established best practices" way to avoid these situations in the
    first place is to set OND_SUSPEND_IO, or even do a hard-reset from
    the pri-on-incon-degr policy helper hook.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg
    Signed-off-by: Jens Axboe

    Lars Ellenberg
     
  • Possibly sequence of events:
    SyncTarget is made Primary, then loses replication link
    (only path to good data on SyncSource).

    Behavior is then controlled by the on-no-data-accessible policy,
    which defaults to OND_IO_ERROR (may be set to OND_SUSPEND_IO).

    If OND_IO_ERROR is in fact the current policy, we clear the susp_fen
    (IO suspended due to fencing policy) flag, do NOT set the susp_nod
    (IO suspended due to no data) flag.

    But we forgot to call the IO error completion for all pending,
    suspended, requests.

    While at it, also add a race check for a theoretically possible
    race with a new handshake (network hickup), we may be able to
    re-send requests, and can avoid passing IO errors up the stack.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg
    Signed-off-by: Jens Axboe

    Lars Ellenberg
     
  • When resync is finished, we already call the "after-resync-target"
    handler (on the former sync target, obviously), once per volume.

    Paired with the before-resync-target handler, you can create snapshots,
    before the resync causes the volumes to become inconsistent,
    and discard those snapshots again, once they are no longer needed.

    It was also overloaded to be paired with the "fence-peer" handler,
    to "unfence" once the volumes are up-to-date and known good.

    This has some disadvantages, though: we call "fence-peer" for the whole
    connection (once for the group of volumes), but would call unfence as
    side-effect of after-resync-target once for each volume.

    Also, we fence on a (current, or about to become) Primary,
    which will later become the sync-source.

    Calling unfence only as a side effect of the after-resync-target
    handler opens a race window, between a new fence on the Primary
    (SyncTarget) and the unfence on the SyncTarget, which is difficult to
    close without some kind of "cluster wide lock" in those handlers.

    We would not need those handlers if we could still communicate.
    Which makes trying to aquire a cluster wide lock from those handlers
    seem like a very bad idea.

    This introduces the "unfence-peer" handler, which will be called
    per connection (once for the group of volumes), just like the fence
    handler, only once all volumes are back in sync, and on the SyncSource.

    Which is expected to be the node that previously called "fence", the
    node that is currently allowed to be Primary, and thus the only node
    that could trigger a new "fence" that could race with this unfence.

    Which makes us not need any cluster wide synchronization here,
    serializing two scripts running on the same node is trivial.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg
    Signed-off-by: Jens Axboe

    Lars Ellenberg