30 Jun, 2011

6 commits

  • Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • If we have an asymetrically congested network, we may send P_PING,
    but due to congestion, the corresponding P_PING_ACK would time out,
    and we would drop a (congested, but otherwise) healthy connection
    ("PingAck did not arrive in time.")

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • If we have a good resync rate, we will frequently update the on-disk
    bitmap, which, if not accounted for as resync io, may let an otherwise
    idle device appear to be "busy", and cause us to throttle resync.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • The last commit, drbd: add missing spinlock to bitmap receive,
    introduced a cond_resched_lock(), where the lock in question is taken
    with irqs disabled.

    As we must not schedule with IRQs disabled,
    and cond_resched_lock_irq() does not exist, yet,
    we re-aquire the spin_lock_irq() for each bitmap page processed in turn.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • During bitmap exchange, when using the RLE bitmap compression scheme,
    we have a code path that can set the whole bitmap at once.

    To avoid holding spin_lock_irq() for too long, we used to lock out other
    bitmap modifications during bitmap exchange by other means, and then,
    knowing we have exclusive access to the bitmap, modify it without
    the spinlock, and with IRQs enabled.

    Since we now allow local IO to continue, potentially setting additional
    bits during the bitmap receive phase, this is no longer true, and we get
    uncoordinated updates of bitmap members, causing bm_set to no longer
    accurately reflect the total number of set bits.

    To actually see this, you'd need to have a large bitmap, use RLE bitmap
    compression, and have busy IO during sync handshake and bitmap exchange.

    Fix this by taking the spin_lock_irq() in this code path as well, but
    calling cond_resched_lock() after each page worth of bits processed.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Philipp Reisner
     

24 May, 2011

16 commits

  • When finding or allocating a loop device, loop_probe() did not take
    partition numbers into account so that it can result to a different
    device. Consider following example:

    $ sudo modprobe loop max_part=15
    $ ls -l /dev/loop*
    brw-rw---- 1 root disk 7, 0 2011-05-24 22:16 /dev/loop0
    brw-rw---- 1 root disk 7, 16 2011-05-24 22:16 /dev/loop1
    brw-rw---- 1 root disk 7, 32 2011-05-24 22:16 /dev/loop2
    brw-rw---- 1 root disk 7, 48 2011-05-24 22:16 /dev/loop3
    brw-rw---- 1 root disk 7, 64 2011-05-24 22:16 /dev/loop4
    brw-rw---- 1 root disk 7, 80 2011-05-24 22:16 /dev/loop5
    brw-rw---- 1 root disk 7, 96 2011-05-24 22:16 /dev/loop6
    brw-rw---- 1 root disk 7, 112 2011-05-24 22:16 /dev/loop7
    $ sudo mknod /dev/loop8 b 7 128
    $ sudo losetup /dev/loop8 ~/temp/disk-with-3-parts.img
    $ sudo losetup -a
    /dev/loop128: [0805]:278201 (/home/namhyung/temp/disk-with-3-parts.img)
    $ ls -l /dev/loop*
    brw-rw---- 1 root disk 7, 0 2011-05-24 22:16 /dev/loop0
    brw-rw---- 1 root disk 7, 16 2011-05-24 22:16 /dev/loop1
    brw-rw---- 1 root disk 7, 2048 2011-05-24 22:18 /dev/loop128
    brw-rw---- 1 root disk 7, 2049 2011-05-24 22:18 /dev/loop128p1
    brw-rw---- 1 root disk 7, 2050 2011-05-24 22:18 /dev/loop128p2
    brw-rw---- 1 root disk 7, 2051 2011-05-24 22:18 /dev/loop128p3
    brw-rw---- 1 root disk 7, 32 2011-05-24 22:16 /dev/loop2
    brw-rw---- 1 root disk 7, 48 2011-05-24 22:16 /dev/loop3
    brw-rw---- 1 root disk 7, 64 2011-05-24 22:16 /dev/loop4
    brw-rw---- 1 root disk 7, 80 2011-05-24 22:16 /dev/loop5
    brw-rw---- 1 root disk 7, 96 2011-05-24 22:16 /dev/loop6
    brw-rw---- 1 root disk 7, 112 2011-05-24 22:16 /dev/loop7
    brw-r--r-- 1 root root 7, 128 2011-05-24 22:17 /dev/loop8

    After this patch, /dev/loop8 - instead of /dev/loop128 - was
    accessed correctly.

    In addition, 'range' passed to blk_register_region() should
    include all range of dev_t that LOOP_MAJOR can address. It does
    not need to be limited by partition numbers unless 'max_loop'
    param was specified.

    Signed-off-by: Namhyung Kim
    Cc: Laurent Vivier
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Namhyung Kim
     
  • The 'max_part' parameter controls the number of maximum partition
    a loop block device can have. However if a user specifies very
    large value it would exceed the limitation of device minor number
    and can cause a kernel panic (or, at least, produce invalid
    device nodes in some cases).

    On my desktop system, following command kills the kernel. On qemu,
    it triggers similar oops but the kernel was alive:

    $ sudo modprobe loop max_part0000
    ------------[ cut here ]------------
    kernel BUG at /media/Linux_Data/project/linux/fs/sysfs/group.c:65!
    invalid opcode: 0000 [#1] SMP
    last sysfs file:
    CPU 0
    Modules linked in: loop(+)

    Pid: 43, comm: insmod Tainted: G W 2.6.39-qemu+ #155 Bochs Bochs
    RIP: 0010:[] [] internal_create_group=
    +0x2a/0x170
    RSP: 0018:ffff880007b3fde8 EFLAGS: 00000246
    RAX: 00000000ffffffef RBX: ffff880007b3d878 RCX: 00000000000007b4
    RDX: ffffffff8152da50 RSI: 0000000000000000 RDI: ffff880007b3d878
    RBP: ffff880007b3fe38 R08: ffff880007b3fde8 R09: 0000000000000000
    R10: ffff88000783b4a8 R11: ffff880007b3d878 R12: ffffffff8152da50
    R13: ffff880007b3d868 R14: 0000000000000000 R15: ffff880007b3d800
    FS: 0000000002137880(0063) GS:ffff880007c00000(0000) knlGS:00000000000000=
    00
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000422680 CR3: 0000000007b50000 CR4: 00000000000006b0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
    Process insmod (pid: 43, threadinfo ffff880007b3e000, task ffff880007afb9c=
    0)
    Stack:
    ffff880007b3fe58 ffffffff811e66dd ffff880007b3fe58 ffffffff811e570b
    0000000000000010 ffff880007b3d800 ffff880007a7b390 ffff880007b3d868
    0000000000400920 ffff880007b3d800 ffff880007b3fe48 ffffffff8113cfc8
    Call Trace:
    [] ? device_add+0x4bc/0x5af
    [] ? dev_set_name+0x3c/0x3e
    [] sysfs_create_group+0xe/0x12
    [] blk_trace_init_sysfs+0x14/0x16
    [] blk_register_queue+0x47/0xf7
    [] add_disk+0xdf/0x290
    [] loop_init+0xeb/0x1b8 [loop]
    [] ? 0xffffffffa0005fff
    [] do_one_initcall+0x7a/0x12e
    [] sys_init_module+0x9c/0x1e0
    [] system_call_fastpath+0x16/0x1b
    Code: c3 55 48 89 e5 41 57 41 56 41 89 f6 41 55 41 54 49 89 d4 53 48 89 fb=
    48 83 ec 28 48 85 ff 74 0b 85 f6 75 0b 48 83 7f 30 00 75 14 0b eb fe =
    48 83 7f 30 00 b9 ea ff ff ff 0f 84 18 01 00 00 49
    RIP [] internal_create_group+0x2a/0x170
    RSP
    ---[ end trace a123eb592043acad ]---

    Signed-off-by: Namhyung Kim
    Cc: Laurent Vivier
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Namhyung Kim
     
  • In file included from drivers/block/drbd/drbd_main.c:54: drivers/block/drbd/drbd_int.h:1190: warning: parameter has incomplete type

    Forward declarations of enums do not work.

    Fix it unpleasantly by moving the prototype.

    Cc: Jens Axboe
    Signed-off-by: Lars Ellenberg
    Signed-off-by: Philipp Reisner
    Signed-off-by: Andrew Morton

    Andrew Morton
     
  • Signed-off-by: Philipp Reisner

    Philipp Reisner
     
  • Found these with the help of ispell -l.

    Signed-off-by: Bart Van Assche
    Signed-off-by: Lars Ellenberg
    Signed-off-by: Philipp Reisner

    Bart Van Assche
     
  • An administrative detach used to request a state change directly to D_DISKLESS,
    first suspending IO to avoid the last put_ldev() occuring from an endio handler,
    potentially in irq context.

    This is not enough on the receiving side (typically secondary), we may miss
    some peer_req on the way to local disk, which then may do the last put_ldev()
    from their drbd_peer_request_endio().

    This patch makes the detach always go through the intermediate D_FAILED state.
    We may consider to rename it D_DETACHING.

    Alternative approach would be to create yet an other work item to be scheduled
    on the worker, do the destructor work from there, and get the timing right.

    manually picked commit 564040f from the drbd 8.4 branch.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • The old (optimistic) implementation could shrink the bio size
    on an primary device.

    Shrinking the bio size on a primary device is bad. Since there
    we might get BIOs with the old (bigger) size shortly after
    we published the new size.

    The new implementation is more conservative, and eventually
    increases the max_bio_size on a primary device (which is valid).
    It does so, when it knows the local limit AND the remote limit.

    We cache the last seen max_bio_size of the peer in the meta
    data, and rely on that, to make the operation of single
    nodes more efficient.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Philipp Reisner
     
  • Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Philipp Reisner
     
  • Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Philipp Reisner
     
  • It seems that the real cause of all the issues where that
    we did not noticed in drbd_try_connect() when the other
    guy closes one socket if the round trip time gets higher
    than 100ms. There were that 100ms hard coded!

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Philipp Reisner
     
  • It is no longer sufficient to trigger on local WRITE,
    we need to check on (rq_state & RQ_IN_ACT_LOG)
    before calling drbd_al_complete_io also in the error path.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Philipp Reisner
     
  • If there is no replication traffic within the idle timeout
    (ping-int seconds), DRBD will send a P_PING,
    and adjust the timeout to ping-timeout.

    If there is no P_PING_ACK received within this ping-timeout,
    DRBD finally drops the connection, and tries to re-establish it.

    To decide which timeout was active, we compared the current timeout
    with the ping-timeout, and dropped the connection, if that was the case.

    By default, ping-int is 10 seconds, ping-timeout is 500 ms.

    Unfortunately, if you configure ping-timeout to be the same as ping-int,
    expiry of the idle-timeout had been mistaken for a missing ping ack,
    and caused an immediate reconnection attempt.

    Fix:
    Allow both timeouts to be equal, use a local variable
    to store which timeout is active.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • We limit ourselves to a configurable maximum number of pages used as
    temporary bio pages.

    If the configured "max_buffers" is not big enough to match the bandwidth
    of the respective deployment, a distributed deadlock could be triggered
    by e.g. fast online verify and heavy application IO.

    TCP connections would block on congestion, because both receivers
    would wait on pages to become available.

    Fortunately the respective senders in this case would be able to give
    back some pages already. So do that.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • For some time we contemplated calling the "struct lru_cache"
    a "struct tracked_set", and some comments kept the ts_ prefix.

    Fix those to match the member field names.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • In case a write failes on the local disk, go into D_INCONSISTENT
    disk state. That causes future reads of that block to be shipped
    to the peer.

    Read retry remote was already in place.

    Actually the documentation needs to get fixed now. Since the
    application is still shielded from the error. (as long as we have
    only a single disk failing) The difference to detach is that
    we keep the disk. And therefore might keep all the other, still
    working sectors up to date.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Philipp Reisner
     

19 May, 2011

3 commits

  • …it://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-2.6.40/drivers

    Jens Axboe
     
  • If the backends, which use these two functions, are compiled as
    a module we need these two functions to be exported.

    Signed-off-by: Konrad Rzeszutek Wilk

    Konrad Rzeszutek Wilk
     
  • We only supported the M2P (and P2M) override only for the
    GNTMAP_contains_pte type mappings. Meaning that we grants
    operations would "contain the machine address of the PTE to update"
    If the flag is unset, then the grant operation is
    "contains a host virtual address". The latter case means that
    the Hypervisor takes care of updating our page table
    (specifically the PTE entry) with the guest's MFN. As such we should
    not try to do anything with the PTE. Previous to this patch
    we would try to clear the PTE which resulted in Xen hypervisor
    being upset with us:

    (XEN) mm.c:1066:d0 Attempt to implicitly unmap a granted PTE c0100000ccc59067
    (XEN) domain_crash called from mm.c:1067
    (XEN) Domain 0 (vcpu#0) crashed on cpu#3:
    (XEN) ----[ Xen-4.0-110228 x86_64 debug=y Not tainted ]----

    and crashing us.

    This patch allows us to inhibit the PTE clearing in the PV guest
    if the GNTMAP_contains_pte is not set.

    On the m2p_remove_override path we provide the same parameter.

    Sadly in the grant-table driver we do not have a mechanism to
    tell m2p_remove_override whether to clear the PTE or not. Since
    the grant-table driver is used by user-space, we can safely assume
    that it operates only on PTE's. Hence the implementation for
    it to work on !GNTMAP_contains_pte returns -EOPNOTSUPP. In the future
    we can implement the support for this. It will require some extra
    accounting structure to keep track of the page[i], and the flag.

    [v1: Added documentation details, made it return -EOPNOTSUPP instead
    of trying to do a half-way implementation]
    Signed-off-by: Konrad Rzeszutek Wilk

    Konrad Rzeszutek Wilk
     

18 May, 2011

1 commit

  • The sector number on empty barrier requests may (will?) be -1, which,
    given that it's being treated as unsigned 64-bit quantity, will almost
    always exceed the actual (virtual) disk's size.

    Inspired by Konrad's "When writting barriers set the sector number to
    zero...".

    While at it also add overflow checking to the math in vbd_translate().

    Signed-off-by: Jan Beulich
    Signed-off-by: Konrad Rzeszutek Wilk

    Jan Beulich
     

13 May, 2011

14 commits