15 Aug, 2014

1 commit

  • Pull DEFINE_PCI_DEVICE_TABLE removal from Bjorn Helgaas:
    "Part two of the PCI changes for v3.17:

    - Remove DEFINE_PCI_DEVICE_TABLE macro use (Benoit Taine)

    It's a mechanical change that removes uses of the
    DEFINE_PCI_DEVICE_TABLE macro. I waited until later in the merge
    window to reduce conflicts, but it's possible you'll still see a few"

    * tag 'pci-v3.17-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
    PCI: Remove DEFINE_PCI_DEVICE_TABLE macro use

    Linus Torvalds
     

14 Aug, 2014

2 commits

  • Pull block driver changes from Jens Axboe:
    "Nothing out of the ordinary here, this pull request contains:

    - A big round of fixes for bcache from Kent Overstreet, Slava Pestov,
    and Surbhi Palande. No new features, just a lot of fixes.

    - The usual round of drbd updates from Andreas Gruenbacher, Lars
    Ellenberg, and Philipp Reisner.

    - virtio_blk was converted to blk-mq back in 3.13, but now Ming Lei
    has taken it one step further and added support for actually using
    more than one queue.

    - Addition of an explicit SG_FLAG_Q_AT_HEAD for block/bsg, to
    compliment the the default behavior of adding to the tail of the
    queue. From Douglas Gilbert"

    * 'for-3.17/drivers' of git://git.kernel.dk/linux-block: (86 commits)
    bcache: Drop unneeded blk_sync_queue() calls
    bcache: add mutex lock for bch_is_open
    bcache: Correct printing of btree_gc_max_duration_ms
    bcache: try to set b->parent properly
    bcache: fix memory corruption in init error path
    bcache: fix crash with incomplete cache set
    bcache: Fix more early shutdown bugs
    bcache: fix use-after-free in btree_gc_coalesce()
    bcache: Fix an infinite loop in journal replay
    bcache: fix crash in bcache_btree_node_alloc_fail tracepoint
    bcache: bcache_write tracepoint was crashing
    bcache: fix typo in bch_bkey_equal_header
    bcache: Allocate bounce buffers with GFP_NOWAIT
    bcache: Make sure to pass GFP_WAIT to mempool_alloc()
    bcache: fix uninterruptible sleep in writeback thread
    bcache: wait for buckets when allocating new btree root
    bcache: fix crash on shutdown in passthrough mode
    bcache: fix lockdep warnings on shutdown
    bcache allocator: send discards with correct size
    bcache: Fix to remove the rcu_sched stalls.
    ...

    Linus Torvalds
     
  • Pull Ceph updates from Sage Weil:
    "There is a lot of refactoring and hardening of the libceph and rbd
    code here from Ilya that fix various smaller bugs, and a few more
    important fixes with clone overlap. The main fix is a critical change
    to the request_fn handling to not sleep that was exposed by the recent
    mutex changes (which will also go to the 3.16 stable series).

    Yan Zheng has several fixes in here for CephFS fixing ACL handling,
    time stamps, and request resends when the MDS restarts.

    Finally, there are a few cleanups from Himangi Saraogi based on
    Coccinelle"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (39 commits)
    libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly
    rbd: remove extra newlines from rbd_warn() messages
    rbd: allocate img_request with GFP_NOIO instead GFP_ATOMIC
    rbd: rework rbd_request_fn()
    ceph: fix kick_requests()
    ceph: fix append mode write
    ceph: fix sizeof(struct tYpO *) typo
    ceph: remove redundant memset(0)
    rbd: take snap_id into account when reading in parent info
    rbd: do not read in parent info before snap context
    rbd: update mapping size only on refresh
    rbd: harden rbd_dev_refresh() and callers a bit
    rbd: split rbd_dev_spec_update() into two functions
    rbd: remove unnecessary asserts in rbd_dev_image_probe()
    rbd: introduce rbd_dev_header_info()
    rbd: show the entire chain of parent images
    ceph: replace comma with a semicolon
    rbd: use rbd_segment_name_free() instead of kfree()
    ceph: check zero length in ceph_sync_read()
    ceph: reset r_resend_mds after receiving -ESTALE
    ...

    Linus Torvalds
     

13 Aug, 2014

1 commit

  • We should prefer `struct pci_device_id` over `DEFINE_PCI_DEVICE_TABLE` to
    meet kernel coding style guidelines. This issue was reported by checkpatch.

    A simplified version of the semantic patch that makes this change is as
    follows (http://coccinelle.lip6.fr/):

    //

    @@
    identifier i;
    declarer name DEFINE_PCI_DEVICE_TABLE;
    initializer z;
    @@

    - DEFINE_PCI_DEVICE_TABLE(i)
    + const struct pci_device_id i[]
    = z;

    //

    [bhelgaas: add semantic patch]
    Signed-off-by: Benoit Taine
    Signed-off-by: Bjorn Helgaas

    Benoit Taine
     

09 Aug, 2014

1 commit


07 Aug, 2014

7 commits

  • rbd_warn() string should be a single line - rbd_warn() appends \n.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Now that rbd_img_request_create() is called from work functions, no
    need to use GFP_ATOMIC.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • While it was never a good idea to sleep in request_fn(), commit
    34c6bc2c919a ("locking/mutexes: Add extra reschedule point") made it
    a *bad* idea. mutex_lock() since 3.15 may reschedule *before* putting
    task on the mutex wait queue, which for tasks in !TASK_RUNNING state
    means block forever. request_fn() may be called with !TASK_RUNNING on
    the way to schedule() in io_schedule().

    Offload request handling to a workqueue, one per rbd device, to avoid
    calling blocking primitives from rbd_request_fn().

    Fixes: http://tracker.ceph.com/issues/8818

    Cc: stable@vger.kernel.org # 3.16, needs backporting for 3.15
    Signed-off-by: Ilya Dryomov
    Tested-by: Eric Eastman
    Tested-by: Greg Wilson
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • Currently, we use a rwlock tb_lock to protect concurrent access to the
    whole zram meta table. However, according to the actual access model,
    there is only a small chance for upper user to access the same
    table[index], so the current lock granularity is too big.

    The idea of optimization is to change the lock granularity from whole
    meta table to per table entry (table -> table[index]), so that we can
    protect concurrent access to the same table[index], meanwhile allow the
    maximum concurrency.

    With this in mind, several kinds of locks which could be used as a
    per-entry lock were tested and compared:

    Test environment:
    x86-64 Intel Core2 Q8400, system memory 4GB, Ubuntu 12.04,
    kernel v3.15.0-rc3 as base, zram with 4 max_comp_streams LZO.

    iozone test:
    iozone -t 4 -R -r 16K -s 200M -I +Z
    (1GB zram with ext4 filesystem, take the average of 10 tests, KB/s)

    Test base CAS spinlock rwlock bit_spinlock
    -------------------------------------------------------------------
    Initial write 1381094 1425435 1422860 1423075 1421521
    Rewrite 1529479 1641199 1668762 1672855 1654910
    Read 8468009 11324979 11305569 11117273 10997202
    Re-read 8467476 11260914 11248059 11145336 10906486
    Reverse Read 6821393 8106334 8282174 8279195 8109186
    Stride read 7191093 8994306 9153982 8961224 9004434
    Random read 7156353 8957932 9167098 8980465 8940476
    Mixed workload 4172747 5680814 5927825 5489578 5972253
    Random write 1483044 1605588 1594329 1600453 1596010
    Pwrite 1276644 1303108 1311612 1314228 1300960
    Pread 4324337 4632869 4618386 4457870 4500166

    To enhance the possibility of access the same table[index] concurrently,
    set zram a small disksize(10MB) and let threads run with large loop
    count.

    fio test:
    fio --bs=32k --randrepeat=1 --randseed=100 --refill_buffers
    --scramble_buffers=1 --direct=1 --loops=3000 --numjobs=4
    --filename=/dev/zram0 --name=seq-write --rw=write --stonewall
    --name=seq-read --rw=read --stonewall --name=seq-readwrite
    --rw=rw --stonewall --name=rand-readwrite --rw=randrw --stonewall
    (10MB zram raw block device, take the average of 10 tests, KB/s)

    Test base CAS spinlock rwlock bit_spinlock
    -------------------------------------------------------------
    seq-write 933789 999357 1003298 995961 1001958
    seq-read 5634130 6577930 6380861 6243912 6230006
    seq-rw 1405687 1638117 1640256 1633903 1634459
    rand-rw 1386119 1614664 1617211 1609267 1612471

    All the optimization methods show a higher performance than the base,
    however, it is hard to say which method is the most appropriate.

    On the other hand, zram is mostly used on small embedded system, so we
    don't want to increase any memory footprint.

    This patch pick the bit_spinlock method, pack object size and page_flag
    into an unsigned long table.value, so as to not increase any memory
    overhead on both 32-bit and 64-bit system.

    On the third hand, even though different kinds of locks have different
    performances, we can ignore this difference, because: if zram is used as
    zram swapfile, the swap subsystem can prevent concurrent access to the
    same swapslot; if zram is used as zram-blk for set up filesystem on it,
    the upper filesystem and the page cache also prevent concurrent access
    of the same block mostly. So we can ignore the different performances
    among locks.

    Acked-by: Sergey Senozhatsky
    Reviewed-by: Davidlohr Bueso
    Signed-off-by: Weijie Yang
    Signed-off-by: Minchan Kim
    Cc: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     
  • Some architectures (eg, hexagon and PowerPC) could use PAGE_SHIFT of 16
    or more. In these cases u16 is not sufficiently large to represent a
    compressed page's size so use size_t.

    Signed-off-by: Minchan Kim
    Reported-by: Weijie Yang
    Acked-by: Sergey Senozhatsky
    Cc: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Drop SECTOR_SIZE define, because it's not used.

    Signed-off-by: Sergey Senozhatsky
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Weijie Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Andrew Morton has recently noted that `struct table' actually represents
    table entry and, thus, should be renamed. Rename to `zram_table_entry'.

    Signed-off-by: Sergey Senozhatsky
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Weijie Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     

25 Jul, 2014

8 commits

  • If we are mapping a snapshot, we must read in the parent_overlap value
    of that snapshot instead of that of the base image. Not doing so may
    in particular result in us returning zeros instead of user data:

    # cat overlap-snap.sh
    #!/bin/bash
    rbd create --size 10 --image-format 2 foo
    FOO_DEV=$(rbd map foo)
    dd if=/dev/urandom of=$FOO_DEV bs=1M &>/dev/null
    echo "Base image"
    dd if=$FOO_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd
    rbd snap create foo@snap
    rbd snap protect foo@snap
    rbd clone foo@snap bar
    rbd snap create bar@snap
    BAR_DEV=$(rbd map bar@snap)
    echo "Snapshot"
    dd if=$BAR_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd
    rbd resize --allow-shrink --size 4 bar
    echo "Snapshot after base image resize"
    dd if=$BAR_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd

    # ./overlap-snap.sh
    Base image
    0000000: e781 e33b d34b 2225 6034 2845 a2e3 36ed ...;.K"%`4(E..6.
    Snapshot
    0000000: e781 e33b d34b 2225 6034 2845 a2e3 36ed ...;.K"%`4(E..6.
    Resizing image: 100% complete...done.
    Snapshot after base image resize
    0000000: e781 e33b d34b 2225 0000 0000 0000 0000 ...;.K"%........

    Even though bar@snap is taken with the old bar parent_overlap (8M),
    reads from bar@snap beyond the new bar parent_overlap (4M) return
    zeroes. Fix it.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • Currently rbd_dev_v2_header_info() reads in parent info before the snap
    context is read in. This is wrong, because we may need to look at the
    the parent_overlap value of the snapshot instead of that of the base
    image, for example when mapping a snapshot - see next commit. (When
    mapping a snapshot, all we got is its name and we need the snap context
    to translate that name into an id to know which parent info to look
    for.)

    The approach taken here is to make sure rbd_dev_v2_parent_info() is
    called after the snap context has been read in. The other approach
    would be to add a parent_overlap field to struct rbd_mapping and
    maintain it the same way rbd_mapping::size is maintained. The reason
    I chose the first approach is that the value of keeping around both
    base image values and the actual mapping values is unclear to me.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • There is no sense in trying to update the mapping size before it's even
    been set.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • Recently discovered watch/notify problems showed that we really can't
    ignore errors in anything refresh related. Alas, currently there is
    not much we can do in response to those errors, except print warnings.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • rbd_dev_spec_update() has two modes of operation, with nothing in
    common between them. Split it into two functions, one for each mode
    and make our expectations more clear.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • spec->image_id assert doesn't buy us much and image_format is asserted
    in rbd_dev_header_name() and rbd_dev_header_info() anyway.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • A wrapper around rbd_dev_v{1,2}_header_info() to reduce duplication.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • Make /sys/bus/rbd/devices//parent show the entire chain of parent
    images. While at it, kernel sprintf() doesn't return negative values,
    casting to unsigned long long is no longer necessary and there is no
    good reason to split into multiple sprintf() calls.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Alex Elder

    Ilya Dryomov
     

24 Jul, 2014

2 commits

  • Free memory allocated using kmem_cache_zalloc using kmem_cache_free
    rather than kfree. The helper rbd_segment_name_free does the job here.
    Its position is shifted above the calling function.

    The Coccinelle semantic patch that detects this change is as follows:

    //
    @@
    expression x,E,c;
    @@

    x = \(kmem_cache_alloc\|kmem_cache_zalloc\|kmem_cache_alloc_node\)(c,...)
    ... when != x = E
    when != &x
    ?-kfree(x)
    +kmem_cache_free(c,x)
    //

    Signed-off-by: Himangi Saraogi
    Acked-by: Julia Lawall
    Signed-off-by: Ilya Dryomov

    Himangi Saraogi
     
  • Sasha reported lockdep warning [1] introduced by [2].

    It could be fixed by doing disk revalidation out of the init_lock. It's
    okay because disk capacity change is protected by init_lock so that
    revalidate_disk always sees up-to-date value so there is no race.

    [1] https://lkml.org/lkml/2014/7/3/735
    [2] zram: revalidate disk after capacity change

    Fixes 2e32baea46ce ("zram: revalidate disk after capacity change").

    Signed-off-by: Minchan Kim
    Reported-by: Sasha Levin
    Cc: "Alexander E. Patrakov"
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Sergey Senozhatsky
    CC:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

11 Jul, 2014

18 commits

  • My static checker warns that "data_size" could be negative and underflow
    the limit check. The code looks suspicious but I don't know if it is a
    real bug.

    Signed-off-by: Dan Carpenter
    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Dan Carpenter
     
  • Don't error out with misleading "out of memory"
    if the cpu-mask has more bits set than there are CPUs.
    Just truncate to nr_cpu_ids implicitly.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • size is always 4096,
    page is always device->md_io.page.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • During resync, if we need to block some specific incoming write because
    of active resync requests to that same range, we potentially caused
    *all* new application writes (to "cold" activity log extents) to block
    until this one request has been processed.

    Improve the do_submit() logic to
    * grab all incoming requests to some "incoming" list
    * process this list
    - move aside requests that are blocked by resync
    - prepare activity log transactions,
    - commit transactions and submit corresponding requests
    - if there are remaining requests that only wait for
    activity log extents to become free, stop the fast path
    (mark activity log as "starving")
    - iterate until no more requests are waiting for the activity log,
    but all potentially remaining requests are only blocked by resync
    * only then grab new incoming requests

    That way, very busy IO on currently "hot" activity log extents cannot
    starve scattered IO to "cold" extents. And blocked-by-resync requests
    are processed once resync traffic on the affected region has ceased,
    without blocking anything else.

    The only blocking mode left is when we cannot start requests to "cold"
    extents because all currently "hot" extents are actually used.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • The data generation identifiers used to be exposed via sysfs
    at /sys/block/drbdX/drbd/meta_data/data_gen_id (out-of-tree),
    for advanced policy scripting.
    Bring that information over to debugfs.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • Information of former /sys/block/drbdX/drbd/oldest_requests
    is already with higher detail in these files:
    debugfs/drbd/resource/$name/in_flight_summary,
    debugfs/drbd/resource/$name/volumes/$vnr/oldest_requests

    This patch adds
    debugfs/drbd/resource/$name/connections/peer/oldest_requests

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • Make the first line of debugfs files a version number,
    starting now with "v: 0".

    If we change content of presentation, we will bump that.
    Monitoring or diagnostic scritps that may parse these files
    can then easily know when they need to be reviewed.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • Show oldest requests
    * pending master bio completion and,
    * if different, local disk bio completion.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • Add a per-connection worker thread callback_history
    with timing details, call site and callback function.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • * Add details about pending meta data operations to in_flight_summary.

    * Report number of requests waiting for activity log transactions.

    * timing details of peer_requests to in_flight_summary.

    * FLUSH details
    DRBD devides the incoming request stream into "epochs",
    in which peers are allowed to re-order writes independendly.

    These epochs are separated by P_BARRIER on the replication link.
    Such barrier packets, depending on configuration, may cause
    the receiving side to drain the lower level device request queues
    and call blkdev_issue_flush().

    This is known to be an other major source of latency in DRBD.

    Track timing details of calls to blkdev_issue_flush(),
    and add them to in_flight_summary.

    * data socket stats
    To be able to diagnose bottlenecks and root causes of "slow" IO on DRBD,
    it is useful to see network buffer stats along with the timing details of
    requests, peer requests, and meta data IO.

    * pending bitmap IO timing details to in_flight_summary.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • Try to close the race between open() and debugfs_remove_recursive()
    from inside an object destructor.
    Once open succeeds, the object should stay around.
    Open should not succeed if the object has already reached its destructor.

    This may be overkill, but to make that happen, we check for existence of
    a parent directory, "stale-ness" of "this" dentry, and serialize
    kref_get_unless_zero() on the outermost object relevant for this file
    with d_delete() on this dentry (using the parent's i_mutex).

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • To help diagnosing "high latency" or "hung" IO situations on DRBD,
    present per drbd resource group a summary of operations currently in progress.

    First item is a list of oldest drbd_request objects
    waiting for various things:
    * still being prepared
    * waiting for activity log transaction
    * waiting for local disk
    * waiting to be sent
    * waiting for peer acknowledgement ("receive ack", "write ack")
    * waiting for peer epoch acknowledgement ("barrier ack")

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • Add new debugfs hierarchy /sys/kernel/debug/
    drbd/
    resources/
    $resource_name/connections/peer/$volume_number/
    $resource_name/volumes/$volume_number/
    minors/$minor_number -> ../resources/$resource_name/volumes/$volume_number/

    Followup commits will populate this hierarchy with files containing
    statistics, diagnostic information and some attribute data.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • Track start and submit time of bitmap operations, and
    add pending bitmap IO contexts to a new pending_bitmap_io list.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • Initialize peer_request with timestamp and proper empty list head.
    Add peer_request to list early, so debugfs can find this request and
    report it as "preparing", even if we sleep before we actually submit it.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • To be able to present timing details in debugfs,
    we need to track preparation/submit times of peer requests.

    Track peer request flags early,
    before they are put on the epoch_entry lists.

    Waiting for activity log transactions may be a major latency factor.
    We want to be able to present the peer_request state accurately in
    debugfs, and what it is waiting for.

    Consistently mark/unmark peer requests with EE_CALL_AL_COMPLETE_IO.
    Set it only *after* calling drbd_al_begin_io(),
    clear it as soon as we call drbd_al_complete_io().

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • Background resynchronisation does some "side-stepping", or throttles
    itself, if it detects application IO activity, and the current resync
    rate estimate is above the configured "cmin-rate".

    What was not detected: if there is no application IO,
    because it blocks on activity log transactions.

    Introduce a new atomic_t ap_actlog_cnt, tracking such blocked requests,
    and count non-zero as application IO activity.
    This counter is exposed at proc_details level 2 and above.

    Also make sure to release the currently locked resync extent
    if we side-step due to such voluntary throttling.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg
     
  • A request that is to be shipped to the peer goes through a few stages:
    - queued
    - sent, waiting for ack
    - ack received, waiting for "barrier ack", which is re-order epoch being
    closed on the peer by acknowledging a "cache flush" equivalent
    on the lower level device.

    In the later two stages, depending on protocol, we may have already
    completed this request to the upper layers, so it won't be found anymore
    on device->pending_master_completion[] lists.

    Track the oldest request yet to be sent (req_next), the oldest not yet
    acknowledged (req_ack_pending) and the oldest "still waiting for
    something from the peer" (req_not_net_done), doing short list walks on
    the transfer log to find the next pending one whenever such a request
    makes progress.

    Now we have a fast way to look up the oldest requests,
    don't do a transfer log walk every time.

    Signed-off-by: Philipp Reisner
    Signed-off-by: Lars Ellenberg

    Lars Ellenberg