Eric Lee / smarc-fsl-linux-kernel

15 Aug, 2014

1 commit

a11c5c9ef Merge tag 'pci-v3.17-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci ... Browse Code »

Pull DEFINE_PCI_DEVICE_TABLE removal from Bjorn Helgaas:
"Part two of the PCI changes for v3.17:

- Remove DEFINE_PCI_DEVICE_TABLE macro use (Benoit Taine)

It's a mechanical change that removes uses of the
DEFINE_PCI_DEVICE_TABLE macro. I waited until later in the merge
window to reduce conflicts, but it's possible you'll still see a few"

* tag 'pci-v3.17-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: Remove DEFINE_PCI_DEVICE_TABLE macro use

Linus Torvalds
2014-08-15 08:10:33 +0800

14 Aug, 2014

2 commits

d429a3639 Merge branch 'for-3.17/drivers' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block driver changes from Jens Axboe:
"Nothing out of the ordinary here, this pull request contains:

- A big round of fixes for bcache from Kent Overstreet, Slava Pestov,
and Surbhi Palande. No new features, just a lot of fixes.

- The usual round of drbd updates from Andreas Gruenbacher, Lars
Ellenberg, and Philipp Reisner.

- virtio_blk was converted to blk-mq back in 3.13, but now Ming Lei
has taken it one step further and added support for actually using
more than one queue.

- Addition of an explicit SG_FLAG_Q_AT_HEAD for block/bsg, to
compliment the the default behavior of adding to the tail of the
queue. From Douglas Gilbert"

* 'for-3.17/drivers' of git://git.kernel.dk/linux-block: (86 commits)
bcache: Drop unneeded blk_sync_queue() calls
bcache: add mutex lock for bch_is_open
bcache: Correct printing of btree_gc_max_duration_ms
bcache: try to set b->parent properly
bcache: fix memory corruption in init error path
bcache: fix crash with incomplete cache set
bcache: Fix more early shutdown bugs
bcache: fix use-after-free in btree_gc_coalesce()
bcache: Fix an infinite loop in journal replay
bcache: fix crash in bcache_btree_node_alloc_fail tracepoint
bcache: bcache_write tracepoint was crashing
bcache: fix typo in bch_bkey_equal_header
bcache: Allocate bounce buffers with GFP_NOWAIT
bcache: Make sure to pass GFP_WAIT to mempool_alloc()
bcache: fix uninterruptible sleep in writeback thread
bcache: wait for buckets when allocating new btree root
bcache: fix crash on shutdown in passthrough mode
bcache: fix lockdep warnings on shutdown
bcache allocator: send discards with correct size
bcache: Fix to remove the rcu_sched stalls.
...

Linus Torvalds
2014-08-14 23:10:21 +0800
8d2d441ac Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client ... Browse Code »

Pull Ceph updates from Sage Weil:
"There is a lot of refactoring and hardening of the libceph and rbd
code here from Ilya that fix various smaller bugs, and a few more
important fixes with clone overlap. The main fix is a critical change
to the request_fn handling to not sleep that was exposed by the recent
mutex changes (which will also go to the 3.16 stable series).

Yan Zheng has several fixes in here for CephFS fixing ACL handling,
time stamps, and request resends when the MDS restarts.

Finally, there are a few cleanups from Himangi Saraogi based on
Coccinelle"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (39 commits)
libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly
rbd: remove extra newlines from rbd_warn() messages
rbd: allocate img_request with GFP_NOIO instead GFP_ATOMIC
rbd: rework rbd_request_fn()
ceph: fix kick_requests()
ceph: fix append mode write
ceph: fix sizeof(struct tYpO *) typo
ceph: remove redundant memset(0)
rbd: take snap_id into account when reading in parent info
rbd: do not read in parent info before snap context
rbd: update mapping size only on refresh
rbd: harden rbd_dev_refresh() and callers a bit
rbd: split rbd_dev_spec_update() into two functions
rbd: remove unnecessary asserts in rbd_dev_image_probe()
rbd: introduce rbd_dev_header_info()
rbd: show the entire chain of parent images
ceph: replace comma with a semicolon
rbd: use rbd_segment_name_free() instead of kfree()
ceph: check zero length in ceph_sync_read()
ceph: reset r_resend_mds after receiving -ESTALE
...

Linus Torvalds
2014-08-14 07:43:29 +0800

13 Aug, 2014

1 commit

9baa3c34a PCI: Remove DEFINE_PCI_DEVICE_TABLE macro use ... Browse Code »

We should prefer `struct pci_device_id` over `DEFINE_PCI_DEVICE_TABLE` to
meet kernel coding style guidelines. This issue was reported by checkpatch.

A simplified version of the semantic patch that makes this change is as
follows (http://coccinelle.lip6.fr/):

//

@@
identifier i;
declarer name DEFINE_PCI_DEVICE_TABLE;
initializer z;
@@

- DEFINE_PCI_DEVICE_TABLE(i)
+ const struct pci_device_id i[]
= z;

//

[bhelgaas: add semantic patch]
Signed-off-by: Benoit Taine
Signed-off-by: Bjorn Helgaas

Benoit Taine
2014-08-13 02:15:14 +0800

09 Aug, 2014

1 commit

a5bbf6160 block: use pci_zalloc_consistent ... Browse Code »

Remove the now unnecessary memset too.

Signed-off-by: Joe Perches
Mike Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2014-08-09 06:57:28 +0800

07 Aug, 2014

7 commits

9584d5082 rbd: remove extra newlines from rbd_warn() messages ... Browse Code »

rbd_warn() string should be a single line - rbd_warn() appends \n.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2014-08-07 20:01:09 +0800
7a716aac0 rbd: allocate img_request with GFP_NOIO instead GFP_ATOMIC ... Browse Code »

Now that rbd_img_request_create() is called from work functions, no
need to use GFP_ATOMIC.

Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder

Ilya Dryomov
2014-08-07 20:00:52 +0800
bc1ecc65a rbd: rework rbd_request_fn() ... Browse Code »

While it was never a good idea to sleep in request_fn(), commit
34c6bc2c919a ("locking/mutexes: Add extra reschedule point") made it
a *bad* idea. mutex_lock() since 3.15 may reschedule *before* putting
task on the mutex wait queue, which for tasks in !TASK_RUNNING state
means block forever. request_fn() may be called with !TASK_RUNNING on
the way to schedule() in io_schedule().

Offload request handling to a workqueue, one per rbd device, to avoid
calling blocking primitives from rbd_request_fn().

Fixes: http://tracker.ceph.com/issues/8818

Cc: stable@vger.kernel.org # 3.16, needs backporting for 3.15
Signed-off-by: Ilya Dryomov
Tested-by: Eric Eastman
Tested-by: Greg Wilson
Reviewed-by: Alex Elder

Ilya Dryomov
2014-08-07 18:56:20 +0800
d2d5e762c zram: replace global tb_lock with fine grain lock ... Browse Code »

Currently, we use a rwlock tb_lock to protect concurrent access to the
whole zram meta table. However, according to the actual access model,
there is only a small chance for upper user to access the same
table[index], so the current lock granularity is too big.

The idea of optimization is to change the lock granularity from whole
meta table to per table entry (table -> table[index]), so that we can
protect concurrent access to the same table[index], meanwhile allow the
maximum concurrency.

With this in mind, several kinds of locks which could be used as a
per-entry lock were tested and compared:

Test environment:
x86-64 Intel Core2 Q8400, system memory 4GB, Ubuntu 12.04,
kernel v3.15.0-rc3 as base, zram with 4 max_comp_streams LZO.

iozone test:
iozone -t 4 -R -r 16K -s 200M -I +Z
(1GB zram with ext4 filesystem, take the average of 10 tests, KB/s)

Test base CAS spinlock rwlock bit_spinlock
-------------------------------------------------------------------
Initial write 1381094 1425435 1422860 1423075 1421521
Rewrite 1529479 1641199 1668762 1672855 1654910
Read 8468009 11324979 11305569 11117273 10997202
Re-read 8467476 11260914 11248059 11145336 10906486
Reverse Read 6821393 8106334 8282174 8279195 8109186
Stride read 7191093 8994306 9153982 8961224 9004434
Random read 7156353 8957932 9167098 8980465 8940476
Mixed workload 4172747 5680814 5927825 5489578 5972253
Random write 1483044 1605588 1594329 1600453 1596010
Pwrite 1276644 1303108 1311612 1314228 1300960
Pread 4324337 4632869 4618386 4457870 4500166

To enhance the possibility of access the same table[index] concurrently,
set zram a small disksize(10MB) and let threads run with large loop
count.

fio test:
fio --bs=32k --randrepeat=1 --randseed=100 --refill_buffers
--scramble_buffers=1 --direct=1 --loops=3000 --numjobs=4
--filename=/dev/zram0 --name=seq-write --rw=write --stonewall
--name=seq-read --rw=read --stonewall --name=seq-readwrite
--rw=rw --stonewall --name=rand-readwrite --rw=randrw --stonewall
(10MB zram raw block device, take the average of 10 tests, KB/s)

Test base CAS spinlock rwlock bit_spinlock
-------------------------------------------------------------
seq-write 933789 999357 1003298 995961 1001958
seq-read 5634130 6577930 6380861 6243912 6230006
seq-rw 1405687 1638117 1640256 1633903 1634459
rand-rw 1386119 1614664 1617211 1609267 1612471

All the optimization methods show a higher performance than the base,
however, it is hard to say which method is the most appropriate.

On the other hand, zram is mostly used on small embedded system, so we
don't want to increase any memory footprint.

This patch pick the bit_spinlock method, pack object size and page_flag
into an unsigned long table.value, so as to not increase any memory
overhead on both 32-bit and 64-bit system.

On the third hand, even though different kinds of locks have different
performances, we can ignore this difference, because: if zram is used as
zram swapfile, the swap subsystem can prevent concurrent access to the
same swapslot; if zram is used as zram-blk for set up filesystem on it,
the upper filesystem and the page cache also prevent concurrent access
of the same block mostly. So we can ignore the different performances
among locks.

Acked-by: Sergey Senozhatsky
Reviewed-by: Davidlohr Bueso
Signed-off-by: Weijie Yang
Signed-off-by: Minchan Kim
Cc: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Weijie Yang
2014-08-07 09:01:23 +0800
023b409f9 zram: use size_t instead of u16 ... Browse Code »

Some architectures (eg, hexagon and PowerPC) could use PAGE_SHIFT of 16
or more. In these cases u16 is not sufficiently large to represent a
compressed page's size so use size_t.

Signed-off-by: Minchan Kim
Reported-by: Weijie Yang
Acked-by: Sergey Senozhatsky
Cc: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2014-08-07 09:01:23 +0800
a830eff74 zram: remove unused SECTOR_SIZE define ... Browse Code »

Drop SECTOR_SIZE define, because it's not used.

Signed-off-by: Sergey Senozhatsky
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Weijie Yang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2014-08-07 09:01:22 +0800
cb8f2eec3 zram: rename struct `table' to `zram_table_entry' ... Browse Code »

Andrew Morton has recently noted that `struct table' actually represents
table entry and, thus, should be renamed. Rename to `zram_table_entry'.

Signed-off-by: Sergey Senozhatsky
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Weijie Yang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2014-08-07 09:01:22 +0800

25 Jul, 2014

8 commits

4d9b67cdd rbd: take snap_id into account when reading in parent info ... Browse Code »

If we are mapping a snapshot, we must read in the parent_overlap value
of that snapshot instead of that of the base image. Not doing so may
in particular result in us returning zeros instead of user data:

# cat overlap-snap.sh
#!/bin/bash
rbd create --size 10 --image-format 2 foo
FOO_DEV=$(rbd map foo)
dd if=/dev/urandom of=$FOO_DEV bs=1M &>/dev/null
echo "Base image"
dd if=$FOO_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd
rbd snap create foo@snap
rbd snap protect foo@snap
rbd clone foo@snap bar
rbd snap create bar@snap
BAR_DEV=$(rbd map bar@snap)
echo "Snapshot"
dd if=$BAR_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd
rbd resize --allow-shrink --size 4 bar
echo "Snapshot after base image resize"
dd if=$BAR_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd

# ./overlap-snap.sh
Base image
0000000: e781 e33b d34b 2225 6034 2845 a2e3 36ed ...;.K"%`4(E..6.
Snapshot
0000000: e781 e33b d34b 2225 6034 2845 a2e3 36ed ...;.K"%`4(E..6.
Resizing image: 100% complete...done.
Snapshot after base image resize
0000000: e781 e33b d34b 2225 0000 0000 0000 0000 ...;.K"%........

Even though bar@snap is taken with the old bar parent_overlap (8M),
reads from bar@snap beyond the new bar parent_overlap (4M) return
zeroes. Fix it.

Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder

Ilya Dryomov
2014-07-25 17:16:50 +0800
e8f59b595 rbd: do not read in parent info before snap context ... Browse Code »

Currently rbd_dev_v2_header_info() reads in parent info before the snap
context is read in. This is wrong, because we may need to look at the
the parent_overlap value of the snapshot instead of that of the base
image, for example when mapping a snapshot - see next commit. (When
mapping a snapshot, all we got is its name and we need the snap context
to translate that name into an id to know which parent info to look
for.)

The approach taken here is to make sure rbd_dev_v2_parent_info() is
called after the snap context has been read in. The other approach
would be to add a parent_overlap field to struct rbd_mapping and
maintain it the same way rbd_mapping::size is maintained. The reason
I chose the first approach is that the value of keeping around both
base image values and the actual mapping values is unclear to me.

Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder

Ilya Dryomov
2014-07-25 17:16:14 +0800
5ff1108cc rbd: update mapping size only on refresh ... Browse Code »

There is no sense in trying to update the mapping size before it's even
been set.

Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder

Ilya Dryomov
2014-07-25 17:15:58 +0800
52bb1f9be rbd: harden rbd_dev_refresh() and callers a bit ... Browse Code »

Recently discovered watch/notify problems showed that we really can't
ignore errors in anything refresh related. Alas, currently there is
not much we can do in response to those errors, except print warnings.

Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder

Ilya Dryomov
2014-07-25 17:15:44 +0800
040775997 rbd: split rbd_dev_spec_update() into two functions ... Browse Code »

rbd_dev_spec_update() has two modes of operation, with nothing in
common between them. Split it into two functions, one for each mode
and make our expectations more clear.

Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder

Ilya Dryomov
2014-07-25 17:15:35 +0800
7626eb7d8 rbd: remove unnecessary asserts in rbd_dev_image_probe() ... Browse Code »

spec->image_id assert doesn't buy us much and image_format is asserted
in rbd_dev_header_name() and rbd_dev_header_info() anyway.

Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder

Ilya Dryomov
2014-07-25 17:15:18 +0800
a720ae090 rbd: introduce rbd_dev_header_info() ... Browse Code »

A wrapper around rbd_dev_v{1,2}_header_info() to reduce duplication.

Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder

Ilya Dryomov
2014-07-25 17:15:00 +0800
ff96128fb rbd: show the entire chain of parent images ... Browse Code »

Make /sys/bus/rbd/devices//parent show the entire chain of parent
images. While at it, kernel sprintf() doesn't return negative values,
casting to unsigned long long is no longer necessary and there is no
good reason to split into multiple sprintf() calls.

Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder

Ilya Dryomov
2014-07-25 17:14:27 +0800

24 Jul, 2014

2 commits

7d5079aa8 rbd: use rbd_segment_name_free() instead of kfree() ... Browse Code »

Free memory allocated using kmem_cache_zalloc using kmem_cache_free
rather than kfree. The helper rbd_segment_name_free does the job here.
Its position is shifted above the calling function.

The Coccinelle semantic patch that detects this change is as follows:

//
@@
expression x,E,c;
@@

x = $kmem_cache_alloc\|kmem_cache_zalloc\|kmem_cache_alloc_node$(c,...)
... when != x = E
when != &x
?-kfree(x)
+kmem_cache_free(c,x)
//

Signed-off-by: Himangi Saraogi
Acked-by: Julia Lawall
Signed-off-by: Ilya Dryomov

Himangi Saraogi
2014-07-24 15:57:27 +0800
b4c5c6092 zram: avoid lockdep splat by revalidate_disk ... Browse Code »

Sasha reported lockdep warning [1] introduced by [2].

It could be fixed by doing disk revalidation out of the init_lock. It's
okay because disk capacity change is protected by init_lock so that
revalidate_disk always sees up-to-date value so there is no race.

[1] https://lkml.org/lkml/2014/7/3/735
[2] zram: revalidate disk after capacity change

Fixes 2e32baea46ce ("zram: revalidate disk after capacity change").

Signed-off-by: Minchan Kim
Reported-by: Sasha Levin
Cc: "Alexander E. Patrakov"
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Sergey Senozhatsky
CC:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2014-07-24 06:10:54 +0800

11 Jul, 2014

18 commits

bf0d6e4a1 drbd: silence underflow warning in read_in_block() ... Browse Code »

My static checker warns that "data_size" could be negative and underflow
the limit check. The code looks suspicious but I don't know if it is a
real bug.

Signed-off-by: Dan Carpenter
Signed-off-by: Philipp Reisner
Signed-off-by: Lars Ellenberg

Dan Carpenter
2014-07-11 00:35:23 +0800
1e39152fe drbd: implicitly truncate cpu-mask ... Browse Code »

Don't error out with misleading "out of memory"
if the cpu-mask has more bits set than there are CPUs.
Just truncate to nr_cpu_ids implicitly.

Signed-off-by: Philipp Reisner
Signed-off-by: Lars Ellenberg

Lars Ellenberg
2014-07-11 00:35:22 +0800
193cb00ce drbd: drop spurious parameters from _drbd_md_sync_page_io ... Browse Code »

size is always 4096,
page is always device->md_io.page.

Signed-off-by: Philipp Reisner
Signed-off-by: Lars Ellenberg

Lars Ellenberg
2014-07-11 00:35:22 +0800
f5b90b6bf drbd: resync should only lock out specific ranges ... Browse Code »

During resync, if we need to block some specific incoming write because
of active resync requests to that same range, we potentially caused
*all* new application writes (to "cold" activity log extents) to block
until this one request has been processed.

Improve the do_submit() logic to
* grab all incoming requests to some "incoming" list
* process this list
- move aside requests that are blocked by resync
- prepare activity log transactions,
- commit transactions and submit corresponding requests
- if there are remaining requests that only wait for
activity log extents to become free, stop the fast path
(mark activity log as "starving")
- iterate until no more requests are waiting for the activity log,
but all potentially remaining requests are only blocked by resync
* only then grab new incoming requests

That way, very busy IO on currently "hot" activity log extents cannot
starve scattered IO to "cold" extents. And blocked-by-resync requests
are processed once resync traffic on the affected region has ceased,
without blocking anything else.

The only blocking mode left is when we cannot start requests to "cold"
extents because all currently "hot" extents are actually used.

Signed-off-by: Philipp Reisner
Signed-off-by: Lars Ellenberg

Lars Ellenberg
2014-07-11 00:35:21 +0800
cc356f85b drbd: debugfs: add per device data_gen_id ... Browse Code »

The data generation identifiers used to be exposed via sysfs
at /sys/block/drbdX/drbd/meta_data/data_gen_id (out-of-tree),
for advanced policy scripting.
Bring that information over to debugfs.

Signed-off-by: Philipp Reisner
Signed-off-by: Lars Ellenberg

Lars Ellenberg
2014-07-11 00:35:21 +0800
3d299f48a drbd: debugfs: add per connection oldest requests ... Browse Code »

Information of former /sys/block/drbdX/drbd/oldest_requests
is already with higher detail in these files:
debugfs/drbd/resource/$name/in_flight_summary,
debugfs/drbd/resource/$name/volumes/$vnr/oldest_requests

This patch adds
debugfs/drbd/resource/$name/connections/peer/oldest_requests

Signed-off-by: Philipp Reisner
Signed-off-by: Lars Ellenberg

Lars Ellenberg
2014-07-11 00:35:20 +0800
b44e1184f drbd: debugfs: add version tag to debugfs files ... Browse Code »

Make the first line of debugfs files a version number,
starting now with "v: 0".

If we change content of presentation, we will bump that.
Monitoring or diagnostic scritps that may parse these files
can then easily know when they need to be reviewed.

Signed-off-by: Philipp Reisner
Signed-off-by: Lars Ellenberg

Lars Ellenberg
2014-07-11 00:35:19 +0800
54e6fc38e drbd: debugfs: add per volume oldest_requests ... Browse Code »

Show oldest requests
* pending master bio completion and,
* if different, local disk bio completion.

Signed-off-by: Philipp Reisner
Signed-off-by: Lars Ellenberg

Lars Ellenberg
2014-07-11 00:35:19 +0800
944410e97 drbd: debugfs: add callback_history ... Browse Code »

Add a per-connection worker thread callback_history
with timing details, call site and callback function.

Signed-off-by: Philipp Reisner
Signed-off-by: Lars Ellenberg

Lars Ellenberg
2014-07-11 00:35:18 +0800
f418815f7 drbd: debugfs: Add in_flight_summary ... Browse Code »

* Add details about pending meta data operations to in_flight_summary.

* Report number of requests waiting for activity log transactions.

* timing details of peer_requests to in_flight_summary.

* FLUSH details
DRBD devides the incoming request stream into "epochs",
in which peers are allowed to re-order writes independendly.

These epochs are separated by P_BARRIER on the replication link.
Such barrier packets, depending on configuration, may cause
the receiving side to drain the lower level device request queues
and call blkdev_issue_flush().

This is known to be an other major source of latency in DRBD.

Track timing details of calls to blkdev_issue_flush(),
and add them to in_flight_summary.

* data socket stats
To be able to diagnose bottlenecks and root causes of "slow" IO on DRBD,
it is useful to see network buffer stats along with the timing details of
requests, peer requests, and meta data IO.

* pending bitmap IO timing details to in_flight_summary.

Signed-off-by: Philipp Reisner
Signed-off-by: Lars Ellenberg

Lars Ellenberg
2014-07-11 00:35:17 +0800
4a521cca9 drbd: debugfs: deal with destructor racing with open of debugfs file ... Browse Code »

Try to close the race between open() and debugfs_remove_recursive()
from inside an object destructor.
Once open succeeds, the object should stay around.
Open should not succeed if the object has already reached its destructor.

This may be overkill, but to make that happen, we check for existence of
a parent directory, "stale-ness" of "this" dentry, and serialize
kref_get_unless_zero() on the outermost object relevant for this file
with d_delete() on this dentry (using the parent's i_mutex).

Signed-off-by: Philipp Reisner
Signed-off-by: Lars Ellenberg

Lars Ellenberg
2014-07-11 00:35:17 +0800
db1866ffe drbd: debugfs: add in_flight_summary data ... Browse Code »

To help diagnosing "high latency" or "hung" IO situations on DRBD,
present per drbd resource group a summary of operations currently in progress.

First item is a list of oldest drbd_request objects
waiting for various things:
* still being prepared
* waiting for activity log transaction
* waiting for local disk
* waiting to be sent
* waiting for peer acknowledgement ("receive ack", "write ack")
* waiting for peer epoch acknowledgement ("barrier ack")

Signed-off-by: Philipp Reisner
Signed-off-by: Lars Ellenberg

Lars Ellenberg
2014-07-11 00:35:16 +0800
4d3d5aa83 drbd: debugfs: add basic hierarchy ... Browse Code »

Add new debugfs hierarchy /sys/kernel/debug/
drbd/
resources/
$resource_name/connections/peer/$volume_number/
$resource_name/volumes/$volume_number/
minors/$minor_number -> ../resources/$resource_name/volumes/$volume_number/

Followup commits will populate this hierarchy with files containing
statistics, diagnostic information and some attribute data.

Signed-off-by: Philipp Reisner
Signed-off-by: Lars Ellenberg

Lars Ellenberg
2014-07-11 00:35:16 +0800
4ce492668 drbd: track details of bitmap IO ... Browse Code »

Track start and submit time of bitmap operations, and
add pending bitmap IO contexts to a new pending_bitmap_io list.

Signed-off-by: Philipp Reisner
Signed-off-by: Lars Ellenberg

Lars Ellenberg
2014-07-11 00:35:15 +0800
c5a2c1509 drbd: register peer requests on read_ee early ... Browse Code »

Initialize peer_request with timestamp and proper empty list head.
Add peer_request to list early, so debugfs can find this request and
report it as "preparing", even if we sleep before we actually submit it.

Signed-off-by: Philipp Reisner
Signed-off-by: Lars Ellenberg

Lars Ellenberg
2014-07-11 00:35:14 +0800
21ae5d7f9 drbd: track timing details of peer_requests ... Browse Code »

To be able to present timing details in debugfs,
we need to track preparation/submit times of peer requests.

Track peer request flags early,
before they are put on the epoch_entry lists.

Waiting for activity log transactions may be a major latency factor.
We want to be able to present the peer_request state accurately in
debugfs, and what it is waiting for.

Consistently mark/unmark peer requests with EE_CALL_AL_COMPLETE_IO.
Set it only *after* calling drbd_al_begin_io(),
clear it as soon as we call drbd_al_complete_io().

Signed-off-by: Philipp Reisner
Signed-off-by: Lars Ellenberg

Lars Ellenberg
2014-07-11 00:35:14 +0800
ad3fee790 drbd: improve throttling decisions of background resynchronisation ... Browse Code »

Background resynchronisation does some "side-stepping", or throttles
itself, if it detects application IO activity, and the current resync
rate estimate is above the configured "cmin-rate".

What was not detected: if there is no application IO,
because it blocks on activity log transactions.

Introduce a new atomic_t ap_actlog_cnt, tracking such blocked requests,
and count non-zero as application IO activity.
This counter is exposed at proc_details level 2 and above.

Also make sure to release the currently locked resync extent
if we side-step due to such voluntary throttling.

Signed-off-by: Philipp Reisner
Signed-off-by: Lars Ellenberg

Lars Ellenberg
2014-07-11 00:35:13 +0800
7753a4c17 drbd: add caching oldest request pointers for replication stages ... Browse Code »

A request that is to be shipped to the peer goes through a few stages:
- queued
- sent, waiting for ack
- ack received, waiting for "barrier ack", which is re-order epoch being
closed on the peer by acknowledging a "cache flush" equivalent
on the lower level device.

In the later two stages, depending on protocol, we may have already
completed this request to the upper layers, so it won't be found anymore
on device->pending_master_completion[] lists.

Track the oldest request yet to be sent (req_next), the oldest not yet
acknowledged (req_ack_pending) and the oldest "still waiting for
something from the peer" (req_not_net_done), doing short list walks on
the transfer log to find the next pending one whenever such a request
makes progress.

Now we have a fast way to look up the oldest requests,
don't do a transfer log walk every time.

Signed-off-by: Philipp Reisner
Signed-off-by: Lars Ellenberg

Lars Ellenberg
2014-07-11 00:35:12 +0800