26 Jan, 2019

1 commit

  • [ Upstream commit d7e6b8dfc7bcb3f4f3a18313581f67486a725b52 ]

    When using kcopyd to run callbacks through dm_kcopyd_do_callback() or
    submitting copy jobs with a source size of 0, the jobs are pushed
    directly to the complete_jobs list, which could be under processing by
    the kcopyd thread. As a result, the kcopyd thread can continue running
    completed jobs indefinitely, without releasing the CPU, as long as
    someone keeps submitting new completed jobs through the aforementioned
    paths. Processing of work items, queued for execution on the same CPU as
    the currently running kcopyd thread, is thus stalled for excessive
    amounts of time, hurting performance.

    Running the following test, from the device mapper test suite [1],

    dmtest run --suite snapshot -n parallel_io_to_many_snaps_N

    , with 8 active snapshots, we get, in dmesg, messages like the
    following:

    [68899.948523] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 95s!
    [68899.949282] Showing busy workqueues and worker pools:
    [68899.949288] workqueue events: flags=0x0
    [68899.949295] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
    [68899.949306] pending: vmstat_shepherd, cache_reap
    [68899.949331] workqueue mm_percpu_wq: flags=0x8
    [68899.949337] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
    [68899.949345] pending: vmstat_update
    [68899.949387] workqueue dm_bufio_cache: flags=0x8
    [68899.949392] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
    [68899.949400] pending: work_fn [dm_bufio]
    [68899.949423] workqueue kcopyd: flags=0x8
    [68899.949429] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
    [68899.949437] pending: do_work [dm_mod]
    [68899.949452] workqueue kcopyd: flags=0x8
    [68899.949458] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
    [68899.949466] in-flight: 13:do_work [dm_mod]
    [68899.949474] pending: do_work [dm_mod]
    [68899.949487] workqueue kcopyd: flags=0x8
    [68899.949493] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
    [68899.949501] pending: do_work [dm_mod]
    [68899.949515] workqueue kcopyd: flags=0x8
    [68899.949521] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
    [68899.949529] pending: do_work [dm_mod]
    [68899.949541] workqueue kcopyd: flags=0x8
    [68899.949547] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
    [68899.949555] pending: do_work [dm_mod]
    [68899.949568] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=95s workers=4 idle: 27130 27223 1084

    Fix this by splitting the complete_jobs list into two parts: A user
    facing part, named callback_jobs, and one used internally by kcopyd,
    retaining the name complete_jobs. dm_kcopyd_do_callback() and
    dispatch_job() now push their jobs to the callback_jobs list, which is
    spliced to the complete_jobs list once, every time the kcopyd thread
    wakes up. This prevents kcopyd from hogging the CPU indefinitely and
    causing workqueue stalls.

    Re-running the aforementioned test:

    * Workqueue stalls are eliminated
    * The maximum writing time among all targets is reduced from 09m37.10s
    to 06m04.85s and the total run time of the test is reduced from
    10m43.591s to 7m19.199s

    [1] https://github.com/jthornber/device-mapper-test-suite

    Signed-off-by: Nikos Tsironis
    Signed-off-by: Ilias Tsitsimpis
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Nikos Tsironis
     

15 Sep, 2018

1 commit

  • [ Upstream commit 784c9a29e99eb40b842c29ecf1cc3a79e00fb629 ]

    It was reported that softlockups occur when using dm-snapshot ontop of
    slow (rbd) storage. E.g.:

    [ 4047.990647] watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [kworker/10:23:26177]
    ...
    [ 4048.034151] Workqueue: kcopyd do_work [dm_mod]
    [ 4048.034156] RIP: 0010:copy_callback+0x41/0x160 [dm_snapshot]
    ...
    [ 4048.034190] Call Trace:
    [ 4048.034196] ? __chunk_is_tracked+0x70/0x70 [dm_snapshot]
    [ 4048.034200] run_complete_job+0x5f/0xb0 [dm_mod]
    [ 4048.034205] process_jobs+0x91/0x220 [dm_mod]
    [ 4048.034210] ? kcopyd_put_pages+0x40/0x40 [dm_mod]
    [ 4048.034214] do_work+0x46/0xa0 [dm_mod]
    [ 4048.034219] process_one_work+0x171/0x370
    [ 4048.034221] worker_thread+0x1fc/0x3f0
    [ 4048.034224] kthread+0xf8/0x130
    [ 4048.034226] ? max_active_store+0x80/0x80
    [ 4048.034227] ? kthread_bind+0x10/0x10
    [ 4048.034231] ret_from_fork+0x35/0x40
    [ 4048.034233] Kernel panic - not syncing: softlockup: hung tasks

    Fix this by calling cond_resched() after run_complete_job()'s callout to
    the dm_kcopyd_notify_fn (which is dm-snap.c:copy_callback in the above
    trace).

    Signed-off-by: John Pittman
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    John Pittman
     

19 Jun, 2017

1 commit

  • When copyying blocks to host-managed zoned block devices, writes must be
    sequential. However, dm_kcopyd_copy() does not guarantee this as writes
    are issued in the completion order of reads, and reads may complete out
    of order despite being issued sequentially.

    Fix this by introducing the DM_KCOPYD_WRITE_SEQ feature flag. This can
    be specified when calling dm_kcopyd_copy() and should be set
    automatically if one of the destinations is a host-managed zoned block
    device. For a split job, the master job maintains the write position at
    which writes must be issued. This is checked with the pop() function
    which is modified to not return any write I/O sub job that is not at the
    correct write position.

    When DM_KCOPYD_WRITE_SEQ is specified for a job, errors cannot be
    ignored and the flag DM_KCOPYD_IGNORE_ERROR is ignored, even if
    specified by the user.

    Signed-off-by: Damien Le Moal
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Bart Van Assche
    Signed-off-by: Mike Snitzer

    Damien Le Moal
     

09 Apr, 2017

1 commit

  • It seems like the code currently passes whatever it was using for writes
    to WRITE SAME. Just switch it to WRITE ZEROES, although that doesn't
    need any payload.

    Untested, and confused by the code, maybe someone who understands it
    better than me can help..

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

11 Jun, 2016

1 commit

  • Add some seperation between bio-based and request-based DM core code.

    'struct mapped_device' and other DM core only structures and functions
    have been moved to dm-core.h and all relevant DM core .c files have been
    updated to include dm-core.h rather than dm.h

    DM targets should _never_ include dm-core.h!

    [block core merge conflict resolution from Stephen Rothwell]
    Signed-off-by: Mike Snitzer
    Signed-off-by: Stephen Rothwell

    Mike Snitzer
     

08 Jun, 2016

2 commits

  • Separate the op from the rq_flag_bits and have dm
    set/get the bio using bio_set_op_attrs/bio_op.

    Signed-off-by: Mike Christie
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Mike Christie
     
  • We currently set REQ_WRITE/WRITE for all non READ IOs
    like discard, flush, writesame, etc. In the next patches where we
    no longer set up the op as a bitmap, we will not be able to
    detect a operation direction like writesame by testing if REQ_WRITE is
    set.

    This has dm use the op_is_write helper which will do the right
    thing.

    Signed-off-by: Mike Christie
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Mike Christie
     

07 Nov, 2015

1 commit

  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

23 Aug, 2013

1 commit

  • dbf2576e37 ("workqueue: make all workqueues non-reentrant") made
    WQ_NON_REENTRANT no-op and the flag is going away. Remove its usages.

    This patch doesn't introduce any behavior changes.

    Signed-off-by: Tejun Heo
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon
    Acked-by: Joe Thornber

    Tejun Heo
     

02 Mar, 2013

1 commit

  • This patch allows the administrator to reduce the rate at which kcopyd
    issues I/O.

    Each module that uses kcopyd acquires a throttle parameter that can be
    set in /sys/module/*/parameters.

    We maintain a history of kcopyd usage by each module in the variables
    io_period and total_period in struct dm_kcopyd_throttle. The actual
    kcopyd activity is calculated as a percentage of time equal to
    "(100 * io_period / total_period)". This is compared with the user-defined
    throttle percentage threshold and if it is exceeded, we sleep.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

22 Dec, 2012

1 commit

  • Add WRITE SAME support to dm-io and make it accessible to
    dm_kcopyd_zero(). dm_kcopyd_zero() provides an asynchronous interface
    whereas the blkdev_issue_write_same() interface is synchronous.

    WRITE SAME is a SCSI command that can be leveraged for more efficient
    zeroing of a specified logical extent of a device which supports it.
    Only a single zeroed logical block is transfered to the target for each
    WRITE SAME and the target then writes that same block across the
    specified extent.

    The dm thin target uses this.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     

01 Nov, 2011

1 commit

  • This patch introduces dm_kcopyd_zero() to make it easy to use
    kcopyd to write zeros into the requested areas instead
    instead of copying. It is implemented by passing a NULL
    copying source to dm_kcopyd_copy().

    The forthcoming thin provisioning target uses this.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

24 Oct, 2011

1 commit

  • Fix memory leak introduced by commit a6e50b409d3f9e0833e69c3c9cca822e8fa4adbb
    (dm snapshot: skip reading origin when overwriting complete chunk).

    When allocating a set of jobs from kc->job_pool, job->master_job must be
    set (to point to itself) so that the mempool item gets freed when the
    master_job completes.

    master_job was introduced by commit c6ea41fbbe08f270a8edef99dc369faf809d1bd6
    (dm kcopyd: preallocate sub jobs to avoid deadlock)

    Reported-by: Michael Leun
    Cc: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     

02 Aug, 2011

3 commits

  • If we write a full chunk in the snapshot, skip reading the origin device
    because the whole chunk will be overwritten anyway.

    This patch changes the snapshot write logic when a full chunk is written.
    In this case:
    1. allocate the exception
    2. dispatch the bio (but don't report the bio completion to device mapper)
    3. write the exception record
    4. report bio completed

    Callbacks must be done through the kcopyd thread, because callbacks must not
    race with each other. So we create two new functions:

    dm_kcopyd_prepare_callback: allocate a job structure and prepare the callback.
    (This function must not be called from interrupt context.)

    dm_kcopyd_do_callback: submit callback.
    (This function may be called from interrupt context.)

    Performance test (on snapshots with 4k chunk size):
    without the patch:
    non-direct-io sequential write (dd): 17.7MB/s
    direct-io sequential write (dd): 20.9MB/s
    non-direct-io random write (mkfs.ext2): 0.44s

    with the patch:
    non-direct-io sequential write (dd): 26.5MB/s
    direct-io sequential write (dd): 33.2MB/s
    non-direct-io random write (mkfs.ext2): 0.27s

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • The nr_pages field in struct kcopyd_job is only used temporarily in
    run_pages_job() to count the number of required pages.
    We can use a local variable instead.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • The offset field in struct kcopyd_job is always zero so remove it.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

27 Jul, 2011

1 commit

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     

29 May, 2011

8 commits

  • Return client directly from dm_kcopyd_client_create, not through a
    parameter, making it consistent with dm_io_client_create.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Reserve just the minimum of pages needed to process one job.

    Because we allocate pages from page allocator, we don't need to reserve
    a large number of pages. The maximum job size is SUB_JOB_SIZE and we
    calculate the number of reserved pages based on this.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Replace the arbitrary calculation of an initial io struct mempool size
    with a constant.

    The code calculated the number of reserved structures based on the request
    size and used a "magic" multiplication constant of 4. This patch changes
    it to reserve a fixed number - itself still chosen quite arbitrarily.
    Further testing might show if there is a better number to choose.

    Note that if there is no memory pressure, we can still allocate an
    arbitrary number of "struct io" structures. One structure is enough to
    process the whole request.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • This patch changes dm-kcopyd so that it allocates pages from the main
    page allocator with __GFP_NOWARN | __GFP_NORETRY flags (so that it can
    fail in case of memory pressure). If the allocation fails, dm-kcopyd
    allocates pages from its own reserve.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Introduce a parameter for gfp flags to alloc_pl() for use in following
    patches.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Remove the spinlock protecting the pages allocation. The spinlock is only
    taken on initialization or from single-threaded workqueue. Therefore, the
    spinlock is useless.

    The spinlock is taken in kcopyd_get_pages and kcopyd_put_pages.

    kcopyd_get_pages is only called from run_pages_job, which is only
    called from process_jobs called from do_work.

    kcopyd_put_pages is called from client_alloc_pages (which is initialization
    function) or from run_complete_job. run_complete_job is only called from
    process_jobs called from do_work.

    Another spinlock, kc->job_lock is taken each time someone pushes or pops
    some work for the worker thread. Once we take kc->job_lock, we
    guarantee that any written memory is visible to the other CPUs.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • There's a possible theoretical deadlock in dm-kcopyd because multiple
    allocations from the same mempool are required to finish a request.
    Avoid this by preallocating sub jobs.

    There is a mempool of 512 entries. Each request requires up to 9
    entries from the mempool. If we have at least 57 concurrent requests
    running, the mempool may overflow and mempool allocations may start
    blocking until another entry is freed to the mempool. Because the same
    thread is used to free entries to the mempool and allocate entries from
    the mempool, this may result in a deadlock.

    This patch changes it so that one mempool entry contains all 9 "struct
    kcopyd_job" required to fulfill the whole request. The allocation is
    done only once in dm_kcopyd_copy and no further mempool allocations are
    done during request processing.

    If dm_kcopyd_copy is not run in the completion thread, this
    implementation is deadlock-free.

    MIN_JOBS needs reducing accordingly and we've chosen to reduce it
    further to 8.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Don't split SUB_JOB_SIZE jobs

    If the job size equals SUB_JOB_SIZE, there is no point in splitting it.
    Splitting it just unnecessarily wastes time, because the split job size
    is SUB_JOB_SIZE too.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

10 Mar, 2011

2 commits

  • With the plugging now being explicitly controlled by the
    submitter, callers need not pass down unplugging hints
    to the block layer. If they want to unplug, it's because they
    manually plugged on their own - in which case, they should just
    unplug at will.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().

    Signed-off-by: Jens Axboe

    Jens Axboe
     

14 Jan, 2011

4 commits

  • kmirrord_wq, kcopyd_work and md->wq are created per dm instance and
    serve only a single work item from the dm instance, so non-reentrant
    workqueues would provide the same ordering guarantees as ordered ones
    while allowing CPU affinity and use of the workqueues for other
    purposes. Switch them to non-reentrant workqueues.

    Signed-off-by: Tejun Heo
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Tejun Heo
     
  • Convert all create[_singlethread]_work() users to the new
    alloc[_ordered]_workqueue(). This conversion is mechanical and
    doesn't introduce any behavior change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Tejun Heo
     
  • Make kcopyd merge more I/O requests by using device unplugging.

    Without this patch, each I/O request is dispatched separately to the device.
    If the device supports tagged queuing, there are many small requests sent
    to the device. To improve performance, this patch will batch as many requests
    as possible, allowing the queue to merge consecutive requests, and send them
    to the device at once.

    In my tests (15k SCSI disk), this patch improves sequential write throughput:

    Sequential write throughput (chunksize of 4k, 32k, 512k)
    unpatched: 15.2, 18.5, 17.5 MB/s
    patched: 14.4, 22.6, 23.0 MB/s

    In most common uses (snapshot or two-way mirror), kcopyd is only used for
    two devices, one for reading and the other for writing, thus this optimization
    is implemented only for two devices. The optimization may be extended to n-way
    mirrors with some code complexity increase.

    We keep track of two block devices to unplug (one for read and the
    other for write) and unplug them when exiting "do_work" thread. If
    there are more devices used (in theory it could happen, in practice it
    is rare), we unplug immediately.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Remove the REQ_SYNC flag to improve write throughput when writing
    to the origin with a snapshot on the same device (using the CFQ I/O
    scheduler).

    Sequential write throughput (chunksize of 4k, 32k, 512k)
    unpatched: 8.5, 8.6, 9.3 MB/s
    patched: 15.2, 18.5, 17.5 MB/s

    Snapshot exception reallocations are triggered by writes that are
    usually async, so mark the associated dm_io_request as async as well.
    This helps when using the CFQ I/O scheduler because it has separate
    queues for sync and async I/O. Async is optimized for throughput; sync
    for latency. With this change we're consciously favoring throughput over
    latency.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

08 Aug, 2010

1 commit

  • Remove the current bio flags and reuse the request flags for the bio, too.
    This allows to more easily trace the type of I/O from the filesystem
    down to the block driver. There were two flags in the bio that were
    missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
    renamed two request flags that had a superflous RW in them.

    Note that the flags are in bio.h despite having the REQ_ name - as
    blkdev.h includes bio.h that is the only way to go for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

11 Dec, 2009

1 commit

  • dm-kcopyd: accept zero-size jobs

    This patch changes dm-kcopyd so that it accepts zero-size jobs and completes
    them immediatelly via its completion thread.

    It is needed for multisnapshots snapshot resizing. When we are writing to
    a chunk beyond origin end, no copying is done. To simplify the code, we submit
    an empty request to kcopyd and let kcopyd complete it. If we didn't submit
    a request to kcopyd and called the completion routine immediatelly, it would
    violate the principle that completion is called only from one thread and
    it would need additional locking.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

09 Apr, 2009

2 commits

  • If the thread calling dm_kcopyd_copy is delayed due to scheduling inside
    split_job/segment_complete and the subjobs complete before the loop in
    split_job completes, the kcopyd callback could be invoked from the
    thread that called dm_kcopyd_copy instead of the kcopyd workqueue.

    dm_kcopyd_copy -> split_job -> segment_complete -> job->fn()

    Snapshots depend on the fact that callbacks are called from the singlethreaded
    kcopyd workqueue and expect that there is no racing between individual
    callbacks. The racing between callbacks can lead to corruption of exception
    store and it can also mean that exception store callbacks are called twice
    for the same exception - a likely reason for crashes reported inside
    pending_complete() / remove_exception().

    This patch fixes two problems:

    1. job->fn being called from the thread that submitted the job (see above).

    - Fix: hand over the completion callback to the kcopyd thread.

    2. job->fn(read_err, write_err, job->context); in segment_complete
    reports the error of the last subjob, not the union of all errors.

    - Fix: pass job->write_err to the callback to report all error bits
    (it is done already in run_complete_job)

    Cc: stable@kernel.org
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Use a variable in segment_complete() to point to the dm_kcopyd_client
    struct and only release job->pages in run_complete_job() if any are
    defined. These changes are needed by the next patch.

    Cc: stable@kernel.org
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

18 Feb, 2009

1 commit


22 Oct, 2008

2 commits

  • Change #include "dm.h" to #include in all targets.
    Targets should not need direct access to internal DM structures.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Write throughput to LVM snapshot origin volume is an order
    of magnitude slower than those to LV without snapshots or
    snapshot target volumes, especially in the case of sequential
    writes with O_SYNC on.

    The following patch originally written by Kevin Jamieson and
    Jan Blunck and slightly modified for the current RCs by myself
    tries to improve the performance by modifying the behaviour
    of kcopyd, so that it pushes back an I/O job to the head of
    the job queue instead of the tail as process_jobs() currently
    does when it has to wait for free pages. This way, write
    requests aren't shuffled to cause extra seeks.

    I tested the patch against 2.6.27-rc5 and got the following results.
    The test is a dd command writing to snapshot origin followed by fsync
    to the file just created/updated. A couple of filesystem benchmarks
    gave me similar results in case of sequential writes, while random
    writes didn't suffer much.

    dd if=/dev/zero of= bs=4096 count=...
    [conv=notrunc when updating]

    1) linux 2.6.27-rc5 without the patch, write to snapshot origin,
    average throughput (MB/s)
    10M 100M 1000M
    create,dd 511.46 610.72 11.81
    create,dd+fsync 7.10 6.77 8.13
    update,dd 431.63 917.41 12.75
    update,dd+fsync 7.79 7.43 8.12

    compared with write throughput to LV without any snapshots,
    all dd+fsync and 1000 MiB writes perform very poorly.

    10M 100M 1000M
    create,dd 555.03 608.98 123.29
    create,dd+fsync 114.27 72.78 76.65
    update,dd 152.34 1267.27 124.04
    update,dd+fsync 130.56 77.81 77.84

    2) linux 2.6.27-rc5 with the patch, write to snapshot origin,
    average throughput (MB/s)

    10M 100M 1000M
    create,dd 537.06 589.44 46.21
    create,dd+fsync 31.63 29.19 29.23
    update,dd 487.59 897.65 37.76
    update,dd+fsync 34.12 30.07 26.85

    Although still not on par with plain LV performance -
    cannot be avoided because it's copy on write anyway -
    this simple patch successfully improves throughtput
    of dd+fsync while not affecting the rest.

    Signed-off-by: Jan Blunck
    Signed-off-by: Kazuo Ito
    Signed-off-by: Alasdair G Kergon
    Cc: stable@kernel.org

    Kazuo Ito
     

25 Apr, 2008

2 commits

  • Remove an avoidable 3ms delay on some dm-raid1 and kcopyd I/O.

    It is specified that any submitted bio without BIO_RW_SYNC flag may plug the
    queue (i.e. block the requests from being dispatched to the physical device).

    The queue is unplugged when the caller calls blk_unplug() function. Usually, the
    sequence is that someone calls submit_bh to submit IO on a buffer. The IO plugs
    the queue and waits (to be possibly joined with other adjacent bios). Then, when
    the caller calls wait_on_buffer(), it unplugs the queue and submits the IOs to
    the disk.

    This was happenning:

    When doing O_SYNC writes, function fsync_buffers_list() submits a list of
    bios to dm_raid1, the bios are added to dm_raid1 write queue and kmirrord is
    woken up.

    fsync_buffers_list() calls wait_on_buffer(). That unplugs the queue, but
    there are no bios on the device queue as they are still in the dm_raid1 queue.

    wait_on_buffer() starts waiting until the IO is finished.

    kmirrord is scheduled, kmirrord takes bios and submits them to the devices.

    The submitted bio plugs the harddisk queue but there is no one to unplug it.
    (The process that called wait_on_buffer() is already sleeping.)

    So there is a 3ms timeout, after which the queues on the harddisks are
    unplugged and requests are processed.

    This 3ms timeout meant that in certain workloads (e.g. O_SYNC, 8kb writes),
    dm-raid1 is 10 times slower than md raid1.

    Every time we submit something asynchronously via dm_io, we must unplug the
    queue actually to send the request to the device.

    This patch adds an unplug call to kmirrord - while processing requests, it keeps
    the queue plugged (so that adjacent bios can be merged); when it finishes
    processing all the bios, it unplugs the queue to submit the bios.

    It also fixes kcopyd which has the same potential problem. All kcopyd requests
    are submitted with BIO_RW_SYNC.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon
    Acked-by: Jens Axboe

    Mikulas Patocka
     
  • Publish the dm-io, dm-log and dm-kcopyd headers in include/linux.

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon