Blame view

block/bfq-iosched.c 237 KB
a497ee34a   Christoph Hellwig   block: switch all...
1
  // SPDX-License-Identifier: GPL-2.0-or-later
aee69d78d   Paolo Valente   block, bfq: intro...
2
3
4
5
6
7
8
9
10
11
12
13
14
15
  /*
   * Budget Fair Queueing (BFQ) I/O scheduler.
   *
   * Based on ideas and code from CFQ:
   * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
   *
   * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
   *		      Paolo Valente <paolo.valente@unimore.it>
   *
   * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
   *                    Arianna Avanzini <avanzini@google.com>
   *
   * Copyright (C) 2017 Paolo Valente <paolo.valente@linaro.org>
   *
aee69d78d   Paolo Valente   block, bfq: intro...
16
17
18
19
   * BFQ is a proportional-share I/O scheduler, with some extra
   * low-latency capabilities. BFQ also supports full hierarchical
   * scheduling through cgroups. Next paragraphs provide an introduction
   * on BFQ inner workings. Details on BFQ benefits, usage and
898bd37a9   Mauro Carvalho Chehab   docs: block: conv...
20
   * limitations can be found in Documentation/block/bfq-iosched.rst.
aee69d78d   Paolo Valente   block, bfq: intro...
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
   *
   * BFQ is a proportional-share storage-I/O scheduling algorithm based
   * on the slice-by-slice service scheme of CFQ. But BFQ assigns
   * budgets, measured in number of sectors, to processes instead of
   * time slices. The device is not granted to the in-service process
   * for a given time slice, but until it has exhausted its assigned
   * budget. This change from the time to the service domain enables BFQ
   * to distribute the device throughput among processes as desired,
   * without any distortion due to throughput fluctuations, or to device
   * internal queueing. BFQ uses an ad hoc internal scheduler, called
   * B-WF2Q+, to schedule processes according to their budgets. More
   * precisely, BFQ schedules queues associated with processes. Each
   * process/queue is assigned a user-configurable weight, and B-WF2Q+
   * guarantees that each queue receives a fraction of the throughput
   * proportional to its weight. Thanks to the accurate policy of
   * B-WF2Q+, BFQ can afford to assign high budgets to I/O-bound
   * processes issuing sequential requests (to boost the throughput),
   * and yet guarantee a low latency to interactive and soft real-time
   * applications.
   *
   * In particular, to provide these low-latency guarantees, BFQ
   * explicitly privileges the I/O of two classes of time-sensitive
4029eef1b   Paolo Valente   block, bfq: add d...
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
   * applications: interactive and soft real-time. In more detail, BFQ
   * behaves this way if the low_latency parameter is set (default
   * configuration). This feature enables BFQ to provide applications in
   * these classes with a very low latency.
   *
   * To implement this feature, BFQ constantly tries to detect whether
   * the I/O requests in a bfq_queue come from an interactive or a soft
   * real-time application. For brevity, in these cases, the queue is
   * said to be interactive or soft real-time. In both cases, BFQ
   * privileges the service of the queue, over that of non-interactive
   * and non-soft-real-time queues. This privileging is performed,
   * mainly, by raising the weight of the queue. So, for brevity, we
   * call just weight-raising periods the time periods during which a
   * queue is privileged, because deemed interactive or soft real-time.
   *
   * The detection of soft real-time queues/applications is described in
   * detail in the comments on the function
   * bfq_bfqq_softrt_next_start. On the other hand, the detection of an
   * interactive queue works as follows: a queue is deemed interactive
   * if it is constantly non empty only for a limited time interval,
   * after which it does become empty. The queue may be deemed
   * interactive again (for a limited time), if it restarts being
   * constantly non empty, provided that this happens only after the
   * queue has remained empty for a given minimum idle time.
   *
   * By default, BFQ computes automatically the above maximum time
   * interval, i.e., the time interval after which a constantly
   * non-empty queue stops being deemed interactive. Since a queue is
   * weight-raised while it is deemed interactive, this maximum time
   * interval happens to coincide with the (maximum) duration of the
   * weight-raising for interactive queues.
   *
   * Finally, BFQ also features additional heuristics for
aee69d78d   Paolo Valente   block, bfq: intro...
76
77
78
79
   * preserving both a low latency and a high throughput on NCQ-capable,
   * rotational or flash-based devices, and to get the job done quickly
   * for applications consisting in many I/O-bound processes.
   *
43c1b3d6e   Paolo Valente   block, bfq: stres...
80
81
82
83
84
   * NOTE: if the main or only goal, with a given device, is to achieve
   * the maximum-possible throughput at all times, then do switch off
   * all low-latency heuristics for that device, by setting low_latency
   * to 0.
   *
4029eef1b   Paolo Valente   block, bfq: add d...
85
86
87
88
89
90
91
92
   * BFQ is described in [1], where also a reference to the initial,
   * more theoretical paper on BFQ can be found. The interested reader
   * can find in the latter paper full details on the main algorithm, as
   * well as formulas of the guarantees and formal proofs of all the
   * properties.  With respect to the version of BFQ presented in these
   * papers, this implementation adds a few more heuristics, such as the
   * ones that guarantee a low latency to interactive and soft real-time
   * applications, and a hierarchical extension based on H-WF2Q+.
aee69d78d   Paolo Valente   block, bfq: intro...
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
   *
   * B-WF2Q+ is based on WF2Q+, which is described in [2], together with
   * H-WF2Q+, while the augmented tree used here to implement B-WF2Q+
   * with O(log N) complexity derives from the one introduced with EEVDF
   * in [3].
   *
   * [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
   *     Scheduler", Proceedings of the First Workshop on Mobile System
   *     Technologies (MST-2015), May 2015.
   *     http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
   *
   * [2] Jon C.R. Bennett and H. Zhang, "Hierarchical Packet Fair Queueing
   *     Algorithms", IEEE/ACM Transactions on Networking, 5(5):675-689,
   *     Oct 1997.
   *
   * http://www.cs.cmu.edu/~hzhang/papers/TON-97-Oct.ps.gz
   *
   * [3] I. Stoica and H. Abdel-Wahab, "Earliest Eligible Virtual Deadline
   *     First: A Flexible and Accurate Mechanism for Proportional Share
   *     Resource Allocation", technical report.
   *
   * http://www.cs.berkeley.edu/~istoica/papers/eevdf-tr-95.pdf
   */
  #include <linux/module.h>
  #include <linux/slab.h>
  #include <linux/blkdev.h>
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
119
  #include <linux/cgroup.h>
aee69d78d   Paolo Valente   block, bfq: intro...
120
121
122
123
124
125
  #include <linux/elevator.h>
  #include <linux/ktime.h>
  #include <linux/rbtree.h>
  #include <linux/ioprio.h>
  #include <linux/sbitmap.h>
  #include <linux/delay.h>
d51cfc53a   Yufen Yu   bdi: use bdi_dev_...
126
  #include <linux/backing-dev.h>
aee69d78d   Paolo Valente   block, bfq: intro...
127
128
129
130
131
  
  #include "blk.h"
  #include "blk-mq.h"
  #include "blk-mq-tag.h"
  #include "blk-mq-sched.h"
ea25da480   Paolo Valente   block, bfq: split...
132
  #include "bfq-iosched.h"
b5dc5d4d1   Luca Miccio   block,bfq: Disabl...
133
  #include "blk-wbt.h"
aee69d78d   Paolo Valente   block, bfq: intro...
134

ea25da480   Paolo Valente   block, bfq: split...
135
136
137
138
139
140
141
142
143
144
145
146
  #define BFQ_BFQQ_FNS(name)						\
  void bfq_mark_bfqq_##name(struct bfq_queue *bfqq)			\
  {									\
  	__set_bit(BFQQF_##name, &(bfqq)->flags);			\
  }									\
  void bfq_clear_bfqq_##name(struct bfq_queue *bfqq)			\
  {									\
  	__clear_bit(BFQQF_##name, &(bfqq)->flags);		\
  }									\
  int bfq_bfqq_##name(const struct bfq_queue *bfqq)			\
  {									\
  	return test_bit(BFQQF_##name, &(bfqq)->flags);		\
44e44a1b3   Paolo Valente   block, bfq: impro...
147
  }
ea25da480   Paolo Valente   block, bfq: split...
148
149
150
151
152
  BFQ_BFQQ_FNS(just_created);
  BFQ_BFQQ_FNS(busy);
  BFQ_BFQQ_FNS(wait_request);
  BFQ_BFQQ_FNS(non_blocking_wait_rq);
  BFQ_BFQQ_FNS(fifo_expire);
d5be3fefc   Paolo Valente   block,bfq: refact...
153
  BFQ_BFQQ_FNS(has_short_ttime);
ea25da480   Paolo Valente   block, bfq: split...
154
155
156
157
158
159
  BFQ_BFQQ_FNS(sync);
  BFQ_BFQQ_FNS(IO_bound);
  BFQ_BFQQ_FNS(in_large_burst);
  BFQ_BFQQ_FNS(coop);
  BFQ_BFQQ_FNS(split_coop);
  BFQ_BFQQ_FNS(softrt_update);
13a857a4c   Paolo Valente   block, bfq: detec...
160
  BFQ_BFQQ_FNS(has_waker);
ea25da480   Paolo Valente   block, bfq: split...
161
  #undef BFQ_BFQQ_FNS						\
aee69d78d   Paolo Valente   block, bfq: intro...
162

ea25da480   Paolo Valente   block, bfq: split...
163
164
  /* Expiration time of sync (0) and async (1) requests, in ns. */
  static const u64 bfq_fifo_expire[2] = { NSEC_PER_SEC / 4, NSEC_PER_SEC / 8 };
aee69d78d   Paolo Valente   block, bfq: intro...
165

ea25da480   Paolo Valente   block, bfq: split...
166
167
  /* Maximum backwards seek (magic number lifted from CFQ), in KiB. */
  static const int bfq_back_max = 16 * 1024;
aee69d78d   Paolo Valente   block, bfq: intro...
168

ea25da480   Paolo Valente   block, bfq: split...
169
170
  /* Penalty of a backwards seek, in number of sectors. */
  static const int bfq_back_penalty = 2;
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
171

ea25da480   Paolo Valente   block, bfq: split...
172
173
  /* Idling period duration, in ns. */
  static u64 bfq_slice_idle = NSEC_PER_SEC / 125;
aee69d78d   Paolo Valente   block, bfq: intro...
174

ea25da480   Paolo Valente   block, bfq: split...
175
176
  /* Minimum number of assigned budgets for which stats are safe to compute. */
  static const int bfq_stats_min_budgets = 194;
aee69d78d   Paolo Valente   block, bfq: intro...
177

ea25da480   Paolo Valente   block, bfq: split...
178
179
  /* Default maximum budget values, in sectors and number of requests. */
  static const int bfq_default_max_budget = 16 * 1024;
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
180

ea25da480   Paolo Valente   block, bfq: split...
181
  /*
d5801088a   Paolo Valente   block, bfq: reduc...
182
183
   * When a sync request is dispatched, the queue that contains that
   * request, and all the ancestor entities of that queue, are charged
636b8fe86   Angelo Ruocco   block, bfq: fix s...
184
   * with the number of sectors of the request. In contrast, if the
d5801088a   Paolo Valente   block, bfq: reduc...
185
186
187
188
189
190
191
192
193
194
195
196
197
198
   * request is async, then the queue and its ancestor entities are
   * charged with the number of sectors of the request, multiplied by
   * the factor below. This throttles the bandwidth for async I/O,
   * w.r.t. to sync I/O, and it is done to counter the tendency of async
   * writes to steal I/O throughput to reads.
   *
   * The current value of this parameter is the result of a tuning with
   * several hardware and software configurations. We tried to find the
   * lowest value for which writes do not cause noticeable problems to
   * reads. In fact, the lower this parameter, the stabler I/O control,
   * in the following respect.  The lower this parameter is, the less
   * the bandwidth enjoyed by a group decreases
   * - when the group does writes, w.r.t. to when it does reads;
   * - when other groups do reads, w.r.t. to when they do writes.
ea25da480   Paolo Valente   block, bfq: split...
199
   */
d5801088a   Paolo Valente   block, bfq: reduc...
200
  static const int bfq_async_charge_factor = 3;
aee69d78d   Paolo Valente   block, bfq: intro...
201

ea25da480   Paolo Valente   block, bfq: split...
202
203
  /* Default timeout values, in jiffies, approximating CFQ defaults. */
  const int bfq_timeout = HZ / 8;
aee69d78d   Paolo Valente   block, bfq: intro...
204

7b8fa3b90   Paolo Valente   block, bfq: let a...
205
206
207
208
209
210
211
  /*
   * Time limit for merging (see comments in bfq_setup_cooperator). Set
   * to the slowest value that, in our tests, proved to be effective in
   * removing false positives, while not causing true positives to miss
   * queue merging.
   *
   * As can be deduced from the low time limit below, queue merging, if
636b8fe86   Angelo Ruocco   block, bfq: fix s...
212
   * successful, happens at the very beginning of the I/O of the involved
7b8fa3b90   Paolo Valente   block, bfq: let a...
213
214
215
216
217
   * cooperating processes, as a consequence of the arrival of the very
   * first requests from each cooperator.  After that, there is very
   * little chance to find cooperators.
   */
  static const unsigned long bfq_merge_time_limit = HZ/10;
ea25da480   Paolo Valente   block, bfq: split...
218
  static struct kmem_cache *bfq_pool;
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
219

ea25da480   Paolo Valente   block, bfq: split...
220
221
  /* Below this threshold (in ns), we consider thinktime immediate. */
  #define BFQ_MIN_TT		(2 * NSEC_PER_MSEC)
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
222

ea25da480   Paolo Valente   block, bfq: split...
223
  /* hw_tag detection: parallel requests threshold and min samples needed. */
a3c925603   Paolo Valente   block, bfq: reduc...
224
  #define BFQ_HW_QUEUE_THRESHOLD	3
ea25da480   Paolo Valente   block, bfq: split...
225
  #define BFQ_HW_QUEUE_SAMPLES	32
aee69d78d   Paolo Valente   block, bfq: intro...
226

ea25da480   Paolo Valente   block, bfq: split...
227
228
  #define BFQQ_SEEK_THR		(sector_t)(8 * 100)
  #define BFQQ_SECT_THR_NONROT	(sector_t)(2 * 32)
d87447d84   Paolo Valente   block, bfq: fix s...
229
230
231
232
233
  #define BFQ_RQ_SEEKY(bfqd, last_pos, rq) \
  	(get_sdist(last_pos, rq) >			\
  	 BFQQ_SEEK_THR &&				\
  	 (!blk_queue_nonrot(bfqd->queue) ||		\
  	  blk_rq_sectors(rq) < BFQQ_SECT_THR_NONROT))
ea25da480   Paolo Valente   block, bfq: split...
234
  #define BFQQ_CLOSE_THR		(sector_t)(8 * 1024)
f0ba5ea2f   Paolo Valente   block, bfq: incre...
235
  #define BFQQ_SEEKY(bfqq)	(hweight32(bfqq->seek_history) > 19)
7074f076f   Paolo Valente   block, bfq: do no...
236
237
238
239
240
241
242
  /*
   * Sync random I/O is likely to be confused with soft real-time I/O,
   * because it is characterized by limited throughput and apparently
   * isochronous arrival pattern. To avoid false positives, queues
   * containing only random (seeky) I/O are prevented from being tagged
   * as soft real-time.
   */
e6feaf215   Paolo Valente   block, bfq: fix o...
243
  #define BFQQ_TOTALLY_SEEKY(bfqq)	(bfqq->seek_history == -1)
aee69d78d   Paolo Valente   block, bfq: intro...
244

ea25da480   Paolo Valente   block, bfq: split...
245
246
247
248
249
250
  /* Min number of samples required to perform peak-rate update */
  #define BFQ_RATE_MIN_SAMPLES	32
  /* Min observation time interval required to perform a peak-rate update (ns) */
  #define BFQ_RATE_MIN_INTERVAL	(300*NSEC_PER_MSEC)
  /* Target observation time interval for a peak-rate update (ns) */
  #define BFQ_RATE_REF_INTERVAL	NSEC_PER_SEC
aee69d78d   Paolo Valente   block, bfq: intro...
251

bc56e2caf   Paolo Valente   block, bfq: lower...
252
253
254
255
256
257
258
259
260
261
262
263
264
265
  /*
   * Shift used for peak-rate fixed precision calculations.
   * With
   * - the current shift: 16 positions
   * - the current type used to store rate: u32
   * - the current unit of measure for rate: [sectors/usec], or, more precisely,
   *   [(sectors/usec) / 2^BFQ_RATE_SHIFT] to take into account the shift,
   * the range of rates that can be stored is
   * [1 / 2^BFQ_RATE_SHIFT, 2^(32 - BFQ_RATE_SHIFT)] sectors/usec =
   * [1 / 2^16, 2^16] sectors/usec = [15e-6, 65536] sectors/usec =
   * [15, 65G] sectors/sec
   * Which, assuming a sector size of 512B, corresponds to a range of
   * [7.5K, 33T] B/sec
   */
ea25da480   Paolo Valente   block, bfq: split...
266
  #define BFQ_RATE_SHIFT		16
aee69d78d   Paolo Valente   block, bfq: intro...
267

ea25da480   Paolo Valente   block, bfq: split...
268
  /*
4029eef1b   Paolo Valente   block, bfq: add d...
269
270
271
   * When configured for computing the duration of the weight-raising
   * for interactive queues automatically (see the comments at the
   * beginning of this file), BFQ does it using the following formula:
e24f1c245   Paolo Valente   block, bfq: remov...
272
273
274
275
276
277
278
279
280
281
   * duration = (ref_rate / r) * ref_wr_duration,
   * where r is the peak rate of the device, and ref_rate and
   * ref_wr_duration are two reference parameters.  In particular,
   * ref_rate is the peak rate of the reference storage device (see
   * below), and ref_wr_duration is about the maximum time needed, with
   * BFQ and while reading two files in parallel, to load typical large
   * applications on the reference device (see the comments on
   * max_service_from_wr below, for more details on how ref_wr_duration
   * is obtained).  In practice, the slower/faster the device at hand
   * is, the more/less it takes to load applications with respect to the
4029eef1b   Paolo Valente   block, bfq: add d...
282
283
   * reference device.  Accordingly, the longer/shorter BFQ grants
   * weight raising to interactive applications.
ea25da480   Paolo Valente   block, bfq: split...
284
   *
e24f1c245   Paolo Valente   block, bfq: remov...
285
286
   * BFQ uses two different reference pairs (ref_rate, ref_wr_duration),
   * depending on whether the device is rotational or non-rotational.
ea25da480   Paolo Valente   block, bfq: split...
287
   *
e24f1c245   Paolo Valente   block, bfq: remov...
288
289
290
291
292
293
294
295
296
297
   * In the following definitions, ref_rate[0] and ref_wr_duration[0]
   * are the reference values for a rotational device, whereas
   * ref_rate[1] and ref_wr_duration[1] are the reference values for a
   * non-rotational device. The reference rates are not the actual peak
   * rates of the devices used as a reference, but slightly lower
   * values. The reason for using slightly lower values is that the
   * peak-rate estimator tends to yield slightly lower values than the
   * actual peak rate (it can yield the actual peak rate only if there
   * is only one process doing I/O, and the process does sequential
   * I/O).
ea25da480   Paolo Valente   block, bfq: split...
298
   *
e24f1c245   Paolo Valente   block, bfq: remov...
299
300
   * The reference peak rates are measured in sectors/usec, left-shifted
   * by BFQ_RATE_SHIFT.
ea25da480   Paolo Valente   block, bfq: split...
301
   */
e24f1c245   Paolo Valente   block, bfq: remov...
302
  static int ref_rate[2] = {14000, 33000};
ea25da480   Paolo Valente   block, bfq: split...
303
  /*
e24f1c245   Paolo Valente   block, bfq: remov...
304
305
306
   * To improve readability, a conversion function is used to initialize
   * the following array, which entails that the array can be
   * initialized only in a function.
ea25da480   Paolo Valente   block, bfq: split...
307
   */
e24f1c245   Paolo Valente   block, bfq: remov...
308
  static int ref_wr_duration[2];
aee69d78d   Paolo Valente   block, bfq: intro...
309

8a8747dc0   Paolo Valente   block, bfq: limit...
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
  /*
   * BFQ uses the above-detailed, time-based weight-raising mechanism to
   * privilege interactive tasks. This mechanism is vulnerable to the
   * following false positives: I/O-bound applications that will go on
   * doing I/O for much longer than the duration of weight
   * raising. These applications have basically no benefit from being
   * weight-raised at the beginning of their I/O. On the opposite end,
   * while being weight-raised, these applications
   * a) unjustly steal throughput to applications that may actually need
   * low latency;
   * b) make BFQ uselessly perform device idling; device idling results
   * in loss of device throughput with most flash-based storage, and may
   * increase latencies when used purposelessly.
   *
   * BFQ tries to reduce these problems, by adopting the following
   * countermeasure. To introduce this countermeasure, we need first to
   * finish explaining how the duration of weight-raising for
   * interactive tasks is computed.
   *
   * For a bfq_queue deemed as interactive, the duration of weight
   * raising is dynamically adjusted, as a function of the estimated
   * peak rate of the device, so as to be equal to the time needed to
   * execute the 'largest' interactive task we benchmarked so far. By
   * largest task, we mean the task for which each involved process has
   * to do more I/O than for any of the other tasks we benchmarked. This
   * reference interactive task is the start-up of LibreOffice Writer,
   * and in this task each process/bfq_queue needs to have at most ~110K
   * sectors transferred.
   *
   * This last piece of information enables BFQ to reduce the actual
   * duration of weight-raising for at least one class of I/O-bound
   * applications: those doing sequential or quasi-sequential I/O. An
   * example is file copy. In fact, once started, the main I/O-bound
   * processes of these applications usually consume the above 110K
   * sectors in much less time than the processes of an application that
   * is starting, because these I/O-bound processes will greedily devote
   * almost all their CPU cycles only to their target,
   * throughput-friendly I/O operations. This is even more true if BFQ
   * happens to be underestimating the device peak rate, and thus
   * overestimating the duration of weight raising. But, according to
   * our measurements, once transferred 110K sectors, these processes
   * have no right to be weight-raised any longer.
   *
   * Basing on the last consideration, BFQ ends weight-raising for a
   * bfq_queue if the latter happens to have received an amount of
   * service at least equal to the following constant. The constant is
   * set to slightly more than 110K, to have a minimum safety margin.
   *
   * This early ending of weight-raising reduces the amount of time
   * during which interactive false positives cause the two problems
   * described at the beginning of these comments.
   */
  static const unsigned long max_service_from_wr = 120000;
12cd3a2fe   Bart Van Assche   bfq: Use icq_to_b...
363
  #define RQ_BIC(rq)		icq_to_bic((rq)->elv.priv[0])
ea25da480   Paolo Valente   block, bfq: split...
364
  #define RQ_BFQQ(rq)		((rq)->elv.priv[1])
aee69d78d   Paolo Valente   block, bfq: intro...
365

ea25da480   Paolo Valente   block, bfq: split...
366
  struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic, bool is_sync)
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
367
  {
ea25da480   Paolo Valente   block, bfq: split...
368
  	return bic->bfqq[is_sync];
aee69d78d   Paolo Valente   block, bfq: intro...
369
  }
ea25da480   Paolo Valente   block, bfq: split...
370
  void bic_set_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq, bool is_sync)
aee69d78d   Paolo Valente   block, bfq: intro...
371
  {
ea25da480   Paolo Valente   block, bfq: split...
372
  	bic->bfqq[is_sync] = bfqq;
aee69d78d   Paolo Valente   block, bfq: intro...
373
  }
ea25da480   Paolo Valente   block, bfq: split...
374
  struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic)
aee69d78d   Paolo Valente   block, bfq: intro...
375
  {
ea25da480   Paolo Valente   block, bfq: split...
376
  	return bic->icq.q->elevator->elevator_data;
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
377
  }
aee69d78d   Paolo Valente   block, bfq: intro...
378

ea25da480   Paolo Valente   block, bfq: split...
379
380
381
382
383
  /**
   * icq_to_bic - convert iocontext queue structure to bfq_io_cq.
   * @icq: the iocontext queue.
   */
  static struct bfq_io_cq *icq_to_bic(struct io_cq *icq)
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
384
  {
ea25da480   Paolo Valente   block, bfq: split...
385
386
  	/* bic->icq is the first member, %NULL will convert to %NULL */
  	return container_of(icq, struct bfq_io_cq, icq);
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
387
  }
aee69d78d   Paolo Valente   block, bfq: intro...
388

ea25da480   Paolo Valente   block, bfq: split...
389
390
391
392
393
394
395
396
397
  /**
   * bfq_bic_lookup - search into @ioc a bic associated to @bfqd.
   * @bfqd: the lookup key.
   * @ioc: the io_context of the process doing I/O.
   * @q: the request queue.
   */
  static struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd,
  					struct io_context *ioc,
  					struct request_queue *q)
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
398
  {
ea25da480   Paolo Valente   block, bfq: split...
399
400
401
  	if (ioc) {
  		unsigned long flags;
  		struct bfq_io_cq *icq;
aee69d78d   Paolo Valente   block, bfq: intro...
402

0d945c1f9   Christoph Hellwig   block: remove the...
403
  		spin_lock_irqsave(&q->queue_lock, flags);
ea25da480   Paolo Valente   block, bfq: split...
404
  		icq = icq_to_bic(ioc_lookup_icq(ioc, q));
0d945c1f9   Christoph Hellwig   block: remove the...
405
  		spin_unlock_irqrestore(&q->queue_lock, flags);
aee69d78d   Paolo Valente   block, bfq: intro...
406

ea25da480   Paolo Valente   block, bfq: split...
407
  		return icq;
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
408
  	}
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
409

ea25da480   Paolo Valente   block, bfq: split...
410
  	return NULL;
aee69d78d   Paolo Valente   block, bfq: intro...
411
  }
ea25da480   Paolo Valente   block, bfq: split...
412
413
414
415
416
  /*
   * Scheduler run of queue, if there are requests pending and no one in the
   * driver that will restart queueing.
   */
  void bfq_schedule_dispatch(struct bfq_data *bfqd)
aee69d78d   Paolo Valente   block, bfq: intro...
417
  {
ea25da480   Paolo Valente   block, bfq: split...
418
419
420
  	if (bfqd->queued != 0) {
  		bfq_log(bfqd, "schedule dispatch");
  		blk_mq_run_hw_queues(bfqd->queue, true);
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
421
  	}
aee69d78d   Paolo Valente   block, bfq: intro...
422
423
424
  }
  
  #define bfq_class_idle(bfqq)	((bfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
aee69d78d   Paolo Valente   block, bfq: intro...
425
426
427
428
  
  #define bfq_sample_valid(samples)	((samples) > 80)
  
  /*
aee69d78d   Paolo Valente   block, bfq: intro...
429
   * Lifted from AS - choose which of rq1 and rq2 that is best served now.
636b8fe86   Angelo Ruocco   block, bfq: fix s...
430
   * We choose the request that is closer to the head right now.  Distance
aee69d78d   Paolo Valente   block, bfq: intro...
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
   * behind the head is penalized and only allowed to a certain extent.
   */
  static struct request *bfq_choose_req(struct bfq_data *bfqd,
  				      struct request *rq1,
  				      struct request *rq2,
  				      sector_t last)
  {
  	sector_t s1, s2, d1 = 0, d2 = 0;
  	unsigned long back_max;
  #define BFQ_RQ1_WRAP	0x01 /* request 1 wraps */
  #define BFQ_RQ2_WRAP	0x02 /* request 2 wraps */
  	unsigned int wrap = 0; /* bit mask: requests behind the disk head? */
  
  	if (!rq1 || rq1 == rq2)
  		return rq2;
  	if (!rq2)
  		return rq1;
  
  	if (rq_is_sync(rq1) && !rq_is_sync(rq2))
  		return rq1;
  	else if (rq_is_sync(rq2) && !rq_is_sync(rq1))
  		return rq2;
  	if ((rq1->cmd_flags & REQ_META) && !(rq2->cmd_flags & REQ_META))
  		return rq1;
  	else if ((rq2->cmd_flags & REQ_META) && !(rq1->cmd_flags & REQ_META))
  		return rq2;
  
  	s1 = blk_rq_pos(rq1);
  	s2 = blk_rq_pos(rq2);
  
  	/*
  	 * By definition, 1KiB is 2 sectors.
  	 */
  	back_max = bfqd->bfq_back_max * 2;
  
  	/*
  	 * Strict one way elevator _except_ in the case where we allow
  	 * short backward seeks which are biased as twice the cost of a
  	 * similar forward seek.
  	 */
  	if (s1 >= last)
  		d1 = s1 - last;
  	else if (s1 + back_max >= last)
  		d1 = (last - s1) * bfqd->bfq_back_penalty;
  	else
  		wrap |= BFQ_RQ1_WRAP;
  
  	if (s2 >= last)
  		d2 = s2 - last;
  	else if (s2 + back_max >= last)
  		d2 = (last - s2) * bfqd->bfq_back_penalty;
  	else
  		wrap |= BFQ_RQ2_WRAP;
  
  	/* Found required data */
  
  	/*
  	 * By doing switch() on the bit mask "wrap" we avoid having to
  	 * check two variables for all permutations: --> faster!
  	 */
  	switch (wrap) {
  	case 0: /* common case for CFQ: rq1 and rq2 not wrapped */
  		if (d1 < d2)
  			return rq1;
  		else if (d2 < d1)
  			return rq2;
  
  		if (s1 >= s2)
  			return rq1;
  		else
  			return rq2;
  
  	case BFQ_RQ2_WRAP:
  		return rq1;
  	case BFQ_RQ1_WRAP:
  		return rq2;
  	case BFQ_RQ1_WRAP|BFQ_RQ2_WRAP: /* both rqs wrapped */
  	default:
  		/*
  		 * Since both rqs are wrapped,
  		 * start with the one that's further behind head
  		 * (--> only *one* back seek required),
  		 * since back seek takes more time than forward.
  		 */
  		if (s1 <= s2)
  			return rq1;
  		else
  			return rq2;
  	}
  }
a52a69ea8   Paolo Valente   block, bfq: limit...
521
  /*
a52a69ea8   Paolo Valente   block, bfq: limit...
522
523
524
525
526
527
528
529
   * Async I/O can easily starve sync I/O (both sync reads and sync
   * writes), by consuming all tags. Similarly, storms of sync writes,
   * such as those that sync(2) may trigger, can starve sync reads.
   * Limit depths of async I/O and sync writes so as to counter both
   * problems.
   */
  static void bfq_limit_depth(unsigned int op, struct blk_mq_alloc_data *data)
  {
a52a69ea8   Paolo Valente   block, bfq: limit...
530
  	struct bfq_data *bfqd = data->q->elevator->elevator_data;
a52a69ea8   Paolo Valente   block, bfq: limit...
531
532
533
  
  	if (op_is_sync(op) && !op_is_write(op))
  		return;
a52a69ea8   Paolo Valente   block, bfq: limit...
534
535
536
537
538
539
540
  	data->shallow_depth =
  		bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op)];
  
  	bfq_log(bfqd, "[%s] wr_busy %d sync %d depth %u",
  			__func__, bfqd->wr_busy_queues, op_is_sync(op),
  			data->shallow_depth);
  }
36eca8948   Arianna Avanzini   block, bfq: add E...
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
  static struct bfq_queue *
  bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct rb_root *root,
  		     sector_t sector, struct rb_node **ret_parent,
  		     struct rb_node ***rb_link)
  {
  	struct rb_node **p, *parent;
  	struct bfq_queue *bfqq = NULL;
  
  	parent = NULL;
  	p = &root->rb_node;
  	while (*p) {
  		struct rb_node **n;
  
  		parent = *p;
  		bfqq = rb_entry(parent, struct bfq_queue, pos_node);
  
  		/*
  		 * Sort strictly based on sector. Smallest to the left,
  		 * largest to the right.
  		 */
  		if (sector > blk_rq_pos(bfqq->next_rq))
  			n = &(*p)->rb_right;
  		else if (sector < blk_rq_pos(bfqq->next_rq))
  			n = &(*p)->rb_left;
  		else
  			break;
  		p = n;
  		bfqq = NULL;
  	}
  
  	*ret_parent = parent;
  	if (rb_link)
  		*rb_link = p;
  
  	bfq_log(bfqd, "rq_pos_tree_lookup %llu: returning %d",
  		(unsigned long long)sector,
  		bfqq ? bfqq->pid : 0);
  
  	return bfqq;
  }
7b8fa3b90   Paolo Valente   block, bfq: let a...
581
582
583
584
585
586
  static bool bfq_too_late_for_merging(struct bfq_queue *bfqq)
  {
  	return bfqq->service_from_backlogged > 0 &&
  		time_is_before_jiffies(bfqq->first_IO_time +
  				       bfq_merge_time_limit);
  }
8cacc5ab3   Paolo Valente   block, bfq: do no...
587
588
589
590
591
592
593
594
595
596
  /*
   * The following function is not marked as __cold because it is
   * actually cold, but for the same performance goal described in the
   * comments on the likely() at the beginning of
   * bfq_setup_cooperator(). Unexpectedly, to reach an even lower
   * execution time for the case where this function is not invoked, we
   * had to add an unlikely() in each involved if().
   */
  void __cold
  bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
36eca8948   Arianna Avanzini   block, bfq: add E...
597
598
599
600
601
602
603
604
  {
  	struct rb_node **p, *parent;
  	struct bfq_queue *__bfqq;
  
  	if (bfqq->pos_root) {
  		rb_erase(&bfqq->pos_node, bfqq->pos_root);
  		bfqq->pos_root = NULL;
  	}
32c59e3a9   Paolo Valente   block, bfq: do no...
605
606
607
  	/* oom_bfqq does not participate in queue merging */
  	if (bfqq == &bfqd->oom_bfqq)
  		return;
7b8fa3b90   Paolo Valente   block, bfq: let a...
608
609
610
611
612
613
614
  	/*
  	 * bfqq cannot be merged any longer (see comments in
  	 * bfq_setup_cooperator): no point in adding bfqq into the
  	 * position tree.
  	 */
  	if (bfq_too_late_for_merging(bfqq))
  		return;
36eca8948   Arianna Avanzini   block, bfq: add E...
615
616
617
618
619
620
621
622
623
624
625
626
627
628
  	if (bfq_class_idle(bfqq))
  		return;
  	if (!bfqq->next_rq)
  		return;
  
  	bfqq->pos_root = &bfq_bfqq_to_bfqg(bfqq)->rq_pos_tree;
  	__bfqq = bfq_rq_pos_tree_lookup(bfqd, bfqq->pos_root,
  			blk_rq_pos(bfqq->next_rq), &parent, &p);
  	if (!__bfqq) {
  		rb_link_node(&bfqq->pos_node, parent, p);
  		rb_insert_color(&bfqq->pos_node, bfqq->pos_root);
  	} else
  		bfqq->pos_root = NULL;
  }
aee69d78d   Paolo Valente   block, bfq: intro...
629
  /*
fb53ac6cd   Paolo Valente   block, bfq: do no...
630
631
632
633
634
635
636
637
638
639
640
   * The following function returns false either if every active queue
   * must receive the same share of the throughput (symmetric scenario),
   * or, as a special case, if bfqq must receive a share of the
   * throughput lower than or equal to the share that every other active
   * queue must receive.  If bfqq does sync I/O, then these are the only
   * two cases where bfqq happens to be guaranteed its share of the
   * throughput even if I/O dispatching is not plugged when bfqq remains
   * temporarily empty (for more details, see the comments in the
   * function bfq_better_to_idle()). For this reason, the return value
   * of this function is used to check whether I/O-dispatch plugging can
   * be avoided.
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
641
   *
fb53ac6cd   Paolo Valente   block, bfq: do no...
642
   * The above first case (symmetric scenario) occurs when:
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
643
   * 1) all active queues have the same weight,
73d581184   Paolo Valente   block, bfq: consi...
644
   * 2) all active queues belong to the same I/O-priority class,
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
645
   * 3) all active groups at the same level in the groups tree have the same
73d581184   Paolo Valente   block, bfq: consi...
646
647
   *    weight,
   * 4) all active groups at the same level in the groups tree have the same
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
648
649
   *    number of children.
   *
2d29c9f89   Federico Motta   block, bfq: impro...
650
651
   * Unfortunately, keeping the necessary state for evaluating exactly
   * the last two symmetry sub-conditions above would be quite complex
73d581184   Paolo Valente   block, bfq: consi...
652
653
   * and time consuming. Therefore this function evaluates, instead,
   * only the following stronger three sub-conditions, for which it is
2d29c9f89   Federico Motta   block, bfq: impro...
654
   * much easier to maintain the needed state:
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
655
   * 1) all active queues have the same weight,
73d581184   Paolo Valente   block, bfq: consi...
656
657
   * 2) all active queues belong to the same I/O-priority class,
   * 3) there are no active groups.
2d29c9f89   Federico Motta   block, bfq: impro...
658
659
660
   * In particular, the last condition is always true if hierarchical
   * support or the cgroups interface are not enabled, thus no state
   * needs to be maintained in this case.
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
661
   */
fb53ac6cd   Paolo Valente   block, bfq: do no...
662
663
  static bool bfq_asymmetric_scenario(struct bfq_data *bfqd,
  				   struct bfq_queue *bfqq)
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
664
  {
fb53ac6cd   Paolo Valente   block, bfq: do no...
665
666
667
668
669
670
671
  	bool smallest_weight = bfqq &&
  		bfqq->weight_counter &&
  		bfqq->weight_counter ==
  		container_of(
  			rb_first_cached(&bfqd->queue_weights_tree),
  			struct bfq_weight_counter,
  			weights_node);
73d581184   Paolo Valente   block, bfq: consi...
672
673
674
675
  	/*
  	 * For queue weights to differ, queue_weights_tree must contain
  	 * at least two nodes.
  	 */
fb53ac6cd   Paolo Valente   block, bfq: do no...
676
677
678
679
  	bool varied_queue_weights = !smallest_weight &&
  		!RB_EMPTY_ROOT(&bfqd->queue_weights_tree.rb_root) &&
  		(bfqd->queue_weights_tree.rb_root.rb_node->rb_left ||
  		 bfqd->queue_weights_tree.rb_root.rb_node->rb_right);
73d581184   Paolo Valente   block, bfq: consi...
680
681
682
683
684
  
  	bool multiple_classes_busy =
  		(bfqd->busy_queues[0] && bfqd->busy_queues[1]) ||
  		(bfqd->busy_queues[0] && bfqd->busy_queues[2]) ||
  		(bfqd->busy_queues[1] && bfqd->busy_queues[2]);
fb53ac6cd   Paolo Valente   block, bfq: do no...
685
  	return varied_queue_weights || multiple_classes_busy
42b1bd33d   Konstantin Khlebnikov   block/bfq: fix if...
686
  #ifdef CONFIG_BFQ_GROUP_IOSCHED
73d581184   Paolo Valente   block, bfq: consi...
687
688
  	       || bfqd->num_groups_with_pending_reqs > 0
  #endif
fb53ac6cd   Paolo Valente   block, bfq: do no...
689
  		;
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
690
691
692
693
  }
  
  /*
   * If the weight-counter tree passed as input contains no counter for
2d29c9f89   Federico Motta   block, bfq: impro...
694
   * the weight of the input queue, then add that counter; otherwise just
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
695
696
697
698
699
700
701
702
703
704
   * increment the existing counter.
   *
   * Note that weight-counter trees contain few nodes in mostly symmetric
   * scenarios. For example, if all queues have the same weight, then the
   * weight-counter tree for the queues may contain at most one node.
   * This holds even if low_latency is on, because weight-raised queues
   * are not inserted in the tree.
   * In most scenarios, the rate at which nodes are created/destroyed
   * should be low too.
   */
2d29c9f89   Federico Motta   block, bfq: impro...
705
  void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq,
fb53ac6cd   Paolo Valente   block, bfq: do no...
706
  			  struct rb_root_cached *root)
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
707
  {
2d29c9f89   Federico Motta   block, bfq: impro...
708
  	struct bfq_entity *entity = &bfqq->entity;
fb53ac6cd   Paolo Valente   block, bfq: do no...
709
710
  	struct rb_node **new = &(root->rb_root.rb_node), *parent = NULL;
  	bool leftmost = true;
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
711
712
  
  	/*
2d29c9f89   Federico Motta   block, bfq: impro...
713
  	 * Do not insert if the queue is already associated with a
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
714
  	 * counter, which happens if:
2d29c9f89   Federico Motta   block, bfq: impro...
715
  	 *   1) a request arrival has caused the queue to become both
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
716
717
718
  	 *      non-weight-raised, and hence change its weight, and
  	 *      backlogged; in this respect, each of the two events
  	 *      causes an invocation of this function,
2d29c9f89   Federico Motta   block, bfq: impro...
719
  	 *   2) this is the invocation of this function caused by the
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
720
721
722
723
  	 *      second event. This second invocation is actually useless,
  	 *      and we handle this fact by exiting immediately. More
  	 *      efficient or clearer solutions might possibly be adopted.
  	 */
2d29c9f89   Federico Motta   block, bfq: impro...
724
  	if (bfqq->weight_counter)
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
725
726
727
728
729
730
731
732
733
  		return;
  
  	while (*new) {
  		struct bfq_weight_counter *__counter = container_of(*new,
  						struct bfq_weight_counter,
  						weights_node);
  		parent = *new;
  
  		if (entity->weight == __counter->weight) {
2d29c9f89   Federico Motta   block, bfq: impro...
734
  			bfqq->weight_counter = __counter;
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
735
736
737
738
  			goto inc_counter;
  		}
  		if (entity->weight < __counter->weight)
  			new = &((*new)->rb_left);
fb53ac6cd   Paolo Valente   block, bfq: do no...
739
  		else {
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
740
  			new = &((*new)->rb_right);
fb53ac6cd   Paolo Valente   block, bfq: do no...
741
742
  			leftmost = false;
  		}
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
743
  	}
2d29c9f89   Federico Motta   block, bfq: impro...
744
745
  	bfqq->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),
  				       GFP_ATOMIC);
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
746
747
748
  
  	/*
  	 * In the unlucky event of an allocation failure, we just
2d29c9f89   Federico Motta   block, bfq: impro...
749
  	 * exit. This will cause the weight of queue to not be
fb53ac6cd   Paolo Valente   block, bfq: do no...
750
  	 * considered in bfq_asymmetric_scenario, which, in its turn,
73d581184   Paolo Valente   block, bfq: consi...
751
752
753
754
755
756
757
  	 * causes the scenario to be deemed wrongly symmetric in case
  	 * bfqq's weight would have been the only weight making the
  	 * scenario asymmetric.  On the bright side, no unbalance will
  	 * however occur when bfqq becomes inactive again (the
  	 * invocation of this function is triggered by an activation
  	 * of queue).  In fact, bfq_weights_tree_remove does nothing
  	 * if !bfqq->weight_counter.
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
758
  	 */
2d29c9f89   Federico Motta   block, bfq: impro...
759
  	if (unlikely(!bfqq->weight_counter))
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
760
  		return;
2d29c9f89   Federico Motta   block, bfq: impro...
761
762
  	bfqq->weight_counter->weight = entity->weight;
  	rb_link_node(&bfqq->weight_counter->weights_node, parent, new);
fb53ac6cd   Paolo Valente   block, bfq: do no...
763
764
  	rb_insert_color_cached(&bfqq->weight_counter->weights_node, root,
  				leftmost);
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
765
766
  
  inc_counter:
2d29c9f89   Federico Motta   block, bfq: impro...
767
  	bfqq->weight_counter->num_active++;
9dee8b3b0   Paolo Valente   block, bfq: fix q...
768
  	bfqq->ref++;
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
769
770
771
  }
  
  /*
2d29c9f89   Federico Motta   block, bfq: impro...
772
   * Decrement the weight counter associated with the queue, and, if the
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
773
774
775
776
   * counter reaches 0, remove the counter from the tree.
   * See the comments to the function bfq_weights_tree_add() for considerations
   * about overhead.
   */
0471559c2   Paolo Valente   block, bfq: add/r...
777
  void __bfq_weights_tree_remove(struct bfq_data *bfqd,
2d29c9f89   Federico Motta   block, bfq: impro...
778
  			       struct bfq_queue *bfqq,
fb53ac6cd   Paolo Valente   block, bfq: do no...
779
  			       struct rb_root_cached *root)
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
780
  {
2d29c9f89   Federico Motta   block, bfq: impro...
781
  	if (!bfqq->weight_counter)
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
782
  		return;
2d29c9f89   Federico Motta   block, bfq: impro...
783
784
  	bfqq->weight_counter->num_active--;
  	if (bfqq->weight_counter->num_active > 0)
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
785
  		goto reset_entity_pointer;
fb53ac6cd   Paolo Valente   block, bfq: do no...
786
  	rb_erase_cached(&bfqq->weight_counter->weights_node, root);
2d29c9f89   Federico Motta   block, bfq: impro...
787
  	kfree(bfqq->weight_counter);
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
788
789
  
  reset_entity_pointer:
2d29c9f89   Federico Motta   block, bfq: impro...
790
  	bfqq->weight_counter = NULL;
9dee8b3b0   Paolo Valente   block, bfq: fix q...
791
  	bfq_put_queue(bfqq);
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
792
793
794
  }
  
  /*
2d29c9f89   Federico Motta   block, bfq: impro...
795
796
   * Invoke __bfq_weights_tree_remove on bfqq and decrement the number
   * of active groups for each queue's inactive parent entity.
0471559c2   Paolo Valente   block, bfq: add/r...
797
798
799
800
801
   */
  void bfq_weights_tree_remove(struct bfq_data *bfqd,
  			     struct bfq_queue *bfqq)
  {
  	struct bfq_entity *entity = bfqq->entity.parent;
0471559c2   Paolo Valente   block, bfq: add/r...
802
803
804
805
806
807
808
809
810
811
812
  	for_each_entity(entity) {
  		struct bfq_sched_data *sd = entity->my_sched_data;
  
  		if (sd->next_in_service || sd->in_service_entity) {
  			/*
  			 * entity is still active, because either
  			 * next_in_service or in_service_entity is not
  			 * NULL (see the comments on the definition of
  			 * next_in_service for details on why
  			 * in_service_entity must be checked too).
  			 *
2d29c9f89   Federico Motta   block, bfq: impro...
813
814
815
  			 * As a consequence, its parent entities are
  			 * active as well, and thus this loop must
  			 * stop here.
0471559c2   Paolo Valente   block, bfq: add/r...
816
817
818
  			 */
  			break;
  		}
ba7aeae55   Paolo Valente   block, bfq: fix d...
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
  
  		/*
  		 * The decrement of num_groups_with_pending_reqs is
  		 * not performed immediately upon the deactivation of
  		 * entity, but it is delayed to when it also happens
  		 * that the first leaf descendant bfqq of entity gets
  		 * all its pending requests completed. The following
  		 * instructions perform this delayed decrement, if
  		 * needed. See the comments on
  		 * num_groups_with_pending_reqs for details.
  		 */
  		if (entity->in_groups_with_pending_reqs) {
  			entity->in_groups_with_pending_reqs = false;
  			bfqd->num_groups_with_pending_reqs--;
  		}
0471559c2   Paolo Valente   block, bfq: add/r...
834
  	}
9dee8b3b0   Paolo Valente   block, bfq: fix q...
835
836
837
838
839
840
841
842
843
  
  	/*
  	 * Next function is invoked last, because it causes bfqq to be
  	 * freed if the following holds: bfqq is not in service and
  	 * has no dispatched request. DO NOT use bfqq after the next
  	 * function invocation.
  	 */
  	__bfq_weights_tree_remove(bfqd, bfqq,
  				  &bfqd->queue_weights_tree);
0471559c2   Paolo Valente   block, bfq: add/r...
844
845
846
  }
  
  /*
aee69d78d   Paolo Valente   block, bfq: intro...
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
   * Return expired entry, or NULL to just start from scratch in rbtree.
   */
  static struct request *bfq_check_fifo(struct bfq_queue *bfqq,
  				      struct request *last)
  {
  	struct request *rq;
  
  	if (bfq_bfqq_fifo_expire(bfqq))
  		return NULL;
  
  	bfq_mark_bfqq_fifo_expire(bfqq);
  
  	rq = rq_entry_fifo(bfqq->fifo.next);
  
  	if (rq == last || ktime_get_ns() < rq->fifo_time)
  		return NULL;
  
  	bfq_log_bfqq(bfqq->bfqd, bfqq, "check_fifo: returned %p", rq);
  	return rq;
  }
  
  static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
  					struct bfq_queue *bfqq,
  					struct request *last)
  {
  	struct rb_node *rbnext = rb_next(&last->rb_node);
  	struct rb_node *rbprev = rb_prev(&last->rb_node);
  	struct request *next, *prev = NULL;
  
  	/* Follow expired path, else get first next available. */
  	next = bfq_check_fifo(bfqq, last);
  	if (next)
  		return next;
  
  	if (rbprev)
  		prev = rb_entry_rq(rbprev);
  
  	if (rbnext)
  		next = rb_entry_rq(rbnext);
  	else {
  		rbnext = rb_first(&bfqq->sort_list);
  		if (rbnext && rbnext != &last->rb_node)
  			next = rb_entry_rq(rbnext);
  	}
  
  	return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last));
  }
c074170e6   Paolo Valente   block, bfq: add m...
894
  /* see the definition of bfq_async_charge_factor for details */
aee69d78d   Paolo Valente   block, bfq: intro...
895
896
897
  static unsigned long bfq_serv_to_charge(struct request *rq,
  					struct bfq_queue *bfqq)
  {
02a6d787f   Paolo Valente   block, bfq: do no...
898
  	if (bfq_bfqq_sync(bfqq) || bfqq->wr_coeff > 1 ||
fb53ac6cd   Paolo Valente   block, bfq: do no...
899
  	    bfq_asymmetric_scenario(bfqq->bfqd, bfqq))
c074170e6   Paolo Valente   block, bfq: add m...
900
  		return blk_rq_sectors(rq);
d5801088a   Paolo Valente   block, bfq: reduc...
901
  	return blk_rq_sectors(rq) * bfq_async_charge_factor;
aee69d78d   Paolo Valente   block, bfq: intro...
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
  }
  
  /**
   * bfq_updated_next_req - update the queue after a new next_rq selection.
   * @bfqd: the device data the queue belongs to.
   * @bfqq: the queue to update.
   *
   * If the first request of a queue changes we make sure that the queue
   * has enough budget to serve at least its first request (if the
   * request has grown).  We do this because if the queue has not enough
   * budget for its first request, it has to go through two dispatch
   * rounds to actually get it dispatched.
   */
  static void bfq_updated_next_req(struct bfq_data *bfqd,
  				 struct bfq_queue *bfqq)
  {
  	struct bfq_entity *entity = &bfqq->entity;
  	struct request *next_rq = bfqq->next_rq;
  	unsigned long new_budget;
  
  	if (!next_rq)
  		return;
  
  	if (bfqq == bfqd->in_service_queue)
  		/*
  		 * In order not to break guarantees, budgets cannot be
  		 * changed after an entity has been selected.
  		 */
  		return;
f3218ad8c   Paolo Valente   block, bfq: make ...
931
932
933
934
  	new_budget = max_t(unsigned long,
  			   max_t(unsigned long, bfqq->max_budget,
  				 bfq_serv_to_charge(next_rq, bfqq)),
  			   entity->service);
aee69d78d   Paolo Valente   block, bfq: intro...
935
936
937
938
  	if (entity->budget != new_budget) {
  		entity->budget = new_budget;
  		bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu",
  					 new_budget);
80294c3bb   Paolo Valente   block, bfq: make ...
939
  		bfq_requeue_bfqq(bfqd, bfqq, false);
aee69d78d   Paolo Valente   block, bfq: intro...
940
941
  	}
  }
3e2bdd6df   Paolo Valente   block, bfq: check...
942
943
944
945
946
947
  static unsigned int bfq_wr_duration(struct bfq_data *bfqd)
  {
  	u64 dur;
  
  	if (bfqd->bfq_wr_max_time > 0)
  		return bfqd->bfq_wr_max_time;
e24f1c245   Paolo Valente   block, bfq: remov...
948
  	dur = bfqd->rate_dur_prod;
3e2bdd6df   Paolo Valente   block, bfq: check...
949
950
951
  	do_div(dur, bfqd->peak_rate);
  
  	/*
d450542e3   Davide Sapienza   block, bfq: incre...
952
953
954
955
956
957
958
959
960
  	 * Limit duration between 3 and 25 seconds. The upper limit
  	 * has been conservatively set after the following worst case:
  	 * on a QEMU/KVM virtual machine
  	 * - running in a slow PC
  	 * - with a virtual disk stacked on a slow low-end 5400rpm HDD
  	 * - serving a heavy I/O workload, such as the sequential reading
  	 *   of several files
  	 * mplayer took 23 seconds to start, if constantly weight-raised.
  	 *
636b8fe86   Angelo Ruocco   block, bfq: fix s...
961
  	 * As for higher values than that accommodating the above bad
d450542e3   Davide Sapienza   block, bfq: incre...
962
963
964
965
  	 * scenario, tests show that higher values would often yield
  	 * the opposite of the desired result, i.e., would worsen
  	 * responsiveness by allowing non-interactive applications to
  	 * preserve weight raising for too long.
3e2bdd6df   Paolo Valente   block, bfq: check...
966
967
968
969
970
  	 *
  	 * On the other end, lower values than 3 seconds make it
  	 * difficult for most interactive tasks to complete their jobs
  	 * before weight-raising finishes.
  	 */
d450542e3   Davide Sapienza   block, bfq: incre...
971
  	return clamp_val(dur, msecs_to_jiffies(3000), msecs_to_jiffies(25000));
3e2bdd6df   Paolo Valente   block, bfq: check...
972
973
974
975
976
977
978
979
980
981
  }
  
  /* switch back from soft real-time to interactive weight raising */
  static void switch_back_to_interactive_wr(struct bfq_queue *bfqq,
  					  struct bfq_data *bfqd)
  {
  	bfqq->wr_coeff = bfqd->bfq_wr_coeff;
  	bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
  	bfqq->last_wr_start_finish = bfqq->wr_start_at_switch_to_srt;
  }
36eca8948   Arianna Avanzini   block, bfq: add E...
982
  static void
13c931bd9   Paolo Valente   block, bfq: updat...
983
984
  bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd,
  		      struct bfq_io_cq *bic, bool bfq_already_existing)
36eca8948   Arianna Avanzini   block, bfq: add E...
985
  {
13c931bd9   Paolo Valente   block, bfq: updat...
986
987
  	unsigned int old_wr_coeff = bfqq->wr_coeff;
  	bool busy = bfq_already_existing && bfq_bfqq_busy(bfqq);
d5be3fefc   Paolo Valente   block,bfq: refact...
988
989
  	if (bic->saved_has_short_ttime)
  		bfq_mark_bfqq_has_short_ttime(bfqq);
36eca8948   Arianna Avanzini   block, bfq: add E...
990
  	else
d5be3fefc   Paolo Valente   block,bfq: refact...
991
  		bfq_clear_bfqq_has_short_ttime(bfqq);
36eca8948   Arianna Avanzini   block, bfq: add E...
992
993
994
995
996
  
  	if (bic->saved_IO_bound)
  		bfq_mark_bfqq_IO_bound(bfqq);
  	else
  		bfq_clear_bfqq_IO_bound(bfqq);
fffca087d   Francesco Pollicino   block, bfq: save ...
997
  	bfqq->entity.new_weight = bic->saved_weight;
36eca8948   Arianna Avanzini   block, bfq: add E...
998
999
1000
1001
1002
  	bfqq->ttime = bic->saved_ttime;
  	bfqq->wr_coeff = bic->saved_wr_coeff;
  	bfqq->wr_start_at_switch_to_srt = bic->saved_wr_start_at_switch_to_srt;
  	bfqq->last_wr_start_finish = bic->saved_last_wr_start_finish;
  	bfqq->wr_cur_max_time = bic->saved_wr_cur_max_time;
e1b2324dd   Arianna Avanzini   block, bfq: handl...
1003
  	if (bfqq->wr_coeff > 1 && (bfq_bfqq_in_large_burst(bfqq) ||
36eca8948   Arianna Avanzini   block, bfq: add E...
1004
  	    time_is_before_jiffies(bfqq->last_wr_start_finish +
e1b2324dd   Arianna Avanzini   block, bfq: handl...
1005
  				   bfqq->wr_cur_max_time))) {
3e2bdd6df   Paolo Valente   block, bfq: check...
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
  		if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time &&
  		    !bfq_bfqq_in_large_burst(bfqq) &&
  		    time_is_after_eq_jiffies(bfqq->wr_start_at_switch_to_srt +
  					     bfq_wr_duration(bfqd))) {
  			switch_back_to_interactive_wr(bfqq, bfqd);
  		} else {
  			bfqq->wr_coeff = 1;
  			bfq_log_bfqq(bfqq->bfqd, bfqq,
  				     "resume state: switching off wr");
  		}
36eca8948   Arianna Avanzini   block, bfq: add E...
1016
1017
1018
1019
  	}
  
  	/* make sure weight will be updated, however we got here */
  	bfqq->entity.prio_changed = 1;
13c931bd9   Paolo Valente   block, bfq: updat...
1020
1021
1022
1023
1024
1025
1026
1027
  
  	if (likely(!busy))
  		return;
  
  	if (old_wr_coeff == 1 && bfqq->wr_coeff > 1)
  		bfqd->wr_busy_queues++;
  	else if (old_wr_coeff > 1 && bfqq->wr_coeff == 1)
  		bfqd->wr_busy_queues--;
36eca8948   Arianna Avanzini   block, bfq: add E...
1028
1029
1030
1031
  }
  
  static int bfqq_process_refs(struct bfq_queue *bfqq)
  {
33a16a980   Paolo Valente   block, bfq: exten...
1032
  	return bfqq->ref - bfqq->allocated - bfqq->entity.on_st_or_in_serv -
9dee8b3b0   Paolo Valente   block, bfq: fix q...
1033
  		(bfqq->weight_counter != NULL);
36eca8948   Arianna Avanzini   block, bfq: add E...
1034
  }
e1b2324dd   Arianna Avanzini   block, bfq: handl...
1035
1036
1037
1038
1039
1040
1041
1042
  /* Empty burst list and add just bfqq (see comments on bfq_handle_burst) */
  static void bfq_reset_burst_list(struct bfq_data *bfqd, struct bfq_queue *bfqq)
  {
  	struct bfq_queue *item;
  	struct hlist_node *n;
  
  	hlist_for_each_entry_safe(item, n, &bfqd->burst_list, burst_list_node)
  		hlist_del_init(&item->burst_list_node);
84a746891   Paolo Valente   block, bfq: alway...
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
  
  	/*
  	 * Start the creation of a new burst list only if there is no
  	 * active queue. See comments on the conditional invocation of
  	 * bfq_handle_burst().
  	 */
  	if (bfq_tot_busy_queues(bfqd) == 0) {
  		hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list);
  		bfqd->burst_size = 1;
  	} else
  		bfqd->burst_size = 0;
e1b2324dd   Arianna Avanzini   block, bfq: handl...
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
  	bfqd->burst_parent_entity = bfqq->entity.parent;
  }
  
  /* Add bfqq to the list of queues in current burst (see bfq_handle_burst) */
  static void bfq_add_to_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq)
  {
  	/* Increment burst size to take into account also bfqq */
  	bfqd->burst_size++;
  
  	if (bfqd->burst_size == bfqd->bfq_large_burst_thresh) {
  		struct bfq_queue *pos, *bfqq_item;
  		struct hlist_node *n;
  
  		/*
  		 * Enough queues have been activated shortly after each
  		 * other to consider this burst as large.
  		 */
  		bfqd->large_burst = true;
  
  		/*
  		 * We can now mark all queues in the burst list as
  		 * belonging to a large burst.
  		 */
  		hlist_for_each_entry(bfqq_item, &bfqd->burst_list,
  				     burst_list_node)
  			bfq_mark_bfqq_in_large_burst(bfqq_item);
  		bfq_mark_bfqq_in_large_burst(bfqq);
  
  		/*
  		 * From now on, and until the current burst finishes, any
  		 * new queue being activated shortly after the last queue
  		 * was inserted in the burst can be immediately marked as
  		 * belonging to a large burst. So the burst list is not
  		 * needed any more. Remove it.
  		 */
  		hlist_for_each_entry_safe(pos, n, &bfqd->burst_list,
  					  burst_list_node)
  			hlist_del_init(&pos->burst_list_node);
  	} else /*
  		* Burst not yet large: add bfqq to the burst list. Do
  		* not increment the ref counter for bfqq, because bfqq
  		* is removed from the burst list before freeing bfqq
  		* in put_queue.
  		*/
  		hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list);
  }
  
  /*
   * If many queues belonging to the same group happen to be created
   * shortly after each other, then the processes associated with these
   * queues have typically a common goal. In particular, bursts of queue
   * creations are usually caused by services or applications that spawn
   * many parallel threads/processes. Examples are systemd during boot,
   * or git grep. To help these processes get their job done as soon as
   * possible, it is usually better to not grant either weight-raising
84a746891   Paolo Valente   block, bfq: alway...
1109
1110
   * or device idling to their queues, unless these queues must be
   * protected from the I/O flowing through other active queues.
e1b2324dd   Arianna Avanzini   block, bfq: handl...
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
   *
   * In this comment we describe, firstly, the reasons why this fact
   * holds, and, secondly, the next function, which implements the main
   * steps needed to properly mark these queues so that they can then be
   * treated in a different way.
   *
   * The above services or applications benefit mostly from a high
   * throughput: the quicker the requests of the activated queues are
   * cumulatively served, the sooner the target job of these queues gets
   * completed. As a consequence, weight-raising any of these queues,
   * which also implies idling the device for it, is almost always
84a746891   Paolo Valente   block, bfq: alway...
1122
1123
1124
1125
   * counterproductive, unless there are other active queues to isolate
   * these new queues from. If there no other active queues, then
   * weight-raising these new queues just lowers throughput in most
   * cases.
e1b2324dd   Arianna Avanzini   block, bfq: handl...
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
   *
   * On the other hand, a burst of queue creations may be caused also by
   * the start of an application that does not consist of a lot of
   * parallel I/O-bound threads. In fact, with a complex application,
   * several short processes may need to be executed to start-up the
   * application. In this respect, to start an application as quickly as
   * possible, the best thing to do is in any case to privilege the I/O
   * related to the application with respect to all other
   * I/O. Therefore, the best strategy to start as quickly as possible
   * an application that causes a burst of queue creations is to
   * weight-raise all the queues created during the burst. This is the
   * exact opposite of the best strategy for the other type of bursts.
   *
   * In the end, to take the best action for each of the two cases, the
   * two types of bursts need to be distinguished. Fortunately, this
   * seems relatively easy, by looking at the sizes of the bursts. In
   * particular, we found a threshold such that only bursts with a
   * larger size than that threshold are apparently caused by
   * services or commands such as systemd or git grep. For brevity,
   * hereafter we call just 'large' these bursts. BFQ *does not*
   * weight-raise queues whose creation occurs in a large burst. In
   * addition, for each of these queues BFQ performs or does not perform
   * idling depending on which choice boosts the throughput more. The
   * exact choice depends on the device and request pattern at
   * hand.
   *
   * Unfortunately, false positives may occur while an interactive task
   * is starting (e.g., an application is being started). The
   * consequence is that the queues associated with the task do not
   * enjoy weight raising as expected. Fortunately these false positives
   * are very rare. They typically occur if some service happens to
   * start doing I/O exactly when the interactive task starts.
   *
84a746891   Paolo Valente   block, bfq: alway...
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
   * Turning back to the next function, it is invoked only if there are
   * no active queues (apart from active queues that would belong to the
   * same, possible burst bfqq would belong to), and it implements all
   * the steps needed to detect the occurrence of a large burst and to
   * properly mark all the queues belonging to it (so that they can then
   * be treated in a different way). This goal is achieved by
   * maintaining a "burst list" that holds, temporarily, the queues that
   * belong to the burst in progress. The list is then used to mark
   * these queues as belonging to a large burst if the burst does become
   * large. The main steps are the following.
e1b2324dd   Arianna Avanzini   block, bfq: handl...
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
   *
   * . when the very first queue is created, the queue is inserted into the
   *   list (as it could be the first queue in a possible burst)
   *
   * . if the current burst has not yet become large, and a queue Q that does
   *   not yet belong to the burst is activated shortly after the last time
   *   at which a new queue entered the burst list, then the function appends
   *   Q to the burst list
   *
   * . if, as a consequence of the previous step, the burst size reaches
   *   the large-burst threshold, then
   *
   *     . all the queues in the burst list are marked as belonging to a
   *       large burst
   *
   *     . the burst list is deleted; in fact, the burst list already served
   *       its purpose (keeping temporarily track of the queues in a burst,
   *       so as to be able to mark them as belonging to a large burst in the
   *       previous sub-step), and now is not needed any more
   *
   *     . the device enters a large-burst mode
   *
   * . if a queue Q that does not belong to the burst is created while
   *   the device is in large-burst mode and shortly after the last time
   *   at which a queue either entered the burst list or was marked as
   *   belonging to the current large burst, then Q is immediately marked
   *   as belonging to a large burst.
   *
   * . if a queue Q that does not belong to the burst is created a while
   *   later, i.e., not shortly after, than the last time at which a queue
   *   either entered the burst list or was marked as belonging to the
   *   current large burst, then the current burst is deemed as finished and:
   *
   *        . the large-burst mode is reset if set
   *
   *        . the burst list is emptied
   *
   *        . Q is inserted in the burst list, as Q may be the first queue
   *          in a possible new burst (then the burst list contains just Q
   *          after this step).
   */
  static void bfq_handle_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq)
  {
  	/*
  	 * If bfqq is already in the burst list or is part of a large
  	 * burst, or finally has just been split, then there is
  	 * nothing else to do.
  	 */
  	if (!hlist_unhashed(&bfqq->burst_list_node) ||
  	    bfq_bfqq_in_large_burst(bfqq) ||
  	    time_is_after_eq_jiffies(bfqq->split_time +
  				     msecs_to_jiffies(10)))
  		return;
  
  	/*
  	 * If bfqq's creation happens late enough, or bfqq belongs to
  	 * a different group than the burst group, then the current
  	 * burst is finished, and related data structures must be
  	 * reset.
  	 *
  	 * In this respect, consider the special case where bfqq is
  	 * the very first queue created after BFQ is selected for this
  	 * device. In this case, last_ins_in_burst and
  	 * burst_parent_entity are not yet significant when we get
  	 * here. But it is easy to verify that, whether or not the
  	 * following condition is true, bfqq will end up being
  	 * inserted into the burst list. In particular the list will
  	 * happen to contain only bfqq. And this is exactly what has
  	 * to happen, as bfqq may be the first queue of the first
  	 * burst.
  	 */
  	if (time_is_before_jiffies(bfqd->last_ins_in_burst +
  	    bfqd->bfq_burst_interval) ||
  	    bfqq->entity.parent != bfqd->burst_parent_entity) {
  		bfqd->large_burst = false;
  		bfq_reset_burst_list(bfqd, bfqq);
  		goto end;
  	}
  
  	/*
  	 * If we get here, then bfqq is being activated shortly after the
  	 * last queue. So, if the current burst is also large, we can mark
  	 * bfqq as belonging to this large burst immediately.
  	 */
  	if (bfqd->large_burst) {
  		bfq_mark_bfqq_in_large_burst(bfqq);
  		goto end;
  	}
  
  	/*
  	 * If we get here, then a large-burst state has not yet been
  	 * reached, but bfqq is being activated shortly after the last
  	 * queue. Then we add bfqq to the burst.
  	 */
  	bfq_add_to_burst(bfqd, bfqq);
  end:
  	/*
  	 * At this point, bfqq either has been added to the current
  	 * burst or has caused the current burst to terminate and a
  	 * possible new burst to start. In particular, in the second
  	 * case, bfqq has become the first queue in the possible new
  	 * burst.  In both cases last_ins_in_burst needs to be moved
  	 * forward.
  	 */
  	bfqd->last_ins_in_burst = jiffies;
  }
aee69d78d   Paolo Valente   block, bfq: intro...
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
  static int bfq_bfqq_budget_left(struct bfq_queue *bfqq)
  {
  	struct bfq_entity *entity = &bfqq->entity;
  
  	return entity->budget - entity->service;
  }
  
  /*
   * If enough samples have been computed, return the current max budget
   * stored in bfqd, which is dynamically updated according to the
   * estimated disk peak rate; otherwise return the default max budget
   */
  static int bfq_max_budget(struct bfq_data *bfqd)
  {
  	if (bfqd->budgets_assigned < bfq_stats_min_budgets)
  		return bfq_default_max_budget;
  	else
  		return bfqd->bfq_max_budget;
  }
  
  /*
   * Return min budget, which is a fraction of the current or default
   * max budget (trying with 1/32)
   */
  static int bfq_min_budget(struct bfq_data *bfqd)
  {
  	if (bfqd->budgets_assigned < bfq_stats_min_budgets)
  		return bfq_default_max_budget / 32;
  	else
  		return bfqd->bfq_max_budget / 32;
  }
aee69d78d   Paolo Valente   block, bfq: intro...
1306
1307
1308
1309
1310
1311
  /*
   * The next function, invoked after the input queue bfqq switches from
   * idle to busy, updates the budget of bfqq. The function also tells
   * whether the in-service queue should be expired, by returning
   * true. The purpose of expiring the in-service queue is to give bfqq
   * the chance to possibly preempt the in-service queue, and the reason
44e44a1b3   Paolo Valente   block, bfq: impro...
1312
1313
   * for preempting the in-service queue is to achieve one of the two
   * goals below.
aee69d78d   Paolo Valente   block, bfq: intro...
1314
   *
44e44a1b3   Paolo Valente   block, bfq: impro...
1315
1316
1317
   * 1. Guarantee to bfqq its reserved bandwidth even if bfqq has
   * expired because it has remained idle. In particular, bfqq may have
   * expired for one of the following two reasons:
aee69d78d   Paolo Valente   block, bfq: intro...
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
   *
   * - BFQQE_NO_MORE_REQUESTS bfqq did not enjoy any device idling
   *   and did not make it to issue a new request before its last
   *   request was served;
   *
   * - BFQQE_TOO_IDLE bfqq did enjoy device idling, but did not issue
   *   a new request before the expiration of the idling-time.
   *
   * Even if bfqq has expired for one of the above reasons, the process
   * associated with the queue may be however issuing requests greedily,
   * and thus be sensitive to the bandwidth it receives (bfqq may have
   * remained idle for other reasons: CPU high load, bfqq not enjoying
   * idling, I/O throttling somewhere in the path from the process to
   * the I/O scheduler, ...). But if, after every expiration for one of
   * the above two reasons, bfqq has to wait for the service of at least
   * one full budget of another queue before being served again, then
   * bfqq is likely to get a much lower bandwidth or resource time than
   * its reserved ones. To address this issue, two countermeasures need
   * to be taken.
   *
   * First, the budget and the timestamps of bfqq need to be updated in
   * a special way on bfqq reactivation: they need to be updated as if
   * bfqq did not remain idle and did not expire. In fact, if they are
   * computed as if bfqq expired and remained idle until reactivation,
   * then the process associated with bfqq is treated as if, instead of
   * being greedy, it stopped issuing requests when bfqq remained idle,
   * and restarts issuing requests only on this reactivation. In other
   * words, the scheduler does not help the process recover the "service
   * hole" between bfqq expiration and reactivation. As a consequence,
   * the process receives a lower bandwidth than its reserved one. In
   * contrast, to recover this hole, the budget must be updated as if
   * bfqq was not expired at all before this reactivation, i.e., it must
   * be set to the value of the remaining budget when bfqq was
   * expired. Along the same line, timestamps need to be assigned the
   * value they had the last time bfqq was selected for service, i.e.,
   * before last expiration. Thus timestamps need to be back-shifted
   * with respect to their normal computation (see [1] for more details
   * on this tricky aspect).
   *
   * Secondly, to allow the process to recover the hole, the in-service
   * queue must be expired too, to give bfqq the chance to preempt it
   * immediately. In fact, if bfqq has to wait for a full budget of the
   * in-service queue to be completed, then it may become impossible to
   * let the process recover the hole, even if the back-shifted
   * timestamps of bfqq are lower than those of the in-service queue. If
   * this happens for most or all of the holes, then the process may not
   * receive its reserved bandwidth. In this respect, it is worth noting
   * that, being the service of outstanding requests unpreemptible, a
   * little fraction of the holes may however be unrecoverable, thereby
   * causing a little loss of bandwidth.
   *
   * The last important point is detecting whether bfqq does need this
   * bandwidth recovery. In this respect, the next function deems the
   * process associated with bfqq greedy, and thus allows it to recover
   * the hole, if: 1) the process is waiting for the arrival of a new
   * request (which implies that bfqq expired for one of the above two
   * reasons), and 2) such a request has arrived soon. The first
   * condition is controlled through the flag non_blocking_wait_rq,
   * while the second through the flag arrived_in_time. If both
   * conditions hold, then the function computes the budget in the
   * above-described special way, and signals that the in-service queue
   * should be expired. Timestamp back-shifting is done later in
   * __bfq_activate_entity.
44e44a1b3   Paolo Valente   block, bfq: impro...
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
   *
   * 2. Reduce latency. Even if timestamps are not backshifted to let
   * the process associated with bfqq recover a service hole, bfqq may
   * however happen to have, after being (re)activated, a lower finish
   * timestamp than the in-service queue.	 That is, the next budget of
   * bfqq may have to be completed before the one of the in-service
   * queue. If this is the case, then preempting the in-service queue
   * allows this goal to be achieved, apart from the unpreemptible,
   * outstanding requests mentioned above.
   *
   * Unfortunately, regardless of which of the above two goals one wants
   * to achieve, service trees need first to be updated to know whether
   * the in-service queue must be preempted. To have service trees
   * correctly updated, the in-service queue must be expired and
   * rescheduled, and bfqq must be scheduled too. This is one of the
   * most costly operations (in future versions, the scheduling
   * mechanism may be re-designed in such a way to make it possible to
   * know whether preemption is needed without needing to update service
   * trees). In addition, queue preemptions almost always cause random
96a291c38   Paolo Valente   block, bfq: preem...
1400
1401
1402
1403
1404
1405
1406
1407
1408
   * I/O, which may in turn cause loss of throughput. Finally, there may
   * even be no in-service queue when the next function is invoked (so,
   * no queue to compare timestamps with). Because of these facts, the
   * next function adopts the following simple scheme to avoid costly
   * operations, too frequent preemptions and too many dependencies on
   * the state of the scheduler: it requests the expiration of the
   * in-service queue (unconditionally) only for queues that need to
   * recover a hole. Then it delegates to other parts of the code the
   * responsibility of handling the above case 2.
aee69d78d   Paolo Valente   block, bfq: intro...
1409
1410
1411
   */
  static bool bfq_bfqq_update_budg_for_activation(struct bfq_data *bfqd,
  						struct bfq_queue *bfqq,
96a291c38   Paolo Valente   block, bfq: preem...
1412
  						bool arrived_in_time)
aee69d78d   Paolo Valente   block, bfq: intro...
1413
1414
  {
  	struct bfq_entity *entity = &bfqq->entity;
218cb897b   Paolo Valente   block, bfq: avoid...
1415
1416
1417
1418
1419
1420
1421
1422
1423
  	/*
  	 * In the next compound condition, we check also whether there
  	 * is some budget left, because otherwise there is no point in
  	 * trying to go on serving bfqq with this same budget: bfqq
  	 * would be expired immediately after being selected for
  	 * service. This would only cause useless overhead.
  	 */
  	if (bfq_bfqq_non_blocking_wait_rq(bfqq) && arrived_in_time &&
  	    bfq_bfqq_budget_left(bfqq) > 0) {
aee69d78d   Paolo Valente   block, bfq: intro...
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
  		/*
  		 * We do not clear the flag non_blocking_wait_rq here, as
  		 * the latter is used in bfq_activate_bfqq to signal
  		 * that timestamps need to be back-shifted (and is
  		 * cleared right after).
  		 */
  
  		/*
  		 * In next assignment we rely on that either
  		 * entity->service or entity->budget are not updated
  		 * on expiration if bfqq is empty (see
  		 * __bfq_bfqq_recalc_budget). Thus both quantities
  		 * remain unchanged after such an expiration, and the
  		 * following statement therefore assigns to
  		 * entity->budget the remaining budget on such an
9fae8dd59   Paolo Valente   block, bfq: fix s...
1439
  		 * expiration.
aee69d78d   Paolo Valente   block, bfq: intro...
1440
1441
1442
1443
  		 */
  		entity->budget = min_t(unsigned long,
  				       bfq_bfqq_budget_left(bfqq),
  				       bfqq->max_budget);
9fae8dd59   Paolo Valente   block, bfq: fix s...
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
  		/*
  		 * At this point, we have used entity->service to get
  		 * the budget left (needed for updating
  		 * entity->budget). Thus we finally can, and have to,
  		 * reset entity->service. The latter must be reset
  		 * because bfqq would otherwise be charged again for
  		 * the service it has received during its previous
  		 * service slot(s).
  		 */
  		entity->service = 0;
aee69d78d   Paolo Valente   block, bfq: intro...
1454
1455
  		return true;
  	}
9fae8dd59   Paolo Valente   block, bfq: fix s...
1456
1457
1458
1459
  	/*
  	 * We can finally complete expiration, by setting service to 0.
  	 */
  	entity->service = 0;
aee69d78d   Paolo Valente   block, bfq: intro...
1460
1461
1462
  	entity->budget = max_t(unsigned long, bfqq->max_budget,
  			       bfq_serv_to_charge(bfqq->next_rq, bfqq));
  	bfq_clear_bfqq_non_blocking_wait_rq(bfqq);
96a291c38   Paolo Valente   block, bfq: preem...
1463
  	return false;
44e44a1b3   Paolo Valente   block, bfq: impro...
1464
  }
4baa8bb13   Paolo Valente   block, bfq: fix w...
1465
  /*
4baa8bb13   Paolo Valente   block, bfq: fix w...
1466
1467
1468
1469
1470
1471
1472
   * Return the farthest past time instant according to jiffies
   * macros.
   */
  static unsigned long bfq_smallest_from_now(void)
  {
  	return jiffies - MAX_JIFFY_OFFSET;
  }
44e44a1b3   Paolo Valente   block, bfq: impro...
1473
1474
1475
1476
  static void bfq_update_bfqq_wr_on_rq_arrival(struct bfq_data *bfqd,
  					     struct bfq_queue *bfqq,
  					     unsigned int old_wr_coeff,
  					     bool wr_or_deserves_wr,
77b7dcead   Paolo Valente   block, bfq: reduc...
1477
  					     bool interactive,
e1b2324dd   Arianna Avanzini   block, bfq: handl...
1478
  					     bool in_burst,
77b7dcead   Paolo Valente   block, bfq: reduc...
1479
  					     bool soft_rt)
44e44a1b3   Paolo Valente   block, bfq: impro...
1480
1481
1482
  {
  	if (old_wr_coeff == 1 && wr_or_deserves_wr) {
  		/* start a weight-raising period */
77b7dcead   Paolo Valente   block, bfq: reduc...
1483
  		if (interactive) {
8a8747dc0   Paolo Valente   block, bfq: limit...
1484
  			bfqq->service_from_wr = 0;
77b7dcead   Paolo Valente   block, bfq: reduc...
1485
1486
1487
  			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
  			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
  		} else {
4baa8bb13   Paolo Valente   block, bfq: fix w...
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
  			/*
  			 * No interactive weight raising in progress
  			 * here: assign minus infinity to
  			 * wr_start_at_switch_to_srt, to make sure
  			 * that, at the end of the soft-real-time
  			 * weight raising periods that is starting
  			 * now, no interactive weight-raising period
  			 * may be wrongly considered as still in
  			 * progress (and thus actually started by
  			 * mistake).
  			 */
  			bfqq->wr_start_at_switch_to_srt =
  				bfq_smallest_from_now();
77b7dcead   Paolo Valente   block, bfq: reduc...
1501
1502
1503
1504
1505
  			bfqq->wr_coeff = bfqd->bfq_wr_coeff *
  				BFQ_SOFTRT_WEIGHT_FACTOR;
  			bfqq->wr_cur_max_time =
  				bfqd->bfq_wr_rt_max_time;
  		}
44e44a1b3   Paolo Valente   block, bfq: impro...
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
  
  		/*
  		 * If needed, further reduce budget to make sure it is
  		 * close to bfqq's backlog, so as to reduce the
  		 * scheduling-error component due to a too large
  		 * budget. Do not care about throughput consequences,
  		 * but only about latency. Finally, do not assign a
  		 * too small budget either, to avoid increasing
  		 * latency by causing too frequent expirations.
  		 */
  		bfqq->entity.budget = min_t(unsigned long,
  					    bfqq->entity.budget,
  					    2 * bfq_min_budget(bfqd));
  	} else if (old_wr_coeff > 1) {
77b7dcead   Paolo Valente   block, bfq: reduc...
1520
1521
1522
  		if (interactive) { /* update wr coeff and duration */
  			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
  			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
e1b2324dd   Arianna Avanzini   block, bfq: handl...
1523
1524
1525
  		} else if (in_burst)
  			bfqq->wr_coeff = 1;
  		else if (soft_rt) {
77b7dcead   Paolo Valente   block, bfq: reduc...
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
  			/*
  			 * The application is now or still meeting the
  			 * requirements for being deemed soft rt.  We
  			 * can then correctly and safely (re)charge
  			 * the weight-raising duration for the
  			 * application with the weight-raising
  			 * duration for soft rt applications.
  			 *
  			 * In particular, doing this recharge now, i.e.,
  			 * before the weight-raising period for the
  			 * application finishes, reduces the probability
  			 * of the following negative scenario:
  			 * 1) the weight of a soft rt application is
  			 *    raised at startup (as for any newly
  			 *    created application),
  			 * 2) since the application is not interactive,
  			 *    at a certain time weight-raising is
  			 *    stopped for the application,
  			 * 3) at that time the application happens to
  			 *    still have pending requests, and hence
  			 *    is destined to not have a chance to be
  			 *    deemed soft rt before these requests are
  			 *    completed (see the comments to the
  			 *    function bfq_bfqq_softrt_next_start()
  			 *    for details on soft rt detection),
  			 * 4) these pending requests experience a high
  			 *    latency because the application is not
  			 *    weight-raised while they are pending.
  			 */
  			if (bfqq->wr_cur_max_time !=
  				bfqd->bfq_wr_rt_max_time) {
  				bfqq->wr_start_at_switch_to_srt =
  					bfqq->last_wr_start_finish;
  
  				bfqq->wr_cur_max_time =
  					bfqd->bfq_wr_rt_max_time;
  				bfqq->wr_coeff = bfqd->bfq_wr_coeff *
  					BFQ_SOFTRT_WEIGHT_FACTOR;
  			}
  			bfqq->last_wr_start_finish = jiffies;
  		}
44e44a1b3   Paolo Valente   block, bfq: impro...
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
  	}
  }
  
  static bool bfq_bfqq_idle_for_long_time(struct bfq_data *bfqd,
  					struct bfq_queue *bfqq)
  {
  	return bfqq->dispatched == 0 &&
  		time_is_before_jiffies(
  			bfqq->budget_timeout +
  			bfqd->bfq_wr_min_idle_time);
aee69d78d   Paolo Valente   block, bfq: intro...
1577
  }
96a291c38   Paolo Valente   block, bfq: preem...
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
  
  /*
   * Return true if bfqq is in a higher priority class, or has a higher
   * weight than the in-service queue.
   */
  static bool bfq_bfqq_higher_class_or_weight(struct bfq_queue *bfqq,
  					    struct bfq_queue *in_serv_bfqq)
  {
  	int bfqq_weight, in_serv_weight;
  
  	if (bfqq->ioprio_class < in_serv_bfqq->ioprio_class)
  		return true;
  
  	if (in_serv_bfqq->entity.parent == bfqq->entity.parent) {
  		bfqq_weight = bfqq->entity.weight;
  		in_serv_weight = in_serv_bfqq->entity.weight;
  	} else {
  		if (bfqq->entity.parent)
  			bfqq_weight = bfqq->entity.parent->weight;
  		else
  			bfqq_weight = bfqq->entity.weight;
  		if (in_serv_bfqq->entity.parent)
  			in_serv_weight = in_serv_bfqq->entity.parent->weight;
  		else
  			in_serv_weight = in_serv_bfqq->entity.weight;
  	}
  
  	return bfqq_weight > in_serv_weight;
  }
aee69d78d   Paolo Valente   block, bfq: intro...
1607
1608
  static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
  					     struct bfq_queue *bfqq,
44e44a1b3   Paolo Valente   block, bfq: impro...
1609
1610
1611
  					     int old_wr_coeff,
  					     struct request *rq,
  					     bool *interactive)
aee69d78d   Paolo Valente   block, bfq: intro...
1612
  {
e1b2324dd   Arianna Avanzini   block, bfq: handl...
1613
1614
  	bool soft_rt, in_burst,	wr_or_deserves_wr,
  		bfqq_wants_to_preempt,
44e44a1b3   Paolo Valente   block, bfq: impro...
1615
  		idle_for_long_time = bfq_bfqq_idle_for_long_time(bfqd, bfqq),
aee69d78d   Paolo Valente   block, bfq: intro...
1616
1617
1618
1619
1620
1621
1622
1623
  		/*
  		 * See the comments on
  		 * bfq_bfqq_update_budg_for_activation for
  		 * details on the usage of the next variable.
  		 */
  		arrived_in_time =  ktime_get_ns() <=
  			bfqq->ttime.last_end_request +
  			bfqd->bfq_slice_idle * 3;
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
1624

aee69d78d   Paolo Valente   block, bfq: intro...
1625
  	/*
44e44a1b3   Paolo Valente   block, bfq: impro...
1626
1627
  	 * bfqq deserves to be weight-raised if:
  	 * - it is sync,
e1b2324dd   Arianna Avanzini   block, bfq: handl...
1628
  	 * - it does not belong to a large burst,
36eca8948   Arianna Avanzini   block, bfq: add E...
1629
1630
  	 * - it has been idle for enough time or is soft real-time,
  	 * - is linked to a bfq_io_cq (it is not shared in any sense).
44e44a1b3   Paolo Valente   block, bfq: impro...
1631
  	 */
e1b2324dd   Arianna Avanzini   block, bfq: handl...
1632
  	in_burst = bfq_bfqq_in_large_burst(bfqq);
77b7dcead   Paolo Valente   block, bfq: reduc...
1633
  	soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
7074f076f   Paolo Valente   block, bfq: do no...
1634
  		!BFQQ_TOTALLY_SEEKY(bfqq) &&
e1b2324dd   Arianna Avanzini   block, bfq: handl...
1635
  		!in_burst &&
f6c3ca0e5   Davide Sapienza   block, bfq: preve...
1636
1637
  		time_is_before_jiffies(bfqq->soft_rt_next_start) &&
  		bfqq->dispatched == 0;
e1b2324dd   Arianna Avanzini   block, bfq: handl...
1638
  	*interactive = !in_burst && idle_for_long_time;
44e44a1b3   Paolo Valente   block, bfq: impro...
1639
1640
  	wr_or_deserves_wr = bfqd->low_latency &&
  		(bfqq->wr_coeff > 1 ||
36eca8948   Arianna Avanzini   block, bfq: add E...
1641
1642
  		 (bfq_bfqq_sync(bfqq) &&
  		  bfqq->bic && (*interactive || soft_rt)));
44e44a1b3   Paolo Valente   block, bfq: impro...
1643
1644
1645
1646
  
  	/*
  	 * Using the last flag, update budget and check whether bfqq
  	 * may want to preempt the in-service queue.
aee69d78d   Paolo Valente   block, bfq: intro...
1647
1648
1649
  	 */
  	bfqq_wants_to_preempt =
  		bfq_bfqq_update_budg_for_activation(bfqd, bfqq,
96a291c38   Paolo Valente   block, bfq: preem...
1650
  						    arrived_in_time);
aee69d78d   Paolo Valente   block, bfq: intro...
1651

e1b2324dd   Arianna Avanzini   block, bfq: handl...
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
  	/*
  	 * If bfqq happened to be activated in a burst, but has been
  	 * idle for much more than an interactive queue, then we
  	 * assume that, in the overall I/O initiated in the burst, the
  	 * I/O associated with bfqq is finished. So bfqq does not need
  	 * to be treated as a queue belonging to a burst
  	 * anymore. Accordingly, we reset bfqq's in_large_burst flag
  	 * if set, and remove bfqq from the burst list if it's
  	 * there. We do not decrement burst_size, because the fact
  	 * that bfqq does not need to belong to the burst list any
  	 * more does not invalidate the fact that bfqq was created in
  	 * a burst.
  	 */
  	if (likely(!bfq_bfqq_just_created(bfqq)) &&
  	    idle_for_long_time &&
  	    time_is_before_jiffies(
  		    bfqq->budget_timeout +
  		    msecs_to_jiffies(10000))) {
  		hlist_del_init(&bfqq->burst_list_node);
  		bfq_clear_bfqq_in_large_burst(bfqq);
  	}
  
  	bfq_clear_bfqq_just_created(bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
1675
1676
1677
1678
1679
1680
1681
1682
1683
  	if (!bfq_bfqq_IO_bound(bfqq)) {
  		if (arrived_in_time) {
  			bfqq->requests_within_timer++;
  			if (bfqq->requests_within_timer >=
  			    bfqd->bfq_requests_within_timer)
  				bfq_mark_bfqq_IO_bound(bfqq);
  		} else
  			bfqq->requests_within_timer = 0;
  	}
44e44a1b3   Paolo Valente   block, bfq: impro...
1684
  	if (bfqd->low_latency) {
36eca8948   Arianna Avanzini   block, bfq: add E...
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
  		if (unlikely(time_is_after_jiffies(bfqq->split_time)))
  			/* wraparound */
  			bfqq->split_time =
  				jiffies - bfqd->bfq_wr_min_idle_time - 1;
  
  		if (time_is_before_jiffies(bfqq->split_time +
  					   bfqd->bfq_wr_min_idle_time)) {
  			bfq_update_bfqq_wr_on_rq_arrival(bfqd, bfqq,
  							 old_wr_coeff,
  							 wr_or_deserves_wr,
  							 *interactive,
e1b2324dd   Arianna Avanzini   block, bfq: handl...
1696
  							 in_burst,
36eca8948   Arianna Avanzini   block, bfq: add E...
1697
1698
1699
1700
1701
  							 soft_rt);
  
  			if (old_wr_coeff != bfqq->wr_coeff)
  				bfqq->entity.prio_changed = 1;
  		}
44e44a1b3   Paolo Valente   block, bfq: impro...
1702
  	}
77b7dcead   Paolo Valente   block, bfq: reduc...
1703
1704
1705
  	bfqq->last_idle_bklogged = jiffies;
  	bfqq->service_from_backlogged = 0;
  	bfq_clear_bfqq_softrt_update(bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
1706
1707
1708
1709
  	bfq_add_bfqq_busy(bfqd, bfqq);
  
  	/*
  	 * Expire in-service queue only if preemption may be needed
96a291c38   Paolo Valente   block, bfq: preem...
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
  	 * for guarantees. In particular, we care only about two
  	 * cases. The first is that bfqq has to recover a service
  	 * hole, as explained in the comments on
  	 * bfq_bfqq_update_budg_for_activation(), i.e., that
  	 * bfqq_wants_to_preempt is true. However, if bfqq does not
  	 * carry time-critical I/O, then bfqq's bandwidth is less
  	 * important than that of queues that carry time-critical I/O.
  	 * So, as a further constraint, we consider this case only if
  	 * bfqq is at least as weight-raised, i.e., at least as time
  	 * critical, as the in-service queue.
  	 *
  	 * The second case is that bfqq is in a higher priority class,
  	 * or has a higher weight than the in-service queue. If this
  	 * condition does not hold, we don't care because, even if
  	 * bfqq does not start to be served immediately, the resulting
  	 * delay for bfqq's I/O is however lower or much lower than
  	 * the ideal completion time to be guaranteed to bfqq's I/O.
  	 *
  	 * In both cases, preemption is needed only if, according to
  	 * the timestamps of both bfqq and of the in-service queue,
  	 * bfqq actually is the next queue to serve. So, to reduce
  	 * useless preemptions, the return value of
  	 * next_queue_may_preempt() is considered in the next compound
  	 * condition too. Yet next_queue_may_preempt() just checks a
  	 * simple, necessary condition for bfqq to be the next queue
  	 * to serve. In fact, to evaluate a sufficient condition, the
  	 * timestamps of the in-service queue would need to be
  	 * updated, and this operation is quite costly (see the
  	 * comments on bfq_bfqq_update_budg_for_activation()).
aee69d78d   Paolo Valente   block, bfq: intro...
1739
  	 */
96a291c38   Paolo Valente   block, bfq: preem...
1740
1741
1742
1743
  	if (bfqd->in_service_queue &&
  	    ((bfqq_wants_to_preempt &&
  	      bfqq->wr_coeff >= bfqd->in_service_queue->wr_coeff) ||
  	     bfq_bfqq_higher_class_or_weight(bfqq, bfqd->in_service_queue)) &&
aee69d78d   Paolo Valente   block, bfq: intro...
1744
1745
1746
1747
  	    next_queue_may_preempt(bfqd))
  		bfq_bfqq_expire(bfqd, bfqd->in_service_queue,
  				false, BFQQE_PREEMPTED);
  }
766d61412   Paolo Valente   block, bfq: reset...
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
  static void bfq_reset_inject_limit(struct bfq_data *bfqd,
  				   struct bfq_queue *bfqq)
  {
  	/* invalidate baseline total service time */
  	bfqq->last_serv_time_ns = 0;
  
  	/*
  	 * Reset pointer in case we are waiting for
  	 * some request completion.
  	 */
  	bfqd->waited_rq = NULL;
  
  	/*
  	 * If bfqq has a short think time, then start by setting the
  	 * inject limit to 0 prudentially, because the service time of
  	 * an injected I/O request may be higher than the think time
  	 * of bfqq, and therefore, if one request was injected when
  	 * bfqq remains empty, this injected request might delay the
  	 * service of the next I/O request for bfqq significantly. In
  	 * case bfqq can actually tolerate some injection, then the
  	 * adaptive update will however raise the limit soon. This
  	 * lucky circumstance holds exactly because bfqq has a short
  	 * think time, and thus, after remaining empty, is likely to
  	 * get new I/O enqueued---and then completed---before being
  	 * expired. This is the very pattern that gives the
  	 * limit-update algorithm the chance to measure the effect of
  	 * injection on request service times, and then to update the
  	 * limit accordingly.
  	 *
  	 * However, in the following special case, the inject limit is
  	 * left to 1 even if the think time is short: bfqq's I/O is
  	 * synchronized with that of some other queue, i.e., bfqq may
  	 * receive new I/O only after the I/O of the other queue is
  	 * completed. Keeping the inject limit to 1 allows the
  	 * blocking I/O to be served while bfqq is in service. And
  	 * this is very convenient both for bfqq and for overall
  	 * throughput, as explained in detail in the comments in
  	 * bfq_update_has_short_ttime().
  	 *
  	 * On the opposite end, if bfqq has a long think time, then
  	 * start directly by 1, because:
  	 * a) on the bright side, keeping at most one request in
  	 * service in the drive is unlikely to cause any harm to the
  	 * latency of bfqq's requests, as the service time of a single
  	 * request is likely to be lower than the think time of bfqq;
  	 * b) on the downside, after becoming empty, bfqq is likely to
  	 * expire before getting its next request. With this request
  	 * arrival pattern, it is very hard to sample total service
  	 * times and update the inject limit accordingly (see comments
  	 * on bfq_update_inject_limit()). So the limit is likely to be
  	 * never, or at least seldom, updated.  As a consequence, by
  	 * setting the limit to 1, we avoid that no injection ever
  	 * occurs with bfqq. On the downside, this proactive step
  	 * further reduces chances to actually compute the baseline
  	 * total service time. Thus it reduces chances to execute the
  	 * limit-update algorithm and possibly raise the limit to more
  	 * than 1.
  	 */
  	if (bfq_bfqq_has_short_ttime(bfqq))
  		bfqq->inject_limit = 0;
  	else
  		bfqq->inject_limit = 1;
  
  	bfqq->decrease_time_jif = jiffies;
  }
aee69d78d   Paolo Valente   block, bfq: intro...
1813
1814
1815
1816
1817
  static void bfq_add_request(struct request *rq)
  {
  	struct bfq_queue *bfqq = RQ_BFQQ(rq);
  	struct bfq_data *bfqd = bfqq->bfqd;
  	struct request *next_rq, *prev;
44e44a1b3   Paolo Valente   block, bfq: impro...
1818
1819
  	unsigned int old_wr_coeff = bfqq->wr_coeff;
  	bool interactive = false;
aee69d78d   Paolo Valente   block, bfq: intro...
1820
1821
1822
1823
  
  	bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
  	bfqq->queued[rq_is_sync(rq)]++;
  	bfqd->queued++;
2341d662e   Paolo Valente   block, bfq: tune ...
1824
1825
  	if (RB_EMPTY_ROOT(&bfqq->sort_list) && bfq_bfqq_sync(bfqq)) {
  		/*
13a857a4c   Paolo Valente   block, bfq: detec...
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
  		 * Detect whether bfqq's I/O seems synchronized with
  		 * that of some other queue, i.e., whether bfqq, after
  		 * remaining empty, happens to receive new I/O only
  		 * right after some I/O request of the other queue has
  		 * been completed. We call waker queue the other
  		 * queue, and we assume, for simplicity, that bfqq may
  		 * have at most one waker queue.
  		 *
  		 * A remarkable throughput boost can be reached by
  		 * unconditionally injecting the I/O of the waker
  		 * queue, every time a new bfq_dispatch_request
  		 * happens to be invoked while I/O is being plugged
  		 * for bfqq.  In addition to boosting throughput, this
  		 * unblocks bfqq's I/O, thereby improving bandwidth
  		 * and latency for bfqq. Note that these same results
  		 * may be achieved with the general injection
  		 * mechanism, but less effectively. For details on
  		 * this aspect, see the comments on the choice of the
  		 * queue for injection in bfq_select_queue().
  		 *
  		 * Turning back to the detection of a waker queue, a
  		 * queue Q is deemed as a waker queue for bfqq if, for
  		 * two consecutive times, bfqq happens to become non
  		 * empty right after a request of Q has been
  		 * completed. In particular, on the first time, Q is
  		 * tentatively set as a candidate waker queue, while
  		 * on the second time, the flag
  		 * bfq_bfqq_has_waker(bfqq) is set to confirm that Q
  		 * is a waker queue for bfqq. These detection steps
  		 * are performed only if bfqq has a long think time,
  		 * so as to make it more likely that bfqq's I/O is
  		 * actually being blocked by a synchronization. This
  		 * last filter, plus the above two-times requirement,
  		 * make false positives less likely.
  		 *
  		 * NOTE
  		 *
  		 * The sooner a waker queue is detected, the sooner
  		 * throughput can be boosted by injecting I/O from the
  		 * waker queue. Fortunately, detection is likely to be
  		 * actually fast, for the following reasons. While
  		 * blocked by synchronization, bfqq has a long think
  		 * time. This implies that bfqq's inject limit is at
  		 * least equal to 1 (see the comments in
  		 * bfq_update_inject_limit()). So, thanks to
  		 * injection, the waker queue is likely to be served
  		 * during the very first I/O-plugging time interval
  		 * for bfqq. This triggers the first step of the
  		 * detection mechanism. Thanks again to injection, the
  		 * candidate waker queue is then likely to be
  		 * confirmed no later than during the next
  		 * I/O-plugging interval for bfqq.
  		 */
08d383a74   Paolo Valente   block, bfq: reset...
1879
1880
  		if (bfqd->last_completed_rq_bfqq &&
  		    !bfq_bfqq_has_short_ttime(bfqq) &&
13a857a4c   Paolo Valente   block, bfq: detec...
1881
1882
1883
  		    ktime_get_ns() - bfqd->last_completion <
  		    200 * NSEC_PER_USEC) {
  			if (bfqd->last_completed_rq_bfqq != bfqq &&
08d383a74   Paolo Valente   block, bfq: reset...
1884
1885
  			    bfqd->last_completed_rq_bfqq !=
  			    bfqq->waker_bfqq) {
13a857a4c   Paolo Valente   block, bfq: detec...
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
  				/*
  				 * First synchronization detected with
  				 * a candidate waker queue, or with a
  				 * different candidate waker queue
  				 * from the current one.
  				 */
  				bfqq->waker_bfqq = bfqd->last_completed_rq_bfqq;
  
  				/*
  				 * If the waker queue disappears, then
  				 * bfqq->waker_bfqq must be reset. To
  				 * this goal, we maintain in each
  				 * waker queue a list, woken_list, of
  				 * all the queues that reference the
  				 * waker queue through their
  				 * waker_bfqq pointer. When the waker
  				 * queue exits, the waker_bfqq pointer
  				 * of all the queues in the woken_list
  				 * is reset.
  				 *
  				 * In addition, if bfqq is already in
  				 * the woken_list of a waker queue,
  				 * then, before being inserted into
  				 * the woken_list of a new waker
  				 * queue, bfqq must be removed from
  				 * the woken_list of the old waker
  				 * queue.
  				 */
  				if (!hlist_unhashed(&bfqq->woken_list_node))
  					hlist_del_init(&bfqq->woken_list_node);
  				hlist_add_head(&bfqq->woken_list_node,
  				    &bfqd->last_completed_rq_bfqq->woken_list);
  
  				bfq_clear_bfqq_has_waker(bfqq);
  			} else if (bfqd->last_completed_rq_bfqq ==
  				   bfqq->waker_bfqq &&
  				   !bfq_bfqq_has_waker(bfqq)) {
  				/*
  				 * synchronization with waker_bfqq
  				 * seen for the second time
  				 */
  				bfq_mark_bfqq_has_waker(bfqq);
  			}
  		}
  
  		/*
2341d662e   Paolo Valente   block, bfq: tune ...
1932
1933
1934
1935
1936
1937
  		 * Periodically reset inject limit, to make sure that
  		 * the latter eventually drops in case workload
  		 * changes, see step (3) in the comments on
  		 * bfq_update_inject_limit().
  		 */
  		if (time_is_before_eq_jiffies(bfqq->decrease_time_jif +
766d61412   Paolo Valente   block, bfq: reset...
1938
1939
  					     msecs_to_jiffies(1000)))
  			bfq_reset_inject_limit(bfqd, bfqq);
2341d662e   Paolo Valente   block, bfq: tune ...
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
  
  		/*
  		 * The following conditions must hold to setup a new
  		 * sampling of total service time, and then a new
  		 * update of the inject limit:
  		 * - bfqq is in service, because the total service
  		 *   time is evaluated only for the I/O requests of
  		 *   the queues in service;
  		 * - this is the right occasion to compute or to
  		 *   lower the baseline total service time, because
  		 *   there are actually no requests in the drive,
  		 *   or
  		 *   the baseline total service time is available, and
  		 *   this is the right occasion to compute the other
  		 *   quantity needed to update the inject limit, i.e.,
  		 *   the total service time caused by the amount of
  		 *   injection allowed by the current value of the
  		 *   limit. It is the right occasion because injection
  		 *   has actually been performed during the service
  		 *   hole, and there are still in-flight requests,
  		 *   which are very likely to be exactly the injected
  		 *   requests, or part of them;
  		 * - the minimum interval for sampling the total
  		 *   service time and updating the inject limit has
  		 *   elapsed.
  		 */
  		if (bfqq == bfqd->in_service_queue &&
  		    (bfqd->rq_in_driver == 0 ||
  		     (bfqq->last_serv_time_ns > 0 &&
  		      bfqd->rqs_injected && bfqd->rq_in_driver > 0)) &&
  		    time_is_before_eq_jiffies(bfqq->decrease_time_jif +
17c3d2660   Paolo Valente   block, bfq: incre...
1971
  					      msecs_to_jiffies(10))) {
2341d662e   Paolo Valente   block, bfq: tune ...
1972
1973
1974
1975
1976
1977
1978
1979
  			bfqd->last_empty_occupied_ns = ktime_get_ns();
  			/*
  			 * Start the state machine for measuring the
  			 * total service time of rq: setting
  			 * wait_dispatch will cause bfqd->waited_rq to
  			 * be set when rq will be dispatched.
  			 */
  			bfqd->wait_dispatch = true;
23ed570ac   Paolo Valente   block, bfq: updat...
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
  			/*
  			 * If there is no I/O in service in the drive,
  			 * then possible injection occurred before the
  			 * arrival of rq will not affect the total
  			 * service time of rq. So the injection limit
  			 * must not be updated as a function of such
  			 * total service time, unless new injection
  			 * occurs before rq is completed. To have the
  			 * injection limit updated only in the latter
  			 * case, reset rqs_injected here (rqs_injected
  			 * will be set in case injection is performed
  			 * on bfqq before rq is completed).
  			 */
  			if (bfqd->rq_in_driver == 0)
  				bfqd->rqs_injected = false;
2341d662e   Paolo Valente   block, bfq: tune ...
1995
1996
  		}
  	}
aee69d78d   Paolo Valente   block, bfq: intro...
1997
1998
1999
2000
2001
2002
2003
2004
  	elv_rb_add(&bfqq->sort_list, rq);
  
  	/*
  	 * Check if this request is a better next-serve candidate.
  	 */
  	prev = bfqq->next_rq;
  	next_rq = bfq_choose_req(bfqd, bfqq->next_rq, rq, bfqd->last_position);
  	bfqq->next_rq = next_rq;
36eca8948   Arianna Avanzini   block, bfq: add E...
2005
2006
  	/*
  	 * Adjust priority tree position, if next_rq changes.
8cacc5ab3   Paolo Valente   block, bfq: do no...
2007
  	 * See comments on bfq_pos_tree_add_move() for the unlikely().
36eca8948   Arianna Avanzini   block, bfq: add E...
2008
  	 */
8cacc5ab3   Paolo Valente   block, bfq: do no...
2009
  	if (unlikely(!bfqd->nonrot_with_queueing && prev != bfqq->next_rq))
36eca8948   Arianna Avanzini   block, bfq: add E...
2010
  		bfq_pos_tree_add_move(bfqd, bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
2011
  	if (!bfq_bfqq_busy(bfqq)) /* switching to busy ... */
44e44a1b3   Paolo Valente   block, bfq: impro...
2012
2013
2014
2015
2016
2017
2018
2019
2020
  		bfq_bfqq_handle_idle_busy_switch(bfqd, bfqq, old_wr_coeff,
  						 rq, &interactive);
  	else {
  		if (bfqd->low_latency && old_wr_coeff == 1 && !rq_is_sync(rq) &&
  		    time_is_before_jiffies(
  				bfqq->last_wr_start_finish +
  				bfqd->bfq_wr_min_inter_arr_async)) {
  			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
  			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
cfd69712a   Paolo Valente   block, bfq: reduc...
2021
  			bfqd->wr_busy_queues++;
44e44a1b3   Paolo Valente   block, bfq: impro...
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
  			bfqq->entity.prio_changed = 1;
  		}
  		if (prev != bfqq->next_rq)
  			bfq_updated_next_req(bfqd, bfqq);
  	}
  
  	/*
  	 * Assign jiffies to last_wr_start_finish in the following
  	 * cases:
  	 *
  	 * . if bfqq is not going to be weight-raised, because, for
  	 *   non weight-raised queues, last_wr_start_finish stores the
  	 *   arrival time of the last request; as of now, this piece
  	 *   of information is used only for deciding whether to
  	 *   weight-raise async queues
  	 *
  	 * . if bfqq is not weight-raised, because, if bfqq is now
  	 *   switching to weight-raised, then last_wr_start_finish
  	 *   stores the time when weight-raising starts
  	 *
  	 * . if bfqq is interactive, because, regardless of whether
  	 *   bfqq is currently weight-raised, the weight-raising
  	 *   period must start or restart (this case is considered
  	 *   separately because it is not detected by the above
  	 *   conditions, if bfqq is already weight-raised)
77b7dcead   Paolo Valente   block, bfq: reduc...
2047
2048
2049
2050
2051
2052
  	 *
  	 * last_wr_start_finish has to be updated also if bfqq is soft
  	 * real-time, because the weight-raising period is constantly
  	 * restarted on idle-to-busy transitions for these queues, but
  	 * this is already done in bfq_bfqq_handle_idle_busy_switch if
  	 * needed.
44e44a1b3   Paolo Valente   block, bfq: impro...
2053
2054
2055
2056
  	 */
  	if (bfqd->low_latency &&
  		(old_wr_coeff == 1 || bfqq->wr_coeff == 1 || interactive))
  		bfqq->last_wr_start_finish = jiffies;
aee69d78d   Paolo Valente   block, bfq: intro...
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
  }
  
  static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,
  					  struct bio *bio,
  					  struct request_queue *q)
  {
  	struct bfq_queue *bfqq = bfqd->bio_bfqq;
  
  
  	if (bfqq)
  		return elv_rb_find(&bfqq->sort_list, bio_end_sector(bio));
  
  	return NULL;
  }
ab0e43e9c   Paolo Valente   block, bfq: modif...
2071
2072
2073
2074
2075
2076
2077
  static sector_t get_sdist(sector_t last_pos, struct request *rq)
  {
  	if (last_pos)
  		return abs(blk_rq_pos(rq) - last_pos);
  
  	return 0;
  }
aee69d78d   Paolo Valente   block, bfq: intro...
2078
2079
2080
2081
2082
2083
  #if 0 /* Still not clear if we can do without next two functions */
  static void bfq_activate_request(struct request_queue *q, struct request *rq)
  {
  	struct bfq_data *bfqd = q->elevator->elevator_data;
  
  	bfqd->rq_in_driver++;
aee69d78d   Paolo Valente   block, bfq: intro...
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
  }
  
  static void bfq_deactivate_request(struct request_queue *q, struct request *rq)
  {
  	struct bfq_data *bfqd = q->elevator->elevator_data;
  
  	bfqd->rq_in_driver--;
  }
  #endif
  
  static void bfq_remove_request(struct request_queue *q,
  			       struct request *rq)
  {
  	struct bfq_queue *bfqq = RQ_BFQQ(rq);
  	struct bfq_data *bfqd = bfqq->bfqd;
  	const int sync = rq_is_sync(rq);
  
  	if (bfqq->next_rq == rq) {
  		bfqq->next_rq = bfq_find_next_rq(bfqd, bfqq, rq);
  		bfq_updated_next_req(bfqd, bfqq);
  	}
  
  	if (rq->queuelist.prev != &rq->queuelist)
  		list_del_init(&rq->queuelist);
  	bfqq->queued[sync]--;
  	bfqd->queued--;
  	elv_rb_del(&bfqq->sort_list, rq);
  
  	elv_rqhash_del(q, rq);
  	if (q->last_merge == rq)
  		q->last_merge = NULL;
  
  	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
  		bfqq->next_rq = NULL;
  
  		if (bfq_bfqq_busy(bfqq) && bfqq != bfqd->in_service_queue) {
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
2120
  			bfq_del_bfqq_busy(bfqd, bfqq, false);
aee69d78d   Paolo Valente   block, bfq: intro...
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
  			/*
  			 * bfqq emptied. In normal operation, when
  			 * bfqq is empty, bfqq->entity.service and
  			 * bfqq->entity.budget must contain,
  			 * respectively, the service received and the
  			 * budget used last time bfqq emptied. These
  			 * facts do not hold in this case, as at least
  			 * this last removal occurred while bfqq is
  			 * not in service. To avoid inconsistencies,
  			 * reset both bfqq->entity.service and
  			 * bfqq->entity.budget, if bfqq has still a
  			 * process that may issue I/O requests to it.
  			 */
  			bfqq->entity.budget = bfqq->entity.service = 0;
  		}
36eca8948   Arianna Avanzini   block, bfq: add E...
2136
2137
2138
2139
2140
2141
2142
2143
  
  		/*
  		 * Remove queue from request-position tree as it is empty.
  		 */
  		if (bfqq->pos_root) {
  			rb_erase(&bfqq->pos_node, bfqq->pos_root);
  			bfqq->pos_root = NULL;
  		}
05e902835   Paolo Valente   block, bfq: add m...
2144
  	} else {
8cacc5ab3   Paolo Valente   block, bfq: do no...
2145
2146
2147
  		/* see comments on bfq_pos_tree_add_move() for the unlikely() */
  		if (unlikely(!bfqd->nonrot_with_queueing))
  			bfq_pos_tree_add_move(bfqd, bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
2148
2149
2150
2151
  	}
  
  	if (rq->cmd_flags & REQ_META)
  		bfqq->meta_pending--;
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
2152

aee69d78d   Paolo Valente   block, bfq: intro...
2153
  }
14ccb66b3   Christoph Hellwig   block: remove the...
2154
2155
  static bool bfq_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio,
  		unsigned int nr_segs)
aee69d78d   Paolo Valente   block, bfq: intro...
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
  {
  	struct request_queue *q = hctx->queue;
  	struct bfq_data *bfqd = q->elevator->elevator_data;
  	struct request *free = NULL;
  	/*
  	 * bfq_bic_lookup grabs the queue_lock: invoke it now and
  	 * store its return value for later use, to avoid nesting
  	 * queue_lock inside the bfqd->lock. We assume that the bic
  	 * returned by bfq_bic_lookup does not go away before
  	 * bfqd->lock is taken.
  	 */
  	struct bfq_io_cq *bic = bfq_bic_lookup(bfqd, current->io_context, q);
  	bool ret;
  
  	spin_lock_irq(&bfqd->lock);
  
  	if (bic)
  		bfqd->bio_bfqq = bic_to_bfqq(bic, op_is_sync(bio->bi_opf));
  	else
  		bfqd->bio_bfqq = NULL;
  	bfqd->bio_bic = bic;
14ccb66b3   Christoph Hellwig   block: remove the...
2177
  	ret = blk_mq_sched_try_merge(q, bio, nr_segs, &free);
aee69d78d   Paolo Valente   block, bfq: intro...
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
  
  	if (free)
  		blk_mq_free_request(free);
  	spin_unlock_irq(&bfqd->lock);
  
  	return ret;
  }
  
  static int bfq_request_merge(struct request_queue *q, struct request **req,
  			     struct bio *bio)
  {
  	struct bfq_data *bfqd = q->elevator->elevator_data;
  	struct request *__rq;
  
  	__rq = bfq_find_rq_fmerge(bfqd, bio, q);
  	if (__rq && elv_bio_merge_ok(__rq, bio)) {
  		*req = __rq;
  		return ELEVATOR_FRONT_MERGE;
  	}
  
  	return ELEVATOR_NO_MERGE;
  }
18e5a57d7   Paolo Valente   block, bfq: postp...
2200
  static struct bfq_queue *bfq_init_rq(struct request *rq);
aee69d78d   Paolo Valente   block, bfq: intro...
2201
2202
2203
2204
2205
2206
2207
2208
  static void bfq_request_merged(struct request_queue *q, struct request *req,
  			       enum elv_merge type)
  {
  	if (type == ELEVATOR_FRONT_MERGE &&
  	    rb_prev(&req->rb_node) &&
  	    blk_rq_pos(req) <
  	    blk_rq_pos(container_of(rb_prev(&req->rb_node),
  				    struct request, rb_node))) {
18e5a57d7   Paolo Valente   block, bfq: postp...
2209
  		struct bfq_queue *bfqq = bfq_init_rq(req);
fd03177c3   Paolo Valente   block, bfq: handl...
2210
  		struct bfq_data *bfqd;
aee69d78d   Paolo Valente   block, bfq: intro...
2211
  		struct request *prev, *next_rq;
fd03177c3   Paolo Valente   block, bfq: handl...
2212
2213
2214
2215
  		if (!bfqq)
  			return;
  
  		bfqd = bfqq->bfqd;
aee69d78d   Paolo Valente   block, bfq: intro...
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
  		/* Reposition request in its sort_list */
  		elv_rb_del(&bfqq->sort_list, req);
  		elv_rb_add(&bfqq->sort_list, req);
  
  		/* Choose next request to be served for bfqq */
  		prev = bfqq->next_rq;
  		next_rq = bfq_choose_req(bfqd, bfqq->next_rq, req,
  					 bfqd->last_position);
  		bfqq->next_rq = next_rq;
  		/*
36eca8948   Arianna Avanzini   block, bfq: add E...
2226
2227
2228
  		 * If next_rq changes, update both the queue's budget to
  		 * fit the new request and the queue's position in its
  		 * rq_pos_tree.
aee69d78d   Paolo Valente   block, bfq: intro...
2229
  		 */
36eca8948   Arianna Avanzini   block, bfq: add E...
2230
  		if (prev != bfqq->next_rq) {
aee69d78d   Paolo Valente   block, bfq: intro...
2231
  			bfq_updated_next_req(bfqd, bfqq);
8cacc5ab3   Paolo Valente   block, bfq: do no...
2232
2233
2234
2235
2236
2237
  			/*
  			 * See comments on bfq_pos_tree_add_move() for
  			 * the unlikely().
  			 */
  			if (unlikely(!bfqd->nonrot_with_queueing))
  				bfq_pos_tree_add_move(bfqd, bfqq);
36eca8948   Arianna Avanzini   block, bfq: add E...
2238
  		}
aee69d78d   Paolo Valente   block, bfq: intro...
2239
2240
  	}
  }
8abfa4d6f   Paolo Valente   block, bfq: remov...
2241
2242
2243
2244
2245
2246
  /*
   * This function is called to notify the scheduler that the requests
   * rq and 'next' have been merged, with 'next' going away.  BFQ
   * exploits this hook to address the following issue: if 'next' has a
   * fifo_time lower that rq, then the fifo_time of rq must be set to
   * the value of 'next', to not forget the greater age of 'next'.
8abfa4d6f   Paolo Valente   block, bfq: remov...
2247
2248
2249
2250
2251
2252
2253
2254
   *
   * NOTE: in this function we assume that rq is in a bfq_queue, basing
   * on that rq is picked from the hash table q->elevator->hash, which,
   * in its turn, is filled only with I/O requests present in
   * bfq_queues, while BFQ is in use for the request queue q. In fact,
   * the function that fills this hash table (elv_rqhash_add) is called
   * only by bfq_insert_request.
   */
aee69d78d   Paolo Valente   block, bfq: intro...
2255
2256
2257
  static void bfq_requests_merged(struct request_queue *q, struct request *rq,
  				struct request *next)
  {
18e5a57d7   Paolo Valente   block, bfq: postp...
2258
2259
  	struct bfq_queue *bfqq = bfq_init_rq(rq),
  		*next_bfqq = bfq_init_rq(next);
aee69d78d   Paolo Valente   block, bfq: intro...
2260

fd03177c3   Paolo Valente   block, bfq: handl...
2261
2262
  	if (!bfqq)
  		return;
aee69d78d   Paolo Valente   block, bfq: intro...
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
  	/*
  	 * If next and rq belong to the same bfq_queue and next is older
  	 * than rq, then reposition rq in the fifo (by substituting next
  	 * with rq). Otherwise, if next and rq belong to different
  	 * bfq_queues, never reposition rq: in fact, we would have to
  	 * reposition it with respect to next's position in its own fifo,
  	 * which would most certainly be too expensive with respect to
  	 * the benefits.
  	 */
  	if (bfqq == next_bfqq &&
  	    !list_empty(&rq->queuelist) && !list_empty(&next->queuelist) &&
  	    next->fifo_time < rq->fifo_time) {
  		list_del_init(&rq->queuelist);
  		list_replace_init(&next->queuelist, &rq->queuelist);
  		rq->fifo_time = next->fifo_time;
  	}
  
  	if (bfqq->next_rq == next)
  		bfqq->next_rq = rq;
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
2282
  	bfqg_stats_update_io_merged(bfqq_group(bfqq), next->cmd_flags);
aee69d78d   Paolo Valente   block, bfq: intro...
2283
  }
44e44a1b3   Paolo Valente   block, bfq: impro...
2284
2285
2286
  /* Must be called with bfqq != NULL */
  static void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
  {
cfd69712a   Paolo Valente   block, bfq: reduc...
2287
2288
  	if (bfq_bfqq_busy(bfqq))
  		bfqq->bfqd->wr_busy_queues--;
44e44a1b3   Paolo Valente   block, bfq: impro...
2289
2290
  	bfqq->wr_coeff = 1;
  	bfqq->wr_cur_max_time = 0;
77b7dcead   Paolo Valente   block, bfq: reduc...
2291
  	bfqq->last_wr_start_finish = jiffies;
44e44a1b3   Paolo Valente   block, bfq: impro...
2292
2293
2294
2295
2296
2297
  	/*
  	 * Trigger a weight change on the next invocation of
  	 * __bfq_entity_update_weight_prio.
  	 */
  	bfqq->entity.prio_changed = 1;
  }
ea25da480   Paolo Valente   block, bfq: split...
2298
2299
  void bfq_end_wr_async_queues(struct bfq_data *bfqd,
  			     struct bfq_group *bfqg)
44e44a1b3   Paolo Valente   block, bfq: impro...
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
  {
  	int i, j;
  
  	for (i = 0; i < 2; i++)
  		for (j = 0; j < IOPRIO_BE_NR; j++)
  			if (bfqg->async_bfqq[i][j])
  				bfq_bfqq_end_wr(bfqg->async_bfqq[i][j]);
  	if (bfqg->async_idle_bfqq)
  		bfq_bfqq_end_wr(bfqg->async_idle_bfqq);
  }
  
  static void bfq_end_wr(struct bfq_data *bfqd)
  {
  	struct bfq_queue *bfqq;
  
  	spin_lock_irq(&bfqd->lock);
  
  	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list)
  		bfq_bfqq_end_wr(bfqq);
  	list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list)
  		bfq_bfqq_end_wr(bfqq);
  	bfq_end_wr_async(bfqd);
  
  	spin_unlock_irq(&bfqd->lock);
  }
36eca8948   Arianna Avanzini   block, bfq: add E...
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
  static sector_t bfq_io_struct_pos(void *io_struct, bool request)
  {
  	if (request)
  		return blk_rq_pos(io_struct);
  	else
  		return ((struct bio *)io_struct)->bi_iter.bi_sector;
  }
  
  static int bfq_rq_close_to_sector(void *io_struct, bool request,
  				  sector_t sector)
  {
  	return abs(bfq_io_struct_pos(io_struct, request) - sector) <=
  	       BFQQ_CLOSE_THR;
  }
  
  static struct bfq_queue *bfqq_find_close(struct bfq_data *bfqd,
  					 struct bfq_queue *bfqq,
  					 sector_t sector)
  {
  	struct rb_root *root = &bfq_bfqq_to_bfqg(bfqq)->rq_pos_tree;
  	struct rb_node *parent, *node;
  	struct bfq_queue *__bfqq;
  
  	if (RB_EMPTY_ROOT(root))
  		return NULL;
  
  	/*
  	 * First, if we find a request starting at the end of the last
  	 * request, choose it.
  	 */
  	__bfqq = bfq_rq_pos_tree_lookup(bfqd, root, sector, &parent, NULL);
  	if (__bfqq)
  		return __bfqq;
  
  	/*
  	 * If the exact sector wasn't found, the parent of the NULL leaf
  	 * will contain the closest sector (rq_pos_tree sorted by
  	 * next_request position).
  	 */
  	__bfqq = rb_entry(parent, struct bfq_queue, pos_node);
  	if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))
  		return __bfqq;
  
  	if (blk_rq_pos(__bfqq->next_rq) < sector)
  		node = rb_next(&__bfqq->pos_node);
  	else
  		node = rb_prev(&__bfqq->pos_node);
  	if (!node)
  		return NULL;
  
  	__bfqq = rb_entry(node, struct bfq_queue, pos_node);
  	if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))
  		return __bfqq;
  
  	return NULL;
  }
  
  static struct bfq_queue *bfq_find_close_cooperator(struct bfq_data *bfqd,
  						   struct bfq_queue *cur_bfqq,
  						   sector_t sector)
  {
  	struct bfq_queue *bfqq;
  
  	/*
  	 * We shall notice if some of the queues are cooperating,
  	 * e.g., working closely on the same area of the device. In
  	 * that case, we can group them together and: 1) don't waste
  	 * time idling, and 2) serve the union of their requests in
  	 * the best possible order for throughput.
  	 */
  	bfqq = bfqq_find_close(bfqd, cur_bfqq, sector);
  	if (!bfqq || bfqq == cur_bfqq)
  		return NULL;
  
  	return bfqq;
  }
  
  static struct bfq_queue *
  bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
  {
  	int process_refs, new_process_refs;
  	struct bfq_queue *__bfqq;
  
  	/*
  	 * If there are no process references on the new_bfqq, then it is
  	 * unsafe to follow the ->new_bfqq chain as other bfqq's in the chain
  	 * may have dropped their last reference (not just their last process
  	 * reference).
  	 */
  	if (!bfqq_process_refs(new_bfqq))
  		return NULL;
  
  	/* Avoid a circular list and skip interim queue merges. */
  	while ((__bfqq = new_bfqq->new_bfqq)) {
  		if (__bfqq == bfqq)
  			return NULL;
  		new_bfqq = __bfqq;
  	}
  
  	process_refs = bfqq_process_refs(bfqq);
  	new_process_refs = bfqq_process_refs(new_bfqq);
  	/*
  	 * If the process for the bfqq has gone away, there is no
  	 * sense in merging the queues.
  	 */
  	if (process_refs == 0 || new_process_refs == 0)
  		return NULL;
  
  	bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d",
  		new_bfqq->pid);
  
  	/*
  	 * Merging is just a redirection: the requests of the process
  	 * owning one of the two queues are redirected to the other queue.
  	 * The latter queue, in its turn, is set as shared if this is the
  	 * first time that the requests of some process are redirected to
  	 * it.
  	 *
6fa3e8d34   Paolo Valente   block, bfq: remov...
2443
2444
2445
2446
2447
2448
  	 * We redirect bfqq to new_bfqq and not the opposite, because
  	 * we are in the context of the process owning bfqq, thus we
  	 * have the io_cq of this process. So we can immediately
  	 * configure this io_cq to redirect the requests of the
  	 * process to new_bfqq. In contrast, the io_cq of new_bfqq is
  	 * not available any more (new_bfqq->bic == NULL).
36eca8948   Arianna Avanzini   block, bfq: add E...
2449
  	 *
6fa3e8d34   Paolo Valente   block, bfq: remov...
2450
2451
2452
2453
2454
  	 * Anyway, even in case new_bfqq coincides with the in-service
  	 * queue, redirecting requests the in-service queue is the
  	 * best option, as we feed the in-service queue with new
  	 * requests close to the last request served and, by doing so,
  	 * are likely to increase the throughput.
36eca8948   Arianna Avanzini   block, bfq: add E...
2455
2456
2457
2458
2459
2460
2461
2462
2463
  	 */
  	bfqq->new_bfqq = new_bfqq;
  	new_bfqq->ref += process_refs;
  	return new_bfqq;
  }
  
  static bool bfq_may_be_close_cooperator(struct bfq_queue *bfqq,
  					struct bfq_queue *new_bfqq)
  {
7b8fa3b90   Paolo Valente   block, bfq: let a...
2464
2465
  	if (bfq_too_late_for_merging(new_bfqq))
  		return false;
36eca8948   Arianna Avanzini   block, bfq: add E...
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
  	if (bfq_class_idle(bfqq) || bfq_class_idle(new_bfqq) ||
  	    (bfqq->ioprio_class != new_bfqq->ioprio_class))
  		return false;
  
  	/*
  	 * If either of the queues has already been detected as seeky,
  	 * then merging it with the other queue is unlikely to lead to
  	 * sequential I/O.
  	 */
  	if (BFQQ_SEEKY(bfqq) || BFQQ_SEEKY(new_bfqq))
  		return false;
  
  	/*
  	 * Interleaved I/O is known to be done by (some) applications
  	 * only for reads, so it does not make sense to merge async
  	 * queues.
  	 */
  	if (!bfq_bfqq_sync(bfqq) || !bfq_bfqq_sync(new_bfqq))
  		return false;
  
  	return true;
  }
  
  /*
36eca8948   Arianna Avanzini   block, bfq: add E...
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
   * Attempt to schedule a merge of bfqq with the currently in-service
   * queue or with a close queue among the scheduled queues.  Return
   * NULL if no merge was scheduled, a pointer to the shared bfq_queue
   * structure otherwise.
   *
   * The OOM queue is not allowed to participate to cooperation: in fact, since
   * the requests temporarily redirected to the OOM queue could be redirected
   * again to dedicated queues at any time, the state needed to correctly
   * handle merging with the OOM queue would be quite complex and expensive
   * to maintain. Besides, in such a critical condition as an out of memory,
   * the benefits of queue merging may be little relevant, or even negligible.
   *
36eca8948   Arianna Avanzini   block, bfq: add E...
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
   * WARNING: queue merging may impair fairness among non-weight raised
   * queues, for at least two reasons: 1) the original weight of a
   * merged queue may change during the merged state, 2) even being the
   * weight the same, a merged queue may be bloated with many more
   * requests than the ones produced by its originally-associated
   * process.
   */
  static struct bfq_queue *
  bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
  		     void *io_struct, bool request)
  {
  	struct bfq_queue *in_service_bfqq, *new_bfqq;
7b8fa3b90   Paolo Valente   block, bfq: let a...
2514
  	/*
8cacc5ab3   Paolo Valente   block, bfq: do no...
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
  	 * Do not perform queue merging if the device is non
  	 * rotational and performs internal queueing. In fact, such a
  	 * device reaches a high speed through internal parallelism
  	 * and pipelining. This means that, to reach a high
  	 * throughput, it must have many requests enqueued at the same
  	 * time. But, in this configuration, the internal scheduling
  	 * algorithm of the device does exactly the job of queue
  	 * merging: it reorders requests so as to obtain as much as
  	 * possible a sequential I/O pattern. As a consequence, with
  	 * the workload generated by processes doing interleaved I/O,
  	 * the throughput reached by the device is likely to be the
  	 * same, with and without queue merging.
  	 *
  	 * Disabling merging also provides a remarkable benefit in
  	 * terms of throughput. Merging tends to make many workloads
  	 * artificially more uneven, because of shared queues
  	 * remaining non empty for incomparably more time than
  	 * non-merged queues. This may accentuate workload
  	 * asymmetries. For example, if one of the queues in a set of
  	 * merged queues has a higher weight than a normal queue, then
  	 * the shared queue may inherit such a high weight and, by
  	 * staying almost always active, may force BFQ to perform I/O
  	 * plugging most of the time. This evidently makes it harder
  	 * for BFQ to let the device reach a high throughput.
  	 *
  	 * Finally, the likely() macro below is not used because one
  	 * of the two branches is more likely than the other, but to
  	 * have the code path after the following if() executed as
  	 * fast as possible for the case of a non rotational device
  	 * with queueing. We want it because this is the fastest kind
  	 * of device. On the opposite end, the likely() may lengthen
  	 * the execution time of BFQ for the case of slower devices
  	 * (rotational or at least without queueing). But in this case
  	 * the execution time of BFQ matters very little, if not at
  	 * all.
  	 */
  	if (likely(bfqd->nonrot_with_queueing))
  		return NULL;
  
  	/*
7b8fa3b90   Paolo Valente   block, bfq: let a...
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
  	 * Prevent bfqq from being merged if it has been created too
  	 * long ago. The idea is that true cooperating processes, and
  	 * thus their associated bfq_queues, are supposed to be
  	 * created shortly after each other. This is the case, e.g.,
  	 * for KVM/QEMU and dump I/O threads. Basing on this
  	 * assumption, the following filtering greatly reduces the
  	 * probability that two non-cooperating processes, which just
  	 * happen to do close I/O for some short time interval, have
  	 * their queues merged by mistake.
  	 */
  	if (bfq_too_late_for_merging(bfqq))
  		return NULL;
36eca8948   Arianna Avanzini   block, bfq: add E...
2567
2568
  	if (bfqq->new_bfqq)
  		return bfqq->new_bfqq;
4403e4e46   Angelo Ruocco   block, bfq: remov...
2569
  	if (!io_struct || unlikely(bfqq == &bfqd->oom_bfqq))
36eca8948   Arianna Avanzini   block, bfq: add E...
2570
2571
2572
  		return NULL;
  
  	/* If there is only one backlogged queue, don't search. */
73d581184   Paolo Valente   block, bfq: consi...
2573
  	if (bfq_tot_busy_queues(bfqd) == 1)
36eca8948   Arianna Avanzini   block, bfq: add E...
2574
2575
2576
  		return NULL;
  
  	in_service_bfqq = bfqd->in_service_queue;
4403e4e46   Angelo Ruocco   block, bfq: remov...
2577
2578
  	if (in_service_bfqq && in_service_bfqq != bfqq &&
  	    likely(in_service_bfqq != &bfqd->oom_bfqq) &&
058fdecc6   Paolo Valente   block, bfq: fix i...
2579
2580
  	    bfq_rq_close_to_sector(io_struct, request,
  				   bfqd->in_serv_last_pos) &&
36eca8948   Arianna Avanzini   block, bfq: add E...
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
  	    bfqq->entity.parent == in_service_bfqq->entity.parent &&
  	    bfq_may_be_close_cooperator(bfqq, in_service_bfqq)) {
  		new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq);
  		if (new_bfqq)
  			return new_bfqq;
  	}
  	/*
  	 * Check whether there is a cooperator among currently scheduled
  	 * queues. The only thing we need is that the bio/request is not
  	 * NULL, as we need it to establish whether a cooperator exists.
  	 */
36eca8948   Arianna Avanzini   block, bfq: add E...
2592
2593
  	new_bfqq = bfq_find_close_cooperator(bfqd, bfqq,
  			bfq_io_struct_pos(io_struct, request));
4403e4e46   Angelo Ruocco   block, bfq: remov...
2594
  	if (new_bfqq && likely(new_bfqq != &bfqd->oom_bfqq) &&
36eca8948   Arianna Avanzini   block, bfq: add E...
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
  	    bfq_may_be_close_cooperator(bfqq, new_bfqq))
  		return bfq_setup_merge(bfqq, new_bfqq);
  
  	return NULL;
  }
  
  static void bfq_bfqq_save_state(struct bfq_queue *bfqq)
  {
  	struct bfq_io_cq *bic = bfqq->bic;
  
  	/*
  	 * If !bfqq->bic, the queue is already shared or its requests
  	 * have already been redirected to a shared queue; both idle window
  	 * and weight raising state have already been saved. Do nothing.
  	 */
  	if (!bic)
  		return;
fffca087d   Francesco Pollicino   block, bfq: save ...
2612
  	bic->saved_weight = bfqq->entity.orig_weight;
36eca8948   Arianna Avanzini   block, bfq: add E...
2613
  	bic->saved_ttime = bfqq->ttime;
d5be3fefc   Paolo Valente   block,bfq: refact...
2614
  	bic->saved_has_short_ttime = bfq_bfqq_has_short_ttime(bfqq);
36eca8948   Arianna Avanzini   block, bfq: add E...
2615
  	bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq);
e1b2324dd   Arianna Avanzini   block, bfq: handl...
2616
2617
  	bic->saved_in_large_burst = bfq_bfqq_in_large_burst(bfqq);
  	bic->was_in_burst_list = !hlist_unhashed(&bfqq->burst_list_node);
894df937e   Paolo Valente   block, bfq: let e...
2618
  	if (unlikely(bfq_bfqq_just_created(bfqq) &&
1be6e8a96   Angelo Ruocco   block, bfq: check...
2619
2620
  		     !bfq_bfqq_in_large_burst(bfqq) &&
  		     bfqq->bfqd->low_latency)) {
894df937e   Paolo Valente   block, bfq: let e...
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
  		/*
  		 * bfqq being merged right after being created: bfqq
  		 * would have deserved interactive weight raising, but
  		 * did not make it to be set in a weight-raised state,
  		 * because of this early merge.	Store directly the
  		 * weight-raising state that would have been assigned
  		 * to bfqq, so that to avoid that bfqq unjustly fails
  		 * to enjoy weight raising if split soon.
  		 */
  		bic->saved_wr_coeff = bfqq->bfqd->bfq_wr_coeff;
2b50f230f   Douglas Anderson   block, bfq: Init ...
2631
  		bic->saved_wr_start_at_switch_to_srt = bfq_smallest_from_now();
894df937e   Paolo Valente   block, bfq: let e...
2632
2633
2634
2635
2636
2637
2638
2639
2640
  		bic->saved_wr_cur_max_time = bfq_wr_duration(bfqq->bfqd);
  		bic->saved_last_wr_start_finish = jiffies;
  	} else {
  		bic->saved_wr_coeff = bfqq->wr_coeff;
  		bic->saved_wr_start_at_switch_to_srt =
  			bfqq->wr_start_at_switch_to_srt;
  		bic->saved_last_wr_start_finish = bfqq->last_wr_start_finish;
  		bic->saved_wr_cur_max_time = bfqq->wr_cur_max_time;
  	}
36eca8948   Arianna Avanzini   block, bfq: add E...
2641
  }
478de3380   Paolo Valente   block, bfq: desch...
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
  void bfq_release_process_ref(struct bfq_data *bfqd, struct bfq_queue *bfqq)
  {
  	/*
  	 * To prevent bfqq's service guarantees from being violated,
  	 * bfqq may be left busy, i.e., queued for service, even if
  	 * empty (see comments in __bfq_bfqq_expire() for
  	 * details). But, if no process will send requests to bfqq any
  	 * longer, then there is no point in keeping bfqq queued for
  	 * service. In addition, keeping bfqq queued for service, but
  	 * with no process ref any longer, may have caused bfqq to be
  	 * freed when dequeued from service. But this is assumed to
  	 * never happen.
  	 */
  	if (bfq_bfqq_busy(bfqq) && RB_EMPTY_ROOT(&bfqq->sort_list) &&
  	    bfqq != bfqd->in_service_queue)
  		bfq_del_bfqq_busy(bfqd, bfqq, false);
  
  	bfq_put_queue(bfqq);
  }
36eca8948   Arianna Avanzini   block, bfq: add E...
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
  static void
  bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic,
  		struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
  {
  	bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu",
  		(unsigned long)new_bfqq->pid);
  	/* Save weight raising and idle window of the merged queues */
  	bfq_bfqq_save_state(bfqq);
  	bfq_bfqq_save_state(new_bfqq);
  	if (bfq_bfqq_IO_bound(bfqq))
  		bfq_mark_bfqq_IO_bound(new_bfqq);
  	bfq_clear_bfqq_IO_bound(bfqq);
  
  	/*
  	 * If bfqq is weight-raised, then let new_bfqq inherit
  	 * weight-raising. To reduce false positives, neglect the case
  	 * where bfqq has just been created, but has not yet made it
  	 * to be weight-raised (which may happen because EQM may merge
  	 * bfqq even before bfq_add_request is executed for the first
e1b2324dd   Arianna Avanzini   block, bfq: handl...
2680
2681
  	 * time for bfqq). Handling this case would however be very
  	 * easy, thanks to the flag just_created.
36eca8948   Arianna Avanzini   block, bfq: add E...
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
  	 */
  	if (new_bfqq->wr_coeff == 1 && bfqq->wr_coeff > 1) {
  		new_bfqq->wr_coeff = bfqq->wr_coeff;
  		new_bfqq->wr_cur_max_time = bfqq->wr_cur_max_time;
  		new_bfqq->last_wr_start_finish = bfqq->last_wr_start_finish;
  		new_bfqq->wr_start_at_switch_to_srt =
  			bfqq->wr_start_at_switch_to_srt;
  		if (bfq_bfqq_busy(new_bfqq))
  			bfqd->wr_busy_queues++;
  		new_bfqq->entity.prio_changed = 1;
  	}
  
  	if (bfqq->wr_coeff > 1) { /* bfqq has given its wr to new_bfqq */
  		bfqq->wr_coeff = 1;
  		bfqq->entity.prio_changed = 1;
  		if (bfq_bfqq_busy(bfqq))
  			bfqd->wr_busy_queues--;
  	}
  
  	bfq_log_bfqq(bfqd, new_bfqq, "merge_bfqqs: wr_busy %d",
  		     bfqd->wr_busy_queues);
  
  	/*
36eca8948   Arianna Avanzini   block, bfq: add E...
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
  	 * Merge queues (that is, let bic redirect its requests to new_bfqq)
  	 */
  	bic_set_bfqq(bic, new_bfqq, 1);
  	bfq_mark_bfqq_coop(new_bfqq);
  	/*
  	 * new_bfqq now belongs to at least two bics (it is a shared queue):
  	 * set new_bfqq->bic to NULL. bfqq either:
  	 * - does not belong to any bic any more, and hence bfqq->bic must
  	 *   be set to NULL, or
  	 * - is a queue whose owning bics have already been redirected to a
  	 *   different queue, hence the queue is destined to not belong to
  	 *   any bic soon and bfqq->bic is already NULL (therefore the next
  	 *   assignment causes no harm).
  	 */
  	new_bfqq->bic = NULL;
1e66413c4   Francesco Pollicino   block, bfq: print...
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
  	/*
  	 * If the queue is shared, the pid is the pid of one of the associated
  	 * processes. Which pid depends on the exact sequence of merge events
  	 * the queue underwent. So printing such a pid is useless and confusing
  	 * because it reports a random pid between those of the associated
  	 * processes.
  	 * We mark such a queue with a pid -1, and then print SHARED instead of
  	 * a pid in logging messages.
  	 */
  	new_bfqq->pid = -1;
36eca8948   Arianna Avanzini   block, bfq: add E...
2730
  	bfqq->bic = NULL;
478de3380   Paolo Valente   block, bfq: desch...
2731
  	bfq_release_process_ref(bfqd, bfqq);
36eca8948   Arianna Avanzini   block, bfq: add E...
2732
  }
aee69d78d   Paolo Valente   block, bfq: intro...
2733
2734
2735
2736
2737
  static bool bfq_allow_bio_merge(struct request_queue *q, struct request *rq,
  				struct bio *bio)
  {
  	struct bfq_data *bfqd = q->elevator->elevator_data;
  	bool is_sync = op_is_sync(bio->bi_opf);
36eca8948   Arianna Avanzini   block, bfq: add E...
2738
  	struct bfq_queue *bfqq = bfqd->bio_bfqq, *new_bfqq;
aee69d78d   Paolo Valente   block, bfq: intro...
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
  
  	/*
  	 * Disallow merge of a sync bio into an async request.
  	 */
  	if (is_sync && !rq_is_sync(rq))
  		return false;
  
  	/*
  	 * Lookup the bfqq that this bio will be queued with. Allow
  	 * merge only if rq is queued there.
  	 */
  	if (!bfqq)
  		return false;
36eca8948   Arianna Avanzini   block, bfq: add E...
2752
2753
2754
2755
2756
2757
2758
2759
2760
  	/*
  	 * We take advantage of this function to perform an early merge
  	 * of the queues of possible cooperating processes.
  	 */
  	new_bfqq = bfq_setup_cooperator(bfqd, bfqq, bio, false);
  	if (new_bfqq) {
  		/*
  		 * bic still points to bfqq, then it has not yet been
  		 * redirected to some other bfq_queue, and a queue
636b8fe86   Angelo Ruocco   block, bfq: fix s...
2761
2762
  		 * merge between bfqq and new_bfqq can be safely
  		 * fulfilled, i.e., bic can be redirected to new_bfqq
36eca8948   Arianna Avanzini   block, bfq: add E...
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
  		 * and bfqq can be put.
  		 */
  		bfq_merge_bfqqs(bfqd, bfqd->bio_bic, bfqq,
  				new_bfqq);
  		/*
  		 * If we get here, bio will be queued into new_queue,
  		 * so use new_bfqq to decide whether bio and rq can be
  		 * merged.
  		 */
  		bfqq = new_bfqq;
  
  		/*
  		 * Change also bqfd->bio_bfqq, as
  		 * bfqd->bio_bic now points to new_bfqq, and
  		 * this function may be invoked again (and then may
  		 * use again bqfd->bio_bfqq).
  		 */
  		bfqd->bio_bfqq = bfqq;
  	}
aee69d78d   Paolo Valente   block, bfq: intro...
2782
2783
  	return bfqq == RQ_BFQQ(rq);
  }
44e44a1b3   Paolo Valente   block, bfq: impro...
2784
2785
2786
2787
2788
2789
2790
2791
2792
  /*
   * Set the maximum time for the in-service queue to consume its
   * budget. This prevents seeky processes from lowering the throughput.
   * In practice, a time-slice service scheme is used with seeky
   * processes.
   */
  static void bfq_set_budget_timeout(struct bfq_data *bfqd,
  				   struct bfq_queue *bfqq)
  {
77b7dcead   Paolo Valente   block, bfq: reduc...
2793
2794
2795
2796
2797
2798
  	unsigned int timeout_coeff;
  
  	if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time)
  		timeout_coeff = 1;
  	else
  		timeout_coeff = bfqq->entity.weight / bfqq->entity.orig_weight;
44e44a1b3   Paolo Valente   block, bfq: impro...
2799
2800
2801
  	bfqd->last_budget_start = ktime_get();
  
  	bfqq->budget_timeout = jiffies +
77b7dcead   Paolo Valente   block, bfq: reduc...
2802
  		bfqd->bfq_timeout * timeout_coeff;
44e44a1b3   Paolo Valente   block, bfq: impro...
2803
  }
aee69d78d   Paolo Valente   block, bfq: intro...
2804
2805
2806
2807
  static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
  				       struct bfq_queue *bfqq)
  {
  	if (bfqq) {
aee69d78d   Paolo Valente   block, bfq: intro...
2808
2809
2810
  		bfq_clear_bfqq_fifo_expire(bfqq);
  
  		bfqd->budgets_assigned = (bfqd->budgets_assigned * 7 + 256) / 8;
77b7dcead   Paolo Valente   block, bfq: reduc...
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
  		if (time_is_before_jiffies(bfqq->last_wr_start_finish) &&
  		    bfqq->wr_coeff > 1 &&
  		    bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time &&
  		    time_is_before_jiffies(bfqq->budget_timeout)) {
  			/*
  			 * For soft real-time queues, move the start
  			 * of the weight-raising period forward by the
  			 * time the queue has not received any
  			 * service. Otherwise, a relatively long
  			 * service delay is likely to cause the
  			 * weight-raising period of the queue to end,
  			 * because of the short duration of the
  			 * weight-raising period of a soft real-time
  			 * queue.  It is worth noting that this move
  			 * is not so dangerous for the other queues,
  			 * because soft real-time queues are not
  			 * greedy.
  			 *
  			 * To not add a further variable, we use the
  			 * overloaded field budget_timeout to
  			 * determine for how long the queue has not
  			 * received service, i.e., how much time has
  			 * elapsed since the queue expired. However,
  			 * this is a little imprecise, because
  			 * budget_timeout is set to jiffies if bfqq
  			 * not only expires, but also remains with no
  			 * request.
  			 */
  			if (time_after(bfqq->budget_timeout,
  				       bfqq->last_wr_start_finish))
  				bfqq->last_wr_start_finish +=
  					jiffies - bfqq->budget_timeout;
  			else
  				bfqq->last_wr_start_finish = jiffies;
  		}
44e44a1b3   Paolo Valente   block, bfq: impro...
2846
  		bfq_set_budget_timeout(bfqd, bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
  		bfq_log_bfqq(bfqd, bfqq,
  			     "set_in_service_queue, cur-budget = %d",
  			     bfqq->entity.budget);
  	}
  
  	bfqd->in_service_queue = bfqq;
  }
  
  /*
   * Get and set a new queue for service.
   */
  static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd)
  {
  	struct bfq_queue *bfqq = bfq_get_next_queue(bfqd);
  
  	__bfq_set_in_service_queue(bfqd, bfqq);
  	return bfqq;
  }
aee69d78d   Paolo Valente   block, bfq: intro...
2865
2866
2867
  static void bfq_arm_slice_timer(struct bfq_data *bfqd)
  {
  	struct bfq_queue *bfqq = bfqd->in_service_queue;
aee69d78d   Paolo Valente   block, bfq: intro...
2868
  	u32 sl;
aee69d78d   Paolo Valente   block, bfq: intro...
2869
2870
2871
2872
2873
2874
2875
2876
2877
  	bfq_mark_bfqq_wait_request(bfqq);
  
  	/*
  	 * We don't want to idle for seeks, but we do want to allow
  	 * fair distribution of slice time for a process doing back-to-back
  	 * seeks. So allow a little bit of time for him to submit a new rq.
  	 */
  	sl = bfqd->bfq_slice_idle;
  	/*
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
2878
2879
2880
2881
2882
2883
2884
2885
  	 * Unless the queue is being weight-raised or the scenario is
  	 * asymmetric, grant only minimum idle time if the queue
  	 * is seeky. A long idling is preserved for a weight-raised
  	 * queue, or, more in general, in an asymmetric scenario,
  	 * because a long idling is needed for guaranteeing to a queue
  	 * its reserved share of the throughput (in particular, it is
  	 * needed if the queue has a higher weight than some other
  	 * queue).
aee69d78d   Paolo Valente   block, bfq: intro...
2886
  	 */
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
2887
  	if (BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1 &&
fb53ac6cd   Paolo Valente   block, bfq: do no...
2888
  	    !bfq_asymmetric_scenario(bfqd, bfqq))
aee69d78d   Paolo Valente   block, bfq: intro...
2889
  		sl = min_t(u64, sl, BFQ_MIN_TT);
778c02a23   Paolo Valente   block, bfq: incre...
2890
2891
  	else if (bfqq->wr_coeff > 1)
  		sl = max_t(u32, sl, 20ULL * NSEC_PER_MSEC);
aee69d78d   Paolo Valente   block, bfq: intro...
2892
2893
  
  	bfqd->last_idling_start = ktime_get();
2341d662e   Paolo Valente   block, bfq: tune ...
2894
  	bfqd->last_idling_start_jiffies = jiffies;
aee69d78d   Paolo Valente   block, bfq: intro...
2895
2896
  	hrtimer_start(&bfqd->idle_slice_timer, ns_to_ktime(sl),
  		      HRTIMER_MODE_REL);
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
2897
  	bfqg_stats_set_start_idle_time(bfqq_group(bfqq));
aee69d78d   Paolo Valente   block, bfq: intro...
2898
2899
2900
  }
  
  /*
ab0e43e9c   Paolo Valente   block, bfq: modif...
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
   * In autotuning mode, max_budget is dynamically recomputed as the
   * amount of sectors transferred in timeout at the estimated peak
   * rate. This enables BFQ to utilize a full timeslice with a full
   * budget, even if the in-service queue is served at peak rate. And
   * this maximises throughput with sequential workloads.
   */
  static unsigned long bfq_calc_max_budget(struct bfq_data *bfqd)
  {
  	return (u64)bfqd->peak_rate * USEC_PER_MSEC *
  		jiffies_to_msecs(bfqd->bfq_timeout)>>BFQ_RATE_SHIFT;
  }
44e44a1b3   Paolo Valente   block, bfq: impro...
2912
2913
2914
  /*
   * Update parameters related to throughput and responsiveness, as a
   * function of the estimated peak rate. See comments on
e24f1c245   Paolo Valente   block, bfq: remov...
2915
   * bfq_calc_max_budget(), and on the ref_wr_duration array.
44e44a1b3   Paolo Valente   block, bfq: impro...
2916
2917
2918
   */
  static void update_thr_responsiveness_params(struct bfq_data *bfqd)
  {
e24f1c245   Paolo Valente   block, bfq: remov...
2919
  	if (bfqd->bfq_user_max_budget == 0) {
44e44a1b3   Paolo Valente   block, bfq: impro...
2920
2921
  		bfqd->bfq_max_budget =
  			bfq_calc_max_budget(bfqd);
e24f1c245   Paolo Valente   block, bfq: remov...
2922
  		bfq_log(bfqd, "new max_budget = %d", bfqd->bfq_max_budget);
44e44a1b3   Paolo Valente   block, bfq: impro...
2923
  	}
44e44a1b3   Paolo Valente   block, bfq: impro...
2924
  }
ab0e43e9c   Paolo Valente   block, bfq: modif...
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
  static void bfq_reset_rate_computation(struct bfq_data *bfqd,
  				       struct request *rq)
  {
  	if (rq != NULL) { /* new rq dispatch now, reset accordingly */
  		bfqd->last_dispatch = bfqd->first_dispatch = ktime_get_ns();
  		bfqd->peak_rate_samples = 1;
  		bfqd->sequential_samples = 0;
  		bfqd->tot_sectors_dispatched = bfqd->last_rq_max_size =
  			blk_rq_sectors(rq);
  	} else /* no new rq dispatched, just reset the number of samples */
  		bfqd->peak_rate_samples = 0; /* full re-init on next disp. */
  
  	bfq_log(bfqd,
  		"reset_rate_computation at end, sample %u/%u tot_sects %llu",
  		bfqd->peak_rate_samples, bfqd->sequential_samples,
  		bfqd->tot_sectors_dispatched);
  }
  
  static void bfq_update_rate_reset(struct bfq_data *bfqd, struct request *rq)
  {
  	u32 rate, weight, divisor;
  
  	/*
  	 * For the convergence property to hold (see comments on
  	 * bfq_update_peak_rate()) and for the assessment to be
  	 * reliable, a minimum number of samples must be present, and
  	 * a minimum amount of time must have elapsed. If not so, do
  	 * not compute new rate. Just reset parameters, to get ready
  	 * for a new evaluation attempt.
  	 */
  	if (bfqd->peak_rate_samples < BFQ_RATE_MIN_SAMPLES ||
  	    bfqd->delta_from_first < BFQ_RATE_MIN_INTERVAL)
  		goto reset_computation;
  
  	/*
  	 * If a new request completion has occurred after last
  	 * dispatch, then, to approximate the rate at which requests
  	 * have been served by the device, it is more precise to
  	 * extend the observation interval to the last completion.
  	 */
  	bfqd->delta_from_first =
  		max_t(u64, bfqd->delta_from_first,
  		      bfqd->last_completion - bfqd->first_dispatch);
  
  	/*
  	 * Rate computed in sects/usec, and not sects/nsec, for
  	 * precision issues.
  	 */
  	rate = div64_ul(bfqd->tot_sectors_dispatched<<BFQ_RATE_SHIFT,
  			div_u64(bfqd->delta_from_first, NSEC_PER_USEC));
  
  	/*
  	 * Peak rate not updated if:
  	 * - the percentage of sequential dispatches is below 3/4 of the
  	 *   total, and rate is below the current estimated peak rate
  	 * - rate is unreasonably high (> 20M sectors/sec)
  	 */
  	if ((bfqd->sequential_samples < (3 * bfqd->peak_rate_samples)>>2 &&
  	     rate <= bfqd->peak_rate) ||
  		rate > 20<<BFQ_RATE_SHIFT)
  		goto reset_computation;
  
  	/*
  	 * We have to update the peak rate, at last! To this purpose,
  	 * we use a low-pass filter. We compute the smoothing constant
  	 * of the filter as a function of the 'weight' of the new
  	 * measured rate.
  	 *
  	 * As can be seen in next formulas, we define this weight as a
  	 * quantity proportional to how sequential the workload is,
  	 * and to how long the observation time interval is.
  	 *
  	 * The weight runs from 0 to 8. The maximum value of the
  	 * weight, 8, yields the minimum value for the smoothing
  	 * constant. At this minimum value for the smoothing constant,
  	 * the measured rate contributes for half of the next value of
  	 * the estimated peak rate.
  	 *
  	 * So, the first step is to compute the weight as a function
  	 * of how sequential the workload is. Note that the weight
  	 * cannot reach 9, because bfqd->sequential_samples cannot
  	 * become equal to bfqd->peak_rate_samples, which, in its
  	 * turn, holds true because bfqd->sequential_samples is not
  	 * incremented for the first sample.
  	 */
  	weight = (9 * bfqd->sequential_samples) / bfqd->peak_rate_samples;
  
  	/*
  	 * Second step: further refine the weight as a function of the
  	 * duration of the observation interval.
  	 */
  	weight = min_t(u32, 8,
  		       div_u64(weight * bfqd->delta_from_first,
  			       BFQ_RATE_REF_INTERVAL));
  
  	/*
  	 * Divisor ranging from 10, for minimum weight, to 2, for
  	 * maximum weight.
  	 */
  	divisor = 10 - weight;
  
  	/*
  	 * Finally, update peak rate:
  	 *
  	 * peak_rate = peak_rate * (divisor-1) / divisor  +  rate / divisor
  	 */
  	bfqd->peak_rate *= divisor-1;
  	bfqd->peak_rate /= divisor;
  	rate /= divisor; /* smoothing constant alpha = 1/divisor */
  
  	bfqd->peak_rate += rate;
bc56e2caf   Paolo Valente   block, bfq: lower...
3036
3037
3038
3039
3040
3041
3042
3043
3044
  
  	/*
  	 * For a very slow device, bfqd->peak_rate can reach 0 (see
  	 * the minimum representable values reported in the comments
  	 * on BFQ_RATE_SHIFT). Push to 1 if this happens, to avoid
  	 * divisions by zero where bfqd->peak_rate is used as a
  	 * divisor.
  	 */
  	bfqd->peak_rate = max_t(u32, 1, bfqd->peak_rate);
44e44a1b3   Paolo Valente   block, bfq: impro...
3045
  	update_thr_responsiveness_params(bfqd);
ab0e43e9c   Paolo Valente   block, bfq: modif...
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
  
  reset_computation:
  	bfq_reset_rate_computation(bfqd, rq);
  }
  
  /*
   * Update the read/write peak rate (the main quantity used for
   * auto-tuning, see update_thr_responsiveness_params()).
   *
   * It is not trivial to estimate the peak rate (correctly): because of
   * the presence of sw and hw queues between the scheduler and the
   * device components that finally serve I/O requests, it is hard to
   * say exactly when a given dispatched request is served inside the
   * device, and for how long. As a consequence, it is hard to know
   * precisely at what rate a given set of requests is actually served
   * by the device.
   *
   * On the opposite end, the dispatch time of any request is trivially
   * available, and, from this piece of information, the "dispatch rate"
   * of requests can be immediately computed. So, the idea in the next
   * function is to use what is known, namely request dispatch times
   * (plus, when useful, request completion times), to estimate what is
   * unknown, namely in-device request service rate.
   *
   * The main issue is that, because of the above facts, the rate at
   * which a certain set of requests is dispatched over a certain time
   * interval can vary greatly with respect to the rate at which the
   * same requests are then served. But, since the size of any
   * intermediate queue is limited, and the service scheme is lossless
   * (no request is silently dropped), the following obvious convergence
   * property holds: the number of requests dispatched MUST become
   * closer and closer to the number of requests completed as the
   * observation interval grows. This is the key property used in
   * the next function to estimate the peak service rate as a function
   * of the observed dispatch rate. The function assumes to be invoked
   * on every request dispatch.
   */
  static void bfq_update_peak_rate(struct bfq_data *bfqd, struct request *rq)
  {
  	u64 now_ns = ktime_get_ns();
  
  	if (bfqd->peak_rate_samples == 0) { /* first dispatch */
  		bfq_log(bfqd, "update_peak_rate: goto reset, samples %d",
  			bfqd->peak_rate_samples);
  		bfq_reset_rate_computation(bfqd, rq);
  		goto update_last_values; /* will add one sample */
  	}
  
  	/*
  	 * Device idle for very long: the observation interval lasting
  	 * up to this dispatch cannot be a valid observation interval
  	 * for computing a new peak rate (similarly to the late-
  	 * completion event in bfq_completed_request()). Go to
  	 * update_rate_and_reset to have the following three steps
  	 * taken:
  	 * - close the observation interval at the last (previous)
  	 *   request dispatch or completion
  	 * - compute rate, if possible, for that observation interval
  	 * - start a new observation interval with this dispatch
  	 */
  	if (now_ns - bfqd->last_dispatch > 100*NSEC_PER_MSEC &&
  	    bfqd->rq_in_driver == 0)
  		goto update_rate_and_reset;
  
  	/* Update sampling information */
  	bfqd->peak_rate_samples++;
  
  	if ((bfqd->rq_in_driver > 0 ||
  		now_ns - bfqd->last_completion < BFQ_MIN_TT)
d87447d84   Paolo Valente   block, bfq: fix s...
3115
  	    && !BFQ_RQ_SEEKY(bfqd, bfqd->last_position, rq))
ab0e43e9c   Paolo Valente   block, bfq: modif...
3116
3117
3118
3119
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
3130
3131
3132
3133
3134
3135
3136
  		bfqd->sequential_samples++;
  
  	bfqd->tot_sectors_dispatched += blk_rq_sectors(rq);
  
  	/* Reset max observed rq size every 32 dispatches */
  	if (likely(bfqd->peak_rate_samples % 32))
  		bfqd->last_rq_max_size =
  			max_t(u32, blk_rq_sectors(rq), bfqd->last_rq_max_size);
  	else
  		bfqd->last_rq_max_size = blk_rq_sectors(rq);
  
  	bfqd->delta_from_first = now_ns - bfqd->first_dispatch;
  
  	/* Target observation interval not yet reached, go on sampling */
  	if (bfqd->delta_from_first < BFQ_RATE_REF_INTERVAL)
  		goto update_last_values;
  
  update_rate_and_reset:
  	bfq_update_rate_reset(bfqd, rq);
  update_last_values:
  	bfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
058fdecc6   Paolo Valente   block, bfq: fix i...
3137
3138
  	if (RQ_BFQQ(rq) == bfqd->in_service_queue)
  		bfqd->in_serv_last_pos = bfqd->last_position;
ab0e43e9c   Paolo Valente   block, bfq: modif...
3139
3140
3141
3142
  	bfqd->last_dispatch = now_ns;
  }
  
  /*
aee69d78d   Paolo Valente   block, bfq: intro...
3143
3144
3145
3146
3147
3148
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3160
3161
   * Remove request from internal lists.
   */
  static void bfq_dispatch_remove(struct request_queue *q, struct request *rq)
  {
  	struct bfq_queue *bfqq = RQ_BFQQ(rq);
  
  	/*
  	 * For consistency, the next instruction should have been
  	 * executed after removing the request from the queue and
  	 * dispatching it.  We execute instead this instruction before
  	 * bfq_remove_request() (and hence introduce a temporary
  	 * inconsistency), for efficiency.  In fact, should this
  	 * dispatch occur for a non in-service bfqq, this anticipated
  	 * increment prevents two counters related to bfqq->dispatched
  	 * from risking to be, first, uselessly decremented, and then
  	 * incremented again when the (new) value of bfqq->dispatched
  	 * happens to be taken into account.
  	 */
  	bfqq->dispatched++;
ab0e43e9c   Paolo Valente   block, bfq: modif...
3162
  	bfq_update_peak_rate(q->elevator->elevator_data, rq);
aee69d78d   Paolo Valente   block, bfq: intro...
3163
3164
3165
  
  	bfq_remove_request(q, rq);
  }
3726112ec   Paolo Valente   block, bfq: re-sc...
3166
3167
3168
3169
3170
3171
3172
3173
3174
3175
3176
3177
3178
3179
3180
3181
3182
3183
3184
3185
3186
3187
3188
3189
3190
3191
3192
3193
3194
3195
3196
3197
3198
3199
3200
3201
3202
3203
3204
3205
3206
3207
3208
3209
3210
3211
3212
3213
3214
3215
3216
3217
3218
3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
3230
3231
3232
3233
3234
3235
3236
3237
3238
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3249
3250
3251
3252
3253
3254
3255
3256
3257
3258
3259
3260
3261
3262
3263
3264
3265
3266
3267
3268
3269
3270
3271
3272
3273
3274
3275
3276
3277
3278
3279
3280
3281
3282
3283
3284
3285
3286
3287
3288
3289
3290
3291
3292
3293
3294
3295
3296
3297
3298
3299
3300
3301
3302
3303
3304
3305
3306
3307
3308
  /*
   * There is a case where idling does not have to be performed for
   * throughput concerns, but to preserve the throughput share of
   * the process associated with bfqq.
   *
   * To introduce this case, we can note that allowing the drive
   * to enqueue more than one request at a time, and hence
   * delegating de facto final scheduling decisions to the
   * drive's internal scheduler, entails loss of control on the
   * actual request service order. In particular, the critical
   * situation is when requests from different processes happen
   * to be present, at the same time, in the internal queue(s)
   * of the drive. In such a situation, the drive, by deciding
   * the service order of the internally-queued requests, does
   * determine also the actual throughput distribution among
   * these processes. But the drive typically has no notion or
   * concern about per-process throughput distribution, and
   * makes its decisions only on a per-request basis. Therefore,
   * the service distribution enforced by the drive's internal
   * scheduler is likely to coincide with the desired throughput
   * distribution only in a completely symmetric, or favorably
   * skewed scenario where:
   * (i-a) each of these processes must get the same throughput as
   *	 the others,
   * (i-b) in case (i-a) does not hold, it holds that the process
   *       associated with bfqq must receive a lower or equal
   *	 throughput than any of the other processes;
   * (ii)  the I/O of each process has the same properties, in
   *       terms of locality (sequential or random), direction
   *       (reads or writes), request sizes, greediness
   *       (from I/O-bound to sporadic), and so on;
  
   * In fact, in such a scenario, the drive tends to treat the requests
   * of each process in about the same way as the requests of the
   * others, and thus to provide each of these processes with about the
   * same throughput.  This is exactly the desired throughput
   * distribution if (i-a) holds, or, if (i-b) holds instead, this is an
   * even more convenient distribution for (the process associated with)
   * bfqq.
   *
   * In contrast, in any asymmetric or unfavorable scenario, device
   * idling (I/O-dispatch plugging) is certainly needed to guarantee
   * that bfqq receives its assigned fraction of the device throughput
   * (see [1] for details).
   *
   * The problem is that idling may significantly reduce throughput with
   * certain combinations of types of I/O and devices. An important
   * example is sync random I/O on flash storage with command
   * queueing. So, unless bfqq falls in cases where idling also boosts
   * throughput, it is important to check conditions (i-a), i(-b) and
   * (ii) accurately, so as to avoid idling when not strictly needed for
   * service guarantees.
   *
   * Unfortunately, it is extremely difficult to thoroughly check
   * condition (ii). And, in case there are active groups, it becomes
   * very difficult to check conditions (i-a) and (i-b) too.  In fact,
   * if there are active groups, then, for conditions (i-a) or (i-b) to
   * become false 'indirectly', it is enough that an active group
   * contains more active processes or sub-groups than some other active
   * group. More precisely, for conditions (i-a) or (i-b) to become
   * false because of such a group, it is not even necessary that the
   * group is (still) active: it is sufficient that, even if the group
   * has become inactive, some of its descendant processes still have
   * some request already dispatched but still waiting for
   * completion. In fact, requests have still to be guaranteed their
   * share of the throughput even after being dispatched. In this
   * respect, it is easy to show that, if a group frequently becomes
   * inactive while still having in-flight requests, and if, when this
   * happens, the group is not considered in the calculation of whether
   * the scenario is asymmetric, then the group may fail to be
   * guaranteed its fair share of the throughput (basically because
   * idling may not be performed for the descendant processes of the
   * group, but it had to be).  We address this issue with the following
   * bi-modal behavior, implemented in the function
   * bfq_asymmetric_scenario().
   *
   * If there are groups with requests waiting for completion
   * (as commented above, some of these groups may even be
   * already inactive), then the scenario is tagged as
   * asymmetric, conservatively, without checking any of the
   * conditions (i-a), (i-b) or (ii). So the device is idled for bfqq.
   * This behavior matches also the fact that groups are created
   * exactly if controlling I/O is a primary concern (to
   * preserve bandwidth and latency guarantees).
   *
   * On the opposite end, if there are no groups with requests waiting
   * for completion, then only conditions (i-a) and (i-b) are actually
   * controlled, i.e., provided that conditions (i-a) or (i-b) holds,
   * idling is not performed, regardless of whether condition (ii)
   * holds.  In other words, only if conditions (i-a) and (i-b) do not
   * hold, then idling is allowed, and the device tends to be prevented
   * from queueing many requests, possibly of several processes. Since
   * there are no groups with requests waiting for completion, then, to
   * control conditions (i-a) and (i-b) it is enough to check just
   * whether all the queues with requests waiting for completion also
   * have the same weight.
   *
   * Not checking condition (ii) evidently exposes bfqq to the
   * risk of getting less throughput than its fair share.
   * However, for queues with the same weight, a further
   * mechanism, preemption, mitigates or even eliminates this
   * problem. And it does so without consequences on overall
   * throughput. This mechanism and its benefits are explained
   * in the next three paragraphs.
   *
   * Even if a queue, say Q, is expired when it remains idle, Q
   * can still preempt the new in-service queue if the next
   * request of Q arrives soon (see the comments on
   * bfq_bfqq_update_budg_for_activation). If all queues and
   * groups have the same weight, this form of preemption,
   * combined with the hole-recovery heuristic described in the
   * comments on function bfq_bfqq_update_budg_for_activation,
   * are enough to preserve a correct bandwidth distribution in
   * the mid term, even without idling. In fact, even if not
   * idling allows the internal queues of the device to contain
   * many requests, and thus to reorder requests, we can rather
   * safely assume that the internal scheduler still preserves a
   * minimum of mid-term fairness.
   *
   * More precisely, this preemption-based, idleless approach
   * provides fairness in terms of IOPS, and not sectors per
   * second. This can be seen with a simple example. Suppose
   * that there are two queues with the same weight, but that
   * the first queue receives requests of 8 sectors, while the
   * second queue receives requests of 1024 sectors. In
   * addition, suppose that each of the two queues contains at
   * most one request at a time, which implies that each queue
   * always remains idle after it is served. Finally, after
   * remaining idle, each queue receives very quickly a new
   * request. It follows that the two queues are served
   * alternatively, preempting each other if needed. This
   * implies that, although both queues have the same weight,
   * the queue with large requests receives a service that is
   * 1024/8 times as high as the service received by the other
   * queue.
   *
   * The motivation for using preemption instead of idling (for
   * queues with the same weight) is that, by not idling,
   * service guarantees are preserved (completely or at least in
   * part) without minimally sacrificing throughput. And, if
   * there is no active group, then the primary expectation for
   * this device is probably a high throughput.
   *
b5e02b484   Paolo Valente   block, bfq: check...
3309
3310
3311
3312
   * We are now left only with explaining the two sub-conditions in the
   * additional compound condition that is checked below for deciding
   * whether the scenario is asymmetric. To explain the first
   * sub-condition, we need to add that the function
3726112ec   Paolo Valente   block, bfq: re-sc...
3313
   * bfq_asymmetric_scenario checks the weights of only
b5e02b484   Paolo Valente   block, bfq: check...
3314
3315
3316
3317
3318
3319
3320
3321
3322
3323
3324
3325
3326
3327
3328
3329
3330
3331
3332
3333
3334
3335
3336
3337
3338
3339
   * non-weight-raised queues, for efficiency reasons (see comments on
   * bfq_weights_tree_add()). Then the fact that bfqq is weight-raised
   * is checked explicitly here. More precisely, the compound condition
   * below takes into account also the fact that, even if bfqq is being
   * weight-raised, the scenario is still symmetric if all queues with
   * requests waiting for completion happen to be
   * weight-raised. Actually, we should be even more precise here, and
   * differentiate between interactive weight raising and soft real-time
   * weight raising.
   *
   * The second sub-condition checked in the compound condition is
   * whether there is a fair amount of already in-flight I/O not
   * belonging to bfqq. If so, I/O dispatching is to be plugged, for the
   * following reason. The drive may decide to serve in-flight
   * non-bfqq's I/O requests before bfqq's ones, thereby delaying the
   * arrival of new I/O requests for bfqq (recall that bfqq is sync). If
   * I/O-dispatching is not plugged, then, while bfqq remains empty, a
   * basically uncontrolled amount of I/O from other queues may be
   * dispatched too, possibly causing the service of bfqq's I/O to be
   * delayed even longer in the drive. This problem gets more and more
   * serious as the speed and the queue depth of the drive grow,
   * because, as these two quantities grow, the probability to find no
   * queue busy but many requests in flight grows too. By contrast,
   * plugging I/O dispatching minimizes the delay induced by already
   * in-flight I/O, and enables bfqq to recover the bandwidth it may
   * lose because of this delay.
3726112ec   Paolo Valente   block, bfq: re-sc...
3340
3341
   *
   * As a side note, it is worth considering that the above
b5e02b484   Paolo Valente   block, bfq: check...
3342
3343
3344
3345
3346
3347
3348
3349
3350
   * device-idling countermeasures may however fail in the following
   * unlucky scenario: if I/O-dispatch plugging is (correctly) disabled
   * in a time period during which all symmetry sub-conditions hold, and
   * therefore the device is allowed to enqueue many requests, but at
   * some later point in time some sub-condition stops to hold, then it
   * may become impossible to make requests be served in the desired
   * order until all the requests already queued in the device have been
   * served. The last sub-condition commented above somewhat mitigates
   * this problem for weight-raised queues.
3726112ec   Paolo Valente   block, bfq: re-sc...
3351
3352
3353
3354
   */
  static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd,
  						 struct bfq_queue *bfqq)
  {
f718b0932   Paolo Valente   block, bfq: do no...
3355
3356
3357
  	/* No point in idling for bfqq if it won't get requests any longer */
  	if (unlikely(!bfqq_process_refs(bfqq)))
  		return false;
3726112ec   Paolo Valente   block, bfq: re-sc...
3358
  	return (bfqq->wr_coeff > 1 &&
b5e02b484   Paolo Valente   block, bfq: check...
3359
3360
3361
3362
  		(bfqd->wr_busy_queues <
  		 bfq_tot_busy_queues(bfqd) ||
  		 bfqd->rq_in_driver >=
  		 bfqq->dispatched + 4)) ||
3726112ec   Paolo Valente   block, bfq: re-sc...
3363
3364
3365
3366
3367
  		bfq_asymmetric_scenario(bfqd, bfqq);
  }
  
  static bool __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq,
  			      enum bfqq_expiration reason)
aee69d78d   Paolo Valente   block, bfq: intro...
3368
  {
36eca8948   Arianna Avanzini   block, bfq: add E...
3369
3370
3371
3372
3373
3374
3375
3376
  	/*
  	 * If this bfqq is shared between multiple processes, check
  	 * to make sure that those processes are still issuing I/Os
  	 * within the mean seek distance. If not, it may be time to
  	 * break the queues apart again.
  	 */
  	if (bfq_bfqq_coop(bfqq) && BFQQ_SEEKY(bfqq))
  		bfq_mark_bfqq_split_coop(bfqq);
3726112ec   Paolo Valente   block, bfq: re-sc...
3377
3378
3379
3380
3381
3382
3383
3384
3385
3386
3387
3388
3389
3390
3391
3392
  	/*
  	 * Consider queues with a higher finish virtual time than
  	 * bfqq. If idling_needed_for_service_guarantees(bfqq) returns
  	 * true, then bfqq's bandwidth would be violated if an
  	 * uncontrolled amount of I/O from these queues were
  	 * dispatched while bfqq is waiting for its new I/O to
  	 * arrive. This is exactly what may happen if this is a forced
  	 * expiration caused by a preemption attempt, and if bfqq is
  	 * not re-scheduled. To prevent this from happening, re-queue
  	 * bfqq if it needs I/O-dispatch plugging, even if it is
  	 * empty. By doing so, bfqq is granted to be served before the
  	 * above queues (provided that bfqq is of course eligible).
  	 */
  	if (RB_EMPTY_ROOT(&bfqq->sort_list) &&
  	    !(reason == BFQQE_PREEMPTED &&
  	      idling_needed_for_service_guarantees(bfqd, bfqq))) {
44e44a1b3   Paolo Valente   block, bfq: impro...
3393
3394
3395
3396
3397
3398
3399
3400
  		if (bfqq->dispatched == 0)
  			/*
  			 * Overloading budget_timeout field to store
  			 * the time at which the queue remains with no
  			 * backlog and no outstanding request; used by
  			 * the weight-raising mechanism.
  			 */
  			bfqq->budget_timeout = jiffies;
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
3401
  		bfq_del_bfqq_busy(bfqd, bfqq, true);
36eca8948   Arianna Avanzini   block, bfq: add E...
3402
  	} else {
80294c3bb   Paolo Valente   block, bfq: make ...
3403
  		bfq_requeue_bfqq(bfqd, bfqq, true);
36eca8948   Arianna Avanzini   block, bfq: add E...
3404
3405
  		/*
  		 * Resort priority tree of potential close cooperators.
8cacc5ab3   Paolo Valente   block, bfq: do no...
3406
  		 * See comments on bfq_pos_tree_add_move() for the unlikely().
36eca8948   Arianna Avanzini   block, bfq: add E...
3407
  		 */
3726112ec   Paolo Valente   block, bfq: re-sc...
3408
3409
  		if (unlikely(!bfqd->nonrot_with_queueing &&
  			     !RB_EMPTY_ROOT(&bfqq->sort_list)))
8cacc5ab3   Paolo Valente   block, bfq: do no...
3410
  			bfq_pos_tree_add_move(bfqd, bfqq);
36eca8948   Arianna Avanzini   block, bfq: add E...
3411
  	}
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
3412
3413
3414
3415
  
  	/*
  	 * All in-service entities must have been properly deactivated
  	 * or requeued before executing the next function, which
eed47d19d   Paolo Valente   block, bfq: fix u...
3416
3417
3418
  	 * resets all in-service entities as no more in service. This
  	 * may cause bfqq to be freed. If this happens, the next
  	 * function returns true.
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
3419
  	 */
eed47d19d   Paolo Valente   block, bfq: fix u...
3420
  	return __bfq_bfqd_reset_in_service(bfqd);
aee69d78d   Paolo Valente   block, bfq: intro...
3421
3422
3423
3424
3425
3426
3427
3428
3429
3430
3431
3432
3433
3434
3435
3436
3437
  }
  
  /**
   * __bfq_bfqq_recalc_budget - try to adapt the budget to the @bfqq behavior.
   * @bfqd: device data.
   * @bfqq: queue to update.
   * @reason: reason for expiration.
   *
   * Handle the feedback on @bfqq budget at queue expiration.
   * See the body for detailed comments.
   */
  static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
  				     struct bfq_queue *bfqq,
  				     enum bfqq_expiration reason)
  {
  	struct request *next_rq;
  	int budget, min_budget;
aee69d78d   Paolo Valente   block, bfq: intro...
3438
  	min_budget = bfq_min_budget(bfqd);
44e44a1b3   Paolo Valente   block, bfq: impro...
3439
3440
3441
3442
3443
3444
3445
3446
3447
  	if (bfqq->wr_coeff == 1)
  		budget = bfqq->max_budget;
  	else /*
  	      * Use a constant, low budget for weight-raised queues,
  	      * to help achieve a low latency. Keep it slightly higher
  	      * than the minimum possible budget, to cause a little
  	      * bit fewer expirations.
  	      */
  		budget = 2 * min_budget;
aee69d78d   Paolo Valente   block, bfq: intro...
3448
3449
3450
3451
3452
3453
  	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last budg %d, budg left %d",
  		bfqq->entity.budget, bfq_bfqq_budget_left(bfqq));
  	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last max_budg %d, min budg %d",
  		budget, bfq_min_budget(bfqd));
  	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: sync %d, seeky %d",
  		bfq_bfqq_sync(bfqq), BFQQ_SEEKY(bfqd->in_service_queue));
44e44a1b3   Paolo Valente   block, bfq: impro...
3454
  	if (bfq_bfqq_sync(bfqq) && bfqq->wr_coeff == 1) {
aee69d78d   Paolo Valente   block, bfq: intro...
3455
3456
3457
3458
3459
3460
  		switch (reason) {
  		/*
  		 * Caveat: in all the following cases we trade latency
  		 * for throughput.
  		 */
  		case BFQQE_TOO_IDLE:
54b604567   Paolo Valente   block, bfq: impro...
3461
3462
3463
3464
3465
3466
3467
3468
3469
3470
3471
3472
3473
3474
3475
3476
3477
3478
3479
3480
3481
3482
3483
3484
3485
3486
3487
3488
3489
3490
3491
3492
  			/*
  			 * This is the only case where we may reduce
  			 * the budget: if there is no request of the
  			 * process still waiting for completion, then
  			 * we assume (tentatively) that the timer has
  			 * expired because the batch of requests of
  			 * the process could have been served with a
  			 * smaller budget.  Hence, betting that
  			 * process will behave in the same way when it
  			 * becomes backlogged again, we reduce its
  			 * next budget.  As long as we guess right,
  			 * this budget cut reduces the latency
  			 * experienced by the process.
  			 *
  			 * However, if there are still outstanding
  			 * requests, then the process may have not yet
  			 * issued its next request just because it is
  			 * still waiting for the completion of some of
  			 * the still outstanding ones.  So in this
  			 * subcase we do not reduce its budget, on the
  			 * contrary we increase it to possibly boost
  			 * the throughput, as discussed in the
  			 * comments to the BUDGET_TIMEOUT case.
  			 */
  			if (bfqq->dispatched > 0) /* still outstanding reqs */
  				budget = min(budget * 2, bfqd->bfq_max_budget);
  			else {
  				if (budget > 5 * min_budget)
  					budget -= 4 * min_budget;
  				else
  					budget = min_budget;
  			}
aee69d78d   Paolo Valente   block, bfq: intro...
3493
3494
  			break;
  		case BFQQE_BUDGET_TIMEOUT:
54b604567   Paolo Valente   block, bfq: impro...
3495
3496
3497
3498
3499
3500
3501
  			/*
  			 * We double the budget here because it gives
  			 * the chance to boost the throughput if this
  			 * is not a seeky process (and has bumped into
  			 * this timeout because of, e.g., ZBR).
  			 */
  			budget = min(budget * 2, bfqd->bfq_max_budget);
aee69d78d   Paolo Valente   block, bfq: intro...
3502
3503
3504
3505
3506
3507
3508
3509
3510
3511
3512
  			break;
  		case BFQQE_BUDGET_EXHAUSTED:
  			/*
  			 * The process still has backlog, and did not
  			 * let either the budget timeout or the disk
  			 * idling timeout expire. Hence it is not
  			 * seeky, has a short thinktime and may be
  			 * happy with a higher budget too. So
  			 * definitely increase the budget of this good
  			 * candidate to boost the disk throughput.
  			 */
54b604567   Paolo Valente   block, bfq: impro...
3513
  			budget = min(budget * 4, bfqd->bfq_max_budget);
aee69d78d   Paolo Valente   block, bfq: intro...
3514
3515
3516
3517
3518
3519
3520
3521
3522
3523
3524
3525
3526
3527
3528
3529
3530
3531
3532
3533
3534
3535
3536
3537
3538
3539
3540
3541
3542
3543
3544
3545
3546
3547
3548
3549
3550
3551
3552
  			break;
  		case BFQQE_NO_MORE_REQUESTS:
  			/*
  			 * For queues that expire for this reason, it
  			 * is particularly important to keep the
  			 * budget close to the actual service they
  			 * need. Doing so reduces the timestamp
  			 * misalignment problem described in the
  			 * comments in the body of
  			 * __bfq_activate_entity. In fact, suppose
  			 * that a queue systematically expires for
  			 * BFQQE_NO_MORE_REQUESTS and presents a
  			 * new request in time to enjoy timestamp
  			 * back-shifting. The larger the budget of the
  			 * queue is with respect to the service the
  			 * queue actually requests in each service
  			 * slot, the more times the queue can be
  			 * reactivated with the same virtual finish
  			 * time. It follows that, even if this finish
  			 * time is pushed to the system virtual time
  			 * to reduce the consequent timestamp
  			 * misalignment, the queue unjustly enjoys for
  			 * many re-activations a lower finish time
  			 * than all newly activated queues.
  			 *
  			 * The service needed by bfqq is measured
  			 * quite precisely by bfqq->entity.service.
  			 * Since bfqq does not enjoy device idling,
  			 * bfqq->entity.service is equal to the number
  			 * of sectors that the process associated with
  			 * bfqq requested to read/write before waiting
  			 * for request completions, or blocking for
  			 * other reasons.
  			 */
  			budget = max_t(int, bfqq->entity.service, min_budget);
  			break;
  		default:
  			return;
  		}
44e44a1b3   Paolo Valente   block, bfq: impro...
3553
  	} else if (!bfq_bfqq_sync(bfqq)) {
aee69d78d   Paolo Valente   block, bfq: intro...
3554
3555
3556
3557
3558
3559
3560
3561
3562
3563
3564
3565
3566
3567
3568
3569
3570
3571
3572
3573
3574
3575
3576
3577
3578
3579
3580
3581
3582
3583
3584
3585
3586
3587
  		/*
  		 * Async queues get always the maximum possible
  		 * budget, as for them we do not care about latency
  		 * (in addition, their ability to dispatch is limited
  		 * by the charging factor).
  		 */
  		budget = bfqd->bfq_max_budget;
  	}
  
  	bfqq->max_budget = budget;
  
  	if (bfqd->budgets_assigned >= bfq_stats_min_budgets &&
  	    !bfqd->bfq_user_max_budget)
  		bfqq->max_budget = min(bfqq->max_budget, bfqd->bfq_max_budget);
  
  	/*
  	 * If there is still backlog, then assign a new budget, making
  	 * sure that it is large enough for the next request.  Since
  	 * the finish time of bfqq must be kept in sync with the
  	 * budget, be sure to call __bfq_bfqq_expire() *after* this
  	 * update.
  	 *
  	 * If there is no backlog, then no need to update the budget;
  	 * it will be updated on the arrival of a new request.
  	 */
  	next_rq = bfqq->next_rq;
  	if (next_rq)
  		bfqq->entity.budget = max_t(unsigned long, bfqq->max_budget,
  					    bfq_serv_to_charge(next_rq, bfqq));
  
  	bfq_log_bfqq(bfqd, bfqq, "head sect: %u, new budget %d",
  			next_rq ? blk_rq_sectors(next_rq) : 0,
  			bfqq->entity.budget);
  }
aee69d78d   Paolo Valente   block, bfq: intro...
3588
  /*
ab0e43e9c   Paolo Valente   block, bfq: modif...
3589
3590
3591
3592
3593
3594
3595
3596
3597
3598
3599
3600
3601
3602
3603
3604
3605
3606
3607
3608
3609
3610
3611
3612
3613
3614
3615
3616
3617
   * Return true if the process associated with bfqq is "slow". The slow
   * flag is used, in addition to the budget timeout, to reduce the
   * amount of service provided to seeky processes, and thus reduce
   * their chances to lower the throughput. More details in the comments
   * on the function bfq_bfqq_expire().
   *
   * An important observation is in order: as discussed in the comments
   * on the function bfq_update_peak_rate(), with devices with internal
   * queues, it is hard if ever possible to know when and for how long
   * an I/O request is processed by the device (apart from the trivial
   * I/O pattern where a new request is dispatched only after the
   * previous one has been completed). This makes it hard to evaluate
   * the real rate at which the I/O requests of each bfq_queue are
   * served.  In fact, for an I/O scheduler like BFQ, serving a
   * bfq_queue means just dispatching its requests during its service
   * slot (i.e., until the budget of the queue is exhausted, or the
   * queue remains idle, or, finally, a timeout fires). But, during the
   * service slot of a bfq_queue, around 100 ms at most, the device may
   * be even still processing requests of bfq_queues served in previous
   * service slots. On the opposite end, the requests of the in-service
   * bfq_queue may be completed after the service slot of the queue
   * finishes.
   *
   * Anyway, unless more sophisticated solutions are used
   * (where possible), the sum of the sizes of the requests dispatched
   * during the service slot of a bfq_queue is probably the only
   * approximation available for the service received by the bfq_queue
   * during its service slot. And this sum is the quantity used in this
   * function to evaluate the I/O speed of a process.
aee69d78d   Paolo Valente   block, bfq: intro...
3618
   */
ab0e43e9c   Paolo Valente   block, bfq: modif...
3619
3620
3621
  static bool bfq_bfqq_is_slow(struct bfq_data *bfqd, struct bfq_queue *bfqq,
  				 bool compensate, enum bfqq_expiration reason,
  				 unsigned long *delta_ms)
aee69d78d   Paolo Valente   block, bfq: intro...
3622
  {
ab0e43e9c   Paolo Valente   block, bfq: modif...
3623
3624
3625
  	ktime_t delta_ktime;
  	u32 delta_usecs;
  	bool slow = BFQQ_SEEKY(bfqq); /* if delta too short, use seekyness */
aee69d78d   Paolo Valente   block, bfq: intro...
3626

ab0e43e9c   Paolo Valente   block, bfq: modif...
3627
  	if (!bfq_bfqq_sync(bfqq))
aee69d78d   Paolo Valente   block, bfq: intro...
3628
3629
3630
  		return false;
  
  	if (compensate)
ab0e43e9c   Paolo Valente   block, bfq: modif...
3631
  		delta_ktime = bfqd->last_idling_start;
aee69d78d   Paolo Valente   block, bfq: intro...
3632
  	else
ab0e43e9c   Paolo Valente   block, bfq: modif...
3633
3634
3635
  		delta_ktime = ktime_get();
  	delta_ktime = ktime_sub(delta_ktime, bfqd->last_budget_start);
  	delta_usecs = ktime_to_us(delta_ktime);
aee69d78d   Paolo Valente   block, bfq: intro...
3636
3637
  
  	/* don't use too short time intervals */
ab0e43e9c   Paolo Valente   block, bfq: modif...
3638
3639
3640
3641
3642
3643
3644
3645
3646
3647
3648
3649
  	if (delta_usecs < 1000) {
  		if (blk_queue_nonrot(bfqd->queue))
  			 /*
  			  * give same worst-case guarantees as idling
  			  * for seeky
  			  */
  			*delta_ms = BFQ_MIN_TT / NSEC_PER_MSEC;
  		else /* charge at least one seek */
  			*delta_ms = bfq_slice_idle / NSEC_PER_MSEC;
  
  		return slow;
  	}
aee69d78d   Paolo Valente   block, bfq: intro...
3650

ab0e43e9c   Paolo Valente   block, bfq: modif...
3651
  	*delta_ms = delta_usecs / USEC_PER_MSEC;
aee69d78d   Paolo Valente   block, bfq: intro...
3652
3653
  
  	/*
ab0e43e9c   Paolo Valente   block, bfq: modif...
3654
3655
  	 * Use only long (> 20ms) intervals to filter out excessive
  	 * spikes in service rate estimation.
aee69d78d   Paolo Valente   block, bfq: intro...
3656
  	 */
ab0e43e9c   Paolo Valente   block, bfq: modif...
3657
3658
3659
3660
3661
3662
3663
3664
3665
3666
3667
3668
  	if (delta_usecs > 20000) {
  		/*
  		 * Caveat for rotational devices: processes doing I/O
  		 * in the slower disk zones tend to be slow(er) even
  		 * if not seeky. In this respect, the estimated peak
  		 * rate is likely to be an average over the disk
  		 * surface. Accordingly, to not be too harsh with
  		 * unlucky processes, a process is deemed slow only if
  		 * its rate has been lower than half of the estimated
  		 * peak rate.
  		 */
  		slow = bfqq->entity.service < bfqd->bfq_max_budget / 2;
aee69d78d   Paolo Valente   block, bfq: intro...
3669
  	}
ab0e43e9c   Paolo Valente   block, bfq: modif...
3670
  	bfq_log_bfqq(bfqd, bfqq, "bfq_bfqq_is_slow: slow %d", slow);
aee69d78d   Paolo Valente   block, bfq: intro...
3671

ab0e43e9c   Paolo Valente   block, bfq: modif...
3672
  	return slow;
aee69d78d   Paolo Valente   block, bfq: intro...
3673
3674
3675
  }
  
  /*
77b7dcead   Paolo Valente   block, bfq: reduc...
3676
3677
3678
3679
3680
3681
3682
3683
3684
3685
3686
3687
3688
3689
3690
3691
3692
3693
3694
   * To be deemed as soft real-time, an application must meet two
   * requirements. First, the application must not require an average
   * bandwidth higher than the approximate bandwidth required to playback or
   * record a compressed high-definition video.
   * The next function is invoked on the completion of the last request of a
   * batch, to compute the next-start time instant, soft_rt_next_start, such
   * that, if the next request of the application does not arrive before
   * soft_rt_next_start, then the above requirement on the bandwidth is met.
   *
   * The second requirement is that the request pattern of the application is
   * isochronous, i.e., that, after issuing a request or a batch of requests,
   * the application stops issuing new requests until all its pending requests
   * have been completed. After that, the application may issue a new batch,
   * and so on.
   * For this reason the next function is invoked to compute
   * soft_rt_next_start only for applications that meet this requirement,
   * whereas soft_rt_next_start is set to infinity for applications that do
   * not.
   *
a34b02444   Paolo Valente   block, bfq: consi...
3695
3696
3697
3698
3699
3700
3701
3702
3703
3704
3705
3706
3707
3708
3709
3710
3711
3712
3713
3714
3715
3716
3717
3718
3719
3720
3721
3722
3723
3724
3725
3726
3727
3728
3729
3730
3731
3732
3733
3734
3735
3736
3737
3738
3739
3740
3741
3742
3743
3744
3745
3746
3747
3748
3749
3750
3751
3752
3753
3754
3755
3756
3757
3758
   * Unfortunately, even a greedy (i.e., I/O-bound) application may
   * happen to meet, occasionally or systematically, both the above
   * bandwidth and isochrony requirements. This may happen at least in
   * the following circumstances. First, if the CPU load is high. The
   * application may stop issuing requests while the CPUs are busy
   * serving other processes, then restart, then stop again for a while,
   * and so on. The other circumstances are related to the storage
   * device: the storage device is highly loaded or reaches a low-enough
   * throughput with the I/O of the application (e.g., because the I/O
   * is random and/or the device is slow). In all these cases, the
   * I/O of the application may be simply slowed down enough to meet
   * the bandwidth and isochrony requirements. To reduce the probability
   * that greedy applications are deemed as soft real-time in these
   * corner cases, a further rule is used in the computation of
   * soft_rt_next_start: the return value of this function is forced to
   * be higher than the maximum between the following two quantities.
   *
   * (a) Current time plus: (1) the maximum time for which the arrival
   *     of a request is waited for when a sync queue becomes idle,
   *     namely bfqd->bfq_slice_idle, and (2) a few extra jiffies. We
   *     postpone for a moment the reason for adding a few extra
   *     jiffies; we get back to it after next item (b).  Lower-bounding
   *     the return value of this function with the current time plus
   *     bfqd->bfq_slice_idle tends to filter out greedy applications,
   *     because the latter issue their next request as soon as possible
   *     after the last one has been completed. In contrast, a soft
   *     real-time application spends some time processing data, after a
   *     batch of its requests has been completed.
   *
   * (b) Current value of bfqq->soft_rt_next_start. As pointed out
   *     above, greedy applications may happen to meet both the
   *     bandwidth and isochrony requirements under heavy CPU or
   *     storage-device load. In more detail, in these scenarios, these
   *     applications happen, only for limited time periods, to do I/O
   *     slowly enough to meet all the requirements described so far,
   *     including the filtering in above item (a). These slow-speed
   *     time intervals are usually interspersed between other time
   *     intervals during which these applications do I/O at a very high
   *     speed. Fortunately, exactly because of the high speed of the
   *     I/O in the high-speed intervals, the values returned by this
   *     function happen to be so high, near the end of any such
   *     high-speed interval, to be likely to fall *after* the end of
   *     the low-speed time interval that follows. These high values are
   *     stored in bfqq->soft_rt_next_start after each invocation of
   *     this function. As a consequence, if the last value of
   *     bfqq->soft_rt_next_start is constantly used to lower-bound the
   *     next value that this function may return, then, from the very
   *     beginning of a low-speed interval, bfqq->soft_rt_next_start is
   *     likely to be constantly kept so high that any I/O request
   *     issued during the low-speed interval is considered as arriving
   *     to soon for the application to be deemed as soft
   *     real-time. Then, in the high-speed interval that follows, the
   *     application will not be deemed as soft real-time, just because
   *     it will do I/O at a high speed. And so on.
   *
   * Getting back to the filtering in item (a), in the following two
   * cases this filtering might be easily passed by a greedy
   * application, if the reference quantity was just
   * bfqd->bfq_slice_idle:
   * 1) HZ is so low that the duration of a jiffy is comparable to or
   *    higher than bfqd->bfq_slice_idle. This happens, e.g., on slow
   *    devices with HZ=100. The time granularity may be so coarse
   *    that the approximation, in jiffies, of bfqd->bfq_slice_idle
   *    is rather lower than the exact value.
77b7dcead   Paolo Valente   block, bfq: reduc...
3759
3760
3761
   * 2) jiffies, instead of increasing at a constant rate, may stop increasing
   *    for a while, then suddenly 'jump' by several units to recover the lost
   *    increments. This seems to happen, e.g., inside virtual machines.
a34b02444   Paolo Valente   block, bfq: consi...
3762
3763
3764
3765
3766
   * To address this issue, in the filtering in (a) we do not use as a
   * reference time interval just bfqd->bfq_slice_idle, but
   * bfqd->bfq_slice_idle plus a few jiffies. In particular, we add the
   * minimum number of jiffies for which the filter seems to be quite
   * precise also in embedded systems and KVM/QEMU virtual machines.
77b7dcead   Paolo Valente   block, bfq: reduc...
3767
3768
3769
3770
   */
  static unsigned long bfq_bfqq_softrt_next_start(struct bfq_data *bfqd,
  						struct bfq_queue *bfqq)
  {
a34b02444   Paolo Valente   block, bfq: consi...
3771
3772
3773
3774
3775
  	return max3(bfqq->soft_rt_next_start,
  		    bfqq->last_idle_bklogged +
  		    HZ * bfqq->service_from_backlogged /
  		    bfqd->bfq_wr_max_softrt_rate,
  		    jiffies + nsecs_to_jiffies(bfqq->bfqd->bfq_slice_idle) + 4);
77b7dcead   Paolo Valente   block, bfq: reduc...
3776
  }
aee69d78d   Paolo Valente   block, bfq: intro...
3777
3778
3779
3780
3781
3782
3783
  /**
   * bfq_bfqq_expire - expire a queue.
   * @bfqd: device owning the queue.
   * @bfqq: the queue to expire.
   * @compensate: if true, compensate for the time spent idling.
   * @reason: the reason causing the expiration.
   *
c074170e6   Paolo Valente   block, bfq: add m...
3784
3785
3786
3787
3788
3789
3790
3791
3792
3793
3794
3795
3796
   * If the process associated with bfqq does slow I/O (e.g., because it
   * issues random requests), we charge bfqq with the time it has been
   * in service instead of the service it has received (see
   * bfq_bfqq_charge_time for details on how this goal is achieved). As
   * a consequence, bfqq will typically get higher timestamps upon
   * reactivation, and hence it will be rescheduled as if it had
   * received more service than what it has actually received. In the
   * end, bfqq receives less service in proportion to how slowly its
   * associated process consumes its budgets (and hence how seriously it
   * tends to lower the throughput). In addition, this time-charging
   * strategy guarantees time fairness among slow processes. In
   * contrast, if the process associated with bfqq is not slow, we
   * charge bfqq exactly with the service it has received.
aee69d78d   Paolo Valente   block, bfq: intro...
3797
   *
c074170e6   Paolo Valente   block, bfq: add m...
3798
3799
3800
3801
   * Charging time to the first type of queues and the exact service to
   * the other has the effect of using the WF2Q+ policy to schedule the
   * former on a timeslice basis, without violating service domain
   * guarantees among the latter.
aee69d78d   Paolo Valente   block, bfq: intro...
3802
   */
ea25da480   Paolo Valente   block, bfq: split...
3803
3804
3805
3806
  void bfq_bfqq_expire(struct bfq_data *bfqd,
  		     struct bfq_queue *bfqq,
  		     bool compensate,
  		     enum bfqq_expiration reason)
aee69d78d   Paolo Valente   block, bfq: intro...
3807
3808
  {
  	bool slow;
ab0e43e9c   Paolo Valente   block, bfq: modif...
3809
3810
  	unsigned long delta = 0;
  	struct bfq_entity *entity = &bfqq->entity;
aee69d78d   Paolo Valente   block, bfq: intro...
3811
3812
  
  	/*
ab0e43e9c   Paolo Valente   block, bfq: modif...
3813
  	 * Check whether the process is slow (see bfq_bfqq_is_slow).
aee69d78d   Paolo Valente   block, bfq: intro...
3814
  	 */
ab0e43e9c   Paolo Valente   block, bfq: modif...
3815
  	slow = bfq_bfqq_is_slow(bfqd, bfqq, compensate, reason, &delta);
aee69d78d   Paolo Valente   block, bfq: intro...
3816
3817
  
  	/*
c074170e6   Paolo Valente   block, bfq: add m...
3818
3819
3820
3821
3822
3823
3824
3825
3826
3827
3828
3829
3830
  	 * As above explained, charge slow (typically seeky) and
  	 * timed-out queues with the time and not the service
  	 * received, to favor sequential workloads.
  	 *
  	 * Processes doing I/O in the slower disk zones will tend to
  	 * be slow(er) even if not seeky. Therefore, since the
  	 * estimated peak rate is actually an average over the disk
  	 * surface, these processes may timeout just for bad luck. To
  	 * avoid punishing them, do not charge time to processes that
  	 * succeeded in consuming at least 2/3 of their budget. This
  	 * allows BFQ to preserve enough elasticity to still perform
  	 * bandwidth, and not time, distribution with little unlucky
  	 * or quasi-sequential processes.
aee69d78d   Paolo Valente   block, bfq: intro...
3831
  	 */
44e44a1b3   Paolo Valente   block, bfq: impro...
3832
3833
3834
3835
  	if (bfqq->wr_coeff == 1 &&
  	    (slow ||
  	     (reason == BFQQE_BUDGET_TIMEOUT &&
  	      bfq_bfqq_budget_left(bfqq) >=  entity->budget / 3)))
c074170e6   Paolo Valente   block, bfq: add m...
3836
  		bfq_bfqq_charge_time(bfqd, bfqq, delta);
aee69d78d   Paolo Valente   block, bfq: intro...
3837
3838
  
  	if (reason == BFQQE_TOO_IDLE &&
ab0e43e9c   Paolo Valente   block, bfq: modif...
3839
  	    entity->service <= 2 * entity->budget / 10)
aee69d78d   Paolo Valente   block, bfq: intro...
3840
  		bfq_clear_bfqq_IO_bound(bfqq);
44e44a1b3   Paolo Valente   block, bfq: impro...
3841
3842
  	if (bfqd->low_latency && bfqq->wr_coeff == 1)
  		bfqq->last_wr_start_finish = jiffies;
77b7dcead   Paolo Valente   block, bfq: reduc...
3843
3844
3845
3846
3847
3848
3849
  	if (bfqd->low_latency && bfqd->bfq_wr_max_softrt_rate > 0 &&
  	    RB_EMPTY_ROOT(&bfqq->sort_list)) {
  		/*
  		 * If we get here, and there are no outstanding
  		 * requests, then the request pattern is isochronous
  		 * (see the comments on the function
  		 * bfq_bfqq_softrt_next_start()). Thus we can compute
20cd32450   Paolo Valente   block, bfq: do no...
3850
3851
3852
3853
3854
3855
3856
3857
3858
3859
3860
3861
3862
3863
3864
3865
3866
3867
3868
3869
  		 * soft_rt_next_start. And we do it, unless bfqq is in
  		 * interactive weight raising. We do not do it in the
  		 * latter subcase, for the following reason. bfqq may
  		 * be conveying the I/O needed to load a soft
  		 * real-time application. Such an application will
  		 * actually exhibit a soft real-time I/O pattern after
  		 * it finally starts doing its job. But, if
  		 * soft_rt_next_start is computed here for an
  		 * interactive bfqq, and bfqq had received a lot of
  		 * service before remaining with no outstanding
  		 * request (likely to happen on a fast device), then
  		 * soft_rt_next_start would be assigned such a high
  		 * value that, for a very long time, bfqq would be
  		 * prevented from being possibly considered as soft
  		 * real time.
  		 *
  		 * If, instead, the queue still has outstanding
  		 * requests, then we have to wait for the completion
  		 * of all the outstanding requests to discover whether
  		 * the request pattern is actually isochronous.
77b7dcead   Paolo Valente   block, bfq: reduc...
3870
  		 */
20cd32450   Paolo Valente   block, bfq: do no...
3871
3872
  		if (bfqq->dispatched == 0 &&
  		    bfqq->wr_coeff != bfqd->bfq_wr_coeff)
77b7dcead   Paolo Valente   block, bfq: reduc...
3873
3874
  			bfqq->soft_rt_next_start =
  				bfq_bfqq_softrt_next_start(bfqd, bfqq);
20cd32450   Paolo Valente   block, bfq: do no...
3875
  		else if (bfqq->dispatched > 0) {
77b7dcead   Paolo Valente   block, bfq: reduc...
3876
  			/*
77b7dcead   Paolo Valente   block, bfq: reduc...
3877
3878
3879
3880
3881
3882
  			 * Schedule an update of soft_rt_next_start to when
  			 * the task may be discovered to be isochronous.
  			 */
  			bfq_mark_bfqq_softrt_update(bfqq);
  		}
  	}
aee69d78d   Paolo Valente   block, bfq: intro...
3883
  	bfq_log_bfqq(bfqd, bfqq,
d5be3fefc   Paolo Valente   block,bfq: refact...
3884
3885
  		"expire (%d, slow %d, num_disp %d, short_ttime %d)", reason,
  		slow, bfqq->dispatched, bfq_bfqq_has_short_ttime(bfqq));
aee69d78d   Paolo Valente   block, bfq: intro...
3886
3887
  
  	/*
2341d662e   Paolo Valente   block, bfq: tune ...
3888
3889
3890
3891
3892
3893
3894
3895
  	 * bfqq expired, so no total service time needs to be computed
  	 * any longer: reset state machine for measuring total service
  	 * times.
  	 */
  	bfqd->rqs_injected = bfqd->wait_dispatch = false;
  	bfqd->waited_rq = NULL;
  
  	/*
aee69d78d   Paolo Valente   block, bfq: intro...
3896
3897
3898
3899
  	 * Increase, decrease or leave budget unchanged according to
  	 * reason.
  	 */
  	__bfq_bfqq_recalc_budget(bfqd, bfqq, reason);
3726112ec   Paolo Valente   block, bfq: re-sc...
3900
  	if (__bfq_bfqq_expire(bfqd, bfqq, reason))
eed47d19d   Paolo Valente   block, bfq: fix u...
3901
  		/* bfqq is gone, no more actions on it */
9fae8dd59   Paolo Valente   block, bfq: fix s...
3902
  		return;
aee69d78d   Paolo Valente   block, bfq: intro...
3903
  	/* mark bfqq as waiting a request only if a bic still points to it */
9fae8dd59   Paolo Valente   block, bfq: fix s...
3904
  	if (!bfq_bfqq_busy(bfqq) &&
aee69d78d   Paolo Valente   block, bfq: intro...
3905
  	    reason != BFQQE_BUDGET_TIMEOUT &&
9fae8dd59   Paolo Valente   block, bfq: fix s...
3906
  	    reason != BFQQE_BUDGET_EXHAUSTED) {
aee69d78d   Paolo Valente   block, bfq: intro...
3907
  		bfq_mark_bfqq_non_blocking_wait_rq(bfqq);
9fae8dd59   Paolo Valente   block, bfq: fix s...
3908
3909
3910
3911
3912
3913
3914
  		/*
  		 * Not setting service to 0, because, if the next rq
  		 * arrives in time, the queue will go on receiving
  		 * service with this same budget (as if it never expired)
  		 */
  	} else
  		entity->service = 0;
8a511ba5f   Paolo Valente   block, bfq: readd...
3915
3916
3917
3918
3919
3920
3921
3922
3923
3924
3925
3926
3927
3928
3929
3930
3931
3932
3933
3934
3935
  
  	/*
  	 * Reset the received-service counter for every parent entity.
  	 * Differently from what happens with bfqq->entity.service,
  	 * the resetting of this counter never needs to be postponed
  	 * for parent entities. In fact, in case bfqq may have a
  	 * chance to go on being served using the last, partially
  	 * consumed budget, bfqq->entity.service needs to be kept,
  	 * because if bfqq then actually goes on being served using
  	 * the same budget, the last value of bfqq->entity.service is
  	 * needed to properly decrement bfqq->entity.budget by the
  	 * portion already consumed. In contrast, it is not necessary
  	 * to keep entity->service for parent entities too, because
  	 * the bubble up of the new value of bfqq->entity.budget will
  	 * make sure that the budgets of parent entities are correct,
  	 * even in case bfqq and thus parent entities go on receiving
  	 * service with the same budget.
  	 */
  	entity = entity->parent;
  	for_each_entity(entity)
  		entity->service = 0;
aee69d78d   Paolo Valente   block, bfq: intro...
3936
3937
3938
3939
3940
3941
3942
3943
3944
  }
  
  /*
   * Budget timeout is not implemented through a dedicated timer, but
   * just checked on request arrivals and completions, as well as on
   * idle timer expirations.
   */
  static bool bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
  {
44e44a1b3   Paolo Valente   block, bfq: impro...
3945
  	return time_is_before_eq_jiffies(bfqq->budget_timeout);
aee69d78d   Paolo Valente   block, bfq: intro...
3946
3947
3948
3949
3950
3951
3952
3953
3954
3955
3956
3957
3958
3959
3960
3961
3962
3963
3964
3965
3966
3967
3968
  }
  
  /*
   * If we expire a queue that is actively waiting (i.e., with the
   * device idled) for the arrival of a new request, then we may incur
   * the timestamp misalignment problem described in the body of the
   * function __bfq_activate_entity. Hence we return true only if this
   * condition does not hold, or if the queue is slow enough to deserve
   * only to be kicked off for preserving a high throughput.
   */
  static bool bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  {
  	bfq_log_bfqq(bfqq->bfqd, bfqq,
  		"may_budget_timeout: wait_request %d left %d timeout %d",
  		bfq_bfqq_wait_request(bfqq),
  			bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3,
  		bfq_bfqq_budget_timeout(bfqq));
  
  	return (!bfq_bfqq_wait_request(bfqq) ||
  		bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3)
  		&&
  		bfq_bfqq_budget_timeout(bfqq);
  }
05c2f5c30   Paolo Valente   block, bfq: split...
3969
3970
  static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
  					     struct bfq_queue *bfqq)
aee69d78d   Paolo Valente   block, bfq: intro...
3971
  {
edaf94285   Paolo Valente   block, bfq: boost...
3972
3973
3974
  	bool rot_without_queueing =
  		!blk_queue_nonrot(bfqd->queue) && !bfqd->hw_tag,
  		bfqq_sequential_and_IO_bound,
05c2f5c30   Paolo Valente   block, bfq: split...
3975
  		idling_boosts_thr;
d5be3fefc   Paolo Valente   block,bfq: refact...
3976

f718b0932   Paolo Valente   block, bfq: do no...
3977
3978
3979
  	/* No point in idling for bfqq if it won't get requests any longer */
  	if (unlikely(!bfqq_process_refs(bfqq)))
  		return false;
edaf94285   Paolo Valente   block, bfq: boost...
3980
3981
  	bfqq_sequential_and_IO_bound = !BFQQ_SEEKY(bfqq) &&
  		bfq_bfqq_IO_bound(bfqq) && bfq_bfqq_has_short_ttime(bfqq);
d5be3fefc   Paolo Valente   block,bfq: refact...
3982
  	/*
44e44a1b3   Paolo Valente   block, bfq: impro...
3983
3984
3985
  	 * The next variable takes into account the cases where idling
  	 * boosts the throughput.
  	 *
e01eff01d   Paolo Valente   block, bfq: boost...
3986
3987
  	 * The value of the variable is computed considering, first, that
  	 * idling is virtually always beneficial for the throughput if:
edaf94285   Paolo Valente   block, bfq: boost...
3988
3989
3990
3991
3992
3993
  	 * (a) the device is not NCQ-capable and rotational, or
  	 * (b) regardless of the presence of NCQ, the device is rotational and
  	 *     the request pattern for bfqq is I/O-bound and sequential, or
  	 * (c) regardless of whether it is rotational, the device is
  	 *     not NCQ-capable and the request pattern for bfqq is
  	 *     I/O-bound and sequential.
bf2b79e7c   Paolo Valente   block, bfq: boost...
3994
3995
3996
  	 *
  	 * Secondly, and in contrast to the above item (b), idling an
  	 * NCQ-capable flash-based device would not boost the
e01eff01d   Paolo Valente   block, bfq: boost...
3997
  	 * throughput even with sequential I/O; rather it would lower
bf2b79e7c   Paolo Valente   block, bfq: boost...
3998
3999
  	 * the throughput in proportion to how fast the device
  	 * is. Accordingly, the next variable is true if any of the
edaf94285   Paolo Valente   block, bfq: boost...
4000
4001
4002
  	 * above conditions (a), (b) or (c) is true, and, in
  	 * particular, happens to be false if bfqd is an NCQ-capable
  	 * flash-based device.
aee69d78d   Paolo Valente   block, bfq: intro...
4003
  	 */
edaf94285   Paolo Valente   block, bfq: boost...
4004
4005
4006
  	idling_boosts_thr = rot_without_queueing ||
  		((!blk_queue_nonrot(bfqd->queue) || !bfqd->hw_tag) &&
  		 bfqq_sequential_and_IO_bound);
aee69d78d   Paolo Valente   block, bfq: intro...
4007
4008
  
  	/*
05c2f5c30   Paolo Valente   block, bfq: split...
4009
  	 * The return value of this function is equal to that of
cfd69712a   Paolo Valente   block, bfq: reduc...
4010
4011
4012
4013
4014
4015
4016
4017
4018
4019
4020
4021
4022
4023
4024
4025
  	 * idling_boosts_thr, unless a special case holds. In this
  	 * special case, described below, idling may cause problems to
  	 * weight-raised queues.
  	 *
  	 * When the request pool is saturated (e.g., in the presence
  	 * of write hogs), if the processes associated with
  	 * non-weight-raised queues ask for requests at a lower rate,
  	 * then processes associated with weight-raised queues have a
  	 * higher probability to get a request from the pool
  	 * immediately (or at least soon) when they need one. Thus
  	 * they have a higher probability to actually get a fraction
  	 * of the device throughput proportional to their high
  	 * weight. This is especially true with NCQ-capable drives,
  	 * which enqueue several requests in advance, and further
  	 * reorder internally-queued requests.
  	 *
05c2f5c30   Paolo Valente   block, bfq: split...
4026
4027
4028
4029
4030
4031
4032
4033
4034
4035
4036
4037
4038
4039
4040
4041
4042
  	 * For this reason, we force to false the return value if
  	 * there are weight-raised busy queues. In this case, and if
  	 * bfqq is not weight-raised, this guarantees that the device
  	 * is not idled for bfqq (if, instead, bfqq is weight-raised,
  	 * then idling will be guaranteed by another variable, see
  	 * below). Combined with the timestamping rules of BFQ (see
  	 * [1] for details), this behavior causes bfqq, and hence any
  	 * sync non-weight-raised queue, to get a lower number of
  	 * requests served, and thus to ask for a lower number of
  	 * requests from the request pool, before the busy
  	 * weight-raised queues get served again. This often mitigates
  	 * starvation problems in the presence of heavy write
  	 * workloads and NCQ, thereby guaranteeing a higher
  	 * application and system responsiveness in these hostile
  	 * scenarios.
  	 */
  	return idling_boosts_thr &&
cfd69712a   Paolo Valente   block, bfq: reduc...
4043
  		bfqd->wr_busy_queues == 0;
05c2f5c30   Paolo Valente   block, bfq: split...
4044
  }
cfd69712a   Paolo Valente   block, bfq: reduc...
4045

530c4cbb3   Paolo Valente   block, bfq: uncon...
4046
  /*
05c2f5c30   Paolo Valente   block, bfq: split...
4047
4048
4049
4050
4051
4052
4053
4054
4055
4056
4057
4058
4059
4060
4061
4062
4063
4064
4065
4066
4067
4068
4069
4070
   * For a queue that becomes empty, device idling is allowed only if
   * this function returns true for that queue. As a consequence, since
   * device idling plays a critical role for both throughput boosting
   * and service guarantees, the return value of this function plays a
   * critical role as well.
   *
   * In a nutshell, this function returns true only if idling is
   * beneficial for throughput or, even if detrimental for throughput,
   * idling is however necessary to preserve service guarantees (low
   * latency, desired throughput distribution, ...). In particular, on
   * NCQ-capable devices, this function tries to return false, so as to
   * help keep the drives' internal queues full, whenever this helps the
   * device boost the throughput without causing any service-guarantee
   * issue.
   *
   * Most of the issues taken into account to get the return value of
   * this function are not trivial. We discuss these issues in the two
   * functions providing the main pieces of information needed by this
   * function.
   */
  static bool bfq_better_to_idle(struct bfq_queue *bfqq)
  {
  	struct bfq_data *bfqd = bfqq->bfqd;
  	bool idling_boosts_thr_with_no_issue, idling_needed_for_service_guar;
f718b0932   Paolo Valente   block, bfq: do no...
4071
4072
4073
  	/* No point in idling for bfqq if it won't get requests any longer */
  	if (unlikely(!bfqq_process_refs(bfqq)))
  		return false;
05c2f5c30   Paolo Valente   block, bfq: split...
4074
4075
4076
4077
4078
4079
4080
4081
4082
4083
4084
4085
4086
4087
4088
4089
4090
4091
4092
4093
  	if (unlikely(bfqd->strict_guarantees))
  		return true;
  
  	/*
  	 * Idling is performed only if slice_idle > 0. In addition, we
  	 * do not idle if
  	 * (a) bfqq is async
  	 * (b) bfqq is in the idle io prio class: in this case we do
  	 * not idle because we want to minimize the bandwidth that
  	 * queues in this class can steal to higher-priority queues
  	 */
  	if (bfqd->bfq_slice_idle == 0 || !bfq_bfqq_sync(bfqq) ||
  	   bfq_class_idle(bfqq))
  		return false;
  
  	idling_boosts_thr_with_no_issue =
  		idling_boosts_thr_without_issues(bfqd, bfqq);
  
  	idling_needed_for_service_guar =
  		idling_needed_for_service_guarantees(bfqd, bfqq);
e1b2324dd   Arianna Avanzini   block, bfq: handl...
4094
4095
  
  	/*
05c2f5c30   Paolo Valente   block, bfq: split...
4096
  	 * We have now the two components we need to compute the
d5be3fefc   Paolo Valente   block,bfq: refact...
4097
4098
4099
  	 * return value of the function, which is true only if idling
  	 * either boosts the throughput (without issues), or is
  	 * necessary to preserve service guarantees.
aee69d78d   Paolo Valente   block, bfq: intro...
4100
  	 */
05c2f5c30   Paolo Valente   block, bfq: split...
4101
4102
  	return idling_boosts_thr_with_no_issue ||
  		idling_needed_for_service_guar;
aee69d78d   Paolo Valente   block, bfq: intro...
4103
4104
4105
  }
  
  /*
277a4a9b5   Paolo Valente   block, bfq: give ...
4106
   * If the in-service queue is empty but the function bfq_better_to_idle
aee69d78d   Paolo Valente   block, bfq: intro...
4107
4108
4109
4110
   * returns true, then:
   * 1) the queue must remain in service and cannot be expired, and
   * 2) the device must be idled to wait for the possible arrival of a new
   *    request for the queue.
277a4a9b5   Paolo Valente   block, bfq: give ...
4111
   * See the comments on the function bfq_better_to_idle for the reasons
aee69d78d   Paolo Valente   block, bfq: intro...
4112
   * why performing device idling is the best choice to boost the throughput
277a4a9b5   Paolo Valente   block, bfq: give ...
4113
   * and preserve service guarantees when bfq_better_to_idle itself
aee69d78d   Paolo Valente   block, bfq: intro...
4114
4115
4116
4117
   * returns true.
   */
  static bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
  {
277a4a9b5   Paolo Valente   block, bfq: give ...
4118
  	return RB_EMPTY_ROOT(&bfqq->sort_list) && bfq_better_to_idle(bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
4119
  }
2341d662e   Paolo Valente   block, bfq: tune ...
4120
4121
4122
4123
4124
4125
4126
4127
4128
  /*
   * This function chooses the queue from which to pick the next extra
   * I/O request to inject, if it finds a compatible queue. See the
   * comments on bfq_update_inject_limit() for details on the injection
   * mechanism, and for the definitions of the quantities mentioned
   * below.
   */
  static struct bfq_queue *
  bfq_choose_bfqq_for_injection(struct bfq_data *bfqd)
d0edc2473   Paolo Valente   block, bfq: injec...
4129
  {
2341d662e   Paolo Valente   block, bfq: tune ...
4130
4131
4132
4133
4134
4135
4136
4137
4138
4139
4140
4141
4142
4143
4144
4145
  	struct bfq_queue *bfqq, *in_serv_bfqq = bfqd->in_service_queue;
  	unsigned int limit = in_serv_bfqq->inject_limit;
  	/*
  	 * If
  	 * - bfqq is not weight-raised and therefore does not carry
  	 *   time-critical I/O,
  	 * or
  	 * - regardless of whether bfqq is weight-raised, bfqq has
  	 *   however a long think time, during which it can absorb the
  	 *   effect of an appropriate number of extra I/O requests
  	 *   from other queues (see bfq_update_inject_limit for
  	 *   details on the computation of this number);
  	 * then injection can be performed without restrictions.
  	 */
  	bool in_serv_always_inject = in_serv_bfqq->wr_coeff == 1 ||
  		!bfq_bfqq_has_short_ttime(in_serv_bfqq);
d0edc2473   Paolo Valente   block, bfq: injec...
4146
4147
  
  	/*
2341d662e   Paolo Valente   block, bfq: tune ...
4148
4149
4150
4151
4152
4153
4154
4155
4156
4157
4158
4159
4160
4161
4162
4163
4164
4165
4166
4167
4168
4169
4170
  	 * If
  	 * - the baseline total service time could not be sampled yet,
  	 *   so the inject limit happens to be still 0, and
  	 * - a lot of time has elapsed since the plugging of I/O
  	 *   dispatching started, so drive speed is being wasted
  	 *   significantly;
  	 * then temporarily raise inject limit to one request.
  	 */
  	if (limit == 0 && in_serv_bfqq->last_serv_time_ns == 0 &&
  	    bfq_bfqq_wait_request(in_serv_bfqq) &&
  	    time_is_before_eq_jiffies(bfqd->last_idling_start_jiffies +
  				      bfqd->bfq_slice_idle)
  		)
  		limit = 1;
  
  	if (bfqd->rq_in_driver >= limit)
  		return NULL;
  
  	/*
  	 * Linear search of the source queue for injection; but, with
  	 * a high probability, very few steps are needed to find a
  	 * candidate queue, i.e., a queue with enough budget left for
  	 * its next request. In fact:
d0edc2473   Paolo Valente   block, bfq: injec...
4171
4172
4173
4174
  	 * - BFQ dynamically updates the budget of every queue so as
  	 *   to accommodate the expected backlog of the queue;
  	 * - if a queue gets all its requests dispatched as injected
  	 *   service, then the queue is removed from the active list
2341d662e   Paolo Valente   block, bfq: tune ...
4175
4176
  	 *   (and re-added only if it gets new requests, but then it
  	 *   is assigned again enough budget for its new backlog).
d0edc2473   Paolo Valente   block, bfq: injec...
4177
4178
4179
  	 */
  	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list)
  		if (!RB_EMPTY_ROOT(&bfqq->sort_list) &&
2341d662e   Paolo Valente   block, bfq: tune ...
4180
  		    (in_serv_always_inject || bfqq->wr_coeff > 1) &&
d0edc2473   Paolo Valente   block, bfq: injec...
4181
  		    bfq_serv_to_charge(bfqq->next_rq, bfqq) <=
2341d662e   Paolo Valente   block, bfq: tune ...
4182
4183
4184
4185
4186
4187
4188
4189
4190
4191
4192
4193
4194
4195
4196
4197
4198
4199
4200
4201
4202
4203
4204
4205
4206
4207
4208
4209
4210
4211
  		    bfq_bfqq_budget_left(bfqq)) {
  			/*
  			 * Allow for only one large in-flight request
  			 * on non-rotational devices, for the
  			 * following reason. On non-rotationl drives,
  			 * large requests take much longer than
  			 * smaller requests to be served. In addition,
  			 * the drive prefers to serve large requests
  			 * w.r.t. to small ones, if it can choose. So,
  			 * having more than one large requests queued
  			 * in the drive may easily make the next first
  			 * request of the in-service queue wait for so
  			 * long to break bfqq's service guarantees. On
  			 * the bright side, large requests let the
  			 * drive reach a very high throughput, even if
  			 * there is only one in-flight large request
  			 * at a time.
  			 */
  			if (blk_queue_nonrot(bfqd->queue) &&
  			    blk_rq_sectors(bfqq->next_rq) >=
  			    BFQQ_SECT_THR_NONROT)
  				limit = min_t(unsigned int, 1, limit);
  			else
  				limit = in_serv_bfqq->inject_limit;
  
  			if (bfqd->rq_in_driver < limit) {
  				bfqd->rqs_injected = true;
  				return bfqq;
  			}
  		}
d0edc2473   Paolo Valente   block, bfq: injec...
4212
4213
4214
  
  	return NULL;
  }
aee69d78d   Paolo Valente   block, bfq: intro...
4215
4216
4217
4218
4219
4220
4221
4222
4223
4224
4225
4226
4227
4228
4229
  /*
   * Select a queue for service.  If we have a current queue in service,
   * check whether to continue servicing it, or retrieve and set a new one.
   */
  static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
  {
  	struct bfq_queue *bfqq;
  	struct request *next_rq;
  	enum bfqq_expiration reason = BFQQE_BUDGET_TIMEOUT;
  
  	bfqq = bfqd->in_service_queue;
  	if (!bfqq)
  		goto new_queue;
  
  	bfq_log_bfqq(bfqd, bfqq, "select_queue: already in-service queue");
4420b095c   Paolo Valente   block, bfq: do no...
4230
4231
4232
4233
4234
4235
4236
  	/*
  	 * Do not expire bfqq for budget timeout if bfqq may be about
  	 * to enjoy device idling. The reason why, in this case, we
  	 * prevent bfqq from expiring is the same as in the comments
  	 * on the case where bfq_bfqq_must_idle() returns true, in
  	 * bfq_completed_request().
  	 */
aee69d78d   Paolo Valente   block, bfq: intro...
4237
  	if (bfq_may_expire_for_budg_timeout(bfqq) &&
aee69d78d   Paolo Valente   block, bfq: intro...
4238
4239
4240
4241
4242
4243
4244
4245
4246
4247
4248
4249
4250
4251
4252
4253
4254
4255
4256
4257
4258
4259
4260
4261
4262
4263
4264
4265
4266
4267
4268
4269
4270
4271
4272
4273
4274
4275
4276
4277
4278
4279
4280
4281
4282
4283
4284
4285
4286
4287
4288
4289
4290
4291
4292
4293
4294
  	    !bfq_bfqq_must_idle(bfqq))
  		goto expire;
  
  check_queue:
  	/*
  	 * This loop is rarely executed more than once. Even when it
  	 * happens, it is much more convenient to re-execute this loop
  	 * than to return NULL and trigger a new dispatch to get a
  	 * request served.
  	 */
  	next_rq = bfqq->next_rq;
  	/*
  	 * If bfqq has requests queued and it has enough budget left to
  	 * serve them, keep the queue, otherwise expire it.
  	 */
  	if (next_rq) {
  		if (bfq_serv_to_charge(next_rq, bfqq) >
  			bfq_bfqq_budget_left(bfqq)) {
  			/*
  			 * Expire the queue for budget exhaustion,
  			 * which makes sure that the next budget is
  			 * enough to serve the next request, even if
  			 * it comes from the fifo expired path.
  			 */
  			reason = BFQQE_BUDGET_EXHAUSTED;
  			goto expire;
  		} else {
  			/*
  			 * The idle timer may be pending because we may
  			 * not disable disk idling even when a new request
  			 * arrives.
  			 */
  			if (bfq_bfqq_wait_request(bfqq)) {
  				/*
  				 * If we get here: 1) at least a new request
  				 * has arrived but we have not disabled the
  				 * timer because the request was too small,
  				 * 2) then the block layer has unplugged
  				 * the device, causing the dispatch to be
  				 * invoked.
  				 *
  				 * Since the device is unplugged, now the
  				 * requests are probably large enough to
  				 * provide a reasonable throughput.
  				 * So we disable idling.
  				 */
  				bfq_clear_bfqq_wait_request(bfqq);
  				hrtimer_try_to_cancel(&bfqd->idle_slice_timer);
  			}
  			goto keep_queue;
  		}
  	}
  
  	/*
  	 * No requests pending. However, if the in-service queue is idling
  	 * for a new request, or has requests waiting for a completion and
  	 * may idle after their completion, then keep it anyway.
d0edc2473   Paolo Valente   block, bfq: injec...
4295
  	 *
2341d662e   Paolo Valente   block, bfq: tune ...
4296
4297
  	 * Yet, inject service from other queues if it boosts
  	 * throughput and is possible.
aee69d78d   Paolo Valente   block, bfq: intro...
4298
4299
  	 */
  	if (bfq_bfqq_wait_request(bfqq) ||
277a4a9b5   Paolo Valente   block, bfq: give ...
4300
  	    (bfqq->dispatched != 0 && bfq_better_to_idle(bfqq))) {
2341d662e   Paolo Valente   block, bfq: tune ...
4301
4302
  		struct bfq_queue *async_bfqq =
  			bfqq->bic && bfqq->bic->bfqq[0] &&
3726112ec   Paolo Valente   block, bfq: re-sc...
4303
4304
  			bfq_bfqq_busy(bfqq->bic->bfqq[0]) &&
  			bfqq->bic->bfqq[0]->next_rq ?
2341d662e   Paolo Valente   block, bfq: tune ...
4305
4306
4307
  			bfqq->bic->bfqq[0] : NULL;
  
  		/*
13a857a4c   Paolo Valente   block, bfq: detec...
4308
4309
4310
4311
4312
4313
4314
4315
4316
4317
4318
4319
4320
4321
4322
4323
4324
4325
4326
4327
4328
4329
4330
4331
4332
4333
4334
4335
4336
4337
4338
4339
4340
4341
4342
4343
4344
4345
4346
4347
4348
4349
4350
4351
4352
4353
4354
4355
4356
4357
4358
4359
4360
4361
4362
4363
4364
4365
4366
4367
4368
4369
4370
4371
4372
4373
4374
4375
4376
4377
  		 * The next three mutually-exclusive ifs decide
  		 * whether to try injection, and choose the queue to
  		 * pick an I/O request from.
  		 *
  		 * The first if checks whether the process associated
  		 * with bfqq has also async I/O pending. If so, it
  		 * injects such I/O unconditionally. Injecting async
  		 * I/O from the same process can cause no harm to the
  		 * process. On the contrary, it can only increase
  		 * bandwidth and reduce latency for the process.
  		 *
  		 * The second if checks whether there happens to be a
  		 * non-empty waker queue for bfqq, i.e., a queue whose
  		 * I/O needs to be completed for bfqq to receive new
  		 * I/O. This happens, e.g., if bfqq is associated with
  		 * a process that does some sync. A sync generates
  		 * extra blocking I/O, which must be completed before
  		 * the process associated with bfqq can go on with its
  		 * I/O. If the I/O of the waker queue is not served,
  		 * then bfqq remains empty, and no I/O is dispatched,
  		 * until the idle timeout fires for bfqq. This is
  		 * likely to result in lower bandwidth and higher
  		 * latencies for bfqq, and in a severe loss of total
  		 * throughput. The best action to take is therefore to
  		 * serve the waker queue as soon as possible. So do it
  		 * (without relying on the third alternative below for
  		 * eventually serving waker_bfqq's I/O; see the last
  		 * paragraph for further details). This systematic
  		 * injection of I/O from the waker queue does not
  		 * cause any delay to bfqq's I/O. On the contrary,
  		 * next bfqq's I/O is brought forward dramatically,
  		 * for it is not blocked for milliseconds.
  		 *
  		 * The third if checks whether bfqq is a queue for
  		 * which it is better to avoid injection. It is so if
  		 * bfqq delivers more throughput when served without
  		 * any further I/O from other queues in the middle, or
  		 * if the service times of bfqq's I/O requests both
  		 * count more than overall throughput, and may be
  		 * easily increased by injection (this happens if bfqq
  		 * has a short think time). If none of these
  		 * conditions holds, then a candidate queue for
  		 * injection is looked for through
  		 * bfq_choose_bfqq_for_injection(). Note that the
  		 * latter may return NULL (for example if the inject
  		 * limit for bfqq is currently 0).
  		 *
  		 * NOTE: motivation for the second alternative
  		 *
  		 * Thanks to the way the inject limit is updated in
  		 * bfq_update_has_short_ttime(), it is rather likely
  		 * that, if I/O is being plugged for bfqq and the
  		 * waker queue has pending I/O requests that are
  		 * blocking bfqq's I/O, then the third alternative
  		 * above lets the waker queue get served before the
  		 * I/O-plugging timeout fires. So one may deem the
  		 * second alternative superfluous. It is not, because
  		 * the third alternative may be way less effective in
  		 * case of a synchronization. For two main
  		 * reasons. First, throughput may be low because the
  		 * inject limit may be too low to guarantee the same
  		 * amount of injected I/O, from the waker queue or
  		 * other queues, that the second alternative
  		 * guarantees (the second alternative unconditionally
  		 * injects a pending I/O request of the waker queue
  		 * for each bfq_dispatch_request()). Second, with the
  		 * third alternative, the duration of the plugging,
  		 * i.e., the time before bfqq finally receives new I/O,
  		 * may not be minimized, because the waker queue may
  		 * happen to be served only after other queues.
2341d662e   Paolo Valente   block, bfq: tune ...
4378
4379
4380
4381
4382
4383
  		 */
  		if (async_bfqq &&
  		    icq_to_bic(async_bfqq->next_rq->elv.icq) == bfqq->bic &&
  		    bfq_serv_to_charge(async_bfqq->next_rq, async_bfqq) <=
  		    bfq_bfqq_budget_left(async_bfqq))
  			bfqq = bfqq->bic->bfqq[0];
13a857a4c   Paolo Valente   block, bfq: detec...
4384
4385
  		else if (bfq_bfqq_has_waker(bfqq) &&
  			   bfq_bfqq_busy(bfqq->waker_bfqq) &&
3726112ec   Paolo Valente   block, bfq: re-sc...
4386
  			   bfqq->next_rq &&
13a857a4c   Paolo Valente   block, bfq: detec...
4387
4388
4389
4390
4391
  			   bfq_serv_to_charge(bfqq->waker_bfqq->next_rq,
  					      bfqq->waker_bfqq) <=
  			   bfq_bfqq_budget_left(bfqq->waker_bfqq)
  			)
  			bfqq = bfqq->waker_bfqq;
2341d662e   Paolo Valente   block, bfq: tune ...
4392
4393
4394
  		else if (!idling_boosts_thr_without_issues(bfqd, bfqq) &&
  			 (bfqq->wr_coeff == 1 || bfqd->wr_busy_queues > 1 ||
  			  !bfq_bfqq_has_short_ttime(bfqq)))
d0edc2473   Paolo Valente   block, bfq: injec...
4395
4396
4397
  			bfqq = bfq_choose_bfqq_for_injection(bfqd);
  		else
  			bfqq = NULL;
aee69d78d   Paolo Valente   block, bfq: intro...
4398
4399
4400
4401
4402
4403
4404
4405
4406
4407
4408
4409
4410
4411
4412
4413
4414
4415
4416
4417
  		goto keep_queue;
  	}
  
  	reason = BFQQE_NO_MORE_REQUESTS;
  expire:
  	bfq_bfqq_expire(bfqd, bfqq, false, reason);
  new_queue:
  	bfqq = bfq_set_in_service_queue(bfqd);
  	if (bfqq) {
  		bfq_log_bfqq(bfqd, bfqq, "select_queue: checking new queue");
  		goto check_queue;
  	}
  keep_queue:
  	if (bfqq)
  		bfq_log_bfqq(bfqd, bfqq, "select_queue: returned this queue");
  	else
  		bfq_log(bfqd, "select_queue: no queue returned");
  
  	return bfqq;
  }
44e44a1b3   Paolo Valente   block, bfq: impro...
4418
4419
4420
4421
4422
4423
4424
4425
4426
4427
4428
4429
4430
4431
4432
4433
  static void bfq_update_wr_data(struct bfq_data *bfqd, struct bfq_queue *bfqq)
  {
  	struct bfq_entity *entity = &bfqq->entity;
  
  	if (bfqq->wr_coeff > 1) { /* queue is being weight-raised */
  		bfq_log_bfqq(bfqd, bfqq,
  			"raising period dur %u/%u msec, old coeff %u, w %d(%d)",
  			jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish),
  			jiffies_to_msecs(bfqq->wr_cur_max_time),
  			bfqq->wr_coeff,
  			bfqq->entity.weight, bfqq->entity.orig_weight);
  
  		if (entity->prio_changed)
  			bfq_log_bfqq(bfqd, bfqq, "WARN: pending prio change");
  
  		/*
e1b2324dd   Arianna Avanzini   block, bfq: handl...
4434
4435
4436
  		 * If the queue was activated in a burst, or too much
  		 * time has elapsed from the beginning of this
  		 * weight-raising period, then end weight raising.
44e44a1b3   Paolo Valente   block, bfq: impro...
4437
  		 */
e1b2324dd   Arianna Avanzini   block, bfq: handl...
4438
4439
4440
4441
  		if (bfq_bfqq_in_large_burst(bfqq))
  			bfq_bfqq_end_wr(bfqq);
  		else if (time_is_before_jiffies(bfqq->last_wr_start_finish +
  						bfqq->wr_cur_max_time)) {
77b7dcead   Paolo Valente   block, bfq: reduc...
4442
4443
  			if (bfqq->wr_cur_max_time != bfqd->bfq_wr_rt_max_time ||
  			time_is_before_jiffies(bfqq->wr_start_at_switch_to_srt +
e1b2324dd   Arianna Avanzini   block, bfq: handl...
4444
  					       bfq_wr_duration(bfqd)))
77b7dcead   Paolo Valente   block, bfq: reduc...
4445
4446
  				bfq_bfqq_end_wr(bfqq);
  			else {
3e2bdd6df   Paolo Valente   block, bfq: check...
4447
  				switch_back_to_interactive_wr(bfqq, bfqd);
77b7dcead   Paolo Valente   block, bfq: reduc...
4448
4449
  				bfqq->entity.prio_changed = 1;
  			}
44e44a1b3   Paolo Valente   block, bfq: impro...
4450
  		}
8a8747dc0   Paolo Valente   block, bfq: limit...
4451
4452
4453
4454
4455
4456
  		if (bfqq->wr_coeff > 1 &&
  		    bfqq->wr_cur_max_time != bfqd->bfq_wr_rt_max_time &&
  		    bfqq->service_from_wr > max_service_from_wr) {
  			/* see comments on max_service_from_wr */
  			bfq_bfqq_end_wr(bfqq);
  		}
44e44a1b3   Paolo Valente   block, bfq: impro...
4457
  	}
431b17f9d   Paolo Valente   block, bfq: don't...
4458
4459
4460
4461
4462
4463
4464
4465
  	/*
  	 * To improve latency (for this or other queues), immediately
  	 * update weight both if it must be raised and if it must be
  	 * lowered. Since, entity may be on some active tree here, and
  	 * might have a pending change of its ioprio class, invoke
  	 * next function with the last parameter unset (see the
  	 * comments on the function).
  	 */
44e44a1b3   Paolo Valente   block, bfq: impro...
4466
  	if ((entity->weight > entity->orig_weight) != (bfqq->wr_coeff > 1))
431b17f9d   Paolo Valente   block, bfq: don't...
4467
4468
  		__bfq_entity_update_weight_prio(bfq_entity_service_tree(entity),
  						entity, false);
44e44a1b3   Paolo Valente   block, bfq: impro...
4469
  }
aee69d78d   Paolo Valente   block, bfq: intro...
4470
4471
4472
4473
4474
4475
4476
4477
4478
4479
4480
4481
  /*
   * Dispatch next request from bfqq.
   */
  static struct request *bfq_dispatch_rq_from_bfqq(struct bfq_data *bfqd,
  						 struct bfq_queue *bfqq)
  {
  	struct request *rq = bfqq->next_rq;
  	unsigned long service_to_charge;
  
  	service_to_charge = bfq_serv_to_charge(rq, bfqq);
  
  	bfq_bfqq_served(bfqq, service_to_charge);
2341d662e   Paolo Valente   block, bfq: tune ...
4482
4483
4484
4485
  	if (bfqq == bfqd->in_service_queue && bfqd->wait_dispatch) {
  		bfqd->wait_dispatch = false;
  		bfqd->waited_rq = rq;
  	}
aee69d78d   Paolo Valente   block, bfq: intro...
4486

2341d662e   Paolo Valente   block, bfq: tune ...
4487
  	bfq_dispatch_remove(bfqd->queue, rq);
d0edc2473   Paolo Valente   block, bfq: injec...
4488

2341d662e   Paolo Valente   block, bfq: tune ...
4489
  	if (bfqq != bfqd->in_service_queue)
d0edc2473   Paolo Valente   block, bfq: injec...
4490
  		goto return_rq;
d0edc2473   Paolo Valente   block, bfq: injec...
4491

44e44a1b3   Paolo Valente   block, bfq: impro...
4492
4493
4494
4495
4496
4497
4498
4499
4500
4501
4502
4503
  	/*
  	 * If weight raising has to terminate for bfqq, then next
  	 * function causes an immediate update of bfqq's weight,
  	 * without waiting for next activation. As a consequence, on
  	 * expiration, bfqq will be timestamped as if has never been
  	 * weight-raised during this service slot, even if it has
  	 * received part or even most of the service as a
  	 * weight-raised queue. This inflates bfqq's timestamps, which
  	 * is beneficial, as bfqq is then more willing to leave the
  	 * device immediately to possible other weight-raised queues.
  	 */
  	bfq_update_wr_data(bfqd, bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
4504
4505
4506
4507
4508
  	/*
  	 * Expire bfqq, pretending that its budget expired, if bfqq
  	 * belongs to CLASS_IDLE and other queues are waiting for
  	 * service.
  	 */
73d581184   Paolo Valente   block, bfq: consi...
4509
  	if (!(bfq_tot_busy_queues(bfqd) > 1 && bfq_class_idle(bfqq)))
d0edc2473   Paolo Valente   block, bfq: injec...
4510
  		goto return_rq;
aee69d78d   Paolo Valente   block, bfq: intro...
4511

aee69d78d   Paolo Valente   block, bfq: intro...
4512
  	bfq_bfqq_expire(bfqd, bfqq, false, BFQQE_BUDGET_EXHAUSTED);
d0edc2473   Paolo Valente   block, bfq: injec...
4513
4514
  
  return_rq:
aee69d78d   Paolo Valente   block, bfq: intro...
4515
4516
4517
4518
4519
4520
  	return rq;
  }
  
  static bool bfq_has_work(struct blk_mq_hw_ctx *hctx)
  {
  	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
b445547ec   Kashyap Desai   blk-mq, elevator:...
4521
4522
  	if (!atomic_read(&hctx->elevator_queued))
  		return false;
aee69d78d   Paolo Valente   block, bfq: intro...
4523
4524
4525
4526
4527
  	/*
  	 * Avoiding lock: a race on bfqd->busy_queues should cause at
  	 * most a call to dispatch for nothing
  	 */
  	return !list_empty_careful(&bfqd->dispatch) ||
73d581184   Paolo Valente   block, bfq: consi...
4528
  		bfq_tot_busy_queues(bfqd) > 0;
aee69d78d   Paolo Valente   block, bfq: intro...
4529
4530
4531
4532
4533
4534
4535
4536
4537
4538
4539
4540
4541
4542
4543
4544
4545
4546
4547
4548
4549
4550
4551
4552
4553
4554
4555
4556
  }
  
  static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
  {
  	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
  	struct request *rq = NULL;
  	struct bfq_queue *bfqq = NULL;
  
  	if (!list_empty(&bfqd->dispatch)) {
  		rq = list_first_entry(&bfqd->dispatch, struct request,
  				      queuelist);
  		list_del_init(&rq->queuelist);
  
  		bfqq = RQ_BFQQ(rq);
  
  		if (bfqq) {
  			/*
  			 * Increment counters here, because this
  			 * dispatch does not follow the standard
  			 * dispatch flow (where counters are
  			 * incremented)
  			 */
  			bfqq->dispatched++;
  
  			goto inc_in_driver_start_rq;
  		}
  
  		/*
a78773906   Paolo Valente   block, bfq: add r...
4557
4558
4559
4560
4561
4562
4563
4564
4565
  		 * We exploit the bfq_finish_requeue_request hook to
  		 * decrement rq_in_driver, but
  		 * bfq_finish_requeue_request will not be invoked on
  		 * this request. So, to avoid unbalance, just start
  		 * this request, without incrementing rq_in_driver. As
  		 * a negative consequence, rq_in_driver is deceptively
  		 * lower than it should be while this request is in
  		 * service. This may cause bfq_schedule_dispatch to be
  		 * invoked uselessly.
aee69d78d   Paolo Valente   block, bfq: intro...
4566
4567
  		 *
  		 * As for implementing an exact solution, the
a78773906   Paolo Valente   block, bfq: add r...
4568
4569
4570
4571
4572
4573
4574
4575
4576
  		 * bfq_finish_requeue_request hook, if defined, is
  		 * probably invoked also on this request. So, by
  		 * exploiting this hook, we could 1) increment
  		 * rq_in_driver here, and 2) decrement it in
  		 * bfq_finish_requeue_request. Such a solution would
  		 * let the value of the counter be always accurate,
  		 * but it would entail using an extra interface
  		 * function. This cost seems higher than the benefit,
  		 * being the frequency of non-elevator-private
aee69d78d   Paolo Valente   block, bfq: intro...
4577
4578
4579
4580
  		 * requests very low.
  		 */
  		goto start_rq;
  	}
73d581184   Paolo Valente   block, bfq: consi...
4581
4582
  	bfq_log(bfqd, "dispatch requests: %d busy queues",
  		bfq_tot_busy_queues(bfqd));
aee69d78d   Paolo Valente   block, bfq: intro...
4583

73d581184   Paolo Valente   block, bfq: consi...
4584
  	if (bfq_tot_busy_queues(bfqd) == 0)
aee69d78d   Paolo Valente   block, bfq: intro...
4585
4586
4587
4588
4589
4590
4591
4592
4593
4594
4595
  		goto exit;
  
  	/*
  	 * Force device to serve one request at a time if
  	 * strict_guarantees is true. Forcing this service scheme is
  	 * currently the ONLY way to guarantee that the request
  	 * service order enforced by the scheduler is respected by a
  	 * queueing device. Otherwise the device is free even to make
  	 * some unlucky request wait for as long as the device
  	 * wishes.
  	 *
f06678af9   Randy Dunlap   block: bfq-iosche...
4596
  	 * Of course, serving one request at a time may cause loss of
aee69d78d   Paolo Valente   block, bfq: intro...
4597
4598
4599
4600
4601
4602
4603
4604
4605
4606
4607
4608
4609
4610
4611
4612
4613
4614
4615
4616
  	 * throughput.
  	 */
  	if (bfqd->strict_guarantees && bfqd->rq_in_driver > 0)
  		goto exit;
  
  	bfqq = bfq_select_queue(bfqd);
  	if (!bfqq)
  		goto exit;
  
  	rq = bfq_dispatch_rq_from_bfqq(bfqd, bfqq);
  
  	if (rq) {
  inc_in_driver_start_rq:
  		bfqd->rq_in_driver++;
  start_rq:
  		rq->rq_flags |= RQF_STARTED;
  	}
  exit:
  	return rq;
  }
8060c47ba   Christoph Hellwig   block: rename CON...
4617
  #ifdef CONFIG_BFQ_CGROUP_DEBUG
9b25bd036   Paolo Valente   block, bfq: remov...
4618
4619
4620
4621
4622
4623
  static void bfq_update_dispatch_stats(struct request_queue *q,
  				      struct request *rq,
  				      struct bfq_queue *in_serv_queue,
  				      bool idle_timer_disabled)
  {
  	struct bfq_queue *bfqq = rq ? RQ_BFQQ(rq) : NULL;
aee69d78d   Paolo Valente   block, bfq: intro...
4624

24bfd19bb   Paolo Valente   block, bfq: updat...
4625
  	if (!idle_timer_disabled && !bfqq)
9b25bd036   Paolo Valente   block, bfq: remov...
4626
  		return;
24bfd19bb   Paolo Valente   block, bfq: updat...
4627
4628
4629
4630
4631
4632
4633
4634
4635
4636
4637
4638
4639
4640
  
  	/*
  	 * rq and bfqq are guaranteed to exist until this function
  	 * ends, for the following reasons. First, rq can be
  	 * dispatched to the device, and then can be completed and
  	 * freed, only after this function ends. Second, rq cannot be
  	 * merged (and thus freed because of a merge) any longer,
  	 * because it has already started. Thus rq cannot be freed
  	 * before this function ends, and, since rq has a reference to
  	 * bfqq, the same guarantee holds for bfqq too.
  	 *
  	 * In addition, the following queue lock guarantees that
  	 * bfqq_group(bfqq) exists as well.
  	 */
0d945c1f9   Christoph Hellwig   block: remove the...
4641
  	spin_lock_irq(&q->queue_lock);
24bfd19bb   Paolo Valente   block, bfq: updat...
4642
4643
4644
4645
4646
4647
4648
4649
4650
4651
4652
4653
4654
4655
4656
4657
4658
4659
  	if (idle_timer_disabled)
  		/*
  		 * Since the idle timer has been disabled,
  		 * in_serv_queue contained some request when
  		 * __bfq_dispatch_request was invoked above, which
  		 * implies that rq was picked exactly from
  		 * in_serv_queue. Thus in_serv_queue == bfqq, and is
  		 * therefore guaranteed to exist because of the above
  		 * arguments.
  		 */
  		bfqg_stats_update_idle_time(bfqq_group(in_serv_queue));
  	if (bfqq) {
  		struct bfq_group *bfqg = bfqq_group(bfqq);
  
  		bfqg_stats_update_avg_queue_size(bfqg);
  		bfqg_stats_set_start_empty_time(bfqg);
  		bfqg_stats_update_io_remove(bfqg, rq->cmd_flags);
  	}
0d945c1f9   Christoph Hellwig   block: remove the...
4660
  	spin_unlock_irq(&q->queue_lock);
9b25bd036   Paolo Valente   block, bfq: remov...
4661
4662
4663
4664
4665
4666
  }
  #else
  static inline void bfq_update_dispatch_stats(struct request_queue *q,
  					     struct request *rq,
  					     struct bfq_queue *in_serv_queue,
  					     bool idle_timer_disabled) {}
8060c47ba   Christoph Hellwig   block: rename CON...
4667
  #endif /* CONFIG_BFQ_CGROUP_DEBUG */
24bfd19bb   Paolo Valente   block, bfq: updat...
4668

9b25bd036   Paolo Valente   block, bfq: remov...
4669
4670
4671
4672
4673
4674
4675
4676
4677
4678
4679
4680
4681
4682
4683
4684
4685
4686
4687
4688
4689
  static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
  {
  	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
  	struct request *rq;
  	struct bfq_queue *in_serv_queue;
  	bool waiting_rq, idle_timer_disabled;
  
  	spin_lock_irq(&bfqd->lock);
  
  	in_serv_queue = bfqd->in_service_queue;
  	waiting_rq = in_serv_queue && bfq_bfqq_wait_request(in_serv_queue);
  
  	rq = __bfq_dispatch_request(hctx);
  
  	idle_timer_disabled =
  		waiting_rq && !bfq_bfqq_wait_request(in_serv_queue);
  
  	spin_unlock_irq(&bfqd->lock);
  
  	bfq_update_dispatch_stats(hctx->queue, rq, in_serv_queue,
  				  idle_timer_disabled);
aee69d78d   Paolo Valente   block, bfq: intro...
4690
4691
4692
4693
4694
4695
4696
4697
4698
4699
  	return rq;
  }
  
  /*
   * Task holds one reference to the queue, dropped when task exits.  Each rq
   * in-flight on this queue also holds a reference, dropped when rq is freed.
   *
   * Scheduler lock must be held here. Recall not to use bfqq after calling
   * this function on it.
   */
ea25da480   Paolo Valente   block, bfq: split...
4700
  void bfq_put_queue(struct bfq_queue *bfqq)
aee69d78d   Paolo Valente   block, bfq: intro...
4701
  {
3f758e844   Paolo Valente   block, bfq: move ...
4702
4703
  	struct bfq_queue *item;
  	struct hlist_node *n;
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
4704
  	struct bfq_group *bfqg = bfqq_group(bfqq);
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
4705

aee69d78d   Paolo Valente   block, bfq: intro...
4706
4707
4708
4709
4710
4711
4712
  	if (bfqq->bfqd)
  		bfq_log_bfqq(bfqq->bfqd, bfqq, "put_queue: %p %d",
  			     bfqq, bfqq->ref);
  
  	bfqq->ref--;
  	if (bfqq->ref)
  		return;
99fead8d3   Paolo Valente   block, bfq: fix u...
4713
  	if (!hlist_unhashed(&bfqq->burst_list_node)) {
e1b2324dd   Arianna Avanzini   block, bfq: handl...
4714
  		hlist_del_init(&bfqq->burst_list_node);
99fead8d3   Paolo Valente   block, bfq: fix u...
4715
4716
4717
4718
4719
4720
4721
4722
4723
4724
4725
4726
4727
4728
4729
4730
4731
4732
4733
4734
4735
4736
4737
4738
4739
4740
4741
4742
  		/*
  		 * Decrement also burst size after the removal, if the
  		 * process associated with bfqq is exiting, and thus
  		 * does not contribute to the burst any longer. This
  		 * decrement helps filter out false positives of large
  		 * bursts, when some short-lived process (often due to
  		 * the execution of commands by some service) happens
  		 * to start and exit while a complex application is
  		 * starting, and thus spawning several processes that
  		 * do I/O (and that *must not* be treated as a large
  		 * burst, see comments on bfq_handle_burst).
  		 *
  		 * In particular, the decrement is performed only if:
  		 * 1) bfqq is not a merged queue, because, if it is,
  		 * then this free of bfqq is not triggered by the exit
  		 * of the process bfqq is associated with, but exactly
  		 * by the fact that bfqq has just been merged.
  		 * 2) burst_size is greater than 0, to handle
  		 * unbalanced decrements. Unbalanced decrements may
  		 * happen in te following case: bfqq is inserted into
  		 * the current burst list--without incrementing
  		 * bust_size--because of a split, but the current
  		 * burst list is not the burst list bfqq belonged to
  		 * (see comments on the case of a split in
  		 * bfq_set_request).
  		 */
  		if (bfqq->bic && bfqq->bfqd->burst_size > 0)
  			bfqq->bfqd->burst_size--;
7cb04004f   Paolo Valente   block, bfq: decre...
4743
  	}
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
4744

3f758e844   Paolo Valente   block, bfq: move ...
4745
4746
4747
4748
4749
4750
4751
4752
4753
4754
4755
4756
4757
4758
4759
4760
4761
4762
4763
4764
4765
4766
4767
4768
4769
4770
  	/*
  	 * bfqq does not exist any longer, so it cannot be woken by
  	 * any other queue, and cannot wake any other queue. Then bfqq
  	 * must be removed from the woken list of its possible waker
  	 * queue, and all queues in the woken list of bfqq must stop
  	 * having a waker queue. Strictly speaking, these updates
  	 * should be performed when bfqq remains with no I/O source
  	 * attached to it, which happens before bfqq gets freed. In
  	 * particular, this happens when the last process associated
  	 * with bfqq exits or gets associated with a different
  	 * queue. However, both events lead to bfqq being freed soon,
  	 * and dangling references would come out only after bfqq gets
  	 * freed. So these updates are done here, as a simple and safe
  	 * way to handle all cases.
  	 */
  	/* remove bfqq from woken list */
  	if (!hlist_unhashed(&bfqq->woken_list_node))
  		hlist_del_init(&bfqq->woken_list_node);
  
  	/* reset waker for all queues in woken list */
  	hlist_for_each_entry_safe(item, n, &bfqq->woken_list,
  				  woken_list_node) {
  		item->waker_bfqq = NULL;
  		bfq_clear_bfqq_has_waker(item);
  		hlist_del_init(&item->woken_list_node);
  	}
08d383a74   Paolo Valente   block, bfq: reset...
4771
4772
  	if (bfqq->bfqd && bfqq->bfqd->last_completed_rq_bfqq == bfqq)
  		bfqq->bfqd->last_completed_rq_bfqq = NULL;
aee69d78d   Paolo Valente   block, bfq: intro...
4773
  	kmem_cache_free(bfq_pool, bfqq);
8f9bebc33   Paolo Valente   block, bfq: acces...
4774
  	bfqg_and_blkg_put(bfqg);
aee69d78d   Paolo Valente   block, bfq: intro...
4775
  }
36eca8948   Arianna Avanzini   block, bfq: add E...
4776
4777
4778
4779
4780
4781
4782
4783
4784
4785
4786
4787
4788
4789
4790
4791
4792
4793
  static void bfq_put_cooperator(struct bfq_queue *bfqq)
  {
  	struct bfq_queue *__bfqq, *next;
  
  	/*
  	 * If this queue was scheduled to merge with another queue, be
  	 * sure to drop the reference taken on that queue (and others in
  	 * the merge chain). See bfq_setup_merge and bfq_merge_bfqqs.
  	 */
  	__bfqq = bfqq->new_bfqq;
  	while (__bfqq) {
  		if (__bfqq == bfqq)
  			break;
  		next = __bfqq->new_bfqq;
  		bfq_put_queue(__bfqq);
  		__bfqq = next;
  	}
  }
aee69d78d   Paolo Valente   block, bfq: intro...
4794
4795
4796
  static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
  {
  	if (bfqq == bfqd->in_service_queue) {
3726112ec   Paolo Valente   block, bfq: re-sc...
4797
  		__bfq_bfqq_expire(bfqd, bfqq, BFQQE_BUDGET_TIMEOUT);
aee69d78d   Paolo Valente   block, bfq: intro...
4798
4799
4800
4801
  		bfq_schedule_dispatch(bfqd);
  	}
  
  	bfq_log_bfqq(bfqd, bfqq, "exit_bfqq: %p, %d", bfqq, bfqq->ref);
36eca8948   Arianna Avanzini   block, bfq: add E...
4802
  	bfq_put_cooperator(bfqq);
478de3380   Paolo Valente   block, bfq: desch...
4803
  	bfq_release_process_ref(bfqd, bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
4804
4805
4806
4807
4808
4809
4810
4811
4812
4813
4814
4815
4816
4817
  }
  
  static void bfq_exit_icq_bfqq(struct bfq_io_cq *bic, bool is_sync)
  {
  	struct bfq_queue *bfqq = bic_to_bfqq(bic, is_sync);
  	struct bfq_data *bfqd;
  
  	if (bfqq)
  		bfqd = bfqq->bfqd; /* NULL if scheduler already exited */
  
  	if (bfqq && bfqd) {
  		unsigned long flags;
  
  		spin_lock_irqsave(&bfqd->lock, flags);
dbc3117d4   Douglas Anderson   block, bfq: NULL ...
4818
  		bfqq->bic = NULL;
aee69d78d   Paolo Valente   block, bfq: intro...
4819
4820
  		bfq_exit_bfqq(bfqd, bfqq);
  		bic_set_bfqq(bic, NULL, is_sync);
6fa3e8d34   Paolo Valente   block, bfq: remov...
4821
  		spin_unlock_irqrestore(&bfqd->lock, flags);
aee69d78d   Paolo Valente   block, bfq: intro...
4822
4823
4824
4825
4826
4827
4828
4829
4830
4831
4832
4833
4834
4835
4836
4837
4838
4839
4840
4841
4842
4843
4844
4845
4846
4847
4848
4849
  	}
  }
  
  static void bfq_exit_icq(struct io_cq *icq)
  {
  	struct bfq_io_cq *bic = icq_to_bic(icq);
  
  	bfq_exit_icq_bfqq(bic, true);
  	bfq_exit_icq_bfqq(bic, false);
  }
  
  /*
   * Update the entity prio values; note that the new values will not
   * be used until the next (re)activation.
   */
  static void
  bfq_set_next_ioprio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
  {
  	struct task_struct *tsk = current;
  	int ioprio_class;
  	struct bfq_data *bfqd = bfqq->bfqd;
  
  	if (!bfqd)
  		return;
  
  	ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
  	switch (ioprio_class) {
  	default:
d51cfc53a   Yufen Yu   bdi: use bdi_dev_...
4850
4851
4852
4853
  		pr_err("bdi %s: bfq: bad prio class %d
  ",
  				bdi_dev_name(bfqq->bfqd->queue->backing_dev_info),
  				ioprio_class);
df561f668   Gustavo A. R. Silva   treewide: Use fal...
4854
  		fallthrough;
aee69d78d   Paolo Valente   block, bfq: intro...
4855
4856
4857
4858
4859
4860
4861
4862
4863
4864
4865
4866
4867
4868
4869
4870
4871
4872
  	case IOPRIO_CLASS_NONE:
  		/*
  		 * No prio set, inherit CPU scheduling settings.
  		 */
  		bfqq->new_ioprio = task_nice_ioprio(tsk);
  		bfqq->new_ioprio_class = task_nice_ioclass(tsk);
  		break;
  	case IOPRIO_CLASS_RT:
  		bfqq->new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
  		bfqq->new_ioprio_class = IOPRIO_CLASS_RT;
  		break;
  	case IOPRIO_CLASS_BE:
  		bfqq->new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
  		bfqq->new_ioprio_class = IOPRIO_CLASS_BE;
  		break;
  	case IOPRIO_CLASS_IDLE:
  		bfqq->new_ioprio_class = IOPRIO_CLASS_IDLE;
  		bfqq->new_ioprio = 7;
aee69d78d   Paolo Valente   block, bfq: intro...
4873
4874
4875
4876
4877
4878
4879
4880
4881
4882
4883
4884
4885
  		break;
  	}
  
  	if (bfqq->new_ioprio >= IOPRIO_BE_NR) {
  		pr_crit("bfq_set_next_ioprio_data: new_ioprio %d
  ",
  			bfqq->new_ioprio);
  		bfqq->new_ioprio = IOPRIO_BE_NR;
  	}
  
  	bfqq->entity.new_weight = bfq_ioprio_to_weight(bfqq->new_ioprio);
  	bfqq->entity.prio_changed = 1;
  }
ea25da480   Paolo Valente   block, bfq: split...
4886
4887
4888
  static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
  				       struct bio *bio, bool is_sync,
  				       struct bfq_io_cq *bic);
aee69d78d   Paolo Valente   block, bfq: intro...
4889
4890
4891
4892
4893
4894
4895
4896
4897
4898
4899
4900
4901
4902
4903
4904
4905
  static void bfq_check_ioprio_change(struct bfq_io_cq *bic, struct bio *bio)
  {
  	struct bfq_data *bfqd = bic_to_bfqd(bic);
  	struct bfq_queue *bfqq;
  	int ioprio = bic->icq.ioc->ioprio;
  
  	/*
  	 * This condition may trigger on a newly created bic, be sure to
  	 * drop the lock before returning.
  	 */
  	if (unlikely(!bfqd) || likely(bic->ioprio == ioprio))
  		return;
  
  	bic->ioprio = ioprio;
  
  	bfqq = bic_to_bfqq(bic, false);
  	if (bfqq) {
478de3380   Paolo Valente   block, bfq: desch...
4906
  		bfq_release_process_ref(bfqd, bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
4907
4908
4909
4910
4911
4912
4913
4914
4915
4916
4917
4918
4919
4920
  		bfqq = bfq_get_queue(bfqd, bio, BLK_RW_ASYNC, bic);
  		bic_set_bfqq(bic, bfqq, false);
  	}
  
  	bfqq = bic_to_bfqq(bic, true);
  	if (bfqq)
  		bfq_set_next_ioprio_data(bfqq, bic);
  }
  
  static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
  			  struct bfq_io_cq *bic, pid_t pid, int is_sync)
  {
  	RB_CLEAR_NODE(&bfqq->entity.rb_node);
  	INIT_LIST_HEAD(&bfqq->fifo);
e1b2324dd   Arianna Avanzini   block, bfq: handl...
4921
  	INIT_HLIST_NODE(&bfqq->burst_list_node);
13a857a4c   Paolo Valente   block, bfq: detec...
4922
4923
  	INIT_HLIST_NODE(&bfqq->woken_list_node);
  	INIT_HLIST_HEAD(&bfqq->woken_list);
aee69d78d   Paolo Valente   block, bfq: intro...
4924
4925
4926
4927
4928
4929
4930
4931
  
  	bfqq->ref = 0;
  	bfqq->bfqd = bfqd;
  
  	if (bic)
  		bfq_set_next_ioprio_data(bfqq, bic);
  
  	if (is_sync) {
d5be3fefc   Paolo Valente   block,bfq: refact...
4932
4933
4934
4935
4936
  		/*
  		 * No need to mark as has_short_ttime if in
  		 * idle_class, because no device idling is performed
  		 * for queues in idle class
  		 */
aee69d78d   Paolo Valente   block, bfq: intro...
4937
  		if (!bfq_class_idle(bfqq))
d5be3fefc   Paolo Valente   block,bfq: refact...
4938
4939
  			/* tentatively mark as has_short_ttime */
  			bfq_mark_bfqq_has_short_ttime(bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
4940
  		bfq_mark_bfqq_sync(bfqq);
e1b2324dd   Arianna Avanzini   block, bfq: handl...
4941
  		bfq_mark_bfqq_just_created(bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
4942
4943
4944
4945
4946
4947
4948
4949
4950
4951
4952
  	} else
  		bfq_clear_bfqq_sync(bfqq);
  
  	/* set end request to minus infinity from now */
  	bfqq->ttime.last_end_request = ktime_get_ns() + 1;
  
  	bfq_mark_bfqq_IO_bound(bfqq);
  
  	bfqq->pid = pid;
  
  	/* Tentative initial value to trade off between thr and lat */
54b604567   Paolo Valente   block, bfq: impro...
4953
  	bfqq->max_budget = (2 * bfq_max_budget(bfqd)) / 3;
aee69d78d   Paolo Valente   block, bfq: intro...
4954
  	bfqq->budget_timeout = bfq_smallest_from_now();
aee69d78d   Paolo Valente   block, bfq: intro...
4955

44e44a1b3   Paolo Valente   block, bfq: impro...
4956
  	bfqq->wr_coeff = 1;
36eca8948   Arianna Avanzini   block, bfq: add E...
4957
  	bfqq->last_wr_start_finish = jiffies;
77b7dcead   Paolo Valente   block, bfq: reduc...
4958
  	bfqq->wr_start_at_switch_to_srt = bfq_smallest_from_now();
36eca8948   Arianna Avanzini   block, bfq: add E...
4959
  	bfqq->split_time = bfq_smallest_from_now();
77b7dcead   Paolo Valente   block, bfq: reduc...
4960
4961
  
  	/*
a34b02444   Paolo Valente   block, bfq: consi...
4962
4963
4964
4965
4966
4967
4968
  	 * To not forget the possibly high bandwidth consumed by a
  	 * process/queue in the recent past,
  	 * bfq_bfqq_softrt_next_start() returns a value at least equal
  	 * to the current value of bfqq->soft_rt_next_start (see
  	 * comments on bfq_bfqq_softrt_next_start).  Set
  	 * soft_rt_next_start to now, to mean that bfqq has consumed
  	 * no bandwidth so far.
77b7dcead   Paolo Valente   block, bfq: reduc...
4969
  	 */
a34b02444   Paolo Valente   block, bfq: consi...
4970
  	bfqq->soft_rt_next_start = jiffies;
44e44a1b3   Paolo Valente   block, bfq: impro...
4971

aee69d78d   Paolo Valente   block, bfq: intro...
4972
4973
4974
4975
4976
  	/* first request is almost certainly seeky */
  	bfqq->seek_history = 1;
  }
  
  static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
4977
  					       struct bfq_group *bfqg,
aee69d78d   Paolo Valente   block, bfq: intro...
4978
4979
4980
4981
  					       int ioprio_class, int ioprio)
  {
  	switch (ioprio_class) {
  	case IOPRIO_CLASS_RT:
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
4982
  		return &bfqg->async_bfqq[0][ioprio];
aee69d78d   Paolo Valente   block, bfq: intro...
4983
4984
  	case IOPRIO_CLASS_NONE:
  		ioprio = IOPRIO_NORM;
df561f668   Gustavo A. R. Silva   treewide: Use fal...
4985
  		fallthrough;
aee69d78d   Paolo Valente   block, bfq: intro...
4986
  	case IOPRIO_CLASS_BE:
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
4987
  		return &bfqg->async_bfqq[1][ioprio];
aee69d78d   Paolo Valente   block, bfq: intro...
4988
  	case IOPRIO_CLASS_IDLE:
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
4989
  		return &bfqg->async_idle_bfqq;
aee69d78d   Paolo Valente   block, bfq: intro...
4990
4991
4992
4993
4994
4995
4996
4997
4998
4999
5000
5001
5002
  	default:
  		return NULL;
  	}
  }
  
  static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
  				       struct bio *bio, bool is_sync,
  				       struct bfq_io_cq *bic)
  {
  	const int ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
  	const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
  	struct bfq_queue **async_bfqq = NULL;
  	struct bfq_queue *bfqq;
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
5003
  	struct bfq_group *bfqg;
aee69d78d   Paolo Valente   block, bfq: intro...
5004
5005
  
  	rcu_read_lock();
0fe061b9f   Dennis Zhou   blkcg: fix ref co...
5006
  	bfqg = bfq_find_set_group(bfqd, __bio_blkcg(bio));
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
5007
5008
5009
5010
  	if (!bfqg) {
  		bfqq = &bfqd->oom_bfqq;
  		goto out;
  	}
aee69d78d   Paolo Valente   block, bfq: intro...
5011
  	if (!is_sync) {
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
5012
  		async_bfqq = bfq_async_queue_prio(bfqd, bfqg, ioprio_class,
aee69d78d   Paolo Valente   block, bfq: intro...
5013
5014
5015
5016
5017
5018
5019
5020
5021
5022
5023
5024
5025
  						  ioprio);
  		bfqq = *async_bfqq;
  		if (bfqq)
  			goto out;
  	}
  
  	bfqq = kmem_cache_alloc_node(bfq_pool,
  				     GFP_NOWAIT | __GFP_ZERO | __GFP_NOWARN,
  				     bfqd->queue->node);
  
  	if (bfqq) {
  		bfq_init_bfqq(bfqd, bfqq, bic, current->pid,
  			      is_sync);
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
5026
  		bfq_init_entity(&bfqq->entity, bfqg);
aee69d78d   Paolo Valente   block, bfq: intro...
5027
5028
5029
5030
5031
5032
5033
5034
5035
5036
5037
5038
  		bfq_log_bfqq(bfqd, bfqq, "allocated");
  	} else {
  		bfqq = &bfqd->oom_bfqq;
  		bfq_log_bfqq(bfqd, bfqq, "using oom bfqq");
  		goto out;
  	}
  
  	/*
  	 * Pin the queue now that it's allocated, scheduler exit will
  	 * prune it.
  	 */
  	if (async_bfqq) {
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
5039
5040
5041
5042
5043
5044
5045
5046
  		bfqq->ref++; /*
  			      * Extra group reference, w.r.t. sync
  			      * queue. This extra reference is removed
  			      * only if bfqq->bfqg disappears, to
  			      * guarantee that this queue is not freed
  			      * until its group goes away.
  			      */
  		bfq_log_bfqq(bfqd, bfqq, "get_queue, bfqq not in async: %p, %d",
aee69d78d   Paolo Valente   block, bfq: intro...
5047
5048
5049
5050
5051
5052
5053
5054
5055
5056
5057
5058
5059
5060
5061
5062
5063
5064
5065
5066
5067
5068
5069
5070
5071
5072
5073
5074
5075
  			     bfqq, bfqq->ref);
  		*async_bfqq = bfqq;
  	}
  
  out:
  	bfqq->ref++; /* get a process reference to this queue */
  	bfq_log_bfqq(bfqd, bfqq, "get_queue, at end: %p, %d", bfqq, bfqq->ref);
  	rcu_read_unlock();
  	return bfqq;
  }
  
  static void bfq_update_io_thinktime(struct bfq_data *bfqd,
  				    struct bfq_queue *bfqq)
  {
  	struct bfq_ttime *ttime = &bfqq->ttime;
  	u64 elapsed = ktime_get_ns() - bfqq->ttime.last_end_request;
  
  	elapsed = min_t(u64, elapsed, 2ULL * bfqd->bfq_slice_idle);
  
  	ttime->ttime_samples = (7*bfqq->ttime.ttime_samples + 256) / 8;
  	ttime->ttime_total = div_u64(7*ttime->ttime_total + 256*elapsed,  8);
  	ttime->ttime_mean = div64_ul(ttime->ttime_total + 128,
  				     ttime->ttime_samples);
  }
  
  static void
  bfq_update_io_seektime(struct bfq_data *bfqd, struct bfq_queue *bfqq,
  		       struct request *rq)
  {
aee69d78d   Paolo Valente   block, bfq: intro...
5076
  	bfqq->seek_history <<= 1;
d87447d84   Paolo Valente   block, bfq: fix s...
5077
  	bfqq->seek_history |= BFQ_RQ_SEEKY(bfqd, bfqq->last_request_pos, rq);
7074f076f   Paolo Valente   block, bfq: do no...
5078
5079
5080
5081
5082
  
  	if (bfqq->wr_coeff > 1 &&
  	    bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time &&
  	    BFQQ_TOTALLY_SEEKY(bfqq))
  		bfq_bfqq_end_wr(bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
5083
  }
d5be3fefc   Paolo Valente   block,bfq: refact...
5084
5085
5086
  static void bfq_update_has_short_ttime(struct bfq_data *bfqd,
  				       struct bfq_queue *bfqq,
  				       struct bfq_io_cq *bic)
aee69d78d   Paolo Valente   block, bfq: intro...
5087
  {
766d61412   Paolo Valente   block, bfq: reset...
5088
  	bool has_short_ttime = true, state_changed;
aee69d78d   Paolo Valente   block, bfq: intro...
5089

d5be3fefc   Paolo Valente   block,bfq: refact...
5090
5091
5092
5093
5094
5095
5096
  	/*
  	 * No need to update has_short_ttime if bfqq is async or in
  	 * idle io prio class, or if bfq_slice_idle is zero, because
  	 * no device idling is performed for bfqq in this case.
  	 */
  	if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq) ||
  	    bfqd->bfq_slice_idle == 0)
aee69d78d   Paolo Valente   block, bfq: intro...
5097
  		return;
36eca8948   Arianna Avanzini   block, bfq: add E...
5098
5099
5100
5101
  	/* Idle window just restored, statistics are meaningless. */
  	if (time_is_after_eq_jiffies(bfqq->split_time +
  				     bfqd->bfq_wr_min_idle_time))
  		return;
d5be3fefc   Paolo Valente   block,bfq: refact...
5102
5103
5104
5105
  	/* Think time is infinite if no process is linked to
  	 * bfqq. Otherwise check average think time to
  	 * decide whether to mark as has_short_ttime
  	 */
aee69d78d   Paolo Valente   block, bfq: intro...
5106
  	if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
d5be3fefc   Paolo Valente   block,bfq: refact...
5107
5108
5109
  	    (bfq_sample_valid(bfqq->ttime.ttime_samples) &&
  	     bfqq->ttime.ttime_mean > bfqd->bfq_slice_idle))
  		has_short_ttime = false;
766d61412   Paolo Valente   block, bfq: reset...
5110
  	state_changed = has_short_ttime != bfq_bfqq_has_short_ttime(bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
5111

d5be3fefc   Paolo Valente   block,bfq: refact...
5112
5113
  	if (has_short_ttime)
  		bfq_mark_bfqq_has_short_ttime(bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
5114
  	else
d5be3fefc   Paolo Valente   block,bfq: refact...
5115
  		bfq_clear_bfqq_has_short_ttime(bfqq);
766d61412   Paolo Valente   block, bfq: reset...
5116
5117
5118
5119
5120
5121
5122
5123
5124
5125
5126
5127
5128
5129
5130
5131
5132
5133
5134
5135
  
  	/*
  	 * Until the base value for the total service time gets
  	 * finally computed for bfqq, the inject limit does depend on
  	 * the think-time state (short|long). In particular, the limit
  	 * is 0 or 1 if the think time is deemed, respectively, as
  	 * short or long (details in the comments in
  	 * bfq_update_inject_limit()). Accordingly, the next
  	 * instructions reset the inject limit if the think-time state
  	 * has changed and the above base value is still to be
  	 * computed.
  	 *
  	 * However, the reset is performed only if more than 100 ms
  	 * have elapsed since the last update of the inject limit, or
  	 * (inclusive) if the change is from short to long think
  	 * time. The reason for this waiting is as follows.
  	 *
  	 * bfqq may have a long think time because of a
  	 * synchronization with some other queue, i.e., because the
  	 * I/O of some other queue may need to be completed for bfqq
13a857a4c   Paolo Valente   block, bfq: detec...
5136
5137
  	 * to receive new I/O. Details in the comments on the choice
  	 * of the queue for injection in bfq_select_queue().
766d61412   Paolo Valente   block, bfq: reset...
5138
  	 *
13a857a4c   Paolo Valente   block, bfq: detec...
5139
5140
5141
5142
5143
5144
5145
5146
  	 * As stressed in those comments, if such a synchronization is
  	 * actually in place, then, without injection on bfqq, the
  	 * blocking I/O cannot happen to served while bfqq is in
  	 * service. As a consequence, if bfqq is granted
  	 * I/O-dispatch-plugging, then bfqq remains empty, and no I/O
  	 * is dispatched, until the idle timeout fires. This is likely
  	 * to result in lower bandwidth and higher latencies for bfqq,
  	 * and in a severe loss of total throughput.
766d61412   Paolo Valente   block, bfq: reset...
5147
5148
5149
  	 *
  	 * On the opposite end, a non-zero inject limit may allow the
  	 * I/O that blocks bfqq to be executed soon, and therefore
13a857a4c   Paolo Valente   block, bfq: detec...
5150
5151
5152
5153
5154
5155
5156
  	 * bfqq to receive new I/O soon.
  	 *
  	 * But, if the blocking gets actually eliminated, then the
  	 * next think-time sample for bfqq may be very low. This in
  	 * turn may cause bfqq's think time to be deemed
  	 * short. Without the 100 ms barrier, this new state change
  	 * would cause the body of the next if to be executed
766d61412   Paolo Valente   block, bfq: reset...
5157
5158
5159
5160
5161
5162
5163
5164
5165
5166
  	 * immediately. But this would set to 0 the inject
  	 * limit. Without injection, the blocking I/O would cause the
  	 * think time of bfqq to become long again, and therefore the
  	 * inject limit to be raised again, and so on. The only effect
  	 * of such a steady oscillation between the two think-time
  	 * states would be to prevent effective injection on bfqq.
  	 *
  	 * In contrast, if the inject limit is not reset during such a
  	 * long time interval as 100 ms, then the number of short
  	 * think time samples can grow significantly before the reset
13a857a4c   Paolo Valente   block, bfq: detec...
5167
5168
5169
5170
5171
  	 * is performed. As a consequence, the think time state can
  	 * become stable before the reset. Therefore there will be no
  	 * state change when the 100 ms elapse, and no reset of the
  	 * inject limit. The inject limit remains steadily equal to 1
  	 * both during and after the 100 ms. So injection can be
766d61412   Paolo Valente   block, bfq: reset...
5172
5173
5174
5175
5176
5177
5178
5179
5180
5181
5182
5183
5184
5185
  	 * performed at all times, and throughput gets boosted.
  	 *
  	 * An inject limit equal to 1 is however in conflict, in
  	 * general, with the fact that the think time of bfqq is
  	 * short, because injection may be likely to delay bfqq's I/O
  	 * (as explained in the comments in
  	 * bfq_update_inject_limit()). But this does not happen in
  	 * this special case, because bfqq's low think time is due to
  	 * an effective handling of a synchronization, through
  	 * injection. In this special case, bfqq's I/O does not get
  	 * delayed by injection; on the contrary, bfqq's I/O is
  	 * brought forward, because it is not blocked for
  	 * milliseconds.
  	 *
13a857a4c   Paolo Valente   block, bfq: detec...
5186
5187
5188
5189
5190
5191
5192
5193
5194
5195
5196
5197
5198
5199
  	 * In addition, serving the blocking I/O much sooner, and much
  	 * more frequently than once per I/O-plugging timeout, makes
  	 * it much quicker to detect a waker queue (the concept of
  	 * waker queue is defined in the comments in
  	 * bfq_add_request()). This makes it possible to start sooner
  	 * to boost throughput more effectively, by injecting the I/O
  	 * of the waker queue unconditionally on every
  	 * bfq_dispatch_request().
  	 *
  	 * One last, important benefit of not resetting the inject
  	 * limit before 100 ms is that, during this time interval, the
  	 * base value for the total service time is likely to get
  	 * finally computed for bfqq, freeing the inject limit from
  	 * its relation with the think time.
766d61412   Paolo Valente   block, bfq: reset...
5200
5201
5202
5203
5204
5205
  	 */
  	if (state_changed && bfqq->last_serv_time_ns == 0 &&
  	    (time_is_before_eq_jiffies(bfqq->decrease_time_jif +
  				      msecs_to_jiffies(100)) ||
  	     !has_short_ttime))
  		bfq_reset_inject_limit(bfqd, bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
5206
5207
5208
5209
5210
5211
5212
5213
5214
  }
  
  /*
   * Called when a new fs request (rq) is added to bfqq.  Check if there's
   * something we should do about it.
   */
  static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
  			    struct request *rq)
  {
aee69d78d   Paolo Valente   block, bfq: intro...
5215
5216
  	if (rq->cmd_flags & REQ_META)
  		bfqq->meta_pending++;
aee69d78d   Paolo Valente   block, bfq: intro...
5217
5218
5219
5220
5221
5222
5223
5224
  	bfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
  
  	if (bfqq == bfqd->in_service_queue && bfq_bfqq_wait_request(bfqq)) {
  		bool small_req = bfqq->queued[rq_is_sync(rq)] == 1 &&
  				 blk_rq_sectors(rq) < 32;
  		bool budget_timeout = bfq_bfqq_budget_timeout(bfqq);
  
  		/*
ac8b0cb41   Paolo Valente   block, bfq: do no...
5225
5226
5227
5228
5229
  		 * There is just this request queued: if
  		 * - the request is small, and
  		 * - we are idling to boost throughput, and
  		 * - the queue is not to be expired,
  		 * then just exit.
aee69d78d   Paolo Valente   block, bfq: intro...
5230
5231
5232
5233
  		 *
  		 * In this way, if the device is being idled to wait
  		 * for a new request from the in-service queue, we
  		 * avoid unplugging the device and committing the
ac8b0cb41   Paolo Valente   block, bfq: do no...
5234
5235
5236
5237
5238
  		 * device to serve just a small request. In contrast
  		 * we wait for the block layer to decide when to
  		 * unplug the device: hopefully, new requests will be
  		 * merged to this one quickly, then the device will be
  		 * unplugged and larger requests will be dispatched.
aee69d78d   Paolo Valente   block, bfq: intro...
5239
  		 */
ac8b0cb41   Paolo Valente   block, bfq: do no...
5240
5241
  		if (small_req && idling_boosts_thr_without_issues(bfqd, bfqq) &&
  		    !budget_timeout)
aee69d78d   Paolo Valente   block, bfq: intro...
5242
5243
5244
  			return;
  
  		/*
ac8b0cb41   Paolo Valente   block, bfq: do no...
5245
5246
5247
5248
5249
  		 * A large enough request arrived, or idling is being
  		 * performed to preserve service guarantees, or
  		 * finally the queue is to be expired: in all these
  		 * cases disk idling is to be stopped, so clear
  		 * wait_request flag and reset timer.
aee69d78d   Paolo Valente   block, bfq: intro...
5250
5251
5252
5253
5254
5255
5256
5257
5258
5259
5260
5261
5262
5263
5264
5265
  		 */
  		bfq_clear_bfqq_wait_request(bfqq);
  		hrtimer_try_to_cancel(&bfqd->idle_slice_timer);
  
  		/*
  		 * The queue is not empty, because a new request just
  		 * arrived. Hence we can safely expire the queue, in
  		 * case of budget timeout, without risking that the
  		 * timestamps of the queue are not updated correctly.
  		 * See [1] for more details.
  		 */
  		if (budget_timeout)
  			bfq_bfqq_expire(bfqd, bfqq, false,
  					BFQQE_BUDGET_TIMEOUT);
  	}
  }
24bfd19bb   Paolo Valente   block, bfq: updat...
5266
5267
  /* returns true if it causes the idle timer to be disabled */
  static bool __bfq_insert_request(struct bfq_data *bfqd, struct request *rq)
aee69d78d   Paolo Valente   block, bfq: intro...
5268
  {
36eca8948   Arianna Avanzini   block, bfq: add E...
5269
5270
  	struct bfq_queue *bfqq = RQ_BFQQ(rq),
  		*new_bfqq = bfq_setup_cooperator(bfqd, bfqq, rq, true);
24bfd19bb   Paolo Valente   block, bfq: updat...
5271
  	bool waiting, idle_timer_disabled = false;
36eca8948   Arianna Avanzini   block, bfq: add E...
5272
5273
  
  	if (new_bfqq) {
36eca8948   Arianna Avanzini   block, bfq: add E...
5274
5275
5276
5277
5278
5279
5280
5281
5282
5283
5284
5285
5286
5287
5288
5289
5290
5291
  		/*
  		 * Release the request's reference to the old bfqq
  		 * and make sure one is taken to the shared queue.
  		 */
  		new_bfqq->allocated++;
  		bfqq->allocated--;
  		new_bfqq->ref++;
  		/*
  		 * If the bic associated with the process
  		 * issuing this request still points to bfqq
  		 * (and thus has not been already redirected
  		 * to new_bfqq or even some other bfq_queue),
  		 * then complete the merge and redirect it to
  		 * new_bfqq.
  		 */
  		if (bic_to_bfqq(RQ_BIC(rq), 1) == bfqq)
  			bfq_merge_bfqqs(bfqd, RQ_BIC(rq),
  					bfqq, new_bfqq);
894df937e   Paolo Valente   block, bfq: let e...
5292
5293
  
  		bfq_clear_bfqq_just_created(bfqq);
36eca8948   Arianna Avanzini   block, bfq: add E...
5294
5295
5296
5297
5298
5299
5300
5301
  		/*
  		 * rq is about to be enqueued into new_bfqq,
  		 * release rq reference on bfqq
  		 */
  		bfq_put_queue(bfqq);
  		rq->elv.priv[1] = new_bfqq;
  		bfqq = new_bfqq;
  	}
aee69d78d   Paolo Valente   block, bfq: intro...
5302

a3f9bce36   Paolo Valente   block, bfq: bring...
5303
5304
5305
  	bfq_update_io_thinktime(bfqd, bfqq);
  	bfq_update_has_short_ttime(bfqd, bfqq, RQ_BIC(rq));
  	bfq_update_io_seektime(bfqd, bfqq, rq);
24bfd19bb   Paolo Valente   block, bfq: updat...
5306
  	waiting = bfqq && bfq_bfqq_wait_request(bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
5307
  	bfq_add_request(rq);
24bfd19bb   Paolo Valente   block, bfq: updat...
5308
  	idle_timer_disabled = waiting && !bfq_bfqq_wait_request(bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
5309
5310
5311
5312
5313
  
  	rq->fifo_time = ktime_get_ns() + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
  	list_add_tail(&rq->queuelist, &bfqq->fifo);
  
  	bfq_rq_enqueued(bfqd, bfqq, rq);
24bfd19bb   Paolo Valente   block, bfq: updat...
5314
5315
  
  	return idle_timer_disabled;
aee69d78d   Paolo Valente   block, bfq: intro...
5316
  }
8060c47ba   Christoph Hellwig   block: rename CON...
5317
  #ifdef CONFIG_BFQ_CGROUP_DEBUG
9b25bd036   Paolo Valente   block, bfq: remov...
5318
5319
5320
5321
5322
5323
5324
5325
5326
5327
5328
5329
5330
5331
5332
5333
5334
5335
  static void bfq_update_insert_stats(struct request_queue *q,
  				    struct bfq_queue *bfqq,
  				    bool idle_timer_disabled,
  				    unsigned int cmd_flags)
  {
  	if (!bfqq)
  		return;
  
  	/*
  	 * bfqq still exists, because it can disappear only after
  	 * either it is merged with another queue, or the process it
  	 * is associated with exits. But both actions must be taken by
  	 * the same process currently executing this flow of
  	 * instructions.
  	 *
  	 * In addition, the following queue lock guarantees that
  	 * bfqq_group(bfqq) exists as well.
  	 */
0d945c1f9   Christoph Hellwig   block: remove the...
5336
  	spin_lock_irq(&q->queue_lock);
9b25bd036   Paolo Valente   block, bfq: remov...
5337
5338
5339
  	bfqg_stats_update_io_add(bfqq_group(bfqq), bfqq, cmd_flags);
  	if (idle_timer_disabled)
  		bfqg_stats_update_idle_time(bfqq_group(bfqq));
0d945c1f9   Christoph Hellwig   block: remove the...
5340
  	spin_unlock_irq(&q->queue_lock);
9b25bd036   Paolo Valente   block, bfq: remov...
5341
5342
5343
5344
5345
5346
  }
  #else
  static inline void bfq_update_insert_stats(struct request_queue *q,
  					   struct bfq_queue *bfqq,
  					   bool idle_timer_disabled,
  					   unsigned int cmd_flags) {}
8060c47ba   Christoph Hellwig   block: rename CON...
5347
  #endif /* CONFIG_BFQ_CGROUP_DEBUG */
9b25bd036   Paolo Valente   block, bfq: remov...
5348

aee69d78d   Paolo Valente   block, bfq: intro...
5349
5350
5351
5352
5353
  static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
  			       bool at_head)
  {
  	struct request_queue *q = hctx->queue;
  	struct bfq_data *bfqd = q->elevator->elevator_data;
18e5a57d7   Paolo Valente   block, bfq: postp...
5354
  	struct bfq_queue *bfqq;
24bfd19bb   Paolo Valente   block, bfq: updat...
5355
5356
  	bool idle_timer_disabled = false;
  	unsigned int cmd_flags;
aee69d78d   Paolo Valente   block, bfq: intro...
5357

fd41e6033   Tejun Heo   bfq-iosched: stop...
5358
5359
5360
5361
  #ifdef CONFIG_BFQ_GROUP_IOSCHED
  	if (!cgroup_subsys_on_dfl(io_cgrp_subsys) && rq->bio)
  		bfqg_stats_update_legacy_io(q, rq);
  #endif
aee69d78d   Paolo Valente   block, bfq: intro...
5362
5363
5364
5365
5366
5367
5368
5369
5370
5371
5372
  	spin_lock_irq(&bfqd->lock);
  	if (blk_mq_sched_try_insert_merge(q, rq)) {
  		spin_unlock_irq(&bfqd->lock);
  		return;
  	}
  
  	spin_unlock_irq(&bfqd->lock);
  
  	blk_mq_sched_request_inserted(rq);
  
  	spin_lock_irq(&bfqd->lock);
18e5a57d7   Paolo Valente   block, bfq: postp...
5373
  	bfqq = bfq_init_rq(rq);
fd03177c3   Paolo Valente   block, bfq: handl...
5374
  	if (!bfqq || at_head || blk_rq_is_passthrough(rq)) {
aee69d78d   Paolo Valente   block, bfq: intro...
5375
5376
5377
5378
  		if (at_head)
  			list_add(&rq->queuelist, &bfqd->dispatch);
  		else
  			list_add_tail(&rq->queuelist, &bfqd->dispatch);
fd03177c3   Paolo Valente   block, bfq: handl...
5379
  	} else {
24bfd19bb   Paolo Valente   block, bfq: updat...
5380
  		idle_timer_disabled = __bfq_insert_request(bfqd, rq);
614822f81   Luca Miccio   block, bfq: add m...
5381
5382
5383
5384
5385
5386
  		/*
  		 * Update bfqq, because, if a queue merge has occurred
  		 * in __bfq_insert_request, then rq has been
  		 * redirected into a new queue.
  		 */
  		bfqq = RQ_BFQQ(rq);
aee69d78d   Paolo Valente   block, bfq: intro...
5387
5388
5389
5390
5391
5392
5393
  
  		if (rq_mergeable(rq)) {
  			elv_rqhash_add(q, rq);
  			if (!q->last_merge)
  				q->last_merge = rq;
  		}
  	}
24bfd19bb   Paolo Valente   block, bfq: updat...
5394
5395
5396
5397
5398
5399
  	/*
  	 * Cache cmd_flags before releasing scheduler lock, because rq
  	 * may disappear afterwards (for example, because of a request
  	 * merge).
  	 */
  	cmd_flags = rq->cmd_flags;
9b25bd036   Paolo Valente   block, bfq: remov...
5400

6fa3e8d34   Paolo Valente   block, bfq: remov...
5401
  	spin_unlock_irq(&bfqd->lock);
24bfd19bb   Paolo Valente   block, bfq: updat...
5402

9b25bd036   Paolo Valente   block, bfq: remov...
5403
5404
  	bfq_update_insert_stats(q, bfqq, idle_timer_disabled,
  				cmd_flags);
aee69d78d   Paolo Valente   block, bfq: intro...
5405
5406
5407
5408
5409
5410
5411
5412
5413
5414
5415
  }
  
  static void bfq_insert_requests(struct blk_mq_hw_ctx *hctx,
  				struct list_head *list, bool at_head)
  {
  	while (!list_empty(list)) {
  		struct request *rq;
  
  		rq = list_first_entry(list, struct request, queuelist);
  		list_del_init(&rq->queuelist);
  		bfq_insert_request(hctx, rq, at_head);
b445547ec   Kashyap Desai   blk-mq, elevator:...
5416
  		atomic_inc(&hctx->elevator_queued);
aee69d78d   Paolo Valente   block, bfq: intro...
5417
5418
5419
5420
5421
  	}
  }
  
  static void bfq_update_hw_tag(struct bfq_data *bfqd)
  {
b3c349811   Paolo Valente   block, bfq: port ...
5422
  	struct bfq_queue *bfqq = bfqd->in_service_queue;
aee69d78d   Paolo Valente   block, bfq: intro...
5423
5424
5425
5426
5427
5428
5429
5430
5431
5432
5433
5434
  	bfqd->max_rq_in_driver = max_t(int, bfqd->max_rq_in_driver,
  				       bfqd->rq_in_driver);
  
  	if (bfqd->hw_tag == 1)
  		return;
  
  	/*
  	 * This sample is valid if the number of outstanding requests
  	 * is large enough to allow a queueing behavior.  Note that the
  	 * sum is not exact, as it's not taking into account deactivated
  	 * requests.
  	 */
a3c925603   Paolo Valente   block, bfq: reduc...
5435
  	if (bfqd->rq_in_driver + bfqd->queued <= BFQ_HW_QUEUE_THRESHOLD)
aee69d78d   Paolo Valente   block, bfq: intro...
5436
  		return;
b3c349811   Paolo Valente   block, bfq: port ...
5437
5438
5439
5440
5441
5442
5443
5444
5445
5446
  	/*
  	 * If active queue hasn't enough requests and can idle, bfq might not
  	 * dispatch sufficient requests to hardware. Don't zero hw_tag in this
  	 * case
  	 */
  	if (bfqq && bfq_bfqq_has_short_ttime(bfqq) &&
  	    bfqq->dispatched + bfqq->queued[0] + bfqq->queued[1] <
  	    BFQ_HW_QUEUE_THRESHOLD &&
  	    bfqd->rq_in_driver < BFQ_HW_QUEUE_THRESHOLD)
  		return;
aee69d78d   Paolo Valente   block, bfq: intro...
5447
5448
5449
5450
5451
5452
  	if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES)
  		return;
  
  	bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD;
  	bfqd->max_rq_in_driver = 0;
  	bfqd->hw_tag_samples = 0;
8cacc5ab3   Paolo Valente   block, bfq: do no...
5453
5454
5455
  
  	bfqd->nonrot_with_queueing =
  		blk_queue_nonrot(bfqd->queue) && bfqd->hw_tag;
aee69d78d   Paolo Valente   block, bfq: intro...
5456
5457
5458
5459
  }
  
  static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd)
  {
ab0e43e9c   Paolo Valente   block, bfq: modif...
5460
5461
  	u64 now_ns;
  	u32 delta_us;
aee69d78d   Paolo Valente   block, bfq: intro...
5462
5463
5464
5465
  	bfq_update_hw_tag(bfqd);
  
  	bfqd->rq_in_driver--;
  	bfqq->dispatched--;
44e44a1b3   Paolo Valente   block, bfq: impro...
5466
5467
5468
5469
5470
5471
5472
5473
  	if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq)) {
  		/*
  		 * Set budget_timeout (which we overload to store the
  		 * time at which the queue remains with no backlog and
  		 * no outstanding request; used by the weight-raising
  		 * mechanism).
  		 */
  		bfqq->budget_timeout = jiffies;
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
5474

0471559c2   Paolo Valente   block, bfq: add/r...
5475
  		bfq_weights_tree_remove(bfqd, bfqq);
44e44a1b3   Paolo Valente   block, bfq: impro...
5476
  	}
ab0e43e9c   Paolo Valente   block, bfq: modif...
5477
5478
5479
5480
5481
5482
5483
5484
5485
5486
5487
5488
5489
5490
5491
5492
5493
5494
5495
5496
5497
5498
5499
5500
5501
5502
5503
5504
5505
5506
5507
  	now_ns = ktime_get_ns();
  
  	bfqq->ttime.last_end_request = now_ns;
  
  	/*
  	 * Using us instead of ns, to get a reasonable precision in
  	 * computing rate in next check.
  	 */
  	delta_us = div_u64(now_ns - bfqd->last_completion, NSEC_PER_USEC);
  
  	/*
  	 * If the request took rather long to complete, and, according
  	 * to the maximum request size recorded, this completion latency
  	 * implies that the request was certainly served at a very low
  	 * rate (less than 1M sectors/sec), then the whole observation
  	 * interval that lasts up to this time instant cannot be a
  	 * valid time interval for computing a new peak rate.  Invoke
  	 * bfq_update_rate_reset to have the following three steps
  	 * taken:
  	 * - close the observation interval at the last (previous)
  	 *   request dispatch or completion
  	 * - compute rate, if possible, for that observation interval
  	 * - reset to zero samples, which will trigger a proper
  	 *   re-initialization of the observation interval on next
  	 *   dispatch
  	 */
  	if (delta_us > BFQ_MIN_TT/NSEC_PER_USEC &&
  	   (bfqd->last_rq_max_size<<BFQ_RATE_SHIFT)/delta_us <
  			1UL<<(BFQ_RATE_SHIFT - 10))
  		bfq_update_rate_reset(bfqd, NULL);
  	bfqd->last_completion = now_ns;
13a857a4c   Paolo Valente   block, bfq: detec...
5508
  	bfqd->last_completed_rq_bfqq = bfqq;
aee69d78d   Paolo Valente   block, bfq: intro...
5509
5510
  
  	/*
77b7dcead   Paolo Valente   block, bfq: reduc...
5511
5512
5513
5514
5515
  	 * If we are waiting to discover whether the request pattern
  	 * of the task associated with the queue is actually
  	 * isochronous, and both requisites for this condition to hold
  	 * are now satisfied, then compute soft_rt_next_start (see the
  	 * comments on the function bfq_bfqq_softrt_next_start()). We
20cd32450   Paolo Valente   block, bfq: do no...
5516
5517
5518
5519
  	 * do not compute soft_rt_next_start if bfqq is in interactive
  	 * weight raising (see the comments in bfq_bfqq_expire() for
  	 * an explanation). We schedule this delayed update when bfqq
  	 * expires, if it still has in-flight requests.
77b7dcead   Paolo Valente   block, bfq: reduc...
5520
5521
  	 */
  	if (bfq_bfqq_softrt_update(bfqq) && bfqq->dispatched == 0 &&
20cd32450   Paolo Valente   block, bfq: do no...
5522
5523
  	    RB_EMPTY_ROOT(&bfqq->sort_list) &&
  	    bfqq->wr_coeff != bfqd->bfq_wr_coeff)
77b7dcead   Paolo Valente   block, bfq: reduc...
5524
5525
5526
5527
  		bfqq->soft_rt_next_start =
  			bfq_bfqq_softrt_next_start(bfqd, bfqq);
  
  	/*
aee69d78d   Paolo Valente   block, bfq: intro...
5528
5529
5530
5531
  	 * If this is the in-service queue, check if it needs to be expired,
  	 * or if we want to idle in case it has no pending requests.
  	 */
  	if (bfqd->in_service_queue == bfqq) {
4420b095c   Paolo Valente   block, bfq: do no...
5532
5533
5534
5535
5536
5537
5538
5539
5540
5541
5542
5543
5544
5545
5546
5547
5548
5549
5550
5551
5552
5553
5554
5555
5556
5557
  		if (bfq_bfqq_must_idle(bfqq)) {
  			if (bfqq->dispatched == 0)
  				bfq_arm_slice_timer(bfqd);
  			/*
  			 * If we get here, we do not expire bfqq, even
  			 * if bfqq was in budget timeout or had no
  			 * more requests (as controlled in the next
  			 * conditional instructions). The reason for
  			 * not expiring bfqq is as follows.
  			 *
  			 * Here bfqq->dispatched > 0 holds, but
  			 * bfq_bfqq_must_idle() returned true. This
  			 * implies that, even if no request arrives
  			 * for bfqq before bfqq->dispatched reaches 0,
  			 * bfqq will, however, not be expired on the
  			 * completion event that causes bfqq->dispatch
  			 * to reach zero. In contrast, on this event,
  			 * bfqq will start enjoying device idling
  			 * (I/O-dispatch plugging).
  			 *
  			 * But, if we expired bfqq here, bfqq would
  			 * not have the chance to enjoy device idling
  			 * when bfqq->dispatched finally reaches
  			 * zero. This would expose bfqq to violation
  			 * of its reserved service guarantees.
  			 */
aee69d78d   Paolo Valente   block, bfq: intro...
5558
5559
5560
5561
5562
5563
  			return;
  		} else if (bfq_may_expire_for_budg_timeout(bfqq))
  			bfq_bfqq_expire(bfqd, bfqq, false,
  					BFQQE_BUDGET_TIMEOUT);
  		else if (RB_EMPTY_ROOT(&bfqq->sort_list) &&
  			 (bfqq->dispatched == 0 ||
277a4a9b5   Paolo Valente   block, bfq: give ...
5564
  			  !bfq_better_to_idle(bfqq)))
aee69d78d   Paolo Valente   block, bfq: intro...
5565
5566
5567
  			bfq_bfqq_expire(bfqd, bfqq, false,
  					BFQQE_NO_MORE_REQUESTS);
  	}
3f7cb4f41   Hou Tao   bfq: dispatch req...
5568
5569
5570
  
  	if (!bfqd->rq_in_driver)
  		bfq_schedule_dispatch(bfqd);
aee69d78d   Paolo Valente   block, bfq: intro...
5571
  }
a78773906   Paolo Valente   block, bfq: add r...
5572
  static void bfq_finish_requeue_request_body(struct bfq_queue *bfqq)
aee69d78d   Paolo Valente   block, bfq: intro...
5573
5574
5575
5576
5577
  {
  	bfqq->allocated--;
  
  	bfq_put_queue(bfqq);
  }
a78773906   Paolo Valente   block, bfq: add r...
5578
  /*
2341d662e   Paolo Valente   block, bfq: tune ...
5579
5580
5581
5582
5583
5584
5585
5586
5587
5588
5589
5590
5591
5592
5593
5594
5595
5596
5597
5598
5599
5600
5601
5602
5603
5604
5605
5606
5607
5608
5609
5610
5611
5612
5613
5614
5615
5616
5617
5618
5619
5620
5621
5622
5623
5624
5625
5626
5627
5628
5629
5630
5631
5632
5633
5634
5635
5636
5637
5638
5639
5640
5641
5642
5643
5644
5645
5646
5647
5648
5649
5650
5651
5652
5653
5654
5655
5656
5657
5658
5659
5660
5661
5662
5663
5664
5665
5666
5667
5668
5669
5670
5671
5672
5673
5674
5675
5676
5677
5678
5679
5680
5681
5682
5683
5684
5685
5686
   * The processes associated with bfqq may happen to generate their
   * cumulative I/O at a lower rate than the rate at which the device
   * could serve the same I/O. This is rather probable, e.g., if only
   * one process is associated with bfqq and the device is an SSD. It
   * results in bfqq becoming often empty while in service. In this
   * respect, if BFQ is allowed to switch to another queue when bfqq
   * remains empty, then the device goes on being fed with I/O requests,
   * and the throughput is not affected. In contrast, if BFQ is not
   * allowed to switch to another queue---because bfqq is sync and
   * I/O-dispatch needs to be plugged while bfqq is temporarily
   * empty---then, during the service of bfqq, there will be frequent
   * "service holes", i.e., time intervals during which bfqq gets empty
   * and the device can only consume the I/O already queued in its
   * hardware queues. During service holes, the device may even get to
   * remaining idle. In the end, during the service of bfqq, the device
   * is driven at a lower speed than the one it can reach with the kind
   * of I/O flowing through bfqq.
   *
   * To counter this loss of throughput, BFQ implements a "request
   * injection mechanism", which tries to fill the above service holes
   * with I/O requests taken from other queues. The hard part in this
   * mechanism is finding the right amount of I/O to inject, so as to
   * both boost throughput and not break bfqq's bandwidth and latency
   * guarantees. In this respect, the mechanism maintains a per-queue
   * inject limit, computed as below. While bfqq is empty, the injection
   * mechanism dispatches extra I/O requests only until the total number
   * of I/O requests in flight---i.e., already dispatched but not yet
   * completed---remains lower than this limit.
   *
   * A first definition comes in handy to introduce the algorithm by
   * which the inject limit is computed.  We define as first request for
   * bfqq, an I/O request for bfqq that arrives while bfqq is in
   * service, and causes bfqq to switch from empty to non-empty. The
   * algorithm updates the limit as a function of the effect of
   * injection on the service times of only the first requests of
   * bfqq. The reason for this restriction is that these are the
   * requests whose service time is affected most, because they are the
   * first to arrive after injection possibly occurred.
   *
   * To evaluate the effect of injection, the algorithm measures the
   * "total service time" of first requests. We define as total service
   * time of an I/O request, the time that elapses since when the
   * request is enqueued into bfqq, to when it is completed. This
   * quantity allows the whole effect of injection to be measured. It is
   * easy to see why. Suppose that some requests of other queues are
   * actually injected while bfqq is empty, and that a new request R
   * then arrives for bfqq. If the device does start to serve all or
   * part of the injected requests during the service hole, then,
   * because of this extra service, it may delay the next invocation of
   * the dispatch hook of BFQ. Then, even after R gets eventually
   * dispatched, the device may delay the actual service of R if it is
   * still busy serving the extra requests, or if it decides to serve,
   * before R, some extra request still present in its queues. As a
   * conclusion, the cumulative extra delay caused by injection can be
   * easily evaluated by just comparing the total service time of first
   * requests with and without injection.
   *
   * The limit-update algorithm works as follows. On the arrival of a
   * first request of bfqq, the algorithm measures the total time of the
   * request only if one of the three cases below holds, and, for each
   * case, it updates the limit as described below:
   *
   * (1) If there is no in-flight request. This gives a baseline for the
   *     total service time of the requests of bfqq. If the baseline has
   *     not been computed yet, then, after computing it, the limit is
   *     set to 1, to start boosting throughput, and to prepare the
   *     ground for the next case. If the baseline has already been
   *     computed, then it is updated, in case it results to be lower
   *     than the previous value.
   *
   * (2) If the limit is higher than 0 and there are in-flight
   *     requests. By comparing the total service time in this case with
   *     the above baseline, it is possible to know at which extent the
   *     current value of the limit is inflating the total service
   *     time. If the inflation is below a certain threshold, then bfqq
   *     is assumed to be suffering from no perceivable loss of its
   *     service guarantees, and the limit is even tentatively
   *     increased. If the inflation is above the threshold, then the
   *     limit is decreased. Due to the lack of any hysteresis, this
   *     logic makes the limit oscillate even in steady workload
   *     conditions. Yet we opted for it, because it is fast in reaching
   *     the best value for the limit, as a function of the current I/O
   *     workload. To reduce oscillations, this step is disabled for a
   *     short time interval after the limit happens to be decreased.
   *
   * (3) Periodically, after resetting the limit, to make sure that the
   *     limit eventually drops in case the workload changes. This is
   *     needed because, after the limit has gone safely up for a
   *     certain workload, it is impossible to guess whether the
   *     baseline total service time may have changed, without measuring
   *     it again without injection. A more effective version of this
   *     step might be to just sample the baseline, by interrupting
   *     injection only once, and then to reset/lower the limit only if
   *     the total service time with the current limit does happen to be
   *     too large.
   *
   * More details on each step are provided in the comments on the
   * pieces of code that implement these steps: the branch handling the
   * transition from empty to non empty in bfq_add_request(), the branch
   * handling injection in bfq_select_queue(), and the function
   * bfq_choose_bfqq_for_injection(). These comments also explain some
   * exceptions, made by the injection mechanism in some special cases.
   */
  static void bfq_update_inject_limit(struct bfq_data *bfqd,
  				    struct bfq_queue *bfqq)
  {
  	u64 tot_time_ns = ktime_get_ns() - bfqd->last_empty_occupied_ns;
  	unsigned int old_limit = bfqq->inject_limit;
23ed570ac   Paolo Valente   block, bfq: updat...
5687
  	if (bfqq->last_serv_time_ns > 0 && bfqd->rqs_injected) {
2341d662e   Paolo Valente   block, bfq: tune ...
5688
5689
5690
5691
5692
5693
  		u64 threshold = (bfqq->last_serv_time_ns * 3)>>1;
  
  		if (tot_time_ns >= threshold && old_limit > 0) {
  			bfqq->inject_limit--;
  			bfqq->decrease_time_jif = jiffies;
  		} else if (tot_time_ns < threshold &&
c1e0a1822   Paolo Valente   block, bfq: reduc...
5694
  			   old_limit <= bfqd->max_rq_in_driver)
2341d662e   Paolo Valente   block, bfq: tune ...
5695
5696
5697
5698
5699
5700
5701
5702
  			bfqq->inject_limit++;
  	}
  
  	/*
  	 * Either we still have to compute the base value for the
  	 * total service time, and there seem to be the right
  	 * conditions to do it, or we can lower the last base value
  	 * computed.
db599f9ed   Paolo Valente   block, bfq: fix r...
5703
5704
5705
5706
5707
5708
  	 *
  	 * NOTE: (bfqd->rq_in_driver == 1) means that there is no I/O
  	 * request in flight, because this function is in the code
  	 * path that handles the completion of a request of bfqq, and,
  	 * in particular, this function is executed before
  	 * bfqd->rq_in_driver is decremented in such a code path.
2341d662e   Paolo Valente   block, bfq: tune ...
5709
  	 */
db599f9ed   Paolo Valente   block, bfq: fix r...
5710
  	if ((bfqq->last_serv_time_ns == 0 && bfqd->rq_in_driver == 1) ||
2341d662e   Paolo Valente   block, bfq: tune ...
5711
  	    tot_time_ns < bfqq->last_serv_time_ns) {
58494c980   Paolo Valente   block, bfq: push ...
5712
5713
5714
5715
5716
5717
5718
  		if (bfqq->last_serv_time_ns == 0) {
  			/*
  			 * Now we certainly have a base value: make sure we
  			 * start trying injection.
  			 */
  			bfqq->inject_limit = max_t(unsigned int, 1, old_limit);
  		}
2341d662e   Paolo Valente   block, bfq: tune ...
5719
  		bfqq->last_serv_time_ns = tot_time_ns;
24792ad01   Paolo Valente   block, bfq: updat...
5720
5721
5722
5723
5724
5725
5726
5727
5728
5729
5730
  	} else if (!bfqd->rqs_injected && bfqd->rq_in_driver == 1)
  		/*
  		 * No I/O injected and no request still in service in
  		 * the drive: these are the exact conditions for
  		 * computing the base value of the total service time
  		 * for bfqq. So let's update this value, because it is
  		 * rather variable. For example, it varies if the size
  		 * or the spatial locality of the I/O requests in bfqq
  		 * change.
  		 */
  		bfqq->last_serv_time_ns = tot_time_ns;
2341d662e   Paolo Valente   block, bfq: tune ...
5731
5732
5733
  
  	/* update complete, not waiting for any request completion any longer */
  	bfqd->waited_rq = NULL;
23ed570ac   Paolo Valente   block, bfq: updat...
5734
  	bfqd->rqs_injected = false;
2341d662e   Paolo Valente   block, bfq: tune ...
5735
5736
5737
  }
  
  /*
a78773906   Paolo Valente   block, bfq: add r...
5738
5739
5740
5741
5742
5743
   * Handle either a requeue or a finish for rq. The things to do are
   * the same in both cases: all references to rq are to be dropped. In
   * particular, rq is considered completed from the point of view of
   * the scheduler.
   */
  static void bfq_finish_requeue_request(struct request *rq)
aee69d78d   Paolo Valente   block, bfq: intro...
5744
  {
a78773906   Paolo Valente   block, bfq: add r...
5745
  	struct bfq_queue *bfqq = RQ_BFQQ(rq);
5bbf4e5a8   Christoph Hellwig   blk-mq-sched: uni...
5746
  	struct bfq_data *bfqd;
a78773906   Paolo Valente   block, bfq: add r...
5747
  	/*
a78773906   Paolo Valente   block, bfq: add r...
5748
5749
5750
5751
5752
  	 * rq either is not associated with any icq, or is an already
  	 * requeued request that has not (yet) been re-inserted into
  	 * a bfq_queue.
  	 */
  	if (!rq->elv.icq || !bfqq)
5bbf4e5a8   Christoph Hellwig   blk-mq-sched: uni...
5753
  		return;
5bbf4e5a8   Christoph Hellwig   blk-mq-sched: uni...
5754
  	bfqd = bfqq->bfqd;
aee69d78d   Paolo Valente   block, bfq: intro...
5755

e21b7a0b9   Arianna Avanzini   block, bfq: add f...
5756
5757
  	if (rq->rq_flags & RQF_STARTED)
  		bfqg_stats_update_completion(bfqq_group(bfqq),
522a77756   Omar Sandoval   block: consolidat...
5758
5759
  					     rq->start_time_ns,
  					     rq->io_start_time_ns,
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
5760
  					     rq->cmd_flags);
aee69d78d   Paolo Valente   block, bfq: intro...
5761
5762
5763
5764
5765
  
  	if (likely(rq->rq_flags & RQF_STARTED)) {
  		unsigned long flags;
  
  		spin_lock_irqsave(&bfqd->lock, flags);
2341d662e   Paolo Valente   block, bfq: tune ...
5766
5767
  		if (rq == bfqd->waited_rq)
  			bfq_update_inject_limit(bfqd, bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
5768
  		bfq_completed_request(bfqq, bfqd);
a78773906   Paolo Valente   block, bfq: add r...
5769
  		bfq_finish_requeue_request_body(bfqq);
b445547ec   Kashyap Desai   blk-mq, elevator:...
5770
  		atomic_dec(&rq->mq_hctx->elevator_queued);
aee69d78d   Paolo Valente   block, bfq: intro...
5771

6fa3e8d34   Paolo Valente   block, bfq: remov...
5772
  		spin_unlock_irqrestore(&bfqd->lock, flags);
aee69d78d   Paolo Valente   block, bfq: intro...
5773
5774
5775
  	} else {
  		/*
  		 * Request rq may be still/already in the scheduler,
a78773906   Paolo Valente   block, bfq: add r...
5776
5777
  		 * in which case we need to remove it (this should
  		 * never happen in case of requeue). And we cannot
aee69d78d   Paolo Valente   block, bfq: intro...
5778
5779
5780
5781
5782
5783
5784
5785
  		 * defer such a check and removal, to avoid
  		 * inconsistencies in the time interval from the end
  		 * of this function to the start of the deferred work.
  		 * This situation seems to occur only in process
  		 * context, as a consequence of a merge. In the
  		 * current version of the code, this implies that the
  		 * lock is held.
  		 */
614822f81   Luca Miccio   block, bfq: add m...
5786
  		if (!RB_EMPTY_NODE(&rq->rb_node)) {
7b9e93616   Christoph Hellwig   blk-mq-sched: uni...
5787
  			bfq_remove_request(rq->q, rq);
614822f81   Luca Miccio   block, bfq: add m...
5788
5789
5790
  			bfqg_stats_update_io_remove(bfqq_group(bfqq),
  						    rq->cmd_flags);
  		}
a78773906   Paolo Valente   block, bfq: add r...
5791
  		bfq_finish_requeue_request_body(bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
5792
  	}
a78773906   Paolo Valente   block, bfq: add r...
5793
5794
5795
5796
5797
5798
5799
5800
5801
5802
5803
5804
5805
5806
5807
5808
5809
  	/*
  	 * Reset private fields. In case of a requeue, this allows
  	 * this function to correctly do nothing if it is spuriously
  	 * invoked again on this same request (see the check at the
  	 * beginning of the function). Probably, a better general
  	 * design would be to prevent blk-mq from invoking the requeue
  	 * or finish hooks of an elevator, for a request that is not
  	 * referred by that elevator.
  	 *
  	 * Resetting the following fields would break the
  	 * request-insertion logic if rq is re-inserted into a bfq
  	 * internal queue, without a re-preparation. Here we assume
  	 * that re-insertions of requeued requests, without
  	 * re-preparation, can happen only for pass_through or at_head
  	 * requests (which are not re-inserted into bfq internal
  	 * queues).
  	 */
aee69d78d   Paolo Valente   block, bfq: intro...
5810
5811
5812
5813
5814
  	rq->elv.priv[0] = NULL;
  	rq->elv.priv[1] = NULL;
  }
  
  /*
c92bddee7   Paolo Valente   block, bfq: clari...
5815
5816
   * Removes the association between the current task and bfqq, assuming
   * that bic points to the bfq iocontext of the task.
36eca8948   Arianna Avanzini   block, bfq: add E...
5817
5818
5819
5820
5821
5822
5823
5824
5825
5826
5827
5828
5829
5830
5831
5832
5833
5834
   * Returns NULL if a new bfqq should be allocated, or the old bfqq if this
   * was the last process referring to that bfqq.
   */
  static struct bfq_queue *
  bfq_split_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq)
  {
  	bfq_log_bfqq(bfqq->bfqd, bfqq, "splitting queue");
  
  	if (bfqq_process_refs(bfqq) == 1) {
  		bfqq->pid = current->pid;
  		bfq_clear_bfqq_coop(bfqq);
  		bfq_clear_bfqq_split_coop(bfqq);
  		return bfqq;
  	}
  
  	bic_set_bfqq(bic, NULL, 1);
  
  	bfq_put_cooperator(bfqq);
478de3380   Paolo Valente   block, bfq: desch...
5835
  	bfq_release_process_ref(bfqq->bfqd, bfqq);
36eca8948   Arianna Avanzini   block, bfq: add E...
5836
5837
5838
5839
5840
5841
5842
5843
5844
5845
5846
5847
5848
5849
5850
5851
5852
5853
5854
5855
5856
5857
  	return NULL;
  }
  
  static struct bfq_queue *bfq_get_bfqq_handle_split(struct bfq_data *bfqd,
  						   struct bfq_io_cq *bic,
  						   struct bio *bio,
  						   bool split, bool is_sync,
  						   bool *new_queue)
  {
  	struct bfq_queue *bfqq = bic_to_bfqq(bic, is_sync);
  
  	if (likely(bfqq && bfqq != &bfqd->oom_bfqq))
  		return bfqq;
  
  	if (new_queue)
  		*new_queue = true;
  
  	if (bfqq)
  		bfq_put_queue(bfqq);
  	bfqq = bfq_get_queue(bfqd, bio, is_sync, bic);
  
  	bic_set_bfqq(bic, bfqq, is_sync);
e1b2324dd   Arianna Avanzini   block, bfq: handl...
5858
5859
5860
5861
5862
5863
5864
  	if (split && is_sync) {
  		if ((bic->was_in_burst_list && bfqd->large_burst) ||
  		    bic->saved_in_large_burst)
  			bfq_mark_bfqq_in_large_burst(bfqq);
  		else {
  			bfq_clear_bfqq_in_large_burst(bfqq);
  			if (bic->was_in_burst_list)
99fead8d3   Paolo Valente   block, bfq: fix u...
5865
5866
5867
5868
5869
5870
5871
5872
5873
5874
5875
5876
5877
5878
5879
5880
5881
5882
5883
5884
5885
5886
5887
5888
5889
5890
5891
5892
  				/*
  				 * If bfqq was in the current
  				 * burst list before being
  				 * merged, then we have to add
  				 * it back. And we do not need
  				 * to increase burst_size, as
  				 * we did not decrement
  				 * burst_size when we removed
  				 * bfqq from the burst list as
  				 * a consequence of a merge
  				 * (see comments in
  				 * bfq_put_queue). In this
  				 * respect, it would be rather
  				 * costly to know whether the
  				 * current burst list is still
  				 * the same burst list from
  				 * which bfqq was removed on
  				 * the merge. To avoid this
  				 * cost, if bfqq was in a
  				 * burst list, then we add
  				 * bfqq to the current burst
  				 * list without any further
  				 * check. This can cause
  				 * inappropriate insertions,
  				 * but rarely enough to not
  				 * harm the detection of large
  				 * bursts significantly.
  				 */
e1b2324dd   Arianna Avanzini   block, bfq: handl...
5893
5894
5895
  				hlist_add_head(&bfqq->burst_list_node,
  					       &bfqd->burst_list);
  		}
36eca8948   Arianna Avanzini   block, bfq: add E...
5896
  		bfqq->split_time = jiffies;
e1b2324dd   Arianna Avanzini   block, bfq: handl...
5897
  	}
36eca8948   Arianna Avanzini   block, bfq: add E...
5898
5899
5900
5901
5902
  
  	return bfqq;
  }
  
  /*
18e5a57d7   Paolo Valente   block, bfq: postp...
5903
5904
5905
5906
   * Only reset private fields. The actual request preparation will be
   * performed by bfq_init_rq, when rq is either inserted or merged. See
   * comments on bfq_init_rq for the reason behind this delayed
   * preparation.
aee69d78d   Paolo Valente   block, bfq: intro...
5907
   */
5d9c305b8   Christoph Hellwig   blk-mq: remove th...
5908
  static void bfq_prepare_request(struct request *rq)
aee69d78d   Paolo Valente   block, bfq: intro...
5909
  {
18e5a57d7   Paolo Valente   block, bfq: postp...
5910
5911
5912
5913
5914
5915
5916
5917
5918
5919
5920
5921
5922
5923
5924
5925
5926
5927
5928
5929
5930
  	/*
  	 * Regardless of whether we have an icq attached, we have to
  	 * clear the scheduler pointers, as they might point to
  	 * previously allocated bic/bfqq structs.
  	 */
  	rq->elv.priv[0] = rq->elv.priv[1] = NULL;
  }
  
  /*
   * If needed, init rq, allocate bfq data structures associated with
   * rq, and increment reference counters in the destination bfq_queue
   * for rq. Return the destination bfq_queue for rq, or NULL is rq is
   * not associated with any bfq_queue.
   *
   * This function is invoked by the functions that perform rq insertion
   * or merging. One may have expected the above preparation operations
   * to be performed in bfq_prepare_request, and not delayed to when rq
   * is inserted or merged. The rationale behind this delayed
   * preparation is that, after the prepare_request hook is invoked for
   * rq, rq may still be transformed into a request with no icq, i.e., a
   * request not associated with any queue. No bfq hook is invoked to
636b8fe86   Angelo Ruocco   block, bfq: fix s...
5931
   * signal this transformation. As a consequence, should these
18e5a57d7   Paolo Valente   block, bfq: postp...
5932
5933
5934
5935
5936
5937
5938
5939
5940
5941
5942
   * preparation operations be performed when the prepare_request hook
   * is invoked, and should rq be transformed one moment later, bfq
   * would end up in an inconsistent state, because it would have
   * incremented some queue counters for an rq destined to
   * transformation, without any chance to correctly lower these
   * counters back. In contrast, no transformation can still happen for
   * rq after rq has been inserted or merged. So, it is safe to execute
   * these preparation operations when rq is finally inserted or merged.
   */
  static struct bfq_queue *bfq_init_rq(struct request *rq)
  {
5bbf4e5a8   Christoph Hellwig   blk-mq-sched: uni...
5943
  	struct request_queue *q = rq->q;
18e5a57d7   Paolo Valente   block, bfq: postp...
5944
  	struct bio *bio = rq->bio;
aee69d78d   Paolo Valente   block, bfq: intro...
5945
  	struct bfq_data *bfqd = q->elevator->elevator_data;
9f2107382   Christoph Hellwig   bfq-iosched: fix ...
5946
  	struct bfq_io_cq *bic;
aee69d78d   Paolo Valente   block, bfq: intro...
5947
5948
  	const int is_sync = rq_is_sync(rq);
  	struct bfq_queue *bfqq;
36eca8948   Arianna Avanzini   block, bfq: add E...
5949
  	bool new_queue = false;
13c931bd9   Paolo Valente   block, bfq: updat...
5950
  	bool bfqq_already_existing = false, split = false;
aee69d78d   Paolo Valente   block, bfq: intro...
5951

18e5a57d7   Paolo Valente   block, bfq: postp...
5952
5953
  	if (unlikely(!rq->elv.icq))
  		return NULL;
72961c4e6   Jens Axboe   bfq-iosched: ensu...
5954
  	/*
18e5a57d7   Paolo Valente   block, bfq: postp...
5955
5956
5957
5958
5959
  	 * Assuming that elv.priv[1] is set only if everything is set
  	 * for this rq. This holds true, because this function is
  	 * invoked only for insertion or merging, and, after such
  	 * events, a request cannot be manipulated any longer before
  	 * being removed from bfq.
72961c4e6   Jens Axboe   bfq-iosched: ensu...
5960
  	 */
18e5a57d7   Paolo Valente   block, bfq: postp...
5961
5962
  	if (rq->elv.priv[1])
  		return rq->elv.priv[1];
72961c4e6   Jens Axboe   bfq-iosched: ensu...
5963

9f2107382   Christoph Hellwig   bfq-iosched: fix ...
5964
  	bic = icq_to_bic(rq->elv.icq);
aee69d78d   Paolo Valente   block, bfq: intro...
5965

8c9ff1add   Colin Ian King   block, bfq: don't...
5966
  	bfq_check_ioprio_change(bic, bio);
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
5967
  	bfq_bic_update_cgroup(bic, bio);
36eca8948   Arianna Avanzini   block, bfq: add E...
5968
5969
5970
5971
5972
5973
5974
  	bfqq = bfq_get_bfqq_handle_split(bfqd, bic, bio, false, is_sync,
  					 &new_queue);
  
  	if (likely(!new_queue)) {
  		/* If the queue was seeky for too long, break it apart. */
  		if (bfq_bfqq_coop(bfqq) && bfq_bfqq_split_coop(bfqq)) {
  			bfq_log_bfqq(bfqd, bfqq, "breaking apart bfqq");
e1b2324dd   Arianna Avanzini   block, bfq: handl...
5975
5976
5977
5978
  
  			/* Update bic before losing reference to bfqq */
  			if (bfq_bfqq_in_large_burst(bfqq))
  				bic->saved_in_large_burst = true;
36eca8948   Arianna Avanzini   block, bfq: add E...
5979
  			bfqq = bfq_split_bfqq(bic, bfqq);
6fa3e8d34   Paolo Valente   block, bfq: remov...
5980
  			split = true;
36eca8948   Arianna Avanzini   block, bfq: add E...
5981
5982
5983
5984
5985
  
  			if (!bfqq)
  				bfqq = bfq_get_bfqq_handle_split(bfqd, bic, bio,
  								 true, is_sync,
  								 NULL);
13c931bd9   Paolo Valente   block, bfq: updat...
5986
5987
  			else
  				bfqq_already_existing = true;
36eca8948   Arianna Avanzini   block, bfq: add E...
5988
  		}
aee69d78d   Paolo Valente   block, bfq: intro...
5989
5990
5991
5992
5993
5994
5995
5996
5997
  	}
  
  	bfqq->allocated++;
  	bfqq->ref++;
  	bfq_log_bfqq(bfqd, bfqq, "get_request %p: bfqq %p, %d",
  		     rq, bfqq, bfqq->ref);
  
  	rq->elv.priv[0] = bic;
  	rq->elv.priv[1] = bfqq;
36eca8948   Arianna Avanzini   block, bfq: add E...
5998
5999
6000
6001
6002
6003
6004
6005
  	/*
  	 * If a bfq_queue has only one process reference, it is owned
  	 * by only this bic: we can then set bfqq->bic = bic. in
  	 * addition, if the queue has also just been split, we have to
  	 * resume its state.
  	 */
  	if (likely(bfqq != &bfqd->oom_bfqq) && bfqq_process_refs(bfqq) == 1) {
  		bfqq->bic = bic;
6fa3e8d34   Paolo Valente   block, bfq: remov...
6006
  		if (split) {
36eca8948   Arianna Avanzini   block, bfq: add E...
6007
6008
6009
6010
6011
  			/*
  			 * The queue has just been split from a shared
  			 * queue: restore the idle window and the
  			 * possible weight raising period.
  			 */
13c931bd9   Paolo Valente   block, bfq: updat...
6012
6013
  			bfq_bfqq_resume_state(bfqq, bfqd, bic,
  					      bfqq_already_existing);
36eca8948   Arianna Avanzini   block, bfq: add E...
6014
6015
  		}
  	}
84a746891   Paolo Valente   block, bfq: alway...
6016
6017
6018
6019
6020
6021
6022
6023
6024
6025
6026
6027
6028
6029
6030
6031
6032
6033
6034
6035
6036
6037
6038
  	/*
  	 * Consider bfqq as possibly belonging to a burst of newly
  	 * created queues only if:
  	 * 1) A burst is actually happening (bfqd->burst_size > 0)
  	 * or
  	 * 2) There is no other active queue. In fact, if, in
  	 *    contrast, there are active queues not belonging to the
  	 *    possible burst bfqq may belong to, then there is no gain
  	 *    in considering bfqq as belonging to a burst, and
  	 *    therefore in not weight-raising bfqq. See comments on
  	 *    bfq_handle_burst().
  	 *
  	 * This filtering also helps eliminating false positives,
  	 * occurring when bfqq does not belong to an actual large
  	 * burst, but some background task (e.g., a service) happens
  	 * to trigger the creation of new queues very close to when
  	 * bfqq and its possible companion queues are created. See
  	 * comments on bfq_handle_burst() for further details also on
  	 * this issue.
  	 */
  	if (unlikely(bfq_bfqq_just_created(bfqq) &&
  		     (bfqd->burst_size > 0 ||
  		      bfq_tot_busy_queues(bfqd) == 0)))
e1b2324dd   Arianna Avanzini   block, bfq: handl...
6039
  		bfq_handle_burst(bfqd, bfqq);
18e5a57d7   Paolo Valente   block, bfq: postp...
6040
  	return bfqq;
aee69d78d   Paolo Valente   block, bfq: intro...
6041
  }
2f95fa5c9   Zhiqiang Liu   block, bfq: fix u...
6042
6043
  static void
  bfq_idle_slice_timer_body(struct bfq_data *bfqd, struct bfq_queue *bfqq)
aee69d78d   Paolo Valente   block, bfq: intro...
6044
  {
aee69d78d   Paolo Valente   block, bfq: intro...
6045
6046
6047
6048
  	enum bfqq_expiration reason;
  	unsigned long flags;
  
  	spin_lock_irqsave(&bfqd->lock, flags);
aee69d78d   Paolo Valente   block, bfq: intro...
6049

2f95fa5c9   Zhiqiang Liu   block, bfq: fix u...
6050
6051
6052
6053
6054
6055
6056
  	/*
  	 * Considering that bfqq may be in race, we should firstly check
  	 * whether bfqq is in service before doing something on it. If
  	 * the bfqq in race is not in service, it has already been expired
  	 * through __bfq_bfqq_expire func and its wait_request flags has
  	 * been cleared in __bfq_bfqd_reset_in_service func.
  	 */
aee69d78d   Paolo Valente   block, bfq: intro...
6057
6058
6059
6060
  	if (bfqq != bfqd->in_service_queue) {
  		spin_unlock_irqrestore(&bfqd->lock, flags);
  		return;
  	}
2f95fa5c9   Zhiqiang Liu   block, bfq: fix u...
6061
  	bfq_clear_bfqq_wait_request(bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
6062
6063
6064
6065
6066
6067
6068
6069
6070
6071
6072
6073
6074
6075
6076
6077
6078
6079
6080
6081
6082
  	if (bfq_bfqq_budget_timeout(bfqq))
  		/*
  		 * Also here the queue can be safely expired
  		 * for budget timeout without wasting
  		 * guarantees
  		 */
  		reason = BFQQE_BUDGET_TIMEOUT;
  	else if (bfqq->queued[0] == 0 && bfqq->queued[1] == 0)
  		/*
  		 * The queue may not be empty upon timer expiration,
  		 * because we may not disable the timer when the
  		 * first request of the in-service queue arrives
  		 * during disk idling.
  		 */
  		reason = BFQQE_TOO_IDLE;
  	else
  		goto schedule_dispatch;
  
  	bfq_bfqq_expire(bfqd, bfqq, true, reason);
  
  schedule_dispatch:
6fa3e8d34   Paolo Valente   block, bfq: remov...
6083
  	spin_unlock_irqrestore(&bfqd->lock, flags);
aee69d78d   Paolo Valente   block, bfq: intro...
6084
6085
6086
6087
6088
6089
6090
6091
6092
6093
6094
6095
6096
6097
6098
6099
6100
6101
6102
6103
6104
6105
  	bfq_schedule_dispatch(bfqd);
  }
  
  /*
   * Handler of the expiration of the timer running if the in-service queue
   * is idling inside its time slice.
   */
  static enum hrtimer_restart bfq_idle_slice_timer(struct hrtimer *timer)
  {
  	struct bfq_data *bfqd = container_of(timer, struct bfq_data,
  					     idle_slice_timer);
  	struct bfq_queue *bfqq = bfqd->in_service_queue;
  
  	/*
  	 * Theoretical race here: the in-service queue can be NULL or
  	 * different from the queue that was idling if a new request
  	 * arrives for the current queue and there is a full dispatch
  	 * cycle that changes the in-service queue.  This can hardly
  	 * happen, but in the worst case we just expire a queue too
  	 * early.
  	 */
  	if (bfqq)
2f95fa5c9   Zhiqiang Liu   block, bfq: fix u...
6106
  		bfq_idle_slice_timer_body(bfqd, bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
6107
6108
6109
6110
6111
6112
6113
6114
6115
6116
6117
  
  	return HRTIMER_NORESTART;
  }
  
  static void __bfq_put_async_bfqq(struct bfq_data *bfqd,
  				 struct bfq_queue **bfqq_ptr)
  {
  	struct bfq_queue *bfqq = *bfqq_ptr;
  
  	bfq_log(bfqd, "put_async_bfqq: %p", bfqq);
  	if (bfqq) {
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
6118
  		bfq_bfqq_move(bfqd, bfqq, bfqd->root_group);
aee69d78d   Paolo Valente   block, bfq: intro...
6119
6120
6121
6122
6123
6124
6125
6126
  		bfq_log_bfqq(bfqd, bfqq, "put_async_bfqq: putting %p, %d",
  			     bfqq, bfqq->ref);
  		bfq_put_queue(bfqq);
  		*bfqq_ptr = NULL;
  	}
  }
  
  /*
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
6127
6128
6129
6130
   * Release all the bfqg references to its async queues.  If we are
   * deallocating the group these queues may still contain requests, so
   * we reparent them to the root cgroup (i.e., the only one that will
   * exist for sure until all the requests on a device are gone).
aee69d78d   Paolo Valente   block, bfq: intro...
6131
   */
ea25da480   Paolo Valente   block, bfq: split...
6132
  void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
aee69d78d   Paolo Valente   block, bfq: intro...
6133
6134
6135
6136
6137
  {
  	int i, j;
  
  	for (i = 0; i < 2; i++)
  		for (j = 0; j < IOPRIO_BE_NR; j++)
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
6138
  			__bfq_put_async_bfqq(bfqd, &bfqg->async_bfqq[i][j]);
aee69d78d   Paolo Valente   block, bfq: intro...
6139

e21b7a0b9   Arianna Avanzini   block, bfq: add f...
6140
  	__bfq_put_async_bfqq(bfqd, &bfqg->async_idle_bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
6141
  }
f0635b8a4   Jens Axboe   bfq: calculate sh...
6142
6143
  /*
   * See the comments on bfq_limit_depth for the purpose of
483b7bf2e   Jens Axboe   bfq-iosched: upda...
6144
   * the depths set in the function. Return minimum shallow depth we'll use.
f0635b8a4   Jens Axboe   bfq: calculate sh...
6145
   */
483b7bf2e   Jens Axboe   bfq-iosched: upda...
6146
6147
  static unsigned int bfq_update_depths(struct bfq_data *bfqd,
  				      struct sbitmap_queue *bt)
f0635b8a4   Jens Axboe   bfq: calculate sh...
6148
  {
483b7bf2e   Jens Axboe   bfq-iosched: upda...
6149
  	unsigned int i, j, min_shallow = UINT_MAX;
f0635b8a4   Jens Axboe   bfq: calculate sh...
6150
6151
6152
6153
6154
  	/*
  	 * In-word depths if no bfq_queue is being weight-raised:
  	 * leaving 25% of tags only for sync reads.
  	 *
  	 * In next formulas, right-shift the value
bd7d4ef6a   Jens Axboe   bfq-iosched: remo...
6155
6156
6157
  	 * (1U<<bt->sb.shift), instead of computing directly
  	 * (1U<<(bt->sb.shift - something)), to be robust against
  	 * any possible value of bt->sb.shift, without having to
f0635b8a4   Jens Axboe   bfq: calculate sh...
6158
6159
6160
  	 * limit 'something'.
  	 */
  	/* no more than 50% of tags for async I/O */
7fdaca86f   Jan Kara   bfq: Fix computat...
6161
  	bfqd->word_depths[0][0] = max(bt->sb.depth >> 1, 1U);
f0635b8a4   Jens Axboe   bfq: calculate sh...
6162
6163
6164
6165
6166
  	/*
  	 * no more than 75% of tags for sync writes (25% extra tags
  	 * w.r.t. async I/O, to prevent async I/O from starving sync
  	 * writes)
  	 */
7fdaca86f   Jan Kara   bfq: Fix computat...
6167
  	bfqd->word_depths[0][1] = max((bt->sb.depth * 3) >> 2, 1U);
f0635b8a4   Jens Axboe   bfq: calculate sh...
6168
6169
6170
6171
6172
6173
6174
6175
6176
  
  	/*
  	 * In-word depths in case some bfq_queue is being weight-
  	 * raised: leaving ~63% of tags for sync reads. This is the
  	 * highest percentage for which, in our tests, application
  	 * start-up times didn't suffer from any regression due to tag
  	 * shortage.
  	 */
  	/* no more than ~18% of tags for async I/O */
7fdaca86f   Jan Kara   bfq: Fix computat...
6177
  	bfqd->word_depths[1][0] = max((bt->sb.depth * 3) >> 4, 1U);
f0635b8a4   Jens Axboe   bfq: calculate sh...
6178
  	/* no more than ~37% of tags for sync writes (~20% extra tags) */
7fdaca86f   Jan Kara   bfq: Fix computat...
6179
  	bfqd->word_depths[1][1] = max((bt->sb.depth * 6) >> 4, 1U);
483b7bf2e   Jens Axboe   bfq-iosched: upda...
6180
6181
6182
6183
6184
6185
  
  	for (i = 0; i < 2; i++)
  		for (j = 0; j < 2; j++)
  			min_shallow = min(min_shallow, bfqd->word_depths[i][j]);
  
  	return min_shallow;
f0635b8a4   Jens Axboe   bfq: calculate sh...
6186
  }
77f1e0a52   Jens Axboe   bfq: update inter...
6187
  static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx)
f0635b8a4   Jens Axboe   bfq: calculate sh...
6188
6189
6190
  {
  	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
  	struct blk_mq_tags *tags = hctx->sched_tags;
483b7bf2e   Jens Axboe   bfq-iosched: upda...
6191
  	unsigned int min_shallow;
f0635b8a4   Jens Axboe   bfq: calculate sh...
6192

222a5ae03   John Garry   blk-mq: Use point...
6193
6194
  	min_shallow = bfq_update_depths(bfqd, tags->bitmap_tags);
  	sbitmap_queue_min_shallow_depth(tags->bitmap_tags, min_shallow);
77f1e0a52   Jens Axboe   bfq: update inter...
6195
6196
6197
6198
6199
  }
  
  static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
  {
  	bfq_depth_updated(hctx);
f0635b8a4   Jens Axboe   bfq: calculate sh...
6200
6201
  	return 0;
  }
aee69d78d   Paolo Valente   block, bfq: intro...
6202
6203
6204
6205
6206
6207
6208
6209
6210
  static void bfq_exit_queue(struct elevator_queue *e)
  {
  	struct bfq_data *bfqd = e->elevator_data;
  	struct bfq_queue *bfqq, *n;
  
  	hrtimer_cancel(&bfqd->idle_slice_timer);
  
  	spin_lock_irq(&bfqd->lock);
  	list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list)
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
6211
  		bfq_deactivate_bfqq(bfqd, bfqq, false, false);
aee69d78d   Paolo Valente   block, bfq: intro...
6212
6213
6214
  	spin_unlock_irq(&bfqd->lock);
  
  	hrtimer_cancel(&bfqd->idle_slice_timer);
0d52af590   Paolo Valente   block, bfq: relea...
6215
6216
  	/* release oom-queue reference to root group */
  	bfqg_and_blkg_put(bfqd->root_group);
4d8340d0d   Paolo Valente   block, bfq: remov...
6217
  #ifdef CONFIG_BFQ_GROUP_IOSCHED
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
6218
6219
6220
6221
6222
6223
6224
  	blkcg_deactivate_policy(bfqd->queue, &blkcg_policy_bfq);
  #else
  	spin_lock_irq(&bfqd->lock);
  	bfq_put_async_queues(bfqd, bfqd->root_group);
  	kfree(bfqd->root_group);
  	spin_unlock_irq(&bfqd->lock);
  #endif
aee69d78d   Paolo Valente   block, bfq: intro...
6225
6226
  	kfree(bfqd);
  }
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
6227
6228
6229
6230
6231
6232
6233
6234
6235
6236
  static void bfq_init_root_group(struct bfq_group *root_group,
  				struct bfq_data *bfqd)
  {
  	int i;
  
  #ifdef CONFIG_BFQ_GROUP_IOSCHED
  	root_group->entity.parent = NULL;
  	root_group->my_entity = NULL;
  	root_group->bfqd = bfqd;
  #endif
36eca8948   Arianna Avanzini   block, bfq: add E...
6237
  	root_group->rq_pos_tree = RB_ROOT;
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
6238
6239
6240
6241
  	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
  		root_group->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
  	root_group->sched_data.bfq_class_idle_last_service = jiffies;
  }
aee69d78d   Paolo Valente   block, bfq: intro...
6242
6243
6244
6245
  static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
  {
  	struct bfq_data *bfqd;
  	struct elevator_queue *eq;
aee69d78d   Paolo Valente   block, bfq: intro...
6246
6247
6248
6249
6250
6251
6252
6253
6254
6255
6256
  
  	eq = elevator_alloc(q, e);
  	if (!eq)
  		return -ENOMEM;
  
  	bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node);
  	if (!bfqd) {
  		kobject_put(&eq->kobj);
  		return -ENOMEM;
  	}
  	eq->elevator_data = bfqd;
0d945c1f9   Christoph Hellwig   block: remove the...
6257
  	spin_lock_irq(&q->queue_lock);
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
6258
  	q->elevator = eq;
0d945c1f9   Christoph Hellwig   block: remove the...
6259
  	spin_unlock_irq(&q->queue_lock);
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
6260

aee69d78d   Paolo Valente   block, bfq: intro...
6261
6262
6263
6264
6265
6266
6267
6268
6269
6270
6271
  	/*
  	 * Our fallback bfqq if bfq_find_alloc_queue() runs into OOM issues.
  	 * Grab a permanent reference to it, so that the normal code flow
  	 * will not attempt to free it.
  	 */
  	bfq_init_bfqq(bfqd, &bfqd->oom_bfqq, NULL, 1, 0);
  	bfqd->oom_bfqq.ref++;
  	bfqd->oom_bfqq.new_ioprio = BFQ_DEFAULT_QUEUE_IOPRIO;
  	bfqd->oom_bfqq.new_ioprio_class = IOPRIO_CLASS_BE;
  	bfqd->oom_bfqq.entity.new_weight =
  		bfq_ioprio_to_weight(bfqd->oom_bfqq.new_ioprio);
e1b2324dd   Arianna Avanzini   block, bfq: handl...
6272
6273
6274
  
  	/* oom_bfqq does not participate to bursts */
  	bfq_clear_bfqq_just_created(&bfqd->oom_bfqq);
aee69d78d   Paolo Valente   block, bfq: intro...
6275
6276
6277
6278
6279
6280
6281
6282
  	/*
  	 * Trigger weight initialization, according to ioprio, at the
  	 * oom_bfqq's first activation. The oom_bfqq's ioprio and ioprio
  	 * class won't be changed any more.
  	 */
  	bfqd->oom_bfqq.entity.prio_changed = 1;
  
  	bfqd->queue = q;
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
6283
  	INIT_LIST_HEAD(&bfqd->dispatch);
aee69d78d   Paolo Valente   block, bfq: intro...
6284
6285
6286
6287
  
  	hrtimer_init(&bfqd->idle_slice_timer, CLOCK_MONOTONIC,
  		     HRTIMER_MODE_REL);
  	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
fb53ac6cd   Paolo Valente   block, bfq: do no...
6288
  	bfqd->queue_weights_tree = RB_ROOT_CACHED;
ba7aeae55   Paolo Valente   block, bfq: fix d...
6289
  	bfqd->num_groups_with_pending_reqs = 0;
1de0c4cd9   Arianna Avanzini   block, bfq: reduc...
6290

aee69d78d   Paolo Valente   block, bfq: intro...
6291
6292
  	INIT_LIST_HEAD(&bfqd->active_list);
  	INIT_LIST_HEAD(&bfqd->idle_list);
e1b2324dd   Arianna Avanzini   block, bfq: handl...
6293
  	INIT_HLIST_HEAD(&bfqd->burst_list);
aee69d78d   Paolo Valente   block, bfq: intro...
6294
6295
  
  	bfqd->hw_tag = -1;
8cacc5ab3   Paolo Valente   block, bfq: do no...
6296
  	bfqd->nonrot_with_queueing = blk_queue_nonrot(bfqd->queue);
aee69d78d   Paolo Valente   block, bfq: intro...
6297
6298
6299
6300
6301
6302
6303
6304
  
  	bfqd->bfq_max_budget = bfq_default_max_budget;
  
  	bfqd->bfq_fifo_expire[0] = bfq_fifo_expire[0];
  	bfqd->bfq_fifo_expire[1] = bfq_fifo_expire[1];
  	bfqd->bfq_back_max = bfq_back_max;
  	bfqd->bfq_back_penalty = bfq_back_penalty;
  	bfqd->bfq_slice_idle = bfq_slice_idle;
aee69d78d   Paolo Valente   block, bfq: intro...
6305
6306
6307
  	bfqd->bfq_timeout = bfq_timeout;
  
  	bfqd->bfq_requests_within_timer = 120;
e1b2324dd   Arianna Avanzini   block, bfq: handl...
6308
6309
  	bfqd->bfq_large_burst_thresh = 8;
  	bfqd->bfq_burst_interval = msecs_to_jiffies(180);
44e44a1b3   Paolo Valente   block, bfq: impro...
6310
6311
6312
6313
6314
6315
  	bfqd->low_latency = true;
  
  	/*
  	 * Trade-off between responsiveness and fairness.
  	 */
  	bfqd->bfq_wr_coeff = 30;
77b7dcead   Paolo Valente   block, bfq: reduc...
6316
  	bfqd->bfq_wr_rt_max_time = msecs_to_jiffies(300);
44e44a1b3   Paolo Valente   block, bfq: impro...
6317
6318
6319
  	bfqd->bfq_wr_max_time = 0;
  	bfqd->bfq_wr_min_idle_time = msecs_to_jiffies(2000);
  	bfqd->bfq_wr_min_inter_arr_async = msecs_to_jiffies(500);
77b7dcead   Paolo Valente   block, bfq: reduc...
6320
6321
6322
6323
6324
6325
  	bfqd->bfq_wr_max_softrt_rate = 7000; /*
  					      * Approximate rate required
  					      * to playback or record a
  					      * high-definition compressed
  					      * video.
  					      */
cfd69712a   Paolo Valente   block, bfq: reduc...
6326
  	bfqd->wr_busy_queues = 0;
44e44a1b3   Paolo Valente   block, bfq: impro...
6327
6328
  
  	/*
e24f1c245   Paolo Valente   block, bfq: remov...
6329
6330
  	 * Begin by assuming, optimistically, that the device peak
  	 * rate is equal to 2/3 of the highest reference rate.
44e44a1b3   Paolo Valente   block, bfq: impro...
6331
  	 */
e24f1c245   Paolo Valente   block, bfq: remov...
6332
6333
6334
  	bfqd->rate_dur_prod = ref_rate[blk_queue_nonrot(bfqd->queue)] *
  		ref_wr_duration[blk_queue_nonrot(bfqd->queue)];
  	bfqd->peak_rate = ref_rate[blk_queue_nonrot(bfqd->queue)] * 2 / 3;
44e44a1b3   Paolo Valente   block, bfq: impro...
6335

aee69d78d   Paolo Valente   block, bfq: intro...
6336
  	spin_lock_init(&bfqd->lock);
aee69d78d   Paolo Valente   block, bfq: intro...
6337

e21b7a0b9   Arianna Avanzini   block, bfq: add f...
6338
6339
6340
6341
6342
6343
6344
6345
6346
6347
6348
6349
6350
6351
6352
6353
6354
6355
6356
6357
  	/*
  	 * The invocation of the next bfq_create_group_hierarchy
  	 * function is the head of a chain of function calls
  	 * (bfq_create_group_hierarchy->blkcg_activate_policy->
  	 * blk_mq_freeze_queue) that may lead to the invocation of the
  	 * has_work hook function. For this reason,
  	 * bfq_create_group_hierarchy is invoked only after all
  	 * scheduler data has been initialized, apart from the fields
  	 * that can be initialized only after invoking
  	 * bfq_create_group_hierarchy. This, in particular, enables
  	 * has_work to correctly return false. Of course, to avoid
  	 * other inconsistencies, the blk-mq stack must then refrain
  	 * from invoking further scheduler hooks before this init
  	 * function is finished.
  	 */
  	bfqd->root_group = bfq_create_group_hierarchy(bfqd, q->node);
  	if (!bfqd->root_group)
  		goto out_free;
  	bfq_init_root_group(bfqd->root_group, bfqd);
  	bfq_init_entity(&bfqd->oom_bfqq.entity, bfqd->root_group);
b5dc5d4d1   Luca Miccio   block,bfq: Disabl...
6358
  	wbt_disable_default(q);
aee69d78d   Paolo Valente   block, bfq: intro...
6359
  	return 0;
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
6360
6361
6362
6363
6364
  
  out_free:
  	kfree(bfqd);
  	kobject_put(&eq->kobj);
  	return -ENOMEM;
aee69d78d   Paolo Valente   block, bfq: intro...
6365
6366
6367
6368
6369
6370
6371
6372
6373
6374
6375
6376
6377
6378
6379
6380
6381
6382
6383
6384
  }
  
  static void bfq_slab_kill(void)
  {
  	kmem_cache_destroy(bfq_pool);
  }
  
  static int __init bfq_slab_setup(void)
  {
  	bfq_pool = KMEM_CACHE(bfq_queue, 0);
  	if (!bfq_pool)
  		return -ENOMEM;
  	return 0;
  }
  
  static ssize_t bfq_var_show(unsigned int var, char *page)
  {
  	return sprintf(page, "%u
  ", var);
  }
2f79136ba   Bart Van Assche   bfq: Check kstrto...
6385
  static int bfq_var_store(unsigned long *var, const char *page)
aee69d78d   Paolo Valente   block, bfq: intro...
6386
6387
6388
  {
  	unsigned long new_val;
  	int ret = kstrtoul(page, 10, &new_val);
2f79136ba   Bart Van Assche   bfq: Check kstrto...
6389
6390
6391
6392
  	if (ret)
  		return ret;
  	*var = new_val;
  	return 0;
aee69d78d   Paolo Valente   block, bfq: intro...
6393
6394
6395
6396
6397
6398
6399
6400
6401
6402
6403
6404
6405
6406
6407
6408
6409
6410
6411
6412
6413
  }
  
  #define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
  static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
  {									\
  	struct bfq_data *bfqd = e->elevator_data;			\
  	u64 __data = __VAR;						\
  	if (__CONV == 1)						\
  		__data = jiffies_to_msecs(__data);			\
  	else if (__CONV == 2)						\
  		__data = div_u64(__data, NSEC_PER_MSEC);		\
  	return bfq_var_show(__data, (page));				\
  }
  SHOW_FUNCTION(bfq_fifo_expire_sync_show, bfqd->bfq_fifo_expire[1], 2);
  SHOW_FUNCTION(bfq_fifo_expire_async_show, bfqd->bfq_fifo_expire[0], 2);
  SHOW_FUNCTION(bfq_back_seek_max_show, bfqd->bfq_back_max, 0);
  SHOW_FUNCTION(bfq_back_seek_penalty_show, bfqd->bfq_back_penalty, 0);
  SHOW_FUNCTION(bfq_slice_idle_show, bfqd->bfq_slice_idle, 2);
  SHOW_FUNCTION(bfq_max_budget_show, bfqd->bfq_user_max_budget, 0);
  SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout, 1);
  SHOW_FUNCTION(bfq_strict_guarantees_show, bfqd->strict_guarantees, 0);
44e44a1b3   Paolo Valente   block, bfq: impro...
6414
  SHOW_FUNCTION(bfq_low_latency_show, bfqd->low_latency, 0);
aee69d78d   Paolo Valente   block, bfq: intro...
6415
6416
6417
6418
6419
6420
6421
6422
6423
6424
6425
6426
6427
6428
6429
6430
6431
6432
  #undef SHOW_FUNCTION
  
  #define USEC_SHOW_FUNCTION(__FUNC, __VAR)				\
  static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
  {									\
  	struct bfq_data *bfqd = e->elevator_data;			\
  	u64 __data = __VAR;						\
  	__data = div_u64(__data, NSEC_PER_USEC);			\
  	return bfq_var_show(__data, (page));				\
  }
  USEC_SHOW_FUNCTION(bfq_slice_idle_us_show, bfqd->bfq_slice_idle);
  #undef USEC_SHOW_FUNCTION
  
  #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
  static ssize_t								\
  __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
  {									\
  	struct bfq_data *bfqd = e->elevator_data;			\
1530486cd   Bart Van Assche   bfq: Suppress com...
6433
  	unsigned long __data, __min = (MIN), __max = (MAX);		\
2f79136ba   Bart Van Assche   bfq: Check kstrto...
6434
6435
6436
6437
6438
  	int ret;							\
  									\
  	ret = bfq_var_store(&__data, (page));				\
  	if (ret)							\
  		return ret;						\
1530486cd   Bart Van Assche   bfq: Suppress com...
6439
6440
6441
6442
  	if (__data < __min)						\
  		__data = __min;						\
  	else if (__data > __max)					\
  		__data = __max;						\
aee69d78d   Paolo Valente   block, bfq: intro...
6443
6444
6445
6446
6447
6448
  	if (__CONV == 1)						\
  		*(__PTR) = msecs_to_jiffies(__data);			\
  	else if (__CONV == 2)						\
  		*(__PTR) = (u64)__data * NSEC_PER_MSEC;			\
  	else								\
  		*(__PTR) = __data;					\
235f8da11   weiping zhang   block, scheduler:...
6449
  	return count;							\
aee69d78d   Paolo Valente   block, bfq: intro...
6450
6451
6452
6453
6454
6455
6456
6457
6458
6459
6460
6461
6462
6463
6464
  }
  STORE_FUNCTION(bfq_fifo_expire_sync_store, &bfqd->bfq_fifo_expire[1], 1,
  		INT_MAX, 2);
  STORE_FUNCTION(bfq_fifo_expire_async_store, &bfqd->bfq_fifo_expire[0], 1,
  		INT_MAX, 2);
  STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, 0, INT_MAX, 0);
  STORE_FUNCTION(bfq_back_seek_penalty_store, &bfqd->bfq_back_penalty, 1,
  		INT_MAX, 0);
  STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 2);
  #undef STORE_FUNCTION
  
  #define USEC_STORE_FUNCTION(__FUNC, __PTR, MIN, MAX)			\
  static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)\
  {									\
  	struct bfq_data *bfqd = e->elevator_data;			\
1530486cd   Bart Van Assche   bfq: Suppress com...
6465
  	unsigned long __data, __min = (MIN), __max = (MAX);		\
2f79136ba   Bart Van Assche   bfq: Check kstrto...
6466
6467
6468
6469
6470
  	int ret;							\
  									\
  	ret = bfq_var_store(&__data, (page));				\
  	if (ret)							\
  		return ret;						\
1530486cd   Bart Van Assche   bfq: Suppress com...
6471
6472
6473
6474
  	if (__data < __min)						\
  		__data = __min;						\
  	else if (__data > __max)					\
  		__data = __max;						\
aee69d78d   Paolo Valente   block, bfq: intro...
6475
  	*(__PTR) = (u64)__data * NSEC_PER_USEC;				\
235f8da11   weiping zhang   block, scheduler:...
6476
  	return count;							\
aee69d78d   Paolo Valente   block, bfq: intro...
6477
6478
6479
6480
  }
  USEC_STORE_FUNCTION(bfq_slice_idle_us_store, &bfqd->bfq_slice_idle, 0,
  		    UINT_MAX);
  #undef USEC_STORE_FUNCTION
aee69d78d   Paolo Valente   block, bfq: intro...
6481
6482
6483
6484
  static ssize_t bfq_max_budget_store(struct elevator_queue *e,
  				    const char *page, size_t count)
  {
  	struct bfq_data *bfqd = e->elevator_data;
2f79136ba   Bart Van Assche   bfq: Check kstrto...
6485
6486
  	unsigned long __data;
  	int ret;
235f8da11   weiping zhang   block, scheduler:...
6487

2f79136ba   Bart Van Assche   bfq: Check kstrto...
6488
6489
6490
  	ret = bfq_var_store(&__data, (page));
  	if (ret)
  		return ret;
aee69d78d   Paolo Valente   block, bfq: intro...
6491
6492
  
  	if (__data == 0)
ab0e43e9c   Paolo Valente   block, bfq: modif...
6493
  		bfqd->bfq_max_budget = bfq_calc_max_budget(bfqd);
aee69d78d   Paolo Valente   block, bfq: intro...
6494
6495
6496
6497
6498
6499
6500
  	else {
  		if (__data > INT_MAX)
  			__data = INT_MAX;
  		bfqd->bfq_max_budget = __data;
  	}
  
  	bfqd->bfq_user_max_budget = __data;
235f8da11   weiping zhang   block, scheduler:...
6501
  	return count;
aee69d78d   Paolo Valente   block, bfq: intro...
6502
6503
6504
6505
6506
6507
6508
6509
6510
6511
  }
  
  /*
   * Leaving this name to preserve name compatibility with cfq
   * parameters, but this timeout is used for both sync and async.
   */
  static ssize_t bfq_timeout_sync_store(struct elevator_queue *e,
  				      const char *page, size_t count)
  {
  	struct bfq_data *bfqd = e->elevator_data;
2f79136ba   Bart Van Assche   bfq: Check kstrto...
6512
6513
  	unsigned long __data;
  	int ret;
235f8da11   weiping zhang   block, scheduler:...
6514

2f79136ba   Bart Van Assche   bfq: Check kstrto...
6515
6516
6517
  	ret = bfq_var_store(&__data, (page));
  	if (ret)
  		return ret;
aee69d78d   Paolo Valente   block, bfq: intro...
6518
6519
6520
6521
6522
6523
6524
6525
  
  	if (__data < 1)
  		__data = 1;
  	else if (__data > INT_MAX)
  		__data = INT_MAX;
  
  	bfqd->bfq_timeout = msecs_to_jiffies(__data);
  	if (bfqd->bfq_user_max_budget == 0)
ab0e43e9c   Paolo Valente   block, bfq: modif...
6526
  		bfqd->bfq_max_budget = bfq_calc_max_budget(bfqd);
aee69d78d   Paolo Valente   block, bfq: intro...
6527

235f8da11   weiping zhang   block, scheduler:...
6528
  	return count;
aee69d78d   Paolo Valente   block, bfq: intro...
6529
6530
6531
6532
6533
6534
  }
  
  static ssize_t bfq_strict_guarantees_store(struct elevator_queue *e,
  				     const char *page, size_t count)
  {
  	struct bfq_data *bfqd = e->elevator_data;
2f79136ba   Bart Van Assche   bfq: Check kstrto...
6535
6536
  	unsigned long __data;
  	int ret;
235f8da11   weiping zhang   block, scheduler:...
6537

2f79136ba   Bart Van Assche   bfq: Check kstrto...
6538
6539
6540
  	ret = bfq_var_store(&__data, (page));
  	if (ret)
  		return ret;
aee69d78d   Paolo Valente   block, bfq: intro...
6541
6542
6543
6544
6545
6546
6547
6548
  
  	if (__data > 1)
  		__data = 1;
  	if (!bfqd->strict_guarantees && __data == 1
  	    && bfqd->bfq_slice_idle < 8 * NSEC_PER_MSEC)
  		bfqd->bfq_slice_idle = 8 * NSEC_PER_MSEC;
  
  	bfqd->strict_guarantees = __data;
235f8da11   weiping zhang   block, scheduler:...
6549
  	return count;
aee69d78d   Paolo Valente   block, bfq: intro...
6550
  }
44e44a1b3   Paolo Valente   block, bfq: impro...
6551
6552
6553
6554
  static ssize_t bfq_low_latency_store(struct elevator_queue *e,
  				     const char *page, size_t count)
  {
  	struct bfq_data *bfqd = e->elevator_data;
2f79136ba   Bart Van Assche   bfq: Check kstrto...
6555
6556
  	unsigned long __data;
  	int ret;
235f8da11   weiping zhang   block, scheduler:...
6557

2f79136ba   Bart Van Assche   bfq: Check kstrto...
6558
6559
6560
  	ret = bfq_var_store(&__data, (page));
  	if (ret)
  		return ret;
44e44a1b3   Paolo Valente   block, bfq: impro...
6561
6562
6563
6564
6565
6566
  
  	if (__data > 1)
  		__data = 1;
  	if (__data == 0 && bfqd->low_latency != 0)
  		bfq_end_wr(bfqd);
  	bfqd->low_latency = __data;
235f8da11   weiping zhang   block, scheduler:...
6567
  	return count;
44e44a1b3   Paolo Valente   block, bfq: impro...
6568
  }
aee69d78d   Paolo Valente   block, bfq: intro...
6569
6570
6571
6572
6573
6574
6575
6576
6577
6578
6579
6580
6581
  #define BFQ_ATTR(name) \
  	__ATTR(name, 0644, bfq_##name##_show, bfq_##name##_store)
  
  static struct elv_fs_entry bfq_attrs[] = {
  	BFQ_ATTR(fifo_expire_sync),
  	BFQ_ATTR(fifo_expire_async),
  	BFQ_ATTR(back_seek_max),
  	BFQ_ATTR(back_seek_penalty),
  	BFQ_ATTR(slice_idle),
  	BFQ_ATTR(slice_idle_us),
  	BFQ_ATTR(max_budget),
  	BFQ_ATTR(timeout_sync),
  	BFQ_ATTR(strict_guarantees),
44e44a1b3   Paolo Valente   block, bfq: impro...
6582
  	BFQ_ATTR(low_latency),
aee69d78d   Paolo Valente   block, bfq: intro...
6583
6584
6585
6586
  	__ATTR_NULL
  };
  
  static struct elevator_type iosched_bfq_mq = {
f9cd4bfe9   Jens Axboe   block: get rid of...
6587
  	.ops = {
a52a69ea8   Paolo Valente   block, bfq: limit...
6588
  		.limit_depth		= bfq_limit_depth,
5bbf4e5a8   Christoph Hellwig   blk-mq-sched: uni...
6589
  		.prepare_request	= bfq_prepare_request,
a78773906   Paolo Valente   block, bfq: add r...
6590
6591
  		.requeue_request        = bfq_finish_requeue_request,
  		.finish_request		= bfq_finish_requeue_request,
aee69d78d   Paolo Valente   block, bfq: intro...
6592
6593
6594
6595
6596
6597
6598
6599
6600
6601
6602
  		.exit_icq		= bfq_exit_icq,
  		.insert_requests	= bfq_insert_requests,
  		.dispatch_request	= bfq_dispatch_request,
  		.next_request		= elv_rb_latter_request,
  		.former_request		= elv_rb_former_request,
  		.allow_merge		= bfq_allow_bio_merge,
  		.bio_merge		= bfq_bio_merge,
  		.request_merge		= bfq_request_merge,
  		.requests_merged	= bfq_requests_merged,
  		.request_merged		= bfq_request_merged,
  		.has_work		= bfq_has_work,
77f1e0a52   Jens Axboe   bfq: update inter...
6603
  		.depth_updated		= bfq_depth_updated,
f0635b8a4   Jens Axboe   bfq: calculate sh...
6604
  		.init_hctx		= bfq_init_hctx,
aee69d78d   Paolo Valente   block, bfq: intro...
6605
6606
6607
  		.init_sched		= bfq_init_queue,
  		.exit_sched		= bfq_exit_queue,
  	},
aee69d78d   Paolo Valente   block, bfq: intro...
6608
6609
6610
6611
6612
6613
  	.icq_size =		sizeof(struct bfq_io_cq),
  	.icq_align =		__alignof__(struct bfq_io_cq),
  	.elevator_attrs =	bfq_attrs,
  	.elevator_name =	"bfq",
  	.elevator_owner =	THIS_MODULE,
  };
26b4cf249   Ben Hutchings   bfq: Re-enable au...
6614
  MODULE_ALIAS("bfq-iosched");
aee69d78d   Paolo Valente   block, bfq: intro...
6615
6616
6617
6618
  
  static int __init bfq_init(void)
  {
  	int ret;
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
6619
6620
6621
6622
6623
  #ifdef CONFIG_BFQ_GROUP_IOSCHED
  	ret = blkcg_policy_register(&blkcg_policy_bfq);
  	if (ret)
  		return ret;
  #endif
aee69d78d   Paolo Valente   block, bfq: intro...
6624
6625
6626
  	ret = -ENOMEM;
  	if (bfq_slab_setup())
  		goto err_pol_unreg;
44e44a1b3   Paolo Valente   block, bfq: impro...
6627
6628
6629
  	/*
  	 * Times to load large popular applications for the typical
  	 * systems installed on the reference devices (see the
e24f1c245   Paolo Valente   block, bfq: remov...
6630
6631
  	 * comments before the definition of the next
  	 * array). Actually, we use slightly lower values, as the
44e44a1b3   Paolo Valente   block, bfq: impro...
6632
6633
6634
6635
6636
6637
6638
6639
  	 * estimated peak rate tends to be smaller than the actual
  	 * peak rate.  The reason for this last fact is that estimates
  	 * are computed over much shorter time intervals than the long
  	 * intervals typically used for benchmarking. Why? First, to
  	 * adapt more quickly to variations. Second, because an I/O
  	 * scheduler cannot rely on a peak-rate-evaluation workload to
  	 * be run for a long time.
  	 */
e24f1c245   Paolo Valente   block, bfq: remov...
6640
6641
  	ref_wr_duration[0] = msecs_to_jiffies(7000); /* actually 8 sec */
  	ref_wr_duration[1] = msecs_to_jiffies(2500); /* actually 3 sec */
44e44a1b3   Paolo Valente   block, bfq: impro...
6642

aee69d78d   Paolo Valente   block, bfq: intro...
6643
6644
  	ret = elv_register(&iosched_bfq_mq);
  	if (ret)
37dcd6570   weiping zhang   block, bfq: fix e...
6645
  		goto slab_kill;
aee69d78d   Paolo Valente   block, bfq: intro...
6646
6647
  
  	return 0;
37dcd6570   weiping zhang   block, bfq: fix e...
6648
6649
  slab_kill:
  	bfq_slab_kill();
aee69d78d   Paolo Valente   block, bfq: intro...
6650
  err_pol_unreg:
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
6651
6652
6653
  #ifdef CONFIG_BFQ_GROUP_IOSCHED
  	blkcg_policy_unregister(&blkcg_policy_bfq);
  #endif
aee69d78d   Paolo Valente   block, bfq: intro...
6654
6655
6656
6657
6658
6659
  	return ret;
  }
  
  static void __exit bfq_exit(void)
  {
  	elv_unregister(&iosched_bfq_mq);
e21b7a0b9   Arianna Avanzini   block, bfq: add f...
6660
6661
6662
  #ifdef CONFIG_BFQ_GROUP_IOSCHED
  	blkcg_policy_unregister(&blkcg_policy_bfq);
  #endif
aee69d78d   Paolo Valente   block, bfq: intro...
6663
6664
6665
6666
6667
6668
6669
6670
6671
  	bfq_slab_kill();
  }
  
  module_init(bfq_init);
  module_exit(bfq_exit);
  
  MODULE_AUTHOR("Paolo Valente");
  MODULE_LICENSE("GPL");
  MODULE_DESCRIPTION("MQ Budget Fair Queueing I/O Scheduler");