pkt_sched: sch_qfq: remove a source of high packet delay/jitter

QFQ+ inherits from QFQ a design choice that may cause a high packet delay/jitter and a severe short-term unfairness. As QFQ, QFQ+ uses a special quantity, the system virtual time, to track the service provided by the ideal system it approximates. When a packet is dequeued, this quantity must be incremented by the size of the packet, divided by the sum of the weights of the aggregates waiting to be served. Tracking this sum correctly is a non-trivial task, because, to preserve tight service guarantees, the decrement of this sum must be delayed in a special way [1]: this sum can be decremented only after that its value would decrease also in the ideal system approximated by QFQ+. For efficiency, QFQ+ keeps track only of the 'instantaneous' weight sum, increased and decreased immediately as the weight of an aggregate changes, and as an aggregate is created or destroyed (which, in its turn, happens as a consequence of some class being created/destroyed/changed). However, to avoid the problems caused to service guarantees by these immediate decreases, QFQ+ increments the system virtual time using the maximum value allowed for the weight sum, 2^10, in place of the dynamic, instantaneous value. The instantaneous value of the weight sum is used only to check whether a request of weight increase or a class creation can be satisfied. Unfortunately, the problems caused by this choice are worse than the temporary degradation of the service guarantees that may occur, when a class is changed or destroyed, if the instantaneous value of the weight sum was used to update the system virtual time. In fact, the fraction of the link bandwidth guaranteed by QFQ+ to each aggregate is equal to the ratio between the weight of the aggregate and the sum of the weights of the competing aggregates. The packet delay guaranteed to the aggregate is instead inversely proportional to the guaranteed bandwidth. By using the maximum possible value, and not the actual value of the weight sum, QFQ+ provides each aggregate with the worst possible service guarantees, and not with service guarantees related to the actual set of competing aggregates. To see the consequences of this fact, consider the following simple example. Suppose that only the following aggregates are backlogged, i.e., that only the classes in the following aggregates have packets to transmit: one aggregate with weight 10, say A, and ten aggregates with weight 1, say B1, B2, ..., B10. In particular, suppose that these aggregates are always backlogged. Given the weight distribution, the smoothest and fairest service order would be: A B1 A B2 A B3 A B4 A B5 A B6 A B7 A B8 A B9 A B10 A B1 A B2 ... QFQ+ would provide exactly this optimal service if it used the actual value for the weight sum instead of the maximum possible value, i.e., 11 instead of 2^10. In contrast, since QFQ+ uses the latter value, it serves aggregates as follows (easy to prove and to reproduce experimentally): A B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 A A A A A A A A A A B1 B2 ... B10 A A ... By replacing 10 with N in the above example, and by increasing N, one can increase at will the maximum packet delay and the jitter experienced by the classes in aggregate A. This patch addresses this issue by just using the above 'instantaneous' value of the weight sum, instead of the maximum possible value, when updating the system virtual time. After the instantaneous weight sum is decreased, QFQ+ may deviate from the ideal service for a time interval in the order of the time to serve one maximum-size packet for each backlogged class. The worst-case extent of the deviation exhibited by QFQ+ during this time interval [1] is basically the same as of the deviation described above (but, without this patch, QFQ+ suffers from such a deviation all the time). Finally, this patch modifies the comment to the function qfq_slot_insert, to make it coherent with the fact that the weight sum used by QFQ+ can now be lower than the maximum possible value. [1] P. Valente, "Extending WF2Q+ to support a dynamic traffic mix", Proceedings of AAA-IDEA'05, June 2005. Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: David S. Miller <davem@davemloft.net>

pkt_sched: sch_qfq: remove a source of high packet delay/jitter
QFQ+ inherits from QFQ a design choice that may cause a high packet delay/jitter and a severe short-term unfairness. As QFQ, QFQ+ uses a special quantity, the system virtual time, to track the service provided by the ideal system it approximates. When a packet is dequeued, this quantity must be incremented by the size of the packet, divided by the sum of the weights of the aggregates waiting to be served. Tracking this sum correctly is a non-trivial task, because, to preserve tight service guarantees, the decrement of this sum must be delayed in a special way [1]: this sum can be decremented only after that its value would decrease also in the ideal system approximated by QFQ+. For efficiency, QFQ+ keeps track only of the 'instantaneous' weight sum, increased and decreased immediately as the weight of an aggregate changes, and as an aggregate is created or destroyed (which, in its turn, happens as a consequence of some class being created/destroyed/changed). However, to avoid the problems caused to service guarantees by these immediate decreases, QFQ+ increments the system virtual time using the maximum value allowed for the weight sum, 2^10, in place of the dynamic, instantaneous value. The instantaneous value of the weight sum is used only to check whether a request of weight increase or a class creation can be satisfied. Unfortunately, the problems caused by this choice are worse than the temporary degradation of the service guarantees that may occur, when a class is changed or destroyed, if the instantaneous value of the weight sum was used to update the system virtual time. In fact, the fraction of the link bandwidth guaranteed by QFQ+ to each aggregate is equal to the ratio between the weight of the aggregate and the sum of the weights of the competing aggregates. The packet delay guaranteed to the aggregate is instead inversely proportional to the guaranteed bandwidth. By using the maximum possible value, and not the actual value of the weight sum, QFQ+ provides each aggregate with the worst possible service guarantees, and not with service guarantees related to the actual set of competing aggregates. To see the consequences of this fact, consider the following simple example. Suppose that only the following aggregates are backlogged, i.e., that only the classes in the following aggregates have packets to transmit: one aggregate with weight 10, say A, and ten aggregates with weight 1, say B1, B2, ..., B10. In particular, suppose that these aggregates are always backlogged. Given the weight distribution, the smoothest and fairest service order would be: A B1 A B2 A B3 A B4 A B5 A B6 A B7 A B8 A B9 A B10 A B1 A B2 ... QFQ+ would provide exactly this optimal service if it used the actual value for the weight sum instead of the maximum possible value, i.e., 11 instead of 2^10. In contrast, since QFQ+ uses the latter value, it serves aggregates as follows (easy to prove and to reproduce experimentally): A B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 A A A A A A A A A A B1 B2 ... B10 A A ... By replacing 10 with N in the above example, and by increasing N, one can increase at will the maximum packet delay and the jitter experienced by the classes in aggregate A. This patch addresses this issue by just using the above 'instantaneous' value of the weight sum, instead of the maximum possible value, when updating the system virtual time. After the instantaneous weight sum is decreased, QFQ+ may deviate from the ideal service for a time interval in the order of the time to serve one maximum-size packet for each backlogged class. The worst-case extent of the deviation exhibited by QFQ+ during this time interval [1] is basically the same as of the deviation described above (but, without this patch, QFQ+ suffers from such a deviation all the time). Finally, this patch modifies the comment to the function qfq_slot_insert, to make it coherent with the fact that the weight sum used by QFQ+ can now be lower than the maximum possible value. [1] P. Valente, "Extending WF2Q+ to support a dynamic traffic mix", Proceedings of AAA-IDEA'05, June 2005. Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Valente · David S. Miller
1 parent 093b9c71b6
Showing 1 changed file with 56 additions and 29 deletions Side-by-side Diff
net/sched/sch_qfq.c
@@ -113,7 +113,6 @@
  
 #define FRAC_BITS		30	/* fixed point arithmetic */
 #define ONE_FP			(1UL << FRAC_BITS)
-#define IWSUM			(ONE_FP/QFQ_MAX_WSUM)
  
 #define QFQ_MTU_SHIFT		16	/* to support TSO/GSO */
 #define QFQ_MIN_LMAX		512	/* see qfq_slot_insert */
@@ -189,6 +188,7 @@
 	struct qfq_aggregate	*in_serv_agg;   /* Aggregate being served. */
 	u32			num_active_agg; /* Num. of active aggregates */
 	u32			wsum;		/* weight sum */
+	u32			iwsum;		/* inverse weight sum */
  
 	unsigned long bitmaps[QFQ_MAX_STATE];	    /* Group bitmaps. */
 	struct qfq_group groups[QFQ_MAX_INDEX + 1]; /* The groups. */
@@ -314,6 +314,7 @@
  
 	q->wsum +=
 		(int) agg->class_weight * (new_num_classes - agg->num_classes);
+	q->iwsum = ONE_FP / q->wsum;
  
 	agg->num_classes = new_num_classes;
 }
@@ -340,6 +341,10 @@
 {
 	if (!hlist_unhashed(&agg->nonfull_next))
 		hlist_del_init(&agg->nonfull_next);
+	q->wsum -= agg->class_weight;
+	if (q->wsum != 0)
+		q->iwsum = ONE_FP / q->wsum;
+
 	if (q->in_serv_agg == agg)
 		q->in_serv_agg = qfq_choose_next_agg(q);
 	kfree(agg);
  
  
  
  
  
@@ -834,38 +839,60 @@
 	}
 }
  
-
 /*
- * The index of the slot in which the aggregate is to be inserted must
- * not be higher than QFQ_MAX_SLOTS-2. There is a '-2' and not a '-1'
- * because the start time of the group may be moved backward by one
- * slot after the aggregate has been inserted, and this would cause
- * non-empty slots to be right-shifted by one position.
+ * The index of the slot in which the input aggregate agg is to be
+ * inserted must not be higher than QFQ_MAX_SLOTS-2. There is a '-2'
+ * and not a '-1' because the start time of the group may be moved
+ * backward by one slot after the aggregate has been inserted, and
+ * this would cause non-empty slots to be right-shifted by one
+ * position.
  *
- * If the weight and lmax (max_pkt_size) of the classes do not change,
- * then QFQ+ does meet the above contraint according to the current
- * values of its parameters. In fact, if the weight and lmax of the
- * classes do not change, then, from the theory, QFQ+ guarantees that
- * the slot index is never higher than
- * 2 + QFQ_MAX_AGG_CLASSES * ((1<<QFQ_MTU_SHIFT)/QFQ_MIN_LMAX) *
- * (QFQ_MAX_WEIGHT/QFQ_MAX_WSUM) = 2 + 8 * 128 * (1 / 64) = 18
+ * QFQ+ fully satisfies this bound to the slot index if the parameters
+ * of the classes are not changed dynamically, and if QFQ+ never
+ * happens to postpone the service of agg unjustly, i.e., it never
+ * happens that the aggregate becomes backlogged and eligible, or just
+ * eligible, while an aggregate with a higher approximated finish time
+ * is being served. In particular, in this case QFQ+ guarantees that
+ * the timestamps of agg are low enough that the slot index is never
+ * higher than 2. Unfortunately, QFQ+ cannot provide the same
+ * guarantee if it happens to unjustly postpone the service of agg, or
+ * if the parameters of some class are changed.
  *
- * When the weight of a class is increased or the lmax of the class is
- * decreased, a new aggregate with smaller slot size than the original
- * parent aggregate of the class may happen to be activated. The
- * activation of this aggregate should be properly delayed to when the
- * service of the class has finished in the ideal system tracked by
- * QFQ+. If the activation of the aggregate is not delayed to this
- * reference time instant, then this aggregate may be unjustly served
- * before other aggregates waiting for service. This may cause the
- * above bound to the slot index to be violated for some of these
- * unlucky aggregates.
+ * As for the first event, i.e., an out-of-order service, the
+ * upper bound to the slot index guaranteed by QFQ+ grows to
+ * 2 +
+ * QFQ_MAX_AGG_CLASSES * ((1<<QFQ_MTU_SHIFT)/QFQ_MIN_LMAX) *
+ * (current_max_weight/current_wsum) <= 2 + 8 * 128 * 1.
  *
+ * The following function deals with this problem by backward-shifting
+ * the timestamps of agg, if needed, so as to guarantee that the slot
+ * index is never higher than QFQ_MAX_SLOTS-2. This backward-shift may
+ * cause the service of other aggregates to be postponed, yet the
+ * worst-case guarantees of these aggregates are not violated.  In
+ * fact, in case of no out-of-order service, the timestamps of agg
+ * would have been even lower than they are after the backward shift,
+ * because QFQ+ would have guaranteed a maximum value equal to 2 for
+ * the slot index, and 2 < QFQ_MAX_SLOTS-2. Hence the aggregates whose
+ * service is postponed because of the backward-shift would have
+ * however waited for the service of agg before being served.
+ *
+ * The other event that may cause the slot index to be higher than 2
+ * for agg is a recent change of the parameters of some class. If the
+ * weight of a class is increased or the lmax (max_pkt_size) of the
+ * class is decreased, then a new aggregate with smaller slot size
+ * than the original parent aggregate of the class may happen to be
+ * activated. The activation of this aggregate should be properly
+ * delayed to when the service of the class has finished in the ideal
+ * system tracked by QFQ+. If the activation of the aggregate is not
+ * delayed to this reference time instant, then this aggregate may be
+ * unjustly served before other aggregates waiting for service. This
+ * may cause the above bound to the slot index to be violated for some
+ * of these unlucky aggregates.
+ *
  * Instead of delaying the activation of the new aggregate, which is
- * quite complex, the following inaccurate but simple solution is used:
- * if the slot index is higher than QFQ_MAX_SLOTS-2, then the
- * timestamps of the aggregate are shifted backward so as to let the
- * slot index become equal to QFQ_MAX_SLOTS-2.
+ * quite complex, the above-discussed capping of the slot index is
+ * used to handle also the consequences of a change of the parameters
+ * of a class.
  */
 static void qfq_slot_insert(struct qfq_group *grp, struct qfq_aggregate *agg,
 			    u64 roundedS)
@@ -1136,7 +1163,7 @@
 	else
 		in_serv_agg->budget -= len;
  
-	q->V += (u64)len * IWSUM;
+	q->V += (u64)len * q->iwsum;
 	pr_debug("qfq dequeue: len %u F %lld now %lld\n",
 		 len, (unsigned long long) in_serv_agg->F,
 		 (unsigned long long) q->V);
...	...	@@ -113,7 +113,6 @@
113	113
114	114	#define FRAC_BITS 30 /* fixed point arithmetic */
115	115	#define ONE_FP (1UL << FRAC_BITS)
116		-#define IWSUM (ONE_FP/QFQ_MAX_WSUM)
117	116
118	117	#define QFQ_MTU_SHIFT 16 /* to support TSO/GSO */
119	118	#define QFQ_MIN_LMAX 512 /* see qfq_slot_insert */
...	...	@@ -189,6 +188,7 @@
189	188	struct qfq_aggregate in_serv_agg; / Aggregate being served. */
190	189	u32 num_active_agg; /* Num. of active aggregates */
191	190	u32 wsum; /* weight sum */
	191	+ u32 iwsum; /* inverse weight sum */
192	192
193	193	unsigned long bitmaps[QFQ_MAX_STATE]; /* Group bitmaps. */
194	194	struct qfq_group groups[QFQ_MAX_INDEX + 1]; /* The groups. */
...	...	@@ -314,6 +314,7 @@
314	314
315	315	q->wsum +=
316	316	(int) agg->class_weight * (new_num_classes - agg->num_classes);
	317	+ q->iwsum = ONE_FP / q->wsum;
317	318
318	319	agg->num_classes = new_num_classes;
319	320	}
...	...	@@ -340,6 +341,10 @@
340	341	{
341	342	if (!hlist_unhashed(&agg->nonfull_next))
342	343	hlist_del_init(&agg->nonfull_next);
	344	+ q->wsum -= agg->class_weight;
	345	+ if (q->wsum != 0)
	346	+ q->iwsum = ONE_FP / q->wsum;
	347	+
343	348	if (q->in_serv_agg == agg)
344	349	q->in_serv_agg = qfq_choose_next_agg(q);
345	350	kfree(agg);
346	351
347	352
348	353
349	354
350	355
...	...	@@ -834,38 +839,60 @@
834	839	}
835	840	}
836	841
837		-
838	842	/*
839		- * The index of the slot in which the aggregate is to be inserted must
840		- * not be higher than QFQ_MAX_SLOTS-2. There is a '-2' and not a '-1'
841		- * because the start time of the group may be moved backward by one
842		- * slot after the aggregate has been inserted, and this would cause
843		- * non-empty slots to be right-shifted by one position.
	843	+ * The index of the slot in which the input aggregate agg is to be
	844	+ * inserted must not be higher than QFQ_MAX_SLOTS-2. There is a '-2'
	845	+ * and not a '-1' because the start time of the group may be moved
	846	+ * backward by one slot after the aggregate has been inserted, and
	847	+ * this would cause non-empty slots to be right-shifted by one
	848	+ * position.
844	849	*
845		- * If the weight and lmax (max_pkt_size) of the classes do not change,
846		- * then QFQ+ does meet the above contraint according to the current
847		- * values of its parameters. In fact, if the weight and lmax of the
848		- * classes do not change, then, from the theory, QFQ+ guarantees that
849		- * the slot index is never higher than
850		- * 2 + QFQ_MAX_AGG_CLASSES * ((1<<QFQ_MTU_SHIFT)/QFQ_MIN_LMAX) *
851		- * (QFQ_MAX_WEIGHT/QFQ_MAX_WSUM) = 2 + 8 * 128 * (1 / 64) = 18
	850	+ * QFQ+ fully satisfies this bound to the slot index if the parameters
	851	+ * of the classes are not changed dynamically, and if QFQ+ never
	852	+ * happens to postpone the service of agg unjustly, i.e., it never
	853	+ * happens that the aggregate becomes backlogged and eligible, or just
	854	+ * eligible, while an aggregate with a higher approximated finish time
	855	+ * is being served. In particular, in this case QFQ+ guarantees that
	856	+ * the timestamps of agg are low enough that the slot index is never
	857	+ * higher than 2. Unfortunately, QFQ+ cannot provide the same
	858	+ * guarantee if it happens to unjustly postpone the service of agg, or
	859	+ * if the parameters of some class are changed.
852	860	*
853		- * When the weight of a class is increased or the lmax of the class is
854		- * decreased, a new aggregate with smaller slot size than the original
855		- * parent aggregate of the class may happen to be activated. The
856		- * activation of this aggregate should be properly delayed to when the
857		- * service of the class has finished in the ideal system tracked by
858		- * QFQ+. If the activation of the aggregate is not delayed to this
859		- * reference time instant, then this aggregate may be unjustly served
860		- * before other aggregates waiting for service. This may cause the
861		- * above bound to the slot index to be violated for some of these
862		- * unlucky aggregates.
	861	+ * As for the first event, i.e., an out-of-order service, the
	862	+ * upper bound to the slot index guaranteed by QFQ+ grows to
	863	+ * 2 +
	864	+ * QFQ_MAX_AGG_CLASSES * ((1<<QFQ_MTU_SHIFT)/QFQ_MIN_LMAX) *
	865	+ * (current_max_weight/current_wsum) <= 2 + 8 * 128 * 1.
863	866	*
	867	+ * The following function deals with this problem by backward-shifting
	868	+ * the timestamps of agg, if needed, so as to guarantee that the slot
	869	+ * index is never higher than QFQ_MAX_SLOTS-2. This backward-shift may
	870	+ * cause the service of other aggregates to be postponed, yet the
	871	+ * worst-case guarantees of these aggregates are not violated. In
	872	+ * fact, in case of no out-of-order service, the timestamps of agg
	873	+ * would have been even lower than they are after the backward shift,
	874	+ * because QFQ+ would have guaranteed a maximum value equal to 2 for
	875	+ * the slot index, and 2 < QFQ_MAX_SLOTS-2. Hence the aggregates whose
	876	+ * service is postponed because of the backward-shift would have
	877	+ * however waited for the service of agg before being served.
	878	+ *
	879	+ * The other event that may cause the slot index to be higher than 2
	880	+ * for agg is a recent change of the parameters of some class. If the
	881	+ * weight of a class is increased or the lmax (max_pkt_size) of the
	882	+ * class is decreased, then a new aggregate with smaller slot size
	883	+ * than the original parent aggregate of the class may happen to be
	884	+ * activated. The activation of this aggregate should be properly
	885	+ * delayed to when the service of the class has finished in the ideal
	886	+ * system tracked by QFQ+. If the activation of the aggregate is not
	887	+ * delayed to this reference time instant, then this aggregate may be
	888	+ * unjustly served before other aggregates waiting for service. This
	889	+ * may cause the above bound to the slot index to be violated for some
	890	+ * of these unlucky aggregates.
	891	+ *
864	892	* Instead of delaying the activation of the new aggregate, which is
865		- * quite complex, the following inaccurate but simple solution is used:
866		- * if the slot index is higher than QFQ_MAX_SLOTS-2, then the
867		- * timestamps of the aggregate are shifted backward so as to let the
868		- * slot index become equal to QFQ_MAX_SLOTS-2.
	893	+ * quite complex, the above-discussed capping of the slot index is
	894	+ * used to handle also the consequences of a change of the parameters
	895	+ * of a class.
869	896	*/
870	897	static void qfq_slot_insert(struct qfq_group grp, struct qfq_aggregate agg,
871	898	u64 roundedS)
...	...	@@ -1136,7 +1163,7 @@
1136	1163	else
1137	1164	in_serv_agg->budget -= len;
1138	1165
1139		- q->V += (u64)len * IWSUM;
	1166	+ q->V += (u64)len * q->iwsum;
1140	1167	pr_debug("qfq dequeue: len %u F %lld now %lld\n",
1141	1168	len, (unsigned long long) in_serv_agg->F,
1142	1169	(unsigned long long) q->V);