14 Nov, 2018

1 commit

  • [ Upstream commit cbeb869a3d1110450186b738199963c5e68c2a71 ]

    BFQ schedules entities (which represent either per-process queues or
    groups of queues) as a function of their timestamps. In particular, as
    a function of their (virtual) finish times. The finish time of an
    entity is computed as a function of the budget assigned to the entity,
    assuming, tentatively, that the entity, once in service, will receive
    an amount of service equal to its budget. Then, when the entity is
    expired because it finishes to be served, this finish time is updated
    as a function of the actual service received by the entity. This
    allows the entity to be correctly charged with only the service
    received, and then to be correctly re-scheduled.

    Yet an entity may receive service also while not being the entity in
    service (in the scheduling environment of its parent entity), for
    several reasons. If the entity remains with no backlog while receiving
    this 'unofficial' service, then it is expired. Also on such an
    expiration, the finish time of the entity should be updated to account
    for only the service actually received by the entity. Unfortunately,
    such an update is not performed for an entity expiring without being
    the entity in service.

    In a similar vein, the service counter of the entity in service is
    reset when the entity is expired, to be ready to be used for next
    service cycle. This reset too should be performed also in case an
    entity is expired because it remains empty after receiving service
    while not being the entity in service. But in this case the reset is
    not performed.

    This commit performs the above update of the finish time and reset of
    the service received, also for an entity expiring while not being the
    entity in service.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Paolo Valente
     

31 Aug, 2017

3 commits

  • If the function bfq_update_next_in_service is invoked as a consequence
    of the activation or requeueing of an entity, say E, then it doesn't
    invoke bfq_lookup_next_entity to get the next-in-service entity. In
    contrast, it follows a shorter path: if E happens to be eligible (see
    commit "bfq-sq-mq: make lookup_next_entity push up vtime on
    expirations" for details on eligibility) and to have a lower virtual
    finish time than the current candidate as next-in-service entity, then
    E directly becomes the next-in-service entity. Unfortunately, there is
    a corner case for which this shorter path makes
    bfq_update_next_in_service choose a non eligible entity: it occurs if
    both E and the current next-in-service entity happen to be non
    eligible when bfq_update_next_in_service is invoked. In this case, E
    is not set as next-in-service, and, since bfq_lookup_next_entity is
    not invoked, the state of the parent entity is not updated so as to
    end up with an eligible entity as the proper next-in-service entity.

    In this respect, next-in-service is actually allowed to be non
    eligible while some queue is in service: since no system-virtual-time
    push-up can be performed in that case (see again commit "bfq-sq-mq:
    make lookup_next_entity push up vtime on expirations" for details),
    next-in-service is chosen, speculatively, as a function of the
    possible value that the system virtual time may get after a push
    up. But the correctness of the schedule breaks if next-in-service is
    still a non eligible entity when it is time to set in service the next
    entity. Unfortunately, this may happen in the above corner case.

    This commit fixes this problem by making bfq_update_next_in_service
    invoke bfq_lookup_next_entity not only if the above shorter path
    cannot be taken, but also if the shorter path is taken but fails to
    yield an eligible next-in-service entity.

    Signed-off-by: Paolo Valente
    Tested-by: Lee Tibbert
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • If the function bfq_update_next_in_service is invoked as a consequence
    of the activation or requeueing of an entity, say E, and finds out
    that E belongs to a higher-priority class than that of the current
    next-in-service entity, then it sets next_in_service directly to
    E. But this may lead to anomalous schedules, because E may happen not
    be eligible for service, because its virtual start time is higher than
    the system virtual time for its service tree.

    This commit addresses this issue by simply removing this direct
    switch.

    Signed-off-by: Paolo Valente
    Tested-by: Lee Tibbert
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • To provide a very smooth service, bfq starts to serve a bfq_queue
    only if the queue is 'eligible', i.e., if the same queue would
    have started to be served in the ideal, perfectly fair system that
    bfq simulates internally. This is obtained by associating each
    queue with a virtual start time, and by computing a special system
    virtual time quantity: a queue is eligible only if the system
    virtual time has reached the virtual start time of the
    queue. Finally, bfq guarantees that, when a new queue must be set
    in service, there is always at least one eligible entity for each
    active parent entity in the scheduler. To provide this guarantee,
    the function __bfq_lookup_next_entity pushes up, for each parent
    entity on which it is invoked, the system virtual time to the
    minimum among the virtual start times of the entities in the
    active tree for the parent entity (more precisely, the push up
    occurs if the system virtual time happens to be lower than all
    such virtual start times).

    There is however a circumstance in which __bfq_lookup_next_entity
    cannot push up the system virtual time for a parent entity, even
    if the system virtual time is lower than the virtual start times
    of all the child entities in the active tree. It happens if one of
    the child entities is in service. In fact, in such a case, there
    is already an eligible entity, the in-service one, even if it may
    not be not present in the active tree (because in-service entities
    may be removed from the active tree).

    Unfortunately, in the last re-design of the
    hierarchical-scheduling engine, the reset of the pointer to the
    in-service entity for a given parent entity--reset to be done as a
    consequence of the expiration of the in-service entity--always
    happens after the function __bfq_lookup_next_entity has been
    invoked. This causes the function to think that there is still an
    entity in service for the parent entity, and then that the system
    virtual time cannot be pushed up, even if actually such a
    no-more-in-service entity has already been properly reinserted
    into the active tree (or in some other tree if no more
    active). Yet, the system virtual time *had* to be pushed up, to be
    ready to correctly choose the next queue to serve. Because of the
    lack of this push up, bfq may wrongly set in service a queue that
    had been speculatively pre-computed as the possible
    next-in-service queue, but that would no more be the one to serve
    after the expiration and the reinsertion into the active trees of
    the previously in-service entities.

    This commit addresses this issue by making
    __bfq_lookup_next_entity properly push up the system virtual time
    if an expiration is occurring.

    Signed-off-by: Paolo Valente
    Tested-by: Lee Tibbert
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Jens Axboe

    Paolo Valente
     

30 Jul, 2017

2 commits

  • Groups of BFQ queues are represented by generic entities in BFQ. When
    a queue belonging to a parent entity is deactivated, the parent entity
    may need to be deactivated too, in case the deactivated queue was the
    only active queue for the parent entity. This deactivation may need to
    be propagated upwards if the entity belongs, in its turn, to a further
    higher-level entity, and so on. In particular, the upward propagation
    of deactivation stops at the first parent entity that remains active
    even if one of its child entities has been deactivated.

    To decide whether the last non-deactivation condition holds for a
    parent entity, BFQ checks whether the field next_in_service is still
    not NULL for the parent entity, after the deactivation of one of its
    child entity. If it is not NULL, then there are certainly other active
    entities in the parent entity, and deactivations can stop.

    Unfortunately, this check misses a corner case: if in_service_entity
    is not NULL, then next_in_service may happen to be NULL, although the
    parent entity is evidently active. This happens if: 1) the entity
    pointed by in_service_entity is the only active entity in the parent
    entity, and 2) according to the definition of next_in_service, the
    in_service_entity cannot be considered as next_in_service. See the
    comments on the definition of next_in_service for details on this
    second point.

    Hitting the above corner case causes crashes.

    To address this issue, this commit:
    1) Extends the above check on only next_in_service to controlling both
    next_in_service and in_service_entity (if any of them is not NULL,
    then no further deactivation is performed)
    2) Improves the (important) comments on how next_in_service is defined
    and updated; in particular it fixes a few rather obscure paragraphs

    Reported-by: Eric Wheeler
    Reported-by: Rick Yiu
    Reported-by: Tom X Nguyen
    Signed-off-by: Paolo Valente
    Tested-by: Eric Wheeler
    Tested-by: Rick Yiu
    Tested-by: Laurentiu Nicola
    Tested-by: Tom X Nguyen
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • BFQ implements hierarchical scheduling by representing each group of
    queues with a generic parent entity. For each parent entity, BFQ
    maintains an in_service_entity pointer: if one of the child entities
    happens to be in service, in_service_entity points to it. The
    resetting of these pointers happens only on queue expirations: when
    the in-service queue is expired, i.e., stops to be the queue in
    service, BFQ resets all in_service_entity pointers along the
    parent-entity path from this queue to the root entity.

    Functions handling the scheduling of entities assume, naturally, that
    in-service entities are active, i.e., have pending I/O requests (or,
    as a special case, even if they have no pending requests, they are
    expected to receive a new request very soon, with the scheduler idling
    the storage device while waiting for such an event). Unfortunately,
    the above resetting scheme of the in_service_entity pointers may cause
    this assumption to be violated. For example, the in-service queue may
    happen to remain without requests because of a request merge. In this
    case the queue does become idle, and all related data structures are
    updated accordingly. But in_service_entity still points to the queue
    in the parent entity. This inconsistency may even propagate to
    higher-level parent entities, if they happen to become idle as well,
    as a consequence of the leaf queue becoming idle. For this queue and
    parent entities, scheduling functions have an undefined behaviour,
    and, as reported, may easily lead to kernel crashes or hangs.

    This commit addresses this issue by simply resetting the
    in_service_entity field also when it is detected to point to an entity
    becoming idle (regardless of why the entity becomes idle).

    Reported-by: Laurentiu Nicola
    Signed-off-by: Paolo Valente
    Tested-by: Laurentiu Nicola
    Signed-off-by: Jens Axboe

    Paolo Valente
     

12 Jul, 2017

1 commit


04 Jul, 2017

1 commit

  • On each deactivation or re-scheduling (after being served) of a
    bfq_queue, BFQ invokes the function __bfq_entity_update_weight_prio(),
    to perform pending updates of ioprio, weight and ioprio class for the
    bfq_queue. BFQ also invokes this function on I/O-request dispatches,
    to raise or lower weights more quickly when needed, thereby improving
    latency. However, the entity representing the bfq_queue may be on the
    active (sub)tree of a service tree when this happens, and, although
    with a very low probability, the bfq_queue may happen to also have a
    pending change of its ioprio class. If both conditions hold when
    __bfq_entity_update_weight_prio() is invoked, then the entity moves to
    a sort of hybrid state: the new service tree for the entity, as
    returned by bfq_entity_service_tree(), differs from service tree on
    which the entity still is. The functions that handle activations and
    deactivations of entities do not cope with such a hybrid state (and
    would need to become more complex to cope).

    This commit addresses this issue by just making
    __bfq_entity_update_weight_prio() not perform also a possible pending
    change of ioprio class, when invoked on an I/O-request dispatch for a
    bfq_queue. Such a change is thus postponed to when
    __bfq_entity_update_weight_prio() is invoked on deactivation or
    re-scheduling of the bfq_queue.

    Reported-by: Marco Piazza
    Reported-by: Laurentiu Nicola
    Signed-off-by: Paolo Valente
    Tested-by: Marco Piazza
    Signed-off-by: Jens Axboe

    Paolo Valente
     

10 May, 2017

1 commit

  • In the function __bfq_deactivate_entity, the pointer
    entity->sched_data could happen to be used before being properly
    initialized. This led to a NULL pointer dereference. This commit fixes
    this bug by just using this pointer only where it is safe to do so.

    Reported-by: Tom Harrison
    Tested-by: Tom Harrison
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

19 Apr, 2017

1 commit

  • The BFQ I/O scheduler features an optimal fair-queuing
    (proportional-share) scheduling algorithm, enriched with several
    mechanisms to boost throughput and reduce latency for interactive and
    real-time applications. This makes BFQ a large and complex piece of
    code. This commit addresses this issue by splitting BFQ into three
    main, independent components, and by moving each component into a
    separate source file:
    1. Main algorithm: handles the interaction with the kernel, and
    decides which requests to dispatch; it uses the following two further
    components to achieve its goals.
    2. Scheduling engine (Hierarchical B-WF2Q+ scheduling algorithm):
    computes the schedule, using weights and budgets provided by the above
    component.
    3. cgroups support: handles group operations (creation, destruction,
    move, ...).

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente