Commit b30505c81a9d4adea8b70ecff512b0216929b797

Authored by Darren Hart
Committed by Ingo Molnar
1 parent ba9c22f2c0

futex: add requeue-pi documentation

Add Documentation/futex-requeue-pi.txt describing the motivation for the
newly added FUTEX_*REQUEUE_PI op codes and their implementation.

[ Impact: add documentation ]

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Sripathi Kodi <sripathik@in.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: Jakub Jelinek <jakub@redhat.com>
LKML-Reference: <4A03634E.3080609@us.ibm.com>
[ reformatted the file ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>

Showing 1 changed file with 131 additions and 0 deletions Side-by-side Diff

Documentation/futex-requeue-pi.txt
  1 +Futex Requeue PI
  2 +----------------
  3 +
  4 +Requeueing of tasks from a non-PI futex to a PI futex requires
  5 +special handling in order to ensure the underlying rt_mutex is never
  6 +left without an owner if it has waiters; doing so would break the PI
  7 +boosting logic [see rt-mutex-desgin.txt] For the purposes of
  8 +brevity, this action will be referred to as "requeue_pi" throughout
  9 +this document. Priority inheritance is abbreviated throughout as
  10 +"PI".
  11 +
  12 +Motivation
  13 +----------
  14 +
  15 +Without requeue_pi, the glibc implementation of
  16 +pthread_cond_broadcast() must resort to waking all the tasks waiting
  17 +on a pthread_condvar and letting them try to sort out which task
  18 +gets to run first in classic thundering-herd formation. An ideal
  19 +implementation would wake the highest-priority waiter, and leave the
  20 +rest to the natural wakeup inherent in unlocking the mutex
  21 +associated with the condvar.
  22 +
  23 +Consider the simplified glibc calls:
  24 +
  25 +/* caller must lock mutex */
  26 +pthread_cond_wait(cond, mutex)
  27 +{
  28 + lock(cond->__data.__lock);
  29 + unlock(mutex);
  30 + do {
  31 + unlock(cond->__data.__lock);
  32 + futex_wait(cond->__data.__futex);
  33 + lock(cond->__data.__lock);
  34 + } while(...)
  35 + unlock(cond->__data.__lock);
  36 + lock(mutex);
  37 +}
  38 +
  39 +pthread_cond_broadcast(cond)
  40 +{
  41 + lock(cond->__data.__lock);
  42 + unlock(cond->__data.__lock);
  43 + futex_requeue(cond->data.__futex, cond->mutex);
  44 +}
  45 +
  46 +Once pthread_cond_broadcast() requeues the tasks, the cond->mutex
  47 +has waiters. Note that pthread_cond_wait() attempts to lock the
  48 +mutex only after it has returned to user space. This will leave the
  49 +underlying rt_mutex with waiters, and no owner, breaking the
  50 +previously mentioned PI-boosting algorithms.
  51 +
  52 +In order to support PI-aware pthread_condvar's, the kernel needs to
  53 +be able to requeue tasks to PI futexes. This support implies that
  54 +upon a successful futex_wait system call, the caller would return to
  55 +user space already holding the PI futex. The glibc implementation
  56 +would be modified as follows:
  57 +
  58 +
  59 +/* caller must lock mutex */
  60 +pthread_cond_wait_pi(cond, mutex)
  61 +{
  62 + lock(cond->__data.__lock);
  63 + unlock(mutex);
  64 + do {
  65 + unlock(cond->__data.__lock);
  66 + futex_wait_requeue_pi(cond->__data.__futex);
  67 + lock(cond->__data.__lock);
  68 + } while(...)
  69 + unlock(cond->__data.__lock);
  70 + /* the kernel acquired the the mutex for us */
  71 +}
  72 +
  73 +pthread_cond_broadcast_pi(cond)
  74 +{
  75 + lock(cond->__data.__lock);
  76 + unlock(cond->__data.__lock);
  77 + futex_requeue_pi(cond->data.__futex, cond->mutex);
  78 +}
  79 +
  80 +The actual glibc implementation will likely test for PI and make the
  81 +necessary changes inside the existing calls rather than creating new
  82 +calls for the PI cases. Similar changes are needed for
  83 +pthread_cond_timedwait() and pthread_cond_signal().
  84 +
  85 +Implementation
  86 +--------------
  87 +
  88 +In order to ensure the rt_mutex has an owner if it has waiters, it
  89 +is necessary for both the requeue code, as well as the waiting code,
  90 +to be able to acquire the rt_mutex before returning to user space.
  91 +The requeue code cannot simply wake the waiter and leave it to
  92 +acquire the rt_mutex as it would open a race window between the
  93 +requeue call returning to user space and the waiter waking and
  94 +starting to run. This is especially true in the uncontended case.
  95 +
  96 +The solution involves two new rt_mutex helper routines,
  97 +rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which
  98 +allow the requeue code to acquire an uncontended rt_mutex on behalf
  99 +of the waiter and to enqueue the waiter on a contended rt_mutex.
  100 +Two new system calls provide the kernel<->user interface to
  101 +requeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_REQUEUE_CMP_PI.
  102 +
  103 +FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait()
  104 +and pthread_cond_timedwait()) to block on the initial futex and wait
  105 +to be requeued to a PI-aware futex. The implementation is the
  106 +result of a high-speed collision between futex_wait() and
  107 +futex_lock_pi(), with some extra logic to check for the additional
  108 +wake-up scenarios.
  109 +
  110 +FUTEX_REQUEUE_CMP_PI is called by the waker
  111 +(pthread_cond_broadcast() and pthread_cond_signal()) to requeue and
  112 +possibly wake the waiting tasks. Internally, this system call is
  113 +still handled by futex_requeue (by passing requeue_pi=1). Before
  114 +requeueing, futex_requeue() attempts to acquire the requeue target
  115 +PI futex on behalf of the top waiter. If it can, this waiter is
  116 +woken. futex_requeue() then proceeds to requeue the remaining
  117 +nr_wake+nr_requeue tasks to the PI futex, calling
  118 +rt_mutex_start_proxy_lock() prior to each requeue to prepare the
  119 +task as a waiter on the underlying rt_mutex. It is possible that
  120 +the lock can be acquired at this stage as well, if so, the next
  121 +waiter is woken to finish the acquisition of the lock.
  122 +
  123 +FUTEX_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but
  124 +their sum is all that really matters. futex_requeue() will wake or
  125 +requeue up to nr_wake + nr_requeue tasks. It will wake only as many
  126 +tasks as it can acquire the lock for, which in the majority of cases
  127 +should be 0 as good programming practice dictates that the caller of
  128 +either pthread_cond_broadcast() or pthread_cond_signal() acquire the
  129 +mutex prior to making the call. FUTEX_REQUEUE_PI requires that
  130 +nr_wake=1. nr_requeue should be INT_MAX for broadcast and 0 for
  131 +signal.