Commit b30505c81a9d4adea8b70ecff512b0216929b797
Committed by
Ingo Molnar
1 parent
ba9c22f2c0
Exists in
master
and in
7 other branches
futex: add requeue-pi documentation
Add Documentation/futex-requeue-pi.txt describing the motivation for the newly added FUTEX_*REQUEUE_PI op codes and their implementation. [ Impact: add documentation ] Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: Sripathi Kodi <sripathik@in.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: John Stultz <johnstul@us.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: Jakub Jelinek <jakub@redhat.com> LKML-Reference: <4A03634E.3080609@us.ibm.com> [ reformatted the file ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
Showing 1 changed file with 131 additions and 0 deletions Side-by-side Diff
Documentation/futex-requeue-pi.txt
1 | +Futex Requeue PI | |
2 | +---------------- | |
3 | + | |
4 | +Requeueing of tasks from a non-PI futex to a PI futex requires | |
5 | +special handling in order to ensure the underlying rt_mutex is never | |
6 | +left without an owner if it has waiters; doing so would break the PI | |
7 | +boosting logic [see rt-mutex-desgin.txt] For the purposes of | |
8 | +brevity, this action will be referred to as "requeue_pi" throughout | |
9 | +this document. Priority inheritance is abbreviated throughout as | |
10 | +"PI". | |
11 | + | |
12 | +Motivation | |
13 | +---------- | |
14 | + | |
15 | +Without requeue_pi, the glibc implementation of | |
16 | +pthread_cond_broadcast() must resort to waking all the tasks waiting | |
17 | +on a pthread_condvar and letting them try to sort out which task | |
18 | +gets to run first in classic thundering-herd formation. An ideal | |
19 | +implementation would wake the highest-priority waiter, and leave the | |
20 | +rest to the natural wakeup inherent in unlocking the mutex | |
21 | +associated with the condvar. | |
22 | + | |
23 | +Consider the simplified glibc calls: | |
24 | + | |
25 | +/* caller must lock mutex */ | |
26 | +pthread_cond_wait(cond, mutex) | |
27 | +{ | |
28 | + lock(cond->__data.__lock); | |
29 | + unlock(mutex); | |
30 | + do { | |
31 | + unlock(cond->__data.__lock); | |
32 | + futex_wait(cond->__data.__futex); | |
33 | + lock(cond->__data.__lock); | |
34 | + } while(...) | |
35 | + unlock(cond->__data.__lock); | |
36 | + lock(mutex); | |
37 | +} | |
38 | + | |
39 | +pthread_cond_broadcast(cond) | |
40 | +{ | |
41 | + lock(cond->__data.__lock); | |
42 | + unlock(cond->__data.__lock); | |
43 | + futex_requeue(cond->data.__futex, cond->mutex); | |
44 | +} | |
45 | + | |
46 | +Once pthread_cond_broadcast() requeues the tasks, the cond->mutex | |
47 | +has waiters. Note that pthread_cond_wait() attempts to lock the | |
48 | +mutex only after it has returned to user space. This will leave the | |
49 | +underlying rt_mutex with waiters, and no owner, breaking the | |
50 | +previously mentioned PI-boosting algorithms. | |
51 | + | |
52 | +In order to support PI-aware pthread_condvar's, the kernel needs to | |
53 | +be able to requeue tasks to PI futexes. This support implies that | |
54 | +upon a successful futex_wait system call, the caller would return to | |
55 | +user space already holding the PI futex. The glibc implementation | |
56 | +would be modified as follows: | |
57 | + | |
58 | + | |
59 | +/* caller must lock mutex */ | |
60 | +pthread_cond_wait_pi(cond, mutex) | |
61 | +{ | |
62 | + lock(cond->__data.__lock); | |
63 | + unlock(mutex); | |
64 | + do { | |
65 | + unlock(cond->__data.__lock); | |
66 | + futex_wait_requeue_pi(cond->__data.__futex); | |
67 | + lock(cond->__data.__lock); | |
68 | + } while(...) | |
69 | + unlock(cond->__data.__lock); | |
70 | + /* the kernel acquired the the mutex for us */ | |
71 | +} | |
72 | + | |
73 | +pthread_cond_broadcast_pi(cond) | |
74 | +{ | |
75 | + lock(cond->__data.__lock); | |
76 | + unlock(cond->__data.__lock); | |
77 | + futex_requeue_pi(cond->data.__futex, cond->mutex); | |
78 | +} | |
79 | + | |
80 | +The actual glibc implementation will likely test for PI and make the | |
81 | +necessary changes inside the existing calls rather than creating new | |
82 | +calls for the PI cases. Similar changes are needed for | |
83 | +pthread_cond_timedwait() and pthread_cond_signal(). | |
84 | + | |
85 | +Implementation | |
86 | +-------------- | |
87 | + | |
88 | +In order to ensure the rt_mutex has an owner if it has waiters, it | |
89 | +is necessary for both the requeue code, as well as the waiting code, | |
90 | +to be able to acquire the rt_mutex before returning to user space. | |
91 | +The requeue code cannot simply wake the waiter and leave it to | |
92 | +acquire the rt_mutex as it would open a race window between the | |
93 | +requeue call returning to user space and the waiter waking and | |
94 | +starting to run. This is especially true in the uncontended case. | |
95 | + | |
96 | +The solution involves two new rt_mutex helper routines, | |
97 | +rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which | |
98 | +allow the requeue code to acquire an uncontended rt_mutex on behalf | |
99 | +of the waiter and to enqueue the waiter on a contended rt_mutex. | |
100 | +Two new system calls provide the kernel<->user interface to | |
101 | +requeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_REQUEUE_CMP_PI. | |
102 | + | |
103 | +FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait() | |
104 | +and pthread_cond_timedwait()) to block on the initial futex and wait | |
105 | +to be requeued to a PI-aware futex. The implementation is the | |
106 | +result of a high-speed collision between futex_wait() and | |
107 | +futex_lock_pi(), with some extra logic to check for the additional | |
108 | +wake-up scenarios. | |
109 | + | |
110 | +FUTEX_REQUEUE_CMP_PI is called by the waker | |
111 | +(pthread_cond_broadcast() and pthread_cond_signal()) to requeue and | |
112 | +possibly wake the waiting tasks. Internally, this system call is | |
113 | +still handled by futex_requeue (by passing requeue_pi=1). Before | |
114 | +requeueing, futex_requeue() attempts to acquire the requeue target | |
115 | +PI futex on behalf of the top waiter. If it can, this waiter is | |
116 | +woken. futex_requeue() then proceeds to requeue the remaining | |
117 | +nr_wake+nr_requeue tasks to the PI futex, calling | |
118 | +rt_mutex_start_proxy_lock() prior to each requeue to prepare the | |
119 | +task as a waiter on the underlying rt_mutex. It is possible that | |
120 | +the lock can be acquired at this stage as well, if so, the next | |
121 | +waiter is woken to finish the acquisition of the lock. | |
122 | + | |
123 | +FUTEX_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but | |
124 | +their sum is all that really matters. futex_requeue() will wake or | |
125 | +requeue up to nr_wake + nr_requeue tasks. It will wake only as many | |
126 | +tasks as it can acquire the lock for, which in the majority of cases | |
127 | +should be 0 as good programming practice dictates that the caller of | |
128 | +either pthread_cond_broadcast() or pthread_cond_signal() acquire the | |
129 | +mutex prior to making the call. FUTEX_REQUEUE_PI requires that | |
130 | +nr_wake=1. nr_requeue should be INT_MAX for broadcast and 0 for | |
131 | +signal. |