Commit a6537be9324c67b41f6d98f5a60a1bd5a8e02861

Authored by Steven Rostedt
Committed by Linus Torvalds
1 parent 23f78d4a03

[PATCH] pi-futex: rt mutex docs

Add rt-mutex documentation.

[rostedt@goodmis.org: Update rt-mutex-design.txt as per Randy Dunlap suggestions]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

Showing 3 changed files with 981 additions and 0 deletions Side-by-side Diff

Documentation/pi-futex.txt
  1 +Lightweight PI-futexes
  2 +----------------------
  3 +
  4 +We are calling them lightweight for 3 reasons:
  5 +
  6 + - in the user-space fastpath a PI-enabled futex involves no kernel work
  7 + (or any other PI complexity) at all. No registration, no extra kernel
  8 + calls - just pure fast atomic ops in userspace.
  9 +
  10 + - even in the slowpath, the system call and scheduling pattern is very
  11 + similar to normal futexes.
  12 +
  13 + - the in-kernel PI implementation is streamlined around the mutex
  14 + abstraction, with strict rules that keep the implementation
  15 + relatively simple: only a single owner may own a lock (i.e. no
  16 + read-write lock support), only the owner may unlock a lock, no
  17 + recursive locking, etc.
  18 +
  19 +Priority Inheritance - why?
  20 +---------------------------
  21 +
  22 +The short reply: user-space PI helps achieving/improving determinism for
  23 +user-space applications. In the best-case, it can help achieve
  24 +determinism and well-bound latencies. Even in the worst-case, PI will
  25 +improve the statistical distribution of locking related application
  26 +delays.
  27 +
  28 +The longer reply:
  29 +-----------------
  30 +
  31 +Firstly, sharing locks between multiple tasks is a common programming
  32 +technique that often cannot be replaced with lockless algorithms. As we
  33 +can see it in the kernel [which is a quite complex program in itself],
  34 +lockless structures are rather the exception than the norm - the current
  35 +ratio of lockless vs. locky code for shared data structures is somewhere
  36 +between 1:10 and 1:100. Lockless is hard, and the complexity of lockless
  37 +algorithms often endangers to ability to do robust reviews of said code.
  38 +I.e. critical RT apps often choose lock structures to protect critical
  39 +data structures, instead of lockless algorithms. Furthermore, there are
  40 +cases (like shared hardware, or other resource limits) where lockless
  41 +access is mathematically impossible.
  42 +
  43 +Media players (such as Jack) are an example of reasonable application
  44 +design with multiple tasks (with multiple priority levels) sharing
  45 +short-held locks: for example, a highprio audio playback thread is
  46 +combined with medium-prio construct-audio-data threads and low-prio
  47 +display-colory-stuff threads. Add video and decoding to the mix and
  48 +we've got even more priority levels.
  49 +
  50 +So once we accept that synchronization objects (locks) are an
  51 +unavoidable fact of life, and once we accept that multi-task userspace
  52 +apps have a very fair expectation of being able to use locks, we've got
  53 +to think about how to offer the option of a deterministic locking
  54 +implementation to user-space.
  55 +
  56 +Most of the technical counter-arguments against doing priority
  57 +inheritance only apply to kernel-space locks. But user-space locks are
  58 +different, there we cannot disable interrupts or make the task
  59 +non-preemptible in a critical section, so the 'use spinlocks' argument
  60 +does not apply (user-space spinlocks have the same priority inversion
  61 +problems as other user-space locking constructs). Fact is, pretty much
  62 +the only technique that currently enables good determinism for userspace
  63 +locks (such as futex-based pthread mutexes) is priority inheritance:
  64 +
  65 +Currently (without PI), if a high-prio and a low-prio task shares a lock
  66 +[this is a quite common scenario for most non-trivial RT applications],
  67 +even if all critical sections are coded carefully to be deterministic
  68 +(i.e. all critical sections are short in duration and only execute a
  69 +limited number of instructions), the kernel cannot guarantee any
  70 +deterministic execution of the high-prio task: any medium-priority task
  71 +could preempt the low-prio task while it holds the shared lock and
  72 +executes the critical section, and could delay it indefinitely.
  73 +
  74 +Implementation:
  75 +---------------
  76 +
  77 +As mentioned before, the userspace fastpath of PI-enabled pthread
  78 +mutexes involves no kernel work at all - they behave quite similarly to
  79 +normal futex-based locks: a 0 value means unlocked, and a value==TID
  80 +means locked. (This is the same method as used by list-based robust
  81 +futexes.) Userspace uses atomic ops to lock/unlock these mutexes without
  82 +entering the kernel.
  83 +
  84 +To handle the slowpath, we have added two new futex ops:
  85 +
  86 + FUTEX_LOCK_PI
  87 + FUTEX_UNLOCK_PI
  88 +
  89 +If the lock-acquire fastpath fails, [i.e. an atomic transition from 0 to
  90 +TID fails], then FUTEX_LOCK_PI is called. The kernel does all the
  91 +remaining work: if there is no futex-queue attached to the futex address
  92 +yet then the code looks up the task that owns the futex [it has put its
  93 +own TID into the futex value], and attaches a 'PI state' structure to
  94 +the futex-queue. The pi_state includes an rt-mutex, which is a PI-aware,
  95 +kernel-based synchronization object. The 'other' task is made the owner
  96 +of the rt-mutex, and the FUTEX_WAITERS bit is atomically set in the
  97 +futex value. Then this task tries to lock the rt-mutex, on which it
  98 +blocks. Once it returns, it has the mutex acquired, and it sets the
  99 +futex value to its own TID and returns. Userspace has no other work to
  100 +perform - it now owns the lock, and futex value contains
  101 +FUTEX_WAITERS|TID.
  102 +
  103 +If the unlock side fastpath succeeds, [i.e. userspace manages to do a
  104 +TID -> 0 atomic transition of the futex value], then no kernel work is
  105 +triggered.
  106 +
  107 +If the unlock fastpath fails (because the FUTEX_WAITERS bit is set),
  108 +then FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on the
  109 +behalf of userspace - and it also unlocks the attached
  110 +pi_state->rt_mutex and thus wakes up any potential waiters.
  111 +
  112 +Note that under this approach, contrary to previous PI-futex approaches,
  113 +there is no prior 'registration' of a PI-futex. [which is not quite
  114 +possible anyway, due to existing ABI properties of pthread mutexes.]
  115 +
  116 +Also, under this scheme, 'robustness' and 'PI' are two orthogonal
  117 +properties of futexes, and all four combinations are possible: futex,
  118 +robust-futex, PI-futex, robust+PI-futex.
  119 +
  120 +More details about priority inheritance can be found in
  121 +Documentation/rtmutex.txt.
Documentation/rt-mutex-design.txt
  1 +#
  2 +# Copyright (c) 2006 Steven Rostedt
  3 +# Licensed under the GNU Free Documentation License, Version 1.2
  4 +#
  5 +
  6 +RT-mutex implementation design
  7 +------------------------------
  8 +
  9 +This document tries to describe the design of the rtmutex.c implementation.
  10 +It doesn't describe the reasons why rtmutex.c exists. For that please see
  11 +Documentation/rt-mutex.txt. Although this document does explain problems
  12 +that happen without this code, but that is in the concept to understand
  13 +what the code actually is doing.
  14 +
  15 +The goal of this document is to help others understand the priority
  16 +inheritance (PI) algorithm that is used, as well as reasons for the
  17 +decisions that were made to implement PI in the manner that was done.
  18 +
  19 +
  20 +Unbounded Priority Inversion
  21 +----------------------------
  22 +
  23 +Priority inversion is when a lower priority process executes while a higher
  24 +priority process wants to run. This happens for several reasons, and
  25 +most of the time it can't be helped. Anytime a high priority process wants
  26 +to use a resource that a lower priority process has (a mutex for example),
  27 +the high priority process must wait until the lower priority process is done
  28 +with the resource. This is a priority inversion. What we want to prevent
  29 +is something called unbounded priority inversion. That is when the high
  30 +priority process is prevented from running by a lower priority process for
  31 +an undetermined amount of time.
  32 +
  33 +The classic example of unbounded priority inversion is were you have three
  34 +processes, let's call them processes A, B, and C, where A is the highest
  35 +priority process, C is the lowest, and B is in between. A tries to grab a lock
  36 +that C owns and must wait and lets C run to release the lock. But in the
  37 +meantime, B executes, and since B is of a higher priority than C, it preempts C,
  38 +but by doing so, it is in fact preempting A which is a higher priority process.
  39 +Now there's no way of knowing how long A will be sleeping waiting for C
  40 +to release the lock, because for all we know, B is a CPU hog and will
  41 +never give C a chance to release the lock. This is called unbounded priority
  42 +inversion.
  43 +
  44 +Here's a little ASCII art to show the problem.
  45 +
  46 + grab lock L1 (owned by C)
  47 + |
  48 +A ---+
  49 + C preempted by B
  50 + |
  51 +C +----+
  52 +
  53 +B +-------->
  54 + B now keeps A from running.
  55 +
  56 +
  57 +Priority Inheritance (PI)
  58 +-------------------------
  59 +
  60 +There are several ways to solve this issue, but other ways are out of scope
  61 +for this document. Here we only discuss PI.
  62 +
  63 +PI is where a process inherits the priority of another process if the other
  64 +process blocks on a lock owned by the current process. To make this easier
  65 +to understand, let's use the previous example, with processes A, B, and C again.
  66 +
  67 +This time, when A blocks on the lock owned by C, C would inherit the priority
  68 +of A. So now if B becomes runnable, it would not preempt C, since C now has
  69 +the high priority of A. As soon as C releases the lock, it loses its
  70 +inherited priority, and A then can continue with the resource that C had.
  71 +
  72 +Terminology
  73 +-----------
  74 +
  75 +Here I explain some terminology that is used in this document to help describe
  76 +the design that is used to implement PI.
  77 +
  78 +PI chain - The PI chain is an ordered series of locks and processes that cause
  79 + processes to inherit priorities from a previous process that is
  80 + blocked on one of its locks. This is described in more detail
  81 + later in this document.
  82 +
  83 +mutex - In this document, to differentiate from locks that implement
  84 + PI and spin locks that are used in the PI code, from now on
  85 + the PI locks will be called a mutex.
  86 +
  87 +lock - In this document from now on, I will use the term lock when
  88 + referring to spin locks that are used to protect parts of the PI
  89 + algorithm. These locks disable preemption for UP (when
  90 + CONFIG_PREEMPT is enabled) and on SMP prevents multiple CPUs from
  91 + entering critical sections simultaneously.
  92 +
  93 +spin lock - Same as lock above.
  94 +
  95 +waiter - A waiter is a struct that is stored on the stack of a blocked
  96 + process. Since the scope of the waiter is within the code for
  97 + a process being blocked on the mutex, it is fine to allocate
  98 + the waiter on the process's stack (local variable). This
  99 + structure holds a pointer to the task, as well as the mutex that
  100 + the task is blocked on. It also has the plist node structures to
  101 + place the task in the waiter_list of a mutex as well as the
  102 + pi_list of a mutex owner task (described below).
  103 +
  104 + waiter is sometimes used in reference to the task that is waiting
  105 + on a mutex. This is the same as waiter->task.
  106 +
  107 +waiters - A list of processes that are blocked on a mutex.
  108 +
  109 +top waiter - The highest priority process waiting on a specific mutex.
  110 +
  111 +top pi waiter - The highest priority process waiting on one of the mutexes
  112 + that a specific process owns.
  113 +
  114 +Note: task and process are used interchangeably in this document, mostly to
  115 + differentiate between two processes that are being described together.
  116 +
  117 +
  118 +PI chain
  119 +--------
  120 +
  121 +The PI chain is a list of processes and mutexes that may cause priority
  122 +inheritance to take place. Multiple chains may converge, but a chain
  123 +would never diverge, since a process can't be blocked on more than one
  124 +mutex at a time.
  125 +
  126 +Example:
  127 +
  128 + Process: A, B, C, D, E
  129 + Mutexes: L1, L2, L3, L4
  130 +
  131 + A owns: L1
  132 + B blocked on L1
  133 + B owns L2
  134 + C blocked on L2
  135 + C owns L3
  136 + D blocked on L3
  137 + D owns L4
  138 + E blocked on L4
  139 +
  140 +The chain would be:
  141 +
  142 + E->L4->D->L3->C->L2->B->L1->A
  143 +
  144 +To show where two chains merge, we could add another process F and
  145 +another mutex L5 where B owns L5 and F is blocked on mutex L5.
  146 +
  147 +The chain for F would be:
  148 +
  149 + F->L5->B->L1->A
  150 +
  151 +Since a process may own more than one mutex, but never be blocked on more than
  152 +one, the chains merge.
  153 +
  154 +Here we show both chains:
  155 +
  156 + E->L4->D->L3->C->L2-+
  157 + |
  158 + +->B->L1->A
  159 + |
  160 + F->L5-+
  161 +
  162 +For PI to work, the processes at the right end of these chains (or we may
  163 +also call it the Top of the chain) must be equal to or higher in priority
  164 +than the processes to the left or below in the chain.
  165 +
  166 +Also since a mutex may have more than one process blocked on it, we can
  167 +have multiple chains merge at mutexes. If we add another process G that is
  168 +blocked on mutex L2:
  169 +
  170 + G->L2->B->L1->A
  171 +
  172 +And once again, to show how this can grow I will show the merging chains
  173 +again.
  174 +
  175 + E->L4->D->L3->C-+
  176 + +->L2-+
  177 + | |
  178 + G-+ +->B->L1->A
  179 + |
  180 + F->L5-+
  181 +
  182 +
  183 +Plist
  184 +-----
  185 +
  186 +Before I go further and talk about how the PI chain is stored through lists
  187 +on both mutexes and processes, I'll explain the plist. This is similar to
  188 +the struct list_head functionality that is already in the kernel.
  189 +The implementation of plist is out of scope for this document, but it is
  190 +very important to understand what it does.
  191 +
  192 +There are a few differences between plist and list, the most important one
  193 +being that plist is a priority sorted linked list. This means that the
  194 +priorities of the plist are sorted, such that it takes O(1) to retrieve the
  195 +highest priority item in the list. Obviously this is useful to store processes
  196 +based on their priorities.
  197 +
  198 +Another difference, which is important for implementation, is that, unlike
  199 +list, the head of the list is a different element than the nodes of a list.
  200 +So the head of the list is declared as struct plist_head and nodes that will
  201 +be added to the list are declared as struct plist_node.
  202 +
  203 +
  204 +Mutex Waiter List
  205 +-----------------
  206 +
  207 +Every mutex keeps track of all the waiters that are blocked on itself. The mutex
  208 +has a plist to store these waiters by priority. This list is protected by
  209 +a spin lock that is located in the struct of the mutex. This lock is called
  210 +wait_lock. Since the modification of the waiter list is never done in
  211 +interrupt context, the wait_lock can be taken without disabling interrupts.
  212 +
  213 +
  214 +Task PI List
  215 +------------
  216 +
  217 +To keep track of the PI chains, each process has its own PI list. This is
  218 +a list of all top waiters of the mutexes that are owned by the process.
  219 +Note that this list only holds the top waiters and not all waiters that are
  220 +blocked on mutexes owned by the process.
  221 +
  222 +The top of the task's PI list is always the highest priority task that
  223 +is waiting on a mutex that is owned by the task. So if the task has
  224 +inherited a priority, it will always be the priority of the task that is
  225 +at the top of this list.
  226 +
  227 +This list is stored in the task structure of a process as a plist called
  228 +pi_list. This list is protected by a spin lock also in the task structure,
  229 +called pi_lock. This lock may also be taken in interrupt context, so when
  230 +locking the pi_lock, interrupts must be disabled.
  231 +
  232 +
  233 +Depth of the PI Chain
  234 +---------------------
  235 +
  236 +The maximum depth of the PI chain is not dynamic, and could actually be
  237 +defined. But is very complex to figure it out, since it depends on all
  238 +the nesting of mutexes. Let's look at the example where we have 3 mutexes,
  239 +L1, L2, and L3, and four separate functions func1, func2, func3 and func4.
  240 +The following shows a locking order of L1->L2->L3, but may not actually
  241 +be directly nested that way.
  242 +
  243 +void func1(void)
  244 +{
  245 + mutex_lock(L1);
  246 +
  247 + /* do anything */
  248 +
  249 + mutex_unlock(L1);
  250 +}
  251 +
  252 +void func2(void)
  253 +{
  254 + mutex_lock(L1);
  255 + mutex_lock(L2);
  256 +
  257 + /* do something */
  258 +
  259 + mutex_unlock(L2);
  260 + mutex_unlock(L1);
  261 +}
  262 +
  263 +void func3(void)
  264 +{
  265 + mutex_lock(L2);
  266 + mutex_lock(L3);
  267 +
  268 + /* do something else */
  269 +
  270 + mutex_unlock(L3);
  271 + mutex_unlock(L2);
  272 +}
  273 +
  274 +void func4(void)
  275 +{
  276 + mutex_lock(L3);
  277 +
  278 + /* do something again */
  279 +
  280 + mutex_unlock(L3);
  281 +}
  282 +
  283 +Now we add 4 processes that run each of these functions separately.
  284 +Processes A, B, C, and D which run functions func1, func2, func3 and func4
  285 +respectively, and such that D runs first and A last. With D being preempted
  286 +in func4 in the "do something again" area, we have a locking that follows:
  287 +
  288 +D owns L3
  289 + C blocked on L3
  290 + C owns L2
  291 + B blocked on L2
  292 + B owns L1
  293 + A blocked on L1
  294 +
  295 +And thus we have the chain A->L1->B->L2->C->L3->D.
  296 +
  297 +This gives us a PI depth of 4 (four processes), but looking at any of the
  298 +functions individually, it seems as though they only have at most a locking
  299 +depth of two. So, although the locking depth is defined at compile time,
  300 +it still is very difficult to find the possibilities of that depth.
  301 +
  302 +Now since mutexes can be defined by user-land applications, we don't want a DOS
  303 +type of application that nests large amounts of mutexes to create a large
  304 +PI chain, and have the code holding spin locks while looking at a large
  305 +amount of data. So to prevent this, the implementation not only implements
  306 +a maximum lock depth, but also only holds at most two different locks at a
  307 +time, as it walks the PI chain. More about this below.
  308 +
  309 +
  310 +Mutex owner and flags
  311 +---------------------
  312 +
  313 +The mutex structure contains a pointer to the owner of the mutex. If the
  314 +mutex is not owned, this owner is set to NULL. Since all architectures
  315 +have the task structure on at least a four byte alignment (and if this is
  316 +not true, the rtmutex.c code will be broken!), this allows for the two
  317 +least significant bits to be used as flags. This part is also described
  318 +in Documentation/rt-mutex.txt, but will also be briefly described here.
  319 +
  320 +Bit 0 is used as the "Pending Owner" flag. This is described later.
  321 +Bit 1 is used as the "Has Waiters" flags. This is also described later
  322 + in more detail, but is set whenever there are waiters on a mutex.
  323 +
  324 +
  325 +cmpxchg Tricks
  326 +--------------
  327 +
  328 +Some architectures implement an atomic cmpxchg (Compare and Exchange). This
  329 +is used (when applicable) to keep the fast path of grabbing and releasing
  330 +mutexes short.
  331 +
  332 +cmpxchg is basically the following function performed atomically:
  333 +
  334 +unsigned long _cmpxchg(unsigned long *A, unsigned long *B, unsigned long *C)
  335 +{
  336 + unsigned long T = *A;
  337 + if (*A == *B) {
  338 + *A = *C;
  339 + }
  340 + return T;
  341 +}
  342 +#define cmpxchg(a,b,c) _cmpxchg(&a,&b,&c)
  343 +
  344 +This is really nice to have, since it allows you to only update a variable
  345 +if the variable is what you expect it to be. You know if it succeeded if
  346 +the return value (the old value of A) is equal to B.
  347 +
  348 +The macro rt_mutex_cmpxchg is used to try to lock and unlock mutexes. If
  349 +the architecture does not support CMPXCHG, then this macro is simply set
  350 +to fail every time. But if CMPXCHG is supported, then this will
  351 +help out extremely to keep the fast path short.
  352 +
  353 +The use of rt_mutex_cmpxchg with the flags in the owner field help optimize
  354 +the system for architectures that support it. This will also be explained
  355 +later in this document.
  356 +
  357 +
  358 +Priority adjustments
  359 +--------------------
  360 +
  361 +The implementation of the PI code in rtmutex.c has several places that a
  362 +process must adjust its priority. With the help of the pi_list of a
  363 +process this is rather easy to know what needs to be adjusted.
  364 +
  365 +The functions implementing the task adjustments are rt_mutex_adjust_prio,
  366 +__rt_mutex_adjust_prio (same as the former, but expects the task pi_lock
  367 +to already be taken), rt_mutex_get_prio, and rt_mutex_setprio.
  368 +
  369 +rt_mutex_getprio and rt_mutex_setprio are only used in __rt_mutex_adjust_prio.
  370 +
  371 +rt_mutex_getprio returns the priority that the task should have. Either the
  372 +task's own normal priority, or if a process of a higher priority is waiting on
  373 +a mutex owned by the task, then that higher priority should be returned.
  374 +Since the pi_list of a task holds an order by priority list of all the top
  375 +waiters of all the mutexes that the task owns, rt_mutex_getprio simply needs
  376 +to compare the top pi waiter to its own normal priority, and return the higher
  377 +priority back.
  378 +
  379 +(Note: if looking at the code, you will notice that the lower number of
  380 + prio is returned. This is because the prio field in the task structure
  381 + is an inverse order of the actual priority. So a "prio" of 5 is
  382 + of higher priority than a "prio" of 10.)
  383 +
  384 +__rt_mutex_adjust_prio examines the result of rt_mutex_getprio, and if the
  385 +result does not equal the task's current priority, then rt_mutex_setprio
  386 +is called to adjust the priority of the task to the new priority.
  387 +Note that rt_mutex_setprio is defined in kernel/sched.c to implement the
  388 +actual change in priority.
  389 +
  390 +It is interesting to note that __rt_mutex_adjust_prio can either increase
  391 +or decrease the priority of the task. In the case that a higher priority
  392 +process has just blocked on a mutex owned by the task, __rt_mutex_adjust_prio
  393 +would increase/boost the task's priority. But if a higher priority task
  394 +were for some reason to leave the mutex (timeout or signal), this same function
  395 +would decrease/unboost the priority of the task. That is because the pi_list
  396 +always contains the highest priority task that is waiting on a mutex owned
  397 +by the task, so we only need to compare the priority of that top pi waiter
  398 +to the normal priority of the given task.
  399 +
  400 +
  401 +High level overview of the PI chain walk
  402 +----------------------------------------
  403 +
  404 +The PI chain walk is implemented by the function rt_mutex_adjust_prio_chain.
  405 +
  406 +The implementation has gone through several iterations, and has ended up
  407 +with what we believe is the best. It walks the PI chain by only grabbing
  408 +at most two locks at a time, and is very efficient.
  409 +
  410 +The rt_mutex_adjust_prio_chain can be used either to boost or lower process
  411 +priorities.
  412 +
  413 +rt_mutex_adjust_prio_chain is called with a task to be checked for PI
  414 +(de)boosting (the owner of a mutex that a process is blocking on), a flag to
  415 +check for deadlocking, the mutex that the task owns, and a pointer to a waiter
  416 +that is the process's waiter struct that is blocked on the mutex (although this
  417 +parameter may be NULL for deboosting).
  418 +
  419 +For this explanation, I will not mention deadlock detection. This explanation
  420 +will try to stay at a high level.
  421 +
  422 +When this function is called, there are no locks held. That also means
  423 +that the state of the owner and lock can change when entered into this function.
  424 +
  425 +Before this function is called, the task has already had rt_mutex_adjust_prio
  426 +performed on it. This means that the task is set to the priority that it
  427 +should be at, but the plist nodes of the task's waiter have not been updated
  428 +with the new priorities, and that this task may not be in the proper locations
  429 +in the pi_lists and wait_lists that the task is blocked on. This function
  430 +solves all that.
  431 +
  432 +A loop is entered, where task is the owner to be checked for PI changes that
  433 +was passed by parameter (for the first iteration). The pi_lock of this task is
  434 +taken to prevent any more changes to the pi_list of the task. This also
  435 +prevents new tasks from completing the blocking on a mutex that is owned by this
  436 +task.
  437 +
  438 +If the task is not blocked on a mutex then the loop is exited. We are at
  439 +the top of the PI chain.
  440 +
  441 +A check is now done to see if the original waiter (the process that is blocked
  442 +on the current mutex) is the top pi waiter of the task. That is, is this
  443 +waiter on the top of the task's pi_list. If it is not, it either means that
  444 +there is another process higher in priority that is blocked on one of the
  445 +mutexes that the task owns, or that the waiter has just woken up via a signal
  446 +or timeout and has left the PI chain. In either case, the loop is exited, since
  447 +we don't need to do any more changes to the priority of the current task, or any
  448 +task that owns a mutex that this current task is waiting on. A priority chain
  449 +walk is only needed when a new top pi waiter is made to a task.
  450 +
  451 +The next check sees if the task's waiter plist node has the priority equal to
  452 +the priority the task is set at. If they are equal, then we are done with
  453 +the loop. Remember that the function started with the priority of the
  454 +task adjusted, but the plist nodes that hold the task in other processes
  455 +pi_lists have not been adjusted.
  456 +
  457 +Next, we look at the mutex that the task is blocked on. The mutex's wait_lock
  458 +is taken. This is done by a spin_trylock, because the locking order of the
  459 +pi_lock and wait_lock goes in the opposite direction. If we fail to grab the
  460 +lock, the pi_lock is released, and we restart the loop.
  461 +
  462 +Now that we have both the pi_lock of the task as well as the wait_lock of
  463 +the mutex the task is blocked on, we update the task's waiter's plist node
  464 +that is located on the mutex's wait_list.
  465 +
  466 +Now we release the pi_lock of the task.
  467 +
  468 +Next the owner of the mutex has its pi_lock taken, so we can update the
  469 +task's entry in the owner's pi_list. If the task is the highest priority
  470 +process on the mutex's wait_list, then we remove the previous top waiter
  471 +from the owner's pi_list, and replace it with the task.
  472 +
  473 +Note: It is possible that the task was the current top waiter on the mutex,
  474 + in which case the task is not yet on the pi_list of the waiter. This
  475 + is OK, since plist_del does nothing if the plist node is not on any
  476 + list.
  477 +
  478 +If the task was not the top waiter of the mutex, but it was before we
  479 +did the priority updates, that means we are deboosting/lowering the
  480 +task. In this case, the task is removed from the pi_list of the owner,
  481 +and the new top waiter is added.
  482 +
  483 +Lastly, we unlock both the pi_lock of the task, as well as the mutex's
  484 +wait_lock, and continue the loop again. On the next iteration of the
  485 +loop, the previous owner of the mutex will be the task that will be
  486 +processed.
  487 +
  488 +Note: One might think that the owner of this mutex might have changed
  489 + since we just grab the mutex's wait_lock. And one could be right.
  490 + The important thing to remember is that the owner could not have
  491 + become the task that is being processed in the PI chain, since
  492 + we have taken that task's pi_lock at the beginning of the loop.
  493 + So as long as there is an owner of this mutex that is not the same
  494 + process as the tasked being worked on, we are OK.
  495 +
  496 + Looking closely at the code, one might be confused. The check for the
  497 + end of the PI chain is when the task isn't blocked on anything or the
  498 + task's waiter structure "task" element is NULL. This check is
  499 + protected only by the task's pi_lock. But the code to unlock the mutex
  500 + sets the task's waiter structure "task" element to NULL with only
  501 + the protection of the mutex's wait_lock, which was not taken yet.
  502 + Isn't this a race condition if the task becomes the new owner?
  503 +
  504 + The answer is No! The trick is the spin_trylock of the mutex's
  505 + wait_lock. If we fail that lock, we release the pi_lock of the
  506 + task and continue the loop, doing the end of PI chain check again.
  507 +
  508 + In the code to release the lock, the wait_lock of the mutex is held
  509 + the entire time, and it is not let go when we grab the pi_lock of the
  510 + new owner of the mutex. So if the switch of a new owner were to happen
  511 + after the check for end of the PI chain and the grabbing of the
  512 + wait_lock, the unlocking code would spin on the new owner's pi_lock
  513 + but never give up the wait_lock. So the PI chain loop is guaranteed to
  514 + fail the spin_trylock on the wait_lock, release the pi_lock, and
  515 + try again.
  516 +
  517 + If you don't quite understand the above, that's OK. You don't have to,
  518 + unless you really want to make a proof out of it ;)
  519 +
  520 +
  521 +Pending Owners and Lock stealing
  522 +--------------------------------
  523 +
  524 +One of the flags in the owner field of the mutex structure is "Pending Owner".
  525 +What this means is that an owner was chosen by the process releasing the
  526 +mutex, but that owner has yet to wake up and actually take the mutex.
  527 +
  528 +Why is this important? Why can't we just give the mutex to another process
  529 +and be done with it?
  530 +
  531 +The PI code is to help with real-time processes, and to let the highest
  532 +priority process run as long as possible with little latencies and delays.
  533 +If a high priority process owns a mutex that a lower priority process is
  534 +blocked on, when the mutex is released it would be given to the lower priority
  535 +process. What if the higher priority process wants to take that mutex again.
  536 +The high priority process would fail to take that mutex that it just gave up
  537 +and it would need to boost the lower priority process to run with full
  538 +latency of that critical section (since the low priority process just entered
  539 +it).
  540 +
  541 +There's no reason a high priority process that gives up a mutex should be
  542 +penalized if it tries to take that mutex again. If the new owner of the
  543 +mutex has not woken up yet, there's no reason that the higher priority process
  544 +could not take that mutex away.
  545 +
  546 +To solve this, we introduced Pending Ownership and Lock Stealing. When a
  547 +new process is given a mutex that it was blocked on, it is only given
  548 +pending ownership. This means that it's the new owner, unless a higher
  549 +priority process comes in and tries to grab that mutex. If a higher priority
  550 +process does come along and wants that mutex, we let the higher priority
  551 +process "steal" the mutex from the pending owner (only if it is still pending)
  552 +and continue with the mutex.
  553 +
  554 +
  555 +Taking of a mutex (The walk through)
  556 +------------------------------------
  557 +
  558 +OK, now let's take a look at the detailed walk through of what happens when
  559 +taking a mutex.
  560 +
  561 +The first thing that is tried is the fast taking of the mutex. This is
  562 +done when we have CMPXCHG enabled (otherwise the fast taking automatically
  563 +fails). Only when the owner field of the mutex is NULL can the lock be
  564 +taken with the CMPXCHG and nothing else needs to be done.
  565 +
  566 +If there is contention on the lock, whether it is owned or pending owner
  567 +we go about the slow path (rt_mutex_slowlock).
  568 +
  569 +The slow path function is where the task's waiter structure is created on
  570 +the stack. This is because the waiter structure is only needed for the
  571 +scope of this function. The waiter structure holds the nodes to store
  572 +the task on the wait_list of the mutex, and if need be, the pi_list of
  573 +the owner.
  574 +
  575 +The wait_lock of the mutex is taken since the slow path of unlocking the
  576 +mutex also takes this lock.
  577 +
  578 +We then call try_to_take_rt_mutex. This is where the architecture that
  579 +does not implement CMPXCHG would always grab the lock (if there's no
  580 +contention).
  581 +
  582 +try_to_take_rt_mutex is used every time the task tries to grab a mutex in the
  583 +slow path. The first thing that is done here is an atomic setting of
  584 +the "Has Waiters" flag of the mutex's owner field. Yes, this could really
  585 +be false, because if the the mutex has no owner, there are no waiters and
  586 +the current task also won't have any waiters. But we don't have the lock
  587 +yet, so we assume we are going to be a waiter. The reason for this is to
  588 +play nice for those architectures that do have CMPXCHG. By setting this flag
  589 +now, the owner of the mutex can't release the mutex without going into the
  590 +slow unlock path, and it would then need to grab the wait_lock, which this
  591 +code currently holds. So setting the "Has Waiters" flag forces the owner
  592 +to synchronize with this code.
  593 +
  594 +Now that we know that we can't have any races with the owner releasing the
  595 +mutex, we check to see if we can take the ownership. This is done if the
  596 +mutex doesn't have a owner, or if we can steal the mutex from a pending
  597 +owner. Let's look at the situations we have here.
  598 +
  599 + 1) Has owner that is pending
  600 + ----------------------------
  601 +
  602 + The mutex has a owner, but it hasn't woken up and the mutex flag
  603 + "Pending Owner" is set. The first check is to see if the owner isn't the
  604 + current task. This is because this function is also used for the pending
  605 + owner to grab the mutex. When a pending owner wakes up, it checks to see
  606 + if it can take the mutex, and this is done if the owner is already set to
  607 + itself. If so, we succeed and leave the function, clearing the "Pending
  608 + Owner" bit.
  609 +
  610 + If the pending owner is not current, we check to see if the current priority is
  611 + higher than the pending owner. If not, we fail the function and return.
  612 +
  613 + There's also something special about a pending owner. That is a pending owner
  614 + is never blocked on a mutex. So there is no PI chain to worry about. It also
  615 + means that if the mutex doesn't have any waiters, there's no accounting needed
  616 + to update the pending owner's pi_list, since we only worry about processes
  617 + blocked on the current mutex.
  618 +
  619 + If there are waiters on this mutex, and we just stole the ownership, we need
  620 + to take the top waiter, remove it from the pi_list of the pending owner, and
  621 + add it to the current pi_list. Note that at this moment, the pending owner
  622 + is no longer on the list of waiters. This is fine, since the pending owner
  623 + would add itself back when it realizes that it had the ownership stolen
  624 + from itself. When the pending owner tries to grab the mutex, it will fail
  625 + in try_to_take_rt_mutex if the owner field points to another process.
  626 +
  627 + 2) No owner
  628 + -----------
  629 +
  630 + If there is no owner (or we successfully stole the lock), we set the owner
  631 + of the mutex to current, and set the flag of "Has Waiters" if the current
  632 + mutex actually has waiters, or we clear the flag if it doesn't. See, it was
  633 + OK that we set that flag early, since now it is cleared.
  634 +
  635 + 3) Failed to grab ownership
  636 + ---------------------------
  637 +
  638 + The most interesting case is when we fail to take ownership. This means that
  639 + there exists an owner, or there's a pending owner with equal or higher
  640 + priority than the current task.
  641 +
  642 +We'll continue on the failed case.
  643 +
  644 +If the mutex has a timeout, we set up a timer to go off to break us out
  645 +of this mutex if we failed to get it after a specified amount of time.
  646 +
  647 +Now we enter a loop that will continue to try to take ownership of the mutex, or
  648 +fail from a timeout or signal.
  649 +
  650 +Once again we try to take the mutex. This will usually fail the first time
  651 +in the loop, since it had just failed to get the mutex. But the second time
  652 +in the loop, this would likely succeed, since the task would likely be
  653 +the pending owner.
  654 +
  655 +If the mutex is TASK_INTERRUPTIBLE a check for signals and timeout is done
  656 +here.
  657 +
  658 +The waiter structure has a "task" field that points to the task that is blocked
  659 +on the mutex. This field can be NULL the first time it goes through the loop
  660 +or if the task is a pending owner and had it's mutex stolen. If the "task"
  661 +field is NULL then we need to set up the accounting for it.
  662 +
  663 +Task blocks on mutex
  664 +--------------------
  665 +
  666 +The accounting of a mutex and process is done with the waiter structure of
  667 +the process. The "task" field is set to the process, and the "lock" field
  668 +to the mutex. The plist nodes are initialized to the processes current
  669 +priority.
  670 +
  671 +Since the wait_lock was taken at the entry of the slow lock, we can safely
  672 +add the waiter to the wait_list. If the current process is the highest
  673 +priority process currently waiting on this mutex, then we remove the
  674 +previous top waiter process (if it exists) from the pi_list of the owner,
  675 +and add the current process to that list. Since the pi_list of the owner
  676 +has changed, we call rt_mutex_adjust_prio on the owner to see if the owner
  677 +should adjust its priority accordingly.
  678 +
  679 +If the owner is also blocked on a lock, and had its pi_list changed
  680 +(or deadlock checking is on), we unlock the wait_lock of the mutex and go ahead
  681 +and run rt_mutex_adjust_prio_chain on the owner, as described earlier.
  682 +
  683 +Now all locks are released, and if the current process is still blocked on a
  684 +mutex (waiter "task" field is not NULL), then we go to sleep (call schedule).
  685 +
  686 +Waking up in the loop
  687 +---------------------
  688 +
  689 +The schedule can then wake up for a few reasons.
  690 + 1) we were given pending ownership of the mutex.
  691 + 2) we received a signal and was TASK_INTERRUPTIBLE
  692 + 3) we had a timeout and was TASK_INTERRUPTIBLE
  693 +
  694 +In any of these cases, we continue the loop and once again try to grab the
  695 +ownership of the mutex. If we succeed, we exit the loop, otherwise we continue
  696 +and on signal and timeout, will exit the loop, or if we had the mutex stolen
  697 +we just simply add ourselves back on the lists and go back to sleep.
  698 +
  699 +Note: For various reasons, because of timeout and signals, the steal mutex
  700 + algorithm needs to be careful. This is because the current process is
  701 + still on the wait_list. And because of dynamic changing of priorities,
  702 + especially on SCHED_OTHER tasks, the current process can be the
  703 + highest priority task on the wait_list.
  704 +
  705 +Failed to get mutex on Timeout or Signal
  706 +----------------------------------------
  707 +
  708 +If a timeout or signal occurred, the waiter's "task" field would not be
  709 +NULL and the task needs to be taken off the wait_list of the mutex and perhaps
  710 +pi_list of the owner. If this process was a high priority process, then
  711 +the rt_mutex_adjust_prio_chain needs to be executed again on the owner,
  712 +but this time it will be lowering the priorities.
  713 +
  714 +
  715 +Unlocking the Mutex
  716 +-------------------
  717 +
  718 +The unlocking of a mutex also has a fast path for those architectures with
  719 +CMPXCHG. Since the taking of a mutex on contention always sets the
  720 +"Has Waiters" flag of the mutex's owner, we use this to know if we need to
  721 +take the slow path when unlocking the mutex. If the mutex doesn't have any
  722 +waiters, the owner field of the mutex would equal the current process and
  723 +the mutex can be unlocked by just replacing the owner field with NULL.
  724 +
  725 +If the owner field has the "Has Waiters" bit set (or CMPXCHG is not available),
  726 +the slow unlock path is taken.
  727 +
  728 +The first thing done in the slow unlock path is to take the wait_lock of the
  729 +mutex. This synchronizes the locking and unlocking of the mutex.
  730 +
  731 +A check is made to see if the mutex has waiters or not. On architectures that
  732 +do not have CMPXCHG, this is the location that the owner of the mutex will
  733 +determine if a waiter needs to be awoken or not. On architectures that
  734 +do have CMPXCHG, that check is done in the fast path, but it is still needed
  735 +in the slow path too. If a waiter of a mutex woke up because of a signal
  736 +or timeout between the time the owner failed the fast path CMPXCHG check and
  737 +the grabbing of the wait_lock, the mutex may not have any waiters, thus the
  738 +owner still needs to make this check. If there are no waiters than the mutex
  739 +owner field is set to NULL, the wait_lock is released and nothing more is
  740 +needed.
  741 +
  742 +If there are waiters, then we need to wake one up and give that waiter
  743 +pending ownership.
  744 +
  745 +On the wake up code, the pi_lock of the current owner is taken. The top
  746 +waiter of the lock is found and removed from the wait_list of the mutex
  747 +as well as the pi_list of the current owner. The task field of the new
  748 +pending owner's waiter structure is set to NULL, and the owner field of the
  749 +mutex is set to the new owner with the "Pending Owner" bit set, as well
  750 +as the "Has Waiters" bit if there still are other processes blocked on the
  751 +mutex.
  752 +
  753 +The pi_lock of the previous owner is released, and the new pending owner's
  754 +pi_lock is taken. Remember that this is the trick to prevent the race
  755 +condition in rt_mutex_adjust_prio_chain from adding itself as a waiter
  756 +on the mutex.
  757 +
  758 +We now clear the "pi_blocked_on" field of the new pending owner, and if
  759 +the mutex still has waiters pending, we add the new top waiter to the pi_list
  760 +of the pending owner.
  761 +
  762 +Finally we unlock the pi_lock of the pending owner and wake it up.
  763 +
  764 +
  765 +Contact
  766 +-------
  767 +
  768 +For updates on this document, please email Steven Rostedt <rostedt@goodmis.org>
  769 +
  770 +
  771 +Credits
  772 +-------
  773 +
  774 +Author: Steven Rostedt <rostedt@goodmis.org>
  775 +
  776 +Reviewers: Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and Randy Dunlap
  777 +
  778 +Updates
  779 +-------
  780 +
  781 +This document was originally written for 2.6.17-rc3-mm1
Documentation/rt-mutex.txt
  1 +RT-mutex subsystem with PI support
  2 +----------------------------------
  3 +
  4 +RT-mutexes with priority inheritance are used to support PI-futexes,
  5 +which enable pthread_mutex_t priority inheritance attributes
  6 +(PTHREAD_PRIO_INHERIT). [See Documentation/pi-futex.txt for more details
  7 +about PI-futexes.]
  8 +
  9 +This technology was developed in the -rt tree and streamlined for
  10 +pthread_mutex support.
  11 +
  12 +Basic principles:
  13 +-----------------
  14 +
  15 +RT-mutexes extend the semantics of simple mutexes by the priority
  16 +inheritance protocol.
  17 +
  18 +A low priority owner of a rt-mutex inherits the priority of a higher
  19 +priority waiter until the rt-mutex is released. If the temporarily
  20 +boosted owner blocks on a rt-mutex itself it propagates the priority
  21 +boosting to the owner of the other rt_mutex it gets blocked on. The
  22 +priority boosting is immediately removed once the rt_mutex has been
  23 +unlocked.
  24 +
  25 +This approach allows us to shorten the block of high-prio tasks on
  26 +mutexes which protect shared resources. Priority inheritance is not a
  27 +magic bullet for poorly designed applications, but it allows
  28 +well-designed applications to use userspace locks in critical parts of
  29 +an high priority thread, without losing determinism.
  30 +
  31 +The enqueueing of the waiters into the rtmutex waiter list is done in
  32 +priority order. For same priorities FIFO order is chosen. For each
  33 +rtmutex, only the top priority waiter is enqueued into the owner's
  34 +priority waiters list. This list too queues in priority order. Whenever
  35 +the top priority waiter of a task changes (for example it timed out or
  36 +got a signal), the priority of the owner task is readjusted. [The
  37 +priority enqueueing is handled by "plists", see include/linux/plist.h
  38 +for more details.]
  39 +
  40 +RT-mutexes are optimized for fastpath operations and have no internal
  41 +locking overhead when locking an uncontended mutex or unlocking a mutex
  42 +without waiters. The optimized fastpath operations require cmpxchg
  43 +support. [If that is not available then the rt-mutex internal spinlock
  44 +is used]
  45 +
  46 +The state of the rt-mutex is tracked via the owner field of the rt-mutex
  47 +structure:
  48 +
  49 +rt_mutex->owner holds the task_struct pointer of the owner. Bit 0 and 1
  50 +are used to keep track of the "owner is pending" and "rtmutex has
  51 +waiters" state.
  52 +
  53 + owner bit1 bit0
  54 + NULL 0 0 mutex is free (fast acquire possible)
  55 + NULL 0 1 invalid state
  56 + NULL 1 0 Transitional state*
  57 + NULL 1 1 invalid state
  58 + taskpointer 0 0 mutex is held (fast release possible)
  59 + taskpointer 0 1 task is pending owner
  60 + taskpointer 1 0 mutex is held and has waiters
  61 + taskpointer 1 1 task is pending owner and mutex has waiters
  62 +
  63 +Pending-ownership handling is a performance optimization:
  64 +pending-ownership is assigned to the first (highest priority) waiter of
  65 +the mutex, when the mutex is released. The thread is woken up and once
  66 +it starts executing it can acquire the mutex. Until the mutex is taken
  67 +by it (bit 0 is cleared) a competing higher priority thread can "steal"
  68 +the mutex which puts the woken up thread back on the waiters list.
  69 +
  70 +The pending-ownership optimization is especially important for the
  71 +uninterrupted workflow of high-prio tasks which repeatedly
  72 +takes/releases locks that have lower-prio waiters. Without this
  73 +optimization the higher-prio thread would ping-pong to the lower-prio
  74 +task [because at unlock time we always assign a new owner].
  75 +
  76 +(*) The "mutex has waiters" bit gets set to take the lock. If the lock
  77 +doesn't already have an owner, this bit is quickly cleared if there are
  78 +no waiters. So this is a transitional state to synchronize with looking
  79 +at the owner field of the mutex and the mutex owner releasing the lock.