Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

Commit a6537be9324c67b41f6d98f5a60a1bd5a8e02861

Authored by Steven Rostedt 2006-06-27 17:54:54 +0800

Committed by Linus Torvalds 2006-06-28 08:32:47 +0800

[PATCH] pi-futex: rt mutex docs

Add rt-mutex documentation.

[rostedt@goodmis.org: Update rt-mutex-design.txt as per Randy Dunlap suggestions]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

Showing 3 changed files with 981 additions and 0 deletions Side-by-side Diff

Documentation/pi-futex.txt
Documentation/rt-mutex-design.txt
Documentation/rt-mutex.txt

Documentation/pi-futex.txt

Diff comments View file @ a6537be

	1	+Lightweight PI-futexes
	2	+----------------------
	3	+
	4	+We are calling them lightweight for 3 reasons:
	5	+
	6	+ - in the user-space fastpath a PI-enabled futex involves no kernel work
	7	+ (or any other PI complexity) at all. No registration, no extra kernel
	8	+ calls - just pure fast atomic ops in userspace.
	9	+
	10	+ - even in the slowpath, the system call and scheduling pattern is very
	11	+ similar to normal futexes.
	12	+
	13	+ - the in-kernel PI implementation is streamlined around the mutex
	14	+ abstraction, with strict rules that keep the implementation
	15	+ relatively simple: only a single owner may own a lock (i.e. no
	16	+ read-write lock support), only the owner may unlock a lock, no
	17	+ recursive locking, etc.
	18	+
	19	+Priority Inheritance - why?
	20	+---------------------------
	21	+
	22	+The short reply: user-space PI helps achieving/improving determinism for
	23	+user-space applications. In the best-case, it can help achieve
	24	+determinism and well-bound latencies. Even in the worst-case, PI will
	25	+improve the statistical distribution of locking related application
	26	+delays.
	27	+
	28	+The longer reply:
	29	+-----------------
	30	+
	31	+Firstly, sharing locks between multiple tasks is a common programming
	32	+technique that often cannot be replaced with lockless algorithms. As we
	33	+can see it in the kernel [which is a quite complex program in itself],
	34	+lockless structures are rather the exception than the norm - the current
	35	+ratio of lockless vs. locky code for shared data structures is somewhere
	36	+between 1:10 and 1:100. Lockless is hard, and the complexity of lockless
	37	+algorithms often endangers to ability to do robust reviews of said code.
	38	+I.e. critical RT apps often choose lock structures to protect critical
	39	+data structures, instead of lockless algorithms. Furthermore, there are
	40	+cases (like shared hardware, or other resource limits) where lockless
	41	+access is mathematically impossible.
	42	+
	43	+Media players (such as Jack) are an example of reasonable application
	44	+design with multiple tasks (with multiple priority levels) sharing
	45	+short-held locks: for example, a highprio audio playback thread is
	46	+combined with medium-prio construct-audio-data threads and low-prio
	47	+display-colory-stuff threads. Add video and decoding to the mix and
	48	+we've got even more priority levels.
	49	+
	50	+So once we accept that synchronization objects (locks) are an
	51	+unavoidable fact of life, and once we accept that multi-task userspace
	52	+apps have a very fair expectation of being able to use locks, we've got
	53	+to think about how to offer the option of a deterministic locking
	54	+implementation to user-space.
	55	+
	56	+Most of the technical counter-arguments against doing priority
	57	+inheritance only apply to kernel-space locks. But user-space locks are
	58	+different, there we cannot disable interrupts or make the task
	59	+non-preemptible in a critical section, so the 'use spinlocks' argument
	60	+does not apply (user-space spinlocks have the same priority inversion
	61	+problems as other user-space locking constructs). Fact is, pretty much
	62	+the only technique that currently enables good determinism for userspace
	63	+locks (such as futex-based pthread mutexes) is priority inheritance:
	64	+
	65	+Currently (without PI), if a high-prio and a low-prio task shares a lock
	66	+[this is a quite common scenario for most non-trivial RT applications],
	67	+even if all critical sections are coded carefully to be deterministic
	68	+(i.e. all critical sections are short in duration and only execute a
	69	+limited number of instructions), the kernel cannot guarantee any
	70	+deterministic execution of the high-prio task: any medium-priority task
	71	+could preempt the low-prio task while it holds the shared lock and
	72	+executes the critical section, and could delay it indefinitely.
	73	+
	74	+Implementation:
	75	+---------------
	76	+
	77	+As mentioned before, the userspace fastpath of PI-enabled pthread
	78	+mutexes involves no kernel work at all - they behave quite similarly to
	79	+normal futex-based locks: a 0 value means unlocked, and a value==TID
	80	+means locked. (This is the same method as used by list-based robust
	81	+futexes.) Userspace uses atomic ops to lock/unlock these mutexes without
	82	+entering the kernel.
	83	+
	84	+To handle the slowpath, we have added two new futex ops:
	85	+
	86	+ FUTEX_LOCK_PI
	87	+ FUTEX_UNLOCK_PI
	88	+
	89	+If the lock-acquire fastpath fails, [i.e. an atomic transition from 0 to
	90	+TID fails], then FUTEX_LOCK_PI is called. The kernel does all the
	91	+remaining work: if there is no futex-queue attached to the futex address
	92	+yet then the code looks up the task that owns the futex [it has put its
	93	+own TID into the futex value], and attaches a 'PI state' structure to
	94	+the futex-queue. The pi_state includes an rt-mutex, which is a PI-aware,
	95	+kernel-based synchronization object. The 'other' task is made the owner
	96	+of the rt-mutex, and the FUTEX_WAITERS bit is atomically set in the
	97	+futex value. Then this task tries to lock the rt-mutex, on which it
	98	+blocks. Once it returns, it has the mutex acquired, and it sets the
	99	+futex value to its own TID and returns. Userspace has no other work to
	100	+perform - it now owns the lock, and futex value contains
	101	+FUTEX_WAITERS\|TID.
	102	+
	103	+If the unlock side fastpath succeeds, [i.e. userspace manages to do a
	104	+TID -> 0 atomic transition of the futex value], then no kernel work is
	105	+triggered.
	106	+
	107	+If the unlock fastpath fails (because the FUTEX_WAITERS bit is set),
	108	+then FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on the
	109	+behalf of userspace - and it also unlocks the attached
	110	+pi_state->rt_mutex and thus wakes up any potential waiters.
	111	+
	112	+Note that under this approach, contrary to previous PI-futex approaches,
	113	+there is no prior 'registration' of a PI-futex. [which is not quite
	114	+possible anyway, due to existing ABI properties of pthread mutexes.]
	115	+
	116	+Also, under this scheme, 'robustness' and 'PI' are two orthogonal
	117	+properties of futexes, and all four combinations are possible: futex,
	118	+robust-futex, PI-futex, robust+PI-futex.
	119	+
	120	+More details about priority inheritance can be found in
	121	+Documentation/rtmutex.txt.

Documentation/rt-mutex-design.txt

Diff comments View file @ a6537be

	1	+#
	2	+# Copyright (c) 2006 Steven Rostedt
	3	+# Licensed under the GNU Free Documentation License, Version 1.2
	4	+#
	5	+
	6	+RT-mutex implementation design
	7	+------------------------------
	8	+
	9	+This document tries to describe the design of the rtmutex.c implementation.
	10	+It doesn't describe the reasons why rtmutex.c exists. For that please see
	11	+Documentation/rt-mutex.txt. Although this document does explain problems
	12	+that happen without this code, but that is in the concept to understand
	13	+what the code actually is doing.
	14	+
	15	+The goal of this document is to help others understand the priority
	16	+inheritance (PI) algorithm that is used, as well as reasons for the
	17	+decisions that were made to implement PI in the manner that was done.
	18	+
	19	+
	20	+Unbounded Priority Inversion
	21	+----------------------------
	22	+
	23	+Priority inversion is when a lower priority process executes while a higher
	24	+priority process wants to run. This happens for several reasons, and
	25	+most of the time it can't be helped. Anytime a high priority process wants
	26	+to use a resource that a lower priority process has (a mutex for example),
	27	+the high priority process must wait until the lower priority process is done
	28	+with the resource. This is a priority inversion. What we want to prevent
	29	+is something called unbounded priority inversion. That is when the high
	30	+priority process is prevented from running by a lower priority process for
	31	+an undetermined amount of time.
	32	+
	33	+The classic example of unbounded priority inversion is were you have three
	34	+processes, let's call them processes A, B, and C, where A is the highest
	35	+priority process, C is the lowest, and B is in between. A tries to grab a lock
	36	+that C owns and must wait and lets C run to release the lock. But in the
	37	+meantime, B executes, and since B is of a higher priority than C, it preempts C,
	38	+but by doing so, it is in fact preempting A which is a higher priority process.
	39	+Now there's no way of knowing how long A will be sleeping waiting for C
	40	+to release the lock, because for all we know, B is a CPU hog and will
	41	+never give C a chance to release the lock. This is called unbounded priority
	42	+inversion.
	43	+
	44	+Here's a little ASCII art to show the problem.
	45	+
	46	+ grab lock L1 (owned by C)
	47	+ \|
	48	+A ---+
	49	+ C preempted by B
	50	+ \|
	51	+C +----+
	52	+
	53	+B +-------->
	54	+ B now keeps A from running.
	55	+
	56	+
	57	+Priority Inheritance (PI)
	58	+-------------------------
	59	+
	60	+There are several ways to solve this issue, but other ways are out of scope
	61	+for this document. Here we only discuss PI.
	62	+
	63	+PI is where a process inherits the priority of another process if the other
	64	+process blocks on a lock owned by the current process. To make this easier
	65	+to understand, let's use the previous example, with processes A, B, and C again.
	66	+
	67	+This time, when A blocks on the lock owned by C, C would inherit the priority
	68	+of A. So now if B becomes runnable, it would not preempt C, since C now has
	69	+the high priority of A. As soon as C releases the lock, it loses its
	70	+inherited priority, and A then can continue with the resource that C had.
	71	+
	72	+Terminology
	73	+-----------
	74	+
	75	+Here I explain some terminology that is used in this document to help describe
	76	+the design that is used to implement PI.
	77	+
	78	+PI chain - The PI chain is an ordered series of locks and processes that cause
	79	+ processes to inherit priorities from a previous process that is
	80	+ blocked on one of its locks. This is described in more detail
	81	+ later in this document.
	82	+
	83	+mutex - In this document, to differentiate from locks that implement
	84	+ PI and spin locks that are used in the PI code, from now on
	85	+ the PI locks will be called a mutex.
	86	+
	87	+lock - In this document from now on, I will use the term lock when
	88	+ referring to spin locks that are used to protect parts of the PI
	89	+ algorithm. These locks disable preemption for UP (when
	90	+ CONFIG_PREEMPT is enabled) and on SMP prevents multiple CPUs from
	91	+ entering critical sections simultaneously.
	92	+
	93	+spin lock - Same as lock above.
	94	+
	95	+waiter - A waiter is a struct that is stored on the stack of a blocked
	96	+ process. Since the scope of the waiter is within the code for
	97	+ a process being blocked on the mutex, it is fine to allocate
	98	+ the waiter on the process's stack (local variable). This
	99	+ structure holds a pointer to the task, as well as the mutex that
	100	+ the task is blocked on. It also has the plist node structures to
	101	+ place the task in the waiter_list of a mutex as well as the
	102	+ pi_list of a mutex owner task (described below).
	103	+
	104	+ waiter is sometimes used in reference to the task that is waiting
	105	+ on a mutex. This is the same as waiter->task.
	106	+
	107	+waiters - A list of processes that are blocked on a mutex.
	108	+
	109	+top waiter - The highest priority process waiting on a specific mutex.
	110	+
	111	+top pi waiter - The highest priority process waiting on one of the mutexes
	112	+ that a specific process owns.
	113	+
	114	+Note: task and process are used interchangeably in this document, mostly to
	115	+ differentiate between two processes that are being described together.
	116	+
	117	+
	118	+PI chain
	119	+--------
	120	+
	121	+The PI chain is a list of processes and mutexes that may cause priority
	122	+inheritance to take place. Multiple chains may converge, but a chain
	123	+would never diverge, since a process can't be blocked on more than one
	124	+mutex at a time.
	125	+
	126	+Example:
	127	+
	128	+ Process: A, B, C, D, E
	129	+ Mutexes: L1, L2, L3, L4
	130	+
	131	+ A owns: L1
	132	+ B blocked on L1
	133	+ B owns L2
	134	+ C blocked on L2
	135	+ C owns L3
	136	+ D blocked on L3
	137	+ D owns L4
	138	+ E blocked on L4
	139	+
	140	+The chain would be:
	141	+
	142	+ E->L4->D->L3->C->L2->B->L1->A
	143	+
	144	+To show where two chains merge, we could add another process F and
	145	+another mutex L5 where B owns L5 and F is blocked on mutex L5.
	146	+
	147	+The chain for F would be:
	148	+
	149	+ F->L5->B->L1->A
	150	+
	151	+Since a process may own more than one mutex, but never be blocked on more than
	152	+one, the chains merge.
	153	+
	154	+Here we show both chains:
	155	+
	156	+ E->L4->D->L3->C->L2-+
	157	+ \|
	158	+ +->B->L1->A
	159	+ \|
	160	+ F->L5-+
	161	+
	162	+For PI to work, the processes at the right end of these chains (or we may
	163	+also call it the Top of the chain) must be equal to or higher in priority
	164	+than the processes to the left or below in the chain.
	165	+
	166	+Also since a mutex may have more than one process blocked on it, we can
	167	+have multiple chains merge at mutexes. If we add another process G that is
	168	+blocked on mutex L2:
	169	+
	170	+ G->L2->B->L1->A
	171	+
	172	+And once again, to show how this can grow I will show the merging chains
	173	+again.
	174	+
	175	+ E->L4->D->L3->C-+
	176	+ +->L2-+
	177	+ \| \|
	178	+ G-+ +->B->L1->A
	179	+ \|
	180	+ F->L5-+
	181	+
	182	+
	183	+Plist
	184	+-----
	185	+
	186	+Before I go further and talk about how the PI chain is stored through lists
	187	+on both mutexes and processes, I'll explain the plist. This is similar to
	188	+the struct list_head functionality that is already in the kernel.
	189	+The implementation of plist is out of scope for this document, but it is
	190	+very important to understand what it does.
	191	+
	192	+There are a few differences between plist and list, the most important one
	193	+being that plist is a priority sorted linked list. This means that the
	194	+priorities of the plist are sorted, such that it takes O(1) to retrieve the
	195	+highest priority item in the list. Obviously this is useful to store processes
	196	+based on their priorities.
	197	+
	198	+Another difference, which is important for implementation, is that, unlike
	199	+list, the head of the list is a different element than the nodes of a list.
	200	+So the head of the list is declared as struct plist_head and nodes that will
	201	+be added to the list are declared as struct plist_node.
	202	+
	203	+
	204	+Mutex Waiter List
	205	+-----------------
	206	+
	207	+Every mutex keeps track of all the waiters that are blocked on itself. The mutex
	208	+has a plist to store these waiters by priority. This list is protected by
	209	+a spin lock that is located in the struct of the mutex. This lock is called
	210	+wait_lock. Since the modification of the waiter list is never done in
	211	+interrupt context, the wait_lock can be taken without disabling interrupts.
	212	+
	213	+
	214	+Task PI List
	215	+------------
	216	+
	217	+To keep track of the PI chains, each process has its own PI list. This is
	218	+a list of all top waiters of the mutexes that are owned by the process.
	219	+Note that this list only holds the top waiters and not all waiters that are
	220	+blocked on mutexes owned by the process.
	221	+
	222	+The top of the task's PI list is always the highest priority task that
	223	+is waiting on a mutex that is owned by the task. So if the task has
	224	+inherited a priority, it will always be the priority of the task that is
	225	+at the top of this list.
	226	+
	227	+This list is stored in the task structure of a process as a plist called
	228	+pi_list. This list is protected by a spin lock also in the task structure,
	229	+called pi_lock. This lock may also be taken in interrupt context, so when
	230	+locking the pi_lock, interrupts must be disabled.
	231	+
	232	+
	233	+Depth of the PI Chain
	234	+---------------------
	235	+
	236	+The maximum depth of the PI chain is not dynamic, and could actually be
	237	+defined. But is very complex to figure it out, since it depends on all
	238	+the nesting of mutexes. Let's look at the example where we have 3 mutexes,
	239	+L1, L2, and L3, and four separate functions func1, func2, func3 and func4.
	240	+The following shows a locking order of L1->L2->L3, but may not actually
	241	+be directly nested that way.
	242	+
	243	+void func1(void)
	244	+{
	245	+ mutex_lock(L1);
	246	+
	247	+ /* do anything */
	248	+
	249	+ mutex_unlock(L1);
	250	+}
	251	+
	252	+void func2(void)
	253	+{
	254	+ mutex_lock(L1);
	255	+ mutex_lock(L2);
	256	+
	257	+ /* do something */
	258	+
	259	+ mutex_unlock(L2);
	260	+ mutex_unlock(L1);
	261	+}
	262	+
	263	+void func3(void)
	264	+{
	265	+ mutex_lock(L2);
	266	+ mutex_lock(L3);
	267	+
	268	+ /* do something else */
	269	+
	270	+ mutex_unlock(L3);
	271	+ mutex_unlock(L2);
	272	+}
	273	+
	274	+void func4(void)
	275	+{
	276	+ mutex_lock(L3);
	277	+
	278	+ /* do something again */
	279	+
	280	+ mutex_unlock(L3);
	281	+}
	282	+
	283	+Now we add 4 processes that run each of these functions separately.
	284	+Processes A, B, C, and D which run functions func1, func2, func3 and func4
	285	+respectively, and such that D runs first and A last. With D being preempted
	286	+in func4 in the "do something again" area, we have a locking that follows:
	287	+
	288	+D owns L3
	289	+ C blocked on L3
	290	+ C owns L2
	291	+ B blocked on L2
	292	+ B owns L1
	293	+ A blocked on L1
	294	+
	295	+And thus we have the chain A->L1->B->L2->C->L3->D.
	296	+
	297	+This gives us a PI depth of 4 (four processes), but looking at any of the
	298	+functions individually, it seems as though they only have at most a locking
	299	+depth of two. So, although the locking depth is defined at compile time,
	300	+it still is very difficult to find the possibilities of that depth.
	301	+
	302	+Now since mutexes can be defined by user-land applications, we don't want a DOS
	303	+type of application that nests large amounts of mutexes to create a large
	304	+PI chain, and have the code holding spin locks while looking at a large
	305	+amount of data. So to prevent this, the implementation not only implements
	306	+a maximum lock depth, but also only holds at most two different locks at a
	307	+time, as it walks the PI chain. More about this below.
	308	+
	309	+
	310	+Mutex owner and flags
	311	+---------------------
	312	+
	313	+The mutex structure contains a pointer to the owner of the mutex. If the
	314	+mutex is not owned, this owner is set to NULL. Since all architectures
	315	+have the task structure on at least a four byte alignment (and if this is
	316	+not true, the rtmutex.c code will be broken!), this allows for the two
	317	+least significant bits to be used as flags. This part is also described
	318	+in Documentation/rt-mutex.txt, but will also be briefly described here.
	319	+
	320	+Bit 0 is used as the "Pending Owner" flag. This is described later.
	321	+Bit 1 is used as the "Has Waiters" flags. This is also described later
	322	+ in more detail, but is set whenever there are waiters on a mutex.
	323	+
	324	+
	325	+cmpxchg Tricks
	326	+--------------
	327	+
	328	+Some architectures implement an atomic cmpxchg (Compare and Exchange). This
	329	+is used (when applicable) to keep the fast path of grabbing and releasing
	330	+mutexes short.
	331	+
	332	+cmpxchg is basically the following function performed atomically:
	333	+
	334	+unsigned long _cmpxchg(unsigned long A, unsigned long B, unsigned long *C)
	335	+{
	336	+ unsigned long T = *A;
	337	+ if (A == B) {
	338	+ A = C;
	339	+ }
	340	+ return T;
	341	+}
	342	+#define cmpxchg(a,b,c) _cmpxchg(&a,&b,&c)
	343	+
	344	+This is really nice to have, since it allows you to only update a variable
	345	+if the variable is what you expect it to be. You know if it succeeded if
	346	+the return value (the old value of A) is equal to B.
	347	+
	348	+The macro rt_mutex_cmpxchg is used to try to lock and unlock mutexes. If
	349	+the architecture does not support CMPXCHG, then this macro is simply set
	350	+to fail every time. But if CMPXCHG is supported, then this will
	351	+help out extremely to keep the fast path short.
	352	+
	353	+The use of rt_mutex_cmpxchg with the flags in the owner field help optimize
	354	+the system for architectures that support it. This will also be explained
	355	+later in this document.
	356	+
	357	+
	358	+Priority adjustments
	359	+--------------------
	360	+
	361	+The implementation of the PI code in rtmutex.c has several places that a
	362	+process must adjust its priority. With the help of the pi_list of a
	363	+process this is rather easy to know what needs to be adjusted.
	364	+
	365	+The functions implementing the task adjustments are rt_mutex_adjust_prio,
	366	+__rt_mutex_adjust_prio (same as the former, but expects the task pi_lock
	367	+to already be taken), rt_mutex_get_prio, and rt_mutex_setprio.
	368	+
	369	+rt_mutex_getprio and rt_mutex_setprio are only used in __rt_mutex_adjust_prio.
	370	+
	371	+rt_mutex_getprio returns the priority that the task should have. Either the
	372	+task's own normal priority, or if a process of a higher priority is waiting on
	373	+a mutex owned by the task, then that higher priority should be returned.
	374	+Since the pi_list of a task holds an order by priority list of all the top
	375	+waiters of all the mutexes that the task owns, rt_mutex_getprio simply needs
	376	+to compare the top pi waiter to its own normal priority, and return the higher
	377	+priority back.
	378	+
	379	+(Note: if looking at the code, you will notice that the lower number of
	380	+ prio is returned. This is because the prio field in the task structure
	381	+ is an inverse order of the actual priority. So a "prio" of 5 is
	382	+ of higher priority than a "prio" of 10.)
	383	+
	384	+__rt_mutex_adjust_prio examines the result of rt_mutex_getprio, and if the
	385	+result does not equal the task's current priority, then rt_mutex_setprio
	386	+is called to adjust the priority of the task to the new priority.
	387	+Note that rt_mutex_setprio is defined in kernel/sched.c to implement the
	388	+actual change in priority.
	389	+
	390	+It is interesting to note that __rt_mutex_adjust_prio can either increase
	391	+or decrease the priority of the task. In the case that a higher priority
	392	+process has just blocked on a mutex owned by the task, __rt_mutex_adjust_prio
	393	+would increase/boost the task's priority. But if a higher priority task
	394	+were for some reason to leave the mutex (timeout or signal), this same function
	395	+would decrease/unboost the priority of the task. That is because the pi_list
	396	+always contains the highest priority task that is waiting on a mutex owned
	397	+by the task, so we only need to compare the priority of that top pi waiter
	398	+to the normal priority of the given task.
	399	+
	400	+
	401	+High level overview of the PI chain walk
	402	+----------------------------------------
	403	+
	404	+The PI chain walk is implemented by the function rt_mutex_adjust_prio_chain.
	405	+
	406	+The implementation has gone through several iterations, and has ended up
	407	+with what we believe is the best. It walks the PI chain by only grabbing
	408	+at most two locks at a time, and is very efficient.
	409	+
	410	+The rt_mutex_adjust_prio_chain can be used either to boost or lower process
	411	+priorities.
	412	+
	413	+rt_mutex_adjust_prio_chain is called with a task to be checked for PI
	414	+(de)boosting (the owner of a mutex that a process is blocking on), a flag to
	415	+check for deadlocking, the mutex that the task owns, and a pointer to a waiter
	416	+that is the process's waiter struct that is blocked on the mutex (although this
	417	+parameter may be NULL for deboosting).
	418	+
	419	+For this explanation, I will not mention deadlock detection. This explanation
	420	+will try to stay at a high level.
	421	+
	422	+When this function is called, there are no locks held. That also means
	423	+that the state of the owner and lock can change when entered into this function.
	424	+
	425	+Before this function is called, the task has already had rt_mutex_adjust_prio
	426	+performed on it. This means that the task is set to the priority that it
	427	+should be at, but the plist nodes of the task's waiter have not been updated
	428	+with the new priorities, and that this task may not be in the proper locations
	429	+in the pi_lists and wait_lists that the task is blocked on. This function
	430	+solves all that.
	431	+
	432	+A loop is entered, where task is the owner to be checked for PI changes that
	433	+was passed by parameter (for the first iteration). The pi_lock of this task is
	434	+taken to prevent any more changes to the pi_list of the task. This also
	435	+prevents new tasks from completing the blocking on a mutex that is owned by this
	436	+task.
	437	+
	438	+If the task is not blocked on a mutex then the loop is exited. We are at
	439	+the top of the PI chain.
	440	+
	441	+A check is now done to see if the original waiter (the process that is blocked
	442	+on the current mutex) is the top pi waiter of the task. That is, is this
	443	+waiter on the top of the task's pi_list. If it is not, it either means that
	444	+there is another process higher in priority that is blocked on one of the
	445	+mutexes that the task owns, or that the waiter has just woken up via a signal
	446	+or timeout and has left the PI chain. In either case, the loop is exited, since
	447	+we don't need to do any more changes to the priority of the current task, or any
	448	+task that owns a mutex that this current task is waiting on. A priority chain
	449	+walk is only needed when a new top pi waiter is made to a task.
	450	+
	451	+The next check sees if the task's waiter plist node has the priority equal to
	452	+the priority the task is set at. If they are equal, then we are done with
	453	+the loop. Remember that the function started with the priority of the
	454	+task adjusted, but the plist nodes that hold the task in other processes
	455	+pi_lists have not been adjusted.
	456	+
	457	+Next, we look at the mutex that the task is blocked on. The mutex's wait_lock
	458	+is taken. This is done by a spin_trylock, because the locking order of the
	459	+pi_lock and wait_lock goes in the opposite direction. If we fail to grab the
	460	+lock, the pi_lock is released, and we restart the loop.
	461	+
	462	+Now that we have both the pi_lock of the task as well as the wait_lock of
	463	+the mutex the task is blocked on, we update the task's waiter's plist node
	464	+that is located on the mutex's wait_list.
	465	+
	466	+Now we release the pi_lock of the task.
	467	+
	468	+Next the owner of the mutex has its pi_lock taken, so we can update the
	469	+task's entry in the owner's pi_list. If the task is the highest priority
	470	+process on the mutex's wait_list, then we remove the previous top waiter
	471	+from the owner's pi_list, and replace it with the task.
	472	+
	473	+Note: It is possible that the task was the current top waiter on the mutex,
	474	+ in which case the task is not yet on the pi_list of the waiter. This
	475	+ is OK, since plist_del does nothing if the plist node is not on any
	476	+ list.
	477	+
	478	+If the task was not the top waiter of the mutex, but it was before we
	479	+did the priority updates, that means we are deboosting/lowering the
	480	+task. In this case, the task is removed from the pi_list of the owner,
	481	+and the new top waiter is added.
	482	+
	483	+Lastly, we unlock both the pi_lock of the task, as well as the mutex's
	484	+wait_lock, and continue the loop again. On the next iteration of the
	485	+loop, the previous owner of the mutex will be the task that will be
	486	+processed.
	487	+
	488	+Note: One might think that the owner of this mutex might have changed
	489	+ since we just grab the mutex's wait_lock. And one could be right.
	490	+ The important thing to remember is that the owner could not have
	491	+ become the task that is being processed in the PI chain, since
	492	+ we have taken that task's pi_lock at the beginning of the loop.
	493	+ So as long as there is an owner of this mutex that is not the same
	494	+ process as the tasked being worked on, we are OK.
	495	+
	496	+ Looking closely at the code, one might be confused. The check for the
	497	+ end of the PI chain is when the task isn't blocked on anything or the
	498	+ task's waiter structure "task" element is NULL. This check is
	499	+ protected only by the task's pi_lock. But the code to unlock the mutex
	500	+ sets the task's waiter structure "task" element to NULL with only
	501	+ the protection of the mutex's wait_lock, which was not taken yet.
	502	+ Isn't this a race condition if the task becomes the new owner?
	503	+
	504	+ The answer is No! The trick is the spin_trylock of the mutex's
	505	+ wait_lock. If we fail that lock, we release the pi_lock of the
	506	+ task and continue the loop, doing the end of PI chain check again.
	507	+
	508	+ In the code to release the lock, the wait_lock of the mutex is held
	509	+ the entire time, and it is not let go when we grab the pi_lock of the
	510	+ new owner of the mutex. So if the switch of a new owner were to happen
	511	+ after the check for end of the PI chain and the grabbing of the
	512	+ wait_lock, the unlocking code would spin on the new owner's pi_lock
	513	+ but never give up the wait_lock. So the PI chain loop is guaranteed to
	514	+ fail the spin_trylock on the wait_lock, release the pi_lock, and
	515	+ try again.
	516	+
	517	+ If you don't quite understand the above, that's OK. You don't have to,
	518	+ unless you really want to make a proof out of it ;)
	519	+
	520	+
	521	+Pending Owners and Lock stealing
	522	+--------------------------------
	523	+
	524	+One of the flags in the owner field of the mutex structure is "Pending Owner".
	525	+What this means is that an owner was chosen by the process releasing the
	526	+mutex, but that owner has yet to wake up and actually take the mutex.
	527	+
	528	+Why is this important? Why can't we just give the mutex to another process
	529	+and be done with it?
	530	+
	531	+The PI code is to help with real-time processes, and to let the highest
	532	+priority process run as long as possible with little latencies and delays.
	533	+If a high priority process owns a mutex that a lower priority process is
	534	+blocked on, when the mutex is released it would be given to the lower priority
	535	+process. What if the higher priority process wants to take that mutex again.
	536	+The high priority process would fail to take that mutex that it just gave up
	537	+and it would need to boost the lower priority process to run with full
	538	+latency of that critical section (since the low priority process just entered
	539	+it).
	540	+
	541	+There's no reason a high priority process that gives up a mutex should be
	542	+penalized if it tries to take that mutex again. If the new owner of the
	543	+mutex has not woken up yet, there's no reason that the higher priority process
	544	+could not take that mutex away.
	545	+
	546	+To solve this, we introduced Pending Ownership and Lock Stealing. When a
	547	+new process is given a mutex that it was blocked on, it is only given
	548	+pending ownership. This means that it's the new owner, unless a higher
	549	+priority process comes in and tries to grab that mutex. If a higher priority
	550	+process does come along and wants that mutex, we let the higher priority
	551	+process "steal" the mutex from the pending owner (only if it is still pending)
	552	+and continue with the mutex.
	553	+
	554	+
	555	+Taking of a mutex (The walk through)
	556	+------------------------------------
	557	+
	558	+OK, now let's take a look at the detailed walk through of what happens when
	559	+taking a mutex.
	560	+
	561	+The first thing that is tried is the fast taking of the mutex. This is
	562	+done when we have CMPXCHG enabled (otherwise the fast taking automatically
	563	+fails). Only when the owner field of the mutex is NULL can the lock be
	564	+taken with the CMPXCHG and nothing else needs to be done.
	565	+
	566	+If there is contention on the lock, whether it is owned or pending owner
	567	+we go about the slow path (rt_mutex_slowlock).
	568	+
	569	+The slow path function is where the task's waiter structure is created on
	570	+the stack. This is because the waiter structure is only needed for the
	571	+scope of this function. The waiter structure holds the nodes to store
	572	+the task on the wait_list of the mutex, and if need be, the pi_list of
	573	+the owner.
	574	+
	575	+The wait_lock of the mutex is taken since the slow path of unlocking the
	576	+mutex also takes this lock.
	577	+
	578	+We then call try_to_take_rt_mutex. This is where the architecture that
	579	+does not implement CMPXCHG would always grab the lock (if there's no
	580	+contention).
	581	+
	582	+try_to_take_rt_mutex is used every time the task tries to grab a mutex in the
	583	+slow path. The first thing that is done here is an atomic setting of
	584	+the "Has Waiters" flag of the mutex's owner field. Yes, this could really
	585	+be false, because if the the mutex has no owner, there are no waiters and
	586	+the current task also won't have any waiters. But we don't have the lock
	587	+yet, so we assume we are going to be a waiter. The reason for this is to
	588	+play nice for those architectures that do have CMPXCHG. By setting this flag
	589	+now, the owner of the mutex can't release the mutex without going into the
	590	+slow unlock path, and it would then need to grab the wait_lock, which this
	591	+code currently holds. So setting the "Has Waiters" flag forces the owner
	592	+to synchronize with this code.
	593	+
	594	+Now that we know that we can't have any races with the owner releasing the
	595	+mutex, we check to see if we can take the ownership. This is done if the
	596	+mutex doesn't have a owner, or if we can steal the mutex from a pending
	597	+owner. Let's look at the situations we have here.
	598	+
	599	+ 1) Has owner that is pending
	600	+ ----------------------------
	601	+
	602	+ The mutex has a owner, but it hasn't woken up and the mutex flag
	603	+ "Pending Owner" is set. The first check is to see if the owner isn't the
	604	+ current task. This is because this function is also used for the pending
	605	+ owner to grab the mutex. When a pending owner wakes up, it checks to see
	606	+ if it can take the mutex, and this is done if the owner is already set to
	607	+ itself. If so, we succeed and leave the function, clearing the "Pending
	608	+ Owner" bit.
	609	+
	610	+ If the pending owner is not current, we check to see if the current priority is
	611	+ higher than the pending owner. If not, we fail the function and return.
	612	+
	613	+ There's also something special about a pending owner. That is a pending owner
	614	+ is never blocked on a mutex. So there is no PI chain to worry about. It also
	615	+ means that if the mutex doesn't have any waiters, there's no accounting needed
	616	+ to update the pending owner's pi_list, since we only worry about processes
	617	+ blocked on the current mutex.
	618	+
	619	+ If there are waiters on this mutex, and we just stole the ownership, we need
	620	+ to take the top waiter, remove it from the pi_list of the pending owner, and
	621	+ add it to the current pi_list. Note that at this moment, the pending owner
	622	+ is no longer on the list of waiters. This is fine, since the pending owner
	623	+ would add itself back when it realizes that it had the ownership stolen
	624	+ from itself. When the pending owner tries to grab the mutex, it will fail
	625	+ in try_to_take_rt_mutex if the owner field points to another process.
	626	+
	627	+ 2) No owner
	628	+ -----------
	629	+
	630	+ If there is no owner (or we successfully stole the lock), we set the owner
	631	+ of the mutex to current, and set the flag of "Has Waiters" if the current
	632	+ mutex actually has waiters, or we clear the flag if it doesn't. See, it was
	633	+ OK that we set that flag early, since now it is cleared.
	634	+
	635	+ 3) Failed to grab ownership
	636	+ ---------------------------
	637	+
	638	+ The most interesting case is when we fail to take ownership. This means that
	639	+ there exists an owner, or there's a pending owner with equal or higher
	640	+ priority than the current task.
	641	+
	642	+We'll continue on the failed case.
	643	+
	644	+If the mutex has a timeout, we set up a timer to go off to break us out
	645	+of this mutex if we failed to get it after a specified amount of time.
	646	+
	647	+Now we enter a loop that will continue to try to take ownership of the mutex, or
	648	+fail from a timeout or signal.
	649	+
	650	+Once again we try to take the mutex. This will usually fail the first time
	651	+in the loop, since it had just failed to get the mutex. But the second time
	652	+in the loop, this would likely succeed, since the task would likely be
	653	+the pending owner.
	654	+
	655	+If the mutex is TASK_INTERRUPTIBLE a check for signals and timeout is done
	656	+here.
	657	+
	658	+The waiter structure has a "task" field that points to the task that is blocked
	659	+on the mutex. This field can be NULL the first time it goes through the loop
	660	+or if the task is a pending owner and had it's mutex stolen. If the "task"
	661	+field is NULL then we need to set up the accounting for it.
	662	+
	663	+Task blocks on mutex
	664	+--------------------
	665	+
	666	+The accounting of a mutex and process is done with the waiter structure of
	667	+the process. The "task" field is set to the process, and the "lock" field
	668	+to the mutex. The plist nodes are initialized to the processes current
	669	+priority.
	670	+
	671	+Since the wait_lock was taken at the entry of the slow lock, we can safely
	672	+add the waiter to the wait_list. If the current process is the highest
	673	+priority process currently waiting on this mutex, then we remove the
	674	+previous top waiter process (if it exists) from the pi_list of the owner,
	675	+and add the current process to that list. Since the pi_list of the owner
	676	+has changed, we call rt_mutex_adjust_prio on the owner to see if the owner
	677	+should adjust its priority accordingly.
	678	+
	679	+If the owner is also blocked on a lock, and had its pi_list changed
	680	+(or deadlock checking is on), we unlock the wait_lock of the mutex and go ahead
	681	+and run rt_mutex_adjust_prio_chain on the owner, as described earlier.
	682	+
	683	+Now all locks are released, and if the current process is still blocked on a
	684	+mutex (waiter "task" field is not NULL), then we go to sleep (call schedule).
	685	+
	686	+Waking up in the loop
	687	+---------------------
	688	+
	689	+The schedule can then wake up for a few reasons.
	690	+ 1) we were given pending ownership of the mutex.
	691	+ 2) we received a signal and was TASK_INTERRUPTIBLE
	692	+ 3) we had a timeout and was TASK_INTERRUPTIBLE
	693	+
	694	+In any of these cases, we continue the loop and once again try to grab the
	695	+ownership of the mutex. If we succeed, we exit the loop, otherwise we continue
	696	+and on signal and timeout, will exit the loop, or if we had the mutex stolen
	697	+we just simply add ourselves back on the lists and go back to sleep.
	698	+
	699	+Note: For various reasons, because of timeout and signals, the steal mutex
	700	+ algorithm needs to be careful. This is because the current process is
	701	+ still on the wait_list. And because of dynamic changing of priorities,
	702	+ especially on SCHED_OTHER tasks, the current process can be the
	703	+ highest priority task on the wait_list.
	704	+
	705	+Failed to get mutex on Timeout or Signal
	706	+----------------------------------------
	707	+
	708	+If a timeout or signal occurred, the waiter's "task" field would not be
	709	+NULL and the task needs to be taken off the wait_list of the mutex and perhaps
	710	+pi_list of the owner. If this process was a high priority process, then
	711	+the rt_mutex_adjust_prio_chain needs to be executed again on the owner,
	712	+but this time it will be lowering the priorities.
	713	+
	714	+
	715	+Unlocking the Mutex
	716	+-------------------
	717	+
	718	+The unlocking of a mutex also has a fast path for those architectures with
	719	+CMPXCHG. Since the taking of a mutex on contention always sets the
	720	+"Has Waiters" flag of the mutex's owner, we use this to know if we need to
	721	+take the slow path when unlocking the mutex. If the mutex doesn't have any
	722	+waiters, the owner field of the mutex would equal the current process and
	723	+the mutex can be unlocked by just replacing the owner field with NULL.
	724	+
	725	+If the owner field has the "Has Waiters" bit set (or CMPXCHG is not available),
	726	+the slow unlock path is taken.
	727	+
	728	+The first thing done in the slow unlock path is to take the wait_lock of the
	729	+mutex. This synchronizes the locking and unlocking of the mutex.
	730	+
	731	+A check is made to see if the mutex has waiters or not. On architectures that
	732	+do not have CMPXCHG, this is the location that the owner of the mutex will
	733	+determine if a waiter needs to be awoken or not. On architectures that
	734	+do have CMPXCHG, that check is done in the fast path, but it is still needed
	735	+in the slow path too. If a waiter of a mutex woke up because of a signal
	736	+or timeout between the time the owner failed the fast path CMPXCHG check and
	737	+the grabbing of the wait_lock, the mutex may not have any waiters, thus the
	738	+owner still needs to make this check. If there are no waiters than the mutex
	739	+owner field is set to NULL, the wait_lock is released and nothing more is
	740	+needed.
	741	+
	742	+If there are waiters, then we need to wake one up and give that waiter
	743	+pending ownership.
	744	+
	745	+On the wake up code, the pi_lock of the current owner is taken. The top
	746	+waiter of the lock is found and removed from the wait_list of the mutex
	747	+as well as the pi_list of the current owner. The task field of the new
	748	+pending owner's waiter structure is set to NULL, and the owner field of the
	749	+mutex is set to the new owner with the "Pending Owner" bit set, as well
	750	+as the "Has Waiters" bit if there still are other processes blocked on the
	751	+mutex.
	752	+
	753	+The pi_lock of the previous owner is released, and the new pending owner's
	754	+pi_lock is taken. Remember that this is the trick to prevent the race
	755	+condition in rt_mutex_adjust_prio_chain from adding itself as a waiter
	756	+on the mutex.
	757	+
	758	+We now clear the "pi_blocked_on" field of the new pending owner, and if
	759	+the mutex still has waiters pending, we add the new top waiter to the pi_list
	760	+of the pending owner.
	761	+
	762	+Finally we unlock the pi_lock of the pending owner and wake it up.
	763	+
	764	+
	765	+Contact
	766	+-------
	767	+
	768	+For updates on this document, please email Steven Rostedt <rostedt@goodmis.org>
	769	+
	770	+
	771	+Credits
	772	+-------
	773	+
	774	+Author: Steven Rostedt <rostedt@goodmis.org>
	775	+
	776	+Reviewers: Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and Randy Dunlap
	777	+
	778	+Updates
	779	+-------
	780	+
	781	+This document was originally written for 2.6.17-rc3-mm1

Documentation/rt-mutex.txt

Diff comments View file @ a6537be

	1	+RT-mutex subsystem with PI support
	2	+----------------------------------
	3	+
	4	+RT-mutexes with priority inheritance are used to support PI-futexes,
	5	+which enable pthread_mutex_t priority inheritance attributes
	6	+(PTHREAD_PRIO_INHERIT). [See Documentation/pi-futex.txt for more details
	7	+about PI-futexes.]
	8	+
	9	+This technology was developed in the -rt tree and streamlined for
	10	+pthread_mutex support.
	11	+
	12	+Basic principles:
	13	+-----------------
	14	+
	15	+RT-mutexes extend the semantics of simple mutexes by the priority
	16	+inheritance protocol.
	17	+
	18	+A low priority owner of a rt-mutex inherits the priority of a higher
	19	+priority waiter until the rt-mutex is released. If the temporarily
	20	+boosted owner blocks on a rt-mutex itself it propagates the priority
	21	+boosting to the owner of the other rt_mutex it gets blocked on. The
	22	+priority boosting is immediately removed once the rt_mutex has been
	23	+unlocked.
	24	+
	25	+This approach allows us to shorten the block of high-prio tasks on
	26	+mutexes which protect shared resources. Priority inheritance is not a
	27	+magic bullet for poorly designed applications, but it allows
	28	+well-designed applications to use userspace locks in critical parts of
	29	+an high priority thread, without losing determinism.
	30	+
	31	+The enqueueing of the waiters into the rtmutex waiter list is done in
	32	+priority order. For same priorities FIFO order is chosen. For each
	33	+rtmutex, only the top priority waiter is enqueued into the owner's
	34	+priority waiters list. This list too queues in priority order. Whenever
	35	+the top priority waiter of a task changes (for example it timed out or
	36	+got a signal), the priority of the owner task is readjusted. [The
	37	+priority enqueueing is handled by "plists", see include/linux/plist.h
	38	+for more details.]
	39	+
	40	+RT-mutexes are optimized for fastpath operations and have no internal
	41	+locking overhead when locking an uncontended mutex or unlocking a mutex
	42	+without waiters. The optimized fastpath operations require cmpxchg
	43	+support. [If that is not available then the rt-mutex internal spinlock
	44	+is used]
	45	+
	46	+The state of the rt-mutex is tracked via the owner field of the rt-mutex
	47	+structure:
	48	+
	49	+rt_mutex->owner holds the task_struct pointer of the owner. Bit 0 and 1
	50	+are used to keep track of the "owner is pending" and "rtmutex has
	51	+waiters" state.
	52	+
	53	+ owner bit1 bit0
	54	+ NULL 0 0 mutex is free (fast acquire possible)
	55	+ NULL 0 1 invalid state
	56	+ NULL 1 0 Transitional state*
	57	+ NULL 1 1 invalid state
	58	+ taskpointer 0 0 mutex is held (fast release possible)
	59	+ taskpointer 0 1 task is pending owner
	60	+ taskpointer 1 0 mutex is held and has waiters
	61	+ taskpointer 1 1 task is pending owner and mutex has waiters
	62	+
	63	+Pending-ownership handling is a performance optimization:
	64	+pending-ownership is assigned to the first (highest priority) waiter of
	65	+the mutex, when the mutex is released. The thread is woken up and once
	66	+it starts executing it can acquire the mutex. Until the mutex is taken
	67	+by it (bit 0 is cleared) a competing higher priority thread can "steal"
	68	+the mutex which puts the woken up thread back on the waiters list.
	69	+
	70	+The pending-ownership optimization is especially important for the
	71	+uninterrupted workflow of high-prio tasks which repeatedly
	72	+takes/releases locks that have lower-prio waiters. Without this
	73	+optimization the higher-prio thread would ping-pong to the lower-prio
	74	+task [because at unlock time we always assign a new owner].
	75	+
	76	+(*) The "mutex has waiters" bit gets set to take the lock. If the lock
	77	+doesn't already have an owner, this bit is quickly cleared if there are
	78	+no waiters. So this is a transitional state to synchronize with looking
	79	+at the owner field of the mutex and the mutex owner releasing the lock.