Commit 1c5aefb5b12a90e29866c960a57c1f8f75def617

Authored by Linus Torvalds

Merge branch 'futex-fixes' (futex fixes from Thomas Gleixner)

Merge futex fixes from Thomas Gleixner:
 "So with more awake and less futex wreckaged brain, I went through my
  list of points again and came up with the following 4 patches.

  1) Prevent pi requeueing on the same futex

     I kept Kees check for uaddr1 == uaddr2 as a early check for private
     futexes and added a key comparison to both futex_requeue and
     futex_wait_requeue_pi.

     Sebastian, sorry for the confusion yesterday night.  I really
     misunderstood your question.

     You are right the check is pointless for shared futexes where the
     same physical address is mapped to two different virtual addresses.

  2) Sanity check atomic acquisiton in futex_lock_pi_atomic

     That's basically what Darren suggested.

     I just simplified it to use futex_top_waiter() to find kernel
     internal state.  If state is found return -EINVAL and do not bother
     to fix up the user space variable.  It's corrupted already.

  3) Ensure state consistency in futex_unlock_pi

     The code is silly versus the owner died bit.  There is no point to
     preserve it on unlock when the user space thread owns the futex.

     What's worse is that it does not update the user space value when
     the owner died bit is set.  So the kernel itself creates observable
     inconsistency.

     Another "optimization" is to retry an atomic unlock.  That's
     pointless as in a sane environment user space would not call into
     that code if it could have unlocked it atomically.  So we always
     check whether there is kernel state around and only if there is
     none, we do the unlock by setting the user space value to 0.

  4) Sanitize lookup_pi_state

     lookup_pi_state is ambigous about TID == 0 in the user space value.

     This can be a valid state even if there is kernel state on this
     uaddr, but we miss a few corner case checks.

     I tried to come up with a smaller solution hacking the checks into
     the current cruft, but it turned out to be ugly as hell and I got
     more confused than I was before.  So I rewrote the sanity checks
     along the state documentation with awful lots of commentry"

* emailed patches from Thomas Gleixner <tglx@linutronix.de>:
  futex: Make lookup_pi_state more robust
  futex: Always cleanup owner tid in unlock_pi
  futex: Validate atomic acquisition in futex_lock_pi_atomic()
  futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 in futex_requeue(..., requeue_pi=1)

Showing 1 changed file Inline Diff

1 /* 1 /*
2 * Fast Userspace Mutexes (which I call "Futexes!"). 2 * Fast Userspace Mutexes (which I call "Futexes!").
3 * (C) Rusty Russell, IBM 2002 3 * (C) Rusty Russell, IBM 2002
4 * 4 *
5 * Generalized futexes, futex requeueing, misc fixes by Ingo Molnar 5 * Generalized futexes, futex requeueing, misc fixes by Ingo Molnar
6 * (C) Copyright 2003 Red Hat Inc, All Rights Reserved 6 * (C) Copyright 2003 Red Hat Inc, All Rights Reserved
7 * 7 *
8 * Removed page pinning, fix privately mapped COW pages and other cleanups 8 * Removed page pinning, fix privately mapped COW pages and other cleanups
9 * (C) Copyright 2003, 2004 Jamie Lokier 9 * (C) Copyright 2003, 2004 Jamie Lokier
10 * 10 *
11 * Robust futex support started by Ingo Molnar 11 * Robust futex support started by Ingo Molnar
12 * (C) Copyright 2006 Red Hat Inc, All Rights Reserved 12 * (C) Copyright 2006 Red Hat Inc, All Rights Reserved
13 * Thanks to Thomas Gleixner for suggestions, analysis and fixes. 13 * Thanks to Thomas Gleixner for suggestions, analysis and fixes.
14 * 14 *
15 * PI-futex support started by Ingo Molnar and Thomas Gleixner 15 * PI-futex support started by Ingo Molnar and Thomas Gleixner
16 * Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com> 16 * Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
17 * Copyright (C) 2006 Timesys Corp., Thomas Gleixner <tglx@timesys.com> 17 * Copyright (C) 2006 Timesys Corp., Thomas Gleixner <tglx@timesys.com>
18 * 18 *
19 * PRIVATE futexes by Eric Dumazet 19 * PRIVATE futexes by Eric Dumazet
20 * Copyright (C) 2007 Eric Dumazet <dada1@cosmosbay.com> 20 * Copyright (C) 2007 Eric Dumazet <dada1@cosmosbay.com>
21 * 21 *
22 * Requeue-PI support by Darren Hart <dvhltc@us.ibm.com> 22 * Requeue-PI support by Darren Hart <dvhltc@us.ibm.com>
23 * Copyright (C) IBM Corporation, 2009 23 * Copyright (C) IBM Corporation, 2009
24 * Thanks to Thomas Gleixner for conceptual design and careful reviews. 24 * Thanks to Thomas Gleixner for conceptual design and careful reviews.
25 * 25 *
26 * Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly 26 * Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
27 * enough at me, Linus for the original (flawed) idea, Matthew 27 * enough at me, Linus for the original (flawed) idea, Matthew
28 * Kirkwood for proof-of-concept implementation. 28 * Kirkwood for proof-of-concept implementation.
29 * 29 *
30 * "The futexes are also cursed." 30 * "The futexes are also cursed."
31 * "But they come in a choice of three flavours!" 31 * "But they come in a choice of three flavours!"
32 * 32 *
33 * This program is free software; you can redistribute it and/or modify 33 * This program is free software; you can redistribute it and/or modify
34 * it under the terms of the GNU General Public License as published by 34 * it under the terms of the GNU General Public License as published by
35 * the Free Software Foundation; either version 2 of the License, or 35 * the Free Software Foundation; either version 2 of the License, or
36 * (at your option) any later version. 36 * (at your option) any later version.
37 * 37 *
38 * This program is distributed in the hope that it will be useful, 38 * This program is distributed in the hope that it will be useful,
39 * but WITHOUT ANY WARRANTY; without even the implied warranty of 39 * but WITHOUT ANY WARRANTY; without even the implied warranty of
40 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 40 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
41 * GNU General Public License for more details. 41 * GNU General Public License for more details.
42 * 42 *
43 * You should have received a copy of the GNU General Public License 43 * You should have received a copy of the GNU General Public License
44 * along with this program; if not, write to the Free Software 44 * along with this program; if not, write to the Free Software
45 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 45 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
46 */ 46 */
47 #include <linux/slab.h> 47 #include <linux/slab.h>
48 #include <linux/poll.h> 48 #include <linux/poll.h>
49 #include <linux/fs.h> 49 #include <linux/fs.h>
50 #include <linux/file.h> 50 #include <linux/file.h>
51 #include <linux/jhash.h> 51 #include <linux/jhash.h>
52 #include <linux/init.h> 52 #include <linux/init.h>
53 #include <linux/futex.h> 53 #include <linux/futex.h>
54 #include <linux/mount.h> 54 #include <linux/mount.h>
55 #include <linux/pagemap.h> 55 #include <linux/pagemap.h>
56 #include <linux/syscalls.h> 56 #include <linux/syscalls.h>
57 #include <linux/signal.h> 57 #include <linux/signal.h>
58 #include <linux/export.h> 58 #include <linux/export.h>
59 #include <linux/magic.h> 59 #include <linux/magic.h>
60 #include <linux/pid.h> 60 #include <linux/pid.h>
61 #include <linux/nsproxy.h> 61 #include <linux/nsproxy.h>
62 #include <linux/ptrace.h> 62 #include <linux/ptrace.h>
63 #include <linux/sched/rt.h> 63 #include <linux/sched/rt.h>
64 #include <linux/hugetlb.h> 64 #include <linux/hugetlb.h>
65 #include <linux/freezer.h> 65 #include <linux/freezer.h>
66 #include <linux/bootmem.h> 66 #include <linux/bootmem.h>
67 67
68 #include <asm/futex.h> 68 #include <asm/futex.h>
69 69
70 #include "locking/rtmutex_common.h" 70 #include "locking/rtmutex_common.h"
71 71
72 /* 72 /*
73 * READ this before attempting to hack on futexes! 73 * READ this before attempting to hack on futexes!
74 * 74 *
75 * Basic futex operation and ordering guarantees 75 * Basic futex operation and ordering guarantees
76 * ============================================= 76 * =============================================
77 * 77 *
78 * The waiter reads the futex value in user space and calls 78 * The waiter reads the futex value in user space and calls
79 * futex_wait(). This function computes the hash bucket and acquires 79 * futex_wait(). This function computes the hash bucket and acquires
80 * the hash bucket lock. After that it reads the futex user space value 80 * the hash bucket lock. After that it reads the futex user space value
81 * again and verifies that the data has not changed. If it has not changed 81 * again and verifies that the data has not changed. If it has not changed
82 * it enqueues itself into the hash bucket, releases the hash bucket lock 82 * it enqueues itself into the hash bucket, releases the hash bucket lock
83 * and schedules. 83 * and schedules.
84 * 84 *
85 * The waker side modifies the user space value of the futex and calls 85 * The waker side modifies the user space value of the futex and calls
86 * futex_wake(). This function computes the hash bucket and acquires the 86 * futex_wake(). This function computes the hash bucket and acquires the
87 * hash bucket lock. Then it looks for waiters on that futex in the hash 87 * hash bucket lock. Then it looks for waiters on that futex in the hash
88 * bucket and wakes them. 88 * bucket and wakes them.
89 * 89 *
90 * In futex wake up scenarios where no tasks are blocked on a futex, taking 90 * In futex wake up scenarios where no tasks are blocked on a futex, taking
91 * the hb spinlock can be avoided and simply return. In order for this 91 * the hb spinlock can be avoided and simply return. In order for this
92 * optimization to work, ordering guarantees must exist so that the waiter 92 * optimization to work, ordering guarantees must exist so that the waiter
93 * being added to the list is acknowledged when the list is concurrently being 93 * being added to the list is acknowledged when the list is concurrently being
94 * checked by the waker, avoiding scenarios like the following: 94 * checked by the waker, avoiding scenarios like the following:
95 * 95 *
96 * CPU 0 CPU 1 96 * CPU 0 CPU 1
97 * val = *futex; 97 * val = *futex;
98 * sys_futex(WAIT, futex, val); 98 * sys_futex(WAIT, futex, val);
99 * futex_wait(futex, val); 99 * futex_wait(futex, val);
100 * uval = *futex; 100 * uval = *futex;
101 * *futex = newval; 101 * *futex = newval;
102 * sys_futex(WAKE, futex); 102 * sys_futex(WAKE, futex);
103 * futex_wake(futex); 103 * futex_wake(futex);
104 * if (queue_empty()) 104 * if (queue_empty())
105 * return; 105 * return;
106 * if (uval == val) 106 * if (uval == val)
107 * lock(hash_bucket(futex)); 107 * lock(hash_bucket(futex));
108 * queue(); 108 * queue();
109 * unlock(hash_bucket(futex)); 109 * unlock(hash_bucket(futex));
110 * schedule(); 110 * schedule();
111 * 111 *
112 * This would cause the waiter on CPU 0 to wait forever because it 112 * This would cause the waiter on CPU 0 to wait forever because it
113 * missed the transition of the user space value from val to newval 113 * missed the transition of the user space value from val to newval
114 * and the waker did not find the waiter in the hash bucket queue. 114 * and the waker did not find the waiter in the hash bucket queue.
115 * 115 *
116 * The correct serialization ensures that a waiter either observes 116 * The correct serialization ensures that a waiter either observes
117 * the changed user space value before blocking or is woken by a 117 * the changed user space value before blocking or is woken by a
118 * concurrent waker: 118 * concurrent waker:
119 * 119 *
120 * CPU 0 CPU 1 120 * CPU 0 CPU 1
121 * val = *futex; 121 * val = *futex;
122 * sys_futex(WAIT, futex, val); 122 * sys_futex(WAIT, futex, val);
123 * futex_wait(futex, val); 123 * futex_wait(futex, val);
124 * 124 *
125 * waiters++; (a) 125 * waiters++; (a)
126 * mb(); (A) <-- paired with -. 126 * mb(); (A) <-- paired with -.
127 * | 127 * |
128 * lock(hash_bucket(futex)); | 128 * lock(hash_bucket(futex)); |
129 * | 129 * |
130 * uval = *futex; | 130 * uval = *futex; |
131 * | *futex = newval; 131 * | *futex = newval;
132 * | sys_futex(WAKE, futex); 132 * | sys_futex(WAKE, futex);
133 * | futex_wake(futex); 133 * | futex_wake(futex);
134 * | 134 * |
135 * `-------> mb(); (B) 135 * `-------> mb(); (B)
136 * if (uval == val) 136 * if (uval == val)
137 * queue(); 137 * queue();
138 * unlock(hash_bucket(futex)); 138 * unlock(hash_bucket(futex));
139 * schedule(); if (waiters) 139 * schedule(); if (waiters)
140 * lock(hash_bucket(futex)); 140 * lock(hash_bucket(futex));
141 * else wake_waiters(futex); 141 * else wake_waiters(futex);
142 * waiters--; (b) unlock(hash_bucket(futex)); 142 * waiters--; (b) unlock(hash_bucket(futex));
143 * 143 *
144 * Where (A) orders the waiters increment and the futex value read through 144 * Where (A) orders the waiters increment and the futex value read through
145 * atomic operations (see hb_waiters_inc) and where (B) orders the write 145 * atomic operations (see hb_waiters_inc) and where (B) orders the write
146 * to futex and the waiters read -- this is done by the barriers in 146 * to futex and the waiters read -- this is done by the barriers in
147 * get_futex_key_refs(), through either ihold or atomic_inc, depending on the 147 * get_futex_key_refs(), through either ihold or atomic_inc, depending on the
148 * futex type. 148 * futex type.
149 * 149 *
150 * This yields the following case (where X:=waiters, Y:=futex): 150 * This yields the following case (where X:=waiters, Y:=futex):
151 * 151 *
152 * X = Y = 0 152 * X = Y = 0
153 * 153 *
154 * w[X]=1 w[Y]=1 154 * w[X]=1 w[Y]=1
155 * MB MB 155 * MB MB
156 * r[Y]=y r[X]=x 156 * r[Y]=y r[X]=x
157 * 157 *
158 * Which guarantees that x==0 && y==0 is impossible; which translates back into 158 * Which guarantees that x==0 && y==0 is impossible; which translates back into
159 * the guarantee that we cannot both miss the futex variable change and the 159 * the guarantee that we cannot both miss the futex variable change and the
160 * enqueue. 160 * enqueue.
161 * 161 *
162 * Note that a new waiter is accounted for in (a) even when it is possible that 162 * Note that a new waiter is accounted for in (a) even when it is possible that
163 * the wait call can return error, in which case we backtrack from it in (b). 163 * the wait call can return error, in which case we backtrack from it in (b).
164 * Refer to the comment in queue_lock(). 164 * Refer to the comment in queue_lock().
165 * 165 *
166 * Similarly, in order to account for waiters being requeued on another 166 * Similarly, in order to account for waiters being requeued on another
167 * address we always increment the waiters for the destination bucket before 167 * address we always increment the waiters for the destination bucket before
168 * acquiring the lock. It then decrements them again after releasing it - 168 * acquiring the lock. It then decrements them again after releasing it -
169 * the code that actually moves the futex(es) between hash buckets (requeue_futex) 169 * the code that actually moves the futex(es) between hash buckets (requeue_futex)
170 * will do the additional required waiter count housekeeping. This is done for 170 * will do the additional required waiter count housekeeping. This is done for
171 * double_lock_hb() and double_unlock_hb(), respectively. 171 * double_lock_hb() and double_unlock_hb(), respectively.
172 */ 172 */
173 173
174 #ifndef CONFIG_HAVE_FUTEX_CMPXCHG 174 #ifndef CONFIG_HAVE_FUTEX_CMPXCHG
175 int __read_mostly futex_cmpxchg_enabled; 175 int __read_mostly futex_cmpxchg_enabled;
176 #endif 176 #endif
177 177
178 /* 178 /*
179 * Futex flags used to encode options to functions and preserve them across 179 * Futex flags used to encode options to functions and preserve them across
180 * restarts. 180 * restarts.
181 */ 181 */
182 #define FLAGS_SHARED 0x01 182 #define FLAGS_SHARED 0x01
183 #define FLAGS_CLOCKRT 0x02 183 #define FLAGS_CLOCKRT 0x02
184 #define FLAGS_HAS_TIMEOUT 0x04 184 #define FLAGS_HAS_TIMEOUT 0x04
185 185
186 /* 186 /*
187 * Priority Inheritance state: 187 * Priority Inheritance state:
188 */ 188 */
189 struct futex_pi_state { 189 struct futex_pi_state {
190 /* 190 /*
191 * list of 'owned' pi_state instances - these have to be 191 * list of 'owned' pi_state instances - these have to be
192 * cleaned up in do_exit() if the task exits prematurely: 192 * cleaned up in do_exit() if the task exits prematurely:
193 */ 193 */
194 struct list_head list; 194 struct list_head list;
195 195
196 /* 196 /*
197 * The PI object: 197 * The PI object:
198 */ 198 */
199 struct rt_mutex pi_mutex; 199 struct rt_mutex pi_mutex;
200 200
201 struct task_struct *owner; 201 struct task_struct *owner;
202 atomic_t refcount; 202 atomic_t refcount;
203 203
204 union futex_key key; 204 union futex_key key;
205 }; 205 };
206 206
207 /** 207 /**
208 * struct futex_q - The hashed futex queue entry, one per waiting task 208 * struct futex_q - The hashed futex queue entry, one per waiting task
209 * @list: priority-sorted list of tasks waiting on this futex 209 * @list: priority-sorted list of tasks waiting on this futex
210 * @task: the task waiting on the futex 210 * @task: the task waiting on the futex
211 * @lock_ptr: the hash bucket lock 211 * @lock_ptr: the hash bucket lock
212 * @key: the key the futex is hashed on 212 * @key: the key the futex is hashed on
213 * @pi_state: optional priority inheritance state 213 * @pi_state: optional priority inheritance state
214 * @rt_waiter: rt_waiter storage for use with requeue_pi 214 * @rt_waiter: rt_waiter storage for use with requeue_pi
215 * @requeue_pi_key: the requeue_pi target futex key 215 * @requeue_pi_key: the requeue_pi target futex key
216 * @bitset: bitset for the optional bitmasked wakeup 216 * @bitset: bitset for the optional bitmasked wakeup
217 * 217 *
218 * We use this hashed waitqueue, instead of a normal wait_queue_t, so 218 * We use this hashed waitqueue, instead of a normal wait_queue_t, so
219 * we can wake only the relevant ones (hashed queues may be shared). 219 * we can wake only the relevant ones (hashed queues may be shared).
220 * 220 *
221 * A futex_q has a woken state, just like tasks have TASK_RUNNING. 221 * A futex_q has a woken state, just like tasks have TASK_RUNNING.
222 * It is considered woken when plist_node_empty(&q->list) || q->lock_ptr == 0. 222 * It is considered woken when plist_node_empty(&q->list) || q->lock_ptr == 0.
223 * The order of wakeup is always to make the first condition true, then 223 * The order of wakeup is always to make the first condition true, then
224 * the second. 224 * the second.
225 * 225 *
226 * PI futexes are typically woken before they are removed from the hash list via 226 * PI futexes are typically woken before they are removed from the hash list via
227 * the rt_mutex code. See unqueue_me_pi(). 227 * the rt_mutex code. See unqueue_me_pi().
228 */ 228 */
229 struct futex_q { 229 struct futex_q {
230 struct plist_node list; 230 struct plist_node list;
231 231
232 struct task_struct *task; 232 struct task_struct *task;
233 spinlock_t *lock_ptr; 233 spinlock_t *lock_ptr;
234 union futex_key key; 234 union futex_key key;
235 struct futex_pi_state *pi_state; 235 struct futex_pi_state *pi_state;
236 struct rt_mutex_waiter *rt_waiter; 236 struct rt_mutex_waiter *rt_waiter;
237 union futex_key *requeue_pi_key; 237 union futex_key *requeue_pi_key;
238 u32 bitset; 238 u32 bitset;
239 }; 239 };
240 240
241 static const struct futex_q futex_q_init = { 241 static const struct futex_q futex_q_init = {
242 /* list gets initialized in queue_me()*/ 242 /* list gets initialized in queue_me()*/
243 .key = FUTEX_KEY_INIT, 243 .key = FUTEX_KEY_INIT,
244 .bitset = FUTEX_BITSET_MATCH_ANY 244 .bitset = FUTEX_BITSET_MATCH_ANY
245 }; 245 };
246 246
247 /* 247 /*
248 * Hash buckets are shared by all the futex_keys that hash to the same 248 * Hash buckets are shared by all the futex_keys that hash to the same
249 * location. Each key may have multiple futex_q structures, one for each task 249 * location. Each key may have multiple futex_q structures, one for each task
250 * waiting on a futex. 250 * waiting on a futex.
251 */ 251 */
252 struct futex_hash_bucket { 252 struct futex_hash_bucket {
253 atomic_t waiters; 253 atomic_t waiters;
254 spinlock_t lock; 254 spinlock_t lock;
255 struct plist_head chain; 255 struct plist_head chain;
256 } ____cacheline_aligned_in_smp; 256 } ____cacheline_aligned_in_smp;
257 257
258 static unsigned long __read_mostly futex_hashsize; 258 static unsigned long __read_mostly futex_hashsize;
259 259
260 static struct futex_hash_bucket *futex_queues; 260 static struct futex_hash_bucket *futex_queues;
261 261
262 static inline void futex_get_mm(union futex_key *key) 262 static inline void futex_get_mm(union futex_key *key)
263 { 263 {
264 atomic_inc(&key->private.mm->mm_count); 264 atomic_inc(&key->private.mm->mm_count);
265 /* 265 /*
266 * Ensure futex_get_mm() implies a full barrier such that 266 * Ensure futex_get_mm() implies a full barrier such that
267 * get_futex_key() implies a full barrier. This is relied upon 267 * get_futex_key() implies a full barrier. This is relied upon
268 * as full barrier (B), see the ordering comment above. 268 * as full barrier (B), see the ordering comment above.
269 */ 269 */
270 smp_mb__after_atomic_inc(); 270 smp_mb__after_atomic_inc();
271 } 271 }
272 272
273 /* 273 /*
274 * Reflects a new waiter being added to the waitqueue. 274 * Reflects a new waiter being added to the waitqueue.
275 */ 275 */
276 static inline void hb_waiters_inc(struct futex_hash_bucket *hb) 276 static inline void hb_waiters_inc(struct futex_hash_bucket *hb)
277 { 277 {
278 #ifdef CONFIG_SMP 278 #ifdef CONFIG_SMP
279 atomic_inc(&hb->waiters); 279 atomic_inc(&hb->waiters);
280 /* 280 /*
281 * Full barrier (A), see the ordering comment above. 281 * Full barrier (A), see the ordering comment above.
282 */ 282 */
283 smp_mb__after_atomic_inc(); 283 smp_mb__after_atomic_inc();
284 #endif 284 #endif
285 } 285 }
286 286
287 /* 287 /*
288 * Reflects a waiter being removed from the waitqueue by wakeup 288 * Reflects a waiter being removed from the waitqueue by wakeup
289 * paths. 289 * paths.
290 */ 290 */
291 static inline void hb_waiters_dec(struct futex_hash_bucket *hb) 291 static inline void hb_waiters_dec(struct futex_hash_bucket *hb)
292 { 292 {
293 #ifdef CONFIG_SMP 293 #ifdef CONFIG_SMP
294 atomic_dec(&hb->waiters); 294 atomic_dec(&hb->waiters);
295 #endif 295 #endif
296 } 296 }
297 297
298 static inline int hb_waiters_pending(struct futex_hash_bucket *hb) 298 static inline int hb_waiters_pending(struct futex_hash_bucket *hb)
299 { 299 {
300 #ifdef CONFIG_SMP 300 #ifdef CONFIG_SMP
301 return atomic_read(&hb->waiters); 301 return atomic_read(&hb->waiters);
302 #else 302 #else
303 return 1; 303 return 1;
304 #endif 304 #endif
305 } 305 }
306 306
307 /* 307 /*
308 * We hash on the keys returned from get_futex_key (see below). 308 * We hash on the keys returned from get_futex_key (see below).
309 */ 309 */
310 static struct futex_hash_bucket *hash_futex(union futex_key *key) 310 static struct futex_hash_bucket *hash_futex(union futex_key *key)
311 { 311 {
312 u32 hash = jhash2((u32*)&key->both.word, 312 u32 hash = jhash2((u32*)&key->both.word,
313 (sizeof(key->both.word)+sizeof(key->both.ptr))/4, 313 (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
314 key->both.offset); 314 key->both.offset);
315 return &futex_queues[hash & (futex_hashsize - 1)]; 315 return &futex_queues[hash & (futex_hashsize - 1)];
316 } 316 }
317 317
318 /* 318 /*
319 * Return 1 if two futex_keys are equal, 0 otherwise. 319 * Return 1 if two futex_keys are equal, 0 otherwise.
320 */ 320 */
321 static inline int match_futex(union futex_key *key1, union futex_key *key2) 321 static inline int match_futex(union futex_key *key1, union futex_key *key2)
322 { 322 {
323 return (key1 && key2 323 return (key1 && key2
324 && key1->both.word == key2->both.word 324 && key1->both.word == key2->both.word
325 && key1->both.ptr == key2->both.ptr 325 && key1->both.ptr == key2->both.ptr
326 && key1->both.offset == key2->both.offset); 326 && key1->both.offset == key2->both.offset);
327 } 327 }
328 328
329 /* 329 /*
330 * Take a reference to the resource addressed by a key. 330 * Take a reference to the resource addressed by a key.
331 * Can be called while holding spinlocks. 331 * Can be called while holding spinlocks.
332 * 332 *
333 */ 333 */
334 static void get_futex_key_refs(union futex_key *key) 334 static void get_futex_key_refs(union futex_key *key)
335 { 335 {
336 if (!key->both.ptr) 336 if (!key->both.ptr)
337 return; 337 return;
338 338
339 switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) { 339 switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
340 case FUT_OFF_INODE: 340 case FUT_OFF_INODE:
341 ihold(key->shared.inode); /* implies MB (B) */ 341 ihold(key->shared.inode); /* implies MB (B) */
342 break; 342 break;
343 case FUT_OFF_MMSHARED: 343 case FUT_OFF_MMSHARED:
344 futex_get_mm(key); /* implies MB (B) */ 344 futex_get_mm(key); /* implies MB (B) */
345 break; 345 break;
346 } 346 }
347 } 347 }
348 348
349 /* 349 /*
350 * Drop a reference to the resource addressed by a key. 350 * Drop a reference to the resource addressed by a key.
351 * The hash bucket spinlock must not be held. 351 * The hash bucket spinlock must not be held.
352 */ 352 */
353 static void drop_futex_key_refs(union futex_key *key) 353 static void drop_futex_key_refs(union futex_key *key)
354 { 354 {
355 if (!key->both.ptr) { 355 if (!key->both.ptr) {
356 /* If we're here then we tried to put a key we failed to get */ 356 /* If we're here then we tried to put a key we failed to get */
357 WARN_ON_ONCE(1); 357 WARN_ON_ONCE(1);
358 return; 358 return;
359 } 359 }
360 360
361 switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) { 361 switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
362 case FUT_OFF_INODE: 362 case FUT_OFF_INODE:
363 iput(key->shared.inode); 363 iput(key->shared.inode);
364 break; 364 break;
365 case FUT_OFF_MMSHARED: 365 case FUT_OFF_MMSHARED:
366 mmdrop(key->private.mm); 366 mmdrop(key->private.mm);
367 break; 367 break;
368 } 368 }
369 } 369 }
370 370
371 /** 371 /**
372 * get_futex_key() - Get parameters which are the keys for a futex 372 * get_futex_key() - Get parameters which are the keys for a futex
373 * @uaddr: virtual address of the futex 373 * @uaddr: virtual address of the futex
374 * @fshared: 0 for a PROCESS_PRIVATE futex, 1 for PROCESS_SHARED 374 * @fshared: 0 for a PROCESS_PRIVATE futex, 1 for PROCESS_SHARED
375 * @key: address where result is stored. 375 * @key: address where result is stored.
376 * @rw: mapping needs to be read/write (values: VERIFY_READ, 376 * @rw: mapping needs to be read/write (values: VERIFY_READ,
377 * VERIFY_WRITE) 377 * VERIFY_WRITE)
378 * 378 *
379 * Return: a negative error code or 0 379 * Return: a negative error code or 0
380 * 380 *
381 * The key words are stored in *key on success. 381 * The key words are stored in *key on success.
382 * 382 *
383 * For shared mappings, it's (page->index, file_inode(vma->vm_file), 383 * For shared mappings, it's (page->index, file_inode(vma->vm_file),
384 * offset_within_page). For private mappings, it's (uaddr, current->mm). 384 * offset_within_page). For private mappings, it's (uaddr, current->mm).
385 * We can usually work out the index without swapping in the page. 385 * We can usually work out the index without swapping in the page.
386 * 386 *
387 * lock_page() might sleep, the caller should not hold a spinlock. 387 * lock_page() might sleep, the caller should not hold a spinlock.
388 */ 388 */
389 static int 389 static int
390 get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw) 390 get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
391 { 391 {
392 unsigned long address = (unsigned long)uaddr; 392 unsigned long address = (unsigned long)uaddr;
393 struct mm_struct *mm = current->mm; 393 struct mm_struct *mm = current->mm;
394 struct page *page, *page_head; 394 struct page *page, *page_head;
395 int err, ro = 0; 395 int err, ro = 0;
396 396
397 /* 397 /*
398 * The futex address must be "naturally" aligned. 398 * The futex address must be "naturally" aligned.
399 */ 399 */
400 key->both.offset = address % PAGE_SIZE; 400 key->both.offset = address % PAGE_SIZE;
401 if (unlikely((address % sizeof(u32)) != 0)) 401 if (unlikely((address % sizeof(u32)) != 0))
402 return -EINVAL; 402 return -EINVAL;
403 address -= key->both.offset; 403 address -= key->both.offset;
404 404
405 if (unlikely(!access_ok(rw, uaddr, sizeof(u32)))) 405 if (unlikely(!access_ok(rw, uaddr, sizeof(u32))))
406 return -EFAULT; 406 return -EFAULT;
407 407
408 /* 408 /*
409 * PROCESS_PRIVATE futexes are fast. 409 * PROCESS_PRIVATE futexes are fast.
410 * As the mm cannot disappear under us and the 'key' only needs 410 * As the mm cannot disappear under us and the 'key' only needs
411 * virtual address, we dont even have to find the underlying vma. 411 * virtual address, we dont even have to find the underlying vma.
412 * Note : We do have to check 'uaddr' is a valid user address, 412 * Note : We do have to check 'uaddr' is a valid user address,
413 * but access_ok() should be faster than find_vma() 413 * but access_ok() should be faster than find_vma()
414 */ 414 */
415 if (!fshared) { 415 if (!fshared) {
416 key->private.mm = mm; 416 key->private.mm = mm;
417 key->private.address = address; 417 key->private.address = address;
418 get_futex_key_refs(key); /* implies MB (B) */ 418 get_futex_key_refs(key); /* implies MB (B) */
419 return 0; 419 return 0;
420 } 420 }
421 421
422 again: 422 again:
423 err = get_user_pages_fast(address, 1, 1, &page); 423 err = get_user_pages_fast(address, 1, 1, &page);
424 /* 424 /*
425 * If write access is not required (eg. FUTEX_WAIT), try 425 * If write access is not required (eg. FUTEX_WAIT), try
426 * and get read-only access. 426 * and get read-only access.
427 */ 427 */
428 if (err == -EFAULT && rw == VERIFY_READ) { 428 if (err == -EFAULT && rw == VERIFY_READ) {
429 err = get_user_pages_fast(address, 1, 0, &page); 429 err = get_user_pages_fast(address, 1, 0, &page);
430 ro = 1; 430 ro = 1;
431 } 431 }
432 if (err < 0) 432 if (err < 0)
433 return err; 433 return err;
434 else 434 else
435 err = 0; 435 err = 0;
436 436
437 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 437 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
438 page_head = page; 438 page_head = page;
439 if (unlikely(PageTail(page))) { 439 if (unlikely(PageTail(page))) {
440 put_page(page); 440 put_page(page);
441 /* serialize against __split_huge_page_splitting() */ 441 /* serialize against __split_huge_page_splitting() */
442 local_irq_disable(); 442 local_irq_disable();
443 if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) { 443 if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) {
444 page_head = compound_head(page); 444 page_head = compound_head(page);
445 /* 445 /*
446 * page_head is valid pointer but we must pin 446 * page_head is valid pointer but we must pin
447 * it before taking the PG_lock and/or 447 * it before taking the PG_lock and/or
448 * PG_compound_lock. The moment we re-enable 448 * PG_compound_lock. The moment we re-enable
449 * irqs __split_huge_page_splitting() can 449 * irqs __split_huge_page_splitting() can
450 * return and the head page can be freed from 450 * return and the head page can be freed from
451 * under us. We can't take the PG_lock and/or 451 * under us. We can't take the PG_lock and/or
452 * PG_compound_lock on a page that could be 452 * PG_compound_lock on a page that could be
453 * freed from under us. 453 * freed from under us.
454 */ 454 */
455 if (page != page_head) { 455 if (page != page_head) {
456 get_page(page_head); 456 get_page(page_head);
457 put_page(page); 457 put_page(page);
458 } 458 }
459 local_irq_enable(); 459 local_irq_enable();
460 } else { 460 } else {
461 local_irq_enable(); 461 local_irq_enable();
462 goto again; 462 goto again;
463 } 463 }
464 } 464 }
465 #else 465 #else
466 page_head = compound_head(page); 466 page_head = compound_head(page);
467 if (page != page_head) { 467 if (page != page_head) {
468 get_page(page_head); 468 get_page(page_head);
469 put_page(page); 469 put_page(page);
470 } 470 }
471 #endif 471 #endif
472 472
473 lock_page(page_head); 473 lock_page(page_head);
474 474
475 /* 475 /*
476 * If page_head->mapping is NULL, then it cannot be a PageAnon 476 * If page_head->mapping is NULL, then it cannot be a PageAnon
477 * page; but it might be the ZERO_PAGE or in the gate area or 477 * page; but it might be the ZERO_PAGE or in the gate area or
478 * in a special mapping (all cases which we are happy to fail); 478 * in a special mapping (all cases which we are happy to fail);
479 * or it may have been a good file page when get_user_pages_fast 479 * or it may have been a good file page when get_user_pages_fast
480 * found it, but truncated or holepunched or subjected to 480 * found it, but truncated or holepunched or subjected to
481 * invalidate_complete_page2 before we got the page lock (also 481 * invalidate_complete_page2 before we got the page lock (also
482 * cases which we are happy to fail). And we hold a reference, 482 * cases which we are happy to fail). And we hold a reference,
483 * so refcount care in invalidate_complete_page's remove_mapping 483 * so refcount care in invalidate_complete_page's remove_mapping
484 * prevents drop_caches from setting mapping to NULL beneath us. 484 * prevents drop_caches from setting mapping to NULL beneath us.
485 * 485 *
486 * The case we do have to guard against is when memory pressure made 486 * The case we do have to guard against is when memory pressure made
487 * shmem_writepage move it from filecache to swapcache beneath us: 487 * shmem_writepage move it from filecache to swapcache beneath us:
488 * an unlikely race, but we do need to retry for page_head->mapping. 488 * an unlikely race, but we do need to retry for page_head->mapping.
489 */ 489 */
490 if (!page_head->mapping) { 490 if (!page_head->mapping) {
491 int shmem_swizzled = PageSwapCache(page_head); 491 int shmem_swizzled = PageSwapCache(page_head);
492 unlock_page(page_head); 492 unlock_page(page_head);
493 put_page(page_head); 493 put_page(page_head);
494 if (shmem_swizzled) 494 if (shmem_swizzled)
495 goto again; 495 goto again;
496 return -EFAULT; 496 return -EFAULT;
497 } 497 }
498 498
499 /* 499 /*
500 * Private mappings are handled in a simple way. 500 * Private mappings are handled in a simple way.
501 * 501 *
502 * NOTE: When userspace waits on a MAP_SHARED mapping, even if 502 * NOTE: When userspace waits on a MAP_SHARED mapping, even if
503 * it's a read-only handle, it's expected that futexes attach to 503 * it's a read-only handle, it's expected that futexes attach to
504 * the object not the particular process. 504 * the object not the particular process.
505 */ 505 */
506 if (PageAnon(page_head)) { 506 if (PageAnon(page_head)) {
507 /* 507 /*
508 * A RO anonymous page will never change and thus doesn't make 508 * A RO anonymous page will never change and thus doesn't make
509 * sense for futex operations. 509 * sense for futex operations.
510 */ 510 */
511 if (ro) { 511 if (ro) {
512 err = -EFAULT; 512 err = -EFAULT;
513 goto out; 513 goto out;
514 } 514 }
515 515
516 key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */ 516 key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
517 key->private.mm = mm; 517 key->private.mm = mm;
518 key->private.address = address; 518 key->private.address = address;
519 } else { 519 } else {
520 key->both.offset |= FUT_OFF_INODE; /* inode-based key */ 520 key->both.offset |= FUT_OFF_INODE; /* inode-based key */
521 key->shared.inode = page_head->mapping->host; 521 key->shared.inode = page_head->mapping->host;
522 key->shared.pgoff = basepage_index(page); 522 key->shared.pgoff = basepage_index(page);
523 } 523 }
524 524
525 get_futex_key_refs(key); /* implies MB (B) */ 525 get_futex_key_refs(key); /* implies MB (B) */
526 526
527 out: 527 out:
528 unlock_page(page_head); 528 unlock_page(page_head);
529 put_page(page_head); 529 put_page(page_head);
530 return err; 530 return err;
531 } 531 }
532 532
533 static inline void put_futex_key(union futex_key *key) 533 static inline void put_futex_key(union futex_key *key)
534 { 534 {
535 drop_futex_key_refs(key); 535 drop_futex_key_refs(key);
536 } 536 }
537 537
538 /** 538 /**
539 * fault_in_user_writeable() - Fault in user address and verify RW access 539 * fault_in_user_writeable() - Fault in user address and verify RW access
540 * @uaddr: pointer to faulting user space address 540 * @uaddr: pointer to faulting user space address
541 * 541 *
542 * Slow path to fixup the fault we just took in the atomic write 542 * Slow path to fixup the fault we just took in the atomic write
543 * access to @uaddr. 543 * access to @uaddr.
544 * 544 *
545 * We have no generic implementation of a non-destructive write to the 545 * We have no generic implementation of a non-destructive write to the
546 * user address. We know that we faulted in the atomic pagefault 546 * user address. We know that we faulted in the atomic pagefault
547 * disabled section so we can as well avoid the #PF overhead by 547 * disabled section so we can as well avoid the #PF overhead by
548 * calling get_user_pages() right away. 548 * calling get_user_pages() right away.
549 */ 549 */
550 static int fault_in_user_writeable(u32 __user *uaddr) 550 static int fault_in_user_writeable(u32 __user *uaddr)
551 { 551 {
552 struct mm_struct *mm = current->mm; 552 struct mm_struct *mm = current->mm;
553 int ret; 553 int ret;
554 554
555 down_read(&mm->mmap_sem); 555 down_read(&mm->mmap_sem);
556 ret = fixup_user_fault(current, mm, (unsigned long)uaddr, 556 ret = fixup_user_fault(current, mm, (unsigned long)uaddr,
557 FAULT_FLAG_WRITE); 557 FAULT_FLAG_WRITE);
558 up_read(&mm->mmap_sem); 558 up_read(&mm->mmap_sem);
559 559
560 return ret < 0 ? ret : 0; 560 return ret < 0 ? ret : 0;
561 } 561 }
562 562
563 /** 563 /**
564 * futex_top_waiter() - Return the highest priority waiter on a futex 564 * futex_top_waiter() - Return the highest priority waiter on a futex
565 * @hb: the hash bucket the futex_q's reside in 565 * @hb: the hash bucket the futex_q's reside in
566 * @key: the futex key (to distinguish it from other futex futex_q's) 566 * @key: the futex key (to distinguish it from other futex futex_q's)
567 * 567 *
568 * Must be called with the hb lock held. 568 * Must be called with the hb lock held.
569 */ 569 */
570 static struct futex_q *futex_top_waiter(struct futex_hash_bucket *hb, 570 static struct futex_q *futex_top_waiter(struct futex_hash_bucket *hb,
571 union futex_key *key) 571 union futex_key *key)
572 { 572 {
573 struct futex_q *this; 573 struct futex_q *this;
574 574
575 plist_for_each_entry(this, &hb->chain, list) { 575 plist_for_each_entry(this, &hb->chain, list) {
576 if (match_futex(&this->key, key)) 576 if (match_futex(&this->key, key))
577 return this; 577 return this;
578 } 578 }
579 return NULL; 579 return NULL;
580 } 580 }
581 581
582 static int cmpxchg_futex_value_locked(u32 *curval, u32 __user *uaddr, 582 static int cmpxchg_futex_value_locked(u32 *curval, u32 __user *uaddr,
583 u32 uval, u32 newval) 583 u32 uval, u32 newval)
584 { 584 {
585 int ret; 585 int ret;
586 586
587 pagefault_disable(); 587 pagefault_disable();
588 ret = futex_atomic_cmpxchg_inatomic(curval, uaddr, uval, newval); 588 ret = futex_atomic_cmpxchg_inatomic(curval, uaddr, uval, newval);
589 pagefault_enable(); 589 pagefault_enable();
590 590
591 return ret; 591 return ret;
592 } 592 }
593 593
594 static int get_futex_value_locked(u32 *dest, u32 __user *from) 594 static int get_futex_value_locked(u32 *dest, u32 __user *from)
595 { 595 {
596 int ret; 596 int ret;
597 597
598 pagefault_disable(); 598 pagefault_disable();
599 ret = __copy_from_user_inatomic(dest, from, sizeof(u32)); 599 ret = __copy_from_user_inatomic(dest, from, sizeof(u32));
600 pagefault_enable(); 600 pagefault_enable();
601 601
602 return ret ? -EFAULT : 0; 602 return ret ? -EFAULT : 0;
603 } 603 }
604 604
605 605
606 /* 606 /*
607 * PI code: 607 * PI code:
608 */ 608 */
609 static int refill_pi_state_cache(void) 609 static int refill_pi_state_cache(void)
610 { 610 {
611 struct futex_pi_state *pi_state; 611 struct futex_pi_state *pi_state;
612 612
613 if (likely(current->pi_state_cache)) 613 if (likely(current->pi_state_cache))
614 return 0; 614 return 0;
615 615
616 pi_state = kzalloc(sizeof(*pi_state), GFP_KERNEL); 616 pi_state = kzalloc(sizeof(*pi_state), GFP_KERNEL);
617 617
618 if (!pi_state) 618 if (!pi_state)
619 return -ENOMEM; 619 return -ENOMEM;
620 620
621 INIT_LIST_HEAD(&pi_state->list); 621 INIT_LIST_HEAD(&pi_state->list);
622 /* pi_mutex gets initialized later */ 622 /* pi_mutex gets initialized later */
623 pi_state->owner = NULL; 623 pi_state->owner = NULL;
624 atomic_set(&pi_state->refcount, 1); 624 atomic_set(&pi_state->refcount, 1);
625 pi_state->key = FUTEX_KEY_INIT; 625 pi_state->key = FUTEX_KEY_INIT;
626 626
627 current->pi_state_cache = pi_state; 627 current->pi_state_cache = pi_state;
628 628
629 return 0; 629 return 0;
630 } 630 }
631 631
632 static struct futex_pi_state * alloc_pi_state(void) 632 static struct futex_pi_state * alloc_pi_state(void)
633 { 633 {
634 struct futex_pi_state *pi_state = current->pi_state_cache; 634 struct futex_pi_state *pi_state = current->pi_state_cache;
635 635
636 WARN_ON(!pi_state); 636 WARN_ON(!pi_state);
637 current->pi_state_cache = NULL; 637 current->pi_state_cache = NULL;
638 638
639 return pi_state; 639 return pi_state;
640 } 640 }
641 641
642 static void free_pi_state(struct futex_pi_state *pi_state) 642 static void free_pi_state(struct futex_pi_state *pi_state)
643 { 643 {
644 if (!atomic_dec_and_test(&pi_state->refcount)) 644 if (!atomic_dec_and_test(&pi_state->refcount))
645 return; 645 return;
646 646
647 /* 647 /*
648 * If pi_state->owner is NULL, the owner is most probably dying 648 * If pi_state->owner is NULL, the owner is most probably dying
649 * and has cleaned up the pi_state already 649 * and has cleaned up the pi_state already
650 */ 650 */
651 if (pi_state->owner) { 651 if (pi_state->owner) {
652 raw_spin_lock_irq(&pi_state->owner->pi_lock); 652 raw_spin_lock_irq(&pi_state->owner->pi_lock);
653 list_del_init(&pi_state->list); 653 list_del_init(&pi_state->list);
654 raw_spin_unlock_irq(&pi_state->owner->pi_lock); 654 raw_spin_unlock_irq(&pi_state->owner->pi_lock);
655 655
656 rt_mutex_proxy_unlock(&pi_state->pi_mutex, pi_state->owner); 656 rt_mutex_proxy_unlock(&pi_state->pi_mutex, pi_state->owner);
657 } 657 }
658 658
659 if (current->pi_state_cache) 659 if (current->pi_state_cache)
660 kfree(pi_state); 660 kfree(pi_state);
661 else { 661 else {
662 /* 662 /*
663 * pi_state->list is already empty. 663 * pi_state->list is already empty.
664 * clear pi_state->owner. 664 * clear pi_state->owner.
665 * refcount is at 0 - put it back to 1. 665 * refcount is at 0 - put it back to 1.
666 */ 666 */
667 pi_state->owner = NULL; 667 pi_state->owner = NULL;
668 atomic_set(&pi_state->refcount, 1); 668 atomic_set(&pi_state->refcount, 1);
669 current->pi_state_cache = pi_state; 669 current->pi_state_cache = pi_state;
670 } 670 }
671 } 671 }
672 672
673 /* 673 /*
674 * Look up the task based on what TID userspace gave us. 674 * Look up the task based on what TID userspace gave us.
675 * We dont trust it. 675 * We dont trust it.
676 */ 676 */
677 static struct task_struct * futex_find_get_task(pid_t pid) 677 static struct task_struct * futex_find_get_task(pid_t pid)
678 { 678 {
679 struct task_struct *p; 679 struct task_struct *p;
680 680
681 rcu_read_lock(); 681 rcu_read_lock();
682 p = find_task_by_vpid(pid); 682 p = find_task_by_vpid(pid);
683 if (p) 683 if (p)
684 get_task_struct(p); 684 get_task_struct(p);
685 685
686 rcu_read_unlock(); 686 rcu_read_unlock();
687 687
688 return p; 688 return p;
689 } 689 }
690 690
691 /* 691 /*
692 * This task is holding PI mutexes at exit time => bad. 692 * This task is holding PI mutexes at exit time => bad.
693 * Kernel cleans up PI-state, but userspace is likely hosed. 693 * Kernel cleans up PI-state, but userspace is likely hosed.
694 * (Robust-futex cleanup is separate and might save the day for userspace.) 694 * (Robust-futex cleanup is separate and might save the day for userspace.)
695 */ 695 */
696 void exit_pi_state_list(struct task_struct *curr) 696 void exit_pi_state_list(struct task_struct *curr)
697 { 697 {
698 struct list_head *next, *head = &curr->pi_state_list; 698 struct list_head *next, *head = &curr->pi_state_list;
699 struct futex_pi_state *pi_state; 699 struct futex_pi_state *pi_state;
700 struct futex_hash_bucket *hb; 700 struct futex_hash_bucket *hb;
701 union futex_key key = FUTEX_KEY_INIT; 701 union futex_key key = FUTEX_KEY_INIT;
702 702
703 if (!futex_cmpxchg_enabled) 703 if (!futex_cmpxchg_enabled)
704 return; 704 return;
705 /* 705 /*
706 * We are a ZOMBIE and nobody can enqueue itself on 706 * We are a ZOMBIE and nobody can enqueue itself on
707 * pi_state_list anymore, but we have to be careful 707 * pi_state_list anymore, but we have to be careful
708 * versus waiters unqueueing themselves: 708 * versus waiters unqueueing themselves:
709 */ 709 */
710 raw_spin_lock_irq(&curr->pi_lock); 710 raw_spin_lock_irq(&curr->pi_lock);
711 while (!list_empty(head)) { 711 while (!list_empty(head)) {
712 712
713 next = head->next; 713 next = head->next;
714 pi_state = list_entry(next, struct futex_pi_state, list); 714 pi_state = list_entry(next, struct futex_pi_state, list);
715 key = pi_state->key; 715 key = pi_state->key;
716 hb = hash_futex(&key); 716 hb = hash_futex(&key);
717 raw_spin_unlock_irq(&curr->pi_lock); 717 raw_spin_unlock_irq(&curr->pi_lock);
718 718
719 spin_lock(&hb->lock); 719 spin_lock(&hb->lock);
720 720
721 raw_spin_lock_irq(&curr->pi_lock); 721 raw_spin_lock_irq(&curr->pi_lock);
722 /* 722 /*
723 * We dropped the pi-lock, so re-check whether this 723 * We dropped the pi-lock, so re-check whether this
724 * task still owns the PI-state: 724 * task still owns the PI-state:
725 */ 725 */
726 if (head->next != next) { 726 if (head->next != next) {
727 spin_unlock(&hb->lock); 727 spin_unlock(&hb->lock);
728 continue; 728 continue;
729 } 729 }
730 730
731 WARN_ON(pi_state->owner != curr); 731 WARN_ON(pi_state->owner != curr);
732 WARN_ON(list_empty(&pi_state->list)); 732 WARN_ON(list_empty(&pi_state->list));
733 list_del_init(&pi_state->list); 733 list_del_init(&pi_state->list);
734 pi_state->owner = NULL; 734 pi_state->owner = NULL;
735 raw_spin_unlock_irq(&curr->pi_lock); 735 raw_spin_unlock_irq(&curr->pi_lock);
736 736
737 rt_mutex_unlock(&pi_state->pi_mutex); 737 rt_mutex_unlock(&pi_state->pi_mutex);
738 738
739 spin_unlock(&hb->lock); 739 spin_unlock(&hb->lock);
740 740
741 raw_spin_lock_irq(&curr->pi_lock); 741 raw_spin_lock_irq(&curr->pi_lock);
742 } 742 }
743 raw_spin_unlock_irq(&curr->pi_lock); 743 raw_spin_unlock_irq(&curr->pi_lock);
744 } 744 }
745 745
746 /*
747 * We need to check the following states:
748 *
749 * Waiter | pi_state | pi->owner | uTID | uODIED | ?
750 *
751 * [1] NULL | --- | --- | 0 | 0/1 | Valid
752 * [2] NULL | --- | --- | >0 | 0/1 | Valid
753 *
754 * [3] Found | NULL | -- | Any | 0/1 | Invalid
755 *
756 * [4] Found | Found | NULL | 0 | 1 | Valid
757 * [5] Found | Found | NULL | >0 | 1 | Invalid
758 *
759 * [6] Found | Found | task | 0 | 1 | Valid
760 *
761 * [7] Found | Found | NULL | Any | 0 | Invalid
762 *
763 * [8] Found | Found | task | ==taskTID | 0/1 | Valid
764 * [9] Found | Found | task | 0 | 0 | Invalid
765 * [10] Found | Found | task | !=taskTID | 0/1 | Invalid
766 *
767 * [1] Indicates that the kernel can acquire the futex atomically. We
768 * came came here due to a stale FUTEX_WAITERS/FUTEX_OWNER_DIED bit.
769 *
770 * [2] Valid, if TID does not belong to a kernel thread. If no matching
771 * thread is found then it indicates that the owner TID has died.
772 *
773 * [3] Invalid. The waiter is queued on a non PI futex
774 *
775 * [4] Valid state after exit_robust_list(), which sets the user space
776 * value to FUTEX_WAITERS | FUTEX_OWNER_DIED.
777 *
778 * [5] The user space value got manipulated between exit_robust_list()
779 * and exit_pi_state_list()
780 *
781 * [6] Valid state after exit_pi_state_list() which sets the new owner in
782 * the pi_state but cannot access the user space value.
783 *
784 * [7] pi_state->owner can only be NULL when the OWNER_DIED bit is set.
785 *
786 * [8] Owner and user space value match
787 *
788 * [9] There is no transient state which sets the user space TID to 0
789 * except exit_robust_list(), but this is indicated by the
790 * FUTEX_OWNER_DIED bit. See [4]
791 *
792 * [10] There is no transient state which leaves owner and user space
793 * TID out of sync.
794 */
746 static int 795 static int
747 lookup_pi_state(u32 uval, struct futex_hash_bucket *hb, 796 lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
748 union futex_key *key, struct futex_pi_state **ps, 797 union futex_key *key, struct futex_pi_state **ps)
749 struct task_struct *task)
750 { 798 {
751 struct futex_pi_state *pi_state = NULL; 799 struct futex_pi_state *pi_state = NULL;
752 struct futex_q *this, *next; 800 struct futex_q *this, *next;
753 struct task_struct *p; 801 struct task_struct *p;
754 pid_t pid = uval & FUTEX_TID_MASK; 802 pid_t pid = uval & FUTEX_TID_MASK;
755 803
756 plist_for_each_entry_safe(this, next, &hb->chain, list) { 804 plist_for_each_entry_safe(this, next, &hb->chain, list) {
757 if (match_futex(&this->key, key)) { 805 if (match_futex(&this->key, key)) {
758 /* 806 /*
759 * Another waiter already exists - bump up 807 * Sanity check the waiter before increasing
760 * the refcount and return its pi_state: 808 * the refcount and attaching to it.
761 */ 809 */
762 pi_state = this->pi_state; 810 pi_state = this->pi_state;
763 /* 811 /*
764 * Userspace might have messed up non-PI and PI futexes 812 * Userspace might have messed up non-PI and
813 * PI futexes [3]
765 */ 814 */
766 if (unlikely(!pi_state)) 815 if (unlikely(!pi_state))
767 return -EINVAL; 816 return -EINVAL;
768 817
769 WARN_ON(!atomic_read(&pi_state->refcount)); 818 WARN_ON(!atomic_read(&pi_state->refcount));
770 819
771 /* 820 /*
772 * When pi_state->owner is NULL then the owner died 821 * Handle the owner died case:
773 * and another waiter is on the fly. pi_state->owner
774 * is fixed up by the task which acquires
775 * pi_state->rt_mutex.
776 *
777 * We do not check for pid == 0 which can happen when
778 * the owner died and robust_list_exit() cleared the
779 * TID.
780 */ 822 */
781 if (pid && pi_state->owner) { 823 if (uval & FUTEX_OWNER_DIED) {
782 /* 824 /*
783 * Bail out if user space manipulated the 825 * exit_pi_state_list sets owner to NULL and
784 * futex value. 826 * wakes the topmost waiter. The task which
827 * acquires the pi_state->rt_mutex will fixup
828 * owner.
785 */ 829 */
786 if (pid != task_pid_vnr(pi_state->owner)) 830 if (!pi_state->owner) {
831 /*
832 * No pi state owner, but the user
833 * space TID is not 0. Inconsistent
834 * state. [5]
835 */
836 if (pid)
837 return -EINVAL;
838 /*
839 * Take a ref on the state and
840 * return. [4]
841 */
842 goto out_state;
843 }
844
845 /*
846 * If TID is 0, then either the dying owner
847 * has not yet executed exit_pi_state_list()
848 * or some waiter acquired the rtmutex in the
849 * pi state, but did not yet fixup the TID in
850 * user space.
851 *
852 * Take a ref on the state and return. [6]
853 */
854 if (!pid)
855 goto out_state;
856 } else {
857 /*
858 * If the owner died bit is not set,
859 * then the pi_state must have an
860 * owner. [7]
861 */
862 if (!pi_state->owner)
787 return -EINVAL; 863 return -EINVAL;
788 } 864 }
789 865
790 /* 866 /*
791 * Protect against a corrupted uval. If uval 867 * Bail out if user space manipulated the
792 * is 0x80000000 then pid is 0 and the waiter 868 * futex value. If pi state exists then the
793 * bit is set. So the deadlock check in the 869 * owner TID must be the same as the user
794 * calling code has failed and we did not fall 870 * space TID. [9/10]
795 * into the check above due to !pid.
796 */ 871 */
797 if (task && pi_state->owner == task) 872 if (pid != task_pid_vnr(pi_state->owner))
798 return -EDEADLK; 873 return -EINVAL;
799 874
875 out_state:
800 atomic_inc(&pi_state->refcount); 876 atomic_inc(&pi_state->refcount);
801 *ps = pi_state; 877 *ps = pi_state;
802
803 return 0; 878 return 0;
804 } 879 }
805 } 880 }
806 881
807 /* 882 /*
808 * We are the first waiter - try to look up the real owner and attach 883 * We are the first waiter - try to look up the real owner and attach
809 * the new pi_state to it, but bail out when TID = 0 884 * the new pi_state to it, but bail out when TID = 0 [1]
810 */ 885 */
811 if (!pid) 886 if (!pid)
812 return -ESRCH; 887 return -ESRCH;
813 p = futex_find_get_task(pid); 888 p = futex_find_get_task(pid);
814 if (!p) 889 if (!p)
815 return -ESRCH; 890 return -ESRCH;
816 891
817 if (!p->mm) { 892 if (!p->mm) {
818 put_task_struct(p); 893 put_task_struct(p);
819 return -EPERM; 894 return -EPERM;
820 } 895 }
821 896
822 /* 897 /*
823 * We need to look at the task state flags to figure out, 898 * We need to look at the task state flags to figure out,
824 * whether the task is exiting. To protect against the do_exit 899 * whether the task is exiting. To protect against the do_exit
825 * change of the task flags, we do this protected by 900 * change of the task flags, we do this protected by
826 * p->pi_lock: 901 * p->pi_lock:
827 */ 902 */
828 raw_spin_lock_irq(&p->pi_lock); 903 raw_spin_lock_irq(&p->pi_lock);
829 if (unlikely(p->flags & PF_EXITING)) { 904 if (unlikely(p->flags & PF_EXITING)) {
830 /* 905 /*
831 * The task is on the way out. When PF_EXITPIDONE is 906 * The task is on the way out. When PF_EXITPIDONE is
832 * set, we know that the task has finished the 907 * set, we know that the task has finished the
833 * cleanup: 908 * cleanup:
834 */ 909 */
835 int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN; 910 int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
836 911
837 raw_spin_unlock_irq(&p->pi_lock); 912 raw_spin_unlock_irq(&p->pi_lock);
838 put_task_struct(p); 913 put_task_struct(p);
839 return ret; 914 return ret;
840 } 915 }
841 916
917 /*
918 * No existing pi state. First waiter. [2]
919 */
842 pi_state = alloc_pi_state(); 920 pi_state = alloc_pi_state();
843 921
844 /* 922 /*
845 * Initialize the pi_mutex in locked state and make 'p' 923 * Initialize the pi_mutex in locked state and make 'p'
846 * the owner of it: 924 * the owner of it:
847 */ 925 */
848 rt_mutex_init_proxy_locked(&pi_state->pi_mutex, p); 926 rt_mutex_init_proxy_locked(&pi_state->pi_mutex, p);
849 927
850 /* Store the key for possible exit cleanups: */ 928 /* Store the key for possible exit cleanups: */
851 pi_state->key = *key; 929 pi_state->key = *key;
852 930
853 WARN_ON(!list_empty(&pi_state->list)); 931 WARN_ON(!list_empty(&pi_state->list));
854 list_add(&pi_state->list, &p->pi_state_list); 932 list_add(&pi_state->list, &p->pi_state_list);
855 pi_state->owner = p; 933 pi_state->owner = p;
856 raw_spin_unlock_irq(&p->pi_lock); 934 raw_spin_unlock_irq(&p->pi_lock);
857 935
858 put_task_struct(p); 936 put_task_struct(p);
859 937
860 *ps = pi_state; 938 *ps = pi_state;
861 939
862 return 0; 940 return 0;
863 } 941 }
864 942
865 /** 943 /**
866 * futex_lock_pi_atomic() - Atomic work required to acquire a pi aware futex 944 * futex_lock_pi_atomic() - Atomic work required to acquire a pi aware futex
867 * @uaddr: the pi futex user address 945 * @uaddr: the pi futex user address
868 * @hb: the pi futex hash bucket 946 * @hb: the pi futex hash bucket
869 * @key: the futex key associated with uaddr and hb 947 * @key: the futex key associated with uaddr and hb
870 * @ps: the pi_state pointer where we store the result of the 948 * @ps: the pi_state pointer where we store the result of the
871 * lookup 949 * lookup
872 * @task: the task to perform the atomic lock work for. This will 950 * @task: the task to perform the atomic lock work for. This will
873 * be "current" except in the case of requeue pi. 951 * be "current" except in the case of requeue pi.
874 * @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0) 952 * @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0)
875 * 953 *
876 * Return: 954 * Return:
877 * 0 - ready to wait; 955 * 0 - ready to wait;
878 * 1 - acquired the lock; 956 * 1 - acquired the lock;
879 * <0 - error 957 * <0 - error
880 * 958 *
881 * The hb->lock and futex_key refs shall be held by the caller. 959 * The hb->lock and futex_key refs shall be held by the caller.
882 */ 960 */
883 static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb, 961 static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
884 union futex_key *key, 962 union futex_key *key,
885 struct futex_pi_state **ps, 963 struct futex_pi_state **ps,
886 struct task_struct *task, int set_waiters) 964 struct task_struct *task, int set_waiters)
887 { 965 {
888 int lock_taken, ret, force_take = 0; 966 int lock_taken, ret, force_take = 0;
889 u32 uval, newval, curval, vpid = task_pid_vnr(task); 967 u32 uval, newval, curval, vpid = task_pid_vnr(task);
890 968
891 retry: 969 retry:
892 ret = lock_taken = 0; 970 ret = lock_taken = 0;
893 971
894 /* 972 /*
895 * To avoid races, we attempt to take the lock here again 973 * To avoid races, we attempt to take the lock here again
896 * (by doing a 0 -> TID atomic cmpxchg), while holding all 974 * (by doing a 0 -> TID atomic cmpxchg), while holding all
897 * the locks. It will most likely not succeed. 975 * the locks. It will most likely not succeed.
898 */ 976 */
899 newval = vpid; 977 newval = vpid;
900 if (set_waiters) 978 if (set_waiters)
901 newval |= FUTEX_WAITERS; 979 newval |= FUTEX_WAITERS;
902 980
903 if (unlikely(cmpxchg_futex_value_locked(&curval, uaddr, 0, newval))) 981 if (unlikely(cmpxchg_futex_value_locked(&curval, uaddr, 0, newval)))
904 return -EFAULT; 982 return -EFAULT;
905 983
906 /* 984 /*
907 * Detect deadlocks. 985 * Detect deadlocks.
908 */ 986 */
909 if ((unlikely((curval & FUTEX_TID_MASK) == vpid))) 987 if ((unlikely((curval & FUTEX_TID_MASK) == vpid)))
910 return -EDEADLK; 988 return -EDEADLK;
911 989
912 /* 990 /*
913 * Surprise - we got the lock. Just return to userspace: 991 * Surprise - we got the lock, but we do not trust user space at all.
914 */ 992 */
915 if (unlikely(!curval)) 993 if (unlikely(!curval)) {
916 return 1; 994 /*
995 * We verify whether there is kernel state for this
996 * futex. If not, we can safely assume, that the 0 ->
997 * TID transition is correct. If state exists, we do
998 * not bother to fixup the user space state as it was
999 * corrupted already.
1000 */
1001 return futex_top_waiter(hb, key) ? -EINVAL : 1;
1002 }
917 1003
918 uval = curval; 1004 uval = curval;
919 1005
920 /* 1006 /*
921 * Set the FUTEX_WAITERS flag, so the owner will know it has someone 1007 * Set the FUTEX_WAITERS flag, so the owner will know it has someone
922 * to wake at the next unlock. 1008 * to wake at the next unlock.
923 */ 1009 */
924 newval = curval | FUTEX_WAITERS; 1010 newval = curval | FUTEX_WAITERS;
925 1011
926 /* 1012 /*
927 * Should we force take the futex? See below. 1013 * Should we force take the futex? See below.
928 */ 1014 */
929 if (unlikely(force_take)) { 1015 if (unlikely(force_take)) {
930 /* 1016 /*
931 * Keep the OWNER_DIED and the WAITERS bit and set the 1017 * Keep the OWNER_DIED and the WAITERS bit and set the
932 * new TID value. 1018 * new TID value.
933 */ 1019 */
934 newval = (curval & ~FUTEX_TID_MASK) | vpid; 1020 newval = (curval & ~FUTEX_TID_MASK) | vpid;
935 force_take = 0; 1021 force_take = 0;
936 lock_taken = 1; 1022 lock_taken = 1;
937 } 1023 }
938 1024
939 if (unlikely(cmpxchg_futex_value_locked(&curval, uaddr, uval, newval))) 1025 if (unlikely(cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)))
940 return -EFAULT; 1026 return -EFAULT;
941 if (unlikely(curval != uval)) 1027 if (unlikely(curval != uval))
942 goto retry; 1028 goto retry;
943 1029
944 /* 1030 /*
945 * We took the lock due to forced take over. 1031 * We took the lock due to forced take over.
946 */ 1032 */
947 if (unlikely(lock_taken)) 1033 if (unlikely(lock_taken))
948 return 1; 1034 return 1;
949 1035
950 /* 1036 /*
951 * We dont have the lock. Look up the PI state (or create it if 1037 * We dont have the lock. Look up the PI state (or create it if
952 * we are the first waiter): 1038 * we are the first waiter):
953 */ 1039 */
954 ret = lookup_pi_state(uval, hb, key, ps, task); 1040 ret = lookup_pi_state(uval, hb, key, ps);
955 1041
956 if (unlikely(ret)) { 1042 if (unlikely(ret)) {
957 switch (ret) { 1043 switch (ret) {
958 case -ESRCH: 1044 case -ESRCH:
959 /* 1045 /*
960 * We failed to find an owner for this 1046 * We failed to find an owner for this
961 * futex. So we have no pi_state to block 1047 * futex. So we have no pi_state to block
962 * on. This can happen in two cases: 1048 * on. This can happen in two cases:
963 * 1049 *
964 * 1) The owner died 1050 * 1) The owner died
965 * 2) A stale FUTEX_WAITERS bit 1051 * 2) A stale FUTEX_WAITERS bit
966 * 1052 *
967 * Re-read the futex value. 1053 * Re-read the futex value.
968 */ 1054 */
969 if (get_futex_value_locked(&curval, uaddr)) 1055 if (get_futex_value_locked(&curval, uaddr))
970 return -EFAULT; 1056 return -EFAULT;
971 1057
972 /* 1058 /*
973 * If the owner died or we have a stale 1059 * If the owner died or we have a stale
974 * WAITERS bit the owner TID in the user space 1060 * WAITERS bit the owner TID in the user space
975 * futex is 0. 1061 * futex is 0.
976 */ 1062 */
977 if (!(curval & FUTEX_TID_MASK)) { 1063 if (!(curval & FUTEX_TID_MASK)) {
978 force_take = 1; 1064 force_take = 1;
979 goto retry; 1065 goto retry;
980 } 1066 }
981 default: 1067 default:
982 break; 1068 break;
983 } 1069 }
984 } 1070 }
985 1071
986 return ret; 1072 return ret;
987 } 1073 }
988 1074
989 /** 1075 /**
990 * __unqueue_futex() - Remove the futex_q from its futex_hash_bucket 1076 * __unqueue_futex() - Remove the futex_q from its futex_hash_bucket
991 * @q: The futex_q to unqueue 1077 * @q: The futex_q to unqueue
992 * 1078 *
993 * The q->lock_ptr must not be NULL and must be held by the caller. 1079 * The q->lock_ptr must not be NULL and must be held by the caller.
994 */ 1080 */
995 static void __unqueue_futex(struct futex_q *q) 1081 static void __unqueue_futex(struct futex_q *q)
996 { 1082 {
997 struct futex_hash_bucket *hb; 1083 struct futex_hash_bucket *hb;
998 1084
999 if (WARN_ON_SMP(!q->lock_ptr || !spin_is_locked(q->lock_ptr)) 1085 if (WARN_ON_SMP(!q->lock_ptr || !spin_is_locked(q->lock_ptr))
1000 || WARN_ON(plist_node_empty(&q->list))) 1086 || WARN_ON(plist_node_empty(&q->list)))
1001 return; 1087 return;
1002 1088
1003 hb = container_of(q->lock_ptr, struct futex_hash_bucket, lock); 1089 hb = container_of(q->lock_ptr, struct futex_hash_bucket, lock);
1004 plist_del(&q->list, &hb->chain); 1090 plist_del(&q->list, &hb->chain);
1005 hb_waiters_dec(hb); 1091 hb_waiters_dec(hb);
1006 } 1092 }
1007 1093
1008 /* 1094 /*
1009 * The hash bucket lock must be held when this is called. 1095 * The hash bucket lock must be held when this is called.
1010 * Afterwards, the futex_q must not be accessed. 1096 * Afterwards, the futex_q must not be accessed.
1011 */ 1097 */
1012 static void wake_futex(struct futex_q *q) 1098 static void wake_futex(struct futex_q *q)
1013 { 1099 {
1014 struct task_struct *p = q->task; 1100 struct task_struct *p = q->task;
1015 1101
1016 if (WARN(q->pi_state || q->rt_waiter, "refusing to wake PI futex\n")) 1102 if (WARN(q->pi_state || q->rt_waiter, "refusing to wake PI futex\n"))
1017 return; 1103 return;
1018 1104
1019 /* 1105 /*
1020 * We set q->lock_ptr = NULL _before_ we wake up the task. If 1106 * We set q->lock_ptr = NULL _before_ we wake up the task. If
1021 * a non-futex wake up happens on another CPU then the task 1107 * a non-futex wake up happens on another CPU then the task
1022 * might exit and p would dereference a non-existing task 1108 * might exit and p would dereference a non-existing task
1023 * struct. Prevent this by holding a reference on p across the 1109 * struct. Prevent this by holding a reference on p across the
1024 * wake up. 1110 * wake up.
1025 */ 1111 */
1026 get_task_struct(p); 1112 get_task_struct(p);
1027 1113
1028 __unqueue_futex(q); 1114 __unqueue_futex(q);
1029 /* 1115 /*
1030 * The waiting task can free the futex_q as soon as 1116 * The waiting task can free the futex_q as soon as
1031 * q->lock_ptr = NULL is written, without taking any locks. A 1117 * q->lock_ptr = NULL is written, without taking any locks. A
1032 * memory barrier is required here to prevent the following 1118 * memory barrier is required here to prevent the following
1033 * store to lock_ptr from getting ahead of the plist_del. 1119 * store to lock_ptr from getting ahead of the plist_del.
1034 */ 1120 */
1035 smp_wmb(); 1121 smp_wmb();
1036 q->lock_ptr = NULL; 1122 q->lock_ptr = NULL;
1037 1123
1038 wake_up_state(p, TASK_NORMAL); 1124 wake_up_state(p, TASK_NORMAL);
1039 put_task_struct(p); 1125 put_task_struct(p);
1040 } 1126 }
1041 1127
1042 static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this) 1128 static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
1043 { 1129 {
1044 struct task_struct *new_owner; 1130 struct task_struct *new_owner;
1045 struct futex_pi_state *pi_state = this->pi_state; 1131 struct futex_pi_state *pi_state = this->pi_state;
1046 u32 uninitialized_var(curval), newval; 1132 u32 uninitialized_var(curval), newval;
1133 int ret = 0;
1047 1134
1048 if (!pi_state) 1135 if (!pi_state)
1049 return -EINVAL; 1136 return -EINVAL;
1050 1137
1051 /* 1138 /*
1052 * If current does not own the pi_state then the futex is 1139 * If current does not own the pi_state then the futex is
1053 * inconsistent and user space fiddled with the futex value. 1140 * inconsistent and user space fiddled with the futex value.
1054 */ 1141 */
1055 if (pi_state->owner != current) 1142 if (pi_state->owner != current)
1056 return -EINVAL; 1143 return -EINVAL;
1057 1144
1058 raw_spin_lock(&pi_state->pi_mutex.wait_lock); 1145 raw_spin_lock(&pi_state->pi_mutex.wait_lock);
1059 new_owner = rt_mutex_next_owner(&pi_state->pi_mutex); 1146 new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
1060 1147
1061 /* 1148 /*
1062 * It is possible that the next waiter (the one that brought 1149 * It is possible that the next waiter (the one that brought
1063 * this owner to the kernel) timed out and is no longer 1150 * this owner to the kernel) timed out and is no longer
1064 * waiting on the lock. 1151 * waiting on the lock.
1065 */ 1152 */
1066 if (!new_owner) 1153 if (!new_owner)
1067 new_owner = this->task; 1154 new_owner = this->task;
1068 1155
1069 /* 1156 /*
1070 * We pass it to the next owner. (The WAITERS bit is always 1157 * We pass it to the next owner. The WAITERS bit is always
1071 * kept enabled while there is PI state around. We must also 1158 * kept enabled while there is PI state around. We cleanup the
1072 * preserve the owner died bit.) 1159 * owner died bit, because we are the owner.
1073 */ 1160 */
1074 if (!(uval & FUTEX_OWNER_DIED)) { 1161 newval = FUTEX_WAITERS | task_pid_vnr(new_owner);
1075 int ret = 0;
1076 1162
1077 newval = FUTEX_WAITERS | task_pid_vnr(new_owner); 1163 if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval))
1078 1164 ret = -EFAULT;
1079 if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)) 1165 else if (curval != uval)
1080 ret = -EFAULT; 1166 ret = -EINVAL;
1081 else if (curval != uval) 1167 if (ret) {
1082 ret = -EINVAL; 1168 raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
1083 if (ret) { 1169 return ret;
1084 raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
1085 return ret;
1086 }
1087 } 1170 }
1088 1171
1089 raw_spin_lock_irq(&pi_state->owner->pi_lock); 1172 raw_spin_lock_irq(&pi_state->owner->pi_lock);
1090 WARN_ON(list_empty(&pi_state->list)); 1173 WARN_ON(list_empty(&pi_state->list));
1091 list_del_init(&pi_state->list); 1174 list_del_init(&pi_state->list);
1092 raw_spin_unlock_irq(&pi_state->owner->pi_lock); 1175 raw_spin_unlock_irq(&pi_state->owner->pi_lock);
1093 1176
1094 raw_spin_lock_irq(&new_owner->pi_lock); 1177 raw_spin_lock_irq(&new_owner->pi_lock);
1095 WARN_ON(!list_empty(&pi_state->list)); 1178 WARN_ON(!list_empty(&pi_state->list));
1096 list_add(&pi_state->list, &new_owner->pi_state_list); 1179 list_add(&pi_state->list, &new_owner->pi_state_list);
1097 pi_state->owner = new_owner; 1180 pi_state->owner = new_owner;
1098 raw_spin_unlock_irq(&new_owner->pi_lock); 1181 raw_spin_unlock_irq(&new_owner->pi_lock);
1099 1182
1100 raw_spin_unlock(&pi_state->pi_mutex.wait_lock); 1183 raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
1101 rt_mutex_unlock(&pi_state->pi_mutex); 1184 rt_mutex_unlock(&pi_state->pi_mutex);
1102 1185
1103 return 0; 1186 return 0;
1104 } 1187 }
1105 1188
1106 static int unlock_futex_pi(u32 __user *uaddr, u32 uval) 1189 static int unlock_futex_pi(u32 __user *uaddr, u32 uval)
1107 { 1190 {
1108 u32 uninitialized_var(oldval); 1191 u32 uninitialized_var(oldval);
1109 1192
1110 /* 1193 /*
1111 * There is no waiter, so we unlock the futex. The owner died 1194 * There is no waiter, so we unlock the futex. The owner died
1112 * bit has not to be preserved here. We are the owner: 1195 * bit has not to be preserved here. We are the owner:
1113 */ 1196 */
1114 if (cmpxchg_futex_value_locked(&oldval, uaddr, uval, 0)) 1197 if (cmpxchg_futex_value_locked(&oldval, uaddr, uval, 0))
1115 return -EFAULT; 1198 return -EFAULT;
1116 if (oldval != uval) 1199 if (oldval != uval)
1117 return -EAGAIN; 1200 return -EAGAIN;
1118 1201
1119 return 0; 1202 return 0;
1120 } 1203 }
1121 1204
1122 /* 1205 /*
1123 * Express the locking dependencies for lockdep: 1206 * Express the locking dependencies for lockdep:
1124 */ 1207 */
1125 static inline void 1208 static inline void
1126 double_lock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2) 1209 double_lock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
1127 { 1210 {
1128 if (hb1 <= hb2) { 1211 if (hb1 <= hb2) {
1129 spin_lock(&hb1->lock); 1212 spin_lock(&hb1->lock);
1130 if (hb1 < hb2) 1213 if (hb1 < hb2)
1131 spin_lock_nested(&hb2->lock, SINGLE_DEPTH_NESTING); 1214 spin_lock_nested(&hb2->lock, SINGLE_DEPTH_NESTING);
1132 } else { /* hb1 > hb2 */ 1215 } else { /* hb1 > hb2 */
1133 spin_lock(&hb2->lock); 1216 spin_lock(&hb2->lock);
1134 spin_lock_nested(&hb1->lock, SINGLE_DEPTH_NESTING); 1217 spin_lock_nested(&hb1->lock, SINGLE_DEPTH_NESTING);
1135 } 1218 }
1136 } 1219 }
1137 1220
1138 static inline void 1221 static inline void
1139 double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2) 1222 double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
1140 { 1223 {
1141 spin_unlock(&hb1->lock); 1224 spin_unlock(&hb1->lock);
1142 if (hb1 != hb2) 1225 if (hb1 != hb2)
1143 spin_unlock(&hb2->lock); 1226 spin_unlock(&hb2->lock);
1144 } 1227 }
1145 1228
1146 /* 1229 /*
1147 * Wake up waiters matching bitset queued on this futex (uaddr). 1230 * Wake up waiters matching bitset queued on this futex (uaddr).
1148 */ 1231 */
1149 static int 1232 static int
1150 futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset) 1233 futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
1151 { 1234 {
1152 struct futex_hash_bucket *hb; 1235 struct futex_hash_bucket *hb;
1153 struct futex_q *this, *next; 1236 struct futex_q *this, *next;
1154 union futex_key key = FUTEX_KEY_INIT; 1237 union futex_key key = FUTEX_KEY_INIT;
1155 int ret; 1238 int ret;
1156 1239
1157 if (!bitset) 1240 if (!bitset)
1158 return -EINVAL; 1241 return -EINVAL;
1159 1242
1160 ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_READ); 1243 ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_READ);
1161 if (unlikely(ret != 0)) 1244 if (unlikely(ret != 0))
1162 goto out; 1245 goto out;
1163 1246
1164 hb = hash_futex(&key); 1247 hb = hash_futex(&key);
1165 1248
1166 /* Make sure we really have tasks to wakeup */ 1249 /* Make sure we really have tasks to wakeup */
1167 if (!hb_waiters_pending(hb)) 1250 if (!hb_waiters_pending(hb))
1168 goto out_put_key; 1251 goto out_put_key;
1169 1252
1170 spin_lock(&hb->lock); 1253 spin_lock(&hb->lock);
1171 1254
1172 plist_for_each_entry_safe(this, next, &hb->chain, list) { 1255 plist_for_each_entry_safe(this, next, &hb->chain, list) {
1173 if (match_futex (&this->key, &key)) { 1256 if (match_futex (&this->key, &key)) {
1174 if (this->pi_state || this->rt_waiter) { 1257 if (this->pi_state || this->rt_waiter) {
1175 ret = -EINVAL; 1258 ret = -EINVAL;
1176 break; 1259 break;
1177 } 1260 }
1178 1261
1179 /* Check if one of the bits is set in both bitsets */ 1262 /* Check if one of the bits is set in both bitsets */
1180 if (!(this->bitset & bitset)) 1263 if (!(this->bitset & bitset))
1181 continue; 1264 continue;
1182 1265
1183 wake_futex(this); 1266 wake_futex(this);
1184 if (++ret >= nr_wake) 1267 if (++ret >= nr_wake)
1185 break; 1268 break;
1186 } 1269 }
1187 } 1270 }
1188 1271
1189 spin_unlock(&hb->lock); 1272 spin_unlock(&hb->lock);
1190 out_put_key: 1273 out_put_key:
1191 put_futex_key(&key); 1274 put_futex_key(&key);
1192 out: 1275 out:
1193 return ret; 1276 return ret;
1194 } 1277 }
1195 1278
1196 /* 1279 /*
1197 * Wake up all waiters hashed on the physical page that is mapped 1280 * Wake up all waiters hashed on the physical page that is mapped
1198 * to this virtual address: 1281 * to this virtual address:
1199 */ 1282 */
1200 static int 1283 static int
1201 futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2, 1284 futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
1202 int nr_wake, int nr_wake2, int op) 1285 int nr_wake, int nr_wake2, int op)
1203 { 1286 {
1204 union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT; 1287 union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
1205 struct futex_hash_bucket *hb1, *hb2; 1288 struct futex_hash_bucket *hb1, *hb2;
1206 struct futex_q *this, *next; 1289 struct futex_q *this, *next;
1207 int ret, op_ret; 1290 int ret, op_ret;
1208 1291
1209 retry: 1292 retry:
1210 ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ); 1293 ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ);
1211 if (unlikely(ret != 0)) 1294 if (unlikely(ret != 0))
1212 goto out; 1295 goto out;
1213 ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE); 1296 ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE);
1214 if (unlikely(ret != 0)) 1297 if (unlikely(ret != 0))
1215 goto out_put_key1; 1298 goto out_put_key1;
1216 1299
1217 hb1 = hash_futex(&key1); 1300 hb1 = hash_futex(&key1);
1218 hb2 = hash_futex(&key2); 1301 hb2 = hash_futex(&key2);
1219 1302
1220 retry_private: 1303 retry_private:
1221 double_lock_hb(hb1, hb2); 1304 double_lock_hb(hb1, hb2);
1222 op_ret = futex_atomic_op_inuser(op, uaddr2); 1305 op_ret = futex_atomic_op_inuser(op, uaddr2);
1223 if (unlikely(op_ret < 0)) { 1306 if (unlikely(op_ret < 0)) {
1224 1307
1225 double_unlock_hb(hb1, hb2); 1308 double_unlock_hb(hb1, hb2);
1226 1309
1227 #ifndef CONFIG_MMU 1310 #ifndef CONFIG_MMU
1228 /* 1311 /*
1229 * we don't get EFAULT from MMU faults if we don't have an MMU, 1312 * we don't get EFAULT from MMU faults if we don't have an MMU,
1230 * but we might get them from range checking 1313 * but we might get them from range checking
1231 */ 1314 */
1232 ret = op_ret; 1315 ret = op_ret;
1233 goto out_put_keys; 1316 goto out_put_keys;
1234 #endif 1317 #endif
1235 1318
1236 if (unlikely(op_ret != -EFAULT)) { 1319 if (unlikely(op_ret != -EFAULT)) {
1237 ret = op_ret; 1320 ret = op_ret;
1238 goto out_put_keys; 1321 goto out_put_keys;
1239 } 1322 }
1240 1323
1241 ret = fault_in_user_writeable(uaddr2); 1324 ret = fault_in_user_writeable(uaddr2);
1242 if (ret) 1325 if (ret)
1243 goto out_put_keys; 1326 goto out_put_keys;
1244 1327
1245 if (!(flags & FLAGS_SHARED)) 1328 if (!(flags & FLAGS_SHARED))
1246 goto retry_private; 1329 goto retry_private;
1247 1330
1248 put_futex_key(&key2); 1331 put_futex_key(&key2);
1249 put_futex_key(&key1); 1332 put_futex_key(&key1);
1250 goto retry; 1333 goto retry;
1251 } 1334 }
1252 1335
1253 plist_for_each_entry_safe(this, next, &hb1->chain, list) { 1336 plist_for_each_entry_safe(this, next, &hb1->chain, list) {
1254 if (match_futex (&this->key, &key1)) { 1337 if (match_futex (&this->key, &key1)) {
1255 if (this->pi_state || this->rt_waiter) { 1338 if (this->pi_state || this->rt_waiter) {
1256 ret = -EINVAL; 1339 ret = -EINVAL;
1257 goto out_unlock; 1340 goto out_unlock;
1258 } 1341 }
1259 wake_futex(this); 1342 wake_futex(this);
1260 if (++ret >= nr_wake) 1343 if (++ret >= nr_wake)
1261 break; 1344 break;
1262 } 1345 }
1263 } 1346 }
1264 1347
1265 if (op_ret > 0) { 1348 if (op_ret > 0) {
1266 op_ret = 0; 1349 op_ret = 0;
1267 plist_for_each_entry_safe(this, next, &hb2->chain, list) { 1350 plist_for_each_entry_safe(this, next, &hb2->chain, list) {
1268 if (match_futex (&this->key, &key2)) { 1351 if (match_futex (&this->key, &key2)) {
1269 if (this->pi_state || this->rt_waiter) { 1352 if (this->pi_state || this->rt_waiter) {
1270 ret = -EINVAL; 1353 ret = -EINVAL;
1271 goto out_unlock; 1354 goto out_unlock;
1272 } 1355 }
1273 wake_futex(this); 1356 wake_futex(this);
1274 if (++op_ret >= nr_wake2) 1357 if (++op_ret >= nr_wake2)
1275 break; 1358 break;
1276 } 1359 }
1277 } 1360 }
1278 ret += op_ret; 1361 ret += op_ret;
1279 } 1362 }
1280 1363
1281 out_unlock: 1364 out_unlock:
1282 double_unlock_hb(hb1, hb2); 1365 double_unlock_hb(hb1, hb2);
1283 out_put_keys: 1366 out_put_keys:
1284 put_futex_key(&key2); 1367 put_futex_key(&key2);
1285 out_put_key1: 1368 out_put_key1:
1286 put_futex_key(&key1); 1369 put_futex_key(&key1);
1287 out: 1370 out:
1288 return ret; 1371 return ret;
1289 } 1372 }
1290 1373
1291 /** 1374 /**
1292 * requeue_futex() - Requeue a futex_q from one hb to another 1375 * requeue_futex() - Requeue a futex_q from one hb to another
1293 * @q: the futex_q to requeue 1376 * @q: the futex_q to requeue
1294 * @hb1: the source hash_bucket 1377 * @hb1: the source hash_bucket
1295 * @hb2: the target hash_bucket 1378 * @hb2: the target hash_bucket
1296 * @key2: the new key for the requeued futex_q 1379 * @key2: the new key for the requeued futex_q
1297 */ 1380 */
1298 static inline 1381 static inline
1299 void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1, 1382 void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
1300 struct futex_hash_bucket *hb2, union futex_key *key2) 1383 struct futex_hash_bucket *hb2, union futex_key *key2)
1301 { 1384 {
1302 1385
1303 /* 1386 /*
1304 * If key1 and key2 hash to the same bucket, no need to 1387 * If key1 and key2 hash to the same bucket, no need to
1305 * requeue. 1388 * requeue.
1306 */ 1389 */
1307 if (likely(&hb1->chain != &hb2->chain)) { 1390 if (likely(&hb1->chain != &hb2->chain)) {
1308 plist_del(&q->list, &hb1->chain); 1391 plist_del(&q->list, &hb1->chain);
1309 hb_waiters_dec(hb1); 1392 hb_waiters_dec(hb1);
1310 plist_add(&q->list, &hb2->chain); 1393 plist_add(&q->list, &hb2->chain);
1311 hb_waiters_inc(hb2); 1394 hb_waiters_inc(hb2);
1312 q->lock_ptr = &hb2->lock; 1395 q->lock_ptr = &hb2->lock;
1313 } 1396 }
1314 get_futex_key_refs(key2); 1397 get_futex_key_refs(key2);
1315 q->key = *key2; 1398 q->key = *key2;
1316 } 1399 }
1317 1400
1318 /** 1401 /**
1319 * requeue_pi_wake_futex() - Wake a task that acquired the lock during requeue 1402 * requeue_pi_wake_futex() - Wake a task that acquired the lock during requeue
1320 * @q: the futex_q 1403 * @q: the futex_q
1321 * @key: the key of the requeue target futex 1404 * @key: the key of the requeue target futex
1322 * @hb: the hash_bucket of the requeue target futex 1405 * @hb: the hash_bucket of the requeue target futex
1323 * 1406 *
1324 * During futex_requeue, with requeue_pi=1, it is possible to acquire the 1407 * During futex_requeue, with requeue_pi=1, it is possible to acquire the
1325 * target futex if it is uncontended or via a lock steal. Set the futex_q key 1408 * target futex if it is uncontended or via a lock steal. Set the futex_q key
1326 * to the requeue target futex so the waiter can detect the wakeup on the right 1409 * to the requeue target futex so the waiter can detect the wakeup on the right
1327 * futex, but remove it from the hb and NULL the rt_waiter so it can detect 1410 * futex, but remove it from the hb and NULL the rt_waiter so it can detect
1328 * atomic lock acquisition. Set the q->lock_ptr to the requeue target hb->lock 1411 * atomic lock acquisition. Set the q->lock_ptr to the requeue target hb->lock
1329 * to protect access to the pi_state to fixup the owner later. Must be called 1412 * to protect access to the pi_state to fixup the owner later. Must be called
1330 * with both q->lock_ptr and hb->lock held. 1413 * with both q->lock_ptr and hb->lock held.
1331 */ 1414 */
1332 static inline 1415 static inline
1333 void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key, 1416 void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,
1334 struct futex_hash_bucket *hb) 1417 struct futex_hash_bucket *hb)
1335 { 1418 {
1336 get_futex_key_refs(key); 1419 get_futex_key_refs(key);
1337 q->key = *key; 1420 q->key = *key;
1338 1421
1339 __unqueue_futex(q); 1422 __unqueue_futex(q);
1340 1423
1341 WARN_ON(!q->rt_waiter); 1424 WARN_ON(!q->rt_waiter);
1342 q->rt_waiter = NULL; 1425 q->rt_waiter = NULL;
1343 1426
1344 q->lock_ptr = &hb->lock; 1427 q->lock_ptr = &hb->lock;
1345 1428
1346 wake_up_state(q->task, TASK_NORMAL); 1429 wake_up_state(q->task, TASK_NORMAL);
1347 } 1430 }
1348 1431
1349 /** 1432 /**
1350 * futex_proxy_trylock_atomic() - Attempt an atomic lock for the top waiter 1433 * futex_proxy_trylock_atomic() - Attempt an atomic lock for the top waiter
1351 * @pifutex: the user address of the to futex 1434 * @pifutex: the user address of the to futex
1352 * @hb1: the from futex hash bucket, must be locked by the caller 1435 * @hb1: the from futex hash bucket, must be locked by the caller
1353 * @hb2: the to futex hash bucket, must be locked by the caller 1436 * @hb2: the to futex hash bucket, must be locked by the caller
1354 * @key1: the from futex key 1437 * @key1: the from futex key
1355 * @key2: the to futex key 1438 * @key2: the to futex key
1356 * @ps: address to store the pi_state pointer 1439 * @ps: address to store the pi_state pointer
1357 * @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0) 1440 * @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0)
1358 * 1441 *
1359 * Try and get the lock on behalf of the top waiter if we can do it atomically. 1442 * Try and get the lock on behalf of the top waiter if we can do it atomically.
1360 * Wake the top waiter if we succeed. If the caller specified set_waiters, 1443 * Wake the top waiter if we succeed. If the caller specified set_waiters,
1361 * then direct futex_lock_pi_atomic() to force setting the FUTEX_WAITERS bit. 1444 * then direct futex_lock_pi_atomic() to force setting the FUTEX_WAITERS bit.
1362 * hb1 and hb2 must be held by the caller. 1445 * hb1 and hb2 must be held by the caller.
1363 * 1446 *
1364 * Return: 1447 * Return:
1365 * 0 - failed to acquire the lock atomically; 1448 * 0 - failed to acquire the lock atomically;
1366 * >0 - acquired the lock, return value is vpid of the top_waiter 1449 * >0 - acquired the lock, return value is vpid of the top_waiter
1367 * <0 - error 1450 * <0 - error
1368 */ 1451 */
1369 static int futex_proxy_trylock_atomic(u32 __user *pifutex, 1452 static int futex_proxy_trylock_atomic(u32 __user *pifutex,
1370 struct futex_hash_bucket *hb1, 1453 struct futex_hash_bucket *hb1,
1371 struct futex_hash_bucket *hb2, 1454 struct futex_hash_bucket *hb2,
1372 union futex_key *key1, union futex_key *key2, 1455 union futex_key *key1, union futex_key *key2,
1373 struct futex_pi_state **ps, int set_waiters) 1456 struct futex_pi_state **ps, int set_waiters)
1374 { 1457 {
1375 struct futex_q *top_waiter = NULL; 1458 struct futex_q *top_waiter = NULL;
1376 u32 curval; 1459 u32 curval;
1377 int ret, vpid; 1460 int ret, vpid;
1378 1461
1379 if (get_futex_value_locked(&curval, pifutex)) 1462 if (get_futex_value_locked(&curval, pifutex))
1380 return -EFAULT; 1463 return -EFAULT;
1381 1464
1382 /* 1465 /*
1383 * Find the top_waiter and determine if there are additional waiters. 1466 * Find the top_waiter and determine if there are additional waiters.
1384 * If the caller intends to requeue more than 1 waiter to pifutex, 1467 * If the caller intends to requeue more than 1 waiter to pifutex,
1385 * force futex_lock_pi_atomic() to set the FUTEX_WAITERS bit now, 1468 * force futex_lock_pi_atomic() to set the FUTEX_WAITERS bit now,
1386 * as we have means to handle the possible fault. If not, don't set 1469 * as we have means to handle the possible fault. If not, don't set
1387 * the bit unecessarily as it will force the subsequent unlock to enter 1470 * the bit unecessarily as it will force the subsequent unlock to enter
1388 * the kernel. 1471 * the kernel.
1389 */ 1472 */
1390 top_waiter = futex_top_waiter(hb1, key1); 1473 top_waiter = futex_top_waiter(hb1, key1);
1391 1474
1392 /* There are no waiters, nothing for us to do. */ 1475 /* There are no waiters, nothing for us to do. */
1393 if (!top_waiter) 1476 if (!top_waiter)
1394 return 0; 1477 return 0;
1395 1478
1396 /* Ensure we requeue to the expected futex. */ 1479 /* Ensure we requeue to the expected futex. */
1397 if (!match_futex(top_waiter->requeue_pi_key, key2)) 1480 if (!match_futex(top_waiter->requeue_pi_key, key2))
1398 return -EINVAL; 1481 return -EINVAL;
1399 1482
1400 /* 1483 /*
1401 * Try to take the lock for top_waiter. Set the FUTEX_WAITERS bit in 1484 * Try to take the lock for top_waiter. Set the FUTEX_WAITERS bit in
1402 * the contended case or if set_waiters is 1. The pi_state is returned 1485 * the contended case or if set_waiters is 1. The pi_state is returned
1403 * in ps in contended cases. 1486 * in ps in contended cases.
1404 */ 1487 */
1405 vpid = task_pid_vnr(top_waiter->task); 1488 vpid = task_pid_vnr(top_waiter->task);
1406 ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task, 1489 ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task,
1407 set_waiters); 1490 set_waiters);
1408 if (ret == 1) { 1491 if (ret == 1) {
1409 requeue_pi_wake_futex(top_waiter, key2, hb2); 1492 requeue_pi_wake_futex(top_waiter, key2, hb2);
1410 return vpid; 1493 return vpid;
1411 } 1494 }
1412 return ret; 1495 return ret;
1413 } 1496 }
1414 1497
1415 /** 1498 /**
1416 * futex_requeue() - Requeue waiters from uaddr1 to uaddr2 1499 * futex_requeue() - Requeue waiters from uaddr1 to uaddr2
1417 * @uaddr1: source futex user address 1500 * @uaddr1: source futex user address
1418 * @flags: futex flags (FLAGS_SHARED, etc.) 1501 * @flags: futex flags (FLAGS_SHARED, etc.)
1419 * @uaddr2: target futex user address 1502 * @uaddr2: target futex user address
1420 * @nr_wake: number of waiters to wake (must be 1 for requeue_pi) 1503 * @nr_wake: number of waiters to wake (must be 1 for requeue_pi)
1421 * @nr_requeue: number of waiters to requeue (0-INT_MAX) 1504 * @nr_requeue: number of waiters to requeue (0-INT_MAX)
1422 * @cmpval: @uaddr1 expected value (or %NULL) 1505 * @cmpval: @uaddr1 expected value (or %NULL)
1423 * @requeue_pi: if we are attempting to requeue from a non-pi futex to a 1506 * @requeue_pi: if we are attempting to requeue from a non-pi futex to a
1424 * pi futex (pi to pi requeue is not supported) 1507 * pi futex (pi to pi requeue is not supported)
1425 * 1508 *
1426 * Requeue waiters on uaddr1 to uaddr2. In the requeue_pi case, try to acquire 1509 * Requeue waiters on uaddr1 to uaddr2. In the requeue_pi case, try to acquire
1427 * uaddr2 atomically on behalf of the top waiter. 1510 * uaddr2 atomically on behalf of the top waiter.
1428 * 1511 *
1429 * Return: 1512 * Return:
1430 * >=0 - on success, the number of tasks requeued or woken; 1513 * >=0 - on success, the number of tasks requeued or woken;
1431 * <0 - on error 1514 * <0 - on error
1432 */ 1515 */
1433 static int futex_requeue(u32 __user *uaddr1, unsigned int flags, 1516 static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
1434 u32 __user *uaddr2, int nr_wake, int nr_requeue, 1517 u32 __user *uaddr2, int nr_wake, int nr_requeue,
1435 u32 *cmpval, int requeue_pi) 1518 u32 *cmpval, int requeue_pi)
1436 { 1519 {
1437 union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT; 1520 union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
1438 int drop_count = 0, task_count = 0, ret; 1521 int drop_count = 0, task_count = 0, ret;
1439 struct futex_pi_state *pi_state = NULL; 1522 struct futex_pi_state *pi_state = NULL;
1440 struct futex_hash_bucket *hb1, *hb2; 1523 struct futex_hash_bucket *hb1, *hb2;
1441 struct futex_q *this, *next; 1524 struct futex_q *this, *next;
1442 1525
1443 if (requeue_pi) { 1526 if (requeue_pi) {
1444 /* 1527 /*
1528 * Requeue PI only works on two distinct uaddrs. This
1529 * check is only valid for private futexes. See below.
1530 */
1531 if (uaddr1 == uaddr2)
1532 return -EINVAL;
1533
1534 /*
1445 * requeue_pi requires a pi_state, try to allocate it now 1535 * requeue_pi requires a pi_state, try to allocate it now
1446 * without any locks in case it fails. 1536 * without any locks in case it fails.
1447 */ 1537 */
1448 if (refill_pi_state_cache()) 1538 if (refill_pi_state_cache())
1449 return -ENOMEM; 1539 return -ENOMEM;
1450 /* 1540 /*
1451 * requeue_pi must wake as many tasks as it can, up to nr_wake 1541 * requeue_pi must wake as many tasks as it can, up to nr_wake
1452 * + nr_requeue, since it acquires the rt_mutex prior to 1542 * + nr_requeue, since it acquires the rt_mutex prior to
1453 * returning to userspace, so as to not leave the rt_mutex with 1543 * returning to userspace, so as to not leave the rt_mutex with
1454 * waiters and no owner. However, second and third wake-ups 1544 * waiters and no owner. However, second and third wake-ups
1455 * cannot be predicted as they involve race conditions with the 1545 * cannot be predicted as they involve race conditions with the
1456 * first wake and a fault while looking up the pi_state. Both 1546 * first wake and a fault while looking up the pi_state. Both
1457 * pthread_cond_signal() and pthread_cond_broadcast() should 1547 * pthread_cond_signal() and pthread_cond_broadcast() should
1458 * use nr_wake=1. 1548 * use nr_wake=1.
1459 */ 1549 */
1460 if (nr_wake != 1) 1550 if (nr_wake != 1)
1461 return -EINVAL; 1551 return -EINVAL;
1462 } 1552 }
1463 1553
1464 retry: 1554 retry:
1465 if (pi_state != NULL) { 1555 if (pi_state != NULL) {
1466 /* 1556 /*
1467 * We will have to lookup the pi_state again, so free this one 1557 * We will have to lookup the pi_state again, so free this one
1468 * to keep the accounting correct. 1558 * to keep the accounting correct.
1469 */ 1559 */
1470 free_pi_state(pi_state); 1560 free_pi_state(pi_state);
1471 pi_state = NULL; 1561 pi_state = NULL;
1472 } 1562 }
1473 1563
1474 ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ); 1564 ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ);
1475 if (unlikely(ret != 0)) 1565 if (unlikely(ret != 0))
1476 goto out; 1566 goto out;
1477 ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, 1567 ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2,
1478 requeue_pi ? VERIFY_WRITE : VERIFY_READ); 1568 requeue_pi ? VERIFY_WRITE : VERIFY_READ);
1479 if (unlikely(ret != 0)) 1569 if (unlikely(ret != 0))
1480 goto out_put_key1; 1570 goto out_put_key1;
1481 1571
1572 /*
1573 * The check above which compares uaddrs is not sufficient for
1574 * shared futexes. We need to compare the keys:
1575 */
1576 if (requeue_pi && match_futex(&key1, &key2)) {
1577 ret = -EINVAL;
1578 goto out_put_keys;
1579 }
1580
1482 hb1 = hash_futex(&key1); 1581 hb1 = hash_futex(&key1);
1483 hb2 = hash_futex(&key2); 1582 hb2 = hash_futex(&key2);
1484 1583
1485 retry_private: 1584 retry_private:
1486 hb_waiters_inc(hb2); 1585 hb_waiters_inc(hb2);
1487 double_lock_hb(hb1, hb2); 1586 double_lock_hb(hb1, hb2);
1488 1587
1489 if (likely(cmpval != NULL)) { 1588 if (likely(cmpval != NULL)) {
1490 u32 curval; 1589 u32 curval;
1491 1590
1492 ret = get_futex_value_locked(&curval, uaddr1); 1591 ret = get_futex_value_locked(&curval, uaddr1);
1493 1592
1494 if (unlikely(ret)) { 1593 if (unlikely(ret)) {
1495 double_unlock_hb(hb1, hb2); 1594 double_unlock_hb(hb1, hb2);
1496 hb_waiters_dec(hb2); 1595 hb_waiters_dec(hb2);
1497 1596
1498 ret = get_user(curval, uaddr1); 1597 ret = get_user(curval, uaddr1);
1499 if (ret) 1598 if (ret)
1500 goto out_put_keys; 1599 goto out_put_keys;
1501 1600
1502 if (!(flags & FLAGS_SHARED)) 1601 if (!(flags & FLAGS_SHARED))
1503 goto retry_private; 1602 goto retry_private;
1504 1603
1505 put_futex_key(&key2); 1604 put_futex_key(&key2);
1506 put_futex_key(&key1); 1605 put_futex_key(&key1);
1507 goto retry; 1606 goto retry;
1508 } 1607 }
1509 if (curval != *cmpval) { 1608 if (curval != *cmpval) {
1510 ret = -EAGAIN; 1609 ret = -EAGAIN;
1511 goto out_unlock; 1610 goto out_unlock;
1512 } 1611 }
1513 } 1612 }
1514 1613
1515 if (requeue_pi && (task_count - nr_wake < nr_requeue)) { 1614 if (requeue_pi && (task_count - nr_wake < nr_requeue)) {
1516 /* 1615 /*
1517 * Attempt to acquire uaddr2 and wake the top waiter. If we 1616 * Attempt to acquire uaddr2 and wake the top waiter. If we
1518 * intend to requeue waiters, force setting the FUTEX_WAITERS 1617 * intend to requeue waiters, force setting the FUTEX_WAITERS
1519 * bit. We force this here where we are able to easily handle 1618 * bit. We force this here where we are able to easily handle
1520 * faults rather in the requeue loop below. 1619 * faults rather in the requeue loop below.
1521 */ 1620 */
1522 ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1, 1621 ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
1523 &key2, &pi_state, nr_requeue); 1622 &key2, &pi_state, nr_requeue);
1524 1623
1525 /* 1624 /*
1526 * At this point the top_waiter has either taken uaddr2 or is 1625 * At this point the top_waiter has either taken uaddr2 or is
1527 * waiting on it. If the former, then the pi_state will not 1626 * waiting on it. If the former, then the pi_state will not
1528 * exist yet, look it up one more time to ensure we have a 1627 * exist yet, look it up one more time to ensure we have a
1529 * reference to it. If the lock was taken, ret contains the 1628 * reference to it. If the lock was taken, ret contains the
1530 * vpid of the top waiter task. 1629 * vpid of the top waiter task.
1531 */ 1630 */
1532 if (ret > 0) { 1631 if (ret > 0) {
1533 WARN_ON(pi_state); 1632 WARN_ON(pi_state);
1534 drop_count++; 1633 drop_count++;
1535 task_count++; 1634 task_count++;
1536 /* 1635 /*
1537 * If we acquired the lock, then the user 1636 * If we acquired the lock, then the user
1538 * space value of uaddr2 should be vpid. It 1637 * space value of uaddr2 should be vpid. It
1539 * cannot be changed by the top waiter as it 1638 * cannot be changed by the top waiter as it
1540 * is blocked on hb2 lock if it tries to do 1639 * is blocked on hb2 lock if it tries to do
1541 * so. If something fiddled with it behind our 1640 * so. If something fiddled with it behind our
1542 * back the pi state lookup might unearth 1641 * back the pi state lookup might unearth
1543 * it. So we rather use the known value than 1642 * it. So we rather use the known value than
1544 * rereading and handing potential crap to 1643 * rereading and handing potential crap to
1545 * lookup_pi_state. 1644 * lookup_pi_state.
1546 */ 1645 */
1547 ret = lookup_pi_state(ret, hb2, &key2, &pi_state, NULL); 1646 ret = lookup_pi_state(ret, hb2, &key2, &pi_state);
1548 } 1647 }
1549 1648
1550 switch (ret) { 1649 switch (ret) {
1551 case 0: 1650 case 0:
1552 break; 1651 break;
1553 case -EFAULT: 1652 case -EFAULT:
1554 double_unlock_hb(hb1, hb2); 1653 double_unlock_hb(hb1, hb2);
1555 hb_waiters_dec(hb2); 1654 hb_waiters_dec(hb2);
1556 put_futex_key(&key2); 1655 put_futex_key(&key2);
1557 put_futex_key(&key1); 1656 put_futex_key(&key1);
1558 ret = fault_in_user_writeable(uaddr2); 1657 ret = fault_in_user_writeable(uaddr2);
1559 if (!ret) 1658 if (!ret)
1560 goto retry; 1659 goto retry;
1561 goto out; 1660 goto out;
1562 case -EAGAIN: 1661 case -EAGAIN:
1563 /* The owner was exiting, try again. */ 1662 /* The owner was exiting, try again. */
1564 double_unlock_hb(hb1, hb2); 1663 double_unlock_hb(hb1, hb2);
1565 hb_waiters_dec(hb2); 1664 hb_waiters_dec(hb2);
1566 put_futex_key(&key2); 1665 put_futex_key(&key2);
1567 put_futex_key(&key1); 1666 put_futex_key(&key1);
1568 cond_resched(); 1667 cond_resched();
1569 goto retry; 1668 goto retry;
1570 default: 1669 default:
1571 goto out_unlock; 1670 goto out_unlock;
1572 } 1671 }
1573 } 1672 }
1574 1673
1575 plist_for_each_entry_safe(this, next, &hb1->chain, list) { 1674 plist_for_each_entry_safe(this, next, &hb1->chain, list) {
1576 if (task_count - nr_wake >= nr_requeue) 1675 if (task_count - nr_wake >= nr_requeue)
1577 break; 1676 break;
1578 1677
1579 if (!match_futex(&this->key, &key1)) 1678 if (!match_futex(&this->key, &key1))
1580 continue; 1679 continue;
1581 1680
1582 /* 1681 /*
1583 * FUTEX_WAIT_REQEUE_PI and FUTEX_CMP_REQUEUE_PI should always 1682 * FUTEX_WAIT_REQEUE_PI and FUTEX_CMP_REQUEUE_PI should always
1584 * be paired with each other and no other futex ops. 1683 * be paired with each other and no other futex ops.
1585 * 1684 *
1586 * We should never be requeueing a futex_q with a pi_state, 1685 * We should never be requeueing a futex_q with a pi_state,
1587 * which is awaiting a futex_unlock_pi(). 1686 * which is awaiting a futex_unlock_pi().
1588 */ 1687 */
1589 if ((requeue_pi && !this->rt_waiter) || 1688 if ((requeue_pi && !this->rt_waiter) ||
1590 (!requeue_pi && this->rt_waiter) || 1689 (!requeue_pi && this->rt_waiter) ||
1591 this->pi_state) { 1690 this->pi_state) {
1592 ret = -EINVAL; 1691 ret = -EINVAL;
1593 break; 1692 break;
1594 } 1693 }
1595 1694
1596 /* 1695 /*
1597 * Wake nr_wake waiters. For requeue_pi, if we acquired the 1696 * Wake nr_wake waiters. For requeue_pi, if we acquired the
1598 * lock, we already woke the top_waiter. If not, it will be 1697 * lock, we already woke the top_waiter. If not, it will be
1599 * woken by futex_unlock_pi(). 1698 * woken by futex_unlock_pi().
1600 */ 1699 */
1601 if (++task_count <= nr_wake && !requeue_pi) { 1700 if (++task_count <= nr_wake && !requeue_pi) {
1602 wake_futex(this); 1701 wake_futex(this);
1603 continue; 1702 continue;
1604 } 1703 }
1605 1704
1606 /* Ensure we requeue to the expected futex for requeue_pi. */ 1705 /* Ensure we requeue to the expected futex for requeue_pi. */
1607 if (requeue_pi && !match_futex(this->requeue_pi_key, &key2)) { 1706 if (requeue_pi && !match_futex(this->requeue_pi_key, &key2)) {
1608 ret = -EINVAL; 1707 ret = -EINVAL;
1609 break; 1708 break;
1610 } 1709 }
1611 1710
1612 /* 1711 /*
1613 * Requeue nr_requeue waiters and possibly one more in the case 1712 * Requeue nr_requeue waiters and possibly one more in the case
1614 * of requeue_pi if we couldn't acquire the lock atomically. 1713 * of requeue_pi if we couldn't acquire the lock atomically.
1615 */ 1714 */
1616 if (requeue_pi) { 1715 if (requeue_pi) {
1617 /* Prepare the waiter to take the rt_mutex. */ 1716 /* Prepare the waiter to take the rt_mutex. */
1618 atomic_inc(&pi_state->refcount); 1717 atomic_inc(&pi_state->refcount);
1619 this->pi_state = pi_state; 1718 this->pi_state = pi_state;
1620 ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex, 1719 ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
1621 this->rt_waiter, 1720 this->rt_waiter,
1622 this->task, 1); 1721 this->task, 1);
1623 if (ret == 1) { 1722 if (ret == 1) {
1624 /* We got the lock. */ 1723 /* We got the lock. */
1625 requeue_pi_wake_futex(this, &key2, hb2); 1724 requeue_pi_wake_futex(this, &key2, hb2);
1626 drop_count++; 1725 drop_count++;
1627 continue; 1726 continue;
1628 } else if (ret) { 1727 } else if (ret) {
1629 /* -EDEADLK */ 1728 /* -EDEADLK */
1630 this->pi_state = NULL; 1729 this->pi_state = NULL;
1631 free_pi_state(pi_state); 1730 free_pi_state(pi_state);
1632 goto out_unlock; 1731 goto out_unlock;
1633 } 1732 }
1634 } 1733 }
1635 requeue_futex(this, hb1, hb2, &key2); 1734 requeue_futex(this, hb1, hb2, &key2);
1636 drop_count++; 1735 drop_count++;
1637 } 1736 }
1638 1737
1639 out_unlock: 1738 out_unlock:
1640 double_unlock_hb(hb1, hb2); 1739 double_unlock_hb(hb1, hb2);
1641 hb_waiters_dec(hb2); 1740 hb_waiters_dec(hb2);
1642 1741
1643 /* 1742 /*
1644 * drop_futex_key_refs() must be called outside the spinlocks. During 1743 * drop_futex_key_refs() must be called outside the spinlocks. During
1645 * the requeue we moved futex_q's from the hash bucket at key1 to the 1744 * the requeue we moved futex_q's from the hash bucket at key1 to the
1646 * one at key2 and updated their key pointer. We no longer need to 1745 * one at key2 and updated their key pointer. We no longer need to
1647 * hold the references to key1. 1746 * hold the references to key1.
1648 */ 1747 */
1649 while (--drop_count >= 0) 1748 while (--drop_count >= 0)
1650 drop_futex_key_refs(&key1); 1749 drop_futex_key_refs(&key1);
1651 1750
1652 out_put_keys: 1751 out_put_keys:
1653 put_futex_key(&key2); 1752 put_futex_key(&key2);
1654 out_put_key1: 1753 out_put_key1:
1655 put_futex_key(&key1); 1754 put_futex_key(&key1);
1656 out: 1755 out:
1657 if (pi_state != NULL) 1756 if (pi_state != NULL)
1658 free_pi_state(pi_state); 1757 free_pi_state(pi_state);
1659 return ret ? ret : task_count; 1758 return ret ? ret : task_count;
1660 } 1759 }
1661 1760
1662 /* The key must be already stored in q->key. */ 1761 /* The key must be already stored in q->key. */
1663 static inline struct futex_hash_bucket *queue_lock(struct futex_q *q) 1762 static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
1664 __acquires(&hb->lock) 1763 __acquires(&hb->lock)
1665 { 1764 {
1666 struct futex_hash_bucket *hb; 1765 struct futex_hash_bucket *hb;
1667 1766
1668 hb = hash_futex(&q->key); 1767 hb = hash_futex(&q->key);
1669 1768
1670 /* 1769 /*
1671 * Increment the counter before taking the lock so that 1770 * Increment the counter before taking the lock so that
1672 * a potential waker won't miss a to-be-slept task that is 1771 * a potential waker won't miss a to-be-slept task that is
1673 * waiting for the spinlock. This is safe as all queue_lock() 1772 * waiting for the spinlock. This is safe as all queue_lock()
1674 * users end up calling queue_me(). Similarly, for housekeeping, 1773 * users end up calling queue_me(). Similarly, for housekeeping,
1675 * decrement the counter at queue_unlock() when some error has 1774 * decrement the counter at queue_unlock() when some error has
1676 * occurred and we don't end up adding the task to the list. 1775 * occurred and we don't end up adding the task to the list.
1677 */ 1776 */
1678 hb_waiters_inc(hb); 1777 hb_waiters_inc(hb);
1679 1778
1680 q->lock_ptr = &hb->lock; 1779 q->lock_ptr = &hb->lock;
1681 1780
1682 spin_lock(&hb->lock); /* implies MB (A) */ 1781 spin_lock(&hb->lock); /* implies MB (A) */
1683 return hb; 1782 return hb;
1684 } 1783 }
1685 1784
1686 static inline void 1785 static inline void
1687 queue_unlock(struct futex_hash_bucket *hb) 1786 queue_unlock(struct futex_hash_bucket *hb)
1688 __releases(&hb->lock) 1787 __releases(&hb->lock)
1689 { 1788 {
1690 spin_unlock(&hb->lock); 1789 spin_unlock(&hb->lock);
1691 hb_waiters_dec(hb); 1790 hb_waiters_dec(hb);
1692 } 1791 }
1693 1792
1694 /** 1793 /**
1695 * queue_me() - Enqueue the futex_q on the futex_hash_bucket 1794 * queue_me() - Enqueue the futex_q on the futex_hash_bucket
1696 * @q: The futex_q to enqueue 1795 * @q: The futex_q to enqueue
1697 * @hb: The destination hash bucket 1796 * @hb: The destination hash bucket
1698 * 1797 *
1699 * The hb->lock must be held by the caller, and is released here. A call to 1798 * The hb->lock must be held by the caller, and is released here. A call to
1700 * queue_me() is typically paired with exactly one call to unqueue_me(). The 1799 * queue_me() is typically paired with exactly one call to unqueue_me(). The
1701 * exceptions involve the PI related operations, which may use unqueue_me_pi() 1800 * exceptions involve the PI related operations, which may use unqueue_me_pi()
1702 * or nothing if the unqueue is done as part of the wake process and the unqueue 1801 * or nothing if the unqueue is done as part of the wake process and the unqueue
1703 * state is implicit in the state of woken task (see futex_wait_requeue_pi() for 1802 * state is implicit in the state of woken task (see futex_wait_requeue_pi() for
1704 * an example). 1803 * an example).
1705 */ 1804 */
1706 static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb) 1805 static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
1707 __releases(&hb->lock) 1806 __releases(&hb->lock)
1708 { 1807 {
1709 int prio; 1808 int prio;
1710 1809
1711 /* 1810 /*
1712 * The priority used to register this element is 1811 * The priority used to register this element is
1713 * - either the real thread-priority for the real-time threads 1812 * - either the real thread-priority for the real-time threads
1714 * (i.e. threads with a priority lower than MAX_RT_PRIO) 1813 * (i.e. threads with a priority lower than MAX_RT_PRIO)
1715 * - or MAX_RT_PRIO for non-RT threads. 1814 * - or MAX_RT_PRIO for non-RT threads.
1716 * Thus, all RT-threads are woken first in priority order, and 1815 * Thus, all RT-threads are woken first in priority order, and
1717 * the others are woken last, in FIFO order. 1816 * the others are woken last, in FIFO order.
1718 */ 1817 */
1719 prio = min(current->normal_prio, MAX_RT_PRIO); 1818 prio = min(current->normal_prio, MAX_RT_PRIO);
1720 1819
1721 plist_node_init(&q->list, prio); 1820 plist_node_init(&q->list, prio);
1722 plist_add(&q->list, &hb->chain); 1821 plist_add(&q->list, &hb->chain);
1723 q->task = current; 1822 q->task = current;
1724 spin_unlock(&hb->lock); 1823 spin_unlock(&hb->lock);
1725 } 1824 }
1726 1825
1727 /** 1826 /**
1728 * unqueue_me() - Remove the futex_q from its futex_hash_bucket 1827 * unqueue_me() - Remove the futex_q from its futex_hash_bucket
1729 * @q: The futex_q to unqueue 1828 * @q: The futex_q to unqueue
1730 * 1829 *
1731 * The q->lock_ptr must not be held by the caller. A call to unqueue_me() must 1830 * The q->lock_ptr must not be held by the caller. A call to unqueue_me() must
1732 * be paired with exactly one earlier call to queue_me(). 1831 * be paired with exactly one earlier call to queue_me().
1733 * 1832 *
1734 * Return: 1833 * Return:
1735 * 1 - if the futex_q was still queued (and we removed unqueued it); 1834 * 1 - if the futex_q was still queued (and we removed unqueued it);
1736 * 0 - if the futex_q was already removed by the waking thread 1835 * 0 - if the futex_q was already removed by the waking thread
1737 */ 1836 */
1738 static int unqueue_me(struct futex_q *q) 1837 static int unqueue_me(struct futex_q *q)
1739 { 1838 {
1740 spinlock_t *lock_ptr; 1839 spinlock_t *lock_ptr;
1741 int ret = 0; 1840 int ret = 0;
1742 1841
1743 /* In the common case we don't take the spinlock, which is nice. */ 1842 /* In the common case we don't take the spinlock, which is nice. */
1744 retry: 1843 retry:
1745 lock_ptr = q->lock_ptr; 1844 lock_ptr = q->lock_ptr;
1746 barrier(); 1845 barrier();
1747 if (lock_ptr != NULL) { 1846 if (lock_ptr != NULL) {
1748 spin_lock(lock_ptr); 1847 spin_lock(lock_ptr);
1749 /* 1848 /*
1750 * q->lock_ptr can change between reading it and 1849 * q->lock_ptr can change between reading it and
1751 * spin_lock(), causing us to take the wrong lock. This 1850 * spin_lock(), causing us to take the wrong lock. This
1752 * corrects the race condition. 1851 * corrects the race condition.
1753 * 1852 *
1754 * Reasoning goes like this: if we have the wrong lock, 1853 * Reasoning goes like this: if we have the wrong lock,
1755 * q->lock_ptr must have changed (maybe several times) 1854 * q->lock_ptr must have changed (maybe several times)
1756 * between reading it and the spin_lock(). It can 1855 * between reading it and the spin_lock(). It can
1757 * change again after the spin_lock() but only if it was 1856 * change again after the spin_lock() but only if it was
1758 * already changed before the spin_lock(). It cannot, 1857 * already changed before the spin_lock(). It cannot,
1759 * however, change back to the original value. Therefore 1858 * however, change back to the original value. Therefore
1760 * we can detect whether we acquired the correct lock. 1859 * we can detect whether we acquired the correct lock.
1761 */ 1860 */
1762 if (unlikely(lock_ptr != q->lock_ptr)) { 1861 if (unlikely(lock_ptr != q->lock_ptr)) {
1763 spin_unlock(lock_ptr); 1862 spin_unlock(lock_ptr);
1764 goto retry; 1863 goto retry;
1765 } 1864 }
1766 __unqueue_futex(q); 1865 __unqueue_futex(q);
1767 1866
1768 BUG_ON(q->pi_state); 1867 BUG_ON(q->pi_state);
1769 1868
1770 spin_unlock(lock_ptr); 1869 spin_unlock(lock_ptr);
1771 ret = 1; 1870 ret = 1;
1772 } 1871 }
1773 1872
1774 drop_futex_key_refs(&q->key); 1873 drop_futex_key_refs(&q->key);
1775 return ret; 1874 return ret;
1776 } 1875 }
1777 1876
1778 /* 1877 /*
1779 * PI futexes can not be requeued and must remove themself from the 1878 * PI futexes can not be requeued and must remove themself from the
1780 * hash bucket. The hash bucket lock (i.e. lock_ptr) is held on entry 1879 * hash bucket. The hash bucket lock (i.e. lock_ptr) is held on entry
1781 * and dropped here. 1880 * and dropped here.
1782 */ 1881 */
1783 static void unqueue_me_pi(struct futex_q *q) 1882 static void unqueue_me_pi(struct futex_q *q)
1784 __releases(q->lock_ptr) 1883 __releases(q->lock_ptr)
1785 { 1884 {
1786 __unqueue_futex(q); 1885 __unqueue_futex(q);
1787 1886
1788 BUG_ON(!q->pi_state); 1887 BUG_ON(!q->pi_state);
1789 free_pi_state(q->pi_state); 1888 free_pi_state(q->pi_state);
1790 q->pi_state = NULL; 1889 q->pi_state = NULL;
1791 1890
1792 spin_unlock(q->lock_ptr); 1891 spin_unlock(q->lock_ptr);
1793 } 1892 }
1794 1893
1795 /* 1894 /*
1796 * Fixup the pi_state owner with the new owner. 1895 * Fixup the pi_state owner with the new owner.
1797 * 1896 *
1798 * Must be called with hash bucket lock held and mm->sem held for non 1897 * Must be called with hash bucket lock held and mm->sem held for non
1799 * private futexes. 1898 * private futexes.
1800 */ 1899 */
1801 static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, 1900 static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
1802 struct task_struct *newowner) 1901 struct task_struct *newowner)
1803 { 1902 {
1804 u32 newtid = task_pid_vnr(newowner) | FUTEX_WAITERS; 1903 u32 newtid = task_pid_vnr(newowner) | FUTEX_WAITERS;
1805 struct futex_pi_state *pi_state = q->pi_state; 1904 struct futex_pi_state *pi_state = q->pi_state;
1806 struct task_struct *oldowner = pi_state->owner; 1905 struct task_struct *oldowner = pi_state->owner;
1807 u32 uval, uninitialized_var(curval), newval; 1906 u32 uval, uninitialized_var(curval), newval;
1808 int ret; 1907 int ret;
1809 1908
1810 /* Owner died? */ 1909 /* Owner died? */
1811 if (!pi_state->owner) 1910 if (!pi_state->owner)
1812 newtid |= FUTEX_OWNER_DIED; 1911 newtid |= FUTEX_OWNER_DIED;
1813 1912
1814 /* 1913 /*
1815 * We are here either because we stole the rtmutex from the 1914 * We are here either because we stole the rtmutex from the
1816 * previous highest priority waiter or we are the highest priority 1915 * previous highest priority waiter or we are the highest priority
1817 * waiter but failed to get the rtmutex the first time. 1916 * waiter but failed to get the rtmutex the first time.
1818 * We have to replace the newowner TID in the user space variable. 1917 * We have to replace the newowner TID in the user space variable.
1819 * This must be atomic as we have to preserve the owner died bit here. 1918 * This must be atomic as we have to preserve the owner died bit here.
1820 * 1919 *
1821 * Note: We write the user space value _before_ changing the pi_state 1920 * Note: We write the user space value _before_ changing the pi_state
1822 * because we can fault here. Imagine swapped out pages or a fork 1921 * because we can fault here. Imagine swapped out pages or a fork
1823 * that marked all the anonymous memory readonly for cow. 1922 * that marked all the anonymous memory readonly for cow.
1824 * 1923 *
1825 * Modifying pi_state _before_ the user space value would 1924 * Modifying pi_state _before_ the user space value would
1826 * leave the pi_state in an inconsistent state when we fault 1925 * leave the pi_state in an inconsistent state when we fault
1827 * here, because we need to drop the hash bucket lock to 1926 * here, because we need to drop the hash bucket lock to
1828 * handle the fault. This might be observed in the PID check 1927 * handle the fault. This might be observed in the PID check
1829 * in lookup_pi_state. 1928 * in lookup_pi_state.
1830 */ 1929 */
1831 retry: 1930 retry:
1832 if (get_futex_value_locked(&uval, uaddr)) 1931 if (get_futex_value_locked(&uval, uaddr))
1833 goto handle_fault; 1932 goto handle_fault;
1834 1933
1835 while (1) { 1934 while (1) {
1836 newval = (uval & FUTEX_OWNER_DIED) | newtid; 1935 newval = (uval & FUTEX_OWNER_DIED) | newtid;
1837 1936
1838 if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)) 1937 if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval))
1839 goto handle_fault; 1938 goto handle_fault;
1840 if (curval == uval) 1939 if (curval == uval)
1841 break; 1940 break;
1842 uval = curval; 1941 uval = curval;
1843 } 1942 }
1844 1943
1845 /* 1944 /*
1846 * We fixed up user space. Now we need to fix the pi_state 1945 * We fixed up user space. Now we need to fix the pi_state
1847 * itself. 1946 * itself.
1848 */ 1947 */
1849 if (pi_state->owner != NULL) { 1948 if (pi_state->owner != NULL) {
1850 raw_spin_lock_irq(&pi_state->owner->pi_lock); 1949 raw_spin_lock_irq(&pi_state->owner->pi_lock);
1851 WARN_ON(list_empty(&pi_state->list)); 1950 WARN_ON(list_empty(&pi_state->list));
1852 list_del_init(&pi_state->list); 1951 list_del_init(&pi_state->list);
1853 raw_spin_unlock_irq(&pi_state->owner->pi_lock); 1952 raw_spin_unlock_irq(&pi_state->owner->pi_lock);
1854 } 1953 }
1855 1954
1856 pi_state->owner = newowner; 1955 pi_state->owner = newowner;
1857 1956
1858 raw_spin_lock_irq(&newowner->pi_lock); 1957 raw_spin_lock_irq(&newowner->pi_lock);
1859 WARN_ON(!list_empty(&pi_state->list)); 1958 WARN_ON(!list_empty(&pi_state->list));
1860 list_add(&pi_state->list, &newowner->pi_state_list); 1959 list_add(&pi_state->list, &newowner->pi_state_list);
1861 raw_spin_unlock_irq(&newowner->pi_lock); 1960 raw_spin_unlock_irq(&newowner->pi_lock);
1862 return 0; 1961 return 0;
1863 1962
1864 /* 1963 /*
1865 * To handle the page fault we need to drop the hash bucket 1964 * To handle the page fault we need to drop the hash bucket
1866 * lock here. That gives the other task (either the highest priority 1965 * lock here. That gives the other task (either the highest priority
1867 * waiter itself or the task which stole the rtmutex) the 1966 * waiter itself or the task which stole the rtmutex) the
1868 * chance to try the fixup of the pi_state. So once we are 1967 * chance to try the fixup of the pi_state. So once we are
1869 * back from handling the fault we need to check the pi_state 1968 * back from handling the fault we need to check the pi_state
1870 * after reacquiring the hash bucket lock and before trying to 1969 * after reacquiring the hash bucket lock and before trying to
1871 * do another fixup. When the fixup has been done already we 1970 * do another fixup. When the fixup has been done already we
1872 * simply return. 1971 * simply return.
1873 */ 1972 */
1874 handle_fault: 1973 handle_fault:
1875 spin_unlock(q->lock_ptr); 1974 spin_unlock(q->lock_ptr);
1876 1975
1877 ret = fault_in_user_writeable(uaddr); 1976 ret = fault_in_user_writeable(uaddr);
1878 1977
1879 spin_lock(q->lock_ptr); 1978 spin_lock(q->lock_ptr);
1880 1979
1881 /* 1980 /*
1882 * Check if someone else fixed it for us: 1981 * Check if someone else fixed it for us:
1883 */ 1982 */
1884 if (pi_state->owner != oldowner) 1983 if (pi_state->owner != oldowner)
1885 return 0; 1984 return 0;
1886 1985
1887 if (ret) 1986 if (ret)
1888 return ret; 1987 return ret;
1889 1988
1890 goto retry; 1989 goto retry;
1891 } 1990 }
1892 1991
1893 static long futex_wait_restart(struct restart_block *restart); 1992 static long futex_wait_restart(struct restart_block *restart);
1894 1993
1895 /** 1994 /**
1896 * fixup_owner() - Post lock pi_state and corner case management 1995 * fixup_owner() - Post lock pi_state and corner case management
1897 * @uaddr: user address of the futex 1996 * @uaddr: user address of the futex
1898 * @q: futex_q (contains pi_state and access to the rt_mutex) 1997 * @q: futex_q (contains pi_state and access to the rt_mutex)
1899 * @locked: if the attempt to take the rt_mutex succeeded (1) or not (0) 1998 * @locked: if the attempt to take the rt_mutex succeeded (1) or not (0)
1900 * 1999 *
1901 * After attempting to lock an rt_mutex, this function is called to cleanup 2000 * After attempting to lock an rt_mutex, this function is called to cleanup
1902 * the pi_state owner as well as handle race conditions that may allow us to 2001 * the pi_state owner as well as handle race conditions that may allow us to
1903 * acquire the lock. Must be called with the hb lock held. 2002 * acquire the lock. Must be called with the hb lock held.
1904 * 2003 *
1905 * Return: 2004 * Return:
1906 * 1 - success, lock taken; 2005 * 1 - success, lock taken;
1907 * 0 - success, lock not taken; 2006 * 0 - success, lock not taken;
1908 * <0 - on error (-EFAULT) 2007 * <0 - on error (-EFAULT)
1909 */ 2008 */
1910 static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked) 2009 static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked)
1911 { 2010 {
1912 struct task_struct *owner; 2011 struct task_struct *owner;
1913 int ret = 0; 2012 int ret = 0;
1914 2013
1915 if (locked) { 2014 if (locked) {
1916 /* 2015 /*
1917 * Got the lock. We might not be the anticipated owner if we 2016 * Got the lock. We might not be the anticipated owner if we
1918 * did a lock-steal - fix up the PI-state in that case: 2017 * did a lock-steal - fix up the PI-state in that case:
1919 */ 2018 */
1920 if (q->pi_state->owner != current) 2019 if (q->pi_state->owner != current)
1921 ret = fixup_pi_state_owner(uaddr, q, current); 2020 ret = fixup_pi_state_owner(uaddr, q, current);
1922 goto out; 2021 goto out;
1923 } 2022 }
1924 2023
1925 /* 2024 /*
1926 * Catch the rare case, where the lock was released when we were on the 2025 * Catch the rare case, where the lock was released when we were on the
1927 * way back before we locked the hash bucket. 2026 * way back before we locked the hash bucket.
1928 */ 2027 */
1929 if (q->pi_state->owner == current) { 2028 if (q->pi_state->owner == current) {
1930 /* 2029 /*
1931 * Try to get the rt_mutex now. This might fail as some other 2030 * Try to get the rt_mutex now. This might fail as some other
1932 * task acquired the rt_mutex after we removed ourself from the 2031 * task acquired the rt_mutex after we removed ourself from the
1933 * rt_mutex waiters list. 2032 * rt_mutex waiters list.
1934 */ 2033 */
1935 if (rt_mutex_trylock(&q->pi_state->pi_mutex)) { 2034 if (rt_mutex_trylock(&q->pi_state->pi_mutex)) {
1936 locked = 1; 2035 locked = 1;
1937 goto out; 2036 goto out;
1938 } 2037 }
1939 2038
1940 /* 2039 /*
1941 * pi_state is incorrect, some other task did a lock steal and 2040 * pi_state is incorrect, some other task did a lock steal and
1942 * we returned due to timeout or signal without taking the 2041 * we returned due to timeout or signal without taking the
1943 * rt_mutex. Too late. 2042 * rt_mutex. Too late.
1944 */ 2043 */
1945 raw_spin_lock(&q->pi_state->pi_mutex.wait_lock); 2044 raw_spin_lock(&q->pi_state->pi_mutex.wait_lock);
1946 owner = rt_mutex_owner(&q->pi_state->pi_mutex); 2045 owner = rt_mutex_owner(&q->pi_state->pi_mutex);
1947 if (!owner) 2046 if (!owner)
1948 owner = rt_mutex_next_owner(&q->pi_state->pi_mutex); 2047 owner = rt_mutex_next_owner(&q->pi_state->pi_mutex);
1949 raw_spin_unlock(&q->pi_state->pi_mutex.wait_lock); 2048 raw_spin_unlock(&q->pi_state->pi_mutex.wait_lock);
1950 ret = fixup_pi_state_owner(uaddr, q, owner); 2049 ret = fixup_pi_state_owner(uaddr, q, owner);
1951 goto out; 2050 goto out;
1952 } 2051 }
1953 2052
1954 /* 2053 /*
1955 * Paranoia check. If we did not take the lock, then we should not be 2054 * Paranoia check. If we did not take the lock, then we should not be
1956 * the owner of the rt_mutex. 2055 * the owner of the rt_mutex.
1957 */ 2056 */
1958 if (rt_mutex_owner(&q->pi_state->pi_mutex) == current) 2057 if (rt_mutex_owner(&q->pi_state->pi_mutex) == current)
1959 printk(KERN_ERR "fixup_owner: ret = %d pi-mutex: %p " 2058 printk(KERN_ERR "fixup_owner: ret = %d pi-mutex: %p "
1960 "pi-state %p\n", ret, 2059 "pi-state %p\n", ret,
1961 q->pi_state->pi_mutex.owner, 2060 q->pi_state->pi_mutex.owner,
1962 q->pi_state->owner); 2061 q->pi_state->owner);
1963 2062
1964 out: 2063 out:
1965 return ret ? ret : locked; 2064 return ret ? ret : locked;
1966 } 2065 }
1967 2066
1968 /** 2067 /**
1969 * futex_wait_queue_me() - queue_me() and wait for wakeup, timeout, or signal 2068 * futex_wait_queue_me() - queue_me() and wait for wakeup, timeout, or signal
1970 * @hb: the futex hash bucket, must be locked by the caller 2069 * @hb: the futex hash bucket, must be locked by the caller
1971 * @q: the futex_q to queue up on 2070 * @q: the futex_q to queue up on
1972 * @timeout: the prepared hrtimer_sleeper, or null for no timeout 2071 * @timeout: the prepared hrtimer_sleeper, or null for no timeout
1973 */ 2072 */
1974 static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q, 2073 static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
1975 struct hrtimer_sleeper *timeout) 2074 struct hrtimer_sleeper *timeout)
1976 { 2075 {
1977 /* 2076 /*
1978 * The task state is guaranteed to be set before another task can 2077 * The task state is guaranteed to be set before another task can
1979 * wake it. set_current_state() is implemented using set_mb() and 2078 * wake it. set_current_state() is implemented using set_mb() and
1980 * queue_me() calls spin_unlock() upon completion, both serializing 2079 * queue_me() calls spin_unlock() upon completion, both serializing
1981 * access to the hash list and forcing another memory barrier. 2080 * access to the hash list and forcing another memory barrier.
1982 */ 2081 */
1983 set_current_state(TASK_INTERRUPTIBLE); 2082 set_current_state(TASK_INTERRUPTIBLE);
1984 queue_me(q, hb); 2083 queue_me(q, hb);
1985 2084
1986 /* Arm the timer */ 2085 /* Arm the timer */
1987 if (timeout) { 2086 if (timeout) {
1988 hrtimer_start_expires(&timeout->timer, HRTIMER_MODE_ABS); 2087 hrtimer_start_expires(&timeout->timer, HRTIMER_MODE_ABS);
1989 if (!hrtimer_active(&timeout->timer)) 2088 if (!hrtimer_active(&timeout->timer))
1990 timeout->task = NULL; 2089 timeout->task = NULL;
1991 } 2090 }
1992 2091
1993 /* 2092 /*
1994 * If we have been removed from the hash list, then another task 2093 * If we have been removed from the hash list, then another task
1995 * has tried to wake us, and we can skip the call to schedule(). 2094 * has tried to wake us, and we can skip the call to schedule().
1996 */ 2095 */
1997 if (likely(!plist_node_empty(&q->list))) { 2096 if (likely(!plist_node_empty(&q->list))) {
1998 /* 2097 /*
1999 * If the timer has already expired, current will already be 2098 * If the timer has already expired, current will already be
2000 * flagged for rescheduling. Only call schedule if there 2099 * flagged for rescheduling. Only call schedule if there
2001 * is no timeout, or if it has yet to expire. 2100 * is no timeout, or if it has yet to expire.
2002 */ 2101 */
2003 if (!timeout || timeout->task) 2102 if (!timeout || timeout->task)
2004 freezable_schedule(); 2103 freezable_schedule();
2005 } 2104 }
2006 __set_current_state(TASK_RUNNING); 2105 __set_current_state(TASK_RUNNING);
2007 } 2106 }
2008 2107
2009 /** 2108 /**
2010 * futex_wait_setup() - Prepare to wait on a futex 2109 * futex_wait_setup() - Prepare to wait on a futex
2011 * @uaddr: the futex userspace address 2110 * @uaddr: the futex userspace address
2012 * @val: the expected value 2111 * @val: the expected value
2013 * @flags: futex flags (FLAGS_SHARED, etc.) 2112 * @flags: futex flags (FLAGS_SHARED, etc.)
2014 * @q: the associated futex_q 2113 * @q: the associated futex_q
2015 * @hb: storage for hash_bucket pointer to be returned to caller 2114 * @hb: storage for hash_bucket pointer to be returned to caller
2016 * 2115 *
2017 * Setup the futex_q and locate the hash_bucket. Get the futex value and 2116 * Setup the futex_q and locate the hash_bucket. Get the futex value and
2018 * compare it with the expected value. Handle atomic faults internally. 2117 * compare it with the expected value. Handle atomic faults internally.
2019 * Return with the hb lock held and a q.key reference on success, and unlocked 2118 * Return with the hb lock held and a q.key reference on success, and unlocked
2020 * with no q.key reference on failure. 2119 * with no q.key reference on failure.
2021 * 2120 *
2022 * Return: 2121 * Return:
2023 * 0 - uaddr contains val and hb has been locked; 2122 * 0 - uaddr contains val and hb has been locked;
2024 * <1 - -EFAULT or -EWOULDBLOCK (uaddr does not contain val) and hb is unlocked 2123 * <1 - -EFAULT or -EWOULDBLOCK (uaddr does not contain val) and hb is unlocked
2025 */ 2124 */
2026 static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags, 2125 static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
2027 struct futex_q *q, struct futex_hash_bucket **hb) 2126 struct futex_q *q, struct futex_hash_bucket **hb)
2028 { 2127 {
2029 u32 uval; 2128 u32 uval;
2030 int ret; 2129 int ret;
2031 2130
2032 /* 2131 /*
2033 * Access the page AFTER the hash-bucket is locked. 2132 * Access the page AFTER the hash-bucket is locked.
2034 * Order is important: 2133 * Order is important:
2035 * 2134 *
2036 * Userspace waiter: val = var; if (cond(val)) futex_wait(&var, val); 2135 * Userspace waiter: val = var; if (cond(val)) futex_wait(&var, val);
2037 * Userspace waker: if (cond(var)) { var = new; futex_wake(&var); } 2136 * Userspace waker: if (cond(var)) { var = new; futex_wake(&var); }
2038 * 2137 *
2039 * The basic logical guarantee of a futex is that it blocks ONLY 2138 * The basic logical guarantee of a futex is that it blocks ONLY
2040 * if cond(var) is known to be true at the time of blocking, for 2139 * if cond(var) is known to be true at the time of blocking, for
2041 * any cond. If we locked the hash-bucket after testing *uaddr, that 2140 * any cond. If we locked the hash-bucket after testing *uaddr, that
2042 * would open a race condition where we could block indefinitely with 2141 * would open a race condition where we could block indefinitely with
2043 * cond(var) false, which would violate the guarantee. 2142 * cond(var) false, which would violate the guarantee.
2044 * 2143 *
2045 * On the other hand, we insert q and release the hash-bucket only 2144 * On the other hand, we insert q and release the hash-bucket only
2046 * after testing *uaddr. This guarantees that futex_wait() will NOT 2145 * after testing *uaddr. This guarantees that futex_wait() will NOT
2047 * absorb a wakeup if *uaddr does not match the desired values 2146 * absorb a wakeup if *uaddr does not match the desired values
2048 * while the syscall executes. 2147 * while the syscall executes.
2049 */ 2148 */
2050 retry: 2149 retry:
2051 ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q->key, VERIFY_READ); 2150 ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q->key, VERIFY_READ);
2052 if (unlikely(ret != 0)) 2151 if (unlikely(ret != 0))
2053 return ret; 2152 return ret;
2054 2153
2055 retry_private: 2154 retry_private:
2056 *hb = queue_lock(q); 2155 *hb = queue_lock(q);
2057 2156
2058 ret = get_futex_value_locked(&uval, uaddr); 2157 ret = get_futex_value_locked(&uval, uaddr);
2059 2158
2060 if (ret) { 2159 if (ret) {
2061 queue_unlock(*hb); 2160 queue_unlock(*hb);
2062 2161
2063 ret = get_user(uval, uaddr); 2162 ret = get_user(uval, uaddr);
2064 if (ret) 2163 if (ret)
2065 goto out; 2164 goto out;
2066 2165
2067 if (!(flags & FLAGS_SHARED)) 2166 if (!(flags & FLAGS_SHARED))
2068 goto retry_private; 2167 goto retry_private;
2069 2168
2070 put_futex_key(&q->key); 2169 put_futex_key(&q->key);
2071 goto retry; 2170 goto retry;
2072 } 2171 }
2073 2172
2074 if (uval != val) { 2173 if (uval != val) {
2075 queue_unlock(*hb); 2174 queue_unlock(*hb);
2076 ret = -EWOULDBLOCK; 2175 ret = -EWOULDBLOCK;
2077 } 2176 }
2078 2177
2079 out: 2178 out:
2080 if (ret) 2179 if (ret)
2081 put_futex_key(&q->key); 2180 put_futex_key(&q->key);
2082 return ret; 2181 return ret;
2083 } 2182 }
2084 2183
2085 static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val, 2184 static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
2086 ktime_t *abs_time, u32 bitset) 2185 ktime_t *abs_time, u32 bitset)
2087 { 2186 {
2088 struct hrtimer_sleeper timeout, *to = NULL; 2187 struct hrtimer_sleeper timeout, *to = NULL;
2089 struct restart_block *restart; 2188 struct restart_block *restart;
2090 struct futex_hash_bucket *hb; 2189 struct futex_hash_bucket *hb;
2091 struct futex_q q = futex_q_init; 2190 struct futex_q q = futex_q_init;
2092 int ret; 2191 int ret;
2093 2192
2094 if (!bitset) 2193 if (!bitset)
2095 return -EINVAL; 2194 return -EINVAL;
2096 q.bitset = bitset; 2195 q.bitset = bitset;
2097 2196
2098 if (abs_time) { 2197 if (abs_time) {
2099 to = &timeout; 2198 to = &timeout;
2100 2199
2101 hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ? 2200 hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
2102 CLOCK_REALTIME : CLOCK_MONOTONIC, 2201 CLOCK_REALTIME : CLOCK_MONOTONIC,
2103 HRTIMER_MODE_ABS); 2202 HRTIMER_MODE_ABS);
2104 hrtimer_init_sleeper(to, current); 2203 hrtimer_init_sleeper(to, current);
2105 hrtimer_set_expires_range_ns(&to->timer, *abs_time, 2204 hrtimer_set_expires_range_ns(&to->timer, *abs_time,
2106 current->timer_slack_ns); 2205 current->timer_slack_ns);
2107 } 2206 }
2108 2207
2109 retry: 2208 retry:
2110 /* 2209 /*
2111 * Prepare to wait on uaddr. On success, holds hb lock and increments 2210 * Prepare to wait on uaddr. On success, holds hb lock and increments
2112 * q.key refs. 2211 * q.key refs.
2113 */ 2212 */
2114 ret = futex_wait_setup(uaddr, val, flags, &q, &hb); 2213 ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
2115 if (ret) 2214 if (ret)
2116 goto out; 2215 goto out;
2117 2216
2118 /* queue_me and wait for wakeup, timeout, or a signal. */ 2217 /* queue_me and wait for wakeup, timeout, or a signal. */
2119 futex_wait_queue_me(hb, &q, to); 2218 futex_wait_queue_me(hb, &q, to);
2120 2219
2121 /* If we were woken (and unqueued), we succeeded, whatever. */ 2220 /* If we were woken (and unqueued), we succeeded, whatever. */
2122 ret = 0; 2221 ret = 0;
2123 /* unqueue_me() drops q.key ref */ 2222 /* unqueue_me() drops q.key ref */
2124 if (!unqueue_me(&q)) 2223 if (!unqueue_me(&q))
2125 goto out; 2224 goto out;
2126 ret = -ETIMEDOUT; 2225 ret = -ETIMEDOUT;
2127 if (to && !to->task) 2226 if (to && !to->task)
2128 goto out; 2227 goto out;
2129 2228
2130 /* 2229 /*
2131 * We expect signal_pending(current), but we might be the 2230 * We expect signal_pending(current), but we might be the
2132 * victim of a spurious wakeup as well. 2231 * victim of a spurious wakeup as well.
2133 */ 2232 */
2134 if (!signal_pending(current)) 2233 if (!signal_pending(current))
2135 goto retry; 2234 goto retry;
2136 2235
2137 ret = -ERESTARTSYS; 2236 ret = -ERESTARTSYS;
2138 if (!abs_time) 2237 if (!abs_time)
2139 goto out; 2238 goto out;
2140 2239
2141 restart = &current_thread_info()->restart_block; 2240 restart = &current_thread_info()->restart_block;
2142 restart->fn = futex_wait_restart; 2241 restart->fn = futex_wait_restart;
2143 restart->futex.uaddr = uaddr; 2242 restart->futex.uaddr = uaddr;
2144 restart->futex.val = val; 2243 restart->futex.val = val;
2145 restart->futex.time = abs_time->tv64; 2244 restart->futex.time = abs_time->tv64;
2146 restart->futex.bitset = bitset; 2245 restart->futex.bitset = bitset;
2147 restart->futex.flags = flags | FLAGS_HAS_TIMEOUT; 2246 restart->futex.flags = flags | FLAGS_HAS_TIMEOUT;
2148 2247
2149 ret = -ERESTART_RESTARTBLOCK; 2248 ret = -ERESTART_RESTARTBLOCK;
2150 2249
2151 out: 2250 out:
2152 if (to) { 2251 if (to) {
2153 hrtimer_cancel(&to->timer); 2252 hrtimer_cancel(&to->timer);
2154 destroy_hrtimer_on_stack(&to->timer); 2253 destroy_hrtimer_on_stack(&to->timer);
2155 } 2254 }
2156 return ret; 2255 return ret;
2157 } 2256 }
2158 2257
2159 2258
2160 static long futex_wait_restart(struct restart_block *restart) 2259 static long futex_wait_restart(struct restart_block *restart)
2161 { 2260 {
2162 u32 __user *uaddr = restart->futex.uaddr; 2261 u32 __user *uaddr = restart->futex.uaddr;
2163 ktime_t t, *tp = NULL; 2262 ktime_t t, *tp = NULL;
2164 2263
2165 if (restart->futex.flags & FLAGS_HAS_TIMEOUT) { 2264 if (restart->futex.flags & FLAGS_HAS_TIMEOUT) {
2166 t.tv64 = restart->futex.time; 2265 t.tv64 = restart->futex.time;
2167 tp = &t; 2266 tp = &t;
2168 } 2267 }
2169 restart->fn = do_no_restart_syscall; 2268 restart->fn = do_no_restart_syscall;
2170 2269
2171 return (long)futex_wait(uaddr, restart->futex.flags, 2270 return (long)futex_wait(uaddr, restart->futex.flags,
2172 restart->futex.val, tp, restart->futex.bitset); 2271 restart->futex.val, tp, restart->futex.bitset);
2173 } 2272 }
2174 2273
2175 2274
2176 /* 2275 /*
2177 * Userspace tried a 0 -> TID atomic transition of the futex value 2276 * Userspace tried a 0 -> TID atomic transition of the futex value
2178 * and failed. The kernel side here does the whole locking operation: 2277 * and failed. The kernel side here does the whole locking operation:
2179 * if there are waiters then it will block, it does PI, etc. (Due to 2278 * if there are waiters then it will block, it does PI, etc. (Due to
2180 * races the kernel might see a 0 value of the futex too.) 2279 * races the kernel might see a 0 value of the futex too.)
2181 */ 2280 */
2182 static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, int detect, 2281 static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, int detect,
2183 ktime_t *time, int trylock) 2282 ktime_t *time, int trylock)
2184 { 2283 {
2185 struct hrtimer_sleeper timeout, *to = NULL; 2284 struct hrtimer_sleeper timeout, *to = NULL;
2186 struct futex_hash_bucket *hb; 2285 struct futex_hash_bucket *hb;
2187 struct futex_q q = futex_q_init; 2286 struct futex_q q = futex_q_init;
2188 int res, ret; 2287 int res, ret;
2189 2288
2190 if (refill_pi_state_cache()) 2289 if (refill_pi_state_cache())
2191 return -ENOMEM; 2290 return -ENOMEM;
2192 2291
2193 if (time) { 2292 if (time) {
2194 to = &timeout; 2293 to = &timeout;
2195 hrtimer_init_on_stack(&to->timer, CLOCK_REALTIME, 2294 hrtimer_init_on_stack(&to->timer, CLOCK_REALTIME,
2196 HRTIMER_MODE_ABS); 2295 HRTIMER_MODE_ABS);
2197 hrtimer_init_sleeper(to, current); 2296 hrtimer_init_sleeper(to, current);
2198 hrtimer_set_expires(&to->timer, *time); 2297 hrtimer_set_expires(&to->timer, *time);
2199 } 2298 }
2200 2299
2201 retry: 2300 retry:
2202 ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key, VERIFY_WRITE); 2301 ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key, VERIFY_WRITE);
2203 if (unlikely(ret != 0)) 2302 if (unlikely(ret != 0))
2204 goto out; 2303 goto out;
2205 2304
2206 retry_private: 2305 retry_private:
2207 hb = queue_lock(&q); 2306 hb = queue_lock(&q);
2208 2307
2209 ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, 0); 2308 ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, 0);
2210 if (unlikely(ret)) { 2309 if (unlikely(ret)) {
2211 switch (ret) { 2310 switch (ret) {
2212 case 1: 2311 case 1:
2213 /* We got the lock. */ 2312 /* We got the lock. */
2214 ret = 0; 2313 ret = 0;
2215 goto out_unlock_put_key; 2314 goto out_unlock_put_key;
2216 case -EFAULT: 2315 case -EFAULT:
2217 goto uaddr_faulted; 2316 goto uaddr_faulted;
2218 case -EAGAIN: 2317 case -EAGAIN:
2219 /* 2318 /*
2220 * Task is exiting and we just wait for the 2319 * Task is exiting and we just wait for the
2221 * exit to complete. 2320 * exit to complete.
2222 */ 2321 */
2223 queue_unlock(hb); 2322 queue_unlock(hb);
2224 put_futex_key(&q.key); 2323 put_futex_key(&q.key);
2225 cond_resched(); 2324 cond_resched();
2226 goto retry; 2325 goto retry;
2227 default: 2326 default:
2228 goto out_unlock_put_key; 2327 goto out_unlock_put_key;
2229 } 2328 }
2230 } 2329 }
2231 2330
2232 /* 2331 /*
2233 * Only actually queue now that the atomic ops are done: 2332 * Only actually queue now that the atomic ops are done:
2234 */ 2333 */
2235 queue_me(&q, hb); 2334 queue_me(&q, hb);
2236 2335
2237 WARN_ON(!q.pi_state); 2336 WARN_ON(!q.pi_state);
2238 /* 2337 /*
2239 * Block on the PI mutex: 2338 * Block on the PI mutex:
2240 */ 2339 */
2241 if (!trylock) 2340 if (!trylock)
2242 ret = rt_mutex_timed_lock(&q.pi_state->pi_mutex, to, 1); 2341 ret = rt_mutex_timed_lock(&q.pi_state->pi_mutex, to, 1);
2243 else { 2342 else {
2244 ret = rt_mutex_trylock(&q.pi_state->pi_mutex); 2343 ret = rt_mutex_trylock(&q.pi_state->pi_mutex);
2245 /* Fixup the trylock return value: */ 2344 /* Fixup the trylock return value: */
2246 ret = ret ? 0 : -EWOULDBLOCK; 2345 ret = ret ? 0 : -EWOULDBLOCK;
2247 } 2346 }
2248 2347
2249 spin_lock(q.lock_ptr); 2348 spin_lock(q.lock_ptr);
2250 /* 2349 /*
2251 * Fixup the pi_state owner and possibly acquire the lock if we 2350 * Fixup the pi_state owner and possibly acquire the lock if we
2252 * haven't already. 2351 * haven't already.
2253 */ 2352 */
2254 res = fixup_owner(uaddr, &q, !ret); 2353 res = fixup_owner(uaddr, &q, !ret);
2255 /* 2354 /*
2256 * If fixup_owner() returned an error, proprogate that. If it acquired 2355 * If fixup_owner() returned an error, proprogate that. If it acquired
2257 * the lock, clear our -ETIMEDOUT or -EINTR. 2356 * the lock, clear our -ETIMEDOUT or -EINTR.
2258 */ 2357 */
2259 if (res) 2358 if (res)
2260 ret = (res < 0) ? res : 0; 2359 ret = (res < 0) ? res : 0;
2261 2360
2262 /* 2361 /*
2263 * If fixup_owner() faulted and was unable to handle the fault, unlock 2362 * If fixup_owner() faulted and was unable to handle the fault, unlock
2264 * it and return the fault to userspace. 2363 * it and return the fault to userspace.
2265 */ 2364 */
2266 if (ret && (rt_mutex_owner(&q.pi_state->pi_mutex) == current)) 2365 if (ret && (rt_mutex_owner(&q.pi_state->pi_mutex) == current))
2267 rt_mutex_unlock(&q.pi_state->pi_mutex); 2366 rt_mutex_unlock(&q.pi_state->pi_mutex);
2268 2367
2269 /* Unqueue and drop the lock */ 2368 /* Unqueue and drop the lock */
2270 unqueue_me_pi(&q); 2369 unqueue_me_pi(&q);
2271 2370
2272 goto out_put_key; 2371 goto out_put_key;
2273 2372
2274 out_unlock_put_key: 2373 out_unlock_put_key:
2275 queue_unlock(hb); 2374 queue_unlock(hb);
2276 2375
2277 out_put_key: 2376 out_put_key:
2278 put_futex_key(&q.key); 2377 put_futex_key(&q.key);
2279 out: 2378 out:
2280 if (to) 2379 if (to)
2281 destroy_hrtimer_on_stack(&to->timer); 2380 destroy_hrtimer_on_stack(&to->timer);
2282 return ret != -EINTR ? ret : -ERESTARTNOINTR; 2381 return ret != -EINTR ? ret : -ERESTARTNOINTR;
2283 2382
2284 uaddr_faulted: 2383 uaddr_faulted:
2285 queue_unlock(hb); 2384 queue_unlock(hb);
2286 2385
2287 ret = fault_in_user_writeable(uaddr); 2386 ret = fault_in_user_writeable(uaddr);
2288 if (ret) 2387 if (ret)
2289 goto out_put_key; 2388 goto out_put_key;
2290 2389
2291 if (!(flags & FLAGS_SHARED)) 2390 if (!(flags & FLAGS_SHARED))
2292 goto retry_private; 2391 goto retry_private;
2293 2392
2294 put_futex_key(&q.key); 2393 put_futex_key(&q.key);
2295 goto retry; 2394 goto retry;
2296 } 2395 }
2297 2396
2298 /* 2397 /*
2299 * Userspace attempted a TID -> 0 atomic transition, and failed. 2398 * Userspace attempted a TID -> 0 atomic transition, and failed.
2300 * This is the in-kernel slowpath: we look up the PI state (if any), 2399 * This is the in-kernel slowpath: we look up the PI state (if any),
2301 * and do the rt-mutex unlock. 2400 * and do the rt-mutex unlock.
2302 */ 2401 */
2303 static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags) 2402 static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
2304 { 2403 {
2305 struct futex_hash_bucket *hb; 2404 struct futex_hash_bucket *hb;
2306 struct futex_q *this, *next; 2405 struct futex_q *this, *next;
2307 union futex_key key = FUTEX_KEY_INIT; 2406 union futex_key key = FUTEX_KEY_INIT;
2308 u32 uval, vpid = task_pid_vnr(current); 2407 u32 uval, vpid = task_pid_vnr(current);
2309 int ret; 2408 int ret;
2310 2409
2311 retry: 2410 retry:
2312 if (get_user(uval, uaddr)) 2411 if (get_user(uval, uaddr))
2313 return -EFAULT; 2412 return -EFAULT;
2314 /* 2413 /*
2315 * We release only a lock we actually own: 2414 * We release only a lock we actually own:
2316 */ 2415 */
2317 if ((uval & FUTEX_TID_MASK) != vpid) 2416 if ((uval & FUTEX_TID_MASK) != vpid)
2318 return -EPERM; 2417 return -EPERM;
2319 2418
2320 ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE); 2419 ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
2321 if (unlikely(ret != 0)) 2420 if (unlikely(ret != 0))
2322 goto out; 2421 goto out;
2323 2422
2324 hb = hash_futex(&key); 2423 hb = hash_futex(&key);
2325 spin_lock(&hb->lock); 2424 spin_lock(&hb->lock);
2326 2425
2327 /* 2426 /*
2328 * To avoid races, try to do the TID -> 0 atomic transition 2427 * To avoid races, try to do the TID -> 0 atomic transition
2329 * again. If it succeeds then we can return without waking 2428 * again. If it succeeds then we can return without waking
2330 * anyone else up: 2429 * anyone else up. We only try this if neither the waiters nor
2430 * the owner died bit are set.
2331 */ 2431 */
2332 if (!(uval & FUTEX_OWNER_DIED) && 2432 if (!(uval & ~FUTEX_TID_MASK) &&
2333 cmpxchg_futex_value_locked(&uval, uaddr, vpid, 0)) 2433 cmpxchg_futex_value_locked(&uval, uaddr, vpid, 0))
2334 goto pi_faulted; 2434 goto pi_faulted;
2335 /* 2435 /*
2336 * Rare case: we managed to release the lock atomically, 2436 * Rare case: we managed to release the lock atomically,
2337 * no need to wake anyone else up: 2437 * no need to wake anyone else up:
2338 */ 2438 */
2339 if (unlikely(uval == vpid)) 2439 if (unlikely(uval == vpid))
2340 goto out_unlock; 2440 goto out_unlock;
2341 2441
2342 /* 2442 /*
2343 * Ok, other tasks may need to be woken up - check waiters 2443 * Ok, other tasks may need to be woken up - check waiters
2344 * and do the wakeup if necessary: 2444 * and do the wakeup if necessary:
2345 */ 2445 */
2346 plist_for_each_entry_safe(this, next, &hb->chain, list) { 2446 plist_for_each_entry_safe(this, next, &hb->chain, list) {
2347 if (!match_futex (&this->key, &key)) 2447 if (!match_futex (&this->key, &key))
2348 continue; 2448 continue;
2349 ret = wake_futex_pi(uaddr, uval, this); 2449 ret = wake_futex_pi(uaddr, uval, this);
2350 /* 2450 /*
2351 * The atomic access to the futex value 2451 * The atomic access to the futex value
2352 * generated a pagefault, so retry the 2452 * generated a pagefault, so retry the
2353 * user-access and the wakeup: 2453 * user-access and the wakeup:
2354 */ 2454 */
2355 if (ret == -EFAULT) 2455 if (ret == -EFAULT)
2356 goto pi_faulted; 2456 goto pi_faulted;
2357 goto out_unlock; 2457 goto out_unlock;
2358 } 2458 }
2359 /* 2459 /*
2360 * No waiters - kernel unlocks the futex: 2460 * No waiters - kernel unlocks the futex:
2361 */ 2461 */
2362 if (!(uval & FUTEX_OWNER_DIED)) { 2462 ret = unlock_futex_pi(uaddr, uval);
2363 ret = unlock_futex_pi(uaddr, uval); 2463 if (ret == -EFAULT)
2364 if (ret == -EFAULT) 2464 goto pi_faulted;
2365 goto pi_faulted;
2366 }
2367 2465
2368 out_unlock: 2466 out_unlock:
2369 spin_unlock(&hb->lock); 2467 spin_unlock(&hb->lock);
2370 put_futex_key(&key); 2468 put_futex_key(&key);
2371 2469
2372 out: 2470 out:
2373 return ret; 2471 return ret;
2374 2472
2375 pi_faulted: 2473 pi_faulted:
2376 spin_unlock(&hb->lock); 2474 spin_unlock(&hb->lock);
2377 put_futex_key(&key); 2475 put_futex_key(&key);
2378 2476
2379 ret = fault_in_user_writeable(uaddr); 2477 ret = fault_in_user_writeable(uaddr);
2380 if (!ret) 2478 if (!ret)
2381 goto retry; 2479 goto retry;
2382 2480
2383 return ret; 2481 return ret;
2384 } 2482 }
2385 2483
2386 /** 2484 /**
2387 * handle_early_requeue_pi_wakeup() - Detect early wakeup on the initial futex 2485 * handle_early_requeue_pi_wakeup() - Detect early wakeup on the initial futex
2388 * @hb: the hash_bucket futex_q was original enqueued on 2486 * @hb: the hash_bucket futex_q was original enqueued on
2389 * @q: the futex_q woken while waiting to be requeued 2487 * @q: the futex_q woken while waiting to be requeued
2390 * @key2: the futex_key of the requeue target futex 2488 * @key2: the futex_key of the requeue target futex
2391 * @timeout: the timeout associated with the wait (NULL if none) 2489 * @timeout: the timeout associated with the wait (NULL if none)
2392 * 2490 *
2393 * Detect if the task was woken on the initial futex as opposed to the requeue 2491 * Detect if the task was woken on the initial futex as opposed to the requeue
2394 * target futex. If so, determine if it was a timeout or a signal that caused 2492 * target futex. If so, determine if it was a timeout or a signal that caused
2395 * the wakeup and return the appropriate error code to the caller. Must be 2493 * the wakeup and return the appropriate error code to the caller. Must be
2396 * called with the hb lock held. 2494 * called with the hb lock held.
2397 * 2495 *
2398 * Return: 2496 * Return:
2399 * 0 = no early wakeup detected; 2497 * 0 = no early wakeup detected;
2400 * <0 = -ETIMEDOUT or -ERESTARTNOINTR 2498 * <0 = -ETIMEDOUT or -ERESTARTNOINTR
2401 */ 2499 */
2402 static inline 2500 static inline
2403 int handle_early_requeue_pi_wakeup(struct futex_hash_bucket *hb, 2501 int handle_early_requeue_pi_wakeup(struct futex_hash_bucket *hb,
2404 struct futex_q *q, union futex_key *key2, 2502 struct futex_q *q, union futex_key *key2,
2405 struct hrtimer_sleeper *timeout) 2503 struct hrtimer_sleeper *timeout)
2406 { 2504 {
2407 int ret = 0; 2505 int ret = 0;
2408 2506
2409 /* 2507 /*
2410 * With the hb lock held, we avoid races while we process the wakeup. 2508 * With the hb lock held, we avoid races while we process the wakeup.
2411 * We only need to hold hb (and not hb2) to ensure atomicity as the 2509 * We only need to hold hb (and not hb2) to ensure atomicity as the
2412 * wakeup code can't change q.key from uaddr to uaddr2 if we hold hb. 2510 * wakeup code can't change q.key from uaddr to uaddr2 if we hold hb.
2413 * It can't be requeued from uaddr2 to something else since we don't 2511 * It can't be requeued from uaddr2 to something else since we don't
2414 * support a PI aware source futex for requeue. 2512 * support a PI aware source futex for requeue.
2415 */ 2513 */
2416 if (!match_futex(&q->key, key2)) { 2514 if (!match_futex(&q->key, key2)) {
2417 WARN_ON(q->lock_ptr && (&hb->lock != q->lock_ptr)); 2515 WARN_ON(q->lock_ptr && (&hb->lock != q->lock_ptr));
2418 /* 2516 /*
2419 * We were woken prior to requeue by a timeout or a signal. 2517 * We were woken prior to requeue by a timeout or a signal.
2420 * Unqueue the futex_q and determine which it was. 2518 * Unqueue the futex_q and determine which it was.
2421 */ 2519 */
2422 plist_del(&q->list, &hb->chain); 2520 plist_del(&q->list, &hb->chain);
2423 hb_waiters_dec(hb); 2521 hb_waiters_dec(hb);
2424 2522
2425 /* Handle spurious wakeups gracefully */ 2523 /* Handle spurious wakeups gracefully */
2426 ret = -EWOULDBLOCK; 2524 ret = -EWOULDBLOCK;
2427 if (timeout && !timeout->task) 2525 if (timeout && !timeout->task)
2428 ret = -ETIMEDOUT; 2526 ret = -ETIMEDOUT;
2429 else if (signal_pending(current)) 2527 else if (signal_pending(current))
2430 ret = -ERESTARTNOINTR; 2528 ret = -ERESTARTNOINTR;
2431 } 2529 }
2432 return ret; 2530 return ret;
2433 } 2531 }
2434 2532
2435 /** 2533 /**
2436 * futex_wait_requeue_pi() - Wait on uaddr and take uaddr2 2534 * futex_wait_requeue_pi() - Wait on uaddr and take uaddr2
2437 * @uaddr: the futex we initially wait on (non-pi) 2535 * @uaddr: the futex we initially wait on (non-pi)
2438 * @flags: futex flags (FLAGS_SHARED, FLAGS_CLOCKRT, etc.), they must be 2536 * @flags: futex flags (FLAGS_SHARED, FLAGS_CLOCKRT, etc.), they must be
2439 * the same type, no requeueing from private to shared, etc. 2537 * the same type, no requeueing from private to shared, etc.
2440 * @val: the expected value of uaddr 2538 * @val: the expected value of uaddr
2441 * @abs_time: absolute timeout 2539 * @abs_time: absolute timeout
2442 * @bitset: 32 bit wakeup bitset set by userspace, defaults to all 2540 * @bitset: 32 bit wakeup bitset set by userspace, defaults to all
2443 * @uaddr2: the pi futex we will take prior to returning to user-space 2541 * @uaddr2: the pi futex we will take prior to returning to user-space
2444 * 2542 *
2445 * The caller will wait on uaddr and will be requeued by futex_requeue() to 2543 * The caller will wait on uaddr and will be requeued by futex_requeue() to
2446 * uaddr2 which must be PI aware and unique from uaddr. Normal wakeup will wake 2544 * uaddr2 which must be PI aware and unique from uaddr. Normal wakeup will wake
2447 * on uaddr2 and complete the acquisition of the rt_mutex prior to returning to 2545 * on uaddr2 and complete the acquisition of the rt_mutex prior to returning to
2448 * userspace. This ensures the rt_mutex maintains an owner when it has waiters; 2546 * userspace. This ensures the rt_mutex maintains an owner when it has waiters;
2449 * without one, the pi logic would not know which task to boost/deboost, if 2547 * without one, the pi logic would not know which task to boost/deboost, if
2450 * there was a need to. 2548 * there was a need to.
2451 * 2549 *
2452 * We call schedule in futex_wait_queue_me() when we enqueue and return there 2550 * We call schedule in futex_wait_queue_me() when we enqueue and return there
2453 * via the following-- 2551 * via the following--
2454 * 1) wakeup on uaddr2 after an atomic lock acquisition by futex_requeue() 2552 * 1) wakeup on uaddr2 after an atomic lock acquisition by futex_requeue()
2455 * 2) wakeup on uaddr2 after a requeue 2553 * 2) wakeup on uaddr2 after a requeue
2456 * 3) signal 2554 * 3) signal
2457 * 4) timeout 2555 * 4) timeout
2458 * 2556 *
2459 * If 3, cleanup and return -ERESTARTNOINTR. 2557 * If 3, cleanup and return -ERESTARTNOINTR.
2460 * 2558 *
2461 * If 2, we may then block on trying to take the rt_mutex and return via: 2559 * If 2, we may then block on trying to take the rt_mutex and return via:
2462 * 5) successful lock 2560 * 5) successful lock
2463 * 6) signal 2561 * 6) signal
2464 * 7) timeout 2562 * 7) timeout
2465 * 8) other lock acquisition failure 2563 * 8) other lock acquisition failure
2466 * 2564 *
2467 * If 6, return -EWOULDBLOCK (restarting the syscall would do the same). 2565 * If 6, return -EWOULDBLOCK (restarting the syscall would do the same).
2468 * 2566 *
2469 * If 4 or 7, we cleanup and return with -ETIMEDOUT. 2567 * If 4 or 7, we cleanup and return with -ETIMEDOUT.
2470 * 2568 *
2471 * Return: 2569 * Return:
2472 * 0 - On success; 2570 * 0 - On success;
2473 * <0 - On error 2571 * <0 - On error
2474 */ 2572 */
2475 static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, 2573 static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
2476 u32 val, ktime_t *abs_time, u32 bitset, 2574 u32 val, ktime_t *abs_time, u32 bitset,
2477 u32 __user *uaddr2) 2575 u32 __user *uaddr2)
2478 { 2576 {
2479 struct hrtimer_sleeper timeout, *to = NULL; 2577 struct hrtimer_sleeper timeout, *to = NULL;
2480 struct rt_mutex_waiter rt_waiter; 2578 struct rt_mutex_waiter rt_waiter;
2481 struct rt_mutex *pi_mutex = NULL; 2579 struct rt_mutex *pi_mutex = NULL;
2482 struct futex_hash_bucket *hb; 2580 struct futex_hash_bucket *hb;
2483 union futex_key key2 = FUTEX_KEY_INIT; 2581 union futex_key key2 = FUTEX_KEY_INIT;
2484 struct futex_q q = futex_q_init; 2582 struct futex_q q = futex_q_init;
2485 int res, ret; 2583 int res, ret;
2486 2584
2487 if (uaddr == uaddr2) 2585 if (uaddr == uaddr2)
2488 return -EINVAL; 2586 return -EINVAL;
2489 2587
2490 if (!bitset) 2588 if (!bitset)
2491 return -EINVAL; 2589 return -EINVAL;
2492 2590
2493 if (abs_time) { 2591 if (abs_time) {
2494 to = &timeout; 2592 to = &timeout;
2495 hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ? 2593 hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
2496 CLOCK_REALTIME : CLOCK_MONOTONIC, 2594 CLOCK_REALTIME : CLOCK_MONOTONIC,
2497 HRTIMER_MODE_ABS); 2595 HRTIMER_MODE_ABS);
2498 hrtimer_init_sleeper(to, current); 2596 hrtimer_init_sleeper(to, current);
2499 hrtimer_set_expires_range_ns(&to->timer, *abs_time, 2597 hrtimer_set_expires_range_ns(&to->timer, *abs_time,
2500 current->timer_slack_ns); 2598 current->timer_slack_ns);
2501 } 2599 }
2502 2600
2503 /* 2601 /*
2504 * The waiter is allocated on our stack, manipulated by the requeue 2602 * The waiter is allocated on our stack, manipulated by the requeue
2505 * code while we sleep on uaddr. 2603 * code while we sleep on uaddr.
2506 */ 2604 */
2507 debug_rt_mutex_init_waiter(&rt_waiter); 2605 debug_rt_mutex_init_waiter(&rt_waiter);
2508 RB_CLEAR_NODE(&rt_waiter.pi_tree_entry); 2606 RB_CLEAR_NODE(&rt_waiter.pi_tree_entry);
2509 RB_CLEAR_NODE(&rt_waiter.tree_entry); 2607 RB_CLEAR_NODE(&rt_waiter.tree_entry);
2510 rt_waiter.task = NULL; 2608 rt_waiter.task = NULL;
2511 2609
2512 ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE); 2610 ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE);
2513 if (unlikely(ret != 0)) 2611 if (unlikely(ret != 0))
2514 goto out; 2612 goto out;
2515 2613
2516 q.bitset = bitset; 2614 q.bitset = bitset;
2517 q.rt_waiter = &rt_waiter; 2615 q.rt_waiter = &rt_waiter;
2518 q.requeue_pi_key = &key2; 2616 q.requeue_pi_key = &key2;
2519 2617
2520 /* 2618 /*
2521 * Prepare to wait on uaddr. On success, increments q.key (key1) ref 2619 * Prepare to wait on uaddr. On success, increments q.key (key1) ref
2522 * count. 2620 * count.
2523 */ 2621 */
2524 ret = futex_wait_setup(uaddr, val, flags, &q, &hb); 2622 ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
2525 if (ret) 2623 if (ret)
2526 goto out_key2; 2624 goto out_key2;
2625
2626 /*
2627 * The check above which compares uaddrs is not sufficient for
2628 * shared futexes. We need to compare the keys:
2629 */
2630 if (match_futex(&q.key, &key2)) {
2631 ret = -EINVAL;
2632 goto out_put_keys;
2633 }
2527 2634
2528 /* Queue the futex_q, drop the hb lock, wait for wakeup. */ 2635 /* Queue the futex_q, drop the hb lock, wait for wakeup. */
2529 futex_wait_queue_me(hb, &q, to); 2636 futex_wait_queue_me(hb, &q, to);
2530 2637
2531 spin_lock(&hb->lock); 2638 spin_lock(&hb->lock);
2532 ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to); 2639 ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
2533 spin_unlock(&hb->lock); 2640 spin_unlock(&hb->lock);
2534 if (ret) 2641 if (ret)
2535 goto out_put_keys; 2642 goto out_put_keys;
2536 2643
2537 /* 2644 /*
2538 * In order for us to be here, we know our q.key == key2, and since 2645 * In order for us to be here, we know our q.key == key2, and since
2539 * we took the hb->lock above, we also know that futex_requeue() has 2646 * we took the hb->lock above, we also know that futex_requeue() has
2540 * completed and we no longer have to concern ourselves with a wakeup 2647 * completed and we no longer have to concern ourselves with a wakeup
2541 * race with the atomic proxy lock acquisition by the requeue code. The 2648 * race with the atomic proxy lock acquisition by the requeue code. The
2542 * futex_requeue dropped our key1 reference and incremented our key2 2649 * futex_requeue dropped our key1 reference and incremented our key2
2543 * reference count. 2650 * reference count.
2544 */ 2651 */
2545 2652
2546 /* Check if the requeue code acquired the second futex for us. */ 2653 /* Check if the requeue code acquired the second futex for us. */
2547 if (!q.rt_waiter) { 2654 if (!q.rt_waiter) {
2548 /* 2655 /*
2549 * Got the lock. We might not be the anticipated owner if we 2656 * Got the lock. We might not be the anticipated owner if we
2550 * did a lock-steal - fix up the PI-state in that case. 2657 * did a lock-steal - fix up the PI-state in that case.
2551 */ 2658 */
2552 if (q.pi_state && (q.pi_state->owner != current)) { 2659 if (q.pi_state && (q.pi_state->owner != current)) {
2553 spin_lock(q.lock_ptr); 2660 spin_lock(q.lock_ptr);
2554 ret = fixup_pi_state_owner(uaddr2, &q, current); 2661 ret = fixup_pi_state_owner(uaddr2, &q, current);
2555 spin_unlock(q.lock_ptr); 2662 spin_unlock(q.lock_ptr);
2556 } 2663 }
2557 } else { 2664 } else {
2558 /* 2665 /*
2559 * We have been woken up by futex_unlock_pi(), a timeout, or a 2666 * We have been woken up by futex_unlock_pi(), a timeout, or a
2560 * signal. futex_unlock_pi() will not destroy the lock_ptr nor 2667 * signal. futex_unlock_pi() will not destroy the lock_ptr nor
2561 * the pi_state. 2668 * the pi_state.
2562 */ 2669 */
2563 WARN_ON(!q.pi_state); 2670 WARN_ON(!q.pi_state);
2564 pi_mutex = &q.pi_state->pi_mutex; 2671 pi_mutex = &q.pi_state->pi_mutex;
2565 ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter, 1); 2672 ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter, 1);
2566 debug_rt_mutex_free_waiter(&rt_waiter); 2673 debug_rt_mutex_free_waiter(&rt_waiter);
2567 2674
2568 spin_lock(q.lock_ptr); 2675 spin_lock(q.lock_ptr);
2569 /* 2676 /*
2570 * Fixup the pi_state owner and possibly acquire the lock if we 2677 * Fixup the pi_state owner and possibly acquire the lock if we
2571 * haven't already. 2678 * haven't already.
2572 */ 2679 */
2573 res = fixup_owner(uaddr2, &q, !ret); 2680 res = fixup_owner(uaddr2, &q, !ret);
2574 /* 2681 /*
2575 * If fixup_owner() returned an error, proprogate that. If it 2682 * If fixup_owner() returned an error, proprogate that. If it
2576 * acquired the lock, clear -ETIMEDOUT or -EINTR. 2683 * acquired the lock, clear -ETIMEDOUT or -EINTR.
2577 */ 2684 */
2578 if (res) 2685 if (res)
2579 ret = (res < 0) ? res : 0; 2686 ret = (res < 0) ? res : 0;
2580 2687
2581 /* Unqueue and drop the lock. */ 2688 /* Unqueue and drop the lock. */
2582 unqueue_me_pi(&q); 2689 unqueue_me_pi(&q);
2583 } 2690 }
2584 2691
2585 /* 2692 /*
2586 * If fixup_pi_state_owner() faulted and was unable to handle the 2693 * If fixup_pi_state_owner() faulted and was unable to handle the
2587 * fault, unlock the rt_mutex and return the fault to userspace. 2694 * fault, unlock the rt_mutex and return the fault to userspace.
2588 */ 2695 */
2589 if (ret == -EFAULT) { 2696 if (ret == -EFAULT) {
2590 if (pi_mutex && rt_mutex_owner(pi_mutex) == current) 2697 if (pi_mutex && rt_mutex_owner(pi_mutex) == current)
2591 rt_mutex_unlock(pi_mutex); 2698 rt_mutex_unlock(pi_mutex);
2592 } else if (ret == -EINTR) { 2699 } else if (ret == -EINTR) {
2593 /* 2700 /*
2594 * We've already been requeued, but cannot restart by calling 2701 * We've already been requeued, but cannot restart by calling
2595 * futex_lock_pi() directly. We could restart this syscall, but 2702 * futex_lock_pi() directly. We could restart this syscall, but
2596 * it would detect that the user space "val" changed and return 2703 * it would detect that the user space "val" changed and return
2597 * -EWOULDBLOCK. Save the overhead of the restart and return 2704 * -EWOULDBLOCK. Save the overhead of the restart and return
2598 * -EWOULDBLOCK directly. 2705 * -EWOULDBLOCK directly.
2599 */ 2706 */
2600 ret = -EWOULDBLOCK; 2707 ret = -EWOULDBLOCK;
2601 } 2708 }
2602 2709
2603 out_put_keys: 2710 out_put_keys:
2604 put_futex_key(&q.key); 2711 put_futex_key(&q.key);
2605 out_key2: 2712 out_key2:
2606 put_futex_key(&key2); 2713 put_futex_key(&key2);
2607 2714
2608 out: 2715 out:
2609 if (to) { 2716 if (to) {
2610 hrtimer_cancel(&to->timer); 2717 hrtimer_cancel(&to->timer);
2611 destroy_hrtimer_on_stack(&to->timer); 2718 destroy_hrtimer_on_stack(&to->timer);
2612 } 2719 }
2613 return ret; 2720 return ret;
2614 } 2721 }
2615 2722
2616 /* 2723 /*
2617 * Support for robust futexes: the kernel cleans up held futexes at 2724 * Support for robust futexes: the kernel cleans up held futexes at
2618 * thread exit time. 2725 * thread exit time.
2619 * 2726 *
2620 * Implementation: user-space maintains a per-thread list of locks it 2727 * Implementation: user-space maintains a per-thread list of locks it
2621 * is holding. Upon do_exit(), the kernel carefully walks this list, 2728 * is holding. Upon do_exit(), the kernel carefully walks this list,
2622 * and marks all locks that are owned by this thread with the 2729 * and marks all locks that are owned by this thread with the
2623 * FUTEX_OWNER_DIED bit, and wakes up a waiter (if any). The list is 2730 * FUTEX_OWNER_DIED bit, and wakes up a waiter (if any). The list is
2624 * always manipulated with the lock held, so the list is private and 2731 * always manipulated with the lock held, so the list is private and
2625 * per-thread. Userspace also maintains a per-thread 'list_op_pending' 2732 * per-thread. Userspace also maintains a per-thread 'list_op_pending'
2626 * field, to allow the kernel to clean up if the thread dies after 2733 * field, to allow the kernel to clean up if the thread dies after
2627 * acquiring the lock, but just before it could have added itself to 2734 * acquiring the lock, but just before it could have added itself to
2628 * the list. There can only be one such pending lock. 2735 * the list. There can only be one such pending lock.
2629 */ 2736 */
2630 2737
2631 /** 2738 /**
2632 * sys_set_robust_list() - Set the robust-futex list head of a task 2739 * sys_set_robust_list() - Set the robust-futex list head of a task
2633 * @head: pointer to the list-head 2740 * @head: pointer to the list-head
2634 * @len: length of the list-head, as userspace expects 2741 * @len: length of the list-head, as userspace expects
2635 */ 2742 */
2636 SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head, 2743 SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head,
2637 size_t, len) 2744 size_t, len)
2638 { 2745 {
2639 if (!futex_cmpxchg_enabled) 2746 if (!futex_cmpxchg_enabled)
2640 return -ENOSYS; 2747 return -ENOSYS;
2641 /* 2748 /*
2642 * The kernel knows only one size for now: 2749 * The kernel knows only one size for now:
2643 */ 2750 */
2644 if (unlikely(len != sizeof(*head))) 2751 if (unlikely(len != sizeof(*head)))
2645 return -EINVAL; 2752 return -EINVAL;
2646 2753
2647 current->robust_list = head; 2754 current->robust_list = head;
2648 2755
2649 return 0; 2756 return 0;
2650 } 2757 }
2651 2758
2652 /** 2759 /**
2653 * sys_get_robust_list() - Get the robust-futex list head of a task 2760 * sys_get_robust_list() - Get the robust-futex list head of a task
2654 * @pid: pid of the process [zero for current task] 2761 * @pid: pid of the process [zero for current task]
2655 * @head_ptr: pointer to a list-head pointer, the kernel fills it in 2762 * @head_ptr: pointer to a list-head pointer, the kernel fills it in
2656 * @len_ptr: pointer to a length field, the kernel fills in the header size 2763 * @len_ptr: pointer to a length field, the kernel fills in the header size
2657 */ 2764 */
2658 SYSCALL_DEFINE3(get_robust_list, int, pid, 2765 SYSCALL_DEFINE3(get_robust_list, int, pid,
2659 struct robust_list_head __user * __user *, head_ptr, 2766 struct robust_list_head __user * __user *, head_ptr,
2660 size_t __user *, len_ptr) 2767 size_t __user *, len_ptr)
2661 { 2768 {
2662 struct robust_list_head __user *head; 2769 struct robust_list_head __user *head;
2663 unsigned long ret; 2770 unsigned long ret;
2664 struct task_struct *p; 2771 struct task_struct *p;
2665 2772
2666 if (!futex_cmpxchg_enabled) 2773 if (!futex_cmpxchg_enabled)
2667 return -ENOSYS; 2774 return -ENOSYS;
2668 2775
2669 rcu_read_lock(); 2776 rcu_read_lock();
2670 2777
2671 ret = -ESRCH; 2778 ret = -ESRCH;
2672 if (!pid) 2779 if (!pid)
2673 p = current; 2780 p = current;
2674 else { 2781 else {
2675 p = find_task_by_vpid(pid); 2782 p = find_task_by_vpid(pid);
2676 if (!p) 2783 if (!p)
2677 goto err_unlock; 2784 goto err_unlock;
2678 } 2785 }
2679 2786
2680 ret = -EPERM; 2787 ret = -EPERM;
2681 if (!ptrace_may_access(p, PTRACE_MODE_READ)) 2788 if (!ptrace_may_access(p, PTRACE_MODE_READ))
2682 goto err_unlock; 2789 goto err_unlock;
2683 2790
2684 head = p->robust_list; 2791 head = p->robust_list;
2685 rcu_read_unlock(); 2792 rcu_read_unlock();
2686 2793
2687 if (put_user(sizeof(*head), len_ptr)) 2794 if (put_user(sizeof(*head), len_ptr))
2688 return -EFAULT; 2795 return -EFAULT;
2689 return put_user(head, head_ptr); 2796 return put_user(head, head_ptr);
2690 2797
2691 err_unlock: 2798 err_unlock:
2692 rcu_read_unlock(); 2799 rcu_read_unlock();
2693 2800
2694 return ret; 2801 return ret;
2695 } 2802 }
2696 2803
2697 /* 2804 /*
2698 * Process a futex-list entry, check whether it's owned by the 2805 * Process a futex-list entry, check whether it's owned by the
2699 * dying task, and do notification if so: 2806 * dying task, and do notification if so:
2700 */ 2807 */
2701 int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi) 2808 int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi)
2702 { 2809 {
2703 u32 uval, uninitialized_var(nval), mval; 2810 u32 uval, uninitialized_var(nval), mval;
2704 2811
2705 retry: 2812 retry:
2706 if (get_user(uval, uaddr)) 2813 if (get_user(uval, uaddr))
2707 return -1; 2814 return -1;
2708 2815
2709 if ((uval & FUTEX_TID_MASK) == task_pid_vnr(curr)) { 2816 if ((uval & FUTEX_TID_MASK) == task_pid_vnr(curr)) {
2710 /* 2817 /*
2711 * Ok, this dying thread is truly holding a futex 2818 * Ok, this dying thread is truly holding a futex
2712 * of interest. Set the OWNER_DIED bit atomically 2819 * of interest. Set the OWNER_DIED bit atomically
2713 * via cmpxchg, and if the value had FUTEX_WAITERS 2820 * via cmpxchg, and if the value had FUTEX_WAITERS
2714 * set, wake up a waiter (if any). (We have to do a 2821 * set, wake up a waiter (if any). (We have to do a
2715 * futex_wake() even if OWNER_DIED is already set - 2822 * futex_wake() even if OWNER_DIED is already set -
2716 * to handle the rare but possible case of recursive 2823 * to handle the rare but possible case of recursive
2717 * thread-death.) The rest of the cleanup is done in 2824 * thread-death.) The rest of the cleanup is done in
2718 * userspace. 2825 * userspace.
2719 */ 2826 */
2720 mval = (uval & FUTEX_WAITERS) | FUTEX_OWNER_DIED; 2827 mval = (uval & FUTEX_WAITERS) | FUTEX_OWNER_DIED;
2721 /* 2828 /*
2722 * We are not holding a lock here, but we want to have 2829 * We are not holding a lock here, but we want to have
2723 * the pagefault_disable/enable() protection because 2830 * the pagefault_disable/enable() protection because
2724 * we want to handle the fault gracefully. If the 2831 * we want to handle the fault gracefully. If the
2725 * access fails we try to fault in the futex with R/W 2832 * access fails we try to fault in the futex with R/W
2726 * verification via get_user_pages. get_user() above 2833 * verification via get_user_pages. get_user() above
2727 * does not guarantee R/W access. If that fails we 2834 * does not guarantee R/W access. If that fails we
2728 * give up and leave the futex locked. 2835 * give up and leave the futex locked.
2729 */ 2836 */
2730 if (cmpxchg_futex_value_locked(&nval, uaddr, uval, mval)) { 2837 if (cmpxchg_futex_value_locked(&nval, uaddr, uval, mval)) {
2731 if (fault_in_user_writeable(uaddr)) 2838 if (fault_in_user_writeable(uaddr))
2732 return -1; 2839 return -1;
2733 goto retry; 2840 goto retry;
2734 } 2841 }
2735 if (nval != uval) 2842 if (nval != uval)
2736 goto retry; 2843 goto retry;
2737 2844
2738 /* 2845 /*
2739 * Wake robust non-PI futexes here. The wakeup of 2846 * Wake robust non-PI futexes here. The wakeup of
2740 * PI futexes happens in exit_pi_state(): 2847 * PI futexes happens in exit_pi_state():
2741 */ 2848 */
2742 if (!pi && (uval & FUTEX_WAITERS)) 2849 if (!pi && (uval & FUTEX_WAITERS))
2743 futex_wake(uaddr, 1, 1, FUTEX_BITSET_MATCH_ANY); 2850 futex_wake(uaddr, 1, 1, FUTEX_BITSET_MATCH_ANY);
2744 } 2851 }
2745 return 0; 2852 return 0;
2746 } 2853 }
2747 2854
2748 /* 2855 /*
2749 * Fetch a robust-list pointer. Bit 0 signals PI futexes: 2856 * Fetch a robust-list pointer. Bit 0 signals PI futexes:
2750 */ 2857 */
2751 static inline int fetch_robust_entry(struct robust_list __user **entry, 2858 static inline int fetch_robust_entry(struct robust_list __user **entry,
2752 struct robust_list __user * __user *head, 2859 struct robust_list __user * __user *head,
2753 unsigned int *pi) 2860 unsigned int *pi)
2754 { 2861 {
2755 unsigned long uentry; 2862 unsigned long uentry;
2756 2863
2757 if (get_user(uentry, (unsigned long __user *)head)) 2864 if (get_user(uentry, (unsigned long __user *)head))
2758 return -EFAULT; 2865 return -EFAULT;
2759 2866
2760 *entry = (void __user *)(uentry & ~1UL); 2867 *entry = (void __user *)(uentry & ~1UL);
2761 *pi = uentry & 1; 2868 *pi = uentry & 1;
2762 2869
2763 return 0; 2870 return 0;
2764 } 2871 }
2765 2872
2766 /* 2873 /*
2767 * Walk curr->robust_list (very carefully, it's a userspace list!) 2874 * Walk curr->robust_list (very carefully, it's a userspace list!)
2768 * and mark any locks found there dead, and notify any waiters. 2875 * and mark any locks found there dead, and notify any waiters.
2769 * 2876 *
2770 * We silently return on any sign of list-walking problem. 2877 * We silently return on any sign of list-walking problem.
2771 */ 2878 */
2772 void exit_robust_list(struct task_struct *curr) 2879 void exit_robust_list(struct task_struct *curr)
2773 { 2880 {
2774 struct robust_list_head __user *head = curr->robust_list; 2881 struct robust_list_head __user *head = curr->robust_list;
2775 struct robust_list __user *entry, *next_entry, *pending; 2882 struct robust_list __user *entry, *next_entry, *pending;
2776 unsigned int limit = ROBUST_LIST_LIMIT, pi, pip; 2883 unsigned int limit = ROBUST_LIST_LIMIT, pi, pip;
2777 unsigned int uninitialized_var(next_pi); 2884 unsigned int uninitialized_var(next_pi);
2778 unsigned long futex_offset; 2885 unsigned long futex_offset;
2779 int rc; 2886 int rc;
2780 2887
2781 if (!futex_cmpxchg_enabled) 2888 if (!futex_cmpxchg_enabled)
2782 return; 2889 return;
2783 2890
2784 /* 2891 /*
2785 * Fetch the list head (which was registered earlier, via 2892 * Fetch the list head (which was registered earlier, via
2786 * sys_set_robust_list()): 2893 * sys_set_robust_list()):
2787 */ 2894 */
2788 if (fetch_robust_entry(&entry, &head->list.next, &pi)) 2895 if (fetch_robust_entry(&entry, &head->list.next, &pi))
2789 return; 2896 return;
2790 /* 2897 /*
2791 * Fetch the relative futex offset: 2898 * Fetch the relative futex offset:
2792 */ 2899 */
2793 if (get_user(futex_offset, &head->futex_offset)) 2900 if (get_user(futex_offset, &head->futex_offset))
2794 return; 2901 return;
2795 /* 2902 /*
2796 * Fetch any possibly pending lock-add first, and handle it 2903 * Fetch any possibly pending lock-add first, and handle it
2797 * if it exists: 2904 * if it exists:
2798 */ 2905 */
2799 if (fetch_robust_entry(&pending, &head->list_op_pending, &pip)) 2906 if (fetch_robust_entry(&pending, &head->list_op_pending, &pip))
2800 return; 2907 return;
2801 2908
2802 next_entry = NULL; /* avoid warning with gcc */ 2909 next_entry = NULL; /* avoid warning with gcc */
2803 while (entry != &head->list) { 2910 while (entry != &head->list) {
2804 /* 2911 /*
2805 * Fetch the next entry in the list before calling 2912 * Fetch the next entry in the list before calling
2806 * handle_futex_death: 2913 * handle_futex_death:
2807 */ 2914 */
2808 rc = fetch_robust_entry(&next_entry, &entry->next, &next_pi); 2915 rc = fetch_robust_entry(&next_entry, &entry->next, &next_pi);
2809 /* 2916 /*
2810 * A pending lock might already be on the list, so 2917 * A pending lock might already be on the list, so
2811 * don't process it twice: 2918 * don't process it twice:
2812 */ 2919 */
2813 if (entry != pending) 2920 if (entry != pending)
2814 if (handle_futex_death((void __user *)entry + futex_offset, 2921 if (handle_futex_death((void __user *)entry + futex_offset,
2815 curr, pi)) 2922 curr, pi))
2816 return; 2923 return;
2817 if (rc) 2924 if (rc)
2818 return; 2925 return;
2819 entry = next_entry; 2926 entry = next_entry;
2820 pi = next_pi; 2927 pi = next_pi;
2821 /* 2928 /*
2822 * Avoid excessively long or circular lists: 2929 * Avoid excessively long or circular lists:
2823 */ 2930 */
2824 if (!--limit) 2931 if (!--limit)
2825 break; 2932 break;
2826 2933
2827 cond_resched(); 2934 cond_resched();
2828 } 2935 }
2829 2936
2830 if (pending) 2937 if (pending)
2831 handle_futex_death((void __user *)pending + futex_offset, 2938 handle_futex_death((void __user *)pending + futex_offset,
2832 curr, pip); 2939 curr, pip);
2833 } 2940 }
2834 2941
2835 long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, 2942 long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
2836 u32 __user *uaddr2, u32 val2, u32 val3) 2943 u32 __user *uaddr2, u32 val2, u32 val3)
2837 { 2944 {
2838 int cmd = op & FUTEX_CMD_MASK; 2945 int cmd = op & FUTEX_CMD_MASK;
2839 unsigned int flags = 0; 2946 unsigned int flags = 0;
2840 2947
2841 if (!(op & FUTEX_PRIVATE_FLAG)) 2948 if (!(op & FUTEX_PRIVATE_FLAG))
2842 flags |= FLAGS_SHARED; 2949 flags |= FLAGS_SHARED;
2843 2950
2844 if (op & FUTEX_CLOCK_REALTIME) { 2951 if (op & FUTEX_CLOCK_REALTIME) {
2845 flags |= FLAGS_CLOCKRT; 2952 flags |= FLAGS_CLOCKRT;
2846 if (cmd != FUTEX_WAIT_BITSET && cmd != FUTEX_WAIT_REQUEUE_PI) 2953 if (cmd != FUTEX_WAIT_BITSET && cmd != FUTEX_WAIT_REQUEUE_PI)
2847 return -ENOSYS; 2954 return -ENOSYS;
2848 } 2955 }
2849 2956
2850 switch (cmd) { 2957 switch (cmd) {
2851 case FUTEX_LOCK_PI: 2958 case FUTEX_LOCK_PI:
2852 case FUTEX_UNLOCK_PI: 2959 case FUTEX_UNLOCK_PI:
2853 case FUTEX_TRYLOCK_PI: 2960 case FUTEX_TRYLOCK_PI:
2854 case FUTEX_WAIT_REQUEUE_PI: 2961 case FUTEX_WAIT_REQUEUE_PI:
2855 case FUTEX_CMP_REQUEUE_PI: 2962 case FUTEX_CMP_REQUEUE_PI:
2856 if (!futex_cmpxchg_enabled) 2963 if (!futex_cmpxchg_enabled)
2857 return -ENOSYS; 2964 return -ENOSYS;
2858 } 2965 }
2859 2966
2860 switch (cmd) { 2967 switch (cmd) {
2861 case FUTEX_WAIT: 2968 case FUTEX_WAIT:
2862 val3 = FUTEX_BITSET_MATCH_ANY; 2969 val3 = FUTEX_BITSET_MATCH_ANY;
2863 case FUTEX_WAIT_BITSET: 2970 case FUTEX_WAIT_BITSET:
2864 return futex_wait(uaddr, flags, val, timeout, val3); 2971 return futex_wait(uaddr, flags, val, timeout, val3);
2865 case FUTEX_WAKE: 2972 case FUTEX_WAKE:
2866 val3 = FUTEX_BITSET_MATCH_ANY; 2973 val3 = FUTEX_BITSET_MATCH_ANY;
2867 case FUTEX_WAKE_BITSET: 2974 case FUTEX_WAKE_BITSET:
2868 return futex_wake(uaddr, flags, val, val3); 2975 return futex_wake(uaddr, flags, val, val3);
2869 case FUTEX_REQUEUE: 2976 case FUTEX_REQUEUE:
2870 return futex_requeue(uaddr, flags, uaddr2, val, val2, NULL, 0); 2977 return futex_requeue(uaddr, flags, uaddr2, val, val2, NULL, 0);
2871 case FUTEX_CMP_REQUEUE: 2978 case FUTEX_CMP_REQUEUE:
2872 return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 0); 2979 return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 0);
2873 case FUTEX_WAKE_OP: 2980 case FUTEX_WAKE_OP:
2874 return futex_wake_op(uaddr, flags, uaddr2, val, val2, val3); 2981 return futex_wake_op(uaddr, flags, uaddr2, val, val2, val3);
2875 case FUTEX_LOCK_PI: 2982 case FUTEX_LOCK_PI:
2876 return futex_lock_pi(uaddr, flags, val, timeout, 0); 2983 return futex_lock_pi(uaddr, flags, val, timeout, 0);
2877 case FUTEX_UNLOCK_PI: 2984 case FUTEX_UNLOCK_PI:
2878 return futex_unlock_pi(uaddr, flags); 2985 return futex_unlock_pi(uaddr, flags);
2879 case FUTEX_TRYLOCK_PI: 2986 case FUTEX_TRYLOCK_PI:
2880 return futex_lock_pi(uaddr, flags, 0, timeout, 1); 2987 return futex_lock_pi(uaddr, flags, 0, timeout, 1);
2881 case FUTEX_WAIT_REQUEUE_PI: 2988 case FUTEX_WAIT_REQUEUE_PI:
2882 val3 = FUTEX_BITSET_MATCH_ANY; 2989 val3 = FUTEX_BITSET_MATCH_ANY;
2883 return futex_wait_requeue_pi(uaddr, flags, val, timeout, val3, 2990 return futex_wait_requeue_pi(uaddr, flags, val, timeout, val3,
2884 uaddr2); 2991 uaddr2);
2885 case FUTEX_CMP_REQUEUE_PI: 2992 case FUTEX_CMP_REQUEUE_PI:
2886 return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1); 2993 return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
2887 } 2994 }
2888 return -ENOSYS; 2995 return -ENOSYS;
2889 } 2996 }
2890 2997
2891 2998
2892 SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, 2999 SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
2893 struct timespec __user *, utime, u32 __user *, uaddr2, 3000 struct timespec __user *, utime, u32 __user *, uaddr2,
2894 u32, val3) 3001 u32, val3)
2895 { 3002 {
2896 struct timespec ts; 3003 struct timespec ts;
2897 ktime_t t, *tp = NULL; 3004 ktime_t t, *tp = NULL;
2898 u32 val2 = 0; 3005 u32 val2 = 0;
2899 int cmd = op & FUTEX_CMD_MASK; 3006 int cmd = op & FUTEX_CMD_MASK;
2900 3007
2901 if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI || 3008 if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI ||
2902 cmd == FUTEX_WAIT_BITSET || 3009 cmd == FUTEX_WAIT_BITSET ||
2903 cmd == FUTEX_WAIT_REQUEUE_PI)) { 3010 cmd == FUTEX_WAIT_REQUEUE_PI)) {
2904 if (copy_from_user(&ts, utime, sizeof(ts)) != 0) 3011 if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
2905 return -EFAULT; 3012 return -EFAULT;
2906 if (!timespec_valid(&ts)) 3013 if (!timespec_valid(&ts))
2907 return -EINVAL; 3014 return -EINVAL;
2908 3015
2909 t = timespec_to_ktime(ts); 3016 t = timespec_to_ktime(ts);
2910 if (cmd == FUTEX_WAIT) 3017 if (cmd == FUTEX_WAIT)
2911 t = ktime_add_safe(ktime_get(), t); 3018 t = ktime_add_safe(ktime_get(), t);
2912 tp = &t; 3019 tp = &t;
2913 } 3020 }
2914 /* 3021 /*
2915 * requeue parameter in 'utime' if cmd == FUTEX_*_REQUEUE_*. 3022 * requeue parameter in 'utime' if cmd == FUTEX_*_REQUEUE_*.
2916 * number of waiters to wake in 'utime' if cmd == FUTEX_WAKE_OP. 3023 * number of waiters to wake in 'utime' if cmd == FUTEX_WAKE_OP.
2917 */ 3024 */
2918 if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE || 3025 if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE ||
2919 cmd == FUTEX_CMP_REQUEUE_PI || cmd == FUTEX_WAKE_OP) 3026 cmd == FUTEX_CMP_REQUEUE_PI || cmd == FUTEX_WAKE_OP)
2920 val2 = (u32) (unsigned long) utime; 3027 val2 = (u32) (unsigned long) utime;
2921 3028
2922 return do_futex(uaddr, op, val, tp, uaddr2, val2, val3); 3029 return do_futex(uaddr, op, val, tp, uaddr2, val2, val3);
2923 } 3030 }
2924 3031
2925 static void __init futex_detect_cmpxchg(void) 3032 static void __init futex_detect_cmpxchg(void)
2926 { 3033 {
2927 #ifndef CONFIG_HAVE_FUTEX_CMPXCHG 3034 #ifndef CONFIG_HAVE_FUTEX_CMPXCHG
2928 u32 curval; 3035 u32 curval;
2929 3036
2930 /* 3037 /*
2931 * This will fail and we want it. Some arch implementations do 3038 * This will fail and we want it. Some arch implementations do
2932 * runtime detection of the futex_atomic_cmpxchg_inatomic() 3039 * runtime detection of the futex_atomic_cmpxchg_inatomic()
2933 * functionality. We want to know that before we call in any 3040 * functionality. We want to know that before we call in any
2934 * of the complex code paths. Also we want to prevent 3041 * of the complex code paths. Also we want to prevent
2935 * registration of robust lists in that case. NULL is 3042 * registration of robust lists in that case. NULL is
2936 * guaranteed to fault and we get -EFAULT on functional 3043 * guaranteed to fault and we get -EFAULT on functional
2937 * implementation, the non-functional ones will return 3044 * implementation, the non-functional ones will return
2938 * -ENOSYS. 3045 * -ENOSYS.
2939 */ 3046 */
2940 if (cmpxchg_futex_value_locked(&curval, NULL, 0, 0) == -EFAULT) 3047 if (cmpxchg_futex_value_locked(&curval, NULL, 0, 0) == -EFAULT)
2941 futex_cmpxchg_enabled = 1; 3048 futex_cmpxchg_enabled = 1;
2942 #endif 3049 #endif
2943 } 3050 }
2944 3051
2945 static int __init futex_init(void) 3052 static int __init futex_init(void)
2946 { 3053 {
2947 unsigned int futex_shift; 3054 unsigned int futex_shift;
2948 unsigned long i; 3055 unsigned long i;
2949 3056
2950 #if CONFIG_BASE_SMALL 3057 #if CONFIG_BASE_SMALL
2951 futex_hashsize = 16; 3058 futex_hashsize = 16;
2952 #else 3059 #else
2953 futex_hashsize = roundup_pow_of_two(256 * num_possible_cpus()); 3060 futex_hashsize = roundup_pow_of_two(256 * num_possible_cpus());
2954 #endif 3061 #endif
2955 3062
2956 futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues), 3063 futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues),
2957 futex_hashsize, 0, 3064 futex_hashsize, 0,
2958 futex_hashsize < 256 ? HASH_SMALL : 0, 3065 futex_hashsize < 256 ? HASH_SMALL : 0,