Commit 7a0ea09ad5352efce8fe79ed853150449903b9f5

Authored by Michal Hocko
Committed by Linus Torvalds
1 parent f4985dc714

futex: futex_find_get_task remove credentails check

futex_find_get_task is currently used (through lookup_pi_state) from two
contexts, futex_requeue and futex_lock_pi_atomic.  None of the paths
looks it needs the credentials check, though.  Different (e)uids
shouldn't matter at all because the only thing that is important for
shared futex is the accessibility of the shared memory.

The credentail check results in glibc assert failure or process hang (if
glibc is compiled without assert support) for shared robust pthread
mutex with priority inheritance if a process tries to lock already held
lock owned by a process with a different euid:

pthread_mutex_lock.c:312: __pthread_mutex_lock_full: Assertion `(-(e)) != 3 || !robust' failed.

The problem is that futex_lock_pi_atomic which is called when we try to
lock already held lock checks the current holder (tid is stored in the
futex value) to get the PI state.  It uses lookup_pi_state which in turn
gets task struct from futex_find_get_task.  ESRCH is returned either
when the task is not found or if credentials check fails.

futex_lock_pi_atomic simply returns if it gets ESRCH.  glibc code,
however, doesn't expect that robust lock returns with ESRCH because it
should get either success or owner died.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Showing 1 changed file with 4 additions and 13 deletions Inline Diff

1 /* 1 /*
2 * Fast Userspace Mutexes (which I call "Futexes!"). 2 * Fast Userspace Mutexes (which I call "Futexes!").
3 * (C) Rusty Russell, IBM 2002 3 * (C) Rusty Russell, IBM 2002
4 * 4 *
5 * Generalized futexes, futex requeueing, misc fixes by Ingo Molnar 5 * Generalized futexes, futex requeueing, misc fixes by Ingo Molnar
6 * (C) Copyright 2003 Red Hat Inc, All Rights Reserved 6 * (C) Copyright 2003 Red Hat Inc, All Rights Reserved
7 * 7 *
8 * Removed page pinning, fix privately mapped COW pages and other cleanups 8 * Removed page pinning, fix privately mapped COW pages and other cleanups
9 * (C) Copyright 2003, 2004 Jamie Lokier 9 * (C) Copyright 2003, 2004 Jamie Lokier
10 * 10 *
11 * Robust futex support started by Ingo Molnar 11 * Robust futex support started by Ingo Molnar
12 * (C) Copyright 2006 Red Hat Inc, All Rights Reserved 12 * (C) Copyright 2006 Red Hat Inc, All Rights Reserved
13 * Thanks to Thomas Gleixner for suggestions, analysis and fixes. 13 * Thanks to Thomas Gleixner for suggestions, analysis and fixes.
14 * 14 *
15 * PI-futex support started by Ingo Molnar and Thomas Gleixner 15 * PI-futex support started by Ingo Molnar and Thomas Gleixner
16 * Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com> 16 * Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
17 * Copyright (C) 2006 Timesys Corp., Thomas Gleixner <tglx@timesys.com> 17 * Copyright (C) 2006 Timesys Corp., Thomas Gleixner <tglx@timesys.com>
18 * 18 *
19 * PRIVATE futexes by Eric Dumazet 19 * PRIVATE futexes by Eric Dumazet
20 * Copyright (C) 2007 Eric Dumazet <dada1@cosmosbay.com> 20 * Copyright (C) 2007 Eric Dumazet <dada1@cosmosbay.com>
21 * 21 *
22 * Requeue-PI support by Darren Hart <dvhltc@us.ibm.com> 22 * Requeue-PI support by Darren Hart <dvhltc@us.ibm.com>
23 * Copyright (C) IBM Corporation, 2009 23 * Copyright (C) IBM Corporation, 2009
24 * Thanks to Thomas Gleixner for conceptual design and careful reviews. 24 * Thanks to Thomas Gleixner for conceptual design and careful reviews.
25 * 25 *
26 * Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly 26 * Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
27 * enough at me, Linus for the original (flawed) idea, Matthew 27 * enough at me, Linus for the original (flawed) idea, Matthew
28 * Kirkwood for proof-of-concept implementation. 28 * Kirkwood for proof-of-concept implementation.
29 * 29 *
30 * "The futexes are also cursed." 30 * "The futexes are also cursed."
31 * "But they come in a choice of three flavours!" 31 * "But they come in a choice of three flavours!"
32 * 32 *
33 * This program is free software; you can redistribute it and/or modify 33 * This program is free software; you can redistribute it and/or modify
34 * it under the terms of the GNU General Public License as published by 34 * it under the terms of the GNU General Public License as published by
35 * the Free Software Foundation; either version 2 of the License, or 35 * the Free Software Foundation; either version 2 of the License, or
36 * (at your option) any later version. 36 * (at your option) any later version.
37 * 37 *
38 * This program is distributed in the hope that it will be useful, 38 * This program is distributed in the hope that it will be useful,
39 * but WITHOUT ANY WARRANTY; without even the implied warranty of 39 * but WITHOUT ANY WARRANTY; without even the implied warranty of
40 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 40 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
41 * GNU General Public License for more details. 41 * GNU General Public License for more details.
42 * 42 *
43 * You should have received a copy of the GNU General Public License 43 * You should have received a copy of the GNU General Public License
44 * along with this program; if not, write to the Free Software 44 * along with this program; if not, write to the Free Software
45 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 45 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
46 */ 46 */
47 #include <linux/slab.h> 47 #include <linux/slab.h>
48 #include <linux/poll.h> 48 #include <linux/poll.h>
49 #include <linux/fs.h> 49 #include <linux/fs.h>
50 #include <linux/file.h> 50 #include <linux/file.h>
51 #include <linux/jhash.h> 51 #include <linux/jhash.h>
52 #include <linux/init.h> 52 #include <linux/init.h>
53 #include <linux/futex.h> 53 #include <linux/futex.h>
54 #include <linux/mount.h> 54 #include <linux/mount.h>
55 #include <linux/pagemap.h> 55 #include <linux/pagemap.h>
56 #include <linux/syscalls.h> 56 #include <linux/syscalls.h>
57 #include <linux/signal.h> 57 #include <linux/signal.h>
58 #include <linux/module.h> 58 #include <linux/module.h>
59 #include <linux/magic.h> 59 #include <linux/magic.h>
60 #include <linux/pid.h> 60 #include <linux/pid.h>
61 #include <linux/nsproxy.h> 61 #include <linux/nsproxy.h>
62 62
63 #include <asm/futex.h> 63 #include <asm/futex.h>
64 64
65 #include "rtmutex_common.h" 65 #include "rtmutex_common.h"
66 66
67 int __read_mostly futex_cmpxchg_enabled; 67 int __read_mostly futex_cmpxchg_enabled;
68 68
69 #define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8) 69 #define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
70 70
71 /* 71 /*
72 * Priority Inheritance state: 72 * Priority Inheritance state:
73 */ 73 */
74 struct futex_pi_state { 74 struct futex_pi_state {
75 /* 75 /*
76 * list of 'owned' pi_state instances - these have to be 76 * list of 'owned' pi_state instances - these have to be
77 * cleaned up in do_exit() if the task exits prematurely: 77 * cleaned up in do_exit() if the task exits prematurely:
78 */ 78 */
79 struct list_head list; 79 struct list_head list;
80 80
81 /* 81 /*
82 * The PI object: 82 * The PI object:
83 */ 83 */
84 struct rt_mutex pi_mutex; 84 struct rt_mutex pi_mutex;
85 85
86 struct task_struct *owner; 86 struct task_struct *owner;
87 atomic_t refcount; 87 atomic_t refcount;
88 88
89 union futex_key key; 89 union futex_key key;
90 }; 90 };
91 91
92 /** 92 /**
93 * struct futex_q - The hashed futex queue entry, one per waiting task 93 * struct futex_q - The hashed futex queue entry, one per waiting task
94 * @task: the task waiting on the futex 94 * @task: the task waiting on the futex
95 * @lock_ptr: the hash bucket lock 95 * @lock_ptr: the hash bucket lock
96 * @key: the key the futex is hashed on 96 * @key: the key the futex is hashed on
97 * @pi_state: optional priority inheritance state 97 * @pi_state: optional priority inheritance state
98 * @rt_waiter: rt_waiter storage for use with requeue_pi 98 * @rt_waiter: rt_waiter storage for use with requeue_pi
99 * @requeue_pi_key: the requeue_pi target futex key 99 * @requeue_pi_key: the requeue_pi target futex key
100 * @bitset: bitset for the optional bitmasked wakeup 100 * @bitset: bitset for the optional bitmasked wakeup
101 * 101 *
102 * We use this hashed waitqueue, instead of a normal wait_queue_t, so 102 * We use this hashed waitqueue, instead of a normal wait_queue_t, so
103 * we can wake only the relevant ones (hashed queues may be shared). 103 * we can wake only the relevant ones (hashed queues may be shared).
104 * 104 *
105 * A futex_q has a woken state, just like tasks have TASK_RUNNING. 105 * A futex_q has a woken state, just like tasks have TASK_RUNNING.
106 * It is considered woken when plist_node_empty(&q->list) || q->lock_ptr == 0. 106 * It is considered woken when plist_node_empty(&q->list) || q->lock_ptr == 0.
107 * The order of wakup is always to make the first condition true, then 107 * The order of wakup is always to make the first condition true, then
108 * the second. 108 * the second.
109 * 109 *
110 * PI futexes are typically woken before they are removed from the hash list via 110 * PI futexes are typically woken before they are removed from the hash list via
111 * the rt_mutex code. See unqueue_me_pi(). 111 * the rt_mutex code. See unqueue_me_pi().
112 */ 112 */
113 struct futex_q { 113 struct futex_q {
114 struct plist_node list; 114 struct plist_node list;
115 115
116 struct task_struct *task; 116 struct task_struct *task;
117 spinlock_t *lock_ptr; 117 spinlock_t *lock_ptr;
118 union futex_key key; 118 union futex_key key;
119 struct futex_pi_state *pi_state; 119 struct futex_pi_state *pi_state;
120 struct rt_mutex_waiter *rt_waiter; 120 struct rt_mutex_waiter *rt_waiter;
121 union futex_key *requeue_pi_key; 121 union futex_key *requeue_pi_key;
122 u32 bitset; 122 u32 bitset;
123 }; 123 };
124 124
125 /* 125 /*
126 * Hash buckets are shared by all the futex_keys that hash to the same 126 * Hash buckets are shared by all the futex_keys that hash to the same
127 * location. Each key may have multiple futex_q structures, one for each task 127 * location. Each key may have multiple futex_q structures, one for each task
128 * waiting on a futex. 128 * waiting on a futex.
129 */ 129 */
130 struct futex_hash_bucket { 130 struct futex_hash_bucket {
131 spinlock_t lock; 131 spinlock_t lock;
132 struct plist_head chain; 132 struct plist_head chain;
133 }; 133 };
134 134
135 static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS]; 135 static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
136 136
137 /* 137 /*
138 * We hash on the keys returned from get_futex_key (see below). 138 * We hash on the keys returned from get_futex_key (see below).
139 */ 139 */
140 static struct futex_hash_bucket *hash_futex(union futex_key *key) 140 static struct futex_hash_bucket *hash_futex(union futex_key *key)
141 { 141 {
142 u32 hash = jhash2((u32*)&key->both.word, 142 u32 hash = jhash2((u32*)&key->both.word,
143 (sizeof(key->both.word)+sizeof(key->both.ptr))/4, 143 (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
144 key->both.offset); 144 key->both.offset);
145 return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)]; 145 return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
146 } 146 }
147 147
148 /* 148 /*
149 * Return 1 if two futex_keys are equal, 0 otherwise. 149 * Return 1 if two futex_keys are equal, 0 otherwise.
150 */ 150 */
151 static inline int match_futex(union futex_key *key1, union futex_key *key2) 151 static inline int match_futex(union futex_key *key1, union futex_key *key2)
152 { 152 {
153 return (key1 && key2 153 return (key1 && key2
154 && key1->both.word == key2->both.word 154 && key1->both.word == key2->both.word
155 && key1->both.ptr == key2->both.ptr 155 && key1->both.ptr == key2->both.ptr
156 && key1->both.offset == key2->both.offset); 156 && key1->both.offset == key2->both.offset);
157 } 157 }
158 158
159 /* 159 /*
160 * Take a reference to the resource addressed by a key. 160 * Take a reference to the resource addressed by a key.
161 * Can be called while holding spinlocks. 161 * Can be called while holding spinlocks.
162 * 162 *
163 */ 163 */
164 static void get_futex_key_refs(union futex_key *key) 164 static void get_futex_key_refs(union futex_key *key)
165 { 165 {
166 if (!key->both.ptr) 166 if (!key->both.ptr)
167 return; 167 return;
168 168
169 switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) { 169 switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
170 case FUT_OFF_INODE: 170 case FUT_OFF_INODE:
171 atomic_inc(&key->shared.inode->i_count); 171 atomic_inc(&key->shared.inode->i_count);
172 break; 172 break;
173 case FUT_OFF_MMSHARED: 173 case FUT_OFF_MMSHARED:
174 atomic_inc(&key->private.mm->mm_count); 174 atomic_inc(&key->private.mm->mm_count);
175 break; 175 break;
176 } 176 }
177 } 177 }
178 178
179 /* 179 /*
180 * Drop a reference to the resource addressed by a key. 180 * Drop a reference to the resource addressed by a key.
181 * The hash bucket spinlock must not be held. 181 * The hash bucket spinlock must not be held.
182 */ 182 */
183 static void drop_futex_key_refs(union futex_key *key) 183 static void drop_futex_key_refs(union futex_key *key)
184 { 184 {
185 if (!key->both.ptr) { 185 if (!key->both.ptr) {
186 /* If we're here then we tried to put a key we failed to get */ 186 /* If we're here then we tried to put a key we failed to get */
187 WARN_ON_ONCE(1); 187 WARN_ON_ONCE(1);
188 return; 188 return;
189 } 189 }
190 190
191 switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) { 191 switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
192 case FUT_OFF_INODE: 192 case FUT_OFF_INODE:
193 iput(key->shared.inode); 193 iput(key->shared.inode);
194 break; 194 break;
195 case FUT_OFF_MMSHARED: 195 case FUT_OFF_MMSHARED:
196 mmdrop(key->private.mm); 196 mmdrop(key->private.mm);
197 break; 197 break;
198 } 198 }
199 } 199 }
200 200
201 /** 201 /**
202 * get_futex_key() - Get parameters which are the keys for a futex 202 * get_futex_key() - Get parameters which are the keys for a futex
203 * @uaddr: virtual address of the futex 203 * @uaddr: virtual address of the futex
204 * @fshared: 0 for a PROCESS_PRIVATE futex, 1 for PROCESS_SHARED 204 * @fshared: 0 for a PROCESS_PRIVATE futex, 1 for PROCESS_SHARED
205 * @key: address where result is stored. 205 * @key: address where result is stored.
206 * 206 *
207 * Returns a negative error code or 0 207 * Returns a negative error code or 0
208 * The key words are stored in *key on success. 208 * The key words are stored in *key on success.
209 * 209 *
210 * For shared mappings, it's (page->index, vma->vm_file->f_path.dentry->d_inode, 210 * For shared mappings, it's (page->index, vma->vm_file->f_path.dentry->d_inode,
211 * offset_within_page). For private mappings, it's (uaddr, current->mm). 211 * offset_within_page). For private mappings, it's (uaddr, current->mm).
212 * We can usually work out the index without swapping in the page. 212 * We can usually work out the index without swapping in the page.
213 * 213 *
214 * lock_page() might sleep, the caller should not hold a spinlock. 214 * lock_page() might sleep, the caller should not hold a spinlock.
215 */ 215 */
216 static int 216 static int
217 get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key) 217 get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
218 { 218 {
219 unsigned long address = (unsigned long)uaddr; 219 unsigned long address = (unsigned long)uaddr;
220 struct mm_struct *mm = current->mm; 220 struct mm_struct *mm = current->mm;
221 struct page *page; 221 struct page *page;
222 int err; 222 int err;
223 223
224 /* 224 /*
225 * The futex address must be "naturally" aligned. 225 * The futex address must be "naturally" aligned.
226 */ 226 */
227 key->both.offset = address % PAGE_SIZE; 227 key->both.offset = address % PAGE_SIZE;
228 if (unlikely((address % sizeof(u32)) != 0)) 228 if (unlikely((address % sizeof(u32)) != 0))
229 return -EINVAL; 229 return -EINVAL;
230 address -= key->both.offset; 230 address -= key->both.offset;
231 231
232 /* 232 /*
233 * PROCESS_PRIVATE futexes are fast. 233 * PROCESS_PRIVATE futexes are fast.
234 * As the mm cannot disappear under us and the 'key' only needs 234 * As the mm cannot disappear under us and the 'key' only needs
235 * virtual address, we dont even have to find the underlying vma. 235 * virtual address, we dont even have to find the underlying vma.
236 * Note : We do have to check 'uaddr' is a valid user address, 236 * Note : We do have to check 'uaddr' is a valid user address,
237 * but access_ok() should be faster than find_vma() 237 * but access_ok() should be faster than find_vma()
238 */ 238 */
239 if (!fshared) { 239 if (!fshared) {
240 if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32)))) 240 if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))))
241 return -EFAULT; 241 return -EFAULT;
242 key->private.mm = mm; 242 key->private.mm = mm;
243 key->private.address = address; 243 key->private.address = address;
244 get_futex_key_refs(key); 244 get_futex_key_refs(key);
245 return 0; 245 return 0;
246 } 246 }
247 247
248 again: 248 again:
249 err = get_user_pages_fast(address, 1, 1, &page); 249 err = get_user_pages_fast(address, 1, 1, &page);
250 if (err < 0) 250 if (err < 0)
251 return err; 251 return err;
252 252
253 page = compound_head(page); 253 page = compound_head(page);
254 lock_page(page); 254 lock_page(page);
255 if (!page->mapping) { 255 if (!page->mapping) {
256 unlock_page(page); 256 unlock_page(page);
257 put_page(page); 257 put_page(page);
258 goto again; 258 goto again;
259 } 259 }
260 260
261 /* 261 /*
262 * Private mappings are handled in a simple way. 262 * Private mappings are handled in a simple way.
263 * 263 *
264 * NOTE: When userspace waits on a MAP_SHARED mapping, even if 264 * NOTE: When userspace waits on a MAP_SHARED mapping, even if
265 * it's a read-only handle, it's expected that futexes attach to 265 * it's a read-only handle, it's expected that futexes attach to
266 * the object not the particular process. 266 * the object not the particular process.
267 */ 267 */
268 if (PageAnon(page)) { 268 if (PageAnon(page)) {
269 key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */ 269 key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
270 key->private.mm = mm; 270 key->private.mm = mm;
271 key->private.address = address; 271 key->private.address = address;
272 } else { 272 } else {
273 key->both.offset |= FUT_OFF_INODE; /* inode-based key */ 273 key->both.offset |= FUT_OFF_INODE; /* inode-based key */
274 key->shared.inode = page->mapping->host; 274 key->shared.inode = page->mapping->host;
275 key->shared.pgoff = page->index; 275 key->shared.pgoff = page->index;
276 } 276 }
277 277
278 get_futex_key_refs(key); 278 get_futex_key_refs(key);
279 279
280 unlock_page(page); 280 unlock_page(page);
281 put_page(page); 281 put_page(page);
282 return 0; 282 return 0;
283 } 283 }
284 284
285 static inline 285 static inline
286 void put_futex_key(int fshared, union futex_key *key) 286 void put_futex_key(int fshared, union futex_key *key)
287 { 287 {
288 drop_futex_key_refs(key); 288 drop_futex_key_refs(key);
289 } 289 }
290 290
291 /** 291 /**
292 * fault_in_user_writeable() - Fault in user address and verify RW access 292 * fault_in_user_writeable() - Fault in user address and verify RW access
293 * @uaddr: pointer to faulting user space address 293 * @uaddr: pointer to faulting user space address
294 * 294 *
295 * Slow path to fixup the fault we just took in the atomic write 295 * Slow path to fixup the fault we just took in the atomic write
296 * access to @uaddr. 296 * access to @uaddr.
297 * 297 *
298 * We have no generic implementation of a non destructive write to the 298 * We have no generic implementation of a non destructive write to the
299 * user address. We know that we faulted in the atomic pagefault 299 * user address. We know that we faulted in the atomic pagefault
300 * disabled section so we can as well avoid the #PF overhead by 300 * disabled section so we can as well avoid the #PF overhead by
301 * calling get_user_pages() right away. 301 * calling get_user_pages() right away.
302 */ 302 */
303 static int fault_in_user_writeable(u32 __user *uaddr) 303 static int fault_in_user_writeable(u32 __user *uaddr)
304 { 304 {
305 struct mm_struct *mm = current->mm; 305 struct mm_struct *mm = current->mm;
306 int ret; 306 int ret;
307 307
308 down_read(&mm->mmap_sem); 308 down_read(&mm->mmap_sem);
309 ret = get_user_pages(current, mm, (unsigned long)uaddr, 309 ret = get_user_pages(current, mm, (unsigned long)uaddr,
310 1, 1, 0, NULL, NULL); 310 1, 1, 0, NULL, NULL);
311 up_read(&mm->mmap_sem); 311 up_read(&mm->mmap_sem);
312 312
313 return ret < 0 ? ret : 0; 313 return ret < 0 ? ret : 0;
314 } 314 }
315 315
316 /** 316 /**
317 * futex_top_waiter() - Return the highest priority waiter on a futex 317 * futex_top_waiter() - Return the highest priority waiter on a futex
318 * @hb: the hash bucket the futex_q's reside in 318 * @hb: the hash bucket the futex_q's reside in
319 * @key: the futex key (to distinguish it from other futex futex_q's) 319 * @key: the futex key (to distinguish it from other futex futex_q's)
320 * 320 *
321 * Must be called with the hb lock held. 321 * Must be called with the hb lock held.
322 */ 322 */
323 static struct futex_q *futex_top_waiter(struct futex_hash_bucket *hb, 323 static struct futex_q *futex_top_waiter(struct futex_hash_bucket *hb,
324 union futex_key *key) 324 union futex_key *key)
325 { 325 {
326 struct futex_q *this; 326 struct futex_q *this;
327 327
328 plist_for_each_entry(this, &hb->chain, list) { 328 plist_for_each_entry(this, &hb->chain, list) {
329 if (match_futex(&this->key, key)) 329 if (match_futex(&this->key, key))
330 return this; 330 return this;
331 } 331 }
332 return NULL; 332 return NULL;
333 } 333 }
334 334
335 static u32 cmpxchg_futex_value_locked(u32 __user *uaddr, u32 uval, u32 newval) 335 static u32 cmpxchg_futex_value_locked(u32 __user *uaddr, u32 uval, u32 newval)
336 { 336 {
337 u32 curval; 337 u32 curval;
338 338
339 pagefault_disable(); 339 pagefault_disable();
340 curval = futex_atomic_cmpxchg_inatomic(uaddr, uval, newval); 340 curval = futex_atomic_cmpxchg_inatomic(uaddr, uval, newval);
341 pagefault_enable(); 341 pagefault_enable();
342 342
343 return curval; 343 return curval;
344 } 344 }
345 345
346 static int get_futex_value_locked(u32 *dest, u32 __user *from) 346 static int get_futex_value_locked(u32 *dest, u32 __user *from)
347 { 347 {
348 int ret; 348 int ret;
349 349
350 pagefault_disable(); 350 pagefault_disable();
351 ret = __copy_from_user_inatomic(dest, from, sizeof(u32)); 351 ret = __copy_from_user_inatomic(dest, from, sizeof(u32));
352 pagefault_enable(); 352 pagefault_enable();
353 353
354 return ret ? -EFAULT : 0; 354 return ret ? -EFAULT : 0;
355 } 355 }
356 356
357 357
358 /* 358 /*
359 * PI code: 359 * PI code:
360 */ 360 */
361 static int refill_pi_state_cache(void) 361 static int refill_pi_state_cache(void)
362 { 362 {
363 struct futex_pi_state *pi_state; 363 struct futex_pi_state *pi_state;
364 364
365 if (likely(current->pi_state_cache)) 365 if (likely(current->pi_state_cache))
366 return 0; 366 return 0;
367 367
368 pi_state = kzalloc(sizeof(*pi_state), GFP_KERNEL); 368 pi_state = kzalloc(sizeof(*pi_state), GFP_KERNEL);
369 369
370 if (!pi_state) 370 if (!pi_state)
371 return -ENOMEM; 371 return -ENOMEM;
372 372
373 INIT_LIST_HEAD(&pi_state->list); 373 INIT_LIST_HEAD(&pi_state->list);
374 /* pi_mutex gets initialized later */ 374 /* pi_mutex gets initialized later */
375 pi_state->owner = NULL; 375 pi_state->owner = NULL;
376 atomic_set(&pi_state->refcount, 1); 376 atomic_set(&pi_state->refcount, 1);
377 pi_state->key = FUTEX_KEY_INIT; 377 pi_state->key = FUTEX_KEY_INIT;
378 378
379 current->pi_state_cache = pi_state; 379 current->pi_state_cache = pi_state;
380 380
381 return 0; 381 return 0;
382 } 382 }
383 383
384 static struct futex_pi_state * alloc_pi_state(void) 384 static struct futex_pi_state * alloc_pi_state(void)
385 { 385 {
386 struct futex_pi_state *pi_state = current->pi_state_cache; 386 struct futex_pi_state *pi_state = current->pi_state_cache;
387 387
388 WARN_ON(!pi_state); 388 WARN_ON(!pi_state);
389 current->pi_state_cache = NULL; 389 current->pi_state_cache = NULL;
390 390
391 return pi_state; 391 return pi_state;
392 } 392 }
393 393
394 static void free_pi_state(struct futex_pi_state *pi_state) 394 static void free_pi_state(struct futex_pi_state *pi_state)
395 { 395 {
396 if (!atomic_dec_and_test(&pi_state->refcount)) 396 if (!atomic_dec_and_test(&pi_state->refcount))
397 return; 397 return;
398 398
399 /* 399 /*
400 * If pi_state->owner is NULL, the owner is most probably dying 400 * If pi_state->owner is NULL, the owner is most probably dying
401 * and has cleaned up the pi_state already 401 * and has cleaned up the pi_state already
402 */ 402 */
403 if (pi_state->owner) { 403 if (pi_state->owner) {
404 raw_spin_lock_irq(&pi_state->owner->pi_lock); 404 raw_spin_lock_irq(&pi_state->owner->pi_lock);
405 list_del_init(&pi_state->list); 405 list_del_init(&pi_state->list);
406 raw_spin_unlock_irq(&pi_state->owner->pi_lock); 406 raw_spin_unlock_irq(&pi_state->owner->pi_lock);
407 407
408 rt_mutex_proxy_unlock(&pi_state->pi_mutex, pi_state->owner); 408 rt_mutex_proxy_unlock(&pi_state->pi_mutex, pi_state->owner);
409 } 409 }
410 410
411 if (current->pi_state_cache) 411 if (current->pi_state_cache)
412 kfree(pi_state); 412 kfree(pi_state);
413 else { 413 else {
414 /* 414 /*
415 * pi_state->list is already empty. 415 * pi_state->list is already empty.
416 * clear pi_state->owner. 416 * clear pi_state->owner.
417 * refcount is at 0 - put it back to 1. 417 * refcount is at 0 - put it back to 1.
418 */ 418 */
419 pi_state->owner = NULL; 419 pi_state->owner = NULL;
420 atomic_set(&pi_state->refcount, 1); 420 atomic_set(&pi_state->refcount, 1);
421 current->pi_state_cache = pi_state; 421 current->pi_state_cache = pi_state;
422 } 422 }
423 } 423 }
424 424
425 /* 425 /*
426 * Look up the task based on what TID userspace gave us. 426 * Look up the task based on what TID userspace gave us.
427 * We dont trust it. 427 * We dont trust it.
428 */ 428 */
429 static struct task_struct * futex_find_get_task(pid_t pid) 429 static struct task_struct * futex_find_get_task(pid_t pid)
430 { 430 {
431 struct task_struct *p; 431 struct task_struct *p;
432 const struct cred *cred = current_cred(), *pcred;
433 432
434 rcu_read_lock(); 433 rcu_read_lock();
435 p = find_task_by_vpid(pid); 434 p = find_task_by_vpid(pid);
436 if (!p) { 435 if (p)
437 p = ERR_PTR(-ESRCH); 436 get_task_struct(p);
438 } else {
439 pcred = __task_cred(p);
440 if (cred->euid != pcred->euid &&
441 cred->euid != pcred->uid)
442 p = ERR_PTR(-ESRCH);
443 else
444 get_task_struct(p);
445 }
446 437
447 rcu_read_unlock(); 438 rcu_read_unlock();
448 439
449 return p; 440 return p;
450 } 441 }
451 442
452 /* 443 /*
453 * This task is holding PI mutexes at exit time => bad. 444 * This task is holding PI mutexes at exit time => bad.
454 * Kernel cleans up PI-state, but userspace is likely hosed. 445 * Kernel cleans up PI-state, but userspace is likely hosed.
455 * (Robust-futex cleanup is separate and might save the day for userspace.) 446 * (Robust-futex cleanup is separate and might save the day for userspace.)
456 */ 447 */
457 void exit_pi_state_list(struct task_struct *curr) 448 void exit_pi_state_list(struct task_struct *curr)
458 { 449 {
459 struct list_head *next, *head = &curr->pi_state_list; 450 struct list_head *next, *head = &curr->pi_state_list;
460 struct futex_pi_state *pi_state; 451 struct futex_pi_state *pi_state;
461 struct futex_hash_bucket *hb; 452 struct futex_hash_bucket *hb;
462 union futex_key key = FUTEX_KEY_INIT; 453 union futex_key key = FUTEX_KEY_INIT;
463 454
464 if (!futex_cmpxchg_enabled) 455 if (!futex_cmpxchg_enabled)
465 return; 456 return;
466 /* 457 /*
467 * We are a ZOMBIE and nobody can enqueue itself on 458 * We are a ZOMBIE and nobody can enqueue itself on
468 * pi_state_list anymore, but we have to be careful 459 * pi_state_list anymore, but we have to be careful
469 * versus waiters unqueueing themselves: 460 * versus waiters unqueueing themselves:
470 */ 461 */
471 raw_spin_lock_irq(&curr->pi_lock); 462 raw_spin_lock_irq(&curr->pi_lock);
472 while (!list_empty(head)) { 463 while (!list_empty(head)) {
473 464
474 next = head->next; 465 next = head->next;
475 pi_state = list_entry(next, struct futex_pi_state, list); 466 pi_state = list_entry(next, struct futex_pi_state, list);
476 key = pi_state->key; 467 key = pi_state->key;
477 hb = hash_futex(&key); 468 hb = hash_futex(&key);
478 raw_spin_unlock_irq(&curr->pi_lock); 469 raw_spin_unlock_irq(&curr->pi_lock);
479 470
480 spin_lock(&hb->lock); 471 spin_lock(&hb->lock);
481 472
482 raw_spin_lock_irq(&curr->pi_lock); 473 raw_spin_lock_irq(&curr->pi_lock);
483 /* 474 /*
484 * We dropped the pi-lock, so re-check whether this 475 * We dropped the pi-lock, so re-check whether this
485 * task still owns the PI-state: 476 * task still owns the PI-state:
486 */ 477 */
487 if (head->next != next) { 478 if (head->next != next) {
488 spin_unlock(&hb->lock); 479 spin_unlock(&hb->lock);
489 continue; 480 continue;
490 } 481 }
491 482
492 WARN_ON(pi_state->owner != curr); 483 WARN_ON(pi_state->owner != curr);
493 WARN_ON(list_empty(&pi_state->list)); 484 WARN_ON(list_empty(&pi_state->list));
494 list_del_init(&pi_state->list); 485 list_del_init(&pi_state->list);
495 pi_state->owner = NULL; 486 pi_state->owner = NULL;
496 raw_spin_unlock_irq(&curr->pi_lock); 487 raw_spin_unlock_irq(&curr->pi_lock);
497 488
498 rt_mutex_unlock(&pi_state->pi_mutex); 489 rt_mutex_unlock(&pi_state->pi_mutex);
499 490
500 spin_unlock(&hb->lock); 491 spin_unlock(&hb->lock);
501 492
502 raw_spin_lock_irq(&curr->pi_lock); 493 raw_spin_lock_irq(&curr->pi_lock);
503 } 494 }
504 raw_spin_unlock_irq(&curr->pi_lock); 495 raw_spin_unlock_irq(&curr->pi_lock);
505 } 496 }
506 497
507 static int 498 static int
508 lookup_pi_state(u32 uval, struct futex_hash_bucket *hb, 499 lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
509 union futex_key *key, struct futex_pi_state **ps) 500 union futex_key *key, struct futex_pi_state **ps)
510 { 501 {
511 struct futex_pi_state *pi_state = NULL; 502 struct futex_pi_state *pi_state = NULL;
512 struct futex_q *this, *next; 503 struct futex_q *this, *next;
513 struct plist_head *head; 504 struct plist_head *head;
514 struct task_struct *p; 505 struct task_struct *p;
515 pid_t pid = uval & FUTEX_TID_MASK; 506 pid_t pid = uval & FUTEX_TID_MASK;
516 507
517 head = &hb->chain; 508 head = &hb->chain;
518 509
519 plist_for_each_entry_safe(this, next, head, list) { 510 plist_for_each_entry_safe(this, next, head, list) {
520 if (match_futex(&this->key, key)) { 511 if (match_futex(&this->key, key)) {
521 /* 512 /*
522 * Another waiter already exists - bump up 513 * Another waiter already exists - bump up
523 * the refcount and return its pi_state: 514 * the refcount and return its pi_state:
524 */ 515 */
525 pi_state = this->pi_state; 516 pi_state = this->pi_state;
526 /* 517 /*
527 * Userspace might have messed up non PI and PI futexes 518 * Userspace might have messed up non PI and PI futexes
528 */ 519 */
529 if (unlikely(!pi_state)) 520 if (unlikely(!pi_state))
530 return -EINVAL; 521 return -EINVAL;
531 522
532 WARN_ON(!atomic_read(&pi_state->refcount)); 523 WARN_ON(!atomic_read(&pi_state->refcount));
533 524
534 /* 525 /*
535 * When pi_state->owner is NULL then the owner died 526 * When pi_state->owner is NULL then the owner died
536 * and another waiter is on the fly. pi_state->owner 527 * and another waiter is on the fly. pi_state->owner
537 * is fixed up by the task which acquires 528 * is fixed up by the task which acquires
538 * pi_state->rt_mutex. 529 * pi_state->rt_mutex.
539 * 530 *
540 * We do not check for pid == 0 which can happen when 531 * We do not check for pid == 0 which can happen when
541 * the owner died and robust_list_exit() cleared the 532 * the owner died and robust_list_exit() cleared the
542 * TID. 533 * TID.
543 */ 534 */
544 if (pid && pi_state->owner) { 535 if (pid && pi_state->owner) {
545 /* 536 /*
546 * Bail out if user space manipulated the 537 * Bail out if user space manipulated the
547 * futex value. 538 * futex value.
548 */ 539 */
549 if (pid != task_pid_vnr(pi_state->owner)) 540 if (pid != task_pid_vnr(pi_state->owner))
550 return -EINVAL; 541 return -EINVAL;
551 } 542 }
552 543
553 atomic_inc(&pi_state->refcount); 544 atomic_inc(&pi_state->refcount);
554 *ps = pi_state; 545 *ps = pi_state;
555 546
556 return 0; 547 return 0;
557 } 548 }
558 } 549 }
559 550
560 /* 551 /*
561 * We are the first waiter - try to look up the real owner and attach 552 * We are the first waiter - try to look up the real owner and attach
562 * the new pi_state to it, but bail out when TID = 0 553 * the new pi_state to it, but bail out when TID = 0
563 */ 554 */
564 if (!pid) 555 if (!pid)
565 return -ESRCH; 556 return -ESRCH;
566 p = futex_find_get_task(pid); 557 p = futex_find_get_task(pid);
567 if (IS_ERR(p)) 558 if (!p)
568 return PTR_ERR(p); 559 return -ESRCH;
569 560
570 /* 561 /*
571 * We need to look at the task state flags to figure out, 562 * We need to look at the task state flags to figure out,
572 * whether the task is exiting. To protect against the do_exit 563 * whether the task is exiting. To protect against the do_exit
573 * change of the task flags, we do this protected by 564 * change of the task flags, we do this protected by
574 * p->pi_lock: 565 * p->pi_lock:
575 */ 566 */
576 raw_spin_lock_irq(&p->pi_lock); 567 raw_spin_lock_irq(&p->pi_lock);
577 if (unlikely(p->flags & PF_EXITING)) { 568 if (unlikely(p->flags & PF_EXITING)) {
578 /* 569 /*
579 * The task is on the way out. When PF_EXITPIDONE is 570 * The task is on the way out. When PF_EXITPIDONE is
580 * set, we know that the task has finished the 571 * set, we know that the task has finished the
581 * cleanup: 572 * cleanup:
582 */ 573 */
583 int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN; 574 int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
584 575
585 raw_spin_unlock_irq(&p->pi_lock); 576 raw_spin_unlock_irq(&p->pi_lock);
586 put_task_struct(p); 577 put_task_struct(p);
587 return ret; 578 return ret;
588 } 579 }
589 580
590 pi_state = alloc_pi_state(); 581 pi_state = alloc_pi_state();
591 582
592 /* 583 /*
593 * Initialize the pi_mutex in locked state and make 'p' 584 * Initialize the pi_mutex in locked state and make 'p'
594 * the owner of it: 585 * the owner of it:
595 */ 586 */
596 rt_mutex_init_proxy_locked(&pi_state->pi_mutex, p); 587 rt_mutex_init_proxy_locked(&pi_state->pi_mutex, p);
597 588
598 /* Store the key for possible exit cleanups: */ 589 /* Store the key for possible exit cleanups: */
599 pi_state->key = *key; 590 pi_state->key = *key;
600 591
601 WARN_ON(!list_empty(&pi_state->list)); 592 WARN_ON(!list_empty(&pi_state->list));
602 list_add(&pi_state->list, &p->pi_state_list); 593 list_add(&pi_state->list, &p->pi_state_list);
603 pi_state->owner = p; 594 pi_state->owner = p;
604 raw_spin_unlock_irq(&p->pi_lock); 595 raw_spin_unlock_irq(&p->pi_lock);
605 596
606 put_task_struct(p); 597 put_task_struct(p);
607 598
608 *ps = pi_state; 599 *ps = pi_state;
609 600
610 return 0; 601 return 0;
611 } 602 }
612 603
613 /** 604 /**
614 * futex_lock_pi_atomic() - Atomic work required to acquire a pi aware futex 605 * futex_lock_pi_atomic() - Atomic work required to acquire a pi aware futex
615 * @uaddr: the pi futex user address 606 * @uaddr: the pi futex user address
616 * @hb: the pi futex hash bucket 607 * @hb: the pi futex hash bucket
617 * @key: the futex key associated with uaddr and hb 608 * @key: the futex key associated with uaddr and hb
618 * @ps: the pi_state pointer where we store the result of the 609 * @ps: the pi_state pointer where we store the result of the
619 * lookup 610 * lookup
620 * @task: the task to perform the atomic lock work for. This will 611 * @task: the task to perform the atomic lock work for. This will
621 * be "current" except in the case of requeue pi. 612 * be "current" except in the case of requeue pi.
622 * @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0) 613 * @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0)
623 * 614 *
624 * Returns: 615 * Returns:
625 * 0 - ready to wait 616 * 0 - ready to wait
626 * 1 - acquired the lock 617 * 1 - acquired the lock
627 * <0 - error 618 * <0 - error
628 * 619 *
629 * The hb->lock and futex_key refs shall be held by the caller. 620 * The hb->lock and futex_key refs shall be held by the caller.
630 */ 621 */
631 static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb, 622 static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
632 union futex_key *key, 623 union futex_key *key,
633 struct futex_pi_state **ps, 624 struct futex_pi_state **ps,
634 struct task_struct *task, int set_waiters) 625 struct task_struct *task, int set_waiters)
635 { 626 {
636 int lock_taken, ret, ownerdied = 0; 627 int lock_taken, ret, ownerdied = 0;
637 u32 uval, newval, curval; 628 u32 uval, newval, curval;
638 629
639 retry: 630 retry:
640 ret = lock_taken = 0; 631 ret = lock_taken = 0;
641 632
642 /* 633 /*
643 * To avoid races, we attempt to take the lock here again 634 * To avoid races, we attempt to take the lock here again
644 * (by doing a 0 -> TID atomic cmpxchg), while holding all 635 * (by doing a 0 -> TID atomic cmpxchg), while holding all
645 * the locks. It will most likely not succeed. 636 * the locks. It will most likely not succeed.
646 */ 637 */
647 newval = task_pid_vnr(task); 638 newval = task_pid_vnr(task);
648 if (set_waiters) 639 if (set_waiters)
649 newval |= FUTEX_WAITERS; 640 newval |= FUTEX_WAITERS;
650 641
651 curval = cmpxchg_futex_value_locked(uaddr, 0, newval); 642 curval = cmpxchg_futex_value_locked(uaddr, 0, newval);
652 643
653 if (unlikely(curval == -EFAULT)) 644 if (unlikely(curval == -EFAULT))
654 return -EFAULT; 645 return -EFAULT;
655 646
656 /* 647 /*
657 * Detect deadlocks. 648 * Detect deadlocks.
658 */ 649 */
659 if ((unlikely((curval & FUTEX_TID_MASK) == task_pid_vnr(task)))) 650 if ((unlikely((curval & FUTEX_TID_MASK) == task_pid_vnr(task))))
660 return -EDEADLK; 651 return -EDEADLK;
661 652
662 /* 653 /*
663 * Surprise - we got the lock. Just return to userspace: 654 * Surprise - we got the lock. Just return to userspace:
664 */ 655 */
665 if (unlikely(!curval)) 656 if (unlikely(!curval))
666 return 1; 657 return 1;
667 658
668 uval = curval; 659 uval = curval;
669 660
670 /* 661 /*
671 * Set the FUTEX_WAITERS flag, so the owner will know it has someone 662 * Set the FUTEX_WAITERS flag, so the owner will know it has someone
672 * to wake at the next unlock. 663 * to wake at the next unlock.
673 */ 664 */
674 newval = curval | FUTEX_WAITERS; 665 newval = curval | FUTEX_WAITERS;
675 666
676 /* 667 /*
677 * There are two cases, where a futex might have no owner (the 668 * There are two cases, where a futex might have no owner (the
678 * owner TID is 0): OWNER_DIED. We take over the futex in this 669 * owner TID is 0): OWNER_DIED. We take over the futex in this
679 * case. We also do an unconditional take over, when the owner 670 * case. We also do an unconditional take over, when the owner
680 * of the futex died. 671 * of the futex died.
681 * 672 *
682 * This is safe as we are protected by the hash bucket lock ! 673 * This is safe as we are protected by the hash bucket lock !
683 */ 674 */
684 if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) { 675 if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
685 /* Keep the OWNER_DIED bit */ 676 /* Keep the OWNER_DIED bit */
686 newval = (curval & ~FUTEX_TID_MASK) | task_pid_vnr(task); 677 newval = (curval & ~FUTEX_TID_MASK) | task_pid_vnr(task);
687 ownerdied = 0; 678 ownerdied = 0;
688 lock_taken = 1; 679 lock_taken = 1;
689 } 680 }
690 681
691 curval = cmpxchg_futex_value_locked(uaddr, uval, newval); 682 curval = cmpxchg_futex_value_locked(uaddr, uval, newval);
692 683
693 if (unlikely(curval == -EFAULT)) 684 if (unlikely(curval == -EFAULT))
694 return -EFAULT; 685 return -EFAULT;
695 if (unlikely(curval != uval)) 686 if (unlikely(curval != uval))
696 goto retry; 687 goto retry;
697 688
698 /* 689 /*
699 * We took the lock due to owner died take over. 690 * We took the lock due to owner died take over.
700 */ 691 */
701 if (unlikely(lock_taken)) 692 if (unlikely(lock_taken))
702 return 1; 693 return 1;
703 694
704 /* 695 /*
705 * We dont have the lock. Look up the PI state (or create it if 696 * We dont have the lock. Look up the PI state (or create it if
706 * we are the first waiter): 697 * we are the first waiter):
707 */ 698 */
708 ret = lookup_pi_state(uval, hb, key, ps); 699 ret = lookup_pi_state(uval, hb, key, ps);
709 700
710 if (unlikely(ret)) { 701 if (unlikely(ret)) {
711 switch (ret) { 702 switch (ret) {
712 case -ESRCH: 703 case -ESRCH:
713 /* 704 /*
714 * No owner found for this futex. Check if the 705 * No owner found for this futex. Check if the
715 * OWNER_DIED bit is set to figure out whether 706 * OWNER_DIED bit is set to figure out whether
716 * this is a robust futex or not. 707 * this is a robust futex or not.
717 */ 708 */
718 if (get_futex_value_locked(&curval, uaddr)) 709 if (get_futex_value_locked(&curval, uaddr))
719 return -EFAULT; 710 return -EFAULT;
720 711
721 /* 712 /*
722 * We simply start over in case of a robust 713 * We simply start over in case of a robust
723 * futex. The code above will take the futex 714 * futex. The code above will take the futex
724 * and return happy. 715 * and return happy.
725 */ 716 */
726 if (curval & FUTEX_OWNER_DIED) { 717 if (curval & FUTEX_OWNER_DIED) {
727 ownerdied = 1; 718 ownerdied = 1;
728 goto retry; 719 goto retry;
729 } 720 }
730 default: 721 default:
731 break; 722 break;
732 } 723 }
733 } 724 }
734 725
735 return ret; 726 return ret;
736 } 727 }
737 728
738 /* 729 /*
739 * The hash bucket lock must be held when this is called. 730 * The hash bucket lock must be held when this is called.
740 * Afterwards, the futex_q must not be accessed. 731 * Afterwards, the futex_q must not be accessed.
741 */ 732 */
742 static void wake_futex(struct futex_q *q) 733 static void wake_futex(struct futex_q *q)
743 { 734 {
744 struct task_struct *p = q->task; 735 struct task_struct *p = q->task;
745 736
746 /* 737 /*
747 * We set q->lock_ptr = NULL _before_ we wake up the task. If 738 * We set q->lock_ptr = NULL _before_ we wake up the task. If
748 * a non futex wake up happens on another CPU then the task 739 * a non futex wake up happens on another CPU then the task
749 * might exit and p would dereference a non existing task 740 * might exit and p would dereference a non existing task
750 * struct. Prevent this by holding a reference on p across the 741 * struct. Prevent this by holding a reference on p across the
751 * wake up. 742 * wake up.
752 */ 743 */
753 get_task_struct(p); 744 get_task_struct(p);
754 745
755 plist_del(&q->list, &q->list.plist); 746 plist_del(&q->list, &q->list.plist);
756 /* 747 /*
757 * The waiting task can free the futex_q as soon as 748 * The waiting task can free the futex_q as soon as
758 * q->lock_ptr = NULL is written, without taking any locks. A 749 * q->lock_ptr = NULL is written, without taking any locks. A
759 * memory barrier is required here to prevent the following 750 * memory barrier is required here to prevent the following
760 * store to lock_ptr from getting ahead of the plist_del. 751 * store to lock_ptr from getting ahead of the plist_del.
761 */ 752 */
762 smp_wmb(); 753 smp_wmb();
763 q->lock_ptr = NULL; 754 q->lock_ptr = NULL;
764 755
765 wake_up_state(p, TASK_NORMAL); 756 wake_up_state(p, TASK_NORMAL);
766 put_task_struct(p); 757 put_task_struct(p);
767 } 758 }
768 759
769 static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this) 760 static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
770 { 761 {
771 struct task_struct *new_owner; 762 struct task_struct *new_owner;
772 struct futex_pi_state *pi_state = this->pi_state; 763 struct futex_pi_state *pi_state = this->pi_state;
773 u32 curval, newval; 764 u32 curval, newval;
774 765
775 if (!pi_state) 766 if (!pi_state)
776 return -EINVAL; 767 return -EINVAL;
777 768
778 /* 769 /*
779 * If current does not own the pi_state then the futex is 770 * If current does not own the pi_state then the futex is
780 * inconsistent and user space fiddled with the futex value. 771 * inconsistent and user space fiddled with the futex value.
781 */ 772 */
782 if (pi_state->owner != current) 773 if (pi_state->owner != current)
783 return -EINVAL; 774 return -EINVAL;
784 775
785 raw_spin_lock(&pi_state->pi_mutex.wait_lock); 776 raw_spin_lock(&pi_state->pi_mutex.wait_lock);
786 new_owner = rt_mutex_next_owner(&pi_state->pi_mutex); 777 new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
787 778
788 /* 779 /*
789 * This happens when we have stolen the lock and the original 780 * This happens when we have stolen the lock and the original
790 * pending owner did not enqueue itself back on the rt_mutex. 781 * pending owner did not enqueue itself back on the rt_mutex.
791 * Thats not a tragedy. We know that way, that a lock waiter 782 * Thats not a tragedy. We know that way, that a lock waiter
792 * is on the fly. We make the futex_q waiter the pending owner. 783 * is on the fly. We make the futex_q waiter the pending owner.
793 */ 784 */
794 if (!new_owner) 785 if (!new_owner)
795 new_owner = this->task; 786 new_owner = this->task;
796 787
797 /* 788 /*
798 * We pass it to the next owner. (The WAITERS bit is always 789 * We pass it to the next owner. (The WAITERS bit is always
799 * kept enabled while there is PI state around. We must also 790 * kept enabled while there is PI state around. We must also
800 * preserve the owner died bit.) 791 * preserve the owner died bit.)
801 */ 792 */
802 if (!(uval & FUTEX_OWNER_DIED)) { 793 if (!(uval & FUTEX_OWNER_DIED)) {
803 int ret = 0; 794 int ret = 0;
804 795
805 newval = FUTEX_WAITERS | task_pid_vnr(new_owner); 796 newval = FUTEX_WAITERS | task_pid_vnr(new_owner);
806 797
807 curval = cmpxchg_futex_value_locked(uaddr, uval, newval); 798 curval = cmpxchg_futex_value_locked(uaddr, uval, newval);
808 799
809 if (curval == -EFAULT) 800 if (curval == -EFAULT)
810 ret = -EFAULT; 801 ret = -EFAULT;
811 else if (curval != uval) 802 else if (curval != uval)
812 ret = -EINVAL; 803 ret = -EINVAL;
813 if (ret) { 804 if (ret) {
814 raw_spin_unlock(&pi_state->pi_mutex.wait_lock); 805 raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
815 return ret; 806 return ret;
816 } 807 }
817 } 808 }
818 809
819 raw_spin_lock_irq(&pi_state->owner->pi_lock); 810 raw_spin_lock_irq(&pi_state->owner->pi_lock);
820 WARN_ON(list_empty(&pi_state->list)); 811 WARN_ON(list_empty(&pi_state->list));
821 list_del_init(&pi_state->list); 812 list_del_init(&pi_state->list);
822 raw_spin_unlock_irq(&pi_state->owner->pi_lock); 813 raw_spin_unlock_irq(&pi_state->owner->pi_lock);
823 814
824 raw_spin_lock_irq(&new_owner->pi_lock); 815 raw_spin_lock_irq(&new_owner->pi_lock);
825 WARN_ON(!list_empty(&pi_state->list)); 816 WARN_ON(!list_empty(&pi_state->list));
826 list_add(&pi_state->list, &new_owner->pi_state_list); 817 list_add(&pi_state->list, &new_owner->pi_state_list);
827 pi_state->owner = new_owner; 818 pi_state->owner = new_owner;
828 raw_spin_unlock_irq(&new_owner->pi_lock); 819 raw_spin_unlock_irq(&new_owner->pi_lock);
829 820
830 raw_spin_unlock(&pi_state->pi_mutex.wait_lock); 821 raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
831 rt_mutex_unlock(&pi_state->pi_mutex); 822 rt_mutex_unlock(&pi_state->pi_mutex);
832 823
833 return 0; 824 return 0;
834 } 825 }
835 826
836 static int unlock_futex_pi(u32 __user *uaddr, u32 uval) 827 static int unlock_futex_pi(u32 __user *uaddr, u32 uval)
837 { 828 {
838 u32 oldval; 829 u32 oldval;
839 830
840 /* 831 /*
841 * There is no waiter, so we unlock the futex. The owner died 832 * There is no waiter, so we unlock the futex. The owner died
842 * bit has not to be preserved here. We are the owner: 833 * bit has not to be preserved here. We are the owner:
843 */ 834 */
844 oldval = cmpxchg_futex_value_locked(uaddr, uval, 0); 835 oldval = cmpxchg_futex_value_locked(uaddr, uval, 0);
845 836
846 if (oldval == -EFAULT) 837 if (oldval == -EFAULT)
847 return oldval; 838 return oldval;
848 if (oldval != uval) 839 if (oldval != uval)
849 return -EAGAIN; 840 return -EAGAIN;
850 841
851 return 0; 842 return 0;
852 } 843 }
853 844
854 /* 845 /*
855 * Express the locking dependencies for lockdep: 846 * Express the locking dependencies for lockdep:
856 */ 847 */
857 static inline void 848 static inline void
858 double_lock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2) 849 double_lock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
859 { 850 {
860 if (hb1 <= hb2) { 851 if (hb1 <= hb2) {
861 spin_lock(&hb1->lock); 852 spin_lock(&hb1->lock);
862 if (hb1 < hb2) 853 if (hb1 < hb2)
863 spin_lock_nested(&hb2->lock, SINGLE_DEPTH_NESTING); 854 spin_lock_nested(&hb2->lock, SINGLE_DEPTH_NESTING);
864 } else { /* hb1 > hb2 */ 855 } else { /* hb1 > hb2 */
865 spin_lock(&hb2->lock); 856 spin_lock(&hb2->lock);
866 spin_lock_nested(&hb1->lock, SINGLE_DEPTH_NESTING); 857 spin_lock_nested(&hb1->lock, SINGLE_DEPTH_NESTING);
867 } 858 }
868 } 859 }
869 860
870 static inline void 861 static inline void
871 double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2) 862 double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
872 { 863 {
873 spin_unlock(&hb1->lock); 864 spin_unlock(&hb1->lock);
874 if (hb1 != hb2) 865 if (hb1 != hb2)
875 spin_unlock(&hb2->lock); 866 spin_unlock(&hb2->lock);
876 } 867 }
877 868
878 /* 869 /*
879 * Wake up waiters matching bitset queued on this futex (uaddr). 870 * Wake up waiters matching bitset queued on this futex (uaddr).
880 */ 871 */
881 static int futex_wake(u32 __user *uaddr, int fshared, int nr_wake, u32 bitset) 872 static int futex_wake(u32 __user *uaddr, int fshared, int nr_wake, u32 bitset)
882 { 873 {
883 struct futex_hash_bucket *hb; 874 struct futex_hash_bucket *hb;
884 struct futex_q *this, *next; 875 struct futex_q *this, *next;
885 struct plist_head *head; 876 struct plist_head *head;
886 union futex_key key = FUTEX_KEY_INIT; 877 union futex_key key = FUTEX_KEY_INIT;
887 int ret; 878 int ret;
888 879
889 if (!bitset) 880 if (!bitset)
890 return -EINVAL; 881 return -EINVAL;
891 882
892 ret = get_futex_key(uaddr, fshared, &key); 883 ret = get_futex_key(uaddr, fshared, &key);
893 if (unlikely(ret != 0)) 884 if (unlikely(ret != 0))
894 goto out; 885 goto out;
895 886
896 hb = hash_futex(&key); 887 hb = hash_futex(&key);
897 spin_lock(&hb->lock); 888 spin_lock(&hb->lock);
898 head = &hb->chain; 889 head = &hb->chain;
899 890
900 plist_for_each_entry_safe(this, next, head, list) { 891 plist_for_each_entry_safe(this, next, head, list) {
901 if (match_futex (&this->key, &key)) { 892 if (match_futex (&this->key, &key)) {
902 if (this->pi_state || this->rt_waiter) { 893 if (this->pi_state || this->rt_waiter) {
903 ret = -EINVAL; 894 ret = -EINVAL;
904 break; 895 break;
905 } 896 }
906 897
907 /* Check if one of the bits is set in both bitsets */ 898 /* Check if one of the bits is set in both bitsets */
908 if (!(this->bitset & bitset)) 899 if (!(this->bitset & bitset))
909 continue; 900 continue;
910 901
911 wake_futex(this); 902 wake_futex(this);
912 if (++ret >= nr_wake) 903 if (++ret >= nr_wake)
913 break; 904 break;
914 } 905 }
915 } 906 }
916 907
917 spin_unlock(&hb->lock); 908 spin_unlock(&hb->lock);
918 put_futex_key(fshared, &key); 909 put_futex_key(fshared, &key);
919 out: 910 out:
920 return ret; 911 return ret;
921 } 912 }
922 913
923 /* 914 /*
924 * Wake up all waiters hashed on the physical page that is mapped 915 * Wake up all waiters hashed on the physical page that is mapped
925 * to this virtual address: 916 * to this virtual address:
926 */ 917 */
927 static int 918 static int
928 futex_wake_op(u32 __user *uaddr1, int fshared, u32 __user *uaddr2, 919 futex_wake_op(u32 __user *uaddr1, int fshared, u32 __user *uaddr2,
929 int nr_wake, int nr_wake2, int op) 920 int nr_wake, int nr_wake2, int op)
930 { 921 {
931 union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT; 922 union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
932 struct futex_hash_bucket *hb1, *hb2; 923 struct futex_hash_bucket *hb1, *hb2;
933 struct plist_head *head; 924 struct plist_head *head;
934 struct futex_q *this, *next; 925 struct futex_q *this, *next;
935 int ret, op_ret; 926 int ret, op_ret;
936 927
937 retry: 928 retry:
938 ret = get_futex_key(uaddr1, fshared, &key1); 929 ret = get_futex_key(uaddr1, fshared, &key1);
939 if (unlikely(ret != 0)) 930 if (unlikely(ret != 0))
940 goto out; 931 goto out;
941 ret = get_futex_key(uaddr2, fshared, &key2); 932 ret = get_futex_key(uaddr2, fshared, &key2);
942 if (unlikely(ret != 0)) 933 if (unlikely(ret != 0))
943 goto out_put_key1; 934 goto out_put_key1;
944 935
945 hb1 = hash_futex(&key1); 936 hb1 = hash_futex(&key1);
946 hb2 = hash_futex(&key2); 937 hb2 = hash_futex(&key2);
947 938
948 retry_private: 939 retry_private:
949 double_lock_hb(hb1, hb2); 940 double_lock_hb(hb1, hb2);
950 op_ret = futex_atomic_op_inuser(op, uaddr2); 941 op_ret = futex_atomic_op_inuser(op, uaddr2);
951 if (unlikely(op_ret < 0)) { 942 if (unlikely(op_ret < 0)) {
952 943
953 double_unlock_hb(hb1, hb2); 944 double_unlock_hb(hb1, hb2);
954 945
955 #ifndef CONFIG_MMU 946 #ifndef CONFIG_MMU
956 /* 947 /*
957 * we don't get EFAULT from MMU faults if we don't have an MMU, 948 * we don't get EFAULT from MMU faults if we don't have an MMU,
958 * but we might get them from range checking 949 * but we might get them from range checking
959 */ 950 */
960 ret = op_ret; 951 ret = op_ret;
961 goto out_put_keys; 952 goto out_put_keys;
962 #endif 953 #endif
963 954
964 if (unlikely(op_ret != -EFAULT)) { 955 if (unlikely(op_ret != -EFAULT)) {
965 ret = op_ret; 956 ret = op_ret;
966 goto out_put_keys; 957 goto out_put_keys;
967 } 958 }
968 959
969 ret = fault_in_user_writeable(uaddr2); 960 ret = fault_in_user_writeable(uaddr2);
970 if (ret) 961 if (ret)
971 goto out_put_keys; 962 goto out_put_keys;
972 963
973 if (!fshared) 964 if (!fshared)
974 goto retry_private; 965 goto retry_private;
975 966
976 put_futex_key(fshared, &key2); 967 put_futex_key(fshared, &key2);
977 put_futex_key(fshared, &key1); 968 put_futex_key(fshared, &key1);
978 goto retry; 969 goto retry;
979 } 970 }
980 971
981 head = &hb1->chain; 972 head = &hb1->chain;
982 973
983 plist_for_each_entry_safe(this, next, head, list) { 974 plist_for_each_entry_safe(this, next, head, list) {
984 if (match_futex (&this->key, &key1)) { 975 if (match_futex (&this->key, &key1)) {
985 wake_futex(this); 976 wake_futex(this);
986 if (++ret >= nr_wake) 977 if (++ret >= nr_wake)
987 break; 978 break;
988 } 979 }
989 } 980 }
990 981
991 if (op_ret > 0) { 982 if (op_ret > 0) {
992 head = &hb2->chain; 983 head = &hb2->chain;
993 984
994 op_ret = 0; 985 op_ret = 0;
995 plist_for_each_entry_safe(this, next, head, list) { 986 plist_for_each_entry_safe(this, next, head, list) {
996 if (match_futex (&this->key, &key2)) { 987 if (match_futex (&this->key, &key2)) {
997 wake_futex(this); 988 wake_futex(this);
998 if (++op_ret >= nr_wake2) 989 if (++op_ret >= nr_wake2)
999 break; 990 break;
1000 } 991 }
1001 } 992 }
1002 ret += op_ret; 993 ret += op_ret;
1003 } 994 }
1004 995
1005 double_unlock_hb(hb1, hb2); 996 double_unlock_hb(hb1, hb2);
1006 out_put_keys: 997 out_put_keys:
1007 put_futex_key(fshared, &key2); 998 put_futex_key(fshared, &key2);
1008 out_put_key1: 999 out_put_key1:
1009 put_futex_key(fshared, &key1); 1000 put_futex_key(fshared, &key1);
1010 out: 1001 out:
1011 return ret; 1002 return ret;
1012 } 1003 }
1013 1004
1014 /** 1005 /**
1015 * requeue_futex() - Requeue a futex_q from one hb to another 1006 * requeue_futex() - Requeue a futex_q from one hb to another
1016 * @q: the futex_q to requeue 1007 * @q: the futex_q to requeue
1017 * @hb1: the source hash_bucket 1008 * @hb1: the source hash_bucket
1018 * @hb2: the target hash_bucket 1009 * @hb2: the target hash_bucket
1019 * @key2: the new key for the requeued futex_q 1010 * @key2: the new key for the requeued futex_q
1020 */ 1011 */
1021 static inline 1012 static inline
1022 void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1, 1013 void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
1023 struct futex_hash_bucket *hb2, union futex_key *key2) 1014 struct futex_hash_bucket *hb2, union futex_key *key2)
1024 { 1015 {
1025 1016
1026 /* 1017 /*
1027 * If key1 and key2 hash to the same bucket, no need to 1018 * If key1 and key2 hash to the same bucket, no need to
1028 * requeue. 1019 * requeue.
1029 */ 1020 */
1030 if (likely(&hb1->chain != &hb2->chain)) { 1021 if (likely(&hb1->chain != &hb2->chain)) {
1031 plist_del(&q->list, &hb1->chain); 1022 plist_del(&q->list, &hb1->chain);
1032 plist_add(&q->list, &hb2->chain); 1023 plist_add(&q->list, &hb2->chain);
1033 q->lock_ptr = &hb2->lock; 1024 q->lock_ptr = &hb2->lock;
1034 #ifdef CONFIG_DEBUG_PI_LIST 1025 #ifdef CONFIG_DEBUG_PI_LIST
1035 q->list.plist.spinlock = &hb2->lock; 1026 q->list.plist.spinlock = &hb2->lock;
1036 #endif 1027 #endif
1037 } 1028 }
1038 get_futex_key_refs(key2); 1029 get_futex_key_refs(key2);
1039 q->key = *key2; 1030 q->key = *key2;
1040 } 1031 }
1041 1032
1042 /** 1033 /**
1043 * requeue_pi_wake_futex() - Wake a task that acquired the lock during requeue 1034 * requeue_pi_wake_futex() - Wake a task that acquired the lock during requeue
1044 * @q: the futex_q 1035 * @q: the futex_q
1045 * @key: the key of the requeue target futex 1036 * @key: the key of the requeue target futex
1046 * @hb: the hash_bucket of the requeue target futex 1037 * @hb: the hash_bucket of the requeue target futex
1047 * 1038 *
1048 * During futex_requeue, with requeue_pi=1, it is possible to acquire the 1039 * During futex_requeue, with requeue_pi=1, it is possible to acquire the
1049 * target futex if it is uncontended or via a lock steal. Set the futex_q key 1040 * target futex if it is uncontended or via a lock steal. Set the futex_q key
1050 * to the requeue target futex so the waiter can detect the wakeup on the right 1041 * to the requeue target futex so the waiter can detect the wakeup on the right
1051 * futex, but remove it from the hb and NULL the rt_waiter so it can detect 1042 * futex, but remove it from the hb and NULL the rt_waiter so it can detect
1052 * atomic lock acquisition. Set the q->lock_ptr to the requeue target hb->lock 1043 * atomic lock acquisition. Set the q->lock_ptr to the requeue target hb->lock
1053 * to protect access to the pi_state to fixup the owner later. Must be called 1044 * to protect access to the pi_state to fixup the owner later. Must be called
1054 * with both q->lock_ptr and hb->lock held. 1045 * with both q->lock_ptr and hb->lock held.
1055 */ 1046 */
1056 static inline 1047 static inline
1057 void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key, 1048 void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,
1058 struct futex_hash_bucket *hb) 1049 struct futex_hash_bucket *hb)
1059 { 1050 {
1060 get_futex_key_refs(key); 1051 get_futex_key_refs(key);
1061 q->key = *key; 1052 q->key = *key;
1062 1053
1063 WARN_ON(plist_node_empty(&q->list)); 1054 WARN_ON(plist_node_empty(&q->list));
1064 plist_del(&q->list, &q->list.plist); 1055 plist_del(&q->list, &q->list.plist);
1065 1056
1066 WARN_ON(!q->rt_waiter); 1057 WARN_ON(!q->rt_waiter);
1067 q->rt_waiter = NULL; 1058 q->rt_waiter = NULL;
1068 1059
1069 q->lock_ptr = &hb->lock; 1060 q->lock_ptr = &hb->lock;
1070 #ifdef CONFIG_DEBUG_PI_LIST 1061 #ifdef CONFIG_DEBUG_PI_LIST
1071 q->list.plist.spinlock = &hb->lock; 1062 q->list.plist.spinlock = &hb->lock;
1072 #endif 1063 #endif
1073 1064
1074 wake_up_state(q->task, TASK_NORMAL); 1065 wake_up_state(q->task, TASK_NORMAL);
1075 } 1066 }
1076 1067
1077 /** 1068 /**
1078 * futex_proxy_trylock_atomic() - Attempt an atomic lock for the top waiter 1069 * futex_proxy_trylock_atomic() - Attempt an atomic lock for the top waiter
1079 * @pifutex: the user address of the to futex 1070 * @pifutex: the user address of the to futex
1080 * @hb1: the from futex hash bucket, must be locked by the caller 1071 * @hb1: the from futex hash bucket, must be locked by the caller
1081 * @hb2: the to futex hash bucket, must be locked by the caller 1072 * @hb2: the to futex hash bucket, must be locked by the caller
1082 * @key1: the from futex key 1073 * @key1: the from futex key
1083 * @key2: the to futex key 1074 * @key2: the to futex key
1084 * @ps: address to store the pi_state pointer 1075 * @ps: address to store the pi_state pointer
1085 * @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0) 1076 * @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0)
1086 * 1077 *
1087 * Try and get the lock on behalf of the top waiter if we can do it atomically. 1078 * Try and get the lock on behalf of the top waiter if we can do it atomically.
1088 * Wake the top waiter if we succeed. If the caller specified set_waiters, 1079 * Wake the top waiter if we succeed. If the caller specified set_waiters,
1089 * then direct futex_lock_pi_atomic() to force setting the FUTEX_WAITERS bit. 1080 * then direct futex_lock_pi_atomic() to force setting the FUTEX_WAITERS bit.
1090 * hb1 and hb2 must be held by the caller. 1081 * hb1 and hb2 must be held by the caller.
1091 * 1082 *
1092 * Returns: 1083 * Returns:
1093 * 0 - failed to acquire the lock atomicly 1084 * 0 - failed to acquire the lock atomicly
1094 * 1 - acquired the lock 1085 * 1 - acquired the lock
1095 * <0 - error 1086 * <0 - error
1096 */ 1087 */
1097 static int futex_proxy_trylock_atomic(u32 __user *pifutex, 1088 static int futex_proxy_trylock_atomic(u32 __user *pifutex,
1098 struct futex_hash_bucket *hb1, 1089 struct futex_hash_bucket *hb1,
1099 struct futex_hash_bucket *hb2, 1090 struct futex_hash_bucket *hb2,
1100 union futex_key *key1, union futex_key *key2, 1091 union futex_key *key1, union futex_key *key2,
1101 struct futex_pi_state **ps, int set_waiters) 1092 struct futex_pi_state **ps, int set_waiters)
1102 { 1093 {
1103 struct futex_q *top_waiter = NULL; 1094 struct futex_q *top_waiter = NULL;
1104 u32 curval; 1095 u32 curval;
1105 int ret; 1096 int ret;
1106 1097
1107 if (get_futex_value_locked(&curval, pifutex)) 1098 if (get_futex_value_locked(&curval, pifutex))
1108 return -EFAULT; 1099 return -EFAULT;
1109 1100
1110 /* 1101 /*
1111 * Find the top_waiter and determine if there are additional waiters. 1102 * Find the top_waiter and determine if there are additional waiters.
1112 * If the caller intends to requeue more than 1 waiter to pifutex, 1103 * If the caller intends to requeue more than 1 waiter to pifutex,
1113 * force futex_lock_pi_atomic() to set the FUTEX_WAITERS bit now, 1104 * force futex_lock_pi_atomic() to set the FUTEX_WAITERS bit now,
1114 * as we have means to handle the possible fault. If not, don't set 1105 * as we have means to handle the possible fault. If not, don't set
1115 * the bit unecessarily as it will force the subsequent unlock to enter 1106 * the bit unecessarily as it will force the subsequent unlock to enter
1116 * the kernel. 1107 * the kernel.
1117 */ 1108 */
1118 top_waiter = futex_top_waiter(hb1, key1); 1109 top_waiter = futex_top_waiter(hb1, key1);
1119 1110
1120 /* There are no waiters, nothing for us to do. */ 1111 /* There are no waiters, nothing for us to do. */
1121 if (!top_waiter) 1112 if (!top_waiter)
1122 return 0; 1113 return 0;
1123 1114
1124 /* Ensure we requeue to the expected futex. */ 1115 /* Ensure we requeue to the expected futex. */
1125 if (!match_futex(top_waiter->requeue_pi_key, key2)) 1116 if (!match_futex(top_waiter->requeue_pi_key, key2))
1126 return -EINVAL; 1117 return -EINVAL;
1127 1118
1128 /* 1119 /*
1129 * Try to take the lock for top_waiter. Set the FUTEX_WAITERS bit in 1120 * Try to take the lock for top_waiter. Set the FUTEX_WAITERS bit in
1130 * the contended case or if set_waiters is 1. The pi_state is returned 1121 * the contended case or if set_waiters is 1. The pi_state is returned
1131 * in ps in contended cases. 1122 * in ps in contended cases.
1132 */ 1123 */
1133 ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task, 1124 ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task,
1134 set_waiters); 1125 set_waiters);
1135 if (ret == 1) 1126 if (ret == 1)
1136 requeue_pi_wake_futex(top_waiter, key2, hb2); 1127 requeue_pi_wake_futex(top_waiter, key2, hb2);
1137 1128
1138 return ret; 1129 return ret;
1139 } 1130 }
1140 1131
1141 /** 1132 /**
1142 * futex_requeue() - Requeue waiters from uaddr1 to uaddr2 1133 * futex_requeue() - Requeue waiters from uaddr1 to uaddr2
1143 * uaddr1: source futex user address 1134 * uaddr1: source futex user address
1144 * uaddr2: target futex user address 1135 * uaddr2: target futex user address
1145 * nr_wake: number of waiters to wake (must be 1 for requeue_pi) 1136 * nr_wake: number of waiters to wake (must be 1 for requeue_pi)
1146 * nr_requeue: number of waiters to requeue (0-INT_MAX) 1137 * nr_requeue: number of waiters to requeue (0-INT_MAX)
1147 * requeue_pi: if we are attempting to requeue from a non-pi futex to a 1138 * requeue_pi: if we are attempting to requeue from a non-pi futex to a
1148 * pi futex (pi to pi requeue is not supported) 1139 * pi futex (pi to pi requeue is not supported)
1149 * 1140 *
1150 * Requeue waiters on uaddr1 to uaddr2. In the requeue_pi case, try to acquire 1141 * Requeue waiters on uaddr1 to uaddr2. In the requeue_pi case, try to acquire
1151 * uaddr2 atomically on behalf of the top waiter. 1142 * uaddr2 atomically on behalf of the top waiter.
1152 * 1143 *
1153 * Returns: 1144 * Returns:
1154 * >=0 - on success, the number of tasks requeued or woken 1145 * >=0 - on success, the number of tasks requeued or woken
1155 * <0 - on error 1146 * <0 - on error
1156 */ 1147 */
1157 static int futex_requeue(u32 __user *uaddr1, int fshared, u32 __user *uaddr2, 1148 static int futex_requeue(u32 __user *uaddr1, int fshared, u32 __user *uaddr2,
1158 int nr_wake, int nr_requeue, u32 *cmpval, 1149 int nr_wake, int nr_requeue, u32 *cmpval,
1159 int requeue_pi) 1150 int requeue_pi)
1160 { 1151 {
1161 union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT; 1152 union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
1162 int drop_count = 0, task_count = 0, ret; 1153 int drop_count = 0, task_count = 0, ret;
1163 struct futex_pi_state *pi_state = NULL; 1154 struct futex_pi_state *pi_state = NULL;
1164 struct futex_hash_bucket *hb1, *hb2; 1155 struct futex_hash_bucket *hb1, *hb2;
1165 struct plist_head *head1; 1156 struct plist_head *head1;
1166 struct futex_q *this, *next; 1157 struct futex_q *this, *next;
1167 u32 curval2; 1158 u32 curval2;
1168 1159
1169 if (requeue_pi) { 1160 if (requeue_pi) {
1170 /* 1161 /*
1171 * requeue_pi requires a pi_state, try to allocate it now 1162 * requeue_pi requires a pi_state, try to allocate it now
1172 * without any locks in case it fails. 1163 * without any locks in case it fails.
1173 */ 1164 */
1174 if (refill_pi_state_cache()) 1165 if (refill_pi_state_cache())
1175 return -ENOMEM; 1166 return -ENOMEM;
1176 /* 1167 /*
1177 * requeue_pi must wake as many tasks as it can, up to nr_wake 1168 * requeue_pi must wake as many tasks as it can, up to nr_wake
1178 * + nr_requeue, since it acquires the rt_mutex prior to 1169 * + nr_requeue, since it acquires the rt_mutex prior to
1179 * returning to userspace, so as to not leave the rt_mutex with 1170 * returning to userspace, so as to not leave the rt_mutex with
1180 * waiters and no owner. However, second and third wake-ups 1171 * waiters and no owner. However, second and third wake-ups
1181 * cannot be predicted as they involve race conditions with the 1172 * cannot be predicted as they involve race conditions with the
1182 * first wake and a fault while looking up the pi_state. Both 1173 * first wake and a fault while looking up the pi_state. Both
1183 * pthread_cond_signal() and pthread_cond_broadcast() should 1174 * pthread_cond_signal() and pthread_cond_broadcast() should
1184 * use nr_wake=1. 1175 * use nr_wake=1.
1185 */ 1176 */
1186 if (nr_wake != 1) 1177 if (nr_wake != 1)
1187 return -EINVAL; 1178 return -EINVAL;
1188 } 1179 }
1189 1180
1190 retry: 1181 retry:
1191 if (pi_state != NULL) { 1182 if (pi_state != NULL) {
1192 /* 1183 /*
1193 * We will have to lookup the pi_state again, so free this one 1184 * We will have to lookup the pi_state again, so free this one
1194 * to keep the accounting correct. 1185 * to keep the accounting correct.
1195 */ 1186 */
1196 free_pi_state(pi_state); 1187 free_pi_state(pi_state);
1197 pi_state = NULL; 1188 pi_state = NULL;
1198 } 1189 }
1199 1190
1200 ret = get_futex_key(uaddr1, fshared, &key1); 1191 ret = get_futex_key(uaddr1, fshared, &key1);
1201 if (unlikely(ret != 0)) 1192 if (unlikely(ret != 0))
1202 goto out; 1193 goto out;
1203 ret = get_futex_key(uaddr2, fshared, &key2); 1194 ret = get_futex_key(uaddr2, fshared, &key2);
1204 if (unlikely(ret != 0)) 1195 if (unlikely(ret != 0))
1205 goto out_put_key1; 1196 goto out_put_key1;
1206 1197
1207 hb1 = hash_futex(&key1); 1198 hb1 = hash_futex(&key1);
1208 hb2 = hash_futex(&key2); 1199 hb2 = hash_futex(&key2);
1209 1200
1210 retry_private: 1201 retry_private:
1211 double_lock_hb(hb1, hb2); 1202 double_lock_hb(hb1, hb2);
1212 1203
1213 if (likely(cmpval != NULL)) { 1204 if (likely(cmpval != NULL)) {
1214 u32 curval; 1205 u32 curval;
1215 1206
1216 ret = get_futex_value_locked(&curval, uaddr1); 1207 ret = get_futex_value_locked(&curval, uaddr1);
1217 1208
1218 if (unlikely(ret)) { 1209 if (unlikely(ret)) {
1219 double_unlock_hb(hb1, hb2); 1210 double_unlock_hb(hb1, hb2);
1220 1211
1221 ret = get_user(curval, uaddr1); 1212 ret = get_user(curval, uaddr1);
1222 if (ret) 1213 if (ret)
1223 goto out_put_keys; 1214 goto out_put_keys;
1224 1215
1225 if (!fshared) 1216 if (!fshared)
1226 goto retry_private; 1217 goto retry_private;
1227 1218
1228 put_futex_key(fshared, &key2); 1219 put_futex_key(fshared, &key2);
1229 put_futex_key(fshared, &key1); 1220 put_futex_key(fshared, &key1);
1230 goto retry; 1221 goto retry;
1231 } 1222 }
1232 if (curval != *cmpval) { 1223 if (curval != *cmpval) {
1233 ret = -EAGAIN; 1224 ret = -EAGAIN;
1234 goto out_unlock; 1225 goto out_unlock;
1235 } 1226 }
1236 } 1227 }
1237 1228
1238 if (requeue_pi && (task_count - nr_wake < nr_requeue)) { 1229 if (requeue_pi && (task_count - nr_wake < nr_requeue)) {
1239 /* 1230 /*
1240 * Attempt to acquire uaddr2 and wake the top waiter. If we 1231 * Attempt to acquire uaddr2 and wake the top waiter. If we
1241 * intend to requeue waiters, force setting the FUTEX_WAITERS 1232 * intend to requeue waiters, force setting the FUTEX_WAITERS
1242 * bit. We force this here where we are able to easily handle 1233 * bit. We force this here where we are able to easily handle
1243 * faults rather in the requeue loop below. 1234 * faults rather in the requeue loop below.
1244 */ 1235 */
1245 ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1, 1236 ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
1246 &key2, &pi_state, nr_requeue); 1237 &key2, &pi_state, nr_requeue);
1247 1238
1248 /* 1239 /*
1249 * At this point the top_waiter has either taken uaddr2 or is 1240 * At this point the top_waiter has either taken uaddr2 or is
1250 * waiting on it. If the former, then the pi_state will not 1241 * waiting on it. If the former, then the pi_state will not
1251 * exist yet, look it up one more time to ensure we have a 1242 * exist yet, look it up one more time to ensure we have a
1252 * reference to it. 1243 * reference to it.
1253 */ 1244 */
1254 if (ret == 1) { 1245 if (ret == 1) {
1255 WARN_ON(pi_state); 1246 WARN_ON(pi_state);
1256 drop_count++; 1247 drop_count++;
1257 task_count++; 1248 task_count++;
1258 ret = get_futex_value_locked(&curval2, uaddr2); 1249 ret = get_futex_value_locked(&curval2, uaddr2);
1259 if (!ret) 1250 if (!ret)
1260 ret = lookup_pi_state(curval2, hb2, &key2, 1251 ret = lookup_pi_state(curval2, hb2, &key2,
1261 &pi_state); 1252 &pi_state);
1262 } 1253 }
1263 1254
1264 switch (ret) { 1255 switch (ret) {
1265 case 0: 1256 case 0:
1266 break; 1257 break;
1267 case -EFAULT: 1258 case -EFAULT:
1268 double_unlock_hb(hb1, hb2); 1259 double_unlock_hb(hb1, hb2);
1269 put_futex_key(fshared, &key2); 1260 put_futex_key(fshared, &key2);
1270 put_futex_key(fshared, &key1); 1261 put_futex_key(fshared, &key1);
1271 ret = fault_in_user_writeable(uaddr2); 1262 ret = fault_in_user_writeable(uaddr2);
1272 if (!ret) 1263 if (!ret)
1273 goto retry; 1264 goto retry;
1274 goto out; 1265 goto out;
1275 case -EAGAIN: 1266 case -EAGAIN:
1276 /* The owner was exiting, try again. */ 1267 /* The owner was exiting, try again. */
1277 double_unlock_hb(hb1, hb2); 1268 double_unlock_hb(hb1, hb2);
1278 put_futex_key(fshared, &key2); 1269 put_futex_key(fshared, &key2);
1279 put_futex_key(fshared, &key1); 1270 put_futex_key(fshared, &key1);
1280 cond_resched(); 1271 cond_resched();
1281 goto retry; 1272 goto retry;
1282 default: 1273 default:
1283 goto out_unlock; 1274 goto out_unlock;
1284 } 1275 }
1285 } 1276 }
1286 1277
1287 head1 = &hb1->chain; 1278 head1 = &hb1->chain;
1288 plist_for_each_entry_safe(this, next, head1, list) { 1279 plist_for_each_entry_safe(this, next, head1, list) {
1289 if (task_count - nr_wake >= nr_requeue) 1280 if (task_count - nr_wake >= nr_requeue)
1290 break; 1281 break;
1291 1282
1292 if (!match_futex(&this->key, &key1)) 1283 if (!match_futex(&this->key, &key1))
1293 continue; 1284 continue;
1294 1285
1295 /* 1286 /*
1296 * FUTEX_WAIT_REQEUE_PI and FUTEX_CMP_REQUEUE_PI should always 1287 * FUTEX_WAIT_REQEUE_PI and FUTEX_CMP_REQUEUE_PI should always
1297 * be paired with each other and no other futex ops. 1288 * be paired with each other and no other futex ops.
1298 */ 1289 */
1299 if ((requeue_pi && !this->rt_waiter) || 1290 if ((requeue_pi && !this->rt_waiter) ||
1300 (!requeue_pi && this->rt_waiter)) { 1291 (!requeue_pi && this->rt_waiter)) {
1301 ret = -EINVAL; 1292 ret = -EINVAL;
1302 break; 1293 break;
1303 } 1294 }
1304 1295
1305 /* 1296 /*
1306 * Wake nr_wake waiters. For requeue_pi, if we acquired the 1297 * Wake nr_wake waiters. For requeue_pi, if we acquired the
1307 * lock, we already woke the top_waiter. If not, it will be 1298 * lock, we already woke the top_waiter. If not, it will be
1308 * woken by futex_unlock_pi(). 1299 * woken by futex_unlock_pi().
1309 */ 1300 */
1310 if (++task_count <= nr_wake && !requeue_pi) { 1301 if (++task_count <= nr_wake && !requeue_pi) {
1311 wake_futex(this); 1302 wake_futex(this);
1312 continue; 1303 continue;
1313 } 1304 }
1314 1305
1315 /* Ensure we requeue to the expected futex for requeue_pi. */ 1306 /* Ensure we requeue to the expected futex for requeue_pi. */
1316 if (requeue_pi && !match_futex(this->requeue_pi_key, &key2)) { 1307 if (requeue_pi && !match_futex(this->requeue_pi_key, &key2)) {
1317 ret = -EINVAL; 1308 ret = -EINVAL;
1318 break; 1309 break;
1319 } 1310 }
1320 1311
1321 /* 1312 /*
1322 * Requeue nr_requeue waiters and possibly one more in the case 1313 * Requeue nr_requeue waiters and possibly one more in the case
1323 * of requeue_pi if we couldn't acquire the lock atomically. 1314 * of requeue_pi if we couldn't acquire the lock atomically.
1324 */ 1315 */
1325 if (requeue_pi) { 1316 if (requeue_pi) {
1326 /* Prepare the waiter to take the rt_mutex. */ 1317 /* Prepare the waiter to take the rt_mutex. */
1327 atomic_inc(&pi_state->refcount); 1318 atomic_inc(&pi_state->refcount);
1328 this->pi_state = pi_state; 1319 this->pi_state = pi_state;
1329 ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex, 1320 ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
1330 this->rt_waiter, 1321 this->rt_waiter,
1331 this->task, 1); 1322 this->task, 1);
1332 if (ret == 1) { 1323 if (ret == 1) {
1333 /* We got the lock. */ 1324 /* We got the lock. */
1334 requeue_pi_wake_futex(this, &key2, hb2); 1325 requeue_pi_wake_futex(this, &key2, hb2);
1335 drop_count++; 1326 drop_count++;
1336 continue; 1327 continue;
1337 } else if (ret) { 1328 } else if (ret) {
1338 /* -EDEADLK */ 1329 /* -EDEADLK */
1339 this->pi_state = NULL; 1330 this->pi_state = NULL;
1340 free_pi_state(pi_state); 1331 free_pi_state(pi_state);
1341 goto out_unlock; 1332 goto out_unlock;
1342 } 1333 }
1343 } 1334 }
1344 requeue_futex(this, hb1, hb2, &key2); 1335 requeue_futex(this, hb1, hb2, &key2);
1345 drop_count++; 1336 drop_count++;
1346 } 1337 }
1347 1338
1348 out_unlock: 1339 out_unlock:
1349 double_unlock_hb(hb1, hb2); 1340 double_unlock_hb(hb1, hb2);
1350 1341
1351 /* 1342 /*
1352 * drop_futex_key_refs() must be called outside the spinlocks. During 1343 * drop_futex_key_refs() must be called outside the spinlocks. During
1353 * the requeue we moved futex_q's from the hash bucket at key1 to the 1344 * the requeue we moved futex_q's from the hash bucket at key1 to the
1354 * one at key2 and updated their key pointer. We no longer need to 1345 * one at key2 and updated their key pointer. We no longer need to
1355 * hold the references to key1. 1346 * hold the references to key1.
1356 */ 1347 */
1357 while (--drop_count >= 0) 1348 while (--drop_count >= 0)
1358 drop_futex_key_refs(&key1); 1349 drop_futex_key_refs(&key1);
1359 1350
1360 out_put_keys: 1351 out_put_keys:
1361 put_futex_key(fshared, &key2); 1352 put_futex_key(fshared, &key2);
1362 out_put_key1: 1353 out_put_key1:
1363 put_futex_key(fshared, &key1); 1354 put_futex_key(fshared, &key1);
1364 out: 1355 out:
1365 if (pi_state != NULL) 1356 if (pi_state != NULL)
1366 free_pi_state(pi_state); 1357 free_pi_state(pi_state);
1367 return ret ? ret : task_count; 1358 return ret ? ret : task_count;
1368 } 1359 }
1369 1360
1370 /* The key must be already stored in q->key. */ 1361 /* The key must be already stored in q->key. */
1371 static inline struct futex_hash_bucket *queue_lock(struct futex_q *q) 1362 static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
1372 { 1363 {
1373 struct futex_hash_bucket *hb; 1364 struct futex_hash_bucket *hb;
1374 1365
1375 get_futex_key_refs(&q->key); 1366 get_futex_key_refs(&q->key);
1376 hb = hash_futex(&q->key); 1367 hb = hash_futex(&q->key);
1377 q->lock_ptr = &hb->lock; 1368 q->lock_ptr = &hb->lock;
1378 1369
1379 spin_lock(&hb->lock); 1370 spin_lock(&hb->lock);
1380 return hb; 1371 return hb;
1381 } 1372 }
1382 1373
1383 static inline void 1374 static inline void
1384 queue_unlock(struct futex_q *q, struct futex_hash_bucket *hb) 1375 queue_unlock(struct futex_q *q, struct futex_hash_bucket *hb)
1385 { 1376 {
1386 spin_unlock(&hb->lock); 1377 spin_unlock(&hb->lock);
1387 drop_futex_key_refs(&q->key); 1378 drop_futex_key_refs(&q->key);
1388 } 1379 }
1389 1380
1390 /** 1381 /**
1391 * queue_me() - Enqueue the futex_q on the futex_hash_bucket 1382 * queue_me() - Enqueue the futex_q on the futex_hash_bucket
1392 * @q: The futex_q to enqueue 1383 * @q: The futex_q to enqueue
1393 * @hb: The destination hash bucket 1384 * @hb: The destination hash bucket
1394 * 1385 *
1395 * The hb->lock must be held by the caller, and is released here. A call to 1386 * The hb->lock must be held by the caller, and is released here. A call to
1396 * queue_me() is typically paired with exactly one call to unqueue_me(). The 1387 * queue_me() is typically paired with exactly one call to unqueue_me(). The
1397 * exceptions involve the PI related operations, which may use unqueue_me_pi() 1388 * exceptions involve the PI related operations, which may use unqueue_me_pi()
1398 * or nothing if the unqueue is done as part of the wake process and the unqueue 1389 * or nothing if the unqueue is done as part of the wake process and the unqueue
1399 * state is implicit in the state of woken task (see futex_wait_requeue_pi() for 1390 * state is implicit in the state of woken task (see futex_wait_requeue_pi() for
1400 * an example). 1391 * an example).
1401 */ 1392 */
1402 static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb) 1393 static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
1403 { 1394 {
1404 int prio; 1395 int prio;
1405 1396
1406 /* 1397 /*
1407 * The priority used to register this element is 1398 * The priority used to register this element is
1408 * - either the real thread-priority for the real-time threads 1399 * - either the real thread-priority for the real-time threads
1409 * (i.e. threads with a priority lower than MAX_RT_PRIO) 1400 * (i.e. threads with a priority lower than MAX_RT_PRIO)
1410 * - or MAX_RT_PRIO for non-RT threads. 1401 * - or MAX_RT_PRIO for non-RT threads.
1411 * Thus, all RT-threads are woken first in priority order, and 1402 * Thus, all RT-threads are woken first in priority order, and
1412 * the others are woken last, in FIFO order. 1403 * the others are woken last, in FIFO order.
1413 */ 1404 */
1414 prio = min(current->normal_prio, MAX_RT_PRIO); 1405 prio = min(current->normal_prio, MAX_RT_PRIO);
1415 1406
1416 plist_node_init(&q->list, prio); 1407 plist_node_init(&q->list, prio);
1417 #ifdef CONFIG_DEBUG_PI_LIST 1408 #ifdef CONFIG_DEBUG_PI_LIST
1418 q->list.plist.spinlock = &hb->lock; 1409 q->list.plist.spinlock = &hb->lock;
1419 #endif 1410 #endif
1420 plist_add(&q->list, &hb->chain); 1411 plist_add(&q->list, &hb->chain);
1421 q->task = current; 1412 q->task = current;
1422 spin_unlock(&hb->lock); 1413 spin_unlock(&hb->lock);
1423 } 1414 }
1424 1415
1425 /** 1416 /**
1426 * unqueue_me() - Remove the futex_q from its futex_hash_bucket 1417 * unqueue_me() - Remove the futex_q from its futex_hash_bucket
1427 * @q: The futex_q to unqueue 1418 * @q: The futex_q to unqueue
1428 * 1419 *
1429 * The q->lock_ptr must not be held by the caller. A call to unqueue_me() must 1420 * The q->lock_ptr must not be held by the caller. A call to unqueue_me() must
1430 * be paired with exactly one earlier call to queue_me(). 1421 * be paired with exactly one earlier call to queue_me().
1431 * 1422 *
1432 * Returns: 1423 * Returns:
1433 * 1 - if the futex_q was still queued (and we removed unqueued it) 1424 * 1 - if the futex_q was still queued (and we removed unqueued it)
1434 * 0 - if the futex_q was already removed by the waking thread 1425 * 0 - if the futex_q was already removed by the waking thread
1435 */ 1426 */
1436 static int unqueue_me(struct futex_q *q) 1427 static int unqueue_me(struct futex_q *q)
1437 { 1428 {
1438 spinlock_t *lock_ptr; 1429 spinlock_t *lock_ptr;
1439 int ret = 0; 1430 int ret = 0;
1440 1431
1441 /* In the common case we don't take the spinlock, which is nice. */ 1432 /* In the common case we don't take the spinlock, which is nice. */
1442 retry: 1433 retry:
1443 lock_ptr = q->lock_ptr; 1434 lock_ptr = q->lock_ptr;
1444 barrier(); 1435 barrier();
1445 if (lock_ptr != NULL) { 1436 if (lock_ptr != NULL) {
1446 spin_lock(lock_ptr); 1437 spin_lock(lock_ptr);
1447 /* 1438 /*
1448 * q->lock_ptr can change between reading it and 1439 * q->lock_ptr can change between reading it and
1449 * spin_lock(), causing us to take the wrong lock. This 1440 * spin_lock(), causing us to take the wrong lock. This
1450 * corrects the race condition. 1441 * corrects the race condition.
1451 * 1442 *
1452 * Reasoning goes like this: if we have the wrong lock, 1443 * Reasoning goes like this: if we have the wrong lock,
1453 * q->lock_ptr must have changed (maybe several times) 1444 * q->lock_ptr must have changed (maybe several times)
1454 * between reading it and the spin_lock(). It can 1445 * between reading it and the spin_lock(). It can
1455 * change again after the spin_lock() but only if it was 1446 * change again after the spin_lock() but only if it was
1456 * already changed before the spin_lock(). It cannot, 1447 * already changed before the spin_lock(). It cannot,
1457 * however, change back to the original value. Therefore 1448 * however, change back to the original value. Therefore
1458 * we can detect whether we acquired the correct lock. 1449 * we can detect whether we acquired the correct lock.
1459 */ 1450 */
1460 if (unlikely(lock_ptr != q->lock_ptr)) { 1451 if (unlikely(lock_ptr != q->lock_ptr)) {
1461 spin_unlock(lock_ptr); 1452 spin_unlock(lock_ptr);
1462 goto retry; 1453 goto retry;
1463 } 1454 }
1464 WARN_ON(plist_node_empty(&q->list)); 1455 WARN_ON(plist_node_empty(&q->list));
1465 plist_del(&q->list, &q->list.plist); 1456 plist_del(&q->list, &q->list.plist);
1466 1457
1467 BUG_ON(q->pi_state); 1458 BUG_ON(q->pi_state);
1468 1459
1469 spin_unlock(lock_ptr); 1460 spin_unlock(lock_ptr);
1470 ret = 1; 1461 ret = 1;
1471 } 1462 }
1472 1463
1473 drop_futex_key_refs(&q->key); 1464 drop_futex_key_refs(&q->key);
1474 return ret; 1465 return ret;
1475 } 1466 }
1476 1467
1477 /* 1468 /*
1478 * PI futexes can not be requeued and must remove themself from the 1469 * PI futexes can not be requeued and must remove themself from the
1479 * hash bucket. The hash bucket lock (i.e. lock_ptr) is held on entry 1470 * hash bucket. The hash bucket lock (i.e. lock_ptr) is held on entry
1480 * and dropped here. 1471 * and dropped here.
1481 */ 1472 */
1482 static void unqueue_me_pi(struct futex_q *q) 1473 static void unqueue_me_pi(struct futex_q *q)
1483 { 1474 {
1484 WARN_ON(plist_node_empty(&q->list)); 1475 WARN_ON(plist_node_empty(&q->list));
1485 plist_del(&q->list, &q->list.plist); 1476 plist_del(&q->list, &q->list.plist);
1486 1477
1487 BUG_ON(!q->pi_state); 1478 BUG_ON(!q->pi_state);
1488 free_pi_state(q->pi_state); 1479 free_pi_state(q->pi_state);
1489 q->pi_state = NULL; 1480 q->pi_state = NULL;
1490 1481
1491 spin_unlock(q->lock_ptr); 1482 spin_unlock(q->lock_ptr);
1492 1483
1493 drop_futex_key_refs(&q->key); 1484 drop_futex_key_refs(&q->key);
1494 } 1485 }
1495 1486
1496 /* 1487 /*
1497 * Fixup the pi_state owner with the new owner. 1488 * Fixup the pi_state owner with the new owner.
1498 * 1489 *
1499 * Must be called with hash bucket lock held and mm->sem held for non 1490 * Must be called with hash bucket lock held and mm->sem held for non
1500 * private futexes. 1491 * private futexes.
1501 */ 1492 */
1502 static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, 1493 static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
1503 struct task_struct *newowner, int fshared) 1494 struct task_struct *newowner, int fshared)
1504 { 1495 {
1505 u32 newtid = task_pid_vnr(newowner) | FUTEX_WAITERS; 1496 u32 newtid = task_pid_vnr(newowner) | FUTEX_WAITERS;
1506 struct futex_pi_state *pi_state = q->pi_state; 1497 struct futex_pi_state *pi_state = q->pi_state;
1507 struct task_struct *oldowner = pi_state->owner; 1498 struct task_struct *oldowner = pi_state->owner;
1508 u32 uval, curval, newval; 1499 u32 uval, curval, newval;
1509 int ret; 1500 int ret;
1510 1501
1511 /* Owner died? */ 1502 /* Owner died? */
1512 if (!pi_state->owner) 1503 if (!pi_state->owner)
1513 newtid |= FUTEX_OWNER_DIED; 1504 newtid |= FUTEX_OWNER_DIED;
1514 1505
1515 /* 1506 /*
1516 * We are here either because we stole the rtmutex from the 1507 * We are here either because we stole the rtmutex from the
1517 * pending owner or we are the pending owner which failed to 1508 * pending owner or we are the pending owner which failed to
1518 * get the rtmutex. We have to replace the pending owner TID 1509 * get the rtmutex. We have to replace the pending owner TID
1519 * in the user space variable. This must be atomic as we have 1510 * in the user space variable. This must be atomic as we have
1520 * to preserve the owner died bit here. 1511 * to preserve the owner died bit here.
1521 * 1512 *
1522 * Note: We write the user space value _before_ changing the pi_state 1513 * Note: We write the user space value _before_ changing the pi_state
1523 * because we can fault here. Imagine swapped out pages or a fork 1514 * because we can fault here. Imagine swapped out pages or a fork
1524 * that marked all the anonymous memory readonly for cow. 1515 * that marked all the anonymous memory readonly for cow.
1525 * 1516 *
1526 * Modifying pi_state _before_ the user space value would 1517 * Modifying pi_state _before_ the user space value would
1527 * leave the pi_state in an inconsistent state when we fault 1518 * leave the pi_state in an inconsistent state when we fault
1528 * here, because we need to drop the hash bucket lock to 1519 * here, because we need to drop the hash bucket lock to
1529 * handle the fault. This might be observed in the PID check 1520 * handle the fault. This might be observed in the PID check
1530 * in lookup_pi_state. 1521 * in lookup_pi_state.
1531 */ 1522 */
1532 retry: 1523 retry:
1533 if (get_futex_value_locked(&uval, uaddr)) 1524 if (get_futex_value_locked(&uval, uaddr))
1534 goto handle_fault; 1525 goto handle_fault;
1535 1526
1536 while (1) { 1527 while (1) {
1537 newval = (uval & FUTEX_OWNER_DIED) | newtid; 1528 newval = (uval & FUTEX_OWNER_DIED) | newtid;
1538 1529
1539 curval = cmpxchg_futex_value_locked(uaddr, uval, newval); 1530 curval = cmpxchg_futex_value_locked(uaddr, uval, newval);
1540 1531
1541 if (curval == -EFAULT) 1532 if (curval == -EFAULT)
1542 goto handle_fault; 1533 goto handle_fault;
1543 if (curval == uval) 1534 if (curval == uval)
1544 break; 1535 break;
1545 uval = curval; 1536 uval = curval;
1546 } 1537 }
1547 1538
1548 /* 1539 /*
1549 * We fixed up user space. Now we need to fix the pi_state 1540 * We fixed up user space. Now we need to fix the pi_state
1550 * itself. 1541 * itself.
1551 */ 1542 */
1552 if (pi_state->owner != NULL) { 1543 if (pi_state->owner != NULL) {
1553 raw_spin_lock_irq(&pi_state->owner->pi_lock); 1544 raw_spin_lock_irq(&pi_state->owner->pi_lock);
1554 WARN_ON(list_empty(&pi_state->list)); 1545 WARN_ON(list_empty(&pi_state->list));
1555 list_del_init(&pi_state->list); 1546 list_del_init(&pi_state->list);
1556 raw_spin_unlock_irq(&pi_state->owner->pi_lock); 1547 raw_spin_unlock_irq(&pi_state->owner->pi_lock);
1557 } 1548 }
1558 1549
1559 pi_state->owner = newowner; 1550 pi_state->owner = newowner;
1560 1551
1561 raw_spin_lock_irq(&newowner->pi_lock); 1552 raw_spin_lock_irq(&newowner->pi_lock);
1562 WARN_ON(!list_empty(&pi_state->list)); 1553 WARN_ON(!list_empty(&pi_state->list));
1563 list_add(&pi_state->list, &newowner->pi_state_list); 1554 list_add(&pi_state->list, &newowner->pi_state_list);
1564 raw_spin_unlock_irq(&newowner->pi_lock); 1555 raw_spin_unlock_irq(&newowner->pi_lock);
1565 return 0; 1556 return 0;
1566 1557
1567 /* 1558 /*
1568 * To handle the page fault we need to drop the hash bucket 1559 * To handle the page fault we need to drop the hash bucket
1569 * lock here. That gives the other task (either the pending 1560 * lock here. That gives the other task (either the pending
1570 * owner itself or the task which stole the rtmutex) the 1561 * owner itself or the task which stole the rtmutex) the
1571 * chance to try the fixup of the pi_state. So once we are 1562 * chance to try the fixup of the pi_state. So once we are
1572 * back from handling the fault we need to check the pi_state 1563 * back from handling the fault we need to check the pi_state
1573 * after reacquiring the hash bucket lock and before trying to 1564 * after reacquiring the hash bucket lock and before trying to
1574 * do another fixup. When the fixup has been done already we 1565 * do another fixup. When the fixup has been done already we
1575 * simply return. 1566 * simply return.
1576 */ 1567 */
1577 handle_fault: 1568 handle_fault:
1578 spin_unlock(q->lock_ptr); 1569 spin_unlock(q->lock_ptr);
1579 1570
1580 ret = fault_in_user_writeable(uaddr); 1571 ret = fault_in_user_writeable(uaddr);
1581 1572
1582 spin_lock(q->lock_ptr); 1573 spin_lock(q->lock_ptr);
1583 1574
1584 /* 1575 /*
1585 * Check if someone else fixed it for us: 1576 * Check if someone else fixed it for us:
1586 */ 1577 */
1587 if (pi_state->owner != oldowner) 1578 if (pi_state->owner != oldowner)
1588 return 0; 1579 return 0;
1589 1580
1590 if (ret) 1581 if (ret)
1591 return ret; 1582 return ret;
1592 1583
1593 goto retry; 1584 goto retry;
1594 } 1585 }
1595 1586
1596 /* 1587 /*
1597 * In case we must use restart_block to restart a futex_wait, 1588 * In case we must use restart_block to restart a futex_wait,
1598 * we encode in the 'flags' shared capability 1589 * we encode in the 'flags' shared capability
1599 */ 1590 */
1600 #define FLAGS_SHARED 0x01 1591 #define FLAGS_SHARED 0x01
1601 #define FLAGS_CLOCKRT 0x02 1592 #define FLAGS_CLOCKRT 0x02
1602 #define FLAGS_HAS_TIMEOUT 0x04 1593 #define FLAGS_HAS_TIMEOUT 0x04
1603 1594
1604 static long futex_wait_restart(struct restart_block *restart); 1595 static long futex_wait_restart(struct restart_block *restart);
1605 1596
1606 /** 1597 /**
1607 * fixup_owner() - Post lock pi_state and corner case management 1598 * fixup_owner() - Post lock pi_state and corner case management
1608 * @uaddr: user address of the futex 1599 * @uaddr: user address of the futex
1609 * @fshared: whether the futex is shared (1) or not (0) 1600 * @fshared: whether the futex is shared (1) or not (0)
1610 * @q: futex_q (contains pi_state and access to the rt_mutex) 1601 * @q: futex_q (contains pi_state and access to the rt_mutex)
1611 * @locked: if the attempt to take the rt_mutex succeeded (1) or not (0) 1602 * @locked: if the attempt to take the rt_mutex succeeded (1) or not (0)
1612 * 1603 *
1613 * After attempting to lock an rt_mutex, this function is called to cleanup 1604 * After attempting to lock an rt_mutex, this function is called to cleanup
1614 * the pi_state owner as well as handle race conditions that may allow us to 1605 * the pi_state owner as well as handle race conditions that may allow us to
1615 * acquire the lock. Must be called with the hb lock held. 1606 * acquire the lock. Must be called with the hb lock held.
1616 * 1607 *
1617 * Returns: 1608 * Returns:
1618 * 1 - success, lock taken 1609 * 1 - success, lock taken
1619 * 0 - success, lock not taken 1610 * 0 - success, lock not taken
1620 * <0 - on error (-EFAULT) 1611 * <0 - on error (-EFAULT)
1621 */ 1612 */
1622 static int fixup_owner(u32 __user *uaddr, int fshared, struct futex_q *q, 1613 static int fixup_owner(u32 __user *uaddr, int fshared, struct futex_q *q,
1623 int locked) 1614 int locked)
1624 { 1615 {
1625 struct task_struct *owner; 1616 struct task_struct *owner;
1626 int ret = 0; 1617 int ret = 0;
1627 1618
1628 if (locked) { 1619 if (locked) {
1629 /* 1620 /*
1630 * Got the lock. We might not be the anticipated owner if we 1621 * Got the lock. We might not be the anticipated owner if we
1631 * did a lock-steal - fix up the PI-state in that case: 1622 * did a lock-steal - fix up the PI-state in that case:
1632 */ 1623 */
1633 if (q->pi_state->owner != current) 1624 if (q->pi_state->owner != current)
1634 ret = fixup_pi_state_owner(uaddr, q, current, fshared); 1625 ret = fixup_pi_state_owner(uaddr, q, current, fshared);
1635 goto out; 1626 goto out;
1636 } 1627 }
1637 1628
1638 /* 1629 /*
1639 * Catch the rare case, where the lock was released when we were on the 1630 * Catch the rare case, where the lock was released when we were on the
1640 * way back before we locked the hash bucket. 1631 * way back before we locked the hash bucket.
1641 */ 1632 */
1642 if (q->pi_state->owner == current) { 1633 if (q->pi_state->owner == current) {
1643 /* 1634 /*
1644 * Try to get the rt_mutex now. This might fail as some other 1635 * Try to get the rt_mutex now. This might fail as some other
1645 * task acquired the rt_mutex after we removed ourself from the 1636 * task acquired the rt_mutex after we removed ourself from the
1646 * rt_mutex waiters list. 1637 * rt_mutex waiters list.
1647 */ 1638 */
1648 if (rt_mutex_trylock(&q->pi_state->pi_mutex)) { 1639 if (rt_mutex_trylock(&q->pi_state->pi_mutex)) {
1649 locked = 1; 1640 locked = 1;
1650 goto out; 1641 goto out;
1651 } 1642 }
1652 1643
1653 /* 1644 /*
1654 * pi_state is incorrect, some other task did a lock steal and 1645 * pi_state is incorrect, some other task did a lock steal and
1655 * we returned due to timeout or signal without taking the 1646 * we returned due to timeout or signal without taking the
1656 * rt_mutex. Too late. We can access the rt_mutex_owner without 1647 * rt_mutex. Too late. We can access the rt_mutex_owner without
1657 * locking, as the other task is now blocked on the hash bucket 1648 * locking, as the other task is now blocked on the hash bucket
1658 * lock. Fix the state up. 1649 * lock. Fix the state up.
1659 */ 1650 */
1660 owner = rt_mutex_owner(&q->pi_state->pi_mutex); 1651 owner = rt_mutex_owner(&q->pi_state->pi_mutex);
1661 ret = fixup_pi_state_owner(uaddr, q, owner, fshared); 1652 ret = fixup_pi_state_owner(uaddr, q, owner, fshared);
1662 goto out; 1653 goto out;
1663 } 1654 }
1664 1655
1665 /* 1656 /*
1666 * Paranoia check. If we did not take the lock, then we should not be 1657 * Paranoia check. If we did not take the lock, then we should not be
1667 * the owner, nor the pending owner, of the rt_mutex. 1658 * the owner, nor the pending owner, of the rt_mutex.
1668 */ 1659 */
1669 if (rt_mutex_owner(&q->pi_state->pi_mutex) == current) 1660 if (rt_mutex_owner(&q->pi_state->pi_mutex) == current)
1670 printk(KERN_ERR "fixup_owner: ret = %d pi-mutex: %p " 1661 printk(KERN_ERR "fixup_owner: ret = %d pi-mutex: %p "
1671 "pi-state %p\n", ret, 1662 "pi-state %p\n", ret,
1672 q->pi_state->pi_mutex.owner, 1663 q->pi_state->pi_mutex.owner,
1673 q->pi_state->owner); 1664 q->pi_state->owner);
1674 1665
1675 out: 1666 out:
1676 return ret ? ret : locked; 1667 return ret ? ret : locked;
1677 } 1668 }
1678 1669
1679 /** 1670 /**
1680 * futex_wait_queue_me() - queue_me() and wait for wakeup, timeout, or signal 1671 * futex_wait_queue_me() - queue_me() and wait for wakeup, timeout, or signal
1681 * @hb: the futex hash bucket, must be locked by the caller 1672 * @hb: the futex hash bucket, must be locked by the caller
1682 * @q: the futex_q to queue up on 1673 * @q: the futex_q to queue up on
1683 * @timeout: the prepared hrtimer_sleeper, or null for no timeout 1674 * @timeout: the prepared hrtimer_sleeper, or null for no timeout
1684 */ 1675 */
1685 static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q, 1676 static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
1686 struct hrtimer_sleeper *timeout) 1677 struct hrtimer_sleeper *timeout)
1687 { 1678 {
1688 /* 1679 /*
1689 * The task state is guaranteed to be set before another task can 1680 * The task state is guaranteed to be set before another task can
1690 * wake it. set_current_state() is implemented using set_mb() and 1681 * wake it. set_current_state() is implemented using set_mb() and
1691 * queue_me() calls spin_unlock() upon completion, both serializing 1682 * queue_me() calls spin_unlock() upon completion, both serializing
1692 * access to the hash list and forcing another memory barrier. 1683 * access to the hash list and forcing another memory barrier.
1693 */ 1684 */
1694 set_current_state(TASK_INTERRUPTIBLE); 1685 set_current_state(TASK_INTERRUPTIBLE);
1695 queue_me(q, hb); 1686 queue_me(q, hb);
1696 1687
1697 /* Arm the timer */ 1688 /* Arm the timer */
1698 if (timeout) { 1689 if (timeout) {
1699 hrtimer_start_expires(&timeout->timer, HRTIMER_MODE_ABS); 1690 hrtimer_start_expires(&timeout->timer, HRTIMER_MODE_ABS);
1700 if (!hrtimer_active(&timeout->timer)) 1691 if (!hrtimer_active(&timeout->timer))
1701 timeout->task = NULL; 1692 timeout->task = NULL;
1702 } 1693 }
1703 1694
1704 /* 1695 /*
1705 * If we have been removed from the hash list, then another task 1696 * If we have been removed from the hash list, then another task
1706 * has tried to wake us, and we can skip the call to schedule(). 1697 * has tried to wake us, and we can skip the call to schedule().
1707 */ 1698 */
1708 if (likely(!plist_node_empty(&q->list))) { 1699 if (likely(!plist_node_empty(&q->list))) {
1709 /* 1700 /*
1710 * If the timer has already expired, current will already be 1701 * If the timer has already expired, current will already be
1711 * flagged for rescheduling. Only call schedule if there 1702 * flagged for rescheduling. Only call schedule if there
1712 * is no timeout, or if it has yet to expire. 1703 * is no timeout, or if it has yet to expire.
1713 */ 1704 */
1714 if (!timeout || timeout->task) 1705 if (!timeout || timeout->task)
1715 schedule(); 1706 schedule();
1716 } 1707 }
1717 __set_current_state(TASK_RUNNING); 1708 __set_current_state(TASK_RUNNING);
1718 } 1709 }
1719 1710
1720 /** 1711 /**
1721 * futex_wait_setup() - Prepare to wait on a futex 1712 * futex_wait_setup() - Prepare to wait on a futex
1722 * @uaddr: the futex userspace address 1713 * @uaddr: the futex userspace address
1723 * @val: the expected value 1714 * @val: the expected value
1724 * @fshared: whether the futex is shared (1) or not (0) 1715 * @fshared: whether the futex is shared (1) or not (0)
1725 * @q: the associated futex_q 1716 * @q: the associated futex_q
1726 * @hb: storage for hash_bucket pointer to be returned to caller 1717 * @hb: storage for hash_bucket pointer to be returned to caller
1727 * 1718 *
1728 * Setup the futex_q and locate the hash_bucket. Get the futex value and 1719 * Setup the futex_q and locate the hash_bucket. Get the futex value and
1729 * compare it with the expected value. Handle atomic faults internally. 1720 * compare it with the expected value. Handle atomic faults internally.
1730 * Return with the hb lock held and a q.key reference on success, and unlocked 1721 * Return with the hb lock held and a q.key reference on success, and unlocked
1731 * with no q.key reference on failure. 1722 * with no q.key reference on failure.
1732 * 1723 *
1733 * Returns: 1724 * Returns:
1734 * 0 - uaddr contains val and hb has been locked 1725 * 0 - uaddr contains val and hb has been locked
1735 * <1 - -EFAULT or -EWOULDBLOCK (uaddr does not contain val) and hb is unlcoked 1726 * <1 - -EFAULT or -EWOULDBLOCK (uaddr does not contain val) and hb is unlcoked
1736 */ 1727 */
1737 static int futex_wait_setup(u32 __user *uaddr, u32 val, int fshared, 1728 static int futex_wait_setup(u32 __user *uaddr, u32 val, int fshared,
1738 struct futex_q *q, struct futex_hash_bucket **hb) 1729 struct futex_q *q, struct futex_hash_bucket **hb)
1739 { 1730 {
1740 u32 uval; 1731 u32 uval;
1741 int ret; 1732 int ret;
1742 1733
1743 /* 1734 /*
1744 * Access the page AFTER the hash-bucket is locked. 1735 * Access the page AFTER the hash-bucket is locked.
1745 * Order is important: 1736 * Order is important:
1746 * 1737 *
1747 * Userspace waiter: val = var; if (cond(val)) futex_wait(&var, val); 1738 * Userspace waiter: val = var; if (cond(val)) futex_wait(&var, val);
1748 * Userspace waker: if (cond(var)) { var = new; futex_wake(&var); } 1739 * Userspace waker: if (cond(var)) { var = new; futex_wake(&var); }
1749 * 1740 *
1750 * The basic logical guarantee of a futex is that it blocks ONLY 1741 * The basic logical guarantee of a futex is that it blocks ONLY
1751 * if cond(var) is known to be true at the time of blocking, for 1742 * if cond(var) is known to be true at the time of blocking, for
1752 * any cond. If we queued after testing *uaddr, that would open 1743 * any cond. If we queued after testing *uaddr, that would open
1753 * a race condition where we could block indefinitely with 1744 * a race condition where we could block indefinitely with
1754 * cond(var) false, which would violate the guarantee. 1745 * cond(var) false, which would violate the guarantee.
1755 * 1746 *
1756 * A consequence is that futex_wait() can return zero and absorb 1747 * A consequence is that futex_wait() can return zero and absorb
1757 * a wakeup when *uaddr != val on entry to the syscall. This is 1748 * a wakeup when *uaddr != val on entry to the syscall. This is
1758 * rare, but normal. 1749 * rare, but normal.
1759 */ 1750 */
1760 retry: 1751 retry:
1761 q->key = FUTEX_KEY_INIT; 1752 q->key = FUTEX_KEY_INIT;
1762 ret = get_futex_key(uaddr, fshared, &q->key); 1753 ret = get_futex_key(uaddr, fshared, &q->key);
1763 if (unlikely(ret != 0)) 1754 if (unlikely(ret != 0))
1764 return ret; 1755 return ret;
1765 1756
1766 retry_private: 1757 retry_private:
1767 *hb = queue_lock(q); 1758 *hb = queue_lock(q);
1768 1759
1769 ret = get_futex_value_locked(&uval, uaddr); 1760 ret = get_futex_value_locked(&uval, uaddr);
1770 1761
1771 if (ret) { 1762 if (ret) {
1772 queue_unlock(q, *hb); 1763 queue_unlock(q, *hb);
1773 1764
1774 ret = get_user(uval, uaddr); 1765 ret = get_user(uval, uaddr);
1775 if (ret) 1766 if (ret)
1776 goto out; 1767 goto out;
1777 1768
1778 if (!fshared) 1769 if (!fshared)
1779 goto retry_private; 1770 goto retry_private;
1780 1771
1781 put_futex_key(fshared, &q->key); 1772 put_futex_key(fshared, &q->key);
1782 goto retry; 1773 goto retry;
1783 } 1774 }
1784 1775
1785 if (uval != val) { 1776 if (uval != val) {
1786 queue_unlock(q, *hb); 1777 queue_unlock(q, *hb);
1787 ret = -EWOULDBLOCK; 1778 ret = -EWOULDBLOCK;
1788 } 1779 }
1789 1780
1790 out: 1781 out:
1791 if (ret) 1782 if (ret)
1792 put_futex_key(fshared, &q->key); 1783 put_futex_key(fshared, &q->key);
1793 return ret; 1784 return ret;
1794 } 1785 }
1795 1786
1796 static int futex_wait(u32 __user *uaddr, int fshared, 1787 static int futex_wait(u32 __user *uaddr, int fshared,
1797 u32 val, ktime_t *abs_time, u32 bitset, int clockrt) 1788 u32 val, ktime_t *abs_time, u32 bitset, int clockrt)
1798 { 1789 {
1799 struct hrtimer_sleeper timeout, *to = NULL; 1790 struct hrtimer_sleeper timeout, *to = NULL;
1800 struct restart_block *restart; 1791 struct restart_block *restart;
1801 struct futex_hash_bucket *hb; 1792 struct futex_hash_bucket *hb;
1802 struct futex_q q; 1793 struct futex_q q;
1803 int ret; 1794 int ret;
1804 1795
1805 if (!bitset) 1796 if (!bitset)
1806 return -EINVAL; 1797 return -EINVAL;
1807 1798
1808 q.pi_state = NULL; 1799 q.pi_state = NULL;
1809 q.bitset = bitset; 1800 q.bitset = bitset;
1810 q.rt_waiter = NULL; 1801 q.rt_waiter = NULL;
1811 q.requeue_pi_key = NULL; 1802 q.requeue_pi_key = NULL;
1812 1803
1813 if (abs_time) { 1804 if (abs_time) {
1814 to = &timeout; 1805 to = &timeout;
1815 1806
1816 hrtimer_init_on_stack(&to->timer, clockrt ? CLOCK_REALTIME : 1807 hrtimer_init_on_stack(&to->timer, clockrt ? CLOCK_REALTIME :
1817 CLOCK_MONOTONIC, HRTIMER_MODE_ABS); 1808 CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
1818 hrtimer_init_sleeper(to, current); 1809 hrtimer_init_sleeper(to, current);
1819 hrtimer_set_expires_range_ns(&to->timer, *abs_time, 1810 hrtimer_set_expires_range_ns(&to->timer, *abs_time,
1820 current->timer_slack_ns); 1811 current->timer_slack_ns);
1821 } 1812 }
1822 1813
1823 retry: 1814 retry:
1824 /* Prepare to wait on uaddr. */ 1815 /* Prepare to wait on uaddr. */
1825 ret = futex_wait_setup(uaddr, val, fshared, &q, &hb); 1816 ret = futex_wait_setup(uaddr, val, fshared, &q, &hb);
1826 if (ret) 1817 if (ret)
1827 goto out; 1818 goto out;
1828 1819
1829 /* queue_me and wait for wakeup, timeout, or a signal. */ 1820 /* queue_me and wait for wakeup, timeout, or a signal. */
1830 futex_wait_queue_me(hb, &q, to); 1821 futex_wait_queue_me(hb, &q, to);
1831 1822
1832 /* If we were woken (and unqueued), we succeeded, whatever. */ 1823 /* If we were woken (and unqueued), we succeeded, whatever. */
1833 ret = 0; 1824 ret = 0;
1834 if (!unqueue_me(&q)) 1825 if (!unqueue_me(&q))
1835 goto out_put_key; 1826 goto out_put_key;
1836 ret = -ETIMEDOUT; 1827 ret = -ETIMEDOUT;
1837 if (to && !to->task) 1828 if (to && !to->task)
1838 goto out_put_key; 1829 goto out_put_key;
1839 1830
1840 /* 1831 /*
1841 * We expect signal_pending(current), but we might be the 1832 * We expect signal_pending(current), but we might be the
1842 * victim of a spurious wakeup as well. 1833 * victim of a spurious wakeup as well.
1843 */ 1834 */
1844 if (!signal_pending(current)) { 1835 if (!signal_pending(current)) {
1845 put_futex_key(fshared, &q.key); 1836 put_futex_key(fshared, &q.key);
1846 goto retry; 1837 goto retry;
1847 } 1838 }
1848 1839
1849 ret = -ERESTARTSYS; 1840 ret = -ERESTARTSYS;
1850 if (!abs_time) 1841 if (!abs_time)
1851 goto out_put_key; 1842 goto out_put_key;
1852 1843
1853 restart = &current_thread_info()->restart_block; 1844 restart = &current_thread_info()->restart_block;
1854 restart->fn = futex_wait_restart; 1845 restart->fn = futex_wait_restart;
1855 restart->futex.uaddr = (u32 *)uaddr; 1846 restart->futex.uaddr = (u32 *)uaddr;
1856 restart->futex.val = val; 1847 restart->futex.val = val;
1857 restart->futex.time = abs_time->tv64; 1848 restart->futex.time = abs_time->tv64;
1858 restart->futex.bitset = bitset; 1849 restart->futex.bitset = bitset;
1859 restart->futex.flags = FLAGS_HAS_TIMEOUT; 1850 restart->futex.flags = FLAGS_HAS_TIMEOUT;
1860 1851
1861 if (fshared) 1852 if (fshared)
1862 restart->futex.flags |= FLAGS_SHARED; 1853 restart->futex.flags |= FLAGS_SHARED;
1863 if (clockrt) 1854 if (clockrt)
1864 restart->futex.flags |= FLAGS_CLOCKRT; 1855 restart->futex.flags |= FLAGS_CLOCKRT;
1865 1856
1866 ret = -ERESTART_RESTARTBLOCK; 1857 ret = -ERESTART_RESTARTBLOCK;
1867 1858
1868 out_put_key: 1859 out_put_key:
1869 put_futex_key(fshared, &q.key); 1860 put_futex_key(fshared, &q.key);
1870 out: 1861 out:
1871 if (to) { 1862 if (to) {
1872 hrtimer_cancel(&to->timer); 1863 hrtimer_cancel(&to->timer);
1873 destroy_hrtimer_on_stack(&to->timer); 1864 destroy_hrtimer_on_stack(&to->timer);
1874 } 1865 }
1875 return ret; 1866 return ret;
1876 } 1867 }
1877 1868
1878 1869
1879 static long futex_wait_restart(struct restart_block *restart) 1870 static long futex_wait_restart(struct restart_block *restart)
1880 { 1871 {
1881 u32 __user *uaddr = (u32 __user *)restart->futex.uaddr; 1872 u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
1882 int fshared = 0; 1873 int fshared = 0;
1883 ktime_t t, *tp = NULL; 1874 ktime_t t, *tp = NULL;
1884 1875
1885 if (restart->futex.flags & FLAGS_HAS_TIMEOUT) { 1876 if (restart->futex.flags & FLAGS_HAS_TIMEOUT) {
1886 t.tv64 = restart->futex.time; 1877 t.tv64 = restart->futex.time;
1887 tp = &t; 1878 tp = &t;
1888 } 1879 }
1889 restart->fn = do_no_restart_syscall; 1880 restart->fn = do_no_restart_syscall;
1890 if (restart->futex.flags & FLAGS_SHARED) 1881 if (restart->futex.flags & FLAGS_SHARED)
1891 fshared = 1; 1882 fshared = 1;
1892 return (long)futex_wait(uaddr, fshared, restart->futex.val, tp, 1883 return (long)futex_wait(uaddr, fshared, restart->futex.val, tp,
1893 restart->futex.bitset, 1884 restart->futex.bitset,
1894 restart->futex.flags & FLAGS_CLOCKRT); 1885 restart->futex.flags & FLAGS_CLOCKRT);
1895 } 1886 }
1896 1887
1897 1888
1898 /* 1889 /*
1899 * Userspace tried a 0 -> TID atomic transition of the futex value 1890 * Userspace tried a 0 -> TID atomic transition of the futex value
1900 * and failed. The kernel side here does the whole locking operation: 1891 * and failed. The kernel side here does the whole locking operation:
1901 * if there are waiters then it will block, it does PI, etc. (Due to 1892 * if there are waiters then it will block, it does PI, etc. (Due to
1902 * races the kernel might see a 0 value of the futex too.) 1893 * races the kernel might see a 0 value of the futex too.)
1903 */ 1894 */
1904 static int futex_lock_pi(u32 __user *uaddr, int fshared, 1895 static int futex_lock_pi(u32 __user *uaddr, int fshared,
1905 int detect, ktime_t *time, int trylock) 1896 int detect, ktime_t *time, int trylock)
1906 { 1897 {
1907 struct hrtimer_sleeper timeout, *to = NULL; 1898 struct hrtimer_sleeper timeout, *to = NULL;
1908 struct futex_hash_bucket *hb; 1899 struct futex_hash_bucket *hb;
1909 struct futex_q q; 1900 struct futex_q q;
1910 int res, ret; 1901 int res, ret;
1911 1902
1912 if (refill_pi_state_cache()) 1903 if (refill_pi_state_cache())
1913 return -ENOMEM; 1904 return -ENOMEM;
1914 1905
1915 if (time) { 1906 if (time) {
1916 to = &timeout; 1907 to = &timeout;
1917 hrtimer_init_on_stack(&to->timer, CLOCK_REALTIME, 1908 hrtimer_init_on_stack(&to->timer, CLOCK_REALTIME,
1918 HRTIMER_MODE_ABS); 1909 HRTIMER_MODE_ABS);
1919 hrtimer_init_sleeper(to, current); 1910 hrtimer_init_sleeper(to, current);
1920 hrtimer_set_expires(&to->timer, *time); 1911 hrtimer_set_expires(&to->timer, *time);
1921 } 1912 }
1922 1913
1923 q.pi_state = NULL; 1914 q.pi_state = NULL;
1924 q.rt_waiter = NULL; 1915 q.rt_waiter = NULL;
1925 q.requeue_pi_key = NULL; 1916 q.requeue_pi_key = NULL;
1926 retry: 1917 retry:
1927 q.key = FUTEX_KEY_INIT; 1918 q.key = FUTEX_KEY_INIT;
1928 ret = get_futex_key(uaddr, fshared, &q.key); 1919 ret = get_futex_key(uaddr, fshared, &q.key);
1929 if (unlikely(ret != 0)) 1920 if (unlikely(ret != 0))
1930 goto out; 1921 goto out;
1931 1922
1932 retry_private: 1923 retry_private:
1933 hb = queue_lock(&q); 1924 hb = queue_lock(&q);
1934 1925
1935 ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, 0); 1926 ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, 0);
1936 if (unlikely(ret)) { 1927 if (unlikely(ret)) {
1937 switch (ret) { 1928 switch (ret) {
1938 case 1: 1929 case 1:
1939 /* We got the lock. */ 1930 /* We got the lock. */
1940 ret = 0; 1931 ret = 0;
1941 goto out_unlock_put_key; 1932 goto out_unlock_put_key;
1942 case -EFAULT: 1933 case -EFAULT:
1943 goto uaddr_faulted; 1934 goto uaddr_faulted;
1944 case -EAGAIN: 1935 case -EAGAIN:
1945 /* 1936 /*
1946 * Task is exiting and we just wait for the 1937 * Task is exiting and we just wait for the
1947 * exit to complete. 1938 * exit to complete.
1948 */ 1939 */
1949 queue_unlock(&q, hb); 1940 queue_unlock(&q, hb);
1950 put_futex_key(fshared, &q.key); 1941 put_futex_key(fshared, &q.key);
1951 cond_resched(); 1942 cond_resched();
1952 goto retry; 1943 goto retry;
1953 default: 1944 default:
1954 goto out_unlock_put_key; 1945 goto out_unlock_put_key;
1955 } 1946 }
1956 } 1947 }
1957 1948
1958 /* 1949 /*
1959 * Only actually queue now that the atomic ops are done: 1950 * Only actually queue now that the atomic ops are done:
1960 */ 1951 */
1961 queue_me(&q, hb); 1952 queue_me(&q, hb);
1962 1953
1963 WARN_ON(!q.pi_state); 1954 WARN_ON(!q.pi_state);
1964 /* 1955 /*
1965 * Block on the PI mutex: 1956 * Block on the PI mutex:
1966 */ 1957 */
1967 if (!trylock) 1958 if (!trylock)
1968 ret = rt_mutex_timed_lock(&q.pi_state->pi_mutex, to, 1); 1959 ret = rt_mutex_timed_lock(&q.pi_state->pi_mutex, to, 1);
1969 else { 1960 else {
1970 ret = rt_mutex_trylock(&q.pi_state->pi_mutex); 1961 ret = rt_mutex_trylock(&q.pi_state->pi_mutex);
1971 /* Fixup the trylock return value: */ 1962 /* Fixup the trylock return value: */
1972 ret = ret ? 0 : -EWOULDBLOCK; 1963 ret = ret ? 0 : -EWOULDBLOCK;
1973 } 1964 }
1974 1965
1975 spin_lock(q.lock_ptr); 1966 spin_lock(q.lock_ptr);
1976 /* 1967 /*
1977 * Fixup the pi_state owner and possibly acquire the lock if we 1968 * Fixup the pi_state owner and possibly acquire the lock if we
1978 * haven't already. 1969 * haven't already.
1979 */ 1970 */
1980 res = fixup_owner(uaddr, fshared, &q, !ret); 1971 res = fixup_owner(uaddr, fshared, &q, !ret);
1981 /* 1972 /*
1982 * If fixup_owner() returned an error, proprogate that. If it acquired 1973 * If fixup_owner() returned an error, proprogate that. If it acquired
1983 * the lock, clear our -ETIMEDOUT or -EINTR. 1974 * the lock, clear our -ETIMEDOUT or -EINTR.
1984 */ 1975 */
1985 if (res) 1976 if (res)
1986 ret = (res < 0) ? res : 0; 1977 ret = (res < 0) ? res : 0;
1987 1978
1988 /* 1979 /*
1989 * If fixup_owner() faulted and was unable to handle the fault, unlock 1980 * If fixup_owner() faulted and was unable to handle the fault, unlock
1990 * it and return the fault to userspace. 1981 * it and return the fault to userspace.
1991 */ 1982 */
1992 if (ret && (rt_mutex_owner(&q.pi_state->pi_mutex) == current)) 1983 if (ret && (rt_mutex_owner(&q.pi_state->pi_mutex) == current))
1993 rt_mutex_unlock(&q.pi_state->pi_mutex); 1984 rt_mutex_unlock(&q.pi_state->pi_mutex);
1994 1985
1995 /* Unqueue and drop the lock */ 1986 /* Unqueue and drop the lock */
1996 unqueue_me_pi(&q); 1987 unqueue_me_pi(&q);
1997 1988
1998 goto out_put_key; 1989 goto out_put_key;
1999 1990
2000 out_unlock_put_key: 1991 out_unlock_put_key:
2001 queue_unlock(&q, hb); 1992 queue_unlock(&q, hb);
2002 1993
2003 out_put_key: 1994 out_put_key:
2004 put_futex_key(fshared, &q.key); 1995 put_futex_key(fshared, &q.key);
2005 out: 1996 out:
2006 if (to) 1997 if (to)
2007 destroy_hrtimer_on_stack(&to->timer); 1998 destroy_hrtimer_on_stack(&to->timer);
2008 return ret != -EINTR ? ret : -ERESTARTNOINTR; 1999 return ret != -EINTR ? ret : -ERESTARTNOINTR;
2009 2000
2010 uaddr_faulted: 2001 uaddr_faulted:
2011 queue_unlock(&q, hb); 2002 queue_unlock(&q, hb);
2012 2003
2013 ret = fault_in_user_writeable(uaddr); 2004 ret = fault_in_user_writeable(uaddr);
2014 if (ret) 2005 if (ret)
2015 goto out_put_key; 2006 goto out_put_key;
2016 2007
2017 if (!fshared) 2008 if (!fshared)
2018 goto retry_private; 2009 goto retry_private;
2019 2010
2020 put_futex_key(fshared, &q.key); 2011 put_futex_key(fshared, &q.key);
2021 goto retry; 2012 goto retry;
2022 } 2013 }
2023 2014
2024 /* 2015 /*
2025 * Userspace attempted a TID -> 0 atomic transition, and failed. 2016 * Userspace attempted a TID -> 0 atomic transition, and failed.
2026 * This is the in-kernel slowpath: we look up the PI state (if any), 2017 * This is the in-kernel slowpath: we look up the PI state (if any),
2027 * and do the rt-mutex unlock. 2018 * and do the rt-mutex unlock.
2028 */ 2019 */
2029 static int futex_unlock_pi(u32 __user *uaddr, int fshared) 2020 static int futex_unlock_pi(u32 __user *uaddr, int fshared)
2030 { 2021 {
2031 struct futex_hash_bucket *hb; 2022 struct futex_hash_bucket *hb;
2032 struct futex_q *this, *next; 2023 struct futex_q *this, *next;
2033 u32 uval; 2024 u32 uval;
2034 struct plist_head *head; 2025 struct plist_head *head;
2035 union futex_key key = FUTEX_KEY_INIT; 2026 union futex_key key = FUTEX_KEY_INIT;
2036 int ret; 2027 int ret;
2037 2028
2038 retry: 2029 retry:
2039 if (get_user(uval, uaddr)) 2030 if (get_user(uval, uaddr))
2040 return -EFAULT; 2031 return -EFAULT;
2041 /* 2032 /*
2042 * We release only a lock we actually own: 2033 * We release only a lock we actually own:
2043 */ 2034 */
2044 if ((uval & FUTEX_TID_MASK) != task_pid_vnr(current)) 2035 if ((uval & FUTEX_TID_MASK) != task_pid_vnr(current))
2045 return -EPERM; 2036 return -EPERM;
2046 2037
2047 ret = get_futex_key(uaddr, fshared, &key); 2038 ret = get_futex_key(uaddr, fshared, &key);
2048 if (unlikely(ret != 0)) 2039 if (unlikely(ret != 0))
2049 goto out; 2040 goto out;
2050 2041
2051 hb = hash_futex(&key); 2042 hb = hash_futex(&key);
2052 spin_lock(&hb->lock); 2043 spin_lock(&hb->lock);
2053 2044
2054 /* 2045 /*
2055 * To avoid races, try to do the TID -> 0 atomic transition 2046 * To avoid races, try to do the TID -> 0 atomic transition
2056 * again. If it succeeds then we can return without waking 2047 * again. If it succeeds then we can return without waking
2057 * anyone else up: 2048 * anyone else up:
2058 */ 2049 */
2059 if (!(uval & FUTEX_OWNER_DIED)) 2050 if (!(uval & FUTEX_OWNER_DIED))
2060 uval = cmpxchg_futex_value_locked(uaddr, task_pid_vnr(current), 0); 2051 uval = cmpxchg_futex_value_locked(uaddr, task_pid_vnr(current), 0);
2061 2052
2062 2053
2063 if (unlikely(uval == -EFAULT)) 2054 if (unlikely(uval == -EFAULT))
2064 goto pi_faulted; 2055 goto pi_faulted;
2065 /* 2056 /*
2066 * Rare case: we managed to release the lock atomically, 2057 * Rare case: we managed to release the lock atomically,
2067 * no need to wake anyone else up: 2058 * no need to wake anyone else up:
2068 */ 2059 */
2069 if (unlikely(uval == task_pid_vnr(current))) 2060 if (unlikely(uval == task_pid_vnr(current)))
2070 goto out_unlock; 2061 goto out_unlock;
2071 2062
2072 /* 2063 /*
2073 * Ok, other tasks may need to be woken up - check waiters 2064 * Ok, other tasks may need to be woken up - check waiters
2074 * and do the wakeup if necessary: 2065 * and do the wakeup if necessary:
2075 */ 2066 */
2076 head = &hb->chain; 2067 head = &hb->chain;
2077 2068
2078 plist_for_each_entry_safe(this, next, head, list) { 2069 plist_for_each_entry_safe(this, next, head, list) {
2079 if (!match_futex (&this->key, &key)) 2070 if (!match_futex (&this->key, &key))
2080 continue; 2071 continue;
2081 ret = wake_futex_pi(uaddr, uval, this); 2072 ret = wake_futex_pi(uaddr, uval, this);
2082 /* 2073 /*
2083 * The atomic access to the futex value 2074 * The atomic access to the futex value
2084 * generated a pagefault, so retry the 2075 * generated a pagefault, so retry the
2085 * user-access and the wakeup: 2076 * user-access and the wakeup:
2086 */ 2077 */
2087 if (ret == -EFAULT) 2078 if (ret == -EFAULT)
2088 goto pi_faulted; 2079 goto pi_faulted;
2089 goto out_unlock; 2080 goto out_unlock;
2090 } 2081 }
2091 /* 2082 /*
2092 * No waiters - kernel unlocks the futex: 2083 * No waiters - kernel unlocks the futex:
2093 */ 2084 */
2094 if (!(uval & FUTEX_OWNER_DIED)) { 2085 if (!(uval & FUTEX_OWNER_DIED)) {
2095 ret = unlock_futex_pi(uaddr, uval); 2086 ret = unlock_futex_pi(uaddr, uval);
2096 if (ret == -EFAULT) 2087 if (ret == -EFAULT)
2097 goto pi_faulted; 2088 goto pi_faulted;
2098 } 2089 }
2099 2090
2100 out_unlock: 2091 out_unlock:
2101 spin_unlock(&hb->lock); 2092 spin_unlock(&hb->lock);
2102 put_futex_key(fshared, &key); 2093 put_futex_key(fshared, &key);
2103 2094
2104 out: 2095 out:
2105 return ret; 2096 return ret;
2106 2097
2107 pi_faulted: 2098 pi_faulted:
2108 spin_unlock(&hb->lock); 2099 spin_unlock(&hb->lock);
2109 put_futex_key(fshared, &key); 2100 put_futex_key(fshared, &key);
2110 2101
2111 ret = fault_in_user_writeable(uaddr); 2102 ret = fault_in_user_writeable(uaddr);
2112 if (!ret) 2103 if (!ret)
2113 goto retry; 2104 goto retry;
2114 2105
2115 return ret; 2106 return ret;
2116 } 2107 }
2117 2108
2118 /** 2109 /**
2119 * handle_early_requeue_pi_wakeup() - Detect early wakeup on the initial futex 2110 * handle_early_requeue_pi_wakeup() - Detect early wakeup on the initial futex
2120 * @hb: the hash_bucket futex_q was original enqueued on 2111 * @hb: the hash_bucket futex_q was original enqueued on
2121 * @q: the futex_q woken while waiting to be requeued 2112 * @q: the futex_q woken while waiting to be requeued
2122 * @key2: the futex_key of the requeue target futex 2113 * @key2: the futex_key of the requeue target futex
2123 * @timeout: the timeout associated with the wait (NULL if none) 2114 * @timeout: the timeout associated with the wait (NULL if none)
2124 * 2115 *
2125 * Detect if the task was woken on the initial futex as opposed to the requeue 2116 * Detect if the task was woken on the initial futex as opposed to the requeue
2126 * target futex. If so, determine if it was a timeout or a signal that caused 2117 * target futex. If so, determine if it was a timeout or a signal that caused
2127 * the wakeup and return the appropriate error code to the caller. Must be 2118 * the wakeup and return the appropriate error code to the caller. Must be
2128 * called with the hb lock held. 2119 * called with the hb lock held.
2129 * 2120 *
2130 * Returns 2121 * Returns
2131 * 0 - no early wakeup detected 2122 * 0 - no early wakeup detected
2132 * <0 - -ETIMEDOUT or -ERESTARTNOINTR 2123 * <0 - -ETIMEDOUT or -ERESTARTNOINTR
2133 */ 2124 */
2134 static inline 2125 static inline
2135 int handle_early_requeue_pi_wakeup(struct futex_hash_bucket *hb, 2126 int handle_early_requeue_pi_wakeup(struct futex_hash_bucket *hb,
2136 struct futex_q *q, union futex_key *key2, 2127 struct futex_q *q, union futex_key *key2,
2137 struct hrtimer_sleeper *timeout) 2128 struct hrtimer_sleeper *timeout)
2138 { 2129 {
2139 int ret = 0; 2130 int ret = 0;
2140 2131
2141 /* 2132 /*
2142 * With the hb lock held, we avoid races while we process the wakeup. 2133 * With the hb lock held, we avoid races while we process the wakeup.
2143 * We only need to hold hb (and not hb2) to ensure atomicity as the 2134 * We only need to hold hb (and not hb2) to ensure atomicity as the
2144 * wakeup code can't change q.key from uaddr to uaddr2 if we hold hb. 2135 * wakeup code can't change q.key from uaddr to uaddr2 if we hold hb.
2145 * It can't be requeued from uaddr2 to something else since we don't 2136 * It can't be requeued from uaddr2 to something else since we don't
2146 * support a PI aware source futex for requeue. 2137 * support a PI aware source futex for requeue.
2147 */ 2138 */
2148 if (!match_futex(&q->key, key2)) { 2139 if (!match_futex(&q->key, key2)) {
2149 WARN_ON(q->lock_ptr && (&hb->lock != q->lock_ptr)); 2140 WARN_ON(q->lock_ptr && (&hb->lock != q->lock_ptr));
2150 /* 2141 /*
2151 * We were woken prior to requeue by a timeout or a signal. 2142 * We were woken prior to requeue by a timeout or a signal.
2152 * Unqueue the futex_q and determine which it was. 2143 * Unqueue the futex_q and determine which it was.
2153 */ 2144 */
2154 plist_del(&q->list, &q->list.plist); 2145 plist_del(&q->list, &q->list.plist);
2155 2146
2156 /* Handle spurious wakeups gracefully */ 2147 /* Handle spurious wakeups gracefully */
2157 ret = -EWOULDBLOCK; 2148 ret = -EWOULDBLOCK;
2158 if (timeout && !timeout->task) 2149 if (timeout && !timeout->task)
2159 ret = -ETIMEDOUT; 2150 ret = -ETIMEDOUT;
2160 else if (signal_pending(current)) 2151 else if (signal_pending(current))
2161 ret = -ERESTARTNOINTR; 2152 ret = -ERESTARTNOINTR;
2162 } 2153 }
2163 return ret; 2154 return ret;
2164 } 2155 }
2165 2156
2166 /** 2157 /**
2167 * futex_wait_requeue_pi() - Wait on uaddr and take uaddr2 2158 * futex_wait_requeue_pi() - Wait on uaddr and take uaddr2
2168 * @uaddr: the futex we initially wait on (non-pi) 2159 * @uaddr: the futex we initially wait on (non-pi)
2169 * @fshared: whether the futexes are shared (1) or not (0). They must be 2160 * @fshared: whether the futexes are shared (1) or not (0). They must be
2170 * the same type, no requeueing from private to shared, etc. 2161 * the same type, no requeueing from private to shared, etc.
2171 * @val: the expected value of uaddr 2162 * @val: the expected value of uaddr
2172 * @abs_time: absolute timeout 2163 * @abs_time: absolute timeout
2173 * @bitset: 32 bit wakeup bitset set by userspace, defaults to all 2164 * @bitset: 32 bit wakeup bitset set by userspace, defaults to all
2174 * @clockrt: whether to use CLOCK_REALTIME (1) or CLOCK_MONOTONIC (0) 2165 * @clockrt: whether to use CLOCK_REALTIME (1) or CLOCK_MONOTONIC (0)
2175 * @uaddr2: the pi futex we will take prior to returning to user-space 2166 * @uaddr2: the pi futex we will take prior to returning to user-space
2176 * 2167 *
2177 * The caller will wait on uaddr and will be requeued by futex_requeue() to 2168 * The caller will wait on uaddr and will be requeued by futex_requeue() to
2178 * uaddr2 which must be PI aware. Normal wakeup will wake on uaddr2 and 2169 * uaddr2 which must be PI aware. Normal wakeup will wake on uaddr2 and
2179 * complete the acquisition of the rt_mutex prior to returning to userspace. 2170 * complete the acquisition of the rt_mutex prior to returning to userspace.
2180 * This ensures the rt_mutex maintains an owner when it has waiters; without 2171 * This ensures the rt_mutex maintains an owner when it has waiters; without
2181 * one, the pi logic wouldn't know which task to boost/deboost, if there was a 2172 * one, the pi logic wouldn't know which task to boost/deboost, if there was a
2182 * need to. 2173 * need to.
2183 * 2174 *
2184 * We call schedule in futex_wait_queue_me() when we enqueue and return there 2175 * We call schedule in futex_wait_queue_me() when we enqueue and return there
2185 * via the following: 2176 * via the following:
2186 * 1) wakeup on uaddr2 after an atomic lock acquisition by futex_requeue() 2177 * 1) wakeup on uaddr2 after an atomic lock acquisition by futex_requeue()
2187 * 2) wakeup on uaddr2 after a requeue 2178 * 2) wakeup on uaddr2 after a requeue
2188 * 3) signal 2179 * 3) signal
2189 * 4) timeout 2180 * 4) timeout
2190 * 2181 *
2191 * If 3, cleanup and return -ERESTARTNOINTR. 2182 * If 3, cleanup and return -ERESTARTNOINTR.
2192 * 2183 *
2193 * If 2, we may then block on trying to take the rt_mutex and return via: 2184 * If 2, we may then block on trying to take the rt_mutex and return via:
2194 * 5) successful lock 2185 * 5) successful lock
2195 * 6) signal 2186 * 6) signal
2196 * 7) timeout 2187 * 7) timeout
2197 * 8) other lock acquisition failure 2188 * 8) other lock acquisition failure
2198 * 2189 *
2199 * If 6, return -EWOULDBLOCK (restarting the syscall would do the same). 2190 * If 6, return -EWOULDBLOCK (restarting the syscall would do the same).
2200 * 2191 *
2201 * If 4 or 7, we cleanup and return with -ETIMEDOUT. 2192 * If 4 or 7, we cleanup and return with -ETIMEDOUT.
2202 * 2193 *
2203 * Returns: 2194 * Returns:
2204 * 0 - On success 2195 * 0 - On success
2205 * <0 - On error 2196 * <0 - On error
2206 */ 2197 */
2207 static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared, 2198 static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
2208 u32 val, ktime_t *abs_time, u32 bitset, 2199 u32 val, ktime_t *abs_time, u32 bitset,
2209 int clockrt, u32 __user *uaddr2) 2200 int clockrt, u32 __user *uaddr2)
2210 { 2201 {
2211 struct hrtimer_sleeper timeout, *to = NULL; 2202 struct hrtimer_sleeper timeout, *to = NULL;
2212 struct rt_mutex_waiter rt_waiter; 2203 struct rt_mutex_waiter rt_waiter;
2213 struct rt_mutex *pi_mutex = NULL; 2204 struct rt_mutex *pi_mutex = NULL;
2214 struct futex_hash_bucket *hb; 2205 struct futex_hash_bucket *hb;
2215 union futex_key key2; 2206 union futex_key key2;
2216 struct futex_q q; 2207 struct futex_q q;
2217 int res, ret; 2208 int res, ret;
2218 2209
2219 if (!bitset) 2210 if (!bitset)
2220 return -EINVAL; 2211 return -EINVAL;
2221 2212
2222 if (abs_time) { 2213 if (abs_time) {
2223 to = &timeout; 2214 to = &timeout;
2224 hrtimer_init_on_stack(&to->timer, clockrt ? CLOCK_REALTIME : 2215 hrtimer_init_on_stack(&to->timer, clockrt ? CLOCK_REALTIME :
2225 CLOCK_MONOTONIC, HRTIMER_MODE_ABS); 2216 CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
2226 hrtimer_init_sleeper(to, current); 2217 hrtimer_init_sleeper(to, current);
2227 hrtimer_set_expires_range_ns(&to->timer, *abs_time, 2218 hrtimer_set_expires_range_ns(&to->timer, *abs_time,
2228 current->timer_slack_ns); 2219 current->timer_slack_ns);
2229 } 2220 }
2230 2221
2231 /* 2222 /*
2232 * The waiter is allocated on our stack, manipulated by the requeue 2223 * The waiter is allocated on our stack, manipulated by the requeue
2233 * code while we sleep on uaddr. 2224 * code while we sleep on uaddr.
2234 */ 2225 */
2235 debug_rt_mutex_init_waiter(&rt_waiter); 2226 debug_rt_mutex_init_waiter(&rt_waiter);
2236 rt_waiter.task = NULL; 2227 rt_waiter.task = NULL;
2237 2228
2238 key2 = FUTEX_KEY_INIT; 2229 key2 = FUTEX_KEY_INIT;
2239 ret = get_futex_key(uaddr2, fshared, &key2); 2230 ret = get_futex_key(uaddr2, fshared, &key2);
2240 if (unlikely(ret != 0)) 2231 if (unlikely(ret != 0))
2241 goto out; 2232 goto out;
2242 2233
2243 q.pi_state = NULL; 2234 q.pi_state = NULL;
2244 q.bitset = bitset; 2235 q.bitset = bitset;
2245 q.rt_waiter = &rt_waiter; 2236 q.rt_waiter = &rt_waiter;
2246 q.requeue_pi_key = &key2; 2237 q.requeue_pi_key = &key2;
2247 2238
2248 /* Prepare to wait on uaddr. */ 2239 /* Prepare to wait on uaddr. */
2249 ret = futex_wait_setup(uaddr, val, fshared, &q, &hb); 2240 ret = futex_wait_setup(uaddr, val, fshared, &q, &hb);
2250 if (ret) 2241 if (ret)
2251 goto out_key2; 2242 goto out_key2;
2252 2243
2253 /* Queue the futex_q, drop the hb lock, wait for wakeup. */ 2244 /* Queue the futex_q, drop the hb lock, wait for wakeup. */
2254 futex_wait_queue_me(hb, &q, to); 2245 futex_wait_queue_me(hb, &q, to);
2255 2246
2256 spin_lock(&hb->lock); 2247 spin_lock(&hb->lock);
2257 ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to); 2248 ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
2258 spin_unlock(&hb->lock); 2249 spin_unlock(&hb->lock);
2259 if (ret) 2250 if (ret)
2260 goto out_put_keys; 2251 goto out_put_keys;
2261 2252
2262 /* 2253 /*
2263 * In order for us to be here, we know our q.key == key2, and since 2254 * In order for us to be here, we know our q.key == key2, and since
2264 * we took the hb->lock above, we also know that futex_requeue() has 2255 * we took the hb->lock above, we also know that futex_requeue() has
2265 * completed and we no longer have to concern ourselves with a wakeup 2256 * completed and we no longer have to concern ourselves with a wakeup
2266 * race with the atomic proxy lock acquition by the requeue code. 2257 * race with the atomic proxy lock acquition by the requeue code.
2267 */ 2258 */
2268 2259
2269 /* Check if the requeue code acquired the second futex for us. */ 2260 /* Check if the requeue code acquired the second futex for us. */
2270 if (!q.rt_waiter) { 2261 if (!q.rt_waiter) {
2271 /* 2262 /*
2272 * Got the lock. We might not be the anticipated owner if we 2263 * Got the lock. We might not be the anticipated owner if we
2273 * did a lock-steal - fix up the PI-state in that case. 2264 * did a lock-steal - fix up the PI-state in that case.
2274 */ 2265 */
2275 if (q.pi_state && (q.pi_state->owner != current)) { 2266 if (q.pi_state && (q.pi_state->owner != current)) {
2276 spin_lock(q.lock_ptr); 2267 spin_lock(q.lock_ptr);
2277 ret = fixup_pi_state_owner(uaddr2, &q, current, 2268 ret = fixup_pi_state_owner(uaddr2, &q, current,
2278 fshared); 2269 fshared);
2279 spin_unlock(q.lock_ptr); 2270 spin_unlock(q.lock_ptr);
2280 } 2271 }
2281 } else { 2272 } else {
2282 /* 2273 /*
2283 * We have been woken up by futex_unlock_pi(), a timeout, or a 2274 * We have been woken up by futex_unlock_pi(), a timeout, or a
2284 * signal. futex_unlock_pi() will not destroy the lock_ptr nor 2275 * signal. futex_unlock_pi() will not destroy the lock_ptr nor
2285 * the pi_state. 2276 * the pi_state.
2286 */ 2277 */
2287 WARN_ON(!&q.pi_state); 2278 WARN_ON(!&q.pi_state);
2288 pi_mutex = &q.pi_state->pi_mutex; 2279 pi_mutex = &q.pi_state->pi_mutex;
2289 ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter, 1); 2280 ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter, 1);
2290 debug_rt_mutex_free_waiter(&rt_waiter); 2281 debug_rt_mutex_free_waiter(&rt_waiter);
2291 2282
2292 spin_lock(q.lock_ptr); 2283 spin_lock(q.lock_ptr);
2293 /* 2284 /*
2294 * Fixup the pi_state owner and possibly acquire the lock if we 2285 * Fixup the pi_state owner and possibly acquire the lock if we
2295 * haven't already. 2286 * haven't already.
2296 */ 2287 */
2297 res = fixup_owner(uaddr2, fshared, &q, !ret); 2288 res = fixup_owner(uaddr2, fshared, &q, !ret);
2298 /* 2289 /*
2299 * If fixup_owner() returned an error, proprogate that. If it 2290 * If fixup_owner() returned an error, proprogate that. If it
2300 * acquired the lock, clear -ETIMEDOUT or -EINTR. 2291 * acquired the lock, clear -ETIMEDOUT or -EINTR.
2301 */ 2292 */
2302 if (res) 2293 if (res)
2303 ret = (res < 0) ? res : 0; 2294 ret = (res < 0) ? res : 0;
2304 2295
2305 /* Unqueue and drop the lock. */ 2296 /* Unqueue and drop the lock. */
2306 unqueue_me_pi(&q); 2297 unqueue_me_pi(&q);
2307 } 2298 }
2308 2299
2309 /* 2300 /*
2310 * If fixup_pi_state_owner() faulted and was unable to handle the 2301 * If fixup_pi_state_owner() faulted and was unable to handle the
2311 * fault, unlock the rt_mutex and return the fault to userspace. 2302 * fault, unlock the rt_mutex and return the fault to userspace.
2312 */ 2303 */
2313 if (ret == -EFAULT) { 2304 if (ret == -EFAULT) {
2314 if (rt_mutex_owner(pi_mutex) == current) 2305 if (rt_mutex_owner(pi_mutex) == current)
2315 rt_mutex_unlock(pi_mutex); 2306 rt_mutex_unlock(pi_mutex);
2316 } else if (ret == -EINTR) { 2307 } else if (ret == -EINTR) {
2317 /* 2308 /*
2318 * We've already been requeued, but cannot restart by calling 2309 * We've already been requeued, but cannot restart by calling
2319 * futex_lock_pi() directly. We could restart this syscall, but 2310 * futex_lock_pi() directly. We could restart this syscall, but
2320 * it would detect that the user space "val" changed and return 2311 * it would detect that the user space "val" changed and return
2321 * -EWOULDBLOCK. Save the overhead of the restart and return 2312 * -EWOULDBLOCK. Save the overhead of the restart and return
2322 * -EWOULDBLOCK directly. 2313 * -EWOULDBLOCK directly.
2323 */ 2314 */
2324 ret = -EWOULDBLOCK; 2315 ret = -EWOULDBLOCK;
2325 } 2316 }
2326 2317
2327 out_put_keys: 2318 out_put_keys:
2328 put_futex_key(fshared, &q.key); 2319 put_futex_key(fshared, &q.key);
2329 out_key2: 2320 out_key2:
2330 put_futex_key(fshared, &key2); 2321 put_futex_key(fshared, &key2);
2331 2322
2332 out: 2323 out:
2333 if (to) { 2324 if (to) {
2334 hrtimer_cancel(&to->timer); 2325 hrtimer_cancel(&to->timer);
2335 destroy_hrtimer_on_stack(&to->timer); 2326 destroy_hrtimer_on_stack(&to->timer);
2336 } 2327 }
2337 return ret; 2328 return ret;
2338 } 2329 }
2339 2330
2340 /* 2331 /*
2341 * Support for robust futexes: the kernel cleans up held futexes at 2332 * Support for robust futexes: the kernel cleans up held futexes at
2342 * thread exit time. 2333 * thread exit time.
2343 * 2334 *
2344 * Implementation: user-space maintains a per-thread list of locks it 2335 * Implementation: user-space maintains a per-thread list of locks it
2345 * is holding. Upon do_exit(), the kernel carefully walks this list, 2336 * is holding. Upon do_exit(), the kernel carefully walks this list,
2346 * and marks all locks that are owned by this thread with the 2337 * and marks all locks that are owned by this thread with the
2347 * FUTEX_OWNER_DIED bit, and wakes up a waiter (if any). The list is 2338 * FUTEX_OWNER_DIED bit, and wakes up a waiter (if any). The list is
2348 * always manipulated with the lock held, so the list is private and 2339 * always manipulated with the lock held, so the list is private and
2349 * per-thread. Userspace also maintains a per-thread 'list_op_pending' 2340 * per-thread. Userspace also maintains a per-thread 'list_op_pending'
2350 * field, to allow the kernel to clean up if the thread dies after 2341 * field, to allow the kernel to clean up if the thread dies after
2351 * acquiring the lock, but just before it could have added itself to 2342 * acquiring the lock, but just before it could have added itself to
2352 * the list. There can only be one such pending lock. 2343 * the list. There can only be one such pending lock.
2353 */ 2344 */
2354 2345
2355 /** 2346 /**
2356 * sys_set_robust_list() - Set the robust-futex list head of a task 2347 * sys_set_robust_list() - Set the robust-futex list head of a task
2357 * @head: pointer to the list-head 2348 * @head: pointer to the list-head
2358 * @len: length of the list-head, as userspace expects 2349 * @len: length of the list-head, as userspace expects
2359 */ 2350 */
2360 SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head, 2351 SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head,
2361 size_t, len) 2352 size_t, len)
2362 { 2353 {
2363 if (!futex_cmpxchg_enabled) 2354 if (!futex_cmpxchg_enabled)
2364 return -ENOSYS; 2355 return -ENOSYS;
2365 /* 2356 /*
2366 * The kernel knows only one size for now: 2357 * The kernel knows only one size for now:
2367 */ 2358 */
2368 if (unlikely(len != sizeof(*head))) 2359 if (unlikely(len != sizeof(*head)))
2369 return -EINVAL; 2360 return -EINVAL;
2370 2361
2371 current->robust_list = head; 2362 current->robust_list = head;
2372 2363
2373 return 0; 2364 return 0;
2374 } 2365 }
2375 2366
2376 /** 2367 /**
2377 * sys_get_robust_list() - Get the robust-futex list head of a task 2368 * sys_get_robust_list() - Get the robust-futex list head of a task
2378 * @pid: pid of the process [zero for current task] 2369 * @pid: pid of the process [zero for current task]
2379 * @head_ptr: pointer to a list-head pointer, the kernel fills it in 2370 * @head_ptr: pointer to a list-head pointer, the kernel fills it in
2380 * @len_ptr: pointer to a length field, the kernel fills in the header size 2371 * @len_ptr: pointer to a length field, the kernel fills in the header size
2381 */ 2372 */
2382 SYSCALL_DEFINE3(get_robust_list, int, pid, 2373 SYSCALL_DEFINE3(get_robust_list, int, pid,
2383 struct robust_list_head __user * __user *, head_ptr, 2374 struct robust_list_head __user * __user *, head_ptr,
2384 size_t __user *, len_ptr) 2375 size_t __user *, len_ptr)
2385 { 2376 {
2386 struct robust_list_head __user *head; 2377 struct robust_list_head __user *head;
2387 unsigned long ret; 2378 unsigned long ret;
2388 const struct cred *cred = current_cred(), *pcred; 2379 const struct cred *cred = current_cred(), *pcred;
2389 2380
2390 if (!futex_cmpxchg_enabled) 2381 if (!futex_cmpxchg_enabled)
2391 return -ENOSYS; 2382 return -ENOSYS;
2392 2383
2393 if (!pid) 2384 if (!pid)
2394 head = current->robust_list; 2385 head = current->robust_list;
2395 else { 2386 else {
2396 struct task_struct *p; 2387 struct task_struct *p;
2397 2388
2398 ret = -ESRCH; 2389 ret = -ESRCH;
2399 rcu_read_lock(); 2390 rcu_read_lock();
2400 p = find_task_by_vpid(pid); 2391 p = find_task_by_vpid(pid);
2401 if (!p) 2392 if (!p)
2402 goto err_unlock; 2393 goto err_unlock;
2403 ret = -EPERM; 2394 ret = -EPERM;
2404 pcred = __task_cred(p); 2395 pcred = __task_cred(p);
2405 if (cred->euid != pcred->euid && 2396 if (cred->euid != pcred->euid &&
2406 cred->euid != pcred->uid && 2397 cred->euid != pcred->uid &&
2407 !capable(CAP_SYS_PTRACE)) 2398 !capable(CAP_SYS_PTRACE))
2408 goto err_unlock; 2399 goto err_unlock;
2409 head = p->robust_list; 2400 head = p->robust_list;
2410 rcu_read_unlock(); 2401 rcu_read_unlock();
2411 } 2402 }
2412 2403
2413 if (put_user(sizeof(*head), len_ptr)) 2404 if (put_user(sizeof(*head), len_ptr))
2414 return -EFAULT; 2405 return -EFAULT;
2415 return put_user(head, head_ptr); 2406 return put_user(head, head_ptr);
2416 2407
2417 err_unlock: 2408 err_unlock:
2418 rcu_read_unlock(); 2409 rcu_read_unlock();
2419 2410
2420 return ret; 2411 return ret;
2421 } 2412 }
2422 2413
2423 /* 2414 /*
2424 * Process a futex-list entry, check whether it's owned by the 2415 * Process a futex-list entry, check whether it's owned by the
2425 * dying task, and do notification if so: 2416 * dying task, and do notification if so:
2426 */ 2417 */
2427 int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi) 2418 int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi)
2428 { 2419 {
2429 u32 uval, nval, mval; 2420 u32 uval, nval, mval;
2430 2421
2431 retry: 2422 retry:
2432 if (get_user(uval, uaddr)) 2423 if (get_user(uval, uaddr))
2433 return -1; 2424 return -1;
2434 2425
2435 if ((uval & FUTEX_TID_MASK) == task_pid_vnr(curr)) { 2426 if ((uval & FUTEX_TID_MASK) == task_pid_vnr(curr)) {
2436 /* 2427 /*
2437 * Ok, this dying thread is truly holding a futex 2428 * Ok, this dying thread is truly holding a futex
2438 * of interest. Set the OWNER_DIED bit atomically 2429 * of interest. Set the OWNER_DIED bit atomically
2439 * via cmpxchg, and if the value had FUTEX_WAITERS 2430 * via cmpxchg, and if the value had FUTEX_WAITERS
2440 * set, wake up a waiter (if any). (We have to do a 2431 * set, wake up a waiter (if any). (We have to do a
2441 * futex_wake() even if OWNER_DIED is already set - 2432 * futex_wake() even if OWNER_DIED is already set -
2442 * to handle the rare but possible case of recursive 2433 * to handle the rare but possible case of recursive
2443 * thread-death.) The rest of the cleanup is done in 2434 * thread-death.) The rest of the cleanup is done in
2444 * userspace. 2435 * userspace.
2445 */ 2436 */
2446 mval = (uval & FUTEX_WAITERS) | FUTEX_OWNER_DIED; 2437 mval = (uval & FUTEX_WAITERS) | FUTEX_OWNER_DIED;
2447 nval = futex_atomic_cmpxchg_inatomic(uaddr, uval, mval); 2438 nval = futex_atomic_cmpxchg_inatomic(uaddr, uval, mval);
2448 2439
2449 if (nval == -EFAULT) 2440 if (nval == -EFAULT)
2450 return -1; 2441 return -1;
2451 2442
2452 if (nval != uval) 2443 if (nval != uval)
2453 goto retry; 2444 goto retry;
2454 2445
2455 /* 2446 /*
2456 * Wake robust non-PI futexes here. The wakeup of 2447 * Wake robust non-PI futexes here. The wakeup of
2457 * PI futexes happens in exit_pi_state(): 2448 * PI futexes happens in exit_pi_state():
2458 */ 2449 */
2459 if (!pi && (uval & FUTEX_WAITERS)) 2450 if (!pi && (uval & FUTEX_WAITERS))
2460 futex_wake(uaddr, 1, 1, FUTEX_BITSET_MATCH_ANY); 2451 futex_wake(uaddr, 1, 1, FUTEX_BITSET_MATCH_ANY);
2461 } 2452 }
2462 return 0; 2453 return 0;
2463 } 2454 }
2464 2455
2465 /* 2456 /*
2466 * Fetch a robust-list pointer. Bit 0 signals PI futexes: 2457 * Fetch a robust-list pointer. Bit 0 signals PI futexes:
2467 */ 2458 */
2468 static inline int fetch_robust_entry(struct robust_list __user **entry, 2459 static inline int fetch_robust_entry(struct robust_list __user **entry,
2469 struct robust_list __user * __user *head, 2460 struct robust_list __user * __user *head,
2470 int *pi) 2461 int *pi)
2471 { 2462 {
2472 unsigned long uentry; 2463 unsigned long uentry;
2473 2464
2474 if (get_user(uentry, (unsigned long __user *)head)) 2465 if (get_user(uentry, (unsigned long __user *)head))
2475 return -EFAULT; 2466 return -EFAULT;
2476 2467
2477 *entry = (void __user *)(uentry & ~1UL); 2468 *entry = (void __user *)(uentry & ~1UL);
2478 *pi = uentry & 1; 2469 *pi = uentry & 1;
2479 2470
2480 return 0; 2471 return 0;
2481 } 2472 }
2482 2473
2483 /* 2474 /*
2484 * Walk curr->robust_list (very carefully, it's a userspace list!) 2475 * Walk curr->robust_list (very carefully, it's a userspace list!)
2485 * and mark any locks found there dead, and notify any waiters. 2476 * and mark any locks found there dead, and notify any waiters.
2486 * 2477 *
2487 * We silently return on any sign of list-walking problem. 2478 * We silently return on any sign of list-walking problem.
2488 */ 2479 */
2489 void exit_robust_list(struct task_struct *curr) 2480 void exit_robust_list(struct task_struct *curr)
2490 { 2481 {
2491 struct robust_list_head __user *head = curr->robust_list; 2482 struct robust_list_head __user *head = curr->robust_list;
2492 struct robust_list __user *entry, *next_entry, *pending; 2483 struct robust_list __user *entry, *next_entry, *pending;
2493 unsigned int limit = ROBUST_LIST_LIMIT, pi, next_pi, pip; 2484 unsigned int limit = ROBUST_LIST_LIMIT, pi, next_pi, pip;
2494 unsigned long futex_offset; 2485 unsigned long futex_offset;
2495 int rc; 2486 int rc;
2496 2487
2497 if (!futex_cmpxchg_enabled) 2488 if (!futex_cmpxchg_enabled)
2498 return; 2489 return;
2499 2490
2500 /* 2491 /*
2501 * Fetch the list head (which was registered earlier, via 2492 * Fetch the list head (which was registered earlier, via
2502 * sys_set_robust_list()): 2493 * sys_set_robust_list()):
2503 */ 2494 */
2504 if (fetch_robust_entry(&entry, &head->list.next, &pi)) 2495 if (fetch_robust_entry(&entry, &head->list.next, &pi))
2505 return; 2496 return;
2506 /* 2497 /*
2507 * Fetch the relative futex offset: 2498 * Fetch the relative futex offset:
2508 */ 2499 */
2509 if (get_user(futex_offset, &head->futex_offset)) 2500 if (get_user(futex_offset, &head->futex_offset))
2510 return; 2501 return;
2511 /* 2502 /*
2512 * Fetch any possibly pending lock-add first, and handle it 2503 * Fetch any possibly pending lock-add first, and handle it
2513 * if it exists: 2504 * if it exists:
2514 */ 2505 */
2515 if (fetch_robust_entry(&pending, &head->list_op_pending, &pip)) 2506 if (fetch_robust_entry(&pending, &head->list_op_pending, &pip))
2516 return; 2507 return;
2517 2508
2518 next_entry = NULL; /* avoid warning with gcc */ 2509 next_entry = NULL; /* avoid warning with gcc */
2519 while (entry != &head->list) { 2510 while (entry != &head->list) {
2520 /* 2511 /*
2521 * Fetch the next entry in the list before calling 2512 * Fetch the next entry in the list before calling
2522 * handle_futex_death: 2513 * handle_futex_death:
2523 */ 2514 */
2524 rc = fetch_robust_entry(&next_entry, &entry->next, &next_pi); 2515 rc = fetch_robust_entry(&next_entry, &entry->next, &next_pi);
2525 /* 2516 /*
2526 * A pending lock might already be on the list, so 2517 * A pending lock might already be on the list, so
2527 * don't process it twice: 2518 * don't process it twice:
2528 */ 2519 */
2529 if (entry != pending) 2520 if (entry != pending)
2530 if (handle_futex_death((void __user *)entry + futex_offset, 2521 if (handle_futex_death((void __user *)entry + futex_offset,
2531 curr, pi)) 2522 curr, pi))
2532 return; 2523 return;
2533 if (rc) 2524 if (rc)
2534 return; 2525 return;
2535 entry = next_entry; 2526 entry = next_entry;
2536 pi = next_pi; 2527 pi = next_pi;
2537 /* 2528 /*
2538 * Avoid excessively long or circular lists: 2529 * Avoid excessively long or circular lists:
2539 */ 2530 */
2540 if (!--limit) 2531 if (!--limit)
2541 break; 2532 break;
2542 2533
2543 cond_resched(); 2534 cond_resched();
2544 } 2535 }
2545 2536
2546 if (pending) 2537 if (pending)
2547 handle_futex_death((void __user *)pending + futex_offset, 2538 handle_futex_death((void __user *)pending + futex_offset,
2548 curr, pip); 2539 curr, pip);
2549 } 2540 }
2550 2541
2551 long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, 2542 long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
2552 u32 __user *uaddr2, u32 val2, u32 val3) 2543 u32 __user *uaddr2, u32 val2, u32 val3)
2553 { 2544 {
2554 int clockrt, ret = -ENOSYS; 2545 int clockrt, ret = -ENOSYS;
2555 int cmd = op & FUTEX_CMD_MASK; 2546 int cmd = op & FUTEX_CMD_MASK;
2556 int fshared = 0; 2547 int fshared = 0;
2557 2548
2558 if (!(op & FUTEX_PRIVATE_FLAG)) 2549 if (!(op & FUTEX_PRIVATE_FLAG))
2559 fshared = 1; 2550 fshared = 1;
2560 2551
2561 clockrt = op & FUTEX_CLOCK_REALTIME; 2552 clockrt = op & FUTEX_CLOCK_REALTIME;
2562 if (clockrt && cmd != FUTEX_WAIT_BITSET && cmd != FUTEX_WAIT_REQUEUE_PI) 2553 if (clockrt && cmd != FUTEX_WAIT_BITSET && cmd != FUTEX_WAIT_REQUEUE_PI)
2563 return -ENOSYS; 2554 return -ENOSYS;
2564 2555
2565 switch (cmd) { 2556 switch (cmd) {
2566 case FUTEX_WAIT: 2557 case FUTEX_WAIT:
2567 val3 = FUTEX_BITSET_MATCH_ANY; 2558 val3 = FUTEX_BITSET_MATCH_ANY;
2568 case FUTEX_WAIT_BITSET: 2559 case FUTEX_WAIT_BITSET:
2569 ret = futex_wait(uaddr, fshared, val, timeout, val3, clockrt); 2560 ret = futex_wait(uaddr, fshared, val, timeout, val3, clockrt);
2570 break; 2561 break;
2571 case FUTEX_WAKE: 2562 case FUTEX_WAKE:
2572 val3 = FUTEX_BITSET_MATCH_ANY; 2563 val3 = FUTEX_BITSET_MATCH_ANY;
2573 case FUTEX_WAKE_BITSET: 2564 case FUTEX_WAKE_BITSET:
2574 ret = futex_wake(uaddr, fshared, val, val3); 2565 ret = futex_wake(uaddr, fshared, val, val3);
2575 break; 2566 break;
2576 case FUTEX_REQUEUE: 2567 case FUTEX_REQUEUE:
2577 ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, NULL, 0); 2568 ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, NULL, 0);
2578 break; 2569 break;
2579 case FUTEX_CMP_REQUEUE: 2570 case FUTEX_CMP_REQUEUE:
2580 ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3, 2571 ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3,
2581 0); 2572 0);
2582 break; 2573 break;
2583 case FUTEX_WAKE_OP: 2574 case FUTEX_WAKE_OP:
2584 ret = futex_wake_op(uaddr, fshared, uaddr2, val, val2, val3); 2575 ret = futex_wake_op(uaddr, fshared, uaddr2, val, val2, val3);
2585 break; 2576 break;
2586 case FUTEX_LOCK_PI: 2577 case FUTEX_LOCK_PI:
2587 if (futex_cmpxchg_enabled) 2578 if (futex_cmpxchg_enabled)
2588 ret = futex_lock_pi(uaddr, fshared, val, timeout, 0); 2579 ret = futex_lock_pi(uaddr, fshared, val, timeout, 0);
2589 break; 2580 break;
2590 case FUTEX_UNLOCK_PI: 2581 case FUTEX_UNLOCK_PI:
2591 if (futex_cmpxchg_enabled) 2582 if (futex_cmpxchg_enabled)
2592 ret = futex_unlock_pi(uaddr, fshared); 2583 ret = futex_unlock_pi(uaddr, fshared);
2593 break; 2584 break;
2594 case FUTEX_TRYLOCK_PI: 2585 case FUTEX_TRYLOCK_PI:
2595 if (futex_cmpxchg_enabled) 2586 if (futex_cmpxchg_enabled)
2596 ret = futex_lock_pi(uaddr, fshared, 0, timeout, 1); 2587 ret = futex_lock_pi(uaddr, fshared, 0, timeout, 1);
2597 break; 2588 break;
2598 case FUTEX_WAIT_REQUEUE_PI: 2589 case FUTEX_WAIT_REQUEUE_PI:
2599 val3 = FUTEX_BITSET_MATCH_ANY; 2590 val3 = FUTEX_BITSET_MATCH_ANY;
2600 ret = futex_wait_requeue_pi(uaddr, fshared, val, timeout, val3, 2591 ret = futex_wait_requeue_pi(uaddr, fshared, val, timeout, val3,
2601 clockrt, uaddr2); 2592 clockrt, uaddr2);
2602 break; 2593 break;
2603 case FUTEX_CMP_REQUEUE_PI: 2594 case FUTEX_CMP_REQUEUE_PI:
2604 ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3, 2595 ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3,
2605 1); 2596 1);
2606 break; 2597 break;
2607 default: 2598 default:
2608 ret = -ENOSYS; 2599 ret = -ENOSYS;
2609 } 2600 }
2610 return ret; 2601 return ret;
2611 } 2602 }
2612 2603
2613 2604
2614 SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, 2605 SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
2615 struct timespec __user *, utime, u32 __user *, uaddr2, 2606 struct timespec __user *, utime, u32 __user *, uaddr2,
2616 u32, val3) 2607 u32, val3)
2617 { 2608 {
2618 struct timespec ts; 2609 struct timespec ts;
2619 ktime_t t, *tp = NULL; 2610 ktime_t t, *tp = NULL;
2620 u32 val2 = 0; 2611 u32 val2 = 0;
2621 int cmd = op & FUTEX_CMD_MASK; 2612 int cmd = op & FUTEX_CMD_MASK;
2622 2613
2623 if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI || 2614 if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI ||
2624 cmd == FUTEX_WAIT_BITSET || 2615 cmd == FUTEX_WAIT_BITSET ||
2625 cmd == FUTEX_WAIT_REQUEUE_PI)) { 2616 cmd == FUTEX_WAIT_REQUEUE_PI)) {
2626 if (copy_from_user(&ts, utime, sizeof(ts)) != 0) 2617 if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
2627 return -EFAULT; 2618 return -EFAULT;
2628 if (!timespec_valid(&ts)) 2619 if (!timespec_valid(&ts))
2629 return -EINVAL; 2620 return -EINVAL;
2630 2621
2631 t = timespec_to_ktime(ts); 2622 t = timespec_to_ktime(ts);
2632 if (cmd == FUTEX_WAIT) 2623 if (cmd == FUTEX_WAIT)
2633 t = ktime_add_safe(ktime_get(), t); 2624 t = ktime_add_safe(ktime_get(), t);
2634 tp = &t; 2625 tp = &t;
2635 } 2626 }
2636 /* 2627 /*
2637 * requeue parameter in 'utime' if cmd == FUTEX_*_REQUEUE_*. 2628 * requeue parameter in 'utime' if cmd == FUTEX_*_REQUEUE_*.
2638 * number of waiters to wake in 'utime' if cmd == FUTEX_WAKE_OP. 2629 * number of waiters to wake in 'utime' if cmd == FUTEX_WAKE_OP.
2639 */ 2630 */
2640 if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE || 2631 if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE ||
2641 cmd == FUTEX_CMP_REQUEUE_PI || cmd == FUTEX_WAKE_OP) 2632 cmd == FUTEX_CMP_REQUEUE_PI || cmd == FUTEX_WAKE_OP)
2642 val2 = (u32) (unsigned long) utime; 2633 val2 = (u32) (unsigned long) utime;
2643 2634
2644 return do_futex(uaddr, op, val, tp, uaddr2, val2, val3); 2635 return do_futex(uaddr, op, val, tp, uaddr2, val2, val3);
2645 } 2636 }
2646 2637
2647 static int __init futex_init(void) 2638 static int __init futex_init(void)
2648 { 2639 {
2649 u32 curval; 2640 u32 curval;
2650 int i; 2641 int i;
2651 2642
2652 /* 2643 /*
2653 * This will fail and we want it. Some arch implementations do 2644 * This will fail and we want it. Some arch implementations do
2654 * runtime detection of the futex_atomic_cmpxchg_inatomic() 2645 * runtime detection of the futex_atomic_cmpxchg_inatomic()
2655 * functionality. We want to know that before we call in any 2646 * functionality. We want to know that before we call in any
2656 * of the complex code paths. Also we want to prevent 2647 * of the complex code paths. Also we want to prevent
2657 * registration of robust lists in that case. NULL is 2648 * registration of robust lists in that case. NULL is
2658 * guaranteed to fault and we get -EFAULT on functional 2649 * guaranteed to fault and we get -EFAULT on functional
2659 * implementation, the non functional ones will return 2650 * implementation, the non functional ones will return
2660 * -ENOSYS. 2651 * -ENOSYS.
2661 */ 2652 */
2662 curval = cmpxchg_futex_value_locked(NULL, 0, 0); 2653 curval = cmpxchg_futex_value_locked(NULL, 0, 0);
2663 if (curval == -EFAULT) 2654 if (curval == -EFAULT)
2664 futex_cmpxchg_enabled = 1; 2655 futex_cmpxchg_enabled = 1;
2665 2656
2666 for (i = 0; i < ARRAY_SIZE(futex_queues); i++) { 2657 for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
2667 plist_head_init(&futex_queues[i].chain, &futex_queues[i].lock); 2658 plist_head_init(&futex_queues[i].chain, &futex_queues[i].lock);
2668 spin_lock_init(&futex_queues[i].lock); 2659 spin_lock_init(&futex_queues[i].lock);
2669 } 2660 }
2670 2661
2671 return 0; 2662 return 0;
2672 } 2663 }
2673 __initcall(futex_init); 2664 __initcall(futex_init);
2674 2665