futexes: fix fault handling in futex_lock_pi

This patch addresses a very sporadic pi-futex related failure in highly threaded java apps on large SMP systems. David Holmes reported that the pi_state consistency check in lookup_pi_state triggered with his test application. This means that the kernel internal pi_state and the user space futex variable are out of sync. First we assumed that this is a user space data corruption, but deeper investigation revieled that the problem happend because the pi-futex code is not handling a fault in the futex_lock_pi path when the user space variable needs to be fixed up. The fault happens when a fork mapped the anon memory which contains the futex readonly for COW or the page got swapped out exactly between the unlock of the futex and the return of either the new futex owner or the task which was the expected owner but failed to acquire the kernel internal rtmutex. The current futex_lock_pi() code drops out with an inconsistent in case it faults and returns -EFAULT to user space. User space has no way to fixup that state. When we wrote this code we thought that we could not drop the hash bucket lock at this point to handle the fault. After analysing the code again it turned out to be wrong because there are only two tasks involved which might modify the pi_state and the user space variable: - the task which acquired the rtmutex - the pending owner of the pi_state which did not get the rtmutex Both tasks drop into the fixup_pi_state() function before returning to user space. The first task which acquired the hash bucket lock faults in the fixup of the user space variable, drops the spinlock and calls futex_handle_fault() to fault in the page. Now the second task could acquire the hash bucket lock and tries to fixup the user space variable as well. It either faults as well or it succeeds because the first task already faulted the page in. One caveat is to avoid a double fixup. After returning from the fault handling we reacquire the hash bucket lock and check whether the pi_state owner has been modified already. Reported-by: David Holmes <david.holmes@sun.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Holmes <david.holmes@sun.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> kernel/futex.c | 93 ++++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 73 insertions(+), 20 deletions(-)

futexes: fix fault handling in futex_lock_pi
This patch addresses a very sporadic pi-futex related failure in highly threaded java apps on large SMP systems. David Holmes reported that the pi_state consistency check in lookup_pi_state triggered with his test application. This means that the kernel internal pi_state and the user space futex variable are out of sync. First we assumed that this is a user space data corruption, but deeper investigation revieled that the problem happend because the pi-futex code is not handling a fault in the futex_lock_pi path when the user space variable needs to be fixed up. The fault happens when a fork mapped the anon memory which contains the futex readonly for COW or the page got swapped out exactly between the unlock of the futex and the return of either the new futex owner or the task which was the expected owner but failed to acquire the kernel internal rtmutex. The current futex_lock_pi() code drops out with an inconsistent in case it faults and returns -EFAULT to user space. User space has no way to fixup that state. When we wrote this code we thought that we could not drop the hash bucket lock at this point to handle the fault. After analysing the code again it turned out to be wrong because there are only two tasks involved which might modify the pi_state and the user space variable: - the task which acquired the rtmutex - the pending owner of the pi_state which did not get the rtmutex Both tasks drop into the fixup_pi_state() function before returning to user space. The first task which acquired the hash bucket lock faults in the fixup of the user space variable, drops the spinlock and calls futex_handle_fault() to fault in the page. Now the second task could acquire the hash bucket lock and tries to fixup the user space variable as well. It either faults as well or it succeeds because the first task already faulted the page in. One caveat is to avoid a double fixup. After returning from the fault handling we reacquire the hash bucket lock and check whether the pi_state owner has been modified already. Reported-by: David Holmes <david.holmes@sun.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Holmes <david.holmes@sun.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> kernel/futex.c | 93 ++++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 73 insertions(+), 20 deletions(-)
Thomas Gleixner · Ingo Molnar
1 parent 481c5346d0
Showing 1 changed file with 73 additions and 20 deletions Side-by-side Diff
kernel/futex.c
@@ -1096,21 +1096,64 @@
  * private futexes.
  */
 static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
-				struct task_struct *newowner)
+				struct task_struct *newowner,
+				struct rw_semaphore *fshared)
 {
 	u32 newtid = task_pid_vnr(newowner) | FUTEX_WAITERS;
 	struct futex_pi_state *pi_state = q->pi_state;
+	struct task_struct *oldowner = pi_state->owner;
 	u32 uval, curval, newval;
-	int ret;
+	int ret, attempt = 0;
  
 	/* Owner died? */
+	if (!pi_state->owner)
+		newtid |= FUTEX_OWNER_DIED;
+
+	/*
+	 * We are here either because we stole the rtmutex from the
+	 * pending owner or we are the pending owner which failed to
+	 * get the rtmutex. We have to replace the pending owner TID
+	 * in the user space variable. This must be atomic as we have
+	 * to preserve the owner died bit here.
+	 *
+	 * Note: We write the user space value _before_ changing the
+	 * pi_state because we can fault here. Imagine swapped out
+	 * pages or a fork, which was running right before we acquired
+	 * mmap_sem, that marked all the anonymous memory readonly for
+	 * cow.
+	 *
+	 * Modifying pi_state _before_ the user space value would
+	 * leave the pi_state in an inconsistent state when we fault
+	 * here, because we need to drop the hash bucket lock to
+	 * handle the fault. This might be observed in the PID check
+	 * in lookup_pi_state.
+	 */
+retry:
+	if (get_futex_value_locked(&uval, uaddr))
+		goto handle_fault;
+
+	while (1) {
+		newval = (uval & FUTEX_OWNER_DIED) | newtid;
+
+		curval = cmpxchg_futex_value_locked(uaddr, uval, newval);
+
+		if (curval == -EFAULT)
+			goto handle_fault;
+		if (curval == uval)
+			break;
+		uval = curval;
+	}
+
+	/*
+	 * We fixed up user space. Now we need to fix the pi_state
+	 * itself.
+	 */
 	if (pi_state->owner != NULL) {
 		spin_lock_irq(&pi_state->owner->pi_lock);
 		WARN_ON(list_empty(&pi_state->list));
 		list_del_init(&pi_state->list);
 		spin_unlock_irq(&pi_state->owner->pi_lock);
-	} else
-		newtid |= FUTEX_OWNER_DIED;
+	}
  
 	pi_state->owner = newowner;
  
  
  
  
  
  
@@ -1118,26 +1161,35 @@
 	WARN_ON(!list_empty(&pi_state->list));
 	list_add(&pi_state->list, &newowner->pi_state_list);
 	spin_unlock_irq(&newowner->pi_lock);
+	return 0;
  
 	/*
-	 * We own it, so we have to replace the pending owner
-	 * TID. This must be atomic as we have preserve the
-	 * owner died bit here.
+	 * To handle the page fault we need to drop the hash bucket
+	 * lock here. That gives the other task (either the pending
+	 * owner itself or the task which stole the rtmutex) the
+	 * chance to try the fixup of the pi_state. So once we are
+	 * back from handling the fault we need to check the pi_state
+	 * after reacquiring the hash bucket lock and before trying to
+	 * do another fixup. When the fixup has been done already we
+	 * simply return.
 	 */
-	ret = get_futex_value_locked(&uval, uaddr);
+handle_fault:
+	spin_unlock(q->lock_ptr);
  
-	while (!ret) {
-		newval = (uval & FUTEX_OWNER_DIED) | newtid;
+	ret = futex_handle_fault((unsigned long)uaddr, fshared, attempt++);
  
-		curval = cmpxchg_futex_value_locked(uaddr, uval, newval);
+	spin_lock(q->lock_ptr);
  
-		if (curval == -EFAULT)
-			ret = -EFAULT;
-		if (curval == uval)
-			break;
-		uval = curval;
-	}
-	return ret;
+	/*
+	 * Check if someone else fixed it for us:
+	 */
+	if (pi_state->owner != oldowner)
+		return 0;
+
+	if (ret)
+		return ret;
+
+	goto retry;
 }
  
 /*
@@ -1507,7 +1559,7 @@
 		 * that case:
 		 */
 		if (q.pi_state->owner != curr)
-			ret = fixup_pi_state_owner(uaddr, &q, curr);
+			ret = fixup_pi_state_owner(uaddr, &q, curr, fshared);
 	} else {
 		/*
 		 * Catch the rare case, where the lock was released
@@ -1539,7 +1591,8 @@
 				int res;
  
 				owner = rt_mutex_owner(&q.pi_state->pi_mutex);
-				res = fixup_pi_state_owner(uaddr, &q, owner);
+				res = fixup_pi_state_owner(uaddr, &q, owner,
+							   fshared);
  
 				/* propagate -EFAULT, if the fixup failed */
 				if (res)
...	...	@@ -1096,21 +1096,64 @@
1096	1096	* private futexes.
1097	1097	*/
1098	1098	static int fixup_pi_state_owner(u32 __user uaddr, struct futex_q q,
1099		- struct task_struct *newowner)
	1099	+ struct task_struct *newowner,
	1100	+ struct rw_semaphore *fshared)
1100	1101	{
1101	1102	u32 newtid = task_pid_vnr(newowner) \| FUTEX_WAITERS;
1102	1103	struct futex_pi_state *pi_state = q->pi_state;
	1104	+ struct task_struct *oldowner = pi_state->owner;
1103	1105	u32 uval, curval, newval;
1104		- int ret;
	1106	+ int ret, attempt = 0;
1105	1107
1106	1108	/* Owner died? */
	1109	+ if (!pi_state->owner)
	1110	+ newtid \|= FUTEX_OWNER_DIED;
	1111	+
	1112	+ /*
	1113	+ * We are here either because we stole the rtmutex from the
	1114	+ * pending owner or we are the pending owner which failed to
	1115	+ * get the rtmutex. We have to replace the pending owner TID
	1116	+ * in the user space variable. This must be atomic as we have
	1117	+ * to preserve the owner died bit here.
	1118	+ *
	1119	+ * Note: We write the user space value _before_ changing the
	1120	+ * pi_state because we can fault here. Imagine swapped out
	1121	+ * pages or a fork, which was running right before we acquired
	1122	+ * mmap_sem, that marked all the anonymous memory readonly for
	1123	+ * cow.
	1124	+ *
	1125	+ * Modifying pi_state _before_ the user space value would
	1126	+ * leave the pi_state in an inconsistent state when we fault
	1127	+ * here, because we need to drop the hash bucket lock to
	1128	+ * handle the fault. This might be observed in the PID check
	1129	+ * in lookup_pi_state.
	1130	+ */
	1131	+retry:
	1132	+ if (get_futex_value_locked(&uval, uaddr))
	1133	+ goto handle_fault;
	1134	+
	1135	+ while (1) {
	1136	+ newval = (uval & FUTEX_OWNER_DIED) \| newtid;
	1137	+
	1138	+ curval = cmpxchg_futex_value_locked(uaddr, uval, newval);
	1139	+
	1140	+ if (curval == -EFAULT)
	1141	+ goto handle_fault;
	1142	+ if (curval == uval)
	1143	+ break;
	1144	+ uval = curval;
	1145	+ }
	1146	+
	1147	+ /*
	1148	+ * We fixed up user space. Now we need to fix the pi_state
	1149	+ * itself.
	1150	+ */
1107	1151	if (pi_state->owner != NULL) {
1108	1152	spin_lock_irq(&pi_state->owner->pi_lock);
1109	1153	WARN_ON(list_empty(&pi_state->list));
1110	1154	list_del_init(&pi_state->list);
1111	1155	spin_unlock_irq(&pi_state->owner->pi_lock);
1112		- } else
1113		- newtid \|= FUTEX_OWNER_DIED;
	1156	+ }
1114	1157
1115	1158	pi_state->owner = newowner;
1116	1159
1117	1160
1118	1161
1119	1162
1120	1163
1121	1164
...	...	@@ -1118,26 +1161,35 @@
1118	1161	WARN_ON(!list_empty(&pi_state->list));
1119	1162	list_add(&pi_state->list, &newowner->pi_state_list);
1120	1163	spin_unlock_irq(&newowner->pi_lock);
	1164	+ return 0;
1121	1165
1122	1166	/*
1123		- * We own it, so we have to replace the pending owner
1124		- * TID. This must be atomic as we have preserve the
1125		- * owner died bit here.
	1167	+ * To handle the page fault we need to drop the hash bucket
	1168	+ * lock here. That gives the other task (either the pending
	1169	+ * owner itself or the task which stole the rtmutex) the
	1170	+ * chance to try the fixup of the pi_state. So once we are
	1171	+ * back from handling the fault we need to check the pi_state
	1172	+ * after reacquiring the hash bucket lock and before trying to
	1173	+ * do another fixup. When the fixup has been done already we
	1174	+ * simply return.
1126	1175	*/
1127		- ret = get_futex_value_locked(&uval, uaddr);
	1176	+handle_fault:
	1177	+ spin_unlock(q->lock_ptr);
1128	1178
1129		- while (!ret) {
1130		- newval = (uval & FUTEX_OWNER_DIED) \| newtid;
	1179	+ ret = futex_handle_fault((unsigned long)uaddr, fshared, attempt++);
1131	1180
1132		- curval = cmpxchg_futex_value_locked(uaddr, uval, newval);
	1181	+ spin_lock(q->lock_ptr);
1133	1182
1134		- if (curval == -EFAULT)
1135		- ret = -EFAULT;
1136		- if (curval == uval)
1137		- break;
1138		- uval = curval;
1139		- }
1140		- return ret;
	1183	+ /*
	1184	+ * Check if someone else fixed it for us:
	1185	+ */
	1186	+ if (pi_state->owner != oldowner)
	1187	+ return 0;
	1188	+
	1189	+ if (ret)
	1190	+ return ret;
	1191	+
	1192	+ goto retry;
1141	1193	}
1142	1194
1143	1195	/*
...	...	@@ -1507,7 +1559,7 @@
1507	1559	* that case:
1508	1560	*/
1509	1561	if (q.pi_state->owner != curr)
1510		- ret = fixup_pi_state_owner(uaddr, &q, curr);
	1562	+ ret = fixup_pi_state_owner(uaddr, &q, curr, fshared);
1511	1563	} else {
1512	1564	/*
1513	1565	* Catch the rare case, where the lock was released
...	...	@@ -1539,7 +1591,8 @@
1539	1591	int res;
1540	1592
1541	1593	owner = rt_mutex_owner(&q.pi_state->pi_mutex);
1542		- res = fixup_pi_state_owner(uaddr, &q, owner);
	1594	+ res = fixup_pi_state_owner(uaddr, &q, owner,
	1595	+ fshared);
1543	1596
1544	1597	/* propagate -EFAULT, if the fixup failed */
1545	1598	if (res)