Commit 1b7558e457ed0de61023cfc913d2c342c7c3d9f2

Authored by Thomas Gleixner
Committed by Ingo Molnar
1 parent 481c5346d0

futexes: fix fault handling in futex_lock_pi

This patch addresses a very sporadic pi-futex related failure in
highly threaded java apps on large SMP systems.

David Holmes reported that the pi_state consistency check in
lookup_pi_state triggered with his test application. This means that
the kernel internal pi_state and the user space futex variable are out
of sync. First we assumed that this is a user space data corruption,
but deeper investigation revieled that the problem happend because the
pi-futex code is not handling a fault in the futex_lock_pi path when
the user space variable needs to be fixed up.

The fault happens when a fork mapped the anon memory which contains
the futex readonly for COW or the page got swapped out exactly between
the unlock of the futex and the return of either the new futex owner
or the task which was the expected owner but failed to acquire the
kernel internal rtmutex. The current futex_lock_pi() code drops out
with an inconsistent in case it faults and returns -EFAULT to user
space. User space has no way to fixup that state.

When we wrote this code we thought that we could not drop the hash
bucket lock at this point to handle the fault.

After analysing the code again it turned out to be wrong because there
are only two tasks involved which might modify the pi_state and the
user space variable:

 - the task which acquired the rtmutex
 - the pending owner of the pi_state which did not get the rtmutex

Both tasks drop into the fixup_pi_state() function before returning to
user space. The first task which acquired the hash bucket lock faults
in the fixup of the user space variable, drops the spinlock and calls
futex_handle_fault() to fault in the page. Now the second task could
acquire the hash bucket lock and tries to fixup the user space
variable as well. It either faults as well or it succeeds because the
first task already faulted the page in.

One caveat is to avoid a double fixup. After returning from the fault
handling we reacquire the hash bucket lock and check whether the
pi_state owner has been modified already.

Reported-by: David Holmes <david.holmes@sun.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Holmes <david.holmes@sun.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

 kernel/futex.c |   93 ++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 73 insertions(+), 20 deletions(-)

Showing 1 changed file with 73 additions and 20 deletions Side-by-side Diff

... ... @@ -1096,21 +1096,64 @@
1096 1096 * private futexes.
1097 1097 */
1098 1098 static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
1099   - struct task_struct *newowner)
  1099 + struct task_struct *newowner,
  1100 + struct rw_semaphore *fshared)
1100 1101 {
1101 1102 u32 newtid = task_pid_vnr(newowner) | FUTEX_WAITERS;
1102 1103 struct futex_pi_state *pi_state = q->pi_state;
  1104 + struct task_struct *oldowner = pi_state->owner;
1103 1105 u32 uval, curval, newval;
1104   - int ret;
  1106 + int ret, attempt = 0;
1105 1107  
1106 1108 /* Owner died? */
  1109 + if (!pi_state->owner)
  1110 + newtid |= FUTEX_OWNER_DIED;
  1111 +
  1112 + /*
  1113 + * We are here either because we stole the rtmutex from the
  1114 + * pending owner or we are the pending owner which failed to
  1115 + * get the rtmutex. We have to replace the pending owner TID
  1116 + * in the user space variable. This must be atomic as we have
  1117 + * to preserve the owner died bit here.
  1118 + *
  1119 + * Note: We write the user space value _before_ changing the
  1120 + * pi_state because we can fault here. Imagine swapped out
  1121 + * pages or a fork, which was running right before we acquired
  1122 + * mmap_sem, that marked all the anonymous memory readonly for
  1123 + * cow.
  1124 + *
  1125 + * Modifying pi_state _before_ the user space value would
  1126 + * leave the pi_state in an inconsistent state when we fault
  1127 + * here, because we need to drop the hash bucket lock to
  1128 + * handle the fault. This might be observed in the PID check
  1129 + * in lookup_pi_state.
  1130 + */
  1131 +retry:
  1132 + if (get_futex_value_locked(&uval, uaddr))
  1133 + goto handle_fault;
  1134 +
  1135 + while (1) {
  1136 + newval = (uval & FUTEX_OWNER_DIED) | newtid;
  1137 +
  1138 + curval = cmpxchg_futex_value_locked(uaddr, uval, newval);
  1139 +
  1140 + if (curval == -EFAULT)
  1141 + goto handle_fault;
  1142 + if (curval == uval)
  1143 + break;
  1144 + uval = curval;
  1145 + }
  1146 +
  1147 + /*
  1148 + * We fixed up user space. Now we need to fix the pi_state
  1149 + * itself.
  1150 + */
1107 1151 if (pi_state->owner != NULL) {
1108 1152 spin_lock_irq(&pi_state->owner->pi_lock);
1109 1153 WARN_ON(list_empty(&pi_state->list));
1110 1154 list_del_init(&pi_state->list);
1111 1155 spin_unlock_irq(&pi_state->owner->pi_lock);
1112   - } else
1113   - newtid |= FUTEX_OWNER_DIED;
  1156 + }
1114 1157  
1115 1158 pi_state->owner = newowner;
1116 1159  
1117 1160  
1118 1161  
1119 1162  
1120 1163  
1121 1164  
... ... @@ -1118,26 +1161,35 @@
1118 1161 WARN_ON(!list_empty(&pi_state->list));
1119 1162 list_add(&pi_state->list, &newowner->pi_state_list);
1120 1163 spin_unlock_irq(&newowner->pi_lock);
  1164 + return 0;
1121 1165  
1122 1166 /*
1123   - * We own it, so we have to replace the pending owner
1124   - * TID. This must be atomic as we have preserve the
1125   - * owner died bit here.
  1167 + * To handle the page fault we need to drop the hash bucket
  1168 + * lock here. That gives the other task (either the pending
  1169 + * owner itself or the task which stole the rtmutex) the
  1170 + * chance to try the fixup of the pi_state. So once we are
  1171 + * back from handling the fault we need to check the pi_state
  1172 + * after reacquiring the hash bucket lock and before trying to
  1173 + * do another fixup. When the fixup has been done already we
  1174 + * simply return.
1126 1175 */
1127   - ret = get_futex_value_locked(&uval, uaddr);
  1176 +handle_fault:
  1177 + spin_unlock(q->lock_ptr);
1128 1178  
1129   - while (!ret) {
1130   - newval = (uval & FUTEX_OWNER_DIED) | newtid;
  1179 + ret = futex_handle_fault((unsigned long)uaddr, fshared, attempt++);
1131 1180  
1132   - curval = cmpxchg_futex_value_locked(uaddr, uval, newval);
  1181 + spin_lock(q->lock_ptr);
1133 1182  
1134   - if (curval == -EFAULT)
1135   - ret = -EFAULT;
1136   - if (curval == uval)
1137   - break;
1138   - uval = curval;
1139   - }
1140   - return ret;
  1183 + /*
  1184 + * Check if someone else fixed it for us:
  1185 + */
  1186 + if (pi_state->owner != oldowner)
  1187 + return 0;
  1188 +
  1189 + if (ret)
  1190 + return ret;
  1191 +
  1192 + goto retry;
1141 1193 }
1142 1194  
1143 1195 /*
... ... @@ -1507,7 +1559,7 @@
1507 1559 * that case:
1508 1560 */
1509 1561 if (q.pi_state->owner != curr)
1510   - ret = fixup_pi_state_owner(uaddr, &q, curr);
  1562 + ret = fixup_pi_state_owner(uaddr, &q, curr, fshared);
1511 1563 } else {
1512 1564 /*
1513 1565 * Catch the rare case, where the lock was released
... ... @@ -1539,7 +1591,8 @@
1539 1591 int res;
1540 1592  
1541 1593 owner = rt_mutex_owner(&q.pi_state->pi_mutex);
1542   - res = fixup_pi_state_owner(uaddr, &q, owner);
  1594 + res = fixup_pi_state_owner(uaddr, &q, owner,
  1595 + fshared);
1543 1596  
1544 1597 /* propagate -EFAULT, if the fixup failed */
1545 1598 if (res)