25 Sep, 2009

1 commit


22 Sep, 2009

5 commits

  • PI futexes do not use the same plist_node_empty() test for wakeup.
    It was possible for the waiter (in futex_wait_requeue_pi()) to set
    TASK_INTERRUPTIBLE after the waker assigned the rtmutex to the
    waiter. The waiter would then note the plist was not empty and call
    schedule(). The task would not be found by any subsequeuent futex
    wakeups, resulting in a userspace hang.

    By moving the setting of TASK_INTERRUPTIBLE to before the call to
    queue_me(), the race with the waker is eliminated. Since we no
    longer call get_user() from within queue_me(), there is no need to
    delay the setting of TASK_INTERRUPTIBLE until after the call to
    queue_me().

    The FUTEX_LOCK_PI operation is not affected as futex_lock_pi()
    relies entirely on the rtmutex code to handle schedule() and
    wakeup. The requeue PI code is affected because the waiter starts
    as a non-PI waiter and is woken on a PI futex.

    Remove the crusty old comment about holding spinlocks() across
    get_user() as we no longer do that. Correct the locking statement
    with a description of why the test is performed.

    Signed-off-by: Darren Hart
    Acked-by: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Eric Dumazet
    Cc: Dinakar Guniguntala
    Cc: John Stultz
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Darren Hart
     
  • Use kernel-doc format to describe struct futex_q.

    Correct the wakeup definition to eliminate the statement about
    waking the waiter between the plist_del() and the q->lock_ptr = 0.

    Note in the comment that PI futexes have a different definition of
    the woken state.

    Signed-off-by: Darren Hart
    Acked-by: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Eric Dumazet
    Cc: Dinakar Guniguntala
    Cc: John Stultz
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Darren Hart
     
  • Make the existing function kernel-doc consistent throughout
    futex.c, following Documentation/kernel-doc-nano-howto.txt as
    closely as possible.

    When unsure, at least be consistent within futex.c.

    Signed-off-by: Darren Hart
    Acked-by: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Eric Dumazet
    Cc: Dinakar Guniguntala
    Cc: John Stultz
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Darren Hart
     
  • The queue_me/unqueue_me commentary is oddly placed and out of date.
    Clean it up and correct the inaccurate bits.

    Signed-off-by: Darren Hart
    Acked-by: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Eric Dumazet
    Cc: Dinakar Guniguntala
    Cc: John Stultz
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Darren Hart
     
  • Correct various typos and formatting inconsistencies in the
    commentary of futex_wait_requeue_pi().

    Signed-off-by: Darren Hart
    Acked-by: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Eric Dumazet
    Cc: Dinakar Guniguntala
    Cc: John Stultz
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Darren Hart
     

12 Sep, 2009

1 commit


16 Aug, 2009

1 commit

  • There is currently no check to ensure that userspace uses the same
    futex requeue target (uaddr2) in futex_requeue() that the waiter used
    in futex_wait_requeue_pi(). A mismatch here could very unexpected
    results as the waiter assumes it either wakes on uaddr1 or uaddr2. We
    could detect this on wakeup in the waiter, but the cleanup is more
    intense after the improper requeue has occured.

    This patch stores the waiter's expected requeue target in a new
    requeue_pi_key pointer in the futex_q which futex_requeue() checks
    prior to attempting to do a proxy lock acquistion or a requeue when
    requeue_pi=1. If they don't match, return -EINVAL from futex_requeue,
    aborting the requeue of any remaining waiters.

    Signed-off-by: Darren Hart
    Cc: Peter Zijlstra
    Cc: Eric Dumazet
    Cc: John Kacur
    Cc: Dinakar Guniguntala
    Cc: John Stultz
    LKML-Reference:
    Signed-off-by: Thomas Gleixner

    Darren Hart
     

11 Aug, 2009

1 commit

  • If futex_requeue(requeue_pi=1) finds a futex_q that was created by a call
    other the futex_wait_requeue_pi(), the q.rt_waiter may be null. If so,
    this will result in an oops from the following call graph:

    futex_requeue()
    rt_mutex_start_proxy_lock()
    task_blocks_on_rt_mutex()
    waiter->task dereference
    OOPS

    We currently WARN_ON() if this is detected, clearly this is inadequate.
    If we detect a mispairing in futex_requeue(), bail out, seding -EINVAL to
    user-space.

    V2: Fix parenthesis warnings.

    Signed-off-by: Darren Hart
    Acked-by: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: John Kacur
    Cc: Eric Dumazet
    Cc: Dinakar Guniguntala
    Cc: John Stultz
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Darren Hart
     

10 Aug, 2009

1 commit

  • futex_requeue() can acquire the lock on behalf of a waiter
    early on or during the requeue loop if it is uncontended or in
    the event of a lock steal or owner died. On wakeup, the waiter
    (in futex_wait_requeue_pi()) cleans up the pi_state owner using
    the lock_ptr to protect against concurrent access to the
    pi_state. The pi_state is hung off futex_q's on the requeue
    target futex hash bucket so the lock_ptr needs to be updated
    accordingly.

    The problem manifested by triggering the WARN_ON in
    lookup_pi_state() about the pid != pi_state->owner->pid. With
    this patch, the pi_state is properly guarded against concurrent
    access via the requeue target hb lock.

    The astute reviewer may notice that there is a window of time
    between when futex_requeue() unlocks the hb locks and when
    futex_wait_requeue_pi() will acquire hb2->lock. During this
    time the pi_state and uval are not in sync with the underlying
    rtmutex owner (but the uval does indicate there are waiters, so
    no atomic changes will occur in userspace). However, this is
    not a problem. Should a contending thread enter
    lookup_pi_state() and acquire hb2->lock before the ownership is
    fixed up, it will find the pi_state hung off a waiter's
    (possibly the pending owner's) futex_q and block on the
    rtmutex. Once futex_wait_requeue_pi() fixes up the owner, it
    will also move the pi_state from the old owner's
    task->pi_state_list to its own.

    v3: Fix plist lock name for application to mainline (rather
    than -rt) Compile tested against tip/v2.6.31-rc5.

    Signed-off-by: Darren Hart
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Eric Dumazet
    Cc: Dinakar Guniguntala
    Cc: John Stultz
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Darren Hart
     

04 Aug, 2009

1 commit

  • The state machine described in the comments wasn't updated with
    a follow-on fix. Address that and cleanup the corresponding
    commentary in the function.

    Signed-off-by: Darren Hart
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Darren Hart
     

11 Jul, 2009

1 commit

  • get_futex_key() can infinitely loop if it is called on a
    virtual address that is within a huge page but not aligned to
    the beginning of that page. The call to get_user_pages_fast
    will return the struct page for a sub-page within the huge page
    and the check for page->mapping will always fail.

    The fix is to call compound_head on the page before checking
    that it's mapped.

    Signed-off-by: Sonny Rao
    Acked-by: Thomas Gleixner
    Cc: stable@kernel.org
    Cc: anton@samba.org
    Cc: rajamony@us.ibm.com
    Cc: speight@us.ibm.com
    Cc: mstephen@us.ibm.com
    Cc: grimm@us.ibm.com
    Cc: mikey@ozlabs.au.ibm.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Sonny Rao
     

25 Jun, 2009

2 commits

  • Yanmin noticed that fault_in_user_writeable() requests 4 pages instead
    of one.

    That's the result of blindly trusting Linus' proposal :) I even looked
    up the prototype to verify the correctness: the argument in question
    is confusingly enough named "len" while in reality it means number of
    pages.

    Pointed-out-by: Yanmin Zhang
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • commit 64d1304a64 (futex: setup writeable mapping for futex ops which
    modify user space data) did address only half of the problem of write
    access faults.

    The patch was made on two wrong assumptions:

    1) access_ok(VERIFY_WRITE,...) would actually check write access.

    On x86 it does _NOT_. It's a pure address range check.

    2) a RW mapped region can not go away under us.

    That's wrong as well. Nobody can prevent another thread to call
    mprotect(PROT_READ) on that region where the futex resides. If that
    call hits between the get_user_pages_fast() verification and the
    actual write access in the atomic region we are toast again.

    The solution is to not rely on access_ok and get_user() for any write
    access related fault on private and shared futexes. Instead we need to
    fault it in with verification of write access.

    There is no generic non destructive write mechanism which would fault
    the user page in trough a #PF, but as we already know that we will
    fault we can as well call get_user_pages() directly and avoid the #PF
    overhead.

    If get_user_pages() returns -EFAULT we know that we can not fix it
    anymore and need to bail out to user space.

    Remove a bunch of confusing comments on this issue as well.

    Signed-off-by: Thomas Gleixner
    Cc: stable@kernel.org

    Thomas Gleixner
     

20 May, 2009

5 commits

  • If the waiter has been requeued to the outer PI futex and is
    interrupted by a signal and the thread handles the signal then
    ERESTART_RESTARTBLOCK is changed to EINTR and the restart block is
    discarded. That way we return an unexcpected EINTR to user space
    instead of ending up in futex_lock_pi_restart.

    But we do not need to restart the syscall because we know that the
    condition has changed since we have been requeued. If we would simply
    restart the syscall then we would drop out via the comparison of the
    user space value with EWOULDBLOCK.

    The user space side needs to handle EWOULDBLOCK anyway as the
    enqueueing on the inner futex can race with a requeue/wake. So we can
    simply return EWOULDBLOCK to user space which also signals that we did
    not take the outer futex and let user space handle it in the same way
    it has to handle the requeue/wake race.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • The futex_wait_requeue_pi op should restart unconditionally like
    futex_lock_pi. The user of that function e.g. pthread_cond_wait can
    not be interrupted so we do not care about the SA_RESTART flag of the
    signal. Clean up the FIXMEs.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Reuse the put_key_ref(key2) call in the exit path.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Merge reason: this branch was on an pre -rc1 base, merge it up to -rc6+
    to get the latest upstream fixes.

    Conflicts:
    kernel/futex.c

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • The futex code installs a read only mapping via get_user_pages_fast()
    even if the futex op function has to modify user space data. The
    eventual fault was fixed up by futex_handle_fault() which walked the
    VMA with mmap_sem held.

    After the cleanup patches which removed the mmap_sem dependency of the
    futex code commit 4dc5b7a36a49eff97050894cf1b3a9a02523717 (futex:
    clean up fault logic) removed the private VMA walk logic from the
    futex code. This change results in a stale RO mapping which is not
    fixed up.

    Instead of reintroducing the previous fault logic we set up the
    mapping in get_user_pages_fast() read/write for all operations which
    modify user space data. Also handle private futexes in the same way
    and make the current unconditional access_ok(VERIFY_WRITE) depend on
    the futex op.

    Reported-by: Andreas Schwab
    Signed-off-by: Thomas Gleixner
    CC: stable@kernel.org

    Thomas Gleixner
     

15 May, 2009

1 commit

  • The waitqueue which is used in struct futex_q is a leftover from the
    futexfd implementation. There is no need to use a waitqueue at all, as
    the waiting task is the only user of it. The waitqueue just adds
    additional locking and a loop in the wake up path which both can be
    avoided.

    We have already a task reference in struct futex_q which is used for
    PI futexes. Use it for normal futexes as well and just wake up the
    task directly.

    The logic of signalling the futex wakeup via setting q->lock_ptr to
    NULL is kept with the difference that we set it NULL before doing the
    wakeup. This opens an exit race window vs. a non futex wake up of the
    to be woken up task, which we prevent with get_task_struct /
    put_task_struct on the waiter.

    [ Impact: simplification ]

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

30 Apr, 2009

1 commit

  • The new requeue PI futex op codes were modeled after the existing
    FUTEX_REQUEUE and FUTEX_CMP_REQUEUE calls. I was unaware at the time
    that FUTEX_REQUEUE was only around for compatibility reasons and
    shouldn't be used in new code. Ulrich Drepper elaborates on this in his
    Futexes are Tricky paper: http://people.redhat.com/drepper/futex.pdf.
    The deprecated call doesn't catch changes to the futex corresponding to
    the destination futex which can lead to deadlock.

    Therefor, I feel it best to remove FUTEX_REQUEUE_PI and leave only
    FUTEX_CMP_REQUEUE_PI as there are not yet any existing users of the API.
    This patch does change the OP code value of FUTEX_CMP_REQUEUE_PI to 12
    from 13. Since my test case is the only known user of this API, I felt
    this was the right thing to do, rather than leave a hole in the
    enumeration.

    I chose to continue using the _CMP_ modifier in the OP code to make it
    explicit to the user that the test is being done.

    Builds, boots, and ran several hundred iterations requeue_pi.c.

    Signed-off-by: Darren Hart
    LKML-Reference:
    Signed-off-by: Thomas Gleixner

    Darren Hart
     

11 Apr, 2009

1 commit

  • If the get_futex_key() call were to fail, the existing code would
    try and put_futex_key() prior to returning. This patch makes sure
    we only put_futex_key() if get_futex_key() succeeded.

    Reported-by: Clark Williams
    Signed-off-by: Darren Hart
    LKML-Reference:
    Signed-off-by: Thomas Gleixner

    Darren Hart
     

08 Apr, 2009

1 commit

  • Thomas's testing caught a problem when the requeue target futex is
    unowned and multiple tasks are requeued to it. This patch ensures
    the FUTEX_WAITERS bit gets set if futex_requeue() will requeue one
    or more tasks in addition to the one acquiring the lock.

    Signed-off-by: Darren Hart
    Signed-off-by: Thomas Gleixner

    Darren Hart
     

06 Apr, 2009

8 commits

  • PI Futexes and their underlying rt_mutex cannot be left ownerless if
    there are pending waiters as this will break the PI boosting logic, so
    the standard requeue commands aren't sufficient. The new commands
    properly manage pi futex ownership by ensuring a futex with waiters
    has an owner at all times. This will allow glibc to properly handle
    pi mutexes with pthread_condvars.

    The approach taken here is to create two new futex op codes:

    FUTEX_WAIT_REQUEUE_PI:
    Tasks will use this op code to wait on a futex (such as a non-pi waitqueue)
    and wake after they have been requeued to a pi futex. Prior to returning to
    userspace, they will acquire this pi futex (and the underlying rt_mutex).

    futex_wait_requeue_pi() is the result of a high speed collision between
    futex_wait() and futex_lock_pi() (with the first part of futex_lock_pi() being
    done by futex_proxy_trylock_atomic() on behalf of the top_waiter).

    FUTEX_REQUEUE_PI (and FUTEX_CMP_REQUEUE_PI):
    This call must be used to wake tasks waiting with FUTEX_WAIT_REQUEUE_PI,
    regardless of how many tasks the caller intends to wake or requeue.
    pthread_cond_broadcast() should call this with nr_wake=1 and
    nr_requeue=INT_MAX. pthread_cond_signal() should call this with nr_wake=1 and
    nr_requeue=0. The reason being we need both callers to get the benefit of the
    futex_proxy_trylock_atomic() routine. futex_requeue() also enqueues the
    top_waiter on the rt_mutex via rt_mutex_start_proxy_lock().

    Signed-off-by: Darren Hart
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Thomas Gleixner

    Darren Hart
     
  • Refactor the code to validate the expected futex value in order to
    reuse it with the requeue_pi code.

    Signed-off-by: Darren Hart
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Thomas Gleixner

    Darren Hart
     
  • futex_requeue() is getting a bit long-winded, and will be getting more
    so after the requeue_pi patch. Factor out the actual requeueing into a
    nicely contained inline function to reduce function length and improve
    legibility.

    Signed-off-by: Darren Hart
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Thomas Gleixner

    Darren Hart
     
  • Currently restart is only used if there is a timeout. The requeue_pi
    functionality requires restarting to futex_lock_pi() on signal after
    wakeup in futex_wait_requeue_pi() regardless of if there was a timeout
    or not. Using 0 for the timeout value is confusing as that could
    indicate an expired timer. The flag makes this explicit. While the
    check is not technically needed in futex_wait_restart(), doing so
    makes the code consistent with and will avoid confusion should the
    need arise to restart wait without a timeout.

    Signed-off-by: Darren Hart
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Thomas Gleixner

    Darren Hart
     
  • Refactor the post lock acquisition logic from futex_lock_pi(). This
    code will be reused in futex_wait_requeue_pi().

    Signed-off-by: Darren Hart
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Thomas Gleixner

    Darren Hart
     
  • Refactor the atomic portion of futex_lock_pi() into futex_lock_pi_atomic().

    This logic will be needed by requeue_pi, so modularize it to reduce
    code duplication. The only significant change is passing of the task
    to try and take the lock for. This simplifies the -EDEADLK test as if
    the lock is owned by task t, it's a deadlock, regardless of if we are
    doing requeue pi or not. This patch updates the corresponding comment
    accordingly.

    Signed-off-by: Darren Hart
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Thomas Gleixner

    Darren Hart
     
  • Improve legibility by wrapping finding the top waiter in a function.
    This will be used by the follow-on patches for enabling requeue pi.

    Signed-off-by: Darren Hart
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Thomas Gleixner

    Darren Hart
     
  • Refactor futex_wait() in preparation for futex_wait_requeue_pi(). In
    order to reuse a good chunk of the futex_wait() code for the upcoming
    futex_wait_requeue_pi() function, this patch breaks out the
    queue-to-wakeup section of futex_wait() into futex_wait_queue_me().

    Signed-off-by: Darren Hart
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Thomas Gleixner

    Darren Hart
     

03 Apr, 2009

1 commit

  • We've tripped over the futex_requeue drop_count refering to key2
    instead of key1. The code is actually correct, but is non-intuitive.
    This patch adds an explicit comment explaining the requeue.

    Signed-off-by: Darren Hart
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Ingo Molnar

    Darren Hart
     

13 Mar, 2009

2 commits


12 Mar, 2009

6 commits

  • Impact: cleanup

    Older versions of the futex code held the mmap_sem which had to
    be dropped in order to call get_user(), so a two-pronged fault
    handling mechanism was employed to handle faults of the atomic
    operations. The mmap_sem is no longer held, so get_user()
    should be adequate. This patch greatly simplifies the logic and
    improves legibility.

    Build and boot tested on a 4 way Intel x86_64 workstation.
    Passes basic pthread_mutex and PI tests out of
    ltp/testcases/realtime.

    Signed-off-by: Darren Hart
    Acked-by: Peter Zijlstra
    Cc: Rusty Russell
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Darren Hart
     
  • Impact: rt-mutex failure case fix

    futex_lock_pi can potentially return -EFAULT with the rt_mutex
    held. This seems like the wrong thing to do as userspace should
    assume -EFAULT means the lock was not taken. Even if it could
    figure this out, we'd be leaving the pi_state->owner in an
    inconsistent state. This patch unlocks the rt_mutex prior to
    returning -EFAULT to userspace.

    Build and boot tested on a 4 way Intel x86_64 workstation.
    Passes basic pthread_mutex and PI tests out of
    ltp/testcases/realtime.

    Signed-off-by: Darren Hart
    Acked-by: Peter Zijlstra
    Cc: Rusty Russell
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Darren Hart
     
  • RT tasks should set their timer slack to 0 on their own. This
    patch removes the 'if (rt_task()) slack = 0;' block in
    futex_wait.

    Build and boot tested on a 4 way Intel x86_64 workstation.
    Passes basic pthread_mutex and PI tests out of
    ltp/testcases/realtime.

    Signed-off-by: Darren Hart
    Acked-by: Peter Zijlstra
    Cc: Rusty Russell
    Cc: Arjan van de Ven
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Darren Hart
     
  • Impact: cleanup

    The futex code uses double_lock_hb() which locks the hb->lock's
    in pointer value order. There is no parallel unlock routine,
    and the code unlocks them in name order, ignoring pointer value.

    This patch adds double_unlock_hb() to refactor the duplicated
    code segments.

    Build and boot tested on a 4 way Intel x86_64 workstation.
    Passes basic pthread_mutex and PI tests out of
    ltp/testcases/realtime.

    Signed-off-by: Darren Hart
    Acked-by: Peter Zijlstra
    Cc: Rusty Russell
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Darren Hart
     
  • Impact: fix races

    futex_requeue and futex_lock_pi still had some bad
    (get|put)_futex_key() usage. This patch adds the missing
    put_futex_keys() and corrects a goto in futex_lock_pi() to avoid
    a double get.

    Build and boot tested on a 4 way Intel x86_64 workstation.
    Passes basic pthread_mutex and PI tests out of
    ltp/testcases/realtime.

    Signed-off-by: Darren Hart
    Acked-by: Peter Zijlstra
    Cc: Rusty Russell
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Darren Hart
     
  • Impact: cleanup

    The futex_hash_bucket can be a bit confusing when first looking
    at the code as it is a shared queue (and futex_q isn't a queue
    at all, but rather an element on the queue).

    The mmap_sem is no longer held outside of the
    futex_handle_fault() routine, yet numerous comments refer to it.
    The fshared argument is no an integer. I left some of these
    comments along as they are simply removed in future patches.

    Some of the commentary refering to futexes by virtual page
    mappings was not very clear, and completely accurate (as for
    shared futexes both the page and the offset are used to
    determine the key). For the purposes of the function
    description, just referring to "the futex" seems sufficient.

    With hashed futexes we now access the page after the hash-bucket
    is locked, and not only after it is enqueued.

    Signed-off-by: Darren Hart
    Acked-by: Peter Zijlstra
    Cc: Rusty Russell
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Darren Hart