25 Jun, 2009

2 commits

  • Yanmin noticed that fault_in_user_writeable() requests 4 pages instead
    of one.

    That's the result of blindly trusting Linus' proposal :) I even looked
    up the prototype to verify the correctness: the argument in question
    is confusingly enough named "len" while in reality it means number of
    pages.

    Pointed-out-by: Yanmin Zhang
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • commit 64d1304a64 (futex: setup writeable mapping for futex ops which
    modify user space data) did address only half of the problem of write
    access faults.

    The patch was made on two wrong assumptions:

    1) access_ok(VERIFY_WRITE,...) would actually check write access.

    On x86 it does _NOT_. It's a pure address range check.

    2) a RW mapped region can not go away under us.

    That's wrong as well. Nobody can prevent another thread to call
    mprotect(PROT_READ) on that region where the futex resides. If that
    call hits between the get_user_pages_fast() verification and the
    actual write access in the atomic region we are toast again.

    The solution is to not rely on access_ok and get_user() for any write
    access related fault on private and shared futexes. Instead we need to
    fault it in with verification of write access.

    There is no generic non destructive write mechanism which would fault
    the user page in trough a #PF, but as we already know that we will
    fault we can as well call get_user_pages() directly and avoid the #PF
    overhead.

    If get_user_pages() returns -EFAULT we know that we can not fix it
    anymore and need to bail out to user space.

    Remove a bunch of confusing comments on this issue as well.

    Signed-off-by: Thomas Gleixner
    Cc: stable@kernel.org

    Thomas Gleixner
     

20 May, 2009

5 commits

  • If the waiter has been requeued to the outer PI futex and is
    interrupted by a signal and the thread handles the signal then
    ERESTART_RESTARTBLOCK is changed to EINTR and the restart block is
    discarded. That way we return an unexcpected EINTR to user space
    instead of ending up in futex_lock_pi_restart.

    But we do not need to restart the syscall because we know that the
    condition has changed since we have been requeued. If we would simply
    restart the syscall then we would drop out via the comparison of the
    user space value with EWOULDBLOCK.

    The user space side needs to handle EWOULDBLOCK anyway as the
    enqueueing on the inner futex can race with a requeue/wake. So we can
    simply return EWOULDBLOCK to user space which also signals that we did
    not take the outer futex and let user space handle it in the same way
    it has to handle the requeue/wake race.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • The futex_wait_requeue_pi op should restart unconditionally like
    futex_lock_pi. The user of that function e.g. pthread_cond_wait can
    not be interrupted so we do not care about the SA_RESTART flag of the
    signal. Clean up the FIXMEs.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Reuse the put_key_ref(key2) call in the exit path.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Merge reason: this branch was on an pre -rc1 base, merge it up to -rc6+
    to get the latest upstream fixes.

    Conflicts:
    kernel/futex.c

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • The futex code installs a read only mapping via get_user_pages_fast()
    even if the futex op function has to modify user space data. The
    eventual fault was fixed up by futex_handle_fault() which walked the
    VMA with mmap_sem held.

    After the cleanup patches which removed the mmap_sem dependency of the
    futex code commit 4dc5b7a36a49eff97050894cf1b3a9a02523717 (futex:
    clean up fault logic) removed the private VMA walk logic from the
    futex code. This change results in a stale RO mapping which is not
    fixed up.

    Instead of reintroducing the previous fault logic we set up the
    mapping in get_user_pages_fast() read/write for all operations which
    modify user space data. Also handle private futexes in the same way
    and make the current unconditional access_ok(VERIFY_WRITE) depend on
    the futex op.

    Reported-by: Andreas Schwab
    Signed-off-by: Thomas Gleixner
    CC: stable@kernel.org

    Thomas Gleixner
     

15 May, 2009

1 commit

  • The waitqueue which is used in struct futex_q is a leftover from the
    futexfd implementation. There is no need to use a waitqueue at all, as
    the waiting task is the only user of it. The waitqueue just adds
    additional locking and a loop in the wake up path which both can be
    avoided.

    We have already a task reference in struct futex_q which is used for
    PI futexes. Use it for normal futexes as well and just wake up the
    task directly.

    The logic of signalling the futex wakeup via setting q->lock_ptr to
    NULL is kept with the difference that we set it NULL before doing the
    wakeup. This opens an exit race window vs. a non futex wake up of the
    to be woken up task, which we prevent with get_task_struct /
    put_task_struct on the waiter.

    [ Impact: simplification ]

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

30 Apr, 2009

1 commit

  • The new requeue PI futex op codes were modeled after the existing
    FUTEX_REQUEUE and FUTEX_CMP_REQUEUE calls. I was unaware at the time
    that FUTEX_REQUEUE was only around for compatibility reasons and
    shouldn't be used in new code. Ulrich Drepper elaborates on this in his
    Futexes are Tricky paper: http://people.redhat.com/drepper/futex.pdf.
    The deprecated call doesn't catch changes to the futex corresponding to
    the destination futex which can lead to deadlock.

    Therefor, I feel it best to remove FUTEX_REQUEUE_PI and leave only
    FUTEX_CMP_REQUEUE_PI as there are not yet any existing users of the API.
    This patch does change the OP code value of FUTEX_CMP_REQUEUE_PI to 12
    from 13. Since my test case is the only known user of this API, I felt
    this was the right thing to do, rather than leave a hole in the
    enumeration.

    I chose to continue using the _CMP_ modifier in the OP code to make it
    explicit to the user that the test is being done.

    Builds, boots, and ran several hundred iterations requeue_pi.c.

    Signed-off-by: Darren Hart
    LKML-Reference:
    Signed-off-by: Thomas Gleixner

    Darren Hart
     

11 Apr, 2009

1 commit

  • If the get_futex_key() call were to fail, the existing code would
    try and put_futex_key() prior to returning. This patch makes sure
    we only put_futex_key() if get_futex_key() succeeded.

    Reported-by: Clark Williams
    Signed-off-by: Darren Hart
    LKML-Reference:
    Signed-off-by: Thomas Gleixner

    Darren Hart
     

08 Apr, 2009

1 commit

  • Thomas's testing caught a problem when the requeue target futex is
    unowned and multiple tasks are requeued to it. This patch ensures
    the FUTEX_WAITERS bit gets set if futex_requeue() will requeue one
    or more tasks in addition to the one acquiring the lock.

    Signed-off-by: Darren Hart
    Signed-off-by: Thomas Gleixner

    Darren Hart
     

06 Apr, 2009

8 commits

  • PI Futexes and their underlying rt_mutex cannot be left ownerless if
    there are pending waiters as this will break the PI boosting logic, so
    the standard requeue commands aren't sufficient. The new commands
    properly manage pi futex ownership by ensuring a futex with waiters
    has an owner at all times. This will allow glibc to properly handle
    pi mutexes with pthread_condvars.

    The approach taken here is to create two new futex op codes:

    FUTEX_WAIT_REQUEUE_PI:
    Tasks will use this op code to wait on a futex (such as a non-pi waitqueue)
    and wake after they have been requeued to a pi futex. Prior to returning to
    userspace, they will acquire this pi futex (and the underlying rt_mutex).

    futex_wait_requeue_pi() is the result of a high speed collision between
    futex_wait() and futex_lock_pi() (with the first part of futex_lock_pi() being
    done by futex_proxy_trylock_atomic() on behalf of the top_waiter).

    FUTEX_REQUEUE_PI (and FUTEX_CMP_REQUEUE_PI):
    This call must be used to wake tasks waiting with FUTEX_WAIT_REQUEUE_PI,
    regardless of how many tasks the caller intends to wake or requeue.
    pthread_cond_broadcast() should call this with nr_wake=1 and
    nr_requeue=INT_MAX. pthread_cond_signal() should call this with nr_wake=1 and
    nr_requeue=0. The reason being we need both callers to get the benefit of the
    futex_proxy_trylock_atomic() routine. futex_requeue() also enqueues the
    top_waiter on the rt_mutex via rt_mutex_start_proxy_lock().

    Signed-off-by: Darren Hart
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Thomas Gleixner

    Darren Hart
     
  • Refactor the code to validate the expected futex value in order to
    reuse it with the requeue_pi code.

    Signed-off-by: Darren Hart
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Thomas Gleixner

    Darren Hart
     
  • futex_requeue() is getting a bit long-winded, and will be getting more
    so after the requeue_pi patch. Factor out the actual requeueing into a
    nicely contained inline function to reduce function length and improve
    legibility.

    Signed-off-by: Darren Hart
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Thomas Gleixner

    Darren Hart
     
  • Currently restart is only used if there is a timeout. The requeue_pi
    functionality requires restarting to futex_lock_pi() on signal after
    wakeup in futex_wait_requeue_pi() regardless of if there was a timeout
    or not. Using 0 for the timeout value is confusing as that could
    indicate an expired timer. The flag makes this explicit. While the
    check is not technically needed in futex_wait_restart(), doing so
    makes the code consistent with and will avoid confusion should the
    need arise to restart wait without a timeout.

    Signed-off-by: Darren Hart
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Thomas Gleixner

    Darren Hart
     
  • Refactor the post lock acquisition logic from futex_lock_pi(). This
    code will be reused in futex_wait_requeue_pi().

    Signed-off-by: Darren Hart
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Thomas Gleixner

    Darren Hart
     
  • Refactor the atomic portion of futex_lock_pi() into futex_lock_pi_atomic().

    This logic will be needed by requeue_pi, so modularize it to reduce
    code duplication. The only significant change is passing of the task
    to try and take the lock for. This simplifies the -EDEADLK test as if
    the lock is owned by task t, it's a deadlock, regardless of if we are
    doing requeue pi or not. This patch updates the corresponding comment
    accordingly.

    Signed-off-by: Darren Hart
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Thomas Gleixner

    Darren Hart
     
  • Improve legibility by wrapping finding the top waiter in a function.
    This will be used by the follow-on patches for enabling requeue pi.

    Signed-off-by: Darren Hart
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Thomas Gleixner

    Darren Hart
     
  • Refactor futex_wait() in preparation for futex_wait_requeue_pi(). In
    order to reuse a good chunk of the futex_wait() code for the upcoming
    futex_wait_requeue_pi() function, this patch breaks out the
    queue-to-wakeup section of futex_wait() into futex_wait_queue_me().

    Signed-off-by: Darren Hart
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Thomas Gleixner

    Darren Hart
     

03 Apr, 2009

1 commit

  • We've tripped over the futex_requeue drop_count refering to key2
    instead of key1. The code is actually correct, but is non-intuitive.
    This patch adds an explicit comment explaining the requeue.

    Signed-off-by: Darren Hart
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Ingo Molnar

    Darren Hart
     

13 Mar, 2009

2 commits


12 Mar, 2009

6 commits

  • Impact: cleanup

    Older versions of the futex code held the mmap_sem which had to
    be dropped in order to call get_user(), so a two-pronged fault
    handling mechanism was employed to handle faults of the atomic
    operations. The mmap_sem is no longer held, so get_user()
    should be adequate. This patch greatly simplifies the logic and
    improves legibility.

    Build and boot tested on a 4 way Intel x86_64 workstation.
    Passes basic pthread_mutex and PI tests out of
    ltp/testcases/realtime.

    Signed-off-by: Darren Hart
    Acked-by: Peter Zijlstra
    Cc: Rusty Russell
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Darren Hart
     
  • Impact: rt-mutex failure case fix

    futex_lock_pi can potentially return -EFAULT with the rt_mutex
    held. This seems like the wrong thing to do as userspace should
    assume -EFAULT means the lock was not taken. Even if it could
    figure this out, we'd be leaving the pi_state->owner in an
    inconsistent state. This patch unlocks the rt_mutex prior to
    returning -EFAULT to userspace.

    Build and boot tested on a 4 way Intel x86_64 workstation.
    Passes basic pthread_mutex and PI tests out of
    ltp/testcases/realtime.

    Signed-off-by: Darren Hart
    Acked-by: Peter Zijlstra
    Cc: Rusty Russell
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Darren Hart
     
  • RT tasks should set their timer slack to 0 on their own. This
    patch removes the 'if (rt_task()) slack = 0;' block in
    futex_wait.

    Build and boot tested on a 4 way Intel x86_64 workstation.
    Passes basic pthread_mutex and PI tests out of
    ltp/testcases/realtime.

    Signed-off-by: Darren Hart
    Acked-by: Peter Zijlstra
    Cc: Rusty Russell
    Cc: Arjan van de Ven
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Darren Hart
     
  • Impact: cleanup

    The futex code uses double_lock_hb() which locks the hb->lock's
    in pointer value order. There is no parallel unlock routine,
    and the code unlocks them in name order, ignoring pointer value.

    This patch adds double_unlock_hb() to refactor the duplicated
    code segments.

    Build and boot tested on a 4 way Intel x86_64 workstation.
    Passes basic pthread_mutex and PI tests out of
    ltp/testcases/realtime.

    Signed-off-by: Darren Hart
    Acked-by: Peter Zijlstra
    Cc: Rusty Russell
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Darren Hart
     
  • Impact: fix races

    futex_requeue and futex_lock_pi still had some bad
    (get|put)_futex_key() usage. This patch adds the missing
    put_futex_keys() and corrects a goto in futex_lock_pi() to avoid
    a double get.

    Build and boot tested on a 4 way Intel x86_64 workstation.
    Passes basic pthread_mutex and PI tests out of
    ltp/testcases/realtime.

    Signed-off-by: Darren Hart
    Acked-by: Peter Zijlstra
    Cc: Rusty Russell
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Darren Hart
     
  • Impact: cleanup

    The futex_hash_bucket can be a bit confusing when first looking
    at the code as it is a shared queue (and futex_q isn't a queue
    at all, but rather an element on the queue).

    The mmap_sem is no longer held outside of the
    futex_handle_fault() routine, yet numerous comments refer to it.
    The fshared argument is no an integer. I left some of these
    comments along as they are simply removed in future patches.

    Some of the commentary refering to futexes by virtual page
    mappings was not very clear, and completely accurate (as for
    shared futexes both the page and the offset are used to
    determine the key). For the purposes of the function
    description, just referring to "the futex" seems sufficient.

    With hashed futexes we now access the page after the hash-bucket
    is locked, and not only after it is enqueued.

    Signed-off-by: Darren Hart
    Acked-by: Peter Zijlstra
    Cc: Rusty Russell
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Darren Hart
     

12 Feb, 2009

1 commit

  • Catalin noticed that (38d47c1b7075: futex: rely on get_user_pages() for
    shared futexes) caused an mm_struct leak.

    Some tracing with the function graph tracer quickly pointed out that
    futex_wait() has exit paths with unbalanced reference counts.

    This regression was discovered by kmemleak.

    Reported-by: Catalin Marinas
    Signed-off-by: Peter Zijlstra
    Tested-by: "Pallipadi, Venkatesh"
    Tested-by: Catalin Marinas
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

14 Jan, 2009

2 commits


06 Jan, 2009

1 commit


03 Jan, 2009

1 commit

  • Impact: add debug check

    Following up on my previous key reference accounting patches, this patch
    will catch puts on keys that haven't been "got". This won't catch nested
    get/put mismatches though.

    Build and boot tested, with minimal desktop activity and a run of the
    open_posix_testsuite in LTP for testing. No warnings logged.

    Signed-off-by: Darren Hart
    Cc:
    Signed-off-by: Ingo Molnar

    Darren Hart
     

31 Dec, 2008

1 commit

  • * 'core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (63 commits)
    stacktrace: provide save_stack_trace_tsk() weak alias
    rcu: provide RCU options on non-preempt architectures too
    printk: fix discarding message when recursion_bug
    futex: clean up futex_(un)lock_pi fault handling
    "Tree RCU": scalable classic RCU implementation
    futex: rename field in futex_q to clarify single waiter semantics
    x86/swiotlb: add default swiotlb_arch_range_needs_mapping
    x86/swiotlb: add default physbus conversion
    x86: unify pci iommu setup and allow swiotlb to compile for 32 bit
    x86: add swiotlb allocation functions
    swiotlb: consolidate swiotlb info message printing
    swiotlb: support bouncing of HighMem pages
    swiotlb: factor out copy to/from device
    swiotlb: add arch hook to force mapping
    swiotlb: allow architectures to override physbusphys conversions
    swiotlb: add comment where we handle the overflow of a dma mask on 32 bit
    rcu: fix rcutorture behavior during reboot
    resources: skip sanity check of busy resources
    swiotlb: move some definitions to header
    swiotlb: allow architectures to override swiotlb pool allocation
    ...

    Fix up trivial conflicts in
    arch/x86/kernel/Makefile
    arch/x86/mm/init_32.c
    include/linux/hardirq.h
    as per Ingo's suggestions.

    Linus Torvalds
     

30 Dec, 2008

1 commit

  • Impact: cleanup

    This patch makes the calls to futex_get_key_refs() and futex_drop_key_refs()
    explicitly symmetric by only "putting" keys we successfully "got". Also
    cleanup a couple return points that didn't "put" after a successful "get".

    Build and boot tested on an x86_64 system.

    Signed-off-by: Darren Hart
    Cc:
    Signed-off-by: Ingo Molnar

    Darren Hart
     

19 Dec, 2008

1 commit

  • Impact: cleanup

    Some apparently left over cruft code was complicating the fault logic:

    Testing if uval != -EFAULT doesn't have any meaning, get_user() sets ret
    to either 0 or -EFAULT, there's no need to compare uval, especially not
    against EFAULT which it will never be. This patch removes the superfluous
    test and clarifies the comment blocks.

    Build and boot tested on an 8way x86_64 system.

    Signed-off-by: Darren Hart
    Signed-off-by: Ingo Molnar

    Darren Hart
     

18 Dec, 2008

1 commit

  • Impact: simplify code

    I've tripped over the naming of this field a couple times.

    The futex_q uses a "waiters" list to represent a single blocked task and
    then calles wake_up_all().

    This can lead to confusion in trying to understand the intent of the code,
    which is to have a single futex_q for every task waiting on a futex.

    This patch corrects the problem, using a single pointer to the waiting
    task, and an appropriate call to wake_up, rather than wake_up_all.

    Compile and boot tested on an 8way x86_64 machine.

    Signed-off-by: Darren Hart
    Acked-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Darren Hart
     

25 Nov, 2008

2 commits


14 Nov, 2008

1 commit