09 Mar, 2017

1 commit

  • The scheduler header file split and cleanups ended up exposing a few
    nasty header file dependencies, and in particular it showed how we in
    ended up depending on "signal_pending()", which now comes
    from .

    That's a very subtle and annoying dependency, which already caused a
    semantic merge conflict (see commit e58bc927835a "Pull overlayfs updates
    from Miklos Szeredi", which added that fixup in the merge commit).

    It turns out that we can avoid this dependency _and_ improve code
    generation by moving the guts of the fairly nasty helper #define
    __wait_event_interruptible_locked() to out-of-line code. The code that
    includes the signal_pending() check is all in the slow-path where we
    actually go to sleep waiting for the event anyway, so using a helper
    function is the right thing to do.

    Using a helper function is also what we already did for the non-locked
    versions, see the "__wait_event*()" macros and the "prepare_to_wait*()"
    set of helper functions.

    We might want to try to unify all these macro games, we have a _lot_ of
    subtly different wait-event loops. But this is the minimal patch to fix
    the annoying header dependency.

    Acked-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

02 Mar, 2017

2 commits


28 Oct, 2016

1 commit

  • The per-zone waitqueues exist because of a scalability issue with the
    page waitqueues on some NUMA machines, but it turns out that they hurt
    normal loads, and now with the vmalloced stacks they also end up
    breaking gfs2 that uses a bit_wait on a stack object:

    wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)

    where 'gh' can be a reference to the local variable 'mount_gh' on the
    stack of fill_super().

    The reason the per-zone hash table breaks for this case is that there is
    no "zone" for virtual allocations, and trying to look up the physical
    page to get at it will fail (with a BUG_ON()).

    It turns out that I actually complained to the mm people about the
    per-zone hash table for another reason just a month ago: the zone lookup
    also hurts the regular use of "unlock_page()" a lot, because the zone
    lookup ends up forcing several unnecessary cache misses and generates
    horrible code.

    As part of that earlier discussion, we had a much better solution for
    the NUMA scalability issue - by just making the page lock have a
    separate contention bit, the waitqueue doesn't even have to be looked at
    for the normal case.

    Peter Zijlstra already has a patch for that, but let's see if anybody
    even notices. In the meantime, let's fix the actual gfs2 breakage by
    simplifying the bitlock waitqueues and removing the per-zone issue.

    Reported-by: Andreas Gruenbacher
    Tested-by: Bob Peterson
    Acked-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Steven Whitehouse
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 Sep, 2016

4 commits

  • The partial initialization of wait_queue_t in prepare_to_wait_event() looks
    ugly. This was done to shrink .text, but we can simply add the new helper
    which does the full initialization and shrink the compiled code a bit more.

    And. This way prepare_to_wait_event() can have more users. In particular we
    are ready to remove the signal_pending_state() checks from wait_bit_action_f
    helpers and change __wait_on_bit_lock() to use prepare_to_wait_event().

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Al Viro
    Cc: Bart Van Assche
    Cc: Johannes Weiner
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20160906140055.GA6167@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • __wait_on_bit_lock() doesn't need abort_exclusive_wait() too. Right
    now it can't use prepare_to_wait_event() (see the next change), but
    it can do the additional finish_wait() if action() fails.

    abort_exclusive_wait() no longer has callers, remove it.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Al Viro
    Cc: Bart Van Assche
    Cc: Johannes Weiner
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20160906140053.GA6164@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • ___wait_event() doesn't really need abort_exclusive_wait(), we can simply
    change prepare_to_wait_event() to remove the waiter from q->task_list if
    it was interrupted.

    This simplifies the code/logic, and this way prepare_to_wait_event() can
    have more users, see the next change.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Al Viro
    Cc: Bart Van Assche
    Cc: Johannes Weiner
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20160908164815.GA18801@redhat.com
    Signed-off-by: Ingo Molnar
    --
    include/linux/wait.h | 7 +------
    kernel/sched/wait.c | 35 +++++++++++++++++++++++++----------
    2 files changed, 26 insertions(+), 16 deletions(-)

    Oleg Nesterov
     
  • Otherwise this logic only works if mode is "compatible" with another
    exclusive waiter.

    If some wq has both TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE waiters,
    abort_exclusive_wait() won't wait an uninterruptible waiter.

    The main user is __wait_on_bit_lock() and currently it is fine but only
    because TASK_KILLABLE includes TASK_UNINTERRUPTIBLE and we do not have
    lock_page_interruptible() yet.

    Just use TASK_NORMAL and remove the "mode" arg from abort_exclusive_wait().
    Yes, this means that (say) wake_up_interruptible() can wake up the non-
    interruptible waiter(s), but I think this is fine. And in fact I think
    that abort_exclusive_wait() must die, see the next change.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Al Viro
    Cc: Bart Van Assche
    Cc: Johannes Weiner
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20160906140047.GA6157@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

14 Dec, 2015

1 commit

  • Jan Stancek reported that I wrecked things for him by fixing things for
    Vladimir :/

    His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which
    should not be possible, however my previous patch made this possible by
    unconditionally checking signal_pending().

    We cannot use current->state as was done previously, because the
    instruction after the store to that variable it can be changed. We must
    instead pass the initial state along and use that.

    Fixes: 68985633bccb ("sched/wait: Fix signal handling in bit wait helpers")
    Reported-by: Jan Stancek
    Reported-by: Chris Mason
    Tested-by: Jan Stancek
    Tested-by: Vladimir Murzin
    Tested-by: Chris Mason
    Reviewed-by: Paul Turner
    Cc: Ingo Molnar
    Cc: tglx@linutronix.de
    Cc: Oleg Nesterov
    Cc: hpa@zytor.com
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

04 Dec, 2015

1 commit

  • Vladimir reported getting RCU stall warnings and bisected it back to
    commit:

    743162013d40 ("sched: Remove proliferation of wait_on_bit() action functions")

    That commit inadvertently reversed the calls to schedule() and signal_pending(),
    thereby not handling the case where the signal receives while we sleep.

    Reported-by: Vladimir Murzin
    Tested-by: Vladimir Murzin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: mark.rutland@arm.com
    Cc: neilb@suse.de
    Cc: oleg@redhat.com
    Fixes: 743162013d40 ("sched: Remove proliferation of wait_on_bit() action functions")
    Fixes: cbbce8220949 ("SCHED: add some "wait..on_bit...timeout()" interfaces.")
    Link: http://lkml.kernel.org/r/20151201130404.GL3816@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

23 Sep, 2015

1 commit

  • This reverts commit 51360155eccb907ff8635bd10fc7de876408c2e0 and adapts
    fs/userfaultfd.c to use the old version of that function.

    It didn't look robust to call __wake_up_common with "nr == 1" when we
    absolutely require wakeall semantics, but we've full control of what we
    insert in the two waitqueue heads of the blocked userfaults. No
    exclusive waitqueue risks to be inserted into those two waitqueue heads
    so we can as well stick to "nr == 1" of the old code and we can rely
    purely on the fact no waitqueue inserted in one of the two waitqueue
    heads we must enforce as wakeall, has wait->flags WQ_FLAG_EXCLUSIVE set.

    Signed-off-by: Andrea Arcangeli
    Cc: Dr. David Alan Gilbert
    Cc: Michael Ellerman
    Cc: Shuah Khan
    Cc: Thierry Reding
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

05 Sep, 2015

1 commit

  • userfaultfd needs to wake all waitqueues (pass 0 as nr parameter), instead
    of the current hardcoded 1 (that would wake just the first waitqueue in
    the head list).

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

23 Jun, 2015

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes are:

    - lockless wakeup support for futexes and IPC message queues
    (Davidlohr Bueso, Peter Zijlstra)

    - Replace spinlocks with atomics in thread_group_cputimer(), to
    improve scalability (Jason Low)

    - NUMA balancing improvements (Rik van Riel)

    - SCHED_DEADLINE improvements (Wanpeng Li)

    - clean up and reorganize preemption helpers (Frederic Weisbecker)

    - decouple page fault disabling machinery from the preemption
    counter, to improve debuggability and robustness (David
    Hildenbrand)

    - SCHED_DEADLINE documentation updates (Luca Abeni)

    - topology CPU masks cleanups (Bartosz Golaszewski)

    - /proc/sched_debug improvements (Srikar Dronamraju)"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
    sched/deadline: Remove needless parameter in dl_runtime_exceeded()
    sched: Remove superfluous resetting of the p->dl_throttled flag
    sched/deadline: Drop duplicate init_sched_dl_class() declaration
    sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target
    sched/deadline: Make init_sched_dl_class() __init
    sched/deadline: Optimize pull_dl_task()
    sched/preempt: Add static_key() to preempt_notifiers
    sched/preempt: Fix preempt notifiers documentation about hlist_del() within unsafe iteration
    sched/stop_machine: Fix deadlock between multiple stop_two_cpus()
    sched/debug: Add sum_sleep_runtime to /proc//sched
    sched/debug: Replace vruntime with wait_sum in /proc/sched_debug
    sched/debug: Properly format runnable tasks in /proc/sched_debug
    sched/numa: Only consider less busy nodes as numa balancing destinations
    Revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced")
    sched/fair: Prevent throttling in early pick_next_task_fair()
    preempt: Reorganize the notrace definitions a bit
    preempt: Use preempt_schedule_context() as the official tracing preemption point
    sched: Make preempt_schedule_context() function-tracing safe
    x86: Remove cpu_sibling_mask() and cpu_core_mask()
    x86: Replace cpu_**_mask() with topology_**_cpumask()
    ...

    Linus Torvalds
     

19 May, 2015

1 commit

  • Since set_mb() is really about an smp_mb() -- not a IO/DMA barrier
    like mb() rename it to match the recent smp_load_acquire() and
    smp_store_release().

    Suggested-by: Linus Torvalds
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

08 May, 2015

1 commit

  • ACCESS_ONCE doesn't work reliably on non-scalar types. This patch removes
    the rest of the existing usages of ACCESS_ONCE() in the scheduler, and use
    the new READ_ONCE() and WRITE_ONCE() APIs as appropriate.

    Signed-off-by: Jason Low
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Thomas Gleixner
    Acked-by: Rik van Riel
    Acked-by: Waiman Long
    Cc: Andrew Morton
    Cc: Aswin Chandramouleeswaran
    Cc: Borislav Petkov
    Cc: Davidlohr Bueso
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Preeti U Murthy
    Cc: Scott J Norton
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/1430251224-5764-2-git-send-email-jason.low2@hp.com
    Signed-off-by: Ingo Molnar

    Jason Low
     

04 Nov, 2014

1 commit

  • There is a race between kthread_stop() and the new wait_woken() that
    can result in a lack of progress.

    CPU 0 | CPU 1
    |
    rfcomm_run() | kthread_stop()
    ... |
    if (!test_bit(KTHREAD_SHOULD_STOP)) |
    | set_bit(KTHREAD_SHOULD_STOP)
    | wake_up_process()
    wait_woken() | wait_for_completion()
    set_current_state(INTERRUPTIBLE) |
    if (!WQ_FLAG_WOKEN) |
    schedule_timeout() |
    |

    After which both tasks will wait.. forever.

    Fix this by having wait_woken() check for kthread_should_stop() but
    only for kthreads (obviously).

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Peter Hurley
    Cc: Oleg Nesterov
    Cc: Linus Torvalds
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

28 Oct, 2014

1 commit

  • There are a few places that call blocking primitives from wait loops,
    provide infrastructure to support this without the typical
    task_struct::state collision.

    We record the wakeup in wait_queue_t::flags which leaves
    task_struct::state free to be used by others.

    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Oleg Nesterov
    Cc: tglx@linutronix.de
    Cc: ilya.dryomov@inktank.com
    Cc: umgwanakikbuti@gmail.com
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140924082242.051202318@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

25 Sep, 2014

1 commit

  • In commit c1221321b7c25b53204447cff9949a6d5a7ddddc
    sched: Allow wait_on_bit_action() functions to support a timeout

    I suggested that a "wait_on_bit_timeout()" interface would not meet my
    need. This isn't true - I was just over-engineering.

    Including a 'private' field in wait_bit_key instead of a focused
    "timeout" field was just premature generalization. If some other
    use is ever found, it can be generalized or added later.

    So this patch renames "private" to "timeout" with a meaning "stop
    waiting when "jiffies" reaches or passes "timeout",
    and adds two of the many possible wait..bit..timeout() interfaces:

    wait_on_page_bit_killable_timeout(), which is the one I want to use,
    and out_of_line_wait_on_bit_timeout() which is a reasonably general
    example. Others can be added as needed.

    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: NeilBrown
    Acked-by: Ingo Molnar
    Signed-off-by: Trond Myklebust

    NeilBrown
     

16 Jul, 2014

2 commits

  • It is currently not possible for various wait_on_bit functions
    to implement a timeout.

    While the "action" function that is called to do the waiting
    could certainly use schedule_timeout(), there is no way to carry
    forward the remaining timeout after a false wake-up.
    As false-wakeups a clearly possible at least due to possible
    hash collisions in bit_waitqueue(), this is a real problem.

    The 'action' function is currently passed a pointer to the word
    containing the bit being waited on. No current action functions
    use this pointer. So changing it to something else will be a
    little noisy but will have no immediate effect.

    This patch changes the 'action' function to take a pointer to
    the "struct wait_bit_key", which contains a pointer to the word
    containing the bit so nothing is really lost.

    It also adds a 'private' field to "struct wait_bit_key", which
    is initialized to zero.

    An action function can now implement a timeout with something
    like

    static int timed_out_waiter(struct wait_bit_key *key)
    {
    unsigned long waited;
    if (key->private == 0) {
    key->private = jiffies;
    if (key->private == 0)
    key->private -= 1;
    }
    waited = jiffies - key->private;
    if (waited > 10 * HZ)
    return -EAGAIN;
    schedule_timeout(waited - 10 * HZ);
    return 0;
    }

    If any other need for context in a waiter were found it would be
    easy to use ->private for some other purpose, or even extend
    "struct wait_bit_key".

    My particular need is to support timeouts in nfs_release_page()
    to avoid deadlocks with loopback mounted NFS.

    While wait_on_bit_timeout() would be a cleaner interface, it
    will not meet my need. I need the timeout to be sensitive to
    the state of the connection with the server, which could change.
    So I need to use an 'action' interface.

    Signed-off-by: NeilBrown
    Acked-by: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Steve French
    Cc: David Howells
    Cc: Steven Whitehouse
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140707051604.28027.41257.stgit@notabene.brown
    Signed-off-by: Ingo Molnar

    NeilBrown
     
  • The current "wait_on_bit" interface requires an 'action'
    function to be provided which does the actual waiting.
    There are over 20 such functions, many of them identical.
    Most cases can be satisfied by one of just two functions, one
    which uses io_schedule() and one which just uses schedule().

    So:
    Rename wait_on_bit and wait_on_bit_lock to
    wait_on_bit_action and wait_on_bit_lock_action
    to make it explicit that they need an action function.

    Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
    which are *not* given an action function but implicitly use
    a standard one.
    The decision to error-out if a signal is pending is now made
    based on the 'mode' argument rather than being encoded in the action
    function.

    All instances of the old wait_on_bit and wait_on_bit_lock which
    can use the new version have been changed accordingly and their
    action functions have been discarded.
    wait_on_bit{_lock} does not return any specific error code in the
    event of a signal so the caller must check for non-zero and
    interpolate their own error code as appropriate.

    The wait_on_bit() call in __fscache_wait_on_invalidate() was
    ambiguous as it specified TASK_UNINTERRUPTIBLE but used
    fscache_wait_bit_interruptible as an action function.
    David Howells confirms this should be uniformly
    "uninterruptible"

    The main remaining user of wait_on_bit{,_lock}_action is NFS
    which needs to use a freezer-aware schedule() call.

    A comment in fs/gfs2/glock.c notes that having multiple 'action'
    functions is useful as they display differently in the 'wchan'
    field of 'ps'. (and /proc/$PID/wchan).
    As the new bit_wait{,_io} functions are tagged "__sched", they
    will not show up at all, but something higher in the stack. So
    the distinction will still be visible, only with different
    function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
    gfs2/glock.c case).

    Since first version of this patch (against 3.15) two new action
    functions appeared, on in NFS and one in CIFS. CIFS also now
    uses an action function that makes the same freezer aware
    schedule call as NFS.

    Signed-off-by: NeilBrown
    Acked-by: David Howells (fscache, keys)
    Acked-by: Steven Whitehouse (gfs2)
    Acked-by: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Steve French
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
    Signed-off-by: Ingo Molnar

    NeilBrown
     

18 Apr, 2014

1 commit

  • Mostly scripted conversion of the smp_mb__* barriers.

    Signed-off-by: Peter Zijlstra
    Acked-by: Paul E. McKenney
    Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
    Cc: Linus Torvalds
    Cc: linux-arch@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

06 Nov, 2013

2 commits

  • For some reason only the wait part of the wait api lives in
    kernel/sched/wait.c and the wake part still lives in kernel/sched/core.c;
    ammend this.

    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/n/tip-ftycee88naznulqk7ei5mbci@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Suggested-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/n/tip-5q5yqvdaen0rmapwloeaotx3@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra