10 Feb, 2015

5 commits

  • Pull timer updates from Ingo Molnar:
    "The main changes in this cycle were:

    - rework hrtimer expiry calculation in hrtimer_interrupt(): the
    previous code had a subtle bug where expiry caching would miss an
    expiry, resulting in occasional bogus (late) expiry of hrtimers.

    - continuing Y2038 fixes

    - ktime division optimization

    - misc smaller fixes and cleanups"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    hrtimer: Make __hrtimer_get_next_event() static
    rtc: Convert rtc_set_ntp_time() to use timespec64
    rtc: Remove redundant rtc_valid_tm() from rtc_hctosys()
    rtc: Modify rtc_hctosys() to address y2038 issues
    rtc: Update rtc-dev to use y2038-safe time interfaces
    rtc: Update interface.c to use y2038-safe time interfaces
    time: Expose get_monotonic_boottime64 for in-kernel use
    time: Expose getboottime64 for in-kernel uses
    ktime: Optimize ktime_divns for constant divisors
    hrtimer: Prevent stale expiry time in hrtimer_interrupt()
    ktime.h: Introduce ktime_ms_delta

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:
    "The main scheduler changes in this cycle were:

    - various sched/deadline fixes and enhancements

    - rescheduling latency fixes/cleanups

    - rework the rq->clock code to be more consistent and more robust.

    - minor micro-optimizations

    - ->avg.decay_count fixes

    - add a stack overflow check to might_sleep()

    - idle-poll handler fix, possibly resulting in power savings

    - misc smaller updates and fixes"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/Documentation: Remove unneeded word
    sched/wait: Introduce wait_on_bit_timeout()
    sched: Pull resched loop to __schedule() callers
    sched/deadline: Remove cpu_active_mask from cpudl_find()
    sched: Fix hrtick_start() on UP
    sched/deadline: Avoid pointless __setscheduler()
    sched/deadline: Fix stale yield state
    sched/deadline: Fix hrtick for a non-leftmost task
    sched/deadline: Modify cpudl::free_cpus to reflect rd->online
    sched/idle: Add missing checks to the exit condition of cpu_idle_poll()
    sched: Fix missing preemption opportunity
    sched/rt: Reduce rq lock contention by eliminating locking of non-feasible target
    sched/debug: Print rq->clock_task
    sched/core: Rework rq->clock update skips
    sched/core: Validate rq_clock*() serialization
    sched/core: Remove check of p->sched_class
    sched/fair: Fix sched_entity::avg::decay_count initialization
    sched/debug: Fix potential call to __ffs(0) in sched_show_task()
    sched/debug: Check for stack overflow in ___might_sleep()
    sched/fair: Fix the dealing with decay_count in __synchronize_entity_decay()

    Linus Torvalds
     
  • Pull perf updates from Ingo Molnar:
    "Kernel side changes:

    - AMD range breakpoints support:

    Extend breakpoint tools and core to support address range through
    perf event with initial backend support for AMD extended
    breakpoints.

    The syntax is:

    perf record -e mem:addr/len:type

    For example set write breakpoint from 0x1000 to 0x1200 (0x1000 + 512)

    perf record -e mem:0x1000/512:w

    - event throttling/rotating fixes

    - various event group handling fixes, cleanups and general paranoia
    code to be more robust against bugs in the future.

    - kernel stack overhead fixes

    User-visible tooling side changes:

    - Show precise number of samples in at the end of a 'record' session,
    if processing build ids, since we will then traverse the whole
    perf.data file and see all the PERF_RECORD_SAMPLE records,
    otherwise stop showing the previous off-base heuristicly counted
    number of "samples" (Namhyung Kim).

    - Support to read compressed module from build-id cache (Namhyung
    Kim)

    - Enable sampling loads and stores simultaneously in 'perf mem'
    (Stephane Eranian)

    - 'perf diff' output improvements (Namhyung Kim)

    - Fix error reporting for evsel pgfault constructor (Arnaldo Carvalho
    de Melo)

    Tooling side infrastructure changes:

    - Cache eh/debug frame offset for dwarf unwind (Namhyung Kim)

    - Support parsing parameterized events (Cody P Schafer)

    - Add support for IP address formats in libtraceevent (David Ahern)

    Plus other misc fixes"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
    perf: Decouple unthrottling and rotating
    perf: Drop module reference on event init failure
    perf: Use POLLIN instead of POLL_IN for perf poll data in flag
    perf: Fix put_event() ctx lock
    perf: Fix move_group() order
    perf: Fix event->ctx locking
    perf: Add a bit of paranoia
    perf symbols: Convert lseek + read to pread
    perf tools: Use perf_data_file__fd() consistently
    perf symbols: Support to read compressed module from build-id cache
    perf evsel: Set attr.task bit for a tracking event
    perf header: Set header version correctly
    perf record: Show precise number of samples
    perf tools: Do not use __perf_session__process_events() directly
    perf callchain: Cache eh/debug frame offset for dwarf unwind
    perf tools: Provide stub for missing pthread_attr_setaffinity_np
    perf evsel: Don't rely on malloc working for sz 0
    tools lib traceevent: Add support for IP address formats
    perf ui/tui: Show fatal error message only if exists
    perf tests: Fix typo in sample-parsing.c
    ...

    Linus Torvalds
     
  • Pull core locking updates from Ingo Molnar:
    "The main changes are:

    - mutex, completions and rtmutex micro-optimizations
    - lock debugging fix
    - various cleanups in the MCS and the futex code"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    locking/rtmutex: Optimize setting task running after being blocked
    locking/rwsem: Use task->state helpers
    sched/completion: Add lock-free checking of the blocking case
    sched/completion: Remove unnecessary ->wait.lock serialization when reading completion state
    locking/mutex: Explicitly mark task as running after wakeup
    futex: Fix argument handling in futex_lock_pi() calls
    doc: Fix misnamed FUTEX_CMP_REQUEUE_PI op constants
    locking/Documentation: Update code path
    softirq/preempt: Add missing current->preempt_disable_ip update
    locking/osq: No need for load/acquire when acquire-polling
    locking/mcs: Better differentiate between MCS variants
    locking/mutex: Introduce ww_mutex_set_context_slowpath()
    locking/mutex: Move MCS related comments to proper location
    locking/mutex: Checking the stamp is WW only

    Linus Torvalds
     
  • Pull RCU updates from Ingo Molnar:
    "The main RCU changes in this cycle are:

    - Documentation updates.

    - Miscellaneous fixes.

    - Preemptible-RCU fixes, including fixing an old bug in the
    interaction of RCU priority boosting and CPU hotplug.

    - SRCU updates.

    - RCU CPU stall-warning updates.

    - RCU torture-test updates"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
    rcu: Initialize tiny RCU stall-warning timeouts at boot
    rcu: Fix RCU CPU stall detection in tiny implementation
    rcu: Add GP-kthread-starvation checks to CPU stall warnings
    rcu: Make cond_resched_rcu_qs() apply to normal RCU flavors
    rcu: Optionally run grace-period kthreads at real-time priority
    ksoftirqd: Use new cond_resched_rcu_qs() function
    ksoftirqd: Enable IRQs and call cond_resched() before poking RCU
    rcutorture: Add more diagnostics in rcu_barrier() test failure case
    torture: Flag console.log file to prevent holdovers from earlier runs
    torture: Add "-enable-kvm -soundhw pcspk" to qemu command line
    rcutorture: Handle different mpstat versions
    rcutorture: Check from beginning to end of grace period
    rcu: Remove redundant rcu_batches_completed() declaration
    rcutorture: Drop rcu_torture_completed() and friends
    rcu: Provide rcu_batches_completed_sched() for TINY_RCU
    rcutorture: Use unsigned for Reader Batch computations
    rcutorture: Make build-output parsing correctly flag RCU's warnings
    rcu: Make _batches_completed() functions return unsigned long
    rcutorture: Issue warnings on close calls due to Reader Batch blows
    documentation: Fix smp typo in memory-barriers.txt
    ...

    Linus Torvalds
     

07 Feb, 2015

3 commits


05 Feb, 2015

1 commit

  • I noticed some CLOCK_TAI timer test failures on one of my
    less-frequently used configurations. And after digging in I
    found in 76f4108892d9 (Cleanup hrtimer accessors to the
    timekepeing state), the hrtimer_get_softirq_time tai offset
    calucation was incorrectly rewritten, as the tai offset we
    return shold be from CLOCK_MONOTONIC, and not CLOCK_REALTIME.

    This results in CLOCK_TAI timers expiring early on non-highres
    capable machines.

    This patch fixes the issue, calculating the tai time properly
    from the monotonic base.

    Signed-off-by: John Stultz
    Cc: Thomas Gleixner
    Cc: stable # 3.17+
    Link: http://lkml.kernel.org/r/1423097126-10236-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: Ingo Molnar

    John Stultz
     

04 Feb, 2015

23 commits

  • Currently the adjusments made as part of perf_event_task_tick() use the
    percpu rotation lists to iterate over any active PMU contexts, but these
    are not used by the context rotation code, having been replaced by
    separate (per-context) hrtimer callbacks. However, some manipulation of
    the rotation lists (i.e. removal of contexts) has remained in
    perf_rotate_context(). This leads to the following issues:

    * Contexts are not always removed from the rotation lists. Removal of
    PMUs which have been placed in rotation lists, but have not been
    removed by a hrtimer callback can result in corruption of the rotation
    lists (when memory backing the context is freed).

    This has been observed to result in hangs when PMU drivers built as
    modules are inserted and removed around the creation of events for
    said PMUs.

    * Contexts which do not require rotation may be removed from the
    rotation lists as a result of a hrtimer, and will not be considered by
    the unthrottling code in perf_event_task_tick.

    This patch fixes the issue by updating the rotation ist when events are
    scheduled in/out, ensuring that each rotation list stays in sync with
    the HW state. As each event holds a refcount on the module of its PMU,
    this ensures that when a PMU module is unloaded none of its CPU contexts
    can be in a rotation list. By maintaining a list of perf_event_contexts
    rather than perf_event_cpu_contexts, we don't need separate paths to
    handle the cpu and task contexts, which also makes the code a little
    simpler.

    As the rotation_list variables are not used for rotation, these are
    renamed to active_ctx_list, which better matches their current function.
    perf_pmu_rotate_{start,stop} are renamed to
    perf_pmu_ctx_{activate,deactivate}.

    Reported-by: Johannes Jensen
    Signed-off-by: Mark Rutland
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Will Deacon
    Cc: Arnaldo Carvalho de Melo
    Cc: Fengguang Wu
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20150129134511.GR17721@leverpostej
    Signed-off-by: Ingo Molnar

    Mark Rutland
     
  • When initialising an event, perf_init_event will call try_module_get() to
    ensure that the PMU's module cannot be removed for the lifetime of the
    event, with __free_event() dropping the reference when the event is
    finally destroyed. If something fails after the event has been
    initialised, but before the event is installed, perf_event_alloc will
    drop the reference on the module.

    However, if we fail to initialise an event for some reason (e.g. we ask
    an uncore PMU to perform sampling, and it refuses to initialise the
    event), we do not drop the refcount. If we try to open such a bogus
    event without a precise IDR type, we will loop over each PMU in the pmus
    list, incrementing each of their refcounts without decrementing them.

    This patch adds a module_put when pmu->event_init(event) fails, ensuring
    that the refcounts are balanced in failure cases. As the innards of the
    precise and search based initialisation look very similar, this logic is
    hoisted out into a new helper function. While the early return for the
    failed try_module_get is removed from the search case, this is handled
    by the remaining return when ret is not -ENOENT.

    Signed-off-by: Mark Rutland
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Will Deacon
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1420642611-22667-1-git-send-email-mark.rutland@arm.com
    Signed-off-by: Ingo Molnar

    Mark Rutland
     
  • Currently we flag available data (via poll syscall) on perf fd with
    POLL_IN macro, which is normally used for SIGIO interface.

    We've been lucky, because POLLIN (0x1) is subset of POLL_IN (0x20001)
    and sys_poll (do_pollfd function) cut the extra bit out (0x20000).

    Signed-off-by: Jiri Olsa
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Frederic Weisbecker
    Cc: Namhyung Kim
    Cc: Stephane Eranian
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1422467678-22341-1-git-send-email-jolsa@kernel.org
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     
  • So what I suspect; but I'm in zombie mode today it seems; is that while
    I initially thought that it was impossible for ctx to change when
    refcount dropped to 0, I now suspect its possible.

    Note that until perf_remove_from_context() the event is still active and
    visible on the lists. So a concurrent sys_perf_event_open() from another
    task into this task can race.

    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Stephane Eranian
    Cc: mark.rutland@arm.com
    Cc: Jiri Olsa
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20150129134434.GB26304@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Jiri reported triggering the new WARN_ON_ONCE in event_sched_out over
    the weekend:

    event_sched_out.isra.79+0x2b9/0x2d0
    group_sched_out+0x69/0xc0
    ctx_sched_out+0x106/0x130
    task_ctx_sched_out+0x37/0x70
    __perf_install_in_context+0x70/0x1a0
    remote_function+0x48/0x60
    generic_exec_single+0x15b/0x1d0
    smp_call_function_single+0x67/0xa0
    task_function_call+0x53/0x80
    perf_install_in_context+0x8b/0x110

    I think the below should cure this; if we install a group leader it
    will iterate the (still intact) group list and find its siblings and
    try and install those too -- even though those still have the old
    event->ctx -- in the new ctx.

    Upon installing the first group sibling we'd try and schedule out the
    group and trigger the above warn.

    Fix this by installing the group leader last, installing siblings
    would have no effect, they're not reachable through the group lists
    and therefore we don't schedule them.

    Also delay resetting the state until we're absolutely sure the events
    are quiescent.

    Reported-by: Jiri Olsa
    Reported-by: vincent.weaver@maine.edu
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20150126162639.GA21418@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra (Intel)
     
  • There have been a few reported issues wrt. the lack of locking around
    changing event->ctx. This patch tries to address those.

    It avoids the whole rwsem thing; and while it appears to work, please
    give it some thought in review.

    What I did fail at is sensible runtime checks on the use of
    event->ctx, the RCU use makes it very hard.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Paul E. McKenney
    Cc: Jiri Olsa
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20150123125834.209535886@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Add a few WARN()s to catch things that should never happen.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20150123125834.150481799@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We explicitly mark the task running after returning from
    a __rt_mutex_slowlock() call, which does the actual sleeping
    via wait-wake-trylocking. As such, this patch does two things:

    (1) refactors the code so that setting current to TASK_RUNNING
    is done by __rt_mutex_slowlock(), and not by the callers. The
    downside to this is that it becomes a bit unclear when at what
    point we block. As such I've added a comment that the task
    blocks when calling __rt_mutex_slowlock() so readers can figure
    out when it is running again.

    (2) relaxes setting current's state through __set_current_state(),
    instead of it's more expensive barrier alternative. There was no
    need for the implied barrier as we're obviously not planning on
    blocking.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1422857784.18096.1.camel@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • Call __set_task_state() instead of assigning the new state
    directly. These interfaces also aid CONFIG_DEBUG_ATOMIC_SLEEP
    environments, keeping track of who last changed the state.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: "Paul E. McKenney"
    Cc: Jason Low
    Cc: Michel Lespinasse
    Cc: Tim Chen
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1422257769-14083-2-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • The "thread would block" case can be checked without grabbing ->wait.lock.

    [ If the check does not return early then grab the lock and recheck.
    A memory barrier is not needed as complete() and complete_all() imply
    a barrier.

    The ACCESS_ONCE() is needed for calls in a loop that, if inlined, could
    optimize out the re-fetching of x->done. ]

    Signed-off-by: Nicholas Mc Guire
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1422013307-13200-1-git-send-email-der.herr@hofr.at
    Signed-off-by: Ingo Molnar

    Nicholas Mc Guire
     
  • Signed-off-by: Nicholas Mc Guire
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1421467534-22834-1-git-send-email-der.herr@hofr.at
    Signed-off-by: Ingo Molnar

    Nicholas Mc Guire
     
  • By the time we wake up and get the lock after being asleep
    in the slowpath, we better be running. As good practice,
    be explicit about this and avoid any mischief.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: "Paul E. McKenney"
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1421717961.4903.11.camel@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • The second 'mutex' shouldn't be there, it can't be about the mutex,
    as the mutex can't be freed, but unlocked, the memory where the
    mutex resides however, can be freed.

    Signed-off-by: Sharon Dvir
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1422827252-31363-1-git-send-email-sharon.dvir1@mail.huji.ac.il
    Signed-off-by: Ingo Molnar

    Sharon Dvir
     
  • __schedule() disables preemption during its job and re-enables it
    afterward without doing a preemption check to avoid recursion.

    But if an event happens after the context switch which requires
    rescheduling, we need to check again if a task of a higher priority
    needs the CPU. A preempt irq can raise such a situation. To handle that,
    __schedule() loops on need_resched().

    But preempt_schedule_*() functions, which call __schedule(), also loop
    on need_resched() to handle missed preempt irqs. Hence we end up with
    the same loop happening twice.

    Lets simplify that by attributing the need_resched() loop responsibility
    to all __schedule() callers.

    There is a risk that the outer loop now handles reschedules that used
    to be handled by the inner loop with the added overhead of caller details
    (inc/dec of PREEMPT_ACTIVE, irq save/restore) but assuming those inner
    rescheduling loop weren't too frequent, this shouldn't matter. Especially
    since the whole preemption path is now losing one loop in any case.

    Suggested-by: Linus Torvalds
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/1422404652-29067-2-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • cpu_active_mask is rarely changed (only on hotplug), so remove this
    operation to gain a little performance.

    If there is a change in cpu_active_mask, rq_online_dl() and
    rq_offline_dl() should take care of it normally, so cpudl::free_cpus
    carries enough information for us.

    For the rare case when a task is put onto a dying cpu (which
    rq_offline_dl() can't handle in a timely fashion), it will be
    handled through _cpu_down()->...->multi_cpu_stop()->migration_call()
    ->migrate_tasks(), preventing the task from hanging on the
    dead cpu.

    Cc: Juri Lelli
    Signed-off-by: Xunlei Pang
    [peterz: changelog]
    Signed-off-by: Peter Zijlstra (Intel)
    Link: http://lkml.kernel.org/r/1421642980-10045-2-git-send-email-pang.xunlei@linaro.org
    Cc: Linus Torvalds
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Xunlei Pang
     
  • The commit 177ef2a6315e ("sched/deadline: Fix a precision problem in
    the microseconds range") forgot to change the UP version of
    hrtick_start(), do so now.

    Signed-off-by: Wanpeng Li
    Fixes: 177ef2a6315e ("sched/deadline: Fix a precision problem in the microseconds range")
    [ Fixed the changelog. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1416962647-76792-7-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • There is no need to dequeue/enqueue and push/pull if there are
    no scheduling parameters changed for the DL class.

    Both fair and RT classes already check if parameters changed for
    them to avoid unnecessary overhead. This patch add the parameters
    changed test for the DL class in order to reduce overhead.

    Signed-off-by: Wanpeng Li
    [ Fixed up the changelog. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1416962647-76792-5-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • When we fail to start the deadline timer in update_curr_dl(), we
    forget to clear ->dl_yielded, resulting in wrecked time keeping.

    Since the natural place to clear both ->dl_yielded and ->dl_throttled
    is in replenish_dl_entity(); both are after all waiting for that event;
    make it so.

    Luckily since 67dfa1b756f2 ("sched/deadline: Implement
    cancel_dl_timer() to use in switched_from_dl()") the
    task_on_rq_queued() condition in dl_task_timer() must be true, and can
    therefore call enqueue_task_dl() unconditionally.

    Reported-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Kirill Tkhai
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1416962647-76792-4-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • After update_curr_dl() the current task might not be the leftmost task
    anymore. In that case do not start a new hrtick for it.

    In this case NEED_RESCHED will be set and the next schedule will start
    the hrtick for the new task if and when appropriate.

    Signed-off-by: Wanpeng Li
    Acked-by: Juri Lelli
    [ Rewrote the changelog and comment. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1416962647-76792-2-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Commit 67dfa1b756f2 ("sched/deadline: Implement cancel_dl_timer() to
    use in switched_from_dl()") removed the hrtimer_try_cancel() function
    call out from init_dl_task_timer(), which gets called from
    __setparam_dl().

    The result is that we can now re-init the timer while its active --
    this is bad and corrupts timer state.

    Furthermore; changing the parameters of an active deadline task is
    tricky in that you want to maintain guarantees, while immediately
    effective change would allow one to circumvent the CBS guarantees --
    this too is bad, as one (bad) task should not be able to affect the
    others.

    Rework things to avoid both problems. We only need to initialize the
    timer once, so move that to __sched_fork() for new tasks.

    Then make sure __setparam_dl() doesn't affect the current running
    state but only updates the parameters used to calculate the next
    scheduling period -- this guarantees the CBS functions as expected
    (albeit slightly pessimistic).

    This however means we need to make sure __dl_clear_params() needs to
    reset the active state otherwise new (and tasks flipping between
    classes) will not properly (re)compute their first instance.

    Todo: close class flipping CBS hole.
    Todo: implement delayed BW release.

    Reported-by: Luca Abeni
    Acked-by: Juri Lelli
    Tested-by: Luca Abeni
    Fixes: 67dfa1b756f2 ("sched/deadline: Implement cancel_dl_timer() to use in switched_from_dl()")
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20150128140803.GF23038@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

02 Feb, 2015

1 commit

  • Commit 8eb23b9f35aa ("sched: Debug nested sleeps") added code to report
    on nested sleep conditions, which we generally want to avoid because the
    inner sleeping operation can re-set the thread state to TASK_RUNNING,
    but that will then cause the outer sleep loop not actually sleep when it
    calls schedule.

    However, that's actually valid traditional behavior, with the inner
    sleep being some fairly rare case (like taking a sleeping lock that
    normally doesn't actually need to sleep).

    And the debug code would actually change the state of the task to
    TASK_RUNNING internally, which makes that kind of traditional and
    working code not work at all, because now the nested sleep doesn't just
    sometimes cause the outer one to not block, but will cause it to happen
    every time.

    In particular, it will cause the cardbus kernel daemon (pccardd) to
    basically busy-loop doing scheduling, converting a laptop into a heater,
    as reported by Bruno Prémont. But there may be other legacy uses of
    that nested sleep model in other drivers that are also likely to never
    get converted to the new model.

    This fixes both cases:

    - don't set TASK_RUNNING when the nested condition happens (note: even
    if WARN_ONCE() only _warns_ once, the return value isn't whether the
    warning happened, but whether the condition for the warning was true.
    So despite the warning only happening once, the "if (WARN_ON(..))"
    would trigger for every nested sleep.

    - in the cases where we knowingly disable the warning by using
    "sched_annotate_sleep()", don't change the task state (that is used
    for all core scheduling decisions), instead use '->task_state_change'
    that is used for the debugging decision itself.

    (Credit for the second part of the fix goes to Oleg Nesterov: "Can't we
    avoid this subtle change in behaviour DEBUG_ATOMIC_SLEEP adds?" with the
    suggested change to use 'task_state_change' as part of the test)

    Reported-and-bisected-by: Bruno Prémont
    Tested-by: Rafael J Wysocki
    Acked-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner ,
    Cc: Ilya Dryomov ,
    Cc: Mike Galbraith
    Cc: Ingo Molnar
    Cc: Peter Hurley ,
    Cc: Davidlohr Bueso ,
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

31 Jan, 2015

6 commits

  • Pull perf fixes from Ingo Molnar:
    "Mostly tooling fixes, but also an event groups fix, two PMU driver
    fixes and a CPU model variant addition"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf: Tighten (and fix) the grouping condition
    perf/x86/intel: Add model number for Airmont
    perf/rapl: Fix crash in rapl_scale()
    perf/x86/intel/uncore: Move uncore_box_init() out of driver initialization
    perf probe: Fix probing kretprobes
    perf symbols: Introduce 'for' method to iterate over the symbols with a given name
    perf probe: Do not rely on map__load() filter to find symbols
    perf symbols: Introduce method to iterate symbols ordered by name
    perf symbols: Return the first entry with a given name in find_by_name method
    perf annotate: Fix memory leaks in LOCK handling
    perf annotate: Handle ins parsing failures
    perf scripting perl: Force to use stdbool
    perf evlist: Remove extraneous 'was' on error message

    Linus Torvalds
     
  • Currently, cpudl::free_cpus contains all CPUs during init, see
    cpudl_init(). When calling cpudl_find(), we have to add rd->span
    to avoid selecting the cpu outside the current root domain, because
    cpus_allowed cannot be depended on when performing clustered
    scheduling using the cpuset, see find_later_rq().

    This patch adds cpudl_set_freecpu() and cpudl_clear_freecpu() for
    changing cpudl::free_cpus when doing rq_online_dl()/rq_offline_dl(),
    so we can avoid the rd->span operation when calling cpudl_find()
    in find_later_rq().

    Signed-off-by: Xunlei Pang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1421642980-10045-1-git-send-email-pang.xunlei@linaro.org
    Signed-off-by: Ingo Molnar

    Xunlei Pang
     
  • cpu_idle_poll() is entered into when either the cpu_idle_force_poll is set or
    tick_check_broadcast_expired() returns true. The exit condition from
    cpu_idle_poll() is tif_need_resched().

    However this does not take into account scenarios where cpu_idle_force_poll
    changes or tick_check_broadcast_expired() returns false, without setting
    the resched flag. So a cpu will be caught in cpu_idle_poll() needlessly,
    thereby wasting power. Add an explicit check on cpu_idle_force_poll and
    tick_check_broadcast_expired() to the exit condition of cpu_idle_poll()
    to avoid this.

    Signed-off-by: Preeti U Murthy
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20150121105655.15279.59626.stgit@preeti.in.ibm.com
    Signed-off-by: Ingo Molnar

    Preeti U Murthy
     
  • If an interrupt fires in cond_resched(), between the call to __schedule()
    and the PREEMPT_ACTIVE count decrementation, and that interrupt sets
    TIF_NEED_RESCHED, the call to preempt_schedule_irq() will be ignored
    due to the PREEMPT_ACTIVE count. This kind of scenario, with irq preemption
    being delayed because it's interrupting a preempt-disabled area, is
    usually fixed up after preemption is re-enabled back with an explicit
    call to preempt_schedule().

    This is what preempt_enable() does but a raw preempt count decrement as
    performed by __preempt_count_sub(PREEMPT_ACTIVE) doesn't handle delayed
    preemption check. Therefore when such a race happens, the rescheduling
    is going to be delayed until the next scheduler or preemption entrypoint.
    This can be a problem for scheduler latency sensitive workloads.

    Lets fix that by consolidating cond_resched() with preempt_schedule()
    internals.

    Reported-by: Linus Torvalds
    Reported-by: Ingo Molnar
    Original-patch-by: Ingo Molnar
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra (Intel)
    Link: http://lkml.kernel.org/r/1421946484-9298-1-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • This patch adds checks that prevens futile attempts to move rt tasks
    to a CPU with active tasks of equal or higher priority.

    This reduces run queue lock contention and improves the performance of
    a well known OLTP benchmark by 0.7%.

    Signed-off-by: Tim Chen
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Shawn Bohrer
    Cc: Suruchi Kadu
    Cc: Doug Nelson
    Cc: Steven Rostedt
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1421430374.2399.27.camel@schen9-desk2.jf.intel.com
    Signed-off-by: Ingo Molnar

    Tim Chen
     
  • Merge all pending fixes and refresh the tree, before applying new changes.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

28 Jan, 2015

1 commit