03 Apr, 2017

1 commit

  • Pull scheduler fixes from Thomas Gleixner:
    "This update provides:

    - make the scheduler clock switch to unstable mode smooth so the
    timestamps stay at microseconds granularity instead of switching to
    tick granularity.

    - unbreak perf test tsc by taking the new offset into account which
    was added in order to proveide better sched clock continuity

    - switching sched clock to unstable mode runs all clock related
    computations which affect the sched clock output itself from a work
    queue. In case of preemption sched clock uses half updated data and
    provides wrong timestamps. Keep the math in the protected context
    and delegate only the static key switch to workqueue context.

    - remove a duplicate header include"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/headers: Remove duplicate #include line
    sched/clock: Fix broken stable to unstable transfer
    sched/clock, x86/perf: Fix "perf test tsc"
    sched/clock: Fix clear_sched_clock_stable() preempt wobbly

    Linus Torvalds
     

27 Mar, 2017

1 commit

  • When it is determined that the clock is actually unstable, and
    we switch from stable to unstable, the __clear_sched_clock_stable()
    function is eventually called.

    In this function we set gtod_offset so the following holds true:

    sched_clock() + raw_offset == ktime_get_ns() + gtod_offset

    But instead of getting the latest timestamps, we use the last values
    from scd, so instead of sched_clock() we use scd->tick_raw, and
    instead of ktime_get_ns() we use scd->tick_gtod.

    However, later, when we use gtod_offset sched_clock_local() we do not
    add it to scd->tick_gtod to calculate the correct clock value when we
    determine the boundaries for min/max clocks.

    This can result in tick granularity sched_clock() values, so fix it.

    Signed-off-by: Pavel Tatashin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: hpa@zytor.com
    Fixes: 5680d8094ffa ("sched/clock: Provide better clock continuity")
    Link: http://lkml.kernel.org/r/1490214265-899964-2-git-send-email-pasha.tatashin@oracle.com
    Signed-off-by: Ingo Molnar

    Pavel Tatashin
     

23 Mar, 2017

2 commits

  • People reported that commit:

    5680d8094ffa ("sched/clock: Provide better clock continuity")

    broke "perf test tsc".

    That commit added another offset to the reported clock value; so
    take that into account when computing the provided offset values.

    Reported-by: Adrian Hunter
    Reported-by: Arnaldo Carvalho de Melo
    Tested-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 5680d8094ffa ("sched/clock: Provide better clock continuity")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Paul reported a problems with clear_sched_clock_stable(). Since we run
    all of __clear_sched_clock_stable() from workqueue context, there's a
    preempt problem.

    Solve it by only running the static_key_disable() from workqueue.

    Reported-by: Paul E. McKenney
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: fweisbec@gmail.com
    Link: http://lkml.kernel.org/r/20170313124621.GA3328@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

21 Mar, 2017

1 commit

  • sugov_start() only initializes struct sugov_cpu per-CPU structures
    for shared policies, but it should do that for single-CPU policies too.

    That in particular makes the IO-wait boost mechanism work in the
    cases when cpufreq policies correspond to individual CPUs.

    Fixes: 21ca6d2c52f8 (cpufreq: schedutil: Add iowait boosting)
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Viresh Kumar
    Cc: 4.9+ # 4.9+

    Rafael J. Wysocki
     

16 Mar, 2017

6 commits

  • I was testing Daniel's changes with his test case, and tweaked it a
    little. Instead of having the runtime equal to the deadline, I
    increased the deadline ten fold.

    Daniel's test case had:

    attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */
    attr.sched_deadline = 2 * 1000 * 1000; /* 2 ms */
    attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */

    To make it more interesting, I changed it to:

    attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */
    attr.sched_deadline = 20 * 1000 * 1000; /* 20 ms */
    attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */

    The results were rather surprising. The behavior that Daniel's patch
    was fixing came back. The task started using much more than .1% of the
    CPU. More like 20%.

    Looking into this I found that it was due to the dl_entity_overflow()
    constantly returning true. That's because it uses the relative period
    against relative runtime vs the absolute deadline against absolute
    runtime.

    runtime / (deadline - t) > dl_runtime / dl_period

    There's even a comment mentioning this, and saying that when relative
    deadline equals relative period, that the equation is the same as using
    deadline instead of period. That comment is backwards! What we really
    want is:

    runtime / (deadline - t) > dl_runtime / dl_deadline

    We care about if the runtime can make its deadline, not its period. And
    then we can say "when the deadline equals the period, the equation is
    the same as using dl_period instead of dl_deadline".

    After correcting this, now when the task gets enqueued, it can throttle
    correctly, and Daniel's fix to the throttling of sleeping deadline
    tasks works even when the runtime and deadline are not the same.

    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Daniel Bristot de Oliveira
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Luca Abeni
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Romulo Silva de Oliveira
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tommaso Cucinotta
    Link: http://lkml.kernel.org/r/02135a27f1ae3fe5fd032568a5a2f370e190e8d7.1488392936.git.bristot@redhat.com
    Signed-off-by: Ingo Molnar

    Steven Rostedt (VMware)
     
  • During the activation, CBS checks if it can reuse the current task's
    runtime and period. If the deadline of the task is in the past, CBS
    cannot use the runtime, and so it replenishes the task. This rule
    works fine for implicit deadline tasks (deadline == period), and the
    CBS was designed for implicit deadline tasks. However, a task with
    constrained deadline (deadine < period) might be awakened after the
    deadline, but before the next period. In this case, replenishing the
    task would allow it to run for runtime / deadline. As in this case
    deadline < period, CBS enables a task to run for more than the
    runtime / period. In a very loaded system, this can cause a domino
    effect, making other tasks miss their deadlines.

    To avoid this problem, in the activation of a constrained deadline
    task after the deadline but before the next period, throttle the
    task and set the replenishing timer to the begin of the next period,
    unless it is boosted.

    Reproducer:

    --------------- %< ---------------
    int main (int argc, char **argv)
    {
    int ret;
    int flags = 0;
    unsigned long l = 0;
    struct timespec ts;
    struct sched_attr attr;

    memset(&attr, 0, sizeof(attr));
    attr.size = sizeof(attr);

    attr.sched_policy = SCHED_DEADLINE;
    attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */
    attr.sched_deadline = 2 * 1000 * 1000; /* 2 ms */
    attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */

    ts.tv_sec = 0;
    ts.tv_nsec = 2000 * 1000; /* 2 ms */

    ret = sched_setattr(0, &attr, flags);

    if (ret < 0) {
    perror("sched_setattr");
    exit(-1);
    }

    for(;;) {
    /* XXX: you may need to adjust the loop */
    for (l = 0; l < 150000; l++);
    /*
    * The ideia is to go to sleep right before the deadline
    * and then wake up before the next period to receive
    * a new replenishment.
    */
    nanosleep(&ts, NULL);
    }

    exit(0);
    }
    --------------- >% ---------------

    On my box, this reproducer uses almost 50% of the CPU time, which is
    obviously wrong for a task with 2/2000 reservation.

    Signed-off-by: Daniel Bristot de Oliveira
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Luca Abeni
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Romulo Silva de Oliveira
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tommaso Cucinotta
    Link: http://lkml.kernel.org/r/edf58354e01db46bf42df8d2dd32418833f68c89.1488392936.git.bristot@redhat.com
    Signed-off-by: Ingo Molnar

    Daniel Bristot de Oliveira
     
  • Currently, the replenishment timer is set to fire at the deadline
    of a task. Although that works for implicit deadline tasks because the
    deadline is equals to the begin of the next period, that is not correct
    for constrained deadline tasks (deadline < period).

    For instance:

    f.c:
    --------------- %< ---------------
    int main (void)
    {
    for(;;);
    }
    --------------- >% ---------------

    # gcc -o f f.c

    # trace-cmd record -e sched:sched_switch \
    -e syscalls:sys_exit_sched_setattr \
    chrt -d --sched-runtime 490000000 \
    --sched-deadline 500000000 \
    --sched-period 1000000000 0 ./f

    # trace-cmd report | grep "{pid of ./f}"

    After setting parameters, the task is replenished and continue running
    until being throttled:

    f-11295 [003] 13322.113776: sys_exit_sched_setattr: 0x0

    The task is throttled after running 492318 ms, as expected:

    f-11295 [003] 13322.606094: sched_switch: f:11295 [-1] R ==> watchdog/3:32 [0]

    But then, the task is replenished 500719 ms after the first
    replenishment:

    -0 [003] 13322.614495: sched_switch: swapper/3:0 [120] R ==> f:11295 [-1]

    Running for 490277 ms:

    f-11295 [003] 13323.104772: sched_switch: f:11295 [-1] R ==> swapper/3:0 [120]

    Hence, in the first period, the task runs 2 * runtime, and that is a bug.

    During the first replenishment, the next deadline is set one period away.
    So the runtime / period starts to be respected. However, as the second
    replenishment took place in the wrong instant, the next replenishment
    will also be held in a wrong instant of time. Rather than occurring in
    the nth period away from the first activation, it is taking place
    in the (nth period - relative deadline).

    Signed-off-by: Daniel Bristot de Oliveira
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Luca Abeni
    Reviewed-by: Steven Rostedt (VMware)
    Reviewed-by: Juri Lelli
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Romulo Silva de Oliveira
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tommaso Cucinotta
    Link: http://lkml.kernel.org/r/ac50d89887c25285b47465638354b63362f8adff.1488392936.git.bristot@redhat.com
    Signed-off-by: Ingo Molnar

    Daniel Bristot de Oliveira
     
  • 'calc_load_update' is accessed without any kind of locking and there's
    a clear assumption in the code that only a single value is read or
    written.

    Make this explicit by using READ_ONCE() and WRITE_ONCE(), and avoid
    unintentionally seeing multiple values, or having the load/stores
    split.

    Technically the loads in calc_global_*() don't require this since
    those are the only functions that update 'calc_load_update', but I've
    added the READ_ONCE() for consistency.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Matt Fleming
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Mike Galbraith
    Cc: Morten Rasmussen
    Cc: Thomas Gleixner
    Cc: Vincent Guittot
    Link: http://lkml.kernel.org/r/20170217120731.11868-3-matt@codeblueprint.co.uk
    Signed-off-by: Ingo Molnar

    Matt Fleming
     
  • If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
    the pending sample window time on exit, setting the next update not
    one window into the future, but two.

    This situation on exiting NO_HZ is described by:

    this_rq->calc_load_update < jiffies < calc_load_update

    In this scenario, what we should be doing is:

    this_rq->calc_load_update = calc_load_update [ next window ]

    But what we actually do is:

    this_rq->calc_load_update = calc_load_update + LOAD_FREQ [ next+1 window ]

    This has the effect of delaying load average updates for potentially
    up to ~9seconds.

    This can result in huge spikes in the load average values due to
    per-cpu uninterruptible task counts being out of sync when accumulated
    across all CPUs.

    It's safe to update the per-cpu active count if we wake between sample
    windows because any load that we left in 'calc_load_idle' will have
    been zero'd when the idle load was folded in calc_global_load().

    This issue is easy to reproduce before,

    commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")

    just by forking short-lived process pipelines built from ps(1) and
    grep(1) in a loop. I'm unable to reproduce the spikes after that
    commit, but the bug still seems to be present from code review.

    Signed-off-by: Matt Fleming
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Mike Galbraith
    Cc: Morten Rasmussen
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Vincent Guittot
    Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again")
    Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.uk
    Signed-off-by: Ingo Molnar

    Matt Fleming
     
  • The following warning can be triggered by hot-unplugging the CPU
    on which an active SCHED_DEADLINE task is running on:

    ------------[ cut here ]------------
    WARNING: CPU: 7 PID: 0 at kernel/sched/sched.h:833 replenish_dl_entity+0x71e/0xc40
    rq->clock_update_flags < RQCF_ACT_SKIP
    CPU: 7 PID: 0 Comm: swapper/7 Tainted: G B 4.11.0-rc1+ #24
    Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
    Call Trace:

    dump_stack+0x85/0xc4
    __warn+0x172/0x1b0
    warn_slowpath_fmt+0xb4/0xf0
    ? __warn+0x1b0/0x1b0
    ? debug_check_no_locks_freed+0x2c0/0x2c0
    ? cpudl_set+0x3d/0x2b0
    replenish_dl_entity+0x71e/0xc40
    enqueue_task_dl+0x2ea/0x12e0
    ? dl_task_timer+0x777/0x990
    ? __hrtimer_run_queues+0x270/0xa50
    dl_task_timer+0x316/0x990
    ? enqueue_task_dl+0x12e0/0x12e0
    ? enqueue_task_dl+0x12e0/0x12e0
    __hrtimer_run_queues+0x270/0xa50
    ? hrtimer_cancel+0x20/0x20
    ? hrtimer_interrupt+0x119/0x600
    hrtimer_interrupt+0x19c/0x600
    ? trace_hardirqs_off+0xd/0x10
    local_apic_timer_interrupt+0x74/0xe0
    smp_apic_timer_interrupt+0x76/0xa0
    apic_timer_interrupt+0x93/0xa0

    The DL task will be migrated to a suitable later deadline rq once the DL
    timer fires and currnet rq is offline. The rq clock of the new rq should
    be updated. This patch fixes it by updating the rq clock after holding
    the new rq's rq lock.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Matt Fleming
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1488865888-15894-1-git-send-email-wanpeng.li@hotmail.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     

10 Mar, 2017

1 commit

  • Pull power management fixes from Rafael Wysocki:
    "These fix several issues in the intel_pstate driver and one issue in
    the schedutil cpufreq governor, clean up that governor a bit and hook
    up existing code for disabling cpufreq to a new kernel command line
    option.

    Specifics:

    - Three fixes for intel_pstate problems related to the passive mode
    (in which it acts as a regular cpufreq scaling driver), two for the
    handling of global P-state limits and one for the handling of the
    cpu_frequency tracepoint in that mode (Rafael Wysocki).

    - Three fixes for the handling of P-state limits in intel_pstate in
    the active mode (Rafael Wysocki).

    - Introduction of a new cpufreq.off=1 kernel command line argument
    that will disable cpufreq entirely if passed to the kernel and is
    simply hooked up to the existing code used by Xen (Len Brown).

    - Fix for the schedutil cpufreq governor to prevent it from using
    stale raw frequency values in configurations with mutiple CPUs
    sharing one policy object and a cleanup for it reducing its
    overhead slightly (Viresh Kumar)"

    * tag 'pm-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy
    cpufreq: intel_pstate: Fix intel_pstate_verify_policy()
    cpufreq: intel_pstate: Fix global settings in active mode
    cpufreq: Add the "cpufreq.off=1" cmdline option
    cpufreq: schedutil: Pass sg_policy to get_next_freq()
    cpufreq: schedutil: move cached_raw_freq to struct sugov_policy
    cpufreq: intel_pstate: Avoid triggering cpu_frequency tracepoint unnecessarily
    cpufreq: intel_pstate: Fix intel_cpufreq_verify_policy()
    cpufreq: intel_pstate: Do not use performance_limits in passive mode

    Linus Torvalds
     

09 Mar, 2017

1 commit

  • The scheduler header file split and cleanups ended up exposing a few
    nasty header file dependencies, and in particular it showed how we in
    ended up depending on "signal_pending()", which now comes
    from .

    That's a very subtle and annoying dependency, which already caused a
    semantic merge conflict (see commit e58bc927835a "Pull overlayfs updates
    from Miklos Szeredi", which added that fixup in the merge commit).

    It turns out that we can avoid this dependency _and_ improve code
    generation by moving the guts of the fairly nasty helper #define
    __wait_event_interruptible_locked() to out-of-line code. The code that
    includes the signal_pending() check is all in the slow-path where we
    actually go to sleep waiting for the event anyway, so using a helper
    function is the right thing to do.

    Using a helper function is also what we already did for the non-locked
    versions, see the "__wait_event*()" macros and the "prepare_to_wait*()"
    set of helper functions.

    We might want to try to unify all these macro games, we have a _lot_ of
    subtly different wait-event loops. But this is the minimal patch to fix
    the annoying header dependency.

    Acked-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Mar, 2017

1 commit


06 Mar, 2017

2 commits

  • get_next_freq() uses sg_cpu only to get sg_policy, which the callers of
    get_next_freq() already have. Pass sg_policy instead of sg_cpu to
    get_next_freq(), to make it more efficient.

    Signed-off-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki

    Viresh Kumar
     
  • cached_raw_freq applies to the entire cpufreq policy and not individual
    CPUs. Apart from wasting per-cpu memory, it is actually wrong to keep it
    in struct sugov_cpu as we may end up comparing next_freq with a stale
    cached_raw_freq of a random CPU.

    Move cached_raw_freq to struct sugov_policy.

    Fixes: 5cbea46984d6 (cpufreq: schedutil: map raw required frequency to driver frequency)
    Signed-off-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki

    Viresh Kumar
     

03 Mar, 2017

1 commit

  • …e.h> into <linux/sched/cputime.h>

    Move cputime related functionality out of <linux/sched.h>, as most code
    that includes <linux/sched.h> does not use that functionality.

    Move data types that are not included in task_struct directly to
    the signal definitions, into <linux/sched/signal.h>.

    Also merge the (small) existing <linux/cputime.h> header into <linux/sched/cputime.h>.

    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

02 Mar, 2017

23 commits