20 Jan, 2017

1 commit

  • Mike reported that he could trigger the WARN_ON_ONCE() in
    set_sched_clock_stable() using hotplug.

    This exposed a fundamental problem with the interface, we should never
    mark the TSC stable if we ever find it to be unstable. Therefore
    set_sched_clock_stable() is a broken interface.

    The reason it existed is that not having it is a pain, it means all
    relevant architecture code needs to call clear_sched_clock_stable()
    where appropriate.

    Of the three architectures that select HAVE_UNSTABLE_SCHED_CLOCK ia64
    and parisc are trivial in that they never called
    set_sched_clock_stable(), so add an unconditional call to
    clear_sched_clock_stable() to them.

    For x86 the story is a lot more involved, and what this patch tries to
    do is ensure we preserve the status quo. So even is Cyrix or Transmeta
    have usable TSC they never called set_sched_clock_stable() so they now
    get an explicit mark unstable.

    Reported-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 9881b024b7d7 ("sched/clock: Delay switching sched_clock to stable")
    Link: http://lkml.kernel.org/r/20170119133633.GB6536@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

19 Jan, 2017

1 commit

  • Scheduling to the max performance core is enabled by
    default for Turbo Boost Maxt Technology 3.0 capable platforms.

    Remove the useless sysctl_sched_itmt_enabled check to
    update sched topology for adding the prioritized core scheduling
    flag.

    Signed-off-by: Tim Chen
    Cc: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Srinivas Pandruvada
    Cc: Thomas Gleixner
    Cc: bp@suse.de
    Cc: jolsa@redhat.com
    Cc: linux-acpi@vger.kernel.org
    Cc: linux-pm@vger.kernel.org
    Cc: rjw@rjwysocki.net
    Link: http://lkml.kernel.org/r/1484778629-4404-1-git-send-email-tim.c.chen@linux.intel.com
    Signed-off-by: Ingo Molnar

    Tim Chen
     

15 Jan, 2017

1 commit

  • Mike noticed this bogosity:

    > > +# define mutex_lock_nest_io(lock, nest_lock) mutex_io(lock)
    > ^^^^^^^^^^^^^^ typo

    This new locking API is not used yet, so this didn't trigger in testing.

    Fix it.

    Reported-by: Mike Galbraith
    Cc: Tejun Heo
    Cc: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Jens Axboe
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: adilger.kernel@dilger.ca
    Cc: jack@suse.com
    Cc: kernel-team@fb.com
    Cc: mingbo@fb.com
    Cc: tytso@mit.edu
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

14 Jan, 2017

37 commits

  • When an ext4 fs is bogged down by a lot of metadata IOs (in the
    reported case, it was deletion of millions of files, but any massive
    amount of journal writes would do), after the journal is filled up,
    tasks which try to access the filesystem and aren't currently
    performing the journal writes end up waiting in
    __jbd2_log_wait_for_space() for journal->j_checkpoint_mutex.

    Because those mutex sleeps aren't marked as iowait, this condition can
    lead to misleadingly low iowait and /proc/stat:procs_blocked. While
    iowait propagation is far from strict, this condition can be triggered
    fairly easily and annotating these sleeps correctly helps initial
    diagnosis quite a bit.

    Use the new mutex_lock_io() for journal->j_checkpoint_mutex so that
    these sleeps are properly marked as iowait.

    Reported-by: Mingbo Wan
    Signed-off-by: Tejun Heo
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andreas Dilger
    Cc: Andrew Morton
    Cc: Jan Kara
    Cc: Jens Axboe
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Theodore Ts'o
    Cc: Thomas Gleixner
    Cc: kernel-team@fb.com
    Link: http://lkml.kernel.org/r/1477673892-28940-5-git-send-email-tj@kernel.org
    Signed-off-by: Ingo Molnar

    Tejun Heo
     
  • We sometimes end up propagating IO blocking through mutexes; however,
    because there currently is no way of annotating mutex sleeps as
    iowait, there are cases where iowait and /proc/stat:procs_blocked
    report misleading numbers obscuring the actual state of the system.

    This patch adds mutex_lock_io() so that mutex sleeps can be marked as
    iowait in those cases.

    Signed-off-by: Tejun Heo
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Jens Axboe
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: adilger.kernel@dilger.ca
    Cc: jack@suse.com
    Cc: kernel-team@fb.com
    Cc: mingbo@fb.com
    Cc: tytso@mit.edu
    Link: http://lkml.kernel.org/r/1477673892-28940-4-git-send-email-tj@kernel.org
    Signed-off-by: Ingo Molnar

    Tejun Heo
     
  • Now that IO schedule accounting is done inside __schedule(),
    io_schedule() can be split into three steps - prep, schedule, and
    finish - where the schedule part doesn't need any special annotation.
    This allows marking a sleep as iowait by simply wrapping an existing
    blocking function with io_schedule_prepare() and io_schedule_finish().

    Because task_struct->in_iowait is single bit, the caller of
    io_schedule_prepare() needs to record and the pass its state to
    io_schedule_finish() to be safe regarding nesting. While this isn't
    the prettiest, these functions are mostly gonna be used by core
    functions and we don't want to use more space for ->in_iowait.

    While at it, as it's simple to do now, reimplement io_schedule()
    without unnecessarily going through io_schedule_timeout().

    Signed-off-by: Tejun Heo
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Jens Axboe
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: adilger.kernel@dilger.ca
    Cc: jack@suse.com
    Cc: kernel-team@fb.com
    Cc: mingbo@fb.com
    Cc: tytso@mit.edu
    Link: http://lkml.kernel.org/r/1477673892-28940-3-git-send-email-tj@kernel.org
    Signed-off-by: Ingo Molnar

    Tejun Heo
     
  • For an interface to support blocking for IOs, it must call
    io_schedule() instead of schedule(). This makes it tedious to add IO
    blocking to existing interfaces as the switching between schedule()
    and io_schedule() is often buried deep.

    As we already have a way to mark the task as IO scheduling, this can
    be made easier by separating out io_schedule() into multiple steps so
    that IO schedule preparation can be performed before invoking a
    blocking interface and the actual accounting happens inside the
    scheduler.

    io_schedule_timeout() does the following three things prior to calling
    schedule_timeout().

    1. Mark the task as scheduling for IO.
    2. Flush out plugged IOs.
    3. Account the IO scheduling.

    done close to the actual scheduling. This patch moves #3 into the
    scheduler so that later patches can separate out preparation and
    finish steps from io_schedule().

    Patch-originally-by: Peter Zijlstra
    Signed-off-by: Tejun Heo
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: adilger.kernel@dilger.ca
    Cc: akpm@linux-foundation.org
    Cc: axboe@kernel.dk
    Cc: jack@suse.com
    Cc: kernel-team@fb.com
    Cc: mingbo@fb.com
    Cc: tytso@mit.edu
    Link: http://lkml.kernel.org/r/20161207204841.GA22296@htj.duckdns.org
    Signed-off-by: Ingo Molnar

    Tejun Heo
     
  • Signed-off-by: Dietmar Eggemann
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Samuel Thibault
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/e9a4d858-bcf3-36b9-e3a9-449953e34569@arm.com
    Signed-off-by: Ingo Molnar

    Dietmar Eggemann
     
  • The update of the share of a cfs_rq is done when its load_avg is updated
    but before the group_entity's load_avg has been updated for the past time
    slot. This generates wrong load_avg accounting which can be significant
    when small tasks are involved in the scheduling.

    Let take the example of a task a that is dequeued of its task group A:
    root
    (cfs_rq)
    \
    (se)
    A
    (cfs_rq)
    \
    (se)
    a

    Task "a" was the only task in task group A which becomes idle when a is
    dequeued.

    We have the sequence:

    - dequeue_entity a->se
    - update_load_avg(a->se)
    - dequeue_entity_load_avg(A->cfs_rq, a->se)
    - update_cfs_shares(A->cfs_rq)
    A->cfs_rq->load.weight == 0
    A->se->load.weight is updated with the new share (0 in this case)
    - dequeue_entity A->se
    - update_load_avg(A->se) but its weight is now null so the last time
    slot (up to a tick) will be accounted with a weight of 0 instead of
    its real weight during the time slot. The last time slot will be
    accounted as an idle one whereas it was a running one.

    If the running time of task a is short enough that no tick happens when it
    runs, all running time of group entity A->se will be accounted as idle
    time.

    Instead, we should update the share of a cfs_rq (in fact the weight of its
    group entity) only after having updated the load_avg of the group_entity.

    update_cfs_shares() now takes the sched_entity as a parameter instead of the
    cfs_rq, and the weight of the group_entity is updated only once its load_avg
    has been synced with current time.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: pjt@google.com
    Link: http://lkml.kernel.org/r/1482335426-7664-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     
  • Documentation/scheduler/completion.txt says this about complete_all():

    "calls complete_all() to signal all current and future waiters."

    Which doesn't strictly match the current semantics. Currently
    complete_all() is equivalent to UINT_MAX/2 complete() invocations,
    which is distinctly less than 'all current and future waiters'
    (enumerable vs innumerable), although it has worked in practice.

    However, Dmitry had a weird case where it might matter, so change
    completions to use saturation semantics for complete()/complete_all().
    Once done hits UINT_MAX (and complete_all() sets it there) it will
    never again be decremented.

    Requested-by: Dmitry Torokhov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: der.herr@hofr.at
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • This patch allows for reading the current (leftover) runtime and
    absolute deadline of a SCHED_DEADLINE task through /proc/*/sched
    (entries dl.runtime and dl.deadline), while debugging/testing.

    Signed-off-by: Tommaso Cucinotta
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Juri Lelli
    Reviewed-by: Luca Abeni
    Acked-by: Daniel Bistrot de Oliveira
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1477473437-10346-2-git-send-email-tommaso.cucinotta@sssup.it
    Signed-off-by: Ingo Molnar

    Tommaso Cucinotta
     
  • When switching between the unstable and stable variants it is
    currently possible that clock discontinuities occur.

    And while these will mostly be 'small', attempt to do better.

    As observed on my IVB-EP, the sched_clock() is ~1.5s ahead of the
    ktime_get_ns() based timeline at the point of switchover
    (sched_clock_init_late()) after SMP bringup.

    Equally, when the TSC is later found to be unstable -- typically
    because SMM tries to hide its SMI latencies by mucking with the TSC --
    we want to avoid large jumps.

    Since the clocksource watchdog reports the issue after the fact we
    cannot exactly fix up time, but since SMI latencies are typically
    small (~10ns range), the discontinuity is mainly due to drift between
    sched_clock() and ktime_get_ns() (which on my desktop is ~79s over
    24days).

    I dislike this patch because it adds overhead to the good case in
    favour of dealing with badness. But given the widespread failure of
    TSC stability this is worth it.

    Note that in case the TSC makes drastic jumps after SMP bringup we're
    still hosed. There's just not much we can do in that case without
    stupid overhead.

    If we were to somehow expose tsc_clocksource_reliable (which is hard
    because this code is also used on ia64 and parisc) we could avoid some
    of the newly introduced overhead.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Currently we switch to the stable sched_clock if we guess the TSC is
    usable, and then switch back to the unstable path if it turns out TSC
    isn't stable during SMP bringup after all.

    Delay switching to the stable path until after SMP bringup is
    complete. This way we'll avoid switching during the time we detect the
    worst of the TSC offences.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • sched_clock was still using the deprecated static_key interface.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • PeterZ reported that we'd fail to mark the TSC unstable when the
    clocksource watchdog finds it unsuitable.

    Allow a clocksource to run a custom action when its being marked
    unstable and hook up the TSC unstable code.

    Reported-by: Peter Zijlstra (Intel)
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • There's no diagnostic checks for figuring out when we've accidentally
    missed update_rq_clock() calls. Let's add some by piggybacking on the
    rq_*pin_lock() wrappers.

    The idea behind the diagnostic checks is that upon pining rq lock the
    rq clock should be updated, via update_rq_clock(), before anybody
    reads the clock with rq_clock() or rq_clock_task().

    The exception to this rule is when updates have explicitly been
    disabled with the rq_clock_skip_update() optimisation.

    There are some functions that only unpin the rq lock in order to grab
    some other lock and avoid deadlock. In that case we don't need to
    update the clock again and the previous diagnostic state can be
    carried over in rq_repin_lock() by saving the state in the rq_flags
    context.

    Since this patch adds a new clock update flag and some already exist
    in rq::clock_skip_update, that field has now been renamed. An attempt
    has been made to keep the flag manipulation code small and fast since
    it's used in the heart of the __schedule() fast path.

    For the !CONFIG_SCHED_DEBUG case the only object code change (other
    than addresses) is the following change to reset RQCF_ACT_SKIP inside
    of __schedule(),

    - c7 83 38 09 00 00 00 movl $0x0,0x938(%rbx)
    - 00 00 00
    + 83 a3 38 09 00 00 fc andl $0xfffffffc,0x938(%rbx)

    Suggested-by: Peter Zijlstra
    Signed-off-by: Matt Fleming
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Byungchul Park
    Cc: Frederic Weisbecker
    Cc: Jan Kara
    Cc: Linus Torvalds
    Cc: Luca Abeni
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Mike Galbraith
    Cc: Petr Mladek
    Cc: Rik van Riel
    Cc: Sergey Senozhatsky
    Cc: Thomas Gleixner
    Cc: Wanpeng Li
    Cc: Yuyang Du
    Link: http://lkml.kernel.org/r/20160921133813.31976-8-matt@codeblueprint.co.uk
    Signed-off-by: Ingo Molnar

    Matt Fleming
     
  • Address this rq-clock update bug:

    WARNING: CPU: 30 PID: 195 at ../kernel/sched/sched.h:797 set_next_entity()
    rq->clock_update_flags < RQCF_ACT_SKIP

    Call Trace:
    dump_stack()
    __warn()
    warn_slowpath_fmt()
    set_next_entity()
    ? _raw_spin_lock()
    set_curr_task_fair()
    set_user_nice.part.85()
    set_user_nice()
    create_worker()
    worker_thread()
    kthread()
    ret_from_fork()

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Add the update_rq_clock() call at the top of the callstack instead of
    at the bottom where we find it missing, this to aid later effort to
    minimize the number of update_rq_lock() calls.

    WARNING: CPU: 30 PID: 194 at ../kernel/sched/sched.h:797 assert_clock_updated()
    rq->clock_update_flags < RQCF_ACT_SKIP

    Call Trace:
    dump_stack()
    __warn()
    warn_slowpath_fmt()
    assert_clock_updated.isra.63.part.64()
    can_migrate_task()
    load_balance()
    pick_next_task_fair()
    __schedule()
    schedule()
    worker_thread()
    kthread()

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Instead of adding the update_rq_clock() all the way at the bottom of
    the callstack, add one at the top, this to aid later effort to
    minimize update_rq_lock() calls.

    WARNING: CPU: 0 PID: 1 at ../kernel/sched/sched.h:797 detach_task_cfs_rq()
    rq->clock_update_flags < RQCF_ACT_SKIP

    Call Trace:
    dump_stack()
    __warn()
    warn_slowpath_fmt()
    detach_task_cfs_rq()
    switched_from_fair()
    __sched_setscheduler()
    _sched_setscheduler()
    sched_set_stop_task()
    cpu_stop_create()
    __smpboot_create_thread.part.2()
    smpboot_register_percpu_thread_cpumask()
    cpu_stop_init()
    do_one_initcall()
    ? print_cpu_info()
    kernel_init_freeable()
    ? rest_init()
    kernel_init()
    ret_from_fork()

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Address this rq-clock update bug:

    WARNING: CPU: 0 PID: 0 at ../kernel/sched/sched.h:797 post_init_entity_util_avg()
    rq->clock_update_flags < RQCF_ACT_SKIP

    Call Trace:
    __warn()
    post_init_entity_util_avg()
    wake_up_new_task()
    _do_fork()
    kernel_thread()
    rest_init()
    start_kernel()

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Future patches will emit warnings if rq_clock() is called before
    update_rq_clock() inside a rq_pin_lock()/rq_unpin_lock() pair.

    Since there is only one caller of idle_balance() we can push the
    unpin/repin there.

    Signed-off-by: Matt Fleming
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Byungchul Park
    Cc: Frederic Weisbecker
    Cc: Jan Kara
    Cc: Linus Torvalds
    Cc: Luca Abeni
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Petr Mladek
    Cc: Rik van Riel
    Cc: Sergey Senozhatsky
    Cc: Thomas Gleixner
    Cc: Wanpeng Li
    Cc: Yuyang Du
    Link: http://lkml.kernel.org/r/20160921133813.31976-7-matt@codeblueprint.co.uk
    Signed-off-by: Ingo Molnar

    Matt Fleming
     
  • rq_clock() is called from sched_info_{depart,arrive}() after resetting
    RQCF_ACT_SKIP but prior to a call to update_rq_clock().

    In preparation for pending patches that check whether the rq clock has
    been updated inside of a pin context before rq_clock() is called, move
    the reset of rq->clock_skip_update immediately before unpinning the rq
    lock.

    This will avoid the new warnings which check if update_rq_clock() is
    being actively skipped.

    Signed-off-by: Matt Fleming
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Byungchul Park
    Cc: Frederic Weisbecker
    Cc: Jan Kara
    Cc: Linus Torvalds
    Cc: Luca Abeni
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Petr Mladek
    Cc: Rik van Riel
    Cc: Sergey Senozhatsky
    Cc: Thomas Gleixner
    Cc: Wanpeng Li
    Cc: Yuyang Du
    Link: http://lkml.kernel.org/r/20160921133813.31976-6-matt@codeblueprint.co.uk
    Signed-off-by: Ingo Molnar

    Matt Fleming
     
  • In preparation for adding diagnostic checks to catch missing calls to
    update_rq_clock(), provide wrappers for (re)pinning and unpinning
    rq->lock.

    Because the pending diagnostic checks allow state to be maintained in
    rq_flags across pin contexts, swap the 'struct pin_cookie' arguments
    for 'struct rq_flags *'.

    Signed-off-by: Matt Fleming
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Byungchul Park
    Cc: Frederic Weisbecker
    Cc: Jan Kara
    Cc: Linus Torvalds
    Cc: Luca Abeni
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Petr Mladek
    Cc: Rik van Riel
    Cc: Sergey Senozhatsky
    Cc: Thomas Gleixner
    Cc: Wanpeng Li
    Cc: Yuyang Du
    Link: http://lkml.kernel.org/r/20160921133813.31976-5-matt@codeblueprint.co.uk
    Signed-off-by: Ingo Molnar

    Matt Fleming
     
  • CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y used to accumulate user time and
    account it on ticks and context switches only through the
    vtime_account_user() function.

    Now this model has been generalized on the 3 archs for all kind of
    cputime (system, irq, ...) and all the cputime flushing happens under
    vtime_account_user().

    So let's rename this function to better reflect its new role.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Acked-by: Martin Schwidefsky
    Cc: Benjamin Herrenschmidt
    Cc: Christian Borntraeger
    Cc: Fenghua Yu
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Stanislaw Gruszka
    Cc: Tony Luck
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1483636310-6557-11-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • The account_system_time() function is called with a cputime that
    occurred while running in the kernel. The function detects which
    context the CPU is currently running in and accounts the time to
    the correct bucket. This forces the arch code to account the
    cputime for hardirq and softirq immediately.

    Such accounting function can be costly and perform unwelcome divisions
    and multiplications, among others.

    The arch code can delay the accounting for system time. For s390
    the accounting is done once per timer tick and for each task switch.

    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Frederic Weisbecker
    [ Rebase against latest linus tree and move account_system_index_scaled(). ]
    Acked-by: Thomas Gleixner
    Cc: Benjamin Herrenschmidt
    Cc: Christian Borntraeger
    Cc: Fenghua Yu
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Stanislaw Gruszka
    Cc: Tony Luck
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1483636310-6557-10-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Martin Schwidefsky
     
  • Currently CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y accounts the cputime on
    any context boundary: irq entry/exit, guest entry/exit, context switch,
    etc...

    Calling functions such as account_system_time(), account_user_time()
    and such can be costly, especially if they are called on many fastpath
    such as twice per IRQ. Those functions do more than just accounting to
    kcpustat and task cputime. Depending on the config, some subsystems can
    perform unpleasant multiplications and divisions, among other things.

    So lets accumulate the cputime instead and delay the accounting on ticks
    and context switches only.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Cc: Benjamin Herrenschmidt
    Cc: Christian Borntraeger
    Cc: Fenghua Yu
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Stanislaw Gruszka
    Cc: Tony Luck
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1483636310-6557-9-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • Currently CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y accounts the cputime on
    any context boundary: irq entry/exit, guest entry/exit, context switch,
    etc...

    Calling functions such as account_system_time(), account_user_time()
    and such can be costly, especially if they are called on many fastpath
    such as twice per IRQ. Those functions do more than just accounting to
    kcpustat and task cputime. Depending on the config, some subsystems can
    perform unpleasant multiplications and divisions, among other things.

    So lets accumulate the cputime instead and delay the accounting on ticks
    and context switches only.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Cc: Benjamin Herrenschmidt
    Cc: Christian Borntraeger
    Cc: Fenghua Yu
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Stanislaw Gruszka
    Cc: Tony Luck
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1483636310-6557-8-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • That in order to gather all cputime accumulation to the same place.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Cc: Benjamin Herrenschmidt
    Cc: Christian Borntraeger
    Cc: Fenghua Yu
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Stanislaw Gruszka
    Cc: Tony Luck
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1483636310-6557-7-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • In order to prepare for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y to delay
    cputime accounting to the tick, provide finegrained accumulators to
    powerpc in order to store the cputime until flushing.

    While at it, normalize the name of several fields according to common
    cputime naming.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Cc: Benjamin Herrenschmidt
    Cc: Christian Borntraeger
    Cc: Fenghua Yu
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Stanislaw Gruszka
    Cc: Tony Luck
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1483636310-6557-6-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • In order to prepare for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y to delay
    cputime accounting to the tick, let's allow archs to account cputime
    directly to gtime.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Cc: Benjamin Herrenschmidt
    Cc: Christian Borntraeger
    Cc: Fenghua Yu
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Stanislaw Gruszka
    Cc: Tony Luck
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1483636310-6557-5-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • In order to prepare for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y to delay
    cputime accounting to the tick, let's provide APIs to account system
    time to precise contexts: hardirq, softirq, pure system, ...

    Inspired-by: Martin Schwidefsky
    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Cc: Benjamin Herrenschmidt
    Cc: Christian Borntraeger
    Cc: Fenghua Yu
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Stanislaw Gruszka
    Cc: Tony Luck
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1483636310-6557-4-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • On task switch we must initialize the current cputime of the next task
    using the value of the previous task which got freshly updated.

    But we are confusing that with doing the opposite, which should result
    in incorrect cputime accounting.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Cc: Benjamin Herrenschmidt
    Cc: Christian Borntraeger
    Cc: Fenghua Yu
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Stanislaw Gruszka
    Cc: Tony Luck
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1483636310-6557-3-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • On context switch with powerpc32, the cputime is accumulated in the
    thread_info struct. So the switching-in task must move forward its
    start time snapshot to the current time in order to later compute the
    delta spent in system mode.

    This is what we do for the normal cputime by initializing the starttime
    field to the value of the previous task's starttime which got freshly
    updated.

    But we are missing the update of the scaled cputime start time. As a
    result we may be accounting too much scaled cputime later.

    Fix this by initializing the scaled cputime the same way we do for
    normal cputime.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Cc: Benjamin Herrenschmidt
    Cc: Christian Borntraeger
    Cc: Fenghua Yu
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Stanislaw Gruszka
    Cc: Tony Luck
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1483636310-6557-2-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • Pull btrfs fixes from Chris Mason:
    "These are all over the place.

    The tracepoint part of the pull fixes a crash and adds a little more
    information to two tracepoints, while the rest are good old fashioned
    fixes"

    * 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    btrfs: make tracepoint format strings more compact
    Btrfs: add truncated_len for ordered extent tracepoints
    Btrfs: add 'inode' for extent map tracepoint
    btrfs: fix crash when tracepoint arguments are freed by wq callbacks
    Btrfs: adjust outstanding_extents counter properly when dio write is split
    Btrfs: fix lockdep warning about log_mutex
    Btrfs: use down_read_nested to make lockdep silent
    btrfs: fix locking when we put back a delayed ref that's too new
    btrfs: fix error handling when run_delayed_extent_op fails
    btrfs: return the actual error value from from btrfs_uuid_tree_iterate

    Linus Torvalds
     
  • Pull ceph fixes from Ilya Dryomov:
    "Two small fixups for the filesystem changes that went into this merge
    window"

    * tag 'ceph-for-4.10-rc4' of git://github.com/ceph/ceph-client:
    ceph: fix get_oldest_context()
    ceph: fix mds cluster availability check

    Linus Torvalds
     
  • Pull VFIO fixes from Alex Williamson:

    - Cleanups and bug fixes for the mtty sample driver (Dan Carpenter)

    - Export and make use of has_capability() to fix incorrect use of
    ns_capable() for testing task capabilities (Jike Song)

    * tag 'vfio-v4.10-rc4' of git://github.com/awilliam/linux-vfio:
    vfio/type1: Remove pid_namespace.h include
    vfio iommu type1: fix the testing of capability for remote task
    capability: export has_capability
    vfio-mdev: remove some dead code
    vfio-mdev: buffer overflow in ioctl()
    vfio-mdev: return -EFAULT if copy_to_user() fails

    Linus Torvalds
     
  • Pull KVM fixes from Paolo Bonzini:

    - fix for module unload vs deferred jump labels (note: there might be
    other buggy modules!)

    - two NULL pointer dereferences from syzkaller

    - also syzkaller: fix emulation of fxsave/fxrstor/sgdt/sidt, problem
    made worse during this merge window, "just" kernel memory leak on
    releases

    - fix emulation of "mov ss" - somewhat serious on AMD, less so on Intel

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    KVM: x86: fix emulation of "MOV SS, null selector"
    KVM: x86: fix NULL deref in vcpu_scan_ioapic
    KVM: eventfd: fix NULL deref irqbypass consumer
    KVM: x86: Introduce segmented_write_std
    KVM: x86: flush pending lapic jump label updates on module unload
    jump_labels: API for flushing deferred jump label updates

    Linus Torvalds
     
  • Pull arm64 fixes from Catalin Marinas:

    - Fix huge_ptep_set_access_flags() to return "changed" when any of the
    ptes in the contiguous range is changed, not just the last one

    - Fix the adr_l assembly macro to work in modules under KASLR

    * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
    arm64: assembler: make adr_l work in modules under KASLR
    arm64: hugetlb: fix the wrong return value for huge_ptep_set_access_flags

    Linus Torvalds
     
  • Pull SCSI fixes from James Bottomley:
    "The major fix is the bfa firmware, since the latest 10Gb cards fail
    probing with the current firmware.

    The rest is a set of minor fixes: one missed Kconfig dependency
    causing randconfig failures, a missed error return on an error leg, a
    change for how multiqueue waits on a blocked device and a don't reset
    while in reset fix"

    * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
    scsi: bfa: Increase requested firmware version to 3.2.5.1
    scsi: snic: Return error code on memory allocation failure
    scsi: fnic: Avoid sending reset to firmware when another reset is in progress
    scsi: qedi: fix build, depends on UIO
    scsi: scsi-mq: Wait for .queue_rq() if necessary

    Linus Torvalds
     
  • Pull input updates from Dmitry Torokhov:
    "Small driver fixups"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
    Input: elants_i2c - avoid divide by 0 errors on bad touchscreen data
    Input: adxl34x - make it enumerable in ACPI environment
    Input: ALPS - fix TrackStick Y axis handling for SS5 hardware
    Input: synaptics-rmi4 - fix F03 build error when serio is module
    Input: xpad - use correct product id for x360w controllers
    Input: synaptics_i2c - change msleep to usleep_range for small msecs
    Input: i8042 - add Pegatron touchpad to noloop table
    Input: joydev - remove unused linux/miscdevice.h include

    Linus Torvalds