08 Mar, 2017

1 commit


02 Mar, 2017

10 commits

  • Pavan noticed that the following commit:

    49ee576809d8 ("sched/core: Optimize pick_next_task() for idle_sched_class")

    ... broke RT,DL balancing by robbing them of the opportinty to do new-'idle'
    balancing when their last runnable task (on that runqueue) goes away.

    Reported-by: Pavan Kondeti
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Fixes: 49ee576809d8 ("sched/core: Optimize pick_next_task() for idle_sched_class")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to move scheduler ABI details to ,
    which will be used from a number of .c files.

    Create empty placeholder header that maps to .

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • So rcupdate.h is a pretty complex header, in particular it includes
    which includes - creating a
    dependency that includes in ,
    which prevents the isolation of from the derived
    header.

    Solve part of the problem by decoupling rcupdate.h from completions:
    this can be done by separating out the rcu_synchronize types and APIs,
    and updating their usage sites.

    Since this is a mostly RCU-internal types this will not just simplify
    's dependencies, but will make all the hundreds of
    .c files that include rcupdate.h but not completions or wait.h build
    faster.

    ( For rcutiny this means that two dependent APIs have to be uninlined,
    but that shouldn't be much of a problem as they are rare variants. )

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • tsk_nr_cpus_allowed() too is a pretty pointless wrapper that
    is not used consistently and which makes the code both harder
    to read and longer as well.

    So remove it - this also shrinks a bit.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • So the original intention of tsk_cpus_allowed() was to 'future-proof'
    the field - but it's pretty ineffectual at that, because half of
    the code uses ->cpus_allowed directly ...

    Also, the wrapper makes the code longer than the original expression!

    So just get rid of it. This also shrinks a bit.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • It's defined in , but nothing outside the scheduler
    uses it - so move it to the sched/core.c usage site.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • The length of TASK_STATE_TO_CHAR_STR was still checked using the old
    link-time manual error method - convert it to BUILD_BUG_ON(). This
    has a couple of advantages:

    - it's more obvious what's going on

    - it reduces the size and complexity of

    - BUILD_BUG_ON() will fail during compilation, with a clearer
    error message than the link time assert.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

01 Mar, 2017

1 commit


28 Feb, 2017

1 commit

  • Apart from adding the helper function itself, the rest of the kernel is
    converted mechanically using:

    git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
    git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'

    This is needed for a later patch that hooks into the helper, but might
    be a worthwhile cleanup on its own.

    (Michal Hocko provided most of the kerneldoc comment.)

    Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com
    Signed-off-by: Vegard Nossum
    Acked-by: Michal Hocko
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     

24 Feb, 2017

3 commits

  • Commit:

    2f5177f0fd7e ("sched/cgroup: Fix/cleanup cgroup teardown/init")

    .. moved sched_online_group() from css_online() to css_alloc().
    It exposes half-baked task group into global lists before initializing
    generic cgroup stuff.

    LTP testcase (third in cgroup_regression_test) written for testing
    similar race in kernels 2.6.26-2.6.28 easily triggers this oops:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    IP: kernfs_path_from_node_locked+0x260/0x320
    CPU: 1 PID: 30346 Comm: cat Not tainted 4.10.0-rc5-test #4
    Call Trace:
    ? kernfs_path_from_node+0x4f/0x60
    kernfs_path_from_node+0x3e/0x60
    print_rt_rq+0x44/0x2b0
    print_rt_stats+0x7a/0xd0
    print_cpu+0x2fc/0xe80
    ? __might_sleep+0x4a/0x80
    sched_debug_show+0x17/0x30
    seq_read+0xf2/0x3b0
    proc_reg_read+0x42/0x70
    __vfs_read+0x28/0x130
    ? security_file_permission+0x9b/0xc0
    ? rw_verify_area+0x4e/0xb0
    vfs_read+0xa5/0x170
    SyS_read+0x46/0xa0
    entry_SYSCALL_64_fastpath+0x1e/0xad

    Here the task group is already linked into the global RCU-protected 'task_groups'
    list, but the css->cgroup pointer is still NULL.

    This patch reverts this chunk and moves online back to css_online().

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Fixes: 2f5177f0fd7e ("sched/cgroup: Fix/cleanup cgroup teardown/init")
    Link: http://lkml.kernel.org/r/148655324740.424917.5302984537258726349.stgit@buzz
    Signed-off-by: Ingo Molnar

    Konstantin Khlebnikov
     
  • This is triggered during boot when CONFIG_SCHED_DEBUG is enabled:

    ------------[ cut here ]------------
    WARNING: CPU: 6 PID: 81 at kernel/sched/sched.h:812 set_next_entity+0x11d/0x380
    rq->clock_update_flags < RQCF_ACT_SKIP
    CPU: 6 PID: 81 Comm: torture_shuffle Not tainted 4.10.0+ #1
    Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
    Call Trace:
    dump_stack+0x85/0xc2
    __warn+0xcb/0xf0
    warn_slowpath_fmt+0x5f/0x80
    set_next_entity+0x11d/0x380
    set_curr_task_fair+0x2b/0x60
    do_set_cpus_allowed+0x139/0x180
    __set_cpus_allowed_ptr+0x113/0x260
    set_cpus_allowed_ptr+0x10/0x20
    torture_shuffle+0xfd/0x180
    kthread+0x10f/0x150
    ? torture_shutdown_init+0x60/0x60
    ? kthread_create_on_node+0x60/0x60
    ret_from_fork+0x31/0x40
    ---[ end trace dd94d92344cea9c6 ]---

    The task is running && !queued, so there is no rq clock update before calling
    set_curr_task().

    This patch fixes it by updating rq clock after holding rq->lock/pi_lock
    just as what other dequeue + put_prev + enqueue + set_curr story does.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Matt Fleming
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1487749975-5994-1-git-send-email-wanpeng.li@hotmail.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • The hotplug code still triggers the warning about using a stale
    rq->clock value.

    Fix things up to actually run update_rq_clock() in a place where we
    record the 'UPDATED' flag, and then modify the annotation to retain
    this flag over the rq->lock fiddling that happens as a result of
    actually migrating all the tasks elsewhere.

    Reported-by: Linus Torvalds
    Tested-by: Mike Galbraith
    Tested-by: Sachin Sant
    Tested-by: Borislav Petkov
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Matt Fleming
    Cc: Michael Ellerman
    Cc: Peter Zijlstra
    Cc: Ross Zwisler
    Cc: Thomas Gleixner
    Fixes: 4d25b35ea372 ("sched/fair: Restore previous rq_flags when migrating tasks in hotplug")
    Link: http://lkml.kernel.org/r/20170202155506.GX6515@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

22 Feb, 2017

1 commit

  • Commit 004172bdad64 ("sched/core: Remove unnecessary #include headers")
    removed the inclusion of asm/paravirt.h which is used to get
    declarations of paravirt_steal_rq_enabled and paravirt_steal_clock.

    It is implicitly included on x86 but not on arm and arm64 breaking the
    build if paravirtualization is used. Since things from that header are
    used directly fix the build by putting the direct inclusion back.

    Signed-off-by: Mark Brown
    Signed-off-by: Linus Torvalds

    Mark Brown
     

10 Feb, 2017

1 commit

  • The check for 'running' in sched_move_task() has an unlikely() around it. That
    is, it is unlikely that the task being moved is running. That use to be
    true. But with a couple of recent updates, it is now likely that the task
    will be running.

    The first change came from ea86cb4b7621 ("sched/cgroup: Fix
    cpu_cgroup_fork() handling") that moved around the use case of
    sched_move_task() in do_fork() where the call is now done after the task is
    woken (hence it is running).

    The second change came from 8e5bfa8c1f84 ("sched/autogroup: Do not use
    autogroup->tg in zombie threads") where sched_move_task() is called by the
    exit path, by the task that is exiting. Hence it too is running.

    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Vincent Guittot
    Link: http://lkml.kernel.org/r/20170206110426.27ca6426@gandalf.local.home
    Signed-off-by: Ingo Molnar

    Steven Rostedt (VMware)
     

07 Feb, 2017

4 commits


01 Feb, 2017

1 commit

  • We added the 'sched_rr_timeslice_ms' SCHED_RR tuning knob in this commit:

    ce0dbbbb30ae ("sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice")

    ... which name suggests to users that it's in milliseconds, while in reality
    it's being set in milliseconds but the result is shown in jiffies.

    This is obviously confusing when HZ is not 1000, it makes it appear like the
    value set failed, such as HZ=100:

    root# echo 100 > /proc/sys/kernel/sched_rr_timeslice_ms
    root# cat /proc/sys/kernel/sched_rr_timeslice_ms
    10

    Fix this to be milliseconds all around.

    Signed-off-by: Shile Zhang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1485612049-20923-1-git-send-email-shile.zhang@nokia.com
    Signed-off-by: Ingo Molnar

    Shile Zhang
     

30 Jan, 2017

5 commits

  • While in the process of initialising a root domain, if function
    cpupri_init() fails the memory allocated in cpudl_init() is not
    reclaimed.

    Adding a new goto target to cleanup the previous initialistion of
    the root_domain's dl_bw structure reclaims said memory.

    Signed-off-by: Mathieu Poirier
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1485292295-21298-2-git-send-email-mathieu.poirier@linaro.org
    Signed-off-by: Ingo Molnar

    Mathieu Poirier
     
  • If function cpudl_init() fails the memory allocated for &rd->rto_mask
    needs to be freed, something this patch is addressing.

    Signed-off-by: Mathieu Poirier
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1485292295-21298-1-git-send-email-mathieu.poirier@linaro.org
    Signed-off-by: Ingo Molnar

    Mathieu Poirier
     
  • __migrate_task() can return with a different runqueue locked than the
    one we passed as an argument. So that we can repin the lock in
    migrate_tasks() (and keep the update_rq_clock() bit) we need to
    restore the old rq_flags before repinning.

    Note that it wouldn't be correct to change move_queued_task() to repin
    because of the change of runqueue and the fact that having an
    up-to-date clock on the initial rq doesn't mean the new rq has one
    too.

    Signed-off-by: Matt Fleming
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Matt Fleming
     
  • Bug was noticed via this warning:

    WARNING: CPU: 6 PID: 1 at kernel/sched/sched.h:804 detach_task_cfs_rq+0x8e8/0xb80
    rq->clock_update_flags < RQCF_ACT_SKIP
    Modules linked in:
    CPU: 6 PID: 1 Comm: systemd Not tainted 4.10.0-rc5-00140-g0874170baf55-dirty #1
    Hardware name: Supermicro SYS-4048B-TRFT/X10QBi, BIOS 1.0 04/11/2014
    Call Trace:
    dump_stack+0x4d/0x65
    __warn+0xcb/0xf0
    warn_slowpath_fmt+0x5f/0x80
    detach_task_cfs_rq+0x8e8/0xb80
    ? allocate_cgrp_cset_links+0x59/0x80
    task_change_group_fair+0x27/0x150
    sched_change_group+0x48/0xf0
    sched_move_task+0x53/0x150
    cpu_cgroup_attach+0x36/0x70
    cgroup_taskset_migrate+0x175/0x300
    cgroup_migrate+0xab/0xd0
    cgroup_attach_task+0xf0/0x190
    __cgroup_procs_write+0x1ed/0x2f0
    cgroup_procs_write+0x14/0x20
    cgroup_file_write+0x3f/0x100
    kernfs_fop_write+0x104/0x180
    __vfs_write+0x37/0x140
    vfs_write+0xb8/0x1b0
    SyS_write+0x55/0xc0
    do_syscall_64+0x61/0x170
    entry_SYSCALL64_slow_path+0x25/0x25

    Reported-by: Ingo Molnar
    Reported-by: Borislav Petkov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Steve noticed that when we switch from IDLE to SCHED_OTHER we fail to
    take the shortcut, even though all runnable tasks are of the fair
    class, because prev->sched_class != &fair_sched_class.

    Since I reworked the put_prev_task() stuff, we don't really care about
    prev->class here, so removing that condition will allow this case.

    This increases the likely case from 78% to 98% correct for Steve's
    workload.

    Reported-by: Steven Rostedt (VMware)
    Tested-by: Steven Rostedt (VMware)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20170119174408.GN6485@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

14 Jan, 2017

9 commits

  • Now that IO schedule accounting is done inside __schedule(),
    io_schedule() can be split into three steps - prep, schedule, and
    finish - where the schedule part doesn't need any special annotation.
    This allows marking a sleep as iowait by simply wrapping an existing
    blocking function with io_schedule_prepare() and io_schedule_finish().

    Because task_struct->in_iowait is single bit, the caller of
    io_schedule_prepare() needs to record and the pass its state to
    io_schedule_finish() to be safe regarding nesting. While this isn't
    the prettiest, these functions are mostly gonna be used by core
    functions and we don't want to use more space for ->in_iowait.

    While at it, as it's simple to do now, reimplement io_schedule()
    without unnecessarily going through io_schedule_timeout().

    Signed-off-by: Tejun Heo
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Jens Axboe
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: adilger.kernel@dilger.ca
    Cc: jack@suse.com
    Cc: kernel-team@fb.com
    Cc: mingbo@fb.com
    Cc: tytso@mit.edu
    Link: http://lkml.kernel.org/r/1477673892-28940-3-git-send-email-tj@kernel.org
    Signed-off-by: Ingo Molnar

    Tejun Heo
     
  • For an interface to support blocking for IOs, it must call
    io_schedule() instead of schedule(). This makes it tedious to add IO
    blocking to existing interfaces as the switching between schedule()
    and io_schedule() is often buried deep.

    As we already have a way to mark the task as IO scheduling, this can
    be made easier by separating out io_schedule() into multiple steps so
    that IO schedule preparation can be performed before invoking a
    blocking interface and the actual accounting happens inside the
    scheduler.

    io_schedule_timeout() does the following three things prior to calling
    schedule_timeout().

    1. Mark the task as scheduling for IO.
    2. Flush out plugged IOs.
    3. Account the IO scheduling.

    done close to the actual scheduling. This patch moves #3 into the
    scheduler so that later patches can separate out preparation and
    finish steps from io_schedule().

    Patch-originally-by: Peter Zijlstra
    Signed-off-by: Tejun Heo
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: adilger.kernel@dilger.ca
    Cc: akpm@linux-foundation.org
    Cc: axboe@kernel.dk
    Cc: jack@suse.com
    Cc: kernel-team@fb.com
    Cc: mingbo@fb.com
    Cc: tytso@mit.edu
    Link: http://lkml.kernel.org/r/20161207204841.GA22296@htj.duckdns.org
    Signed-off-by: Ingo Molnar

    Tejun Heo
     
  • Currently we switch to the stable sched_clock if we guess the TSC is
    usable, and then switch back to the unstable path if it turns out TSC
    isn't stable during SMP bringup after all.

    Delay switching to the stable path until after SMP bringup is
    complete. This way we'll avoid switching during the time we detect the
    worst of the TSC offences.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • There's no diagnostic checks for figuring out when we've accidentally
    missed update_rq_clock() calls. Let's add some by piggybacking on the
    rq_*pin_lock() wrappers.

    The idea behind the diagnostic checks is that upon pining rq lock the
    rq clock should be updated, via update_rq_clock(), before anybody
    reads the clock with rq_clock() or rq_clock_task().

    The exception to this rule is when updates have explicitly been
    disabled with the rq_clock_skip_update() optimisation.

    There are some functions that only unpin the rq lock in order to grab
    some other lock and avoid deadlock. In that case we don't need to
    update the clock again and the previous diagnostic state can be
    carried over in rq_repin_lock() by saving the state in the rq_flags
    context.

    Since this patch adds a new clock update flag and some already exist
    in rq::clock_skip_update, that field has now been renamed. An attempt
    has been made to keep the flag manipulation code small and fast since
    it's used in the heart of the __schedule() fast path.

    For the !CONFIG_SCHED_DEBUG case the only object code change (other
    than addresses) is the following change to reset RQCF_ACT_SKIP inside
    of __schedule(),

    - c7 83 38 09 00 00 00 movl $0x0,0x938(%rbx)
    - 00 00 00
    + 83 a3 38 09 00 00 fc andl $0xfffffffc,0x938(%rbx)

    Suggested-by: Peter Zijlstra
    Signed-off-by: Matt Fleming
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Byungchul Park
    Cc: Frederic Weisbecker
    Cc: Jan Kara
    Cc: Linus Torvalds
    Cc: Luca Abeni
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Mike Galbraith
    Cc: Petr Mladek
    Cc: Rik van Riel
    Cc: Sergey Senozhatsky
    Cc: Thomas Gleixner
    Cc: Wanpeng Li
    Cc: Yuyang Du
    Link: http://lkml.kernel.org/r/20160921133813.31976-8-matt@codeblueprint.co.uk
    Signed-off-by: Ingo Molnar

    Matt Fleming
     
  • Address this rq-clock update bug:

    WARNING: CPU: 30 PID: 195 at ../kernel/sched/sched.h:797 set_next_entity()
    rq->clock_update_flags < RQCF_ACT_SKIP

    Call Trace:
    dump_stack()
    __warn()
    warn_slowpath_fmt()
    set_next_entity()
    ? _raw_spin_lock()
    set_curr_task_fair()
    set_user_nice.part.85()
    set_user_nice()
    create_worker()
    worker_thread()
    kthread()
    ret_from_fork()

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Instead of adding the update_rq_clock() all the way at the bottom of
    the callstack, add one at the top, this to aid later effort to
    minimize update_rq_lock() calls.

    WARNING: CPU: 0 PID: 1 at ../kernel/sched/sched.h:797 detach_task_cfs_rq()
    rq->clock_update_flags < RQCF_ACT_SKIP

    Call Trace:
    dump_stack()
    __warn()
    warn_slowpath_fmt()
    detach_task_cfs_rq()
    switched_from_fair()
    __sched_setscheduler()
    _sched_setscheduler()
    sched_set_stop_task()
    cpu_stop_create()
    __smpboot_create_thread.part.2()
    smpboot_register_percpu_thread_cpumask()
    cpu_stop_init()
    do_one_initcall()
    ? print_cpu_info()
    kernel_init_freeable()
    ? rest_init()
    kernel_init()
    ret_from_fork()

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Address this rq-clock update bug:

    WARNING: CPU: 0 PID: 0 at ../kernel/sched/sched.h:797 post_init_entity_util_avg()
    rq->clock_update_flags < RQCF_ACT_SKIP

    Call Trace:
    __warn()
    post_init_entity_util_avg()
    wake_up_new_task()
    _do_fork()
    kernel_thread()
    rest_init()
    start_kernel()

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • rq_clock() is called from sched_info_{depart,arrive}() after resetting
    RQCF_ACT_SKIP but prior to a call to update_rq_clock().

    In preparation for pending patches that check whether the rq clock has
    been updated inside of a pin context before rq_clock() is called, move
    the reset of rq->clock_skip_update immediately before unpinning the rq
    lock.

    This will avoid the new warnings which check if update_rq_clock() is
    being actively skipped.

    Signed-off-by: Matt Fleming
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Byungchul Park
    Cc: Frederic Weisbecker
    Cc: Jan Kara
    Cc: Linus Torvalds
    Cc: Luca Abeni
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Petr Mladek
    Cc: Rik van Riel
    Cc: Sergey Senozhatsky
    Cc: Thomas Gleixner
    Cc: Wanpeng Li
    Cc: Yuyang Du
    Link: http://lkml.kernel.org/r/20160921133813.31976-6-matt@codeblueprint.co.uk
    Signed-off-by: Ingo Molnar

    Matt Fleming
     
  • In preparation for adding diagnostic checks to catch missing calls to
    update_rq_clock(), provide wrappers for (re)pinning and unpinning
    rq->lock.

    Because the pending diagnostic checks allow state to be maintained in
    rq_flags across pin contexts, swap the 'struct pin_cookie' arguments
    for 'struct rq_flags *'.

    Signed-off-by: Matt Fleming
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Byungchul Park
    Cc: Frederic Weisbecker
    Cc: Jan Kara
    Cc: Linus Torvalds
    Cc: Luca Abeni
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Petr Mladek
    Cc: Rik van Riel
    Cc: Sergey Senozhatsky
    Cc: Thomas Gleixner
    Cc: Wanpeng Li
    Cc: Yuyang Du
    Link: http://lkml.kernel.org/r/20160921133813.31976-5-matt@codeblueprint.co.uk
    Signed-off-by: Ingo Molnar

    Matt Fleming
     

26 Dec, 2016

1 commit

  • ktime_set(S,N) was required for the timespec storage type and is still
    useful for situations where a Seconds and Nanoseconds part of a time value
    needs to be converted. For anything where the Seconds argument is 0, this
    is pointless and can be replaced with a simple assignment.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra

    Thomas Gleixner
     

14 Dec, 2016

1 commit

  • Pull power management updates from Rafael Wysocki:
    "Again, cpufreq gets more changes than the other parts this time (one
    new driver, one old driver less, a bunch of enhancements of the
    existing code, new CPU IDs, fixes, cleanups)

    There also are some changes in cpuidle (idle injection rework, a
    couple of new CPU IDs, online/offline rework in intel_idle, fixes and
    cleanups), in the generic power domains framework (mostly related to
    supporting power domains containing CPUs), and in the Operating
    Performance Points (OPP) library (mostly related to supporting devices
    with multiple voltage regulators)

    In addition to that, the system sleep state selection interface is
    modified to make it easier for distributions with unchanged user space
    to support suspend-to-idle as the default system suspend method, some
    issues are fixed in the PM core, the latency tolerance PM QoS
    framework is improved a bit, the Intel RAPL power capping driver is
    cleaned up and there are some fixes and cleanups in the devfreq
    subsystem

    Specifics:

    - New cpufreq driver for Broadcom STB SoCs and a Device Tree binding
    for it (Markus Mayer)

    - Support for ARM Integrator/AP and Integrator/CP in the generic DT
    cpufreq driver and elimination of the old Integrator cpufreq driver
    (Linus Walleij)

    - Support for the zx296718, r8a7743 and r8a7745, Socionext UniPhier,
    and PXA SoCs in the the generic DT cpufreq driver (Baoyou Xie,
    Geert Uytterhoeven, Masahiro Yamada, Robert Jarzmik)

    - cpufreq core fix to eliminate races that may lead to using inactive
    policy objects and related cleanups (Rafael Wysocki)

    - cpufreq schedutil governor update to make it use SCHED_FIFO kernel
    threads (instead of regular workqueues) for doing delayed work (to
    reduce the response latency in some cases) and related cleanups
    (Viresh Kumar)

    - New cpufreq sysfs attribute for resetting statistics (Markus Mayer)

    - cpufreq governors fixes and cleanups (Chen Yu, Stratos Karafotis,
    Viresh Kumar)

    - Support for using generic cpufreq governors in the intel_pstate
    driver (Rafael Wysocki)

    - Support for per-logical-CPU P-state limits and the EPP/EPB (Energy
    Performance Preference/Energy Performance Bias) knobs in the
    intel_pstate driver (Srinivas Pandruvada)

    - New CPU ID for Knights Mill in intel_pstate (Piotr Luc)

    - intel_pstate driver modification to use the P-state selection
    algorithm based on CPU load on platforms with the system profile in
    the ACPI tables set to "mobile" (Srinivas Pandruvada)

    - intel_pstate driver cleanups (Arnd Bergmann, Rafael Wysocki,
    Srinivas Pandruvada)

    - cpufreq powernv driver updates including fast switching support
    (for the schedutil governor), fixes and cleanus (Akshay Adiga,
    Andrew Donnellan, Denis Kirjanov)

    - acpi-cpufreq driver rework to switch it over to the new CPU
    offline/online state machine (Sebastian Andrzej Siewior)

    - Assorted cleanups in cpufreq drivers (Wei Yongjun, Prashanth
    Prakash)

    - Idle injection rework (to make it use the regular idle path instead
    of a home-grown custom one) and related powerclamp thermal driver
    updates (Peter Zijlstra, Jacob Pan, Petr Mladek, Sebastian Andrzej
    Siewior)

    - New CPU IDs for Atom Z34xx and Knights Mill in intel_idle (Andy
    Shevchenko, Piotr Luc)

    - intel_idle driver cleanups and switch over to using the new CPU
    offline/online state machine (Anna-Maria Gleixner, Sebastian
    Andrzej Siewior)

    - cpuidle DT driver update to support suspend-to-idle properly
    (Sudeep Holla)

    - cpuidle core cleanups and misc updates (Daniel Lezcano, Pan Bian,
    Rafael Wysocki)

    - Preliminary support for power domains including CPUs in the generic
    power domains (genpd) framework and related DT bindings (Lina Iyer)

    - Assorted fixes and cleanups in the generic power domains (genpd)
    framework (Colin Ian King, Dan Carpenter, Geert Uytterhoeven)

    - Preliminary support for devices with multiple voltage regulators
    and related fixes and cleanups in the Operating Performance Points
    (OPP) library (Viresh Kumar, Masahiro Yamada, Stephen Boyd)

    - System sleep state selection interface rework to make it easier to
    support suspend-to-idle as the default system suspend method
    (Rafael Wysocki)

    - PM core fixes and cleanups, mostly related to the interactions
    between the system suspend and runtime PM frameworks (Ulf Hansson,
    Sahitya Tummala, Tony Lindgren)

    - Latency tolerance PM QoS framework imorovements (Andrew Lutomirski)

    - New Knights Mill CPU ID for the Intel RAPL power capping driver
    (Piotr Luc)

    - Intel RAPL power capping driver fixes, cleanups and switch over to
    using the new CPU offline/online state machine (Jacob Pan, Thomas
    Gleixner, Sebastian Andrzej Siewior)

    - Fixes and cleanups in the exynos-ppmu, exynos-nocp, rk3399_dmc,
    rockchip-dfi devfreq drivers and the devfreq core (Axel Lin,
    Chanwoo Choi, Javier Martinez Canillas, MyungJoo Ham, Viresh Kumar)

    - Fix for false-positive KASAN warnings during resume from ACPI S3
    (suspend-to-RAM) on x86 (Josh Poimboeuf)

    - Memory map verification during resume from hibernation on x86 to
    ensure a consistent address space layout (Chen Yu)

    - Wakeup sources debugging enhancement (Xing Wei)

    - rockchip-io AVS driver cleanup (Shawn Lin)"

    * tag 'pm-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (127 commits)
    devfreq: rk3399_dmc: Don't use OPP structures outside of RCU locks
    devfreq: rk3399_dmc: Remove dangling rcu_read_unlock()
    devfreq: exynos: Don't use OPP structures outside of RCU locks
    Documentation: intel_pstate: Document HWP energy/performance hints
    cpufreq: intel_pstate: Support for energy performance hints with HWP
    cpufreq: intel_pstate: Add locking around HWP requests
    PM / sleep: Print active wakeup sources when blocking on wakeup_count reads
    PM / core: Fix bug in the error handling of async suspend
    PM / wakeirq: Fix dedicated wakeirq for drivers not using autosuspend
    PM / Domains: Fix compatible for domain idle state
    PM / OPP: Don't WARN on multiple calls to dev_pm_opp_set_regulators()
    PM / OPP: Allow platform specific custom set_opp() callbacks
    PM / OPP: Separate out _generic_set_opp()
    PM / OPP: Add infrastructure to manage multiple regulators
    PM / OPP: Pass struct dev_pm_opp_supply to _set_opp_voltage()
    PM / OPP: Manage supply's voltage/current in a separate structure
    PM / OPP: Don't use OPP structure outside of rcu protected section
    PM / OPP: Reword binding supporting multiple regulators per device
    PM / OPP: Fix incorrect cpu-supply property in binding
    cpuidle: Add a kerneldoc comment to cpuidle_use_deepest_state()
    ..

    Linus Torvalds
     

13 Dec, 2016

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main scheduler changes in this cycle were:

    - support Intel Turbo Boost Max Technology 3.0 (TBM3) by introducig a
    notion of 'better cores', which the scheduler will prefer to
    schedule single threaded workloads on. (Tim Chen, Srinivas
    Pandruvada)

    - enhance the handling of asymmetric capacity CPUs further (Morten
    Rasmussen)

    - improve/fix load handling when moving tasks between task groups
    (Vincent Guittot)

    - simplify and clean up the cputime code (Stanislaw Gruszka)

    - improve mass fork()ed task spread a.k.a. hackbench speedup (Vincent
    Guittot)

    - make struct kthread kmalloc()ed and related fixes (Oleg Nesterov)

    - add uaccess atomicity debugging (when using access_ok() in the
    wrong context), under CONFIG_DEBUG_ATOMIC_SLEEP=y (Peter Zijlstra)

    - implement various fixes, cleanups and other enhancements (Daniel
    Bristot de Oliveira, Martin Schwidefsky, Rafael J. Wysocki)"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
    sched/core: Use load_avg for selecting idlest group
    sched/core: Fix find_idlest_group() for fork
    kthread: Don't abuse kthread_create_on_cpu() in __kthread_create_worker()
    kthread: Don't use to_live_kthread() in kthread_[un]park()
    kthread: Don't use to_live_kthread() in kthread_stop()
    Revert "kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function"
    kthread: Make struct kthread kmalloc'ed
    x86/uaccess, sched/preempt: Verify access_ok() context
    sched/x86: Make CONFIG_SCHED_MC_PRIO=y easier to enable
    sched/x86: Change CONFIG_SCHED_ITMT to CONFIG_SCHED_MC_PRIO
    x86/sched: Use #include instead of #include
    cpufreq/intel_pstate: Use CPPC to get max performance
    acpi/bus: Set _OSC for diverse core support
    acpi/bus: Enable HWP CPPC objects
    x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU
    x86/sysctl: Add sysctl for ITMT scheduling feature
    x86: Enable Intel Turbo Boost Max Technology 3.0
    x86/topology: Define x86's arch_update_cpu_topology
    sched: Extend scheduler's asym packing
    sched/fair: Clean up the tunable parameter definitions
    ...

    Linus Torvalds