24 Apr, 2011

1 commit

  • Neil Brown pointed out that lock_depth somehow escaped the BKL
    removal work. Let's get rid of it now.

    Note that the perf scripting utilities still have a bunch of
    code for dealing with common_lock_depth in tracepoints; I have
    left that in place in case anybody wants to use that code with
    older kernels.

    Suggested-by: Neil Brown
    Signed-off-by: Jonathan Corbet
    Cc: Arnd Bergmann
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20110422111910.456c0e84@bike.lwn.net
    Signed-off-by: Ingo Molnar

    Jonathan Corbet
     

14 Apr, 2011

1 commit

  • Provide a generic p->on_rq because the p->se.on_rq semantics are
    unfavourable for lockless wakeups but needed for sched_fair.

    In particular, p->on_rq is only cleared when we actually dequeue the
    task in schedule() and not on any random dequeue as done by things
    like __migrate_task() and __sched_setscheduler().

    This also allows us to remove p->se usage from !sched_fair code.

    Reviewed-by: Frank Rowand
    Cc: Mike Galbraith
    Cc: Nick Piggin
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110405152728.949545047@chello.nl

    Peter Zijlstra
     

03 Feb, 2011

1 commit

  • Use the buddy mechanism to implement yield_task_fair. This
    allows us to skip onto the next highest priority se at every
    level in the CFS tree, unless doing so would introduce gross
    unfairness in CPU time distribution.

    We order the buddy selection in pick_next_entity to check
    yield first, then last, then next. We need next to be able
    to override yield, because it is possible for the "next" and
    "yield" task to be different processen in the same sub-tree
    of the CFS tree. When they are, we need to go into that
    sub-tree regardless of the "yield" hint, and pick the correct
    entity once we get to the right level.

    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

18 Jan, 2011

3 commits


30 Nov, 2010

1 commit

  • A recurring complaint from CFS users is that parallel kbuild has
    a negative impact on desktop interactivity. This patch
    implements an idea from Linus, to automatically create task
    groups. Currently, only per session autogroups are implemented,
    but the patch leaves the way open for enhancement.

    Implementation: each task's signal struct contains an inherited
    pointer to a refcounted autogroup struct containing a task group
    pointer, the default for all tasks pointing to the
    init_task_group. When a task calls setsid(), a new task group
    is created, the process is moved into the new task group, and a
    reference to the preveious task group is dropped. Child
    processes inherit this task group thereafter, and increase it's
    refcount. When the last thread of a process exits, the
    process's reference is dropped, such that when the last process
    referencing an autogroup exits, the autogroup is destroyed.

    At runqueue selection time, IFF a task has no cgroup assignment,
    its current autogroup is used.

    Autogroup bandwidth is controllable via setting it's nice level
    through the proc filesystem:

    cat /proc//autogroup

    Displays the task's group and the group's nice level.

    echo > /proc//autogroup

    Sets the task group's shares to the weight of nice task.
    Setting nice level is rate limited for !admin users due to the
    abuse risk of task group locking.

    The feature is enabled from boot by default if
    CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
    the boot option noautogroup, and can also be turned on/off on
    the fly via:

    echo [01] > /proc/sys/kernel/sched_autogroup_enabled

    ... which will automatically move tasks to/from the root task group.

    Signed-off-by: Mike Galbraith
    Acked-by: Linus Torvalds
    Acked-by: Peter Zijlstra
    Cc: Markus Trippelsdorf
    Cc: Mathieu Desnoyers
    Cc: Paul Turner
    Cc: Oleg Nesterov
    [ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
    Signed-off-by: Ingo Molnar
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

23 Nov, 2010

1 commit


18 Nov, 2010

1 commit

  • By tracking a per-cpu load-avg for each cfs_rq and folding it into a
    global task_group load on each tick we can rework tg_shares_up to be
    strictly per-cpu.

    This should improve cpu-cgroup performance for smp systems
    significantly.

    [ Paul: changed to use queueing cfs_rq + bug fixes ]

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

22 Jul, 2010

1 commit


28 May, 2010

1 commit

  • Trivial, use get_nr_threads() helper to read signal->count which we are
    going to change.

    Like other callers, proc_sched_show_task() doesn't need the exactly
    precise nr_threads.

    David said:

    : Note that get_nr_threads() isn't completely equivalent (it can return 0
    : where proc_sched_show_task() will display a 1). But I don't think this
    : should be a problem.

    Signed-off-by: Oleg Nesterov
    Acked-by: David Howells
    Cc: Peter Zijlstra
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

18 May, 2010

1 commit

  • …/git/tip/linux-2.6-tip

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (49 commits)
    stop_machine: Move local variable closer to the usage site in cpu_stop_cpu_callback()
    sched, wait: Use wrapper functions
    sched: Remove a stale comment
    ondemand: Make the iowait-is-busy time a sysfs tunable
    ondemand: Solve a big performance issue by counting IOWAIT time as busy
    sched: Intoduce get_cpu_iowait_time_us()
    sched: Eliminate the ts->idle_lastupdate field
    sched: Fold updating of the last_update_time_info into update_ts_time_stats()
    sched: Update the idle statistics in get_cpu_idle_time_us()
    sched: Introduce a function to update the idle statistics
    sched: Add a comment to get_cpu_idle_time_us()
    cpu_stop: add dummy implementation for UP
    sched: Remove rq argument to the tracepoints
    rcu: need barrier() in UP synchronize_sched_expedited()
    sched: correctly place paranioa memory barriers in synchronize_sched_expedited()
    sched: kill paranoia check in synchronize_sched_expedited()
    sched: replace migration_thread with cpu_stop
    stop_machine: reimplement using cpu_stop
    cpu_stop: implement stop_cpu[s]()
    sched: Fix select_idle_sibling() logic in select_task_rq_fair()
    ...

    Linus Torvalds
     

05 May, 2010

1 commit

  • With CONFIG_PROVE_RCU=y, a warning can be triggered:

    $ cat /proc/sched_debug

    ...
    kernel/cgroup.c:1649 invoked rcu_dereference_check() without protection!
    ...

    Both cgroup_path() and task_group() should be called with either
    rcu_read_lock or cgroup_mutex held.

    The rcu_dereference_check() does include cgroup_lock_is_held(), so we
    know that this lock is not held. Therefore, in a CONFIG_PREEMPT kernel,
    to say nothing of a CONFIG_PREEMPT_RT kernel, the original code could
    have ended up copying a string out of the freelist.

    This patch inserts RCU read-side primitives needed to prevent this
    scenario.

    Signed-off-by: Li Zefan
    Signed-off-by: Paul E. McKenney

    Li Zefan
     

15 Apr, 2010

1 commit


03 Apr, 2010

2 commits

  • This is left over from commit 7c9414385e ("sched: Remove USER_SCHED"")

    Signed-off-by: Li Zefan
    Acked-by: Dhaval Giani
    Signed-off-by: Peter Zijlstra
    Cc: David Howells
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • Latencytop clearing sum_exec_runtime via proc_sched_set_task() breaks
    task_times(). Other places in kernel use nvcsw and nivcsw, which are
    being cleared as well, Clear task statistics only.

    Reported-by: Török Edwin
    Signed-off-by: Mike Galbraith
    Cc: Hidetoshi Seto
    Cc: Arjan van de Ven
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

12 Mar, 2010

2 commits

  • Both avg_overlap and avg_wakeup had an inherent problem in that their accuracy
    was detrimentally affected by cross-cpu wakeups, this because we are missing
    the necessary call to update_curr(). This can't be fixed without increasing
    overhead in our already too fat fastpath.

    Additionally, with recent load balancing changes making us prefer to place tasks
    in an idle cache domain (which is good for compute bound loads), communicating
    tasks suffer when a sync wakeup, which would enable affine placement, is turned
    into a non-sync wakeup by SYNC_LESS. With one task on the runqueue, wake_affine()
    rejects the affine wakeup request, leaving the unfortunate where placed, taking
    frequent cache misses.

    Remove it, and recover some fastpath cycles.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Testing the load which led to this heuristic (nfs4 kbuild) shows that it has
    outlived it's usefullness. With intervening load balancing changes, I cannot
    see any difference with/without, so recover there fastpath cycles.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

11 Mar, 2010

1 commit

  • Put all statistic fields of sched_entity in one struct, sched_statistics,
    and embed it into sched_entity.

    This change allows to memset the sched_statistics to 0 when needed (for
    instance when forking), avoiding bugs of non initialized fields.

    Signed-off-by: Lucas De Marchi
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Lucas De Marchi
     

15 Dec, 2009

1 commit


11 Dec, 2009

1 commit

  • This build warning:

    kernel/sched.c: In function 'set_task_cpu':
    kernel/sched.c:2070: warning: unused variable 'old_rq'

    Made me realize that the forced2_migrations stat looks pretty
    pointless (and a misnomer) - remove it.

    Cc: Peter Zijlstra
    Cc: Mike Galbraith
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

09 Dec, 2009

2 commits

  • As scaling now takes place on all kind of cpu add/remove events a user
    that configures values via proc should be able to configure if his set
    values are still rescaled or kept whatever happens.

    As the comments state that log2 was just a second guess that worked the
    interface is not just designed for on/off, but to choose a scaling type.
    Currently this allows none, log and linear, but more important it allwos
    us to keep the interface even if someone has an even better idea how to
    scale the values.

    Signed-off-by: Christian Ehrhardt
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Christian Ehrhardt
     
  • WAKEUP_RUNNING was an experiment, not sure why that ever ended up being
    merged...

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

05 Nov, 2009

1 commit


17 Sep, 2009

1 commit

  • Create a new wakeup preemption mode, preempt towards tasks that run
    shorter on avg. It sets next buddy to be sure we actually run the task
    we preempted for.

    Test results:

    root@twins:~# while :; do :; done &
    [1] 6537
    root@twins:~# while :; do :; done &
    [2] 6538
    root@twins:~# while :; do :; done &
    [3] 6539
    root@twins:~# while :; do :; done &
    [4] 6540

    root@twins:/home/peter# ./latt -c4 sleep 4
    Entries: 48 (clients=4)

    Averages:
    ------------------------------
    Max 4750 usec
    Avg 497 usec
    Stdev 737 usec

    root@twins:/home/peter# echo WAKEUP_RUNNING > /debug/sched_features

    root@twins:/home/peter# ./latt -c4 sleep 4
    Entries: 48 (clients=4)

    Averages:
    ------------------------------
    Max 14 usec
    Avg 5 usec
    Stdev 3 usec

    Disabled by default - needs more testing.

    Signed-off-by: Peter Zijlstra
    Acked-by: Mike Galbraith
    Signed-off-by: Ingo Molnar
    LKML-Reference:

    Peter Zijlstra
     

02 Sep, 2009

1 commit

  • For counting how long an application has been waiting for
    (disk) IO, there currently is only the HZ sample driven
    information available, while for all other counters in this
    class, a high resolution version is available via
    CONFIG_SCHEDSTATS.

    In order to make an improved bootchart tool possible, we also
    need a higher resolution version of the iowait time.

    This patch below adds this scheduler statistic to the kernel.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Arjan van de Ven
     

18 Jun, 2009

1 commit


25 Mar, 2009

1 commit

  • Impact: cleanup, new schedstat ABI

    Since they are used on in statistics and are always set to zero, the
    following fields from struct rq have been removed: yld_exp_empty,
    yld_act_empty and yld_both_empty.

    Both Sched Debug and SCHEDSTAT_VERSION versions has also been
    incremented since ABIs have been changed.

    The schedtop tool has been updated to properly handle new version of
    schedstat:

    http://rt.wiki.kernel.org/index.php/Schedtop_utility

    Signed-off-by: Luis Henriques
    Acked-by: Gregory Haskins
    Acked-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Luis Henriques
     

18 Mar, 2009

1 commit


15 Jan, 2009

1 commit

  • Introduce a new avg_wakeup statistic.

    avg_wakeup is a measure of how frequently a task wakes up other tasks, it
    represents the average time between wakeups, with a limit of avg_runtime
    for when it doesn't wake up anybody.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mike Galbraith
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

11 Jan, 2009

1 commit


02 Dec, 2008

1 commit


19 Nov, 2008

1 commit


16 Nov, 2008

1 commit

  • Luis Henriques reported that with CONFIG_PREEMPT=y + CONFIG_PREEMPT_DEBUG=y +
    CONFIG_SCHED_DEBUG=y + CONFIG_LATENCYTOP=y enabled, the following warning
    triggers when using latencytop:

    > [ 775.663239] BUG: using smp_processor_id() in preemptible [00000000] code: latencytop/6585
    > [ 775.663303] caller is native_sched_clock+0x3a/0x80
    > [ 775.663314] Pid: 6585, comm: latencytop Tainted: G W 2.6.28-rc4-00355-g9c7c354 #1
    > [ 775.663322] Call Trace:
    > [ 775.663343] [] debug_smp_processor_id+0xe4/0xf0
    > [ 775.663356] [] native_sched_clock+0x3a/0x80
    > [ 775.663368] [] sched_clock+0x9/0x10
    > [ 775.663381] [] proc_sched_show_task+0x8bd/0x10e0
    > [ 775.663395] [] sched_show+0x3e/0x80
    > [ 775.663408] [] seq_read+0xdb/0x350
    > [ 775.663421] [] ? security_file_permission+0x16/0x20
    > [ 775.663435] [] vfs_read+0xc8/0x170
    > [ 775.663447] [] sys_read+0x55/0x90
    > [ 775.663460] [] system_call_fastpath+0x16/0x1b
    > ...

    This breakage was caused by me via:

    7cbaef9: sched: optimize sched_clock() a bit

    Change the calls to cpu_clock().

    Reported-by: Luis Henriques

    Ingo Molnar
     

11 Nov, 2008

1 commit

  • Impact: extend /proc/sched_debug info

    Since the statistics of a group entity isn't exported directly from the
    kernel, it becomes difficult to obtain some of the group statistics.
    For example, the current method to obtain exec time of a group entity
    is not always accurate. One has to read the exec times of all
    the tasks(/proc//sched) in the group and add them. This method
    fails (or becomes difficult) if we want to collect stats of a group over
    a duration where tasks get created and terminated.

    This patch makes it easier to obtain group stats by directly including
    them in /proc/sched_debug. Stats like group exec time would help user
    programs (like LTP) to accurately measure the group fairness.

    An example output of group stats from /proc/sched_debug:

    cfs_rq[3]:/3/a/1
    .exec_clock : 89.598007
    .MIN_vruntime : 0.000001
    .min_vruntime : 256300.970506
    .max_vruntime : 0.000001
    .spread : 0.000000
    .spread0 : -25373.372248
    .nr_running : 0
    .load : 0
    .yld_exp_empty : 0
    .yld_act_empty : 0
    .yld_both_empty : 0
    .yld_count : 4474
    .sched_switch : 0
    .sched_count : 40507
    .sched_goidle : 12686
    .ttwu_count : 15114
    .ttwu_local : 11950
    .bkl_count : 67
    .nr_spread_over : 0
    .shares : 0
    .se->exec_start : 113676.727170
    .se->vruntime : 1592.612714
    .se->sum_exec_runtime : 89.598007
    .se->wait_start : 0.000000
    .se->sleep_start : 0.000000
    .se->block_start : 0.000000
    .se->sleep_max : 0.000000
    .se->block_max : 0.000000
    .se->exec_max : 1.000282
    .se->slice_max : 1.999750
    .se->wait_max : 54.981093
    .se->wait_sum : 217.610521
    .se->wait_count : 50
    .se->load.weight : 2

    Signed-off-by: Bharata B Rao
    Acked-by: Srivatsa Vaddagiri
    Acked-by: Dhaval Giani
    Acked-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Bharata B Rao
     

10 Nov, 2008

1 commit

  • Impact: clean up and fix debug info printout

    While looking over the sched_debug code I noticed that we printed the rq
    schedstats for every cfs_rq, ammend this.

    Also change nr_spead_over into an int, and fix a little buglet in
    min_vruntime printing.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

04 Nov, 2008

1 commit


30 Oct, 2008

1 commit


10 Oct, 2008

1 commit

  • lock_task_sighand() make sure task->sighand is being protected,
    so we do not need rcu_read_lock().
    [ exec() will get task->sighand->siglock before change task->sighand! ]

    But code using rcu_read_lock() _just_ to protect lock_task_sighand()
    only appear in procfs. (and some code in procfs use lock_task_sighand()
    without such redundant protection.)

    Other subsystem may put lock_task_sighand() into rcu_read_lock()
    critical region, but these rcu_read_lock() are used for protecting
    "for_each_process()", "find_task_by_vpid()" etc. , not for protecting
    lock_task_sighand().

    Signed-off-by: Lai Jiangshan
    [ok from Oleg]
    Signed-off-by: Alexey Dobriyan

    Lai Jiangshan
     

27 Jun, 2008

1 commit