27 Feb, 2013

1 commit

  • Pull scheduler fixes from Ingo Molnar.

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    cputime: Use local_clock() for full dynticks cputime accounting
    cputime: Constify timeval_to_cputime(timeval) argument
    sched: Move RR_TIMESLICE from sysctl.h to rt.h
    sched: Fix /proc/sched_debug failure on very very large systems
    sched: Fix /proc/sched_stat failure on very very large systems
    sched/core: Remove the obsolete and unused nr_uninterruptible() function

    Linus Torvalds
     

22 Feb, 2013

1 commit

  • On systems with 4096 cores attemping to read /proc/sched_debug
    fails because we are trying to push all the data into a single
    kmalloc buffer.

    The issue is on these very large machines all the data will not
    fit in 4mb.

    A better solution is to not us the single_open mechanism but to
    provide our own seq_operations and treat each cpu as an
    individual record.

    The output should be identical to the previous version.

    Reported-by: Dave Jones
    Signed-off-by: Nathan Zimmer
    Cc: Peter Zijlstra )
    [ Whitespace fixlet]
    [ Fix spello in comment]
    Signed-off-by: Andrew Morton
    Signed-off-by: Ingo Molnar

    Nathan Zimmer
     

21 Feb, 2013

1 commit

  • Pull cgroup changes from Tejun Heo:
    "Nothing too drastic.

    - Removal of synchronize_rcu() from userland visible paths.

    - Various fixes and cleanups from Li.

    - cgroup_rightmost_descendant() added which will be used by cpuset
    changes (it will be a separate pull request)."

    * 'for-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: fail if monitored file and event_control are in different cgroup
    cgroup: fix cgroup_rmdir() vs close(eventfd) race
    cpuset: fix cpuset_print_task_mems_allowed() vs rename() race
    cgroup: fix exit() vs rmdir() race
    cgroup: remove bogus comments in cgroup_diput()
    cgroup: remove synchronize_rcu() from cgroup_diput()
    cgroup: remove duplicate RCU free on struct cgroup
    sched: remove redundant NULL cgroup check in task_group_path()
    sched: split out css_online/css_offline from tg creation/destruction
    cgroup: initialize cgrp->dentry before css_alloc()
    cgroup: remove a NULL check in cgroup_exit()
    cgroup: fix bogus kernel warnings when cgroup_create() failed
    cgroup: remove synchronize_rcu() from rebind_subsystems()
    cgroup: remove synchronize_rcu() from cgroup_attach_{task|proc}()
    cgroup: use new hashtable implementation
    cgroups: fix cgroup_event_listener error handling
    cgroups: move cgroup_event_listener.c to tools/cgroup
    cgroup: implement cgroup_rightmost_descendant()
    cgroup: remove unused dummy cgroup_fork_callbacks()

    Linus Torvalds
     

25 Jan, 2013

2 commits

  • The type returned from atomic64_t can be either unsigned
    long or unsigned long long, depending on the architecture.
    Using a cast to unsigned long long lets us use the same
    format string for all architectures.

    Without this patch, building with scheduler debugging
    enabled results in:

    kernel/sched/debug.c: In function 'print_cfs_rq':
    kernel/sched/debug.c:225:2: warning: format '%ld' expects argument of type 'long int', but argument 4 has type 'long long int' [-Wformat]
    kernel/sched/debug.c:225:2: warning: format '%ld' expects argument of type 'long int', but argument 3 has type 'long long int' [-Wformat]

    Signed-off-by: Arnd Bergmann
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: linux-arm-kernel@list.infradead.org
    Link: http://lkml.kernel.org/r/1359123276-15833-7-git-send-email-arnd@arndb.de
    Signed-off-by: Ingo Molnar

    Arnd Bergmann
     
  • A task_group won't be online (thus no one can see it) until
    cpu_cgroup_css_online(), and at that time tg->css.cgroup has
    been initialized, so this NULL check is redundant.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     

24 Oct, 2012

7 commits

  • Now that the machinery in place is in place to compute contributed load in a
    bottom up fashion; replace the shares distribution code within update_shares()
    accordingly.

    Signed-off-by: Paul Turner
    Reviewed-by: Ben Segall
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120823141507.061208672@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Entities of equal weight should receive equitable distribution of cpu time.
    This is challenging in the case of a task_group's shares as execution may be
    occurring on multiple cpus simultaneously.

    To handle this we divide up the shares into weights proportionate with the load
    on each cfs_rq. This does not however, account for the fact that the sum of
    the parts may be less than one cpu and so we need to normalize:
    load(tg) = min(runnable_avg(tg), 1) * tg->shares
    Where runnable_avg is the aggregate time in which the task_group had runnable
    children.

    Signed-off-by: Paul Turner
    Reviewed-by: Ben Segall .
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120823141506.930124292@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Maintain a global running sum of the average load seen on each cfs_rq belonging
    to each task group so that it may be used in calculating an appropriate
    shares:weight distribution.

    Signed-off-by: Paul Turner
    Reviewed-by: Ben Segall
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120823141506.792901086@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • We are currently maintaining:

    runnable_load(cfs_rq) = \Sum task_load(t)

    For all running children t of cfs_rq. While this can be naturally updated for
    tasks in a runnable state (as they are scheduled); this does not account for
    the load contributed by blocked task entities.

    This can be solved by introducing a separate accounting for blocked load:

    blocked_load(cfs_rq) = \Sum runnable(b) * weight(b)

    Obviously we do not want to iterate over all blocked entities to account for
    their decay, we instead observe that:

    runnable_load(t) = \Sum p_i*y^i

    and that to account for an additional idle period we only need to compute:

    y*runnable_load(t).

    This means that we can compute all blocked entities at once by evaluating:

    blocked_load(cfs_rq)` = y * blocked_load(cfs_rq)

    Finally we maintain a decay counter so that when a sleeping entity re-awakens
    we can determine how much of its load should be removed from the blocked sum.

    Signed-off-by: Paul Turner
    Reviewed-by: Ben Segall
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120823141506.585389902@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • For a given task t, we can compute its contribution to load as:

    task_load(t) = runnable_avg(t) * weight(t)

    On a parenting cfs_rq we can then aggregate:

    runnable_load(cfs_rq) = \Sum task_load(t), for all runnable children t

    Maintain this bottom up, with task entities adding their contributed load to
    the parenting cfs_rq sum. When a task entity's load changes we add the same
    delta to the maintained sum.

    Signed-off-by: Paul Turner
    Reviewed-by: Ben Segall
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120823141506.514678907@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Since runqueues do not have a corresponding sched_entity we instead embed a
    sched_avg structure directly.

    Signed-off-by: Ben Segall
    Reviewed-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120823141506.442637130@google.com
    Signed-off-by: Ingo Molnar

    Ben Segall
     
  • Instead of tracking averaging the load parented by a cfs_rq, we can track
    entity load directly. With the load for a given cfs_rq then being the sum
    of its children.

    To do this we represent the historical contribution to runnable average
    within each trailing 1024us of execution as the coefficients of a
    geometric series.

    We can express this for a given task t as:

    runnable_sum(t) = \Sum u_i * y^i, runnable_avg_period(t) = \Sum 1024 * y^i
    load(t) = weight_t * runnable_sum(t) / runnable_avg_period(t)

    Where: u_i is the usage in the last i`th 1024us period (approximately 1ms)
    ~ms and y is chosen such that y^k = 1/2. We currently choose k to be 32 which
    roughly translates to about a sched period.

    Signed-off-by: Paul Turner
    Reviewed-by: Ben Segall
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120823141506.372695337@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     

14 May, 2012

1 commit

  • Some numbers like nr_running and nr_uninterruptible are fundamentally
    unsigned since its impossible to have a negative amount of tasks, yet
    we still print them as signed to easily recognise the underflow
    condition.

    rq->nr_uninterruptible has 'special' accounting and can in fact very
    easily become negative on a per-cpu basis.

    It was noted that since the P() macro assumes things are long long and
    the promotion of unsigned 'int/long' to long long on 32bit doesn't
    sign extend we print silly large numbers instead of the easier to read
    signed numbers.

    Therefore extend the P() macro to not require the sign extention.

    Reported-by: Diwakar Tundlam
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-gk5tm8t2n4ix2vkpns42uqqp@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

09 May, 2012

1 commit

  • Since there's a PID space limit of 30bits (see
    futex.h:FUTEX_TID_MASK) and allocating that many tasks (assuming a
    lower bound of 2 pages per task) would still take 8T of memory it
    seems reasonable to say that unsigned int is sufficient for
    rq->nr_running.

    When we do get anywhere near that amount of tasks I suspect other
    things would go funny, load-balancer load computations would really
    need to be hoisted to 128bit etc.

    So save a few bytes and convert rq->nr_running and friends to
    unsigned int.

    Suggested-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-y3tvyszjdmbibade5bw8zl81@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

27 Jan, 2012

1 commit

  • Currently we don't utilize the sched_switch field anymore.

    But, simply removing sched_switch field from the middle of the
    sched_stat output will break tools.

    So, to stay compatible we hardcode it to zero and remove the
    field from the scheduler data structures.

    Update the schedstat documentation accordingly.

    Signed-off-by: Rakib Mullick
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1327422836.27181.5.camel@localhost.localdomain
    Signed-off-by: Ingo Molnar

    Rakib Mullick
     

17 Nov, 2011

1 commit