03 Oct, 2014

1 commit

  • On 32 bit systems cmpxchg cannot handle 64 bit values, so
    some additional magic is required to allow a 32 bit system
    with CONFIG_VIRT_CPU_ACCOUNTING_GEN=y enabled to build.

    Make sure the correct cmpxchg function is used when doing
    an atomic swap of a cputime_t.

    Reported-by: Arnd Bergmann
    Signed-off-by: Rik van Riel
    Acked-by: Arnd Bergmann
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: umgwanakikbuti@gmail.com
    Cc: fweisbec@gmail.com
    Cc: srao@redhat.com
    Cc: lwoodman@redhat.com
    Cc: atheurer@redhat.com
    Cc: oleg@redhat.com
    Cc: Andrew Morton
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: linux390@de.ibm.com
    Cc: linux-arch@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: linux-s390@vger.kernel.org
    Link: http://lkml.kernel.org/r/20140930155947.070cdb1f@annuminas.surriel.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

19 Sep, 2014

1 commit

  • The sig->stats_lock nests inside the tasklist_lock and the
    sighand->siglock in __exit_signal and wait_task_zombie.

    However, both of those locks can be taken from irq context,
    which means we need to use the interrupt safe variant of
    read_seqbegin_or_lock. This blocks interrupts when the "lock"
    branch is taken (seq is odd), preventing the lock inversion.

    On the first (lockless) pass through the loop, irqs are not
    blocked.

    Reported-by: Stanislaw Gruszka
    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: prarit@redhat.com
    Cc: oleg@redhat.com
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1410527535-9814-3-git-send-email-riel@redhat.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

08 Sep, 2014

2 commits

  • The functions task_cputime_adjusted and thread_group_cputime_adjusted()
    can be called locklessly, as well as concurrently on many different CPUs.

    This can occasionally lead to the utime and stime reported by times(), and
    other syscalls like it, going backward. The cause for this appears to be
    multiple threads racing in cputime_adjust(), both with values for utime or
    stime that is larger than the original, but each with a different value.

    Sometimes the larger value gets saved first, only to be immediately
    overwritten with a smaller value by another thread.

    Using atomic exchange prevents that problem, and ensures time
    progresses monotonically.

    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: umgwanakikbuti@gmail.com
    Cc: fweisbec@gmail.com
    Cc: akpm@linux-foundation.org
    Cc: srao@redhat.com
    Cc: lwoodman@redhat.com
    Cc: atheurer@redhat.com
    Cc: oleg@redhat.com
    Link: http://lkml.kernel.org/r/1408133138-22048-4-git-send-email-riel@redhat.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
    issues on large systems, due to both functions being serialized with a
    lock.

    The lock protects against reporting a wrong value, due to a thread in the
    task group exiting, its statistics reporting up to the signal struct, and
    that exited task's statistics being counted twice (or not at all).

    Protecting that with a lock results in times() and clock_gettime() being
    completely serialized on large systems.

    This can be fixed by using a seqlock around the events that gather and
    propagate statistics. As an additional benefit, the protection code can
    be moved into thread_group_cputime(), slightly simplifying the calling
    functions.

    In the case of posix_cpu_clock_get_task() things can be simplified a
    lot, because the calling function already ensures that the task sticks
    around, and the rest is now taken care of in thread_group_cputime().

    This way the statistics reporting code can run lockless.

    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alex Thorlton
    Cc: Andrew Morton
    Cc: Daeseok Youn
    Cc: David Rientjes
    Cc: Dongsheng Yang
    Cc: Geert Uytterhoeven
    Cc: Guillaume Morin
    Cc: Ionut Alexa
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Li Zefan
    Cc: Michal Hocko
    Cc: Michal Schmidt
    Cc: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: umgwanakikbuti@gmail.com
    Cc: fweisbec@gmail.com
    Cc: srao@redhat.com
    Cc: lwoodman@redhat.com
    Cc: atheurer@redhat.com
    Link: http://lkml.kernel.org/r/20140816134010.26a9b572@annuminas.surriel.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

20 Aug, 2014

1 commit

  • Change thread_group_cputime() to use for_each_thread() instead of
    buggy while_each_thread(). This also makes the pid_alive() check
    unnecessary.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Hidetoshi Seto
    Cc: Frank Mayhar
    Cc: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Sanjay Rao
    Cc: Larry Woodman
    Cc: Rik van Riel
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140813192000.GA19327@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

07 May, 2014

1 commit

  • Russell reported, that irqtime_account_idle_ticks() takes ages due to:

    for (i = 0; i < ticks; i++)
    irqtime_account_process_tick(current, 0, rq);

    It's sad, that this code was written way _AFTER_ the NOHZ idle
    functionality was available. I charge myself guitly for not paying
    attention when that crap got merged with commit abb74cefa ("sched:
    Export ns irqtimes through /proc/stat")

    So instead of looping nr_ticks times just apply the whole thing at
    once.

    As a side note: The whole cputime_t vs. u64 business in that context
    wants to be cleaned up as well. There is no point in having all these
    back and forth conversions. Lets standardise on u64 nsec for all
    kernel internal accounting and be done with it. Everything else does
    not make sense at all for fine grained accounting. Frederic, can you
    please take care of that?

    Reported-by: Russell King
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Peter Zijlstra
    Cc: Venkatesh Pallipadi
    Cc: Shaun Ruffell
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1405022307000.6261@ionos.tec.linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

02 Apr, 2014

1 commit

  • Pull timer updates from Ingo Molnar:
    "The main purpose is to fix a full dynticks bug related to
    virtualization, where steal time accounting appears to be zero in
    /proc/stat even after a few seconds of competing guests running busy
    loops in a same host CPU. It's not a regression though as it was
    there since the beginning.

    The other commits are preparatory work to fix the bug and various
    cleanups"

    * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    arch: Remove stub cputime.h headers
    sched: Remove needless round trip nsecs tick conversion of steal time
    cputime: Fix jiffies based cputime assumption on steal accounting
    cputime: Bring cputime -> nsecs conversion
    cputime: Default implementation of nsecs -> cputime conversion
    cputime: Fix nsecs_to_cputime() return type cast

    Linus Torvalds
     

13 Mar, 2014

1 commit

  • The steal guest time accounting code assumes that cputime_t is based on
    jiffies. So when CONFIG_NO_HZ_FULL=y, which implies that cputime_t
    is based on nsecs, steal_account_process_tick() passes the delta in
    jiffies to account_steal_time() which then accounts it as if it's a
    value in nsecs.

    As a result, accounting 1 second of steal time (with HZ=100 that would
    be 100 jiffies) is spuriously accounted as 100 nsecs.

    As such /proc/stat may report 0 values of steal time even when two
    guests have run concurrently for a few seconds on the same host and
    same CPU.

    In order to fix this, lets convert the nsecs based steal delta to
    cputime instead of jiffies by using the right conversion API.

    Given that the steal time is stored in cputime_t and this type can have
    a smaller granularity than nsecs, we only account the rounded converted
    value and leave the remaining nsecs for the next deltas.

    Reported-by: Huiqingding
    Reported-by: Marcelo Tosatti
    Cc: Ingo Molnar
    Cc: Marcelo Tosatti
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Acked-by: Rik van Riel
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     

09 Feb, 2014

1 commit

  • As patch "sched: Move the priority specific bits into a new header file" exposes
    the priority related macros in linux/sched/prio.h, we don't have to implement
    task_nice() in kernel/sched/core.c any more.

    This patch implements it in linux/sched/sched.h as static inline function,
    saving the kernel stack and enhancing performance a bit.

    Signed-off-by: Dongsheng Yang
    Cc: clark.williams@gmail.com
    Cc: rostedt@goodmis.org
    Cc: raistlin@linux.it
    Cc: juri.lelli@gmail.com
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1390878045-7096-1-git-send-email-yangds.fnst@cn.fujitsu.com
    Signed-off-by: Ingo Molnar

    Dongsheng Yang
     

06 Sep, 2013

1 commit


05 Sep, 2013

1 commit

  • Pull timers/nohz changes from Ingo Molnar:
    "It mostly contains fixes and full dynticks off-case optimizations, by
    Frederic Weisbecker"

    * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    nohz: Include local CPU in full dynticks global kick
    nohz: Optimize full dynticks's sched hooks with static keys
    nohz: Optimize full dynticks state checks with static keys
    nohz: Rename a few state variables
    vtime: Always debug check snapshot source _before_ updating it
    vtime: Always scale generic vtime accounting results
    vtime: Optimize full dynticks accounting off case with static keys
    vtime: Describe overriden functions in dedicated arch headers
    m68k: hardirq_count() only need preempt_mask.h
    hardirq: Split preempt count mask definitions
    context_tracking: Split low level state headers
    vtime: Fix racy cputime delta update
    vtime: Remove a few unneeded generic vtime state checks
    context_tracking: User/kernel broundary cross trace events
    context_tracking: Optimize context switch off case with static keys
    context_tracking: Optimize guest APIs off case with static key
    context_tracking: Optimize main APIs off case with static key
    context_tracking: Ground setup for static key use
    context_tracking: Remove full dynticks' hacky dependency on wide context tracking
    nohz: Only enable context tracking on full dynticks CPUs
    ...

    Linus Torvalds
     

04 Sep, 2013

1 commit

  • scale_stime() silently assumes that stime < rtime, otherwise
    when stime == rtime and both values are big enough (operations
    on them do not fit in 32 bits), the resulting scaling stime can
    be bigger than rtime. In consequence utime = rtime - stime
    results in negative value.

    User space visible symptoms of the bug are overflowed TIME
    values on ps/top, for example:

    $ ps aux | grep rcu
    root 8 0.0 0.0 0 0 ? S 12:42 0:00 [rcuc/0]
    root 9 0.0 0.0 0 0 ? S 12:42 0:00 [rcub/0]
    root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt]
    root 11 0.1 0.0 0 0 ? S 12:42 0:02 [rcuop/0]
    root 12 62422329 0.0 0 0 ? S 12:42 21114581:35 [rcuop/1]
    root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt]

    or overflowed utime values read directly from /proc/$PID/stat

    Reference:

    https://lkml.org/lkml/2013/8/20/259

    Reported-and-tested-by: Sergey Senozhatsky
    Signed-off-by: Stanislaw Gruszka
    Cc: stable@vger.kernel.org
    Cc: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Paul E. McKenney
    Cc: Borislav Petkov
    Link: http://lkml.kernel.org/r/20130904131602.GC2564@redhat.com
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     

16 Aug, 2013

1 commit

  • Use of a this_cpu() operation reduces the number of instructions used
    for accounting (account_user_time()) and frees up some registers. This is in
    the scheduler tick hotpath.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/00000140596dd165-338ff7f5-893b-4fec-b251-aaac5557239e-000000@email.amazonses.com
    Signed-off-by: Ingo Molnar

    Christoph Lameter
     

14 Aug, 2013

6 commits

  • The vtime delta update performed by get_vtime_delta() always check
    that the source of the snapshot is valid.

    Meanhile the snapshot updaters that rely on get_vtime_delta() also
    set the new snapshot origin. But some of them do this right before
    the call to get_vtime_delta(), making its debug check useless.

    This is easily fixable by moving the snapshot origin update after
    the call to get_vtime_delta(). The order doesn't matter there.

    Signed-off-by: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Li Zhong
    Cc: Mike Galbraith
    Cc: Kevin Hilman

    Frederic Weisbecker
     
  • The cputime accounting in full dynticks can be a subtle
    mixup of CPUs using tick based accounting and others using
    generic vtime.

    As long as the tick can have a share on producing these stats, we
    want to scale the result against CFS precise accounting as the tick
    can miss some task hiding between the periodic interrupt.

    Signed-off-by: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Li Zhong
    Cc: Mike Galbraith
    Cc: Kevin Hilman

    Frederic Weisbecker
     
  • If no CPU is in the full dynticks range, we can avoid the full
    dynticks cputime accounting through generic vtime along with its
    overhead and use the traditional tick based accounting instead.

    Let's do this and nope the off case with static keys.

    Signed-off-by: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Li Zhong
    Cc: Mike Galbraith
    Cc: Kevin Hilman

    Frederic Weisbecker
     
  • get_vtime_delta() must be called under the task vtime_seqlock
    with the code that does the cputime accounting flush.

    Otherwise the cputime reader can be fooled and run into
    a race where it sees the snapshot update but misses the
    cputime flush. As a result it can report a cputime that is
    way too short.

    Fix vtime_account_user() that wasn't complying to that rule.

    Signed-off-by: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Li Zhong
    Cc: Mike Galbraith
    Cc: Kevin Hilman

    Frederic Weisbecker
     
  • Some generic vtime APIs check if the vtime accounting
    is enabled on the local CPU before doing their work.

    Some of these are not needed because all their callers already
    take care of that. Let's remove the checks on these.

    Signed-off-by: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Li Zhong
    Cc: Mike Galbraith
    Cc: Kevin Hilman

    Frederic Weisbecker
     
  • Optimize guest entry/exit APIs with static keys. This minimize
    the overhead for those who enable CONFIG_NO_HZ_FULL without
    always using it. Having no range passed to nohz_full= should
    result in the probes overhead to be minimized.

    Signed-off-by: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Li Zhong
    Cc: Mike Galbraith
    Cc: Kevin Hilman

    Frederic Weisbecker
     

13 Aug, 2013

1 commit

  • Update a stale comment from the old vtime era and document some
    locking that might be non obvious.

    Signed-off-by: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Li Zhong
    Cc: Mike Galbraith
    Cc: Kevin Hilman

    Frederic Weisbecker
     

01 Jul, 2013

1 commit

  • Merge in a recent upstream commit:

    c2853c8df57f include/linux/math64.h: add div64_ul()

    because:

    72a4cf20cb71 sched: Change cfs_rq load avg to unsigned long

    relies on it.

    [ We don't rebase sched/core for this, because the handful of
    followup commits after the broken commit are not behavioral
    changes so are unlikely to be needed during bisection. ]

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

31 May, 2013

1 commit

  • While computing the cputime delta of dynticks CPUs,
    we are mixing up clocks of differents natures:

    * local_clock() which takes care of unstable clock
    sources and fix these if needed.

    * sched_clock() which is the weaker version of
    local_clock(). It doesn't compute any fixup in case
    of unstable source.

    If the clock source is stable, those two clocks are the
    same and we can safely compute the difference against
    two random points.

    Otherwise it results in random deltas as sched_clock()
    can randomly drift away, back or forward, from local_clock().

    As a consequence, some strange behaviour with unstable tsc
    has been observed such as non progressing constant zero cputime.
    (The 'top' command showing no load).

    Fix this by only using local_clock(), or its irq safe/remote
    equivalent, in vtime code.

    Reported-by: Mike Galbraith
    Suggested-by: Mike Galbraith
    Cc: Steven Rostedt
    Cc: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Li Zhong
    Cc: Mike Galbraith
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

28 May, 2013

1 commit


03 May, 2013

1 commit


01 May, 2013

3 commits

  • Dave Hansen reported strange utime/stime values on his system:
    https://lkml.org/lkml/2013/4/4/435

    This happens because prev->stime value is bigger than rtime
    value. Root of the problem are non-monotonic rtime values (i.e.
    current rtime is smaller than previous rtime) and that should be
    debugged and fixed.

    But since problem did not manifest itself before commit
    62188451f0d63add7ad0cd2a1ae269d600c1663d "cputime: Avoid
    multiplication overflow on utime scaling", it should be threated
    as regression, which we can easily fixed on cputime_adjust()
    function.

    For now, let's apply this fix, but further work is needed to fix
    root of the problem.

    Reported-and-tested-by: Dave Hansen
    Cc: # 3.9+
    Signed-off-by: Stanislaw Gruszka
    Cc: Frederic Weisbecker
    Cc: rostedt@goodmis.org
    Cc: Linus Torvalds
    Cc: Dave Hansen
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1367314507-9728-3-git-send-email-sgruszka@redhat.com
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     
  • Due to rounding in scale_stime(), for big numbers, scaled stime
    values will grow in chunks. Since rtime grow in jiffies and we
    calculate utime like below:

    prev->stime = max(prev->stime, stime);
    prev->utime = max(prev->utime, rtime - prev->stime);

    we could erroneously account stime values as utime. To prevent
    that only update prev->{u,s}time values when they are smaller
    than current rtime.

    Signed-off-by: Stanislaw Gruszka
    Cc: Frederic Weisbecker
    Cc: rostedt@goodmis.org
    Cc: Linus Torvalds
    Cc: Dave Hansen
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1367314507-9728-2-git-send-email-sgruszka@redhat.com
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     
  • Here is patch, which adds Linus's cputime scaling algorithm to the
    kernel.

    This is a follow up (well, fix) to commit
    d9a3c9823a2e6a543eb7807fb3d15d8233817ec5 ("sched: Lower chances
    of cputime scaling overflow") which commit tried to avoid
    multiplication overflow, but did not guarantee that the overflow
    would not happen.

    Linus crated a different algorithm, which completely avoids the
    multiplication overflow by dropping precision when numbers are
    big.

    It was tested by me and it gives good relative error of
    scaled numbers. Testing method is described here:
    http://marc.info/?l=linux-kernel&m=136733059505406&w=2

    Originally-From: Linus Torvalds
    Signed-off-by: Stanislaw Gruszka
    Cc: Frederic Weisbecker
    Cc: rostedt@goodmis.org
    Cc: Dave Hansen
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20130430151441.GC10465@redhat.com
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     

30 Apr, 2013

1 commit

  • Pull scheduler changes from Ingo Molnar:
    "The main changes in this development cycle were:

    - full dynticks preparatory work by Frederic Weisbecker

    - factor out the cpu time accounting code better, by Li Zefan

    - multi-CPU load balancer cleanups and improvements by Joonsoo Kim

    - various smaller fixes and cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
    sched: Fix init NOHZ_IDLE flag
    sched: Prevent to re-select dst-cpu in load_balance()
    sched: Rename load_balance_tmpmask to load_balance_mask
    sched: Move up affinity check to mitigate useless redoing overhead
    sched: Don't consider other cpus in our group in case of NEWLY_IDLE
    sched: Explicitly cpu_idle_type checking in rebalance_domains()
    sched: Change position of resched_cpu() in load_balance()
    sched: Fix wrong rq's runnable_avg update with rt tasks
    sched: Document task_struct::personality field
    sched/cpuacct/UML: Fix header file dependency bug on the UML build
    cgroup: Kill subsys.active flag
    sched/cpuacct: No need to check subsys active state
    sched/cpuacct: Initialize cpuacct subsystem earlier
    sched/cpuacct: Initialize root cpuacct earlier
    sched/cpuacct: Allocate per_cpu cpuusage for root cpuacct statically
    sched/cpuacct: Clean up cpuacct.h
    sched/cpuacct: Remove redundant NULL checks in cpuacct_acount_field()
    sched/cpuacct: Remove redundant NULL checks in cpuacct_charge()
    sched/cpuacct: Add cpuacct_acount_field()
    sched/cpuacct: Add cpuacct_init()
    ...

    Linus Torvalds
     

10 Apr, 2013

1 commit


08 Apr, 2013

1 commit

  • Recent commit 6fac4829 ("cputime: Use accessors to read task
    cputime stats") introduced a bug, where we account many times
    the cputime of the first thread, instead of cputimes of all
    the different threads.

    Signed-off-by: Stanislaw Gruszka
    Acked-by: Frederic Weisbecker
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20130404085740.GA2495@redhat.com
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     

14 Mar, 2013

1 commit

  • Some users have reported that after running a process with
    hundreds of threads on intensive CPU-bound loads, the cputime
    of the group started to freeze after a few days.

    This is due to how we scale the tick-based cputime against
    the scheduler precise execution time value.

    We add the values of all threads in the group and we multiply
    that against the sum of the scheduler exec runtime of the whole
    group.

    This easily overflows after a few days/weeks of execution.

    A proposed solution to solve this was to compute that multiplication
    on stime instead of utime:
    62188451f0d63add7ad0cd2a1ae269d600c1663d
    ("cputime: Avoid multiplication overflow on utime scaling")

    The rationale behind that was that it's easy for a thread to
    spend most of its time in userspace under intensive CPU-bound workload
    but it's much harder to do CPU-bound intensive long run in the kernel.

    This postulate got defeated when a user recently reported he was still
    seeing cputime freezes after the above patch. The workload that
    triggers this issue relates to intensive networking workloads where
    most of the cputime is consumed in the kernel.

    To reduce much more the opportunities for multiplication overflow,
    lets reduce the multiplication factors to the remainders of the division
    between sched exec runtime and cputime. Assuming the difference between
    these shouldn't ever be that large, it could work on many situations.

    This gets the same results as in the upstream scaling code except for
    a small difference: the upstream code always rounds the results to
    the nearest integer not greater to what would be the precise result.
    The new code rounds to the nearest integer either greater or not
    greater. In practice this difference probably shouldn't matter but
    it's worth mentioning.

    If this solution appears not to be enough in the end, we'll
    need to partly revert back to the behaviour prior to commit
    0cf55e1ec08bb5a22e068309e2d8ba1180ab4239
    ("sched, cputime: Introduce thread_group_times()")

    Back then, the scaling was done on exit() time before adding the cputime
    of an exiting thread to the signal struct. And then we'll need to
    scale one-by-one the live threads cputime in thread_group_cputime(). The
    drawback may be a slightly slower code on exit time.

    Signed-off-by: Frederic Weisbecker
    Cc: Stanislaw Gruszka
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Andrew Morton

    Frederic Weisbecker
     

08 Mar, 2013

1 commit

  • The full dynticks cputime accounting is able to account either
    using the tick or the context tracking subsystem. This way
    the housekeeping CPU can keep the low overhead tick based
    solution.

    This latter mode has a low jiffies resolution granularity and
    need to be scaled against CFS precise runtime accounting to
    improve its result. We are doing this for CONFIG_TICK_CPU_ACCOUNTING,
    now we also need to expand it to full dynticks accounting dynamic
    off-case as well.

    Signed-off-by: Frederic Weisbecker
    Cc: Li Zhong
    Cc: Kevin Hilman
    Cc: Mats Liljegren
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Cc: Namhyung Kim
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Cc: Paul E. McKenney

    Frederic Weisbecker
     

24 Feb, 2013

1 commit

  • Running the full dynticks cputime accounting with preemptible
    kernel debugging trigger the following warning:

    [ 4.488303] BUG: using smp_processor_id() in preemptible [00000000] code: init/1
    [ 4.490971] caller is native_sched_clock+0x22/0x80
    [ 4.493663] Pid: 1, comm: init Not tainted 3.8.0+ #13
    [ 4.496376] Call Trace:
    [ 4.498996] [] debug_smp_processor_id+0xdb/0xf0
    [ 4.501716] [] native_sched_clock+0x22/0x80
    [ 4.504434] [] sched_clock+0x9/0x10
    [ 4.507185] [] fetch_task_cputime+0xad/0x120
    [ 4.509916] [] task_cputime+0x35/0x60
    [ 4.512622] [] acct_update_integrals+0x1e/0x40
    [ 4.515372] [] do_execve_common+0x4ff/0x5c0
    [ 4.518117] [] ? do_execve_common+0x144/0x5c0
    [ 4.520844] [] ? rest_init+0x160/0x160
    [ 4.523554] [] do_execve+0x37/0x40
    [ 4.526276] [] run_init_process+0x23/0x30
    [ 4.528953] [] kernel_init+0x9c/0xf0
    [ 4.531608] [] ret_from_fork+0x7c/0xb0

    We use sched_clock() to perform and fixup the cputime
    accounting. However we are calling it with preemption enabled
    from the read side, which trigger the bug above.

    To fix this up, use local_clock() instead. It takes care of
    preemption and also provide a more reliable clock source. This
    is welcome for this kind of statistic that is widely relied on
    in userspace.

    Reported-by: Thomas Gleixner
    Reported-by: Ingo Molnar
    Suggested-by: Thomas Gleixner
    Signed-off-by: Frederic Weisbecker
    Cc: Li Zhong
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Kevin Hilman
    Link: http://lkml.kernel.org/r/1361636925-22288-3-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

19 Feb, 2013

1 commit


05 Feb, 2013

1 commit

  • …x/kernel/git/frederic/linux-dynticks into sched/core

    Pull full-dynticks (user-space execution is undisturbed and
    receives no timer IRQs) preparation changes that convert the
    cputime accounting code to be full-dynticks ready,
    from Frederic Weisbecker:

    "This implements the cputime accounting on full dynticks CPUs.

    Typical cputime stats infrastructure relies on the timer tick and
    its periodic polling on the CPU to account the amount of time
    spent by the CPUs and the tasks per high level domains such as
    userspace, kernelspace, guest, ...

    Now we are preparing to implement full dynticks capability on
    Linux for Real Time and HPC users who want full CPU isolation.
    This feature requires a cputime accounting that doesn't depend
    on the timer tick.

    To implement it, this new cputime infrastructure plugs into
    kernel/user/guest boundaries to take snapshots of cputime and
    flush these to the stats when needed. This performs pretty
    much like CONFIG_VIRT_CPU_ACCOUNTING except that context location
    and cputime snaphots are synchronized between write and read
    side such that the latter can safely retrieve the pending tickless
    cputime of a task and add it to its latest cputime snapshot to
    return the correct result to the user."

    Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

28 Jan, 2013

5 commits

  • While remotely reading the cputime of a task running in a
    full dynticks CPU, the values stored in utime/stime fields
    of struct task_struct may be stale. Its values may be those
    of the last kernel user transition time snapshot and
    we need to add the tickless time spent since this snapshot.

    To fix this, flush the cputime of the dynticks CPUs on
    kernel user transition and record the time / context
    where we did this. Then on top of this snapshot and the current
    time, perform the fixup on the reader side from task_times()
    accessors.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    [fixed kvm module related build errors]
    Signed-off-by: Sedat Dilek

    Frederic Weisbecker
     
  • Do some ground preparatory work before adding guest_enter()
    and guest_exit() context tracking callbacks. Those will
    be later used to read the guest cputime safely when we
    run in full dynticks mode.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Gleb Natapov
    Cc: Ingo Molnar
    Cc: Li Zhong
    Cc: Marcelo Tosatti
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     
  • This is in preparation for the full dynticks feature. While
    remotely reading the cputime of a task running in a full
    dynticks CPU, we'll need to do some extra-computation. This
    way we can account the time it spent tickless in userspace
    since its last cputime snapshot.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     
  • Allow to dynamically switch between tick and virtual based
    cputime accounting. This way we can provide a kind of "on-demand"
    virtual based cputime accounting. In this mode, the kernel relies
    on the context tracking subsystem to dynamically probe on kernel
    boundaries.

    This is in preparation for being able to stop the timer tick in
    more places than just the idle state. Doing so will depend on
    CONFIG_VIRT_CPU_ACCOUNTING_GEN which makes it possible to account
    the cputime without the tick by hooking on kernel/user boundaries.

    Depending whether the tick is stopped or not, we can switch between
    tick and vtime based accounting anytime in order to minimize the
    overhead associated to user hooks.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     
  • If we want to stop the tick further idle, we need to be
    able to account the cputime without using the tick.

    Virtual based cputime accounting solves that problem by
    hooking into kernel/user boundaries.

    However implementing CONFIG_VIRT_CPU_ACCOUNTING require
    low level hooks and involves more overhead. But we already
    have a generic context tracking subsystem that is required
    for RCU needs by archs which plan to shut down the tick
    outside idle.

    This patch implements a generic virtual based cputime
    accounting that relies on these generic kernel/user hooks.

    There are some upsides of doing this:

    - This requires no arch code to implement CONFIG_VIRT_CPU_ACCOUNTING
    if context tracking is already built (already necessary for RCU in full
    tickless mode).

    - We can rely on the generic context tracking subsystem to dynamically
    (de)activate the hooks, so that we can switch anytime between virtual
    and tick based accounting. This way we don't have the overhead
    of the virtual accounting when the tick is running periodically.

    And one downside:

    - There is probably more overhead than a native virtual based cputime
    accounting. But this relies on hooks that are already set anyway.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker