25 Jul, 2019

1 commit

  • sched_info_on() is called with unlikely hint, however, the test
    is to be a constant(1) on which compiler will do nothing when
    make defconfig, so remove the hint.

    Also, fix a lack of {}.

    Signed-off-by: Yi Wang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: up2wing@gmail.com
    Cc: wang.liang82@zte.com.cn
    Cc: xue.zhihong@zte.com.cn
    Link: https://lkml.kernel.org/r/1562301307-43002-1-git-send-email-wang.yi59@zte.com.cn
    Signed-off-by: Ingo Molnar

    Yi Wang
     

01 Dec, 2018

1 commit

  • Mel Gorman reports a hackbench regression with psi that would prohibit
    shipping the suse kernel with it default-enabled, but he'd still like
    users to be able to opt in at little to no cost to others.

    With the current combination of CONFIG_PSI and the psi_disabled bool set
    from the commandline, this is a challenge. Do the following things to
    make it easier:

    1. Add a config option CONFIG_PSI_DEFAULT_DISABLED that allows distros
    to enable CONFIG_PSI in their kernel but leave the feature disabled
    unless a user requests it at boot-time.

    To avoid double negatives, rename psi_disabled= to psi=.

    2. Make psi_disabled a static branch to eliminate any branch costs
    when the feature is disabled.

    In terms of numbers before and after this patch, Mel says:

    : The following is a comparision using CONFIG_PSI=n as a baseline against
    : your patch and a vanilla kernel
    :
    : 4.20.0-rc4 4.20.0-rc4 4.20.0-rc4
    : kconfigdisable-v1r1 vanilla psidisable-v1r1
    : Amean 1 1.3100 ( 0.00%) 1.3923 ( -6.28%) 1.3427 ( -2.49%)
    : Amean 3 3.8860 ( 0.00%) 4.1230 * -6.10%* 3.8860 ( -0.00%)
    : Amean 5 6.8847 ( 0.00%) 8.0390 * -16.77%* 6.7727 ( 1.63%)
    : Amean 7 9.9310 ( 0.00%) 10.8367 * -9.12%* 9.9910 ( -0.60%)
    : Amean 12 16.6577 ( 0.00%) 18.2363 * -9.48%* 17.1083 ( -2.71%)
    : Amean 18 26.5133 ( 0.00%) 27.8833 * -5.17%* 25.7663 ( 2.82%)
    : Amean 24 34.3003 ( 0.00%) 34.6830 ( -1.12%) 32.0450 ( 6.58%)
    : Amean 30 40.0063 ( 0.00%) 40.5800 ( -1.43%) 41.5087 ( -3.76%)
    : Amean 32 40.1407 ( 0.00%) 41.2273 ( -2.71%) 39.9417 ( 0.50%)
    :
    : It's showing that the vanilla kernel takes a hit (as the bisection
    : indicated it would) and that disabling PSI by default is reasonably
    : close in terms of performance for this particular workload on this
    : particular machine so;

    Link: http://lkml.kernel.org/r/20181127165329.GA29728@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Tested-by: Mel Gorman
    Reported-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

27 Oct, 2018

1 commit

  • When systems are overcommitted and resources become contended, it's hard
    to tell exactly the impact this has on workload productivity, or how close
    the system is to lockups and OOM kills. In particular, when machines work
    multiple jobs concurrently, the impact of overcommit in terms of latency
    and throughput on the individual job can be enormous.

    In order to maximize hardware utilization without sacrificing individual
    job health or risk complete machine lockups, this patch implements a way
    to quantify resource pressure in the system.

    A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
    expose the percentage of time the system is stalled on CPU, memory, or IO,
    respectively. Stall states are aggregate versions of the per-task delay
    accounting delays:

    cpu: some tasks are runnable but not executing on a CPU
    memory: tasks are reclaiming, or waiting for swapin or thrashing cache
    io: tasks are waiting for io completions

    These percentages of walltime can be thought of as pressure percentages,
    and they give a general sense of system health and productivity loss
    incurred by resource overcommit. They can also indicate when the system
    is approaching lockup scenarios and OOMs.

    To do this, psi keeps track of the task states associated with each CPU
    and samples the time they spend in stall states. Every 2 seconds, the
    samples are averaged across CPUs - weighted by the CPUs' non-idle time to
    eliminate artifacts from unused CPUs - and translated into percentages of
    walltime. A running average of those percentages is maintained over 10s,
    1m, and 5m periods (similar to the loadaverage).

    [hannes@cmpxchg.org: doc fixlet, per Randy]
    Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
    [hannes@cmpxchg.org: code optimization]
    Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
    [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
    Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
    [hannes@cmpxchg.org: fix build]
    Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
    Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Daniel Drake
    Tested-by: Suren Baghdasaryan
    Cc: Christopher Lameter
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Mike Galbraith
    Cc: Peter Enderborg
    Cc: Randy Dunlap
    Cc: Shakeel Butt
    Cc: Tejun Heo
    Cc: Vinayak Menon
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

03 Mar, 2018

1 commit

  • A good number of small style inconsistencies have accumulated
    in the scheduler core, so do a pass over them to harmonize
    all these details:

    - fix speling in comments,

    - use curly braces for multi-line statements,

    - remove unnecessary parentheses from integer literals,

    - capitalize consistently,

    - remove stray newlines,

    - add comments where necessary,

    - remove invalid/unnecessary comments,

    - align structure definitions and other data types vertically,

    - add missing newlines for increased readability,

    - fix vertical tabulation where it's misaligned,

    - harmonize preprocessor conditional block labeling
    and vertical alignment,

    - remove line-breaks where they uglify the code,

    - add newline after local variable definitions,

    No change in functionality:

    md5:
    1191fa0a890cfa8132156d2959d7e9e2 built-in.o.before.asm
    1191fa0a890cfa8132156d2959d7e9e2 built-in.o.after.asm

    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

06 Feb, 2018

2 commits

  • These functions are already gated by schedstats_enabled(), there is no
    point in then issuing another static_branch for every individual
    update in them.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The whole of ttwu_stat() is guarded by a single schedstat_enabled(),
    there is absolutely no point in then issuing another static_branch for
    every single schedstat_inc() in there.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

03 Mar, 2017

1 commit

  • …e.h> into <linux/sched/cputime.h>

    Move cputime related functionality out of <linux/sched.h>, as most code
    that includes <linux/sched.h> does not use that functionality.

    Move data types that are not included in task_struct directly to
    the signal definitions, into <linux/sched/signal.h>.

    Also merge the (small) existing <linux/cputime.h> header into <linux/sched/cputime.h>.

    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

21 Feb, 2017

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this (fairly busy) cycle were:

    - There was a class of scheduler bugs related to forgetting to update
    the rq-clock timestamp which can cause weird and hard to debug
    problems, so there's a new debug facility for this: which uncovered
    a whole lot of bugs which convinced us that we want to keep the
    debug facility.

    (Peter Zijlstra, Matt Fleming)

    - Various cputime related updates: eliminate cputime and use u64
    nanoseconds directly, simplify and improve the arch interfaces,
    implement delayed accounting more widely, etc. - (Frederic
    Weisbecker)

    - Move code around for better structure plus cleanups (Ingo Molnar)

    - Move IO schedule accounting deeper into the scheduler plus related
    changes to improve the situation (Tejun Heo)

    - ... plus a round of sched/rt and sched/deadline fixes, plus other
    fixes, updats and cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (85 commits)
    sched/core: Remove unlikely() annotation from sched_move_task()
    sched/autogroup: Rename auto_group.[ch] to autogroup.[ch]
    sched/topology: Split out scheduler topology code from core.c into topology.c
    sched/core: Remove unnecessary #include headers
    sched/rq_clock: Consolidate the ordering of the rq_clock methods
    delayacct: Include
    sched/core: Clean up comments
    sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds
    sched/clock: Add dummy clear_sched_clock_stable() stub function
    sched/cputime: Remove generic asm headers
    sched/cputime: Remove unused nsec_to_cputime()
    s390, sched/cputime: Remove unused cputime definitions
    powerpc, sched/cputime: Remove unused cputime definitions
    s390, sched/cputime: Make arch_cpu_idle_time() to return nsecs
    ia64, sched/cputime: Remove unused cputime definitions
    ia64: Convert vtime to use nsec units directly
    ia64, sched/cputime: Move the nsecs based cputime headers to the last arch using it
    sched/cputime: Remove jiffies based cputime
    sched/cputime, vtime: Return nsecs instead of cputime_t to account
    sched/cputime: Complete nsec conversion of tick based accounting
    ...

    Linus Torvalds
     

01 Feb, 2017

1 commit

  • Use the new nsec based cputime accessors as part of the whole cputime
    conversion from cputime_t to nsecs.

    Also convert posix-cpu-timers to use nsec based internal counters to
    simplify it.

    Signed-off-by: Frederic Weisbecker
    Cc: Benjamin Herrenschmidt
    Cc: Fenghua Yu
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Stanislaw Gruszka
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1485832191-26889-19-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

28 Jan, 2017

1 commit


05 Sep, 2016

2 commits

  • The schedstat_val() macro's behavior is kind of surprising: when
    schedstat is runtime disabled, it returns zero. Rename it to
    schedstat_val_or_zero().

    There's also a need for a similar macro which doesn't have the 'if
    (schedstat_enable())' check, to avoid doing the check twice. Create a
    new 'schedstat_val()' macro for that.

    Signed-off-by: Josh Poimboeuf
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Matt Fleming
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/3bb1d2367d041fee333b0dde17171e709395b675.1466184592.git.jpoimboe@redhat.com
    Signed-off-by: Ingo Molnar

    Josh Poimboeuf
     
  • The schedstat_*() macros are inconsistent: most of them take a pointer
    and a field which the macro combines, whereas schedstat_set() takes the
    already combined ptr->field.

    The already combined ptr->field argument is actually more intuitive and
    easier to use, and there's no reason to require the user to split the
    variable up, so convert the macros to use the combined argument.

    Signed-off-by: Josh Poimboeuf
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Matt Fleming
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/54953ca25bb579f3a5946432dee409b0e05222c6.1466184592.git.jpoimboe@redhat.com
    Signed-off-by: Ingo Molnar

    Josh Poimboeuf
     

08 Jun, 2016

1 commit

  • Commit:

    cb2517653fcc ("sched/debug: Make schedstats a runtime tunable that is disabled by default")

    ... introduced a bug when CONFIG_SCHEDSTATS is enabled and the
    runtime tunable is disabled (which is the default).

    The wait-time, sum-exec, and sum-sleep fields are missing from the
    /proc/sched_debug file in the runnable_tasks section.

    Fix it with a new schedstat_val() macro which returns the field value
    when schedstats is enabled and zero otherwise. The macro works with
    both SCHEDSTATS and !SCHEDSTATS. I put the macro in stats.h since it
    might end up being useful in other places.

    Signed-off-by: Josh Poimboeuf
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Mel Gorman
    Cc: Linus Torvalds
    Cc: Matt Fleming
    Cc: Peter Zijlstra
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Fixes: cb2517653fcc ("sched/debug: Make schedstats a runtime tunable that is disabled by default")
    Link: http://lkml.kernel.org/r/bcda7c2790cf2ccbe586a28c02dd7b6fe7749a2b.1464994423.git.jpoimboe@redhat.com
    Signed-off-by: Ingo Molnar

    Josh Poimboeuf
     

09 Feb, 2016

1 commit

  • schedstats is very useful during debugging and performance tuning but it
    incurs overhead to calculate the stats. As such, even though it can be
    disabled at build time, it is often enabled as the information is useful.

    This patch adds a kernel command-line and sysctl tunable to enable or
    disable schedstats on demand (when it's built in). It is disabled
    by default as someone who knows they need it can also learn to enable
    it when necessary.

    The benefits are dependent on how scheduler-intensive the workload is.
    If it is then the patch reduces the number of cycles spent calculating
    the stats with a small benefit from reducing the cache footprint of the
    scheduler.

    These measurements were taken from a 48-core 2-socket
    machine with Xeon(R) E5-2670 v3 cpus although they were also tested on a
    single socket machine 8-core machine with Intel i7-3770 processors.

    netperf-tcp
    4.5.0-rc1 4.5.0-rc1
    vanilla nostats-v3r1
    Hmean 64 560.45 ( 0.00%) 575.98 ( 2.77%)
    Hmean 128 766.66 ( 0.00%) 795.79 ( 3.80%)
    Hmean 256 950.51 ( 0.00%) 981.50 ( 3.26%)
    Hmean 1024 1433.25 ( 0.00%) 1466.51 ( 2.32%)
    Hmean 2048 2810.54 ( 0.00%) 2879.75 ( 2.46%)
    Hmean 3312 4618.18 ( 0.00%) 4682.09 ( 1.38%)
    Hmean 4096 5306.42 ( 0.00%) 5346.39 ( 0.75%)
    Hmean 8192 10581.44 ( 0.00%) 10698.15 ( 1.10%)
    Hmean 16384 18857.70 ( 0.00%) 18937.61 ( 0.42%)

    Small gains here, UDP_STREAM showed nothing intresting and neither did
    the TCP_RR tests. The gains on the 8-core machine were very similar.

    tbench4
    4.5.0-rc1 4.5.0-rc1
    vanilla nostats-v3r1
    Hmean mb/sec-1 500.85 ( 0.00%) 522.43 ( 4.31%)
    Hmean mb/sec-2 984.66 ( 0.00%) 1018.19 ( 3.41%)
    Hmean mb/sec-4 1827.91 ( 0.00%) 1847.78 ( 1.09%)
    Hmean mb/sec-8 3561.36 ( 0.00%) 3611.28 ( 1.40%)
    Hmean mb/sec-16 5824.52 ( 0.00%) 5929.03 ( 1.79%)
    Hmean mb/sec-32 10943.10 ( 0.00%) 10802.83 ( -1.28%)
    Hmean mb/sec-64 15950.81 ( 0.00%) 16211.31 ( 1.63%)
    Hmean mb/sec-128 15302.17 ( 0.00%) 15445.11 ( 0.93%)
    Hmean mb/sec-256 14866.18 ( 0.00%) 15088.73 ( 1.50%)
    Hmean mb/sec-512 15223.31 ( 0.00%) 15373.69 ( 0.99%)
    Hmean mb/sec-1024 14574.25 ( 0.00%) 14598.02 ( 0.16%)
    Hmean mb/sec-2048 13569.02 ( 0.00%) 13733.86 ( 1.21%)
    Hmean mb/sec-3072 12865.98 ( 0.00%) 13209.23 ( 2.67%)

    Small gains of 2-4% at low thread counts and otherwise flat. The
    gains on the 8-core machine were slightly different

    tbench4 on 8-core i7-3770 single socket machine
    Hmean mb/sec-1 442.59 ( 0.00%) 448.73 ( 1.39%)
    Hmean mb/sec-2 796.68 ( 0.00%) 794.39 ( -0.29%)
    Hmean mb/sec-4 1322.52 ( 0.00%) 1343.66 ( 1.60%)
    Hmean mb/sec-8 2611.65 ( 0.00%) 2694.86 ( 3.19%)
    Hmean mb/sec-16 2537.07 ( 0.00%) 2609.34 ( 2.85%)
    Hmean mb/sec-32 2506.02 ( 0.00%) 2578.18 ( 2.88%)
    Hmean mb/sec-64 2511.06 ( 0.00%) 2569.16 ( 2.31%)
    Hmean mb/sec-128 2313.38 ( 0.00%) 2395.50 ( 3.55%)
    Hmean mb/sec-256 2110.04 ( 0.00%) 2177.45 ( 3.19%)
    Hmean mb/sec-512 2072.51 ( 0.00%) 2053.97 ( -0.89%)

    In constract, this shows a relatively steady 2-3% gain at higher thread
    counts. Due to the nature of the patch and the type of workload, it's
    not a surprise that the result will depend on the CPU used.

    hackbench-pipes
    4.5.0-rc1 4.5.0-rc1
    vanilla nostats-v3r1
    Amean 1 0.0637 ( 0.00%) 0.0660 ( -3.59%)
    Amean 4 0.1229 ( 0.00%) 0.1181 ( 3.84%)
    Amean 7 0.1921 ( 0.00%) 0.1911 ( 0.52%)
    Amean 12 0.3117 ( 0.00%) 0.2923 ( 6.23%)
    Amean 21 0.4050 ( 0.00%) 0.3899 ( 3.74%)
    Amean 30 0.4586 ( 0.00%) 0.4433 ( 3.33%)
    Amean 48 0.5910 ( 0.00%) 0.5694 ( 3.65%)
    Amean 79 0.8663 ( 0.00%) 0.8626 ( 0.43%)
    Amean 110 1.1543 ( 0.00%) 1.1517 ( 0.22%)
    Amean 141 1.4457 ( 0.00%) 1.4290 ( 1.16%)
    Amean 172 1.7090 ( 0.00%) 1.6924 ( 0.97%)
    Amean 192 1.9126 ( 0.00%) 1.9089 ( 0.19%)

    Some small gains and losses and while the variance data is not included,
    it's close to the noise. The UMA machine did not show anything particularly
    different

    pipetest
    4.5.0-rc1 4.5.0-rc1
    vanilla nostats-v2r2
    Min Time 4.13 ( 0.00%) 3.99 ( 3.39%)
    1st-qrtle Time 4.38 ( 0.00%) 4.27 ( 2.51%)
    2nd-qrtle Time 4.46 ( 0.00%) 4.39 ( 1.57%)
    3rd-qrtle Time 4.56 ( 0.00%) 4.51 ( 1.10%)
    Max-90% Time 4.67 ( 0.00%) 4.60 ( 1.50%)
    Max-93% Time 4.71 ( 0.00%) 4.65 ( 1.27%)
    Max-95% Time 4.74 ( 0.00%) 4.71 ( 0.63%)
    Max-99% Time 4.88 ( 0.00%) 4.79 ( 1.84%)
    Max Time 4.93 ( 0.00%) 4.83 ( 2.03%)
    Mean Time 4.48 ( 0.00%) 4.39 ( 1.91%)
    Best99%Mean Time 4.47 ( 0.00%) 4.39 ( 1.91%)
    Best95%Mean Time 4.46 ( 0.00%) 4.38 ( 1.93%)
    Best90%Mean Time 4.45 ( 0.00%) 4.36 ( 1.98%)
    Best50%Mean Time 4.36 ( 0.00%) 4.25 ( 2.49%)
    Best10%Mean Time 4.23 ( 0.00%) 4.10 ( 3.13%)
    Best5%Mean Time 4.19 ( 0.00%) 4.06 ( 3.20%)
    Best1%Mean Time 4.13 ( 0.00%) 4.00 ( 3.39%)

    Small improvement and similar gains were seen on the UMA machine.

    The gain is small but it stands to reason that doing less work in the
    scheduler is a good thing. The downside is that the lack of schedstats and
    tracepoints may be surprising to experts doing performance analysis until
    they find the existence of the schedstats= parameter or schedstats sysctl.
    It will be automatically activated for latencytop and sleep profiling to
    alleviate the problem. For tracepoints, there is a simple warning as it's
    not safe to activate schedstats in the context when it's known the tracepoint
    may be wanted but is unavailable.

    Signed-off-by: Mel Gorman
    Reviewed-by: Matt Fleming
    Reviewed-by: Srikar Dronamraju
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1454663316-22048-1-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

04 Jul, 2015

1 commit

  • Both CONFIG_SCHEDSTATS=y and CONFIG_TASK_DELAY_ACCT=y track task
    sched_info, which results in ugly #if clauses.

    Simplify the code by introducing a synthethic CONFIG_SCHED_INFO
    switch, selected by both.

    Signed-off-by: Naveen N. Rao
    Cc: Balbir Singh
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: a.p.zijlstra@chello.nl
    Cc: ricklind@us.ibm.com
    Link: http://lkml.kernel.org/r/8d19eef800811a94b0f91bcbeb27430a884d7433.1435255405.git.naveen.n.rao@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Naveen N. Rao
     

08 May, 2015

2 commits

  • Recent optimizations were made to thread_group_cputimer to improve its
    scalability by keeping track of cputime stats without a lock. However,
    the values were open coded to the structure, causing them to be at
    a different abstraction level from the regular task_cputime structure.
    Furthermore, any subsequent similar optimizations would not be able to
    share the new code, since they are specific to thread_group_cputimer.

    This patch adds the new task_cputime_atomic data structure (introduced in
    the previous patch in the series) to thread_group_cputimer for keeping
    track of the cputime atomically, which also helps generalize the code.

    Suggested-by: Ingo Molnar
    Signed-off-by: Jason Low
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Thomas Gleixner
    Acked-by: Rik van Riel
    Cc: Andrew Morton
    Cc: Aswin Chandramouleeswaran
    Cc: Borislav Petkov
    Cc: Davidlohr Bueso
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Scott J Norton
    Cc: Steven Rostedt
    Cc: Waiman Long
    Link: http://lkml.kernel.org/r/1430251224-5764-6-git-send-email-jason.low2@hp.com
    Signed-off-by: Ingo Molnar

    Jason Low
     
  • While running a database workload, we found a scalability issue with itimers.

    Much of the problem was caused by the thread_group_cputimer spinlock.
    Each time we account for group system/user time, we need to obtain a
    thread_group_cputimer's spinlock to update the timers. On larger systems
    (such as a 16 socket machine), this caused more than 30% of total time
    spent trying to obtain this kernel lock to update these group timer stats.

    This patch converts the timers to 64-bit atomic variables and use
    atomic add to update them without a lock. With this patch, the percent
    of total time spent updating thread group cputimer timers was reduced
    from 30% down to less than 1%.

    Note: On 32-bit systems using the generic 64-bit atomics, this causes
    sample_group_cputimer() to take locks 3 times instead of just 1 time.
    However, we tested this patch on a 32-bit system ARM system using the
    generic atomics and did not find the overhead to be much of an issue.
    An explanation for why this isn't an issue is that 32-bit systems usually
    have small numbers of CPUs, and cacheline contention from extra spinlocks
    called periodically is not really apparent on smaller systems.

    Signed-off-by: Jason Low
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Thomas Gleixner
    Acked-by: Rik van Riel
    Cc: Andrew Morton
    Cc: Aswin Chandramouleeswaran
    Cc: Borislav Petkov
    Cc: Davidlohr Bueso
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Scott J Norton
    Cc: Steven Rostedt
    Cc: Waiman Long
    Link: http://lkml.kernel.org/r/1430251224-5764-4-git-send-email-jason.low2@hp.com
    Signed-off-by: Ingo Molnar

    Jason Low
     

25 Sep, 2013

1 commit

  • We always know the rq used, let's just pass it around.
    This seems to cut the size of scheduler core down a tiny bit:

    Before:

    [linux]$ size kernel/sched/core.o.orig
    text data bss dec hex filename
    62760 16130 3876 82766 1434e kernel/sched/core.o.orig

    After:

    [linux]$ size kernel/sched/core.o.patched
    text data bss dec hex filename
    62566 16130 3876 82572 1428c kernel/sched/core.o.patched

    Probably speeds it up as well.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20130922142054.GA11499@redhat.com
    Signed-off-by: Ingo Molnar

    Michael S. Tsirkin
     

16 Sep, 2013

1 commit

  • sched_info_depart seems to be only called from
    sched_info_switch(), so only on involuntary task switch.

    Fix the comment to match.

    Signed-off-by: Michael S. Tsirkin
    Cc: Peter Zijlstra
    Cc: Frederic Weisbecker
    Cc: KOSAKI Motohiro
    Link: http://lkml.kernel.org/r/20130916083036.GA1113@redhat.com
    Signed-off-by: Ingo Molnar

    Michael S. Tsirkin
     

07 Jul, 2013

1 commit

  • Pull timer core updates from Thomas Gleixner:
    "The timer changes contain:

    - posix timer code consolidation and fixes for odd corner cases

    - sched_clock implementation moved from ARM to core code to avoid
    duplication by other architectures

    - alarm timer updates

    - clocksource and clockevents unregistration facilities

    - clocksource/events support for new hardware

    - precise nanoseconds RTC readout (Xen feature)

    - generic support for Xen suspend/resume oddities

    - the usual lot of fixes and cleanups all over the place

    The parts which touch other areas (ARM/XEN) have been coordinated with
    the relevant maintainers. Though this results in an handful of
    trivial to solve merge conflicts, which we preferred over nasty cross
    tree merge dependencies.

    The patches which have been committed in the last few days are bug
    fixes plus the posix timer lot. The latter was in akpms queue and
    next for quite some time; they just got forgotten and Frederic
    collected them last minute."

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (59 commits)
    hrtimer: Remove unused variable
    hrtimers: Move SMP function call to thread context
    clocksource: Reselect clocksource when watchdog validated high-res capability
    posix-cpu-timers: don't account cpu timer after stopped thread runtime accounting
    posix_timers: fix racy timer delta caching on task exit
    posix-timers: correctly get dying task time sample in posix_cpu_timer_schedule()
    selftests: add basic posix timers selftests
    posix_cpu_timers: consolidate expired timers check
    posix_cpu_timers: consolidate timer list cleanups
    posix_cpu_timer: consolidate expiry time type
    tick: Sanitize broadcast control logic
    tick: Prevent uncontrolled switch to oneshot mode
    tick: Make oneshot broadcast robust vs. CPU offlining
    x86: xen: Sync the CMOS RTC as well as the Xen wallclock
    x86: xen: Sync the wallclock when the system time is set
    timekeeping: Indicate that clock was set in the pvclock gtod notifier
    timekeeping: Pass flags instead of multiple bools to timekeeping_update()
    xen: Remove clock_was_set() call in the resume path
    hrtimers: Support resuming with two or more CPUs online (but stopped)
    timer: Fix jiffies wrap behavior of round_jiffies_common()
    ...

    Linus Torvalds
     

05 Jul, 2013

1 commit

  • When tsk->signal->cputimer->running is 1, signal->cputimer (i.e. per process
    timer account) and tsk->sum_sched_runtime (i.e. per thread timer account)
    increase at the same pace because update_curr() increases both accounting.

    However, there is one exception. When thread exiting, __exit_signal() turns
    over task's sum_shced_runtime to sig->sum_sched_runtime, but it doesn't stop
    signal->cputimer accounting.

    This inconsistency makes POSIX timer wake up too early. This patch fixes it.

    Original-patch-by: Olivier Langlois
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Acked-by: Peter Zijlstra
    Signed-off-by: Olivier Langlois
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Frederic Weisbecker

    KOSAKI Motohiro
     

28 May, 2013

1 commit

  • Read the runqueue clock through an accessor. This
    prepares for adding a debugging infrastructure to
    detect missing or redundant calls to update_rq_clock()
    between a scheduler's entry and exit point.

    Signed-off-by: Frederic Weisbecker
    Cc: Li Zhong
    Cc: Steven Rostedt
    Cc: Paul Turner
    Cc: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1365724262-20142-6-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

20 Dec, 2011

1 commit


17 Nov, 2011

1 commit