02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

14 Jan, 2017

1 commit

  • In preparation for adding diagnostic checks to catch missing calls to
    update_rq_clock(), provide wrappers for (re)pinning and unpinning
    rq->lock.

    Because the pending diagnostic checks allow state to be maintained in
    rq_flags across pin contexts, swap the 'struct pin_cookie' arguments
    for 'struct rq_flags *'.

    Signed-off-by: Matt Fleming
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Byungchul Park
    Cc: Frederic Weisbecker
    Cc: Jan Kara
    Cc: Linus Torvalds
    Cc: Luca Abeni
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Petr Mladek
    Cc: Rik van Riel
    Cc: Sergey Senozhatsky
    Cc: Thomas Gleixner
    Cc: Wanpeng Li
    Cc: Yuyang Du
    Link: http://lkml.kernel.org/r/20160921133813.31976-5-matt@codeblueprint.co.uk
    Signed-off-by: Ingo Molnar

    Matt Fleming
     

30 Sep, 2016

1 commit

  • select_idle_siblings() is a known pain point for a number of
    workloads; it either does too much or not enough and sometimes just
    does plain wrong.

    This rewrite attempts to address a number of issues (but sadly not
    all).

    The current code does an unconditional sched_domain iteration; with
    the intent of finding an idle core (on SMT hardware). The problems
    which this patch tries to address are:

    - its pointless to look for idle cores if the machine is real busy;
    at which point you're just wasting cycles.

    - it's behaviour is inconsistent between SMT and !SMT hardware in
    that !SMT hardware ends up doing a scan for any idle CPU in the LLC
    domain, while SMT hardware does a scan for idle cores and if that
    fails, falls back to a scan for idle threads on the 'target' core.

    The new code replaces the sched_domain scan with 3 explicit scans:

    1) search for an idle core in the LLC
    2) search for an idle CPU in the LLC
    3) search for an idle thread in the 'target' core

    where 1 and 3 are conditional on SMT support and 1 and 2 have runtime
    heuristics to skip the step.

    Step 1) is conditional on sd_llc_shared->has_idle_cores; when a cpu
    goes idle and sd_llc_shared->has_idle_cores is false, we scan all SMT
    siblings of the CPU going idle. Similarly, we clear
    sd_llc_shared->has_idle_cores when we fail to find an idle core.

    Step 2) tracks the average cost of the scan and compares this to the
    average idle time guestimate for the CPU doing the wakeup. There is a
    significant fudge factor involved to deal with the variability of the
    averages. Esp. hackbench was sensitive to this.

    Step 3) is unconditional; we assume (also per step 1) that scanning
    all SMT siblings in a core is 'cheap'.

    With this; SMT systems gain step 2, which cures a few benchmarks --
    notably one from Facebook.

    One 'feature' of the sched_domain iteration, which we preserve in the
    new code, is that it would start scanning from the 'target' CPU,
    instead of scanning the cpumask in cpu id order. This avoids multiple
    CPUs in the LLC scanning for idle to gang up and find the same CPU
    quite as much. The down side is that tasks can end up hopping across
    the LLC for no apparent reason.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

05 Sep, 2016

1 commit

  • The schedstat_*() macros are inconsistent: most of them take a pointer
    and a field which the macro combines, whereas schedstat_set() takes the
    already combined ptr->field.

    The already combined ptr->field argument is actually more intuitive and
    easier to use, and there's no reason to require the user to split the
    variable up, so convert the macros to use the combined argument.

    Signed-off-by: Josh Poimboeuf
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Matt Fleming
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/54953ca25bb579f3a5946432dee409b0e05222c6.1466184592.git.jpoimboe@redhat.com
    Signed-off-by: Ingo Molnar

    Josh Poimboeuf
     

05 May, 2016

1 commit

  • The problem with the existing lock pinning is that each pin is of
    value 1; this mean you can simply unpin if you know its pinned,
    without having any extra information.

    This scheme generates a random (16 bit) cookie for each pin and
    requires this same cookie to unpin. This means you have to keep the
    cookie in context.

    No objsize difference for !LOCKDEP kernels.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

23 Nov, 2015

1 commit

  • Commit cd126afe838d ("sched/fair: Remove rq's runnable avg") got rid of
    rq->avg and so there is no need to update it any more when entering or
    exiting idle.

    Remove the now empty functions idle_{enter|exit}_fair().

    Signed-off-by: Dietmar Eggemann
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Yuyang Du
    Link: http://lkml.kernel.org/r/1445342681-17171-1-git-send-email-dietmar.eggemann@arm.com
    Signed-off-by: Ingo Molnar

    Dietmar Eggemann
     

12 Aug, 2015

1 commit

  • Give every class a set_cpus_allowed() method, this enables some small
    optimization in the RT,DL implementation by avoiding a double
    cpumask_weight() call.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dedekind1@gmail.com
    Cc: juri.lelli@arm.com
    Cc: mgorman@suse.de
    Cc: riel@redhat.com
    Cc: rostedt@goodmis.org
    Link: http://lkml.kernel.org/r/20150515154833.614517487@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

24 Nov, 2014

1 commit

  • Chris bisected a NULL pointer deference in task_sched_runtime() to
    commit 6e998916dfe3 'sched/cputime: Fix clock_nanosleep()/clock_gettime()
    inconsistency'.

    Chris observed crashes in atop or other /proc walking programs when he
    started fork bombs on his machine. He assumed that this is a new exit
    race, but that does not make any sense when looking at that commit.

    What's interesting is that, the commit provides update_curr callbacks
    for all scheduling classes except stop_task and idle_task.

    While nothing can ever hit that via the clock_nanosleep() and
    clock_gettime() interfaces, which have been the target of the commit in
    question, the author obviously forgot that there are other code paths
    which invoke task_sched_runtime()

    do_task_stat(()
    thread_group_cputime_adjusted()
    thread_group_cputime()
    task_cputime()
    task_sched_runtime()
    if (task_current(rq, p) && task_on_rq_queued(p)) {
    update_rq_clock(rq);
    up->sched_class->update_curr(rq);
    }

    If the stats are read for a stomp machine task, aka 'migration/N' and
    that task is current on its cpu, this will happily call the NULL pointer
    of stop_task->update_curr. Ooops.

    Chris observation that this happens faster when he runs the fork bomb
    makes sense as the fork bomb will kick migration threads more often so
    the probability to hit the issue will increase.

    Add the missing update_curr callbacks to the scheduler classes stop_task
    and idle_task. While idle tasks cannot be monitored via /proc we have
    other means to hit the idle case.

    Fixes: 6e998916dfe3 'sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency'
    Reported-by: Chris Mason
    Reported-and-tested-by: Borislav Petkov
    Signed-off-by: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Stanislaw Gruszka
    Cc: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

16 Jul, 2014

1 commit

  • We always use resched_task() with rq->curr argument.
    It's not possible to reschedule any task but rq's current.

    The patch introduces resched_curr(struct rq *) to
    replace all of the repeating patterns. The main aim
    is cleanup, but there is a little size profit too:

    (before)
    $ size kernel/sched/built-in.o
    text data bss dec hex filename
    155274 16445 7042 178761 2ba49 kernel/sched/built-in.o

    $ size vmlinux
    text data bss dec hex filename
    7411490 1178376 991232 9581098 92322a vmlinux

    (after)
    $ size kernel/sched/built-in.o
    text data bss dec hex filename
    155130 16445 7042 178617 2b9b9 kernel/sched/built-in.o

    $ size vmlinux
    text data bss dec hex filename
    7411362 1178376 991232 9580970 9231aa vmlinux

    I was choosing between resched_curr() and resched_rq(),
    and the first name looks better for me.

    A little lie in Documentation/trace/ftrace.txt. I have not
    actually collected the tracing again. With a hope the patch
    won't make execution times much worse :)

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Randy Dunlap
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20140628200219.1778.18735.stgit@localhost
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     

11 Mar, 2014

1 commit


22 Feb, 2014

2 commits

  • Remove a few gratuitous #ifdefs in pick_next_task*().

    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Cc: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-nnzddp5c4fijyzzxxrwlxghf@git.kernel.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Dan Carpenter reported:

    > kernel/sched/rt.c:1347 pick_next_task_rt() warn: variable dereferenced before check 'prev' (see line 1338)
    > kernel/sched/deadline.c:1011 pick_next_task_dl() warn: variable dereferenced before check 'prev' (see line 1005)

    Kirill also spotted that migrate_tasks() will have an instant NULL
    deref because pick_next_task() will immediately deref prev.

    Instead of fixing all the corner cases because migrate_tasks() can
    pass in a NULL prev task in the unlikely case of hot-un-plug, provide
    a fake task such that we can remove all the NULL checks from the far
    more common paths.

    A further problem; not previously spotted; is that because we pushed
    pre_schedule() and idle_balance() into pick_next_task() we now need to
    avoid those getting called and pulling more tasks on our dying CPU.

    We avoid pull_{dl,rt}_task() by setting fake_task.prio to MAX_PRIO+1.
    We also note that since we call pick_next_task() exactly the amount of
    times we have runnable tasks present, we should never land in
    idle_balance().

    Fixes: 38033c37faab ("sched: Push down pre_schedule() and idle_balance()")
    Cc: Juri Lelli
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Reported-by: Kirill Tkhai
    Reported-by: Dan Carpenter
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140212094930.GB3545@laptop.programming.kicks-ass.net
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

11 Feb, 2014

1 commit

  • This patch both merged idle_balance() and pre_schedule() and pushes
    both of them into pick_next_task().

    Conceptually pre_schedule() and idle_balance() are rather similar,
    both are used to pull more work onto the current CPU.

    We cannot however first move idle_balance() into pre_schedule_fair()
    since there is no guarantee the last runnable task is a fair task, and
    thus we would miss newidle balances.

    Similarly, the dl and rt pre_schedule calls must be ran before
    idle_balance() since their respective tasks have higher priority and
    it would not do to delay their execution searching for less important
    tasks first.

    However, by noticing that pick_next_tasks() already traverses the
    sched_class hierarchy in the right order, we can get the right
    behaviour and do away with both calls.

    We must however change the special case optimization to also require
    that prev is of sched_class_fair, otherwise we can miss doing a dl or
    rt pull where we needed one.

    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/n/tip-a8k6vvaebtn64nie345kx1je@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

10 Feb, 2014

2 commits

  • The idle post_schedule flag is just a vile waste of time, furthermore
    it appears unneeded, move the idle_enter_fair() call into
    pick_next_task_idle().

    Signed-off-by: Peter Zijlstra
    Cc: Daniel Lezcano
    Cc: Vincent Guittot
    Cc: alex.shi@linaro.org
    Cc: mingo@kernel.org
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/n/tip-aljykihtxJt3mkokxi0qZurb@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In order to avoid having to do put/set on a whole cgroup hierarchy
    when we context switch, push the put into pick_next_task() so that
    both operations are in the same function. Further changes then allow
    us to possibly optimize away redundant work.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

09 Oct, 2013

1 commit

  • Use the new stop_two_cpus() to implement migrate_swap(), a function that
    flips two tasks between their respective cpus.

    I'm fairly sure there's a less crude way than employing the stop_two_cpus()
    method, but everything I tried either got horribly fragile and/or complex. So
    keep it simple for now.

    The notable detail is how we 'migrate' tasks that aren't runnable
    anymore. We'll make it appear like we migrated them before they went to
    sleep. The sole difference is the previous cpu in the wakeup path, so we
    override this.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Mel Gorman
    Link: http://lkml.kernel.org/r/1381141781-10992-39-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

04 May, 2013

1 commit

  • The scheduler doesn't yet fully support environments
    with a single task running without a periodic tick.

    In order to ensure we still maintain the duties of scheduler_tick(),
    keep at least 1 tick per second.

    This makes sure that we keep the progression of various scheduler
    accounting and background maintainance even with a very low granularity.
    Examples include cpu load, sched average, CFS entity vruntime,
    avenrun and events such as load balancing, amongst other details
    handled in sched_class::task_tick().

    This limitation will be removed in the future once we get
    these individual items to work in full dynticks CPUs.

    Suggested-by: Ingo Molnar
    Signed-off-by: Frederic Weisbecker
    Cc: Christoph Lameter
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

21 Apr, 2013

1 commit

  • The current update of the rq's load can be erroneous when RT
    tasks are involved.

    The update of the load of a rq that becomes idle, is done only
    if the avg_idle is less than sysctl_sched_migration_cost. If RT
    tasks and short idle duration alternate, the runnable_avg will
    not be updated correctly and the time will be accounted as idle
    time when a CFS task wakes up.

    A new idle_enter function is called when the next task is the
    idle function so the elapsed time will be accounted as run time
    in the load of the rq, whatever the average idle time is. The
    function update_rq_runnable_avg is removed from idle_balance.

    When a RT task is scheduled on an idle CPU, the update of the
    rq's load is not done when the rq exit idle state because CFS's
    functions are not called. Then, the idle_balance, which is
    called just before entering the idle function, updates the rq's
    load and makes the assumption that the elapsed time since the
    last update, was only running time.

    As a consequence, the rq's load of a CPU that only runs a
    periodic RT task, is close to LOAD_AVG_MAX whatever the running
    duration of the RT task is.

    A new idle_exit function is called when the prev task is the
    idle function so the elapsed time will be accounted as idle time
    in the rq's load.

    Signed-off-by: Vincent Guittot
    Acked-by: Peter Zijlstra
    Acked-by: Steven Rostedt
    Cc: linaro-kernel@lists.linaro.org
    Cc: peterz@infradead.org
    Cc: pjt@google.com
    Cc: fweisbec@gmail.com
    Cc: efault@gmx.de
    Link: http://lkml.kernel.org/r/1366302867-5055-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     

06 Jul, 2012

1 commit

  • Thanks to Charles Wang for spotting the defects in the current code:

    - If we go idle during the sample window -- after sampling, we get a
    negative bias because we can negate our own sample.

    - If we wake up during the sample window we get a positive bias
    because we push the sample to a known active period.

    So rewrite the entire nohz load-avg muck once again, now adding
    copious documentation to the code.

    Reported-and-tested-by: Doug Smythies
    Reported-and-tested-by: Charles Wang
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: stable@kernel.org
    Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins
    [ minor edits ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

07 May, 2012

1 commit


17 Nov, 2011

1 commit