09 Jan, 2015
2 commits
-
When alloc_fair_sched_group() in sched_create_group() fails,
free_sched_group() is called, and free_fair_sched_group() is called by
free_sched_group(). Since destroy_cfs_bandwidth() is called by
free_fair_sched_group() without calling init_cfs_bandwidth(),
RCU stall occurs at hrtimer_cancel():INFO: rcu_sched self-detected stall on CPU { 1} (t=60000 jiffies g=13074 c=13073 q=0)
Task dump for CPU 1:
(fprintd) R running task 0 6249 1 0x00000088
...
Call Trace:
[] sched_show_task+0xa8/0x110
[] dump_cpu_task+0x3d/0x50
[] rcu_dump_cpu_stacks+0x90/0xd0
[] rcu_check_callbacks+0x491/0x700
[] update_process_times+0x4b/0x80
[] tick_sched_handle.isra.20+0x36/0x50
[] tick_sched_timer+0x42/0x70
[] __run_hrtimer+0x69/0x1a0
[] ? tick_sched_handle.isra.20+0x50/0x50
[] hrtimer_interrupt+0xef/0x230
[] local_apic_timer_interrupt+0x3b/0x70
[] smp_apic_timer_interrupt+0x45/0x60
[] apic_timer_interrupt+0x6d/0x80
[] ? lock_hrtimer_base.isra.23+0x18/0x50
[] ? __kmalloc+0x211/0x230
[] hrtimer_try_to_cancel+0x22/0xd0
[] ? __kmalloc+0x211/0x230
[] hrtimer_cancel+0x22/0x30
[] free_fair_sched_group+0x25/0xd0
[] free_sched_group+0x16/0x40
[] sched_create_group+0x4b/0x80
[] sched_autogroup_create_attach+0x43/0x1c0
[] sys_setsid+0x7c/0x110
[] system_call_fastpath+0x12/0x17Check whether init_cfs_bandwidth() was called before calling
destroy_cfs_bandwidth().Signed-off-by: Tetsuo Handa
[ Move the check into destroy_cfs_bandwidth() to aid compilability. ]
Signed-off-by: Peter Zijlstra (Intel)
Cc: Paul Turner
Cc: Ben Segall
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/201412252210.GCC30204.SOMVFFOtQJFLOH@I-love.SAKURA.ne.jp
Signed-off-by: Ingo Molnar -
In effective_load, we have (long w * unsigned long tg->shares) / long W,
when w is negative, it is cast to unsigned long and hence the product is
insanely large. Fix this by casting tg->shares to long.Reported-by: Sasha Levin
Signed-off-by: Yuyang Du
Signed-off-by: Peter Zijlstra (Intel)
Cc: Dave Jones
Cc: Andrey Ryabinin
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20141219002956.GA25405@intel.com
Signed-off-by: Ingo Molnar
16 Nov, 2014
6 commits
-
Commit caeb178c60f4 ("sched/fair: Make update_sd_pick_busiest() return
'true' on a busier sd") changes groups to be ranked in the order of
overloaded > imbalance > other, and busiest group is picked according
to this order.sgs->group_capacity_factor is used to check if the group is overloaded.
When the child domain prefers tasks to go to siblings first, the
sgs->group_capacity_factor will be set lower than one in order to
move all the excess tasks away.However, group overloaded status is not updated when
sgs->group_capacity_factor is set to lower than one, which leads to us
missing to find the busiest group.This patch fixes it by updating group overloaded status when sg capacity
factor is set to one, in order to find the busiest group accurately.Signed-off-by: Wanpeng Li
Signed-off-by: Peter Zijlstra (Intel)
Cc: Rik van Riel
Cc: Vincent Guittot
Cc: Kirill Tkhai
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1415144690-25196-1-git-send-email-wanpeng.li@linux.intel.com
[ Fixed the changelog. ]
Signed-off-by: Ingo Molnar -
Move the p->nr_cpus_allowed check into kernel/sched/core.c: select_task_rq().
This change will make fair.c, rt.c, and deadline.c all start with the
same logic.Suggested-and-Acked-by: Steven Rostedt
Signed-off-by: Wanpeng Li
Signed-off-by: Peter Zijlstra (Intel)
Cc: "pang.xunlei"
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar -
Nobody iterates over numa_group::task_list, this just confuses the readers.
Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1415358456.28592.17.camel@tkhai
Signed-off-by: Ingo Molnar -
Signed-off-by: Ingo Molnar
-
Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
test case in cost of breaking another one. After that commit, calling
clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
of Y time being smaller than X time.Reproducer/tester can be found further below, it can be compiled and ran by:
gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
while ./tst-cpuclock2 ; do : ; doneThis reproducer, when running on a buggy kernel, will complain
about "clock_gettime difference too small".Issue happens because on start in thread_group_cputimer() we initialize
sum_exec_runtime of cputimer with threads runtime not yet accounted and
then add the threads runtime to running cputimer again on scheduler
tick, making it's sum_exec_runtime bigger than actual threads runtime.KOSAKI Motohiro posted a fix for this problem, but that patch was never
applied: https://lkml.org/lkml/2013/5/26/191 .This patch takes different approach to cure the problem. It calls
update_curr() when cputimer starts, that assure we will have updated
stats of running threads and on the next schedule tick we will account
only the runtime that elapsed from cputimer start. That also assure we
have consistent state between cpu times of individual threads and cpu
time of the process consisted by those threads.Full reproducer (tst-cpuclock2.c):
#define _GNU_SOURCE
#include
#include
#include
#include
#include
#include
#include/* Parameters for the Linux kernel ABI for CPU clocks. */
#define CPUCLOCK_SCHED 2
#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
((~(clockid_t) (pid) << 3) | (clockid_t) (clock))static pthread_barrier_t barrier;
/* Help advance the clock. */
static void *chew_cpu(void *arg)
{
pthread_barrier_wait(&barrier);
while (1) ;return NULL;
}/* Don't use the glibc wrapper. */
static int do_nanosleep(int flags, const struct timespec *req)
{
clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
}static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
{
int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;return after_i - before_i;
}int main(void)
{
int result = 0;
pthread_t th;pthread_barrier_init(&barrier, NULL, 2);
if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
perror("pthread_create");
return 1;
}pthread_barrier_wait(&barrier);
/* The test. */
struct timespec before, after, sleeptimeabs;
int64_t sleepdiff, diffabs;
const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };/* The relative nanosleep. Not sure why this is needed, but its presence
seems to make it easier to reproduce the problem. */
if (do_nanosleep(0, &sleeptime) != 0) {
perror("clock_nanosleep");
return 1;
}/* Get the current time. */
if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
perror("clock_gettime[2]");
return 1;
}/* Compute the absolute sleep time based on the current time. */
uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
sleeptimeabs.tv_nsec = nsec % 1000000000;/* Sleep for the computed time. */
if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
perror("absolute clock_nanosleep");
return 1;
}/* Get the time after the sleep. */
if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
perror("clock_gettime[3]");
return 1;
}/* The time after sleep should always be equal to or after the absolute sleep
time passed to clock_nanosleep. */
sleepdiff = tsdiff(&sleeptimeabs, &after);
if (sleepdiff < 0) {
printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
result = 1;printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
printf("After %llu.%09llu\n", after.tv_sec, after.tv_nsec);
printf("Sleep %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
}/* The difference between the timestamps taken before and after the
clock_nanosleep call should be equal to or more than the duration of the
sleep. */
diffabs = tsdiff(&before, &after);
if (diffabs < sleeptime.tv_nsec) {
printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
result = 1;
}pthread_cancel(th);
return result;
}Signed-off-by: Stanislaw Gruszka
Signed-off-by: Peter Zijlstra (Intel)
Cc: Rik van Riel
Cc: Frederic Weisbecker
Cc: KOSAKI Motohiro
Cc: Oleg Nesterov
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com
Signed-off-by: Ingo Molnar -
Because the whole numa task selection stuff runs with preemption
enabled (its long and expensive) we can end up migrating and selecting
oneself as a swap target. This doesn't really work out well -- we end
up trying to acquire the same lock twice for the swap migrate -- so
avoid this.Reported-and-Tested-by: Sasha Levin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20141110100328.GF29390@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar
04 Nov, 2014
2 commits
-
This patch simplifies task_struct by removing the four numa_* pointers
in the same array and replacing them with the array pointer. By doing this,
on x86_64, the size of task_struct is reduced by 3 ulong pointers (24 bytes on
x86_64).A new parameter is added to the task_faults_idx function so that it can return
an index to the correct offset, corresponding with the old precalculated
pointers.All of the code in sched/ that depended on task_faults_idx and numa_* was
changed in order to match the new logic.Signed-off-by: Iulia Manda
Signed-off-by: Peter Zijlstra (Intel)
Cc: mgorman@suse.de
Cc: dave@stgolabs.net
Cc: riel@redhat.com
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20141031001331.GA30662@winterfell
Signed-off-by: Ingo Molnar -
Idle cpu is idler than non-idle cpu, so we needn't search for least_loaded_cpu
after we have found an idle cpu.Signed-off-by: Yao Dongdong
Reviewed-by: Srikar Dronamraju
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1414469286-6023-1-git-send-email-yaodongdong@huawei.com
Signed-off-by: Ingo Molnar
28 Oct, 2014
7 commits
-
In pseudo-interleaved numa_groups, all tasks try to relocate to
the group's preferred_nid. When a group is spread across multiple
NUMA nodes, this can lead to tasks swapping their location with
other tasks inside the same group, instead of swapping location with
tasks from other NUMA groups. This can keep NUMA groups from converging.Examining all nodes, when dealing with a task in a pseudo-interleaved
NUMA group, avoids this problem. Note that only CPUs in nodes that
improve the task or group score are examined, so the loop isn't too
bad.Tested-by: Vinod Chegu
Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra (Intel)
Cc: "Vinod Chegu"
Cc: mgorman@suse.de
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20141009172747.0d97c38c@annuminas.surriel.com
Signed-off-by: Ingo Molnar -
On systems with complex NUMA topologies, the node scoring is adjusted
to allow workloads to converge on nodes that are near each other.The way a task group's preferred nid is determined needs to be adjusted,
in order for the preferred_nid to be consistent with group_weight scoring.
This ensures that we actually try to converge workloads on adjacent nodes.Signed-off-by: Rik van Riel
Tested-by: Chegu Vinod
Signed-off-by: Peter Zijlstra (Intel)
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1413530994-9732-6-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar -
In order to do task placement on systems with complex NUMA topologies,
it is necessary to count the faults on nodes nearby the node that is
being examined for a potential move.In case of a system with a backplane interconnect, we are dealing with
groups of NUMA nodes; each of the nodes within a group is the same number
of hops away from nodes in other groups in the system. Optimal placement
on this topology is achieved by counting all nearby nodes equally. When
comparing nodes A and B at distance N, nearby nodes are those at distances
smaller than N from nodes A or B.Placement strategy on a system with a glueless mesh NUMA topology needs
to be different, because there are no natural groups of nodes determined
by the hardware. Instead, when dealing with two nodes A and B at distance
N, N >= 2, there will be intermediate nodes at distance < N from both nodes
A and B. Good placement can be achieved by right shifting the faults on
nearby nodes by the number of hops from the node being scored. In this
context, a nearby node is any node less than the maximum distance in the
system away from the node. Those nodes are skipped for efficiency reasons,
there is no real policy reason to do so.Placement policy on directly connected NUMA systems is not affected.
Signed-off-by: Rik van Riel
Tested-by: Chegu Vinod
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Link: http://lkml.kernel.org/r/1413530994-9732-5-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar -
Preparatory patch for adding NUMA placement on systems with
complex NUMA topology. Also fix a potential divide by zero
in group_weight()Signed-off-by: Rik van Riel
Tested-by: Chegu Vinod
Signed-off-by: Peter Zijlstra (Intel)
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1413530994-9732-4-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar -
File /proc/sys/kernel/numa_balancing_scan_size_mb allows writing of zero.
This bash command reproduces problem:
$ while :; do echo 0 > /proc/sys/kernel/numa_balancing_scan_size_mb; \
echo 256 > /proc/sys/kernel/numa_balancing_scan_size_mb; donedivide error: 0000 [#1] SMP
Modules linked in:
CPU: 0 PID: 24112 Comm: bash Not tainted 3.17.0+ #8
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88013c852600 ti: ffff880037a68000 task.ti: ffff880037a68000
RIP: 0010:[] [] task_scan_min+0x21/0x50
RSP: 0000:ffff880037a6bce0 EFLAGS: 00010246
RAX: 0000000000000a00 RBX: 00000000000003e8 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013c852600
RBP: ffff880037a6bcf0 R08: 0000000000000001 R09: 0000000000015c90
R10: ffff880239bf6c00 R11: 0000000000000016 R12: 0000000000003fff
R13: ffff88013c852600 R14: ffffea0008d1b000 R15: 0000000000000003
FS: 00007f12bb048700(0000) GS:ffff88007da00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000001505678 CR3: 0000000234770000 CR4: 00000000000006f0
Stack:
ffff88013c852600 0000000000003fff ffff880037a6bd18 ffffffff810741d1
ffff88013c852600 0000000000003fff 000000000002bfff ffff880037a6bda8
ffffffff81077ef7 ffffea0008a56d40 0000000000000001 0000000000000001
Call Trace:
[] task_scan_max+0x11/0x40
[] task_numa_fault+0x1f7/0xae0
[] ? migrate_misplaced_page+0x276/0x300
[] handle_mm_fault+0x62d/0xba0
[] __do_page_fault+0x191/0x510
[] ? native_smp_send_reschedule+0x42/0x60
[] ? check_preempt_curr+0x80/0xa0
[] ? wake_up_new_task+0x11c/0x1a0
[] ? do_fork+0x14d/0x340
[] ? get_unused_fd_flags+0x2b/0x30
[] ? __fd_install+0x1f/0x60
[] do_page_fault+0xc/0x10
[] page_fault+0x22/0x30
RIP [] task_scan_min+0x21/0x50
RSP
---[ end trace 9a826d16936c04de ]---Also fix race in task_scan_min (it depends on compiler behaviour).
Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: Aaron Tomlin
Cc: Andrew Morton
Cc: Dario Faggioli
Cc: David Rientjes
Cc: Jens Axboe
Cc: Kees Cook
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Rik van Riel
Link: http://lkml.kernel.org/r/1413455977.24793.78.camel@tkhai
Signed-off-by: Ingo Molnar -
While offling node by hot removing memory, the following divide error
occurs:divide error: 0000 [#1] SMP
[...]
Call Trace:
[...] handle_mm_fault
[...] ? try_to_wake_up
[...] ? wake_up_state
[...] __do_page_fault
[...] ? do_futex
[...] ? put_prev_entity
[...] ? __switch_to
[...] do_page_fault
[...] page_fault
[...]
RIP [] task_numa_fault
RSPThe issue occurs as follows:
1. When page fault occurs and page is allocated from node 1,
task_struct->numa_faults_buffer_memory[] of node 1 is
incremented and p->numa_faults_locality[] is also incremented
as follows:o numa_faults_buffer_memory[] o numa_faults_locality[]
NR_NUMA_HINT_FAULT_TYPES
| 0 | 1 |
---------------------------------- ----------------------
node 0 | 0 | 0 | remote | 0 |
node 1 | 0 | 1 | locale | 1 |
---------------------------------- ----------------------2. node 1 is offlined by hot removing memory.
3. When page fault occurs, fault_types[] is calculated by using
p->numa_faults_buffer_memory[] of all online nodes in
task_numa_placement(). But node 1 was offline by step 2. So
the fault_types[] is calculated by using only
p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
are set to 0.4. The values(0) of fault_types[] pass to update_task_scan_period().
5. numa_faults_locality[1] is set to 1. So the following division is
calculated.static void update_task_scan_period(struct task_struct *p,
unsigned long shared, unsigned long private){
...
ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
}6. But both of private and shared are set to 0. So divide error
occurs here.The divide error is rare case because the trigger is node offline.
This patch always increments denominator for avoiding divide error.Signed-off-by: Yasuaki Ishimatsu
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/54475703.8000505@jp.fujitsu.com
Signed-off-by: Ingo Molnar -
Unlocked access to dst_rq->curr in task_numa_compare() is racy.
If curr task is exiting this may be a reason of use-after-free:task_numa_compare() do_exit()
... current->flags |= PF_EXITING;
... release_task()
... ~~delayed_put_task_struct()~~
... schedule()
rcu_read_lock() ...
cur = ACCESS_ONCE(dst_rq->curr) ...
... rq->curr = next;
... context_switch()
... finish_task_switch()
... put_task_struct()
... __put_task_struct()
... free_task_struct()
task_numa_assign() ...
get_task_struct() ...As noted by Oleg:
<
Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1413962231.19914.130.camel@tkhai
Signed-off-by: Ingo Molnar
15 Oct, 2014
1 commit
-
Pull percpu consistent-ops changes from Tejun Heo:
"Way back, before the current percpu allocator was implemented, static
and dynamic percpu memory areas were allocated and handled separately
and had their own accessors. The distinction has been gone for many
years now; however, the now duplicate two sets of accessors remained
with the pointer based ones - this_cpu_*() - evolving various other
operations over time. During the process, we also accumulated other
inconsistent operations.This pull request contains Christoph's patches to clean up the
duplicate accessor situation. __get_cpu_var() uses are replaced with
with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().Unfortunately, the former sometimes is tricky thanks to C being a bit
messy with the distinction between lvalues and pointers, which led to
a rather ugly solution for cpumask_var_t involving the introduction of
this_cpu_cpumask_var_ptr().This converts most of the uses but not all. Christoph will follow up
with the remaining conversions in this merge window and hopefully
remove the obsolete accessors"* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
irqchip: Properly fetch the per cpu offset
percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
Revert "powerpc: Replace __get_cpu_var uses"
percpu: Remove __this_cpu_ptr
clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
sparc: Replace __get_cpu_var uses
avr32: Replace __get_cpu_var with __this_cpu_write
blackfin: Replace __get_cpu_var uses
tile: Use this_cpu_ptr() for hardware counters
tile: Replace __get_cpu_var uses
powerpc: Replace __get_cpu_var uses
alpha: Replace __get_cpu_var
ia64: Replace __get_cpu_var uses
s390: cio driver &__get_cpu_var replacements
s390: Replace __get_cpu_var uses
mips: Replace __get_cpu_var uses
MIPS: Replace __get_cpu_var uses in FPU emulator.
arm: Replace __this_cpu_ptr with raw_cpu_ptr
...
13 Oct, 2014
1 commit
-
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:- Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
Hansen)- Various sched/idle refinements for better idle handling (Nicolas
Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)- sched/numa updates and optimizations (Rik van Riel)
- sysbench speedup (Vincent Guittot)
- capacity calculation cleanups/refactoring (Vincent Guittot)
- Various cleanups to thread group iteration (Oleg Nesterov)
- Double-rq-lock removal optimization and various refactorings
(Kirill Tkhai)- various sched/deadline fixes
... and lots of other changes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
sched/fair: Delete resched_cpu() from idle_balance()
sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
sched: Improve sysbench performance by fixing spurious active migration
sched/x86: Fix up typo in topology detection
x86, sched: Add new topology for multi-NUMA-node CPUs
sched/rt: Use resched_curr() in task_tick_rt()
sched: Use rq->rd in sched_setaffinity() under RCU read lock
sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
sched: Use dl_bw_of() under RCU read lock
sched/fair: Remove duplicate code from can_migrate_task()
sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
sched: print_rq(): Don't use tasklist_lock
sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
sched: Fix the task-group check in tg_has_rt_tasks()
sched/fair: Leverage the idle state info when choosing the "idlest" cpu
sched: Let the scheduler see CPU idle states
sched/deadline: Fix inter- exclusive cpusets migrations
sched/deadline: Clear dl_entity params when setscheduling to different class
sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
...
10 Oct, 2014
1 commit
-
1. vma_policy_mof(task) is simply not safe unless task == current,
it can race with do_exit()->mpol_put(). Remove this arg and update
its single caller.2. vma can not be NULL, remove this check and simplify the code.
Signed-off-by: Oleg Nesterov
Cc: KAMEZAWA Hiroyuki
Cc: David Rientjes
Cc: KOSAKI Motohiro
Cc: Alexander Viro
Cc: Cyrill Gorcunov
Cc: "Eric W. Biederman"
Cc: "Kirill A. Shutemov"
Cc: Peter Zijlstra
Cc: Hugh Dickins
Cc: Andi Kleen
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
03 Oct, 2014
2 commits
-
We already reschedule env.dst_cpu in attach_tasks()->check_preempt_curr()
if this is necessary.Furthermore, a higher priority class task may be current on dest rq,
we shouldn't disturb it.Signed-off-by: Kirill Tkhai
Cc: Juri Lelli
Signed-off-by: Peter Zijlstra (Intel)
Link: http://lkml.kernel.org/r/20140930210441.5258.55054.stgit@localhost
Signed-off-by: Ingo Molnar -
Since commit caeb178c60f4 ("sched/fair: Make update_sd_pick_busiest() ...")
sd_pick_busiest returns a group that can be neither imbalanced nor overloaded
but is only more loaded than others. This change has been introduced to ensure
a better load balance in system that are not overloaded but as a side effect,
it can also generate useless active migration between groups.Let take the example of 3 tasks on a quad cores system. We will always have an
idle core so the load balance will find a busiest group (core) whenever an ILB
is triggered and it will force an active migration (once above
nr_balance_failed threshold) so the idle core becomes busy but another core
will become idle. With the next ILB, the freshly idle core will try to pull the
task of a busy CPU.
The number of spurious active migration is not so huge in quad core system
because the ILB is not triggered so much. But it becomes significant as soon as
you have more than one sched_domain level like on a dual cluster of quad cores
where the ILB is triggered every tick when you have more than 1 busy_cpuWe need to ensure that the migration generate a real improveùent and will not
only move the avg_load imbalance on another CPU.Before caeb178c60f4f93f1b45c0bc056b5cf6d217b67f, the filtering of such use
case was ensured by the following test in f_b_g:if ((local->idle_cpus < busiest->idle_cpus) &&
busiest->sum_nr_running group_weight)This patch modified the condition to take into account situation where busiest
group is not overloaded: If the diff between the number of idle cpus in 2
groups is less than or equal to 1 and the busiest group is not overloaded,
moving a task will not improve the load balance but just move it.A test with sysbench on a dual clusters of quad cores gives the following
results:command: sysbench --test=cpu --num-threads=5 --max-time=5 run
The HZ is 200 which means that 1000 ticks has fired during the test.
With Mainline, perf gives the following figures:
Samples: 727 of event 'sched:sched_migrate_task'
Event count (approx.): 727
Overhead Command Shared Object Symbol
........ ............... ............. ..............
12.52% migration/1 [unknown] [.] 00000000
12.52% migration/5 [unknown] [.] 00000000
12.52% migration/7 [unknown] [.] 00000000
12.10% migration/6 [unknown] [.] 00000000
11.83% migration/0 [unknown] [.] 00000000
11.83% migration/3 [unknown] [.] 00000000
11.14% migration/4 [unknown] [.] 00000000
10.87% migration/2 [unknown] [.] 00000000
2.75% sysbench [unknown] [.] 00000000
0.83% swapper [unknown] [.] 00000000
0.55% ktps65090charge [unknown] [.] 00000000
0.41% mmcqd/1 [unknown] [.] 00000000
0.14% perf [unknown] [.] 00000000With this patch, perf gives the following figures
Samples: 20 of event 'sched:sched_migrate_task'
Event count (approx.): 20
Overhead Command Shared Object Symbol
........ ............... ............. ..............
80.00% sysbench [unknown] [.] 00000000
10.00% swapper [unknown] [.] 00000000
5.00% ktps65090charge [unknown] [.] 00000000
5.00% migration/1 [unknown] [.] 00000000Signed-off-by: Vincent Guittot
Reviewed-by: Rik van Riel
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1412170735-5356-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar
24 Sep, 2014
3 commits
-
Combine two branches which do the same.
Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140922183612.11015.64200.stgit@localhost
Signed-off-by: Ingo Molnar -
The code in find_idlest_cpu() looks for the CPU with the smallest load.
However, if multiple CPUs are idle, the first idle CPU is selected
irrespective of the depth of its idle state.Among the idle CPUs we should pick the one with with the shallowest idle
state, or the latest to have gone idle if all idle CPUs are in the same
state. The later applies even when cpuidle is configured out.This patch doesn't cover the following issues:
- The idle exit latency of a CPU might be larger than the time needed
to migrate the waking task to an already running CPU with sufficient
capacity, and therefore performance would benefit from task packing
in such case (in most cases task packing is about power saving).- Some idle states have a non negligible and non abortable entry latency
which needs to run to completion before the exit latency can start.
A concurrent patch series is making this info available to the cpuidle
core. Once available, the entry latency with the idle timestamp could
determine when the exit latency may be effective.Those issues will be handled in due course. In the mean time, what
is implemented here should improve things already compared to the current
state of affairs.Based on an initial patch from Daniel Lezcano.
Signed-off-by: Nicolas Pitre
Signed-off-by: Peter Zijlstra (Intel)
Cc: Daniel Lezcano
Cc: "Rafael J. Wysocki"
Cc: Linus Torvalds
Cc: linux-pm@vger.kernel.org
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/n/tip-@git.kernel.org
Signed-off-by: Ingo Molnar -
current->state == TASK_DEAD means that the task is doing its
last schedule(), page fault is obviously impossible at this
stage.Signed-off-by: Oleg Nesterov
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Peter Zijlstra
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140921194743.GA30114@redhat.com
Signed-off-by: Ingo Molnar
21 Sep, 2014
1 commit
-
Signed-off-by: Zhihui Zhang
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1411262676-19928-1-git-send-email-zzhsuny@gmail.com
Signed-off-by: Ingo Molnar
19 Sep, 2014
7 commits
-
Currently the task always wakes affine on this_cpu if the latter is idle.
Before waking up the task on this_cpu, we check that this_cpu capacity is not
significantly reduced because of RT tasks or irq activity.Use case where the number of irq and/or the time spent under irq is important
will take benefit of this because the task that is woken up by irq or softirq
will not use the same CPU than irq (and softirq) but a idle one.Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Cc: Morten.Rasmussen@arm.com
Cc: efault@gmx.de
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1409051215-16788-8-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar -
'capacity_orig' is only changed for systems with an SMT sched_domain level in order
to reflect the lower capacity of CPUs. Heterogenous systems also have to reflect an
original capacity that is different from the default value.Create a more generic function arch_scale_cpu_capacity that can be also used by
non SMT platforms to set capacity_orig.The __weak implementation of arch_scale_cpu_capacity() is the previous SMT variant,
in order to keep backward compatibility with the use of capacity_orig.arch_scale_smt_capacity() and default_scale_smt_capacity() have been removed as
they were not used elsewhere than in arch_scale_cpu_capacity().Signed-off-by: Vincent Guittot
Reviewed-by: Kamalesh Babulal
Reviewed-by: Preeti U. Murthy
[ Added default_scale_cpu_capacity() back. ]
Signed-off-by: Peter Zijlstra (Intel)
Cc: riel@redhat.com
Cc: Morten.Rasmussen@arm.com
Cc: efault@gmx.de
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1409051215-16788-5-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar -
The computation of avg_load and avg_load_per_task should only take into
account the number of CFS tasks. The non-CFS tasks are already taken into
account by decreasing the CPU's capacity and they will be tracked in the
CPU's utilization (group_utilization) of the next patches.Reviewed-by: Preeti U Murthy
Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Cc: riel@redhat.com
Cc: Morten.Rasmussen@arm.com
Cc: efault@gmx.de
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1409051215-16788-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar -
In wake_affine() I have tried to understand the meaning of the condition:
(this_load
Signed-off-by: Peter Zijlstra (Intel)
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Cc: Morten.Rasmussen@arm.com
Cc: efault@gmx.de
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1409051215-16788-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar -
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
We will have some idle load balance which are triggered during tick.
Unfortunately, the tick is also used to queue background work so we can reach
the situation where short work has been queued on a CPU which already runs a
task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
a worker thread that is pinned on a CPU so an imbalance due to pinned task is
detected and the imbalance flag is set.Then, we will not be able to clear the flag because we have at most 1 task on
each CPU but the imbalance flag will trig to useless active load balance
between the idle CPU and the busy CPU.We need to reset of the imbalance flag as soon as we have reached a balanced
state. If all tasks are pinned, we don't consider that as a balanced state and
let the imbalance flag set.Signed-off-by: Vincent Guittot
Reviewed-by: Preeti U Murthy
Signed-off-by: Peter Zijlstra (Intel)
Cc: riel@redhat.com
Cc: Morten.Rasmussen@arm.com
Cc: efault@gmx.de
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1409051215-16788-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar -
new_cpu is reassigned below, so we do not need this here.
Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Link: http://lkml.kernel.org/r/1410529276.3569.24.camel@tkhai
Signed-off-by: Ingo Molnar -
The code in task_numa_compare() will only examine at most one idle CPU per node,
because they all have the same score. However, some idle CPUs are better
candidates than others, due to busy or idle SMT siblings, etc...The scheduler has logic to find the best CPU within an LLC to place a
task. The NUMA code should probably use it.This seems to reduce the standard deviation for single instance SPECjbb2005
with a low warehouse count on my 4 node test system.Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra (Intel)
Cc: mgorman@suse.de
Cc: Mike Galbraith
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140904163530.189d410a@cuia.bos.redhat.com
Signed-off-by: Ingo Molnar
09 Sep, 2014
1 commit
-
When running workloads on 2+ socket systems, based on perf profiles, the
update_cfs_rq_blocked_load() function often shows up as taking up a
noticeable % of run time.Much of the contention is in __update_cfs_rq_tg_load_contrib() when we
update the tg load contribution stats. However, it turns out that in many
cases, they don't need to be updated and "tg_contrib" is 0.This patch adds a check in __update_cfs_rq_tg_load_contrib() to skip updating
tg load contribution stats when nothing needs to be updated. This reduces the
cacheline contention that would be unnecessary.Reviewed-by: Ben Segall
Reviewed-by: Waiman Long
Signed-off-by: Jason Low
Signed-off-by: Peter Zijlstra
Cc: Paul Turner
Cc: jason.low2@hp.com
Cc: Yuyang Du
Cc: Aswin Chandramouleeswaran
Cc: Chegu Vinod
Cc: Scott J Norton
Cc: Tim Chen
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1409643684.19197.15.camel@j-VirtualBox
Signed-off-by: Ingo Molnar
07 Sep, 2014
1 commit
-
An overrun could happen in function start_hrtick_dl()
when a task with SCHED_DEADLINE runs in the microseconds
range.For example, if a task with SCHED_DEADLINE has the following parameters:
Task runtime deadline period
P1 200us 500us 500usThe deadline and period from task P1 are less than 1ms.
In order to achieve microsecond precision, we need to enable HRTICK feature
by the next command:PC#echo "HRTICK" > /sys/kernel/debug/sched_features
PC#trace-cmd record -e sched_switch &
PC#./schedtool -E -t 200000:500000:500000 -e ./testThe binary test is in an endless while(1) loop here.
Some pieces of trace.dat are as follows:-0 157.603157: sched_switch: :R ==> 2481:4294967295: test
test-2481 157.603203: sched_switch: 2481:R ==> 0:120: swapper/2
-0 157.605657: sched_switch: :R ==> 2481:4294967295: test
test-2481 157.608183: sched_switch: 2481:R ==> 2483:120: trace-cmd
trace-cmd-2483 157.609656: sched_switch:2483:R==>2481:4294967295: testWe can get the runtime of P1 from the information above:
runtime = 157.608183 - 157.605657
runtime = 0.002526(2.526ms)The correct runtime should be less than or equal to 200us at some point.
The problem is caused by a conditional judgment "delta > 10000"
in function start_hrtick_dl().Because no hrtimer start up to control the rest of runtime
when the reset of runtime is less than 10us.So the process will continue to run until tick-period is coming.
Move the code with the limit of the least time slice
from hrtick_start_fair() to hrtick_start() because the
EDF schedule class also needs this function in start_hrtick_dl().To fix this problem, we call hrtimer_start() unconditionally in
start_hrtick_dl(), and make sure the scheduling slice won't be smaller
than 10us in hrtimer_start().Signed-off-by: Xiaofeng Yan
Reviewed-by: Li Zefan
Acked-by: Juri Lelli
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1409022941-5880-1-git-send-email-xiaofeng.yan@huawei.com
[ Massaged the changelog and the code. ]
Signed-off-by: Ingo Molnar
05 Sep, 2014
1 commit
-
The use of "rcu_assign_pointer()" is NULLing out the pointer.
According to RCU_INIT_POINTER()'s block comment:"1. This use of RCU_INIT_POINTER() is NULLing out the pointer"
it is better to use it instead of rcu_assign_pointer() because it has a
smaller overhead.The following Coccinelle semantic patch was used:
@@
@@- rcu_assign_pointer
+ RCU_INIT_POINTER
(..., NULL)Signed-off-by: Andreea-Cristina Bernat
Signed-off-by: Peter Zijlstra (Intel)
Cc: paulmck@linux.vnet.ibm.com
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140822145043.GA580@ada
Signed-off-by: Ingo Molnar
28 Aug, 2014
1 commit
-
__get_cpu_var can paper over differences in the definitions of
cpumask_var_t and either use the address of the cpumask variable
directly or perform a fetch of the address of the struct cpumask
allocated elsewhere. This is important particularly when using per cpu
cpumask_var_t declarations because in one case we have an offset into
a per cpu area to handle and in the other case we need to fetch a
pointer from the offset.This patch introduces a new macro
this_cpu_cpumask_var_ptr()
that is defined where cpumask_var_t is defined and performs the proper
actions. All use cases where __get_cpu_var is used with cpumask_var_t
are converted to the use of this_cpu_cpumask_var_ptr().Signed-off-by: Christoph Lameter
Signed-off-by: Tejun Heo
20 Aug, 2014
3 commits
-
Avoid double_rq_lock() and use TASK_ON_RQ_MIGRATING for
load_balance(). The advantage is (obviously) not holding two
rq->lock's at the same time and thereby increasing parallelism.Further note that if there was no task to migrate we will not
have acquired the second rq->lock at all.The important point to note is that because we acquire dst->lock
immediately after releasing src->lock the potential wait time of
task_rq_lock() callers on TASK_ON_RQ_MIGRATING is not longer
than it would have been in the double rq lock scenario.Signed-off-by: Kirill Tkhai
Cc: Peter Zijlstra
Cc: Paul Turner
Cc: Oleg Nesterov
Cc: Steven Rostedt
Cc: Mike Galbraith
Cc: Kirill Tkhai
Cc: Tim Chen
Cc: Nicolas Pitre
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1408528109.23412.94.camel@tkhai
Signed-off-by: Ingo Molnar -
Avoid double_rq_lock() and use the TASK_ON_RQ_MIGRATING state for
active_load_balance_cpu_stop(). The advantage is (obviously) not
holding two 'rq->lock's at the same time and thereby increasing
parallelism.Further note that if there was no task to migrate we will not
have acquired the second rq->lock at all.The important point to note is that because we acquire dst->lock
immediately after releasing src->lock the potential wait time of
task_rq_lock() callers on TASK_ON_RQ_MIGRATING is not longer
than it would have been in the double rq lock scenario.Signed-off-by: Kirill Tkhai
Cc: Peter Zijlstra
Cc: Paul Turner
Cc: Oleg Nesterov
Cc: Steven Rostedt
Cc: Mike Galbraith
Cc: Kirill Tkhai
Cc: Tim Chen
Cc: Nicolas Pitre
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1408528081.23412.92.camel@tkhai
Signed-off-by: Ingo Molnar -
Implement task_on_rq_queued() and use it everywhere instead of
on_rq check. No functional changes.The only exception is we do not use the wrapper in
check_for_tasks(), because it requires to export
task_on_rq_queued() in global header files. Next patch in series
would return it back, so we do not twist it from here to there.Signed-off-by: Kirill Tkhai
Cc: Peter Zijlstra
Cc: Paul Turner
Cc: Oleg Nesterov
Cc: Steven Rostedt
Cc: Mike Galbraith
Cc: Kirill Tkhai
Cc: Tim Chen
Cc: Nicolas Pitre
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1408528052.23412.87.camel@tkhai
Signed-off-by: Ingo Molnar