Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

05 Jun, 2014

2 commits

09dc4ab03 sched/fair: Fix tg_set_cfs_bandwidth() deadlock on rq->lock ... Browse Code »
5

tg_set_cfs_bandwidth() sets cfs_b->timer_active to 0 to
force the period timer restart. It's not safe, because
can lead to deadlock, described in commit 927b54fccbf0:
"__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock,
waiting for the hrtimer to finish. However, if sched_cfs_period_timer
runs for another loop iteration, the hrtimer can attempt to take
rq->lock, resulting in deadlock."

Three CPUs must be involved:

CPU0 CPU1 CPU2
take rq->lock period timer fired
... take cfs_b lock
... ... tg_set_cfs_bandwidth()
throttle_cfs_rq() release cfs_b lock take cfs_b lock
... distribute_cfs_runtime() timer_active = 0
take cfs_b->lock wait for rq->lock ...
__start_cfs_bandwidth()
{wait for timer callback
break if timer_active == 1}

So, CPU0 and CPU1 are deadlocked.

Instead of resetting cfs_b->timer_active, tg_set_cfs_bandwidth can
wait for period timer callbacks (ignoring cfs_b->timer_active) and
restart the timer explicitly.

Signed-off-by: Roman Gushchin
Reviewed-by: Ben Segall
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/87wqdi9g8e.wl\%klamm@yandex-team.ru
Cc: pjt@google.com
Cc: chris.j.arges@canonical.com
Cc: gregkh@linuxfoundation.org
Cc: Linus Torvalds
Signed-off-by: Ingo Molnar

Roman Gushchin
2014-06-05 17:51:34 +0800
b14ed2c27 sched: Fix sched_policy < 0 comparison ... Browse Code »
5

attr.sched_policy is u32, therefore a comparison against < 0 is never true.
Fix this by casting sched_policy to int.

This issue was reported by coverity CID 1219934.

Fixes: dbdb22754fde ("sched: Disallow sched_attr::sched_policy < 0")
Signed-off-by: Richard Weinberger
Signed-off-by: Peter Zijlstra
Cc: Michael Kerrisk
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1401741514-7045-1-git-send-email-richard@nod.at
Signed-off-by: Ingo Molnar

Richard Weinberger
2014-06-05 17:07:43 +0800

22 May, 2014

5 commits

6acbfb969 sched: Fix hotplug vs. set_cpus_allowed_ptr() ... Browse Code »
57

Lai found that:

WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b()
...
migration_cpu_stop+0x1d/0x22

was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is
always a sub-set of cpu_online_mask.

This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness").

So set active and online at the same time to avoid this particular
problem.

Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness")
Signed-off-by: Lai Jiangshan
Signed-off-by: Peter Zijlstra
Cc: Andrew Morton
Cc: Gautham R. Shenoy
Cc: Linus Torvalds
Cc: Michael wang
Cc: Paul Gortmaker
Cc: Rafael J. Wysocki
Cc: Srivatsa S. Bhat
Cc: Toshi Kani
Link: http://lkml.kernel.org/r/53758B12.8060609@cn.fujitsu.com
Signed-off-by: Ingo Molnar

Lai Jiangshan
2014-05-22 16:21:31 +0800
b0827819b sched/deadline: Restrict user params max value to 2^63 ns ... Browse Code »
5

Michael Kerrisk noticed that creating SCHED_DEADLINE reservations
with certain parameters (e.g, a runtime of something near 2^64 ns)
can cause a system freeze for some amount of time.

The problem is that in the interface we have

u64 sched_runtime;

while internally we need to have a signed runtime (to cope with
budget overruns)

s64 runtime;

At the time we setup a new dl_entity we copy the first value in
the second. The cast turns out with negative values when
sched_runtime is too big, and this causes the scheduler to go crazy
right from the start.

Moreover, considering how we deal with deadlines wraparound

(s64)(a - b) < 0

we also have to restrict acceptable values for sched_{deadline,period}.

This patch fixes the thing checking that user parameters are always
below 2^63 ns (still large enough for everyone).

It also rewrites other conditions that we check, since in
__checkparam_dl we don't have to deal with deadline wraparounds
and what we have now erroneously fails when the difference between
values is too big.

Reported-by: Michael Kerrisk
Suggested-by: Peter Zijlstra
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra
Cc:
Cc: Dario Faggioli
Cc: Dave Jones
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140513141131.20d944f81633ee937f256385@gmail.com
Signed-off-by: Ingo Molnar

Juri Lelli
2014-05-22 16:21:27 +0800
ce5f7f820 sched/deadline: Change sched_getparam() behaviour vs SCHED_DEADLINE ... Browse Code »
5

The way we read POSIX one should only call sched_getparam() when
sched_getscheduler() returns either SCHED_FIFO or SCHED_RR.

Given that we currently return sched_param::sched_priority=0 for all
others, extend the same behaviour to SCHED_DEADLINE.

Requested-by: Michael Kerrisk
Signed-off-by: Peter Zijlstra
Acked-by: Michael Kerrisk
Cc: Dario Faggioli
Cc: linux-man
Cc: "Michael Kerrisk (man-pages)"
Cc: Juri Lelli
Cc: Linus Torvalds
Cc:
Link: http://lkml.kernel.org/r/20140512205034.GH13467@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-05-22 16:21:26 +0800
dbdb22754 sched: Disallow sched_attr::sched_policy < 0 ... Browse Code »
24

The scheduler uses policy=-1 to preserve the current policy state to
implement sys_sched_setparam(), this got exposed to userspace by
accident through sys_sched_setattr(), cure this.

Reported-by: Michael Kerrisk
Signed-off-by: Peter Zijlstra
Acked-by: Michael Kerrisk
Cc:
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140509085311.GJ30445@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-05-22 16:21:26 +0800
143cf23df sched: Make sched_setattr() correctly return -EFBIG ... Browse Code »
5

The documented[1] behavior of sched_attr() in the proposed man page text is:

sched_attr::size must be set to the size of the structure, as in
sizeof(struct sched_attr), if the provided structure is smaller
than the kernel structure, any additional fields are assumed
'0'. If the provided structure is larger than the kernel structure,
the kernel verifies all additional fields are '0' if not the
syscall will fail with -E2BIG.

As currently implemented, sched_copy_attr() returns -EFBIG for
for this case, but the logic in sys_sched_setattr() converts that
error to -EFAULT. This patch fixes the behavior.

[1] http://thread.gmane.org/gmane.linux.kernel/1615615/focus=1697760

Signed-off-by: Michael Kerrisk
Signed-off-by: Peter Zijlstra
Cc:
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/536CEC17.9070903@gmail.com
Signed-off-by: Ingo Molnar

Michael Kerrisk
2014-05-22 16:21:25 +0800

07 May, 2014

3 commits

2b4cfe64d sched/numa: Initialize newidle balance stats in sd_numa_init() ... Browse Code »

Also initialize the per-sd variables for newidle load balancing
in sd_numa_init().

Signed-off-by: Jason Low
Acked-by: morten.rasmussen@arm.com
Cc: daniel.lezcano@linaro.org
Cc: alex.shi@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: efault@gmx.de
Cc: vincent.guittot@linaro.org
Cc: aswin@hp.com
Cc: chegu_vinod@hp.com
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1398303035-18255-3-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar

Jason Low
2014-05-07 17:51:37 +0800
6ccdc84b8 sched: Skip double execution of pick_next_task_fair() ... Browse Code »

Tim wrote:

"The current code will call pick_next_task_fair a second time in the
slow path if we did not pull any task in our first try. This is
really unnecessary as we already know no task can be pulled and it
doubles the delay for the cpu to enter idle.

We instrumented some network workloads and that saw that
pick_next_task_fair is frequently called twice before a cpu enters
idle. The call to pick_next_task_fair can add non trivial latency as
it calls load_balance which runs find_busiest_group on an hierarchy of
sched domains spanning the cpus for a large system. For some 4 socket
systems, we saw almost 0.25 msec spent per call of pick_next_task_fair
before a cpu can be idled."

Optimize the second call away for the common case and document the
dependency.

Reported-by: Tim Chen
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Len Brown
Link: http://lkml.kernel.org/r/20140424100047.GP11096@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-05-07 17:51:35 +0800
5bfd126e8 sched/deadline: Fix sched_yield() behavior ... Browse Code »

yield_task_dl() is broken:

o it forces current to be throttled setting its runtime to zero;
o it sets current's dl_se->dl_new to one, expecting that dl_task_timer()
will queue it back with proper parameters at replenish time.

Unfortunately, dl_task_timer() has this check at the very beginning:

if (!dl_task(p) || dl_se->dl_new)
goto unlock;

So, it just bails out and the task is never replenished. It actually
yielded forever.

To fix this, introduce a new flag indicating that the task properly yielded
the CPU before its current runtime expired. While this is a little overdoing
at the moment, the flag would be useful in the future to discriminate between
"good" jobs (of which remaining runtime could be reclaimed, i.e. recycled)
and "bad" jobs (for which dl_throttled task has been set) that needed to be
stopped.

Reported-by: yjay.kim
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140429103953.e68eba1b2ac3309214e3dc5a@gmail.com
Signed-off-by: Ingo Molnar

Juri Lelli
2014-05-07 17:51:31 +0800

24 Apr, 2014

1 commit

db66d756c sched/docbook: Fix 'make htmldocs' warnings caused by missing description ... Browse Code »

When 'flags' argument to sched_{set,get}attr() syscalls were
added in:

6d35ab48090b ("sched: Add 'flags' argument to sched_{set,get}attr() syscalls")

no description for 'flags' was added. It causes the following warnings on "make htmldocs":

Warning(/kernel/sched/core.c:3645): No description found for parameter 'flags'
Warning(/kernel/sched/core.c:3789): No description found for parameter 'flags'

Signed-off-by: Masanari Iida
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1397753955-2914-1-git-send-email-standby24x7@gmail.com
Signed-off-by: Ingo Molnar

Masanari Iida
2014-04-24 15:28:32 +0800

08 Apr, 2014

3 commits

26c12d933 Merge branch 'akpm' (incoming from Andrew) ... Browse Code »

Merge second patch-bomb from Andrew Morton:
- the rest of MM
- zram updates
- zswap updates
- exit
- procfs
- exec
- wait
- crash dump
- lib/idr
- rapidio
- adfs, affs, bfs, ufs
- cris
- Kconfig things
- initramfs
- small amount of IPC material
- percpu enhancements
- early ioremap support
- various other misc things

* emailed patches from Andrew Morton : (156 commits)
MAINTAINERS: update Intel C600 SAS driver maintainers
fs/ufs: remove unused ufs_super_block_third pointer
fs/ufs: remove unused ufs_super_block_second pointer
fs/ufs: remove unused ufs_super_block_first pointer
fs/ufs/super.c: add __init to init_inodecache()
doc/kernel-parameters.txt: add early_ioremap_debug
arm64: add early_ioremap support
arm64: initialize pgprot info earlier in boot
x86: use generic early_ioremap
mm: create generic early_ioremap() support
x86/mm: sparse warning fix for early_memremap
lglock: map to spinlock when !CONFIG_SMP
percpu: add preemption checks to __this_cpu ops
vmstat: use raw_cpu_ops to avoid false positives on preemption checks
slub: use raw_cpu_inc for incrementing statistics
net: replace __this_cpu_inc in route.c with raw_cpu_inc
modules: use raw_cpu_write for initialization of per cpu refcount.
mm: use raw_cpu ops for determining current NUMA node
percpu: add raw_cpu_ops
slub: fix leak of 'name' in sysfs_slab_add
...

Linus Torvalds
2014-04-08 07:38:06 +0800
52f5684c8 kernel: use macros from compiler.h instead of __attribute__((...)) ... Browse Code »

To increase compiler portability there is which
provides convenience macros for various gcc constructs. Eg: __weak for
__attribute__((weak)). I've replaced all instances of gcc attributes
with the right macro in the kernel subsystem.

Signed-off-by: Gideon Israel Dsouza
Cc: "Rafael J. Wysocki"
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gideon Israel Dsouza
2014-04-08 07:36:11 +0800
b8780c363 sched: remove sleep_on() and friends ... Browse Code »

This is the final piece in the puzzle, as all patches to remove the
last users of \(interruptible_\|\)sleep_on\(_timeout\|\) have made it
into the 3.15 merge window. The work was long overdue, and this
interface in particular should not have survived the BKL removal
that was done a couple of years ago.

Citing Jon Corbet from http://lwn.net/2001/0201/kernel.php3":

"[...] it was suggested that the janitors look for and fix all code
that calls sleep_on() [...] since (1) almost all such code is
incorrect, and (2) Linus has agreed that those functions should
be removed in the 2.5 development series".

We haven't quite made it for 2.5, but maybe we can merge this for 3.15.

Signed-off-by: Arnd Bergmann
Cc: Peter Zijlstra
Cc: Ingo Molnar
Signed-off-by: Linus Torvalds

Arnd Bergmann
2014-04-08 02:24:06 +0800

04 Apr, 2014

1 commit

32d01dc7b Merge branch 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"A lot updates for cgroup:

- The biggest one is cgroup's conversion to kernfs. cgroup took
after the long abandoned vfs-entangled sysfs implementation and
made it even more convoluted over time. cgroup's internal objects
were fused with vfs objects which also brought in vfs locking and
object lifetime rules. Naturally, there are places where vfs rules
don't fit and nasty hacks, such as credential switching or lock
dance interleaving inode mutex and cgroup_mutex with object serial
number comparison thrown in to decide whether the operation is
actually necessary, needed to be employed.

After conversion to kernfs, internal object lifetime and locking
rules are mostly isolated from vfs interactions allowing shedding
of several nasty hacks and overall simplification. This will also
allow implmentation of operations which may affect multiple cgroups
which weren't possible before as it would have required nesting
i_mutexes.

- Various simplifications including dropping of module support,
easier cgroup name/path handling, simplified cgroup file type
handling and task_cg_lists optimization.

- Prepatory changes for the planned unified hierarchy, which is still
a patchset away from being actually operational. The dummy
hierarchy is updated to serve as the default unified hierarchy.
Controllers which aren't claimed by other hierarchies are
associated with it, which BTW was what the dummy hierarchy was for
anyway.

- Various fixes from Li and others. This pull request includes some
patches to add missing slab.h to various subsystems. This was
triggered xattr.h include removal from cgroup.h. cgroup.h
indirectly got included a lot of files which brought in xattr.h
which brought in slab.h.

There are several merge commits - one to pull in kernfs updates
necessary for converting cgroup (already in upstream through
driver-core), others for interfering changes in the fixes branch"

* 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
cgroup: remove useless argument from cgroup_exit()
cgroup: fix spurious lockdep warning in cgroup_exit()
cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
cgroup: break kernfs active_ref protection in cgroup directory operations
cgroup: fix cgroup_taskset walking order
cgroup: implement CFTYPE_ONLY_ON_DFL
cgroup: make cgrp_dfl_root mountable
cgroup: drop const from @buffer of cftype->write_string()
cgroup: rename cgroup_dummy_root and related names
cgroup: move ->subsys_mask from cgroupfs_root to cgroup
cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
cgroup: reorganize cgroup bootstrapping
cgroup: relocate setting of CGRP_DEAD
cpuset: use rcu_read_lock() to protect task_cs()
cgroup_freezer: document freezer_fork() subtleties
cgroup: update cgroup_transfer_tasks() to either succeed or fail
cgroup: drop task_lock() protection around task->cgroups
cgroup: update how a newly forked task gets associated with css_set
...

Linus Torvalds
2014-04-04 04:05:42 +0800

02 Apr, 2014

3 commits

7a4883773 Merge branch 'for-3.15/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block layer updates from Jens Axboe:
"This is the pull request for the core block IO bits for the 3.15
kernel. It's a smaller round this time, it contains:

- Various little blk-mq fixes and additions from Christoph and
myself.

- Cleanup of the IPI usage from the block layer, and associated
helper code. From Frederic Weisbecker and Jan Kara.

- Duplicate code cleanup in bio-integrity from Gu Zheng. This will
give you a merge conflict, but that should be easy to resolve.

- blk-mq notify spinlock fix for RT from Mike Galbraith.

- A blktrace partial accounting bug fix from Roman Pen.

- Missing REQ_SYNC detection fix for blk-mq from Shaohua Li"

* 'for-3.15/core' of git://git.kernel.dk/linux-block: (25 commits)
blk-mq: add REQ_SYNC early
rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock
blk-mq: support partial I/O completions
blk-mq: merge blk_mq_insert_request and blk_mq_run_request
blk-mq: remove blk_mq_alloc_rq
blk-mq: don't dump CPU -> hw queue map on driver load
blk-mq: fix wrong usage of hctx->state vs hctx->flags
blk-mq: allow blk_mq_init_commands() to return failure
block: remove old blk_iopoll_enabled variable
blktrace: fix accounting of partially completed requests
smp: Rename __smp_call_function_single() to smp_call_function_single_async()
smp: Remove wait argument from __smp_call_function_single()
watchdog: Simplify a little the IPI call
smp: Move __smp_call_function_single() below its safe version
smp: Consolidate the various smp_call_function_single() declensions
smp: Teach __smp_call_function_single() to check for offline cpus
smp: Remove unused list_head from csd
smp: Iterate functions through llist_for_each_entry_safe()
block: Stop abusing rq->csd.list in blk-softirq
block: Remove useless IPI struct initialization
...

Linus Torvalds
2014-04-02 10:19:15 +0800
1ead65812 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timer changes from Thomas Gleixner:
"This assorted collection provides:

- A new timer based timer broadcast feature for systems which do not
provide a global accessible timer device. That allows those
systems to put CPUs into deep idle states where the per cpu timer
device stops.

- A few NOHZ_FULL related improvements to the timer wheel

- The usual updates to timer devices found in ARM SoCs

- Small improvements and updates all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
tick: Remove code duplication in tick_handle_periodic()
tick: Fix spelling mistake in tick_handle_periodic()
x86: hpet: Use proper destructor for delayed work
workqueue: Provide destroy_delayed_work_on_stack()
clocksource: CMT, MTU2, TMU and STI should depend on GENERIC_CLOCKEVENTS
timer: Remove code redundancy while calling get_nohz_timer_target()
hrtimer: Rearrange comments in the order struct members are declared
timer: Use variable head instead of &work_list in __run_timers()
clocksource: exynos_mct: silence a static checker warning
arm: zynq: Add support for cpufreq
arm: zynq: Don't use arm_global_timer with cpufreq
clocksource/cadence_ttc: Overhaul clocksource frequency adjustment
clocksource/cadence_ttc: Call clockevents_update_freq() with IRQs enabled
clocksource: Add Kconfig entries for CMT, MTU2, TMU and STI
sh: Remove Kconfig entries for TMU, CMT and MTU2
ARM: shmobile: Remove CMT, TMU and STI Kconfig entries
clocksource: armada-370-xp: Use atomic access for shared registers
clocksource: orion: Use atomic access for shared registers
clocksource: timer-keystone: Delete unnecessary variable
clocksource: timer-keystone: introduce clocksource driver for Keystone
...

Linus Torvalds
2014-04-02 02:00:07 +0800
a21e40877 Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timer updates from Ingo Molnar:
"The main purpose is to fix a full dynticks bug related to
virtualization, where steal time accounting appears to be zero in
/proc/stat even after a few seconds of competing guests running busy
loops in a same host CPU. It's not a regression though as it was
there since the beginning.

The other commits are preparatory work to fix the bug and various
cleanups"

* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
arch: Remove stub cputime.h headers
sched: Remove needless round trip nsecs tick conversion of steal time
cputime: Fix jiffies based cputime assumption on steal accounting
cputime: Bring cputime -> nsecs conversion
cputime: Default implementation of nsecs -> cputime conversion
cputime: Fix nsecs_to_cputime() return type cast

Linus Torvalds
2014-04-02 01:16:10 +0800

01 Apr, 2014

1 commit

1f8c538ed Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux ... Browse Code »

Pull s390 updates from Martin Schwidefsky:
"There are two memory management related changes, the CMMA support for
KVM to avoid swap-in of freed pages and the split page table lock for
the PMD level. These two come with common code changes in mm/.

A fix for the long standing theoretical TLB flush problem, this one
comes with a common code change in kernel/sched/.

Another set of changes is Heikos uaccess work, included is the initial
set of patches with more to come.

And fixes and cleanups as usual"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (36 commits)
s390/con3270: optionally disable auto update
s390/mm: remove unecessary parameter from pgste_ipte_notify
s390/mm: remove unnecessary parameter from gmap_do_ipte_notify
s390/mm: fixing comment so that parameter name match
s390/smp: limit number of cpus in possible cpu mask
hypfs: Add clarification for "weight_min" attribute
s390: update defconfigs
s390/ptrace: add support for PTRACE_SINGLEBLOCK
s390/perf: make print_debug_cf() static
s390/topology: Remove call to update_cpu_masks()
s390/compat: remove compat exec domain
s390: select CONFIG_TTY for use of tty in unconditional keyboard driver
s390/appldata_os: fix cpu array size calculation
s390/checksum: remove memset() within csum_partial_copy_from_user()
s390/uaccess: remove copy_from_user_real()
s390/sclp_early: Return correct HSA block count also for zero
s390: add some drivers/subsystems to the MAINTAINERS file
s390: improve debug feature usage
s390/airq: add support for irq ranges
s390/mm: enable split page table lock for PMD level
...

Linus Torvalds
2014-04-01 05:35:30 +0800

20 Mar, 2014

1 commit

6201b4d61 timer: Remove code redundancy while calling get_nohz_timer_target() ... Browse Code »

There are only two users of get_nohz_timer_target(): timer and hrtimer. Both
call it under same circumstances, i.e.

#ifdef CONFIG_NO_HZ_COMMON
if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu))
return get_nohz_timer_target();
#endif

So, it makes more sense to get all this as part of get_nohz_timer_target()
instead of duplicating code at two places. For this another parameter is
required to be passed to this routine, pinned.

Signed-off-by: Viresh Kumar
Cc: linaro-kernel@lists.linaro.org
Cc: fweisbec@gmail.com
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1e1b53537217d58d48c2d7a222a9c3ac47d5b64c.1395140107.git.viresh.kumar@linaro.org
Signed-off-by: Thomas Gleixner

Viresh Kumar
2014-03-20 19:35:46 +0800

13 Mar, 2014

1 commit

300a9d887 sched: Remove needless round trip nsecs <-> tick conversion of steal time ... Browse Code »

When update_rq_clock_task() accounts the pending steal time for a task,
it converts the steal delta from nsecs to tick then from tick to nsecs.

There is no apparent good reason for doing that though because both
the task clock and the prev steal delta are u64 and store values
in nsecs.

So lets remove the needless conversion.

Cc: Ingo Molnar
Cc: Marcelo Tosatti
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Acked-by: Rik van Riel
Signed-off-by: Frederic Weisbecker

Frederic Weisbecker
2014-03-13 22:56:44 +0800

12 Mar, 2014

1 commit

383afd097 sched: Fix broken setscheduler() ... Browse Code »

I decided to run my tests on linux-next, and my wakeup_rt tracer was
broken. After running a bisect, I found that the problem commit was:

linux-next commit c365c292d059
"sched: Consider pi boosting in setscheduler()"

And the reason the wake_rt tracer test was failing, was because it had
no RT task to trace. I first noticed this when running with
sched_switch event and saw that my RT task still had normal SCHED_OTHER
priority. Looking at the problem commit, I found:

- p->normal_prio = normal_prio(p);
- p->prio = rt_mutex_getprio(p);

With no

+ p->normal_prio = normal_prio(p);
+ p->prio = rt_mutex_getprio(p);

Reading what the commit is suppose to do, I realize that the p->prio
can't be set if the task is boosted with a higher prio, but the
p->normal_prio still needs to be set regardless, otherwise, when the
task is deboosted, it wont get the new priority.

The p->prio has to be set before "check_class_changed()" is called,
otherwise the class wont be changed.

Also added fix to newprio to include a check for deadline policy that
was missing. This change was suggested by Juri Lelli.

Signed-off-by: Steven Rostedt
Cc: SebastianAndrzej Siewior
Cc: Juri Lelli
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140306120438.638bfe94@gandalf.local.home
Signed-off-by: Ingo Molnar

Steven Rostedt
2014-03-12 17:48:59 +0800

11 Mar, 2014

3 commits

156654f49 sched/numa: Move task_numa_free() to __put_task_struct() ... Browse Code »

Bad idea on -rt:

[ 908.026136] [] rt_spin_lock_slowlock+0xaa/0x2c0
[ 908.026145] [] task_numa_free+0x31/0x130
[ 908.026151] [] finish_task_switch+0xce/0x100
[ 908.026156] [] thread_return+0x48/0x4ae
[ 908.026160] [] schedule+0x25/0xa0
[ 908.026163] [] rt_spin_lock_slowlock+0xd5/0x2c0
[ 908.026170] [] get_signal_to_deliver+0xaf/0x680
[ 908.026175] [] do_signal+0x3d/0x5b0
[ 908.026179] [] do_notify_resume+0x90/0xe0
[ 908.026186] [] int_signal+0x12/0x17
[ 908.026193] [] 0x7ff2a388b1cf

and since upstream does not mind where we do this, be a bit nicer ...

Signed-off-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
Cc: Mel Gorman
Cc: Linus Torvalds
Cc: Andrew Morton
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1393568591.6018.27.camel@marge.simpson.net
Signed-off-by: Ingo Molnar

Mike Galbraith
2014-03-11 19:05:43 +0800
a02ed5e3e Merge branch 'sched/urgent' into sched/core ... Browse Code »

Pick up fixes before queueing up new changes.

Signed-off-by: Ingo Molnar

Ingo Molnar
2014-03-11 18:34:27 +0800
d44753b84 sched/deadline: Deny unprivileged users to set/change SCHED_DEADLINE policy ... Browse Code »

Deny the use of SCHED_DEADLINE policy to unprivileged users.
Even if root users can set the policy for normal users, we
don't want the latter to be able to change their parameters
(safest behavior).

Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1393844961-18097-1-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar

Juri Lelli
2014-03-11 18:33:46 +0800

27 Feb, 2014

1 commit

37e117c07 sched: Guarantee task priority in pick_next_task() ... Browse Code »

Michael spotted that the idle_balance() push down created a task
priority problem.

Previously, when we called idle_balance() before pick_next_task() it
wasn't a problem when -- because of the rq->lock droppage -- an rt/dl
task slipped in.

Similarly for pre_schedule(), rt pre-schedule could have a dl task
slip in.

But by pulling it into the pick_next_task() loop, we'll not try a
higher task priority again.

Cure this by creating a re-start condition in pick_next_task(); and
triggering this from pick_next_task_{rt,fair}().

It also fixes a live-lock where we get stuck in pick_next_task_fair()
due to idle_balance() seeing !0 nr_running but there not actually
being any fair tasks about.

Reported-by: Michael Wang
Fixes: 38033c37faab ("sched: Push down pre_schedule() and idle_balance()")
Tested-by: Sasha Levin
Signed-off-by: Peter Zijlstra
Cc: Juri Lelli
Cc: Steven Rostedt
Link: http://lkml.kernel.org/r/20140224121218.GR15586@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-02-27 19:41:02 +0800

25 Feb, 2014

2 commits

c46fff2a3 smp: Rename __smp_call_function_single() to smp_call_function_single_async() ... Browse Code »

The name __smp_call_function_single() doesn't tell much about the
properties of this function, especially when compared to
smp_call_function_single().

The comments above the implementation are also misleading. The main
point of this function is actually not to be able to embed the csd
in an object. This is actually a requirement that result from the
purpose of this function which is to raise an IPI asynchronously.

As such it can be called with interrupts disabled. And this feature
comes at the cost of the caller who then needs to serialize the
IPIs on this csd.

Lets rename the function and enhance the comments so that they reflect
these properties.

Suggested-by: Christoph Hellwig
Cc: Andrew Morton
Cc: Christoph Hellwig
Cc: Ingo Molnar
Cc: Jan Kara
Cc: Jens Axboe
Signed-off-by: Frederic Weisbecker
Signed-off-by: Jens Axboe

Frederic Weisbecker
2014-02-25 06:47:15 +0800
fce8ad156 smp: Remove wait argument from __smp_call_function_single() ... Browse Code »

The main point of calling __smp_call_function_single() is to send
an IPI in a pure asynchronous way. By embedding a csd in an object,
a caller can send the IPI without waiting for a previous one to complete
as is required by smp_call_function_single() for example. As such,
sending this kind of IPI can be safe even when irqs are disabled.

This flexibility comes at the expense of the caller who then needs to
synchronize the csd lifecycle by himself and make sure that IPIs on a
single csd are serialized.

This is how __smp_call_function_single() works when wait = 0 and this
usecase is relevant.

Now there don't seem to be any usecase with wait = 1 that can't be
covered by smp_call_function_single() instead, which is safer. Lets look
at the two possible scenario:

1) The user calls __smp_call_function_single(wait = 1) on a csd embedded
in an object. It looks like a nice and convenient pattern at the first
sight because we can then retrieve the object from the IPI handler easily.

But actually it is a waste of memory space in the object since the csd
can be allocated from the stack by smp_call_function_single(wait = 1)
and the object can be passed an the IPI argument.

Besides that, embedding the csd in an object is more error prone
because the caller must take care of the serialization of the IPIs
for this csd.

2) The user calls __smp_call_function_single(wait = 1) on a csd that
is allocated on the stack. It's ok but smp_call_function_single()
can do it as well and it already takes care of the allocation on the
stack. Again it's more simple and less error prone.

Therefore, using the underscore prepend API version with wait = 1
is a bad pattern and a sign that the caller can do safer and more
simple.

There was a single user of that which has just been converted.
So lets remove this option to discourage further users.

Cc: Andrew Morton
Cc: Christoph Hellwig
Cc: Ingo Molnar
Cc: Jan Kara
Cc: Jens Axboe
Signed-off-by: Frederic Weisbecker
Signed-off-by: Jens Axboe

Frederic Weisbecker
2014-02-25 06:47:09 +0800

23 Feb, 2014

8 commits

75e45d512 sched: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE ... Browse Code »

Signed-off-by: Dongsheng Yang
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/bd80780f19b4f9b4a765acc353c8dbc130274dd6.1392103744.git.yangds.fnst@cn.fujitsu.com
Signed-off-by: Thomas Gleixner
Signed-off-by: Ingo Molnar

Dongsheng Yang
2014-02-23 01:15:54 +0800
d82fd2535 sched/rt: Remove 'leaf_rt_rq_list' from 'struct rq' ... Browse Code »

This is a leftover from commit e23ee74777f389369431d77390c4b09332ce026a
("sched/rt: Simplify pull_rt_task() logic and remove .leaf_rt_rq_list").

Signed-off-by: Li Zefan
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/52F5CBF6.4060901@huawei.com
Signed-off-by: Thomas Gleixner
Signed-off-by: Ingo Molnar

Li Zefan
2014-02-23 01:10:43 +0800
c365c292d sched: Consider pi boosting in setscheduler() ... Browse Code »
28

If a PI boosted task policy/priority is modified by a setscheduler()
call we unconditionally dequeue and requeue the task if it is on the
runqueue even if the new priority is lower than the current effective
boosted priority. This can result in undesired reordering of the
priority bucket list.

If the new priority is less or equal than the current effective we
just store the new parameters in the task struct and leave the
scheduler class and the runqueue untouched. This is handled when the
task deboosts itself. Only if the new priority is higher than the
effective boosted priority we apply the change immediately.

Signed-off-by: Thomas Gleixner
[ Rebase ontop of v3.14-rc1. ]
Signed-off-by: Sebastian Andrzej Siewior
Cc: Dario Faggioli
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1391803122-4425-7-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2014-02-23 01:10:04 +0800
81a44c544 sched: Queue RT tasks to head when prio drops ... Browse Code »

The following scenario does not work correctly:

Runqueue of CPUx contains two runnable and pinned tasks:

T1: SCHED_FIFO, prio 80
T2: SCHED_FIFO, prio 80

T1 is on the cpu and executes the following syscalls (classic priority
ceiling scenario):

sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90);
...
sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80);
...

Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back
to sleep the scheduler picks T2. Surprise!

The same happens w/o actual preemption when T1 is forced into the
scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes
pick_next_task() which returns T2. So T1 gets preempted and scheduled
out.

This happens because sched_setscheduler() dequeues T1 from the prio 90
list and then enqueues it on the tail of the prio 80 list behind T2.
This violates the POSIX spec and surprises user space which relies on
the guarantee that SCHED_FIFO tasks are not scheduled out unless they
give the CPU up voluntarily or are preempted by a higher priority
task. In the latter case the preempted task must get back on the CPU
after the preempting task schedules out again.

We fixed a similar issue already in commit 60db48c (sched: Queue a
deboosted task to the head of the RT prio queue). The same treatment
is necessary for sched_setscheduler(). So enqueue to head of the prio
bucket list if the priority of the task is lowered.

It might be possible that existing user space relies on the current
behaviour, but it can be considered highly unlikely due to the corner
case nature of the application scenario.

Signed-off-by: Thomas Gleixner
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1391803122-4425-6-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2014-02-23 01:09:41 +0800
d6b1e9119 sched: Adjust p->sched_reset_on_fork when nothing else changes ... Browse Code »

If the policy and priority remain unchanged a possible modification of
p->sched_reset_on_fork gets lost in the early exit path.

Signed-off-by: Thomas Gleixner
[ Rebase ontop of v3.14-rc1. ]
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1391803122-4425-5-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2014-02-23 01:08:43 +0800
8f47b1871 sched: Add better debug output for might_sleep() ... Browse Code »

might_sleep() can tell us where interrupts have been disabled, but we
have no idea what disabled preemption. Add some debug infrastructure.

Signed-off-by: Thomas Gleixner
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1391803122-4425-4-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2014-02-23 01:08:08 +0800
db273be2a sched: Check for idle task in might_sleep() ... Browse Code »

Idle is not allowed to call sleeping functions ever!

Signed-off-by: Thomas Gleixner
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1391803122-4425-3-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2014-02-23 01:07:50 +0800
77177856e sched: Init idle->on_rq in init_idle() ... Browse Code »

We stumbled in RT over a SMP bringup issue on ARM where the
idle->on_rq == 0 was causing try_to_wakeup() on the other cpu to run
into nada land.

After adding that idle->on_rq = 1; I was able to find the root cause
of the lockup: the idle task on the newly woken up cpu was fiddling
with a sleeping spinlock, which is a nono.

I kept the init of idle->on_rq to keep the state consistent and to
avoid another long lasting debug session.

As a side note, the whole debug mess could have been avoided if
might_sleep() would have yelled when called from the idle task. That's
fixed with patch 2/6 - and that one actually has a changelog :)

Signed-off-by: Thomas Gleixner
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1391803122-4425-2-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2014-02-23 01:07:36 +0800

22 Feb, 2014

4 commits

3f1d2a318 sched: Fix hotplug task migration ... Browse Code »

Dan Carpenter reported:

> kernel/sched/rt.c:1347 pick_next_task_rt() warn: variable dereferenced before check 'prev' (see line 1338)
> kernel/sched/deadline.c:1011 pick_next_task_dl() warn: variable dereferenced before check 'prev' (see line 1005)

Kirill also spotted that migrate_tasks() will have an instant NULL
deref because pick_next_task() will immediately deref prev.

Instead of fixing all the corner cases because migrate_tasks() can
pass in a NULL prev task in the unlikely case of hot-un-plug, provide
a fake task such that we can remove all the NULL checks from the far
more common paths.

A further problem; not previously spotted; is that because we pushed
pre_schedule() and idle_balance() into pick_next_task() we now need to
avoid those getting called and pulling more tasks on our dying CPU.

We avoid pull_{dl,rt}_task() by setting fake_task.prio to MAX_PRIO+1.
We also note that since we call pick_next_task() exactly the amount of
times we have runnable tasks present, we should never land in
idle_balance().

Fixes: 38033c37faab ("sched: Push down pre_schedule() and idle_balance()")
Cc: Juri Lelli
Cc: Ingo Molnar
Cc: Steven Rostedt
Reported-by: Kirill Tkhai
Reported-by: Dan Carpenter
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140212094930.GB3545@laptop.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner

Peter Zijlstra
2014-02-22 04:43:18 +0800
6d35ab480 sched: Add 'flags' argument to sched_{set,get}attr() syscalls ... Browse Code »
1

Because of a recent syscall design debate; its deemed appropriate for
each syscall to have a flags argument for future extension; without
immediately requiring new syscalls.

Cc: juri.lelli@gmail.com
Cc: Ingo Molnar
Suggested-by: Michael Kerrisk
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140214161929.GL27965@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner

Peter Zijlstra
2014-02-22 04:27:10 +0800
4efbc454b sched: Fix information leak in sys_sched_getattr() ... Browse Code »

We're copying the on-stack structure to userspace, but forgot to give
the right number of bytes to copy. This allows the calling process to
obtain up to PAGE_SIZE bytes from the stack (and possibly adjacent
kernel memory).

This fix copies only as much as we actually have on the stack
(attr->size defaults to the size of the struct) and leaves the rest of
the userspace-provided buffer untouched.

Found using kmemcheck + trinity.

Fixes: d50dde5a10f30 ("sched: Add new scheduler syscalls to support an extended scheduling parameters ABI")
Cc: Dario Faggioli
Cc: Juri Lelli
Cc: Ingo Molnar
Signed-off-by: Vegard Nossum
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1392585857-10725-1-git-send-email-vegard.nossum@oracle.com
Signed-off-by: Thomas Gleixner

Vegard Nossum
2014-02-22 04:27:10 +0800
495163420 sched/core: Make dl_b->lock IRQ safe ... Browse Code »

Fix this lockdep warning:

[ 44.804600] =========================================================
[ 44.805746] [ INFO: possible irq lock inversion dependency detected ]
[ 44.805746] 3.14.0-rc2-test+ #14 Not tainted
[ 44.805746] ---------------------------------------------------------
[ 44.805746] bash/3674 just changed the state of lock:
[ 44.805746] (&dl_b->lock){+.....}, at: [] sched_rt_handler+0x132/0x248
[ 44.805746] but this lock was taken by another, HARDIRQ-safe lock in the past:
[ 44.805746] (&rq->lock){-.-.-.}

and interrupts could create inverse lock ordering between them.

[ 44.805746]
[ 44.805746] other info that might help us debug this:
[ 44.805746] Possible interrupt unsafe locking scenario:
[ 44.805746]
[ 44.805746] CPU0 CPU1
[ 44.805746] ---- ----
[ 44.805746] lock(&dl_b->lock);
[ 44.805746] local_irq_disable();
[ 44.805746] lock(&rq->lock);
[ 44.805746] lock(&dl_b->lock);
[ 44.805746]
[ 44.805746] lock(&rq->lock);

by making dl_b->lock acquiring always IRQ safe.

Cc: Ingo Molnar
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1392107067-19907-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Thomas Gleixner

Juri Lelli
2014-02-22 04:27:10 +0800