Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

16 Nov, 2014

1 commit

23cfa361f sched/cputime: Fix cpu_timer_sample_group() double accounting ... Browse Code »

While looking over the cpu-timer code I found that we appear to add
the delta for the calling task twice, through:

cpu_timer_sample_group()
thread_group_cputimer()
thread_group_cputime()
times->sum_exec_runtime += task_sched_runtime();

*sample = cputime.sum_exec_runtime + task_delta_exec();

Which would make the sample run ahead, making the sleep short.

Signed-off-by: Peter Zijlstra (Intel)
Cc: KOSAKI Motohiro
Cc: Oleg Nesterov
Cc: Stanislaw Gruszka
Cc: Christoph Lameter
Cc: Frederic Weisbecker
Cc: Linus Torvalds
Cc: Rik van Riel
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20141112113737.GI10476@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-11-16 17:04:18 +0800

25 Oct, 2014

2 commits

10632008b clockevents: Prevent shift out of bounds ... Browse Code »
2

Andrey reported that on a kernel with UBSan enabled he found:

UBSan: Undefined behaviour in ../kernel/time/clockevents.c:75:34

I guess it should be 1ULL here instead of 1U:
(!ismax || evt->mult << evt->shift)))

That's indeed the correct solution because shift might be 32.

Reported-by: Andrey Ryabinin
Cc: Peter Zijlstra
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2014-10-25 16:43:15 +0800
6891c4509 posix-timers: Fix stack info leak in timer_create() ... Browse Code »
5

If userland creates a timer without specifying a sigevent info, we'll
create one ourself, using a stack local variable. Particularly will we
use the timer ID as sival_int. But as sigev_value is a union containing
a pointer and an int, that assignment will only partially initialize
sigev_value on systems where the size of a pointer is bigger than the
size of an int. On such systems we'll copy the uninitialized stack bytes
from the timer_create() call to userland when the timer actually fires
and we're going to deliver the signal.

Initialize sigev_value with 0 to plug the stack info leak.

Found in the PaX patch, written by the PaX Team.

Fixes: 5a9fa7307285 ("posix-timers: kill ->it_sigev_signo and...")
Signed-off-by: Mathias Krause
Cc: Oleg Nesterov
Cc: Brad Spengler
Cc: PaX Team
Cc: # v2.6.28+
Link: http://lkml.kernel.org/r/1412456799-32339-1-git-send-email-minipli@googlemail.com
Signed-off-by: Thomas Gleixner

Mathias Krause
2014-10-25 16:43:15 +0800

15 Oct, 2014

1 commit

0429fbc0b Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu ... Browse Code »

Pull percpu consistent-ops changes from Tejun Heo:
"Way back, before the current percpu allocator was implemented, static
and dynamic percpu memory areas were allocated and handled separately
and had their own accessors. The distinction has been gone for many
years now; however, the now duplicate two sets of accessors remained
with the pointer based ones - this_cpu_*() - evolving various other
operations over time. During the process, we also accumulated other
inconsistent operations.

This pull request contains Christoph's patches to clean up the
duplicate accessor situation. __get_cpu_var() uses are replaced with
with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

Unfortunately, the former sometimes is tricky thanks to C being a bit
messy with the distinction between lvalues and pointers, which led to
a rather ugly solution for cpumask_var_t involving the introduction of
this_cpu_cpumask_var_ptr().

This converts most of the uses but not all. Christoph will follow up
with the remaining conversions in this merge window and hopefully
remove the obsolete accessors"

* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
irqchip: Properly fetch the per cpu offset
percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
Revert "powerpc: Replace __get_cpu_var uses"
percpu: Remove __this_cpu_ptr
clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
sparc: Replace __get_cpu_var uses
avr32: Replace __get_cpu_var with __this_cpu_write
blackfin: Replace __get_cpu_var uses
tile: Use this_cpu_ptr() for hardware counters
tile: Replace __get_cpu_var uses
powerpc: Replace __get_cpu_var uses
alpha: Replace __get_cpu_var
ia64: Replace __get_cpu_var uses
s390: cio driver &__get_cpu_var replacements
s390: Replace __get_cpu_var uses
mips: Replace __get_cpu_var uses
MIPS: Replace __get_cpu_var uses in FPU emulator.
arm: Replace __this_cpu_ptr with raw_cpu_ptr
...

Linus Torvalds
2014-10-15 13:48:18 +0800

14 Oct, 2014

1 commit

1ee07ef6b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux ... Browse Code »

Pull s390 updates from Martin Schwidefsky:
"This patch set contains the main portion of the changes for 3.18 in
regard to the s390 architecture. It is a bit bigger than usual,
mainly because of a new driver and the vector extension patches.

The interesting bits are:
- Quite a bit of work on the tracing front. Uprobes is enabled and
the ftrace code is reworked to get some of the lost performance
back if CONFIG_FTRACE is enabled.
- To improve boot time with CONFIG_DEBIG_PAGEALLOC, support for the
IPTE range facility is added.
- The rwlock code is re-factored to improve writer fairness and to be
able to use the interlocked-access instructions.
- The kernel part for the support of the vector extension is added.
- The device driver to access the CD/DVD on the HMC is added, this
will hopefully come in handy to improve the installation process.
- Add support for control-unit initiated reconfiguration.
- The crypto device driver is enhanced to enable the additional AP
domains and to allow the new crypto hardware to be used.
- Bug fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (39 commits)
s390/ftrace: simplify enabling/disabling of ftrace_graph_caller
s390/ftrace: remove 31 bit ftrace support
s390/kdump: add support for vector extension
s390/disassembler: add vector instructions
s390: add support for vector extension
s390/zcrypt: Toleration of new crypto hardware
s390/idle: consolidate idle functions and definitions
s390/nohz: use a per-cpu flag for arch_needs_cpu
s390/vtime: do not reset idle data on CPU hotplug
s390/dasd: add support for control unit initiated reconfiguration
s390/dasd: fix infinite loop during format
s390/mm: make use of ipte range facility
s390/setup: correct 4-level kernel page table detection
s390/topology: call set_sched_topology early
s390/uprobes: architecture backend for uprobes
s390/uprobes: common library for kprobes and uprobes
s390/rwlock: use the interlocked-access facility 1 instructions
s390/rwlock: improve writer fairness
s390/rwlock: remove interrupt-enabling rwlock variant.
s390/mm: remove change bit override support
...

Linus Torvalds
2014-10-14 09:47:00 +0800

13 Oct, 2014

1 commit

faafcba3b Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:

- Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
Hansen)

- Various sched/idle refinements for better idle handling (Nicolas
Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)

- sched/numa updates and optimizations (Rik van Riel)

- sysbench speedup (Vincent Guittot)

- capacity calculation cleanups/refactoring (Vincent Guittot)

- Various cleanups to thread group iteration (Oleg Nesterov)

- Double-rq-lock removal optimization and various refactorings
(Kirill Tkhai)

- various sched/deadline fixes

... and lots of other changes"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
sched/fair: Delete resched_cpu() from idle_balance()
sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
sched: Improve sysbench performance by fixing spurious active migration
sched/x86: Fix up typo in topology detection
x86, sched: Add new topology for multi-NUMA-node CPUs
sched/rt: Use resched_curr() in task_tick_rt()
sched: Use rq->rd in sched_setaffinity() under RCU read lock
sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
sched: Use dl_bw_of() under RCU read lock
sched/fair: Remove duplicate code from can_migrate_task()
sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
sched: print_rq(): Don't use tasklist_lock
sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
sched: Fix the task-group check in tg_has_rt_tasks()
sched/fair: Leverage the idle state info when choosing the "idlest" cpu
sched: Let the scheduler see CPU idle states
sched/deadline: Fix inter- exclusive cpusets migrations
sched/deadline: Clear dl_entity params when setscheduling to different class
sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
...

Linus Torvalds
2014-10-13 22:23:15 +0800

09 Oct, 2014

3 commits

47137c6ba Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timer updates from Thomas Gleixner:
"Nothing really exciting this time:

- a few fixlets in the NOHZ code

- a new ARM SoC timer abomination. One should expect that we have
enough of them already, but they insist on inventing new ones.

- the usual bunch of ARM SoC timer updates. That feels like herding
cats"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource: arm_arch_timer: Consolidate arch_timer_evtstrm_enable
clocksource: arm_arch_timer: Enable counter access for 32-bit ARM
clocksource: arm_arch_timer: Change clocksource name if CP15 unavailable
clocksource: sirf: Disable counter before re-setting it
clocksource: cadence_ttc: Add support for 32bit mode
clocksource: tcb_clksrc: Sanitize IRQ request
clocksource: arm_arch_timer: Discard unavailable timers correctly
clocksource: vf_pit_timer: Support shutdown mode
ARM: meson6: clocksource: Add Meson6 timer support
ARM: meson: documentation: Add timer documentation
clocksource: sh_tmu: Document r8a7779 binding
clocksource: sh_mtu2: Document r7s72100 binding
clocksource: sh_cmt: Document SoC specific bindings
timerfd: Remove an always true check
nohz: Avoid tick's double reprogramming in highres mode
nohz: Fix spurious periodic tick behaviour in low-res dynticks mode

Linus Torvalds
2014-10-09 18:35:05 +0800
afa3536be Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timer fixes from Ingo Molnar:
"Main changes:

- Fix the deadlock reported by Dave Jones et al
- Clean up and fix nohz_full interaction with arch abilities
- nohz init code consolidation/cleanup"

* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
nohz: nohz full depends on irq work self IPI support
nohz: Consolidate nohz full init code
arm64: Tell irq work about self IPI support
arm: Tell irq work about self IPI support
x86: Tell irq work about self IPI support
irq_work: Force raised irq work to run on irq work interrupt
irq_work: Introduce arch_irq_work_has_interrupt()
nohz: Move nohz full init call to tick init

Linus Torvalds
2014-10-09 18:30:57 +0800
fe0f49768 s390/nohz: use a per-cpu flag for arch_needs_cpu ... Browse Code »

Move the nohz_delay bit from the s390_idle data structure to the
per-cpu flags. Clear the nohz delay flag in __cpu_disable and
remove the cpu hotplug notifier that used to do this.

Signed-off-by: Martin Schwidefsky

Martin Schwidefsky
2014-10-09 15:14:02 +0800

19 Sep, 2014

1 commit

f139caf2e sched, cleanup, treewide: Remove set_current_state(TASK_RUNNING) after schedule() ... Browse Code »

schedule(), io_schedule() and schedule_timeout() always return
with TASK_RUNNING state set, so one more setting is unnecessary.

(All places in patch are visible good, only exception is
kiblnd_scheduler() from:

drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c

Its schedule() is one line above standard 3 lines of unified diff)

No places where set_current_state() is used for mb().

Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Link: http://lkml.kernel.org/r/1410529254.3569.23.camel@tkhai
Cc: Alasdair Kergon
Cc: Anil Belur
Cc: Arnd Bergmann
Cc: Dave Kleikamp
Cc: David Airlie
Cc: David Howells
Cc: Dmitry Eremin
Cc: Frank Blaschka
Cc: Greg Kroah-Hartman
Cc: Heiko Carstens
Cc: Helge Deller
Cc: Isaac Huang
Cc: James E.J. Bottomley
Cc: James E.J. Bottomley
Cc: J. Bruce Fields
Cc: Jeff Dike
Cc: Jesper Nilsson
Cc: Jiri Slaby
Cc: Laura Abbott
Cc: Liang Zhen
Cc: Linus Torvalds
Cc: Martin Schwidefsky
Cc: Masaru Nomura
Cc: Michael Opdenacker
Cc: Mikael Starvik
Cc: Mike Snitzer
Cc: Neil Brown
Cc: Oleg Drokin
Cc: Peng Tao
Cc: Richard Weinberger
Cc: Robert Love
Cc: Steven Rostedt
Cc: Trond Myklebust
Cc: Ursula Braun
Cc: Zi Shen Lim
Cc: devel@driverdev.osuosl.org
Cc: dm-devel@redhat.com
Cc: dri-devel@lists.freedesktop.org
Cc: fcoe-devel@open-fcoe.org
Cc: jfs-discussion@lists.sourceforge.net
Cc: linux390@de.ibm.com
Cc: linux-afs@lists.infradead.org
Cc: linux-cris-kernel@axis.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-nfs@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linux-raid@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: qla2xxx-upstream@qlogic.com
Cc: user-mode-linux-devel@lists.sourceforge.net
Cc: user-mode-linux-user@lists.sourceforge.net
Signed-off-by: Ingo Molnar

Kirill Tkhai
2014-09-19 18:35:17 +0800

14 Sep, 2014

4 commits

9b01f5bf3 nohz: nohz full depends on irq work self IPI support ... Browse Code »
13

The nohz full functionality depends on IRQ work to trigger its own
interrupts. As it's used to restart the tick, we can't rely on the tick
fallback for irq work callbacks, ie: we can't use the tick to restart
the tick itself.

Lets reject the full dynticks initialization if that arch support isn't
available.

As a side effect, this makes sure that nohz kick is never called from
the tick. That otherwise would result in illegal hrtimer self-cancellation
and lockup.

Acked-by: Peter Zijlstra (Intel)
Cc: Ingo Molnar
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Signed-off-by: Frederic Weisbecker

Frederic Weisbecker
2014-09-14 00:46:41 +0800
4327b15f6 nohz: Consolidate nohz full init code ... Browse Code »

The supports for CONFIG_NO_HZ_FULL_ALL=y and the nohz_full= kernel
parameter both have their own way to do the same thing: allocate
full dynticks cpumasks, fill them and initialize some state variables.

Lets consolidate that all in the same place.

While at it, convert some regular printk message to warnings when
fundamental allocations fail.

Acked-by: Peter Zijlstra (Intel)
Cc: Ingo Molnar
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Signed-off-by: Frederic Weisbecker

Frederic Weisbecker
2014-09-14 00:46:40 +0800
76a33061b irq_work: Force raised irq work to run on irq work interrupt ... Browse Code »

The nohz full kick, which restarts the tick when any resource depend
on it, can't be executed anywhere given the operation it does on timers.
If it is called from the scheduler or timers code, chances are that
we run into a deadlock.

This is why we run the nohz full kick from an irq work. That way we make
sure that the kick runs on a virgin context.

However if that's the case when irq work runs in its own dedicated
self-ipi, things are different for the big bunch of archs that don't
support the self triggered way. In order to support them, irq works are
also handled by the timer interrupt as fallback.

Now when irq works run on the timer interrupt, the context isn't blank.
More precisely, they can run in the context of the hrtimer that runs the
tick. But the nohz kick cancels and restarts this hrtimer and cancelling
an hrtimer from itself isn't allowed. This is why we run in an endless
loop:

Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
CPU: 2 PID: 7538 Comm: kworker/u8:8 Not tainted 3.16.0+ #34
Workqueue: btrfs-endio-write normal_work_helper [btrfs]
ffff880244c06c88 000000001b486fe1 ffff880244c06bf0 ffffffff8a7f1e37
ffffffff8ac52a18 ffff880244c06c78 ffffffff8a7ef928 0000000000000010
ffff880244c06c88 ffff880244c06c20 000000001b486fe1 0000000000000000
Call Trace:
] dump_stack+0x4e/0x7a
[] panic+0xd4/0x207
[] watchdog_overflow_callback+0x118/0x120
[] __perf_event_overflow+0xae/0x350
[] ? perf_event_task_disable+0xa0/0xa0
[] ? x86_perf_event_set_period+0xbf/0x150
[] perf_event_overflow+0x14/0x20
[] intel_pmu_handle_irq+0x206/0x410
[] perf_event_nmi_handler+0x2b/0x50
[] nmi_handle+0xd2/0x390
[] ? nmi_handle+0x5/0x390
[] ? match_held_lock+0x8/0x1b0
[] default_do_nmi+0x72/0x1c0
[] do_nmi+0xb8/0x100
[] end_repeat_nmi+0x1e/0x2e
[] ? match_held_lock+0x8/0x1b0
[] ? match_held_lock+0x8/0x1b0
[] ? match_held_lock+0x8/0x1b0
<] lock_acquired+0xaf/0x450
[] ? lock_hrtimer_base.isra.20+0x25/0x50
[] _raw_spin_lock_irqsave+0x78/0x90
[] ? lock_hrtimer_base.isra.20+0x25/0x50
[] lock_hrtimer_base.isra.20+0x25/0x50
[] hrtimer_try_to_cancel+0x33/0x1e0
[] hrtimer_cancel+0x1a/0x30
[] tick_nohz_restart+0x17/0x90
[] __tick_nohz_full_check+0xc3/0x100
[] nohz_full_kick_work_func+0xe/0x10
[] irq_work_run_list+0x44/0x70
[] irq_work_run+0x2a/0x50
[] update_process_times+0x5b/0x70
[] tick_sched_handle.isra.21+0x25/0x60
[] tick_sched_timer+0x41/0x60
[] __run_hrtimer+0x72/0x470
[] ? tick_sched_do_timer+0xb0/0xb0
[] hrtimer_interrupt+0x117/0x270
[] local_apic_timer_interrupt+0x37/0x60
[] smp_apic_timer_interrupt+0x3f/0x50
[] apic_timer_interrupt+0x6f/0x80

To fix this we force non-lazy irq works to run on irq work self-IPIs
when available. That ability of the arch to trigger irq work self IPIs
is available with arch_irq_work_has_interrupt().

Reported-by: Catalin Iacob
Reported-by: Dave Jones
Acked-by: Peter Zijlstra (Intel)
Cc: Ingo Molnar
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Signed-off-by: Frederic Weisbecker

Frederic Weisbecker
2014-09-14 00:38:15 +0800
a80e49e2c nohz: Move nohz full init call to tick init ... Browse Code »

This way we unbloat a bit main.c and more importantly we initialize
nohz full after init_IRQ(). This dependency will be needed in further
patches because nohz full needs irq work to raise its own IRQ.
Information about the support for this ability on ARM64 is obtained on
init_IRQ() which initialize the pointer to __smp_call_function.

Since tick_init() is called right after init_IRQ(), this is a good place
to call tick_nohz_init() and prepare for that dependency.

Acked-by: Peter Zijlstra (Intel)
Cc: Ingo Molnar
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Signed-off-by: Frederic Weisbecker

Frederic Weisbecker
2014-09-14 00:34:44 +0800

13 Sep, 2014

4 commits

474e941be alarmtimer: Lock k_itimer during timer callback ... Browse Code »
5

Locks the k_itimer's it_lock member when handling the alarm timer's
expiry callback.

The regular posix timers defined in posix-timers.c have this lock held
during timout processing because their callbacks are routed through
posix_timer_fn(). The alarm timers follow a different path, so they
ought to grab the lock somewhere else.

Cc: stable@vger.kernel.org
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Richard Cochran
Cc: Prarit Bhargava
Cc: Sharvil Nanavati
Signed-off-by: Richard Larocque
Signed-off-by: John Stultz

Richard Larocque
2014-09-13 04:59:12 +0800
265b81d23 alarmtimer: Do not signal SIGEV_NONE timers ... Browse Code »
5

Avoids sending a signal to alarm timers created with sigev_notify set to
SIGEV_NONE by checking for that special case in the timeout callback.

The regular posix timers avoid sending signals to SIGEV_NONE timers by
not scheduling any callbacks for them in the first place. Although it
would be possible to do something similar for alarm timers, it's simpler
to handle this as a special case in the timeout.

Prior to this patch, the alarm timer would ignore the sigev_notify value
and try to deliver signals to the process anyway. Even worse, the
sanity check for the value of sigev_signo is skipped when SIGEV_NONE was
specified, so the signal number could be bogus. If sigev_signo was an
unitialized value (as it often would be if SIGEV_NONE is used), then
it's hard to predict which signal will be sent.

Cc: stable@vger.kernel.org
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Richard Cochran
Cc: Prarit Bhargava
Cc: Sharvil Nanavati
Signed-off-by: Richard Larocque
Signed-off-by: John Stultz

Richard Larocque
2014-09-13 04:59:12 +0800
e86fea764 alarmtimer: Return relative times in timer_gettime ... Browse Code »
5

Returns the time remaining for an alarm timer, rather than the time at
which it is scheduled to expire. If the timer has already expired or it
is not currently scheduled, the it_value's members are set to zero.

This new behavior matches that of the other posix-timers and the POSIX
specifications.

This is a change in user-visible behavior, and may break existing
applications. Hopefully, few users rely on the old incorrect behavior.

Cc: stable@vger.kernel.org
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Richard Cochran
Cc: Prarit Bhargava
Cc: Sharvil Nanavati
Signed-off-by: Richard Larocque
[jstultz: minor style tweak]
Signed-off-by: John Stultz

Richard Larocque
2014-09-13 04:59:11 +0800
d78c9300c jiffies: Fix timeval conversion to jiffies ... Browse Code »
5

timeval_to_jiffies tried to round a timeval up to an integral number
of jiffies, but the logic for doing so was incorrect: intervals
corresponding to exactly N jiffies would become N+1. This manifested
itself particularly repeatedly stopping/starting an itimer:

setitimer(ITIMER_PROF, &val, NULL);
setitimer(ITIMER_PROF, NULL, &val);

would add a full tick to val, _even if it was exactly representable in
terms of jiffies_ (say, the result of a previous rounding.) Doing
this repeatedly would cause unbounded growth in val. So fix the math.

Here's what was wrong with the conversion: we essentially computed
(eliding seconds)

jiffies = usec * (NSEC_PER_USEC/TICK_NSEC)

by using scaling arithmetic, which took the best approximation of
NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
x/(2^USEC_JIFFIE_SC), and computed:

jiffies = (usec * x) >> USEC_JIFFIE_SC

and rounded this calculation up in the intermediate form (since we
can't necessarily exactly represent TICK_NSEC in usec.) But the
scaling arithmetic is a (very slight) *over*approximation of the true
value; that is, instead of dividing by (1 usec/ 1 jiffie), we
effectively divided by (1 usec/1 jiffie)-epsilon (rounding
down). This would normally be fine, but we want to round timeouts up,
and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
would be fine if our division was exact, but dividing this by the
slightly smaller factor was equivalent to adding just _over_ 1 to the
final result (instead of just _under_ 1, as desired.)

In particular, with HZ=1000, we consistently computed that 10000 usec
was 11 jiffies; the same was true for any exact multiple of
TICK_NSEC.

We could possibly still round in the intermediate form, adding
something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
convert usec->nsec, round in nanoseconds, and then convert using
time*spec*_to_jiffies. This adds one constant multiplication, and is
not observably slower in microbenchmarks on recent x86 hardware.

Tested: the following program:

int main() {
struct itimerval zero = {{0, 0}, {0, 0}};
/* Initially set to 10 ms. */
struct itimerval initial = zero;
initial.it_interval.tv_usec = 10000;
setitimer(ITIMER_PROF, &initial, NULL);
/* Save and restore several times. */
for (size_t i = 0; i < 10; ++i) {
struct itimerval prev;
setitimer(ITIMER_PROF, &zero, &prev);
/* on old kernels, this goes up by TICK_USEC every iteration */
printf("previous value: %ld %ld %ld %ld\n",
prev.it_interval.tv_sec, prev.it_interval.tv_usec,
prev.it_value.tv_sec, prev.it_value.tv_usec);
setitimer(ITIMER_PROF, &prev, NULL);
}
return 0;
}

Cc: stable@vger.kernel.org
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Paul Turner
Cc: Richard Cochran
Cc: Prarit Bhargava
Reviewed-by: Paul Turner
Reported-by: Aaron Jacobs
Signed-off-by: Andrew Hunter
[jstultz: Tweaked to apply to 3.17-rc]
Signed-off-by: John Stultz

Andrew Hunter
2014-09-13 04:59:03 +0800

08 Sep, 2014

1 commit

e78c34967 time, signal: Protect resource use statistics with seqlock ... Browse Code »

Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
issues on large systems, due to both functions being serialized with a
lock.

The lock protects against reporting a wrong value, due to a thread in the
task group exiting, its statistics reporting up to the signal struct, and
that exited task's statistics being counted twice (or not at all).

Protecting that with a lock results in times() and clock_gettime() being
completely serialized on large systems.

This can be fixed by using a seqlock around the events that gather and
propagate statistics. As an additional benefit, the protection code can
be moved into thread_group_cputime(), slightly simplifying the calling
functions.

In the case of posix_cpu_clock_get_task() things can be simplified a
lot, because the calling function already ensures that the task sticks
around, and the rest is now taken care of in thread_group_cputime().

This way the statistics reporting code can run lockless.

Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alex Thorlton
Cc: Andrew Morton
Cc: Daeseok Youn
Cc: David Rientjes
Cc: Dongsheng Yang
Cc: Geert Uytterhoeven
Cc: Guillaume Morin
Cc: Ionut Alexa
Cc: Kees Cook
Cc: Linus Torvalds
Cc: Li Zefan
Cc: Michal Hocko
Cc: Michal Schmidt
Cc: Oleg Nesterov
Cc: Vladimir Davydov
Cc: umgwanakikbuti@gmail.com
Cc: fweisbec@gmail.com
Cc: srao@redhat.com
Cc: lwoodman@redhat.com
Cc: atheurer@redhat.com
Link: http://lkml.kernel.org/r/20140816134010.26a9b572@annuminas.surriel.com
Signed-off-by: Ingo Molnar

Rik van Riel
2014-09-08 14:17:01 +0800

06 Sep, 2014

1 commit

9bf2419fa timekeeping: Update timekeeper before updating vsyscall and pvclock ... Browse Code »

The update_walltime() code works on the shadow timekeeper to make the
seqcount protected region as short as possible. But that update to the
shadow timekeeper does not update all timekeeper fields because it's
sufficient to do that once before it becomes life. One of these fields
is tkr.base_mono. That stays stale in the shadow timekeeper unless an
operation happens which copies the real timekeeper to the shadow.

The update function is called after the update calls to vsyscall and
pvclock. While not correct, it did not cause any problems because none
of the invoked update functions used base_mono.

commit cbcf2dd3b3d4 (x86: kvm: Make kvm_get_time_and_clockread()
nanoseconds based) changed that in the kvm pvclock update function, so
the stale mono_base value got used and caused kvm-clock to malfunction.

Put the update where it belongs and fix the issue.

Reported-by: Chris J Arges
Reported-by: Paolo Bonzini
Cc: Gleb Natapov
Cc: John Stultz
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1409050000570.3333@nanos
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2014-09-06 18:58:18 +0800

05 Sep, 2014

1 commit

40bea0395 nohz: Restore NMI safe local irq work for local nohz kick ... Browse Code »

The local nohz kick is currently used by perf which needs it to be
NMI-safe. Recent commit though (7d1311b93e58ed55f3a31cc8f94c4b8fe988a2b9)
changed its implementation to fire the local kick using the remote kick
API. It was convenient to make the code more generic but the remote kick
isn't NMI-safe.

As a result:

WARNING: CPU: 3 PID: 18062 at kernel/irq_work.c:72 irq_work_queue_on+0x11e/0x140()
CPU: 3 PID: 18062 Comm: trinity-subchil Not tainted 3.16.0+ #34
0000000000000009 00000000903774d1 ffff880244e06c00 ffffffff9a7f1e37
0000000000000000 ffff880244e06c38 ffffffff9a0791dd ffff880244fce180
0000000000000003 ffff880244e06d58 ffff880244e06ef8 0000000000000000
Call Trace:
[] dump_stack+0x4e/0x7a
[] warn_slowpath_common+0x7d/0xa0
[] warn_slowpath_null+0x1a/0x20
[] irq_work_queue_on+0x11e/0x140
[] tick_nohz_full_kick_cpu+0x57/0x90
[] __perf_event_overflow+0x275/0x350
[] ? perf_event_task_disable+0xa0/0xa0
[] ? x86_perf_event_set_period+0xbf/0x150
[] perf_event_overflow+0x14/0x20
[] intel_pmu_handle_irq+0x206/0x410
[] ? arch_vtime_task_switch+0x63/0x130
[] perf_event_nmi_handler+0x2b/0x50
[] nmi_handle+0xd2/0x390
[] ? nmi_handle+0x5/0x390
[] ? lock_release+0xab/0x330
[] default_do_nmi+0x72/0x1c0
[] ? cpuacct_account_field+0xcf/0x200
[] do_nmi+0xb8/0x100

Lets fix this by restoring the use of local irq work for the nohz local
kick.

Reported-by: Catalin Iacob
Reported-and-tested-by: Dave Jones
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Signed-off-by: Frederic Weisbecker

Frederic Weisbecker
2014-09-05 04:35:59 +0800

27 Aug, 2014

3 commits

4a32fea9d scheduler: Replace __get_cpu_var with this_cpu_ptr ... Browse Code »

Convert all uses of __get_cpu_var for address calculation to use
this_cpu_ptr instead.

[Uses of __get_cpu_var with cpumask_var_t are no longer
handled by this patch]

Cc: Peter Zijlstra
Acked-by: Ingo Molnar
Signed-off-by: Christoph Lameter
Signed-off-by: Tejun Heo

Christoph Lameter
2014-08-27 01:45:45 +0800
dc5df73b3 time: Convert a bunch of &__get_cpu_var introduced in the 3.16 merge period ... Browse Code »

Signed-off-by: Christoph Lameter
Signed-off-by: Tejun Heo

Christoph Lameter
2014-08-27 01:45:44 +0800
22127e93c time: Replace __get_cpu_var uses ... Browse Code »

Convert uses of __get_cpu_var for creating a address from a percpu
offset to this_cpu_ptr.

The two cases where get_cpu_var is used to actually access a percpu
variable are changed to use this_cpu_read/raw_cpu_read.

Reviewed-by: Thomas Gleixner
Signed-off-by: Christoph Lameter
Signed-off-by: Tejun Heo

Christoph Lameter
2014-08-27 01:45:44 +0800

23 Aug, 2014

2 commits

2a16fc93d nohz: Avoid tick's double reprogramming in highres mode ... Browse Code »

In highres mode, the tick reschedules itself unconditionally to the
next jiffies.

However while this clock reprogramming is relevant when the tick is
in periodic mode, it's not that interesting when we run in dynticks mode
because irq exit is likely going to overwrite the next tick to some
randomly deferred future.

So lets just get rid of this tick self rescheduling in dynticks mode.
This way we can avoid some clockevents double write in favourable
scenarios like when we stop the tick completely in idle while no other
hrtimer is pending.

Suggested-by: Frederic Weisbecker
Signed-off-by: Viresh Kumar
Cc: Thomas Gleixner
Signed-off-by: Frederic Weisbecker

Viresh Kumar
2014-08-23 00:47:35 +0800
b5e995e67 nohz: Fix spurious periodic tick behaviour in low-res dynticks mode ... Browse Code »

When we reach the end of the tick handler, we unconditionally reschedule
the next tick to the next jiffy. Then on irq exit, the nohz code
overrides that setting if needed and defers the next tick as far away in
the future as possible.

Now in the best dynticks case, when we actually don't need any tick in
the future (ie: expires == KTIME_MAX), low-res and high-res behave
differently. What we want in this case is to cancel the next tick
programmed by the previous one. That's what we do in high-res mode. OTOH
we lack a low-res mode equivalent of hrtimer_cancel() so we simply don't
do anything in this case and the next tick remains scheduled to jiffies + 1.

As a result, in low-res mode, when the dynticks code determines that no
tick is needed in the future, we can recursively get a spurious tick
every jiffy because then the next tick is always reprogrammed from the
tick handler and is never cancelled. And this can happen indefinetly
until some subsystem actually needs a precise tick in the future and only
then we eventually overwrite the previous tick handler setting to defer
the next tick.

We are fixing this by introducing the ONESHOT_STOPPED mode which will
let us pause a clockevent when no further interrupt is needed. Meanwhile
we can't expect all drivers to support this new mode.

So lets reduce much of the symptoms by skipping the nohz-blind tick
rescheduling from the tick-handler when the CPU is in dynticks mode.
That tick rescheduling wrongly assumed periodicity and the low-res
dynticks code can't cancel such decision. This breaks the recursive (and
thus the worst) part of the problem. In the worst case now, we'll get
only one extra tick due to uncancelled tick scheduled before we entered
dynticks mode.

This also removes a needless clockevent write on idle ticks. Since those
clock write are usually considered to be slow, it's a general win.

Reviewed-by: Preeti U Murthy
Signed-off-by: Viresh Kumar
Cc: Thomas Gleixner
Signed-off-by: Frederic Weisbecker

Viresh Kumar
2014-08-23 00:46:49 +0800

15 Aug, 2014

1 commit

0680eb1f4 timekeeping: Another fix to the VSYSCALL_OLD update_vsyscall ... Browse Code »

Benjamin Herrenschmidt pointed out that I further missed modifying
update_vsyscall after the wall_to_mono value was changed to a
timespec64. This causes issues on powerpc32, which expects a 32bit
timespec.

This patch fixes the problem by properly converting from a timespec64 to
a timespec before passing the value on to the arch-specific vsyscall
logic.

[ Thomas is currently on vacation, but reviewed it and wanted me to send
this fix on to you directly. ]

Cc: LKML
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Benjamin Herrenschmidt
Reported-by: Benjamin Herrenschmidt
Reviewed-by: Thomas Gleixner
Signed-off-by: John Stultz
Signed-off-by: Linus Torvalds

John Stultz
2014-08-15 01:04:11 +0800

06 Aug, 2014

1 commit

e7fda6c4c Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timer and time updates from Thomas Gleixner:
"A rather large update of timers, timekeeping & co

- Core timekeeping code is year-2038 safe now for 32bit machines.
Now we just need to fix all in kernel users and the gazillion of
user space interfaces which rely on timespec/timeval :)

- Better cache layout for the timekeeping internal data structures.

- Proper nanosecond based interfaces for in kernel users.

- Tree wide cleanup of code which wants nanoseconds but does hoops
and loops to convert back and forth from timespecs. Some of it
definitely belongs into the ugly code museum.

- Consolidation of the timekeeping interface zoo.

- A fast NMI safe accessor to clock monotonic for tracing. This is a
long standing request to support correlated user/kernel space
traces. With proper NTP frequency correction it's also suitable
for correlation of traces accross separate machines.

- Checkpoint/restart support for timerfd.

- A few NOHZ[_FULL] improvements in the [hr]timer code.

- Code move from kernel to kernel/time of all time* related code.

- New clocksource/event drivers from the ARM universe. I'm really
impressed that despite an architected timer in the newer chips SoC
manufacturers insist on inventing new and differently broken SoC
specific timers.

[ Ed. "Impressed"? I don't think that word means what you think it means ]

- Another round of code move from arch to drivers. Looks like most
of the legacy mess in ARM regarding timers is sorted out except for
a few obnoxious strongholds.

- The usual updates and fixlets all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
timekeeping: Fixup typo in update_vsyscall_old definition
clocksource: document some basic timekeeping concepts
timekeeping: Use cached ntp_tick_length when accumulating error
timekeeping: Rework frequency adjustments to work better w/ nohz
timekeeping: Minor fixup for timespec64->timespec assignment
ftrace: Provide trace clocks monotonic
timekeeping: Provide fast and NMI safe access to CLOCK_MONOTONIC
seqcount: Add raw_write_seqcount_latch()
seqcount: Provide raw_read_seqcount()
timekeeping: Use tk_read_base as argument for timekeeping_get_ns()
timekeeping: Create struct tk_read_base and use it in struct timekeeper
timekeeping: Restructure the timekeeper some more
clocksource: Get rid of cycle_last
clocksource: Move cycle_last validation to core code
clocksource: Make delta calculation a function
wireless: ath9k: Get rid of timespec conversions
drm: vmwgfx: Use nsec based interfaces
drm: i915: Use nsec based interfaces
timekeeping: Provide ktime_get_raw()
hangcheck-timer: Use ktime_get_ns()
...

Linus Torvalds
2014-08-06 08:46:42 +0800

05 Aug, 2014

3 commits

53ee98337 Merge tag 'staging-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging ... Browse Code »

Pull staging driver updates from Greg KH:
"Here's the big pull request for the staging driver tree for 3.17-rc1.

Lots of things in here, over 2000 patches, but the best part is this:
1480 files changed, 39070 insertions(+), 254659 deletions(-)

Thanks to the great work of Kristina Martšenko, 14 different staging
drivers have been removed from the tree as they were obsolete and no
one was willing to work on cleaning them up. Other than the driver
removals, loads of cleanups are in here (comedi, lustre, etc.) as well
as the usual IIO driver updates and additions.

All of this has been in the linux-next tree for a while"

* tag 'staging-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (2199 commits)
staging: comedi: addi_apci_1564: remove diagnostic interrupt support code
staging: comedi: addi_apci_1564: add subdevice to check diagnostic status
staging: wlan-ng: coding style problem fix
staging: wlan-ng: fixing coding style problems
staging: comedi: ii_pci20kc: request and ioremap memory
staging: lustre: bitwise vs logical typo
staging: dgnc: Remove unneeded dgnc_trace.c and dgnc_trace.h
staging: dgnc: rephrase comment
staging: comedi: ni_tio: remove some dead code
staging: rtl8723au: Fix static symbol sparse warning
staging: rtl8723au: usb_dvobj_init(): Remove unused variable 'pdev_desc'
staging: rtl8723au: Do not duplicate kernel provided USB macros
staging: rtl8723au: Remove never set struct pwrctrl_priv.bHWPowerdown
staging: rtl8723au: Remove two never set variables
staging: rtl8723au: RSSI_test is never set
staging:r8190: coding style: Fixed checkpatch reported Error
staging:r8180: coding style: Fixed too long lines
staging:r8180: coding style: Fixed commenting style
staging: lustre: ptlrpc: lproc_ptlrpc.c - fix dereferenceing user space buffer
staging: lustre: ldlm: ldlm_resource.c - fix dereferenceing user space buffer
...

Linus Torvalds
2014-08-05 09:36:12 +0800
98959948a Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:

- Move the nohz kick code out of the scheduler tick to a dedicated IPI,
from Frederic Weisbecker.

This necessiated quite some background infrastructure rework,
including:

* Clean up some irq-work internals
* Implement remote irq-work
* Implement nohz kick on top of remote irq-work
* Move full dynticks timer enqueue notification to new kick
* Move multi-task notification to new kick
* Remove unecessary barriers on multi-task notification

- Remove proliferation of wait_on_bit() action functions and allow
wait_on_bit_action() functions to support a timeout. (Neil Brown)

- Another round of sched/numa improvements, cleanups and fixes. (Rik
van Riel)

- Implement fast idling of CPUs when the system is partially loaded,
for better scalability. (Tim Chen)

- Restructure and fix the CPU hotplug handling code that may leave
cfs_rq and rt_rq's throttled when tasks are migrated away from a dead
cpu. (Kirill Tkhai)

- Robustify the sched topology setup code. (Peterz Zijlstra)

- Improve sched_feat() handling wrt. static_keys (Jason Baron)

- Misc fixes.

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
sched/fair: Fix 'make xmldocs' warning caused by missing description
sched: Use macro for magic number of -1 for setparam
sched: Robustify topology setup
sched: Fix sched_setparam() policy == -1 logic
sched: Allow wait_on_bit_action() functions to support a timeout
sched: Remove proliferation of wait_on_bit() action functions
sched/numa: Revert "Use effective_load() to balance NUMA loads"
sched: Fix static_key race with sched_feat()
sched: Remove extra static_key*() function indirection
sched/rt: Fix replenish_dl_entity() comments to match the current upstream code
sched: Transform resched_task() into resched_curr()
sched/deadline: Kill task_struct->pi_top_task
sched: Rework check_for_tasks()
sched/rt: Enqueue just unthrottled rt_rq back on the stack in __disable_runtime()
sched/fair: Disable runtime_enabled on dying rq
sched/numa: Change scan period code to match intent
sched/numa: Rework best node setting in task_numa_migrate()
sched/numa: Examine a task move when examining a task swap
sched/numa: Simplify task_numa_compare()
sched/numa: Use effective_load() to balance NUMA loads
...

Linus Torvalds
2014-08-05 07:23:30 +0800
5bda4f638 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull RCU changes from Ingo Molar:
"The main changes:

- torture-test updates
- callback-offloading changes
- maintainership changes
- update RCU documentation
- miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
rcu: Allow for NULL tick_nohz_full_mask when nohz_full= missing
rcu: Fix a sparse warning in rcu_report_unblock_qs_rnp()
rcu: Fix a sparse warning in rcu_initiate_boost()
rcu: Fix __rcu_reclaim() to use true/false for bool
rcu: Remove CONFIG_PROVE_RCU_DELAY
rcu: Use __this_cpu_read() instead of per_cpu_ptr()
rcu: Don't use NMIs to dump other CPUs' stacks
rcu: Bind grace-period kthreads to non-NO_HZ_FULL CPUs
rcu: Simplify priority boosting by putting rt_mutex in rcu_node
rcu: Check both root and current rcu_node when setting up future grace period
rcu: Allow post-unlock reference for rt_mutex
rcu: Loosen __call_rcu()'s rcu_head alignment constraint
rcu: Eliminate read-modify-write ACCESS_ONCE() calls
rcu: Remove redundant ACCESS_ONCE() from tick_do_timer_cpu
rcu: Make rcu node arrays static const char * const
signal: Explain local_irq_save() call
rcu: Handle obsolete references to TINY_PREEMPT_RCU
rcu: Document deadlock-avoidance information for rcu_read_unlock()
scripts: Teach get_maintainer.pl about the new "R:" tag
rcu: Update rcu torture maintainership filename patterns
...

Linus Torvalds
2014-08-05 06:55:08 +0800

01 Aug, 2014

1 commit

504d58745 timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks ... Browse Code »
6

clockevents_increase_min_delta() calls printk() from under
hrtimer_bases.lock. That causes lock inversion on scheduler locks because
printk() can call into the scheduler. Lockdep puts it as:

======================================================
[ INFO: possible circular locking dependency detected ]
3.15.0-rc8-06195-g939f04b #2 Not tainted
-------------------------------------------------------
trinity-main/74 is trying to acquire lock:
(&port_lock_key){-.....}, at: [] serial8250_console_write+0x8c/0x10c

but task is already holding lock:
(hrtimer_bases.lock){-.-...}, at: [] hrtimer_try_to_cancel+0x13/0x66

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #5 (hrtimer_bases.lock){-.-...}:
[] lock_acquire+0x92/0x101
[] _raw_spin_lock_irqsave+0x2e/0x3e
[] __hrtimer_start_range_ns+0x1c/0x197
[] perf_swevent_start_hrtimer.part.41+0x7a/0x85
[] task_clock_event_start+0x3a/0x3f
[] task_clock_event_add+0xd/0x14
[] event_sched_in+0xb6/0x17a
[] group_sched_in+0x44/0x122
[] ctx_sched_in.isra.67+0x105/0x11f
[] perf_event_sched_in.isra.70+0x47/0x4b
[] __perf_install_in_context+0x8b/0xa3
[] remote_function+0x12/0x2a
[] smp_call_function_single+0x2d/0x53
[] task_function_call+0x30/0x36
[] perf_install_in_context+0x87/0xbb
[] SYSC_perf_event_open+0x5c6/0x701
[] SyS_perf_event_open+0x17/0x19
[] syscall_call+0x7/0xb

-> #4 (&ctx->lock){......}:
[] lock_acquire+0x92/0x101
[] _raw_spin_lock+0x21/0x30
[] __perf_event_task_sched_out+0x1dc/0x34f
[] __schedule+0x4c6/0x4cb
[] schedule+0xf/0x11
[] work_resched+0x5/0x30

-> #3 (&rq->lock){-.-.-.}:
[] lock_acquire+0x92/0x101
[] _raw_spin_lock+0x21/0x30
[] __task_rq_lock+0x33/0x3a
[] wake_up_new_task+0x25/0xc2
[] do_fork+0x15c/0x2a0
[] kernel_thread+0x1a/0x1f
[] rest_init+0x1a/0x10e
[] start_kernel+0x303/0x308
[] i386_start_kernel+0x79/0x7d

-> #2 (&p->pi_lock){-.-...}:
[] lock_acquire+0x92/0x101
[] _raw_spin_lock_irqsave+0x2e/0x3e
[] try_to_wake_up+0x1d/0xd6
[] default_wake_function+0xb/0xd
[] __wake_up_common+0x39/0x59
[] __wake_up+0x29/0x3b
[] tty_wakeup+0x49/0x51
[] uart_write_wakeup+0x17/0x19
[] serial8250_tx_chars+0xbc/0xfb
[] serial8250_handle_irq+0x54/0x6a
[] serial8250_default_handle_irq+0x19/0x1c
[] serial8250_interrupt+0x38/0x9e
[] handle_irq_event_percpu+0x5f/0x1e2
[] handle_irq_event+0x2c/0x43
[] handle_level_irq+0x57/0x80
[] handle_irq+0x46/0x5c
[] do_IRQ+0x32/0x89
[] common_interrupt+0x2e/0x33
[] _raw_spin_unlock_irqrestore+0x3f/0x49
[] uart_start+0x2d/0x32
[] uart_write+0xc7/0xd6
[] n_tty_write+0xb8/0x35e
[] tty_write+0x163/0x1e4
[] redirected_tty_write+0x6d/0x75
[] vfs_write+0x75/0xb0
[] SyS_write+0x44/0x77
[] syscall_call+0x7/0xb

-> #1 (&tty->write_wait){-.....}:
[] lock_acquire+0x92/0x101
[] _raw_spin_lock_irqsave+0x2e/0x3e
[] __wake_up+0x15/0x3b
[] tty_wakeup+0x49/0x51
[] uart_write_wakeup+0x17/0x19
[] serial8250_tx_chars+0xbc/0xfb
[] serial8250_handle_irq+0x54/0x6a
[] serial8250_default_handle_irq+0x19/0x1c
[] serial8250_interrupt+0x38/0x9e
[] handle_irq_event_percpu+0x5f/0x1e2
[] handle_irq_event+0x2c/0x43
[] handle_level_irq+0x57/0x80
[] handle_irq+0x46/0x5c
[] do_IRQ+0x32/0x89
[] common_interrupt+0x2e/0x33
[] _raw_spin_unlock_irqrestore+0x3f/0x49
[] uart_start+0x2d/0x32
[] uart_write+0xc7/0xd6
[] n_tty_write+0xb8/0x35e
[] tty_write+0x163/0x1e4
[] redirected_tty_write+0x6d/0x75
[] vfs_write+0x75/0xb0
[] SyS_write+0x44/0x77
[] syscall_call+0x7/0xb

-> #0 (&port_lock_key){-.....}:
[] __lock_acquire+0x9ea/0xc6d
[] lock_acquire+0x92/0x101
[] _raw_spin_lock_irqsave+0x2e/0x3e
[] serial8250_console_write+0x8c/0x10c
[] call_console_drivers.constprop.31+0x87/0x118
[] console_unlock+0x1d7/0x398
[] vprintk_emit+0x3da/0x3e4
[] printk+0x17/0x19
[] clockevents_program_min_delta+0x104/0x116
[] clockevents_program_event+0xe7/0xf3
[] tick_program_event+0x1e/0x23
[] hrtimer_force_reprogram+0x88/0x8f
[] __remove_hrtimer+0x5b/0x79
[] hrtimer_try_to_cancel+0x49/0x66
[] hrtimer_cancel+0xd/0x18
[] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
[] task_clock_event_stop+0x20/0x64
[] task_clock_event_del+0xd/0xf
[] event_sched_out+0xab/0x11e
[] group_sched_out+0x1d/0x66
[] ctx_sched_out+0xaf/0xbf
[] __perf_event_task_sched_out+0x1ed/0x34f
[] __schedule+0x4c6/0x4cb
[] schedule+0xf/0x11
[] work_resched+0x5/0x30

other info that might help us debug this:

Chain exists of:
&port_lock_key --> &ctx->lock --> hrtimer_bases.lock

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(hrtimer_bases.lock);
lock(&ctx->lock);
lock(hrtimer_bases.lock);
lock(&port_lock_key);

*** DEADLOCK ***

4 locks held by trinity-main/74:
#0: (&rq->lock){-.-.-.}, at: [] __schedule+0xed/0x4cb
#1: (&ctx->lock){......}, at: [] __perf_event_task_sched_out+0x1dc/0x34f
#2: (hrtimer_bases.lock){-.-...}, at: [] hrtimer_try_to_cancel+0x13/0x66
#3: (console_lock){+.+...}, at: [] vprintk_emit+0x3c7/0x3e4

stack backtrace:
CPU: 0 PID: 74 Comm: trinity-main Not tainted 3.15.0-rc8-06195-g939f04b #2
00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570
8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0
8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003
Call Trace:
[] dump_stack+0x16/0x18
[] print_circular_bug+0x18f/0x19c
[] __lock_acquire+0x9ea/0xc6d
[] lock_acquire+0x92/0x101
[] ? serial8250_console_write+0x8c/0x10c
[] ? wait_for_xmitr+0x76/0x76
[] _raw_spin_lock_irqsave+0x2e/0x3e
[] ? serial8250_console_write+0x8c/0x10c
[] serial8250_console_write+0x8c/0x10c
[] ? lock_release+0x191/0x223
[] ? wait_for_xmitr+0x76/0x76
[] call_console_drivers.constprop.31+0x87/0x118
[] console_unlock+0x1d7/0x398
[] vprintk_emit+0x3da/0x3e4
[] printk+0x17/0x19
[] clockevents_program_min_delta+0x104/0x116
[] tick_program_event+0x1e/0x23
[] hrtimer_force_reprogram+0x88/0x8f
[] __remove_hrtimer+0x5b/0x79
[] hrtimer_try_to_cancel+0x49/0x66
[] hrtimer_cancel+0xd/0x18
[] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
[] task_clock_event_stop+0x20/0x64
[] task_clock_event_del+0xd/0xf
[] event_sched_out+0xab/0x11e
[] group_sched_out+0x1d/0x66
[] ctx_sched_out+0xaf/0xbf
[] __perf_event_task_sched_out+0x1ed/0x34f
[] ? __dequeue_entity+0x23/0x27
[] ? pick_next_task_fair+0xb1/0x120
[] __schedule+0x4c6/0x4cb
[] ? trace_hardirqs_off_caller+0xd7/0x108
[] ? trace_hardirqs_off+0xb/0xd
[] ? rcu_irq_exit+0x64/0x77

Fix the problem by using printk_deferred() which does not call into the
scheduler.

Reported-by: Fengguang Wu
Signed-off-by: Jan Kara
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner

Jan Kara
2014-08-01 18:54:41 +0800

28 Jul, 2014

1 commit

ca5bc6cd5 Merge branch 'sched/urgent' into sched/core, to merge fixes before applying new changes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2014-07-28 16:03:00 +0800

24 Jul, 2014

7 commits

f723aa181 sched_clock: Avoid corrupting hrtimer tree during suspend ... Browse Code »
5

During suspend we call sched_clock_poll() to update the epoch and
accumulated time and reprogram the sched_clock_timer to fire
before the next wrap-around time. Unfortunately,
sched_clock_poll() doesn't restart the timer, instead it relies
on the hrtimer layer to do that and during suspend we aren't
calling that function from the hrtimer layer. Instead, we're
reprogramming the expires time while the hrtimer is enqueued,
which can cause the hrtimer tree to be corrupted. Furthermore, we
restart the timer during suspend but we update the epoch during
resume which seems counter-intuitive.

Let's fix this by saving the accumulated state and canceling the
timer during suspend. On resume we can update the epoch and
restart the timer similar to what we would do if we were starting
the clock for the first time.

Fixes: a08ca5d1089d "sched_clock: Use an hrtimer instead of timer"
Signed-off-by: Stephen Boyd
Signed-off-by: John Stultz
Link: http://lkml.kernel.org/r/1406174630-23458-1-git-send-email-john.stultz@linaro.org
Cc: Ingo Molnar
Cc: stable
Signed-off-by: Thomas Gleixner

Stephen Boyd
2014-07-24 18:02:49 +0800
375f45b5b timekeeping: Use cached ntp_tick_length when accumulating error ... Browse Code »

By caching the ntp_tick_length() when we correct the frequency error,
and then using that cached value to accumulate error, we avoid large
initial errors when the tick length is changed.

This makes convergence happen much faster in the simulator, since the
initial error doesn't have to be slowly whittled away.

This initially seems like an accounting error, but Miroslav pointed out
that ntp_tick_length() can change mid-tick, so when we apply it in the
error accumulation, we are applying any recent change to the entire tick.

This approach chooses to apply changes in the ntp_tick_length() only to
the next tick, which allows us to calculate the freq correction before
using the new tick length, which avoids accummulating error.

Credit to Miroslav for pointing this out and providing the original patch
this functionality has been pulled out from, along with the rational.

Cc: Miroslav Lichvar
Cc: Richard Cochran
Cc: Prarit Bhargava
Reported-by: Miroslav Lichvar
Signed-off-by: John Stultz

John Stultz
2014-07-24 06:01:57 +0800
dc491596f timekeeping: Rework frequency adjustments to work better w/ nohz ... Browse Code »
14

The existing timekeeping_adjust logic has always been complicated
to understand. Further, since it was developed prior to NOHZ becoming
common, its not surprising it performs poorly when NOHZ is enabled.

Since Miroslav pointed out the problematic nature of the existing code
in the NOHZ case, I've tried to refactor the code to perform better.

The problem with the previous approach was that it tried to adjust
for the total cumulative error using a scaled dampening factor. This
resulted in large errors to be corrected slowly, while small errors
were corrected quickly. With NOHZ the timekeeping code doesn't know
how far out the next tick will be, so this results in bad
over-correction to small errors, and insufficient correction to large
errors.

Inspired by Miroslav's patch, I've refactored the code to try to
address the correction in two steps.

1) Check the future freq error for the next tick, and if the frequency
error is large, try to make sure we correct it so it doesn't cause
much accumulated error.

2) Then make a small single unit adjustment to correct any cumulative
error that has collected over time.

This method performs fairly well in the simulator Miroslav created.

Major credit to Miroslav for pointing out the issue, providing the
original patch to resolve this, a simulator for testing, as well as
helping debug and resolve issues in my implementation so that it
performed closer to his original implementation.

Cc: Miroslav Lichvar
Cc: Richard Cochran
Cc: Prarit Bhargava
Reported-by: Miroslav Lichvar
Signed-off-by: John Stultz

John Stultz
2014-07-24 06:01:56 +0800
e2dff1ec0 timekeeping: Minor fixup for timespec64->timespec assignment ... Browse Code »

In the GENERIC_TIME_VSYSCALL_OLD update_vsyscall implementation,
we take the tk_xtime() value, which returns a timespec64, and
store it in a timespec.

This luckily is ok, since the only architectures that use
GENERIC_TIME_VSYSCALL_OLD are ia64 and ppc64, which are both
64 bit systems where timespec64 is the same as a timespec.

Even so, for cleanliness reasons, use the conversion function
to assign the proper type.

Signed-off-by: John Stultz

John Stultz
2014-07-24 06:01:56 +0800
4396e058c timekeeping: Provide fast and NMI safe access to CLOCK_MONOTONIC ... Browse Code »
13

Tracers want a correlated time between the kernel instrumentation and
user space. We really do not want to export sched_clock() to user
space, so we need to provide something sensible for this.

Using separate data structures with an non blocking sequence count
based update mechanism allows us to do that. The data structure
required for the readout has a sequence counter and two copies of the
timekeeping data.

On the update side:

smp_wmb();
tkf->seq++;
smp_wmb();
update(tkf->base[0], tk);
smp_wmb();
tkf->seq++;
smp_wmb();
update(tkf->base[1], tk);

On the reader side:

do {
seq = tkf->seq;
smp_rmb();
idx = seq & 0x01;
now = now(tkf->base[idx]);
smp_rmb();
} while (seq != tkf->seq)

So if a NMI hits the update of base[0] it will use base[1] which is
still consistent, but this timestamp is not guaranteed to be monotonic
across an update.

The timestamp is calculated by:

now = base_mono + clock_delta * slope

So if the update lowers the slope, readers who are forced to the
not yet updated second array are still using the old steeper slope.

tmono
^
| o n
| o n
| u
| o
|o
|12345678---> reader order

o = old slope
u = update
n = new slope

So reader 6 will observe time going backwards versus reader 5.

While other CPUs are likely to be able observe that, the only way
for a CPU local observation is when an NMI hits in the middle of
the update. Timestamps taken from that NMI context might be ahead
of the following timestamps. Callers need to be aware of that and
deal with it.

V2: Got rid of clock monotonic raw and reorganized the data
structures. Folded in the barrier fix from Mathieu.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Mathieu Desnoyers
Signed-off-by: John Stultz

Thomas Gleixner
2014-07-24 06:01:55 +0800
0e5ac3a8b timekeeping: Use tk_read_base as argument for timekeeping_get_ns() ... Browse Code »

All the function needs is in the tk_read_base struct. No functional
change for the current code, just a preparatory patch for the NMI safe
accessor to clock monotonic which will use struct tk_read_base as well.

Signed-off-by: Thomas Gleixner
Cc: Steven Rostedt
Cc: Peter Zijlstra
Cc: Mathieu Desnoyers
Signed-off-by: John Stultz

Thomas Gleixner
2014-07-24 06:01:53 +0800
d28ede837 timekeeping: Create struct tk_read_base and use it in struct timekeeper ... Browse Code »

The members of the new struct are the required ones for the new NMI
safe accessor to clcok monotonic. In order to reuse the existing
timekeeping code and to make the update of the fast NMI safe
timekeepers a simple memcpy use the struct for the timekeeper as well
and convert all users.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Mathieu Desnoyers
Signed-off-by: John Stultz

Thomas Gleixner
2014-07-24 06:01:53 +0800