16 Nov, 2014

1 commit

  • While looking over the cpu-timer code I found that we appear to add
    the delta for the calling task twice, through:

    cpu_timer_sample_group()
    thread_group_cputimer()
    thread_group_cputime()
    times->sum_exec_runtime += task_sched_runtime();

    *sample = cputime.sum_exec_runtime + task_delta_exec();

    Which would make the sample run ahead, making the sleep short.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Stanislaw Gruszka
    Cc: Christoph Lameter
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Rik van Riel
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20141112113737.GI10476@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

25 Oct, 2014

2 commits

  • Andrey reported that on a kernel with UBSan enabled he found:

    UBSan: Undefined behaviour in ../kernel/time/clockevents.c:75:34

    I guess it should be 1ULL here instead of 1U:
    (!ismax || evt->mult << evt->shift)))

    That's indeed the correct solution because shift might be 32.

    Reported-by: Andrey Ryabinin
    Cc: Peter Zijlstra
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • If userland creates a timer without specifying a sigevent info, we'll
    create one ourself, using a stack local variable. Particularly will we
    use the timer ID as sival_int. But as sigev_value is a union containing
    a pointer and an int, that assignment will only partially initialize
    sigev_value on systems where the size of a pointer is bigger than the
    size of an int. On such systems we'll copy the uninitialized stack bytes
    from the timer_create() call to userland when the timer actually fires
    and we're going to deliver the signal.

    Initialize sigev_value with 0 to plug the stack info leak.

    Found in the PaX patch, written by the PaX Team.

    Fixes: 5a9fa7307285 ("posix-timers: kill ->it_sigev_signo and...")
    Signed-off-by: Mathias Krause
    Cc: Oleg Nesterov
    Cc: Brad Spengler
    Cc: PaX Team
    Cc: # v2.6.28+
    Link: http://lkml.kernel.org/r/1412456799-32339-1-git-send-email-minipli@googlemail.com
    Signed-off-by: Thomas Gleixner

    Mathias Krause
     

15 Oct, 2014

1 commit

  • Pull percpu consistent-ops changes from Tejun Heo:
    "Way back, before the current percpu allocator was implemented, static
    and dynamic percpu memory areas were allocated and handled separately
    and had their own accessors. The distinction has been gone for many
    years now; however, the now duplicate two sets of accessors remained
    with the pointer based ones - this_cpu_*() - evolving various other
    operations over time. During the process, we also accumulated other
    inconsistent operations.

    This pull request contains Christoph's patches to clean up the
    duplicate accessor situation. __get_cpu_var() uses are replaced with
    with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

    Unfortunately, the former sometimes is tricky thanks to C being a bit
    messy with the distinction between lvalues and pointers, which led to
    a rather ugly solution for cpumask_var_t involving the introduction of
    this_cpu_cpumask_var_ptr().

    This converts most of the uses but not all. Christoph will follow up
    with the remaining conversions in this merge window and hopefully
    remove the obsolete accessors"

    * 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
    irqchip: Properly fetch the per cpu offset
    percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
    ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
    percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
    Revert "powerpc: Replace __get_cpu_var uses"
    percpu: Remove __this_cpu_ptr
    clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
    sparc: Replace __get_cpu_var uses
    avr32: Replace __get_cpu_var with __this_cpu_write
    blackfin: Replace __get_cpu_var uses
    tile: Use this_cpu_ptr() for hardware counters
    tile: Replace __get_cpu_var uses
    powerpc: Replace __get_cpu_var uses
    alpha: Replace __get_cpu_var
    ia64: Replace __get_cpu_var uses
    s390: cio driver &__get_cpu_var replacements
    s390: Replace __get_cpu_var uses
    mips: Replace __get_cpu_var uses
    MIPS: Replace __get_cpu_var uses in FPU emulator.
    arm: Replace __this_cpu_ptr with raw_cpu_ptr
    ...

    Linus Torvalds
     

14 Oct, 2014

1 commit

  • Pull s390 updates from Martin Schwidefsky:
    "This patch set contains the main portion of the changes for 3.18 in
    regard to the s390 architecture. It is a bit bigger than usual,
    mainly because of a new driver and the vector extension patches.

    The interesting bits are:
    - Quite a bit of work on the tracing front. Uprobes is enabled and
    the ftrace code is reworked to get some of the lost performance
    back if CONFIG_FTRACE is enabled.
    - To improve boot time with CONFIG_DEBIG_PAGEALLOC, support for the
    IPTE range facility is added.
    - The rwlock code is re-factored to improve writer fairness and to be
    able to use the interlocked-access instructions.
    - The kernel part for the support of the vector extension is added.
    - The device driver to access the CD/DVD on the HMC is added, this
    will hopefully come in handy to improve the installation process.
    - Add support for control-unit initiated reconfiguration.
    - The crypto device driver is enhanced to enable the additional AP
    domains and to allow the new crypto hardware to be used.
    - Bug fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (39 commits)
    s390/ftrace: simplify enabling/disabling of ftrace_graph_caller
    s390/ftrace: remove 31 bit ftrace support
    s390/kdump: add support for vector extension
    s390/disassembler: add vector instructions
    s390: add support for vector extension
    s390/zcrypt: Toleration of new crypto hardware
    s390/idle: consolidate idle functions and definitions
    s390/nohz: use a per-cpu flag for arch_needs_cpu
    s390/vtime: do not reset idle data on CPU hotplug
    s390/dasd: add support for control unit initiated reconfiguration
    s390/dasd: fix infinite loop during format
    s390/mm: make use of ipte range facility
    s390/setup: correct 4-level kernel page table detection
    s390/topology: call set_sched_topology early
    s390/uprobes: architecture backend for uprobes
    s390/uprobes: common library for kprobes and uprobes
    s390/rwlock: use the interlocked-access facility 1 instructions
    s390/rwlock: improve writer fairness
    s390/rwlock: remove interrupt-enabling rwlock variant.
    s390/mm: remove change bit override support
    ...

    Linus Torvalds
     

13 Oct, 2014

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
    Hansen)

    - Various sched/idle refinements for better idle handling (Nicolas
    Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)

    - sched/numa updates and optimizations (Rik van Riel)

    - sysbench speedup (Vincent Guittot)

    - capacity calculation cleanups/refactoring (Vincent Guittot)

    - Various cleanups to thread group iteration (Oleg Nesterov)

    - Double-rq-lock removal optimization and various refactorings
    (Kirill Tkhai)

    - various sched/deadline fixes

    ... and lots of other changes"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
    sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
    sched/fair: Delete resched_cpu() from idle_balance()
    sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
    sched: Improve sysbench performance by fixing spurious active migration
    sched/x86: Fix up typo in topology detection
    x86, sched: Add new topology for multi-NUMA-node CPUs
    sched/rt: Use resched_curr() in task_tick_rt()
    sched: Use rq->rd in sched_setaffinity() under RCU read lock
    sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
    sched: Use dl_bw_of() under RCU read lock
    sched/fair: Remove duplicate code from can_migrate_task()
    sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
    sched: print_rq(): Don't use tasklist_lock
    sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
    sched: Fix the task-group check in tg_has_rt_tasks()
    sched/fair: Leverage the idle state info when choosing the "idlest" cpu
    sched: Let the scheduler see CPU idle states
    sched/deadline: Fix inter- exclusive cpusets migrations
    sched/deadline: Clear dl_entity params when setscheduling to different class
    sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
    ...

    Linus Torvalds
     

09 Oct, 2014

3 commits

  • Pull timer updates from Thomas Gleixner:
    "Nothing really exciting this time:

    - a few fixlets in the NOHZ code

    - a new ARM SoC timer abomination. One should expect that we have
    enough of them already, but they insist on inventing new ones.

    - the usual bunch of ARM SoC timer updates. That feels like herding
    cats"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    clocksource: arm_arch_timer: Consolidate arch_timer_evtstrm_enable
    clocksource: arm_arch_timer: Enable counter access for 32-bit ARM
    clocksource: arm_arch_timer: Change clocksource name if CP15 unavailable
    clocksource: sirf: Disable counter before re-setting it
    clocksource: cadence_ttc: Add support for 32bit mode
    clocksource: tcb_clksrc: Sanitize IRQ request
    clocksource: arm_arch_timer: Discard unavailable timers correctly
    clocksource: vf_pit_timer: Support shutdown mode
    ARM: meson6: clocksource: Add Meson6 timer support
    ARM: meson: documentation: Add timer documentation
    clocksource: sh_tmu: Document r8a7779 binding
    clocksource: sh_mtu2: Document r7s72100 binding
    clocksource: sh_cmt: Document SoC specific bindings
    timerfd: Remove an always true check
    nohz: Avoid tick's double reprogramming in highres mode
    nohz: Fix spurious periodic tick behaviour in low-res dynticks mode

    Linus Torvalds
     
  • Pull timer fixes from Ingo Molnar:
    "Main changes:

    - Fix the deadlock reported by Dave Jones et al
    - Clean up and fix nohz_full interaction with arch abilities
    - nohz init code consolidation/cleanup"

    * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    nohz: nohz full depends on irq work self IPI support
    nohz: Consolidate nohz full init code
    arm64: Tell irq work about self IPI support
    arm: Tell irq work about self IPI support
    x86: Tell irq work about self IPI support
    irq_work: Force raised irq work to run on irq work interrupt
    irq_work: Introduce arch_irq_work_has_interrupt()
    nohz: Move nohz full init call to tick init

    Linus Torvalds
     
  • Move the nohz_delay bit from the s390_idle data structure to the
    per-cpu flags. Clear the nohz delay flag in __cpu_disable and
    remove the cpu hotplug notifier that used to do this.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

19 Sep, 2014

1 commit

  • schedule(), io_schedule() and schedule_timeout() always return
    with TASK_RUNNING state set, so one more setting is unnecessary.

    (All places in patch are visible good, only exception is
    kiblnd_scheduler() from:

    drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c

    Its schedule() is one line above standard 3 lines of unified diff)

    No places where set_current_state() is used for mb().

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Link: http://lkml.kernel.org/r/1410529254.3569.23.camel@tkhai
    Cc: Alasdair Kergon
    Cc: Anil Belur
    Cc: Arnd Bergmann
    Cc: Dave Kleikamp
    Cc: David Airlie
    Cc: David Howells
    Cc: Dmitry Eremin
    Cc: Frank Blaschka
    Cc: Greg Kroah-Hartman
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Isaac Huang
    Cc: James E.J. Bottomley
    Cc: James E.J. Bottomley
    Cc: J. Bruce Fields
    Cc: Jeff Dike
    Cc: Jesper Nilsson
    Cc: Jiri Slaby
    Cc: Laura Abbott
    Cc: Liang Zhen
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Masaru Nomura
    Cc: Michael Opdenacker
    Cc: Mikael Starvik
    Cc: Mike Snitzer
    Cc: Neil Brown
    Cc: Oleg Drokin
    Cc: Peng Tao
    Cc: Richard Weinberger
    Cc: Robert Love
    Cc: Steven Rostedt
    Cc: Trond Myklebust
    Cc: Ursula Braun
    Cc: Zi Shen Lim
    Cc: devel@driverdev.osuosl.org
    Cc: dm-devel@redhat.com
    Cc: dri-devel@lists.freedesktop.org
    Cc: fcoe-devel@open-fcoe.org
    Cc: jfs-discussion@lists.sourceforge.net
    Cc: linux390@de.ibm.com
    Cc: linux-afs@lists.infradead.org
    Cc: linux-cris-kernel@axis.com
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-nfs@vger.kernel.org
    Cc: linux-parisc@vger.kernel.org
    Cc: linux-raid@vger.kernel.org
    Cc: linux-s390@vger.kernel.org
    Cc: linux-scsi@vger.kernel.org
    Cc: qla2xxx-upstream@qlogic.com
    Cc: user-mode-linux-devel@lists.sourceforge.net
    Cc: user-mode-linux-user@lists.sourceforge.net
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     

14 Sep, 2014

4 commits

  • The nohz full functionality depends on IRQ work to trigger its own
    interrupts. As it's used to restart the tick, we can't rely on the tick
    fallback for irq work callbacks, ie: we can't use the tick to restart
    the tick itself.

    Lets reject the full dynticks initialization if that arch support isn't
    available.

    As a side effect, this makes sure that nohz kick is never called from
    the tick. That otherwise would result in illegal hrtimer self-cancellation
    and lockup.

    Acked-by: Peter Zijlstra (Intel)
    Cc: Ingo Molnar
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • The supports for CONFIG_NO_HZ_FULL_ALL=y and the nohz_full= kernel
    parameter both have their own way to do the same thing: allocate
    full dynticks cpumasks, fill them and initialize some state variables.

    Lets consolidate that all in the same place.

    While at it, convert some regular printk message to warnings when
    fundamental allocations fail.

    Acked-by: Peter Zijlstra (Intel)
    Cc: Ingo Molnar
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • The nohz full kick, which restarts the tick when any resource depend
    on it, can't be executed anywhere given the operation it does on timers.
    If it is called from the scheduler or timers code, chances are that
    we run into a deadlock.

    This is why we run the nohz full kick from an irq work. That way we make
    sure that the kick runs on a virgin context.

    However if that's the case when irq work runs in its own dedicated
    self-ipi, things are different for the big bunch of archs that don't
    support the self triggered way. In order to support them, irq works are
    also handled by the timer interrupt as fallback.

    Now when irq works run on the timer interrupt, the context isn't blank.
    More precisely, they can run in the context of the hrtimer that runs the
    tick. But the nohz kick cancels and restarts this hrtimer and cancelling
    an hrtimer from itself isn't allowed. This is why we run in an endless
    loop:

    Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
    CPU: 2 PID: 7538 Comm: kworker/u8:8 Not tainted 3.16.0+ #34
    Workqueue: btrfs-endio-write normal_work_helper [btrfs]
    ffff880244c06c88 000000001b486fe1 ffff880244c06bf0 ffffffff8a7f1e37
    ffffffff8ac52a18 ffff880244c06c78 ffffffff8a7ef928 0000000000000010
    ffff880244c06c88 ffff880244c06c20 000000001b486fe1 0000000000000000
    Call Trace:
    ] dump_stack+0x4e/0x7a
    [] panic+0xd4/0x207
    [] watchdog_overflow_callback+0x118/0x120
    [] __perf_event_overflow+0xae/0x350
    [] ? perf_event_task_disable+0xa0/0xa0
    [] ? x86_perf_event_set_period+0xbf/0x150
    [] perf_event_overflow+0x14/0x20
    [] intel_pmu_handle_irq+0x206/0x410
    [] perf_event_nmi_handler+0x2b/0x50
    [] nmi_handle+0xd2/0x390
    [] ? nmi_handle+0x5/0x390
    [] ? match_held_lock+0x8/0x1b0
    [] default_do_nmi+0x72/0x1c0
    [] do_nmi+0xb8/0x100
    [] end_repeat_nmi+0x1e/0x2e
    [] ? match_held_lock+0x8/0x1b0
    [] ? match_held_lock+0x8/0x1b0
    [] ? match_held_lock+0x8/0x1b0
    <] lock_acquired+0xaf/0x450
    [] ? lock_hrtimer_base.isra.20+0x25/0x50
    [] _raw_spin_lock_irqsave+0x78/0x90
    [] ? lock_hrtimer_base.isra.20+0x25/0x50
    [] lock_hrtimer_base.isra.20+0x25/0x50
    [] hrtimer_try_to_cancel+0x33/0x1e0
    [] hrtimer_cancel+0x1a/0x30
    [] tick_nohz_restart+0x17/0x90
    [] __tick_nohz_full_check+0xc3/0x100
    [] nohz_full_kick_work_func+0xe/0x10
    [] irq_work_run_list+0x44/0x70
    [] irq_work_run+0x2a/0x50
    [] update_process_times+0x5b/0x70
    [] tick_sched_handle.isra.21+0x25/0x60
    [] tick_sched_timer+0x41/0x60
    [] __run_hrtimer+0x72/0x470
    [] ? tick_sched_do_timer+0xb0/0xb0
    [] hrtimer_interrupt+0x117/0x270
    [] local_apic_timer_interrupt+0x37/0x60
    [] smp_apic_timer_interrupt+0x3f/0x50
    [] apic_timer_interrupt+0x6f/0x80

    To fix this we force non-lazy irq works to run on irq work self-IPIs
    when available. That ability of the arch to trigger irq work self IPIs
    is available with arch_irq_work_has_interrupt().

    Reported-by: Catalin Iacob
    Reported-by: Dave Jones
    Acked-by: Peter Zijlstra (Intel)
    Cc: Ingo Molnar
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • This way we unbloat a bit main.c and more importantly we initialize
    nohz full after init_IRQ(). This dependency will be needed in further
    patches because nohz full needs irq work to raise its own IRQ.
    Information about the support for this ability on ARM64 is obtained on
    init_IRQ() which initialize the pointer to __smp_call_function.

    Since tick_init() is called right after init_IRQ(), this is a good place
    to call tick_nohz_init() and prepare for that dependency.

    Acked-by: Peter Zijlstra (Intel)
    Cc: Ingo Molnar
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     

13 Sep, 2014

4 commits

  • Locks the k_itimer's it_lock member when handling the alarm timer's
    expiry callback.

    The regular posix timers defined in posix-timers.c have this lock held
    during timout processing because their callbacks are routed through
    posix_timer_fn(). The alarm timers follow a different path, so they
    ought to grab the lock somewhere else.

    Cc: stable@vger.kernel.org
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Cc: Sharvil Nanavati
    Signed-off-by: Richard Larocque
    Signed-off-by: John Stultz

    Richard Larocque
     
  • Avoids sending a signal to alarm timers created with sigev_notify set to
    SIGEV_NONE by checking for that special case in the timeout callback.

    The regular posix timers avoid sending signals to SIGEV_NONE timers by
    not scheduling any callbacks for them in the first place. Although it
    would be possible to do something similar for alarm timers, it's simpler
    to handle this as a special case in the timeout.

    Prior to this patch, the alarm timer would ignore the sigev_notify value
    and try to deliver signals to the process anyway. Even worse, the
    sanity check for the value of sigev_signo is skipped when SIGEV_NONE was
    specified, so the signal number could be bogus. If sigev_signo was an
    unitialized value (as it often would be if SIGEV_NONE is used), then
    it's hard to predict which signal will be sent.

    Cc: stable@vger.kernel.org
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Cc: Sharvil Nanavati
    Signed-off-by: Richard Larocque
    Signed-off-by: John Stultz

    Richard Larocque
     
  • Returns the time remaining for an alarm timer, rather than the time at
    which it is scheduled to expire. If the timer has already expired or it
    is not currently scheduled, the it_value's members are set to zero.

    This new behavior matches that of the other posix-timers and the POSIX
    specifications.

    This is a change in user-visible behavior, and may break existing
    applications. Hopefully, few users rely on the old incorrect behavior.

    Cc: stable@vger.kernel.org
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Cc: Sharvil Nanavati
    Signed-off-by: Richard Larocque
    [jstultz: minor style tweak]
    Signed-off-by: John Stultz

    Richard Larocque
     
  • timeval_to_jiffies tried to round a timeval up to an integral number
    of jiffies, but the logic for doing so was incorrect: intervals
    corresponding to exactly N jiffies would become N+1. This manifested
    itself particularly repeatedly stopping/starting an itimer:

    setitimer(ITIMER_PROF, &val, NULL);
    setitimer(ITIMER_PROF, NULL, &val);

    would add a full tick to val, _even if it was exactly representable in
    terms of jiffies_ (say, the result of a previous rounding.) Doing
    this repeatedly would cause unbounded growth in val. So fix the math.

    Here's what was wrong with the conversion: we essentially computed
    (eliding seconds)

    jiffies = usec * (NSEC_PER_USEC/TICK_NSEC)

    by using scaling arithmetic, which took the best approximation of
    NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
    x/(2^USEC_JIFFIE_SC), and computed:

    jiffies = (usec * x) >> USEC_JIFFIE_SC

    and rounded this calculation up in the intermediate form (since we
    can't necessarily exactly represent TICK_NSEC in usec.) But the
    scaling arithmetic is a (very slight) *over*approximation of the true
    value; that is, instead of dividing by (1 usec/ 1 jiffie), we
    effectively divided by (1 usec/1 jiffie)-epsilon (rounding
    down). This would normally be fine, but we want to round timeouts up,
    and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
    would be fine if our division was exact, but dividing this by the
    slightly smaller factor was equivalent to adding just _over_ 1 to the
    final result (instead of just _under_ 1, as desired.)

    In particular, with HZ=1000, we consistently computed that 10000 usec
    was 11 jiffies; the same was true for any exact multiple of
    TICK_NSEC.

    We could possibly still round in the intermediate form, adding
    something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
    convert usec->nsec, round in nanoseconds, and then convert using
    time*spec*_to_jiffies. This adds one constant multiplication, and is
    not observably slower in microbenchmarks on recent x86 hardware.

    Tested: the following program:

    int main() {
    struct itimerval zero = {{0, 0}, {0, 0}};
    /* Initially set to 10 ms. */
    struct itimerval initial = zero;
    initial.it_interval.tv_usec = 10000;
    setitimer(ITIMER_PROF, &initial, NULL);
    /* Save and restore several times. */
    for (size_t i = 0; i < 10; ++i) {
    struct itimerval prev;
    setitimer(ITIMER_PROF, &zero, &prev);
    /* on old kernels, this goes up by TICK_USEC every iteration */
    printf("previous value: %ld %ld %ld %ld\n",
    prev.it_interval.tv_sec, prev.it_interval.tv_usec,
    prev.it_value.tv_sec, prev.it_value.tv_usec);
    setitimer(ITIMER_PROF, &prev, NULL);
    }
    return 0;
    }

    Cc: stable@vger.kernel.org
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Paul Turner
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Reviewed-by: Paul Turner
    Reported-by: Aaron Jacobs
    Signed-off-by: Andrew Hunter
    [jstultz: Tweaked to apply to 3.17-rc]
    Signed-off-by: John Stultz

    Andrew Hunter
     

08 Sep, 2014

1 commit

  • Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
    issues on large systems, due to both functions being serialized with a
    lock.

    The lock protects against reporting a wrong value, due to a thread in the
    task group exiting, its statistics reporting up to the signal struct, and
    that exited task's statistics being counted twice (or not at all).

    Protecting that with a lock results in times() and clock_gettime() being
    completely serialized on large systems.

    This can be fixed by using a seqlock around the events that gather and
    propagate statistics. As an additional benefit, the protection code can
    be moved into thread_group_cputime(), slightly simplifying the calling
    functions.

    In the case of posix_cpu_clock_get_task() things can be simplified a
    lot, because the calling function already ensures that the task sticks
    around, and the rest is now taken care of in thread_group_cputime().

    This way the statistics reporting code can run lockless.

    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alex Thorlton
    Cc: Andrew Morton
    Cc: Daeseok Youn
    Cc: David Rientjes
    Cc: Dongsheng Yang
    Cc: Geert Uytterhoeven
    Cc: Guillaume Morin
    Cc: Ionut Alexa
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Li Zefan
    Cc: Michal Hocko
    Cc: Michal Schmidt
    Cc: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: umgwanakikbuti@gmail.com
    Cc: fweisbec@gmail.com
    Cc: srao@redhat.com
    Cc: lwoodman@redhat.com
    Cc: atheurer@redhat.com
    Link: http://lkml.kernel.org/r/20140816134010.26a9b572@annuminas.surriel.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

06 Sep, 2014

1 commit

  • The update_walltime() code works on the shadow timekeeper to make the
    seqcount protected region as short as possible. But that update to the
    shadow timekeeper does not update all timekeeper fields because it's
    sufficient to do that once before it becomes life. One of these fields
    is tkr.base_mono. That stays stale in the shadow timekeeper unless an
    operation happens which copies the real timekeeper to the shadow.

    The update function is called after the update calls to vsyscall and
    pvclock. While not correct, it did not cause any problems because none
    of the invoked update functions used base_mono.

    commit cbcf2dd3b3d4 (x86: kvm: Make kvm_get_time_and_clockread()
    nanoseconds based) changed that in the kvm pvclock update function, so
    the stale mono_base value got used and caused kvm-clock to malfunction.

    Put the update where it belongs and fix the issue.

    Reported-by: Chris J Arges
    Reported-by: Paolo Bonzini
    Cc: Gleb Natapov
    Cc: John Stultz
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1409050000570.3333@nanos
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

05 Sep, 2014

1 commit

  • The local nohz kick is currently used by perf which needs it to be
    NMI-safe. Recent commit though (7d1311b93e58ed55f3a31cc8f94c4b8fe988a2b9)
    changed its implementation to fire the local kick using the remote kick
    API. It was convenient to make the code more generic but the remote kick
    isn't NMI-safe.

    As a result:

    WARNING: CPU: 3 PID: 18062 at kernel/irq_work.c:72 irq_work_queue_on+0x11e/0x140()
    CPU: 3 PID: 18062 Comm: trinity-subchil Not tainted 3.16.0+ #34
    0000000000000009 00000000903774d1 ffff880244e06c00 ffffffff9a7f1e37
    0000000000000000 ffff880244e06c38 ffffffff9a0791dd ffff880244fce180
    0000000000000003 ffff880244e06d58 ffff880244e06ef8 0000000000000000
    Call Trace:
    [] dump_stack+0x4e/0x7a
    [] warn_slowpath_common+0x7d/0xa0
    [] warn_slowpath_null+0x1a/0x20
    [] irq_work_queue_on+0x11e/0x140
    [] tick_nohz_full_kick_cpu+0x57/0x90
    [] __perf_event_overflow+0x275/0x350
    [] ? perf_event_task_disable+0xa0/0xa0
    [] ? x86_perf_event_set_period+0xbf/0x150
    [] perf_event_overflow+0x14/0x20
    [] intel_pmu_handle_irq+0x206/0x410
    [] ? arch_vtime_task_switch+0x63/0x130
    [] perf_event_nmi_handler+0x2b/0x50
    [] nmi_handle+0xd2/0x390
    [] ? nmi_handle+0x5/0x390
    [] ? lock_release+0xab/0x330
    [] default_do_nmi+0x72/0x1c0
    [] ? cpuacct_account_field+0xcf/0x200
    [] do_nmi+0xb8/0x100

    Lets fix this by restoring the use of local irq work for the nohz local
    kick.

    Reported-by: Catalin Iacob
    Reported-and-tested-by: Dave Jones
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     

27 Aug, 2014

3 commits


23 Aug, 2014

2 commits

  • In highres mode, the tick reschedules itself unconditionally to the
    next jiffies.

    However while this clock reprogramming is relevant when the tick is
    in periodic mode, it's not that interesting when we run in dynticks mode
    because irq exit is likely going to overwrite the next tick to some
    randomly deferred future.

    So lets just get rid of this tick self rescheduling in dynticks mode.
    This way we can avoid some clockevents double write in favourable
    scenarios like when we stop the tick completely in idle while no other
    hrtimer is pending.

    Suggested-by: Frederic Weisbecker
    Signed-off-by: Viresh Kumar
    Cc: Thomas Gleixner
    Signed-off-by: Frederic Weisbecker

    Viresh Kumar
     
  • When we reach the end of the tick handler, we unconditionally reschedule
    the next tick to the next jiffy. Then on irq exit, the nohz code
    overrides that setting if needed and defers the next tick as far away in
    the future as possible.

    Now in the best dynticks case, when we actually don't need any tick in
    the future (ie: expires == KTIME_MAX), low-res and high-res behave
    differently. What we want in this case is to cancel the next tick
    programmed by the previous one. That's what we do in high-res mode. OTOH
    we lack a low-res mode equivalent of hrtimer_cancel() so we simply don't
    do anything in this case and the next tick remains scheduled to jiffies + 1.

    As a result, in low-res mode, when the dynticks code determines that no
    tick is needed in the future, we can recursively get a spurious tick
    every jiffy because then the next tick is always reprogrammed from the
    tick handler and is never cancelled. And this can happen indefinetly
    until some subsystem actually needs a precise tick in the future and only
    then we eventually overwrite the previous tick handler setting to defer
    the next tick.

    We are fixing this by introducing the ONESHOT_STOPPED mode which will
    let us pause a clockevent when no further interrupt is needed. Meanwhile
    we can't expect all drivers to support this new mode.

    So lets reduce much of the symptoms by skipping the nohz-blind tick
    rescheduling from the tick-handler when the CPU is in dynticks mode.
    That tick rescheduling wrongly assumed periodicity and the low-res
    dynticks code can't cancel such decision. This breaks the recursive (and
    thus the worst) part of the problem. In the worst case now, we'll get
    only one extra tick due to uncancelled tick scheduled before we entered
    dynticks mode.

    This also removes a needless clockevent write on idle ticks. Since those
    clock write are usually considered to be slow, it's a general win.

    Reviewed-by: Preeti U Murthy
    Signed-off-by: Viresh Kumar
    Cc: Thomas Gleixner
    Signed-off-by: Frederic Weisbecker

    Viresh Kumar
     

15 Aug, 2014

1 commit

  • Benjamin Herrenschmidt pointed out that I further missed modifying
    update_vsyscall after the wall_to_mono value was changed to a
    timespec64. This causes issues on powerpc32, which expects a 32bit
    timespec.

    This patch fixes the problem by properly converting from a timespec64 to
    a timespec before passing the value on to the arch-specific vsyscall
    logic.

    [ Thomas is currently on vacation, but reviewed it and wanted me to send
    this fix on to you directly. ]

    Cc: LKML
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Benjamin Herrenschmidt
    Reported-by: Benjamin Herrenschmidt
    Reviewed-by: Thomas Gleixner
    Signed-off-by: John Stultz
    Signed-off-by: Linus Torvalds

    John Stultz
     

06 Aug, 2014

1 commit

  • Pull timer and time updates from Thomas Gleixner:
    "A rather large update of timers, timekeeping & co

    - Core timekeeping code is year-2038 safe now for 32bit machines.
    Now we just need to fix all in kernel users and the gazillion of
    user space interfaces which rely on timespec/timeval :)

    - Better cache layout for the timekeeping internal data structures.

    - Proper nanosecond based interfaces for in kernel users.

    - Tree wide cleanup of code which wants nanoseconds but does hoops
    and loops to convert back and forth from timespecs. Some of it
    definitely belongs into the ugly code museum.

    - Consolidation of the timekeeping interface zoo.

    - A fast NMI safe accessor to clock monotonic for tracing. This is a
    long standing request to support correlated user/kernel space
    traces. With proper NTP frequency correction it's also suitable
    for correlation of traces accross separate machines.

    - Checkpoint/restart support for timerfd.

    - A few NOHZ[_FULL] improvements in the [hr]timer code.

    - Code move from kernel to kernel/time of all time* related code.

    - New clocksource/event drivers from the ARM universe. I'm really
    impressed that despite an architected timer in the newer chips SoC
    manufacturers insist on inventing new and differently broken SoC
    specific timers.

    [ Ed. "Impressed"? I don't think that word means what you think it means ]

    - Another round of code move from arch to drivers. Looks like most
    of the legacy mess in ARM regarding timers is sorted out except for
    a few obnoxious strongholds.

    - The usual updates and fixlets all over the place"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
    timekeeping: Fixup typo in update_vsyscall_old definition
    clocksource: document some basic timekeeping concepts
    timekeeping: Use cached ntp_tick_length when accumulating error
    timekeeping: Rework frequency adjustments to work better w/ nohz
    timekeeping: Minor fixup for timespec64->timespec assignment
    ftrace: Provide trace clocks monotonic
    timekeeping: Provide fast and NMI safe access to CLOCK_MONOTONIC
    seqcount: Add raw_write_seqcount_latch()
    seqcount: Provide raw_read_seqcount()
    timekeeping: Use tk_read_base as argument for timekeeping_get_ns()
    timekeeping: Create struct tk_read_base and use it in struct timekeeper
    timekeeping: Restructure the timekeeper some more
    clocksource: Get rid of cycle_last
    clocksource: Move cycle_last validation to core code
    clocksource: Make delta calculation a function
    wireless: ath9k: Get rid of timespec conversions
    drm: vmwgfx: Use nsec based interfaces
    drm: i915: Use nsec based interfaces
    timekeeping: Provide ktime_get_raw()
    hangcheck-timer: Use ktime_get_ns()
    ...

    Linus Torvalds
     

05 Aug, 2014

3 commits

  • Pull staging driver updates from Greg KH:
    "Here's the big pull request for the staging driver tree for 3.17-rc1.

    Lots of things in here, over 2000 patches, but the best part is this:
    1480 files changed, 39070 insertions(+), 254659 deletions(-)

    Thanks to the great work of Kristina Martšenko, 14 different staging
    drivers have been removed from the tree as they were obsolete and no
    one was willing to work on cleaning them up. Other than the driver
    removals, loads of cleanups are in here (comedi, lustre, etc.) as well
    as the usual IIO driver updates and additions.

    All of this has been in the linux-next tree for a while"

    * tag 'staging-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (2199 commits)
    staging: comedi: addi_apci_1564: remove diagnostic interrupt support code
    staging: comedi: addi_apci_1564: add subdevice to check diagnostic status
    staging: wlan-ng: coding style problem fix
    staging: wlan-ng: fixing coding style problems
    staging: comedi: ii_pci20kc: request and ioremap memory
    staging: lustre: bitwise vs logical typo
    staging: dgnc: Remove unneeded dgnc_trace.c and dgnc_trace.h
    staging: dgnc: rephrase comment
    staging: comedi: ni_tio: remove some dead code
    staging: rtl8723au: Fix static symbol sparse warning
    staging: rtl8723au: usb_dvobj_init(): Remove unused variable 'pdev_desc'
    staging: rtl8723au: Do not duplicate kernel provided USB macros
    staging: rtl8723au: Remove never set struct pwrctrl_priv.bHWPowerdown
    staging: rtl8723au: Remove two never set variables
    staging: rtl8723au: RSSI_test is never set
    staging:r8190: coding style: Fixed checkpatch reported Error
    staging:r8180: coding style: Fixed too long lines
    staging:r8180: coding style: Fixed commenting style
    staging: lustre: ptlrpc: lproc_ptlrpc.c - fix dereferenceing user space buffer
    staging: lustre: ldlm: ldlm_resource.c - fix dereferenceing user space buffer
    ...

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:

    - Move the nohz kick code out of the scheduler tick to a dedicated IPI,
    from Frederic Weisbecker.

    This necessiated quite some background infrastructure rework,
    including:

    * Clean up some irq-work internals
    * Implement remote irq-work
    * Implement nohz kick on top of remote irq-work
    * Move full dynticks timer enqueue notification to new kick
    * Move multi-task notification to new kick
    * Remove unecessary barriers on multi-task notification

    - Remove proliferation of wait_on_bit() action functions and allow
    wait_on_bit_action() functions to support a timeout. (Neil Brown)

    - Another round of sched/numa improvements, cleanups and fixes. (Rik
    van Riel)

    - Implement fast idling of CPUs when the system is partially loaded,
    for better scalability. (Tim Chen)

    - Restructure and fix the CPU hotplug handling code that may leave
    cfs_rq and rt_rq's throttled when tasks are migrated away from a dead
    cpu. (Kirill Tkhai)

    - Robustify the sched topology setup code. (Peterz Zijlstra)

    - Improve sched_feat() handling wrt. static_keys (Jason Baron)

    - Misc fixes.

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
    sched/fair: Fix 'make xmldocs' warning caused by missing description
    sched: Use macro for magic number of -1 for setparam
    sched: Robustify topology setup
    sched: Fix sched_setparam() policy == -1 logic
    sched: Allow wait_on_bit_action() functions to support a timeout
    sched: Remove proliferation of wait_on_bit() action functions
    sched/numa: Revert "Use effective_load() to balance NUMA loads"
    sched: Fix static_key race with sched_feat()
    sched: Remove extra static_key*() function indirection
    sched/rt: Fix replenish_dl_entity() comments to match the current upstream code
    sched: Transform resched_task() into resched_curr()
    sched/deadline: Kill task_struct->pi_top_task
    sched: Rework check_for_tasks()
    sched/rt: Enqueue just unthrottled rt_rq back on the stack in __disable_runtime()
    sched/fair: Disable runtime_enabled on dying rq
    sched/numa: Change scan period code to match intent
    sched/numa: Rework best node setting in task_numa_migrate()
    sched/numa: Examine a task move when examining a task swap
    sched/numa: Simplify task_numa_compare()
    sched/numa: Use effective_load() to balance NUMA loads
    ...

    Linus Torvalds
     
  • Pull RCU changes from Ingo Molar:
    "The main changes:

    - torture-test updates
    - callback-offloading changes
    - maintainership changes
    - update RCU documentation
    - miscellaneous fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
    rcu: Allow for NULL tick_nohz_full_mask when nohz_full= missing
    rcu: Fix a sparse warning in rcu_report_unblock_qs_rnp()
    rcu: Fix a sparse warning in rcu_initiate_boost()
    rcu: Fix __rcu_reclaim() to use true/false for bool
    rcu: Remove CONFIG_PROVE_RCU_DELAY
    rcu: Use __this_cpu_read() instead of per_cpu_ptr()
    rcu: Don't use NMIs to dump other CPUs' stacks
    rcu: Bind grace-period kthreads to non-NO_HZ_FULL CPUs
    rcu: Simplify priority boosting by putting rt_mutex in rcu_node
    rcu: Check both root and current rcu_node when setting up future grace period
    rcu: Allow post-unlock reference for rt_mutex
    rcu: Loosen __call_rcu()'s rcu_head alignment constraint
    rcu: Eliminate read-modify-write ACCESS_ONCE() calls
    rcu: Remove redundant ACCESS_ONCE() from tick_do_timer_cpu
    rcu: Make rcu node arrays static const char * const
    signal: Explain local_irq_save() call
    rcu: Handle obsolete references to TINY_PREEMPT_RCU
    rcu: Document deadlock-avoidance information for rcu_read_unlock()
    scripts: Teach get_maintainer.pl about the new "R:" tag
    rcu: Update rcu torture maintainership filename patterns
    ...

    Linus Torvalds
     

01 Aug, 2014

1 commit

  • clockevents_increase_min_delta() calls printk() from under
    hrtimer_bases.lock. That causes lock inversion on scheduler locks because
    printk() can call into the scheduler. Lockdep puts it as:

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    3.15.0-rc8-06195-g939f04b #2 Not tainted
    -------------------------------------------------------
    trinity-main/74 is trying to acquire lock:
    (&port_lock_key){-.....}, at: [] serial8250_console_write+0x8c/0x10c

    but task is already holding lock:
    (hrtimer_bases.lock){-.-...}, at: [] hrtimer_try_to_cancel+0x13/0x66

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #5 (hrtimer_bases.lock){-.-...}:
    [] lock_acquire+0x92/0x101
    [] _raw_spin_lock_irqsave+0x2e/0x3e
    [] __hrtimer_start_range_ns+0x1c/0x197
    [] perf_swevent_start_hrtimer.part.41+0x7a/0x85
    [] task_clock_event_start+0x3a/0x3f
    [] task_clock_event_add+0xd/0x14
    [] event_sched_in+0xb6/0x17a
    [] group_sched_in+0x44/0x122
    [] ctx_sched_in.isra.67+0x105/0x11f
    [] perf_event_sched_in.isra.70+0x47/0x4b
    [] __perf_install_in_context+0x8b/0xa3
    [] remote_function+0x12/0x2a
    [] smp_call_function_single+0x2d/0x53
    [] task_function_call+0x30/0x36
    [] perf_install_in_context+0x87/0xbb
    [] SYSC_perf_event_open+0x5c6/0x701
    [] SyS_perf_event_open+0x17/0x19
    [] syscall_call+0x7/0xb

    -> #4 (&ctx->lock){......}:
    [] lock_acquire+0x92/0x101
    [] _raw_spin_lock+0x21/0x30
    [] __perf_event_task_sched_out+0x1dc/0x34f
    [] __schedule+0x4c6/0x4cb
    [] schedule+0xf/0x11
    [] work_resched+0x5/0x30

    -> #3 (&rq->lock){-.-.-.}:
    [] lock_acquire+0x92/0x101
    [] _raw_spin_lock+0x21/0x30
    [] __task_rq_lock+0x33/0x3a
    [] wake_up_new_task+0x25/0xc2
    [] do_fork+0x15c/0x2a0
    [] kernel_thread+0x1a/0x1f
    [] rest_init+0x1a/0x10e
    [] start_kernel+0x303/0x308
    [] i386_start_kernel+0x79/0x7d

    -> #2 (&p->pi_lock){-.-...}:
    [] lock_acquire+0x92/0x101
    [] _raw_spin_lock_irqsave+0x2e/0x3e
    [] try_to_wake_up+0x1d/0xd6
    [] default_wake_function+0xb/0xd
    [] __wake_up_common+0x39/0x59
    [] __wake_up+0x29/0x3b
    [] tty_wakeup+0x49/0x51
    [] uart_write_wakeup+0x17/0x19
    [] serial8250_tx_chars+0xbc/0xfb
    [] serial8250_handle_irq+0x54/0x6a
    [] serial8250_default_handle_irq+0x19/0x1c
    [] serial8250_interrupt+0x38/0x9e
    [] handle_irq_event_percpu+0x5f/0x1e2
    [] handle_irq_event+0x2c/0x43
    [] handle_level_irq+0x57/0x80
    [] handle_irq+0x46/0x5c
    [] do_IRQ+0x32/0x89
    [] common_interrupt+0x2e/0x33
    [] _raw_spin_unlock_irqrestore+0x3f/0x49
    [] uart_start+0x2d/0x32
    [] uart_write+0xc7/0xd6
    [] n_tty_write+0xb8/0x35e
    [] tty_write+0x163/0x1e4
    [] redirected_tty_write+0x6d/0x75
    [] vfs_write+0x75/0xb0
    [] SyS_write+0x44/0x77
    [] syscall_call+0x7/0xb

    -> #1 (&tty->write_wait){-.....}:
    [] lock_acquire+0x92/0x101
    [] _raw_spin_lock_irqsave+0x2e/0x3e
    [] __wake_up+0x15/0x3b
    [] tty_wakeup+0x49/0x51
    [] uart_write_wakeup+0x17/0x19
    [] serial8250_tx_chars+0xbc/0xfb
    [] serial8250_handle_irq+0x54/0x6a
    [] serial8250_default_handle_irq+0x19/0x1c
    [] serial8250_interrupt+0x38/0x9e
    [] handle_irq_event_percpu+0x5f/0x1e2
    [] handle_irq_event+0x2c/0x43
    [] handle_level_irq+0x57/0x80
    [] handle_irq+0x46/0x5c
    [] do_IRQ+0x32/0x89
    [] common_interrupt+0x2e/0x33
    [] _raw_spin_unlock_irqrestore+0x3f/0x49
    [] uart_start+0x2d/0x32
    [] uart_write+0xc7/0xd6
    [] n_tty_write+0xb8/0x35e
    [] tty_write+0x163/0x1e4
    [] redirected_tty_write+0x6d/0x75
    [] vfs_write+0x75/0xb0
    [] SyS_write+0x44/0x77
    [] syscall_call+0x7/0xb

    -> #0 (&port_lock_key){-.....}:
    [] __lock_acquire+0x9ea/0xc6d
    [] lock_acquire+0x92/0x101
    [] _raw_spin_lock_irqsave+0x2e/0x3e
    [] serial8250_console_write+0x8c/0x10c
    [] call_console_drivers.constprop.31+0x87/0x118
    [] console_unlock+0x1d7/0x398
    [] vprintk_emit+0x3da/0x3e4
    [] printk+0x17/0x19
    [] clockevents_program_min_delta+0x104/0x116
    [] clockevents_program_event+0xe7/0xf3
    [] tick_program_event+0x1e/0x23
    [] hrtimer_force_reprogram+0x88/0x8f
    [] __remove_hrtimer+0x5b/0x79
    [] hrtimer_try_to_cancel+0x49/0x66
    [] hrtimer_cancel+0xd/0x18
    [] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
    [] task_clock_event_stop+0x20/0x64
    [] task_clock_event_del+0xd/0xf
    [] event_sched_out+0xab/0x11e
    [] group_sched_out+0x1d/0x66
    [] ctx_sched_out+0xaf/0xbf
    [] __perf_event_task_sched_out+0x1ed/0x34f
    [] __schedule+0x4c6/0x4cb
    [] schedule+0xf/0x11
    [] work_resched+0x5/0x30

    other info that might help us debug this:

    Chain exists of:
    &port_lock_key --> &ctx->lock --> hrtimer_bases.lock

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(hrtimer_bases.lock);
    lock(&ctx->lock);
    lock(hrtimer_bases.lock);
    lock(&port_lock_key);

    *** DEADLOCK ***

    4 locks held by trinity-main/74:
    #0: (&rq->lock){-.-.-.}, at: [] __schedule+0xed/0x4cb
    #1: (&ctx->lock){......}, at: [] __perf_event_task_sched_out+0x1dc/0x34f
    #2: (hrtimer_bases.lock){-.-...}, at: [] hrtimer_try_to_cancel+0x13/0x66
    #3: (console_lock){+.+...}, at: [] vprintk_emit+0x3c7/0x3e4

    stack backtrace:
    CPU: 0 PID: 74 Comm: trinity-main Not tainted 3.15.0-rc8-06195-g939f04b #2
    00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570
    8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0
    8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003
    Call Trace:
    [] dump_stack+0x16/0x18
    [] print_circular_bug+0x18f/0x19c
    [] __lock_acquire+0x9ea/0xc6d
    [] lock_acquire+0x92/0x101
    [] ? serial8250_console_write+0x8c/0x10c
    [] ? wait_for_xmitr+0x76/0x76
    [] _raw_spin_lock_irqsave+0x2e/0x3e
    [] ? serial8250_console_write+0x8c/0x10c
    [] serial8250_console_write+0x8c/0x10c
    [] ? lock_release+0x191/0x223
    [] ? wait_for_xmitr+0x76/0x76
    [] call_console_drivers.constprop.31+0x87/0x118
    [] console_unlock+0x1d7/0x398
    [] vprintk_emit+0x3da/0x3e4
    [] printk+0x17/0x19
    [] clockevents_program_min_delta+0x104/0x116
    [] tick_program_event+0x1e/0x23
    [] hrtimer_force_reprogram+0x88/0x8f
    [] __remove_hrtimer+0x5b/0x79
    [] hrtimer_try_to_cancel+0x49/0x66
    [] hrtimer_cancel+0xd/0x18
    [] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
    [] task_clock_event_stop+0x20/0x64
    [] task_clock_event_del+0xd/0xf
    [] event_sched_out+0xab/0x11e
    [] group_sched_out+0x1d/0x66
    [] ctx_sched_out+0xaf/0xbf
    [] __perf_event_task_sched_out+0x1ed/0x34f
    [] ? __dequeue_entity+0x23/0x27
    [] ? pick_next_task_fair+0xb1/0x120
    [] __schedule+0x4c6/0x4cb
    [] ? trace_hardirqs_off_caller+0xd7/0x108
    [] ? trace_hardirqs_off+0xb/0xd
    [] ? rcu_irq_exit+0x64/0x77

    Fix the problem by using printk_deferred() which does not call into the
    scheduler.

    Reported-by: Fengguang Wu
    Signed-off-by: Jan Kara
    Cc: stable@vger.kernel.org
    Signed-off-by: Thomas Gleixner

    Jan Kara
     

28 Jul, 2014

1 commit


24 Jul, 2014

7 commits

  • During suspend we call sched_clock_poll() to update the epoch and
    accumulated time and reprogram the sched_clock_timer to fire
    before the next wrap-around time. Unfortunately,
    sched_clock_poll() doesn't restart the timer, instead it relies
    on the hrtimer layer to do that and during suspend we aren't
    calling that function from the hrtimer layer. Instead, we're
    reprogramming the expires time while the hrtimer is enqueued,
    which can cause the hrtimer tree to be corrupted. Furthermore, we
    restart the timer during suspend but we update the epoch during
    resume which seems counter-intuitive.

    Let's fix this by saving the accumulated state and canceling the
    timer during suspend. On resume we can update the epoch and
    restart the timer similar to what we would do if we were starting
    the clock for the first time.

    Fixes: a08ca5d1089d "sched_clock: Use an hrtimer instead of timer"
    Signed-off-by: Stephen Boyd
    Signed-off-by: John Stultz
    Link: http://lkml.kernel.org/r/1406174630-23458-1-git-send-email-john.stultz@linaro.org
    Cc: Ingo Molnar
    Cc: stable
    Signed-off-by: Thomas Gleixner

    Stephen Boyd
     
  • By caching the ntp_tick_length() when we correct the frequency error,
    and then using that cached value to accumulate error, we avoid large
    initial errors when the tick length is changed.

    This makes convergence happen much faster in the simulator, since the
    initial error doesn't have to be slowly whittled away.

    This initially seems like an accounting error, but Miroslav pointed out
    that ntp_tick_length() can change mid-tick, so when we apply it in the
    error accumulation, we are applying any recent change to the entire tick.

    This approach chooses to apply changes in the ntp_tick_length() only to
    the next tick, which allows us to calculate the freq correction before
    using the new tick length, which avoids accummulating error.

    Credit to Miroslav for pointing this out and providing the original patch
    this functionality has been pulled out from, along with the rational.

    Cc: Miroslav Lichvar
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Reported-by: Miroslav Lichvar
    Signed-off-by: John Stultz

    John Stultz
     
  • The existing timekeeping_adjust logic has always been complicated
    to understand. Further, since it was developed prior to NOHZ becoming
    common, its not surprising it performs poorly when NOHZ is enabled.

    Since Miroslav pointed out the problematic nature of the existing code
    in the NOHZ case, I've tried to refactor the code to perform better.

    The problem with the previous approach was that it tried to adjust
    for the total cumulative error using a scaled dampening factor. This
    resulted in large errors to be corrected slowly, while small errors
    were corrected quickly. With NOHZ the timekeeping code doesn't know
    how far out the next tick will be, so this results in bad
    over-correction to small errors, and insufficient correction to large
    errors.

    Inspired by Miroslav's patch, I've refactored the code to try to
    address the correction in two steps.

    1) Check the future freq error for the next tick, and if the frequency
    error is large, try to make sure we correct it so it doesn't cause
    much accumulated error.

    2) Then make a small single unit adjustment to correct any cumulative
    error that has collected over time.

    This method performs fairly well in the simulator Miroslav created.

    Major credit to Miroslav for pointing out the issue, providing the
    original patch to resolve this, a simulator for testing, as well as
    helping debug and resolve issues in my implementation so that it
    performed closer to his original implementation.

    Cc: Miroslav Lichvar
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Reported-by: Miroslav Lichvar
    Signed-off-by: John Stultz

    John Stultz
     
  • In the GENERIC_TIME_VSYSCALL_OLD update_vsyscall implementation,
    we take the tk_xtime() value, which returns a timespec64, and
    store it in a timespec.

    This luckily is ok, since the only architectures that use
    GENERIC_TIME_VSYSCALL_OLD are ia64 and ppc64, which are both
    64 bit systems where timespec64 is the same as a timespec.

    Even so, for cleanliness reasons, use the conversion function
    to assign the proper type.

    Signed-off-by: John Stultz

    John Stultz
     
  • Tracers want a correlated time between the kernel instrumentation and
    user space. We really do not want to export sched_clock() to user
    space, so we need to provide something sensible for this.

    Using separate data structures with an non blocking sequence count
    based update mechanism allows us to do that. The data structure
    required for the readout has a sequence counter and two copies of the
    timekeeping data.

    On the update side:

    smp_wmb();
    tkf->seq++;
    smp_wmb();
    update(tkf->base[0], tk);
    smp_wmb();
    tkf->seq++;
    smp_wmb();
    update(tkf->base[1], tk);

    On the reader side:

    do {
    seq = tkf->seq;
    smp_rmb();
    idx = seq & 0x01;
    now = now(tkf->base[idx]);
    smp_rmb();
    } while (seq != tkf->seq)

    So if a NMI hits the update of base[0] it will use base[1] which is
    still consistent, but this timestamp is not guaranteed to be monotonic
    across an update.

    The timestamp is calculated by:

    now = base_mono + clock_delta * slope

    So if the update lowers the slope, readers who are forced to the
    not yet updated second array are still using the old steeper slope.

    tmono
    ^
    | o n
    | o n
    | u
    | o
    |o
    |12345678---> reader order

    o = old slope
    u = update
    n = new slope

    So reader 6 will observe time going backwards versus reader 5.

    While other CPUs are likely to be able observe that, the only way
    for a CPU local observation is when an NMI hits in the middle of
    the update. Timestamps taken from that NMI context might be ahead
    of the following timestamps. Callers need to be aware of that and
    deal with it.

    V2: Got rid of clock monotonic raw and reorganized the data
    structures. Folded in the barrier fix from Mathieu.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Mathieu Desnoyers
    Signed-off-by: John Stultz

    Thomas Gleixner
     
  • All the function needs is in the tk_read_base struct. No functional
    change for the current code, just a preparatory patch for the NMI safe
    accessor to clock monotonic which will use struct tk_read_base as well.

    Signed-off-by: Thomas Gleixner
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Mathieu Desnoyers
    Signed-off-by: John Stultz

    Thomas Gleixner
     
  • The members of the new struct are the required ones for the new NMI
    safe accessor to clcok monotonic. In order to reuse the existing
    timekeeping code and to make the update of the fast NMI safe
    timekeepers a simple memcpy use the struct for the timekeeper as well
    and convert all users.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Mathieu Desnoyers
    Signed-off-by: John Stultz

    Thomas Gleixner