26 Dec, 2016

2 commits

  • ktime_set(S,N) was required for the timespec storage type and is still
    useful for situations where a Seconds and Nanoseconds part of a time value
    needs to be converted. For anything where the Seconds argument is 0, this
    is pointless and can be replaced with a simple assignment.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra

    Thomas Gleixner
     
  • ktime is a union because the initial implementation stored the time in
    scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
    variant for 32bit machines. The Y2038 cleanup removed the timespec variant
    and switched everything to scalar nanoseconds. The union remained, but
    become completely pointless.

    Get rid of the union and just keep ktime_t as simple typedef of type s64.

    The conversion was done with coccinelle and some manual mopping up.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra

    Thomas Gleixner
     

25 Dec, 2016

2 commits


19 Dec, 2016

1 commit


15 Dec, 2016

2 commits

  • When a disfunctional timer, e.g. dummy timer, is installed, the tick core
    tries to setup the broadcast timer.

    If no broadcast device is installed, the kernel crashes with a NULL pointer
    dereference in tick_broadcast_setup_oneshot() because the function has no
    sanity check.

    Reported-by: Mason
    Signed-off-by: Thomas Gleixner
    Cc: Mark Rutland
    Cc: Anna-Maria Gleixner
    Cc: Richard Cochran
    Cc: Sebastian Andrzej Siewior
    Cc: Daniel Lezcano
    Cc: Peter Zijlstra ,
    Cc: Sebastian Frias
    Cc: Thibaud Cornic
    Cc: Robin Murphy
    Link: http://lkml.kernel.org/r/1147ef90-7877-e4d2-bb2b-5c4fa8d3144b@free.fr

    Thomas Gleixner
     
  • The OpenRISC compiler (so far) fails to optimize away a large portion of
    code containing a reference to posix_timer_event in alarmtimer.c when
    CONFIG_POSIX_TIMERS is unset. Let's give it a direct clue to let the
    build succeed.

    This fixes
    [linux-next:master 6682/7183] alarmtimer.c:undefined reference to `posix_timer_event'
    reported by kbuild test robot.

    Signed-off-by: Nicolas Pitre
    Cc: Thomas Gleixner
    Cc: Josh Triplett

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Pitre
     

13 Dec, 2016

3 commits

  • Pull timer updates from Thomas Gleixner:
    "The time/timekeeping/timer folks deliver with this update:

    - Fix a reintroduced signed/unsigned issue and cleanup the whole
    signed/unsigned mess in the timekeeping core so this wont happen
    accidentaly again.

    - Add a new trace clock based on boot time

    - Prevent injection of random sleep times when PM tracing abuses the
    RTC for storage

    - Make posix timers configurable for real tiny systems

    - Add tracepoints for the alarm timer subsystem so timer based
    suspend wakeups can be instrumented

    - The usual pile of fixes and updates to core and drivers"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
    timekeeping: Use mul_u64_u32_shr() instead of open coding it
    timekeeping: Get rid of pointless typecasts
    timekeeping: Make the conversion call chain consistently unsigned
    timekeeping_Force_unsigned_clocksource_to_nanoseconds_conversion
    alarmtimer: Add tracepoints for alarm timers
    trace: Update documentation for mono, mono_raw and boot clock
    trace: Add an option for boot clock as trace clock
    timekeeping: Add a fast and NMI safe boot clock
    timekeeping/clocksource_cyc2ns: Document intended range limitation
    timekeeping: Ignore the bogus sleep time if pm_trace is enabled
    selftests/timers: Fix spelling mistake "Asyncrhonous" -> "Asynchronous"
    clocksource/drivers/bcm2835_timer: Unmap region obtained by of_iomap
    clocksource/drivers/arm_arch_timer: Map frame with of_io_request_and_map()
    arm64: dts: rockchip: Arch counter doesn't tick in system suspend
    clocksource/drivers/arm_arch_timer: Don't assume clock runs in suspend
    posix-timers: Make them configurable
    posix_cpu_timers: Move the add_device_randomness() call to a proper place
    timer: Move sys_alarm from timer.c to itimer.c
    ptp_clock: Allow for it to be optional
    Kconfig: Regenerate *.c_shipped files after previous changes
    ...

    Linus Torvalds
     
  • Pull smp hotplug updates from Thomas Gleixner:
    "This is the final round of converting the notifier mess to the state
    machine. The removal of the notifiers and the related infrastructure
    will happen around rc1, as there are conversions outstanding in other
    trees.

    The whole exercise removed about 2000 lines of code in total and in
    course of the conversion several dozen bugs got fixed. The new
    mechanism allows to test almost every hotplug step standalone, so
    usage sites can exercise all transitions extensively.

    There is more room for improvement, like integrating all the
    pointlessly different architecture mechanisms of synchronizing,
    setting cpus online etc into the core code"

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
    tracing/rb: Init the CPU mask on allocation
    soc/fsl/qbman: Convert to hotplug state machine
    soc/fsl/qbman: Convert to hotplug state machine
    zram: Convert to hotplug state machine
    KVM/PPC/Book3S HV: Convert to hotplug state machine
    arm64/cpuinfo: Convert to hotplug state machine
    arm64/cpuinfo: Make hotplug notifier symmetric
    mm/compaction: Convert to hotplug state machine
    iommu/vt-d: Convert to hotplug state machine
    mm/zswap: Convert pool to hotplug state machine
    mm/zswap: Convert dst-mem to hotplug state machine
    mm/zsmalloc: Convert to hotplug state machine
    mm/vmstat: Convert to hotplug state machine
    mm/vmstat: Avoid on each online CPU loops
    mm/vmstat: Drop get_online_cpus() from init_cpu_node_state/vmstat_cpu_dead()
    tracing/rb: Convert to hotplug state machine
    oprofile/nmi timer: Convert to hotplug state machine
    net/iucv: Use explicit clean up labels in iucv_init()
    x86/pci/amd-bus: Convert to hotplug state machine
    x86/oprofile/nmi: Convert to hotplug state machine
    ...

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:
    "The main scheduler changes in this cycle were:

    - support Intel Turbo Boost Max Technology 3.0 (TBM3) by introducig a
    notion of 'better cores', which the scheduler will prefer to
    schedule single threaded workloads on. (Tim Chen, Srinivas
    Pandruvada)

    - enhance the handling of asymmetric capacity CPUs further (Morten
    Rasmussen)

    - improve/fix load handling when moving tasks between task groups
    (Vincent Guittot)

    - simplify and clean up the cputime code (Stanislaw Gruszka)

    - improve mass fork()ed task spread a.k.a. hackbench speedup (Vincent
    Guittot)

    - make struct kthread kmalloc()ed and related fixes (Oleg Nesterov)

    - add uaccess atomicity debugging (when using access_ok() in the
    wrong context), under CONFIG_DEBUG_ATOMIC_SLEEP=y (Peter Zijlstra)

    - implement various fixes, cleanups and other enhancements (Daniel
    Bristot de Oliveira, Martin Schwidefsky, Rafael J. Wysocki)"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
    sched/core: Use load_avg for selecting idlest group
    sched/core: Fix find_idlest_group() for fork
    kthread: Don't abuse kthread_create_on_cpu() in __kthread_create_worker()
    kthread: Don't use to_live_kthread() in kthread_[un]park()
    kthread: Don't use to_live_kthread() in kthread_stop()
    Revert "kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function"
    kthread: Make struct kthread kmalloc'ed
    x86/uaccess, sched/preempt: Verify access_ok() context
    sched/x86: Make CONFIG_SCHED_MC_PRIO=y easier to enable
    sched/x86: Change CONFIG_SCHED_ITMT to CONFIG_SCHED_MC_PRIO
    x86/sched: Use #include instead of #include
    cpufreq/intel_pstate: Use CPPC to get max performance
    acpi/bus: Set _OSC for diverse core support
    acpi/bus: Enable HWP CPPC objects
    x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU
    x86/sysctl: Add sysctl for ITMT scheduling feature
    x86: Enable Intel Turbo Boost Max Technology 3.0
    x86/topology: Define x86's arch_update_cpu_topology
    sched: Extend scheduler's asym packing
    sched/fair: Clean up the tunable parameter definitions
    ...

    Linus Torvalds
     

09 Dec, 2016

4 commits

  • The resume code must deal with a clocksource delta which is potentially big
    enough to overflow the 64bit mult.

    Replace the open coded handling with the proper function.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: David Gibson
    Acked-by: Peter Zijlstra (Intel)
    Cc: Parit Bhargava
    Cc: Laurent Vivier
    Cc: "Christopher S. Hall"
    Cc: Chris Metcalf
    Cc: Richard Cochran
    Cc: Liav Rehana
    Cc: John Stultz
    Link: http://lkml.kernel.org/r/20161208204228.921674404@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • cycle_t is defined as u64, so casting it to u64 is a pointless and
    confusing exercise. cycle_t should simply go away and be replaced with a
    plain u64 to avoid further confusion.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: David Gibson
    Acked-by: Peter Zijlstra (Intel)
    Cc: Parit Bhargava
    Cc: Laurent Vivier
    Cc: "Christopher S. Hall"
    Cc: Chris Metcalf
    Cc: Richard Cochran
    Cc: Liav Rehana
    Cc: John Stultz
    Link: http://lkml.kernel.org/r/20161208204228.844699737@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Propagating a unsigned value through signed variables and functions makes
    absolutely no sense and is just prone to (re)introduce subtle signed
    vs. unsigned issues as happened recently.

    Clean it up.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: David Gibson
    Acked-by: Peter Zijlstra (Intel)
    Cc: Parit Bhargava
    Cc: Laurent Vivier
    Cc: "Christopher S. Hall"
    Cc: Chris Metcalf
    Cc: Richard Cochran
    Cc: Liav Rehana
    Cc: John Stultz
    Link: http://lkml.kernel.org/r/20161208204228.765843099@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • The clocksource delta to nanoseconds conversion is using signed math, but
    the delta is unsigned. This makes the conversion space smaller than
    necessary and in case of a multiplication overflow the conversion can
    become negative. The conversion is done with scaled math:

    s64 nsec_delta = ((s64)clkdelta * clk->mult) >> clk->shift;

    Shifting a signed integer right obvioulsy preserves the sign, which has
    interesting consequences:

    - Time jumps backwards

    - __iter_div_u64_rem() which is used in one of the calling code pathes
    will take forever to piecewise calculate the seconds/nanoseconds part.

    This has been reported by several people with different scenarios:

    David observed that when stopping a VM with a debugger:

    "It was essentially the stopped by debugger case. I forget exactly why,
    but the guest was being explicitly stopped from outside, it wasn't just
    scheduling lag. I think it was something in the vicinity of 10 minutes
    stopped."

    When lifting the stop the machine went dead.

    The stopped by debugger case is not really interesting, but nevertheless it
    would be a good thing not to die completely.

    But this was also observed on a live system by Liav:

    "When the OS is too overloaded, delta will get a high enough value for the
    msb of the sum delta * tkr->mult + tkr->xtime_nsec to be set, and so
    after the shift the nsec variable will gain a value similar to
    0xffffffffff000000."

    Unfortunately this has been reintroduced recently with commit 6bd58f09e1d8
    ("time: Add cycles to nanoseconds translation"). It had been fixed a year
    ago already in commit 35a4933a8959 ("time: Avoid signed overflow in
    timekeeping_get_ns()").

    Though it's not surprising that the issue has been reintroduced because the
    function itself and the whole call chain uses s64 for the result and the
    propagation of it. The change in this recent commit is subtle:

    s64 nsec;

    - nsec = (d * m + n) >> s:
    + nsec = d * m + n;
    + nsec >>= s;

    d being type of cycle_t adds another level of obfuscation.

    This wouldn't have happened if the previous change to unsigned computation
    would have made the 'nsec' variable u64 right away and a follow up patch
    had cleaned up the whole call chain.

    There have been patches submitted which basically did a revert of the above
    patch leaving everything else unchanged as signed. Back to square one. This
    spawned a admittedly pointless discussion about potential users which rely
    on the unsigned behaviour until someone pointed out that it had been fixed
    before. The changelogs of said patches added further confusion as they made
    finally false claims about the consequences for eventual users which expect
    signed results.

    Despite delta being cycle_t, aka. u64, it's very well possible to hand in
    a signed negative value and the signed computation will happily return the
    correct result. But nobody actually sat down and analyzed the code which
    was added as user after the propably unintended signed conversion.

    Though in sensitive code like this it's better to analyze it proper and
    make sure that nothing relies on this than hunting the subtle wreckage half
    a year later. After analyzing all call chains it stands that no caller can
    hand in a negative value (which actually would work due to the s64 cast)
    and rely on the signed math to do the right thing.

    Change the conversion function to unsigned math. The conversion of all call
    chains is done in a follow up patch.

    This solves the starvation issue, which was caused by the negative result,
    but it does not solve the underlying problem. It merily procrastinates
    it. When the timekeeper update is deferred long enough that the unsigned
    multiplication overflows, then time going backwards is observable again.

    It does neither solve the issue of clocksources with a small counter width
    which will wrap around possibly several times and cause random time stamps
    to be generated. But those are usually not found on systems used for
    virtualization, so this is likely a non issue.

    I took the liberty to claim authorship for this simply because
    analyzing all callsites and writing the changelog took substantially
    more time than just making the simple s/s64/u64/ change and ignore the
    rest.

    Fixes: 6bd58f09e1d8 ("time: Add cycles to nanoseconds translation")
    Reported-by: David Gibson
    Reported-by: Liav Rehana
    Signed-off-by: Thomas Gleixner
    Reviewed-by: David Gibson
    Acked-by: Peter Zijlstra (Intel)
    Cc: Parit Bhargava
    Cc: Laurent Vivier
    Cc: "Christopher S. Hall"
    Cc: Chris Metcalf
    Cc: Richard Cochran
    Cc: John Stultz
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20161208204228.688545601@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

08 Dec, 2016

1 commit

  • The CPSW CPTS driver is capable of doing timestamping on tx/rx packets and
    requires to know mult and shift factors for timestamp conversion from raw
    value to nanoseconds (ptp clock). Now these mult and shift factors are
    calculated manually and provided through DT, which makes very hard to
    support of a lot number of platforms, especially if CPTS refclk is not the
    same for some kind of boards and depends on efuse settings (Keystone 2
    platforms). Hence, export clocks_calc_mult_shift() to allow drivers like
    CPSW CPTS (and other ptp drivesr) to benefit from automaitc calculation of
    mult and shift factors.

    Cc: John Stultz
    Signed-off-by: Murali Karicheri
    Signed-off-by: Grygorii Strashko
    Acked-by: Thomas Gleixner
    Signed-off-by: David S. Miller

    Murali Karicheri
     

01 Dec, 2016

1 commit

  • Alarm timers are one of the mechanisms to wake up a system from suspend,
    but there exist no tracepoints to analyse which process/thread armed an
    alarmtimer.

    Add tracepoints for start/cancel/expire of individual alarm timers and one
    for tracing the suspend time decision when to resume the system.

    The following trace excerpt illustrates the new mechanism:

    Binder:3292_2-3304 [000] d..2 149.981123: alarmtimer_cancel:
    alarmtimer:ffffffc1319a7800 type:REALTIME
    expires:1325463120000000000 now:1325376810370370245

    Binder:3292_2-3304 [000] d..2 149.981136: alarmtimer_start:
    alarmtimer:ffffffc1319a7800 type:REALTIME
    expires:1325376840000000000 now:1325376810370384591

    Binder:3292_9-3953 [000] d..2 150.212991: alarmtimer_cancel:
    alarmtimer:ffffffc1319a5a00 type:BOOTTIME
    expires:179552000000 now:150154008122

    Binder:3292_9-3953 [000] d..2 150.213006: alarmtimer_start:
    alarmtimer:ffffffc1319a5a00 type:BOOTTIME
    expires:179551000000 now:150154025622

    system_server-3000 [002] ...1 162.701940: alarmtimer_suspend:
    alarmtimer type:REALTIME expires:1325376840000000000

    The wakeup time which is selected at suspend time allows to map it back to
    the task arming the timer: Binder:3292_2.

    [ tglx: Store alarm timer expiry time instead of some useless RTC relative
    information, add proper type information for wakeups which are
    handled via the clock_nanosleep/freezer and massage the changelog. ]

    Signed-off-by: Baolin Wang
    Signed-off-by: John Stultz
    Acked-by: Steven Rostedt
    Cc: Prarit Bhargava
    Cc: Richard Cochran
    Link: http://lkml.kernel.org/r/1480372524-15181-5-git-send-email-john.stultz@linaro.org
    Signed-off-by: Thomas Gleixner

    Baolin Wang
     

30 Nov, 2016

1 commit

  • This boot clock can be used as a tracing clock and will account for
    suspend time.

    To keep it NMI safe since we're accessing from tracing, we're not using a
    separate timekeeper with updates to monotonic clock and boot offset
    protected with seqlocks. This has the following minor side effects:

    (1) Its possible that a timestamp be taken after the boot offset is updated
    but before the timekeeper is updated. If this happens, the new boot offset
    is added to the old timekeeping making the clock appear to update slightly
    earlier:
    CPU 0 CPU 1
    timekeeping_inject_sleeptime64()
    __timekeeping_inject_sleeptime(tk, delta);
    timestamp();
    timekeeping_update(tk, TK_CLEAR_NTP...);

    (2) On 32-bit systems, the 64-bit boot offset (tk->offs_boot) may be
    partially updated. Since the tk->offs_boot update is a rare event, this
    should be a rare occurrence which postprocessing should be able to handle.

    Signed-off-by: Joel Fernandes
    Signed-off-by: John Stultz
    Reviewed-by: Thomas Gleixner
    Cc: Prarit Bhargava
    Cc: Richard Cochran
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/1480372524-15181-6-git-send-email-john.stultz@linaro.org
    Signed-off-by: Thomas Gleixner

    Joel Fernandes
     

23 Nov, 2016

1 commit


16 Nov, 2016

3 commits

  • Some embedded systems have no use for them. This removes about
    25KB from the kernel binary size when configured out.

    Corresponding syscalls are routed to a stub logging the attempt to
    use those syscalls which should be enough of a clue if they were
    disabled without proper consideration. They are: timer_create,
    timer_gettime: timer_getoverrun, timer_settime, timer_delete,
    clock_adjtime, setitimer, getitimer, alarm.

    The clock_settime, clock_gettime, clock_getres and clock_nanosleep
    syscalls are replaced by simple wrappers compatible with CLOCK_REALTIME,
    CLOCK_MONOTONIC and CLOCK_BOOTTIME only which should cover the vast
    majority of use cases with very little code.

    Signed-off-by: Nicolas Pitre
    Acked-by: Richard Cochran
    Acked-by: Thomas Gleixner
    Acked-by: John Stultz
    Reviewed-by: Josh Triplett
    Cc: Paul Bolle
    Cc: linux-kbuild@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Cc: Michal Marek
    Cc: Edward Cree
    Link: http://lkml.kernel.org/r/1478841010-28605-7-git-send-email-nicolas.pitre@linaro.org
    Signed-off-by: Thomas Gleixner

    Nicolas Pitre
     
  • There is no logical relation between add_device_randomness() and
    posix_cpu_timers_exit(). Let's move the former to where the later
    is called. This way, when posix-cpu-timers.c is compiled out, there
    is no need to worry about not losing a call to add_device_randomness().

    Signed-off-by: Nicolas Pitre
    Acked-by: John Stultz
    Cc: Paul Bolle
    Cc: linux-kbuild@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Cc: Richard Cochran
    Cc: Josh Triplett
    Cc: Michal Marek
    Cc: Edward Cree
    Link: http://lkml.kernel.org/r/1478841010-28605-6-git-send-email-nicolas.pitre@linaro.org
    Signed-off-by: Thomas Gleixner

    Nicolas Pitre
     
  • Move the only user of alarm_setitimer to itimer.c where it is defined.
    This allows for making alarm_setitimer static, and dropping it from the
    build when __ARCH_WANT_SYS_ALARM is not defined.

    Signed-off-by: Nicolas Pitre
    Acked-by: John Stultz
    Cc: Paul Bolle
    Cc: linux-kbuild@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Cc: Richard Cochran
    Cc: Josh Triplett
    Cc: Michal Marek
    Cc: Edward Cree
    Link: http://lkml.kernel.org/r/1478841010-28605-5-git-send-email-nicolas.pitre@linaro.org
    Signed-off-by: Thomas Gleixner

    Nicolas Pitre
     

15 Nov, 2016

1 commit

  • Now since fetch_task_cputime() has no other users than task_cputime(),
    its code could be used directly in task_cputime().

    Moreover since only 2 task_cputime() calls of 17 use a NULL argument,
    we can add dummy variables to those calls and remove NULL checks from
    task_cputimes().

    Also remove NULL checks from task_cputimes_scaled().

    Signed-off-by: Stanislaw Gruszka
    Signed-off-by: Frederic Weisbecker
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Michael Neuling
    Cc: Paul Mackerras
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1479175612-14718-5-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     

26 Oct, 2016

2 commits

  • The documentation for schedule_timeout(), schedule_hrtimeout(), and
    schedule_hrtimeout_range() all claim that the routines couldn't possibly
    return early if the task state was TASK_UNINTERRUPTIBLE. This is simply
    not true since wake_up_process() will cause those routines to exit early.

    We cannot make schedule_[hr]timeout() loop until the timeout expires if the
    task state is uninterruptible because we have users which rely on the
    existing and designed behaviour.

    Make the documentation match the (correct) implementation.

    schedule_hrtimeout() returns -EINTR even when a uninterruptible task was
    woken up. This might look strange, but making the return code depend on the
    state is too much of an effort as it would affect all the call sites. There
    is no value in doing so, but we spell it out clearly in the documentation.

    Suggested-by: Daniel Kurtz
    Signed-off-by: Douglas Anderson
    Cc: huangtao@rock-chips.com
    Cc: heiko@sntech.de
    Cc: broonie@kernel.org
    Cc: briannorris@chromium.org
    Cc: Andreas Mohr
    Cc: linux-rockchip@lists.infradead.org
    Cc: tony.xie@rock-chips.com
    Cc: John Stultz
    Cc: linux@roeck-us.net
    Cc: tskd08@gmail.com
    Link: http://lkml.kernel.org/r/1477065531-30342-2-git-send-email-dianders@chromium.org
    Signed-off-by: Thomas Gleixner

    Douglas Anderson
     
  • Users of usleep_range() expect that it will _never_ return in less time
    than the minimum passed parameter. However, nothing in the code ensures
    this, when the sleeping task is woken by wake_up_process() or any other
    mechanism which can wake a task from uninterruptible state.

    Neither usleep_range() nor schedule_hrtimeout_range*() have any protection
    against wakeups. schedule_hrtimeout_range*() is designed this way despite
    the fact that the API documentation does not mention it.

    msleep() already has code to handle this case since it will loop as long
    as there was still time left. usleep_range() has no such loop, add it.

    Presumably this problem was not detected before because usleep_range() is
    only used in a few places and the function is mostly used in contexts which
    are not exposed to wakeups of any form.

    An effort was made to look for users relying on the old behavior by
    looking for usleep_range() in the same file as wake_up_process().
    No problems were found by this search, though it is conceivable that
    someone could have put the sleep and wakeup in two different files.

    An effort was made to ask several upstream maintainers if they were aware
    of people relying on wake_up_process() to wake up usleep_range(). No
    maintainers were aware of that but they were aware of many people relying
    on usleep_range() never returning before the minimum.

    Reported-by: Tao Huang
    Signed-off-by: Douglas Anderson
    Cc: heiko@sntech.de
    Cc: broonie@kernel.org
    Cc: briannorris@chromium.org
    Cc: Andreas Mohr
    Cc: linux-rockchip@lists.infradead.org
    Cc: tony.xie@rock-chips.com
    Cc: John Stultz
    Cc: djkurtz@chromium.org
    Cc: linux@roeck-us.net
    Cc: tskd08@gmail.com
    Link: http://lkml.kernel.org/r/1477065531-30342-1-git-send-email-dianders@chromium.org
    Signed-off-by: Thomas Gleixner

    Douglas Anderson
     

25 Oct, 2016

4 commits

  • When a timer is enqueued we try to forward the timer base clock. This
    mechanism has two issues:

    1) Forwarding a remote base unlocked

    The forwarding function is called from get_target_base() with the current
    timer base lock held. But if the new target base is a different base than
    the current base (can happen with NOHZ, sigh!) then the forwarding is done
    on an unlocked base. This can lead to corruption of base->clk.

    Solution is simple: Invoke the forwarding after the target base is locked.

    2) Possible corruption due to jiffies advancing

    This is similar to the issue in get_net_timer_interrupt() which was fixed
    in the previous patch. jiffies can advance between check and assignement
    and therefore advancing base->clk beyond the next expiry value.

    So we need to read jiffies into a local variable once and do the checks and
    assignment with the local copy.

    Fixes: a683f390b93f("timers: Forward the wheel clock whenever possible")
    Reported-by: Ashton Holmes
    Reported-by: Michael Thayer
    Signed-off-by: Thomas Gleixner
    Cc: Michal Necasek
    Cc: Peter Zijlstra
    Cc: knut.osmundsen@oracle.com
    Cc: stable@vger.kernel.org
    Cc: stern@rowland.harvard.edu
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20161022110552.253640125@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Ashton and Michael reported, that kernel versions 4.8 and later suffer from
    USB timeouts which are caused by the timer wheel rework.

    This is caused by a bug in the base clock forwarding mechanism, which leads
    to timers expiring early. The scenario which leads to this is:

    run_timers()
    while (jiffies >= base->clk) {
    collect_expired_timers();
    base->clk++;
    expire_timers();
    }

    So base->clk = jiffies + 1. Now the cpu goes idle:

    idle()
    get_next_timer_interrupt()
    nextevt = __next_time_interrupt();
    if (time_after(nextevt, base->clk))
    base->clk = jiffies;

    jiffies has not advanced since run_timers(), so this assignment effectively
    decrements base->clk by one.

    base->clk is the index into the timer wheel arrays. So let's assume the
    following state after the base->clk increment in run_timers():

    jiffies = 0
    base->clk = 1

    A timer gets enqueued with an expiry delta of 63 ticks (which is the case
    with the USB timeout and HZ=250) so the resulting bucket index is:

    base->clk + delta = 1 + 63 = 64

    The timer goes into the first wheel level. The array size is 64 so it ends
    up in bucket 0, which is correct as it takes 63 ticks to advance base->clk
    to index into bucket 0 again.

    If the cpu goes idle before jiffies advance, then the bug in the forwarding
    mechanism sets base->clk back to 0, so the next invocation of run_timers()
    at the next tick will index into bucket 0 and therefore expire the timer 62
    ticks too early.

    Instead of blindly setting base->clk to jiffies we must make the forwarding
    conditional on jiffies > base->clk, but we cannot use jiffies for this as
    we might run into the following issue:

    if (time_after(jiffies, base->clk) {
    if (time_after(nextevt, base->clk))
    base->clk = jiffies;

    jiffies can increment between the check and the assigment far enough to
    advance beyond nextevt. So we need to use a stable value for checking.

    get_next_timer_interrupt() has the basej argument which is the jiffies
    value snapshot taken in the calling code. So we can just that.

    Thanks to Ashton for bisecting and providing trace data!

    Fixes: a683f390b93f ("timers: Forward the wheel clock whenever possible")
    Reported-by: Ashton Holmes
    Reported-by: Michael Thayer
    Signed-off-by: Thomas Gleixner
    Cc: Michal Necasek
    Cc: Peter Zijlstra
    Cc: knut.osmundsen@oracle.com
    Cc: stable@vger.kernel.org
    Cc: stern@rowland.harvard.edu
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20161022110552.175308322@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Linus stumbled over the unlocked modification of the timer expiry value in
    mod_timer() which is an optimization for timers which stay in the same
    bucket - due to the bucket granularity - despite their expiry time getting
    updated.

    The optimization itself still makes sense even if we take the lock, because
    in case that the bucket stays the same, we avoid the pointless
    queue/enqueue dance.

    Make the check and the modification of timer->expires protected by the base
    lock and shuffle the remaining code around so we can keep the lock held
    when we actually have to requeue the timer to a different bucket.

    Fixes: f00c0afdfa62 ("timers: Implement optimization for same expiry time in mod_timer()")
    Reported-by: Linus Torvalds
    Signed-off-by: Thomas Gleixner
    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos
    Cc: stable@vger.kernel.org
    Cc: Andrew Morton
    Cc: Peter Zijlstra

    Thomas Gleixner
     
  • Linus noticed that lock_timer_base() lacks a READ_ONCE() for accessing the
    timer flags. As a consequence the compiler is allowed to reload the flags
    between the initial check for TIMER_MIGRATION and the following timer base
    computation and the spin lock of the base.

    While this has not been observed (yet), we need to make sure that it never
    happens.

    Fixes: 0eeda71bc30d ("timer: Replace timer base by a cpu index")
    Reported-by: Linus Torvalds
    Signed-off-by: Thomas Gleixner
    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos
    Cc: stable@vger.kernel.org
    Cc: Andrew Morton
    Cc: Peter Zijlstra

    Thomas Gleixner
     

17 Oct, 2016

1 commit

  • Remove the set but unused variable base in alarm_clock_get to fix the
    following warning when building with 'W=1':

    kernel/time/alarmtimer.c: In function ‘alarm_timer_create’:
    kernel/time/alarmtimer.c:545:21: warning: variable ‘base’ set but not used [-Wunused-but-set-variable]

    Signed-off-by: Tobias Klauser
    Cc: John Stultz
    Link: http://lkml.kernel.org/r/20161017094702.10873-1-tklauser@distanz.ch
    Signed-off-by: Thomas Gleixner

    Tobias Klauser
     

16 Oct, 2016

1 commit

  • Pull gcc plugins update from Kees Cook:
    "This adds a new gcc plugin named "latent_entropy". It is designed to
    extract as much possible uncertainty from a running system at boot
    time as possible, hoping to capitalize on any possible variation in
    CPU operation (due to runtime data differences, hardware differences,
    SMP ordering, thermal timing variation, cache behavior, etc).

    At the very least, this plugin is a much more comprehensive example
    for how to manipulate kernel code using the gcc plugin internals"

    * tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    latent_entropy: Mark functions with __latent_entropy
    gcc-plugins: Add latent_entropy plugin

    Linus Torvalds
     

11 Oct, 2016

1 commit

  • The __latent_entropy gcc attribute can be used only on functions and
    variables. If it is on a function then the plugin will instrument it for
    gathering control-flow entropy. If the attribute is on a variable then
    the plugin will initialize it with random contents. The variable must
    be an integer, an integer array type or a structure with integer fields.

    These specific functions have been selected because they are init
    functions (to help gather boot-time entropy), are called at unpredictable
    times, or they have variable loops, each of which provide some level of
    latent entropy.

    Signed-off-by: Emese Revfy
    [kees: expanded commit message]
    Signed-off-by: Kees Cook

    Emese Revfy
     

05 Oct, 2016

1 commit

  • In commit 27727df240c7 ("Avoid taking lock in NMI path with
    CONFIG_DEBUG_TIMEKEEPING"), I changed the logic to open-code
    the timekeeping_get_ns() function, but I forgot to include
    the unit conversion from cycles to nanoseconds, breaking the
    function's output, which impacts users like perf.

    This results in bogus perf timestamps like:
    swapper 0 [000] 253.427536: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.426573: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.426687: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.426800: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.426905: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.427022: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.427127: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.427239: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.427346: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.427463: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 255.426572: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])

    Instead of more reasonable expected timestamps like:
    swapper 0 [000] 39.953768: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.064839: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.175956: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.287103: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.398217: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.509324: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.620437: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.731546: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.842654: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.953772: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 41.064881: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])

    Add the proper use of timekeeping_delta_to_ns() to convert
    the cycle delta to nanoseconds as needed.

    Thanks to Brendan and Alexei for finding this quickly after
    the v4.8 release. Unfortunately the problematic commit has
    landed in some -stable trees so they'll need this fix as
    well.

    Many apologies for this mistake. I'll be looking to add a
    perf-clock sanity test to the kselftest timers tests soon.

    Fixes: 27727df240c7 "timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING"
    Reported-by: Brendan Gregg
    Reported-by: Alexei Starovoitov
    Tested-and-reviewed-by: Mathieu Desnoyers
    Signed-off-by: John Stultz
    Cc: Peter Zijlstra
    Cc: stable
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/1475636148-26539-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: Thomas Gleixner

    John Stultz
     

13 Sep, 2016

1 commit

  • can_stop_full_tick() has no check for offline cpus. So it allows to stop
    the tick on an offline cpu from the interrupt return path, which is wrong
    and subsequently makes irq_work_needs_cpu() warn about being called for an
    offline cpu.

    Commit f7ea0fd639c2c4 ("tick: Don't invoke tick_nohz_stop_sched_tick() if
    the cpu is offline") added prevention for can_stop_idle_tick(), but forgot
    to do the same in can_stop_full_tick(). Add it.

    [ tglx: Massaged changelog ]

    Signed-off-by: Wanpeng Li
    Cc: Peter Zijlstra
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/1473245473-4463-1-git-send-email-wanpeng.li@hotmail.com
    Signed-off-by: Thomas Gleixner

    Wanpeng Li
     

08 Sep, 2016

1 commit


02 Sep, 2016

1 commit

  • tick_nohz_start_idle() is prevented to be called if the idle tick can't
    be stopped since commit 1f3b0f8243cb934 ("tick/nohz: Optimize nohz idle
    enter"). As a result, after suspend/resume the host machine, full dynticks
    kvm guest will softlockup:

    NMI watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:0]
    Call Trace:
    default_idle+0x31/0x1a0
    arch_cpu_idle+0xf/0x20
    default_idle_call+0x2a/0x50
    cpu_startup_entry+0x39b/0x4d0
    rest_init+0x138/0x140
    ? rest_init+0x5/0x140
    start_kernel+0x4c1/0x4ce
    ? set_init_arg+0x55/0x55
    ? early_idt_handler_array+0x120/0x120
    x86_64_start_reservations+0x24/0x26
    x86_64_start_kernel+0x142/0x14f

    In addition, cat /proc/stat | grep cpu in guest or host:

    cpu 398 16 5049 15754 5490 0 1 46 0 0
    cpu0 206 5 450 0 0 0 1 14 0 0
    cpu1 81 0 3937 3149 1514 0 0 9 0 0
    cpu2 45 6 332 6052 2243 0 0 11 0 0
    cpu3 65 2 328 6552 1732 0 0 11 0 0

    The idle and iowait states are weird 0 for cpu0(housekeeping).

    The bug is present in both guest and host kernels, and they both have
    cpu0's idle and iowait states issue, however, host kernel's suspend/resume
    path etc will touch watchdog to avoid the softlockup.

    - The watchdog will not be touched in tick_nohz_stop_idle path (need be
    touched since the scheduler stall is expected) if idle_active flags are
    not detected.
    - The idle and iowait states will not be accounted when exit idle loop
    (resched or interrupt) if idle start time and idle_active flags are
    not set.

    This patch fixes it by reverting commit 1f3b0f8243cb934 since can't stop
    idle tick doesn't mean can't be idle.

    Fixes: 1f3b0f8243cb934 ("tick/nohz: Optimize nohz idle enter")
    Signed-off-by: Wanpeng Li
    Cc: Sanjeev Yadav
    Cc: Gaurav Jindal
    Cc: stable@vger.kernel.org
    Cc: kvm@vger.kernel.org
    Cc: Radim Krčmář
    Cc: Peter Zijlstra
    Cc: Paolo Bonzini
    Link: http://lkml.kernel.org/r/1472798303-4154-1-git-send-email-wanpeng.li@hotmail.com
    Signed-off-by: Thomas Gleixner

    Wanpeng Li
     

01 Sep, 2016

5 commits

  • I ran into this:

    ================================================================================
    UBSAN: Undefined behaviour in kernel/time/hrtimer.c:310:16
    signed integer overflow:
    9223372036854775807 + 50000 cannot be represented in type 'long long int'
    CPU: 2 PID: 4798 Comm: trinity-c2 Not tainted 4.8.0-rc1+ #91
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
    0000000000000000 ffff88010ce6fb88 ffffffff82344740 0000000041b58ab3
    ffffffff84f97a20 ffffffff82344694 ffff88010ce6fbb0 ffff88010ce6fb60
    000000000000c350 ffff88010ce6f968 dffffc0000000000 ffffffff857bc320
    Call Trace:
    [] dump_stack+0xac/0xfc
    [] ? _atomic_dec_and_lock+0xc4/0xc4
    [] ubsan_epilogue+0xd/0x8a
    [] handle_overflow+0x202/0x23d
    [] ? val_to_string.constprop.6+0x11e/0x11e
    [] ? timerqueue_add+0x151/0x410
    [] ? hrtimer_start_range_ns+0x3b8/0x1380
    [] ? memset+0x31/0x40
    [] __ubsan_handle_add_overflow+0xe/0x10
    [] hrtimer_nanosleep+0x5d9/0x790
    [] ? hrtimer_init_sleeper+0x80/0x80
    [] ? __might_sleep+0x5b/0x260
    [] common_nsleep+0x20/0x30
    [] SyS_clock_nanosleep+0x197/0x210
    [] ? SyS_clock_getres+0x150/0x150
    [] ? __this_cpu_preempt_check+0x13/0x20
    [] ? __context_tracking_exit.part.3+0x30/0x1b0
    [] ? SyS_clock_getres+0x150/0x150
    [] do_syscall_64+0x1b3/0x4b0
    [] entry_SYSCALL64_slow_path+0x25/0x25
    ================================================================================

    Add a new ktime_add_unsafe() helper which doesn't check for overflow, but
    doesn't throw a UBSAN warning when it does overflow either.

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Signed-off-by: Vegard Nossum
    Signed-off-by: John Stultz

    Vegard Nossum
     
  • I ran into this:

    ================================================================================
    UBSAN: Undefined behaviour in kernel/time/time.c:783:2
    signed integer overflow:
    5273 + 9223372036854771711 cannot be represented in type 'long int'
    CPU: 0 PID: 17363 Comm: trinity-c0 Not tainted 4.8.0-rc1+ #88
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org
    04/01/2014
    0000000000000000 ffff88011457f8f0 ffffffff82344f50 0000000041b58ab3
    ffffffff84f98080 ffffffff82344ea4 ffff88011457f918 ffff88011457f8c8
    ffff88011457f8e0 7fffffffffffefff ffff88011457f6d8 dffffc0000000000
    Call Trace:
    [] dump_stack+0xac/0xfc
    [] ? _atomic_dec_and_lock+0xc4/0xc4
    [] ubsan_epilogue+0xd/0x8a
    [] handle_overflow+0x202/0x23d
    [] ? val_to_string.constprop.6+0x11e/0x11e
    [] ? debug_smp_processor_id+0x17/0x20
    [] ? __sigqueue_free.part.13+0x51/0x70
    [] ? rcu_is_watching+0x110/0x110
    [] __ubsan_handle_add_overflow+0xe/0x10
    [] timespec64_add_safe+0x298/0x340
    [] ? timespec_add_safe+0x330/0x330
    [] ? wait_noreap_copyout+0x1d0/0x1d0
    [] poll_select_set_timeout+0xf8/0x170
    [] ? poll_schedule_timeout+0x2b0/0x2b0
    [] ? __might_sleep+0x5b/0x260
    [] __sys_recvmmsg+0x107/0x790
    [] ? SyS_recvmsg+0x20/0x20
    [] ? hrtimer_start_range_ns+0x3b8/0x1380
    [] ? _raw_spin_unlock_irqrestore+0x3b/0x60
    [] ? do_setitimer+0x39a/0x8e0
    [] ? __might_sleep+0x5b/0x260
    [] ? __sys_recvmmsg+0x790/0x790
    [] SyS_recvmmsg+0xd9/0x160
    [] ? __sys_recvmmsg+0x790/0x790
    [] ? __this_cpu_preempt_check+0x13/0x20
    [] ? __context_tracking_exit.part.3+0x30/0x1b0
    [] ? __sys_recvmmsg+0x790/0x790
    [] do_syscall_64+0x1b3/0x4b0
    [] entry_SYSCALL64_slow_path+0x25/0x25
    ================================================================================

    Line 783 is this:

    783 set_normalized_timespec64(&res, lhs.tv_sec + rhs.tv_sec,
    784 lhs.tv_nsec + rhs.tv_nsec);

    In other words, since lhs.tv_sec and rhs.tv_sec are both time64_t, this
    is a signed addition which will cause undefined behaviour on overflow.

    Note that this is not currently a huge concern since the kernel should be
    built with -fno-strict-overflow by default, but could be a problem in the
    future, a problem with older compilers, or other compilers than gcc.

    The easiest way to avoid the overflow is to cast one of the arguments to
    unsigned (so the addition will be done using unsigned arithmetic).

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Signed-off-by: Vegard Nossum
    Signed-off-by: John Stultz

    Vegard Nossum
     
  • In addition to keeping a histogram of suspend times, also
    print out the time spent in suspend to dmesg.

    This helps to keep track of suspend time while debugging using
    kernel logs.

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Signed-off-by: Ruchi Kandoi
    [jstultz: Tweaked commit message]
    Signed-off-by: John Stultz

    Ruchi Kandoi
     
  • Clocksources don't get the VALID_FOR_HRES flag until they have been
    checked by a watchdog. However, when using an override, the
    clocksource_select logic will clear the override value if the
    clocksource is not marked VALID_FOR_HRES during that inititial check.
    When using the boot arguments clocksource=, this selection can
    run before the watchdog, and can cause the override to be incorrectly
    cleared.

    To address this condition, the override_name is only invalidated for
    unstable clocksources. Otherwise, the override is left intact until after
    the watchdog has validated the clocksource as stable/unstable.

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Cc: Martin Schwidefsky
    Signed-off-by: Kyle Walker
    Signed-off-by: John Stultz

    Kyle Walker
     
  • Fix a minor spelling error.

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Signed-off-by: Pratyush Patel
    [jstultz: Added commit message]
    Signed-off-by: John Stultz

    Pratyush Patel