13 Oct, 2020

1 commit

  • Pull locking updates from Ingo Molnar:
    "These are the locking updates for v5.10:

    - Add deadlock detection for recursive read-locks.

    The rationale is outlined in commit 224ec489d3cd ("lockdep/
    Documention: Recursive read lock detection reasoning")

    The main deadlock pattern we want to detect is:

    TASK A: TASK B:

    read_lock(X);
    write_lock(X);
    read_lock_2(X);

    - Add "latch sequence counters" (seqcount_latch_t):

    A sequence counter variant where the counter even/odd value is used
    to switch between two copies of protected data. This allows the
    read path, typically NMIs, to safely interrupt the write side
    critical section.

    We utilize this new variant for sched-clock, and to make x86 TSC
    handling safer.

    - Other seqlock cleanups, fixes and enhancements

    - KCSAN updates

    - LKMM updates

    - Misc updates, cleanups and fixes"

    * tag 'locking-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (67 commits)
    lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu variables"
    lockdep: Fix lockdep recursion
    lockdep: Fix usage_traceoverflow
    locking/atomics: Check atomic-arch-fallback.h too
    locking/seqlock: Tweak DEFINE_SEQLOCK() kernel doc
    lockdep: Optimize the memory usage of circular queue
    seqlock: Unbreak lockdep
    seqlock: PREEMPT_RT: Do not starve seqlock_t writers
    seqlock: seqcount_LOCKNAME_t: Introduce PREEMPT_RT support
    seqlock: seqcount_t: Implement all read APIs as statement expressions
    seqlock: Use unique prefix for seqcount_t property accessors
    seqlock: seqcount_LOCKNAME_t: Standardize naming convention
    seqlock: seqcount latch APIs: Only allow seqcount_latch_t
    rbtree_latch: Use seqcount_latch_t
    x86/tsc: Use seqcount_latch_t
    timekeeping: Use seqcount_latch_t
    time/sched_clock: Use seqcount_latch_t
    seqlock: Introduce seqcount_latch_t
    mm/swap: Do not abuse the seqcount_t latching API
    time/sched_clock: Use raw_read_seqcount_latch() during suspend
    ...

    Linus Torvalds
     

10 Sep, 2020

1 commit

  • Latch sequence counters are a multiversion concurrency control mechanism
    where the seqcount_t counter even/odd value is used to switch between
    two data storage copies. This allows the seqcount_t read path to safely
    interrupt its write side critical section (e.g. from NMIs).

    Initially, latch sequence counters were implemented as a single write
    function, raw_write_seqcount_latch(), above plain seqcount_t. The read
    path was expected to use plain seqcount_t raw_read_seqcount().

    A specialized read function was later added, raw_read_seqcount_latch(),
    and became the standardized way for latch read paths. Having unique read
    and write APIs meant that latch sequence counters are basically a data
    type of their own -- just inappropriately overloading plain seqcount_t.
    The seqcount_latch_t data type was thus introduced at seqlock.h.

    Use that new data type instead of seqcount_raw_spinlock_t. This ensures
    that only latch-safe APIs are to be used with the sequence counter.

    Note that the use of seqcount_raw_spinlock_t was not very useful in the
    first place. Only the "raw_" subset of seqcount_t APIs were used at
    timekeeping.c. This subset was created for contexts where lockdep cannot
    be used. seqcount_LOCKTYPE_t's raison d'être -- verifying that the
    seqcount_t writer serialization lock is held -- cannot thus be done.

    References: 0c3351d451ae ("seqlock: Use raw_ prefix instead of _no_lockdep")
    References: 55f3560df975 ("seqlock: Extend seqcount API with associated locks")
    Signed-off-by: Ahmed S. Darwish
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200827114044.11173-6-a.darwish@linutronix.de

    Ahmed S. Darwish
     

23 Aug, 2020

2 commits

  • printk wants to store various timestamps (MONOTONIC, REALTIME, BOOTTIME) to
    make correlation of dmesg from several systems easier.

    Provide an interface to retrieve all three timestamps in one go.

    There are some caveats:

    1) Boot time and late sleep time injection

    Boot time is a racy access on 32bit systems if the sleep time injection
    happens late during resume and not in timekeeping_resume(). That could be
    avoided by expanding struct tk_read_base with boot offset for 32bit and
    adding more overhead to the update. As this is a hard to observe once per
    resume event which can be filtered with reasonable effort using the
    accurate mono/real timestamps, it's probably not worth the trouble.

    Aside of that it might be possible on 32 and 64 bit to observe the
    following when the sleep time injection happens late:

    CPU 0 CPU 1
    timekeeping_resume()
    ktime_get_fast_timestamps()
    mono, real = __ktime_get_real_fast()
    inject_sleep_time()
    update boot offset
    boot = mono + bootoffset;

    That means that boot time already has the sleep time adjustment, but
    real time does not. On the next readout both are in sync again.

    Preventing this for 64bit is not really feasible without destroying the
    careful cache layout of the timekeeper because the sequence count and
    struct tk_read_base would then need two cache lines instead of one.

    2) Suspend/resume timestamps

    Access to the time keeper clock source is disabled accross the innermost
    steps of suspend/resume. The accessors still work, but the timestamps
    are frozen until time keeping is resumed which happens very early.

    For regular suspend/resume there is no observable difference vs. sched
    clock, but it might affect some of the nasty low level debug printks.

    OTOH, access to sched clock is not guaranteed accross suspend/resume on
    all systems either so it depends on the hardware in use.

    If that turns out to be a real problem then this could be mitigated by
    using sched clock in a similar way as during early boot. But it's not as
    trivial as on early boot because it needs some careful protection
    against the clock monotonic timestamp jumping backwards on resume.

    Signed-off-by: Thomas Gleixner
    Tested-by: Petr Mladek
    Link: https://lore.kernel.org/r/20200814115512.159981360@linutronix.de

    Thomas Gleixner
     
  • During early boot the NMI safe timekeeper returns 0 until the first
    clocksource becomes available.

    This prevents it from being used for printk or other facilities which today
    use sched clock. sched clock can be available way before timekeeping is
    initialized.

    The obvious workaround for this is to utilize the early sched clock in the
    default dummy clock read function until a clocksource becomes available.

    After switching to the clocksource clock MONOTONIC and BOOTTIME will not
    jump because the timekeeping_init() bases clock MONOTONIC on sched clock
    and the offset between clock MONOTONIC and BOOTTIME is zero during boot.

    Clock REALTIME cannot provide useful timestamps during early boot up to
    the point where a persistent clock becomes available, which is either in
    timekeeping_init() or later when the RTC driver which might depend on I2C
    or other subsystems is initialized.

    There is a minor difference to sched_clock() vs. suspend/resume. As the
    timekeeper clock source might not be accessible during suspend, after
    timekeeping_suspend() timestamps freeze up to the point where
    timekeeping_resume() is invoked. OTOH this is true for some sched clock
    implementations as well.

    Signed-off-by: Thomas Gleixner
    Tested-by: Petr Mladek
    Link: https://lore.kernel.org/r/20200814115512.041422402@linutronix.de

    Thomas Gleixner
     

15 Aug, 2020

1 commit

  • Pull timekeeping updates from Thomas Gleixner:
    "A set of timekeeping/VDSO updates:

    - Preparatory work to allow S390 to switch over to the generic VDSO
    implementation.

    S390 requires that the VDSO data pointer is handed in to the
    counter read function when time namespace support is enabled.
    Adding the pointer is a NOOP for all other architectures because
    the compiler is supposed to optimize that out when it is unused in
    the architecture specific inline. The change also solved a similar
    problem for MIPS which fortunately has time namespaces not yet
    enabled.

    S390 needs to update clock related VDSO data independent of the
    timekeeping updates. This was solved so far with yet another
    sequence counter in the S390 implementation. A better solution is
    to utilize the already existing VDSO sequence count for this. The
    core code now exposes helper functions which allow to serialize
    against the timekeeper code and against concurrent readers.

    S390 needs extra data for their clock readout function. The initial
    common VDSO data structure did not provide a way to add that. It
    now has an embedded architecture specific struct embedded which
    defaults to an empty struct.

    Doing this now avoids tree dependencies and conflicts post rc1 and
    allows all other architectures which work on generic VDSO support
    to work from a common upstream base.

    - A trivial comment fix"

    * tag 'timers-urgent-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    time: Delete repeated words in comments
    lib/vdso: Allow to add architecture-specific vdso data
    timekeeping/vsyscall: Provide vdso_update_begin/end()
    vdso/treewide: Add vdso_data pointer argument to __arch_get_hw_counter()

    Linus Torvalds
     

11 Aug, 2020

2 commits

  • Pull locking updates from Thomas Gleixner:
    "A set of locking fixes and updates:

    - Untangle the header spaghetti which causes build failures in
    various situations caused by the lockdep additions to seqcount to
    validate that the write side critical sections are non-preemptible.

    - The seqcount associated lock debug addons which were blocked by the
    above fallout.

    seqcount writers contrary to seqlock writers must be externally
    serialized, which usually happens via locking - except for strict
    per CPU seqcounts. As the lock is not part of the seqcount, lockdep
    cannot validate that the lock is held.

    This new debug mechanism adds the concept of associated locks.
    sequence count has now lock type variants and corresponding
    initializers which take a pointer to the associated lock used for
    writer serialization. If lockdep is enabled the pointer is stored
    and write_seqcount_begin() has a lockdep assertion to validate that
    the lock is held.

    Aside of the type and the initializer no other code changes are
    required at the seqcount usage sites. The rest of the seqcount API
    is unchanged and determines the type at compile time with the help
    of _Generic which is possible now that the minimal GCC version has
    been moved up.

    Adding this lockdep coverage unearthed a handful of seqcount bugs
    which have been addressed already independent of this.

    While generally useful this comes with a Trojan Horse twist: On RT
    kernels the write side critical section can become preemtible if
    the writers are serialized by an associated lock, which leads to
    the well known reader preempts writer livelock. RT prevents this by
    storing the associated lock pointer independent of lockdep in the
    seqcount and changing the reader side to block on the lock when a
    reader detects that a writer is in the write side critical section.

    - Conversion of seqcount usage sites to associated types and
    initializers"

    * tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
    locking/seqlock, headers: Untangle the spaghetti monster
    locking, arch/ia64: Reduce header dependencies by moving XTP bits into the new header
    x86/headers: Remove APIC headers from
    seqcount: More consistent seqprop names
    seqcount: Compress SEQCNT_LOCKNAME_ZERO()
    seqlock: Fold seqcount_LOCKNAME_init() definition
    seqlock: Fold seqcount_LOCKNAME_t definition
    seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
    hrtimer: Use sequence counter with associated raw spinlock
    kvm/eventfd: Use sequence counter with associated spinlock
    userfaultfd: Use sequence counter with associated spinlock
    NFSv4: Use sequence counter with associated spinlock
    iocost: Use sequence counter with associated spinlock
    raid5: Use sequence counter with associated spinlock
    vfs: Use sequence counter with associated spinlock
    timekeeping: Use sequence counter with associated raw spinlock
    xfrm: policy: Use sequence counters with associated lock
    netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
    netfilter: conntrack: Use sequence counter with associated spinlock
    sched: tasks: Use sequence counter with associated spinlock
    ...

    Linus Torvalds
     
  • Drop repeated words in kernel/time/. {when, one, into}

    Signed-off-by: Randy Dunlap
    Signed-off-by: Thomas Gleixner
    Acked-by: John Stultz
    Link: https://lore.kernel.org/r/20200807033248.8452-1-rdunlap@infradead.org

    Randy Dunlap
     

06 Aug, 2020

1 commit

  • Architectures can have the requirement to add additional architecture
    specific data to the VDSO data page which needs to be updated independent
    of the timekeeper updates.

    To protect these updates vs. concurrent readers and a conflicting update
    through timekeeping, provide helper functions to make such updates safe.

    vdso_update_begin() takes the timekeeper_lock to protect against a
    potential update from timekeeper code and increments the VDSO sequence
    count to signal data inconsistency to concurrent readers. vdso_update_end()
    makes the sequence count even again to signal data consistency and drops
    the timekeeper lock.

    [ Sven: Add interrupt disable handling to the functions ]

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sven Schnelle
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20200804150124.41692-3-svens@linux.ibm.com

    Thomas Gleixner
     

29 Jul, 2020

1 commit

  • A sequence counter write side critical section must be protected by some
    form of locking to serialize writers. A plain seqcount_t does not
    contain the information of which lock must be held when entering a write
    side critical section.

    Use the new seqcount_raw_spinlock_t data type, which allows to associate
    a raw spinlock with the sequence counter. This enables lockdep to verify
    that the raw spinlock used for writer serialization is held when the
    write side critical section is entered.

    If lockdep is disabled this lock association is compiled out and has
    neither storage size nor runtime overhead.

    Signed-off-by: Ahmed S. Darwish
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200720155530.1173732-18-a.darwish@linutronix.de

    Ahmed S. Darwish
     

22 Jul, 2020

1 commit

  • The "ticks" parameter was added in commit 0f004f5a696a ("sched: Cure more
    NO_HZ load average woes") since calc_global_nohz() was called and needed
    the "ticks" argument.

    But in commit c308b56b5398 ("sched: Fix nohz load accounting -- again!")
    it became unused as the function calc_global_nohz() dropped using "ticks".

    Fixes: c308b56b5398 ("sched: Fix nohz load accounting -- again!")
    Signed-off-by: Paul Gortmaker
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/1593628458-32290-1-git-send-email-paul.gortmaker@windriver.com

    Paul Gortmaker
     

11 Jun, 2020

1 commit

  • Mark the relevant functions noinstr, use the plain non-instrumented MSR
    accessors. The only odd part is the instrumentation_begin()/end() pair around the
    indirect machine_check_vector() call as objtool can't figure that out. The
    possible invoked functions are annotated correctly.

    Also use notrace variant of nmi_enter/exit(). If MCEs happen then hardware
    latency tracing is the least of the worries.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Alexandre Chartre
    Acked-by: Peter Zijlstra
    Acked-by: Andy Lutomirski
    Link: https://lkml.kernel.org/r/20200505135315.476734898@linutronix.de

    Thomas Gleixner
     

31 Mar, 2020

1 commit

  • Pull timekeeping and timer updates from Thomas Gleixner:
    "Core:

    - Consolidation of the vDSO build infrastructure to address the
    difficulties of cross-builds for ARM64 compat vDSO libraries by
    restricting the exposure of header content to the vDSO build.

    This is achieved by splitting out header content into separate
    headers. which contain only the minimaly required information which
    is necessary to build the vDSO. These new headers are included from
    the kernel headers and the vDSO specific files.

    - Enhancements to the generic vDSO library allowing more fine grained
    control over the compiled in code, further reducing architecture
    specific storage and preparing for adopting the generic library by
    PPC.

    - Cleanup and consolidation of the exit related code in posix CPU
    timers.

    - Small cleanups and enhancements here and there

    Drivers:

    - The obligatory new drivers: Ingenic JZ47xx and X1000 TCU support

    - Correct the clock rate of PIT64b global clock

    - setup_irq() cleanup

    - Preparation for PWM and suspend support for the TI DM timer

    - Expand the fttmr010 driver to support ast2600 systems

    - The usual small fixes, enhancements and cleanups all over the
    place"

    * tag 'timers-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (80 commits)
    Revert "clocksource/drivers/timer-probe: Avoid creating dead devices"
    vdso: Fix clocksource.h macro detection
    um: Fix header inclusion
    arm64: vdso32: Enable Clang Compilation
    lib/vdso: Enable common headers
    arm: vdso: Enable arm to use common headers
    x86/vdso: Enable x86 to use common headers
    mips: vdso: Enable mips to use common headers
    arm64: vdso32: Include common headers in the vdso library
    arm64: vdso: Include common headers in the vdso library
    arm64: Introduce asm/vdso/processor.h
    arm64: vdso32: Code clean up
    linux/elfnote.h: Replace elf.h with UAPI equivalent
    scripts: Fix the inclusion order in modpost
    common: Introduce processor.h
    linux/ktime.h: Extract common header for vDSO
    linux/jiffies.h: Extract common header for vDSO
    linux/time64.h: Extract common header for vDSO
    linux/time32.h: Extract common header for vDSO
    linux/time.h: Extract common header for vDSO
    ...

    Linus Torvalds
     

21 Mar, 2020

1 commit

  • seqlock consists of a sequence counter and a spinlock_t which is used to
    serialize the writers. spinlock_t is substituted by a "sleeping" spinlock
    on PREEMPT_RT enabled kernels which breaks the usage in the timekeeping
    code as the writers are executed in hard interrupt and therefore
    non-preemptible context even on PREEMPT_RT.

    The spinlock in seqlock cannot be unconditionally replaced by a
    raw_spinlock_t as many seqlock users have nesting spinlock sections or
    other code which is not suitable to run in truly atomic context on RT.

    Instead of providing a raw_seqlock API for a single use case, open code the
    seqlock for the jiffies use case and implement it with a raw_spinlock_t and
    a sequence counter.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200321113242.120587764@linutronix.de

    Thomas Gleixner
     

04 Mar, 2020

1 commit

  • While unlikely the divisor in scale64_check_overflow() could be >= 32bit in
    scale64_check_overflow(). do_div() truncates the divisor to 32bit at least
    on 32bit platforms.

    Use div64_u64() instead to avoid the truncation to 32-bit.

    [ tglx: Massaged changelog ]

    Signed-off-by: Wen Yang
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20200120100523.45656-1-wenyang@linux.alibaba.com

    Wen Yang
     

23 Aug, 2019

1 commit

  • The VDSO update for CLOCK_BOOTTIME has a overflow issue as it shifts the
    nanoseconds based boot time offset left by the clocksource shift. That
    overflows once the boot time offset becomes large enough. As a consequence
    CLOCK_BOOTTIME in the VDSO becomes a random number causing applications to
    misbehave.

    Fix it by storing a timespec64 representation of the offset when boot time
    is adjusted and add that to the MONOTONIC base time value in the vdso data
    page. Using the timespec64 representation avoids a 64bit division in the
    update code.

    Fixes: 44f57d788e7d ("timekeeping: Provide a generic update_vsyscall() implementation")
    Reported-by: Chris Clayton
    Signed-off-by: Thomas Gleixner
    Tested-by: Chris Clayton
    Tested-by: Vincenzo Frascino
    Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908221257580.1983@nanos.tec.linutronix.de

    Thomas Gleixner
     

22 Jun, 2019

1 commit

  • While this doesn't actually amount to a real difference, since the macro
    evaluates to the same thing, every place else operates on ktime_t using
    these functions, so let's not break the pattern.

    Fixes: e3ff9c3678b4 ("timekeeping: Repair ktime_get_coarse*() granularity")
    Signed-off-by: Jason A. Donenfeld
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Arnd Bergmann
    Link: https://lkml.kernel.org/r/20190621203249.3909-1-Jason@zx2c4.com

    Jason A. Donenfeld
     

14 Jun, 2019

1 commit

  • Jason reported that the coarse ktime based time getters advance only once
    per second and not once per tick as advertised.

    The code reads only the monotonic base time, which advances once per
    second. The nanoseconds are accumulated on every tick in xtime_nsec up to
    a second and the regular time getters take this nanoseconds offset into
    account, but the ktime_get_coarse*() implementation fails to do so.

    Add the accumulated xtime_nsec value to the monotonic base time to get the
    proper per tick advancing coarse tinme.

    Fixes: b9ff604cff11 ("timekeeping: Add ktime_get_coarse_with_offset")
    Reported-by: Jason A. Donenfeld
    Signed-off-by: Thomas Gleixner
    Tested-by: Jason A. Donenfeld
    Cc: Arnd Bergmann
    Cc: Peter Zijlstra
    Cc: Clemens Ladisch
    Cc: Sultan Alsawaf
    Cc: Waiman Long
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1906132136280.1791@nanos.tec.linutronix.de

    Thomas Gleixner
     

08 May, 2019

1 commit

  • Pull audit updates from Paul Moore:
    "We've got a reasonably broad set of audit patches for the v5.2 merge
    window, the highlights are below:

    - The biggest change, and the source of all the arch/* changes, is
    the patchset from Dmitry to help enable some of the work he is
    doing around PTRACE_GET_SYSCALL_INFO.

    To be honest, including this in the audit tree is a bit of a
    stretch, but it does help move audit a little further along towards
    proper syscall auditing for all arches, and everyone else seemed to
    agree that audit was a "good" spot for this to land (or maybe they
    just didn't want to merge it? dunno.).

    - We can now audit time/NTP adjustments.

    - We continue the work to connect associated audit records into a
    single event"

    * tag 'audit-pr-20190507' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: (21 commits)
    audit: fix a memory leak bug
    ntp: Audit NTP parameters adjustment
    timekeeping: Audit clock adjustments
    audit: purge unnecessary list_empty calls
    audit: link integrity evm_write_xattrs record to syscall event
    syscall_get_arch: add "struct task_struct *" argument
    unicore32: define syscall_get_arch()
    Move EM_UNICORE to uapi/linux/elf-em.h
    nios2: define syscall_get_arch()
    nds32: define syscall_get_arch()
    Move EM_NDS32 to uapi/linux/elf-em.h
    m68k: define syscall_get_arch()
    hexagon: define syscall_get_arch()
    Move EM_HEXAGON to uapi/linux/elf-em.h
    h8300: define syscall_get_arch()
    c6x: define syscall_get_arch()
    arc: define syscall_get_arch()
    Move EM_ARCOMPACT and EM_ARCV2 to uapi/linux/elf-em.h
    audit: Make audit_log_cap and audit_copy_inode static
    audit: connect LOGIN record to its syscall record
    ...

    Linus Torvalds
     

16 Apr, 2019

2 commits

  • Emit an audit record every time selected NTP parameters are modified
    from userspace (via adjtimex(2) or clock_adjtime(2)). These parameters
    may be used to indirectly change system clock, and thus their
    modifications should be audited.

    Such events will now generate records of type AUDIT_TIME_ADJNTPVAL
    containing the following fields:
    - op -- which value was adjusted:
    - offset -- corresponding to the time_offset variable
    - freq -- corresponding to the time_freq variable
    - status -- corresponding to the time_status variable
    - adjust -- corresponding to the time_adjust variable
    - tick -- corresponding to the tick_usec variable
    - tai -- corresponding to the timekeeping's TAI offset
    - old -- the old value
    - new -- the new value

    Example records:

    type=TIME_ADJNTPVAL msg=audit(1530616044.507:7): op=status old=64 new=8256
    type=TIME_ADJNTPVAL msg=audit(1530616044.511:11): op=freq old=0 new=49180377088000

    The records of this type will be associated with the corresponding
    syscall records.

    An overview of parameter changes that can be done via do_adjtimex()
    (based on information from Miroslav Lichvar) and whether they are
    audited:
    __timekeeping_set_tai_offset() -- sets the offset from the
    International Atomic Time
    (AUDITED)
    NTP variables:
    time_offset -- can adjust the clock by up to 0.5 seconds per call
    and also speed it up or slow down by up to about
    0.05% (43 seconds per day) (AUDITED)
    time_freq -- can speed up or slow down by up to about 0.05%
    (AUDITED)
    time_status -- can insert/delete leap seconds and it also enables/
    disables synchronization of the hardware real-time
    clock (AUDITED)
    time_maxerror, time_esterror -- change error estimates used to
    inform userspace applications
    (NOT AUDITED)
    time_constant -- controls the speed of the clock adjustments that
    are made when time_offset is set (NOT AUDITED)
    time_adjust -- can temporarily speed up or slow down the clock by up
    to 0.05% (AUDITED)
    tick_usec -- a more extreme version of time_freq; can speed up or
    slow down the clock by up to 10% (AUDITED)

    Signed-off-by: Ondrej Mosnacek
    Reviewed-by: Richard Guy Briggs
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Paul Moore

    Ondrej Mosnacek
     
  • Emit an audit record whenever the system clock is changed (i.e. shifted
    by a non-zero offset) by a syscall from userspace. The syscalls than can
    (at the time of writing) trigger such record are:
    - settimeofday(2), stime(2), clock_settime(2) -- via
    do_settimeofday64()
    - adjtimex(2), clock_adjtime(2) -- via do_adjtimex()

    The new records have type AUDIT_TIME_INJOFFSET and contain the following
    fields:
    - sec -- the 'seconds' part of the offset
    - nsec -- the 'nanoseconds' part of the offset

    Example record (time was shifted backwards by ~15.875 seconds):

    type=TIME_INJOFFSET msg=audit(1530616049.652:13): sec=-16 nsec=124887145

    The records of this type will be associated with the corresponding
    syscall records.

    Signed-off-by: Ondrej Mosnacek
    Reviewed-by: Richard Guy Briggs
    Reviewed-by: Thomas Gleixner
    [PM: fixed a line width problem in __audit_tk_injoffset()]
    Signed-off-by: Paul Moore

    Ondrej Mosnacek
     

28 Mar, 2019

1 commit

  • Several people reported testing failures after setting CLOCK_REALTIME close
    to the limits of the kernel internal representation in nanoseconds,
    i.e. year 2262.

    The failures are exposed in subsequent operations, i.e. when arming timers
    or when the advancing CLOCK_MONOTONIC makes the calculation of
    CLOCK_REALTIME overflow into negative space.

    Now people start to paper over the underlying problem by clamping
    calculations to the valid range, but that's just wrong because such
    workarounds will prevent detection of real issues as well.

    It is reasonable to force an upper bound for the various methods of setting
    CLOCK_REALTIME. Year 2262 is the absolute upper bound. Assume a maximum
    uptime of 30 years which is plenty enough even for esoteric embedded
    systems. That results in an upper bound of year 2232 for setting the time.

    Once that limit is reached in reality this limit is only a small part of
    the problem space. But until then this stops people from trying to paper
    over the problem at the wrong places.

    Reported-by: Xiongfeng Wang
    Reported-by: Hongbo Yao
    Signed-off-by: Thomas Gleixner
    Cc: John Stultz
    Cc: Stephen Boyd
    Cc: Miroslav Lichvar
    Cc: Arnd Bergmann
    Cc: Richard Cochran
    Cc: Peter Zijlstra
    Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1903231125480.2157@nanos.tec.linutronix.de

    Thomas Gleixner
     

23 Mar, 2019

1 commit

  • The timekeeping code uses a random mix of "unsigned long" and "unsigned
    int" for the seqcount snapshots (ratio 14:12). Since the seqlock.h API is
    entirely based on unsigned int, use that throughout.

    Signed-off-by: Rasmus Villemoes
    Signed-off-by: Thomas Gleixner
    Cc: Frederic Weisbecker
    Cc: John Stultz
    Cc: Stephen Boyd
    Link: https://lkml.kernel.org/r/20190318195557.20773-1-linux@rasmusvillemoes.dk

    Rasmus Villemoes
     

07 Feb, 2019

1 commit

  • struct timex is not y2038 safe.
    Replace all uses of timex with y2038 safe __kernel_timex.

    Note that struct __kernel_timex is an ABI interface definition.
    We could define a new structure based on __kernel_timex that
    is only available internally instead. Right now, there isn't
    a strong motivation for this as the structure is isolated to
    a few defined struct timex interfaces and such a structure would
    be exactly the same as struct timex.

    The patch was generated by the following coccinelle script:

    virtual patch

    @depends on patch forall@
    identifier ts;
    expression e;
    @@
    (
    - struct timex ts;
    + struct __kernel_timex ts;
    |
    - struct timex ts = {};
    + struct __kernel_timex ts = {};
    |
    - struct timex ts = e;
    + struct __kernel_timex ts = e;
    |
    - struct timex *ts;
    + struct __kernel_timex *ts;
    |
    (memset \| copy_from_user \| copy_to_user \)(...,
    - sizeof(struct timex))
    + sizeof(struct __kernel_timex))
    )

    @depends on patch forall@
    identifier ts;
    identifier fn;
    @@
    fn(...,
    - struct timex *ts,
    + struct __kernel_timex *ts,
    ...) {
    ...
    }

    @depends on patch forall@
    identifier ts;
    identifier fn;
    @@
    fn(...,
    - struct timex *ts) {
    + struct __kernel_timex *ts) {
    ...
    }

    Signed-off-by: Deepa Dinamani
    Cc: linux-alpha@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Signed-off-by: Arnd Bergmann

    Deepa Dinamani
     

29 Dec, 2018

1 commit

  • Pull y2038 updates from Arnd Bergmann:
    "More syscalls and cleanups

    This concludes the main part of the system call rework for 64-bit
    time_t, which has spread over most of year 2018, the last six system
    calls being

    - ppoll
    - pselect6
    - io_pgetevents
    - recvmmsg
    - futex
    - rt_sigtimedwait

    As before, nothing changes for 64-bit architectures, while 32-bit
    architectures gain another entry point that differs only in the layout
    of the timespec structure. Hopefully in the next release we can wire
    up all 22 of those system calls on all 32-bit architectures, which
    gives us a baseline version for glibc to start using them.

    This does not include the clock_adjtime, getrusage/waitid, and
    getitimer/setitimer system calls. I still plan to have new versions of
    those as well, but they are not required for correct operation of the
    C library since they can be emulated using the old 32-bit time_t based
    system calls.

    Aside from the system calls, there are also a few cleanups here,
    removing old kernel internal interfaces that have become unused after
    all references got removed. The arch/sh cleanups are part of this,
    there were posted several times over the past year without a reaction
    from the maintainers, while the corresponding changes made it into all
    other architectures"

    * tag 'y2038-for-4.21' of ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playground:
    timekeeping: remove obsolete time accessors
    vfs: replace current_kernel_time64 with ktime equivalent
    timekeeping: remove timespec_add/timespec_del
    timekeeping: remove unused {read,update}_persistent_clock
    sh: remove board_time_init() callback
    sh: remove unused rtc_sh_get/set_time infrastructure
    sh: sh03: rtc: push down rtc class ops into driver
    sh: dreamcast: rtc: push down rtc class ops into driver
    y2038: signal: Add compat_sys_rt_sigtimedwait_time64
    y2038: signal: Add sys_rt_sigtimedwait_time32
    y2038: socket: Add compat_sys_recvmmsg_time64
    y2038: futex: Add support for __kernel_timespec
    y2038: futex: Move compat implementation into futex.c
    io_pgetevents: use __kernel_timespec
    pselect6: use __kernel_timespec
    ppoll: use __kernel_timespec
    signal: Add restore_user_sigmask()
    signal: Add set_user_sigmask()

    Linus Torvalds
     

18 Dec, 2018

1 commit


05 Dec, 2018

1 commit

  • tk_core.seq is initialized open coded, but that misses to initialize the
    lockdep map when lockdep is enabled. Lockdep splats involving tk_core seq
    consequently lack a name and are hard to read.

    Use the proper initializer which takes care of the lockdep map
    initialization.

    [ tglx: Massaged changelog ]

    Signed-off-by: Bart Van Assche
    Signed-off-by: Thomas Gleixner
    Cc: peterz@infradead.org
    Cc: tj@kernel.org
    Cc: johannes.berg@intel.com
    Link: https://lkml.kernel.org/r/20181128234325.110011-12-bvanassche@acm.org

    Bart Van Assche
     

23 Nov, 2018

2 commits

  • Update the time(r) core files files with the correct SPDX license
    identifier based on the license text in the file itself. The SPDX
    identifier is a legally binding shorthand, which can be used instead of the
    full boiler plate text.

    This work is based on a script and data from Philippe Ombredanne, Kate
    Stewart and myself. The data has been created with two independent license
    scanners and manual inspection.

    The following files do not contain any direct license information and have
    been omitted from the big initial SPDX changes:

    timeconst.bc: The .bc files were not touched
    time.c, timer.c, timekeeping.c: Licence was deduced from EXPORT_SYMBOL_GPL

    As those files do not contain direct license references they fall under the
    project license, i.e. GPL V2 only.

    Signed-off-by: Thomas Gleixner
    Acked-by: Kees Cook
    Acked-by: Ingo Molnar
    Acked-by: John Stultz
    Acked-by: Corey Minyard
    Cc: Peter Zijlstra
    Cc: Kate Stewart
    Cc: Philippe Ombredanne
    Cc: Russell King
    Cc: Richard Cochran
    Cc: Nicolas Pitre
    Cc: David Riley
    Cc: Colin Cross
    Cc: Mark Brown
    Cc: H. Peter Anvin
    Cc: Paul E. McKenney
    Link: https://lkml.kernel.org/r/20181031182252.879109557@linutronix.de

    Thomas Gleixner
     
  • Remove the pointless filenames in the top level comments. They have no
    value at all and just occupy space. While at it tidy up some of the
    comments and remove a stale one.

    Signed-off-by: Thomas Gleixner
    Acked-by: Nicolas Pitre
    Acked-by: Kees Cook
    Acked-by: Ingo Molnar
    Acked-by: John Stultz
    Acked-by: Corey Minyard
    Cc: Peter Zijlstra
    Cc: Kate Stewart
    Cc: Philippe Ombredanne
    Cc: Peter Anvin
    Cc: Russell King
    Cc: Richard Cochran
    Cc: "Paul E. McKenney"
    Cc: David Riley
    Cc: Colin Cross
    Cc: Mark Brown
    Link: https://lkml.kernel.org/r/20181031182252.794898238@linutronix.de

    Thomas Gleixner
     

27 Aug, 2018

1 commit

  • get_seconds() and do_gettimeofday() are only used by a few modules now any
    more (waiting for the respective patches to get accepted), and they are
    among the last holdouts of code that is not y2038 safe in the core kernel.

    Move the implementation into the timekeeping32.h header to clean up
    the core kernel and isolate the old interfaces further.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

14 Aug, 2018

1 commit

  • Pull x86 timer updates from Thomas Gleixner:
    "Early TSC based time stamping to allow better boot time analysis.

    This comes with a general cleanup of the TSC calibration code which
    grew warts and duct taping over the years and removes 250 lines of
    code. Initiated and mostly implemented by Pavel with help from various
    folks"

    * 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
    x86/kvmclock: Mark kvm_get_preset_lpj() as __init
    x86/tsc: Consolidate init code
    sched/clock: Disable interrupts when calling generic_sched_clock_init()
    timekeeping: Prevent false warning when persistent clock is not available
    sched/clock: Close a hole in sched_clock_init()
    x86/tsc: Make use of tsc_calibrate_cpu_early()
    x86/tsc: Split native_calibrate_cpu() into early and late parts
    sched/clock: Use static key for sched_clock_running
    sched/clock: Enable sched clock early
    sched/clock: Move sched clock initialization and merge with generic clock
    x86/tsc: Use TSC as sched clock early
    x86/tsc: Initialize cyc2ns when tsc frequency is determined
    x86/tsc: Calibrate tsc only once
    ARM/time: Remove read_boot_clock64()
    s390/time: Remove read_boot_clock64()
    timekeeping: Default boot time offset to local_clock()
    timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset()
    s390/time: Add read_persistent_wall_and_boot_offset()
    x86/xen/time: Output xen sched_clock time from 0
    x86/xen/time: Initialize pv xen time in init_hypervisor_platform()
    ...

    Linus Torvalds
     

31 Jul, 2018

1 commit

  • On arches with no persistent clock a message like this is printed during
    boot:

    [ 0.000000] Persistent clock returned invalid value

    The value is not invalid: Zero means that no persistent clock is available
    and the absence of persistent clock should be quietly accepted.

    Fixes: 3eca993740b8 ("timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset()")
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Thomas Gleixner
    Cc: steven.sistare@oracle.com
    Cc: daniel.m.jordan@oracle.com
    Cc: sboyd@kernel.org
    Cc: john.stultz@linaro.org
    Link: https://lkml.kernel.org/r/20180725200018.23722-1-pasha.tatashin@oracle.com

    Pavel Tatashin
     

20 Jul, 2018

5 commits

  • On some hardware with multiple clocksources, we have coarse grained
    clocksources that support the CLOCK_SOURCE_SUSPEND_NONSTOP flag, but
    which are less than ideal for timekeeping whereas other clocksources
    can be better candidates but halt on suspend.

    Currently, the timekeeping core only supports timing suspend using
    CLOCK_SOURCE_SUSPEND_NONSTOP clocksources if that clocksource is the
    current clocksource for timekeeping.

    As a result, some architectures try to implement read_persistent_clock64()
    using those non-stop clocksources, but isn't really ideal, which will
    introduce more duplicate code. To fix this, provide logic to allow a
    registered SUSPEND_NONSTOP clocksource, which isn't the current
    clocksource, to be used to calculate the suspend time.

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Miroslav Lichvar
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Cc: Stephen Boyd
    Cc: Daniel Lezcano
    Reviewed-by: Thomas Gleixner
    Reviewed-by: Daniel Lezcano
    Suggested-by: Thomas Gleixner
    Signed-off-by: Baolin Wang
    [jstultz: minor tweaks to merge with previous resume changes]
    Signed-off-by: John Stultz

    Baolin Wang
     
  • Currently, there exists a corner case assuming when there is
    only one clocksource e.g RTC, and system failed to go to
    suspend mode. While resume rtc_resume() injects the sleeptime
    as timekeeping_rtc_skipresume() returned 'false' (default value
    of sleeptime_injected) due to which we can see mismatch in
    timestamps.

    This issue can also come in a system where more than one
    clocksource are present and very first suspend fails.

    Success case:
    ------------
    {sleeptime_injected=false}
    rtc_suspend() => timekeeping_suspend() => timekeeping_resume() =>

    (sleeptime injected)
    rtc_resume()

    Failure case:
    ------------
    {failure in sleep path} {sleeptime_injected=false}
    rtc_suspend() => rtc_resume()

    {sleeptime injected again which was not required as the suspend failed}

    Fix this by handling the boolean logic properly.

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Miroslav Lichvar
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Cc: Stephen Boyd
    Originally-by: Thomas Gleixner
    Signed-off-by: Mukesh Ojha
    Signed-off-by: John Stultz

    Mukesh Ojha
     
  • Add 'const' to some function arguments and variables to make it easier
    to read the code.

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Miroslav Lichvar
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Cc: Stephen Boyd
    Signed-off-by: Ondrej Mosnacek
    [jstultz: Also fixup pre-existing checkpatch warnings for
    prototype arguments with no variable name]
    Signed-off-by: John Stultz

    Ondrej Mosnacek
     
  • read_persistent_wall_and_boot_offset() is called during boot to read
    both the persistent clock and also return the offset between the boot time
    and the value of persistent clock.

    Change the default boot_offset from zero to local_clock() so architectures,
    that do not have a dedicated boot_clock but have early sched_clock(), such
    as SPARCv9, x86, and possibly more will benefit from this change by getting
    a better and more consistent estimate of the boot time without need for an
    arch specific implementation.

    Signed-off-by: Pavel Tatashin
    Signed-off-by: Thomas Gleixner
    Cc: steven.sistare@oracle.com
    Cc: daniel.m.jordan@oracle.com
    Cc: linux@armlinux.org.uk
    Cc: schwidefsky@de.ibm.com
    Cc: heiko.carstens@de.ibm.com
    Cc: john.stultz@linaro.org
    Cc: sboyd@codeaurora.org
    Cc: hpa@zytor.com
    Cc: douly.fnst@cn.fujitsu.com
    Cc: peterz@infradead.org
    Cc: prarit@redhat.com
    Cc: feng.tang@intel.com
    Cc: pmladek@suse.com
    Cc: gnomes@lxorguk.ukuu.org.uk
    Cc: linux-s390@vger.kernel.org
    Cc: boris.ostrovsky@oracle.com
    Cc: jgross@suse.com
    Cc: pbonzini@redhat.com
    Link: https://lkml.kernel.org/r/20180719205545.16512-17-pasha.tatashin@oracle.com

    Pavel Tatashin
     
  • If architecture does not support exact boot time, it is challenging to
    estimate boot time without having a reference to the current persistent
    clock value. Yet, it cannot read the persistent clock time again, because
    this may lead to math discrepancies with the caller of read_boot_clock64()
    who have read the persistent clock at a different time.

    This is why it is better to provide two values simultaneously: the
    persistent clock value, and the boot time.

    Replace read_boot_clock64() with:
    read_persistent_wall_and_boot_offset(wall_time, boot_offset)

    Where wall_time is returned by read_persistent_clock() And boot_offset is
    wall_time - boot time, which defaults to 0.

    Signed-off-by: Pavel Tatashin
    Signed-off-by: Thomas Gleixner
    Cc: steven.sistare@oracle.com
    Cc: daniel.m.jordan@oracle.com
    Cc: linux@armlinux.org.uk
    Cc: schwidefsky@de.ibm.com
    Cc: heiko.carstens@de.ibm.com
    Cc: john.stultz@linaro.org
    Cc: sboyd@codeaurora.org
    Cc: hpa@zytor.com
    Cc: douly.fnst@cn.fujitsu.com
    Cc: peterz@infradead.org
    Cc: prarit@redhat.com
    Cc: feng.tang@intel.com
    Cc: pmladek@suse.com
    Cc: gnomes@lxorguk.ukuu.org.uk
    Cc: linux-s390@vger.kernel.org
    Cc: boris.ostrovsky@oracle.com
    Cc: jgross@suse.com
    Cc: pbonzini@redhat.com
    Link: https://lkml.kernel.org/r/20180719205545.16512-16-pasha.tatashin@oracle.com

    Pavel Tatashin
     

13 Jul, 2018

1 commit


11 Jul, 2018

1 commit

  • When the NTP frequency is set directly from userspace using the
    ADJ_FREQUENCY or ADJ_TICK timex mode, immediately update the
    timekeeper's multiplier instead of waiting for the next tick.

    This removes a hidden non-deterministic delay in setting of the
    frequency and allows an extremely tight control of the system clock
    with update rates close to or even exceeding the kernel HZ.

    The update is limited to archs using modern timekeeping
    (!ARCH_USES_GETTIMEOFFSET).

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Miroslav Lichvar
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Cc: Stephen Boyd
    Signed-off-by: Miroslav Lichvar
    Signed-off-by: John Stultz

    Miroslav Lichvar
     

19 Jun, 2018

1 commit


19 May, 2018

1 commit

  • I have run into a couple of drivers using current_kernel_time()
    suffering from the y2038 problem, and they could be converted
    to using ktime_t, but don't have interfaces that skip the nanosecond
    calculation at the moment.

    This introduces ktime_get_coarse_with_offset() as a simpler
    variant of ktime_get_with_offset(), and adds wrappers for the
    three time domains we support with the existing function.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Thomas Gleixner
    Cc: Stephen Boyd
    Cc: y2038@lists.linaro.org
    Cc: John Stultz
    Link: https://lkml.kernel.org/r/20180427134016.2525989-5-arnd@arndb.de

    Arnd Bergmann