06 Jan, 2021

1 commit

  • [ Upstream commit ba8ea8e7dd6e1662e34e730eadfc52aa6816f9dd ]

    can_stop_idle_tick() checks whether the do_timer() duty has been taken over
    by a CPU on boot. That's silly because the boot CPU always takes over with
    the initial clockevent device.

    But even if no CPU would have installed a clockevent and taken over the
    duty then the question whether the tick on the current CPU can be stopped
    or not is moot. In that case the current CPU would have no clockevent
    either, so there would be nothing to keep ticking.

    Remove it.

    Signed-off-by: Thomas Gleixner
    Acked-by: Frederic Weisbecker
    Link: https://lore.kernel.org/r/20201206212002.725238293@linutronix.de
    Signed-off-by: Sasha Levin

    Thomas Gleixner
     

26 Oct, 2020

4 commits

  • UBSAN reports:

    Undefined behaviour in ./include/linux/time64.h:127:27
    signed integer overflow:
    17179869187 * 1000000000 cannot be represented in type 'long long int'
    Call Trace:
    timespec64_to_ns include/linux/time64.h:127 [inline]
    set_cpu_itimer+0x65c/0x880 kernel/time/itimer.c:180
    do_setitimer+0x8e/0x740 kernel/time/itimer.c:245
    __x64_sys_setitimer+0x14c/0x2c0 kernel/time/itimer.c:336
    do_syscall_64+0xa1/0x540 arch/x86/entry/common.c:295

    Commit bd40a175769d ("y2038: itimer: change implementation to timespec64")
    replaced the original conversion which handled time clamping correctly with
    timespec64_to_ns() which has no overflow protection.

    Fix it in timespec64_to_ns() as this is not necessarily limited to the
    usage in itimers.

    [ tglx: Added comment and adjusted the fixes tag ]

    Fixes: 361a3bf00582 ("time64: Add time64.h header and define struct timespec64")
    Signed-off-by: Zeng Tao
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Arnd Bergmann
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/1598952616-6416-1-git-send-email-prime.zeng@hisilicon.com

    Zeng Tao
     
  • There is no caller in tree, remove it.

    Signed-off-by: YueHaibing
    Signed-off-by: Thomas Gleixner
    Link: https://lore.kernel.org/r/20200909134749.32300-1-yuehaibing@huawei.com

    YueHaibing
     
  • There is no caller in tree, remove it.

    Signed-off-by: YueHaibing
    Signed-off-by: Thomas Gleixner
    Link: https://lore.kernel.org/r/20200909134850.21940-1-yuehaibing@huawei.com

    YueHaibing
     
  • Since sched_clock_read_begin() and sched_clock_read_retry() are called
    by notrace function sched_clock(), they shouldn't be traceable either,
    or else ftrace_graph_caller will run into a dead loop on the path
    as below (arm for instance):

    ftrace_graph_caller()
    prepare_ftrace_return()
    function_graph_enter()
    ftrace_push_return_trace()
    trace_clock_local()
    sched_clock()
    sched_clock_read_begin/retry()

    Fixes: 1b86abc1c645 ("sched_clock: Expose struct clock_read_data")
    Signed-off-by: Quanyang Wang
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20200929082027.16787-1-quanyang.wang@windriver.com

    Quanyang Wang
     

25 Oct, 2020

2 commits

  • With the removal of the interrupt perturbations in previous random32
    change (random32: make prandom_u32() output unpredictable), the PRNG
    has become 100% deterministic again. While SipHash is expected to be
    way more robust against brute force than the previous Tausworthe LFSR,
    there's still the risk that whoever has even one temporary access to
    the PRNG's internal state is able to predict all subsequent draws till
    the next reseed (roughly every minute). This may happen through a side
    channel attack or any data leak.

    This patch restores the spirit of commit f227e3ec3b5c ("random32: update
    the net random state on interrupt and activity") in that it will perturb
    the internal PRNG's statee using externally collected noise, except that
    it will not pick that noise from the random pool's bits nor upon
    interrupt, but will rather combine a few elements along the Tx path
    that are collectively hard to predict, such as dev, skb and txq
    pointers, packet length and jiffies values. These ones are combined
    using a single round of SipHash into a single long variable that is
    mixed with the net_rand_state upon each invocation.

    The operation was inlined because it produces very small and efficient
    code, typically 3 xor, 2 add and 2 rol. The performance was measured
    to be the same (even very slightly better) than before the switch to
    SipHash; on a 6-core 12-thread Core i7-8700k equipped with a 40G NIC
    (i40e), the connection rate dropped from 556k/s to 555k/s while the
    SYN cookie rate grew from 5.38 Mpps to 5.45 Mpps.

    Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
    Cc: George Spelvin
    Cc: Amit Klein
    Cc: Eric Dumazet
    Cc: "Jason A. Donenfeld"
    Cc: Andy Lutomirski
    Cc: Kees Cook
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: tytso@mit.edu
    Cc: Florian Westphal
    Cc: Marc Plumb
    Tested-by: Sedat Dilek
    Signed-off-by: Willy Tarreau

    Willy Tarreau
     
  • Non-cryptographic PRNGs may have great statistical properties, but
    are usually trivially predictable to someone who knows the algorithm,
    given a small sample of their output. An LFSR like prandom_u32() is
    particularly simple, even if the sample is widely scattered bits.

    It turns out the network stack uses prandom_u32() for some things like
    random port numbers which it would prefer are *not* trivially predictable.
    Predictability led to a practical DNS spoofing attack. Oops.

    This patch replaces the LFSR with a homebrew cryptographic PRNG based
    on the SipHash round function, which is in turn seeded with 128 bits
    of strong random key. (The authors of SipHash have *not* been consulted
    about this abuse of their algorithm.) Speed is prioritized over security;
    attacks are rare, while performance is always wanted.

    Replacing all callers of prandom_u32() is the quick fix.
    Whether to reinstate a weaker PRNG for uses which can tolerate it
    is an open question.

    Commit f227e3ec3b5c ("random32: update the net random state on interrupt
    and activity") was an earlier attempt at a solution. This patch replaces
    it.

    Reported-by: Amit Klein
    Cc: Willy Tarreau
    Cc: Eric Dumazet
    Cc: "Jason A. Donenfeld"
    Cc: Andy Lutomirski
    Cc: Kees Cook
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: tytso@mit.edu
    Cc: Florian Westphal
    Cc: Marc Plumb
    Fixes: f227e3ec3b5c ("random32: update the net random state on interrupt and activity")
    Signed-off-by: George Spelvin
    Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
    [ willy: partial reversal of f227e3ec3b5c; moved SIPROUND definitions
    to prandom.h for later use; merged George's prandom_seed() proposal;
    inlined siprand_u32(); replaced the net_rand_state[] array with 4
    members to fix a build issue; cosmetic cleanups to make checkpatch
    happy; fixed RANDOM32_SELFTEST build ]
    Signed-off-by: Willy Tarreau

    George Spelvin
     

19 Oct, 2020

1 commit

  • Pull RCU changes from Ingo Molnar:

    - Debugging for smp_call_function()

    - RT raw/non-raw lock ordering fixes

    - Strict grace periods for KASAN

    - New smp_call_function() torture test

    - Torture-test updates

    - Documentation updates

    - Miscellaneous fixes

    [ This doesn't actually pull the tag - I've dropped the last merge from
    the RCU branch due to questions about the series. - Linus ]

    * tag 'core-rcu-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (77 commits)
    smp: Make symbol 'csd_bug_count' static
    kernel/smp: Provide CSD lock timeout diagnostics
    smp: Add source and destination CPUs to __call_single_data
    rcu: Shrink each possible cpu krcp
    rcu/segcblist: Prevent useless GP start if no CBs to accelerate
    torture: Add gdb support
    rcutorture: Allow pointer leaks to test diagnostic code
    rcutorture: Hoist OOM registry up one level
    refperf: Avoid null pointer dereference when buf fails to allocate
    rcutorture: Properly synchronize with OOM notifier
    rcutorture: Properly set rcu_fwds for OOM handling
    torture: Add kvm.sh --help and update help message
    rcutorture: Add CONFIG_PROVE_RCU_LIST to TREE05
    torture: Update initrd documentation
    rcutorture: Replace HTTP links with HTTPS ones
    locktorture: Make function torture_percpu_rwsem_init() static
    torture: document --allcpus argument added to the kvm.sh script
    rcutorture: Output number of elapsed grace periods
    rcutorture: Remove KCSAN stubs
    rcu: Remove unused "cpu" parameter from rcu_report_qs_rdp()
    ...

    Linus Torvalds
     

13 Oct, 2020

2 commits

  • Pull locking updates from Ingo Molnar:
    "These are the locking updates for v5.10:

    - Add deadlock detection for recursive read-locks.

    The rationale is outlined in commit 224ec489d3cd ("lockdep/
    Documention: Recursive read lock detection reasoning")

    The main deadlock pattern we want to detect is:

    TASK A: TASK B:

    read_lock(X);
    write_lock(X);
    read_lock_2(X);

    - Add "latch sequence counters" (seqcount_latch_t):

    A sequence counter variant where the counter even/odd value is used
    to switch between two copies of protected data. This allows the
    read path, typically NMIs, to safely interrupt the write side
    critical section.

    We utilize this new variant for sched-clock, and to make x86 TSC
    handling safer.

    - Other seqlock cleanups, fixes and enhancements

    - KCSAN updates

    - LKMM updates

    - Misc updates, cleanups and fixes"

    * tag 'locking-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (67 commits)
    lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu variables"
    lockdep: Fix lockdep recursion
    lockdep: Fix usage_traceoverflow
    locking/atomics: Check atomic-arch-fallback.h too
    locking/seqlock: Tweak DEFINE_SEQLOCK() kernel doc
    lockdep: Optimize the memory usage of circular queue
    seqlock: Unbreak lockdep
    seqlock: PREEMPT_RT: Do not starve seqlock_t writers
    seqlock: seqcount_LOCKNAME_t: Introduce PREEMPT_RT support
    seqlock: seqcount_t: Implement all read APIs as statement expressions
    seqlock: Use unique prefix for seqcount_t property accessors
    seqlock: seqcount_LOCKNAME_t: Standardize naming convention
    seqlock: seqcount latch APIs: Only allow seqcount_latch_t
    rbtree_latch: Use seqcount_latch_t
    x86/tsc: Use seqcount_latch_t
    timekeeping: Use seqcount_latch_t
    time/sched_clock: Use seqcount_latch_t
    seqlock: Introduce seqcount_latch_t
    mm/swap: Do not abuse the seqcount_t latching API
    time/sched_clock: Use raw_read_seqcount_latch() during suspend
    ...

    Linus Torvalds
     
  • Pull timekeeping updates from Thomas Gleixner:
    "Updates for timekeeping, timers and related drivers:

    Core:

    - Early boot support for the NMI safe timekeeper by utilizing
    local_clock() up to the point where timekeeping is initialized.
    This allows printk() to store multiple timestamps in the ringbuffer
    which is useful for coordinating dmesg information across a fleet
    of machines.

    - Provide a multi-timestamp accessor for printk()

    - Make timer init more robust by checking for invalid timer flags.

    - Comma vs semicolon fixes

    Drivers:

    - Support for new platforms in existing drivers (SP804 and Renesas
    CMT)

    - Comma vs semicolon fixes

    * tag 'timers-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    clocksource/drivers/armada-370-xp: Use semicolons rather than commas to separate statements
    clocksource/drivers/mps2-timer: Use semicolons rather than commas to separate statements
    timers: Mask invalid flags in do_init_timer()
    clocksource/drivers/sp804: Enable Hisilicon sp804 timer 64bit mode
    clocksource/drivers/sp804: Add support for Hisilicon sp804 timer
    clocksource/drivers/sp804: Support non-standard register offset
    clocksource/drivers/sp804: Prepare for support non-standard register offset
    clocksource/drivers/sp804: Remove a mismatched comment
    clocksource/drivers/sp804: Delete the leading "__" of some functions
    clocksource/drivers/sp804: Remove unused sp804_timer_disable() and timer-sp804.h
    clocksource/drivers/sp804: Cleanup clk_get_sys()
    dt-bindings: timer: renesas,cmt: Document r8a774e1 CMT support
    dt-bindings: timer: renesas,cmt: Document r8a7742 CMT support
    alarmtimer: Convert comma to semicolon
    timekeeping: Provide multi-timestamp accessor to NMI safe timekeeper
    timekeeping: Utilize local_clock() for NMI safe timekeeper during early boot

    Linus Torvalds
     

09 Oct, 2020

2 commits


25 Sep, 2020

2 commits

  • do_init_timer() accepts any combination of timer flags handed in by the
    caller without a sanity check, but only TIMER_DEFFERABLE, TIMER_PINNED and
    TIMER_IRQSAFE are valid.

    If the supplied flags have other bits set, this could result in
    malfunction. If bits are set in TIMER_CPUMASK the first timer usage could
    deference a cpu base which is outside the range of possible CPUs. If
    TIMER_MIGRATION is set, then the switch_timer_base() will live lock.

    Prevent that with a sanity check which warns when invalid flags are
    supplied and masks them out.

    [ tglx: Made it WARN_ON_ONCE() and added context to the changelog ]

    Signed-off-by: Qianli Zhao
    Signed-off-by: Thomas Gleixner
    Link: https://lore.kernel.org/r/9d79a8aa4eb56713af7379f99f062dedabcde140.1597326756.git.zhaoqianli@xiaomi.com

    Qianli Zhao
     
  • This should make it harder for the kernel to corrupt the debug object
    descriptor, used to call functions to fixup state and track debug objects,
    by moving the structure to read-only memory.

    Signed-off-by: Stephen Boyd
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Kees Cook
    Link: https://lore.kernel.org/r/20200815004027.2046113-3-swboyd@chromium.org

    Stephen Boyd
     

10 Sep, 2020

3 commits

  • Latch sequence counters are a multiversion concurrency control mechanism
    where the seqcount_t counter even/odd value is used to switch between
    two data storage copies. This allows the seqcount_t read path to safely
    interrupt its write side critical section (e.g. from NMIs).

    Initially, latch sequence counters were implemented as a single write
    function, raw_write_seqcount_latch(), above plain seqcount_t. The read
    path was expected to use plain seqcount_t raw_read_seqcount().

    A specialized read function was later added, raw_read_seqcount_latch(),
    and became the standardized way for latch read paths. Having unique read
    and write APIs meant that latch sequence counters are basically a data
    type of their own -- just inappropriately overloading plain seqcount_t.
    The seqcount_latch_t data type was thus introduced at seqlock.h.

    Use that new data type instead of seqcount_raw_spinlock_t. This ensures
    that only latch-safe APIs are to be used with the sequence counter.

    Note that the use of seqcount_raw_spinlock_t was not very useful in the
    first place. Only the "raw_" subset of seqcount_t APIs were used at
    timekeeping.c. This subset was created for contexts where lockdep cannot
    be used. seqcount_LOCKTYPE_t's raison d'être -- verifying that the
    seqcount_t writer serialization lock is held -- cannot thus be done.

    References: 0c3351d451ae ("seqlock: Use raw_ prefix instead of _no_lockdep")
    References: 55f3560df975 ("seqlock: Extend seqcount API with associated locks")
    Signed-off-by: Ahmed S. Darwish
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200827114044.11173-6-a.darwish@linutronix.de

    Ahmed S. Darwish
     
  • Latch sequence counters have unique read and write APIs, and thus
    seqcount_latch_t was recently introduced at seqlock.h.

    Use that new data type instead of plain seqcount_t. This adds the
    necessary type-safety and ensures only latching-safe seqcount APIs are
    to be used.

    Signed-off-by: Ahmed S. Darwish
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200827114044.11173-5-a.darwish@linutronix.de

    Ahmed S. Darwish
     
  • sched_clock uses seqcount_t latching to switch between two storage
    places protected by the sequence counter. This allows it to have
    interruptible, NMI-safe, seqcount_t write side critical sections.

    Since 7fc26327b756 ("seqlock: Introduce raw_read_seqcount_latch()"),
    raw_read_seqcount_latch() became the standardized way for seqcount_t
    latch read paths. Due to the dependent load, it has one read memory
    barrier less than the currently used raw_read_seqcount() API.

    Use raw_read_seqcount_latch() for the suspend path.

    Commit aadd6e5caaac ("time/sched_clock: Use raw_read_seqcount_latch()")
    missed changing that instance of raw_read_seqcount().

    References: 1809bfa44e10 ("timers, sched/clock: Avoid deadlock during read from NMI")
    Signed-off-by: Ahmed S. Darwish
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200715092345.GA231464@debian-buster-darwi.lab.linutronix.de

    Ahmed S. Darwish
     

25 Aug, 2020

2 commits

  • Replace a comma between expression statements by a semicolon.

    Signed-off-by: Xu Wang
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Stephen Boyd
    Link: https://lore.kernel.org/r/20200818062651.21680-1-vulab@iscas.ac.cn

    Xu Wang
     
  • Currently, can_stop_idle_tick() prints "NOHZ: local_softirq_pending HH"
    (where "HH" is the hexadecimal softirq vector number) when one or more
    non-RCU softirq handlers are still enabled when checking to stop the
    scheduler-tick interrupt. This message is not as enlightening as one
    might hope, so this commit changes it to "NOHZ tick-stop error: Non-RCU
    local softirq work is pending, handler #HH".

    Reported-by: Andy Lutomirski
    Cc: Frederic Weisbecker
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

24 Aug, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

23 Aug, 2020

2 commits

  • printk wants to store various timestamps (MONOTONIC, REALTIME, BOOTTIME) to
    make correlation of dmesg from several systems easier.

    Provide an interface to retrieve all three timestamps in one go.

    There are some caveats:

    1) Boot time and late sleep time injection

    Boot time is a racy access on 32bit systems if the sleep time injection
    happens late during resume and not in timekeeping_resume(). That could be
    avoided by expanding struct tk_read_base with boot offset for 32bit and
    adding more overhead to the update. As this is a hard to observe once per
    resume event which can be filtered with reasonable effort using the
    accurate mono/real timestamps, it's probably not worth the trouble.

    Aside of that it might be possible on 32 and 64 bit to observe the
    following when the sleep time injection happens late:

    CPU 0 CPU 1
    timekeeping_resume()
    ktime_get_fast_timestamps()
    mono, real = __ktime_get_real_fast()
    inject_sleep_time()
    update boot offset
    boot = mono + bootoffset;

    That means that boot time already has the sleep time adjustment, but
    real time does not. On the next readout both are in sync again.

    Preventing this for 64bit is not really feasible without destroying the
    careful cache layout of the timekeeper because the sequence count and
    struct tk_read_base would then need two cache lines instead of one.

    2) Suspend/resume timestamps

    Access to the time keeper clock source is disabled accross the innermost
    steps of suspend/resume. The accessors still work, but the timestamps
    are frozen until time keeping is resumed which happens very early.

    For regular suspend/resume there is no observable difference vs. sched
    clock, but it might affect some of the nasty low level debug printks.

    OTOH, access to sched clock is not guaranteed accross suspend/resume on
    all systems either so it depends on the hardware in use.

    If that turns out to be a real problem then this could be mitigated by
    using sched clock in a similar way as during early boot. But it's not as
    trivial as on early boot because it needs some careful protection
    against the clock monotonic timestamp jumping backwards on resume.

    Signed-off-by: Thomas Gleixner
    Tested-by: Petr Mladek
    Link: https://lore.kernel.org/r/20200814115512.159981360@linutronix.de

    Thomas Gleixner
     
  • During early boot the NMI safe timekeeper returns 0 until the first
    clocksource becomes available.

    This prevents it from being used for printk or other facilities which today
    use sched clock. sched clock can be available way before timekeeping is
    initialized.

    The obvious workaround for this is to utilize the early sched clock in the
    default dummy clock read function until a clocksource becomes available.

    After switching to the clocksource clock MONOTONIC and BOOTTIME will not
    jump because the timekeeping_init() bases clock MONOTONIC on sched clock
    and the offset between clock MONOTONIC and BOOTTIME is zero during boot.

    Clock REALTIME cannot provide useful timestamps during early boot up to
    the point where a persistent clock becomes available, which is either in
    timekeeping_init() or later when the RTC driver which might depend on I2C
    or other subsystems is initialized.

    There is a minor difference to sched_clock() vs. suspend/resume. As the
    timekeeper clock source might not be accessible during suspend, after
    timekeeping_suspend() timestamps freeze up to the point where
    timekeeping_resume() is invoked. OTOH this is true for some sched clock
    implementations as well.

    Signed-off-by: Thomas Gleixner
    Tested-by: Petr Mladek
    Link: https://lore.kernel.org/r/20200814115512.041422402@linutronix.de

    Thomas Gleixner
     

15 Aug, 2020

2 commits

  • Pull timekeeping updates from Thomas Gleixner:
    "A set of timekeeping/VDSO updates:

    - Preparatory work to allow S390 to switch over to the generic VDSO
    implementation.

    S390 requires that the VDSO data pointer is handed in to the
    counter read function when time namespace support is enabled.
    Adding the pointer is a NOOP for all other architectures because
    the compiler is supposed to optimize that out when it is unused in
    the architecture specific inline. The change also solved a similar
    problem for MIPS which fortunately has time namespaces not yet
    enabled.

    S390 needs to update clock related VDSO data independent of the
    timekeeping updates. This was solved so far with yet another
    sequence counter in the S390 implementation. A better solution is
    to utilize the already existing VDSO sequence count for this. The
    core code now exposes helper functions which allow to serialize
    against the timekeeper code and against concurrent readers.

    S390 needs extra data for their clock readout function. The initial
    common VDSO data structure did not provide a way to add that. It
    now has an embedded architecture specific struct embedded which
    defaults to an empty struct.

    Doing this now avoids tree dependencies and conflicts post rc1 and
    allows all other architectures which work on generic VDSO support
    to work from a common upstream base.

    - A trivial comment fix"

    * tag 'timers-urgent-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    time: Delete repeated words in comments
    lib/vdso: Allow to add architecture-specific vdso data
    timekeeping/vsyscall: Provide vdso_update_begin/end()
    vdso/treewide: Add vdso_data pointer argument to __arch_get_hw_counter()

    Linus Torvalds
     
  • Pull more timer updates from Thomas Gleixner:
    "A set of posix CPU timer changes which allows to defer the heavy work
    of posix CPU timers into task work context. The tick interrupt is
    reduced to a quick check which queues the work which is doing the
    heavy lifting before returning to user space or going back to guest
    mode. Moving this out is deferring the signal delivery slightly but
    posix CPU timers are inaccurate by nature as they depend on the tick
    so there is no real damage. The relevant test cases all passed.

    This lifts the last offender for RT out of the hard interrupt context
    tick handler, but it also has the general benefit that the actual
    heavy work is accounted to the task/process and not to the tick
    interrupt itself.

    Further optimizations are possible to break long sighand lock hold and
    interrupt disabled (on !RT kernels) times when a massive amount of
    posix CPU timers (which are unpriviledged) is armed for a
    task/process.

    This is currently only enabled for x86 because the architecture has to
    ensure that task work is handled in KVM before entering a guest, which
    was just established for x86 with the new common entry/exit code which
    got merged post 5.8 and is not the case for other KVM architectures"

    * tag 'timers-core-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86: Select POSIX_CPU_TIMERS_TASK_WORK
    posix-cpu-timers: Provide mechanisms to defer timer handling to task_work
    posix-cpu-timers: Split run_posix_cpu_timers()

    Linus Torvalds
     

11 Aug, 2020

2 commits

  • Pull locking updates from Thomas Gleixner:
    "A set of locking fixes and updates:

    - Untangle the header spaghetti which causes build failures in
    various situations caused by the lockdep additions to seqcount to
    validate that the write side critical sections are non-preemptible.

    - The seqcount associated lock debug addons which were blocked by the
    above fallout.

    seqcount writers contrary to seqlock writers must be externally
    serialized, which usually happens via locking - except for strict
    per CPU seqcounts. As the lock is not part of the seqcount, lockdep
    cannot validate that the lock is held.

    This new debug mechanism adds the concept of associated locks.
    sequence count has now lock type variants and corresponding
    initializers which take a pointer to the associated lock used for
    writer serialization. If lockdep is enabled the pointer is stored
    and write_seqcount_begin() has a lockdep assertion to validate that
    the lock is held.

    Aside of the type and the initializer no other code changes are
    required at the seqcount usage sites. The rest of the seqcount API
    is unchanged and determines the type at compile time with the help
    of _Generic which is possible now that the minimal GCC version has
    been moved up.

    Adding this lockdep coverage unearthed a handful of seqcount bugs
    which have been addressed already independent of this.

    While generally useful this comes with a Trojan Horse twist: On RT
    kernels the write side critical section can become preemtible if
    the writers are serialized by an associated lock, which leads to
    the well known reader preempts writer livelock. RT prevents this by
    storing the associated lock pointer independent of lockdep in the
    seqcount and changing the reader side to block on the lock when a
    reader detects that a writer is in the write side critical section.

    - Conversion of seqcount usage sites to associated types and
    initializers"

    * tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
    locking/seqlock, headers: Untangle the spaghetti monster
    locking, arch/ia64: Reduce header dependencies by moving XTP bits into the new header
    x86/headers: Remove APIC headers from
    seqcount: More consistent seqprop names
    seqcount: Compress SEQCNT_LOCKNAME_ZERO()
    seqlock: Fold seqcount_LOCKNAME_init() definition
    seqlock: Fold seqcount_LOCKNAME_t definition
    seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
    hrtimer: Use sequence counter with associated raw spinlock
    kvm/eventfd: Use sequence counter with associated spinlock
    userfaultfd: Use sequence counter with associated spinlock
    NFSv4: Use sequence counter with associated spinlock
    iocost: Use sequence counter with associated spinlock
    raid5: Use sequence counter with associated spinlock
    vfs: Use sequence counter with associated spinlock
    timekeeping: Use sequence counter with associated raw spinlock
    xfrm: policy: Use sequence counters with associated lock
    netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
    netfilter: conntrack: Use sequence counter with associated spinlock
    sched: tasks: Use sequence counter with associated spinlock
    ...

    Linus Torvalds
     
  • Drop repeated words in kernel/time/. {when, one, into}

    Signed-off-by: Randy Dunlap
    Signed-off-by: Thomas Gleixner
    Acked-by: John Stultz
    Link: https://lore.kernel.org/r/20200807033248.8452-1-rdunlap@infradead.org

    Randy Dunlap
     

06 Aug, 2020

3 commits

  • Running posix CPU timers in hard interrupt context has a few downsides:

    - For PREEMPT_RT it cannot work as the expiry code needs to take
    sighand lock, which is a 'sleeping spinlock' in RT. The original RT
    approach of offloading the posix CPU timer handling into a high
    priority thread was clumsy and provided no real benefit in general.

    - For fine grained accounting it's just wrong to run this in context of
    the timer interrupt because that way a process specific CPU time is
    accounted to the timer interrupt.

    - Long running timer interrupts caused by a large amount of expiring
    timers which can be created and armed by unpriviledged user space.

    There is no hard requirement to expire them in interrupt context.

    If the signal is targeted at the task itself then it won't be delivered
    before the task returns to user space anyway. If the signal is targeted at
    a supervisor process then it might be slightly delayed, but posix CPU
    timers are inaccurate anyway due to the fact that they are tied to the
    tick.

    Provide infrastructure to schedule task work which allows splitting the
    posix CPU timer code into a quick check in interrupt context and a thread
    context expiry and signal delivery function. This has to be enabled by
    architectures as it requires that the architecture specific KVM
    implementation handles pending task work before exiting to guest mode.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar
    Reviewed-by: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lore.kernel.org/r/20200730102337.783470146@linutronix.de

    Thomas Gleixner
     
  • Split it up as a preparatory step to move the heavy lifting out of
    interrupt context.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar
    Reviewed-by: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lore.kernel.org/r/20200730102337.677439437@linutronix.de

    Thomas Gleixner
     
  • Architectures can have the requirement to add additional architecture
    specific data to the VDSO data page which needs to be updated independent
    of the timekeeper updates.

    To protect these updates vs. concurrent readers and a conflicting update
    through timekeeping, provide helper functions to make such updates safe.

    vdso_update_begin() takes the timekeeper_lock to protect against a
    potential update from timekeeper code and increments the VDSO sequence
    count to signal data inconsistency to concurrent readers. vdso_update_end()
    makes the sequence count even again to signal data consistency and drops
    the timekeeper lock.

    [ Sven: Add interrupt disable handling to the functions ]

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sven Schnelle
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20200804150124.41692-3-svens@linux.ibm.com

    Thomas Gleixner
     

05 Aug, 2020

2 commits

  • Pull timer updates from Thomas Gleixner:
    "Time, timers and related driver updates:

    - Prevent unnecessary timer softirq invocations by extending the
    tracking of the next expiring timer in the timer wheel beyond the
    existing NOHZ functionality.

    The tracking overhead at enqueue time is within the noise, but on
    sensitive workloads the avoidance of the soft interrupt invocation
    is a measurable improvement.

    - The obligatory new clocksource driver for Ingenic X100 OST

    - The usual fixes, improvements, cleanups and extensions for newer
    chip variants all over the driver space"

    * tag 'timers-core-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
    timers: Recalculate next timer interrupt only when necessary
    clocksource/drivers/ingenic: Add support for the Ingenic X1000 OST.
    dt-bindings: timer: Add Ingenic X1000 OST bindings.
    clocksource/drivers: Replace HTTP links with HTTPS ones
    clocksource/drivers/nomadik-mtu: Handle 32kHz clock
    clocksource/drivers/sh_cmt: Use "kHz" for kilohertz
    clocksource/drivers/imx: Add support for i.MX TPM driver with ARM64
    clocksource/drivers/ingenic: Add high resolution timer support for SMP/SMT.
    timers: Lower base clock forwarding threshold
    timers: Remove must_forward_clk
    timers: Spare timer softirq until next expiry
    timers: Expand clk forward logic beyond nohz
    timers: Reuse next expiry cache after nohz exit
    timers: Always keep track of next expiry
    timers: Optimize _next_timer_interrupt() level iteration
    timers: Add comments about calc_index() ceiling work
    timers: Move trigger_dyntick_cpu() to enqueue_timer()
    timers: Use only bucket expiry for base->next_expiry value
    timers: Preserve higher bits of expiration on index calculation
    clocksource/drivers/timer-atmel-tcb: Add sama5d2 support
    ...

    Linus Torvalds
     
  • Pull thread updates from Christian Brauner:
    "This contains the changes to add the missing support for attaching to
    time namespaces via pidfds.

    Last cycle setns() was changed to support attaching to multiple
    namespaces atomically. This requires all namespaces to have a point of
    no return where they can't fail anymore.

    Specifically, _install() is allowed to perform
    permission checks and install the namespace into the new struct nsset
    that it has been given but it is not allowed to make visible changes
    to the affected task. Once _install() returns,
    anything that the given namespace type additionally requires to be
    setup needs to ideally be done in a function that can't fail or if it
    fails the failure must be non-fatal.

    For time namespaces the relevant functions that fell into this
    category were timens_set_vvar_page() and vdso_join_timens(). The
    latter could still fail although it didn't need to. This function is
    only implemented for vdso_join_timens() in current mainline. As
    discussed on-list (cf. [1]), in order to make setns() support time
    namespaces when attaching to multiple namespaces at once properly we
    changed vdso_join_timens() to always succeed. So vdso_join_timens()
    replaces the mmap_write_lock_killable() with mmap_read_lock().

    Please note that arm is about to grow vdso support for time namespaces
    (possibly this merge window). We've synced on this change and arm64
    also uses mmap_read_lock(), i.e. makes vdso_join_timens() a function
    that can't fail. Once the changes here and the arm64 changes have
    landed, vdso_join_timens() should be turned into a void function so
    it's obvious to callers and implementers on other architectures that
    the expectation is that it can't fail.

    We didn't do this right away because it would've introduced
    unnecessary merge conflicts between the two trees for no major gain.

    As always, tests included"

    [1]: https://lore.kernel.org/lkml/20200611110221.pgd3r5qkjrjmfqa2@wittgenstein

    * tag 'threads-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    tests: add CLONE_NEWTIME setns tests
    nsproxy: support CLONE_NEWTIME with setns()
    timens: add timens_commit() helper
    timens: make vdso_join_timens() always succeed

    Linus Torvalds
     

04 Aug, 2020

3 commits

  • Pull scheduler updates from Ingo Molnar:

    - Improve uclamp performance by using a static key for the fast path

    - Add the "sched_util_clamp_min_rt_default" sysctl, to optimize for
    better power efficiency of RT tasks on battery powered devices.
    (The default is to maximize performance & reduce RT latencies.)

    - Improve utime and stime tracking accuracy, which had a fixed boundary
    of error, which created larger and larger relative errors as the
    values become larger. This is now replaced with more precise
    arithmetics, using the new mul_u64_u64_div_u64() helper in math64.h.

    - Improve the deadline scheduler, such as making it capacity aware

    - Improve frequency-invariant scheduling

    - Misc cleanups in energy/power aware scheduling

    - Add sched_update_nr_running tracepoint to track changes to nr_running

    - Documentation additions and updates

    - Misc cleanups and smaller fixes

    * tag 'sched-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
    sched/doc: Factorize bits between sched-energy.rst & sched-capacity.rst
    sched/doc: Document capacity aware scheduling
    sched: Document arch_scale_*_capacity()
    arm, arm64: Fix selection of CONFIG_SCHED_THERMAL_PRESSURE
    Documentation/sysctl: Document uclamp sysctl knobs
    sched/uclamp: Add a new sysctl to control RT default boost value
    sched/uclamp: Fix a deadlock when enabling uclamp static key
    sched: Remove duplicated tick_nohz_full_enabled() check
    sched: Fix a typo in a comment
    sched/uclamp: Remove unnecessary mutex_init()
    arm, arm64: Select CONFIG_SCHED_THERMAL_PRESSURE
    sched: Cleanup SCHED_THERMAL_PRESSURE kconfig entry
    arch_topology, sched/core: Cleanup thermal pressure definition
    trace/events/sched.h: fix duplicated word
    linux/sched/mm.h: drop duplicated words in comments
    smp: Fix a potential usage of stale nr_cpus
    sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal
    sched: nohz: stop passing around unused "ticks" parameter.
    sched: Better document ttwu()
    sched: Add a tracepoint to track rq->nr_running
    ...

    Linus Torvalds
     
  • Pull RCU updates from Ingo Molnar:

    - kfree_rcu updates

    - RCU tasks updates

    - Read-side scalability tests

    - SRCU updates

    - Torture-test updates

    - Documentation updates

    - Miscellaneous fixes

    * tag 'core-rcu-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (109 commits)
    torture: Remove obsolete "cd $KVM"
    torture: Avoid duplicate specification of qemu command
    torture: Dump ftrace at shutdown only if requested
    torture: Add kvm-tranform.sh script for qemu-cmd files
    torture: Add more tracing crib notes to kvm.sh
    torture: Improve diagnostic for KCSAN-incapable compilers
    torture: Correctly summarize build-only runs
    torture: Pass --kmake-arg to all make invocations
    rcutorture: Check for unwatched readers
    torture: Abstract out console-log error detection
    torture: Add a stop-run capability
    torture: Create qemu-cmd in --buildonly runs
    rcu/rcutorture: Replace 0 with false
    torture: Add --allcpus argument to the kvm.sh script
    torture: Remove whitespace from identify_qemu_vcpus output
    rcutorture: NULL rcu_torture_current earlier in cleanup code
    rcutorture: Handle non-statistic bang-string error messages
    torture: Set configfile variable to current scenario
    rcutorture: Add races with task-exit processing
    locktorture: Use true and false to assign to bool variables
    ...

    Linus Torvalds
     
  • Pull arm64 and cross-arch updates from Catalin Marinas:
    "Here's a slightly wider-spread set of updates for 5.9.

    Going outside the usual arch/arm64/ area is the removal of
    read_barrier_depends() series from Will and the MSI/IOMMU ID
    translation series from Lorenzo.

    The notable arm64 updates include ARMv8.4 TLBI range operations and
    translation level hint, time namespace support, and perf.

    Summary:

    - Removal of the tremendously unpopular read_barrier_depends()
    barrier, which is a NOP on all architectures apart from Alpha, in
    favour of allowing architectures to override READ_ONCE() and do
    whatever dance they need to do to ensure address dependencies
    provide LOAD -> LOAD/STORE ordering.

    This work also offers a potential solution if compilers are shown
    to convert LOAD -> LOAD address dependencies into control
    dependencies (e.g. under LTO), as weakly ordered architectures will
    effectively be able to upgrade READ_ONCE() to smp_load_acquire().
    The latter case is not used yet, but will be discussed further at
    LPC.

    - Make the MSI/IOMMU input/output ID translation PCI agnostic,
    augment the MSI/IOMMU ACPI/OF ID mapping APIs to accept an input ID
    bus-specific parameter and apply the resulting changes to the
    device ID space provided by the Freescale FSL bus.

    - arm64 support for TLBI range operations and translation table level
    hints (part of the ARMv8.4 architecture version).

    - Time namespace support for arm64.

    - Export the virtual and physical address sizes in vmcoreinfo for
    makedumpfile and crash utilities.

    - CPU feature handling cleanups and checks for programmer errors
    (overlapping bit-fields).

    - ACPI updates for arm64: disallow AML accesses to EFI code regions
    and kernel memory.

    - perf updates for arm64.

    - Miscellaneous fixes and cleanups, most notably PLT counting
    optimisation for module loading, recordmcount fix to ignore
    relocations other than R_AARCH64_CALL26, CMA areas reserved for
    gigantic pages on 16K and 64K configurations.

    - Trivial typos, duplicate words"

    Link: http://lkml.kernel.org/r/20200710165203.31284-1-will@kernel.org
    Link: http://lkml.kernel.org/r/20200619082013.13661-1-lorenzo.pieralisi@arm.com

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (82 commits)
    arm64: use IRQ_STACK_SIZE instead of THREAD_SIZE for irq stack
    arm64/mm: save memory access in check_and_switch_context() fast switch path
    arm64: sigcontext.h: delete duplicated word
    arm64: ptrace.h: delete duplicated word
    arm64: pgtable-hwdef.h: delete duplicated words
    bus: fsl-mc: Add ACPI support for fsl-mc
    bus/fsl-mc: Refactor the MSI domain creation in the DPRC driver
    of/irq: Make of_msi_map_rid() PCI bus agnostic
    of/irq: make of_msi_map_get_device_domain() bus agnostic
    dt-bindings: arm: fsl: Add msi-map device-tree binding for fsl-mc bus
    of/device: Add input id to of_dma_configure()
    of/iommu: Make of_map_rid() PCI agnostic
    ACPI/IORT: Add an input ID to acpi_dma_configure()
    ACPI/IORT: Remove useless PCI bus walk
    ACPI/IORT: Make iort_msi_map_rid() PCI agnostic
    ACPI/IORT: Make iort_get_device_domain IRQ domain agnostic
    ACPI/IORT: Make iort_match_node_callback walk the ACPI namespace for NC
    arm64: enable time namespace support
    arm64/vdso: Restrict splitting VVAR VMA
    arm64/vdso: Handle faults on timens page
    ...

    Linus Torvalds
     

31 Jul, 2020

1 commit


30 Jul, 2020

1 commit

  • This modifies the first 32 bits out of the 128 bits of a random CPU's
    net_rand_state on interrupt or CPU activity to complicate remote
    observations that could lead to guessing the network RNG's internal
    state.

    Note that depending on some network devices' interrupt rate moderation
    or binding, this re-seeding might happen on every packet or even almost
    never.

    In addition, with NOHZ some CPUs might not even get timer interrupts,
    leaving their local state rarely updated, while they are running
    networked processes making use of the random state. For this reason, we
    also perform this update in update_process_times() in order to at least
    update the state when there is user or system activity, since it's the
    only case we care about.

    Reported-by: Amit Klein
    Suggested-by: Linus Torvalds
    Cc: Eric Dumazet
    Cc: "Jason A. Donenfeld"
    Cc: Andy Lutomirski
    Cc: Kees Cook
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc:
    Signed-off-by: Willy Tarreau
    Signed-off-by: Linus Torvalds

    Willy Tarreau
     

29 Jul, 2020

2 commits

  • A sequence counter write side critical section must be protected by some
    form of locking to serialize writers. A plain seqcount_t does not
    contain the information of which lock must be held when entering a write
    side critical section.

    Use the new seqcount_raw_spinlock_t data type, which allows to associate
    a raw spinlock with the sequence counter. This enables lockdep to verify
    that the raw spinlock used for writer serialization is held when the
    write side critical section is entered.

    If lockdep is disabled this lock association is compiled out and has
    neither storage size nor runtime overhead.

    Signed-off-by: Ahmed S. Darwish
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200720155530.1173732-25-a.darwish@linutronix.de

    Ahmed S. Darwish
     
  • A sequence counter write side critical section must be protected by some
    form of locking to serialize writers. A plain seqcount_t does not
    contain the information of which lock must be held when entering a write
    side critical section.

    Use the new seqcount_raw_spinlock_t data type, which allows to associate
    a raw spinlock with the sequence counter. This enables lockdep to verify
    that the raw spinlock used for writer serialization is held when the
    write side critical section is entered.

    If lockdep is disabled this lock association is compiled out and has
    neither storage size nor runtime overhead.

    Signed-off-by: Ahmed S. Darwish
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200720155530.1173732-18-a.darwish@linutronix.de

    Ahmed S. Darwish
     

24 Jul, 2020

1 commit

  • The nohz tick code recalculates the timer wheel's next expiry on each idle
    loop iteration.

    On the other hand, the base next expiry is now always cached and updated
    upon timer enqueue and execution. Only timer dequeue may leave
    base->next_expiry out of date (but then its stale value won't ever go past
    the actual next expiry to be recalculated).

    Since recalculating the next_expiry isn't a free operation, especially when
    the last wheel level is reached to find out that no timer has been enqueued
    at all, reuse the next expiry cache when it is known to be reliable, which
    it is most of the time.

    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20200723151641.12236-1-frederic@kernel.org

    Frederic Weisbecker
     

22 Jul, 2020

1 commit

  • The "ticks" parameter was added in commit 0f004f5a696a ("sched: Cure more
    NO_HZ load average woes") since calc_global_nohz() was called and needed
    the "ticks" argument.

    But in commit c308b56b5398 ("sched: Fix nohz load accounting -- again!")
    it became unused as the function calc_global_nohz() dropped using "ticks".

    Fixes: c308b56b5398 ("sched: Fix nohz load accounting -- again!")
    Signed-off-by: Paul Gortmaker
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/1593628458-32290-1-git-send-email-paul.gortmaker@windriver.com

    Paul Gortmaker