Blame view

Documentation/timers/hrtimers.txt 8.69 KB
df78488de   Thomas Gleixner   [PATCH] hrtimer: ...
1
2
3
4
5
6
7
8
9
10
11
12
  
  hrtimers - subsystem for high-resolution kernel timers
  ----------------------------------------------------
  
  This patch introduces a new subsystem for high-resolution kernel timers.
  
  One might ask the question: we already have a timer subsystem
  (kernel/timers.c), why do we need two timer subsystems? After a lot of
  back and forth trying to integrate high-resolution and high-precision
  features into the existing timer framework, and after testing various
  such high-resolution timer implementations in practice, we came to the
  conclusion that the timer wheel code is fundamentally not suitable for
fff9289b2   Matt LaPlante   Fix typos in Docu...
13
  such an approach. We initially didn't believe this ('there must be a way
df78488de   Thomas Gleixner   [PATCH] hrtimer: ...
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
  to solve this'), and spent a considerable effort trying to integrate
  things into the timer wheel, but we failed. In hindsight, there are
  several reasons why such integration is hard/impossible:
  
  - the forced handling of low-resolution and high-resolution timers in
    the same way leads to a lot of compromises, macro magic and #ifdef
    mess. The timers.c code is very "tightly coded" around jiffies and
    32-bitness assumptions, and has been honed and micro-optimized for a
    relatively narrow use case (jiffies in a relatively narrow HZ range)
    for many years - and thus even small extensions to it easily break
    the wheel concept, leading to even worse compromises. The timer wheel
    code is very good and tight code, there's zero problems with it in its
    current usage - but it is simply not suitable to be extended for
    high-res timers.
  
  - the unpredictable [O(N)] overhead of cascading leads to delays which
992caacf1   Matt LaPlante   Fix typos in Docu...
30
    necessitate a more complex handling of high resolution timers, which
df78488de   Thomas Gleixner   [PATCH] hrtimer: ...
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
    in turn decreases robustness. Such a design still led to rather large
    timing inaccuracies. Cascading is a fundamental property of the timer
    wheel concept, it cannot be 'designed out' without unevitably
    degrading other portions of the timers.c code in an unacceptable way.
  
  - the implementation of the current posix-timer subsystem on top of
    the timer wheel has already introduced a quite complex handling of
    the required readjusting of absolute CLOCK_REALTIME timers at
    settimeofday or NTP time - further underlying our experience by
    example: that the timer wheel data structure is too rigid for high-res
    timers.
  
  - the timer wheel code is most optimal for use cases which can be
    identified as "timeouts". Such timeouts are usually set up to cover
    error conditions in various I/O paths, such as networking and block
    I/O. The vast majority of those timers never expire and are rarely
    recascaded because the expected correct event arrives in time so they
    can be removed from the timer wheel before any further processing of
    them becomes necessary. Thus the users of these timeouts can accept
    the granularity and precision tradeoffs of the timer wheel, and
    largely expect the timer subsystem to have near-zero overhead.
    Accurate timing for them is not a core purpose - in fact most of the
    timeout values used are ad-hoc. For them it is at most a necessary
    evil to guarantee the processing of actual timeout completions
    (because most of the timeouts are deleted before completion), which
    should thus be as cheap and unintrusive as possible.
  
  The primary users of precision timers are user-space applications that
  utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel
  users like drivers and subsystems which require precise timed events
53cb47268   Matt LaPlante   Fix typos in Docu...
61
  (e.g. multimedia) can benefit from the availability of a separate
df78488de   Thomas Gleixner   [PATCH] hrtimer: ...
62
63
64
65
66
67
68
69
  high-resolution timer subsystem as well.
  
  While this subsystem does not offer high-resolution clock sources just
  yet, the hrtimer subsystem can be easily extended with high-resolution
  clock capabilities, and patches for that exist and are maturing quickly.
  The increasing demand for realtime and multimedia applications along
  with other potential users for precise timers gives another reason to
  separate the "timeout" and "precise timer" subsystems.
53cb47268   Matt LaPlante   Fix typos in Docu...
70
  Another potential benefit is that such a separation allows even more
df78488de   Thomas Gleixner   [PATCH] hrtimer: ...
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
  special-purpose optimization of the existing timer wheel for the low
  resolution and low precision use cases - once the precision-sensitive
  APIs are separated from the timer wheel and are migrated over to
  hrtimers. E.g. we could decrease the frequency of the timeout subsystem
  from 250 Hz to 100 HZ (or even smaller).
  
  hrtimer subsystem implementation details
  ----------------------------------------
  
  the basic design considerations were:
  
  - simplicity
  
  - data structure not bound to jiffies or any other granularity. All the
    kernel logic works at 64-bit nanoseconds resolution - no compromises.
  
  - simplification of existing, timing related kernel code
  
  another basic requirement was the immediate enqueueing and ordering of
  timers at activation time. After looking at several possible solutions
  such as radix trees and hashes, we chose the red black tree as the basic
  data structure. Rbtrees are available as a library in the kernel and are
  used in various performance-critical areas of e.g. memory management and
  file systems. The rbtree is solely used for time sorted ordering, while
  a separate list is used to give the expiry code fast access to the
  queued timers, without having to walk the rbtree.
53cb47268   Matt LaPlante   Fix typos in Docu...
97
98
  (This separate list is also useful for later when we'll introduce
  high-resolution clocks, where we need separate pending and expired
df78488de   Thomas Gleixner   [PATCH] hrtimer: ...
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
  queues while keeping the time-order intact.)
  
  Time-ordered enqueueing is not purely for the purposes of
  high-resolution clocks though, it also simplifies the handling of
  absolute timers based on a low-resolution CLOCK_REALTIME. The existing
  implementation needed to keep an extra list of all armed absolute
  CLOCK_REALTIME timers along with complex locking. In case of
  settimeofday and NTP, all the timers (!) had to be dequeued, the
  time-changing code had to fix them up one by one, and all of them had to
  be enqueued again. The time-ordered enqueueing and the storage of the
  expiry time in absolute time units removes all this complex and poorly
  scaling code from the posix-timer implementation - the clock can simply
  be set without having to touch the rbtree. This also makes the handling
  of posix-timers simpler in general.
  
  The locking and per-CPU behavior of hrtimers was mostly taken from the
  existing timer wheel code, as it is mature and well suited. Sharing code
  was not really a win, due to the different data structures. Also, the
  hrtimer functions now have clearer behavior and clearer names - such as
  hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly
  equivalent to del_timer() and del_timer_sync()] - so there's no direct
  1:1 mapping between them on the algorithmical level, and thus no real
  potential for code sharing either.
  
  Basic data types: every time value, absolute or relative, is in a
  special nanosecond-resolution type: ktime_t. The kernel-internal
  representation of ktime_t values and operations is implemented via
  macros and inline functions, and can be switched between a "hybrid
  union" type and a plain "scalar" 64bit nanoseconds representation (at
  compile time). The hybrid union type optimizes time conversions on 32bit
  CPUs. This build-time-selectable ktime_t storage format was implemented
  to avoid the performance impact of 64-bit multiplications and divisions
  on 32bit CPUs. Such operations are frequently necessary to convert
  between the storage formats provided by kernel and userspace interfaces
  and the internal time format. (See include/linux/ktime.h for further
  details.)
  
  hrtimers - rounding of timer values
  -----------------------------------
  
  the hrtimer code will round timer events to lower-resolution clocks
  because it has to. Otherwise it will do no artificial rounding at all.
  
  one question is, what resolution value should be returned to the user by
  the clock_getres() interface. This will return whatever real resolution
  a given clock has - be it low-res, high-res, or artificially-low-res.
  
  hrtimers - testing and verification
  ----------------------------------
  
  We used the high-resolution clock subsystem ontop of hrtimers to verify
  the hrtimer implementation details in praxis, and we also ran the posix
  timer tests in order to ensure specification compliance. We also ran
  tests on low-resolution clocks.
  
  The hrtimer patch converts the following kernel functionality to use
  hrtimers:
  
   - nanosleep
   - itimers
   - posix-timers
  
  The conversion of nanosleep and posix-timers enabled the unification of
  nanosleep and clock_nanosleep.
  
  The code was successfully compiled for the following platforms:
  
   i386, x86_64, ARM, PPC, PPC64, IA64
  
  The code was run-tested on the following platforms:
  
   i386(UP/SMP), x86_64(UP/SMP), ARM, PPC
  
  hrtimers were also integrated into the -rt tree, along with a
  hrtimers-based high-resolution clock implementation, so the hrtimers
  code got a healthy amount of testing and use in practice.
  
  	Thomas Gleixner, Ingo Molnar