Blame view

Documentation/ia64/fsys.txt 11.8 KB
1da177e4c   Linus Torvalds   Linux-2.6.12-rc2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
  -*-Mode: outline-*-
  
  		Light-weight System Calls for IA-64
  		-----------------------------------
  
  		        Started: 13-Jan-2003
  		    Last update: 27-Sep-2003
  
  	              David Mosberger-Tang
  		      <davidm@hpl.hp.com>
  
  Using the "epc" instruction effectively introduces a new mode of
  execution to the ia64 linux kernel.  We call this mode the
  "fsys-mode".  To recap, the normal states of execution are:
  
    - kernel mode:
  	Both the register stack and the memory stack have been
  	switched over to kernel memory.  The user-level state is saved
  	in a pt-regs structure at the top of the kernel memory stack.
  
    - user mode:
  	Both the register stack and the kernel stack are in
  	user memory.  The user-level state is contained in the
  	CPU registers.
  
    - bank 0 interruption-handling mode:
  	This is the non-interruptible state which all
  	interruption-handlers start execution in.  The user-level
  	state remains in the CPU registers and some kernel state may
  	be stored in bank 0 of registers r16-r31.
  
  In contrast, fsys-mode has the following special properties:
  
    - execution is at privilege level 0 (most-privileged)
  
    - CPU registers may contain a mixture of user-level and kernel-level
      state (it is the responsibility of the kernel to ensure that no
      security-sensitive kernel-level state is leaked back to
      user-level)
  
    - execution is interruptible and preemptible (an fsys-mode handler
      can disable interrupts and avoid all other interruption-sources
      to avoid preemption)
  
    - neither the memory-stack nor the register-stack can be trusted while
      in fsys-mode (they point to the user-level stacks, which may
      be invalid, or completely bogus addresses)
  
  In summary, fsys-mode is much more similar to running in user-mode
  than it is to running in kernel-mode.  Of course, given that the
  privilege level is at level 0, this means that fsys-mode requires some
  care (see below).
  
  
  * How to tell fsys-mode
  
  Linux operates in fsys-mode when (a) the privilege level is 0 (most
  privileged) and (b) the stacks have NOT been switched to kernel memory
  yet.  For convenience, the header file <asm-ia64/ptrace.h> provides
  three macros:
  
  	user_mode(regs)
  	user_stack(task,regs)
  	fsys_mode(task,regs)
  
  The "regs" argument is a pointer to a pt_regs structure.  The "task"
  argument is a pointer to the task structure to which the "regs"
  pointer belongs to.  user_mode() returns TRUE if the CPU state pointed
  to by "regs" was executing in user mode (privilege level 3).
  user_stack() returns TRUE if the state pointed to by "regs" was
  executing on the user-level stack(s).  Finally, fsys_mode() returns
  TRUE if the CPU state pointed to by "regs" was executing in fsys-mode.
  The fsys_mode() macro is equivalent to the expression:
  
  	!user_mode(regs) && user_stack(task,regs)
  
  * How to write an fsyscall handler
  
  The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers
  (fsyscall_table).  This table contains one entry for each system call.
  By default, a system call is handled by fsys_fallback_syscall().  This
  routine takes care of entering (full) kernel mode and calling the
  normal Linux system call handler.  For performance-critical system
  calls, it is possible to write a hand-tuned fsyscall_handler.  For
  example, fsys.S contains fsys_getpid(), which is a hand-tuned version
  of the getpid() system call.
  
  The entry and exit-state of an fsyscall handler is as follows:
  
  ** Machine state on entry to fsyscall handler:
  
   - r10	  = 0
   - r11	  = saved ar.pfs (a user-level value)
   - r15	  = system call number
   - r16	  = "current" task pointer (in normal kernel-mode, this is in r13)
   - r32-r39 = system call arguments
   - b6	  = return address (a user-level value)
   - ar.pfs = previous frame-state (a user-level value)
   - PSR.be = cleared to zero (i.e., little-endian byte order is in effect)
   - all other registers may contain values passed in from user-mode
  
  ** Required machine state on exit to fsyscall handler:
  
   - r11	  = saved ar.pfs (as passed into the fsyscall handler)
   - r15	  = system call number (as passed into the fsyscall handler)
   - r32-r39 = system call arguments (as passed into the fsyscall handler)
   - b6	  = return address (as passed into the fsyscall handler)
   - ar.pfs = previous frame-state (as passed into the fsyscall handler)
  
  Fsyscall handlers can execute with very little overhead, but with that
  speed comes a set of restrictions:
  
   o Fsyscall-handlers MUST check for any pending work in the flags
     member of the thread-info structure and if any of the
     TIF_ALLWORK_MASK flags are set, the handler needs to fall back on
     doing a full system call (by calling fsys_fallback_syscall).
  
   o Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11,
     r15, b6, and ar.pfs) because they will be needed in case of a
     system call restart.  Of course, all "preserved" registers also
     must be preserved, in accordance to the normal calling conventions.
  
   o Fsyscall-handlers MUST check argument registers for containing a
     NaT value before using them in any way that could trigger a
     NaT-consumption fault.  If a system call argument is found to
     contain a NaT value, an fsyscall-handler may return immediately
     with r8=EINVAL, r10=-1.
  
   o Fsyscall-handlers MUST NOT use the "alloc" instruction or perform
     any other operation that would trigger mandatory RSE
     (register-stack engine) traffic.
  
   o Fsyscall-handlers MUST NOT write to any stacked registers because
     it is not safe to assume that user-level called a handler with the
     proper number of arguments.
  
   o Fsyscall-handlers need to be careful when accessing per-CPU variables:
     unless proper safe-guards are taken (e.g., interruptions are avoided),
     execution may be pre-empted and resumed on another CPU at any given
     time.
  
   o Fsyscall-handlers must be careful not to leak sensitive kernel'
     information back to user-level.  In particular, before returning to
     user-level, care needs to be taken to clear any scratch registers
     that could contain sensitive information (note that the current
     task pointer is not considered sensitive: it's already exposed
     through ar.k6).
  
   o Fsyscall-handlers MUST NOT access user-memory without first
     validating access-permission (this can be done typically via
     probe.r.fault and/or probe.w.fault) and without guarding against
     memory access exceptions (this can be done with the EX() macros
     defined by asmmacro.h).
  
  The above restrictions may seem draconian, but remember that it's
  possible to trade off some of the restrictions by paying a slightly
  higher overhead.  For example, if an fsyscall-handler could benefit
  from the shadow register bank, it could temporarily disable PSR.i and
  PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as
  needed.  In other words, following the above rules yields extremely
  fast system call execution (while fully preserving system call
  semantics), but there is also a lot of flexibility in handling more
  complicated cases.
  
  * Signal handling
  
  The delivery of (asynchronous) signals must be delayed until fsys-mode
3f6dee9b2   Matt LaPlante   Fix some typos in...
168
  is exited.  This is accomplished with the help of the lower-privilege
1da177e4c   Linus Torvalds   Linux-2.6.12-rc2
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
  transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user()
  checks whether the interrupted task was in fsys-mode and, if so, sets
  PSR.lp and returns immediately.  When fsys-mode is exited via the
  "br.ret" instruction that lowers the privilege level, a trap will
  occur.  The trap handler clears PSR.lp again and returns immediately.
  The kernel exit path then checks for and delivers any pending signals.
  
  * PSR Handling
  
  The "epc" instruction doesn't change the contents of PSR at all.  This
  is in contrast to a regular interruption, which clears almost all
  bits.  Because of that, some care needs to be taken to ensure things
  work as expected.  The following discussion describes how each PSR bit
  is handled.
  
  PSR.be	Cleared when entering fsys-mode.  A srlz.d instruction is used
  	to ensure the CPU is in little-endian mode before the first
  	load/store instruction is executed.  PSR.be is normally NOT
  	restored upon return from an fsys-mode handler.  In other
  	words, user-level code must not rely on PSR.be being preserved
  	across a system call.
  PSR.up	Unchanged.
  PSR.ac	Unchanged.
  PSR.mfl Unchanged.  Note: fsys-mode handlers must not write-registers!
  PSR.mfh	Unchanged.  Note: fsys-mode handlers must not write-registers!
  PSR.ic	Unchanged.  Note: fsys-mode handlers can clear the bit, if needed.
  PSR.i	Unchanged.  Note: fsys-mode handlers can clear the bit, if needed.
  PSR.pk	Unchanged.
  PSR.dt	Unchanged.
  PSR.dfl	Unchanged.  Note: fsys-mode handlers must not write-registers!
  PSR.dfh	Unchanged.  Note: fsys-mode handlers must not write-registers!
  PSR.sp	Unchanged.
  PSR.pp	Unchanged.
  PSR.di	Unchanged.
  PSR.si	Unchanged.
  PSR.db	Unchanged.  The kernel prevents user-level from setting a hardware
  	breakpoint that triggers at any privilege level other than 3 (user-mode).
  PSR.lp	Unchanged.
  PSR.tb	Lazy redirect.  If a taken-branch trap occurs while in
  	fsys-mode, the trap-handler modifies the saved machine state
  	such that execution resumes in the gate page at
  	syscall_via_break(), with privilege level 3.  Note: the
  	taken branch would occur on the branch invoking the
  	fsyscall-handler, at which point, by definition, a syscall
  	restart is still safe.  If the system call number is invalid,
  	the fsys-mode handler will return directly to user-level.  This
  	return will trigger a taken-branch trap, but since the trap is
  	taken _after_ restoring the privilege level, the CPU has already
  	left fsys-mode, so no special treatment is needed.
  PSR.rt	Unchanged.
  PSR.cpl	Cleared to 0.
  PSR.is	Unchanged (guaranteed to be 0 on entry to the gate page).
  PSR.mc	Unchanged.
  PSR.it	Unchanged (guaranteed to be 1).
  PSR.id	Unchanged.  Note: the ia64 linux kernel never sets this bit.
  PSR.da	Unchanged.  Note: the ia64 linux kernel never sets this bit.
  PSR.dd	Unchanged.  Note: the ia64 linux kernel never sets this bit.
  PSR.ss	Lazy redirect.  If set, "epc" will cause a Single Step Trap to
  	be taken.  The trap handler then modifies the saved machine
  	state such that execution resumes in the gate page at
  	syscall_via_break(), with privilege level 3.
  PSR.ri	Unchanged.
  PSR.ed	Unchanged.  Note: This bit could only have an effect if an fsys-mode
  	handler performed a speculative load that gets NaTted.  If so, this
  	would be the normal & expected behavior, so no special treatment is
  	needed.
  PSR.bn	Unchanged.  Note: fsys-mode handlers may clear the bit, if needed.
  	Doing so requires clearing PSR.i and PSR.ic as well.
  PSR.ia	Unchanged.  Note: the ia64 linux kernel never sets this bit.
  
  * Using fast system calls
  
  To use fast system calls, userspace applications need simply call
  __kernel_syscall_via_epc().  For example
  
  -- example fgettimeofday() call --
  -- fgettimeofday.S --
  
  #include <asm/asmmacro.h>
  
  GLOBAL_ENTRY(fgettimeofday)
  .prologue
  .save ar.pfs, r11
  mov r11 = ar.pfs
  .body 
  
  mov r2 = 0xa000000000020660;;  // gate address 
  			       // found by inspection of System.map for the 
  			       // __kernel_syscall_via_epc() function.  See
  			       // below for how to do this for real.
  
  mov b7 = r2
  mov r15 = 1087		       // gettimeofday syscall
  ;;
  br.call.sptk.many b6 = b7
  ;;
  
  .restore sp
  
  mov ar.pfs = r11
  br.ret.sptk.many rp;;	      // return to caller
  END(fgettimeofday)
  
  -- end fgettimeofday.S --
  
  In reality, getting the gate address is accomplished by two extra
  values passed via the ELF auxiliary vector (include/asm-ia64/elf.h)
  
   o AT_SYSINFO : is the address of __kernel_syscall_via_epc()
   o AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO
  
  The ELF DSO is a pre-linked library that is mapped in by the kernel at
  the gate page.  It is a proper ELF shared object so, with a dynamic
  loader that recognises the library, you should be able to make calls to
  the exported functions within it as with any other shared library.
  AT_SYSINFO points into the kernel DSO at the
  __kernel_syscall_via_epc() function for historical reasons (it was
  used before the kernel DSO) and as a convenience.