Blame view

Documentation/userspace-api/unshare.rst 13.2 KB
f504d47be   Jonathan Corbet   docs: Convert uns...
1
2
  unshare system call
  ===================
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
3

f504d47be   Jonathan Corbet   docs: Convert uns...
4
  This document describes the new system call, unshare(). The document
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
5
6
7
  provides an overview of the feature, why it is needed, how it can
  be used, its interface specification, design, implementation and
  how it can be tested.
f504d47be   Jonathan Corbet   docs: Convert uns...
8
9
  Change Log
  ----------
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
10
  version 0.1  Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006
f504d47be   Jonathan Corbet   docs: Convert uns...
11
12
  Contents
  --------
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
13
14
15
16
17
18
19
20
21
22
23
24
  	1) Overview
  	2) Benefits
  	3) Cost
  	4) Requirements
  	5) Functional Specification
  	6) High Level Design
  	7) Low Level Design
  	8) Test Specification
  	9) Future Work
  
  1) Overview
  -----------
f504d47be   Jonathan Corbet   docs: Convert uns...
25

0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
26
27
28
29
30
31
32
33
34
35
36
37
38
  Most legacy operating system kernels support an abstraction of threads
  as multiple execution contexts within a process. These kernels provide
  special resources and mechanisms to maintain these "threads". The Linux
  kernel, in a clever and simple manner, does not make distinction
  between processes and "threads". The kernel allows processes to share
  resources and thus they can achieve legacy "threads" behavior without
  requiring additional data structures and mechanisms in the kernel. The
  power of implementing threads in this manner comes not only from
  its simplicity but also from allowing application programmers to work
  outside the confinement of all-or-nothing shared resources of legacy
  threads. On Linux, at the time of thread creation using the clone system
  call, applications can selectively choose which resources to share
  between threads.
f504d47be   Jonathan Corbet   docs: Convert uns...
39
  unshare() system call adds a primitive to the Linux thread model that
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
40
  allows threads to selectively 'unshare' any resources that were being
f504d47be   Jonathan Corbet   docs: Convert uns...
41
  shared at the time of their creation. unshare() was conceptualized by
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
42
  Al Viro in the August of 2000, on the Linux-Kernel mailing list, as part
f504d47be   Jonathan Corbet   docs: Convert uns...
43
  of the discussion on POSIX threads on Linux.  unshare() augments the
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
44
  usefulness of Linux threads for applications that would like to control
f504d47be   Jonathan Corbet   docs: Convert uns...
45
  shared resources without creating a new process. unshare() is a natural
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
46
47
48
49
50
  addition to the set of available primitives on Linux that implement
  the concept of process/thread as a virtual machine.
  
  2) Benefits
  -----------
f504d47be   Jonathan Corbet   docs: Convert uns...
51
52
  
  unshare() would be useful to large application frameworks such as PAM
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
53
54
  where creating a new process to control sharing/unsharing of process
  resources is not possible. Since namespaces are shared by default
f504d47be   Jonathan Corbet   docs: Convert uns...
55
  when creating a new process using fork or clone, unshare() can benefit
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
56
57
  even non-threaded applications if they have a need to disassociate
  from default shared namespace. The following lists two use-cases
f504d47be   Jonathan Corbet   docs: Convert uns...
58
  where unshare() can be used.
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
59
60
  
  2.1 Per-security context namespaces
f504d47be   Jonathan Corbet   docs: Convert uns...
61
62
63
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  
  unshare() can be used to implement polyinstantiated directories using
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
64
65
66
  the kernel's per-process namespace mechanism. Polyinstantiated directories,
  such as per-user and/or per-security context instance of /tmp, /var/tmp or
  per-security context instance of a user's home directory, isolate user
f504d47be   Jonathan Corbet   docs: Convert uns...
67
  processes when working with these directories. Using unshare(), a PAM
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
68
69
70
71
72
73
74
75
76
  module can easily setup a private namespace for a user at login.
  Polyinstantiated directories are required for Common Criteria certification
  with Labeled System Protection Profile, however, with the availability
  of shared-tree feature in the Linux kernel, even regular Linux systems
  can benefit from setting up private namespaces at login and
  polyinstantiating /tmp, /var/tmp and other directories deemed
  appropriate by system administrators.
  
  2.2 unsharing of virtual memory and/or open files
f504d47be   Jonathan Corbet   docs: Convert uns...
77
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
78
79
  Consider a client/server application where the server is processing
  client requests by creating processes that share resources such as
f504d47be   Jonathan Corbet   docs: Convert uns...
80
  virtual memory and open files. Without unshare(), the server has to
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
81
  decide what needs to be shared at the time of creating the process
f504d47be   Jonathan Corbet   docs: Convert uns...
82
  which services the request. unshare() allows the server an ability to
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
83
84
  disassociate parts of the context during the servicing of the
  request. For large and complex middleware application frameworks, this
f504d47be   Jonathan Corbet   docs: Convert uns...
85
  ability to unshare() after the process was created can be very
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
86
87
88
89
  useful.
  
  3) Cost
  -------
f504d47be   Jonathan Corbet   docs: Convert uns...
90
91
  
  In order to not duplicate code and to handle the fact that unshare()
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
92
  works on an active task (as opposed to clone/fork working on a newly
f504d47be   Jonathan Corbet   docs: Convert uns...
93
  allocated inactive task) unshare() had to make minor reorganizational
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
94
95
96
97
  changes to copy_* functions utilized by clone/fork system call.
  There is a cost associated with altering existing, well tested and
  stable code to implement a new feature that may not get exercised
  extensively in the beginning. However, with proper design and code
f504d47be   Jonathan Corbet   docs: Convert uns...
98
  review of the changes and creation of an unshare() test for the LTP
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
99
100
101
102
  the benefits of this new feature can exceed its cost.
  
  4) Requirements
  ---------------
f504d47be   Jonathan Corbet   docs: Convert uns...
103
104
105
  
  unshare() reverses sharing that was done using clone(2) system call,
  so unshare() should have a similar interface as clone(2). That is,
5e33994dc   Markus Heiser   doc-rst: fix inli...
106
  since flags in clone(int flags, void \*stack) specifies what should
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
107
108
109
110
111
  be shared, similar flags in unshare(int flags) should specify
  what should be unshared. Unfortunately, this may appear to invert
  the meaning of the flags from the way they are used in clone(2).
  However, there was no easy solution that was less confusing and that
  allowed incremental context unsharing in future without an ABI change.
f504d47be   Jonathan Corbet   docs: Convert uns...
112
  unshare() interface should accommodate possible future addition of
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
113
  new context flags without requiring a rebuild of old applications.
f504d47be   Jonathan Corbet   docs: Convert uns...
114
  If and when new context flags are added, unshare() design should allow
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
115
116
117
118
  incremental unsharing of those resources on an as needed basis.
  
  5) Functional Specification
  ---------------------------
f504d47be   Jonathan Corbet   docs: Convert uns...
119

0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
120
121
122
123
124
125
126
127
128
  NAME
  	unshare - disassociate parts of the process execution context
  
  SYNOPSIS
  	#include <sched.h>
  
  	int unshare(int flags);
  
  DESCRIPTION
f504d47be   Jonathan Corbet   docs: Convert uns...
129
  	unshare() allows a process to disassociate parts of its execution
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
130
131
132
133
134
135
  	context that are currently being shared with other processes. Part
  	of execution context, such as the namespace, is shared by default
  	when a new process is created using fork(2), while other parts,
  	such as the virtual memory, open file descriptors, etc, may be
  	shared by explicit request to share them when creating a process
  	using clone(2).
f504d47be   Jonathan Corbet   docs: Convert uns...
136
  	The main use of unshare() is to allow a process to control its
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
  	shared execution context without creating a new process.
  
  	The flags argument specifies one or bitwise-or'ed of several of
  	the following constants.
  
  	CLONE_FS
  		If CLONE_FS is set, file system information of the caller
  		is disassociated from the shared file system information.
  
  	CLONE_FILES
  		If CLONE_FILES is set, the file descriptor table of the
  		caller is disassociated from the shared file descriptor
  		table.
  
  	CLONE_NEWNS
  		If CLONE_NEWNS is set, the namespace of the caller is
  		disassociated from the shared namespace.
  
  	CLONE_VM
  		If CLONE_VM is set, the virtual memory of the caller is
  		disassociated from the shared virtual memory.
  
  RETURN VALUE
  	On success, zero returned. On failure, -1 is returned and errno is
  
  ERRORS
  	EPERM	CLONE_NEWNS was specified by a non-root process (process
  		without CAP_SYS_ADMIN).
  
  	ENOMEM	Cannot allocate sufficient memory to copy parts of caller's
  		context that need to be unshared.
  
  	EINVAL	Invalid flag was specified as an argument.
  
  CONFORMING TO
  	The unshare() call is Linux-specific and  should  not be used
  	in programs intended to be portable.
  
  SEE ALSO
  	clone(2), fork(2)
  
  6) High Level Design
  --------------------
f504d47be   Jonathan Corbet   docs: Convert uns...
180
181
  
  Depending on the flags argument, the unshare() system call allocates
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
182
183
184
185
  appropriate process context structures, populates it with values from
  the current shared version, associates newly duplicated structures
  with the current task structure and releases corresponding shared
  versions. Helper functions of clone (copy_*) could not be used
f504d47be   Jonathan Corbet   docs: Convert uns...
186
  directly by unshare() because of the following two reasons.
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
187
    1) clone operates on a newly allocated not-yet-active task
f504d47be   Jonathan Corbet   docs: Convert uns...
188
189
       structure, where as unshare() operates on the current active
       task. Therefore unshare() has to take appropriate task_lock()
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
190
       before associating newly duplicated context structures
f504d47be   Jonathan Corbet   docs: Convert uns...
191
192
  
    2) unshare() has to allocate and duplicate all context structures
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
       that are being unshared, before associating them with the
       current task and releasing older shared structures. Failure
       do so will create race conditions and/or oops when trying
       to backout due to an error. Consider the case of unsharing
       both virtual memory and namespace. After successfully unsharing
       vm, if the system call encounters an error while allocating
       new namespace structure, the error return code will have to
       reverse the unsharing of vm. As part of the reversal the
       system call will have to go back to older, shared, vm
       structure, which may not exist anymore.
  
  Therefore code from copy_* functions that allocated and duplicated
  current context structure was moved into new dup_* functions. Now,
  copy_* functions call dup_* functions to allocate and duplicate
  appropriate context structures and then associate them with the
f504d47be   Jonathan Corbet   docs: Convert uns...
208
  task structure that is being constructed. unshare() system call on
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
209
  the other hand performs the following:
f504d47be   Jonathan Corbet   docs: Convert uns...
210

0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
211
    1) Check flags to force missing, but implied, flags
f504d47be   Jonathan Corbet   docs: Convert uns...
212
213
  
    2) For each context structure, call the corresponding unshare()
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
214
215
       helper function to allocate and duplicate a new context
       structure, if the appropriate bit is set in the flags argument.
f504d47be   Jonathan Corbet   docs: Convert uns...
216

0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
217
218
219
220
    3) If there is no error in allocation and duplication and there
       are new context structures then lock the current task structure,
       associate new context structures with the current task structure,
       and release the lock on the current task structure.
f504d47be   Jonathan Corbet   docs: Convert uns...
221

0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
222
223
224
225
    4) Appropriately release older, shared, context structures.
  
  7) Low Level Design
  -------------------
f504d47be   Jonathan Corbet   docs: Convert uns...
226
227
  
  Implementation of unshare() can be grouped in the following 4 different
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
228
  items:
f504d47be   Jonathan Corbet   docs: Convert uns...
229

0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
230
    a) Reorganization of existing copy_* functions
f504d47be   Jonathan Corbet   docs: Convert uns...
231
232
233
234
  
    b) unshare() system call service function
  
    c) unshare() helper functions for each different process context
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
235
    d) Registration of system call number for different architectures
f504d47be   Jonathan Corbet   docs: Convert uns...
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
  7.1) Reorganization of copy_* functions
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  
  Each copy function such as copy_mm, copy_namespace, copy_files,
  etc, had roughly two components. The first component allocated
  and duplicated the appropriate structure and the second component
  linked it to the task structure passed in as an argument to the copy
  function. The first component was split into its own function.
  These dup_* functions allocated and duplicated the appropriate
  context structure. The reorganized copy_* functions invoked
  their corresponding dup_* functions and then linked the newly
  duplicated structures to the task structure with which the
  copy function was called.
  
  7.2) unshare() system call service function
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
252
253
254
255
256
         * Check flags
  	 Force implied flags. If CLONE_THREAD is set force CLONE_VM.
  	 If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is
  	 set and signals are also being shared, force CLONE_THREAD. If
  	 CLONE_NEWNS is set, force CLONE_FS.
f504d47be   Jonathan Corbet   docs: Convert uns...
257

0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
258
259
260
         * For each context flag, invoke the corresponding unshare_*
  	 helper routine with flags passed into the system call and a
  	 reference to pointer pointing the new unshared structure
f504d47be   Jonathan Corbet   docs: Convert uns...
261

0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
262
263
264
265
         * If any new structures are created by unshare_* helper
  	 functions, take the task_lock() on the current task,
  	 modify appropriate context pointers, and release the
           task lock.
f504d47be   Jonathan Corbet   docs: Convert uns...
266

0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
267
268
         * For all newly unshared structures, release the corresponding
           older, shared, structures.
f504d47be   Jonathan Corbet   docs: Convert uns...
269
270
  7.3) unshare_* helper functions
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
271

f504d47be   Jonathan Corbet   docs: Convert uns...
272
273
274
275
276
277
278
279
280
281
282
283
  For unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND,
  and CLONE_THREAD, return -EINVAL since they are not implemented yet.
  For others, check the flag value to see if the unsharing is
  required for that structure. If it is, invoke the corresponding
  dup_* function to allocate and duplicate the structure and return
  a pointer to it.
  
  7.4) Finally
  ~~~~~~~~~~~~
  
  Appropriately modify architecture specific code to register the
  new system call.
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
284
285
286
  
  8) Test Specification
  ---------------------
f504d47be   Jonathan Corbet   docs: Convert uns...
287
288
  
  The test for unshare() should test the following:
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
289
    1) Valid flags: Test to check that clone flags for signal and
f504d47be   Jonathan Corbet   docs: Convert uns...
290
291
       signal handlers, for which unsharing is not implemented
       yet, return -EINVAL.
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
292
    2) Missing/implied flags: Test to make sure that if unsharing
f504d47be   Jonathan Corbet   docs: Convert uns...
293
294
       namespace without specifying unsharing of filesystem, correctly
       unshares both namespace and filesystem information.
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
295
    3) For each of the four (namespace, filesystem, files and vm)
f504d47be   Jonathan Corbet   docs: Convert uns...
296
297
298
299
       supported unsharing, verify that the system call correctly
       unshares the appropriate structure. Verify that unsharing
       them individually as well as in combination with each
       other works as expected.
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
300
    4) Concurrent execution: Use shared memory segments and futex on
f504d47be   Jonathan Corbet   docs: Convert uns...
301
302
303
304
305
       an address in the shm segment to synchronize execution of
       about 10 threads. Have a couple of threads execute execve,
       a couple _exit and the rest unshare with different combination
       of flags. Verify that unsharing is performed as expected and
       that there are no oops or hangs.
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
306
307
308
  
  9) Future Work
  --------------
f504d47be   Jonathan Corbet   docs: Convert uns...
309
310
  
  The current implementation of unshare() does not allow unsharing of
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
311
312
313
314
  signals and signal handlers. Signals are complex to begin with and
  to unshare signals and/or signal handlers of a currently running
  process is even more complex. If in the future there is a specific
  need to allow unsharing of signals and/or signal handlers, it can
f504d47be   Jonathan Corbet   docs: Convert uns...
315
316
  be incrementally added to unshare() without affecting legacy
  applications using unshare().
0d4c3e7a8   JANAK DESAI   [PATCH] unshare s...
317