Doug / smarc-fsl-linux-kernel | Embedian Git Server

17 Nov, 2013

1 commit

b29c8306a Merge tag 'trace-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace ... Browse Code »

Pull tracing update from Steven Rostedt:
"This batch of changes is mostly clean ups and small bug fixes. The
only real feature that was added this release is from Namhyung Kim,
who introduced "set_graph_notrace" filter that lets you run the
function graph tracer and not trace particular functions and their
call chain.

Tom Zanussi added some updates to the ftrace multibuffer tracing that
made it more consistent with the top level tracing.

One of the fixes for perf function tracing required an API change in
RCU; the addition of "rcu_is_watching()". As Paul McKenney is pushing
that change in this release too, he gave me a branch that included all
the changes to get that working, and I pulled that into my tree in
order to complete the perf function tracing fix"

* tag 'trace-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Add rcu annotation for syscall trace descriptors
tracing: Do not use signed enums with unsigned long long in fgragh output
tracing: Remove unused function ftrace_off_permanent()
tracing: Do not assign filp->private_data to freed memory
tracing: Add helper function tracing_is_disabled()
tracing: Open tracer when ftrace_dump_on_oops is used
tracing: Add support for SOFT_DISABLE to syscall events
tracing: Make register/unregister_ftrace_command __init
tracing: Update event filters for multibuffer
recordmcount.pl: Add support for __fentry__
ftrace: Have control op function callback only trace when RCU is watching
rcu: Do not trace rcu_is_watching() functions
ftrace/x86: skip over the breakpoint for ftrace caller
trace/trace_stat: use rbtree postorder iteration helper instead of opencoding
ftrace: Add set_graph_notrace filter
ftrace: Narrow down the protected area of graph_lock
ftrace: Introduce struct ftrace_graph_data
ftrace: Get rid of ftrace_graph_filter_enabled
tracing: Fix potential out-of-bounds in trace_get_user()
tracing: Show more exact help information about snapshot

Linus Torvalds
2013-11-17 04:23:18 +0800

13 Nov, 2013

1 commit

83460ec8d syscalls.h: use gcc alias instead of assembler aliases for syscalls ... Browse Code »

Use standard gcc __attribute__((alias(foo))) to define the syscall aliases
instead of custom assembler macros.

This is far cleaner, and also fixes my LTO kernel build.

Signed-off-by: Andi Kleen
Cc: Al Viro
Cc: Geert Uytterhoeven
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2013-11-13 11:09:12 +0800

06 Nov, 2013

2 commits

d562aff93 tracing: Add support for SOFT_DISABLE to syscall events ... Browse Code »

The original SOFT_DISABLE patches didn't add support for soft disable
of syscall events; this adds it.

Add an array of ftrace_event_file pointers indexed by syscall number
to the trace array and remove the existing enabled bitmaps, which as a
result are now redundant. The ftrace_event_file structs in turn
contain the soft disable flags we need for per-syscall soft disable
accounting.

Adding ftrace_event_files also means we can remove the USE_CALL_FILTER
bit, thus enabling multibuffer filter support for syscall events.

Link: http://lkml.kernel.org/r/6e72b566e85d8df8042f133efbc6c30e21fb017e.1382620672.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi
Signed-off-by: Steven Rostedt

Tom Zanussi
2013-11-06 06:48:49 +0800
f306cc82a tracing: Update event filters for multibuffer ... Browse Code »

The trace event filters are still tied to event calls rather than
event files, which means you don't get what you'd expect when using
filters in the multibuffer case:

Before:

# echo 'bytes_alloc > 8192' > /sys/kernel/debug/tracing/events/kmem/kmalloc/filter
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/filter
bytes_alloc > 8192
# mkdir /sys/kernel/debug/tracing/instances/test1
# echo 'bytes_alloc > 2048' > /sys/kernel/debug/tracing/instances/test1/events/kmem/kmalloc/filter
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/filter
bytes_alloc > 2048
# cat /sys/kernel/debug/tracing/instances/test1/events/kmem/kmalloc/filter
bytes_alloc > 2048

Setting the filter in tracing/instances/test1/events shouldn't affect
the same event in tracing/events as it does above.

After:

# echo 'bytes_alloc > 8192' > /sys/kernel/debug/tracing/events/kmem/kmalloc/filter
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/filter
bytes_alloc > 8192
# mkdir /sys/kernel/debug/tracing/instances/test1
# echo 'bytes_alloc > 2048' > /sys/kernel/debug/tracing/instances/test1/events/kmem/kmalloc/filter
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/filter
bytes_alloc > 8192
# cat /sys/kernel/debug/tracing/instances/test1/events/kmem/kmalloc/filter
bytes_alloc > 2048

We'd like to just move the filter directly from ftrace_event_call to
ftrace_event_file, but there are a couple cases that don't yet have
multibuffer support and therefore have to continue using the current
event_call-based filters. For those cases, a new USE_CALL_FILTER bit
is added to the event_call flags, whose main purpose is to keep the
old behavior for those cases until they can be updated with
multibuffer support; at that point, the USE_CALL_FILTER flag (and the
new associated call_filter_check_discard() function) can go away.

The multibuffer support also made filter_current_check_discard()
redundant, so this change removes that function as well and replaces
it with filter_check_discard() (or call_filter_check_discard() as
appropriate).

Link: http://lkml.kernel.org/r/f16e9ce4270c62f46b2e966119225e1c3cca7e60.1382620672.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi
Signed-off-by: Steven Rostedt

Tom Zanussi
2013-11-06 05:50:20 +0800

12 Sep, 2013

1 commit

f9597f24c syscalls.h: add forward declarations for inplace syscall wrappers ... Browse Code »

Unclutter -Wmissing-prototypes warning types (enabled at make W=1)

linux/include/linux/syscalls.h:190:18: warning: no previous prototype for 'SyS_semctl' [-Wmissing-prototypes]
asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
^
linux/include/linux/syscalls.h:183:2: note: in expansion of macro '__SYSCALL_DEFINEx'
__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
^
by adding forward declarations right before definitions.

Signed-off-by: Sergei Trofimovich
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergei Trofimovich
2013-09-12 06:58:25 +0800

14 Aug, 2013

1 commit

dfa9771a7 microblaze: fix clone syscall ... Browse Code »

Fix inadvertent breakage in the clone syscall ABI for Microblaze that
was introduced in commit f3268edbe6fe ("microblaze: switch to generic
fork/vfork/clone").

The Microblaze syscall ABI for clone takes the parent tid address in the
4th argument; the third argument slot is used for the stack size. The
incorrectly-used CLONE_BACKWARDS type assigned parent tid to the 3rd
slot.

This commit restores the original ABI so that existing userspace libc
code will work correctly.

All kernel versions from v3.8-rc1 were affected.

Signed-off-by: Michal Simek
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Simek
2013-08-14 08:57:48 +0800

06 Mar, 2013

2 commits

99e621f79 syscalls.h: slightly reduce the jungles of macros ... Browse Code »

a) teach __MAP(num, m, ) to take empty
list (with num being 0, of course)
b) fold types__... and args__... declaration and initialization into
SYSCALL_METADATA(num, ...), making their use conditional on num != 0.
That allows to use the SYSCALL_METADATA instead of its near-duplicate
in SYSCALL_DEFINE0.
c) make SYSCALL_METADATA expand to nothing in case if CONFIG_FTRACE_SYSCALLS
is not defined; that allows to make SYSCALL_DEFINE0 and SYSCALL_DEFINEx
definitions independent from CONFIG_FTRACE_SYSCALLS.
d) kill SYSCALL_DEFINE - no users left (SYSCALL_DEFINE[0-6] is, of course,
still alive and well).

Signed-off-by: Al Viro

Al Viro
2013-03-06 04:36:40 +0800
e1fd1f490 get rid of union semop in sys_semctl(2) arguments ... Browse Code »

just have the bugger take unsigned long and deal with SETVAL
case (when we use an int member in the union) explicitly.

Signed-off-by: Al Viro

Al Viro
2013-03-06 04:14:16 +0800

04 Mar, 2013

5 commits

2cf096668 make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect ... Browse Code »

... and switch i386 to HAVE_SYSCALL_WRAPPERS, killing open-coded
uses of asmlinkage_protect() in a bunch of syscalls.

Signed-off-by: Al Viro

Al Viro
2013-03-04 11:58:33 +0800
22d1a35da make HAVE_SYSCALL_WRAPPERS unconditional ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2013-03-04 11:58:30 +0800
e1b5bb6d1 consolidate cond_syscall and SYSCALL_ALIAS declarations ... Browse Code »

take them to asm/linkage.h, with default in linux/linkage.h

Signed-off-by: Al Viro

Al Viro
2013-03-04 11:55:19 +0800
4a0fd5bf0 teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long ... Browse Code »

... and convert a bunch of SYSCALL_DEFINE ones to SYSCALL_DEFINE,
killing the boilerplate crap around them.

Signed-off-by: Al Viro

Al Viro
2013-03-04 11:46:22 +0800
07fe6e00f get rid of duplicate logics in __SC_....[1-6] definitions ... Browse Code »

All those guys have the same form - "take a list of type/name pairs,
apply some macro to each of them". Abstract that part away, convert
all __SC_FOO##x(__VA_ARGS__) to __MAP(x,__SC_FOO,__VA_ARGS__).

Signed-off-by: Al Viro

Al Viro
2013-03-04 11:46:21 +0800

14 Feb, 2013

1 commit

d64008a8f burying unused conditionals ... Browse Code »

__ARCH_WANT_SYS_RT_SIGACTION,
__ARCH_WANT_SYS_RT_SIGSUSPEND,
__ARCH_WANT_COMPAT_SYS_RT_SIGSUSPEND,
__ARCH_WANT_COMPAT_SYS_SCHED_RR_GET_INTERVAL - not used anymore
CONFIG_GENERIC_{SIGALTSTACK,COMPAT_RT_SIG{ACTION,QUEUEINFO,PENDING,PROCMASK}} -
can be assumed always set.

Al Viro
2013-02-14 22:21:15 +0800

04 Feb, 2013

5 commits

0aa0203fb take sys_rt_sigsuspend() prototype to linux/syscalls.h ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2013-02-04 07:14:23 +0800
495dfbf76 generic sys_sigaction() and compat_sys_sigaction() ... Browse Code »

conditional on OLD_SIGACTION/COMPAT_OLD_SIGACTION

Signed-off-by: Al Viro

Al Viro
2013-02-04 04:09:23 +0800
574c4866e consolidate kernel-side struct sigaction declarations ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2013-02-04 04:09:22 +0800
0a0e8cdf7 old sigsuspend variants in kernel/signal.c ... Browse Code »

conditional on OLD_SIGSUSPEND/OLD_SIGSUSPEND3, depending on which
variety of that fossil is needed.

Signed-off-by: Al Viro

Al Viro
2013-02-04 04:09:20 +0800
eaca6eae3 sanitize rt_sigaction() situation a bit ... Browse Code »

Switch from __ARCH_WANT_SYS_RT_SIGACTION to opposite
(!CONFIG_ODD_RT_SIGACTION); the only two architectures that
need it are alpha and sparc. The reason for use of CONFIG_...
instead of __ARCH_... is that it's needed only kernel-side
and doing it that way avoids a mess with include order on many
architectures.

Signed-off-by: Al Viro

Al Viro
2013-02-04 04:09:18 +0800

21 Dec, 2012

1 commit

54d46ea99 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal ... Browse Code »

Pull signal handling cleanups from Al Viro:
"sigaltstack infrastructure + conversion for x86, alpha and um,
COMPAT_SYSCALL_DEFINE infrastructure.

Note that there are several conflicts between "unify
SS_ONSTACK/SS_DISABLE definitions" and UAPI patches in mainline;
resolution is trivial - just remove definitions of SS_ONSTACK and
SS_DISABLED from arch/*/uapi/asm/signal.h; they are all identical and
include/uapi/linux/signal.h contains the unified variant."

Fixed up conflicts as per Al.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
alpha: switch to generic sigaltstack
new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those
generic compat_sys_sigaltstack()
introduce generic sys_sigaltstack(), switch x86 and um to it
new helper: compat_user_stack_pointer()
new helper: restore_altstack()
unify SS_ONSTACK/SS_DISABLE definitions
new helper: current_user_stack_pointer()
missing user_stack_pointer() instances
Bury the conditionals from kernel_thread/kernel_execve series
COMPAT_SYSCALL_DEFINE: infrastructure

Linus Torvalds
2012-12-21 10:05:28 +0800

20 Dec, 2012

2 commits

6bf9adfc9 introduce generic sys_sigaltstack(), switch x86 and um to it ... Browse Code »

Conditional on CONFIG_GENERIC_SIGALTSTACK; architectures that do not
select it are completely unaffected

Signed-off-by: Al Viro

Al Viro
2012-12-20 07:07:40 +0800
ae903caae Bury the conditionals from kernel_thread/kernel_execve series ... Browse Code »

All architectures have
CONFIG_GENERIC_KERNEL_THREAD
CONFIG_GENERIC_KERNEL_EXECVE
__ARCH_WANT_SYS_EXECVE
None of them have __ARCH_WANT_KERNEL_EXECVE and there are only two callers
of kernel_execve() (which is a trivial wrapper for do_execve() now) left.
Kill the conditionals and make both callers use do_execve().

Signed-off-by: Al Viro

Al Viro
2012-12-20 07:07:38 +0800

19 Dec, 2012

1 commit

7a684c452 Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux ... Browse Code »

Pull module update from Rusty Russell:
"Nothing all that exciting; a new module-from-fd syscall for those who
want to verify the source of the module (ChromeOS) and/or use standard
IMA on it or other security hooks."

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
MODSIGN: Fix kbuild output when using default extra_certificates
MODSIGN: Avoid using .incbin in C source
modules: don't hand 0 to vmalloc.
module: Remove a extra null character at the top of module->strtab.
ASN.1: Use the ASN1_LONG_TAG and ASN1_INDEFINITE_LENGTH constants
ASN.1: Define indefinite length marker constant
moduleparam: use __UNIQUE_ID()
__UNIQUE_ID()
MODSIGN: Add modules_sign make target
powerpc: add finit_module syscall.
ima: support new kernel module syscall
add finit_module syscall to asm-generic
ARM: add finit_module syscall to ARM
security: introduce kernel_module_from_file hook
module: add flags arg to sys_finit_module()
module: add syscall to load module from fd

Linus Torvalds
2012-12-19 23:55:08 +0800

18 Dec, 2012

1 commit

965c8e59c lseek: the "whence" argument is called "whence" ... Browse Code »

But the kernel decided to call it "origin" instead. Fix most of the
sites.

Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2012-12-18 09:15:12 +0800

14 Dec, 2012

2 commits

2f3238aeb module: add flags arg to sys_finit_module() ... Browse Code »

Thanks to Michael Kerrisk for keeping us honest. These flags are actually
useful for eliminating the only case where kmod has to mangle a module's
internals: for overriding module versioning.

Signed-off-by: Rusty Russell
Acked-by: Lucas De Marchi
Acked-by: Kees Cook

Rusty Russell
2012-12-14 10:35:23 +0800
34e1169d9 module: add syscall to load module from fd ... Browse Code »

As part of the effort to create a stronger boundary between root and
kernel, Chrome OS wants to be able to enforce that kernel modules are
being loaded only from our read-only crypto-hash verified (dm_verity)
root filesystem. Since the init_module syscall hands the kernel a module
as a memory blob, no reasoning about the origin of the blob can be made.

Earlier proposals for appending signatures to kernel modules would not be
useful in Chrome OS, since it would involve adding an additional set of
keys to our kernel and builds for no good reason: we already trust the
contents of our root filesystem. We don't need to verify those kernel
modules a second time. Having to do signature checking on module loading
would slow us down and be redundant. All we need to know is where a
module is coming from so we can say yes/no to loading it.

If a file descriptor is used as the source of a kernel module, many more
things can be reasoned about. In Chrome OS's case, we could enforce that
the module lives on the filesystem we expect it to live on. In the case
of IMA (or other LSMs), it would be possible, for example, to examine
extended attributes that may contain signatures over the contents of
the module.

This introduces a new syscall (on x86), similar to init_module, that has
only two arguments. The first argument is used as a file descriptor to
the module and the second argument is a pointer to the NULL terminated
string of module arguments.

Signed-off-by: Kees Cook
Cc: Andrew Morton
Signed-off-by: Rusty Russell (merge fixes)

Kees Cook
2012-12-14 10:35:22 +0800

29 Nov, 2012

3 commits

24465a40b take sys_fork/sys_vfork/sys_clone prototypes to linux/syscalls.h ... Browse Code »

now it can be done...

Signed-off-by: Al Viro

Al Viro
2012-11-29 12:43:27 +0800
da3d4c5fa get rid of pt_regs argument of do_execve() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-11-29 10:53:37 +0800
6b94631f9 consolidate sys_execve() prototype ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-11-29 10:53:35 +0800

13 Oct, 2012

1 commit

a74fb73c1 infrastructure for saner ret_from_kernel_thread semantics ... Browse Code »

* allow kernel_execve() leave the actual return to userland to
caller (selected by CONFIG_GENERIC_KERNEL_EXECVE). Callers
updated accordingly.
* architecture that does select GENERIC_KERNEL_EXECVE in its
Kconfig should have its ret_from_kernel_thread() do this:
call schedule_tail
call the callback left for it by copy_thread(); if it ever
returns, that's because it has just done successful kernel_execve()
jump to return from syscall
IOW, its only difference from ret_from_fork() is that it does call the
callback.
* such an architecture should also get rid of ret_from_kernel_execve()
and __ARCH_WANT_KERNEL_EXECVE

This is the last part of infrastructure patches in that area - from
that point on work on different architectures can live independently.

Signed-off-by: Al Viro

Al Viro
2012-10-13 01:35:07 +0800

01 Jun, 2012

1 commit

d97b46a64 syscalls, x86: add __NR_kcmp syscall ... Browse Code »

While doing the checkpoint-restore in the user space one need to determine
whether various kernel objects (like mm_struct-s of file_struct-s) are
shared between tasks and restore this state.

The 2nd step can be solved by using appropriate CLONE_ flags and the
unshare syscall, while there's currently no ways for solving the 1st one.

One of the ways for checking whether two tasks share e.g. mm_struct is to
provide some mm_struct ID of a task to its proc file, but showing such
info considered to be not that good for security reasons.

Thus after some debates we end up in conclusion that using that named
'comparison' syscall might be the best candidate. So here is it --
__NR_kcmp.

It takes up to 5 arguments - the pids of the two tasks (which
characteristics should be compared), the comparison type and (in case of
comparison of files) two file descriptors.

Lookups for pids are done in the caller's PID namespace only.

At moment only x86 is supported and tested.

[akpm@linux-foundation.org: fix up selftests, warnings]
[akpm@linux-foundation.org: include errno.h]
[akpm@linux-foundation.org: tweak comment text]
Signed-off-by: Cyrill Gorcunov
Acked-by: "Eric W. Biederman"
Cc: Pavel Emelyanov
Cc: Andrey Vagin
Cc: KOSAKI Motohiro
Cc: Ingo Molnar
Cc: H. Peter Anvin
Cc: Thomas Gleixner
Cc: Glauber Costa
Cc: Andi Kleen
Cc: Tejun Heo
Cc: Matt Helsley
Cc: Pekka Enberg
Cc: Eric Dumazet
Cc: Vasiliy Kulikov
Cc: Alexey Dobriyan
Cc: Valdis.Kletnieks@vt.edu
Cc: Michal Marek
Cc: Frederic Weisbecker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-06-01 08:49:32 +0800

05 Mar, 2012

1 commit

187f1882b BUG: headers with BUG/BUG_ON etc. need linux/bug.h ... Browse Code »

If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any
other BUG variant in a static inline (i.e. not in a #define) then
that header really should be including and not just
expecting it to be implicitly present.

We can make this change risk-free, since if the files using these
headers didn't have exposure to linux/bug.h already, they would have
been causing compile failures/warnings.

Signed-off-by: Paul Gortmaker

Paul Gortmaker
2012-03-05 06:54:34 +0800

22 Feb, 2012

1 commit

faf309009 sys_poll: fix incorrect type for 'timeout' parameter ... Browse Code »

The 'poll()' system call timeout parameter is supposed to be 'int', not
'long'.

Now, the reason this matters is that right now 32-bit compat mode is
broken on at least x86-64, because the 32-bit code just calls
'sys_poll()' directly on x86-64, and the 32-bit argument will have been
zero-extended, turning a signed 'int' into a large unsigned 'long'
value.

We could just introduce a 'compat_sys_poll()' function for this, and
that may eventually be what we have to do, but since the actual standard
poll() semantics is *supposed* to be 'int', and since at least on x86-64
glibc sign-extends the argument before invocing the system call (so
nobody can actually use a 64-bit timeout value in user space _anyway_,
even in 64-bit binaries), the simpler solution would seem to be to just
fix the definition of the system call to match what it should have been
from the very start.

If it turns out that somebody somehow circumvents the user-level libc
64-bit sign extension and actually uses a large unsigned 64-bit timeout
despite that not being how poll() is supposed to work, we will need to
do the compat_sys_poll() approach.

Reported-by: Thomas Meyer
Acked-by: Eric Dumazet
Signed-off-by: Linus Torvalds

Linus Torvalds
2012-02-22 09:24:20 +0800

04 Jan, 2012

5 commits

a218d0fdc switch open and mkdir syscalls to umode_t ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:55:19 +0800
1bc94226d switch spu_create(2) to use of SYSCALL_DEFINE4, make it use umode_t ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:55:16 +0800
df0a42837 switch mq_open() to umode_t Browse Code »

Al Viro
2012-01-04 11:55:16 +0800
49f0a0767 switch sys_chmod()/sys_fchmod()/sys_fchmodat() to umode_t ... Browse Code »

SYSCALLx magic should take care of things, according to Linus...

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:55:12 +0800
8208a22bb switch sys_mknodat(2) to umode_t ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:54:52 +0800

01 Nov, 2011

1 commit

fcf634098 Cross Memory Attach ... Browse Code »

The basic idea behind cross memory attach is to allow MPI programs doing
intra-node communication to do a single copy of the message rather than a
double copy of the message via shared memory.

The following patch attempts to achieve this by allowing a destination
process, given an address and size from a source process, to copy memory
directly from the source process into its own address space via a system
call. There is also a symmetrical ability to copy from the current
process's address space into a destination process's address space.

- Use of /proc/pid/mem has been considered, but there are issues with
using it:
- Does not allow for specifying iovecs for both src and dest, assuming
preadv or pwritev was implemented either the area read from or
written to would need to be contiguous.
- Currently mem_read allows only processes who are currently
ptrace'ing the target and are still able to ptrace the target to read
from the target. This check could possibly be moved to the open call,
but its not clear exactly what race this restriction is stopping
(reason appears to have been lost)
- Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
domain socket is a bit ugly from a userspace point of view,
especially when you may have hundreds if not (eventually) thousands
of processes that all need to do this with each other
- Doesn't allow for some future use of the interface we would like to
consider adding in the future (see below)
- Interestingly reading from /proc/pid/mem currently actually
involves two copies! (But this could be fixed pretty easily)

As mentioned previously use of vmsplice instead was considered, but has
problems. Since you need the reader and writer working co-operatively if
the pipe is not drained then you block. Which requires some wrapping to
do non blocking on the send side or polling on the receive. In all to all
communication it requires ordering otherwise you can deadlock. And in the
example of many MPI tasks writing to one MPI task vmsplice serialises the
copying.

There are some cases of MPI collectives where even a single copy interface
does not get us the performance gain we could. For example in an
MPI_Reduce rather than copy the data from the source we would like to
instead use it directly in a mathops (say the reduce is doing a sum) as
this would save us doing a copy. We don't need to keep a copy of the data
from the source. I haven't implemented this, but I think this interface
could in the future do all this through the use of the flags - eg could
specify the math operation and type and the kernel rather than just
copying the data would apply the specified operation between the source
and destination and store it in the destination.

Although we don't have a "second user" of the interface (though I've had
some nibbles from people who may be interested in using it for intra
process messaging which is not MPI). This interface is something which
hardware vendors are already doing for their custom drivers to implement
fast local communication. And so in addition to this being useful for
OpenMPI it would mean the driver maintainers don't have to fix things up
when the mm changes.

There was some discussion about how much faster a true zero copy would
go. Here's a link back to the email with some testing I did on that:

http://marc.info/?l=linux-mm&m=130105930902915&w=2

There is a basic man page for the proposed interface here:

http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt

This has been implemented for x86 and powerpc, other architecture should
mainly (I think) just need to add syscall numbers for the process_vm_readv
and process_vm_writev. There are 32 bit compatibility versions for
64-bit kernels.

For arch maintainers there are some simple tests to be able to quickly
verify that the syscalls are working correctly here:

http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz

Signed-off-by: Chris Yeoh
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Cc: Arnd Bergmann
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: David Howells
Cc: James Morris
Cc:
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christopher Yeoh
2011-11-01 08:30:44 +0800

27 Aug, 2011

1 commit

f5b940997 All Arch: remove linkage for sys_nfsservctl system call ... Browse Code »

The nfsservctl system call is now gone, so we should remove all
linkage for it.

Signed-off-by: NeilBrown
Signed-off-by: J. Bruce Fields
Signed-off-by: Linus Torvalds

NeilBrown
2011-08-27 06:09:58 +0800