Doug / smarc-fsl-linux-kernel | Embedian Git Server

02 Jun, 2012

2 commits

8b8c0daa2 nls: fix (and rename) mac NLS table files and config options ... Browse Code »

The config options in the Kconfig file (with _CODEPAGE_ in the name)
didn't match the config option name in the Makefile (no _CODEPAGE_).

And both of them were of the hard-to-read MACXYZZY variety, which made
them hard to parse for normal humans: MACROMAN easily reads as "macro
man", not as "Mac Roman".

So rename the options to be consistent, and be NLS_MAC_xyzzy. Rename
the files to be mac-xyzzy.c too, and drop the "nls" part entirely (it's
already in the directory name).

Signed-off-by: Linus Torvalds

Linus Torvalds
2012-06-02 10:51:22 +0800
92a895636 fs/nls/Makefile: remove bogus CONFIG_ assignments ... Browse Code »

These were debug things which snuck through.

Reported-by: Yinghai Lu
Cc: Vladimir Serbinenko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2012-06-02 10:47:26 +0800

01 Jun, 2012

38 commits

0a4dd35c6 kconfig: update compression algorithm info ... Browse Code »

There have been new compression algorithms added without updating nearby
relevant descriptive text that refers to (a) the number of compression
algorithms and (b) the most recent one. Fix these inconsistencies.

Signed-off-by: Randy Dunlap
Reported-by:
Cc: Lasse Collin
Cc: "H. Peter Anvin"
Cc: Markus Trippelsdorf
Cc: Alain Knaff
Cc: Albin Tonnerre
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2012-06-01 08:49:33 +0800
b32dfe377 c/r: prctl: add ability to set new mm_struct::exe_file ... Browse Code »

When we do restore we would like to have a way to setup a former
mm_struct::exe_file so that /proc/pid/exe would point to the original
executable file a process had at checkpoint time.

For this the PR_SET_MM_EXE_FILE code is introduced. This option takes a
file descriptor which will be set as a source for new /proc/$pid/exe
symlink.

Note it allows to change /proc/$pid/exe if there are no VM_EXECUTABLE
vmas present for current process, simply because this feature is a special
to C/R and mm::num_exe_file_vmas become meaningless after that.

To minimize the amount of transition the /proc/pid/exe symlink might have,
this feature is implemented in one-shot manner. Thus once changed the
symlink can't be changed again. This should help sysadmins to monitor the
symlinks over all process running in a system.

In particular one could make a snapshot of processes and ring alarm if
there unexpected changes of /proc/pid/exe's in a system.

Note -- this feature is available iif CONFIG_CHECKPOINT_RESTORE is set and
the caller must have CAP_SYS_RESOURCE capability granted, otherwise the
request to change symlink will be rejected.

Signed-off-by: Cyrill Gorcunov
Reviewed-by: Oleg Nesterov
Cc: KOSAKI Motohiro
Cc: Pavel Emelyanov
Cc: Kees Cook
Cc: Tejun Heo
Cc: Matt Helsley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-06-01 08:49:32 +0800
fe8c7f5cb c/r: prctl: extend PR_SET_MM to set up more mm_struct entries ... Browse Code »

During checkpoint we dump whole process memory to a file and the dump
includes process stack memory. But among stack data itself, the stack
carries additional parameters such as command line arguments, environment
data and auxiliary vector.

So when we do restore procedure and once we've restored stack data itself
we need to setup mm_struct::arg_start/end, env_start/end, so restored
process would be able to find command line arguments and environment data
it had at checkpoint time. The same applies to auxiliary vector.

For this reason additional PR_SET_MM_(ARG_START | ARG_END | ENV_START |
ENV_END | AUXV) codes are introduced.

Signed-off-by: Cyrill Gorcunov
Acked-by: Kees Cook
Cc: Tejun Heo
Cc: Andrew Vagin
Cc: Serge Hallyn
Cc: Pavel Emelyanov
Cc: Vasiliy Kulikov
Cc: KAMEZAWA Hiroyuki
Cc: Michael Kerrisk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-06-01 08:49:32 +0800
5b172087f c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat ... Browse Code »

We would like to have an ability to restore command line arguments and
program environment pointers but first we need to obtain them somehow.
Thus we put these values into /proc/$pid/stat. The exit_code is needed to
restore zombie tasks.

Signed-off-by: Cyrill Gorcunov
Acked-by: Kees Cook
Cc: Pavel Emelyanov
Cc: Serge Hallyn
Cc: KAMEZAWA Hiroyuki
Cc: Alexey Dobriyan
Cc: Tejun Heo
Cc: Andrew Vagin
Cc: Vasiliy Kulikov
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-06-01 08:49:32 +0800
d97b46a64 syscalls, x86: add __NR_kcmp syscall ... Browse Code »

While doing the checkpoint-restore in the user space one need to determine
whether various kernel objects (like mm_struct-s of file_struct-s) are
shared between tasks and restore this state.

The 2nd step can be solved by using appropriate CLONE_ flags and the
unshare syscall, while there's currently no ways for solving the 1st one.

One of the ways for checking whether two tasks share e.g. mm_struct is to
provide some mm_struct ID of a task to its proc file, but showing such
info considered to be not that good for security reasons.

Thus after some debates we end up in conclusion that using that named
'comparison' syscall might be the best candidate. So here is it --
__NR_kcmp.

It takes up to 5 arguments - the pids of the two tasks (which
characteristics should be compared), the comparison type and (in case of
comparison of files) two file descriptors.

Lookups for pids are done in the caller's PID namespace only.

At moment only x86 is supported and tested.

[akpm@linux-foundation.org: fix up selftests, warnings]
[akpm@linux-foundation.org: include errno.h]
[akpm@linux-foundation.org: tweak comment text]
Signed-off-by: Cyrill Gorcunov
Acked-by: "Eric W. Biederman"
Cc: Pavel Emelyanov
Cc: Andrey Vagin
Cc: KOSAKI Motohiro
Cc: Ingo Molnar
Cc: H. Peter Anvin
Cc: Thomas Gleixner
Cc: Glauber Costa
Cc: Andi Kleen
Cc: Tejun Heo
Cc: Matt Helsley
Cc: Pekka Enberg
Cc: Eric Dumazet
Cc: Vasiliy Kulikov
Cc: Alexey Dobriyan
Cc: Valdis.Kletnieks@vt.edu
Cc: Michal Marek
Cc: Frederic Weisbecker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-06-01 08:49:32 +0800
818411616 fs, proc: introduce /proc/<pid>/task/<tid>/children entry ... Browse Code »

When we do checkpoint of a task we need to know the list of children the
task, has but there is no easy and fast way to generate reverse
parent->children chain from arbitrary (while a parent pid is
provided in "PPid" field of /proc//status).

So instead of walking over all pids in the system (creating one big
process tree in memory, just to figure out which children a task has) --
we add explicit /proc//task//children entry, because the kernel
already has this kind of information but it is not yet exported.

This is a first level children, not the whole process tree.

Signed-off-by: Cyrill Gorcunov
Reviewed-by: Oleg Nesterov
Reviewed-by: Kees Cook
Cc: Pavel Emelyanov
Cc: Serge Hallyn
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-06-01 08:49:32 +0800
98ed57eef sysctl: make kernel.ns_last_pid control dependent on CHECKPOINT_RESTORE ... Browse Code »

For those who doesn't need C/R functionality there is no need to control
last pid, ie the pid for the next fork() call.

Signed-off-by: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Tejun Heo
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-06-01 08:49:32 +0800
ac34ebb3a aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector() ... Browse Code »

A cleanup of rw_copy_check_uvector and compat_rw_copy_check_uvector after
changes made to support CMA in an earlier patch.

Rather than having an additional check_access parameter to these
functions, the first paramater type is overloaded to allow the caller to
specify CHECK_IOVEC_ONLY which means check that the contents of the iovec
are valid, but do not check the memory that they point to. This is used
by process_vm_readv/writev where we need to validate that a iovec passed
to the syscall is valid but do not want to check the memory that it points
to at this point because it refers to an address space in another process.

Signed-off-by: Chris Yeoh
Reviewed-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christopher Yeoh
2012-06-01 08:49:32 +0800
ee62c6b2d eventfd: change int to __u64 in eventfd_signal() ... Browse Code »

eventfd_ctx->count is an __u64 counter which is allowed to reach
ULLONG_MAX. eventfd_write() adds a __u64 value to "count", but the kernel
side eventfd_signal() only adds an int value to it. Make them consistent.

[akpm@linux-foundation.org: update interface documentation]
Signed-off-by: Sha Zhengju
Cc: Davide Libenzi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sha Zhengju
2012-06-01 08:49:32 +0800
71ca97da9 fs/nls: add Apple NLS ... Browse Code »

HFS has support for NLS. However the relevant NLS tables are missing.
Here they are automatically transformed from the tables at unicode.org.
Codepages requiring special handling like CJK, RTL or Brahmic ones are not
included in this patch.

[akpm@linux-foundation.org: add unicode.org copyright and permission notices]
Signed-off-by: Vladimir Serbinenko
Cc: Alan Stern
Cc: OGAWA Hirofumi
Cc: Clemens Ladisch
Cc: Al Viro
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Serbinenko
2012-06-01 08:49:32 +0800
00c10bc13 pidns: make killed children autoreap ... Browse Code »

Force SIGCHLD handling to SIG_IGN so that signals are not generated and so
that the children autoreap. This increases the parallelize and in general
the speed of network namespace shutdown.

Note self reaping childrean can exist past zap_pid_ns_processess but they
will all be reaped before we allow the pid namespace init task with pid ==
1 to be reaped.

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Eric W. Biederman
Cc: Oleg Nesterov
Cc: Pavel Emelyanov
Cc: Cyrill Gorcunov
Cc: Louis Rilling
Cc: Mike Galbraith
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2012-06-01 08:49:32 +0800
320845048 pidns: use task_active_pid_ns in do_notify_parent ... Browse Code »

Using task_active_pid_ns is more robust because it works even after we
have called exit_namespaces. This change allows us to have parent
processes that are zombies. Normally a zombie parent processes is crazy
and the last thing you would want to have but in the case of not letting
the init process of a pid namespace be reaped until all of it's children
are dead and reaped a zombie parent process is exactly what we want.

Signed-off-by: Eric W. Biederman
Cc: Oleg Nesterov
Cc: Pavel Emelyanov
Cc: Cyrill Gorcunov
Cc: Louis Rilling
Cc: Mike Galbraith
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2012-06-01 08:49:31 +0800
9eaa3d9bb rapidio/tsi721: add DMA engine support ... Browse Code »

Adds support for DMA Engine API into Tsi721 mport driver.

Includes following changes for Tsi721 driver:
- Modifies BDMA register offset definitions to support per-channel handling
- Separates BDMA channel reserved for RIO Maintenance requests
- Adds DMA Engine callback routines

Signed-off-by: Alexandre Bounine
Cc: Dan Williams
Cc: Vinod Koul
Cc: Li Yang
Cc: Matt Porter
Cc: Paul Gortmaker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Bounine
2012-06-01 08:49:31 +0800
e42d98ebe rapidio: add DMA engine support for RIO data transfers ... Browse Code »

Adds DMA Engine framework support into RapidIO subsystem.

Uses DMA Engine DMA_SLAVE interface to generate data transfers to/from
remote RapidIO target devices.

Introduces RapidIO-specific wrapper for prep_slave_sg() interface with an
extra parameter to pass target specific information.

Uses scatterlist to describe local data buffer. Address flat data buffer
on a remote side.

Signed-off-by: Alexandre Bounine
Cc: Dan Williams
Acked-by: Vinod Koul
Cc: Li Yang
Cc: Matt Porter
Cc: Paul Gortmaker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Bounine
2012-06-01 08:49:31 +0800
ce2d52cc1 ipc/mqueue: add rbtree node caching support ... Browse Code »

When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).

This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.

Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.

Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.

Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.

The net result of all of this is:

1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.

2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).

3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.

The performance changes are:

Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns

I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.

As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford
Cc: Frederic Weisbecker
Cc: Manfred Spraul
Cc: Stephen Rothwell
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Doug Ledford
2012-06-01 08:49:31 +0800
7820b0715 tools/selftests: add mq_perf_tests ... Browse Code »

Add the mq_perf_tests tool I used when creating my mq performance patch.
Also add a local .gitignore to keep the binaries from showing up in git
status output.

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford
Cc: Stephen Rothwell
Cc: Manfred Spraul
Cc: Frederic Weisbecker
Acked-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Doug Ledford
2012-06-01 08:49:31 +0800
113289cc0 ipc/mqueue: strengthen checks on mqueue creation ... Browse Code »

We already check the mq attr struct if it's passed in, but now that the
admin can set system wide defaults separate from maximums, it's actually
possible to set the defaults to something that would overflow. So, if
there is no attr struct passed in to the open call, check the default
values.

While we are at it, simplify mq_attr_ok() by making it return 0 or an
error condition, so that way if we add more tests to it later, we have the
option of what error should be returned instead of the calling location
having to pick a possibly inaccurate error code.

[akpm@linux-foundation.org: s/ENOMEM/EOVERFLOW/]
Signed-off-by: Doug Ledford
Cc: Stephen Rothwell
Cc: Manfred Spraul
Acked-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Doug Ledford
2012-06-01 08:49:31 +0800
2c12ea498 ipc/mqueue: correct mq_attr_ok test ... Browse Code »

While working on the other parts of the mqueue stuff, I noticed that the
calculation for overflow in mq_attr_ok didn't actually match reality (this
is especially true since my last patch which changed how we account memory
slightly).

In particular, we used to test for overflow using:
msgs * msgsize + msgs * sizeof(struct msg_msg *)

That was never really correct because each message we allocate via
load_msg() is actually a struct msg_msg followed by the data for the
message (and if struct msg_msg + data exceeds PAGE_SIZE we end up
allocating struct msg_msgseg structs too, but accounting for them would
get really tedious, so let's ignore those...they're only a pointer in size
anyway). This patch updates the calculation to be more accurate in
regards to maximum possible memory consumption by the mqueue.

[akpm@linux-foundation.org: add a local to simplify overflow-checking expression]
Signed-off-by: Doug Ledford
Cc: Stephen Rothwell
Cc: Manfred Spraul
Acked-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Doug Ledford
2012-06-01 08:49:31 +0800
d6629859b ipc/mqueue: improve performance of send/recv ... Browse Code »

The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.

OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.

"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:

Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.

So, on to the results already:

Subsystem/Test Old New

Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s

Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s

% slowdown from mqueue
cache thrashing ~8% ~.5%

Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436

Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17

So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.

[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford
Cc: Stephen Rothwell
Cc: Manfred Spraul
Acked-by: KOSAKI Motohiro
Signed-off-by: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Doug Ledford
2012-06-01 08:49:31 +0800
50069a585 selftests: add mq_open_tests ... Browse Code »

Add a directory to house POSIX message queue subsystem specific tests.
Add first test which checks the operation of mq_open() under various
corner conditions.

Signed-off-by: Doug Ledford
Cc: KOSAKI Motohiro
Cc: Doug Ledford
Cc: Joe Korty
Cc: Amerigo Wang
Cc: Serge E. Hallyn
Cc: Jiri Slaby
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Doug Ledford
2012-06-01 08:49:31 +0800
cef0184c1 mqueue: separate mqueue default value from maximum value ... Browse Code »

Commit b231cca4381e ("message queues: increase range limits") changed
mqueue default value when attr parameter is specified NULL from hard
coded value to fs.mqueue.{msg,msgsize}_max sysctl value.

This made large side effect. When user need to use two mqueue
applications 1) using !NULL attr parameter and it require big message
size and 2) using NULL attr parameter and only need small size message,
app (1) require to raise fs.mqueue.msgsize_max and app (2) consume large
memory size even though it doesn't need.

Doug Ledford propsed to switch back it to static hard coded value.
However it also has a compatibility problem. Some applications might
started depend on the default value is tunable.

The solution is to separate default value from maximum value.

Signed-off-by: KOSAKI Motohiro
Signed-off-by: Doug Ledford
Acked-by: Doug Ledford
Acked-by: Joe Korty
Cc: Amerigo Wang
Acked-by: Serge E. Hallyn
Cc: Jiri Slaby
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2012-06-01 08:49:31 +0800
fd1f87d24 mqueue: don't use kmalloc with KMALLOC_MAX_SIZE ... Browse Code »

KMALLOC_MAX_SIZE is not a good threshold. It is extremely high and
problematic. Unfortunately, some silly drivers depend on this and we
can't change it. But any new code needn't use such extreme ugly high
order allocations. It brings us awful fragmentation issues and system
slowdown.

Signed-off-by: KOSAKI Motohiro
Acked-by: Doug Ledford
Acked-by: Joe Korty
Cc: Amerigo Wang
Cc: Serge E. Hallyn
Cc: Jiri Slaby
Cc: Joe Korty
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2012-06-01 08:49:31 +0800
e6315bb15 mqueue: revert bump up DFLT_*MAX ... Browse Code »

Mqueue limitation is slightly naieve parameter likes other ipcs because
unprivileged user can consume kernel memory by using ipcs.

Thus, too aggressive raise bring us security issue. Example, current
setting allow evil unprivileged user use 256GB (= 256 * 1024 * 1024*1024)
and it's enough large to system will belome unresponsive. Don't do that.

Instead, every admin should adjust the knobs for their own systems.

Signed-off-by: KOSAKI Motohiro
Acked-by: Doug Ledford
Acked-by: Joe Korty
Cc: Amerigo Wang
Acked-by: Serge E. Hallyn
Cc: Jiri Slaby
Cc: Manfred Spraul
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2012-06-01 08:49:31 +0800
5b5c4d1a1 ipc/mqueue: update maximums for the mqueue subsystem ... Browse Code »

Commit b231cca4381e ("message queues: increase range limits") changed the
maximum size of a message in a message queue from INT_MAX to 8192*128.
Unfortunately, we had customers that relied on a size much larger than
8192*128 on their production systems. After reviewing POSIX, we found
that it is silent on the maximum message size. We did find a couple other
areas in which it was not silent. Fix up the mqueue maximums so that the
customer's system can continue to work, and document both the POSIX and
real world requirements in ipc_namespace.h so that we don't have this
issue crop back up.

Also, commit 9cf18e1dd74cd0 ("ipc: HARD_MSGMAX should be higher not lower
on 64bit") fiddled with HARD_MSGMAX without realizing that the number was
intentionally in place to limit the msg queue depth to one that was small
enough to kmalloc an array of pointers (hence why we divided 128k by
sizeof(long)). If we wish to meet POSIX requirements, we have no choice
but to change our allocation to a vmalloc instead (at least for the large
queue size case). With that, it's possible to increase our allowed
maximum to the POSIX requirements (or more if we choose).

[sfr@canb.auug.org.au: using vmalloc requires including vmalloc.h]
Signed-off-by: Doug Ledford
Cc: Serge E. Hallyn
Cc: Amerigo Wang
Cc: Joe Korty
Cc: Jiri Slaby
Acked-by: KOSAKI Motohiro
Cc: Manfred Spraul
Signed-off-by: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Doug Ledford
2012-06-01 08:49:30 +0800
02967ea08 ipc/mqueue: enforce hard limits ... Browse Code »

In two places we don't enforce the hard limits for CAP_SYS_RESOURCE apps.
In preparation for making more reasonable hard limits, start enforcing
them even on CAP_SYS_RESOURCE.

Signed-off-by: Doug Ledford
Cc: Serge E. Hallyn
Cc: Amerigo Wang
Cc: Joe Korty
Cc: Jiri Slaby
Acked-by: KOSAKI Motohiro
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Doug Ledford
2012-06-01 08:49:30 +0800
858ee3784 ipc/mqueue: switch back to using non-max values on create ... Browse Code »

Commit b231cca4381e ("message queues: increase range limits") changed
how we create a queue that does not include an attr struct passed to
open so that it creates the queue with whatever the maximum values are.
However, if the admin has set the maximums to allow flexibility in
creating a queue (aka, both a large size and large queue are allowed,
but combined they create a queue too large for the RLIMIT_MSGQUEUE of
the user), then attempts to create a queue without an attr struct will
fail. Switch back to using acceptable defaults regardless of what the
maximums are.

Note: so far, we only know of a few applications that rely on this
behavior (specifically, set the maximums in /proc, then run the
application which calls mq_open() without passing in an attr struct, and
the application expects the newly created message queue to have the
maximum sizes that were set in /proc used on the mq_open() call, and all
of those applications that we know of are actually part of regression
test suites that were coded to do something like this:

for size in 4096 65536 $((1024 * 1024)) $((16 * 1024 * 1024)); do
echo $size > /proc/sys/fs/mqueue/msgsize_max
mq_open || echo "Error opening mq with size $size"
done

These test suites that depend on any behavior like this are broken. The
concept that programs should rely upon the system wide maximum in order
to get their desired results instead of simply using a attr struct to
specify what they want is fundamentally unfriendly programming practice
for any multi-tasking OS.

Fixing this will break those few apps that we know of (and those app
authors recognize the brokenness of their code and the need to fix it).
However, the following patch "mqueue: separate mqueue default value"
allows a workaround in the form of new knobs for the default msg queue
creation parameters for any software out there that we don't already
know about that might rely on this behavior at the moment.

Signed-off-by: Doug Ledford
Cc: Serge E. Hallyn
Cc: Amerigo Wang
Cc: Joe Korty
Cc: Jiri Slaby
Acked-by: KOSAKI Motohiro
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Doug Ledford
2012-06-01 08:49:30 +0800
93e6f119c ipc/mqueue: cleanup definition names and locations ... Browse Code »

Since commit b231cca4381e ("message queues: increase range limits") on
Oct 18, 2008, calls to mq_open() that did not pass in an attribute
struct and expected to get default values for the size of the queue and
the max message size now get the system wide maximums instead of
hardwired defaults like they used to get.

This was uncovered when one of the earlier patches in this patch set
increased the default system wide maximums at the same time it increased
the hard ceiling on the system wide maximums (a customer specifically
needed the hard ceiling brought back up, the new ceiling that commit
b231cca4381e introduced was too low for their production systems). By
increasing the default maximums and not realising they were tied to any
attempt to create a message queue without an attribute struct, I had
inadvertently made it such that all message queue creation attempts
without an attribute struct were failing because the new default
maximums would create a queue that exceeded the default rlimit for
message queue bytes.

As a result, the system wide defaults were brought back down to their
previous levels, and the system wide ceilings on the maximums were
raised to meet the customer's needs. However, the fact that the no
attribute struct behavior of mq_open() could be broken by changing the
system wide maximums for message queues was seen as fundamentally broken
itself. So we hardwired the no attribute case back like it used to be.
But, then we realized that on the very off chance that some piece of
software in the wild depended on that behavior, we could work around
that issue by adding two new knobs to /proc that allowed setting the
defaults for message queues created without an attr struct separately
from the system wide maximums.

What is not an option IMO is to leave the current behavior in place. No
piece of software should ever rely on setting the system wide maximums
in order to get a desired message queue. Such a reliance would be so
fundamentally multitasking OS unfriendly as to not really be tolerable.
Fortunately, we don't know of any software in the wild that uses this
except for a regression test program that caught the issue in the first
place. If there is though, we have made accommodations with the two new
/proc knobs (and that's all the accommodations such fundamentally broken
software can be allowed)..

This patch:

The various defines for minimums and maximums of the sysctl controllable
mqueue values are scattered amongst different files and named
inconsistently. Move them all into ipc_namespace.h and make them have
consistent names. Additionally, make the number of queues per namespace
also have a minimum and maximum and use the same sysctl function as the
other two settable variables.

Signed-off-by: Doug Ledford
Acked-by: Serge E. Hallyn
Cc: Amerigo Wang
Cc: Joe Korty
Cc: Jiri Slaby
Acked-by: KOSAKI Motohiro
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Doug Ledford
2012-06-01 08:49:30 +0800
29a5c67e7 kexec: export kexec.h to user space ... Browse Code »

Add userspace definitions, guard all relevant kernel structures. While at
it document stuff and remove now useless userspace hint.

It is easy to add the relevant system call to respective libc's, but it
seems pointless to have to duplicate the data structures.

This is based on the kexec-tools headers, with the exception of just using
int on return (succes or failure) and using size_t instead of 'unsigned
long int' for the number of segments argument of kexec_load().

Signed-off-by: maximilian attems
Cc: Simon Horman
Cc: Vivek Goyal
Cc: Haren Myneni
Cc: "Eric W. Biederman"
Cc: Martin Schwidefsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

maximilian attems
2012-06-01 08:49:30 +0800
e4cc2f873 kernel/cpu.c: document clear_tasks_mm_cpumask() ... Browse Code »

Add more comments on clear_tasks_mm_cpumask, plus adds a runtime check:
the function is only suitable for offlined CPUs, and if called
inappropriately, the kernel should scream aloud.

[akpm@linux-foundation.org: tweak comment: s/walks up/walks/, use 80 cols]
Suggested-by: Andrew Morton
Suggested-by: Peter Zijlstra
Signed-off-by: Anton Vorontsov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anton Vorontsov
2012-06-01 08:49:30 +0800
2c922c51e um: properly check all process' threads for a live mm ... Browse Code »

kill_off_processes() might miss a valid process, this is because checking
for process->mm is not enough. Process' main thread may exit or detach
its mm via use_mm(), but other threads may still have a valid mm.

To catch this we use find_lock_task_mm(), which walks up all threads and
returns an appropriate task (with task lock held).

Suggested-by: Oleg Nesterov
Signed-off-by: Anton Vorontsov
Cc: Richard Weinberger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anton Vorontsov
2012-06-01 08:49:30 +0800
137d1a26c um: fix possible race on task->mm ... Browse Code »

Checking for task->mm is dangerous as ->mm might disappear (exit_mm()
assigns NULL under task_lock(), so tasklist lock is not enough).

We can't use get_task_mm()/mmput() pair as mmput() might sleep, so let's
take the task lock while we care about its mm.

Note that we should also use find_lock_task_mm() to check all process'
threads for a valid mm, but for uml we'll do it in a separate patch.

Signed-off-by: Anton Vorontsov
Cc: Richard Weinberger
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anton Vorontsov
2012-06-01 08:49:30 +0800
9bd0a0771 um: should hold tasklist_lock while traversing processes ... Browse Code »

Traversing the tasks requires holding tasklist_lock, otherwise it is
unsafe.

p.s. However, I'm not sure that calling os_kill_ptraced_process() in the
atomic context is correct. It seem to work, but please take a closer
look.

Signed-off-by: Anton Vorontsov
Cc: Richard Weinberger
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anton Vorontsov
2012-06-01 08:49:30 +0800
af1be5a57 blackfin: fix possible deadlock in decode_address() ... Browse Code »

Oleg Nesterov found an interesting deadlock possibility:

> sysrq_showregs_othercpus() does smp_call_function(showacpu)
> and showacpu() show_stack()->decode_address(). Now suppose that IPI
> interrupts the task holding read_lock(tasklist).

To fix this, blackfin should not grab the write_ variant of the
tasklist lock, read_ one is enough.

Suggested-by: Oleg Nesterov
Signed-off-by: Anton Vorontsov
Cc: Mike Frysinger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anton Vorontsov
2012-06-01 08:49:30 +0800
2214f707d blackfin: a couple of task->mm handling fixes ... Browse Code »

The patch fixes two problems:

1. Working with task->mm w/o getting mm or grabing the task lock is
dangerous as ->mm might disappear (exit_mm() assigns NULL under
task_lock(), so tasklist lock is not enough).

We can't use get_task_mm()/mmput() pair as mmput() might sleep,
so we have to take the task lock while handle its mm.

2. Checking for process->mm is not enough because process' main
thread may exit or detach its mm via use_mm(), but other threads
may still have a valid mm.

To catch this we use find_lock_task_mm(), which walks up all
threads and returns an appropriate task (with task lock held).

Suggested-by: Oleg Nesterov
Signed-off-by: Anton Vorontsov
Cc: Mike Frysinger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anton Vorontsov
2012-06-01 08:49:30 +0800
1198c8b9a sh: use clear_tasks_mm_cpumask() ... Browse Code »

Checking for process->mm is not enough because process' main thread may
exit or detach its mm via use_mm(), but other threads may still have a
valid mm.

To fix this we would need to use find_lock_task_mm(), which would walk up
all threads and returns an appropriate task (with task lock held).

clear_tasks_mm_cpumask() has the issue fixed, so let's use it.

Suggested-by: Oleg Nesterov
Signed-off-by: Anton Vorontsov
Cc: Paul Mundt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anton Vorontsov
2012-06-01 08:49:30 +0800
73863ab02 powerpc: use clear_tasks_mm_cpumask() ... Browse Code »

Current CPU hotplug code has some task->mm handling issues:

1. Working with task->mm w/o getting mm or grabing the task lock is
dangerous as ->mm might disappear (exit_mm() assigns NULL under
task_lock(), so tasklist lock is not enough).

We can't use get_task_mm()/mmput() pair as mmput() might sleep,
so we must take the task lock while handle its mm.

2. Checking for process->mm is not enough because process' main
thread may exit or detach its mm via use_mm(), but other threads
may still have a valid mm.

To fix this we would need to use find_lock_task_mm(), which would
walk up all threads and returns an appropriate task (with task
lock held).

clear_tasks_mm_cpumask() has all the issues fixed, so let's use it.

Suggested-by: Oleg Nesterov
Signed-off-by: Anton Vorontsov
Acked-by: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anton Vorontsov
2012-06-01 08:49:29 +0800
3eaa73bde arm: use clear_tasks_mm_cpumask() ... Browse Code »

Checking for process->mm is not enough because process' main thread may
exit or detach its mm via use_mm(), but other threads may still have a
valid mm.

To fix this we would need to use find_lock_task_mm(), which would walk up
all threads and returns an appropriate task (with task lock held).

clear_tasks_mm_cpumask() has this issue fixed, so let's use it.

Suggested-by: Oleg Nesterov
Signed-off-by: Anton Vorontsov
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anton Vorontsov
2012-06-01 08:49:29 +0800
cb79295e2 cpu: introduce clear_tasks_mm_cpumask() helper ... Browse Code »

Many architectures clear tasks' mm_cpumask like this:

read_lock(&tasklist_lock);
for_each_process(p) {
if (p->mm)
cpumask_clear_cpu(cpu, mm_cpumask(p->mm));
}
read_unlock(&tasklist_lock);

Depending on the context, the code above may have several problems,
such as:

1. Working with task->mm w/o getting mm or grabing the task lock is
dangerous as ->mm might disappear (exit_mm() assigns NULL under
task_lock(), so tasklist lock is not enough).

2. Checking for process->mm is not enough because process' main
thread may exit or detach its mm via use_mm(), but other threads
may still have a valid mm.

This patch implements a small helper function that does things
correctly, i.e.:

1. We take the task's lock while whe handle its mm (we can't use
get_task_mm()/mmput() pair as mmput() might sleep);

2. To catch exited main thread case, we use find_lock_task_mm(),
which walks up all threads and returns an appropriate task
(with task lock held).

Also, Per Peter Zijlstra's idea, now we don't grab tasklist_lock in
the new helper, instead we take the rcu read lock. We can do this
because the function is called after the cpu is taken down and marked
offline, so no new tasks will get this cpu set in their mm mask.

Signed-off-by: Anton Vorontsov
Cc: Richard Weinberger
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Russell King
Cc: Benjamin Herrenschmidt
Cc: Mike Frysinger
Cc: Paul Mundt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anton Vorontsov
2012-06-01 08:49:29 +0800