Eric Lee / smarc-fsl-linux-kernel

09 Jun, 2012

4 commits

724945044 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar.

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix the relax_domain_level boot parameter
sched: Validate assumptions in sched_init_numa()
sched: Always initialize cpu-power
sched: Fix domain iteration
sched/rt: Fix lockdep annotation within find_lock_lowest_rq()
sched/numa: Load balance between remote nodes
sched/x86: Calculate booted cores after construction of sibling_mask

Linus Torvalds
2012-06-09 05:59:29 +0800
cd96891d4 sched/fair: fix lots of kernel-doc warnings ... Browse Code »

Fix lots of new kernel-doc warnings in kernel/sched/fair.c:

Warning(kernel/sched/fair.c:3625): No description found for parameter 'env'
Warning(kernel/sched/fair.c:3625): Excess function parameter 'sd' description in 'update_sg_lb_stats'
Warning(kernel/sched/fair.c:3735): No description found for parameter 'env'
Warning(kernel/sched/fair.c:3735): Excess function parameter 'sd' description in 'update_sd_pick_busiest'
Warning(kernel/sched/fair.c:3735): Excess function parameter 'this_cpu' description in 'update_sd_pick_busiest'
.. more warnings

Signed-off-by: Randy Dunlap
Cc: Ingo Molnar
Cc: Peter Zijlstra
Signed-off-by: Linus Torvalds

Randy Dunlap
2012-06-09 05:59:10 +0800
106544d81 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf fixes from Ingo Molnar:
"A bit larger than what I'd wish for - half of it is due to hw driver
updates to Intel Ivy-Bridge which info got recently released,
cycles:pp should work there now too, amongst other things. (but we
are generally making exceptions for hardware enablement of this type.)

There are also callchain fixes in it - responding to mostly
theoretical (but valid) concerns. The tooling side sports perf.data
endianness/portability fixes which did not make it for the merge
window - and various other fixes as well."

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
perf/x86: Check user address explicitly in copy_from_user_nmi()
perf/x86: Check if user fp is valid
perf: Limit callchains to 127
perf/x86: Allow multiple stacks
perf/x86: Update SNB PEBS constraints
perf/x86: Enable/Add IvyBridge hardware support
perf/x86: Implement cycles:p for SNB/IVB
perf/x86: Fix Intel shared extra MSR allocation
x86/decoder: Fix bsr/bsf/jmpe decoding with operand-size prefix
perf: Remove duplicate invocation on perf_event_for_each
perf uprobes: Remove unnecessary check before strlist__delete
perf symbols: Check for valid dso before creating map
perf evsel: Fix 32 bit values endianity swap for sample_id_all header
perf session: Handle endianity swap on sample_id_all header data
perf symbols: Handle different endians properly during symbol load
perf evlist: Pass third argument to ioctl explicitly
perf tools: Update ioctl documentation for PERF_IOC_FLAG_GROUP
perf tools: Make --version show kernel version instead of pull req tag
perf tools: Check if callchain is corrupted
perf callchain: Make callchain cursors TLS
...

Linus Torvalds
2012-06-09 00:14:46 +0800
b1e25f410 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull leap second timer fix from Thomas Gleixner.

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Fix CLOCK_MONOTONIC inconsistency during leapsecond

Linus Torvalds
2012-06-09 00:11:33 +0800

08 Jun, 2012

6 commits

48d212a2e Revert "mm: correctly synchronize rss-counters at exit/exec" ... Browse Code »

This reverts commit 40af1bbdca47e5c8a2044039bb78ca8fd8b20f94.

It's horribly and utterly broken for at least the following reasons:

- calling sync_mm_rss() from mmput() is fundamentally wrong, because
there's absolutely no reason to believe that the task that does the
mmput() always does it on its own VM. Example: fork, ptrace, /proc -
you name it.

- calling it *after* having done mmdrop() on it is doubly insane, since
the mm struct may well be gone now.

- testing mm against NULL before you call it is insane too, since a
NULL mm there would have caused oopses long before.

.. and those are just the three bugs I found before I decided to give up
looking for me and revert it asap. I should have caught it before I
even took it, but I trusted Andrew too much.

Cc: Konstantin Khlebnikov
Cc: Markus Trippelsdorf
Cc: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Oleg Nesterov
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Linus Torvalds
2012-06-08 08:54:07 +0800
40af1bbdc mm: correctly synchronize rss-counters at exit/exec ... Browse Code »
46

mm->rss_stat counters have per-task delta: task->rss_stat. Before
changing task->mm pointer the kernel must flush this delta with
sync_mm_rss().

do_exit() already calls sync_mm_rss() to flush the rss-counters before
committing the rss statistics into task->signal->maxrss, taskstats,
audit and other stuff. Unfortunately the kernel does this before
calling mm_release(), which can call put_user() for processing
task->clear_child_tid. So at this point we can trigger page-faults and
task->rss_stat becomes non-zero again. As a result mm->rss_stat becomes
inconsistent and check_mm() will print something like this:

| BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
| BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1

This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
out of do_exit() and calls it earlier. After mm_release() there should
be no pagefaults.

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Konstantin Khlebnikov
Reported-by: Markus Trippelsdorf
Cc: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Oleg Nesterov
Cc: [3.4.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-06-08 05:43:55 +0800
736f24d5e c/r: prctl: drop VMA flags test on PR_SET_MM_ stack data assignment ... Browse Code »

In commit b76437579d13 ("procfs: mark thread stack correctly in
proc//maps") the stack allocated via clone() is marked in
/proc//maps as [stack:%d] thus it might be out of the former
mm->start_stack/end_stack values (and even has some custom VMA flags
set).

So to be able to restore mm->start_stack/end_stack drop vma flags test,
but still require the underlying VMA to exist.

As always note this feature is under CONFIG_CHECKPOINT_RESTORE and
requires CAP_SYS_RESOURCE to be granted.

Signed-off-by: Cyrill Gorcunov
Cc: Oleg Nesterov
Acked-by: Kees Cook
Cc: Pavel Emelyanov
Cc: Serge Hallyn
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-06-08 05:43:55 +0800
300f786b2 c/r: prctl: add ability to get clear_tid_address ... Browse Code »

Zero is written at clear_tid_address when the process exits. This
functionality is used by pthread_join().

We already have sys_set_tid_address() to change this address for the
current task but there is no way to obtain it from user space.

Without the ability to find this address and dump it we can't restore
pthread'ed apps which call pthread_join() once they have been restored.

This patch introduces the PR_GET_TID_ADDRESS prctl option which allows
the current process to obtain own clear_tid_address.

This feature is available iif CONFIG_CHECKPOINT_RESTORE is set.

[akpm@linux-foundation.org: fix prctl numbering]
Signed-off-by: Andrew Vagin
Signed-off-by: Cyrill Gorcunov
Cc: Pedro Alves
Cc: Oleg Nesterov
Cc: Pavel Emelyanov
Cc: Tejun Heo
Acked-by: Kees Cook
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-06-08 05:43:55 +0800
1ad75b9e1 c/r: prctl: add minimal address test to PR_SET_MM ... Browse Code »

Make sure the address being set is greater than mmap_min_addr (as
suggested by Kees Cook).

Signed-off-by: Cyrill Gorcunov
Acked-by: Kees Cook
Cc: Serge Hallyn
Cc: Tejun Heo
Cc: Pavel Emelyanov
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-06-08 05:43:55 +0800
bafb282df c/r: prctl: update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removal ... Browse Code »

A fix for commit b32dfe377102 ("c/r: prctl: add ability to set new
mm_struct::exe_file").

After removing mm->num_exe_file_vmas kernel keeps mm->exe_file until
final mmput(), it never becomes NULL while task is alive.

We can check for other mapped files in mm instead of checking
mm->num_exe_file_vmas, and mark mm with flag MMF_EXE_FILE_CHANGED in
order to forbid second changing of mm->exe_file.

Signed-off-by: Konstantin Khlebnikov
Reviewed-by: Cyrill Gorcunov
Cc: Oleg Nesterov
Cc: Matt Helsley
Cc: Kees Cook
Cc: KOSAKI Motohiro
Cc: Tejun Heo
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-06-08 05:43:55 +0800

06 Jun, 2012

9 commits

a841f8cef sched: Fix the relax_domain_level boot parameter ... Browse Code »
1

It does not get processed because sched_domain_level_max is 0 at the
time that setup_relax_domain_level() is run.

Simply accept the value as it is, as we don't know the value of
sched_domain_level_max until sched domain construction is completed.

Fix sched_relax_domain_level in cpuset. The build_sched_domain() routine calls
the set_domain_attribute() routine prior to setting the sd->level, however,
the set_domain_attribute() routine relies on the sd->level to decide whether
idle load balancing will be off/on.

Signed-off-by: Dimitri Sivanich
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20120605184436.GA15668@sgi.com
Signed-off-by: Ingo Molnar

Dimitri Sivanich
2012-06-06 23:07:41 +0800
d039ac608 sched: Validate assumptions in sched_init_numa() ... Browse Code »

Add some code to validate assumptions we're making and output
warnings if they are not.

If this trigger we want to know about it.

Signed-off-by: Peter Zijlstra
Cc: Alex Shi
Link: http://lkml.kernel.org/n/tip-6uc3wk5s9udxtdl9cnku0vtt@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-06-06 22:52:30 +0800
c3decf0df sched: Always initialize cpu-power ... Browse Code »

Often when we run into mis-shapen topologies the balance iteration
fails to update the cpu power properly and we'll end up in /0 traps.

Always initialize the cpu-power to a semi-sane value so that we can
at least boot the machine, even if the load-balancer might not
function correctly.

Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-3lbhyj25sr169ha7z3qht5na@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-06-06 22:52:27 +0800
c11748768 sched: Fix domain iteration ... Browse Code »

Weird topologies can lead to asymmetric domain setups. This needs
further consideration since these setups are typically non-minimal
too.

For now, make it work by adding an extra mask selecting which CPUs
are allowed to iterate up.

The topology that triggered it is the one from David Rientjes:

10 20 20 30
20 10 20 20
20 20 10 20
30 20 20 10

resulting in boxes that wouldn't even boot.

Reported-by: David Rientjes
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-3p86l9cuaqnxz7uxsojmz5rm@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-06-06 22:52:26 +0800
7f1b43936 sched/rt: Fix lockdep annotation within find_lock_lowest_rq() ... Browse Code »

Roland Dreier reported spurious, hard to trigger lockdep warnings
within the scheduler - without any real lockup.

This bit gives us the right clue:

> [89945.640512] [] double_lock_balance+0x5a/0x90
> [89945.640568] [] push_rt_task+0xc6/0x290

if you look at that code you'll find the double_lock_balance() in
question is the one in find_lock_lowest_rq() [yay for inlining].

Now find_lock_lowest_rq() has a bug.. it fails to use
double_unlock_balance() in one exit path, if this results in a retry in
push_rt_task() we'll call double_lock_balance() again, at which point
we'll run into said lockdep confusion.

Reported-by: Roland Dreier
Acked-by: Steven Rostedt
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1337282386.4281.77.camel@twins
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-06-06 22:52:26 +0800
10717dcde sched/numa: Load balance between remote nodes ... Browse Code »

Commit cb83b629b ("sched/numa: Rewrite the CONFIG_NUMA sched
domain support") removed the NODE sched domain and started checking
if the node distance in SLIT table is farther than REMOTE_DISTANCE,
if so, it will lose the load balance chance at exec/fork/wake_affine
points.

But actually, even the node distance is farther than REMOTE_DISTANCE.

Modern CPUs also has QPI like connections, which ensures that memory
access is not too slow between nodes. So the above change in behavior
on NUMA machine causes a performance regression on various benchmarks:
hackbench, tbench, netperf, oltp, etc.

This patch will recover the scheduler behavior to old mode on all my
Intel platforms: NHM EP/EX, WSM EP, SNB EP/EP4S, and thus fixes the
perfromance regressions. (all of them just have 2 kinds distance, 10, 21)

Signed-off-by: Alex Shi
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1338965571-9812-1-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar

Alex Shi
2012-06-06 22:52:25 +0800
02e03040a Merge tag 'perf-urgent-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/g… ... Browse Code »

…it/acme/linux into perf/urgent

Pull perf fixes from Arnaldo Carvalho de Melo:

* Endianness fixes from Jiri Olsa

* Fixes for make perf tarball

* Fix for DSO name in perf script callchains, from David Ahern

* Segfault fixes for perf top --callchain, from Namhyung Kim

* Minor function result fixes from Srikar Dronamraju

* Add missing 3rd ioctl parameter, from Namhyung Kim

* Fix pager usage in minimal embedded systems, from Avik Sil

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2012-06-06 14:46:33 +0800
365f0e173 Merge branch 'for-3.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup fix from Tejun Heo:
"This fixes the possible premature superblock release on umount bug
mentioned during v3.5-rc1 pull request.

Originally, cgroup dentry destruction path assumed that cgroup dentry
didn't have any reference left after cgroup removal thus put super
during dentry removal. Now that there can be lingering dentry
references, this led to super being put with live dentries. This
patch fixes the problem by putting super ref on dentry release instead
of removal."

* 'for-3.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: superblock can't be released with active dentries

Linus Torvalds
2012-06-06 02:54:12 +0800
0b3e9f3f2 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar.

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Remove NULL assignment of dattr_cur
sched: Remove the last NULL entry from sched_feat_names
sched: Make sched_feat_names const
sched/rt: Fix SCHED_RR across cgroups
sched: Move nr_cpus_allowed out of 'struct sched_rt_entity'
sched: Make sure to not re-read variables after validation
sched: Fix SD_OVERLAP
sched: Don't try allocating memory from offline nodes
sched/nohz: Fix rq->cpu_load calculations some more
sched/x86: Use cpu_llc_shared_mask(cpu) for coregroup_mask

Linus Torvalds
2012-06-06 00:47:15 +0800

05 Jun, 2012

3 commits

fad0c66c4 timekeeping: Fix CLOCK_MONOTONIC inconsistency during leapsecond ... Browse Code »

Commit 6b43ae8a61 (ntp: Fix leap-second hrtimer livelock) broke the
leapsecond update of CLOCK_MONOTONIC. The missing leapsecond update to
wall_to_monotonic causes discontinuities in CLOCK_MONOTONIC.

Adjust wall_to_monotonic when NTP inserted a leapsecond.

Reported-by: Richard Cochran
Signed-off-by: John Stultz
Tested-by: Richard Cochran
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1338400497-12420-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner

John Stultz
2012-06-05 03:46:29 +0800
9171c670b Merge branches 'irq-urgent-for-linus' and 'smp-hotplug-for-linus' of git://git.k… ... Browse Code »

…ernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq and smpboot updates from Thomas Gleixner:
"Just cleanup patches with no functional change and a fix for suspend
issues."

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Introduce irq_do_set_affinity() to reduce duplicated code
genirq: Add IRQS_PENDING for nested and simple irq

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smpboot, idle: Fix comment mismatch over idle_threads_init()
smpboot, idle: Optimize calls to smp_processor_id() in idle_threads_init()

Linus Torvalds
2012-06-05 02:36:51 +0800
c22072bdf Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timer updates from Thomas Gleixner:
"The clocksource driver is pure hardware enablement and the skew option
is default off, well tested and non dangerous."

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick: Move skew_tick option into the HIGH_RES_TIMER section
clocksource: em_sti: Add DT support
clocksource: em_sti: Emma Mobile STI driver
clockevents: Make clockevents_config() a global symbol
tick: Add tick skew boot option

Linus Torvalds
2012-06-05 02:25:31 +0800

02 Jun, 2012

6 commits

86c47b70f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal ... Browse Code »

Pull third pile of signal handling patches from Al Viro:
"This time it's mostly helpers and conversions to them; there's a lot
of stuff remaining in the tree, but that'll either go in -rc2
(isolated bug fixes, ideally via arch maintainers' trees) or will sit
there until the next cycle."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
x86: get rid of calling do_notify_resume() when returning to kernel mode
blackfin: check __get_user() return value
whack-a-mole with TIF_FREEZE
FRV: Optimise the system call exit path in entry.S [ver #2]
FRV: Shrink TIF_WORK_MASK [ver #2]
FRV: Prevent syscall exit tracing and notify_resume at end of kernel exceptions
new helper: signal_delivered()
powerpc: get rid of restore_sigmask()
most of set_current_blocked() callers want SIGKILL/SIGSTOP removed from set
set_restore_sigmask() is never called without SIGPENDING (and never should be)
TIF_RESTORE_SIGMASK can be set only when TIF_SIGPENDING is set
don't call try_to_freeze() from do_signal()
pull clearing RESTORE_SIGMASK into block_sigmask()
sh64: failure to build sigframe != signal without handler
openrisc: tracehook_signal_handler() is supposed to be called on success
new helper: sigmask_to_save()
new helper: restore_saved_sigmask()
new helpers: {clear,test,test_and_clear}_restore_sigmask()
HAVE_RESTORE_SIGMASK is defined on all architectures now

Linus Torvalds
2012-06-02 02:53:44 +0800
1193755ac Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs changes from Al Viro.
"A lot of misc stuff. The obvious groups:
* Miklos' atomic_open series; kills the damn abuse of
->d_revalidate() by NFS, which was the major stumbling block for
all work in that area.
* ripping security_file_mmap() and dealing with deadlocks in the
area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in
general.
* ->encode_fh() switched to saner API; insane fake dentry in
mm/cleancache.c gone.
* assorted annotations in fs (endianness, __user)
* parts of Artem's ->s_dirty work (jff2 and reiserfs parts)
* ->update_time() work from Josef.
* other bits and pieces all over the place.

Normally it would've been in two or three pull requests, but
signal.git stuff had eaten a lot of time during this cycle ;-/"

Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the
'truncate_range' inode method was removed by the VM changes, the VFS
update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due
to sparse fix added twice, with other changes nearby).

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits)
nfs: don't open in ->d_revalidate
vfs: retry last component if opening stale dentry
vfs: nameidata_to_filp(): don't throw away file on error
vfs: nameidata_to_filp(): inline __dentry_open()
vfs: do_dentry_open(): don't put filp
vfs: split __dentry_open()
vfs: do_last() common post lookup
vfs: do_last(): add audit_inode before open
vfs: do_last(): only return EISDIR for O_CREAT
vfs: do_last(): check LOOKUP_DIRECTORY
vfs: do_last(): make ENOENT exit RCU safe
vfs: make follow_link check RCU safe
vfs: do_last(): use inode variable
vfs: do_last(): inline walk_component()
vfs: do_last(): make exit RCU safe
vfs: split do_lookup()
Btrfs: move over to use ->update_time
fs: introduce inode operation ->update_time
reiserfs: get rid of resierfs_sync_super
reiserfs: mark the superblock as dirty a bit later
...

Linus Torvalds
2012-06-02 01:34:35 +0800
efee984c2 new helper: signal_delivered() ... Browse Code »

Does block_sigmask() + tracehook_signal_handler(); called when
sigframe has been successfully built. All architectures converted
to it; block_sigmask() itself is gone now (merged into this one).

I'm still not too happy with the signature, but that's a separate
story (IMO we need a structure that would contain signal number +
siginfo + k_sigaction, so that get_signal_to_deliver() would fill one,
signal_delivered(), handle_signal() and probably setup...frame() -
take one).

Signed-off-by: Al Viro

Al Viro
2012-06-02 00:58:52 +0800
77097ae50 most of set_current_blocked() callers want SIGKILL/SIGSTOP removed from set ... Browse Code »

Only 3 out of 63 do not. Renamed the current variant to __set_current_blocked(),
added set_current_blocked() that will exclude unblockable signals, switched
open-coded instances to it.

Signed-off-by: Al Viro

Al Viro
2012-06-02 00:58:51 +0800
a610d6e67 pull clearing RESTORE_SIGMASK into block_sigmask() ... Browse Code »
138

Signed-off-by: Al Viro

Al Viro
2012-06-02 00:58:49 +0800
754421c8c HAVE_RESTORE_SIGMASK is defined on all architectures now ... Browse Code »

Everyone either defines it in arch thread_info.h or has TIF_RESTORE_SIGMASK
and picks default set_restore_sigmask() in linux/thread_info.h. Kill the
ifdefs, slap #error in linux/thread_info.h to catch breakage when new ones
get merged.

Signed-off-by: Al Viro

Al Viro
2012-06-02 00:58:46 +0800

01 Jun, 2012

12 commits

fb21affa4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal ... Browse Code »
92

Pull second pile of signal handling patches from Al Viro:
"This one is just task_work_add() series + remaining prereqs for it.

There probably will be another pull request from that tree this
cycle - at least for helpers, to get them out of the way for per-arch
fixes remaining in the tree."

Fix trivial conflict in kernel/irq/manage.c: the merge of Andrew's pile
had brought in commit 97fd75b7b8e0 ("kernel/irq/manage.c: use the
pr_foo() infrastructure to prefix printks") which changed one of the
pr_err() calls that this merge moves around.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
keys: kill task_struct->replacement_session_keyring
keys: kill the dummy key_replace_session_keyring()
keys: change keyctl_session_to_parent() to use task_work_add()
genirq: reimplement exit_irq_thread() hook via task_work_add()
task_work_add: generic process-context callbacks
avr32: missed _TIF_NOTIFY_RESUME on one of do_notify_resume callers
parisc: need to check NOTIFY_RESUME when exiting from syscall
move key_repace_session_keyring() into tracehook_notify_resume()
TIF_NOTIFY_RESUME is defined on all targets now

Linus Torvalds
2012-06-01 09:47:30 +0800
08615d7d8 Merge branch 'akpm' (Andrew's patch-bomb) ... Browse Code »

Merge misc patches from Andrew Morton:

- the "misc" tree - stuff from all over the map

- checkpatch updates

- fatfs

- kmod changes

- procfs

- cpumask

- UML

- kexec

- mqueue

- rapidio

- pidns

- some checkpoint-restore feature work. Reluctantly. Most of it
delayed a release. I'm still rather worried that we don't have a
clear roadmap to completion for this work.

* emailed from Andrew Morton : (78 patches)
kconfig: update compression algorithm info
c/r: prctl: add ability to set new mm_struct::exe_file
c/r: prctl: extend PR_SET_MM to set up more mm_struct entries
c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
syscalls, x86: add __NR_kcmp syscall
fs, proc: introduce /proc//task//children entry
sysctl: make kernel.ns_last_pid control dependent on CHECKPOINT_RESTORE
aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector()
eventfd: change int to __u64 in eventfd_signal()
fs/nls: add Apple NLS
pidns: make killed children autoreap
pidns: use task_active_pid_ns in do_notify_parent
rapidio/tsi721: add DMA engine support
rapidio: add DMA engine support for RIO data transfers
ipc/mqueue: add rbtree node caching support
tools/selftests: add mq_perf_tests
ipc/mqueue: strengthen checks on mqueue creation
ipc/mqueue: correct mq_attr_ok test
ipc/mqueue: improve performance of send/recv
selftests: add mq_open_tests
...

Linus Torvalds
2012-06-01 09:10:18 +0800
b32dfe377 c/r: prctl: add ability to set new mm_struct::exe_file ... Browse Code »
46

When we do restore we would like to have a way to setup a former
mm_struct::exe_file so that /proc/pid/exe would point to the original
executable file a process had at checkpoint time.

For this the PR_SET_MM_EXE_FILE code is introduced. This option takes a
file descriptor which will be set as a source for new /proc/$pid/exe
symlink.

Note it allows to change /proc/$pid/exe if there are no VM_EXECUTABLE
vmas present for current process, simply because this feature is a special
to C/R and mm::num_exe_file_vmas become meaningless after that.

To minimize the amount of transition the /proc/pid/exe symlink might have,
this feature is implemented in one-shot manner. Thus once changed the
symlink can't be changed again. This should help sysadmins to monitor the
symlinks over all process running in a system.

In particular one could make a snapshot of processes and ring alarm if
there unexpected changes of /proc/pid/exe's in a system.

Note -- this feature is available iif CONFIG_CHECKPOINT_RESTORE is set and
the caller must have CAP_SYS_RESOURCE capability granted, otherwise the
request to change symlink will be rejected.

Signed-off-by: Cyrill Gorcunov
Reviewed-by: Oleg Nesterov
Cc: KOSAKI Motohiro
Cc: Pavel Emelyanov
Cc: Kees Cook
Cc: Tejun Heo
Cc: Matt Helsley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-06-01 08:49:32 +0800
fe8c7f5cb c/r: prctl: extend PR_SET_MM to set up more mm_struct entries ... Browse Code »

During checkpoint we dump whole process memory to a file and the dump
includes process stack memory. But among stack data itself, the stack
carries additional parameters such as command line arguments, environment
data and auxiliary vector.

So when we do restore procedure and once we've restored stack data itself
we need to setup mm_struct::arg_start/end, env_start/end, so restored
process would be able to find command line arguments and environment data
it had at checkpoint time. The same applies to auxiliary vector.

For this reason additional PR_SET_MM_(ARG_START | ARG_END | ENV_START |
ENV_END | AUXV) codes are introduced.

Signed-off-by: Cyrill Gorcunov
Acked-by: Kees Cook
Cc: Tejun Heo
Cc: Andrew Vagin
Cc: Serge Hallyn
Cc: Pavel Emelyanov
Cc: Vasiliy Kulikov
Cc: KAMEZAWA Hiroyuki
Cc: Michael Kerrisk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-06-01 08:49:32 +0800
d97b46a64 syscalls, x86: add __NR_kcmp syscall ... Browse Code »

While doing the checkpoint-restore in the user space one need to determine
whether various kernel objects (like mm_struct-s of file_struct-s) are
shared between tasks and restore this state.

The 2nd step can be solved by using appropriate CLONE_ flags and the
unshare syscall, while there's currently no ways for solving the 1st one.

One of the ways for checking whether two tasks share e.g. mm_struct is to
provide some mm_struct ID of a task to its proc file, but showing such
info considered to be not that good for security reasons.

Thus after some debates we end up in conclusion that using that named
'comparison' syscall might be the best candidate. So here is it --
__NR_kcmp.

It takes up to 5 arguments - the pids of the two tasks (which
characteristics should be compared), the comparison type and (in case of
comparison of files) two file descriptors.

Lookups for pids are done in the caller's PID namespace only.

At moment only x86 is supported and tested.

[akpm@linux-foundation.org: fix up selftests, warnings]
[akpm@linux-foundation.org: include errno.h]
[akpm@linux-foundation.org: tweak comment text]
Signed-off-by: Cyrill Gorcunov
Acked-by: "Eric W. Biederman"
Cc: Pavel Emelyanov
Cc: Andrey Vagin
Cc: KOSAKI Motohiro
Cc: Ingo Molnar
Cc: H. Peter Anvin
Cc: Thomas Gleixner
Cc: Glauber Costa
Cc: Andi Kleen
Cc: Tejun Heo
Cc: Matt Helsley
Cc: Pekka Enberg
Cc: Eric Dumazet
Cc: Vasiliy Kulikov
Cc: Alexey Dobriyan
Cc: Valdis.Kletnieks@vt.edu
Cc: Michal Marek
Cc: Frederic Weisbecker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-06-01 08:49:32 +0800
98ed57eef sysctl: make kernel.ns_last_pid control dependent on CHECKPOINT_RESTORE ... Browse Code »

For those who doesn't need C/R functionality there is no need to control
last pid, ie the pid for the next fork() call.

Signed-off-by: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Tejun Heo
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-06-01 08:49:32 +0800
00c10bc13 pidns: make killed children autoreap ... Browse Code »

Force SIGCHLD handling to SIG_IGN so that signals are not generated and so
that the children autoreap. This increases the parallelize and in general
the speed of network namespace shutdown.

Note self reaping childrean can exist past zap_pid_ns_processess but they
will all be reaped before we allow the pid namespace init task with pid ==
1 to be reaped.

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Eric W. Biederman
Cc: Oleg Nesterov
Cc: Pavel Emelyanov
Cc: Cyrill Gorcunov
Cc: Louis Rilling
Cc: Mike Galbraith
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2012-06-01 08:49:32 +0800
320845048 pidns: use task_active_pid_ns in do_notify_parent ... Browse Code »

Using task_active_pid_ns is more robust because it works even after we
have called exit_namespaces. This change allows us to have parent
processes that are zombies. Normally a zombie parent processes is crazy
and the last thing you would want to have but in the case of not letting
the init process of a pid namespace be reaped until all of it's children
are dead and reaped a zombie parent process is exactly what we want.

Signed-off-by: Eric W. Biederman
Cc: Oleg Nesterov
Cc: Pavel Emelyanov
Cc: Cyrill Gorcunov
Cc: Louis Rilling
Cc: Mike Galbraith
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2012-06-01 08:49:31 +0800
e4cc2f873 kernel/cpu.c: document clear_tasks_mm_cpumask() ... Browse Code »

Add more comments on clear_tasks_mm_cpumask, plus adds a runtime check:
the function is only suitable for offlined CPUs, and if called
inappropriately, the kernel should scream aloud.

[akpm@linux-foundation.org: tweak comment: s/walks up/walks/, use 80 cols]
Suggested-by: Andrew Morton
Suggested-by: Peter Zijlstra
Signed-off-by: Anton Vorontsov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anton Vorontsov
2012-06-01 08:49:30 +0800
cb79295e2 cpu: introduce clear_tasks_mm_cpumask() helper ... Browse Code »

Many architectures clear tasks' mm_cpumask like this:

read_lock(&tasklist_lock);
for_each_process(p) {
if (p->mm)
cpumask_clear_cpu(cpu, mm_cpumask(p->mm));
}
read_unlock(&tasklist_lock);

Depending on the context, the code above may have several problems,
such as:

1. Working with task->mm w/o getting mm or grabing the task lock is
dangerous as ->mm might disappear (exit_mm() assigns NULL under
task_lock(), so tasklist lock is not enough).

2. Checking for process->mm is not enough because process' main
thread may exit or detach its mm via use_mm(), but other threads
may still have a valid mm.

This patch implements a small helper function that does things
correctly, i.e.:

1. We take the task's lock while whe handle its mm (we can't use
get_task_mm()/mmput() pair as mmput() might sleep);

2. To catch exited main thread case, we use find_lock_task_mm(),
which walks up all threads and returns an appropriate task
(with task lock held).

Also, Per Peter Zijlstra's idea, now we don't grab tasklist_lock in
the new helper, instead we take the rcu read lock. We can do this
because the function is called after the cpu is taken down and marked
offline, so no new tasks will get this cpu set in their mm mask.

Signed-off-by: Anton Vorontsov
Cc: Richard Weinberger
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Russell King
Cc: Benjamin Herrenschmidt
Cc: Mike Frysinger
Cc: Paul Mundt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anton Vorontsov
2012-06-01 08:49:29 +0800
f7505d64f fork: call complete_vfork_done() after clearing child_tid and flushing rss-counters ... Browse Code »

Child should wake up the parent from vfork() only after finishing all
operations with shared mm. There is no sense in using
CLONE_CHILD_CLEARTID together with CLONE_VFORK, but it looks more accurate
now.

Signed-off-by: Konstantin Khlebnikov
Cc: Oleg Nesterov
Cc: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Konstantin Khlebnikov
Cc: Markus Trippelsdorf
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-06-01 08:49:29 +0800
168eeccbc stack usage: add pid to warning printk in check_stack_usage ... Browse Code »

In embedded systems, sometimes the same program (busybox) is the cause of
multiple warnings. Outputting the pid with the program name in the
warning printk helps distinguish which instances of a program are using
the stack most.

This is a small patch, but useful.

Signed-off-by: Tim Bird
Cc: Oleg Nesterov
Cc: Frederic Weisbecker
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tim Bird
2012-06-01 08:49:28 +0800