Eric Lee / smarc-fsl-linux-kernel

15 Jun, 2013

1 commit

8aac62706 move exit_task_namespaces() outside of exit_notify() ... Browse Code »

exit_notify() does exit_task_namespaces() after
forget_original_parent(). This was needed to ensure that ->nsproxy
can't be cleared prematurely, an exiting child we are going to
reparent can do do_notify_parent() and use the parent's (ours) pid_ns.

However, after 32084504 "pidns: use task_active_pid_ns in
do_notify_parent" ->nsproxy != NULL is no longer needed, we rely
on task_active_pid_ns().

Move exit_task_namespaces() from exit_notify() to do_exit(), after
exit_fs() and before exit_task_work().

This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy()
does fput() which needs task_work_add().

Note: this particular problem can be fixed if we change fput(), and
that change makes sense anyway. But there is another reason to move
the callsite. The original reason for exit_task_namespaces() from
the middle of exit_notify() was subtle and it has already gone away,
now this looks confusing. And this allows us do simplify exit_notify(),
we can avoid unlock/lock(tasklist) and we can use ->exit_state instead
of PF_EXITING in forget_original_parent().

Reported-by: Andrey Vagin
Signed-off-by: Oleg Nesterov
Acked-by: "Eric W. Biederman"
Acked-by: Andrey Vagin
Signed-off-by: Al Viro

Oleg Nesterov
2013-06-15 09:39:08 +0800

02 May, 2013

1 commit

20b4fb485 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull VFS updates from Al Viro,

Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).

7kloc removed.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...

Linus Torvalds
2013-05-02 08:51:54 +0800

01 May, 2013

1 commit

08d767608 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal ... Browse Code »

Pull compat cleanup from Al Viro:
"Mostly about syscall wrappers this time; there will be another pile
with patches in the same general area from various people, but I'd
rather push those after both that and vfs.git pile are in."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
syscalls.h: slightly reduce the jungles of macros
get rid of union semop in sys_semctl(2) arguments
make do_mremap() static
sparc: no need to sign-extend in sync_file_range() wrapper
ppc compat wrappers for add_key(2) and request_key(2) are pointless
x86: trim sys_ia32.h
x86: sys32_kill and sys32_mprotect are pointless
get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
merge compat sys_ipc instances
consolidate compat lookup_dcookie()
convert vmsplice to COMPAT_SYSCALL_DEFINE
switch getrusage() to COMPAT_SYSCALL_DEFINE
switch epoll_pwait to COMPAT_SYSCALL_DEFINE
convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
make SYSCALL_DEFINE-generated wrappers do asmlinkage_protect
make HAVE_SYSCALL_WRAPPERS unconditional
consolidate cond_syscall and SYSCALL_ALIAS declarations
teach SYSCALL_DEFINE how to deal with long long/unsigned long long
get rid of duplicate logics in __SC_....[1-6] definitions

Linus Torvalds
2013-05-01 22:21:43 +0800

10 Apr, 2013

1 commit

4b8a8f1e4 get rid of the last free_pipe_info() callers ... Browse Code »

and rename __free_pipe_info() to free_pipe_info()

Signed-off-by: Al Viro

Al Viro
2013-04-10 02:13:02 +0800

01 Apr, 2013

1 commit

dbf520a9d Revert "lockdep: check that no locks held at freeze time" ... Browse Code »

This reverts commit 6aa9707099c4b25700940eb3d016f16c4434360d.

Commit 6aa9707099c4 ("lockdep: check that no locks held at freeze time")
causes problems with NFS root filesystems. The failures were noticed on
OMAP2 and 3 boards during kernel init:

[ BUG: swapper/0/1 still has locks held! ]
3.9.0-rc3-00344-ga937536 #1 Not tainted
-------------------------------------
1 lock held by swapper/0/1:
#0: (&type->s_umount_key#13/1){+.+.+.}, at: [] sget+0x248/0x574

stack backtrace:
rpc_wait_bit_killable
__wait_on_bit
out_of_line_wait_on_bit
__rpc_execute
rpc_run_task
rpc_call_sync
nfs_proc_get_root
nfs_get_root
nfs_fs_mount_common
nfs_try_mount
nfs_fs_mount
mount_fs
vfs_kern_mount
do_mount
sys_mount
do_mount_root
mount_root
prepare_namespace
kernel_init_freeable
kernel_init

Although the rootfs mounts, the system is unstable. Here's a transcript
from a PM test:

http://www.pwsan.com/omap/testlogs/test_v3.9-rc3/20130317194234/pm/37xxevm/37xxevm_log.txt

Here's what the test log should look like:

http://www.pwsan.com/omap/testlogs/test_v3.8/20130218214403/pm/37xxevm/37xxevm_log.txt

Mailing list discussion is here:

http://lkml.org/lkml/2013/3/4/221

Deal with this for v3.9 by reverting the problem commit, until folks can
figure out the right long-term course of action.

Signed-off-by: Paul Walmsley
Cc: Mandeep Singh Baines
Cc: Jeff Layton
Cc: Shawn Guo
Cc:
Cc: Fengguang Wu
Cc: Trond Myklebust
Cc: Ingo Molnar
Cc: Ben Chan
Cc: Oleg Nesterov
Cc: Tejun Heo
Cc: Rafael J. Wysocki
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Walmsley
2013-04-01 02:38:33 +0800

04 Mar, 2013

1 commit

2cf096668 make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect ... Browse Code »

... and switch i386 to HAVE_SYSCALL_WRAPPERS, killing open-coded
uses of asmlinkage_protect() in a bunch of syscalls.

Signed-off-by: Al Viro

Al Viro
2013-03-04 11:58:33 +0800

28 Feb, 2013

2 commits

80d26af89 coredump: use a freezable_schedule for the coredump_finish wait ... Browse Code »

Prevents hung_task detector from panicing the machine. This is also
needed to prevent this wait from blocking suspend.

(It doesnt' currently block suspend but it would once the next
patch in this series is applied.)

[yongjun_wei@trendmicro.com.cn: kernel/exit.c: remove duplicated include]
Signed-off-by: Mandeep Singh Baines
Cc: Ben Chan
Cc: Oleg Nesterov
Cc: Tejun Heo
Cc: Rafael J. Wysocki
Cc: Ingo Molnar
Signed-off-by: Wei Yongjun
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mandeep Singh Baines
2013-02-28 11:10:11 +0800
6aa970709 lockdep: check that no locks held at freeze time ... Browse Code »

We shouldn't try_to_freeze if locks are held. Holding a lock can cause a
deadlock if the lock is later acquired in the suspend or hibernate path
(e.g. by dpm). Holding a lock can also cause a deadlock in the case of
cgroup_freezer if a lock is held inside a frozen cgroup that is later
acquired by a process outside that group.

[akpm@linux-foundation.org: export debug_check_no_locks_held]
Signed-off-by: Mandeep Singh Baines
Cc: Ben Chan
Cc: Oleg Nesterov
Cc: Tejun Heo
Cc: Rafael J. Wysocki
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mandeep Singh Baines
2013-02-28 11:10:11 +0800

28 Jan, 2013

1 commit

6fac4829c cputime: Use accessors to read task cputime stats ... Browse Code »

This is in preparation for the full dynticks feature. While
remotely reading the cputime of a task running in a full
dynticks CPU, we'll need to do some extra-computation. This
way we can account the time it spent tickless in userspace
since its last cputime snapshot.

Signed-off-by: Frederic Weisbecker
Cc: Andrew Morton
Cc: Ingo Molnar
Cc: Li Zhong
Cc: Namhyung Kim
Cc: Paul E. McKenney
Cc: Paul Gortmaker
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner

Frederic Weisbecker
2013-01-28 02:23:31 +0800

18 Dec, 2012

1 commit

6a2b60b17 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull user namespace changes from Eric Biederman:
"While small this set of changes is very significant with respect to
containers in general and user namespaces in particular. The user
space interface is now complete.

This set of changes adds support for unprivileged users to create user
namespaces and as a user namespace root to create other namespaces.
The tyranny of supporting suid root preventing unprivileged users from
using cool new kernel features is broken.

This set of changes completes the work on setns, adding support for
the pid, user, mount namespaces.

This set of changes includes a bunch of basic pid namespace
cleanups/simplifications. Of particular significance is the rework of
the pid namespace cleanup so it no longer requires sending out
tendrils into all kinds of unexpected cleanup paths for operation. At
least one case of broken error handling is fixed by this cleanup.

The files under /proc//ns/ have been converted from regular files
to magic symlinks which prevents incorrect caching by the VFS,
ensuring the files always refer to the namespace the process is
currently using and ensuring that the ptrace_mayaccess permission
checks are always applied.

The files under /proc//ns/ have been given stable inode numbers
so it is now possible to see if different processes share the same
namespaces.

Through the David Miller's net tree are changes to relax many of the
permission checks in the networking stack to allowing the user
namespace root to usefully use the networking stack. Similar changes
for the mount namespace and the pid namespace are coming through my
tree.

Two small changes to add user namespace support were commited here adn
in David Miller's -net tree so that I could complete the work on the
/proc//ns/ files in this tree.

Work remains to make it safe to build user namespaces and 9p, afs,
ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
Kconfig guard remains in place preventing that user namespaces from
being built when any of those filesystems are enabled.

Future design work remains to allow root users outside of the initial
user namespace to mount more than just /proc and /sys."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
proc: Usable inode numbers for the namespace file descriptors.
proc: Fix the namespace inode permission checks.
proc: Generalize proc inode allocation
userns: Allow unprivilged mounts of proc and sysfs
userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
procfs: Print task uids and gids in the userns that opened the proc file
userns: Implement unshare of the user namespace
userns: Implent proc namespace operations
userns: Kill task_user_ns
userns: Make create_new_namespaces take a user_ns parameter
userns: Allow unprivileged use of setns.
userns: Allow unprivileged users to create new namespaces
userns: Allow setting a userns mapping to your current uid.
userns: Allow chown and setgid preservation
userns: Allow unprivileged users to create user namespaces.
userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
userns: fix return value on mntns_install() failure
vfs: Allow unprivileged manipulation of the mount namespace.
vfs: Only support slave subtrees across different user namespaces
vfs: Add a user namespace reference from struct mnt_namespace
...

Linus Torvalds
2012-12-18 07:44:47 +0800

13 Dec, 2012

1 commit

9977d9b37 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal ... Browse Code »

Pull big execve/kernel_thread/fork unification series from Al Viro:
"All architectures are converted to new model. Quite a bit of that
stuff is actually shared with architecture trees; in such cases it's
literally shared branch pulled by both, not a cherry-pick.

A lot of ugliness and black magic is gone (-3KLoC total in this one):

- kernel_thread()/kernel_execve()/sys_execve() redesign.

We don't do syscalls from kernel anymore for either kernel_thread()
or kernel_execve():

kernel_thread() is essentially clone(2) with callback run before we
return to userland, the callbacks either never return or do
successful do_execve() before returning.

kernel_execve() is a wrapper for do_execve() - it doesn't need to
do transition to user mode anymore.

As a result kernel_thread() and kernel_execve() are
arch-independent now - they live in kernel/fork.c and fs/exec.c
resp. sys_execve() is also in fs/exec.c and it's completely
architecture-independent.

- daemonize() is gone, along with its parts in fs/*.c

- struct pt_regs * is no longer passed to do_fork/copy_process/
copy_thread/do_execve/search_binary_handler/->load_binary/do_coredump.

- sys_fork()/sys_vfork()/sys_clone() unified; some architectures
still need wrappers (ones with callee-saved registers not saved in
pt_regs on syscall entry), but the main part of those suckers is in
kernel/fork.c now."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (113 commits)
do_coredump(): get rid of pt_regs argument
print_fatal_signal(): get rid of pt_regs argument
ptrace_signal(): get rid of unused arguments
get rid of ptrace_signal_deliver() arguments
new helper: signal_pt_regs()
unify default ptrace_signal_deliver
flagday: kill pt_regs argument of do_fork()
death to idle_regs()
don't pass regs to copy_process()
flagday: don't pass regs to copy_thread()
bfin: switch to generic vfork, get rid of pointless wrappers
xtensa: switch to generic clone()
openrisc: switch to use of generic fork and clone
unicore32: switch to generic clone(2)
score: switch to generic fork/vfork/clone
c6x: sanitize copy_thread(), get rid of clone(2) wrapper, switch to generic clone()
take sys_fork/sys_vfork/sys_clone prototypes to linux/syscalls.h
mn10300: switch to generic fork/vfork/clone
h8300: switch to generic fork/vfork/clone
tile: switch to generic clone()
...

Conflicts:
arch/microblaze/include/asm/Kbuild

Linus Torvalds
2012-12-13 04:22:13 +0800

29 Nov, 2012

2 commits

c4144670f kill daemonize() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-11-29 10:49:02 +0800
e80d0a1ae cputime: Rename thread_group_times to thread_group_cputime_adjusted ... Browse Code »

We have thread_group_cputime() and thread_group_times(). The naming
doesn't provide enough information about the difference between
these two APIs.

To lower the confusion, rename thread_group_times() to
thread_group_cputime_adjusted(). This name better suggests that
it's a version of thread_group_cputime() that does some stabilization
on the raw cputime values. ie here: scale on top of CFS runtime
stats and bound lower value for monotonicity.

Signed-off-by: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Steven Rostedt
Cc: Paul Gortmaker

Frederic Weisbecker
2012-11-29 00:07:57 +0800

19 Nov, 2012

1 commit

af4b8a83a pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1 ... Browse Code »

Looking at pid_ns->nr_hashed is a bit simpler and it works for
disjoint process trees that an unshare or a join of a pid_namespace
may create.

Acked-by: "Serge E. Hallyn"
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2012-11-19 21:59:12 +0800

03 Oct, 2012

1 commit

aab174f0d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs update from Al Viro:

- big one - consolidation of descriptor-related logics; almost all of
that is moved to fs/file.c

(BTW, I'm seriously tempted to rename the result to fd.c. As it is,
we have a situation when file_table.c is about handling of struct
file and file.c is about handling of descriptor tables; the reasons
are historical - file_table.c used to be about a static array of
struct file we used to have way back).

A lot of stray ends got cleaned up and converted to saner primitives,
disgusting mess in android/binder.c is still disgusting, but at least
doesn't poke so much in descriptor table guts anymore. A bunch of
relatively minor races got fixed in process, plus an ext4 struct file
leak.

- related thing - fget_light() partially unuglified; see fdget() in
there (and yes, it generates the code as good as we used to have).

- also related - bits of Cyrill's procfs stuff that got entangled into
that work; _not_ all of it, just the initial move to fs/proc/fd.c and
switch of fdinfo to seq_file.

- Alex's fs/coredump.c spiltoff - the same story, had been easier to
take that commit than mess with conflicts. The rest is a separate
pile, this was just a mechanical code movement.

- a few misc patches all over the place. Not all for this cycle,
there'll be more (and quite a few currently sit in akpm's tree)."

Fix up trivial conflicts in the android binder driver, and some fairly
simple conflicts due to two different changes to the sock_alloc_file()
interface ("take descriptor handling from sock_alloc_file() to callers"
vs "net: Providing protocol type via system.sockprotoname xattr of
/proc/PID/fd entries" adding a dentry name to the socket)

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
MAX_LFS_FILESIZE should be a loff_t
compat: fs: Generic compat_sys_sendfile implementation
fs: push rcu_barrier() from deactivate_locked_super() to filesystems
btrfs: reada_extent doesn't need kref for refcount
coredump: move core dump functionality into its own file
coredump: prevent double-free on an error path in core dumper
usb/gadget: fix misannotations
fcntl: fix misannotations
ceph: don't abuse d_delete() on failure exits
hypfs: ->d_parent is never NULL or negative
vfs: delete surplus inode NULL check
switch simple cases of fget_light to fdget
new helpers: fdget()/fdput()
switch o2hb_region_dev_write() to fget_light()
proc_map_files_readdir(): don't bother with grabbing files
make get_file() return its argument
vhost_set_vring(): turn pollstart/pollstop into bool
switch prctl_set_mm_exe_file() to fget_light()
switch xfs_find_handle() to fget_light()
switch xfs_swapext() to fget_light()
...

Linus Torvalds
2012-10-03 11:25:04 +0800

27 Sep, 2012

2 commits

864bdb3b6 new helper: daemonize_descriptors() ... Browse Code »

descriptor-related parts of daemonize, done right. As the
result we simplify the locking rules for ->files - we
hold task_lock in *all* cases when we modify ->files.

Signed-off-by: Al Viro

Al Viro
2012-09-27 09:10:00 +0800
7cf4dc3c8 move files_struct-related bits from kernel/exit.c to fs/file.c ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-09-27 09:08:54 +0800

25 Sep, 2012

1 commit

5640f7685 net: use a per task frag allocator ... Browse Code »

We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.

This page is used to build fragments for skbs.

Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)

But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page

Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.

This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.

(up to 32768 bytes per frag, thats order-3 pages on x86)

This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.

Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536

Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)

Signed-off-by: Eric Dumazet
Cc: Ben Hutchings
Cc: Vijay Subramanian
Cc: Alexander Duyck
Tested-by: Vijay Subramanian
Signed-off-by: David S. Miller

Eric Dumazet
2012-09-25 04:31:37 +0800

27 Jul, 2012

1 commit

8ded2bbc1 posix_types.h: Cleanup stale __NFDBITS and related definitions ... Browse Code »

Recently, glibc made a change to suppress sign-conversion warnings in
FD_SET (glibc commit ceb9e56b3d1). This uncovered an issue with the
kernel's definition of __NFDBITS if applications #include
after including . A build failure would
be seen when passing the -Werror=sign-compare and -D_FORTIFY_SOURCE=2
flags to gcc.

It was suggested that the kernel should either match the glibc
definition of __NFDBITS or remove that entirely. The current in-kernel
uses of __NFDBITS can be replaced with BITS_PER_LONG, and there are no
uses of the related __FDELT and __FDMASK defines. Given that, we'll
continue the cleanup that was started with commit 8b3d1cda4f5f
("posix_types: Remove fd_set macros") and drop the remaining unused
macros.

Additionally, linux/time.h has similar macros defined that expand to
nothing so we'll remove those at the same time.

Reported-by: Jeff Law
Suggested-by: Linus Torvalds
CC:
Signed-off-by: Josh Boyer
[ .. and fix up whitespace as per akpm ]
Signed-off-by: Linus Torvalds

Josh Boyer
2012-07-27 04:36:43 +0800

23 Jul, 2012

1 commit

ed3e694d7 move exit_task_work() past exit_files() et.al. ... Browse Code »
43

... and get rid of PF_EXITING check in task_work_add().

Signed-off-by: Al Viro

Al Viro
2012-07-23 03:57:57 +0800

21 Jun, 2012

3 commits

50d75f8da pidns: find_new_reaper() can no longer switch to init_pid_ns.child_reaper ... Browse Code »

find_new_reaper() changes pid_ns->child_reaper, see add0d4df ("pid_ns:
zap_pid_ns_processes: fix the ->child_reaper changing").

The original reason has gone away after the previous patch, ->children
list must be empty after zap_pid_ns_processes().

However now we can not switch to init_pid_ns.child_reaper.
__unhash_process() relies on the "->child_reaper == parent" check, but
this check does not work if the last exiting task is also the child
reaper.

As Eric sugested, we can change __unhash_process() to use the parent's
pid_ns and remove this code.

Also, with this change we can move detach_pid(PIDTYPE_PID) back, where it
was before the previous fix.

Signed-off-by: Oleg Nesterov
Acked-by: "Eric W. Biederman"
Cc: Louis Rilling
Cc: Mike Galbraith
Acked-by: Pavel Emelyanov
Tested-by: Andrew Wagin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2012-06-21 05:39:36 +0800
6347e9009 pidns: guarantee that the pidns init will be the last pidns process reaped ... Browse Code »

Today we have a twofold bug. Sometimes release_task on pid == 1 in a pid
namespace can run before other processes in a pid namespace have had
release task called. With the result that pid_ns_release_proc can be
called before the last proc_flus_task() is done using upid->ns->proc_mnt,
resulting in the use of a stale pointer. This same set of circumstances
can lead to waitpid(...) returning for a processes started with
clone(CLONE_NEWPID) before the every process in the pid namespace has
actually exited.

To fix this modify zap_pid_ns_processess wait until all other processes in
the pid namespace have exited, even EXIT_DEAD zombies.

The delay_group_leader and related tests ensure that the thread gruop
leader will be the last thread of a process group to be reaped, or to
become EXIT_DEAD and self reap. With the change to zap_pid_ns_processes
we get the guarantee that pid == 1 in a pid namespace will be the last
task that release_task is called on.

With pid == 1 being the last task to pass through release_task
pid_ns_release_proc can no longer be called too early nor can wait return
before all of the EXIT_DEAD tasks in a pid namespace have exited.

Signed-off-by: Eric W. Biederman
Signed-off-by: Oleg Nesterov
Cc: Louis Rilling
Cc: Mike Galbraith
Acked-by: Pavel Emelyanov
Tested-by: Andrew Wagin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2012-06-21 05:39:36 +0800
4fe7efdbd mm: correctly synchronize rss-counters at exit/exec ... Browse Code »

do_exit() and exec_mmap() call sync_mm_rss() before mm_release() does
put_user(clear_child_tid) which can update task->rss_stat and thus make
mm->rss_stat inconsistent. This triggers the "BUG:" printk in check_mm().

Let's fix this bug in the safest way, and optimize/cleanup this later.

Reported-by: Markus Trippelsdorf
Signed-off-by: Konstantin Khlebnikov
Cc: Oleg Nesterov
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-06-21 05:39:36 +0800

08 Jun, 2012

2 commits

48d212a2e Revert "mm: correctly synchronize rss-counters at exit/exec" ... Browse Code »

This reverts commit 40af1bbdca47e5c8a2044039bb78ca8fd8b20f94.

It's horribly and utterly broken for at least the following reasons:

- calling sync_mm_rss() from mmput() is fundamentally wrong, because
there's absolutely no reason to believe that the task that does the
mmput() always does it on its own VM. Example: fork, ptrace, /proc -
you name it.

- calling it *after* having done mmdrop() on it is doubly insane, since
the mm struct may well be gone now.

- testing mm against NULL before you call it is insane too, since a
NULL mm there would have caused oopses long before.

.. and those are just the three bugs I found before I decided to give up
looking for me and revert it asap. I should have caught it before I
even took it, but I trusted Andrew too much.

Cc: Konstantin Khlebnikov
Cc: Markus Trippelsdorf
Cc: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Oleg Nesterov
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Linus Torvalds
2012-06-08 08:54:07 +0800
40af1bbdc mm: correctly synchronize rss-counters at exit/exec ... Browse Code »
43

mm->rss_stat counters have per-task delta: task->rss_stat. Before
changing task->mm pointer the kernel must flush this delta with
sync_mm_rss().

do_exit() already calls sync_mm_rss() to flush the rss-counters before
committing the rss statistics into task->signal->maxrss, taskstats,
audit and other stuff. Unfortunately the kernel does this before
calling mm_release(), which can call put_user() for processing
task->clear_child_tid. So at this point we can trigger page-faults and
task->rss_stat becomes non-zero again. As a result mm->rss_stat becomes
inconsistent and check_mm() will print something like this:

| BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
| BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1

This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
out of do_exit() and calls it earlier. After mm_release() there should
be no pagefaults.

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Konstantin Khlebnikov
Reported-by: Markus Trippelsdorf
Cc: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Oleg Nesterov
Cc: [3.4.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-06-08 05:43:55 +0800

01 Jun, 2012

3 commits

fb21affa4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal ... Browse Code »
86

Pull second pile of signal handling patches from Al Viro:
"This one is just task_work_add() series + remaining prereqs for it.

There probably will be another pull request from that tree this
cycle - at least for helpers, to get them out of the way for per-arch
fixes remaining in the tree."

Fix trivial conflict in kernel/irq/manage.c: the merge of Andrew's pile
had brought in commit 97fd75b7b8e0 ("kernel/irq/manage.c: use the
pr_foo() infrastructure to prefix printks") which changed one of the
pr_err() calls that this merge moves around.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
keys: kill task_struct->replacement_session_keyring
keys: kill the dummy key_replace_session_keyring()
keys: change keyctl_session_to_parent() to use task_work_add()
genirq: reimplement exit_irq_thread() hook via task_work_add()
task_work_add: generic process-context callbacks
avr32: missed _TIF_NOTIFY_RESUME on one of do_notify_resume callers
parisc: need to check NOTIFY_RESUME when exiting from syscall
move key_repace_session_keyring() into tracehook_notify_resume()
TIF_NOTIFY_RESUME is defined on all targets now

Linus Torvalds
2012-06-01 09:47:30 +0800
168eeccbc stack usage: add pid to warning printk in check_stack_usage ... Browse Code »

In embedded systems, sometimes the same program (busybox) is the cause of
multiple warnings. Outputting the pid with the program name in the
warning printk helps distinguish which instances of a program are using
the stack most.

This is a small patch, but useful.

Signed-off-by: Tim Bird
Cc: Oleg Nesterov
Cc: Frederic Weisbecker
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tim Bird
2012-06-01 08:49:28 +0800
43e13cc10 cred: remove task_is_dead() from __task_cred() validation ... Browse Code »

Commit 8f92054e7ca1 ("CRED: Fix __task_cred()'s lockdep check and banner
comment"):

add the following validation condition:

task->exit_state >= 0

to permit the access if the target task is dead and therefore
unable to change its own credentials.

OK, but afaics currently this can only help wait_task_zombie() which calls
__task_cred() without rcu lock.

Remove this validation and change wait_task_zombie() to use task_uid()
instead. This means we do rcu_read_lock() only to shut up the lockdep,
but we already do the same in, say, wait_task_stopped().

task_is_dead() should die, task->exit_state != 0 means that this task has
passed exit_notify(), only do_wait-like code paths should use this.

Unfortunately, we can't kill task_is_dead() right now, it has already
acquired buggy users in drivers/staging. The fix already exists.

Signed-off-by: Oleg Nesterov
Reviewed-by: "Eric W. Biederman"
Acked-by: David Howells
Cc: Jiri Olsa
Cc: Paul E. McKenney
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2012-06-01 08:49:28 +0800

24 May, 2012

2 commits

4d1d61a6b genirq: reimplement exit_irq_thread() hook via task_work_add() ... Browse Code »

exit_irq_thread() and task->irq_thread are needed to handle the unexpected
(and unlikely) exit of irq-thread.

We can use task_work instead and make this all private to
kernel/irq/manage.c, cleanup plus micro-optimization.

1. rename exit_irq_thread() to irq_thread_dtor(), make it
static, and move it up before irq_thread().

2. change irq_thread() to do task_work_add(irq_thread_dtor)
at the start and task_work_cancel() before return.

tracehook_notify_resume() can never play with kthreads,
only do_exit()->exit_task_work() can call the callback
and this is what we want.

3. remove task_struct->irq_thread and the special hook
in do_exit().

Signed-off-by: Oleg Nesterov
Reviewed-by: Thomas Gleixner
Cc: David Howells
Cc: Richard Kuo
Cc: Linus Torvalds
Cc: Alexander Gordeev
Cc: Chris Zankel
Cc: David Smith
Cc: "Frank Ch. Eigler"
Cc: Geert Uytterhoeven
Cc: Larry Woodman
Cc: Peter Zijlstra
Cc: Tejun Heo
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro

Oleg Nesterov
2012-05-24 10:11:12 +0800
e73f8959a task_work_add: generic process-context callbacks ... Browse Code »

Provide a simple mechanism that allows running code in the (nonatomic)
context of the arbitrary task.

The caller does task_work_add(task, task_work) and this task executes
task_work->func() either from do_notify_resume() or from do_exit(). The
callback can rely on PF_EXITING to detect the latter case.

"struct task_work" can be embedded in another struct, still it has "void
*data" to handle the most common/simple case.

This allows us to kill the ->replacement_session_keyring hack, and
potentially this can have more users.

Performance-wise, this adds 2 "unlikely(!hlist_empty())" checks into
tracehook_notify_resume() and do_exit(). But at the same time we can
remove the "replacement_session_keyring != NULL" checks from
arch/*/signal.c and exit_creds().

Note: task_work_add/task_work_run abuses ->pi_lock. This is only because
this lock is already used by lookup_pi_state() to synchronize with
do_exit() setting PF_EXITING. Fortunately the scope of this lock in
task_work.c is really tiny, and the code is unlikely anyway.

Signed-off-by: Oleg Nesterov
Acked-by: David Howells
Cc: Thomas Gleixner
Cc: Richard Kuo
Cc: Linus Torvalds
Cc: Alexander Gordeev
Cc: Chris Zankel
Cc: David Smith
Cc: "Frank Ch. Eigler"
Cc: Geert Uytterhoeven
Cc: Larry Woodman
Cc: Peter Zijlstra
Cc: Tejun Heo
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro

Oleg Nesterov
2012-05-24 10:09:21 +0800

18 May, 2012

1 commit

8ca937a66 cred: use correct cred accessor with regards to rcu read lock ... Browse Code »

Commit "userns: Convert setting and getting uid and gid system calls to use
kuid and kgid has modified the accessors in wait_task_continued() and
wait_task_stopped() to use __task_cred() instead of task_uid().

__task_cred() assumes that we're inside a rcu read lock, which is untrue
for these two functions.

Modify it to use task_uid() instead.

Signed-off-by: Sasha Levin
Signed-off-by: Eric W. Biederman

Sasha Levin
2012-05-18 06:50:06 +0800

03 May, 2012

1 commit

a29c33f4e userns: Convert setting and getting uid and gid system calls to use kuid and kgid ... Browse Code »

Convert setregid, setgid, setreuid, setuid,
setresuid, getresuid, setresgid, getresgid, setfsuid, setfsgid,
getuid, geteuid, getgid, getegid,
waitpid, waitid, wait4.

Convert userspace uids and gids into kuids and kgids before
being placed on struct cred. Convert struct cred kuids and
kgids into userspace uids and gids when returning them.

Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-05-03 18:28:41 +0800

30 Mar, 2012

1 commit

a591afc01 Merge branch 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x32 support for x86-64 from Ingo Molnar:
"This tree introduces the X32 binary format and execution mode for x86:
32-bit data space binaries using 64-bit instructions and 64-bit kernel
syscalls.

This allows applications whose working set fits into a 32 bits address
space to make use of 64-bit instructions while using a 32-bit address
space with shorter pointers, more compressed data structures, etc."

Fix up trivial context conflicts in arch/x86/{Kconfig,vdso/vma.c}

* 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
x32: Fix alignment fail in struct compat_siginfo
x32: Fix stupid ia32/x32 inversion in the siginfo format
x32: Add ptrace for x32
x32: Switch to a 64-bit clock_t
x32: Provide separate is_ia32_task() and is_x32_task() predicates
x86, mtrr: Use explicit sizing and padding for the 64-bit ioctls
x86/x32: Fix the binutils auto-detect
x32: Warn and disable rather than error if binutils too old
x32: Only clear TIF_X32 flag once
x32: Make sure TS_COMPAT is cleared for x32 tasks
fs: Remove missed ->fds_bits from cessation use of fd_set structs internally
fs: Fix close_on_exec pointer in alloc_fdtable
x32: Drop non-__vdso weak symbols from the x32 VDSO
x32: Fix coding style violations in the x32 VDSO code
x32: Add x32 VDSO support
x32: Allow x32 to be configured
x32: If configured, add x32 system calls to system call tables
x32: Handle process creation
x32: Signal-related system calls
x86: Add #ifdef CONFIG_COMPAT to
...

Linus Torvalds
2012-03-30 09:12:23 +0800

24 Mar, 2012

2 commits

397a21f24 kernel/exit.c: if init dies, log a signal which killed it, if any ... Browse Code »

I just received another user's pleas for help when their init
mysteriously died. I again explained that they need to check whether it
died because of bad instruction, a segv, or something else. Which was
an annoying detour into writing a trivial C program to spawn his init
and print its exit code:

http://lists.busybox.net/pipermail/busybox/2012-January/077172.html

I hear you saying "just test it under /bin/sh". Well, the crashing init
_was_ /bin/sh.

Which prompted me to make kernel do this first step automatically. We can
print exit code, which makes it possible to see that death was from e.g.
SIGILL without writing test programs.

[akpm@linux-foundation.org: add 0x to hex number output]
Signed-off-by: Denys Vlasenko
Acked-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Denys Vlasenko
2012-03-24 07:58:32 +0800
ebec18a6d prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision ... Browse Code »

Userspace service managers/supervisors need to track their started
services. Many services daemonize by double-forking and get implicitly
re-parented to PID 1. The service manager will no longer be able to
receive the SIGCHLD signals for them, and is no longer in charge of
reaping the children with wait(). All information about the children is
lost at the moment PID 1 cleans up the re-parented processes.

With this prctl, a service manager process can mark itself as a sort of
'sub-init', able to stay as the parent for all orphaned processes
created by the started services. All SIGCHLD signals will be delivered
to the service manager.

Receiving SIGCHLD and doing wait() is in cases of a service-manager much
preferred over any possible asynchronous notification about specific
PIDs, because the service manager has full access to the child process
data in /proc and the PID can not be re-used until the wait(), the
service-manager itself is in charge of, has happened.

As a side effect, the relevant parent PID information does not get lost
by a double-fork, which results in a more elaborate process tree and
'ps' output:

before:
# ps afx
253 ? Ss 0:00 /bin/dbus-daemon --system --nofork
294 ? Sl 0:00 /usr/libexec/polkit-1/polkitd
328 ? S 0:00 /usr/sbin/modem-manager
608 ? Sl 0:00 /usr/libexec/colord
658 ? Sl 0:00 /usr/libexec/upowerd
819 ? Sl 0:00 /usr/libexec/imsettings-daemon
916 ? Sl 0:00 /usr/libexec/udisks-daemon
917 ? S 0:00 \_ udisks-daemon: not polling any devices

after:
# ps afx
294 ? Ss 0:00 /bin/dbus-daemon --system --nofork
426 ? Sl 0:00 \_ /usr/libexec/polkit-1/polkitd
449 ? S 0:00 \_ /usr/sbin/modem-manager
635 ? Sl 0:00 \_ /usr/libexec/colord
705 ? Sl 0:00 \_ /usr/libexec/upowerd
959 ? Sl 0:00 \_ /usr/libexec/udisks-daemon
960 ? S 0:00 | \_ udisks-daemon: not polling any devices
977 ? Sl 0:00 \_ /usr/libexec/packagekitd

This prctl is orthogonal to PID namespaces. PID namespaces are isolated
from each other, while a service management process usually requires the
services to live in the same namespace, to be able to talk to each
other.

Users of this will be the systemd per-user instance, which provides
init-like functionality for the user's login session and D-Bus, which
activates bus services on-demand. Both need init-like capabilities to
be able to properly keep track of the services they start.

Many thanks to Oleg for several rounds of review and insights.

[akpm@linux-foundation.org: fix comment layout and spelling]
[akpm@linux-foundation.org: add lengthy code comment from Oleg]
Reviewed-by: Oleg Nesterov
Signed-off-by: Lennart Poettering
Signed-off-by: Kay Sievers
Acked-by: Valdis Kletnieks
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lennart Poettering
2012-03-24 07:58:32 +0800

23 Mar, 2012

1 commit

95211279c Merge branch 'akpm' (Andrew's patch-bomb) ... Browse Code »

Merge first batch of patches from Andrew Morton:
"A few misc things and all the MM queue"

* emailed from Andrew Morton : (92 commits)
memcg: avoid THP split in task migration
thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE
memcg: clean up existing move charge code
mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read()
mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event()
mm/memcontrol.c: s/stealed/stolen/
memcg: fix performance of mem_cgroup_begin_update_page_stat()
memcg: remove PCG_FILE_MAPPED
memcg: use new logic for page stat accounting
memcg: remove PCG_MOVE_LOCK flag from page_cgroup
memcg: simplify move_account() check
memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
memcg: kill dead prev_priority stubs
memcg: remove PCG_CACHE page_cgroup flag
memcg: let css_get_next() rely upon rcu_read_lock()
cgroup: revert ss_id_lock to spinlock
idr: make idr_get_next() good for rcu_read_lock()
memcg: remove unnecessary thp check in page stat accounting
memcg: remove redundant returns
memcg: enum lru_list lru
...

Linus Torvalds
2012-03-23 00:04:48 +0800

22 Mar, 2012

3 commits

05af2e104 mm, counters: remove task argument to sync_mm_rss() and __sync_task_rss_stat() ... Browse Code »

sync_mm_rss() can only be used for current to avoid race conditions in
iterating and clearing its per-task counters. Remove the task argument
for it and its helper function, __sync_task_rss_stat(), to avoid thinking
it can be used safely for anything other than current.

Signed-off-by: David Rientjes
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-03-22 08:54:59 +0800
3556485f1 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security ... Browse Code »

Pull security subsystem updates for 3.4 from James Morris:
"The main addition here is the new Yama security module from Kees Cook,
which was discussed at the Linux Security Summit last year. Its
purpose is to collect miscellaneous DAC security enhancements in one
place. This also marks a departure in policy for LSM modules, which
were previously limited to being standalone access control systems.
Chromium OS is using Yama, and I believe there are plans for Ubuntu,
at least.

This patchset also includes maintenance updates for AppArmor, TOMOYO
and others."

Fix trivial conflict in due to the jumo_label->static_key
rename.

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (38 commits)
AppArmor: Fix location of const qualifier on generated string tables
TOMOYO: Return error if fails to delete a domain
AppArmor: add const qualifiers to string arrays
AppArmor: Add ability to load extended policy
TOMOYO: Return appropriate value to poll().
AppArmor: Move path failure information into aa_get_name and rename
AppArmor: Update dfa matching routines.
AppArmor: Minor cleanup of d_namespace_path to consolidate error handling
AppArmor: Retrieve the dentry_path for error reporting when path lookup fails
AppArmor: Add const qualifiers to generated string tables
AppArmor: Fix oops in policy unpack auditing
AppArmor: Fix error returned when a path lookup is disconnected
KEYS: testing wrong bit for KEY_FLAG_REVOKED
TOMOYO: Fix mount flags checking order.
security: fix ima kconfig warning
AppArmor: Fix the error case for chroot relative path name lookup
AppArmor: fix mapping of META_READ to audit and quiet flags
AppArmor: Fix underflow in xindex calculation
AppArmor: Fix dropping of allowed operations that are force audited
AppArmor: Add mising end of structure test to caps unpacking
...

Linus Torvalds
2012-03-22 04:25:04 +0800
c7c66c0cb Merge tag 'pm-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull power management updates for 3.4 from Rafael Wysocki:
"Assorted extensions and fixes including:

* Introduction of early/late suspend/hibernation device callbacks.
* Generic PM domains extensions and fixes.
* devfreq updates from Axel Lin and MyungJoo Ham.
* Device PM QoS updates.
* Fixes of concurrency problems with wakeup sources.
* System suspend and hibernation fixes."

* tag 'pm-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (43 commits)
PM / Domains: Check domain status during hibernation restore of devices
PM / devfreq: add relation of recommended frequency.
PM / shmobile: Make MTU2 driver use pm_genpd_dev_always_on()
PM / shmobile: Make CMT driver use pm_genpd_dev_always_on()
PM / shmobile: Make TMU driver use pm_genpd_dev_always_on()
PM / Domains: Introduce "always on" device flag
PM / Domains: Fix hibernation restore of devices, v2
PM / Domains: Fix handling of wakeup devices during system resume
sh_mmcif / PM: Use PM QoS latency constraint
tmio_mmc / PM: Use PM QoS latency constraint
PM / QoS: Make it possible to expose PM QoS latency constraints
PM / Sleep: JBD and JBD2 missing set_freezable()
PM / Domains: Fix include for PM_GENERIC_DOMAINS=n case
PM / Freezer: Remove references to TIF_FREEZE in comments
PM / Sleep: Add more wakeup source initialization routines
PM / Hibernate: Enable usermodehelpers in hibernate() error path
PM / Sleep: Make __pm_stay_awake() delete wakeup source timers
PM / Sleep: Fix race conditions related to wakeup source timer function
PM / Sleep: Fix possible infinite loop during wakeup source destruction
PM / Hibernate: print physical addresses consistently with other parts of kernel
...

Linus Torvalds
2012-03-22 01:15:51 +0800

21 Mar, 2012

1 commit

b6e238dce exit_signal: fix the "parent has changed security domain" logic ... Browse Code »

exit_notify() changes ->exit_signal if the parent already did exec.
This doesn't really work, we are not going to send the signal now
if there is another live thread or the exiting task is traced. The
parent can exec before the last dies or the tracer detaches.

Move this check into do_notify_parent() which actually sends the
signal.

The user-visible change is that we do not change ->exit_signal,
and thus the exiting task is still "clone children" for
do_wait()->eligible_child(__WCLONE). Hopefully this is fine, the
current logic is racy anyway.

Signed-off-by: Oleg Nesterov
Signed-off-by: Linus Torvalds

Oleg Nesterov
2012-03-21 05:16:50 +0800