15 Jun, 2013

1 commit

  • exit_notify() does exit_task_namespaces() after
    forget_original_parent(). This was needed to ensure that ->nsproxy
    can't be cleared prematurely, an exiting child we are going to
    reparent can do do_notify_parent() and use the parent's (ours) pid_ns.

    However, after 32084504 "pidns: use task_active_pid_ns in
    do_notify_parent" ->nsproxy != NULL is no longer needed, we rely
    on task_active_pid_ns().

    Move exit_task_namespaces() from exit_notify() to do_exit(), after
    exit_fs() and before exit_task_work().

    This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy()
    does fput() which needs task_work_add().

    Note: this particular problem can be fixed if we change fput(), and
    that change makes sense anyway. But there is another reason to move
    the callsite. The original reason for exit_task_namespaces() from
    the middle of exit_notify() was subtle and it has already gone away,
    now this looks confusing. And this allows us do simplify exit_notify(),
    we can avoid unlock/lock(tasklist) and we can use ->exit_state instead
    of PF_EXITING in forget_original_parent().

    Reported-by: Andrey Vagin
    Signed-off-by: Oleg Nesterov
    Acked-by: "Eric W. Biederman"
    Acked-by: Andrey Vagin
    Signed-off-by: Al Viro

    Oleg Nesterov
     

02 May, 2013

1 commit

  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds
     

01 May, 2013

1 commit

  • Pull compat cleanup from Al Viro:
    "Mostly about syscall wrappers this time; there will be another pile
    with patches in the same general area from various people, but I'd
    rather push those after both that and vfs.git pile are in."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    syscalls.h: slightly reduce the jungles of macros
    get rid of union semop in sys_semctl(2) arguments
    make do_mremap() static
    sparc: no need to sign-extend in sync_file_range() wrapper
    ppc compat wrappers for add_key(2) and request_key(2) are pointless
    x86: trim sys_ia32.h
    x86: sys32_kill and sys32_mprotect are pointless
    get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
    merge compat sys_ipc instances
    consolidate compat lookup_dcookie()
    convert vmsplice to COMPAT_SYSCALL_DEFINE
    switch getrusage() to COMPAT_SYSCALL_DEFINE
    switch epoll_pwait to COMPAT_SYSCALL_DEFINE
    convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
    switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
    make SYSCALL_DEFINE-generated wrappers do asmlinkage_protect
    make HAVE_SYSCALL_WRAPPERS unconditional
    consolidate cond_syscall and SYSCALL_ALIAS declarations
    teach SYSCALL_DEFINE how to deal with long long/unsigned long long
    get rid of duplicate logics in __SC_....[1-6] definitions

    Linus Torvalds
     

10 Apr, 2013

1 commit


01 Apr, 2013

1 commit

  • This reverts commit 6aa9707099c4b25700940eb3d016f16c4434360d.

    Commit 6aa9707099c4 ("lockdep: check that no locks held at freeze time")
    causes problems with NFS root filesystems. The failures were noticed on
    OMAP2 and 3 boards during kernel init:

    [ BUG: swapper/0/1 still has locks held! ]
    3.9.0-rc3-00344-ga937536 #1 Not tainted
    -------------------------------------
    1 lock held by swapper/0/1:
    #0: (&type->s_umount_key#13/1){+.+.+.}, at: [] sget+0x248/0x574

    stack backtrace:
    rpc_wait_bit_killable
    __wait_on_bit
    out_of_line_wait_on_bit
    __rpc_execute
    rpc_run_task
    rpc_call_sync
    nfs_proc_get_root
    nfs_get_root
    nfs_fs_mount_common
    nfs_try_mount
    nfs_fs_mount
    mount_fs
    vfs_kern_mount
    do_mount
    sys_mount
    do_mount_root
    mount_root
    prepare_namespace
    kernel_init_freeable
    kernel_init

    Although the rootfs mounts, the system is unstable. Here's a transcript
    from a PM test:

    http://www.pwsan.com/omap/testlogs/test_v3.9-rc3/20130317194234/pm/37xxevm/37xxevm_log.txt

    Here's what the test log should look like:

    http://www.pwsan.com/omap/testlogs/test_v3.8/20130218214403/pm/37xxevm/37xxevm_log.txt

    Mailing list discussion is here:

    http://lkml.org/lkml/2013/3/4/221

    Deal with this for v3.9 by reverting the problem commit, until folks can
    figure out the right long-term course of action.

    Signed-off-by: Paul Walmsley
    Cc: Mandeep Singh Baines
    Cc: Jeff Layton
    Cc: Shawn Guo
    Cc:
    Cc: Fengguang Wu
    Cc: Trond Myklebust
    Cc: Ingo Molnar
    Cc: Ben Chan
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Rafael J. Wysocki
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Walmsley
     

04 Mar, 2013

1 commit


28 Feb, 2013

2 commits

  • Prevents hung_task detector from panicing the machine. This is also
    needed to prevent this wait from blocking suspend.

    (It doesnt' currently block suspend but it would once the next
    patch in this series is applied.)

    [yongjun_wei@trendmicro.com.cn: kernel/exit.c: remove duplicated include]
    Signed-off-by: Mandeep Singh Baines
    Cc: Ben Chan
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Rafael J. Wysocki
    Cc: Ingo Molnar
    Signed-off-by: Wei Yongjun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mandeep Singh Baines
     
  • We shouldn't try_to_freeze if locks are held. Holding a lock can cause a
    deadlock if the lock is later acquired in the suspend or hibernate path
    (e.g. by dpm). Holding a lock can also cause a deadlock in the case of
    cgroup_freezer if a lock is held inside a frozen cgroup that is later
    acquired by a process outside that group.

    [akpm@linux-foundation.org: export debug_check_no_locks_held]
    Signed-off-by: Mandeep Singh Baines
    Cc: Ben Chan
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Rafael J. Wysocki
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mandeep Singh Baines
     

28 Jan, 2013

1 commit

  • This is in preparation for the full dynticks feature. While
    remotely reading the cputime of a task running in a full
    dynticks CPU, we'll need to do some extra-computation. This
    way we can account the time it spent tickless in userspace
    since its last cputime snapshot.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

18 Dec, 2012

1 commit

  • Pull user namespace changes from Eric Biederman:
    "While small this set of changes is very significant with respect to
    containers in general and user namespaces in particular. The user
    space interface is now complete.

    This set of changes adds support for unprivileged users to create user
    namespaces and as a user namespace root to create other namespaces.
    The tyranny of supporting suid root preventing unprivileged users from
    using cool new kernel features is broken.

    This set of changes completes the work on setns, adding support for
    the pid, user, mount namespaces.

    This set of changes includes a bunch of basic pid namespace
    cleanups/simplifications. Of particular significance is the rework of
    the pid namespace cleanup so it no longer requires sending out
    tendrils into all kinds of unexpected cleanup paths for operation. At
    least one case of broken error handling is fixed by this cleanup.

    The files under /proc//ns/ have been converted from regular files
    to magic symlinks which prevents incorrect caching by the VFS,
    ensuring the files always refer to the namespace the process is
    currently using and ensuring that the ptrace_mayaccess permission
    checks are always applied.

    The files under /proc//ns/ have been given stable inode numbers
    so it is now possible to see if different processes share the same
    namespaces.

    Through the David Miller's net tree are changes to relax many of the
    permission checks in the networking stack to allowing the user
    namespace root to usefully use the networking stack. Similar changes
    for the mount namespace and the pid namespace are coming through my
    tree.

    Two small changes to add user namespace support were commited here adn
    in David Miller's -net tree so that I could complete the work on the
    /proc//ns/ files in this tree.

    Work remains to make it safe to build user namespaces and 9p, afs,
    ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
    Kconfig guard remains in place preventing that user namespaces from
    being built when any of those filesystems are enabled.

    Future design work remains to allow root users outside of the initial
    user namespace to mount more than just /proc and /sys."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
    proc: Usable inode numbers for the namespace file descriptors.
    proc: Fix the namespace inode permission checks.
    proc: Generalize proc inode allocation
    userns: Allow unprivilged mounts of proc and sysfs
    userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
    procfs: Print task uids and gids in the userns that opened the proc file
    userns: Implement unshare of the user namespace
    userns: Implent proc namespace operations
    userns: Kill task_user_ns
    userns: Make create_new_namespaces take a user_ns parameter
    userns: Allow unprivileged use of setns.
    userns: Allow unprivileged users to create new namespaces
    userns: Allow setting a userns mapping to your current uid.
    userns: Allow chown and setgid preservation
    userns: Allow unprivileged users to create user namespaces.
    userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
    userns: fix return value on mntns_install() failure
    vfs: Allow unprivileged manipulation of the mount namespace.
    vfs: Only support slave subtrees across different user namespaces
    vfs: Add a user namespace reference from struct mnt_namespace
    ...

    Linus Torvalds
     

13 Dec, 2012

1 commit

  • Pull big execve/kernel_thread/fork unification series from Al Viro:
    "All architectures are converted to new model. Quite a bit of that
    stuff is actually shared with architecture trees; in such cases it's
    literally shared branch pulled by both, not a cherry-pick.

    A lot of ugliness and black magic is gone (-3KLoC total in this one):

    - kernel_thread()/kernel_execve()/sys_execve() redesign.

    We don't do syscalls from kernel anymore for either kernel_thread()
    or kernel_execve():

    kernel_thread() is essentially clone(2) with callback run before we
    return to userland, the callbacks either never return or do
    successful do_execve() before returning.

    kernel_execve() is a wrapper for do_execve() - it doesn't need to
    do transition to user mode anymore.

    As a result kernel_thread() and kernel_execve() are
    arch-independent now - they live in kernel/fork.c and fs/exec.c
    resp. sys_execve() is also in fs/exec.c and it's completely
    architecture-independent.

    - daemonize() is gone, along with its parts in fs/*.c

    - struct pt_regs * is no longer passed to do_fork/copy_process/
    copy_thread/do_execve/search_binary_handler/->load_binary/do_coredump.

    - sys_fork()/sys_vfork()/sys_clone() unified; some architectures
    still need wrappers (ones with callee-saved registers not saved in
    pt_regs on syscall entry), but the main part of those suckers is in
    kernel/fork.c now."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (113 commits)
    do_coredump(): get rid of pt_regs argument
    print_fatal_signal(): get rid of pt_regs argument
    ptrace_signal(): get rid of unused arguments
    get rid of ptrace_signal_deliver() arguments
    new helper: signal_pt_regs()
    unify default ptrace_signal_deliver
    flagday: kill pt_regs argument of do_fork()
    death to idle_regs()
    don't pass regs to copy_process()
    flagday: don't pass regs to copy_thread()
    bfin: switch to generic vfork, get rid of pointless wrappers
    xtensa: switch to generic clone()
    openrisc: switch to use of generic fork and clone
    unicore32: switch to generic clone(2)
    score: switch to generic fork/vfork/clone
    c6x: sanitize copy_thread(), get rid of clone(2) wrapper, switch to generic clone()
    take sys_fork/sys_vfork/sys_clone prototypes to linux/syscalls.h
    mn10300: switch to generic fork/vfork/clone
    h8300: switch to generic fork/vfork/clone
    tile: switch to generic clone()
    ...

    Conflicts:
    arch/microblaze/include/asm/Kbuild

    Linus Torvalds
     

29 Nov, 2012

2 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • We have thread_group_cputime() and thread_group_times(). The naming
    doesn't provide enough information about the difference between
    these two APIs.

    To lower the confusion, rename thread_group_times() to
    thread_group_cputime_adjusted(). This name better suggests that
    it's a version of thread_group_cputime() that does some stabilization
    on the raw cputime values. ie here: scale on top of CFS runtime
    stats and bound lower value for monotonicity.

    Signed-off-by: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Steven Rostedt
    Cc: Paul Gortmaker

    Frederic Weisbecker
     

19 Nov, 2012

1 commit


03 Oct, 2012

1 commit

  • Pull vfs update from Al Viro:

    - big one - consolidation of descriptor-related logics; almost all of
    that is moved to fs/file.c

    (BTW, I'm seriously tempted to rename the result to fd.c. As it is,
    we have a situation when file_table.c is about handling of struct
    file and file.c is about handling of descriptor tables; the reasons
    are historical - file_table.c used to be about a static array of
    struct file we used to have way back).

    A lot of stray ends got cleaned up and converted to saner primitives,
    disgusting mess in android/binder.c is still disgusting, but at least
    doesn't poke so much in descriptor table guts anymore. A bunch of
    relatively minor races got fixed in process, plus an ext4 struct file
    leak.

    - related thing - fget_light() partially unuglified; see fdget() in
    there (and yes, it generates the code as good as we used to have).

    - also related - bits of Cyrill's procfs stuff that got entangled into
    that work; _not_ all of it, just the initial move to fs/proc/fd.c and
    switch of fdinfo to seq_file.

    - Alex's fs/coredump.c spiltoff - the same story, had been easier to
    take that commit than mess with conflicts. The rest is a separate
    pile, this was just a mechanical code movement.

    - a few misc patches all over the place. Not all for this cycle,
    there'll be more (and quite a few currently sit in akpm's tree)."

    Fix up trivial conflicts in the android binder driver, and some fairly
    simple conflicts due to two different changes to the sock_alloc_file()
    interface ("take descriptor handling from sock_alloc_file() to callers"
    vs "net: Providing protocol type via system.sockprotoname xattr of
    /proc/PID/fd entries" adding a dentry name to the socket)

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
    MAX_LFS_FILESIZE should be a loff_t
    compat: fs: Generic compat_sys_sendfile implementation
    fs: push rcu_barrier() from deactivate_locked_super() to filesystems
    btrfs: reada_extent doesn't need kref for refcount
    coredump: move core dump functionality into its own file
    coredump: prevent double-free on an error path in core dumper
    usb/gadget: fix misannotations
    fcntl: fix misannotations
    ceph: don't abuse d_delete() on failure exits
    hypfs: ->d_parent is never NULL or negative
    vfs: delete surplus inode NULL check
    switch simple cases of fget_light to fdget
    new helpers: fdget()/fdput()
    switch o2hb_region_dev_write() to fget_light()
    proc_map_files_readdir(): don't bother with grabbing files
    make get_file() return its argument
    vhost_set_vring(): turn pollstart/pollstop into bool
    switch prctl_set_mm_exe_file() to fget_light()
    switch xfs_find_handle() to fget_light()
    switch xfs_swapext() to fget_light()
    ...

    Linus Torvalds
     

27 Sep, 2012

2 commits


25 Sep, 2012

1 commit

  • We currently use a per socket order-0 page cache for tcp_sendmsg()
    operations.

    This page is used to build fragments for skbs.

    Its done to increase probability of coalescing small write() into
    single segments in skbs still in write queue (not yet sent)

    But it wastes a lot of memory for applications handling many mostly
    idle sockets, since each socket holds one page in sk->sk_sndmsg_page

    Its also quite inefficient to build TSO 64KB packets, because we need
    about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
    page allocator more than wanted.

    This patch adds a per task frag allocator and uses bigger pages,
    if available. An automatic fallback is done in case of memory pressure.

    (up to 32768 bytes per frag, thats order-3 pages on x86)

    This increases TCP stream performance by 20% on loopback device,
    but also benefits on other network devices, since 8x less frags are
    mapped on transmit and unmapped on tx completion. Alexander Duyck
    mentioned a probable performance win on systems with IOMMU enabled.

    Its possible some SG enabled hardware cant cope with bigger fragments,
    but their ndo_start_xmit() should already handle this, splitting a
    fragment in sub fragments, since some arches have PAGE_SIZE=65536

    Successfully tested on various ethernet devices.
    (ixgbe, igb, bnx2x, tg3, mellanox mlx4)

    Signed-off-by: Eric Dumazet
    Cc: Ben Hutchings
    Cc: Vijay Subramanian
    Cc: Alexander Duyck
    Tested-by: Vijay Subramanian
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Jul, 2012

1 commit

  • Recently, glibc made a change to suppress sign-conversion warnings in
    FD_SET (glibc commit ceb9e56b3d1). This uncovered an issue with the
    kernel's definition of __NFDBITS if applications #include
    after including . A build failure would
    be seen when passing the -Werror=sign-compare and -D_FORTIFY_SOURCE=2
    flags to gcc.

    It was suggested that the kernel should either match the glibc
    definition of __NFDBITS or remove that entirely. The current in-kernel
    uses of __NFDBITS can be replaced with BITS_PER_LONG, and there are no
    uses of the related __FDELT and __FDMASK defines. Given that, we'll
    continue the cleanup that was started with commit 8b3d1cda4f5f
    ("posix_types: Remove fd_set macros") and drop the remaining unused
    macros.

    Additionally, linux/time.h has similar macros defined that expand to
    nothing so we'll remove those at the same time.

    Reported-by: Jeff Law
    Suggested-by: Linus Torvalds
    CC:
    Signed-off-by: Josh Boyer
    [ .. and fix up whitespace as per akpm ]
    Signed-off-by: Linus Torvalds

    Josh Boyer
     

23 Jul, 2012

1 commit


21 Jun, 2012

3 commits

  • find_new_reaper() changes pid_ns->child_reaper, see add0d4df ("pid_ns:
    zap_pid_ns_processes: fix the ->child_reaper changing").

    The original reason has gone away after the previous patch, ->children
    list must be empty after zap_pid_ns_processes().

    However now we can not switch to init_pid_ns.child_reaper.
    __unhash_process() relies on the "->child_reaper == parent" check, but
    this check does not work if the last exiting task is also the child
    reaper.

    As Eric sugested, we can change __unhash_process() to use the parent's
    pid_ns and remove this code.

    Also, with this change we can move detach_pid(PIDTYPE_PID) back, where it
    was before the previous fix.

    Signed-off-by: Oleg Nesterov
    Acked-by: "Eric W. Biederman"
    Cc: Louis Rilling
    Cc: Mike Galbraith
    Acked-by: Pavel Emelyanov
    Tested-by: Andrew Wagin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Today we have a twofold bug. Sometimes release_task on pid == 1 in a pid
    namespace can run before other processes in a pid namespace have had
    release task called. With the result that pid_ns_release_proc can be
    called before the last proc_flus_task() is done using upid->ns->proc_mnt,
    resulting in the use of a stale pointer. This same set of circumstances
    can lead to waitpid(...) returning for a processes started with
    clone(CLONE_NEWPID) before the every process in the pid namespace has
    actually exited.

    To fix this modify zap_pid_ns_processess wait until all other processes in
    the pid namespace have exited, even EXIT_DEAD zombies.

    The delay_group_leader and related tests ensure that the thread gruop
    leader will be the last thread of a process group to be reaped, or to
    become EXIT_DEAD and self reap. With the change to zap_pid_ns_processes
    we get the guarantee that pid == 1 in a pid namespace will be the last
    task that release_task is called on.

    With pid == 1 being the last task to pass through release_task
    pid_ns_release_proc can no longer be called too early nor can wait return
    before all of the EXIT_DEAD tasks in a pid namespace have exited.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Oleg Nesterov
    Cc: Louis Rilling
    Cc: Mike Galbraith
    Acked-by: Pavel Emelyanov
    Tested-by: Andrew Wagin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • do_exit() and exec_mmap() call sync_mm_rss() before mm_release() does
    put_user(clear_child_tid) which can update task->rss_stat and thus make
    mm->rss_stat inconsistent. This triggers the "BUG:" printk in check_mm().

    Let's fix this bug in the safest way, and optimize/cleanup this later.

    Reported-by: Markus Trippelsdorf
    Signed-off-by: Konstantin Khlebnikov
    Cc: Oleg Nesterov
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

08 Jun, 2012

2 commits

  • This reverts commit 40af1bbdca47e5c8a2044039bb78ca8fd8b20f94.

    It's horribly and utterly broken for at least the following reasons:

    - calling sync_mm_rss() from mmput() is fundamentally wrong, because
    there's absolutely no reason to believe that the task that does the
    mmput() always does it on its own VM. Example: fork, ptrace, /proc -
    you name it.

    - calling it *after* having done mmdrop() on it is doubly insane, since
    the mm struct may well be gone now.

    - testing mm against NULL before you call it is insane too, since a
    NULL mm there would have caused oopses long before.

    .. and those are just the three bugs I found before I decided to give up
    looking for me and revert it asap. I should have caught it before I
    even took it, but I trusted Andrew too much.

    Cc: Konstantin Khlebnikov
    Cc: Markus Trippelsdorf
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • mm->rss_stat counters have per-task delta: task->rss_stat. Before
    changing task->mm pointer the kernel must flush this delta with
    sync_mm_rss().

    do_exit() already calls sync_mm_rss() to flush the rss-counters before
    committing the rss statistics into task->signal->maxrss, taskstats,
    audit and other stuff. Unfortunately the kernel does this before
    calling mm_release(), which can call put_user() for processing
    task->clear_child_tid. So at this point we can trigger page-faults and
    task->rss_stat becomes non-zero again. As a result mm->rss_stat becomes
    inconsistent and check_mm() will print something like this:

    | BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
    | BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1

    This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
    out of do_exit() and calls it earlier. After mm_release() there should
    be no pagefaults.

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Markus Trippelsdorf
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Cc: [3.4.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

01 Jun, 2012

3 commits

  • Pull second pile of signal handling patches from Al Viro:
    "This one is just task_work_add() series + remaining prereqs for it.

    There probably will be another pull request from that tree this
    cycle - at least for helpers, to get them out of the way for per-arch
    fixes remaining in the tree."

    Fix trivial conflict in kernel/irq/manage.c: the merge of Andrew's pile
    had brought in commit 97fd75b7b8e0 ("kernel/irq/manage.c: use the
    pr_foo() infrastructure to prefix printks") which changed one of the
    pr_err() calls that this merge moves around.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    keys: kill task_struct->replacement_session_keyring
    keys: kill the dummy key_replace_session_keyring()
    keys: change keyctl_session_to_parent() to use task_work_add()
    genirq: reimplement exit_irq_thread() hook via task_work_add()
    task_work_add: generic process-context callbacks
    avr32: missed _TIF_NOTIFY_RESUME on one of do_notify_resume callers
    parisc: need to check NOTIFY_RESUME when exiting from syscall
    move key_repace_session_keyring() into tracehook_notify_resume()
    TIF_NOTIFY_RESUME is defined on all targets now

    Linus Torvalds
     
  • In embedded systems, sometimes the same program (busybox) is the cause of
    multiple warnings. Outputting the pid with the program name in the
    warning printk helps distinguish which instances of a program are using
    the stack most.

    This is a small patch, but useful.

    Signed-off-by: Tim Bird
    Cc: Oleg Nesterov
    Cc: Frederic Weisbecker
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Bird
     
  • Commit 8f92054e7ca1 ("CRED: Fix __task_cred()'s lockdep check and banner
    comment"):

    add the following validation condition:

    task->exit_state >= 0

    to permit the access if the target task is dead and therefore
    unable to change its own credentials.

    OK, but afaics currently this can only help wait_task_zombie() which calls
    __task_cred() without rcu lock.

    Remove this validation and change wait_task_zombie() to use task_uid()
    instead. This means we do rcu_read_lock() only to shut up the lockdep,
    but we already do the same in, say, wait_task_stopped().

    task_is_dead() should die, task->exit_state != 0 means that this task has
    passed exit_notify(), only do_wait-like code paths should use this.

    Unfortunately, we can't kill task_is_dead() right now, it has already
    acquired buggy users in drivers/staging. The fix already exists.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: "Eric W. Biederman"
    Acked-by: David Howells
    Cc: Jiri Olsa
    Cc: Paul E. McKenney
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

24 May, 2012

2 commits

  • exit_irq_thread() and task->irq_thread are needed to handle the unexpected
    (and unlikely) exit of irq-thread.

    We can use task_work instead and make this all private to
    kernel/irq/manage.c, cleanup plus micro-optimization.

    1. rename exit_irq_thread() to irq_thread_dtor(), make it
    static, and move it up before irq_thread().

    2. change irq_thread() to do task_work_add(irq_thread_dtor)
    at the start and task_work_cancel() before return.

    tracehook_notify_resume() can never play with kthreads,
    only do_exit()->exit_task_work() can call the callback
    and this is what we want.

    3. remove task_struct->irq_thread and the special hook
    in do_exit().

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Thomas Gleixner
    Cc: David Howells
    Cc: Richard Kuo
    Cc: Linus Torvalds
    Cc: Alexander Gordeev
    Cc: Chris Zankel
    Cc: David Smith
    Cc: "Frank Ch. Eigler"
    Cc: Geert Uytterhoeven
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Oleg Nesterov
     
  • Provide a simple mechanism that allows running code in the (nonatomic)
    context of the arbitrary task.

    The caller does task_work_add(task, task_work) and this task executes
    task_work->func() either from do_notify_resume() or from do_exit(). The
    callback can rely on PF_EXITING to detect the latter case.

    "struct task_work" can be embedded in another struct, still it has "void
    *data" to handle the most common/simple case.

    This allows us to kill the ->replacement_session_keyring hack, and
    potentially this can have more users.

    Performance-wise, this adds 2 "unlikely(!hlist_empty())" checks into
    tracehook_notify_resume() and do_exit(). But at the same time we can
    remove the "replacement_session_keyring != NULL" checks from
    arch/*/signal.c and exit_creds().

    Note: task_work_add/task_work_run abuses ->pi_lock. This is only because
    this lock is already used by lookup_pi_state() to synchronize with
    do_exit() setting PF_EXITING. Fortunately the scope of this lock in
    task_work.c is really tiny, and the code is unlikely anyway.

    Signed-off-by: Oleg Nesterov
    Acked-by: David Howells
    Cc: Thomas Gleixner
    Cc: Richard Kuo
    Cc: Linus Torvalds
    Cc: Alexander Gordeev
    Cc: Chris Zankel
    Cc: David Smith
    Cc: "Frank Ch. Eigler"
    Cc: Geert Uytterhoeven
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Oleg Nesterov
     

18 May, 2012

1 commit

  • Commit "userns: Convert setting and getting uid and gid system calls to use
    kuid and kgid has modified the accessors in wait_task_continued() and
    wait_task_stopped() to use __task_cred() instead of task_uid().

    __task_cred() assumes that we're inside a rcu read lock, which is untrue
    for these two functions.

    Modify it to use task_uid() instead.

    Signed-off-by: Sasha Levin
    Signed-off-by: Eric W. Biederman

    Sasha Levin
     

03 May, 2012

1 commit


30 Mar, 2012

1 commit

  • Pull x32 support for x86-64 from Ingo Molnar:
    "This tree introduces the X32 binary format and execution mode for x86:
    32-bit data space binaries using 64-bit instructions and 64-bit kernel
    syscalls.

    This allows applications whose working set fits into a 32 bits address
    space to make use of 64-bit instructions while using a 32-bit address
    space with shorter pointers, more compressed data structures, etc."

    Fix up trivial context conflicts in arch/x86/{Kconfig,vdso/vma.c}

    * 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
    x32: Fix alignment fail in struct compat_siginfo
    x32: Fix stupid ia32/x32 inversion in the siginfo format
    x32: Add ptrace for x32
    x32: Switch to a 64-bit clock_t
    x32: Provide separate is_ia32_task() and is_x32_task() predicates
    x86, mtrr: Use explicit sizing and padding for the 64-bit ioctls
    x86/x32: Fix the binutils auto-detect
    x32: Warn and disable rather than error if binutils too old
    x32: Only clear TIF_X32 flag once
    x32: Make sure TS_COMPAT is cleared for x32 tasks
    fs: Remove missed ->fds_bits from cessation use of fd_set structs internally
    fs: Fix close_on_exec pointer in alloc_fdtable
    x32: Drop non-__vdso weak symbols from the x32 VDSO
    x32: Fix coding style violations in the x32 VDSO code
    x32: Add x32 VDSO support
    x32: Allow x32 to be configured
    x32: If configured, add x32 system calls to system call tables
    x32: Handle process creation
    x32: Signal-related system calls
    x86: Add #ifdef CONFIG_COMPAT to
    ...

    Linus Torvalds
     

24 Mar, 2012

2 commits

  • I just received another user's pleas for help when their init
    mysteriously died. I again explained that they need to check whether it
    died because of bad instruction, a segv, or something else. Which was
    an annoying detour into writing a trivial C program to spawn his init
    and print its exit code:

    http://lists.busybox.net/pipermail/busybox/2012-January/077172.html

    I hear you saying "just test it under /bin/sh". Well, the crashing init
    _was_ /bin/sh.

    Which prompted me to make kernel do this first step automatically. We can
    print exit code, which makes it possible to see that death was from e.g.
    SIGILL without writing test programs.

    [akpm@linux-foundation.org: add 0x to hex number output]
    Signed-off-by: Denys Vlasenko
    Acked-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denys Vlasenko
     
  • Userspace service managers/supervisors need to track their started
    services. Many services daemonize by double-forking and get implicitly
    re-parented to PID 1. The service manager will no longer be able to
    receive the SIGCHLD signals for them, and is no longer in charge of
    reaping the children with wait(). All information about the children is
    lost at the moment PID 1 cleans up the re-parented processes.

    With this prctl, a service manager process can mark itself as a sort of
    'sub-init', able to stay as the parent for all orphaned processes
    created by the started services. All SIGCHLD signals will be delivered
    to the service manager.

    Receiving SIGCHLD and doing wait() is in cases of a service-manager much
    preferred over any possible asynchronous notification about specific
    PIDs, because the service manager has full access to the child process
    data in /proc and the PID can not be re-used until the wait(), the
    service-manager itself is in charge of, has happened.

    As a side effect, the relevant parent PID information does not get lost
    by a double-fork, which results in a more elaborate process tree and
    'ps' output:

    before:
    # ps afx
    253 ? Ss 0:00 /bin/dbus-daemon --system --nofork
    294 ? Sl 0:00 /usr/libexec/polkit-1/polkitd
    328 ? S 0:00 /usr/sbin/modem-manager
    608 ? Sl 0:00 /usr/libexec/colord
    658 ? Sl 0:00 /usr/libexec/upowerd
    819 ? Sl 0:00 /usr/libexec/imsettings-daemon
    916 ? Sl 0:00 /usr/libexec/udisks-daemon
    917 ? S 0:00 \_ udisks-daemon: not polling any devices

    after:
    # ps afx
    294 ? Ss 0:00 /bin/dbus-daemon --system --nofork
    426 ? Sl 0:00 \_ /usr/libexec/polkit-1/polkitd
    449 ? S 0:00 \_ /usr/sbin/modem-manager
    635 ? Sl 0:00 \_ /usr/libexec/colord
    705 ? Sl 0:00 \_ /usr/libexec/upowerd
    959 ? Sl 0:00 \_ /usr/libexec/udisks-daemon
    960 ? S 0:00 | \_ udisks-daemon: not polling any devices
    977 ? Sl 0:00 \_ /usr/libexec/packagekitd

    This prctl is orthogonal to PID namespaces. PID namespaces are isolated
    from each other, while a service management process usually requires the
    services to live in the same namespace, to be able to talk to each
    other.

    Users of this will be the systemd per-user instance, which provides
    init-like functionality for the user's login session and D-Bus, which
    activates bus services on-demand. Both need init-like capabilities to
    be able to properly keep track of the services they start.

    Many thanks to Oleg for several rounds of review and insights.

    [akpm@linux-foundation.org: fix comment layout and spelling]
    [akpm@linux-foundation.org: add lengthy code comment from Oleg]
    Reviewed-by: Oleg Nesterov
    Signed-off-by: Lennart Poettering
    Signed-off-by: Kay Sievers
    Acked-by: Valdis Kletnieks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lennart Poettering
     

23 Mar, 2012

1 commit

  • Merge first batch of patches from Andrew Morton:
    "A few misc things and all the MM queue"

    * emailed from Andrew Morton : (92 commits)
    memcg: avoid THP split in task migration
    thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE
    memcg: clean up existing move charge code
    mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read()
    mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event()
    mm/memcontrol.c: s/stealed/stolen/
    memcg: fix performance of mem_cgroup_begin_update_page_stat()
    memcg: remove PCG_FILE_MAPPED
    memcg: use new logic for page stat accounting
    memcg: remove PCG_MOVE_LOCK flag from page_cgroup
    memcg: simplify move_account() check
    memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
    memcg: kill dead prev_priority stubs
    memcg: remove PCG_CACHE page_cgroup flag
    memcg: let css_get_next() rely upon rcu_read_lock()
    cgroup: revert ss_id_lock to spinlock
    idr: make idr_get_next() good for rcu_read_lock()
    memcg: remove unnecessary thp check in page stat accounting
    memcg: remove redundant returns
    memcg: enum lru_list lru
    ...

    Linus Torvalds
     

22 Mar, 2012

3 commits

  • sync_mm_rss() can only be used for current to avoid race conditions in
    iterating and clearing its per-task counters. Remove the task argument
    for it and its helper function, __sync_task_rss_stat(), to avoid thinking
    it can be used safely for anything other than current.

    Signed-off-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Pull security subsystem updates for 3.4 from James Morris:
    "The main addition here is the new Yama security module from Kees Cook,
    which was discussed at the Linux Security Summit last year. Its
    purpose is to collect miscellaneous DAC security enhancements in one
    place. This also marks a departure in policy for LSM modules, which
    were previously limited to being standalone access control systems.
    Chromium OS is using Yama, and I believe there are plans for Ubuntu,
    at least.

    This patchset also includes maintenance updates for AppArmor, TOMOYO
    and others."

    Fix trivial conflict in due to the jumo_label->static_key
    rename.

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (38 commits)
    AppArmor: Fix location of const qualifier on generated string tables
    TOMOYO: Return error if fails to delete a domain
    AppArmor: add const qualifiers to string arrays
    AppArmor: Add ability to load extended policy
    TOMOYO: Return appropriate value to poll().
    AppArmor: Move path failure information into aa_get_name and rename
    AppArmor: Update dfa matching routines.
    AppArmor: Minor cleanup of d_namespace_path to consolidate error handling
    AppArmor: Retrieve the dentry_path for error reporting when path lookup fails
    AppArmor: Add const qualifiers to generated string tables
    AppArmor: Fix oops in policy unpack auditing
    AppArmor: Fix error returned when a path lookup is disconnected
    KEYS: testing wrong bit for KEY_FLAG_REVOKED
    TOMOYO: Fix mount flags checking order.
    security: fix ima kconfig warning
    AppArmor: Fix the error case for chroot relative path name lookup
    AppArmor: fix mapping of META_READ to audit and quiet flags
    AppArmor: Fix underflow in xindex calculation
    AppArmor: Fix dropping of allowed operations that are force audited
    AppArmor: Add mising end of structure test to caps unpacking
    ...

    Linus Torvalds
     
  • Pull power management updates for 3.4 from Rafael Wysocki:
    "Assorted extensions and fixes including:

    * Introduction of early/late suspend/hibernation device callbacks.
    * Generic PM domains extensions and fixes.
    * devfreq updates from Axel Lin and MyungJoo Ham.
    * Device PM QoS updates.
    * Fixes of concurrency problems with wakeup sources.
    * System suspend and hibernation fixes."

    * tag 'pm-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (43 commits)
    PM / Domains: Check domain status during hibernation restore of devices
    PM / devfreq: add relation of recommended frequency.
    PM / shmobile: Make MTU2 driver use pm_genpd_dev_always_on()
    PM / shmobile: Make CMT driver use pm_genpd_dev_always_on()
    PM / shmobile: Make TMU driver use pm_genpd_dev_always_on()
    PM / Domains: Introduce "always on" device flag
    PM / Domains: Fix hibernation restore of devices, v2
    PM / Domains: Fix handling of wakeup devices during system resume
    sh_mmcif / PM: Use PM QoS latency constraint
    tmio_mmc / PM: Use PM QoS latency constraint
    PM / QoS: Make it possible to expose PM QoS latency constraints
    PM / Sleep: JBD and JBD2 missing set_freezable()
    PM / Domains: Fix include for PM_GENERIC_DOMAINS=n case
    PM / Freezer: Remove references to TIF_FREEZE in comments
    PM / Sleep: Add more wakeup source initialization routines
    PM / Hibernate: Enable usermodehelpers in hibernate() error path
    PM / Sleep: Make __pm_stay_awake() delete wakeup source timers
    PM / Sleep: Fix race conditions related to wakeup source timer function
    PM / Sleep: Fix possible infinite loop during wakeup source destruction
    PM / Hibernate: print physical addresses consistently with other parts of kernel
    ...

    Linus Torvalds
     

21 Mar, 2012

1 commit

  • exit_notify() changes ->exit_signal if the parent already did exec.
    This doesn't really work, we are not going to send the signal now
    if there is another live thread or the exiting task is traced. The
    parent can exec before the last dies or the tracer detaches.

    Move this check into do_notify_parent() which actually sends the
    signal.

    The user-visible change is that we do not change ->exit_signal,
    and thus the exiting task is still "clone children" for
    do_wait()->eligible_child(__WCLONE). Hopefully this is fine, the
    current logic is racy anyway.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov