02 Jun, 2012

4 commits


01 Jun, 2012

1 commit

  • Using task_active_pid_ns is more robust because it works even after we
    have called exit_namespaces. This change allows us to have parent
    processes that are zombies. Normally a zombie parent processes is crazy
    and the last thing you would want to have but in the case of not letting
    the init process of a pid namespace be reaped until all of it's children
    are dead and reaped a zombie parent process is exactly what we want.

    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Louis Rilling
    Cc: Mike Galbraith
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

25 May, 2012

1 commit

  • Pull user-space probe instrumentation from Ingo Molnar:
    "The uprobes code originates from SystemTap and has been used for years
    in Fedora and RHEL kernels. This version is much rewritten, reviews
    from PeterZ, Oleg and myself shaped the end result.

    This tree includes uprobes support in 'perf probe' - but SystemTap
    (and other tools) can take advantage of user probe points as well.

    Sample usage of uprobes via perf, for example to profile malloc()
    calls without modifying user-space binaries.

    First boot a new kernel with CONFIG_UPROBE_EVENT=y enabled.

    If you don't know which function you want to probe you can pick one
    from 'perf top' or can get a list all functions that can be probed
    within libc (binaries can be specified as well):

    $ perf probe -F -x /lib/libc.so.6

    To probe libc's malloc():

    $ perf probe -x /lib64/libc.so.6 malloc
    Added new event:
    probe_libc:malloc (on 0x7eac0)

    You can now use it in all perf tools, such as:

    perf record -e probe_libc:malloc -aR sleep 1

    Make use of it to create a call graph (as the flat profile is going to
    look very boring):

    $ perf record -e probe_libc:malloc -gR make
    [ perf record: Woken up 173 times to write data ]
    [ perf record: Captured and wrote 44.190 MB perf.data (~1930712

    $ perf report | less

    32.03% git libc-2.15.so [.] malloc
    |
    --- malloc

    29.49% cc1 libc-2.15.so [.] malloc
    |
    --- malloc
    |
    |--0.95%-- 0x208eb1000000000
    |
    |--0.63%-- htab_traverse_noresize

    11.04% as libc-2.15.so [.] malloc
    |
    --- malloc
    |

    7.15% ld libc-2.15.so [.] malloc
    |
    --- malloc
    |

    5.07% sh libc-2.15.so [.] malloc
    |
    --- malloc
    |
    4.99% python-config libc-2.15.so [.] malloc
    |
    --- malloc
    |
    4.54% make libc-2.15.so [.] malloc
    |
    --- malloc
    |
    |--7.34%-- glob
    | |
    | |--93.18%-- 0x41588f
    | |
    | --6.82%-- glob
    | 0x41588f

    ...

    Or:

    $ perf report -g flat | less

    # Overhead Command Shared Object Symbol
    # ........ ............. ............. ..........
    #
    32.03% git libc-2.15.so [.] malloc
    27.19%
    malloc

    29.49% cc1 libc-2.15.so [.] malloc
    24.77%
    malloc

    11.04% as libc-2.15.so [.] malloc
    11.02%
    malloc

    7.15% ld libc-2.15.so [.] malloc
    6.57%
    malloc

    ...

    The core uprobes design is fairly straightforward: uprobes probe
    points register themselves at (inode:offset) addresses of
    libraries/binaries, after which all existing (or new) vmas that map
    that address will have a software breakpoint injected at that address.
    vmas are COW-ed to preserve original content. The probe points are
    kept in an rbtree.

    If user-space executes the probed inode:offset instruction address
    then an event is generated which can be recovered from the regular
    perf event channels and mmap-ed ring-buffer.

    Multiple probes at the same address are supported, they create a
    dynamic callback list of event consumers.

    The basic model is further complicated by the XOL speedup: the
    original instruction that is probed is copied (in an architecture
    specific fashion) and executed out of line when the probe triggers.
    The XOL area is a single vma per process, with a fixed number of
    entries (which limits probe execution parallelism).

    The API: uprobes are installed/removed via
    /sys/kernel/debug/tracing/uprobe_events, the API is integrated to
    align with the kprobes interface as much as possible, but is separate
    to it.

    Injecting a probe point is privileged operation, which can be relaxed
    by setting perf_paranoid to -1.

    You can use multiple probes as well and mix them with kprobes and
    regular PMU events or tracepoints, when instrumenting a task."

    Fix up trivial conflicts in mm/memory.c due to previous cleanup of
    unmap_single_vma().

    * 'perf-uprobes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
    perf probe: Detect probe target when m/x options are absent
    perf probe: Provide perf interface for uprobes
    tracing: Fix kconfig warning due to a typo
    tracing: Provide trace events interface for uprobes
    tracing: Extract out common code for kprobes/uprobes trace events
    tracing: Modify is_delete, is_return from int to bool
    uprobes/core: Decrement uprobe count before the pages are unmapped
    uprobes/core: Make background page replacement logic account for rss_stat counters
    uprobes/core: Optimize probe hits with the help of a counter
    uprobes/core: Allocate XOL slots for uprobes use
    uprobes/core: Handle breakpoint and singlestep exceptions
    uprobes/core: Rename bkpt to swbp
    uprobes/core: Make order of function parameters consistent across functions
    uprobes/core: Make macro names consistent
    uprobes: Update copyright notices
    uprobes/core: Move insn to arch specific structure
    uprobes/core: Remove uprobe_opcode_sz
    uprobes/core: Make instruction tables volatile
    uprobes: Move to kernel/events/
    uprobes/core: Clean up, refactor and improve the code
    ...

    Linus Torvalds
     

24 May, 2012

2 commits

  • Pull first series of signal handling cleanups from Al Viro:
    "This is just the first part of the queue (about a half of it);
    assorted fixes all over the place in signal handling.

    This one ends with all sigsuspend() implementations switched to
    generic one (->saved_sigmask-based).

    With this, a bunch of assorted old buglets are fixed and most of the
    missing bits of NOTIFY_RESUME hookup are in place. Two more fixes sit
    in arm and um trees respectively, and there's a couple of broken ones
    that need obvious fixes - parisc and avr32 check TIF_NOTIFY_RESUME
    only on one of two codepaths; fixes for that will happen in the next
    series"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (55 commits)
    unicore32: if there's no handler we need to restore sigmask, syscall or no syscall
    xtensa: add handling of TIF_NOTIFY_RESUME
    microblaze: drop 'oldset' argument of do_notify_resume()
    microblaze: handle TIF_NOTIFY_RESUME
    score: add handling of NOTIFY_RESUME to do_notify_resume()
    m68k: add TIF_NOTIFY_RESUME and handle it.
    sparc: kill ancient comment in sparc_sigaction()
    h8300: missing checks of __get_user()/__put_user() return values
    frv: missing checks of __get_user()/__put_user() return values
    cris: missing checks of __get_user()/__put_user() return values
    powerpc: missing checks of __get_user()/__put_user() return values
    sh: missing checks of __get_user()/__put_user() return values
    sparc: missing checks of __get_user()/__put_user() return values
    avr32: struct old_sigaction is never used
    m32r: struct old_sigaction is never used
    xtensa: xtensa_sigaction doesn't exist
    alpha: tidy signal delivery up
    score: don't open-code force_sigsegv()
    cris: don't open-code force_sigsegv()
    blackfin: don't open-code force_sigsegv()
    ...

    Linus Torvalds
     
  • Pull user namespace enhancements from Eric Biederman:
    "This is a course correction for the user namespace, so that we can
    reach an inexpensive, maintainable, and reasonably complete
    implementation.

    Highlights:
    - Config guards make it impossible to enable the user namespace and
    code that has not been converted to be user namespace safe.

    - Use of the new kuid_t type ensures the if you somehow get past the
    config guards the kernel will encounter type errors if you enable
    user namespaces and attempt to compile in code whose permission
    checks have not been updated to be user namespace safe.

    - All uids from child user namespaces are mapped into the initial
    user namespace before they are processed. Removing the need to add
    an additional check to see if the user namespace of the compared
    uids remains the same.

    - With the user namespaces compiled out the performance is as good or
    better than it is today.

    - For most operations absolutely nothing changes performance or
    operationally with the user namespace enabled.

    - The worst case performance I could come up with was timing 1
    billion cache cold stat operations with the user namespace code
    enabled. This went from 156s to 164s on my laptop (or 156ns to
    164ns per stat operation).

    - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
    Most uid/gid setting system calls treat these value specially
    anyway so attempting to use -1 as a uid would likely cause
    entertaining failures in userspace.

    - If setuid is called with a uid that can not be mapped setuid fails.
    I have looked at sendmail, login, ssh and every other program I
    could think of that would call setuid and they all check for and
    handle the case where setuid fails.

    - If stat or a similar system call is called from a context in which
    we can not map a uid we lie and return overflowuid. The LFS
    experience suggests not lying and returning an error code might be
    better, but the historical precedent with uids is different and I
    can not think of anything that would break by lying about a uid we
    can't map.

    - Capabilities are localized to the current user namespace making it
    safe to give the initial user in a user namespace all capabilities.

    My git tree covers all of the modifications needed to convert the core
    kernel and enough changes to make a system bootable to runlevel 1."

    Fix up trivial conflicts due to nearby independent changes in fs/stat.c

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
    userns: Silence silly gcc warning.
    cred: use correct cred accessor with regards to rcu read lock
    userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
    userns: Convert cgroup permission checks to use uid_eq
    userns: Convert tmpfs to use kuid and kgid where appropriate
    userns: Convert sysfs to use kgid/kuid where appropriate
    userns: Convert sysctl permission checks to use kuid and kgids.
    userns: Convert proc to use kuid/kgid where appropriate
    userns: Convert ext4 to user kuid/kgid where appropriate
    userns: Convert ext3 to use kuid/kgid where appropriate
    userns: Convert ext2 to use kuid/kgid where appropriate.
    userns: Convert devpts to use kuid/kgid where appropriate
    userns: Convert binary formats to use kuid/kgid where appropriate
    userns: Add negative depends on entries to avoid building code that is userns unsafe
    userns: signal remove unnecessary map_cred_ns
    userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
    userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
    userns: Convert stat to return values mapped from kuids and kgids
    userns: Convert user specfied uids and gids in chown into kuids and kgid
    userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
    ...

    Linus Torvalds
     

22 May, 2012

1 commit


16 May, 2012

1 commit


03 May, 2012

3 commits


14 Apr, 2012

2 commits

  • Merge in latest upstream (and the latest perf development tree),
    to prepare for tooling changes, and also to pick up v3.4 MM
    changes that the uprobes code needs to take care of.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • This change enables SIGSYS, defines _sigfields._sigsys, and adds
    x86 (compat) arch support. _sigsys defines fields which allow
    a signal handler to receive the triggering system call number,
    the relevant AUDIT_ARCH_* value for that number, and the address
    of the callsite.

    SIGSYS is added to the SYNCHRONOUS_MASK because it is desirable for it
    to have setup_frame() called for it. The goal is to ensure that
    ucontext_t reflects the machine state from the time-of-syscall and not
    from another signal handler.

    The first consumer of SIGSYS would be seccomp filter. In particular,
    a filter program could specify a new return value, SECCOMP_RET_TRAP,
    which would result in the system call being denied and the calling
    thread signaled. This also means that implementing arch-specific
    support can be dependent upon HAVE_ARCH_SECCOMP_FILTER.

    Suggested-by: H. Peter Anvin
    Signed-off-by: Will Drewry
    Acked-by: Serge Hallyn
    Reviewed-by: H. Peter Anvin
    Acked-by: Eric Paris

    v18: - added acked by, rebase
    v17: - rebase and reviewed-by addition
    v14: - rebase/nochanges
    v13: - rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc
    v12: - reworded changelog (oleg@redhat.com)
    v11: - fix dropped words in the change description
    - added fallback copy_siginfo support.
    - added __ARCH_SIGSYS define to allow stepped arch support.
    v10: - first version based on suggestion
    Signed-off-by: James Morris

    Will Drewry
     

08 Apr, 2012

1 commit


29 Mar, 2012

2 commits

  • …m/linux/kernel/git/dhowells/linux-asm_system

    Pull "Disintegrate and delete asm/system.h" from David Howells:
    "Here are a bunch of patches to disintegrate asm/system.h into a set of
    separate bits to relieve the problem of circular inclusion
    dependencies.

    I've built all the working defconfigs from all the arches that I can
    and made sure that they don't break.

    The reason for these patches is that I recently encountered a circular
    dependency problem that came about when I produced some patches to
    optimise get_order() by rewriting it to use ilog2().

    This uses bitops - and on the SH arch asm/bitops.h drags in
    asm-generic/get_order.h by a circuituous route involving asm/system.h.

    The main difficulty seems to be asm/system.h. It holds a number of
    low level bits with no/few dependencies that are commonly used (eg.
    memory barriers) and a number of bits with more dependencies that
    aren't used in many places (eg. switch_to()).

    These patches break asm/system.h up into the following core pieces:

    (1) asm/barrier.h

    Move memory barriers here. This already done for MIPS and Alpha.

    (2) asm/switch_to.h

    Move switch_to() and related stuff here.

    (3) asm/exec.h

    Move arch_align_stack() here. Other process execution related bits
    could perhaps go here from asm/processor.h.

    (4) asm/cmpxchg.h

    Move xchg() and cmpxchg() here as they're full word atomic ops and
    frequently used by atomic_xchg() and atomic_cmpxchg().

    (5) asm/bug.h

    Move die() and related bits.

    (6) asm/auxvec.h

    Move AT_VECTOR_SIZE_ARCH here.

    Other arch headers are created as needed on a per-arch basis."

    Fixed up some conflicts from other header file cleanups and moving code
    around that has happened in the meantime, so David's testing is somewhat
    weakened by that. We'll find out anything that got broken and fix it..

    * tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits)
    Delete all instances of asm/system.h
    Remove all #inclusions of asm/system.h
    Add #includes needed to permit the removal of asm/system.h
    Move all declarations of free_initmem() to linux/mm.h
    Disintegrate asm/system.h for OpenRISC
    Split arch_align_stack() out from asm-generic/system.h
    Split the switch_to() wrapper out of asm-generic/system.h
    Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h
    Create asm-generic/barrier.h
    Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h
    Disintegrate asm/system.h for Xtensa
    Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt]
    Disintegrate asm/system.h for Tile
    Disintegrate asm/system.h for Sparc
    Disintegrate asm/system.h for SH
    Disintegrate asm/system.h for Score
    Disintegrate asm/system.h for S390
    Disintegrate asm/system.h for PowerPC
    Disintegrate asm/system.h for PA-RISC
    Disintegrate asm/system.h for MN10300
    ...

    Linus Torvalds
     
  • Disintegrate asm/system.h for Sparc.

    Signed-off-by: David Howells
    cc: sparclinux@vger.kernel.org

    David Howells
     

24 Mar, 2012

2 commits

  • Cosmetic, rename the from_ancestor_ns argument in prepare_signal()
    paths. After the previous change it doesn't match the reality.

    Signed-off-by: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Anton Vorontsov
    Cc: "Eric W. Biederman"
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • force_sig_info() and friends have the special semantics for synchronous
    signals, this interface should not be used if the target is not current.
    And it needs the fixes, in particular the clearing of SIGNAL_UNKILLABLE
    is not exactly right.

    However there are callers which have to use force_ exactly because it
    clears SIGNAL_UNKILLABLE and thus it can kill the CLONE_NEWPID tasks,
    although this is almost always is wrong by various reasons.

    With this patch SEND_SIG_FORCED ignores SIGNAL_UNKILLABLE, like we do if
    the signal comes from the ancestor namespace.

    This makes the naming in prepare_signal() paths insane, fixed by the
    next cleanup.

    Note: this only affects SIGKILL/SIGSTOP, but this is enough for
    force_sig() abusers.

    Signed-off-by: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Anton Vorontsov
    Cc: "Eric W. Biederman"
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

21 Mar, 2012

1 commit

  • exit_notify() changes ->exit_signal if the parent already did exec.
    This doesn't really work, we are not going to send the signal now
    if there is another live thread or the exiting task is traced. The
    parent can exec before the last dies or the tracer detaches.

    Move this check into do_notify_parent() which actually sends the
    signal.

    The user-visible change is that we do not change ->exit_signal,
    and thus the exiting task is still "clone children" for
    do_wait()->eligible_child(__WCLONE). Hopefully this is fine, the
    current logic is racy anyway.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

14 Mar, 2012

1 commit

  • Uprobes uses exception notifiers to get to know if a thread hit
    a breakpoint or a singlestep exception.

    When a thread hits a uprobe or is singlestepping post a uprobe
    hit, the uprobe exception notifier sets its TIF_UPROBE bit,
    which will then be checked on its return to userspace path
    (do_notify_resume() ->uprobe_notify_resume()), where the
    consumers handlers are run (in task context) based on the
    defined filters.

    Uprobe hits are thread specific and hence we need to maintain
    information about if a task hit a uprobe, what uprobe was hit,
    the slot where the original instruction was copied for xol so
    that it can be singlestepped with appropriate fixups.

    In some cases, special care is needed for instructions that are
    executed out of line (xol). These are architecture specific
    artefacts, such as handling RIP relative instructions on x86_64.

    Since the instruction at which the uprobe was inserted is
    executed out of line, architecture specific fixups are added so
    that the thread continues normal execution in the presence of a
    uprobe.

    Postpone the signals until we execute the probed insn.
    post_xol() path does a recalc_sigpending() before return to
    user-mode, this ensures the signal can't be lost.

    Uprobes relies on DIE_DEBUG notification to notify if a
    singlestep is complete.

    Adds x86 specific uprobe exception notifiers and appropriate
    hooks needed to determine a uprobe hit and subsequent post
    processing.

    Add requisite x86 fixups for xol for uprobes. Specific cases
    needing fixups include relative jumps (x86_64), calls, etc.

    Where possible, we check and skip singlestepping the
    breakpointed instructions. For now we skip single byte as well
    as few multibyte nop instructions. However this can be extended
    to other instructions too.

    Credits to Oleg Nesterov for suggestions/patches related to
    signal, breakpoint, singlestep handling code.

    Signed-off-by: Srikar Dronamraju
    Cc: Linus Torvalds
    Cc: Ananth N Mavinakayanahalli
    Cc: Jim Keniston
    Cc: Linux-mm
    Cc: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Christoph Hellwig
    Cc: Steven Rostedt
    Cc: Arnaldo Carvalho de Melo
    Cc: Masami Hiramatsu
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120313180011.29771.89027.sendpatchset@srdronam.in.ibm.com
    [ Performed various cleanliness edits ]
    Signed-off-by: Ingo Molnar

    Srikar Dronamraju
     

14 Jan, 2012

2 commits

  • Add trace_signal_generate() into send_sigqueue().

    send_sigqueue() is very similar to __send_signal(), just it uses
    the preallocated info. It should do the same wrt tracing.

    Reported-by: Seiji Aguchi
    Reviewed-by: Seiji Aguchi
    Acked-by: Steven Rostedt
    Signed-off-by: Oleg Nesterov

    Oleg Nesterov
     
  • __send_signal()->trace_signal_generate() doesn't report enough info.
    The users want to know was the signal actually delivered or not, and
    they also need the shared/private info.

    The patch moves trace_signal_generate() at the end of __send_signal()
    and adds the 2 additional arguments.

    This also allows us to kill trace_signal_overflow_fail/lose_info, we
    can simply add the appropriate TRACE_SIGNAL_ "result" codes.

    Reported-by: Seiji Aguchi
    Reviewed-by: Seiji Aguchi
    Acked-by: Steven Rostedt
    Signed-off-by: Oleg Nesterov

    Oleg Nesterov
     

11 Jan, 2012

2 commits

  • ipc/mqueue.c: for __SI_MESQ, convert the uid being sent to recipient's
    user namespace. (new, thanks Oleg)

    __send_signal: convert current's uid to the recipient's user namespace
    for any siginfo which is not SI_FROMKERNEL (patch from Oleg, thanks
    again :)

    do_notify_parent and do_notify_parent_cldstop: map task's uid to parent's
    user namespace

    ptrace_signal maps parent's uid into current's user namespace before
    including in signal to current. IIUC Oleg has argued that this shouldn't
    matter as the debugger will play with it, but it seems like not converting
    the value currently being set is misleading.

    Changelog:
    Sep 20: Inspired by Oleg's suggestion, define map_cred_ns() helper to
    simplify callers and help make clear what we are translating
    (which uid into which namespace). Passing the target task would
    make callers even easier to read, but we pass in user_ns because
    current_user_ns() != task_cred_xxx(current, user_ns).
    Sep 20: As recommended by Oleg, also put task_pid_vnr() under rcu_read_lock
    in ptrace_signal().
    Sep 23: In send_signal(), detect when (user) signal is coming from an
    ancestor or unrelated user namespace. Pass that on to __send_signal,
    which sets si_uid to 0 or overflowuid if needed.
    Oct 12: Base on Oleg's fixup_uid() patch. On top of that, handle all
    SI_FROMKERNEL cases at callers, because we can't assume sender is
    current in those cases.
    Nov 10: (mhelsley) rename fixup_uid to more meaningful usern_fixup_signal_uid
    Nov 10: (akpm) make the !CONFIG_USER_NS case clearer

    Signed-off-by: Serge Hallyn
    Cc: Oleg Nesterov
    Cc: Matt Helsley
    Cc: "Eric W. Biederman"
    From: Serge Hallyn
    Subject: __send_signal: pass q->info, not info, to userns_fixup_signal_uid (v2)

    Eric Biederman pointed out that passing info is a bug and could lead to a
    NULL pointer deref to boot.

    A collection of signal, securebits, filecaps, cap_bounds, and a few other
    ltp tests passed with this kernel.

    Changelog:
    Nov 18: previous patch missed a leading '&'

    Signed-off-by: Serge Hallyn
    Cc: "Eric W. Biederman"
    From: Dan Carpenter
    Subject: ipc/mqueue: lock() => unlock() typo

    There was a double lock typo introduced in b085f4bd6b21 "user namespace:
    make signal.c respect user namespaces"

    Signed-off-by: Dan Carpenter
    Cc: Oleg Nesterov
    Cc: Matt Helsley
    Cc: "Eric W. Biederman"
    Acked-by: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Abstract the code sequence for adding a signal handler's sa_mask to
    current->blocked because the sequence is identical for all architectures.
    Furthermore, in the past some architectures actually got this code wrong,
    so introduce a wrapper that all architectures can use.

    Signed-off-by: Matt Fleming
    Signed-off-by: Oleg Nesterov
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Tejun Heo
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Fleming
     

10 Jan, 2012

1 commit

  • * 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
    cgroup: fix to allow mounting a hierarchy by name
    cgroup: move assignement out of condition in cgroup_attach_proc()
    cgroup: Remove task_lock() from cgroup_post_fork()
    cgroup: add sparse annotation to cgroup_iter_start() and cgroup_iter_end()
    cgroup: mark cgroup_rmdir_waitq and cgroup_attach_proc() as static
    cgroup: only need to check oldcgrp==newgrp once
    cgroup: remove redundant get/put of task struct
    cgroup: remove redundant get/put of old css_set from migrate
    cgroup: Remove unnecessary task_lock before fetching css_set on migration
    cgroup: Drop task_lock(parent) on cgroup_fork()
    cgroups: remove redundant get/put of css_set from css_set_check_fetched()
    resource cgroups: remove bogus cast
    cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
    cgroup, cpuset: don't use ss->pre_attach()
    cgroup: don't use subsys->can_attach_task() or ->attach_task()
    cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()
    cgroup: improve old cgroup handling in cgroup_attach_proc()
    cgroup: always lock threadgroup during migration
    threadgroup: extend threadgroup_lock() to cover exit and exec
    threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
    ...

    Fix up conflict in kernel/cgroup.c due to commit e0197aae59e5: "cgroups:
    fix a css_set not found bug in cgroup_attach_proc" that already
    mentioned that the bug is fixed (differently) in Tejun's cgroup
    patchset. This one, in other words.

    Linus Torvalds
     

07 Jan, 2012

1 commit

  • * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
    sched/tracing: Add a new tracepoint for sleeptime
    sched: Disable scheduler warnings during oopses
    sched: Fix cgroup movement of waking process
    sched: Fix cgroup movement of newly created process
    sched: Fix cgroup movement of forking process
    sched: Remove cfs bandwidth period check in tg_set_cfs_period()
    sched: Fix load-balance lock-breaking
    sched: Replace all_pinned with a generic flags field
    sched: Only queue remote wakeups when crossing cache boundaries
    sched: Add missing rcu_dereference() around ->real_parent usage
    [S390] fix cputime overflow in uptime_proc_show
    [S390] cputime: add sparse checking and cleanup
    sched: Mark parent and real_parent as __rcu
    sched, nohz: Fix missing RCU read lock
    sched, nohz: Set the NOHZ_BALANCE_KICK flag for idle load balancer
    sched, nohz: Fix the idle cpu check in nohz_idle_balance
    sched: Use jump_labels for sched_feat
    sched/accounting: Fix parameter passing in task_group_account_field
    sched/accounting: Fix user/system tick double accounting
    sched/accounting: Re-use scheduler statistics for the root cgroup
    ...

    Fix up conflicts in
    - arch/ia64/include/asm/cputime.h, include/asm-generic/cputime.h
    usecs_to_cputime64() vs the sparse cleanups
    - kernel/sched/fair.c, kernel/time/tick-sched.c
    scheduler changes in multiple branches

    Linus Torvalds
     

05 Jan, 2012

1 commit

  • This is the temporary simple fix for 3.2, we need more changes in this
    area.

    1. do_signal_stop() assumes that the running untraced thread in the
    stopped thread group is not possible. This was our goal but it is
    not yet achieved: a stopped-but-resumed tracee can clone the running
    thread which can initiate another group-stop.

    Remove WARN_ON_ONCE(!current->ptrace).

    2. A new thread always starts with ->jobctl = 0. If it is auto-attached
    and this group is stopped, __ptrace_unlink() sets JOBCTL_STOP_PENDING
    but JOBCTL_STOP_SIGMASK part is zero, this triggers WANR_ON(!signr)
    in do_jobctl_trap() if another debugger attaches.

    Change __ptrace_unlink() to set the artificial SIGSTOP for report.

    Alternatively we could change ptrace_init_task() to copy signr from
    current, but this means we can copy it for no reason and hide the
    possible similar problems.

    Acked-by: Tejun Heo
    Cc: [3.1]
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

15 Dec, 2011

1 commit


13 Dec, 2011

1 commit

  • threadgroup_lock() protected only protected against new addition to
    the threadgroup, which was inherently somewhat incomplete and
    problematic for its only user cgroup. On-going migration could race
    against exec and exit leading to interesting problems - the symmetry
    between various attach methods, task exiting during method execution,
    ->exit() racing against attach methods, migrating task switching basic
    properties during exec and so on.

    This patch extends threadgroup_lock() such that it protects against
    all three threadgroup altering operations - fork, exit and exec. For
    exit, threadgroup_change_begin/end() calls are added to exit_signals
    around assertion of PF_EXITING. For exec, threadgroup_[un]lock() are
    updated to also grab and release cred_guard_mutex.

    With this change, threadgroup_lock() guarantees that the target
    threadgroup will remain stable - no new task will be added, no new
    PF_EXITING will be set and exec won't happen.

    The next patch will update cgroup so that it can take full advantage
    of this change.

    -v2: beefed up comment as suggested by Frederic.

    -v3: narrowed scope of protection in exit path as suggested by
    Frederic.

    Signed-off-by: Tejun Heo
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Li Zefan
    Acked-by: Frederic Weisbecker
    Cc: Oleg Nesterov
    Cc: Andrew Morton
    Cc: Paul Menage
    Cc: Linus Torvalds

    Tejun Heo
     

31 Oct, 2011

1 commit

  • The changed files were only including linux/module.h for the
    EXPORT_SYMBOL infrastructure, and nothing else. Revector them
    onto the isolated export header for faster compile times.

    Nothing to see here but a whole lot of instances of:

    -#include
    +#include

    This commit is only changing the kernel dir; next targets
    will probably be mm, fs, the arch dirs, etc.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

30 Sep, 2011

1 commit

  • Add to the dev_state and alloc_async structures the user namespace
    corresponding to the uid and euid. Pass these to kill_pid_info_as_uid(),
    which can then implement a proper, user-namespace-aware uid check.

    Changelog:
    Sep 20: Per Oleg's suggestion: Instead of caching and passing user namespace,
    uid, and euid each separately, pass a struct cred.
    Sep 26: Address Alan Stern's comments: don't define a struct cred at
    usbdev_open(), and take and put a cred at async_completed() to
    ensure it lasts for the duration of kill_pid_info_as_cred().

    Signed-off-by: Serge Hallyn
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Serge Hallyn
     

28 Jul, 2011

1 commit

  • sys_ssetmask(), sys_rt_sigsuspend() and compat_sys_rt_sigsuspend()
    change ->blocked directly. This is not correct, see the changelog in
    e6fa16ab "signal: sigprocmask() should do retarget_shared_pending()"

    Change them to use set_current_blocked().

    Another change is that now we are doing ->saved_sigmask = ->blocked
    lockless, it doesn't make any sense to do this under ->siglock.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Matt Fleming
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

23 Jul, 2011

1 commit

  • * 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (39 commits)
    ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever
    ptrace: fix ptrace_signal() && STOP_DEQUEUED interaction
    connector: add an event for monitoring process tracers
    ptrace: dont send SIGSTOP on auto-attach if PT_SEIZED
    ptrace: mv send-SIGSTOP from do_fork() to ptrace_init_task()
    ptrace_init_task: initialize child->jobctl explicitly
    has_stopped_jobs: s/task_is_stopped/SIGNAL_STOP_STOPPED/
    ptrace: make former thread ID available via PTRACE_GETEVENTMSG after PTRACE_EVENT_EXEC stop
    ptrace: wait_consider_task: s/same_thread_group/ptrace_reparented/
    ptrace: kill real_parent_is_ptracer() in in favor of ptrace_reparented()
    ptrace: ptrace_reparented() should check same_thread_group()
    redefine thread_group_leader() as exit_signal >= 0
    do not change dead_task->exit_signal
    kill task_detached()
    reparent_leader: check EXIT_DEAD instead of task_detached()
    make do_notify_parent() __must_check, update the callers
    __ptrace_detach: avoid task_detached(), check do_notify_parent()
    kill tracehook_notify_death()
    make do_notify_parent() return bool
    ptrace: s/tracehook_tracer_task()/ptrace_parent()/
    ...

    Linus Torvalds
     

21 Jul, 2011

2 commits

  • Simple test-case,

    int main(void)
    {
    int pid, status;

    pid = fork();
    if (!pid) {
    pause();
    assert(0);
    return 0x23;
    }

    assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0);
    assert(wait(&status) == pid);
    assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGSTOP);

    kill(pid, SIGCONT); // ptrace check from ptrace_signal() to its caller,
    get_signal_to_deliver(), this looks more natural.

    Signed-off-by: Oleg Nesterov
    Acked-by: Tejun Heo

    Oleg Nesterov
     
  • The __lock_task_sighand() function calls rcu_read_lock() with interrupts
    and preemption enabled, but later calls rcu_read_unlock() with interrupts
    disabled. It is therefore possible that this RCU read-side critical
    section will be preempted and later RCU priority boosted, which means that
    rcu_read_unlock() will call rt_mutex_unlock() in order to deboost itself, but
    with interrupts disabled. This results in lockdep splats, so this commit
    nests the RCU read-side critical section within the interrupt-disabled
    region of code. This prevents the RCU read-side critical section from
    being preempted, and thus prevents the attempt to deboost with interrupts
    disabled.

    It is quite possible that a better long-term fix is to make rt_mutex_unlock()
    disable irqs when acquiring the rt_mutex structure's ->wait_lock.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

28 Jun, 2011

3 commits

  • Kill real_parent_is_ptracer() and update the callers to use
    ptrace_reparented(), after the previous patch they do the same.

    Remove the unnecessary ->ptrace != 0 check in get_signal_to_deliver(),
    if ptrace_reparented() == T then the task must be ptraced.

    Signed-off-by: Oleg Nesterov
    Acked-by: Tejun Heo

    Oleg Nesterov
     
  • __ptrace_detach() and do_notify_parent() set task->exit_signal = -1
    to mark the task dead. This is no longer needed, nobody checks
    exit_signal to detect the EXIT_DEAD task.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Tejun Heo

    Oleg Nesterov
     
  • - change do_notify_parent() to return a boolean, true if the task should
    be reaped because its parent ignores SIGCHLD.

    - update the only caller which checks the returned value, exit_notify().

    This temporary uglifies exit_notify() even more, will be cleanuped by
    the next change.

    Signed-off-by: Oleg Nesterov
    Acked-by: Tejun Heo

    Oleg Nesterov