25 Jun, 2009

3 commits

  • …/{vfs-2.6,audit-current}

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    another race fix in jfs_check_acl()
    Get "no acls for this inode" right, fix shmem breakage
    inline functions left without protection of ifdef (acl)

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
    audit: inode watches depend on CONFIG_AUDIT not CONFIG_AUDIT_SYSCALL

    Linus Torvalds
     
  • Even though one cannot make use of the audit watch code without
    CONFIG_AUDIT_SYSCALL the spaghetti nature of the audit code means that
    the audit rule filtering requires that it at least be compiled.

    Thus build the audit_watch code when we build auditfilter like it was
    before cfcad62c74abfef83762dc05a556d21bdf3980a2

    Clearly this is a point of potential future cleanup..

    Reported-by: Frans Pop
    Signed-off-by: Eric Paris
    Signed-off-by: Al Viro

    Eric Paris
     
  • commit 64d1304a64 (futex: setup writeable mapping for futex ops which
    modify user space data) did address only half of the problem of write
    access faults.

    The patch was made on two wrong assumptions:

    1) access_ok(VERIFY_WRITE,...) would actually check write access.

    On x86 it does _NOT_. It's a pure address range check.

    2) a RW mapped region can not go away under us.

    That's wrong as well. Nobody can prevent another thread to call
    mprotect(PROT_READ) on that region where the futex resides. If that
    call hits between the get_user_pages_fast() verification and the
    actual write access in the atomic region we are toast again.

    The solution is to not rely on access_ok and get_user() for any write
    access related fault on private and shared futexes. Instead we need to
    fault it in with verification of write access.

    There is no generic non destructive write mechanism which would fault
    the user page in trough a #PF, but as we already know that we will
    fault we can as well call get_user_pages() directly and avoid the #PF
    overhead.

    If get_user_pages() returns -EFAULT we know that we can not fix it
    anymore and need to bail out to user space.

    Remove a bunch of confusing comments on this issue as well.

    Signed-off-by: Thomas Gleixner
    Cc: stable@kernel.org

    Thomas Gleixner
     

24 Jun, 2009

10 commits

  • If syscall removes the root of subtree being watched, we
    definitely do not want the rules refering that subtree
    to be destroyed without the syscall in question having
    a chance to match them.

    Signed-off-by: Al Viro

    Al Viro
     
  • A number of places in the audit system we send an op= followed by a string
    that includes spaces. Somehow this works but it's just wrong. This patch
    moves all of those that I could find to be quoted.

    Example:

    Change From: type=CONFIG_CHANGE msg=audit(1244666690.117:31): auid=0 ses=1
    subj=unconfined_u:unconfined_r:auditctl_t:s0-s0:c0.c1023 op=remove rule
    key="number2" list=4 res=0

    Change To: type=CONFIG_CHANGE msg=audit(1244666690.117:31): auid=0 ses=1
    subj=unconfined_u:unconfined_r:auditctl_t:s0-s0:c0.c1023 op="remove rule"
    key="number2" list=4 res=0

    Signed-off-by: Eric Paris

    Eric Paris
     
  • audit_get_nd() is only used by audit_watch and could be more cleanly
    implemented by having the audit watch functions call it when needed rather
    than making the generic audit rule parsing code deal with those objects.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • In preparation for converting audit to use fsnotify instead of inotify we
    seperate the inode watching code into it's own file. This is similar to
    how the audit tree watching code is already seperated into audit_tree.c

    Signed-off-by: Eric Paris

    Eric Paris
     
  • audit_receive_skb is hard to clearly parse what it is doing to the netlink
    message. Clean the function up so it is easy and clear to see what is going
    on.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • The audit handling of netlink messages is all over the place. Clean things
    up, use predetermined macros, generally make it more readable.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • Remove code duplication of skb printk when auditd is not around in userspace
    to deal with this message.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • audit_update_watch() runs all of the rules for a given watch and duplicates
    them, attaches a new watch to them, and then when it finishes that process
    and has called free on all of the old rules (ok maybe still inside the rcu
    grace period) it proceeds to use the last element from list_for_each_entry_safe()
    as if it were a krule rather than being the audit_watch which was anchoring
    the list to output a message about audit rules changing.

    This patch unfies the audit message from two different places into a helper
    function and calls it from the correct location in audit_update_rules(). We
    will now get an audit message about the config changing for each rule (with
    each rules filterkey) rather than the previous garbage.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • The audit execve record splitting code estimates the length of the message
    generated. But it forgot to include the "" that wrap each string in its
    estimation. This means that execve messages with lots of tiny (1-2 byte)
    arguments could still cause records greater than 8k to be emitted. Simply
    fix the estimate.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • When an audit watch is added to a parent the temporary watch inside the
    original krule from userspace is freed. Yet the original watch is used after
    the real watch was created in audit_add_rules()

    Signed-off-by: Eric Paris

    Eric Paris
     

23 Jun, 2009

1 commit

  • SLAB uses get/put_online_cpus() which use a mutex which is itself only
    initialized when cpu_hotplug_init() is called. Currently we hang suring
    boot in SLAB due to doing that too late.

    Reported by James Bottomley and Sachin Sant (and possibly others).
    Debugged by Benjamin Herrenschmidt.

    This just removes the dynamic initialization of the data structures, and
    replaces it with a static one, avoiding this dependency entirely, and
    removing one unnecessary special initcall.

    Tested-by: Sachin Sant
    Tested-by: James Bottomley
    Tested-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 Jun, 2009

6 commits

  • …git/tip/linux-2.6-tip

    * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    genirq, irq.h: Fix kernel-doc warnings
    genirq: fix comment to say IRQ_WAKE_THREAD

    Linus Torvalds
     
  • …x/kernel/git/tip/linux-2.6-tip

    * 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (49 commits)
    perfcounter: Handle some IO return values
    perf_counter: Push perf_sample_data through the swcounter code
    perf_counter tools: Define and use our own u64, s64 etc. definitions
    perf_counter: Close race in perf_lock_task_context()
    perf_counter, x86: Improve interactions with fast-gup
    perf_counter: Simplify and fix task migration counting
    perf_counter tools: Add a data file header
    perf_counter: Update userspace callchain sampling uses
    perf_counter: Make callchain samples extensible
    perf report: Filter to parent set by default
    perf_counter tools: Handle lost events
    perf_counter: Add event overlow handling
    fs: Provide empty .set_page_dirty() aop for anon inodes
    perf_counter: tools: Makefile tweaks for 64-bit powerpc
    perf_counter: powerpc: Add processor back-end for MPC7450 family
    perf_counter: powerpc: Make powerpc perf_counter code safe for 32-bit kernels
    perf_counter: powerpc: Change how processor-specific back-ends get selected
    perf_counter: powerpc: Use unsigned long for register and constraint values
    perf_counter: powerpc: Enable use of software counters on 32-bit powerpc
    perf_counter tools: Add and use isprint()
    ...

    Linus Torvalds
     
  • …l/git/tip/linux-2.6-tip

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: Fix out of scope variable access in sched_slice()
    sched: Hide runqueues from direct refer at source code level
    sched: Remove unneeded __ref tag
    sched, x86: Fix cpufreq + sched_clock() TSC scaling

    Linus Torvalds
     
  • …nel/git/tip/linux-2.6-tip

    * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits)
    tracing/urgent: warn in case of ftrace_start_up inbalance
    tracing/urgent: fix unbalanced ftrace_start_up
    function-graph: add stack frame test
    function-graph: disable when both x86_32 and optimize for size are configured
    ring-buffer: have benchmark test print to trace buffer
    ring-buffer: do not grab locks in nmi
    ring-buffer: add locks around rb_per_cpu_empty
    ring-buffer: check for less than two in size allocation
    ring-buffer: remove useless compile check for buffer_page size
    ring-buffer: remove useless warn on check
    ring-buffer: use BUF_PAGE_HDR_SIZE in calculating index
    tracing: update sample event documentation
    tracing/filters: fix race between filter setting and module unload
    tracing/filters: free filter_string in destroy_preds()
    ring-buffer: use commit counters for commit pointer accounting
    ring-buffer: remove unused variable
    ring-buffer: have benchmark test handle discarded events
    ring-buffer: prevent adding write in discarded area
    tracing/filters: strloc should be unsigned short
    tracing/filters: operand can be negative
    ...

    Fix up kmemcheck-induced conflict in kernel/trace/ring_buffer.c manually

    Linus Torvalds
     
  • …el/git/tip/linux-2.6-tip

    * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    NOHZ: Properly feed cpufreq ondemand governor

    Linus Torvalds
     
  • …/git/rostedt/linux-2.6-trace into tracing/urgent

    Ingo Molnar
     

20 Jun, 2009

5 commits

  • …it/rostedt/linux-2.6-trace into tracing/urgent

    Ingo Molnar
     
  • Push the perf_sample_data further outwards to the swcounter interface,
    to abstract it away some more.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Prevent from further ftrace_start_up inbalances so that we avoid
    future nop patching omissions with dynamic ftrace.

    Signed-off-by: Frederic Weisbecker
    Cc: Steven Rostedt

    Frederic Weisbecker
     
  • Perfcounter reports the following stats for a wide system
    profiling:

    #
    # (2364 samples)
    #
    # Overhead Symbol
    # ........ ......
    #
    15.40% [k] mwait_idle_with_hints
    8.29% [k] read_hpet
    5.75% [k] ftrace_caller
    3.60% [k] ftrace_call
    [...]

    This snapshot has been taken while neither the function tracer nor
    the function graph tracer was running.
    With dynamic ftrace, such results show a wrong ftrace behaviour
    because all calls to ftrace_caller or ftrace_graph_caller (the patched
    calls to mcount) are supposed to be patched into nop if none of those
    tracers are running.

    The problem occurs after the first run of the function tracer. Once we
    launch it a second time, the callsites will never be nopped back,
    unless you set custom filters.
    For example it happens during the self tests at boot time.
    The function tracer selftest runs, and then the dynamic tracing is
    tested too. After that, the callsites are left un-nopped.

    This is because the reset callback of the function tracer tries to
    unregister two ftrace callbacks in once: the common function tracer
    and the function tracer with stack backtrace, regardless of which
    one is currently in use.
    It then creates an unbalance on ftrace_start_up value which is expected
    to be zero when the last ftrace callback is unregistered. When it
    reaches zero, the FTRACE_DISABLE_CALLS is set on the next ftrace
    command, triggering the patching into nop. But since it becomes
    unbalanced, ie becomes lower than zero, if the kernel functions
    are patched again (as in every further function tracer runs), they
    won't ever be nopped back.

    Note that ftrace_call and ftrace_graph_call are still patched back
    to ftrace_stub in the off case, but not the callers of ftrace_call
    and ftrace_graph_caller. It means that the tracing is well deactivated
    but we waste a useless call into every kernel function.

    This patch just unregisters the right ftrace_ops for the function
    tracer on its reset callback and ignores the other one which is
    not registered, fixing the unbalance. The problem also happens
    is .30

    Signed-off-by: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: stable@kernel.org

    Frederic Weisbecker
     
  • The bug is ancient.

    If we trace the sub-thread of our natural child and this sub-thread exits,
    we update parent->signal->cxxx fields. But we should not do this until
    the whole thread-group exits, otherwise we account this thread (and all
    other live threads) twice.

    Add the task_detached() check. No need to check thread_group_empty(),
    wait_consider_task()->delay_group_leader() already did this.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Acked-by: Roland McGrath
    Cc: Stanislaw Gruszka
    Cc: Vitaly Mayatskikh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

19 Jun, 2009

15 commits

  • perf_lock_task_context() is buggy because it can return a dead
    context.

    the RCU read lock in perf_lock_task_context() only guarantees
    the memory won't get freed, it doesn't guarantee the object is
    valid (in our case refcount > 0).

    Therefore we can return a locked object that can get freed the
    moment we release the rcu read lock.

    perf_pin_task_context() then increases the refcount and does an
    unlock on freed memory.

    That increased refcount will cause a double free, in case it
    started out with 0.

    Ammend this by including the get_ctx() functionality in
    perf_lock_task_context() (all users already did this later
    anyway), and return a NULL context when the found one is
    already dead.

    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The task migrations counter was causing rare and hard to decypher
    memory corruptions under load. After a day of debugging and bisection
    we found that the problem was introduced with:

    3f731ca: perf_counter: Fix cpu migration counter

    Turning them off fixes the crashes. Incidentally, the whole
    perf_counter_task_migration() logic can be done simpler as well,
    by injecting a proper sw-counter event.

    This cleanup also fixed the crashes. The precise failure mode is
    not completely clear yet, but we are clearly not unhappy about
    having a fix ;-)

    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Corey Ashford
    Cc: Marcelo Tosatti
    Cc: Arnaldo Carvalho de Melo
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In case gcc does something funny with the stack frames, or the return
    from function code, we would like to detect that.

    An arch may implement passing of a variable that is unique to the
    function and can be saved on entering a function and can be tested
    when exiting the function. Usually the frame pointer can be used for
    this purpose.

    This patch also implements this for x86. Where it passes in the stack
    frame of the parent function, and will test that frame on exit.

    There was a case in x86_32 with optimize for size (-Os) where, for a
    few functions, gcc would align the stack frame and place a copy of the
    return address into it. The function graph tracer modified the copy and
    not the actual return address. On return from the funtion, it did not go
    to the tracer hook, but returned to the parent. This broke the function
    graph tracer, because the return of the parent (where gcc did not do
    this funky manipulation) returned to the location that the child function
    was suppose to. This caused strange kernel crashes.

    This test detected the problem and pointed out where the issue was.

    This modifies the parameters of one of the functions that the arch
    specific code calls, so it includes changes to arch code to accommodate
    the new prototype.

    Note, I notice that the parsic arch implements its own push_return_trace.
    This is now a generic function and the ftrace_push_return_trace should be
    used instead. This patch does not touch that code.

    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Frederic Weisbecker
    Cc: Helge Deller
    Cc: Kyle McMartin
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • On x86_32, when optimize for size is set, gcc may align the frame pointer
    and make a copy of the the return address inside the stack frame.
    The return address that is located in the stack frame may not be
    the one used to return to the calling function. This will break the
    function graph tracer.

    The function graph tracer replaces the return address with a jump to a hook
    function that can trace the exit of the function. If it only replaces
    a copy, then the hook will not be called when the function returns.
    Worse yet, when the parent function returns, the function graph tracer
    will return back to the location of the child function which will
    easily crash the kernel with weird results.

    To see the problem, when i386 is compiled with -Os we get:

    c106be03: 57 push %edi
    c106be04: 8d 7c 24 08 lea 0x8(%esp),%edi
    c106be08: 83 e4 e0 and $0xffffffe0,%esp
    c106be0b: ff 77 fc pushl 0xfffffffc(%edi)
    c106be0e: 55 push %ebp
    c106be0f: 89 e5 mov %esp,%ebp
    c106be11: 57 push %edi
    c106be12: 56 push %esi
    c106be13: 53 push %ebx
    c106be14: 81 ec 8c 00 00 00 sub $0x8c,%esp
    c106be1a: e8 f5 57 fb ff call c1021614

    When it is compiled with -O2 instead we get:

    c10896f0: 55 push %ebp
    c10896f1: 89 e5 mov %esp,%ebp
    c10896f3: 83 ec 28 sub $0x28,%esp
    c10896f6: 89 5d f4 mov %ebx,0xfffffff4(%ebp)
    c10896f9: 89 75 f8 mov %esi,0xfffffff8(%ebp)
    c10896fc: 89 7d fc mov %edi,0xfffffffc(%ebp)
    c10896ff: e8 d0 08 fa ff call c1029fd4

    The compile with -Os will align the stack pointer then set up the
    frame pointer (%ebp), and it copies the return address back into
    the stack frame. The change to the return address in mcount is done
    to the copy and not the real place holder of the return address.

    Then compile with -O2 sets up the frame pointer first, this makes
    the change to the return address by mcount affect where the function
    will jump on exit.

    Reported-by: Jake Edge
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • Enable gcov profiling of the entire kernel on x86_64. Required changes
    include disabling profiling for:

    * arch/kernel/acpi/realmode and arch/kernel/boot/compressed:
    not linked to main kernel
    * arch/vdso, arch/kernel/vsyscall_64 and arch/kernel/hpet:
    profiling causes segfaults during boot (incompatible context)

    Signed-off-by: Peter Oberparleiter
    Cc: Andi Kleen
    Cc: Huang Ying
    Cc: Li Wei
    Cc: Michael Ellerman
    Cc: Ingo Molnar
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Rusty Russell
    Cc: WANG Cong
    Cc: Sam Ravnborg
    Cc: Jeff Dike
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Oberparleiter
     
  • Enable the use of GCC's coverage testing tool gcov [1] with the Linux
    kernel. gcov may be useful for:

    * debugging (has this code been reached at all?)
    * test improvement (how do I change my test to cover these lines?)
    * minimizing kernel configurations (do I need this option if the
    associated code is never run?)

    The profiling patch incorporates the following changes:

    * change kbuild to include profiling flags
    * provide functions needed by profiling code
    * present profiling data as files in debugfs

    Note that on some architectures, enabling gcc's profiling option
    "-fprofile-arcs" for the entire kernel may trigger compile/link/
    run-time problems, some of which are caused by toolchain bugs and
    others which require adjustment of architecture code.

    For this reason profiling the entire kernel is initially restricted
    to those architectures for which it is known to work without changes.
    This restriction can be lifted once an architecture has been tested
    and found compatible with gcc's profiling. Profiling of single files
    or directories is still available on all platforms (see config help
    text).

    [1] http://gcc.gnu.org/onlinedocs/gcc/Gcov.html

    Signed-off-by: Peter Oberparleiter
    Cc: Andi Kleen
    Cc: Huang Ying
    Cc: Li Wei
    Cc: Michael Ellerman
    Cc: Ingo Molnar
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Rusty Russell
    Cc: WANG Cong
    Cc: Sam Ravnborg
    Cc: Jeff Dike
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Oberparleiter
     
  • Call constructors (gcc-generated initcall-like functions) during kernel
    start and module load. Constructors are e.g. used for gcov data
    initialization.

    Disable constructor support for usermode Linux to prevent conflicts with
    host glibc.

    Signed-off-by: Peter Oberparleiter
    Acked-by: Rusty Russell
    Acked-by: WANG Cong
    Cc: Sam Ravnborg
    Cc: Jeff Dike
    Cc: Andi Kleen
    Cc: Huang Ying
    Cc: Li Wei
    Cc: Michael Ellerman
    Cc: Ingo Molnar
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Oberparleiter
     
  • clone_nsproxy() does useless copying of old nsproxy -- every pointer will
    be rewritten to new ns or to old ns. Remove copying, rename
    clone_nsproxy(), create_nsproxy() will be used by C/R code to create fresh
    nsproxy on restart.

    Signed-off-by: Alexey Dobriyan
    Acked-by: Serge Hallyn
    Cc: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • create_uts_ns() will be used by C/R to create fresh uts_ns.

    Signed-off-by: Alexey Dobriyan
    Acked-by: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • copy_pid_ns() is a perfect example of a case where unwinding leads to more
    code and makes it less clear. Watch the diffstat.

    Signed-off-by: Alexey Dobriyan
    Cc: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Reviewed-by: Serge Hallyn
    Acked-by: Sukadev Bhattiprolu
    Reviewed-by: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • create_pid_namespace() creates everything, but caller has to assign parent
    pidns by hand, which is unnatural. At the moment of call new ->level has
    to be taken from somewhere and parent pidns is already available.

    Signed-off-by: Alexey Dobriyan
    Cc: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Acked-by: Serge Hallyn
    Acked-by: Sukadev Bhattiprolu
    Reviewed-by: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • find_task_by_pid_type_ns is only used to implement find_task_by_vpid and
    find_task_by_pid_ns, but both of them pass PIDTYPE_PID as first argument.
    So just fold find_task_by_pid_type_ns into find_task_by_pid_ns and use
    find_task_by_pid_ns to implement find_task_by_vpid.

    While we're at it also remove the exports for find_task_by_pid_ns and
    find_task_by_vpid - we don't have any modular callers left as the only
    modular caller of he old pre pid namespace find_task_by_pid (gfs2) was
    switched to pid_task which operates on a struct pid pointer instead of a
    pid_t. Given the confusion about pid_t values vs namespace that's
    generally the better option anyway and I think we're better of restricting
    modules to do it that way.

    Signed-off-by: Christoph Hellwig
    Cc: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Cc: Ingo Molnar
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Remoce the unused variable 'val' from __do_proc_dointvec()

    The integer has been declared and used as 'val = -val' and there is no
    reference to it anywhere.

    Signed-off-by: Sukanto Ghosh
    Cc: Jaswinder Singh Rajput
    Cc: Sukanto Ghosh
    Cc: Jiri Kosina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukanto Ghosh
     
  • Now that kthread_stop() can be used even if the task has already exited,
    we can kill the "wait_to_die:" loop in migration_thread(). But we must
    pin rq->migration_thread after creation.

    Actually, I don't think CPU_UP_CANCELED or CPU_DEAD should wait for
    ->migration_thread exit. Perhaps we can simplify this code a bit more.
    migration_call() can set ->should_stop and forget about this thread. But
    we need a new helper in kthred.c for that.

    Signed-off-by: Oleg Nesterov
    Cc: Christoph Hellwig
    Cc: "Eric W. Biederman"
    Cc: Ingo Molnar
    Cc: Pavel Emelyanov
    Cc: Rusty Russell
    Cc: Vitaliy Gusev
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Based on Eric's patch which in turn was based on my patch.

    kthread_stop() has the nasty problems:

    - it runs unpredictably long with the global semaphore held.

    - it deadlocks if kthread itself does kthread_stop() before it obeys
    the kthread_should_stop() request.

    - it is not useable if kthread exits on its own, see for example the
    ugly "wait_to_die:" hack in migration_thread()

    - it is not possible to just tell kthread it should stop, we must always
    wait for its exit.

    With this patch kthread() allocates all neccesary data (struct kthread) on
    its own stack, globals kthread_stop_xxx are deleted. ->vfork_done is used
    as a pointer into "struct kthread", this means kthread_stop() can easily
    wait for kthread's exit.

    Signed-off-by: Oleg Nesterov
    Cc: Christoph Hellwig
    Cc: "Eric W. Biederman"
    Cc: Ingo Molnar
    Cc: Pavel Emelyanov
    Cc: Rusty Russell
    Cc: Vitaliy Gusev
    Signed-off-by: Linus Torvalds

    Oleg Nesterov