29 Jun, 2009

1 commit

  • …nel/git/tip/linux-2.6-tip

    * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    ftrace: Fix the output of profile
    ring-buffer: Make it generally available
    ftrace: Remove duplicate newline
    tracing: Fix trace_buf_size boot option
    ftrace: Fix t_hash_start()
    ftrace: Don't manipulate @pos in t_start()
    ftrace: Don't increment @pos in g_start()
    tracing: Reset iterator in t_start()
    trace_stat: Don't increment @pos in seq start()
    tracing_bprintk: Don't increment @pos in t_start()
    tracing/events: Don't increment @pos in s_start()

    Linus Torvalds
     

26 Jun, 2009

1 commit

  • The first entry of the ftrace profile was always skipped when
    reading trace_stat/functionX.

    Signed-off-by: Li Zefan
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     

25 Jun, 2009

6 commits

  • Yanmin noticed that fault_in_user_writeable() requests 4 pages instead
    of one.

    That's the result of blindly trusting Linus' proposal :) I even looked
    up the prototype to verify the correctness: the argument in question
    is confusingly enough named "len" while in reality it means number of
    pages.

    Pointed-out-by: Yanmin Zhang
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • In hunting down the cause for the hwlat_detector ring buffer spew in
    my failed -next builds it became obvious that folks are now treating
    ring_buffer as something that is generic independent of tracing and thus,
    suitable for public driver consumption.

    Given that there are only a few minor areas in ring_buffer that have any
    reliance on CONFIG_TRACING or CONFIG_FUNCTION_TRACER, provide stubs for
    those and make it generally available.

    Signed-off-by: Paul Mundt
    Cc: Jon Masters
    Cc: Steven Rostedt
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Mundt
     
  • Before:
    # echo 'sys_open:traceon:' > set_ftrace_filter
    # echo 'sys_close:traceoff:5' > set_ftrace_filter
    # cat set_ftrace_filter
    #### all functions enabled ####
    sys_open:traceon:unlimited

    sys_close:traceoff:count=0

    After:
    # cat set_ftrace_filter
    #### all functions enabled ####
    sys_open:traceon:unlimited
    sys_close:traceoff:count=0

    Signed-off-by: Li Zefan
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • …/{vfs-2.6,audit-current}

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    another race fix in jfs_check_acl()
    Get "no acls for this inode" right, fix shmem breakage
    inline functions left without protection of ifdef (acl)

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
    audit: inode watches depend on CONFIG_AUDIT not CONFIG_AUDIT_SYSCALL

    Linus Torvalds
     
  • Even though one cannot make use of the audit watch code without
    CONFIG_AUDIT_SYSCALL the spaghetti nature of the audit code means that
    the audit rule filtering requires that it at least be compiled.

    Thus build the audit_watch code when we build auditfilter like it was
    before cfcad62c74abfef83762dc05a556d21bdf3980a2

    Clearly this is a point of potential future cleanup..

    Reported-by: Frans Pop
    Signed-off-by: Eric Paris
    Signed-off-by: Al Viro

    Eric Paris
     
  • commit 64d1304a64 (futex: setup writeable mapping for futex ops which
    modify user space data) did address only half of the problem of write
    access faults.

    The patch was made on two wrong assumptions:

    1) access_ok(VERIFY_WRITE,...) would actually check write access.

    On x86 it does _NOT_. It's a pure address range check.

    2) a RW mapped region can not go away under us.

    That's wrong as well. Nobody can prevent another thread to call
    mprotect(PROT_READ) on that region where the futex resides. If that
    call hits between the get_user_pages_fast() verification and the
    actual write access in the atomic region we are toast again.

    The solution is to not rely on access_ok and get_user() for any write
    access related fault on private and shared futexes. Instead we need to
    fault it in with verification of write access.

    There is no generic non destructive write mechanism which would fault
    the user page in trough a #PF, but as we already know that we will
    fault we can as well call get_user_pages() directly and avoid the #PF
    overhead.

    If get_user_pages() returns -EFAULT we know that we can not fix it
    anymore and need to bail out to user space.

    Remove a bunch of confusing comments on this issue as well.

    Signed-off-by: Thomas Gleixner
    Cc: stable@kernel.org

    Thomas Gleixner
     

24 Jun, 2009

18 commits

  • We should be able to specify [KMG] when setting trace_buf_size
    boot option, as documented in kernel-parameters.txt

    Signed-off-by: Li Zefan
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • When the output of set_ftrace_filter is larger than PAGE_SIZE,
    t_hash_start() will be called the 2nd time, and then we start
    from the head of a hlist, which is wrong and causes some entries
    to be outputed twice.

    The worse is, if the hlist is large enough, reading set_ftrace_filter
    won't stop but in a dead loop.

    Reviewed-by: Liming Wang
    Signed-off-by: Li Zefan
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • It's rather confusing that in t_start(), in some cases @pos is
    incremented, and in some cases it's decremented and then incremented.

    This patch rewrites t_start() in a much more general way.

    Thus we fix a bug that if ftrace_filtered == 1, functions have tracer
    hooks won't be printed, because the branch is always unreachable:

    static void *t_start(...)
    {
    ...
    if (!p)
    return t_hash_start(m, pos);
    return p;
    }

    Before:
    # echo 'sys_open' > /mnt/tracing/set_ftrace_filter
    # echo 'sys_write:traceon:4' >> /mnt/tracing/set_ftrace_filter
    sys_open

    After:
    # echo 'sys_open' > /mnt/tracing/set_ftrace_filter
    # echo 'sys_write:traceon:4' >> /mnt/tracing/set_ftrace_filter
    sys_open
    sys_write:traceon:count=4

    Reviewed-by: Liming Wang
    Signed-off-by: Li Zefan
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • It's wrong to increment @pos in g_start(). It causes some entries
    lost when reading set_graph_function, if the output of the file
    is larger than PAGE_SIZE.

    Reviewed-by: Liming Wang
    Signed-off-by: Li Zefan
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • The iterator is m->private, but it's not reset to trace_types in
    t_start(). If the output is larger than PAGE_SIZE and t_start()
    is called the 2nd time, things will go wrong.

    Reviewed-by: Liming Wang
    Signed-off-by: Li Zefan
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • It's wrong to increment @pos in stat_seq_start(). It causes some
    stat entries lost when reading stat file, if the output of the file
    is larger than PAGE_SIZE.

    Reviewed-by: Liming Wang
    Signed-off-by: Li Zefan
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • It's wrong to increment @pos in t_start(), otherwise we'll lose
    some entries when reading printk_formats, if the output is larger
    than PAGE_SIZE.

    Reported-by: Lai Jiangshan
    Reviewed-by: Liming Wang
    Signed-off-by: Li Zefan
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • While testing syscall tracepoints posted by Jason, I found 3 entries
    were missing when reading available_events. The output size of
    available_events is < 4 pages, which means we lost 1 entry per page.

    The cause is, it's wrong to increment @pos in s_start().

    Actually there's another bug here -- reading avaiable_events/set_events
    can race with module unload:

    # cat available_events |
    s_start() |
    s_stop() |
    | # rmmod foo.ko
    s_start() |
    call = list_entry(m->private) |

    @call might be freed and accessing it will lead to crash.

    Reviewed-by: Liming Wang
    Signed-off-by: Li Zefan
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • If syscall removes the root of subtree being watched, we
    definitely do not want the rules refering that subtree
    to be destroyed without the syscall in question having
    a chance to match them.

    Signed-off-by: Al Viro

    Al Viro
     
  • A number of places in the audit system we send an op= followed by a string
    that includes spaces. Somehow this works but it's just wrong. This patch
    moves all of those that I could find to be quoted.

    Example:

    Change From: type=CONFIG_CHANGE msg=audit(1244666690.117:31): auid=0 ses=1
    subj=unconfined_u:unconfined_r:auditctl_t:s0-s0:c0.c1023 op=remove rule
    key="number2" list=4 res=0

    Change To: type=CONFIG_CHANGE msg=audit(1244666690.117:31): auid=0 ses=1
    subj=unconfined_u:unconfined_r:auditctl_t:s0-s0:c0.c1023 op="remove rule"
    key="number2" list=4 res=0

    Signed-off-by: Eric Paris

    Eric Paris
     
  • audit_get_nd() is only used by audit_watch and could be more cleanly
    implemented by having the audit watch functions call it when needed rather
    than making the generic audit rule parsing code deal with those objects.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • In preparation for converting audit to use fsnotify instead of inotify we
    seperate the inode watching code into it's own file. This is similar to
    how the audit tree watching code is already seperated into audit_tree.c

    Signed-off-by: Eric Paris

    Eric Paris
     
  • audit_receive_skb is hard to clearly parse what it is doing to the netlink
    message. Clean the function up so it is easy and clear to see what is going
    on.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • The audit handling of netlink messages is all over the place. Clean things
    up, use predetermined macros, generally make it more readable.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • Remove code duplication of skb printk when auditd is not around in userspace
    to deal with this message.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • audit_update_watch() runs all of the rules for a given watch and duplicates
    them, attaches a new watch to them, and then when it finishes that process
    and has called free on all of the old rules (ok maybe still inside the rcu
    grace period) it proceeds to use the last element from list_for_each_entry_safe()
    as if it were a krule rather than being the audit_watch which was anchoring
    the list to output a message about audit rules changing.

    This patch unfies the audit message from two different places into a helper
    function and calls it from the correct location in audit_update_rules(). We
    will now get an audit message about the config changing for each rule (with
    each rules filterkey) rather than the previous garbage.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • The audit execve record splitting code estimates the length of the message
    generated. But it forgot to include the "" that wrap each string in its
    estimation. This means that execve messages with lots of tiny (1-2 byte)
    arguments could still cause records greater than 8k to be emitted. Simply
    fix the estimate.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • When an audit watch is added to a parent the temporary watch inside the
    original krule from userspace is freed. Yet the original watch is used after
    the real watch was created in audit_add_rules()

    Signed-off-by: Eric Paris

    Eric Paris
     

23 Jun, 2009

1 commit

  • SLAB uses get/put_online_cpus() which use a mutex which is itself only
    initialized when cpu_hotplug_init() is called. Currently we hang suring
    boot in SLAB due to doing that too late.

    Reported by James Bottomley and Sachin Sant (and possibly others).
    Debugged by Benjamin Herrenschmidt.

    This just removes the dynamic initialization of the data structures, and
    replaces it with a static one, avoiding this dependency entirely, and
    removing one unnecessary special initcall.

    Tested-by: Sachin Sant
    Tested-by: James Bottomley
    Tested-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 Jun, 2009

6 commits

  • …git/tip/linux-2.6-tip

    * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    genirq, irq.h: Fix kernel-doc warnings
    genirq: fix comment to say IRQ_WAKE_THREAD

    Linus Torvalds
     
  • …x/kernel/git/tip/linux-2.6-tip

    * 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (49 commits)
    perfcounter: Handle some IO return values
    perf_counter: Push perf_sample_data through the swcounter code
    perf_counter tools: Define and use our own u64, s64 etc. definitions
    perf_counter: Close race in perf_lock_task_context()
    perf_counter, x86: Improve interactions with fast-gup
    perf_counter: Simplify and fix task migration counting
    perf_counter tools: Add a data file header
    perf_counter: Update userspace callchain sampling uses
    perf_counter: Make callchain samples extensible
    perf report: Filter to parent set by default
    perf_counter tools: Handle lost events
    perf_counter: Add event overlow handling
    fs: Provide empty .set_page_dirty() aop for anon inodes
    perf_counter: tools: Makefile tweaks for 64-bit powerpc
    perf_counter: powerpc: Add processor back-end for MPC7450 family
    perf_counter: powerpc: Make powerpc perf_counter code safe for 32-bit kernels
    perf_counter: powerpc: Change how processor-specific back-ends get selected
    perf_counter: powerpc: Use unsigned long for register and constraint values
    perf_counter: powerpc: Enable use of software counters on 32-bit powerpc
    perf_counter tools: Add and use isprint()
    ...

    Linus Torvalds
     
  • …l/git/tip/linux-2.6-tip

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: Fix out of scope variable access in sched_slice()
    sched: Hide runqueues from direct refer at source code level
    sched: Remove unneeded __ref tag
    sched, x86: Fix cpufreq + sched_clock() TSC scaling

    Linus Torvalds
     
  • …nel/git/tip/linux-2.6-tip

    * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits)
    tracing/urgent: warn in case of ftrace_start_up inbalance
    tracing/urgent: fix unbalanced ftrace_start_up
    function-graph: add stack frame test
    function-graph: disable when both x86_32 and optimize for size are configured
    ring-buffer: have benchmark test print to trace buffer
    ring-buffer: do not grab locks in nmi
    ring-buffer: add locks around rb_per_cpu_empty
    ring-buffer: check for less than two in size allocation
    ring-buffer: remove useless compile check for buffer_page size
    ring-buffer: remove useless warn on check
    ring-buffer: use BUF_PAGE_HDR_SIZE in calculating index
    tracing: update sample event documentation
    tracing/filters: fix race between filter setting and module unload
    tracing/filters: free filter_string in destroy_preds()
    ring-buffer: use commit counters for commit pointer accounting
    ring-buffer: remove unused variable
    ring-buffer: have benchmark test handle discarded events
    ring-buffer: prevent adding write in discarded area
    tracing/filters: strloc should be unsigned short
    tracing/filters: operand can be negative
    ...

    Fix up kmemcheck-induced conflict in kernel/trace/ring_buffer.c manually

    Linus Torvalds
     
  • …el/git/tip/linux-2.6-tip

    * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    NOHZ: Properly feed cpufreq ondemand governor

    Linus Torvalds
     
  • …/git/rostedt/linux-2.6-trace into tracing/urgent

    Ingo Molnar
     

20 Jun, 2009

5 commits

  • …it/rostedt/linux-2.6-trace into tracing/urgent

    Ingo Molnar
     
  • Push the perf_sample_data further outwards to the swcounter interface,
    to abstract it away some more.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Prevent from further ftrace_start_up inbalances so that we avoid
    future nop patching omissions with dynamic ftrace.

    Signed-off-by: Frederic Weisbecker
    Cc: Steven Rostedt

    Frederic Weisbecker
     
  • Perfcounter reports the following stats for a wide system
    profiling:

    #
    # (2364 samples)
    #
    # Overhead Symbol
    # ........ ......
    #
    15.40% [k] mwait_idle_with_hints
    8.29% [k] read_hpet
    5.75% [k] ftrace_caller
    3.60% [k] ftrace_call
    [...]

    This snapshot has been taken while neither the function tracer nor
    the function graph tracer was running.
    With dynamic ftrace, such results show a wrong ftrace behaviour
    because all calls to ftrace_caller or ftrace_graph_caller (the patched
    calls to mcount) are supposed to be patched into nop if none of those
    tracers are running.

    The problem occurs after the first run of the function tracer. Once we
    launch it a second time, the callsites will never be nopped back,
    unless you set custom filters.
    For example it happens during the self tests at boot time.
    The function tracer selftest runs, and then the dynamic tracing is
    tested too. After that, the callsites are left un-nopped.

    This is because the reset callback of the function tracer tries to
    unregister two ftrace callbacks in once: the common function tracer
    and the function tracer with stack backtrace, regardless of which
    one is currently in use.
    It then creates an unbalance on ftrace_start_up value which is expected
    to be zero when the last ftrace callback is unregistered. When it
    reaches zero, the FTRACE_DISABLE_CALLS is set on the next ftrace
    command, triggering the patching into nop. But since it becomes
    unbalanced, ie becomes lower than zero, if the kernel functions
    are patched again (as in every further function tracer runs), they
    won't ever be nopped back.

    Note that ftrace_call and ftrace_graph_call are still patched back
    to ftrace_stub in the off case, but not the callers of ftrace_call
    and ftrace_graph_caller. It means that the tracing is well deactivated
    but we waste a useless call into every kernel function.

    This patch just unregisters the right ftrace_ops for the function
    tracer on its reset callback and ignores the other one which is
    not registered, fixing the unbalance. The problem also happens
    is .30

    Signed-off-by: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: stable@kernel.org

    Frederic Weisbecker
     
  • The bug is ancient.

    If we trace the sub-thread of our natural child and this sub-thread exits,
    we update parent->signal->cxxx fields. But we should not do this until
    the whole thread-group exits, otherwise we account this thread (and all
    other live threads) twice.

    Add the task_detached() check. No need to check thread_group_empty(),
    wait_consider_task()->delay_group_leader() already did this.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Acked-by: Roland McGrath
    Cc: Stanislaw Gruszka
    Cc: Vitaly Mayatskikh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

19 Jun, 2009

2 commits

  • perf_lock_task_context() is buggy because it can return a dead
    context.

    the RCU read lock in perf_lock_task_context() only guarantees
    the memory won't get freed, it doesn't guarantee the object is
    valid (in our case refcount > 0).

    Therefore we can return a locked object that can get freed the
    moment we release the rcu read lock.

    perf_pin_task_context() then increases the refcount and does an
    unlock on freed memory.

    That increased refcount will cause a double free, in case it
    started out with 0.

    Ammend this by including the get_ctx() functionality in
    perf_lock_task_context() (all users already did this later
    anyway), and return a NULL context when the found one is
    already dead.

    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The task migrations counter was causing rare and hard to decypher
    memory corruptions under load. After a day of debugging and bisection
    we found that the problem was introduced with:

    3f731ca: perf_counter: Fix cpu migration counter

    Turning them off fixes the crashes. Incidentally, the whole
    perf_counter_task_migration() logic can be done simpler as well,
    by injecting a proper sw-counter event.

    This cleanup also fixed the crashes. The precise failure mode is
    not completely clear yet, but we are clearly not unhappy about
    having a fix ;-)

    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Corey Ashford
    Cc: Marcelo Tosatti
    Cc: Arnaldo Carvalho de Melo
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra