29 Jun, 2009

3 commits

  • …git/tip/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86, delay: tsc based udelay should have rdtsc_barrier
    x86, setup: correct include file in <asm/boot.h>
    x86, setup: Fix typo "CONFIG_x86_64" in <asm/boot.h>
    x86, mce: percpu mcheck_timer should be pinned
    x86: Add sysctl to allow panic on IOCK NMI error
    x86: Fix uv bau sending buffer initialization
    x86, mce: Fix mce resume on 32bit
    x86: Move init_gbpages() to setup_arch()
    x86: ensure percpu lpage doesn't consume too much vmalloc space
    x86: implement percpu_alloc kernel parameter
    x86: fix pageattr handling for lpage percpu allocator and re-enable it
    x86: reorganize cpa_process_alias()
    x86: prepare setup_pcpu_lpage() for pageattr fix
    x86: rename remap percpu first chunk allocator to lpage
    x86: fix duplicate free in setup_pcpu_remap() failure path
    percpu: fix too lazy vunmap cache flushing
    x86: Set cpu_llc_id on AMD CPUs

    Linus Torvalds
     
  • …el/git/tip/linux-2.6-tip

    * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    timer stats: Optimize by adding quick check to avoid function calls
    timers: Fix timer_migration interface which accepts any number as input

    Linus Torvalds
     
  • …nel/git/tip/linux-2.6-tip

    * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    ftrace: Fix the output of profile
    ring-buffer: Make it generally available
    ftrace: Remove duplicate newline
    tracing: Fix trace_buf_size boot option
    ftrace: Fix t_hash_start()
    ftrace: Don't manipulate @pos in t_start()
    ftrace: Don't increment @pos in g_start()
    tracing: Reset iterator in t_start()
    trace_stat: Don't increment @pos in seq start()
    tracing_bprintk: Don't increment @pos in t_start()
    tracing/events: Don't increment @pos in s_start()

    Linus Torvalds
     

26 Jun, 2009

2 commits

  • The first entry of the ftrace profile was always skipped when
    reading trace_stat/functionX.

    Signed-off-by: Li Zefan
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • This patch introduces a new sysctl:

    /proc/sys/kernel/panic_on_io_nmi

    which defaults to 0 (off).

    When enabled, the kernel panics when the kernel receives an NMI
    caused by an IO error.

    The IO error triggered NMI indicates a serious system
    condition, which could result in IO data corruption. Rather
    than contiuing, panicing and dumping might be a better choice,
    so one can figure out what's causing the IO error.

    This could be especially important to companies running IO
    intensive applications where corruption must be avoided, e.g. a
    bank's databases.

    [ SuSE has been shipping it for a while, it was done at the
    request of a large database vendor, for their users. ]

    Signed-off-by: Kurt Garloff
    Signed-off-by: Roberto Angelino
    Signed-off-by: Greg Kroah-Hartman
    Cc: "Eric W. Biederman"
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Kurt Garloff
     

25 Jun, 2009

6 commits

  • Yanmin noticed that fault_in_user_writeable() requests 4 pages instead
    of one.

    That's the result of blindly trusting Linus' proposal :) I even looked
    up the prototype to verify the correctness: the argument in question
    is confusingly enough named "len" while in reality it means number of
    pages.

    Pointed-out-by: Yanmin Zhang
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • In hunting down the cause for the hwlat_detector ring buffer spew in
    my failed -next builds it became obvious that folks are now treating
    ring_buffer as something that is generic independent of tracing and thus,
    suitable for public driver consumption.

    Given that there are only a few minor areas in ring_buffer that have any
    reliance on CONFIG_TRACING or CONFIG_FUNCTION_TRACER, provide stubs for
    those and make it generally available.

    Signed-off-by: Paul Mundt
    Cc: Jon Masters
    Cc: Steven Rostedt
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Mundt
     
  • Before:
    # echo 'sys_open:traceon:' > set_ftrace_filter
    # echo 'sys_close:traceoff:5' > set_ftrace_filter
    # cat set_ftrace_filter
    #### all functions enabled ####
    sys_open:traceon:unlimited

    sys_close:traceoff:count=0

    After:
    # cat set_ftrace_filter
    #### all functions enabled ####
    sys_open:traceon:unlimited
    sys_close:traceoff:count=0

    Signed-off-by: Li Zefan
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • …/{vfs-2.6,audit-current}

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    another race fix in jfs_check_acl()
    Get "no acls for this inode" right, fix shmem breakage
    inline functions left without protection of ifdef (acl)

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
    audit: inode watches depend on CONFIG_AUDIT not CONFIG_AUDIT_SYSCALL

    Linus Torvalds
     
  • Even though one cannot make use of the audit watch code without
    CONFIG_AUDIT_SYSCALL the spaghetti nature of the audit code means that
    the audit rule filtering requires that it at least be compiled.

    Thus build the audit_watch code when we build auditfilter like it was
    before cfcad62c74abfef83762dc05a556d21bdf3980a2

    Clearly this is a point of potential future cleanup..

    Reported-by: Frans Pop
    Signed-off-by: Eric Paris
    Signed-off-by: Al Viro

    Eric Paris
     
  • commit 64d1304a64 (futex: setup writeable mapping for futex ops which
    modify user space data) did address only half of the problem of write
    access faults.

    The patch was made on two wrong assumptions:

    1) access_ok(VERIFY_WRITE,...) would actually check write access.

    On x86 it does _NOT_. It's a pure address range check.

    2) a RW mapped region can not go away under us.

    That's wrong as well. Nobody can prevent another thread to call
    mprotect(PROT_READ) on that region where the futex resides. If that
    call hits between the get_user_pages_fast() verification and the
    actual write access in the atomic region we are toast again.

    The solution is to not rely on access_ok and get_user() for any write
    access related fault on private and shared futexes. Instead we need to
    fault it in with verification of write access.

    There is no generic non destructive write mechanism which would fault
    the user page in trough a #PF, but as we already know that we will
    fault we can as well call get_user_pages() directly and avoid the #PF
    overhead.

    If get_user_pages() returns -EFAULT we know that we can not fix it
    anymore and need to bail out to user space.

    Remove a bunch of confusing comments on this issue as well.

    Signed-off-by: Thomas Gleixner
    Cc: stable@kernel.org

    Thomas Gleixner
     

24 Jun, 2009

19 commits

  • We should be able to specify [KMG] when setting trace_buf_size
    boot option, as documented in kernel-parameters.txt

    Signed-off-by: Li Zefan
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • When the kernel is configured with CONFIG_TIMER_STATS but timer
    stats are runtime disabled we still get calls to
    __timer_stats_timer_set_start_info which initializes some
    fields in the corresponding struct timer_list.

    So add some quick checks in the the timer stats setup functions
    to avoid function calls to __timer_stats_timer_set_start_info
    when timer stats are disabled.

    In an artificial workload that does nothing but playing ping
    pong with a single tcp packet via loopback this decreases cpu
    consumption by 1 - 1.5%.

    This is part of a modified function trace output on SLES11:

    perl-2497 [00] 28630647177732388 [+ 125]: sk_reset_timer
    Cc: Andrew Morton
    Cc: Martin Schwidefsky
    Cc: Mustafa Mesanovic
    Cc: Arjan van de Ven
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Heiko Carstens
     
  • When the output of set_ftrace_filter is larger than PAGE_SIZE,
    t_hash_start() will be called the 2nd time, and then we start
    from the head of a hlist, which is wrong and causes some entries
    to be outputed twice.

    The worse is, if the hlist is large enough, reading set_ftrace_filter
    won't stop but in a dead loop.

    Reviewed-by: Liming Wang
    Signed-off-by: Li Zefan
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • It's rather confusing that in t_start(), in some cases @pos is
    incremented, and in some cases it's decremented and then incremented.

    This patch rewrites t_start() in a much more general way.

    Thus we fix a bug that if ftrace_filtered == 1, functions have tracer
    hooks won't be printed, because the branch is always unreachable:

    static void *t_start(...)
    {
    ...
    if (!p)
    return t_hash_start(m, pos);
    return p;
    }

    Before:
    # echo 'sys_open' > /mnt/tracing/set_ftrace_filter
    # echo 'sys_write:traceon:4' >> /mnt/tracing/set_ftrace_filter
    sys_open

    After:
    # echo 'sys_open' > /mnt/tracing/set_ftrace_filter
    # echo 'sys_write:traceon:4' >> /mnt/tracing/set_ftrace_filter
    sys_open
    sys_write:traceon:count=4

    Reviewed-by: Liming Wang
    Signed-off-by: Li Zefan
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • It's wrong to increment @pos in g_start(). It causes some entries
    lost when reading set_graph_function, if the output of the file
    is larger than PAGE_SIZE.

    Reviewed-by: Liming Wang
    Signed-off-by: Li Zefan
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • The iterator is m->private, but it's not reset to trace_types in
    t_start(). If the output is larger than PAGE_SIZE and t_start()
    is called the 2nd time, things will go wrong.

    Reviewed-by: Liming Wang
    Signed-off-by: Li Zefan
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • It's wrong to increment @pos in stat_seq_start(). It causes some
    stat entries lost when reading stat file, if the output of the file
    is larger than PAGE_SIZE.

    Reviewed-by: Liming Wang
    Signed-off-by: Li Zefan
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • It's wrong to increment @pos in t_start(), otherwise we'll lose
    some entries when reading printk_formats, if the output is larger
    than PAGE_SIZE.

    Reported-by: Lai Jiangshan
    Reviewed-by: Liming Wang
    Signed-off-by: Li Zefan
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • While testing syscall tracepoints posted by Jason, I found 3 entries
    were missing when reading available_events. The output size of
    available_events is < 4 pages, which means we lost 1 entry per page.

    The cause is, it's wrong to increment @pos in s_start().

    Actually there's another bug here -- reading avaiable_events/set_events
    can race with module unload:

    # cat available_events |
    s_start() |
    s_stop() |
    | # rmmod foo.ko
    s_start() |
    call = list_entry(m->private) |

    @call might be freed and accessing it will lead to crash.

    Reviewed-by: Liming Wang
    Signed-off-by: Li Zefan
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • If syscall removes the root of subtree being watched, we
    definitely do not want the rules refering that subtree
    to be destroyed without the syscall in question having
    a chance to match them.

    Signed-off-by: Al Viro

    Al Viro
     
  • A number of places in the audit system we send an op= followed by a string
    that includes spaces. Somehow this works but it's just wrong. This patch
    moves all of those that I could find to be quoted.

    Example:

    Change From: type=CONFIG_CHANGE msg=audit(1244666690.117:31): auid=0 ses=1
    subj=unconfined_u:unconfined_r:auditctl_t:s0-s0:c0.c1023 op=remove rule
    key="number2" list=4 res=0

    Change To: type=CONFIG_CHANGE msg=audit(1244666690.117:31): auid=0 ses=1
    subj=unconfined_u:unconfined_r:auditctl_t:s0-s0:c0.c1023 op="remove rule"
    key="number2" list=4 res=0

    Signed-off-by: Eric Paris

    Eric Paris
     
  • audit_get_nd() is only used by audit_watch and could be more cleanly
    implemented by having the audit watch functions call it when needed rather
    than making the generic audit rule parsing code deal with those objects.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • In preparation for converting audit to use fsnotify instead of inotify we
    seperate the inode watching code into it's own file. This is similar to
    how the audit tree watching code is already seperated into audit_tree.c

    Signed-off-by: Eric Paris

    Eric Paris
     
  • audit_receive_skb is hard to clearly parse what it is doing to the netlink
    message. Clean the function up so it is easy and clear to see what is going
    on.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • The audit handling of netlink messages is all over the place. Clean things
    up, use predetermined macros, generally make it more readable.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • Remove code duplication of skb printk when auditd is not around in userspace
    to deal with this message.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • audit_update_watch() runs all of the rules for a given watch and duplicates
    them, attaches a new watch to them, and then when it finishes that process
    and has called free on all of the old rules (ok maybe still inside the rcu
    grace period) it proceeds to use the last element from list_for_each_entry_safe()
    as if it were a krule rather than being the audit_watch which was anchoring
    the list to output a message about audit rules changing.

    This patch unfies the audit message from two different places into a helper
    function and calls it from the correct location in audit_update_rules(). We
    will now get an audit message about the config changing for each rule (with
    each rules filterkey) rather than the previous garbage.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • The audit execve record splitting code estimates the length of the message
    generated. But it forgot to include the "" that wrap each string in its
    estimation. This means that execve messages with lots of tiny (1-2 byte)
    arguments could still cause records greater than 8k to be emitted. Simply
    fix the estimate.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • When an audit watch is added to a parent the temporary watch inside the
    original krule from userspace is freed. Yet the original watch is used after
    the real watch was created in audit_add_rules()

    Signed-off-by: Eric Paris

    Eric Paris
     

23 Jun, 2009

2 commits

  • Poornima Nayek reported:

    | Timer migration interface /proc/sys/kernel/timer_migration in
    | 2.6.30-git9 accepts any numerical value as input.
    |
    | Steps to reproduce:
    | 1. echo -6666666 > /proc/sys/kernel/timer_migration
    | 2. cat /proc/sys/kernel/timer_migration
    | -6666666
    |
    | 1. echo 44444444444444444444444444444444444444444444444444444444444 > /proc/sys/kernel/timer_migration
    | 2. cat /proc/sys/kernel/timer_migration
    | -1357789412
    |
    | Expected behavior: Should 'echo: write error: Invalid argument' while
    | setting any value other then 0 & 1

    Restrict valid values to 0 and 1.

    Reported-by: Poornima Nayak
    Tested-by: Poornima Nayak
    Signed-off-by: Arun R Bharadwaj
    Cc: poornima nayak
    Cc: Arun Bharadwaj
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Arun R Bharadwaj
     
  • SLAB uses get/put_online_cpus() which use a mutex which is itself only
    initialized when cpu_hotplug_init() is called. Currently we hang suring
    boot in SLAB due to doing that too late.

    Reported by James Bottomley and Sachin Sant (and possibly others).
    Debugged by Benjamin Herrenschmidt.

    This just removes the dynamic initialization of the data structures, and
    replaces it with a static one, avoiding this dependency entirely, and
    removing one unnecessary special initcall.

    Tested-by: Sachin Sant
    Tested-by: James Bottomley
    Tested-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 Jun, 2009

6 commits

  • …git/tip/linux-2.6-tip

    * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    genirq, irq.h: Fix kernel-doc warnings
    genirq: fix comment to say IRQ_WAKE_THREAD

    Linus Torvalds
     
  • …x/kernel/git/tip/linux-2.6-tip

    * 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (49 commits)
    perfcounter: Handle some IO return values
    perf_counter: Push perf_sample_data through the swcounter code
    perf_counter tools: Define and use our own u64, s64 etc. definitions
    perf_counter: Close race in perf_lock_task_context()
    perf_counter, x86: Improve interactions with fast-gup
    perf_counter: Simplify and fix task migration counting
    perf_counter tools: Add a data file header
    perf_counter: Update userspace callchain sampling uses
    perf_counter: Make callchain samples extensible
    perf report: Filter to parent set by default
    perf_counter tools: Handle lost events
    perf_counter: Add event overlow handling
    fs: Provide empty .set_page_dirty() aop for anon inodes
    perf_counter: tools: Makefile tweaks for 64-bit powerpc
    perf_counter: powerpc: Add processor back-end for MPC7450 family
    perf_counter: powerpc: Make powerpc perf_counter code safe for 32-bit kernels
    perf_counter: powerpc: Change how processor-specific back-ends get selected
    perf_counter: powerpc: Use unsigned long for register and constraint values
    perf_counter: powerpc: Enable use of software counters on 32-bit powerpc
    perf_counter tools: Add and use isprint()
    ...

    Linus Torvalds
     
  • …l/git/tip/linux-2.6-tip

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: Fix out of scope variable access in sched_slice()
    sched: Hide runqueues from direct refer at source code level
    sched: Remove unneeded __ref tag
    sched, x86: Fix cpufreq + sched_clock() TSC scaling

    Linus Torvalds
     
  • …nel/git/tip/linux-2.6-tip

    * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits)
    tracing/urgent: warn in case of ftrace_start_up inbalance
    tracing/urgent: fix unbalanced ftrace_start_up
    function-graph: add stack frame test
    function-graph: disable when both x86_32 and optimize for size are configured
    ring-buffer: have benchmark test print to trace buffer
    ring-buffer: do not grab locks in nmi
    ring-buffer: add locks around rb_per_cpu_empty
    ring-buffer: check for less than two in size allocation
    ring-buffer: remove useless compile check for buffer_page size
    ring-buffer: remove useless warn on check
    ring-buffer: use BUF_PAGE_HDR_SIZE in calculating index
    tracing: update sample event documentation
    tracing/filters: fix race between filter setting and module unload
    tracing/filters: free filter_string in destroy_preds()
    ring-buffer: use commit counters for commit pointer accounting
    ring-buffer: remove unused variable
    ring-buffer: have benchmark test handle discarded events
    ring-buffer: prevent adding write in discarded area
    tracing/filters: strloc should be unsigned short
    tracing/filters: operand can be negative
    ...

    Fix up kmemcheck-induced conflict in kernel/trace/ring_buffer.c manually

    Linus Torvalds
     
  • …el/git/tip/linux-2.6-tip

    * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    NOHZ: Properly feed cpufreq ondemand governor

    Linus Torvalds
     
  • …/git/rostedt/linux-2.6-trace into tracing/urgent

    Ingo Molnar
     

20 Jun, 2009

2 commits