11 Apr, 2010

1 commit

  • When CONFIG_DEBUG_BLOCK_EXT_DEVT is set we decode the device
    improperly by old_decode_dev and it results in an error while
    hibernating with s2disk.

    All users already pass the new device number, so switch to
    new_decode_dev().

    Signed-off-by: Jiri Slaby
    Reported-and-tested-by: Jiri Kosina
    Signed-off-by: "Rafael J. Wysocki"

    Jiri Slaby
     

08 Apr, 2010

1 commit


07 Apr, 2010

2 commits


06 Apr, 2010

5 commits

  • taskset on 2.6.34-rc3 fails on one of my ppc64 test boxes with
    the following error:

    sched_getaffinity(0, 16, 0x10029650030) = -1 EINVAL (Invalid argument)

    This box has 128 threads and 16 bytes is enough to cover it.

    Commit cd3d8031eb4311e516329aee03c79a08333141f1 (sched:
    sched_getaffinity(): Allow less than NR_CPUS length) is
    comparing this 16 bytes agains nr_cpu_ids.

    Fix it by comparing nr_cpu_ids to the number of bits in the
    cpumask we pass in.

    Signed-off-by: Anton Blanchard
    Reviewed-by: KOSAKI Motohiro
    Cc: Sharyathi Nagesh
    Cc: Ulrich Drepper
    Cc: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Jack Steiner
    Cc: Russ Anderson
    Cc: Mike Travis
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Anton Blanchard
     
  • Module refcounting is implemented with a per-cpu counter for speed.
    However there is a race when tallying the counter where a reference may
    be taken by one CPU and released by another. Reference count summation
    may then see the decrement without having seen the previous increment,
    leading to lower than expected count. A module which never has its
    actual reference drop below 1 may return a reference count of 0 due to
    this race.

    Module removal generally runs under stop_machine, which prevents this
    race causing bugs due to removal of in-use modules. However there are
    other real bugs in module.c code and driver code (module_refcount is
    exported) where the callers do not run under stop_machine.

    Fix this by maintaining running per-cpu counters for the number of
    module refcount increments and the number of refcount decrements. The
    increments are tallied after the decrements, so any decrement seen will
    always have its corresponding increment counted. The final refcount is
    the difference of the total increments and decrements, preventing a
    low-refcount from being returned.

    Signed-off-by: Nick Piggin
    Acked-by: Rusty Russell
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • There have been a number of reports of people seeing the message:
    "name_count maxed, losing inode data: dev=00:05, inode=3185"
    in dmesg. These usually lead to people reporting problems to the filesystem
    group who are in turn clueless what they mean.

    Eventually someone finds me and I explain what is going on and that
    these come from the audit system. The basics of the problem is that the
    audit subsystem never expects a single syscall to 'interact' (for some
    wish washy meaning of interact) with more than 20 inodes. But in fact
    some operations like loading kernel modules can cause changes to lots of
    inodes in debugfs.

    There are a couple real fixes being bandied about including removing the
    fixed compile time limit of 20 or not auditing changes in debugfs (or
    both) but neither are small and obvious so I am not sending them for
    immediate inclusion (I hope Al forwards a real solution next devel
    window).

    In the meantime this patch simply adds 'audit' to the beginning of the
    crap message so if a user sees it, they come blame me first and we can
    talk about what it means and make sure we understand all of the reasons
    it can happen and make sure this gets solved correctly in the long run.

    Signed-off-by: Eric Paris
    Signed-off-by: Linus Torvalds

    Eric Paris
     
  • * 'slabh' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc:
    eeepc-wmi: include slab.h
    staging/otus: include slab.h from usbdrv.h
    percpu: don't implicitly include slab.h from percpu.h
    kmemcheck: Fix build errors due to missing slab.h
    include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
    iwlwifi: don't include iwl-dev.h from iwl-devtrace.h
    x86: don't include slab.h from arch/x86/include/asm/pgtable_32.h

    Fix up trivial conflicts in include/linux/percpu.h due to
    is_kernel_percpu_address() having been introduced since the slab.h
    cleanup with the percpu_up.c splitup.

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    module: add stub for is_module_percpu_address
    percpu, module: implement and use is_kernel/module_percpu_address()
    module: encapsulate percpu handling better and record percpu_size

    Linus Torvalds
     

05 Apr, 2010

4 commits


03 Apr, 2010

11 commits


01 Apr, 2010

2 commits

  • Scheduler's task migration events don't work because they always
    pass NULL regs perf_sw_event(). The event hence gets filtered
    in perf_swevent_add().

    Scheduler's context switches events use task_pt_regs() to get
    the context when the event occured which is a wrong thing to
    do as this won't give us the place in the kernel where we went
    to sleep but the place where we left userspace. The result is
    even more wrong if we switch from a kernel thread.

    Use the hot regs snapshot for both events as they belong to the
    non-interrupt/exception based events family. Unlike page faults
    or so that provide the regs matching the exact origin of the event,
    we need to save the current context.

    This makes the task migration event working and fix the context
    switch callchains and origin ip.

    Example: perf record -a -e cs

    Before:

    10.91% ksoftirqd/0 0 [k] 0000000000000000
    |
    --- (nil)
    perf_callchain
    perf_prepare_sample
    __perf_event_overflow
    perf_swevent_overflow
    perf_swevent_add
    perf_swevent_ctx_event
    do_perf_sw_event
    __perf_sw_event
    perf_event_task_sched_out
    schedule
    run_ksoftirqd
    kthread
    kernel_thread_helper

    After:

    23.77% hald-addon-stor [kernel.kallsyms] [k] schedule
    |
    --- schedule
    |
    |--60.00%-- schedule_timeout
    | wait_for_common
    | wait_for_completion
    | blk_execute_rq
    | scsi_execute
    | scsi_execute_req
    | sr_test_unit_ready
    | |
    | |--66.67%-- sr_media_change
    | | media_changed
    | | cdrom_media_changed
    | | sr_block_media_changed
    | | check_disk_change
    | | cdrom_open

    v2: Always build perf_arch_fetch_caller_regs() now that software
    events need that too. They don't need it from modules, unlike trace
    events, so we keep the EXPORT_SYMBOL in trace_event_perf.c

    Signed-off-by: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: David Miller

    Frederic Weisbecker
     
  • The trace event buffer used by perf to record raw sample events
    is typed as an array of char and may then not be aligned to 8
    by alloc_percpu().

    But we need it to be aligned to 8 in sparc64 because we cast
    this buffer into a random structure type built by the TRACE_EVENT()
    macro to store the traces. So if a random 64 bits field is accessed
    inside, it may be not under an expected good alignment.

    Use an array of long instead to force the appropriate alignment, and
    perform a compile time check to ensure the size in byte of the buffer
    is a multiple of sizeof(long) so that its actual size doesn't get
    shrinked under us.

    This fixes unaligned accesses reported while using perf lock
    in sparc 64.

    Suggested-by: David Miller
    Suggested-by: Tejun Heo
    Signed-off-by: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Steven Rostedt

    Frederic Weisbecker
     

31 Mar, 2010

1 commit

  • Network folks reported that directing all MSI-X vectors of their multi
    queue NICs to a single core can cause interrupt stack overflows when
    enough interrupts fire at the same time.

    This is caused by the fact that we run interrupt handlers by default
    with interrupts enabled unless the driver reuqests the interrupt with
    the IRQF_DISABLED set. The NIC handlers do not set this flag, so
    simultaneous interrupts can nest unlimited and cause the stack
    overflow.

    The only safe counter measure is to run the interrupt handlers with
    interrupts disabled. We can't switch to this mode in general right
    now, but it is safe to do so for MSI interrupts.

    Force IRQF_DISABLED for MSI interrupt handlers.

    Signed-off-by: Thomas Gleixner
    Cc: Andi Kleen
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Alan Cox
    Cc: David Miller
    Cc: Greg Kroah-Hartman
    Cc: Arnaldo Carvalho de Melo
    Cc: stable@kernel.org

    Thomas Gleixner
     

30 Mar, 2010

9 commits

  • …s/security-testing-2.6

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
    CRED: Fix memory leak in error handling

    Linus Torvalds
     
  • …git/tip/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86: Do not free zero sized per cpu areas
    x86: Make sure free_init_pages() frees pages on page boundary
    x86: Make smp_locks end with page alignment

    Linus Torvalds
     
  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     
  • Fix a memory leak on an OOM condition in prepare_usermodehelper_creds().

    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: David Howells
    Signed-off-by: James Morris

    Mathieu Desnoyers
     
  • In some error handling cases the lock is not unlocked. The return is
    converted to a goto, to share the unlock at the end of the function.

    A simplified version of the semantic patch that finds this problem is as
    follows: (http://coccinelle.lip6.fr/)

    //
    @r exists@
    expression E1;
    identifier f;
    @@

    f (...) { }
    //

    Signed-off-by: Julia Lawall
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Julia Lawall
     
  • # echo 1 > events/enable
    # echo global > trace_clock

    ------------[ cut here ]------------
    WARNING: at kernel/lockdep.c:3162 check_flags+0xb2/0x190()
    ...
    ---[ end trace 3f86734a89416623 ]---
    possible reason: unannotated irqs-on.
    ...

    There's no reason to use the raw_local_irq_save() in trace_clock_global.
    The local_irq_save() version is fine, and does not cause the bug in lockdep.

    Acked-by: Peter Zijlstra
    Signed-off-by: Li Zefan
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Li Zefan
     
  • This avoids an infinite loop in free_early_partial().

    Add a warning to free_early_partial() to catch future problems.

    -v5: put back start > end back into WARN_ONCE()
    -v6: use one line for warning, suggested by Linus
    -v7: more tests
    -v8: remove the function name as suggested by Johannes
    WARN_ONCE() will print out that function name.

    Signed-off-by: Ian Campbell
    Signed-off-by: Yinghai Lu
    Tested-by: Konrad Rzeszutek Wilk
    Tested-by: Joel Becker
    Tested-by: Stanislaw Gruszka
    Acked-by: Johannes Weiner
    Cc: Peter Zijlstra
    Cc: David Miller
    Cc: Benjamin Herrenschmidt
    Cc: Linus Torvalds
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ian Campbell
     
  • CONFIG_SLOW_WORK_PROC was changed to CONFIG_SLOW_WORK_DEBUG, but not in all
    instances. Change the remaining instances. This makes the debugfs file
    display the time mark and the owner's description again.

    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Otherwise we can get an oops if the user has no get_ref/put_ref
    requirement.

    Signed-off-by: Dave Airlie
    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    Dave Airlie
     

29 Mar, 2010

2 commits

  • lockdep has custom code to check whether a pointer belongs to static
    percpu area which is somewhat broken. Implement proper
    is_kernel/module_percpu_address() and replace the custom code.

    On UP, percpu variables are regular static variables and can't be
    distinguished from them. Always return %false on UP.

    Signed-off-by: Tejun Heo
    Acked-by: Peter Zijlstra
    Cc: Rusty Russell
    Cc: Ingo Molnar

    Tejun Heo
     
  • Better encapsulate module static percpu area handling so that code
    outsidef of CONFIG_SMP ifdef doesn't deal with mod->percpu directly
    and add mod->percpu_size and record percpu_size in it. Both percpu
    fields are compiled out on UP. While at it, mark mod->percpu w/
    __percpu.

    This is to prepare for is_module_percpu_address().

    Signed-off-by: Tejun Heo
    Acked-by: Rusty Russell

    Tejun Heo
     

27 Mar, 2010

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6:
    x86/PCI: truncate _CRS windows with _LEN > _MAX - _MIN + 1
    x86/PCI: for host bridge address space collisions, show conflicting resource
    frv/PCI: remove redundant warnings
    x86/PCI: remove redundant warnings
    PCI: don't say we claimed a resource if we failed
    PCI quirk: Disable MSI on VIA K8T890 systems
    PCI quirk: RS780/RS880: work around missing MSI initialization
    PCI quirk: only apply CX700 PCI bus parking quirk if external VT6212L is present
    PCI: complain about devices that seem to be broken
    PCI: print resources consistently with %pR
    PCI: make disabled window printk style match the enabled ones
    PCI: break out primary/secondary/subordinate for readability
    PCI: for address space collisions, show conflicting resource
    resources: add interfaces that return conflict information
    PCI: cleanup error return for pcix get and set mmrbc functions
    PCI: fix access of PCI_X_CMD by pcix get and set mmrbc functions
    PCI: kill off pci_register_set_vga_state() symbol export.
    PCI: fix return value from pcix_get_max_mmrbc()

    Linus Torvalds
     
  • When the cgroup freezer is used to freeze tasks we do not want to thaw
    those tasks during resume. Currently we test the cgroup freezer
    state of the resuming tasks to see if the cgroup is FROZEN. If so
    then we don't thaw the task. However, the FREEZING state also indicates
    that the task should remain frozen.

    This also avoids a problem pointed out by Oren Ladaan: the freezer state
    transition from FREEZING to FROZEN is updated lazily when userspace reads
    or writes the freezer.state file in the cgroup filesystem. This means that
    resume will thaw tasks in cgroups which should be in the FROZEN state if
    there is no read/write of the freezer.state file to trigger this
    transition before suspend.

    NOTE: Another "simple" solution would be to always update the cgroup
    freezer state during resume. However it's a bad choice for several reasons:
    Updating the cgroup freezer state is somewhat expensive because it requires
    walking all the tasks in the cgroup and checking if they are each frozen.
    Worse, this could easily make resume run in N^2 time where N is the number
    of tasks in the cgroup. Finally, updating the freezer state from this code
    path requires trickier locking because of the way locks must be ordered.

    Instead of updating the freezer state we rely on the fact that lazy
    updates only manage the transition from FREEZING to FROZEN. We know that
    a cgroup with the FREEZING state may actually be FROZEN so test for that
    state too. This makes sense in the resume path even for partially-frozen
    cgroups -- those that really are FREEZING but not FROZEN.

    Reported-by: Oren Ladaan
    Signed-off-by: Matt Helsley
    Cc: stable@kernel.org
    Signed-off-by: Rafael J. Wysocki

    Matt Helsley