09 Jun, 2009

2 commits

  • These are defined as static cpumask_var_t so if MAXSMP is not used,
    they are cleared already. Avoid surprises when MAXSMP is enabled.

    Signed-off-by: Yinghai Lu
    Signed-off-by: Rusty Russell

    Yinghai Lu
     
  • Our async work synchronization was broken by "async: make sure
    independent async domains can't accidentally entangle" (commit
    d5a877e8dd409d8c702986d06485c374b705d340), because it would report
    the wrong lowest active async ID when there was both running and
    pending async work.

    This caused things like no being able to read the root filesystem,
    resulting in missing console devices and inability to run 'init',
    causing a boot-time panic.

    This fixes it by properly returning the lowest pending async ID: if
    there is any running async work, that will have a lower ID than any
    pending work, and we should _not_ look at the pending work list.

    There were alternative patches from Jaswinder and James, but this one
    also cleans up the code by removing the pointless 'ret' variable and
    the unnecesary testing for an empty list around 'for_each_entry()' (if
    the list is empty, the for_each_entry() thing just won't execute).

    Fixes-bug: http://bugzilla.kernel.org/show_bug.cgi?id=13474
    Reported-and-tested-by: Chris Clayton
    Cc: Jaswinder Singh Rajput
    Cc: James Bottomley
    Cc: Arjan van de Ven
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

05 Jun, 2009

2 commits

  • Commit 95a3540da9c81a5987be810e1d9a83640a366bd5 ("ptrace_detach: the wrong
    wakeup breaks the ERESTARTxxx logic") removed the "extra"
    wake_up_process() from ptrace_detach(), but as Jan pointed out this breaks
    the compatibility.

    I believe the changelog is right and this wake_up() is wrong in many
    ways, but GDB assumes that ptrace(PTRACE_DETACH, child, 0, 0) always
    wakes up the tracee.

    Despite the fact this breaks SIGNAL_STOP_STOPPED/group_stop_count logic,
    and despite the fact this wake_up_process() can break another
    assumption: PTRACE_DETACH with SIGSTOP should leave the tracee in
    TASK_STOPPED case. Because the untraced child can dequeue SIGSTOP and
    call do_signal_stop() before ptrace_detach() calls wake_up_process().

    Revert this change for now. We need some fixes even if we we want to keep
    the current behaviour, but these fixes are not for 2.6.30.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Jan Kratochvil
    Cc: Denys Vlasenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The "trace || CLONE_PTRACE" check in tracehook_report_clone() is not right,

    - If the untraced task does clone(CLONE_PTRACE) the new child is not traced,
    we must not queue SIGSTOP.

    - If we forked the traced task, but the tracer exits and untraces both the
    forking task and the new child (after copy_process() drops tasklist_lock),
    we should not queue SIGSTOP too.

    Change the code to check task_ptrace() != 0 instead. This is still racy, but
    the race is harmless.

    We can race with another tracer attaching to this child, or the tracer can
    exit and detach in parallel. But giwen that we didn't do wake_up_new_task()
    yet, the child must have the pending SIGSTOP anyway.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Christoph Hellwig
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

27 May, 2009

1 commit


25 May, 2009

3 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
    PM: Do not hold dpm_list_mtx while disabling/enabling nonboot CPUs

    Linus Torvalds
     
  • The problem occurs when async_synchronize_full_domain() is called when
    the async_pending list is not empty. This will cause lowest_running()
    to return the cookie of the first entry on the async_pending list, which
    might be nothing at all to do with the domain being asked for and thus
    cause the domain synchronization to wait for an unrelated domain. This
    can cause a deadlock if domain synchronization is used from one domain
    to wait for another.

    Fix by running over the async_pending list to see if any pending items
    actually belong to our domain (and return their cookies if they do).

    Signed-off-by: James Bottomley
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Linus Torvalds

    James Bottomley
     
  • We shouldn't hold dpm_list_mtx while executing
    [disable|enable]_nonboot_cpus(), because theoretically this may lead
    to a deadlock as shown by the following example (provided by Johannes
    Berg):

    CPU 3 CPU 2 CPU 1
    suspend/hibernate
    something:
    rtnl_lock() device_pm_lock()
    -> mutex_lock(&dpm_list_mtx)

    mutex_lock(&dpm_list_mtx)

    linkwatch_work
    -> rtnl_lock()
    disable_nonboot_cpus()
    -> flush CPU 3 workqueue

    Fortunately, device drivers are supposed to stop any activities that
    might lead to the registration of new device objects way before
    disable_nonboot_cpus() is called, so it shouldn't be necessary to
    hold dpm_list_mtx over the entire late part of device suspend and
    early part of device resume.

    Thus, during the late suspend and the early resume of devices acquire
    dpm_list_mtx only when dpm_list is going to be traversed and release
    it right after that.

    This patch is reported to fix the regressions tracked as
    http://bugzilla.kernel.org/show_bug.cgi?id=13245.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Alan Stern
    Reported-by: Miles Lane
    Tested-by: Ming Lei

    Rafael J. Wysocki
     

20 May, 2009

1 commit

  • The futex code installs a read only mapping via get_user_pages_fast()
    even if the futex op function has to modify user space data. The
    eventual fault was fixed up by futex_handle_fault() which walked the
    VMA with mmap_sem held.

    After the cleanup patches which removed the mmap_sem dependency of the
    futex code commit 4dc5b7a36a49eff97050894cf1b3a9a02523717 (futex:
    clean up fault logic) removed the private VMA walk logic from the
    futex code. This change results in a stale RO mapping which is not
    fixed up.

    Instead of reintroducing the previous fault logic we set up the
    mapping in get_user_pages_fast() read/write for all operations which
    modify user space data. Also handle private futexes in the same way
    and make the current unconditional access_ok(VERIFY_WRITE) depend on
    the futex op.

    Reported-by: Andreas Schwab
    Signed-off-by: Thomas Gleixner
    CC: stable@kernel.org

    Thomas Gleixner
     

19 May, 2009

2 commits


18 May, 2009

1 commit


17 May, 2009

1 commit

  • Ian Campbell noticed that since "Eliminate thousands of warnings with
    gcc 3.2 build" (commit 57adc4d2dbf968fdbe516359688094eef4d46581) all
    WARN_ON()'s currently appear to come from warn_slowpath_null(), eg:

    WARNING: at kernel/softirq.c:143 warn_slowpath_null+0x1c/0x20()

    because now that warn_slowpath_null() is in the call path, the
    __builtin_return_address(0) returns that, rather than the place that
    caused the warning.

    Fix this by splitting up the warn_slowpath_null/fmt cases differently,
    using a common helper function, and getting the return address in the
    right place. This also happens to avoid the unnecessary stack usage for
    the non-stdargs case, and just generally cleans things up.

    Make the function name printout use %pS while at it.

    Cc: Ian Campbell
    Cc: Jesper Nilsson
    Cc: Johannes Weiner
    Cc: Arjan van de Ven
    Cc: Andi Kleen
    Cc: Hugh Dickins
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

16 May, 2009

2 commits


15 May, 2009

3 commits

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
    kgdb: gdb documentation fix
    kgdb,i386: use address that SP register points to in the exception frame
    sysrq, intel_fb: fix sysrq g collision

    Linus Torvalds
     
  • Commit 79e539453b34e35f39299a899d263b0a1f1670bd introduced a
    regression where you cannot use sysrq 'g' to enter kgdb. The solution
    is to move the intel fb sysrq over to V for video instead of G for
    graphics. The SMP VOYAGER code to register for the sysrq-v is not
    anywhere to be found in the mainline kernel, so the comments in the
    code were cleaned up as well.

    This patch also cleans up the sysrq definitions for kgdb to make it
    generic for the kernel debugger, such that the sysrq 'g' can be used
    in the future to enter a gdbstub or another kernel debugger.

    Signed-off-by: Jason Wessel
    Acked-by: Jesse Barnes
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton

    Jason Wessel
     
  • This reverts commit fafd688e4c0c34da0f3de909881117d374e4c7af.

    Work is progressing to switch away from pdflush as the process backing
    for flushing out dirty data. So it seems pointless to add more knobs
    to control pdflush threads. The original author of the patch did not
    have any specific use cases for adding the knobs, so we can easily
    revert this before 2.6.30 to avoid having to maintain this API
    forever.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

13 May, 2009

1 commit

  • Now that lockdep coverage has increased it has become easier to
    run out of entries:

    [ 21.401387] BUG: MAX_LOCKDEP_ENTRIES too low!
    [ 21.402007] turning off the locking correctness validator.
    [ 21.402007] Pid: 1555, comm: S99local Not tainted 2.6.30-rc5-tip #2
    [ 21.402007] Call Trace:
    [ 21.402007] [] add_lock_to_list+0x53/0xba
    [ 21.402007] [] ? lookup_mnt+0x19/0x53
    [ 21.402007] [] check_prev_add+0x14b/0x1c7
    [ 21.402007] [] validate_chain+0x474/0x52a
    [ 21.402007] [] __lock_acquire+0x342/0x3c7
    [ 21.402007] [] lock_acquire+0xc1/0xe5
    [ 21.402007] [] ? lookup_mnt+0x19/0x53
    [ 21.402007] [] _spin_lock+0x31/0x66

    Double the size - as we've done in the past.

    [ Impact: allow lockdep to cover more locks ]

    Acked-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

09 May, 2009

3 commits


07 May, 2009

2 commits

  • When building with gcc 3.2 I get thousands of warnings such as

    include/linux/gfp.h: In function `allocflags_to_migratetype':
    include/linux/gfp.h:105: warning: null format string

    due to passing a NULL format string to warn_slowpath() in

    #define __WARN() warn_slowpath(__FILE__, __LINE__, NULL)

    Split this case out into a separate call. This also shrinks the kernel
    slightly:

    text data bss dec hex filename
    4802274 707668 712704 6222646 5ef336 vmlinux
    text data bss dec hex filename
    4799027 703572 712704 6215303 5ed687 vmlinux

    due to removeing one argument from the commonly-called __WARN().

    [akpm@linux-foundation.org: reduce scope of `empty']
    Acked-by: Jesper Nilsson
    Acked-by: Johannes Weiner
    Acked-by: Arjan van de Ven
    Signed-off-by: Andi Kleen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • There is what we believe to be a false positive reported by lockdep.

    inotify_inode_queue_event() => take inotify_mutex => kernel_event() =>
    kmalloc() => SLOB => alloc_pages_node() => page reclaim => slab reclaim =>
    dcache reclaim => inotify_inode_is_dead => take inotify_mutex => deadlock

    The plan is to fix this via lockdep annotation, but that is proving to be
    quite involved.

    The patch flips the allocation over to GFP_NFS to shut the warning up, for
    the 2.6.30 release.

    Hopefully we will fix this for real in 2.6.31. I'll queue a patch in -mm
    to switch it back to GFP_KERNEL so we don't forget.

    =================================
    [ INFO: inconsistent lock state ]
    2.6.30-rc2-next-20090417 #203
    ---------------------------------
    inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
    kswapd0/380 [HC0[0]:SC0[0]:HE1:SE1] takes:
    (&inode->inotify_mutex){+.+.?.}, at: [] inotify_inode_is_dead+0x35/0xb0
    {RECLAIM_FS-ON-W} state was registered at:
    [] mark_held_locks+0x68/0x90
    [] lockdep_trace_alloc+0xf5/0x100
    [] __kmalloc_node+0x31/0x1e0
    [] kernel_event+0xe2/0x190
    [] inotify_dev_queue_event+0x126/0x230
    [] inotify_inode_queue_event+0xc6/0x110
    [] vfs_create+0xcd/0x140
    [] do_filp_open+0x88d/0xa20
    [] do_sys_open+0x98/0x140
    [] sys_open+0x20/0x30
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff
    irq event stamp: 690455
    hardirqs last enabled at (690455): [] _spin_unlock_irqrestore+0x44/0x80
    hardirqs last disabled at (690454): [] _spin_lock_irqsave+0x32/0xa0
    softirqs last enabled at (690178): [] __do_softirq+0x202/0x220
    softirqs last disabled at (690157): [] call_softirq+0x1c/0x50

    other info that might help us debug this:
    2 locks held by kswapd0/380:
    #0: (shrinker_rwsem){++++..}, at: [] shrink_slab+0x37/0x180
    #1: (&type->s_umount_key#17){++++..}, at: [] shrink_dcache_memory+0x11f/0x1e0

    stack backtrace:
    Pid: 380, comm: kswapd0 Not tainted 2.6.30-rc2-next-20090417 #203
    Call Trace:
    [] print_usage_bug+0x19f/0x200
    [] ? save_stack_trace+0x2f/0x50
    [] mark_lock+0x4bb/0x6d0
    [] ? check_usage_forwards+0x0/0xc0
    [] __lock_acquire+0xc62/0x1ae0
    [] ? slob_free+0x10c/0x370
    [] lock_acquire+0xe1/0x120
    [] ? inotify_inode_is_dead+0x35/0xb0
    [] mutex_lock_nested+0x63/0x420
    [] ? inotify_inode_is_dead+0x35/0xb0
    [] ? inotify_inode_is_dead+0x35/0xb0
    [] ? sched_clock+0x9/0x10
    [] ? lock_release_holdtime+0x35/0x1c0
    [] inotify_inode_is_dead+0x35/0xb0
    [] dentry_iput+0xbc/0xe0
    [] d_kill+0x33/0x60
    [] __shrink_dcache_sb+0x2d3/0x350
    [] shrink_dcache_memory+0x15a/0x1e0
    [] shrink_slab+0x125/0x180
    [] kswapd+0x560/0x7a0
    [] ? isolate_pages_global+0x0/0x2c0
    [] ? autoremove_wake_function+0x0/0x40
    [] ? trace_hardirqs_on+0xd/0x10
    [] ? kswapd+0x0/0x7a0
    [] kthread+0x5b/0xa0
    [] child_rip+0xa/0x20
    [] ? restore_args+0x0/0x30
    [] ? kthread+0x0/0xa0
    [] ? child_rip+0x0/0x20

    [eparis@redhat.com: fix audit too]
    Cc: Al Viro
    Cc: Matt Mackall
    Cc: Christoph Lameter
    Signed-off-by: Wu Fengguang
    Signed-off-by: Eric Paris
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

06 May, 2009

5 commits


03 May, 2009

1 commit

  • Avoid setting less than two pages for vm_dirty_bytes: this is necessary to
    avoid potential division by 0 (like the following) in get_dirty_limits().

    [ 49.951610] divide error: 0000 [#1] PREEMPT SMP
    [ 49.952195] last sysfs file: /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/uevent
    [ 49.952195] CPU 1
    [ 49.952195] Modules linked in: pcspkr
    [ 49.952195] Pid: 3064, comm: dd Not tainted 2.6.30-rc3 #1
    [ 49.952195] RIP: 0010:[] [] get_dirty_limits+0xe9/0x2c0
    [ 49.952195] RSP: 0018:ffff88001de03a98 EFLAGS: 00010202
    [ 49.952195] RAX: 00000000000000c0 RBX: ffff88001de03b80 RCX: 28f5c28f5c28f5c3
    [ 49.952195] RDX: 0000000000000000 RSI: 00000000000000c0 RDI: 0000000000000000
    [ 49.952195] RBP: ffff88001de03ae8 R08: 0000000000000000 R09: 0000000000000000
    [ 49.952195] R10: ffff88001ddda9a0 R11: 0000000000000001 R12: 0000000000000001
    [ 49.952195] R13: ffff88001fbc8218 R14: ffff88001de03b70 R15: ffff88001de03b78
    [ 49.952195] FS: 00007fe9a435b6f0(0000) GS:ffff8800025d9000(0000) knlGS:0000000000000000
    [ 49.952195] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 49.952195] CR2: 00007fe9a39ab000 CR3: 000000001de38000 CR4: 00000000000006e0
    [ 49.952195] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 49.952195] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    [ 49.952195] Process dd (pid: 3064, threadinfo ffff88001de02000, task ffff88001ddda250)
    [ 49.952195] Stack:
    [ 49.952195] ffff88001fa0de00 ffff88001f2dbd70 ffff88001f9fe800 000080b900000000
    [ 49.952195] 00000000000000c0 ffff8800027a6100 0000000000000400 ffff88001fbc8218
    [ 49.952195] 0000000000000000 0000000000000600 ffff88001de03bb8 ffffffff802d3ed7
    [ 49.952195] Call Trace:
    [ 49.952195] [] balance_dirty_pages_ratelimited_nr+0x1d7/0x3f0
    [ 49.952195] [] ? ext3_writeback_write_end+0x9e/0x120
    [ 49.952195] [] generic_file_buffered_write+0x12f/0x330
    [ 49.952195] [] __generic_file_aio_write_nolock+0x26d/0x460
    [ 49.952195] [] ? generic_file_aio_write+0x52/0xd0
    [ 49.952195] [] generic_file_aio_write+0x69/0xd0
    [ 49.952195] [] ext3_file_write+0x26/0xc0
    [ 49.952195] [] do_sync_write+0xf1/0x140
    [ 49.952195] [] ? get_lock_stats+0x2a/0x60
    [ 49.952195] [] ? autoremove_wake_function+0x0/0x40
    [ 49.952195] [] vfs_write+0xcb/0x190
    [ 49.952195] [] sys_write+0x50/0x90
    [ 49.952195] [] system_call_fastpath+0x16/0x1b
    [ 49.952195] Code: 00 00 00 2b 05 09 1c 17 01 48 89 c6 49 0f af f4 48 c1 ee 02 48 89 f0 48 f7 e1 48 89 d6 31 d2 48 c1 ee 02 48 0f af 75 d0 48 89 f0 f7 f7 41 8b 95 ac 01 00 00 48 89 c7 49 0f af d4 48 c1 ea 02
    [ 49.952195] RIP [] get_dirty_limits+0xe9/0x2c0
    [ 49.952195] RSP
    [ 50.096523] ---[ end trace 008d7aa02f244d7b ]---

    Signed-off-by: Andrea Righi
    Cc: Peter Zijlstra
    Cc: David Rientjes
    Cc: Dave Chinner
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Righi
     

02 May, 2009

1 commit

  • tick_handle_periodic() can lock up hard when a one shot clock event
    device is used in combination with jiffies clocksource.

    Avoid an endless loop issue by requiring that a highres valid
    clocksource be installed before we call tick_periodic() in a loop when
    using ONESHOT mode. The result is we will only increment jiffies once
    per interrupt until a continuous hardware clocksource is available.

    Without this, we can run into a endless loop, where each cycle through
    the loop, jiffies is updated which increments time by tick_period or
    more (due to clock steering), which can cause the event programming to
    think the next event was before the newly incremented time and fail
    causing tick_periodic() to be called again and the whole process loops
    forever.

    [ Impact: prevent hard lock up ]

    Signed-off-by: John Stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Thomas Gleixner
    Cc: stable@kernel.org

    john stultz
     

01 May, 2009

1 commit


30 Apr, 2009

1 commit


29 Apr, 2009

2 commits

  • Andrew Gallatin reported that IRQ and SOFTIRQ times were
    sometime not reported correctly on recent kernels, and even
    bisected to commit 457533a7d3402d1d91fbc125c8bd1bd16dcd3cd4
    ([PATCH] fix scaled & unscaled cputime accounting) as the first
    bad commit.

    Further analysis pointed that commit
    79741dd35713ff4f6fd0eafd59fa94e8a4ba922d ([PATCH] idle cputime
    accounting) was the real cause of the problem.

    account_process_tick() was not taking into account timer IRQ
    interrupting the idle task servicing a hard or soft irq.

    On mostly idle cpu, irqs were thus not accounted and top or
    mpstat could tell user/admin that cpu was 100 % idle, 0.00 %
    irq, 0.00 % softirq, while it was not.

    [ Impact: fix occasionally incorrect CPU statistics in top/mpstat ]

    Reported-by: Andrew Gallatin
    Re-reported-by: Andrew Morton
    Signed-off-by: Eric Dumazet
    Acked-by: Martin Schwidefsky
    Cc: rick.jones2@hp.com
    Cc: brice@myri.com
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Eric Dumazet
     
  • The pages allocated for the splice binary buffer did not initialize
    the ref count correctly. This caused pages not to be freed and causes
    a drastic memory leak.

    Thanks to logdev I was able to trace the tracer to find where the leak
    was.

    [ Impact: stop memory leak when using splice ]

    Signed-off-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     

27 Apr, 2009

4 commits


25 Apr, 2009

1 commit

  • Commit c751085943362143f84346d274e0011419c84202 ("PM/Hibernate: Wait for
    SCSI devices scan to complete during resume") added a call to
    scsi_complete_async_scans() to software_resume(), so that it waited for
    the SCSI scanning to complete, but the call was added at a wrong place.

    Namely, it should have been added after wait_for_device_probe(), which
    is called only if the image partition hasn't been specified yet. Also,
    it's reasonable to check if the image partition is present and only wait
    for the device probing and SCSI scanning to complete if it is not the
    case.

    Additionally, since noresume is checked right at the beginning of
    software_resume() and the function returns immediately if it's set, it
    doesn't make sense to check it once again later.

    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki