24 Jul, 2006

2 commits

  • Fix ABBA deadlock between lock_cpu_hotplug() and the cpuset
    callback_mutex lock.

    It only happens on cpu_exclusive cpusets, due to the dynamic
    sched domain code trying to take the cpu hotplug lock inside
    the cpuset callback_mutex lock.

    This bug has apparently been here for several months, but didn't
    get hit until the right customer load on a large system.

    This fix appears right from inspection, but it will take a few
    more days running it on that customers workload to be confident
    we nailed it. We don't have any other reproducible test case.

    The cpu_hotplug_lock() tends to cover large runs of code.
    The other places that hold both that lock and the cpuset callback
    mutex lock always nest the cpuset lock inside the hotplug lock.
    This place tries to do the reverse, risking an ABBA deadlock.

    This is in the cpuset_rmdir() code, where we:
    * take the callback_mutex lock
    * mark the cpuset CS_REMOVED
    * call update_cpu_domains for cpu_exclusive cpusets
    * in that call, take the cpu_hotplug lock if the
    cpuset is marked for removal.

    Thanks to Jack Steiner for identifying this deadlock.

    The fix is to tear down the dynamic sched domain before we grab
    the cpuset callback_mutex lock. This way, the two locks are
    serialized, with the hotplug lock taken and released before
    trying for the cpuset lock.

    I suspect that this bug was introduced when I changed the
    cpuset locking from one lock to two. The dynamic sched domain
    dependency on cpu_exclusive cpusets and its hotplug hooks were
    added to this code earlier, when cpusets had only a single lock.
    It may well have been fine then.

    Signed-off-by: Paul Jackson
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • The CPU hotplug locking was quite messy, with a recursive lock to
    handle the fact that both the actual up/down sequence wanted to
    protect itself from being re-entered, but the callbacks that it
    called also tended to want to protect themselves from CPU events.

    This splits the lock into two (one to serialize the whole hotplug
    sequence, the other to protect against the CPU present bitmaps
    changing). The latter still allows recursive usage because some
    subsystems (ondemand policy for cpufreq at least) had already gotten
    too used to the lax locking, but the locking mistakes are hopefully
    now less fundamental, and we now warn about recursive lock usage
    when we see it, in the hope that it can be fixed.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Jul, 2006

17 commits

  • In send_cpu_listeners(), which is called on the exit path, a down_write()
    was protecting operations like skb_clone() and genlmsg_unicast() that do
    GFP_KERNEL allocations. If the oom-killer decides to kill tasks to satisfy
    the allocations,the exit of those tasks could block on the same semphore.

    The down_write() was only needed to allow removal of invalid listeners from
    the listener list. The patch converts the down_write to a down_read and
    defers the removal to a separate critical region. This ensures that even
    if the oom-killer is called, no other task's exit is blocked as it can
    still acquire another down_read.

    Thanks to Andrew Morton & Herbert Xu for pointing out the oom related
    pitfalls, and to Chandra Seetharaman for suggesting this fix instead of
    using something more complex like RCU.

    Signed-off-by: Chandra Seetharaman
    Signed-off-by: Shailabh Nagar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shailabh Nagar
     
  • On systems with a large number of cpus, with even a modest rate of tasks
    exiting per cpu, the volume of taskstats data sent on thread exit can
    overflow a userspace listener's buffers.

    One approach to avoiding overflow is to allow listeners to get data for a
    limited and specific set of cpus. By scaling the number of listeners
    and/or the cpus they monitor, userspace can handle the statistical data
    overload more gracefully.

    In this patch, each listener registers to listen to a specific set of cpus
    by specifying a cpumask. The interest is recorded per-cpu. When a task
    exits on a cpu, its taskstats data is unicast to each listener interested
    in that cpu.

    Thanks to Andrew Morton for pointing out the various scalability and
    general concerns of previous attempts and for suggesting this design.

    [akpm@osdl.org: build fix]
    Signed-off-by: Shailabh Nagar
    Signed-off-by: Balbir Singh
    Signed-off-by: Chandra Seetharaman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shailabh Nagar
     
  • Send per-tgid data only once during exit of a thread group instead of once
    with each member thread exit.

    Currently, when a thread exits, besides its per-tid data, the per-tgid data
    of its thread group is also sent out, if its thread group is non-empty.
    The per-tgid data sent consists of the sum of per-tid stats for all
    *remaining* threads of the thread group.

    This patch modifies this sending in two ways:

    - the per-tgid data is sent only when the last thread of a thread group
    exits. This cuts down heavily on the overhead of sending/receiving
    per-tgid data, especially when other exploiters of the taskstats
    interface aren't interested in per-tgid stats

    - the semantics of the per-tgid data sent are changed. Instead of being
    the sum of per-tid data for remaining threads, the value now sent is the
    true total accumalated statistics for all threads that are/were part of
    the thread group.

    The patch also addresses a minor issue where failure of one accounting
    subsystem to fill in the taskstats structure was causing the send of
    taskstats to not be sent at all.

    The patch has been tested for stability and run cerberus for over 4 hours
    on an SMP.

    [akpm@osdl.org: bugfixes]
    Signed-off-by: Shailabh Nagar
    Signed-off-by: Balbir Singh
    Cc: Jay Lan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shailabh Nagar
     
  • Export I/O delays seen by a task through /proc//stats for use in top
    etc.

    Note that delays for I/O done for swapping in pages (swapin I/O) is clubbed
    together with all other I/O here (this is not the case in the netlink
    interface where the swapin I/O is kept distinct)

    [akpm@osdl.org: printk warning fix]
    Signed-off-by: Shailabh Nagar
    Signed-off-by: Balbir Singh
    Cc: Jes Sorensen
    Cc: Peter Chubb
    Cc: Erich Focht
    Cc: Levent Serinol
    Cc: Jay Lan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shailabh Nagar
     
  • Usage of taskstats interface by delay accounting.

    Signed-off-by: Shailabh Nagar
    Signed-off-by: Balbir Singh
    Cc: Jes Sorensen
    Cc: Peter Chubb
    Cc: Erich Focht
    Cc: Levent Serinol
    Cc: Jay Lan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shailabh Nagar
     
  • Create a "taskstats" interface based on generic netlink (NETLINK_GENERIC
    family), for getting statistics of tasks and thread groups during their
    lifetime and when they exit. The interface is intended for use by multiple
    accounting packages though it is being created in the context of delay
    accounting.

    This patch creates the interface without populating the fields of the data
    that is sent to the user in response to a command or upon the exit of a task.
    Each accounting package interested in using taskstats has to provide an
    additional patch to add its stats to the common structure.

    [akpm@osdl.org: cleanups, Kconfig fix]
    Signed-off-by: Shailabh Nagar
    Signed-off-by: Balbir Singh
    Cc: Jes Sorensen
    Cc: Peter Chubb
    Cc: Erich Focht
    Cc: Levent Serinol
    Cc: Jay Lan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shailabh Nagar
     
  • Make the task-related schedstats functions callable by delay accounting even
    if schedstats collection isn't turned on. This removes the dependency of
    delay accounting on schedstats.

    Signed-off-by: Chandra Seetharaman
    Signed-off-by: Shailabh Nagar
    Signed-off-by: Balbir Singh
    Cc: Jes Sorensen
    Cc: Peter Chubb
    Cc: Erich Focht
    Cc: Levent Serinol
    Cc: Jay Lan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chandra Seetharaman
     
  • Unlike earlier iterations of the delay accounting patches, now delays are only
    collected for the actual I/O waits rather than try and cover the delays seen
    in I/O submission paths.

    Account separately for block I/O delays incurred as a result of swapin page
    faults whose frequency can be affected by the task/process' rss limit. Hence
    swapin delays can act as feedback for rss limit changes independent of I/O
    priority changes.

    Signed-off-by: Shailabh Nagar
    Signed-off-by: Balbir Singh
    Cc: Jes Sorensen
    Cc: Peter Chubb
    Cc: Erich Focht
    Cc: Levent Serinol
    Cc: Jay Lan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shailabh Nagar
     
  • Initialization code related to collection of per-task "delay" statistics which
    measure how long it had to wait for cpu, sync block io, swapping etc. The
    collection of statistics and the interface are in other patches. This patch
    sets up the data structures and allows the statistics collection to be
    disabled through a kernel boot parameter.

    Signed-off-by: Shailabh Nagar
    Signed-off-by: Balbir Singh
    Cc: Jes Sorensen
    Cc: Peter Chubb
    Cc: Erich Focht
    Cc: Levent Serinol
    Cc: Jay Lan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shailabh Nagar
     
  • On platforms that have __ARCH_WANT_UNLOCKED_CTXSW set and want to implement
    lock validator support there's a bug in rq->lock handling: in this case we
    dont 'carry over' the runqueue lock into another task - but still we did a
    spinlock_release() of it. Fix this by making the spinlock_release() in
    context_switch() dependent on !__ARCH_WANT_UNLOCKED_CTXSW.

    (Reported by Ralf Baechle on MIPS, which has __ARCH_WANT_UNLOCKED_CTXSW.
    This fixes a lockdep-internal BUG message on such platforms.)

    Signed-off-by: Ingo Molnar
    Cc: Ralf Baechle
    Cc: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • IRQs must be disabled before taking ->siglock.

    Noticed by lockdep.

    Signed-off-by: OGAWA Hirofumi
    Cc: Arjan van de Ven
    Cc: Ingo Molnar
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • Resolve problems seen w/ APM suspend.

    Due to resume initialization ordering, its possible we could get a timer
    interrupt before the timekeeping resume() function is called. This patch
    ensures we don't do any timekeeping accounting before we're fully resumed.

    (akpm: fixes the machine-freezes-on-APM-resume bug)

    Signed-off-by: John Stultz
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    john stultz
     
  • Christoph Hellwig:
    open_softirq just enables a softirq. The softirq array is statically
    allocated so to add a new one you would have to patch the kernel. So
    there's no point to keep this export at all as any user would have to
    patch the enum in include/linux/interrupt.h anyway.

    Signed-off-by: Adrian Bunk
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • When CONFIG_RT_MUTEX_TESTER is enabled kernel refuses to suspend the
    machine because it's unable to freeze the rt-test-* threads.

    Add try_to_freeze() after schedule() so that the threads will be freezed
    correctly; I've tested the patch and it lets the notebook suspends and
    resumes nicely.

    Signed-off-by: Luca Tettamanti
    Cc: Ingo Molnar
    Acked-by: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luca Tettamanti
     
  • Relax the CPU in the del_timer_sync() busywait loop.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Remove the now-unneeded kthread_stop_sem().

    Signed-off-by: Adrian Bunk
    Acked-by: Alan Stern
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Got a customer bug report (https://bugzilla.novell.com/190296) about kernel
    symbols longer than 127 characters which end up in a string buffer that is
    not NULL terminated, leading to garbage in /proc/kallsyms. Using strlcpy
    prevents this from happening, even though such symbols still won't come out
    right.

    A better fix would be to not use a fixed-size buffer, but it's probably not
    worth the trouble. (Modversion'ed symbols even have a length limit of 60.)

    [bunk@stusta.de: build fix]
    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andreas Gruenbacher
     

13 Jul, 2006

3 commits

  • Implement the scheduled unexport of insert_resource.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Adrian Bunk
     
  • Remove the deprecated and no longer used pm_unregister_all().

    Signed-off-by: Adrian Bunk
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Adrian Bunk
     
  • Based on a patch from Ernie Petrides

    During security research, Red Hat discovered a behavioral flaw in core
    dump handling. A local user could create a program that would cause a
    core file to be dumped into a directory they would not normally have
    permissions to write to. This could lead to a denial of service (disk
    consumption), or allow the local user to gain root privileges.

    The prctl() system call should never allow to set "dumpable" to the
    value 2. Especially not for non-privileged users.

    This can be split into three cases:

    1) running as root -- then core dumps will already be done as root,
    and so prctl(PR_SET_DUMPABLE, 2) is not useful

    2) running as non-root w/setuid-to-root -- this is the debatable case

    3) running as non-root w/setuid-to-non-root -- then you definitely
    do NOT want "dumpable" to get set to 2 because you have the
    privilege escalation vulnerability

    With case #2, the only potential usefulness is for a program that has
    designed to run with higher privilege (than the user invoking it) that
    wants to be able to create root-owned root-validated core dumps. This
    might be useful as a debugging aid, but would only be safe if the program
    had done a chdir() to a safe directory.

    There is no benefit to a production setuid-to-root utility, because it
    shouldn't be dumping core in the first place. If this is true, then the
    same debugging aid could also be accomplished with the "suid_dumpable"
    sysctl.

    Signed-off-by: Marcel Holtmann
    Signed-off-by: Linus Torvalds

    Marcel Holtmann
     

11 Jul, 2006

15 commits

  • Disable lockdep debugging in two situations where the integrity of the
    kernel no longer is guaranteed: when oopsing and when hitting a
    tainting-condition. The goal is to not get weird lockdep traces that don't
    make sense or are otherwise undebuggable, to not waste time.

    Lockdep assumes that the previous state it knows about is valid to operate,
    which is why lockdep turns itself off after the first violation it reports,
    after that point it can no longer make that assumption.

    A kernel oops means that the integrity of the kernel compromised; in
    addition anything lockdep would report is of lesser importance than the
    oops.

    All the tainting conditions are of similar integrity-violating nature and
    also make debugging/diagnosing more difficult.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     
  • As announced half a year ago this patch will remove the tasklist_lock
    export. The previous two patches got rid of the remaining modular users.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • allyesconfig vmlinux size delta:

    text data bss dec filename
    20736884 6073834 3075176 29885894 vmlinux.before
    20721009 6073966 3075176 29870151 vmlinux.after

    ~18 bytes per callsite, 15K of text size (~0.1%) saved.

    (as an added bonus this also removes a lockdep annotation.)

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Do not panic a machine when swsusp signature can't be read.

    Signed-off-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • kernel/power/swap.c: In function 'swsusp_write':
    kernel/power/swap.c:275: warning: 'start' may be used uninitialized in this function

    gcc isn't smart enough, so help it.

    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • swsusp should not use memcpy for snapshotting memory, because on some
    architectures memcpy may increase preempt_count (i386 does this when
    CONFIG_X86_USE_3DNOW is set). Then, as a result, wrong value of preempt_count
    is stored in the image.

    Replace memcpy in copy_data_pages with an open-coded loop.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • A large number of lost ticks can cause an overadjustment of the clock. To
    compensate for this we look at the current error and the larger the error
    already is the more careful we are at adjusting the error. As small extra
    fix reset the error when the clock is set.

    Signed-off-by: Roman Zippel
    Acked-by: john stultz
    Cc: Uwe Bugla
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Zippel
     
  • Calling futex_lock_pi is called with a reference to a non PI futex and
    waiters exist already, lookup_pi_state() oopses due to pi_state == NULL.
    Check this condition and return -EINVAL to userspace.

    Signed-off-by: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Jakub Jelinek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • This patch marks an unused export as EXPORT_UNUSED_SYMBOL.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • This patch marks unused exports as EXPORT_SYMBOL_UNUSED.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • lockdep_map is embedded into every lock, which blows up data structure
    sizes all around the kernel. Reduce the class-cache to be for the default
    class only - that is used in 99.9% of the cases and even if we dont have a
    class cached, the lookup in the class-hash is lockless.

    This change reduces the per-lock dep_map overhead by 56 bytes on 64-bit
    platforms and by 28 bytes on 32-bit platforms.

    Signed-off-by: Ingo Molnar
    Cc: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Make lockdep print which lock is held, in the "kfree() of a live lock"
    scenario.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     
  • - Use printk formatting for indentation
    - Don't leave NTFS in the default event filter

    Signed-off-by: Andi Kleen
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • - constify and optimize stat_nam (thanks to Michael Tokarev!)
    - spelling and comment fixes

    Signed-off-by: Andreas Mohr
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andreas Mohr
     
  • Problem:

    In the function __migrate_task(), deactivate_task() followed by
    activate_task() is used to move the task from one run queue to
    another. This has two undesirable effects:

    1. The task's priority is recalculated. (Nowhere else in the
    scheduler code is the priority recalculated for a change of CPU.)

    2. The task's time stamp is set to the current time. At the very least,
    this makes the adjustment of the time stamp before the call to
    deactivate_task() redundant but I believe the problem is more serious
    as the time stamp now holds the time of the queue change instead of
    the time at which the task was woken. In addition, unless dest_rq is
    the same queue as "current" is on the time stamp could be inaccurate
    due to inter CPU drift.

    Solution:

    Replace the call to activate_task() with one to __activate_task().

    Signed-off-by: Peter Williams
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Williams
     

05 Jul, 2006

1 commit

  • * master.kernel.org:/pub/scm/linux/kernel/git/davej/cpufreq:
    Move workqueue exports to where the functions are defined.
    [CPUFREQ] Misc cleanups in ondemand.
    [CPUFREQ] Make ondemand sampling per CPU and remove the mutex usage in sampling path.
    [CPUFREQ] Add queue_delayed_work_on() interface for workqueues.
    [CPUFREQ] Remove slowdown from ondemand sampling path.

    Linus Torvalds
     

04 Jul, 2006

2 commits

  • Jiri reports that the stop_machin kthread conversion caused his machine to
    hang when suspending. Hyperthreading is apparently involved.

    I don't see why that would be and I can't reproduce it. Revert to the 2.6.17
    code.

    Cc: "Serge E. Hallyn"
    Cc: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc:
    powerpc: add defconfig for Freescale MPC8349E-mITX board
    powerpc: Add base support for the Freescale MPC8349E-mITX eval board
    Documentation: correct values in MPC8548E SEC example node
    [POWERPC] Actually copy over i8259.c to arch/ppc/syslib this time
    [POWERPC] Add new interrupt mapping core and change platforms to use it
    [POWERPC] Copy i8259 code back to arch/ppc
    [POWERPC] New device-tree interrupt parsing code
    [POWERPC] Use the genirq framework
    [PATCH] genirq: Allow fasteoi handler to retrigger disabled interrupts
    [POWERPC] Update the SWIM3 (powermac) floppy driver
    [POWERPC] Fix error handling in detecting legacy serial ports
    [POWERPC] Fix booting on Momentum "Apache" board (a Maple derivative)
    [POWERPC] Fix various offb and BootX-related issues
    [POWERPC] Add a default config for 32-bit CHRP machines
    [POWERPC] fix implicit declaration on cell.
    [POWERPC] change get_property to return void *

    Linus Torvalds