29 May, 2005

1 commit

  • The "unhandled interrupts" catcher, note_interrupt(), increments a global
    desc->irq_count and grossly damages scaling of very large systems, e.g.,
    >192p ia64 Altix, because of this highly contented cacheline, especially
    for timer interrupts. 384p is severely crippled, and 512p is unuseable.

    All calls to note_interrupt() can be disabled by booting with "noirqdebug",
    but this disables the useful interrupt checking for all interrupts.

    I propose eliminating note_interrupt() for all per-CPU interrupts. This
    was the behavior of linux-2.6.10 and earlier, but in 2.6.11 a code
    restructuring added a call to note_interrupt() for per-CPU interrupts.
    Besides, note_interrupt() is a bit racy for concurrent CPU calls anyway, as
    the desc->irq_count++ increment isn't atomic (which, if done, would make
    scaling even worse).

    Signed-off-by: John Hawkes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Hawkes
     

27 May, 2005

1 commit

  • There is a race in the kernel cpuset code, between the code
    to handle notify_on_release, and the code to remove a cpuset.
    The notify_on_release code can end up trying to access a
    cpuset that has been removed. In the most common case, this
    causes a NULL pointer dereference from the routine cpuset_path.
    However all manner of bad things are possible, in theory at least.

    The existing code decrements the cpuset use count, and if the
    count goes to zero, processes the notify_on_release request,
    if appropriate. However, once the count goes to zero, unless we
    are holding the global cpuset_sem semaphore, there is nothing to
    stop another task from immediately removing the cpuset entirely,
    and recycling its memory.

    The obvious fix would be to always hold the cpuset_sem
    semaphore while decrementing the use count and dealing with
    notify_on_release. However we don't want to force a global
    semaphore into the mainline task exit path, as that might create
    a scaling problem.

    The actual fix is almost as easy - since this is only an issue
    for cpusets using notify_on_release, which the top level big
    cpusets don't normally need to use, only take the cpuset_sem
    for cpusets using notify_on_release.

    This code has been run for hours without a hiccup, while running
    a cpuset create/destroy stress test that could crash the existing
    kernel in seconds. This patch applies to the current -linus
    git kernel.

    Signed-off-by: Paul Jackson
    Acked-by: Simon Derr
    Acked-by: Dinakar Guniguntala
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

25 May, 2005

1 commit

  • If SIGKILL does not have priority, we cannot instantly kill task before it
    makes some unexpected job. It can be critical, but we were unable to
    reproduce this easily until Heiko Carstens
    reported this problem on LKML.

    Signed-Off-By: Kirill Korotaev
    Signed-Off-By: Alexey Kuznetsov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     

22 May, 2005

1 commit

  • In _spin_unlock_bh(lock):
    do { \
    _raw_spin_unlock(lock); \
    preempt_enable(); \
    local_bh_enable(); \
    __release(lock); \
    } while (0)

    there is no reason for using preempt_enable() instead of a simple
    preempt_enable_no_resched()

    Since we know bottom halves are disabled, preempt_schedule() will always
    return at once (preempt_count!=0), and hence preempt_check_resched() is
    useless here...

    This fixes it by using "preempt_enable_no_resched()" instead of the
    "preempt_enable()", and thus avoids the useless preempt_check_resched()
    just before re-enabling bottom halves.

    Signed-off-by: Samuel Thibault
    Signed-off-by: Linus Torvalds

    Samuel Thibault
     

21 May, 2005

1 commit

  • This patch removes the entwining of cpusets and hotplug code in the "No
    more Mr. Nice Guy" case of sched.c move_task_off_dead_cpu().

    Since the hotplug code is holding a spinlock at this point, we cannot take
    the cpuset semaphore, cpuset_sem, as would seem to be required either to
    update the tasks cpuset, or to scan up the nested cpuset chain, looking for
    the nearest cpuset ancestor that still has some CPUs that are online. So
    we just punt and blast the tasks cpus_allowed with all bits allowed.

    This reverts these lines of code to what they were before the cpuset patch.
    And it updates the cpuset Doc file, to match.

    The one known alternative to this that seems to work came from Dinakar
    Guniguntala, and required the hotplug code to take the cpuset_sem semaphore
    much earlier in its processing. So far as we know, the increased locking
    entanglement between cpusets and hot plug of this alternative approach is
    not worth doing in this case.

    Signed-off-by: Paul Jackson
    Acked-by: Nathan Lynch
    Acked-by: Dinakar Guniguntala
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

18 May, 2005

1 commit

  • This patch includes various tweaks in the messaging that appears during
    system pm state transitions:

    * Warn about certain illegal calls in the device tree, like resuming
    child before parent or suspending parent before child. This could
    happen easily enough through sysfs, or in some cases when drivers
    use device_pm_set_parent().

    * Be more consistent about dev_dbg() tracing ... do it for resume() and
    shutdown() too, and never if the driver doesn't have that method.

    * Say which type of system sleep state is being entered.

    Except for the warnings, these only affect debug messaging.

    Signed-off-by: David Brownell
    Acked-by: Pavel Machek
    Signed-off-by: Greg Kroah-Hartman

    David Brownell
     

17 May, 2005

3 commits

  • profile=schedule parsing is not quite what it should be. First, str[7] is
    'e', not ',', but then even if it did fall through, prof_on =
    SCHED_PROFILING would be clobbered inside if (get_option(...)) So a small
    amount of rearrangement is done in this patch to correct it.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    William Lee Irwin III
     
  • Move add_preferred_console out of CONFIG_PRINTK so serial console does the
    right thing.

    Signed-off-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • On my IA64 machine, after kernel 2.6.12-rc3 boots, an edge-triggered
    interrupt (IRQ 46) keeps triggered over and over again. There is no IRQ 46
    interrupt action handler. It has lots of impact on performance.

    Kernel 2.6.10 and its prior versions have no the problem. Basically,
    kernel 2.6.10 will mask the spurious edge interrupt if the interrupt is
    triggered for the second time and its status includes
    IRQ_DISABLE|IRQ_PENDING.

    Originally, IA64 kernel has its own specific _irq_desc definitions in file
    arch/ia64/kernel/irq.c. The definition initiates _irq_desc[irq].status to
    IRQ_DISABLE. Since kernel 2.6.11, it was moved to architecture independent
    codes, i.e. kernel/irq/handle.c, but kernel/irq/handle.c initiates
    _irq_desc[irq].status to 0 instead of IRQ_DISABLE.

    Signed-off-by: Zhang Yanmin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang, Yanmin
     

06 May, 2005

6 commits

  • As per http://www.nist.gov/dads/HTML/shellsort.html, this should be
    referred to as a Shell sort. Shell-Metzner is a misnomer.

    Signed-off-by: Daniel Dickman
    Signed-off-by: Domen Puncer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Domen Puncer
     
  • It seems that the code responsible for this is in kernel/itimer.c:126:

    p->signal->real_timer.expires = jiffies + interval;
    add_timer(&p->signal->real_timer);

    If you request an interval of, lets say 900 usecs, the interval given by
    timeval_to_jiffies will be 1.

    If you request this when we are half-way between two timer ticks, the
    interval will only give 400 usecs.

    If we want to guarantee that we never ever give intervals less than
    requested, the simple solution would be to change that to:

    p->signal->real_timer.expires = jiffies + interval + 1;

    This however will produce pathological cases, like having a idle system
    being requested 1 ms timeouts will give systematically 2 ms timeouts,
    whereas currently it simply gives a few usecs less than 1 ms.

    The complex (and more computationally expensive) solution would be to
    check the gettimeofday time, and compute the correct number of jiffies.
    This way, if we request a 300 usecs timer 200 usecs inside the timer
    tick, we can wait just one tick, but not if we are 800 usecs inside the
    tick. This would also mean that we would have to lock preemption during
    these computations to avoid races, etc.

    I've searched the archives but couldn't find this particular issue being
    discussed before.

    Attached is a patch to do the simple solution, in case anybody thinks
    that it should be used.

    Signed-Off-By: Paulo Marques
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paulo Marques
     
  • Allow registration of multiple kprobes at an address in an architecture
    agnostic way. Corresponding handlers will be invoked in a sequence. But,
    a kprobe and a jprobe can't (yet) co-exist at the same address.

    Signed-off-by: Ananth N Mavinakayanahalli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ananth N Mavinakayanahalli
     
  • kernel oops! when unregister_kprobe() is called on a non-registered
    kprobe. This patch fixes the above problem by checking if the probe exists
    before unregistering.

    Signed-off-by: Prasanna S Panchamukhi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Prasanna S Panchamukhi
     
  • While looking at code generated by gcc4.0 I noticed some functions still
    had frame pointers, even after we stopped ppc64 from defining
    CONFIG_FRAME_POINTER. It turns out kernel/Makefile hardwires
    -fno-omit-frame-pointer on when compiling schedule.c.

    Create CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER and define it on architectures
    that dont require frame pointers in sched.c code.

    (akpm: blame me for the name)

    Signed-off-by: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     
  • The PPC32 kernel puts platform-specific functions into separate sections so
    that unneeded parts of it can be freed when we've booted and actually
    worked out what we're running on today.

    This makes kallsyms ignore those functions, because they're not between
    _[se]text or _[se]inittext. Rather than teaching kallsyms about the
    various pmac/chrp/etc sections, this patch adds '_[se]extratext' markers
    for kallsyms.

    Signed-off-by: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Woodhouse
     

05 May, 2005

2 commits


04 May, 2005

2 commits

  • Let's recap the problem. The current asynchronous netlink kernel
    message processing is vulnerable to these attacks:

    1) Hit and run: Attacker sends one or more messages and then exits
    before they're processed. This may confuse/disable the next netlink
    user that gets the netlink address of the attacker since it may
    receive the responses to the attacker's messages.

    Proposed solutions:

    a) Synchronous processing.
    b) Stream mode socket.
    c) Restrict/prohibit binding.

    2) Starvation: Because various netlink rcv functions were written
    to not return until all messages have been processed on a socket,
    it is possible for these functions to execute for an arbitrarily
    long period of time. If this is successfully exploited it could
    also be used to hold rtnl forever.

    Proposed solutions:

    a) Synchronous processing.
    b) Stream mode socket.

    Firstly let's cross off solution c). It only solves the first
    problem and it has user-visible impacts. In particular, it'll
    break user space applications that expect to bind or communicate
    with specific netlink addresses (pid's).

    So we're left with a choice of synchronous processing versus
    SOCK_STREAM for netlink.

    For the moment I'm sticking with the synchronous approach as
    suggested by Alexey since it's simpler and I'd rather spend
    my time working on other things.

    However, it does have a number of deficiencies compared to the
    stream mode solution:

    1) User-space to user-space netlink communication is still vulnerable.

    2) Inefficient use of resources. This is especially true for rtnetlink
    since the lock is shared with other users such as networking drivers.
    The latter could hold the rtnl while communicating with hardware which
    causes the rtnetlink user to wait when it could be doing other things.

    3) It is still possible to DoS all netlink users by flooding the kernel
    netlink receive queue. The attacker simply fills the receive socket
    with a single netlink message that fills up the entire queue. The
    attacker then continues to call sendmsg with the same message in a loop.

    Point 3) can be countered by retransmissions in user-space code, however
    it is pretty messy.

    In light of these problems (in particular, point 3), we should implement
    stream mode netlink at some point. In the mean time, here is a patch
    that implements synchronous processing.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • The patch "MCA recovery improvements" added do_exit to mca_drv.c.
    That's fine when the mca recovery code is built in the kernel
    (CONFIG_IA64_MCA_RECOVERY=y) but breaks building the mca recovery
    code as a module (CONFIG_IA64_MCA_RECOVERY=m).

    Most users are currently building this as a module, as loading
    and unloading the module provides a very convenient way to turn
    on/off error recovery.

    This patch exports do_exit, so mca_drv.c can build as a module.

    Signed-off-by: Russ Anderson (rja@sgi.com)
    Signed-off-by: Tony Luck

    Russ Anderson
     

03 May, 2005

2 commits


01 May, 2005

11 commits

  • Another large rollup of various patches from Adrian which make things static
    where they were needlessly exported.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Some KernelDoc descriptions are updated to match the current code.
    No code changes.

    Signed-off-by: Martin Waitz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Waitz
     
  • I have recompiled Linux kernel 2.6.11.5 documentation for me and our
    university students again. The documentation could be extended for more
    sources which are equipped by structured comments for recent 2.6 kernels. I
    have tried to proceed with that task. I have done that more times from 2.6.0
    time and it gets boring to do same changes again and again. Linux kernel
    compiles after changes for i386 and ARM targets. I have added references to
    some more files into kernel-api book, I have added some section names as well.
    So please, check that changes do not break something and that categories are
    not too much skewed.

    I have changed kernel-doc to accept "fastcall" and "asmlinkage" words reserved
    by kernel convention. Most of the other changes are modifications in the
    comments to make kernel-doc happy, accept some parameters description and do
    not bail out on errors. Changed to @pid in the description, moved some
    #ifdef before comments to correct function to comments bindings, etc.

    You can see result of the modified documentation build at
    http://cmp.felk.cvut.cz/~pisa/linux/lkdb-2.6.11.tar.gz

    Some more sources are ready to be included into kernel-doc generated
    documentation. Sources has been added into kernel-api for now. Some more
    section names added and probably some more chaos introduced as result of quick
    cleanup work.

    Signed-off-by: Pavel Pisa
    Signed-off-by: Martin Waitz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Pisa
     
  • Convert most of the current code that uses _NSIG directly to instead use
    valid_signal(). This avoids gcc -W warnings and off-by-one errors.

    Signed-off-by: Jesper Juhl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     
  • Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • This patch changes calls to synchronize_kernel(), deprecated in the earlier
    "Deprecate synchronize_kernel, GPL replacement" patch to instead call the new
    synchronize_rcu() and synchronize_sched() APIs.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     
  • The synchronize_kernel() primitive is used for quite a few different purposes:
    waiting for RCU readers, waiting for NMIs, waiting for interrupts, and so on.
    This makes RCU code harder to read, since synchronize_kernel() might or might
    not have matching rcu_read_lock()s. This patch creates a new
    synchronize_rcu() that is to be used for RCU readers and a new
    synchronize_sched() that is used for the rest. These two new primitives
    currently have the same implementation, but this is might well change with
    additional real-time support. Both new primitives are GPL-only, the old
    primitive is deprecated.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     
  • The gpl exports need to be put back. Moving them to GPL -- but in a
    measured manner, as I proposed on this list some months ago -- is fine.
    Changing these particular exports precipitously is most definitely -not-
    fine. Here is my earlier proposal:

    http://marc.theaimsgroup.com/?l=linux-kernel&m=110520930301813&w=2

    See below for a patch that puts the exports back, along with an updated
    version of my earlier patch that starts the process of moving them to GPL.
    I will also be following this message with RFC patches that introduce two
    (EXPORT_SYMBOL_GPL) interfaces to replace synchronize_kernel(), which then
    becomes deprecated.

    Signed-off-by:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     
  • Arrange for all kernel printks to be no-ops. Only available if
    CONFIG_EMBEDDED.

    This patch saves about 375k on my laptop config and nearly 100k on minimal
    configs.

    Signed-off-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • Add a pair of rlimits for allowing non-root tasks to raise nice and rt
    priorities. Defaults to traditional behavior. Originally written by
    Chris Wright.

    The patch implements a simple rlimit ceiling for the RT (and nice) priorities
    a task can set. The rlimit defaults to 0, meaning no change in behavior by
    default. A value of 50 means RT priority levels 1-50 are allowed. A value of
    100 means all 99 privilege levels from 1 to 99 are allowed. CAP_SYS_NICE is
    blanket permission.

    (akpm: see http://www.uwsg.iu.edu/hypermail/linux/kernel/0503.1/1921.html for
    tips on integrating this with PAM).

    Signed-off-by: Matt Mackall
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • Replace a number of memory barriers with smp_ variants. This means we won't
    take the unnecessary hit on UP machines.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     

30 Apr, 2005

3 commits

  • It's old sanity checking that may have been useful for debugging, but
    is just bogus these days.

    Noticed by Mattia Belletti.

    Linus Torvalds
     
  • Attached is a new patch that solves the issue of getting valid credentials
    into the LOGIN message. The current code was assuming that the audit context
    had already been copied. This is not always the case for LOGIN messages.

    To solve the problem, the patch passes the task struct to the function that
    emits the message where it can get valid credentials.

    Signed-off-by: Steve Grubb
    Signed-off-by: David Woodhouse

    Steve Grubb
     
  • If netlink_unicast() fails, requeue the skb back at the head of the queue
    it just came from, instead of the tail. And do so unless we've exceeded
    the audit_backlog limit; not according to some other arbitrary limit.

    From: Chris Wright
    Signed-off-by: David Woodhouse

    Chris Wright
     

29 Apr, 2005

5 commits