29 May, 2005
1 commit
-
The "unhandled interrupts" catcher, note_interrupt(), increments a global
desc->irq_count and grossly damages scaling of very large systems, e.g.,
>192p ia64 Altix, because of this highly contented cacheline, especially
for timer interrupts. 384p is severely crippled, and 512p is unuseable.All calls to note_interrupt() can be disabled by booting with "noirqdebug",
but this disables the useful interrupt checking for all interrupts.I propose eliminating note_interrupt() for all per-CPU interrupts. This
was the behavior of linux-2.6.10 and earlier, but in 2.6.11 a code
restructuring added a call to note_interrupt() for per-CPU interrupts.
Besides, note_interrupt() is a bit racy for concurrent CPU calls anyway, as
the desc->irq_count++ increment isn't atomic (which, if done, would make
scaling even worse).Signed-off-by: John Hawkes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
27 May, 2005
1 commit
-
There is a race in the kernel cpuset code, between the code
to handle notify_on_release, and the code to remove a cpuset.
The notify_on_release code can end up trying to access a
cpuset that has been removed. In the most common case, this
causes a NULL pointer dereference from the routine cpuset_path.
However all manner of bad things are possible, in theory at least.The existing code decrements the cpuset use count, and if the
count goes to zero, processes the notify_on_release request,
if appropriate. However, once the count goes to zero, unless we
are holding the global cpuset_sem semaphore, there is nothing to
stop another task from immediately removing the cpuset entirely,
and recycling its memory.The obvious fix would be to always hold the cpuset_sem
semaphore while decrementing the use count and dealing with
notify_on_release. However we don't want to force a global
semaphore into the mainline task exit path, as that might create
a scaling problem.The actual fix is almost as easy - since this is only an issue
for cpusets using notify_on_release, which the top level big
cpusets don't normally need to use, only take the cpuset_sem
for cpusets using notify_on_release.This code has been run for hours without a hiccup, while running
a cpuset create/destroy stress test that could crash the existing
kernel in seconds. This patch applies to the current -linus
git kernel.Signed-off-by: Paul Jackson
Acked-by: Simon Derr
Acked-by: Dinakar Guniguntala
Signed-off-by: Linus Torvalds
25 May, 2005
1 commit
-
If SIGKILL does not have priority, we cannot instantly kill task before it
makes some unexpected job. It can be critical, but we were unable to
reproduce this easily until Heiko Carstens
reported this problem on LKML.Signed-Off-By: Kirill Korotaev
Signed-Off-By: Alexey Kuznetsov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
22 May, 2005
1 commit
-
In _spin_unlock_bh(lock):
do { \
_raw_spin_unlock(lock); \
preempt_enable(); \
local_bh_enable(); \
__release(lock); \
} while (0)there is no reason for using preempt_enable() instead of a simple
preempt_enable_no_resched()Since we know bottom halves are disabled, preempt_schedule() will always
return at once (preempt_count!=0), and hence preempt_check_resched() is
useless here...This fixes it by using "preempt_enable_no_resched()" instead of the
"preempt_enable()", and thus avoids the useless preempt_check_resched()
just before re-enabling bottom halves.Signed-off-by: Samuel Thibault
Signed-off-by: Linus Torvalds
21 May, 2005
1 commit
-
This patch removes the entwining of cpusets and hotplug code in the "No
more Mr. Nice Guy" case of sched.c move_task_off_dead_cpu().Since the hotplug code is holding a spinlock at this point, we cannot take
the cpuset semaphore, cpuset_sem, as would seem to be required either to
update the tasks cpuset, or to scan up the nested cpuset chain, looking for
the nearest cpuset ancestor that still has some CPUs that are online. So
we just punt and blast the tasks cpus_allowed with all bits allowed.This reverts these lines of code to what they were before the cpuset patch.
And it updates the cpuset Doc file, to match.The one known alternative to this that seems to work came from Dinakar
Guniguntala, and required the hotplug code to take the cpuset_sem semaphore
much earlier in its processing. So far as we know, the increased locking
entanglement between cpusets and hot plug of this alternative approach is
not worth doing in this case.Signed-off-by: Paul Jackson
Acked-by: Nathan Lynch
Acked-by: Dinakar Guniguntala
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
18 May, 2005
1 commit
-
This patch includes various tweaks in the messaging that appears during
system pm state transitions:* Warn about certain illegal calls in the device tree, like resuming
child before parent or suspending parent before child. This could
happen easily enough through sysfs, or in some cases when drivers
use device_pm_set_parent().* Be more consistent about dev_dbg() tracing ... do it for resume() and
shutdown() too, and never if the driver doesn't have that method.* Say which type of system sleep state is being entered.
Except for the warnings, these only affect debug messaging.
Signed-off-by: David Brownell
Acked-by: Pavel Machek
Signed-off-by: Greg Kroah-Hartman
17 May, 2005
3 commits
-
profile=schedule parsing is not quite what it should be. First, str[7] is
'e', not ',', but then even if it did fall through, prof_on =
SCHED_PROFILING would be clobbered inside if (get_option(...)) So a small
amount of rearrangement is done in this patch to correct it.Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Move add_preferred_console out of CONFIG_PRINTK so serial console does the
right thing.Signed-off-by: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
On my IA64 machine, after kernel 2.6.12-rc3 boots, an edge-triggered
interrupt (IRQ 46) keeps triggered over and over again. There is no IRQ 46
interrupt action handler. It has lots of impact on performance.Kernel 2.6.10 and its prior versions have no the problem. Basically,
kernel 2.6.10 will mask the spurious edge interrupt if the interrupt is
triggered for the second time and its status includes
IRQ_DISABLE|IRQ_PENDING.Originally, IA64 kernel has its own specific _irq_desc definitions in file
arch/ia64/kernel/irq.c. The definition initiates _irq_desc[irq].status to
IRQ_DISABLE. Since kernel 2.6.11, it was moved to architecture independent
codes, i.e. kernel/irq/handle.c, but kernel/irq/handle.c initiates
_irq_desc[irq].status to 0 instead of IRQ_DISABLE.Signed-off-by: Zhang Yanmin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 May, 2005
6 commits
-
As per http://www.nist.gov/dads/HTML/shellsort.html, this should be
referred to as a Shell sort. Shell-Metzner is a misnomer.Signed-off-by: Daniel Dickman
Signed-off-by: Domen Puncer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
It seems that the code responsible for this is in kernel/itimer.c:126:
p->signal->real_timer.expires = jiffies + interval;
add_timer(&p->signal->real_timer);If you request an interval of, lets say 900 usecs, the interval given by
timeval_to_jiffies will be 1.If you request this when we are half-way between two timer ticks, the
interval will only give 400 usecs.If we want to guarantee that we never ever give intervals less than
requested, the simple solution would be to change that to:p->signal->real_timer.expires = jiffies + interval + 1;
This however will produce pathological cases, like having a idle system
being requested 1 ms timeouts will give systematically 2 ms timeouts,
whereas currently it simply gives a few usecs less than 1 ms.The complex (and more computationally expensive) solution would be to
check the gettimeofday time, and compute the correct number of jiffies.
This way, if we request a 300 usecs timer 200 usecs inside the timer
tick, we can wait just one tick, but not if we are 800 usecs inside the
tick. This would also mean that we would have to lock preemption during
these computations to avoid races, etc.I've searched the archives but couldn't find this particular issue being
discussed before.Attached is a patch to do the simple solution, in case anybody thinks
that it should be used.Signed-Off-By: Paulo Marques
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Allow registration of multiple kprobes at an address in an architecture
agnostic way. Corresponding handlers will be invoked in a sequence. But,
a kprobe and a jprobe can't (yet) co-exist at the same address.Signed-off-by: Ananth N Mavinakayanahalli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
kernel oops! when unregister_kprobe() is called on a non-registered
kprobe. This patch fixes the above problem by checking if the probe exists
before unregistering.Signed-off-by: Prasanna S Panchamukhi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
While looking at code generated by gcc4.0 I noticed some functions still
had frame pointers, even after we stopped ppc64 from defining
CONFIG_FRAME_POINTER. It turns out kernel/Makefile hardwires
-fno-omit-frame-pointer on when compiling schedule.c.Create CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER and define it on architectures
that dont require frame pointers in sched.c code.(akpm: blame me for the name)
Signed-off-by: Anton Blanchard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The PPC32 kernel puts platform-specific functions into separate sections so
that unneeded parts of it can be freed when we've booted and actually
worked out what we're running on today.This makes kallsyms ignore those functions, because they're not between
_[se]text or _[se]inittext. Rather than teaching kallsyms about the
various pmac/chrp/etc sections, this patch adds '_[se]extratext' markers
for kallsyms.Signed-off-by: David Woodhouse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 May, 2005
2 commits
04 May, 2005
2 commits
-
Let's recap the problem. The current asynchronous netlink kernel
message processing is vulnerable to these attacks:1) Hit and run: Attacker sends one or more messages and then exits
before they're processed. This may confuse/disable the next netlink
user that gets the netlink address of the attacker since it may
receive the responses to the attacker's messages.Proposed solutions:
a) Synchronous processing.
b) Stream mode socket.
c) Restrict/prohibit binding.2) Starvation: Because various netlink rcv functions were written
to not return until all messages have been processed on a socket,
it is possible for these functions to execute for an arbitrarily
long period of time. If this is successfully exploited it could
also be used to hold rtnl forever.Proposed solutions:
a) Synchronous processing.
b) Stream mode socket.Firstly let's cross off solution c). It only solves the first
problem and it has user-visible impacts. In particular, it'll
break user space applications that expect to bind or communicate
with specific netlink addresses (pid's).So we're left with a choice of synchronous processing versus
SOCK_STREAM for netlink.For the moment I'm sticking with the synchronous approach as
suggested by Alexey since it's simpler and I'd rather spend
my time working on other things.However, it does have a number of deficiencies compared to the
stream mode solution:1) User-space to user-space netlink communication is still vulnerable.
2) Inefficient use of resources. This is especially true for rtnetlink
since the lock is shared with other users such as networking drivers.
The latter could hold the rtnl while communicating with hardware which
causes the rtnetlink user to wait when it could be doing other things.3) It is still possible to DoS all netlink users by flooding the kernel
netlink receive queue. The attacker simply fills the receive socket
with a single netlink message that fills up the entire queue. The
attacker then continues to call sendmsg with the same message in a loop.Point 3) can be countered by retransmissions in user-space code, however
it is pretty messy.In light of these problems (in particular, point 3), we should implement
stream mode netlink at some point. In the mean time, here is a patch
that implements synchronous processing.Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller -
The patch "MCA recovery improvements" added do_exit to mca_drv.c.
That's fine when the mca recovery code is built in the kernel
(CONFIG_IA64_MCA_RECOVERY=y) but breaks building the mca recovery
code as a module (CONFIG_IA64_MCA_RECOVERY=m).Most users are currently building this as a module, as loading
and unloading the module provides a very convenient way to turn
on/off error recovery.This patch exports do_exit, so mca_drv.c can build as a module.
Signed-off-by: Russ Anderson (rja@sgi.com)
Signed-off-by: Tony Luck
03 May, 2005
2 commits
-
When adding more formatted audit data to an skb for delivery to userspace,
the kernel will attempt to reuse an skb that has spare room. However, if
the audit message has already been fragmented to multiple skb's, the search
for spare room in the skb uses the head of the list. This will corrupt the
audit message with trailing bytes being placed midway through the stream.
Fix is to look at the end of the list.Signed-off-by: Chris Wright
Signed-off-by: David Woodhouse
01 May, 2005
11 commits
-
Another large rollup of various patches from Adrian which make things static
where they were needlessly exported.Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Some KernelDoc descriptions are updated to match the current code.
No code changes.Signed-off-by: Martin Waitz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
I have recompiled Linux kernel 2.6.11.5 documentation for me and our
university students again. The documentation could be extended for more
sources which are equipped by structured comments for recent 2.6 kernels. I
have tried to proceed with that task. I have done that more times from 2.6.0
time and it gets boring to do same changes again and again. Linux kernel
compiles after changes for i386 and ARM targets. I have added references to
some more files into kernel-api book, I have added some section names as well.
So please, check that changes do not break something and that categories are
not too much skewed.I have changed kernel-doc to accept "fastcall" and "asmlinkage" words reserved
by kernel convention. Most of the other changes are modifications in the
comments to make kernel-doc happy, accept some parameters description and do
not bail out on errors. Changed to @pid in the description, moved some
#ifdef before comments to correct function to comments bindings, etc.You can see result of the modified documentation build at
http://cmp.felk.cvut.cz/~pisa/linux/lkdb-2.6.11.tar.gzSome more sources are ready to be included into kernel-doc generated
documentation. Sources has been added into kernel-api for now. Some more
section names added and probably some more chaos introduced as result of quick
cleanup work.Signed-off-by: Pavel Pisa
Signed-off-by: Martin Waitz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Convert most of the current code that uses _NSIG directly to instead use
valid_signal(). This avoids gcc -W warnings and off-by-one errors.Signed-off-by: Jesper Juhl
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Signed-off-by: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch changes calls to synchronize_kernel(), deprecated in the earlier
"Deprecate synchronize_kernel, GPL replacement" patch to instead call the new
synchronize_rcu() and synchronize_sched() APIs.Signed-off-by: Paul E. McKenney
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The synchronize_kernel() primitive is used for quite a few different purposes:
waiting for RCU readers, waiting for NMIs, waiting for interrupts, and so on.
This makes RCU code harder to read, since synchronize_kernel() might or might
not have matching rcu_read_lock()s. This patch creates a new
synchronize_rcu() that is to be used for RCU readers and a new
synchronize_sched() that is used for the rest. These two new primitives
currently have the same implementation, but this is might well change with
additional real-time support. Both new primitives are GPL-only, the old
primitive is deprecated.Signed-off-by: Paul E. McKenney
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The gpl exports need to be put back. Moving them to GPL -- but in a
measured manner, as I proposed on this list some months ago -- is fine.
Changing these particular exports precipitously is most definitely -not-
fine. Here is my earlier proposal:http://marc.theaimsgroup.com/?l=linux-kernel&m=110520930301813&w=2
See below for a patch that puts the exports back, along with an updated
version of my earlier patch that starts the process of moving them to GPL.
I will also be following this message with RFC patches that introduce two
(EXPORT_SYMBOL_GPL) interfaces to replace synchronize_kernel(), which then
becomes deprecated.Signed-off-by:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Arrange for all kernel printks to be no-ops. Only available if
CONFIG_EMBEDDED.This patch saves about 375k on my laptop config and nearly 100k on minimal
configs.Signed-off-by: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add a pair of rlimits for allowing non-root tasks to raise nice and rt
priorities. Defaults to traditional behavior. Originally written by
Chris Wright.The patch implements a simple rlimit ceiling for the RT (and nice) priorities
a task can set. The rlimit defaults to 0, meaning no change in behavior by
default. A value of 50 means RT priority levels 1-50 are allowed. A value of
100 means all 99 privilege levels from 1 to 99 are allowed. CAP_SYS_NICE is
blanket permission.(akpm: see http://www.uwsg.iu.edu/hypermail/linux/kernel/0503.1/1921.html for
tips on integrating this with PAM).Signed-off-by: Matt Mackall
Acked-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Replace a number of memory barriers with smp_ variants. This means we won't
take the unnecessary hit on UP machines.Signed-off-by: Anton Blanchard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
30 Apr, 2005
3 commits
-
It's old sanity checking that may have been useful for debugging, but
is just bogus these days.Noticed by Mattia Belletti.
-
Attached is a new patch that solves the issue of getting valid credentials
into the LOGIN message. The current code was assuming that the audit context
had already been copied. This is not always the case for LOGIN messages.To solve the problem, the patch passes the task struct to the function that
emits the message where it can get valid credentials.Signed-off-by: Steve Grubb
Signed-off-by: David Woodhouse -
If netlink_unicast() fails, requeue the skb back at the head of the queue
it just came from, instead of the tail. And do so unless we've exceeded
the audit_backlog limit; not according to some other arbitrary limit.From: Chris Wright
Signed-off-by: David Woodhouse
29 Apr, 2005
5 commits
-
Most audit control messages are sent over netlink.In order to properly
log the identity of the sender of audit control messages, we would like
to add the loginuid to the netlink_creds structure, as per the attached
patch.Signed-off-by: Serge Hallyn
Signed-off-by: David Woodhouse -
They don't seem to work correctly (investigation ongoing), but we don't
actually need to do it anyway.Patch from Peter Martuccelli
Signed-off-by: David Woodhouse -
Attached is a patch that corrects a signed/unsigned warning. I also noticed
that we needlessly init serial to 0. That only needs to occur if the kernel
was compiled without the audit system.-Steve Grubb
Signed-off-by: David Woodhouse
-
We were calling ptrace_notify() after auditing the syscall and arguments,
but the debugger could have _changed_ them before the syscall was actually
invoked. Reorder the calls to fix that.While we're touching ever call to audit_syscall_entry(), we also make it
take an extra argument: the architecture of the syscall which was made,
because some architectures allow more than one type of syscall.Also add an explicit success/failure flag to audit_syscall_exit(), for
the benefit of architectures which return that in a condition register
rather than only returning a single register.Change type of syscall return value to 'long' not 'int'.
Signed-off-by: David Woodhouse