11 Dec, 2006
27 commits
-
Mostly changing alignment. Just some general cleanup.
[akpm@osdl.org: build fix]
Signed-off-by: Daniel Walker
Acked-by: John Stultz
Cc: Thomas Gleixner
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Simply adds some ifdefs to remove clocksoure sysfs code when CONFIG_SYSFS
isn't turn on.Signed-off-by: Daniel Walker
Acked-by: John Stultz
Cc: Thomas Gleixner
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Introduce a round_jiffies() function as well as a round_jiffies_relative()
function. These functions round a jiffies value to the next whole second.
The primary purpose of this rounding is to cause all "we don't care exactly
when" timers to happen at the same jiffy.This avoids multiple timers firing within the second for no real reason;
with dynamic ticks these extra timers cause wakeups from deep sleep CPU
sleep states and thus waste power.The exact wakeup moment is skewed by the cpu number, to avoid all cpus from
waking up at the exact same time (and hitting the same lock/cachelines
there)[akpm@osdl.org: fix variable type]
Signed-off-by: Arjan van de Ven
Cc: Thomas Gleixner
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
An fdtable can either be embedded inside a files_struct or standalone (after
being expanded). When an fdtable is being discarded after all RCU references
to it have expired, we must either free it directly, in the standalone case,
or free the files_struct it is contained within, in the embedded case.Currently the free_files field controls this behavior, but we can get rid of
it entirely, as all the necessary information is already recorded. We can
distinguish embedded and standalone fdtables using max_fds, and if it is
embedded we can divine the relevant files_struct using container_of().Signed-off-by: Vadim Lobanov
Cc: Christoph Hellwig
Cc: Al Viro
Cc: Dipankar Sarma
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently, each fdtable supports three dynamically-sized arrays of data: the
fdarray and two fdsets. The code allows the number of fds supported by the
fdarray (fdtable->max_fds) to differ from the number of fds supported by each
of the fdsets (fdtable->max_fdset).In practice, it is wasteful for these two sizes to differ: whenever we hit a
limit on the smaller-capacity structure, we will reallocate the entire fdtable
and all the dynamic arrays within it, so any delta in the memory used by the
larger-capacity structure will never be touched at all.Rather than hogging this excess, we shouldn't even allocate it in the first
place, and keep the capacities of the fdarray and the fdsets equal. This
patch removes fdtable->max_fdset. As an added bonus, most of the supporting
code becomes simpler.Signed-off-by: Vadim Lobanov
Cc: Christoph Hellwig
Cc: Al Viro
Cc: Dipankar Sarma
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The dup_fd() function creates a new files_struct and fdtable embedded inside
that files_struct, and then possibly expands the fdtable using expand_files().The out_release error path is invoked when expand_files() returns an error
code. However, when this attempt to expand fails, the fdtable is left in its
original embedded form, so it is pointless to try to free the associated
fdarray and fdsets.Signed-off-by: Vadim Lobanov
Cc: Dipankar Sarma
Cc: Christoph Hellwig
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
[akpm@osdl.org: additional cleanups]
Signed-off-by: Miguel Ojeda Sandonis
Acked-by: Ingo Molnar
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
RT task does not participate in interactiveness priority and thus shouldn't
be bothered with timestamp and p->sleep_type manipulation when task is
being put on run queue. Bypass all of the them with a single if (rt_task)
test.Signed-off-by: Ken Chen
Acked-by: Ingo Molnar
Cc: Nick Piggin
Cc: "Siddha, Suresh B"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Remove scheduler stats lb_stopbalance counter. This counter can be
calculated by: lb_balanced - lb_nobusyg - lb_nobusyq. There is no need to
create gazillion counters while we can derive the value.Signed-off-by: Ken Chen
Signed-off-by: Suresh Siddha
Acked-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently at a particular domain, each cpu in the sched group will do a
load balance at the frequency of balance_interval. More the cores and
threads, more the cpus will be in each sched group at SMP and NUMA domain.
And we endup spending quite a bit of time doing load balancing in those
domains.Fix this by making only one cpu(first idle cpu or first cpu in the group if
all the cpus are busy) in the sched group do the load balance at that
particular sched domain and this load will slowly percolate down to the
other cpus with in that group(when they do load balancing at lower
domains).Signed-off-by: Suresh Siddha
Cc: Christoph Lameter
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Co-opt rq->timestamp_last_tick to maintain a cache_hot_time evaluation
reference timestamp at both tick and sched times to prevent said reference,
formerly rq->timestamp_last_tick, from being behind task->last_ran at
evaluation time, and to move said reference closer to current time on the
remote processor, intent being to improve cache hot evaluation and
timestamp adjustment accuracy for task migration.Fix minor sched_time double accounting error which occurs when a task
passing through schedule() does not schedule off, and takes the next timer
tick.[kenneth.w.chen@intel.com: cleanup]
Signed-off-by: Mike Galbraith
Acked-by: Ingo Molnar
Acked-by: Ken Chen
Cc: Don Mullis
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Large sched domains can be very expensive to scan. Add an option SD_SERIALIZE
to the sched domain flags. If that flag is set then we make sure that no
other such domain is being balanced.[akpm@osdl.org: build fix]
Signed-off-by: Christoph Lameter
Cc: Peter Williams
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: "Siddha, Suresh B"
Cc: "Chen, Kenneth W"
Acked-by: Ingo Molnar
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Trigger softirq less frequently
We trigger the softirq before this patch using offset of sd->interval.
However, if the queue is busy then it is sufficient to schedule the softirq
with sd->interval * busy_factor.So we modify the calculation of the next time to balance by taking
the interval added to last_balance again. This is only the
right value if the idle/busy situation continues as is.There are two potential trouble spots:
- If the queue was idle and now gets busy then we call rebalance
early. However, that is not a problem because we will then use
the longer interval for the next period.- If the queue was busy and becomes idle then we potentially
wait too long before rebalancing. However, when the task
goes idle then idle_balance is called. We add another calculation
of the next balance time based on sd->interval in idle_balance
so that we will rebalance soon.V2->V3:
- Calculate rebalance time based on current jiffies and not
based on the jiffies at the last time we load balanced.
We no longer rely on staggering and therefore we can
affort to do this now.V3->V4:
- Use functions to do jiffy comparisons.Signed-off-by: Christoph Lameter
Cc: Peter Williams
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: "Siddha, Suresh B"
Cc: "Chen, Kenneth W"
Acked-by: Ingo Molnar
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Call rebalance_tick (renamed to run_rebalance_domains) from a newly introduced
softirq.We calculate the earliest time for each layer of sched domains to be rescanned
(this is the rescan time for idle) and use the earliest of those to schedule
the softirq via a new field "next_balance" added to struct rq.Signed-off-by: Christoph Lameter
Cc: Peter Williams
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: "Siddha, Suresh B"
Cc: "Chen, Kenneth W"
Acked-by: Ingo Molnar
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Perform the idle state determination in rebalance_tick.
If we separate balancing from sched_tick then we also need to determine the
idle state in rebalance_tick.V2->V3
Remove useless idlle != 0 check. Checking nr_running seems
to be sufficient. Thanks Suresh.Signed-off-by: Christoph Lameter
Cc: Peter Williams
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: "Siddha, Suresh B"
Cc: "Chen, Kenneth W"
Acked-by: Ingo Molnar
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A load calculation is always done in rebalance_tick() in addition to the real
load balancing activities that only take place when certain jiffie counts have
been reached. Move that processing into a separate function and call it
directly from scheduler_tick().Also extract the time slice handling from scheduler_tick and put it into a
separate function. Then we can clean up scheduler_tick significantly. It
will no longer have any gotos.Signed-off-by: Christoph Lameter
Cc: Peter Williams
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: "Siddha, Suresh B"
Cc: "Chen, Kenneth W"
Acked-by: Ingo Molnar
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Interrupts must be disabled for request queue locks if we want to run
load_balance() with interrupts enabled.Signed-off-by: Christoph Lameter
Cc: Peter Williams
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: "Siddha, Suresh B"
Cc: "Chen, Kenneth W"
Acked-by: Ingo Molnar
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Timer interrupts already are staggered. We do not need an additional layer of
time staggering for short load balancing actions that take a reasonably small
portion of the time slice.For load balancing on large sched_domains we will add a serialization later
that avoids concurrent load balance operations and thus has the same effect as
load staggering.Signed-off-by: Christoph Lameter
Cc: Peter Williams
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: "Siddha, Suresh B"
Cc: "Chen, Kenneth W"
Acked-by: Ingo Molnar
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Avoid taking the request queue lock in wake_priority_sleeper if there are no
running processes.Signed-off-by: Christoph Lameter
Cc: Peter Williams
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: "Siddha, Suresh B"
Cc: "Chen, Kenneth W"
Cc: KAMEZAWA Hiroyuki
Acked-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
move_task_off_dead_cpu() requires interrupts to be disabled, while
migrate_dead() calls it with enabled interrupts. Added appropriate
comments to functions and added BUG_ON(!irqs_disabled()) into
double_rq_lock() and double_lock_balance() which are the origin sources of
such bugs.Signed-off-by: Kirill Korotaev
Acked-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Move the sched group allocations to percpu area. This will minimize cross
node memory references and also cleans up the sched groups allocation for
allnodes sched domain.Signed-off-by: Suresh Siddha
Acked-by: Ingo Molnar
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Acked-by: Ingo Molnar
Signed-off-by: Robert P. J. Day
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Deliver IO accounting via taskstats.
Cc: Jay Lan
Cc: Shailabh Nagar
Cc: Balbir Singh
Cc: Chris Sturtivant
Cc: Tony Ernst
Cc: Guillaume Thouvenin
Cc: David Wright
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The present per-task IO accounting isn't very useful. It simply counts the
number of bytes passed into read() and write(). So if a process reads 1MB
from an already-cached file, it is accused of having performed 1MB of I/O,
which is wrong.(David Wright had some comments on the applicability of the present logical IO accounting:
For billing purposes it is useless but for workload analysis it is very
usefulread_bytes/read_calls average read request size
write_bytes/write_calls average write request sizeread_bytes/read_blocks ie logical/physical can indicate hit rate or thrashing
write_bytes/write_blocks ie logical/physical guess since pdflush writes can
be missedI often look for logical larger than physical to see filesystem cache
problems. And the bytes/cpusec can help find applications that are
dominating the cache and causing slow interactive response from page cache
contention.I want to find the IO intensive applications and make sure they are doing
efficient IO. Thus the acctcms(sysV) or csacms command would give the high
IO commands).This patchset adds new accounting which tries to be more accurate. We account
for three things:reads:
attempt to count the number of bytes which this process really did cause
to be fetched from the storage layer. Done at the submit_bio() level, so it
is accurate for block-backed filesystems. I also attempt to wire up NFS and
CIFS.writes:
attempt to count the number of bytes which this process caused to be sent
to the storage layer. This is done at page-dirtying time.The big inaccuracy here is truncate. If a process writes 1MB to a file
and then deletes the file, it will in fact perform no writeout. But it will
have been accounted as having caused 1MB of write.So...
cancelled_writes:
account the number of bytes which this process caused to not happen, by
truncating pagecache.We _could_ just subtract this from the process's `write' accounting. But
that means that some processes would be reported to have done negative
amounts of write IO, which is silly.So we just report the raw number and punt this decision up to userspace.
Now, we _could_ account for writes at the physical I/O level. But
- This would require that we track memory-dirtying tasks at the per-page
level (would require a new pointer in struct page).- It would mean that IO statistics for a process are usually only available
long after that process has exitted. Which means that we probably cannot
communicate this info via taskstats.This patch:
Wire up the kernel-private data structures and the accessor functions to
manipulate them.Cc: Jay Lan
Cc: Shailabh Nagar
Cc: Balbir Singh
Cc: Chris Sturtivant
Cc: Tony Ernst
Cc: Guillaume Thouvenin
Cc: David Wright
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Signed-off-by: Alexey Dobriyan
Cc: Andi Kleen
Cc: "David S. Miller"
Cc: David Howells
Cc: Ralf Baechle
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
kernel.cap-bound uses only OP_SET and OP_AND
Signed-off-by: Alexey Dobriyan
Cc: "Eric W. Biederman"
Cc: Chris Wright
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When CONFIG_PROC_FS=n and CONFIG_PROC_SYSCTL=n but CONFIG_SYSVIPC=y, we get
this build error:kernel/built-in.o:(.data+0xc38): undefined reference to `proc_ipc_doulongvec_minmax'
kernel/built-in.o:(.data+0xc88): undefined reference to `proc_ipc_doulongvec_minmax'
kernel/built-in.o:(.data+0xcd8): undefined reference to `proc_ipc_dointvec'
kernel/built-in.o:(.data+0xd28): undefined reference to `proc_ipc_dointvec'
kernel/built-in.o:(.data+0xd78): undefined reference to `proc_ipc_dointvec'
kernel/built-in.o:(.data+0xdc8): undefined reference to `proc_ipc_dointvec'
kernel/built-in.o:(.data+0xe18): undefined reference to `proc_ipc_dointvec'
make: *** [vmlinux] Error 1Signed-off-by: Randy Dunlap
Acked-by: Eric Biederman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
10 Dec, 2006
1 commit
-
Use direct assignment rather than cmpxchg() as the latter is unavailable
and unimplementable on some platforms and is actually unnecessary.The use of cmpxchg() was to guard against two possibilities, neither of
which can actually occur:(1) The pending flag may have been unset or may be cleared. However, given
where it's called, the pending flag is _always_ set. I don't think it
can be unset whilst we're in set_wq_data().Once the work is enqueued to be actually run, the only way off the queue
is for it to be actually run.If it's a delayed work item, then the bit can't be cleared by the timer
because we haven't started the timer yet. Also, the pending bit can't be
cleared by cancelling the delayed work _until_ the work item has had its
timer started.(2) The workqueue pointer might change. This can only happen in two cases:
(a) The work item has just been queued to actually run, and so we're
protected by the appropriate workqueue spinlock.(b) A delayed work item is being queued, and so the timer hasn't been
started yet, and so no one else knows about the work item or can
access it (the pending bit protects us).Besides, set_wq_data() _sets_ the workqueue pointer unconditionally, so
it can be assigned instead.So, replacing the set_wq_data() with a straight assignment would be okay
in most cases.The problem is where we end up tangling with test_and_set_bit() emulated
using spinlocks, and even then it's not a problem _provided_
test_and_set_bit() doesn't attempt to modify the word if the bit was
set.If that's a problem, then a bitops-proofed assignment will be required -
equivalent to atomic_set() vs other atomic_xxx() ops.Signed-off-by: David Howells
Signed-off-by: Linus Torvalds
09 Dec, 2006
12 commits
-
Currently there is a regression and the ipc sysctls don't show up in the
binary sysctl namespace.This patch adds sysctl_ipc_data to read data/write from the appropriate
namespace and deliver it in the expected manner.[akpm@osdl.org: warning fix]
Signed-off-by: Eric W. Biederman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Refactor the ipc sysctl support so that it is simpler, more readable, and
prepares for fixing the bug with the wrong values being returned in the
sys_sysctl interface.The function proc_do_ipc_string() was misnamed as it never handled strings.
It's magic of when to work with strings and when to work with longs belonged
in the sysctl table. I couldn't tell if the code would work if you disabled
the ipc namespace but it certainly looked like it would have problems.Signed-off-by: Eric W. Biederman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The problem: When using sys_sysctl we don't read the proper values for the
variables exported from the uts namespace, nor do we do the proper locking.This patch introduces sysctl_uts_string which properly fetches the values and
does the proper locking.Signed-off-by: Eric W. Biederman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The binary interface to the namespace sysctls was never implemented resulting
in some really weird things if you attempted to use sys_sysctl to read your
hostname for example.This patch series simples the code a little and implements the binary sysctl
interface.In testing this patch series I discovered that our 32bit compatibility for the
binary sysctl interface is imperfect. In particular KERN_SHMMAX and
KERN_SMMALL are size_t sized quantities and are returned as 8 bytes on to
32bit binaries using a x86_64 kernel. However this has existing for a long
time so it is not a new regression with the namespace work.Gads the whole sysctl thing needs work before it stops being easy to shoot
yourself in the foot.Looking forward a little bit we need a better way to handle sysctls and
namespaces as our current technique will not work for the network namespace.
I think something based on the current overlapping sysctl trees will work but
the proc side needs to be redone before we can use it.This patch:
Introduce get_uts() and put_uts() (used later) and remove most of the special
cases for when UTS namespace is compiled in.Signed-off-by: Eric W. Biederman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
All members of the process group have the same sid and it can't be == 0.
NOTE: this code (and a similar one in sys_setpgid) was needed because it
was possibe to have ->session == 0. It's not possible any longer since[PATCH] pidhash: don't use zero pids
Commit: c7c6464117a02b0d54feb4ebeca4db70fa493678Signed-off-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
All tasks in the process group have the same sid, we don't need to iterate
them all to check that the caller of sys_setpgid() doesn't change its
session.Signed-off-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add a per pid_namespace child-reaper. This is needed so processes are reaped
within the same pid space and do not spill over to the parent pid space. Its
also needed so containers preserve existing semantic that pid == 1 would reap
orphaned children.This is based on Eric Biederman's patch: http://lkml.org/lkml/2006/2/6/285
Signed-off-by: Sukadev Bhattiprolu
Signed-off-by: Cedric Le Goater
Cc: Kirill Korotaev
Cc: Eric W. Biederman
Cc: Herbert Poetzl
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Signed-off-by: Cedric Le Goater
Cc: Kirill Korotaev
Cc: Eric W. Biederman
Cc: Herbert Poetzl
Cc: Sukadev Bhattiprolu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add the pid namespace framework to the nsproxy object. The copy of the pid
namespace only increases the refcount on the global pid namespace,
init_pid_ns, and unshare is not implemented.There is no configuration option to activate or deactivate this feature
because this not relevant for the moment.Signed-off-by: Cedric Le Goater
Cc: Kirill Korotaev
Cc: Eric W. Biederman
Cc: Herbert Poetzl
Cc: Sukadev Bhattiprolu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Rename struct pspace to struct pid_namespace for consistency with other
namespaces (uts_namespace and ipc_namespace). Also rename
include/linux/pspace.h to include/linux/pid_namespace.h and variables from
pspace to pid_ns.Signed-off-by: Sukadev Bhattiprolu
Signed-off-by: Cedric Le Goater
Cc: Kirill Korotaev
Cc: Eric W. Biederman
Cc: Herbert Poetzl
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add an identifier to nsproxy. The default init_ns_proxy has identifier 0 and
allocated nsproxies are given -1.This identifier will be used by a new syscall sys_bind_ns.
Signed-off-by: Cedric Le Goater
Cc: Kirill Korotaev
Cc: Eric W. Biederman
Cc: Herbert Poetzl
Cc: Sukadev Bhattiprolu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Rename 'struct namespace' to 'struct mnt_namespace' to avoid confusion with
other namespaces being developped for the containers : pid, uts, ipc, etc.
'namespace' variables and attributes are also renamed to 'mnt_ns'Signed-off-by: Kirill Korotaev
Signed-off-by: Cedric Le Goater
Cc: Eric W. Biederman
Cc: Herbert Poetzl
Cc: Sukadev Bhattiprolu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds