Doug / smarc-fsl-linux-kernel | Embedian Git Server

21 Oct, 2008

2 commits

9301975ec Merge branch 'genirq-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/tip/linux-2.6-tip

This merges branches irq/genirq, irq/sparseirq-v4, timers/hpet-percpu
and x86/uv.

The sparseirq branch is just preliminary groundwork: no sparse IRQs are
actually implemented by this tree anymore - just the new APIs are added
while keeping the old way intact as well (the new APIs map 1:1 to
irq_desc[]). The 'real' sparse IRQ support will then be a relatively
small patch ontop of this - with a v2.6.29 merge target.

* 'genirq-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (178 commits)
genirq: improve include files
intr_remapping: fix typo
io_apic: make irq_mis_count available on 64-bit too
genirq: fix name space collisions of nr_irqs in arch/*
genirq: fix name space collision of nr_irqs in autoprobe.c
genirq: use iterators for irq_desc loops
proc: fixup irq iterator
genirq: add reverse iterator for irq_desc
x86: move ack_bad_irq() to irq.c
x86: unify show_interrupts() and proc helpers
x86: cleanup show_interrupts
genirq: cleanup the sparseirq modifications
genirq: remove artifacts from sparseirq removal
genirq: revert dynarray
genirq: remove irq_to_desc_alloc
genirq: remove sparse irq code
genirq: use inline function for irq_to_desc
genirq: consolidate nr_irqs and for_each_irq_desc()
x86: remove sparse irq from Kconfig
genirq: define nr_irqs for architectures with GENERIC_HARDIRQS=n
...

Linus Torvalds
2008-10-21 04:23:01 +0800
99ebcf828 Merge branch 'v28-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/tip/linux-2.6-tip

* 'v28-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (36 commits)
fix documentation of sysrq-q really
Fix documentation of sysrq-q
timer_list: add base address to clock base
timer_list: print cpu number of clockevents device
timer_list: print real timer address
NOHZ: restart tick device from irq_enter()
NOHZ: split tick_nohz_restart_sched_tick()
NOHZ: unify the nohz function calls in irq_enter()
timers: fix itimer/many thread hang, fix
timers: fix itimer/many thread hang, v3
ntp: improve adjtimex frequency rounding
timekeeping: fix rounding problem during clock update
ntp: let update_persistent_clock() sleep
hrtimer: reorder struct hrtimer to save 8 bytes on 64bit builds
posix-timers: lock_timer: make it readable
posix-timers: lock_timer: kill the bogus ->it_id check
posix-timers: kill ->it_sigev_signo and ->it_sigev_value
posix-timers: sys_timer_create: cleanup the error handling
posix-timers: move the initialization of timer->sigq from send to create path
posix-timers: sys_timer_create: simplify and s/tasklist/rcu/
...

Fix trivial conflicts due to sysrq-q description clahes in
Documentation/sysrq.txt and drivers/char/sysrq.c

Linus Torvalds
2008-10-21 04:19:56 +0800

20 Oct, 2008

6 commits

85a0ee342 kdump: add is_vmcore_usable() and vmcore_unusable() ... Browse Code »

The usage of elfcorehdr_addr has changed recently such that being set to
ELFCORE_ADDR_MAX is used by is_kdump_kernel() to indicate if the code is
executing in a kernel executed as a crash kernel.

However, arch/ia64/kernel/setup.c:reserve_elfcorehdr will rest
elfcorehdr_addr to ELFCORE_ADDR_MAX on error, which means any subsequent
calls to is_kdump_kernel() will return 0, even though they should return
1.

Ok, at this point in time there are no subsequent calls, but I think its
fair to say that there is ample scope for error or at the very least
confusion.

This patch add an extra state, ELFCORE_ADDR_ERR, which indicates that
elfcorehdr_addr was passed on the command line, and thus execution is
taking place in a crashdump kernel, but vmcore can't be used for some
reason. This is tested for using is_vmcore_usable() and set using
vmcore_unusable(). A subsequent patch makes use of this new code.

To summarise, the states that elfcorehdr_addr can now be in are as follows:

ELFCORE_ADDR_MAX: not a crashdump kernel
ELFCORE_ADDR_ERR: crashdump kernel but vmcore is unusable
any other value: crash dump kernel and vmcore is usable

Signed-off-by: Simon Horman
Cc: Vivek Goyal
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Simon Horman
2008-10-20 23:52:40 +0800
57cac4d18 kdump: make elfcorehdr_addr independent of CONFIG_PROC_VMCORE ... Browse Code »

o elfcorehdr_addr is used by not only the code under CONFIG_PROC_VMCORE
but also by the code which is not inside CONFIG_PROC_VMCORE. For
example, is_kdump_kernel() is used by powerpc code to determine if
kernel is booting after a panic then use previous kernel's TCE table.
So even if CONFIG_PROC_VMCORE is not set in second kernel, one should be
able to correctly determine that we are booting after a panic and setup
calgary iommu accordingly.

o So remove the assumption that elfcorehdr_addr is under
CONFIG_PROC_VMCORE.

o Move definition of elfcorehdr_addr to arch dependent crash files.
(Unfortunately crash dump does not have an arch independent file
otherwise that would have been the best place).

o kexec.c is not the right place as one can Have CRASH_DUMP enabled in
second kernel without KEXEC being enabled.

o I don't see sh setup code parsing the command line for
elfcorehdr_addr. I am wondering how does vmcore interface work on sh.
Anyway, I am atleast defining elfcoredhr_addr so that compilation is not
broken on sh.

Signed-off-by: Vivek Goyal
Acked-by: "Eric W. Biederman"
Acked-by: Simon Horman
Acked-by: Paul Mundt
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vivek Goyal
2008-10-20 23:52:39 +0800
5344b7e64 vmstat: mlocked pages statistics ... Browse Code »

Add NR_MLOCK zone page state, which provides a (conservative) count of
mlocked pages (actually, the number of mlocked pages moved off the LRU).

Reworked by lts to fit in with the modified mlock page support in the
Reclaim Scalability series.

[kosaki.motohiro@jp.fujitsu.com: fix incorrect Mlocked field of /proc/meminfo]
[lee.schermerhorn@hp.com: mlocked-pages: add event counting with statistics]
Signed-off-by: Nick Piggin
Signed-off-by: Lee Schermerhorn
Signed-off-by: Rik van Riel
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-10-20 23:52:31 +0800
7b854121e Unevictable LRU Page Statistics ... Browse Code »

Report unevictable pages per zone and system wide.

Kosaki Motohiro added support for memory controller unevictable
statistics.

[riel@redhat.com: fix printk in show_free_areas()]
[akpm@linux-foundation.org: fix units in /proc/vmstats]
Signed-off-by: Lee Schermerhorn
Signed-off-by: Rik van Riel
Signed-off-by: KOSAKI Motohiro
Debugged-by: Hiroshi Shimamoto
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2008-10-20 23:50:26 +0800
4f98a2fee vmscan: split LRU lists into anon & file sets ... Browse Code »

Split the LRU lists in two, one set for pages that are backed by real file
systems ("file") and one for pages that are backed by memory and swap
("anon"). The latter includes tmpfs.

The advantage of doing this is that the VM will not have to scan over lots
of anonymous pages (which we generally do not want to swap out), just to
find the page cache pages that it should evict.

This patch has the infrastructure and a basic policy to balance how much
we scan the anon lists and how much we scan the file lists. The big
policy changes are in separate patches.

[lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
[kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
[kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
[hugh@veritas.com: memcg swapbacked pages active]
[hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
[akpm@linux-foundation.org: fix /proc/vmstat units]
[nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
[kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
[kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
Signed-off-by: Rik van Riel
Signed-off-by: Lee Schermerhorn
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Hugh Dickins
Signed-off-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2008-10-20 23:50:25 +0800
c465a76af Merge branches 'timers/clocksource', 'timers/hrtimers', 'timers/nohz', 'timers/n… ... Browse Code »

…tp', 'timers/posixtimers' and 'timers/debug' into v28-timers-for-linus

Thomas Gleixner
2008-10-20 19:14:06 +0800

17 Oct, 2008

1 commit

f40cbaa5b proc: move sysrq-trigger out of fs/proc/ ... Browse Code »

Move it into sysrq.c, along with the rest of the sysrq implementation.

Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2008-10-17 02:21:47 +0800

16 Oct, 2008

7 commits

2be3b52a5 proc: fixup irq iterator ... Browse Code »

There is no need for irq_desc here. Even for sparse_irq we can
handle this clever in for_each_irq_nr().

Signed-off-by: Thomas Gleixner

Thomas Gleixner
2008-10-16 22:53:30 +0800
2cc21ef84 genirq: remove sparse irq code ... Browse Code »

This code is not ready, but we need to rip it out instead of rebasing
as we would lose the APIC/IO_APIC unification otherwise.

Signed-off-by: Thomas Gleixner

Thomas Gleixner
2008-10-16 22:53:15 +0800
6d50bc268 x86: use 28 bits irq NR for pci msi/msix and ht ... Browse Code »

also print out irq no in /proc/interrups and /proc/stat in hex, so could
tell bus/dev/func.

Signed-off-by: Yinghai Lu
Signed-off-by: Ingo Molnar

Yinghai Lu
2008-10-16 22:52:52 +0800
52b17329d x86_64: make /proc/interrupts work with dyn irq_desc ... Browse Code »

loop with irq_desc list

Signed-off-by: Yinghai Lu
Signed-off-by: Ingo Molnar

Yinghai Lu
2008-10-16 22:52:51 +0800
c7fb03a47 irq, fs/proc: replace loop with nr_irqs for proc/stat ... Browse Code »

Replace another nr_irqs loop to avoid the allocation of all sparse
irq entries - use for_each_irq_desc instead.

v2: make sure arch without GENERIC_HARDIRQS works too

Signed-off-by: Yinghai Lu
Signed-off-by: Ingo Molnar

Yinghai Lu
2008-10-16 22:52:33 +0800
7f95ec9e4 x86: move kstat_irqs from kstat to irq_desc ... Browse Code »

based on Eric's patch ...

together mold it with dyn_array for irq_desc, will allcate kstat_irqs for
nr_irq_desc alltogether if needed. -- at that point nr_cpus is known already.

v2: make sure system without generic_hardirqs works they don't have irq_desc
v3: fix merging
v4: [mingo@elte.hu] fix typo

[ mingo@elte.hu ] irq: build fix

fix:

arch/x86/xen/spinlock.c: In function 'xen_spin_lock_slow':
arch/x86/xen/spinlock.c:90: error: 'struct kernel_stat' has no member named 'irqs'

Signed-off-by: Yinghai Lu
Signed-off-by: Ingo Molnar

Yinghai Lu
2008-10-16 22:52:32 +0800
da27c118e fs/proc: use nr_irqs ... Browse Code »

Signed-off-by: Yinghai Lu
Signed-off-by: Ingo Molnar

Yinghai Lu
2008-10-16 22:52:07 +0800

15 Oct, 2008

1 commit

8acd3a60b Merge branch 'for-2.6.28' of git://linux-nfs.org/~bfields/linux ... Browse Code »

* 'for-2.6.28' of git://linux-nfs.org/~bfields/linux: (59 commits)
svcrdma: Fix IRD/ORD polarity
svcrdma: Update svc_rdma_send_error to use DMA LKEY
svcrdma: Modify the RPC reply path to use FRMR when available
svcrdma: Modify the RPC recv path to use FRMR when available
svcrdma: Add support to svc_rdma_send to handle chained WR
svcrdma: Modify post recv path to use local dma key
svcrdma: Add a service to register a Fast Reg MR with the device
svcrdma: Query device for Fast Reg support during connection setup
svcrdma: Add FRMR get/put services
NLM: Remove unused argument from svc_addsock() function
NLM: Remove "proto" argument from lockd_up()
NLM: Always start both UDP and TCP listeners
lockd: Remove unused fields in the nlm_reboot structure
lockd: Add helper to sanity check incoming NOTIFY requests
lockd: change nlmclnt_grant() to take a "struct sockaddr *"
lockd: Adjust nlmsvc_lookup_host() to accomodate AF_INET6 addresses
lockd: Adjust nlmclnt_lookup_host() signature to accomodate non-AF_INET
lockd: Support non-AF_INET addresses in nlm_lookup_host()
NLM: Convert nlm_lookup_host() to use a single argument
svcrdma: Add Fast Reg MR Data Types
...

Linus Torvalds
2008-10-15 03:31:14 +0800

10 Oct, 2008

10 commits

3bbfe0596 proc: remove kernel.maps_protect ... Browse Code »

After commit 831830b5a2b5d413407adf380ef62fe17d6fcbf2 aka
"restrict reading from /proc//maps to those who share ->mm or can ptrace"
sysctl stopped being relevant because commit moved security checks from ->show
time to ->start time (mm_for_maps()).

Signed-off-by: Alexey Dobriyan
Acked-by: Kees Cook

Alexey Dobriyan
2008-10-10 08:24:51 +0800
45acb8db0 proc: remove now unneeded ADDBUF macro ... Browse Code »

After local seq_file conversion it was forgotten.

Signed-off-by: Alexey Dobriyan

Alexey Dobriyan
2008-10-10 08:18:58 +0800
478307230 [PATCH] proc: show personality via /proc/pid/personality ... Browse Code »

Make process personality flags visible in /proc. Since a process's
personality is potentially sensitive (e.g. READ_IMPLIES_EXEC), make this
file only readable by the process owner.

Signed-off-by: Kees Cook
Signed-off-by: Alexey Dobriyan

Kees Cook
2008-10-10 08:18:57 +0800
a6bebbc87 [PATCH] signal, procfs: some lock_task_sighand() users do not need rcu_read_lock() ... Browse Code »

lock_task_sighand() make sure task->sighand is being protected,
so we do not need rcu_read_lock().
[ exec() will get task->sighand->siglock before change task->sighand! ]

But code using rcu_read_lock() _just_ to protect lock_task_sighand()
only appear in procfs. (and some code in procfs use lock_task_sighand()
without such redundant protection.)

Other subsystem may put lock_task_sighand() into rcu_read_lock()
critical region, but these rcu_read_lock() are used for protecting
"for_each_process()", "find_task_by_vpid()" etc. , not for protecting
lock_task_sighand().

Signed-off-by: Lai Jiangshan
[ok from Oleg]
Signed-off-by: Alexey Dobriyan

Lai Jiangshan
2008-10-10 08:18:57 +0800
53167a3ef proc: move PROC_PAGE_MONITOR to fs/proc/Kconfig ... Browse Code »

Signed-off-by: Alexey Dobriyan

Alexey Dobriyan
2008-10-10 08:18:57 +0800
81324364b proc: make grab_header() static ... Browse Code »

Signed-off-by: Adrian Bunk
Signed-off-by: Alexey Dobriyan

Adrian Bunk
2008-10-10 08:18:56 +0800
a70973c21 proc: remove unused get_dma_list() ... Browse Code »

Signed-off-by: Alexey Dobriyan

Alexey Dobriyan
2008-10-10 08:18:56 +0800
a04f4de64 proc: remove dummy vmcore_open() ... Browse Code »

Empty ->open is equivalent to always succeeding ->open.

Signed-off-by: Alexey Dobriyan

Alexey Dobriyan
2008-10-10 08:18:55 +0800
e1675231c proc: proc_sys_root tweak ... Browse Code »

Signed-off-by: Alexey Dobriyan

Alexey Dobriyan
2008-10-10 08:18:55 +0800
300b994b7 proc: fix return value of proc_reg_open() in "too late" case ... Browse Code »

If ->open() wasn't called, returning 0 is misleading and, theoretically,
oopsable:
1) remove_proc_entry clears ->proc_fops, drops lock,
2) ->open "succeeds",
3) ->release oopses, because it assumes ->open was called (single_release()).

Signed-off-by: Alexey Dobriyan

Alexey Dobriyan
2008-10-10 08:18:54 +0800

30 Sep, 2008

1 commit

bfcd17a6c Configure out file locking features ... Browse Code »

This patch adds the CONFIG_FILE_LOCKING option which allows to remove
support for advisory locks. With this patch enabled, the flock()
system call, the F_GETLK, F_SETLK and F_SETLKW operations of fcntl()
and NFS support are disabled. These features are not necessarly needed
on embedded systems. It allows to save ~11 Kb of kernel code and data:

text data bss dec hex filename
1125436 118764 212992 1457192 163c28 vmlinux.old
1114299 118564 212992 1445855 160fdf vmlinux
-11137 -200 0 -11337 -2C49 +/-

This patch has originally been written by Matt Mackall
, and is part of the Linux Tiny project.

Signed-off-by: Thomas Petazzoni
Signed-off-by: Matt Mackall
Cc: matthew@wil.cx
Cc: linux-fsdevel@vger.kernel.org
Cc: mpm@selenic.com
Cc: akpm@linux-foundation.org
Signed-off-by: J. Bruce Fields

Thomas Petazzoni
2008-09-30 05:56:57 +0800

14 Sep, 2008

3 commits

f06febc96 timers: fix itimer/many thread hang ... Browse Code »

Overview

This patch reworks the handling of POSIX CPU timers, including the
ITIMER_PROF, ITIMER_VIRT timers and rlimit handling. It was put together
with the help of Roland McGrath, the owner and original writer of this code.

The problem we ran into, and the reason for this rework, has to do with using
a profiling timer in a process with a large number of threads. It appears
that the performance of the old implementation of run_posix_cpu_timers() was
at least O(n*3) (where "n" is the number of threads in a process) or worse.
Everything is fine with an increasing number of threads until the time taken
for that routine to run becomes the same as or greater than the tick time, at
which point things degrade rather quickly.

This patch fixes bug 9906, "Weird hang with NPTL and SIGPROF."

Code Changes

This rework corrects the implementation of run_posix_cpu_timers() to make it
run in constant time for a particular machine. (Performance may vary between
one machine and another depending upon whether the kernel is built as single-
or multiprocessor and, in the latter case, depending upon the number of
running processors.) To do this, at each tick we now update fields in
signal_struct as well as task_struct. The run_posix_cpu_timers() function
uses those fields to make its decisions.

We define a new structure, "task_cputime," to contain user, system and
scheduler times and use these in appropriate places:

struct task_cputime {
cputime_t utime;
cputime_t stime;
unsigned long long sum_exec_runtime;
};

This is included in the structure "thread_group_cputime," which is a new
substructure of signal_struct and which varies for uniprocessor versus
multiprocessor kernels. For uniprocessor kernels, it uses "task_cputime" as
a simple substructure, while for multiprocessor kernels it is a pointer:

struct thread_group_cputime {
struct task_cputime totals;
};

struct thread_group_cputime {
struct task_cputime *totals;
};

We also add a new task_cputime substructure directly to signal_struct, to
cache the earliest expiration of process-wide timers, and task_cputime also
replaces the it_*_expires fields of task_struct (used for earliest expiration
of thread timers). The "thread_group_cputime" structure contains process-wide
timers that are updated via account_user_time() and friends. In the non-SMP
case the structure is a simple aggregator; unfortunately in the SMP case that
simplicity was not achievable due to cache-line contention between CPUs (in
one measured case performance was actually _worse_ on a 16-cpu system than
the same test on a 4-cpu system, due to this contention). For SMP, the
thread_group_cputime counters are maintained as a per-cpu structure allocated
using alloc_percpu(). The timer functions update only the timer field in
the structure corresponding to the running CPU, obtained using per_cpu_ptr().

We define a set of inline functions in sched.h that we use to maintain the
thread_group_cputime structure and hide the differences between UP and SMP
implementations from the rest of the kernel. The thread_group_cputime_init()
function initializes the thread_group_cputime structure for the given task.
The thread_group_cputime_alloc() is a no-op for UP; for SMP it calls the
out-of-line function thread_group_cputime_alloc_smp() to allocate and fill
in the per-cpu structures and fields. The thread_group_cputime_free()
function, also a no-op for UP, in SMP frees the per-cpu structures. The
thread_group_cputime_clone_thread() function (also a UP no-op) for SMP calls
thread_group_cputime_alloc() if the per-cpu structures haven't yet been
allocated. The thread_group_cputime() function fills the task_cputime
structure it is passed with the contents of the thread_group_cputime fields;
in UP it's that simple but in SMP it must also safely check that tsk->signal
is non-NULL (if it is it just uses the appropriate fields of task_struct) and,
if so, sums the per-cpu values for each online CPU. Finally, the three
functions account_group_user_time(), account_group_system_time() and
account_group_exec_runtime() are used by timer functions to update the
respective fields of the thread_group_cputime structure.

Non-SMP operation is trivial and will not be mentioned further.

The per-cpu structure is always allocated when a task creates its first new
thread, via a call to thread_group_cputime_clone_thread() from copy_signal().
It is freed at process exit via a call to thread_group_cputime_free() from
cleanup_signal().

All functions that formerly summed utime/stime/sum_sched_runtime values from
from all threads in the thread group now use thread_group_cputime() to
snapshot the values in the thread_group_cputime structure or the values in
the task structure itself if the per-cpu structure hasn't been allocated.

Finally, the code in kernel/posix-cpu-timers.c has changed quite a bit.
The run_posix_cpu_timers() function has been split into a fast path and a
slow path; the former safely checks whether there are any expired thread
timers and, if not, just returns, while the slow path does the heavy lifting.
With the dedicated thread group fields, timers are no longer "rebalanced" and
the process_timer_rebalance() function and related code has gone away. All
summing loops are gone and all code that used them now uses the
thread_group_cputime() inline. When process-wide timers are set, the new
task_cputime structure in signal_struct is used to cache the earliest
expiration; this is checked in the fast path.

Performance

The fix appears not to add significant overhead to existing operations. It
generally performs the same as the current code except in two cases, one in
which it performs slightly worse (Case 5 below) and one in which it performs
very significantly better (Case 2 below). Overall it's a wash except in those
two cases.

I've since done somewhat more involved testing on a dual-core Opteron system.

Case 1: With no itimer running, for a test with 100,000 threads, the fixed
kernel took 1428.5 seconds, 513 seconds more than the unfixed system,
all of which was spent in the system. There were twice as many
voluntary context switches with the fix as without it.

Case 2: With an itimer running at .01 second ticks and 4000 threads (the most
an unmodified kernel can handle), the fixed kernel ran the test in
eight percent of the time (5.8 seconds as opposed to 70 seconds) and
had better tick accuracy (.012 seconds per tick as opposed to .023
seconds per tick).

Case 3: A 4000-thread test with an initial timer tick of .01 second and an
interval of 10,000 seconds (i.e. a timer that ticks only once) had
very nearly the same performance in both cases: 6.3 seconds elapsed
for the fixed kernel versus 5.5 seconds for the unfixed kernel.

With fewer threads (eight in these tests), the Case 1 test ran in essentially
the same time on both the modified and unmodified kernels (5.2 seconds versus
5.8 seconds). The Case 2 test ran in about the same time as well, 5.9 seconds
versus 5.4 seconds but again with much better tick accuracy, .013 seconds per
tick versus .025 seconds per tick for the unmodified kernel.

Since the fix affected the rlimit code, I also tested soft and hard CPU limits.

Case 4: With a hard CPU limit of 20 seconds and eight threads (and an itimer
running), the modified kernel was very slightly favored in that while
it killed the process in 19.997 seconds of CPU time (5.002 seconds of
wall time), only .003 seconds of that was system time, the rest was
user time. The unmodified kernel killed the process in 20.001 seconds
of CPU (5.014 seconds of wall time) of which .016 seconds was system
time. Really, though, the results were too close to call. The results
were essentially the same with no itimer running.

Case 5: With a soft limit of 20 seconds and a hard limit of 2000 seconds
(where the hard limit would never be reached) and an itimer running,
the modified kernel exhibited worse tick accuracy than the unmodified
kernel: .050 seconds/tick versus .028 seconds/tick. Otherwise,
performance was almost indistinguishable. With no itimer running this
test exhibited virtually identical behavior and times in both cases.

In times past I did some limited performance testing. those results are below.

On a four-cpu Opteron system without this fix, a sixteen-thread test executed
in 3569.991 seconds, of which user was 3568.435s and system was 1.556s. On
the same system with the fix, user and elapsed time were about the same, but
system time dropped to 0.007 seconds. Performance with eight, four and one
thread were comparable. Interestingly, the timer ticks with the fix seemed
more accurate: The sixteen-thread test with the fix received 149543 ticks
for 0.024 seconds per tick, while the same test without the fix received 58720
for 0.061 seconds per tick. Both cases were configured for an interval of
0.01 seconds. Again, the other tests were comparable. Each thread in this
test computed the primes up to 25,000,000.

I also did a test with a large number of threads, 100,000 threads, which is
impossible without the fix. In this case each thread computed the primes only
up to 10,000 (to make the runtime manageable). System time dominated, at
1546.968 seconds out of a total 2176.906 seconds (giving a user time of
629.938s). It received 147651 ticks for 0.015 seconds per tick, still quite
accurate. There is obviously no comparable test without the fix.

Signed-off-by: Frank Mayhar
Cc: Roland McGrath
Cc: Alexey Dobriyan
Cc: Andrew Morton
Signed-off-by: Ingo Molnar

Frank Mayhar
2008-09-14 22:25:35 +0800
d7a3e4959 mm: ifdef Quicklists in /proc/meminfo ... Browse Code »

A "Quicklists: 0 kB" line has just started appearing in
/proc/meminfo, but most architectures (including x86) don't have
them configured, so #ifdef it, like the highmem lines.

And those architectures which do have quicklists configured are
using them for page tables: so let's place it next to PageTables.

Signed-off-by: Hugh Dickins
Acked-by: Christoph Lameter
Acked-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-09-14 05:41:51 +0800
665020c35 proc: more debugging for "already registered" case ... Browse Code »

Print parent directory name as well.

The aim is to catch non-creation of parent directory when proc_mkdir will
return NULL and all subsequent registrations go directly in /proc instead
of intended directory.

Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
[ Fixed insane printk string while at it. - Linus ]
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2008-09-14 05:41:50 +0800

06 Sep, 2008

1 commit

49048622e sched: fix process time monotonicity ... Browse Code »

Spencer reported a problem where utime and stime were going negative despite
the fixes in commit b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa. The suspected
reason for the problem is that signal_struct maintains it's own utime and
stime (of exited tasks), these are not updated using the new task_utime()
routine, hence sig->utime can go backwards and cause the same problem
to occur (sig->utime, adds tsk->utime and not task_utime()). This patch
fixes the problem

TODO: using max(task->prev_utime, derived utime) works for now, but a more
generic solution is to implement cputime_max() and use the cputime_gt()
function for comparison.

Reported-by: spencer@bluehost.com
Signed-off-by: Balbir Singh
Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Balbir Singh
2008-09-06 00:14:35 +0800

03 Sep, 2008

1 commit

4b8561521 mm: show quicklist usage in /proc/meminfo ... Browse Code »

Quicklists can consume several GB of memory. We should provide a means of
monitoring this.

After this patch is applied, /proc/meminfo will output the following:

% cat /proc/meminfo

MemTotal: 7715392 kB
MemFree: 5401600 kB
Buffers: 80384 kB
Cached: 300800 kB
SwapCached: 0 kB
Active: 235584 kB
Inactive: 262656 kB
SwapTotal: 2031488 kB
SwapFree: 2031488 kB
Dirty: 3520 kB
Writeback: 0 kB
AnonPages: 117696 kB
Mapped: 38528 kB
Slab: 1589952 kB
SReclaimable: 23104 kB
SUnreclaim: 1566848 kB
PageTables: 14656 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 5889152 kB
Committed_AS: 393152 kB
VmallocTotal: 17592177655808 kB
VmallocUsed: 29056 kB
VmallocChunk: 17592177626432 kB
Quicklists: 130944 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 262144 kB

Signed-off-by: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: Keiichiro Tokunaga
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2008-09-03 10:21:38 +0800

25 Aug, 2008

1 commit

cc9960991 [PATCH] proc: inode number fixlet ... Browse Code »

Ouch, if number taken from IDA is too big, the intent was to signal an
error, not check for overflow and still do overflowing addition.

One still needs 2^28 proc entries to notice this.

Signed-off-by: Alexey Dobriyan
Signed-off-by: Al Viro

Alexey Dobriyan
2008-08-25 13:18:03 +0800

21 Aug, 2008

1 commit

1804dc6e1 /proc/self/maps doesn't display the real file offset ... Browse Code »

This addresses

http://bugzilla.kernel.org/show_bug.cgi?id=11318

In function show_map (file: fs/proc/task_mmu.c), if vma->vm_pgoff > 2^20
than (vma->vm_pgoff << PAGE_SIZE) is greater than 2^32 (with PAGE_SIZE
equal to 4096 (i.e. 2^12). The next seq_printf use an unsigned long for
the conversion of (vma->vm_pgoff << PAGE_SIZE), as a result the offset
value displayed in /proc/self/maps is truncated if the page offset is
greater than 2^20.

A test that shows this issue:

#define _GNU_SOURCE
#include
#include
#include
#include
#include
#include
#include
#include

#define PAGE_SIZE (getpagesize())

#if __i386__
# define U64_STR "%llx"
#elif __x86_64
# define U64_STR "%lx"
#else
# error "Architecture Unsupported"
#endif

int main(int argc, char *argv[])
{
int fd;
char *addr;
off64_t offset = 0x10000000;
char *filename = "/dev/zero";

fd = open(filename, O_RDONLY);
if (fd < 0) {
perror("open");
return 1;
}

offset *= 0x10;
printf("offset = " U64_STR "\n", offset);

addr = (char*)mmap64(NULL, PAGE_SIZE, PROT_READ, MAP_PRIVATE, fd,
offset);
if ((void*)addr == MAP_FAILED) {
perror("mmap64");
return 1;
}

{
FILE *fmaps;
char *line = NULL;
size_t len = 0;
ssize_t read;
size_t filename_len = strlen(filename);

fmaps = fopen("/proc/self/maps", "r");
if (!fmaps) {
perror("fopen");
return 1;
}
while ((read = getline(&line, &len, fmaps)) != -1) {
if ((read > filename_len + 1)
&& (strncmp(&line[read - filename_len - 1], filename, filename_len) == 0))
printf("%s", line);
}

if (line)
free(line);

fclose(fmaps);
}

close(fd);
return 0;
}

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Clement Calmels
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Clement Calmels
2008-08-21 06:40:30 +0800

06 Aug, 2008

1 commit

7c44319dc proc: fix warnings ... Browse Code »

proc: fix warnings

fs/proc/base.c:2429: warning: format '%llu' expects type 'long long unsigned int', but argument 3 has type 'u64'
fs/proc/base.c:2429: warning: format '%llu' expects type 'long long unsigned int', but argument 4 has type 'u64'
fs/proc/base.c:2429: warning: format '%llu' expects type 'long long unsigned int', but argument 5 has type 'u64'
fs/proc/base.c:2429: warning: format '%llu' expects type 'long long unsigned int', but argument 6 has type 'u64'
fs/proc/base.c:2429: warning: format '%llu' expects type 'long long unsigned int', but argument 7 has type 'u64'
fs/proc/base.c:2429: warning: format '%llu' expects type 'long long unsigned int', but argument 8 has type 'u64'
fs/proc/base.c:2429: warning: format '%llu' expects type 'long long unsigned int', but argument 9 has type 'u64'

Signed-off-by: Alexander Beregalov
Acked-by: Andrea Righi
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Beregalov
2008-08-06 05:33:50 +0800

01 Aug, 2008

2 commits

9a1854091 [PATCH 2/2] proc: switch inode number allocation to IDA ... Browse Code »

proc doesn't use "associate pointer with id" feature of IDR, so switch
to IDA.

NOTE, NOTE, NOTE:
Do not apply if release_inode_number() still mantions MAX_ID_MASK!

Signed-off-by: Alexey Dobriyan
Signed-off-by: Al Viro

Alexey Dobriyan
2008-08-01 23:25:28 +0800
67935df49 [PATCH 1/2] proc: fix inode number bogorithmetic ... Browse Code »

Id which proc gets from IDR for inode number and id which proc removes
from IDR do not match. E.g. 0x11a transforms into 0x8000011a.

Which stayed unnoticed for a long time because, surprise, idr_remove()
masks out that high bit before doing anything.

All of this due to "| ~MAX_ID_MASK" in release_inode_number().

I still don't understand how it's supposed to work, because "| ~MASK"
is not an inversion for "& MAX" operation.

So, use just one nice, working addition. Make start offset unsigned int,
while I'm at it. It's longness is not used anywhere.

Signed-off-by: Alexey Dobriyan
Signed-off-by: Al Viro

Alexey Dobriyan
2008-08-01 23:25:27 +0800

28 Jul, 2008

2 commits

940389b8a task IO accounting: move all IO statistics in struct task_io_accounting ... Browse Code »

Simplify the code of include/linux/task_io_accounting.h.

It is also more reasonable to have all the task i/o-related statistics in a
single struct (task_io_accounting).

Signed-off-by: Andrea Righi
Signed-off-by: Oleg Nesterov
Signed-off-by: Linus Torvalds

Andrea Righi
2008-07-28 07:12:28 +0800
5995477ab task IO accounting: improve code readability ... Browse Code »

Put all i/o statistics in struct proc_io_accounting and use inline functions to
initialize and increment statistics, removing a lot of single variable
assignments.

This also reduces the kernel size as following (with CONFIG_TASK_XACCT=y and
CONFIG_TASK_IO_ACCOUNTING=y).

text data bss dec hex filename
11651 0 0 11651 2d83 kernel/exit.o.before
11619 0 0 11619 2d63 kernel/exit.o.after
10886 132 136 11154 2b92 kernel/fork.o.before
10758 132 136 11026 2b12 kernel/fork.o.after

3082029 807968 4818600 8708597 84e1f5 vmlinux.o.before
3081869 807968 4818600 8708437 84e155 vmlinux.o.after

Signed-off-by: Andrea Righi
Acked-by: Oleg Nesterov
Signed-off-by: Linus Torvalds

Andrea Righi
2008-07-28 00:58:20 +0800