Eric Lee / smarc-fsl-linux-kernel

29 Apr, 2008

4 commits

d7321cd62 sysctl: add the ->permissions callback on the ctl_table_root ... Browse Code »

When reading from/writing to some table, a root, which this table came from,
may affect this table's permissions, depending on who is working with the
table.

The core hunk is at the bottom of this patch. All the rest is just pushing
the ctl_table_root argument up to the sysctl_perm() function.

This will be mostly (only?) used in the net sysctls.

Signed-off-by: Pavel Emelyanov
Acked-by: David S. Miller
Cc: "Eric W. Biederman"
Cc: Alexey Dobriyan
Cc: Denis V. Lunev
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2008-04-29 23:06:23 +0800
2c4c7155f sysctl: clean from unneeded extern and forward declarations ... Browse Code »

The do_sysctl_strategy isn't used outside kernel/sysctl.c, so this can be
static and without a prototype in header.

Besides, move this one and parse_table() above their callers and drop the
forward declarations of the latter call.

One more "besides" - fix two checkpatch warnings: space before a ( and an
extra space at the end of a line.

Signed-off-by: Pavel Emelyanov
Acked-by: David S. Miller
Cc: "Eric W. Biederman"
Cc: Alexey Dobriyan
Cc: Denis V. Lunev
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2008-04-29 23:06:23 +0800
88f458e4b sysctl: allow embedded targets to disable sysctl_check.c ... Browse Code »

Disable sysctl_check.c for embedded targets. This saves about about 11 kB
in .text and another 11 kB in .data on a PXA255 embedded platform.

Signed-off-by: Holger Schurig
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Holger Schurig
2008-04-29 23:06:22 +0800
0b77f5bfb keys: make the keyring quotas controllable through /proc/sys ... Browse Code »

Make the keyring quotas controllable through /proc/sys files:

(*) /proc/sys/kernel/keys/root_maxkeys
/proc/sys/kernel/keys/root_maxbytes

Maximum number of keys that root may have and the maximum total number of
bytes of data that root may have stored in those keys.

(*) /proc/sys/kernel/keys/maxkeys
/proc/sys/kernel/keys/maxbytes

Maximum number of keys that each non-root user may have and the maximum
total number of bytes of data that each of those users may have stored in
their keys.

Also increase the quotas as a number of people have been complaining that it's
not big enough. I'm not sure that it's big enough now either, but on the
other hand, it can now be set in /etc/sysctl.conf.

Signed-off-by: David Howells
Cc:
Cc:
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Howells
2008-04-29 23:06:17 +0800

20 Apr, 2008

2 commits

d0b27fa77 sched: rt-group: synchonised bandwidth period ... Browse Code »

Various SMP balancing algorithms require that the bandwidth period
run in sync.

Possible improvements are moving the rt_bandwidth thing into root_domain
and keeping a span per rt_bandwidth which marks throttled cpus.

Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Peter Zijlstra
2008-04-20 01:44:57 +0800
50df5d6ae sched: remove sysctl_sched_batch_wakeup_granularity ... Browse Code »

it's unused.

Signed-off-by: Ingo Molnar

Ingo Molnar
2008-04-20 01:44:57 +0800

05 Mar, 2008

1 commit

62fb18513 sched: revert load_balance_monitor() changes ... Browse Code »

The following commits cause a number of regressions:

commit 58e2d4ca581167c2a079f4ee02be2f0bc52e8729
Author: Srivatsa Vaddagiri
Date: Fri Jan 25 21:08:00 2008 +0100
sched: group scheduling, change how cpu load is calculated

commit 6b2d7700266b9402e12824e11e0099ae6a4a6a79
Author: Srivatsa Vaddagiri
Date: Fri Jan 25 21:08:00 2008 +0100
sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups

Namely:
- very frequent wakeups on SMP, reported by PowerTop users.
- cacheline trashing on (large) SMP
- some latencies larger than 500ms

While there is a mergeable patch to fix the latter, the former issues
are not fixable in a manner suitable for .25 (we're at -rc3 now).

Hence we revert them and try again in v2.6.26.

Signed-off-by: Peter Zijlstra
CC: Srivatsa Vaddagiri
Tested-by: Alexey Zaytsev
Signed-off-by: Ingo Molnar

Peter Zijlstra
2008-03-05 00:54:06 +0800

14 Feb, 2008

1 commit

064d9efe9 hugetlb: fix overcommit locking ... Browse Code »

proc_doulongvec_minmax() calls copy_to_user()/copy_from_user(), so we can't
hold hugetlb_lock over the call. Use a dummy variable to store the sysctl
result, like in hugetlb_sysctl_handler(), then grab the lock to update
nr_overcommit_huge_pages.

Signed-off-by: Nishanth Aravamudan
Reported-by: Miles Lane
Cc: Adam Litke
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2008-02-14 08:21:18 +0800

13 Feb, 2008

1 commit

9f0c1e560 sched: rt-group: interface ... Browse Code »

Change the rt_ratio interface to rt_runtime_us, to match rt_period_us.
This avoids picking a granularity for the ratio.

Extend the /sys/kernel/uids// interface to allow setting
the group's rt_runtime.

Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Peter Zijlstra
2008-02-13 22:45:39 +0800

09 Feb, 2008

4 commits

7ef3d2fd1 printk_ratelimit() functions should use CONFIG_PRINTK ... Browse Code »

Makes an embedded image a bit smaller.

Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2008-02-09 01:22:39 +0800
efae09f3e Nuke duplicate header from sysctl.c ... Browse Code »

Don't include linux/security.h twice in kernel/sysctl.c

Signed-off-by: Jesper Juhl
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Juhl
2008-02-09 01:22:34 +0800
6c5f3e7b4 Pidns: make full use of xxx_vnr() calls ... Browse Code »

Some time ago the xxx_vnr() calls (e.g. pid_vnr or find_task_by_vpid) were
_all_ converted to operate on the current pid namespace. After this each call
like xxx_nr_ns(foo, current->nsproxy->pid_ns) is nothing but a xxx_vnr(foo)
one.

Switch all the xxx_nr_ns() callers to use the xxx_vnr() calls where
appropriate.

Signed-off-by: Pavel Emelyanov
Reviewed-by: Oleg Nesterov
Cc: "Eric W. Biederman"
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2008-02-09 01:22:29 +0800
a3d0c6aa1 hugetlb: add locking for overcommit sysctl ... Browse Code »

When I replaced hugetlb_dynamic_pool with nr_overcommit_hugepages I used
proc_doulongvec_minmax() directly. However, hugetlb.c's locking rules
require that all counter modifications occur under the hugetlb_lock. Add a
callback into the hugetlb code similar to the one for nr_hugepages. Grab
the lock around the manipulation of nr_overcommit_hugepages in
proc_doulongvec_minmax().

Signed-off-by: Nishanth Aravamudan
Acked-by: Adam Litke
Cc: David Gibson
Cc: William Lee Irwin III
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2008-02-09 01:22:23 +0800

08 Feb, 2008

1 commit

fef1bdd68 oom: add sysctl to enable task memory dump ... Browse Code »

Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a
dump of all system tasks (excluding kernel threads) when performing an
OOM-killing. Information includes pid, uid, tgid, vm size, rss, cpu,
oom_adj score, and name.

This is helpful for determining why there was an OOM condition and which
rogue task caused it.

It is configurable so that large systems, such as those with several
thousand tasks, do not incur a performance penalty associated with dumping
data they may not desire.

If an OOM was triggered as a result of a memory controller, the tasklist
shall be filtered to exclude tasks that are not a member of the same
cgroup.

Cc: Andrea Arcangeli
Cc: Christoph Lameter
Cc: Balbir Singh
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2008-02-08 00:42:19 +0800

07 Feb, 2008

1 commit

9cfe015aa get rid of NR_OPEN and introduce a sysctl_nr_open ... Browse Code »

NR_OPEN (historically set to 1024*1024) actually forbids processes to open
more than 1024*1024 handles.

Unfortunatly some production servers hit the not so 'ridiculously high
value' of 1024*1024 file descriptors per process.

Changing NR_OPEN is not considered safe because of vmalloc space potential
exhaust.

This patch introduces a new sysctl (/proc/sys/fs/nr_open) wich defaults to
1024*1024, so that admins can decide to change this limit if their workload
needs it.

[akpm@linux-foundation.org: export it for sparc64]
Signed-off-by: Eric Dumazet
Cc: Alan Cox
Cc: Richard Henderson
Cc: Ivan Kokshaysky
Cc: "David S. Miller"
Cc: Ralf Baechle
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Dumazet
2008-02-07 02:41:06 +0800

06 Feb, 2008

2 commits

3b7391de6 capabilities: introduce per-process capability bounding set ... Browse Code »

The capability bounding set is a set beyond which capabilities cannot grow.
Currently cap_bset is per-system. It can be manipulated through sysctl,
but only init can add capabilities. Root can remove capabilities. By
default it includes all caps except CAP_SETPCAP.

This patch makes the bounding set per-process when file capabilities are
enabled. It is inherited at fork from parent. Noone can add elements,
CAP_SETPCAP is required to remove them.

One example use of this is to start a safer container. For instance, until
device namespaces or per-container device whitelists are introduced, it is
best to take CAP_MKNOD away from a container.

The bounding set will not affect pP and pE immediately. It will only
affect pP' and pE' after subsequent exec()s. It also does not affect pI,
and exec() does not constrain pI'. So to really start a shell with no way
of regain CAP_MKNOD, you would do

prctl(PR_CAPBSET_DROP, CAP_MKNOD);
cap_t cap = cap_get_proc();
cap_value_t caparray[1];
caparray[0] = CAP_MKNOD;
cap_set_flag(cap, CAP_INHERITABLE, 1, caparray, CAP_DROP);
cap_set_proc(cap);
cap_free(cap);

The following test program will get and set the bounding
set (but not pI). For instance

./bset get
(lists capabilities in bset)
./bset drop cap_net_raw
(starts shell with new bset)
(use capset, setuid binary, or binary with
file capabilities to try to increase caps)

************************************************************
cap_bound.c
************************************************************
#include
#include
#include
#include
#include
#include
#include

#ifndef PR_CAPBSET_READ
#define PR_CAPBSET_READ 23
#endif

#ifndef PR_CAPBSET_DROP
#define PR_CAPBSET_DROP 24
#endif

int usage(char *me)
{
printf("Usage: %s get\n", me);
printf(" %s drop \n", me);
return 1;
}

#define numcaps 32
char *captable[numcaps] = {
"cap_chown",
"cap_dac_override",
"cap_dac_read_search",
"cap_fowner",
"cap_fsetid",
"cap_kill",
"cap_setgid",
"cap_setuid",
"cap_setpcap",
"cap_linux_immutable",
"cap_net_bind_service",
"cap_net_broadcast",
"cap_net_admin",
"cap_net_raw",
"cap_ipc_lock",
"cap_ipc_owner",
"cap_sys_module",
"cap_sys_rawio",
"cap_sys_chroot",
"cap_sys_ptrace",
"cap_sys_pacct",
"cap_sys_admin",
"cap_sys_boot",
"cap_sys_nice",
"cap_sys_resource",
"cap_sys_time",
"cap_sys_tty_config",
"cap_mknod",
"cap_lease",
"cap_audit_write",
"cap_audit_control",
"cap_setfcap"
};

int getbcap(void)
{
int comma=0;
unsigned long i;
int ret;

printf("i know of %d capabilities\n", numcaps);
printf("capability bounding set:");
for (i=0; i< 0)
perror("prctl");
else if (ret==1)
printf("%s%s", (comma++) ? ", " : " ", captable[i]);
}
printf("\n");
return 0;
}

int capdrop(char *str)
{
unsigned long i;

int found=0;
for (i=0; i
Signed-off-by: Andrew G. Morgan
Cc: Stephen Smalley
Cc: James Morris
Cc: Chris Wright
Cc: Casey Schaufler a
Signed-off-by: "Serge E. Hallyn"
Tested-by: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Serge E. Hallyn
2008-02-06 01:44:20 +0800
195cf453d mm/page-writeback: highmem_is_dirtyable option ... Browse Code »

Add vm.highmem_is_dirtyable toggle

A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
approximately 2Gb size which contains a hash format that is written
randomly by the dbclean process. On 2.6.16 this process took a few
minutes. With lowmem only accounting of dirty ratios, this takes about 12
hours of 100% disk IO, all random writes.

Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
add the highmem back to the total available memory count.

[akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build]
Signed-off-by: Bron Gondwana
Cc: Ethan Solomita
Cc: Peter Zijlstra
Cc: WU Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bron Gondwana
2008-02-06 01:44:18 +0800

02 Feb, 2008

1 commit

de6bbd1d3 [AUDIT] break large execve argument logging into smaller messages ... Browse Code »

execve arguments can be quite large. There is no limit on the number of
arguments and a 4G limit on the size of an argument.

this patch prints those aruguments in bite sized pieces. a userspace size
limitation of 8k was discovered so this keeps messages around 7.5k

single arguments larger than 7.5k in length are split into multiple records
and can be identified as aX[Y]=

Signed-off-by: Eric Paris

Eric Paris
2008-02-02 03:23:55 +0800

30 Jan, 2008

1 commit

6e7c40259 x86: various changes and cleanups to in_p/out_p delay details ... Browse Code »

various changes to the in_p/out_p delay details:

- add the io_delay=none method
- make each method selectable from the kernel config
- simplify the delay code a bit by getting rid of an indirect function call
- add the /proc/sys/kernel/io_delay_type sysctl
- change 'io_delay=standard|alternate' to io_delay=0x80 and io_delay=0xed
- make the io delay config not depend on CONFIG_DEBUG_KERNEL

Signed-off-by: Ingo Molnar
Signed-off-by: Thomas Gleixner
Tested-by: "David P. Reed"

Ingo Molnar
2008-01-30 20:30:05 +0800

29 Jan, 2008

4 commits

08913681e [NET]: Remove the empty net_table ... Browse Code »

I have removed all the entries from this table (core_table,
ipv4_table and tr_table), so now we can safely drop it.

Signed-off-by: Pavel Emelyanov
Signed-off-by: David S. Miller

Pavel Emelyanov
2008-01-29 06:56:29 +0800
e51b6ba07 sysctl: Infrastructure for per namespace sysctls ... Browse Code »

This patch implements the basic infrastructure for per namespace sysctls.

A list of lists of sysctl headers is added, allowing each namespace to have
it's own list of sysctl headers.

Each list of sysctl headers has a lookup function to find the first
sysctl header in the list, allowing the lists to have a per namespace
instance.

register_sysct_root is added to tell sysctl.c about additional
lists of sysctl_headers. As all of the users are expected to be in
kernel no unregister function is provided.

sysctl_head_next is updated to walk through the list of lists.

__register_sysctl_paths is added to add a new sysctl table on
a non-default sysctl list.

The only intrusive part of this patch is propagating the information
to decided which list of sysctls to use for sysctl_check_table.

Signed-off-by: Eric W. Biederman
Cc: Serge Hallyn
Cc: Daniel Lezcano
Cc: Cedric Le Goater
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Eric W. Biederman
2008-01-29 06:55:17 +0800
23eb06de7 sysctl: Remember the ctl_table we passed to register_sysctl_paths ... Browse Code »

By doing this we allow users of register_sysctl_paths that build
and dynamically allocate their ctl_table to be simpler. This allows
them to just remember the ctl_table_header returned from
register_sysctl_paths from which they can now find the
ctl_table array they need to free.

Signed-off-by: Eric W. Biederman
Cc: Serge Hallyn
Cc: Daniel Lezcano
Cc: Cedric Le Goater
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Eric W. Biederman
2008-01-29 06:55:17 +0800
29e796fd4 sysctl: Add register_sysctl_paths function ... Browse Code »

There are a number of modules that register a sysctl table
somewhere deeply nested in the sysctl hierarchy, such as
fs/nfs, fs/xfs, dev/cdrom, etc.

They all specify several dummy ctl_tables for the path name.
This patch implements register_sysctl_path that takes
an additional path name, and makes up dummy sysctl nodes
for each component.

This patch was originally written by Olaf Kirch and
brought to my attention and reworked some by Olaf Hering.
I have changed a few additional things so the bugs are mine.

After converting all of the easy callers Olaf Hering observed
allyesconfig ARCH=i386, the patch reduces the final binary size by 9369 bytes.

.text +897
.data -7008

text data bss dec hex filename
26959310 4045899 4718592 35723801 2211a19 ../vmlinux-vanilla
26960207 4038891 4718592 35717690 221023a ../O-allyesconfig/vmlinux

So this change is both a space savings and a code simplification.

CC: Olaf Kirch
CC: Olaf Hering
Signed-off-by: Eric W. Biederman
Cc: Serge Hallyn
Cc: Daniel Lezcano
Cc: Cedric Le Goater
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Eric W. Biederman
2008-01-29 06:55:16 +0800

26 Jan, 2008

5 commits

90739081e softlockup: fix signedness ... Browse Code »

fix softlockup tunables signedness.

mark tunables read-mostly.

Signed-off-by: Ingo Molnar

Ingo Molnar
2008-01-26 04:08:34 +0800
9745512ce sched: latencytop support ... Browse Code »

LatencyTOP kernel infrastructure; it measures latencies in the
scheduler and tracks it system wide and per process.

Signed-off-by: Arjan van de Ven
Signed-off-by: Ingo Molnar

Arjan van de Ven
2008-01-26 04:08:34 +0800
fa85ae241 sched: rt time limit ... Browse Code »

Very simple time limit on the realtime scheduling classes.
Allow the rq's realtime class to consume sched_rt_ratio of every
sched_rt_period slice. If the class exceeds this quota the fair class
will preempt the realtime class.

Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Peter Zijlstra
2008-01-26 04:08:29 +0800
82a1fcb90 softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks ... Browse Code »

this patch extends the soft-lockup detector to automatically
detect hung TASK_UNINTERRUPTIBLE tasks. Such hung tasks are
printed the following way:

------------------>
INFO: task prctl:3042 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message
prctl D fd5e3793 0 3042 2997
f6050f38 00000046 00000001 fd5e3793 00000009 c06d8264 c06dae80 00000286
f6050f40 f6050f00 f7d34d90 f7d34fc8 c1e1be80 00000001 f6050000 00000000
f7e92d00 00000286 f6050f18 c0489d1a f6050f40 00006605 00000000 c0133a5b
Call Trace:
[] schedule_timeout+0x6d/0x8b
[] schedule_timeout_uninterruptible+0x15/0x17
[] msleep+0x10/0x16
[] sys_prctl+0x30/0x1e2
[] sysenter_past_esp+0x5f/0xa5
=======================
2 locks held by prctl/3042:
#0: (&sb->s_type->i_mutex_key#5){--..}, at: [] do_fsync+0x38/0x7a
#1: (jbd_handle){--..}, at: [] journal_start+0xc7/0xe9
: CPU hotplug fixes. ]
[ Andrew Morton : build warning fix. ]

Signed-off-by: Ingo Molnar
Signed-off-by: Arjan van de Ven

Ingo Molnar
2008-01-26 04:08:02 +0800
6b2d77002 sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups ... Browse Code »

The current load balancing scheme isn't good enough for precise
group fairness.

For example: on a 8-cpu system, I created 3 groups as under:

a = 8 tasks (cpu.shares = 1024)
b = 4 tasks (cpu.shares = 1024)
c = 3 tasks (cpu.shares = 1024)

a, b and c are task groups that have equal weight. We would expect each
of the groups to receive 33.33% of cpu bandwidth under a fair scheduler.

This is what I get with the latest scheduler git tree:

Signed-off-by: Ingo Molnar
--------------------------------------------------------------------------------
Col1 | Col2 | Col3 | Col4
------|---------|-------|-------------------------------------------------------
a | 277.676 | 57.8% | 54.1% 54.1% 54.1% 54.2% 56.7% 62.2% 62.8% 64.5%
b | 116.108 | 24.2% | 47.4% 48.1% 48.7% 49.3%
c | 86.326 | 18.0% | 47.5% 47.9% 48.5%
--------------------------------------------------------------------------------

Explanation of o/p:

Col1 -> Group name
Col2 -> Cumulative execution time (in seconds) received by all tasks of that
group in a 60sec window across 8 cpus
Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in
percentage. Col3 data is derived as:
Col3 = 100 * Col2 / (NR_CPUS * 60)
Col4 -> CPU bandwidth received by each individual task of the group.
Col4 = 100 * cpu_time_recd_by_task / 60

[I can share the test case that produces a similar o/p if reqd]

The deviation from desired group fairness is as below:

a = +24.47%
b = -9.13%
c = -15.33%

which is quite high.

After the patch below is applied, here are the results:

--------------------------------------------------------------------------------
Col1 | Col2 | Col3 | Col4
------|---------|-------|-------------------------------------------------------
a | 163.112 | 34.0% | 33.2% 33.4% 33.5% 33.5% 33.7% 34.4% 34.8% 35.3%
b | 156.220 | 32.5% | 63.3% 64.5% 66.1% 66.5%
c | 160.653 | 33.5% | 85.8% 90.6% 91.4%
--------------------------------------------------------------------------------

Deviation from desired group fairness is as below:

a = +0.67%
b = -0.83%
c = +0.17%

which is far better IMO. Most of other runs have yielded a deviation within
+-2% at the most, which is good.

Why do we see bad (group) fairness with current scheuler?
=========================================================

Currently cpu's weight is just the summation of individual task weights.
This can yield incorrect results. For ex: consider three groups as below
on a 2-cpu system:

CPU0 CPU1
---------------------------
A (10) B(5)
C(5)
---------------------------

Group A has 10 tasks, all on CPU0, Group B and C have 5 tasks each all
of which are on CPU1. Each task has the same weight (NICE_0_LOAD =
1024).

The current scheme would yield a cpu weight of 10240 (10*1024) for each cpu and
the load balancer will think both CPUs are perfectly balanced and won't
move around any tasks. This, however, would yield this bandwidth:

A = 50%
B = 25%
C = 25%

which is not the desired result.

What's changing in the patch?
=============================

- How cpu weights are calculated when CONFIF_FAIR_GROUP_SCHED is
defined (see below)
- API Change
- Two tunables introduced in sysfs (under SCHED_DEBUG) to
control the frequency at which the load balance monitor
thread runs.

The basic change made in this patch is how cpu weight (rq->load.weight) is
calculated. Its now calculated as the summation of group weights on a cpu,
rather than summation of task weights. Weight exerted by a group on a
cpu is dependent on the shares allocated to it and also the number of
tasks the group has on that cpu compared to the total number of
(runnable) tasks the group has in the system.

Let,
W(K,i) = Weight of group K on cpu i
T(K,i) = Task load present in group K's cfs_rq on cpu i
T(K) = Total task load of group K across various cpus
S(K) = Shares allocated to group K
NRCPUS = Number of online cpus in the scheduler domain to
which group K is assigned.

Then,
W(K,i) = S(K) * NRCPUS * T(K,i) / T(K)

A load balance monitor thread is created at bootup, which periodically
runs and adjusts group's weight on each cpu. To avoid its overhead, two
min/max tunables are introduced (under SCHED_DEBUG) to control the rate
at which it runs.

Fixes from: Peter Zijlstra

- don't start the load_balance_monitor when there is only a single cpu.
- rename the kthread because its currently longer than TASK_COMM_LEN

Signed-off-by: Srivatsa Vaddagiri
Signed-off-by: Ingo Molnar

Srivatsa Vaddagiri
2008-01-26 04:08:00 +0800

18 Dec, 2007

3 commits

73c4efd2c sched: sysctl, proc_dointvec_minmax() expects int values for ... Browse Code »

min_sched_granularity_ns, max_sched_granularity_ns,
min_wakeup_granularity_ns and max_wakeup_granularity_ns are declared
"unsigned long".

This is incorrect since proc_dointvec_minmax() expects plain "int" guard
values.

This bug only triggers on big endian 64 bit arches.

Signed-off-by: Eric Dumazet
Signed-off-by: Ingo Molnar

Eric Dumazet
2007-12-18 22:21:13 +0800
368d2c635 Revert "hugetlb: Add hugetlb_dynamic_pool sysctl" ... Browse Code »

This reverts commit 54f9f80d6543fb7b157d3b11e2e7911dc1379790 ("hugetlb:
Add hugetlb_dynamic_pool sysctl")

Given the new sysctl nr_overcommit_hugepages, the boolean dynamic pool
sysctl is not needed, as its semantics can be expressed by 0 in the
overcommit sysctl (no dynamic pool) and non-0 in the overcommit sysctl
(pool enabled).

(Needed in 2.6.24 since it reverts a post-2.6.23 userspace-visible change)

Signed-off-by: Nishanth Aravamudan
Acked-by: Adam Litke
Cc: William Lee Irwin III
Cc: Dave Hansen
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2007-12-18 11:28:17 +0800
d1c3fb1f8 hugetlb: introduce nr_overcommit_hugepages sysctl ... Browse Code »

hugetlb: introduce nr_overcommit_hugepages sysctl

While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
became convinced that having a boolean sysctl was insufficient:

1) To support per-node control of hugepages, I have previously submitted
patches to add a sysfs attribute related to nr_hugepages. However, with
a boolean global value and per-mount quota enforcement constraining the
dynamic pool, adding corresponding control of the dynamic pool on a
per-node basis seems inconsistent to me.

2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
mount points is, arguably, more arduous than it needs to be. Each quota
would need to be set separately, and the sum would need to be monitored.

To ease the administration, and to help make the way for per-node
control of the static & dynamic hugepage pool, I added a separate
sysctl, nr_overcommit_hugepages. This value serves as a high watermark
for the overall hugepage pool, while nr_hugepages serves as a low
watermark. The boolean sysctl can then be removed, as the condition

nr_overcommit_hugepages > 0

indicates the same administrative setting as

hugetlb_dynamic_pool == 1

Quotas still serve as local enforcement of the size of the pool on a
per-mount basis.

A few caveats:

1) There is a race whereby the global surplus huge page counter is
incremented before a hugepage has allocated. Another process could then
try grow the pool, and fail to convert a surplus huge page to a normal
huge page and instead allocate a fresh huge page. I believe this is
benign, as no memory is leaked (the actual pages are still tracked
correctly) and the counters won't go out of sync.

2) Shrinking the static pool while a surplus is in effect will allow the
number of surplus huge pages to exceed the overcommit value. As long as
this condition holds, however, no more surplus huge pages will be
allowed on the system until one of the two sysctls are increased
sufficiently, or the surplus huge pages go out of use and are freed.

Successfully tested on x86_64 with the current libhugetlbfs snapshot,
modified to use the new sysctl.

Signed-off-by: Nishanth Aravamudan
Acked-by: Adam Litke
Cc: William Lee Irwin III
Cc: Dave Hansen
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2007-12-18 11:28:17 +0800

06 Dec, 2007

1 commit

f1dad166e Avoid potential NULL dereference in unregister_sysctl_table ... Browse Code »

register_sysctl_table() can return NULL sometimes, e.g. when kmalloc()
returns NULL or when sysctl check fails.

I've also noticed, that many (most?) code in the kernel doesn't check for
the return value from register_sysctl_table() and later simply calls the
unregister_sysctl_table() with potentially NULL argument.

This is unlikely on a common kernel configuration, but in case we're
dealing with modules and/or fault-injection support, there's a slight
possibility of an OOPS.

Changing all the users to check for return code from the registering does
not look like a good solution - there are too many code doing this and
failure in sysctl tables registration is not a good reason to abort module
loading (in most of the cases).

So I think, that we can just have this check in unregister_sysctl_table
just to avoid accidental OOPS-es (actually, the unregister_sysctl_table()
did exactly this, before the start_unregistering() appeared).

Signed-off-by: Pavel Emelyanov
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-12-06 01:21:20 +0800

15 Nov, 2007

1 commit

6fc48af82 sysctl: check length at deprecated_sysctl_warning ... Browse Code »

Original patch assumed args->nlen < CTL_MAXNAME, but it can be false.

Signed-off-by: Tetsuo Handa
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2007-11-15 10:45:37 +0800

10 Nov, 2007

3 commits

b82d9fdd8 sched: avoid large irq-latencies in smp-balancing ... Browse Code »

SMP balancing is done with IRQs disabled and can iterate the full rq.
When rqs are large this can cause large irq-latencies. Limit the nr of
iterations on each run.

This fixes a scheduling latency regression reported by the -rt folks.

Signed-off-by: Peter Zijlstra
Acked-by: Steven Rostedt
Tested-by: Gregory Haskins
Signed-off-by: Ingo Molnar

Peter Zijlstra
2007-11-10 05:39:39 +0800
d6322faf2 sched: cleanup, use NSEC_PER_MSEC and NSEC_PER_SEC ... Browse Code »

1) hardcoded 1000000000 value is used five times in places where
NSEC_PER_SEC might be more readable.

2) A conversion from nsec to msec uses the hardcoded 1000000 value,
which is a candidate for NSEC_PER_MSEC.

no code changed:

text data bss dec hex filename
44359 3326 36 47721 ba69 sched.o.before
44359 3326 36 47721 ba69 sched.o.after

Signed-off-by: Eric Dumazet
Signed-off-by: Ingo Molnar

Eric Dumazet
2007-11-10 05:39:38 +0800
b2be5e96d sched: reintroduce the sched_min_granularity tunable ... Browse Code »

we lost the sched_min_granularity tunable to a clever optimization
that uses the sched_latency/min_granularity ratio - but the ratio
is quite unintuitive to users and can also crash the kernel if the
ratio is set to 0. So reintroduce the min_granularity tunable,
while keeping the ratio maintained internally.

no functionality changed.

[ mingo@elte.hu: some fixlets. ]

Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Peter Zijlstra
2007-11-10 05:39:37 +0800

20 Oct, 2007

2 commits

b488893a3 pid namespaces: changes to show virtual ids to user ... Browse Code »

This is the largest patch in the set. Make all (I hope) the places where
the pid is shown to or get from user operate on the virtual pids.

The idea is:
- all in-kernel data structures must store either struct pid itself
or the pid's global nr, obtained with pid_nr() call;
- when seeking the task from kernel code with the stored id one
should use find_task_by_pid() call that works with global pids;
- when showing pid's numerical value to the user the virtual one
should be used, but however when one shows task's pid outside this
task's namespace the global one is to be used;
- when getting the pid from userspace one need to consider this as
the virtual one and use appropriate task/pid-searching functions.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: nuther build fix]
[akpm@linux-foundation.org: yet nuther build fix]
[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Pavel Emelyanov
Signed-off-by: Alexey Dobriyan
Cc: Sukadev Bhattiprolu
Cc: Oleg Nesterov
Cc: Paul Menage
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:40 +0800
b460cbc58 pid namespaces: define is_global_init() and is_container_init() ... Browse Code »

is_init() is an ambiguous name for the pid==1 check. Split it into
is_global_init() and is_container_init().

A cgroup init has it's tsk->pid == 1.

A global init also has it's tsk->pid == 1 and it's active pid namespace
is the init_pid_ns. But rather than check the active pid namespace,
compare the task structure with 'init_pid_ns.child_reaper', which is
initialized during boot to the /sbin/init process and never changes.

Changelog:

2.6.22-rc4-mm2-pidns1:
- Use 'init_pid_ns.child_reaper' to determine if a given task is the
global init (/sbin/init) process. This would improve performance
and remove dependence on the task_pid().

2.6.21-mm2-pidns2:

- [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc,
ppc,avr32}/traps.c for the _exception() call to is_global_init().
This way, we kill only the cgroup if the cgroup's init has a
bug rather than force a kernel panic.

[akpm@linux-foundation.org: fix comment]
[sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c]
[bunk@stusta.de: kernel/pid.c: remove unused exports]
[sukadev@us.ibm.com: Fix capability.c to work with threaded init]
Signed-off-by: Serge E. Hallyn
Signed-off-by: Sukadev Bhattiprolu
Acked-by: Pavel Emelianov
Cc: Eric W. Biederman
Cc: Cedric Le Goater
Cc: Dave Hansen
Cc: Herbert Poetzel
Cc: Kirill Korotaev
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Serge E. Hallyn
2007-10-20 02:53:37 +0800

19 Oct, 2007

2 commits

72c2d5823 V3 file capabilities: alter behavior of cap_setpcap ... Browse Code »

The non-filesystem capability meaning of CAP_SETPCAP is that a process, p1,
can change the capabilities of another process, p2. This is not the
meaning that was intended for this capability at all, and this
implementation came about purely because, without filesystem capabilities,
there was no way to use capabilities without one process bestowing them on
another.

Since we now have a filesystem support for capabilities we can fix the
implementation of CAP_SETPCAP.

The most significant thing about this change is that, with it in effect, no
process can set the capabilities of another process.

The capabilities of a program are set via the capability convolution
rules:

pI(post-exec) = pI(pre-exec)
pP(post-exec) = (X(aka cap_bset) & fP) | (pI(post-exec) & fI)
pE(post-exec) = fE ? pP(post-exec) : 0

at exec() time. As such, the only influence the pre-exec() program can
have on the post-exec() program's capabilities are through the pI
capability set.

The correct implementation for CAP_SETPCAP (and that enabled by this patch)
is that it can be used to add extra pI capabilities to the current process
- to be picked up by subsequent exec()s when the above convolution rules
are applied.

Here is how it works:

Let's say we have a process, p. It has capability sets, pE, pP and pI.
Generally, p, can change the value of its own pI to pI' where

(pI' & ~pI) & ~pP = 0.

That is, the only new things in pI' that were not present in pI need to
be present in pP.

The role of CAP_SETPCAP is basically to permit changes to pI beyond
the above:

if (pE & CAP_SETPCAP) {
pI' = anything; /* ie., even (pI' & ~pI) & ~pP != 0 */
}

This capability is useful for things like login, which (say, via
pam_cap) might want to raise certain inheritable capabilities for use
by the children of the logged-in user's shell, but those capabilities
are not useful to or needed by the login program itself.

One such use might be to limit who can run ping. You set the
capabilities of the 'ping' program to be "= cap_net_raw+i", and then
only shells that have (pI & CAP_NET_RAW) will be able to run
it. Without CAP_SETPCAP implemented as described above, login(pam_cap)
would have to also have (pP & CAP_NET_RAW) in order to raise this
capability and pass it on through the inheritable set.

Signed-off-by: Andrew Morgan
Signed-off-by: Serge E. Hallyn
Cc: Stephen Smalley
Cc: James Morris
Cc: Casey Schaufler
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morgan
2007-10-19 05:37:24 +0800
7058cb02d sysctl: deprecate sys_sysctl in a user space visible fashion. ... Browse Code »

After adding checking to register_sysctl_table and finding a whole new set
of bugs. Missed by countless code reviews and testers I have finally lost
patience with the binary sysctl interface.

The binary sysctl interface has been sort of deprecated for years and
finding a user space program that uses the syscall is more difficult then
finding a needle in a haystack. Problems continue to crop up, with the in
kernel implementation. So since supporting something that no one uses is
silly, deprecate sys_sysctl with a sufficient grace period and notice that
the handful of user space applications that care can be fixed or replaced.

The /proc/sys sysctl interface that people use will continue to be
supported indefinitely.

This patch moves the tested warning about sysctls from the path where
sys_sysctl to a separate path called from both implementations of
sys_sysctl, and it adds a proper entry into
Documentation/feature-removal-schedule.

Allowing us to revisit this in a couple years time and actually kill
sys_sysctl.

[lethal@linux-sh.org: sysctl: Fix syscall disabled build]
Signed-off-by: Eric W. Biederman
Signed-off-by: Paul Mundt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-10-19 05:37:23 +0800