29 Apr, 2008

4 commits

  • When reading from/writing to some table, a root, which this table came from,
    may affect this table's permissions, depending on who is working with the
    table.

    The core hunk is at the bottom of this patch. All the rest is just pushing
    the ctl_table_root argument up to the sysctl_perm() function.

    This will be mostly (only?) used in the net sysctls.

    Signed-off-by: Pavel Emelyanov
    Acked-by: David S. Miller
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Cc: Denis V. Lunev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • The do_sysctl_strategy isn't used outside kernel/sysctl.c, so this can be
    static and without a prototype in header.

    Besides, move this one and parse_table() above their callers and drop the
    forward declarations of the latter call.

    One more "besides" - fix two checkpatch warnings: space before a ( and an
    extra space at the end of a line.

    Signed-off-by: Pavel Emelyanov
    Acked-by: David S. Miller
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Cc: Denis V. Lunev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Disable sysctl_check.c for embedded targets. This saves about about 11 kB
    in .text and another 11 kB in .data on a PXA255 embedded platform.

    Signed-off-by: Holger Schurig
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Holger Schurig
     
  • Make the keyring quotas controllable through /proc/sys files:

    (*) /proc/sys/kernel/keys/root_maxkeys
    /proc/sys/kernel/keys/root_maxbytes

    Maximum number of keys that root may have and the maximum total number of
    bytes of data that root may have stored in those keys.

    (*) /proc/sys/kernel/keys/maxkeys
    /proc/sys/kernel/keys/maxbytes

    Maximum number of keys that each non-root user may have and the maximum
    total number of bytes of data that each of those users may have stored in
    their keys.

    Also increase the quotas as a number of people have been complaining that it's
    not big enough. I'm not sure that it's big enough now either, but on the
    other hand, it can now be set in /etc/sysctl.conf.

    Signed-off-by: David Howells
    Cc:
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

20 Apr, 2008

2 commits


05 Mar, 2008

1 commit

  • The following commits cause a number of regressions:

    commit 58e2d4ca581167c2a079f4ee02be2f0bc52e8729
    Author: Srivatsa Vaddagiri
    Date: Fri Jan 25 21:08:00 2008 +0100
    sched: group scheduling, change how cpu load is calculated

    commit 6b2d7700266b9402e12824e11e0099ae6a4a6a79
    Author: Srivatsa Vaddagiri
    Date: Fri Jan 25 21:08:00 2008 +0100
    sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups

    Namely:
    - very frequent wakeups on SMP, reported by PowerTop users.
    - cacheline trashing on (large) SMP
    - some latencies larger than 500ms

    While there is a mergeable patch to fix the latter, the former issues
    are not fixable in a manner suitable for .25 (we're at -rc3 now).

    Hence we revert them and try again in v2.6.26.

    Signed-off-by: Peter Zijlstra
    CC: Srivatsa Vaddagiri
    Tested-by: Alexey Zaytsev
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

14 Feb, 2008

1 commit

  • proc_doulongvec_minmax() calls copy_to_user()/copy_from_user(), so we can't
    hold hugetlb_lock over the call. Use a dummy variable to store the sysctl
    result, like in hugetlb_sysctl_handler(), then grab the lock to update
    nr_overcommit_huge_pages.

    Signed-off-by: Nishanth Aravamudan
    Reported-by: Miles Lane
    Cc: Adam Litke
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     

13 Feb, 2008

1 commit

  • Change the rt_ratio interface to rt_runtime_us, to match rt_period_us.
    This avoids picking a granularity for the ratio.

    Extend the /sys/kernel/uids// interface to allow setting
    the group's rt_runtime.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

09 Feb, 2008

4 commits

  • Makes an embedded image a bit smaller.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Don't include linux/security.h twice in kernel/sysctl.c

    Signed-off-by: Jesper Juhl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     
  • Some time ago the xxx_vnr() calls (e.g. pid_vnr or find_task_by_vpid) were
    _all_ converted to operate on the current pid namespace. After this each call
    like xxx_nr_ns(foo, current->nsproxy->pid_ns) is nothing but a xxx_vnr(foo)
    one.

    Switch all the xxx_nr_ns() callers to use the xxx_vnr() calls where
    appropriate.

    Signed-off-by: Pavel Emelyanov
    Reviewed-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • When I replaced hugetlb_dynamic_pool with nr_overcommit_hugepages I used
    proc_doulongvec_minmax() directly. However, hugetlb.c's locking rules
    require that all counter modifications occur under the hugetlb_lock. Add a
    callback into the hugetlb code similar to the one for nr_hugepages. Grab
    the lock around the manipulation of nr_overcommit_hugepages in
    proc_doulongvec_minmax().

    Signed-off-by: Nishanth Aravamudan
    Acked-by: Adam Litke
    Cc: David Gibson
    Cc: William Lee Irwin III
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     

08 Feb, 2008

1 commit

  • Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a
    dump of all system tasks (excluding kernel threads) when performing an
    OOM-killing. Information includes pid, uid, tgid, vm size, rss, cpu,
    oom_adj score, and name.

    This is helpful for determining why there was an OOM condition and which
    rogue task caused it.

    It is configurable so that large systems, such as those with several
    thousand tasks, do not incur a performance penalty associated with dumping
    data they may not desire.

    If an OOM was triggered as a result of a memory controller, the tasklist
    shall be filtered to exclude tasks that are not a member of the same
    cgroup.

    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Cc: Balbir Singh
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

07 Feb, 2008

1 commit

  • NR_OPEN (historically set to 1024*1024) actually forbids processes to open
    more than 1024*1024 handles.

    Unfortunatly some production servers hit the not so 'ridiculously high
    value' of 1024*1024 file descriptors per process.

    Changing NR_OPEN is not considered safe because of vmalloc space potential
    exhaust.

    This patch introduces a new sysctl (/proc/sys/fs/nr_open) wich defaults to
    1024*1024, so that admins can decide to change this limit if their workload
    needs it.

    [akpm@linux-foundation.org: export it for sparc64]
    Signed-off-by: Eric Dumazet
    Cc: Alan Cox
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: "David S. Miller"
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

06 Feb, 2008

2 commits

  • The capability bounding set is a set beyond which capabilities cannot grow.
    Currently cap_bset is per-system. It can be manipulated through sysctl,
    but only init can add capabilities. Root can remove capabilities. By
    default it includes all caps except CAP_SETPCAP.

    This patch makes the bounding set per-process when file capabilities are
    enabled. It is inherited at fork from parent. Noone can add elements,
    CAP_SETPCAP is required to remove them.

    One example use of this is to start a safer container. For instance, until
    device namespaces or per-container device whitelists are introduced, it is
    best to take CAP_MKNOD away from a container.

    The bounding set will not affect pP and pE immediately. It will only
    affect pP' and pE' after subsequent exec()s. It also does not affect pI,
    and exec() does not constrain pI'. So to really start a shell with no way
    of regain CAP_MKNOD, you would do

    prctl(PR_CAPBSET_DROP, CAP_MKNOD);
    cap_t cap = cap_get_proc();
    cap_value_t caparray[1];
    caparray[0] = CAP_MKNOD;
    cap_set_flag(cap, CAP_INHERITABLE, 1, caparray, CAP_DROP);
    cap_set_proc(cap);
    cap_free(cap);

    The following test program will get and set the bounding
    set (but not pI). For instance

    ./bset get
    (lists capabilities in bset)
    ./bset drop cap_net_raw
    (starts shell with new bset)
    (use capset, setuid binary, or binary with
    file capabilities to try to increase caps)

    ************************************************************
    cap_bound.c
    ************************************************************
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #ifndef PR_CAPBSET_READ
    #define PR_CAPBSET_READ 23
    #endif

    #ifndef PR_CAPBSET_DROP
    #define PR_CAPBSET_DROP 24
    #endif

    int usage(char *me)
    {
    printf("Usage: %s get\n", me);
    printf(" %s drop \n", me);
    return 1;
    }

    #define numcaps 32
    char *captable[numcaps] = {
    "cap_chown",
    "cap_dac_override",
    "cap_dac_read_search",
    "cap_fowner",
    "cap_fsetid",
    "cap_kill",
    "cap_setgid",
    "cap_setuid",
    "cap_setpcap",
    "cap_linux_immutable",
    "cap_net_bind_service",
    "cap_net_broadcast",
    "cap_net_admin",
    "cap_net_raw",
    "cap_ipc_lock",
    "cap_ipc_owner",
    "cap_sys_module",
    "cap_sys_rawio",
    "cap_sys_chroot",
    "cap_sys_ptrace",
    "cap_sys_pacct",
    "cap_sys_admin",
    "cap_sys_boot",
    "cap_sys_nice",
    "cap_sys_resource",
    "cap_sys_time",
    "cap_sys_tty_config",
    "cap_mknod",
    "cap_lease",
    "cap_audit_write",
    "cap_audit_control",
    "cap_setfcap"
    };

    int getbcap(void)
    {
    int comma=0;
    unsigned long i;
    int ret;

    printf("i know of %d capabilities\n", numcaps);
    printf("capability bounding set:");
    for (i=0; i< 0)
    perror("prctl");
    else if (ret==1)
    printf("%s%s", (comma++) ? ", " : " ", captable[i]);
    }
    printf("\n");
    return 0;
    }

    int capdrop(char *str)
    {
    unsigned long i;

    int found=0;
    for (i=0; i
    Signed-off-by: Andrew G. Morgan
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Chris Wright
    Cc: Casey Schaufler a
    Signed-off-by: "Serge E. Hallyn"
    Tested-by: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Add vm.highmem_is_dirtyable toggle

    A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
    approximately 2Gb size which contains a hash format that is written
    randomly by the dbclean process. On 2.6.16 this process took a few
    minutes. With lowmem only accounting of dirty ratios, this takes about 12
    hours of 100% disk IO, all random writes.

    Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
    add the highmem back to the total available memory count.

    [akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build]
    Signed-off-by: Bron Gondwana
    Cc: Ethan Solomita
    Cc: Peter Zijlstra
    Cc: WU Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bron Gondwana
     

02 Feb, 2008

1 commit

  • execve arguments can be quite large. There is no limit on the number of
    arguments and a 4G limit on the size of an argument.

    this patch prints those aruguments in bite sized pieces. a userspace size
    limitation of 8k was discovered so this keeps messages around 7.5k

    single arguments larger than 7.5k in length are split into multiple records
    and can be identified as aX[Y]=

    Signed-off-by: Eric Paris

    Eric Paris
     

30 Jan, 2008

1 commit

  • various changes to the in_p/out_p delay details:

    - add the io_delay=none method
    - make each method selectable from the kernel config
    - simplify the delay code a bit by getting rid of an indirect function call
    - add the /proc/sys/kernel/io_delay_type sysctl
    - change 'io_delay=standard|alternate' to io_delay=0x80 and io_delay=0xed
    - make the io delay config not depend on CONFIG_DEBUG_KERNEL

    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner
    Tested-by: "David P. Reed"

    Ingo Molnar
     

29 Jan, 2008

4 commits

  • I have removed all the entries from this table (core_table,
    ipv4_table and tr_table), so now we can safely drop it.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • This patch implements the basic infrastructure for per namespace sysctls.

    A list of lists of sysctl headers is added, allowing each namespace to have
    it's own list of sysctl headers.

    Each list of sysctl headers has a lookup function to find the first
    sysctl header in the list, allowing the lists to have a per namespace
    instance.

    register_sysct_root is added to tell sysctl.c about additional
    lists of sysctl_headers. As all of the users are expected to be in
    kernel no unregister function is provided.

    sysctl_head_next is updated to walk through the list of lists.

    __register_sysctl_paths is added to add a new sysctl table on
    a non-default sysctl list.

    The only intrusive part of this patch is propagating the information
    to decided which list of sysctls to use for sysctl_check_table.

    Signed-off-by: Eric W. Biederman
    Cc: Serge Hallyn
    Cc: Daniel Lezcano
    Cc: Cedric Le Goater
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • By doing this we allow users of register_sysctl_paths that build
    and dynamically allocate their ctl_table to be simpler. This allows
    them to just remember the ctl_table_header returned from
    register_sysctl_paths from which they can now find the
    ctl_table array they need to free.

    Signed-off-by: Eric W. Biederman
    Cc: Serge Hallyn
    Cc: Daniel Lezcano
    Cc: Cedric Le Goater
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • There are a number of modules that register a sysctl table
    somewhere deeply nested in the sysctl hierarchy, such as
    fs/nfs, fs/xfs, dev/cdrom, etc.

    They all specify several dummy ctl_tables for the path name.
    This patch implements register_sysctl_path that takes
    an additional path name, and makes up dummy sysctl nodes
    for each component.

    This patch was originally written by Olaf Kirch and
    brought to my attention and reworked some by Olaf Hering.
    I have changed a few additional things so the bugs are mine.

    After converting all of the easy callers Olaf Hering observed
    allyesconfig ARCH=i386, the patch reduces the final binary size by 9369 bytes.

    .text +897
    .data -7008

    text data bss dec hex filename
    26959310 4045899 4718592 35723801 2211a19 ../vmlinux-vanilla
    26960207 4038891 4718592 35717690 221023a ../O-allyesconfig/vmlinux

    So this change is both a space savings and a code simplification.

    CC: Olaf Kirch
    CC: Olaf Hering
    Signed-off-by: Eric W. Biederman
    Cc: Serge Hallyn
    Cc: Daniel Lezcano
    Cc: Cedric Le Goater
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

26 Jan, 2008

5 commits

  • fix softlockup tunables signedness.

    mark tunables read-mostly.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • LatencyTOP kernel infrastructure; it measures latencies in the
    scheduler and tracks it system wide and per process.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Ingo Molnar

    Arjan van de Ven
     
  • Very simple time limit on the realtime scheduling classes.
    Allow the rq's realtime class to consume sched_rt_ratio of every
    sched_rt_period slice. If the class exceeds this quota the fair class
    will preempt the realtime class.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • this patch extends the soft-lockup detector to automatically
    detect hung TASK_UNINTERRUPTIBLE tasks. Such hung tasks are
    printed the following way:

    ------------------>
    INFO: task prctl:3042 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message
    prctl D fd5e3793 0 3042 2997
    f6050f38 00000046 00000001 fd5e3793 00000009 c06d8264 c06dae80 00000286
    f6050f40 f6050f00 f7d34d90 f7d34fc8 c1e1be80 00000001 f6050000 00000000
    f7e92d00 00000286 f6050f18 c0489d1a f6050f40 00006605 00000000 c0133a5b
    Call Trace:
    [] schedule_timeout+0x6d/0x8b
    [] schedule_timeout_uninterruptible+0x15/0x17
    [] msleep+0x10/0x16
    [] sys_prctl+0x30/0x1e2
    [] sysenter_past_esp+0x5f/0xa5
    =======================
    2 locks held by prctl/3042:
    #0: (&sb->s_type->i_mutex_key#5){--..}, at: [] do_fsync+0x38/0x7a
    #1: (jbd_handle){--..}, at: [] journal_start+0xc7/0xe9
    : CPU hotplug fixes. ]
    [ Andrew Morton : build warning fix. ]

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven

    Ingo Molnar
     
  • The current load balancing scheme isn't good enough for precise
    group fairness.

    For example: on a 8-cpu system, I created 3 groups as under:

    a = 8 tasks (cpu.shares = 1024)
    b = 4 tasks (cpu.shares = 1024)
    c = 3 tasks (cpu.shares = 1024)

    a, b and c are task groups that have equal weight. We would expect each
    of the groups to receive 33.33% of cpu bandwidth under a fair scheduler.

    This is what I get with the latest scheduler git tree:

    Signed-off-by: Ingo Molnar
    --------------------------------------------------------------------------------
    Col1 | Col2 | Col3 | Col4
    ------|---------|-------|-------------------------------------------------------
    a | 277.676 | 57.8% | 54.1% 54.1% 54.1% 54.2% 56.7% 62.2% 62.8% 64.5%
    b | 116.108 | 24.2% | 47.4% 48.1% 48.7% 49.3%
    c | 86.326 | 18.0% | 47.5% 47.9% 48.5%
    --------------------------------------------------------------------------------

    Explanation of o/p:

    Col1 -> Group name
    Col2 -> Cumulative execution time (in seconds) received by all tasks of that
    group in a 60sec window across 8 cpus
    Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in
    percentage. Col3 data is derived as:
    Col3 = 100 * Col2 / (NR_CPUS * 60)
    Col4 -> CPU bandwidth received by each individual task of the group.
    Col4 = 100 * cpu_time_recd_by_task / 60

    [I can share the test case that produces a similar o/p if reqd]

    The deviation from desired group fairness is as below:

    a = +24.47%
    b = -9.13%
    c = -15.33%

    which is quite high.

    After the patch below is applied, here are the results:

    --------------------------------------------------------------------------------
    Col1 | Col2 | Col3 | Col4
    ------|---------|-------|-------------------------------------------------------
    a | 163.112 | 34.0% | 33.2% 33.4% 33.5% 33.5% 33.7% 34.4% 34.8% 35.3%
    b | 156.220 | 32.5% | 63.3% 64.5% 66.1% 66.5%
    c | 160.653 | 33.5% | 85.8% 90.6% 91.4%
    --------------------------------------------------------------------------------

    Deviation from desired group fairness is as below:

    a = +0.67%
    b = -0.83%
    c = +0.17%

    which is far better IMO. Most of other runs have yielded a deviation within
    +-2% at the most, which is good.

    Why do we see bad (group) fairness with current scheuler?
    =========================================================

    Currently cpu's weight is just the summation of individual task weights.
    This can yield incorrect results. For ex: consider three groups as below
    on a 2-cpu system:

    CPU0 CPU1
    ---------------------------
    A (10) B(5)
    C(5)
    ---------------------------

    Group A has 10 tasks, all on CPU0, Group B and C have 5 tasks each all
    of which are on CPU1. Each task has the same weight (NICE_0_LOAD =
    1024).

    The current scheme would yield a cpu weight of 10240 (10*1024) for each cpu and
    the load balancer will think both CPUs are perfectly balanced and won't
    move around any tasks. This, however, would yield this bandwidth:

    A = 50%
    B = 25%
    C = 25%

    which is not the desired result.

    What's changing in the patch?
    =============================

    - How cpu weights are calculated when CONFIF_FAIR_GROUP_SCHED is
    defined (see below)
    - API Change
    - Two tunables introduced in sysfs (under SCHED_DEBUG) to
    control the frequency at which the load balance monitor
    thread runs.

    The basic change made in this patch is how cpu weight (rq->load.weight) is
    calculated. Its now calculated as the summation of group weights on a cpu,
    rather than summation of task weights. Weight exerted by a group on a
    cpu is dependent on the shares allocated to it and also the number of
    tasks the group has on that cpu compared to the total number of
    (runnable) tasks the group has in the system.

    Let,
    W(K,i) = Weight of group K on cpu i
    T(K,i) = Task load present in group K's cfs_rq on cpu i
    T(K) = Total task load of group K across various cpus
    S(K) = Shares allocated to group K
    NRCPUS = Number of online cpus in the scheduler domain to
    which group K is assigned.

    Then,
    W(K,i) = S(K) * NRCPUS * T(K,i) / T(K)

    A load balance monitor thread is created at bootup, which periodically
    runs and adjusts group's weight on each cpu. To avoid its overhead, two
    min/max tunables are introduced (under SCHED_DEBUG) to control the rate
    at which it runs.

    Fixes from: Peter Zijlstra

    - don't start the load_balance_monitor when there is only a single cpu.
    - rename the kthread because its currently longer than TASK_COMM_LEN

    Signed-off-by: Srivatsa Vaddagiri
    Signed-off-by: Ingo Molnar

    Srivatsa Vaddagiri
     

18 Dec, 2007

3 commits

  • min_sched_granularity_ns, max_sched_granularity_ns,
    min_wakeup_granularity_ns and max_wakeup_granularity_ns are declared
    "unsigned long".

    This is incorrect since proc_dointvec_minmax() expects plain "int" guard
    values.

    This bug only triggers on big endian 64 bit arches.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Ingo Molnar

    Eric Dumazet
     
  • This reverts commit 54f9f80d6543fb7b157d3b11e2e7911dc1379790 ("hugetlb:
    Add hugetlb_dynamic_pool sysctl")

    Given the new sysctl nr_overcommit_hugepages, the boolean dynamic pool
    sysctl is not needed, as its semantics can be expressed by 0 in the
    overcommit sysctl (no dynamic pool) and non-0 in the overcommit sysctl
    (pool enabled).

    (Needed in 2.6.24 since it reverts a post-2.6.23 userspace-visible change)

    Signed-off-by: Nishanth Aravamudan
    Acked-by: Adam Litke
    Cc: William Lee Irwin III
    Cc: Dave Hansen
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     
  • hugetlb: introduce nr_overcommit_hugepages sysctl

    While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
    became convinced that having a boolean sysctl was insufficient:

    1) To support per-node control of hugepages, I have previously submitted
    patches to add a sysfs attribute related to nr_hugepages. However, with
    a boolean global value and per-mount quota enforcement constraining the
    dynamic pool, adding corresponding control of the dynamic pool on a
    per-node basis seems inconsistent to me.

    2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
    mount points is, arguably, more arduous than it needs to be. Each quota
    would need to be set separately, and the sum would need to be monitored.

    To ease the administration, and to help make the way for per-node
    control of the static & dynamic hugepage pool, I added a separate
    sysctl, nr_overcommit_hugepages. This value serves as a high watermark
    for the overall hugepage pool, while nr_hugepages serves as a low
    watermark. The boolean sysctl can then be removed, as the condition

    nr_overcommit_hugepages > 0

    indicates the same administrative setting as

    hugetlb_dynamic_pool == 1

    Quotas still serve as local enforcement of the size of the pool on a
    per-mount basis.

    A few caveats:

    1) There is a race whereby the global surplus huge page counter is
    incremented before a hugepage has allocated. Another process could then
    try grow the pool, and fail to convert a surplus huge page to a normal
    huge page and instead allocate a fresh huge page. I believe this is
    benign, as no memory is leaked (the actual pages are still tracked
    correctly) and the counters won't go out of sync.

    2) Shrinking the static pool while a surplus is in effect will allow the
    number of surplus huge pages to exceed the overcommit value. As long as
    this condition holds, however, no more surplus huge pages will be
    allowed on the system until one of the two sysctls are increased
    sufficiently, or the surplus huge pages go out of use and are freed.

    Successfully tested on x86_64 with the current libhugetlbfs snapshot,
    modified to use the new sysctl.

    Signed-off-by: Nishanth Aravamudan
    Acked-by: Adam Litke
    Cc: William Lee Irwin III
    Cc: Dave Hansen
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     

06 Dec, 2007

1 commit

  • register_sysctl_table() can return NULL sometimes, e.g. when kmalloc()
    returns NULL or when sysctl check fails.

    I've also noticed, that many (most?) code in the kernel doesn't check for
    the return value from register_sysctl_table() and later simply calls the
    unregister_sysctl_table() with potentially NULL argument.

    This is unlikely on a common kernel configuration, but in case we're
    dealing with modules and/or fault-injection support, there's a slight
    possibility of an OOPS.

    Changing all the users to check for return code from the registering does
    not look like a good solution - there are too many code doing this and
    failure in sysctl tables registration is not a good reason to abort module
    loading (in most of the cases).

    So I think, that we can just have this check in unregister_sysctl_table
    just to avoid accidental OOPS-es (actually, the unregister_sysctl_table()
    did exactly this, before the start_unregistering() appeared).

    Signed-off-by: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

15 Nov, 2007

1 commit


10 Nov, 2007

3 commits

  • SMP balancing is done with IRQs disabled and can iterate the full rq.
    When rqs are large this can cause large irq-latencies. Limit the nr of
    iterations on each run.

    This fixes a scheduling latency regression reported by the -rt folks.

    Signed-off-by: Peter Zijlstra
    Acked-by: Steven Rostedt
    Tested-by: Gregory Haskins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • 1) hardcoded 1000000000 value is used five times in places where
    NSEC_PER_SEC might be more readable.

    2) A conversion from nsec to msec uses the hardcoded 1000000 value,
    which is a candidate for NSEC_PER_MSEC.

    no code changed:

    text data bss dec hex filename
    44359 3326 36 47721 ba69 sched.o.before
    44359 3326 36 47721 ba69 sched.o.after

    Signed-off-by: Eric Dumazet
    Signed-off-by: Ingo Molnar

    Eric Dumazet
     
  • we lost the sched_min_granularity tunable to a clever optimization
    that uses the sched_latency/min_granularity ratio - but the ratio
    is quite unintuitive to users and can also crash the kernel if the
    ratio is set to 0. So reintroduce the min_granularity tunable,
    while keeping the ratio maintained internally.

    no functionality changed.

    [ mingo@elte.hu: some fixlets. ]

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

20 Oct, 2007

2 commits

  • This is the largest patch in the set. Make all (I hope) the places where
    the pid is shown to or get from user operate on the virtual pids.

    The idea is:
    - all in-kernel data structures must store either struct pid itself
    or the pid's global nr, obtained with pid_nr() call;
    - when seeking the task from kernel code with the stored id one
    should use find_task_by_pid() call that works with global pids;
    - when showing pid's numerical value to the user the virtual one
    should be used, but however when one shows task's pid outside this
    task's namespace the global one is to be used;
    - when getting the pid from userspace one need to consider this as
    the virtual one and use appropriate task/pid-searching functions.

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: nuther build fix]
    [akpm@linux-foundation.org: yet nuther build fix]
    [akpm@linux-foundation.org: remove unneeded casts]
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Alexey Dobriyan
    Cc: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • is_init() is an ambiguous name for the pid==1 check. Split it into
    is_global_init() and is_container_init().

    A cgroup init has it's tsk->pid == 1.

    A global init also has it's tsk->pid == 1 and it's active pid namespace
    is the init_pid_ns. But rather than check the active pid namespace,
    compare the task structure with 'init_pid_ns.child_reaper', which is
    initialized during boot to the /sbin/init process and never changes.

    Changelog:

    2.6.22-rc4-mm2-pidns1:
    - Use 'init_pid_ns.child_reaper' to determine if a given task is the
    global init (/sbin/init) process. This would improve performance
    and remove dependence on the task_pid().

    2.6.21-mm2-pidns2:

    - [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc,
    ppc,avr32}/traps.c for the _exception() call to is_global_init().
    This way, we kill only the cgroup if the cgroup's init has a
    bug rather than force a kernel panic.

    [akpm@linux-foundation.org: fix comment]
    [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c]
    [bunk@stusta.de: kernel/pid.c: remove unused exports]
    [sukadev@us.ibm.com: Fix capability.c to work with threaded init]
    Signed-off-by: Serge E. Hallyn
    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Pavel Emelianov
    Cc: Eric W. Biederman
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Herbert Poetzel
    Cc: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

19 Oct, 2007

2 commits

  • The non-filesystem capability meaning of CAP_SETPCAP is that a process, p1,
    can change the capabilities of another process, p2. This is not the
    meaning that was intended for this capability at all, and this
    implementation came about purely because, without filesystem capabilities,
    there was no way to use capabilities without one process bestowing them on
    another.

    Since we now have a filesystem support for capabilities we can fix the
    implementation of CAP_SETPCAP.

    The most significant thing about this change is that, with it in effect, no
    process can set the capabilities of another process.

    The capabilities of a program are set via the capability convolution
    rules:

    pI(post-exec) = pI(pre-exec)
    pP(post-exec) = (X(aka cap_bset) & fP) | (pI(post-exec) & fI)
    pE(post-exec) = fE ? pP(post-exec) : 0

    at exec() time. As such, the only influence the pre-exec() program can
    have on the post-exec() program's capabilities are through the pI
    capability set.

    The correct implementation for CAP_SETPCAP (and that enabled by this patch)
    is that it can be used to add extra pI capabilities to the current process
    - to be picked up by subsequent exec()s when the above convolution rules
    are applied.

    Here is how it works:

    Let's say we have a process, p. It has capability sets, pE, pP and pI.
    Generally, p, can change the value of its own pI to pI' where

    (pI' & ~pI) & ~pP = 0.

    That is, the only new things in pI' that were not present in pI need to
    be present in pP.

    The role of CAP_SETPCAP is basically to permit changes to pI beyond
    the above:

    if (pE & CAP_SETPCAP) {
    pI' = anything; /* ie., even (pI' & ~pI) & ~pP != 0 */
    }

    This capability is useful for things like login, which (say, via
    pam_cap) might want to raise certain inheritable capabilities for use
    by the children of the logged-in user's shell, but those capabilities
    are not useful to or needed by the login program itself.

    One such use might be to limit who can run ping. You set the
    capabilities of the 'ping' program to be "= cap_net_raw+i", and then
    only shells that have (pI & CAP_NET_RAW) will be able to run
    it. Without CAP_SETPCAP implemented as described above, login(pam_cap)
    would have to also have (pP & CAP_NET_RAW) in order to raise this
    capability and pass it on through the inheritable set.

    Signed-off-by: Andrew Morgan
    Signed-off-by: Serge E. Hallyn
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Casey Schaufler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morgan
     
  • After adding checking to register_sysctl_table and finding a whole new set
    of bugs. Missed by countless code reviews and testers I have finally lost
    patience with the binary sysctl interface.

    The binary sysctl interface has been sort of deprecated for years and
    finding a user space program that uses the syscall is more difficult then
    finding a needle in a haystack. Problems continue to crop up, with the in
    kernel implementation. So since supporting something that no one uses is
    silly, deprecate sys_sysctl with a sufficient grace period and notice that
    the handful of user space applications that care can be fixed or replaced.

    The /proc/sys sysctl interface that people use will continue to be
    supported indefinitely.

    This patch moves the tested warning about sysctls from the path where
    sys_sysctl to a separate path called from both implementations of
    sys_sysctl, and it adds a proper entry into
    Documentation/feature-removal-schedule.

    Allowing us to revisit this in a couple years time and actually kill
    sys_sysctl.

    [lethal@linux-sh.org: sysctl: Fix syscall disabled build]
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman