01 Mar, 2010

1 commit

  • …/git/tip/linux-2.6-tip

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits)
    sched: Fix SCHED_MC regression caused by change in sched cpu_power
    sched: Don't use possibly stale sched_class
    kthread, sched: Remove reference to kthread_create_on_cpu
    sched: cpuacct: Use bigger percpu counter batch values for stats counters
    percpu_counter: Make __percpu_counter_add an inline function on UP
    sched: Remove member rt_se from struct rt_rq
    sched: Change usage of rt_rq->rt_se to rt_rq->tg->rt_se[cpu]
    sched: Remove unused update_shares_locked()
    sched: Use for_each_bit
    sched: Queue a deboosted task to the head of the RT prio queue
    sched: Implement head queueing for sched_rt
    sched: Extend enqueue_task to allow head queueing
    sched: Remove USER_SCHED
    sched: Fix the place where group powers are updated
    sched: Assume *balance is valid
    sched: Remove load_balance_newidle()
    sched: Unify load_balance{,_newidle}()
    sched: Add a lock break for PREEMPT=y
    sched: Remove from fwd decls
    sched: Remove rq_iterator from move_one_task
    ...

    Fix up trivial conflicts in kernel/sched.c

    Linus Torvalds
     

23 Feb, 2010

1 commit


21 Jan, 2010

1 commit

  • Remove the USER_SCHED feature. It has been scheduled to be removed in
    2.6.34 as per http://marc.info/?l=linux-kernel&m=125728479022976&w=2

    Signed-off-by: Dhaval Giani
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Dhaval Giani
     

20 Dec, 2009

1 commit


16 Dec, 2009

1 commit


11 Dec, 2009

1 commit

  • commit c69e8d9 (CRED: Use RCU to access another task's creds and to
    release a task's own creds) added non rcu_read_lock() protected access
    to task creds of the target task in set_prio_one().

    The comment above the function says:
    * - the caller must hold the RCU read lock

    The calling code in sys_setpriority does read_lock(&tasklist_lock) but
    not rcu_read_lock(). This works only when CONFIG_TREE_PREEMPT_RCU=n.
    With CONFIG_TREE_PREEMPT_RCU=y the rcu_callbacks can run in the tick
    interrupt when they see no read side critical section.

    There is another instance of __task_cred() in sys_setpriority() itself
    which is equally unprotected.

    Wrap the whole code section into a rcu read side critical section to
    fix this quick and dirty.

    Will be revisited in course of the read_lock(&tasklist_lock) -> rcu
    crusade.

    Oleg noted further:

    This also fixes another bug here. find_task_by_vpid() is not safe
    without rcu_read_lock(). I do not mean it is not safe to use the
    result, just find_pid_ns() by itself is not safe.

    Usually tasklist gives enough protection, but if copy_process() fails
    it calls free_pid() lockless and does call_rcu(delayed_put_pid().
    This means, without rcu lock find_pid_ns() can't scan the hash table
    safely.

    Signed-off-by: Thomas Gleixner
    LKML-Reference:
    Acked-by: Paul E. McKenney

    Thomas Gleixner
     

10 Dec, 2009

1 commit


03 Dec, 2009

1 commit

  • This is a real fix for problem of utime/stime values decreasing
    described in the thread:

    http://lkml.org/lkml/2009/11/3/522

    Now cputime is accounted in the following way:

    - {u,s}time in task_struct are increased every time when the thread
    is interrupted by a tick (timer interrupt).

    - When a thread exits, its {u,s}time are added to signal->{u,s}time,
    after adjusted by task_times().

    - When all threads in a thread_group exits, accumulated {u,s}time
    (and also c{u,s}time) in signal struct are added to c{u,s}time
    in signal struct of the group's parent.

    So {u,s}time in task struct are "raw" tick count, while
    {u,s}time and c{u,s}time in signal struct are "adjusted" values.

    And accounted values are used by:

    - task_times(), to get cputime of a thread:
    This function returns adjusted values that originates from raw
    {u,s}time and scaled by sum_exec_runtime that accounted by CFS.

    - thread_group_cputime(), to get cputime of a thread group:
    This function returns sum of all {u,s}time of living threads in
    the group, plus {u,s}time in the signal struct that is sum of
    adjusted cputimes of all exited threads belonged to the group.

    The problem is the return value of thread_group_cputime(),
    because it is mixed sum of "raw" value and "adjusted" value:

    group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)

    This misbehavior can break {u,s}time monotonicity.
    Assume that if there is a thread that have raw values greater
    than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
    but only runs 45ms) and if it exits, cputime will decrease (e.g.
    -5ms).

    To fix this, we could do:

    group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)

    But task_times() contains hard divisions, so applying it for
    every thread should be avoided.

    This patch fixes the above problem in the following way:

    - Modify thread's exit (= __exit_signal()) not to use task_times().
    It means {u,s}time in signal struct accumulates raw values instead
    of adjusted values. As the result it makes thread_group_cputime()
    to return pure sum of "raw" values.

    - Introduce a new function thread_group_times(*task, *utime, *stime)
    that converts "raw" values of thread_group_cputime() to "adjusted"
    values, in same calculation procedure as task_times().

    - Modify group's exit (= wait_task_zombie()) to use this introduced
    thread_group_times(). It make c{u,s}time in signal struct to
    have adjusted values like before this patch.

    - Replace some thread_group_cputime() by thread_group_times().
    This replacements are only applied where conveys the "adjusted"
    cputime to users, and where already uses task_times() near by it.
    (i.e. sys_times(), getrusage(), and /proc//stat.)

    This patch have a positive side effect:

    - Before this patch, if a group contains many short-life threads
    (e.g. runs 0.9ms and not interrupted by ticks), the group's
    cputime could be invisible since thread's cputime was accumulated
    after adjusted: imagine adjustment function as adj(ticks, runtime),
    {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
    After this patch it will not happen because the adjustment is
    applied after accumulated.

    v2:
    - remove if()s, put new variables into signal_struct.

    Signed-off-by: Hidetoshi Seto
    Acked-by: Peter Zijlstra
    Cc: Spencer Candland
    Cc: Americo Wang
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Stanislaw Gruszka
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     

26 Nov, 2009

1 commit

  • Functions task_{u,s}time() are called in pair in almost all
    cases. However task_stime() is implemented to call task_utime()
    from its inside, so such paired calls run task_utime() twice.

    It means we do heavy divisions (div_u64 + do_div) twice to get
    utime and stime which can be obtained at same time by one set
    of divisions.

    This patch introduces a function task_times(*tsk, *utime,
    *stime) to retrieve utime and stime at once in better, optimized
    way.

    Signed-off-by: Hidetoshi Seto
    Acked-by: Peter Zijlstra
    Cc: Stanislaw Gruszka
    Cc: Spencer Candland
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Americo Wang
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     

29 Oct, 2009

2 commits

  • * 'hwpoison-2.6.32' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6:
    HWPOISON: fix invalid page count in printk output
    HWPOISON: Allow schedule_on_each_cpu() from keventd
    HWPOISON: fix/proc/meminfo alignment
    HWPOISON: fix oops on ksm pages
    HWPOISON: Fix page count leak in hwpoison late kill in do_swap_page
    HWPOISON: return early on non-LRU pages
    HWPOISON: Add brief hwpoison description to Documentation
    HWPOISON: Clean up PR_MCE_KILL interface

    Linus Torvalds
     
  • Since commit 02b51df1b07b4e9ca823c89284e704cadb323cd1 (proc connector: add
    event for process becoming session leader) we have the following warning:

    Badness at kernel/softirq.c:143
    [...]
    Krnl PSW : 0404c00180000000 00000000001481d4 (local_bh_enable+0xb0/0xe0)
    [...]
    Call Trace:
    ([] 0x13fe04100)
    [] sk_filter+0x9a/0xd0
    [] netlink_broadcast+0x2c0/0x53c
    [] cn_netlink_send+0x272/0x2b0
    [] proc_sid_connector+0xc4/0xd4
    [] __set_special_pids+0x58/0x90
    [] sys_setsid+0xb4/0xd8
    [] sysc_noemu+0x10/0x16
    [] 0x41616cb266

    The warning is
    ---> WARN_ON_ONCE(in_irq() || irqs_disabled());

    The network code must not be called with disabled interrupts but
    sys_setsid holds the tasklist_lock with spinlock_irq while calling the
    connector.

    After a discussion we agreed that we can move proc_sid_connector from
    __set_special_pids to sys_setsid.

    We also agreed that it is sufficient to change the check from
    task_session(curr) != pid into err > 0, since if we don't change the
    session, this means we were already the leader and return -EPERM.

    One last thing:
    There is also daemonize(), and some people might want to get a
    notification in that case. Since daemonize() is only needed if a user
    space does kernel_thread this does not look important (and there seems
    to be no consensus if this connector should be called in daemonize). If
    we really want this, we can add proc_sid_connector to daemonize() in an
    additional patch (Scott?)

    Signed-off-by: Christian Borntraeger
    Cc: Scott James Remnant
    Cc: Matt Helsley
    Cc: David S. Miller
    Acked-by: Oleg Nesterov
    Acked-by: Evgeniy Polyakov
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christian Borntraeger
     

14 Oct, 2009

1 commit


04 Oct, 2009

1 commit

  • While writing the manpage I noticed some shortcomings in the
    current interface.

    - Define symbolic names for all the different values
    - Boundary check the kill mode values
    - For symmetry add a get interface too. This allows library
    code to get/set the current state.
    - For consistency define a PR_MCE_KILL_DEFAULT value

    Signed-off-by: Andi Kleen

    Andi Kleen
     

24 Sep, 2009

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     

23 Sep, 2009

1 commit

  • Make ->ru_maxrss value in struct rusage filled accordingly to rss hiwater
    mark. This struct is filled as a parameter to getrusage syscall.
    ->ru_maxrss value is set to KBs which is the way it is done in BSD
    systems. /usr/bin/time (gnu time) application converts ->ru_maxrss to KBs
    which seems to be incorrect behavior. Maintainer of this util was
    notified by me with the patch which corrects it and cc'ed.

    To make this happen we extend struct signal_struct by two fields. The
    first one is ->maxrss which we use to store rss hiwater of the task. The
    second one is ->cmaxrss which we use to store highest rss hiwater of all
    task childs. These values are used in k_getrusage() to actually fill
    ->ru_maxrss. k_getrusage() uses current rss hiwater value directly if mm
    struct exists.

    Note:
    exec() clear mm->hiwater_rss, but doesn't clear sig->maxrss.
    it is intetionally behavior. *BSD getrusage have exec() inheriting.

    test programs
    ========================================================

    getrusage.c
    ===========
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"

    #define err(str) perror(str), exit(1)

    int main(int argc, char** argv)
    {
    int status;

    printf("allocate 100MB\n");
    consume(100);

    printf("testcase1: fork inherit? \n");
    printf(" expect: initial.self ~= child.self\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    show_rusage("fork child");
    _exit(0);
    }
    printf("\n");

    printf("testcase2: fork inherit? (cont.) \n");
    printf(" expect: initial.children ~= 100MB, but child.children = 0\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    show_rusage("child");
    _exit(0);
    }
    printf("\n");

    printf("testcase3: fork + malloc \n");
    printf(" expect: child.self ~= initial.self + 50MB\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    printf("allocate +50MB\n");
    consume(50);
    show_rusage("fork child");
    _exit(0);
    }
    printf("\n");

    printf("testcase4: grandchild maxrss\n");
    printf(" expect: post_wait.children ~= 300MB\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    show_rusage("post_wait");
    } else {
    system("./child -n 0 -g 300");
    _exit(0);
    }
    printf("\n");

    printf("testcase5: zombie\n");
    printf(" expect: pre_wait ~= initial, IOW the zombie process is not accounted.\n");
    printf(" post_wait ~= 400MB, IOW wait() collect child's max_rss. \n");
    show_rusage("initial");
    if (__fork()) {
    sleep(1); /* children become zombie */
    show_rusage("pre_wait");
    wait(&status);
    show_rusage("post_wait");
    } else {
    system("./child -n 400");
    _exit(0);
    }
    printf("\n");

    printf("testcase6: SIG_IGN\n");
    printf(" expect: initial ~= after_zombie (child's 500MB alloc should be ignored).\n");
    show_rusage("initial");
    signal(SIGCHLD, SIG_IGN);
    if (__fork()) {
    sleep(1); /* children become zombie */
    show_rusage("after_zombie");
    } else {
    system("./child -n 500");
    _exit(0);
    }
    printf("\n");
    signal(SIGCHLD, SIG_DFL);

    printf("testcase7: exec (without fork) \n");
    printf(" expect: initial ~= exec \n");
    show_rusage("initial");
    execl("./child", "child", "-v", NULL);

    return 0;
    }

    child.c
    =======
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"

    int main(int argc, char** argv)
    {
    int status;
    int c;
    long consume_size = 0;
    long grandchild_consume_size = 0;
    int show = 0;

    while ((c = getopt(argc, argv, "n:g:v")) != -1) {
    switch (c) {
    case 'n':
    consume_size = atol(optarg);
    break;
    case 'v':
    show = 1;
    break;
    case 'g':

    grandchild_consume_size = atol(optarg);
    break;
    default:
    break;
    }
    }

    if (show)
    show_rusage("exec");

    if (consume_size) {
    printf("child alloc %ldMB\n", consume_size);
    consume(consume_size);
    }

    if (grandchild_consume_size) {
    if (fork()) {
    wait(&status);
    } else {
    printf("grandchild alloc %ldMB\n", grandchild_consume_size);
    consume(grandchild_consume_size);

    exit(0);
    }
    }

    return 0;
    }

    common.c
    ========
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"
    #define err(str) perror(str), exit(1)

    void show_rusage(char *prefix)
    {
    int err, err2;
    struct rusage rusage_self;
    struct rusage rusage_children;

    printf("%s: ", prefix);
    err = getrusage(RUSAGE_SELF, &rusage_self);
    if (!err)
    printf("self %ld ", rusage_self.ru_maxrss);
    err2 = getrusage(RUSAGE_CHILDREN, &rusage_children);
    if (!err2)
    printf("children %ld ", rusage_children.ru_maxrss);

    printf("\n");
    }

    /* Some buggy OS need this worthless CPU waste. */
    void make_pagefault(void)
    {
    void *addr;
    int size = getpagesize();
    int i;

    for (i=0; i
    Signed-off-by: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Pirko
     

21 Sep, 2009

1 commit

  • Bye-bye Performance Counters, welcome Performance Events!

    In the past few months the perfcounters subsystem has grown out its
    initial role of counting hardware events, and has become (and is
    becoming) a much broader generic event enumeration, reporting, logging,
    monitoring, analysis facility.

    Naming its core object 'perf_counter' and naming the subsystem
    'perfcounters' has become more and more of a misnomer. With pending
    code like hw-breakpoints support the 'counter' name is less and
    less appropriate.

    All in one, we've decided to rename the subsystem to 'performance
    events' and to propagate this rename through all fields, variables
    and API names. (in an ABI compatible fashion)

    The word 'event' is also a bit shorter than 'counter' - which makes
    it slightly more convenient to write/handle as well.

    Thanks goes to Stephane Eranian who first observed this misnomer and
    suggested a rename.

    User-space tooling and ABI compatibility is not affected - this patch
    should be function-invariant. (Also, defconfigs were not touched to
    keep the size down.)

    This patch has been generated via the following script:

    FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

    sed -i \
    -e 's/PERF_EVENT_/PERF_RECORD_/g' \
    -e 's/PERF_COUNTER/PERF_EVENT/g' \
    -e 's/perf_counter/perf_event/g' \
    -e 's/nb_counters/nb_events/g' \
    -e 's/swcounter/swevent/g' \
    -e 's/tpcounter_event/tp_event/g' \
    $FILES

    for N in $(find . -name perf_counter.[ch]); do
    M=$(echo $N | sed 's/perf_counter/perf_event/g')
    mv $N $M
    done

    FILES=$(find . -name perf_event.*)

    sed -i \
    -e 's/COUNTER_MASK/REG_MASK/g' \
    -e 's/COUNTER/EVENT/g' \
    -e 's/\/event_id/g' \
    -e 's/counter/event/g' \
    -e 's/Counter/Event/g' \
    $FILES

    ... to keep it as correct as possible. This script can also be
    used by anyone who has pending perfcounters patches - it converts
    a Linux kernel tree over to the new naming. We tried to time this
    change to the point in time where the amount of pending patches
    is the smallest: the end of the merge window.

    Namespace clashes were fixed up in a preparatory patch - and some
    stylistic fallout will be fixed up in a subsequent patch.

    ( NOTE: 'counters' are still the proper terminology when we deal
    with hardware registers - and these sed scripts are a bit
    over-eager in renaming them. I've undone some of that, but
    in case there's something left where 'counter' would be
    better than 'event' we can undo that on an individual basis
    instead of touching an otherwise nicely automated patch. )

    Suggested-by: Stephane Eranian
    Acked-by: Peter Zijlstra
    Acked-by: Paul Mackerras
    Reviewed-by: Arjan van de Ven
    Cc: Mike Galbraith
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Kyle McMartin
    Cc: Martin Schwidefsky
    Cc: "David S. Miller"
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

16 Sep, 2009

1 commit

  • This allows processes to override their early/late kill
    behaviour on hardware memory errors.

    Typically applications which are memory error aware is
    better of with early kill (see the error as soon
    as possible), all others with late kill (only
    see the error when the error is really impacting execution)

    There's a global sysctl, but this way an application
    can set its specific policy.

    We're using two bits, one to signify that the process
    stated its intention and that

    I also made the prctl future proof by enforcing
    the unused arguments are 0.

    The state is inherited to children.

    Note this makes us officially run out of process flags
    on 32bit, but the next patch can easily add another field.

    Manpage patch will be supplied separately.

    Signed-off-by: Andi Kleen

    Andi Kleen
     

17 Jun, 2009

1 commit

  • Move supplementary groups implementation to kernel/groups.c .
    kernel/sys.c already accumulated quite a few random stuff.

    Do strictly copy/paste + add required headers to compile. Compile-tested
    on many configs and archs.

    Signed-off-by: Alexey Dobriyan
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

29 Apr, 2009

1 commit


14 Apr, 2009

1 commit


06 Apr, 2009

1 commit

  • Merge reason: we have gathered quite a few conflicts, need to merge upstream

    Conflicts:
    arch/powerpc/kernel/Makefile
    arch/x86/ia32/ia32entry.S
    arch/x86/include/asm/hardirq.h
    arch/x86/include/asm/unistd_32.h
    arch/x86/include/asm/unistd_64.h
    arch/x86/kernel/cpu/common.c
    arch/x86/kernel/irq.c
    arch/x86/kernel/syscall_table_32.S
    arch/x86/mm/iomap_32.c
    include/linux/sched.h
    kernel/Makefile

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

03 Apr, 2009

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    Remove two unneeded exports and make two symbols static in fs/mpage.c
    Cleanup after commit 585d3bc06f4ca57f975a5a1f698f65a45ea66225
    Trim includes of fdtable.h
    Don't crap into descriptor table in binfmt_som
    Trim includes in binfmt_elf
    Don't mess with descriptor table in load_elf_binary()
    Get rid of indirect include of fs_struct.h
    New helper - current_umask()
    check_unsafe_exec() doesn't care about signal handlers sharing
    New locking/refcounting for fs_struct
    Take fs_struct handling to new file (fs/fs_struct.c)
    Get rid of bumping fs_struct refcount in pivot_root(2)
    Kill unsharing fs_struct in __set_personality()

    Linus Torvalds
     
  • We are wasting 2 words in signal_struct without any reason to implement
    task_pgrp_nr() and task_session_nr().

    task_session_nr() has no callers since
    2e2ba22ea4fd4bb85f0fa37c521066db6775cbef, we can remove it.

    task_pgrp_nr() is still (I believe wrongly) used in fs/autofsX and
    fs/coda.

    This patch reimplements task_pgrp_nr() via task_pgrp_nr_ns(), and kills
    __pgrp/__session and the related helpers.

    The change in drivers/char/tty_io.c is cosmetic, but hopefully makes sense
    anyway.

    Signed-off-by: Oleg Nesterov
    Acked-by: Alan Cox [tty parts]
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Eric Biederman
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: Sukadev Bhattiprolu
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

01 Apr, 2009

1 commit


04 Mar, 2009

1 commit


27 Feb, 2009

1 commit

  • Impact: fix hung task with certain (non-default) rt-limit settings

    Corey Hickey reported that on using setuid to change the uid of a
    rt process, the process would be unkillable and not be running.
    This is because there was no rt runtime for that user group. Add
    in a check to see if a user can attach an rt task to its task group.
    On failure, return EINVAL, which is also returned in
    CONFIG_CGROUP_SCHED.

    Reported-by: Corey Hickey
    Signed-off-by: Dhaval Giani
    Acked-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Dhaval Giani
     

11 Feb, 2009

1 commit


06 Feb, 2009

1 commit

  • Revert commit 0c2d64fb6cae9aae480f6a46cfe79f8d7d48b59f because it causes
    (arguably poorly designed) existing userspace to spend interminable
    periods closing billions of not-open file descriptors.

    We could bring this back, with some sort of opt-in tunable in /proc, which
    defaults to "off".

    Peter's alanysis follows:

    : I spent several hours trying to get to the bottom of a serious
    : performance issue that appeared on one of our servers after upgrading to
    : 2.6.28. In the end it's what could be considered a userspace bug that
    : was triggered by a change in 2.6.28. Since this might also affect other
    : people I figured I'd at least document what I found here, and maybe we
    : can even do something about it:
    :
    :
    : So, I upgraded some of debian.org's machines to 2.6.28.1 and immediately
    : the team maintaining our ftp archive complained that one of their
    : scripts that previously ran in a few minutes still hadn't even come
    : close to being done after an hour or so. Downgrading to 2.6.27 fixed
    : that.
    :
    : Turns out that script is forking a lot and something in it or python or
    : whereever closes all the file descriptors it doesn't want to pass on.
    : That is, it starts at zero and goes up to ulimit -n/RLIMIT_NOFILE and
    : closes them all with a few exceptions.
    :
    : Turns out that takes a long time when your limit -n is now 2^20 (1048576).
    :
    : With 2.6.27.* the ulimit -n was the standard 1024, but with 2.6.28 it is
    : now a thousand times that.
    :
    : 2.6.28 included a patch titled "rlimit: permit setting RLIMIT_NOFILE to
    : RLIM_INFINITY" (0c2d64fb6cae9aae480f6a46cfe79f8d7d48b59f)[1] that
    : allows, as the title implies, to set the limit for number of files to
    : infinity.
    :
    : Closer investigation showed that the broken default ulimit did not apply
    : to "system" processes (like stuff started from init). In the end I
    : could establish that all processes that passed through pam_limit at one
    : point had the bad resource limit.
    :
    : Apparently the pam library in Debian etch (4.0) initializes the limits
    : to some default values when it doesn't have any settings in limit.conf
    : to override them. Turns out that for nofiles this is RLIM_INFINITY.
    : Commenting out "case RLIMIT_NOFILE" in pam_limit.c:267 of our pam
    : package version 0.79-5 fixes that - tho I'm not sure what side effects
    : that has.
    :
    : Debian lenny (the upcoming 5.0 version) doesn't have this issue as it
    : uses a different pam (version).

    Reported-by: Peter Palfrader
    Cc: Adam Tkac
    Cc: Michael Kerrisk
    Cc: [2.6.28.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

21 Jan, 2009

1 commit


14 Jan, 2009

9 commits


11 Jan, 2009

1 commit


07 Jan, 2009

1 commit

  • …l/git/tip/linux-2.6-tip

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: fix section mismatch
    sched: fix double kfree in failure path
    sched: clean up arch_reinit_sched_domains()
    sched: mark sched_create_sysfs_power_savings_entries() as __init
    getrusage: RUSAGE_THREAD should return ru_utime and ru_stime
    sched: fix sched_slice()
    sched_clock: prevent scd->clock from moving backwards, take #2
    sched: sched.c declare variables before they get used

    Linus Torvalds