12 Sep, 2007

3 commits

  • Seems to me that this timer will only get started on platforms that say
    they don't want it?

    Signed-off-by: Tony Breeds
    Cc: Paul Mackerras
    Cc: Gabriel Paubert
    Cc: Zachary Amsden
    Acked-by: Thomas Gleixner
    Cc: John Stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Breeds
     
  • The semantics of call_usermodehelper_pipe() used to be that it would fork
    the helper, and wait for the kernel thread to be started. This was
    implemented by setting sub_info.wait to 0 (implicitly), and doing a
    wait_for_completion().

    As part of the cleanup done in 0ab4dc92278a0f3816e486d6350c6652a72e06c8,
    call_usermodehelper_pipe() was changed to pass 1 as the value for wait to
    call_usermodehelper_exec().

    This is equivalent to setting sub_info.wait to 1, which is a change from
    the previous behaviour. Using 1 instead of 0 causes
    __call_usermodehelper() to start the kernel thread running
    wait_for_helper(), rather than directly calling ____call_usermodehelper().

    The end result is that the calling kernel code blocks until the user mode
    helper finishes. As the helper is expecting input on stdin, and now no one
    is writing anything, everything locks up (observed in do_coredump).

    The fix is to change the 1 to UMH_WAIT_EXEC (aka 0), indicating that we
    want to wait for the kernel thread to be started, but not for the helper to
    finish.

    Signed-off-by: Michael Ellerman
    Acked-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Ellerman
     
  • The futex list traversal on the compat side appears to have
    a bug.

    It's loop termination condition compares:

    while (compat_ptr(uentry) != &head->list)

    But that can't be right because "uentry" has the special
    "pi" indicator bit still potentially set at bit 0. This
    is cleared by fetch_robust_entry() into the "entry"
    return value.

    What this seems to mean is that the list won't terminate
    when list iteration gets back to the the head. And we'll
    also process the list head like a normal entry, which could
    cause all kinds of problems.

    So we should check for equality with "entry". That pointer
    is of the non-compat type so we have to do a little casting
    to keep the compiler and sparse happy.

    The same problem can in theory occur with the 'pending'
    variable, although that has not been reported from users
    so far.

    Based on the original patch from David Miller.

    Acked-by: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: David Miller
    Signed-off-by: Arnd Bergmann
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

11 Sep, 2007

1 commit

  • When PTRACE_SYSCALL was used and then PTRACE_DETACH is used, the
    TIF_SYSCALL_TRACE flag is left set on the formerly-traced task. This
    means that when a new tracer comes along and does PTRACE_ATTACH, it's
    possible he gets a syscall tracing stop even though he's never used
    PTRACE_SYSCALL. This happens if the task was in the middle of a system
    call when the second PTRACE_ATTACH was done. The symptom is an
    unexpected SIGTRAP when the tracer thinks that only SIGSTOP should have
    been provoked by his ptrace calls so far.

    A few machines already fixed this in ptrace_disable (i386, ia64, m68k).
    But all other machines do not, and still have this bug. On x86_64, this
    constitutes a regression in IA32 compatibility support.

    Since all machines now use TIF_SYSCALL_TRACE for this, I put the
    clearing of TIF_SYSCALL_TRACE in the generic ptrace_detach code rather
    than adding it to every other machine's ptrace_disable.

    Signed-off-by: Roland McGrath
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

05 Sep, 2007

8 commits

  • fix ideal_runtime:

    - do not scale it using niced_granularity()
    it is against sum_exec_delta, so its wall-time, not fair-time.

    - move the whole check into __check_preempt_curr_fair()
    so that wakeup preemption can also benefit from the new logic.

    this also results in code size reduction:

    text data bss dec hex filename
    13391 228 1204 14823 39e7 sched.o.before
    13369 228 1204 14801 39d1 sched.o.after

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Second preparatory patch for fix-ideal runtime:

    Mark prev_sum_exec_runtime at the beginning of our run, the same spot
    that adds our wait period to wait_runtime. This seems a more natural
    location to do this, and it also reduces the code a bit:

    text data bss dec hex filename
    13397 228 1204 14829 39ed sched.o.before
    13391 228 1204 14823 39e7 sched.o.after

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Preparatory patch for fix-ideal-runtime:

    simplify __check_preempt_curr_fair(): get rid of the integer return.

    text data bss dec hex filename
    13404 228 1204 14836 39f4 sched.o.before
    13393 228 1204 14825 39e9 sched.o.after

    functionality is unchanged.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • rename RSR to SRR - 'RSR' is already defined on xtensa.

    found by Adrian Bunk.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • when cleaning sched-stats also clear prev_sum_exec_runtime.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • the cfs_rq->wait_runtime debug/statistics counter was not maintained
    properly - fix this.

    this also removes some code:

    text data bss dec hex filename
    13420 228 1204 14852 3a04 sched.o.before
    13404 228 1204 14836 39f4 sched.o.after

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra

    Ingo Molnar
     
  • fix niced_granularity(). This resulted in under-scheduling for
    CPU-bound negative nice level tasks (and this in turn caused
    higher than necessary latencies in nice-0 tasks).

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • First fix the check
    if (*imbalance + SCHED_LOAD_SCALE_FUZZ < busiest_load_per_task)
    with this
    if (*imbalance < busiest_load_per_task)

    As the current check is always false for nice 0 tasks (as
    SCHED_LOAD_SCALE_FUZZ is same as busiest_load_per_task for nice 0
    tasks).

    With the above change, imbalance was getting reset to 0 in the corner
    case condition, making the FUZZ logic fail. Fix it by not corrupting the
    imbalance and change the imbalance, only when it finds that the HT/MC
    optimization is needed.

    Signed-off-by: Suresh Siddha
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     

01 Sep, 2007

1 commit


31 Aug, 2007

6 commits

  • Spotted by taoyue and Jeremy Katz .

    collect_signal: sigqueue_free:

    list_del_init(&first->list);
    if (!list_empty(&q->list)) {
    // not taken
    }
    q->flags &= ~SIGQUEUE_PREALLOC;

    __sigqueue_free(first); __sigqueue_free(q);

    Now, __sigqueue_free() is called twice on the same "struct sigqueue" with the
    obviously bad implications.

    In particular, this double free breaks the array_cache->avail logic, so the
    same sigqueue could be "allocated" twice, and the bug can manifest itself via
    the "impossible" BUG_ON(!SIGQUEUE_PREALLOC) in sigqueue_free/send_sigqueue.

    Hopefully this can explain these mysterious bug-reports, see

    http://marc.info/?t=118766926500003
    http://marc.info/?t=118466273000005

    Alexey Dobriyan reports this patch makes the difference for the testcase, but
    nobody has an access to the application which opened the problems originally.

    Also, this patch removes tasklist lock/unlock, ->siglock is enough.

    Signed-off-by: Oleg Nesterov
    Cc: taoyue
    Cc: Jeremy Katz
    Cc: Sukadev Bhattiprolu
    Cc: Alexey Dobriyan
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Roland McGrath
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Signed-off-by: Alexey Dobriyan
    Acked-by: Cedric Le Goater
    Acked-by: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Mariusz Kozlowski reported lockdep's warning:

    > =================================
    > [ INFO: inconsistent lock state ]
    > 2.6.23-rc2-mm1 #7
    > ---------------------------------
    > inconsistent {in-hardirq-W} -> {hardirq-on-W} usage.
    > ifconfig/5492 [HC0[0]:SC0[0]:HE1:SE1] takes:
    > (&tp->lock){+...}, at: [] rtl8139_interrupt+0x27/0x46b [8139too]
    > {in-hardirq-W} state was registered at:
    > [] __lock_acquire+0x949/0x11ac
    > [] lock_acquire+0x99/0xb2
    > [] _spin_lock+0x35/0x42
    > [] rtl8139_interrupt+0x27/0x46b [8139too]
    > [] handle_IRQ_event+0x28/0x59
    > [] handle_level_irq+0xad/0x10b
    > [] do_IRQ+0x93/0xd0
    > [] common_interrupt+0x2e/0x34
    ...
    > other info that might help us debug this:
    > 1 lock held by ifconfig/5492:
    > #0: (rtnl_mutex){--..}, at: [] mutex_lock+0x1c/0x1f
    >
    > stack backtrace:
    ...
    > [] _spin_lock+0x35/0x42
    > [] rtl8139_interrupt+0x27/0x46b [8139too]
    > [] free_irq+0x11b/0x146
    > [] rtl8139_close+0x8a/0x14a [8139too]
    > [] dev_close+0x57/0x74
    ...

    This shows that a driver's irq handler was running both in hard interrupt
    and process contexts with irqs enabled. The latter was done during
    free_irq() call and was possible only with CONFIG_DEBUG_SHIRQ enabled.
    This was fixed by another patch.

    But similar problem is possible with request_irq(): any locks taken from
    irq handler could be vulnerable - especially with soft interrupts. This
    patch fixes it by disabling local interrupts during handler's run. (It
    seems, disabling softirqs should be enough, but it needs more checking
    on possible races or other special cases).

    Reported-by: Mariusz Kozlowski
    Signed-off-by: Jarek Poplawski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jarek Poplawski
     
  • Dependencies of CONFIG_SUSPEND and CONFIG_HIBERNATION introduced by commit
    296699de6bdc717189a331ab6bbe90e05c94db06 "Introduce CONFIG_SUSPEND for
    suspend-to-Ram and standby" are incorrect, as they don't cover the facts that
    (1) not all architectures support suspend and (2) SMP hibernation is only
    possible on X86 and PPC64 (if CONFIG_PPC64_SWSUSP is set).

    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Spotted by Marcin Kowalczyk .

    sys_setpgid(child) fails if the child was forked by sub-thread.

    Fix the "is it our child" check. The previous commit
    ee0acf90d320c29916ba8c5c1b2e908d81f5057d was not complete.

    (this patch asks for the new same_thread_group() helper, but mainline doesn't
    have it yet).

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc:
    Tested-by: "Marcin 'Qrczak' Kowalczyk"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • taskstats.ac_exitcode is assigned to task_struct.exit_code in bacct_add_tsk()
    through the following kernel function calls:

    do_exit()
    taskstats_exit()
    fill_pid()
    bacct_add_tsk()

    The problem is that in do_exit(), task_struct.exit_code is set to 'code' only
    after taskstats_exit() has been called. So we need to move the assignment
    before taskstats_exit().

    Signed-off-by: Jonathan Lim
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jonathan Lim
     

28 Aug, 2007

7 commits

  • cleanup: we have the 'se' and 'curr' entity-pointers already,
    no need to use p->se and current->se.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mike Galbraith

    Ingo Molnar
     
  • small schedstat fix: the cfs_rq->wait_runtime 'sum of all runtimes'
    statistics counters missed newly forked tasks and thus had a constant
    negative skew. Fix this.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mike Galbraith

    Ingo Molnar
     
  • Peter Zijlstra noticed the following bug in SCHED_FEAT_SKIP_INITIAL (which
    is disabled by default at the moment): it relies on se.wait_start_fair
    being 0 while update_stats_wait_end() did not recognize a 0 value,
    so instead of 'skipping' the initial interval we gave the new child
    a maximum boost of +runtime-limit ...

    (No impact on the default kernel, but nice to fix for completeness.)

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mike Galbraith

    Ingo Molnar
     
  • update the fair-clock before using it for the key value.

    [ mingo@elte.hu: small cleanups. ]

    Signed-off-by: Ting Yang
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra

    Ting Yang
     
  • de-HZ-ification of the granularity defaults unearthed a pre-existing
    property of CFS: while it correctly converges to the granularity goal,
    it does not prevent run-time fluctuations in the range of
    [-gran ... 0 ... +gran].

    With the increase of the granularity due to the removal of HZ
    dependencies, this becomes visible in chew-max output (with 5 tasks
    running):

    out: 28 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 37 . 40
    out: 27 . 27. 32 | flu: 0 . 0 | ran: 17 . 13 | per: 44 . 40
    out: 27 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 36 . 40
    out: 29 . 27. 32 | flu: 2 . 0 | ran: 17 . 13 | per: 46 . 40
    out: 28 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 37 . 40
    out: 29 . 27. 32 | flu: 0 . 0 | ran: 18 . 13 | per: 47 . 40
    out: 28 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 37 . 40

    average slice is the ideal 13 msecs and the period is picture-perfect 40
    msecs. But the 'ran' field fluctuates around 13.33 msecs and there's no
    mechanism in CFS to keep that from happening: it's a perfectly valid
    solution that CFS finds.

    to fix this we add a granularity/preemption rule that knows about
    the "target latency", which makes tasks that run longer than the ideal
    latency run a bit less. The simplest approach is to simply decrease the
    preemption granularity when a task overruns its ideal latency. For this
    we have to track how much the task executed since its last preemption.

    ( this adds a new field to task_struct, but we can eliminate that
    overhead in 2.6.24 by putting all the scheduler timestamps into an
    anonymous union. )

    with this change in place, chew-max output is fluctuation-less all
    around:

    out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40
    out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40
    out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40
    out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40
    out: 28 . 27. 39 | flu: 0 . 1 | ran: 13 . 13 | per: 41 . 40
    out: 28 . 27. 39 | flu: 0 . 1 | ran: 13 . 13 | per: 41 . 40

    this patch has no impact on any fastpath or on any globally observable
    scheduling property. (unless you have sharp enough eyes to see
    millisecond-level ruckles in glxgears smoothness :-)

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mike Galbraith

    Ingo Molnar
     
  • There is an Amarok song switch time increase (regression) under
    hefty load.

    What is happening is that sleeper_bonus is never consumed, and only
    rarely goes below runtime_limit, so for the most part, Amarok isn't
    getting any bonus at all. We're keeping sleeper_bonus right at
    runtime_limit (sched_latency == sched_runtime_limit == 40ms) forever, ie
    we don't consume if we're lower that that, and don't add if we're above
    it. One Amarok thread waking (or anybody else) will push us past the
    threshold, so the next thread waking gets nada, but will reap pain from
    the previous thread waking until we drop back to runtime_limit. It
    looks to me like under load, some random task gets a bonus, and
    everybody else pays, whether deserving or not.

    This diff fixed the regression for me at any load rate.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra

    Mike Galbraith
     
  • Fix bogus DEBUG_PREEMPT warning on x86_64, when cpu brought online after
    bootup: current_is_keventd is right to note its use of smp_processor_id
    is preempt-safe, but should use raw_smp_processor_id to avoid the warning.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

26 Aug, 2007

4 commits

  • runtime limit and wakeup granularity used to be a function of
    granularity and that was incorrect changed to sched_latency.

    Fix this to make wakeup granularity a function of min-granularity,
    and the runtime limit equal to latency.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • due to adaptive granularity scheduling the role of sched_granularity
    has changed to "minimum granularity", so rename the variable (and the
    tunable) accordingly.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra

    Ingo Molnar
     
  • Instead of specifying the preemption granularity, specify the wanted
    latency. By fixing the granlarity to a constany the wakeup latency
    it a function of the number of running tasks on the rq.

    Invert this relation.

    sysctl_sched_granularity becomes a minimum for the dynamic granularity
    computed from the new sysctl_sched_latency.

    Then use this latency to do more intelligent granularity decisions: if
    there are fewer tasks running then we can schedule coarser. This helps
    performance while still always keeping the latency target.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Make the lockdep sysctls not depend on CONFIG_SCHED_DEBUG.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

25 Aug, 2007

8 commits

  • fix task startup penalty miscalculation: sysctl_sched_granularity is
    unsigned int and wait_runtime is long so we first have to convert it
    to long before turning it negative ...

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • current code:

    delta = calc_delta_mine(delta_exec, curr->load.weight, lw);
    delta = min((u64)delta, cfs_rq->sleeper_bonus);

    Notice that this calc_delta_mine() line is exactly delta_mine, which
    gives:

    delta = min((u64)delta_mine, cfs_rq->sleeper_bonus);

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • current code:

    delta = min(cfs_rq->sleeper_bonus, (u64)delta_exec);
    delta = calc_delta_mine(delta, curr->load.weight, lw);
    delta = min((u64)delta, cfs_rq->sleeper_bonus);

    drop the first min(), because we clip against sleeper_bonus in the 3rd line
    again. That gives:

    delta = calc_delta_mine(delta_exec, curr->load.weight, lw);
    delta = min((u64)delta, cfs_rq->sleeper_bonus);

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • make the bonus balance more consistent: do not hand out a bonus if
    there's too much in flight already, and only deduct as much from a
    runner as it has the capacity. This makes the bonus engine a zero-sum
    game (as intended).

    this also simplifies the code:

    text data bss dec hex filename
    34770 2998 24 37792 93a0 sched.o.before
    34749 2998 24 37771 938b sched.o.after

    and it also avoids overscheduling in sleep-happy workloads like
    hackbench.c.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Mitchell Erblich suggested a quality-of-implementation change to
    not requeue SCHED_RR tasks if there's only a single task on the
    runqueue, by checking for rq->nr_running == 1.

    provide a more efficient implementation of that, to check that
    particular RT priority-queue only.

    [ From: mingo@elte.hu ]

    Also first requeue the task then set need_resched - results in slightly
    better machine-instruction ordering. Also clean up the code a bit.

    Signed-off-by: Dmitry Adamushko
    Signed-off-by: Ingo Molnar

    Dmitry Adamushko
     
  • Remove trivial conditional branch in Linux scheduler's
    can_migrate_task() function.

    text data bss dec hex filename
    34770 2998 24 37792 93a0 sched.o.before
    34757 2998 24 37779 9393 sched.o.after

    Signed-off-by: Sven-Thorsten Dietrich
    Signed-off-by: Ingo Molnar

    Sven-Thorsten Dietrich
     
  • remove HZ dependency from the granularity default. Use 10 msec for
    the base granularity, 1 msec for wakeup granularity and 25 msec for
    batch wakeup granularity. (These defaults are close to the values
    that the default HZ=250 setting got previously, and thus it's the
    most common setting.)

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • when I built with CONFIG_FAIR_GROUP_SCHED=y, I need the following change
    to make things right.

    [ From: mingo@elte.hu ]

    this config option is not upstream-configurable right now but lets fix
    this for completeness.

    Signed-off-by: Bruce Ashfield
    Signed-off-by: Ingo Molnar

    Bruce Ashfield
     

24 Aug, 2007

2 commits