09 Jul, 2008

1 commit


06 Jul, 2008

1 commit


05 Jul, 2008

3 commits

  • Most places in the kernel that go BUG: print a module list
    (which is very useful for doing statistics and finding patterns),
    however the softlockup detector does not do this yet.

    This patch adds the one line change to fix this gap.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Ingo Molnar

    Arjan van de Ven
     
  • This commit includes a bugfix for the fragile setuid fixup code in the
    case that filesystem capabilities are supported (in access()). The effect
    of this fix is gated on filesystem capability support because changing
    securebits is only supported when filesystem capabilities support is
    configured.)

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Andrew G. Morgan
    Acked-by: Serge Hallyn
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew G. Morgan
     
  • Remove all clameter@sgi.com addresses from the kernel tree since they will
    become invalid on June 27th. Change my maintainer email address for the
    slab allocators to cl@linux-foundation.org (which will be the new email
    address for the future).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Stephen Rothwell
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

04 Jul, 2008

1 commit

  • Due to a possible deadlock, the waking of the softirq was pushed outside
    of the hrtimer base locks. See commit 0c96c5979a522c3323c30a078a70120e29b5bdbc

    Unfortunately this allows the task to migrate after setting up the softirq
    and raising it. Since softirqs run a queue that is per-cpu we may raise the
    softirq on the wrong CPU and this will keep the queued softirq task from
    running.

    To solve this issue, this patch disables preemption around the releasing
    of the hrtimer lock and raising of the softirq.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     

03 Jul, 2008

2 commits


01 Jul, 2008

2 commits

  • Dhaval Giani reported this warning during cpu hotplug stress-tests:

    | On running kernel compiles in parallel with cpu hotplug:
    |
    | WARNING: at arch/x86/kernel/smp.c:118
    | native_smp_send_reschedule+0x21/0x36()
    | Modules linked in:
    | Pid: 27483, comm: cc1 Not tainted 2.6.26-rc7 #1
    | [...]
    | [] native_smp_send_reschedule+0x21/0x36
    | [] force_quiescent_state+0x47/0x57
    | [] call_rcu+0x51/0x6d
    | [] __fput+0x130/0x158
    | [] fput+0x17/0x19
    | [] filp_close+0x4d/0x57
    | [] sys_close+0x5c/0x97

    IMHO the warning is a spurious one.

    cpu_online_map is updated by the _cpu_down() using stop_machine_run().
    Since force_quiescent_state is invoked from irqs disabled section,
    stop_machine_run() won't be executing while a cpu is executing
    force_quiescent_state(). Hence the cpu_online_map is stable while we're
    in the irq disabled section.

    However, a cpu might have been offlined _just_ before we disabled irqs
    while entering force_quiescent_state(). And rcu subsystem might not yet
    have handled the CPU_DEAD notification, leading to the offlined cpu's
    bit being set in the rcp->cpumask.

    Hence cpumask = (rcp->cpumask & cpu_online_map) to prevent sending
    smp_reschedule() to an offlined CPU.

    Here's the timeline:

    CPU_A CPU_B
    --------------------------------------------------------------
    cpu_down(): .
    . .
    . .
    stop_machine(): /* disables preemption, .
    * and irqs */ .
    . .
    . .
    take_cpu_down(); .
    . .
    . .
    . .
    cpu_disable(); /*this removes cpu .
    *from cpu_online_map .
    */ .
    . .
    . .
    restart_machine(); /* enables irqs */ .
    ------WINDOW DURING WHICH rcp->cpumask is stale ---------------
    . call_rcu();
    . /* disables irqs here */
    . .force_quiescent_state();
    .CPU_DEAD: .for_each_cpu(rcp->cpumask)
    . . smp_send_reschedule();
    . .
    . . WARN_ON() for offlined CPU!
    .
    .
    .
    rcu_cpu_notify:
    .
    -------- WINDOW ENDS ------------------------------------------
    rcu_offline_cpu() /* Which calls cpu_quiet()
    * which removes
    * cpu from rcp->cpumask.
    */

    If a new batch was started just before calling stop_machine_run(), the
    "tobe-offlined" cpu is still present in rcp-cpumask.

    During a cpu-offline, from take_cpu_down(), we queue an rt-prio idle
    task as the next task to be picked by the scheduler. We also call
    cpu_disable() which will disable any further interrupts and remove the
    cpu's bit from the cpu_online_map.

    Once the stop_machine_run() successfully calls take_cpu_down(), it calls
    schedule(). That's the last time a schedule is called on the offlined
    cpu, and hence the last time when rdp->passed_quiesc will be set to 1
    through rcu_qsctr_inc().

    But the cpu_quiet() will be on this cpu will be called only when the
    next RCU_SOFTIRQ occurs on this CPU. So at this time, the offlined CPU
    is still set in rcp->cpumask.

    Now coming back to the idle_task which truely offlines the CPU, it does
    check for a pending RCU and raises the softirq, since it will find
    rdp->passed_quiesc to be 0 in this case. However, since the cpu is
    offline I am not sure if the softirq will trigger on the CPU.

    Even if it doesn't the rcu_offline_cpu() will find that rcp->completed
    is not the same as rcp->cur, which means that our cpu could be holding
    up the grace period progression. Hence we call cpu_quiet() and move
    ahead.

    But because of the window explained in the timeline, we could still have
    a call_rcu() before the RCU subsystem executes it's CPU_DEAD
    notification, and we send smp_send_reschedule() to offlined cpu while
    trying to force the quiescent states. The appended patch adds comments
    and prevents checking for offlined cpu everytime.

    cpu_online_map is updated by the _cpu_down() using stop_machine_run().
    Since force_quiescent_state is invoked from irqs disabled section,
    stop_machine_run() won't be executing while a cpu is executing
    force_quiescent_state(). Hence the cpu_online_map is stable while we're
    in the irq disabled section.

    Reported-by: Dhaval Giani
    Signed-off-by: Gautham R Shenoy
    Acked-by: Dhaval Giani
    Cc: Dipankar Sarma
    Cc: laijs@cn.fujitsu.com
    Cc: Peter Zijlstra
    Cc: Rusty Russel
    Cc: "Paul E. McKenney"
    Signed-off-by: Ingo Molnar

    Gautham R Shenoy
     
  • Here it is another little Oops we found while configuring invalid values
    via cgroups:

    echo 0 > /dev/cgroups/0/cpu.rt_period_us
    or
    echo 4294967296 > /dev/cgroups/0/cpu.rt_period_us

    [ 205.509825] divide error: 0000 [#1]
    [ 205.510151] Modules linked in:
    [ 205.510151]
    [ 205.510151] Pid: 2339, comm: bash Not tainted (2.6.26-rc8 #33)
    [ 205.510151] EIP: 0060:[] EFLAGS: 00000293 CPU: 0
    [ 205.510151] EIP is at div64_u64+0x5f/0x70
    [ 205.510151] EAX: 0000389f EBX: 00000000 ECX: 00000000 EDX: 00000000
    [ 205.510151] ESI: d9800000 EDI: 00000000 EBP: c6cede60 ESP: c6cede50
    [ 205.510151] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
    [ 205.510151] Process bash (pid: 2339, ti=c6cec000 task=c79be370 task.ti=c6cec000)
    [ 205.510151] Stack: d9800000 0000389f c05971a0 d9800000 c6cedeb4 c0214dbd 00000000 00000000
    [ 205.510151] c6cede88 c0242bd8 c05377c0 c7a41b40 00000000 00000000 00000000 c05971a0
    [ 205.510151] c780ed20 c7508494 c7a41b40 00000000 00000002 c6cedebc c05971a0 ffffffea
    [ 205.510151] Call Trace:
    [ 205.510151] [] ? __rt_schedulable+0x1cd/0x240
    [ 205.510151] [] ? cgroup_file_open+0x18/0xe0
    [ 205.510151] [] ? tg_set_bandwidth+0xa4/0xf0
    [ 205.510151] [] ? sched_group_set_rt_period+0x36/0x50
    [ 205.510151] [] ? cpu_rt_period_write_uint+0xe/0x10
    [ 205.510151] [] ? cgroup_file_write+0x125/0x160
    [ 205.510151] [] ? hrtimer_interrupt+0x155/0x190
    [ 205.510151] [] ? security_file_permission+0xf/0x20
    [ 205.510151] [] ? rw_verify_area+0x48/0xc0
    [ 205.510151] [] ? dupfd+0x104/0x130
    [ 205.510151] [] ? vfs_write+0x9c/0x160
    [ 205.510151] [] ? cgroup_file_write+0x0/0x160
    [ 205.510151] [] ? sys_write+0x3d/0x70
    [ 205.510151] [] ? sysenter_past_esp+0x6a/0x91
    [ 205.510151] =======================
    [ 205.510151] Code: 0f 45 de 31 f6 0f ad d0 d3 ea f6 c1 20 0f 45 c2 0f 45 d6 89 45 f0 89 55 f4 8b 55 f4 31 c9 8b 45 f0 39 d3 89 c6 77 08 89 d0 31 d2 f3 89 c1 83 c4 08 89 f0 f7 f3 89 ca 5b 5e 5d c3 55 89 e5 56
    [ 205.510151] EIP: [] div64_u64+0x5f/0x70 SS:ESP 0068:c6cede50

    The attached patch solves the issue for me.

    I'm checking as soon as possible for the period not being zero since, if
    it is, going ahead is useless. This way we also save a mutex_lock() and
    a read_lock() wrt doing it inside tg_set_bandwidth() or
    __rt_schedulable().

    Signed-off-by: Dario Faggioli
    Signed-off-by: Michael Trimarchi
    Signed-off-by: Ingo Molnar

    Raistlin
     

30 Jun, 2008

2 commits


29 Jun, 2008

1 commit

  • the CPU hotplug problems (crashes under high-volume unplug+replug
    tests) seem to be related to migrate_dead_tasks().

    Firstly I added traces to see all tasks being migrated with
    migrate_live_tasks() and migrate_dead_tasks(). On my setup the problem
    pops up (the one with "se == NULL" in the loop of
    pick_next_task_fair()) shortly after the traces indicate that some has
    been migrated with migrate_dead_tasks()). btw., I can reproduce it
    much faster now with just a plain cpu down/up loop.

    [disclaimer] Well, unless I'm really missing something important in
    this late hour [/desclaimer] pick_next_task() is not something
    appropriate for migrate_dead_tasks() :-)

    the following change seems to eliminate the problem on my setup
    (although, I kept it running only for a few minutes to get a few
    messages indicating migrate_dead_tasks() does move tasks and the
    system is still ok)

    Signed-off-by: Ingo Molnar

    Dmitry Adamushko
     

25 Jun, 2008

3 commits


24 Jun, 2008

2 commits


23 Jun, 2008

2 commits

  • This patch addresses a very sporadic pi-futex related failure in
    highly threaded java apps on large SMP systems.

    David Holmes reported that the pi_state consistency check in
    lookup_pi_state triggered with his test application. This means that
    the kernel internal pi_state and the user space futex variable are out
    of sync. First we assumed that this is a user space data corruption,
    but deeper investigation revieled that the problem happend because the
    pi-futex code is not handling a fault in the futex_lock_pi path when
    the user space variable needs to be fixed up.

    The fault happens when a fork mapped the anon memory which contains
    the futex readonly for COW or the page got swapped out exactly between
    the unlock of the futex and the return of either the new futex owner
    or the task which was the expected owner but failed to acquire the
    kernel internal rtmutex. The current futex_lock_pi() code drops out
    with an inconsistent in case it faults and returns -EFAULT to user
    space. User space has no way to fixup that state.

    When we wrote this code we thought that we could not drop the hash
    bucket lock at this point to handle the fault.

    After analysing the code again it turned out to be wrong because there
    are only two tasks involved which might modify the pi_state and the
    user space variable:

    - the task which acquired the rtmutex
    - the pending owner of the pi_state which did not get the rtmutex

    Both tasks drop into the fixup_pi_state() function before returning to
    user space. The first task which acquired the hash bucket lock faults
    in the fixup of the user space variable, drops the spinlock and calls
    futex_handle_fault() to fault in the page. Now the second task could
    acquire the hash bucket lock and tries to fixup the user space
    variable as well. It either faults as well or it succeeds because the
    first task already faulted the page in.

    One caveat is to avoid a double fixup. After returning from the fault
    handling we reacquire the hash bucket lock and check whether the
    pi_state owner has been modified already.

    Reported-by: David Holmes
    Signed-off-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: David Holmes
    Cc: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc:
    Signed-off-by: Ingo Molnar

    kernel/futex.c | 93 ++++++++++++++++++++++++++++++++++++++++++++-------------
    1 file changed, 73 insertions(+), 20 deletions(-)

    Thomas Gleixner
     
  • Ingo Molnar
     

21 Jun, 2008

1 commit


20 Jun, 2008

3 commits

  • Simplify the code and fix the boundary condition of
    wait_for_completion_timeout(,0).

    We can kill the first __remove_wait_queue() as well.

    Signed-off-by: Ingo Molnar
    Acked-by: Peter Zijlstra

    Oleg Nesterov
     
  • It seems that the current implementaton of wait_for_completion_timeout()
    has a small problem under very high load for the common pattern:

    if (!wait_for_completion_timeout(&done, timeout))
    /* handle failure */

    because the implementation very roughly does (lots of code deleted to
    show the basic flow):

    static inline long __sched
    do_wait_for_common(struct completion *x, long timeout, int state)
    {
    if (x->done)
    return timeout;

    do {
    timeout = schedule_timeout(timeout);

    if (!timeout)
    return timeout;

    } while (!x->done);

    return timeout;
    }

    so if the system is very busy and x->done is not set when
    do_wait_for_common() is entered, it is possible that the first call to
    schedule_timeout() returns 0 because the task doing wait_for_completion
    doesn't get rescheduled for a long time, even if it is woken up early
    enough.

    In this case, wait_for_completion_timeout() returns 0 without even
    checking x->done again, and the code above falls into its failure case
    purely for scheduler reasons, even if the hardware event or whatever was
    being waited for happened early enough.

    It would make sense to add an extra test to do_wait_for() in the timeout
    case and return 1 if x->done is actually set.

    A quick audit (not exhaustive) of wait_for_completion_timeout() callers
    seems to indicate that no one actually cares about the return value in
    the success case -- they just test for 0 (timed out) versus non-zero
    (wait succeeded).

    Signed-off-by: Ingo Molnar

    Roland Dreier
     
  • So if the group ever gets throttled, it will never wake up again.

    Reported-by: "Daniel K."
    Signed-off-by: Peter Zijlstra
    Tested-by: Daniel K.
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

19 Jun, 2008

9 commits

  • This patch corrects the incorrect value of per process run-queue wait
    time reported by delay statistics. The anomaly was due to the following
    reason. When a process leaves the CPU and immediately starts waiting for
    CPU on the runqueue (which means it remains in the TASK_RUNNABLE state),
    the time of re-entry into the run-queue is never recorded. Due to this,
    the waiting time on the runqueue from this point of re-entry upto the
    next time it hits the CPU is not accounted for. This is solved by
    recording the time of re-entry of a process leaving the CPU in the
    sched_info_depart() function IF the process will go back to waiting on
    the run-queue. This IF condition is verified by checking whether the
    process is still in the TASK_RUNNABLE state.

    The patch was tested on 2.6.26-rc6 using two simple CPU hog programs.
    The values noted prior to the fix did not account for the time spent on
    the runqueue waiting. After the fix, the correct values were reported
    back to user space.

    Signed-off-by: Bharath Ravi
    Signed-off-by: Madhava K R
    Cc: dhaval@linux.vnet.ibm.com
    Cc: vatsa@in.ibm.com
    Cc: balbir@in.ibm.com
    Acked-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Bharath Ravi
     
  • The touch_nmi_watchdog() routine on x86 ultimately calls
    touch_softlockup_watchdog(). The problem is that to touch the
    softlockup watchdog, the cpu_clock code has to be called which could
    involve multiple cpu locks and can lead to a hard hang if one of the
    locks is held by a processor that is not going to return anytime soon
    (such as could be the case with kgdb or perhaps even with some other
    kind of exception).

    This patch causes the public version of the
    touch_softlockup_watchdog() to defer the cpu clock access to a later
    point.

    The test case for this problem is to use the following kernel config
    options:

    CONFIG_KGDB_TESTS=y
    CONFIG_KGDB_TESTS_ON_BOOT=y
    CONFIG_KGDB_TESTS_BOOT_STRING="V1F100I100000"

    It should be noted that kgdb test suite and these options were not
    available until 2.6.26-rc2, so it was necessary to patch the kgdb
    test suite during the bisection.

    I would consider this patch a regression fix because the problem first
    appeared in commit 27ec4407790d075c325e1f4da0a19c56953cce23 when some
    logic was added to try to periodically sync the clocks. It was
    possible to work around this particular problem by simply not
    performing the sync anytime the system was in a critical context.
    This was ok until commit 3e51f33fcc7f55e6df25d15b55ed10c8b4da84cd,
    which added config option CONFIG_HAVE_UNSTABLE_SCHED_CLOCK and some
    multi-cpu locks to sync the clocks. It became clear that accessing
    this code from an nmi was the source of the lockups. Avoiding the
    access to the low level clock code from an code inside the NMI
    processing also fixed the problem with the 27ec44... commit.

    Signed-off-by: Jason Wessel
    Signed-off-by: Ingo Molnar

    Jason Wessel
     
  • In rcupreempt, rcu_batches_completed_bh is defined as a static inline in
    the header file. This does not need to be exported, and not only that,
    this breaks my PPC build.

    Signed-off-by: Steven Rostedt
    Cc: "Paul E. McKenney"
    Cc: paulus@samba.org
    Cc: linuxppc-dev@ozlabs.org
    Signed-off-by: Thomas Gleixner

    Steven Rostedt
     
  • We allow the inputs to be [-1 ... SD_LV_MAX), and return -EINVAL
    for inputs outside this range.

    Signed-off-by: Li Zefan
    Acked-by: Paul Menage
    Acked-by: Paul Jackson
    Acked-by: Hidetoshi Seto
    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner

    Li Zefan
     
  • First issue is not related to the cpusets. We're simply leaking doms_cur.
    It's allocated in arch_init_sched_domains() which is called for every
    hotplug event. So we just keep reallocation doms_cur without freeing it.
    I introduced free_sched_domains() function that cleans things up.

    Second issue is that sched domains created by the cpusets are
    completely destroyed by the CPU hotplug events. For all CPU hotplug
    events scheduler attaches all CPUs to the NULL domain and then puts
    them all into the single domain thereby destroying domains created
    by the cpusets (partition_sched_domains).
    The solution is simple, when cpusets are enabled scheduler should not
    create default domain and instead let cpusets do that. Which is
    exactly what the patch does.

    Signed-off-by: Max Krasnyansky
    Cc: pj@sgi.com
    Cc: menage@google.com
    Cc: rostedt@goodmis.org
    Acked-by: Peter Zijlstra
    Signed-off-by: Thomas Gleixner

    Max Krasnyansky
     
  • In tick_task_rt() we first call update_curr_rt() which can dequeue a runqueue
    due to it running out of runtime, and then we try to requeue it, of it also
    having exhausted its RR quota. Obviously requeueing something that is no longer
    on the runqueue will not have the expected result.

    Signed-off-by: Peter Zijlstra
    Tested-by: Daniel K.
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The bandwidth throttle code dequeues a group when it runs out of quota, and
    re-queues it once the period rolls over and the quota gets refreshed.

    Sadly it failed to take the hierarchy into consideration. Share more of the
    enqueue/dequeue code with regular task opterations.

    Also, some operations like sched_setscheduler() can dequeue/enqueue tasks that
    are in throttled runqueues, we should not inadvertly re-enqueue empty runqueues
    so check for that.

    Signed-off-by: Peter Zijlstra
    Tested-by: Daniel K.
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Don't re-set the entity's runqueue to the wrong rq after we've set it
    to the right one.

    Signed-off-by: Peter Zijlstra
    Tested-by: Daniel K.
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • When CONFIG_RT_GROUP_SCHED and CONFIG_CGROUP_SCHED are enabled, with:

    echo 10000 > /proc/sys/kernel/sched_rt_period_us

    We get this:

    BUG: unable to handle kernel NULL pointer dereference at 0000008c
    [ 947.682233] IP: [] __rt_schedulable+0x12/0x160
    [ 947.683123] *pde = 00000000=20
    [ 947.683782] Oops: 0000 [#1]
    [ 947.684307] Modules linked in:
    [ 947.684308]
    [ 947.684308] Pid: 2359, comm: bash Not tainted (2.6.26-rc6 #8)
    [ 947.684308] EIP: 0060:[] EFLAGS: 00000246 CPU: 0
    [ 947.684308] EIP is at __rt_schedulable+0x12/0x160
    [ 947.684308] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000001
    [ 947.684308] ESI: c0521db4 EDI: 00000001 EBP: c6cc9f00 ESP: c6cc9ed0
    [ 947.684308] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
    [ 947.684308] Process bash (pid: 2359, tiÆcc8000 taskÇa54f00=20 task.tiÆcc8000)
    [ 947.684308] Stack: c0222790 00000000 080f8c08 c0521db4 c6cc9f00 00000001 00000000 00000000
    [ 947.684308] c6cc9f9c 00000000 c0521db4 00000001 c6cc9f28 c0216d40 00000000 00000000
    [ 947.684308] c6cc9f9c 000f4240 000e7ef0 ffffffff c0521db4 c79dfb60 c6cc9f58 c02af2cc
    [ 947.684308] Call Trace:
    [ 947.684308] [] ? do_proc_dointvec_conv+0x0/0x50
    [ 947.684308] [] ? sched_rt_handler+0x80/0x110
    [ 947.684308] [] ? proc_sys_call_handler+0x9c/0xb0
    [ 947.684308] [] ? proc_sys_write+0x1a/0x20
    [ 947.684308] [] ? vfs_write+0x96/0x160
    [ 947.684308] [] ? proc_sys_write+0x0/0x20
    [ 947.684308] [] ? sys_write+0x3d/0x70
    [ 947.684308] [] ? sysenter_past_esp+0x6a/0x91
    [ 947.684308] =======================
    [ 947.684308] Code: 24 04 e8 62 b1 0e 00 89 c7 89 f8 8b 5d f4 8b 75
    f8 8b 7d fc 89 ec 5d c3 90 55 89 e5 57 56 53 83 ec 24 89 45 ec 89 55 e4
    89 4d e8 b8 8c 00 00 00 85 ff 0f 84 c9 00 00 00 8b 57 24 39 55 e8
    8b
    [ 947.684308] EIP: [] __rt_schedulable+0x12/0x160 SS:ESP 0068:c6cc9ed0

    We think the following patch solves the issue.

    Signed-off-by: Dario Faggioli
    Signed-off-by: Michael Trimarchi
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Dario Faggioli
     

17 Jun, 2008

1 commit


13 Jun, 2008

2 commits


12 Jun, 2008

2 commits

  • (overflow means weight >= 2^32 here, because inv_weigh = 2^32/weight)

    A weight of a cfs_rq is the sum of weights of which entities
    are queued on this cfs_rq, so it will overflow when there are
    too many entities.

    Although, overflow occurs very rarely, but it break fairness when
    it occurs. 64-bits systems have more memory than 32-bit systems
    and 64-bit systems can create more process usually, so overflow may
    occur more frequently.

    This patch guarantees fairness when overflow happens on 64-bit systems.
    Thanks to the optimization of compiler, it changes nothing on 32-bit.

    Signed-off-by: Lai Jiangshan
    Acked-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Lai Jiangshan
     
  • I found a bug which can be reproduced by this way:(linux-2.6.26-rc5, x86-64)
    (use 2^32, 2^33, ...., 2^63 as shares value)

    # mkdir /dev/cpuctl
    # mount -t cgroup -o cpu cpuctl /dev/cpuctl
    # cd /dev/cpuctl
    # mkdir sub
    # echo 0x8000000000000000 > sub/cpu.shares
    # echo $$ > sub/tasks
    oops here! divide by zero.

    This is because do_div() expects the 2th parameter to be 32 bits,
    but unsigned long is 64 bits in x86_64.

    Peter Zijstra pointed it out that the sane thing to do is limit the
    shares value to something smaller instead of using an even more
    expensive divide.

    Also, I found another bug about "the shares value is too large":

    pid1 and pid2 are set affinity to cpu#0
    pid1 is attached to cg1 and pid2 is attached to cg2

    if cg1/cpu.shares = 1024 cg2/cpu.shares = 2000000000
    then pid2 got 100% usage of cpu, and pid1 0%

    if cg1/cpu.shares = 1024 cg2/cpu.shares = 20000000000
    then pid2 got 0% usage of cpu, and pid1 100%

    And a weight of a cfs_rq is the sum of weights of which entities
    are queued on this cfs_rq, so the shares value should be limited
    to a smaller value.

    I think that (1UL << 18) is a good limited value:

    1) it's not too large, we can create a lot of group before overflow
    2) it's several times the weight value for nice=-19 (not too small)

    Signed-off-by: Lai Jiangshan
    Acked-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Lai Jiangshan
     

10 Jun, 2008

1 commit

  • schedule() has the special "TASK_INTERRUPTIBLE && signal_pending()" case,
    this allows us to do

    current->state = TASK_INTERRUPTIBLE;
    schedule();

    without fear to sleep with pending signal.

    However, the code like

    current->state = TASK_KILLABLE;
    schedule();

    is not right, schedule() doesn't take TASK_WAKEKILL into account. This means
    that mutex_lock_killable(), wait_for_completion_killable(), down_killable(),
    schedule_timeout_killable() can miss SIGKILL (and btw the second SIGKILL has
    no effect).

    Introduce the new helper, signal_pending_state(), and change schedule() to
    use it. Hopefully it will have more users, that is why the task's state is
    passed separately.

    Note this "__TASK_STOPPED | __TASK_TRACED" check in signal_pending_state().
    This is needed to preserve the current behaviour (ptrace_notify). I hope
    this check will be removed soon, but this (afaics good) change needs the
    separate discussion.

    The fast path is "(state & (INTERRUPTIBLE | WAKEKILL)) + signal_pending(p)",
    basically the same that schedule() does now. However, this patch of course
    bloats schedule().

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

07 Jun, 2008

1 commit