07 Oct, 2008

1 commit


04 Oct, 2008

1 commit


03 Oct, 2008

1 commit


30 Sep, 2008

1 commit

  • …el/git/tip/linux-2.6-tip

    * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    hrtimer: prevent migration of per CPU hrtimers
    hrtimer: mark migration state
    hrtimer: fix migration of CB_IRQSAFE_NO_SOFTIRQ hrtimers
    hrtimer: migrate pending list on cpu offline

    Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

    Linus Torvalds
     

29 Sep, 2008

5 commits

  • There's a race between mm->owner assignment and swapoff, more easily
    seen when task slab poisoning is turned on. The condition occurs when
    try_to_unuse() runs in parallel with an exiting task. A similar race
    can occur with callers of get_task_mm(), such as /proc//
    or ptrace or page migration.

    CPU0 CPU1
    try_to_unuse
    looks at mm = task0->mm
    increments mm->mm_users
    task 0 exits
    mm->owner needs to be updated, but no
    new owner is found (mm_users > 1, but
    no other task has task->mm = task0->mm)
    mm_update_next_owner() leaves
    mmput(mm) decrements mm->mm_users
    task0 freed
    dereferencing mm->owner fails

    The fix is to notify the subsystem via mm_owner_changed callback(),
    if no new owner is found, by specifying the new task as NULL.

    Jiri Slaby:
    mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
    must be set after that, so as not to pass NULL as old owner causing oops.

    Daisuke Nishimura:
    mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
    and its callers need to take account of this situation to avoid oops.

    Hugh Dickins:
    Lockdep warning and hang below exec_mmap() when testing these patches.
    exit_mm() up_reads mmap_sem before calling mm_update_next_owner(),
    so exec_mmap() now needs to do the same. And with that repositioning,
    there's now no point in mm_need_new_owner() allowing for NULL mm.

    Reported-by: Hugh Dickins
    Signed-off-by: Balbir Singh
    Signed-off-by: Jiri Slaby
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Impact: per CPU hrtimers can be migrated from a dead CPU

    The hrtimer code has no knowledge about per CPU timers, but we need to
    prevent the migration of such timers and warn when such a timer is
    active at migration time.

    Explicitely mark the timers as per CPU and use a more understandable
    mode descriptor for the interrupts safe unlocked callback mode, which
    is used by hrtimer_sleeper and the scheduler code.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Impact: during migration active hrtimers can be seen as inactive

    The migration code removes the hrtimers from the queues of the dead
    CPU and sets the state temporary to INACTIVE. The enqueue code sets it
    to ACTIVE/PENDING again.

    Prevent that the wrong state can be seen by using a separate migration
    state bit.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Impact: Stale timers after a CPU went offline.

    commit 37bb6cb4097e29ffee970065b74499cbf10603a3
    hrtimer: unlock hrtimer_wakeup

    changed the hrtimer sleeper callback mode to CB_IRQSAFE_NO_SOFTIRQ due
    to locking problems. A result of this change is that when enqueue is
    called for an already expired hrtimer the callback function is not
    longer called directly from the enqueue code. The normal callers have
    been fixed in the code, but the migration code which moves hrtimers
    from a dead CPU to a live CPU was not made aware of this.

    This can be fixed by checking the timer state after the call to
    enqueue in the migration code.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Impact: hrtimers which are on the pending list are not migrated at cpu
    offline and can be stale forever

    Add the pending list migration when CONFIG_HIGH_RES_TIMERS is enabled

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

26 Sep, 2008

2 commits

  • On the x86 arch, user space single step exceptions should be ignored
    if they occur in the kernel space, such as ptrace stepping through a
    system call.

    First check if it is kgdb that is executing a single step, then ensure
    it is not an accidental traversal into the user space, while in kgdb,
    any other time the TIF_SINGLESTEP is set, kgdb should ignore the
    exception.

    On x86, arm, mips and powerpc, the kgdb_contthread usage was
    inconsistent with the way single stepping is implemented in the kgdb
    core. The arch specific stub should always set the
    kgdb_cpu_doing_single_step correctly if it is single stepping. This
    allows kgdb to correctly process an instruction steps if ptrace
    happens to be requesting an instruction step over a system call.

    Signed-off-by: Jason Wessel

    Jason Wessel
     
  • On the ARM architecture, kgdb will crash the kernel if the last byte
    of valid memory is written due to a flush_icache_range flushing
    beyond the memory boundary.

    Signed-off-by: Atsuo Igarashi
    Signed-off-by: Jason Wessel

    Atsuo Igarashi
     

24 Sep, 2008

2 commits


23 Sep, 2008

7 commits

  • A segmentation fault can occur in kimage_add_entry in kexec.c when loading
    a kernel image into memory. The fault occurs because a page is requested
    by calling kimage_alloc_page with gfp_mask GFP_KERNEL and the function may
    actually return a page with gfp_mask GFP_HIGHUSER. The high mem page is
    returned because it was swapped with the kernel page due to the kernel
    page being a page that will shortly be copied to.

    This patch ensures that kimage_alloc_page returns a page that was created
    with the correct gfp flags.

    I have verified the change and fixed the whitespace damage of the original
    patch. Jonathan did a great job of tracking this down after he hit the
    problem. -- Eric

    Signed-off-by: Jonathan Steel
    Signed-off-by: Eric W. Biederman
    Acked-by: Simon Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jonathan Steel
     
  • kernel/time/tick-common.c: In function ‘tick_setup_periodic’:
    kernel/time/tick-common.c:113: error: implicit declaration of function ‘tick_broadcast_oneshot_active’

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Impact: timer hang on CPU online observed on AMD C1E systems

    When a CPU is brought online then the broadcast machinery can
    be in the one shot state already. Check this and setup the timer
    device of the new CPU in one shot mode so the broadcast code
    can pick up the next_event value correctly.

    Another AMD C1E oddity, as we switch to broadcast immediately and
    not after the full bring up via the ACPI cpu idle code.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Impact: Possible hang on CPU online observed on AMD C1E machines.

    The broadcast setup code looks at the mode of the tick device to
    determine whether it needs to be shut down or setup. This is wrong
    when the broadcast mode is set to one shot already. This can happen
    when a CPU is brought online as it goes through the periodic setup
    first.

    The problem went unnoticed as sane systems do not call into that code
    before the switch to one shot for the clock event device happens.
    The AMD C1E idle routine switches over immediately and thereby shuts
    down the just setup device before the first interrupt happens.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Impact: possible hang on CPU onlining in timer one shot mode.

    The tick_next_period variable is only used during boot on nohz/highres
    enabled systems, but for CPU onlining it needs to be maintained when
    the per cpu clock events device operates in one shot mode.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Impact: rare hang which can be triggered on CPU online.

    tick_do_timer_cpu keeps track of the CPU which updates jiffies
    via do_timer. The value -1 is used to signal, that currently no
    CPU is doing this. There are two cases, where the variable can
    have this state:

    boot:
    necessary for systems where the boot cpu id can be != 0

    nohz long idle sleep:
    When the CPU which did the jiffies update last goes into
    a long idle sleep it drops the update jiffies duty so
    another CPU which is not idle can pick it up and keep
    jiffies going.

    Using the same value for both situations is wrong, as the CPU online
    code can see the -1 state when the timer of the newly onlined CPU is
    setup. The setup for a newly onlined CPU goes through periodic mode
    and can pick up the do_timer duty without being aware of the nohz /
    highres mode of the already running system.

    Use two separate states and make them constants to avoid magic
    numbers confusion.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • LD kernel/built-in.o
    WARNING: kernel/built-in.o(.text+0x326): Section mismatch in reference
    from the function init_hrtick() to the variable
    .cpuinit.data:hotplug_hrtick_nb.8
    The function init_hrtick() references
    the variable __cpuinitdata hotplug_hrtick_nb.8.
    This is often because init_hrtick lacks a __cpuinitdata
    annotation or the annotation of hotplug_hrtick_nb.8 is wrong.

    Signed-off-by: Md.Rakib H. Mullick
    Signed-off-by: Andrew Morton
    Signed-off-by: Ingo Molnar

    Rakib Mullick
     

20 Sep, 2008

2 commits


17 Sep, 2008

1 commit

  • The device shut down does not cleanup the next_event variable of the
    clock event device. So when the device is reactivated the possible
    stale next_event value can prevent the device to be reprogrammed as it
    claims to wait on a event already.

    This is the root cause of the resurfacing suspend/resume problem,
    where systems need key press to come back to life.

    Fix this by setting next_event to KTIME_MAX when the device is shut
    down. Use a separate function for shutdown which takes care of that
    and only keep the direct set mode call in the broadcast code, where we
    can not touch the next_event value.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

14 Sep, 2008

1 commit

  • After the patch:

    commit 0b2f630a28d53b5a2082a5275bc3334b10373508
    Author: Miao Xie
    Date: Fri Jul 25 01:47:21 2008 -0700

    cpusets: restructure the function update_cpumask() and update_nodemask()

    It might happen that 'echo 0 > /cpuset/sub/cpus' returned failure but 'cpus'
    has been changed, because cpus was changed before calling heap_init() which
    may return -ENOMEM.

    This patch restores the orginal behavior.

    Signed-off-by: Li Zefan
    Acked-by: Paul Menage
    Cc: Paul Jackson
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     

11 Sep, 2008

2 commits

  • Andrei Gusev wrote:

    > I played witch scheduler settings. After doing something like:
    > echo -n 1000000 >sched_rt_period_us
    >
    > command is locked. I found in kernel.log:
    >
    > Sep 11 00:39:34 zaratustra
    > Sep 11 00:39:34 zaratustra Pid: 4495, comm: bash Tainted: G W
    > (2.6.26.3 #12)
    > Sep 11 00:39:34 zaratustra EIP: 0060:[] EFLAGS: 00210246 CPU: 0
    > Sep 11 00:39:34 zaratustra EIP is at div64_u64+0x57/0x80
    > Sep 11 00:39:34 zaratustra EAX: 0000389f EBX: 00000000 ECX: 00000000
    > EDX: 00000000
    > Sep 11 00:39:34 zaratustra ESI: d9800000 EDI: d9800000 EBP: 0000389f
    > ESP: ea7a6edc
    > Sep 11 00:39:34 zaratustra DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
    > Sep 11 00:39:34 zaratustra Process bash (pid: 4495, ti=ea7a6000
    > task=ea744000 task.ti=ea7a6000)
    > Sep 11 00:39:34 zaratustra Stack: 00000000 000003e8 d9800000 0000389f
    > c0119042 00000000 00000000 00000001
    > Sep 11 00:39:34 zaratustra 00000000 00000000 ea7a6f54 00010000 00000000
    > c04d2e80 00000001 000e7ef0
    > Sep 11 00:39:34 zaratustra c01191a3 00000000 00000000 ea7a6fa0 00000001
    > ffffffff c04d2e80 ea5b2480
    > Sep 11 00:39:34 zaratustra Call Trace:
    > Sep 11 00:39:34 zaratustra [] __rt_schedulable+0x52/0x130
    > Sep 11 00:39:34 zaratustra [] sched_rt_handler+0x83/0x120
    > Sep 11 00:39:34 zaratustra [] proc_sys_call_handler+0xb6/0xd0
    > Sep 11 00:39:34 zaratustra [] proc_sys_write+0x0/0x20
    > Sep 11 00:39:34 zaratustra [] proc_sys_write+0x19/0x20
    > Sep 11 00:39:34 zaratustra [] vfs_write+0xa8/0x140
    > Sep 11 00:39:34 zaratustra [] sys_write+0x41/0x80
    > Sep 11 00:39:34 zaratustra [] sysenter_past_esp+0x6a/0x91
    > Sep 11 00:39:34 zaratustra =======================
    > Sep 11 00:39:34 zaratustra Code: c8 41 0f ad f3 d3 ee f6 c1 20 0f 45 de
    > 31 f6 0f ad ef d3 ed f6 c1 20 0f 45 fd 0f 45 ee 31 c9 39 eb 89 fe 89 ea
    > 77 08 89 e8 31 d2 f3 89 c1 89 f0 8b 7c 24 08 f7 f3 8b 74 24 04 89
    > ca 8b 1c 24
    > Sep 11 00:39:34 zaratustra EIP: [] div64_u64+0x57/0x80 SS:ESP
    > 0068:ea7a6edc
    > Sep 11 00:39:34 zaratustra ---[ end trace 4eaa2a86a8e2da22 ]---

    fix the boundary condition.

    sysctl_sched_rt_period=0 makes exception at to_ratio().

    Signed-off-by: Hiroshi Shimamoto
    Signed-off-by: Ingo Molnar

    Hiroshi Shimamoto
     
  • On my tulsa x86-64 machine, kernel 2.6.25-rc5 couldn't boot randomly.

    Basically, function __enable_runtime forgets to reset rt_rq->rt_throttled
    to 0. When every cpu is up, per-cpu migration_thread is created and it runs
    very fast, sometimes to mark the corresponding rt_rq->rt_throttled to 1 very
    quickly. After all cpus are up, with below calling chain:

    sched_init_smp => arch_init_sched_domains => build_sched_domains => ...
    => cpu_attach_domain => rq_attach_root => set_rq_online => ...
    => _enable_runtime

    _enable_runtime is called against every rt_rq again, so rt_rq->rt_time is
    reset to 0, but rt_rq->rt_throttled might be still 1. Later on function
    do_sched_rt_period_timer couldn't reset it, and all RT tasks couldn't be
    scheduled to run on that cpu. here is RT task migration_thread which is
    woken up when a task is migrated to another cpu.

    Below patch fixes it against 2.6.27-rc5.

    Signed-off-by: Zhang Yanmin
    Signed-off-by: Ingo Molnar

    Zhang, Yanmin
     

10 Sep, 2008

1 commit

  • The issue of the endless reprogramming loop due to a too small
    min_delta_ns was fixed with the previous updates of the clock events
    code, but we had no information about the spread of this problem. I
    added a WARN_ON to get automated information via kerneloops.org and to
    get some direct reports, which allowed me to analyse the affected
    machines.

    The WARN_ON has served its purpose and would be annoying for a release
    kernel. Remove it and just keep the information about the increase of
    the min_delta_ns value.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

09 Sep, 2008

1 commit


07 Sep, 2008

3 commits

  • …el/git/tip/linux-2.6-tip

    * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    clocksource, acpi_pm.c: check for monotonicity
    clocksource, acpi_pm.c: use proper read function also in errata mode
    ntp: fix calculation of the next jiffie to trigger RTC sync
    x86: HPET: read back compare register before reading counter
    x86: HPET fix moronic 32/64bit thinko
    clockevents: broadcast fixup possible waiters
    HPET: make minimum reprogramming delta useful
    clockevents: prevent endless loop lockup
    clockevents: prevent multiple init/shutdown
    clockevents: enforce reprogram in oneshot setup
    clockevents: prevent endless loop in periodic broadcast handler
    clockevents: prevent clockevent event_handler ending up handler_noop

    Linus Torvalds
     
  • Ingo Molnar
     
  • What I realized recently is that calling rebuild_sched_domains() in
    arch_reinit_sched_domains() by itself is not enough when cpusets are enabled.
    partition_sched_domains() code is trying to avoid unnecessary domain rebuilds
    and will not actually rebuild anything if new domain masks match the old ones.

    What this means is that doing
    echo 1 > /sys/devices/system/cpu/sched_mc_power_savings
    on a system with cpusets enabled will not take affect untill something changes
    in the cpuset setup (ie new sets created or deleted).

    This patch fixes restore correct behaviour where domains must be rebuilt in
    order to enable MC powersaving flags.

    Test on quad-core Core2 box with both CONFIG_CPUSETS and !CONFIG_CPUSETS.
    Also tested on dual-core Core2 laptop. Lockdep is happy and things are working
    as expected.

    Signed-off-by: Max Krasnyansky
    Tested-by: Vaidyanathan Srinivasan
    Signed-off-by: Ingo Molnar

    Max Krasnyansky
     

06 Sep, 2008

4 commits

  • We have a bug in the calculation of the next jiffie to trigger the RTC
    synchronisation. The aim here is to run sync_cmos_clock() as close as
    possible to the middle of a second. Which means we want this function to
    be called less than or equal to half a jiffie away from when now.tv_nsec
    equals 5e8 (500000000).

    If this is not the case for a given call to the function, for this purpose
    instead of updating the RTC we calculate the offset in nanoseconds to the
    next point in time where now.tv_nsec will be equal 5e8. The calculated
    offset is then converted to jiffies as these are the unit used by the
    timer.

    Hovewer timespec_to_jiffies() used here uses a ceil()-type rounding mode,
    where the resulting value is rounded up. As a result the range of
    now.tv_nsec when the timer will trigger is from 5e8 to 5e8 + TICK_NSEC
    rather than the desired 5e8 - TICK_NSEC / 2 to 5e8 + TICK_NSEC / 2.

    As a result if for example sync_cmos_clock() happens to be called at the
    time when now.tv_nsec is between 5e8 + TICK_NSEC / 2 and 5e8 to 5e8 +
    TICK_NSEC, it will simply be rescheduled HZ jiffies later, falling in the
    same range of now.tv_nsec again. Similarly for cases offsetted by an
    integer multiple of TICK_NSEC.

    This change addresses the problem by subtracting TICK_NSEC / 2 from the
    nanosecond offset to the next point in time where now.tv_nsec will be
    equal 5e8, effectively shifting the following rounding in
    timespec_to_jiffies() so that it produces a rounded-to-nearest result.

    Signed-off-by: Maciej W. Rozycki
    Signed-off-by: Andrew Morton
    Signed-off-by: Ingo Molnar

    Maciej W. Rozycki
     
  • Until the C1E patches arrived there where no users of periodic broadcast
    before switching to oneshot mode. Now we need to trigger a possible
    waiter for a periodic broadcast when switching to oneshot mode.
    Otherwise we can starve them for ever.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Spencer reported a problem where utime and stime were going negative despite
    the fixes in commit b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa. The suspected
    reason for the problem is that signal_struct maintains it's own utime and
    stime (of exited tasks), these are not updated using the new task_utime()
    routine, hence sig->utime can go backwards and cause the same problem
    to occur (sig->utime, adds tsk->utime and not task_utime()). This patch
    fixes the problem

    TODO: using max(task->prev_utime, derived utime) works for now, but a more
    generic solution is to implement cputime_max() and use the cputime_gt()
    function for comparison.

    Reported-by: spencer@bluehost.com
    Signed-off-by: Balbir Singh
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Balbir Singh
     
  • If HLT stops the TSC, we'll fail to account idle time, thereby inflating the
    actual process times. Fix this by re-calibrating the clock against GTOD when
    leaving nohz mode.

    Signed-off-by: Peter Zijlstra
    Tested-by: Avi Kivity
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

05 Sep, 2008

5 commits

  • The C1E/HPET bug reports on AMDX2/RS690 systems where tracked down to a
    too small value of the HPET minumum delta for programming an event.

    The clockevents code needs to enforce an interrupt event on the clock event
    device in some cases. The enforcement code was stupid and naive, as it just
    added the minimum delta to the current time and tried to reprogram the device.
    When the minimum delta is too small, then this loops forever.

    Add a sanity check. Allow reprogramming to fail 3 times, then print a warning
    and double the minimum delta value to make sure, that this does not happen again.
    Use the same function for both tick-oneshot and tick-broadcast code.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • While chasing the C1E/HPET bugreports I went through the clock events
    code inch by inch and found that the broadcast device can be initialized
    and shutdown multiple times. Multiple shutdowns are not critical, but
    useless waste of time. Multiple initializations are simply broken. Another
    CPU might have the device in use already after the first initialization and
    the second init could just render it unusable again.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • In tick_oneshot_setup we program the device to the given next_event,
    but we do not check the return value. We need to make sure that the
    device is programmed enforced so the interrupt handler engine starts
    working. Split out the reprogramming function from tick_program_event()
    and call it with the device, which was handed in to tick_setup_oneshot().
    Set the force argument, so the devices is firing an interrupt.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • The reprogramming of the periodic broadcast handler was broken,
    when the first programming returned -ETIME. The clockevents code
    stores the new expiry value in the clock events device next_event field
    only when the programming time has not been elapsed yet. The loop in
    question calculates the new expiry value from the next_event value
    and therefor never increases.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • There is a ordering related problem with clockevents code, due to which
    clockevents_register_device() called after tickless/highres switch
    will not work. The new clockevent ends up with clockevents_handle_noop as
    event handler, resulting in no timer activity.

    The problematic path seems to be

    * old device already has hrtimer_interrupt as the event_handler
    * new clockevent device registers with a higher rating
    * tick_check_new_device() is called
    * clockevents_exchange_device() gets called
    * old->event_handler is set to clockevents_handle_noop
    * tick_setup_device() is called for the new device
    * which sets new->event_handler using the old->event_handler which is noop.

    Change the ordering so that new device inherits the proper handler.

    This does not have any issue in normal case as most likely all the clockevent
    devices are setup before the highres switch. But, can potentially be affecting
    some corner case where HPET force detect happens after the highres switch.
    This was a problem with HPET in MSI mode code that we have been experimenting
    with.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Shaohua Li
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi