15 Aug, 2006

1 commit

  • sys_getppid() optimization can access a freed memory. On kernels with
    DEBUG_SLAB turned ON, this results in Oops. As Dave Hansen noted, this
    optimization is also unsafe for memory hotplug.

    So this patch always takes the lock to be safe.

    [oleg@tv-sign.ru: simplifications]
    Signed-off-by: Kirill Korotaev
    Cc:
    Cc: Dave Hansen
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Kirill Korotaev
     

01 Aug, 2006

3 commits

  • kernel/timer.c defines a (per-cpu) pointer to tvec_base_t, but initializes
    it using { &a_tvec_base_t }, which sparse warns about; change this to just
    &a_tvec_base_t.

    Signed-off-by: Josh Triplett
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josh Triplett
     
  • We have

    #define INDEX(N) (base->timer_jiffies >> (TVR_BITS + N * TVN_BITS)) & TVN_MASK

    and it's used via

    list = varray[i + 1]->vec + (INDEX(i + 1));

    So, due to underparenthesisation, this INDEX(i+1) is now a ... (TVR_BITS + i
    + 1 * TVN_BITS)) ...

    So this bugfix changes behaviour. It worked before by sheer luck:

    "If i was anything but 0, it was broken. But this was only used by
    s390 and arm. Since it was for the next interrupt, could that next
    interrupt be a problem (going into the second cascade)? But it was
    probably seldom wrong. That is, this would fail if the next
    interrupt was in the second cascade, and was wrapped. Which may
    never of happened. Also if it did happen, it would have just missed
    the interrupt.

    If an interrupt was missed, and no one was there to miss it, was it
    really missed :-)"

    Signed-off-by: Steven Rostedt
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • Few of the callback functions and notifier blocks that are associated with cpu
    notifications incorrectly have __devinit and __devinitdata. They should be
    __cpuinit and __cpuinitdata instead.

    It makes no functional difference but wastes text area when CONFIG_HOTPLUG is
    enabled and CONFIG_HOTPLUG_CPU is not.

    This patch fixes all those instances.

    Signed-off-by: Chandra Seetharaman
    Cc: Ashok Raj
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chandra Seetharaman
     

15 Jul, 2006

2 commits

  • Resolve problems seen w/ APM suspend.

    Due to resume initialization ordering, its possible we could get a timer
    interrupt before the timekeeping resume() function is called. This patch
    ensures we don't do any timekeeping accounting before we're fully resumed.

    (akpm: fixes the machine-freezes-on-APM-resume bug)

    Signed-off-by: John Stultz
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    john stultz
     
  • Relax the CPU in the del_timer_sync() busywait loop.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

11 Jul, 2006

1 commit

  • A large number of lost ticks can cause an overadjustment of the clock. To
    compensate for this we look at the current error and the larger the error
    already is the more careful we are at adjusting the error. As small extra
    fix reset the error when the clock is set.

    Signed-off-by: Roman Zippel
    Acked-by: john stultz
    Cc: Uwe Bugla
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Zippel
     

04 Jul, 2006

3 commits


28 Jun, 2006

2 commits

  • This patch reverts notifier_block changes made in 2.6.17

    Signed-off-by: Chandra Seetharaman
    Cc: Ashok Raj
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chandra Seetharaman
     
  • In 2.6.17, there was a problem with cpu_notifiers and XFS. I provided a
    band-aid solution to solve that problem. In the process, i undid all the
    changes you both were making to ensure that these notifiers were available
    only at init time (unless CONFIG_HOTPLUG_CPU is defined).

    We deferred the real fix to 2.6.18. Here is a set of patches that fixes the
    XFS problem cleanly and makes the cpu notifiers available only at init time
    (unless CONFIG_HOTPLUG_CPU is defined).

    If CONFIG_HOTPLUG_CPU is defined then cpu notifiers are available at run
    time.

    This patch reverts the notifier_call changes made in 2.6.17

    Signed-off-by: Chandra Seetharaman
    Cc: Ashok Raj
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chandra Seetharaman
     

27 Jun, 2006

6 commits

  • This fixes the clock source updates in update_wall_time() to correctly
    track the time coming in via current_tick_length(). Optimize the fast
    paths to be as short as possible to keep the overhead low.

    Signed-off-by: Roman Zippel
    Acked-by: John Stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Zippel
     
  • As suggested by Roman Zippel, change clocksource functions to use
    clocksource_xyz rather then xyz_clocksource to avoid polluting the
    namespace.

    Signed-off-by: John Stultz
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    john stultz
     
  • Introduces clocksource switching code and the arch generic time accessor
    functions that use the clocksource infrastructure.

    Signed-off-by: John Stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    john stultz
     
  • Instead of incrementing xtime by tick_nsec + ntp adjustments, use the
    clocksource abstraction to increment and scale time. Using the clocksource
    abstraction allows other clocksources to be used consistently in the face of
    late or lost ticks, while preserving the existing behavior via the jiffies
    clocksource.

    This removes the need to keep time_phase adjustments as we just use the
    current_tick_length() function as the NTP interface and accumulate time using
    shifted nanoseconds.

    The basics of this design was by Roman Zippel, however it is my own
    interpretation and implementation, so the credit should go to him and the
    blame to me.

    Signed-off-by: John Stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    john stultz
     
  • Change the current_tick_length() function so it takes an argument which
    specifies how much precision to return in shifted nanoseconds. This provides
    a simple way to convert between NTPs internal nanoseconds shifted by
    (SHIFT_SCALE - 10) to other shifted nanosecond units that are used by the
    clocksource abstraction.

    Signed-off-by: John Stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    john stultz
     
  • Modify the update_wall_time function so it increments time using the
    clocksource abstraction instead of jiffies. Since the only clocksource driver
    currently provided is the jiffies clocksource, this should result in no
    functional change. Additionally, a timekeeping_init and timekeeping_resume
    function has been added to initialize and maintain some of the new timekeping
    state.

    [hirofumi@mail.parknet.co.jp: fixlet]
    Signed-off-by: John Stultz
    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    john stultz
     

26 Jun, 2006

1 commit

  • There are several instances of per_cpu(foo, raw_smp_processor_id()), which
    is semantically equivalent to __get_cpu_var(foo) but without the warning
    that smp_processor_id() can give if CONFIG_DEBUG_PREEMPT is enabled. For
    those architectures with optimized per-cpu implementations, namely ia64,
    powerpc, s390, sparc64 and x86_64, per_cpu() turns into more and slower
    code than __get_cpu_var(), so it would be preferable to use __get_cpu_var
    on those platforms.

    This defines a __raw_get_cpu_var(x) macro which turns into per_cpu(x,
    raw_smp_processor_id()) on architectures that use the generic per-cpu
    implementation, and turns into __get_cpu_var(x) on the architectures that
    have an optimized per-cpu implementation.

    Signed-off-by: Paul Mackerras
    Acked-by: David S. Miller
    Acked-by: Ingo Molnar
    Acked-by: Martin Schwidefsky
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Mackerras
     

23 Jun, 2006

2 commits

  • When CONFIG_BASE_SAMLL=1, cascade() in may enter the infinite loop.
    Because of CONFIG_BASE_SMALL=1(TVR_BITS=6 and TVN_BITS=4), the list
    base->tv5 may cascade into base->tv5. So, the kernel enters the infinite
    loop in the function cascade().

    I created a test module to verify this bug, and a patch to fix it.

    #include
    #include
    #include
    #include
    #if 0
    #include
    #else
    #define kdb_printf printk
    #endif

    #define TVN_BITS (CONFIG_BASE_SMALL ? 4 : 6)
    #define TVR_BITS (CONFIG_BASE_SMALL ? 6 : 8)
    #define TVN_SIZE (1 << TVN_BITS)
    #define TVR_SIZE (1 << TVR_BITS)
    #define TVN_MASK (TVN_SIZE - 1)
    #define TVR_MASK (TVR_SIZE - 1)

    #define TV_SIZE(N) (N*TVN_BITS + TVR_BITS)

    struct timer_list timer0;
    struct timer_list dummy_timer1;
    struct timer_list dummy_timer2;

    void dummy_timer_fun(unsigned long data) {
    }
    unsigned long j=0;
    void check_timer_base(unsigned long data)
    {
    kdb_printf("check_timer_base %08x\n",jiffies);
    mod_timer(&timer0,(jiffies & (~0xFFF)) + 0x1FFF);
    }

    int init_module(void)
    {
    init_timer(&timer0);
    timer0.data = (unsigned long)0;
    timer0.function = check_timer_base;
    mod_timer(&timer0,jiffies+1);

    init_timer(&dummy_timer1);
    dummy_timer1.data = (unsigned long)0;
    dummy_timer1.function = dummy_timer_fun;

    init_timer(&dummy_timer2);
    dummy_timer2.data = (unsigned long)0;
    dummy_timer2.function = dummy_timer_fun;

    j=jiffies;
    j&=(~((1<<<
    Cc: Matt Mackall
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Porpoise
     
  • list_splice_init(list, head) does unneeded job if it is known that
    list_empty(head) == 1. We can use list_replace_init() instead.

    Signed-off-by: Oleg Nesterov
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

22 May, 2006

1 commit

  • Under certain timing conditions, a race during boot occurs where timer
    ticks are being processed on remote CPUs. The remote timer ticks can
    increment jiffies, and if this happens during a window when a timeout is
    very close to expiring but a local tick has not yet been delivered, you can
    end up with

    1) No softirq pending
    2) A local timer wheel which is not synced to jiffies
    3) No high resolution timer active
    4) A local timer which is supposed to fire before the current jiffies value.

    In this circumstance, the comparison in next_timer_interrupt overflows,
    because the base of the comparison for high resolution timers is jiffies,
    but for the softirq timer wheel, it is relative the the current base of the
    wheel (jiffies_base).

    Signed-off-by: Zachary Amsden
    Cc: Martin Schwidefsky
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     

26 Apr, 2006

2 commits

  • Few of the notifier_chain_register() callers use __init in the definition
    of notifier_call. It is incorrect as the function definition should be
    available after the initializations (they do not unregister them during
    initializations).

    This patch fixes all such usages to _not_ have the notifier_call __init
    section.

    Signed-off-by: Chandra Seetharaman
    Signed-off-by: Linus Torvalds

    Chandra Seetharaman
     
  • Few of the notifier_chain_register() callers use __devinitdata in the
    definition of notifier_block data structure. It is incorrect as the
    data structure should be available after the initializations (they do
    not unregister them during initializations).

    This was leading to an oops when notifier_chain_register() call is
    invoked for those callback chains after initialization.

    This patch fixes all such usages to _not_ have the notifier_block data
    structure in the init data section.

    Signed-off-by: Chandra Seetharaman
    Signed-off-by: Linus Torvalds

    Chandra Seetharaman
     

11 Apr, 2006

1 commit

  • We need the boot CPU's tvec_bases[] entry to be initialised super-early in
    boot, for early_serial_setup(). That runs within setup_arch(), before even
    per-cpu areas are initialised.

    The patch changes tvec_bases to use compile-time initialisation, and adds a
    separate array `tvec_base_done' to keep track of which CPU has had its
    tvec_bases[] entry initialised (because we can no longer use the zeroness of
    that tvec_bases[] entry to determine whether it has been initialised).

    Thanks to Eugene Surovegin for diagnosing this.

    Cc: Eugene Surovegin
    Cc: Jan Beulich
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

10 Apr, 2006

1 commit

  • If the HPET timer is enabled, the clock can drift by ~3 seconds a day.
    This is due to the HPET timer not being initialized with the correct
    setting (still using PIT count).

    If HZ changes, this drift can become even more pronounced.

    HPET patch initializes tick_nsec with correct tick_nsec settings for
    HPET timer.

    Vojtech comments:

    "It's not entirely correct (it assumes the HPET ticks totally
    exactly), but it's significantly better than assuming the PIT error
    there."

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Jordan Hargrave
     

02 Apr, 2006

1 commit


01 Apr, 2006

3 commits

  • Currently, count_active_tasks() calls both nr_running() &
    nr_interruptible(). Each of these functions does a "for_each_cpu" & reads
    values from the runqueue of each cpu. Although this is not a lot of
    instructions, each runqueue may be located on different node. Depending on
    the architecture, a unique TLB entry may be required to access each
    runqueue.

    Since there may be more runqueues than cpu TLB entries, a scan of all
    runqueues can trash the TLB. Each memory reference incurs a TLB miss &
    refill.

    In addition, the runqueue cacheline that contains nr_running &
    nr_uninterruptible may be evicted from the cache between the two passes.
    This causes unnecessary cache misses.

    Combining nr_running() & nr_interruptible() into a single function
    substantially reduces the TLB & cache misses on large systems. This should
    have no measureable effect on smaller systems.

    On a 128p IA64 system running a memory stress workload, the new function
    reduced the overhead of calc_load() from 605 usec/call to 324 usec/call.

    Signed-off-by: Jack Steiner
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jack Steiner
     
  • Since base and new_base are of the same type now, we can save one 'if'
    branch and simplify the code a bit.

    Signed-off-by: Oleg Nesterov
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Commit a4a6198b80cf82eb8160603c98da218d1bd5e104:
    [PATCH] tvec_bases too large for per-cpu data

    introduced "struct tvec_t_base_s boot_tvec_bases" which is visible at
    compile time. This means we can kill __init_timer_base and move
    timer_base_s's content into tvec_t_base_s.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

26 Mar, 2006

2 commits

  • This removes the support for pps. It's completely unused within the kernel
    and is basically in the way for further cleanups. It should be easier to
    readd proper support for it after the rest has been converted to NTP4
    (where the pps mechanisms are quite different from NTP3 anyway).

    Signed-off-by: Roman Zippel
    Cc: Adrian Bunk
    Cc: john stultz
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Zippel
     
  • alarm() calls the kernel with an unsigend int timeout in seconds. The
    value is stored in the tv_sec field of a struct timeval to setup the
    itimer. The tv_sec field of struct timeval is of type long, which causes
    the tv_sec value to be negative on 32 bit machines if seconds > INT_MAX.

    Before the hrtimer merge (pre 2.6.16) such a negative value was converted
    to the maximum jiffies timeout by the timeval_to_jiffies conversion. It's
    not clear whether this was intended or just happened to be done by the
    timeval_to_jiffies code.

    hrtimers expect a timeval in canonical form and treat a negative timeout as
    already expired. This breaks the legitimate usage of alarm() with a
    timeout value > INT_MAX seconds.

    For 32 bit machines it is therefor necessary to limit the internal seconds
    value to avoid API breakage. Instead of doing this in all implementations
    of sys_alarm the duplicated sys_alarm code is moved into a common function
    in itimer.c

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

24 Mar, 2006

2 commits

  • Make the softlockup detector purely timer-interrupt driven, removing
    softirq-context (timer) dependencies. This means that if the softlockup
    watchdog triggers, it has truly observed a longer than 10 seconds
    scheduling delay of a SCHED_FIFO prio 99 task.

    (the patch also turns off the softlockup detector during the initial bootup
    phase and does small style fixes)

    Signed-off-by: Ingo Molnar
    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • With internal Xen-enabled kernels we see the kernel's static per-cpu data
    area exceed the limit of 32k on x86-64, and even native x86-64 kernels get
    fairly close to that limit. I generally question whether it is reasonable
    to have data structures several kb in size allocated as per-cpu data when
    the space there is rather limited.

    The biggest arch-independent consumer is tvec_bases (over 4k on 32-bit
    archs, over 8k on 64-bit ones), which now gets converted to use dynamically
    allocated memory instead.

    Signed-off-by: Jan Beulich
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     

17 Mar, 2006

1 commit

  • The pointer to the current time interpolator and the current list of time
    interpolators are typically only changed during bootup. Adding
    __read_mostly takes them away from possibly hot cachelines.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

07 Mar, 2006

2 commits

  • Add a compiler barrier so that we don't read jiffies before updating
    jiffies_64.

    Signed-off-by: Atsushi Nemoto
    Cc: Ralf Baechle
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Atsushi Nemoto
     
  • Also from Thomas Gleixner

    Function next_timer_interrupt() got broken with a recent patch
    6ba1b91213e81aa92b5cf7539f7d2a94ff54947c as sys_nanosleep() was moved to
    hrtimer. This broke things as next_timer_interrupt() did not check hrtimer
    tree for next event.

    Function next_timer_interrupt() is needed with dyntick (CONFIG_NO_IDLE_HZ,
    VST) implementations, as the system can be in idle when next hrtimer event
    was supposed to happen. At least ARM and S390 currently use
    next_timer_interrupt().

    Signed-off-by: Thomas Gleixner
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Lindgren
     

03 Mar, 2006

1 commit

  • On some platforms readq performs additional work to make sure I/O is done
    in a coherent way. This is not needed for time retrieval as done by the
    time interpolator. So we can use readq_relaxed instead which will improve
    performance.

    It affects sparc64 and ia64 only. Apparently it makes a significant
    difference on ia64.

    Signed-off-by: Christoph Lameter
    Cc: john stultz
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

18 Feb, 2006

1 commit

  • This provides an interface for arch code to find out how many
    nanoseconds are going to be added on to xtime by the next call to
    do_timer. The value returned is a fixed-point number in 52.12 format
    in nanoseconds. The reason for this format is that it gives the
    full precision that the timekeeping code is using internally.

    The motivation for this is to fix a problem that has arisen on 32-bit
    powerpc in that the value returned by do_gettimeofday drifts apart
    from xtime if NTP is being used. PowerPC is now using a lockless
    do_gettimeofday based on reading the timebase register and performing
    some simple arithmetic. (This method of getting the time is also
    exported to userspace via the VDSO.) However, the factor and offset
    it uses were calculated based on the nominal tick length and weren't
    being adjusted when NTP varied the tick length.

    Note that 64-bit powerpc has had the lockless do_gettimeofday for a
    long time now. It also had an extremely hairy routine that got called
    from the 32-bit compat routine for adjtimex, which adjusted the
    factor and offset according to what it thought the timekeeping code
    was going to do. Not only was this only called if a 32-bit task did
    adjtimex (i.e. not if a 64-bit task did adjtimex), it was also
    duplicating computations from kernel/timer.c and it wasn't clear that
    it was (still) correct.

    The simple solution is to ask the timekeeping code how long the
    current jiffy will be on each timer interrupt, after calling
    do_timer. If this jiffy will be a different length from the last one,
    we then need to compute new values for the factor and offset used in
    the lockless do_gettimeofday. In this way we can keep xtime and
    do_gettimeofday in sync, even when NTP is varying the tick length.

    Note that when adjtimex varies the tick length, it almost always
    introduces the variation from the next tick on. The only case I could
    see where adjtimex would vary the length of the current tick is when
    an old-style adjtime adjustment is being cancelled. (It's not clear
    to me why the adjustment has to be cancelled immediately rather than
    from the next tick on.) Thus I don't see any real need for a hook in
    adjtimex; the rare case of an old-style adjustment being cancelled can
    be fixed up at the next tick.

    Signed-off-by: Paul Mackerras
    Acked-by: john stultz
    Signed-off-by: Linus Torvalds

    Paul Mackerras
     

08 Feb, 2006

1 commit