17 Nov, 2007

2 commits

  • Fix a typo in ntp.c that has caused updating of the persistent (RTC)
    clock when synced to NTP to behave erratically.

    When debugging a freeze that arises on my AMD64 machines when I
    run the ntpd service, I added a number of printk's to monitor the
    sync_cmos_clock procedure. I discovered that it was not syncing to
    cmos RTC every 11 minutes as documented, but instead would keep trying
    every second for hours at a time. The reason turned out to be a typo
    in sync_cmos_clock, where it attempts to ensure that
    update_persistent_clock is called very close to 500 msec. after a 1
    second boundary (required by the PC RTC's spec). That typo referred to
    "xtime" in one spot, rather than "now", which is derived from "xtime"
    but not equal to it. This makes the test erratic, creating a
    "coin-flip" that decides when update_persistent_clock is called - when
    it is called, which is rarely, it may be at any time during the one
    second period, rather than close to 500 msec, so the value written is
    needlessly incorrect, too.

    Signed-off-by: David P. Reed
    Signed-off-by: Thomas Gleixner

    David P. Reed
     
  • dont use the vgetcpu tcache - it's causing problems with tasks
    migrating, they'll see the old cache up to a jiffy after the
    migration, further increasing the costs of the migration.

    In the worst case they see a complete bogus information from
    the tcache, when a sys_getcpu() call "invalidated" the cache
    info by incrementing the jiffies _and_ the cpuid info in the
    cache and the following vdso_getcpu() call happens after
    vdso_jiffies have been incremented.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Ulrich Drepper
    Signed-off-by: Thomas Gleixner

    Ingo Molnar
     

16 Nov, 2007

8 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched:
    sched: reorder SCHED_FEAT_ bits
    sched: make sched_nr_latency static
    sched: remove activate_idle_task()
    sched: fix __set_task_cpu() SMP race
    sched: fix SCHED_FIFO tasks & FAIR_GROUP_SCHED
    sched: fix accounting of interrupts during guest execution on s390

    Linus Torvalds
     
  • reorder SCHED_FEAT_ bits so that the used ones come first. Makes
    tuning instructions easier.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • sched_nr_latency can now become static.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Ingo Molnar

    Adrian Bunk
     
  • cpu_down() code is ok wrt sched_idle_next() placing the 'idle' task not
    at the beginning of the queue.

    So get rid of activate_idle_task() and make use of activate_task() instead.
    It is the same as activate_task(), except for the update_rq_clock(rq) call
    that is redundant.

    Code size goes down:

    text data bss dec hex filename
    47853 3934 336 52123 cb9b sched.o.before
    47828 3934 336 52098 cb82 sched.o.after

    Signed-off-by: Dmitry Adamushko
    Signed-off-by: Ingo Molnar

    Dmitry Adamushko
     
  • Grant Wilson has reported rare SCHED_FAIR_USER crashes on his quad-core
    system, which crashes can only be explained via runqueue corruption.

    there is a narrow SMP race in __set_task_cpu(): after ->cpu is set up to
    a new value, task_rq_lock(p, ...) can be successfuly executed on another
    CPU. We must ensure that updates of per-task data have been completed by
    this moment.

    this bug has been hiding in the Linux scheduler for an eternity (we never
    had any explicit barrier for task->cpu in set_task_cpu() - so the bug was
    introduced in 2.5.1), but only became visible via set_task_cfs_rq() being
    accidentally put after the task->cpu update. It also probably needs a
    sufficiently out-of-order CPU to trigger.

    Reported-by: Grant Wilson
    Signed-off-by: Dmitry Adamushko
    Signed-off-by: Ingo Molnar

    Dmitry Adamushko
     
  • Suppose that the SCHED_FIFO task does

    switch_uid(new_user);

    Now, p->se.cfs_rq and p->se.parent both point into the old
    user_struct->tg because sched_move_task() doesn't call set_task_cfs_rq()
    for !fair_sched_class case.

    Suppose that old user_struct/task_group is freed/reused, and the task
    does

    sched_setscheduler(SCHED_NORMAL);

    __setscheduler() sets fair_sched_class, but doesn't update
    ->se.cfs_rq/parent which point to the freed memory.

    This means that check_preempt_wakeup() doing

    while (!is_same_group(se, pse)) {
    se = parent_entity(se);
    pse = parent_entity(pse);
    }

    may OOPS in a similar way if rq->curr or p did something like above.

    Perhaps we need something like the patch below, note that
    __setscheduler() can't do set_task_cfs_rq().

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Currently the scheduler checks for PF_VCPU to decide if this timeslice
    has to be accounted as guest time. On s390 host interrupts are not
    disabled during guest execution. This causes theses interrupts to be
    accounted as guest time if CONFIG_VIRT_CPU_ACCOUNTING is set. Solution
    is to check if an interrupt triggered account_system_time. As the tick
    is timer interrupt based, we have to subtract hardirq_offset.

    I tested the patch on s390 with CONFIG_VIRT_CPU_ACCOUNTING and on
    x86_64. Seems to work.

    CC: Avi Kivity
    CC: Laurent Vivier
    Signed-off-by: Christian Borntraeger
    Signed-off-by: Ingo Molnar

    Christian Borntraeger
     
  • The original meaning of the old test (p->state > TASK_STOPPED) was
    "not dead", since it was before TASK_TRACED existed and before the
    state/exit_state split. It was a wrong correction in commit
    14bf01bb0599c89fc7f426d20353b76e12555308 to make this test for
    TASK_TRACED instead. It should have been changed when TASK_TRACED
    was introducted and again when exit_state was introduced.

    Signed-off-by: Roland McGrath
    Cc: Oleg Nesterov
    Cc: Alexey Dobriyan
    Cc: Kees Cook
    Acked-by: Scott James Remnant
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

15 Nov, 2007

11 commits

  • * 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
    [NET]: rt_check_expire() can take a long time, add a cond_resched()
    [ISDN] sc: Really, really fix warning
    [ISDN] sc: Fix sndpkt to have the correct number of arguments
    [TCP] FRTO: Clear frto_highmark only after process_frto that uses it
    [NET]: Remove notifier block from chain when register_netdevice_notifier fails
    [FS_ENET]: Fix module build.
    [TCP]: Make sure write_queue_from does not begin with NULL ptr
    [TCP]: Fix size calculation in sk_stream_alloc_pskb
    [S2IO]: Fixed memory leak when MSI-X vector allocation fails
    [BONDING]: Fix resource use after free
    [SYSCTL]: Fix warning for token-ring from sysctl checker
    [NET] random : secure_tcp_sequence_number should not assume CONFIG_KTIME_SCALAR
    [IWLWIFI]: Not correctly dealing with hotunplug.
    [TCP] FRTO: Plug potential LOST-bit leak
    [TCP] FRTO: Limit snd_cwnd if TCP was application limited
    [E1000]: Fix schedule while atomic when called from mii-tool.
    [NETX]: Fix build failure added by 2.6.24 statistics cleanup.
    [EP93xx_ETH]: Build fix after 2.6.24 NAPI changes.
    [PKT_SCHED]: Check subqueue status before calling hard_start_xmit

    Linus Torvalds
     
  • We'd better not nlmsg_free on a pointer containing an undefined value
    (and without having anything allocated).

    Spotted by the Coverity checker.

    Signed-off-by: Adrian Bunk
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Lockdep reports a circular locking dependency in the hibernate code
    because
    - during system boot hibernate code (from an initcall) locks pm_mutex
    and then a sysfs buffer mutex via name_to_dev_t
    - during regular operation hibernate code locks pm_mutex under a
    sysfs buffer mutex because it's called from sysfs methods.

    The deadlock can never happen because during initcall invocation nothing
    can write to sysfs yet. This removes the lockdep report by marking the
    initcall locking as being in a different class.

    Signed-off-by: Johannes Berg
    Cc: "Rafael J. Wysocki"
    Cc: Alan Stern
    Acked-by: Peter Zijlstra
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Berg
     
  • In __do_IRQ(), the normal case is that IRQ_DISABLED is checked and if set
    the handler (handle_IRQ_event()) is not called.

    Earlier in __do_IRQ(), if IRQ_PER_CPU is set the code does not check
    IRQ_DISABLED and calls the handler even though IRQ_DISABLED is set. This
    behavior seems unintentional.

    One user encountering this behavior is the CPE handler (in
    arch/ia64/kernel/mca.c). When the CPE handler encounters too many CPEs
    (such as a solid single bit error), it sets up a polling timer and disables
    the CPE interrupt (to avoid excessive overhead logging the stream of single
    bit errors). disable_irq_nosync() is called which sets IRQ_DISABLED. The
    IRQ_PER_CPU flag was previously set (in ia64_mca_late_init()). The net
    result is the CPE handler gets called even though it is marked disabled.

    If the behavior of not checking IRQ_DISABLED when IRQ_PER_CPU is set is
    intentional, it would be worthy of a comment describing the intended
    behavior. disable_irq_nosync() does call chip->disable() to provide a
    chipset specifiec interface for disabling the interrupt, which avoids this
    issue when used.

    Signed-off-by: Russ Anderson
    Cc: "Luck, Tony"
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Bjorn Helgaas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Russ Anderson
     
  • This is my trivial patch to swat innumerable little bugs with a single
    blow.

    After some intensive review (my apologies for not having gotten to this
    sooner) what we have looks like a good base to build on with the current
    pid namespace code but it is not complete, and it is still much to simple
    to find issues where the kernel does the wrong thing outside of the initial
    pid namespace.

    Until the dust settles and we are certain we have the ABI and the
    implementation is as correct as humanly possible let's keep process ID
    namespaces behind CONFIG_EXPERIMENTAL.

    Allowing us the option of fixing any ABI or other bugs we find as long as
    they are minor.

    Allowing users of the kernel to avoid those bugs simply by ensuring their
    kernel does not have support for multiple pid namespaces.

    [akpm@linux-foundation.org: coding-style cleanups]
    Signed-off-by: Eric W. Biederman
    Cc: Cedric Le Goater
    Cc: Adrian Bunk
    Cc: Jeremy Fitzhardinge
    Cc: Kir Kolyshkin
    Cc: Kirill Korotaev
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Commit faf8c714f4508207a9c81cc94dafc76ed6680b44 caused a regression:
    parameter names longer than MAX_KBUILD_MODNAME will now be rejected,
    although we just need to keep the module name part that short. This patch
    restores the old behaviour while still avoiding that memchr is called with
    its length parameter larger than the total string length.

    Signed-off-by: Jan Kiszka
    Cc: Dave Young
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kiszka
     
  • Upon module load, we must take the markers mutex. It implies that the marker
    mutex must be nested inside the module mutex.

    It implies changing the nesting order : now the marker mutex nests inside the
    module mutex. Make the necessary changes to reverse the order in which the
    mutexes are taken.

    Includes some cleanup from Dave Hansen .

    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     
  • Revert 62d0df64065e7c135d0002f069444fbdfc64768f.

    This was originally intended as a simple initial example of how to create a
    control groups subsystem; it wasn't intended for mainline, but I didn't make
    this clear enough to Andrew.

    The CFS cgroup subsystem now has better functionality for the per-cgroup usage
    accounting (based directly on CFS stats) than the "usage" status file in this
    patch, and the "load" status file is rather simplistic - although having a
    per-cgroup load average report would be a useful feature, I don't believe this
    patch actually provides it. If it gets into the final 2.6.24 we'd probably
    have to support this interface for ever.

    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • i386 and x86-64 registers System RAM as IORESOURCE_MEM | IORESOURCE_BUSY.

    But ia64 registers it as IORESOURCE_MEM only.
    In addition, memory hotplug code registers new memory as IORESOURCE_MEM too.

    This difference causes a failure of memory unplug of x86-64. This patch
    fixes it.

    This patch adds IORESOURCE_BUSY to avoid potential overlap mapping by PCI
    device.

    Signed-off-by: Yasunori Goto
    Signed-off-by: Badari Pulavarty
    Cc: Luck, Tony"
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • When I boot with the 'quiet' parameter, I see on the screen:

    [ 0.000000] Initializing cgroup subsys cpuset
    [ 0.000000] Initializing cgroup subsys cpu
    [ 39.036026] Initializing cgroup subsys cpuacct
    [ 39.036080] Initializing cgroup subsys debug
    [ 39.036118] Initializing cgroup subsys ns

    This patch lowers the priority of those messages, adds a "cgroup: " prefix
    to another couple of printks and kills the useless reference to the source
    file.

    Signed-off-by: Diego Calleja
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Diego Calleja
     
  • Original patch assumed args->nlen < CTL_MAXNAME, but it can be false.

    Signed-off-by: Tetsuo Handa
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

14 Nov, 2007

1 commit


13 Nov, 2007

1 commit

  • While a signal is blocked, it must be posted even if its action is
    SIG_IGN or is SIG_DFL with the default action to ignore. This works
    right most of the time, but is broken when a sigwait (rt_sigtimedwait)
    is in progress. This changes the early-discard check to respect
    real_blocked. ~blocked is the set to check for "should wake up now",
    but ~(blocked|real_blocked) is the set for "blocked" semantics as
    defined by POSIX.

    This fixes bugzilla entry 9347, see

    http://bugzilla.kernel.org/show_bug.cgi?id=9347

    Signed-off-by: Roland McGrath
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

10 Nov, 2007

16 commits

  • compat_exit_robust_list() computes a pointer to the
    futex entry in userspace as follows:

    (void __user *)entry + futex_offset

    'entry' is a 'struct robust_list __user *', and
    'futex_offset' is a 'compat_long_t' (typically a 's32').

    Things explode if the 32-bit sign bit is set in futex_offset.

    Type promotion sign extends futex_offset to a 64-bit value before
    adding it to 'entry'.

    This triggered a problem on sparc64 running 32-bit applications which
    would lock up a cpu looping forever in the fault handling for the
    userspace load in handle_futex_death().

    Compat userspace runs with address masking (wherein the cpu zeros out
    the top 32-bits of every effective address given to a memory operation
    instruction) so the sparc64 fault handler accounts for this by
    zero'ing out the top 32-bits of the fault address too.

    Since the kernel properly uses the compat_uptr interfaces, kernel side
    accesses to compat userspace work too since they will only use
    addresses with the top 32-bit clear.

    Because of this compat futex layer bug we get into the following loop
    when executing the get_user() load near the top of handle_futex_death():

    1) load from address '0xfffffffff7f16bd8', FAULT
    2) fault handler clears upper 32-bits, processes fault
    for address '0xf7f16bd8' which succeeds
    3) goto #1

    I want to thank Bernd Zeimetz, Josip Rodin, and Fabio Massimo Di Nitto
    for their tireless efforts helping me track down this bug.

    Signed-off-by: David S. Miller
    Signed-off-by: Linus Torvalds

    David Miller
     
  • This patch adds a proper prototype for migration_init() in
    include/linux/sched.h

    Since there's no point in always returning 0 to a caller that doesn't check
    the return value it also changes the function to return void.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Ingo Molnar

    Adrian Bunk
     
  • SMP balancing is done with IRQs disabled and can iterate the full rq.
    When rqs are large this can cause large irq-latencies. Limit the nr of
    iterations on each run.

    This fixes a scheduling latency regression reported by the -rt folks.

    Signed-off-by: Peter Zijlstra
    Acked-by: Steven Rostedt
    Tested-by: Gregory Haskins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Sukadev Bhattiprolu reported a kernel crash with control groups.
    There are couple of problems discovered by Suka's test:

    - The test requires the cgroup filesystem to be mounted with
    atleast the cpu and ns options (i.e both namespace and cpu
    controllers are active in the same hierarchy).

    # mkdir /dev/cpuctl
    # mount -t cgroup -ocpu,ns none cpuctl
    (or simply)
    # mount -t cgroup none cpuctl -> Will activate all controllers
    in same hierarchy.

    - The test invokes clone() with CLONE_NEWNS set. This causes a a new child
    to be created, also a new group (do_fork->copy_namespaces->ns_cgroup_clone->
    cgroup_clone) and the child is attached to the new group (cgroup_clone->
    attach_task->sched_move_task). At this point in time, the child's scheduler
    related fields are uninitialized (including its on_rq field, which it has
    inherited from parent). As a result sched_move_task thinks its on
    runqueue, when it isn't.

    As a solution to this problem, I moved sched_fork() call, which
    initializes scheduler related fields on a new task, before
    copy_namespaces(). I am not sure though whether moving up will
    cause other side-effects. Do you see any issue?

    - The second problem exposed by this test is that task_new_fair()
    assumes that parent and child will be part of the same group (which
    needn't be as this test shows). As a result, cfs_rq->curr can be NULL
    for the child.

    The solution is to test for curr pointer being NULL in
    task_new_fair().

    With the patch below, I could run ns_exec() fine w/o a crash.

    Reported-by: Sukadev Bhattiprolu
    Signed-off-by: Srivatsa Vaddagiri
    Signed-off-by: Ingo Molnar

    Srivatsa Vaddagiri
     
  • clean up the preemption check to not use unnecessary 64-bit
    variables. This improves code size:

    text data bss dec hex filename
    44227 3326 36 47589 b9e5 sched.o.before
    44201 3326 36 47563 b9cb sched.o.after

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • clean up the wakeup preemption check. No code changed:

    text data bss dec hex filename
    44227 3326 36 47589 b9e5 sched.o.before
    44227 3326 36 47589 b9e5 sched.o.after

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • wakeup preemption fix: do not make it dependent on p->prio.
    Preemption purely depends on ->vruntime.

    This improves preemption in mixed-nice-level workloads.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • remove PREEMPT_RESTRICT. (this is a separate commit so that any
    regression related to the removal itself is bisectable)

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • PREEMPT_RESTRICT was a method aimed at reducing the amount of wakeup
    related preemption. It has a disadvantage though, it can prevent
    legitimate wakeups if a task is 'unlucky' to be hit too early by a tick
    that clears peer_preempt.

    Now that the wakeup preemption has been cleaned up we dont seem to have
    excessive preemptions anymore, so this feature can be turned off. (and
    removed in the next patch)

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • 1) hardcoded 1000000000 value is used five times in places where
    NSEC_PER_SEC might be more readable.

    2) A conversion from nsec to msec uses the hardcoded 1000000 value,
    which is a candidate for NSEC_PER_MSEC.

    no code changed:

    text data bss dec hex filename
    44359 3326 36 47721 ba69 sched.o.before
    44359 3326 36 47721 ba69 sched.o.after

    Signed-off-by: Eric Dumazet
    Signed-off-by: Ingo Molnar

    Eric Dumazet
     
  • Yanmin Zhang reported an aim7 regression and bisected it down to:

    | commit 38ad464d410dadceda1563f36bdb0be7fe4c8938
    | Author: Ingo Molnar
    | Date: Mon Oct 15 17:00:02 2007 +0200
    |
    | sched: uniform tunings
    |
    | use the same defaults on both UP and SMP.

    fix this by reintroducing similar SMP tunings again. This resolves
    the regression.

    (also update the comments to match the ilog2(nr_cpus) tuning effect)

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Since powerpc started using CONFIG_GENERIC_CLOCKEVENTS, the
    deterministic CPU accounting (CONFIG_VIRT_CPU_ACCOUNTING) has been
    broken on powerpc, because we end up counting user time twice: once in
    timer_interrupt() and once in update_process_times().

    This fixes the problem by pulling the code in update_process_times
    that updates utime and stime into a separate function called
    account_process_tick. If CONFIG_VIRT_CPU_ACCOUNTING is not defined,
    there is a version of account_process_tick in kernel/timer.c that
    simply accounts a whole tick to either utime or stime as before. If
    CONFIG_VIRT_CPU_ACCOUNTING is defined, then arch code gets to
    implement account_process_tick.

    This also lets us simplify the s390 code a bit; it means that the s390
    timer interrupt can now call update_process_times even when
    CONFIG_VIRT_CPU_ACCOUNTING is turned on, and can just implement a
    suitable account_process_tick().

    account_process_tick() now takes the task_struct * as an argument.
    Tested both with and without CONFIG_VIRT_CPU_ACCOUNTING.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Ingo Molnar

    Paul Mackerras
     
  • Fix the delay accounting regression introduced by commit
    75d4ef16a6aa84f708188bada182315f80aab6fa. rq no longer has sched_info
    data associated with it. task_struct sched_info structure is used by delay
    accounting to provide back statistics to user space.

    also remove direct use of sched_clock() (which is not a valid thing to
    do anymore) and use rq->clock instead.

    Signed-off-by: Balbir Singh
    Signed-off-by: Ingo Molnar

    Balbir Singh
     
  • we lost the sched_min_granularity tunable to a clever optimization
    that uses the sched_latency/min_granularity ratio - but the ratio
    is quite unintuitive to users and can also crash the kernel if the
    ratio is set to 0. So reintroduce the min_granularity tunable,
    while keeping the ratio maintained internally.

    no functionality changed.

    [ mingo@elte.hu: some fixlets. ]

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Add a few comments to place_entity(). No code changed.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • vslice was missing a factor NICE_0_LOAD, as weight is in
    weight*NICE_0_LOAD units.

    the effect of this bug was larger initial slices and
    thus latency-noisier forks.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

06 Nov, 2007

1 commit