25 Mar, 2008

6 commits

  • The printk() logic on when/how to get the console semaphore was
    unreadable, this splits the code up into a few helper functions and
    makes it easier to follow what is going on.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • In case we're accounting from a sub-namespace, the tgids reported will not
    refer to the right namespace.

    Save the pid_namespace we're accounting in on the acct_glbs and use it in
    do_acct_process.

    Two less :) places using the task_struct.tgid member.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • This is minor, but dereferencing even current real_parent is not safe on debug
    kernels, since the memory, this points to, can be unmapped - RCU protection is
    required.

    Besides, the tgid field is deprecated and is to be replaced with task_tgid_xxx
    call (the 2nd patch), so RCU will be required anyway.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • As Paul pointed out, the ACCESS_ONCE are not needed because we already have
    the explicit surrounding memory barriers.

    Signed-off-by: Mathieu Desnoyers
    Cc: Christoph Hellwig
    Cc: Mike Mason
    Cc: Dipankar Sarma
    Cc: David Smith
    Cc: "Paul E. McKenney"
    Cc: Steven Rostedt
    Cc: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     
  • Add comments requested by Andrew.

    Updated comments about synchronize_sched(). Since we use call_rcu and
    rcu_barrier now, these comments were out of sync with the code.

    Signed-off-by: Mathieu Desnoyers
    Cc: Christoph Hellwig
    Cc: Mike Mason
    Cc: Dipankar Sarma
    Cc: David Smith
    Cc: "Paul E. McKenney"
    Cc: Steven Rostedt
    Cc: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     
  • The printk() can deadlock because it can wake up klogd(), and
    task enqueueing will try to read the time in order to set a hrtimer.

    Reported-by: Marcin Slusarz
    Debugged-by: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 Mar, 2008

6 commits

  • Will be called each time the scheduling domains are rebuild.
    Needed for architectures that don't have a static cpu topology.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Ingo Molnar

    Heiko Carstens
     
  • Needed so it can be called from outside of sched.c.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Ingo Molnar

    Heiko Carstens
     
  • Combine two unlikely's

    Signed-off-by: Roel Kluin
    Signed-off-by: Ingo Molnar

    Roel Kluin
     
  • TREE_AVG and APPROX_AVG are initial task placement policies that have been
    disabled for a long while.. time to remove them.

    Signed-off-by: Peter Zijlstra
    CC: Srivatsa Vaddagiri
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (46 commits)
    [NET] ifb: set separate lockdep classes for queue locks
    [IPV6] KCONFIG: Fix description about IPV6_TUNNEL.
    [TCP]: Fix shrinking windows with window scaling
    netpoll: zap_completion_queue: adjust skb->users counter
    bridge: use time_before() in br_fdb_cleanup()
    [TG3]: Fix build warning on sparc32.
    MAINTAINERS: bluez-devel is subscribers-only
    audit: netlink socket can be auto-bound to pid other than current->pid (v2)
    [NET]: Fix permissions of /proc/net
    [SCTP]: Fix a race between module load and protosw access
    [NETFILTER]: ipt_recent: sanity check hit count
    [NETFILTER]: nf_conntrack_h323: logical-bitwise & confusion in process_setup()
    [RT2X00] drivers/net/wireless/rt2x00/rt2x00dev.c: remove dead code, fix warning
    [IPV4]: esp_output() misannotations
    [8021Q]: vlan_dev misannotations
    xfrm: ->eth_proto is __be16
    [IPV4]: ipv4_is_lbcast() misannotations
    [SUNRPC]: net/* NULL noise
    [SCTP]: fix misannotated __sctp_rcv_asconf_lookup()
    [PKT_SCHED]: annotate cls_u32
    ...

    Linus Torvalds
     
  • From: Pavel Emelyanov

    This patch is based on the one from Thomas.

    The kauditd_thread() calls the netlink_unicast() and passes
    the audit_pid to it. The audit_pid, in turn, is received from
    the user space and the tool (I've checked the audit v1.6.9)
    uses getpid() to pass one in the kernel. Besides, this tool
    doesn't bind the netlink socket to this id, but simply creates
    it allowing the kernel to auto-bind one.

    That's the preamble.

    The problem is that netlink_autobind() _does_not_ guarantees
    that the socket will be auto-bound to the current pid. Instead
    it uses the current pid as a hint to start looking for a free
    id. So, in case of conflict, the audit messages can be sent
    to a wrong socket. This can happen (it's unlikely, but can be)
    in case some task opens more than one netlink sockets and then
    the audit one starts - in this case the audit's pid can be busy
    and its socket will be bound to another id.

    The proposal is to introduce an audit_nlk_pid in audit subsys,
    that will point to the netlink socket to send packets to. It
    will most often be equal to audit_pid. The socket id can be
    got from the skb's netlink CB right in the audit_receive_msg.
    The audit_nlk_pid reset to 0 is not required, since all the
    decisions are taken based on audit_pid value only.

    Later, if the audit tools will bind the socket themselves, the
    kernel will have to provide a way to setup the audit_nlk_pid
    as well.

    A good side effect of this patch is that audit_pid can later
    be converted to struct pid, as it is not longer safe to use
    pid_t-s in the presence of pid namespaces. But audit code still
    uses the tgid from task_struct in the audit_signal_info and in
    the audit_filter_syscall.

    Signed-off-by: Thomas Graf
    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric Paris
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

20 Mar, 2008

1 commit

  • Revert commit 1ada5cba6a0318f90e45b38557e7b5206a9cba38 ("clocksource:
    make clocksource watchdog cycle through online CPUs") due to the
    regression reported by Gabriel C at

    http://lkml.org/lkml/2008/2/24/281

    (short vesion: it makes TSC be marked as always unstable on his
    machine).

    Cc: Andi Kleen
    Acked-by: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Robert Hancock
    Acked-by: Linus Torvalds
    Cc: "Rafael J. Wysocki"
    Cc: Gabriel C
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

19 Mar, 2008

6 commits

  • reduce wake-up granularity for better interactivity.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Wakeup-buddy tasks are cache-hot - this makes it a bit harder
    for the load-balancer to tear them apart. (but it's still possible,
    if the load is sufficiently assymetric)

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • improve affine wakeups. Maintain the 'overlap' metric based on CFS's
    sum_exec_runtime - which means the amount of time a task executes
    after it wakes up some other task.

    Use the 'overlap' for the wakeup decisions: if the 'overlap' is short,
    it means there's strong workload coupling between this task and the
    woken up task. If the 'overlap' is large then the workload is decoupled
    and the scheduler will move them to separate CPUs more easily.

    ( Also slightly move the preempt_check within try_to_wake_up() - this has
    no effect on functionality but allows 'early wakeups' (for still-on-rq
    tasks) to be correctly accounted as well.)

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Clean up the code flow. No code changed:

    kernel/sched.o:

    text data bss dec hex filename
    42521 2858 232 45611 b22b sched.o.before
    42521 2858 232 45611 b22b sched.o.after

    md5:
    09b31c44e9aff8666f72773dc433e2df sched.o.before.asm
    09b31c44e9aff8666f72773dc433e2df sched.o.after.asm

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • rename 'cpu' to 'prev_cpu'. No code changed:

    kernel/sched.o:

    text data bss dec hex filename
    42521 2858 232 45611 b22b sched.o.before
    42521 2858 232 45611 b22b sched.o.after

    md5:
    09b31c44e9aff8666f72773dc433e2df sched.o.before.asm
    09b31c44e9aff8666f72773dc433e2df sched.o.after.asm

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • split out the affine-wakeup bits.

    No code changed:

    kernel/sched.o:

    text data bss dec hex filename
    42521 2858 232 45611 b22b sched.o.before
    42521 2858 232 45611 b22b sched.o.after

    md5:
    9d76738f1272aa82f0b7affd2f51df6b sched.o.before.asm
    09b31c44e9aff8666f72773dc433e2df sched.o.after.asm

    (the md5's changed because stack slots changed and some registers
    get scheduled by gcc in a different order - but otherwise the before
    and after assembly is instruction for instruction equivalent.)

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

17 Mar, 2008

1 commit


15 Mar, 2008

7 commits

  • Use the existing calc_delta_mine() calculation for sched_slice(). This
    saves a divide and simplifies the code because we share it with the
    other /cfs_rq->load users.

    It also improves code size:

    text data bss dec hex filename
    42659 2740 144 45543 b1e7 sched.o.before
    42093 2740 144 44977 afb1 sched.o.after

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra

    Ingo Molnar
     
  • Fair sleepers need to scale their latency target down by runqueue
    weight. Otherwise busy systems will gain ever larger sleep bonus.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra

    Ingo Molnar
     
  • Currently we schedule to the leftmost task in the runqueue. When the
    runtimes are very short because of some server/client ping-pong,
    especially in over-saturated workloads, this will cycle through all
    tasks trashing the cache.

    Reduce cache trashing by keeping dependent tasks together by running
    newly woken tasks first. However, by not running the leftmost task first
    we could starve tasks because the wakee can gain unlimited runtime.

    Therefore we only run the wakee if its within a small
    (wakeup_granularity) window of the leftmost task. This preserves
    fairness, but does alternate server/client task groups.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • lw->weight can be 0 for a short time during bootup.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra

    Ingo Molnar
     
  • Clear the cached inverse value when updating load. This is needed for
    calc_delta_mine() to work correctly when using the rq load.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra

    Ingo Molnar
     
  • Current min_vruntime tracking is incorrect and will cause serious
    problems when we don't run the leftmost task for some reason.

    min_vruntime does two things; 1) it's used to determine a forward
    direction when the u64 vruntime wraps, 2) it's used to track the
    leftmost vruntime to position newly enqueued tasks from.

    The current logic advances min_vruntime whenever the current task's
    vruntime advance. Because the current task may pass the leftmost task
    still waiting we're failing the second goal. This causes new tasks to be
    placed too far ahead and thus penalizes their runtime.

    Fix this by making min_vruntime the min_vruntime of the waiting tasks by
    tracking it in enqueue/dequeue, and compare against current's vruntime
    to obtain the absolute minimum when placing new tasks.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Fix a hard to trigger crash seen in the -rt kernel that also affects
    the vanilla scheduler.

    There is a race condition between schedule() and some dequeue/enqueue
    functions; rt_mutex_setprio(), __setscheduler() and sched_move_task().

    When scheduling to idle, idle_balance() is called to pull tasks from
    other busy processor. It might drop the rq lock. It means that those 3
    functions encounter on_rq=0 and running=1. The current task should be
    put when running.

    Here is a possible scenario:

    CPU0 CPU1
    | schedule()
    | ->deactivate_task()
    | ->idle_balance()
    | -->load_balance_newidle()
    rt_mutex_setprio() |
    | --->double_lock_balance()
    *get lock *rel lock
    * on_rq=0, ruuning=1 |
    * sched_class is changed |
    *rel lock *get lock
    : |
    :
    ->put_prev_task_rt()
    ->pick_next_task_fair()
    => panic

    The current process of CPU1(P1) is scheduling. Deactivated P1, and the
    scheduler looks for another process on other CPU's runqueue because CPU1
    will be idle. idle_balance(), load_balance_newidle() and
    double_lock_balance() are called and double_lock_balance() could drop
    the rq lock. On the other hand, CPU0 is trying to boost the priority of
    P1. The result of boosting only P1's prio and sched_class are changed to
    RT. The sched entities of P1 and P1's group are never put. It makes
    cfs_rq invalid, because the cfs_rq has curr and no leaf, but
    pick_next_task_fair() is called, then the kernel panics.

    Signed-off-by: Hiroshi Shimamoto
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Hiroshi Shimamoto
     

13 Mar, 2008

2 commits


12 Mar, 2008

1 commit

  • There is a problem in the hibernation code that triggers on some NUMA
    systems on which pfn_valid() returns 'true' for some PFNs that don't
    belong to any zone. Namely, there is a BUG_ON() in
    memory_bm_find_bit() that triggers for PFNs not belonging to any
    zone and passing the pfn_valid() test. On the affected systems it
    triggers when we mark PFNs reported by the platform as not saveable,
    because the PFNs in question belong to a region mapped directly using
    iorepam() (i.e. the ACPI data area) and they pass the pfn_valid()
    test.

    Modify memory_bm_find_bit() so that it returns an error if given PFN
    doesn't belong to any zone instead of crashing the kernel and ignore
    the result returned by it in mark_nosave_pages(), while marking the
    "nosave" memory regions.

    This doesn't affect the hibernation functionality, as we won't touch
    the PFNs in question anyway.

    http://bugzilla.kernel.org/show_bug.cgi?id=9966 .

    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Len Brown

    Rafael J. Wysocki
     

11 Mar, 2008

5 commits

  • It is possible to allow the root-domain cache of online cpus to
    become out of sync with the global cpu_online_map. This is because we
    currently trigger removal of cpus too early in the notifier chain.
    Other DOWN_PREPARE handlers may in fact run and reconfigure the
    root-domain topology, thereby stomping on our own offline handling.

    The end result is that rd->online may become out of sync with
    cpu_online_map, which results in potential task misrouting.

    So change the offline handling to be more tightly coupled with the
    global offline process by triggering on CPU_DYING intead of
    CPU_DOWN_PREPARE.

    Signed-off-by: Gregory Haskins
    Cc: Gautham R Shenoy
    Cc: "Siddha, Suresh B"
    Cc: "Rafael J. Wysocki"
    Cc: Andrew Morton
    Signed-off-by: Ingo Molnar

    Gregory Haskins
     
  • This reverts commit 393d94d98b19089ec172566e23557997931b137e.

    Lets fix this right.

    Signed-off-by: Gregory Haskins
    Cc: Gautham R Shenoy
    Cc: "Siddha, Suresh B"
    Cc: "Rafael J. Wysocki"
    Cc: Andrew Morton
    Signed-off-by: Ingo Molnar

    Gregory Haskins
     
  • The original preemptible-RCU patch put the choice between classic and
    preemptible RCU into kernel/Kconfig.preempt, which resulted in build failures
    on machines not supporting CONFIG_PREEMPT. This choice was therefore moved to
    init/Kconfig, which worked, but placed the choice between classic and
    preemptible RCU at the top level, a very obtuse choice indeed.

    This patch changes from the Kconfig "choice" mechanism to a pair of booleans,
    only one of which (CONFIG_PREEMPT_RCU) is user-visible, and is located in
    kernel/Kconfig.preempt, where one would expect it to be. The other
    (CONFIG_CLASSIC_RCU) is in init/Kconfig so that it is available to all
    architectures, hopefully avoiding build breakage. Thanks to Roman Zippel for
    suggesting this approach.

    Signed-off-by: Paul E. McKenney
    Cc: Ingo Molnar
    Acked-by: Steven Rostedt
    Cc: Dipankar Sarma
    Cc: Josh Triplett
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Roman Zippel
    Cc: Sam Ravnborg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     
  • Return value convention of module's init functions is 0/-E. Sometimes,
    e.g. during forward-porting mistakes happen and buggy module created,
    where result of comparison "workqueue != NULL" is propagated all the way up
    to sys_init_module. What happens is that some other module created
    workqueue in question, our module created it again and module was
    successfully loaded.

    Or it could be some other bug.

    Let's make such mistakes much more visible. In retrospective, such
    messages would noticeably shorten some of my head-scratching sessions.

    Note, that dump_stack() is just a way to get attention from user. Sample
    message:

    sys_init_module: 'foo'->init suspiciously returned 1, it should follow 0/-E convention
    sys_init_module: loading module anyway...
    Pid: 4223, comm: modprobe Not tainted 2.6.24-25f666300625d894ebe04bac2b4b3aadb907c861 #5

    Call Trace:
    [] sys_init_module+0xe5/0x1d0
    [] system_call_after_swapgs+0x7b/0x80

    Signed-off-by: Alexey Dobriyan
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Commit c9a3ba55 (module: wait for dependent modules doing init.) didn't quite
    work because the waiter holds the module lock, meaning that the state of the
    module it's waiting for cannot change.

    Fortunately, it's fairly simple to update the state outside the lock and do
    the wakeup.

    Thanks to Jan Glauber for tracking this down and testing (qdio and qeth).

    Signed-off-by: Rusty Russell
    Cc: Jan Glauber
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rusty Russell
     

10 Mar, 2008

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt:
    time: remove obsolete CLOCK_TICK_ADJUST
    time: don't touch an offlined CPU's ts->tick_stopped in tick_cancel_sched_timer()
    time: prevent the loop in timespec_add_ns() from being optimised away
    ntp: use unsigned input for do_div()

    Linus Torvalds
     
  • We currently set the root-domain online span automatically when the
    domain is added to the cpu if the cpu is already a member of
    cpu_online_map.

    This was done as a hack/bug-fix for s2ram, but it also causes a problem
    with hotplug CPU_DOWN transitioning. The right way to fix the original
    problem is to actually respond to CPU_UP events, instead of CPU_ONLINE,
    which is already too late.

    This solves the hung reboot regression reported by Andrew Morton and
    others.

    Signed-off-by: Gregory Haskins
    Acked-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Gregory Haskins
     

09 Mar, 2008

3 commits

  • The first version of the ntp_interval/tick_length inconsistent usage patch was
    recently merged as bbe4d18ac2e058c56adb0cd71f49d9ed3216a405

    http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=bbe4d18ac2e058c56adb0cd71f49d9ed3216a405

    While the fix did greatly improve the situation, it was correctly pointed out
    by Roman that it does have a small bug: If the users change clocksources after
    the system has been running and NTP has made corrections, the correctoins made
    against the old clocksource will be applied against the new clocksource,
    causing error.

    The second attempt, which corrects the issue in the NTP_INTERVAL_LENGTH
    definition has also made it up-stream as commit
    e13a2e61dd5152f5499d2003470acf9c838eab84

    http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e13a2e61dd5152f5499d2003470acf9c838eab84

    Roman has correctly pointed out that CLOCK_TICK_ADJUST is calculated
    based on the PIT's frequency, and isn't really relevant to non-PIT
    driven clocksources (that is, clocksources other then jiffies and pit).

    This patch reverts both of those changes, and simply removes
    CLOCK_TICK_ADJUST.

    This does remove the granularity error correction for users of PIT and Jiffies
    clocksource users, but the granularity error but for the majority of users, it
    should be within the 500ppm range NTP can accommodate for.

    For systems that have granularity errors greater then 500ppm, the
    "ntp_tick_adj=" boot option can be used to compensate.

    [johnstul@us.ibm.com: provided changelog]
    [mattilinnanvuori@yahoo.com: maek ntp_tick_adj static]
    Signed-off-by: Roman Zippel
    Acked-by: john stultz
    Signed-off-by: Matti Linnanvuori
    Signed-off-by: Andrew Morton
    Cc: mingo@elte.hu
    Signed-off-by: Thomas Gleixner

    Roman Zippel
     
  • Silences WARN_ONs in rcu_enter_nohz() and rcu_exit_nohz(), which appeared
    before caused by (repeated) calls to:
    $ echo 0 > /sys/devices/system/cpu/cpu1/online
    $ echo 1 > /sys/devices/system/cpu/cpu1/online

    Signed-off-by: Karsten Wiese
    Cc: johnstul@us.ibm.com
    Cc: Rafael Wysocki
    Cc: Steven Rostedt
    Cc: Ingo Molnar
    Acked-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Thomas Gleixner

    Karsten Wiese
     
  • The kernel NTP code shouldn't hand 64-bit *signed* values to do_div(). Make it
    instead hand 64-bit unsigned values. This gets rid of a couple of warnings.

    Signed-off-by: David Howells
    Cc: Roman Zippel
    Cc: Ingo Molnar
    Cc: john stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Thomas Gleixner

    David Howells