Blame view

Documentation/sched-design.txt 7.73 KB
1da177e4c   Linus Torvalds   Linux-2.6.12-rc2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
  		   Goals, Design and Implementation of the
  		      new ultra-scalable O(1) scheduler
  
  
    This is an edited version of an email Ingo Molnar sent to
    lkml on 4 Jan 2002.  It describes the goals, design, and
    implementation of Ingo's new ultra-scalable O(1) scheduler.
    Last Updated: 18 April 2002.
  
  
  Goal
  ====
  
  The main goal of the new scheduler is to keep all the good things we know
  and love about the current Linux scheduler:
  
   - good interactive performance even during high load: if the user
     types or clicks then the system must react instantly and must execute
     the user tasks smoothly, even during considerable background load.
  
   - good scheduling/wakeup performance with 1-2 runnable processes.
  
   - fairness: no process should stay without any timeslice for any
     unreasonable amount of time. No process should get an unjustly high
     amount of CPU time.
  
   - priorities: less important tasks can be started with lower priority,
     more important tasks with higher priority.
  
   - SMP efficiency: no CPU should stay idle if there is work to do.
  
   - SMP affinity: processes which run on one CPU should stay affine to
     that CPU. Processes should not bounce between CPUs too frequently.
  
   - plus additional scheduler features: RT scheduling, CPU binding.
  
  and the goal is also to add a few new things:
  
   - fully O(1) scheduling. Are you tired of the recalculation loop
     blowing the L1 cache away every now and then? Do you think the goodness
     loop is taking a bit too long to finish if there are lots of runnable
     processes? This new scheduler takes no prisoners: wakeup(), schedule(),
     the timer interrupt are all O(1) algorithms. There is no recalculation
     loop. There is no goodness loop either.
  
   - 'perfect' SMP scalability. With the new scheduler there is no 'big'
     runqueue_lock anymore - it's all per-CPU runqueues and locks - two
     tasks on two separate CPUs can wake up, schedule and context-switch
     completely in parallel, without any interlocking. All
     scheduling-relevant data is structured for maximum scalability.
  
   - better SMP affinity. The old scheduler has a particular weakness that
     causes the random bouncing of tasks between CPUs if/when higher
     priority/interactive tasks, this was observed and reported by many
     people. The reason is that the timeslice recalculation loop first needs
     every currently running task to consume its timeslice. But when this
     happens on eg. an 8-way system, then this property starves an
     increasing number of CPUs from executing any process. Once the last
     task that has a timeslice left has finished using up that timeslice,
     the recalculation loop is triggered and other CPUs can start executing
     tasks again - after having idled around for a number of timer ticks.
     The more CPUs, the worse this effect.
  
     Furthermore, this same effect causes the bouncing effect as well:
     whenever there is such a 'timeslice squeeze' of the global runqueue,
     idle processors start executing tasks which are not affine to that CPU.
     (because the affine tasks have finished off their timeslices already.)
  
     The new scheduler solves this problem by distributing timeslices on a
     per-CPU basis, without having any global synchronization or
     recalculation.
  
   - batch scheduling. A significant proportion of computing-intensive tasks
     benefit from batch-scheduling, where timeslices are long and processes
     are roundrobin scheduled. The new scheduler does such batch-scheduling
     of the lowest priority tasks - so nice +19 jobs will get
     'batch-scheduled' automatically. With this scheduler, nice +19 jobs are
     in essence SCHED_IDLE, from an interactiveness point of view.
  
   - handle extreme loads more smoothly, without breakdown and scheduling
     storms.
  
   - O(1) RT scheduling. For those RT folks who are paranoid about the
     O(nr_running) property of the goodness loop and the recalculation loop.
  
   - run fork()ed children before the parent. Andrea has pointed out the
     advantages of this a few months ago, but patches for this feature
     do not work with the old scheduler as well as they should,
     because idle processes often steal the new child before the fork()ing
     CPU gets to execute it.
  
  
  Design
  ======
  
  the core of the new scheduler are the following mechanizms:
  
   - *two*, priority-ordered 'priority arrays' per CPU. There is an 'active'
     array and an 'expired' array. The active array contains all tasks that
     are affine to this CPU and have timeslices left. The expired array
     contains all tasks which have used up their timeslices - but this array
     is kept sorted as well. The active and expired array is not accessed
     directly, it's accessed through two pointers in the per-CPU runqueue
     structure. If all active tasks are used up then we 'switch' the two
     pointers and from now on the ready-to-go (former-) expired array is the
     active array - and the empty active array serves as the new collector
     for expired tasks.
  
   - there is a 64-bit bitmap cache for array indices. Finding the highest
     priority task is thus a matter of two x86 BSFL bit-search instructions.
  
  the split-array solution enables us to have an arbitrary number of active
  and expired tasks, and the recalculation of timeslices can be done
  immediately when the timeslice expires. Because the arrays are always
  access through the pointers in the runqueue, switching the two arrays can
  be done very quickly.
  
  this is a hybride priority-list approach coupled with roundrobin
  scheduling and the array-switch method of distributing timeslices.
  
   - there is a per-task 'load estimator'.
  
  one of the toughest things to get right is good interactive feel during
  heavy system load. While playing with various scheduler variants i found
  that the best interactive feel is achieved not by 'boosting' interactive
  tasks, but by 'punishing' tasks that want to use more CPU time than there
  is available. This method is also much easier to do in an O(1) fashion.
  
  to establish the actual 'load' the task contributes to the system, a
  complex-looking but pretty accurate method is used: there is a 4-entry
  'history' ringbuffer of the task's activities during the last 4 seconds.
  This ringbuffer is operated without much overhead. The entries tell the
  scheduler a pretty accurate load-history of the task: has it used up more
  CPU time or less during the past N seconds. [the size '4' and the interval
  of 4x 1 seconds was found by lots of experimentation - this part is
  flexible and can be changed in both directions.]
  
  the penalty a task gets for generating more load than the CPU can handle
  is a priority decrease - there is a maximum amount to this penalty
  relative to their static priority, so even fully CPU-bound tasks will
  observe each other's priorities, and will share the CPU accordingly.
  
  the SMP load-balancer can be extended/switched with additional parallel
  computing and cache hierarchy concepts: NUMA scheduling, multi-core CPUs
  can be supported easily by changing the load-balancer. Right now it's
  tuned for my SMP systems.
  
  i skipped the prev->mm == next->mm advantage - no workload i know of shows
  any sensitivity to this. It can be added back by sacrificing O(1)
  schedule() [the current and one-lower priority list can be searched for a
  that->mm == current->mm condition], but costs a fair number of cycles
  during a number of important workloads, so i wanted to avoid this as much
  as possible.
  
  - the SMP idle-task startup code was still racy and the new scheduler
  triggered this. So i streamlined the idle-setup code a bit. We do not call
  into schedule() before all processors have started up fully and all idle
  threads are in place.
  
  - the patch also cleans up a number of aspects of sched.c - moves code
  into other areas of the kernel where it's appropriate, and simplifies
  certain code paths and data constructs. As a result, the new scheduler's
  code is smaller than the old one.
  
  	Ingo