Eric Lee / smarc-fsl-linux-kernel

20 Sep, 2011

1 commit

590e4d857 sched: Allow SD_NODES_PER_DOMAIN to be overridden ... Browse Code »

We want to override the default value of SD_NODES_PER_DOMAIN on ppc64,
so move it into linux/topology.h.

Signed-off-by: Anton Blanchard
Acked-by: Peter Zijlstra
Signed-off-by: Benjamin Herrenschmidt

Anton Blanchard
2011-09-20 13:53:21 +0800

16 Jun, 2011

1 commit

32e45ff43 mm: increase RECLAIM_DISTANCE to 30 ... Browse Code »

Recently, Robert Mueller reported (http://lkml.org/lkml/2010/9/12/236)
that zone_reclaim_mode doesn't work properly on his new NUMA server (Dual
Xeon E5520 + Intel S5520UR MB). He is using Cyrus IMAPd and it's built on
a very traditional single-process model.

* a master process which reads config files and manages the other
process
* multiple imapd processes, one per connection
* multiple pop3d processes, one per connection
* multiple lmtpd processes, one per connection
* periodical "cleanup" processes.

There are thousands of independent processes. The problem is, recent
Intel motherboard turn on zone_reclaim_mode by default and traditional
prefork model software don't work well on it. Unfortunatelly, such models
are still typical even in the 21st century. We can't ignore them.

This patch raises the zone_reclaim_mode threshold to 30. 30 doesn't have
any specific meaning. but 20 means that one-hop QPI/Hypertransport and
such relatively cheap 2-4 socket machine are often used for traditional
servers as above. The intention is that these machines don't use
zone_reclaim_mode.

Note: ia64 and Power have arch specific RECLAIM_DISTANCE definitions.
This patch doesn't change such high-end NUMA machine behavior.

Dave Hansen said:

: I know specifically of pieces of x86 hardware that set the information
: in the BIOS to '21' *specifically* so they'll get the zone_reclaim_mode
: behavior which that implies.
:
: They've done performance testing and run very large and scary benchmarks
: to make sure that they _want_ this turned on. What this means for them
: is that they'll probably be de-optimized, at least on newer versions of
: the kernel.
:
: If you want to do this for particular systems, maybe _that_'s what we
: should do. Have a list of specific configurations that need the
: defaults overridden either because they're buggy, or they have an
: unusual hardware configuration not really reflected in the distance
: table.

And later said:

: The original change in the hardware tables was for the benefit of a
: benchmark. Said benchmark isn't going to get run on mainline until the
: next batch of enterprise distros drops, at which point the hardware where
: this was done will be irrelevant for the benchmark. I'm sure any new
: hardware will just set this distance to another yet arbitrary value to
: make the kernel do what it wants. :)
:
: Also, when the hardware got _set_ to this initially, I complained. So, I
: guess I'm getting my way now, with this patch. I'm cool with it.

Reported-by: Robert Mueller
Signed-off-by: KOSAKI Motohiro
Acked-by: Christoph Lameter
Acked-by: David Rientjes
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Benjamin Herrenschmidt
Cc: "Luck, Tony"
Acked-by: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2011-06-16 11:03:59 +0800

10 Sep, 2010

1 commit

01a08546a sched: Add book scheduling domain ... Browse Code »

On top of the SMT and MC scheduling domains this adds the BOOK scheduling
domain. This is useful for NUMA like machines which do not have an interface
which tells which piece of memory is attached to which node or where the
hardware performs striping.

Signed-off-by: Heiko Carstens
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Heiko Carstens
2010-09-10 02:41:20 +0800

10 Aug, 2010

1 commit

251060006 topology: alternate fix for ia64 tiger_defconfig build breakage ... Browse Code »

Define stubs for the numa_*_id() generic percpu related functions for
non-NUMA configurations in where the other
non-numa stubs live.

Fixes ia64 !NUMA build breakage -- e.g., tiger_defconfig

Back out now unneeded '#ifndef CONFIG_NUMA' guards from ia64 smpboot.c

Signed-off-by: Lee Schermerhorn
Tested-by: Tony Luck
Acked-by: Tony Luck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2010-08-10 11:44:57 +0800

29 Jun, 2010

1 commit

2ec57d448 sched: Fix spelling of sibling ... Browse Code »

No logic changes, only spelling.

Signed-off-by: Michael Neuling
Cc: linuxppc-dev@ozlabs.org
Cc: David Howells
Cc: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Michael Neuling
2010-06-29 16:44:29 +0800

09 Jun, 2010

1 commit

532cb4c40 sched: Add asymmetric group packing option for sibling domain ... Browse Code »

Check to see if the group is packed in a sched doman.

This is primarily intended to used at the sibling level. Some cores
like POWER7 prefer to use lower numbered SMT threads. In the case of
POWER7, it can move to lower SMT modes only when higher threads are
idle. When in lower SMT modes, the threads will perform better since
they share less core resources. Hence when we have idle threads, we
want them to be the higher ones.

This adds a hook into f_b_g() called check_asym_packing() to check the
packing. This packing function is run on idle threads. It checks to
see if the busiest CPU in this domain (core in the P7 case) has a
higher CPU number than what where the packing function is being run
on. If it is, calculate the imbalance and return the higher busier
thread as the busiest group to f_b_g(). Here we are assuming a lower
CPU number will be equivalent to a lower SMT thread number.

It also creates a new SD_ASYM_PACKING flag to enable this feature at
any scheduler domain level.

It also creates an arch hook to enable this feature at the sibling
level. The default function doesn't enable this feature.

Based heavily on patch from Peter Zijlstra.
Fixes from Srivatsa Vaddagiri.

Signed-off-by: Michael Neuling
Signed-off-by: Srivatsa Vaddagiri
Signed-off-by: Peter Zijlstra
Cc: Arjan van de Ven
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
LKML-Reference:
Signed-off-by: Ingo Molnar

Michael Neuling
2010-06-09 16:34:55 +0800

28 May, 2010

2 commits

7aac78988 numa: introduce numa_mem_id()- effective local memory node id ... Browse Code »

Introduce numa_mem_id(), based on generic percpu variable infrastructure
to track "nearest node with memory" for archs that support memoryless
nodes.

Define API in when CONFIG_HAVE_MEMORYLESS_NODES
defined, else stubs. Architectures will define HAVE_MEMORYLESS_NODES
if/when they support them.

Archs can override definitions of:

numa_mem_id() - returns node number of "local memory" node
set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
cpu_to_mem() - return numa_mem for specified cpu; may be used as lvalue

Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
This will initialize the boot cpu at boot time, and all cpus on change of
numa_zonelist_order, or when node or memory hot-plug requires zonelist
rebuild. Archs that support memoryless nodes will need to initialize
'numa_mem' for secondary cpus as they're brought on-line.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Lee Schermerhorn
Signed-off-by: Christoph Lameter
Cc: Tejun Heo
Cc: Mel Gorman
Cc: Christoph Lameter
Cc: Nick Piggin
Cc: David Rientjes
Cc: Eric Whitney
Cc: KAMEZAWA Hiroyuki
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: "Luck, Tony"
Cc: Pekka Enberg
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2010-05-28 00:12:57 +0800
728120192 numa: add generic percpu var numa_node_id() implementation ... Browse Code »

Rework the generic version of the numa_node_id() function to use the new
generic percpu variable infrastructure.

Guard the new implementation with a new config option:

CONFIG_USE_PERCPU_NUMA_NODE_ID.

Archs which support this new implemention will default this option to 'y'
when NUMA is configured. This config option could be removed if/when all
archs switch over to the generic percpu implementation of numa_node_id().
Arch support involves:

1) converting any existing per cpu variable implementations to use
this implementation. x86_64 is an instance of such an arch.
2) archs that don't use a per cpu variable for numa_node_id() will
need to initialize the new per cpu variable "numa_node" as cpus
are brought on-line. ia64 is an example.
3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g.,
when NUMA is configured. This is required because I have
retained the old implementation by default to allow archs to
be modified incrementally, as desired.

Subsequent patches will convert x86_64 and ia64 to use this implemenation.

Signed-off-by: Lee Schermerhorn
Cc: Tejun Heo
Cc: Mel Gorman
Reviewed-by: Christoph Lameter
Cc: Nick Piggin
Cc: David Rientjes
Cc: Eric Whitney
Cc: KAMEZAWA Hiroyuki
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: "Luck, Tony"
Cc: Pekka Enberg
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2010-05-28 00:12:57 +0800

21 Jan, 2010

1 commit

50b926e43 sched: Fix vmark regression on big machines ... Browse Code »

SD_PREFER_SIBLING is set at the CPU domain level if power saving isn't
enabled, leading to many cache misses on large machines as we traverse
looking for an idle shared cache to wake to. Change the enabler of
select_idle_sibling() to SD_SHARE_PKG_RESOURCES, and enable same at the
sibling domain level.

Reported-by: Lin Ming
Signed-off-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Mike Galbraith
2010-01-21 20:39:03 +0800

14 Oct, 2009

1 commit

799e2205e sched: Disable SD_PREFER_LOCAL for MC/CPU domains ... Browse Code »

Yanmin reported that both tbench and hackbench were significantly
hurt by trying to keep tasks local on these domains, esp on small
cache machines.

So disable it in order to promote spreading outside of the cache
domains.

Reported-by: "Zhang, Yanmin"
Signed-off-by: Peter Zijlstra
CC: Mike Galbraith
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2009-10-14 21:02:34 +0800

24 Sep, 2009

1 commit

6f401420e cpumask: remove obsolete topology_core_siblings and topology_thread_siblings: core ... Browse Code »

There were replaced by topology_core_cpumask and topology_thread_cpumask.

Signed-off-by: Rusty Russell

Rusty Russell
2009-09-24 08:04:41 +0800

16 Sep, 2009

2 commits

182a85f8a sched: Disable wakeup balancing ... Browse Code »

Sysbench thinks SD_BALANCE_WAKE is too agressive and kbuild doesn't
really mind too much, SD_BALANCE_NEWIDLE picks up most of the
slack.

On a dual socket, quad core, dual thread nehalem system:

sysbench (--num_threads=16):

SD_BALANCE_WAKE-: 13982 tx/s
SD_BALANCE_WAKE+: 15688 tx/s

kbuild (-j16):

SD_BALANCE_WAKE-: 47.648295846 seconds time elapsed ( +- 0.312% )
SD_BALANCE_WAKE+: 47.608607360 seconds time elapsed ( +- 0.026% )

(same within noise)

Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2009-09-16 22:44:33 +0800
59abf0264 sched: Add SD_PREFER_LOCAL ... Browse Code »

And turn it on for NUMA and MC domains. This improves
locality in balancing decisions by keeping up to
capacity amount of tasks local before looking for idle
CPUs. (and twice the capacity if SD_POWERSAVINGS_BALANCE
is set.)

Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2009-09-16 14:42:40 +0800

15 Sep, 2009

5 commits

b8a543ea5 sched: Reduce forkexec_idx ... Browse Code »

If we're looking to place a new task, we might as well find the
idlest position _now_, not 1 tick ago.

Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2009-09-15 22:51:23 +0800
0ec9fab3d sched: Improve latencies and throughput ... Browse Code »

Make the idle balancer more agressive, to improve a
x264 encoding workload provided by Jason Garrett-Glaser:

NEXT_BUDDY NO_LB_BIAS
encoded 600 frames, 252.82 fps, 22096.60 kb/s
encoded 600 frames, 250.69 fps, 22096.60 kb/s
encoded 600 frames, 245.76 fps, 22096.60 kb/s

NO_NEXT_BUDDY LB_BIAS
encoded 600 frames, 344.44 fps, 22096.60 kb/s
encoded 600 frames, 346.66 fps, 22096.60 kb/s
encoded 600 frames, 352.59 fps, 22096.60 kb/s

NO_NEXT_BUDDY NO_LB_BIAS
encoded 600 frames, 425.75 fps, 22096.60 kb/s
encoded 600 frames, 425.45 fps, 22096.60 kb/s
encoded 600 frames, 422.49 fps, 22096.60 kb/s

Peter pointed out that this is better done via newidle_idx,
not via LB_BIAS, newidle balancing should look for where
there is load _now_, not where there was load 2 ticks ago.

Worst-case latencies are improved as well as no buddies
means less vruntime spread. (as per prior lkml discussions)

This change improves kbuild-peak parallelism as well.

Reported-by: Jason Garrett-Glaser
Signed-off-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Mike Galbraith
2009-09-15 22:51:16 +0800
6bd7821f9 sched: Fix some domain tunings ... Browse Code »

CPU level should have WAKE_AFFINE, whereas ALLNODES is dubious.

Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2009-09-15 22:01:08 +0800
78e7ed53c sched: Tweak wake_idx ... Browse Code »

When merging select_task_rq_fair() and sched_balance_self() we lost
the use of wake_idx, restore that and set them to 0 to make wake
balancing more aggressive.

Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2009-09-15 22:01:07 +0800
c88d59108 sched: Merge select_task_rq_fair() and sched_balance_self() ... Browse Code »

The problem with wake_idle() is that is doesn't respect things like
cpu_power, which means it doesn't deal well with SMT nor the recent
RT interaction.

To cure this, it needs to do what sched_balance_self() does, which
leads to the possibility of merging select_task_rq_fair() and
sched_balance_self().

Modify sched_balance_self() to:

- update_shares() when walking up the domain tree,
(it only called it for the top domain, but it should
have done this anyway), which allows us to remove
this ugly bit from try_to_wake_up().

- do wake_affine() on the smallest domain that contains
both this (the waking) and the prev (the wakee) cpu for
WAKE invocations.

Then use the top-down balance steps it had to replace wake_idle().

This leads to the dissapearance of SD_WAKE_BALANCE and
SD_WAKE_IDLE_FAR, with SD_WAKE_IDLE replaced with SD_BALANCE_WAKE.

SD_WAKE_AFFINE needs SD_BALANCE_WAKE to be effective.

Touch all topology bits to replace the old with new SD flags --
platforms might need re-tuning, enabling SD_BALANCE_WAKE
conditionally on a NUMA distance seems like a good additional
feature, magny-core and small nehalem systems would want this
enabled, systems with slow interconnects would not.

Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2009-09-15 22:01:05 +0800

08 Sep, 2009

1 commit

a8fae3ec5 sched: enable SD_WAKE_IDLE ... Browse Code »

Now that SD_WAKE_IDLE doesn't make pipe-test suck anymore,
enable it by default for MC, CPU and NUMA domains.

Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2009-09-08 04:00:17 +0800

04 Sep, 2009

3 commits

840a06531 sched: Turn on SD_BALANCE_NEWIDLE ... Browse Code »

Start the re-tuning of the balancer by turning on newidle.

It improves hackbench performance and parallelism on a 4x4 box.
The "perf stat --repeat 10" measurements give us:

domain0 domain1
.......................................
-SD_BALANCE_NEWIDLE -SD_BALANCE_NEWIDLE:
2041.273208 task-clock-msecs # 9.354 CPUs ( +- 0.363% )

+SD_BALANCE_NEWIDLE -SD_BALANCE_NEWIDLE:
2086.326925 task-clock-msecs # 11.934 CPUs ( +- 0.301% )

+SD_BALANCE_NEWIDLE +SD_BALANCE_NEWIDLE:
2115.289791 task-clock-msecs # 12.158 CPUs ( +- 0.263% )

Acked-by: Peter Zijlstra
Cc: Andreas Herrmann
Cc: Andreas Herrmann
Cc: Gautham R Shenoy
Cc: Balbir Singh
LKML-Reference:
Signed-off-by: Ingo Molnar

Ingo Molnar
2009-09-04 17:52:54 +0800
47734f89b sched: Clean up topology.h ... Browse Code »

Re-organize the flag settings so that it's visible at a glance
which sched-domains flags are set and which not.

With the new balancer code we'll need to re-tune these details
anyway, so make it cleaner to make fewer mistakes down the
road ;-)

Cc: Peter Zijlstra
Cc: Andreas Herrmann
Cc: Andreas Herrmann
Cc: Gautham R Shenoy
Cc: Balbir Singh
LKML-Reference:
Signed-off-by: Ingo Molnar

Ingo Molnar
2009-09-04 17:52:53 +0800
a52bfd735 sched: Add smt_gain ... Browse Code »

The idea is that multi-threading a core yields more work
capacity than a single thread, provide a way to express a
static gain for threads.

Signed-off-by: Peter Zijlstra
Tested-by: Andreas Herrmann
Acked-by: Andreas Herrmann
Acked-by: Gautham R Shenoy
Cc: Balbir Singh
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2009-09-04 16:09:54 +0800

13 Mar, 2009

2 commits

082edb7bf numa, cpumask: move numa_node_id default implementation to topology.h ... Browse Code »

Impact: cleanup, potential bugfix

Not sure what changed to expose this, but clearly that numa_node_id()
doesn't belong in mmzone.h (the inline in gfp.h is probably overkill, too).

In file included from include/linux/topology.h:34,
from arch/x86/mm/numa.c:2:
/home/rusty/patches-cpumask/linux-2.6/arch/x86/include/asm/topology.h:64:1: warning: "numa_node_id" redefined
In file included from include/linux/topology.h:32,
from arch/x86/mm/numa.c:2:
include/linux/mmzone.h:770:1: warning: this is the location of the previous definition

Signed-off-by: Rusty Russell
Cc: Mike Travis
LKML-Reference:
Signed-off-by: Ingo Molnar

Rusty Russell
2009-03-13 21:35:31 +0800
a70f73028 cpumask: replace node_to_cpumask with cpumask_of_node. ... Browse Code »

Impact: cleanup

node_to_cpumask (and the blecherous node_to_cpumask_ptr which
contained a declaration) are replaced now everyone implements
cpumask_of_node.

Signed-off-by: Rusty Russell

Rusty Russell
2009-03-13 12:19:46 +0800

12 Jan, 2009

1 commit

fbd59a8d1 cpumask: Use topology_core_cpumask()/topology_thread_cpumask() ... Browse Code »

Impact: reduce stack usage, use new cpumask API.

This actually uses topology_core_cpumask() and
topology_thread_cpumask(), removing the only users of
topology_core_siblings() and topology_thread_siblings()

Signed-off-by: Rusty Russell
Signed-off-by: Mike Travis
Cc: linux-net-drivers@solarflare.com

Rusty Russell
2009-01-12 02:12:49 +0800

19 Dec, 2008

2 commits

100fdaee7 sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0 ... Browse Code »

Impact: change task balancing to save power more agressively

Add SD_BALANCE_NEWIDLE flag at MC level and CPU level
if sched_mc is set. This helps power savings and
will not affect performance when sched_mc=0

Ingo and Mike Galbraith have optimised the SD flags by
removing SD_BALANCE_NEWIDLE at MC and CPU level. This
helps performance but hurts power savings since this
slows down task consolidation by reducing the number
of times load_balance is run.

sched: fine-tune SD_MC_INIT
commit 14800984706bf6936bbec5187f736e928be5c218
Author: Mike Galbraith
Date: Fri Nov 7 15:26:50 2008 +0100

sched: re-tune balancing -- revert
commit 9fcd18c9e63e325dbd2b4c726623f760788d5aa8
Author: Ingo Molnar
Date: Wed Nov 5 16:52:08 2008 +0100

This patch selectively enables SD_BALANCE_NEWIDLE flag
only when sched_mc is set to 1 or 2. This helps power savings
by task consolidation and also does not hurt performance at
sched_mc=0 where all power saving optimisations are turned off.

Signed-off-by: Vaidyanathan Srinivasan
Acked-by: Balbir Singh
Acked-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Vaidyanathan Srinivasan
2008-12-19 16:21:55 +0800
716707b29 sched: convert BALANCE_FOR_xx_POWER to inline functions ... Browse Code »

Impact: cleanup

BALANCE_FOR_MC_POWER and similar macros defined in sched.h are
not constants and have various condition checks and significant
amount of code that is not suitable to be contain in a macro.
Also there could be side effects on the expressions passed to
some of them like test_sd_parent().

This patch converts all complex macros related to power savings
balance to inline functions.

Signed-off-by: Vaidyanathan Srinivasan
Acked-by: Balbir Singh
Acked-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Vaidyanathan Srinivasan
2008-12-19 16:21:45 +0800

12 Dec, 2008

1 commit

ee79d1bdb sched: let arch_update_cpu_topology indicate if topology changed ... Browse Code »

Change arch_update_cpu_topology so it returns 1 if the cpu topology changed
and 0 if it didn't change. This will be useful for the next patch which adds
a call to this function in partition_sched_domains.

Signed-off-by: Heiko Carstens
Signed-off-by: Ingo Molnar

Heiko Carstens
2008-12-12 20:47:21 +0800

07 Nov, 2008

2 commits

52c642f33 sched: fine-tune SD_SIBLING_INIT ... Browse Code »

fine-tune the HT sched-domains parameters as well.

On a HT capable box, this increases lat_ctx performance from 23.87
usecs to 1.49 usecs:

# before

$ ./lat_ctx -s 0 2

"size=0k ovr=1.89
2 23.87

# after

$ ./lat_ctx -s 0 2

"size=0k ovr=1.84
2 1.49

Signed-off-by: Ingo Molnar

Ingo Molnar
2008-11-07 23:09:23 +0800
148009847 sched: fine-tune SD_MC_INIT ... Browse Code »

Tune SD_MC_INIT the same way as SD_CPU_INIT:
unset SD_BALANCE_NEWIDLE, and set SD_WAKE_BALANCE.

This improves vmark by 5%:

vmark 132102 125968 125497 messages/sec avg 127855.66 .984
vmark 139404 131719 131272 messages/sec avg 134131.66 1.033

Signed-off-by: Mike Galbraith
Acked-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

# *DOCUMENTATION*

Mike Galbraith
2008-11-07 22:35:11 +0800

06 Nov, 2008

1 commit

9fcd18c9e sched: re-tune balancing ... Browse Code »

Impact: improve wakeup affinity on NUMA systems, tweak SMP systems

Given the fixes+tweaks to the wakeup-buddy code, re-tweak the domain
balancing defaults on NUMA and SMP systems.

Turn on SD_WAKE_AFFINE which was off on x86 NUMA - there's no reason
why we would not want to have wakeup affinity across nodes as well.
(we already do this in the standard NUMA template.)

lat_ctx on a NUMA box is particularly happy about this change:

before:

| phoenix:~/l> ./lat_ctx -s 0 2
| "size=0k ovr=2.60
| 2 5.70

after:

| phoenix:~/l> ./lat_ctx -s 0 2
| "size=0k ovr=2.65
| 2 2.07

a 2.75x speedup.

pipe-test is similarly happy about it too:

| phoenix:~/sched-tests> ./pipe-test
| 18.26 usecs/loop.
| 14.70 usecs/loop.
| 14.38 usecs/loop.
| 10.55 usecs/loop. # +WAKE_AFFINE on domain0+domain1
| 8.63 usecs/loop.
| 8.59 usecs/loop.
| 9.03 usecs/loop.
| 8.94 usecs/loop.
| 8.96 usecs/loop.
| 8.63 usecs/loop.

Also:

- disable SD_BALANCE_NEWIDLE on NUMA and SMP domains (keep it for siblings)
- enable SD_WAKE_BALANCE on SMP domains

Sysbench+postgresql improves all around the board, quite significantly:

.28-rc3-11474e2c .28-rc3-11474e2c-tune
-------------------------------------------------
1: 571 688 +17.08%
2: 1236 1206 -2.55%
4: 2381 2642 +9.89%
8: 4958 5164 +3.99%
16: 9580 9574 -0.07%
32: 7128 8118 +12.20%
64: 7342 8266 +11.18%
128: 7342 8064 +8.95%
256: 7519 7884 +4.62%
512: 7350 7731 +4.93%
-------------------------------------------------
SUM: 55412 59341 +6.62%

So it's a win both for the runup portion, the peak area and the tail.

Signed-off-by: Ingo Molnar

Ingo Molnar
2008-11-06 01:04:38 +0800

13 Jun, 2008

1 commit

c50cbb05a cpu topology: always define CPU topology information ... Browse Code »

This can result in an empty topology directory in sysfs, and requires
in-kernel users to protect all uses with #ifdef - see
.

The documentation of CPU topology specifies what the defaults should be if
only partial information is available from the hardware. So we can
provide these defaults as a fallback.

This patch:

- Adds default definitions of the 4 topology macros to
- Changes drivers/base/topology.c to use the topology macros unconditionally
and to cope with definitions that aren't lvalues
- Updates documentation accordingly

[ From: Andrew Morton
- fold now-duplicated code
- fix layout
]

Signed-off-by: Ben Hutchings
Cc: Vegard Nossum
Cc: Nick Piggin
Cc: Chandra Seetharaman
Cc: Suresh Siddha
Cc: Mike Travis
Cc: Christoph Lameter
Cc: John Hawkes
Cc: Zhang, Yanmin
Signed-off-by: Andrew Morton
Signed-off-by: Ingo Molnar

Ben Hutchings
2008-06-13 16:09:46 +0800

29 May, 2008

1 commit

ea3f01f8a sched: re-tune NUMA topologies ... Browse Code »

improve the sysbench ramp-up phase and its peak throughput on
a 16way NUMA box, by turning on WAKE_AFFINE:

tip/sched tip/sched+wake-affine
-------------------------------------------------
1: 700 830 +15.65%
2: 1465 1391 -5.28%
4: 3017 3105 +2.81%
8: 5100 6021 +15.30%
16: 10725 10745 +0.19%
32: 10135 10150 +0.16%
64: 9338 9240 -1.06%
128: 8599 8252 -4.21%
256: 8475 8144 -4.07%
-------------------------------------------------
SUM: 57558 57882 +0.56%

this change also improves lat_ctx from 6.69 usecs to 1.11 usec:

$ ./lat_ctx -s 0 2
"size=0k ovr=1.19
2 1.11

$ ./lat_ctx -s 0 2
"size=0k ovr=1.22
2 6.69

in sysbench it's an overall win with some weakness at the lots-of-clients
side. That happens because we now under-balance this workload
a bit. To counter that effect, turn on NEWIDLE:

wake-idle wake-idle+newidle
-------------------------------------------------
1: 830 834 +0.43%
2: 1391 1401 +0.65%
4: 3105 3091 -0.43%
8: 6021 6046 +0.42%
16: 10745 10736 -0.08%
32: 10150 10206 +0.55%
64: 9240 9533 +3.08%
128: 8252 8355 +1.24%
256: 8144 8384 +2.87%
-------------------------------------------------
SUM: 57882 58591 +1.21%

as a bonus this not only improves the many-clients case but
also improves the (more important) rampup phase.

sysbench is a workload that quickly breaks down if the
scheduler over-balances, so since it showed an improvement
under NEWIDLE this change is definitely good.

Ingo Molnar
2008-05-29 20:46:30 +0800

20 Apr, 2008

1 commit

7c16ec585 cpumask: reduce stack usage in SD_x_INIT initializers ... Browse Code »

* Remove empty cpumask_t (and all non-zero/non-null) variables
in SD_*_INIT macros. Use memset(0) to clear. Also, don't
inline the initializer functions to save on stack space in
build_sched_domains().

* Merge change to include/linux/topology.h that uses the new
node_to_cpumask_ptr function in the nr_cpus_node macro into
this patch.

Depends on:
[mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch
[sched-devel]: sched: add new set_cpus_allowed_ptr function

Cc: H. Peter Anvin
Signed-off-by: Mike Travis
Signed-off-by: Ingo Molnar

Mike Travis
2008-04-20 01:44:59 +0800

21 Mar, 2008

1 commit

22e52b072 sched: add arch_update_cpu_topology hook. ... Browse Code »

Will be called each time the scheduling domains are rebuild.
Needed for architectures that don't have a static cpu topology.

Signed-off-by: Heiko Carstens
Signed-off-by: Martin Schwidefsky
Signed-off-by: Ingo Molnar

Heiko Carstens
2008-03-21 23:43:48 +0800

19 Mar, 2008

1 commit

33b0c4217 sched: tune multi-core idle balancing ... Browse Code »

WAKE_IDLE is too agressive on multi-core CPUs with the new
wake-affine code, keep it on for SMT/HT balancing alone
(where there's no cache affinity at all between logical CPUs).

Signed-off-by: Ingo Molnar

Ingo Molnar
2008-03-19 11:27:53 +0800

26 Jan, 2008

2 commits

32525d022 sched: whitespace cleanups in topology.h ... Browse Code »

whitespace cleanups in topology.h.

Signed-off-by: Ingo Molnar

Ingo Molnar
2008-01-26 04:08:20 +0800
52d853431 sched: reactivate fork balancing ... Browse Code »

reactivate fork balancing.

Signed-off-by: Ingo Molnar

Ingo Molnar
2008-01-26 04:08:20 +0800

15 Oct, 2007

2 commits

7a6c6bcee sched: enable wake-idle on CONFIG_SCHED_MC=y ... Browse Code »

most multicore CPUs today have shared L2 caches, so tune things so
that the spreading amongst cores is more aggressive.

Signed-off-by: Ingo Molnar

Ingo Molnar
2007-10-15 23:00:19 +0800
95dbb421d sched: reintroduce topology.h tunings ... Browse Code »

reintroduce the 2.6.22 topology.h tunings again - they result in
slightly better balancing.

Signed-off-by: Ingo Molnar

Ingo Molnar
2007-10-15 23:00:19 +0800