Commit b9498bfe86530fd54fb855906383c0c905a52c80
Committed by
Linus Torvalds
1 parent
3dd6b5fb43
Exists in
master
and in
4 other branches
numa: update Documentation/vm/numa, add memoryless node info
Kamezawa Hiroyuki requested documentation for the numa_mem_id() and slab related changes. He suggested Documentation/vm/numa for this documentation. Looking at this file, it seems to me to be hopelessly out of date relative to current Linux NUMA support. At the risk of going down a rathole, I have made an attempt to rewrite the doc at a slightly higher level [I think] and provide pointers to other in-tree documents and out-of-tree man pages that cover the details. Let the games begin. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Showing 1 changed file with 147 additions and 39 deletions Side-by-side Diff
Documentation/vm/numa
1 | 1 | Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com> |
2 | 2 | |
3 | -The intent of this file is to have an uptodate, running commentary | |
4 | -from different people about NUMA specific code in the Linux vm. | |
3 | +What is NUMA? | |
5 | 4 | |
6 | -What is NUMA? It is an architecture where the memory access times | |
7 | -for different regions of memory from a given processor varies | |
8 | -according to the "distance" of the memory region from the processor. | |
9 | -Each region of memory to which access times are the same from any | |
10 | -cpu, is called a node. On such architectures, it is beneficial if | |
11 | -the kernel tries to minimize inter node communications. Schemes | |
12 | -for this range from kernel text and read-only data replication | |
13 | -across nodes, and trying to house all the data structures that | |
14 | -key components of the kernel need on memory on that node. | |
5 | +This question can be answered from a couple of perspectives: the | |
6 | +hardware view and the Linux software view. | |
15 | 7 | |
16 | -Currently, all the numa support is to provide efficient handling | |
17 | -of widely discontiguous physical memory, so architectures which | |
18 | -are not NUMA but can have huge holes in the physical address space | |
19 | -can use the same code. All this code is bracketed by CONFIG_DISCONTIGMEM. | |
8 | +From the hardware perspective, a NUMA system is a computer platform that | |
9 | +comprises multiple components or assemblies each of which may contain 0 | |
10 | +or more CPUs, local memory, and/or IO buses. For brevity and to | |
11 | +disambiguate the hardware view of these physical components/assemblies | |
12 | +from the software abstraction thereof, we'll call the components/assemblies | |
13 | +'cells' in this document. | |
20 | 14 | |
21 | -The initial port includes NUMAizing the bootmem allocator code by | |
22 | -encapsulating all the pieces of information into a bootmem_data_t | |
23 | -structure. Node specific calls have been added to the allocator. | |
24 | -In theory, any platform which uses the bootmem allocator should | |
25 | -be able to put the bootmem and mem_map data structures anywhere | |
26 | -it deems best. | |
15 | +Each of the 'cells' may be viewed as an SMP [symmetric multi-processor] subset | |
16 | +of the system--although some components necessary for a stand-alone SMP system | |
17 | +may not be populated on any given cell. The cells of the NUMA system are | |
18 | +connected together with some sort of system interconnect--e.g., a crossbar or | |
19 | +point-to-point link are common types of NUMA system interconnects. Both of | |
20 | +these types of interconnects can be aggregated to create NUMA platforms with | |
21 | +cells at multiple distances from other cells. | |
27 | 22 | |
28 | -Each node's page allocation data structures have also been encapsulated | |
29 | -into a pg_data_t. The bootmem_data_t is just one part of this. To | |
30 | -make the code look uniform between NUMA and regular UMA platforms, | |
31 | -UMA platforms have a statically allocated pg_data_t too (contig_page_data). | |
32 | -For the sake of uniformity, the function num_online_nodes() is also defined | |
33 | -for all platforms. As we run benchmarks, we might decide to NUMAize | |
34 | -more variables like low_on_memory, nr_free_pages etc into the pg_data_t. | |
23 | +For Linux, the NUMA platforms of interest are primarily what is known as Cache | |
24 | +Coherent NUMA or ccNUMA systems. With ccNUMA systems, all memory is visible | |
25 | +to and accessible from any CPU attached to any cell and cache coherency | |
26 | +is handled in hardware by the processor caches and/or the system interconnect. | |
35 | 27 | |
36 | -The NUMA aware page allocation code currently tries to allocate pages | |
37 | -from different nodes in a round robin manner. This will be changed to | |
38 | -do concentratic circle search, starting from current node, once the | |
39 | -NUMA port achieves more maturity. The call alloc_pages_node has been | |
40 | -added, so that drivers can make the call and not worry about whether | |
41 | -it is running on a NUMA or UMA platform. | |
28 | +Memory access time and effective memory bandwidth varies depending on how far | |
29 | +away the cell containing the CPU or IO bus making the memory access is from the | |
30 | +cell containing the target memory. For example, access to memory by CPUs | |
31 | +attached to the same cell will experience faster access times and higher | |
32 | +bandwidths than accesses to memory on other, remote cells. NUMA platforms | |
33 | +can have cells at multiple remote distances from any given cell. | |
34 | + | |
35 | +Platform vendors don't build NUMA systems just to make software developers' | |
36 | +lives interesting. Rather, this architecture is a means to provide scalable | |
37 | +memory bandwidth. However, to achieve scalable memory bandwidth, system and | |
38 | +application software must arrange for a large majority of the memory references | |
39 | +[cache misses] to be to "local" memory--memory on the same cell, if any--or | |
40 | +to the closest cell with memory. | |
41 | + | |
42 | +This leads to the Linux software view of a NUMA system: | |
43 | + | |
44 | +Linux divides the system's hardware resources into multiple software | |
45 | +abstractions called "nodes". Linux maps the nodes onto the physical cells | |
46 | +of the hardware platform, abstracting away some of the details for some | |
47 | +architectures. As with physical cells, software nodes may contain 0 or more | |
48 | +CPUs, memory and/or IO buses. And, again, memory accesses to memory on | |
49 | +"closer" nodes--nodes that map to closer cells--will generally experience | |
50 | +faster access times and higher effective bandwidth than accesses to more | |
51 | +remote cells. | |
52 | + | |
53 | +For some architectures, such as x86, Linux will "hide" any node representing a | |
54 | +physical cell that has no memory attached, and reassign any CPUs attached to | |
55 | +that cell to a node representing a cell that does have memory. Thus, on | |
56 | +these architectures, one cannot assume that all CPUs that Linux associates with | |
57 | +a given node will see the same local memory access times and bandwidth. | |
58 | + | |
59 | +In addition, for some architectures, again x86 is an example, Linux supports | |
60 | +the emulation of additional nodes. For NUMA emulation, linux will carve up | |
61 | +the existing nodes--or the system memory for non-NUMA platforms--into multiple | |
62 | +nodes. Each emulated node will manage a fraction of the underlying cells' | |
63 | +physical memory. NUMA emluation is useful for testing NUMA kernel and | |
64 | +application features on non-NUMA platforms, and as a sort of memory resource | |
65 | +management mechanism when used together with cpusets. | |
66 | +[see Documentation/cgroups/cpusets.txt] | |
67 | + | |
68 | +For each node with memory, Linux constructs an independent memory management | |
69 | +subsystem, complete with its own free page lists, in-use page lists, usage | |
70 | +statistics and locks to mediate access. In addition, Linux constructs for | |
71 | +each memory zone [one or more of DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE], | |
72 | +an ordered "zonelist". A zonelist specifies the zones/nodes to visit when a | |
73 | +selected zone/node cannot satisfy the allocation request. This situation, | |
74 | +when a zone has no available memory to satisfy a request, is called | |
75 | +"overflow" or "fallback". | |
76 | + | |
77 | +Because some nodes contain multiple zones containing different types of | |
78 | +memory, Linux must decide whether to order the zonelists such that allocations | |
79 | +fall back to the same zone type on a different node, or to a different zone | |
80 | +type on the same node. This is an important consideration because some zones, | |
81 | +such as DMA or DMA32, represent relatively scarce resources. Linux chooses | |
82 | +a default zonelist order based on the sizes of the various zone types relative | |
83 | +to the total memory of the node and the total memory of the system. The | |
84 | +default zonelist order may be overridden using the numa_zonelist_order kernel | |
85 | +boot parameter or sysctl. [see Documentation/kernel-parameters.txt and | |
86 | +Documentation/sysctl/vm.txt] | |
87 | + | |
88 | +By default, Linux will attempt to satisfy memory allocation requests from the | |
89 | +node to which the CPU that executes the request is assigned. Specifically, | |
90 | +Linux will attempt to allocate from the first node in the appropriate zonelist | |
91 | +for the node where the request originates. This is called "local allocation." | |
92 | +If the "local" node cannot satisfy the request, the kernel will examine other | |
93 | +nodes' zones in the selected zonelist looking for the first zone in the list | |
94 | +that can satisfy the request. | |
95 | + | |
96 | +Local allocation will tend to keep subsequent access to the allocated memory | |
97 | +"local" to the underlying physical resources and off the system interconnect-- | |
98 | +as long as the task on whose behalf the kernel allocated some memory does not | |
99 | +later migrate away from that memory. The Linux scheduler is aware of the | |
100 | +NUMA topology of the platform--embodied in the "scheduling domains" data | |
101 | +structures [see Documentation/scheduler/sched-domains.txt]--and the scheduler | |
102 | +attempts to minimize task migration to distant scheduling domains. However, | |
103 | +the scheduler does not take a task's NUMA footprint into account directly. | |
104 | +Thus, under sufficient imbalance, tasks can migrate between nodes, remote | |
105 | +from their initial node and kernel data structures. | |
106 | + | |
107 | +System administrators and application designers can restrict a task's migration | |
108 | +to improve NUMA locality using various CPU affinity command line interfaces, | |
109 | +such as taskset(1) and numactl(1), and program interfaces such as | |
110 | +sched_setaffinity(2). Further, one can modify the kernel's default local | |
111 | +allocation behavior using Linux NUMA memory policy. | |
112 | +[see Documentation/vm/numa_memory_policy.] | |
113 | + | |
114 | +System administrators can restrict the CPUs and nodes' memories that a non- | |
115 | +privileged user can specify in the scheduling or NUMA commands and functions | |
116 | +using control groups and CPUsets. [see Documentation/cgroups/CPUsets.txt] | |
117 | + | |
118 | +On architectures that do not hide memoryless nodes, Linux will include only | |
119 | +zones [nodes] with memory in the zonelists. This means that for a memoryless | |
120 | +node the "local memory node"--the node of the first zone in CPU's node's | |
121 | +zonelist--will not be the node itself. Rather, it will be the node that the | |
122 | +kernel selected as the nearest node with memory when it built the zonelists. | |
123 | +So, default, local allocations will succeed with the kernel supplying the | |
124 | +closest available memory. This is a consequence of the same mechanism that | |
125 | +allows such allocations to fallback to other nearby nodes when a node that | |
126 | +does contain memory overflows. | |
127 | + | |
128 | +Some kernel allocations do not want or cannot tolerate this allocation fallback | |
129 | +behavior. Rather they want to be sure they get memory from the specified node | |
130 | +or get notified that the node has no free memory. This is usually the case when | |
131 | +a subsystem allocates per CPU memory resources, for example. | |
132 | + | |
133 | +A typical model for making such an allocation is to obtain the node id of the | |
134 | +node to which the "current CPU" is attached using one of the kernel's | |
135 | +numa_node_id() or CPU_to_node() functions and then request memory from only | |
136 | +the node id returned. When such an allocation fails, the requesting subsystem | |
137 | +may revert to its own fallback path. The slab kernel memory allocator is an | |
138 | +example of this. Or, the subsystem may choose to disable or not to enable | |
139 | +itself on allocation failure. The kernel profiling subsystem is an example of | |
140 | +this. | |
141 | + | |
142 | +If the architecture supports--does not hide--memoryless nodes, then CPUs | |
143 | +attached to memoryless nodes would always incur the fallback path overhead | |
144 | +or some subsystems would fail to initialize if they attempted to allocated | |
145 | +memory exclusively from a node without memory. To support such | |
146 | +architectures transparently, kernel subsystems can use the numa_mem_id() | |
147 | +or cpu_to_mem() function to locate the "local memory node" for the calling or | |
148 | +specified CPU. Again, this is the same node from which default, local page | |
149 | +allocations will be attempted. |