Eric Lee / linux-smarc-t335x-v3.2

Commit b9498bfe86530fd54fb855906383c0c905a52c80

Authored by Lee Schermerhorn 2010-05-27 05:45:06 +0800

Committed by Linus Torvalds 2010-05-28 00:12:57 +0800

Exists in master and in 4 other branches

numa: update Documentation/vm/numa, add memoryless node info

Kamezawa Hiroyuki requested documentation for the numa_mem_id() and slab
related changes.  He suggested Documentation/vm/numa for this
documentation.  Looking at this file, it seems to me to be hopelessly out
of date relative to current Linux NUMA support.  At the risk of going down
a rathole, I have made an attempt to rewrite the doc at a slightly higher
level [I think] and provide pointers to other in-tree documents and
out-of-tree man pages that cover the details.

Let the games begin.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Showing 1 changed file with 147 additions and 39 deletions Side-by-side Diff

Documentation/vm/numa

Documentation/vm/numa

Diff comments View file @ b9498bf

1	1	Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com>
2	2
3		-The intent of this file is to have an uptodate, running commentary
4		-from different people about NUMA specific code in the Linux vm.
	3	+What is NUMA?
5	4
6		-What is NUMA? It is an architecture where the memory access times
7		-for different regions of memory from a given processor varies
8		-according to the "distance" of the memory region from the processor.
9		-Each region of memory to which access times are the same from any
10		-cpu, is called a node. On such architectures, it is beneficial if
11		-the kernel tries to minimize inter node communications. Schemes
12		-for this range from kernel text and read-only data replication
13		-across nodes, and trying to house all the data structures that
14		-key components of the kernel need on memory on that node.
	5	+This question can be answered from a couple of perspectives: the
	6	+hardware view and the Linux software view.
15	7
16		-Currently, all the numa support is to provide efficient handling
17		-of widely discontiguous physical memory, so architectures which
18		-are not NUMA but can have huge holes in the physical address space
19		-can use the same code. All this code is bracketed by CONFIG_DISCONTIGMEM.
	8	+From the hardware perspective, a NUMA system is a computer platform that
	9	+comprises multiple components or assemblies each of which may contain 0
	10	+or more CPUs, local memory, and/or IO buses. For brevity and to
	11	+disambiguate the hardware view of these physical components/assemblies
	12	+from the software abstraction thereof, we'll call the components/assemblies
	13	+'cells' in this document.
20	14
21		-The initial port includes NUMAizing the bootmem allocator code by
22		-encapsulating all the pieces of information into a bootmem_data_t
23		-structure. Node specific calls have been added to the allocator.
24		-In theory, any platform which uses the bootmem allocator should
25		-be able to put the bootmem and mem_map data structures anywhere
26		-it deems best.
	15	+Each of the 'cells' may be viewed as an SMP [symmetric multi-processor] subset
	16	+of the system--although some components necessary for a stand-alone SMP system
	17	+may not be populated on any given cell. The cells of the NUMA system are
	18	+connected together with some sort of system interconnect--e.g., a crossbar or
	19	+point-to-point link are common types of NUMA system interconnects. Both of
	20	+these types of interconnects can be aggregated to create NUMA platforms with
	21	+cells at multiple distances from other cells.
27	22
28		-Each node's page allocation data structures have also been encapsulated
29		-into a pg_data_t. The bootmem_data_t is just one part of this. To
30		-make the code look uniform between NUMA and regular UMA platforms,
31		-UMA platforms have a statically allocated pg_data_t too (contig_page_data).
32		-For the sake of uniformity, the function num_online_nodes() is also defined
33		-for all platforms. As we run benchmarks, we might decide to NUMAize
34		-more variables like low_on_memory, nr_free_pages etc into the pg_data_t.
	23	+For Linux, the NUMA platforms of interest are primarily what is known as Cache
	24	+Coherent NUMA or ccNUMA systems. With ccNUMA systems, all memory is visible
	25	+to and accessible from any CPU attached to any cell and cache coherency
	26	+is handled in hardware by the processor caches and/or the system interconnect.
35	27
36		-The NUMA aware page allocation code currently tries to allocate pages
37		-from different nodes in a round robin manner. This will be changed to
38		-do concentratic circle search, starting from current node, once the
39		-NUMA port achieves more maturity. The call alloc_pages_node has been
40		-added, so that drivers can make the call and not worry about whether
41		-it is running on a NUMA or UMA platform.
	28	+Memory access time and effective memory bandwidth varies depending on how far
	29	+away the cell containing the CPU or IO bus making the memory access is from the
	30	+cell containing the target memory. For example, access to memory by CPUs
	31	+attached to the same cell will experience faster access times and higher
	32	+bandwidths than accesses to memory on other, remote cells. NUMA platforms
	33	+can have cells at multiple remote distances from any given cell.
	34	+
	35	+Platform vendors don't build NUMA systems just to make software developers'
	36	+lives interesting. Rather, this architecture is a means to provide scalable
	37	+memory bandwidth. However, to achieve scalable memory bandwidth, system and
	38	+application software must arrange for a large majority of the memory references
	39	+[cache misses] to be to "local" memory--memory on the same cell, if any--or
	40	+to the closest cell with memory.
	41	+
	42	+This leads to the Linux software view of a NUMA system:
	43	+
	44	+Linux divides the system's hardware resources into multiple software
	45	+abstractions called "nodes". Linux maps the nodes onto the physical cells
	46	+of the hardware platform, abstracting away some of the details for some
	47	+architectures. As with physical cells, software nodes may contain 0 or more
	48	+CPUs, memory and/or IO buses. And, again, memory accesses to memory on
	49	+"closer" nodes--nodes that map to closer cells--will generally experience
	50	+faster access times and higher effective bandwidth than accesses to more
	51	+remote cells.
	52	+
	53	+For some architectures, such as x86, Linux will "hide" any node representing a
	54	+physical cell that has no memory attached, and reassign any CPUs attached to
	55	+that cell to a node representing a cell that does have memory. Thus, on
	56	+these architectures, one cannot assume that all CPUs that Linux associates with
	57	+a given node will see the same local memory access times and bandwidth.
	58	+
	59	+In addition, for some architectures, again x86 is an example, Linux supports
	60	+the emulation of additional nodes. For NUMA emulation, linux will carve up
	61	+the existing nodes--or the system memory for non-NUMA platforms--into multiple
	62	+nodes. Each emulated node will manage a fraction of the underlying cells'
	63	+physical memory. NUMA emluation is useful for testing NUMA kernel and
	64	+application features on non-NUMA platforms, and as a sort of memory resource
	65	+management mechanism when used together with cpusets.
	66	+[see Documentation/cgroups/cpusets.txt]
	67	+
	68	+For each node with memory, Linux constructs an independent memory management
	69	+subsystem, complete with its own free page lists, in-use page lists, usage
	70	+statistics and locks to mediate access. In addition, Linux constructs for
	71	+each memory zone [one or more of DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE],
	72	+an ordered "zonelist". A zonelist specifies the zones/nodes to visit when a
	73	+selected zone/node cannot satisfy the allocation request. This situation,
	74	+when a zone has no available memory to satisfy a request, is called
	75	+"overflow" or "fallback".
	76	+
	77	+Because some nodes contain multiple zones containing different types of
	78	+memory, Linux must decide whether to order the zonelists such that allocations
	79	+fall back to the same zone type on a different node, or to a different zone
	80	+type on the same node. This is an important consideration because some zones,
	81	+such as DMA or DMA32, represent relatively scarce resources. Linux chooses
	82	+a default zonelist order based on the sizes of the various zone types relative
	83	+to the total memory of the node and the total memory of the system. The
	84	+default zonelist order may be overridden using the numa_zonelist_order kernel
	85	+boot parameter or sysctl. [see Documentation/kernel-parameters.txt and
	86	+Documentation/sysctl/vm.txt]
	87	+
	88	+By default, Linux will attempt to satisfy memory allocation requests from the
	89	+node to which the CPU that executes the request is assigned. Specifically,
	90	+Linux will attempt to allocate from the first node in the appropriate zonelist
	91	+for the node where the request originates. This is called "local allocation."
	92	+If the "local" node cannot satisfy the request, the kernel will examine other
	93	+nodes' zones in the selected zonelist looking for the first zone in the list
	94	+that can satisfy the request.
	95	+
	96	+Local allocation will tend to keep subsequent access to the allocated memory
	97	+"local" to the underlying physical resources and off the system interconnect--
	98	+as long as the task on whose behalf the kernel allocated some memory does not
	99	+later migrate away from that memory. The Linux scheduler is aware of the
	100	+NUMA topology of the platform--embodied in the "scheduling domains" data
	101	+structures [see Documentation/scheduler/sched-domains.txt]--and the scheduler
	102	+attempts to minimize task migration to distant scheduling domains. However,
	103	+the scheduler does not take a task's NUMA footprint into account directly.
	104	+Thus, under sufficient imbalance, tasks can migrate between nodes, remote
	105	+from their initial node and kernel data structures.
	106	+
	107	+System administrators and application designers can restrict a task's migration
	108	+to improve NUMA locality using various CPU affinity command line interfaces,
	109	+such as taskset(1) and numactl(1), and program interfaces such as
	110	+sched_setaffinity(2). Further, one can modify the kernel's default local
	111	+allocation behavior using Linux NUMA memory policy.
	112	+[see Documentation/vm/numa_memory_policy.]
	113	+
	114	+System administrators can restrict the CPUs and nodes' memories that a non-
	115	+privileged user can specify in the scheduling or NUMA commands and functions
	116	+using control groups and CPUsets. [see Documentation/cgroups/CPUsets.txt]
	117	+
	118	+On architectures that do not hide memoryless nodes, Linux will include only
	119	+zones [nodes] with memory in the zonelists. This means that for a memoryless
	120	+node the "local memory node"--the node of the first zone in CPU's node's
	121	+zonelist--will not be the node itself. Rather, it will be the node that the
	122	+kernel selected as the nearest node with memory when it built the zonelists.
	123	+So, default, local allocations will succeed with the kernel supplying the
	124	+closest available memory. This is a consequence of the same mechanism that
	125	+allows such allocations to fallback to other nearby nodes when a node that
	126	+does contain memory overflows.
	127	+
	128	+Some kernel allocations do not want or cannot tolerate this allocation fallback
	129	+behavior. Rather they want to be sure they get memory from the specified node
	130	+or get notified that the node has no free memory. This is usually the case when
	131	+a subsystem allocates per CPU memory resources, for example.
	132	+
	133	+A typical model for making such an allocation is to obtain the node id of the
	134	+node to which the "current CPU" is attached using one of the kernel's
	135	+numa_node_id() or CPU_to_node() functions and then request memory from only
	136	+the node id returned. When such an allocation fails, the requesting subsystem
	137	+may revert to its own fallback path. The slab kernel memory allocator is an
	138	+example of this. Or, the subsystem may choose to disable or not to enable
	139	+itself on allocation failure. The kernel profiling subsystem is an example of
	140	+this.
	141	+
	142	+If the architecture supports--does not hide--memoryless nodes, then CPUs
	143	+attached to memoryless nodes would always incur the fallback path overhead
	144	+or some subsystems would fail to initialize if they attempted to allocated
	145	+memory exclusively from a node without memory. To support such
	146	+architectures transparently, kernel subsystems can use the numa_mem_id()
	147	+or cpu_to_mem() function to locate the "local memory node" for the calling or
	148	+specified CPU. Again, this is the same node from which default, local page
	149	+allocations will be attempted.