17 Oct, 2007

40 commits

  • cpusets try to ensure that any node added to a cpuset's mems_allowed is
    on-line and contains memory. The assumption was that online nodes contained
    memory. Thus, it is possible to add memoryless nodes to a cpuset and then add
    tasks to this cpuset. This results in continuous series of oom-kill and
    apparent system hang.

    Change cpusets to use node_states[N_HIGH_MEMORY] [a.k.a. node_memory_map] in
    place of node_online_map when vetting memories. Return error if admin
    attempts to write a non-empty mems_allowed node mask containing only
    memoryless-nodes.

    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Bob Picco
    Signed-off-by: Nishanth Aravamudan
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • GFP_THISNODE checks that the zone selected is within the pgdat (node) of the
    first zone of a nodelist. That only works if the node has memory. A
    memoryless node will have its first node on another pgdat (node).

    GFP_THISNODE currently will return simply memory on the first pgdat. Thus it
    is returning memory on other nodes. GFP_THISNODE should fail if there is no
    local memory on a node.

    Add a new set of zonelists for each node that only contain the nodes that
    belong to the zones itself so that no fallback is possible.

    Then modify gfp_type to pickup the right zone based on the presence of
    __GFP_THISNODE.

    Drop the existing GFP_THISNODE checks from the page_allocators hot path.

    Signed-off-by: Christoph Lameter
    Acked-by: Nishanth Aravamudan
    Tested-by: Lee Schermerhorn
    Acked-by: Bob Picco
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • get_pfn_range_for_nid() is called multiple times for each node at boot time.
    Each time, it will warn about nodes with no memory, resulting in boot messages
    like:

    Node 0 active with no memory
    Node 0 active with no memory
    Node 0 active with no memory
    Node 0 active with no memory
    Node 0 active with no memory
    Node 0 active with no memory
    On node 0 totalpages: 0
    Node 0 active with no memory
    Node 0 active with no memory
    DMA zone: 0 pages used for memmap
    Node 0 active with no memory
    Node 0 active with no memory
    Normal zone: 0 pages used for memmap
    Node 0 active with no memory
    Node 0 active with no memory
    Movable zone: 0 pages used for memmap

    and so on for each memoryless node.

    We already have the "On node N totalpages: ..." and other related messages, so
    drop the "Node N active with no memory" warnings.

    Signed-off-by: Lee Schermerhorn
    Cc: Bob Picco
    Cc: Nishanth Aravamudan
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We need the check for a node with cpu in zone reclaim. Zone reclaim will not
    allow remote zone reclaim if a node has a cpu.

    [Lee.Schermerhorn@hp.com: Move setup of N_CPU node state mask]
    Signed-off-by: Christoph Lameter
    Tested-by: Lee Schermerhorn
    Acked-by: Bob Picco
    Cc: Nishanth Aravamudan
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Online nodes now may have no memory. The checks and initialization must
    therefore be changed to no longer use the online functions.

    This will correctly initialize the interleave on bootup to only target nodes
    with memory and will make sys_move_pages return an error when a page is to be
    moved to a memoryless node. Similarly we will get an error if MPOL_BIND and
    MPOL_INTERLEAVE is used on a memoryless node.

    These are somewhat new semantics. So far one could specify memoryless nodes
    and we would maybe do the right thing and just ignore the node (or we'd do
    something strange like with MPOL_INTERLEAVE). If we want to allow the
    specification of memoryless nodes via memory policies then we need to keep
    checking for online nodes.

    Signed-off-by: Christoph Lameter
    Acked-by: Nishanth Aravamudan
    Tested-by: Lee Schermerhorn
    Acked-by: Bob Picco
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Processors on memoryless nodes must be able to fall back to remote nodes in
    order to get a profiling buffer. This may lead to excessive NUMA traffic but
    I think we should allow this rather than failing.

    Signed-off-by: Christoph Lameter
    Acked-by: Nishanth Aravamudan
    Acked-by: Lee Schermerhorn
    Acked-by: Bob Picco
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The checks for node_online in the uncached allocator are made to make sure
    that memory is available on these nodes. Thus switch all the checks to use
    N_HIGH_MEMORY and to N_ONLINE.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Jes Sorensen
    Acked-by: Lee Schermerhorn
    Acked-by: Bob Picco
    Cc: Nishanth Aravamudan
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Simply switch all for_each_online_node to for_each_node_state(NORMAL_MEMORY).
    That way SLUB only operates on nodes with regular memory. Any allocation
    attempt on a memoryless node or a node with just highmem will fall whereupon
    SLUB will fetch memory from a nearby node (depending on how memory policies
    and cpuset describe fallback).

    Signed-off-by: Christoph Lameter
    Tested-by: Lee Schermerhorn
    Acked-by: Bob Picco
    Cc: Nishanth Aravamudan
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Slab should not allocate control structures for nodes without memory. This
    may seem to work right now but its unreliable since not all allocations can
    fall back due to the use of GFP_THISNODE.

    Switching a few for_each_online_node's to N_NORMAL_MEMORY will allow us to
    only allocate for nodes that have regular memory.

    Signed-off-by: Christoph Lameter
    Acked-by: Nishanth Aravamudan
    Acked-by: Lee Schermerhorn
    Acked-by: Bob Picco
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • A node without memory does not need a kswapd. So use the memory map instead
    of the online map when starting kswapd.

    Signed-off-by: Christoph Lameter
    Acked-by: Nishanth Aravamudan
    Tested-by: Lee Schermerhorn
    Acked-by: Bob Picco
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • constrained_alloc() builds its own memory map for nodes with memory. We have
    that available in N_HIGH_MEMORY now. So simplify the code.

    Signed-off-by: Christoph Lameter
    Acked-by: Nishanth Aravamudan
    Acked-by: Lee Schermerhorn
    Acked-by: Bob Picco
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • MPOL_INTERLEAVE currently simply loops over all nodes. Allocations on
    memoryless nodes will be redirected to nodes with memory. This results in an
    imbalance because the neighboring nodes to memoryless nodes will get
    significantly more interleave hits that the rest of the nodes on the system.

    We can avoid this imbalance by clearing the nodes in the interleave node set
    that have no memory. If we use the node map of the memory nodes instead of
    the online nodes then we have only the nodes we want.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Nishanth Aravamudan
    Tested-by: Lee Schermerhorn
    Acked-by: Bob Picco
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • It is necessary to know if nodes have memory since we have recently begun to
    add support for memoryless nodes. For that purpose we introduce a two new
    node states: N_HIGH_MEMORY and N_NORMAL_MEMORY.

    A node has its bit in N_HIGH_MEMORY set if it has any memory regardless of the
    type of mmemory. If a node has memory then it has at least one zone defined
    in its pgdat structure that is located in the pgdat itself.

    A node has its bit in N_NORMAL_MEMORY set if it has a lower zone than
    ZONE_HIGHMEM. This means it is possible to allocate memory that is not
    subject to kmap.

    N_HIGH_MEMORY and N_NORMAL_MEMORY can then be used in various places to insure
    that we do the right thing when we encounter a memoryless node.

    [akpm@linux-foundation.org: build fix]
    [Lee.Schermerhorn@hp.com: update N_HIGH_MEMORY node state for memory hotadd]
    [y-goto@jp.fujitsu.com: Fix memory hotplug + sparsemem build]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Nishanth Aravamudan
    Signed-off-by: Christoph Lameter
    Acked-by: Bob Picco
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Yasunori Goto
    Signed-off-by: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Why do we need to support memoryless nodes?

    KAMEZAWA Hiroyuki wrote:

    > For fujitsu, problem is called "empty" node.
    >
    > When ACPI's SRAT table includes "possible nodes", ia64 bootstrap(acpi_numa_init)
    > creates nodes, which includes no memory, no cpu.
    >
    > I tried to remove empty-node in past, but that was denied.
    > It was because we can hot-add cpu to the empty node.
    > (node-hotplug triggered by cpu is not implemented now. and it will be ugly.)
    >
    >
    > For HP, (Lee can comment on this later), they have memory-less-node.
    > As far as I hear, HP's machine can have following configration.
    >
    > (example)
    > Node0: CPU0 memory AAA MB
    > Node1: CPU1 memory AAA MB
    > Node2: CPU2 memory AAA MB
    > Node3: CPU3 memory AAA MB
    > Node4: Memory XXX GB
    >
    > AAA is very small value (below 16MB) and will be omitted by ia64 bootstrap.
    > After boot, only Node 4 has valid memory (but have no cpu.)
    >
    > Maybe this is memory-interleave by firmware config.

    Christoph Lameter wrote:

    > Future SGI platforms (actually also current one can have but nothing like
    > that is deployed to my knowledge) have nodes with only cpus. Current SGI
    > platforms have nodes with just I/O that we so far cannot manage in the
    > core. So the arch code maps them to the nearest memory node.

    Lee Schermerhorn wrote:

    > For the HP platforms, we can configure each cell with from 0% to 100%
    > "cell local memory". When we configure with improve bandwidth at the expense of latency for numa-challenged
    > applications [and OSes, but not our problem ;-)]. When we boot Linux on
    > such a config, all of the real nodes have no memory--it all resides in a
    > single interleaved pseudo-node.
    >
    > When we boot Linux on a 100% CLM configuration [== NUMA], we still have
    > the interleaved pseudo-node. It contains a few hundred MB stolen from
    > the real nodes to contain the DMA zone. [Interleaved memory resides at
    > phys addr 0]. The memoryless-nodes patches, along with the zoneorder
    > patches, support this config as well.
    >
    > Also, when we boot a NUMA config with the "mem=" command line,
    > specifying less memory than actually exists, Linux takes the excluded
    > memory "off the top" rather than distributing it across the nodes. This
    > can result in memoryless nodes, as well.
    >

    This patch:

    Preparation for memoryless node patches.

    Provide a generic way to keep nodemasks describing various characteristics of
    NUMA nodes.

    Remove the node_online_map and the node_possible map and realize the same
    functionality using two nodes stats: N_POSSIBLE and N_ONLINE.

    [Lee.Schermerhorn@hp.com: Initialize N_*_MEMORY and N_CPU masks for non-NUMA config]
    Signed-off-by: Christoph Lameter
    Tested-by: Lee Schermerhorn
    Acked-by: Lee Schermerhorn
    Acked-by: Bob Picco
    Cc: Nishanth Aravamudan
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Lee Schermerhorn
    Cc: "Serge E. Hallyn"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • prepare/commit_write no longer returns AOP_TRUNCATED_PAGE since OCFS2 and
    GFS2 were converted to the new aops, so we can make some simplifications
    for that.

    [michal.k.k.piotrowski@gmail.com: fix warning]
    Signed-off-by: Nick Piggin
    Cc: Michael Halcrow
    Cc: Mark Fasheh
    Cc: Steven Whitehouse
    Signed-off-by: Michal Piotrowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Implement nobh in new aops. This is a bit tricky. FWIW, nobh_truncate is
    now implemented in a way that does not create blocks in sparse regions,
    which is a silly thing for it to have been doing (isn't it?)

    ext2 survives fsx and fsstress. jfs is converted as well... ext3
    should be easy to do (but not done yet).

    [akpm@linux-foundation.org: coding-style fixes]
    Cc: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Plug ocfs2 into the ->write_begin and ->write_end aops.

    A bunch of custom code is now gone - the iovec iteration stuff during write
    and the ocfs2 splice write actor.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Cc: Roman Zippel
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Acked-by: Russell King
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Acked-by: Dave Kleikamp
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc: Andries Brouwer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Convert udf to new aops. Also seem to have fixed pagecache corruption in
    udf_adinicb_commit_write -- page was marked uptodate when it is not. Also,
    fixed the silly setup where prepare_write was doing a kmap to be used in
    commit_write: just do kmap_atomic in write_end. Use libfs helpers to make
    this easier.

    Signed-off-by: Nick Piggin
    Cc:
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • This also gets rid of a lot of useless read_file stuff. And also
    optimises the full page write case by marking a !uptodate page uptodate.

    Signed-off-by: Nick Piggin
    Cc: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • [mszeredi]
    - don't send zero length write requests
    - it is not legal for the filesystem to return with zero written bytes

    Signed-off-by: Nick Piggin
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • [akpm@linux-foundation.org: fix against git-nfs]
    [peterz@infradead.org: fix against git-nfs]
    Signed-off-by: Nick Piggin
    Acked-by: Trond Myklebust
    Cc: "J. Bruce Fields"
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • This patch makes reiserfs to use AOP_FLAG_CONT_EXPAND
    in order to get rid of the special generic_cont_expand routine

    Signed-off-by: Vladimir Saveliev
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Saveliev
     
  • Convert reiserfs to new aops

    Signed-off-by: Vladimir Saveliev
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Saveliev
     
  • Make reiserfs to write via generic routines.
    Original reiserfs write optimized for big writes is deadlock rone

    Signed-off-by: Vladimir Saveliev
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Saveliev
     
  • Signed-off-by: Nick Piggin
    Acked-by: Anders Larsen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc: Tigran Aivazian
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Rework the generic block "cont" routines to handle the new aops. Supporting
    cont_prepare_write would take quite a lot of code to support, so remove it
    instead (and we later convert all filesystems to use it).

    write_begin gets passed AOP_FLAG_CONT_EXPAND when called from
    generic_cont_expand, so filesystems can avoid the old hacks they used.

    Signed-off-by: Nick Piggin
    Cc: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin