27 Sep, 2006

11 commits

  • The NUMA_BUILD constant is always available and will be set to 1 on
    NUMA_BUILDs. That way checks valid only under CONFIG_NUMA can easily be done
    without #ifdef CONFIG_NUMA

    F.e.

    if (NUMA_BUILD && ) {
    ...
    }

    [akpm: not a thing we'd normally do, but CONFIG_NUMA is special: it is
    causing ifdef explosion in core kernel, so let's see if this is a comfortable
    way in whcih to control that]

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • On larger systems, the amount of output dumped on the console when you do
    SysRq-M is beyond insane. This patch is trying to reduce it somewhat as
    even with the smaller NUMA systems that have hit the desktop this seems to
    be a fair thing to do.

    The philosophy I have taken is as follows:
    1) If a zone is empty, don't tell, we don't need yet another line
    telling us so. The information is available since one can look up
    the fact how many zones were initialized in the first place.
    2) Put as much information on a line is possible, if it can be done
    in one line, rahter than two, then do it in one. I tried to format
    the temperature stuff for easy reading.

    Change show_free_areas() to not print lines for empty zones. If no zone
    output is printed, the zone is empty. This reduces the number of lines
    dumped to the console in sysrq on a large system by several thousand lines.

    Change the zone temperature printouts to use one line per CPU instead of
    two lines (one hot, one cold). On a 1024 CPU, 1024 node system, this
    reduces the console output by over a million lines of output.

    While this is a bigger problem on large NUMA systems, it is also applicable
    to smaller desktop sized and mid range NUMA systems.

    Old format:

    Mem-info:
    Node 0 DMA per-cpu:
    cpu 0 hot: high 42, batch 7 used:24
    cpu 0 cold: high 14, batch 3 used:1
    cpu 1 hot: high 42, batch 7 used:34
    cpu 1 cold: high 14, batch 3 used:0
    cpu 2 hot: high 42, batch 7 used:0
    cpu 2 cold: high 14, batch 3 used:0
    cpu 3 hot: high 42, batch 7 used:0
    cpu 3 cold: high 14, batch 3 used:0
    cpu 4 hot: high 42, batch 7 used:0
    cpu 4 cold: high 14, batch 3 used:0
    cpu 5 hot: high 42, batch 7 used:0
    cpu 5 cold: high 14, batch 3 used:0
    cpu 6 hot: high 42, batch 7 used:0
    cpu 6 cold: high 14, batch 3 used:0
    cpu 7 hot: high 42, batch 7 used:0
    cpu 7 cold: high 14, batch 3 used:0
    Node 0 DMA32 per-cpu: empty
    Node 0 Normal per-cpu: empty
    Node 0 HighMem per-cpu: empty
    Node 1 DMA per-cpu:
    [snip]
    Free pages: 5410688kB (0kB HighMem)
    Active:9536 inactive:4261 dirty:6 writeback:0 unstable:0 free:338168 slab:1931 mapped:1900 pagetables:208
    Node 0 DMA free:1676304kB min:3264kB low:4080kB high:4896kB active:128048kB inactive:61568kB present:1970880kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 0 DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 0 HighMem free:0kB min:512kB low:512kB high:512kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 1 DMA free:1951728kB min:3280kB low:4096kB high:4912kB active:5632kB inactive:1504kB present:1982464kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    ....

    New format:

    Mem-info:
    Node 0 DMA per-cpu:
    CPU 0: Hot: hi: 42, btch: 7 usd: 41 Cold: hi: 14, btch: 3 usd: 2
    CPU 1: Hot: hi: 42, btch: 7 usd: 40 Cold: hi: 14, btch: 3 usd: 1
    CPU 2: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
    CPU 3: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
    CPU 4: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
    CPU 5: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
    CPU 6: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
    CPU 7: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0
    Node 1 DMA per-cpu:
    [snip]
    Free pages: 5411088kB (0kB HighMem)
    Active:9558 inactive:4233 dirty:6 writeback:0 unstable:0 free:338193 slab:1942 mapped:1918 pagetables:208
    Node 0 DMA free:1677648kB min:3264kB low:4080kB high:4896kB active:129296kB inactive:58864kB present:1970880kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 1 DMA free:1948448kB min:3280kB low:4096kB high:4912kB active:6864kB inactive:3536kB present:1982464kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0

    Signed-off-by: Jes Sorensen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jes Sorensen
     
  • kmalloc_node() falls back to ___cache_alloc() under certain conditions and
    at that point memory policies may be applied redirecting the allocation
    away from the current node. Therefore kmalloc_node(...,numa_node_id()) or
    kmalloc_node(...,-1) may not return memory from the local node.

    Fix this by doing the policy check in __cache_alloc() instead of
    ____cache_alloc().

    This version here is a cleanup of Kiran's patch.

    - Tested on ia64.
    - Extra material removed.
    - Consolidate the exit path if alternate_node_alloc() returned an object.

    [akpm@osdl.org: warning fix]
    Signed-off-by: Alok N Kataria
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Clean up the invalidate code, and use a common function to safely remove
    the page from pagecache.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • The VM is supposed to minimise the number of pages which get written off the
    LRU (for IO scheduling efficiency, and for high reclaim-success rates). But
    we don't actually have a clear way of showing how true this is.

    So add `nr_vmscan_write' to /proc/vmstat and /proc/zoneinfo - the number of
    pages which have been written by the vm scanner in this zone and globally.

    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Arch-independent zone-sizing determines the size of a node
    (pgdat->node_spanned_pages) based on the physical memory that was
    registered by the architecture. However, when
    CONFIG_MEMORY_HOTPLUG_RESERVE is set, the architecture expects that the
    spanned_pages will be much larger and that mem_map will be allocated that
    is used lated on memory hot-add.

    This patch allows an architecture that sets CONFIG_MEMORY_HOTPLUG_RESERVE
    to call push_node_boundaries() which will set the node beginning and end to
    at *least* the requested boundary.

    Cc: Dave Hansen
    Cc: Andy Whitcroft
    Cc: Andi Kleen
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Keith Mannthey"
    Cc: "Luck, Tony"
    Cc: KAMEZAWA Hiroyuki
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • absent_pages_in_range() made the assumption that users of the API would not
    care about holes beyound the end of physical memory. This was not the
    case. This patch will account for ranges outside of physical memory as
    holes correctly.

    Cc: Dave Hansen
    Cc: Andy Whitcroft
    Cc: Andi Kleen
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Keith Mannthey"
    Cc: "Luck, Tony"
    Cc: KAMEZAWA Hiroyuki
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The x86_64 code accounted for memmap and some portions of the the DMA zone as
    holes. This was because those areas would never be reclaimed and accounting
    for them as memory affects min watermarks. This patch will account for the
    memmap as a memory hole. Architectures may optionally use set_dma_reserve()
    if they wish to account for a portion of memory in ZONE_DMA as a hole.

    Signed-off-by: Mel Gorman
    Cc: Dave Hansen
    Cc: Andy Whitcroft
    Cc: Andi Kleen
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Keith Mannthey"
    Cc: "Luck, Tony"
    Cc: KAMEZAWA Hiroyuki
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • At a basic level, architectures define structures to record where active
    ranges of page frames are located. Once located, the code to calculate zone
    sizes and holes in each architecture is very similar. Some of this zone and
    hole sizing code is difficult to read for no good reason. This set of patches
    eliminates the similar-looking architecture-specific code.

    The patches introduce a mechanism where architectures register where the
    active ranges of page frames are with add_active_range(). When all areas have
    been discovered, free_area_init_nodes() is called to initialise the pgdat and
    zones. The zone sizes and holes are then calculated in an architecture
    independent manner.

    Patch 1 introduces the mechanism for registering and initialising PFN ranges
    Patch 2 changes ppc to use the mechanism - 139 arch-specific LOC removed
    Patch 3 changes x86 to use the mechanism - 136 arch-specific LOC removed
    Patch 4 changes x86_64 to use the mechanism - 74 arch-specific LOC removed
    Patch 5 changes ia64 to use the mechanism - 52 arch-specific LOC removed
    Patch 6 accounts for mem_map as a memory hole as the pages are not reclaimable.
    It adjusts the watermarks slightly

    Tony Luck has successfully tested for ia64 on Itanium with tiger_defconfig,
    gensparse_defconfig and defconfig. Bob Picco has also tested and debugged on
    IA64. Jack Steiner successfully boot tested on a mammoth SGI IA64-based
    machine. These were on patches against 2.6.17-rc1 and release 3 of these
    patches but there have been no ia64-changes since release 3.

    There are differences in the zone sizes for x86_64 as the arch-specific code
    for x86_64 accounts the kernel image and the starting mem_maps as memory holes
    but the architecture-independent code accounts the memory as present.

    The big benefit of this set of patches is a sizable reduction of
    architecture-specific code, some of which is very hairy. There should be a
    greater reduction when other architectures use the same mechanisms for zone
    and hole sizing but I lack the hardware to test on.

    Additional credit;
    Dave Hansen for the initial suggestion and comments on early patches
    Andy Whitcroft for reviewing early versions and catching numerous
    errors
    Tony Luck for testing and debugging on IA64
    Bob Picco for fixing bugs related to pfn registration, reviewing a
    number of patch revisions, providing a number of suggestions
    on future direction and testing heavily
    Jack Steiner and Robin Holt for testing on IA64 and clarifying
    issues related to memory holes
    Yasunori for testing on IA64
    Andi Kleen for reviewing and feeding back about x86_64
    Christian Kujau for providing valuable information related to ACPI
    problems on x86_64 and testing potential fixes

    This patch:

    Define the structure to represent an active range of page frames within a node
    in an architecture independent manner. Architectures are expected to register
    active ranges of PFNs using add_active_range(nid, start_pfn, end_pfn) and call
    free_area_init_nodes() passing the PFNs of the end of each zone.

    Signed-off-by: Mel Gorman
    Signed-off-by: Bob Picco
    Cc: Dave Hansen
    Cc: Andy Whitcroft
    Cc: Andi Kleen
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Keith Mannthey"
    Cc: "Luck, Tony"
    Cc: KAMEZAWA Hiroyuki
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • un-, de-, -free, -destroy, -exit, etc functions should in general return
    void. Also,

    There is very little, say, filesystem driver code can do upon failed
    kmem_cache_destroy(). If it will be decided to BUG in this case, BUG
    should be put in generic code, instead.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • * Rougly half of callers already do it by not checking return value
    * Code in drivers/acpi/osl.c does the following to be sure:

    (void)kmem_cache_destroy(cache);

    * Those who check it printk something, however, slab_error already printed
    the name of failed cache.
    * XFS BUGs on failed kmem_cache_destroy which is not the decision
    low-level filesystem driver should make. Converted to ignore.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

26 Sep, 2006

29 commits

  • Clean up mm/page_alloc.c#mark_free_pages() and make it avoid clearing
    PageNosaveFree for PageNosave pages. This allows us to get rid of an ugly
    hack in kernel/power/snapshot.c#copy_data_pages().

    Additionally, the page-copying loop in copy_data_pages() is moved to an
    inline function.

    Signed-off-by: Rafael J. Wysocki
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Implement async reads for swsusp resuming.

    Crufty old PIII testbox:
    15.7 MB/s -> 20.3 MB/s

    Sony Vaio:
    14.6 MB/s -> 33.3 MB/s

    I didn't implement the post-resume bio_set_pages_dirty(). I don't really
    understand why resume needs to run set_page_dirty() against these pages.

    It might be a worry that this code modifies PG_Uptodate, PG_Error and
    PG_Locked against the image pages. Can this possibly affect the resumed-into
    kernel? Hopefully not, if we're atomically restoring its mem_map?

    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Cc: Jens Axboe
    Cc: Laurent Riffard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Switch the swsusp writeout code from 4k-at-a-time to 4MB-at-a-time.

    Crufty old PIII testbox:
    12.9 MB/s -> 20.9 MB/s

    Sony Vaio:
    14.7 MB/s -> 26.5 MB/s

    The implementation is crude. A better one would use larger BIOs, but wouldn't
    gain any performance.

    The memcpys will be mostly pipelined with the IO and basically come for free.

    The ENOMEM path has not been tested. It should be.

    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • There are many places where we need to determine the node of a zone.
    Currently we use a difficult to read sequence of pointer dereferencing.
    Put that into an inline function and use throughout VM. Maybe we can find
    a way to optimize the lookup in the future.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • I found two location in hugetlb.c where we chase pointer instead of using
    page_to_nid(). Page_to_nid is more effective and can get the node directly
    from page flags.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Update the comments for __oom_kill_task() to reflect the code changes.

    Signed-off-by: Ram Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ram Gupta
     
  • Minor performance fix.

    If we reclaimed enough slab pages from a zone then we can avoid going off
    node with the current allocation. Take care of updating nr_reclaimed when
    reclaiming from the slab.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently one can enable slab reclaim by setting an explicit option in
    /proc/sys/vm/zone_reclaim_mode. Slab reclaim is then used as a final
    option if the freeing of unmapped file backed pages is not enough to free
    enough pages to allow a local allocation.

    However, that means that the slab can grow excessively and that most memory
    of a node may be used by slabs. We have had a case where a machine with
    46GB of memory was using 40-42GB for slab. Zone reclaim was effective in
    dealing with pagecache pages. However, slab reclaim was only done during
    global reclaim (which is a bit rare on NUMA systems).

    This patch implements slab reclaim during zone reclaim. Zone reclaim
    occurs if there is a danger of an off node allocation. At that point we

    1. Shrink the per node page cache if the number of pagecache
    pages is more than min_unmapped_ratio percent of pages in a zone.

    2. Shrink the slab cache if the number of the nodes reclaimable slab pages
    (patch depends on earlier one that implements that counter)
    are more than min_slab_ratio (a new /proc/sys/vm tunable).

    The shrinking of the slab cache is a bit problematic since it is not node
    specific. So we simply calculate what point in the slab we want to reach
    (current per node slab use minus the number of pages that neeed to be
    allocated) and then repeately run the global reclaim until that is
    unsuccessful or we have reached the limit. I hope we will have zone based
    slab reclaim at some point which will make that easier.

    The default for the min_slab_ratio is 5%

    Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Remove the atomic counter for slab_reclaim_pages and replace the counter
    and NR_SLAB with two ZVC counter that account for unreclaimable and
    reclaimable slab pages: NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE.

    Change the check in vmscan.c to refer to to NR_SLAB_RECLAIMABLE. The
    intend seems to be to check for slab pages that could be freed.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • *_pages is a better description of the role of the variable.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The allocpercpu functions __alloc_percpu and __free_percpu() are heavily
    using the slab allocator. However, they are conceptually slab. This also
    simplifies SLOB (at this point slob may be broken in mm. This should fix
    it).

    Signed-off-by: Christoph Lameter
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • If a zone is unpopulated then we do not need to check for pages that are to
    be drained and also not for vm counters that may need to be updated.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Free one_page currently adds the page to a fake list and calls
    free_page_bulk. Fee_page_bulk takes it off again and then calles
    __free_one_page.

    Make free_one_page go directly to __free_one_page. Saves list on / off and
    a temporary list in free_one_page for higher ordered pages.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • On High end systems (1024 or so cpus) this can potentially cause stack
    overflow. Fix the stack usage.

    Signed-off-by: Suresh Siddha
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Siddha, Suresh B
     
  • In many places we will need to use the same combination of flags. Specify
    a single GFP_THISNODE definition for ease of use in gfp.h.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • There are frequent references to *z in get_page_from_freelist.

    Add an explicit zone variable that can be used in all these places.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • If the user specified a node where we should move the page to then we
    really do not want any other node.

    Signed-off-by: Christoph Lameter
    Cc: Andy Whitcroft
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • …mory policy restrictions

    Add a new gfp flag __GFP_THISNODE to avoid fallback to other nodes. This
    flag is essential if a kernel component requires memory to be located on a
    certain node. It will be needed for alloc_pages_node() to force allocation
    on the indicated node and for alloc_pages() to force allocation on the
    current node.

    Signed-off-by: Christoph Lameter <clameter@sgi.com>
    Cc: Andy Whitcroft <apw@shadowen.org>
    Cc: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

    Christoph Lameter
     
  • Place the alien array cache locks of on slab malloc slab caches on a
    seperate lockdep class. This avoids false positives from lockdep

    [akpm@osdl.org: build fix]
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Cc: Thomas Gleixner
    Acked-by: Arjan van de Ven
    Cc: Ingo Molnar
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • It is fairly easy to get a system to oops by simply sizing a cache via
    /proc in such a way that one of the chaches (shared is easiest) becomes
    bigger than the maximum allowed slab allocation size. This occurs because
    enable_cpucache() fails if it cannot reallocate some caches.

    However, enable_cpucache() is used for multiple purposes: resizing caches,
    cache creation and bootstrap.

    If the slab is already up then we already have working caches. The resize
    can fail without a problem. We just need to return the proper error code.
    F.e. after this patch:

    # echo "size-64 10000 50 1000" >/proc/slabinfo
    -bash: echo: write error: Cannot allocate memory

    notice no OOPS.

    If we are doing a kmem_cache_create() then we also should not panic but
    return -ENOMEM.

    If on the other hand we do not have a fully bootstrapped slab allocator yet
    then we should indeed panic since we are unable to bring up the slab to its
    full functionality.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The ability to free memory allocated to a slab cache is also useful if an
    error occurs during setup of a slab. So extract the function.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • [akpm@osdl.org: export fix]
    Signed-off-by: Christoph Hellwig
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Let's try to keep mm/ comments more useful and up to date. This is a start.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Also, checks if we get a valid slabp_cache for off slab slab-descriptors.
    We should always get this. If we don't, then in that case we, will have to
    disable off-slab descriptors for this cache and do the calculations again.
    This is a rare case, so add a BUG_ON, for now, just in case.

    Signed-off-by: Alok N Kataria
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • Introduce ARCH_LOW_ADDRESS_LIMIT which can be set per architecture to
    override the 4GB default limit used by the bootmem allocater within
    __alloc_bootmem_low() and __alloc_bootmem_low_node(). E.g. s390 needs a
    2GB limit instead of 4GB.

    Acked-by: Ingo Molnar
    Cc: Martin Schwidefsky
    Signed-off-by: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • Print the name of the task invoking the OOM killer. Could make debugging
    easier.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Skip kernel threads, rather than having them return 0 from badness.
    Theoretically, badness might truncate all results to 0, thus a kernel thread
    might be picked first, causing an infinite loop.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • PF_SWAPOFF processes currently cause select_bad_process to return straight
    away. Instead, give them high priority, so we will kill them first, however
    we also first ensure no parallel OOM kills are happening at the same time.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Having the oomkilladj == OOM_DISABLE check before the releasing check means
    that oomkilladj == OOM_DISABLE tasks exiting will not stop the OOM killer.

    Moving the test down will give the desired behaviour. Also: it will allow
    them to "OOM-kill" themselves if they are exiting. As per the previous patch,
    this is required to prevent OOM killer deadlocks (and they don't actually get
    killed, because they're already exiting -- they're simply allowed access to
    memory reserves).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin