26 Sep, 2006

40 commits

  • Fix array initialization in lots of arches

    The number of zones may now be reduced from 4 to 2 for many arches. Fix the
    array initialization for the zones array for all architectures so that it is
    not initializing a fixed number of elements.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • I keep seeing zones on various platforms that are never used and wonder why we
    compile support for them into the kernel. Counters show up for HIGHMEM and
    DMA32 that are alway zero.

    This patch allows the removal of ZONE_DMA32 for non x86_64 architectures and
    it will get rid of ZONE_HIGHMEM for arches not using highmem (like 64 bit
    architectures). If an arch does not define CONFIG_HIGHMEM then ZONE_HIGHMEM
    will not be defined. Similarly if an arch does not define CONFIG_ZONE_DMA32
    then ZONE_DMA32 will not be defined.

    No current architecture uses all the 4 zones (DMA,DMA32,NORMAL,HIGH) that we
    have now. The patchset will reduce the number of zones for all platforms.

    On many platforms that do not have DMA32 or HIGHMEM this will reduce the
    number of zones by 50%. F.e. ia64 only uses DMA and NORMAL.

    Large amounts of memory can be saved for larger systemss that may have a few
    hundred NUMA nodes.

    With ZONE_DMA32 and ZONE_HIGHMEM support optional MAX_NR_ZONES will be 2 for
    many non i386 platforms and even for i386 without CONFIG_HIGHMEM set.

    Tested on ia64, x86_64 and on i386 with and without highmem.

    The patchset consists of 11 patches that are following this message.

    One could go even further than this patchset and also make ZONE_DMA optional
    because some platforms do not need a separate DMA zone and can do DMA to all
    of memory. This could reduce MAX_NR_ZONES to 1. Such a patchset will
    hopefully follow soon.

    This patch:

    Fix strange uses of MAX_NR_ZONES

    Sometimes we use MAX_NR_ZONES - x to refer to a zone. Make that explicit.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • It fixes various coding style issues, specially when spaces are useless. For
    example '*' go next to the function name.

    Signed-off-by: Franck Bui-Huu
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Franck Bui-Huu
     
  • It also creates get_mapsize() helper in order to make the code more readable
    when it calculates the boot bitmap size.

    Signed-off-by: Franck Bui-Huu
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Franck Bui-Huu
     
  • Signed-off-by: Franck Bui-Huu
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Franck Bui-Huu
     
  • Signed-off-by: Franck Bui-Huu
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Franck Bui-Huu
     
  • Signed-off-by: Franck Bui-Huu
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Franck Bui-Huu
     
  • Signed-off-by: Franck Bui-Huu
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Franck Bui-Huu
     
  • __init in headers is pretty useless because the compiler doesn't check it, and
    they get out of sync relatively frequently. So if you see an __init in a
    header file, it's quite unreliable and you need to check the definition
    anyway.

    Signed-off-by: Franck Bui-Huu
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Franck Bui-Huu
     
  • Address a long standing issue of booting with an initrd on an i386 numa
    system. Currently (and always) the numa kva area is mapped into low memory
    by finding the end of low memory and moving that mark down (thus creating
    space for the kva). The issue with this is that Grub loads initrds into
    this similar space so when the kernel check the initrd it finds it outside
    max_low_pfn and disables it (it thinks the initrd is not mapped into usable
    memory) thus initrd enabled kernels can't boot i386 numa :(

    My solution to the problem just converts the numa kva area to use the
    bootmem allocator to save it's area (instead of moving the end of low
    memory). Using bootmem allows the kva area to be mapped into more diverse
    addresses (not just the end of low memory) and enables the kva area to be
    mapped below the initrd if present.

    I have tested this patch on numaq(no initrd) and summit(initrd) i386 numa
    based systems.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Keith Mannthey
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    keith mannthey
     
  • This patch makes the following needlessly global functions static:
    - slab.c: kmem_find_general_cachep()
    - swap.c: __page_cache_release()
    - vmalloc.c: __vmalloc_node()

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • With the tracking of dirty pages properly done now, msync doesn't need to scan
    the PTEs anymore to determine the dirty status.

    From: Hugh Dickins

    In looking to do that, I made some other tidyups: can remove several
    #includes, and sys_msync loop termination not quite right.

    Most of those points are criticisms of the existing sys_msync, not of your
    patch. In particular, the loop termination errors were introduced in 2.6.17:
    I did notice this shortly before it came out, but decided I was more likely to
    get it wrong myself, and make matters worse if I tried to rush a last-minute
    fix in. And it's not terribly likely to go wrong, nor disastrous if it does
    go wrong (may miss reporting an unmapped area; may also fsync file of a
    following vma).

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Wrt. the recent modifications in do_wp_page() Hugh Dickins pointed out:

    "I now realize it's right to the first order (normal case) and to the
    second order (ptrace poke), but not to the third order (ptrace poke
    anon page here to be COWed - perhaps can't occur without intervening
    mprotects)."

    This patch restores the old COW behaviour for anonymous pages.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Smallish cleanup to install_page(), could save a memory read (haven't checked
    the asm output) and sure looks nicer.

    Signed-off-by: Peter Zijlstra
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • mprotect() resets the page protections, which could result in extra write
    faults for those pages whose dirty state we track using write faults and are
    dirty already.

    Signed-off-by: Peter Zijlstra
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Now that we can detect writers of shared mappings, throttle them. Avoids OOM
    by surprise.

    Signed-off-by: Peter Zijlstra
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Tracking of dirty pages in shared writeable mmap()s.

    The idea is simple: write protect clean shared writeable pages, catch the
    write-fault, make writeable and set dirty. On page write-back clean all the
    PTE dirty bits and write protect them once again.

    The implementation is a tad harder, mainly because the default
    backing_dev_info capabilities were too loosely maintained. Hence it is not
    enough to test the backing_dev_info for cap_account_dirty.

    The current heuristic is as follows, a VMA is eligible when:
    - its shared writeable
    (vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)
    - it is not a 'special' mapping
    (vm_flags & (VM_PFNMAP|VM_INSERTPAGE)) == 0
    - the backing_dev_info is cap_account_dirty
    mapping_cap_account_dirty(vma->vm_file->f_mapping)
    - f_op->mmap() didn't change the default page protection

    Page from remap_pfn_range() are explicitly excluded because their COW
    semantics are already horrid enough (see vm_normal_page() in do_wp_page()) and
    because they don't have a backing store anyway.

    mprotect() is taught about the new behaviour as well. However it overrides
    the last condition.

    Cleaning the pages on write-back is done with page_mkclean() a new rmap call.
    It can be called on any page, but is currently only implemented for mapped
    pages, if the page is found the be of a VMA that accounts dirty pages it will
    also wrprotect the PTE.

    Finally, in fs/buffers.c:try_to_free_buffers(); remove clear_page_dirty() from
    under ->private_lock. This seems to be safe, since ->private_lock is used to
    serialize access to the buffers, not the page itself. This is needed because
    clear_page_dirty() will call into page_mkclean() and would thereby violate
    locking order.

    [dhowells@redhat.com: Provide a page_mkclean() implementation for NOMMU]
    Signed-off-by: Peter Zijlstra
    Cc: Hugh Dickins
    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Introduce a VM_BUG_ON, which is turned on with CONFIG_DEBUG_VM. Use this
    in the lightweight, inline refcounting functions; PageLRU and PageActive
    checks in vmscan, because they're pretty well confined to vmscan. And in
    page allocate/free fastpaths which can be the hottest parts of the kernel
    for kbuilds.

    Unlike BUG_ON, VM_BUG_ON must not be used to execute statements with
    side-effects, and should not be used outside core mm code.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Give non-highmem architectures access to the kmap API for the purposes of
    overriding (this is what the attached patch does).

    The proposal is that we should now require all architectures with coherence
    issues to manage data coherence via the kmap/kunmap API. Thus driver
    writers never have to write code like

    kmap(page)
    modify data in page
    flush_kernel_dcache_page(page)
    kunmap(page)

    instead, kmap/kunmap will manage the coherence and driver (and filesystem)
    writers don't need to worry about how to flush between kmap and kunmap.

    For most architectures, the page only needs to be flushed if it was
    actually written to *and* there are user mappings of it, so the best
    implementation looks to be: clear the page dirty pte bit in the kernel page
    tables on kmap and on kunmap, check page->mappings for user maps, and then
    the dirty bit, and only flush if it both has user mappings and is dirty.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Bottomley
     
  • Original commit code assumes, that when a buffer on BJ_SyncData list is
    locked, it is being written to disk. But this is not true and hence it can
    lead to a potential data loss on crash. Also the code didn't count with
    the fact that journal_dirty_data() can steal buffers from committing
    transaction and hence could write buffers that no longer belong to the
    committing transaction. Finally it could possibly happen that we tried
    writing out one buffer several times.

    The patch below tries to solve these problems by a complete rewrite of the
    data commit code. We go through buffers on t_sync_datalist, lock buffers
    needing write out and store them in an array. Buffers are also immediately
    refiled to BJ_Locked list or unfiled (if the write out is completed). When
    the array is full or we have to block on buffer lock, we submit all
    accumulated buffers for IO.

    [suitable for 2.6.18.x around the 2.6.19-rc2 timeframe]

    Signed-off-by: Jan Kara
    Cc: Badari Pulavarty
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • get_cpu_var()/per_cpu()/__get_cpu_var() arguments must be simple
    identifiers. Otherwise the arch dependent implementations might break.

    This patch enforces the correct usage of the macros by producing a syntax
    error if the variable is not a simple identifier.

    Signed-off-by: Jan Blunck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Blunck
     
  • The scheduler will stop load balancing if the most busy processor contains
    processes pinned via processor affinity.

    The scheduler currently only does one search for busiest cpu. If it cannot
    pull any tasks away from the busiest cpu because they were pinned then the
    scheduler goes into a corner and sulks leaving the idle processors idle.

    F.e. If you have processor 0 busy running four tasks pinned via taskset,
    there are none on processor 1 and one just started two processes on
    processor 2 then the scheduler will not move one of the two processes away
    from processor 2.

    This patch fixes that issue by forcing the scheduler to come out of its
    corner and retrying the load balancing by considering other processors for
    load balancing.

    This patch was originally developed by John Hawkes and discussed at

    http://marc.theaimsgroup.com/?l=linux-kernel&m=113901368523205&w=2.

    I have removed extraneous material and gone back to equipping struct rq
    with the cpu the queue is associated with since this makes the patch much
    easier and it is likely that others in the future will have the same
    difficulty of figuring out which processor owns which runqueue.

    The overhead added through these patches is a single word on the stack if
    the kernel is configured to support 32 cpus or less (32 bit). For 32 bit
    environments the maximum number of cpus that can be configued is 255 which
    would result in the use of 32 bytes additional on the stack. On IA64 up to
    1k cpus can be configured which will result in the use of 128 additional
    bytes on the stack. The maximum additional cache footprint is one
    cacheline. Typically memory use will be much less than a cacheline and the
    additional cpumask will be placed on the stack in a cacheline that already
    contains other local variable.

    Signed-off-by: Christoph Lameter
    Cc: John Hawkes
    Cc: "Siddha, Suresh B"
    Cc: Ingo Molnar
    Cc: Nick Piggin
    Cc: Peter Williams
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • * 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev:
    [libata] Fix oops introduced in non-uniform port handling fix
    [PATCH] ata-piix: fixes kerneldoc error

    Linus Torvalds
     
  • Noticed by several people.

    Signed-off-by: Jeff Garzik

    Jeff Garzik
     
  • * master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
    [NetLabel]: update docs with website information
    [NetLabel]: rework the Netlink attribute handling (part 2)
    [NetLabel]: rework the Netlink attribute handling (part 1)
    [Netlink]: add nla_validate_nested()
    [NETLINK]: add nla_for_each_nested() to the interface list
    [NetLabel]: change the SELinux permissions
    [NetLabel]: make the CIPSOv4 cache spinlocks bottom half safe
    [NetLabel]: correct improper handling of non-NetLabel peer contexts
    [TCP]: make cubic the default
    [TCP]: default congestion control menu
    [ATM] he: Fix __init/__devinit conflict
    [NETFILTER]: Add dscp,DSCP headers to header-y
    [DCCP]: Introduce dccp_probe
    [DCCP]: Use constants for CCIDs
    [DCCP]: Introduce constants for CCID numbers
    [DCCP]: Allow default/fallback service code.

    Linus Torvalds
     
  • * master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6:
    [SOUND] sparc/amd7930: Use __devinit and __devinitdata as needed.
    [SUNLANCE]: Mark sparc_lance_probe_one as __devinit.
    [SPARC64]: Fix section-mismatch errors in solaris emul module.

    Linus Torvalds
     
  • The v4l2 API documentation for VIDIOC_ENUMSTD says:

    To enumerate all standards applications shall begin at index
    zero, incrementing by one until the driver returns EINVAL.

    The actual code, however, tests the index this way:

    if (index= vfd->tvnormsize) {
    ret=-EINVAL;

    So any application which passes in index=0 gets EINVAL right off the bat
    - and, in fact, this is what happens to mplayer. So I think the
    following patch is called for, and maybe even appropriate for a 2.6.18.x
    stable release.

    Signed-off-by: Jonathan Corbet
    Cc: Mauro Carvalho Chehab
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jonathan Corbet
     
  • Invoking load_module() before param_sysfs_init() is called crashes in
    mod_sysfs_setup(), since the kset in module_subsys is not initialized yet.

    In my case, net-pf-1 is getting modprobed as a result of hotplug trying to
    create a UNIX socket. Calls to hotplug begin after the topology_init
    initcall.

    Another patch for the same symptom (module_subsys-initialize-earlier.patch)
    moves param_sysfs_init() to the subsys initcalls, but this is still not
    early enough in the boot process in some cases. In particular,
    topology_init() causes /sbin/hotplug to run, which requests net-pf-1 (the
    UNIX socket protocol) which can be compiled as a module. Moving
    param_sysfs_init() to the postcore initcalls fixes this particular race,
    but there might well be other cases where a usermodehelper causes a module
    to load earlier still.

    The patch makes load_module() return an error rather than crashing the
    kernel if invoked before module_subsys is initialized.

    Cc: Mark Huang
    Cc: Greg KH
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Swierk
     
  • If there is only 1 node in the system cpus should think they are apart of
    some other node.

    If cases where a real numa system boots the Flat numa option make sure the
    cpus don't claim to be apart on a non-existent node.

    Signed-off-by: Keith Mannthey
    Cc: Andy Whitcroft
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    keith mannthey
     
  • Assume that a cpu is *physically* offlined at boot time...

    Because smpboot.c::smp_boot_cpu_map() canoot find cpu's sapicid,
    numa.c::build_cpu_to_node_map() cannot build cpunode map for
    offlined cpu.

    For such cpus, cpu_to_node map should be fixed at cpu-hot-add.
    This mapping should be done before cpu onlining.

    This patch also handles cpu hotremove case.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Luck, Tony"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Problem description:

    We have additional_cpus= option for allocating possible_cpus. But nid
    for possible cpus are not fixed at boot time. cpus which is offlined at
    boot or cpus which is not on SRAT is not tied to its node. This will
    cause panic at cpu onlining.

    Usually, pxm_to_nid() mapping is fixed at boot time by SRAT.

    But, unfortunately, some system (my system!) do not include
    full SRAT table for possible cpus. (Then, I use
    additiona_cpus= option.)

    For such possible cpus, pxmnid should be fixed at
    hot-add. We now have acpi_map_pxm_to_node() which is also
    used at boot. It's suitable here.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Luck, Tony"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Seems like not all drivers use the framebuffer_alloc() function and won't
    have an initialized mutex. But those don't have a backlight, anyway.

    Signed-off-by: Michael Hanselmann
    Cc: Olaf Hering
    Cc: "Antonino A. Daplas"
    Cc: Daniel R Thompson
    Cc: Jon Smirl
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Hanselmann
     
  • Stops panic associated with attempting to free a non slab-allocated
    per_cpu_pageset.

    Signed-off-by: David Rientjes
    Acked-by: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • With CONFIG_PHYSICAL_START set to a non default values the i386
    boot_ioremap code calculated its pte index wrong and users of boot_ioremap
    have their areas incorrectly mapped (for me SRAT table not mapped during
    early boot). This patch removes the addr < BOOT_PTE_PTRS constraint.

    [ Keith says this is applicable to 2.6.16 and 2.6.17 as well ]

    Signed-off-by: Keith Mannthey
    Cc: Vivek Goyal
    Cc: Dave Hansen
    Cc:
    Cc: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    keith mannthey
     
  • BUG: warning at kernel/lockdep.c:1816/trace_hardirqs_on() (Not tainted)
    [] show_trace_log_lvl+0x58/0x171
    [] show_trace+0xd/0x10
    [] dump_stack+0x19/0x1b
    [] trace_hardirqs_on+0xa2/0x11e
    [] _spin_unlock_irq+0x22/0x26
    [] rtc_get_rtc_time+0x32/0x176
    [] hpet_rtc_interrupt+0x92/0x14d
    [] handle_IRQ_event+0x20/0x4d
    [] __do_IRQ+0x94/0xef
    [] do_IRQ+0x9e/0xbd
    [] common_interrupt+0x25/0x2c
    DWARF2 unwinder stuck at common_interrupt+0x25/0x2c

    Signed-off-by: Peter Zijlstra
    Acked-by: Ingo Molnar
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • If the timeout of an autofs mount is set to zero then umounts are disabled.
    This works fine, however the kernel module checks the expire timeout and
    goes no further if it is zero. This is not the right thing to do at
    shutdown as the module is passed an option to expire mounts regardless of
    their timeout setting.

    This patch allows autofs to honor the force expire option.

    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Kent
     
  • Fixes an error in kerneldoc of ata_piix.c.
    Signed-off-by: Henrik Kretzschmar
    Signed-off-by: Jeff Garzik

    Henne
     
  • Fixes section-mismatch errors.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Fixes section mismatch warnings when built as a module.

    Also, mark find_ledma and sun4 init function as __devinit
    too.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • init_socksys() was marked __init but invoked from a
    non-__init function.

    Use the correct module_{init,exit}() faciltiies while we're
    here and eliminate some seriously bogus ifdefs.

    Signed-off-by: David S. Miller

    David S. Miller