04 Jul, 2006

2 commits

  • Generic lock debugging:

    - generalized lock debugging framework. For example, a bug in one lock
    subsystem turns off debugging in all lock subsystems.

    - got rid of the caller address passing (__IP__/__IP_DECL__/etc.) from
    the mutex/rtmutex debugging code: it caused way too much prototype
    hackery, and lockdep will give the same information anyway.

    - ability to do silent tests

    - check lock freeing in vfree too.

    - more finegrained debugging options, to allow distributions to
    turn off more expensive debugging features.

    There's no separate 'held mutexes' list anymore - but there's a 'held locks'
    stack within lockdep, which unifies deadlock detection across all lock
    classes. (this is independent of the lockdep validation stuff - lockdep first
    checks whether we are holding a lock already)

    Here are the current debugging options:

    CONFIG_DEBUG_MUTEXES=y
    CONFIG_DEBUG_LOCK_ALLOC=y

    which do:

    config DEBUG_MUTEXES
    bool "Mutex debugging, basic checks"

    config DEBUG_LOCK_ALLOC
    bool "Detect incorrect freeing of live mutexes"

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • It turns out that it is advantageous to leave a small portion of unmapped file
    backed pages if all of a zone's pages (or almost all pages) are allocated and
    so the page allocator has to go off-node.

    This allows recently used file I/O buffers to stay on the node and
    reduces the times that zone reclaim is invoked if file I/O occurs
    when we run out of memory in a zone.

    The problem is that zone reclaim runs too frequently when the page cache is
    used for file I/O (read write and therefore unmapped pages!) alone and we have
    almost all pages of the zone allocated. Zone reclaim may remove 32 unmapped
    pages. File I/O will use these pages for the next read/write requests and the
    unmapped pages increase. After the zone has filled up again zone reclaim will
    remove it again after only 32 pages. This cycle is too inefficient and there
    are potentially too many zone reclaim cycles.

    With the 1% boundary we may still remove all unmapped pages for file I/O in
    zone reclaim pass. However. it will take a large number of read and writes
    to get back to 1% again where we trigger zone reclaim again.

    The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
    second timeout.

    [akpm@osdl.org: rename the /proc file and the variable]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

01 Jul, 2006

20 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
    Remove obsolete #include
    remove obsolete swsusp_encrypt
    arch/arm26/Kconfig typos
    Documentation/IPMI typos
    Kconfig: Typos in net/sched/Kconfig
    v9fs: do not include linux/version.h
    Documentation/DocBook/mtdnand.tmpl: typo fixes
    typo fixes: specfic -> specific
    typo fixes in Documentation/networking/pktgen.txt
    typo fixes: occuring -> occurring
    typo fixes: infomation -> information
    typo fixes: disadvantadge -> disadvantage
    typo fixes: aquire -> acquire
    typo fixes: mecanism -> mechanism
    typo fixes: bandwith -> bandwidth
    fix a typo in the RTC_CLASS help text
    smb is no longer maintained

    Manually merged trivial conflict in arch/um/kernel/vmlinux.lds.S

    Linus Torvalds
     
  • Post and discussion:
    http://marc.theaimsgroup.com/?t=115074342800003&r=1&w=2

    Code in __shrink_node() duplicates code in cache_reap()

    Add a new function drain_freelist that removes slabs with objects that are
    already free and use that in various places.

    This eliminates the __node_shrink() function and provides the interrupt
    holdoff reduction from slab_free to code that used to call __node_shrink.

    [akpm@osdl.org: build fixes]
    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The remaining counters in page_state after the zoned VM counter patches
    have been applied are all just for show in /proc/vmstat. They have no
    essential function for the VM.

    We use a simple increment of per cpu variables. In order to avoid the most
    severe races we disable preempt. Preempt does not prevent the race between
    an increment and an interrupt handler incrementing the same statistics
    counter. However, that race is exceedingly rare, we may only loose one
    increment or so and there is no requirement (at least not in kernel) that
    the vm event counters have to be accurate.

    In the non preempt case this results in a simple increment for each
    counter. For many architectures this will be reduced by the compiler to a
    single instruction. This single instruction is atomic for i386 and x86_64.
    And therefore even the rare race condition in an interrupt is avoided for
    both architectures in most cases.

    The patchset also adds an off switch for embedded systems that allows a
    building of linux kernels without these counters.

    The implementation of these counters is through inline code that hopefully
    results in only a single instruction increment instruction being emitted
    (i386, x86_64) or in the increment being hidden though instruction
    concurrency (EPIC architectures such as ia64 can get that done).

    Benefits:
    - VM event counter operations usually reduce to a single inline instruction
    on i386 and x86_64.
    - No interrupt disable, only preempt disable for the preempt case.
    Preempt disable can also be avoided by moving the counter into a spinlock.
    - Handling is similar to zoned VM counters.
    - Simple and easily extendable.
    - Can be omitted to reduce memory use for embedded use.

    References:

    RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
    RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
    local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
    V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
    V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
    V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The numa statistics are really event counters. But they are per node and
    so we have had special treatment for these counters through additional
    fields on the pcp structure. We can now use the per zone nature of the
    zoned VM counters to realize these.

    This will shrink the size of the pcp structure on NUMA systems. We will
    have some room to add additional per zone counters that will all still fit
    in the same cacheline.

    Bits Prior pcp size Size after patch We can add
    ------------------------------------------------------------------
    64 128 bytes (16 words) 80 bytes (10 words) 48
    32 76 bytes (19 words) 56 bytes (14 words) 8 (64 byte cacheline)
    72 (128 byte)

    Remove the special statistics for numa and replace them with zoned vm
    counters. This has the side effect that global sums of these events now
    show up in /proc/vmstat.

    Also take the opportunity to move the zone_statistics() function from
    page_alloc.c into vmstat.c.

    Discussions:
    V2 http://marc.theaimsgroup.com/?t=115048227000002&r=1&w=2

    Signed-off-by: Christoph Lameter
    Acked-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • No callers.

    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Remove writeback state

    We can remove some functions now that were needed to calculate the page state
    for writeback control since these statistics are now directly available.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Conversion of nr_bounce to a per zone counter

    nr_bounce is only used for proc output. So it could be left as an event
    counter. However, the event counters may not be accurate and nr_bounce is
    categorizing types of pages in a zone. So we really need this to also be a
    per zone counter.

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Conversion of nr_unstable to a per zone counter

    We need to do some special modifications to the nfs code since there are
    multiple cases of disposition and we need to have a page ref for proper
    accounting.

    This converts the last critical page state of the VM and therefore we need to
    remove several functions that were depending on GET_PAGE_STATE_LAST in order
    to make the kernel compile again. We are only left with event type counters
    in page state.

    [akpm@osdl.org: bugfixes]
    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Conversion of nr_writeback to per zone counter.

    This removes the last page_state counter from arch/i386/mm/pgtable.c so we
    drop the page_state from there.

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This makes nr_dirty a per zone counter. Looping over all processors is
    avoided during writeback state determination.

    The counter aggregation for nr_dirty had to be undone in the NFS layer since
    we summed up the page counts from multiple zones. Someone more familiar with
    NFS should probably review what I have done.

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Conversion of nr_page_table_pages to a per zone counter

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • - Allows reclaim to access counter without looping over processor counts.

    - Allows accurate statistics on how many pages are used in a zone by
    the slab. This may become useful to balance slab allocations over
    various zones.

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The zone_reclaim_interval was necessary because we were not able to determine
    how many unmapped pages exist in a zone. Therefore we had to scan in
    intervals to figure out if any pages were unmapped.

    With the zoned counters and NR_ANON_PAGES we now know the number of pagecache
    pages and the number of mapped pages in a zone. So we can simply skip the
    reclaim if there is an insufficient number of unmapped pages. We use
    SWAP_CLUSTER_MAX as the boundary.

    Drop all support for /proc/sys/vm/zone_reclaim_interval.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The current NR_FILE_MAPPED is used by zone reclaim and the dirty load
    calculation as the number of mapped pagecache pages. However, that is not
    true. NR_FILE_MAPPED includes the mapped anonymous pages. This patch
    separates those and therefore allows an accurate tracking of the anonymous
    pages per zone.

    It then becomes possible to determine the number of unmapped pages per zone
    and we can avoid scanning for unmapped pages if there are none.

    Also it may now be possible to determine the mapped/unmapped ratio in
    get_dirty_limit. Isnt the number of anonymous pages irrelevant in that
    calculation?

    Note that this will change the meaning of the number of mapped pages reported
    in /proc/vmstat /proc/meminfo and in the per node statistics. This may affect
    user space tools that monitor these counters! NR_FILE_MAPPED works like
    NR_FILE_DIRTY. It is only valid for pagecache pages.

    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We can now access the number of pages in a mapped state in an inexpensive way
    in shrink_active_list. So drop the nr_mapped field from scan_control.

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently a single atomic variable is used to establish the size of the page
    cache in the whole machine. The zoned VM counters have the same method of
    implementation as the nr_pagecache code but also allow the determination of
    the pagecache size per zone.

    Remove the special implementation for nr_pagecache and make it a zoned counter
    named NR_FILE_PAGES.

    Updates of the page cache counters are always performed with interrupts off.
    We can therefore use the __ variant here.

    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • nr_mapped is important because it allows a determination of how many pages of
    a zone are not mapped, which would allow a more efficient means of determining
    when we need to reclaim memory in a zone.

    We take the nr_mapped field out of the page state structure and define a new
    per zone counter named NR_FILE_MAPPED (the anonymous pages will be split off
    from NR_MAPPED in the next patch).

    We replace the use of nr_mapped in various kernel locations. This avoids the
    looping over all processors in try_to_free_pages(), writeback, reclaim (swap +
    zone reclaim).

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Per zone counter infrastructure

    The counters that we currently have for the VM are split per processor. The
    processor however has not much to do with the zone these pages belong to. We
    cannot tell f.e. how many ZONE_DMA pages are dirty.

    So we are blind to potentially inbalances in the usage of memory in various
    zones. F.e. in a NUMA system we cannot tell how many pages are dirty on a
    particular node. If we knew then we could put measures into the VM to balance
    the use of memory between different zones and different nodes in a NUMA
    system. For example it would be possible to limit the dirty pages per node so
    that fast local memory is kept available even if a process is dirtying huge
    amounts of pages.

    Another example is zone reclaim. We do not know how many unmapped pages exist
    per zone. So we just have to try to reclaim. If it is not working then we
    pause and try again later. It would be better if we knew when it makes sense
    to reclaim unmapped pages from a zone. This patchset allows the determination
    of the number of unmapped pages per zone. We can remove the zone reclaim
    interval with the counters introduced here.

    Futhermore the ability to have various usage statistics available will allow
    the development of new NUMA balancing algorithms that may be able to improve
    the decision making in the scheduler of when to move a process to another node
    and hopefully will also enable automatic page migration through a user space
    program that can analyse the memory load distribution and then rebalance
    memory use in order to increase performance.

    The counter framework here implements differential counters for each processor
    in struct zone. The differential counters are consolidated when a threshold
    is exceeded (like done in the current implementation for nr_pageache), when
    slab reaping occurs or when a consolidation function is called.

    Consolidation uses atomic operations and accumulates counters per zone in the
    zone structure and also globally in the vm_stat array. VM functions can
    access the counts by simply indexing a global or zone specific array.

    The arrangement of counters in an array also simplifies processing when output
    has to be generated for /proc/*.

    Counters can be updated by calling inc/dec_zone_page_state or
    _inc/dec_zone_page_state analogous to *_page_state. The second group of
    functions can be called if it is known that interrupts are disabled.

    Special optimized increment and decrement functions are provided. These can
    avoid certain checks and use increment or decrement instructions that an
    architecture may provide.

    We also add a new CONFIG_DMA_IS_NORMAL that signifies that an architecture can
    do DMA to all memory and therefore ZONE_NORMAL will not be populated. This is
    only currently set for IA64 SGI SN2 and currently only affects
    node_page_state(). In the best case node_page_state can be reduced to
    retrieving a single counter for the one zone on the node.

    [akpm@osdl.org: cleanups]
    [akpm@osdl.org: export vm_stat[] for filesystems]
    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • NOTE: ZVC are *not* the lightweight event counters. ZVCs are reliable whereas
    event counters do not need to be.

    Zone based VM statistics are necessary to be able to determine what the state
    of memory in one zone is. In a NUMA system this can be helpful for local
    reclaim and other memory optimizations that may be able to shift VM load in
    order to get more balanced memory use.

    It is also useful to know how the computing load affects the memory
    allocations on various zones. This patchset allows the retrieval of that data
    from userspace.

    The patchset introduces a framework for counters that is a cross between the
    existing page_stats --which are simply global counters split per cpu-- and the
    approach of deferred incremental updates implemented for nr_pagecache.

    Small per cpu 8 bit counters are added to struct zone. If the counter exceeds
    certain thresholds then the counters are accumulated in an array of
    atomic_long in the zone and in a global array that sums up all zone values.
    The small 8 bit counters are next to the per cpu page pointers and so they
    will be in high in the cpu cache when pages are allocated and freed.

    Access to VM counter information for a zone and for the whole machine is then
    possible by simply indexing an array (Thanks to Nick Piggin for pointing out
    that approach). The access to the total number of pages of various types does
    no longer require the summing up of all per cpu counters.

    Benefits of this patchset right now:

    - Ability for UP and SMP configuration to determine how memory
    is balanced between the DMA, NORMAL and HIGHMEM zones.

    - loops over all processors are avoided in writeback and
    reclaim paths. We can avoid caching the writeback information
    because the needed information is directly accessible.

    - Special handling for nr_pagecache removed.

    - zone_reclaim_interval vanishes since VM stats can now determine
    when it is worth to do local reclaim.

    - Fast inline per node page state determination.

    - Accurate counters in /sys/devices/system/node/node*/meminfo. Current
    counters are counting simply which processor allocated a page somewhere
    and guestimate based on that. So the counters were not useful to show
    the actual distribution of page use on a specific zone.

    - The swap_prefetch patch requires per node statistics in order to
    figure out when processors of a node can prefetch. This patch provides
    some of the needed numbers.

    - Detailed VM counters available in more /proc and /sys status files.

    References to earlier discussions:
    V1 http://marc.theaimsgroup.com/?l=linux-kernel&m=113511649910826&w=2
    V2 http://marc.theaimsgroup.com/?l=linux-kernel&m=114980851924230&w=2
    V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115014697910351&w=2
    V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767318740&w=2

    Performance tests with AIM7 did not show any regressions. Seems to be a tad
    faster even. Tested on ia64/NUMA. Builds fine on i386, SMP / UP. Includes
    fixes for s390/arm/uml arch code.

    This patch:

    Move counter code from page_alloc.c/page-flags.h to vmstat.c/h.

    Create vmstat.c/vmstat.h by separating the counter code and the proc
    functions.

    Move the vm_stat_text array before zoneinfo_show.

    [akpm@osdl.org: s390 build fix]
    [akpm@osdl.org: HOTPLUG_CPU build fix]
    Signed-off-by: Christoph Lameter
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Signed-off-by: Jörn Engel
    Signed-off-by: Adrian Bunk

    Jörn Engel
     

30 Jun, 2006

4 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/devfs-2.6: (22 commits)
    [PATCH] devfs: Remove it from the feature_removal.txt file
    [PATCH] devfs: Last little devfs cleanups throughout the kernel tree.
    [PATCH] devfs: Rename TTY_DRIVER_NO_DEVFS to TTY_DRIVER_DYNAMIC_DEV
    [PATCH] devfs: Remove the tty_driver devfs_name field as it's no longer needed
    [PATCH] devfs: Remove the line_driver devfs_name field as it's no longer needed
    [PATCH] devfs: Remove the videodevice devfs_name field as it's no longer needed
    [PATCH] devfs: Remove the gendisk devfs_name field as it's no longer needed
    [PATCH] devfs: Remove the miscdevice devfs_name field as it's no longer needed
    [PATCH] devfs: Remove the devfs_fs_kernel.h file from the tree
    [PATCH] devfs: Remove devfs_remove() function from the kernel tree
    [PATCH] devfs: Remove devfs_mk_cdev() function from the kernel tree
    [PATCH] devfs: Remove devfs_mk_bdev() function from the kernel tree
    [PATCH] devfs: Remove devfs_mk_symlink() function from the kernel tree
    [PATCH] devfs: Remove devfs_mk_dir() function from the kernel tree
    [PATCH] devfs: Remove devfs_*_tape() functions from the kernel tree
    [PATCH] devfs: Remove devfs support from the sound subsystem
    [PATCH] devfs: Remove devfs support from the ide subsystem.
    [PATCH] devfs: Remove devfs support from the serial subsystem
    [PATCH] devfs: Remove devfs from the init code
    [PATCH] devfs: Remove devfs from the partition code
    ...

    Linus Torvalds
     
  • * master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6:
    [PATCH] i386: export memory more than 4G through /proc/iomem
    [PATCH] 64bit Resource: finally enable 64bit resource sizes
    [PATCH] 64bit Resource: convert a few remaining drivers to use resource_size_t where needed
    [PATCH] 64bit resource: change pnp core to use resource_size_t
    [PATCH] 64bit resource: change pci core and arch code to use resource_size_t
    [PATCH] 64bit resource: change resource core to use resource_size_t
    [PATCH] 64bit resource: introduce resource_size_t for the start and end of struct resource
    [PATCH] 64bit resource: fix up printks for resources in misc drivers
    [PATCH] 64bit resource: fix up printks for resources in arch and core code
    [PATCH] 64bit resource: fix up printks for resources in pcmcia drivers
    [PATCH] 64bit resource: fix up printks for resources in video drivers
    [PATCH] 64bit resource: fix up printks for resources in ide drivers
    [PATCH] 64bit resource: fix up printks for resources in mtd drivers
    [PATCH] 64bit resource: fix up printks for resources in pci core and hotplug drivers
    [PATCH] 64bit resource: fix up printks for resources in networks drivers
    [PATCH] 64bit resource: fix up printks for resources in sound drivers
    [PATCH] 64bit resource: C99 changes for struct resource declarations

    Fixed up trivial conflict in drivers/ide/pci/cmd64x.c (the printk that
    was changed by the 64-bit resources had been deleted in the meantime ;)

    Linus Torvalds
     
  • Memory hotplug code of i386 adds memory to only highmem. So, if
    CONFIG_HIGHMEM is not set, CONFIG_MEMORY_HOTPLUG shouldn't be set.
    Otherwise, it causes compile error.

    In addition, many architecture can't use memory hotplug feature yet. So, I
    introduce CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG.

    Signed-off-by: Yasunori Goto
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • The recent generic_file_write() deadlock fix caused
    generic_file_buffered_write() to loop inifinitely when presented with a
    zero-length iovec segment. Fix.

    Note that this fix deliberately avoids calling ->prepare_write(),
    ->commit_write() etc with a zero-length write. This is because I don't trust
    all filesystems to get that right.

    This is a cautious approach, for 2.6.17.x. For 2.6.18 we should just go ahead
    and call ->prepare_write() and ->commit_write() with the zero length and fix
    any broken filesystems. So I'll make that change once this code is stabilised
    and backported into 2.6.17.x.

    The reason for preferring to call ->prepare_write() and ->commit_write() with
    the zero-length segment: a zero-length segment _should_ be sufficiently
    uncommon that this is the correct way of handling it. We don't want to
    optimise for poorly-written userspace at the expense of well-written
    userspace.

    Cc: "Vladimir V. Saveliev"
    Cc: Neil Brown
    Cc: Martin Schwidefsky
    Cc: Chris Wright
    Cc: Greg KH
    Cc:
    Cc: walt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

29 Jun, 2006

1 commit


28 Jun, 2006

13 commits

  • Runtime debugging functionality for rt-mutexes.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Add debug_check_no_locks_freed(), as a central inline to add
    bad-lock-free-debugging functionality to.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Make notifier_calls associated with cpu_notifier as __cpuinit.

    __cpuinit makes sure that the function is init time only unless
    CONFIG_HOTPLUG_CPU is defined.

    [akpm@osdl.org: section fix]
    Signed-off-by: Chandra Seetharaman
    Cc: Ashok Raj
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chandra Seetharaman
     
  • Make notifier_blocks associated with cpu_notifier as __cpuinitdata.

    __cpuinitdata makes sure that the data is init time only unless
    CONFIG_HOTPLUG_CPU is defined.

    Signed-off-by: Chandra Seetharaman
    Cc: Ashok Raj
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chandra Seetharaman
     
  • In 2.6.17, there was a problem with cpu_notifiers and XFS. I provided a
    band-aid solution to solve that problem. In the process, i undid all the
    changes you both were making to ensure that these notifiers were available
    only at init time (unless CONFIG_HOTPLUG_CPU is defined).

    We deferred the real fix to 2.6.18. Here is a set of patches that fixes the
    XFS problem cleanly and makes the cpu notifiers available only at init time
    (unless CONFIG_HOTPLUG_CPU is defined).

    If CONFIG_HOTPLUG_CPU is defined then cpu notifiers are available at run
    time.

    This patch reverts the notifier_call changes made in 2.6.17

    Signed-off-by: Chandra Seetharaman
    Cc: Ashok Raj
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chandra Seetharaman
     
  • generic_file_buffered_write() prefaults in user pages in order to avoid
    deadlock on copying from the same page as write goes to.

    However, it looks like there is a problem when write is vectored:
    fault_in_pages_readable brings in current segment or its part (maxlen).
    OTOH, filemap_copy_from_user_iovec is called to copy number of bytes
    (bytes) which may exceed current segment, so filemap_copy_from_user_iovec
    switches to the next segment which is not brought in yet. Pagefault is
    generated. That causes the deadlock if pagefault is for the same page
    write goes to: page being written is locked and not uptodate, pagefault
    will deadlock trying to lock locked page.

    [akpm@osdl.org: somewhat rewritten]
    Cc: Neil Brown
    Cc: Martin Schwidefsky
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir V. Saveliev
     
  • locking init cleanups:

    - convert " = SPIN_LOCK_UNLOCKED" to spin_lock_init() or DEFINE_SPINLOCK()
    - convert rwlocks in a similar manner

    this patch was generated automatically.

    Motivation:

    - cleanliness
    - lockdep needs control of lock initialization, which the open-coded
    variants do not give
    - it's also useful for -rt and for lock debugging in general

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Localize poison values into one header file for better documentation and
    easier/quicker debugging and so that the same values won't be used for
    multiple purposes.

    Use these constants in core arch., mm, driver, and fs code.

    Signed-off-by: Randy Dunlap
    Acked-by: Matt Mackall
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: "David S. Miller"
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • When new node becomes enable by hot-add, new sysfs file must be created for
    new node. So, if new node is enabled by add_memory(), register_one_node() is
    called to create it. In addition, I386's arch_register_node() and a part of
    register_nodes() of powerpc are consolidated to register_one_node() as a
    generic_code().

    This is tested by Tiger4(IPF) with node hot-plug emulation.

    Signed-off-by: Keiichiro Tokunaga
    Signed-off-by: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • Fix "undefined reference to `arch_add_memory'" on sparc64 allmodconfig.

    sparc64 doesn't support memory hotplug. But we want it to support
    sparsemem.

    Signed-off-by: Yasunori Goto
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • This patch allows hot-add memory which is not aligned to section.

    Now, hot-added memory has to be aligned to section size. Considering big
    section sized archs, this is not useful.

    When hot-added memory is registerd as iomem resoruce by iomem resource
    patch, we can make use of that information to detect valid memory range.

    Note: With this, not-aligned memory can be registerd. To allow hot-add
    memory with holes, we have to do more work around add_memory().
    (It doesn't allows add memory to already existing mem section.)

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Register hot-added memory to iomem_resource. With this, /proc/iomem can
    show hot-added memory.

    Note: kdump uses /proc/iomem to catch memory range when it is installed.
    So, kdump should be re-installed after /proc/iomem change.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Vivek Goyal
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Add node-hot-add support to add_memory().

    node hotadd uses this sequence.
    1. allocate pgdat.
    2. refresh NODE_DATA()
    3. call free_area_init_node() to initialize
    4. create sysfs entry
    5. add memory (old add_memory())
    6. set node online
    7. run kswapd for new node.
    (8). update zonelist after pages are onlined. (This is already merged in -mm
    due to update phase is difference.)

    Note:
    To make common function as much as possible,
    there is 2 changes from v2.
    - The old add_memory(), which is defiend by each archs,
    is renamed to arch_add_memory(). New add_memory becomes
    caller of arch dependent function as a common code.

    - This patch changes add_memory()'s interface
    From: add_memory(start, end)
    TO : add_memory(nid, start, end).
    It was cause of similar code that finding node id from
    physical address is inside of old add_memory() on each arch.

    In addition, acpi memory hotplug driver can find node id easier.
    In v2, it must walk DSDT'S _CRS by matching physical address to
    get the handle of its memory device, then get _PXM and node id.
    Because input is just physical address.
    However, in v3, the acpi driver can use handle to get _PXM and node id
    for the new memory device. It can pass just node id to add_memory().

    Fix interface of arch_add_memory() is in next patche.

    Signed-off-by: Yasunori Goto
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Dave Hansen
    Cc: "Brown, Len"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto