23 Sep, 2006

1 commit

  • * master.kernel.org:/pub/scm/linux/kernel/git/davej/agpgart:
    [AGPGART] Rework AGPv3 modesetting fallback.
    [AGPGART] Add suspend callback for i965
    [AGPGART] Fix number of aperture sizes in 830 gart structs.
    [AGPGART] Intel 965 Express support.
    [AGPGART] agp.h: constify struct agp_bridge_data::version
    [AGPGART] const'ify VIA AGP PCI table.
    [AGPGART] CONFIG_PM=n slim: drivers/char/agp/intel-agp.c
    [AGPGART] CONFIG_PM=n slim: drivers/char/agp/efficeon-agp.c
    [AGPGART] Const'ify the agpgart driver version.
    [AGPGART] remove private page protection map

    Linus Torvalds
     

09 Sep, 2006

1 commit

  • If a CPU faults this page into pagetables after invalidate_mapping_pages()
    checked page_mapped(), invalidate_complete_page() will still proceed to remove
    the page from pagecache. This leaves the page-faulting process with a
    detached page. If it was MAP_SHARED then file data loss will ensue.

    Fix that up by checking the page's refcount after taking tree_lock.

    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

08 Sep, 2006

1 commit

  • This prevents cross-region mappings on IA64 and SPARC which could lead
    to system crash. They were correctly trapped for normal mmap() calls,
    but not for the kernel internal calls generated by executable loading.

    This code just moves the architecture-specific cross-region checks into
    an arch-specific "arch_mmap_check()" macro, and defines that for the
    architectures that needed it (ia64, sparc and sparc64).

    Architectures that don't have any special requirements can just ignore
    the new cross-region check, since the mmap() code will just notice on
    its own when the macro isn't defined.

    Signed-off-by: Pavel Emelianov
    Signed-off-by: Kirill Korotaev
    Acked-by: David Miller
    Signed-off-by: Greg Kroah-Hartman
    [ Cleaned up to not affect architectures that don't need it ]
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     

06 Sep, 2006

1 commit


02 Sep, 2006

4 commits

  • Since vma->vm_pgoff is in units of smallpages, VMAs for huge pages have the
    lower HPAGE_SHIFT - PAGE_SHIFT bits always cleared, which results in badd
    offsets to the interleave functions. Take this difference from small pages
    into account when calculating the offset. This does add a 0-bit shift into
    the small-page path (via alloc_page_vma()), but I think that is negligible.
    Also add a BUG_ON to prevent the offset from growing due to a negative
    right-shift, which probably shouldn't be allowed anyways.

    Tested on an 8-memory node ppc64 NUMA box and got the interleaving I
    expected.

    Signed-off-by: Nishanth Aravamudan
    Signed-off-by: Adam Litke
    Cc: Andi Kleen
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     
  • This patch works around a complex dm-related deadlock/livelock down in the
    mempool allocator.

    Alasdair said:

    Several dm targets suffer from this.

    Mempools are not yet used correctly everywhere in device-mapper: they can
    get shared when devices are stacked, and some targets share them across
    multiple instances. I made fixing this one of the prerequisites for this
    patch:

    md-dm-reduce-stack-usage-with-stacked-block-devices.patch

    which in some cases makes people more likely to hit the problem.

    There's been some progress on this recently with (unfinished) dm-crypt
    patches at:

    http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/
    (dm-crypt-move-io-to-workqueue.patch plus dependencies)

    and:

    I've no problems with a temporary workaround like that, but Milan Broz (a
    new Redhat developer in the Czech Republic) has started reviewing all the
    mempool usage in device-mapper so I'm expecting we'll soon have a proper fix
    for this associated problems. [He's back from holiday at the start of next
    week.]

    For now, this sad-but-safe little patch will allow the machine to recover.

    [akpm@osdl.org: rewrote changelog]
    Cc: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Mironchik
     
  • The ZVC counter update threshold is currently set to a fixed value of 32.
    This patch sets up the threshold depending on the number of processors and
    the sizes of the zones in the system.

    With the current threshold of 32, I was able to observe slight contention
    when more than 130-140 processors concurrently updated the counters. The
    contention vanished when I either increased the threshold to 64 or used
    Andrew's idea of overstepping the interval (see ZVC overstep patch).

    However, we saw contention again at 220-230 processors. So we need higher
    values for larger systems.

    But the current default is already a bit of an overkill for smaller
    systems. Some systems have tiny zones where precision matters. For
    example i386 and x86_64 have 16M DMA zones and either 900M ZONE_NORMAL or
    ZONE_DMA32. These are even present on SMP and NUMA systems.

    The patch here sets up a threshold based on the number of processors in the
    system and the size of the zone that these counters are used for. The
    threshold should grow logarithmically, so we use fls() as an easy
    approximation.

    Results of tests on a system with 1024 processors (4TB RAM)

    The following output is from a test allocating 1GB of memory concurrently
    on each processor (Forking the process. So contention on mmap_sem and the
    pte locks is not a factor):

    X MIN
    TYPE: CPUS WALL WALL SYS USER TOTCPU
    fork 1 0.552 0.552 0.540 0.012 0.552
    fork 4 0.552 0.548 2.164 0.036 2.200
    fork 16 0.564 0.548 8.812 0.164 8.976
    fork 128 0.580 0.572 72.204 1.208 73.412
    fork 256 1.300 0.660 310.400 2.160 312.560
    fork 512 3.512 0.696 1526.836 4.816 1531.652
    fork 1020 20.024 0.700 17243.176 6.688 17249.863

    So a threshold of 32 is fine up to 128 processors. At 256 processors contention
    becomes a factor.

    Overstepping the counter (earlier patch) improves the numbers a bit:

    fork 4 0.552 0.548 2.164 0.040 2.204
    fork 16 0.552 0.548 8.640 0.148 8.788
    fork 128 0.556 0.548 69.676 0.956 70.632
    fork 256 0.876 0.636 212.468 2.108 214.576
    fork 512 2.276 0.672 997.324 4.260 1001.584
    fork 1020 13.564 0.680 11586.436 6.088 11592.523

    Still contention at 512 and 1020. Contention at 1020 is down by a third.
    256 still has a slight bit of contention.

    After this patch the counter threshold will be set to 125 which reduces
    contention significantly:

    fork 128 0.560 0.548 69.776 0.932 70.708
    fork 256 0.636 0.556 143.460 2.036 145.496
    fork 512 0.640 0.548 284.244 4.236 288.480
    fork 1020 1.500 0.588 1326.152 8.892 1335.044

    [akpm@osdl.org: !SMP build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Increments and decrements are usually grouped rather than mixed. We can
    optimize the inc and dec functions for that case.

    Increment and decrement the counters by 50% more than the threshold in
    those cases and set the differential accordingly. This decreases the need
    to update the atomic counters.

    The idea came originally from Andrew Morton. The overstepping alone was
    sufficient to address the contention issue found when updating the global
    and the per zone counters from 160 processors.

    Also remove some code in dec_zone_page_state.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

28 Aug, 2006

1 commit

  • There is a bug in mm/swapfile.c#swap_type_of() that makes swsusp only be
    able to use the first active swap partition as the resume device. Fix it.

    Signed-off-by: Rafael J. Wysocki
    Cc: Hugh Dickins
    Acked-by: Pavel Machek
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     

15 Aug, 2006

1 commit


06 Aug, 2006

4 commits

  • This patch is for collision check enhancement for memory hot add.

    It's better to do resouce collision check before doing memory hot add,
    which will touch memory management structures.

    And add_section() should check section exists or not before calling
    sparse_add_one_section(). (sparse_add_one_section() will do another
    check anyway. but checking in memory_hotplug.c will be easy to understand.)

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: keith mannthey
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • find_next_system_ram() is used to find available memory resource at onlining
    newly added memory. This patch fixes following problem.

    find_next_system_ram() cannot catch this case.

    Resource: (start)-------------(end)
    Section : (start)-------------(end)

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Keith Mannthey
    Cc: Yasunori Goto
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • ioresouce handling code in memory hotplug allows not-aligned memory hot add.
    But when memmap and other memory structures are initialized, parameters should
    be aligned. (if not aligned, initialization of mem_map will do wrong, it
    assumes parameters are aligned.) This patch fix it.

    And this patch allows ioresource collision check to handle -EEXIST.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Keith Mannthey
    Cc: Yasunori Goto
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • The POSIX_FADV_NOREUSE hint means "the application will use this range of the
    file a single time". It seems to be intended that the implementation will use
    this hint to perform drop-behind of that part of the file when the application
    gets around to reading or writing it.

    However for reasons which aren't obvious (or sane?) I mapped
    POSIX_FADV_NOREUSE onto POSIX_FADV_WILLNEED. ie: it does readahead.

    That's daft. So for now, make POSIX_FADV_NOREUSE a no-op.

    This is a non-back-compatible change. If someone was using POSIX_FADV_NOREUSE
    to perform readahead, they lose. The likelihood is low.

    If/when we later implement POSIX_FADV_NOREUSE things will get interesting - to
    do it fully we'll need to maintain file offset/length ranges and peform all
    sorts of complex tricks, and managing the lifetime of those ranges' data
    structures will be interesting..

    A sensible implementation would probably ignore the file range and would
    simply mark the entire file as needing some form of drop-behind treatment.

    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

01 Aug, 2006

2 commits


30 Jul, 2006

1 commit


27 Jul, 2006

1 commit


15 Jul, 2006

4 commits

  • Unlike earlier iterations of the delay accounting patches, now delays are only
    collected for the actual I/O waits rather than try and cover the delays seen
    in I/O submission paths.

    Account separately for block I/O delays incurred as a result of swapin page
    faults whose frequency can be affected by the task/process' rss limit. Hence
    swapin delays can act as feedback for rss limit changes independent of I/O
    priority changes.

    Signed-off-by: Shailabh Nagar
    Signed-off-by: Balbir Singh
    Cc: Jes Sorensen
    Cc: Peter Chubb
    Cc: Erich Focht
    Cc: Levent Serinol
    Cc: Jay Lan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shailabh Nagar
     
  • nommu.c needs to export two more symbols for drivers to use:
    remap_pfn_range and unmap_mapping_range.

    Signed-off-by: Luke Yang
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luke Yang
     
  • There is a race condition that showed up in a threaded JIT environment.
    The situation is that a process with a JIT code page forks, so the page is
    marked read-only, then some threads are created in the child. One of the
    threads attempts to add a new code block to the JIT page, so a
    copy-on-write fault is taken, and the kernel allocates a new page, copies
    the data, installs the new pte, and then calls lazy_mmu_prot_update() to
    flush caches to make sure that the icache and dcache are in sync.
    Unfortunately, the other thread runs right after the new pte is installed,
    but before the caches have been flushed. It tries to execute some old JIT
    code that was already in this page, but it sees some garbage in the i-cache
    from the previous users of the new physical page.

    Fix: we must make the caches consistent before installing the pte. This is
    an ia64 only fix because lazy_mmu_prot_update() is a no-op on all other
    architectures.

    Signed-off-by: Anil Keshavamurthy
    Signed-off-by: Tony Luck
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anil Keshavamurthy
     
  • __vunmap must not rely on area->nr_pages when picking the release methode
    for area->pages. It may be too small when __vmalloc_area_node failed early
    due to lacking memory. Instead, use a flag in vmstruct to differentiate.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kiszka
     

14 Jul, 2006

3 commits

  • Chandra Seetharaman reported SLAB crashes caused by the slab.c lock
    annotation patch. There is only one chunk of that patch that has a
    material effect on the slab logic - this patch undoes that chunk.

    This was confirmed to fix the slab problem by Chandra.

    Signed-off-by: Ingo Molnar
    Tested-by: Chandra Seetharaman
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • mm/slab.c uses nested locking when dealing with 'off-slab'
    caches, in that case it allocates the slab header from the
    (on-slab) kmalloc caches. Teach the lock validator about
    this by putting all on-slab caches into a separate class.

    this patch has no effect on non-lockdep kernels.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     
  • undo existing mm/slab.c lock-validator annotations, in preparation
    of a new, less intrusive annotation patch.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

11 Jul, 2006

5 commits


04 Jul, 2006

6 commits

  • cleanup: remove task_t and convert all the uses to struct task_struct. I
    introduced it for the scheduler anno and it was a mistake.

    Conversion was mostly scripted, the result was reviewed and all
    secondary whitespace and style impact (if any) was fixed up by hand.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Teach special (recursive) locking code to the lock validator. Has no effect
    on non-lockdep kernels.

    Fix initialize-locks-via-memcpy assumptions.

    Effects on non-lockdep kernels: the subclass nesting parameter is passed into
    cache_free_alien() and __cache_free(), and turns one internal
    kmem_cache_free() call into an open-coded __cache_free() call.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Teach special (recursive) locking code to the lock validator. Has no effect
    on non-lockdep kernels.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Locking init improvement:

    - introduce and use __SPIN_LOCK_UNLOCKED for array initializations,
    to pass in the name string of locks, used by debugging

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Generic lock debugging:

    - generalized lock debugging framework. For example, a bug in one lock
    subsystem turns off debugging in all lock subsystems.

    - got rid of the caller address passing (__IP__/__IP_DECL__/etc.) from
    the mutex/rtmutex debugging code: it caused way too much prototype
    hackery, and lockdep will give the same information anyway.

    - ability to do silent tests

    - check lock freeing in vfree too.

    - more finegrained debugging options, to allow distributions to
    turn off more expensive debugging features.

    There's no separate 'held mutexes' list anymore - but there's a 'held locks'
    stack within lockdep, which unifies deadlock detection across all lock
    classes. (this is independent of the lockdep validation stuff - lockdep first
    checks whether we are holding a lock already)

    Here are the current debugging options:

    CONFIG_DEBUG_MUTEXES=y
    CONFIG_DEBUG_LOCK_ALLOC=y

    which do:

    config DEBUG_MUTEXES
    bool "Mutex debugging, basic checks"

    config DEBUG_LOCK_ALLOC
    bool "Detect incorrect freeing of live mutexes"

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • It turns out that it is advantageous to leave a small portion of unmapped file
    backed pages if all of a zone's pages (or almost all pages) are allocated and
    so the page allocator has to go off-node.

    This allows recently used file I/O buffers to stay on the node and
    reduces the times that zone reclaim is invoked if file I/O occurs
    when we run out of memory in a zone.

    The problem is that zone reclaim runs too frequently when the page cache is
    used for file I/O (read write and therefore unmapped pages!) alone and we have
    almost all pages of the zone allocated. Zone reclaim may remove 32 unmapped
    pages. File I/O will use these pages for the next read/write requests and the
    unmapped pages increase. After the zone has filled up again zone reclaim will
    remove it again after only 32 pages. This cycle is too inefficient and there
    are potentially too many zone reclaim cycles.

    With the 1% boundary we may still remove all unmapped pages for file I/O in
    zone reclaim pass. However. it will take a large number of read and writes
    to get back to 1% again where we trigger zone reclaim again.

    The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
    second timeout.

    [akpm@osdl.org: rename the /proc file and the variable]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

01 Jul, 2006

4 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
    Remove obsolete #include
    remove obsolete swsusp_encrypt
    arch/arm26/Kconfig typos
    Documentation/IPMI typos
    Kconfig: Typos in net/sched/Kconfig
    v9fs: do not include linux/version.h
    Documentation/DocBook/mtdnand.tmpl: typo fixes
    typo fixes: specfic -> specific
    typo fixes in Documentation/networking/pktgen.txt
    typo fixes: occuring -> occurring
    typo fixes: infomation -> information
    typo fixes: disadvantadge -> disadvantage
    typo fixes: aquire -> acquire
    typo fixes: mecanism -> mechanism
    typo fixes: bandwith -> bandwidth
    fix a typo in the RTC_CLASS help text
    smb is no longer maintained

    Manually merged trivial conflict in arch/um/kernel/vmlinux.lds.S

    Linus Torvalds
     
  • Post and discussion:
    http://marc.theaimsgroup.com/?t=115074342800003&r=1&w=2

    Code in __shrink_node() duplicates code in cache_reap()

    Add a new function drain_freelist that removes slabs with objects that are
    already free and use that in various places.

    This eliminates the __node_shrink() function and provides the interrupt
    holdoff reduction from slab_free to code that used to call __node_shrink.

    [akpm@osdl.org: build fixes]
    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The remaining counters in page_state after the zoned VM counter patches
    have been applied are all just for show in /proc/vmstat. They have no
    essential function for the VM.

    We use a simple increment of per cpu variables. In order to avoid the most
    severe races we disable preempt. Preempt does not prevent the race between
    an increment and an interrupt handler incrementing the same statistics
    counter. However, that race is exceedingly rare, we may only loose one
    increment or so and there is no requirement (at least not in kernel) that
    the vm event counters have to be accurate.

    In the non preempt case this results in a simple increment for each
    counter. For many architectures this will be reduced by the compiler to a
    single instruction. This single instruction is atomic for i386 and x86_64.
    And therefore even the rare race condition in an interrupt is avoided for
    both architectures in most cases.

    The patchset also adds an off switch for embedded systems that allows a
    building of linux kernels without these counters.

    The implementation of these counters is through inline code that hopefully
    results in only a single instruction increment instruction being emitted
    (i386, x86_64) or in the increment being hidden though instruction
    concurrency (EPIC architectures such as ia64 can get that done).

    Benefits:
    - VM event counter operations usually reduce to a single inline instruction
    on i386 and x86_64.
    - No interrupt disable, only preempt disable for the preempt case.
    Preempt disable can also be avoided by moving the counter into a spinlock.
    - Handling is similar to zoned VM counters.
    - Simple and easily extendable.
    - Can be omitted to reduce memory use for embedded use.

    References:

    RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
    RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
    local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
    V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
    V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
    V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The numa statistics are really event counters. But they are per node and
    so we have had special treatment for these counters through additional
    fields on the pcp structure. We can now use the per zone nature of the
    zoned VM counters to realize these.

    This will shrink the size of the pcp structure on NUMA systems. We will
    have some room to add additional per zone counters that will all still fit
    in the same cacheline.

    Bits Prior pcp size Size after patch We can add
    ------------------------------------------------------------------
    64 128 bytes (16 words) 80 bytes (10 words) 48
    32 76 bytes (19 words) 56 bytes (14 words) 8 (64 byte cacheline)
    72 (128 byte)

    Remove the special statistics for numa and replace them with zoned vm
    counters. This has the side effect that global sums of these events now
    show up in /proc/vmstat.

    Also take the opportunity to move the zone_statistics() function from
    page_alloc.c into vmstat.c.

    Discussions:
    V2 http://marc.theaimsgroup.com/?t=115048227000002&r=1&w=2

    Signed-off-by: Christoph Lameter
    Acked-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter