16 Jan, 2009

2 commits

  • Revert commit e97a630eb0f5b8b380fd67504de6cedebb489003 ("mm: vmalloc use
    mutex for purge")

    Bryan Donlan reports:

    : After testing 2.6.29-rc1 on xen-x86 with a btrfs root filesystem, I
    : got the OOPS quoted below and a hard freeze shortly after boot.
    : Boot messages and config are attached.
    :
    : ------------[ cut here ]------------
    : Kernel BUG at c05ef80d [verbose debug info unavailable]
    : invalid opcode: 0000 [#1] SMP
    : last sysfs file: /sys/block/xvdc/size
    : Modules linked in:
    :
    : Pid: 0, comm: swapper Not tainted (2.6.29-rc1 #6)
    : EIP: 0061:[] EFLAGS: 00010087 CPU: 2
    : EIP is at schedule+0x7cd/0x950
    : EAX: d5aeca80 EBX: 00000002 ECX: 00000000 EDX: d4cb9a40
    : ESI: c12f5600 EDI: d4cb9a40 EBP: d6033fa4 ESP: d6033ef4
    : DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069
    : Process swapper (pid: 0, ti=d6032000 task=d6020b70 task.ti=d6032000)
    : Stack:
    : 000d85bc 00000000 000186a0 00000000 0dd11410 c0105417 c12efe00 0dc367c3
    : 00000011 c0105d46 d5a5d310 deadbeef d4cb9a40 c07cc600 c05f1340 c12e0060
    : deadbeef d6020b70 d6020d08 00000002 c014377d 00000000 c12f5600 00002c22
    : Call Trace:
    : [] xen_force_evtchn_callback+0x17/0x30
    : [] check_events+0x8/0x12
    : [] _spin_unlock_irqrestore+0x20/0x40
    : [] hrtimer_start_range_ns+0x12d/0x2e0
    : [] tick_nohz_restart_sched_tick+0x146/0x160
    : [] cpu_idle+0xa5/0xc0

    and bisected it to this commit.

    Let's remove it now while we have a think about the problem.

    Reported-by: Bryan Donlan
    Tested-by: Christophe Saout
    Cc: Nick Piggin
    Cc: Ingo Molnar
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • On alpha, we have to map some stuff in the VMALLOC space very early in the
    boot process (to make SRM console callbacks work and so on, see
    arch/alpha/mm/init.c). For old VM allocator, we just manually placed a
    vm_struct onto the global vmlist and this worked for ages.

    Unfortunately, the new allocator isn't aware of this, so it constantly
    tries to allocate the VM space which is already in use, making vmalloc on
    alpha defunct.

    This patch forces KVA to import vmlist entries on init.

    [akpm@linux-foundation.org: remove unneeded check (per Johannes)]
    Signed-off-by: Ivan Kokshaysky
    Cc: Nick Piggin
    Cc: Johannes Weiner
    Cc: Richard Henderson
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ivan Kokshaysky
     

07 Jan, 2009

4 commits

  • Lazy unmapping in the vmalloc code has now opened the possibility for use
    after free bugs to go undetected. We can catch those by forcing an unmap
    and flush (which is going to be slow, but that's what happens).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • The vmalloc purge lock can be a mutex so we can sleep while a purge is
    going on (purge involves a global kernel TLB invalidate, so it can take
    quite a while).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • If we do that, output of files like /proc/vmallocinfo will show things
    like "vmalloc_32", "vmalloc_user", or whomever the caller was as the
    caller. This info is not as useful as the real caller of the allocation.

    So, proposal is to call __vmalloc_node node directly, with matching
    parameters to save the caller information

    Signed-off-by: Glauber Costa
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • If we can't service a vmalloc allocation, show size of the allocation that
    actually failed. Useful for debugging.

    Signed-off-by: Glauber Costa
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     

05 Jan, 2009

1 commit


11 Dec, 2008

1 commit

  • Miles Lane tailing /sys files hit a BUG which Pekka Enberg has tracked
    to my 966c8c12dc9e77f931e2281ba25d2f0244b06949 sprint_symbol(): use
    less stack exposing a bug in slub's list_locations() -
    kallsyms_lookup() writes a 0 to namebuf[KSYM_NAME_LEN-1], but that was
    beyond the end of page provided.

    The 100 slop which list_locations() allows at end of page looks roughly
    enough for all the other stuff it might print after the symbol before
    it checks again: break out KSYM_SYMBOL_LEN earlier than before.

    Latencytop and ftrace and are using KSYM_NAME_LEN buffers where they
    need KSYM_SYMBOL_LEN buffers, and vmallocinfo a 2*KSYM_NAME_LEN buffer
    where it wants a KSYM_SYMBOL_LEN buffer: fix those before anyone copies
    them.

    [akpm@linux-foundation.org: ftrace.h needs module.h]
    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc Miles Lane
    Acked-by: Pekka Enberg
    Acked-by: Steven Rostedt
    Acked-by: Frederic Weisbecker
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

02 Dec, 2008

1 commit

  • Jim Radford has reported that the vmap subsystem rewrite was sometimes
    causing his VIVT ARM system to behave strangely (seemed like going into
    infinite loops trying to fault in pages to userspace).

    We determined that the problem was most likely due to a cache aliasing
    issue. flush_cache_vunmap was only being called at the moment the page
    tables were to be taken down, however with lazy unmapping, this can happen
    after the page has subsequently been freed and allocated for something
    else. The dangling alias may still have dirty data attached to it.

    The fix for this problem is to do the cache flushing when the caller has
    called vunmap -- it would be a bug for them to write anything else to the
    mapping at that point.

    That appeared to solve Jim's problems.

    Reported-by: Jim Radford
    Signed-off-by: Nick Piggin
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

20 Nov, 2008

3 commits

  • Current vmalloc restart search for a free area in case we can't find one.
    The reason is there are areas which are lazily freed, and could be
    possibly freed now. However, current implementation start searching the
    tree from the last failing address, which is pretty much by definition at
    the end of address space. So, we fail.

    The proposal of this patch is to restart the search from the beginning of
    the requested vstart address. This fixes the regression in running KVM
    virtual machines for me, described in http://lkml.org/lkml/2008/10/28/349,
    caused by commit db64fe02258f1507e13fe5212a989922323685ce.

    Signed-off-by: Glauber Costa
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • An initial vmalloc failure should start off a synchronous flush of lazy
    areas, in case someone is in progress flushing them already, which could
    cause us to return an allocation failure even if there is plenty of KVA
    free.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Fix off by one bug in the KVA allocator that can leave gaps in the address
    space.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

07 Nov, 2008

2 commits

  • Xen can end up calling vm_unmap_aliases() before vmalloc_init() has
    been called. In this case its safe to make it a simple no-op.

    Signed-off-by: Jeremy Fitzhardinge
    Cc: Linux Memory Management List
    Cc: Nick Piggin
    Signed-off-by: Ingo Molnar

    Jeremy Fitzhardinge
     
  • As of 73bdf0a60e607f4b8ecc5aec597105976565a84f, the kernel needs
    to know where modules are located in the virtual address space.
    On ARM, we located this region between MODULE_START and MODULE_END.
    Unfortunately, everyone else calls it MODULES_VADDR and MODULES_END.
    Update ARM to use the same naming, so is_vmalloc_or_module_addr()
    can work properly. Also update the comment on mm/vmalloc.c to
    reflect that ARM also places modules in a separate region from the
    vmalloc space.

    Signed-off-by: Russell King

    Russell King
     

31 Oct, 2008

1 commit

  • Delete excess kernel-doc notation in mm/ subdirectory.
    Actually this is a kernel-doc notation fix.

    Warning(/var/linsrc/linux-2.6.27-git10//mm/vmalloc.c:902): Excess function parameter or struct member 'returns' description in 'vm_map_ram'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

23 Oct, 2008

1 commit


21 Oct, 2008

2 commits


20 Oct, 2008

1 commit

  • Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
    provide a fast, scalable percpu frontend for small vmaps (requires a
    slightly different API, though).

    The biggest problem with vmap is actually vunmap. Presently this requires
    a global kernel TLB flush, which on most architectures is a broadcast IPI
    to all CPUs to flush the cache. This is all done under a global lock. As
    the number of CPUs increases, so will the number of vunmaps a scaled
    workload will want to perform, and so will the cost of a global TLB flush.
    This gives terrible quadratic scalability characteristics.

    Another problem is that the entire vmap subsystem works under a single
    lock. It is a rwlock, but it is actually taken for write in all the fast
    paths, and the read locking would likely never be run concurrently anyway,
    so it's just pointless.

    This is a rewrite of vmap subsystem to solve those problems. The existing
    vmalloc API is implemented on top of the rewritten subsystem.

    The TLB flushing problem is solved by using lazy TLB unmapping. vmap
    addresses do not have to be flushed immediately when they are vunmapped,
    because the kernel will not reuse them again (would be a use-after-free)
    until they are reallocated. So the addresses aren't allocated again until
    a subsequent TLB flush. A single TLB flush then can flush multiple
    vunmaps from each CPU.

    XEN and PAT and such do not like deferred TLB flushing because they can't
    always handle multiple aliasing virtual addresses to a physical address.
    They now call vm_unmap_aliases() in order to flush any deferred mappings.
    That call is very expensive (well, actually not a lot more expensive than
    a single vunmap under the old scheme), however it should be OK if not
    called too often.

    The virtual memory extent information is stored in an rbtree rather than a
    linked list to improve the algorithmic scalability.

    There is a per-CPU allocator for small vmaps, which amortizes or avoids
    global locking.

    To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
    must be used in place of vmap and vunmap. Vmalloc does not use these
    interfaces at the moment, so it will not be quite so scalable (although it
    will use lazy TLB flushing).

    As a quick test of performance, I ran a test that loops in the kernel,
    linearly mapping then touching then unmapping 4 pages. Different numbers
    of tests were run in parallel on an 4 core, 2 socket opteron. Results are
    in nanoseconds per map+touch+unmap.

    threads vanilla vmap rewrite
    1 14700 2900
    2 33600 3000
    4 49500 2800
    8 70631 2900

    So with a 8 cores, the rewritten version is already 25x faster.

    In a slightly more realistic test (although with an older and less
    scalable version of the patch), I ripped the not-very-good vunmap batching
    code out of XFS, and implemented the large buffer mapping with vm_map_ram
    and vm_unmap_ram... along with a couple of other tricks, I was able to
    speed up a large directory workload by 20x on a 64 CPU system. I believe
    vmap/vunmap is actually sped up a lot more than 20x on such a system, but
    I'm running into other locks now. vmap is pretty well blown off the
    profiles.

    Before:
    1352059 total 0.1401
    798784 _write_lock 8320.6667
    Cc: Jeremy Fitzhardinge
    Cc: Krzysztof Helt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

16 Oct, 2008

1 commit

  • Impact: crash on module insertion with CONFIG_DEBUG_VIRTUAL

    We would incorrectly BUG due to:

    VIRTUAL_BUG_ON(!is_vmalloc_addr(vmalloc_addr) &&
    !is_module_address(addr));

    ... because, at least on x86-64, is_module_address() doesn't do what
    it should. This patch introduces is_vmalloc_or_module_addr(), which
    is what we really want anyway, and uses it instead.

    Signed-off-by: H. Peter Anvin

    Linus Torvalds
     

12 Oct, 2008

1 commit


27 Jul, 2008

1 commit

  • Use WARN() instead of a printk+WARN_ON() pair; this way the message becomes
    part of the warning section for better reporting/collection.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     

25 Jul, 2008

1 commit

  • Christoph recently added /proc/vmallocinfo file to get information about
    vmalloc allocations.

    This patch adds NUMA specific information, giving number of pages
    allocated on each memory node.

    This should help to check that vmalloc() is able to respect NUMA policies.

    Example of output on a four nodes machine (one cpu per node)

    1) network hash tables are evenly spreaded on four nodes (OK) (Same
    point for inodes and dentries hash tables)

    2) iptables tables (x_tables) are correctly allocated on each cpu node
    (OK).

    3) sys_swapon() allocates its memory from one node only.

    4) each loaded module is using memory on one node.

    Sysadmins could tune their setup to change points 3) and 4) if necessary.

    grep "pages=" /proc/vmallocinfo
    0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204/0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128
    0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204/0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64
    0xffffc2000031a000-0xffffc2000031d000 12288 alloc_large_system_hash+0x204/0x2c0 pages=2 vmalloc N1=1 N2=1
    0xffffc2000031f000-0xffffc2000032b000 49152 cramfs_uncompress_init+0x2e/0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3
    0xffffc2000033e000-0xffffc20000341000 12288 sys_swapon+0x640/0xac0 pages=2 vmalloc N0=2
    0xffffc20000341000-0xffffc20000344000 12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N0=2
    0xffffc20000344000-0xffffc20000347000 12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N1=2
    0xffffc20000347000-0xffffc2000034a000 12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N2=2
    0xffffc2000034a000-0xffffc2000034d000 12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N3=2
    0xffffc20004381000-0xffffc20004402000 528384 alloc_large_system_hash+0x204/0x2c0 pages=128 vmalloc N0=32 N1=32 N2=32 N3=32
    0xffffc20004402000-0xffffc20004803000 4198400 alloc_large_system_hash+0x204/0x2c0 pages=1024 vmalloc vpages N0=256 N1=256 N2=256 N3=256
    0xffffc20004803000-0xffffc20004904000 1052672 alloc_large_system_hash+0x204/0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64
    0xffffc20004904000-0xffffc20004bec000 3047424 sys_swapon+0x640/0xac0 pages=743 vmalloc vpages N0=743
    0xffffffffa0000000-0xffffffffa000f000 61440 sys_init_module+0xc27/0x1d00 pages=14 vmalloc N1=14
    0xffffffffa000f000-0xffffffffa0014000 20480 sys_init_module+0xc27/0x1d00 pages=4 vmalloc N0=4
    0xffffffffa0014000-0xffffffffa0017000 12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N0=2
    0xffffffffa0017000-0xffffffffa0022000 45056 sys_init_module+0xc27/0x1d00 pages=10 vmalloc N1=10
    0xffffffffa0022000-0xffffffffa0028000 24576 sys_init_module+0xc27/0x1d00 pages=5 vmalloc N3=5
    0xffffffffa0028000-0xffffffffa0050000 163840 sys_init_module+0xc27/0x1d00 pages=39 vmalloc N1=39
    0xffffffffa0050000-0xffffffffa0052000 8192 sys_init_module+0xc27/0x1d00 pages=1 vmalloc N1=1
    0xffffffffa0052000-0xffffffffa0056000 16384 sys_init_module+0xc27/0x1d00 pages=3 vmalloc N1=3
    0xffffffffa0056000-0xffffffffa0081000 176128 sys_init_module+0xc27/0x1d00 pages=42 vmalloc N3=42
    0xffffffffa0081000-0xffffffffa00ae000 184320 sys_init_module+0xc27/0x1d00 pages=44 vmalloc N3=44
    0xffffffffa00ae000-0xffffffffa00b1000 12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2
    0xffffffffa00b1000-0xffffffffa00b9000 32768 sys_init_module+0xc27/0x1d00 pages=7 vmalloc N0=7
    0xffffffffa00b9000-0xffffffffa00c4000 45056 sys_init_module+0xc27/0x1d00 pages=10 vmalloc N3=10
    0xffffffffa00c6000-0xffffffffa00e0000 106496 sys_init_module+0xc27/0x1d00 pages=25 vmalloc N2=25
    0xffffffffa00e0000-0xffffffffa00f1000 69632 sys_init_module+0xc27/0x1d00 pages=16 vmalloc N2=16
    0xffffffffa00f1000-0xffffffffa00f4000 12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2
    0xffffffffa00f4000-0xffffffffa00f7000 12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2

    [akpm@linux-foundation.org: fix comment]
    Signed-off-by: Eric Dumazet
    Cc: Christoph Lameter
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

19 Jun, 2008

2 commits

  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Add some (configurable) expensive sanity checking to catch wrong address
    translations on x86.

    - create linux/mmdebug.h file to be able include this file in
    asm headers to not get unsolvable loops in header files
    - __phys_addr on x86_32 became a function in ioremap.c since
    PAGE_OFFSET, is_vmalloc_addr and VMALLOC_* non-constasts are undefined
    if declared in page_32.h
    - add __phys_addr_const for initializing doublefault_tss.__cr3

    Tested on 386, 386pae, x86_64 and x86_64 numa=fake=2.

    Contains Andi's enable numa virtual address debug patch.

    Signed-off-by: Jiri Slaby
    Cc: Andi Kleen
    Signed-off-by: Ingo Molnar

    Jiri Slaby
     

01 May, 2008

1 commit


30 Apr, 2008

1 commit

  • We can see an ever repeating problem pattern with objects of any kind in the
    kernel:

    1) freeing of active objects
    2) reinitialization of active objects

    Both problems can be hard to debug because the crash happens at a point where
    we have no chance to decode the root cause anymore. One problem spot are
    kernel timers, where the detection of the problem often happens in interrupt
    context and usually causes the machine to panic.

    While working on a timer related bug report I had to hack specialized code
    into the timer subsystem to get a reasonable hint for the root cause. This
    debug hack was fine for temporary use, but far from a mergeable solution due
    to the intrusiveness into the timer code.

    The code further lacked the ability to detect and report the root cause
    instantly and keep the system operational.

    Keeping the system operational is important to get hold of the debug
    information without special debugging aids like serial consoles and special
    knowledge of the bug reporter.

    The problems described above are not restricted to timers, but timers tend to
    expose it usually in a full system crash. Other objects are less explosive,
    but the symptoms caused by such mistakes can be even harder to debug.

    Instead of creating specialized debugging code for the timer subsystem a
    generic infrastructure is created which allows developers to verify their code
    and provides an easy to enable debug facility for users in case of trouble.

    The debugobjects core code keeps track of operations on static and dynamic
    objects by inserting them into a hashed list and sanity checking them on
    object operations and provides additional checks whenever kernel memory is
    freed.

    The tracked object operations are:
    - initializing an object
    - adding an object to a subsystem list
    - deleting an object from a subsystem list

    Each operation is sanity checked before the operation is executed and the
    subsystem specific code can provide a fixup function which allows to prevent
    the damage of the operation. When the sanity check triggers a warning message
    and a stack trace is printed.

    The list of operations can be extended if the need arises. For now it's
    limited to the requirements of the first user (timers).

    The core code enqueues the objects into hash buckets. The hash index is
    generated from the address of the object to simplify the lookup for the check
    on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a
    global lock.

    The debug code can be compiled in without being active. The runtime overhead
    is minimal and could be optimized by asm alternatives. A kernel command line
    option enables the debugging code.

    Thanks to Ingo Molnar for review, suggestions and cleanup patches.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar
    Cc: Greg KH
    Cc: Randy Dunlap
    Cc: Kay Sievers
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

28 Apr, 2008

2 commits

  • Add caller information so that /proc/vmallocinfo shows where the allocation
    request for a slice of vmalloc memory originated.

    Results in output like this:

    0xffffc20000000000-0xffffc20000801000 8392704 alloc_large_system_hash+0x127/0x246 pages=2048 vmalloc vpages
    0xffffc20000801000-0xffffc20000806000 20480 alloc_large_system_hash+0x127/0x246 pages=4 vmalloc
    0xffffc20000806000-0xffffc20000c07000 4198400 alloc_large_system_hash+0x127/0x246 pages=1024 vmalloc vpages
    0xffffc20000c07000-0xffffc20000c0a000 12288 alloc_large_system_hash+0x127/0x246 pages=2 vmalloc
    0xffffc20000c0a000-0xffffc20000c0c000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
    0xffffc20000c0c000-0xffffc20000c0f000 12288 acpi_os_map_memory+0x13/0x1c phys=cff64000 ioremap
    0xffffc20000c10000-0xffffc20000c15000 20480 acpi_os_map_memory+0x13/0x1c phys=cff65000 ioremap
    0xffffc20000c16000-0xffffc20000c18000 8192 acpi_os_map_memory+0x13/0x1c phys=cff69000 ioremap
    0xffffc20000c18000-0xffffc20000c1a000 8192 acpi_os_map_memory+0x13/0x1c phys=fed1f000 ioremap
    0xffffc20000c1a000-0xffffc20000c1c000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
    0xffffc20000c1c000-0xffffc20000c1e000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
    0xffffc20000c1e000-0xffffc20000c20000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
    0xffffc20000c20000-0xffffc20000c22000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
    0xffffc20000c22000-0xffffc20000c24000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
    0xffffc20000c24000-0xffffc20000c26000 8192 acpi_os_map_memory+0x13/0x1c phys=e0081000 ioremap
    0xffffc20000c26000-0xffffc20000c28000 8192 acpi_os_map_memory+0x13/0x1c phys=e0080000 ioremap
    0xffffc20000c28000-0xffffc20000c2d000 20480 alloc_large_system_hash+0x127/0x246 pages=4 vmalloc
    0xffffc20000c2d000-0xffffc20000c31000 16384 tcp_init+0xd5/0x31c pages=3 vmalloc
    0xffffc20000c31000-0xffffc20000c34000 12288 alloc_large_system_hash+0x127/0x246 pages=2 vmalloc
    0xffffc20000c34000-0xffffc20000c36000 8192 init_vdso_vars+0xde/0x1f1
    0xffffc20000c36000-0xffffc20000c38000 8192 pci_iomap+0x8a/0xb4 phys=d8e00000 ioremap
    0xffffc20000c38000-0xffffc20000c3a000 8192 usb_hcd_pci_probe+0x139/0x295 [usbcore] phys=d8e00000 ioremap
    0xffffc20000c3a000-0xffffc20000c3e000 16384 sys_swapon+0x509/0xa15 pages=3 vmalloc
    0xffffc20000c40000-0xffffc20000c61000 135168 e1000_probe+0x1c4/0xa32 phys=d8a20000 ioremap
    0xffffc20000c61000-0xffffc20000c6a000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
    0xffffc20000c6a000-0xffffc20000c73000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
    0xffffc20000c73000-0xffffc20000c7c000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
    0xffffc20000c7c000-0xffffc20000c7f000 12288 e1000e_setup_tx_resources+0x29/0xbe pages=2 vmalloc
    0xffffc20000c80000-0xffffc20001481000 8392704 pci_mmcfg_arch_init+0x90/0x118 phys=e0000000 ioremap
    0xffffc20001481000-0xffffc20001682000 2101248 alloc_large_system_hash+0x127/0x246 pages=512 vmalloc
    0xffffc20001682000-0xffffc20001e83000 8392704 alloc_large_system_hash+0x127/0x246 pages=2048 vmalloc vpages
    0xffffc20001e83000-0xffffc20002204000 3674112 alloc_large_system_hash+0x127/0x246 pages=896 vmalloc vpages
    0xffffc20002204000-0xffffc2000220d000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
    0xffffc2000220d000-0xffffc20002216000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
    0xffffc20002216000-0xffffc2000221f000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
    0xffffc2000221f000-0xffffc20002228000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
    0xffffc20002228000-0xffffc20002231000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
    0xffffc20002231000-0xffffc20002234000 12288 e1000e_setup_rx_resources+0x35/0x122 pages=2 vmalloc
    0xffffc20002240000-0xffffc20002261000 135168 e1000_probe+0x1c4/0xa32 phys=d8a60000 ioremap
    0xffffc20002261000-0xffffc2000270c000 4894720 sys_swapon+0x509/0xa15 pages=1194 vmalloc vpages
    0xffffffffa0000000-0xffffffffa0022000 139264 module_alloc+0x4f/0x55 pages=33 vmalloc
    0xffffffffa0022000-0xffffffffa0029000 28672 module_alloc+0x4f/0x55 pages=6 vmalloc
    0xffffffffa002b000-0xffffffffa0034000 36864 module_alloc+0x4f/0x55 pages=8 vmalloc
    0xffffffffa0034000-0xffffffffa003d000 36864 module_alloc+0x4f/0x55 pages=8 vmalloc
    0xffffffffa003d000-0xffffffffa0049000 49152 module_alloc+0x4f/0x55 pages=11 vmalloc
    0xffffffffa0049000-0xffffffffa0050000 28672 module_alloc+0x4f/0x55 pages=6 vmalloc

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Implement a new proc file that allows the display of the currently allocated
    vmalloc memory.

    It allows to see the users of vmalloc. That is important if vmalloc space is
    scarce (i386 for example).

    And it's going to be important for the compound page fallback to vmalloc.
    Many of the current users can be switched to use compound pages with fallback.
    This means that the number of users of vmalloc is reduced and page tables no
    longer necessary to access the memory. /proc/vmallocinfo allows to review how
    that reduction occurs.

    If memory becomes fragmented and larger order allocations are no longer
    possible then /proc/vmallocinfo allows to see which compound page allocations
    fell back to virtual compound pages. That is important for new users of
    virtual compound pages. Such as order 1 stack allocation etc that may
    fallback to virtual compound pages in the future.

    /proc/vmallocinfo permissions are made readable-only-by-root to avoid possible
    information leakage.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: CONFIG_MMU=n build fix]
    Signed-off-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Cc: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

20 Mar, 2008

1 commit

  • Fix various kernel-doc notation in mm/:

    filemap.c: add function short description; convert 2 to kernel-doc
    fremap.c: change parameter 'prot' to @prot
    pagewalk.c: change "-" in function parameters to ":"
    slab.c: fix short description of kmem_ptr_validate()
    swap.c: fix description & parameters of put_pages_list()
    swap_state.c: fix function parameters
    vmalloc.c: change "@returns" to "Returns:" since that is not a parameter

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

09 Feb, 2008

1 commit

  • Background: I've implemented 1K/2K page tables for s390. These sub-page
    page tables are required to properly support the s390 virtualization
    instruction with KVM. The SIE instruction requires that the page tables
    have 256 page table entries (pte) followed by 256 page status table entries
    (pgste). The pgstes are only required if the process is using the SIE
    instruction. The pgstes are updated by the hardware and by the hypervisor
    for a number of reasons, one of them is dirty and reference bit tracking.
    To avoid wasting memory the standard pte table allocation should return
    1K/2K (31/64 bit) and 2K/4K if the process is using SIE.

    Problem: Page size on s390 is 4K, page table size is 1K or 2K. That means
    the s390 version for pte_alloc_one cannot return a pointer to a struct
    page. Trouble is that with the CONFIG_HIGHPTE feature on x86 pte_alloc_one
    cannot return a pointer to a pte either, since that would require more than
    32 bit for the return value of pte_alloc_one (and the pte * would not be
    accessible since its not kmapped).

    Solution: The only solution I found to this dilemma is a new typedef: a
    pgtable_t. For s390 pgtable_t will be a (pte *) - to be introduced with a
    later patch. For everybody else it will be a (struct page *). The
    additional problem with the initialization of the ptl lock and the
    NR_PAGETABLE accounting is solved with a constructor pgtable_page_ctor and
    a destructor pgtable_page_dtor. The page table allocation and free
    functions need to call these two whenever a page table page is allocated or
    freed. pmd_populate will get a pgtable_t instead of a struct page pointer.
    To get the pgtable_t back from a pmd entry that has been installed with
    pmd_populate a new function pmd_pgtable is added. It replaces the pmd_page
    call in free_pte_range and apply_to_pte_range.

    Signed-off-by: Martin Schwidefsky
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Schwidefsky
     

06 Feb, 2008

5 commits

  • When running with a 16M IOREMAP_MAX_ORDER (on armv7) we found that the
    vmlist search routine in __get_vm_area_node can mistakenly allow a driver
    to ioremap a range larger than vmalloc space.

    If at the time of the ioremap all existing vmlist areas sit below the
    determined alignment then the search routine continues past all entries and
    exits the for loop - straight into the found: label - without ever testing
    for integer wrapping or that the requested size fits.

    We were seeing a driver successfully ioremap 128M of flash even though
    there was only 120M of vmalloc space. From that point the system was left
    with the remainder of the first 16M of space to vmalloc/ioremap within.

    Signed-off-by: Robert Bragg
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert Bragg
     
  • __vmalloc_area_node() can become static.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • The page array is repeatedly indexed both in vunmap and vmalloc_area_node().
    Add a temporary variable to make it easier to read (and easier to patch
    later).

    Signed-off-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Make vmalloc functions work the same way as kfree() and friends that
    take a const void * argument.

    [akpm@linux-foundation.org: fix consts, coding-style]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We already have page table manipulation for vmalloc in vmalloc.c. Move the
    vmalloc_to_page() function there as well.

    Move the definitions for vmalloc related functions in mm.h to a newly created
    section. A better place would be vmalloc.h but mm.h is basic and may depend
    on these functions. An alternative would be to include vmalloc.h in mm.h
    (like done for vmstat.h).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

20 Oct, 2007

1 commit


17 Oct, 2007

1 commit

  • The function of GFP_LEVEL_MASK seems to be unclear. In order to clear up
    the mystery we get rid of it and replace GFP_LEVEL_MASK with 3 sets of GFP
    flags:

    GFP_RECLAIM_MASK Flags used to control page allocator reclaim behavior.

    GFP_CONSTRAINT_MASK Flags used to limit where allocations can occur.

    GFP_SLAB_BUG_MASK Flags that the slab allocator BUG()s on.

    These replace the uses of GFP_LEVEL mask in the slab allocators and in
    vmalloc.c.

    The use of the flags not included in these sets may occur as a result of a
    slab allocation standing in for a page allocation when constructing scatter
    gather lists. Extraneous flags are cleared and not passed through to the
    page allocator. __GFP_MOVABLE/RECLAIMABLE, __GFP_COLD and __GFP_COMP will
    now be ignored if passed to a slab allocator.

    Change the allocation of allocator meta data in SLAB and vmalloc to not
    pass through flags listed in GFP_CONSTRAINT_MASK. SLAB already removes the
    __GFP_THISNODE flag for such allocations. Generalize that to also cover
    vmalloc. The use of GFP_CONSTRAINT_MASK also includes __GFP_HARDWALL.

    The impact of allocator metadata placement on access latency to the
    cachelines of the object itself is minimal since metadata is only
    referenced on alloc and free. The attempt is still made to place the meta
    data optimally but we consistently allow fallback both in SLAB and vmalloc
    (SLUB does not need to allocate metadata like that).

    Allocator metadata may serve multiple in kernel users and thus should not
    be subject to the limitations arising from a single allocation context.

    [akpm@linux-foundation.org: fix fallback_alloc()]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

20 Jul, 2007

2 commits

  • lguest does some fairly lowlevel things to support a host, which
    normal modules don't need:

    math_state_restore:
    When the guest triggers a Device Not Available fault, we need
    to be able to restore the FPU

    __put_task_struct:
    We need to hold a reference to another task for inter-guest
    I/O, and put_task_struct() is an inline function which calls
    __put_task_struct.

    access_process_vm:
    We need to access another task for inter-guest I/O.

    map_vm_area & __get_vm_area:
    We need to map the switcher shim (ie. monitor) at 0xFFC01000.

    Signed-off-by: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rusty Russell
     
  • I've noticed lots of failures of vmalloc_32 on machines where it
    shouldn't have failed unless it was doing an atomic operation.

    Looking closely, I noticed that:

    #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
    #define GFP_VMALLOC32 GFP_DMA32
    #elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
    #define GFP_VMALLOC32 GFP_DMA
    #else
    #define GFP_VMALLOC32 GFP_KERNEL
    #endif

    Which seems to be incorrect, it should always -or- in the DMA flags
    on top of GFP_KERNEL, thus this patch.

    This fixes frequent errors launchin X with the nouveau DRM for example.

    Signed-off-by: Benjamin Herrenschmidt
    Cc: Andi Kleen
    Cc: Dave Airlie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt