21 Oct, 2005

1 commit


20 Oct, 2005

3 commits

  • This introduces a limit parameter to the core bootmem allocator; The new
    parameter indicates that physical memory allocated by the bootmem
    allocator should be within the requested limit.

    We also introduce alloc_bootmem_low_pages_limit, alloc_bootmem_node_limit,
    alloc_bootmem_low_pages_node_limit apis, but alloc_bootmem_low_pages_limit
    is the only api used for swiotlb.

    The existing alloc_bootmem_low_pages() api could instead have been
    changed and made to pass right limit to the core allocator. But that
    would make the patch more intrusive for 2.6.14, as other arches use
    alloc_bootmem_low_pages(). We may be done that post 2.6.14 as a
    cleanup.

    With this, swiotlb gets memory within 4G for both x86_64 and ia64
    arches.

    Signed-off-by: Yasunori Goto
    Cc: Ravikiran G Thirumalai
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • hugetlbfs allows truncation of its files (should it?), but hugetlb.c often
    forgets that: crashes and misaccounting ensue.

    copy_hugetlb_page_range better grab the src page_table_lock since we don't
    want to guess what happens if concurrently truncated. unmap_hugepage_range
    rss accounting must not assume the full range was mapped. follow_hugetlb_page
    must guard with page_table_lock and be prepared to exit early.

    Restyle copy_hugetlb_page_range with a for loop like the others there.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The hugetlb pages are currently pre-faulted. At the time of mmap of
    hugepages, we populate the new PTEs. It is possible that HW has already
    cached some of the unused PTEs internally. These stale entries never
    get a chance to be purged in existing control flow.

    This patch extends the check in page fault code for hugepages. Check if
    a faulted address falls with in size for the hugetlb file backing it.
    We return VM_FAULT_MINOR for these cases (assuming that the arch
    specific page-faulting code purges the stale entry for the archs that
    need it).

    Signed-off-by: Rohit Seth

    [ This is apparently arguably an ia64 port bug. But the code won't
    hurt, and for now it fixes a real problem on some ia64 machines ]

    Signed-off-by: Linus Torvalds

    Seth, Rohit
     

17 Oct, 2005

1 commit

  • As noticed by Nick Piggin, we need to make sure that we check the page
    count before we check for PageDirty, since the dirty check is only valid
    if the count implies that we're the only possible ones holding the page.

    We always did do this, but the code needs a read-memory-barrier to make
    sure that the orderign is also honored by the CPU.

    (The writer side is ordered due to the atomic decrement and test on the
    page count, see the discussion on linux-kernel)

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

12 Oct, 2005

2 commits

  • Refuse to install a page into a mapping if the mapping count is already
    ridiculously large.

    You probably cannot trigger this on 32-bit architectures, but on a
    64-bit setup we should protect against it.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Revert this recent correctness change: Douglas Crosher
    reported that it broke an existing application, and that madvise() works
    without error on anonymous mappings on Solaris.

    This means that madvise() will remain non-standards-compliant: we should
    return -EBADF for all requests against non-file-backed vma's, but Linux only
    does this for MADV_WILLNEED requests.

    Signed-off-by: Suzuki K P
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Suzuki
     

09 Oct, 2005

1 commit

  • - added typedef unsigned int __nocast gfp_t;

    - replaced __nocast uses for gfp flags with gfp_t - it gives exactly
    the same warnings as far as sparse is concerned, doesn't change
    generated code (from gcc point of view we replaced unsigned int with
    typedef) and documents what's going on far better.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

01 Oct, 2005

1 commit

  • As requested by Thomas Gleixner :

    "5d3d0f7704ed0bc7eaca0501eeae3e5da1ea6c87 breaks a couple of ARM
    boards, which depend on the historical bootmem allocation order.
    There is a cleaner solution around to remove the pgdat list
    completely, but this is a topic for post 2.6.14

    Andi signalled ACK already."

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 Sep, 2005

2 commits

  • In kmalloc_node we are checking if the allocation is for the same node when
    interrupts are "on". This may lead to an allocation on another node than
    intended.

    This patch just shifts the check for the current node in __cache_alloc_node
    when interrupts are disabled.

    Signed-off-by: Alok N Kataria
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alok N Kataria
     
  • Move the ZERO_PAGE remapping complexity to the move_pte macro in
    asm-generic, have it conditionally depend on
    __HAVE_ARCH_MULTIPLE_ZERO_PAGE, which gets defined for MIPS.

    For architectures without __HAVE_ARCH_MULTIPLE_ZERO_PAGE, move_pte becomes
    a noop.

    From: Hugh Dickins

    Fix nasty little bug we've missed in Nick's mremap move ZERO_PAGE patch.
    The "pte" at that point may be a swap entry or a pte_file entry: we must
    check pte_present before perhaps corrupting such an entry.

    Patch below against 2.6.14-rc2-mm1, but the same bug is in 2.6.14-rc2's
    mm/mremap.c, and more dangerous there since it's affecting all arches: I
    think the safest course is to send Nick's patch and Yoichi's build fix and
    this fix (build tested) on to Linus - so only MIPS can be affected.

    Signed-off-by: Nick Piggin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

24 Sep, 2005

1 commit

  • As davem points out, this wasn't such a great idea. There may be some code
    which does:

    size = 1024*1024;
    while (kmalloc(size, ...) == 0)
    size /= 2;

    which will now explode.

    Cc: "David S. Miller"
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

23 Sep, 2005

4 commits

  • Problem: In some circumstances, bd_claim() is returning the wrong error
    code.

    If we try to swapon an unused block device that isn't swap formatted, we
    get -EINVAL. But if that same block device is already mounted, we instead
    get -EBUSY, even though it still isn't a valid swap device.

    This issue came up on the busybox list trying to get the error message
    from "swapon -a" right. If a swap device is already enabled, we get -EBUSY,
    and we shouldn't report this as an error. But we can't distinguish the two
    -EBUSY conditions, which are very different errors.

    In the code, bd_claim() returns either 0 or -EBUSY, but in this case busy
    means "somebody other than sys_swapon has already claimed this", and
    _that_ means this block device can't be a valid swap device. So return
    -EINVAL there.

    Signed-off-by: Rob Landley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Landley
     
  • I had an issue on ia64 where I got a bug in kernel/workqueue because
    kzalloc returned a NULL pointer due to the task structure getting too big
    for the slab allocator. Usually these cases are caught by the kmalloc
    macro in include/linux/slab.h.

    Compilation will fail if a too big value is passed to kmalloc.

    However, kzalloc uses __kmalloc which has no check for that. This patch
    makes __kmalloc bug if a too large entity is requested.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The numa slab allocator may allocate pages from foreign nodes onto the
    lists for a particular node if a node runs out of memory. Inspecting the
    slab->nodeid field will not reflect that the page is now in use for the
    slabs of another node.

    This patch fixes that issue by adding a node field to free_block so that
    the caller can indicate which node currently uses a slab.

    Also removes the check for the current node from kmalloc_cache_node since
    the process may shift later to another node which may lead to an allocation
    on another node than intended.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • It is essential that index_of() be inlined. But alpha undoes the gcc
    inlining hackery and index_of() ends up out-of-line. So fiddle with things
    to make that function inline again.

    Cc: Richard Henderson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ivan Kokshaysky
     

22 Sep, 2005

2 commits


18 Sep, 2005

1 commit


15 Sep, 2005

2 commits

  • With the new changes that we made in the initialization of the slab
    allocator, we first setup the cache from which array caches are allocated,
    and then the cache, from which kmem_list3's are allocated.

    Now if the array cache comes from a cache in which objsize > 32, (in this
    instance size-64) then, first size-64 cache will be allocated and then the
    size-128 (if this is the cache from which kmem_list3's are going to be
    allocated).

    So with these new changes, we are not guaranteed that we will be
    initializing the malloc_sizes array in a serialized order. Thus there is
    a bug in __find_general_cachep, as we are checking whether the first
    cache_sizes ptr is NULL.

    This is replaced by checking whether the array-cache cache is initialized.
    Attached is a patch which does that. Boots fine on a x86-64, with
    DEBUG_SPIN, DEBUG_SLAB, and preempt.

    Attached is a patch which does that. Boots fine on a x86-64, with
    DEBUG_SPIN, DEBUG_SLAB, and preempt.Thanks & Regards, Alok

    Signed-off-by: Alok N Kataria
    Signed-off-by: Shobhit Dayal
    Cc: Manfred Spraul
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alok Kataria
     
  • Pavel Emelianov and Kirill Korotaev observe that fs and arch users of
    security_vm_enough_memory tend to forget to vm_unacct_memory when a
    failure occurs further down (typically in setup_arg_pages variants).

    These are all users of insert_vm_struct, and that reservation will only
    be unaccounted on exit if the vma is marked VM_ACCOUNT: which in some
    cases it is (hidden inside VM_STACK_FLAGS) and in some cases it isn't.

    So x86_64 32-bit and ppc64 vDSO ELFs have been leaking memory into
    Committed_AS each time they're run. But don't add VM_ACCOUNT to them,
    it's inappropriate to reserve against the very unlikely case that gdb
    be used to COW a vDSO page - we ought to do something about that in
    do_wp_page, but there are yet other inconsistencies to be resolved.

    The safe and economical way to fix this is to let insert_vm_struct do
    the security_vm_enough_memory check when it finds VM_ACCOUNT is set.

    And the MIPS irix_brk has been calling security_vm_enough_memory before
    calling do_brk which repeats it, doubly accounting and so also leaking.
    Remove that, and all the fs and arch calls to security_vm_enough_memory:
    give it a less misleading name later on.

    Signed-off-by: Hugh Dickins
    Signed-Off-By: Kirill Korotaev
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

13 Sep, 2005

4 commits


12 Sep, 2005

1 commit

  • Move call to get_mm_counter() in update_mem_hiwater() to be
    inside the check for tsk->mm being null. Otherwise you can be
    following a null pointer here. This patch submitted by
    Javier Herrero .

    Modify the end check for munmap regions to allow for the
    legacy behavior of 0 being valid. Pretty much all current
    uClinux system libc malloc's pass in 0 as the end point.
    A hard check will fail on these, so change the check so
    that if it is non-zero it must be valid otherwise it fails.
    A passed in value will always succeed (as it used too).

    Also export a few more mm system functions - to be consistent
    with the VM code exports.

    Signed-off-by: Greg Ungerer
    Signed-off-by: Linus Torvalds

    Greg Ungerer
     

11 Sep, 2005

5 commits


10 Sep, 2005

5 commits

  • Clean up timer initialization by introducing DEFINE_TIMER a'la
    DEFINE_SPINLOCK. Build and boot-tested on x86. A similar patch has been
    been in the -RT tree for some time.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • This patch clarifies NULL handling of kfree() and vfree(). I addition,
    wording of calling context restriction for vfree() and vunmap() are changed
    from "may not" to "must not."

    Signed-off-by: Pekka Enberg
    Acked-by: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • The NUMA API change that introduced kmalloc_node was accepted for
    2.6.12-rc3. Now it is possible to do slab allocations on a node to
    localize memory structures. This API was used by the pageset localization
    patch and the block layer localization patch now in mm. The existing
    kmalloc_node is slow since it simply searches through all pages of the slab
    to find a page that is on the node requested. The two patches do a one
    time allocation of slab structures at initialization and therefore the
    speed of kmalloc node does not matter.

    This patch allows kmalloc_node to be as fast as kmalloc by introducing node
    specific page lists for partial, free and full slabs. Slab allocation
    improves in a NUMA system so that we are seeing a performance gain in AIM7
    of about 5% with this patch alone.

    More NUMA localizations are possible if kmalloc_node operates in an fast
    way like kmalloc.

    Test run on a 32p systems with 32G Ram.

    w/o patch
    Tasks jobs/min jti jobs/min/task real cpu
    1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005
    100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005
    200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005
    300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005
    400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005
    500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005
    600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005
    700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005
    800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005
    900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005
    1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005

    w/patch
    Tasks jobs/min jti jobs/min/task real cpu
    1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005
    100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005
    200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005
    300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005
    400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005
    500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005
    600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005
    700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005
    800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005
    900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005
    1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005

    These are measurement taken directly after boot and show a greater
    improvement than 5%. However, the performance improvements become less
    over time if the AIM7 runs are repeated and settle down at around 5%.

    Links to earlier discussions:
    http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2
    http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2

    Changelog V4-V5:
    - alloc_arraycache and alloc_aliencache take node parameter instead of cpu
    - fix initialization so that nodes without cpus are properly handled.
    - simplify code in kmem_cache_init
    - patch against Andrews temp mm3 release
    - Add Shai to credits
    - fallback to __cache_alloc from __cache_alloc_node if the node's cache
    is not available yet.

    Changelog V3-V4:
    - Patch against 2.6.12-rc5-mm1
    - Cleanup patch integrated
    - More and better use of for_each_node and for_each_cpu
    - GCC 2.95 fix (do not use [] use [0])
    - Correct determination of INDEX_AC
    - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes.
    - Remove list3_data and list3_data_ptr macros for better readability

    Changelog V2-V3:
    - Made to patch against 2.6.12-rc4-mm1
    - Revised bootstrap mechanism so that larger size kmem_list3 structs can be
    supported. Do a generic solution so that the right slab can be found
    for the internal structs.
    - use for_each_online_node

    Changelog V1-V2:
    - Batching for freeing of wrong-node objects (alien caches)
    - Locking changes and NUMA #ifdefs as requested by Manfred

    Signed-off-by: Alok N Kataria
    Signed-off-by: Shobhit Dayal
    Signed-off-by: Shai Fultheim
    Signed-off-by: Christoph Lameter
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patch modifies tmpfs to call the inode_init_security LSM hook to set
    up the incore inode security state for new inodes before the inode becomes
    accessible via the dcache.

    As there is no underlying storage of security xattrs in this case, it is
    not necessary for the hook to return the (name, value, len) triple to the
    tmpfs code, so this patch also modifies the SELinux hook function to
    correctly handle the case where the (name, value, len) pointers are NULL.

    The hook call is needed in tmpfs in order to support proper security
    labeling of tmpfs inodes (e.g. for udev with tmpfs /dev in Fedora). With
    this change in place, we should then be able to remove the
    security_inode_post_create/mkdir/... hooks safely.

    Signed-off-by: Stephen Smalley
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Smalley
     
  • Update the file systems in fs/ implementing a delete_inode() callback to
    call truncate_inode_pages(). One implementation note: In developing this
    patch I put the calls to truncate_inode_pages() at the very top of those
    filesystems delete_inode() callbacks in order to retain the previous
    behavior. I'm guessing that some of those could probably be optimized.

    Signed-off-by: Mark Fasheh
    Acked-by: Christoph Hellwig
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Fasheh
     

09 Sep, 2005

1 commit

  • Run PCI driver initialization on local node

    Instead of adding messy kmalloc_node()s everywhere run the
    PCI driver probe on the node local to the device.

    This would not have helped for IDE, but should for
    other more clean drivers that do more initialization in probe().
    It won't help for drivers that do most of the work
    on first open (like many network drivers)

    Signed-off-by: Andi Kleen
    Signed-off-by: Greg Kroah-Hartman

    Andi Kleen
     

08 Sep, 2005

3 commits

  • This patch introduces a kzalloc wrapper and converts kernel/ to use it. It
    saves a little program text.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka J Enberg
     
  • Now the real motivation for this cpuset mem_exclusive patch series seems
    trivial.

    This patch keeps a task in or under one mem_exclusive cpuset from provoking an
    oom kill of a task under a non-overlapping mem_exclusive cpuset. Since only
    interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive
    containment, there is little to gain from oom killing a task under a
    non-overlapping mem_exclusive cpuset, as almost all kernel and user memory
    allocation must come from disjoint memory nodes.

    This patch enables configuring a system so that a runaway job under one
    mem_exclusive cpuset cannot cause the killing of a job in another such cpuset
    that might be using very high compute and memory resources for a prolonged
    time.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • This patch makes use of the previously underutilized cpuset flag
    'mem_exclusive' to provide what amounts to another layer of memory placement
    resolution. With this patch, there are now the following four layers of
    memory placement available:

    1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
    2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
    3) The current tasks cpuset (GFP_USER allocations constrained to here), and
    4) Specific node placement, using mbind and set_mempolicy.

    These nest - each layer is a subset (same or within) of the previous.

    Layer (2) above is new, with this patch. The call used to check whether a
    zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
    extended to take a gfp_mask argument, and its logic is extended, in the case
    that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
    hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
    placement is allowed. The definition of GFP_USER, which used to be identical
    to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
    cpuset_gfp_hardwall_flag patch.

    GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
    cpuset, so long as any node therein is not too tight on memory, but will
    escape to the larger layer, if need be.

    The intended use is to allow something like a batch manager to handle several
    jobs, each job in its own cpuset, but using common kernel memory for caches
    and such. Swapper and oom_kill activity is also constrained to Layer (2). A
    task in or below one mem_exclusive cpuset should not cause swapping on nodes
    in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
    task in another such cpuset. Heavy use of kernel memory for i/o caching and
    such by one job should not impact the memory available to jobs in other
    non-overlapping mem_exclusive cpusets.

    This patch enables providing hardwall, inescapable cpusets for memory
    allocations of each job, while sharing kernel memory allocations between
    several jobs, in an enclosing mem_exclusive cpuset.

    Like Dinakar's patch earlier to enable administering sched domains using the
    cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
    that had previously done nothing much useful other than restrict what cpuset
    configurations were allowed.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson