17 Jul, 2007

1 commit

  • Make zonelist creation policy selectable from sysctl/boot option v6.

    This patch makes NUMA's zonelist (of pgdat) order selectable.
    Available order are Default(automatic)/ Node-based / Zone-based.

    [Default Order]
    The kernel selects Node-based or Zone-based order automatically.

    [Node-based Order]
    This policy treats the locality of memory as the most important parameter.
    Zonelist order is created by each zone's locality. This means lower zones
    (ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion.
    IOW. ZONE_DMA will be in the middle of zonelist.
    current 2.6.21 kernel uses this.

    Pros.
    * A user can expect local memory as much as possible.
    Cons.
    * lower zone will be exhansted before higher zone. This may cause OOM_KILL.

    Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL
    because of ZONE_DMA exhaution and you need the best locality.

    (example)
    assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

    *node(0)'s memory allocation order:

    node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL.

    *node(1)'s memory allocation order:

    node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

    [Zone-based order]
    This policy treats the zone type as the most important parameter.
    Zonelist order is created by zone-type order. This means lower zone
    never be used bofere higher zone exhaustion.
    IOW. ZONE_DMA will be always at the tail of zonelist.

    Pros.
    * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted.
    Cons.
    * memory locality may not be best.

    (example)
    assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

    *node(0)'s memory allocation order:

    node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA.

    *node(1)'s memory allocation order:

    node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

    bootoption "numa_zonelist_order=" and proc/sysctl is supporetd.

    command:
    %echo N > /proc/sys/vm/numa_zonelist_order

    Will rebuild zonelist in Node-based order.

    command:
    %echo Z > /proc/sys/vm/numa_zonelist_order

    Will rebuild zonelist in Zone-based order.

    Thanks to Lee Schermerhorn, he gives me much help and codes.

    [Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order]
    [akpm@linux-foundation.org: build fix]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Cc: "jesse.barnes@intel.com"
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

12 Jul, 2007

1 commit

  • Add a new security check on mmap operations to see if the user is attempting
    to mmap to low area of the address space. The amount of space protected is
    indicated by the new proc tunable /proc/sys/vm/mmap_min_addr and defaults to
    0, preserving existing behavior.

    This patch uses a new SELinux security class "memprotect." Policy already
    contains a number of allow rules like a_t self:process * (unconfined_t being
    one of them) which mean that putting this check in the process class (its
    best current fit) would make it useless as all user processes, which we also
    want to protect against, would be allowed. By taking the memprotect name of
    the new class it will also make it possible for us to move some of the other
    memory protect permissions out of 'process' and into the new class next time
    we bump the policy version number (which I also think is a good future idea)

    Acked-by: Stephen Smalley
    Acked-by: Chris Wright
    Signed-off-by: Eric Paris
    Signed-off-by: James Morris

    Eric Paris
     

10 Jul, 2007

3 commits

  • This patch removes xip_file_sendfile, the sendfile implementation for
    xip without replacement. Those customers that use xip on s390 are not
    using sendfile() as far as we know, and so far s390 is the only platform
    this could potentially be used on so far.
    Having sendfile is not a popular feature for execute in place file
    systems, however we have a working implementation of splice_read() based
    on fs/splice.c if anyone asks for it.
    At this point in time, it does not seem preferable to merge
    splice_read() for xip because it causes extra maintenence effort due to
    code duplication and it requires struct page behind the xip memory
    segment. We'd like to get rid of that in favor of supporting flash based
    embedded platforms (Monta Vista work) soon.

    Signed-off-by: Carsten Otte
    Signed-off-by: Jens Axboe

    Carsten Otte
     
  • Remove shmem_file_sendfile and resurrect shmem_readpage, as used by tmpfs
    to support loop and sendfile in 2.4 and 2.5. Now tmpfs can support splice,
    loop and sendfile in the simplest way, using generic_file_splice_read and
    generic_file_splice_write (with the aid of shmem_prepare_write).

    We could make some efficiency tweaks later, if there's a real need;
    but this is stable and works well as is.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Jens Axboe

    Hugh Dickins
     
  • It's no longer used.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

09 Jul, 2007

1 commit

  • Fix a post-2.6.21 regression.

    read_cache_page_async() has two invocations of mark_page_accessed() which will
    launch pages right onto the active list.

    Remove the first one, keeping the latter one. This avoids marking unwanted
    pages active (in the retry loop).

    Signed-off-by: Peter Zijlstra
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

07 Jul, 2007

2 commits

  • kmem_cache_open is static. EXPORT_SYMBOL was leftover from some earlier
    time period where kmem_cache_open was usable outside of slub.

    (Fixes powerpc build error)

    Signed-off-by: Chrsitoph Lameter
    Cc: Johannes Berg
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Line up the vmstat_text with zone_stat_item

    enum zone_stat_item {
    /* First 128 byte cacheline (assuming 64 bit words) */
    NR_FREE_PAGES,
    NR_INACTIVE,
    NR_ACTIVE,

    We current have nr_active and nr_inactive reversed.

    [ "OK with patch, though using initializers canbe handy to prevent such
    things in future:

    static const char * const vmstat_text[] = {
    [NR_FREE_PAGES] = "nr_free_pages",
    ..."
    - Alexey ]

    Signed-off-by: Peter Zijlstra
    Acked-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

06 Jul, 2007

1 commit

  • Commit b46b8f19c9cd435ecac4d9d12b39d78c137ecd66 fixed a couple of bugs
    by switching the redzone to 64 bits. Unfortunately, it neglected to
    ensure that the _second_ redzone, after the slab object, is aligned
    correctly. This caused illegal instruction faults on sparc32, which for
    some reason not entirely clear to me are not trapped and fixed up.

    Two things need to be done to fix this:
    - increase the object size, rounding up to alignof(long long) so
    that the second redzone can be aligned correctly.
    - If SLAB_STORE_USER is set but alignof(long long)==8, allow a
    full 64 bits of space for the user word at the end of the buffer,
    even though we may not _use_ the whole 64 bits.

    This patch should be a no-op on any 64-bit architecture or any 32-bit
    architecture where alignof(long long) == 4. Of the others, it's tested
    on ppc32 by myself and a very similar patch was tested on sparc32 by
    Mark Fortescue, who reported the new problem.

    Also, fix the conditions for FORCED_DEBUG, which hadn't been adjusted to
    the new sizes. Again noticed by Mark.

    Signed-off-by: David Woodhouse
    Signed-off-by: Linus Torvalds

    David Woodhouse
     

04 Jul, 2007

1 commit


02 Jul, 2007

1 commit


29 Jun, 2007

1 commit

  • validate_anon_vma gave a useful check on the integrity of the anon_vma list
    when Andrea was developing obj rmap; but it was not enabled in SLES9
    itself, nor in mainline, until Nick changed commented-out RMAP_DEBUG to
    configurable CONFIG_DEBUG_VM in 2.6.17. Now Petr Vandrovec reports that
    its BUG_ON(mapcount > 100000) can easily crash a CONFIG_DEBUG_VM=y system.

    That limit was just an arbitrary number to protect against an infinite
    loop. We could raise it to something enormous (depending on sizeof struct
    vma and size of memory?); but I rather think validate_anon_vma has outlived
    its usefulness, and is better just removed - which gives a magnificent
    performance boost to anything like Petr's test program ;)

    Of course, a very long anon_vma list is bad news for preemption latency,
    and I believe there has been one recent report of such: let's not forget
    that, but validate_anon_vma only makes it worse not better.

    Signed-off-by: Hugh Dickins
    Cc: Petr Vandrovec
    Acked-by: Nick Piggin
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

24 Jun, 2007

1 commit


22 Jun, 2007

1 commit

  • Function expand_upwards() did not guarded against wrapping
    around to address 0. This fixes the adjtimex02 testcase from
    the Linux Test Project on a 32bit PARISC kernel.

    [expand_upwards is only used on parisc and ia64; it looks like it does
    the right thing on both. --kyle]

    Signed-off-by: Helge Deller
    Cc: Tony Luck
    Signed-off-by: Kyle McMartin

    Helge Deller
     

17 Jun, 2007

3 commits

  • If ARCH_KMALLOC_MINALIGN is set to a value greater than 8 (SLUBs smallest
    kmalloc cache) then SLUB may generate duplicate slabs in sysfs (yes again)
    because the object size is padded to reach ARCH_KMALLOC_MINALIGN. Thus the
    size of the small slabs is all the same.

    No arch sets ARCH_KMALLOC_MINALIGN larger than 8 though except mips which
    for some reason wants a 128 byte alignment.

    This patch increases the size of the smallest cache if
    ARCH_KMALLOC_MINALIGN is greater than 8. In that case more and more of the
    smallest caches are disabled.

    If we do that then the count of the active general caches that is displayed
    on boot is not correct anymore since we may skip elements of the kmalloc
    array. So count them separately.

    This approach was tested by Havard yesterday.

    Signed-off-by: Christoph Lameter
    Cc: Haavard Skinnemoen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Some changes done a while ago to avoid pounding on ptep_set_access_flags and
    update_mmu_cache in some race situations break sun4c which requires
    update_mmu_cache() to always be called on minor faults.

    This patch reworks ptep_set_access_flags() semantics, implementations and
    callers so that it's now responsible for returning whether an update is
    necessary or not (basically whether the PTE actually changed). This allow
    fixing the sparc implementation to always return 1 on sun4c.

    [akpm@linux-foundation.org: fixes, cleanups]
    Signed-off-by: Benjamin Herrenschmidt
    Cc: Hugh Dickins
    Cc: David Miller
    Cc: Mark Fortescue
    Acked-by: William Lee Irwin III
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • The data structure to manage the information gathered about functions
    allocating and freeing objects is allocated when the list_lock has already
    been taken. We need to allocate with GFP_ATOMIC instead of GFP_KERNEL.

    Signed-off-by: Christoph Lameter
    Cc: Mel Gorman
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

16 Jun, 2007

1 commit

  • When building with memory hotplug enabled and cpu hotplug disabled, we
    end up with the following section mismatch:

    WARNING: mm/built-in.o(.text+0x4e58): Section mismatch: reference to
    .init.text: (between 'free_area_init_node' and '__build_all_zonelists')

    This happens as a result of:

    -> free_area_init_node()
    -> free_area_init_core()
    -> zone_pcp_init() zone_batchsize()
    Acked-by: Yasunori Goto
    Signed-off-by: Linus Torvalds

    --

    mm/page_alloc.c | 2 +-
    1 file changed, 1 insertion(+), 1 deletion(-)

    Paul Mundt
     

09 Jun, 2007

4 commits

  • into the appropriate #ifdef.

    Signed-off-by: Stephen Rothwell
    Cc: Yasunori Goto
    Cc: Andy Whitcroft
    Cc: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • Instead of returning the smallest available object return ZERO_SIZE_PTR.

    A ZERO_SIZE_PTR can be legitimately used as an object pointer as long as it
    is not deferenced. The dereference of ZERO_SIZE_PTR causes a distinctive
    fault. kfree can handle a ZERO_SIZE_PTR in the same way as NULL.

    This enables functions to use zero sized object. e.g. n = number of objects.

    objects = kmalloc(n * sizeof(object));

    for (i = 0; i < n; i++)
    objects[i].x = y;

    kfree(objects);

    Signed-off-by: Christoph Lameter
    Acked-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • cache_free_alien must be called regardless if we use alien caches or not.
    cache_free_alien() will do the right thing if there are no alien caches
    available.

    Signed-off-by: Christoph Lameter
    Cc: Paul Mundt
    Acked-by: Pekka J Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Randy Dunlap reports that a tmpfs, mounted with NUMA mpol= specifying an
    offline node, crashes as soon as data is allocated upon it. Now restrict it
    to online nodes, where before it restricted to MAX_NUMNODES.

    Signed-off-by: Hugh Dickins
    Cc: Robin Holt
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Tested-and-acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

01 Jun, 2007

3 commits

  • Hotplug callbacks are performed with interrupts enabled. Slub requires
    interrupts to be disabled for flushing caches.

    Signed-off-by: Christoph Lameter
    Cc: Michal Piotrowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • zone->present_pages is updated in online_pages(). But, __add_zone() can be
    called twice or more before calling online_pages(). So,
    init_currenty_empty_zone() can be called unnecessary times. It is cause of
    memory leak of zone's wait_table.

    Signed-off-by: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • On systems with huge amount of physical memory, VFS cache and memory memmap
    may eat all available system memory under 4G, then the system may fail to
    allocate swiotlb bounce buffer.

    There was a fix for this issue in arch/x86_64/mm/numa.c, but that fix dose
    not cover sparsemem model.

    This patch add fix to sparsemem model by first try to allocate memmap above
    4G.

    Signed-off-by: Zou Nan hai
    Acked-by: Suresh Siddha
    Cc: Andi Kleen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zou Nan hai
     

31 May, 2007

2 commits

  • Fix support for discontinuous memory

    Signed-off-by: Roman Zippel
    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Zippel
     
  • We need this patch in ASAP. Patch fixes the mysterious hang that remained
    on some particular configurations with lockdep on after the first fix that
    moved the #idef CONFIG_SLUB_DEBUG to the right location. See
    http://marc.info/?t=117963072300001&r=1&w=2

    The kmem_cache_node cache is very special because it is needed for NUMA
    bootstrap. Under certain conditions (like for example if lockdep is
    enabled and significantly increases the size of spinlock_t) the structure
    may become exactly the size as one of the larger caches in the kmalloc
    array.

    That early during bootstrap we cannot perform merging properly. The unique
    id for the kmem_cache_node cache will match one of the kmalloc array.
    Sysfs will complain about a duplicate directory entry. All of this occurs
    while the console is not yet fully operational. Thus boot may appear to be
    silently failing.

    The kmem_cache_node cache is very special. During early boostrap the main
    allocation function is not operational yet and so we have to run our own
    small special alloc function during early boot. It is also special in that
    it is never freed.

    We really do not want any merging on that cache. Set the refcount -1 and
    forbid merging of slabs that have a negative refcount.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

24 May, 2007

3 commits

  • The check for super sized slabs where we can no longer move the free
    pointer behind the object for debugging purposes etc is accessing a
    field that is not setup yet. We must use objsize here since the size of
    the slab has not been determined yet.

    The effect of this is that a global slab shrink via "slabinfo -s" will
    show errors about offsets being wrong if booted with slub_debug.
    Potentially there are other troubles with huge slabs under slub_debug
    because the calculated free pointer offset is truncated.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • mm/page_alloc.c:931: warning: 'setup_nr_node_ids' defined but not used

    This is now the only (!) compiler warning I get in my UML build :)

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • The object size calculation is wrong if !CONFIG_SLUB_DEBUG because the
    #ifdef CONFIG_SLUB_DEBUG is now switching off the size adjustments for
    DESTROY_BY_RCU and ctor.

    Signed-off-by: Christoph Lameter
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

22 May, 2007

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild-fix:
    mm/slab: fix section mismatch warning
    mm: fix section mismatch warnings
    init/main: use __init_refok to fix section mismatch
    kbuild: introduce __init_refok/__initdata_refok to supress section mismatch warnings
    all-archs: consolidate .data section definition in asm-generic
    all-archs: consolidate .text section definition in asm-generic
    kbuild: add "Section mismatch" warning whitelist for powerpc
    kbuild: make better section mismatch reports on i386, arm and mips
    kbuild: make modpost section warnings clearer
    kconfig: search harder for curses library in check-lxdialog.sh
    kbuild: include limits.h in sumversion.c for PATH_MAX
    powerpc: Fix the MODALIAS generation in modpost for of devices

    Linus Torvalds
     
  • First thing mm.h does is including sched.h solely for can_do_mlock() inline
    function which has "current" dereference inside. By dealing with can_do_mlock()
    mm.h can be detached from sched.h which is good. See below, why.

    This patch
    a) removes unconditional inclusion of sched.h from mm.h
    b) makes can_do_mlock() normal function in mm/mlock.c
    c) exports can_do_mlock() to not break compilation
    d) adds sched.h inclusions back to files that were getting it indirectly.
    e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
    getting them indirectly

    Net result is:
    a) mm.h users would get less code to open, read, preprocess, parse, ... if
    they don't need sched.h
    b) sched.h stops being dependency for significant number of files:
    on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
    after patch it's only 3744 (-8.3%).

    Cross-compile tested on

    all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
    alpha alpha-up
    arm
    i386 i386-up i386-defconfig i386-allnoconfig
    ia64 ia64-up
    m68k
    mips
    parisc parisc-up
    powerpc powerpc-up
    s390 s390-up
    sparc sparc-up
    sparc64 sparc64-up
    um-x86_64
    x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig

    as well as my two usual configs.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

19 May, 2007

2 commits


17 May, 2007

6 commits

  • Re-introduce rmap verification patches that Hugh removed when he removed
    PG_map_lock. PG_map_lock actually isn't needed to synchronise access to
    anonymous pages, because PG_locked and PTL together already do.

    These checks were important in discovering and fixing a rare rmap corruption
    in SLES9.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • __vunmap doesn't seem to be used outside of mm/vmalloc.c, and has
    no prototype in any header so let's make it static

    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • Currently we have a maze of configuration variables that determine the
    maximum slab size. Worst of all it seems to vary between SLAB and SLUB.

    So define a common maximum size for kmalloc. For conveniences sake we use
    the maximum size ever supported which is 32 MB. We limit the maximum size
    to a lower limit if MAX_ORDER does not allow such large allocations.

    For many architectures this patch will have the effect of adding large
    kmalloc sizes. x86_64 adds 5 new kmalloc sizes. So a small amount of
    memory will be needed for these caches (contemporary SLAB has dynamically
    sizeable node and cpu structure so the waste is less than in the past)

    Most architectures will then be able to allocate object with sizes up to
    MAX_ORDER. We have had repeated breakage (in fact whenever we doubled the
    number of supported processors) on IA64 because one or the other struct
    grew beyond what the slab allocators supported. This will avoid future
    issues and f.e. avoid fixes for 2k and 4k cpu support.

    CONFIG_LARGE_ALLOCS is no longer necessary so drop it.

    It fixes sparc64 with SLAB.

    Signed-off-by: Christoph Lameter
    Signed-off-by: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Consolidate functionality into the #ifdef section.

    Extract tracing into one subroutine.

    Move object debug processing into the #ifdef section so that the
    code in __slab_alloc and __slab_free becomes minimal.

    Reduce number of functions we need to provide stubs for in the !SLUB_DEBUG case.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.

    Signed-off-by: Christoph Lameter
    Cc: David Howells
    Cc: Jens Axboe
    Cc: Steven French
    Cc: Michael Halcrow
    Cc: OGAWA Hirofumi
    Cc: Miklos Szeredi
    Cc: Steven Whitehouse
    Cc: Roman Zippel
    Cc: David Woodhouse
    Cc: Dave Kleikamp
    Cc: Trond Myklebust
    Cc: "J. Bruce Fields"
    Cc: Anton Altaparmakov
    Cc: Mark Fasheh
    Cc: Paul Mackerras
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: David Chinner
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The atomicity when handling flags in SLUB is not necessary since both flags
    used by SLUB are not updated in a racy way. Flag updates are either done
    during slab creation or destruction or under slab_lock. Some of these flags
    do not have the non atomic variants that we need. So define our own.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter