09 Dec, 2006

4 commits

  • This facility provides three entry points:

    ilog2() Log base 2 of unsigned long
    ilog2_u32() Log base 2 of u32
    ilog2_u64() Log base 2 of u64

    These facilities can either be used inside functions on dynamic data:

    int do_something(long q)
    {
    ...;
    y = ilog2(x)
    ...;
    }

    Or can be used to statically initialise global variables with constant values:

    unsigned n = ilog2(27);

    When performing static initialisation, the compiler will report "error:
    initializer element is not constant" if asked to take a log of zero or of
    something not reducible to a constant. They treat negative numbers as
    unsigned.

    When not dealing with a constant, they fall back to using fls() which permits
    them to use arch-specific log calculation instructions - such as BSR on
    x86/x86_64 or SCAN on FRV - if available.

    [akpm@osdl.org: MMC fix]
    Signed-off-by: David Howells
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Herbert Xu
    Cc: David Howells
    Cc: Wojtek Kaniewski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Signed-off-by: Josef Sipek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef Sipek
     
  • Change all the uses of f_{dentry,vfsmnt} to f_path.{dentry,mnt} in linux/mm/.

    Signed-off-by: Josef "Jeff" Sipek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef "Jeff" Sipek
     
  • fallback_alloc() could end up calling cpuset_zone_allowed() with interrupts
    disabled (by code in kmem_cache_alloc_node()), but without __GFP_HARDWALL
    set, leading to a possible call of a sleeping function with interrupts
    disabled.

    This results in the BUG report:

    BUG: sleeping function called from invalid context at kernel/cpuset.c:1520
    in_atomic():0, irqs_disabled():1

    Thanks to Paul Menage for catching this one.

    Signed-off-by: Paul Jackson
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

08 Dec, 2006

36 commits

  • - move some file_operations structs into the .rodata section

    - move static strings from policy_types[] array into the .rodata section

    - fix generic seq_operations usages, so that those structs may be defined
    as "const" as well

    [akpm@osdl.org: couple of fixes]
    Signed-off-by: Helge Deller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Helge Deller
     
  • In time for 2.6.20, we can get rid of this junk.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Burman Yan
     
  • Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn,
    prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus
    generating compiler warnings of unused symbols, hence forcing people to add
    #ifdefs.

    the compiler can skip truly unused functions just fine:

    text data bss dec hex filename
    1624412 728710 3674856 6027978 5bfaca vmlinux.before
    1624412 728710 3674856 6027978 5bfaca vmlinux.after

    [akpm@osdl.org: topology.c fix]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • It has no users and it's doubtful that we'll need it again.

    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Use put_pages_list() instead of opencoding it.

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Move process freezing functions from include/linux/sched.h to freezer.h, so
    that modifications to the freezer or the kernel configuration don't require
    recompiling just about everything.

    [akpm@osdl.org: fix ueagle driver]
    Signed-off-by: Nigel Cunningham
    Cc: "Rafael J. Wysocki"
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nigel Cunningham
     
  • Currently swsusp saves the contents of highmem pages by copying them to the
    normal zone which is quite inefficient (eg. it requires two normal pages
    to be used for saving one highmem page). This may be improved by using
    highmem for saving the contents of saveable highmem pages.

    Namely, during the suspend phase of the suspend-resume cycle we try to
    allocate as many free highmem pages as there are saveable highmem pages.
    If there are not enough highmem image pages to store the contents of all of
    the saveable highmem pages, some of them will be stored in the "normal"
    memory. Next, we allocate as many free "normal" pages as needed to store
    the (remaining) image data. We use a memory bitmap to mark the allocated
    free pages (ie. highmem as well as "normal" image pages).

    Now, we use another memory bitmap to mark all of the saveable pages
    (highmem as well as "normal") and the contents of the saveable pages are
    copied into the image pages. Then, the second bitmap is used to save the
    pfns corresponding to the saveable pages and the first one is used to save
    their data.

    During the resume phase the pfns of the pages that were saveable during the
    suspend are loaded from the image and used to mark the "unsafe" page
    frames. Next, we try to allocate as many free highmem page frames as to
    load all of the image data that had been in the highmem before the suspend
    and we allocate so many free "normal" page frames that the total number of
    allocated free pages (highmem and "normal") is equal to the size of the
    image. While doing this we have to make sure that there will be some extra
    free "normal" and "safe" page frames for two lists of PBEs constructed
    later.

    Now, the image data are loaded, if possible, into their "original" page
    frames. The image data that cannot be written into their "original" page
    frames are loaded into "safe" page frames and their "original" kernel
    virtual addresses, as well as the addresses of the "safe" pages containing
    their copies, are stored in one of two lists of PBEs.

    One list of PBEs is for the copies of "normal" suspend pages (ie. "normal"
    pages that were saveable during the suspend) and it is used in the same way
    as previously (ie. by the architecture-dependent parts of swsusp). The
    other list of PBEs is for the copies of highmem suspend pages. The pages
    in this list are restored (in a reversible way) right before the
    arch-dependent code is called.

    Signed-off-by: Rafael J. Wysocki
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Make swsusp use block device offsets instead of swap offsets to identify swap
    locations and make it use the same code paths for writing as well as for
    reading data.

    This allows us to use the same code for handling swap files and swap
    partitions and to simplify the code, eg. by dropping rw_swap_page_sync().

    Signed-off-by: Rafael J. Wysocki
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • The Linux kernel handles swap files almost in the same way as it handles swap
    partitions and there are only two differences between these two types of swap
    areas:

    (1) swap files need not be contiguous,

    (2) the header of a swap file is not in the first block of the partition
    that holds it. From the swsusp's point of view (1) is not a problem,
    because it is already taken care of by the swap-handling code, but (2) has
    to be taken into consideration.

    In principle the location of a swap file's header may be determined with the
    help of appropriate filesystem driver. Unfortunately, however, it requires
    the filesystem holding the swap file to be mounted, and if this filesystem is
    journaled, it cannot be mounted during a resume from disk. For this reason we
    need some other means by which swap areas can be identified.

    For example, to identify a swap area we can use the partition that holds the
    area and the offset from the beginning of this partition at which the swap
    header is located.

    The following patch allows swsusp to identify swap areas this way. It changes
    swap_type_of() so that it takes an additional argument representing an offset
    of the swap header within the partition represented by its first argument.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Make radix tree lookups safe to be performed without locks. Readers are
    protected against nodes being deleted by using RCU based freeing. Readers
    are protected against new node insertion by using memory barriers to ensure
    the node itself will be properly written before it is visible in the radix
    tree.

    Each radix tree node keeps a record of their height (above leaf nodes).
    This height does not change after insertion -- when the radix tree is
    extended, higher nodes are only inserted in the top. So a lookup can take
    the pointer to what is *now* the root node, and traverse down it even if
    the tree is concurrently extended and this node becomes a subtree of a new
    root.

    "Direct" pointers (tree height of 0, where root->rnode points directly to
    the data item) are handled by using the low bit of the pointer to signal
    whether rnode is a direct pointer or a pointer to a radix tree node.

    When a reader wants to traverse the next branch, they will take a copy of
    the pointer. This pointer will be either NULL (and the branch is empty) or
    non-NULL (and will point to a valid node).

    [akpm@osdl.org: cleanups]
    [Lee.Schermerhorn@hp.com: bugfixes, comments, simplifications]
    [clameter@sgi.com: build fix]
    Signed-off-by: Nick Piggin
    Cc: "Paul E. McKenney"
    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Currently we we use the lru head link of the second page of a compound page
    to hold its destructor. This was ok when it was purely an internal
    implmentation detail. However, hugetlbfs overrides this destructor
    violating the layering. Abstract this out as explicit calls, also
    introduce a type for the callback function allowing them to be type
    checked. For each callback we pre-declare the function, causing a type
    error on definition rather than on use elsewhere.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • Currently we simply attempt to allocate from all allowed nodes using
    GFP_THISNODE. However, GFP_THISNODE does not do reclaim (it wont do any at
    all if the recent GFP_THISNODE patch is accepted). If we truly run out of
    memory in the whole system then fallback_alloc may return NULL although
    memory may still be available if we would perform more thorough reclaim.

    This patch changes fallback_alloc() so that we first only inspect all the
    per node queues for available slabs. If we find any then we allocate from
    those. This avoids slab fragmentation by first getting rid of all partial
    allocated slabs on every node before allocating new memory.

    If we cannot satisfy the allocation from any per node queue then we extend
    a slab. We now call into the page allocator without specifying
    GFP_THISNODE. The page allocator will then implement its own fallback (in
    the given cpuset context), perform necessary reclaim (again considering not
    a single node but the whole set of allowed nodes) and then return pages for
    a new slab.

    We identify from which node the pages were allocated and then insert the
    pages into the corresponding per node structure. In order to do so we need
    to modify cache_grow() to take a parameter that specifies the new slab.
    kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be
    able to use kmem_getpage to allocate from an arbitrary node. GFP_THISNODE
    needs to be specified when calling cache_grow().

    One key advantage is that the decision from which node to allocate new
    memory is removed from slab fallback processing. The patch allows to go
    back to use of the page allocators fallback/reclaim logic.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The intent of GFP_THISNODE is to make sure that an allocation occurs on a
    particular node. If this is not possible then NULL needs to be returned so
    that the caller can choose what to do next on its own (the slab allocator
    depends on that).

    However, GFP_THISNODE currently triggers reclaim before returning a failure
    (GFP_THISNODE means GFP_NORETRY is set). If we have over allocated a node
    then we will currently do some reclaim before returning NULL. The caller
    may want memory from other nodes before reclaim should be triggered. (If
    the caller wants reclaim then he can directly use __GFP_THISNODE instead).

    There is no flag to avoid reclaim in the page allocator and adding yet
    another GFP_xx flag would be difficult given that we are out of available
    flags.

    So just compare and see if all bits for GFP_THISNODE (__GFP_THISNODE,
    __GFP_NORETRY and __GFP_NOWARN) are set. If so then we return NULL before
    waking up kswapd.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This addresses two issues:

    1. Kmalloc_node() may intermittently return NULL if we are allocating
    from the current node and are unable to obtain memory for the current
    node from the page allocator. This is because we call ___cache_alloc()
    if nodeid == numa_node_id() and ____cache_alloc is not able to fallback
    to other nodes.

    This was introduced in the 2.6.19 development cycle.
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • SLAB_DMA is an alias of GFP_DMA. This is the last one so we
    remove the leftover comment too.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • SLAB_KERNEL is an alias of GFP_KERNEL.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • SLAB_LEVEL_MASK is only used internally to the slab and is
    and alias of GFP_LEVEL_MASK.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • It is only used internally in the slab.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • David Binderman and his Intel C compiler rightly observe that
    install_file_pte no longer has any use for its pte_val.

    Signed-off-by: Hugh Dickins
    Cc: d binderman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • These patches introduced new switch statements which are indented contrary
    to the concensus in mm/*.c. Fix them up to match that concensus.

    [PATCH] node local per-cpu-pages
    [PATCH] ZVC: Scale thresholds depending on the size of the system
    commit e7c8d5c9955a4d2e88e36b640563f5d6d5aba48a
    commit df9ecaba3f152d1ea79f2a5e0b87505e03f47590

    Signed-off-by: Andy Whitcroft
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • The fsfuzzer found this; with a corrupt small swapfile that claims to have
    many pages:

    [root]# file swap.741.img
    swap.741.img: Linux/i386 swap file (new style) 1 (4K pages) size 1040191487 pages
    [root]# ls -l swap.741.img
    -rw-r--r-- 1 root root 16777216 Nov 22 05:18 swap.741.img

    sys_swapon() will try to vmalloc all those pages, and -then- check to see if
    the file is actually that large:

    if (!(p->swap_map = vmalloc(maxpages * sizeof(short)))) {

    if (swapfilesize && maxpages > swapfilesize) {
    printk(KERN_WARNING
    "Swap area shorter than signature indicates\n");

    It seems to me that it would make more sense to move this test up before
    the vmalloc, with the other checks, to avoid the OOM-killer in this
    situation...

    Signed-off-by: Eric Sandeen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • NUMA node ids are passed as either int or unsigned int almost exclusivly
    page_to_nid and zone_to_nid both return unsigned long. This is a throw
    back to when page_to_nid was a #define and was thus exposing the real type
    of the page flags field.

    In addition to fixing up the definitions of page_to_nid and zone_to_nid I
    audited the users of these functions identifying the following incorrect
    uses:

    1) mm/page_alloc.c show_node() -- printk dumping the node id,
    2) include/asm-ia64/pgalloc.h pgtable_quicklist_free() -- comparison
    against numa_node_id() which returns an int from cpu_to_node(), and
    3) mm/mpolicy.c check_pte_range -- used as an index in node_isset which
    uses bit_set which in generic code takes an int.

    Signed-off-by: Andy Whitcroft
    Cc: Christoph Lameter
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • drain_node_pages() currently drains the complete pageset of all pages. If
    there are a large number of pages in the queues then we may hold off
    interrupts for too long.

    Duplicate the method used in free_hot_cold_page. Only drain pcp->batch
    pages at one time.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patch makes the needlessly global "global_faults" static.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • When booting a NUMA system with nodes that have no memory (eg by limiting
    memory), bootmem_alloc_core tried to find pages in an uninitialized
    bootmem_map. This caused a null pointer access. This fix adds a check, so
    that NULL is returned. That will enable the caller (bootmem_alloc_nopanic)
    to alloc memory on other without a panic.

    Signed-off-by: Christian Krafft
    Cc: Christoph Lameter
    Cc: Andy Whitcroft
    Cc: Martin Bligh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christian Krafft
     
  • The patch (as824b) makes percpu_free() ignore NULL arguments, as one would
    expect for a deallocation routine. (Note that free_percpu is #defined as
    percpu_free in include/linux/percpu.h.) A few callers are updated to remove
    now-unneeded tests for NULL. A few other callers already seem to assume
    that passing a NULL pointer to percpu_free() is okay!

    The patch also removes an unnecessary NULL check in percpu_depopulate().

    Signed-off-by: Alan Stern
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Stern
     
  • We have variants of kmalloc and kmem_cache_alloc that leave leak tracking to
    the caller. This is used for subsystem-specific allocators like skb_alloc.

    To make skb_alloc node-aware we need similar routines for the node-aware slab
    allocator, which this patch adds.

    Note that the code is rather ugly, but it mirrors the non-node-aware code 1:1:

    [akpm@osdl.org: add module export]
    Signed-off-by: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • It would be possible for /proc/swaps to not always print out the header:

    swapon /dev/hdc2
    swapon /dev/hde2
    swapoff /dev/hdc2

    At this point /proc/swaps would not have a header.

    Signed-off-by: Suleiman Souhlal
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Suleiman Souhlal
     
  • OOM can panic due to the processes stuck in __alloc_pages() doing infinite
    rebalance loop while no memory can be reclaimed. OOM killer tries to kill
    some processes, but unfortunetaly, rebalance label was moved by someone
    below the TIF_MEMDIE check, so buddy allocator doesn't see that process is
    OOM-killed and it can simply fail the allocation :/

    Observed in reality on RHEL4(2.6.9)+OpenVZ kernel when a user doing some
    memory allocation tricks triggered OOM panic.

    Signed-off-by: Denis Lunev
    Signed-off-by: Kirill Korotaev
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     
  • mm is defined as vma->vm_mm, so use that.

    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik Bobbaers
     
  • When using numa=fake on non-NUMA hardware there is no benefit to having the
    alien caches, and they consume much memory.

    Add a kernel boot option to disable them.

    Christoph sayeth "This is good to have even on large NUMA. The problem is
    that the alien caches grow by the square of the size of the system in terms of
    nodes."

    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Here's an attempt towards doing away with lock_cpu_hotplug in the slab
    subsystem. This approach also fixes a bug which shows up when cpus are
    being offlined/onlined and slab caches are being tuned simultaneously.

    http://marc.theaimsgroup.com/?l=linux-kernel&m=116098888100481&w=2

    The patch has been stress tested overnight on a 2 socket 4 core AMD box with
    repeated cpu online and offline, while dbench and kernbench process are
    running, and slab caches being tuned at the same time.
    There were no lockdep warnings either. (This test on 2,6.18 as 2.6.19-rc
    crashes at __drain_pages
    http://marc.theaimsgroup.com/?l=linux-kernel&m=116172164217678&w=2 )

    The approach here is to hold cache_chain_mutex from CPU_UP_PREPARE until
    CPU_ONLINE (similar in approach as worqueue_mutex) . Slab code sensitive
    to cpu_online_map (kmem_cache_create, kmem_cache_destroy, slabinfo_write,
    __cache_shrink) is already serialized with cache_chain_mutex. (This patch
    lengthens cache_chain_mutex hold time at kmem_cache_destroy to cover this).
    This patch also takes the cache_chain_sem at kmem_cache_shrink to protect
    sanity of cpu_online_map at __cache_shrink, as viewed by slab.
    (kmem_cache_shrink->__cache_shrink->drain_cpu_caches). But, really,
    kmem_cache_shrink is used at just one place in the acpi subsystem! Do we
    really need to keep kmem_cache_shrink at all?

    Another note. Looks like a cpu hotplug event can send CPU_UP_CANCELED to
    a registered subsystem even if the subsystem did not receive CPU_UP_PREPARE.
    This could be due to a subsystem registered for notification earlier than
    the current subsystem crapping out with NOTIFY_BAD. Badness can occur with
    in the CPU_UP_CANCELED code path at slab if this happens (The same would
    apply for workqueue.c as well). To overcome this, we might have to use either
    a) a per subsystem flag and avoid handling of CPU_UP_CANCELED, or
    b) Use a special notifier events like LOCK_ACQUIRE/RELEASE as Gautham was
    using in his experiments, or
    c) Do not send CPU_UP_CANCELED to a subsystem which did not receive
    CPU_UP_PREPARE.

    I would prefer c).

    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • When CONFIG_SLAB_DEBUG is used in combination with ARCH_SLAB_MINALIGN, some
    debug flags should be disabled which depend on BYTES_PER_WORD alignment.

    The disabling of these debug flags is not properly handled when
    BYTES_PER_WORD < ARCH_SLAB_MEMALIGN < cache_line_size()

    This patch fixes that and also adds an alignment check to
    cache_alloc_debugcheck_after() when ARCH_SLAB_MINALIGN is used.

    Signed-off-by: Kevin Hilman
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kevin Hilman