26 Sep, 2006

40 commits

  • Make the FRV arch use the generic IRQ code rather than having its own
    routines for doing so.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • As David Howells points out, binfmt_elf sometimes uses
    off_t, sometimes uses loff_t. Use loff_t throughout.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Take tty_mutex when accessing ->signal->tty in selinux code. Noted by Alan
    Cox. Longer term, we are looking at refactoring the code to provide better
    encapsulation of the tty layer, but this is a simple fix that addresses the
    immediate bug.

    Signed-off-by: Stephen Smalley
    Acked-by: Alan Cox
    Acked-by: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Smalley
     
  • This patch converts the semaphore in the superblock security struct to a
    mutex. No locking changes or other code changes are done.

    Signed-off-by: Eric Paris
    Acked-by: Stephen Smalley
    Acked-by: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Paris
     
  • This patch converts the remaining isec->sem into a mutex. Very similar
    locking is provided as before only in the faster smaller mutex rather than a
    semaphore. An out_unlock path is introduced rather than the conditional
    unlocking found in the original code.

    Signed-off-by: Eric Paris
    Acked-by: Stephen Smalley
    Acked-by: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Paris
     
  • inode_security_set_sid is only called by security_inode_init_security, which
    is called when a new file is being created and needs to have its incore
    security state initialized and its security xattr set. This helper used to be
    called in other places in the past, but now only has the one. So this patch
    rolls inode_security_set_sid directly back into security_inode_init_security.
    There also is no need to hold the isec->sem while doing this, as the inode is
    not available to other threads at this point in time.

    Signed-off-by: Eric Paris
    Acked-by: Stephen Smalley
    Acked-by: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Paris
     
  • Introduces support for policy version 21. This version of the binary
    kernel policy allows for defining range transitions on security classes
    other than the process security class. As always, backwards compatibility
    for older formats is retained. The security class is read in as specified
    when using the new format, while the "process" security class is assumed
    when using an older policy format.

    Signed-off-by: Darrel Goeddel
    Signed-off-by: Stephen Smalley
    Acked-by: James Morris
    Acked-by: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrel Goeddel
     
  • Enable configuration of SELinux maximum supported policy version to support
    legacy userland (init) that does not gracefully handle kernels that support
    newer policy versions two or more beyond the installed policy, as in FC3
    and FC4.

    [bunk@stusta.de: improve Kconfig help text]
    Signed-off-by: Stephen Smalley
    Acked-by: James Morris
    Acked-by: Eric Paris
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Smalley
     
  • Replace ctxid with sid in selinux_audit_rule_match interface for
    consistency with other interfaces.

    Signed-off-by: Stephen Smalley
    Acked-by: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Smalley
     
  • Rename selinux_ctxid_to_string to selinux_sid_to_string to be
    consistent with other interfaces.

    Signed-off-by: Stephen Smalley
    Acked-by: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Smalley
     
  • Eliminate selinux_task_ctxid since it duplicates selinux_task_get_sid.

    Signed-off-by: Stephen Smalley
    Acked-by: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Smalley
     
  • There are many places where we need to determine the node of a zone.
    Currently we use a difficult to read sequence of pointer dereferencing.
    Put that into an inline function and use throughout VM. Maybe we can find
    a way to optimize the lookup in the future.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • I found two location in hugetlb.c where we chase pointer instead of using
    page_to_nid(). Page_to_nid is more effective and can get the node directly
    from page flags.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Update the comments for __oom_kill_task() to reflect the code changes.

    Signed-off-by: Ram Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ram Gupta
     
  • Minor performance fix.

    If we reclaimed enough slab pages from a zone then we can avoid going off
    node with the current allocation. Take care of updating nr_reclaimed when
    reclaiming from the slab.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently one can enable slab reclaim by setting an explicit option in
    /proc/sys/vm/zone_reclaim_mode. Slab reclaim is then used as a final
    option if the freeing of unmapped file backed pages is not enough to free
    enough pages to allow a local allocation.

    However, that means that the slab can grow excessively and that most memory
    of a node may be used by slabs. We have had a case where a machine with
    46GB of memory was using 40-42GB for slab. Zone reclaim was effective in
    dealing with pagecache pages. However, slab reclaim was only done during
    global reclaim (which is a bit rare on NUMA systems).

    This patch implements slab reclaim during zone reclaim. Zone reclaim
    occurs if there is a danger of an off node allocation. At that point we

    1. Shrink the per node page cache if the number of pagecache
    pages is more than min_unmapped_ratio percent of pages in a zone.

    2. Shrink the slab cache if the number of the nodes reclaimable slab pages
    (patch depends on earlier one that implements that counter)
    are more than min_slab_ratio (a new /proc/sys/vm tunable).

    The shrinking of the slab cache is a bit problematic since it is not node
    specific. So we simply calculate what point in the slab we want to reach
    (current per node slab use minus the number of pages that neeed to be
    allocated) and then repeately run the global reclaim until that is
    unsuccessful or we have reached the limit. I hope we will have zone based
    slab reclaim at some point which will make that easier.

    The default for the min_slab_ratio is 5%

    Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Remove the atomic counter for slab_reclaim_pages and replace the counter
    and NR_SLAB with two ZVC counter that account for unreclaimable and
    reclaimable slab pages: NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE.

    Change the check in vmscan.c to refer to to NR_SLAB_RECLAIMABLE. The
    intend seems to be to check for slab pages that could be freed.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • *_pages is a better description of the role of the variable.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The allocpercpu functions __alloc_percpu and __free_percpu() are heavily
    using the slab allocator. However, they are conceptually slab. This also
    simplifies SLOB (at this point slob may be broken in mm. This should fix
    it).

    Signed-off-by: Christoph Lameter
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • If a zone is unpopulated then we do not need to check for pages that are to
    be drained and also not for vm counters that may need to be updated.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Free one_page currently adds the page to a fake list and calls
    free_page_bulk. Fee_page_bulk takes it off again and then calles
    __free_one_page.

    Make free_one_page go directly to __free_one_page. Saves list on / off and
    a temporary list in free_one_page for higher ordered pages.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • One of the changes necessary for shared page tables is to standardize the
    pxx_page macros. pte_page and pmd_page have always returned the struct
    page associated with their entry, while pte_page_kernel and pmd_page_kernel
    have returned the kernel virtual address. pud_page and pgd_page, on the
    other hand, return the kernel virtual address.

    Shared page tables needs pud_page and pgd_page to return the actual page
    structures. There are very few actual users of these functions, so it is
    simple to standardize their usage.

    Since this is basic cleanup, I am submitting these changes as a standalone
    patch. Per Hugh Dickins' comments about it, I am also changing the
    pxx_page_kernel macros to pxx_page_vaddr to clarify their meaning.

    Signed-off-by: Dave McCracken
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave McCracken
     
  • On High end systems (1024 or so cpus) this can potentially cause stack
    overflow. Fix the stack usage.

    Signed-off-by: Suresh Siddha
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Siddha, Suresh B
     
  • In many places we will need to use the same combination of flags. Specify
    a single GFP_THISNODE definition for ease of use in gfp.h.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Profiling really suffers with off node buffers. Fail if no memory is
    available on the nodes. The profiling code can deal with these failures
    should they occur.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • There are frequent references to *z in get_page_from_freelist.

    Add an explicit zone variable that can be used in all these places.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The uncached allocator manages per node pools. Specify __GFP_THISNODE in
    order to force allocation on the indicated node or fail. The uncached
    allocator has already logic to deal with failing allocations.

    Signed-off-by: Christoph Lameter
    Cc: Andy Whitcroft
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • If the user specified a node where we should move the page to then we
    really do not want any other node.

    Signed-off-by: Christoph Lameter
    Cc: Andy Whitcroft
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • …mory policy restrictions

    Add a new gfp flag __GFP_THISNODE to avoid fallback to other nodes. This
    flag is essential if a kernel component requires memory to be located on a
    certain node. It will be needed for alloc_pages_node() to force allocation
    on the indicated node and for alloc_pages() to force allocation on the
    current node.

    Signed-off-by: Christoph Lameter <clameter@sgi.com>
    Cc: Andy Whitcroft <apw@shadowen.org>
    Cc: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

    Christoph Lameter
     
  • Place the alien array cache locks of on slab malloc slab caches on a
    seperate lockdep class. This avoids false positives from lockdep

    [akpm@osdl.org: build fix]
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Cc: Thomas Gleixner
    Acked-by: Arjan van de Ven
    Cc: Ingo Molnar
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • It is fairly easy to get a system to oops by simply sizing a cache via
    /proc in such a way that one of the chaches (shared is easiest) becomes
    bigger than the maximum allowed slab allocation size. This occurs because
    enable_cpucache() fails if it cannot reallocate some caches.

    However, enable_cpucache() is used for multiple purposes: resizing caches,
    cache creation and bootstrap.

    If the slab is already up then we already have working caches. The resize
    can fail without a problem. We just need to return the proper error code.
    F.e. after this patch:

    # echo "size-64 10000 50 1000" >/proc/slabinfo
    -bash: echo: write error: Cannot allocate memory

    notice no OOPS.

    If we are doing a kmem_cache_create() then we also should not panic but
    return -ENOMEM.

    If on the other hand we do not have a fully bootstrapped slab allocator yet
    then we should indeed panic since we are unable to bring up the slab to its
    full functionality.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The ability to free memory allocated to a slab cache is also useful if an
    error occurs during setup of a slab. So extract the function.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • [akpm@osdl.org: export fix]
    Signed-off-by: Christoph Hellwig
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Let's try to keep mm/ comments more useful and up to date. This is a start.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Also, checks if we get a valid slabp_cache for off slab slab-descriptors.
    We should always get this. If we don't, then in that case we, will have to
    disable off-slab descriptors for this cache and do the calculations again.
    This is a rare case, so add a BUG_ON, for now, just in case.

    Signed-off-by: Alok N Kataria
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • Introduce ARCH_LOW_ADDRESS_LIMIT which can be set per architecture to
    override the 4GB default limit used by the bootmem allocater within
    __alloc_bootmem_low() and __alloc_bootmem_low_node(). E.g. s390 needs a
    2GB limit instead of 4GB.

    Acked-by: Ingo Molnar
    Cc: Martin Schwidefsky
    Signed-off-by: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • Print the name of the task invoking the OOM killer. Could make debugging
    easier.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Skip kernel threads, rather than having them return 0 from badness.
    Theoretically, badness might truncate all results to 0, thus a kernel thread
    might be picked first, causing an infinite loop.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • PF_SWAPOFF processes currently cause select_bad_process to return straight
    away. Instead, give them high priority, so we will kill them first, however
    we also first ensure no parallel OOM kills are happening at the same time.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Having the oomkilladj == OOM_DISABLE check before the releasing check means
    that oomkilladj == OOM_DISABLE tasks exiting will not stop the OOM killer.

    Moving the test down will give the desired behaviour. Also: it will allow
    them to "OOM-kill" themselves if they are exiting. As per the previous patch,
    this is required to prevent OOM killer deadlocks (and they don't actually get
    killed, because they're already exiting -- they're simply allowed access to
    memory reserves).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin