17 Dec, 2006

2 commits

  • Hugh Dickins correctly points out that mincore() is actually _supposed_
    to fail on an unmapped hole in the user address space, rather than
    return valid ("empty") information about the hole. This just simplifies
    the problem further (I had been misled by our previous confusing and
    complicated way of doing mincore()).

    Also, in the unlikely situation that we can't allocate a temporary
    kernel buffer, we should actually return EAGAIN, not ENOMEM, to keep the
    "unmapped hole" and "allocation failure" error cases separate.

    Finally, add a comment about our stupid historical lack of support for
    anonymous mappings. I'll fix that if somebody reminds me after 2.6.20
    is out.

    Acked-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Doug Chapman noticed that mincore() will doa "copy_to_user()" of the
    result while holding the mmap semaphore for reading, which is a big
    no-no. While a recursive read-lock on a semaphore in the case of a page
    fault happens to work, we don't actually allow them due to deadlock
    schenarios with writers due to fairness issues.

    Doug and Marcel sent in a patch to fix it, but I decided to just rewrite
    the mess instead - not just fixing the locking problem, but making the
    code smaller and (imho) much easier to understand.

    Cc: Doug Chapman
    Cc: Marcel Holtmann
    Cc: Hugh Dickins
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

14 Dec, 2006

6 commits

  • To allow a more effective copy_user_highpage() on certain architectures,
    a vma argument is added to the function and cow_user_page() allowing
    the implementation of these functions to check for the VM_EXEC bit.

    The main part of this patch was originally written by Ralf Baechle;
    Atushi Nemoto did the the debugging.

    Signed-off-by: Atsushi Nemoto
    Signed-off-by: Ralf Baechle
    Signed-off-by: Linus Torvalds

    Atsushi Nemoto
     
  • When some objects are allocated by one CPU but freed by another CPU we can
    consume lot of cycles doing divides in obj_to_index().

    (Typical load on a dual processor machine where network interrupts are
    handled by one particular CPU (allocating skbufs), and the other CPU is
    running the application (consuming and freeing skbufs))

    Here on one production server (dual-core AMD Opteron 285), I noticed this
    divide took 1.20 % of CPU_CLK_UNHALTED events in kernel. But Opteron are
    quite modern cpus and the divide is much more expensive on oldest
    architectures :

    On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1
    cycle for a multiply.

    Doing some math, we can use a reciprocal multiplication instead of a divide.

    If we want to compute V = (A / B) (A and B being u32 quantities)
    we can instead use :

    V = ((u64)A * RECIPROCAL(B)) >> 32 ;

    where RECIPROCAL(B) is precalculated to ((1LL << 32) + (B - 1)) / B

    Note :

    I wrote pure C code for clarity. gcc output for i386 is not optimal but
    acceptable :

    mull 0x14(%ebx)
    mov %edx,%eax // part of the >> 32
    xor %edx,%edx // useless
    mov %eax,(%esp) // could be avoided
    mov %edx,0x4(%esp) // useless
    mov (%esp),%ebx

    [akpm@osdl.org: small cleanups]
    Signed-off-by: Eric Dumazet
    Cc: Christoph Lameter
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • Elaborate the API for calling cpuset_zone_allowed(), so that users have to
    explicitly choose between the two variants:

    cpuset_zone_allowed_hardwall()
    cpuset_zone_allowed_softwall()

    Until now, whether or not you got the hardwall flavor depended solely on
    whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
    argument.

    If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
    version.

    Unfortunately, this meant that users would end up with the softwall version
    without thinking about it. Since only the softwall version might sleep,
    this led to bugs with possible sleeping in interrupt context on more than
    one occassion.

    The hardwall version requires that the current tasks mems_allowed allows
    the node of the specified zone (or that you're in interrupt or that
    __GFP_THISNODE is set or that you're on a one cpuset system.)

    The softwall version, depending on the gfp_mask, might allow a node if it
    was allowed in the nearest enclusing cpuset marked mem_exclusive (which
    requires taking the cpuset lock 'callback_mutex' to evaluate.)

    This patch removes the cpuset_zone_allowed() call, and forces the caller to
    explicitly choose between the hardwall and the softwall case.

    If the caller wants the gfp_mask to determine this choice, they should (1)
    be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
    cpuset_zone_allowed_softwall() routine.

    This adds another 100 or 200 bytes to the kernel text space, due to the few
    lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
    routines. It should save a few instructions executed for the calls that
    turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
    set (before the call) then check (within the call) the __GFP_HARDWALL flag.

    For the most critical call, from get_page_from_freelist(), the same
    instructions are executed as before -- the old cpuset_zone_allowed()
    routine it used to call is the same code as the
    cpuset_zone_allowed_softwall() routine that it calls now.

    Not a perfect win, but seems worth it, to reduce this chance of hitting a
    sleeping with irq off complaint again.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • More cleanups for slab.h

    1. Remove tabs from weird locations as suggested by Pekka

    2. Drop the check for NUMA and SLAB_DEBUG from the fallback section
    as suggested by Pekka.

    3. Uses static inline for the fallback defs as also suggested by Pekka.

    4. Make kmem_ptr_valid take a const * argument.

    5. Separate the NUMA fallback definitions from the kmalloc_track fallback
    definitions.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This is a response to an earlier discussion on linux-mm about splitting
    slab.h components per allocator. Patch is against 2.6.19-git11. See
    http://marc.theaimsgroup.com/?l=linux-mm&m=116469577431008&w=2

    This patch cleans up the slab header definitions. We define the common
    functions of slob and slab in slab.h and put the extra definitions needed
    for slab's kmalloc implementations in . In order to get
    a greater set of common functions we add several empty functions to slob.c
    and also rename slob's kmalloc to __kmalloc.

    Slob does not need any special definitions since we introduce a fallback
    case. If there is no need for a slab implementation to provide its own
    kmalloc mess^H^H^Hacros then we simply fall back to __kmalloc functions.
    That is sufficient for SLOB.

    Sort the function in slab.h according to their functionality. First the
    functions operating on struct kmem_cache * then the kmalloc related
    functions followed by special debug and fallback definitions.

    Also redo a lot of comments.

    Signed-off-by: Christoph Lameter ?
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Fallback_alloc() does not do the check for GFP_WAIT as done in
    cache_grow(). Thus interrupts are disabled when we call kmem_getpages()
    which results in the failure.

    Duplicate the handling of GFP_WAIT in cache_grow().

    Signed-off-by: Christoph Lameter
    Cc: Jay Cliburn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

11 Dec, 2006

7 commits

  • This patch introduces users of the round_jiffies() function in the slab code.

    The slab code has a few "run every second" timers for background work; these
    are obviously not timing critical as long as they happen roughly at the right
    frequency.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     
  • The only time it is safe to call aio_complete() is when the ->ki_retry
    function returns -EIOCBQUEUED to the AIO core. direct_io_worker() has
    historically done this by relying on its caller to translate positive return
    codes into -EIOCBQUEUED for the aio case. It did this by trying to keep
    conditionals in sync. direct_io_worker() knew when finished_one_bio() was
    going to call aio_complete(). It would reverse the test and wait and free the
    dio in the cases it thought that finished_one_bio() wasn't going to.

    Not surprisingly, it ended up getting it wrong. 'ret' could be a negative
    errno from the submission path but it failed to communicate this to
    finished_one_bio(). direct_io_worker() would return < 0, it's callers
    wouldn't raise -EIOCBQUEUED, and aio_complete() would be called. In the
    future finished_one_bio()'s tests wouldn't reflect this and aio_complete()
    would be called for a second time which can manifest as an oops.

    The previous cleanups have whittled the sync and async completion paths down
    to the point where we can collapse them and clearly reassert the invariant
    that we must only call aio_complete() after returning -EIOCBQUEUED.
    direct_io_worker() will only return -EIOCBQUEUED when it is not the last to
    drop the dio refcount and the aio bio completion path will only call
    aio_complete() when it is the last to drop the dio refcount.
    direct_io_worker() can ensure that it is the last to drop the reference count
    by waiting for bios to drain. It does this for sync ops, of course, and for
    partial dio writes that must fall back to buffered and for aio ops that saw
    errors during submission.

    This means that operations that end up waiting, even if they were issued as
    aio ops, will not call aio_complete() from dio. Instead we return the return
    code of the operation and let the aio core call aio_complete(). This is
    purposely done to fix a bug where AIO DIO file extensions would call
    aio_complete() before their callers have a chance to update i_size.

    Now that direct_io_worker() is explicitly returning -EIOCBQUEUED its callers
    no longer have to translate for it. XFS needs to be careful not to free
    resources that will be used during AIO completion if -EIOCBQUEUED is returned.
    We maintain the previous behaviour of trying to write fs metadata for O_SYNC
    aio+dio writes.

    Signed-off-by: Zach Brown
    Cc: Badari Pulavarty
    Cc: Suparna Bhattacharya
    Acked-by: Jeff Moyer
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zach Brown
     
  • nfs's ->readpages uses read_cache_pages(). Wire it up there.

    [wfg@mail.ustc.edu.cn: account only successful nfs/fuse reads]
    Cc: Jay Lan
    Cc: Shailabh Nagar
    Cc: Balbir Singh
    Cc: Chris Sturtivant
    Cc: Tony Ernst
    Cc: Guillaume Thouvenin
    Cc: David Wright
    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Account for the number of byte writes which this process caused to not happen
    after all.

    Cc: Jay Lan
    Cc: Shailabh Nagar
    Cc: Balbir Singh
    Cc: Chris Sturtivant
    Cc: Tony Ernst
    Cc: Guillaume Thouvenin
    Cc: David Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Accounting writes is fairly simple: whenever a process flips a page from clean
    to dirty, we accuse it of having caused a write to underlying storage of
    PAGE_CACHE_SIZE bytes.

    This may overestimate the amount of writing: the page-dirtying may cause only
    one buffer_head's worth of writeout. Fixing that is possible, but probably a
    bit messy and isn't obviously important.

    Cc: Jay Lan
    Cc: Shailabh Nagar
    Cc: Balbir Singh
    Cc: Chris Sturtivant
    Cc: Tony Ernst
    Cc: Guillaume Thouvenin
    Cc: David Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Save a tabstop in __set_page_dirty_nobuffers() and __set_page_dirty_buffers()
    and a few other places. No functional changes.

    Cc: Jay Lan
    Cc: Shailabh Nagar
    Cc: Balbir Singh
    Cc: Chris Sturtivant
    Cc: Tony Ernst
    Cc: Guillaume Thouvenin
    Cc: David Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Ramiro Voicu hits the BUG_ON(!pte_none(*pte)) in zeromap_pte_range: kernel
    bugzilla 7645. Right: read_zero_pagealigned uses down_read of mmap_sem,
    but another thread's racing read of /dev/zero, or a normal fault, can
    easily set that pte again, in between zap_page_range and zeromap_page_range
    getting there. It's been wrong ever since 2.4.3.

    The simple fix is to use down_write instead, but that would serialize reads
    of /dev/zero more than at present: perhaps some app would be badly
    affected. So instead let zeromap_page_range return the error instead of
    BUG_ON, and read_zero_pagealigned break to the slower clear_user loop in
    that case - there's no need to optimize for it.

    Use -EEXIST for when a pte is found: BUG_ON in mmap_zero (the other user of
    zeromap_page_range), though it really isn't interesting there. And since
    mmap_zero wants -EAGAIN for out-of-memory, the zeromaps better return that
    than -ENOMEM.

    Signed-off-by: Hugh Dickins
    Cc: Ramiro Voicu:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

09 Dec, 2006

7 commits

  • Assign defaults most likely to please a new user:
    1) generate some logging output
    (verbose=2)
    2) avoid injecting failures likely to lock up UI
    (ignore_gfp_wait=1, ignore_gfp_highmem=1)

    Signed-off-by: Don Mullis
    Cc: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Don Mullis
     
  • This patch provides fault-injection capability for alloc_pages()

    Boot option:

    fail_page_alloc=,,,

    -- specifies the interval of failures.

    -- specifies how often it should fail in percent.

    -- specifies the size of free space where memory can be
    allocated safely in pages.

    -- specifies how many times failures may happen at most.

    Debugfs:

    /debug/fail_page_alloc/interval
    /debug/fail_page_alloc/probability
    /debug/fail_page_alloc/specifies
    /debug/fail_page_alloc/times
    /debug/fail_page_alloc/ignore-gfp-highmem
    /debug/fail_page_alloc/ignore-gfp-wait

    Example:

    fail_page_alloc=10,100,0,-1

    The page allocation (alloc_pages(), ...) fails once per 10 times.

    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • This patch provides fault-injection capability for kmalloc.

    Boot option:

    failslab=,,,

    -- specifies the interval of failures.

    -- specifies how often it should fail in percent.

    -- specifies the size of free space where memory can be
    allocated safely in bytes.

    -- specifies how many times failures may happen at most.

    Debugfs:

    /debug/failslab/interval
    /debug/failslab/probability
    /debug/failslab/specifies
    /debug/failslab/times
    /debug/failslab/ignore-gfp-highmem
    /debug/failslab/ignore-gfp-wait

    Example:

    failslab=10,100,0,-1

    slab allocation (kmalloc(), kmem_cache_alloc(),..) fails once per 10 times.

    Cc: Pekka Enberg
    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • This facility provides three entry points:

    ilog2() Log base 2 of unsigned long
    ilog2_u32() Log base 2 of u32
    ilog2_u64() Log base 2 of u64

    These facilities can either be used inside functions on dynamic data:

    int do_something(long q)
    {
    ...;
    y = ilog2(x)
    ...;
    }

    Or can be used to statically initialise global variables with constant values:

    unsigned n = ilog2(27);

    When performing static initialisation, the compiler will report "error:
    initializer element is not constant" if asked to take a log of zero or of
    something not reducible to a constant. They treat negative numbers as
    unsigned.

    When not dealing with a constant, they fall back to using fls() which permits
    them to use arch-specific log calculation instructions - such as BSR on
    x86/x86_64 or SCAN on FRV - if available.

    [akpm@osdl.org: MMC fix]
    Signed-off-by: David Howells
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Herbert Xu
    Cc: David Howells
    Cc: Wojtek Kaniewski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Signed-off-by: Josef Sipek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef Sipek
     
  • Change all the uses of f_{dentry,vfsmnt} to f_path.{dentry,mnt} in linux/mm/.

    Signed-off-by: Josef "Jeff" Sipek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef "Jeff" Sipek
     
  • fallback_alloc() could end up calling cpuset_zone_allowed() with interrupts
    disabled (by code in kmem_cache_alloc_node()), but without __GFP_HARDWALL
    set, leading to a possible call of a sleeping function with interrupts
    disabled.

    This results in the BUG report:

    BUG: sleeping function called from invalid context at kernel/cpuset.c:1520
    in_atomic():0, irqs_disabled():1

    Thanks to Paul Menage for catching this one.

    Signed-off-by: Paul Jackson
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

08 Dec, 2006

18 commits

  • - move some file_operations structs into the .rodata section

    - move static strings from policy_types[] array into the .rodata section

    - fix generic seq_operations usages, so that those structs may be defined
    as "const" as well

    [akpm@osdl.org: couple of fixes]
    Signed-off-by: Helge Deller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Helge Deller
     
  • In time for 2.6.20, we can get rid of this junk.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Burman Yan
     
  • Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn,
    prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus
    generating compiler warnings of unused symbols, hence forcing people to add
    #ifdefs.

    the compiler can skip truly unused functions just fine:

    text data bss dec hex filename
    1624412 728710 3674856 6027978 5bfaca vmlinux.before
    1624412 728710 3674856 6027978 5bfaca vmlinux.after

    [akpm@osdl.org: topology.c fix]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • It has no users and it's doubtful that we'll need it again.

    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Use put_pages_list() instead of opencoding it.

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Move process freezing functions from include/linux/sched.h to freezer.h, so
    that modifications to the freezer or the kernel configuration don't require
    recompiling just about everything.

    [akpm@osdl.org: fix ueagle driver]
    Signed-off-by: Nigel Cunningham
    Cc: "Rafael J. Wysocki"
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nigel Cunningham
     
  • Currently swsusp saves the contents of highmem pages by copying them to the
    normal zone which is quite inefficient (eg. it requires two normal pages
    to be used for saving one highmem page). This may be improved by using
    highmem for saving the contents of saveable highmem pages.

    Namely, during the suspend phase of the suspend-resume cycle we try to
    allocate as many free highmem pages as there are saveable highmem pages.
    If there are not enough highmem image pages to store the contents of all of
    the saveable highmem pages, some of them will be stored in the "normal"
    memory. Next, we allocate as many free "normal" pages as needed to store
    the (remaining) image data. We use a memory bitmap to mark the allocated
    free pages (ie. highmem as well as "normal" image pages).

    Now, we use another memory bitmap to mark all of the saveable pages
    (highmem as well as "normal") and the contents of the saveable pages are
    copied into the image pages. Then, the second bitmap is used to save the
    pfns corresponding to the saveable pages and the first one is used to save
    their data.

    During the resume phase the pfns of the pages that were saveable during the
    suspend are loaded from the image and used to mark the "unsafe" page
    frames. Next, we try to allocate as many free highmem page frames as to
    load all of the image data that had been in the highmem before the suspend
    and we allocate so many free "normal" page frames that the total number of
    allocated free pages (highmem and "normal") is equal to the size of the
    image. While doing this we have to make sure that there will be some extra
    free "normal" and "safe" page frames for two lists of PBEs constructed
    later.

    Now, the image data are loaded, if possible, into their "original" page
    frames. The image data that cannot be written into their "original" page
    frames are loaded into "safe" page frames and their "original" kernel
    virtual addresses, as well as the addresses of the "safe" pages containing
    their copies, are stored in one of two lists of PBEs.

    One list of PBEs is for the copies of "normal" suspend pages (ie. "normal"
    pages that were saveable during the suspend) and it is used in the same way
    as previously (ie. by the architecture-dependent parts of swsusp). The
    other list of PBEs is for the copies of highmem suspend pages. The pages
    in this list are restored (in a reversible way) right before the
    arch-dependent code is called.

    Signed-off-by: Rafael J. Wysocki
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Make swsusp use block device offsets instead of swap offsets to identify swap
    locations and make it use the same code paths for writing as well as for
    reading data.

    This allows us to use the same code for handling swap files and swap
    partitions and to simplify the code, eg. by dropping rw_swap_page_sync().

    Signed-off-by: Rafael J. Wysocki
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • The Linux kernel handles swap files almost in the same way as it handles swap
    partitions and there are only two differences between these two types of swap
    areas:

    (1) swap files need not be contiguous,

    (2) the header of a swap file is not in the first block of the partition
    that holds it. From the swsusp's point of view (1) is not a problem,
    because it is already taken care of by the swap-handling code, but (2) has
    to be taken into consideration.

    In principle the location of a swap file's header may be determined with the
    help of appropriate filesystem driver. Unfortunately, however, it requires
    the filesystem holding the swap file to be mounted, and if this filesystem is
    journaled, it cannot be mounted during a resume from disk. For this reason we
    need some other means by which swap areas can be identified.

    For example, to identify a swap area we can use the partition that holds the
    area and the offset from the beginning of this partition at which the swap
    header is located.

    The following patch allows swsusp to identify swap areas this way. It changes
    swap_type_of() so that it takes an additional argument representing an offset
    of the swap header within the partition represented by its first argument.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Make radix tree lookups safe to be performed without locks. Readers are
    protected against nodes being deleted by using RCU based freeing. Readers
    are protected against new node insertion by using memory barriers to ensure
    the node itself will be properly written before it is visible in the radix
    tree.

    Each radix tree node keeps a record of their height (above leaf nodes).
    This height does not change after insertion -- when the radix tree is
    extended, higher nodes are only inserted in the top. So a lookup can take
    the pointer to what is *now* the root node, and traverse down it even if
    the tree is concurrently extended and this node becomes a subtree of a new
    root.

    "Direct" pointers (tree height of 0, where root->rnode points directly to
    the data item) are handled by using the low bit of the pointer to signal
    whether rnode is a direct pointer or a pointer to a radix tree node.

    When a reader wants to traverse the next branch, they will take a copy of
    the pointer. This pointer will be either NULL (and the branch is empty) or
    non-NULL (and will point to a valid node).

    [akpm@osdl.org: cleanups]
    [Lee.Schermerhorn@hp.com: bugfixes, comments, simplifications]
    [clameter@sgi.com: build fix]
    Signed-off-by: Nick Piggin
    Cc: "Paul E. McKenney"
    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Currently we we use the lru head link of the second page of a compound page
    to hold its destructor. This was ok when it was purely an internal
    implmentation detail. However, hugetlbfs overrides this destructor
    violating the layering. Abstract this out as explicit calls, also
    introduce a type for the callback function allowing them to be type
    checked. For each callback we pre-declare the function, causing a type
    error on definition rather than on use elsewhere.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • Currently we simply attempt to allocate from all allowed nodes using
    GFP_THISNODE. However, GFP_THISNODE does not do reclaim (it wont do any at
    all if the recent GFP_THISNODE patch is accepted). If we truly run out of
    memory in the whole system then fallback_alloc may return NULL although
    memory may still be available if we would perform more thorough reclaim.

    This patch changes fallback_alloc() so that we first only inspect all the
    per node queues for available slabs. If we find any then we allocate from
    those. This avoids slab fragmentation by first getting rid of all partial
    allocated slabs on every node before allocating new memory.

    If we cannot satisfy the allocation from any per node queue then we extend
    a slab. We now call into the page allocator without specifying
    GFP_THISNODE. The page allocator will then implement its own fallback (in
    the given cpuset context), perform necessary reclaim (again considering not
    a single node but the whole set of allowed nodes) and then return pages for
    a new slab.

    We identify from which node the pages were allocated and then insert the
    pages into the corresponding per node structure. In order to do so we need
    to modify cache_grow() to take a parameter that specifies the new slab.
    kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be
    able to use kmem_getpage to allocate from an arbitrary node. GFP_THISNODE
    needs to be specified when calling cache_grow().

    One key advantage is that the decision from which node to allocate new
    memory is removed from slab fallback processing. The patch allows to go
    back to use of the page allocators fallback/reclaim logic.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The intent of GFP_THISNODE is to make sure that an allocation occurs on a
    particular node. If this is not possible then NULL needs to be returned so
    that the caller can choose what to do next on its own (the slab allocator
    depends on that).

    However, GFP_THISNODE currently triggers reclaim before returning a failure
    (GFP_THISNODE means GFP_NORETRY is set). If we have over allocated a node
    then we will currently do some reclaim before returning NULL. The caller
    may want memory from other nodes before reclaim should be triggered. (If
    the caller wants reclaim then he can directly use __GFP_THISNODE instead).

    There is no flag to avoid reclaim in the page allocator and adding yet
    another GFP_xx flag would be difficult given that we are out of available
    flags.

    So just compare and see if all bits for GFP_THISNODE (__GFP_THISNODE,
    __GFP_NORETRY and __GFP_NOWARN) are set. If so then we return NULL before
    waking up kswapd.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This addresses two issues:

    1. Kmalloc_node() may intermittently return NULL if we are allocating
    from the current node and are unable to obtain memory for the current
    node from the page allocator. This is because we call ___cache_alloc()
    if nodeid == numa_node_id() and ____cache_alloc is not able to fallback
    to other nodes.

    This was introduced in the 2.6.19 development cycle.
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • SLAB_DMA is an alias of GFP_DMA. This is the last one so we
    remove the leftover comment too.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter