23 Dec, 2006

1 commit


22 Dec, 2006

2 commits

  • truncate presently invalidates the dirty page's buffer_heads then shoots down
    the page. But try_to_free_buffers() will now bale out because the page is
    dirty.

    Net effect: the LRU gets filled with dirty pages which have invalidated
    buffer_heads attached. They have no ->mapping and hence cannot be cleaned.
    The machine leaks memory at an enormous rate.

    Fix this by cleaning the page before running try_to_free_buffers(), so
    try_to_free_buffers() can do its work.

    Also, remember to do dirty-page-acoounting in cancel_dirty_page() so the
    machine won't wedge up trying to write non-existent dirty pages.

    Probably still wrong, but now less so.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • They were horribly easy to mis-use because of their tempting naming, and
    they also did way more than any users of them generally wanted them to
    do.

    A dirty page can become clean under two circumstances:

    (a) when we write it out. We have "clear_page_dirty_for_io()" for
    this, and that function remains unchanged.

    In the "for IO" case it is not sufficient to just clear the dirty
    bit, you also have to mark the page as being under writeback etc.

    (b) when we actually remove a page due to it becoming inaccessible to
    users, notably because it was truncate()'d away or the file (or
    metadata) no longer exists, and we thus want to cancel any
    outstanding dirty state.

    For the (b) case, we now introduce "cancel_dirty_page()", which only
    touches the page state itself, and verifies that the page is not mapped
    (since cancelling writes on a mapped page would be actively wrong as it
    is still accessible to users).

    Some filesystems need to be fixed up for this: CIFS, FUSE, JFS,
    ReiserFS, XFS all use the old confusing functions, and will be fixed
    separately in subsequent commits (with some of them just removing the
    offending logic, and others using clear_page_dirty_for_io()).

    This was confirmed by Martin Michlmayr to fix the apt database
    corruption on ARM.

    Cc: Martin Michlmayr
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Cc: Arjan van de Ven
    Cc: Andrei Popa
    Cc: Andrew Morton
    Cc: Dave Kleikamp
    Cc: Gordon Farquharson
    Cc: Martin Schwidefsky
    Cc: Trond Myklebust
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

18 Dec, 2006

1 commit


17 Dec, 2006

2 commits

  • Hugh Dickins correctly points out that mincore() is actually _supposed_
    to fail on an unmapped hole in the user address space, rather than
    return valid ("empty") information about the hole. This just simplifies
    the problem further (I had been misled by our previous confusing and
    complicated way of doing mincore()).

    Also, in the unlikely situation that we can't allocate a temporary
    kernel buffer, we should actually return EAGAIN, not ENOMEM, to keep the
    "unmapped hole" and "allocation failure" error cases separate.

    Finally, add a comment about our stupid historical lack of support for
    anonymous mappings. I'll fix that if somebody reminds me after 2.6.20
    is out.

    Acked-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Doug Chapman noticed that mincore() will doa "copy_to_user()" of the
    result while holding the mmap semaphore for reading, which is a big
    no-no. While a recursive read-lock on a semaphore in the case of a page
    fault happens to work, we don't actually allow them due to deadlock
    schenarios with writers due to fairness issues.

    Doug and Marcel sent in a patch to fix it, but I decided to just rewrite
    the mess instead - not just fixing the locking problem, but making the
    code smaller and (imho) much easier to understand.

    Cc: Doug Chapman
    Cc: Marcel Holtmann
    Cc: Hugh Dickins
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

14 Dec, 2006

6 commits

  • To allow a more effective copy_user_highpage() on certain architectures,
    a vma argument is added to the function and cow_user_page() allowing
    the implementation of these functions to check for the VM_EXEC bit.

    The main part of this patch was originally written by Ralf Baechle;
    Atushi Nemoto did the the debugging.

    Signed-off-by: Atsushi Nemoto
    Signed-off-by: Ralf Baechle
    Signed-off-by: Linus Torvalds

    Atsushi Nemoto
     
  • When some objects are allocated by one CPU but freed by another CPU we can
    consume lot of cycles doing divides in obj_to_index().

    (Typical load on a dual processor machine where network interrupts are
    handled by one particular CPU (allocating skbufs), and the other CPU is
    running the application (consuming and freeing skbufs))

    Here on one production server (dual-core AMD Opteron 285), I noticed this
    divide took 1.20 % of CPU_CLK_UNHALTED events in kernel. But Opteron are
    quite modern cpus and the divide is much more expensive on oldest
    architectures :

    On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1
    cycle for a multiply.

    Doing some math, we can use a reciprocal multiplication instead of a divide.

    If we want to compute V = (A / B) (A and B being u32 quantities)
    we can instead use :

    V = ((u64)A * RECIPROCAL(B)) >> 32 ;

    where RECIPROCAL(B) is precalculated to ((1LL << 32) + (B - 1)) / B

    Note :

    I wrote pure C code for clarity. gcc output for i386 is not optimal but
    acceptable :

    mull 0x14(%ebx)
    mov %edx,%eax // part of the >> 32
    xor %edx,%edx // useless
    mov %eax,(%esp) // could be avoided
    mov %edx,0x4(%esp) // useless
    mov (%esp),%ebx

    [akpm@osdl.org: small cleanups]
    Signed-off-by: Eric Dumazet
    Cc: Christoph Lameter
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • Elaborate the API for calling cpuset_zone_allowed(), so that users have to
    explicitly choose between the two variants:

    cpuset_zone_allowed_hardwall()
    cpuset_zone_allowed_softwall()

    Until now, whether or not you got the hardwall flavor depended solely on
    whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
    argument.

    If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
    version.

    Unfortunately, this meant that users would end up with the softwall version
    without thinking about it. Since only the softwall version might sleep,
    this led to bugs with possible sleeping in interrupt context on more than
    one occassion.

    The hardwall version requires that the current tasks mems_allowed allows
    the node of the specified zone (or that you're in interrupt or that
    __GFP_THISNODE is set or that you're on a one cpuset system.)

    The softwall version, depending on the gfp_mask, might allow a node if it
    was allowed in the nearest enclusing cpuset marked mem_exclusive (which
    requires taking the cpuset lock 'callback_mutex' to evaluate.)

    This patch removes the cpuset_zone_allowed() call, and forces the caller to
    explicitly choose between the hardwall and the softwall case.

    If the caller wants the gfp_mask to determine this choice, they should (1)
    be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
    cpuset_zone_allowed_softwall() routine.

    This adds another 100 or 200 bytes to the kernel text space, due to the few
    lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
    routines. It should save a few instructions executed for the calls that
    turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
    set (before the call) then check (within the call) the __GFP_HARDWALL flag.

    For the most critical call, from get_page_from_freelist(), the same
    instructions are executed as before -- the old cpuset_zone_allowed()
    routine it used to call is the same code as the
    cpuset_zone_allowed_softwall() routine that it calls now.

    Not a perfect win, but seems worth it, to reduce this chance of hitting a
    sleeping with irq off complaint again.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • More cleanups for slab.h

    1. Remove tabs from weird locations as suggested by Pekka

    2. Drop the check for NUMA and SLAB_DEBUG from the fallback section
    as suggested by Pekka.

    3. Uses static inline for the fallback defs as also suggested by Pekka.

    4. Make kmem_ptr_valid take a const * argument.

    5. Separate the NUMA fallback definitions from the kmalloc_track fallback
    definitions.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This is a response to an earlier discussion on linux-mm about splitting
    slab.h components per allocator. Patch is against 2.6.19-git11. See
    http://marc.theaimsgroup.com/?l=linux-mm&m=116469577431008&w=2

    This patch cleans up the slab header definitions. We define the common
    functions of slob and slab in slab.h and put the extra definitions needed
    for slab's kmalloc implementations in . In order to get
    a greater set of common functions we add several empty functions to slob.c
    and also rename slob's kmalloc to __kmalloc.

    Slob does not need any special definitions since we introduce a fallback
    case. If there is no need for a slab implementation to provide its own
    kmalloc mess^H^H^Hacros then we simply fall back to __kmalloc functions.
    That is sufficient for SLOB.

    Sort the function in slab.h according to their functionality. First the
    functions operating on struct kmem_cache * then the kmalloc related
    functions followed by special debug and fallback definitions.

    Also redo a lot of comments.

    Signed-off-by: Christoph Lameter ?
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Fallback_alloc() does not do the check for GFP_WAIT as done in
    cache_grow(). Thus interrupts are disabled when we call kmem_getpages()
    which results in the failure.

    Duplicate the handling of GFP_WAIT in cache_grow().

    Signed-off-by: Christoph Lameter
    Cc: Jay Cliburn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

11 Dec, 2006

7 commits

  • This patch introduces users of the round_jiffies() function in the slab code.

    The slab code has a few "run every second" timers for background work; these
    are obviously not timing critical as long as they happen roughly at the right
    frequency.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     
  • The only time it is safe to call aio_complete() is when the ->ki_retry
    function returns -EIOCBQUEUED to the AIO core. direct_io_worker() has
    historically done this by relying on its caller to translate positive return
    codes into -EIOCBQUEUED for the aio case. It did this by trying to keep
    conditionals in sync. direct_io_worker() knew when finished_one_bio() was
    going to call aio_complete(). It would reverse the test and wait and free the
    dio in the cases it thought that finished_one_bio() wasn't going to.

    Not surprisingly, it ended up getting it wrong. 'ret' could be a negative
    errno from the submission path but it failed to communicate this to
    finished_one_bio(). direct_io_worker() would return < 0, it's callers
    wouldn't raise -EIOCBQUEUED, and aio_complete() would be called. In the
    future finished_one_bio()'s tests wouldn't reflect this and aio_complete()
    would be called for a second time which can manifest as an oops.

    The previous cleanups have whittled the sync and async completion paths down
    to the point where we can collapse them and clearly reassert the invariant
    that we must only call aio_complete() after returning -EIOCBQUEUED.
    direct_io_worker() will only return -EIOCBQUEUED when it is not the last to
    drop the dio refcount and the aio bio completion path will only call
    aio_complete() when it is the last to drop the dio refcount.
    direct_io_worker() can ensure that it is the last to drop the reference count
    by waiting for bios to drain. It does this for sync ops, of course, and for
    partial dio writes that must fall back to buffered and for aio ops that saw
    errors during submission.

    This means that operations that end up waiting, even if they were issued as
    aio ops, will not call aio_complete() from dio. Instead we return the return
    code of the operation and let the aio core call aio_complete(). This is
    purposely done to fix a bug where AIO DIO file extensions would call
    aio_complete() before their callers have a chance to update i_size.

    Now that direct_io_worker() is explicitly returning -EIOCBQUEUED its callers
    no longer have to translate for it. XFS needs to be careful not to free
    resources that will be used during AIO completion if -EIOCBQUEUED is returned.
    We maintain the previous behaviour of trying to write fs metadata for O_SYNC
    aio+dio writes.

    Signed-off-by: Zach Brown
    Cc: Badari Pulavarty
    Cc: Suparna Bhattacharya
    Acked-by: Jeff Moyer
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zach Brown
     
  • nfs's ->readpages uses read_cache_pages(). Wire it up there.

    [wfg@mail.ustc.edu.cn: account only successful nfs/fuse reads]
    Cc: Jay Lan
    Cc: Shailabh Nagar
    Cc: Balbir Singh
    Cc: Chris Sturtivant
    Cc: Tony Ernst
    Cc: Guillaume Thouvenin
    Cc: David Wright
    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Account for the number of byte writes which this process caused to not happen
    after all.

    Cc: Jay Lan
    Cc: Shailabh Nagar
    Cc: Balbir Singh
    Cc: Chris Sturtivant
    Cc: Tony Ernst
    Cc: Guillaume Thouvenin
    Cc: David Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Accounting writes is fairly simple: whenever a process flips a page from clean
    to dirty, we accuse it of having caused a write to underlying storage of
    PAGE_CACHE_SIZE bytes.

    This may overestimate the amount of writing: the page-dirtying may cause only
    one buffer_head's worth of writeout. Fixing that is possible, but probably a
    bit messy and isn't obviously important.

    Cc: Jay Lan
    Cc: Shailabh Nagar
    Cc: Balbir Singh
    Cc: Chris Sturtivant
    Cc: Tony Ernst
    Cc: Guillaume Thouvenin
    Cc: David Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Save a tabstop in __set_page_dirty_nobuffers() and __set_page_dirty_buffers()
    and a few other places. No functional changes.

    Cc: Jay Lan
    Cc: Shailabh Nagar
    Cc: Balbir Singh
    Cc: Chris Sturtivant
    Cc: Tony Ernst
    Cc: Guillaume Thouvenin
    Cc: David Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Ramiro Voicu hits the BUG_ON(!pte_none(*pte)) in zeromap_pte_range: kernel
    bugzilla 7645. Right: read_zero_pagealigned uses down_read of mmap_sem,
    but another thread's racing read of /dev/zero, or a normal fault, can
    easily set that pte again, in between zap_page_range and zeromap_page_range
    getting there. It's been wrong ever since 2.4.3.

    The simple fix is to use down_write instead, but that would serialize reads
    of /dev/zero more than at present: perhaps some app would be badly
    affected. So instead let zeromap_page_range return the error instead of
    BUG_ON, and read_zero_pagealigned break to the slower clear_user loop in
    that case - there's no need to optimize for it.

    Use -EEXIST for when a pte is found: BUG_ON in mmap_zero (the other user of
    zeromap_page_range), though it really isn't interesting there. And since
    mmap_zero wants -EAGAIN for out-of-memory, the zeromaps better return that
    than -ENOMEM.

    Signed-off-by: Hugh Dickins
    Cc: Ramiro Voicu:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

09 Dec, 2006

7 commits

  • Assign defaults most likely to please a new user:
    1) generate some logging output
    (verbose=2)
    2) avoid injecting failures likely to lock up UI
    (ignore_gfp_wait=1, ignore_gfp_highmem=1)

    Signed-off-by: Don Mullis
    Cc: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Don Mullis
     
  • This patch provides fault-injection capability for alloc_pages()

    Boot option:

    fail_page_alloc=,,,

    -- specifies the interval of failures.

    -- specifies how often it should fail in percent.

    -- specifies the size of free space where memory can be
    allocated safely in pages.

    -- specifies how many times failures may happen at most.

    Debugfs:

    /debug/fail_page_alloc/interval
    /debug/fail_page_alloc/probability
    /debug/fail_page_alloc/specifies
    /debug/fail_page_alloc/times
    /debug/fail_page_alloc/ignore-gfp-highmem
    /debug/fail_page_alloc/ignore-gfp-wait

    Example:

    fail_page_alloc=10,100,0,-1

    The page allocation (alloc_pages(), ...) fails once per 10 times.

    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • This patch provides fault-injection capability for kmalloc.

    Boot option:

    failslab=,,,

    -- specifies the interval of failures.

    -- specifies how often it should fail in percent.

    -- specifies the size of free space where memory can be
    allocated safely in bytes.

    -- specifies how many times failures may happen at most.

    Debugfs:

    /debug/failslab/interval
    /debug/failslab/probability
    /debug/failslab/specifies
    /debug/failslab/times
    /debug/failslab/ignore-gfp-highmem
    /debug/failslab/ignore-gfp-wait

    Example:

    failslab=10,100,0,-1

    slab allocation (kmalloc(), kmem_cache_alloc(),..) fails once per 10 times.

    Cc: Pekka Enberg
    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • This facility provides three entry points:

    ilog2() Log base 2 of unsigned long
    ilog2_u32() Log base 2 of u32
    ilog2_u64() Log base 2 of u64

    These facilities can either be used inside functions on dynamic data:

    int do_something(long q)
    {
    ...;
    y = ilog2(x)
    ...;
    }

    Or can be used to statically initialise global variables with constant values:

    unsigned n = ilog2(27);

    When performing static initialisation, the compiler will report "error:
    initializer element is not constant" if asked to take a log of zero or of
    something not reducible to a constant. They treat negative numbers as
    unsigned.

    When not dealing with a constant, they fall back to using fls() which permits
    them to use arch-specific log calculation instructions - such as BSR on
    x86/x86_64 or SCAN on FRV - if available.

    [akpm@osdl.org: MMC fix]
    Signed-off-by: David Howells
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Herbert Xu
    Cc: David Howells
    Cc: Wojtek Kaniewski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Signed-off-by: Josef Sipek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef Sipek
     
  • Change all the uses of f_{dentry,vfsmnt} to f_path.{dentry,mnt} in linux/mm/.

    Signed-off-by: Josef "Jeff" Sipek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef "Jeff" Sipek
     
  • fallback_alloc() could end up calling cpuset_zone_allowed() with interrupts
    disabled (by code in kmem_cache_alloc_node()), but without __GFP_HARDWALL
    set, leading to a possible call of a sleeping function with interrupts
    disabled.

    This results in the BUG report:

    BUG: sleeping function called from invalid context at kernel/cpuset.c:1520
    in_atomic():0, irqs_disabled():1

    Thanks to Paul Menage for catching this one.

    Signed-off-by: Paul Jackson
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

08 Dec, 2006

14 commits

  • - move some file_operations structs into the .rodata section

    - move static strings from policy_types[] array into the .rodata section

    - fix generic seq_operations usages, so that those structs may be defined
    as "const" as well

    [akpm@osdl.org: couple of fixes]
    Signed-off-by: Helge Deller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Helge Deller
     
  • In time for 2.6.20, we can get rid of this junk.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Burman Yan
     
  • Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn,
    prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus
    generating compiler warnings of unused symbols, hence forcing people to add
    #ifdefs.

    the compiler can skip truly unused functions just fine:

    text data bss dec hex filename
    1624412 728710 3674856 6027978 5bfaca vmlinux.before
    1624412 728710 3674856 6027978 5bfaca vmlinux.after

    [akpm@osdl.org: topology.c fix]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • It has no users and it's doubtful that we'll need it again.

    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Use put_pages_list() instead of opencoding it.

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Move process freezing functions from include/linux/sched.h to freezer.h, so
    that modifications to the freezer or the kernel configuration don't require
    recompiling just about everything.

    [akpm@osdl.org: fix ueagle driver]
    Signed-off-by: Nigel Cunningham
    Cc: "Rafael J. Wysocki"
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nigel Cunningham
     
  • Currently swsusp saves the contents of highmem pages by copying them to the
    normal zone which is quite inefficient (eg. it requires two normal pages
    to be used for saving one highmem page). This may be improved by using
    highmem for saving the contents of saveable highmem pages.

    Namely, during the suspend phase of the suspend-resume cycle we try to
    allocate as many free highmem pages as there are saveable highmem pages.
    If there are not enough highmem image pages to store the contents of all of
    the saveable highmem pages, some of them will be stored in the "normal"
    memory. Next, we allocate as many free "normal" pages as needed to store
    the (remaining) image data. We use a memory bitmap to mark the allocated
    free pages (ie. highmem as well as "normal" image pages).

    Now, we use another memory bitmap to mark all of the saveable pages
    (highmem as well as "normal") and the contents of the saveable pages are
    copied into the image pages. Then, the second bitmap is used to save the
    pfns corresponding to the saveable pages and the first one is used to save
    their data.

    During the resume phase the pfns of the pages that were saveable during the
    suspend are loaded from the image and used to mark the "unsafe" page
    frames. Next, we try to allocate as many free highmem page frames as to
    load all of the image data that had been in the highmem before the suspend
    and we allocate so many free "normal" page frames that the total number of
    allocated free pages (highmem and "normal") is equal to the size of the
    image. While doing this we have to make sure that there will be some extra
    free "normal" and "safe" page frames for two lists of PBEs constructed
    later.

    Now, the image data are loaded, if possible, into their "original" page
    frames. The image data that cannot be written into their "original" page
    frames are loaded into "safe" page frames and their "original" kernel
    virtual addresses, as well as the addresses of the "safe" pages containing
    their copies, are stored in one of two lists of PBEs.

    One list of PBEs is for the copies of "normal" suspend pages (ie. "normal"
    pages that were saveable during the suspend) and it is used in the same way
    as previously (ie. by the architecture-dependent parts of swsusp). The
    other list of PBEs is for the copies of highmem suspend pages. The pages
    in this list are restored (in a reversible way) right before the
    arch-dependent code is called.

    Signed-off-by: Rafael J. Wysocki
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Make swsusp use block device offsets instead of swap offsets to identify swap
    locations and make it use the same code paths for writing as well as for
    reading data.

    This allows us to use the same code for handling swap files and swap
    partitions and to simplify the code, eg. by dropping rw_swap_page_sync().

    Signed-off-by: Rafael J. Wysocki
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • The Linux kernel handles swap files almost in the same way as it handles swap
    partitions and there are only two differences between these two types of swap
    areas:

    (1) swap files need not be contiguous,

    (2) the header of a swap file is not in the first block of the partition
    that holds it. From the swsusp's point of view (1) is not a problem,
    because it is already taken care of by the swap-handling code, but (2) has
    to be taken into consideration.

    In principle the location of a swap file's header may be determined with the
    help of appropriate filesystem driver. Unfortunately, however, it requires
    the filesystem holding the swap file to be mounted, and if this filesystem is
    journaled, it cannot be mounted during a resume from disk. For this reason we
    need some other means by which swap areas can be identified.

    For example, to identify a swap area we can use the partition that holds the
    area and the offset from the beginning of this partition at which the swap
    header is located.

    The following patch allows swsusp to identify swap areas this way. It changes
    swap_type_of() so that it takes an additional argument representing an offset
    of the swap header within the partition represented by its first argument.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Make radix tree lookups safe to be performed without locks. Readers are
    protected against nodes being deleted by using RCU based freeing. Readers
    are protected against new node insertion by using memory barriers to ensure
    the node itself will be properly written before it is visible in the radix
    tree.

    Each radix tree node keeps a record of their height (above leaf nodes).
    This height does not change after insertion -- when the radix tree is
    extended, higher nodes are only inserted in the top. So a lookup can take
    the pointer to what is *now* the root node, and traverse down it even if
    the tree is concurrently extended and this node becomes a subtree of a new
    root.

    "Direct" pointers (tree height of 0, where root->rnode points directly to
    the data item) are handled by using the low bit of the pointer to signal
    whether rnode is a direct pointer or a pointer to a radix tree node.

    When a reader wants to traverse the next branch, they will take a copy of
    the pointer. This pointer will be either NULL (and the branch is empty) or
    non-NULL (and will point to a valid node).

    [akpm@osdl.org: cleanups]
    [Lee.Schermerhorn@hp.com: bugfixes, comments, simplifications]
    [clameter@sgi.com: build fix]
    Signed-off-by: Nick Piggin
    Cc: "Paul E. McKenney"
    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Currently we we use the lru head link of the second page of a compound page
    to hold its destructor. This was ok when it was purely an internal
    implmentation detail. However, hugetlbfs overrides this destructor
    violating the layering. Abstract this out as explicit calls, also
    introduce a type for the callback function allowing them to be type
    checked. For each callback we pre-declare the function, causing a type
    error on definition rather than on use elsewhere.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft