17 Jan, 2006

1 commit


15 Jan, 2006

1 commit


12 Jan, 2006

1 commit


10 Jan, 2006

1 commit


09 Jan, 2006

3 commits

  • We've had two instances recently of overflows when doing

    64_bit_value = (32_bit_value << PAGE_CACHE_SHIFT)

    I did a tree-wide grep of `<page_base)

    Cc: Oleg Drokin
    Cc: David Howells
    Cc: David Woodhouse
    Cc:
    Cc: Christoph Hellwig
    Cc: Anton Altaparmakov
    Cc: Jeff Dike
    Cc: Paolo 'Blaisorblade' Giarrusso
    Cc: Roman Zippel
    Cc:
    Cc: Miklos Szeredi
    Cc: Russell King
    Cc: Trond Myklebust
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.

    See mm/filemap.c:

    And changes the filemap_write_and_wait() and filemap_write_and_wait_range().

    Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
    returns error. However, even if filemap_fdatawrite() returned an
    error, it may have submitted the partially data pages to the device.
    (e.g. in the case of -ENOSPC)

    Andrew Morton writes,

    If filemap_fdatawrite() returns an error, this might be due to some
    I/O problem: dead disk, unplugged cable, etc. Given the generally
    crappy quality of the kernel's handling of such exceptions, there's a
    good chance that the filemap_fdatawait() will get stuck in D state
    forever.

    So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.

    Trond, could you please review the nfs part? Especially I'm not sure,
    nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.

    Acked-by: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • This patch changes generic_cont_expand(), in order to share the code
    with fatfs.

    - Use vmtruncate() if ->prepare_write() returns a error.

    Even if ->prepare_write() returns an error, it may already have added some
    blocks. So, this truncates blocks outside of ->i_size by vmtruncate().

    - Add generic_cont_expand_simple().

    The generic_cont_expand_simple() assumes that ->prepare_write() can handle
    the block boundary. With this, we don't need to care the extra byte.

    And for expanding a file size by truncate(), fatfs uses the
    added generic_cont_expand_simple().

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     

07 Nov, 2005

1 commit


31 Oct, 2005

2 commits

  • If a filesystem passes an idiotic blocksize into bread(), __getblk_slow() will
    warn and will return NULL. We have a report (from Hubert Tonneau
    ) of isofs_fill_super() doing this (passing in
    a silly block size) against an unplugged CDROM drive.

    But a couple of __getblk_slow() callers forgot to check for the NULL bh, hence
    oops.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Fix the problem (BUG 4964) with unmapped buffers in transaction's
    t_sync_data list. The problem is we need to call filesystem's own
    invalidatepage() from block_write_full_page().

    block_write_full_page() must call filesystem's invalidatepage(). Otherwise
    following nasty race can happen:

    proc 1 proc 2
    ------ ------
    - write some new data to 'offset'
    => bh gets to the transactions data list
    - starts truncate
    => i_size set to new size
    - mpage_writepages()
    - ext3_ordered_writepage() to 'offset'
    - block_write_full_page()
    - page->index > end_index+1
    - block_invalidatepage()
    - discard_buffer()
    - clear_buffer_mapped()

    - commit triggers and finds unmapped buffer - BOOM!

    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

30 Oct, 2005

1 commit

  • Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
    a many-threaded application which concurrently initializes different parts of
    a large anonymous area.

    This patch corrects that, by using a separate spinlock per page table page, to
    guard the page table entries in that page, instead of using the mm's single
    page_table_lock. (But even then, page_table_lock is still used to guard page
    table allocation, and anon_vma allocation.)

    In this implementation, the spinlock is tucked inside the struct page of the
    page table page: with a BUILD_BUG_ON in case it overflows - which it would in
    the case of 32-bit PA-RISC with spinlock debugging enabled.

    Splitting the lock is not quite for free: another cacheline access. Ideally,
    I suppose we would use split ptlock only for multi-threaded processes on
    multi-cpu machines; but deciding that dynamically would have its own costs.
    So for now enable it by config, at some number of cpus - since the Kconfig
    language doesn't support inequalities, let preprocessor compare that with
    NR_CPUS. But I don't think it's worth being user-configurable: for good
    testing of both split and unsplit configs, split now at 4 cpus, and perhaps
    change that to 8 later.

    There is a benefit even for singly threaded processes: kswapd can be attacking
    one part of the mm while another part is busy faulting.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

28 Oct, 2005

2 commits

  • - ->releasepage() annotated (s/int/gfp_t), instances updated
    - missing gfp_t in fs/* added
    - fixed misannotation from the original sweep caught by bitwise checks:
    XFS used __nocast both for gfp_t and for flags used by XFS allocator.
    The latter left with unsigned int __nocast; we might want to add a
    different type for those but for now let's leave them alone. That,
    BTW, is a case when __nocast use had been actively confusing - it had
    been used in the same code for two different and similar types, with
    no way to catch misuses. Switch of gfp_t to bitwise had caught that
    immediately...

    One tricky bit is left alone to be dealt with later - mapping->flags is
    a mix of gfp_t and error indications. Left alone for now.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • Beginning of gfp_t annotations:

    - -Wbitwise added to CHECKFLAGS
    - old __bitwise renamed to __bitwise__
    - __bitwise defined to either __bitwise__ or nothing, depending on
    __CHECK_ENDIAN__ being defined
    - gfp_t switched from __nocast to __bitwise__
    - force cast to gfp_t added to __GFP_... constants
    - new helper - gfp_zone(); extracts zone bits out of gfp_t value and casts
    the result to int

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

09 Oct, 2005

1 commit

  • - added typedef unsigned int __nocast gfp_t;

    - replaced __nocast uses for gfp flags with gfp_t - it gives exactly
    the same warnings as far as sparse is concerned, doesn't change
    generated code (from gcc point of view we replaced unsigned int with
    typedef) and documents what's going on far better.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

11 Sep, 2005

1 commit

  • This patch (written by me and also containing many suggestions of Arjan van
    de Ven) does a major cleanup of the spinlock code. It does the following
    things:

    - consolidates and enhances the spinlock/rwlock debugging code

    - simplifies the asm/spinlock.h files

    - encapsulates the raw spinlock type and moves generic spinlock
    features (such as ->break_lock) into the generic code.

    - cleans up the spinlock code hierarchy to get rid of the spaghetti.

    Most notably there's now only a single variant of the debugging code,
    located in lib/spinlock_debug.c. (previously we had one SMP debugging
    variant per architecture, plus a separate generic one for UP builds)

    Also, i've enhanced the rwlock debugging facility, it will now track
    write-owners. There is new spinlock-owner/CPU-tracking on SMP builds too.
    All locks have lockup detection now, which will work for both soft and hard
    spin/rwlock lockups.

    The arch-level include files now only contain the minimally necessary
    subset of the spinlock code - all the rest that can be generalized now
    lives in the generic headers:

    include/asm-i386/spinlock_types.h | 16
    include/asm-x86_64/spinlock_types.h | 16

    I have also split up the various spinlock variants into separate files,
    making it easier to see which does what. The new layout is:

    SMP | UP
    ----------------------------|-----------------------------------
    asm/spinlock_types_smp.h | linux/spinlock_types_up.h
    linux/spinlock_types.h | linux/spinlock_types.h
    asm/spinlock_smp.h | linux/spinlock_up.h
    linux/spinlock_api_smp.h | linux/spinlock_api_up.h
    linux/spinlock.h | linux/spinlock.h

    /*
    * here's the role of the various spinlock/rwlock related include files:
    *
    * on SMP builds:
    *
    * asm/spinlock_types.h: contains the raw_spinlock_t/raw_rwlock_t and the
    * initializers
    *
    * linux/spinlock_types.h:
    * defines the generic type and initializers
    *
    * asm/spinlock.h: contains the __raw_spin_*()/etc. lowlevel
    * implementations, mostly inline assembly code
    *
    * (also included on UP-debug builds:)
    *
    * linux/spinlock_api_smp.h:
    * contains the prototypes for the _spin_*() APIs.
    *
    * linux/spinlock.h: builds the final spin_*() APIs.
    *
    * on UP builds:
    *
    * linux/spinlock_type_up.h:
    * contains the generic, simplified UP spinlock type.
    * (which is an empty structure on non-debug builds)
    *
    * linux/spinlock_types.h:
    * defines the generic type and initializers
    *
    * linux/spinlock_up.h:
    * contains the __raw_spin_*()/etc. version of UP
    * builds. (which are NOPs on non-debug, non-preempt
    * builds)
    *
    * (included on UP-non-debug builds:)
    *
    * linux/spinlock_api_up.h:
    * builds the _spin_*() APIs.
    *
    * linux/spinlock.h: builds the final spin_*() APIs.
    */

    All SMP and UP architectures are converted by this patch.

    arm, i386, ia64, ppc, ppc64, s390/s390x, x64 was build-tested via
    crosscompilers. m32r, mips, sh, sparc, have not been tested yet, but should
    be mostly fine.

    From: Grant Grundler

    Booted and lightly tested on a500-44 (64-bit, SMP kernel, dual CPU).
    Builds 32-bit SMP kernel (not booted or tested). I did not try to build
    non-SMP kernels. That should be trivial to fix up later if necessary.

    I converted bit ops atomic_hash lock to raw_spinlock_t. Doing so avoids
    some ugly nesting of linux/*.h and asm/*.h files. Those particular locks
    are well tested and contained entirely inside arch specific code. I do NOT
    expect any new issues to arise with them.

    If someone does ever need to use debug/metrics with them, then they will
    need to unravel this hairball between spinlocks, atomic ops, and bit ops
    that exist only because parisc has exactly one atomic instruction: LDCW
    (load and clear word).

    From: "Luck, Tony"

    ia64 fix

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Grant Grundler
    Cc: Matthew Wilcox
    Signed-off-by: Hirokazu Takata
    Signed-off-by: Mikael Pettersson
    Signed-off-by: Benoit Boissinot
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

08 Sep, 2005

2 commits


08 Jul, 2005

1 commit

  • Use a bit spin lock in the first buffer of the page to synchronise asynch
    IO buffer completions, instead of the global page_uptodate_lock, which is
    showing some scalabilty problems.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

29 Jun, 2005

1 commit


24 Jun, 2005

2 commits

  • fs/buffer.c::__block_prepare_write() has broken error recovery. It calls
    the get_block() callback with "create = 1" and if that succeeds it
    immediately clears buffer_new on the just allocated buffer (which has
    buffer_new set).

    The bug is that if an error occurs and get_block() returns != 0, we break
    from this loop and go into recovery code. This code has this comment:

    /* Error case: */
    /*
    * Zero out any newly allocated blocks to avoid exposing stale
    * data. If BH_New is set, we know that the block was newly
    * allocated in the above loop.
    */

    So the intent is obviously good in that it wants to clear just allocated
    and hence not zeroed buffers. However the code recognises allocated
    buffers by checking for buffer_new being set.

    Unfortunately __block_prepare_write() as discussed above already cleared
    buffer_new on all allocated buffers thus no buffers will be cleared during
    error recovery and old data will be leaked.

    The simplest way I can see to fix this is to make the current recovery code
    work by _not_ clearing buffer_new after calling get_block() in
    __block_prepare_write().

    We cannot safely allow buffer_new buffers to "leak out" of
    __block_prepare_write(), thus we simply do a quick loop over the buffers
    clearing buffer_new on each of them if it is set just before returning
    "success" from __block_prepare_write().

    Signed-off-by: Anton Altaparmakov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Altaparmakov
     
  • This patch consolidates sys_fsync and sys_fdatasync.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

22 Jun, 2005

1 commit

  • try_to_free_pages accepts a third argument, order, but hasn't used it since
    before 2.6.0. The following patch removes the argument and updates all the
    calls to try_to_free_pages.

    Signed-off-by: Darren Hart
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darren Hart
     

17 May, 2005

1 commit

  • If block_read_full_page() detects an error when running get_block() it will
    run SetPageError(), then it will zero out the block in pagecache and will mark
    the buffer_head uptodate.

    So at the end of readahead we end up with a non-uptodate pagecache page which
    is marked PageError. But it has uptodate buffers.

    The pagefault code will run ClearPageError, will launch readpage a second time
    and block_read_full_page() will notice the uptodate buffers and will mark the
    page uptodate as well. We end up with an uptodate, !PageError page full of
    zeros and the error is lost.

    (It seems a little odd that filemap_nopage() runs ClearPageError(). I guess
    all of this adds up to meaning that for each attempted access to the page, the
    pagefault handler will retry the I/O. Which is good and bad. If the app is
    ignoring SIGBUS for some reason we could get a lot of back-to-back I/O
    errors.)

    Fix it by not marking the pagecache buffer_head as uptodate if the attempt to
    map that buffer to a disk block failed.

    Credit-to: Qu Fuping

    For reporting the bug and identifying its source.

    Signed-off-by: Qu Fuping
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

06 May, 2005

6 commits

  • This patch makes some needlessly global identifiers static.

    Signed-off-by: Adrian Bunk
    Acked-by: Arjan van de Ven
    Acked-by: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • The `last_bh' logic probably isn't worth much. In those situations where only
    the front part of the page is being written out we will save some looping but
    in the vastly more common case of an all-page writeout if just adds more code.

    Nick Piggin

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Remove all those get_bh()'s and put_bh()'s by extending lock_page() to cover
    the troublesome regions.

    (get_bh() and put_bh() happen every time whereas contention on a page's lock
    in there happens basically never).

    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • When running
    fsstress -v -d $DIR/tmp -n 1000 -p 1000 -l 2
    on an ext2 filesystem with 1024 byte block size, on SMP i386 with 4096 byte
    page size over loopback to an image file on a tmpfs filesystem, I would
    very quickly hit
    BUG_ON(!buffer_async_write(bh));
    in fs/buffer.c:end_buffer_async_write

    It seems that more than one request would be submitted for a given bh
    at a time.

    What would happen is the following:
    2 threads doing __mpage_writepages on the same page.
    Thread 1 - lock the page first, and enter __block_write_full_page.
    Thread 1 - (eg.) mark_buffer_async_write on the first 2 buffers.
    Thread 1 - set page writeback, unlock page.
    Thread 2 - lock page, wait on page writeback
    Thread 1 - submit_bh on the first 2 buffers.
    => both requests complete, none of the page buffers are async_write,
    end_page_writeback is called.
    Thread 2 - wakes up. enters __block_write_full_page.
    Thread 2 - mark_buffer_async_write on (eg.) the last buffer
    Thread 1 - finds the last buffer has async_write set, submit_bh on that.
    Thread 2 - submit_bh on the last buffer.
    => oops.

    So change __block_write_full_page to explicitly keep track of the last bh
    we need to issue, so we don't touch anything after issuing the last
    request.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Fix a race where __block_prepare_write can leak out an in-flight read
    against a bh if get_block returns an error. This can lead to the page
    becoming unlocked while the buffer is locked and the read still in flight.
    __mpage_writepage BUGs on this condition.

    BUG sighted on a 2-way Itanium2 system with 16K PAGE_SIZE running

    fsstress -v -d $DIR/tmp -n 1000 -p 1000 -l 2

    where $DIR is a new ext2 filesystem with 4K blocks that is quite
    small (causing get_block to fail often with -ENOSPC).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • This makes sure that reclaimable buffer headers and reclaimable inodes
    are accounted properly during the overcommit checks.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

01 May, 2005

4 commits


17 Apr, 2005

2 commits