27 Mar, 2006

25 commits

  • The nanosleep cleanup allows to remove the data field of hrtimer. The
    callback function can use container_of() to get it's own data. Since the
    hrtimer structure is anyway embedded in other structures, this adds no
    overhead.

    Signed-off-by: Roman Zippel
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Zippel
     
  • Remove the it_real_value from /proc/*/stat, during 1.2.x was the last time it
    returned useful data (as it was directly maintained by the scheduler), now
    it's only a waste of time to calculate it. Return 0 instead.

    Signed-off-by: Roman Zippel
    Acked-by: Ingo Molnar
    Acked-by: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Zippel
     
  • There is no valid reason why we can't support "nobh" option for filesystems
    with blocksize != PAGESIZE.

    This patch lets them use "nobh" option for writeback mode for blocksize <
    pagesize.

    Signed-off-by: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     
  • Mingming Cao recently added multi-block allocation support for ext3,
    currently used only by DIO. I added support to map multiple blocks for
    mpage_readpages(). This patch add support for ext3_get_block() to deal
    with multi-block mapping. Basically it renames ext3_direct_io_get_blocks()
    as ext3_get_block().

    Signed-off-by: Badari Pulavarty
    Cc: Mingming Cao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     
  • - Clean up a few little layout things and comments.

    - Add a WARN_ON to a case which I was wondering about.

    - Tune up some inlines.

    Cc: Mingming Cao
    Cc: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Now that get_block() can handle mapping multiple disk blocks, no need to have
    ->get_blocks(). This patch removes fs specific ->get_blocks() added for DIO
    and makes it users use get_block() instead.

    Signed-off-by: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     
  • This patch changes mpage_readpages() and get_block() to get the disk mapping
    information for multiple blocks at the same time.

    b_size represents the amount of disk mapping that needs to mapped. On the
    successful get_block() b_size indicates the amount of disk mapping thats
    actually mapped. Only the filesystems who care to use this information and
    provide multiple disk blocks at a time can choose to do so.

    No changes are needed for the filesystems who wants to ignore this.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Badari Pulavarty
    Cc: Mingming Cao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     
  • Pass amount of disk needs to be mapped to get_block(). This way one can
    modify the fs ->get_block() functions to map multiple blocks at the same time.

    [akpm@osdl.org: performance tweak]
    [akpm@osdl.org: remove unneeded assignments]
    Signed-off-by: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     
  • Increase the size of the buffer_head b_size field (only) for 64 bit platforms.
    Update some old and moldy comments in and around the structure as well.

    The b_size increase allows us to perform larger mappings and allocations for
    large I/O requests from userspace, which tie in with other changes allowing
    the get_block_t() interface to map multiple blocks at once.

    Signed-off-by: Nathan Scott
    Signed-off-by: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     
  • Optimize the block reservation and the multiple block allocation: with the
    knowledge of the total number of blocks ahead, set or adjust the reservation
    window size properly (based on the number of blocks needed) before block
    allocation happens: if there isn't any reservation yet, make sure the
    reservation window equals to or greater than the number of blocks needed,
    before create an reservation window; if a reservation window is already
    exists, try to extends the window size to match the number of blocks to
    allocate. This could increase the possibility of completing multiple blocks
    allocation in a single request, as blocks are only allocated in the range of
    the inode's reservation window.

    Signed-off-by: Mingming Cao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mingming Cao
     
  • Update accounting information (quota, boundary checks, free blocks number etc)
    in ext3_new_blocks().

    Signed-off-by: Mingming Cao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mingming Cao
     
  • Change ext3_try_to_allocate() (called via ext3_new_blocks()) to try to
    allocate the requested number of blocks on a best effort basis: After
    allocated the first block, it will always attempt to allocate the next few(up
    to the requested size and not beyond the reservation window) adjacent blocks
    at the same time.

    Signed-off-by: Mingming Cao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mingming Cao
     
  • Add support for multiple block allocation in ext3-get-blocks().

    Look up the disk block mapping and count the total number of blocks to
    allocate, then pass it to ext3_new_block(), where the real block allocation is
    performed. Once multiple blocks are allocated, prepare the branch with those
    just allocated blocks info and finally splice the whole branch into the block
    mapping tree.

    Signed-off-by: Mingming Cao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mingming Cao
     
  • Currently ext3_get_block() only maps or allocates one block at a time. This
    is quite inefficient for sequential IO workload.

    I have posted a early implements a simply multiple block map and allocation
    with current ext3. The basic idea is allocating the 1st block in the existing
    way, and attempting to allocate the next adjacent blocks on a best effort
    basis. More description about the implementation could be found here:
    http://marc.theaimsgroup.com/?l=ext2-devel&m=112162230003522&w=2

    The following the latest version of the patch: break the original patch into 5
    patches, re-worked some logicals, and fixed some bugs. The break ups are:

    [patch 1] Adding map multiple blocks at a time in ext3_get_blocks()
    [patch 2] Extend ext3_get_blocks() to support multiple block allocation
    [patch 3] Implement multiple block allocation in ext3-try-to-allocate
    (called via ext3_new_block()).
    [patch 4] Proper accounting updates in ext3_new_blocks()
    [patch 5] Adjust reservation window size properly (by the given number
    of blocks to allocate) before block allocation to increase the
    possibility of allocating multiple blocks in a single call.

    Tests done so far includes fsx,tiobench and dbench. The following numbers
    collected from Direct IO tests (1G file creation/read) shows the system time
    have been greatly reduced (more than 50% on my 8 cpu system) with the patches.

    1G file DIO write:
    2.6.15 2.6.15+patches
    real 0m31.275s 0m31.161s
    user 0m0.000s 0m0.000s
    sys 0m3.384s 0m0.564s

    1G file DIO read:
    2.6.15 2.6.15+patches
    real 0m30.733s 0m30.624s
    user 0m0.000s 0m0.004s
    sys 0m0.748s 0m0.380s

    Some previous test we did on buffered IO with using multiple blocks allocation
    and delayed allocation shows noticeable improvement on throughput and system
    time.

    This patch:

    Add support of mapping multiple blocks in one call.

    This is useful for DIO reads and re-writes (where blocks are already
    allocated), also is in line with Christoph's proposal of using getblocks() in
    mpage_readpage() or mpage_readpages().

    Signed-off-by: Mingming Cao
    Cc: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mingming Cao
     
  • Cc: Takashi Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Modify well over a dozen mempool users to call mempool_create_slab_pool()
    rather than calling mempool_create() with extra arguments, saving about 30
    lines of code and increasing readability.

    Signed-off-by: Matthew Dobson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Dobson
     
  • This patch changes several mempool users, all of which are basically just
    wrappers around kmalloc(), to use the common mempool_kmalloc/kfree, rather
    than their own wrapper function, removing a bunch of duplicated code.

    Signed-off-by: Matthew Dobson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Dobson
     
  • Update the NFSv4 server to use the new posix_lock_file_conf() interface.
    Remove unnecessary (and race-prone) posix_test_file() calls.

    Signed-off-by: Andy Adamson
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Trond Myklebust
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Adamson
     
  • Lockd and the NFSv4 server both exercise a race condition where
    posix_test_lock() is called either before or after posix_lock_file() to
    deal with a denied lock request due to a conflicting lock.

    Remove the race condition for the NFSv4 server by adding a new conflicting
    lock parameter to __posix_lock_file() , changing the name to
    __posix_lock_file_conf().

    Keep posix_lock_file() interface, add posix_lock_conf() interface, both
    call __posix_lock_file_conf().

    [akpm@osdl.org: Put the EXPORT_SYMBOL() where it belongs]
    Signed-off-by: Andy Adamson
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Adamson
     
  • BUG instead of handling a case that should never happen.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    J. Bruce Fields
     
  • I discovered on oprofile hunting on a SMP platform that dentry lookups were
    slowed down because d_hash_mask, d_hash_shift and dentry_hashtable were in
    a cache line that contained inodes_stat. So each time inodes_stats is
    changed by a cpu, other cpus have to refill their cache line.

    This patch moves some variables to the __read_mostly section, in order to
    avoid false sharing. RCU dentry lookups can go full speed.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • The return value of this function is never used, so let's be honest and
    declare it as void.

    Some places where invalidatepage returned 0, I have inserted comments
    suggesting a BUG_ON.

    [akpm@osdl.org: JBD BUG fix]
    [akpm@osdl.org: rework for git-nfs]
    [akpm@osdl.org: don't go BUG in block_invalidate_page()]
    Signed-off-by: Neil Brown
    Acked-by: Dave Kleikamp
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • The only user ignores the return value, and the only instanace
    (block_sync_page) always returns 0...

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Semaphore to mutex conversion.

    The conversion was generated via scripts, and the result was validated
    automatically via a script as well.

    Signed-off-by: Ingo Molnar
    Cc: Eric Van Hensbergen
    Cc: Robert Love
    Cc: Thomas Gleixner
    Cc: David Woodhouse
    Cc: Neil Brown
    Cc: Trond Myklebust
    Cc: Dave Kleikamp
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • It has been discovered that the remove_proc_entry has a race in the removing
    of entries in the proc file system that are siblings. There's no protection
    around the traversing and removing of elements that belong in the same
    subdirectory.

    This subdirectory list is protected in other areas by the BKL. So the BKL was
    at first used to protect this area too, but unfortunately, remove_proc_entry
    may be called with spinlocks held. The BKL may schedule, so this was not a
    solution.

    The final solution was to add a new global spin lock to protect this list,
    called proc_subdir_lock. This lock now protects the list in
    remove_proc_entry, and I also went around looking for other areas that this
    list is modified and added this protection there too. Care must be taken
    since these locations call several functions that may also schedule.

    Since I don't see any location that these functions that modify the
    subdirectory list are called by interrupts, the irqsave/restore versions of
    the spin lock was _not_ used.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     

26 Mar, 2006

15 commits

  • * 'audit.b3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: (22 commits)
    [PATCH] fix audit_init failure path
    [PATCH] EXPORT_SYMBOL patch for audit_log, audit_log_start, audit_log_end and audit_format
    [PATCH] sem2mutex: audit_netlink_sem
    [PATCH] simplify audit_free() locking
    [PATCH] Fix audit operators
    [PATCH] promiscuous mode
    [PATCH] Add tty to syscall audit records
    [PATCH] add/remove rule update
    [PATCH] audit string fields interface + consumer
    [PATCH] SE Linux audit events
    [PATCH] Minor cosmetic cleanups to the code moved into auditfilter.c
    [PATCH] Fix audit record filtering with !CONFIG_AUDITSYSCALL
    [PATCH] Fix IA64 success/failure indication in syscall auditing.
    [PATCH] Miscellaneous bug and warning fixes
    [PATCH] Capture selinux subject/object context information.
    [PATCH] Exclude messages by message type
    [PATCH] Collect more inode information during syscall processing.
    [PATCH] Pass dentry, not just name, in fsnotify creation hooks.
    [PATCH] Define new range of userspace messages.
    [PATCH] Filter rule comparators
    ...

    Fixed trivial conflict in security/selinux/hooks.c

    Linus Torvalds
     
  • * git://git.linux-nfs.org/pub/linux/nfs-2.6: (103 commits)
    SUNRPC,RPCSEC_GSS: spkm3--fix config dependencies
    SUNRPC,RPCSEC_GSS: spkm3: import contexts using NID_cast5_cbc
    LOCKD: Make nlmsvc_traverse_shares return void
    LOCKD: nlmsvc_traverse_blocks return is unused
    SUNRPC,RPCSEC_GSS: fix krb5 sequence numbers.
    NFSv4: Dont list system.nfs4_acl for filesystems that don't support it.
    SUNRPC,RPCSEC_GSS: remove unnecessary kmalloc of a checksum
    SUNRPC: Ensure rpc_call_async() always calls tk_ops->rpc_release()
    SUNRPC: Fix memory barriers for req->rq_received
    NFS: Fix a race in nfs_sync_inode()
    NFS: Clean up nfs_flush_list()
    NFS: Fix a race with PG_private and nfs_release_page()
    NFSv4: Ensure the callback daemon flushes signals
    SUNRPC: Fix a 'Busy inodes' error in rpc_pipefs
    NFS, NLM: Allow blocking locks to respect signals
    NFS: Make nfs_fhget() return appropriate error values
    NFSv4: Fix an oops in nfs4_fill_super
    lockd: blocks should hold a reference to the nlm_file
    NFSv4: SETCLIENTID_CONFIRM should handle NFS4ERR_DELAY/NFS4ERR_RESOURCE
    NFSv4: Send the delegation stateid for SETATTR calls
    ...

    Linus Torvalds
     
  • 8MB is not really very random, use 1GB (or more with larger page sizes)
    instead.

    Also use the low bits of the random generator output now instead of
    throwing them away.

    Only enabled on x86-64 right now. Other architectures need to add
    a suitable STACK_RND_MASK

    Cc: mingo@elte.hu
    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • * 'upstream-linus' of git://oss.oracle.com/home/sourcebo/git/ocfs2:
    ocfs2: finally remove MLF* macros
    ocfs2: don't use MLF* in the file system
    ocfs2: don't use MLF* in dlm/ files
    ocfs2: don't use MLF* in cluster/ files
    [PATCH] ocfs2: dlm recovery fixes
    [PATCH] ocfs2: fix hang in dlm lock resource mastery
    ocfs2: use __attribute__ format

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (21 commits)
    BUG_ON() Conversion in drivers/video/
    BUG_ON() Conversion in drivers/parisc/
    BUG_ON() Conversion in drivers/block/
    BUG_ON() Conversion in sound/sparc/cs4231.c
    BUG_ON() Conversion in drivers/s390/block/dasd.c
    BUG_ON() Conversion in lib/swiotlb.c
    BUG_ON() Conversion in kernel/cpu.c
    BUG_ON() Conversion in ipc/msg.c
    BUG_ON() Conversion in block/elevator.c
    BUG_ON() Conversion in fs/coda/
    BUG_ON() Conversion in fs/binfmt_elf_fdpic.c
    BUG_ON() Conversion in input/serio/hil_mlc.c
    BUG_ON() Conversion in md/dm-hw-handler.c
    BUG_ON() Conversion in md/bitmap.c
    The comment describing how MS_ASYNC works in msync.c is confusing
    rcu: undeclared variable used in documentation
    fix typos "wich" -> "which"
    typo patch for fs/ufs/super.c
    Fix simple typos
    tabify drivers/char/Makefile
    ...

    Linus Torvalds
     
  • In binfmt_flat.c, the flat binary loader should check file descriptor table
    and install the fd on the file.

    Convert the function to single-exit and fix this bug.

    Signed-off-by: "Luke Yang"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luke Yang
     
  • nr_segs is unsigned long and thus cannot be negative. We checked against 0
    few lines before.

    Signed-off-by: Carsten Otte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Carsten Otte
     
  • Local variable i is unsigned int and thus cannot be negative.

    (akpm: unsigneds shouldn't be called `i'. This value cannot possibly be
    negative anyway).

    Signed-off-by: Carsten Otte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Carsten Otte
     
  • There is a bug in direct-io on propagating write error up to the higher I/O
    layer. When performing an async ODIRECT write to a block device, if a
    device error occurred (like media error or disk is pulled), the error code
    is only propagated from device driver to the DIO layer. The error code
    stops at finished_one_bio(). The aysnc write, however, is supposedly have
    a corresponding AIO event with appropriate return code (in this case -EIO).
    Application which waits on the async write event, will hang forever since
    such AIO event is lost forever (if such app did not use the timeout option
    in io_getevents call. Regardless, an AIO event is lost).

    The discovery of above bug leads to another discovery of potential race
    window with dio->result. The fundamental problem is that dio->result is
    overloaded with dual use: an indicator of fall back path for partial dio
    write, and an error indicator used in the I/O completion path. In the
    event of device error, the setting of -EIO to dio->result clashes with
    value used to track partial write that activates the fall back path.

    It was also pointed out that it is impossible to use dio->result to track
    partial write and at the same time to track error returned from device
    driver. Because direct_io_work can only determines whether it is a partial
    write at the end of io submission and in mid stream of those io submission,
    a return code could be coming back from the driver. Thus messing up all
    the subsequent logic.

    Proposed fix is to separating out error code returned by the IO completion
    path from partial IO submit tracking. A new variable is added to dio
    structure specifically to track io error returned in the completion path.

    Signed-off-by: Ken Chen
    Acked-by: Zach Brown
    Acked-by: Suparna Bhattacharya
    Cc: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     
  • As Pekka Enberg pointed out, with the if still following the else, you can
    still get a null uid written to the disk if you specify a default uid= without
    uid=forget. In other words, if the desktop user is uid=1000 and the mount
    option uid=1000 is given ( which is done on ubuntu automatically and probably
    other distributions that use hal ), then if any other user besides uid 1000
    owns a file then a 0 will be written to the media as the owning uid instead.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Phillip Susi
     
  • Signed-off-by: Oliver Neukum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oliver Neukum
     
  • Correct some error handling on the compat version of the nfsservctl()
    system. It was detecting errors while copying in the arguments from user
    space, but then attempting to use the arguments anyway. This didn't seem
    so good.

    Signed-off-by: Peter Staubach
    Cc: Trond Myklebust
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Staubach
     
  • If I mount ext2 "rw", I want it to say "rw", not "rw,nogrpid".

    I caught this writing an automated regression test script for the busybox
    mount command. The symptom is
    /dev/loop0 on /images/ext2.dir type ext2 (rw,nogrpid)
    instead of:
    /dev/loop0 on /images/ext2.dir type ext2 (rw)

    The behavior was introduced by git commit
    8fc2751beb0941966d3a97b26544e8585e428c08.

    Signed-off-by: Rob Landley
    Cc: Mark Bellon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Landley
     
  • As prepare_write, commit_write and readpage are allowed to return
    AOP_TRUNCATE_PAGE, page_symlink should respond to them.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • fs/reiserfs/item_ops.c: In function 'indirect_print_item':
    fs/reiserfs/item_ops.c:278: warning: 'num' may be used uninitialized in this function

    (akpm: this is probably just gcc being dumb)

    Signed-off-by: Benoit Boissinot
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benoit Boissinot