22 May, 2007

1 commit

  • First thing mm.h does is including sched.h solely for can_do_mlock() inline
    function which has "current" dereference inside. By dealing with can_do_mlock()
    mm.h can be detached from sched.h which is good. See below, why.

    This patch
    a) removes unconditional inclusion of sched.h from mm.h
    b) makes can_do_mlock() normal function in mm/mlock.c
    c) exports can_do_mlock() to not break compilation
    d) adds sched.h inclusions back to files that were getting it indirectly.
    e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
    getting them indirectly

    Net result is:
    a) mm.h users would get less code to open, read, preprocess, parse, ... if
    they don't need sched.h
    b) sched.h stops being dependency for significant number of files:
    on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
    after patch it's only 3744 (-8.3%).

    Cross-compile tested on

    all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
    alpha alpha-up
    arm
    i386 i386-up i386-defconfig i386-allnoconfig
    ia64 ia64-up
    m68k
    mips
    parisc parisc-up
    powerpc powerpc-up
    s390 s390-up
    sparc sparc-up
    sparc64 sparc64-up
    um-x86_64
    x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig

    as well as my two usual configs.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

17 May, 2007

2 commits


15 May, 2007

4 commits

  • - fs/nfs/nfs4xdr.c:2499:42: warning: incorrect type in argument 2
    (different signedness)
    - fs/nfs/nfs4xdr.c:2658:49: warning: incorrect type in argument 4
    (different explicit signedness)
    - fs/nfs/nfs4xdr.c:2683:50: warning: incorrect type in argument 4
    (different explicit signedness)
    - fs/nfs/nfs4xdr.c:3063:68: warning: incorrect type in argument 4
    (different explicit signedness)
    - fs/nfs/nfs4xdr.c:3065:68: warning: incorrect type in argument 4
    (different explicit signedness)

    - fs/nfs/callback_xdr.c:138:31: warning: incorrect type in argument 2
    (different signedness)

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • - fs/nfs/dir.c:610:8: warning: symbol 'nfs_llseek_dir' was not declared.
    Should it be static?
    - fs/nfs/dir.c:636:5: warning: symbol 'nfs_fsync_dir' was not declared.
    Should it be static?
    - fs/nfs/write.c:925:19: warning: symbol 'req' shadows an earlier one
    - fs/nfs/write.c:61:6: warning: symbol 'nfs_commit_rcu_free' was not
    declared. Should it be static?
    - fs/nfs/nfs4proc.c:793:5: warning: symbol 'nfs4_recover_expired_lease'
    was not declared. Should it be static?

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • The XDR code should not depend on the physical allocation size of
    structures like nfs4_stateid and nfs4_verifier since those may have to
    change at some future date. We therefore replace all uses of
    sizeof() with constants like NFS4_VERIFIER_SIZE and NFS4_STATEID_SIZE.

    This also has the side-effect of fixing some warnings of the type
    format ‘%u’ expects type ‘unsigned int’, but argument X has type
    ‘long unsigned int’
    on 64-bit systems

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Use zero_user_page() instead of the newly deprecated memclear_highpage_flush().

    Signed-off-by: Nate Diller
    Cc: Trond Myklebust
    Cc: "J. Bruce Fields"
    Signed-off-by: Andrew Morton
    Signed-off-by: Trond Myklebust

    Nate Diller
     

10 May, 2007

6 commits


09 May, 2007

2 commits


08 May, 2007

3 commits

  • * 'server-cluster-locking-api' of git://linux-nfs.org/~bfields/linux:
    gfs2: nfs lock support for gfs2
    lockd: add code to handle deferred lock requests
    lockd: always preallocate block in nlmsvc_lock()
    lockd: handle test_lock deferrals
    lockd: pass cookie in nlmsvc_testlock
    lockd: handle fl_grant callbacks
    lockd: save lock state on deferral
    locks: add fl_grant callback for asynchronous lock return
    nfsd4: Convert NFSv4 to new lock interface
    locks: add lock cancel command
    locks: allow {vfs,posix}_lock_file to return conflicting lock
    locks: factor out generic/filesystem switch from setlock code
    locks: factor out generic/filesystem switch from test_lock
    locks: give posix_test_lock same interface as ->lock
    locks: make ->lock release private data before returning in GETLK case
    locks: create posix-to-flock helper functions
    locks: trivial removal of unnecessary parentheses

    Linus Torvalds
     
  • I have never seen a use of SLAB_DEBUG_INITIAL. It is only supported by
    SLAB.

    I think its purpose was to have a callback after an object has been freed
    to verify that the state is the constructor state again? The callback is
    performed before each freeing of an object.

    I would think that it is much easier to check the object state manually
    before the free. That also places the check near the code object
    manipulation of the object.

    Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
    compiled with SLAB debugging on. If there would be code in a constructor
    handling SLAB_DEBUG_INITIAL then it would have to be conditional on
    SLAB_DEBUG otherwise it would just be dead code. But there is no such code
    in the kernel. I think SLUB_DEBUG_INITIAL is too problematic to make real
    use of, difficult to understand and there are easier ways to accomplish the
    same effect (i.e. add debug code before kfree).

    There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
    clear in fs inode caches. Remove the pointless checks (they would even be
    pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.

    This is the last slab flag that SLUB did not support. Remove the check for
    unimplemented flags from SLUB.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Ensure pages are uptodate after returning from read_cache_page, which allows
    us to cut out most of the filesystem-internal PageUptodate calls.

    I didn't have a great look down the call chains, but this appears to fixes 7
    possible use-before uptodate in hfs, 2 in hfsplus, 1 in jfs, a few in
    ecryptfs, 1 in jffs2, and a possible cleared data overwritten with readpage in
    block2mtd. All depending on whether the filler is async and/or can return
    with a !uptodate page.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

07 May, 2007

2 commits


05 May, 2007

1 commit


02 May, 2007

2 commits


01 May, 2007

15 commits

  • Try running this script in an NFS mounted directory (Client relatively
    recent - 2.6.18 has the problem as does 2.6.20).

    ------------------------------------------------------
    #!/bin/bash
    #
    # This script will produce the following errormessage from tar:
    #
    # tar: newdir/innerdir/innerfile: file changed as we read it

    # create dirs
    rm -rf nfstest
    mkdir -p nfstest/dir/innerdir

    # create files (should not be empty)
    echo "Hello World!" >nfstest/dir/file
    echo "Hello World!" >nfstest/dir/innerdir/innerfile

    # problem only happens if we sleep before chmod
    sleep 1

    # change file modes
    chmod -R a+r nfstest

    # rename dir
    mv nfstest/dir nfstest/newdir

    # tar it
    tar -cf nfstest/nfstest.tar -C nfstest newdir

    # restore old dir name
    mv nfstest/newdir nfstest/dir
    --------------------------------------------------------

    What happens:

    The 'chmod -R' does a readdir_plus in each directory and the results
    get cached in the page cache. It then updates the ctime on each file
    by one second. When this happens, the post-op attributes are used to
    update the ctime stored on the client to match the value in the kernel.

    The 'mv' calls shrink_dcache_parent on the directory tree which
    flushes all the dentries (so a new lookup will be required) but
    doesn't flush the inodes or pagecache.

    The 'tar' does a readdir on each directory, but (in the case of
    'innerdir' at least) satisfies it from the pagecache and uses the
    READDIRPLUS data to update all the inodes. In the case of
    'innerdir/innerfile', the ctime is out of date.

    'tar' then calls 'lstat' on innerdir/innerfile getting an old ctime.
    It then opens the file (triggering a GETATTR), reads the content, and
    then calls fstat to see if anything has changed. It finds that ctime
    has changed and so complains.

    The problem seems to be that the cache readdirplus info is kept around
    for too long.

    My patch below discards pagecache data for directories when
    dentry_iput is called on them. This effectively removes the symptom
    which convinces me that I correctly understand the problem. However
    I'm not convinced that is a proper solution, as there could easily be
    other races that trigger the same problem without being affected by
    this 'fix'.

    One possibility would be to require that readdirplus pagecache data be
    only used *once* to instantiate an inode. Somehow it should then be
    invalidated so that if the dentry subsequently disappears, it will
    cause a new request to the server to fill in the stat data.

    Another possibility is to compare the cache_change_attribute on the
    inode with something similar for the readdirplus info and reject the
    info from readdirplus if it is too old.

    I haven't tried to implement these and would value other opinions
    before I do.

    Thanks,
    NeilBrown

    Signed-off-by: Neil Brown
    Signed-off-by: Trond Myklebust

    Neil Brown
     
  • Don't use uninitialsed value for fattr->time_start in readdirplus results.

    The 'fattr' structure filled in by nfs3_decode_direct does not get a
    value for ->time_start set.
    Thus if an entry is for an inode that we already have in cache,
    when nfs_readdir_lookup calls nfs_fhget, it will call nfs_refresh_inode
    and may update the inode with out-of-date information.

    Directories are read a page at a time, so each page could have a
    different timestamp that "should" be used to set the time_start for
    the fattr for info in that page. However storing the timestamp per
    page is awkward. (We could stick in the first 4 bytes and only read 4092
    bytes, but that is a bigger code change than I am interested it).

    This patch ignores the readdir_plus attributes if a readdir finds the
    information already in cache, and otherwise sets ->time_start to the time
    the readdir request was sent to the server.

    It might be nice to store - in the directory inode - the time stamp for
    the earliest readdir request that is still in the page cache, so that we
    don't ignore attribute data that we don't have to. This patch doesn't do
    that.

    Signed-off-by: Neil Brown
    Signed-off-by: Trond Myklebust

    Neil Brown
     
  • READDIRPLUS can be a performance hindrance when the client is working with
    large directories. In addition, some servers still have bugs in their
    implementations (e.g. Tru64 returns wrong values for the fsid).

    Add a mount flag to enable users to turn it off at mount time following the
    implementation in Apple's NFS client.

    Signed-off-by: Steve Dickson
    Signed-off-by: Trond Myklebust

    Steve Dickson
     
  • It is arguable whether NFSROOT will support IPv6, and thus whether
    rpcb_getport_external needs to support rpcbind versions greater than 2.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • The RPC buffer size estimation logic in net/sunrpc/clnt.c always
    significantly overestimates the requirements for the buffer size.
    A little instrumentation demonstrated that in fact rpc_malloc was never
    allocating the buffer from the mempool, but almost always called kmalloc.

    To compute the size of the RPC buffer more precisely, split p_bufsiz into
    two fields; one for the argument size, and one for the result size.

    Then, compute the sum of the exact call and reply header sizes, and split
    the RPC buffer precisely between the two. That should keep almost all RPC
    buffers within the 2KiB buffer mempool limit.

    And, we can finally be rid of RPC_SLACK_SPACE!

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • It has no business touching wbc->pages_skipped.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Currently we do write coalescing in a very inefficient manner: one pass in
    generic_writepages() in order to lock the pages for writing, then one pass
    in nfs_flush_mapping() and/or nfs_sync_mapping_wait() in order to gather
    the locked pages for coalescing into RPC requests of size "wsize".

    In fact, it turns out there is actually a deadlock possible here since we
    only start I/O on the second pass. If the user signals the process while
    we're in nfs_sync_mapping_wait(), for instance, then we may exit before
    starting I/O on all the requests that have been queued up.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Do the coalescing of read requests into block sized requests at start of
    I/O as we scan through the pages instead of going through a second pass.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • It is redundant, and will interfere with the call to
    balance_dirty_pages_ratelimited_nr in generic_file_write().

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • The nfs statfs function returns a success code on error, and fills the
    output buffer with invalid values. The attached patch makes it return a
    correct error code instead.

    Signed-off-by: Amnon Aaronsohn
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Trond Myklebust
    (Modified patch to reinstate the dprintk())

    Amnon Aaronsohn
     
  • Be more careful about testing page->mapping.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

21 Apr, 2007

2 commits

  • Protect nfs_set_page_dirty() against races with nfs_inode_add_request.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     
  • Redirtying a request that is already marked for commit will screw up the
    accounting for NR_UNSTABLE_NFS as well as nfs_i.ncommit.
    Ensure that all requests on the commit queue are labelled with the
    PG_NEED_COMMIT flag, and avoid moving them onto the dirty list inside
    nfs_page_mark_flush().

    Also inline nfs_mark_request_dirty() into nfs_page_mark_flush() for
    atomicity reasons. Avoid dropping the spinlock until we're done marking the
    request in the radix tree and have added it to the ->dirty list.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Linus Torvalds

    Trond Myklebust