29 Sep, 2007

1 commit

  • It doesn't look as if the NFS file name limit is being initialised correctly
    in the struct nfs_server. Make sure that we limit whatever is being set in
    nfs_probe_fsinfo() and nfs_init_server().

    Also ensure that readdirplus and nfs4_path_walk respect our file name
    limits.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     

20 Sep, 2007

1 commit

  • NFS unregisters sysctls only if V4 support is compiled in. However, sysctl
    table is not V4 specific, so unregister it always.

    Steps to reproduce:

    [build nfs.ko with CONFIG_NFS_V4=n]
    modrobe nfs
    rmmod nfs
    ls /proc/sys

    Unable to handle kernel paging request at ffffffff880661c0 RIP:
    [] proc_sys_readdir+0xd3/0x350
    PGD 203067 PUD 207063 PMD 7e216067 PTE 0
    Oops: 0000 [1] SMP
    CPU 1
    Modules linked in: lockd nfs_acl sunrpc
    Pid: 3335, comm: ls Not tainted 2.6.23-rc3-bloat #2
    RIP: 0010:[] [] proc_sys_readdir+0xd3/0x350
    RSP: 0018:ffff81007fd93e78 EFLAGS: 00010286
    RAX: ffffffff880661c0 RBX: ffffffff80466370 RCX: ffffffff880661c0
    RDX: 00000000000014c0 RSI: ffff81007f3ad020 RDI: ffff81007efd8b40
    RBP: 0000000000000018 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000001 R11: ffffffff802a8570 R12: ffffffff880661c0
    R13: ffff81007e219640 R14: ffff81007efd8b40 R15: ffff81007ded7280
    FS: 00002ba25ef03060(0000) GS:ffff81007ff81258(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: ffffffff880661c0 CR3: 000000007dfaf000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process ls (pid: 3335, threadinfo ffff81007fd92000, task ffff81007d8a0000)
    Stack: ffff81007f3ad150 ffffffff80283f30 ffff81007fd93f48 ffff81007efd8b40
    ffff81007ee00440 0000000422222222 0000000200035593 ffffffff88037e9a
    2222222222222222 ffffffff80466500 ffff81007e416400 ffff81007e219640
    Call Trace:
    [] filldir+0x0/0xf0
    [] filldir+0x0/0xf0
    [] vfs_readdir+0xa7/0xc0
    [] sys_getdents+0x96/0xe0
    [] system_call+0x7e/0x83

    Code: 41 8b 14 24 85 d2 74 dc 49 8b 44 24 08 48 85 c0 74 e7 49 3b
    RIP [] proc_sys_readdir+0xd3/0x350
    RSP
    CR2: ffffffff880661c0
    Kernel panic - not syncing: Fatal exception

    Signed-off-by: Alexey Dobriyan
    Acked-by: Trond Myklebust
    Cc: "J. Bruce Fields"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

01 Sep, 2007

9 commits

  • Ryusuke Konishi says:

    The recent truncate_complete_page() clears the dirty flag from a page
    before calling a_ops->invalidatepage(),
    ^^^^^^
    static void
    truncate_complete_page(struct address_space *mapping, struct page *page)
    {
    ...
    cancel_dirty_page(page, PAGE_CACHE_SIZE); will call
    a_ops->invalidatepage()
    ...
    }

    and this is disturbing nfs_wb_page_priority() from calling
    nfs_writepage_locked() that is expected to handle the pending
    request (=nfs_page) associated with the page.

    int nfs_wb_page_priority(struct inode *inode, struct page *page, int how)
    {
    ...
    if (clear_page_dirty_for_io(page)) {
    ret = nfs_writepage_locked(page, &wbc);
    if (ret < 0)
    goto out;
    }
    ...
    }

    Since truncate_complete_page() will get rid of the page after
    a_ops->invalidatepage() returns, the request (=nfs_page) associated
    with the page becomes a garbage in nfs_inode->nfs_page_tree.
    ------------------------

    Fix this by ensuring that nfs_wb_page_priority() recognises that it may
    also need to clear out non-dirty pages that have an nfs_page associated
    with them.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • According to the mount(2) man page, the proper error return code for the
    mount(2) system call when the special device name or the mounted-on
    directory name is too long is ENAMETOOLONG.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • The hostname was getting truncated in the new text-based NFS mount API.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Don't filter the return code from the in-kernel rpcbind or NFS mount
    clients. Return the real error code so that callers of the new NFS
    text-based mount API can apply a useful retry strategy.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • The new text-based NFS mount option parsing logic doesn't recognize any
    valid transport protocols due to a silly mistake in the protocol token
    matching logic. This prevents basic mount requests such as:

    mount.nfs server:/export /mnt -o proto=tcp

    from working with the new text-based NFS mount API.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • This patch fixes an Oops that was reported by Gabriel Barazer.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • This should fix the following Oops reported by Jeff Garzik:

    kernel BUG at fs/nfs/nfs4xdr.c:1040!
    invalid opcode: 0000 [1] SMP
    CPU 0
    Modules linked in: nfs lockd sunrpc af_packet
    ipv6 cpufreq_ondemand acpi_cpufreq battery floppy nvram sg snd_hda_intel
    ata_generic snd_pcm_oss snd_mixer_oss snd_pcm i2c_i801 snd_page_alloc e1000
    firewire_ohci ata_piix i2c_core sr_mod cdrom sata_sil ahci libata sd_mod
    scsi_mod ext3 jbd ehci_hcd uhci_hcd
    Pid: 16353, comm: 10.10.10.1-recl Not tainted 2.6.23-rc3 #1
    RIP: 0010:[] [] :nfs:encode_open+0x1c0/0x330
    RSP: 0018:ffff8100467c5c60 EFLAGS: 00010202
    RAX: ffff81000f89b8b8 RBX: 00000000697a6f6d RCX: ffff81000f89b8b8
    RDX: 0000000000000004 RSI: 0000000000000004 RDI: ffff8100467c5c80
    RBP: ffff8100467c5c80 R08: ffff81000f89bc30 R09: ffff81000f89b83f
    R10: 0000000000000001 R11: ffffffff881e79e0 R12: ffff81003cbd1808
    R13: ffff81000f89b860 R14: ffff81005fc984e0 R15: ffffffff88240af0
    FS: 0000000000000000(0000) GS:ffffffff8052a000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
    CR2: 00002adb9e51a030 CR3: 000000007ea7e000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process 10.10.10.1-recl (pid: 16353, threadinfo ffff8100467c4000, task ffff8100038ce780)
    Stack: ffff81004aeb6a40 ffff81003cbd1808 ffff81003cbd1808 ffffffff88240b5d
    ffff81000f89b8bc ffff81005fc984e8 ffff81000f89bc30 ffff81005fc984e8
    0000000300000000 0000000000000000 0000000000000000 ffff81003cbd1800
    Call Trace:
    [] :nfs:nfs4_xdr_enc_open_noattr+0x6d/0x90
    [] :sunrpc:rpcauth_wrap_req+0x97/0xf0
    [] :nfs:nfs4_xdr_enc_open_noattr+0x0/0x90
    [] :sunrpc:call_transmit+0x18a/0x290
    [] :sunrpc:__rpc_execute+0x6b/0x290
    [] :sunrpc:rpc_do_run_task+0x76/0xd0
    [] :nfs:_nfs4_proc_open+0x76/0x230
    [] :nfs:nfs4_open_recover_helper+0x5e/0xc0
    [] :nfs:nfs4_open_recover+0xe4/0x120
    [] :nfs:nfs4_open_reclaim+0xa4/0xf0
    [] :nfs:nfs4_reclaim_open_state+0x55/0x1b0
    [] :nfs:reclaimer+0x2ca/0x390
    [] :nfs:reclaimer+0x0/0x390
    [] kthread+0x4b/0x80
    [] child_rip+0xa/0x12
    [] kthread+0x0/0x80
    [] child_rip+0x0/0x12

    Code: 0f 0b eb fe 48 89 ef c7 00 00 00 00 02 be 08 00 00 00 e8 79
    RIP [] :nfs:encode_open+0x1c0/0x330
    RSP

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Doh! We can't use cancel_delayed_work_sync because we may have been called
    from an unmount that was being performed by nfs_automount_task.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • This avoids the recent NFS mount regression (returning EBUSY when
    mounting the same filesystem twice with different parameters).

    The best I can do given the constraints appears to be to have the kernel
    first look for a superblock that matches both the fsid and the
    user-specified mount options, and then spawn off a new superblock if
    that search fails.

    Note that this is not the same as specifying nosharecache everywhere
    since nosharecache will never attempt to match an existing superblock.

    Signed-off-by: Trond Myklebust
    Tested-by: Hua Zhong
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     

08 Aug, 2007

5 commits

  • This will avoid deadlocks of the form:

    stack backtrace:
    [] show_trace_log_lvl+0x1a/0x30
    [] show_trace+0x12/0x20
    [] dump_stack+0x15/0x20
    [] __lock_acquire+0xc22/0x1030
    [] lock_acquire+0x61/0x80
    [] flush_workqueue+0x49/0x70
    [] flush_scheduled_work+0xd/0x10
    [] nfs_release_automount_timer+0x2c/0x30 [nfs]
    [] nfs_free_server+0x9e/0xd0 [nfs]
    [] nfs_kill_super+0x16/0x20 [nfs]
    [] deactivate_super+0x7d/0xa0
    [] mntput_no_expire+0x4b/0x80
    [] expire_mount_list+0xe4/0x140
    [] mark_mounts_for_expiry+0x99/0xb0
    [] nfs_expire_automounts+0xd/0x40 [nfs]
    [] run_workqueue+0x12b/0x1e0
    [] worker_thread+0x9b/0x100
    [] kthread+0x42/0x70
    [] kernel_thread_helper+0x7/0x18
    =======================

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Doing so would require us to introduce bh-safe locks into put_rpccred().
    This patch fixes the lockdep complaint reported by Marc Dietrich:

    inconsistent {softirq-on-W} -> {in-softirq-W} usage.
    swapper/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
    (rpc_credcache_lock){-+..}, at: []
    _atomic_dec_and_lock+0x17/0x60
    {softirq-on-W} state was registered at:
    [] __lock_acquire+0x650/0x1030
    [] lock_acquire+0x61/0x80
    [] _spin_lock+0x2c/0x40
    [] _atomic_dec_and_lock+0x17/0x60
    [] put_rpccred+0x5d/0x100 [sunrpc]
    [] rpcauth_unbindcred+0x21/0x60 [sunrpc]
    [] a0 [sunrpc]
    [] rpc_call_sync+0x30/0x40 [sunrpc]
    [] rpcb_register+0xdb/0x180 [sunrpc]
    [] svc_register+0x93/0x160 [sunrpc]
    [] __svc_create+0x1ee/0x220 [sunrpc]
    [] svc_create+0x13/0x20 [sunrpc]
    [] nfs_callback_up+0x82/0x120 [nfs]
    [] nfs_get_client+0x176/0x390 [nfs]
    [] nfs4_set_client+0x31/0x190 [nfs]
    [] nfs4_create_server+0x63/0x3b0 [nfs]
    [] nfs4_get_sb+0x346/0x5b0 [nfs]
    [] vfs_kern_mount+0x94/0x110
    [] do_mount+0x1f2/0x7d0
    [] sys_mount+0x66/0xa0
    [] syscall_call+0x7/0xb
    [] 0xffffffff
    irq event stamp: 5277830
    hardirqs last enabled at (5277830): [] kmem_cache_free+0x8a/0xc0
    hardirqs last disabled at (5277829): [] kmem_cache_free+0x52/0xc0
    softirqs last enabled at (5277798): [] __do_softirq+0xa3/0xc0
    softirqs last disabled at (5277817): [] do_softirq+0x47/0x50

    other info that might help us debug this:
    no locks held by swapper/0.

    stack backtrace:
    [] show_trace_log_lvl+0x1a/0x30
    [] show_trace+0x12/0x20
    [] dump_stack+0x15/0x20
    [] print_usage_bug+0x153/0x160
    [] mark_lock+0x449/0x620
    [] __lock_acquire+0x604/0x1030
    [] lock_acquire+0x61/0x80
    [] _spin_lock+0x2c/0x40
    [] _atomic_dec_and_lock+0x17/0x60
    [] put_rpccred+0x5d/0x100 [sunrpc]
    [] nfs_free_delegation_callback+0x13/0x20 [nfs]
    [] __rcu_process_callbacks+0x6a/0x1c0
    [] rcu_process_callbacks+0x12/0x30
    [] tasklet_action+0x38/0x80
    [] __do_softirq+0x55/0xc0
    [] do_softirq+0x47/0x50
    [] irq_exit+0x35/0x40
    [] smp_apic_timer_interrupt+0x43/0x80
    [] apic_timer_interrupt+0x33/0x38
    [] cpuidle_idle_call+0x6f/0x90
    [] cpu_idle+0x43/0x70
    [] rest_init+0x47/0x50
    [] start_kernel+0x22a/0x2b0
    [] 0x0
    =======================

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Do not allow cached open for O_RDONLY or O_WRONLY unless the file has been
    previously opened in these modes.

    Also Fix the calculation of the mode in nfs4_close_prepare. We should only
    issue an OPEN_DOWNGRADE if we're sure that we will still be holding the
    correct open modes. This may not be the case if we've been doing delegated
    opens.

    Finally, there is no need to adjust the open mode bit flags in
    nfs4_close_done(): that has already been done in nfs4_close_prepare().

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • We don't really need to clear &state->inode_states inside
    nfs4_set_mode_locked, and doing so without holding the inode->i_lock would
    in any case be a bug...

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • We need to grab the inode->i_lock atomically with the last reference put in
    order to remove the open context that is being freed from the
    nfsi->open_files list.

    Fix by converting the kref to a standard atomic counter and then using
    atomic_dec_and_lock()...

    Thanks to Arnd Bergmann for pointing out the problem.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

23 Jul, 2007

1 commit


20 Jul, 2007

13 commits


19 Jul, 2007

2 commits

  • Since posix_test_lock(), like fcntl() and ->lock(), indicates absence or
    presence of a conflict lock by setting fl_type to, respectively, F_UNLCK
    or something other than F_UNLCK, the return value is no longer needed.

    Signed-off-by: "J. Bruce Fields"

    J. Bruce Fields
     
  • As Peter Staubach says elsewhere
    (http://marc.info/?l=linux-kernel&m=118113649526444&w=2):

    > The problem is that some file system such as NFSv2 and NFSv3 do
    > not have sufficient support to be able to support leases correctly.
    > In particular for these two file systems, there is no over the wire
    > protocol support.
    >
    > Currently, these two file systems fail the fcntl(F_SETLEASE) call
    > accidentally, due to a reference counting difference. These file
    > systems should fail more consciously, with a proper error to
    > indicate that the call is invalid for them.

    Define an nfs setlease method that just returns -EINVAL.

    If someone can demonstrate a real need, perhaps we could reenable
    them in the presence of the "nolock" mount option.

    Signed-off-by: "J. Bruce Fields"
    Cc: Peter Staubach
    Cc: Trond Myklebust

    J. Bruce Fields
     

18 Jul, 2007

2 commits

  • Currently, the freezer treats all tasks as freezable, except for the kernel
    threads that explicitly set the PF_NOFREEZE flag for themselves. This
    approach is problematic, since it requires every kernel thread to either
    set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't
    care for the freezing of tasks at all.

    It seems better to only require the kernel threads that want to or need to
    be frozen to use some freezer-related code and to remove any
    freezer-related code from the other (nonfreezable) kernel threads, which is
    done in this patch.

    The patch causes all kernel threads to be nonfreezable by default (ie. to
    have PF_NOFREEZE set by default) and introduces the set_freezable()
    function that should be called by the freezable kernel threads in order to
    unset PF_NOFREEZE. It also makes all of the currently freezable kernel
    threads call set_freezable(), so it shouldn't cause any (intentional)
    change of behaviour to appear. Additionally, it updates documentation to
    describe the freezing of tasks more accurately.

    [akpm@linux-foundation.org: build fixes]
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Nigel Cunningham
    Cc: Pavel Machek
    Cc: Oleg Nesterov
    Cc: Gautham R Shenoy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • I can never remember what the function to register to receive VM pressure
    is called. I have to trace down from __alloc_pages() to find it.

    It's called "set_shrinker()", and it needs Your Help.

    1) Don't hide struct shrinker. It contains no magic.
    2) Don't allocate "struct shrinker". It's not helpful.
    3) Call them "register_shrinker" and "unregister_shrinker".
    4) Call the function "shrink" not "shrinker".
    5) Reduce the 17 lines of waffly comments to 13, but document it properly.

    Signed-off-by: Rusty Russell
    Cc: David Chinner
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rusty Russell
     

17 Jul, 2007

1 commit


11 Jul, 2007

5 commits

  • I ran into a curious issue when a lock is being canceled. The
    cancellation results in a lock request to the vfs layer instead of an
    unlock request. This is particularly insidious when the process that
    owns the lock is exiting. In that case, sometimes the erroneous lock is
    applied AFTER the process has entered zombie state, preventing the lock
    from ever being released. Eventually other processes block on the lock
    causing a slow degredation of the system. In the 2.6.16 kernel this was
    investigated on, the problem is compounded by the fact that the cl_sem
    is held while blocking on the vfs lock, which results in most processes
    accessing the nfs file system in question hanging.

    In more detail, here is how the situation occurs:

    first _nfs4_do_setlk():

    static int _nfs4_do_setlk(struct nfs4_state *state, int cmd, struct file_lock *fl, int reclaim)
    ...
    ret = nfs4_wait_for_completion_rpc_task(task);
    if (ret == 0) {
    ...
    } else
    data->cancelled = 1;

    then nfs4_lock_release():

    static void nfs4_lock_release(void *calldata)
    ...
    if (data->cancelled != 0) {
    struct rpc_task *task;
    task = nfs4_do_unlck(&data->fl, data->ctx, data->lsp,
    data->arg.lock_seqid);

    The problem is the same file_lock that was passed in to _nfs4_do_setlk()
    gets passed to nfs4_do_unlck() from nfs4_lock_release(). So the type is
    still F_RDLCK or FWRLCK, not F_UNLCK. At some point, when cancelling the
    lock, the type needs to be changed to F_UNLCK. It seemed easiest to do
    that in nfs4_do_unlck(), but it could be done in nfs4_lock_release().
    The concern I had with doing it there was if something still needed the
    original file_lock, though it turns out the original file_lock still
    needs to be modified by nfs4_do_unlck() because nfs4_do_unlck() uses the
    original file_lock to pass to the vfs layer, and a copy of the original
    file_lock for the RPC request.

    It seems like the simplest solution is to force all situations where
    nfs4_do_unlck() is being used to result in an unlock, so with that in
    mind, I made the following change:

    Signed-off-by: Frank Filz
    Signed-off-by: Trond Myklebust

    Frank Filz
     
  • Consider the case where the user has mounted the remote filesystem
    server:/foo on the two local directories /bar and /baz using the
    nosharedcache mount option. The files /bar/file and /baz/file are
    represented by different inodes in the local namespace, but refer to the
    same file /foo/file on the server.
    Consider the case where a process opens both /bar/file and /baz/file, then
    closes /bar/file: because the nfs4_state is not shared between /bar/file
    and /baz/file, the kernel will see that the nfs4_state for /bar/file is no
    longer referenced, so it will send off a CLOSE rpc call. Unless the
    open_owners differ, then that CLOSE call will invalidate the open state on
    /baz/file too.

    Conclusion: we cannot share open state owners between two different
    non-shared mount instances of the same filesystem.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Unless the user sets the NFS_MOUNT_NOSHAREDCACHE mount flag, we should
    return EBUSY if the filesystem is already mounted on a superblock that
    has set conflicting mount options.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Prior to David Howell's mount changes in 2.6.18, users who mounted
    different directories which happened to be from the same filesystem on the
    server would get different super blocks, and hence could choose different
    mount options. As long as there were no hard linked files that crossed from
    one subtree to another, this was quite safe.
    Post the changes, if the two directories are on the same filesystem (have
    the same 'fsid'), they will share the same super block, and hence the same
    mount options.

    Add a flag to allow users to elect not to share the NFS super block with
    another mount point, even if the fsids are the same. This will allow
    users to set different mount options for the two different super blocks, as
    was previously possible. It is still up to the user to ensure that there
    are no cache coherency issues when doing this, however the default
    behaviour will be to share super blocks whenever two paths result in
    the same fsid.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever