14 Dec, 2011

1 commit

  • Commit 06222e491e663dac939f04b125c9dc52126a75c4 got the if wrong so that
    it always evaluates as true. This is semantically harmless, but makes
    SEEK_CUR and SEEK_SET needlessly query the server.

    Rewrite the if to explicitly enumerate the cases we DO need a valid i_size
    to make this code less fragile.

    Reported-by: Roel Kluin
    Signed-off-by: Sage Weil

    Sage Weil
     

08 Dec, 2011

1 commit

  • We have been using i_lock to protect all kinds of data structures in the
    ceph_inode_info struct, including lists of inodes that we need to iterate
    over while avoiding races with inode destruction. That requires grabbing
    a reference to the inode with the list lock protected, but igrab() now
    takes i_lock to check the inode flags.

    Changing the list lock ordering would be a painful process.

    However, using a ceph-specific i_ceph_lock in the ceph inode instead of
    i_lock is a simple mechanical change and avoids the ordering constraints
    imposed by igrab().

    Reported-by: Amon Ott
    Signed-off-by: Sage Weil

    Sage Weil
     

27 Jul, 2011

7 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
    ceph: document unlocked d_parent accesses
    ceph: explicitly reference rename old_dentry parent dir in request
    ceph: document locking for ceph_set_dentry_offset
    ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug
    ceph: protect d_parent access in ceph_d_revalidate
    ceph: protect access to d_parent
    ceph: handle racing calls to ceph_init_dentry
    ceph: set dir complete frag after adding capability
    rbd: set blk_queue request sizes to object size
    ceph: set up readahead size when rsize is not passed
    rbd: cancel watch request when releasing the device
    ceph: ignore lease mask
    ceph: fix ceph_lookup_open intent usage
    ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC
    ceph: fix bad parent_inode calc in ceph_lookup_open
    ceph: avoid carrying Fw cap during write into page cache
    libceph: don't time out osd requests that haven't been received
    ceph: report f_bfree based on kb_avail rather than diffing.
    ceph: only queue capsnap if caps are dirty
    ceph: fix snap writeback when racing with writes
    ...

    Linus Torvalds
     
  • d_parent is protected by d_lock: use it when looking up a dentry's parent
    directory inode. Also take a reference and drop it in the caller to avoid
    a use-after-free.

    Reported-by: Al Viro
    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • We weren't properly calling lookup_instantiate_filp when setting up the
    lookup intent, which could lead to file leakage on errors. So:

    - use separate helper for the hidden snapdir translation, immediately
    following the mds request
    - use ceph_finish_lookup for the final dentry/return value dance in the
    exit path
    - lookup_instantiate_filp on success

    Reported-by: Al Viro
    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • We only need to put these on the directory unsafe list if they have
    side effects that fsync(2) should flush out.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • We were always getting NULL here because the intent file f_dentry is always
    NULL at this point, which means we were always passing NULL to
    ceph_mdsc_do_request. In reality, this was fine, since this isn't
    currently ever a write operation that needs to get strung on the dir's
    unsafe list.

    Use the dir explicitly, and only pass it if this open has side-effects that
    a dir fsync should flush.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • The generic_file_aio_write call may block on balance_dirty_pages while we
    flush data to the OSDs. If we hold a reference to the FILE_WR cap during
    that interval revocation by the MDS (e.g., to do a stat(2)) may be very
    slow.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • This allows us to force IO through the sync path which you normally only
    get when multiple clients are reading/writing to the same file or by
    mounting with -o sync. Among other things, this lets test programs verify
    correctness with a single mount.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     

21 Jul, 2011

1 commit

  • This converts everybody to handle SEEK_HOLE/SEEK_DATA properly. In some cases
    we just return -EINVAL, in others we do the normal generic thing, and in others
    we're simply making sure that the properly due-dilligence is done. For example
    in NFS/CIFS we need to make sure the file size is update properly for the
    SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself
    that is all we have to do. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     

20 Jul, 2011

1 commit


14 Jun, 2011

2 commits


08 Jun, 2011

3 commits

  • Getting ENOENT is equivalent to reading 0 bytes. Make that correction
    before setting up the hit_stripe and was_short flags.

    Fixes the following case:
    dd if=/dev/zero of=/mnt/fs_depot/dd3 bs=1 seek=1048576 count=0
    dd if=/mnt/fs_depot/dd3 of=/root/ddout1 skip=8 bs=500 count=2 iflag=direct

    Reported-by: Henry C Chang
    Signed-off-by: Sage Weil

    Sage Weil
     
  • If we get a short read from the OSD because the object is small, we need to
    zero the remainder of the buffer. For O_DIRECT reads, the attempted range
    is not trimmed to i_size by the VFS, so we were actually looping
    indefinitely.

    Fix by trimming by i_size, and the unconditionally zeroing the trailing
    range.

    Reported-by: Jeff Wu
    Signed-off-by: Sage Weil

    Sage Weil
     
  • We should use ihold whenever we already have a stable inode ref, even
    when we aren't holding i_lock. This avoids adding new and unnecessary
    locking dependencies.

    Signed-off-by: Sage Weil

    Sage Weil
     

05 May, 2011

1 commit


22 Mar, 2011

2 commits

  • In sync_write_wait(), we assume that the newest request is at the
    tail of unsafe write list. We should maintain the semantics here.

    Signed-off-by: Henry C Chang
    Signed-off-by: Sage Weil

    Henry C Chang
     
  • This fixes the list corruption warning like this:

    ------------[ cut here ]------------
    WARNING: at lib/list_debug.c:30 __list_add+0x68/0x81()
    Hardware name: X8DTU
    list_add corruption. prev->next should be next (ffff880618931250), but was (null). (prev=ffff880c188b9130).
    Modules linked in: nfsd lockd nfs_acl auth_rpcgss exportfs ceph libceph libcrc32c sunrpc ipv6 fuse igb i2c_i801 ioatdma i2c_core iTCO_wdt iTCO_vendor_support joydev dca serio_raw usb_storage [last unloaded: scsi_wait_scan]
    Pid: 10977, comm: smbd Tainted: G W 2.6.32.23-170.Elaster.xendom0.fc12.x86_64 #1
    Call Trace:
    [] warn_slowpath_common+0x7c/0x94
    [] warn_slowpath_fmt+0x41/0x43
    [] __list_add+0x68/0x81
    [] ceph_aio_write+0x614/0x8a2 [ceph]
    [] do_sync_write+0xe8/0x125
    [] ? autoremove_wake_function+0x0/0x39
    [] ? selinux_file_permission+0x5c/0xb3
    [] ? security_file_permission+0x16/0x18
    [] vfs_write+0xae/0x10b
    [] sys_pwrite64+0x5a/0x76
    [] system_call_fastpath+0x16/0x1b
    ---[ end trace 08573eb9f07ff6f4 ]---

    Signed-off-by: Henry C Chang
    Signed-off-by: Sage Weil

    Henry C Chang
     

18 Dec, 2010

1 commit


16 Dec, 2010

1 commit


10 Nov, 2010

2 commits


08 Nov, 2010

1 commit

  • Normally when we open a file we already have a cap, and simply update the
    wanted set. However, if we open a file for write, but don't have an auth
    cap, that doesn't work; we need to open a new cap with the auth MDS. Only
    reuse existing caps if we are opening for read or the existing cap is auth.

    Signed-off-by: Sage Weil

    Sage Weil
     

21 Oct, 2010

1 commit

  • This factors out protocol and low-level storage parts of ceph into a
    separate libceph module living in net/ceph and include/linux/ceph. This
    is mostly a matter of moving files around. However, a few key pieces
    of the interface change as well:

    - ceph_client becomes ceph_fs_client and ceph_client, where the latter
    captures the mon and osd clients, and the fs_client gets the mds client
    and file system specific pieces.
    - Mount option parsing and debugfs setup is correspondingly broken into
    two pieces.
    - The mon client gets a generic handler callback for otherwise unknown
    messages (mds map, in this case).
    - The basic supported/required feature bits can be expanded (and are by
    ceph_fs_client).

    No functional change, aside from some subtle error handling cases that got
    cleaned up in the refactoring process.

    Signed-off-by: Sage Weil

    Yehuda Sadeh
     

07 Oct, 2010

1 commit


04 Aug, 2010

1 commit


03 Aug, 2010

1 commit

  • Implement flock inode operation to support advisory file locking. All
    lock/unlock operations are synchronous with the MDS. Lock state is
    sent when reconnecting to a recovering MDS to restore the shared lock
    state.

    Signed-off-by: Greg Farnum
    Signed-off-by: Sage Weil

    Greg Farnum
     

02 Aug, 2010

3 commits


28 Jul, 2010

1 commit

  • This fixes an issue triggered by running concurrent syncs. One of the syncs
    would go through while the other would just hang indefinitely. In any case, we
    never actually want to wake a single waiter, so the *_all functions should
    be used.

    Signed-off-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Yehuda Sadeh
     

30 May, 2010

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: clean up on forwarded aborted mds request
    ceph: fix leak of osd authorizer
    ceph: close out mds, osd connections before stopping auth
    ceph: make lease code DN specific
    fs/ceph: Use ERR_CAST
    ceph: renew auth tickets before they expire
    ceph: do not resend mon requests on auth ticket renewal
    ceph: removed duplicated #includes
    ceph: avoid possible null dereference
    ceph: make mds requests killable, not interruptible
    sched: add wait_for_completion_killable_timeout

    Linus Torvalds
     
  • Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more
    clear what is the purpose of the operation, which otherwise looks like a
    no-op.

    In the case of fs/ceph/inode.c, ERR_CAST is not needed, because the type of
    the returned value is the same as the type of the enclosing function.

    The semantic patch that makes this change is as follows:
    (http://coccinelle.lip6.fr/)

    //
    @@
    type T;
    T x;
    identifier f;
    @@

    T f (...) { }

    @@
    expression x;
    @@

    - ERR_PTR(PTR_ERR(x))
    + ERR_CAST(x)
    //

    Signed-off-by: Julia Lawall
    Signed-off-by: Sage Weil

    Julia Lawall
     

24 May, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (59 commits)
    ceph: reuse mon subscribe message instead of allocated anew
    ceph: avoid resending queued message to monitor
    ceph: Storage class should be before const qualifier
    ceph: all allocation functions should get gfp_mask
    ceph: specify max_bytes on readdir replies
    ceph: cleanup pool op strings
    ceph: Use kzalloc
    ceph: use common helper for aborted dir request invalidation
    ceph: cope with out of order (unsafe after safe) mds reply
    ceph: save peer feature bits in connection structure
    ceph: resync headers with userland
    ceph: use ceph. prefix for virtual xattrs
    ceph: throw out dirty caps metadata, data on session teardown
    ceph: attempt mds reconnect if mds closes our session
    ceph: clean up send_mds_reconnect interface
    ceph: wait for mds OPEN reply to indicate reconnect success
    ceph: only send cap releases when mds is OPEN|HUNG
    ceph: dicard cap releases on mds restart
    ceph: make mon client statfs handling more generic
    ceph: drop src address(es) from message header [new protocol feature]
    ...

    Linus Torvalds
     

22 May, 2010

1 commit

  • Now that the last user passing a NULL file pointer is gone we can remove
    the redundant dentry argument and associated hacks inside vfs_fsynmc_range.

    The next step will be removig the dentry argument from ->fsync, but given
    the luck with the last round of method prototype changes I'd rather
    defer this until after the main merge window.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

18 May, 2010

4 commits