10 Sep, 2011

1 commit


23 Aug, 2011

1 commit


16 Aug, 2011

1 commit


27 Jul, 2011

21 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
    ceph: document unlocked d_parent accesses
    ceph: explicitly reference rename old_dentry parent dir in request
    ceph: document locking for ceph_set_dentry_offset
    ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug
    ceph: protect d_parent access in ceph_d_revalidate
    ceph: protect access to d_parent
    ceph: handle racing calls to ceph_init_dentry
    ceph: set dir complete frag after adding capability
    rbd: set blk_queue request sizes to object size
    ceph: set up readahead size when rsize is not passed
    rbd: cancel watch request when releasing the device
    ceph: ignore lease mask
    ceph: fix ceph_lookup_open intent usage
    ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC
    ceph: fix bad parent_inode calc in ceph_lookup_open
    ceph: avoid carrying Fw cap during write into page cache
    libceph: don't time out osd requests that haven't been received
    ceph: report f_bfree based on kb_avail rather than diffing.
    ceph: only queue capsnap if caps are dirty
    ceph: fix snap writeback when racing with writes
    ...

    Linus Torvalds
     
  • For the most part we don't care about racing with rename when directing
    MDS requests; either the old or new parent is fine. Document that, and
    do some minor cleanup.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • We carry a pin on the parent directory for the rename source and dest
    dentries. For the source it's r_locked_dir; we need to explicitly
    reference the old_dentry parent as well, since the dentry's d_parent may
    change between when the request was created and pinned and when it is
    freed.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • Have caller pass in a safely-obtained reference to the parent directory
    for calculating a dentry's hash valud.

    While we're here, simpify the flow through ceph_encode_fh() so that there
    is a single exit point and cleanup.

    Also fix a bug with the dentry hash calculation: calculate the hash for the
    dentry we were given, not its parent.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • Protect d_parent with d_lock. Carry a reference. Simplify the flow so
    that there is a single exit point and cleanup.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • d_parent is protected by d_lock: use it when looking up a dentry's parent
    directory inode. Also take a reference and drop it in the caller to avoid
    a use-after-free.

    Reported-by: Al Viro
    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • The ->lookup() and prepopulate_readdir() callers are working with unhashed
    dentries, so we don't have to worry. The export.c callers, though, need
    to initialize something they got back from d_obtain_alias() and are
    potentially racing with other callers. Make sure we don't return unless
    the dentry is properly initialized (by us or someone else).

    Reported-by: Al Viro
    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • Curretly ceph_add_cap clears the complete bit if we are newly issued the
    FILE_SHARED cap, which is normally the case for a newly issue cap on a new
    directory. That means we clear the just-set bit. Move the check that sets
    the flag to after the cap is added/updated.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • This should improve the default read performance, as without it
    readahead is practically disabled.

    Signed-off-by: Yehuda Sadeh

    Yehuda Sadeh
     
  • The lease mask is no longer used (and it changed a while back). Instead,
    use a non-zero duration to indicate that there is a lease being issued.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • We weren't properly calling lookup_instantiate_filp when setting up the
    lookup intent, which could lead to file leakage on errors. So:

    - use separate helper for the hidden snapdir translation, immediately
    following the mds request
    - use ceph_finish_lookup for the final dentry/return value dance in the
    exit path
    - lookup_instantiate_filp on success

    Reported-by: Al Viro
    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • We only need to put these on the directory unsafe list if they have
    side effects that fsync(2) should flush out.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • We were always getting NULL here because the intent file f_dentry is always
    NULL at this point, which means we were always passing NULL to
    ceph_mdsc_do_request. In reality, this was fine, since this isn't
    currently ever a write operation that needs to get strung on the dir's
    unsafe list.

    Use the dir explicitly, and only pass it if this open has side-effects that
    a dir fsync should flush.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • The generic_file_aio_write call may block on balance_dirty_pages while we
    flush data to the OSDs. If we hold a reference to the FILE_WR cap during
    that interval revocation by the MDS (e.g., to do a stat(2)) may be very
    slow.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • Reviewed-by: Yehuda Sadeh
    Signed-off-by: Greg Farnum

    Greg Farnum
     
  • We used to go into this branch if i_wrbuffer_ref_head was non-zero. This
    was an ancient check from before we were careful about dealing with all
    kinds of caps (and not just dirty pages). It is cleaner to only queue a
    capsnap if there is an actual dirty cap. If we are racing with...
    something...we will end up here with ci->i_wrbuffer_refs but no dirty
    caps.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • There are two problems that come up when we try to queue a capsnap while a
    write is in progress:

    - The FILE_WR cap is held, but not yet dirty, so we may queue a capsnap
    with dirty == 0. That will crash later in __ceph_flush_snaps(). Or
    on the FILE_WR cap if a write is in progress.
    - We may not have i_head_snapc set, which causes problems pretty quickly.
    Look to the snaprealm in this case.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • This saves us a word of memory per file.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • This allows us to force IO through the sync path which you normally only
    get when multiple clients are reading/writing to the same file or by
    mounting with -o sync. Among other things, this lets test programs verify
    correctness with a single mount.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     

21 Jul, 2011

3 commits

  • Btrfs needs to be able to control how filemap_write_and_wait_range() is called
    in fsync to make it less of a painful operation, so push down taking i_mutex and
    the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
    file systems can drop taking the i_mutex altogether it seems, like ext3 and
    ocfs2. For correctness sake I just pushed everything down in all cases to make
    sure that we keep the current behavior the same for everybody, and then each
    individual fs maintainer can make up their mind about what to do from there.
    Thanks,

    Acked-by: Jan Kara
    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     
  • This converts everybody to handle SEEK_HOLE/SEEK_DATA properly. In some cases
    we just return -EINVAL, in others we do the normal generic thing, and in others
    we're simply making sure that the properly due-dilligence is done. For example
    in NFS/CIFS we need to make sure the file size is update properly for the
    SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself
    that is all we have to do. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     
  • Signed-off-by: Al Viro

    Al Viro
     

20 Jul, 2011

5 commits


17 Jul, 2011

1 commit


14 Jun, 2011

2 commits


08 Jun, 2011

4 commits

  • If we request a lock and then abort (e.g., ^C), we need to send a matching
    unlock request to the MDS to unwind our lock attempt to avoid indefinitely
    blocking other clients.

    Reported-by: Brian Chrisman
    Signed-off-by: Sage Weil

    Sage Weil
     
  • Getting ENOENT is equivalent to reading 0 bytes. Make that correction
    before setting up the hit_stripe and was_short flags.

    Fixes the following case:
    dd if=/dev/zero of=/mnt/fs_depot/dd3 bs=1 seek=1048576 count=0
    dd if=/mnt/fs_depot/dd3 of=/root/ddout1 skip=8 bs=500 count=2 iflag=direct

    Reported-by: Henry C Chang
    Signed-off-by: Sage Weil

    Sage Weil
     
  • If we get a short read from the OSD because the object is small, we need to
    zero the remainder of the buffer. For O_DIRECT reads, the attempted range
    is not trimmed to i_size by the VFS, so we were actually looping
    indefinitely.

    Fix by trimming by i_size, and the unconditionally zeroing the trailing
    range.

    Reported-by: Jeff Wu
    Signed-off-by: Sage Weil

    Sage Weil
     
  • We should use ihold whenever we already have a stable inode ref, even
    when we aren't holding i_lock. This avoids adding new and unnecessary
    locking dependencies.

    Signed-off-by: Sage Weil

    Sage Weil
     

25 May, 2011

1 commit

  • In e9964c10 we change cap flushing to do a delicate dance because some
    inodes on the cap_dirty list could be in a migrating state (got EXPORT but
    not IMPORT) in which we couldn't actually flush and move from
    dirty->flushing, breaking the while (!empty) { process first } loop
    structure. It worked for a single sync thread, but was not reentrant and
    triggered infinite loops when multiple syncers came along.

    Instead, move inodes with dirty to a separate cap_dirty_migrating list
    when in the limbo export-but-no-import state, allowing us to go back to
    the simple loop structure (which was reentrant). This is cleaner and more
    robust.

    Audited the cap_dirty users and this looks fine:
    list_empty(&ci->i_dirty_item) is still a reliable indicator of whether we
    have dirty caps (which list we're on is irrelevant) and list_del_init()
    calls still do the right thing.

    Signed-off-by: Sage Weil

    Sage Weil