14 Jun, 2011

2 commits


08 Jun, 2011

3 commits

  • Getting ENOENT is equivalent to reading 0 bytes. Make that correction
    before setting up the hit_stripe and was_short flags.

    Fixes the following case:
    dd if=/dev/zero of=/mnt/fs_depot/dd3 bs=1 seek=1048576 count=0
    dd if=/mnt/fs_depot/dd3 of=/root/ddout1 skip=8 bs=500 count=2 iflag=direct

    Reported-by: Henry C Chang
    Signed-off-by: Sage Weil

    Sage Weil
     
  • If we get a short read from the OSD because the object is small, we need to
    zero the remainder of the buffer. For O_DIRECT reads, the attempted range
    is not trimmed to i_size by the VFS, so we were actually looping
    indefinitely.

    Fix by trimming by i_size, and the unconditionally zeroing the trailing
    range.

    Reported-by: Jeff Wu
    Signed-off-by: Sage Weil

    Sage Weil
     
  • We should use ihold whenever we already have a stable inode ref, even
    when we aren't holding i_lock. This avoids adding new and unnecessary
    locking dependencies.

    Signed-off-by: Sage Weil

    Sage Weil
     

05 May, 2011

1 commit


22 Mar, 2011

2 commits

  • In sync_write_wait(), we assume that the newest request is at the
    tail of unsafe write list. We should maintain the semantics here.

    Signed-off-by: Henry C Chang
    Signed-off-by: Sage Weil

    Henry C Chang
     
  • This fixes the list corruption warning like this:

    ------------[ cut here ]------------
    WARNING: at lib/list_debug.c:30 __list_add+0x68/0x81()
    Hardware name: X8DTU
    list_add corruption. prev->next should be next (ffff880618931250), but was (null). (prev=ffff880c188b9130).
    Modules linked in: nfsd lockd nfs_acl auth_rpcgss exportfs ceph libceph libcrc32c sunrpc ipv6 fuse igb i2c_i801 ioatdma i2c_core iTCO_wdt iTCO_vendor_support joydev dca serio_raw usb_storage [last unloaded: scsi_wait_scan]
    Pid: 10977, comm: smbd Tainted: G W 2.6.32.23-170.Elaster.xendom0.fc12.x86_64 #1
    Call Trace:
    [] warn_slowpath_common+0x7c/0x94
    [] warn_slowpath_fmt+0x41/0x43
    [] __list_add+0x68/0x81
    [] ceph_aio_write+0x614/0x8a2 [ceph]
    [] do_sync_write+0xe8/0x125
    [] ? autoremove_wake_function+0x0/0x39
    [] ? selinux_file_permission+0x5c/0xb3
    [] ? security_file_permission+0x16/0x18
    [] vfs_write+0xae/0x10b
    [] sys_pwrite64+0x5a/0x76
    [] system_call_fastpath+0x16/0x1b
    ---[ end trace 08573eb9f07ff6f4 ]---

    Signed-off-by: Henry C Chang
    Signed-off-by: Sage Weil

    Henry C Chang
     

18 Dec, 2010

1 commit


16 Dec, 2010

1 commit


10 Nov, 2010

2 commits


08 Nov, 2010

1 commit

  • Normally when we open a file we already have a cap, and simply update the
    wanted set. However, if we open a file for write, but don't have an auth
    cap, that doesn't work; we need to open a new cap with the auth MDS. Only
    reuse existing caps if we are opening for read or the existing cap is auth.

    Signed-off-by: Sage Weil

    Sage Weil
     

21 Oct, 2010

1 commit

  • This factors out protocol and low-level storage parts of ceph into a
    separate libceph module living in net/ceph and include/linux/ceph. This
    is mostly a matter of moving files around. However, a few key pieces
    of the interface change as well:

    - ceph_client becomes ceph_fs_client and ceph_client, where the latter
    captures the mon and osd clients, and the fs_client gets the mds client
    and file system specific pieces.
    - Mount option parsing and debugfs setup is correspondingly broken into
    two pieces.
    - The mon client gets a generic handler callback for otherwise unknown
    messages (mds map, in this case).
    - The basic supported/required feature bits can be expanded (and are by
    ceph_fs_client).

    No functional change, aside from some subtle error handling cases that got
    cleaned up in the refactoring process.

    Signed-off-by: Sage Weil

    Yehuda Sadeh
     

07 Oct, 2010

1 commit


04 Aug, 2010

1 commit


03 Aug, 2010

1 commit

  • Implement flock inode operation to support advisory file locking. All
    lock/unlock operations are synchronous with the MDS. Lock state is
    sent when reconnecting to a recovering MDS to restore the shared lock
    state.

    Signed-off-by: Greg Farnum
    Signed-off-by: Sage Weil

    Greg Farnum
     

02 Aug, 2010

3 commits


28 Jul, 2010

1 commit

  • This fixes an issue triggered by running concurrent syncs. One of the syncs
    would go through while the other would just hang indefinitely. In any case, we
    never actually want to wake a single waiter, so the *_all functions should
    be used.

    Signed-off-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Yehuda Sadeh
     

30 May, 2010

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: clean up on forwarded aborted mds request
    ceph: fix leak of osd authorizer
    ceph: close out mds, osd connections before stopping auth
    ceph: make lease code DN specific
    fs/ceph: Use ERR_CAST
    ceph: renew auth tickets before they expire
    ceph: do not resend mon requests on auth ticket renewal
    ceph: removed duplicated #includes
    ceph: avoid possible null dereference
    ceph: make mds requests killable, not interruptible
    sched: add wait_for_completion_killable_timeout

    Linus Torvalds
     
  • Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more
    clear what is the purpose of the operation, which otherwise looks like a
    no-op.

    In the case of fs/ceph/inode.c, ERR_CAST is not needed, because the type of
    the returned value is the same as the type of the enclosing function.

    The semantic patch that makes this change is as follows:
    (http://coccinelle.lip6.fr/)

    //
    @@
    type T;
    T x;
    identifier f;
    @@

    T f (...) { }

    @@
    expression x;
    @@

    - ERR_PTR(PTR_ERR(x))
    + ERR_CAST(x)
    //

    Signed-off-by: Julia Lawall
    Signed-off-by: Sage Weil

    Julia Lawall
     

24 May, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (59 commits)
    ceph: reuse mon subscribe message instead of allocated anew
    ceph: avoid resending queued message to monitor
    ceph: Storage class should be before const qualifier
    ceph: all allocation functions should get gfp_mask
    ceph: specify max_bytes on readdir replies
    ceph: cleanup pool op strings
    ceph: Use kzalloc
    ceph: use common helper for aborted dir request invalidation
    ceph: cope with out of order (unsafe after safe) mds reply
    ceph: save peer feature bits in connection structure
    ceph: resync headers with userland
    ceph: use ceph. prefix for virtual xattrs
    ceph: throw out dirty caps metadata, data on session teardown
    ceph: attempt mds reconnect if mds closes our session
    ceph: clean up send_mds_reconnect interface
    ceph: wait for mds OPEN reply to indicate reconnect success
    ceph: only send cap releases when mds is OPEN|HUNG
    ceph: dicard cap releases on mds restart
    ceph: make mon client statfs handling more generic
    ceph: drop src address(es) from message header [new protocol feature]
    ...

    Linus Torvalds
     

22 May, 2010

1 commit

  • Now that the last user passing a NULL file pointer is gone we can remove
    the redundant dentry argument and associated hacks inside vfs_fsynmc_range.

    The next step will be removig the dentry argument from ->fsync, but given
    the luck with the last round of method prototype changes I'd rather
    defer this until after the main merge window.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

18 May, 2010

4 commits


04 May, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

02 Mar, 2010

1 commit

  • Verify the file is actually open for the given caps when we are
    waiting for caps. This ensures we will wake up and return EBADF
    if another thread closes the file out from under us.

    Note that EBADF is also the correct return code from write(2)
    when called on a file handle opened for reading (although the
    vfs should catch that).

    Signed-off-by: Sage Weil

    Sage Weil
     

24 Feb, 2010

1 commit


12 Feb, 2010

3 commits

  • If a sync read gets a short result from the OSD, it may need to do a
    getattr to see if it is short due to reaching end-of-file. The getattr
    was being done while holding a reference to FILE_RD, which can lead to
    a deadlock if the MDS is revoking that capability bit and can't process
    the getattr until it does.

    We fix this by setting a flag if EOF size validation is needed, and doing
    the getattr in ceph_aio_read, after the RD cap ref is dropped. If the
    read needs to be continued, we loop and continue traversing the file.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • In the cases where we either do a sync read or a write, we
    need to make sure that everything in the page cache is flushed.
    In the case of a sync write we invalidate the relevant pages,
    so that subsequent read/write reflects the new data written.

    Signed-off-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Yehuda Sadeh
     
  • Zeroing of holes was not done correctly: page_off was miscalculated and
    zeroing the tail didn't not adjust the 'read' value to include the zeroed
    portion.

    Signed-off-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Yehuda Sadeh
     

07 Jan, 2010

1 commit


05 Nov, 2009

1 commit


07 Oct, 2009

1 commit

  • File open and close operations, and read and write methods that ensure
    we have obtained the proper capabilities from the MDS cluster before
    performing IO on a file. We take references on held capabilities for
    the duration of the read/write to avoid prematurely releasing them
    back to the MDS.

    We implement two main paths for read and write: one that is buffered
    (and uses generic_aio_{read,write}), and one that is fully synchronous
    and blocking (operating either on a __user pointer or, if O_DIRECT,
    directly on user pages).

    Signed-off-by: Sage Weil

    Sage Weil