29 Jan, 2014

1 commit


01 Jan, 2014

2 commits


14 Dec, 2013

2 commits


24 Nov, 2013

1 commit


07 Sep, 2013

3 commits

  • Previous patch that allowed us to cleanup most of the issues with pages marked
    as private_2 when calling ceph_readpages. However, there seams to be a case in
    the error case clean up in start read that still trigers this from time to
    time. I've only seen this one a couple times.

    BUG: Bad page state in process petabucket pfn:335b82
    page:ffffea000cd6e080 count:0 mapcount:0 mapping: (null) index:0x0
    page flags: 0x200000000001000(private_2)
    Call Trace:
    [] dump_stack+0x46/0x58
    [] bad_page+0xc7/0x120
    [] free_pages_prepare+0x10e/0x120
    [] free_hot_cold_page+0x40/0x160
    [] __put_single_page+0x27/0x30
    [] put_page+0x25/0x40
    [] ceph_readpages+0x2e9/0x6f0 [ceph]
    [] __do_page_cache_readahead+0x1af/0x260

    Signed-off-by: Milosz Tanski
    Signed-off-by: Sage Weil

    Milosz Tanski
     
  • In some cases the ceph readapages code code bails without filling all the pages
    already marked by fscache. When we return back to readahead code this causes
    a BUG.

    Signed-off-by: Milosz Tanski

    Milosz Tanski
     
  • Adding support for fscache to the Ceph filesystem. This would bring it to on
    par with some of the other network filesystems in Linux (like NFS, AFS, etc...)

    In order to mount the filesystem with fscache the 'fsc' mount option must be
    passed.

    Signed-off-by: Milosz Tanski
    Signed-off-by: Sage Weil

    Milosz Tanski
     

28 Aug, 2013

1 commit

  • Following we will begin to add memcg dirty page accounting around
    __set_page_dirty_{buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
    avoid exporting those details to filesystems.

    Since vfs set_page_dirty() should be called under page lock, here we don't need elaborate
    codes to handle racy anymore, and two WARN_ON() are added to detect such exceptions.
    Thanks very much for Sage and Yan Zheng's coaching!

    I tested it in a two server's ceph environment that one is client and the other is
    mds/osd/mon, and run the following fsx test from xfstests:

    ./fsx 1MB -N 50000 -p 10000 -l 1048576
    ./fsx 10MB -N 50000 -p 10000 -l 10485760
    ./fsx 100MB -N 50000 -p 10000 -l 104857600

    The fsx does lots of mmap-read/mmap-write/truncate operations and the tests completed
    successfully without triggering any of WARN_ON.

    Signed-off-by: Sha Zhengju
    Reviewed-by: Sage Weil

    Sha Zhengju
     

16 Aug, 2013

2 commits


10 Aug, 2013

1 commit

  • The early bug checks are moot because the VMA layer ensures those things.

    1. It will not call invalidatepage unless PagePrivate (or PagePrivate2) are set
    2. It will not call invalidatepage without taking a PageLock first.
    3. Guantrees that the inode page is mapped.

    Signed-off-by: Milosz Tanski
    Reviewed-by: Sage Weil

    Milosz Tanski
     

10 Jul, 2013

1 commit

  • Pull Ceph updates from Sage Weil:
    "There is some follow-on RBD cleanup after the last window's code drop,
    a series from Yan fixing multi-mds behavior in cephfs, and then a
    sprinkling of bug fixes all around. Some warnings, sleeping while
    atomic, a null dereference, and cleanups"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (36 commits)
    libceph: fix invalid unsigned->signed conversion for timespec encoding
    libceph: call r_unsafe_callback when unsafe reply is received
    ceph: fix race between cap issue and revoke
    ceph: fix cap revoke race
    ceph: fix pending vmtruncate race
    ceph: avoid accessing invalid memory
    libceph: Fix NULL pointer dereference in auth client code
    ceph: Reconstruct the func ceph_reserve_caps.
    ceph: Free mdsc if alloc mdsc->mdsmap failed.
    ceph: remove sb_start/end_write in ceph_aio_write.
    ceph: avoid meaningless calling ceph_caps_revoking if sync_mode == WB_SYNC_ALL.
    ceph: fix sleeping function called from invalid context.
    ceph: move inode to proper flushing list when auth MDS changes
    rbd: fix a couple warnings
    ceph: clear migrate seq when MDS restarts
    ceph: check migrate seq before changing auth cap
    ceph: fix race between page writeback and truncate
    ceph: reset iov_len when discarding cap release messages
    ceph: fix cap release race
    libceph: fix truncate size calculation
    ...

    Linus Torvalds
     

04 Jul, 2013

2 commits


22 May, 2013

2 commits

  • ->invalidatepage() aop now accepts range to invalidate so we can make
    use of it in ceph_invalidatepage().

    Signed-off-by: Lukas Czerner
    Acked-by: Sage Weil
    Cc: ceph-devel@vger.kernel.org

    Lukas Czerner
     
  • Currently there is no way to truncate partial page where the end
    truncate point is not at the end of the page. This is because it was not
    needed and the functionality was enough for file system truncate
    operation to work properly. However more file systems now support punch
    hole feature and it can benefit from mm supporting truncating page just
    up to the certain point.

    Specifically, with this functionality truncate_inode_pages_range() can
    be changed so it supports truncating partial page at the end of the
    range (currently it will BUG_ON() if 'end' is not at the end of the
    page).

    This commit changes the invalidatepage() address space operation
    prototype to accept range to be invalidated and update all the instances
    for it.

    We also change the block_invalidatepage() in the same way and actually
    make a use of the new length argument implementing range invalidation.

    Actual file system implementations will follow except the file systems
    where the changes are really simple and should not change the behaviour
    in any way .Implementation for truncate_page_range() which will be able
    to accept page unaligned ranges will follow as well.

    Signed-off-by: Lukas Czerner
    Cc: Andrew Morton
    Cc: Hugh Dickins

    Lukas Czerner
     

02 May, 2013

21 commits

  • In the incremental move toward supporting distinct data items in an
    osd request some of the functions had "write_request" parameters to
    indicate, basically, whether the data belonged to in_data or the
    out_data. Now that we maintain the data fields in the op structure
    there is no need to indicate the direction, so get rid of the
    "write_request" parameters.

    Signed-off-by: Alex Elder
    Reviewed-by: Josh Durgin

    Alex Elder
     
  • ceph_writepages_start() reads inode->i_size in two places. It can get
    different values between successive read, because truncate can change
    inode->i_size at any time. The race can lead to mismatch between data
    length of osd request and pages marked as writeback. When osd request
    finishes, it clear writeback page according to its data length. So
    some pages can be left in writeback state forever. The fix is only
    read inode->i_size once, save its value to a local variable and use
    the local variable when i_size is needed.

    Signed-off-by: Yan, Zheng
    Reviewed-by: Alex Elder

    Yan, Zheng
     
  • This ends up being a rather large patch but what it's doing is
    somewhat straightforward.

    Basically, this is replacing two calls with one. The first of the
    two calls is initializing a struct ceph_osd_data with data (either a
    page array, a page list, or a bio list); the second is setting an
    osd request op so it associates that data with one of the op's
    parameters. In place of those two will be a single function that
    initializes the op directly.

    That means we sort of fan out a set of the needed functions:
    - extent ops with pages data
    - extent ops with pagelist data
    - extent ops with bio list data
    and
    - class ops with page data for receiving a response

    We also have define another one, but it's only used internally:
    - class ops with pagelist data for request parameters

    Note that we *still* haven't gotten rid of the osd request's
    r_data_in and r_data_out fields. All the osd ops refer to them for
    their data. For now, these data fields are pointers assigned to the
    appropriate r_data_* field when these new functions are called.

    Signed-off-by: Alex Elder
    Reviewed-by: Josh Durgin

    Alex Elder
     
  • An osd request now holds all of its source op structures, and every
    place that initializes one of these is in fact initializing one
    of the entries in the the osd request's array.

    So rather than supplying the address of the op to initialize, have
    caller specify the osd request and an indication of which op it
    would like to initialize. This better hides the details the
    op structure (and faciltates moving the data pointers they use).

    Since osd_req_op_init() is a common routine, and it's not used
    outside the osd client code, give it static scope. Also make
    it return the address of the specified op (so all the other
    init routines don't have to repeat that code).

    Signed-off-by: Alex Elder
    Reviewed-by: Josh Durgin

    Alex Elder
     
  • An extent type osd operation currently implies that there will
    be corresponding data supplied in the data portion of the request
    (for write) or response (for read) message. Similarly, an osd class
    method operation implies a data item will be supplied to receive
    the response data from the operation.

    Add a ceph_osd_data pointer to each of those structures, and assign
    it to point to eithre the incoming or the outgoing data structure in
    the osd message. The data is not always available when an op is
    initially set up, so add two new functions to allow setting them
    after the op has been initialized.

    Begin to make use of the data item pointer available in the osd
    operation rather than the request data in or out structure in
    places where it's convenient. Add some assertions to verify
    pointers are always set the way they're expected to be.

    This is a sort of stepping stone toward really moving the data
    into the osd request ops, to allow for some validation before
    making that jump.

    This is the first in a series of patches that resolve:
    http://tracker.ceph.com/issues/4657

    Signed-off-by: Alex Elder
    Reviewed-by: Josh Durgin

    Alex Elder
     
  • An osd request keeps a pointer to the osd operations (ops) array
    that it builds in its request message.

    In order to allow each op in the array to have its own distinct
    data, we will need to keep track of each op's data, and that
    information does not go over the wire.

    As long as we're tracking the data we might as well just track the
    entire (source) op definition for each of the ops. And if we're
    doing that, we'll have no more need to keep a pointer to the
    wire-encoded version.

    This patch makes the array of source ops be kept with the osd
    request structure, and uses that instead of the version encoded in
    the message in places where that was previously used. The array
    will be embedded in the request structure, and the maximum number of
    ops we ever actually use is currently 2. So reduce CEPH_OSD_MAX_OP
    to 2 to reduce the size of the structure.

    The result of doing this sort of ripples back up, and as a result
    various function parameters and local variables become unnecessary.

    Make r_num_ops be unsigned, and move the definition of struct
    ceph_osd_req_op earlier to ensure it's defined where needed.

    It does not yet add per-op data, that's coming soon.

    This resolves:
    http://tracker.ceph.com/issues/4656

    Signed-off-by: Alex Elder
    Reviewed-by: Josh Durgin

    Alex Elder
     
  • These are very small changes that make use osd_data local pointers
    as shorthands for structures being operated on.

    Signed-off-by: Alex Elder
    Reviewed-by: Josh Durgin

    Alex Elder
     
  • Define and use functions that encapsulate the initializion of a
    ceph_osd_data structure.

    Signed-off-by: Alex Elder
    Reviewed-by: Josh Durgin

    Alex Elder
     
  • Hold off building the osd request message in ceph_writepages_start()
    until just before it will be submitted to the osd client for
    execution.

    We'll still create the request and allocate the page pointer array
    after we learn we have at least one page to write. A local variable
    will be used to keep track of the allocated array of pages. Wait
    until just before submitting the request for assigning that page
    array pointer to the request message.

    Create ands use a new function osd_req_op_extent_update() whose
    purpose is to serve this one spot where the length value supplied
    when an osd request's op was initially formatted might need to get
    changed (reduced, never increased) before submitting the request.

    Previously, ceph_writepages_start() assigned the message header's
    data length because of this update. That's no longer necessary,
    because ceph_osdc_build_request() will recalculate the right
    value to use based on the content of the ops in the request.

    Signed-off-by: Alex Elder
    Reviewed-by: Josh Durgin

    Alex Elder
     
  • Defer building the osd request until just before submitting it in
    all callers except ceph_writepages_start(). (That caller will be
    handed in the next patch.)

    Signed-off-by: Alex Elder
    Reviewed-by: Josh Durgin

    Alex Elder
     
  • There is a helper function alloc_page_vec() that, despite its
    generic sounding name depends heavily on an osd request structure
    being populated with certain information.

    There is only one place this function is used, and it ends up
    being a bit simpler to just open code what it does, so get
    rid of the helper.

    The real motivation for this is deferring building the of the osd
    request message, and this is a step in that direction.

    Signed-off-by: Alex Elder
    Reviewed-by: Josh Durgin

    Alex Elder
     
  • Mostly for readability, define ceph_writepages_osd_request() and
    use it to allocate the osd request for ceph_writepages_start().

    Signed-off-by: Alex Elder
    Reviewed-by: Josh Durgin

    Alex Elder
     
  • This patch moves the call to ceph_osdc_build_request() out of
    ceph_osdc_new_request() and into its caller.

    This is in order to defer formatting osd operation information into
    the request message until just before request is started.

    The only unusual (ab)user of ceph_osdc_build_request() is
    ceph_writepages_start(), where the final length of write request may
    change (downward) based on the current inode size or the oldest
    snapshot context with dirty data for the inode.

    The remaining callers don't change anything in the request after has
    been built.

    This means the ops array is now supplied by the caller. It also
    means there is no need to pass the mtime to ceph_osdc_new_request()
    (it gets provided to ceph_osdc_build_request()). And rather than
    passing a do_sync flag, have the number of ops in the ops array
    supplied imply adding a second STARTSYNC operation after the READ or
    WRITE requested.

    This and some of the patches that follow are related to having the
    messenger (only) be responsible for filling the content of the
    message header, as described here:
    http://tracker.ceph.com/issues/4589

    Signed-off-by: Alex Elder
    Reviewed-by: Josh Durgin

    Alex Elder
     
  • There's one spot in ceph_writepages_start() that open-codes what
    page_offset() does safely. Use the macro so we don't have to worry
    about wrapping.

    This resolves:
    http://tracker.ceph.com/issues/4648

    Signed-off-by: Alex Elder
    Reviewed-by: Josh Durgin

    Alex Elder
     
  • Record the byte count for an osd request rather than the page count.
    The number of pages can always be derived from the byte count (and
    alignment/offset) but the reverse is not true.

    Signed-off-by: Alex Elder
    Reviewed-by: Josh Durgin

    Alex Elder
     
  • An osd request defines information about where data to be read
    should be placed as well as where data to write comes from.
    Currently these are represented by common fields.

    Keep information about data for writing separate from data to be
    read by splitting these into data_in and data_out fields.

    This is the key patch in this whole series, in that it actually
    identifies which osd requests generate outgoing data and which
    generate incoming data. It's less obvious (currently) that an osd
    CALL op generates both outgoing and incoming data; that's the focus
    of some upcoming work.

    This resolves:
    http://tracker.ceph.com/issues/4127

    Signed-off-by: Alex Elder
    Reviewed-by: Josh Durgin

    Alex Elder
     
  • An osd request uses either pages or a bio list for its data. Use a
    union to record information about the two, and add a data type
    tag to select between them.

    Signed-off-by: Alex Elder
    Reviewed-by: Josh Durgin

    Alex Elder
     
  • Pull the fields in an osd request structure that define the data for
    the request out into a separate structure.

    Signed-off-by: Alex Elder
    Reviewed-by: Josh Durgin

    Alex Elder
     
  • Currently ceph_osdc_new_request() assigns an osd request's
    r_num_pages and r_alignment fields. The only thing it does
    after that is call ceph_osdc_build_request(), and that doesn't
    need those fields to be assigned.

    Move the assignment of those fields out of ceph_osdc_new_request()
    and into its caller. As a result, the page_align parameter is no
    longer used, so get rid of it.

    Note that in ceph_sync_write(), the value for req->r_num_pages had
    already been calculated earlier (as num_pages, and fortunately
    it was computed the same way). So don't bother recomputing it,
    but because it's not needed earlier, move that calculation after the
    call to ceph_osdc_new_request(). Hold off making the assignment to
    r_alignment, doing it instead r_pages and r_num_pages are
    getting set.

    Similarly, in start_read(), nr_pages already holds the number of
    pages in the array (and is calculated the same way), so there's no
    need to recompute it. Move the assignment of the page alignment
    down with the others there as well.

    This and the next few patches are preparation work for:
    http://tracker.ceph.com/issues/4127

    Signed-off-by: Alex Elder
    Reviewed-by: Josh Durgin

    Alex Elder
     
  • There's a spot that computes the number of pages to allocate for a
    page-aligned length by just shifting it. Use calc_pages_for()
    instead, to be consistent with usage everywhere else. The result
    is the same.

    The reason for this is to make it clearer in an upcoming patch that
    this calculation is duplicated.

    Signed-off-by: Alex Elder
    Reviewed-by: Josh Durgin

    Alex Elder
     
  • commit 22cddde104 breaks the atomicity of write operation, it also
    introduces a deadlock between write and truncate.

    Signed-off-by: Yan, Zheng
    Reviewed-by: Greg Farnum

    Conflicts:
    fs/ceph/addr.c

    Sage Weil
     

01 Mar, 2013

1 commit

  • Pull Ceph updates from Sage Weil:
    "A few groups of patches here. Alex has been hard at work improving
    the RBD code, layout groundwork for understanding the new formats and
    doing layering. Most of the infrastructure is now in place for the
    final bits that will come with the next window.

    There are a few changes to the data layout. Jim Schutt's patch fixes
    some non-ideal CRUSH behavior, and a set of patches from me updates
    the client to speak a newer version of the protocol and implement an
    improved hashing strategy across storage nodes (when the server side
    supports it too).

    A pair of patches from Sam Lang fix the atomicity of open+create
    operations. Several patches from Yan, Zheng fix various mds/client
    issues that turned up during multi-mds torture tests.

    A final set of patches expose file layouts via virtual xattrs, and
    allow the policies to be set on directories via xattrs as well
    (avoiding the awkward ioctl interface and providing a consistent
    interface for both kernel mount and ceph-fuse users)."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (143 commits)
    libceph: add support for HASHPSPOOL pool flag
    libceph: update osd request/reply encoding
    libceph: calculate placement based on the internal data types
    ceph: update support for PGID64, PGPOOL3, OSDENC protocol features
    ceph: update "ceph_features.h"
    libceph: decode into cpu-native ceph_pg type
    libceph: rename ceph_pg -> ceph_pg_v1
    rbd: pass length, not op for osd completions
    rbd: move rbd_osd_trivial_callback()
    libceph: use a do..while loop in con_work()
    libceph: use a flag to indicate a fault has occurred
    libceph: separate non-locked fault handling
    libceph: encapsulate connection backoff
    libceph: eliminate sparse warnings
    ceph: eliminate sparse warnings in fs code
    rbd: eliminate sparse warnings
    libceph: define connection flag helpers
    rbd: normalize dout() calls
    rbd: barriers are hard
    rbd: ignore zero-length requests
    ...

    Linus Torvalds