22 Mar, 2012

26 commits

  • In write_partial_msg_pages(), every case now does an identical call
    to kmap(page). Instead, just call it once inside the CRC-computing
    block where it's needed. Move the definition of kaddr inside that
    block, and make it a (char *) to ensure portable pointer arithmetic.

    We still don't kunmap() it until after the sendpage() call, in case
    that also ends up needing to use the mapping.

    Signed-off-by: Alex Elder
    Reviewed-by: Sage Weil

    Alex Elder
     
  • In write_partial_msg_pages() there is a local variable used to
    track the starting offset within a bio segment to use. Its name,
    "page_shift" defies the Linux convention of using that name for
    log-base-2(page size).

    Since it's only used in the bio case rename it "bio_offset". Use it
    along with the page_pos field to compute the memory offset when
    computing CRC's in that function. This makes the bio case match the
    others more closely.

    Signed-off-by: Alex Elder
    Reviewed-by: Sage Weil

    Alex Elder
     
  • There's not a lot of benefit to zero_page_address, which basically
    holds a mapping of the zero page through the life of the messenger
    module. Even with our own mapping, the sendpage interface where
    it's used may need to kmap() it again. It's almost certain to
    be in low memory anyway.

    So stop treating the zero page specially in write_partial_msg_pages()
    and just get rid of zero_page_address entirely.

    Signed-off-by: Alex Elder
    Reviewed-by: Sage Weil

    Alex Elder
     
  • Make ceph_tcp_sendpage() be the only place kernel_sendpage() is
    used, by using this helper in write_partial_msg_pages().

    Signed-off-by: Alex Elder
    Reviewed-by: Sage Weil

    Alex Elder
     
  • If a message queued for send gets revoked, zeroes are sent over the
    wire instead of any unsent data. This is done by constructing a
    message and passing it to kernel_sendmsg() via ceph_tcp_sendmsg().

    Since we are already working with a page in this case we can use
    the sendpage interface instead. Create a new ceph_tcp_sendpage()
    helper that sets up flags to match the way ceph_tcp_sendmsg()
    does now.

    Signed-off-by: Alex Elder
    Reviewed-by: Sage Weil

    Alex Elder
     
  • CRC's are computed for all messages between ceph entities. The CRC
    computation for the data portion of message can optionally be
    disabled using the "nocrc" (common) ceph option. The default is
    for CRC computation for the data portion to be enabled.

    Unfortunately, the code that implements this feature interprets the
    feature flag wrong, meaning that by default the CRC's have *not*
    been computed (or checked) for the data portion of messages unless
    the "nocrc" option was supplied.

    Fix this, in write_partial_msg_pages() and read_partial_message().
    Also change the flag variable in write_partial_msg_pages() to be
    "no_datacrc" to match the usage elsewhere in the file.

    This fixes http://tracker.newdream.net/issues/2064

    Signed-off-by: Alex Elder
    Reviewed-by: Sage Weil

    Alex Elder
     
  • Nothing too big here.
    - define the size of the buffer used for consuming ignored
    incoming data using a symbolic constant
    - simplify the condition determining whether to unmap the page
    in write_partial_msg_pages(): do it for crc but not if the
    page is the zero page

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     
  • Make a small change in the code that counts down kvecs consumed by
    a ceph_tcp_sendmsg() call. Same functionality, just blocked out
    a little differently.

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     
  • Move blocks of code out of loops in read_partial_message_section()
    and read_partial_message(). They were only was getting called at
    the end of the last iteration of the loop anyway.

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     
  • Calculate CRC in a separate step from rearranging the byte order
    of the result, to improve clarity and readability.

    Use offsetof() to determine the number of bytes to include in the
    CRC calculation.

    In read_partial_message(), switch which value gets byte-swapped,
    since the just-computed CRC is already likely to be in a register.

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     
  • Change the name (and type) of a few CRC-related Boolean local
    variables so they contain the word "do", to distingish their purpose
    from variables used for holding an actual CRC value.

    Note that in the process of doing this I identified a fairly serious
    logic error in write_partial_msg_pages(): the value of "do_crc"
    assigned appears to be the opposite of what it should be. No
    attempt to fix this is made here; this change preserves the
    erroneous behavior. The problem I found is documented here:
    http://tracker.newdream.net/issues/2064

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     
  • Many ceph-related Boolean options offer the ability to both enable
    and disable a feature. For all those that don't offer this, add
    a new option so that they do.

    Note that ceph_show_options()--which reports mount options currently
    in effect--only reports the option if it is different from the
    default value.

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     
  • This gathers a number of very minor changes:
    - use %hu when formatting the a socket address's address family
    - null out the ceph_msgr_wq pointer after the queue has been
    destroyed
    - drop a needless cast in ceph_write_space()
    - add a WARN() call in ceph_state_change() in the event an
    unrecognized socket state is encountered
    - rearrange the logic in ceph_con_get() and ceph_con_put() so
    that:
    - the reference counts are only atomically read once
    - the values displayed via dout() calls are known to
    be meaningful at the time they are formatted

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     
  • There is no real need for ceph_tcp_connect() to return the socket
    pointer it creates, since it already assigns it to con->sock, which
    is visible to the caller. Instead, have it return an error code,
    which tidies things up a bit.

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     
  • Define a helper function to perform various cleanup operations. Use
    it both in the exit routine and in the init routine in the event of
    an error.

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     
  • The messenger workqueue has no need to be public. So give it static
    scope.

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     
  • Encapsulate the operation of adding a new chunk of data to the next
    open slot in a ceph_connection's out_kvec array. Also add a "reset"
    operation to make subsequent add operations start at the beginning
    of the array again.

    Use these routines throughout, avoiding duplicate code and ensuring
    all calls are handled consistently.

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     
  • One of the arguments to prepare_write_connect() indicates whether it
    is being called immediately after a call to prepare_write_banner().
    Move the prepare_write_banner() call inside prepare_write_connect(),
    and reinterpret (and rename) the "after_banner" argument so it
    indicates that prepare_write_connect() should *make* the call
    rather than should know it has already been made.

    This was split out from the next patch to highlight this change in
    logic.

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     
  • ceph_parse_options() takes the address of a pointer as an argument
    and uses it to return the address of an allocated structure if
    successful. With this interface is not evident at call sites that
    the pointer is always initialized. Change the interface to return
    the address instead (or a pointer-coded error code) to make the
    validity of the returned pointer obvious.

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     
  • This fixes some spots where a type cast to (void *) was used as
    as a universal type hiding mechanism. Instead, properly cast the
    type to the intended target type.

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     
  • This eliminates type casts in some places where they are not
    required.

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     
  • A spinlock is used to protect a value used for selecting an array
    index for a string used for formatting a socket address for human
    consumption. The index is reset to 0 if it ever reaches the maximum
    index value.

    Instead, use an ever-increasing atomic variable as a sequence
    number, and compute the array index by masking off all but the
    sequence number's lowest bits. Make the number of entries in the
    array a power of two to allow the use of such a mask (to avoid jumps
    in the index value when the sequence number wraps).

    The length of these strings is somewhat arbitrarily set at 60 bytes.
    The worst-case length of a string produced is 54 bytes, for an IPv6
    address that can't be shortened, e.g.:
    [1234:5678:9abc:def0:1111:2222:123.234.210.100]:32767
    Change it so we arbitrarily use 64 bytes instead; if nothing else
    it will make the array of these line up better in hex dumps.

    Rename a few things to reinforce the distinction between the number
    of strings in the array and the length of individual strings.

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     
  • Rearrange ceph_tcp_connect() a bit, making use of "else" rather than
    re-testing a value with consecutive "if" statements. Don't record a
    connection's socket pointer unless the connect operation is
    successful.

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     
  • Each messenger allocates a page to be used when writing zeroes
    out in the event of error or other abnormal condition. Instead,
    use the kernel ZERO_PAGE() for that purpose.

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     
  • The existing overflow check (n > ULONG_MAX / b) didn't work, because
    n = ULONG_MAX / b would both bypass the check and still overflow the
    allocation size a + n * b.

    The correct check should be (n > (ULONG_MAX - a) / b).

    Signed-off-by: Xi Wang
    Signed-off-by: Sage Weil

    Xi Wang
     
  • The Ceph messenger would sometimes queue multiple work items to write
    data to a socket when the socket buffer was full.

    Fix this problem by making ceph_write_space() use SOCK_NOSPACE in the
    same way that net/core/stream.c:sk_stream_write_space() does, i.e.,
    clearing it only when sufficient space is available in the socket buffer.

    Signed-off-by: Jim Schutt
    Reviewed-by: Alex Elder

    Jim Schutt
     

03 Feb, 2012

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    rbd: fix safety of rbd_put_client()
    rbd: fix a memory leak in rbd_get_client()
    ceph: create a new session lock to avoid lock inversion
    ceph: fix length validation in parse_reply_info()
    ceph: initialize client debugfs outside of monc->mutex
    ceph: change "ceph.layout" xattr to be "ceph.file.layout"

    Linus Torvalds
     
  • Initializing debufs under monc->mutex introduces a lock dependency for
    sb->s_type->i_mutex_key, which (combined with several other dependencies)
    leads to an annoying lockdep warning. There's no particular reason to do
    the debugfs setup under this lock, so move it out.

    It used to be the case that our first monmap could come from the OSD; that
    is no longer the case with recent servers, so we will reliably set up the
    client entry during the initial authentication.

    We don't have to worry about racing with debugfs teardown by
    ceph_debugfs_client_cleanup() because ceph_destroy_client() calls
    ceph_msgr_flush() first, which will wait for the message dispatch work
    to complete (and the debugfs init to complete).

    Fixes: #1940
    Signed-off-by: Sage Weil

    Sage Weil
     

11 Jan, 2012

3 commits


14 Dec, 2011

1 commit


13 Dec, 2011

1 commit


22 Nov, 2011

1 commit


12 Nov, 2011

1 commit

  • ceph_osd_request struct allocates a 40-byte buffer for object names.
    RBD image names can be up to 96 chars long (100 with the .rbd suffix),
    which results in the object name for the image being truncated, and a
    subsequent map failure.

    Increase the oid buffer in request messages, in order to avoid the
    truncation.

    Signed-off-by: Stratos Psomadakis
    Signed-off-by: Sage Weil

    Stratos Psomadakis
     

01 Nov, 2011

1 commit


26 Oct, 2011

4 commits