31 May, 2012

1 commit

  • Pull ceph updates from Sage Weil:
    "There are some updates and cleanups to the CRUSH placement code, a bug
    fix with incremental maps, several cleanups and fixes from Josh Durgin
    in the RBD block device code, a series of cleanups and bug fixes from
    Alex Elder in the messenger code, and some miscellaneous bounds
    checking and gfp cleanups/fixes."

    Fix up trivial conflicts in net/ceph/{messenger.c,osdmap.c} due to the
    networking people preferring "unsigned int" over just "unsigned".

    * git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (45 commits)
    libceph: fix pg_temp updates
    libceph: avoid unregistering osd request when not registered
    ceph: add auth buf in prepare_write_connect()
    ceph: rename prepare_connect_authorizer()
    ceph: return pointer from prepare_connect_authorizer()
    ceph: use info returned by get_authorizer
    ceph: have get_authorizer methods return pointers
    ceph: ensure auth ops are defined before use
    ceph: messenger: reduce args to create_authorizer
    ceph: define ceph_auth_handshake type
    ceph: messenger: check return from get_authorizer
    ceph: messenger: rework prepare_connect_authorizer()
    ceph: messenger: check prepare_write_connect() result
    ceph: don't set WRITE_PENDING too early
    ceph: drop msgr argument from prepare_write_connect()
    ceph: messenger: send banner in process_connect()
    ceph: messenger: reset connection kvec caller
    libceph: don't reset kvec in prepare_write_banner()
    ceph: ignore preferred_osd field
    ceph: fully initialize new layout
    ...

    Linus Torvalds
     

17 May, 2012

4 commits

  • Rather than passing a bunch of arguments to be filled in with the
    content of the ceph_auth_handshake buffer now returned by the
    get_authorizer method, just use the returned information in the
    caller, and drop the unnecessary arguments.

    Signed-off-by: Alex Elder
    Reviewed-by: Sage Weil

    Alex Elder
     
  • Have the get_authorizer auth_client method return a ceph_auth
    pointer rather than an integer, pointer-encoding any returned
    error value. This is to pave the way for making use of the
    returned value in an upcoming patch.

    Signed-off-by: Alex Elder
    Reviewed-by: Sage Weil

    Alex Elder
     
  • Make use of the new ceph_auth_handshake structure in order to reduce
    the number of arguments passed to the create_authorizor method in
    ceph_auth_client_ops. Use a local variable of that type as a
    shorthand in the get_authorizer method definitions.

    Signed-off-by: Alex Elder
    Reviewed-by: Sage Weil

    Alex Elder
     
  • The definitions for the ceph_mds_session and ceph_osd both contain
    five fields related only to "authorizers." Encapsulate those fields
    into their own struct type, allowing for better isolation in some
    upcoming patches.

    Fix the #includes in "linux/ceph/osd_client.h" to lay out their more
    complete canonical path.

    Signed-off-by: Alex Elder
    Reviewed-by: Sage Weil

    Alex Elder
     

15 May, 2012

1 commit


08 May, 2012

1 commit

  • This was an ill-conceived feature that has been removed from Ceph. Do
    this gracefully:

    - reject attempts to specify a preferred_osd via the ioctl
    - stop exposing this information via virtual xattrs
    - always fill in -1 for requests, in case we talk to an older server
    - don't calculate preferred_osd placements/pgids

    Reviewed-by: Alex Elder
    Signed-off-by: Sage Weil

    Sage Weil
     

29 Mar, 2012

1 commit

  • Pull Ceph updates for 3.4-rc1 from Sage Weil:
    "Alex has been busy. There are a range of rbd and libceph cleanups,
    especially surrounding device setup and teardown, and a few critical
    fixes in that code. There are more cleanups in the messenger code,
    virtual xattrs, a fix for CRC calculation/checks, and lots of other
    miscellaneous stuff.

    There's a patch from Amon Ott to make inos behave a bit better on
    32-bit boxes, some decode check fixes from Xi Wang, and network
    throttling fix from Jim Schutt, and a couple RBD fixes from Josh
    Durgin.

    No new functionality, just a lot of cleanup and bug fixing."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (65 commits)
    rbd: move snap_rwsem to the device, rename to header_rwsem
    ceph: fix three bugs, two in ceph_vxattrcb_file_layout()
    libceph: isolate kmap() call in write_partial_msg_pages()
    libceph: rename "page_shift" variable to something sensible
    libceph: get rid of zero_page_address
    libceph: only call kernel_sendpage() via helper
    libceph: use kernel_sendpage() for sending zeroes
    libceph: fix inverted crc option logic
    libceph: some simple changes
    libceph: small refactor in write_partial_kvec()
    libceph: do crc calculations outside loop
    libceph: separate CRC calculation from byte swapping
    libceph: use "do" in CRC-related Boolean variables
    ceph: ensure Boolean options support both senses
    libceph: a few small changes
    libceph: make ceph_tcp_connect() return int
    libceph: encapsulate some messenger cleanup code
    libceph: make ceph_msgr_wq private
    libceph: encapsulate connection kvec operations
    libceph: move prepare_write_banner()
    ...

    Linus Torvalds
     

22 Mar, 2012

4 commits

  • Change the name (and type) of a few CRC-related Boolean local
    variables so they contain the word "do", to distingish their purpose
    from variables used for holding an actual CRC value.

    Note that in the process of doing this I identified a fairly serious
    logic error in write_partial_msg_pages(): the value of "do_crc"
    assigned appears to be the opposite of what it should be. No
    attempt to fix this is made here; this change preserves the
    erroneous behavior. The problem I found is documented here:
    http://tracker.newdream.net/issues/2064

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     
  • The messenger workqueue has no need to be public. So give it static
    scope.

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     
  • ceph_parse_options() takes the address of a pointer as an argument
    and uses it to return the address of an allocated structure if
    successful. With this interface is not evident at call sites that
    the pointer is always initialized. Change the interface to return
    the address instead (or a pointer-coded error code) to make the
    validity of the returned pointer obvious.

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     
  • Each messenger allocates a page to be used when writing zeroes
    out in the event of error or other abnormal condition. Instead,
    use the kernel ZERO_PAGE() for that purpose.

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     

05 Mar, 2012

1 commit

  • If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any
    other BUG variant in a static inline (i.e. not in a #define) then
    that header really should be including and not just
    expecting it to be implicitly present.

    We can make this change risk-free, since if the files using these
    headers didn't have exposure to linux/bug.h already, they would have
    been causing compile failures/warnings.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

12 Nov, 2011

1 commit

  • ceph_osd_request struct allocates a 40-byte buffer for object names.
    RBD image names can be up to 96 chars long (100 with the .rbd suffix),
    which results in the object name for the image being truncated, and a
    subsequent map failure.

    Increase the oid buffer in request messages, in order to avoid the
    truncation.

    Signed-off-by: Stratos Psomadakis
    Signed-off-by: Sage Weil

    Stratos Psomadakis
     

29 Oct, 2011

1 commit

  • * 'for-linus' of git://ceph.newdream.net/git/ceph-client:
    libceph: fix double-free of page vector
    ceph: fix 32-bit ino numbers
    libceph: force resend of osd requests if we skip an osdmap
    ceph: use kernel DNS resolver
    ceph: fix ceph_monc_init memory leak
    ceph: let the set_layout ioctl set single traits
    Revert "ceph: don't truncate dirty pages in invalidate work thread"
    ceph: replace leading spaces with tabs
    libceph: warn on msg allocation failures
    libceph: don't complain on msgpool alloc failures
    libceph: always preallocate mon connection
    libceph: create messenger with client
    ceph: document ioctls
    ceph: implement (optional) max read size
    ceph: rename rsize -> rasize
    ceph: make readpages fully async

    Linus Torvalds
     

26 Oct, 2011

2 commits


15 Sep, 2011

2 commits

  • Fast-forward merge with Linus to be able to merge patches
    based on more recent version of the tree.

    Jiri Kosina
     
  • It was pointed out by 'make versioncheck' that some includes of
    linux/version.h are not needed in include/.
    This patch removes them.

    When I last posted the patch, the ceph bit was ACK'ed by Sage Weil, so
    I've added that below.

    The pwc-ioctl change generated quite a bit of discussion about V4L version
    numbers in general, but as far as I can tell, no concensus was reached on
    what the long term solution should be, so in the mean time I think we
    could start by just removing the unneeded include, which is why I'm
    resending the patch with that hunk still included.

    Signed-off-by: Jesper Juhl
    Acked-by: Sage Weil
    Signed-off-by: Jiri Kosina

    Jesper Juhl
     

27 Jul, 2011

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
    ceph: document unlocked d_parent accesses
    ceph: explicitly reference rename old_dentry parent dir in request
    ceph: document locking for ceph_set_dentry_offset
    ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug
    ceph: protect d_parent access in ceph_d_revalidate
    ceph: protect access to d_parent
    ceph: handle racing calls to ceph_init_dentry
    ceph: set dir complete frag after adding capability
    rbd: set blk_queue request sizes to object size
    ceph: set up readahead size when rsize is not passed
    rbd: cancel watch request when releasing the device
    ceph: ignore lease mask
    ceph: fix ceph_lookup_open intent usage
    ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC
    ceph: fix bad parent_inode calc in ceph_lookup_open
    ceph: avoid carrying Fw cap during write into page cache
    libceph: don't time out osd requests that haven't been received
    ceph: report f_bfree based on kb_avail rather than diffing.
    ceph: only queue capsnap if caps are dirty
    ceph: fix snap writeback when racing with writes
    ...

    Linus Torvalds
     
  • Keep track of when an outgoing message is ACKed (i.e., the server fully
    received it and, presumably, queued it for processing). Time out OSD
    requests only if it's been too long since they've been received.

    This prevents timeouts and connection thrashing when the OSDs are simply
    busy and are throttling the requests they read off the network.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     

21 Jul, 2011

1 commit

  • All these are instances of
    #define NAME value;
    or
    #define NAME(params_opt) value;

    These of course fail to build when used in contexts like
    if(foo $OP NAME)
    while(bar $OP NAME)
    and may silently generate the wrong code in contexts such as
    foo = NAME + 1; /* foo = value; + 1; */
    bar = NAME - 1; /* bar = value; - 1; */
    baz = NAME & quux; /* baz = value; & quux; */

    Reported on comp.lang.c,
    Message-ID:
    Initial analysis of the dangers provided by Keith Thompson in that thread.

    There are many more instances of more complicated macros having unnecessary
    trailing semicolons, but this pile seems to be all of the cases of simple
    values suffering from the problem. (Thus things that are likely to be found
    in one of the contexts above, more complicated ones aren't.)

    Signed-off-by: Phil Carmody
    Signed-off-by: Jiri Kosina

    Phil Carmody
     

25 May, 2011

1 commit


30 Mar, 2011

1 commit


23 Mar, 2011

1 commit

  • Lingering requests are requests that are sent to the OSD normally but
    tracked also after we get a successful request. This keeps the OSD
    connection open and resends the original request if the object moves to
    another OSD. The OSD can then send notification messages back to us
    if another client initiates a notify.

    This framework will be used by RBD so that the client gets notification
    when a snapshot is created by another node or tool.

    Signed-off-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Yehuda Sadeh
     

22 Mar, 2011

3 commits

  • Signed-off-by: Sage Weil

    Sage Weil
     
  • This updates the common header files used by the different ceph
    related modules. Specifically it adds definitions required by
    the rbd watch/notify feature.

    Signed-off-by: Yehuda Sadeh

    Yehuda Sadeh
     
  • If we send a request to osd A, and the request's pg remaps to osd B and
    then back to A in quick succession, we need to resend the request to A. The
    old code was only calling kick_requests after processing all incremental
    maps in a message, so it was very possible to not resend a request that
    needed to be resent. This would make the osd eventually time out (at least
    with the current default of osd timeouts enabled).

    The correct approach is to scan requests on every map incremental. This
    patch refactors the kick code in a few ways:
    - all requests are either on req_lru (in flight), req_unsent (ready to
    send), or req_notarget (currently map to no up osd)
    - mapping always done by map_request (previous map_osds)
    - if the mapping changes, we requeue. requests are resent only after all
    map incrementals are processed.
    - some osd reset code is moved out of kick_requests into a separate
    function
    - the "kick this osd" functionality is moved to kick_osd_requests, as it
    is unrelated to scanning for request->pg->osd mapping changes

    Signed-off-by: Sage Weil

    Sage Weil
     

05 Mar, 2011

2 commits

  • There was some broken keepalive code using a dead variable. Shift to using
    the proper bit flag.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • With commit f363e45f we replaced a bunch of hacky workqueue mutual
    exclusion logic with the WQ_NON_REENTRANT flag. One pieces of fallout is
    that the exponential backoff breaks in certain cases:

    * con_work attempts to connect.
    * we get an immediate failure, and the socket state change handler queues
    immediate work.
    * con_work calls con_fault, we decide to back off, but can't queue delayed
    work.

    In this case, we add a BACKOFF bit to make con_work reschedule delayed work
    next time it runs (which should be immediately).

    Signed-off-by: Sage Weil

    Sage Weil
     

13 Jan, 2011

2 commits

  • ceph messenger code does a rather complex dancing around multithread
    workqueue to make sure the same work item isn't executed concurrently
    on different CPUs. This restriction can be provided by workqueue with
    WQ_NON_REENTRANT.

    Make ceph_msgr_wq non-reentrant workqueue with the default concurrency
    level and remove the QUEUED/BUSY logic.

    * This removes backoff handling in con_work() but it couldn't reliably
    block execution of con_work() to begin with - queue_con() can be
    called after the work started but before BUSY is set. It seems that
    it was an optimization for a rather cold path and can be safely
    removed.

    * The number of concurrent work items is bound by the number of
    connections and connetions are independent from each other. With
    the default concurrency level, different connections will be
    executed independently.

    Signed-off-by: Tejun Heo
    Cc: Sage Weil
    Cc: ceph-devel@vger.kernel.org
    Signed-off-by: Sage Weil

    Tejun Heo
     
  • Add a ceph_dir_layout to the inode, and calculate dentry hash values based
    on the parent directory's specified dir_hash function. This is needed
    because the old default Linux dcache hash function is extremely week and
    leads to a poor distribution of files among dir fragments.

    Signed-off-by: Sage Weil

    Sage Weil
     

18 Dec, 2010

1 commit


10 Nov, 2010

3 commits

  • The alignment used for reading data into or out of pages used to be taken
    from the data_off field in the message header. This only worked as long
    as the page alignment matched the object offset, breaking direct io to
    non-page aligned offsets.

    Instead, explicitly specify the page alignment next to the page vector
    in the ceph_msg struct, and use that instead of the message header (which
    probably shouldn't be trusted). The alloc_msg callback is responsible for
    filling in this field properly when it sets up the page vector.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • We used to infer alignment of IOs within a page based on the file offset,
    which assumed they matched. This broke with direct IO that was not aligned
    to pages (e.g., 512-byte aligned IO). We were also trusting the alignment
    specified in the OSD reply, which could have been adjusted by the server.

    Explicitly specify the page alignment when setting up OSD IO requests.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • The offset/length arguments aren't used.

    Signed-off-by: Sage Weil

    Sage Weil
     

21 Oct, 2010

3 commits

  • Signed-off-by: Sage Weil

    Greg Farnum
     
  • These facilitate preallocation of pages so that we can encode into the pagelist
    in an atomic context.

    Signed-off-by: Greg Farnum
    Signed-off-by: Sage Weil

    Greg Farnum
     
  • This factors out protocol and low-level storage parts of ceph into a
    separate libceph module living in net/ceph and include/linux/ceph. This
    is mostly a matter of moving files around. However, a few key pieces
    of the interface change as well:

    - ceph_client becomes ceph_fs_client and ceph_client, where the latter
    captures the mon and osd clients, and the fs_client gets the mds client
    and file system specific pieces.
    - Mount option parsing and debugfs setup is correspondingly broken into
    two pieces.
    - The mon client gets a generic handler callback for otherwise unknown
    messages (mds map, in this case).
    - The basic supported/required feature bits can be expanded (and are by
    ceph_fs_client).

    No functional change, aside from some subtle error handling cases that got
    cleaned up in the refactoring process.

    Signed-off-by: Sage Weil

    Yehuda Sadeh