22 Feb, 2008

1 commit

  • The recent patch to validate data lengths in rcom_names messages
    failed to account for fake messages a node directs to itself before
    ever sending it. In this case we need to fill in the message length
    in the header for the validation code to use.

    Signed-off-by: David Teigland

    David Teigland
     

07 Feb, 2008

2 commits


06 Feb, 2008

1 commit


04 Feb, 2008

13 commits


31 Jan, 2008

16 commits

  • also change name_prefix from char pointer to char array.

    Signed-off-by: Denis Cheng
    Signed-off-by: David Teigland

    Denis Cheng
     
  • A couple small clean-ups. Remove unnecessary wrapper-functions in
    rcom.c, and remove unnecessary casting and an unnecessary ASSERT in
    util.c.

    Signed-off-by: David Teigland

    David Teigland
     
  • The 32/64 compatibility code in the DLM does not check the validity of
    the lock name length passed into it, so it can easily overwrite memory
    if the value is rubbish (as early versions of libdlm can cause with
    unlock calls, it doesn't zero the field).

    This patch restricts the length of the name to the amount of data
    actually passed into the call.

    Signed-off-by: Patrick Caulfield
    Signed-off-by: David Teigland

    Patrick Caulfeld
     
  • To prevent the master of an rsb from changing rapidly, an unused rsb is kept
    on the "toss list" for a period of time to be reused. The toss list was
    being cleared completely for each recovery, which is unnecessary. Much of
    the benefit of the toss list can be maintained if nodes keep rsb's in their
    toss list that they are the master of. These rsb's need to be included
    when the resource directory is rebuilt during recovery.

    Signed-off-by: David Teigland

    David Teigland
     
  • The invalid lockspace messages are normal and can appear relatively
    often. They should be suppressed without debugging enabled.

    Signed-off-by: David Teigland

    David Teigland
     
  • The dlm_put_lkb() can free the lkb and its associated ua structure,
    so we can't depend on using the ua struct after the put.

    Signed-off-by: David Teigland

    David Teigland
     
  • In a rare case we may need to repeat a local resource directory lookup
    due to a race with removing the rsb and removing the resdir record.
    We'll never need to do more than a single additional lookup, though,
    so the infinite loop around the lookup can be removed. In addition
    to being unnecessary, the infinite loop is dangerous since some other
    unknown condition may appear causing the loop to never break.

    Signed-off-by: David Teigland

    David Teigland
     
  • Non-forced unlocks should be rejected if the lock is waiting on the
    rsb_lookup list for another lock to establish the master node.

    Signed-off-by: David Teigland

    David Teigland
     
  • There was some hit and miss validation of messages that has now been
    cleaned up and unified. Before processing a message, the new
    validate_message() function checks that the lkb is the appropriate type,
    process-copy or master-copy, and that the message is from the correct
    nodeid for the the given lkb. Other checks and assertions on the
    lkb type and nodeid have been removed. The assertions were particularly
    bad since they would panic the machine instead of just ignoring the bad
    message.

    Although other recent patches have made processing old message unlikely,
    it still may be possible for an old message to be processed and caught
    by these checks.

    Signed-off-by: David Teigland

    David Teigland
     
  • Messages from nodes that are no longer members of the lockspace should be
    ignored. When nodes are removed from the lockspace, recovery can
    sometimes complete quickly enough that messages arrive from a removed node
    after recovery has completed. When processed, these messages would often
    cause an error message, and could in some cases change some state, causing
    problems.

    Signed-off-by: David Teigland

    David Teigland
     
  • When a failed request (EBADR or ENOTBLK) is unlocked/canceled instead of
    retried, there may be other lkb's waiting on the rsb_lookup list for it
    to complete. A call to confirm_master() is needed to move on to the next
    waiting lkb since the current one won't be retried.

    Signed-off-by: David Teigland

    David Teigland
     
  • When recovery looks at locks waiting for replies, it fails to consider
    locks that have already received a reply for their first remote operation,
    but not received a reply for secondary, overlapping unlock/cancel. The
    appropriate stub reply needs to be called for these waiters.

    Appears when we start doing recovery in the presence of a many overlapping
    unlock/cancel ops.

    Signed-off-by: David Teigland

    David Teigland
     
  • The lkb_ast_type field indicates whether the lkb is on the astqueue list.
    When clearing locks for a process, lkb's were being removed from the astqueue
    list without clearing the field. If release_lockspace then happened
    immediately afterward, it could try to remove the lkb from the list a second
    time.

    Appears when process calls libdlm dlm_release_lockspace() which first
    closes the ls dev triggering clear_proc_locks, and then removes the ls
    (a write to control dev) causing release_lockspace().

    Signed-off-by: David Teigland

    David Teigland
     
  • Some errno values differ across platforms. So if we return things like
    -EINPROGRESS from one node it can get misinterpreted or rejected on
    another one.

    This patch fixes up the errno values passed on the wire so that they
    match the x86 ones (so as not to break the protocol), and re-instates
    the platform-specific ones at the other end.

    Many thanks to Fabio for testing this patch.
    Initial patch from Patrick.

    Signed-off-by: Patrick Caulfield
    Signed-off-by: Fabio M. Di Nitto
    Signed-off-by: David Teigland

    David Teigland
     
  • DLM_RCOM_LOCK_REPLY messages need byte swapping.

    Signed-off-by: Fabio M. Di Nitto
    Signed-off-by: David Teigland

    Fabio M. Di Nitto
     
  • gcc does not guarantee that an auto buffer is 64bit aligned.
    This change allows sparc64 to work.

    Signed-off-by: Fabio M. Di Nitto
    Signed-off-by: David Teigland

    Fabio M. Di Nitto
     

30 Jan, 2008

5 commits

  • This patch addresses a problem introduced with the last round of
    lowcomms patches where the 'othercon' connections do not get freed when
    the DLM shuts down.

    This results in the error message
    "slab error in kmem_cache_destroy(): cache `dlm_conn': Can't free all
    objects"

    and the DLM cannot be restarted without a system reboot.

    See bz#428119

    Signed-off-by: Patrick Caulfield
    Signed-off-by: Fabio M. Di Nitto
    Signed-off-by: David Teigland

    Patrick Caulfeld
     
  • The dlm functions in memory.c should use the dlm_ prefix. Also, use
    kzalloc/kfree directly for dlm_direntry's, removing the wrapper functions.

    Signed-off-by: David Teigland

    David Teigland
     
  • Change log_error() to log_debug() for conditions that can occur in
    large number in normal operation.

    Signed-off-by: David Teigland

    David Teigland
     
  • This patch adds a proper prototype for some functions in
    fs/dlm/dlm_internal.h

    Signed-off-by: Adrian Bunk
    Signed-off-by: David Teigland

    Adrian Bunk
     
  • A common problem occurs when multiple IP addresses within the same
    subnet are assigned to the same NIC. If we make a connection attempt to
    another address on the same subnet as one of those addresses, the
    connection attempt will not necessarily be routed from the address we
    want.

    In the case of the DLM, the other nodes will quickly drop the connection
    attempt, causing problems.

    This patch makes the DLM bind to the local address it acquired from the
    cluster manager when using TCP prior to making a connection, obviating
    the need for administrators to "fix" their systems or use clever routing
    tricks.

    Signed-off-by: Lon Hohberger
    Signed-off-by: Patrick Caulfield
    Signed-off-by: David Teigland

    Lon Hohberger
     

25 Jan, 2008

2 commits