Automate Engine region-level mutex


#1
  • Description
    Provide the automate engine code a way to synchronize threads between distributed appliances in the same region by using postgres advisory locks (mutexes).
    Ex,
    $evm.acquire_lock(lock_name)
    $evm.has_lock?(lock_name)
    $evm.release_lock(lock_name)

This functionality is useful to prevent conflicts in identification like VM-naming or IP assignment. It can also help prevent oversubscription where placement of VMs by multiple processes could provision because they all calculated a host having free resources when

  • State of current development?
    Currently, I have updated miq_ae_service.rb for the $evm methods and added an MiqConcurrency::PGMutex module that support the advisory lock calls to postgres in my fork.

  • How much is completed?
    This current approach is completed and working, but I am open to any kind of suggestion and/or changes to the approach. This code leverages work done by Jason Dillaman on private projects.

  • Target release and timeline?
    It is completed now pending approval from the committee.

  • Which ManageIQ release will this be included in? When do you expect the first version that’s usable?
    See above.

  • Core feature or independent project?
    This does change core code so it will need to be voted and approved and commented on by the current comittee.

  • Dependencies
    This does not depend on other parts of MIQ development.


#2

Interesting idea…

@gmccullough @mkanoor What do you think?

One question I have is, what happens if someone doesn’t release the lock? Can it be timed out as a safety net?

Jason


#3

Looks like a neat idea. It also allows one way to implement atomic operations. Some of the questions I have are

(1) Would the lock last across multiple automate methods.
(2) When a lock cannot be acquired does the automate method block or does it issue a retry.
(3) How to recover from errant methods that lock and don’t unlock.
(4) How would the has_lock check if the process/request has a lock would the lock identifier be stored in $evm.set_state_var


#4

Hi Jason,

By default, the end of the db session will release the lock. And a single automate custom method is working within a single session. Does that same-session-ness extend to the custom instance (a single state)?
For sure we aren’t guaranteed same-session-ness from state to state as different workers may pick up the workflow.

But aside from that… if you wanted a safety net within a session, we could use a timeout and yield in the acquire_lock method, then users would wrap their own code inside the block. On timeout, we would have to throw an exception for the user to rescue.

def acquire_lock(lock_name) {
  max_time = 60
  while elapsed_time < 60
    if try_acquire_lock(lock_name)
      begin
        yield
      ensure
        release_lock(lock_name)
      end
      return
    end
  end

  # if we get this far, lock acquisition failed
  raise exception
}

begin
acquire_lock(lock_name) {
  do_something_now_that_i_have_a_lock()
}
rescue => err
  uh_oh_handle_problem()
end

#5

Hi Madhu,

(1) Would the lock last across multiple automate methods?
I don’t expect these locks to last across multiple automate methods. The retry-ability of states seems to preclude this since a state can’t even be assured to be rerun by the same appliance.

(2) When a lock cannot be acquired does the automate method block or does it issue a retry?
That is a good question. I was thinking it was up to the user to check the result of the call like this (no blocking):

def acquire_lock(lock_name) {
  return try_acquire_lock(lock_name)
}

But in response to Jason’s question, one could attempt to acquire and yield to the user’s code block which would block for a period of time. See above.

(3) How to recover from errant methods that lock and don’t unlock?
Locks do eventually expire on session expire, but the onus is on the user to ensure a release_lock() in his custom automate method. Or it will timeout as per Jason’s suggestion above.

(4) How would the has_lock check if the process/request has a lock would the lock identifier be stored in $evm.set_state_var?
has_lock(lock_name) will run a sql query on pg_locks table for the matching advisory lock with the session’s pid. No state variable is necessary.

Thank you for your comments. I appreciate all feedback.


#6

Another thing that we have to account for is running methods from other methods which get triggered from $evm.instantiate or $evm.execute.
We would have to enforce that these calls are not supported or fail with error once you have lock.


#7

Hi Madhu,

You’re right, if the primary lock methods are provided (ie, acquire_lock, release_lock, has_lock?) we can run into situations where custom code could leave locks unreleased (until the session expired). Its best that I take them out.

But with the yielded method with_acquire_lock, we can ensure child threads won’t cause the parent thread to leave locks unreleased.

with_acquire_lock(lock_name) do
  $evm.instantiate('/Custom/Instance/Blowup')
done

In this scenario, if the instance Blowup throws an error, the instantiation will simply finish and the block will unyield back to the with_acquire_lock method which ensures a release_lock.


#8

The $evm.instantiate might end up calling a method that uses the same lock and we would have a deadlock, so we would have to guard against this situation. We would need something in the workspace which tracks the locks in use so that we can raise an exception if we detect a deadlock situation.


#9

Can you submit a PR on this enhancement?