How to rollback on provisioning failure?


#1

I have been successfully implementing provisioning and retirement workflows in ManageIQ for a while, and each time, during the design/debug phase, I have to deal with non working state machines that leave a lot of crap in the services ManageIQ is integrated with.

For example, I create fixed address records on my DHCP server, so that the virtual machine is SSH-able once started. During the debug stages it happens quite regularly that I misconfigure some item and my state machine is aborted. And I have to delete manually the DHCP record to keep my environment clean.

So, I would like to implement a rollback policy that triggers the cleanup steps. Even before ManageIQ was released, this subject had been covered by @ramrexx and he handled it with a call to the a method with the state and vm as arguments, looking like this:

def cleanup(state, vm)
  $evm.log("info", "Cleaning up from state '#{state}':")
  case state
  when 'AcquireIPAddress'
    $evm.log("info", "Calling ReleaseIPAddress")
    $evm.instantiate("/Infrastructure/VM/Provisioning/StateMachinesMethods/ReleaseIPAddress")
  when 'RegisterDHCP'
    $evm.log("info", "Calling ReleaseIPAddress")
    $evm.instantiate("/Infrastructure/VM/Provisioning/StateMachinesMethods/ReleaseIPAddress")
    $evm.log("info", "Calling UnregisterDHCP")
    $evm.instantiate("/Infrastructure/VM/Provisioning/StateMachinesMethods/UnregisterDHCP")
  when 'RegisterDNS'
    $evm.log("info", "Calling ReleaseIPAddress")
    $evm.instantiate("/Infrastructure/VM/Provisioning/StateMachinesMethods/ReleaseIPAddress")
    $evm.log("info", "Calling UnregisterDHCP")
    $evm.instantiate("/Infrastructure/VM/Provisioning/StateMachinesMethods/UnregisterDHCP")
    $evm.log("info", "Calling UnregisterDNS")
    $evm.instantiate("/Infrastructure/VM/Provisioning/StateMachinesMethods/UnregisterDNS")
  when 'Provision'
    log(:info, "Calling vm.retire_now for cleanup")
    vm.retire_now
  else
    $evm.log("info", "Nothing to be done.")
    return
  end
end

You can then either call it in the ‘rescue’ statement of you methods, or use the ‘on_error’ method to point on a rollback method. However, I see two caveats to this approach:

  • Maintenance can become quite difficult over time, because of the workflow evolution.
  • It covers only VMProvision_VM state machine. I need to be able to rollback all the items of a service bundle, in case one of them fails to be provisioned.

So, here are a few thougts on this, that I would like to develop with your help and wisdom :smile:


Whenever a step in the provisioning workflow fails, ManageIQ should be able to launch a rollback workflow.

How to trigger the rollback ?

  • Instantiate a state machine:
  • From the state in the ‘rescue’ statement,
  • From the state machine engine, in the method called by ‘on_error’.
  • Forward state information to the state machine (keep track of the required information for rollback):
  • $evm.root:
    • The $evm.root is the same because we are in the same workspace.
    • In every state that performs an action, we could add some information in $evm.root or in the
  • $evm.root[‘miq_provision’]: this object might not exist in all state machines.

The code could look like:

begin
  @state = 'RegisterDHCP'
  @stateData = { :rollback_state => 'UnregisterDHCP', :vmname => 'cfme001', :ip_address => '1.1.1.1', :mac_address => '00:00:00:00:00:01', :domain => 'example.com' }

  $evm.root['workflow'][@state] = @stateData
  exit MIQ_OK

rescue => err
  $evm.log("error", "#{@method} - [#{err}]\n#{err.backtrace.join("\n")}")
  $evm.instantiate("/Infrastructure/VM/Provisioning/StateMachines/VMProvision_VM/template_rollback")
  exit MIQ_ABORT
end

Then we create a state machine /Infrastructure/VM/Provisioning/StateMachines/VMProvision_VM/template_rollback, which is just a “mirror” of the provisioning state machine. Each state can access the $evm.root['workflow'] information to rollback its counterpart state.

When rollbacking a service provisioning state machine, we should rollback all the items that are part of the service. This requires an inspection of the service to trigger the right state machine on each item.


Revamp user experience for Automate/Policy
#2

We have on_enter, on_exit and on_error now in the State Machine. Would it be more difficult to include an on_rollback field that can point to the appropriate code to perform the required action at each step? I’m not sure about the internals to support something like that, but I really hate the idea of having to mirror my entire state machine to be able to perform a successful rollback.

This is definitely an area of interest for us! Thanks for bringing the topic up!

Matt


#3

The approach I adopted is to store the rollback logic in the “update_*provision_status” method. The method is the same for all the states of the state machine and it chooses the required rollback operations based on the “$evm.root[‘ae_state’]” value. The rollback steps for each state are stored in a hash. The main advantage of this approach is that it is processed in the same workspace, and the context of the source request is available.

Implementation

The following code shows the method used in the virtual machine provisioning state machine, aka. /Infrastructure/VM/Provisioning/StateMachines/VMProvision_VM/update_provision_status method.

# First, we create the rollback matrix. It is a hash containing,
# for each state of the state machine, the list of operations to
# process. Each operation is a link to an instance that points to
# a method.
@rollback_matrix = Hash.new

@rollback_matrix['AcquireIPAddress'] = [
  "#{$evm.root['rollback_class']}/ReleaseIPAddress",
  "#{$evm.root['rollback_class']}/LeaveDomain"
]

# Triggers the rollback process if the rollback matrix has an
# entry for the current state. The rollback process consist in
# instantiating the tasks listed in the rollback matrix entry.
def trigger_rollback 
  if @rollback_matrix[$evm.root['ae_state']].nil?
    $evm.log("info", "No rollback steps.")
  else
    @rollback_matrix[$evm.root['ae_state']].each do |step|
      $evm.log("info", ">>> Instantiating '#{step}'")
      $evm.instantiate(step)
    end
  end
end

prov   = $evm.root['miq_provision']
status = $evm.inputs['status']

# Update Status for on_entry,on_exit
if $evm.root['ae_result'] == 'ok' || $evm.root['ae_result'] == 'error'
  prov.message = status
end

# Trigger the rollback
trigger_rollback if $evm.root['ae_result'] == 'error'

For the service provisioning rollback, as we can’t call a state machine, we still have the possibility to retire the provisioned items from the REST API. When an item fails, it triggers its own rollback process and the parent service, rather than just failing, can trigger the retirement of the other items.