Automate State Machine Enhancements


#1

The automate state machine processes an ordered list of states, before each state is executed, the on_entry method is executed. After the state ends the on_exit or on_error method is executed based on how the state ends. If the state ends with ‘ok’ the on_exit method is executed, if the state ends with ‘error’ the on_error method is executed.

We have been getting requests to add 2 new features to the Automate state machine
(1) Skip a state, based on some logic in ‘on_entry’ method
(2) Ignore the error from a state method and continue with some logic in the ‘on_error’ method.

The automate methods communicate the state results by setting the
’ae_result’ attribute in the root object.
e.g.

$evm.root['ae_result'] = 'error'

Proposal:

Add 2 new values for ‘ae_result’

  1. ‘skip’ valid from the on_entry method which will cause the state
    method to be skipped.

  2. ‘continue’ valid from the on_error method which will cause the automate engine to ignore the error code from the current state and continue on to the next state

Please provide comments.


#2

I see another feature : make the State Machine a finished state automate
that allows different next state based on some logic in on_exit. Obviously,
one should be careful about circular references.


#3

We had thought about changing the next state to go to from the current state, that is error prone and would lead to complex state machines which will be difficult to debug. Today the state machine is a single path thru the states, the only thing that changes is where we start (in case of retires) otherwise they always start at the top and fall to to the bottom. With these enhancements we can skip some of the states or continue on in case of errors.


#4

Then, to keep state machines as small and simple as possible, we should be able to trigger custom events from Ruby code. This way, we would be able to start other state machines in asynchronous mode, dedicated to specific small tasks.


#5

I think that ‘skip’ and ‘continue’ look good for this purpose. It will certainly enhance the usefulness of state machines, by allowing optional pre-processing, and error recovery capability to each state.

pemcg


#6

On further investigation, it came to my knowledge that the initial state machine implementation would require that the user set the next state to go to. That requires extra book keeping on the part of the user to decide which state to execute next. The users didn’t want to keep track of which state to go to and wanted the states to be executed serially defined by the field priority. Since then the state machine has been executed sequentially. With a little bit of effort it would be possible to bring back the old behavior to optionally set the next state via ‘ae_next_state’ attribute in $evm.root. If this attribute is not set we would execute the next state


#7

fdupont
Do you know what are the requirements for executing a state machine from an already executing state machine.

(1) Do we need the second state machine require access to the workspace from the first state machine.
(2) Does the first state machine need attributes set from the second state machine.


#8

We are also looking at supporting calling multiple state machines from a single workspace. In the example we have 3 state machines SM1, SM2 and SM3. SM1 can call SM2 directly and SM2 can call SM3 thru some other intermediary object. With the current implementation we can’t invoke one state machine from another state machine, because we only keep 1 set of key state attributes ae_state/ae_state_retries/ae_state_started in the workspace. With the new proposal we would be able to keep the ae_state/ae_state_retries/ae_state_started separate for each state machine that we encounter when resolving the automate model. If a retry is triggered from any of the state machines, the state information is persisted and then reused when the automation model is re-resolved.

When calling other state machines you can directly link to it or you can connect via other objects in the automate model and get to a state machine.

These multiple state machines are running within the same workspace so its attributes would be available when processing all the state machines.


#9

@mkanoor could you expand on the diagram above and provide some examples of how the state machines would be processed (and re-entered) based on various success / error conditions during execution? (I also assume all this is still done synchronously but would be great if you could confirm).

It would also be great to expand the proposal to identify how troubleshooting informations would be integrated to make it easy to those who have to debug.


#10

In the diagram above there are 3 state machines SM1, SM2, SM3
SM1 has 3 states (S11, S12, S13)
SM2 has 5 states (S21,S22,S23,S24,S25)
SM3 has 2 states (S31, S32)

SM1 connects to SM2 when it encounters the S12 state, it saves its current state ae_state/ae_state_retries/ae_state_started and invokes the new state machine SM2 which starts at the top and runs thru it states. SM2 connects to SM3 at state S22, so again we save the salient state attributes and navigate down to SM3 via some other object in between. When SM3 is finished processing the previous state information for SM2 is restored and we start that state machine at state S23, we then process SM2 completely and come back into processing SM1, restore its salient state attributes and resume at S13.

This above description was a normal flow the the state machine where no errors or retries were encountered.

Lets look at en error scenario, say state S31 under SM3 throws an error, this error would be percolated to the top and all the state machines would stop. This is the default behavior, the user can override the error condition with the ‘continue’ feature discussed earlier and continue on with the state machine processing. There isn’t a rollback feature yet in the state machine processing.

Lets look at a retry scenario, say state S31 throws a retry, this would stop all the state machines (SM3, SM2 and SM1) and the process will end with a retry error queueing the next automate task, we would save all the state information for SM1,SM2,SM3 along with the other key attributes that started this automation task. When the timer expires the worker would dispatch the Automate job we would read the previous state information and resume processing and start at state machine SM1, since SM1 was last at Step S12 when the retry happened S11 will be skipped, S12 will get us to SM2 which will skip S21 and move on to processing S22 which will connect to SM3, in SM3 we will resume at S31. The state attributes ae_state_retries/ae_state_started keep track of the number of the times the particular state was invoked and enforce the time and retry limits if specified for the state.

All the state machine states are run synchronously like any other automate state machine. The states machines today get asynchronous behavior by using a check method with a retry which is pretty heavily used during provisioning, that behavior is not going to change.

The state machine debugging is currently limited to the Automate logs, we log as we change states and run the on_entry/on_exit/on_error calls. If there are other requirements or ideas around the debugging we can discuss them here.


#11

@mkanoor thanks for the detail and it is pretty straight forward.
For Error tracking, it will be critical to understand what states machine get affected by erroring a state (especially in the case of you SM3 state machine erroring out, you will need to understand that SM3 is the result of SM2 and SM1 erring out and that needs to be easily identifiable in the log).


#12

I like the idea of launching child state machines and think it would be necessary in automate to have access to parent workspaces. Basically being able to be in SM3s workspace and see the grand parent (SM1) workspace that launched me.


#13

All the state machines are running in the same workspace, so all attributes would be visible to all state machines. Between retries if some data needs to be preserved you would have to use the $evm.set_state_var


#14

This is a great enhancement, allowing for specialized state machines. Would it be possible to make the call asynchronous ? This way, we would be able to launch long run / independent actions without waiting on them to perform further actions.


#15

fdupont
The sync/async is handled today in Automate models using 2 states.
For example in the ManageIQ/Infrastructure/VM/Provisioning/StateMachines/clone_to_vm or template we have 2 states Provision/CheckProvisioned that mimic a sync behavior using retries and async.
The first one Provision launches the task asynchronously and the second one CheckProvisioned waits for the task to end. This pattern is repeated in most of the provisioning code. The async process + the waiter. The waiter CheckProvisioned waits a few minutes and then tries to check if the VM has been created, if it is not done yet it queues in a retry and goes away.

Won’t this behavior suffice for making async calls. So if you have a stacked state machines, and one of them throws a retry the whole process ends and gets requeued to be run after a while. When we pick it back after a retry we start where we left off, so as long as each state machine is doing the following

  1. Launch the process asynchronously
  2. Wait for the async task to end with a retry loop

#16

Well, IFAIK, the async nature of the provision state is hard coded, so how do you create an async state through the interface ?


#17

fdupont
I am trying to understand the use case. Do you want async tasks to run in a separate workspace from the current workspace? Do they ever have to sync up again?

If you have an example that would help.

I am thinking if we were integrating with some other system say using Rest API we would get back a handle or a task id. This can be stored in the $evm.root or $evm.set_state_var and a second state method could fetch that task id and wait for it to end within a reasonable time (max_time)

The proposal here was to run dependent state machines, maybe there is another use for running state machines in parallel, without sharing anything, can that be done using multiple requests that create independent tasks and state machines for each one.


#18

@mkanoor

I was thinking about launching tasks, like the provisioning one, that may require a long time to achieve and would still be linked to the initial request. Exactly like it is done in a service with multiple items that have the same order.

My point is that the ‘provision’ method for a service or a VM is able to start async tasks. Why wouldn’t we expose the same feature for any state machine step. I would see it as a boolean on the state, that would default to false.

Say you create your custom request, like “rebuild service”. In the process you restore a database dump and the operation can be quite long. So, to optimize deployment time, I would like to start restoring the database as soon as possible, in async mode, while deploying the other items.