Ivanchuk/5.11 Embedded Ansible Service Git Connection Issue

We discovered an Ansible Service Git connection issue that occurs when the Embedded Ansible Service provision state machine encounters a problem checking out the Git repo. The Service provision fails without waiting for, or attempting to retry the Git connection.

Why it’s happening now:

We changed our Embedded Ansible implementation in Ivanchuck/5.11 to use ansible-runner instead of Ansible Tower which was used previously.

Even though Ansible Service provisioning hasn’t been modified, the Embedded Ansible implementation changes affect the Service provisioning behavior.

What can I do about it?

Unfortunately, there’s no hot fix for the current release. Modifying the existing provisioning behavior would require a significant code change.

The good news is that we created a workaround for the issue.

We added code in the release that supports Automate changes that are a key part of the workaround.

The back end changes can be found at: http://github.com/ManageIQ/manageiq/pull/20759

The workaround:

The code that is responsible for doing the git checkout and calling ansible-runner to run a playbook is called during the execute state of the Generic Service state machine. We can insert an Automate method in a state prior to the execute state where we can check the git connection and retry the state machine, if necessary, until either the repo becomes available or the maximum number of retries have been exceeded. Once that method ends successfully (meaning it was able to connect to git) we can proceed to the execute state with a high degree of confidence in the Service provision completing successfully.

How it works:

The major part of the workaround is a new check_connection Automate method which checks that the git repo is accessible before it allows the state machine to progress to the execute state.

The check_connection method affects the state machine as follows:

If the git connection is:

  1. available the first time the method runs,

the state machine proceeds to the execute state.

  1. initially unavailable, then becomes available some time before the max_retries attempts,

the state machine proceeds to the execute state.

  1. still not available after the max_retries attempts,

the state machine aborts with a message that it has exceeded the (configurable) max_retries count specified for the pre5 state.

Automate Note - The max_retries setting in the Generic Service state machine Pre5 state determines how many times to retry the git connection code. The default is 100.

Automate changes:

The git_retry domain we created contains only 2 changes to the Automate model.

  1. A new check_connection Automate method.
  2. A modified provision instance. The instance was modified to add a call to the check_connection method in the pre5 state.

The screenshot below shows the ManageIQ system domain provision instance has no value in the pre5 state.

ManageIQ Domain:

The screenshot below shows the git_retry domain provision instance pre5 state has the value of METHOD::check_connection. Notice the git_retry domain GenericLifecycle class contains the check_connection Automate method. The check_connection method is new and does not exist in the ManageIQ domain.

Note — specifying an Automate method using the METHOD:: prefix (called method notation) allows us to use a state relationship to directly call an Automate method without having to create/use an Instance to call the method.

git_retry Domain:

How do I apply the workaround to my environment?

The workaround requires a minimum version of 5.11.10.

  1. Import and enable git_retry custom domain
  2. Create Ansible Playbook Service
  3. Order Service.