How to retry an Ansible Tower method?

I’m trying to make Ansible Tower method to restart several times on error. Here is my dev state machine. onerror_retry method looks to $evm.root[‘ae_state_max_retries’] and sets $evm.root[‘ae_result’] to ‘retry’.


AWXTest is an instance which executes Ansible Tower Method.
In automation.log I see that CustomizeRequest state reruns as needed, but no actual job template is run on AWX. What I’m doing wrong?

The On Error method is only run when there’s an error in the main method. Have you determined what the main error is? There may be no point retrying something that keeps erroring.

pemcg

@pemcg This particular case is related to ansible job running powershell DSC module. This module ocasionally complains to lack of some resources in Windows (some racing conditions, I think). On second run it generally runs successfully. So I’m trying to make it to run 2 times before actually failing the deployment.

Hmm, interesting, so you see the first call to the job template method run ok in Tower (but maybe fail), but the subsequent calls of the same method in the retries don’t make it to Tower at all?

Exactly. I see log messages from onerror_retry ("Job failed, retrying… " and so on). But the actual job runs only in the very first run.

Is there any kind of error message either in MIQ or the Tower UI for the subsequent job attempts? Just to confirm you’re running an automate method of type ‘Ansible Tower Job Template’ (ie rather than calling the AnsibleTower/Operations/StateMachines/Job state machine yourself)?

Here is the Jobs list in MIQ UI. You can see actual runs.


Those are messages from automation.log

And here is output of onerror_retry method:

[----] I, [2021-03-19T15:22:02.128399 #665669:2ad9dd04194c]  INFO -- : Q-task_id([r1000000001449_miq_provision_1000000001551]) Invoking [inline] method [/Dev/Infrastructure/VM/Provisioning/StateMachines/Dev_VMProvision_VM/onerror_retry] with inputs [{}]
[----] I, [2021-03-19T15:22:02.129224 #665669:2ad9dd04194c]  INFO -- : Q-task_id([r1000000001449_miq_provision_1000000001551]) <AEMethod [/Dev/Infrastructure/VM/Provisioning/StateMachines/Dev_VMProvision_VM/onerror_retry]> Starting
[----] W, [2021-03-19T15:22:02.488477 #665669:2ad9ec4daaa8]  WARN -- : Q-task_id([r1000000001449_miq_provision_1000000001551]) <AEMethod onerror_retry> State machine state failed. Waiting for 60.seconds and retrying. 2 attempts remaining.  <>
[----] I, [2021-03-19T15:22:02.496615 #665669:2ad9dd04194c]  INFO -- : Q-task_id([r1000000001449_miq_provision_1000000001551]) <AEMethod [/Dev/Infrastructure/VM/Provisioning/StateMachines/Dev_VMProvision_VM/onerror_retry]> Ending
[----] I, [2021-03-19T15:23:03.898727 #665669:2ad9dd04194c]  INFO -- : Q-task_id([r1000000001449_miq_provision_1000000001551]) In State=[CustomizeRequest], invoking [on_error] method=[onerror_retry]
[----] I, [2021-03-19T15:23:03.914695 #665669:2ad9dd04194c]  INFO -- : Q-task_id([r1000000001449_miq_provision_1000000001551]) Updated namespace [Infrastructure/VM/Provisioning/StateMachines/Dev_VMProvision_vm/onerror_retry  Dev/Infrastructure/VM/Provisioning/StateMachines]
[----] I, [2021-03-19T15:23:03.916820 #665669:2ad9dd04194c]  INFO -- : Q-task_id([r1000000001449_miq_provision_1000000001551]) Invoking [inline] method [/Dev/Infrastructure/VM/Provisioning/StateMachines/Dev_VMProvision_VM/onerror_retry] with inputs [{}]
[----] I, [2021-03-19T15:23:03.917639 #665669:2ad9dd04194c]  INFO -- : Q-task_id([r1000000001449_miq_provision_1000000001551]) <AEMethod [/Dev/Infrastructure/VM/Provisioning/StateMachines/Dev_VMProvision_VM/onerror_retry]> Starting
[----] W, [2021-03-19T15:23:04.278703 #665669:2ad9eca6b01c]  WARN -- : Q-task_id([r1000000001449_miq_provision_1000000001551]) <AEMethod onerror_retry> State machine state failed. Waiting for 60.seconds and retrying. 1 attempts remaining.  <>
[----] I, [2021-03-19T15:23:04.288571 #665669:2ad9dd04194c]  INFO -- : Q-task_id([r1000000001449_miq_provision_1000000001551]) <AEMethod [/Dev/Infrastructure/VM/Provisioning/StateMachines/Dev_VMProvision_VM/onerror_retry]> Ending
[----] I, [2021-03-19T15:24:07.635962 #665671:2abb208ff95c]  INFO -- : Q-task_id([r1000000001449_miq_provision_1000000001551]) In State=[CustomizeRequest], invoking [on_error] method=[onerror_retry]
[----] I, [2021-03-19T15:24:07.652646 #665671:2abb208ff95c]  INFO -- : Q-task_id([r1000000001449_miq_provision_1000000001551]) Updated namespace [Infrastructure/VM/Provisioning/StateMachines/Dev_VMProvision_vm/onerror_retry  Dev/Infrastructure/VM/Provisioning/StateMachines]
[----] I, [2021-03-19T15:24:07.655766 #665671:2abb208ff95c]  INFO -- : Q-task_id([r1000000001449_miq_provision_1000000001551]) Invoking [inline] method [/Dev/Infrastructure/VM/Provisioning/StateMachines/Dev_VMProvision_VM/onerror_retry] with inputs [{}]
[----] I, [2021-03-19T15:24:07.656991 #665671:2abb208ff95c]  INFO -- : Q-task_id([r1000000001449_miq_provision_1000000001551]) <AEMethod [/Dev/Infrastructure/VM/Provisioning/StateMachines/Dev_VMProvision_VM/onerror_retry]> Starting
[----] E, [2021-03-19T15:24:08.041004 #665671:2abb32ffd314] ERROR -- : Q-task_id([r1000000001449_miq_provision_1000000001551]) <AEMethod onerror_retry> State machine state failed. <>
[----] I, [2021-03-19T15:24:08.048786 #665671:2abb208ff95c]  INFO -- : Q-task_id([r1000000001449_miq_provision_1000000001551]) <AEMethod [/Dev/Infrastructure/VM/Provisioning/StateMachines/Dev_VMProvision_VM/onerror_retry]> Ending

You probably need to look at the failing job object in the Rails console to see what’s going on. Try something like:

ManageIQ::Providers::AnsibleTower::ConfigurationManager::Job.where(:id => 4050).first

to get the job object. You can then poke around at the attributes and associations to see what’s going on.

Alternatively if you’re familiar with object_walker, try a small method containing something like this to dump the object structure:

$evm.root['job'] = $evm.vmdb(:ManageIQ_Providers_AnsibleTower_ConfigurationManager_Job, 4050)
$evm.instantiate('/Discovery/ObjectWalker/object_walker')

Then dump it using the object_walker_reader.

Hope this helps,
pemcg

I’ve solved this.

  1. Set “Maximum TTL” for Ansible Tower method more than “Max Time” for respective state in State machine

  2. Used the following code
    interval = ‘60.seconds’
    retry_count = $evm.get_state_var(:retry_count) || $evm.root[‘ae_state_max_retries’]

    (retry_count = 0) if retry_count.nil?

    if retry_count > 0
    $evm.log(:warn, “State machine state failed. Waiting for #{interval} and retrying. #{retry_count} attempts remaining. <>”)
    $evm.log(:info, “max retries: #{$evm.root[‘ae_state_max_retries’]}; retries: #{$evm.root[‘ae_state_retries’]}”)
    retry_count -= 1
    $evm.set_state_var(:retry_count, retry_count)
    $evm.root[‘ae_result’] = ‘retry’
    $evm.root[‘ae_retry_interval’] = interval
    else retry_count == 0
    $evm.set_state_var(:retry_count, 0)
    $evm.log(:error, “State machine state failed. <>”)
    $evm.root[‘ae_result’] = ‘error’
    end

    $evm.instantiate(’/Infrastructure/VM/Provisioning/StateMachines/VMProvision_VM/update_provision_status’)

    exit MIQ_OK