Terminating non responsive method with pid

Hello, sometimes I get error like this:

Q-task_id([r495_service_template_provision_task_1139]) Terminating non responsive method with pid 3819

which stops currently running automation steps. Last one failed from automation ruby script which was running for several minutes and wainting for cloudinit finalization for array of hostnames…

Is there any option in settings where timout for “terminating non responsive method” could be adjusted? Or how can it be prevented? Can I notify MIQ druing ruby method that method is runnign fine? Even if it takes more time?

I use sometimes ruby sleep method in scripts (but not for longer then 1 minute).

I have hammer-4

Thank you very much

Automate Methods timeout after 10 minutes. There is probably a parameter in the Advanced settings to configure it, but usually Methods should not run for that long

Do you have to wait in ruby? Can’t you just check the state and issue a retry if it is not finished yet?

Some of our methods run for 30 minutes without problem, but this one is failing after 3-6 minutes. According to stacktrace it always fails on this line:

sleep(15.seconds)

Looking at the code, I don’t think ManageIQ kills the Method on purpose. There should be a log Method exec failed because ... and the stacktrace should give more info on what is happening

Code: https://github.com/ManageIQ/manageiq-automation_engine/blob/9ab733b5fce3e6c1361482702558d355727ad5a3/lib/miq_automation_engine/engine/miq_ae_engine/miq_ae_method.rb#L208

Thank you @buc for that info. Unfortunately I restarted whole appliance with log cleanup, so I don’t have old logs. And first provision attempt after AP restart went well.

Hello @buc
So I got the error again:

[----] E, [2020-02-11T10:49:52.786198 #31916:f1cf48] ERROR -- : Q-task_id([r567_service_retire_task_1279]) Terminating non responsive method with pid 17841
[----] E, [2020-02-11T10:49:52.786957 #31916:bf3b0a0] ERROR -- : Q-task_id([r567_service_retire_task_1279]) <AEMethod retire_terraform_service> The following error occurred during method evaluation:
[----] E, [2020-02-11T10:49:52.787351 #31916:bf3b0a0] ERROR -- : Q-task_id([r567_service_retire_task_1279]) <AEMethod retire_terraform_service>   SignalException: SIGTERM
[----] E, [2020-02-11T10:49:52.788748 #31916:bf3b0a0] ERROR -- : Q-task_id([r567_service_retire_task_1279]) <AEMethod retire_terraform_service>   /KBCZ/Functions/Awx/utils:24:in `sleep'

It is again in sleep method (waiting for awx job result)

But unfortunately I can’t find Method exec failed anywhere in the logs

 grep -i "exec failed" *.log

Is this the whole stack trace?

The error message I mentioned is printed to STDERR. So I would not be suprised if it does not show up in the logs, however it should be in the journal. you could try looking for it with journalctl

Anyway, we now know that something is sending the process a SIGTERM.
I would guess that your sleep is not the problem here, that is just the line your code happens to be, when something else is killing it.
I have no Idea where the SIGTERM could be comming from, but here are some ideas:

  • If the Automate Method is killed, it should cause the DRuby connection to fail and leave a second stack trace? It probably won’t be anything useful, but it’s worth a shot
  • Is there some external factor that could send the SIGTERM? Maybe the Kernel OOM Killer? Something like it?
  • Have you looked in all the logs at that specific timestamp? Apparently ManageIQ will some worker processes if it experiences memory pressure
    https://github.com/search?p=4&q=org%3AManageIQ+kill&type=Code

You are right with sleep method, today I noticed exception from different method then sleep

[----] I, [2020-02-26T14:32:27.194566 #32587:405aee8]  INFO -- : Q-task_id([r745_service_template_provision_task_1712]) <AEMethod pre_terraform_openstack_k8s__1_0> Scheduled Job ID: 1832 Ansible Job ID: 4839
[----] I, [2020-02-26T14:32:27.195342 #32587:405aee8]  INFO -- : Q-task_id([r745_service_template_provision_task_1712]) <AEMethod pre_terraform_openstack_k8s__1_0> [awx_run_adhoc_template] jobId: 1832 launched (retries 0)
[----] E, [2020-02-26T14:32:42.351973 #32587:a9af60] ERROR -- : Q-task_id([r745_service_template_provision_task_1712]) Terminating non responsive method with pid 22187
[----] E, [2020-02-26T14:32:42.353119 #32587:ea90048] ERROR -- : Q-task_id([r745_service_template_provision_task_1712]) <AEMethod pre_terraform_openstack_k8s__1_0> The following error occurred during method evaluation:
[----] E, [2020-02-26T14:32:42.353504 #32587:ea90048] ERROR -- : Q-task_id([r745_service_template_provision_task_1712]) <AEMethod pre_terraform_openstack_k8s__1_0>   SignalException: SIGTERM
[----] E, [2020-02-26T14:32:42.356833 #32587:ea90048] ERROR -- : Q-task_id([r745_service_template_provision_task_1712]) <AEMethod pre_terraform_openstack_k8s__1_0>   /usr/local/lib/ruby/2.4.0/drb/drb.rb:575:in `read'
/usr/local/lib/ruby/2.4.0/drb/drb.rb:575:in `load'
/usr/local/lib/ruby/2.4.0/drb/drb.rb:641:in `recv_reply'
/usr/local/lib/ruby/2.4.0/drb/drb.rb:941:in `recv_reply'
/usr/local/lib/ruby/2.4.0/drb/drb.rb:1254:in `send_message'
/usr/local/lib/ruby/2.4.0/drb/drb.rb:1142:in `block (2 levels) in method_missing'
/usr/local/lib/ruby/2.4.0/drb/drb.rb:1229:in `open'
/usr/local/lib/ruby/2.4.0/drb/drb.rb:1141:in `block in method_missing'
/usr/local/lib/ruby/2.4.0/drb/drb.rb:1160:in `with_friend'
/usr/local/lib/ruby/2.4.0/drb/drb.rb:1140:in `method_missing'
/KBCZ/Functions/Awx/utils:68:in `_awx_run_adhoc_check_status'
/KBCZ/Functions/Awx/utils:26:in `awx_run_adhoc_template'
/KBCZ/KbOrchestration/OrchestrationFlows/TemplatesMethods/Methods/util:27:in `block in unregister_hostnames_in_ipam_and_ldap'
/KBCZ/KbOrchestration/OrchestrationFlows/TemplatesMethods/Methods/util:24:in `each'
/KBCZ/KbOrchestration/OrchestrationFlows/TemplatesMethods/Methods/util:24:in `unregister_hostnames_in_ipam_and_ldap'
/KBCZ/KbOrchestration/OrchestrationFlows/TemplatesMethods/Methods/pre_terraform_openstack_k8s__1_0:94:in `rescue in <main>'

Unfortunatelly there is nothing in journalctl around the time…
But I found perhaps related message in evm.log:

[----] I, [2020-02-26T14:32:42.441132 #32587:a9af60] INFO – : Q-task_id([r745_service_template_provision_task_1712]) MIQ(ServiceTemplateProvisionTask#after_ae_delivery) ae_result=“error”
[----] I, [2020-02-26T14:32:42.747858 #32587:a9af60] INFO – : Q-task_id([r745_service_template_provision_task_1712]) MIQ(MiqQueue.put) Message id: [29060353], id: , Zone: [default], Role: [automate], Server: , MiqTask id: , Ident: [generic], Target id: , Instance id: , Task id: , Command: [MiqAeEngine.deliver], Timeout: [3600], Priority: [20], State: [ready], Deliver On: , Data: , Args: [{:object_type=>“ServiceAnsibleTower”, :object_id=>584, :attrs=>{:event_type=>:service_provisioned, “MiqEvent::miq_event”=>3481138, :miq_event_id=>3481138, “EventStream::event_stream”=>3481138, :event_stream_id=>3481138}, :instance_name=>“Event”, :user_id=>38, :miq_group_id=>146, :tenant_id=>22, :automate_message=>nil}]
[----] E, [2020-02-26T14:32:42.748163 #32587:a9af60] ERROR – : Q-task_id([r745_service_template_provision_task_1712]) MIQ(MiqQueue#deliver) Message id: [29059879], timed out after 600.397411021 seconds. Timeout threshold [600]
[----] I, [2020-02-26T14:32:42.748217 #32587:a9af60] INFO – : Q-task_id([r745_service_template_provision_task_1712]) MIQ(MiqGenericWorker::Runner#deliver_queue_message) MIQ(MiqGenericWorker::Runner) Reconnecting to DB after timeout error during queue deliver
[----] I, [2020-02-26T14:32:43.496484 #32587:a9af60] INFO – : Q-task_id([r745_service_template_provision_task_1712]) MIQ(MiqQueue#delivered) Message id: [29059879], State: [timeout], Delivered in [601.145701679] seconds
[----] I, [2020-02-26T14:32:43.562790 #32587:a9af60] INFO – : Q-task_id([r745_service_template_provision_task_1712]) MIQ(MiqGenericWorker#log_status) [Generic Worker] Worker ID [13238], PID [32587], GUID [365c0c67-33b2-4bd1-93a8-247782500f9a], Last Heartbeat [2020-02-26 13:22:42 UTC], Process Info: Memory Usage [691355648], Memory Size [1063780352], Proportional Set Size: [604480000], Unique Set Size: [601044000], Memory % [2.75], CPU Time [45220.0], CPU % [0.14], Priority [30]
[----] E, [2020-02-26T14:32:43.563226 #32587:a9af60] ERROR – : Q-task_id([r745_service_template_provision_task_1712]) MIQ(MiqGenericWorker::Runner) ID [13238] PID [32587] GUID [365c0c67-33b2-4bd1-93a8-247782500f9a] Exiting worker due to timeout error Worker exiting.

Could this be the issue?
Message id: [29059879], timed out after 600.397411021 seconds. Timeout threshold [600]

It is probably retaled to msg_timeout column in miq_queue table. How can I modify default value for MiqWorker hander_type ?

That’s the Automate timeout. You can try to set the queue_timeout for the generic workers, in advanced settings.
However do you have to wait in ruby? Can’t you use StateMachine retries to wait until cloudinit finished configuring the VM?

I changed queue_timeout, so far it is working fine :slight_smile:

About statemachine - for instance we have dialog for K8S cluster on Openstack. User can set any number of master nodes and worker nodes in normal text field input. Statemachine is based on AWX because we must use terraform for openstack. So MIQ in the provisioning step has VM number as input (among other variables). Based on the number which usually varies between 5 and 12 MIQ needs to

  • call IPAM service for each future VM (to get new host name)
  • call LDAP service for each VM (via AWX)
  • now we know hostnames, so we can generate Terraform inventory file and push to GIT (AWX)
  • apply Terraform inventory file (AWX)
  • now all VMs exist, so cloud-init finish check for each (AWX)
  • finally kubespray and many company post configuration script, init CICD (AWX/JENKINSX)

And similar dialogs are for Webshere cluster, Docker servers, etc…

When I was thinking how this could be done in “pure” statemachine way I would probably ended with many state_vars variables for statemachine controlling ? (because of VM variable count) and statemachine(s) would be too complex…

Now I have just 1 AWX statemachine for all with 1 variable pre and 1 post awx step (2 custom ruby methods per dialog type) where I can make those calls from…

@buc Sadly timeout occured again today :roll_eyes:

Last week I changed settings and restarted MIQ, is this correct settings (from server page)?

I have tried different value for defaults and generic worker block, but according to the error, there must be still 600s timeout somewhere…

[----] I, [2020-03-03T16:39:59.401567 #2637:400f60] INFO -- : MIQ(MiqQueue.put) Message id: [30102578], id: [], Zone: [default], Role: [], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqEvent.raise_evm_event], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: [["MiqServer", 4], "evm_worker_killed", {:event_details=>"Worker [MiqGenericWorker] with ID: [13693], PID: [2833], GUID: [a9628c6c-4336-44e4-810a-943260b0b1ba] being killed because it is not responding", :type=>"MiqGenericWorker"}]

1 Like