Terminating non responsive method with pid

Hello, sometimes I get error like this:

Q-task_id([r495_service_template_provision_task_1139]) Terminating non responsive method with pid 3819

which stops currently running automation steps. Last one failed from automation ruby script which was running for several minutes and wainting for cloudinit finalization for array of hostnames…

Is there any option in settings where timout for “terminating non responsive method” could be adjusted? Or how can it be prevented? Can I notify MIQ druing ruby method that method is runnign fine? Even if it takes more time?

I use sometimes ruby sleep method in scripts (but not for longer then 1 minute).

I have hammer-4

Thank you very much

Automate Methods timeout after 10 minutes. There is probably a parameter in the Advanced settings to configure it, but usually Methods should not run for that long

Do you have to wait in ruby? Can’t you just check the state and issue a retry if it is not finished yet?

Some of our methods run for 30 minutes without problem, but this one is failing after 3-6 minutes. According to stacktrace it always fails on this line:

sleep(15.seconds)

Looking at the code, I don’t think ManageIQ kills the Method on purpose. There should be a log Method exec failed because ... and the stacktrace should give more info on what is happening

Code: https://github.com/ManageIQ/manageiq-automation_engine/blob/9ab733b5fce3e6c1361482702558d355727ad5a3/lib/miq_automation_engine/engine/miq_ae_engine/miq_ae_method.rb#L208

Thank you @buc for that info. Unfortunately I restarted whole appliance with log cleanup, so I don’t have old logs. And first provision attempt after AP restart went well.

Hello @buc
So I got the error again:

[----] E, [2020-02-11T10:49:52.786198 #31916:f1cf48] ERROR -- : Q-task_id([r567_service_retire_task_1279]) Terminating non responsive method with pid 17841
[----] E, [2020-02-11T10:49:52.786957 #31916:bf3b0a0] ERROR -- : Q-task_id([r567_service_retire_task_1279]) <AEMethod retire_terraform_service> The following error occurred during method evaluation:
[----] E, [2020-02-11T10:49:52.787351 #31916:bf3b0a0] ERROR -- : Q-task_id([r567_service_retire_task_1279]) <AEMethod retire_terraform_service>   SignalException: SIGTERM
[----] E, [2020-02-11T10:49:52.788748 #31916:bf3b0a0] ERROR -- : Q-task_id([r567_service_retire_task_1279]) <AEMethod retire_terraform_service>   /KBCZ/Functions/Awx/utils:24:in `sleep'

It is again in sleep method (waiting for awx job result)

But unfortunately I can’t find Method exec failed anywhere in the logs

 grep -i "exec failed" *.log

Is this the whole stack trace?

The error message I mentioned is printed to STDERR. So I would not be suprised if it does not show up in the logs, however it should be in the journal. you could try looking for it with journalctl

Anyway, we now know that something is sending the process a SIGTERM.
I would guess that your sleep is not the problem here, that is just the line your code happens to be, when something else is killing it.
I have no Idea where the SIGTERM could be comming from, but here are some ideas:

  • If the Automate Method is killed, it should cause the DRuby connection to fail and leave a second stack trace? It probably won’t be anything useful, but it’s worth a shot
  • Is there some external factor that could send the SIGTERM? Maybe the Kernel OOM Killer? Something like it?
  • Have you looked in all the logs at that specific timestamp? Apparently ManageIQ will some worker processes if it experiences memory pressure
    https://github.com/search?p=4&q=org%3AManageIQ+kill&type=Code