Why does provision request stall out at event=<"request_pending">?

Any idea why 25% of the time when I submit a request (via a service dialog) the process gets stuck at <request_pending> event? See logs:

automation.log:

[----] I, [2020-09-15T13:40:52.318143 #3718:6365f0]  INFO -- : MiqAeEvent.build_evm_event >> event=<"request_pending"> inputs=<{"EventStream::event_stream"=>163756, :event_stream_id=>163756}>

evm.log:

[----] I, [2020-09-15T13:40:52.303400 #3718:6365f0]  INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [5474818], MiqWorker id: [823],
Zone: [default], Role: [], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [363], Task id: [], Command: [ServiceTemplateProvisionRe
quest.call_automate_event], Timeout: [3600], Priority: [100], State: [dequeue], Deliver On: [], Data: [], Args: ["request_pending"], Dequeued in: [2.12042297
4] seconds
[----] I, [2020-09-15T13:40:52.303709 #3718:6365f0]  INFO -- : MIQ(MiqQueue#deliver) Message id: [5474818], Delivering...
[----] I, [2020-09-15T13:40:52.307181 #3718:6365f0]  INFO -- : MIQ(ServiceTemplateProvisionRequest#call_automate_event) Raising event [request_pending] to Au
tomate
[----] I, [2020-09-15T13:40:52.370763 #3718:6365f0]  INFO -- : MIQ(MiqQueue.put) Message id: [5474820],  id: [], Zone: [default], Role: [automate], Server: [
], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqAeEngine.deliver], Timeout: [3600], Priority: [20], State: [re
ady], Deliver On: [], Data: [], Args: [{:object_type=>"ServiceTemplateProvisionRequest", :object_id=>363, :attrs=>{:event_type=>"request_pending", "EventStre
am::event_stream"=>163756, :event_stream_id=>163756}, :instance_name=>"Event", :user_id=>25, :miq_group_id=>41, :tenant_id=>1, :automate_message=>nil}]
[----] I, [2020-09-15T13:40:52.371038 #3718:6365f0]  INFO -- : MIQ(ServiceTemplateProvisionRequest#call_automate_event) Raised event [request_pending] to Aut
omate
[----] I, [2020-09-15T13:40:52.371457 #3718:6365f0]  INFO -- : MIQ(MiqQueue#delivered) Message id: [5474818], State: [ok], Delivered in [0.067762559] seconds
[----] I, [2020-09-15T13:40:57.043461 #3504:6365f0]  INFO -- : MIQ(MiqServer#heartbeat) Heartbeat [2020-09-15 17:40:57 UTC]...
[----] I, [2020-09-15T13:40:57.054514 #3504:6365f0]  INFO -- : MIQ(MiqServer#heartbeat) Heartbeat [2020-09-15 17:40:57 UTC]...Complete

It doesn’t proceed to invoke /System/Event/RequestEvent/Request/request_pending

If I go in and approve the request as admin, it works. But without that override it never proceeds to request_pending.

Ironically, I did find an error in evm.log that appears relevant, but it happened with a request that successfully passed request_pending event:

[----] E, [2020-09-15T13:15:31.955286 #3735:6365f0] ERROR -- : MIQ(MiqAeEngine.deliver) Error delivering {:event_type=>"request_pending", "EventStream::event
_stream"=>163739, :event_stream_id=>163739, "User::user"=>1, "ServiceTemplateProvisionRequest::service_template_provision_request"=>361} for object [ServiceT
emplateProvisionRequest.361] with state [] to Automate: Method exited with rc=Unknown RC: [1]

The line to take note of here is this:

If you grep for ‘5474820’ in evm.log you should be able to see any processing related to this message:

  • the MiqQueue.put of the message on the queue as shown by your log line
  • the retrieval of the message by the target worker, you’ll see something like “Dequeued in: [6.698342458] seconds” and “Message id: [5474820], Delivering…”
  • the worker completion, shown as something like “Delivered in [23.469068759] seconds”

The activity between the “Delivering…” and the “Delivered in” is the worker processing the message.

Also useful to note is that as workers process one and only one message at a time, you can see what the worker is doing as a result of each message. The “#3718:6365f0” in each log line is useful here - this shows the PID of the worker process that wrote the log line (in this case 3718). Use the output from rake evm:status on the appliance console to get the worker PIDs, then grep for the worker PID in evm.log, i.e grep '#3718' evm.log

In many cases a worker will dequeue a message, and almost immediately re-queue another message, which is what you’re seeing with message 5474818. What you’re looking for is the successful completion of each message processing, and any errors that might be related.

Hope this helps,
pemcg

Thank you so much @pemcg. I’ll dig in using this guidance.

@pemcg I tried the approach you suggested in talk.manageiq, but the evm.log is presenting LOTS of different MiqQueue.put messages all the time. Like more than 20 per minute, all with different message IDs. Does that seem normal? No one else is using the system. I have deleted all pending requests from the queue (in the UI).

Yes this is normal, most of the workers get their “work instructions” via messages, so you’ll be seeing the messages associated with normal event processing, C&U, etc that go on in the background.

You’re specifically looking for the messages relating to your automation workflow, so they’ll be targeted for a generic worker (“Ident: [generic]”). You’re looking for why a message to raise a request_pending event wasn’t picked up and processed. It may have been dequeued and processed with an error (maybe the generic worker was terminated due to exceeding memory threshold for example), or may still be on the queue for some reason.

Hope this helps,
pemcg

Thank you @pemcg.

I tried this again. I got another process stuck at the “request_pending” event. I found the following log in the evm.log file that occurred concurrently with the “request_pending” event log in automate.log:

[----] I, [2020-09-16T17:20:36.111089 #11713:88660c]  INFO -- : MIQ(ServiceTemplateProvisionRequest#call_automate_event) Raising event [request_pending] to Automate
[----] I, [2020-09-16T17:20:36.130056 #11720:88660c]  INFO -- : Q-task_id([r237_miq_provision_327]) MIQ(ManageIQ::Providers::Azure::CloudManager::Provision#poll_destination_in_vmdb) Unable to find Vm [ansible_train] with ems_ref [{:subscription_id=>"8303ae2b-6207-4094-ae92-f93c11fed04a", :vm_resource_group=>"USW_G15_ManageIQ", :type=>"microsoft.compute/virtualmachines", :vm_name=>"ansible_train"}], will retry
[----] I, [2020-09-16T17:20:36.140417 #11713:88660c]  INFO -- : MIQ(MiqQueue.put) Message id: [5533276],  id: [], Zone: [default], Role: [automate], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqAeEngine.deliver], Timeout: [3600], Priority: [20], State: [ready], Deliver On: [], Data: [], Args: [{:object_type=>"ServiceTemplateProvisionRequest", :object_id=>391, :attrs=>{:event_type=>"request_pending", "EventStream::event_stream"=>164969, :event_stream_id=>164969}, :instance_name=>"Event", :user_id=>25, :miq_group_id=>41, :tenant_id=>1, :automate_message=>nil}]
[----] I, [2020-09-16T17:20:36.140515 #11713:88660c]  INFO -- : MIQ(ServiceTemplateProvisionRequest#call_automate_event) Raised event [request_pending] to Automate

This showed me exactly where in evm it raises the request_pending event and then the MiqQueue.put log with message 5533276

When I grep for 5533276 I get one line only, and it’s that line above from which I got the message id.

Focusing instead on the process id (11713) I find in automate.log:

[----] I, [2020-09-16T17:20:36.140417 #11713:88660c]  INFO -- : MIQ(MiqQueue.put) Message id: [5533276],  id: [], Zone: [default], Role: [automate], Server:
[], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqAeEngine.deliver], Timeout: [3600], Priority: [20], State: [r
eady], Deliver On: [], Data: [], Args: [{:object_type=>"ServiceTemplateProvisionRequest", :object_id=>391, :attrs=>{:event_type=>"request_pending", "EventStr
eam::event_stream"=>164969, :event_stream_id=>164969}, :instance_name=>"Event", :user_id=>25, :miq_group_id=>41, :tenant_id=>1, :automate_message=>nil}]
[----] I, [2020-09-16T17:20:36.140515 #11713:88660c]  INFO -- : MIQ(ServiceTemplateProvisionRequest#call_automate_event) Raised event [request_pending] to Automate
[----] I, [2020-09-16T17:20:36.140747 #11713:88660c]  INFO -- : MIQ(MiqQueue#delivered) Message id: [5533273], State: [ok], Delivered in [0.031736739] seconds

I don’t find any dequeue statement or error statement so I assume it’s still in the queue for some reason?

ok, this is interesting. The message is queued for zone “Default”, is this the only zone that you have? Messages are sometimes not dequeued because there’s no appliance in the requested zone with the target server role.

It’s also worth checking if any workers are being terminated for exceeding memory allocation, try the following:

grep ‘MiqServer#validate_worker’ evm.log | grep “WARN\|ERROR”

pemcg

Yes. This is the only zone we have. Because our system was set up by contractors, I never encountered the question of multiple zones. Looking at the docs, this is something I’m sure we’ll want to address. Thank you!

No exceeding memory allocation errors in evm.log. I did check that yesterday (per your guidance from another person’s post) and I did fine one memory error from like 2-3 days ago. In response I maxed out the generic worker’s memory to 1.5 G (up from 1G).

Do you think the Default zone is perhaps a culprit for my stalling process?

No, it’s fine to have just one zone, although for larger deployments separating your providers into their own zones can be beneficial as they can have their own provider-specific appliances. Having all of your providers in one zone means that all appliances in that zone run the workers for all providers.

For example if you have both VMware and OpenStack providers in a single zone, then each appliance with the C&U, Provider Inventory and Event Monitor roles could potentially have any of the following workers running:

Openstack::Cloud::EventCatcher           
Openstack::Cloud::MetricsCollector       
Openstack::Cloud::MetricsCollector       
Openstack::Cloud::Refresh                
Openstack::Network::EventCatcher         
Openstack::Network::MetricsCollector     
Openstack::Network::MetricsCollector     
Openstack::Storage::Cinder::EventCatcher 
Vmware::Infra::EventCatcher              
Vmware::Infra::MetricsCollector          
Vmware::Infra::MetricsCollector          
Vmware::Infra::Refresh

If you separate the providers into their own zones then the appliances will only run the workers that are relevant for the providers in that zone.

You can see if the message is still on the queue using something like this from rails console:

MiqQueue.where(:id => 5533276)

pemcg

We are still trying to sort out why most requests get stuck at “request_pending” for our system.

Below is a line-by-line comparison of the automation.log entries from two requests/processes, one that got “stuck” at “request_pending” and one that processed successfully.

Lines prefixed with “F” are from the process that failed / got stuck.
Lines prefixed with “S” are from the process that succeeded.

You’ll note that they are in step with one another until like 8 where the stuck process logs an extra entry. After that point, the two logs become very similar to one another again, minus some interesting differences.

I’ll put some commentary below these logs…

Line 1:

"S"[----] I, [2020-09-18T13:51:40.795014 #10072:4598e50]  INFO -- : <AEMethod approve> Sending email to <Mike.Butak@expeditors.com> from <miq@expeditors.com> subject: <Request ID 403 - Your Service provision request has been approved by the level 2 approver.>
"F"[----] I, [2020-09-18T13:53:47.783933 #10063:838fd30]  INFO -- : <AEMethod approve> Sending email to <Mike.Butak@expeditors.com> from <miq@expeditors.com> subject: <Request ID 403 - Your Service provision request has been approved by the level 3 approver.>

Line 2:

"S"[----] I, [2020-09-18T13:51:40.834755 #10072:69a5f0]  INFO -- : <AEMethod [/EXP/custom/multiapprove/Email/approve]> Ending
"F"[----] I, [2020-09-18T13:53:48.223100 #10063:69a5f0]  INFO -- : <AEMethod [/EXP/custom/multiapprove/Email/approve]> Ending

Line 3:

"S"[----] I, [2020-09-18T13:51:40.834883 #10072:69a5f0]  INFO -- : Method exited with rc=MIQ_OK

"F"[----] I, [2020-09-18T13:53:48.223266 #10063:69a5f0]  INFO -- : Method exited with rc=MIQ_OK

Line 4:

"S"[----] I, [2020-09-18T13:51:40.835341 #10072:69a5f0]  INFO -- : Followed  Relationship [miqaedb:/custom/multiapprove/Email/approve#create]
"F"[----] I, [2020-09-18T13:53:48.223701 #10063:69a5f0]  INFO -- : Followed  Relationship [miqaedb:/custom/multiapprove/Email/approve#create]

Line 5:

"S"[----] I, [2020-09-18T13:51:40.835539 #10072:69a5f0]  INFO -- : Followed  Relationship [miqaedb:/System/Policy/ServiceTemplateProvisionRequest_Approved#create]
"F"[----] I, [2020-09-18T13:53:48.223884 #10063:69a5f0]  INFO -- : Followed  Relationship [miqaedb:/System/Policy/ServiceTemplateProvisionRequest_Approved#create]

Line 6:

"S"[----] I, [2020-09-18T13:51:40.836421 #10072:69a5f0]  INFO -- : Followed  Relationship [miqaedb:/System/Policy/request_approved#create]
"F"[----] I, [2020-09-18T13:53:48.224340 #10063:69a5f0]  INFO -- : Followed  Relationship [miqaedb:/System/Policy/request_approved#create]

Line 7:

"S"[----] I, [2020-09-18T13:51:40.837442 #10072:69a5f0]  INFO -- : Followed  Relationship [miqaedb:/System/Event/RequestEvent/Request/request_approved#create]
"F"[----] I, [2020-09-18T13:53:48.224962 #10063:69a5f0]  INFO -- : Followed  Relationship [miqaedb:/System/Event/RequestEvent/Request/request_approved#create]

Line 8:
“S” NO COMPARABLE LOG ENTRY IN THIS FILE. The “Successful” request skips right to the next line…

"F"[----] I, [2020-09-18T13:53:52.840123 #10055:69a5f0]  INFO -- : MiqAeEvent.build_evm_event >> event=<"request_pending"> inputs=<{"EventStream::event_stream"=>166633, :event_stream_id=>166633}>

Line 9:

"S"[----] I, [2020-09-18T13:51:44.974333 #10055:69a5f0]  INFO -- : User [admin] with current group ID [2] name [EvmGroup-super_administrator]
"F"[----] I, [2020-09-18T13:54:39.015296 #10055:69a5f0]  INFO -- : User [admin] with current group ID [1] name [Tenant My Company access]

Line 10:

"S"[----] I, [2020-09-18T13:51:44.990576 #10055:69a5f0]  INFO -- : Instantiating [/System/Process/Event?EventStream%3A%3Aevent_stream=166630&MiqRequest%3A%3Amiq_request=403&MiqServer%3A%3Amiq_server=1&ServiceTemplateProvisionRequest%3A%3Aservice_template_provision_request=403&User%3A%3Auser=1&event_stream_id=166630&event_type=request_pending&object_name=Event&vmdb_object_type=service_template_provision_request]
"F"[----] I, [2020-09-18T13:54:39.017818 #10055:69a5f0]  INFO -- : Instantiating [/System/Process/Event?EventStream%3A%3Aevent_stream=166634&MiqServer%3A%3Amiq_server=1&User%3A%3Auser=1&event_id=166634&event_stream_id=166634&event_type=job_create&object_name=Event&vmdb_object_type=event_stream]

Line 11:

"S"[----] I, [2020-09-18T13:51:45.020073 #10055:69a5f0]  INFO -- : Updated namespace [/System/Process/Event?EventStream%3A%3Aevent_stream=166630&MiqRequest%3A%3Amiq_request=403&MiqServer%3A%3Amiq_server=1&ServiceTemplateProvisionRequest%3A%3Aservice_template_provision_request=403&User%3A%3Auser=1&event_stream_id=166630&event_type=request_pending&object_name=Event&vmdb_object_type=service_template_provision_request  ManageIQ/System]
"F"[----] I, [2020-09-18T13:54:39.046677 #10055:69a5f0]  INFO -- : Updated namespace [/System/Process/Event?EventStream%3A%3Aevent_stream=166634&MiqServer%3A%3Amiq_server=1&User%3A%3Auser=1&event_id=166634&event_stream_id=166634&event_type=job_create&object_name=Event&vmdb_object_type=event_stream  ManageIQ/System]

Line 12:

"S"[----] I, [2020-09-18T13:51:45.088443 #10055:69a5f0]  INFO -- : Following Relationship [miqaedb:/System/Event/RequestEvent/Request/request_pending#create]
"F"[----] I, [2020-09-18T13:54:39.110734 #10055:69a5f0]  INFO -- : Following Relationship [miqaedb:/System/Event/EmsEvent/ANSIBLETOWER/job_create#create]

Commentary:

Lines 1-7: the two processes have basically identical logs
Line 8: the failed process logs a MiqAeEvent.build_evm_event entry for event=“request_pending”. The successful process has no such entry.
Line 9: the successful process says the user (admin) is member of group “Evm_Group-super_administration” while the failed process says that user (admin) is member of group “Tenant My Company access”. No idea what that might mean.
Line 10-11: the successful process calls EventStream with eventstream type of MiqRequest whereas the failed process calls EventStream with type MiqServer.
Line 12: the successful process procedes to call RequestEvent/Request/request_pending (which takes the process on to the next step in the process) while the failed process instead calls an EmsEvent/ANSIBLETOWER/job_create which is a placeholder we are not using. From this point on the failed process is stuck.

Everything seems to boil down to the unique log on line 8 in the failed process (the “MiqAeEvent.build_evm_event entry” with event=“request_pending”). Researching that method I found a discussion (here)[Openstack Provisioning stucks early with pending state] from several years ago where a team had this same behavior and it had to do with excessive events (several thousand) in the miq_queue. I was excited that this might be my problem, but we currently only have 8 event records in our miq_queue table. :frowning:

Thanks for reading this far if you are still with me. Any suggestions would be deeply appreciated!

@pemcg I did double the number of generic and Priority workers but that doesn’t seem to change the issue of requests getting stuck. Also, if you note the super long post above, I don’t have gobs of events stuck in the queue.

We have every role turned on for our default zone except:
Capacity & Utilization *
Cockpit
Embedded Ansible
Internet Connectivity
SmartProxy

Normally a request_pending event would be associated with a request that hasn’t been approved yet, so I’m slightly confused by the “Your Service provision request has been approved” messages.

It suggests that the setting of the request to a “pending” state possibly hasn’t worked, and that the request has somehow already been approved?

Does the service trigger an Ansible Tower playbook? If you this would be backed up by the detection of the job_create event from Tower which would be the processing of the service task.

pemcg

After one of the several approvers approves a request, it’s gets immediately redirected to pending ($evm.root[‘miq_request’].pending) where another custom relationship directs to a method that updates the approver in the db, and emails the approver and the requestor). It works!