Openstack Provisioning stucks early with pending state


#1

Hey Guys,

would be great if somebody can help us out. The use case:
nothing special, we currently defining a workflow for an Openstack Cloud Provider. We’ve currently not customized the Cloud-Provision-Statemachine.
But now, when we call the REST-API with all parameters we got an pending Process with no further functions or updates… this is all we can get from the logs:

[----] I, [2017-01-26T15:19:08.776788 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) MiqAeEvent.build_evm_event >> event=<"request_starting"> inputs=<{"EventStream::event_stream"=>5000025367289, :event_stream_id=>5000025367289}>
[----] I, [2017-01-26T15:19:08.822292 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Instantiating [/System/Process/Event?EventStream%3A%3Aevent_stream=5000025367289&MiqProvisionRequest%3A%3Amiq_provision_request=5000000000008&MiqRequest%3A%3Amiq_request=5000000000008&MiqServer%3A%3Amiq_server=5000000000003&User%3A%3Auser=5000000000001&event_stream_id=5000025367289&event_type=request_starting&object_name=Event&vmdb_object_type=miq_provision_request]
[----] I, [2017-01-26T15:19:09.095907 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Updated namespace [/System/Process/Event?EventStream%3A%3Aevent_stream=5000025367289&MiqProvisionRequest%3A%3Amiq_provision_request=5000000000008&MiqRequest%3A%3Amiq_request=5000000000008&MiqServer%3A%3Amiq_server=5000000000003&User%3A%3Auser=5000000000001&event_stream_id=5000025367289&event_type=request_starting&object_name=Event&vmdb_object_type=miq_provision_request  ManageIQ/System]
[----] I, [2017-01-26T15:19:09.612222 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Following Relationship [miqaedb:/System/Event/RequestEvent/Request/request_starting#create]
[----] I, [2017-01-26T15:19:09.674332 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Updated namespace [miqaedb:/System/Event/RequestEvent/Request/request_starting#create  ManageIQ/System/Event/RequestEvent]
[----] I, [2017-01-26T15:19:09.778417 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Following Relationship [miqaedb:/System/Policy/request_starting#create]
[----] I, [2017-01-26T15:19:09.795141 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Updated namespace [miqaedb:/System/Policy/request_starting#create  ManageIQ/System]
[----] I, [2017-01-26T15:19:09.826150 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Updated namespace [System/Policy/get_request_type  ManageIQ/System]
[----] I, [2017-01-26T15:19:09.846611 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Invoking [inline] method [/ManageIQ/System/Policy/get_request_type] with inputs [{}]
[----] I, [2017-01-26T15:19:09.874604 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) <AEMethod [/ManageIQ/System/Policy/get_request_type]> Starting 
[----] I, [2017-01-26T15:19:10.944903 #2880:8085b8]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) <AEMethod get_request_type> Request Type:<MiqProvisionRequest>
[----] I, [2017-01-26T15:19:10.984238 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) <AEMethod [/ManageIQ/System/Policy/get_request_type]> Ending
[----] I, [2017-01-26T15:19:10.984342 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Method exited with rc=MIQ_OK
[----] I, [2017-01-26T15:19:10.985070 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Following Relationship [miqaedb:/System/Process/parse_provider_category#create]
[----] I, [2017-01-26T15:19:10.999544 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Updated namespace [miqaedb:/System/Process/parse_provider_category#create  ManageIQ/System]
[----] I, [2017-01-26T15:19:11.018083 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Updated namespace [System/Process/parse_provider_category  ManageIQ/System]
[----] I, [2017-01-26T15:19:11.028330 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Invoking [inline] method [/ManageIQ/System/Process/parse_provider_category] with inputs [{}]
[----] I, [2017-01-26T15:19:11.029477 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) <AEMethod [/ManageIQ/System/Process/parse_provider_category]> Starting 
[----] I, [2017-01-26T15:19:12.363022 #2880:becdb4]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) <AEMethod parse_provider_category> Parse Provider Category Key: "miq_request"  Value: cloud
[----] I, [2017-01-26T15:19:12.406191 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) <AEMethod [/ManageIQ/System/Process/parse_provider_category]> Ending
[----] I, [2017-01-26T15:19:12.406296 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Method exited with rc=MIQ_OK
[----] I, [2017-01-26T15:19:12.407082 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Followed  Relationship [miqaedb:/System/Process/parse_provider_category#create]
[----] I, [2017-01-26T15:19:12.407374 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Following Relationship [miqaedb:/System/Policy/MiqProvisionRequest_starting#create]
[----] I, [2017-01-26T15:19:12.418402 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Updated namespace [miqaedb:/System/Policy/MiqProvisionRequest_starting#create  ManageIQ/System]
[----] I, [2017-01-26T15:19:12.430518 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Following Relationship [miqaedb:/System/CommonMethods/QuotaStateMachine/quota#create]
[----] I, [2017-01-26T15:19:12.540160 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Updated namespace [miqaedb:/System/CommonMethods/QuotaStateMachine/quota#create  ManageIQ/System/CommonMethods]
[----] I, [2017-01-26T15:19:12.583781 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Processing State=[quota_source]
[----] I, [2017-01-26T15:19:12.584012 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Following Relationship [miqaedb:/System/CommonMethods/QuotaMethods/quota_source#create]
[----] I, [2017-01-26T15:19:12.606723 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Updated namespace [miqaedb:/System/CommonMethods/QuotaMethods/quota_source#create  ManageIQ/System/CommonMethods]
[----] I, [2017-01-26T15:19:12.635981 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Updated namespace [System/CommonMethods/QuotaMethods/quota_source  ManageIQ/System/CommonMethods]
[----] I, [2017-01-26T15:19:12.648798 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Invoking [inline] method [/ManageIQ/System/CommonMethods/QuotaMethods/quota_source] with inputs [{}]
[----] I, [2017-01-26T15:19:12.649692 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) <AEMethod [/ManageIQ/System/CommonMethods/QuotaMethods/quota_source]> Starting 
[----] I, [2017-01-26T15:19:13.599836 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) <AEMethod [/ManageIQ/System/CommonMethods/QuotaMethods/quota_source]> Ending
[----] I, [2017-01-26T15:19:13.599943 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Method exited with rc=MIQ_OK
[----] I, [2017-01-26T15:19:13.600335 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Followed  Relationship [miqaedb:/System/CommonMethods/QuotaMethods/quota_source#create]
[----] I, [2017-01-26T15:19:13.600468 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Processed  State=[quota_source] with Result=[ok]
[----] I, [2017-01-26T15:19:13.600573 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Processed State =[quota_source]
[----] I, [2017-01-26T15:19:13.600853 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Next State=[limits]
[----] I, [2017-01-26T15:19:13.601152 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Processing State=[limits]
[----] I, [2017-01-26T15:19:13.601370 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Following Relationship [miqaedb:/System/CommonMethods/QuotaMethods/limits#create]
[----] I, [2017-01-26T15:19:13.628915 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Updated namespace [miqaedb:/System/CommonMethods/QuotaMethods/limits#create  ManageIQ/System/CommonMethods]
[----] I, [2017-01-26T15:19:13.658771 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Updated namespace [System/CommonMethods/QuotaMethods/limits  ManageIQ/System/CommonMethods]
[----] I, [2017-01-26T15:19:13.673136 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Invoking [inline] method [/ManageIQ/System/CommonMethods/QuotaMethods/limits] with inputs [{}]
[----] I, [2017-01-26T15:19:13.674092 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) <AEMethod [/ManageIQ/System/CommonMethods/QuotaMethods/limits]> Starting 
[----] I, [2017-01-26T15:19:14.234357 #2880:d79880]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) <AEMethod limits> Getting Tenant Quota Values for: {}
[----] I, [2017-01-26T15:19:14.238888 #2880:d79880]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) <AEMethod limits> No Quota limits set. No quota check being done.
[----] I, [2017-01-26T15:19:14.284665 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) <AEMethod [/ManageIQ/System/CommonMethods/QuotaMethods/limits]> Ending
[----] I, [2017-01-26T15:19:14.284772 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Method exited with rc=MIQ_OK
[----] I, [2017-01-26T15:19:14.285281 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Followed  Relationship [miqaedb:/System/CommonMethods/QuotaMethods/limits#create]
[----] I, [2017-01-26T15:19:14.285406 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Processed  State=[limits] with Result=[ok]
[----] W, [2017-01-26T15:19:14.285574 #2880:82b978]  WARN -- : Q-task_id([miq_provision_request_5000000000008]) Skipping to state finished
[----] I, [2017-01-26T15:19:14.285690 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Processed State =[limits]
[----] W, [2017-01-26T15:19:14.285823 #2880:82b978]  WARN -- : Q-task_id([miq_provision_request_5000000000008]) Skipping to state finished
[----] I, [2017-01-26T15:19:14.286093 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Next State=[finished]
[----] I, [2017-01-26T15:19:14.286531 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Processed State =[finished]
[----] I, [2017-01-26T15:19:14.286744 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Next State=[]
[----] I, [2017-01-26T15:19:14.286939 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Followed  Relationship [miqaedb:/System/CommonMethods/QuotaStateMachine/quota#create]
[----] I, [2017-01-26T15:19:14.287276 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Followed  Relationship [miqaedb:/System/Policy/MiqProvisionRequest_starting#create]
[----] I, [2017-01-26T15:19:14.287569 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Followed  Relationship [miqaedb:/System/Policy/request_starting#create]
[----] I, [2017-01-26T15:19:14.287909 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Followed  Relationship [miqaedb:/System/Event/RequestEvent/Request/request_starting#create]
[----] I, [2017-01-26T15:19:15.003218 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Instantiating [/System/Process/REQUEST?MiqProvision%3A%3Amiq_provision=5000000000008&MiqServer%3A%3Amiq_server=5000000000003&User%3A%3Auser=5000000000001&message=get_vmname&object_name=REQUEST&request=UI_PROVISION_INFO&vmdb_object_type=miq_provision]
[----] I, [2017-01-26T15:19:15.065303 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Updated namespace [/System/Process/REQUEST?MiqProvision%3A%3Amiq_provision=5000000000008&MiqServer%3A%3Amiq_server=5000000000003&User%3A%3Auser=5000000000001&message=get_vmname&object_name=REQUEST&request=UI_PROVISION_INFO&vmdb_object_type=miq_provision  ManageIQ/System]
[----] I, [2017-01-26T15:19:15.229373 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Updated namespace [System/Process/parse_provider_category  ManageIQ/System]
[----] I, [2017-01-26T15:19:15.240287 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Invoking [inline] method [/ManageIQ/System/Process/parse_provider_category] with inputs [{}]
[----] I, [2017-01-26T15:19:15.241142 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) <AEMethod [/ManageIQ/System/Process/parse_provider_category]> Starting 
[----] I, [2017-01-26T15:19:15.892259 #2880:808b1c]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) <AEMethod parse_provider_category> Parse Provider Category Key: "miq_provision"  Value: cloud
[----] I, [2017-01-26T15:19:15.933251 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) <AEMethod [/ManageIQ/System/Process/parse_provider_category]> Ending
[----] I, [2017-01-26T15:19:15.933399 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Method exited with rc=MIQ_OK
[----] I, [2017-01-26T15:19:15.935256 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Following Relationship [miqaedb:/System/Request/UI_PROVISION_INFO#create]
[----] I, [2017-01-26T15:19:15.962523 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Updated namespace [miqaedb:/System/Request/UI_PROVISION_INFO#create  ManageIQ/System]
[----] I, [2017-01-26T15:19:15.987764 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Following Relationship [miqaedb:/cloud/VM/Provisioning/Profile/EvmGroup-super_administrator#get_vmname]
[----] I, [2017-01-26T15:19:16.187066 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Updated namespace [miqaedb:/cloud/VM/Provisioning/Profile/EvmGroup-super_administrator#get_vmname  ManageIQ/cloud/VM/Provisioning]
[----] I, [2017-01-26T15:19:16.245707 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Following Relationship [miqaedb:/Cloud/VM/Provisioning/Naming/Default#create]
[----] I, [2017-01-26T15:19:16.471986 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Updated namespace [miqaedb:/Cloud/VM/Provisioning/Naming/Default#create  ManageIQ/Cloud/VM/Provisioning]
[----] I, [2017-01-26T15:19:16.510690 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Updated namespace [Cloud/VM/Provisioning/Naming/vmname  ManageIQ/Cloud/VM/Provisioning]
[----] I, [2017-01-26T15:19:16.533405 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Invoking [inline] method [/ManageIQ/Cloud/VM/Provisioning/Naming/vmname] with inputs [{}]
[----] I, [2017-01-26T15:19:16.534401 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) <AEMethod [/ManageIQ/Cloud/VM/Provisioning/Naming/vmname]> Starting 
[----] I, [2017-01-26T15:19:17.120868 #2880:8084b4]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) <AEMethod vmname> Detected vmdb_object_type:<miq_provision>
[----] I, [2017-01-26T15:19:17.133648 #2880:8084b4]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) <AEMethod vmname> VM Name: <somefqdn.here.com>
[----] I, [2017-01-26T15:19:17.174468 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) <AEMethod [/ManageIQ/Cloud/VM/Provisioning/Naming/vmname]> Ending
[----] I, [2017-01-26T15:19:17.174577 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Method exited with rc=MIQ_OK
[----] I, [2017-01-26T15:19:17.175581 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Followed  Relationship [miqaedb:/Cloud/VM/Provisioning/Naming/Default#create]
[----] I, [2017-01-26T15:19:17.175738 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Followed  Relationship [miqaedb:/cloud/VM/Provisioning/Profile/EvmGroup-super_administrator#get_vmname]
[----] I, [2017-01-26T15:19:17.176082 #2880:82b978]  INFO -- : Q-task_id([miq_provision_request_5000000000008]) Followed  Relationship [miqaedb:/System/Request/UI_PROVISION_INFO#create]

At this point, we see no more logs, on any worker, or logfile, the request is in pending state, and nothing happens anymore…

Would be great if somebody can help us, or can light up the way how to debug such issues.

Thanks


#2

Hey there,

we think we got a little hint to this issue:

we looked in our database and found more than 2000 Events in database: miq_queue
after some time this events was processed and the automate engine processed our provision_request.

The issue is potentially fixed, but…
maybe we got an issue with the amount of events produced by the vcenter. Currently, we have a setup with 4 automate-engine workers which are involved by the events.

Do you have any suggestions for us how to scale for so many events ? Is there any documentation about that ?

Thanks


#3

Hey there,

I put that issue in the developer queue - maybe I get more attention here…

We’re still investigating this problem. Our findings:
By now we have a little loop script which prints us a summary from the miq_queue table with the count and role.

Output is:
class_name.method_name(role)=count
with this query:

echo "select class_name || '.' || method_name || '(' || coalesce(role, '') || ') =' || count(*) from miq_queue group by class_name,method_name,role" | psql -td vmdb_production
 MiqServer.log_status() =3
 MiqServer.ntp_reload() =1
 EmsRefresh.vc_update(ems_inventory) =9
 MiqAeEngine.deliver(automate) =2638
 EmsRefresh.refresh(ems_inventory) =2
 ManageIQ::Providers::Vmware::InfraManager::Vm.classify_with_parent_folder_path() =31
 JobProxyDispatcher.dispatch(smartstate) =1
 MiqServer.status_update() =4
 Job.signal(smartproxy) =1
 VmdbDatabase.capture_metrics_timer(database_owner) =1
 VmdbDatabaseConnection.log_statistics() =1
 MiqWorker.log_status_all() =3

When our issue appears, we can watch that the queue is growing huge by many vmware events. This events should be processed by the AutomateEngine Role.
You can see that the MiqAeEngine.deliver had 2638 Events in the Queue during this pending state where the complete Platform (1 Master and 4 Workers) is stalling. If thoose events are processed the engine is working again.

If we disable the VMware Provider: everything is smooth, fast and working as expected.
So by now, it seems to be a huge number of events from the vCenter.

The question:
as i mentioned - currently we have 4 Workers with the AutomateEngine enabled - is there a possibility to debug the workload by role and see when which worker is processing which event.

Additional question:
correct me if i’m wrong: if we have a problem by the event-“load” is the solution a vertical scale of workers ?

Thanks for your time ! We appreciate that !


#4

Hi

Try increasing the number of generic and priority workers on each of your appliances with the ‘Automation Engine’ role. Queue messages that deliver a MiqAeEngine.deliver(automate) for an event have a priority of 20, so they’ll be processed by the priority workers. Most other automate tasks have a priority of 100 or so and will be processed by the generic workers. Each worker will only process one automate task at a time though.

Regards,
pemcg


#5

I should add that you might also need to increase the worker memory slightly (the default values for memory threshold don’t give a lot of headroom over steady-state usage). Try grepping for ‘exceeded limit’ in evm.log and see if you get any hits.

Worker count and memory thresholds are set from the Configuration menu. Highlight your appliance in the left hand pane, and click the ‘Workers’ tab in the right hand pane.

Hope this helps,
pemcg


#6

Hey,

Thanks for the fast and great response ! Category change helped :wink:

Your hint with the priority worker which is Processing the events and the generic worker processing the automate code is very useful !

We already raised the thresholds and raised the Memory of all appliances, But this doesn’t helped until now.

We’ll try to raise the worker count, see if this helps…

Is there any further documentation about the worker roles ? I found a doc but this is going not as deep as your little description here :wink:

Thanks !


#7

Hey,

we raised the count of the workers on every AutomateEngine-Worker to 4 with 1GB memory Threshold. In my opinion, the event processing is a little bit faster, but even slow… 4 Appliances with this specs processing 600Events in 4minutes.

Do we have to scale the workers ? Or is there another solution ?

Thanks


#8

I hate that comments where the problem was fixed and nobody posted a solution. So here is our way to solve this issue:

as mentioned the problem was the huge amount of events in the vcenter, but there are many workers and many points where the issue could came from, so we decided to make a little script to write a long term statistic:
(mentioned in a post above)

 while true; do echo -n "$(date)"; echo "select count(*) from miq_queue" | psql -td vmdb_production; echo "select class_name || '.' || method_name || '(' || coalesce(role, '') || ') =' || count(*) from miq_queue group by class_name,method_name,role" | psql -td vmdb_production; sleep 5; done >> /root/queueSize

a next little script make this readable for the line-chart:

grep 'Mon Feb  6' queueSize | while read ds m d t u y c; do t=$(echo $t|tr ':' ','); echo "[[$t], $c],"; done

so our graph from yesterday showed the following picture:

As mentioned by @pemcg in this post - we raised the worker count and the memory threshold - this helped a little but the amount of events even took too long to processed. So we decided to vertical scale the worker nodes.

And…:

I think it would be helpful if the documentation about the worker roles would be more detailed. Or maybe a build-in monitoring with a check if everything has the right settings/size…

Issue can be closed :slight_smile:


#9

Hi

Thanks for your description of the solution. Could you possibly expand on “vertical scale the worker nodes”, do you mean add more vCPUs? A description of how you performed the “Reindex” might also be useful for other people reading.

It looks like you have both VMware and OpenStack providers, have you split these into separate zones? This can help isolate processing so that the events from one do not adversely affect other providers.

We are always working on improving the documentation, and there will be a fuller description of server roles, workers, messages and the implications for scaling appearing in a doc sometime soon.

Regards,
pemcg


#10

With vertical scale i mean that we had provisioned 3 new Worker Nodes with the same Roles enabled as the other worker nodes in this plattform.
The reindex (my wording) process started automatically with adding a new provider.