Plattform is stalling


#1

Hey there,

with this Topic, we want to understand how things are done.

Currently we have an Issue which is not clearly definable.
I’ll make a try:
Our Environment (darga-2) 8 Server (1 Master and 7 Worker) with one OpenStack Provider configured.
Once a Week we ran into an Issue where we make a provision call and it seems that nothing is working anymore…
Normally we’re auto approve every request, when this issue appears we get the message that we have to approve this request (with the approv or deny button in the request status page). This is the first sign of this issue.
If we approve this request now, there is only a message with “automation starting” but nothing happens for hours.

The workaround until now is to restart the platform, this helped really well in the past, but can not be the solution. An additional information is that, last time, we restarted only one worker node and the platform began to process the request… This is unlikely very hard to reproduce for us, and even harder to debug.

Is there anyone, a Senior, Architect or Developer who knows this issue or can try to explain what happens there?

In the past, we’ve asked many times for a little description how to debug requests before the Statemachine is running. Since this point, you can make some debug lines and so on, but before, if a requests stuck at some point (like i described above) before, is there any hint or description how to debug those issues?

Would be great if somebody have a bit time to discuss our problem.

Greetings


#2

Hi

So here are a few things that I would check.

Firstly general appliance CPU & memory. Use ‘top’ to check that each MIQ appliance has greater than 20% idle CPU time (several workers run with reniced priorities, so checking idle is easier than us + sy + ni), and isn’t swapping. CPU load is especially important to monitor on a dedicated VMDB appliance as this can have an adverse affect on the entire region.

On each appliance check that no workers are exceeding their memory allocation. You can use the command:

grep 'MiqServer#validate_worker' evm.log

If any workers are exceeding their memory limit, increase the worker memory threshold in the WebUI, and let it run for a while. Re-check, and also re-check appliance memory again as this will have increased

See if any messages are timing out, using the following command:

egrep "Message id: \[\d+\], timed out after" evm.log

If any messages are timing out, this needs investigating. Generic worker messages sometimes timeout running an automation workflow that takes longer than 10 minutes to complete.

Now check the number of “ready” messages in the queue using the following command:

grep 'count for state=\["ready"\]' evm.log

This will show lines like:

... INFO -- : Q-task_id([log_status]) MIQ(MiqServer.log_system_status) [EVM Server (37435)] MiqQueue count for state=["ready"] by zone and role: {"test"=>{nil=>6, "ems_inventory"=>1, "smartstate"=>1, "ems_metrics_collector"=>474982, "automate"=>4, "ems_metrics_processor"=>2150}, "DB"=>{"smartstate"=>1, "ems_metrics_processor"=>3}, "default"=>{"ems_operations"=>1, "smartstate"=>1, "ems_metrics_collector"=>67, "ems_metrics_processor"=>2143, nil=>12}}

If any zones consistently have more than a small number (1-200ish) of messages, or if the message count is steadily increasing, this needs investigating. It’s possible that some workers need scaling out to handle the message load.

It’ll be useful to compare an automate message from when things are running normally with one from when you see the stalling. You may notice some differences (such as zone). Use the following command on any appliance with the ‘Automation Engine’ role enabled to list automate messages:

grep 'MiqQueue.put' evm.log | grep 'Role: \[automate\]' | grep -v 'EventStream::event_stream'

Run this again when you see the hanging. Note the message ID (something like Message id: [1000000320660]), then grep for the message ID to see if there’s a corresponding MiqGenericWorker::Runner#get_message_via_drb message.

To debug requests before they are at the task stage, you can insert your own debugging calls into the /System/Policy instances, such as /System/Policy/MiqProvisionRequest_starting (which is the entry point into the quota-checking workflow). Try calling something like InspectMe or object_walker from here and you should get a dump of the request object.

Hope this helps,
pemcg


#3

Hey Peter,

thanks for this realy helpful lines. We found the one or another answer with this.
Currently as you mentioned above, we got this situation where the number of “ems_inventory” operations is above 800 and still increasing… we gave the vim_broker_worker more memory and increased the poll interval…
our current settings:

    :vim_broker_worker:
      :memory_threshold: 4.gigabytes
      :nice_delta: 3
      :poll: 5.seconds
      :reconnect_retry_interval: 2.minutes
      :vim_broker_status_interval: 5.minutes
      :vim_broker_update_interval: 0.seconds
      :vim_broker_max_wait: 60
      :vim_broker_max_objects: 250

Is there something else we can try, or do we have to scale out the vim_broker_worker ? (and how?)
Currently all of our VMware Providers feels to have this problems… would be great if we can reach a stable status here :slight_smile:

Thanks for your time

greetings


#4

Hmm, are you sure that it’s ems_inventory? There should only be at most one of these queued per provider.

The inventory collection is performed by a RefreshWorker, not the VIM Broker, so increasing the broker memory won’t really help you.

Can you post the last line printed from this command:

grep 'count for state=\["ready"\]' evm.log

There’s more information about scaling workers in this reference architecture: https://access.redhat.com/documentation/en-us/reference_architectures/2017/html/deploying_cloudforms_at_scale/

Cheers,
pemcg


#5
[----] I, [2017-10-02T13:25:12.789459 #1632:11af990]  INFO -- : Q-task_id([log_status]) MIQ(MiqServer.log_system_status) [EVM Server (1493)] MiqQueue count for state=["ready"] by zone and role: {"default"=>{"ems_operations"=>4, "ems_inventory"=>767}}
[----] I, [2017-10-02T13:29:55.529111 #1628:11af990]  INFO -- : Q-task_id([log_status]) MIQ(MiqServer.log_system_status) [EVM Server (1493)] MiqQueue count for state=["ready"] by zone and role: {"default"=>{"ems_operations"=>4, "ems_inventory"=>772, "automate"=>2}}
[----] I, [2017-10-02T13:35:00.034372 #1620:11af990]  INFO -- : Q-task_id([log_status]) MIQ(MiqServer.log_system_status) [EVM Server (1493)] MiqQueue count for state=["ready"] by zone and role: {"default"=>{"ems_operations"=>4, "ems_inventory"=>791, "automate"=>1}}
[----] I, [2017-10-02T13:39:58.179251 #1628:11af990]  INFO -- : Q-task_id([log_status]) MIQ(MiqServer.log_system_status) [EVM Server (1493)] MiqQueue count for state=["ready"] by zone and role: {"default"=>{"ems_operations"=>4, nil=>2, "ems_inventory"=>791, "automate"=>2}}

theese lines are the last, before we had to restart the appliance to avoid to break the deadline :wink:


#6

Interesting, how many providers do you have?

Can you run the following command and post the last couple of records:

vmdb
bin/rails db
vmdb_production=# \x auto
vmdb_production=# select id,target_id,state,created_on,updated_on,queue_name,args,zone from miq_queue where role = 'ems_inventory';

#7

Hi,

I’m a college of schmandforke and try to continue this thread.

In that setup we have one provider (vCenter).

I removed the ‘args’ column, because it’s content is very large:

vmdb_production=# select id,target_id,state,created_on,updated_on,queue_name,zone from miq_queue where role = 'ems_inventory';
...
 9000030274236 |           | ready   | 2017-10-09 11:09:39.012869 | 2017-10-09 11:09:39.012869 | ems_9000000000001 | default
 9000030274262 |           | ready   | 2017-10-09 11:10:12.547353 | 2017-10-09 11:10:12.547353 | ems_9000000000001 | default
 9000030274263 |           | ready   | 2017-10-09 11:10:12.727252 | 2017-10-09 11:10:12.727252 | ems_9000000000001 | default
 9000030274264 |           | ready   | 2017-10-09 11:10:12.970728 | 2017-10-09 11:10:12.970728 | ems_9000000000001 | default
 9000030274265 |           | ready   | 2017-10-09 11:10:13.144536 | 2017-10-09 11:10:13.144536 | ems_9000000000001 | default
 9000030274272 |           | ready   | 2017-10-09 11:10:43.366614 | 2017-10-09 11:10:43.366614 | ems_9000000000001 | default
 9000030274273 |           | ready   | 2017-10-09 11:10:44.723798 | 2017-10-09 11:10:44.723798 | ems_9000000000001 | default
 9000030274274 |           | ready   | 2017-10-09 11:10:44.956433 | 2017-10-09 11:10:44.956433 | ems_9000000000001 | default
 9000030274275 |           | ready   | 2017-10-09 11:10:45.190014 | 2017-10-09 11:10:45.190014 | ems_9000000000001 | default
 9000030274276 |           | ready   | 2017-10-09 11:10:45.422277 | 2017-10-09 11:10:45.422277 | ems_9000000000001 | default
 9000030274277 |           | ready   | 2017-10-09 11:10:45.688628 | 2017-10-09 11:10:45.688628 | ems_9000000000001 | default
 9000030274285 |           | ready   | 2017-10-09 11:11:17.540173 | 2017-10-09 11:11:17.540173 | ems_9000000000001 | default
 9000030274286 |           | ready   | 2017-10-09 11:11:21.890915 | 2017-10-09 11:11:21.890915 | ems_9000000000001 | default
(55 rows)

Best regards,
larsux


#8

We have the same trouble with another setup which consists of one vCenter provider and a OpenStack provider.

When we activate the vCenter provider, the queue fills very quick and provisions, even on the OpenStack provider, are no more possible, because Cloudforms doesn’t recognize when a VM/Instance was created:

vmdb_production=# select count(*) from miq_queue where role = 'event';
count  
--------
523119
(1 row)

$ grep 'count for state=\["ready"\]' evm.log | head -n1
[----] I, [2017-10-01T12:01:27.925946 #2660:5b197c]  INFO -- : Q-task_id([log_status]) MIQ(MiqServer.log_system_status) [EVM Server (2496)] MiqQueue count for state=["ready"] by zone and role: {"default"=>{"event"=>523119, "smartproxy"=>4, "ems_operations"=>1, nil=>2, "ems_inventory"=>1}, nil=>>{"automate"=>2}}

#9

hey lars(ux),

you can split both providers into different zones to avoid being affected by this effect :wink:
but this seems to be another problem with a massive amount of events… would be great if you can create a separate topic.


#10

It looks like you’re seeing an event storm from the provider. See if you can determine which type of event it is; try grepping for ‘caught event’ in evm.log. If you can find this out then you might be able to blacklist it so that it’s not processed.

You mention that you’re running Darga-2? If there’s any chance of you upgrading to Fine then this release has introduced flood monitoring for the VMware provider. Flood monitoring stops provider events from being queued when too many duplicates are received in a short time. By default an event is considered as flooding if it is received 30 times in one minute.

There’s more information on event storms here: https://access.redhat.com/documentation/en-us/reference_architectures/2017/html/deploying_cloudforms_at_scale/event_handling#event_storms

pemcg


#11

I analysed the events, but they don’t look to me like they belong to an ‘event storm’:

  72298 TaskEvent
  50924 NonVIWorkloadDetectedOnDatastoreEvent
  22836 ExtendedEvent
  21297 DvsPortLinkUpEvent
  20890 DvsPortDisconnectedEvent
  20779 DvsPortConnectedEvent
  20729 DvsPortExitedPassthruEvent
  20704 DvsPortUnblockedEvent
  15100 VmReconfiguredEvent
  11799 VmRemovedEvent
  11761 VmStartingEvent
  11724 VmPoweredOffEvent
  11642 VmPoweredOnEvent
  ...

Perhaps we could blacklist the ‘NonVIWorkloadDetectedOnDatastoreEvent’-Event, but all other events seem to be valuable.
Because the amount of events stuck at 523119 since several days, it seems that the event-worker doesn’t work any more.

I’ve removed the events from DB. Now it looks okay, for now. We need to monitor the Queue for anomalies to detect these issues earlier.

Thanks for the ‘Fine’-hint and docu-reference. I’ll look if we can update our Setup.