Event Catcher/Handler, Event Storm Discussion


#1

I wanted to share some application performance data I have on the current event catcher / event handler for comparison to the future event switchboard architecture.

In this graph we can see the classic event storm and how a rate of events that a provider is pushing to ManageIQ is greater than this current deployment can handle for a portion of time:

The number of event messages processed off the queue matches the number put on the queue for the first 2 hours and then quickly levels off at a steady rate of around 6,400 events messages handled per hour. When it levels off, ManageIQ is putting event messages on the queue at a rate around 11,400 event messages. Once the event storm ends at hour 11 from the left side of the graph, ManageIQ continues to process the events until the queue is drained of all event messages.

On the left side of the graph the amount of queue wait time is indicated in seconds. You can see how the queuing time for handling event-based messages grows when the appliance can not handle the rate of incoming messages.

A few hours later another event storm begins, however the rate at which ManageIQ processes those event messages matches the rate at which ManageIQ can handle, thus the event messages do not excessively queue.

I am particularly interested in if/how the event switchboard intends on handling event storms to avoid queueing as shown above. Does the event switchboard intend on providing a greater level of concurrency to handle a large number of incoming events? In this case the EventHandler is pegged at 100% user cpu usage during the event storm, perhaps we could improve the concurrency level and add another worker. This should allow this deployment to handle the storm without a growing queue time at the expense of memory for the additional worker. Thoughts?


#2

cc @lfu @gmccullough

@akrzos One problem with events is that depending on the event, we can’t handle them out of order, since there are ordering dependencies. However we may be able to classify those events somehow so that we can parallel when needed, and serial otherwise. Right now though, there is only one event handler I believe, and it’s not a parallel process.

Another thought we should consider in the future is perhaps not using the MiqQueue in the database for the event queue. It might make sense to use a dedicated message queuing system. cc @blomquisg


#3

@akrzos I am working on an enhancement to block event storming caused by duplicated events. Duplicated events will be placed to the queue. Do you have a procedure to generate an event repeatedly to simulate a storming scenario?


#4

Wouldn’t it be a great use case for a Complex Event Processing (CEP) solution ? I don’t know if it exists in Ruby but we may not require it to be written in Ruby as it can an external component, like @Fryguy says not using MiqQueue is considered on longer term .


#5

Wouldn’t it be a great use case for a Complex Event Processing (CEP) solution

Actually, now that I think about it, the work that @lfu is doing to make events go through automate would enable ManageIQ to interface with external CEP systems.


#6

@bill I can certainly help you with generating an event storm.

The method I used in my above data was to connect ManageIQ to a VMware provider setup to use the VCSIM (VMware vCenter Simulator) and then I wrote a quick script using the vmware api to repeatedly reset/poweron/poweroff/create/delete vms which subsequently had the provider create the events that ManageIQ then consumed.

William Lam wrote a good article on getting started with the VCSIM here:

Here’s my script for creating events on the vcsim:

Let me know if you have any questions.


#7

@akrzos It took me a while to set up a new VCSA and followed the instruction in the post you provided to configure the VCSIM. I restarted the vpxd service but could not see the simulated resources such as VMs and datastores.

Nevertheless I would like to try your script. Do you have an instruction how to run the script? I should do this in a ssh console window of the VCSA appliance, right? Any dependency python packages to install?

Thanks for the help!


#8

@bill No worries. It sounds like you might have missed a step to get the VCSIM running, if you ended up manually restarting vpxd.

The steps I follow to get the VCSIM up and running are:

  1. Deploy VCSA template to VM
  2. Access VCSA console at https://(VCSA-IP-Address):5480/
  3. There should be a configuration prompt for a brand new VCSA appliance, you must agree to the VMware license, configure credentials, and NTP etc…
  4. Let VCSA appliance completely start up such that you can now login into: https://(VCSA-IP-Address):9443/vsphere-client/ The appliance will have a blank inventory at this step
  5. Now login via ssh to the VCSA appliance and configure the vcsim files as the article mentions. Here is a sample configuration I use for a “medium” sized environment:

Simulator:


Inventory:

6. Once you have the xml files all configured, you must start the vcsim with:

vmware-vcsim-start /etc/vmware-vpx/vcsim/model/vcsim-perf.cfg

After that you should be able to view the inventory, even as the worker is building it in the VCSA appliance.

As far as running the script I wrote. I suggest you open a screen session on a server so you can run the script long term without it closing if you close your connection to the server. You can simply start running it with:

./vcsim-events.py -pwd <VCSIM-Password> <VCSIM-IP-Address> 

Depending on your environment, the default sleep time of 6s and churn of 1 vm per period should force your provider to emit plenty of events. Also incase this is not obvious, DO NOT point that script at a real provider since it will start reseting and powering on/off vms on that provider. To build that script I did have to follow the vmware guide on getting started with pyvmomi:

http://vmware.github.io/pyvmomi-community-samples/#getting-started

Hope that gets you closer to working test environment for your code.