I wanted to share some application performance data I have on the current event catcher / event handler for comparison to the future event switchboard architecture.
In this graph we can see the classic event storm and how a rate of events that a provider is pushing to ManageIQ is greater than this current deployment can handle for a portion of time:
The number of event messages processed off the queue matches the number put on the queue for the first 2 hours and then quickly levels off at a steady rate of around 6,400 events messages handled per hour. When it levels off, ManageIQ is putting event messages on the queue at a rate around 11,400 event messages. Once the event storm ends at hour 11 from the left side of the graph, ManageIQ continues to process the events until the queue is drained of all event messages.
On the left side of the graph the amount of queue wait time is indicated in seconds. You can see how the queuing time for handling event-based messages grows when the appliance can not handle the rate of incoming messages.
A few hours later another event storm begins, however the rate at which ManageIQ processes those event messages matches the rate at which ManageIQ can handle, thus the event messages do not excessively queue.
I am particularly interested in if/how the event switchboard intends on handling event storms to avoid queueing as shown above. Does the event switchboard intend on providing a greater level of concurrency to handle a large number of incoming events? In this case the EventHandler is pegged at 100% user cpu usage during the event storm, perhaps we could improve the concurrency level and add another worker. This should allow this deployment to handle the storm without a growing queue time at the expense of memory for the additional worker. Thoughts?