Scaling problem: It takes too much time to fetch resource usage from Ceilometer


#1

Hello,

now our OpenStack production cloud have thousands of instances and we have much resource usage information in Ceilometer database.

ManageIQ tries to collect five meters (cpu_util, disk.read.bytes, disk.write.bytes, network.incoming.bytes and network.outgoing.bytes) from each instance by talking with Ceilometer, right?
This means that ManageIQ submits “five times number of instances” queries.
This is not a problem in my small test environment.
But in our production case, ManageIQ may submit thousands of queries every certain period and it will take too much time.

Is there any idea to solve this problem?
I think it would be better to have a option that collecting metering data per tenant, instead instance.

Thanks,
Wataru


#2

Hi Wataru,

I’m interested in the issue you describe and any metrics you have around it. Have you determined at what # of requests to ceilometer’s api before seeing a degradation in api response timing? We have found issues with ceilometer’s ability to collect metrics in a consistent interval when the number of instances exceeds a threshold per ceilometer collector. This can lead to no metrics collected in ManageIQ which has been addressed but not in a specific build I have revisited yet. Ceilometer collectors scale in upstream builds by partitioning the compute node workload between multiple ceilometer agents. Perhaps the api can scale in a similar fashion, partitioned by tenant but that is just speculation.

One potential solution for your problem you could try would be to raise the threshold for which you are currently collecting metrics within your OpenStack cloud. You can find this under the advanced configuration search for “capture_threshold” and change the value for vms. Admittedly, this is not a real solution as you are simple pushing the problem out to a greater scale but perhaps before you hit that scale we can find a better solution.

-Alex


#3

This is a known issue in Ceilometer that is being actively worked on. To solve it, the Ceilometer team is putting together a new time series oriented data model. Here are a couple links on the subject:
http://techs.enovance.com/7152/openstack-ceilometer-and-the-gnocchi-experiment

In the meantime, limiting the amount of metrics is a workaround.

Hope this helps,
Nick