Dashboards Top CPU and Memory Consumer stopped reporting

fine

#1

Hi,

i am currently testing ManageIQ Fine beta 1/2 with a vSphere 6.0 Cluster and i am happy so far.

A few days ago the “Top CPU/Memory Consumer” Dashboards suddenly stopped working. Top Storage Consumer still works.

What i checked:

  • C & U Collection is enabled for all Clusters and Datastores.
  • VM > Monitoring/Utilization show correct CPU, Memory and Storage Utilization.
  • i checked evm.log for “metric errors” and and “ERROR” but can’t find anything

Most reports i tested produce no data :frowning:

Today i updated to fine 2 but it does not solve the issue.


#2

Hi Sven

I am having same issue after switching to fine-b2. I am shutting down b2 VM and fall back to fine-b1.



#3

@Sven_Jansen, I redeployed fine-b2 and adding back only one vcenter at a time. Top CPU/Memory usage summary widget are now working.

I am going to do VMWare snapshot backup more often to avoid mig debugging.


#4

This issue started suddenly with fine beta 1. I updated to beta 2 in hope to fix the issue, but nothing changed.


#5

Today stats for CPU and Memory is available again for whatever reason.


#6

i have the same problem. the “Top CPU Consumer” and “Top memory Consumer” have no records. the manageIQ version is euwe-3.20170508081136_. it run in docker. providers is openstack cluster.
in configure>server role i have open the Capacity & Utilization Coordinator , Capacity & Utilization Data Collector , Capacity & Utilization Data Processor , Notifier.
if anyone know the reason, pls help.


#7

Do you have the Scheduler and Reporting roles enabled anywhere in your region?


#8

this is my server role:


#9

ok, looks good. How many pods, containers and nodes does your provider manage?

Try the following commands to check that C&U messages are being processed in a timely manner:

grep 'count for state=\["ready"\]' evm.log | egrep -o "\"ems_metrics_collector\"=>[[:digit:]]+"

grep 'count for state=\["ready"\]' evm.log | egrep -o "\"ems_metrics_processor\"=>[[:digit:]]+"

pemcg


#10

now i have:
Name Type EVM Zone Instances Images ManageIQ Region
pcpsit OpenStack default 168 24 Region 0
yuhua OpenStack default 128 37 Region 0

and i had cleaned the log for debug, so the log is not completely, now runned your cmd , it shows like:

“ems_metrics_collector”=>526
"ems_metrics_collector"=>526
"ems_metrics_collector"=>526
"ems_metrics_collector"=>526
"ems_metrics_collector"=>526
"ems_metrics_collector"=>526
"ems_metrics_collector"=>526
"ems_metrics_collector"=>526
"ems_metrics_collector"=>520
"ems_metrics_collector"=>519
"ems_metrics_collector"=>523
"ems_metrics_collector"=>523


#11

i follow the log, there is no error, the task seems worked successfully. but no data had been saved into database.

[----] I, [2017-05-11T08:40:21.521237 #728:3fd818a17130] INFO – : MIQ(MiqReport#build_create_results) Creating report results with hash: [{:name=>“Top Memory Consumers (weekly)”, :userid=>“widget_id_6|EvmGroup-super_administrator|schedule”, :report_source=>“Generated for widget”, :db=>“VmPerformance”, :last_run_on=>2017-05-11 08:40:21 UTC, :last_accessed_on=>2017-05-11 08:40:21 UTC, :miq_report_id=>nil, :miq_group_id=>nil}]
[----] I, [2017-05-11T08:40:21.566536 #728:3fd818a17130] INFO – : MIQ(MiqReport#build_create_results) Finished creating report result with id [1] for report id: [], name: [Top Memory Consumers (weekly)]
[----] I, [2017-05-11T08:40:21.571806 #728:3fd818a17130] INFO – : MIQ(MiqWidget#generate_one_content_for_group) Widget: [Top Memory Consumers (weekly)] ID: [6] for [MiqGroup] [EvmGroup-super_administrator]…Complete


#12

Aah, I see, the ‘docker’ mention confused me slightly. Your CFME appliance is running podified in OpenShift, but managing an OpenStack provider? Is this just a single appliance? Do you see all of the instances, images, projects etc that you would expect? Do you see utilization graphs for the running instances?

Those metrics_collector messages look fine - what you’re looking for is that the number of messages isn’t gradually increasing.

Check appliance memory using ‘top’, you don’t want any swapping. Also check that no worker processes are exceeding their memory threshold using:

grep 'MiqServer#validate_worker' evm.log


#13

yes,manageIQ is runned in a docker container and it manages two openstack providers. the instances and others seems fine.

On running instance page, utilization graphs is not ok, the tip is “No capacity & Utilization data has been collected for this VM”

But on Cloud intel > dashboard page the “Top Storage Consumers” is ok.


#14

top command result is ok, used memory is about 1/2 total memory.

“grep ‘MiqServer#validate_worker’ evm.log” command shows none.


#15

Interesting. What was the count of ems_metrics_processor messages? How long has C&U been running for?


#16

There’s a scaling document that was created for the Red Hat CloudForms product, but should also be applicable to ManageIQ. It’s available from the Red Hat portal here: https://access.redhat.com/documentation/en-us/reference_architectures/2017/html-single/deploying_cloudforms_at_scale/

It contains a chapter about monitoring and tuning Capacity and Utilization that might be useful. There are also a couple of diagnostic scripts that are mentioned in the guide and are available here: https://github.com/RHsyseng/cfme-log-parsing. They are by no means elegant, but extract some useful timing values from evm.log using regular expressions.

You could try running the perf_process_timings.rb and perf_rollup_timings.rb scripts and see if you’re getting sensible looking timings.


#17

now i am at home, can’t connet to the manageIQ in my company. so i can’t figure out the count of ems_metrics_processor. but C&U has been run about 4 days.


#18

thank you very much! i will check these, hope it can help


#19

i readed evm.log found this:

and in database, table metrics(metrics00 to metrics23) are empty.
the port 8777 seems been used by Ceilometer. but i had changed event source to amqp, why it still uses ceilometer to collect metrics?