Openstack sync fail


#1

Hello,

The Openstack cloud provider summary page shows that Last refresh had an error:

Error - 2 Days Ago
Unable to obtain a collection: ‘flavors’ in a service: ‘nova’ through API. Please, fix your OpenStack installation and run refresh again.

  1. Running “Refresh relationships and power states” doesn’t seem to change anything, including the error day. Why is that?
  2. The Nova api works fine using nova cli with same credentials. How can I debug it further?
  3. If the sync fails on some part, for example the problem with flavors. does it mean other parts won’t be updated as well, for example instances?

Thanks,
Alex


#2

Hi @Alexander_Bravermancan you provide your evm.log
ssh to your appliance, type: vmdb
cd to: logs
You’ll find the log there.


#3

you can try recheck authentication


#5

I encountered a similar problem before, I have a openstack provider, maintenance day, the it come online, but in manageiq it it still fail, i try recheck authentication, it ok.

but I have some questions, before my openstack/vmware provider maintenance have a short time, when they online, when they online manageiq can quickly identified and recheck, But maintenance too much time, manageiq cannot be identified normally, provider maintenance after a certain time, manageiq can’t recheck?


#6

@Alexander_Braverman this error is thrown when non standard response is returned from OpenStack (usually code 500)

In this case the refresh of the data is stopped and you should fix the OpenStack. Continuing in refresh would mean you would loose all the flavor data and metadata on ManageIQ side.

The best way to track the error is to run the same API/CLI request from within your ManageIQ appliance. Do that for scope of all the tenants. If all that works, write a script that will repeat the API 1000 times, I’ve seen badly set network causing a random API timeouts. When you see failure, go to OpenStack logs, to find the cause.

Then recheck authentication and refresh again. After a failure, the refresh worker is turned down, recheck authentication turns the worker back on, if it succeeds. @anyisalin not sure how this worked before, maybe we should make it configurable


#7

I don’t have permissions to upload a file, so here is a link:
https://drive.google.com/file/d/0B4l9O-rdsk1KS0lMSlRuSzlueEk/view?usp=sharing

The time of the error is 09/27/16 13:10:44 UTC


#8

Authentication is correct, and after few days it fixed itself, that is the sync pass and I don’t see the error anymore. Is this normal behavior?


#9

I don’t know which API\CLI calls it does once it got in sync. evm.log looks like collecting all kind of different events, and I can’t tell what relates to what. It becomes even harder when debug mode is on. Is there an easy way to find and reproduce the call?


#10

Does it mean that on a sync fail, ManageIQ will stop trying to sync until you will manually tell it to recheck authentication?


#11

Highlighting some things from the log that may be of interest:

[----] I, [2016-09-27T13:07:33.496527 #35106:11a998c] INFO – : MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [1000004410246], MiqWorker id: [1000000074196], Zone: [default], Role: [automate], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqAeEngine.deliver], Timeout: [3600], Priority: [20], State: [dequeue], Deliver On: [], Data: [], Args: [{:object_type=>“MiqServer”, :object_id=>1000000000002, :attrs=>{:event_type=>“evm_worker_memory_exceeded”, :event_details=>“Worker [MiqPriorityWorker] with ID: [1000000086865], PID: [7312], GUID: [d5c0c03a-84d4-11e6-ac5d-001a4a0aa1a4] process memory usage [441422000] exceeded limit [419430400], requesting worker to exit”, :type=>“MiqPriorityWorker”, “MiqEvent::miq_event”=>1000002130822, :miq_event_id=>1000002130822, “EventStream::event_stream”=>1000002130822, :event_stream_id=>1000002130822}, :instance_name=>“Event”, :user_id=>1000000000001, :miq_group_id=>1000000000002, :tenant_id=>1000000000001, :automate_message=>nil}], Dequeued in: [2.640378959] seconds

[----] I, [2016-09-27T04:01:46.282558 #21144:11a998c] INFO – : MIQ(MiqEventHandler::Runner#message_sync_active_roles) MIQ(MiqEventHandler::Runner) Synchronizing active roles complete…
/opt/rh/cfme-gemset/gems/fog-core-1.42.0/lib/fog/core/attributes/default.rb:52: warning: redefining `object_id’ may cause serious problems

@Ladas or @jrafanie do the errors I pulled mean something to you. Is the another person we can ask that may be able to assist?


#12

@jprause could you make sure we have a BZ for redefining `object_id’ may cause serious problems? Somebody will need to take a look into fog-core gem


#13

Sure @Ladas I’ll open the issue right now.


#14

right the message says Unable to obtain a collection: ‘flavors’ in a service: ‘nova’ through API.

In CLI this is nova flavor-list, try that from within the appliance.

I cannot find it in the log, it’s too big. You might want to disable roles for event monitor, since it doesn’t seems like it’s working, it’s broken or non configured Ceilometer. And that is polluting the log.

I think the refresh worker shuts itself down, if it fails to do a validation. Not sure if it shuts after a failed refresh.


#15

Can you send me the BZ number?


#16

Will it try to validate automatically later?


#17

I think not, you could probably fill a BZ RFE for this. Based on the settings, it could be trying to revalidate periodically.


#18

@Alexander_Braverman here’s the BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1382060


#19

hello,

are there any further news regarding this issue… it’s not magically solved by itself in my environment - so i do need a fix for this…

the BZ Issue is also nearly closed with an pull request for the fog-openstack gem…

Thanks for the update


#20

Hi @schmandforke,…yes this issue is now resolved in the Euwe-GA release.
You can download the release here: http://manageiq.org/download/