Drb error : connection timeout


#1

Hello,

I run the ManageIQ appliance and after rebooting the evm server i got the following errors :

[----] E, [2017-01-12T06:44:15.394235 #32307:73797c] ERROR -- : EMS [] as [AKIAJAK6YKET7IZL6TBA] ID [150807] PID [32307] GUID [6e0e011e-d8bc-11e6-94b7-06dc150d810d] Error heartbeating to MiqServer because DRb::DRbConnError: Connection reset by peer Worker exiting.
[----] I, [2017-01-12T06:44:15.479441 #32244:73797c]  INFO -- : MIQ(ManageIQ::Providers::Amazon::CloudManager::RefreshWorker#log_status) [Refresh Worker for Cloud/Infrastructure Providers: AWS Singapore] Worker ID [150800], PID [32244], GUID [6dd2de40-d8bc-11e6-94b7-06dc150d810d], Last Heartbeat [2017-01-12 11:44:12 UTC], Process Info: Memory Usage [311087104], Memory Size [650801152], Proportional Set Size: [213718000], Memory % [2.03], CPU Time [137.0], CPU % [0.06], Priority [27]
[----] E, [2017-01-12T06:44:15.479840 #32244:73797c] ERROR -- : EMS [] as [AKIAJAK****] ID [150800] PID [32244] GUID [6dd2de40-d8bc-11e6-94b7-06dc150d810d] Error heartbeating to MiqServer because DRb::DRbConnError: Connection reset by peer Worker exiting.
[----] I, [2017-01-12T06:44:15.510840 #32253:73797c]  INFO -- : MIQ(ManageIQ::Providers::Amazon::CloudManager::RefreshWorker#log_status) [Refresh Worker for Cloud/Infrastructure Providers: AWS Sao Paulo] Worker ID [150801], PID [32253], GUID [6dd7f54c-d8bc-11e6-94b7-06dc150d810d], Last Heartbeat [2017-01-12 11:44:12 UTC], Process Info: Memory Usage [311148544], Memory Size [651853824], Proportional Set Size: [213737000], Memory % [2.03], CPU Time [136.0], CPU % [0.06], Priority [27]
[----] E, [2017-01-12T06:44:15.511207 #32253:73797c] ERROR -- : EMS [] as [AKIAJAK****] ID [150801] PID [32253] GUID [6dd7f54c-d8bc-11e6-94b7-06dc150d810d] Error heartbeating to MiqServer because DRb::DRbConnError: Connection reset by peer Worker exiting.
`

This result in the appliance never starting (so no webui, or anything like that).

I only changed some Memory parameters for some workers of the appliance to try resolving memory consumption issues (that another question for later)

Can someone point me in the right direction here ?


#2

@fvillain was this the issue with the schema being out of sync? Did bin/rake db:migrate fix this?

If not, what does grep -E 'WARN|ERROR' log/evm.log after migrating the database?

From out discussion in gitter:
“We don’t normally shut down the drb server in the main server process so it’s possibly the server process is gone, the drb server shutdown/failed abnormally, or the worker has the wrong URI to connect to the drb server
or the firewall got messed up and is refusing connections”

Therefore, we need to look at server startup for anything that might be failing to start the drb server properly.


#3

db:migrate and firewall are good

It seems like the worker got the wrong URI to connect to, but how can it be changed ?
It looks like the DRB port is randomly choosed during MIQ startup, am i wrong ?


#4

Here are the full startup logs
I added a dump of iptables rules for info in log/iptables.rules.dump

startup-logs.tgz (81.3 KB)


#5

Any clues on this ? Problem occurs here too, and we can’t get MIQ to work again


#6

We had to remove Cinder and Swift providers, and MIQ is now up


#7

@fvillain the server process was failing when trying to sync_workers for one of the worker classes, possibly for the cinder/swift providers. For some reason, calling authentications on the provider are nil instead of being an empty array since it’s Rails relation. It looks like a bug. Can you open an issue for it, here: https://github.com/ManageIQ/manageiq/issues? Did you provide authentications for the provider?

/var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:26:in `authentication_userid_passwords': private method `select' called for nil:NilClass (NoMethodError)
	from /var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:356:in `available_authentications'
	from /var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:189:in `authentication_type'
	from /var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:344:in `authentication_best_fit'
	from /var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:99:in `authentication_status_ok?'
	from /var/www/miq/vmdb/app/models/mixins/per_ems_worker_mixin.rb:21:in `select'
	from /var/www/miq/vmdb/app/models/mixins/per_ems_worker_mixin.rb:21:in `all_valid_ems_in_zone'
	from /var/www/miq/vmdb/app/models/mixins/per_ems_worker_mixin.rb:26:in `desired_queue_names'
	from /var/www/miq/vmdb/app/models/mixins/per_ems_worker_mixin.rb:32:in `sync_workers'
	from /var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:52:in `block in sync_workers'
	from /var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `each'
	from /var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `sync_workers'
	from /var/www/miq/vmdb/app/models/miq_server.rb:158:in `start'
	from /var/www/miq/vmdb/app/models/miq_server.rb:249:in `start'
	from /var/www/miq/vmdb/lib/workers/evm_server.rb:65:in `start'
	from /var/www/miq/vmdb/lib/workers/evm_server.rb:92:in `start'
	from /var/www/miq/vmdb/lib/workers/bin/evm_server.rb:4:in `<main>'

#8

Yes of course authentication parameters are provided

I created the issue here : https://github.com/ManageIQ/manageiq/issues/13958