Openstack Service Status


#1

One thing that comes up often is situations where an openstack service is either down or not responding correctly to some requests.

I think it would be good to have a simple status page that shows the connection status for each openstack service based on the most recent attempt to contact the service.

Perhaps there could even be a more detailed view that shows the request/response history for each service. This information can already be tracked with a fog logger. Maybe we could register one and chainsaw it by separating out the various services into their own streams.

Showing this type of information would help users debug their openstack connections and would provide some transparency to the underlying openstack environment and services.

Related:


#2

We have a similar thing for VMWare already where we double check the credentials every so often. Would that be a useful mechanism? I would think we should do this for all providers.


#3

I’m pretty sure we double check the credentials already. However, my point was more about the fact that Openstack has several service endpoints that we have to connect to. Validating credentials for Openstack means only validating the Keystone service. Though, I think we also attempt to at least connect to the Nova service as well as part of the validation. Keep in mind that Authentication validations only validate records in the Authentications table. The Openstack services are not treated as separate Authentication records.

And, it’s not always completely obvious from the Openstack side that something is wrong with a service other than Keystone until you attempt to use the service. For instance, the Openstack installer may indicate that it correctly setup the Swift service, and it may even be possible to telnet to the Swift port on the Openstack server, but the first hint of a problem is during the first attempt to connect to the Swift service when the client receives an HTTP 503 error.

This seems to be the case when a Swift proxy service is running, but the backing service is actually dead on arrival. While this may not directly be our wheelhouse to fix, it’s definitely something we could report on to the user.

I don’t necessarily think this is unique to Openstack. For instance, SCVMM has multiple services that ManageIQ connects to. And, RHEV-M has the metrics DB that we connect to. However, in the case of RHEV, I think we are actually validating the authentication record.


#4

@blomquisg

In the instance of RHEV/ovirt we could monitor if the ovirt-engine-dwhd is installed and running. That way perf_capture messages are not placed on the queue if the service is down. Merely connecting to the database won’t be enough as the service can be down while still able to connect to the database. It can also help clarify to an end user the source of a problem with C&U collections whether the collection service on RHEV is down or a database credential issue has occured.