High Availability Database


#1

Original use cases are in https://github.com/ManageIQ/manageiq/issues/42

Another use case is automatic database maintenance (e.g., re-indexing) without need for a DBA. A mirrored database can be taken offline for automated maintenance and then brought back online. This can be done round-robin with all the database instances participating in the mirroring.

Let’s use this forum to discuss the approach and come up with a design that we can implement.


#2

@dillaman A few thoughts on the details of the HA and rubyrep that occurs in manageiq after a short discussion with @Fryguy (keep me honest if I missed something):

  • The Database Synchronization role should allow failover of the “rubyrep” replication worker. Even though only one worker is run per region, multiple appliances could be configured to take over if this worker dies.

  • The PostgreSQL adapter we use for Rails contains database retry logic in the event of a connection temporarily going away. This code is found here. If the connection goes down and we get a known “retry-able” exception, rubyrep may just pickup with the HA server when HA kicks in. The rubyrep process via the “Database Synchronization” worker role may or may not need to be restarted depending on the state of its “connection” object, if we can automatically fix the connection, and if the latency of PG HA and our retry logic doesn’t time it out first. If things just work, the rubyrep process should be able to pick up any pending changes records in the new database.

  • The rubyrep tables such as rr##_pending_changes, rr##_logged_events, rr##_sync_state, where ## is the region number, should be monitored with HA as we may or may not want non-rubyrep HA to replicate them. Note, some of these tables reside in the master or slave db. Am I missing some tables?

  • Housing the primary database outside of an appliance for HA adds latency and needs to be evaluated as the rubyrep process should be kept to the appliance with the least latency to the database.


#3

Hi @chessbyte and @jrafanie

Can you review attached pdf diagram and comment ?

manageiq-vmware-hawan.pdf (121.2 KB)

  • This is a one-active-miq WAN HA design for two datacenters connected with not-so-fast WAN link.

    • All components (packmaker,corosync and drbd8) are well tested/understood open-source software.
    • We will have to write OCF agent script to allow pacemaker to start/stop/status evmserverd.
    • postgressql and httpd ocf agent for pacemaker are available
    • Configure postresql data partition to write to shared /dev/drbd0 between miq01 and miq02.
  • For WAN HA to be successful, WAN bandwidth need to be fast enough to handle postresql’s disk write rate.

    • I haven’t try drbd-proxy yet, but looks like it can buffer/compress drbd disk sync between two D.C.s.
    • Need to measure miq’s disk write metrics using sar.
  • Same architecture can be for LAN HA, ie miq01 and miq02 in same DC.

    • LAN HA most likely will work since no WAN bandwidth concern.

#4

@tjyang Interesting. I’m quite unfamiliar with pacemaker setup or architecture, mainly high level diagrams with very little hands on experience. I haven’t used corosync or drbd8 at all.

ManageIQ and postgresql are moving toward logical replication, pglogical for pg 9.x versions, then with postgresql itself when it’s included in pg 10.

@gtanzillo @carbonin Can probably provide better suggestions or comments here.

Thanks,
Joe


#5

@tjyang Oops, I misread your post. Sorry about that. I believe others have used pacemaker for HA although I’m not sure of any recommendations related to your architecture. Maybe others have hands on experience with latency and the various tools you mentioned.