So the logic for when/where to fail the database over is handled by repmgrd
. You can take a look at the documentation here.
I think what you’re looking for is the section on “witness servers” which allow a particular network segment to determine whether they have the “voting majority” and should promote a local server.
This has a few implications for our implementation.
- We don’t yet have an interface for creating a witness server (but this is planned)
- The logic to determine which database the app points to is dependent on losing connection to the original master.
This means that if we have application servers on both of the network segments, we need some more logic to determine what to do in that scenario. This has not yet been addressed because we don’t have a very satisfying answer to that question. For now I would suggest keeping the application servers on a single network segment if at all possible so that they will always have a consensus on what database they are pointing to. When we have a method for servers to communicate without using the database, this problem will be much easier to tackle.
The second thing that you may be interested in is STONITH. We haven’t tackled this yet because how to take down a “misbehaving” node is very infrastructure specific and difficult to generalize for all use cases. What I would suggest is injecting a script that runs as the promote_command
configured for repmgrd
that will handle STONITH as well as promoting the local node.
TLDR;
These are really difficult problems to solve, and we have to solve them in a way that can be generalized to many different virtualization and networking scenarios. But, we’re working on it 