[concept] Cold Standby


This is more of a thought experiment then a decided feature, there are things that weight for and against it. However following is a hopefully sound algorithm to provide cold standby systems.

What does cold standby mean here: A second 'version' of the system that is powered off but able to take over the original systems functionality in the case of a outage.

Pro and opportunities:

  • This is as close to a hot migration as you can get.

  • It provides a vastly improved availability for legacy systems.

  • Boost attractiveness for legacy users.

Cons and risks:

  • Supports use of legacy applications (delays migration towards a proper cloud architecture)

  • When done automagically and not by an operator it either involves the risk of a Hot/Hot or Cold/Cold situation causing undefined behavior.

  • Will use twice the space.

  • Will require cpu and network resources to perform the sync.

I believe that a algorithm like (raft or paxos) to enforce strong consistency is important, given a hot/hot situation will probably be the most devastating outcome.

Algorithm (draft):

a sync interval of 1s is taken as an example here.

F (fifo aka sniffle to provide quorum/consensus)
H1 (1st hypervisor)
H2 (2nd hypervisor)

  • (marks hot hypervisor)

H1 and H2 run a FSM that is
1) directly connected
2) is having access to F for consensus.

The VM is created on H1*, if a vm has only 1 hypervisor assigned this hypervisor is automatically declared hot.

H2 is added as a standby.

H2 connects to H1*.

H1* enters connected state.

H1* syncs the last known common state with H2: none

H1* performs a zfs send/receive of S1 the vm to H2.

H1* syncs the last known common state with H2: S1

H1* sends a incremetal snapshot S2 to S1.


H1* goes down.

H2 looses connectivity to H1*.

H2 performs a reconnection attempt and fails.

H2 requests the quorum from F, since F can't reach H1 either it grants the quorum to H2* (F + H2* have more say then H1).

H1 comes online again, starts it boots in cold mode.

H1 requests a list of hypervisors and finds H2* active.

H1 connects to H2* as standby.

(sync etc happens just in the opposite direction.)





Heinz N. Gies


Heinz N. Gies




Fix versions