High Availability (HA)

High Availability (HA) is a Datomic Pro feature for ensuring the availability of a Datomic transactor in the event of a single machine failure.

Enabling HA

To enable HA, simply run two transactor processes on separate machines, with identical transactor properties files (in particular, configured to use the same storage.) Running HA requires the use of a paid Datomic Pro license key in both transactors. You can and should use paid Datomic Pro license keys in development and staging environments, as well as in production, so that all environments have the same capabilities.

The first transactor to communicate with storage will become active, and the second transactor will operate in standby mode until/unless the active transactor fails.

If you are using the Datomic transactor appliance, you can run two transactors in a single stack by setting a property in your CloudFormation template:

aws-autoscaling-group-size=2

If you are managing your own processes, you will need to provide a mechanism that tries to keep two machines running the transactor process at all times.

Peer Recovery Time

The time for a peer to recover from a transactor failure consists of two components:

  • the time for the standby transactor to detect that the active transactor has failed
  • the time for the peer to detect the new transactor

The standby transactor will take over if it sees the active transactor miss two consecutive heartbeats, and the peers will poll for a new transactor once per second, so the total number of seconds for a peer to recover should be at worst

2 * (heartbeat-interval-msec / 1000) + 1

The heartbeat-interval-msec is configured in the transactor properties file, and the default value of 5000 implies a worst-case peer failover time of around 11 seconds. You can improve this time by choosing a smaller heartbeat-interval-msec, so long as you are careful to consider two risks:

  • Network anomalies can trigger unnecessary failover, if they could have resolved themselves in a longer heartbeat window.
  • Long Java garbage collection (GC) pauses should happen rarely or never on a correctly-configured transactor, but could trigger an unnecessary HA failover if heartbeat-interval-msec is tuned too low.

You should test Datomic with your own environment before changing the heartbeat interval.

The minimum legal value for heartbeat-interval-msec is 1000.

Reads Remain Available

Peers can continue to read and query during a transactor failover. In particular, calls to Connection.db will neither block nor fail during the window of a transactor failover.

Transaction Latency After HA Recovery

The current transactor implementation does not replay the log to catch up until after an HA event has occurred.

While the transactor will be available for peer connections as soon as it takes the active role, transactions will queue behind log catch-up, and will experience unusually long latencies (up to several seconds) for several seconds after an HA event.

Application-Level Retry

Datomic does not automatically retry transactions after a transactor failure. Applications, aware of their own semantic needs, can implement an appropriate retry policy, with full access (e.g. via the Connection.log API) to recent transaction history.

Switching to a Different Transactor Version

We recommend using normal HA failover to deploy a different transactor version, as follows:

  • Start a new transactor pair running the version of Datomic you are switching to.
  • Perform whatever operational checks are appropriate in your environment to make sure the new process pair appears healthy. For example, with the AWS appliance you could verify the increase in HeartMonitorMsec count.
  • Kill the original standby transactor.
  • Kill the original active transactor.

It is ok to perform the last two steps simultaneously.

Moving Across Data Centers / Regions

You may want to move a Datomic system from one data center or region to another, e.g. to recover from the catastrophic failure of a data center. You will need to perform the following steps, in this order:

  • Shutdown all Datomic-related processes (transactors and peers) in the data center you are leaving.
  • Restore a consistent copy (see below) of the Datomic data into the new data center.
  • Start all Datomic-related processes in the new data center.

Use Datomic Backup For Consistent Copy

Datomic's backup makes a consistent copy of Datomic data, i.e. backup preserves Datomic's consistency guarantees. Thus Datomic backup is suitable for use in a disaster-recovery plan.

Other Consistent Copy Options

Datomic treats storage as a pluggable component, which raises the possibility of copying Datomic data using the underlying storage engine's backup or replication facilities, an approach hereafter called "storage copy". Storage copy may be preferable to Datomic backup for disaster recovery for reasons including:

  • Operators are already familiar and comfortable with the tools involved.
  • The disaster recovery strategy needs to protect data placed in storage by other applications besides Datomic.
  • Storage facilities provide live replication that can reduce recovery time.

A storage copy is a consistent copy if and only if it always comprises a complete set of writes made up to a specific point in time. For example:

  • SQL log shipping makes a consistent copy and is suitable for disaster recovery.
  • Replication or backup of eventually consistent storages cannot (by definition) make consistent copies and is not suitable for disaster recovery.

Older Versions of Datomic

Versions of Datomic prior to 0.8.3993 have longer peer recovery times, typically 10-15 seconds more than described above, or even longer if the transactor's memory-index-max is set higher than the default value.